CN107066553A - A kind of short text classification method based on convolutional neural networks and random forest - Google Patents

A kind of short text classification method based on convolutional neural networks and random forest Download PDF

Info

Publication number
CN107066553A
CN107066553A CN201710181062.0A CN201710181062A CN107066553A CN 107066553 A CN107066553 A CN 107066553A CN 201710181062 A CN201710181062 A CN 201710181062A CN 107066553 A CN107066553 A CN 107066553A
Authority
CN
China
Prior art keywords
mrow
msub
feature
random forest
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710181062.0A
Other languages
Chinese (zh)
Other versions
CN107066553B (en
Inventor
刘泽锦
王洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710181062.0A priority Critical patent/CN107066553B/en
Publication of CN107066553A publication Critical patent/CN107066553A/en
Application granted granted Critical
Publication of CN107066553B publication Critical patent/CN107066553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of short text classification method based on convolutional neural networks and random forest, belong to text classification and deep learning field.For causing the problem of generalization ability is not enough as convolutional neural networks grader using Softmax, it is proposed that the short text sorting algorithm (CNN RF) of a kind of combination convolutional neural networks and random forest.This method proposes that a kind of doublet Vector convolution neutral net is used for fully extracting short text high-order feature first, then using random forest as high-order feature classifiers, so as to improve short text classifying quality.Result on three open experimental data sets shows that CNN RF have a clear superiority in multiple evaluation indexes compared with other algorithms.

Description

A kind of short text classification method based on convolutional neural networks and random forest
Technical field
The invention belongs to text classification and deep learning field, it is related to a kind of short based on product neutral net and random forest File classification method, available for for microblogging, short message, the classification of the Massive short documents notebook data such as user Query or emotion point The tasks such as class.And can be used for system services such as search engine, information retrievals.
Background technology
With developing rapidly for internet in recent years, various information exchange platforms can produce substantial amounts of short text (Short Text), these short texts are related to the every field of people's life, are increasingly becoming people and use frequent and generally acknowledged communication side Formula.Such as ecommerce comment, info web retrieval, intelligent Answer System etc. are the generating sources of Massive short documents sheet.How from Effective information is excavated in Massive short documents sheet, is the widely studied problem of many scholars in recent years.Text classification is that a kind of text is dug The effective ways of pick, but due to the features such as short text length is short, lexical item feature is sparse, causing traditional long text sorting technique to become Obtain and be no longer applicable.Short text sorting technique (Short Text Classification) can solve above-mentioned to a certain extent Facing challenges in short text application, the technology is one of study hotspot of the numerous scholars of recent domestic, is also nature language A vital task in speech processing (NLP) field.Nowadays, file classification method is mainly based upon statistical learning method Or the method for machine learning, it is trained using statistics or the method for machine learning on the corpus by manually marking To grader, then treat categorized data set and classified.Wherein comparing the machine learning method of main flow has naive Bayesian (Naive Bayes, NB), SVMs (Support Vector Machine, SVM), logistic regression (Logistic Regression, LR) many-sorted logic return (Softmax Regression, SR), random forest (Random Forest, RF), Deep neural network (Deep Neural Network, DNN) etc..More successful long article one's duty is obtained in text classification field Class method is difficult to be applied directly in short text classification, and therefore, the sorting algorithm for short text has become current researcher A research puzzle urgently to be resolved hurrily, short text classification facing challenges be mainly:
1) short text keyword feature is sparse, and compared with the long text that general lexical item is enriched, short text is often only several Effective keyword, and when representing text using vector space model, it is difficult to fully excavate the relevance between feature;
2) in Opening field (such as microblogging, search engine), information updating is fast, and single short text information amount is small, but always Greatly, the cross section between information is few for body text message amount;
3) neologisms, new term, colloquial a large amount of appearance, these words are generally difficult to for oneself has categorizing system It is intractable.
Domestic and foreign scholars have carried out some significant research and discoveries for short text classification problem, and the first kind is The method extended based on short essay eigen:Bouaziz et al. utilizes latent Dirichletal location (Latent Dirichlet Allocation, LDA) distribution of theme and word on theme in model learning wikipedia data, then use same theme Under frequent words extend short text, reuse random semantic forest and feature selecting carried out to extension word, then divided Class;Also some scholars obtain word co-occurrence set of modes by association rule mining (FP-Growth), special as text The foundation of extension is levied, and word relation confidence level completes the feature extension of short text as weight when levying extension is held with dividing Class;XH Phan et al. build global corpus by capturing internet mass data, afterwards using the side of LDA topic models Method obtains the topic model of global corpus, and finally short text corpus to be sorted is entered using global LDA topic models Row theme infers (Model Estimation), the theme distribution of short text to be sorted is obtained, using the theme distribution to short essay This progress feature extension, is finally classified.First kind method can inevitably be introduced when doing short text extension feature and made an uproar Sound, causes classifying quality poor.
Equations of The Second Kind is the method based on deep learning:Socher et al. is using recurrent neural networks model (Recursive Neural Network, RNN), for the sentiment analysis task of sentence level, in the classification task of multiple data sets such as SST Achieve certain effect promoting;Kalchbrenner et al.[8]Utilize convolutional neural networks (Convolutional Neural Network, CNN) handle the other short text classification task of statement level, and propose dynamic convolution network model (Dynimic Convolutional Neural Network, DCNN), the model all obtains good on multiple data sets Effect, further demonstrates potentiality of the convolutional neural networks in short text sort research.Input based on neural net method Generally use random initializtion or use pre-training term vector.The training method of usual term vector is varied, language material, mould Type, pre-process it is different can produce the term vector of different implications, different term vector in terms of different (angle) portrays word It is semantic.Because short essay eigen is sparse, in order to fully extract feature, it may be considered that fully extracted using a variety of term vectors are combined Feature, improves the ability in feature extraction of convolutional neural networks;In addition it is general to use when Softmax is as convolutional network grader BP algorithm is trained, and the process only considers minimization training error, due to local minimum and gradient disappearance, over-fitting etc. The presence of phenomenon is difficult to make neutral net reach optimal generalization ability.Random forest is a kind of based on Boostrap Aggregation (Bagging) integrated learning approach, by combining many decision trees so that model has to exceptional value and noise There are very strong tolerance and robustness, the problem of single decision tree generalization ability is not enough can be overcome.Random forest has many advantages, Such as:
1) less parameter adjustment is needed, training speed is fast;
2) over-fitting problem will not be produced in training process substantially;
3) robustness to noise disturbances is high.
The content of the invention
It is an object of the invention to propose the short essay of a kind of combination doublet Vector convolution neutral net and random forest This sorting algorithm (CNN-RF), doublet Vector convolution neutral net uses two kinds of pre-training term vectors as input, can be abundant Short essay eigen is extracted, overcomes the shortcomings of that short essay eigen is sparse;Classified afterwards using random forest, strengthen the general of model Change ability.The training of CNN-RF models is divided into two stages:1) the pre-training stage:The dual of grader is used as using Softmax Term vector convolutional network is trained, preservation model parameter;2) the classifier training stage:Pre-training stage model parameter constant is kept, Full articulamentum is accessed into random forest, using high-order features training random forest, parameter is preserved.Found in experimentation, it is only necessary to Want seldom epoch to carry out pre-training, the model in classifier training stage can just restrained, and preferably classification effect can be reached Really.
To achieve the above object, the count protocol that uses of the present invention to be a kind of based on convolutional neural networks and random forest Short text classification method, this method comprises the following steps:
Step 1:Participle is carried out to all Chinese texts in corpus to be sorted, respectively using word2vec and glove words Vectorial training tool obtains two groups of term vectors of corpus, is the equal matrix of two dimensions by text representation;Respectively to two Matrix carries out two-dimensional convolution operation, obtains Liang Gejuan basic units characteristic pattern.
Step 2:After convolution operation, pondization operation is carried out to Liang Gejuan basic units characteristic pattern respectively, two pond layers are obtained Eigenmatrix;Nonlinear s igmod conversion is carried out to the pond layer eigenmatrix, two pond layer characteristic patterns are obtained.
Step 3:Convolution operations are carried out to obtain two pond layer characteristic patterns of step 2, final single connect entirely is obtained Connect a layer characteristic pattern.
Step 4:The full connection features figure that step 3 is obtained as random forest layer input data set, to this gather into Row Boostrap samples, and Bootstrap samplings are a kind of methods of samplings statistically, for there is the data set D of m sample, Carry out putting back to sampling m time and obtain new data set D ', hence it is evident that D is identical with D ' sizes, and put back to and sample so that there is repetition in D ' The sample of appearance, also has sample not occur.
Step 5:Taxonomy and distribution CART, Gini are set up using Gini Y-factor method Ys respectively to multiple Boostrap sample sets Coefficient is used for feature selecting, divides feature space with this feature, removes this feature after division from characteristic set, right Recurrence performs Feature Selection with feature division operation until meeting stop condition to left and right subtree respectively.In addition it is to prevent decision tree mistake The generation of fitting phenomenon, this method is operated using predictive pruning.Multiple decision trees are combined, the common classification for sample is carried out Decision-making, generally uses ballot method.
Compared with prior art, the present invention has following beneficial effect.
Replace full connection Softmax layers of convolutional neural networks using random forest (Random Forest), enhance Overall sorting technique obtains robustness, reduces the over-fitting of model, enhances model generalization ability;Using doublet Vector convolution Neutral net, can extract the feature of more horn of plenty;Independent of complicated parsing tree, it is only necessary to pass through convolution and maximum It is worth pond (Max Pooling Over Time) and carries out feature extraction, obtained higher level of abstraction architectural feature is sent into random gloomy Woods layer is classified, and from the point of view of deviation-variance (bias-variance) angle, integrated multiple models can reduce disaggregated model Variance, improves the stability of model.This method would generally be introduced and made an uproar without complicated feature expansion process, feature expansion algorithm Sound, and waste time and energy, this method makes full use of short text self information, and convolution net is inputted compared to traditional single channel term vector Network, sufficiently alleviates the openness of short text data, can fully extract feature.Max-pooling-over-time is operated Also the short text input problem of variable-length is solved, it is seen that also can effectively carry based on dual pre-training term vector convolutional network The degree of accuracy of high short text classification.Found in experimentation, it is only necessary to which seldom epoch carries out pre-training, just can make we Method reaches good effect.
Brief description of the drawings
Fig. 1 is pre-training term vector generation model, skip-gram model schematics
Fig. 2 is the disaggregated model that convolutional neural networks are combined with random forest
Fig. 3 is the contrast on three data sets respectively with NB CART RF CNN on accurate rate (ACC)
Fig. 4 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), F1 values on Fudan data sets It must contrast
Fig. 5 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), worth couple of F1 on MR data sets Than
Fig. 6 is respectively with NB CART RF CNN in accuracy rate (Pr), recall rate (Re), F1 values on Weibo data sets It must contrast
Fig. 7 .1RF algorithms on Fudan data sets with decision tree three evaluation indexes of change change
Fig. 7 .2 this method on Fudan data sets with decision tree three evaluation indexes of change change
Embodiment
In order that the purpose of the present invention, technical scheme and feature are more clearly understood, below in conjunction with specific embodiment, and Referring to the drawings, further refinement explanation is carried out to the present invention.
The present invention replaces full connection Softmax layers of convolutional neural networks using random forest (Random Forest), Enhance overall sorting technique and obtain robustness, prevent model over-fitting, enhance the general Huaneng Group power of model;Further use doublet Vector convolution neutral net, it is adaptable to extract more rich high-order feature.The specific improvement of the present invention can be summarized as following Several aspects:1) replace the term vector of random initializtion using two groups of pre-training term vectors, this method relative to previous methods or Person's bag of words, can reduce characteristic dimension, extract abundant feature;2) random initializtion term vector also needs to carry out term vector matrix Parameter updates, and our rule improves the efficiency of model without this operation;3) extended without feature, or introduce syntax point The complex operations such as analysis tree, it is to avoid introduce noise for the follow-up feature extraction and classifying of model;4) traditional nerve is similar to first Network, feature is extracted using convolution-pond-softmax layers, after certain epoch, the output characteristic of full articulamentum It is changed into high-order structures feature;5) classified using random forest instead of softmax to do, random forest can effectively improve model Generalization ability, prevents the over-fitting of model, strengthens classifying quality.It is demonstrated experimentally that method proposed by the invention is in three disclosures Result on experimental data set (Fudan, Weibo, MR) shows that CNN-RF is equal in multiple evaluation indexes compared with other method Have a clear superiority.
Skip-gram in the word2vec term vector models that Fig. 1 uses for the present invention, Fig. 2 are based on convolutional Neural net The structure that the short text classification method of network and random forest is used, first will be short in corpus to two groups of pre-training term vectors Text is respectively configured to two term vector matrixes, carries out 2 and ties up convolution algorithm and max-pooling-over-time computings, then The feature of two passages is combined using convolution operation, pre-training is carried out, finally disaggregated model is built using random forest, specifically Implementation process was divided into pre-training stage and classifier training stage:
One:The pre-training stage
Step 1:Obtain after two groups of term vectors, for corpus D, a text is represented with x, thenRepresent text In i-th of word term vector, length for n sentence expression into following form:
HereIt is changed into vectorial concatenation, n is the length of sentence most long in training corpus.For curtailment n's Text then uses additional character<PAD>Polishing is carried out, the vector representation for being uniformly distributed generation between (- 0.25,0.25) is used< PAD>.Suppositive vector length is k, then now every text x to be expressed as twoSingle channel (Channel) two dimension Matrix, as two input layers.
Step 2:Convolution operation is carried out to two input layers respectively, usedFilter act on term vector sequence Arrange xi:i+h-1={ xi,xi+1,…,xi+h-1On:
Ci=f (Wxi:i+h-1+b)
Wherein h is size of the filter in term window,For a bias term, f is nonlinear activation letter Number.Filter W can act on whole term vector sequence { x1:h,x2:h+1,…,xn-h+1:nOn, to produce convolutional layer characteristic pattern
Cconv=[Cconv,1,Cconv,2,…,Cconv,n-h+1]
Fully to extract filter m that different spans are set in feature, training process, with { W1,W2,…WmRepresent, often Filter is planted to set respectivelyIt is individual, generally orderM × s feature can be produced Figure, just acts on single feature figure C using maximum pond (Max-pooling-over-time) afterwardsconvOn, come Obtain most important feature in characteristic pattern
Step 3:Step 2 will produce m × s pond layer feature, be stitched together and just obtain pond layer featureWherein l=1,2 represents the pond layer feature of two groups of term vectors respectively.
Step 4:Convolution operation is carried out to two pond layer features, final full articulamentum feature C is obtainedfinal, Cfinal,i Represent CfinalComponent:
Step 5:Softmax graders are accessed after full articulamentum feature, the model in whole pre-training stage uses Adam Batch gradient declines (Mini-batch Gradient Descent) Algorithm for Training, and each layer parameter is adjusted with BP algorithm, Whole CNN parameter θ under being recorded after restraining.Over-fitting is prevented with L2 canonicals using Dropout during training.
Two:The classifier training stage
Step 6:Parameter θ in read step 5, Softmax models are replaced using Random Forest model, and full articulamentum is special Levy CfinalFeeding random forest is trained.The size of decision tree N in forest is set first, Bootstrap sampling is carried out and obtains N Individual data set, the parameter θ that next every is set in N tree of studyn, because the training process between the tree of each in forest does not have mutually Have an impact, therefore accelerate speed by the way of parallel training in experiment.
Step 7:After the completion of single decision tree training, the output of CNN-RF models is finally obtained in the method for ballot:
Ti(x) it is to set i to sample x classification results, that is, method of voting, c*As sample corresponds to final classification, and N is random gloomy The number of decision tree in woods.Due to the full articulamentum feature C of random forestfinalLess, general data collection kind has m to usual dimension ×s<103, so the expense for setting up random forest is very small.
This method combines CNN ability in feature extraction and the generalization ability of random forest, and generalization ability can be from following Three aspects are analyzed:1) from the point of view of statistics angle, because the hypothesis space of learning tasks is often very big, multiple hypothesis may exist The performance of equal level is reached on training set, if now using single decision tree probably due to falsely dropping and causing generalization ability not It is good;2) from feature extraction angle analysis, dual term vector portrays the implication of word from two angles respectively, enriches short text letter Breath, characteristic information has been expanded for single term vector;3) in terms of expression, the true hypothesis of some learning tasks May not be within the hypothesis space residing for current decision tree algorithm, if now using single sorting technique, search can be caused not To set hypothesis space, and random forest is sampled using Bootstrap, can reduce machine learning model to data according to The ability of relying, reduces the variance of model so that model possesses more preferable generalization ability.
Experimental facilities and required environment
Win7 32-bit operating systems, Intel Xeon E5 processors, CPU frequency 3.30Ghz, internal memory 16G.Experimental code Using python, deep learning environment is tensorflow combination Scikit learn frameworks.
Experimental result and explanation
This method is respectively in Fudan Chinese datas collection, NLPIR the Weibo data sets provided and MR comment emotional semantic classification numbers According to being tested on collection.Fudan Chinese datas, which are concentrated, includes 9804 documents of training corpus, testing material totally 9833 documents, 20 classifications, of the invention to use the headline that Fudan Chinese datas are concentrated as short text classification language material altogether, and only chooses 5 classifications therein are respectively C3-Art, C32-Agriculture, C34-Economy, C7-History, C38- Politics, altogether 7120 title documents;In WeiBo data sets amount to 21 classifications, the present invention using except " humanities and art ", All categories outside " advertisement is public ", " campus ", altogether 18 classifications, 36412 microblogging texts.Training is divided for no WeiBo and the MR data set of collection and test set have carried out 10 folding cross validations in an experiment, and experimental result has stronger convincingness.
Pretreatment and parameter setting
In experiment, using two groups of term vectors, first group by word2vec skip-gram training obtain, second group by Glove models are obtained, and are trained the language material of term vector to be obtained using each data set self training, only for Fudan University's data set, are adopted With the training corpus of news content and headline collectively as term vector.Centering literary grace is carried out with Hanlp in preprocessing process Participle, removes stop words operation.The dimension of two groups of term vectors is both configured to filter size in 100, convolutional neural networks and distinguished For 2,3,4, every kind of filter is respectively provided with 100, and it is 0.001 that Dropout parameters, which are set to 0.5, L2 regular parameters,.Due to pre- place The difference of reason mode and term vector language material and method choice, causes the experimental result of different authors to have one in same data set Determine deviation.Herein in order to verify CNN-RF classification performance, it is necessary in identical pretreatment mechanism, voluntarily realize a variety of classification moulds Type and this paper sorting technique carry out the comparative experiments of classification performance.
Setup Experiments and evaluation index
What the present invention was proposed with naive Bayesian (NB), Taxonomy and distribution (CART), random forest (RF) and Kim respectively Four kinds of algorithms of CNN networks are contrasted.The characteristic vector for being wherein used as classification in NB, CART, RF is the corresponding word of its text The form of addition of vectors.Experiment takes accurate rate (accuracy), accuracy rate (precision), recall rate (recall), F1 It is worth (F1-measure) as evaluation criterion, is calculated as follows:
1) accurate rate (accuracy):
2) accuracy rate (precision):
3) recall rate (recall):
4) F1 values (F1-measure):
Wherein TP represents that positive sample is predicted as positive sample number, and TN represents that negative sample is predicted as negative sample quantity, and FN is represented Positive sample is predicted as the quantity of negative sample, and FP represents that negative data is predicted as the quantity of positive sample, and N represents total sample number.It is real afterwards The influence analyzed with the increase of decision tree number to RF and CNN-RF methods is tested, CNN-RF methods and CNN is finally compared for Convergence of algorithm velocity analysis is contrasted.
Analysis of experimental results
First, accurate rate comparative analysis is carried out on 3 data sets to five kinds of algorithms.As seen from Figure 3, it is proposed by the invention CNN-RF methods on 3 data sets accurate rate be highest, improved on Fudan data sets relative to CNN 1.7%, 1.6% is improved relative to CNN on Weibo data sets, 0.8% is improved on MR data sets.Based on depth The result that the CNN methods of habit are obtained is only second to CNN-RF, and is better than other three kinds of methods, and NB, CART accurate rate are below collection Into learning method RF, it can be obtained from analysis of experimental results, integrated learning approach has combined multiple model generalization abilities compared with single model Lifting, but it is weaker than deep learning CNN methods.CNN is by extracting abstract structure feature, so preferably accurate rate can be obtained. CNN-RF combines both advantages, so obtaining better result.
Result of five kinds of algorithms on Fudan Chinese data collection is as shown in Figure 4.By the visible RF algorithms of experimental data accurate Rate, recall rate, three indexs of F1 values are more than CART and NB algorithms, it is seen that the method based on integrated study is added to making an uproar really The disturbance ability of sound, enhances the generalization ability of grader.And in terms of accuracy rate, RF algorithms are higher than CNN by 1.0%, but In recall rate, CNN is higher by 6.1% than RF algorithm, therefore integrates, in F1 values, and CNN is 2.5% more than RF, and CNN Optimal recall rate 92.8% has been reached in several method, 0.6% is higher by than CNN-RF algorithm.Except not enough in recall rate Outside CNN, CNN-RF algorithms further enhancing model generalization ability, and accuracy rate improves 4.1%, F1 values than CNN and improved 1.9%, CNN-RF algorithms achieve optimal result on accurate rate and F1 values.
Result of five kinds of algorithms on MR data sets is as shown in figure 5, MR data sets are two classification affection data collection.CNN-RF It is highest in three evaluation indexes, 1.2% or so is higher by than CNN on F1 estimates, 4.4% is higher by than RF, and it is different With other two datasets, CNN-RF accuracy rate, recall rate, F1 values on MR data sets more than CNN, exceed respectively 1.5%, 1.1% and 1.3%.
Result of five kinds of algorithms on Weibo data sets is as shown in fig. 6, from data, RF recall rate is still showed not Good, but accuracy rate is higher than CNN algorithm by 7.6%, comparatively CNN algorithms achieve highest recall rate, be higher by respectively RF and CNN-RF algorithms 15.6% and 9.2%, cause RF F1 values lower than CNN algorithm by 5.1%.But CNN is performed poor due to accuracy rate, So its F1 value is less than CNN-RF.CNN-RF has obtained optimal result in accuracy rate and F1 values, the CNN- in accuracy rate RF has been higher by 11% than CNN, has reached optimal F1 values, is higher by 6% and 0.9% than RF and CNN respectively.
In summary, CNN-RF methods are insensitive to short text data collection length, and doublet Vector convolution neutral net can Fully to extract feature, and model generalization ability is better than other four kinds of algorithms.By contrast, the effect of CART algorithms and NB algorithms It is really worst, this integrated study modes of RF are used so that generalization ability has certain lifting, but it is initial due to being only used only Term vector feature is simultaneously added by the term vector of word2vec extractions, causes classifying quality to be worse than CNN-RF.CNN-RF methods are first The abstract high-order feature that make use of dual term vector CNN to extract, and many decision trees of combination enhance the generalization ability of model, General performance is better than CNN and RF on several data sets.Relative to CNN, F1 values are respectively increased on 3 data sets 1.9%, 0.9% and 1.3%, the experiment show validity of the inventive method.
On the influence problem of decision tree number of parameters in random forest, test, tie on Fudan Chinese data collection Fruit sees that the quantity of decision tree number in Fig. 7 .1 and Fig. 7 .2, figure is respectively that increment increases to 200 by 10 with 10, totally 20 times.Fig. 7 .1 RF algorithms are represented, Fig. 7 .2 represent context of methods.It can be seen that with decision tree number n increase when initial, the three of CNN-RF and RF Individual evaluation index is rising, and when decision tree number reaches after 80s in RF, the results of three evaluation metricses tends to stabilization.And In CNN-RF, after number reaches 50, three evaluation metricses tend towards stability substantially.

Claims (2)

1. a kind of short text classification method based on convolutional neural networks and random forest, it is characterised in that:This method include with Lower step:
Step 1:Participle is carried out to all Chinese texts in corpus to be sorted, respectively using word2vec and glove term vectors Training tool obtains two groups of term vectors of corpus, is the equal matrix of two dimensions by text representation;Respectively to two matrixes Two-dimensional convolution operation is carried out, Liang Gejuan basic units characteristic pattern is obtained;
Step 2:After convolution operation, pondization operation is carried out to Liang Gejuan basic units characteristic pattern respectively, two pond layer features are obtained Matrix;Nonlinear s igmod conversion is carried out to the pond layer eigenmatrix, two pond layer characteristic patterns are obtained;
Step 3:Convolution operation is carried out to obtain two pond layer characteristic patterns of step 2, final single full articulamentum is obtained Characteristic pattern;
Step 4:The full connection features figure that step 3 is obtained is carried out as the input data set of random forest layer to the set Boostrap samples, and Bootstrap samplings are a kind of methods of samplings statistically, for there is the data set D of m sample, are entered Row puts back to sampling m times and obtains new data set D ', hence it is evident that D is identical with D ' sizes, and puts back to sampling so that having repetition to go out in D ' Existing sample, also has sample not occur;
Step 5:Taxonomy and distribution CART, Gini coefficient is set up using Gini Y-factor method Ys respectively to multiple Boostrap sample sets For feature selecting, feature space is divided with this feature, this feature is removed after division from characteristic set, to left and right Recurrence performs Feature Selection with feature division operation until meeting stop condition to subtree respectively;In addition it is to prevent decision tree over-fitting The generation of phenomenon, this method is operated using predictive pruning;Multiple decision trees are combined, it is common to be determined for the classification of sample Plan, generally uses ballot method.
2. a kind of short text classification method based on convolutional neural networks and random forest according to claim 1, it is special Levy and be:
The specific implementation process of this method was divided into pre-training stage and classifier training stage:
One:The pre-training stage
Step 1:Obtain after two groups of term vectors, for corpus D, a text is represented with x, thenRepresent i-th in text The term vector of individual word, length for n sentence expression into following form:
<mrow> <msub> <mi>x</mi> <mrow> <mn>1</mn> <mo>:</mo> <mi>n</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>&amp;CirclePlus;</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>&amp;CirclePlus;</mo> <mo>...</mo> <mo>&amp;CirclePlus;</mo> <msub> <mi>x</mi> <mi>n</mi> </msub> </mrow>
HereIt is changed into vectorial concatenation, n is the length of sentence most long in training corpus;For curtailment n text then Use additional character<PAD>Polishing is carried out, the vector representation for being uniformly distributed generation between (- 0.25,0.25) is used<PAD>;It is false If term vector length is k, then now every text x to be expressed as twoSingle channel (Channel) two-dimensional matrix, i.e., For two input layers;
Step 2:Convolution operation is carried out to two input layers respectively, usedFilter act on term vector sequence xi:i+h-1={ xi,xi+1,…,xi+h-1On:
Ci=f (Wxi:i+h-1+b)
Wherein h is size of the filter in term window,For a bias term, f is nonlinear activation function;Filtering Device W can act on whole term vector sequence { x1:h,x2:h+1,…,xn-h+1:nOn, to produce convolutional layer characteristic pattern
Cconv=[Cconv,1,Cconv,2,…,Cconv,n-h+1]
Fully to extract filter m that different spans are set in feature, training process, with { W1,W2,…WmRepresent, every kind of mistake Filter is set respectivelyIt is individual, generally orderM × s characteristic pattern can be produced, Just act on single feature figure C using maximum pond (Max-pooling-over-time) afterwardsconvOn, come The most important feature into characteristic pattern
<mrow> <msub> <mover> <mi>C</mi> <mo>^</mo> </mover> <mrow> <mi>p</mi> <mi>o</mi> <mi>o</mi> <mi>l</mi> </mrow> </msub> <mo>=</mo> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo>&amp;lsqb;</mo> <msub> <mi>C</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>v</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>C</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>v</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>C</mi> <mrow> <mi>c</mi> <mi>o</mi> <mi>n</mi> <mi>v</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mi>h</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>&amp;rsqb;</mo> </mrow>
Step 3:Step 2 will produce m × s pond layer feature, be stitched together and just obtain pond layer featureWherein l=1,2 represents the pond layer feature of two groups of term vectors respectively;
Step 4:Convolution operation is carried out to two pond layer features, final full articulamentum feature C is obtainedfinal, Cfinal,iRepresent CfinalComponent:
<mrow> <msub> <mi>C</mi> <mrow> <mi>f</mi> <mi>i</mi> <mi>n</mi> <mi>a</mi> <mi>l</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>=</mo> <mi>f</mi> <mrow> <mo>(</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <mn>2</mn> </munderover> <mi>W</mi> <mo>&amp;CenterDot;</mo> <msubsup> <mi>C</mi> <mrow> <mi>p</mi> <mi>o</mi> <mi>o</mi> <mi>l</mi> <mo>,</mo> <mi>i</mi> <mo>:</mo> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> <mi>l</mi> </msubsup> <mo>+</mo> <msup> <mi>b</mi> <mi>l</mi> </msup> <mo>)</mo> </mrow> </mrow>
Step 5:Softmax graders are accessed after full articulamentum feature, the model in whole pre-training stage uses Adam batches Gradient descent algorithm is trained, and each layer parameter is adjusted with BP algorithm, whole CNN parameter θ under being recorded after restraining;Training Shi Caiyong Dropout prevent over-fitting with L2 canonicals;
Two:The classifier training stage
Step 6:Parameter θ in read step 5, replaces Softmax models, by full articulamentum feature using Random Forest model CfinalFeeding random forest is trained;The size of decision tree N in forest is set first, Bootstrap sampling is carried out and obtains N number of Data set, the parameter θ that next every is set in N tree of studyn, because the training process between the tree of each in forest does not have mutually Accelerate speed by the way of parallel training in influence, therefore experiment;
Step 7:After the completion of single decision tree training, the output of CNN-RF models is finally obtained in the method for ballot:
<mrow> <msup> <mi>c</mi> <mo>*</mo> </msup> <mo>=</mo> <mi>arg</mi> <munder> <mi>max</mi> <mi>c</mi> </munder> <mo>{</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>I</mi> <mrow> <mo>(</mo> <msub> <mi>T</mi> <mi>i</mi> </msub> <mo>(</mo> <mi>x</mi> <mo>)</mo> <mo>=</mo> <mi>c</mi> <mo>)</mo> </mrow> <mo>}</mo> </mrow>
Ti(x) it is to set i to sample x classification results, that is, method of voting, c*As sample corresponds to final classification, and N is in random forest The number of decision tree;Due to the full articulamentum feature C of random forestfinalLess, general data collection kind has m × s to usual dimension< 103, so the expense for setting up random forest is very small.
CN201710181062.0A 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest Active CN107066553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710181062.0A CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710181062.0A CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Publications (2)

Publication Number Publication Date
CN107066553A true CN107066553A (en) 2017-08-18
CN107066553B CN107066553B (en) 2021-01-01

Family

ID=59618101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710181062.0A Active CN107066553B (en) 2017-03-24 2017-03-24 Short text classification method based on convolutional neural network and random forest

Country Status (1)

Country Link
CN (1) CN107066553B (en)

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368613A (en) * 2017-09-05 2017-11-21 中国科学院自动化研究所 Short text sentiment analysis method and device
CN107767378A (en) * 2017-11-13 2018-03-06 浙江中医药大学 The multi-modal Magnetic Resonance Image Segmentation methods of GBM based on deep neural network
CN107798331A (en) * 2017-09-05 2018-03-13 赵彦明 From zoom image sequence characteristic extracting method and device
CN107886474A (en) * 2017-11-22 2018-04-06 北京达佳互联信息技术有限公司 Image processing method, device and server
CN107957993A (en) * 2017-12-13 2018-04-24 北京邮电大学 The computational methods and device of english sentence similarity
CN108108751A (en) * 2017-12-08 2018-06-01 浙江师范大学 A kind of scene recognition method based on convolution multiple features and depth random forest
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
CN108122562A (en) * 2018-01-16 2018-06-05 四川大学 A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108509508A (en) * 2018-02-11 2018-09-07 北京原点时空信息技术有限公司 Short message big data inquiry based on Java technology and analysis system and its method
CN108733801A (en) * 2018-05-17 2018-11-02 武汉大学 A kind of moving-vision search method towards digital humanity
CN108776805A (en) * 2018-05-03 2018-11-09 北斗导航位置服务(北京)有限公司 It is a kind of establish image classification model, characteristics of image classification method and device
CN108829671A (en) * 2018-06-04 2018-11-16 北京百度网讯科技有限公司 Method, apparatus, storage medium and the terminal device of decision based on survey data
CN108875808A (en) * 2018-05-17 2018-11-23 延安职业技术学院 A kind of book classification method based on artificial intelligence
CN108920586A (en) * 2018-06-26 2018-11-30 北京工业大学 A kind of short text classification method based on depth nerve mapping support vector machines
CN108959924A (en) * 2018-06-12 2018-12-07 浙江工业大学 A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN109002532A (en) * 2018-07-17 2018-12-14 电子科技大学 Behavior trend mining analysis method and system based on student data
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109214298A (en) * 2018-08-09 2019-01-15 盈盈(杭州)网络技术有限公司 A kind of Asia women face value Rating Model method based on depth convolutional network
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109670182A (en) * 2018-12-21 2019-04-23 合肥工业大学 A kind of extremely short file classification method of magnanimity indicated based on text Hash vectorization
WO2019080484A1 (en) * 2017-10-26 2019-05-02 北京深鉴智能科技有限公司 Method of pruning convolutional neural network based on feature map variation
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN109843401A (en) * 2017-10-17 2019-06-04 腾讯科技(深圳)有限公司 A kind of AI object behaviour model optimization method and device
CN110019787A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 Neural network model generation method, text emotion analysis method and relevant apparatus
CN110020431A (en) * 2019-03-06 2019-07-16 平安科技(深圳)有限公司 Feature extracting method, device, computer equipment and the storage medium of text information
CN110069634A (en) * 2019-04-24 2019-07-30 北京泰迪熊移动科技有限公司 A kind of method, apparatus and computer readable storage medium generating classification model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN110222173A (en) * 2019-05-16 2019-09-10 吉林大学 Short text sensibility classification method and device neural network based
CN110263344A (en) * 2019-06-25 2019-09-20 名创优品(横琴)企业管理有限公司 A kind of text emotion analysis method, device and equipment based on mixed model
CN110309304A (en) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 A kind of file classification method, device, equipment and storage medium
CN110377915A (en) * 2019-07-25 2019-10-25 腾讯科技(深圳)有限公司 Sentiment analysis method, apparatus, storage medium and the equipment of text
CN110781333A (en) * 2019-06-26 2020-02-11 杭州鲁尔物联科技有限公司 Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN111353512A (en) * 2018-12-20 2020-06-30 长沙智能驾驶研究院有限公司 Obstacle classification method, obstacle classification device, storage medium and computer equipment
CN111352926A (en) * 2018-12-20 2020-06-30 北京沃东天骏信息技术有限公司 Data processing method, device, equipment and readable storage medium
CN111401063A (en) * 2020-06-03 2020-07-10 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN111753081A (en) * 2019-03-28 2020-10-09 百度(美国)有限责任公司 Text classification system and method based on deep SKIP-GRAM network
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN111897921A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on word vector learning and mode mining fusion expansion
WO2020233344A1 (en) * 2019-05-21 2020-11-26 深圳壹账通智能科技有限公司 Searching method and apparatus, and storage medium
CN112182219A (en) * 2020-10-09 2021-01-05 杭州电子科技大学 Online service abnormity detection method based on log semantic analysis
CN112329877A (en) * 2020-11-16 2021-02-05 山西三友和智慧信息技术股份有限公司 Voting mechanism-based web service classification method and system
CN112347247A (en) * 2020-10-29 2021-02-09 南京大学 Specific category text title binary classification method based on LDA and Bert
CN112418354A (en) * 2020-12-15 2021-02-26 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN112487811A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Cascading information extraction system and method based on reinforcement learning
WO2021082861A1 (en) * 2019-10-31 2021-05-06 平安科技(深圳)有限公司 Scoring method and apparatus, electronic device, and storage medium
CN113342970A (en) * 2020-11-24 2021-09-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN114154561A (en) * 2021-11-15 2022-03-08 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114511330A (en) * 2022-04-18 2022-05-17 山东省计算中心(国家超级计算济南中心) Improved CNN-RF-based Ethernet workshop Pompe deception office detection method and system
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device
CN116226702A (en) * 2022-09-09 2023-06-06 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance
CN117473095A (en) * 2023-12-27 2024-01-30 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034788A1 (en) * 2014-07-30 2016-02-04 Adobe Systems Incorporated Learning image categorization using related attributes
CN106156781A (en) * 2016-07-12 2016-11-23 北京航空航天大学 Sequence convolutional neural networks construction method and image processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034788A1 (en) * 2014-07-30 2016-02-04 Adobe Systems Incorporated Learning image categorization using related attributes
CN106156781A (en) * 2016-07-12 2016-11-23 北京航空航天大学 Sequence convolutional neural networks construction method and image processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YOON KIM: "Convolutional Neural Networks for Sentence Classification", 《HTTPS://ARXIV.ORG/ABS/1408.5882》 *
夏从零: "基于事件卷积特征的新闻文本分类", 《计算机应用研究》 *

Cited By (80)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798331A (en) * 2017-09-05 2018-03-13 赵彦明 From zoom image sequence characteristic extracting method and device
CN107798331B (en) * 2017-09-05 2021-11-26 赵彦明 Method and device for extracting characteristics of off-zoom image sequence
CN107368613B (en) * 2017-09-05 2020-02-28 中国科学院自动化研究所 Short text sentiment analysis method and device
CN107368613A (en) * 2017-09-05 2017-11-21 中国科学院自动化研究所 Short text sentiment analysis method and device
CN110019787A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 Neural network model generation method, text emotion analysis method and relevant apparatus
CN109843401B (en) * 2017-10-17 2020-11-24 腾讯科技(深圳)有限公司 AI object behavior model optimization method and device
CN109843401A (en) * 2017-10-17 2019-06-04 腾讯科技(深圳)有限公司 A kind of AI object behaviour model optimization method and device
WO2019080484A1 (en) * 2017-10-26 2019-05-02 北京深鉴智能科技有限公司 Method of pruning convolutional neural network based on feature map variation
CN107767378A (en) * 2017-11-13 2018-03-06 浙江中医药大学 The multi-modal Magnetic Resonance Image Segmentation methods of GBM based on deep neural network
CN107767378B (en) * 2017-11-13 2020-08-04 浙江中医药大学 GBM multi-mode magnetic resonance image segmentation method based on deep neural network
CN107886474A (en) * 2017-11-22 2018-04-06 北京达佳互联信息技术有限公司 Image processing method, device and server
CN108108351B (en) * 2017-12-05 2020-05-22 华南理工大学 Text emotion classification method based on deep learning combination model
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
CN108108751B (en) * 2017-12-08 2021-11-12 浙江师范大学 Scene recognition method based on convolution multi-feature and deep random forest
CN108108751A (en) * 2017-12-08 2018-06-01 浙江师范大学 A kind of scene recognition method based on convolution multiple features and depth random forest
CN107957993B (en) * 2017-12-13 2020-09-25 北京邮电大学 English sentence similarity calculation method and device
CN107957993A (en) * 2017-12-13 2018-04-24 北京邮电大学 The computational methods and device of english sentence similarity
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN108122562A (en) * 2018-01-16 2018-06-05 四川大学 A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108509508A (en) * 2018-02-11 2018-09-07 北京原点时空信息技术有限公司 Short message big data inquiry based on Java technology and analysis system and its method
CN108776805A (en) * 2018-05-03 2018-11-09 北斗导航位置服务(北京)有限公司 It is a kind of establish image classification model, characteristics of image classification method and device
CN108733801B (en) * 2018-05-17 2020-06-09 武汉大学 Digital-human-oriented mobile visual retrieval method
CN108733801A (en) * 2018-05-17 2018-11-02 武汉大学 A kind of moving-vision search method towards digital humanity
CN108875808A (en) * 2018-05-17 2018-11-23 延安职业技术学院 A kind of book classification method based on artificial intelligence
CN108829671B (en) * 2018-06-04 2021-08-20 北京百度网讯科技有限公司 Decision-making method and device based on survey data, storage medium and terminal equipment
CN108829671A (en) * 2018-06-04 2018-11-16 北京百度网讯科技有限公司 Method, apparatus, storage medium and the terminal device of decision based on survey data
CN108959924A (en) * 2018-06-12 2018-12-07 浙江工业大学 A kind of Android malicious code detecting method of word-based vector sum deep neural network
CN108920586A (en) * 2018-06-26 2018-11-30 北京工业大学 A kind of short text classification method based on depth nerve mapping support vector machines
CN109002532A (en) * 2018-07-17 2018-12-14 电子科技大学 Behavior trend mining analysis method and system based on student data
CN109214298B (en) * 2018-08-09 2021-06-08 盈盈(杭州)网络技术有限公司 Asian female color value scoring model method based on deep convolutional network
CN109214298A (en) * 2018-08-09 2019-01-15 盈盈(杭州)网络技术有限公司 A kind of Asia women face value Rating Model method based on depth convolutional network
CN109165294B (en) * 2018-08-21 2021-09-24 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109543084B (en) * 2018-11-09 2021-01-19 西安交通大学 Method for establishing detection model of hidden sensitive text facing network social media
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization
CN111352926B (en) * 2018-12-20 2024-03-08 北京沃东天骏信息技术有限公司 Method, device, equipment and readable storage medium for data processing
CN111353512A (en) * 2018-12-20 2020-06-30 长沙智能驾驶研究院有限公司 Obstacle classification method, obstacle classification device, storage medium and computer equipment
CN111352926A (en) * 2018-12-20 2020-06-30 北京沃东天骏信息技术有限公司 Data processing method, device, equipment and readable storage medium
CN109670182B (en) * 2018-12-21 2023-03-24 合肥工业大学 Massive extremely short text classification method based on text hash vectorization representation
CN109670182A (en) * 2018-12-21 2019-04-23 合肥工业大学 A kind of extremely short file classification method of magnanimity indicated based on text Hash vectorization
CN109739986A (en) * 2018-12-28 2019-05-10 合肥工业大学 A kind of complaint short text classification method based on Deep integrating study
CN110020431A (en) * 2019-03-06 2019-07-16 平安科技(深圳)有限公司 Feature extracting method, device, computer equipment and the storage medium of text information
CN111753081B (en) * 2019-03-28 2023-06-09 百度(美国)有限责任公司 System and method for text classification based on deep SKIP-GRAM network
CN111753081A (en) * 2019-03-28 2020-10-09 百度(美国)有限责任公司 Text classification system and method based on deep SKIP-GRAM network
CN110069634A (en) * 2019-04-24 2019-07-30 北京泰迪熊移动科技有限公司 A kind of method, apparatus and computer readable storage medium generating classification model
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN110222173A (en) * 2019-05-16 2019-09-10 吉林大学 Short text sensibility classification method and device neural network based
CN110222173B (en) * 2019-05-16 2022-11-04 吉林大学 Short text emotion classification method and device based on neural network
WO2020233344A1 (en) * 2019-05-21 2020-11-26 深圳壹账通智能科技有限公司 Searching method and apparatus, and storage medium
CN110309304A (en) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 A kind of file classification method, device, equipment and storage medium
CN110263344B (en) * 2019-06-25 2022-04-19 创优数字科技(广东)有限公司 Text emotion analysis method, device and equipment based on hybrid model
CN110263344A (en) * 2019-06-25 2019-09-20 名创优品(横琴)企业管理有限公司 A kind of text emotion analysis method, device and equipment based on mixed model
CN110781333A (en) * 2019-06-26 2020-02-11 杭州鲁尔物联科技有限公司 Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN110377915B (en) * 2019-07-25 2022-11-29 腾讯科技(深圳)有限公司 Text emotion analysis method and device, storage medium and equipment
CN110377915A (en) * 2019-07-25 2019-10-25 腾讯科技(深圳)有限公司 Sentiment analysis method, apparatus, storage medium and the equipment of text
WO2021082861A1 (en) * 2019-10-31 2021-05-06 平安科技(深圳)有限公司 Scoring method and apparatus, electronic device, and storage medium
CN111401063A (en) * 2020-06-03 2020-07-10 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN111401063B (en) * 2020-06-03 2020-09-11 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN111813939A (en) * 2020-07-13 2020-10-23 南京睿晖数据技术有限公司 Text classification method based on representation enhancement and fusion
CN111897921A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on word vector learning and mode mining fusion expansion
CN112182219A (en) * 2020-10-09 2021-01-05 杭州电子科技大学 Online service abnormity detection method based on log semantic analysis
CN112487811B (en) * 2020-10-21 2021-07-06 上海旻浦科技有限公司 Cascading information extraction system and method based on reinforcement learning
CN112487811A (en) * 2020-10-21 2021-03-12 上海旻浦科技有限公司 Cascading information extraction system and method based on reinforcement learning
CN112347247B (en) * 2020-10-29 2023-10-13 南京大学 Specific category text title classification method based on LDA and Bert
CN112347247A (en) * 2020-10-29 2021-02-09 南京大学 Specific category text title binary classification method based on LDA and Bert
CN112329877A (en) * 2020-11-16 2021-02-05 山西三友和智慧信息技术股份有限公司 Voting mechanism-based web service classification method and system
CN113342970A (en) * 2020-11-24 2021-09-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN113342970B (en) * 2020-11-24 2023-01-03 中电万维信息技术有限责任公司 Multi-label complex text classification method
CN112418354B (en) * 2020-12-15 2022-07-15 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN112418354A (en) * 2020-12-15 2021-02-26 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN114154561A (en) * 2021-11-15 2022-03-08 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114154561B (en) * 2021-11-15 2024-02-27 国家电网有限公司 Electric power data management method based on natural language processing and random forest
CN114511330B (en) * 2022-04-18 2022-12-13 山东省计算中心(国家超级计算济南中心) Ether house Pompe fraudster detection method and system based on improved CNN-RF
CN114511330A (en) * 2022-04-18 2022-05-17 山东省计算中心(国家超级计算济南中心) Improved CNN-RF-based Ethernet workshop Pompe deception office detection method and system
CN115064184A (en) * 2022-06-28 2022-09-16 镁佳(北京)科技有限公司 Audio file musical instrument content identification vector representation method and device
CN116226702A (en) * 2022-09-09 2023-06-06 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance
CN116226702B (en) * 2022-09-09 2024-04-26 武汉中数医疗科技有限公司 Thyroid sampling data identification method based on bioelectrical impedance
CN117473095A (en) * 2023-12-27 2024-01-30 合肥工业大学 Short text classification method and system based on theme enhancement word representation
CN117473095B (en) * 2023-12-27 2024-03-29 合肥工业大学 Short text classification method and system based on theme enhancement word representation

Also Published As

Publication number Publication date
CN107066553B (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN107066553A (en) A kind of short text classification method based on convolutional neural networks and random forest
CN108595632B (en) Hybrid neural network text classification method fusing abstract and main body characteristics
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN110866117A (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN110298032A (en) Text classification corpus labeling training system
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN106202372A (en) A kind of method of network text information emotional semantic classification
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
KR20190063978A (en) Automatic classification method of unstructured data
CN104008090A (en) Multi-subject extraction method based on concept vector model
CN101751455A (en) Method for automatically generating title by adopting artificial intelligence technology
Mehmood et al. A precisely xtreme-multi channel hybrid approach for roman urdu sentiment analysis
CN104484380A (en) Personalized search method and personalized search device
CN110222172A (en) A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN108920586A (en) A kind of short text classification method based on depth nerve mapping support vector machines
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
Halevy et al. Discovering structure in the universe of attribute names

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant