CN106815369B - A kind of file classification method based on Xgboost sorting algorithm - Google Patents
A kind of file classification method based on Xgboost sorting algorithm Download PDFInfo
- Publication number
- CN106815369B CN106815369B CN201710060026.9A CN201710060026A CN106815369B CN 106815369 B CN106815369 B CN 106815369B CN 201710060026 A CN201710060026 A CN 201710060026A CN 106815369 B CN106815369 B CN 106815369B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- sample
- xgboost
- sorting algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then carries out text classification with Xgboost sorting algorithm.It does feature space with common vector space model, common sorting algorithm is reduced come the method for carrying out text classification compared to the memory of required consuming, this is because word included in Chinese text hundreds and thousands of ten thousand, dimension is higher, if characterized by word, it is huge to expend memory, even can not single machine processing, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension substantially reduces, and Xgboost support using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled-LDA algorithm with potential applications simultaneously, and the classification information that feature selecting can be excavated the semantic information of a large amount of corpus using LDA but also be included using text is made of Labeled-LDA.And pretreatment is simple, does not need to extract feature meticulously, in addition the distributed Ensemble Learning Algorithms Xgboost of powerful support, improves the accuracy and performance of classification.
Description
Technical field
The present invention relates to text classification fields, more particularly, to a kind of text classification based on Xgboost sorting algorithm
Method.
Background technique
File classification method has obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring
Using, be realize efficiently management and be accurately positioned massive information an important ring.
The common frame of file classification method is based on machine learning classification algorithm, that is, includes data prediction, then special
Levy extraction, feature selecting, tagsort.
Feature extraction is identified text using unified method and model, and this method or model can indicate text
This feature and it can be easily converted to mathematical linguistics, and then be converted to the mathematical model that computer is capable of handling.It is existing
Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter
The advantages that list, convenience of calculation, few pretreatment, also has the semantic relation between override feature and feature certainly, indicates text
The disadvantages of depth is inadequate.
Feature selecting is one of critical issue of machine learning, and the quality of feature selecting result directly affects classifier
Nicety of grading and Generalization Capability.Feature selecting is that some most effective features are picked out from one group of feature to reduce feature sky
Between dimension, and reach and reject uncorrelated or redundancy feature, reduce runing time and improve the comprehensibility of analysis result, discovery
The structure and other effects hidden in high dimensional data.Whether there is classification information according to data, feature selecting, which can be divided into, supervision and nothing
Supervise two classes.There is common feature selection approach in text classification: document frequencies, mutual information, information gain and chi-square statistics
Measure the methods of (CHI).
Tagsort be in text classification finally be also a most important ring.Naive Bayes Classification Algorithm is a kind of typical case
Tagsort algorithm the probability that text belongs to certain particular category is calculated according to Bayesian formula, needed for the parameter estimated very
Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, theoretically has compared with other sorting algorithms
The smallest error rate, but actually not such was the case with, because this method assumes between attribute that this is assumed in reality independently of each other
It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, can handle data type and conventional type category simultaneously
Property, large data source can be made within the relatively short time it is feasible and work well as a result, but decision tree processing missing
It is relatively difficult when data, it is easy to ignore the correlation in data set between attribute.Remaining tagsort method further include have LR,
KNN and SVM etc..These are all base learners, and integrated study often can get than single by being combined multiple learners
Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, integrated learning approach is broadly divided into two at present
Class, i.e., there are do not deposit between strong dependence, the sequencing method that must serially generate, and individual learner between individual learner
In strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former, which represents, the latter represent be Bagging and " with
Machine forest ".From the point of view of deviation-variation decomposition, Boosting is primarily upon reduction deviation, therefore Boosting can be based on
The quite weak learner of Generalization Capability constructs very strong integrated.Bagging is primarily upon reduction variance, therefore it is in not beta pruning
Effectiveness becomes apparent on the learner that decision tree, neural network etc. are disturbed vulnerable to sample.Xgboost sorting algorithm is to be based on
A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithm, Xgboost sorting algorithm mainly have with
Lower eight big advantages:
One, for tradition GBDT using CART as base classifier, Xgboost sorting algorithm also supports linear classifier.
Two, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithm then carries out cost function
The second Taylor series, while having used single order and second dervative.
Three, Xgboost sorting algorithm joined regular terms in cost function, the complexity for Controlling model.Canonical
Contained in the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum.From deviation-variance
Weigh for angle, regular terms reduces the variance of model, and the model for coming out study is simpler, prevents over-fitting.
Four, reduce: being equivalent to learning rate.Xgboost sorting algorithm, can be by leaf node after having carried out an iteration
Weight be multiplied by the coefficient, primarily to weakening the influence of each tree, allowing below has a bigger studying space.
Five, column sampling: Xgboost sorting algorithm has used for reference the way of random forest, supports column sampling, can not only reduce
Fitting, moreover it is possible to reduce and calculate.
Six, to the processing of missing values: having the sample of missing for the value of feature, Xgboost sorting algorithm can be learned automatically
Practise out its cleavage direction.
Seven, it supports parallel.Boosting is a kind of serial structure, and Xgboost sorting algorithm is not parallel tree
That spends is parallel, and Xgboost sorting algorithm is also complete (the cost letter of the t times iteration that just can be carried out next iteration of an iteration
The predicted value of the iteration of front t-1 times is contained in several).Xgboost sorting algorithm is on characteristic particle size parallel.Decision tree
The most time-consuming step of study be exactly to be ranked up to the value of feature (because to determine optimal partition point), Xgboost points
Class algorithm before training, in advance sorts to data, then saves as block structure, repeatedly uses in subsequent iteration
This structure, greatly reduces calculation amount.This block structure is but also become possibility parallel, when carrying out the division of node,
It needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, then the gain meter of each feature
Calculation can open multithreading progress.
Eight, approximate histogramming algorithm that can be parallel.Tree node needs to calculate each of each feature point when being divided
The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data can not once be loaded into memory or divide
In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithm also proposed it is a kind of can be parallel it is close
Like histogramming algorithm, for efficiently generating candidate cut-point.
Summary of the invention
The present invention is that the classification performance that the above prior art of solution provides is low, consuming memory is big, classification accuracy is low
Defect, provide a kind of file classification method based on Xgboost sorting algorithm.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of file classification method based on Xgboost sorting algorithm, comprising the following steps:
S1. multiple samples are obtained, each sample includes the label of content of text and text;
S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing
Practice sample and forms training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text is separated with space, then should
The label of training sample is inputted as the label of Labeled-LDA, and the content of text of the training sample is as Labeled-LDA
Text input;
S4. setting Labeled-LDA the number of iterations is K, is then iterated training to training sample;
S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, and one
Part is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme;Two parts of documents are integrated, are trained
The number that each word occurs under each theme in sample;For each theme, sort according to the frequency of occurrence of corresponding word, choosing
Take the LLDA word with the maximally related m word of the theme as training sample;
S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text
Frequency of occurrence, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to
In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained;
S7. so far model has trained, and needs to predict forecast set, i.e., carries out step S3~S5 to forecast set
The step of, prediction classification then is carried out to each forecast sample in forecast set using trained model.
Preferably, in the step S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then not by the LLDA word
It being input in Xgboost sorting algorithm, i.e., common sorting algorithm input is matrix form, and this method input is dictionary format,
Memory is saved, it is more convenient quickly reasonably to handle missing values.
Preferably, the K is 1000.
Compared with prior art, the beneficial effects of the present invention are:
Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then uses Xgboost
Sorting algorithm carries out text classification.It does feature space with common vector space model, common sorting algorithm come into
The method of row text classification is reduced compared to the memory of required consuming, this is because word is at hundred included in Chinese text
Up to ten million, dimension is higher, if characterized by word, expend memory it is huge, or even can not single machine processing, and Chinese characters in common use be no more than one
Ten thousand, often occur even only 1,000, dimension substantially reduces, and Xgboost support with dictionary rather than matrix
Form is as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously
LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of corpus but also be utilized using LDA
The classification information that text is included.And pretreatment is simple, does not need to extract feature meticulously, in addition powerful support is distributed
Ensemble Learning Algorithms Xgboost improves the accuracy and performance of classification.
Detailed description of the invention
Fig. 1 is the flow chart of method.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
Below in conjunction with drawings and examples, the present invention is further elaborated.
Embodiment 1
The implementation case includes 3 concrete cases, is classified respectively to 3 text corpus with different characteristics,
That is English corpus WebKB disclosed in one weeds out the sample and two Chinese corpus of no any content, wherein one
A is disclosed long text corpus: Fudan University's text classification corpus, the other is Chinese short text sample is very uneven
The corpus of weighing apparatus: news comment is divided into normal and advertisement two categories, positive negative ratio and reaches 2742/42416=0.065 in this way
Rank.
1 text classification data set summary situation of table
As shown in Figure 1, the text of the present invention for making feature extraction of Labeled-LDA based on Xgboost sorting algorithm
The specific implementation step of classification method is as follows:
Step 1: Text Pretreatment
The classified text set of a batch, such as 3 cases are prepared in advance, according to the ratio random division training set of 8:2 and pre-
Collection (news comment) is surveyed, (WebKB and Fudan University's text classification) is directlyed adopt if disclosed data set has existed division, to institute
There is text to be removed dryness, Unified coding UTF-8 distinguishes each word with space and facilitate subsequent program for Chinese text
Processing, since the present invention is based on word feature, common word, which puts in the stops, is not more than 10,000, and Labeled-
LDA has powerful feature selecting ability, therefore Chinese Text Categorization commonly removes the pretreatment such as punctuation mark, number, stop words
Process can dispense, from experimental result it can also be seen that such pretreatment is not required, for English text, then
It includes that all Caps switchs to small letter, punctuation mark is replaced with space that it, which is pre-processed, can more distinguish the boundary of word in this way and meet
The writing style of English.
Step 2: feature selecting being carried out to training set with Labeled-LDA
Present invention experiment uses Stamford natural language processing tool TMT.The various of Labeled-LDA are first set
Parameter, including text segmentation mode and text filtering mode, label number, the number of iterations etc., text segmentation mode use space
It distinguishes, label number sees that specific corpus actual conditions, the number of iterations are unified for 1000 times, after training, obtains two
File, one is file of the word to coding, one be the number that each word coding occurs under each theme file, merge this
Two files obtain the file for the number that each word occurs under each theme, then under each theme, occur to each word
Number sequence, takes out the highest top n of frequency of occurrence, then these words obtained are exactly after Labeled-LDA feature selecting
The word left, there may be repetitions for these words, count these words after excluding redundancy as LLDA word, have more rich semantic information,
It is to be trained in label obtaining under this strong supervision message as a result, having very strong feature representation ability again simultaneously.
Step 3: training set forecast set being processed into the input format of Xgboosts
The input format of Xgboost is very simple and convenient and reasonable, for each of training set and forecast set sample, because
For word is regarded as feature, current word, if the word is LLDA word, feature corresponding to corresponding LLDA word are arrived in processing
Characteristic value adds 1, for each sample, the LLDA word not occurred, then and the feature of feature corresponding to corresponding LLDA word
Value is 0, this feature does not have to the input as Xgboost;
The parameter setting of step 4:Xgboost sorting algorithm, training and prediction
The parameter of Xgboost sorting algorithm is set, the parameter for selecting resultant effect optimal, the Xgboost being finally arranged points
Class algorithm parameter are as follows: the number of iterations 200, class number (depending on specific), classifier type gbtree, objective function are
Multi:softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training obtains is to pre-
Collection is surveyed to classify;
Step 5: assessment models performance
Since each data set differs greatly, therefore different judgment criteria is used for different data sets, so that judging
Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 value (micro-F1) and macro average F1 value
(macro-F1) it assesses;
It is defined as follows:
PiFor the accuracy rate of i-th of classification, RiFor the recall rate of i-th of classification, n is classification summary.Fudan University's text classification is used
Micro- Average Accuracy (micro-P) is assessed;News comment using the accuracy rate P of negative sample, is called together since classification is especially uneven
Rate R and F value are returned to assess.
The model that the present invention defines is denoted as LLDA-XG, and experimental result is as follows:
Classification performance experimental result in 2 WebKB corpus of table
RCC | Centroid | NB | Winnow | SVM | LLDA-XG | |
micro_F1 | 89.64 | 79.95 | 86.45 | 81.40 | 89.26 | 91.40 |
macro_F1 | 88.02 | 78.47 | 86.01 | 77.85 | 87.66 | 90.48 |
Classification performance experimental result in 3 news comment corpus of table
Bay es | SGDC | Bagging+KNN | SVM | RF | GBDT | Adaboost | LLDA-XG | |
P | 69.55 | 77.71 | 97.00 | 93.65 | 96.4 | 92.22 | 95.80 | 94.52 |
R | 91.62 | 91.98 | 85.20 | 86.81 | 89.84 | 88.77 | 85.56 | 92.34 |
F value | 79.08 | 84.24 | 90.70 | 90.10 | 92.99 | 90.46 | 90.40 | 93.42 |
Classification performance experimental result in 4 Fudan University's text classification corpus of table
Model | Accuracy rate |
Bag of words son+Logistic is returned | 92.08 |
Binary phrase+Logistic is returned | 92.97 |
Bag of words+support vector machines | 93.02 |
Binary phrase+support vector machines | 93.03 |
Average term vector+Logistic is returned | 86.89 |
Labeled-LDA | 90.80 |
Convolutional neural networks | 94.04 |
Cyclic convolution network | 95.20 |
LLDA-XG | 95.41 |
From table 2, table 3 and table 4 it can be seen that LLDA-XG model performance WebKB corpus, news comment corpus and
Good effect is all obtained on Fudan University this paper taxonomy collection, micro_F1 and macro_F1 value is all super in WebKB corpus
Cross 90;The accuracy rate of LLDA-XG model is not highest in news comment corpus, but recall rate and F value are all most
High, illustrate that LLDA-XG model had not only been able to maintain very high accuracy rate but also be able to maintain very high recall rate, more emphasis performance is equal
Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neural network and term vector
In conjunction with doctor Lai Siwei thesis, LLDA-XG model shows very outstanding, and classification accuracy is than cyclic convolution nerve
Network is taller, and pretreatment is considerably less, and runing time is also fast, and single machine performance is common to be obtained for computer four or five minutes
As a result, and neural network then needs largely to calculate the time.
Classification time performance experimental result of 5 LLDA-XG of table in WebKB corpus
Characteristic | 215 | 411 | 925 | 7288 |
micro-F1 | 90.24 | 90.82 | 91.40 | 91.61 |
macro-F1 | 89.19 | 89.98 | 90.48 | 90.69 |
Time (second) | 3.740 | 4.560 | 5.759 | 8.984 |
Classification time performance experimental result of 6 LLDA-XG of table in news comment corpus
Characteristic | 154 | 289 | 674 | 4154 |
P | 94.39 | 93.91 | 94.53 | 94.13 |
R | 90.02 | 90.73 | 92.34 | 91.44 |
F | 92.15 | 92.29 | 93.42 | 92.77 |
Time (second) | 6.739 | 7.716 | 8.756 | 10.589 |
Classification time performance experimental result of 7 LLDA-XG of table in Fudan University's text classification corpus
Characteristic | 408 | 665 | 1329 | 6707 |
Accuracy rate | 94.59 | 95.20 | 95.41 | 95.39 |
Time (second) | 145.701 | 206.362 | 278.52351899874816 | 342.354 |
In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases
0.79;Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,;Characteristic is from 925
When increasing to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very unknown
Aobvious, runing time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 instead, recalls
Rate R increases 0.71, F value and increases 0.14;Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate increases
Grown 1.61, F value and increased 1.13, but characteristic from 674 increase to 4154 when, accuracy rate decline 0.4, recall rate decline
0.9, F value decline 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases
0.61, characteristic from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate
Have dropped 0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG model is in characteristic, the more elapsed times of characteristic are more
It is long, but classification performance increasess slowly, or even to the more more declines of classification performance instead of late feature number, this may be to overtrain
Cause over-fitting.This also illustrates the importance of feature selecting, and the feature of maximum often minority is influenced for classification performance,
There are a large amount of redundancy feature even feature of noise in feature, and feature is more, and consumption operation time is more, and performance boost is non-
Often limited or even generation over-fitting.Also illustrate that Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese
Or English, either long text or short text, select very good feature, so that integral operation from a large amount of feature
Time decline, performance stability and high efficiency.And feature extraction of the invention is based on word, extraction is very convenient, does not need artificial
Also not needing expertise goes consuming vast resources to go to extract feature, and the pretreatment simple operation time is few, classifies in conjunction with Xgboost
Algorithm rapidly and efficiently, the great abilities of missing values, the superior stabilization of overall performance can be handled.In short, by experimental verification, this hair
Bright proposed textual classification model LLDA-XG can be widely used in respective text categorization task, and pretreatment is simple and quick,
Integral operation time and performance are all very prominent, have very high robustness and practical value.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description
To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this
Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention
Protection scope within.
Claims (3)
1. a kind of file classification method based on Xgboost sorting algorithm, it is characterised in that: the following steps are included:
S1. multiple samples are obtained, each sample includes the label of content of text and text;
S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample
This composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then by the training
The label of sample is inputted as the label of Labeled-LDA, and using the content of text of the training sample as Labeled-LDA
Text input;
S4. setting Labeled-LDA the number of iterations is K, is then iterated training to training sample;
S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, Yi Fenshi
It is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme;Two parts of documents are integrated, training sample is obtained
In the number that occurs under each theme of each word;For each theme, sort according to the frequency of occurrence of corresponding word, choose with
LLDA word of the maximally related m word of the theme as training sample;
S6. for each training sample, the appearance of each LLDA word that it is obtained by step S5 in its content of text is counted
Number, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to
In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained;
S7. so far model has trained, and needs to predict forecast set, i.e., the step of step S3~S5 is carried out to forecast set
Suddenly, prediction classification then is carried out to each forecast sample in forecast set using trained model.
2. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the step
In rapid S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then the LLDA word is not input to Xgboost sorting algorithm.
3. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the K is
1000。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710060026.9A CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710060026.9A CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106815369A CN106815369A (en) | 2017-06-09 |
CN106815369B true CN106815369B (en) | 2019-09-20 |
Family
ID=59112460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710060026.9A Active CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106815369B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463965B (en) * | 2017-08-16 | 2024-03-26 | 湖州易有科技有限公司 | Deep learning-based fabric attribute picture acquisition and recognition method and recognition system |
CN107798083A (en) * | 2017-10-17 | 2018-03-13 | 广东广业开元科技有限公司 | A kind of information based on big data recommends method, system and device |
CN107992531B (en) * | 2017-11-21 | 2020-11-27 | 吉浦斯信息咨询(深圳)有限公司 | News personalized intelligent recommendation method and system based on deep learning |
CN108090201A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method, apparatus and electronic equipment of article content classification |
CN108257052B (en) * | 2018-01-16 | 2022-04-22 | 中南大学 | Online student knowledge assessment method and system |
CN108536650B (en) * | 2018-04-03 | 2022-04-26 | 北京京东尚科信息技术有限公司 | Method and device for generating gradient lifting tree model |
CN108627798B (en) * | 2018-04-04 | 2022-03-11 | 北京工业大学 | WLAN indoor positioning algorithm based on linear discriminant analysis and gradient lifting tree |
CN108563638B (en) * | 2018-04-13 | 2021-08-10 | 武汉大学 | Microblog emotion analysis method based on topic identification and integrated learning |
CN108868805B (en) * | 2018-06-08 | 2019-11-15 | 西安电子科技大学 | Shield method for correcting error based on statistical analysis in conjunction with XGboost |
CN108985335B (en) * | 2018-06-19 | 2021-04-27 | 中国原子能科学研究院 | Integrated learning prediction method for irradiation swelling of nuclear reactor cladding material |
CN109189937B (en) | 2018-08-22 | 2021-02-09 | 创新先进技术有限公司 | Feature relationship recommendation method and device, computing device and storage medium |
WO2020082734A1 (en) * | 2018-10-24 | 2020-04-30 | 平安科技(深圳)有限公司 | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium |
CN109543187B (en) * | 2018-11-23 | 2021-09-17 | 中山大学 | Method and device for generating electronic medical record characteristics and storage medium |
CN109754002A (en) * | 2018-12-24 | 2019-05-14 | 上海大学 | A kind of steganalysis hybrid integrated method based on deep learning |
CN109933667A (en) * | 2019-03-19 | 2019-06-25 | 中国联合网络通信集团有限公司 | Textual classification model training method, file classification method and equipment |
CN110334732A (en) * | 2019-05-20 | 2019-10-15 | 北京思路创新科技有限公司 | A kind of Urban Air Pollution Methods and device based on machine learning |
CN110318327A (en) * | 2019-06-10 | 2019-10-11 | 长安大学 | A kind of surface evenness prediction technique based on random forest |
CN111079031B (en) * | 2019-12-27 | 2023-09-12 | 北京工业大学 | Weight classification method for importance of blog with respect to disaster information based on deep learning and XGBoost algorithm |
CN111753058B (en) * | 2020-06-30 | 2023-06-02 | 北京信息科技大学 | Text viewpoint mining method and system |
CN111831824B (en) * | 2020-07-16 | 2024-02-09 | 民生科技有限责任公司 | Public opinion positive and negative surface classification method |
CN111859074B (en) * | 2020-07-29 | 2023-12-29 | 东北大学 | Network public opinion information source influence evaluation method and system based on deep learning |
CN112699944B (en) * | 2020-12-31 | 2024-04-23 | 中国银联股份有限公司 | Training method, processing method, device, equipment and medium for returning list processing model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060142993A1 (en) * | 2004-12-28 | 2006-06-29 | Sony Corporation | System and method for utilizing distance measures to perform text classification |
-
2017
- 2017-01-24 CN CN201710060026.9A patent/CN106815369B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
Non-Patent Citations (1)
Title |
---|
基于Labeled-LDA模型的文本分类新算法;李文波;《计算机学报》;20080430;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN106815369A (en) | 2017-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815369B (en) | A kind of file classification method based on Xgboost sorting algorithm | |
CN110059181B (en) | Short text label method, system and device for large-scale classification system | |
Huang et al. | Naive Bayes classification algorithm based on small sample set | |
CN109740154A (en) | A kind of online comment fine granularity sentiment analysis method based on multi-task learning | |
CN107301171A (en) | A kind of text emotion analysis method and system learnt based on sentiment dictionary | |
Liu et al. | Deep learning approaches for link prediction in social network services | |
CN108197144B (en) | Hot topic discovery method based on BTM and Single-pass | |
Sundus et al. | A deep learning approach for arabic text classification | |
CN101587493A (en) | Text classification method | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN110795564B (en) | Text classification method lacking negative cases | |
CN109299270A (en) | A kind of text data unsupervised clustering based on convolutional neural networks | |
MidhunChakkaravarthy | Evolutionary and incremental text document classifier using deep learning | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
CN103514168B (en) | Data processing method and device | |
Kakade et al. | A neural network approach for text document classification and semantic text analytics | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN114881172A (en) | Software vulnerability automatic classification method based on weighted word vector and neural network | |
Zhu et al. | Chinese texts classification system | |
Nohuddin et al. | Content analytics based on random forest classification technique: An empirical evaluation using online news dataset | |
Al Hasan et al. | Clustering Analysis of Bangla News Articles with TF-IDF & CV Using Mini-Batch K-Means and K-Means | |
Klungpornkun et al. | Hierarchical text categorization using level based neural networks of word embedding sequences with sharing layer information | |
CN114625868A (en) | Electric power data text classification algorithm based on selective ensemble learning | |
Sami et al. | Incorporating random forest trees with particle swarm optimization for automatic image annotation | |
Pavithra et al. | A review article on semi-supervised clustering framework for high dimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |