CN106815369B - A kind of file classification method based on Xgboost sorting algorithm - Google Patents

A kind of file classification method based on Xgboost sorting algorithm Download PDF

Info

Publication number
CN106815369B
CN106815369B CN201710060026.9A CN201710060026A CN106815369B CN 106815369 B CN106815369 B CN 106815369B CN 201710060026 A CN201710060026 A CN 201710060026A CN 106815369 B CN106815369 B CN 106815369B
Authority
CN
China
Prior art keywords
word
text
sample
xgboost
sorting algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710060026.9A
Other languages
Chinese (zh)
Other versions
CN106815369A (en
Inventor
庞宇明
任江涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201710060026.9A priority Critical patent/CN106815369B/en
Publication of CN106815369A publication Critical patent/CN106815369A/en
Application granted granted Critical
Publication of CN106815369B publication Critical patent/CN106815369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then carries out text classification with Xgboost sorting algorithm.It does feature space with common vector space model, common sorting algorithm is reduced come the method for carrying out text classification compared to the memory of required consuming, this is because word included in Chinese text hundreds and thousands of ten thousand, dimension is higher, if characterized by word, it is huge to expend memory, even can not single machine processing, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension substantially reduces, and Xgboost support using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled-LDA algorithm with potential applications simultaneously, and the classification information that feature selecting can be excavated the semantic information of a large amount of corpus using LDA but also be included using text is made of Labeled-LDA.And pretreatment is simple, does not need to extract feature meticulously, in addition the distributed Ensemble Learning Algorithms Xgboost of powerful support, improves the accuracy and performance of classification.

Description

A kind of file classification method based on Xgboost sorting algorithm
Technical field
The present invention relates to text classification fields, more particularly, to a kind of text classification based on Xgboost sorting algorithm Method.
Background technique
File classification method has obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring Using, be realize efficiently management and be accurately positioned massive information an important ring.
The common frame of file classification method is based on machine learning classification algorithm, that is, includes data prediction, then special Levy extraction, feature selecting, tagsort.
Feature extraction is identified text using unified method and model, and this method or model can indicate text This feature and it can be easily converted to mathematical linguistics, and then be converted to the mathematical model that computer is capable of handling.It is existing Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter The advantages that list, convenience of calculation, few pretreatment, also has the semantic relation between override feature and feature certainly, indicates text The disadvantages of depth is inadequate.
Feature selecting is one of critical issue of machine learning, and the quality of feature selecting result directly affects classifier Nicety of grading and Generalization Capability.Feature selecting is that some most effective features are picked out from one group of feature to reduce feature sky Between dimension, and reach and reject uncorrelated or redundancy feature, reduce runing time and improve the comprehensibility of analysis result, discovery The structure and other effects hidden in high dimensional data.Whether there is classification information according to data, feature selecting, which can be divided into, supervision and nothing Supervise two classes.There is common feature selection approach in text classification: document frequencies, mutual information, information gain and chi-square statistics Measure the methods of (CHI).
Tagsort be in text classification finally be also a most important ring.Naive Bayes Classification Algorithm is a kind of typical case Tagsort algorithm the probability that text belongs to certain particular category is calculated according to Bayesian formula, needed for the parameter estimated very Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, theoretically has compared with other sorting algorithms The smallest error rate, but actually not such was the case with, because this method assumes between attribute that this is assumed in reality independently of each other It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, can handle data type and conventional type category simultaneously Property, large data source can be made within the relatively short time it is feasible and work well as a result, but decision tree processing missing It is relatively difficult when data, it is easy to ignore the correlation in data set between attribute.Remaining tagsort method further include have LR, KNN and SVM etc..These are all base learners, and integrated study often can get than single by being combined multiple learners Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, integrated learning approach is broadly divided into two at present Class, i.e., there are do not deposit between strong dependence, the sequencing method that must serially generate, and individual learner between individual learner In strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former, which represents, the latter represent be Bagging and " with Machine forest ".From the point of view of deviation-variation decomposition, Boosting is primarily upon reduction deviation, therefore Boosting can be based on The quite weak learner of Generalization Capability constructs very strong integrated.Bagging is primarily upon reduction variance, therefore it is in not beta pruning Effectiveness becomes apparent on the learner that decision tree, neural network etc. are disturbed vulnerable to sample.Xgboost sorting algorithm is to be based on A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithm, Xgboost sorting algorithm mainly have with Lower eight big advantages:
One, for tradition GBDT using CART as base classifier, Xgboost sorting algorithm also supports linear classifier.
Two, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithm then carries out cost function The second Taylor series, while having used single order and second dervative.
Three, Xgboost sorting algorithm joined regular terms in cost function, the complexity for Controlling model.Canonical Contained in the leaf node number of tree, the score exported on each leaf node L2 mould quadratic sum.From deviation-variance Weigh for angle, regular terms reduces the variance of model, and the model for coming out study is simpler, prevents over-fitting.
Four, reduce: being equivalent to learning rate.Xgboost sorting algorithm, can be by leaf node after having carried out an iteration Weight be multiplied by the coefficient, primarily to weakening the influence of each tree, allowing below has a bigger studying space.
Five, column sampling: Xgboost sorting algorithm has used for reference the way of random forest, supports column sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate.
Six, to the processing of missing values: having the sample of missing for the value of feature, Xgboost sorting algorithm can be learned automatically Practise out its cleavage direction.
Seven, it supports parallel.Boosting is a kind of serial structure, and Xgboost sorting algorithm is not parallel tree That spends is parallel, and Xgboost sorting algorithm is also complete (the cost letter of the t times iteration that just can be carried out next iteration of an iteration The predicted value of the iteration of front t-1 times is contained in several).Xgboost sorting algorithm is on characteristic particle size parallel.Decision tree The most time-consuming step of study be exactly to be ranked up to the value of feature (because to determine optimal partition point), Xgboost points Class algorithm before training, in advance sorts to data, then saves as block structure, repeatedly uses in subsequent iteration This structure, greatly reduces calculation amount.This block structure is but also become possibility parallel, when carrying out the division of node, It needs to calculate the gain of each feature, that maximum feature of gain is finally selected to do division, then the gain meter of each feature Calculation can open multithreading progress.
Eight, approximate histogramming algorithm that can be parallel.Tree node needs to calculate each of each feature point when being divided The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data can not once be loaded into memory or divide In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithm also proposed it is a kind of can be parallel it is close Like histogramming algorithm, for efficiently generating candidate cut-point.
Summary of the invention
The present invention is that the classification performance that the above prior art of solution provides is low, consuming memory is big, classification accuracy is low Defect, provide a kind of file classification method based on Xgboost sorting algorithm.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of file classification method based on Xgboost sorting algorithm, comprising the following steps:
S1. multiple samples are obtained, each sample includes the label of content of text and text;
S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing Practice sample and forms training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text is separated with space, then should The label of training sample is inputted as the label of Labeled-LDA, and the content of text of the training sample is as Labeled-LDA Text input;
S4. setting Labeled-LDA the number of iterations is K, is then iterated training to training sample;
S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, and one Part is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme;Two parts of documents are integrated, are trained The number that each word occurs under each theme in sample;For each theme, sort according to the frequency of occurrence of corresponding word, choosing Take the LLDA word with the maximally related m word of the theme as training sample;
S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text Frequency of occurrence, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained;
S7. so far model has trained, and needs to predict forecast set, i.e., carries out step S3~S5 to forecast set The step of, prediction classification then is carried out to each forecast sample in forecast set using trained model.
Preferably, in the step S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then not by the LLDA word It being input in Xgboost sorting algorithm, i.e., common sorting algorithm input is matrix form, and this method input is dictionary format, Memory is saved, it is more convenient quickly reasonably to handle missing values.
Preferably, the K is 1000.
Compared with prior art, the beneficial effects of the present invention are:
Method provided by the invention extracts tagged word by Labeled-LDA to calculate characteristic value, then uses Xgboost Sorting algorithm carries out text classification.It does feature space with common vector space model, common sorting algorithm come into The method of row text classification is reduced compared to the memory of required consuming, this is because word is at hundred included in Chinese text Up to ten million, dimension is higher, if characterized by word, expend memory it is huge, or even can not single machine processing, and Chinese characters in common use be no more than one Ten thousand, often occur even only 1,000, dimension substantially reduces, and Xgboost support with dictionary rather than matrix Form is as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of corpus but also be utilized using LDA The classification information that text is included.And pretreatment is simple, does not need to extract feature meticulously, in addition powerful support is distributed Ensemble Learning Algorithms Xgboost improves the accuracy and performance of classification.
Detailed description of the invention
Fig. 1 is the flow chart of method.
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
Below in conjunction with drawings and examples, the present invention is further elaborated.
Embodiment 1
The implementation case includes 3 concrete cases, is classified respectively to 3 text corpus with different characteristics, That is English corpus WebKB disclosed in one weeds out the sample and two Chinese corpus of no any content, wherein one A is disclosed long text corpus: Fudan University's text classification corpus, the other is Chinese short text sample is very uneven The corpus of weighing apparatus: news comment is divided into normal and advertisement two categories, positive negative ratio and reaches 2742/42416=0.065 in this way Rank.
1 text classification data set summary situation of table
As shown in Figure 1, the text of the present invention for making feature extraction of Labeled-LDA based on Xgboost sorting algorithm The specific implementation step of classification method is as follows:
Step 1: Text Pretreatment
The classified text set of a batch, such as 3 cases are prepared in advance, according to the ratio random division training set of 8:2 and pre- Collection (news comment) is surveyed, (WebKB and Fudan University's text classification) is directlyed adopt if disclosed data set has existed division, to institute There is text to be removed dryness, Unified coding UTF-8 distinguishes each word with space and facilitate subsequent program for Chinese text Processing, since the present invention is based on word feature, common word, which puts in the stops, is not more than 10,000, and Labeled- LDA has powerful feature selecting ability, therefore Chinese Text Categorization commonly removes the pretreatment such as punctuation mark, number, stop words Process can dispense, from experimental result it can also be seen that such pretreatment is not required, for English text, then It includes that all Caps switchs to small letter, punctuation mark is replaced with space that it, which is pre-processed, can more distinguish the boundary of word in this way and meet The writing style of English.
Step 2: feature selecting being carried out to training set with Labeled-LDA
Present invention experiment uses Stamford natural language processing tool TMT.The various of Labeled-LDA are first set Parameter, including text segmentation mode and text filtering mode, label number, the number of iterations etc., text segmentation mode use space It distinguishes, label number sees that specific corpus actual conditions, the number of iterations are unified for 1000 times, after training, obtains two File, one is file of the word to coding, one be the number that each word coding occurs under each theme file, merge this Two files obtain the file for the number that each word occurs under each theme, then under each theme, occur to each word Number sequence, takes out the highest top n of frequency of occurrence, then these words obtained are exactly after Labeled-LDA feature selecting The word left, there may be repetitions for these words, count these words after excluding redundancy as LLDA word, have more rich semantic information, It is to be trained in label obtaining under this strong supervision message as a result, having very strong feature representation ability again simultaneously.
Step 3: training set forecast set being processed into the input format of Xgboosts
The input format of Xgboost is very simple and convenient and reasonable, for each of training set and forecast set sample, because For word is regarded as feature, current word, if the word is LLDA word, feature corresponding to corresponding LLDA word are arrived in processing Characteristic value adds 1, for each sample, the LLDA word not occurred, then and the feature of feature corresponding to corresponding LLDA word Value is 0, this feature does not have to the input as Xgboost;
The parameter setting of step 4:Xgboost sorting algorithm, training and prediction
The parameter of Xgboost sorting algorithm is set, the parameter for selecting resultant effect optimal, the Xgboost being finally arranged points Class algorithm parameter are as follows: the number of iterations 200, class number (depending on specific), classifier type gbtree, objective function are Multi:softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training obtains is to pre- Collection is surveyed to classify;
Step 5: assessment models performance
Since each data set differs greatly, therefore different judgment criteria is used for different data sets, so that judging Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 value (micro-F1) and macro average F1 value (macro-F1) it assesses;
It is defined as follows:
PiFor the accuracy rate of i-th of classification, RiFor the recall rate of i-th of classification, n is classification summary.Fudan University's text classification is used Micro- Average Accuracy (micro-P) is assessed;News comment using the accuracy rate P of negative sample, is called together since classification is especially uneven Rate R and F value are returned to assess.
The model that the present invention defines is denoted as LLDA-XG, and experimental result is as follows:
Classification performance experimental result in 2 WebKB corpus of table
RCC Centroid NB Winnow SVM LLDA-XG
micro_F1 89.64 79.95 86.45 81.40 89.26 91.40
macro_F1 88.02 78.47 86.01 77.85 87.66 90.48
Classification performance experimental result in 3 news comment corpus of table
Bay es SGDC Bagging+KNN SVM RF GBDT Adaboost LLDA-XG
P 69.55 77.71 97.00 93.65 96.4 92.22 95.80 94.52
R 91.62 91.98 85.20 86.81 89.84 88.77 85.56 92.34
F value 79.08 84.24 90.70 90.10 92.99 90.46 90.40 93.42
Classification performance experimental result in 4 Fudan University's text classification corpus of table
Model Accuracy rate
Bag of words son+Logistic is returned 92.08
Binary phrase+Logistic is returned 92.97
Bag of words+support vector machines 93.02
Binary phrase+support vector machines 93.03
Average term vector+Logistic is returned 86.89
Labeled-LDA 90.80
Convolutional neural networks 94.04
Cyclic convolution network 95.20
LLDA-XG 95.41
From table 2, table 3 and table 4 it can be seen that LLDA-XG model performance WebKB corpus, news comment corpus and Good effect is all obtained on Fudan University this paper taxonomy collection, micro_F1 and macro_F1 value is all super in WebKB corpus Cross 90;The accuracy rate of LLDA-XG model is not highest in news comment corpus, but recall rate and F value are all most High, illustrate that LLDA-XG model had not only been able to maintain very high accuracy rate but also be able to maintain very high recall rate, more emphasis performance is equal Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neural network and term vector In conjunction with doctor Lai Siwei thesis, LLDA-XG model shows very outstanding, and classification accuracy is than cyclic convolution nerve Network is taller, and pretreatment is considerably less, and runing time is also fast, and single machine performance is common to be obtained for computer four or five minutes As a result, and neural network then needs largely to calculate the time.
Classification time performance experimental result of 5 LLDA-XG of table in WebKB corpus
Characteristic 215 411 925 7288
micro-F1 90.24 90.82 91.40 91.61
macro-F1 89.19 89.98 90.48 90.69
Time (second) 3.740 4.560 5.759 8.984
Classification time performance experimental result of 6 LLDA-XG of table in news comment corpus
Characteristic 154 289 674 4154
P 94.39 93.91 94.53 94.13
R 90.02 90.73 92.34 91.44
F 92.15 92.29 93.42 92.77
Time (second) 6.739 7.716 8.756 10.589
Classification time performance experimental result of 7 LLDA-XG of table in Fudan University's text classification corpus
Characteristic 408 665 1329 6707
Accuracy rate 94.59 95.20 95.41 95.39
Time (second) 145.701 206.362 278.52351899874816 342.354
In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases 0.79;Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,;Characteristic is from 925 When increasing to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very unknown Aobvious, runing time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 instead, recalls Rate R increases 0.71, F value and increases 0.14;Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate increases Grown 1.61, F value and increased 1.13, but characteristic from 674 increase to 4154 when, accuracy rate decline 0.4, recall rate decline 0.9, F value decline 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases 0.61, characteristic from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate Have dropped 0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG model is in characteristic, the more elapsed times of characteristic are more It is long, but classification performance increasess slowly, or even to the more more declines of classification performance instead of late feature number, this may be to overtrain Cause over-fitting.This also illustrates the importance of feature selecting, and the feature of maximum often minority is influenced for classification performance, There are a large amount of redundancy feature even feature of noise in feature, and feature is more, and consumption operation time is more, and performance boost is non- Often limited or even generation over-fitting.Also illustrate that Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese Or English, either long text or short text, select very good feature, so that integral operation from a large amount of feature Time decline, performance stability and high efficiency.And feature extraction of the invention is based on word, extraction is very convenient, does not need artificial Also not needing expertise goes consuming vast resources to go to extract feature, and the pretreatment simple operation time is few, classifies in conjunction with Xgboost Algorithm rapidly and efficiently, the great abilities of missing values, the superior stabilization of overall performance can be handled.In short, by experimental verification, this hair Bright proposed textual classification model LLDA-XG can be widely used in respective text categorization task, and pretreatment is simple and quick, Integral operation time and performance are all very prominent, have very high robustness and practical value.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (3)

1. a kind of file classification method based on Xgboost sorting algorithm, it is characterised in that: the following steps are included:
S1. multiple samples are obtained, each sample includes the label of content of text and text;
S2. all samples step S1 obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample This composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then by the training The label of sample is inputted as the label of Labeled-LDA, and using the content of text of the training sample as Labeled-LDA Text input;
S4. setting Labeled-LDA the number of iterations is K, is then iterated training to training sample;
S5. each training sample obtains two parts of documents after iteration, and portion is encoded about word and its corresponding word, Yi Fenshi It is encoded about theme and word, i.e., the number that corresponding word coding occurs under each theme;Two parts of documents are integrated, training sample is obtained In the number that occurs under each theme of each word;For each theme, sort according to the frequency of occurrence of corresponding word, choose with LLDA word of the maximally related m word of the theme as training sample;
S6. for each training sample, the appearance of each LLDA word that it is obtained by step S5 in its content of text is counted Number, and using the number as the value of the LLDA word, value of each sample about each LLDA word will be obtained, is input to In Xgboost sorting algorithm, then Xgboost sorting algorithm is trained;
S7. so far model has trained, and needs to predict forecast set, i.e., the step of step S3~S5 is carried out to forecast set Suddenly, prediction classification then is carried out to each forecast sample in forecast set using trained model.
2. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the step In rapid S6, the frequency of occurrence of LLDA word in the text is 0 if it exists, then the LLDA word is not input to Xgboost sorting algorithm.
3. the file classification method according to claim 1 based on Xgboost sorting algorithm, it is characterised in that: the K is 1000。
CN201710060026.9A 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm Active CN106815369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710060026.9A CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710060026.9A CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Publications (2)

Publication Number Publication Date
CN106815369A CN106815369A (en) 2017-06-09
CN106815369B true CN106815369B (en) 2019-09-20

Family

ID=59112460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710060026.9A Active CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Country Status (1)

Country Link
CN (1) CN106815369B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463965B (en) * 2017-08-16 2024-03-26 湖州易有科技有限公司 Deep learning-based fabric attribute picture acquisition and recognition method and recognition system
CN107798083A (en) * 2017-10-17 2018-03-13 广东广业开元科技有限公司 A kind of information based on big data recommends method, system and device
CN107992531B (en) * 2017-11-21 2020-11-27 吉浦斯信息咨询(深圳)有限公司 News personalized intelligent recommendation method and system based on deep learning
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN108257052B (en) * 2018-01-16 2022-04-22 中南大学 Online student knowledge assessment method and system
CN108536650B (en) * 2018-04-03 2022-04-26 北京京东尚科信息技术有限公司 Method and device for generating gradient lifting tree model
CN108627798B (en) * 2018-04-04 2022-03-11 北京工业大学 WLAN indoor positioning algorithm based on linear discriminant analysis and gradient lifting tree
CN108563638B (en) * 2018-04-13 2021-08-10 武汉大学 Microblog emotion analysis method based on topic identification and integrated learning
CN108868805B (en) * 2018-06-08 2019-11-15 西安电子科技大学 Shield method for correcting error based on statistical analysis in conjunction with XGboost
CN108985335B (en) * 2018-06-19 2021-04-27 中国原子能科学研究院 Integrated learning prediction method for irradiation swelling of nuclear reactor cladding material
CN109189937B (en) 2018-08-22 2021-02-09 创新先进技术有限公司 Feature relationship recommendation method and device, computing device and storage medium
WO2020082734A1 (en) * 2018-10-24 2020-04-30 平安科技(深圳)有限公司 Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
CN109543187B (en) * 2018-11-23 2021-09-17 中山大学 Method and device for generating electronic medical record characteristics and storage medium
CN109754002A (en) * 2018-12-24 2019-05-14 上海大学 A kind of steganalysis hybrid integrated method based on deep learning
CN109933667A (en) * 2019-03-19 2019-06-25 中国联合网络通信集团有限公司 Textual classification model training method, file classification method and equipment
CN110334732A (en) * 2019-05-20 2019-10-15 北京思路创新科技有限公司 A kind of Urban Air Pollution Methods and device based on machine learning
CN110318327A (en) * 2019-06-10 2019-10-11 长安大学 A kind of surface evenness prediction technique based on random forest
CN111079031B (en) * 2019-12-27 2023-09-12 北京工业大学 Weight classification method for importance of blog with respect to disaster information based on deep learning and XGBoost algorithm
CN111753058B (en) * 2020-06-30 2023-06-02 北京信息科技大学 Text viewpoint mining method and system
CN111831824B (en) * 2020-07-16 2024-02-09 民生科技有限责任公司 Public opinion positive and negative surface classification method
CN111859074B (en) * 2020-07-29 2023-12-29 东北大学 Network public opinion information source influence evaluation method and system based on deep learning
CN112699944B (en) * 2020-12-31 2024-04-23 中国银联股份有限公司 Training method, processing method, device, equipment and medium for returning list processing model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060142993A1 (en) * 2004-12-28 2006-06-29 Sony Corporation System and method for utilizing distance measures to perform text classification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Labeled-LDA模型的文本分类新算法;李文波;《计算机学报》;20080430;全文 *

Also Published As

Publication number Publication date
CN106815369A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
CN110059181B (en) Short text label method, system and device for large-scale classification system
Huang et al. Naive Bayes classification algorithm based on small sample set
CN109740154A (en) A kind of online comment fine granularity sentiment analysis method based on multi-task learning
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
Liu et al. Deep learning approaches for link prediction in social network services
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
Sundus et al. A deep learning approach for arabic text classification
CN101587493A (en) Text classification method
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN110795564B (en) Text classification method lacking negative cases
CN109299270A (en) A kind of text data unsupervised clustering based on convolutional neural networks
MidhunChakkaravarthy Evolutionary and incremental text document classifier using deep learning
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN103514168B (en) Data processing method and device
Kakade et al. A neural network approach for text document classification and semantic text analytics
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN114881172A (en) Software vulnerability automatic classification method based on weighted word vector and neural network
Zhu et al. Chinese texts classification system
Nohuddin et al. Content analytics based on random forest classification technique: An empirical evaluation using online news dataset
Al Hasan et al. Clustering Analysis of Bangla News Articles with TF-IDF & CV Using Mini-Batch K-Means and K-Means
Klungpornkun et al. Hierarchical text categorization using level based neural networks of word embedding sequences with sharing layer information
CN114625868A (en) Electric power data text classification algorithm based on selective ensemble learning
Sami et al. Incorporating random forest trees with particle swarm optimization for automatic image annotation
Pavithra et al. A review article on semi-supervised clustering framework for high dimensional data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant