CN106815369A - A kind of file classification method based on Xgboost sorting algorithms - Google Patents

A kind of file classification method based on Xgboost sorting algorithms Download PDF

Info

Publication number
CN106815369A
CN106815369A CN201710060026.9A CN201710060026A CN106815369A CN 106815369 A CN106815369 A CN 106815369A CN 201710060026 A CN201710060026 A CN 201710060026A CN 106815369 A CN106815369 A CN 106815369A
Authority
CN
China
Prior art keywords
text
word
xgboost
sample
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710060026.9A
Other languages
Chinese (zh)
Other versions
CN106815369B (en
Inventor
庞宇明
任江涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201710060026.9A priority Critical patent/CN106815369B/en
Publication of CN106815369A publication Critical patent/CN106815369A/en
Application granted granted Critical
Publication of CN106815369B publication Critical patent/CN106815369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled LDA, then carries out text classification with Xgboost sorting algorithms.It does feature space with common vector space model, the internal memory that common sorting algorithm compares required consuming to carry out the method for text classification is reduced, this is that dimension is higher, if being characterized with word due to the word hundreds and thousands of ten thousand included in Chinese text, expend internal memory huge, even cannot unit treatment, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension is substantially reduced, and Xgboost is supported using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled LDA algorithms with potential applications simultaneously, and feature selecting is made of Labeled LDA can excavate the classification information that the semantic information of a large amount of language materials can utilize text to be included again using LDA.And pretreatment is simple, it is not necessary to extract feature meticulously, add the distributed Ensemble Learning Algorithms Xgboost of powerful support, improve the accuracy and performance of classification.

Description

A kind of file classification method based on Xgboost sorting algorithms
Technical field
The present invention relates to text classification field, more particularly, to a kind of text classification based on Xgboost sorting algorithms Method.
Background technology
File classification method has been obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring Using being to realize efficiently managing and being accurately positioned the important ring of magnanimity information.
The conventional framework of file classification method is based on machine learning classification algorithm, i.e., comprising data prediction, then special Levy the steps such as extraction, feature selecting, tagsort.
Feature extraction is that text is identified using unified method and model, and the method or model can represent text This feature and mathematical linguistics can be easily changed into, and then change into the Mathematical Modeling that computer can be processed.It is existing Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter The advantages of list, convenience of calculation, few pretreatment, also there is the semantic relation between override feature and feature certainly, text is represented The shortcomings of depth is inadequate.
Feature selecting is one of key issue of machine learning, and the quality of feature selecting result directly affects grader Nicety of grading and Generalization Capability.Feature selecting is that some maximally effective features are picked out from one group of feature so as to reduce feature sky Between dimension, and reach and reject uncorrelated or redundancy feature, reduce run time and improve the intelligibility of analysis result, find Structure hidden in high dimensional data and other effects.Whether there is classification information according to data, feature selecting can be divided into supervision and nothing Supervise two classes.The feature selection approach commonly used in text classification has:Document frequencies, mutual information, information gain and chi The methods such as amount (CHI).
It in text classification is also finally a most important ring that tagsort is.Naive Bayes Classification Algorithm is a kind of typical case Tagsort algorithm, according to Bayesian formula, calculate the probability that text belongs to certain particular category, the parameter estimated needed for it is very Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, has compared with other sorting algorithms in theory Minimum error rate, but actually not such was the case with, and because separate between the method hypothesis attribute, this is assumed in reality It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, processing data type and conventional type can belong to simultaneously Property, result that is feasible and working well, but decision tree treatment missing can be made to large data source within the relatively short time It is relatively difficult during data, easily ignore the correlation between attribute in data set.Remaining tagsort method also include LR, KNN and SVM etc..These are all base learners, and integrated study is combined by by multiple learners, can often obtain than single Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, current integrated learning approach is broadly divided into two Exist between class, i.e., individual learner and do not deposited between strong dependence, the sequencing method that must serially generate, and individual learner At strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former represents, the latter represent be Bagging and " with Machine forest ".From from the point of view of deviation-variation decomposition, Boosting be primarily upon reduce deviation, therefore Boosting can be based on it is general Change the quite weak learner of performance construct it is very strong integrated.Bagging is primarily upon reducing variance, therefore it determines in not beta pruning Effectiveness becomes apparent on the learner that plan tree, neutral net etc. are easily disturbed by sample.Xgboost sorting algorithms are to be based on A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithms, Xgboost sorting algorithms mainly have with Lower eight big advantages:
First, using CART as base grader, Xgboost sorting algorithms also support linear classifier to tradition GBDT.
2nd, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithms are then carried out to cost function The second Taylor series, while used single order and second dervative.
3rd, Xgboost sorting algorithms add regular terms in cost function, for the complexity of Controlling model.Canonical The quadratic sum of the L2 moulds of the fraction exported on the leaf node number of tree, each leaf node is contained in.From deviation-variance For balance angle, regular terms reduces the variance of model, makes study model out simpler, prevents over-fitting.
4th, reduce:Equivalent to learning rate.Xgboost sorting algorithms, can be by leaf node after an iteration has been carried out Weight be multiplied by the coefficient, primarily to weaken each tree influence, allow below have bigger studying space.
5th, row sampling:Xgboost sorting algorithms have used for reference the way of random forest, support row sampling, can not only reduce Fitting, moreover it is possible to reduce and calculate.
6th, to the treatment of missing values:For the sample that the value of feature has missing, Xgboost sorting algorithms can be learned automatically Practise out its cleavage direction.
7th, support parallel.Boosting is a kind of serial structure, and the parallel of Xgboost sorting algorithms is not tree That spends is parallel, and Xgboost sorting algorithms are also complete (the t times cost letter of iteration that can just carry out next iteration of an iteration The above t-1 predicted value of iteration is contained in several).Xgboost sorting algorithms are parallel on characteristic particle size.Decision tree The most time-consuming step of study be exactly the value of feature to be ranked up (because to determine optimal partition point), Xgboost points Class algorithm before training, is sorted to data in advance, then saves as block structure, is repeatedly used in iteration below This structure, greatly reduces amount of calculation.This block structure also causes to become possibility parallel, when the division of node is carried out, The gain of each feature of calculating is needed, that feature for finally selecting gain maximum does division, then the gain meter of each feature Calculation can just open multithreading to be carried out.
8th, approximate histogramming algorithm that can be parallel.Tree node is when line splitting is entered, it is necessary to calculate each point of each feature The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data cannot once be loaded into internal memory or divide In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithms also proposed it is a kind of can be parallel it is near Like histogramming algorithm, the cut-point for efficiently generating candidate.
The content of the invention
The classification performance that above prior art is provided is low, big, classification accuracy is low to expend internal memory to solve for the present invention Defect, there is provided a kind of file classification method based on Xgboost sorting algorithms.
To realize above goal of the invention, the technical scheme of use is:
A kind of file classification method based on Xgboost sorting algorithms, comprises the following steps:
S1. multiple samples are obtained, described each sample includes the label of content of text and text;
S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing Practice sample composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then should The label of training sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA Text input;
S4. it is K to set Labeled-LDA iterationses, and training is then iterated to training sample;
S5. each training sample is on word and its correspondence word coding, one by obtaining two parts of documents, portion after iteration Part is encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme;Two parts of documents are integrated, is trained The number of times that each word occurs under each theme in sample;For each theme, sorted according to the occurrence number of corresponding word, choosing Take LLDA word of the m word maximally related with the theme as training sample;
S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text Occurrence number, and using the number of times as the value of this feature, value of each sample on each LLDA word will be obtained, input is extremely In Xgboost sorting algorithms, then Xgboost sorting algorithms are trained;
S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., carry out step S3~S5's to forecast set Step, is then predicted classification using the model for training to each forecast sample in forecast set.
Preferably, in the step S6, if there is LLDA words occurrence number in the text is 0, not by the LLDA words In input to Xgboost sorting algorithms, i.e., common sorting algorithm input is matrix form, and the method input is dictionary format, Internal memory is saved, it is more convenient quickly reasonably to process missing values.
Preferably, the K is 1000.
Compared with prior art, the beneficial effects of the invention are as follows:
The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled-LDA, then with Xgboost points Class algorithm carries out text classification.It does feature space with common vector space model, and common sorting algorithm is carried out The internal memory that the method for text classification expends compared to needed for is reduced, and this is because the word included in Chinese text is on hundred Ten million, dimension is higher, if being characterized with word, it is huge to expend internal memory, in addition cannot unit treatment, and Chinese characters in common use are no more than 10,000 Individual, often occurring even only 1,000, dimension is substantially reduced, and Xgboost is supported with dictionary rather than rectangular Formula is used as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of language materials using LDA and can utilized The classification information that text is included.And pretreatment is simple, it is not necessary to extract feature meticulously, add powerful support distributed Ensemble Learning Algorithms Xgboost, improves the accuracy and performance of classification.
Brief description of the drawings
Fig. 1 is the flow chart of method.
Specific embodiment
Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent;
Below in conjunction with drawings and Examples, the present invention is further elaborated.
Embodiment 1
The implementation case includes 3 concrete cases, and 3 text corpus with different characteristics are classified respectively, English corpus WebKB, weeds out the sample without any content, and two Chinese corpuses, wherein one i.e. disclosed in one Individual is disclosed long text corpus:Fudan University's text classification corpus, another is that Chinese short text sample is very uneven The corpus of weighing apparatus:News analysis, is divided into normal and two kinds of classifications of advertisement, and positive negative ratio reaches 2742/42416=0.065 so Rank.
The text classification data set summary situation of table 1
As shown in figure 1, the text for making feature extraction of Labeled-LDA based on Xgboost sorting algorithms of the present invention The specific implementation step of sorting technique is as follows:
Step 1:Text Pretreatment
A collection of classified text set is prepared in advance, such as 3 cases, according to 8:2 ratio random division training set and pre- Collection (news analysis) is surveyed, directly (WebKB and Fudan University's text classification) is used if disclosed data set has had division, to institute There is text to carry out dry, Unified coding is UTF-8, for Chinese text, the convenient program below of each word is distinguished with space Treatment, because the present invention is based on word feature, conventional word puts in the stops and is not more than 10,000, and Labeled- LDA has powerful feature selecting ability, therefore Chinese Text Categorization conventional remove punctuation mark, numeral, stop words etc. are pre-processed Process can be dispensed, from experimental result it can also be seen that such pretreatment is not required, for English text, then Its pretreatment includes that all Caps switchs to small letter, punctuation mark and replaced with space, so can more distinguish the boundary of word and meet The writing style of English.
Step 2:Feature selecting is carried out to training set with Labeled-LDA
Present invention experiment uses Stamford natural language processing instrument TMT.Each seed ginseng of Labeled-LDA is first set Number, including text segmentation mode and text filtering mode, label number, iterations etc., text segmentation mode uses space region Point, label number sees specific corpus actual conditions, and iterations unification is 1000 times, after training terminates, obtains two texts Part, one is word to the file that encodes, one be the number of times that each word coding occurs under each theme file, merge this two Individual file obtains the file of the number of times that each word occurs under each theme, then under each theme, occurrence is gone out to each word Number sequence, takes out occurrence number highest top n, then these words for obtaining are exactly to be stayed by after Labeled-LDA feature selectings Under word, these words there may be repetition, these words are counted for LLDA words after excluding redundancy, with more rich semantic information, together When be again to train the result for obtaining under this strong supervision message in label, with very strong feature representation ability.
Step 3:Training set forecast set is processed into the pattern of the input of Xgboosts
The pattern of the input of Xgboost is very simple and convenient and reasonable, for each sample of training set and forecast set, because It is that word is regarded as feature, current word, if the word is LLDA words, the spy of the feature corresponding to corresponding LLDA words are arrived in treatment Value indicative adds 1, for each sample, the LLDA words not occurred, the then characteristic value of the feature corresponding to corresponding LLDA words It is 0, this feature is without the input as Xgboost;
Step 4:The parameter setting of Xgboost sorting algorithms, training and prediction
The parameter of Xgboost sorting algorithms, the optimal parameter of selection resultant effect, final the Xgboost for setting points are set Class algorithm parameter is:Iterations is 200, and class number (depending on specific), classifier type are gbtree, object function is multi:Softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training is obtained is to pre- Collection is surveyed to be classified;
Step 5:Assessment models performance
Because each data set differs greatly, therefore different judgment criteria is used for different data sets so that judge Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 values (micro-F1) and grand average F1 values (macro-F1) assess;
It is defined as follows:
PiIt is i-th accuracy rate of classification, RiIt is i-th recall rate of classification, n is summarized for classification.Fudan University's text classification is used Micro- Average Accuracy (micro-P) is assessed;News analysis is especially uneven due to classification, using the accuracy rate P of negative sample, calls together Rate R and F value is returned to assess.
The model of present invention definition is designated as LLDA-XG, and experimental result is as follows:
Classification performance experimental result in the WebKB corpus of table 2
RCC Centroid NB Winnow SVM LLDA-XG
micro_F1 89.64 79.95 86.45 81.40 89.26 91.40
macro_F1 88.02 78.47 86.01 77.85 87.66 90.48
Classification performance experimental result in the news analysis corpus of table 3
Bayes SGDC Bagging+KNN SVM RF GBDT Adaboost LLDA-XG
P 69.55 77.71 97.00 93.65 96.4 92.22 95.80 94.52
R 91.62 91.98 85.20 86.81 89.84 88.77 85.56 92.34
F values 79.08 84.24 90.70 90.10 92.99 90.46 90.40 93.42
Classification performance experimental result in Fudan University's text classification corpus of table 4
From table 2, table 3 and table 4 it can be seen that LLDA-XG model performances WebKB corpus, news analysis corpus and Fudan University classifies all obtain good effect in corpus herein, micro_F1 and macro_F1 values all surpass in WebKB corpus Cross 90;The accuracy rate of LLDA-XG models is not highest in news analysis corpus, but recall rate and F values are all most It is high, illustrate that LLDA-XG models can keep accuracy rate very high and can keep recall rate very high, more emphasis performance is equal Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neutral net and term vector With reference to doctor Lai Siwei thesis, LLDA-XG models show very outstanding, and classification accuracy is than cyclic convolution nerve Network is taller, and pretreatment is considerably less, and run time is also fast, and the common computer of unit performance can just be obtained for four or five minutes As a result, and neutral net then needs the substantial amounts of calculating time.
Classification time performance experimental results of the LLDA-XG of table 5 in WebKB corpus
Characteristic 215 411 925 7288
micro-F1 90.24 90.82 91.40 91.61
macro-F1 89.19 89.98 90.48 90.69
Time (second) 3.740 4.560 5.759 8.984
Classification time performance experimental results of the LLDA-XG of table 6 in news analysis corpus
Characteristic 154 289 674 4154
P 94.39 93.91 94.53 94.13
R 90.02 90.73 92.34 91.44
F 92.15 92.29 93.42 92.77
Time (second) 6.739 7.716 8.756 10.589
Classification time performance experimental results of the LLDA-XG of table 7 in Fudan University's text classification corpus
Characteristic 408 665 1329 6707
Accuracy rate 94.59 95.20 95.41 95.39
Time (second) 145.701 206.362 278.52351899874816 342.354
In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases 0.79; Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,;Characteristic increases from 925 During to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very not substantially, fortune The row time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 on the contrary, and recall rate R increases Grow 0.71, F values and increase 0.14;Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate is increased 1.61, F values increase 1.13, but characteristic from 674 increase to 4154 when, accuracy rate declines 0.4, and recall rate declines 0.9, F Value declines 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases 0.61, feature Number from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate have dropped 0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG models are in characteristic, characteristic is more, and elapsed time is more long, but Classification performance increasess slowly, or even declines to the more many classification performances on the contrary of late feature number, and this is probably to overtrain to cause It is fitted.This also illustrates the importance of feature selecting, the often a small number of feature maximum for classification performance influence, in feature There is substantial amounts of redundancy feature even feature of noise, feature is more, and consumption operation time is more, and performance boost is very limited Even produce over-fitting.Also explanation Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese or English Text, either long text or short text, select the feature of very high-quality from substantial amounts of feature so that under the integral operation time Drop, stable performance is efficient.And feature extraction of the invention is based on word, extract very convenient, it is not necessary to be manually also not required to Expertise is wanted to go consuming ample resources to go to extract feature, the pretreatment simple operation time is few, with reference to Xgboost sorting algorithms Rapidly and efficiently, the great ability of missing values, the superior stabilization of overall performance can be processed.In a word, by experimental verification, the present invention is carried The textual classification model LLDA-XG for going out, can be widely used in respective text categorization task, and its pretreatment is simple and quick, overall fortune Evaluation time and performance are all protruded very much, with robustness and practical value very high.
Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention Protection domain within.

Claims (3)

1. a kind of file classification method based on Xgboost sorting algorithms, it is characterised in that:Comprise the following steps:
S1. multiple samples are obtained, described each sample includes the label of content of text and text;
S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample This composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then trains this The label of sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA text This input;
S4. it is K to set Labeled-LDA iterationses, and training is then iterated to training sample;
S5. each training sample is on word and its correspondence word coding, Yi Fenshi by obtaining two parts of documents, portion after iteration Encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme;Two parts of documents are integrated, training sample is obtained In the number of times that occurs under each theme of each word;For each theme, sorted according to the occurrence number of corresponding word, choose with The maximally related m word of the theme as training sample LLDA words;
S6. for each training sample, its each appearance of LLDA words in its content of text obtained by step S5 is counted Number of times, and using the number of times as the value of this feature, value of each sample on each LLDA word, input to Xgboost will be obtained In sorting algorithm, then Xgboost sorting algorithms are trained;
S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., the step of step S3~S5 is carried out to forecast set Suddenly, classification then is predicted to each forecast sample in forecast set using the model for training.
2. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that:The step In rapid S6, if there is LLDA words occurrence number in the text is 0, the LLDA words are not input into Xgboost sorting algorithms.
3. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that:The K is 1000。
CN201710060026.9A 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm Active CN106815369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710060026.9A CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710060026.9A CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Publications (2)

Publication Number Publication Date
CN106815369A true CN106815369A (en) 2017-06-09
CN106815369B CN106815369B (en) 2019-09-20

Family

ID=59112460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710060026.9A Active CN106815369B (en) 2017-01-24 2017-01-24 A kind of file classification method based on Xgboost sorting algorithm

Country Status (1)

Country Link
CN (1) CN106815369B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463965A (en) * 2017-08-16 2017-12-12 湖州易有科技有限公司 Fabric attribute picture collection and recognition methods and identifying system based on deep learning
CN107798083A (en) * 2017-10-17 2018-03-13 广东广业开元科技有限公司 A kind of information based on big data recommends method, system and device
CN107992531A (en) * 2017-11-21 2018-05-04 吉浦斯信息咨询(深圳)有限公司 News personalization intelligent recommendation method and system based on deep learning
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN108257052A (en) * 2018-01-16 2018-07-06 中南大学 A kind of online student knowledge appraisal procedure and its system
CN108536650A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 Generate the method and apparatus that gradient promotes tree-model
CN108563638A (en) * 2018-04-13 2018-09-21 武汉大学 A kind of microblog emotional analysis method based on topic identification and integrated study
CN108627798A (en) * 2018-04-04 2018-10-09 北京工业大学 WLAN indoor positioning algorithms based on linear discriminant analysis and gradient boosted tree
CN108868805A (en) * 2018-06-08 2018-11-23 西安电子科技大学 Shield method for correcting error based on statistical analysis in conjunction with XGboost
CN108985335A (en) * 2018-06-19 2018-12-11 中国原子能科学研究院 The integrated study prediction technique of nuclear reactor cladding materials void swelling
CN109543187A (en) * 2018-11-23 2019-03-29 中山大学 Generation method, device and the storage medium of electronic health record feature
CN109754002A (en) * 2018-12-24 2019-05-14 上海大学 A kind of steganalysis hybrid integrated method based on deep learning
CN109933667A (en) * 2019-03-19 2019-06-25 中国联合网络通信集团有限公司 Textual classification model training method, file classification method and equipment
CN110318327A (en) * 2019-06-10 2019-10-11 长安大学 A kind of surface evenness prediction technique based on random forest
CN110334732A (en) * 2019-05-20 2019-10-15 北京思路创新科技有限公司 A kind of Urban Air Pollution Methods and device based on machine learning
CN111079031A (en) * 2019-12-27 2020-04-28 北京工业大学 Bowen disaster information importance weighting classification method based on deep learning and XGboost algorithm
WO2020082734A1 (en) * 2018-10-24 2020-04-30 平安科技(深圳)有限公司 Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
TWI705341B (en) * 2018-08-22 2020-09-21 香港商阿里巴巴集團服務有限公司 Feature relationship recommendation method and device, computing equipment and storage medium
CN111753058A (en) * 2020-06-30 2020-10-09 北京信息科技大学 Text viewpoint mining method and system
CN111831824A (en) * 2020-07-16 2020-10-27 民生科技有限责任公司 Public opinion positive and negative face classification method
CN111859074A (en) * 2020-07-29 2020-10-30 东北大学 Internet public opinion information source influence assessment method and system based on deep learning
CN112699944A (en) * 2020-12-31 2021-04-23 中国银联股份有限公司 Order-returning processing model training method, processing method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060142993A1 (en) * 2004-12-28 2006-06-29 Sony Corporation System and method for utilizing distance measures to perform text classification
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060142993A1 (en) * 2004-12-28 2006-06-29 Sony Corporation System and method for utilizing distance measures to perform text classification
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104331498A (en) * 2014-11-19 2015-02-04 亚信科技(南京)有限公司 Method for automatically classifying webpage content visited by Internet users

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李文波: "基于Labeled-LDA模型的文本分类新算法", 《计算机学报》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463965B (en) * 2017-08-16 2024-03-26 湖州易有科技有限公司 Deep learning-based fabric attribute picture acquisition and recognition method and recognition system
CN107463965A (en) * 2017-08-16 2017-12-12 湖州易有科技有限公司 Fabric attribute picture collection and recognition methods and identifying system based on deep learning
CN107798083A (en) * 2017-10-17 2018-03-13 广东广业开元科技有限公司 A kind of information based on big data recommends method, system and device
CN107992531A (en) * 2017-11-21 2018-05-04 吉浦斯信息咨询(深圳)有限公司 News personalization intelligent recommendation method and system based on deep learning
CN107992531B (en) * 2017-11-21 2020-11-27 吉浦斯信息咨询(深圳)有限公司 News personalized intelligent recommendation method and system based on deep learning
CN108090201A (en) * 2017-12-20 2018-05-29 珠海市君天电子科技有限公司 A kind of method, apparatus and electronic equipment of article content classification
CN108257052B (en) * 2018-01-16 2022-04-22 中南大学 Online student knowledge assessment method and system
CN108257052A (en) * 2018-01-16 2018-07-06 中南大学 A kind of online student knowledge appraisal procedure and its system
CN108536650B (en) * 2018-04-03 2022-04-26 北京京东尚科信息技术有限公司 Method and device for generating gradient lifting tree model
CN108536650A (en) * 2018-04-03 2018-09-14 北京京东尚科信息技术有限公司 Generate the method and apparatus that gradient promotes tree-model
CN108627798A (en) * 2018-04-04 2018-10-09 北京工业大学 WLAN indoor positioning algorithms based on linear discriminant analysis and gradient boosted tree
CN108627798B (en) * 2018-04-04 2022-03-11 北京工业大学 WLAN indoor positioning algorithm based on linear discriminant analysis and gradient lifting tree
CN108563638A (en) * 2018-04-13 2018-09-21 武汉大学 A kind of microblog emotional analysis method based on topic identification and integrated study
CN108563638B (en) * 2018-04-13 2021-08-10 武汉大学 Microblog emotion analysis method based on topic identification and integrated learning
CN108868805A (en) * 2018-06-08 2018-11-23 西安电子科技大学 Shield method for correcting error based on statistical analysis in conjunction with XGboost
CN108868805B (en) * 2018-06-08 2019-11-15 西安电子科技大学 Shield method for correcting error based on statistical analysis in conjunction with XGboost
CN108985335A (en) * 2018-06-19 2018-12-11 中国原子能科学研究院 The integrated study prediction technique of nuclear reactor cladding materials void swelling
CN108985335B (en) * 2018-06-19 2021-04-27 中国原子能科学研究院 Integrated learning prediction method for irradiation swelling of nuclear reactor cladding material
TWI705341B (en) * 2018-08-22 2020-09-21 香港商阿里巴巴集團服務有限公司 Feature relationship recommendation method and device, computing equipment and storage medium
US11244232B2 (en) 2018-08-22 2022-02-08 Advanced New Technologies Co., Ltd. Feature relationship recommendation method, apparatus, computing device, and storage medium
WO2020082734A1 (en) * 2018-10-24 2020-04-30 平安科技(深圳)有限公司 Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
CN109543187A (en) * 2018-11-23 2019-03-29 中山大学 Generation method, device and the storage medium of electronic health record feature
CN109543187B (en) * 2018-11-23 2021-09-17 中山大学 Method and device for generating electronic medical record characteristics and storage medium
CN109754002A (en) * 2018-12-24 2019-05-14 上海大学 A kind of steganalysis hybrid integrated method based on deep learning
CN109933667A (en) * 2019-03-19 2019-06-25 中国联合网络通信集团有限公司 Textual classification model training method, file classification method and equipment
CN110334732A (en) * 2019-05-20 2019-10-15 北京思路创新科技有限公司 A kind of Urban Air Pollution Methods and device based on machine learning
CN110318327A (en) * 2019-06-10 2019-10-11 长安大学 A kind of surface evenness prediction technique based on random forest
CN111079031B (en) * 2019-12-27 2023-09-12 北京工业大学 Weight classification method for importance of blog with respect to disaster information based on deep learning and XGBoost algorithm
CN111079031A (en) * 2019-12-27 2020-04-28 北京工业大学 Bowen disaster information importance weighting classification method based on deep learning and XGboost algorithm
CN111753058B (en) * 2020-06-30 2023-06-02 北京信息科技大学 Text viewpoint mining method and system
CN111753058A (en) * 2020-06-30 2020-10-09 北京信息科技大学 Text viewpoint mining method and system
CN111831824B (en) * 2020-07-16 2024-02-09 民生科技有限责任公司 Public opinion positive and negative surface classification method
CN111831824A (en) * 2020-07-16 2020-10-27 民生科技有限责任公司 Public opinion positive and negative face classification method
CN111859074A (en) * 2020-07-29 2020-10-30 东北大学 Internet public opinion information source influence assessment method and system based on deep learning
CN111859074B (en) * 2020-07-29 2023-12-29 东北大学 Network public opinion information source influence evaluation method and system based on deep learning
CN112699944A (en) * 2020-12-31 2021-04-23 中国银联股份有限公司 Order-returning processing model training method, processing method, device, equipment and medium
CN112699944B (en) * 2020-12-31 2024-04-23 中国银联股份有限公司 Training method, processing method, device, equipment and medium for returning list processing model

Also Published As

Publication number Publication date
CN106815369B (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
Hidayat et al. Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier
CN109740154A (en) A kind of online comment fine granularity sentiment analysis method based on multi-task learning
CN111125358B (en) Text classification method based on hypergraph
Li et al. Nonparametric bayes pachinko allocation
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN109376242A (en) Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN107944480A (en) A kind of enterprises ' industry sorting technique
CN108388651A (en) A kind of file classification method based on the kernel of graph and convolutional neural networks
CN108108355A (en) Text emotion analysis method and system based on deep learning
CN111339754B (en) Case public opinion abstract generation method based on case element sentence association graph convolution
CN109508453A (en) Across media information target component correlation analysis systems and its association analysis method
CN105868184A (en) Chinese name recognition method based on recurrent neural network
CN109189926A (en) A kind of construction method of technical paper corpus
CN101587493A (en) Text classification method
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN106815310A (en) A kind of hierarchy clustering method and system to magnanimity document sets
CN107145516A (en) A kind of Text Clustering Method and system
CN107679135A (en) The topic detection of network-oriented text big data and tracking, device
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN111813939A (en) Text classification method based on representation enhancement and fusion
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant