CN106815369A - A kind of file classification method based on Xgboost sorting algorithms - Google Patents
A kind of file classification method based on Xgboost sorting algorithms Download PDFInfo
- Publication number
- CN106815369A CN106815369A CN201710060026.9A CN201710060026A CN106815369A CN 106815369 A CN106815369 A CN 106815369A CN 201710060026 A CN201710060026 A CN 201710060026A CN 106815369 A CN106815369 A CN 106815369A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- xgboost
- sample
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled LDA, then carries out text classification with Xgboost sorting algorithms.It does feature space with common vector space model, the internal memory that common sorting algorithm compares required consuming to carry out the method for text classification is reduced, this is that dimension is higher, if being characterized with word due to the word hundreds and thousands of ten thousand included in Chinese text, expend internal memory huge, even cannot unit treatment, and Chinese characters in common use be no more than 10,000, often occur even only 1,000, dimension is substantially reduced, and Xgboost is supported using dictionary rather than matrix form as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled LDA algorithms with potential applications simultaneously, and feature selecting is made of Labeled LDA can excavate the classification information that the semantic information of a large amount of language materials can utilize text to be included again using LDA.And pretreatment is simple, it is not necessary to extract feature meticulously, add the distributed Ensemble Learning Algorithms Xgboost of powerful support, improve the accuracy and performance of classification.
Description
Technical field
The present invention relates to text classification field, more particularly, to a kind of text classification based on Xgboost sorting algorithms
Method.
Background technology
File classification method has been obtained widely in fields such as search engine, personalized recommendation system, public sentiment monitoring
Using being to realize efficiently managing and being accurately positioned the important ring of magnanimity information.
The conventional framework of file classification method is based on machine learning classification algorithm, i.e., comprising data prediction, then special
Levy the steps such as extraction, feature selecting, tagsort.
Feature extraction is that text is identified using unified method and model, and the method or model can represent text
This feature and mathematical linguistics can be easily changed into, and then change into the Mathematical Modeling that computer can be processed.It is existing
Popular document representation method directed quantity spatial model and potential language analysis model.Vector space model has letter
The advantages of list, convenience of calculation, few pretreatment, also there is the semantic relation between override feature and feature certainly, text is represented
The shortcomings of depth is inadequate.
Feature selecting is one of key issue of machine learning, and the quality of feature selecting result directly affects grader
Nicety of grading and Generalization Capability.Feature selecting is that some maximally effective features are picked out from one group of feature so as to reduce feature sky
Between dimension, and reach and reject uncorrelated or redundancy feature, reduce run time and improve the intelligibility of analysis result, find
Structure hidden in high dimensional data and other effects.Whether there is classification information according to data, feature selecting can be divided into supervision and nothing
Supervise two classes.The feature selection approach commonly used in text classification has:Document frequencies, mutual information, information gain and chi
The methods such as amount (CHI).
It in text classification is also finally a most important ring that tagsort is.Naive Bayes Classification Algorithm is a kind of typical case
Tagsort algorithm, according to Bayesian formula, calculate the probability that text belongs to certain particular category, the parameter estimated needed for it is very
Few, less sensitive to missing data, algorithm is also fairly simple, and calculating speed is fast, has compared with other sorting algorithms in theory
Minimum error rate, but actually not such was the case with, and because separate between the method hypothesis attribute, this is assumed in reality
It is often invalid in the application of border.Decision Tree algorithms should be readily appreciated that and explain, processing data type and conventional type can belong to simultaneously
Property, result that is feasible and working well, but decision tree treatment missing can be made to large data source within the relatively short time
It is relatively difficult during data, easily ignore the correlation between attribute in data set.Remaining tagsort method also include LR,
KNN and SVM etc..These are all base learners, and integrated study is combined by by multiple learners, can often obtain than single
Practise the significantly superior Generalization Capability of device.According to the generating mode of individual learner, current integrated learning approach is broadly divided into two
Exist between class, i.e., individual learner and do not deposited between strong dependence, the sequencing method that must serially generate, and individual learner
At strong dependence, the parallel method that can be generated simultaneously, it is Boosting that the former represents, the latter represent be Bagging and " with
Machine forest ".From from the point of view of deviation-variation decomposition, Boosting be primarily upon reduce deviation, therefore Boosting can be based on it is general
Change the quite weak learner of performance construct it is very strong integrated.Bagging is primarily upon reducing variance, therefore it determines in not beta pruning
Effectiveness becomes apparent on the learner that plan tree, neutral net etc. are easily disturbed by sample.Xgboost sorting algorithms are to be based on
A kind of integrated learning approach of Boosting, compared to traditional integrated study GBDT algorithms, Xgboost sorting algorithms mainly have with
Lower eight big advantages:
First, using CART as base grader, Xgboost sorting algorithms also support linear classifier to tradition GBDT.
2nd, tradition GBDT only uses first derivative information in optimization, and Xgboost sorting algorithms are then carried out to cost function
The second Taylor series, while used single order and second dervative.
3rd, Xgboost sorting algorithms add regular terms in cost function, for the complexity of Controlling model.Canonical
The quadratic sum of the L2 moulds of the fraction exported on the leaf node number of tree, each leaf node is contained in.From deviation-variance
For balance angle, regular terms reduces the variance of model, makes study model out simpler, prevents over-fitting.
4th, reduce:Equivalent to learning rate.Xgboost sorting algorithms, can be by leaf node after an iteration has been carried out
Weight be multiplied by the coefficient, primarily to weaken each tree influence, allow below have bigger studying space.
5th, row sampling:Xgboost sorting algorithms have used for reference the way of random forest, support row sampling, can not only reduce
Fitting, moreover it is possible to reduce and calculate.
6th, to the treatment of missing values:For the sample that the value of feature has missing, Xgboost sorting algorithms can be learned automatically
Practise out its cleavage direction.
7th, support parallel.Boosting is a kind of serial structure, and the parallel of Xgboost sorting algorithms is not tree
That spends is parallel, and Xgboost sorting algorithms are also complete (the t times cost letter of iteration that can just carry out next iteration of an iteration
The above t-1 predicted value of iteration is contained in several).Xgboost sorting algorithms are parallel on characteristic particle size.Decision tree
The most time-consuming step of study be exactly the value of feature to be ranked up (because to determine optimal partition point), Xgboost points
Class algorithm before training, is sorted to data in advance, then saves as block structure, is repeatedly used in iteration below
This structure, greatly reduces amount of calculation.This block structure also causes to become possibility parallel, when the division of node is carried out,
The gain of each feature of calculating is needed, that feature for finally selecting gain maximum does division, then the gain meter of each feature
Calculation can just open multithreading to be carried out.
8th, approximate histogramming algorithm that can be parallel.Tree node is when line splitting is entered, it is necessary to calculate each point of each feature
The corresponding gain of cutpoint, i.e., enumerate all possible cut-point with greedy method.When data cannot once be loaded into internal memory or divide
In the case of cloth, greedy algorithm efficiency will become very low, thus Xgboost sorting algorithms also proposed it is a kind of can be parallel it is near
Like histogramming algorithm, the cut-point for efficiently generating candidate.
The content of the invention
The classification performance that above prior art is provided is low, big, classification accuracy is low to expend internal memory to solve for the present invention
Defect, there is provided a kind of file classification method based on Xgboost sorting algorithms.
To realize above goal of the invention, the technical scheme of use is:
A kind of file classification method based on Xgboost sorting algorithms, comprises the following steps:
S1. multiple samples are obtained, described each sample includes the label of content of text and text;
S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein instructing
Practice sample composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then should
The label of training sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA
Text input;
S4. it is K to set Labeled-LDA iterationses, and training is then iterated to training sample;
S5. each training sample is on word and its correspondence word coding, one by obtaining two parts of documents, portion after iteration
Part is encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme;Two parts of documents are integrated, is trained
The number of times that each word occurs under each theme in sample;For each theme, sorted according to the occurrence number of corresponding word, choosing
Take LLDA word of the m word maximally related with the theme as training sample;
S6. for each training sample, each LLDA word that it is obtained by step S5 is counted in its content of text
Occurrence number, and using the number of times as the value of this feature, value of each sample on each LLDA word will be obtained, input is extremely
In Xgboost sorting algorithms, then Xgboost sorting algorithms are trained;
S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., carry out step S3~S5's to forecast set
Step, is then predicted classification using the model for training to each forecast sample in forecast set.
Preferably, in the step S6, if there is LLDA words occurrence number in the text is 0, not by the LLDA words
In input to Xgboost sorting algorithms, i.e., common sorting algorithm input is matrix form, and the method input is dictionary format,
Internal memory is saved, it is more convenient quickly reasonably to process missing values.
Preferably, the K is 1000.
Compared with prior art, the beneficial effects of the invention are as follows:
The method that the present invention is provided extracts tagged word to calculate characteristic value by Labeled-LDA, then with Xgboost points
Class algorithm carries out text classification.It does feature space with common vector space model, and common sorting algorithm is carried out
The internal memory that the method for text classification expends compared to needed for is reduced, and this is because the word included in Chinese text is on hundred
Ten million, dimension is higher, if being characterized with word, it is huge to expend internal memory, in addition cannot unit treatment, and Chinese characters in common use are no more than 10,000
Individual, often occurring even only 1,000, dimension is substantially reduced, and Xgboost is supported with dictionary rather than rectangular
Formula is used as input.The present invention proposes a kind of novel Supervised feature selection algorithm Labeled- with potential applications simultaneously
LDA algorithm, feature selecting is made of Labeled-LDA can be excavated the semantic information of a large amount of language materials using LDA and can utilized
The classification information that text is included.And pretreatment is simple, it is not necessary to extract feature meticulously, add powerful support distributed
Ensemble Learning Algorithms Xgboost, improves the accuracy and performance of classification.
Brief description of the drawings
Fig. 1 is the flow chart of method.
Specific embodiment
Accompanying drawing being for illustration only property explanation, it is impossible to be interpreted as the limitation to this patent;
Below in conjunction with drawings and Examples, the present invention is further elaborated.
Embodiment 1
The implementation case includes 3 concrete cases, and 3 text corpus with different characteristics are classified respectively,
English corpus WebKB, weeds out the sample without any content, and two Chinese corpuses, wherein one i.e. disclosed in one
Individual is disclosed long text corpus:Fudan University's text classification corpus, another is that Chinese short text sample is very uneven
The corpus of weighing apparatus:News analysis, is divided into normal and two kinds of classifications of advertisement, and positive negative ratio reaches 2742/42416=0.065 so
Rank.
The text classification data set summary situation of table 1
As shown in figure 1, the text for making feature extraction of Labeled-LDA based on Xgboost sorting algorithms of the present invention
The specific implementation step of sorting technique is as follows:
Step 1:Text Pretreatment
A collection of classified text set is prepared in advance, such as 3 cases, according to 8:2 ratio random division training set and pre-
Collection (news analysis) is surveyed, directly (WebKB and Fudan University's text classification) is used if disclosed data set has had division, to institute
There is text to carry out dry, Unified coding is UTF-8, for Chinese text, the convenient program below of each word is distinguished with space
Treatment, because the present invention is based on word feature, conventional word puts in the stops and is not more than 10,000, and Labeled-
LDA has powerful feature selecting ability, therefore Chinese Text Categorization conventional remove punctuation mark, numeral, stop words etc. are pre-processed
Process can be dispensed, from experimental result it can also be seen that such pretreatment is not required, for English text, then
Its pretreatment includes that all Caps switchs to small letter, punctuation mark and replaced with space, so can more distinguish the boundary of word and meet
The writing style of English.
Step 2:Feature selecting is carried out to training set with Labeled-LDA
Present invention experiment uses Stamford natural language processing instrument TMT.Each seed ginseng of Labeled-LDA is first set
Number, including text segmentation mode and text filtering mode, label number, iterations etc., text segmentation mode uses space region
Point, label number sees specific corpus actual conditions, and iterations unification is 1000 times, after training terminates, obtains two texts
Part, one is word to the file that encodes, one be the number of times that each word coding occurs under each theme file, merge this two
Individual file obtains the file of the number of times that each word occurs under each theme, then under each theme, occurrence is gone out to each word
Number sequence, takes out occurrence number highest top n, then these words for obtaining are exactly to be stayed by after Labeled-LDA feature selectings
Under word, these words there may be repetition, these words are counted for LLDA words after excluding redundancy, with more rich semantic information, together
When be again to train the result for obtaining under this strong supervision message in label, with very strong feature representation ability.
Step 3:Training set forecast set is processed into the pattern of the input of Xgboosts
The pattern of the input of Xgboost is very simple and convenient and reasonable, for each sample of training set and forecast set, because
It is that word is regarded as feature, current word, if the word is LLDA words, the spy of the feature corresponding to corresponding LLDA words are arrived in treatment
Value indicative adds 1, for each sample, the LLDA words not occurred, the then characteristic value of the feature corresponding to corresponding LLDA words
It is 0, this feature is without the input as Xgboost;
Step 4:The parameter setting of Xgboost sorting algorithms, training and prediction
The parameter of Xgboost sorting algorithms, the optimal parameter of selection resultant effect, final the Xgboost for setting points are set
Class algorithm parameter is:Iterations is 200, and class number (depending on specific), classifier type are gbtree, object function is
multi:Softmax, learning rate eta=0.3, depth capacity max_depth=6 of tree etc., the model that training is obtained is to pre-
Collection is surveyed to be classified;
Step 5:Assessment models performance
Because each data set differs greatly, therefore different judgment criteria is used for different data sets so that judge
Standard has more reasonability.For WebKB, there are 4 classifications, using micro- average F1 values (micro-F1) and grand average F1 values
(macro-F1) assess;
It is defined as follows:
PiIt is i-th accuracy rate of classification, RiIt is i-th recall rate of classification, n is summarized for classification.Fudan University's text classification is used
Micro- Average Accuracy (micro-P) is assessed;News analysis is especially uneven due to classification, using the accuracy rate P of negative sample, calls together
Rate R and F value is returned to assess.
The model of present invention definition is designated as LLDA-XG, and experimental result is as follows:
Classification performance experimental result in the WebKB corpus of table 2
RCC | Centroid | NB | Winnow | SVM | LLDA-XG | |
micro_F1 | 89.64 | 79.95 | 86.45 | 81.40 | 89.26 | 91.40 |
macro_F1 | 88.02 | 78.47 | 86.01 | 77.85 | 87.66 | 90.48 |
Classification performance experimental result in the news analysis corpus of table 3
Bayes | SGDC | Bagging+KNN | SVM | RF | GBDT | Adaboost | LLDA-XG | |
P | 69.55 | 77.71 | 97.00 | 93.65 | 96.4 | 92.22 | 95.80 | 94.52 |
R | 91.62 | 91.98 | 85.20 | 86.81 | 89.84 | 88.77 | 85.56 | 92.34 |
F values | 79.08 | 84.24 | 90.70 | 90.10 | 92.99 | 90.46 | 90.40 | 93.42 |
Classification performance experimental result in Fudan University's text classification corpus of table 4
From table 2, table 3 and table 4 it can be seen that LLDA-XG model performances WebKB corpus, news analysis corpus and
Fudan University classifies all obtain good effect in corpus herein, micro_F1 and macro_F1 values all surpass in WebKB corpus
Cross 90;The accuracy rate of LLDA-XG models is not highest in news analysis corpus, but recall rate and F values are all most
It is high, illustrate that LLDA-XG models can keep accuracy rate very high and can keep recall rate very high, more emphasis performance is equal
Weighing apparatus.In Fudan University's text classification corpus, experimental result reference is Chinese Academy of Sciences's university research neutral net and term vector
With reference to doctor Lai Siwei thesis, LLDA-XG models show very outstanding, and classification accuracy is than cyclic convolution nerve
Network is taller, and pretreatment is considerably less, and run time is also fast, and the common computer of unit performance can just be obtained for four or five minutes
As a result, and neutral net then needs the substantial amounts of calculating time.
Classification time performance experimental results of the LLDA-XG of table 5 in WebKB corpus
Characteristic | 215 | 411 | 925 | 7288 |
micro-F1 | 90.24 | 90.82 | 91.40 | 91.61 |
macro-F1 | 89.19 | 89.98 | 90.48 | 90.69 |
Time (second) | 3.740 | 4.560 | 5.759 | 8.984 |
Classification time performance experimental results of the LLDA-XG of table 6 in news analysis corpus
Characteristic | 154 | 289 | 674 | 4154 |
P | 94.39 | 93.91 | 94.53 | 94.13 |
R | 90.02 | 90.73 | 92.34 | 91.44 |
F | 92.15 | 92.29 | 93.42 | 92.77 |
Time (second) | 6.739 | 7.716 | 8.756 | 10.589 |
Classification time performance experimental results of the LLDA-XG of table 7 in Fudan University's text classification corpus
Characteristic | 408 | 665 | 1329 | 6707 |
Accuracy rate | 94.59 | 95.20 | 95.41 | 95.39 |
Time (second) | 145.701 | 206.362 | 278.52351899874816 | 342.354 |
In table 5, characteristic from 215 increase to 411 when, micro-F1 increases 0.58, macro-F1 and increases 0.79;
Characteristic from 411 increase to 925 when, micro-F1 increases 0.58, macro-F1 and increases 0.5,;Characteristic increases from 925
During to 7288, characteristic increases 7 times or so, and micro-F1 and macro-F1 only increase 0.21, very not substantially, fortune
The row time nearly doubles.In table 6, characteristic from 154 increase to 289 when, accuracy rate P declines 0.48 on the contrary, and recall rate R increases
Grow 0.71, F values and increase 0.14;Characteristic from 289 increase to 674 when, accuracy rate increases 0.62, and recall rate is increased
1.61, F values increase 1.13, but characteristic from 674 increase to 4154 when, accuracy rate declines 0.4, and recall rate declines 0.9, F
Value declines 0.65.On Fudan University's text categorization task, characteristic from 408 increase to 665 when, accuracy rate increases 0.61, feature
Number from 665 increase to 1329 when, accuracy rate increases 0.21, characteristic from 1329 increase to 6707 when, accuracy rate have dropped
0.02.From table 5, table 6 and table 7 as can be seen that LLDA-XG models are in characteristic, characteristic is more, and elapsed time is more long, but
Classification performance increasess slowly, or even declines to the more many classification performances on the contrary of late feature number, and this is probably to overtrain to cause
It is fitted.This also illustrates the importance of feature selecting, the often a small number of feature maximum for classification performance influence, in feature
There is substantial amounts of redundancy feature even feature of noise, feature is more, and consumption operation time is more, and performance boost is very limited
Even produce over-fitting.Also explanation Labeled-LDA has very strong feature selecting ability simultaneously, either to Chinese or English
Text, either long text or short text, select the feature of very high-quality from substantial amounts of feature so that under the integral operation time
Drop, stable performance is efficient.And feature extraction of the invention is based on word, extract very convenient, it is not necessary to be manually also not required to
Expertise is wanted to go consuming ample resources to go to extract feature, the pretreatment simple operation time is few, with reference to Xgboost sorting algorithms
Rapidly and efficiently, the great ability of missing values, the superior stabilization of overall performance can be processed.In a word, by experimental verification, the present invention is carried
The textual classification model LLDA-XG for going out, can be widely used in respective text categorization task, and its pretreatment is simple and quick, overall fortune
Evaluation time and performance are all protruded very much, with robustness and practical value very high.
Obviously, the above embodiment of the present invention is only intended to clearly illustrate example of the present invention, and is not right
The restriction of embodiments of the present invention.For those of ordinary skill in the field, may be used also on the basis of the above description
To make other changes in different forms.There is no need and unable to be exhaustive to all of implementation method.It is all this
Any modification, equivalent and improvement made within the spirit and principle of invention etc., should be included in the claims in the present invention
Protection domain within.
Claims (3)
1. a kind of file classification method based on Xgboost sorting algorithms, it is characterised in that:Comprise the following steps:
S1. multiple samples are obtained, described each sample includes the label of content of text and text;
S2. all samples that step S1 is obtained are divided into training sample and forecast sample according to a certain percentage, wherein training sample
This composition training set, forecast sample predicted composition collection;
S3. for each training sample, two words of arbitrary neighborhood in its content of text are separated with space, then trains this
The label of sample as Labeled-LDA label be input into, and the training sample content of text as Labeled-LDA text
This input;
S4. it is K to set Labeled-LDA iterationses, and training is then iterated to training sample;
S5. each training sample is on word and its correspondence word coding, Yi Fenshi by obtaining two parts of documents, portion after iteration
Encoded on theme and word, i.e., the number of times that corresponding word coding occurs under each theme;Two parts of documents are integrated, training sample is obtained
In the number of times that occurs under each theme of each word;For each theme, sorted according to the occurrence number of corresponding word, choose with
The maximally related m word of the theme as training sample LLDA words;
S6. for each training sample, its each appearance of LLDA words in its content of text obtained by step S5 is counted
Number of times, and using the number of times as the value of this feature, value of each sample on each LLDA word, input to Xgboost will be obtained
In sorting algorithm, then Xgboost sorting algorithms are trained;
S7. so far model has been trained, it is necessary to be predicted to forecast set, i.e., the step of step S3~S5 is carried out to forecast set
Suddenly, classification then is predicted to each forecast sample in forecast set using the model for training.
2. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that:The step
In rapid S6, if there is LLDA words occurrence number in the text is 0, the LLDA words are not input into Xgboost sorting algorithms.
3. the file classification method based on Xgboost sorting algorithms according to claim 1, it is characterised in that:The K is
1000。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710060026.9A CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710060026.9A CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106815369A true CN106815369A (en) | 2017-06-09 |
CN106815369B CN106815369B (en) | 2019-09-20 |
Family
ID=59112460
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710060026.9A Active CN106815369B (en) | 2017-01-24 | 2017-01-24 | A kind of file classification method based on Xgboost sorting algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106815369B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463965A (en) * | 2017-08-16 | 2017-12-12 | 湖州易有科技有限公司 | Fabric attribute picture collection and recognition methods and identifying system based on deep learning |
CN107798083A (en) * | 2017-10-17 | 2018-03-13 | 广东广业开元科技有限公司 | A kind of information based on big data recommends method, system and device |
CN107992531A (en) * | 2017-11-21 | 2018-05-04 | 吉浦斯信息咨询(深圳)有限公司 | News personalization intelligent recommendation method and system based on deep learning |
CN108090201A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method, apparatus and electronic equipment of article content classification |
CN108257052A (en) * | 2018-01-16 | 2018-07-06 | 中南大学 | A kind of online student knowledge appraisal procedure and its system |
CN108536650A (en) * | 2018-04-03 | 2018-09-14 | 北京京东尚科信息技术有限公司 | Generate the method and apparatus that gradient promotes tree-model |
CN108563638A (en) * | 2018-04-13 | 2018-09-21 | 武汉大学 | A kind of microblog emotional analysis method based on topic identification and integrated study |
CN108627798A (en) * | 2018-04-04 | 2018-10-09 | 北京工业大学 | WLAN indoor positioning algorithms based on linear discriminant analysis and gradient boosted tree |
CN108868805A (en) * | 2018-06-08 | 2018-11-23 | 西安电子科技大学 | Shield method for correcting error based on statistical analysis in conjunction with XGboost |
CN108985335A (en) * | 2018-06-19 | 2018-12-11 | 中国原子能科学研究院 | The integrated study prediction technique of nuclear reactor cladding materials void swelling |
CN109543187A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | Generation method, device and the storage medium of electronic health record feature |
CN109754002A (en) * | 2018-12-24 | 2019-05-14 | 上海大学 | A kind of steganalysis hybrid integrated method based on deep learning |
CN109933667A (en) * | 2019-03-19 | 2019-06-25 | 中国联合网络通信集团有限公司 | Textual classification model training method, file classification method and equipment |
CN110318327A (en) * | 2019-06-10 | 2019-10-11 | 长安大学 | A kind of surface evenness prediction technique based on random forest |
CN110334732A (en) * | 2019-05-20 | 2019-10-15 | 北京思路创新科技有限公司 | A kind of Urban Air Pollution Methods and device based on machine learning |
CN111079031A (en) * | 2019-12-27 | 2020-04-28 | 北京工业大学 | Bowen disaster information importance weighting classification method based on deep learning and XGboost algorithm |
WO2020082734A1 (en) * | 2018-10-24 | 2020-04-30 | 平安科技(深圳)有限公司 | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium |
TWI705341B (en) * | 2018-08-22 | 2020-09-21 | 香港商阿里巴巴集團服務有限公司 | Feature relationship recommendation method and device, computing equipment and storage medium |
CN111753058A (en) * | 2020-06-30 | 2020-10-09 | 北京信息科技大学 | Text viewpoint mining method and system |
CN111831824A (en) * | 2020-07-16 | 2020-10-27 | 民生科技有限责任公司 | Public opinion positive and negative face classification method |
CN111859074A (en) * | 2020-07-29 | 2020-10-30 | 东北大学 | Internet public opinion information source influence assessment method and system based on deep learning |
CN112699944A (en) * | 2020-12-31 | 2021-04-23 | 中国银联股份有限公司 | Order-returning processing model training method, processing method, device, equipment and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060142993A1 (en) * | 2004-12-28 | 2006-06-29 | Sony Corporation | System and method for utilizing distance measures to perform text classification |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
-
2017
- 2017-01-24 CN CN201710060026.9A patent/CN106815369B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060142993A1 (en) * | 2004-12-28 | 2006-06-29 | Sony Corporation | System and method for utilizing distance measures to perform text classification |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
Non-Patent Citations (1)
Title |
---|
李文波: "基于Labeled-LDA模型的文本分类新算法", 《计算机学报》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463965B (en) * | 2017-08-16 | 2024-03-26 | 湖州易有科技有限公司 | Deep learning-based fabric attribute picture acquisition and recognition method and recognition system |
CN107463965A (en) * | 2017-08-16 | 2017-12-12 | 湖州易有科技有限公司 | Fabric attribute picture collection and recognition methods and identifying system based on deep learning |
CN107798083A (en) * | 2017-10-17 | 2018-03-13 | 广东广业开元科技有限公司 | A kind of information based on big data recommends method, system and device |
CN107992531A (en) * | 2017-11-21 | 2018-05-04 | 吉浦斯信息咨询(深圳)有限公司 | News personalization intelligent recommendation method and system based on deep learning |
CN107992531B (en) * | 2017-11-21 | 2020-11-27 | 吉浦斯信息咨询(深圳)有限公司 | News personalized intelligent recommendation method and system based on deep learning |
CN108090201A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method, apparatus and electronic equipment of article content classification |
CN108257052B (en) * | 2018-01-16 | 2022-04-22 | 中南大学 | Online student knowledge assessment method and system |
CN108257052A (en) * | 2018-01-16 | 2018-07-06 | 中南大学 | A kind of online student knowledge appraisal procedure and its system |
CN108536650B (en) * | 2018-04-03 | 2022-04-26 | 北京京东尚科信息技术有限公司 | Method and device for generating gradient lifting tree model |
CN108536650A (en) * | 2018-04-03 | 2018-09-14 | 北京京东尚科信息技术有限公司 | Generate the method and apparatus that gradient promotes tree-model |
CN108627798A (en) * | 2018-04-04 | 2018-10-09 | 北京工业大学 | WLAN indoor positioning algorithms based on linear discriminant analysis and gradient boosted tree |
CN108627798B (en) * | 2018-04-04 | 2022-03-11 | 北京工业大学 | WLAN indoor positioning algorithm based on linear discriminant analysis and gradient lifting tree |
CN108563638A (en) * | 2018-04-13 | 2018-09-21 | 武汉大学 | A kind of microblog emotional analysis method based on topic identification and integrated study |
CN108563638B (en) * | 2018-04-13 | 2021-08-10 | 武汉大学 | Microblog emotion analysis method based on topic identification and integrated learning |
CN108868805A (en) * | 2018-06-08 | 2018-11-23 | 西安电子科技大学 | Shield method for correcting error based on statistical analysis in conjunction with XGboost |
CN108868805B (en) * | 2018-06-08 | 2019-11-15 | 西安电子科技大学 | Shield method for correcting error based on statistical analysis in conjunction with XGboost |
CN108985335A (en) * | 2018-06-19 | 2018-12-11 | 中国原子能科学研究院 | The integrated study prediction technique of nuclear reactor cladding materials void swelling |
CN108985335B (en) * | 2018-06-19 | 2021-04-27 | 中国原子能科学研究院 | Integrated learning prediction method for irradiation swelling of nuclear reactor cladding material |
TWI705341B (en) * | 2018-08-22 | 2020-09-21 | 香港商阿里巴巴集團服務有限公司 | Feature relationship recommendation method and device, computing equipment and storage medium |
US11244232B2 (en) | 2018-08-22 | 2022-02-08 | Advanced New Technologies Co., Ltd. | Feature relationship recommendation method, apparatus, computing device, and storage medium |
WO2020082734A1 (en) * | 2018-10-24 | 2020-04-30 | 平安科技(深圳)有限公司 | Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium |
CN109543187A (en) * | 2018-11-23 | 2019-03-29 | 中山大学 | Generation method, device and the storage medium of electronic health record feature |
CN109543187B (en) * | 2018-11-23 | 2021-09-17 | 中山大学 | Method and device for generating electronic medical record characteristics and storage medium |
CN109754002A (en) * | 2018-12-24 | 2019-05-14 | 上海大学 | A kind of steganalysis hybrid integrated method based on deep learning |
CN109933667A (en) * | 2019-03-19 | 2019-06-25 | 中国联合网络通信集团有限公司 | Textual classification model training method, file classification method and equipment |
CN110334732A (en) * | 2019-05-20 | 2019-10-15 | 北京思路创新科技有限公司 | A kind of Urban Air Pollution Methods and device based on machine learning |
CN110318327A (en) * | 2019-06-10 | 2019-10-11 | 长安大学 | A kind of surface evenness prediction technique based on random forest |
CN111079031B (en) * | 2019-12-27 | 2023-09-12 | 北京工业大学 | Weight classification method for importance of blog with respect to disaster information based on deep learning and XGBoost algorithm |
CN111079031A (en) * | 2019-12-27 | 2020-04-28 | 北京工业大学 | Bowen disaster information importance weighting classification method based on deep learning and XGboost algorithm |
CN111753058B (en) * | 2020-06-30 | 2023-06-02 | 北京信息科技大学 | Text viewpoint mining method and system |
CN111753058A (en) * | 2020-06-30 | 2020-10-09 | 北京信息科技大学 | Text viewpoint mining method and system |
CN111831824B (en) * | 2020-07-16 | 2024-02-09 | 民生科技有限责任公司 | Public opinion positive and negative surface classification method |
CN111831824A (en) * | 2020-07-16 | 2020-10-27 | 民生科技有限责任公司 | Public opinion positive and negative face classification method |
CN111859074A (en) * | 2020-07-29 | 2020-10-30 | 东北大学 | Internet public opinion information source influence assessment method and system based on deep learning |
CN111859074B (en) * | 2020-07-29 | 2023-12-29 | 东北大学 | Network public opinion information source influence evaluation method and system based on deep learning |
CN112699944A (en) * | 2020-12-31 | 2021-04-23 | 中国银联股份有限公司 | Order-returning processing model training method, processing method, device, equipment and medium |
CN112699944B (en) * | 2020-12-31 | 2024-04-23 | 中国银联股份有限公司 | Training method, processing method, device, equipment and medium for returning list processing model |
Also Published As
Publication number | Publication date |
---|---|
CN106815369B (en) | 2019-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815369B (en) | A kind of file classification method based on Xgboost sorting algorithm | |
Hidayat et al. | Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier | |
CN109740154A (en) | A kind of online comment fine granularity sentiment analysis method based on multi-task learning | |
CN111125358B (en) | Text classification method based on hypergraph | |
Li et al. | Nonparametric bayes pachinko allocation | |
CN107301171A (en) | A kind of text emotion analysis method and system learnt based on sentiment dictionary | |
CN107025284A (en) | The recognition methods of network comment text emotion tendency and convolutional neural networks model | |
CN109376242A (en) | Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks | |
CN107944480A (en) | A kind of enterprises ' industry sorting technique | |
CN108388651A (en) | A kind of file classification method based on the kernel of graph and convolutional neural networks | |
CN108108355A (en) | Text emotion analysis method and system based on deep learning | |
CN111339754B (en) | Case public opinion abstract generation method based on case element sentence association graph convolution | |
CN109508453A (en) | Across media information target component correlation analysis systems and its association analysis method | |
CN105868184A (en) | Chinese name recognition method based on recurrent neural network | |
CN109189926A (en) | A kind of construction method of technical paper corpus | |
CN101587493A (en) | Text classification method | |
CN105373606A (en) | Unbalanced data sampling method in improved C4.5 decision tree algorithm | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN106815310A (en) | A kind of hierarchy clustering method and system to magnanimity document sets | |
CN107145516A (en) | A kind of Text Clustering Method and system | |
CN107679135A (en) | The topic detection of network-oriented text big data and tracking, device | |
CN109062958B (en) | Primary school composition automatic classification method based on TextRank and convolutional neural network | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN111813939A (en) | Text classification method based on representation enhancement and fusion | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |