CN107092644A - A kind of Chinese Text Categorization based on MPI and Adaboost.MH - Google Patents

A kind of Chinese Text Categorization based on MPI and Adaboost.MH Download PDF

Info

Publication number
CN107092644A
CN107092644A CN201710131434.9A CN201710131434A CN107092644A CN 107092644 A CN107092644 A CN 107092644A CN 201710131434 A CN201710131434 A CN 201710131434A CN 107092644 A CN107092644 A CN 107092644A
Authority
CN
China
Prior art keywords
feature words
chinese text
mpi
sample
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710131434.9A
Other languages
Chinese (zh)
Inventor
王进
高延雨
李颖
李航
余薇
高选人
邓欣
陈乔松
胡峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710131434.9A priority Critical patent/CN107092644A/en
Publication of CN107092644A publication Critical patent/CN107092644A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a kind of Chinese Text Categorization based on MPI and Adaboost.MH, for solving when data volume is larger, the Adaboost.MH training times are longer thus the problem of causing total Chinese Text Categorization time longer.This method includes:Chinese text Jing Guo word segmentation processing is saved in training dataset, then mutual information method is combined with MPI, realize that Feature Words are selected, then all processes carry out reduction summation by the MPI_Reduce functions in MPI and then try to achieve similarity, and Feature Words are selected according to the size of similarity.Then each process whether there is to assign weights to Feature Words according to the Feature Words selected in the Chinese text that it is included.Then process result of calculation integrate according to MPI communication functions obtaining textual classification model, using disaggregated model to Chinese Text Categorization to be sorted.The present invention highly shortened the time classified to Chinese text.

Description

A kind of Chinese Text Categorization based on MPI and Adaboost.MH
Technical field
The present invention relates to Text Mining Technology field, more particularly to a kind of Chinese text based on MPI and adaboost.MH Sorting technique.
Background technology
Text classification be known to classification system in text in the case of, text is divided into and its phase according to the information content The process gone in the classification of pass.With coming for the progress of science and technology, the development of society, the popularization of computer and cybertimes Face, the quantity of network text is in sharp increase, the characteristics of text categorization task presents new:First, it can all produce daily a large amount of The need for the new text classified, these data are typically more than TB grades.2nd, the classification of text shows diversity, i.e., one Individual text may belong to plurality of classes, such as one text both may belong to history, politics is may belong to again, could also belong to section Skill etc..
Traditional single labeling method such as traditional decision-tree, k near neighbor methods, neural net method, genetic algorithm, pattra leaves The need for this classification, SVMs etc. can not meet people.Therefore many multi-tag sorting techniques are occurred in that now, it is main There are BR, ECC, Adaboost.MH, MLKNN, CML, ML-DT, rank-Svm etc..
Adaboost.MH algorithms are that the iteration that a kind of improvement to single labeling algorithm Adaboost handles multi-tag is calculated Method, its core concept is that different Weak Classifiers are trained to same training set, and these Weak Classifiers then are combined into structure Into a strong classifier.The Weak Classifier of Adaboost.MH algorithms selections herein is one-level decision tree, algorithm idea letter Singly, it is easily achieved.But due to needing the amount of text classified too big now, Adaboost.MH algorithms are in order to ensure classifying quality Need to carry out multiple iterative learning, it is therefore desirable to the substantial amounts of training time.
In order to improve the efficiency of Adaboost.MH algorithms, the training time is reduced, existing solution is mainly to algorithm Carry out Parallel Implementation.Parallel method mainly has openmp, hadoop, spark, MPI.Wherein openmp can be right on unit Algorithm carries out multi-threaded parallel realization, it is impossible to be used in cluster.But when data volume is excessive, the requirement to machine internal memory will be too high, Openmp will not applied to.Hadoop is realized parallel according to mapreduce frameworks by cluster to algorithm, but hadoop is bad at place Manage iterative algorithm.Spark equally can use cluster to carry out Parallel Implementation algorithm, but compared with MPI, speed is slower.
The content of the invention
It is longer and carry out disaggregated model training need using Adaboost.MH algorithms that the training set time is built for mass data The problem of wanting the plenty of time, the present invention is combined using MPI and adaboost.MH, it is proposed that one kind based on MPI and Adaboost.MH parallel file classification method.
The present invention solve Chinese Text Categorization take problem technical scheme be:Pretreated text is divided into p Part, each process handles portion therein, and process realizes that process intercommunication is completed in training set as auxiliary by MPI Text carries out feature selecting, builds weight vector, train classification models, treats classifying text and the work such as classify, so that The parallelization of Chinese Text Categorization is realized, is capable of the time efficiency of the raising Chinese Text Categorization of high degree.
In consideration of it, the technical solution adopted by the present invention is:A kind of Chinese Text Categorization based on MPI and Adaboost.MH Method, comprises the following steps:
(1) Text Pretreatment:The Chinese text file of different field is collected, Chinese point is carried out to the Chinese text collected Word, then punctuation mark and stop words are removed, the entry space character after participle are separated and is saved in training set data, make For preliminary feature.
(2) Feature Words are selected:The preliminary feature of preprocessed text is selected by using mutual information method.
(3) weight vector is built:To each Chinese text file of each process, scanning judges that the Feature Words selected are It is no in the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, the otherwise spy It is 0 to levy the corresponding weights of word, builds Chinese text file weight vector.
(4) textual classification model is built:Disaggregated model is built using Adaboost.MH algorithms.
(5) classifying text is treated to be classified:The disaggregated model built according to step (4) is treated classifying text and divided Class.
Further, what step (2) Feature Words were selected concretely comprises the following steps:
Training set data is equally divided into p parts first, each process is successively read portion therein;Then count respectively each A, B, C, N value of process, A are the Chinese text fraction that Feature Words t occurs in classification c;B is except classification c other classes The Chinese text fraction that not middle Feature Words t occurs;C is the Chinese text fraction that Feature Words t does not occur in classification c;N is all The summation of Chinese text fraction in classification;Then all processes are returned by MPI_Reduce function pairs A, B, C, N in MPI About sum, be as a result saved in process 0, the result that process 0 is summed according to reduction calculates similar between Feature Words t and classification c Spend I;The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are protected Stay, and the result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.Wherein phase It is like degree I calculation formula:
The building process of disaggregated model is as follows in above-mentioned steps (4):
Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, m is Training set sample number, i.e. Chinese text number, k are total for the classification of sample, i.e. the possible generic number of Chinese text, and classification is such as Scientific and technological class, political class etc..
Step 2: each process is according to formulaStatistics is included itself The weights of all Feature Words of sampleThen reduction is carried out to the weights tried to achieve using MPI_Reduce reduction function pairs to ask With obtain realIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b be -1 or 1, -1 represents that sample does not include the label, and 1 represents that sample includes the label, and m represents the sample number of training set,Represent The of i-th of sample during the t times iterationThe weights of individual label, xiRepresent i-th of training sample, XjRepresent that Feature Words have (j =1) or in the absence of the set of all training samples in the case of (j=0), YiRepresent the label of i-th of sample.
Step 3: according to formula in process 0Calculate the Z of all featurest, wherein ZtTable Show normalization factor,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIndividual label is 1 training Sample is in distributionIn weights and,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIt is individual Label is being distributed for -1 training sampleIn weights and.Select ZtThe minimum Feature Words w of value is used as the spy to be chosen Levy word;Then label probability is calculated according to Feature Words w The probability that label is 1 is represented,Represent label for 1 it is general Rate, by MPI_Bcast functions by Zt、w、WithAll processes are broadcast to, all processes are deposited into structure rule; Label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is training Sample number, k is training set number of tags, i.e. text categories number.
Step 4: each process is according to weights distributed update formula The distribution of label weights is updated, wherein αt=1, as w ∈ x, thenWhenThenW is Feature Words,Represent the of i-th of training sampleIndividual label is 1 confidence level.
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided Class device.For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all categories Whether whole Feature Words carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to this feature word Weights are assigned, so as to obtain the weight vector of text to be sorted, test set are obtained.Then test set is divided into p parts, each Process reads portion therein,
The classifying text for the treatment of is categorized as:By Chinese text file to be sorted carry out above-mentioned steps (1), (2) and after the processing of (3), weight vector, i.e. test set are obtained.Then test set is divided into p parts, each process reads it In portion, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formulaClassification results integrate and draw last prediction classification.
Present invention tool beneficial effect:MPI and Adaboost.MH algorithms are combined by the present invention, realize parallel text Sorting algorithm, is solved when data set is excessive, the problem of training time caused by iterations is excessive is long, highly shortened The time of Chinese Text Categorization.
Brief description of the drawings
Fig. 1 represents the flow chart of the Chinese Text Categorization based on MPI and Adaboost.MH algorithms;
Fig. 2 represents that Feature Words select flow chart, it is assumed that it is 4 to enter number of passes p;
The training flow chart of Fig. 3 presentation class device models, it is assumed that it is 4 to enter number of passes p.
Embodiment
The present invention will be further described below in conjunction with the accompanying drawings.
As shown in figure 1, the present invention includes following 5 steps.
1st, Text Pretreatment:The Chinese text text of different field is collected by the mode such as web crawlers and the search network information Part, word segmentation processing is carried out to the Chinese text file collected.The participle bags of increasing income such as IK, ICTCLAS can be used, to what is collected Text carries out Chinese word segmentation, then removes punctuation mark and stop words, and stop words is that the frequency of occurrences is very high but no actual The word of meaning, as " ", " ", "AND" etc..Entry space character segmentation after participle is saved in local training set data In, it is used as preliminary feature.
2nd, feature selecting:Preliminary feature is selected by using mutual information method.First by MPI_Init functions Start p process, the process number of each process is obtained according to MPI_Comm_rank, obtain total using MPI_Comm_size functions Enter number of passes p, all training set datas are divided into p parts, each process is successively read portion therein, then each enters Journey counts respective A, B, C, N respectively, and it is inter process synchronization then to perform MPI_Barrier functions, and A is the feature in classification c The number of files that word t occurs;B is the number of files that Feature Words t occurs in other classifications except classification c;C is special in classification c Levy the number of files that word t does not occur;N is the summation of number of files in all categories.Then it is all according to MPI_Reduce function pairs A, B, C, N carry out reduction summation in process, obtain A, B, C, N relative to whole training set data, are as a result stored in process 0, Process 0 is according to formulaCalculate the similarity I (t, c) between Feature Words t and classification c, it is assumed that altogether There is a k classification, each Feature Words will calculate similarity with k classification, then the weights of each Feature Words are k obtained The average value of the sum of similarity, because quick sorting algorithm relative to other sort algorithms has preferable time performance, therefore enters The weights of 0 pair of all Feature Words of journey are ranked up by quick sorting algorithm, then retain the larger preceding n Feature Words of weights (calculate similarity between Feature Words and classification, sequencing of similarity, select the amount of calculation of Feature Words relatively small, therefore these are grasped Make to complete in process 0, without parallel), then the Feature Words of reservation are broadcast to all by process 0 by MPI_Bcast functions Process, all processes, come member-retaining portion Feature Words, delete remaining Feature Words according to these information.
3rd, weight vector is built:Structure weight vector is carried out to the Chinese text after selection Feature Words, firstly for every Whether in this document each Chinese text file of individual process, scanning judge the Feature Words selected, if existed in file This feature word, this corresponding weights of this feature word are 1, and otherwise the corresponding weights of this feature word are 0, for example:Feature Words " government " Appear in an article, then the weights of Feature Words " government " are just set to 1,0 is otherwise set to.So as to build Chinese text text Part weight vector.The sample weights vector of all processes is saved in train.csv files successively according to process number, then made Terminate all processes with MPI_Finalize functions.
4th, textual classification model is built:
Step 1: starting p process first by MPI_Init functions, each process is obtained according to MPI_Comm_rank Process number, using MPI_Comm_size functions obtain it is total enter number of passes p, file train.csv is opened, according to entering for opening up Training set train is divided into p blocks by number of passes p, (by row point p blocks), and process reads r therein0、r1、r3.....rp-1Row (r0Represent Process 0 reads in the line number of data, and the line number that each process is read is obtained by average mark, at most differed 1).For example:14 elements point To 4 processes, first prime number of each process is 3,4,3,4 etc..The formula of average mark is:Low=id × n ÷ p, low are each First position in process, id represents process number, and n represents all element numbers, and p is represented into number of passes, high=(id+1) × n ÷ p-1, high are last position in each process, and size=high-low+1, size is the member in each process Prime number.Then, initialization weights distribution, i.e., each process assigns weights 1/ (mk) to the sample label included, and wherein m is total Sample number, k is characterized number.
Step 2: each process counts the weights for all Feature Words for itself including sampleWherein j ∈ { 0,1 }, Represent whether sample x includes this feature word, be included as 1, not comprising the value that label is represented for 0, b ∈ { -1 ,+1 },Represent the Individual label, calculation formula isWherein m represents the sample number of training set,Represent the of in the t times iteration i-th sampleThe weights of individual label, xiRepresent i-th of training sample, XjRepresent There is (j=1) or the set in the absence of all training samples in the case of (j=0), Y in Feature WordsiRepresent i-th sample Label;
Step 3: passing through all processes of MPI_Reduce function pairsReduction summation is carried out, is drawn realIn its result deposit process 0, the Z of each feature included is obtained in process 0t, calculation formula isWherein+b=+1 is represented ,-represent that b=-1, t represent iterations, minimum ZtCorresponding Feature Words are the Feature Words w that should be chosen, and then calculate label probability according to Feature Words wCalculation formula isBy the label probability tried to achieveThe Feature Words w of selection is broadcast to each by MPI_Bcast functions Process, each process willIt is stored in rule, rule is a structure, Cheng YuanweiWith Feature Words w.
Step 4: all processes pass through the Z that tries to achievet, each label of each training sample is 1 confidence levelWeights distribution, wherein α are updated according to below equationt=1, as w ∈ x,When W is some feature.
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided Class device, it is regular identical due to what is preserved in each process, so the regular rule preserved in process 0 is saved in into rule.csv texts In part, all processes are closed using MPI_Finalize functions.
5th, classifying text is treated to be classified
For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all Whether the whole Feature Words of classification carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to the spy Levy word and assign weights, so as to obtain the weight vector of text to be sorted, then each process enters the weight vector foundation of acquisition The sequencing of journey number is saved in test.csv files successively.
Then start p process using MPI_Init functions, the process of each process is obtained according to MPI_Comm_rank Number, number of passes p must be entered by being obtained using MPI_Comm_size functions, open rule.csv files, and each process is read in Rule.csv all information.
Then test.csv files are opened, sample to be sorted p parts are bisected into, each process reads portion therein, so Each process tries to achieve the classification of text to be sorted according to the classifying rules rule and below equation of the Weak Classifier of preservation afterwards, finally All processes are closed using MPI_Finalize functions.

Claims (6)

1. a kind of Chinese Text Categorization based on MPI and Adaboost.MH, comprises the following steps:
(1) Text Pretreatment:The Chinese text file of different field is collected, Chinese word segmentation is carried out to the Chinese text collected, Then punctuation mark and stop words are removed, the entry space character after participle is separated and is saved in training set data, as Preliminary feature;
(2) Feature Words are selected:The preliminary feature of preprocessed text is selected by using mutual information method;
(3) weight vector is built:To each Chinese text file of each process, scanning judge the Feature Words selected whether In the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, otherwise this feature word Corresponding weights are 0, build Chinese text file weight vector;
(4) textual classification model is built:Disaggregated model is built using Adaboost.MH algorithms;
(5) classifying text is treated to be classified:The disaggregated model built according to step (4) is treated classifying text and classified.
2. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that: What step (2) Feature Words were selected concretely comprises the following steps:
Training set data is equally divided into p parts first, each process is successively read portion therein;Then each process is counted respectively A, B, C, N value, A in classification c Feature Words t occur Chinese text fraction;B is in other classifications except classification c The Chinese text fraction that Feature Words t occurs;C is the Chinese text fraction that Feature Words t does not occur in classification c;N is all categories The summation of middle Chinese text fraction;Then all processes are asked by MPI_Reduce function pairs A, B, C, N progress reduction in MPI Be as a result saved in process 0, the result that process 0 is summed according to reduction calculates the similarity I between Feature Words t and classification c; The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are retained, and The result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.
3. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 2, it is characterised in that: The calculation formula of the similarity I is:
4. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that: The building process of the disaggregated model is as follows:
Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, and m is training Collect sample number, k is total for the classification of sample;
Step 2: each process is according to formulaStatistics itself includes sample All Feature Words weightsIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b Represent that sample does not include the label for -1 or 1, -1,1 represents that sample includes the label, and m represents the sample number of training set, Represent the of in the t times iteration i-th sampleThe weights of individual label, xiRepresent i-th of training sample, XjRepresent that Feature Words are deposited Or it is non-existent in the case of all training samples set, YiRepresent the label of i-th of sample;
Step 3: according to formula in process 0Calculate the Z of all featurest, wherein ZtExpression is returned One changes the factor,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for 1 training sample In weights and,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for -1 training sampleIn weights and;Select ZtThe minimum Feature Words w of value is used as the Feature Words to be chosen;Then calculated according to Feature Words w Label probability The probability that label is 1 is represented,It is not 1 probability to represent label, by MPI_Bcast functions by Zt、 w、WithAll processes are broadcast to, all processes are deposited into structure rule;
Step 4: each process is according to weights distributed update formulaTo mark Label weights distribution is updated, wherein αt=1, as w ∈ x, thenWhenThenW is characterized Word,Represent the of i-th of training sampleIndividual label is 1 confidence level;
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, obtains T one-level decision tree classifier.
5. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4, it is characterised in that: The label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is Number of training, k is training set number of tags, i.e. text categories number.
6. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4 or 5, its feature exists In:The classifying text for the treatment of is categorized as:Chinese text file to be sorted is carried out to the processing of step (1), (2) and (3) Afterwards, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formula Classification results integrate and draw last prediction classification.
CN201710131434.9A 2017-03-07 2017-03-07 A kind of Chinese Text Categorization based on MPI and Adaboost.MH Pending CN107092644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710131434.9A CN107092644A (en) 2017-03-07 2017-03-07 A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710131434.9A CN107092644A (en) 2017-03-07 2017-03-07 A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Publications (1)

Publication Number Publication Date
CN107092644A true CN107092644A (en) 2017-08-25

Family

ID=59648837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710131434.9A Pending CN107092644A (en) 2017-03-07 2017-03-07 A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Country Status (1)

Country Link
CN (1) CN107092644A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN108846128A (en) * 2018-06-30 2018-11-20 合肥工业大学 A kind of cross-domain texts classification method based on adaptive noise encoder

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102707955A (en) * 2012-05-18 2012-10-03 天津大学 Method for realizing support vector machine by MPI programming and OpenMP programming
US20130212111A1 (en) * 2012-02-07 2013-08-15 Kirill Chashchin System and method for text categorization based on ontologies
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130212111A1 (en) * 2012-02-07 2013-08-15 Kirill Chashchin System and method for text categorization based on ontologies
CN102707955A (en) * 2012-05-18 2012-10-03 天津大学 Method for realizing support vector machine by MPI programming and OpenMP programming
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JORGE L. REYES-ORTIZ1等: "Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf", 《PROCEDIA COMPUTER SCIENCE》 *
ROBERT E. SCHAPIRE等: "BoosTexter: A Boosting-based System for Text Categorization", 《MACHINE LEARNING》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN108509484B (en) * 2018-01-31 2022-03-11 腾讯科技(深圳)有限公司 Classifier construction and intelligent question and answer method, device, terminal and readable storage medium
CN108846128A (en) * 2018-06-30 2018-11-20 合肥工业大学 A kind of cross-domain texts classification method based on adaptive noise encoder
CN108846128B (en) * 2018-06-30 2021-09-14 合肥工业大学 Cross-domain text classification method based on adaptive noise reduction encoder

Similar Documents

Publication Publication Date Title
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
CN106779087B (en) A kind of general-purpose machinery learning data analysis platform
CN109189901B (en) Method for automatically discovering new classification and corresponding corpus in intelligent customer service system
CN108363810B (en) Text classification method and device
Boley et al. Training support vector machine using adaptive clustering
CN108364016A (en) Gradual semisupervised classification method based on multi-categorizer
CN103425996B (en) A kind of large-scale image recognition methods of parallel distributed
CN110309888A (en) A kind of image classification method and system based on layering multi-task learning
CN106599913A (en) Cluster-based multi-label imbalance biomedical data classification method
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN102289522A (en) Method of intelligently classifying texts
CN101763431A (en) PL clustering method based on massive network public sentiment information
CN106156163B (en) Text classification method and device
CN104702465A (en) Parallel network flow classification method
CN102156871A (en) Image classification method based on category correlated codebook and classifier voting strategy
CN109446423B (en) System and method for judging sentiment of news and texts
CN110347791B (en) Topic recommendation method based on multi-label classification convolutional neural network
CN110442568A (en) Acquisition methods and device, storage medium, the electronic device of field label
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN105912525A (en) Sentiment classification method for semi-supervised learning based on theme characteristics
Elnagar et al. Automatic text tagging of Arabic news articles using ensemble deep learning models
CN112148868A (en) Law recommendation method based on law co-occurrence
CN110288028A (en) ECG detecting method, system, equipment and computer readable storage medium
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN107092644A (en) A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170825

RJ01 Rejection of invention patent application after publication