CN107092644A - A kind of Chinese Text Categorization based on MPI and Adaboost.MH - Google Patents
A kind of Chinese Text Categorization based on MPI and Adaboost.MH Download PDFInfo
- Publication number
- CN107092644A CN107092644A CN201710131434.9A CN201710131434A CN107092644A CN 107092644 A CN107092644 A CN 107092644A CN 201710131434 A CN201710131434 A CN 201710131434A CN 107092644 A CN107092644 A CN 107092644A
- Authority
- CN
- China
- Prior art keywords
- feature words
- chinese text
- mpi
- sample
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention discloses a kind of Chinese Text Categorization based on MPI and Adaboost.MH, for solving when data volume is larger, the Adaboost.MH training times are longer thus the problem of causing total Chinese Text Categorization time longer.This method includes:Chinese text Jing Guo word segmentation processing is saved in training dataset, then mutual information method is combined with MPI, realize that Feature Words are selected, then all processes carry out reduction summation by the MPI_Reduce functions in MPI and then try to achieve similarity, and Feature Words are selected according to the size of similarity.Then each process whether there is to assign weights to Feature Words according to the Feature Words selected in the Chinese text that it is included.Then process result of calculation integrate according to MPI communication functions obtaining textual classification model, using disaggregated model to Chinese Text Categorization to be sorted.The present invention highly shortened the time classified to Chinese text.
Description
Technical field
The present invention relates to Text Mining Technology field, more particularly to a kind of Chinese text based on MPI and adaboost.MH
Sorting technique.
Background technology
Text classification be known to classification system in text in the case of, text is divided into and its phase according to the information content
The process gone in the classification of pass.With coming for the progress of science and technology, the development of society, the popularization of computer and cybertimes
Face, the quantity of network text is in sharp increase, the characteristics of text categorization task presents new:First, it can all produce daily a large amount of
The need for the new text classified, these data are typically more than TB grades.2nd, the classification of text shows diversity, i.e., one
Individual text may belong to plurality of classes, such as one text both may belong to history, politics is may belong to again, could also belong to section
Skill etc..
Traditional single labeling method such as traditional decision-tree, k near neighbor methods, neural net method, genetic algorithm, pattra leaves
The need for this classification, SVMs etc. can not meet people.Therefore many multi-tag sorting techniques are occurred in that now, it is main
There are BR, ECC, Adaboost.MH, MLKNN, CML, ML-DT, rank-Svm etc..
Adaboost.MH algorithms are that the iteration that a kind of improvement to single labeling algorithm Adaboost handles multi-tag is calculated
Method, its core concept is that different Weak Classifiers are trained to same training set, and these Weak Classifiers then are combined into structure
Into a strong classifier.The Weak Classifier of Adaboost.MH algorithms selections herein is one-level decision tree, algorithm idea letter
Singly, it is easily achieved.But due to needing the amount of text classified too big now, Adaboost.MH algorithms are in order to ensure classifying quality
Need to carry out multiple iterative learning, it is therefore desirable to the substantial amounts of training time.
In order to improve the efficiency of Adaboost.MH algorithms, the training time is reduced, existing solution is mainly to algorithm
Carry out Parallel Implementation.Parallel method mainly has openmp, hadoop, spark, MPI.Wherein openmp can be right on unit
Algorithm carries out multi-threaded parallel realization, it is impossible to be used in cluster.But when data volume is excessive, the requirement to machine internal memory will be too high,
Openmp will not applied to.Hadoop is realized parallel according to mapreduce frameworks by cluster to algorithm, but hadoop is bad at place
Manage iterative algorithm.Spark equally can use cluster to carry out Parallel Implementation algorithm, but compared with MPI, speed is slower.
The content of the invention
It is longer and carry out disaggregated model training need using Adaboost.MH algorithms that the training set time is built for mass data
The problem of wanting the plenty of time, the present invention is combined using MPI and adaboost.MH, it is proposed that one kind based on MPI and
Adaboost.MH parallel file classification method.
The present invention solve Chinese Text Categorization take problem technical scheme be:Pretreated text is divided into p
Part, each process handles portion therein, and process realizes that process intercommunication is completed in training set as auxiliary by MPI
Text carries out feature selecting, builds weight vector, train classification models, treats classifying text and the work such as classify, so that
The parallelization of Chinese Text Categorization is realized, is capable of the time efficiency of the raising Chinese Text Categorization of high degree.
In consideration of it, the technical solution adopted by the present invention is:A kind of Chinese Text Categorization based on MPI and Adaboost.MH
Method, comprises the following steps:
(1) Text Pretreatment:The Chinese text file of different field is collected, Chinese point is carried out to the Chinese text collected
Word, then punctuation mark and stop words are removed, the entry space character after participle are separated and is saved in training set data, make
For preliminary feature.
(2) Feature Words are selected:The preliminary feature of preprocessed text is selected by using mutual information method.
(3) weight vector is built:To each Chinese text file of each process, scanning judges that the Feature Words selected are
It is no in the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, the otherwise spy
It is 0 to levy the corresponding weights of word, builds Chinese text file weight vector.
(4) textual classification model is built:Disaggregated model is built using Adaboost.MH algorithms.
(5) classifying text is treated to be classified:The disaggregated model built according to step (4) is treated classifying text and divided
Class.
Further, what step (2) Feature Words were selected concretely comprises the following steps:
Training set data is equally divided into p parts first, each process is successively read portion therein;Then count respectively each
A, B, C, N value of process, A are the Chinese text fraction that Feature Words t occurs in classification c;B is except classification c other classes
The Chinese text fraction that not middle Feature Words t occurs;C is the Chinese text fraction that Feature Words t does not occur in classification c;N is all
The summation of Chinese text fraction in classification;Then all processes are returned by MPI_Reduce function pairs A, B, C, N in MPI
About sum, be as a result saved in process 0, the result that process 0 is summed according to reduction calculates similar between Feature Words t and classification c
Spend I;The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are protected
Stay, and the result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.Wherein phase
It is like degree I calculation formula:
The building process of disaggregated model is as follows in above-mentioned steps (4):
Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, m is
Training set sample number, i.e. Chinese text number, k are total for the classification of sample, i.e. the possible generic number of Chinese text, and classification is such as
Scientific and technological class, political class etc..
Step 2: each process is according to formulaStatistics is included itself
The weights of all Feature Words of sampleThen reduction is carried out to the weights tried to achieve using MPI_Reduce reduction function pairs to ask
With obtain realIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b be -1 or
1, -1 represents that sample does not include the label, and 1 represents that sample includes the label, and m represents the sample number of training set,Represent
The of i-th of sample during the t times iterationThe weights of individual label, xiRepresent i-th of training sample, XjRepresent that Feature Words have (j
=1) or in the absence of the set of all training samples in the case of (j=0), YiRepresent the label of i-th of sample.
Step 3: according to formula in process 0Calculate the Z of all featurest, wherein ZtTable
Show normalization factor,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIndividual label is 1 training
Sample is in distributionIn weights and,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIt is individual
Label is being distributed for -1 training sampleIn weights and.Select ZtThe minimum Feature Words w of value is used as the spy to be chosen
Levy word;Then label probability is calculated according to Feature Words w The probability that label is 1 is represented,Represent label for 1 it is general
Rate, by MPI_Bcast functions by Zt、w、WithAll processes are broadcast to, all processes are deposited into structure rule;
Label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is training
Sample number, k is training set number of tags, i.e. text categories number.
Step 4: each process is according to weights distributed update formula
The distribution of label weights is updated, wherein αt=1, as w ∈ x, thenWhenThenW is
Feature Words,Represent the of i-th of training sampleIndividual label is 1 confidence level.
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided
Class device.For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all categories
Whether whole Feature Words carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to this feature word
Weights are assigned, so as to obtain the weight vector of text to be sorted, test set are obtained.Then test set is divided into p parts, each
Process reads portion therein,
The classifying text for the treatment of is categorized as:By Chinese text file to be sorted carry out above-mentioned steps (1),
(2) and after the processing of (3), weight vector, i.e. test set are obtained.Then test set is divided into p parts, each process reads it
In portion, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formulaClassification results integrate and draw last prediction classification.
Present invention tool beneficial effect:MPI and Adaboost.MH algorithms are combined by the present invention, realize parallel text
Sorting algorithm, is solved when data set is excessive, the problem of training time caused by iterations is excessive is long, highly shortened
The time of Chinese Text Categorization.
Brief description of the drawings
Fig. 1 represents the flow chart of the Chinese Text Categorization based on MPI and Adaboost.MH algorithms;
Fig. 2 represents that Feature Words select flow chart, it is assumed that it is 4 to enter number of passes p;
The training flow chart of Fig. 3 presentation class device models, it is assumed that it is 4 to enter number of passes p.
Embodiment
The present invention will be further described below in conjunction with the accompanying drawings.
As shown in figure 1, the present invention includes following 5 steps.
1st, Text Pretreatment:The Chinese text text of different field is collected by the mode such as web crawlers and the search network information
Part, word segmentation processing is carried out to the Chinese text file collected.The participle bags of increasing income such as IK, ICTCLAS can be used, to what is collected
Text carries out Chinese word segmentation, then removes punctuation mark and stop words, and stop words is that the frequency of occurrences is very high but no actual
The word of meaning, as " ", " ", "AND" etc..Entry space character segmentation after participle is saved in local training set data
In, it is used as preliminary feature.
2nd, feature selecting:Preliminary feature is selected by using mutual information method.First by MPI_Init functions
Start p process, the process number of each process is obtained according to MPI_Comm_rank, obtain total using MPI_Comm_size functions
Enter number of passes p, all training set datas are divided into p parts, each process is successively read portion therein, then each enters
Journey counts respective A, B, C, N respectively, and it is inter process synchronization then to perform MPI_Barrier functions, and A is the feature in classification c
The number of files that word t occurs;B is the number of files that Feature Words t occurs in other classifications except classification c;C is special in classification c
Levy the number of files that word t does not occur;N is the summation of number of files in all categories.Then it is all according to MPI_Reduce function pairs
A, B, C, N carry out reduction summation in process, obtain A, B, C, N relative to whole training set data, are as a result stored in process 0,
Process 0 is according to formulaCalculate the similarity I (t, c) between Feature Words t and classification c, it is assumed that altogether
There is a k classification, each Feature Words will calculate similarity with k classification, then the weights of each Feature Words are k obtained
The average value of the sum of similarity, because quick sorting algorithm relative to other sort algorithms has preferable time performance, therefore enters
The weights of 0 pair of all Feature Words of journey are ranked up by quick sorting algorithm, then retain the larger preceding n Feature Words of weights
(calculate similarity between Feature Words and classification, sequencing of similarity, select the amount of calculation of Feature Words relatively small, therefore these are grasped
Make to complete in process 0, without parallel), then the Feature Words of reservation are broadcast to all by process 0 by MPI_Bcast functions
Process, all processes, come member-retaining portion Feature Words, delete remaining Feature Words according to these information.
3rd, weight vector is built:Structure weight vector is carried out to the Chinese text after selection Feature Words, firstly for every
Whether in this document each Chinese text file of individual process, scanning judge the Feature Words selected, if existed in file
This feature word, this corresponding weights of this feature word are 1, and otherwise the corresponding weights of this feature word are 0, for example:Feature Words " government "
Appear in an article, then the weights of Feature Words " government " are just set to 1,0 is otherwise set to.So as to build Chinese text text
Part weight vector.The sample weights vector of all processes is saved in train.csv files successively according to process number, then made
Terminate all processes with MPI_Finalize functions.
4th, textual classification model is built:
Step 1: starting p process first by MPI_Init functions, each process is obtained according to MPI_Comm_rank
Process number, using MPI_Comm_size functions obtain it is total enter number of passes p, file train.csv is opened, according to entering for opening up
Training set train is divided into p blocks by number of passes p, (by row point p blocks), and process reads r therein0、r1、r3.....rp-1Row (r0Represent
Process 0 reads in the line number of data, and the line number that each process is read is obtained by average mark, at most differed 1).For example:14 elements point
To 4 processes, first prime number of each process is 3,4,3,4 etc..The formula of average mark is:Low=id × n ÷ p, low are each
First position in process, id represents process number, and n represents all element numbers, and p is represented into number of passes, high=(id+1)
× n ÷ p-1, high are last position in each process, and size=high-low+1, size is the member in each process
Prime number.Then, initialization weights distribution, i.e., each process assigns weights 1/ (mk) to the sample label included, and wherein m is total
Sample number, k is characterized number.
Step 2: each process counts the weights for all Feature Words for itself including sampleWherein j ∈ { 0,1 },
Represent whether sample x includes this feature word, be included as 1, not comprising the value that label is represented for 0, b ∈ { -1 ,+1 },Represent the
Individual label, calculation formula isWherein m represents the sample number of training set,Represent the of in the t times iteration i-th sampleThe weights of individual label, xiRepresent i-th of training sample, XjRepresent
There is (j=1) or the set in the absence of all training samples in the case of (j=0), Y in Feature WordsiRepresent i-th sample
Label;
Step 3: passing through all processes of MPI_Reduce function pairsReduction summation is carried out, is drawn realIn its result deposit process 0, the Z of each feature included is obtained in process 0t, calculation formula isWherein+b=+1 is represented ,-represent that b=-1, t represent iterations, minimum ZtCorresponding
Feature Words are the Feature Words w that should be chosen, and then calculate label probability according to Feature Words wCalculation formula isBy the label probability tried to achieveThe Feature Words w of selection is broadcast to each by MPI_Bcast functions
Process, each process willIt is stored in rule, rule is a structure, Cheng YuanweiWith Feature Words w.
Step 4: all processes pass through the Z that tries to achievet, each label of each training sample is 1 confidence levelWeights distribution, wherein α are updated according to below equationt=1, as w ∈ x,When W is some feature.
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided
Class device, it is regular identical due to what is preserved in each process, so the regular rule preserved in process 0 is saved in into rule.csv texts
In part, all processes are closed using MPI_Finalize functions.
5th, classifying text is treated to be classified
For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all
Whether the whole Feature Words of classification carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to the spy
Levy word and assign weights, so as to obtain the weight vector of text to be sorted, then each process enters the weight vector foundation of acquisition
The sequencing of journey number is saved in test.csv files successively.
Then start p process using MPI_Init functions, the process of each process is obtained according to MPI_Comm_rank
Number, number of passes p must be entered by being obtained using MPI_Comm_size functions, open rule.csv files, and each process is read in
Rule.csv all information.
Then test.csv files are opened, sample to be sorted p parts are bisected into, each process reads portion therein, so
Each process tries to achieve the classification of text to be sorted according to the classifying rules rule and below equation of the Weak Classifier of preservation afterwards, finally
All processes are closed using MPI_Finalize functions.
Claims (6)
1. a kind of Chinese Text Categorization based on MPI and Adaboost.MH, comprises the following steps:
(1) Text Pretreatment:The Chinese text file of different field is collected, Chinese word segmentation is carried out to the Chinese text collected,
Then punctuation mark and stop words are removed, the entry space character after participle is separated and is saved in training set data, as
Preliminary feature;
(2) Feature Words are selected:The preliminary feature of preprocessed text is selected by using mutual information method;
(3) weight vector is built:To each Chinese text file of each process, scanning judge the Feature Words selected whether
In the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, otherwise this feature word
Corresponding weights are 0, build Chinese text file weight vector;
(4) textual classification model is built:Disaggregated model is built using Adaboost.MH algorithms;
(5) classifying text is treated to be classified:The disaggregated model built according to step (4) is treated classifying text and classified.
2. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that:
What step (2) Feature Words were selected concretely comprises the following steps:
Training set data is equally divided into p parts first, each process is successively read portion therein;Then each process is counted respectively
A, B, C, N value, A in classification c Feature Words t occur Chinese text fraction;B is in other classifications except classification c
The Chinese text fraction that Feature Words t occurs;C is the Chinese text fraction that Feature Words t does not occur in classification c;N is all categories
The summation of middle Chinese text fraction;Then all processes are asked by MPI_Reduce function pairs A, B, C, N progress reduction in MPI
Be as a result saved in process 0, the result that process 0 is summed according to reduction calculates the similarity I between Feature Words t and classification c;
The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are retained, and
The result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.
3. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 2, it is characterised in that:
The calculation formula of the similarity I is:
4. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that:
The building process of the disaggregated model is as follows:
Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, and m is training
Collect sample number, k is total for the classification of sample;
Step 2: each process is according to formulaStatistics itself includes sample
All Feature Words weightsIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b
Represent that sample does not include the label for -1 or 1, -1,1 represents that sample includes the label, and m represents the sample number of training set,
Represent the of in the t times iteration i-th sampleThe weights of individual label, xiRepresent i-th of training sample, XjRepresent that Feature Words are deposited
Or it is non-existent in the case of all training samples set, YiRepresent the label of i-th of sample;
Step 3: according to formula in process 0Calculate the Z of all featurest, wherein ZtExpression is returned
One changes the factor,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for 1 training sample
In weights and,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for -1 training sampleIn weights and;Select ZtThe minimum Feature Words w of value is used as the Feature Words to be chosen;Then calculated according to Feature Words w
Label probability The probability that label is 1 is represented,It is not 1 probability to represent label, by MPI_Bcast functions by Zt、
w、WithAll processes are broadcast to, all processes are deposited into structure rule;
Step 4: each process is according to weights distributed update formulaTo mark
Label weights distribution is updated, wherein αt=1, as w ∈ x, thenWhenThenW is characterized
Word,Represent the of i-th of training sampleIndividual label is 1 confidence level;
Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, obtains T one-level decision tree classifier.
5. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4, it is characterised in that:
The label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is
Number of training, k is training set number of tags, i.e. text categories number.
6. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4 or 5, its feature exists
In:The classifying text for the treatment of is categorized as:Chinese text file to be sorted is carried out to the processing of step (1), (2) and (3)
Afterwards, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formula
Classification results integrate and draw last prediction classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710131434.9A CN107092644A (en) | 2017-03-07 | 2017-03-07 | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710131434.9A CN107092644A (en) | 2017-03-07 | 2017-03-07 | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107092644A true CN107092644A (en) | 2017-08-25 |
Family
ID=59648837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710131434.9A Pending CN107092644A (en) | 2017-03-07 | 2017-03-07 | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107092644A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509484A (en) * | 2018-01-31 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing |
CN108846128A (en) * | 2018-06-30 | 2018-11-20 | 合肥工业大学 | A kind of cross-domain texts classification method based on adaptive noise encoder |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102707955A (en) * | 2012-05-18 | 2012-10-03 | 天津大学 | Method for realizing support vector machine by MPI programming and OpenMP programming |
US20130212111A1 (en) * | 2012-02-07 | 2013-08-15 | Kirill Chashchin | System and method for text categorization based on ontologies |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
-
2017
- 2017-03-07 CN CN201710131434.9A patent/CN107092644A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130212111A1 (en) * | 2012-02-07 | 2013-08-15 | Kirill Chashchin | System and method for text categorization based on ontologies |
CN102707955A (en) * | 2012-05-18 | 2012-10-03 | 天津大学 | Method for realizing support vector machine by MPI programming and OpenMP programming |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
Non-Patent Citations (2)
Title |
---|
JORGE L. REYES-ORTIZ1等: "Big data analytics in the cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf", 《PROCEDIA COMPUTER SCIENCE》 * |
ROBERT E. SCHAPIRE等: "BoosTexter: A Boosting-based System for Text Categorization", 《MACHINE LEARNING》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108509484A (en) * | 2018-01-31 | 2018-09-07 | 腾讯科技(深圳)有限公司 | Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing |
CN108509484B (en) * | 2018-01-31 | 2022-03-11 | 腾讯科技(深圳)有限公司 | Classifier construction and intelligent question and answer method, device, terminal and readable storage medium |
CN108846128A (en) * | 2018-06-30 | 2018-11-20 | 合肥工业大学 | A kind of cross-domain texts classification method based on adaptive noise encoder |
CN108846128B (en) * | 2018-06-30 | 2021-09-14 | 合肥工业大学 | Cross-domain text classification method based on adaptive noise reduction encoder |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815369B (en) | A kind of file classification method based on Xgboost sorting algorithm | |
CN106779087B (en) | A kind of general-purpose machinery learning data analysis platform | |
CN109189901B (en) | Method for automatically discovering new classification and corresponding corpus in intelligent customer service system | |
CN108363810B (en) | Text classification method and device | |
Boley et al. | Training support vector machine using adaptive clustering | |
CN108364016A (en) | Gradual semisupervised classification method based on multi-categorizer | |
CN103425996B (en) | A kind of large-scale image recognition methods of parallel distributed | |
CN110309888A (en) | A kind of image classification method and system based on layering multi-task learning | |
CN106599913A (en) | Cluster-based multi-label imbalance biomedical data classification method | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN102289522A (en) | Method of intelligently classifying texts | |
CN101763431A (en) | PL clustering method based on massive network public sentiment information | |
CN106156163B (en) | Text classification method and device | |
CN104702465A (en) | Parallel network flow classification method | |
CN102156871A (en) | Image classification method based on category correlated codebook and classifier voting strategy | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN110347791B (en) | Topic recommendation method based on multi-label classification convolutional neural network | |
CN110442568A (en) | Acquisition methods and device, storage medium, the electronic device of field label | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
Elnagar et al. | Automatic text tagging of Arabic news articles using ensemble deep learning models | |
CN112148868A (en) | Law recommendation method based on law co-occurrence | |
CN110288028A (en) | ECG detecting method, system, equipment and computer readable storage medium | |
Chu et al. | Co-training based on semi-supervised ensemble classification approach for multi-label data stream | |
CN107092644A (en) | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170825 |
|
RJ01 | Rejection of invention patent application after publication |