CN107092644A

CN107092644A - A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Info

Publication number: CN107092644A
Application number: CN201710131434.9A
Authority: CN
Inventors: 王进; 高延雨; 李颖; 李航; 余薇; 高选人; 邓欣; 陈乔松; 胡峰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2017-08-25

Abstract

The invention discloses a kind of Chinese Text Categorization based on MPI and Adaboost.MH, for solving when data volume is larger, the Adaboost.MH training times are longer thus the problem of causing total Chinese Text Categorization time longer.This method includes：Chinese text Jing Guo word segmentation processing is saved in training dataset, then mutual information method is combined with MPI, realize that Feature Words are selected, then all processes carry out reduction summation by the MPI_Reduce functions in MPI and then try to achieve similarity, and Feature Words are selected according to the size of similarity.Then each process whether there is to assign weights to Feature Words according to the Feature Words selected in the Chinese text that it is included.Then process result of calculation integrate according to MPI communication functions obtaining textual classification model, using disaggregated model to Chinese Text Categorization to be sorted.The present invention highly shortened the time classified to Chinese text.

Description

A kind of Chinese Text Categorization based on MPI and Adaboost.MH

Technical field

The present invention relates to Text Mining Technology field, more particularly to a kind of Chinese text based on MPI and adaboost.MH Sorting technique.

Background technology

Text classification be known to classification system in text in the case of, text is divided into and its phase according to the information content The process gone in the classification of pass.With coming for the progress of science and technology, the development of society, the popularization of computer and cybertimes Face, the quantity of network text is in sharp increase, the characteristics of text categorization task presents new：First, it can all produce daily a large amount of The need for the new text classified, these data are typically more than TB grades.2nd, the classification of text shows diversity, i.e., one Individual text may belong to plurality of classes, such as one text both may belong to history, politics is may belong to again, could also belong to section Skill etc..

Traditional single labeling method such as traditional decision-tree, k near neighbor methods, neural net method, genetic algorithm, pattra leaves The need for this classification, SVMs etc. can not meet people.Therefore many multi-tag sorting techniques are occurred in that now, it is main There are BR, ECC, Adaboost.MH, MLKNN, CML, ML-DT, rank-Svm etc..

Adaboost.MH algorithms are that the iteration that a kind of improvement to single labeling algorithm Adaboost handles multi-tag is calculated Method, its core concept is that different Weak Classifiers are trained to same training set, and these Weak Classifiers then are combined into structure Into a strong classifier.The Weak Classifier of Adaboost.MH algorithms selections herein is one-level decision tree, algorithm idea letter Singly, it is easily achieved.But due to needing the amount of text classified too big now, Adaboost.MH algorithms are in order to ensure classifying quality Need to carry out multiple iterative learning, it is therefore desirable to the substantial amounts of training time.

In order to improve the efficiency of Adaboost.MH algorithms, the training time is reduced, existing solution is mainly to algorithm Carry out Parallel Implementation.Parallel method mainly has openmp, hadoop, spark, MPI.Wherein openmp can be right on unit Algorithm carries out multi-threaded parallel realization, it is impossible to be used in cluster.But when data volume is excessive, the requirement to machine internal memory will be too high, Openmp will not applied to.Hadoop is realized parallel according to mapreduce frameworks by cluster to algorithm, but hadoop is bad at place Manage iterative algorithm.Spark equally can use cluster to carry out Parallel Implementation algorithm, but compared with MPI, speed is slower.

The content of the invention

It is longer and carry out disaggregated model training need using Adaboost.MH algorithms that the training set time is built for mass data The problem of wanting the plenty of time, the present invention is combined using MPI and adaboost.MH, it is proposed that one kind based on MPI and Adaboost.MH parallel file classification method.

The present invention solve Chinese Text Categorization take problem technical scheme be：Pretreated text is divided into p Part, each process handles portion therein, and process realizes that process intercommunication is completed in training set as auxiliary by MPI Text carries out feature selecting, builds weight vector, train classification models, treats classifying text and the work such as classify, so that The parallelization of Chinese Text Categorization is realized, is capable of the time efficiency of the raising Chinese Text Categorization of high degree.

In consideration of it, the technical solution adopted by the present invention is：A kind of Chinese Text Categorization based on MPI and Adaboost.MH Method, comprises the following steps：

(1) Text Pretreatment：The Chinese text file of different field is collected, Chinese point is carried out to the Chinese text collected Word, then punctuation mark and stop words are removed, the entry space character after participle are separated and is saved in training set data, make For preliminary feature.

(2) Feature Words are selected：The preliminary feature of preprocessed text is selected by using mutual information method.

(3) weight vector is built：To each Chinese text file of each process, scanning judges that the Feature Words selected are It is no in the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, the otherwise spy It is 0 to levy the corresponding weights of word, builds Chinese text file weight vector.

(4) textual classification model is built：Disaggregated model is built using Adaboost.MH algorithms.

(5) classifying text is treated to be classified：The disaggregated model built according to step (4) is treated classifying text and divided Class.

Further, what step (2) Feature Words were selected concretely comprises the following steps：

Training set data is equally divided into p parts first, each process is successively read portion therein；Then count respectively each A, B, C, N value of process, A are the Chinese text fraction that Feature Words t occurs in classification c；B is except classification c other classes The Chinese text fraction that not middle Feature Words t occurs；C is the Chinese text fraction that Feature Words t does not occur in classification c；N is all The summation of Chinese text fraction in classification；Then all processes are returned by MPI_Reduce function pairs A, B, C, N in MPI About sum, be as a result saved in process 0, the result that process 0 is summed according to reduction calculates similar between Feature Words t and classification c Spend I；The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are protected Stay, and the result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.Wherein phase It is like degree I calculation formula：

The building process of disaggregated model is as follows in above-mentioned steps (4)：

Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, m is Training set sample number, i.e. Chinese text number, k are total for the classification of sample, i.e. the possible generic number of Chinese text, and classification is such as Scientific and technological class, political class etc..

Step 2: each process is according to formulaStatistics is included itself The weights of all Feature Words of sampleThen reduction is carried out to the weights tried to achieve using MPI_Reduce reduction function pairs to ask With obtain realIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b be -1 or 1, -1 represents that sample does not include the label, and 1 represents that sample includes the label, and m represents the sample number of training set,Represent The of i-th of sample during the t times iterationThe weights of individual label, x_iRepresent i-th of training sample, X_jRepresent that Feature Words have (j =1) or in the absence of the set of all training samples in the case of (j=0), Y_iRepresent the label of i-th of sample.

Step 3: according to formula in process 0Calculate the Z of all features_t, wherein Z_tTable Show normalization factor,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIndividual label is 1 training Sample is in distributionIn weights and,Represent that Feature Words have (j=0) is not present in (j=1) or Feature Words theIt is individual Label is being distributed for -1 training sampleIn weights and.Select Z_tThe minimum Feature Words w of value is used as the spy to be chosen Levy word；Then label probability is calculated according to Feature Words w The probability that label is 1 is represented,Represent label for 1 it is general Rate, by MPI_Bcast functions by Z_t、w、WithAll processes are broadcast to, all processes are deposited into structure rule； Label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is training Sample number, k is training set number of tags, i.e. text categories number.

Step 4: each process is according to weights distributed update formula The distribution of label weights is updated, wherein α_t=1, as w ∈ x, thenWhenThenW is Feature Words,Represent the of i-th of training sampleIndividual label is 1 confidence level.

Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided Class device.For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all categories Whether whole Feature Words carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to this feature word Weights are assigned, so as to obtain the weight vector of text to be sorted, test set are obtained.Then test set is divided into p parts, each Process reads portion therein,

The classifying text for the treatment of is categorized as：By Chinese text file to be sorted carry out above-mentioned steps (1), (2) and after the processing of (3), weight vector, i.e. test set are obtained.Then test set is divided into p parts, each process reads it In portion, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formulaClassification results integrate and draw last prediction classification.

Present invention tool beneficial effect：MPI and Adaboost.MH algorithms are combined by the present invention, realize parallel text Sorting algorithm, is solved when data set is excessive, the problem of training time caused by iterations is excessive is long, highly shortened The time of Chinese Text Categorization.

Brief description of the drawings

Fig. 1 represents the flow chart of the Chinese Text Categorization based on MPI and Adaboost.MH algorithms；

Fig. 2 represents that Feature Words select flow chart, it is assumed that it is 4 to enter number of passes p；

The training flow chart of Fig. 3 presentation class device models, it is assumed that it is 4 to enter number of passes p.

Embodiment

The present invention will be further described below in conjunction with the accompanying drawings.

As shown in figure 1, the present invention includes following 5 steps.

1st, Text Pretreatment：The Chinese text text of different field is collected by the mode such as web crawlers and the search network information Part, word segmentation processing is carried out to the Chinese text file collected.The participle bags of increasing income such as IK, ICTCLAS can be used, to what is collected Text carries out Chinese word segmentation, then removes punctuation mark and stop words, and stop words is that the frequency of occurrences is very high but no actual The word of meaning, as " ", " ", "AND" etc..Entry space character segmentation after participle is saved in local training set data In, it is used as preliminary feature.

2nd, feature selecting：Preliminary feature is selected by using mutual information method.First by MPI_Init functions Start p process, the process number of each process is obtained according to MPI_Comm_rank, obtain total using MPI_Comm_size functions Enter number of passes p, all training set datas are divided into p parts, each process is successively read portion therein, then each enters Journey counts respective A, B, C, N respectively, and it is inter process synchronization then to perform MPI_Barrier functions, and A is the feature in classification c The number of files that word t occurs；B is the number of files that Feature Words t occurs in other classifications except classification c；C is special in classification c Levy the number of files that word t does not occur；N is the summation of number of files in all categories.Then it is all according to MPI_Reduce function pairs A, B, C, N carry out reduction summation in process, obtain A, B, C, N relative to whole training set data, are as a result stored in process 0, Process 0 is according to formulaCalculate the similarity I (t, c) between Feature Words t and classification c, it is assumed that altogether There is a k classification, each Feature Words will calculate similarity with k classification, then the weights of each Feature Words are k obtained The average value of the sum of similarity, because quick sorting algorithm relative to other sort algorithms has preferable time performance, therefore enters The weights of 0 pair of all Feature Words of journey are ranked up by quick sorting algorithm, then retain the larger preceding n Feature Words of weights (calculate similarity between Feature Words and classification, sequencing of similarity, select the amount of calculation of Feature Words relatively small, therefore these are grasped Make to complete in process 0, without parallel), then the Feature Words of reservation are broadcast to all by process 0 by MPI_Bcast functions Process, all processes, come member-retaining portion Feature Words, delete remaining Feature Words according to these information.

3rd, weight vector is built：Structure weight vector is carried out to the Chinese text after selection Feature Words, firstly for every Whether in this document each Chinese text file of individual process, scanning judge the Feature Words selected, if existed in file This feature word, this corresponding weights of this feature word are 1, and otherwise the corresponding weights of this feature word are 0, for example：Feature Words " government " Appear in an article, then the weights of Feature Words " government " are just set to 1,0 is otherwise set to.So as to build Chinese text text Part weight vector.The sample weights vector of all processes is saved in train.csv files successively according to process number, then made Terminate all processes with MPI_Finalize functions.

4th, textual classification model is built：

Step 1: starting p process first by MPI_Init functions, each process is obtained according to MPI_Comm_rank Process number, using MPI_Comm_size functions obtain it is total enter number of passes p, file train.csv is opened, according to entering for opening up Training set train is divided into p blocks by number of passes p, (by row point p blocks), and process reads r therein₀、r₁、r₃.....r_p-1Row (r₀Represent Process 0 reads in the line number of data, and the line number that each process is read is obtained by average mark, at most differed 1).For example：14 elements point To 4 processes, first prime number of each process is 3,4,3,4 etc..The formula of average mark is：Low=id × n ÷ p, low are each First position in process, id represents process number, and n represents all element numbers, and p is represented into number of passes, high=(id+1) × n ÷ p-1, high are last position in each process, and size=high-low+1, size is the member in each process Prime number.Then, initialization weights distribution, i.e., each process assigns weights 1/ (mk) to the sample label included, and wherein m is total Sample number, k is characterized number.

Step 2: each process counts the weights for all Feature Words for itself including sampleWherein j ∈ { 0,1 }, Represent whether sample x includes this feature word, be included as 1, not comprising the value that label is represented for 0, b ∈ { -1 ,+1 },Represent the Individual label, calculation formula isWherein m represents the sample number of training set,Represent the of in the t times iteration i-th sampleThe weights of individual label, x_iRepresent i-th of training sample, X_jRepresent There is (j=1) or the set in the absence of all training samples in the case of (j=0), Y in Feature Words_iRepresent i-th sample Label；

Step 3: passing through all processes of MPI_Reduce function pairsReduction summation is carried out, is drawn realIn its result deposit process 0, the Z of each feature included is obtained in process 0_t, calculation formula isWherein+b=+1 is represented ,-represent that b=-1, t represent iterations, minimum Z_tCorresponding Feature Words are the Feature Words w that should be chosen, and then calculate label probability according to Feature Words wCalculation formula isBy the label probability tried to achieveThe Feature Words w of selection is broadcast to each by MPI_Bcast functions Process, each process willIt is stored in rule, rule is a structure, Cheng YuanweiWith Feature Words w.

Step 4: all processes pass through the Z that tries to achieve_t, each label of each training sample is 1 confidence levelWeights distribution, wherein α are updated according to below equation_t=1, as w ∈ x,When W is some feature.

Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, is obtained T one-level decision tree and is divided Class device, it is regular identical due to what is preserved in each process, so the regular rule preserved in process 0 is saved in into rule.csv texts In part, all processes are closed using MPI_Finalize functions.

5th, classifying text is treated to be classified

For Chinese text file to be sorted, it is averaged and is divided into p process of p parts of readings, each process utilizes all Whether the whole Feature Words of classification carry out the scanning of Feature Words to it, occurred according to Feature Words in Chinese text file to the spy Levy word and assign weights, so as to obtain the weight vector of text to be sorted, then each process enters the weight vector foundation of acquisition The sequencing of journey number is saved in test.csv files successively.

Then start p process using MPI_Init functions, the process of each process is obtained according to MPI_Comm_rank Number, number of passes p must be entered by being obtained using MPI_Comm_size functions, open rule.csv files, and each process is read in Rule.csv all information.

Then test.csv files are opened, sample to be sorted p parts are bisected into, each process reads portion therein, so Each process tries to achieve the classification of text to be sorted according to the classifying rules rule and below equation of the Weak Classifier of preservation afterwards, finally All processes are closed using MPI_Finalize functions.

Claims

1. a kind of Chinese Text Categorization based on MPI and Adaboost.MH, comprises the following steps：

(1) Text Pretreatment：The Chinese text file of different field is collected, Chinese word segmentation is carried out to the Chinese text collected, Then punctuation mark and stop words are removed, the entry space character after participle is separated and is saved in training set data, as Preliminary feature；

(2) Feature Words are selected：The preliminary feature of preprocessed text is selected by using mutual information method；

(3) weight vector is built：To each Chinese text file of each process, scanning judge the Feature Words selected whether In the Chinese text file, if there is this feature word in file, this corresponding weights of this feature word is 1, otherwise this feature word Corresponding weights are 0, build Chinese text file weight vector；

(4) textual classification model is built：Disaggregated model is built using Adaboost.MH algorithms；

(5) classifying text is treated to be classified：The disaggregated model built according to step (4) is treated classifying text and classified.

2. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that： What step (2) Feature Words were selected concretely comprises the following steps：

Training set data is equally divided into p parts first, each process is successively read portion therein；Then each process is counted respectively A, B, C, N value, A in classification c Feature Words t occur Chinese text fraction；B is in other classifications except classification c The Chinese text fraction that Feature Words t occurs；C is the Chinese text fraction that Feature Words t does not occur in classification c；N is all categories The summation of middle Chinese text fraction；Then all processes are asked by MPI_Reduce function pairs A, B, C, N progress reduction in MPI Be as a result saved in process 0, the result that process 0 is summed according to reduction calculates the similarity I between Feature Words t and classification c； The similarity I of Feature Words is ranked up finally by quick sorting algorithm, n larger similarity I Feature Words are retained, and The result of selection is broadcast to all processes, all processes select Feature Words according to the broadcast message received.

3. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 2, it is characterised in that： The calculation formula of the similarity I is：

4. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 1, it is characterised in that： The building process of the disaggregated model is as follows：

Step 1: according to training set data, each process assigns weights 1/ (mk) to itself including sample label, and m is training Collect sample number, k is total for the classification of sample；

Step 2: each process is according to formulaStatistics itself includes sample All Feature Words weightsIn deposit process 0, wherein j represents that Feature Words whether there is,Represent theIndividual label, b Represent that sample does not include the label for -1 or 1, -1,1 represents that sample includes the label, and m represents the sample number of training set, Represent the of in the t times iteration i-th sampleThe weights of individual label, x_iRepresent i-th of training sample, X_jRepresent that Feature Words are deposited Or it is non-existent in the case of all training samples set, Y_iRepresent the label of i-th of sample；

Step 3: according to formula in process 0Calculate the Z of all features_t, wherein Z_tExpression is returned One changes the factor,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for 1 training sample In weights and,Represent that Feature Words are present or Feature Words non-existent theIndividual label is being distributed for -1 training sampleIn weights and；Select Z_tThe minimum Feature Words w of value is used as the Feature Words to be chosen；Then calculated according to Feature Words w Label probability The probability that label is 1 is represented,It is not 1 probability to represent label, by MPI_Bcast functions by Z_t、 w、WithAll processes are broadcast to, all processes are deposited into structure rule；

Step 4: each process is according to weights distributed update formulaTo mark Label weights distribution is updated, wherein α_t=1, as w ∈ x, thenWhenThenW is characterized Word,Represent the of i-th of training sampleIndividual label is 1 confidence level；

Step 5: repeat step two to four T times, step 4 is not operated during the T times iteration, obtains T one-level decision tree classifier.

5. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4, it is characterised in that： The label probabilityCalculated by below equation,ε represents smoothing factor, is 1/mk, and wherein m is Number of training, k is training set number of tags, i.e. text categories number.

6. a kind of Chinese Text Categorization based on MPI and Adaboost.MH according to claim 4 or 5, its feature exists In：The classifying text for the treatment of is categorized as：Chinese text file to be sorted is carried out to the processing of step (1), (2) and (3) Afterwards, according to T one-level decision tree classifier to the sample classification that is each included, finally according to formula Classification results integrate and draw last prediction classification.