CN107894980A

CN107894980A - A kind of multiple statement is to corpus of text sorting technique and grader

Info

Publication number: CN107894980A
Application number: CN201711276465.XA
Authority: CN
Inventors: 陈件; 张井
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-12-06
Filing date: 2017-12-06
Publication date: 2018-04-10

Abstract

Corpus of text sorting technique and grader, the sorting technique are included the invention discloses a kind of multiple statement：Data to be predicted are inputted, carry out languages separation；Word corresponding to each languages is upset at random；Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain prediction result；Prediction result corresponding to multiple languages is interacted into checking；Final judged result is exported according to validation-cross result.The sorting technique of the present invention can improve nicety of grading.

Description

A kind of multiple statement is to corpus of text sorting technique and grader

Technical field

The present invention relates to areas of information technology, more particularly to a kind of multiple statement to corpus of text sorting technique and classification Device.

Background technology

Text corpus be also using electronic computer as carrier carry linguistry basic resource, complete spoken language materials Storehouse is used for language model structure, lexicography and text classification etc., text classification computer to text set (or other entities or Object) according to certain taxonomic hierarchies or the automatic key words sorting of standard progress.

Text classification problem can be attributed to according to be sorted with other classification problems without difference substantially, its method Some features of data are matched, and certainly complete matching is unlikely that, it is therefore necessary to（According to certain evaluation mark It is accurate）Optimal matching result is selected, so as to complete to classify.

But current sorting technique is difficult to the purpose for reaching Accurate classification, there are on some existing platforms substantial amounts of Bilingual sentence is right, and most sentence is not to being marked classification, even if remaining sentence pair has been marked classification, the sentence accurately marked is right Also relatively little of part is only accounted for, however, corresponding, data retrieval, content distribution and route on platform etc. are dependent on In accurate language material classification mark, in order to preferably play the effect of various language material platforms, there is provided it is a kind of practical and The high sorting technique of nicety of grading is necessary.

It should be noted that the introduction to technical background above be intended merely to the convenient technical scheme to the application carry out it is clear, Complete explanation, and facilitate the understanding of those skilled in the art and illustrate.Can not merely because these schemes the application's Background section is set forth and thinks that above-mentioned technical proposal is known to those skilled in the art.

The content of the invention

In view of the drawbacks described above of prior art, the technical problems to be solved by the invention there is provided a kind of improve and classify The multiple statement of precision is to corpus of text sorting technique and grader.

To achieve the above object, the invention provides a kind of multiple statement to corpus of text sorting technique, including：

Data to be predicted are inputted, carry out languages separation；

Word corresponding to each languages is upset at random；

Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted to obtain pre- Survey result；

Prediction result corresponding to multiple languages is interacted into checking；

Final judged result is exported according to validation-cross result.

Further, after described the step of upsetting word corresponding to each languages at random, each languages are beaten at random Word after unrest is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also wrap Include：

Judge it is described upset whether number reaches preset times at random, be the prediction result is corresponded into validation-cross to obtain by several times To the integrated forecasting label of preset times；

The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment, Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences Truncation operation is carried out, this is likely to result in the loss of sentence information；Thus, reading the number upset again is at least twice, and this is default Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, then Truncation operation, and follow-up flow are carried out again, can avoid truncation operation that sentence key message is lost, meanwhile, again may be used With by it is bad, classification it is indefinite, be unfavorable for training sentence to screening out；In addition, final judged result then be based on Machine upsets what multiple text was predicted respectively, if repeatedly prediction result is identical, can consider the prediction result naturally The degree of accuracy is higher；Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification class Not or classification etc. is not known；Wherein, the preset times are as the case may be so that classification can be met by training the result come The required precision of device.

Further, the preset times are 5.In the present embodiment, preset times can according to circumstances change, for example, When data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.

Further, the languages include Chinese languages and English languages, and the convolutional neural networks model includes Chinese Convolutional neural networks model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English languages, Corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through two respectively Person is predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.

Further, input data to be predicted, it is described that each languages are corresponding after the step of carrying out languages separation Word the step of upsetting at random before also include：

Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment；Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text Word, it can specifically be segmented by applicable segmenter etc.；Wherein, stop-word refer to those frequency of use it is too high, to language The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation The word of property, therefore before training filter out these words；Specifically, stop-word database can be put into aggregate type It is interior, and stop-word filtering is carried out to text by applicable filter method.

Further, the convolutional neural networks model is to sample preprocessing, trains to obtain by convolutional neural networks 's.In the present embodiment, the convolutional neural networks model trains to obtain previously according to sample preprocessing by convolutional neural networks , specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be preserved, To be optimized to convolutional neural networks model；Specifically, input CNN's is that size can be 50 * unit dictionary chis Very little bivector, because dictionary size is very big, the size of these bivectors is also very big and very sparse（Contain Have substantial amounts of 0）, CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which embeding layer arrives these vectorial dimensionality reductions 50*200 or other are suitably sized（Selected according to actual computing capability）, then input the second layer and train more advanced spy Sign；The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact；In addition, model is instructed Practice after completing, we are further tested on new sample, it is found that the prediction result of many samples does not meet general knowledge and sentenced It is disconnected, and these samples are actually not belonging to any one known classification, we reject these samples, then consider newly-increased One is not known classification, and dedicated for receiving the sample that those are not belonging to any one known class, these samples are not due to belonging to In any one classification, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should be It is relatively small, by experiment, confirm above-mentioned it is assumed that and being determined the threshold value of a standard deviation（0.05）, by some without classification Sample, which has been assigned to, does not know classification, with improving classification results reliability.

Further, the training step of convolutional neural networks model includes：

Chinese and English lock out operation is carried out to urtext；

The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, then carries out Chinese stop-word filtration treatment；Together When, to the isolated advanced row English space word segmentation processing of English word, then carry out English stop-word filtration treatment；

High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment；

Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples Model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term is disassembled The word minimum into granularity, and generate dictionary（For example, utilize Gensim storehouses）, then sample data projected on glossary, One is by form corresponding to each sentence generation（, there is frequency in " word "）Two tuples composition vector.To vector In two tuples second dimension summed, the term included in the hope of whole sentence（Vocabulary）Number, on this basis On, then obtain term（Vocabulary）The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification；Specifically , for example, have the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, can using these terminology datas To select the of a relatively high sample of quality from original sample；For one section of language material（Assuming that it is noted as：Medicine）, utilize doctor Medicine terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this section of language Expect the percentage of total word number, the two comprehensive indexs, we may determine that this section of language material is strictly the degree for belonging to " medicine ", Rational threshold value is gone out by experiment test（Each classification is required for individually testing）, so as to select the corpus data of high quality.

Further, it is necessary to first by the high quality samples before using convolutional neural networks training high quality samples Vectorization processing forms length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, in forecast period Carry out；The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector（ CNN ） It is trained；Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for our and extracts Feature, distinguish classification, it is therefore desirable to be combined into short sentence longer（40 words to 50 words）And the syntagma of similar length；So Afterwards, the bag of words based on word frequency are utilized（Bag of Words）Model, syntagma is changed into term vector（Each of vector corresponding one Individual word, this numeral are exactly the frequency that this word occurs in syntagma, it can be seen that and vector length is identical with dictionary size, Further, since actually multiple probability very little occurs in a sentence in a word, so all frequency of acquiescence is all 1）；Finally, before neutral net is inputted, a length normalization method is done（By mending 0 or truncating, generation length is exactly 50 vector）.

Further, the formation of the glossary includes step：

Term word segmentation processing is first carried out to original Chinese terminology, then carries out stop-word filtration treatment；

The original Chinese terminology obtained according to filtration treatment, generates the glossary.In the present embodiment, the shape of glossary Into being advantageous to obtain high-quality text, so as to improve the quality of convolutional neural networks model that training obtains, and then improve classification Precision.

The invention also discloses a kind of multiple statement to corpus of text grader, including：

Languages separation module, for the data to be predicted to input, carry out languages separating treatment；

Upset module at random, for word corresponding to each languages to be upset at random；

Convolutional neural networks model prediction module, the convolution of corresponding languages is loaded into for the word after each languages are upset at random Neural network model, and be predicted to obtain prediction result；

Interactive verification module, for prediction result corresponding to multiple languages to be interacted into checking；

Output module, for exporting final judged result according to validation-cross result.

In the present invention, different languages training in advance convolutional neural networks models are corresponded to respectively, in multilingual sentence, bag Contained it is multilingual between be mutually related priori, extra information will bring benefit for our classification so that classification essence Du Genggao；In addition, separated before convolutional neural networks model is loaded into, it is necessary to first treat the languages in prediction data, point The grader that languages train to obtain is not corresponded to, the data after separation are predicted respectively, can make full use of bilingual sentence pair Characteristic, checking is interacted between multiple graders, improves the precision of classification；Furthermore due to needing first to new input classification The sample of device carries out truncation operation, in order to avoid truncating operation sentence key message is lost, we are every time input by sentence Before grader, all sentence word order is upset at random, then carries out classification judgement.

With reference to following explanation and accompanying drawing, the particular implementation of the application is disclose in detail, specifies the original of the application Reason can be in a manner of adopted.It should be understood that presently filed embodiment is not so limited in scope.In appended power In the range of the spirit and terms that profit requires, presently filed embodiment includes many changes, modifications and is equal.

The feature for describing and/or showing for a kind of embodiment can be in a manner of same or similar one or more Used in individual other embodiment, it is combined with the feature in other embodiment, or substitute the feature in other embodiment.

It should be emphasized that term "comprises/comprising" refers to the presence of feature, one integral piece, step or component when being used herein, but simultaneously It is not excluded for the presence or additional of one or more further features, one integral piece, step or component.

Brief description of the drawings

Included accompanying drawing is used for providing being further understood from the embodiment of the present application, which constitutes one of specification Point, for illustrating presently filed embodiment, and come together with word description to explain the principle of the application.Under it should be evident that Accompanying drawing in the description of face is only some embodiments of the present application, and wound will be not being paid for those of ordinary skill in the art On the premise of the property made is laborious, other accompanying drawings can also be obtained according to these accompanying drawings.In the accompanying drawings：

Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique；

Fig. 2 is the training flow chart of convolutional neural networks model of the present invention；

Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader.

Embodiment

In order that those skilled in the art more fully understand the technical scheme in the application, it is real below in conjunction with the application The accompanying drawing in example is applied, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described implementation Example only some embodiments of the present application, rather than whole embodiments.It is common based on the embodiment in the application, this area All other embodiment that technical staff is obtained under the premise of creative work is not made, it should all belong to the application protection Scope.

Fig. 1 is a kind of flow chart of the multiple statement of the present invention to corpus of text sorting technique, is understood with reference to figure 1：The present invention carries A kind of multiple statement has been supplied to corpus of text sorting technique, including：

S1：Data to be predicted are inputted, carry out languages separation；

S2：Word corresponding to each languages is upset at random；

S3：Word after each languages are upset at random is loaded into the convolutional neural networks model of corresponding languages, and is predicted To prediction result；

S4：Prediction result corresponding to multiple languages is interacted into checking；

S5：Final judged result is exported according to validation-cross result.

The present embodiment is optional, after the step of word corresponding to each languages is upset at random, each languages are random Word after upsetting is loaded into the convolutional neural networks model of corresponding languages, and the step of be predicted to obtain prediction result before also Including：

The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.In the present embodiment, Due to carrying out convolutional neural networks model training, it is desirable to which the vector dimension of input is identical, thus we are needed to some long sentences Truncation operation is carried out, this is likely to result in the loss of sentence information；Thus, reading the number upset again is at least twice, and this is default Number can be correspondingly arranged according to the species of languages, complexity of data to be predicted etc., and multiplicating upsets, Ran Houzai Truncation operation, and follow-up flow are carried out, can avoid truncating operation sentence key message is lost, meanwhile, again can be with By bad, sentence that is indefinite, being unfavorable for training of classifying to screening out；In addition, final judged result is then based on random Upset what multiple text was predicted respectively, if repeatedly prediction result is identical, naturally it is considered that the standard of the prediction result Exactness is higher；Furthermore if repeatedly prediction result diversity factor is more than preset value, can be classified to can not Accurate classification classification Or classification etc. is not known；Wherein, the preset times are as the case may be so that grader can be met by training the result come Required precision.

The present embodiment is optional, preset times 5.In the present embodiment, preset times can according to circumstances change, example Such as, when data to be predicted are bilingual Chinese-English, probably preset times can be arranged to 5 times.

The present embodiment is optional, and languages include Chinese languages and English languages, during the convolutional neural networks model includes File accumulates neural network model and English convolutional neural networks model.In the present embodiment, languages include Chinese languages and English language Kind, corresponding training in advance has Chinese convolutional neural networks model and English convolutional neural networks model, in prediction, passes through respectively Both are predicted, and two graders are carried out into parallel connection, and then validation-cross, can improve nicety of grading.

The present embodiment is optional, inputs data to be predicted, described by each languages pair after the step of carrying out languages separation Also include before the step of word answered is upset at random：

Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment；Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.This embodiment party In case, either text vector, or the text maninulation based on word such as stop-word filtering, it is required for first dividing text Word, it can specifically be segmented by applicable segmenter etc.；Wherein, stop-word refer to those frequency of use it is too high, to language The word of sentence information contribution very little, these words are not almost helped our classification task, and can be diluted other with differentiation The word of property, therefore before training filter out these words.

The present embodiment is optional, and convolutional neural networks model is to sample preprocessing, is trained by convolutional neural networks Arrive.In the present embodiment, the convolutional neural networks model is trained previously according to sample preprocessing by convolutional neural networks Arrive, specifically, during applicable platform etc. is using convolutional neural networks model, partial information can also be protected Deposit, to be optimized to convolutional neural networks model；Specifically, input CNN's is that size can be 50 * unit words The bivector of allusion quotation size, because dictionary size is very big, the size of these bivectors is also very big and very sparse （Contain substantial amounts of 0）, CNN is bad to handle such higher-dimension and sparse vector, it is therefore desirable to which these vectors drop in embeding layer Tie up 50*200 or other are suitably sized（Selected according to actual computing capability）, then input the second layer training it is more advanced Feature；The effect of this layer is equivalent to a pre-training, has extracted the text feature of primary in fact；In addition, mould After type training is completed, we are further tested on new sample, it is found that the prediction result of many samples is not met often Know and judge, and these samples are actually not belonging to any one known classification, we reject these samples, then consider Newly-increased one is not known classification, dedicated for receiving the sample that those are not belonging to any one known class, these samples due to Any one classification is not belonging to, it is presumed that after such sample input CNN networks, the standard deviation of the probability distribution of output should This is relatively small, and by experiment, discovery situation is really in this way, we have been successfully found the threshold value of a standard deviation （0.05）, some have been assigned to without classification samples and has not known classification, with improving classification results reliability.

Therefore the two parameters of our comprehensive references, it is final to determine to screen and retain those " term vocabulary numbers " to be more than or equal to X and " the term vocabulary accounting " sentence more than y.Because different classes of term quality, category feature are different, so for Each classification（x,y）Also it is different, it is necessary to according to original sample number, need the sample number selected to test to determine.X and y value Can specifically it be set as the case may be.

Fig. 2 is the training flow chart of convolutional neural networks model of the present invention, with reference to figure 2, is understood with reference to Fig. 1, the present embodiment Optionally, the training step of convolutional neural networks model includes：

S31:Chinese and English lock out operation is carried out to urtext；

S32:The advanced row Chinese word segmentation processing of Chinese word obtained to lock out operation, is then carried out at Chinese stop-word filtering Reason；Meanwhile to the isolated advanced row English space word segmentation processing of English word, then carry out at English stop-word filtering Reason；

S33:High quality samples are selected in the Chinese word and English word that are obtained based on glossary from filtration treatment；

S34:Chinese convolutional neural networks model and English convolutional Neural are obtained using convolutional neural networks training high quality samples Network model.In the present embodiment, because terminology data granularity itself is too big, therefore carries out participle operation to term, term The minimum word of granularity is disassembled into, and generates dictionary（For example, utilize Gensim storehouses）, then sample data projected to glossary On, one is by form corresponding to each sentence generation（, there is frequency in " word "）Two tuples composition vector；It is right Second dimension of two tuples in vector is summed, the term included in the hope of whole sentence（Vocabulary）Number, in this base On plinth, then obtain term（Vocabulary）The ratio of the total word number of sentence is accounted for, in this, as judging that sentence belongs to the degree of this classification； Specifically, for example, there is the extraordinary terminology data of quality on search dog cell dictionary/tmxmall platforms, these term numbers are utilized According to the of a relatively high sample of quality can be selected from original sample.For one section of language material（Assuming that it is noted as：Medicine）, profit With medical terminology data, we can count the frequency of the medical term occurred in this section of language material, and medical term accounts for this The percentage of the section total word number of language material, the two comprehensive indexs, we may determine that this section of language material is strictly to belong to " medicine " Degree, rational threshold value is gone out by experiment test（Each classification is required for individually testing）, so as to select the language material of high quality Data.

The present embodiment is optional, it is necessary to first will be described high-quality before using convolutional neural networks training high quality samples Measure sample vector processing and form length identical vector.The present embodiment, vectorization will not only be carried out in the training stage, predicted Stage will also be carried out；The text-processing of variable lengths can must be inputted convolutional neural networks into length identical vector（ CNN ）It is trained；Specifically, because the sentence length in corpus data is shorter, simple sentence is few comprising information content, is unfavorable for us Extract feature, distinguish classification, it is therefore desirable to be combined into short sentence longer（40 words to 50 words）And the sentence of similar length Section；Then, the bag of words based on word frequency are utilized（Bag of Words）Model, syntagma is changed into term vector（Each of vector A corresponding word, this numeral is exactly the frequency that this word occurs in syntagma, it can be seen that vector length is big with dictionary It is small identical, further, since actually there is multiple probability very little in a sentence in a word, so the frequency that acquiescence is all All it is 1）；Finally, before neutral net is inputted, a length normalization method is done（By mending 0 or truncating, generation length is proper It is 50 vector well）.

The present embodiment is optional, and the formation of glossary includes step：

Fig. 3 is a kind of schematic diagram of the multiple statement of the present invention to corpus of text grader, with reference to figure 2, is understood with reference to Fig. 1：The present invention A kind of multiple statement is also disclosed to corpus of text grader 100, including：

Languages separation module 10, for the data to be predicted to input, carry out languages separating treatment；

Upset module 20 at random, for word corresponding to each languages to be upset at random；

Convolutional neural networks model prediction module .30, corresponding languages are loaded into for the word after each languages are upset at random Convolutional neural networks model, and be predicted to obtain prediction result；

Interactive verification module 40, for prediction result corresponding to multiple languages to be interacted into checking；

Output module 50, for exporting final judged result according to validation-cross result.

Grader in the present invention, can be applicable above-mentioned sorting technique, correspond to different languages training in advance respectively Convolutional neural networks model, in multilingual sentence, contain it is multilingual between be mutually related priori, extra information will Benefit is brought for our classification so that nicety of grading is higher；In addition, it is necessary to first before convolutional neural networks model is loaded into The languages treated in prediction data are separated, and correspond to the grader that languages train to obtain respectively, and the data after separation are carried out Predict respectively, the characteristic of bilingual sentence pair can be made full use of, checking is interacted between multiple graders, improve the essence of classification Degree；Furthermore due to needing first to carry out truncation operation to the sample of new input grader, in order to avoid truncating operation sentence is closed Key information is lost, and we before input by sentence grader, are upset sentence word order at random every time, then are carried out classification and sentenced It is disconnected.

Preferred embodiment of the invention described in detail above.It should be appreciated that one of ordinary skill in the art without Creative work can is needed to make many modifications and variations according to the design of the present invention.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical scheme, all should be in the protection domain being defined in the patent claims.

Claims

1. a kind of multiple statement is to corpus of text sorting technique, it is characterised in that including：

Data to be predicted are inputted, carry out languages separation；

Word corresponding to each languages is upset at random；

Final judged result is exported according to validation-cross result.

2. a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that described by each languages After the step of corresponding word is upset at random, the word after each languages are upset at random is loaded into the convolutional Neural of corresponding languages Network model, and the step of be predicted to obtain prediction result before also include：

The final judged result is to judge what is obtained according to the integrated forecasting label of the preset times.

3. described a kind of multiple statement as claimed in claim 2 is to corpus of text sorting technique, it is characterised in that described default Number is 5.

4. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the languages Including Chinese languages and English languages, the convolutional neural networks model includes Chinese convolutional neural networks model and English convolution Neural network model.

5. described a kind of multiple statement as claimed in claim 4 is to corpus of text sorting technique, it is characterised in that the input Data to be predicted, carry out languages separation the step of after, before described the step of upsetting word corresponding to each languages at random Also include：

Chinese word segmentation processing is first carried out to isolated Chinese languages word, then carries out Chinese stop-word filtration treatment；Together When, English space participle is first carried out to isolated English languages word, then carries out English stop-word filtering.

6. described a kind of multiple statement as claimed in claim 1 is to corpus of text sorting technique, it is characterised in that the convolution Neural network model is to sample preprocessing, trains what is obtained by convolutional neural networks.

7. described a kind of multiple statement as claimed in claim 6 is to corpus of text sorting technique, it is characterised in that convolutional Neural The training step of network model includes：

Chinese and English lock out operation is carried out to urtext；

Chinese convolutional neural networks model and English convolutional neural networks are obtained using convolutional neural networks training high quality samples Model.

8. a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that is using convolutional Neural , it is necessary to which the high quality samples vectorization processing first is formed into length identical vector before network training high quality samples.

9. described a kind of multiple statement as claimed in claim 7 is to corpus of text sorting technique, it is characterised in that the term The formation of dictionary includes step：

The original Chinese terminology obtained according to filtration treatment, generates the glossary.

10. a kind of multiple statement is to corpus of text grader, it is characterised in that including：