CN101201820B

CN101201820B - Method and system for filtering bilingualism corpora

Info

Publication number: CN101201820B
Application number: CN200710178309XA
Authority: CN
Inventors: 王刚; 高立琦; 刘挺; 王海洲
Original assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2007-11-28
Filing date: 2007-11-28
Publication date: 2010-06-02
Anticipated expiration: 2027-11-28
Also published as: CN101201820A

Abstract

The invention discloses a filtering method of a bilingual corpus and the method comprises the following steps: A. ratio flag value of sentence length of English-Chinese bilingual sentence pair is confirmed; B. the number of different parts of speech in the English-Chinese bilingual sentence pair is respectively counted, the matching number of the corresponding words in a bilingual intertranslatingdictionary and words of the part of speech are calculated and the interpretation eigenvalue is confirmed according to the number of different parts of speech and the matching number; C. the filtration and classification are carried out by the ratio eigenvalue of the sentence length and the interpretation eigenvalue according to a classification model established by using a training set in advance. The invention discloses a bilingual corpus system; the invention also provides a filtering method of the bilingual corpus and a system thereof, which are used for improving universality, accuracy and recalling rate of the corpus.

Description

A kind of filtering bilingualism corpora method and system

Technical field

The present invention relates to a kind of corpus filter method, refer to a kind of filtering bilingualism corpora method and system especially.

Background technology

The corpus resource is more and more approved for the immense value of natural language processing research.Particularly parallel bilingualism corpora, it is a kind of special corpus that includes bilingual intertranslation information.Parallel bilingualism corpora can provide match information abundant between the bilingual, in the foundation of the obtaining of translation knowledge, bilingual dictionary, have important use to be worth based on fields such as the mechanical translation of statistics or example, word sense disambiguations, especially high-quality corpus effect more highlights.

The foundation of corpus mainly contains two kinds of methods, and a kind of is traditional manual method of collecting; Another kind is to obtain with automatic sentence alignment schemes by computing machine by the corpus to the alignment of chapter level.But these two kinds of methods all can not guarantee to obtain high-quality corpus, always exist some for example sentence to not matching, comprise mistake such as mess code.

Eliminating wrong sentence is to use the mode of artificial check and correction that corpus is checked to the most frequently used method.Though this method accuracy rate is very high, waste time and energy, especially when corpus was very huge, this method was just not too practical.

It is right with automatic method the wrong sentence of elimination to be handled in corpus by computing machine, and its basic ideas are to set some to judge a feature to quality of match, then each feature are given a mark, and again according to experience, manually set a characteristic threshold value and judge.To be defined as sentence to greater than this characteristic threshold value the time right when bilingual sentence, and to be defined as bad sentence when being less than or equal to this characteristic threshold value right when bilingual sentence.Though this method has realized robotization to a certain extent, it is not high to lack generality and accuracy rate.Characteristic threshold value is set by rule of thumb, and characteristic threshold value often may be that setting person determines according to only several parts of corpus resources, can not contain the distribution situation of most of corpus.And the characteristic threshold value of setting when experience crosses the low accuracy rate that can cause and descends, and causes recall rate to descend when too high again.

Summary of the invention

The purpose of this invention is to provide a kind of filter method and system of english-chinese bilingual corpus, be used to improve corpus versatility, accuracy rate and recall rate.

For addressing the above problem, the invention provides a kind of filtering bilingualism corpora method, may further comprise the steps:

A, determine the long ratio eigenwert of sentence that bilingual sentence is right;

B, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;

The disaggregated model that C, basis utilize training set to set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

Preferably, the described disaggregated model that utilizes training set to set up in advance specifically comprises:

C1, structure training set;

C2, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to steps A, B;

C3, determine disaggregated model.

Preferably, described training set marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming, and configures sentence to being 1, and bad sentence is to being-1.

Preferably, further comprise before the described steps A: determine number matching characteristic value;

Described definite number matching characteristic value is specially: with the unified respectively conversion of carrying out numeral of number of bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1,, determine that number matching characteristic value is 0 when described number does not match.

Preferably, further comprise before the described steps A: the pre-service of the type of coding of unified described bilingual sentence centering.

Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; The pre-service of the type of coding of unified described bilingual sentence centering specifically comprises:

11) described english-chinese bilingual sentence is changeed the half-angle processing to carrying out full-shape;

12) be simplified national standard codes with the traditional font code conversion;

13) processing of eliminating mess code.

Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; Described steps A is specially: determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.

Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; The quantity of the different parts of speech of described statistics english-chinese bilingual sentence centering is specially the quantity of adding up english-chinese bilingual sentence centering noun, verb, adjective and preposition.

The present invention also provides a kind of filtering bilingualism corpora system, comprises the long ratio computing unit of sentence, the property translated computing unit, train classification models unit and taxon mutually;

The long ratio computing unit of described sentence is used for the long ratio eigenwert of sentence of determining that bilingual sentence is right;

Described mutual translation computing unit, be used for adding up respectively the quantity of the different parts of speech of bilingual sentence centering, the quantity of corresponding speech coupling is determined the eigenwert of translation property mutually according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;

Described taxon links to each other with the computing unit of translation property mutually with the long ratio computing unit of described sentence, is used for according to the disaggregated model that utilizes training set to set up in advance, utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

Preferably, described train classification models unit marks each right classification value simultaneously according to the training set of a certain proportion of quality sentence to forming in the bilingualism corpora, configures sentence to being 1, and bad sentence is to being-1.

Preferably, described system further comprises the number matching unit, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1, when described number does not match, determine that number matching characteristic value is 0.

Compare with above-mentioned prior art, the described filtering bilingualism corpora method of the embodiment of the invention, comprise the step of determining right long ratio eigenwert of sentence of bilingual sentence and mutual translation property eigenwert, according to the train classification models of setting up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification again.The filter method of the described bilingualism corpora of the embodiment of the invention huge bilingualism corpora of deal with data amount quickly and easily like this.The present invention utilizes the thought of the classification of train classification models that the filtration problem of bilingualism corpora is converted to binary classification problems, make determining that the weights of bilingualism corpora matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.

Description of drawings

Fig. 1 is first kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention;

Fig. 2 is a process flow diagram of setting up disaggregated model among Fig. 1;

Fig. 3 is second kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention;

Fig. 4 is a process flow diagram of setting up disaggregated model among Fig. 3;

Fig. 5 is the third embodiment process flow diagram of filtering bilingualism corpora method of the present invention;

Fig. 6 is the pretreatment process figure of the type of coding of the unified described bilingual sentence centering of Fig. 5;

Fig. 7 is filtering bilingualism corpora first kind of example structure figure of system of the present invention;

Fig. 8 is filtering bilingualism corpora second kind of example structure figure of system of the present invention;

Fig. 9 is filtering bilingualism corpora the third example structure figure of system of the present invention.

Embodiment

The invention provides a kind of filter method of bilingualism corpora, be used to improve corpus versatility, accuracy rate and recall rate.

Referring to reference to figure 1 and Fig. 2, Fig. 1 is first kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention, and Fig. 2 is a process flow diagram of setting up disaggregated model among Fig. 1.

The described filtering bilingualism corpora method of first kind of embodiment of the present invention may further comprise the steps:

S100, determine the long ratio eigenwert of sentence that bilingual sentence is right.

Determine bilingual sentence centering employing word number or character number.With the word number in a kind of statement in the described bilingual sentence or number of characters word number or the number of characters than another kind of statement in the above bilingual sentence, the value of gained is a long ratio eigenwert.

When described bilingual sentence be the english-chinese bilingual sentence, than employing word number or character number in bilingual of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.When the long number that adopts word number or character respectively of sentence was calculated, the two was more or less the same, and general selected word number calculates more can embody the right long ratio feature of sentence of english-chinese bilingual sentence.

S200, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary.

Adding up the quantity of the different parts of speech of bilingual sentence centering, specifically is the quantity of the bilingual sentence of statistics centering noun, verb, adjective and preposition.

At first, respectively to bilingual sentence to carrying out part-of-speech tagging.Then, add up the number that bilingual sentence centering is contained the speech of noun, verb, adjective and four kinds of parts of speech of preposition respectively again.The part of speech selection of noun, verb, adjective, preposition is based on dictionary translation consideration, generally relatively has ability to see things in their true light because have the translation of the word of these four kinds of parts of speech.

For the speech that contains above-mentioned noun, verb, adjective, preposition part of speech in the right Chinese sentence of english-chinese bilingual sentence, utilize Chinese-English Dictionary translation, and in the right english sentence of english-chinese bilingual sentence, contain in the speech of above-mentioned part of speech and search.If find, then mate the number of statistical match.Otherwise, to containing the speech of above-mentioned part of speech in the right english sentence of english-chinese bilingual sentence, utilize English-Chinese dictionary translation, and in the right Chinese sentence of english-chinese bilingual sentence, contain to search in the speech of above-mentioned part of speech and whether mate.If find, then coupling, and the number of statistical match.

To being example, the formula below utilizing calculates the english-chinese bilingual sentence to mutual translation eigenwert with the english-chinese bilingual sentence for we.

V(c，e)＝(T(c，e)/I(c))*(T(e，c)/I(e))

Wherein, and V (c, e): the english-chinese bilingual sentence is to mutual translation eigenwert;

T (c, e): the coupling number of speech in english sentence of utilizing the above-mentioned four kinds of parts of speech in the Chinese sentence that Chinese-English Dictionary finds;

T (e, c): the coupling number of speech in Chinese sentence of utilizing the above-mentioned four kinds of parts of speech in the english sentence that English-Chinese dictionary finds;

I (c): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right Chinese sentence of english-chinese bilingual sentence;

I (e): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right english sentence of english-chinese bilingual sentence.

Equally, when described bilingual sentence be other macaronic bilingual to the time, also can use top formula and calculate.

The disaggregated model that S300, basis utilize training set to set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

The disaggregated model that utilizes training set to set up specifically comprises:

S301, structure training set.

Described training set to forming, marks the right classification value of each bilingual sentence according to a certain proportion of quality sentence in the bilingualism corpora simultaneously, and configuring the right classification value of sentence is 1, and the right classification value of bad sentence is-1.

Described training set can be selected bilingual sentence to forming training set according to the quality sentence to 1: 1 ratio from bilingualism corpora.

The size of training set should remain on 50,000 to more than, big more training set is beneficial more to train classification models.The source of language material is extensive as far as possible, and language material distribution more widely makes the disaggregated model after the training have generality more.

S302, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to step S100 and step S200.

The annotation formatting of training set feature: " classification value+space+feature code: eigenwert+space+feature code: eigenwert ... "

Between classification value and feature code, keep a space, between eigenwert and feature code, keep a space.For example can set the long ratio eigenwert of described sentence is 2, and setting described mutual translation eigenwert is 3.

Utilize sorter to carry out classification based training and be known technology, can select general sorters such as svm (support vector machine) or maximum entropy to train.

S303, determine disaggregated model.

After disaggregated model was set up, the bilingual sentence that the classification value is labeled as " 1 " filtered the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the bilingual sentence of " 1 " to being retained in the bilingualism corpora.

The described filtering bilingualism corpora method of the embodiment of the invention, comprise the step of determining right long ratio eigenwert of sentence of bilingual sentence and mutual translation property eigenwert, according to the disaggregated model that utilizes training set to set up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification again.The filter method of the described bilingualism corpora of the embodiment of the invention huge bilingualism corpora of deal with data amount quickly and easily like this.The classification that the present invention utilizes described disaggregated model is converted to binary classification problems with the filtration problem of english-chinese bilingual corpus, make determining that the weights of english-chinese bilingual corpus matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.

Referring to reference to figure 3 and Fig. 4, Fig. 3 is second kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention; Fig. 4 is a process flow diagram of setting up disaggregated model among Fig. 3.

Relative first embodiment of second kind of embodiment of filtering bilingualism corpora method of the present invention increases the step of determining number matching characteristic value.

The described filtering bilingualism corpora method of second kind of embodiment of the present invention may further comprise the steps:

S10, determine number matching characteristic value;

With the unified respectively conversion of carrying out numeral of number of bilingual sentence centering, the numeral coupling after the number of bilingual sentence centering transforms determines that number matching characteristic value is 1.When described number does not match, determine that number matching characteristic value is 0.

Below with the english-chinese bilingual sentence to being example, specify the process of determining number matching characteristic value.

At first, respectively right Chinese sentence and the english sentence of bilingual Chinese-English sentence carried out part-of-speech tagging, mask method is a known technology, is not described in detail in this.

Then, be labeled as m (number), contain the number that is labeled as od (coefficient speech) and cd (number) in the English and carry out normalization containing in bilingual Chinese-English the right Chinese sentence.

Contain “ $5 million in for example bilingual Chinese-English the right English sentence ", contain in the Chinese sentence " 5,000,000 ", all unification is converted into 5000000.

Rule-based method is adopted in described normalization, promptly formulates some transformation rules.

Described transformation rule comprises the number and the digital transformation rule of Chinese, for example: " one " correspondence " 1 ", " hundred " correspondence " 100 " etc.

Described transformation rule comprises the number and the digital transformation rule of English, for example " one " correspondence " 1 ", " hundred " correspondence " 100 " etc.

Number in the right english sentence of Chinese sentence that bilingual Chinese-English sentence is right and bilingual Chinese-English sentence after the normalization relatively, if coupling, then number matching characteristic value is 1.If do not match, then number matching characteristic value is 0.

Equally, to being example, specify the process of determining the long ratio eigenwert of sentence with the english-chinese bilingual sentence.

Determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.

When the long number that adopts word number or character respectively of sentence was calculated, the two was more or less the same, and general selected word number calculates more can embody the right long ratio feature of sentence of english-chinese bilingual sentence.

Equally, to being example, specify the process of determining the eigenwert of translation property mutually with the english-chinese bilingual sentence.

The quantity of the different parts of speech of statistics english-chinese bilingual sentence centering specifically is the quantity of statistics english-chinese bilingual sentence centering noun, verb, adjective and preposition.

At first, respectively to the english-chinese bilingual sentence to carrying out part-of-speech tagging.Then, add up the number that english-chinese bilingual sentence centering is contained the speech of noun, verb, adjective and four kinds of parts of speech of preposition respectively again.

Formula below utilizing calculates the english-chinese bilingual sentence to mutual translation eigenwert.

V(c，e)＝(T(c，e)/I(c))*(T(e，c)/I(e))

The train classification models that S300A, basis are set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert and number matching characteristic value to carry out filtering classification.

Equally, to being example, specify the process of setting up of the corresponding disaggregated model of filter method second embodiment of the present invention with the english-chinese bilingual sentence:

The described disaggregated model of setting up specifically comprises:

S301A, structure training set.

Described training set marks the right classification value of each english-chinese bilingual sentence according to a certain proportion of quality sentence in the english-chinese bilingual corpus simultaneously to forming, and configuring the right classification value of sentence is 1, and the right classification value of bad sentence is-1.

S302A, calculate number matching characteristic value, the long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to step S10, step S100 and step S200.

The annotation formatting of training set feature: classification value+space+feature code: eigenwert+space+feature code: eigenwert+space+feature code: eigenwert.

Between classification value and feature code, keep a space, between eigenwert and feature code, keep a space.For example can set described number matching characteristic value is 1, and the long ratio eigenwert of described sentence is 2, and setting described mutual translation eigenwert is 3.

S303A, determine disaggregated model.

After disaggregated model was set up, the english-chinese bilingual sentence that the classification value is labeled as " 1 " filtered the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the english-chinese bilingual sentence of " 1 " to being retained in the english-chinese bilingual corpus.

Second embodiment of the method for the invention has increased the step of definite number matching characteristic value, makes the right filtration accuracy of bilingual sentence that includes numerical information improve greatly.

Referring to reference to figure 5 and Fig. 6, Fig. 5 is the third embodiment process flow diagram of filtering bilingualism corpora method of the present invention; Fig. 6 is the pretreatment process figure of the type of coding of the unified described bilingual sentence centering of Fig. 5.

Relative first embodiment of the third embodiment of filtering bilingualism corpora method of the present invention, the pretreated step of the type of coding of the unified described bilingual sentence centering of increase.

Equally, to being example, specify the process of the described english-chinese bilingual corpus of the third embodiment of the present invention filter method with the english-chinese bilingual sentence.

The described english-chinese bilingual corpus of the third embodiment of the present invention filter method may further comprise the steps:

The pre-service of the type of coding of S1, unified described english-chinese bilingual sentence centering.

The pre-service of the type of coding of unified described english-chinese bilingual sentence centering specifically comprises:

S1a, described english-chinese bilingual sentence is changeed half-angle and handles carrying out full-shape;

S1b, Big5 sign indicating number (traditional font coding) is converted to GB sign indicating number (simplified national standard codes);

The processing of S1c, eliminating mess code.

For the Chinese processing of partly getting rid of mess code of english-chinese bilingual sentence centering, according to GB sign indicating number scope investigation, the rejecting that surmounts this scope.

For the English processing of partly getting rid of mess code of english-chinese bilingual sentence centering, according to ASCII character scope investigation, the rejecting that surmounts this scope.

Handle for special symbol:

The beginning of the sentence right for some english-chinese bilingual sentences contains label, as " 1, (1), (I), (i), 1), one, " etc. during label, with this label deletion of beginning of the sentence, all the other reservations.

For containing special punctuation mark in the right sentence of some english-chinese bilingual sentences, as "======", " ... ... " or special punctuation marks such as "------", with this symbol deletion, remainder keeps.

The pre-service of the type of coding of unified described english-chinese bilingual sentence centering can comprise above-mentioned S1a, S1b, three steps of S1c, also can include only one or two step among S1a, S1b, the S1c.

S100, determine the english-chinese bilingual sentence right the sentence long ratio eigenwert.

S200, add up the quantity of the different parts of speech of english-chinese bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and Chinese-English Dictionary or the English-Chinese dictionary.

V(c，e)＝(T(c，e)/I(c))*(T(e，c)/I(e))

The train classification models that S300, basis are set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

S301, structure training set.

Utilize sorter to carry out classification based training and be known technology, can select general sorters such as svm or maximum entropy to train.

S303, determine disaggregated model.

The third embodiment of filtering bilingualism corpora method of the present invention has increased the pretreated step of the type of coding of unified described bilingual sentence centering, can further improve the accuracy rate of categorical filtering.

Filtering bilingualism corpora method of the present invention can also be determined at the S10 of second embodiment before the number matching characteristic value, increases the pretreated step of the type of coding of unified described bilingual sentence centering.Equally, can improve the accuracy rate of categorical filtering.

The present invention also provides a kind of filtering system of bilingualism corpora, is used to improve corpus versatility, accuracy rate and recall rate.

Referring to Fig. 7, this figure is filtering bilingualism corpora first kind of example structure figure of system of the present invention.

First kind of described filtering bilingualism corpora of embodiment of the present invention system comprises the long ratio computing unit 12 of sentence, the property translated computing unit 13, train classification models unit 14 and taxon 11 mutually.

The long ratio computing unit 12 of described sentence is used for the long ratio eigenwert of sentence of determining that bilingual sentence is right.

Described mutual translation computing unit 13, be used for adding up respectively the quantity of the different parts of speech of bilingual sentence centering, the quantity of corresponding speech coupling is determined the eigenwert of translation property mutually according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary.

Described disaggregated model unit 14, the disaggregated model that is used to set up.

Described disaggregated model unit 14 marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming training set, configures sentence to being 1, and bad sentence is to being-1.

Described long ratio computing unit 12 of sentence and described mutual translation computing unit 13 calculate long ratio eigenwert of the described training poem made up of lines from various poets and the eigenwert of translation property mutually respectively, utilize sorter to train.At last, the bilingual sentence that the classification value is labeled as " 1 " filters the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the bilingual sentence of " 1 " to being retained in the bilingualism corpora, sets up disaggregated model.

Described taxon 11, with the long ratio computing unit 12 of described sentence, the property translated computing unit 13 links to each other with disaggregated model unit 14 mutually, be used for according to the disaggregated model that utilizes training set to set up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

The described filtering bilingualism corpora of embodiment of the invention system, comprise the long ratio computing unit 12 of sentence of the long ratio eigenwert of sentence of determining that bilingual sentence is right and the mutual translation computing unit 13 of the eigenwert of translation property mutually, taxon 11 utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification according to disaggregated model unit 14.The described filtering bilingualism corpora of the embodiment of the invention system huge bilingualism corpora of deal with data amount quickly and easily like this.The filtration problem that the present invention utilizes disaggregated model unit 14 to classify bilingualism corpora is converted to binary classification problems, make determining that the weights of bilingualism corpora matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.

Referring to Fig. 8, this figure is filtering bilingualism corpora second kind of example structure figure of system of the present invention.

Relative first embodiment of second kind of embodiment of filtering bilingualism corpora system of the present invention has increased the number matching unit 15 that links to each other with described taxon.

Described number matching unit 15, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, the numeral coupling after the number of bilingual sentence centering transforms determines that number matching characteristic value is 1, when described number does not match, determine that number matching characteristic value is 0.

Described taxon 11 according to the disaggregated model that disaggregated model unit 14 is set up in advance, utilizes described number matching characteristic value, the described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.

Second embodiment of system of the present invention has increased definite number matching unit 15, the bilingual sentence that described system handles includes numerical information to the time the filtration accuracy improve greatly.

Referring to Fig. 9, this figure is filtering bilingualism corpora the third example structure figure of system of the present invention.

Relative first embodiment of filtering bilingualism corpora the third embodiment of system of the present invention has increased the pretreatment unit 16 that links to each other with described taxon.

Pretreatment unit 16 is used for the pre-service of the type of coding of unified described bilingual sentence centering.

Described pretreatment unit 16 comprises that the full-shape that all links to each other with described taxon 11 changes half-angle and handles subelement 16a and mess code processing subelement 16c.

Full-shape changes half-angle and handles subelement 16a, and being used for changes the half-angle processing with described bilingual sentence to carrying out full-shape.

Mess code is handled subelement 16c, is used to get rid of the processing of mess code.

Mess code is handled subelement 16c and is handled for special symbol:

Mess code is handled subelement 16c and is contained label for the right beginning of the sentences of some bilingual sentences, as " 1, (1), (I), (i), 1), one " when waiting label, this label of beginning of the sentence is deleted all the other reservations.

Mess code is handled subelement for containing special punctuation mark in the right sentence of some bilingual sentences, as "======", " ... ... " or special punctuation marks such as "------", with this symbol deletion, remainder keeps.

When filtering bilingualism corpora of the present invention system was english-chinese bilingual corpus filtering system, mess code was handled subelement and is got rid of the processing of mess code for the Chinese part of english-chinese bilingual sentence centering, according to GB sign indicating number scope investigation, the rejecting that surmounts this scope.

When filtering bilingualism corpora of the present invention system is english-chinese bilingual corpus filtering system, mess code handle subelement 16c for the English part of bilingual sentence centering according to ASCII character scope investigation, the rejecting that surmounts this scope.

When filtering bilingualism corpora of the present invention system was english-chinese bilingual corpus filtering system, described pretreatment unit 16 comprised that the Big5 sign indicating number changes the GB sign indicating number and handles subelement 16b, and the Big5 sign indicating number changes the GB sign indicating number and handles subelement 16b, is used for the sign indicating number with Big5.Be converted to the GB sign indicating number.

Described pretreatment unit 16 can comprise all that full-shape changes half-angle processing subelement 16a, the Big5 sign indicating number changes GB sign indicating number processing subelement 16b and mess code processing subelement 16c, can comprise that also full-shape changes half-angle processing subelement 16a, the Big5 sign indicating number changes one or two subelement among GB sign indicating number processing subelement 16b and the mess code processing subelement 16c.

Filtering bilingualism corpora the third embodiment of system of the present invention has increased pretreatment unit 16, unifies the type of coding of described bilingual sentence centering, further improves the accuracy rate of categorical filtering.

The described filtering bilingualism corpora of embodiment of the invention system can further increase the pretreatment unit 16 that links to each other with described taxon 11 on the basis of second embodiment.

Described pretreatment unit 16 comprises that the full-shape that all links to each other with described taxon 11 changes half-angle and handles subelement 16a, Big5 sign indicating number commentaries on classics GB sign indicating number processing subelement 16b and mess code processing unit 16c.

Described pretreatment unit 16 can comprise all that full-shape changes half-angle and handles subelement 16a, Big5 sign indicating number commentaries on classics GB sign indicating number processing subelement 16b and mess code processing unit 16c, can comprise that also full-shape commentaries on classics half-angle is handled subelement 16a, the Big5 sign indicating number changes one or two subelement among GB sign indicating number processing subelement 16b and the mess code processing unit 16c.

The above only is a preferred implementation of the present invention, does not constitute the qualification to protection domain of the present invention.Any any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the claim protection domain of the present invention.

Claims

1. a filtering bilingualism corpora method is characterized in that, may further comprise the steps:

2. filtering bilingualism corpora method according to claim 1 is characterized in that, the described disaggregated model that utilizes training set to set up in advance specifically comprises:

C1, structure training set;

C3, determine disaggregated model.

3. filtering bilingualism corpora method according to claim 2 is characterized in that, described training set marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming, and configures sentence to being 1, and bad sentence is to being-1.

4. filtering bilingualism corpora method according to claim 1 is characterized in that, further comprises before the described steps A: determine number matching characteristic value;

5. filtering bilingualism corpora method according to claim 1 is characterized in that, further comprises before the described steps A: the pre-service of the type of coding of unified described bilingual sentence centering.

6. filtering bilingualism corpora method according to claim 5 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; The pre-service of the type of coding of unified described bilingual sentence centering specifically comprises:

13) processing of eliminating mess code.

7. filtering bilingualism corpora method according to claim 1 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; Described steps A is specially: determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.

8. filtering bilingualism corpora method according to claim 1 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; The quantity of the different parts of speech of described statistics english-chinese bilingual sentence centering is specially the quantity of adding up english-chinese bilingual sentence centering noun, verb, adjective and preposition.

9. a filtering bilingualism corpora system is characterized in that, comprises the long ratio computing unit of sentence, the property translated computing unit, train classification models unit and taxon mutually;

10. filtering bilingualism corpora according to claim 1 system, it is characterized in that described train classification models unit marks each right classification value simultaneously according to the training set of a certain proportion of quality sentence to forming in the bilingualism corpora, configure sentence to being 1, bad sentence is to being-1.

11. filtering bilingualism corpora according to claim 1 system, it is characterized in that, described system further comprises the number matching unit, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1,, determine that number matching characteristic value is 0 when described number does not match.