CN101201820B - Method and system for filtering bilingualism corpora - Google Patents

Method and system for filtering bilingualism corpora Download PDF

Info

Publication number
CN101201820B
CN101201820B CN200710178309XA CN200710178309A CN101201820B CN 101201820 B CN101201820 B CN 101201820B CN 200710178309X A CN200710178309X A CN 200710178309XA CN 200710178309 A CN200710178309 A CN 200710178309A CN 101201820 B CN101201820 B CN 101201820B
Authority
CN
China
Prior art keywords
sentence
bilingual
english
speech
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200710178309XA
Other languages
Chinese (zh)
Other versions
CN101201820A (en
Inventor
王刚
高立琦
刘挺
王海洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Original Assignee
Harbin Institute of Technology
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, Beijing Kingsoft Software Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Harbin Institute of Technology
Priority to CN200710178309XA priority Critical patent/CN101201820B/en
Publication of CN101201820A publication Critical patent/CN101201820A/en
Application granted granted Critical
Publication of CN101201820B publication Critical patent/CN101201820B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a filtering method of a bilingual corpus and the method comprises the following steps: A. ratio flag value of sentence length of English-Chinese bilingual sentence pair is confirmed; B. the number of different parts of speech in the English-Chinese bilingual sentence pair is respectively counted, the matching number of the corresponding words in a bilingual intertranslatingdictionary and words of the part of speech are calculated and the interpretation eigenvalue is confirmed according to the number of different parts of speech and the matching number; C. the filtration and classification are carried out by the ratio eigenvalue of the sentence length and the interpretation eigenvalue according to a classification model established by using a training set in advance. The invention discloses a bilingual corpus system; the invention also provides a filtering method of the bilingual corpus and a system thereof, which are used for improving universality, accuracy and recalling rate of the corpus.

Description

A kind of filtering bilingualism corpora method and system
Technical field
The present invention relates to a kind of corpus filter method, refer to a kind of filtering bilingualism corpora method and system especially.
Background technology
The corpus resource is more and more approved for the immense value of natural language processing research.Particularly parallel bilingualism corpora, it is a kind of special corpus that includes bilingual intertranslation information.Parallel bilingualism corpora can provide match information abundant between the bilingual, in the foundation of the obtaining of translation knowledge, bilingual dictionary, have important use to be worth based on fields such as the mechanical translation of statistics or example, word sense disambiguations, especially high-quality corpus effect more highlights.
The foundation of corpus mainly contains two kinds of methods, and a kind of is traditional manual method of collecting; Another kind is to obtain with automatic sentence alignment schemes by computing machine by the corpus to the alignment of chapter level.But these two kinds of methods all can not guarantee to obtain high-quality corpus, always exist some for example sentence to not matching, comprise mistake such as mess code.
Eliminating wrong sentence is to use the mode of artificial check and correction that corpus is checked to the most frequently used method.Though this method accuracy rate is very high, waste time and energy, especially when corpus was very huge, this method was just not too practical.
It is right with automatic method the wrong sentence of elimination to be handled in corpus by computing machine, and its basic ideas are to set some to judge a feature to quality of match, then each feature are given a mark, and again according to experience, manually set a characteristic threshold value and judge.To be defined as sentence to greater than this characteristic threshold value the time right when bilingual sentence, and to be defined as bad sentence when being less than or equal to this characteristic threshold value right when bilingual sentence.Though this method has realized robotization to a certain extent, it is not high to lack generality and accuracy rate.Characteristic threshold value is set by rule of thumb, and characteristic threshold value often may be that setting person determines according to only several parts of corpus resources, can not contain the distribution situation of most of corpus.And the characteristic threshold value of setting when experience crosses the low accuracy rate that can cause and descends, and causes recall rate to descend when too high again.
Summary of the invention
The purpose of this invention is to provide a kind of filter method and system of english-chinese bilingual corpus, be used to improve corpus versatility, accuracy rate and recall rate.
For addressing the above problem, the invention provides a kind of filtering bilingualism corpora method, may further comprise the steps:
A, determine the long ratio eigenwert of sentence that bilingual sentence is right;
B, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;
The disaggregated model that C, basis utilize training set to set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
Preferably, the described disaggregated model that utilizes training set to set up in advance specifically comprises:
C1, structure training set;
C2, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to steps A, B;
C3, determine disaggregated model.
Preferably, described training set marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming, and configures sentence to being 1, and bad sentence is to being-1.
Preferably, further comprise before the described steps A: determine number matching characteristic value;
Described definite number matching characteristic value is specially: with the unified respectively conversion of carrying out numeral of number of bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1,, determine that number matching characteristic value is 0 when described number does not match.
Preferably, further comprise before the described steps A: the pre-service of the type of coding of unified described bilingual sentence centering.
Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; The pre-service of the type of coding of unified described bilingual sentence centering specifically comprises:
11) described english-chinese bilingual sentence is changeed the half-angle processing to carrying out full-shape;
12) be simplified national standard codes with the traditional font code conversion;
13) processing of eliminating mess code.
Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; Described steps A is specially: determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.
Preferably, described bilingual sentence is specially the english-chinese bilingual sentence; The quantity of the different parts of speech of described statistics english-chinese bilingual sentence centering is specially the quantity of adding up english-chinese bilingual sentence centering noun, verb, adjective and preposition.
The present invention also provides a kind of filtering bilingualism corpora system, comprises the long ratio computing unit of sentence, the property translated computing unit, train classification models unit and taxon mutually;
The long ratio computing unit of described sentence is used for the long ratio eigenwert of sentence of determining that bilingual sentence is right;
Described mutual translation computing unit, be used for adding up respectively the quantity of the different parts of speech of bilingual sentence centering, the quantity of corresponding speech coupling is determined the eigenwert of translation property mutually according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;
Described taxon links to each other with the computing unit of translation property mutually with the long ratio computing unit of described sentence, is used for according to the disaggregated model that utilizes training set to set up in advance, utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
Preferably, described train classification models unit marks each right classification value simultaneously according to the training set of a certain proportion of quality sentence to forming in the bilingualism corpora, configures sentence to being 1, and bad sentence is to being-1.
Preferably, described system further comprises the number matching unit, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1, when described number does not match, determine that number matching characteristic value is 0.
Compare with above-mentioned prior art, the described filtering bilingualism corpora method of the embodiment of the invention, comprise the step of determining right long ratio eigenwert of sentence of bilingual sentence and mutual translation property eigenwert, according to the train classification models of setting up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification again.The filter method of the described bilingualism corpora of the embodiment of the invention huge bilingualism corpora of deal with data amount quickly and easily like this.The present invention utilizes the thought of the classification of train classification models that the filtration problem of bilingualism corpora is converted to binary classification problems, make determining that the weights of bilingualism corpora matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.
Description of drawings
Fig. 1 is first kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention;
Fig. 2 is a process flow diagram of setting up disaggregated model among Fig. 1;
Fig. 3 is second kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention;
Fig. 4 is a process flow diagram of setting up disaggregated model among Fig. 3;
Fig. 5 is the third embodiment process flow diagram of filtering bilingualism corpora method of the present invention;
Fig. 6 is the pretreatment process figure of the type of coding of the unified described bilingual sentence centering of Fig. 5;
Fig. 7 is filtering bilingualism corpora first kind of example structure figure of system of the present invention;
Fig. 8 is filtering bilingualism corpora second kind of example structure figure of system of the present invention;
Fig. 9 is filtering bilingualism corpora the third example structure figure of system of the present invention.
Embodiment
The invention provides a kind of filter method of bilingualism corpora, be used to improve corpus versatility, accuracy rate and recall rate.
Referring to reference to figure 1 and Fig. 2, Fig. 1 is first kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention, and Fig. 2 is a process flow diagram of setting up disaggregated model among Fig. 1.
The described filtering bilingualism corpora method of first kind of embodiment of the present invention may further comprise the steps:
S100, determine the long ratio eigenwert of sentence that bilingual sentence is right.
Determine bilingual sentence centering employing word number or character number.With the word number in a kind of statement in the described bilingual sentence or number of characters word number or the number of characters than another kind of statement in the above bilingual sentence, the value of gained is a long ratio eigenwert.
When described bilingual sentence be the english-chinese bilingual sentence, than employing word number or character number in bilingual of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.When the long number that adopts word number or character respectively of sentence was calculated, the two was more or less the same, and general selected word number calculates more can embody the right long ratio feature of sentence of english-chinese bilingual sentence.
S200, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary.
Adding up the quantity of the different parts of speech of bilingual sentence centering, specifically is the quantity of the bilingual sentence of statistics centering noun, verb, adjective and preposition.
At first, respectively to bilingual sentence to carrying out part-of-speech tagging.Then, add up the number that bilingual sentence centering is contained the speech of noun, verb, adjective and four kinds of parts of speech of preposition respectively again.The part of speech selection of noun, verb, adjective, preposition is based on dictionary translation consideration, generally relatively has ability to see things in their true light because have the translation of the word of these four kinds of parts of speech.
For the speech that contains above-mentioned noun, verb, adjective, preposition part of speech in the right Chinese sentence of english-chinese bilingual sentence, utilize Chinese-English Dictionary translation, and in the right english sentence of english-chinese bilingual sentence, contain in the speech of above-mentioned part of speech and search.If find, then mate the number of statistical match.Otherwise, to containing the speech of above-mentioned part of speech in the right english sentence of english-chinese bilingual sentence, utilize English-Chinese dictionary translation, and in the right Chinese sentence of english-chinese bilingual sentence, contain to search in the speech of above-mentioned part of speech and whether mate.If find, then coupling, and the number of statistical match.
To being example, the formula below utilizing calculates the english-chinese bilingual sentence to mutual translation eigenwert with the english-chinese bilingual sentence for we.
V(c,e)=(T(c,e)/I(c))*(T(e,c)/I(e))
Wherein, and V (c, e): the english-chinese bilingual sentence is to mutual translation eigenwert;
T (c, e): the coupling number of speech in english sentence of utilizing the above-mentioned four kinds of parts of speech in the Chinese sentence that Chinese-English Dictionary finds;
T (e, c): the coupling number of speech in Chinese sentence of utilizing the above-mentioned four kinds of parts of speech in the english sentence that English-Chinese dictionary finds;
I (c): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right Chinese sentence of english-chinese bilingual sentence;
I (e): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right english sentence of english-chinese bilingual sentence.
Equally, when described bilingual sentence be other macaronic bilingual to the time, also can use top formula and calculate.
The disaggregated model that S300, basis utilize training set to set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
The disaggregated model that utilizes training set to set up specifically comprises:
S301, structure training set.
Described training set to forming, marks the right classification value of each bilingual sentence according to a certain proportion of quality sentence in the bilingualism corpora simultaneously, and configuring the right classification value of sentence is 1, and the right classification value of bad sentence is-1.
Described training set can be selected bilingual sentence to forming training set according to the quality sentence to 1: 1 ratio from bilingualism corpora.
The size of training set should remain on 50,000 to more than, big more training set is beneficial more to train classification models.The source of language material is extensive as far as possible, and language material distribution more widely makes the disaggregated model after the training have generality more.
S302, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to step S100 and step S200.
The annotation formatting of training set feature: " classification value+space+feature code: eigenwert+space+feature code: eigenwert ... "
Between classification value and feature code, keep a space, between eigenwert and feature code, keep a space.For example can set the long ratio eigenwert of described sentence is 2, and setting described mutual translation eigenwert is 3.
Utilize sorter to carry out classification based training and be known technology, can select general sorters such as svm (support vector machine) or maximum entropy to train.
S303, determine disaggregated model.
After disaggregated model was set up, the bilingual sentence that the classification value is labeled as " 1 " filtered the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the bilingual sentence of " 1 " to being retained in the bilingualism corpora.
The described filtering bilingualism corpora method of the embodiment of the invention, comprise the step of determining right long ratio eigenwert of sentence of bilingual sentence and mutual translation property eigenwert, according to the disaggregated model that utilizes training set to set up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification again.The filter method of the described bilingualism corpora of the embodiment of the invention huge bilingualism corpora of deal with data amount quickly and easily like this.The classification that the present invention utilizes described disaggregated model is converted to binary classification problems with the filtration problem of english-chinese bilingual corpus, make determining that the weights of english-chinese bilingual corpus matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.
Referring to reference to figure 3 and Fig. 4, Fig. 3 is second kind of embodiment process flow diagram of filtering bilingualism corpora method of the present invention; Fig. 4 is a process flow diagram of setting up disaggregated model among Fig. 3.
Relative first embodiment of second kind of embodiment of filtering bilingualism corpora method of the present invention increases the step of determining number matching characteristic value.
The described filtering bilingualism corpora method of second kind of embodiment of the present invention may further comprise the steps:
S10, determine number matching characteristic value;
With the unified respectively conversion of carrying out numeral of number of bilingual sentence centering, the numeral coupling after the number of bilingual sentence centering transforms determines that number matching characteristic value is 1.When described number does not match, determine that number matching characteristic value is 0.
Below with the english-chinese bilingual sentence to being example, specify the process of determining number matching characteristic value.
At first, respectively right Chinese sentence and the english sentence of bilingual Chinese-English sentence carried out part-of-speech tagging, mask method is a known technology, is not described in detail in this.
Then, be labeled as m (number), contain the number that is labeled as od (coefficient speech) and cd (number) in the English and carry out normalization containing in bilingual Chinese-English the right Chinese sentence.
Contain “ $5 million in for example bilingual Chinese-English the right English sentence ", contain in the Chinese sentence " 5,000,000 ", all unification is converted into 5000000.
Rule-based method is adopted in described normalization, promptly formulates some transformation rules.
Described transformation rule comprises the number and the digital transformation rule of Chinese, for example: " one " correspondence " 1 ", " hundred " correspondence " 100 " etc.
Described transformation rule comprises the number and the digital transformation rule of English, for example " one " correspondence " 1 ", " hundred " correspondence " 100 " etc.
Number in the right english sentence of Chinese sentence that bilingual Chinese-English sentence is right and bilingual Chinese-English sentence after the normalization relatively, if coupling, then number matching characteristic value is 1.If do not match, then number matching characteristic value is 0.
S100, determine the long ratio eigenwert of sentence that bilingual sentence is right.
Equally, to being example, specify the process of determining the long ratio eigenwert of sentence with the english-chinese bilingual sentence.
Determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.
When the long number that adopts word number or character respectively of sentence was calculated, the two was more or less the same, and general selected word number calculates more can embody the right long ratio feature of sentence of english-chinese bilingual sentence.
S200, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary.
Equally, to being example, specify the process of determining the eigenwert of translation property mutually with the english-chinese bilingual sentence.
The quantity of the different parts of speech of statistics english-chinese bilingual sentence centering specifically is the quantity of statistics english-chinese bilingual sentence centering noun, verb, adjective and preposition.
At first, respectively to the english-chinese bilingual sentence to carrying out part-of-speech tagging.Then, add up the number that english-chinese bilingual sentence centering is contained the speech of noun, verb, adjective and four kinds of parts of speech of preposition respectively again.
For the speech that contains above-mentioned noun, verb, adjective, preposition part of speech in the right Chinese sentence of english-chinese bilingual sentence, utilize Chinese-English Dictionary translation, and in the right english sentence of english-chinese bilingual sentence, contain in the speech of above-mentioned part of speech and search.If find, then mate the number of statistical match.Otherwise, to containing the speech of above-mentioned part of speech in the right english sentence of english-chinese bilingual sentence, utilize English-Chinese dictionary translation, and in the right Chinese sentence of english-chinese bilingual sentence, contain to search in the speech of above-mentioned part of speech and whether mate.If find, then coupling, and the number of statistical match.
Formula below utilizing calculates the english-chinese bilingual sentence to mutual translation eigenwert.
V(c,e)=(T(c,e)/I(c))*(T(e,c)/I(e))
Wherein, and V (c, e): the english-chinese bilingual sentence is to mutual translation eigenwert;
T (c, e): the coupling number of speech in english sentence of utilizing the above-mentioned four kinds of parts of speech in the Chinese sentence that Chinese-English Dictionary finds;
T (e, c): the coupling number of speech in Chinese sentence of utilizing the above-mentioned four kinds of parts of speech in the english sentence that English-Chinese dictionary finds;
I (c): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right Chinese sentence of english-chinese bilingual sentence;
I (e): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right english sentence of english-chinese bilingual sentence.
The train classification models that S300A, basis are set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert and number matching characteristic value to carry out filtering classification.
Equally, to being example, specify the process of setting up of the corresponding disaggregated model of filter method second embodiment of the present invention with the english-chinese bilingual sentence:
The described disaggregated model of setting up specifically comprises:
S301A, structure training set.
Described training set marks the right classification value of each english-chinese bilingual sentence according to a certain proportion of quality sentence in the english-chinese bilingual corpus simultaneously to forming, and configuring the right classification value of sentence is 1, and the right classification value of bad sentence is-1.
S302A, calculate number matching characteristic value, the long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to step S10, step S100 and step S200.
The annotation formatting of training set feature: classification value+space+feature code: eigenwert+space+feature code: eigenwert+space+feature code: eigenwert.
Between classification value and feature code, keep a space, between eigenwert and feature code, keep a space.For example can set described number matching characteristic value is 1, and the long ratio eigenwert of described sentence is 2, and setting described mutual translation eigenwert is 3.
S303A, determine disaggregated model.
After disaggregated model was set up, the english-chinese bilingual sentence that the classification value is labeled as " 1 " filtered the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the english-chinese bilingual sentence of " 1 " to being retained in the english-chinese bilingual corpus.
Second embodiment of the method for the invention has increased the step of definite number matching characteristic value, makes the right filtration accuracy of bilingual sentence that includes numerical information improve greatly.
Referring to reference to figure 5 and Fig. 6, Fig. 5 is the third embodiment process flow diagram of filtering bilingualism corpora method of the present invention; Fig. 6 is the pretreatment process figure of the type of coding of the unified described bilingual sentence centering of Fig. 5.
Relative first embodiment of the third embodiment of filtering bilingualism corpora method of the present invention, the pretreated step of the type of coding of the unified described bilingual sentence centering of increase.
Equally, to being example, specify the process of the described english-chinese bilingual corpus of the third embodiment of the present invention filter method with the english-chinese bilingual sentence.
The described english-chinese bilingual corpus of the third embodiment of the present invention filter method may further comprise the steps:
The pre-service of the type of coding of S1, unified described english-chinese bilingual sentence centering.
The pre-service of the type of coding of unified described english-chinese bilingual sentence centering specifically comprises:
S1a, described english-chinese bilingual sentence is changeed half-angle and handles carrying out full-shape;
S1b, Big5 sign indicating number (traditional font coding) is converted to GB sign indicating number (simplified national standard codes);
The processing of S1c, eliminating mess code.
For the Chinese processing of partly getting rid of mess code of english-chinese bilingual sentence centering, according to GB sign indicating number scope investigation, the rejecting that surmounts this scope.
For the English processing of partly getting rid of mess code of english-chinese bilingual sentence centering, according to ASCII character scope investigation, the rejecting that surmounts this scope.
Handle for special symbol:
The beginning of the sentence right for some english-chinese bilingual sentences contains label, as " 1, (1), (I), (i), 1), one, " etc. during label, with this label deletion of beginning of the sentence, all the other reservations.
For containing special punctuation mark in the right sentence of some english-chinese bilingual sentences, as "======", " ... ... " or special punctuation marks such as "------", with this symbol deletion, remainder keeps.
The pre-service of the type of coding of unified described english-chinese bilingual sentence centering can comprise above-mentioned S1a, S1b, three steps of S1c, also can include only one or two step among S1a, S1b, the S1c.
S100, determine the english-chinese bilingual sentence right the sentence long ratio eigenwert.
Determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.
S200, add up the quantity of the different parts of speech of english-chinese bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and Chinese-English Dictionary or the English-Chinese dictionary.
The quantity of the different parts of speech of statistics english-chinese bilingual sentence centering specifically is the quantity of statistics english-chinese bilingual sentence centering noun, verb, adjective and preposition.
At first, respectively to the english-chinese bilingual sentence to carrying out part-of-speech tagging.Then, add up the number that english-chinese bilingual sentence centering is contained the speech of noun, verb, adjective and four kinds of parts of speech of preposition respectively again.
For the speech that contains above-mentioned noun, verb, adjective, preposition part of speech in the right Chinese sentence of english-chinese bilingual sentence, utilize Chinese-English Dictionary translation, and in the right english sentence of english-chinese bilingual sentence, contain in the speech of above-mentioned part of speech and search.If find, then mate the number of statistical match.Otherwise, to containing the speech of above-mentioned part of speech in the right english sentence of english-chinese bilingual sentence, utilize English-Chinese dictionary translation, and in the right Chinese sentence of english-chinese bilingual sentence, contain to search in the speech of above-mentioned part of speech and whether mate.If find, then coupling, and the number of statistical match.
Formula below utilizing calculates the english-chinese bilingual sentence to mutual translation eigenwert.
V(c,e)=(T(c,e)/I(c))*(T(e,c)/I(e))
Wherein, and V (c, e): the english-chinese bilingual sentence is to mutual translation eigenwert;
T (c, e): the coupling number of speech in english sentence of utilizing the above-mentioned four kinds of parts of speech in the Chinese sentence that Chinese-English Dictionary finds;
T (e, c): the coupling number of speech in Chinese sentence of utilizing the above-mentioned four kinds of parts of speech in the english sentence that English-Chinese dictionary finds;
I (c): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right Chinese sentence of english-chinese bilingual sentence;
I (e): the number of the speech of the above-mentioned four kinds of parts of speech that contain in the right english sentence of english-chinese bilingual sentence.
The train classification models that S300, basis are set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
The disaggregated model that utilizes training set to set up specifically comprises:
S301, structure training set.
Described training set marks the right classification value of each english-chinese bilingual sentence according to a certain proportion of quality sentence in the english-chinese bilingual corpus simultaneously to forming, and configuring the right classification value of sentence is 1, and the right classification value of bad sentence is-1.
S302, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to step S100 and step S200.
Utilize sorter to carry out classification based training and be known technology, can select general sorters such as svm or maximum entropy to train.
S303, determine disaggregated model.
After disaggregated model was set up, the english-chinese bilingual sentence that the classification value is labeled as " 1 " filtered the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the english-chinese bilingual sentence of " 1 " to being retained in the english-chinese bilingual corpus.
The third embodiment of filtering bilingualism corpora method of the present invention has increased the pretreated step of the type of coding of unified described bilingual sentence centering, can further improve the accuracy rate of categorical filtering.
Filtering bilingualism corpora method of the present invention can also be determined at the S10 of second embodiment before the number matching characteristic value, increases the pretreated step of the type of coding of unified described bilingual sentence centering.Equally, can improve the accuracy rate of categorical filtering.
The present invention also provides a kind of filtering system of bilingualism corpora, is used to improve corpus versatility, accuracy rate and recall rate.
Referring to Fig. 7, this figure is filtering bilingualism corpora first kind of example structure figure of system of the present invention.
First kind of described filtering bilingualism corpora of embodiment of the present invention system comprises the long ratio computing unit 12 of sentence, the property translated computing unit 13, train classification models unit 14 and taxon 11 mutually.
The long ratio computing unit 12 of described sentence is used for the long ratio eigenwert of sentence of determining that bilingual sentence is right.
Described mutual translation computing unit 13, be used for adding up respectively the quantity of the different parts of speech of bilingual sentence centering, the quantity of corresponding speech coupling is determined the eigenwert of translation property mutually according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary.
Described disaggregated model unit 14, the disaggregated model that is used to set up.
Described disaggregated model unit 14 marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming training set, configures sentence to being 1, and bad sentence is to being-1.
Described long ratio computing unit 12 of sentence and described mutual translation computing unit 13 calculate long ratio eigenwert of the described training poem made up of lines from various poets and the eigenwert of translation property mutually respectively, utilize sorter to train.At last, the bilingual sentence that the classification value is labeled as " 1 " filters the storehouse to putting into, and waits until with aftertreatment.The classification value is labeled as the bilingual sentence of " 1 " to being retained in the bilingualism corpora, sets up disaggregated model.
Described taxon 11, with the long ratio computing unit 12 of described sentence, the property translated computing unit 13 links to each other with disaggregated model unit 14 mutually, be used for according to the disaggregated model that utilizes training set to set up in advance, utilize described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
The described filtering bilingualism corpora of embodiment of the invention system, comprise the long ratio computing unit 12 of sentence of the long ratio eigenwert of sentence of determining that bilingual sentence is right and the mutual translation computing unit 13 of the eigenwert of translation property mutually, taxon 11 utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification according to disaggregated model unit 14.The described filtering bilingualism corpora of the embodiment of the invention system huge bilingualism corpora of deal with data amount quickly and easily like this.The filtration problem that the present invention utilizes disaggregated model unit 14 to classify bilingualism corpora is converted to binary classification problems, make determining that the weights of bilingualism corpora matching characteristic can be more scientific and reasonable, method than existing experience has more universality, and accuracy rate and recall rate are also improved accordingly.
Referring to Fig. 8, this figure is filtering bilingualism corpora second kind of example structure figure of system of the present invention.
Relative first embodiment of second kind of embodiment of filtering bilingualism corpora system of the present invention has increased the number matching unit 15 that links to each other with described taxon.
Described number matching unit 15, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, the numeral coupling after the number of bilingual sentence centering transforms determines that number matching characteristic value is 1, when described number does not match, determine that number matching characteristic value is 0.
Described taxon 11 according to the disaggregated model that disaggregated model unit 14 is set up in advance, utilizes described number matching characteristic value, the described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
Second embodiment of system of the present invention has increased definite number matching unit 15, the bilingual sentence that described system handles includes numerical information to the time the filtration accuracy improve greatly.
Referring to Fig. 9, this figure is filtering bilingualism corpora the third example structure figure of system of the present invention.
Relative first embodiment of filtering bilingualism corpora the third embodiment of system of the present invention has increased the pretreatment unit 16 that links to each other with described taxon.
Pretreatment unit 16 is used for the pre-service of the type of coding of unified described bilingual sentence centering.
Described pretreatment unit 16 comprises that the full-shape that all links to each other with described taxon 11 changes half-angle and handles subelement 16a and mess code processing subelement 16c.
Full-shape changes half-angle and handles subelement 16a, and being used for changes the half-angle processing with described bilingual sentence to carrying out full-shape.
Mess code is handled subelement 16c, is used to get rid of the processing of mess code.
Mess code is handled subelement 16c and is handled for special symbol:
Mess code is handled subelement 16c and is contained label for the right beginning of the sentences of some bilingual sentences, as " 1, (1), (I), (i), 1), one " when waiting label, this label of beginning of the sentence is deleted all the other reservations.
Mess code is handled subelement for containing special punctuation mark in the right sentence of some bilingual sentences, as "======", " ... ... " or special punctuation marks such as "------", with this symbol deletion, remainder keeps.
When filtering bilingualism corpora of the present invention system was english-chinese bilingual corpus filtering system, mess code was handled subelement and is got rid of the processing of mess code for the Chinese part of english-chinese bilingual sentence centering, according to GB sign indicating number scope investigation, the rejecting that surmounts this scope.
When filtering bilingualism corpora of the present invention system is english-chinese bilingual corpus filtering system, mess code handle subelement 16c for the English part of bilingual sentence centering according to ASCII character scope investigation, the rejecting that surmounts this scope.
When filtering bilingualism corpora of the present invention system was english-chinese bilingual corpus filtering system, described pretreatment unit 16 comprised that the Big5 sign indicating number changes the GB sign indicating number and handles subelement 16b, and the Big5 sign indicating number changes the GB sign indicating number and handles subelement 16b, is used for the sign indicating number with Big5.Be converted to the GB sign indicating number.
Described pretreatment unit 16 can comprise all that full-shape changes half-angle processing subelement 16a, the Big5 sign indicating number changes GB sign indicating number processing subelement 16b and mess code processing subelement 16c, can comprise that also full-shape changes half-angle processing subelement 16a, the Big5 sign indicating number changes one or two subelement among GB sign indicating number processing subelement 16b and the mess code processing subelement 16c.
Filtering bilingualism corpora the third embodiment of system of the present invention has increased pretreatment unit 16, unifies the type of coding of described bilingual sentence centering, further improves the accuracy rate of categorical filtering.
The described filtering bilingualism corpora of embodiment of the invention system can further increase the pretreatment unit 16 that links to each other with described taxon 11 on the basis of second embodiment.
Described pretreatment unit 16 comprises that the full-shape that all links to each other with described taxon 11 changes half-angle and handles subelement 16a, Big5 sign indicating number commentaries on classics GB sign indicating number processing subelement 16b and mess code processing unit 16c.
Described pretreatment unit 16 can comprise all that full-shape changes half-angle and handles subelement 16a, Big5 sign indicating number commentaries on classics GB sign indicating number processing subelement 16b and mess code processing unit 16c, can comprise that also full-shape commentaries on classics half-angle is handled subelement 16a, the Big5 sign indicating number changes one or two subelement among GB sign indicating number processing subelement 16b and the mess code processing unit 16c.
The above only is a preferred implementation of the present invention, does not constitute the qualification to protection domain of the present invention.Any any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the claim protection domain of the present invention.

Claims (11)

1. a filtering bilingualism corpora method is characterized in that, may further comprise the steps:
A, determine the long ratio eigenwert of sentence that bilingual sentence is right;
B, add up the quantity of the different parts of speech of bilingual sentence centering respectively, the quantity of corresponding speech coupling is determined the mutual property translated eigenwert according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;
The disaggregated model that C, basis utilize training set to set up in advance utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
2. filtering bilingualism corpora method according to claim 1 is characterized in that, the described disaggregated model that utilizes training set to set up in advance specifically comprises:
C1, structure training set;
C2, calculate long ratio eigenwert of sentence and the property translated eigenwert mutually respectively, utilize sorter to train according to steps A, B;
C3, determine disaggregated model.
3. filtering bilingualism corpora method according to claim 2 is characterized in that, described training set marks each right classification value according to a certain proportion of quality sentence in the bilingualism corpora simultaneously to forming, and configures sentence to being 1, and bad sentence is to being-1.
4. filtering bilingualism corpora method according to claim 1 is characterized in that, further comprises before the described steps A: determine number matching characteristic value;
Described definite number matching characteristic value is specially: with the unified respectively conversion of carrying out numeral of number of bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1,, determine that number matching characteristic value is 0 when described number does not match.
5. filtering bilingualism corpora method according to claim 1 is characterized in that, further comprises before the described steps A: the pre-service of the type of coding of unified described bilingual sentence centering.
6. filtering bilingualism corpora method according to claim 5 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; The pre-service of the type of coding of unified described bilingual sentence centering specifically comprises:
11) described english-chinese bilingual sentence is changeed the half-angle processing to carrying out full-shape;
12) be simplified national standard codes with the traditional font code conversion;
13) processing of eliminating mess code.
7. filtering bilingualism corpora method according to claim 1 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; Described steps A is specially: determine that english-chinese bilingual sentence centering adopts word number or character number, than employing word number or character number in the bilingual sentence of the above English, draw a long ratio eigenwert with the word number in the described Chinese sentence or character number.
8. filtering bilingualism corpora method according to claim 1 is characterized in that, described bilingual sentence is specially the english-chinese bilingual sentence; The quantity of the different parts of speech of described statistics english-chinese bilingual sentence centering is specially the quantity of adding up english-chinese bilingual sentence centering noun, verb, adjective and preposition.
9. a filtering bilingualism corpora system is characterized in that, comprises the long ratio computing unit of sentence, the property translated computing unit, train classification models unit and taxon mutually;
The long ratio computing unit of described sentence is used for the long ratio eigenwert of sentence of determining that bilingual sentence is right;
Described mutual translation computing unit, be used for adding up respectively the quantity of the different parts of speech of bilingual sentence centering, the quantity of corresponding speech coupling is determined the eigenwert of translation property mutually according to the quantity of different parts of speech and the quantity of described coupling in the speech that calculates described part of speech respectively and the described bilingual intertranslation dictionary;
Described taxon links to each other with the computing unit of translation property mutually with the long ratio computing unit of described sentence, is used for according to the disaggregated model that utilizes training set to set up in advance, utilizes described long ratio eigenwert of sentence and described mutual translation eigenwert to carry out filtering classification.
10. filtering bilingualism corpora according to claim 1 system, it is characterized in that described train classification models unit marks each right classification value simultaneously according to the training set of a certain proportion of quality sentence to forming in the bilingualism corpora, configure sentence to being 1, bad sentence is to being-1.
11. filtering bilingualism corpora according to claim 1 system, it is characterized in that, described system further comprises the number matching unit, be used for the unified respectively conversion of carrying out numeral of number with bilingual sentence centering, numeral coupling after the number of bilingual sentence centering transforms, determine that number matching characteristic value is 1,, determine that number matching characteristic value is 0 when described number does not match.
CN200710178309XA 2007-11-28 2007-11-28 Method and system for filtering bilingualism corpora Expired - Fee Related CN101201820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710178309XA CN101201820B (en) 2007-11-28 2007-11-28 Method and system for filtering bilingualism corpora

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710178309XA CN101201820B (en) 2007-11-28 2007-11-28 Method and system for filtering bilingualism corpora

Publications (2)

Publication Number Publication Date
CN101201820A CN101201820A (en) 2008-06-18
CN101201820B true CN101201820B (en) 2010-06-02

Family

ID=39516990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710178309XA Expired - Fee Related CN101201820B (en) 2007-11-28 2007-11-28 Method and system for filtering bilingualism corpora

Country Status (1)

Country Link
CN (1) CN101201820B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996166B (en) * 2009-08-14 2015-08-05 张龙哺 Bilingual sentence is to medelling recording method and interpretation method and translation system
CN101882148B (en) * 2010-05-24 2012-01-04 中国科学院计算技术研究所 Method and system thereof for automatically identifying Uyghur in web page
CN102930031B (en) * 2012-11-08 2015-10-07 哈尔滨工业大学 By the method and system extracting bilingual parallel text in webpage
CN103853706B (en) * 2012-12-06 2017-04-12 富士通株式会社 Method and equipment for converting simplified Chinese sentence into traditional Chinese sentence
CN104951469B (en) * 2014-03-28 2018-04-06 株式会社东芝 Optimize the method and apparatus of corpus
CN104184653B (en) * 2014-07-28 2018-03-23 小米科技有限责任公司 A kind of method and apparatus of message screening
CN104281564B (en) * 2014-08-12 2017-08-08 中国科学院计算技术研究所 A kind of bilingual unsupervised syntactic analysis method and system
CN104462032A (en) * 2014-12-26 2015-03-25 南通大学 Data recognition and extraction method for language materials
CN106598959B (en) * 2016-12-23 2021-03-19 北京金山办公软件股份有限公司 Method and system for determining mutual translation relationship of bilingual sentence pairs
CN106874263A (en) * 2017-01-17 2017-06-20 中译语通科技(北京)有限公司 A kind of Sino-British corpus proofreading method based on multi-dimensional data analysis and semanteme
CN107977454A (en) * 2017-12-15 2018-05-01 传神语联网网络科技股份有限公司 The method, apparatus and computer-readable recording medium of bilingual corpora cleaning
CN109614624B (en) * 2018-12-12 2023-07-25 广东小天才科技有限公司 English sentence recognition method and electronic equipment
US11288452B2 (en) 2019-07-26 2022-03-29 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof
US11238222B2 (en) 2019-07-26 2022-02-01 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data
CN110781303A (en) * 2019-10-28 2020-02-11 佰聆数据股份有限公司 Short text classification method and system
CN111221965A (en) * 2019-12-30 2020-06-02 成都信息工程大学 Classification sampling detection method based on bilingual corpus of public identification words
CN113343719B (en) * 2021-06-21 2023-03-14 哈尔滨工业大学 Unsupervised bilingual translation dictionary acquisition method for collaborative training by using different word embedding models

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
JP2007072594A (en) * 2005-09-05 2007-03-22 Sharp Corp Translation device, translation method, translation program and medium
CN101030196A (en) * 2006-02-28 2007-09-05 株式会社东芝 Method and apparatus for training bilingual word alignment model, method and apparatus for bilingual word alignment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
JP2007072594A (en) * 2005-09-05 2007-03-22 Sharp Corp Translation device, translation method, translation program and medium
CN101030196A (en) * 2006-02-28 2007-09-05 株式会社东芝 Method and apparatus for training bilingual word alignment model, method and apparatus for bilingual word alignment

Also Published As

Publication number Publication date
CN101201820A (en) 2008-06-18

Similar Documents

Publication Publication Date Title
CN101201820B (en) Method and system for filtering bilingualism corpora
CN109299480B (en) Context-based term translation method and device
CN106598959A (en) Method and system for determining intertranslation relationship of bilingual sentence pairs
CN108763204A (en) A kind of multi-level text emotion feature extracting method and model
CN103365838A (en) Method for automatically correcting syntax errors in English composition based on multivariate features
CN110287482B (en) Semi-automatic participle corpus labeling training device
CN105930509B (en) Field concept based on statistics and template matching extracts refined method and system automatically
CN110276071A (en) A kind of text matching technique, device, computer equipment and storage medium
CN102693222A (en) Carapace bone script explanation machine translation method based on example
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
CN105740218A (en) Post-editing processing method for mechanical translation
CN104182463A (en) Semantic-based text classification method
CN109145286A (en) Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature
CN104199813A (en) Pseudo-feedback-based personalized machine translation system and method
CN109543023B (en) Document classification method and system based on trie and LCS algorithm
CN112036179B (en) Electric power plan information extraction method based on text classification and semantic frame
CN107153635A (en) It is a kind of to automatically extract the method and system that paper quotes bibliography after content and correspondence text
CN114265942A (en) Knowledge unit extraction method, device, equipment and medium
Pickard Comparing word2vec and GloVe for automatic measurement of MWE compositionality
CN104636431A (en) Automatic extraction and optimizing method for document abstracts of different fields
Schottmüller et al. Issues in translating verb-particle constructions from german to english
CN109241521B (en) Scientific literature high-attention sentence extraction method based on citation relation
CN106021225A (en) Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
CN112926320B (en) Text key content intelligent extraction method and system based on subject term optimization
CN113886521A (en) Text relation automatic labeling method based on similar vocabulary

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100602

Termination date: 20141128

EXPY Termination of patent right or utility model