CN104391885A - Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training - Google Patents

Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training Download PDF

Info

Publication number
CN104391885A
CN104391885A CN201410624648.6A CN201410624648A CN104391885A CN 104391885 A CN104391885 A CN 104391885A CN 201410624648 A CN201410624648 A CN 201410624648A CN 104391885 A CN104391885 A CN 104391885A
Authority
CN
China
Prior art keywords
phrase
parallel
translation
word
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410624648.6A
Other languages
Chinese (zh)
Other versions
CN104391885B (en
Inventor
曹海龙
张捷鑫
赵铁军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of technology high tech Development Corporation
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201410624648.6A priority Critical patent/CN104391885B/en
Publication of CN104391885A publication Critical patent/CN104391885A/en
Application granted granted Critical
Publication of CN104391885B publication Critical patent/CN104391885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention discloses a method for extracting a chapter-level parallel phrase pair of a comparable corpus based on parallel corpus training and relates to a method for extracting the parallel phrase pair of the comparable corpus. The method solves the problems that acquisition of a parallel corpus needs high expenditure, and when two most similar contextual words or fragments are mutually translated and applied to the comparable corpus, serious dependency to a bilingual dictionary is caused. The method comprises the following steps of 1, providing a source language sentence set S and a target language sentence set T; 2, obtaining a phrase pair set of the parallel corpus; 3, obtaining a parallel phrase pair of the parallel corpus; 4, obtaining a non-parallel phrase pair of the parallel corpus; 5, obtaining a binary classifier of a support vector machine; 6, extracting a candidate parallel phrase pair <s, t>; 7, obtaining the parallel phrase pair containing a noise in the comparable corpus; 8, obtaining the parallel phrase pair of the comparable corpus; 9, obtaining an extension decoder. The method is applied to the field of extraction of the parallel phrase pair of the comparable corpus.

Description

A kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training
Technical field
The present invention relates to phrase translation pair abstracting method, particularly chapter level phrase translation pair abstracting method.
Background technology
Along with the appearance of broadcast, TV, the contour coverage communication media in internet, interpersonal time-space matrix shortens suddenly, and international association is day by day frequent convenient, and the whole earth is as being both in boundless and indistinct universe small village.In order to allow people straightwayly exchange, mechanical translation has the huge market demand and application prospect widely as from a kind of language to the automatic translation of another kind of language.
In recent years, computing power obtains advances by leaps and bounds, the development of internet and universal, and bilingual country, the United Nations multi-lingual file, for we providing number with the bilingual parallel corpora of ten million, these have established necessary basis for statistical machine translation method, and propose much new model and method thereupon and achieve good effect.
The structure of statictic machine translation system is generally divided into training and translation two key steps.Training step be from language material learning statistical knowledge go forward side by side line parameter training.Training package typically based on the statictic machine translation system of phrase is contained in the translation model training on large-scale bilingualism corpora, the language model on single language corpus of target language is trained, parameter training three major parts, and the parallel corpora scale for training is the principal element affecting its translation performance.For some language pair, as Chinese and english, Arabic and English have a large amount of panel datas and can be used, but for most of language to being not this situation, their panel data resource is very rare even not to be existed, picture is as Dard and English, French and Japanese, and this seriously reduces the performance of machine translation system.To obtain the cost that parallel corpora needs cost quite high, so be necessary to utilize other resources to train statictic machine translation system.Compared with parallel corpora, all there is a large amount of resources each language centering in comparable language material, and obtain conveniently, and network, news, magazine etc. can obtain rich in natural resources.The bilingual document much comprising analog information is had in these comparable language materials, how these comparable language material information are joined the concern being subjected to more and more people in statictic machine translation system, researchists extract abundanter, parallel knowledge accurately by various method from comparable language material, and joined in translation system, improve translation system performance.
Parallel knowledge is extracted mostly all based on distributional assumption from comparable language material.This hypothesis is thought, across two words translated each other between language or fragment, their context is also similar even identical.On the basis of this assumption, the context of the unknown words of source language and target language is mapped to vector space by bilingual dictionary, the similarity then between compute vector by researcher, can pass through COS distance, Euclidean distance, deflection distance etc.Think to there are the most close contextual two words or fragment is translated each other.Also derived a lot of new method based on this most original method, such as add subject information, semantic information, transliteration information etc., these methods can obtain certain effect.But from this hypothesis itself, parallel corpora is symmetrical structure, this hypothesis can be met well, but comparable language material is a kind of unsymmetric structure, sometimes this hypothesis cannot be met, so there is some problems by the most close contextual two words or fragment each other translation application to comparable language material, and the method relies on very serious for bilingual dictionary, and seed dictionary scale directly affects parallel Knowledge Extraction effect.
Summary of the invention
The object of the invention is in order to solve statictic machine translation system panel data resource very rare even do not exist to obtain parallel corpora need to spend high, by the most close contextual two words or fragment each other translation application to comparable language material exist, very serious problem is relied on for bilingual dictionary and a kind of chapter level comparable language material phrase translation pair abstracting method of proposing.
Above-mentioned goal of the invention is achieved through the following technical solutions:
Step one, establish source language sentence S set and target language sentence set T in corpus; Wherein, corpus comprises parallel corpora and comparable language material;
Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;
Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, the statictic machine translation system utilizing parallel corpora to set up in Moses system based on phrase obtains phrase translation table; By extracting the parallel phrase pair that namely training data positive example obtains parallel corpora in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;
Step 4, the phrase of parallel corpora that obtains from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;
Step 5, from the parallel phrase of parallel corpora, characteristic of division is extracted to the non-parallel phrase centering with parallel corpora respectively; Characteristic of division is input in SVMlight system and utilizes this kernel method of radial basis to obtain support vector machine binary classifier;
Step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material to be combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length, t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length;
Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise;
Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, threshold value θ is set, θ ∈ (0,1) by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;
Step 9, by the parallel phrase of comparable language material to the demoder that is expanded in the phrase table joining baseline demoder; Wherein, baseline demoder is evaluated by baseline BLEU value and the evaluation of extension decoder expansion BLEU value; Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.
Invention effect
The object of the invention is to excavate parallel phrase from comparable language material, solve the problem that panel data is rare.Wish to make full use of abundant comparable language material resource, therefrom obtain parallel phrase, be used for promoting the object based on the statictic machine translation system performance of phrase.
The problem extracting parallel phrase from comparable language material is converted into the problem of a binary classification by the present invention.Useful characteristic information is extracted from training data, set up support vector machine binary classifier, and utilizing this sorter to divide parallel phrase and non-parallel phrase, the parallel phrase that this system extracts from comparable language material the most at last joins in translation system, to improve mechanical translation quality.This is a full automatic generation and method of testing.
The process of establishing of binary classifier is data acquisition and training two parts:
In the acquisition training data stage, known parallel source and target language sentence S and T, respectively S and T is divided by specified length, generate all possible phrase, then phrase is matched, each phrase centering must comprise a phrase and come from a S and phrase and come from T, and the panel data information utilizing GIZA++ instrument to obtain from S and T carries out the mark of positive and negative example to training phrase.
In the training stage, panel data information is utilized from training data, to extract nineteen feature as characteristic of division.Because this classification problem belongs to Nonlinear Classification problem, so this kernel method of radial basis is applied to this support vector machine classifier.The training phrase obtained from parallel corpora so just can be utilized to set up support vector machine classifier.
The evaluation method of this invention performance is from classifier performance and carry out translation system performance two aspects:
The classifying quality of sorter is evaluated, utilizes standard evaluation method, comprise accurate rate, recall rate and accuracy rate.The method generating test phrase is the same with the generation method of training phrase, but in order to ensure the fairness of testing, aligns, panel data information that counter-example utilizes when marking should train the consistent of phrase with generating.
Meaning of the present invention from comparable language material, obtains parallel phrase to improve machine translation system performance, so need to test the parallel phrase obtained of classifying from comparable language material whether can improve machine translation system performance, evaluate according to translation quality evaluation criterion.First existing a small amount of parallel corpora is utilized to train a baseline demoder, then the parallel phrase extracted from comparable language material by sorter joins in baseline system phrase table, re-training extension decoder, evaluates respectively to two demoder translation quality.
Experimental result shows, its baseline BLEU value and expansion BLEU value specifically as shown in table 3:
The present invention as known in table 3 can classify to parallel and non-parallel phrase well, the parallel phrase utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.
Accompanying drawing explanation
Fig. 1 is a kind of abstracting method process flow diagram right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training.
Embodiment
Embodiment one: a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training of present embodiment, specifically prepare according to following steps:
Step one, establish source language sentence S set and target language sentence set T in corpus; Wherein, corpus comprises parallel corpora and comparable language material;
Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the whole phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;
Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, phrase that the statictic machine translation system based on phrase obtains comprising in phrase translation table is mostly parallel phrase pair to utilize parallel corpora to set up in Moses system; Namely the parallel phrase pair of parallel corpora is obtained by extracting training data positive example (mark of positive example) in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; Namely often organize word translation to there being corresponding translation probability below, and according to normalization principle, the probability sum that what each word was corresponding likely translate is 1; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;
Step 4, whole phrase of parallel corpora of obtaining from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;
Following two problems is should be noted that in the process that training dataset obtains:
(1) training examples may extraction just with occur many times in counter-example process, so will duplicate removal process be carried out in extraction process, ensure that each training examples is unique;
(2) the training set quantity obtained by parallel corpora may be very huge, if the training data scale for training classifier is crossed conference and is caused Expired Drugs, such meeting seriously reduces classifier performance, so need to concentrate at training data to sample, determine that proper quantity is as final training examples; Can ensure, in positive example and the good situation of counter-example data set quality, the method for random sampling can be used, certainly also can carry out suitable artificial misarrangement;
Step 5, from the parallel phrase of parallel corpora, characteristic of division is extracted to the non-parallel phrase centering with parallel corpora respectively; Characteristic of division is input in SVMlight system and utilizes this kernel method of radial basis to obtain support vector machine binary classifier;
Step 6, the present invention will extract the right problem of parallel phrase and be converted into the problem of a binary classification from comparable language material, at the parallel phrase of extraction to before, first the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length, t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length, so just obtain all candidate phrase pair,
Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise; Do not deal with and can affect translation system performance;
Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, namely rule of thumb threshold value θ is set with actual conditions, θ ∈ (0,1) by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;
Step 9, by the parallel phrase of comparable language material to joining in the phrase table of baseline demoder namely based on the statictic machine translation system of phrase the demoder that is expanded; Wherein, baseline demoder is evaluated as Fig. 1 by the evaluation of baseline BLEU value and extension decoder expansion BLEU value; Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.
Present embodiment effect:
The problem extracting parallel phrase from comparable language material is converted into the problem of a binary classification by present embodiment.Useful characteristic information is extracted from training data, set up support vector machine binary classifier, and utilizing this sorter to divide parallel phrase and non-parallel phrase, the parallel phrase that this system extracts from comparable language material the most at last joins in translation system, to improve mechanical translation quality.This is a full automatic generation and method of testing.
The process of establishing of binary classifier is data acquisition and training two parts:
In the acquisition training data stage, known parallel source and target language sentence S and T, respectively S and T is divided by specified length, generate all possible phrase, then phrase is matched, each phrase centering must comprise a phrase and come from a S and phrase and come from T, the panel data information utilizing GIZA++ instrument to obtain from S and T to training phrase carry out just with the mark of counter-example.
In the training stage, panel data information is utilized from training data, to extract nineteen feature as characteristic of division.Because this classification problem belongs to Nonlinear Classification problem, so this kernel method of radial basis is applied to this support vector machine classifier.The training phrase obtained from parallel corpora so just can be utilized to set up support vector machine classifier.
The evaluation method of this invention performance is from classifier performance and carry out translation system performance two aspects:
The classifying quality of sorter is evaluated, utilizes standard evaluation method, comprise accurate rate, recall rate and accuracy rate.The method generating test phrase is the same with the generation method of training phrase, but in order to ensure the fairness of testing, aligning the panel data information utilized when marking with counter-example and should train the consistent of phrase with generation.
The meaning of present embodiment from comparable language material, obtains parallel phrase to improve machine translation system performance, so need to test the parallel phrase obtained of classifying from comparable language material whether can improve machine translation system performance, evaluate according to translation quality evaluation criterion.First existing a small amount of parallel corpora is utilized to train a baseline demoder, then the parallel phrase extracted from comparable language material by sorter joins in baseline system phrase table, re-training extension decoder, evaluates respectively to two demoder translation quality.
Experimental result shows, its baseline BLEU value and expansion BLEU value specifically as shown in table 3:
Present embodiment as known in table 3 can be classified to parallel and non-parallel phrase well, the parallel phrase utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.
Embodiment two: present embodiment and embodiment one unlike: extracting training data positive example (mark of positive example) detailed process in step 3 is:
(1) S is established kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S kword on ' for the kth in target language sentence set T ' individual position, from position i' to the word sequence of position j' in T; Suppose a threshold epsilon, ε ∈ (0,1);
(2) this threshold value is rule of thumb chosen with actual conditions, if the translation probability of two words is greater than threshold epsilon in two-way word translation table, then thinks these two word S kwith T k'translate each other;
(3) and if only if S kwith T k'translate when namely aliging each other, k ∈ [i, j] and k' ∈ [i', j'];
S kwith T k'do not translate when namely not lining up each other, k ∈ [i, j] and
S kwith T k'when not translating each other, and k' ∈ [i', j']; Then think with translate each other, be the training data positive example of extraction.Other step and parameter identical with embodiment one.
Embodiment three: present embodiment and embodiment one or two unlike: it is as follows that the parallel phrase respectively from parallel corpora in step 5 extracts characteristic of division to the non-parallel phrase centering with parallel corpora:
(1) phrase length is poor: the absolute value being the difference of source language phrase and target language phrase length;
(2) identical initial: if the beginning of source language phrase and the beginning of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(3) identical ending: if the ending of source language phrase and the ending of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(4) number of words in phrase: the quantity being each self-contained word in source language phrase and target language phrase;
(5) phrase length ratio: the ratio being source language phrase length and target language phrase length;
(6) number is translated: be the number that in the language phrase of source, word exists translation corresponding with it in target language phrase, the translation probability p (s|t) of word is greater than a threshold value η;
(7) without translation number: be the number that in the language phrase of source, word does not exist translation corresponding with it in target language phrase;
(8) ratio is translated: be the ratio that there is total words in the word quantity of translation and phrase in the language phrase of source;
(9) half translation: source language phrase word has at least half quantity to there is translation in object phrase, then value is 1, otherwise value is 0;
(10) the longest translation unit: be the length that in the language phrase of source, the longest continuous word sequence exists translation in target language phrase;
(11) the longest without translation unit: to be that in the language phrase of source, in word, the longest continuous word sequence does not exist the length of translation in target language phrase;
(1) direction of ~ (3) feature and source language and target language has nothing to do, and (4) ~ (11) are relevant with direction is both forward and reverse directions; Therefore 19 features are extracted altogether.Other step and parameter identical with embodiment one or two.
Embodiment four: one of present embodiment and embodiment one to three unlike: in step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence:
(1) the word number ratio, in two sentences is no more than 2;
(2), utilizing in dictionary inspection sentence has at least the word of half to there is translation in another one sentence;
If sentence will be dropped meeting these two conditions during difference; The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence; Most of non-parallel sentence can be removed in above process right, but the method to also been removed some approximately parallel sentences right simultaneously, these are not to meeting two conditions of filtering, and main cause is that dictionary does not comprise all entities; But these sentences are to negligible amounts, and not necessarily completely reliable, so this filter method has very great help to the precision of system and robustness on the whole; Inevitably, this filter method can not completely by non-parallel sentence to removal, because word overlay condition is very weak, such as all there is translation in stop-word substantially in corresponding language, if it by chance can mate with some notional words, meet threshold value overlap, so just likely that a non-parallel sentence is right to being mistaken for parallel sentence.Other step and parameter identical with one of embodiment one to three.
Embodiment five: present embodiment and one of embodiment one to four unlike: the formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise in step 8 is as follows:
phrasepri = ln S 1 + ln S 2 + . . . + ln S n n
Wherein, S irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase; This does not wherein comprise the translation probability of stop words, because stop words is very little for the contribution of translation, so directly ignore.Other step and parameter identical with one of embodiment one to four.
Following examples are adopted to verify beneficial effect of the present invention:
Embodiment one:
A kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training of the present embodiment, specifically prepare according to following steps:
Step one, establish source language sentence S set and target language sentence set T in corpus; Wherein, corpus comprises parallel corpora and comparable language material;
Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the whole phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;
Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, phrase that the statictic machine translation system based on phrase obtains comprising in phrase translation table is mostly parallel phrase pair to utilize parallel corpora to set up in Moses system; Namely the parallel phrase pair of parallel corpora is obtained by extracting training data positive example (mark of positive example) in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; Namely often organize word translation to there being corresponding translation probability below, and according to normalization principle, the probability sum that what each word was corresponding likely translate is 1; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;
(1) S is established kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S kword on ' for the kth in target language sentence set T ' individual position, from position i' to the word sequence of position j' in T; Suppose threshold epsilon=0.5, ε ∈ (0,1);
(2) this threshold value is rule of thumb chosen with actual conditions, if the translation probability of two words is greater than threshold epsilon in two-way word translation table, then thinks these two word S kwith T k'translate each other;
(3) and if only if S kwith T k'translate when namely aliging each other, k ∈ [i, j] and k' ∈ [i', j'];
S kwith T k'do not translate when namely not lining up each other, k ∈ [i, j] and
S kwith T k'when not translating each other, and k' ∈ [i', j']; Then think with translate each other, be the training data positive example of extraction.
Step 4, whole phrase of parallel corpora of obtaining from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;
Following two problems is should be noted that in the process that training dataset obtains:
(1) training examples may occur many times in the positive counter-example process of extraction, so will carry out duplicate removal process in extraction process, ensures that each training examples is unique.
(2) the training set quantity obtained by parallel corpora may be very huge, if the training data scale for training classifier is crossed conference and is caused Expired Drugs, such meeting seriously reduces classifier performance, so need to concentrate at training data to sample, determine that proper quantity is as final training examples.Can ensure, in positive example and the good situation of counter-example data set quality, the method for random sampling can be used, certainly also can carry out suitable artificial misarrangement.
Step 5, from the parallel phrase of parallel corpora, characteristic of division is extracted to the non-parallel phrase centering with parallel corpora respectively; Characteristic of division is input in SVMlight system and utilizes this kernel method of radial basis to obtain support vector machine binary classifier;
The parallel phrase of parallel corpora is to as follows with the non-parallel phrase centering extraction characteristic of division of parallel corpora:
(1) phrase length is poor: the absolute value being the difference of source language phrase and target language phrase length;
(2) identical initial: if the beginning of source language phrase and the beginning of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(3) identical ending: if the ending of source language phrase and the ending of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(4) number of words in phrase: the quantity being each self-contained word in source language phrase and target language phrase;
(5) phrase length ratio: the ratio being source language phrase length and target language phrase length;
(6) number is translated: be the number that in the language phrase of source, word exists translation corresponding with it in target language phrase, the translation probability p (s|t) of word is greater than a threshold value η;
(7) without translation number: be the number that in the language phrase of source, word does not exist translation corresponding with it in target language phrase;
(8) ratio is translated: be the ratio that there is total words in the word quantity of translation and phrase in the language phrase of source;
(9) half translation: source language phrase word has at least half quantity to there is translation in object phrase, then value is 1, otherwise value is 0;
(10) the longest translation unit: be the length that in the language phrase of source, the longest continuous word sequence exists translation in target language phrase;
(11) the longest without translation unit: to be that in the language phrase of source, in word, the longest continuous word sequence does not exist the length of translation in target language phrase;
(1) direction of ~ (3) feature and source language and target language has nothing to do, and (4) ~ (11) are relevant with direction.Therefore 19 features are extracted altogether.
Evaluate according to accurate rate, recall rate and accuracy rate three classifying qualities of aspect to support vector machine binary classifier; Stochastic choice five groups of training datas obtain five different sorters, and utilize same group of test data to test, and the result finally obtained is as shown in table 1:
Table 1
Utilize same sorter, random selecting five groups of test datas are tested respectively, and the result finally obtained is as shown in table 2:
Table 2
According to the above results can judge described in the invention by binary support vector machine classification method can well to parallel phrase to non-parallel phrase to classifying, test by selecting different training data and test data, can find out for different data sets, the method stable performance, and good effect can be reached.
Step 6, the present invention will extract the right problem of parallel phrase and be converted into the problem of a binary classification from comparable language material, at the parallel phrase of extraction to before, first the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length (2≤i≤7), t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length (2≤i≤7), so just obtain all candidate phrase pair.
Filtration obtains pseudo-parallel sentence:
(1) the word number ratio, in two sentences is no more than 2.
(2), utilizing in dictionary inspection sentence has at least the word of half to there is translation in another one sentence;
If sentence will be dropped meeting these two conditions during difference.The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence; Most of non-parallel sentence can be removed in above process right, but the method to also been removed some approximately parallel sentences right simultaneously, these are not to meeting two conditions of filtering, and main cause is that dictionary does not comprise all entities.But these sentences are to negligible amounts, and not necessarily completely reliable, so this filter method has very great help to the precision of system and robustness on the whole.Inevitably, this filter method can not completely by non-parallel sentence to removal, because word overlay condition is very weak, such as all there is translation in stop-word substantially in corresponding language, if it by chance can mate with some notional words, meet threshold value overlap, so just likely that a non-parallel sentence is right to being mistaken for parallel sentence.
Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise; Do not deal with and can affect translation system performance;
Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, namely rule of thumb threshold value θ is set with actual conditions, θ=0.3 by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;
The formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise is as follows:
phrasepri = ln S 1 + ln S 2 + . . . + ln S n n
Wherein, S irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase; This does not wherein comprise the translation probability of stop words, because stop words is very little for the contribution of translation, so directly ignore.
Step 9, by the parallel phrase of comparable language material to joining in the phrase table of baseline demoder namely based on the statictic machine translation system of phrase the demoder that is expanded; Wherein, baseline demoder is evaluated by baseline BLEU value and the evaluation of extension decoder expansion BLEU value, its baseline BLEU value and to expand BLEU value specifically as shown in table 3:
Table 3
Can judge according to the above results, adopt that of the present invention to utilize the method for binary classification to extract parallel phrase higher to quality, the parallel phrase extracted from comparable language material can be improved translation system performance to joining in translation system, make result closer to human translation result, and along with parallel phrase is to the increase of quantity, translation result is also become better and better.Experimental result shows, the present invention can well to parallel and non-parallel phrase to classifying, the parallel phrase pair utilizing method of the present invention to extract from comparable language material, then joins the result of the implication expressed by the translation result in translation system closer to human translation.Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.
The present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those skilled in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims (5)

1., based on the abstracting method that the parallel phrase of the comparable language material of chapter level of parallel corpora training is right, it is characterized in that: a kind of chapter level comparable language material phrase translation pair abstracting method specifically carries out according to following steps:
Step one, establish source language sentence S set and target language sentence set T in corpus; Wherein, corpus comprises parallel corpora and comparable language material;
Step 2, respectively S and T is divided into phrase successively by specified length, length 2-7 word of phrase, the phrase be divided into carries out combination of two, obtains the phrase of parallel corpora to set; Wherein, each phrase centering must comprise a phrase and comes from a S and phrase and come from T;
Step 3, utilize GIZA++ instrument from parallel corpora, extract two-way word translation table, the statictic machine translation system utilizing parallel corpora to set up in Moses system based on phrase obtains phrase translation table; By extracting the parallel phrase pair that namely training data positive example obtains parallel corpora in the information in two-way word translation table information and phrase translation table; Wherein, word translation is often organized in two-way word translation table to there being corresponding translation probability below; The word alignment information that phrase translation table comprises the two-way translation probability between phrase, five probability and phrase inside punished in two-way word weight, word;
Step 4, the phrase of parallel corpora that obtains from step 2 to the parallel phrase of the parallel corpora that removal step three set obtains to the non-parallel phrase pair obtaining training data counter-example and parallel corpora;
Step 5, from the parallel phrase of parallel corpora, characteristic of division is extracted to the non-parallel phrase centering with parallel corpora respectively; Characteristic of division is input in SVMlight system and utilizes this kernel method of radial basis to obtain support vector machine binary classifier;
Step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material to be combined, filtration obtains pseudo-parallel sentence to <S, T>, the parallel phrase of candidate is extracted to <s from puppet parallel sentence centering, t>, wherein, s is the length in sentence S is the substring of i, minimum source language phrase length≤i≤maximum source language phrase length, t is the length of sentence T is the substring of j, minimum target language phrase length≤j≤maximum target language phrase length;
Step 7, utilize support vector machine binary classifier to classify to <s, t> to the parallel phrase of candidate, obtain in comparable language material the parallel phrase pair comprising noise;
Step 8, the parallel phrase of noise will be comprised in comparable language material to carrying out filtration treatment, threshold value θ is set, θ ∈ (0,1) by the mean value often organizing in comparable language material the parallel phrase centering word translation probability logarithm comprising noise lower than the phrase of θ to removing the parallel phrase pair obtaining comparable language material;
Step 9, by the parallel phrase of comparable language material to the demoder that is expanded in the phrase table joining baseline demoder; Wherein, baseline demoder is evaluated by baseline BLEU value and the evaluation of extension decoder expansion BLEU value; Namely a kind of abstracting method right based on the parallel phrase of the comparable language material of chapter level of parallel corpora training is completed.
2. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, is characterized in that: extracting training data positive example detailed process in step 3 is:
(1) S is established kfor the kth in source language sentence S set ' word on individual position, from position i to the word sequence of position j and T in S k'for the kth in target language sentence set T ' word on individual position, from position i' to the word sequence of position j' in T; Suppose a threshold epsilon, ε ∈ (0,1);
(2) if the translation probability of two words is greater than threshold epsilon in two-way word translation table, then these two word S are thought kwith T k'translate each other;
(3) and if only if S kwith T k'when translating each other, k ∈ [i, j] and k' ∈ [i', j'];
S kwith T k'when not translating each other, k ∈ [i, j] and
S kwith T k'when not translating each other, and k' ∈ [i', j']; Then think with translate each other, be the training data positive example of extraction.
3. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, is characterized in that: it is as follows that the parallel phrase respectively from parallel corpora in step 5 extracts characteristic of division to the non-parallel phrase centering with parallel corpora:
(1) phrase length is poor: the absolute value being the difference of source language phrase and target language phrase length;
(2) identical initial: if the beginning of source language phrase and the beginning of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(3) identical ending: if the ending of source language phrase and the ending of target language phrase can be translated each other, then value is 1, otherwise value is 0;
(4) number of words in phrase: the quantity being each self-contained word in source language phrase and target language phrase;
(5) phrase length ratio: the ratio being source language phrase length and target language phrase length;
(6) number is translated: be the number that in the language phrase of source, word exists translation corresponding with it in target language phrase, the translation probability p (s|t) of word is greater than a threshold value η;
(7) without translation number: be the number that in the language phrase of source, word does not exist translation corresponding with it in target language phrase;
(8) ratio is translated: be the ratio that there is total words in the word quantity of translation and phrase in the language phrase of source;
(9) half translation: source language phrase word has at least half quantity to there is translation in object phrase, then value is 1, otherwise value is 0;
(10) the longest translation unit: be the length that in the language phrase of source, the longest continuous word sequence exists translation in target language phrase;
(11) the longest without translation unit: to be that in the language phrase of source, in word, the longest continuous word sequence does not exist the length of translation in target language phrase.
4. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, it is characterized in that: in step 6, the sentence in the target language of the sentence in the source language article of comparable language material and comparable language material is combined, filtration obtains pseudo-parallel sentence:
(1) the word number ratio, in two sentences is no more than 2;
(2), utilizing in dictionary inspection sentence has at least the word of half to there is translation in another one sentence; The sentence meeting these two conditions is right to being regarded as pseudo-parallel sentence.
5. a kind of right abstracting method of the parallel phrase of the comparable language material of chapter level based on parallel corpora training according to claim 1, it is characterized in that: the formula often organizing in comparable language material the mean value of the parallel phrase centering word translation probability logarithm comprising noise in step 8 is as follows: θ, θ ∈ (0,1)
phrasepri = ln S 1 + ln S 2 + . . . + ln S n n
Wherein, S irepresent that primitive phrase exists the translation probability of i-th word of translation in target language phrase; N represents that primitive phrase exists the word number of translation in target language phrase.
CN201410624648.6A 2014-11-07 2014-11-07 A kind of abstracting method of the chapter level than the parallel phrase pair of language material trained based on parallel corpora Active CN104391885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410624648.6A CN104391885B (en) 2014-11-07 2014-11-07 A kind of abstracting method of the chapter level than the parallel phrase pair of language material trained based on parallel corpora

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410624648.6A CN104391885B (en) 2014-11-07 2014-11-07 A kind of abstracting method of the chapter level than the parallel phrase pair of language material trained based on parallel corpora

Publications (2)

Publication Number Publication Date
CN104391885A true CN104391885A (en) 2015-03-04
CN104391885B CN104391885B (en) 2017-07-28

Family

ID=52609789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410624648.6A Active CN104391885B (en) 2014-11-07 2014-11-07 A kind of abstracting method of the chapter level than the parallel phrase pair of language material trained based on parallel corpora

Country Status (1)

Country Link
CN (1) CN104391885B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260483A (en) * 2015-11-16 2016-01-20 金陵科技学院 Microblog-text-oriented cross-language topic detection device and method
CN106021224A (en) * 2016-05-13 2016-10-12 中国科学院自动化研究所 Bilingual discourse annotation method
JP2017004179A (en) * 2015-06-08 2017-01-05 日本電信電話株式会社 Information processing method, device, and program
CN106502997A (en) * 2016-10-08 2017-03-15 新译信息科技(深圳)有限公司 The appraisal procedure and system of phrase table filter efficiency
CN107491441A (en) * 2016-06-13 2017-12-19 沈阳雅译网络技术有限公司 A kind of method based on the dynamic extraction translation template for forcing decoding
CN107610693A (en) * 2016-07-11 2018-01-19 科大讯飞股份有限公司 The construction method and device of text corpus
CN108153835A (en) * 2017-12-14 2018-06-12 新疆大学 A kind of dimension-Chinese is than language material automatic obtaining method
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN110309516A (en) * 2019-05-30 2019-10-08 清华大学 Training method, device and the electronic equipment of Machine Translation Model
CN110362820A (en) * 2019-06-17 2019-10-22 昆明理工大学 A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm
CN110489624A (en) * 2019-07-12 2019-11-22 昆明理工大学 The method that the pseudo- parallel sentence pairs of the Chinese based on sentence characteristics vector extract
CN110852117A (en) * 2019-11-08 2020-02-28 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN113806496A (en) * 2021-11-19 2021-12-17 航天宏康智能科技(北京)有限公司 Method and device for extracting entity from text sequence
JP2022511139A (en) * 2019-10-25 2022-01-31 北京小米智能科技有限公司 Information processing methods, devices and storage media

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989287A (en) * 2009-07-31 2011-03-23 富士通株式会社 Method and equipment for generating rule for statistics-based machine translation
KR20120060666A (en) * 2010-12-02 2012-06-12 에스케이플래닛 주식회사 Apparatus and method for extracting noun-phrase translation pairs of statistical machine translation
CN102968411A (en) * 2012-10-24 2013-03-13 橙译中科信息技术(北京)有限公司 Multi-language machine intelligent auxiliary processing method and system
CN102999486A (en) * 2012-11-16 2013-03-27 沈阳雅译网络技术有限公司 Phrase rule extracting method based on combination
US8874433B2 (en) * 2011-05-20 2014-10-28 Microsoft Corporation Syntax-based augmentation of statistical machine translation phrase tables

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989287A (en) * 2009-07-31 2011-03-23 富士通株式会社 Method and equipment for generating rule for statistics-based machine translation
KR20120060666A (en) * 2010-12-02 2012-06-12 에스케이플래닛 주식회사 Apparatus and method for extracting noun-phrase translation pairs of statistical machine translation
US8874433B2 (en) * 2011-05-20 2014-10-28 Microsoft Corporation Syntax-based augmentation of statistical machine translation phrase tables
CN102968411A (en) * 2012-10-24 2013-03-13 橙译中科信息技术(北京)有限公司 Multi-language machine intelligent auxiliary processing method and system
CN102999486A (en) * 2012-11-16 2013-03-27 沈阳雅译网络技术有限公司 Phrase rule extracting method based on combination

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017004179A (en) * 2015-06-08 2017-01-05 日本電信電話株式会社 Information processing method, device, and program
CN105260483A (en) * 2015-11-16 2016-01-20 金陵科技学院 Microblog-text-oriented cross-language topic detection device and method
CN106021224A (en) * 2016-05-13 2016-10-12 中国科学院自动化研究所 Bilingual discourse annotation method
CN106021224B (en) * 2016-05-13 2019-03-15 中国科学院自动化研究所 A kind of bilingual chapter mask method
CN107491441A (en) * 2016-06-13 2017-12-19 沈阳雅译网络技术有限公司 A kind of method based on the dynamic extraction translation template for forcing decoding
CN107491441B (en) * 2016-06-13 2020-07-17 沈阳雅译网络技术有限公司 Method for dynamically extracting translation template based on forced decoding
CN107610693A (en) * 2016-07-11 2018-01-19 科大讯飞股份有限公司 The construction method and device of text corpus
CN107610693B (en) * 2016-07-11 2021-01-29 科大讯飞股份有限公司 Text corpus construction method and device
CN106502997A (en) * 2016-10-08 2017-03-15 新译信息科技(深圳)有限公司 The appraisal procedure and system of phrase table filter efficiency
CN108153835A (en) * 2017-12-14 2018-06-12 新疆大学 A kind of dimension-Chinese is than language material automatic obtaining method
CN109783809B (en) * 2018-12-22 2022-04-12 昆明理工大学 Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN110309516A (en) * 2019-05-30 2019-10-08 清华大学 Training method, device and the electronic equipment of Machine Translation Model
CN110362820A (en) * 2019-06-17 2019-10-22 昆明理工大学 A kind of bilingual parallel sentence extraction method of old man based on Bi-LSTM algorithm
CN110362820B (en) * 2019-06-17 2022-11-01 昆明理工大学 Bi-LSTM algorithm-based method for extracting bilingual parallel sentences in old and Chinese
CN110489624A (en) * 2019-07-12 2019-11-22 昆明理工大学 The method that the pseudo- parallel sentence pairs of the Chinese based on sentence characteristics vector extract
CN110489624B (en) * 2019-07-12 2022-07-19 昆明理工大学 Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector
JP2022511139A (en) * 2019-10-25 2022-01-31 北京小米智能科技有限公司 Information processing methods, devices and storage media
US11461561B2 (en) 2019-10-25 2022-10-04 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for information processing, and storage medium
JP7208968B2 (en) 2019-10-25 2023-01-19 北京小米智能科技有限公司 Information processing method, device and storage medium
CN110852117A (en) * 2019-11-08 2020-02-28 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN110852117B (en) * 2019-11-08 2023-02-24 沈阳雅译网络技术有限公司 Effective data enhancement method for improving translation effect of neural machine
CN113806496A (en) * 2021-11-19 2021-12-17 航天宏康智能科技(北京)有限公司 Method and device for extracting entity from text sequence

Also Published As

Publication number Publication date
CN104391885B (en) 2017-07-28

Similar Documents

Publication Publication Date Title
CN104391885A (en) Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN106547739B (en) A kind of text semantic similarity analysis method
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN105868184A (en) Chinese name recognition method based on recurrent neural network
CN101093478B (en) Method and system for identifying Chinese full name based on Chinese shortened form of entity
CN105808525A (en) Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
Rigouts Terryn et al. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset
CN112541343B (en) Semi-supervised counterstudy cross-language abstract generation method based on word alignment
CN112131872A (en) Document author duplicate name disambiguation method and construction system
CN102779135B (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN107329960B (en) Unregistered word translating equipment and method in a kind of neural network machine translation of context-sensitive
CN102682000A (en) Text clustering method, question-answering system applying same and search engine applying same
CN104008092A (en) Method and system of relation characterizing, clustering and identifying based on the semanteme of semantic space mapping
Bustamante et al. No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru
CN109190099B (en) Sentence pattern extraction method and device
CN106055560A (en) Method for collecting data of word segmentation dictionary based on statistical machine learning method
CN104298663A (en) Method for evaluating translation consistency in term field and statistical machine translation method
CN104572634A (en) Method for interactively extracting comparable corpus and bilingual dictionary and device thereof
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
CN106202039B (en) Vietnamese portmanteau word disambiguation method based on condition random field
CN104035918A (en) Chinese organization name abbreviation recognition system adopting context feature matching
Wei et al. Cross-modal knowledge distillation in multi-modal fake news detection
Khana et al. Named entity dataset for urdu named entity recognition task
CN108268669A (en) A kind of crucial new word discovery method based on multidimensional words and phrases feature and sentiment analysis
CN101763403A (en) Query translation method facing multi-lingual information retrieval system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200402

Address after: 150001 No. 118 West straight street, Nangang District, Heilongjiang, Harbin

Patentee after: Harbin University of technology high tech Development Corporation

Address before: 150001 Harbin, Nangang, West District, large straight street, No. 92

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY

TR01 Transfer of patent right