CN103955456A

CN103955456A - Sentence length penalty factor-based selection method for sentence rich in information amount

Info

Publication number: CN103955456A
Application number: CN201410168282.6A
Authority: CN
Inventors: 杜金华; 张萌
Original assignee: Xian University of Technology
Current assignee: Xi'an bonny Translation Co., Ltd.
Priority date: 2014-04-23
Filing date: 2014-04-23
Publication date: 2014-07-30

Abstract

The invention discloses a sentence length penalty factor-based selection method for a sentence rich in information amount. The method comprises the steps of step 1, establishing a primary counting machine translation system; step 2, establishing an information quantizing unit set X and performing information amount calculation; step 3, performing professional translation to obtain a parallel linguistic data sentence pair set; step 4, updating a linguistic data library; step 5, retraining the counting machine translation system; step 6, performing process iteration and algorithm evaluation. According to the sentence length penalty factor-based selection method for the sentence rich in information amount, the single language sentence of a source language can be effectively subjected to the information amount calculation, the relation between the absolute information amount and the sentence length of the selected sentence is balanced, the accuracy of selection on the sentence with the largest information amount is realized, the manual translation values are maximized, and the limited data efficiency is maximized.

Description

The system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence

Technical field

The invention belongs to computational linguistics/statistical machine translation technical field, relate to a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence.

Background technology

Machine translation mothod based on statistical method or corpus-based is the interpretation method based on data-driven in essence, and therefore, the quality of the size of data scale and data itself has vital impact to translation performance.Obtain a high-quality statictic machine translation system, conventionally need large-scale bilingual parallel corpora, and this cannot realize to a lot of language at present.There have been at present a lot of methods can alleviate this class problem, such as use repetition technology or with comparing language material etc.But for the language of resource shortage, data scale is the bottleneck problem of statistical machine translation technical research, is also one of key problem urgently to be resolved hurrily.

In numerous family of languageies now, also having majority is to belong to " low-density " language, uses the people of this language few, even if there is millions of people to say this language, but available digitized Parallel Corpus is still very deficient.For example, Chinese Minority Nationalities is numerous, and along with expanding economy, minority language and information processing research become one of important means of revitalizing regional economy, promotion regional development and facilitating the intellectual interchange with application.Under this background, the demand of the high-quality statictic machine translation system of " low-density " language is just seemed to particularly urgent.Generally speaking, two schemes can effectively be alleviated this demand: (1) builds extensive bilingual Parallel Corpus; (2) take the bilingual Parallel Corpus of existing certain scale as basis, build extensive single language corpus, adopt efficient method from single language language material, to generate bilingual data, the usefulness of growth data.

In fact, the structure of extensive bilingual Parallel Corpus is a systems engineering, needs a large amount of human and material resources and financial resources to drop into, and especially high-quality corpus needs the considerable time could be perfect.And the development of infotech is fast changing, neologisms and new knowledge emerge in an endless stream, how feasible method is proposed, effectively utilize the current bilingual parallel corpora having built and single language corpus to adapt to new knowledge, improve the statistical machine translation quality that resourceoriented lacks language, current, seem particularly important and urgent.

The existing subject matter that is rich in the general selection algorithm existence of quantity of information sentence is: when in extensive single language corpus, sentence length difference is larger, existing method can tend to select the long shorter sentence of sentence, thereby cause selected sentence make generated bilingual corpora the coverage rate of test set is all showed on still to the probability estimate of translation engine phrase table bad, cause the translation performance of the machine translation system of using the bilingual corpora that selection algorithm generates lower than the machine translation system performance of the bilingual corpora that uses random device to produce, thereby cause, be rich in the selecting without any meaning of quantity of information sentence.

Summary of the invention

The object of this invention is to provide a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence, solved in prior art, tend to select the long shorter sentence of sentence, thereby cause selected sentence to make generated bilingual corpora to the coverage rate of test set or on to the probability estimate of translation engine phrase table, all show badly, affect the problem of machine translation system performance.

The technical solution adopted in the present invention is, a kind of system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence is specifically implemented according to following steps:

Step 1, build initial statictic machine translation system

Use initial bilingual parallel corpora L={ (f _i, e _i) train statictic machine translation system, wherein L represents initial bilingual Parallel Corpus, f _i, e _irepresent respectively in L that i parallel sentence is right, i.e. i Chinese sentence and i english sentence, i=1 ... N};

Step 2, build information quantization unit set X and carry out information computing

According to defined information, represent unit x, from extensive single language language material U={f _jin select sentence set U _n, the long penalty factor of band sentence to be rich in quantity of information sentence selection algorithm as follows:

Wherein, U represents extensive single language corpus, U _nrepresent single language language material subset that select sentence forms, BP is the long penalty factor of sentence, and the ratio long according to the long average of single language language material sentence and sentence sentence to be selected determines whether applying punishment, and ratio is greater than 1, punishment; Ratio is less than 1, does not punish;

the available information quantization unit set of sentence s, i.e. the set of phrase; P (x|U) and P (x|L) are illustrated respectively in the probability of certain phrase x in extensive single language language material U and bilingual parallel corpora L, and the computing formula of P (x|U) and P (x|L) is as follows respectively:

P (x | U) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{U}^{m}} Count (x) + ϵ}, - - - (2)

P (x | L) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{L}^{m}} Count (x) + ϵ}, - - - (3)

Wherein, ε is smoothing factor, represent that in extensive single language language material, U is available for calculating the possible phrase set of Sentence-level quantity of information, be available for calculating the possible phrase set of Sentence-level quantity of information in bilingual parallel corpora L, Count (x) represents that the number of times that phrase x occurs is frequency;

Step 3, carry out technical translator

By U _ntransfer to manually to carry out technical translator, note U _nthe collection of translations generating is combined into u _nwith parallel corpora sentence pair set for intertranslation;

Step 4, corpus is upgraded

From U, remove the selected n sentence list language data U of previous step _n, then by U _nand add in bilingual parallel corpora L simultaneously;

Step 5, statictic machine translation system is trained again

In new formation bilingual corpora on again train statictic machine translation system, and use consequent translation engine test decode collection again; Then, the translation result of test set is pressed to the marking of BLEU criterion, according to score raising situation and random device, contrast, weigh sentence selection algorithm performance, score is higher represents that performance is better;

Step 6, carry out process iteration and algorithm evaluation

From step 2, start to carry out iteration, each iteration is selected the new sentence set of being rich in quantity of information, upgrades the data in U and L, weighs sentence selection algorithm performance.

The invention has the beneficial effects as follows, by calculating the quantity of information of single language language material sentence one-level, limited human translation given in the sentence of Extracting Information amount maximum, realizes obtaining of high-quality bilingual panel data, improves the statistical machine translation translation quality of bilingual resource shortage language; Can balance sentence contradiction between long and quantity of information, extract the sentence that is really rich in information, when guaranteeing to improve test set coverage rate, also guarantee the accuracy of translation engine phrase table probability estimate, the usefulness of expansion available data resource.The scarce resource of machine translation system such as to(for) Chinese and other minority languages, Chinese and other rare foreign languages language etc., what adopt the long punishment of band sentence that the present invention proposes is rich in quantity of information sentence selection algorithm, can effectively from extensive single language language material, obtain the bilingual panel data of high-quality.

Accompanying drawing explanation

Fig. 1 is towards the bilingual generative process schematic diagram of statistical machine translation of scarce resource under the inventive method Active Learning framework;

Fig. 2 is the performance comparison figure of existing how much phrase methods and random device statictic machine translation system when each iteration;

Fig. 3 is the mean sentence length statistical property figure that is rich in quantity of information sentence that existing how much phrase systems of selection and random choice method are selected when each iteration;

Fig. 4 is the coverage rate statistical property figures of the bilingual data that generate when each iteration of existing how much phrase systems of selection and random choice method to different test sets;

Fig. 5 is the mean sentence length statistical property figure that is rich in quantity of information sentence that the inventive method and other two kinds of existing methods are picked out when each iteration;

Fig. 6 is the inventive method and the comparison diagram of other two kinds of existing methods in statistical machine translation performance.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

The present invention is based on the system of selection of being rich in quantity of information sentence of the long penalty factor of sentence, consider and be rich in the impact on translation performance of bilingual panel data that single sentence of quantity of information generates, to pick out in extensive single language language material the sentence of quantity of information maximum, give and manually carry out technical translator, generate the bilingual panel data of high-quality, reach the object of the statistical machine translation quality that improves bilingual resource shortage language.

As shown in Figure 1, be that the inventive method adopts and is rich in the iterative process that the bilingual parallel corpora of sentence generation of quantity of information maximum selected in quantity of information sentence under Active Learning framework.The bilingual parallel corpora that each iteration generates adds the Parallel Corpus of last training translation engine, and then training obtains new statictic machine translation system again.New statictic machine translation system is translated decoding to same test set, use (the accuracy of word matching degree continuously of BLEU interpretational criteria, score value is between (0,1) between) to weigh the quality of each newly-increased bilingual corpora, thereby indirectly weigh the quality that is rich in the set of quantity of information sentence of selecting, with the validity of reflection sentence selection algorithm, this process has two condition precedents: 1) a small amount of for building the initial parallel corpora (this system is existing technology) of baseline statictic machine translation system; 2) extensive single language language material: for generate bilingual corpora under framework of the present invention.Meanwhile, how the key issue that this process comprises is for to design the effective ways of selecting the sentence that is rich in quantity of information, and this is also core of the present invention.

The present invention is based on the system of selection of being rich in quantity of information sentence of the long penalty factor of sentence, according to above-mentioned principle, according to following steps, specifically implement:

Step 1, build initial statictic machine translation system

Use a small amount of resource-constrained initial bilingual parallel corpora L={ (f _i, e _i) train statictic machine translation system, wherein L represents initial bilingual Parallel Corpus, f _i, e _irepresent respectively in L that i parallel sentence is right, i.e. i Chinese sentence and i english sentence, i=1 ... N};

Wherein, U represents extensive single language corpus, U _nrepresent single language language material subset that select sentence forms, the long penalty factor of sentence that BP is the inventive method, basic thought is for to determine whether applying punishment according to the long ratio of the long average of single language language material sentence and sentence sentence to be selected, and ratio is greater than 1, punishment; Ratio is less than 1, does not punish, the introducing of this penalty factor is also the creationary core of the inventive method;

X is that the information for Sentence-level information quantization represents unit, and x is defined as the word sequence that statictic machine translation system is used based on phrase in the present invention;

P (x | U) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{U}^{m}} Count (x) + ϵ}, - - - (2)

P (x | L) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{L}^{m}} Count (x) + ϵ}, - - - (3)

Wherein, ε is smoothing factor, represent that in extensive single language language material, U is available for calculating the possible phrase set of Sentence-level quantity of information, be available for calculating the possible phrase set of Sentence-level quantity of information in bilingual parallel corpora L, Count (x) represents that the number of times that phrase x occurs is frequency,

Step 3, carry out technical translator

Step 4, corpus is upgraded

Step 5, statictic machine translation system is trained again

In new formation bilingual corpora on again train statictic machine translation system, and use consequent translation engine test decode collection again; Then, to the translation result of test set press BLEU criterion (the continuously accuracy of word matching degree, score value is between (0,1) between) marking, according to score raising situation and random device, contrast, weigh sentence selection algorithm performance, score is higher represents that performance is better.

Step 6, carry out process iteration and algorithm evaluation

From step 2, start to carry out iteration, each iteration is selected the new sentence set of being rich in quantity of information, upgrades the data in U and L, weighs sentence selection algorithm performance, after 25 iteration, stops; Calculate the average of 25 iteration BLEU scores, compare with the average of random device, weigh the performance of sentence selection algorithm, average more illustrates that algorithm is better, illustrate that this algorithm selects the sentence ability that contains much information stronger, stronger to the ability of mechanical translation performance boost.

Embodiment

Take " Han-Ying " language to and translation direction be object, the bilingual parallel corpora of Han-Ying FBIS providing from the open evaluation and test of national institute of standards and technology (NIST) random choose to go out initial parallel corpora 5K(Chinese side mean sentence length be 36.5 words), and to simulate single language (Chinese) language material 20K(Chinese mean sentence length be 36.4 words).Using NIST at the test set of 2006 as the exploitation collection U of this experiment statistics machine translation system model parameter training (totally 1,664, every containing four reference translations), use NIST as this experiment test system, to translate the test set of performance at the test set of 2005 and 2008, the former is containing 1,083 (every 4 reference translations), the latter is containing 1,357 (every 4 reference translations).Human translation in Active Learning framework is provided by the English side data of simulating in single language data (the actual bilingual parallel corpora for sentence alignment).

Iterations in Active Learning framework is set to 25, and each iteration is selected the sentence that n=200 sentence is rich in quantity of information from U; Smoothing factor in formula (2) and (3) is made as to 0.5.

What Fig. 2 represented is to take the existing quantity of information sentence selection algorithm (methods based on how much phrases) that is rich in as the comparing result of example explanation with random choose method.The fundamental formular of how much phrase selection methods is as follows:

φ (s) : = \frac{1}{| X_{s}^{m} |} \underset{x &Element; X_{s}^{m}}{Σ} \log \frac{P (x | U)}{P (x | U)}, - - - (4)

Parameters cotype (1) in formula (4).

Broken line in Fig. 2 represents initiatively under framework corresponding statictic machine translation system BLEU score situation after each bilingual panel data of grey iterative generation.Each node of broken line represents iteration one time, and each iteration is a complete process that is rich in Machine Translation Model training after the selection of quantity of information sentence, human translation, the new parallel corpora of merging, decoding.Two broken lines of Fig. 2 middle and upper part represent that the performance test of mechanical translation is based on NIST2005 test set, and two broken lines of bottom represent the NIST2008 test set adopting.Two broken lines of entwining together represent respectively to adopt the existing method based on how much phrases and machine translation system based on random device representative.From comparing result, the performance of existing how much phrase methods (being Geom phrase NIST05 and Geom phrase NIST08 in figure) is not as random device (being random NIST05 and random NIST08 in figure).

Existing how much phrase methods and random approach have been carried out to statistical characteristic analysis.In Fig. 3, listed the sentence average length that existing how much phrase methods and random system of selection are chosen in each iteration.The mean sentence length that in each iteration, how much phrase methods are picked out sentence is all shorter than the mean sentence length that random approach is picked out.25 iteration are carried out to statistical study, and in how much phrase methods, the long average of sentence of 25 iteration is 27.7, and variance is 5.93, and the long average of sentence that random approach is selected is 36.5, and variance is 1.23.Can find out, general how much phrase methods no matter in mean sentence length, or in dynamic range, neither as random approach performance stablize.

Conventionally, with regard to the sentence of extensive single language language material, the neologisms that sentence contains (for there is bilingual Parallel Corpus) are more, and this contained quantity of information is higher so; The old word that sentence contains (for there is bilingual Parallel Corpus) is more, and this contained quantity of information is lower, but more accurate to the probability estimate of phrase.Therefore, each iteration under Active Learning framework is picked out to 200 sentence set the coverage rate of test set has been carried out to statistical characteristic analysis.Fig. 4 represents that the bilingual data that generate in each iteration are to the coverage rate situation in different test sets.Can see in each iteration, the bilingual panel data that existing how much phrase methods generate is at three test set (NIST05, NIST06 and NIST08) on coverage rate all than the height of random approach, represent that the sentence that existing how much phrase methods are picked out comprises more neologisms.Yet high coverage rate also could not improve translation quality.The long statistical property of synthetic sentence and test set coverage rate statistical property, analyze the performance of existing phrase geometric method and be that the quantity of information that the sentence that is rich in quantity of information that the former selects shows is " information maximization relatively " not as the basic reason of random device, the quantification that is information is not normalized according to sentence is long, therefore, although neologisms are many, but because sentence is long shorter, cause absolute information quantity not sufficient, so the performance of translation performance is inferior to the bilingual data that random approach generates.

According to above statistical study, the present invention proposes the long penalty factor of band sentence based on how much of phrases be rich in the system of selection of quantity of information sentence, relative entropy is converted into Absolute Information Quantity, select single sentence of Absolute Information Quantity maximum to carry out human translation.

Fig. 5 has illustrated the average length comparison diagram of the inventive method (phrase-pen) and existing phrase method of geometry (phrase) and random approach (random) selected sentence when each iteration.As can be seen from Figure 5, add after the long penalty factor of sentence, because having considered the long factor of sentence of corpus, the sentence ensemble average length of at every turn picking out is not only improved, reduced sentence long wave moving simultaneously.

Fig. 6 has illustrated the comparison diagram of the inventive method (Phrase-penalty) and existing phrase method of geometry (Phrase) and random device (Random) statictic machine translation system performance (BLEU score) when each iteration.As can be seen from Figure 6, than random approach, add after the long penalty factor of sentence, the inventive method has significantly improved machine translation system performance.The concrete numbers illustrated of the present embodiment is in Table 1.

The translation performance comparison of table 1, the inventive method and other two kinds of methods

Table 1 has illustrated the inventive method and existing phrase geometric method and random approach statistical machine translation Performance Ratio after average to 25 iteration.In table, " BLEU " list is shown and is adopted 4 yuan of BLEU automatic Evaluation criterions to evaluate translation result, and value is between 0～100%, and larger expression performance is better." Phrase " represents existing phrase method of geometry, and " Random " represents random approach, and " Phrase-penalty " represents the system of selection of the long penalty factor of band sentence that the present invention proposes.

As can be seen from Table 1, compare random approach and existing phrase method of geometry, the method of the long penalty factor of band sentence proposed by the invention makes to translate performance and on NIST05 data set, has improved respectively 0.33 and 0.14 point, on NIST08 data set, improve respectively 0.45 and 0.33 point, significantly improved the translation performance of system.Therefore, method proposed by the invention is effective and feasible.

Claims

1. the system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence, is characterized in that, according to following steps, specifically implements:

Step 1, build initial statictic machine translation system

P (x | U) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{U}^{m}} Count (x) + ϵ}, - - - (2)

P (x | L) = \frac{Count (x) + ϵ}{Σ_{x &Element; X_{L}^{m}} Count (x) + ϵ}, - - - (3)

Step 3, carry out technical translator

Step 4, corpus is upgraded

Step 5, statictic machine translation system is trained again

Step 6, carry out process iteration and algorithm evaluation

2. the system of selection of being rich in quantity of information sentence based on the long penalty factor of sentence according to claim 1, is characterized in that, in described step 6, carries out stopping after 25 iteration; The average of calculating 25 iteration BLEU scores, compares with the average of random device, weighs the performance of sentence selection algorithm, and average more illustrates that algorithm is better.