CN111680520A

CN111680520A - Synonym data enhancement-based Hanyue neural machine translation method

Info

Publication number: CN111680520A
Application number: CN202010366635.9A
Authority: CN
Inventors: 高盛祥; 尤丛丛; 余正涛; 毛存礼; 潘润海
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-09-18

Abstract

The invention relates to a synonym data enhancement-based Hanyue neural machine translation method, and belongs to the technical field of natural language processing. The resource scarcity of the parallel Chinese-crossing corpus greatly influences the translation effect of the Chinese-crossing machine. The method comprises the steps of firstly obtaining a synonym list of low-frequency words of one language by learning a monolingual word vector, then carrying out synonym replacement on the low-frequency words, screening the replaced sentences by using a language model, and finally matching the screened sentences with the sentences in the other language to obtain the expanded parallel corpus. The invention provides powerful support for the work of expanding the Han crossing parallel corpus; the problem of low translation performance of the Hanyue neural machine caused by scarcity of corpus resources is solved.

Description

Synonym data enhancement-based Hanyue neural machine translation method

Technical Field

The invention relates to a synonym data enhancement-based Hanyue neural machine translation method, and belongs to the technical field of natural languages.

Background

In the field of Chinese and Vietnamese natural language processing, high-quality construction of a parallel corpus of Chinese and Vietnamese is the foundation, premise and pillar of the neural machine translation work of Chinese and Vietnamese. The quality and scale of the Han-Yue parallel corpus directly influence the translation performance of the Han-Yue neural machine; in order to solve the quality and performance of subsequent work, a high-quality han-crossing parallel corpus needs to be constructed. At present, in the field of machine translation, two types of methods are mainly used for enhancing data, wherein one type is vocabulary replacement and the other type is translation. Because the more Chinese belongs to low-resource languages, a bilingual dictionary is difficult to obtain, but a cross-language word replacement method needs dictionary resources of a certain scale, the realization is difficult, and Chinese and Vietnamese have a large amount of monolingual linguistic data, and word vectors and related words under a monolingual environment are very easy to obtain through monolingual language training. Therefore, the invention provides a bilingual parallel corpus data enhancement method based on synonym replacement of low-frequency words. The method constructs word vectors by exploring a monolingual environment, further searches synonyms of low-frequency words, and obtains expanded parallel corpora through synonym replacement.

Disclosure of Invention

The invention provides a synonym data enhancement-based Hanyue neural machine translation method, which is used for solving the problem of low neural machine translation performance caused by resource scarcity of a Hanyue parallel corpus.

The technical scheme of the invention is as follows: the synonym data enhancement-based Hanyue neural machine translation method comprises the following specific steps:

step1, constructing a low-frequency word list V of one-end language by using a primitive material library_R；

Step2, searching synonyms of low-frequency words through monolingual word vectors;

step3, carrying out synonym replacement on low-frequency words;

step4, screening the replaced sentences by using a language model;

and Step5, matching the screened sentences with sentences in the other end language to obtain expanded parallel corpora.

Further, the Step1 includes the specific steps of:

step1.1, firstly, constructing a vocabulary V in the source language end range of the training corpus;

step1.2, adding the words with the occurrence frequency of N (N is more than or equal to 1 and less than 5) in the V into a low-frequency word list as low-frequency words so as to form a low-frequency word list V_R。

Further, the specific Step of Step2 is as follows:

step2.1, training feature vectors represented by monolingual vocabularies;

step2.2, calculating cosine values between vectors representing words, judging the similarity between the two words, and further searching synonyms of the low-frequency words;

step2.3, construct a synonym list of low frequency words.

Further, the specific Step of Step3 is as follows:

step3.1, giving the positions i in the source sentences S and S;

Step3.2、S_ias low frequency words

Synonyms of (1), with S in S_iIs replaced by

Further generating a synonymy sentence S' of S;

and Step3.3, adding the replaced sentence into the synonym candidate list of the S.

Further, the specific Step of Step4 is as follows:

step4.1, scoring sentences in the synonym candidate list of the fixed source sentence S by using a Kenlm language model;

and Step4.2, screening the first M sentences with high scores, wherein M is a positive integer.

Further, for sentence pair (S, T), since the operation of "synonym replacement" is performed on the source language end S, the source sentence S and the enhancement sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added to the training set to participate in model training.

The invention has the beneficial effects that: the invention provides powerful support for the work of expanding the Han crossing parallel corpus; the problem of low Hanyue neural machine translation performance caused by scarcity of corpus resources is solved, a better BLEU value can be obtained, and the Hanyue neural machine translation model can be helped to better identify low-frequency words at a source language end, so that translation quality is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an exemplary diagram of synonym substitution in the present disclosure;

fig. 3 is a diagram illustrating matching of target sentences in the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 3, the method for synonym data-based enhanced hanyue neural machine translation includes the following specific steps:

step1, firstly, constructing a vocabulary V in the source language end range of the training corpus, and then adding words with the occurrence frequency of N (N is more than or equal to 1 and less than 5) in V as low-frequency words into a low-frequency word list, thereby forming a low-frequency word vocabulary V_R；

Step2, finding synonyms of low-frequency words through monolingual vectors: because the similarity between two vocabularies can be judged by calculating the distance between the vocabularies in the monolingual semantic space. Therefore, the method utilizes large-scale monolingual corpora of Chinese and Vietnamese to obtain the characteristic vectors represented by monolingual words through monolingual language training, then judges the similarity between the two words through calculating the cosine value between the vectors representing the words, and further searches the synonym of the low-frequency words to construct the synonym list of the low-frequency words;

step3, synonym replacement for low frequency words: giving positions i in source sentences S and S; s_iAs low frequency words

Synonyms of (1), with S in S_iIs replaced by

Further generating a synonymy sentence S' of S; the sentence after replacement is added to the synonym candidate list of S, which is shown in fig. 2 and is a diagram of a synonym replacement example.

Step4, screening the replaced sentences by using the language model: and (4) scoring sentences in the synonym candidate list of the fixed source sentence S by using a Kenlm language model, and screening the top M sentences with high scores, wherein M is a positive integer.

Step5, matching the screened sentences with sentences in the other end language to obtain expanded parallel corpora; for sentence pair (S, T), since the source language end S is performed with the operation of "synonym replacement", so that the source sentence S and the enhanced sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added into the training set to participate in model training, as shown in fig. 3, which is a target sentence matching diagram.

The BLEU value is used for evaluating the translation accuracy of a neural machine translation model and can be used for measuring the quality of the model; the invention adopts the BLEU value as an evaluation standard of a synonym data-based enhanced Hanyue neural machine translation model.

In order to verify the effect of the method for enhancing the synonym data, the method collects 20 ten thousand Han-Yuan parallel sentence pairs as training data before data enhancement, sets a machine translation experiment to compare results before and after the enhancement so as to prove the effect of the data enhancement, and uses a transformer as a neural machine translation model. The hidden layer size of the model is 512 dimensions, the batch size is 64, and the training round is 20. In all experiments, the neural machine translation model vocabulary was restricted to the most common 30k words in both languages of han and vietz. Note that data enhancement does not introduce new vocabulary into the vocabulary. The present invention trains the kenlm language model using 5G sized chinese text from wikipedia. The Kenlm language model belongs to the n-gram language model, with n set to 3.

In order to verify the effectiveness of the method, three groups of experiments are designed and evaluated through a BLEU value.

Experiment one: in order to achieve the best effect, synonym replacement is carried out on a source language end, a comparison test of experimental parameter selection is carried out by taking a matched target language end as an example, and the synonym list of low-frequency words constructed by different methods and the influence of the synonym number N of the low-frequency words on a neural machine translation model are evaluated. The method comprises the following steps: 1) the Synonyms toolkit is used to build a synonym list of source language end low frequency words. 2) Generating word vectors of source language end words by using Chinese bert model, and calculating and comparing word list V with low-frequency word list V_RAnd constructing a synonym list of low-frequency words at the source language end from high to low according to the cosin similarity between the intermediate words. In the experiment, the number N of synonyms of the low-frequency words is set to be 1 to 5, and the word frequency of each low-frequency word is increased to be 5 by the method. Note that the different synonym numbers of the low-frequency words do not affect the size of the corpus differently, and the enhanced chinese-crossing parallel sentence pairs are all 310 k. See table 1.

TABLE 1 comparison of BLEU values for translation Performance for Chinese-Vietnamese with different parameters

Table 1 shows that building a synonym list of source language-side low-frequency words using bert achieves higher BLEU values than building the whole using synnyms. This is because the basic technology employed by Synonyms is word2 vect. bert considers context information compared to word2 fact. Therefore, a synonym list of low-frequency words at the source language end is constructed by adopting bert, and a Chinese-cross sentence pair with high parallelism is easier to generate after low-frequency word replacement. Therefore, the method of the invention can specifically adopt bert to construct a synonym list of the source language end low-frequency words.

The results of the bert method adopted in table 1 show that, when the number N of synonyms of the low-frequency words at the source language end is 1, the translation model performance reaches the highest value, and a BLEU value of 11.4 is obtained. The BLEU value of the translation model decreases as the number of synonyms for the low frequency words increases. When the number N of synonyms of the low-frequency word is 5, the lowest BLEU value of 8.6 is obtained. This is because the method of the present invention constructs the synonym list of low frequency words in the source language according to the cosin similarity between words from high to low. Therefore, the larger the number N of synonyms of the low-frequency words is, the larger the probability that the Chinese-to-sentence pairs with lower parallelism are generated after the target sentences are matched is through language model screening. The BLEU value decreases as the translation model corpus noise increases. Therefore, the method of the invention takes the synonym number N of the low-frequency words as 1.

Experiment two: in order to select the best method to realize the idea of the method, the translation performance (BLEU) is compared for different methods of replacing the vocabulary at the source language end, replacing the unidirectional words at the matched target language end and replacing the vocabulary at the replaced target language end and replacing the bidirectional words at the matched source language end on the basis of the unidirectional word replacement. See table 2.

TABLE 2 Effect of unidirectional and bidirectional word substitution on translation Performance BLEU values

TDA (C) refers to a one-way word replacement method. TDA (C-V) refers to a bi-directional word replacement method. Note that when the target language end vocabulary is replaced and the source language end is matched, the synonym list of the target language end low-frequency words is constructed by using the vietnam word vectors trained on fasttext and Wikipedia by e.grave et al, and other involved experimental parameters are consistent with those of tda (c).

The data in table 2 shows that one-way word replacement improves translation performance better than two-way word replacement. Analysis of the data in table 2 can yield: after bidirectional word replacement, 160k of linguistic data is added compared with a basic linguistic data base, wherein 110k of unidirectional word replacement is added, 50k of matching practice on the source language end is added for replacing the vocabulary on the target language end. This is because the size of the source language vocabulary in the original corpus is twice the size of the target vocabulary. One-way word replacement works better than two-way word replacement because the quality of translation depends on the level of a translated language rather than the level of the source language. The target language must be a true sentence to make the translation result more fluent and accurate. Therefore, the 50k corpus obtained by replacing the vocabulary at the target language end and matching the vocabulary at the source language end reduces the translation effect. Therefore, the invention adopts the replacement of the one-way words.

Experiment three: to demonstrate the optimality of the method of the present invention, the method of the present invention was compared to a baseline and a translation back. It is well known that the size of the corpus directly affects the quality of the translation model. In order to compare the method of the present invention with the retranslation on the same horizontal line, the size of the retranslation-enhanced corpus was set to 310k in this experiment. Since the more Chinese belongs to low-resource languages, the quality of the enhanced corpus is obviously reduced if the translation model trained by a person is used for retranslation. For this reason, the present invention performs translation back by google translation. In order not to introduce new vocabulary after enhancement, the invention performs the translation back based on the training corpus. Firstly, translating a target end of a training corpus into a source language end by virtue of Google translation to construct a Chinese cross parallel corpus, and then randomly extracting 110k Chinese cross parallel sentences from the Chinese cross parallel corpus to join in a training model. See table 3.

TABLE 3 comparison of BLEU values for translation Performance for Chinese-Vietnamese in different methods

Model	Data	BLEU
			Baseline	200k	9.6
Back-translation	310k	10.3
			TDA_(C)	310k	11.4

Wherein the second behavior datum, Back-translation, TDA_(C)Refers to the process of the present invention.

The data in table 3 show that the method and the retranslation method provided by the present invention effectively improve the translation quality, and improve 1.8 BLEU values and 0.7 BLEU values respectively compared with the baseline. We attribute the invention to the best BLEU value: TDA_(C)The method can help the Hanyue neural machine translation model to better identify the low-frequency words at the source language end, so that the translation quality is improved.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A synonym data enhancement-based Hanyue neural machine translation method is characterized by comprising the following steps: the synonym data enhancement-based Hanyue neural machine translation method comprises the following specific steps:

step3, carrying out synonym replacement on low-frequency words;

step4, screening the replaced sentences by using a language model;

2. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific steps of Step1 are as follows:

3. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step2 is as follows:

step2.1, training feature vectors represented by monolingual vocabularies;

step2.3, construct a synonym list of low frequency words.

4. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step3 is as follows:

step3.1, giving the positions i in the source sentences S and S;

Step3.2、S_ias low frequency words

Synonyms of (1), with S in S_iIs replaced by

Further generating a synonymy sentence S' of S;

5. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step4 is as follows:

6. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: for sentence pair (S, T), since the source language end S is performed with the operation of "synonym replacement", so that the source sentence S and the enhanced sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added into the training set to participate in model training.