CN113051936A

CN113051936A - Method for enhancing Hanyue neural machine translation based on low-frequency word representation

Info

Publication number: CN113051936A
Application number: CN202110280508.1A
Authority: CN
Inventors: 余正涛; 杨福岸; 高盛祥; 王振晗; 朱俊国
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-29

Abstract

The invention relates to a method for enhancing Hanyue neural machine translation based on low-frequency word representation, and belongs to the field of natural language processing. Low frequency words in neural machine translation are a key factor affecting the performance of translation models. Because the frequency of the low-frequency words appearing in the data set is low, the representation of the low-frequency words in the training process is not accurate enough, and the problem is more obviously influenced in a low-resource neural machine translation task. The method learns the probability distribution of the low-frequency words by utilizing the monolingual data context information, recalculates the word embedding of the low-frequency words according to the distribution, and then retrains the Transformer model on the basis of the obtained word embedding, thereby effectively relieving the problem of inaccurate representation of the low-frequency words. Experiments are respectively carried out on two low-resource translation tasks of Han-Yue and Yue-Han, and the experimental results show that the method provided by the invention is respectively improved by 8.58% and 6.06% on the two tasks compared with a baseline model.

Description

Method for enhancing Hanyue neural machine translation based on low-frequency word representation

Technical Field

The invention relates to a method for enhancing Hanyue neural machine translation based on low-frequency word representation, and belongs to the technical field of natural language processing.

Background

The core of the word representation enhancement method is how to learn more accurately to a more accurate word representation form, and the difficulty is how to represent low-frequency words. In general, there are roughly 2 methods for word representation enhancement: (1) a method based on external knowledge integration. The method is characterized in that the prior knowledge is blended, so that words have richer meanings to achieve the purpose of enhancing word representation; (2) a method based on internal knowledge enhancement. The method has the advantages that the representation form of the word is learnt again through the monolingual data, so that the representation form of the word contains richer translation information, and the representation of the word is more accurate. The 2 methods can enhance the expression form of words to a certain extent, so that the meanings of the enhanced words are more consistent with the meanings of sentences, but the method for enhancing the low-frequency word representation is not available, so that the problem of poor translation of the low-frequency words cannot be solved.

Disclosure of Invention

The invention provides a method for enhancing Hanyue neural machine translation based on low-frequency word representation, which solves the problem that low-frequency words are not well represented in neural machine translation by introducing a language model and a low-frequency word dictionary into a Transformer translation model.

The technical scheme of the invention is as follows: a method of hanyue neural machine translation enhancement based on low frequency word representation, comprising:

step1, collecting Chinese-Yue bilingual corpus, and preprocessing the collected corpus;

step2, learning the probability distribution of each word through a language model;

step3, constructing a Chinese-lower frequency word dictionary;

step4, judging low-frequency words in the translation model input by using the Chinese-lower frequency word dictionary constructed in Step3, and updating the representation of the original low-frequency words by using Step2 probability distribution so as to obtain a new representation form of the translation model input;

and Step5, retraining the Transformer translation model on the basis of the characterization form obtained in Step4.

As a further scheme of the present invention, the Step1 specifically comprises the following steps:

step1.1, translating English into Chinese by a linguistic expert according to the public IWLST English-Vietnamese bilingual parallel corpus to obtain a Chinese-Vietnamese parallel corpus;

step1.2, cleaning and word segmentation processing are carried out on the speech material, and 127,481 pair Chinese and Vietnamese parallel data are finally obtained;

step1.3, segmenting Chinese sentences by using a bus segmentation tool, and cutting punctuations by using a tokenizer for processing Vietnamese.

As a further scheme of the present invention, the Step2 specifically comprises the following steps:

step2.1, for any word w in the dictionary, the probability distribution is:

P(w)＝(P₁(w),P₂(w),P₃(w),...,P_|V|(w),) (1)

satisfies the following conditions:

step2.2, the language model to calculate the conditional probabilities of all words before P (w) and V, for the t-th word x in a sentence_tThe method comprises the following steps:

P_j(x_t)＝LM(w_j|_x＜t) (3)。

as a further scheme of the present invention, the Step3 specifically comprises the following steps:

step3.1, respectively counting word frequencies of Chinese and Vietnamese;

step3.2, defining low-frequency words according to the word frequency distribution rule, namely determining the word grade by adopting a maximum value method, namely arranging the occurrence times of the words from high to low, wherein the grade is the word sequence value k of the words, and constructing a low-frequency word dictionary d with the word sequence value k_k；

Step3.3, constructing a low-frequency word dictionary D_k：

As a further scheme of the present invention, the Step4 specifically comprises the following steps:

step4.1, low frequency word dictionary D constructed by using_KJudging words in the input sentence as low-frequency words if x_t∈D_KY, otherwise N;

step4.2, if Y, P (x) trained using the language model_t) To update x corresponding to it_i(ii) a If the number is N, the original sequence is kept unchanged, so that a new source end sequence X' is obtained;

step4.3, multiplying the obtained new source end sequence X' with a word embedding matrix E of a dictionary V to obtain the input of a translation model:

input＝X'E (5)。

as a further aspect of the present invention, Step5 further includes:

step5.1, finally obtaining a translation result through a translation model Transformer:

output＝Transformer(input,Y) (6)。

the invention has the beneficial effects that:

1. according to the invention, a language model and a low-frequency word dictionary are introduced into a Transformer model, so that the problem that the low-frequency words are not well represented in neural machine translation can be effectively solved.

2. The method can further improve the performance of a machine translation model on a classical Transformer model and a Transformer + LM model without distinguishing word frequency.

3. The experimental result of the invention shows that the Chinese-Yuan neural machine translation method with enhanced low-frequency word representation, which is provided by the invention, improves the BLEU4 scores by 8.58% and 6.06% on two low-resource translation tasks of Han-Yuan and Yuan-Han respectively relative to a baseline model.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a translation model architecture of the present invention;

FIG. 3 is a diagram illustrating the influence of the K-class low-frequency word dictionary on the Han-Yuan model according to the present invention;

FIG. 4 is a diagram illustrating the effect of the K-class low-frequency word dictionary on the over-the-Chinese model according to the present invention.

Detailed Description

For the purpose of describing the invention in more detail and facilitating understanding for those skilled in the art, the present invention will be further described with reference to the accompanying drawings and examples, which are provided for illustration and understanding of the present invention and are not intended to limit the present invention.

Example 1: 1-4, a method for enhanced hanyue neural machine translation based on low frequency word representation, comprising the steps of:

the language model learns the probability distribution of the low-frequency words by utilizing the monolingual data context information, namely, for a given source end and target end sentence pair, the probability distribution of each word is obtained through the language model;

the purpose of the language model is to obtain the probability distribution of each low frequency word in a lexicon of vocabulary size | V |, which for any low frequency word w is:

P(w)＝(P₁(w),P₂(w),P₃(w),...,P_|V|(w),) (1)

satisfies the following conditions:

the probability distribution P (w) of the low-frequency words w can be calculated by various methods, the invention utilizes a pre-trained 6-layer Transformer decoder as a language model to calculate the conditional probability of all words before P (w) and V, and the t x th word in a sentence is_tThe words, there are:

P_j(x_t)＝LM(w_j|_x＜t) (3)

wherein LM (w)_j|_x＜t) Representing the probability of the jth word after it appears in the dictionary, the probability distribution computed by the language model can be seen as a smooth approximation of one-hot since they have the same vocabulary size since they are trained using the same corpus as the translation model.

Step3, constructing a Chinese-lower frequency word dictionary;

step3.1, respectively counting word frequencies of Chinese and Vietnamese;

Step3.3, constructing a low-frequency word dictionary D_k：

Specifically, Chinese and Vietnamese low-frequency word dictionaries are respectively constructed in a statistical mode. The method comprises the steps of selecting low-frequency words by taking a Chinese and Vietnamese training set as a target, defining low-frequency word dictionary words with a word sequence value equal to K as a word sequence value K-type low-frequency word dictionary, and defining the low-frequency word dictionary with the word sequence value K less than or equal to K as a K-type low-frequency word dictionary. And respectively constructing a K-type low-frequency word dictionary and a K-type low-frequency word dictionary according to the word sequence value K (K takes 1 to 10) of each word. And respectively counting the dictionary coverage rate of the low-frequency word dictionary, wherein the dictionary coverage rate is the ratio of the size of the low-frequency word dictionary to the size of the total dictionary, and the total dictionary is obtained by counting the training set.

The Chinese dictionary vocabulary has a size of 47356 and the total number of words in the training set is 2275526. The word sequence value k type low-frequency words respectively have 18496, 6656, 3787, 2508, 1812, 1397, 1067, 832, 719 and 593 words. The Vietnamese dictionary vocabulary size is 22732 and the training set total number of words is 3189350. The low-frequency words of the word sequence value k class respectively have 9428 words, 3188 words, 1667 words, 1006 words, 718 words, 514 words 393 words, 340 words, 188 words and 223 words.

input＝X'E (5)。

and Step5, retraining the Transformer translation model on the basis of the characterization form obtained in Step4. Finally, obtaining a translation result through a translation model Transformer:

output＝Transformer(input,Y) (6)。

for better effectiveness of the training model and the verification model, the parallel data of the bilingual Hanyue with the scale of 2,000 pairs are randomly extracted from the parallel data of the bilingual Hanyue respectively to serve as a test set and a verification set, and the rest are taken as training sets, wherein specific data information is shown in table 1:

TABLE 1 data size and data set partitioning

In the Chinese-Vietnamese translation task, a Transformer Decoder is adopted as a Chinese language model. The training set and the verification set of the Chinese language model are derived from Chinese corpus in the translation model, and the scales of the training set and the verification set are 127,481 and 2,000 Chinese monolingual data respectively; in the Vietnamese-Chinese translation task, the structure of a language model is the same as that of a model in the Chinese-Vietnamese translation task, and a training set and a verification set of the Vietnamese language model are derived from Vietnamese monolingual corpus in the translation model, and the Vietnamese monolingual data are 127,481 and 2,000 pieces of Vietnamese monolingual data respectively.

The low-frequency words are not well performed in the neural machine translation of the lower resources of the Chinese, and in order to distinguish the low-frequency words from other words, the method is used for performing low-frequency word representation enhancement on the low-frequency words, and a low-frequency word dictionary is constructed. The chinese-vietnam word order value k-class low-frequency word dictionary is shown in table 2:

TABLE 2 Chinese-Vietnamese word order value k-class low-frequency words

In the invention, the word list size of the Chinese dictionary is 47,356, the word list size of the Vietnam words is 22,732, the maximum Maxpot of batch is 2048, the maximum length of a sentence is 128, the maximum epoch is 100 rounds, dropout is set to 0.1, the word embedding dimension is 512 dimensions, and the hidden layer dimension is 512 dimensions. All models were trained by the Adam optimizer and the initial learning rate was 10-4.

After the language module training is completed, the optimal training parameters of the model are saved, and when the translation model is trained, the optimal training parameters of the language model are fixedly used. The method provided by the invention is verified on two tasks of Chinese-Vietnamese and Vietnamese-Chinese by using Chinese-Vietnamese parallel data. The method adopts a self-help resampling method (resampling for 1000 times) and uses a BLEU4 value as an evaluation index on a test set under the condition that the significance level p is less than 0.05.

The present invention employs the following two models as baseline models. One is the classical Transformer model (Transformer): experiments were performed using the Transformer _ base model in both the chinese-vietnamese and vietnamese-chinese translation tasks. And secondly, adding a language model (Transformer + LM) on the basis of the Transformer, randomly replacing the input of the translation model by using a training result of the language model, wherein the replacement probability is gamma, and the gamma is 0.15 (the gamma value is the optimal setting used in the document [2 ]) and performing experiments on two translation tasks of Chinese-Vietnamese and Vietnamese-Chinese.

To verify the effectiveness of the method of the present invention, two baseline models, respectively, the classical Transformer model and the Transformer + LM model used in the prior art (corpus scale is 127,481 for chinese-vietnamese parallel data) were compared in the experiment. The invention respectively carries out experiments on translation tasks in two directions of Chinese-Vietnamese and Vietnamese-Chinese, and the experimental result is the BLEU4 score of each translation model, as shown in Table 3.

TABLE 3 experimental results of Chinese-Vietnamese, Vietnamese-Chinese

As can be seen from the above table, in the translation tasks in the two directions of the chinese-vietnamese and the vietnamese-chinese, the transform + LM model respectively improves the BLEU4 values by 0.87 and 0.59 compared with the classical transform model; compared with a Transformer + LM model, the method provided by the invention respectively improves the BLEU4 values by 0.84 and 0.68. According to the results, the method provided by the invention has better promotion in Chinese-Vietnamese and Vietnamese-Chinese translation tasks compared with a Transformer model and a Transformer + LM model, and proves that the method based on low-frequency word representation enhancement provided by the invention is effective in Chinese-Vietnamese and Vietnamese-Chinese translation tasks. The analysis from the experimental result shows that the Transformer + LM model is superior to the classical Transformer model, and word context information is introduced randomly by the Transformer + LM model through the language model, so that randomly introduced words can obtain richer information, and the effectiveness of the word context information introduced in the Transformer + LM model is proved. Compared with a Transformer + LM model, the translation performance of the method is greatly improved, the information of the low-frequency words is considered in the method, the probability estimation of the context is only carried out on the low-frequency words, the translation performance is improved, and the performance is reduced due to the fact that the low-frequency words and the non-low-frequency words are not distinguished. The experimental result shows that the method can relieve the problem of poor translation of low-frequency words, and has obvious advantages on two translation tasks of Chinese-Vietnamese and Vietnamese-Chinese.

In order to analyze the influence of the occurrence frequency of low-frequency words on the method of the present invention, as shown in fig. 1, the method of the present invention performs model performance tests on two translation tasks of chinese-vietnamese and vietnamese-chinese according to words whose occurrence frequency is less than or equal to K (K ═ 1, 2., 10). The results are shown in FIGS. 3 and 4.

As can be seen from fig. 3 and 4, in the chinese-overtaking and overtaking-chinese translation task, as the K value increases, the overall trend increases first and then decreases, and when the K value takes 5 and 6 respectively, that is, the low-frequency words are set to appear in the training set with frequencies less than or equal to 5 and 6 (70.25% and 70.66% of the word list size respectively), the BLEU4 value takes the highest value; the result of the classical Transformer model is obtained when the K value is 0, and the performance of the model is superior to that of the classical Transformer model when the K value is 1,2, 10; in the ascending process, when the K value is equal to 3, the performance of the model of the method exceeds that of a transform + LM model; in the process of descending, when the K value is respectively 9 and 10, the performance of the Transformer + LM model is slightly superior to that of the method.

As shown in fig. 3, when the K value is 0 (classical Transformer model), the Transformer + LM model is superior to the classical Transformer model because the Transformer + LM model introduces the context information of the random word; when the K value is less than or equal to 5, the model effect is stably increased, the occurrence frequency of words in the K-class low-frequency word dictionary is low, the low-frequency words cannot be well represented in the translation model, and the context information of the low-frequency words is used for replacing the low-frequency words for representation, so that the representation information of the low-frequency words is enriched, the low-frequency words have richer context semantic information, and the model is stably increased. When the K value is more than 5, namely words with the occurrence frequency more than 5 are added into the low-frequency word dictionary, the newly added words can be well trained, and the representation of the trained words is superior to the enhanced representation form provided by the language model. Therefore, the newly added words in the low-frequency word dictionary cannot achieve the effect of optimizing the translation performance. Therefore, when the K value is larger than 5, the translation effect is continuously reduced.

TABLE 4 Han-Yue translation example analysis

TABLE 5 analysis of examples of Yuehan-Han translation

As can be seen from the analysis table 4, the method has better translation effect on the low-frequency words, and in the example of Chinese-Vietnamese translation, the low-frequency words comprise prerequisites which appear 5 times in the Chinese training data set, and the method translates the words into the words

Can be seen from the real translation

The meaning is completely similar and the context of the sentence is well understood. The baseline model translation adopts the translation result of the Transformer + LM model and is translated into' nhu

", does not well mean the low frequency word" prerequisite ". As shown in Table 5, in the example of Vietnam-Chinese translation, the word "ng-n" is classified as a low frequency word dictionary, which appears 6 in the Vietnam training data set, and is "depressed" in both the translated text and the translated text of the present invention, and the line model translated text is modeled as a transform modelThe translation result is translated into "urgent", which is far from the true translation and is not a qualified translation. Therefore, the method can fully show and translate the low-frequency words, and can better understand the meaning of the low-frequency words in the sentences.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. a method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement, is characterized in that: comprise as follows:

Step1. Collect Chinese-Vietnamese bilingual corpus, and preprocess the collected corpus;

Step2, learn the probability distribution of each word through the language model;

Step3, build a dictionary of Chinese-Vietnamese low-frequency words;

Step4. Use the Chinese-Vietnamese low-frequency word dictionary constructed in Step3 to determine the low-frequency words in the input of the translation model, and use the Step2 probability distribution to update the representation of the original low-frequency words, so as to obtain a new representation form input by the translation model;

Step5. Retrain the Transformer translation model on the basis of the representation form obtained in Step4.

2. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step1 are:

Step1.1. Obtain the Chinese-Vietnamese parallel corpus by publishing the IWLST English-Vietnamese bilingual parallel corpus and translating the English text into Chinese by linguistic experts;

Step1.2. Cleaned and segmented the corpus, and finally obtained 127,481 pairs of Chinese-Vietnamese bilingual parallel data;

Step1.3. Use the stuttering word segmentation tool to segment Chinese sentences, and use tokenizer to cut punctuation in Vietnamese.

3. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step2 are:

Step2.1. For any word w in the dictionary, the probability distribution is:

P(w) ₌ (P1(w),P2 ₍ w), _P3 (w),...,P _|V| (w),)(1)

Satisfy:

Step2.2, language model to calculate the conditional probability of all words before P(w) and V, for the t-th word x _t in a sentence, there are:

P _j (x _t )=LM(w _j | _x<t ) (3).

4. the method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step3 are:

Step3.1. Count the frequency of Chinese and Vietnamese words respectively;

Step3.2. Define low-frequency words according to the law of word frequency distribution, that is, use the maximum value method to determine the level of words, that is, arrange the number of words appearing from high to low, and its level is its word order value k, and construct a word order value of k low-frequency word dictionary d _k ;

Step3.3, construct the low-frequency word dictionary D _k :

5. the method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step4 are:

Step4.1. Use the constructed low-frequency word dictionary D _K to judge which words in the input sentence belong to low-frequency words, if x _t ∈ D _K is Y, otherwise it is N;

Step4.2. If it is Y, use the P(x _t ) trained by the language model to update the corresponding _xi ; if it is N, keep the original unchanged, thereby obtaining a new source sequence X';

Step4.3. Multiply the obtained new source sequence X' by the word embedding matrix E of the dictionary V to obtain the input of the translation model:

input=X'E(5).

6. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: described step Step5 also comprises:

Step5.1. Finally, the translation result is obtained through the translation model Transformer:

output=Transformer(input, Y) (6).