CN110472252B

CN110472252B - Method for translating Hanyue neural machine based on transfer learning

Info

Publication number: CN110472252B
Application number: CN201910751450.7A
Authority: CN
Inventors: 余正涛; 黄继豪; 郭军军; 文永华; 高盛祥; 王振晗
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2022-12-13
Anticipated expiration: 2039-08-15
Also published as: CN110472252A

Abstract

The invention relates to a method for translating a Hanyue neural machine based on transfer learning, belonging to the technical field of natural language processing. The invention comprises the following steps: corpus collection and pretreatment: collecting and preprocessing parallel corpora of Chinese-Yue, english-Yue and Chinese-English sentence pairs; generating Chinese-English-more-three-language parallel linguistic data by using Chinese-English-more-parallel linguistic data; training a Chinese-English neural machine translation model and an English-crossing neural machine translation model, and initializing parameters of the Chinese-crossing neural machine translation model by using parameters of a pre-training model; and performing fine tuning training on the initialized Hanyue neural machine translation model by using the Hanyue parallel corpus to obtain the Hanyue neural machine translation model for carrying out Hanyue neural machine translation. The invention can effectively improve the translation performance of the Hanyue neural machine.

Description

Method for translating Hanyue neural machine based on transfer learning

Technical Field

The invention relates to a method for translating a Hanyue neural machine based on transfer learning, belonging to the technical field of natural language processing.

Background

In recent years, communication between two countries is becoming more frequent, and the demand for translation technology in a low-resource scenario such as chinese-vietnamese is increasing. However, the neural machine translation performance of the Chinese-Vietnamese language is not ideal at present, so that the performance of the Chinese-Vietnamese neural machine translation system is improved, and the method plays a very important role in communication between two countries. End-to-end Neural Machine Translation (Neural Machine Translation) is a brand-new Translation system, and the mapping from a source language text to a target language text is realized by directly utilizing a Neural network. Neural machine translation has achieved good translation performance in resource-rich language pairs, and has achieved compelling performance in many translation tasks. However, it is still affected by the scale and quality of parallel corpus on the task of Chinese-crossing neural machine translation, because corpus resources are scarce, and there is no large-scale Chinese-crossing parallel corpus, resulting in poor performance of Chinese-crossing neural machine translation. Therefore, the method has very important application prospect on how to improve the translation effect of the Han-Yuan neural machine;

the current pivot language and transfer learning method is one of effective methods for solving the problem of poor neural machine translation effect in a low-resource scene. The source and target languages are bridged using an axis language. The existing parallel corpora of the source language-pivot language and the pivot language-target language are used for training translation models of the source language to the pivot language and the pivot language to the target language respectively. The method has the advantage that translation between source and target languages is possible even if there is no bilingual corpus available for language pairs in low resource scenarios. In addition, the neural machine translation task essentially requires that the model can get sentences in the target language and does not lose information in source language sentences, and thus is suitable for the field of transfer learning knowledge. Compared with the pivot language method, the source language-target language model parameters can be directly improved by the transfer learning, so that many researchers develop research in the field of transfer learning. The method for transfer learning can train parameters of the language pair model with rich resources to initialize parameters of the translation model in a low-resource scene. However, these training processes lack guidance of small-scale bilingual parallel corpora, resulting in a noisy input of multiple languages. In addition, the above method focuses more on improving the parameters of the model in the low resource scenario, and no improvement is made on a separate encoder or decoder. The Chinese-English neural machine translation is neural machine translation under a low-resource scene, and training linguistic data are scarce, but a large amount of Chinese-English parallel linguistic data exist, so that the method is suitable for transfer learning and pivot language. Therefore, the invention provides a method for translating the Hanyue neural machine based on the transfer learning, which solves the problem that the Hanyue machine translation effect is poor in a low-resource scene.

Disclosure of Invention

The invention provides a method for translating a Hanyue neural machine based on transfer learning, which is used for solving the problem of poor translation effect of the Hanyue neural machine.

The technical scheme of the invention is as follows: the method for translating the Hanyue neural machine based on the transfer learning comprises the following specific steps:

step1, corpus collection and pretreatment: collecting and preprocessing parallel corpora of Chinese-Yue, english-Yue and Chinese-English sentence pairs;

as a preferred embodiment of the present invention, the Step1 specifically comprises the following steps:

step1.1, crawling Chinese-Yue, english-Yue and Chinese-English parallel sentence pairs by using a crawler, and extracting a part of training data to be used as a test set and a verification set;

step1.2, the crawled linguistic data are manually screened, then word segmentation is carried out on the crawled linguistic data, and Arabic numbers are replaced by num and messy code filtering processing, so that the neural machine translation model achieves a better effect.

Step2, generating Chinese-English-more-three-language parallel linguistic data by using the Chinese-English-more-parallel linguistic data and the English-more-parallel linguistic data;

as a preferable scheme of the invention, the Step2 comprises the following specific steps:

step2.1, in the existing data set of Chinese-English and English-Vietnamese, using a retranslation method for the axial language and English, using English-Chinese parallel linguistic data to train an English-Chinese neural machine translation model based on an attention machine system, and then using the trained English-Chinese neural machine translation model based on the attention machine system to retranslate English in the English-Vietnamese parallel linguistic data into Chinese, thereby obtaining Chinese-English-Vietnamese parallel linguistic data;

step2.2, replacing the rare words in the Vietnamese corpus to expand the Chinese-English-Vietnamese parallel corpus by using a data enhancement method for the Chinese-English-Vietnamese parallel corpus obtained in the step 2.1.

Step3, training a Chinese-English neural machine translation model and an English-crossing neural machine translation model, and initializing parameters of the Chinese-crossing neural machine translation model by using parameters of a pre-training model;

as a preferred embodiment of the present invention, the Step3 comprises the following specific steps:

in order to solve the problem that a source language is expressed into a vector with a fixed length in a neural machine translation model, but the fixed-length vector cannot sufficiently express the relation between semantic information and context of a source language sentence; introducing an attention mechanism in the trained neural machine translation model;

step3.1, training a neural machine translation model with an attention mechanism respectively by using Chinese-English and English-crossing parallel linguistic data to obtain a Chinese-English neural machine translation model with the attention mechanism and an English-crossing neural machine translation model respectively;

step3.2, initializing the encoder and decoder parameters of the Chinese-Yuetu neural machine translation model by using the Chinese encoder parameters of the Chinese-English neural machine translation model and the Vietnamese decoder parameters of the English-Yuetu neural machine translation model.

And Step4, carrying out fine tuning training on the initialized Hanyue neural machine translation model by using the Hanyue parallel corpus to obtain the Hanyue neural machine translation model for carrying out Hanyue neural machine translation.

Because corpus resources are scarce, large-scale Chinese-crossing parallel corpus is not available, so that the semantic representation of the encoder for Chinese-crossing neural machine translation is poor to influence the Chinese-crossing neural machine translation performance. The large-scale Chinese-English parallel linguistic data and English-English parallel linguistic data exist, and parameters of a neural machine translation model trained by the Chinese-English parallel linguistic data and English-English parallel linguistic data can be used for the idea of transfer learning;

in the Step 3:

the neural machine translation model represents a source language sentence as a fixed vector. The method has the disadvantage that the fixed-length vector cannot fully express the relationship between the semantic information and the context of the source language sentence. The attention mechanism allows a neural network to focus on only a portion of the information of the neural network input, which allows selection of a particular input. The neural machine translation based on the attention mechanism firstly codes source language sentences into vector sequences, and secondly dynamically searches for source language word information related to the generated words through the attention mechanism when generating a target language, so that the expression capability of the neural network machine translation is greatly enhanced.

Neural machine translation is based on a data-driven language conversion process, and the performance of the neural machine translation depends on the scale and quality of parallel corpora. The scale and quality of the Chinese-to-Vietnam parallel corpus are limited, so that training data are insufficient, and parameters of a coder-decoder cannot be optimized. The transfer learning can apply the learned knowledge to similar tasks. In the task under the low-resource scene, the rule parameters obtained by the high-resource task are used to improve the performance of the low-resource task, so that the data volume required by the task can be reduced. Therefore, the invention pre-trains the attention-based neural machine translation model of Chinese-English and English-Vietnamese by using large-scale Chinese-English and English-Vietnamese materials, and initializes the parameters of the encoder and the decoder of the attention-based neural machine translation model by using the Chinese encoder and the Vietnamese decoder.

The beneficial effects of the invention are:

1. firstly, chinese-English parallel linguistic data are obtained by a method of retracing and data enhancement by Chinese-English parallel linguistic data, and are added into training linguistic data, so that parameters of a next initialization model are more relevant;

2. the invention uses Chinese-English-Vietnam parallel corpus to pre-train the neural machine translation model, and initializes the encoder and decoder parameters of the Chinese-English-Vietnam neural machine translation model by using the parameters of the Chinese encoder and Vietnam decoder, so that the model of the Chinese-English-Vietnam neural machine translation model can not be trained by the parameters initialized along with the level at the beginning, and semantic information can be more accurately expressed. Finally, fine tuning training is carried out by using small-scale Hanyue speech materials to obtain a Hanyue neural machine translation model, optimization training can be carried out on the initialized Hanyue neural machine translation model, and the Hanyue neural machine translation performance can be effectively improved;

3. the invention adopts the idea of transfer learning, so that the encoder for the Chinese-crossing neural machine translation can better represent the semantic information of the source language, and the decoding effect is better.

Drawings

FIG. 1 is a detailed flow chart of the present invention;

fig. 2 is a flow chart of the training process of the hanyue neural machine translation based on the transfer learning proposed by the present invention.

Detailed Description

Example 1: as shown in fig. 1-2, a method for machine translation of hanyue nerve based on transfer learning includes the following steps:

step1, crawling training corpora by using a crawler, wherein the crawled training corpora have 10 ten thousand sentence pairs of Chinese-Yue corpora; 70 ten thousand sentence pairs of English-Vietnamese material specification; 5000 ten thousand sentence pairs of Chinese-English corpus scale; manually screening the crawled corpus and then filtering the crawled corpus in a messy code mode; extracting a part of the training data to be used as a test set and a verification set;

and (3) manually screening the crawled linguistic data, then segmenting the crawled linguistic data, and replacing Arabic numerals with num and messy code for filtering.

Step2, in the existing data set of Chinese-English and English-Vietnamese, a method for retranslating axial language and English is used, firstly, a 4-layer attention-based neural machine translation system with a word list of 32000 trains an attention-based English-Chinese neural machine translation model by adopting large-scale English-Chinese parallel linguistic data, and secondly, the trained attention-based English-Chinese neural machine translation model retranslates English in English-Vietnamese parallel linguistic data into Chinese, so that Chinese-English-Vietnamese parallel linguistic data are obtained;

replacing rare words in the Vietnamese corpus to expand the Chinese-English-Vietnamese parallel corpus by using a data enhancement method for the Chinese-English-Vietnamese parallel corpus obtained in the step 2.1; the occurrence frequency of rare words in the Vietnamese corpus is set to be 20, only one rare word is replaced each time, and the rare words in the sentence pairs are replaced to expand the Chinese-English-Vietnamese parallel corpus;

in order to solve the problem that a source language is expressed into a vector with a fixed length in a neural machine translation model, but the fixed-length vector cannot sufficiently express the relation between semantic information and context of a source language sentence; introducing an attention mechanism in a trained neural machine translation model;

step3.1, respectively training a neural machine translation model with an attention mechanism by using Chinese-English and English-crossing parallel linguistic data to respectively obtain a Chinese-English neural machine translation model with the attention mechanism and an English-crossing neural machine translation model;

as shown in FIG. 2, first, two models (Pre-train Model A, pre-train Model B) are obtained by training with Chinese-English parallel linguistic data and English-English parallel linguistic data. In both the Chinese-English neural machine translation model and the English-English neural machine translation model training with attention mechanism, the sequence of a given source language word is represented as

The sequence of target language words is represented as

Let GloVe (w) ^x ) Is corresponding to w ^x And let z be the GloVe vector corresponding to W ^z The random initialization word vector of the word in (a). GloVe (w) ^x ) The LSTM (Long Short-Term Memory Network) for the dual layer, bi-directional, is called NMT-LSTM and is used to compute the hidden state sequence.

h＝NMT-LSTM(GloVe(w ^x )) (1)

In this machine translation model, NMT-LSTM provides an attention-driven decoding network that solves for each stage based on context vectors

The conditional probability.

In stagesIn t, based on previously embedded z _t-1 The decoder first uses the LSTM of the unidirectional bilayer and the implicit state of the adapted context

To obtain a hidden state

The details are as follows:

the decoder calculates an attention weight vector a for each encoding stage's correlation with the current decoder state.

H is the accumulation of H over a time step,

the method is based on weighted summation of state weights of a decoding end of an attention mechanism and then nonlinear activation by using tanh, and the specific formula is as follows

The probability distribution of the output word is generated by the final transition of the hidden state of the context:

step3.2, when training the Chinese to Vietnamese neural machine translation model, adopting the Chinese encoder parameters of the Chinese-English neural machine translation model to initialize the encoder parameters of the Chinese-Vietnamese neural machine translation model, and adopting the Vietnamese decoder parameters of the Chinese-Vietnamese neural machine translation model to initialize the decoder parameters of the Chinese-Vietnamese neural machine translation model.

And (3) performing Fine-tune (Fine-tune Model C) training on the Model after the parameters are initialized by adopting the Chinese-Yue parallel corpus to obtain a Chinese-Yue neural machine translation Model. Table 1 shows the results of comparing the bler values of the baseline system and the transition Learning-based chinese-yuans Neural Machine Translation model (TLNMT) in both the chinese-yuans and the vietnamese-chinese Translation directions, and table 2 shows an example of comparing the baseline system and the transition Learning-based chinese-yuans Neural Machine Translation model (TLNMT) in the chinese-yuans Translation direction.

Table 1 shows the BLEU comparison results of different models

Table 2 is an example of translations for different models

Compared with experimental results, the TLNMT method for the bilingual neural machine translation in Hanyue has obviously better effect than other methods. Compared with an NMT method, the TLNMT method improves 4.48 BLEU values in the Hanyue translation direction, and improves 1.66 BLEU values in the Hanyue translation direction. Compared with an OPENNMT model, TLNMT obtains 1.16 BLEU value promotion in the Han-Yuan translation direction, and obtains 1.05 BLEU value promotion in the Han-Yuan translation direction.

From the first sentence group in table 2, it can be seen that the translation of OpenNMT has the phenomenon of inaccurate sentence, in which the Hubble and the trace are not translated

Words and phrases. In-process trainingIn the aggregate and test set, the numbers are replaced by num, and in the second group of sentences, the OpenNMT translation has more missed translation situations such as edges than the first group of data

Diffraction of

Soft and soft

And so on, and for the "num" data of the source sentence, it does not appear in the OpenNMT translation, but appears in the hanyu TLNMT translation. The reason for the above problems is that the missing words appear less frequently in the training corpus, and the neural machine translation model cannot well learn the semantic representation of the low-frequency words, so that the missing situation occurs. The invention adopts the ideas of transfer learning and pivot language, so that the encoder for the Chinese-transcendental neural machine translation can better express the semantic information of the source language, the decoding effect is better, and the TLNMT has better translation effect.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The method for translating the Hanyue neural machine based on the transfer learning is characterized by comprising the following steps of:

the method comprises the following specific steps:

the Step3 is specifically as follows:

introducing an attention mechanism into a trained neural machine translation model, respectively training the neural machine translation model with the attention mechanism by using Chinese-English and English-crossing parallel linguistic data to respectively obtain a Chinese-English neural machine translation model with the attention mechanism and an English-crossing neural machine translation model, and then initializing an encoder and a decoder parameter of the Chinese-crossing neural machine translation model by using a Chinese encoder parameter of the Chinese-English neural machine translation model and a Vietnamese decoder parameter of the English-crossing neural machine translation model;

2. The method for machine translation of hanyue based on transfer learning of claim 1, wherein: the concrete steps of Step1 are as follows:

step1.2, manually screening the crawled linguistic data, then segmenting the crawled linguistic data, and replacing Arabic numerals with num and messy code for filtering.

3. The method for machine translation of hanyu based on transfer learning of claim 1, wherein: the specific steps of Step2 are as follows: