CN111753557B

CN111753557B - Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Info

Publication number: CN111753557B
Application number: CN202010096013.9A
Authority: CN
Inventors: 余正涛; 薛明亚; 高盛祥; 赖华; 翟家欣; 朱恩昌; 陈玮
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2022-12-20
Anticipated expiration: 2040-02-17
Also published as: CN111753557A

Abstract

The invention relates to a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary, belonging to the technical field of machine translation. The invention comprises the following steps: collecting corpora; crawling Chinese and Vietnamese monolingual sentences by using a web crawler; firstly, respectively training the monolingual word embedding of Chinese and Vietnamese, and obtaining a Chinese-Vietnamese bilingual dictionary through EMD training of minimum word embedding distribution; training the dictionary as a seed dictionary to obtain Chinese-Yue bilingual word embedding; finally, embedding and applying the bilingual words into an unsupervised machine translation model of a shared encoder to construct a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary. The method can effectively improve the performance of the hanyue unsupervised neural machine translation.

Description

Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Technical Field

The invention relates to a Chinese-more unsupervised neural machine translation method fused with an EMD (Earth Mover's Distance) minimized bilingual dictionary, belonging to the technical field of machine translation.

Background

Neural machine translation is a machine translation method proposed in recent years, and the quality of neural machine translation has become a mainstream translation method beyond statistical machine translation over a plurality of language pairs. However, the neural machine translation requires large-scale parallel corpora to achieve a good effect, and when training data is insufficient, translation quality is poor. Parallel corpora between chinese and vietnamese are rare and not readily available, so chinese-over-machine translation is typically a low resource language machine translation. However, chinese and Vietnamese have a large amount of monolingual linguistic data, and Chinese-Vietnamese unsupervised neural machine translation which only utilizes the monolingual linguistic data is explored in the text, so that the method has very important function for promoting communication and cooperation of two countries, and has very important theoretical and application value for the research of machine translation of low-resource languages.

The current research methods for unsupervised machine translation mainly include unsupervised machine translation based on antagonistic learning and unsupervised machine translation based on shared encoder (shared space). Lample et al propose the idea of mapping sentences of two different monolingual corpora to the same space, reconstruct a shared feature space from the two languages by learning, and realize unsupervised neural machine translation by using only the monolingual corpus. Artemix et al modify the model by pre-training unsupervised bilingual word embedding, using a shared encoder and separate decoding to provide unsupervised neural machine translation using only monolingual corpus. The weight-sharing unsupervised machine translation model proposed by Yang et al improves the characteristics and internal features of each language compared with the shared encoder model so as to improve the translation quality, and Lample et al can obtain the effect of further improving unsupervised neural machine translation by combining the neural machine translation and the phrase-based statistical machine translation effect. Lample et al propose cross-language model pre-training for initializing look-up tables to improve the quality of pre-trained cross-language word embedding, significantly improving the performance of unsupervised machine translation models. They use the homologous words as the initial cross-language information or the method of digital alignment from the monolingual corpus of the similar language, and then expand learning to realize the unsupervised neural machine translation. The Chinese-Vietnam language has larger difference, and no usable homologous words exist among the Chinese-Vietnam, so the method using the language homologous words is not feasible on the Chinese-Vietnam language pair, and the unsupervised neural machine translation of the shared encoder of the artemie et al is realized on the basis of unsupervised bilingual word vectors, thereby conforming to the characteristic of larger difference of the language pair. Therefore, the invention chooses to extend the work of artemie et al, but the quality of learning bilingual word embedding using arabic numerals between languages is limited, so the idea of the invention is to improve the unsupervised bilingual word embedding quality to improve the unsupervised neural machine translation quality of chinese characters.

In the unsupervised machine translation only using Chinese and Vietnamese monolingual corpora, the machine translation is difficult to realize directly but the acquisition of a bilingual dictionary is relatively easy, so the invention considers that the bilingual dictionary is trained from the monolingual corpus of Chinese, and then the bilingual dictionary of Chinese is used as a seed word to guide and train the embedding of bilingual words with higher quality, thereby improving the unsupervised neural machine translation quality of Chinese. Zhang et al propose to use the similarity of the word vector space distribution of the language, use EMD minimized method to train the bilingual dictionary, the whole process only uses the unsupervised training mode of the monolingual corpus, and the quality can be comparable with the supervised mode, accord with the greater difference characteristic of the Chinese-Yuan language. The more unsupervised neural machine translation of chinese that fuses EMD to minimize bilingual dictionaries is proposed herein.

The method comprises the steps of firstly regarding word embedding of Chinese and Vietnamese monolingus as two probability distributions, training by minimizing the EMD distance between the embedding of the Chinese-language-crossing words to obtain a Chinese-language-crossing bilingual dictionary, then training the embedding of the Chinese-language-crossing bilingual words by using the Chinese-language-crossing bilingual dictionary as a seed dictionary and utilizing a self-learning method, and realizing unsupervised neural machine translation of the Chinese characters on a shared coding encoder model.

Disclosure of Invention

The invention provides a Chinese-more unsupervised neural machine translation method fused with an EMD minimized bilingual dictionary, which is used for an unsupervised translation system of low-resource languages and improves the unsupervised neural machine translation performance of the Chinese-more unsupervised neural machine.

The technical scheme of the invention is as follows: the Chinese-more unsupervised neural machine translation method fusing the EMD minimized bilingual dictionary comprises the following specific steps:

step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler; the monolingual corpus is mainly from Chinese and Vietnamese monolingual news websites;

step2, corpus pretreatment: on the basis of Step1, performing word segmentation and part-of-speech tagging on Chinese and Vietnamese single-language sentences, performing word segmentation processing and part-of-speech tagging on Chinese and Vietnamese single-language linguistic data by using a word segmentation and part-of-speech tagging tool, and obtaining word-of-word embedding of Chinese and Vietnamese by using a word vector training tool to obtain single-language word vectors; training respectively obtained word vectors of the Chinese-more monolingus, mapping the word vectors into a vector space, wherein the word vector space of the monolingus of the two languages shows approximate homomorphism, which means that linear mapping exists and the two spaces can be approximately connected;

step3, unsupervised bilingual dictionary based on EMD minimization: on the basis of Step2, training an unsupervised Chinese-Vietnamese bilingual dictionary by utilizing an EMD (empirical mode decomposition) minimization method according to Chinese and Vietnamese monolingual word vectors;

as a preferred embodiment of the present invention, the Step3 specifically comprises the following steps:

step3, using an EMD minimization method between the Chinese word vector distribution and the Vietnamese word vector distribution, regarding the word vectors as probability distribution, taking the distance between the distributions as a vocabulary level criterion, and training in an unsupervised mode without using any seed dictionary to find the EMD minimization between the Chinese word vector distribution and the Vietnamese word vector distribution to obtain the Chinese-Vietnamese dictionary.

The dots in fig. 3 are regarded as soil mounds and the squares as pot holes, and their sizes represent the volume of the soil mounds and the volume of the pot holes, or the corresponding weights. In the example of fig. 3, all weights are equal. At this setting, it is desirable to move the mound to fill the pothole with minimal overall cost, as measured by the product of the distance and volume of the moved mound. It is to be understood that the arrow in fig. 3 (b) represents the optimal movement scheme in this example, and this scheme can be just regarded as the result of vocabulary translation. From a microscopic view, due to

The earth in the pile has been used entirely to fill "music" pot holes, which will not interfere with "dance" pot holes, and thus

The soil heap is responsible for filling up the 'dance' pot hole. From a macroscopic view, the minimization of the overall moving cost enables the global information to be considered, so that the locality of the nearest neighbor search is overcome, and the hubness problem is solved. The notion of representing global weighted matching of the above metaphor can be mathematically implemented using EMD, whose name is derived from the above metaphor.

s.t.W _ij ≥0，

Wherein, V _s Representing the size of the source language vocabulary, V _t Representing the size of the vocabulary of the target language, C _ij Represents the distance between the ith soil pile and the jth pit hole, t _i Represents the volume of the ith soil heap, s _j Represents the volume of the j-th pit, W _ij The decision variables for optimizing the problem represent the volume of soil transferred from the ith soil heap to the jth cavern, and therefore the objective function is to minimize the overall movement cost. After the solution is completed, W is non-zero _ij Experiments show that the effect of vocabulary translation by using EMD can be better than that of the nearest neighbor.

In order to better exploit the ability of EMD to handle the phenomenon of one-word multi-translation, it is proposed herein to introduce EMD into the training process of bilingual word vectors. In the training objective function, EMD is used as one item to participate in training in a regular form, so that the bilingual word vector obtained by training can better capture the phenomenon of one-word multi-translation. Its effect is verified experimentally.

The method of counterlearning can also be viewed in this framework, as counterlearning implicitly optimizes Jensen-Shannon divergence. But other better distribution distances are possible for the task of vocabulary translation to choose from. Since EMD is also a measure of the distance between distributions that is well suited for the vocabulary translation task, consider using EMD as a vocabulary level criterion to guide the learning of linear mappings, i.e., finding a mapping G that minimizes EMD between the word vector distribution of the source language after mapping and the word vector distribution of the target language, as shown in FIG. 4. The use of a mathematical formula can be expressed in the form,

wherein p is _G(x) Representing the distribution of the vectors of the words of the source language after G mapping, p _y Representing the target-language word vector distribution.

Step4, obtaining the embedding of the Chinese-Yue bilingual words: on the basis of the steps 2 and 3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary, and a self-learning model is used for guiding the learning of bilingual word embedding; generating Chinese-Vietnamese word embedding;

word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,

a vector for the ith word of the source language,

a vector for the jth word of the target language; the dictionary D is a binary matrix, whenWhen the ith word of the source language is aligned with the jth word of the target language, D _ij And =1. The goal of word mapping is to find a mapping matrix W such that the mapped word is

And

is closest to the Euclidean distance of (i.e. is

After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances is equivalent to maximizing the dot product:

tr represents trace operation of the matrix, and the optimal solution can be obtained by solving W x = UV ^T (U, V represent two orthogonal matrices), subjected to singular value decomposition, X ^T DY＝U∑V ^T . Given that the matrix D is sparse, the solution can be obtained in linear time.

Dictionary self-learning: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.

Taking FIG. 5 as an example, the aligned word pairs in the initial dictionary are (horse)

dog-Ch Lou), mapped once according to dictionary L1, such that the mapped "horse" is associated with

And the Euclidean distance between "dog" and "ChLou". The closest corresponding word is then found in the mapped space for the other words, and cat can be found to be closer to me and thus also added to the dictionary

dog-Ch, cat-meo) as a new reference dictionary, and re-calculating the euclidean distance will result in a new mapping matrix W, and thus a new alignment result.

After training, translation is carried out by using a beam search (beam search), and the size of a beam needs to be determined by balancing translation time and search accuracy.

An unsupervised bilingual dictionary based on EMD minimum training is fused, and the unsupervised dictionary is used as a seed dictionary to improve the self-learning effect of the dictionary and further improve the quality of bilingual word vectors.

And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-to-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.

The method provided by the invention is characterized in that an unsupervised bilingual dictionary based on EMD minimization is fused on the basis of an artemie and other people sharing an encoder, and the method has stronger capability of mining cross-language information in Chinese and Vietnamese monolingual corpora than an original model. Model structure as shown in fig. 6, the model used follows the standard encoder and decoder with attention mechanism proposed by bahdana et al. The system consists of a shared encoder and two decoders, wherein the two decoders respectively correspond to a source language and a target language. The encoder end is a double-layer bidirectional recurrent neural network (BiGRU), and the decoder end is a double-layer recurrent neural network (UniGRU). With regard to the attention mechanism, the global attention method and general alignment function proposed by Luong et al are used herein. At the encoder side, a pre-trained bilingual dictionary of chinese-over-the-word and bilingual word vectors are used, accepting the input sequence and generating language-independent tokens. And the word vector at the decoder end is continuously updated along with training, and training and translation are carried out through two decoders.

For each sentence in chinese (L1), the model trains two steps alternately: denoising, which optimizes the probability of coding the noisy coding of a sentence with a shared encoder and reconstructs it with an L1 decoder and performs dynamic reverse translation, which translates the sentence in an inference mode (encodes it with the shared encoder and decodes it with a vietnamese (L2) decoder) and then optimizes the probability of coding the translated sentence with the shared encoder and restores the original sentence with the L1 decoder. Training alternates between sentences in L1 and L2, the latter using similar steps.

The double structure is as follows: while NMT systems are typically built for a particular translation direction (e.g., chinese- > vietnamese or vietnamese- > chinese), this document takes advantage of the dual nature of machine translation to process two directions simultaneously (e.g., chinese < - > vietnamese).

Sharing the encoder: similar to Ha et al, lee et al, and Johnson et al, the system herein is one encoder shared by both languages. I.e., both chinese and vietnamese are encoded using the same encoder. The shared encoder is intended to represent both languages as language independent representations, and then each decoder should decode into the language corresponding to it.

Pre-training fixed bilingual word embedding: while most neural machine translation systems randomly initialize their word vectors and update them during training, pre-trained cross-language word vectors are used in the encoder, which remain unchanged during the training process. The encoder has language independent word-level representations and it only needs to learn how to combine them to build a representation of a larger phrase.

In an experiment of artemie et al, it is proved that adding denoising and retranslation in the system is beneficial to improving translation quality, and the invention uses a shared encoder system with denoising and retranslation.

For each sentence in Chinese (L1), the system is trained in two steps, denoising: it optimizes the probability of coding the noisy coding of a sentence with a shared coder, and reconstructs it with an L1 decoder, as in fig. 7 (a); and (3) translation back: the sentence is translated in inference mode (inference mode) (the sentence is encoded using a shared encoder, as in fig. 7 (b) decoded using a vietnamese (L2) decoder), and then the probabilities of encoding the translated sentence and recovering the source sentence using an L1 decoder are optimized using the shared encoder. These two steps are performed alternately to train L1 and L2, and the training step for L2 and L1 are similar to fig. 7 (c) and (d). The neural machine translation system is usually trained by a parallel corpus, and since only a monolingual corpus is available, the supervised training method cannot be used in the scene of the text. However, using the model architecture of fig. 6, the whole system can be trained in an unsupervised manner combining the two methods of denoising and interpretation:

denoising: due to the use of a shared encoder and the use of the dual structure of machine translation, the system herein can be trained directly to reconstruct the input sentence. In particular, the system encodes an input sentence in a given language using a shared encoder, and then reconstructs the source sentence using a decoder for that language. Given that pre-trained cross-language word vectors are used in a shared encoder, which learns to combine the embedding of two languages into a language-independent characterization, each decoder should learn to decode such characterization into the corresponding language. In inference mode, the source language decoder is replaced by the target language decoder only, so that the system can generate a translation of the input text using the language independent tokens generated by the encoder.

This text introduces random noise in the input sentence. The idea is that the system is trained to reconstruct the original version of the corrupted input sentence using the same principle of auto-encoder denoising. For this purpose, the word order of the input sentence is changed by performing a random exchange between consecutive words. This N/2 random exchange is performed for a sequence of N elements. Thus, the system needs to learn the internal structure of the language to recover the correct word order. At the same time, the actual word order differences across languages can be better explained by preventing the system from relying too much on the word order of the input sequence.

And (3) translation back: despite the denoising strategy, the training program is still a replication task, which includes some synthetic changes, most importantly, each time involving one language, regardless of the final goal of translation between the two languages. In order to train the system of the present document in a true translation environment without violating the constraint of using only monolingual corpora, the translation method proposed by Sennrich et al was added to the system. Specifically, given an input sentence in one language, the system translates it into another language in inference mode using greedy decoding (i.e., using a shared encoder and decoder for the other language). In this way, pseudo parallel sentence pairs can be obtained and the system trained to predict the original sentence from the synthesized translation.

It is noted that, in contrast to standard reverse translation, which uses an independent model to reverse translate the entire corpus at once, each small batch of sentences is reverse translated on-the-fly using the model being trained, taking advantage of the dual structure of the proposed architecture. Thus, as training progresses and the model improves, it will produce better pairs of synthesized sentences through reverse translation, which will help to further improve the model in subsequent iterations.

Because the difference of the Chinese-Vietnamese language is large and homologous words do not exist, according to the difference characteristic of the Chinese-Vietnamese language, the method for minimizing EMD among word vector distributions is introduced to learn a Chinese-Vietnamese dictionary from a Chinese-Vietnamese corpus, a Chinese-Vietnamese unsupervised neural machine translation method fusing the EMD minimized bilingualse dictionary is provided, and the performance of unsupervised neural machine translation is improved.

The beneficial effects of the invention are:

the invention realizes the unsupervised neural machine translation system of the language with larger Chinese cross-language difference, improves the capability of the unsupervised neural machine translation model of the shared encoder model to acquire cross-language information of the language with larger difference, and further improves the unsupervised neural machine translation quality of the Chinese cross-language. The unsupervised Chinese-Vietnamese language translation model is expanded from similar languages containing homologous words to a Chinese-Vietnamese language task with large difference, and the performance of the unsupervised neural machine translation model of the shared encoder is improved.

Drawings

FIG. 1 is a flow chart of a phrase-based Chinese-to-pseudo parallel sentence pair generation method proposed by the present invention;

FIG. 2 is a monolingual word vector space for Chinese and Vietnamese of the present invention;

FIG. 3 is a chart of the Hubness problem of the present invention;

FIG. 4 is an Earth mover's distance minimization learning diagram of the present invention;

FIG. 5 is a schematic diagram of the word mapping process using number alignment of the present invention;

FIG. 6 is a Chinese-cross unsupervised NMT model of the fused EMD minimized bilingual dictionary of the present invention;

fig. 7 is a diagram of 4 processes of chinese unsupervised NMT model training of the fused EMD minimized bilingual dictionary of the present invention.

Detailed Description

Example 1: as shown in fig. 1-7, the chinese-to-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary, step1, first obtaining parallel corpora: 5800 ten thousand sentences of Chinese monolingual corpus and 3000 ten thousand sentences of Vietnamese monolingual corpus are crawled from the Internet.

Step2, preprocessing the corpus; on the basis of Step1, segmenting words and part-of-speech marks of Chinese and Vietnamese monolingual sentences, and training to obtain monolingual word vectors; and performing word segmentation and part-of-speech tagging on Vietnamese by using an understeetheselp Vietnamese word segmentation tool for the Vietnamese, and performing word segmentation and part-of-speech tagging on Chinese by using a jieba word segmentation tool. Word2vec was used to train monolingual word vectors for both hanyue and vietnamese. Both Chinese and Vietnamese respectively train 300-dimensional word vectors. The 300-dimensional word vector is trained using the skip-gram model. For training bilingual word vectors after dictionary addition.

Training the respectively obtained word vectors of the Chinese-more monolingual words, mapping the word vectors into a vector space as shown in FIG. 2, wherein the monolingual word vector spaces of the two languages show approximate homomorphism, which means that linear mapping exists and can approximately connect the two spaces.

Step3, an unsupervised bilingual dictionary based on EMD minimization; and on the basis of Step2, training an unsupervised Hanyue bilingual dictionary by using an EMD (empirical mode decomposition) minimization-based method according to the Hanyue and Vietnamese monolingual word vectors.

Further, the Step3 specifically comprises the following steps:

step3, an EMD minimization method between Chinese word vector distribution and Vietnam word vector distribution is used; taking the word vectors as probability distributions, and taking the distance between the distributions as a criterion of the vocabulary level; training in an unsupervised manner without using any seed dictionary to find EMD minimization among the vector distributions of the cross-word; acquiring a Chinese-Yue bilingual dictionary;

a bilingual dictionary is trained using the method proposed by Zhang et al, and 50-dimensional word vectors are trained in chinese and vietnamese using word2vec, respectively. The 50-dimensional word vector is trained using a default hyper-parametric trained CBOW framework, the frequency of word occurrence is limited to not less than 1000 nouns, and the experimental results are shown in table 1.

TABLE 1 quantity table generated based on EMD minimized Hanyue bilingual dictionary

Step4, obtaining Chinese-Yue bilingual word embedding; on the basis of the Step2 and the Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding by using a self-learning model; generating Chinese-more bilingual word embedding;

in Step4, performing word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,

a vector for the ith word of the source language,

a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target language _ij And =1. The goal of word mapping is to find a mapping matrix W such that the mapped word is

And

has the shortest Euclidean distance, i.e.

After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances amounts to maximizing the dot product:

tr represents the trace operation of the matrix, and the optimal solution W = UV can be obtained by solving ^T U and V represent two orthogonal matrices, decomposed by singular values, X ^T DY＝U∑V ^T Given that the matrix D is sparse, a solution is obtained in linear time;

And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.

Further, in Step 5:

the experiment is mainly divided into the following five parts: the unsupervised baseline model is used for translating Chinese-cross, fusing UNMT of an EMD minimized bilingual dictionary, adding 1 ten thousand parallel linguistic data and 10 ten thousand parallel linguistic data respectively on the basis of the method model, and directly using 1 ten thousand parallel linguistic data and 10 ten thousand parallel linguistic data to train on a GNMT model and a Transform model.

Unsupervised model training: the translation system is trained by using only monolingual corpus, and the more unsupervised the Chinese translation model is trained by applying the benchmark model in the 1 st benchmark experiment. Article 2 is the method herein, fusing EMD minimized bilingual dictionary hanyuemt on baseline experiments.

Semi-supervised model training: in most cases, the languages under study often have a small amount of parallel corpora, which can be used to improve the performance of the model, but the corpus size is not large enough to directly train the complete conventional NMT system. So in addition to the monolingual corpus, a small amount of parallel corpora is added. Experiments were also performed using 10,000 and 100,000 parallel sentence pairs based on the methods presented herein.

And (3) supervision model training: the conventional supervised neural machine translation model was trained using the 10,000 and 100,000 parallel sentence pairs added in the semi-supervised experiment described above for comparison of the semi-supervised experiments.

TABLE 2 comparison of Hanyue machine translation experiments in different methods

As can be seen from comparison between the 1 st line and the 2 nd line of the experimental results in table 2, the unsupervised bilingual dictionary training model is fused on the basis of the unsupervised model, and compared with the baseline system, the unsupervised bilingual dictionary training model has about 2.5 BLEU values, which indicates that the model of the text can capture more cross-language information from the monolingual corpus, improve the quality of bilingual word vectors, and further improve the translation quality. The semi-supervised system adds 1 million parallel corpora BLEU to the system from the 3 rd row, the more the Chinese reaches 10.02 BLEU values, the more the Chinese reaches 13.91 BLEU values, and after comparing the 5 th, 6 th, 7 th and 8 th rows, the method provided by the text is not difficult to see that the language is in low resource language alignment, so that the method has better effect, and under the condition that only 1 million sentences are aligned to the parallel corpora, the system nearly achieves the effect of directly training the model by using 10 million parallel corpora. From a comparison of lines 4 and 8, it can be seen that both the Han-Yuan and the Yuan-Han directions exceed the Transform model when 10 ten thousand parallel sentence pairs are added.

TABLE 3 different methods Hanyue unsupervised machine translation example analysis

From the results of the experimental translations in table 3, although the model still has the problem of inaccurate translation caused by learning bias, the translation quality of the method for the text is obviously improved compared with that of a baseline system.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Chinese-more unsupervised neural machine translation method fused with the EMD minimized bilingual dictionary is characterized in that:

the method comprises the following specific steps:

step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler;

step2, corpus pretreatment: on the basis of Step1, segmenting words and part-of-speech marks of Chinese and Vietnamese monolingual sentences, and training to obtain monolingual word vectors;

step4, obtaining Chinese-Yue bilingual word embedding: on the basis of the Step2 and the Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding; generating Chinese-more bilingual word embedding;

step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of a shared encoder, and training to obtain a Chinese-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary;

a vector for the ith word of the source language,

a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target language _ij =1, the goal of word mapping is to find a mapping matrix W, such that after mapping

And

is closest to the Euclidean distance of (i.e. is

wherein, tr represents the trace operation of the matrix, and the optimal solution obtained by solving is W = UV ^T U and V represent two orthogonal matrices, decomposed by singular values, X ^T DY＝U∑V ^T Given that the matrix D is sparse, a solution is obtained in linear time;

the dictionary self-learning comprises the following steps: and the mapped word vector of the source language word and the word vector of the target language word are in the same space, a target language word with the closest distance is distributed to each source language word according to the nearest neighbor retrieval method, the aligned word pair is added into the dictionary, and iteration is carried out again until convergence.

2. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the specific steps of Step2 are as follows:

step2, segmenting words and part-of-speech tagging of Chinese and Vietnamese single-language sentences, performing segmentation processing and part-of-speech tagging of Chinese and Vietnamese single-language linguistic data by using a segmentation and part-of-speech tagging tool, and obtaining embedding of Chinese and Vietnamese single-language words by using a word vector training tool.

3. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the concrete steps of Step3 are as follows:

4. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the concrete steps of Step4 are as follows:

using the Chinese-Vietnam bilingual dictionary obtained in Step3 as a seed dictionary; guiding the embedding training of the Chinese-language-crossing single words by using a self-learning model; and obtaining Chinese-more bilingual word embedding training.

5. The method of chinese-to-many unsupervised neural machine translation fusing EMD minimized bilingual dictionaries according to claim 1, wherein: in Step 5:

and embedding and applying the trained bilingual words fused with the EMD bilingual dictionary in the model of the shared encoder by using a shared encoder model, so as to realize word-level correspondence between Chinese and more bilingual words and train a unsupervised neural machine translation model of the Chinese.