CN111753557B - Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary - Google Patents

Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary Download PDF

Info

Publication number
CN111753557B
CN111753557B CN202010096013.9A CN202010096013A CN111753557B CN 111753557 B CN111753557 B CN 111753557B CN 202010096013 A CN202010096013 A CN 202010096013A CN 111753557 B CN111753557 B CN 111753557B
Authority
CN
China
Prior art keywords
chinese
word
bilingual
emd
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010096013.9A
Other languages
Chinese (zh)
Other versions
CN111753557A (en
Inventor
余正涛
薛明亚
高盛祥
赖华
翟家欣
朱恩昌
陈玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010096013.9A priority Critical patent/CN111753557B/en
Publication of CN111753557A publication Critical patent/CN111753557A/en
Application granted granted Critical
Publication of CN111753557B publication Critical patent/CN111753557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary, belonging to the technical field of machine translation. The invention comprises the following steps: collecting corpora; crawling Chinese and Vietnamese monolingual sentences by using a web crawler; firstly, respectively training the monolingual word embedding of Chinese and Vietnamese, and obtaining a Chinese-Vietnamese bilingual dictionary through EMD training of minimum word embedding distribution; training the dictionary as a seed dictionary to obtain Chinese-Yue bilingual word embedding; finally, embedding and applying the bilingual words into an unsupervised machine translation model of a shared encoder to construct a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary. The method can effectively improve the performance of the hanyue unsupervised neural machine translation.

Description

Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary
Technical Field
The invention relates to a Chinese-more unsupervised neural machine translation method fused with an EMD (Earth Mover's Distance) minimized bilingual dictionary, belonging to the technical field of machine translation.
Background
Neural machine translation is a machine translation method proposed in recent years, and the quality of neural machine translation has become a mainstream translation method beyond statistical machine translation over a plurality of language pairs. However, the neural machine translation requires large-scale parallel corpora to achieve a good effect, and when training data is insufficient, translation quality is poor. Parallel corpora between chinese and vietnamese are rare and not readily available, so chinese-over-machine translation is typically a low resource language machine translation. However, chinese and Vietnamese have a large amount of monolingual linguistic data, and Chinese-Vietnamese unsupervised neural machine translation which only utilizes the monolingual linguistic data is explored in the text, so that the method has very important function for promoting communication and cooperation of two countries, and has very important theoretical and application value for the research of machine translation of low-resource languages.
The current research methods for unsupervised machine translation mainly include unsupervised machine translation based on antagonistic learning and unsupervised machine translation based on shared encoder (shared space). Lample et al propose the idea of mapping sentences of two different monolingual corpora to the same space, reconstruct a shared feature space from the two languages by learning, and realize unsupervised neural machine translation by using only the monolingual corpus. Artemix et al modify the model by pre-training unsupervised bilingual word embedding, using a shared encoder and separate decoding to provide unsupervised neural machine translation using only monolingual corpus. The weight-sharing unsupervised machine translation model proposed by Yang et al improves the characteristics and internal features of each language compared with the shared encoder model so as to improve the translation quality, and Lample et al can obtain the effect of further improving unsupervised neural machine translation by combining the neural machine translation and the phrase-based statistical machine translation effect. Lample et al propose cross-language model pre-training for initializing look-up tables to improve the quality of pre-trained cross-language word embedding, significantly improving the performance of unsupervised machine translation models. They use the homologous words as the initial cross-language information or the method of digital alignment from the monolingual corpus of the similar language, and then expand learning to realize the unsupervised neural machine translation. The Chinese-Vietnam language has larger difference, and no usable homologous words exist among the Chinese-Vietnam, so the method using the language homologous words is not feasible on the Chinese-Vietnam language pair, and the unsupervised neural machine translation of the shared encoder of the artemie et al is realized on the basis of unsupervised bilingual word vectors, thereby conforming to the characteristic of larger difference of the language pair. Therefore, the invention chooses to extend the work of artemie et al, but the quality of learning bilingual word embedding using arabic numerals between languages is limited, so the idea of the invention is to improve the unsupervised bilingual word embedding quality to improve the unsupervised neural machine translation quality of chinese characters.
In the unsupervised machine translation only using Chinese and Vietnamese monolingual corpora, the machine translation is difficult to realize directly but the acquisition of a bilingual dictionary is relatively easy, so the invention considers that the bilingual dictionary is trained from the monolingual corpus of Chinese, and then the bilingual dictionary of Chinese is used as a seed word to guide and train the embedding of bilingual words with higher quality, thereby improving the unsupervised neural machine translation quality of Chinese. Zhang et al propose to use the similarity of the word vector space distribution of the language, use EMD minimized method to train the bilingual dictionary, the whole process only uses the unsupervised training mode of the monolingual corpus, and the quality can be comparable with the supervised mode, accord with the greater difference characteristic of the Chinese-Yuan language. The more unsupervised neural machine translation of chinese that fuses EMD to minimize bilingual dictionaries is proposed herein.
The method comprises the steps of firstly regarding word embedding of Chinese and Vietnamese monolingus as two probability distributions, training by minimizing the EMD distance between the embedding of the Chinese-language-crossing words to obtain a Chinese-language-crossing bilingual dictionary, then training the embedding of the Chinese-language-crossing bilingual words by using the Chinese-language-crossing bilingual dictionary as a seed dictionary and utilizing a self-learning method, and realizing unsupervised neural machine translation of the Chinese characters on a shared coding encoder model.
Disclosure of Invention
The invention provides a Chinese-more unsupervised neural machine translation method fused with an EMD minimized bilingual dictionary, which is used for an unsupervised translation system of low-resource languages and improves the unsupervised neural machine translation performance of the Chinese-more unsupervised neural machine.
The technical scheme of the invention is as follows: the Chinese-more unsupervised neural machine translation method fusing the EMD minimized bilingual dictionary comprises the following specific steps:
step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler; the monolingual corpus is mainly from Chinese and Vietnamese monolingual news websites;
step2, corpus pretreatment: on the basis of Step1, performing word segmentation and part-of-speech tagging on Chinese and Vietnamese single-language sentences, performing word segmentation processing and part-of-speech tagging on Chinese and Vietnamese single-language linguistic data by using a word segmentation and part-of-speech tagging tool, and obtaining word-of-word embedding of Chinese and Vietnamese by using a word vector training tool to obtain single-language word vectors; training respectively obtained word vectors of the Chinese-more monolingus, mapping the word vectors into a vector space, wherein the word vector space of the monolingus of the two languages shows approximate homomorphism, which means that linear mapping exists and the two spaces can be approximately connected;
step3, unsupervised bilingual dictionary based on EMD minimization: on the basis of Step2, training an unsupervised Chinese-Vietnamese bilingual dictionary by utilizing an EMD (empirical mode decomposition) minimization method according to Chinese and Vietnamese monolingual word vectors;
as a preferred embodiment of the present invention, the Step3 specifically comprises the following steps:
step3, using an EMD minimization method between the Chinese word vector distribution and the Vietnamese word vector distribution, regarding the word vectors as probability distribution, taking the distance between the distributions as a vocabulary level criterion, and training in an unsupervised mode without using any seed dictionary to find the EMD minimization between the Chinese word vector distribution and the Vietnamese word vector distribution to obtain the Chinese-Vietnamese dictionary.
The dots in fig. 3 are regarded as soil mounds and the squares as pot holes, and their sizes represent the volume of the soil mounds and the volume of the pot holes, or the corresponding weights. In the example of fig. 3, all weights are equal. At this setting, it is desirable to move the mound to fill the pothole with minimal overall cost, as measured by the product of the distance and volume of the moved mound. It is to be understood that the arrow in fig. 3 (b) represents the optimal movement scheme in this example, and this scheme can be just regarded as the result of vocabulary translation. From a microscopic view, due to
Figure RE-GDA0002477274680000031
The earth in the pile has been used entirely to fill "music" pot holes, which will not interfere with "dance" pot holes, and thus
Figure RE-GDA0002477274680000032
The soil heap is responsible for filling up the 'dance' pot hole. From a macroscopic view, the minimization of the overall moving cost enables the global information to be considered, so that the locality of the nearest neighbor search is overcome, and the hubness problem is solved. The notion of representing global weighted matching of the above metaphor can be mathematically implemented using EMD, whose name is derived from the above metaphor.
Figure RE-GDA0002477274680000033
s.t.W ij ≥0,
Figure RE-GDA0002477274680000034
Figure RE-GDA0002477274680000035
Wherein, V s Representing the size of the source language vocabulary, V t Representing the size of the vocabulary of the target language, C ij Represents the distance between the ith soil pile and the jth pit hole, t i Represents the volume of the ith soil heap, s j Represents the volume of the j-th pit, W ij The decision variables for optimizing the problem represent the volume of soil transferred from the ith soil heap to the jth cavern, and therefore the objective function is to minimize the overall movement cost. After the solution is completed, W is non-zero ij Experiments show that the effect of vocabulary translation by using EMD can be better than that of the nearest neighbor.
In order to better exploit the ability of EMD to handle the phenomenon of one-word multi-translation, it is proposed herein to introduce EMD into the training process of bilingual word vectors. In the training objective function, EMD is used as one item to participate in training in a regular form, so that the bilingual word vector obtained by training can better capture the phenomenon of one-word multi-translation. Its effect is verified experimentally.
The method of counterlearning can also be viewed in this framework, as counterlearning implicitly optimizes Jensen-Shannon divergence. But other better distribution distances are possible for the task of vocabulary translation to choose from. Since EMD is also a measure of the distance between distributions that is well suited for the vocabulary translation task, consider using EMD as a vocabulary level criterion to guide the learning of linear mappings, i.e., finding a mapping G that minimizes EMD between the word vector distribution of the source language after mapping and the word vector distribution of the target language, as shown in FIG. 4. The use of a mathematical formula can be expressed in the form,
Figure RE-GDA0002477274680000041
wherein p is G(x) Representing the distribution of the vectors of the words of the source language after G mapping, p y Representing the target-language word vector distribution.
Step4, obtaining the embedding of the Chinese-Yue bilingual words: on the basis of the steps 2 and 3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary, and a self-learning model is used for guiding the learning of bilingual word embedding; generating Chinese-Vietnamese word embedding;
word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure RE-GDA0002477274680000046
a vector for the ith word of the source language,
Figure RE-GDA0002477274680000047
a vector for the jth word of the target language; the dictionary D is a binary matrix, whenWhen the ith word of the source language is aligned with the jth word of the target language, D ij And =1. The goal of word mapping is to find a mapping matrix W such that the mapped word is
Figure RE-GDA0002477274680000048
And
Figure RE-GDA0002477274680000049
is closest to the Euclidean distance of (i.e. is
Figure RE-GDA0002477274680000042
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances is equivalent to maximizing the dot product:
Figure RE-GDA0002477274680000043
tr represents trace operation of the matrix, and the optimal solution can be obtained by solving W x = UV T (U, V represent two orthogonal matrices), subjected to singular value decomposition, X T DY=U∑V T . Given that the matrix D is sparse, the solution can be obtained in linear time.
Dictionary self-learning: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.
Taking FIG. 5 as an example, the aligned word pairs in the initial dictionary are (horse)
Figure RE-GDA0002477274680000044
dog-Ch Lou), mapped once according to dictionary L1, such that the mapped "horse" is associated with
Figure RE-GDA0002477274680000045
And the Euclidean distance between "dog" and "ChLou". The closest corresponding word is then found in the mapped space for the other words, and cat can be found to be closer to me and thus also added to the dictionary
Figure RE-GDA0002477274680000051
dog-Ch, cat-meo) as a new reference dictionary, and re-calculating the euclidean distance will result in a new mapping matrix W, and thus a new alignment result.
After training, translation is carried out by using a beam search (beam search), and the size of a beam needs to be determined by balancing translation time and search accuracy.
An unsupervised bilingual dictionary based on EMD minimum training is fused, and the unsupervised dictionary is used as a seed dictionary to improve the self-learning effect of the dictionary and further improve the quality of bilingual word vectors.
And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-to-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.
The method provided by the invention is characterized in that an unsupervised bilingual dictionary based on EMD minimization is fused on the basis of an artemie and other people sharing an encoder, and the method has stronger capability of mining cross-language information in Chinese and Vietnamese monolingual corpora than an original model. Model structure as shown in fig. 6, the model used follows the standard encoder and decoder with attention mechanism proposed by bahdana et al. The system consists of a shared encoder and two decoders, wherein the two decoders respectively correspond to a source language and a target language. The encoder end is a double-layer bidirectional recurrent neural network (BiGRU), and the decoder end is a double-layer recurrent neural network (UniGRU). With regard to the attention mechanism, the global attention method and general alignment function proposed by Luong et al are used herein. At the encoder side, a pre-trained bilingual dictionary of chinese-over-the-word and bilingual word vectors are used, accepting the input sequence and generating language-independent tokens. And the word vector at the decoder end is continuously updated along with training, and training and translation are carried out through two decoders.
For each sentence in chinese (L1), the model trains two steps alternately: denoising, which optimizes the probability of coding the noisy coding of a sentence with a shared encoder and reconstructs it with an L1 decoder and performs dynamic reverse translation, which translates the sentence in an inference mode (encodes it with the shared encoder and decodes it with a vietnamese (L2) decoder) and then optimizes the probability of coding the translated sentence with the shared encoder and restores the original sentence with the L1 decoder. Training alternates between sentences in L1 and L2, the latter using similar steps.
The double structure is as follows: while NMT systems are typically built for a particular translation direction (e.g., chinese- > vietnamese or vietnamese- > chinese), this document takes advantage of the dual nature of machine translation to process two directions simultaneously (e.g., chinese < - > vietnamese).
Sharing the encoder: similar to Ha et al, lee et al, and Johnson et al, the system herein is one encoder shared by both languages. I.e., both chinese and vietnamese are encoded using the same encoder. The shared encoder is intended to represent both languages as language independent representations, and then each decoder should decode into the language corresponding to it.
Pre-training fixed bilingual word embedding: while most neural machine translation systems randomly initialize their word vectors and update them during training, pre-trained cross-language word vectors are used in the encoder, which remain unchanged during the training process. The encoder has language independent word-level representations and it only needs to learn how to combine them to build a representation of a larger phrase.
In an experiment of artemie et al, it is proved that adding denoising and retranslation in the system is beneficial to improving translation quality, and the invention uses a shared encoder system with denoising and retranslation.
For each sentence in Chinese (L1), the system is trained in two steps, denoising: it optimizes the probability of coding the noisy coding of a sentence with a shared coder, and reconstructs it with an L1 decoder, as in fig. 7 (a); and (3) translation back: the sentence is translated in inference mode (inference mode) (the sentence is encoded using a shared encoder, as in fig. 7 (b) decoded using a vietnamese (L2) decoder), and then the probabilities of encoding the translated sentence and recovering the source sentence using an L1 decoder are optimized using the shared encoder. These two steps are performed alternately to train L1 and L2, and the training step for L2 and L1 are similar to fig. 7 (c) and (d). The neural machine translation system is usually trained by a parallel corpus, and since only a monolingual corpus is available, the supervised training method cannot be used in the scene of the text. However, using the model architecture of fig. 6, the whole system can be trained in an unsupervised manner combining the two methods of denoising and interpretation:
denoising: due to the use of a shared encoder and the use of the dual structure of machine translation, the system herein can be trained directly to reconstruct the input sentence. In particular, the system encodes an input sentence in a given language using a shared encoder, and then reconstructs the source sentence using a decoder for that language. Given that pre-trained cross-language word vectors are used in a shared encoder, which learns to combine the embedding of two languages into a language-independent characterization, each decoder should learn to decode such characterization into the corresponding language. In inference mode, the source language decoder is replaced by the target language decoder only, so that the system can generate a translation of the input text using the language independent tokens generated by the encoder.
This text introduces random noise in the input sentence. The idea is that the system is trained to reconstruct the original version of the corrupted input sentence using the same principle of auto-encoder denoising. For this purpose, the word order of the input sentence is changed by performing a random exchange between consecutive words. This N/2 random exchange is performed for a sequence of N elements. Thus, the system needs to learn the internal structure of the language to recover the correct word order. At the same time, the actual word order differences across languages can be better explained by preventing the system from relying too much on the word order of the input sequence.
And (3) translation back: despite the denoising strategy, the training program is still a replication task, which includes some synthetic changes, most importantly, each time involving one language, regardless of the final goal of translation between the two languages. In order to train the system of the present document in a true translation environment without violating the constraint of using only monolingual corpora, the translation method proposed by Sennrich et al was added to the system. Specifically, given an input sentence in one language, the system translates it into another language in inference mode using greedy decoding (i.e., using a shared encoder and decoder for the other language). In this way, pseudo parallel sentence pairs can be obtained and the system trained to predict the original sentence from the synthesized translation.
It is noted that, in contrast to standard reverse translation, which uses an independent model to reverse translate the entire corpus at once, each small batch of sentences is reverse translated on-the-fly using the model being trained, taking advantage of the dual structure of the proposed architecture. Thus, as training progresses and the model improves, it will produce better pairs of synthesized sentences through reverse translation, which will help to further improve the model in subsequent iterations.
Because the difference of the Chinese-Vietnamese language is large and homologous words do not exist, according to the difference characteristic of the Chinese-Vietnamese language, the method for minimizing EMD among word vector distributions is introduced to learn a Chinese-Vietnamese dictionary from a Chinese-Vietnamese corpus, a Chinese-Vietnamese unsupervised neural machine translation method fusing the EMD minimized bilingualse dictionary is provided, and the performance of unsupervised neural machine translation is improved.
The beneficial effects of the invention are:
the invention realizes the unsupervised neural machine translation system of the language with larger Chinese cross-language difference, improves the capability of the unsupervised neural machine translation model of the shared encoder model to acquire cross-language information of the language with larger difference, and further improves the unsupervised neural machine translation quality of the Chinese cross-language. The unsupervised Chinese-Vietnamese language translation model is expanded from similar languages containing homologous words to a Chinese-Vietnamese language task with large difference, and the performance of the unsupervised neural machine translation model of the shared encoder is improved.
Drawings
FIG. 1 is a flow chart of a phrase-based Chinese-to-pseudo parallel sentence pair generation method proposed by the present invention;
FIG. 2 is a monolingual word vector space for Chinese and Vietnamese of the present invention;
FIG. 3 is a chart of the Hubness problem of the present invention;
FIG. 4 is an Earth mover's distance minimization learning diagram of the present invention;
FIG. 5 is a schematic diagram of the word mapping process using number alignment of the present invention;
FIG. 6 is a Chinese-cross unsupervised NMT model of the fused EMD minimized bilingual dictionary of the present invention;
fig. 7 is a diagram of 4 processes of chinese unsupervised NMT model training of the fused EMD minimized bilingual dictionary of the present invention.
Detailed Description
Example 1: as shown in fig. 1-7, the chinese-to-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary, step1, first obtaining parallel corpora: 5800 ten thousand sentences of Chinese monolingual corpus and 3000 ten thousand sentences of Vietnamese monolingual corpus are crawled from the Internet.
Step2, preprocessing the corpus; on the basis of Step1, segmenting words and part-of-speech marks of Chinese and Vietnamese monolingual sentences, and training to obtain monolingual word vectors; and performing word segmentation and part-of-speech tagging on Vietnamese by using an understeetheselp Vietnamese word segmentation tool for the Vietnamese, and performing word segmentation and part-of-speech tagging on Chinese by using a jieba word segmentation tool. Word2vec was used to train monolingual word vectors for both hanyue and vietnamese. Both Chinese and Vietnamese respectively train 300-dimensional word vectors. The 300-dimensional word vector is trained using the skip-gram model. For training bilingual word vectors after dictionary addition.
Training the respectively obtained word vectors of the Chinese-more monolingual words, mapping the word vectors into a vector space as shown in FIG. 2, wherein the monolingual word vector spaces of the two languages show approximate homomorphism, which means that linear mapping exists and can approximately connect the two spaces.
Step3, an unsupervised bilingual dictionary based on EMD minimization; and on the basis of Step2, training an unsupervised Hanyue bilingual dictionary by using an EMD (empirical mode decomposition) minimization-based method according to the Hanyue and Vietnamese monolingual word vectors.
Further, the Step3 specifically comprises the following steps:
step3, an EMD minimization method between Chinese word vector distribution and Vietnam word vector distribution is used; taking the word vectors as probability distributions, and taking the distance between the distributions as a criterion of the vocabulary level; training in an unsupervised manner without using any seed dictionary to find EMD minimization among the vector distributions of the cross-word; acquiring a Chinese-Yue bilingual dictionary;
a bilingual dictionary is trained using the method proposed by Zhang et al, and 50-dimensional word vectors are trained in chinese and vietnamese using word2vec, respectively. The 50-dimensional word vector is trained using a default hyper-parametric trained CBOW framework, the frequency of word occurrence is limited to not less than 1000 nouns, and the experimental results are shown in table 1.
TABLE 1 quantity table generated based on EMD minimized Hanyue bilingual dictionary
Figure RE-GDA0002477274680000081
Step4, obtaining Chinese-Yue bilingual word embedding; on the basis of the Step2 and the Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding by using a self-learning model; generating Chinese-more bilingual word embedding;
in Step4, performing word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure RE-GDA0002477274680000082
a vector for the ith word of the source language,
Figure RE-GDA0002477274680000083
a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target language ij And =1. The goal of word mapping is to find a mapping matrix W such that the mapped word is
Figure RE-GDA0002477274680000084
And
Figure RE-GDA0002477274680000085
has the shortest Euclidean distance, i.e.
Figure RE-GDA0002477274680000086
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances amounts to maximizing the dot product:
Figure RE-GDA0002477274680000087
tr represents the trace operation of the matrix, and the optimal solution W = UV can be obtained by solving T U and V represent two orthogonal matrices, decomposed by singular values, X T DY=U∑V T Given that the matrix D is sparse, a solution is obtained in linear time;
dictionary self-learning: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.
And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.
Further, in Step 5:
the experiment is mainly divided into the following five parts: the unsupervised baseline model is used for translating Chinese-cross, fusing UNMT of an EMD minimized bilingual dictionary, adding 1 ten thousand parallel linguistic data and 10 ten thousand parallel linguistic data respectively on the basis of the method model, and directly using 1 ten thousand parallel linguistic data and 10 ten thousand parallel linguistic data to train on a GNMT model and a Transform model.
Unsupervised model training: the translation system is trained by using only monolingual corpus, and the more unsupervised the Chinese translation model is trained by applying the benchmark model in the 1 st benchmark experiment. Article 2 is the method herein, fusing EMD minimized bilingual dictionary hanyuemt on baseline experiments.
Semi-supervised model training: in most cases, the languages under study often have a small amount of parallel corpora, which can be used to improve the performance of the model, but the corpus size is not large enough to directly train the complete conventional NMT system. So in addition to the monolingual corpus, a small amount of parallel corpora is added. Experiments were also performed using 10,000 and 100,000 parallel sentence pairs based on the methods presented herein.
And (3) supervision model training: the conventional supervised neural machine translation model was trained using the 10,000 and 100,000 parallel sentence pairs added in the semi-supervised experiment described above for comparison of the semi-supervised experiments.
TABLE 2 comparison of Hanyue machine translation experiments in different methods
Figure RE-GDA0002477274680000091
As can be seen from comparison between the 1 st line and the 2 nd line of the experimental results in table 2, the unsupervised bilingual dictionary training model is fused on the basis of the unsupervised model, and compared with the baseline system, the unsupervised bilingual dictionary training model has about 2.5 BLEU values, which indicates that the model of the text can capture more cross-language information from the monolingual corpus, improve the quality of bilingual word vectors, and further improve the translation quality. The semi-supervised system adds 1 million parallel corpora BLEU to the system from the 3 rd row, the more the Chinese reaches 10.02 BLEU values, the more the Chinese reaches 13.91 BLEU values, and after comparing the 5 th, 6 th, 7 th and 8 th rows, the method provided by the text is not difficult to see that the language is in low resource language alignment, so that the method has better effect, and under the condition that only 1 million sentences are aligned to the parallel corpora, the system nearly achieves the effect of directly training the model by using 10 million parallel corpora. From a comparison of lines 4 and 8, it can be seen that both the Han-Yuan and the Yuan-Han directions exceed the Transform model when 10 ten thousand parallel sentence pairs are added.
TABLE 3 different methods Hanyue unsupervised machine translation example analysis
Figure RE-GDA0002477274680000101
From the results of the experimental translations in table 3, although the model still has the problem of inaccurate translation caused by learning bias, the translation quality of the method for the text is obviously improved compared with that of a baseline system.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. The Chinese-more unsupervised neural machine translation method fused with the EMD minimized bilingual dictionary is characterized in that:
the method comprises the following specific steps:
step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler;
step2, corpus pretreatment: on the basis of Step1, segmenting words and part-of-speech marks of Chinese and Vietnamese monolingual sentences, and training to obtain monolingual word vectors;
step3, unsupervised bilingual dictionary based on EMD minimization: on the basis of Step2, training an unsupervised Chinese-Vietnamese bilingual dictionary by utilizing an EMD (empirical mode decomposition) minimization method according to Chinese and Vietnamese monolingual word vectors;
step4, obtaining Chinese-Yue bilingual word embedding: on the basis of the Step2 and the Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding; generating Chinese-more bilingual word embedding;
step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of a shared encoder, and training to obtain a Chinese-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary;
in Step4, performing word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure FDA0003639170890000011
a vector for the ith word of the source language,
Figure FDA0003639170890000012
a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target language ij =1, the goal of word mapping is to find a mapping matrix W, such that after mapping
Figure FDA0003639170890000013
And
Figure FDA0003639170890000014
is closest to the Euclidean distance of (i.e. is
Figure FDA0003639170890000015
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances amounts to maximizing the dot product:
Figure FDA0003639170890000016
wherein, tr represents the trace operation of the matrix, and the optimal solution obtained by solving is W = UV T U and V represent two orthogonal matrices, decomposed by singular values, X T DY=U∑V T Given that the matrix D is sparse, a solution is obtained in linear time;
the dictionary self-learning comprises the following steps: and the mapped word vector of the source language word and the word vector of the target language word are in the same space, a target language word with the closest distance is distributed to each source language word according to the nearest neighbor retrieval method, the aligned word pair is added into the dictionary, and iteration is carried out again until convergence.
2. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the specific steps of Step2 are as follows:
step2, segmenting words and part-of-speech tagging of Chinese and Vietnamese single-language sentences, performing segmentation processing and part-of-speech tagging of Chinese and Vietnamese single-language linguistic data by using a segmentation and part-of-speech tagging tool, and obtaining embedding of Chinese and Vietnamese single-language words by using a word vector training tool.
3. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the concrete steps of Step3 are as follows:
step3, using an EMD minimization method between the Chinese word vector distribution and the Vietnamese word vector distribution, regarding the word vectors as probability distribution, taking the distance between the distributions as a vocabulary level criterion, and training in an unsupervised mode without using any seed dictionary to find the EMD minimization between the Chinese word vector distribution and the Vietnamese word vector distribution to obtain the Chinese-Vietnamese dictionary.
4. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the concrete steps of Step4 are as follows:
using the Chinese-Vietnam bilingual dictionary obtained in Step3 as a seed dictionary; guiding the embedding training of the Chinese-language-crossing single words by using a self-learning model; and obtaining Chinese-more bilingual word embedding training.
5. The method of chinese-to-many unsupervised neural machine translation fusing EMD minimized bilingual dictionaries according to claim 1, wherein: in Step 5:
and embedding and applying the trained bilingual words fused with the EMD bilingual dictionary in the model of the shared encoder by using a shared encoder model, so as to realize word-level correspondence between Chinese and more bilingual words and train a unsupervised neural machine translation model of the Chinese.
CN202010096013.9A 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary Active CN111753557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096013.9A CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096013.9A CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Publications (2)

Publication Number Publication Date
CN111753557A CN111753557A (en) 2020-10-09
CN111753557B true CN111753557B (en) 2022-12-20

Family

ID=72673087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096013.9A Active CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Country Status (1)

Country Link
CN (1) CN111753557B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287694A (en) * 2020-09-18 2021-01-29 昆明理工大学 Shared encoder-based Chinese-crossing unsupervised neural machine translation method
CN112633018B (en) * 2020-12-28 2022-04-15 内蒙古工业大学 Mongolian Chinese neural machine translation method based on data enhancement
CN112836527B (en) * 2021-01-31 2023-11-21 云知声智能科技股份有限公司 Training method, system, equipment and storage medium of machine translation model
CN112926324B (en) * 2021-02-05 2022-07-29 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113076398B (en) * 2021-03-30 2022-07-29 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance
CN113343672B (en) * 2021-06-21 2022-12-16 哈尔滨工业大学 Unsupervised bilingual dictionary construction method based on corpus merging
CN113591496A (en) * 2021-07-15 2021-11-02 清华大学 Bilingual word alignment method and system
CN113591490B (en) * 2021-07-29 2023-05-26 北京有竹居网络技术有限公司 Information processing method and device and electronic equipment
CN114595688B (en) * 2022-01-06 2023-03-10 昆明理工大学 Chinese cross-language word embedding method fusing word cluster constraint
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114492476A (en) * 2022-01-30 2022-05-13 天津大学 Language code conversion vocabulary overlapping enhancement method for unsupervised neural machine translation
CN115618885A (en) * 2022-09-22 2023-01-17 无锡捷通数智科技有限公司 Statement translation method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110369A1 (en) * 2017-04-28 2018-10-31 Intel Corporation IMPROVEMENT OF AUTONOMOUS MACHINES BY CLOUD, ERROR CORRECTION AND PREDICTION
CN108897797A (en) * 2018-06-12 2018-11-27 腾讯科技(深圳)有限公司 Update training method, device, storage medium and the electronic equipment of dialog model
CN110334881A (en) * 2019-07-17 2019-10-15 深圳大学 A kind of Financial Time Series Forecasting method based on length memory network and depth data cleaning, device and server

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858046B (en) * 2018-02-09 2024-03-08 谷歌有限责任公司 Learning long-term dependencies in neural networks using assistance loss

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110369A1 (en) * 2017-04-28 2018-10-31 Intel Corporation IMPROVEMENT OF AUTONOMOUS MACHINES BY CLOUD, ERROR CORRECTION AND PREDICTION
CN108897797A (en) * 2018-06-12 2018-11-27 腾讯科技(深圳)有限公司 Update training method, device, storage medium and the electronic equipment of dialog model
CN110334881A (en) * 2019-07-17 2019-10-15 深圳大学 A kind of Financial Time Series Forecasting method based on length memory network and depth data cleaning, device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Fully Character-Level Neural Machine Translation without Explicit Segmentation";Jason Lee 等;《Transactions of the Association for Computational Linguistics》;20170531(第5期);第365-378页 *
"倾向近邻关联的神经机器翻译";王坤 等;《计算机科学》;20190515;第46卷(第5期);第198-202页 *

Also Published As

Publication number Publication date
CN111753557A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111753557B (en) Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN109582789B (en) Text multi-label classification method based on semantic unit information
Lee et al. Learning dense representations of phrases at scale
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
CN110334361B (en) Neural machine translation method for Chinese language
Liang et al. An end-to-end discriminative approach to machine translation
Gu et al. Unpaired image captioning by language pivoting
US10255275B2 (en) Method and system for generation of candidate translations
CN109190131B (en) Neural machine translation-based English word and case joint prediction method thereof
Liu et al. A recursive recurrent neural network for statistical machine translation
CN109933808B (en) Neural machine translation method based on dynamic configuration decoding
CN105068998A (en) Translation method and translation device based on neural network model
Garg et al. Machine translation: a literature review
CN106484682A (en) Based on the machine translation method of statistics, device and electronic equipment
Li et al. Spelling error correction using a nested RNN model and pseudo training data
CN108460028A (en) Sentence weight is incorporated to the field adaptive method of neural machine translation
Qiu et al. Dependency-Based Local Attention Approach to Neural Machine Translation.
Ahmadnia et al. Neural machine translation advised by statistical machine translation: The case of farsi-spanish bilingually low-resource scenario
US11586833B2 (en) System and method for bi-directional translation using sum-product networks
Espla-Gomis et al. Using machine translation to provide target-language edit hints in computer aided translation based on translation memories
CN113468883B (en) Fusion method and device of position information and computer readable storage medium
CN112926344A (en) Word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium
CN112287694A (en) Shared encoder-based Chinese-crossing unsupervised neural machine translation method
CN110321568B (en) Chinese-Yue convolution neural machine translation method based on fusion of part of speech and position information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant