CN111753557A - Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary - Google Patents

Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary Download PDF

Info

Publication number
CN111753557A
CN111753557A CN202010096013.9A CN202010096013A CN111753557A CN 111753557 A CN111753557 A CN 111753557A CN 202010096013 A CN202010096013 A CN 202010096013A CN 111753557 A CN111753557 A CN 111753557A
Authority
CN
China
Prior art keywords
chinese
word
bilingual
emd
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010096013.9A
Other languages
Chinese (zh)
Other versions
CN111753557B (en
Inventor
余正涛
薛明亚
高盛祥
赖华
翟家欣
朱恩昌
陈玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010096013.9A priority Critical patent/CN111753557B/en
Publication of CN111753557A publication Critical patent/CN111753557A/en
Application granted granted Critical
Publication of CN111753557B publication Critical patent/CN111753557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary, belonging to the technical field of machine translation. The invention comprises the following steps: collecting the corpus; crawling Chinese and Vietnamese monolingual sentences by using a web crawler; firstly, respectively training the monolingual word embedding of Chinese and Vietnamese, and obtaining a Chinese-Vietnamese bilingual dictionary through EMD training of minimum word embedding distribution; training the dictionary as a seed dictionary to obtain Chinese-Yue bilingual word embedding; finally, embedding and applying the bilingual words into an unsupervised machine translation model of a shared encoder to construct a Chinese-more unsupervised neural machine translation method fusing an EMD minimized bilingual dictionary. The method can effectively improve the performance of hanyue unsupervised neural machine translation.

Description

Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary
Technical Field
The invention relates to a Chinese-more unsupervised neural machine translation method fused with an EMD (Earth Mover's Distance) minimized bilingual dictionary, belonging to the technical field of machine translation.
Background
Neural machine translation is a machine translation method proposed in recent years, and the quality of neural machine translation has become a mainstream translation method beyond statistical machine translation over a plurality of language pairs. However, the neural machine translation requires a large-scale parallel corpus to have a good effect, and when the training data is insufficient, the translation quality is poor. Parallel corpora between chinese and vietnamese are rare and not readily available, so chinese-over-machine translation is typically a low resource language machine translation. However, Chinese and Vietnamese have a large amount of monolingual linguistic data, and Chinese-Vietnamese unsupervised neural machine translation which only utilizes the monolingual linguistic data is explored in the text, so that the method has very important function for promoting communication and cooperation of two countries, and has very important theoretical and application value for the research of machine translation of low-resource languages.
At present, the research methods of unsupervised machine translation mainly include unsupervised machine translation based on counterlearning and unsupervised machine translation (shared space) based on shared encoder. Lample et al propose the idea of mapping sentences of two different monolingual corpora into the same space, reconstruct a shared feature space from the two languages by learning, and realize unsupervised neural machine translation only with monolingual corpora. Artemix et al modify the model by pre-training unsupervised bilingual word embedding, using a shared encoder and separate decoding to provide unsupervised neural machine translation using only monolingual corpora. The weight-sharing unsupervised machine translation model proposed by Yang et al improves the characteristics and internal features of each language compared with the shared encoder model so as to improve the translation quality, and Lample et al can obtain the effect of further improving unsupervised neural machine translation by combining the neural machine translation and the phrase-based statistical machine translation effect. Lample et al propose cross-language model pre-training for initializing lookup tables to improve the quality of pre-trained cross-language word embedding, and significantly improve the performance of unsupervised machine translation models. They use homologous words as initial cross-language information or a digital alignment method from monolingual corpus of similar languages, and then expand learning to realize unsupervised neural machine translation. The Chinese-Vietnam language has larger difference, and no usable homologous words exist among the Chinese-Vietnam, so the method using the language homologous words is not feasible on the Chinese-Vietnam language pair, and the unsupervised neural machine translation of the shared encoder of the artemie et al is realized on the basis of unsupervised bilingual word vectors, thereby conforming to the characteristic of larger difference of the language pair. Therefore, the invention chooses to extend the work of artemie et al, but the quality of learning bilingual word embedding using arabic numerals between languages is limited, so the idea of the invention is to improve the unsupervised bilingual word embedding quality to improve the unsupervised neural machine translation quality of chinese characters.
In the unsupervised machine translation only using Chinese and Vietnamese monolingual corpora, the machine translation is difficult to directly realize, but the acquisition of a bilingual dictionary is relatively easy, so the invention considers that the Chinese-Vietnamese bilingual dictionary is trained in the Chinese-Vietnamese monolingual corpora first, and then the Chinese-Vietnamese dictionary is used as a seed word to guide and train the embedding of bilingual words with higher quality, so the unsupervised neural machine translation quality of the Chinese-Vietnamese is improved. Zhang et al propose that the similarity of the word vector space distribution of the language is utilized, the method of EMD minimization is used for training a bilingual dictionary, the whole process only uses the unsupervised training mode of the monolingual corpus, the quality can be compared favorably with the supervised mode, and the characteristic of great difference of the Chinese-Yuan language is met. The more unsupervised neural machine translation of chinese that fuses EMD to minimize bilingual dictionaries is proposed herein.
The method comprises the steps of firstly regarding word embedding of Chinese and Vietnamese monolingus as two probability distributions, training by minimizing the EMD distance between the embedding of the Chinese-language-crossing words to obtain a Chinese-language-crossing bilingual dictionary, then training the embedding of the Chinese-language-crossing bilingual words by using the Chinese-language-crossing bilingual dictionary as a seed dictionary and utilizing a self-learning method, and realizing unsupervised neural machine translation of the Chinese characters on a shared coding encoder model.
Disclosure of Invention
The invention provides a Chinese-more unsupervised neural machine translation method fused with an EMD minimized bilingual dictionary, which is used for an unsupervised translation system of low-resource languages and improves the unsupervised neural machine translation performance of the Chinese-more unsupervised neural machine.
The technical scheme of the invention is as follows: the Chinese-more unsupervised neural machine translation method fusing the EMD minimized bilingual dictionary comprises the following specific steps:
step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler; the monolingual corpus is mainly from Chinese and Vietnamese monolingual news websites;
step2, corpus preprocessing: on the basis of Step1, performing word segmentation processing and part-of-speech tagging on Chinese and Vietnamese single-language sentences by using a word segmentation and part-of-speech tagging tool, and obtaining word-of-Chinese-Vietnamese words embedding by using a word vector training tool to obtain single-language word vectors; training respectively obtained word vectors of the Chinese-more monolingus, mapping the word vectors into a vector space, wherein the word vector space of the monolingus of the two languages shows approximate homomorphism, which means that linear mapping exists and the two spaces can be approximately connected;
step3, unsupervised bilingual dictionary based on EMD minimization: on the basis of Step2, training an unsupervised Chinese-Vietnamese bilingual dictionary by utilizing an EMD (empirical mode decomposition) minimization-based method according to Chinese and Vietnamese monolingual word vectors;
as a preferred embodiment of the present invention, the Step3 specifically comprises the following steps:
step3, using the EMD minimization method between the Chinese word vector distribution and the Vietnam word vector distribution, regarding the word vectors as probability distribution, using the distance between the distributions as the vocabulary level criterion, and finding the EMD minimization between the Chinese word vector distribution and the Vietnam word vector distribution by the unsupervised training without using any seed dictionary to obtain the Chinese word and Vietnam bilingual dictionary.
The circles in fig. 3 are regarded as soil mounds and the squares as pot holes, and their sizes represent the volume of the soil mounds and the volume of the pot holes, or the corresponding weights. In the example of fig. 3, all weights are equal. At this setting, it is desirable to fill the excavation with a minimum overall cost of moving the soil heap, as measured by the product of the distance and volume of the moving soil heap. It is to be understood that the arrow in fig. 3(b) represents the optimal movement scheme in this example, and this scheme can be just regarded as the result of vocabulary translation. From a microscopic view, due to
Figure RE-GDA0002477274680000031
The earth in the pile has been used entirely to fill "music" pot holes, which will not interfere with "dance" pot holes, and thus
Figure RE-GDA0002477274680000032
The soil heap is responsible for filling up the 'dance' pot hole. From a macroscopic view, the minimization of the overall moving cost enables the global information to be considered, so that the locality of the nearest neighbor search is overcome, and the hubness problem is solved. The notion of representing global weighted matching for the above metaphor can be mathematically implemented using EMD, whose name is derived from the above metaphor.
Figure RE-GDA0002477274680000033
s.t.Wij≥0,
Figure RE-GDA0002477274680000034
Figure RE-GDA0002477274680000035
Wherein, VsRepresenting the size of the source language vocabulary, VtRepresentsTarget language vocabulary size, CijRepresents the distance between the ith soil pile and the jth pit hole, tiRepresents the volume of the ith soil heap, sjRepresents the volume of the j-th pit, WijThe decision variables for optimizing the problem represent the volume of soil transferred from the ith soil heap to the jth cavern, and therefore the objective function is to minimize the overall movement cost. After the solution is completed, W is non-zeroijExperiments show that the effect of vocabulary translation by using EMD can be better than that of the nearest neighbor.
In order to better exploit the ability of EMD to handle the phenomenon of one-word multi-translation, it is proposed herein to introduce EMD into the training process of bilingual word vectors. In the training objective function, EMD is used as one item to participate in training in a regular form, so that the bilingual word vector obtained by training can better capture the phenomenon of one-word multi-translation. Its effect is verified experimentally.
The method of counterlearning can also be viewed in this framework, as counterlearning implicitly optimizes Jensen-Shannon divergence. But other better distribution distances are possible for the task of vocabulary translation to choose from. Since EMD is also a measure of the distance between distributions that is well suited for the task of vocabulary translation, consider using EMD as a vocabulary level criterion to guide the learning of linear mappings, i.e., finding a mapping G that minimizes EMD between the word vector distribution in the source language after mapping and the word vector distribution in the target language, as shown in FIG. 4. The use of a mathematical formula can be expressed in the form,
Figure RE-GDA0002477274680000041
wherein p isG(x)Representing the distribution of source language word vectors, p, after G mappingyRepresenting the target-language word vector distribution.
Step4, obtaining the bilingual word embedding: on the basis of the steps of Step2 and Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding by using a self-learning model; generating Chinese-more bilingual word embedding;
word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure RE-GDA0002477274680000046
a vector for the ith word of the source language,
Figure RE-GDA0002477274680000047
a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target languageij1. The goal of word mapping is to find a mapping matrix W such that the mapped word is
Figure RE-GDA0002477274680000048
And
Figure RE-GDA0002477274680000049
is closest to the Euclidean distance of (i.e. is
Figure RE-GDA0002477274680000042
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances is equivalent to maximizing the dot product:
Figure RE-GDA0002477274680000043
tr represents trace operation of the matrix, and the optimal solution can be obtained by solving W ═ UVT(U, V denotes two orthogonal matrices), subjected to singular value decomposition, XTDY=U∑VT. Given that the matrix D is sparse, a solution can be obtained in linear time.
Dictionary self-learning: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.
Taking FIG. 5 as an example, the word pairs aligned in the beginning dictionary are (horse)
Figure RE-GDA0002477274680000044
dog-Ch Lou), mapped once according to dictionary L1, such that the mapped "horse" is associated with
Figure RE-GDA0002477274680000045
And the Euclidean distance between "dog" and "ChLou". Then in the mapped space, the closest corresponding word is found for the other words, and cat can be found to be closer to me so it is also added to the dictionary
Figure RE-GDA0002477274680000051
dog-Ch, cat-meo) as a new reference dictionary, and re-calculating the euclidean distance will result in a new mapping matrix W, and thus a new alignment result.
After training, translation is carried out by using a beam search (beam search), and the size of a beam needs to be determined by balancing translation time and search accuracy.
The unsupervised bilingual dictionary based on EMD minimum training is fused, and the unsupervised dictionary is used as a seed dictionary to improve the self-learning effect of the dictionary and further improve the quality of bilingual word vectors.
And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-to-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.
The method provided by the invention is characterized in that an unsupervised bilingual dictionary based on EMD minimization is fused on the basis of an artemie and other people sharing an encoder, and the method has stronger capability of mining cross-language information in Chinese and Vietnamese monolingual corpora than an original model. Model structure as shown in fig. 6, the model used follows the standard encoder and decoder with attention mechanism proposed by bahdana et al. The system consists of a shared encoder and two decoders, wherein the two decoders respectively correspond to a source language and a target language. The encoder end is a double-layer bidirectional recurrent neural network (BiGRU), and the decoder end is a double-layer recurrent neural network (UniGRU). With regard to the attention mechanism, the global attention method and general alignment function proposed by Luong et al are used herein. At the encoder side, a pre-trained bilingual dictionary of chinese-over-the-word and bilingual word vectors are used, accepting the input sequence and generating language-independent tokens. And the word vector at the decoder end is continuously updated along with training, and training and translation are carried out through two decoders.
For each sentence in chinese (L1), the model trains two steps alternately: denoising, which optimizes the probability of encoding the noisy encoding of a sentence with a shared encoder and reconstructs it with an L1 decoder, and dynamic reverse translation, which translates the sentence in inference mode (encodes it with a shared encoder and decodes it with a vietnamese (L2) decoder) and then optimizes the probability of encoding the translated sentence with a shared encoder and restores the original sentence with an L1 decoder. Training alternates between sentences in L1 and L2, which takes similar steps.
The double structure is as follows: while NMT systems are typically built for a particular translation direction (e.g., chinese- > vietnamese or vietnamese- > chinese), this document takes advantage of the dual nature of machine translation to process two directions simultaneously (e.g., chinese < - > vietnamese).
Sharing the encoder: similar to Ha et al, Lee et al, and Johnson et al, the system herein is one encoder shared by both languages. I.e., both chinese and vietnamese are encoded using the same encoder. The shared encoder is intended to represent both languages as language independent representations, and then each decoder should decode into the language corresponding to it.
Pre-training fixed bilingual word embedding: while most neural machine translation systems randomly initialize their word vectors and update them during training, pre-trained cross-language word vectors are used in the encoder, which remain unchanged during the training process. The encoder has language independent word-level representations and it only needs to learn how to combine them to build a representation of a larger phrase.
Experiments in artemie et al have shown that adding denoising and retranslation to the system helps to improve translation quality, and the present invention uses a shared encoder system with denoising and retranslation.
For each sentence in chinese (L1), the system is trained in two steps, denoising: it optimizes the probability of encoding the noise encoding of a sentence with a shared encoder, and reconstructs it with an L1 decoder, as in fig. 7 (a); and (3) translation back: the sentence is translated in inference mode (inference mode) (the sentence is encoded using a shared encoder, as in fig. 7(b) decoded using a vietnamese (L2) decoder), and then the probabilities of encoding the translated sentence and recovering the source sentence using an L1 decoder are optimized using the shared encoder. These two steps are performed alternately for training L1 and L2, and the training step for L2 and L1 are similar to fig. 7(c) and (d). The neural machine translation system is usually trained by a parallel corpus, and since only a monolingual corpus is available, the supervised training method cannot be used in the scene of the text. However, using the model architecture of fig. 6, the entire system can be trained unsupervised in a combination of two methods, denoising and translation:
denoising: due to the use of a shared encoder and the use of the dual structure of machine translation, the system herein can be trained directly to reconstruct the input sentence. In particular, the system encodes an input sentence in a given language using a shared encoder, and then reconstructs the source sentence using a decoder for that language. Given that pre-trained cross-language word vectors are used in a shared encoder, which learns to combine the embedding of two languages into a language-independent characterization, each decoder should learn to decode such characterization into the corresponding language. In inference mode, the source language decoder is replaced by the target language decoder only, so that the system can generate a translation of the input text using the language independent tokens generated by the encoder.
This text introduces random noise in the input sentence. The idea is that with the same automatic encoder denoising principle, the system is trained to reconstruct the original version of the corrupted input sentence. For this purpose, the word order of the input sentence is changed by randomly exchanging between consecutive words. This N/2 random exchange is performed for a sequence of N elements. Thus, the system needs to learn the internal structure of the language to recover the correct word order. At the same time, the system is prevented from excessively depending on the word order of the input sequence, so that the actual word order difference of cross-language can be better explained.
And (3) translation back: despite the denoising strategy, the training program is still a replication task, which includes some synthetic changes, most importantly, each time involving one language, regardless of the final goal of translation between the two languages. In order to train the system of the present document in a true translation environment without violating the constraint of using only monolingual corpora, the translation method proposed by Sennrich et al was added to the system. Specifically, given an input sentence in one language, the system translates it into another language in inference mode using greedy decoding (i.e., using a shared encoder and decoder for the other language). In this way, pseudo parallel sentence pairs can be obtained and the system trained to predict the original sentence from the synthesized translation.
It is noted that, in contrast to standard reverse translation, which uses an independent model to reverse translate the entire corpus at once, each small batch of sentences is reverse translated on-the-fly using the model being trained, taking advantage of the dual structure of the proposed architecture. Thus, as training progresses and the model improves, it will produce better pairs of synthesized sentences through reverse translation, which will help to further improve the model in subsequent iterations.
Because the difference of the Chinese-Vietnamese language is large and homologous words do not exist, according to the difference characteristic of the Chinese-Vietnamese language, the method for minimizing EMD among word vector distributions is introduced to learn a Chinese-Vietnamese dictionary from a Chinese-Vietnamese corpus, a Chinese-Vietnamese unsupervised neural machine translation method fusing the EMD minimized bilingualse dictionary is provided, and the performance of unsupervised neural machine translation is improved.
The invention has the beneficial effects that:
the invention realizes the unsupervised neural machine translation system of the languages with larger Chinese cross-distance difference, improves the capability of the shared encoder model unsupervised neural machine translation model to acquire cross-language information of the languages with larger difference, and further improves the unsupervised neural machine translation quality of the Chinese cross-distance. The unsupervised Chinese-Vietnamese language translation model has the advantages that unsupervised operation is expanded from similar languages containing homologous words to Chinese-Vietnamese language tasks with large differences, and the performance of the unsupervised neural machine translation model of the shared encoder is improved.
Drawings
FIG. 1 is a flow chart of a phrase-based Chinese-to-pseudo parallel sentence pair generation method proposed by the present invention;
FIG. 2 is a monolingual word vector space for Chinese and Vietnamese of the present invention;
FIG. 3 is a chart of the Hubness problem of the present invention;
FIG. 4 is an Earth mover's distance minimization learning diagram of the present invention;
FIG. 5 is a schematic diagram of the word mapping process using number alignment of the present invention;
FIG. 6 is a Chinese-cross unsupervised NMT model of the fused EMD minimized bilingual dictionary of the present invention;
fig. 7 is a diagram of 4 processes of unsupervised NMT model training for a fused EMD minimized bilingual dictionary chinese.
Detailed Description
Example 1: as shown in fig. 1-7, the chinese-to-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary, Step1, first obtains parallel corpus: 5800 ten thousand sentences of Chinese monolingual corpus and 3000 ten thousand sentences of Vietnamese monolingual corpus are crawled from the Internet.
Step2, preprocessing the corpus; on the basis of Step1, segmenting Chinese and Vietnamese monolingual sentences and marking parts of speech to obtain monolingual word vectors through training; and performing word segmentation and part-of-speech tagging on the Vietnamese by using an understhesellp Vietnamese word segmentation tool for the Vietnamese, and performing word segmentation and part-of-speech tagging on the Chinese by using a jieba word segmentation tool. Word2vec is used to train monolingual word vectors for both hanyue and vietnamese. Both Chinese and Vietnamese respectively train 300-dimensional word vectors. The 300-dimensional word vector is trained using the skip-gram model. For training bilingual word vectors after dictionary addition.
Training the respectively obtained word vectors of the Chinese-more monolingual words, mapping the word vectors into a vector space as shown in FIG. 2, wherein the monolingual word vector spaces of the two languages show approximate homomorphism, which means that linear mapping exists and can approximately connect the two spaces.
Step3, unsupervised bilingual dictionary based on EMD minimization; based on Step2, an unsupervised Chinese-overt bilingual dictionary is trained by using an EMD minimization method according to the Chinese-overtime and Vietnamese monolingual word vectors.
Further, the Step3 includes the specific steps of:
step3, using EMD minimization method between Chinese word vector distribution and Vietnam word vector distribution; regarding the word vectors as probability distributions, and regarding the distances among the distributions as criteria of the vocabulary level; training in an unsupervised manner without using any seed dictionary to find EMD minimization between the distribution of the chinese-overtaking word vectors; acquiring a Chinese-Yue bilingual dictionary;
a bilingual dictionary is trained using the method proposed by Zhang et al, and 50-dimensional word vectors are trained in chinese and vietnamese using word2vec, respectively. The 50-dimensional word vector is trained using a default hyper-parametric trained CBOW framework, the frequency of word occurrence is limited to not less than 1000 nouns, and the experimental results are shown in table 1.
TABLE 1 generating number table for Chinese-to-English dictionary based on EMD minimization
Figure RE-GDA0002477274680000081
Step4, obtaining the Chinese-Yue bilingual word embedding; on the basis of the steps of Step2 and Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding by using a self-learning model; generating Chinese-more bilingual word embedding;
in Step4, performing word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure RE-GDA0002477274680000082
a vector for the ith word of the source language,
Figure RE-GDA0002477274680000083
a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target languageij1. The goal of word mapping is to find a mapping matrix W such that the mapped word is
Figure RE-GDA0002477274680000084
And
Figure RE-GDA0002477274680000085
is closest to the Euclidean distance of (i.e. is
Figure RE-GDA0002477274680000086
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances is equivalent to maximizing the dot product:
Figure RE-GDA0002477274680000087
tr represents trace operation of the matrix, and the optimal solution can be obtained by solving W ═ UVTU, V denotes two orthogonal matrices, decomposed by singular values, XTDY=U∑VTGiven that the matrix D is sparse, a solution is obtained in linear time;
dictionary self-learning: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.
And Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-to-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.
Further, in Step 5:
the experiment is mainly divided into the following five parts: unsupervised baseline model translation on Chinese-cross, UNMT fusing EMD minimized bilingual dictionary, adding 1 ten thousand and 10 ten thousand parallel corpuses on the basis of the method model, and directly using 1 ten thousand and 10 ten thousand parallel corpuses to train on GNMT and Transform with supervised models.
Unsupervised model training: the translation system is trained only by using monolingual corpus, and the 1 st benchmark experiment applies the benchmark model to train the Chinese unsupervised translation model. Article 2 is the method herein, fusing EMD minimized bilingual dictionary hanyuemt on baseline experiments.
Semi-supervised model training: in most cases, the languages under study often have a small amount of parallel corpora, which can be used to improve the performance of the model, but the corpus scale is not large enough to directly train the complete conventional NMT system. So in addition to the monolingual corpus, a small amount of parallel corpora is added. Experiments were also performed using 10,000 and 100,000 parallel sentence pairs based on the method presented herein.
And (3) supervision model training: the conventional supervised neural machine translation model was trained using the 10,000 and 100,000 parallel sentence pairs added in the semi-supervised experiment described above for comparison of the semi-supervised experiments.
TABLE 2 comparison of Hanyue machine translation experiments in different methods
Figure RE-GDA0002477274680000091
As can be seen from comparison between the 1 st line and the 2 nd line of the experimental results in table 2, the unsupervised bilingual dictionary training model is fused on the basis of the unsupervised model, and compared with the baseline system, the unsupervised bilingual dictionary training model has about 2.5 BLEU values, which indicates that the model of the text can capture more cross-language information from the monolingual corpus, improve the quality of bilingual word vectors, and further improve the translation quality. The semi-supervised system adds 1 million parallel corpora BLEU Han-to-reach 10.02 BLEU values and-reach 13.91 BLEU values from the 3 rd row, and the comparison between the 5 th, 6 th, 7 th and 8 th rows shows that the model is directly trained by 10 million parallel corpora, so that the method provided by the text has a better effect. From a comparison of lines 4 and 8, it can be seen that both the Han-Yuan and the Yuan-Han directions exceed the Transform model when 10 ten thousand parallel sentence pairs are added.
TABLE 3 different methods Hanyue unsupervised machine translation example analysis
Figure RE-GDA0002477274680000101
From the experimental translation results in table 3, although the model still has the problem of inaccurate translation caused by learning bias, the translation quality of the method is obviously improved compared with that of the baseline system.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. The Chinese-more unsupervised neural machine translation method fused with the EMD minimized bilingual dictionary is characterized in that:
the method comprises the following specific steps:
step1, corpus collection: crawling Chinese and Vietnamese monolingual corpora by using a web crawler;
step2, corpus preprocessing: on the basis of Step1, segmenting Chinese and Vietnamese monolingual sentences and marking parts of speech to obtain monolingual word vectors through training;
step3, unsupervised bilingual dictionary based on EMD minimization: on the basis of Step2, training an unsupervised Chinese-Vietnamese bilingual dictionary by utilizing an EMD (empirical mode decomposition) minimization-based method according to Chinese and Vietnamese monolingual word vectors;
step4, obtaining the bilingual word embedding: on the basis of the steps of Step2 and Step3, an unsupervised bilingual dictionary based on EMD minimization is used as a seed dictionary to guide the learning of bilingual word embedding; generating Chinese-more bilingual word embedding;
and Step5, on the basis of the Step4, applying the bilingual word vectors to an unsupervised neural machine translation model of the shared encoder, and training to obtain a Chinese-to-more unsupervised neural machine translation model fused with the EMD minimized bilingual dictionary.
2. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the specific steps of Step2 are as follows:
step2, segmenting Chinese and Vietnamese single-language sentences and labeling parts of speech, performing segmentation processing and labeling parts of speech of the Chinese and Vietnamese single-language linguistic data by using a segmentation and labeling tool, and obtaining embedding of Chinese and Vietnamese single-language words by using a word vector training tool.
3. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the specific steps of Step3 are as follows:
step3, using the EMD minimization method between the Chinese word vector distribution and the Vietnam word vector distribution, regarding the word vectors as probability distribution, using the distance between the distributions as the vocabulary level criterion, and finding the EMD minimization between the Chinese word vector distribution and the Vietnam word vector distribution by the unsupervised training without using any seed dictionary to obtain the Chinese word and Vietnam bilingual dictionary.
4. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: the specific steps of Step4 are as follows:
using the Hanyue bilingual dictionary obtained in Step3 as a seed dictionary; guiding the embedding training of the Chinese-language-crossing single words by using a self-learning model; and obtaining Chinese-more bilingual word embedding training.
5. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: in Step 5:
and embedding and applying the trained bilingual words fused with the EMD bilingual dictionary in the model of the shared encoder by using a shared encoder model, so as to realize word-level correspondence between the Chinese language and the more bilingual language and train an unsupervised neural machine translation model of the Chinese language.
6. The method of Chinese-to-many unsupervised neural machine translation fusing EMD minimizing bilingual dictionaries according to claim 1, wherein: in Step4, performing word embedding mapping: assuming that the word embedding matrices for the languages chinese and vietnamese are X and Y respectively,
Figure FDA0002385297820000021
a vector for the ith word of the source language,
Figure FDA0002385297820000022
a vector for the jth word of the target language; the dictionary D is a binary matrix, D is when the ith word of the source language is aligned with the jth word of the target languageij1. The goal of word mapping is to find a mapping matrix W, such that X is mappedi*And Yj*Is closest to the Euclidean distance of (i.e. is
Figure FDA0002385297820000023
After normalizing and centering the matrices X and Y and setting W as an orthogonal matrix, the above problem of solving euclidean distances is equivalent to maximizing the dot product:
Figure FDA0002385297820000024
tr represents trace operation of the matrix, and the optimal solution can be obtained by solving W ═ UVTU, V denotes two orthogonal matrices, decomposed by singular values, XTDY=U∑VTGiven that the matrix D is sparse, a solution is obtained in linear time;
the dictionary self-learns as follows: and according to a nearest neighbor retrieval method, allocating a target language word closest to each source language word, adding the aligned word pair into the dictionary, and iterating again until convergence.
CN202010096013.9A 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary Active CN111753557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096013.9A CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096013.9A CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Publications (2)

Publication Number Publication Date
CN111753557A true CN111753557A (en) 2020-10-09
CN111753557B CN111753557B (en) 2022-12-20

Family

ID=72673087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096013.9A Active CN111753557B (en) 2020-02-17 2020-02-17 Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary

Country Status (1)

Country Link
CN (1) CN111753557B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287694A (en) * 2020-09-18 2021-01-29 昆明理工大学 Shared encoder-based Chinese-crossing unsupervised neural machine translation method
CN112633018A (en) * 2020-12-28 2021-04-09 内蒙古工业大学 Mongolian Chinese neural machine translation method based on data enhancement
CN112836527A (en) * 2021-01-31 2021-05-25 云知声智能科技股份有限公司 Training method, system, equipment and storage medium of machine translation model
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113076398A (en) * 2021-03-30 2021-07-06 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance
CN113343672A (en) * 2021-06-21 2021-09-03 哈尔滨工业大学 Unsupervised bilingual dictionary construction method based on corpus merging
CN113591490A (en) * 2021-07-29 2021-11-02 北京有竹居网络技术有限公司 Information processing method and device and electronic equipment
CN113591496A (en) * 2021-07-15 2021-11-02 清华大学 Bilingual word alignment method and system
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114492476A (en) * 2022-01-30 2022-05-13 天津大学 Language code conversion vocabulary overlapping enhancement method for unsupervised neural machine translation
CN114595688A (en) * 2022-01-06 2022-06-07 昆明理工大学 Chinese cross-language word embedding method fusing word cluster constraint
CN115618885A (en) * 2022-09-22 2023-01-17 无锡捷通数智科技有限公司 Statement translation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110369A1 (en) * 2017-04-28 2018-10-31 Intel Corporation IMPROVEMENT OF AUTONOMOUS MACHINES BY CLOUD, ERROR CORRECTION AND PREDICTION
CN108897797A (en) * 2018-06-12 2018-11-27 腾讯科技(深圳)有限公司 Update training method, device, storage medium and the electronic equipment of dialog model
US20190251449A1 (en) * 2018-02-09 2019-08-15 Google Llc Learning longer-term dependencies in neural network using auxiliary losses
CN110334881A (en) * 2019-07-17 2019-10-15 深圳大学 A kind of Financial Time Series Forecasting method based on length memory network and depth data cleaning, device and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110369A1 (en) * 2017-04-28 2018-10-31 Intel Corporation IMPROVEMENT OF AUTONOMOUS MACHINES BY CLOUD, ERROR CORRECTION AND PREDICTION
US20190251449A1 (en) * 2018-02-09 2019-08-15 Google Llc Learning longer-term dependencies in neural network using auxiliary losses
CN108897797A (en) * 2018-06-12 2018-11-27 腾讯科技(深圳)有限公司 Update training method, device, storage medium and the electronic equipment of dialog model
CN110334881A (en) * 2019-07-17 2019-10-15 深圳大学 A kind of Financial Time Series Forecasting method based on length memory network and depth data cleaning, device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JASON LEE 等: ""Fully Character-Level Neural Machine Translation without Explicit Segmentation"", 《TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
王坤 等: ""倾向近邻关联的神经机器翻译"", 《计算机科学》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287694A (en) * 2020-09-18 2021-01-29 昆明理工大学 Shared encoder-based Chinese-crossing unsupervised neural machine translation method
CN112633018A (en) * 2020-12-28 2021-04-09 内蒙古工业大学 Mongolian Chinese neural machine translation method based on data enhancement
CN112836527A (en) * 2021-01-31 2021-05-25 云知声智能科技股份有限公司 Training method, system, equipment and storage medium of machine translation model
CN112836527B (en) * 2021-01-31 2023-11-21 云知声智能科技股份有限公司 Training method, system, equipment and storage medium of machine translation model
CN112926324B (en) * 2021-02-05 2022-07-29 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN112926324A (en) * 2021-02-05 2021-06-08 昆明理工大学 Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113076398A (en) * 2021-03-30 2021-07-06 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance
CN113343672B (en) * 2021-06-21 2022-12-16 哈尔滨工业大学 Unsupervised bilingual dictionary construction method based on corpus merging
CN113343672A (en) * 2021-06-21 2021-09-03 哈尔滨工业大学 Unsupervised bilingual dictionary construction method based on corpus merging
CN113591496A (en) * 2021-07-15 2021-11-02 清华大学 Bilingual word alignment method and system
CN113591490A (en) * 2021-07-29 2021-11-02 北京有竹居网络技术有限公司 Information processing method and device and electronic equipment
CN113591490B (en) * 2021-07-29 2023-05-26 北京有竹居网络技术有限公司 Information processing method and device and electronic equipment
CN114595688A (en) * 2022-01-06 2022-06-07 昆明理工大学 Chinese cross-language word embedding method fusing word cluster constraint
CN114595688B (en) * 2022-01-06 2023-03-10 昆明理工大学 Chinese cross-language word embedding method fusing word cluster constraint
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114492476A (en) * 2022-01-30 2022-05-13 天津大学 Language code conversion vocabulary overlapping enhancement method for unsupervised neural machine translation
CN115618885A (en) * 2022-09-22 2023-01-17 无锡捷通数智科技有限公司 Statement translation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111753557B (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN111753557B (en) Chinese-more unsupervised neural machine translation method fusing EMD minimized bilingual dictionary
CN109582789B (en) Text multi-label classification method based on semantic unit information
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
Lee et al. Learning dense representations of phrases at scale
US20210390271A1 (en) Neural machine translation systems
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
Gu et al. Unpaired image captioning by language pivoting
Liang et al. An end-to-end discriminative approach to machine translation
CN109933808B (en) Neural machine translation method based on dynamic configuration decoding
CN105068998A (en) Translation method and translation device based on neural network model
WO2022020467A1 (en) System and method for training multilingual machine translation evaluation models
CN112926344A (en) Word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium
CN116663578A (en) Neural machine translation method based on strategy gradient method improvement
Yassine et al. Leveraging subword embeddings for multinational address parsing
CN112287694A (en) Shared encoder-based Chinese-crossing unsupervised neural machine translation method
Ahmadnia et al. Neural machine translation advised by statistical machine translation: The case of farsi-spanish bilingually low-resource scenario
Espla-Gomis et al. Using machine translation to provide target-language edit hints in computer aided translation based on translation memories
US11586833B2 (en) System and method for bi-directional translation using sum-product networks
Zhu et al. Code-Switching Can be Better Aligners: Advancing Cross-Lingual SLU through Representation-Level and Prediction-Level Alignment
Meyer et al. Subword segmental machine translation: Unifying segmentation and target sentence generation
CN110321568B (en) Chinese-Yue convolution neural machine translation method based on fusion of part of speech and position information
Farajian et al. FBK’s neural machine translation systems for IWSLT 2016
CN113591493B (en) Translation model training method and translation model device
Narzary et al. Attention based English-Bodo neural machine translation system for tourism domain
Adjeisah et al. Twi corpus: a massively Twi-to-handful languages parallel bible corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant