CN110598221B - Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network - Google Patents

Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network Download PDF

Info

Publication number
CN110598221B
CN110598221B CN201910807617.7A CN201910807617A CN110598221B CN 110598221 B CN110598221 B CN 110598221B CN 201910807617 A CN201910807617 A CN 201910807617A CN 110598221 B CN110598221 B CN 110598221B
Authority
CN
China
Prior art keywords
mongolian
sentence
layer
chinese
capsule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910807617.7A
Other languages
Chinese (zh)
Other versions
CN110598221A (en
Inventor
苏依拉
孙晓骞
王宇飞
赵亚平
张振
高芬
贺玉玺
王昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN201910807617.7A priority Critical patent/CN110598221B/en
Publication of CN110598221A publication Critical patent/CN110598221A/en
Application granted granted Critical
Publication of CN110598221B publication Critical patent/CN110598221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel linguistic data through a generation countermeasure network comprises a generator and a discriminator, wherein the generator uses a hybrid encoder to encode Mongolian of a source language sentence into vector representation, a decoder based on a bidirectional transducer is combined with a sparse attention mechanism to convert the representation into Chinese of a target language sentence, and accordingly Mongolian sentences closer to human translation and more Mongolian Chinese parallel linguistic data are generated. The method solves the problems that the Mongolian parallel data set is seriously deficient, and the NMT can not ensure the naturalness, the sufficiency and the accuracy of the translation result.

Description

Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network
Technical Field
The invention belongs to the technical field of machine translation, and particularly relates to a method for improving the translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using a generated confrontation network.
Background
Machine translation is one of the most powerful means to solve the problem of language barrier, which can automatically translate one language into another language by using a computer. In recent years, many large-scale search enterprises and service centers such as google, hundredth, etc. have conducted extensive research on machine translation, and have made an important contribution to obtaining high-quality translations for machine translation, so that translations between major languages have approached the human translation level, and millions of people have achieved communication across language barriers using online translation systems and mobile applications. In the recent wave of deep learning, machine translation has become an important part and has become an important component for promoting global communication.
The Seq2 Seq-based neural-machine translation framework consists of an encoder that reads an input sequence and outputs a single vector, and a decoder that reads the vector to produce an output sequence. Since 2013, the framework has rapidly evolved, achieving a significant improvement in translation quality over statistical machine translation. The addition of sentence-level maximum likelihood estimation principles, gating cells in LSTM and GRU, and attention mechanisms improves the ability of NMT to translate long sentences. AshishVaswani et al proposed a Transformer architecture in 2017, an architecture that completely relies on an attention mechanism to draw global dependencies between inputs and outputs. The method has the advantages of realizing parallelization calculation, effectively reducing the training time of the model and improving the quality of the machine translation model to a certain extent. The defects that the RNN and the derivative network thereof are slow and can not realize parallelization and the like are avoided.
At present, neural machine translation has been successful, but the best NMT system is far from any one of human expectations and translation quality needs to be improved. Because NMT usually trains the model using maximum likelihood estimation, i.e. maximizes the probability of the target real sentence conditioned on the source sentence, i.e.: the model can generate the best candidate word for the current time, but in the long run the translation of the entire sentence is not the best translation, which leaves a hidden danger to NMT. Even powerful transformers are no exception. Compared with the real translation of human beings, the naturalness, the sufficiency and the accuracy of the translation result cannot be guaranteed by the target.
In addition, the mutual translation between large languages is relatively mature, but the machine translation between small languages is very expensive to construct parallel corpora manually due to various challenges, especially the severe shortage of corpora, and thus the translation effect is still unsatisfactory.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a method for improving the translation quality of Mongolian Chinese by constructing a Mongolian Chinese parallel corpus by using a generated confrontation network, which mainly aims at the problems that the Mongolian Chinese parallel data set is seriously deficient and the NMT cannot ensure the naturalness, the sufficiency, the accuracy and the like of a translation result, and applies the generated confrontation network to the Mongolian Chinese neural machine translation.
In order to achieve the purpose, the invention adopts the technical scheme that:
the method for improving the translation quality of the Mongolian Chinese by constructing the Mongolian Chinese parallel corpus through the generation countermeasure network is characterized in that the generation countermeasure network is used in the Mongolian Chinese machine translation to relieve the problem of low translation quality of the Mongolian Chinese machine caused by the shortage of the Mongolian Chinese parallel corpus and minimize the difference between human translation and translation given by an NMT model, the generation countermeasure network mainly comprises a generator and a discriminator, and the generator can effectively utilize the Mongolian Chinese monolingual data to relieve the problem of the shortage of the Mongolian Chinese parallel corpus in a machine translation task. In the generator, in order to relieve the UNK phenomenon in Mongolian Chinese machine translation, a hybrid encoder is used for encoding Mongolian of a source language sentence into vector representation, a bidirectional Transformer-based decoder is combined with a sparse attention mechanism to convert the vector representation into Chinese of a target language sentence, and therefore Mongolian sentences closer to human translation and more Mongolian parallel linguistic data are generated, and the quality and the efficiency of Mongolian Chinese machine translation are improved. In the discriminator, the difference between the Chinese sentence generated by the generator and the human translation is judged, the generator mainly aims to generate Mongolian sentences which are closer to the human translation and effectively generate more Mongolian parallel linguistic data by using Mongolian single-language data, and the discriminator aims to calculate the difference between the Chinese sentence generated by the generator and the Chinese sentence translated by the human translation. And (3) performing countertraining on the generator and the discriminator until the discriminator considers that the Chinese sentence generated by the generator is very similar to the human translation, namely the generator and the discriminator realize Nash balance, obtaining a high-quality Mongolian Chinese machine translation system and a large number of Mongolian Chinese parallel data sets, and performing Mongolian Chinese translation by using the Mongolian Chinese machine translation system.
The hybrid encoder is composed of a sentence encoder and a word encoder, in order to capture semantic information between sentences and encoder efficiency, the sentence encoder is composed of a bidirectional Transformer, and the word encoder uses bidirectional LSTM, thereby improving the efficiency of the word encoder under the condition of ensuring the quality of the encoder. The bidirectional Transformer is an optimized Transformer1, a gate control linear unit is added on the basis of the original Transformer firstly, so as to effectively obtain important information in a source language sentence and discard redundant information; secondly, a branch structure is added to effectively capture diversified semantic information between source language sentences; finally, a capsule network is added on the branch structure and after the third layer of standardization, so that an encoder can capture the accurate position of a word in a source language sentence, the accuracy of the encoder is further enhanced, and the encoding quality is improved; in the decoder, a bidirectional Transformer is optimized Transformer2, and a branch structure is added firstly on the basis of the original Transformer; secondly, a capsule network is added; finally, Swish activation function is added to effectively improve the decoding accuracy of the decoder.
The word encoder and the sentence encoder encode source language sentences in sequence, and then the source language sentences are fused through a fusion function to obtain vector representation with context information, wherein each word is represented into a vector form by the word encoder, vector representation of Mongolian sentences with the words as basic units is constructed, and the model formula is as follows:
h1i=Φ(h1i-1,Wi)
where Φ is the activation function, WiAs a weight, h1i-1For the (i-1) th wordHidden layer state.
The sentence encoder represents a whole Mongolian sentence in a vector form, and constructs a vector representation taking the sentence as a basic unit, wherein a model formula is as follows:
Figure BDA0002184097520000031
wherein v isjA Value (Value) representing the jth word,
Figure BDA0002184097520000032
the calculation formula of (a) is as follows:
Figure BDA0002184097520000041
wherein, αi,jIs calculated as follows:
Figure BDA0002184097520000042
wherein q isiQuery (query) for ith word, kjA key (key) for the jth word,. representing a dot product operation, and d representing the dimensions of q and k;
the fusion function is shown as follows:
ψ(h1i,h2i)=a1h1i+a2h2i
where ψ is a fusion function, a1,a2The encoder is characterized in that the two encoders are fused into vector information containing sentences and words through the two encodings by corresponding weights initialized randomly.
In the sentence encoder, the bidirectional Transformer means that the whole text sequence is read at one time, namely, learning is based on two sides of a sentence, and the two sides are not read sequentially from left to right or from right to left, so that the context relationship between words in the text can be learned.
In the decoder, the bidirectional Transformer reads the vector representation of the source language sentence at one time, namely decoding is carried out based on two sides of the vector representation of the whole sentence, so as to further improve the decoding accuracy of the decoder.
In order to enhance the discrimination capability of the discriminator, the discriminator is a multi-scale discriminator which can discriminate the general sentence meaning and the detail information (such as phrases, words and the like) of the Chinese sentences generated by the generator so as to assist the generator to generate the sentences which are closer to the real translation; meanwhile, in order to overcome the translation invariance of the convolutional neural network, the multi-scale discriminator is realized by using a capsule network, and the discrimination capability of the discriminator can be effectively improved under the condition of not reducing the training efficiency, wherein the translation invariance refers to: for example, in face recognition, the convolutional neural network considers that a face with eyes and mouth and other features is a human face, and specific positions of five sense organs in the face are ignored. If it is used in the generation of a countermeasure network as a discriminator, it is considered that only sentences having words in all of the manually translated sentences in the chinese sentences generated by the generator are manually translated sentences because of its translational invariance. And neglects the position information of the words, thereby causing a misdiscrimination. The capsule network comprises a convolution layer, a main capsule layer, a convolution capsule layer and a full-connection capsule; in order to use a network to represent a plurality of discriminators and improve training efficiency, in a convolutional layer, activation values of different sub-layers represent activation values of sentences with different granularities, activation values of a lower layer represent activation values of words, activation values of a higher layer represent activation values of the whole sentence, and finally feature maps of different layers are transformed into feature maps with the same scale and with the channel number of 1.
Given a sentence pair (x, y), each capsule network first constructs a 2D-image-like representation by concatenating the embedded vectors of the words in x and y, i.e., for the ith word x in the source language sentence xiAnd the jth word y in the target language sentence yjThe following correspondence relationship is provided:
Figure BDA0002184097520000051
wherein: x is the number ofi TDenotes xiTranspose of (y)j TDenotes yjThe transpose of (a) is performed,
Figure BDA0002184097520000052
representing the ith word x in the Source languageiAnd the jth word y in the target languagejThe constructed matrix, i.e. the virtual 2D image representation;
based on the virtual 2D image representation, the similarity between the sentence y' translated by the generator and the sentence y translated by the human under the condition of the source language sentence x is captured through the convolution layer, the main capsule layer, the convolution capsule layer and the fully-connected capsule layer of the capsule network in sequence.
The virtual 2D image represents that the specific process of sequentially passing through the coiling layer, the main capsule layer, the coiling capsule layer and the full-connection capsule layer of the capsule network is as follows:
(1) after the convolutional layer, first, a convolution operation with a convolutional kernel of 9 × 9 is performed with a step size of 1, and the correspondence between sentences in x and y is captured by the following feature mapping.
Figure BDA0002184097520000053
Wherein f is the activation function of the first convolution operation,
Figure BDA0002184097520000054
is the weight of the first convolution operation,
Figure BDA0002184097520000055
representing the ith word x in the Source languageiAnd the jth word y in the target languagejA matrix of formations, b(1,f)An offset for the first convolution operation;
then, a convolution operation with a step size of 1 and a convolution kernel of 3 × 3 is performed to capture the correspondence between words in x and y by the following feature mapping.
Figure BDA0002184097520000056
Wherein the content of the first and second substances,
Figure BDA0002184097520000061
the weight of the second convolution operation, f,
Figure BDA0002184097520000062
b(1,f)The same as in the first convolution algorithm;
two feature maps with different sizes are respectively obtained after two times of convolution operation
Figure BDA0002184097520000063
And
Figure BDA0002184097520000064
then for smaller feature maps
Figure BDA0002184097520000065
Filling is carried out to enable the two feature maps to be the same in size, and then the final feature map is obtained by averaging the two feature maps with the same size, as shown in the following formula:
Figure BDA0002184097520000066
(2) entering the first capsule layer, the output of the convolution layer is calculated as follows
p=g(WbM+b1)
Where g is the nonlinear squeeze function square through the entire vector, M represents the input to the capsule, b1Is the bias of the capsule, WbIs a weight; in the main capsule layer, the capsule replaces the scalar output of the convolution operation with a vector output;
(3) the dynamic routing algorithm replaces the maximum pooling, and the weight is dynamically strengthened or weakened to obtain effective characteristics;
(4) entering a rolling capsule layer; after the layer, all the capsules in the previous layer are changed into a series of capsules, and the weights are further dynamically strengthened or weakened through a dynamic routing algorithm, so that more effective characteristics are obtained;
(5) entering a full-connection capsule layer, and connecting all extracted features;
(6) all the features are input into the multilayer perceptron, and the activation function is used to obtain the probability that the data set (x, y') generated by the generator is real data (x, y), namely the similarity degree with a real sentence.
The final training goals for generating the countermeasure network are:
Figure BDA0002184097520000067
wherein: g denotes a generator, D denotes a discriminator, V (D, G) denotes a loss function of the generator G and the discriminator D,
Figure BDA0002184097520000068
representing D for maximizing the loss function V (D, G) and minimizing the loss function, E representing the expectation, Pdata(x, y) represents the probability that the source language x and the target language y in the parallel corpus are input into the discriminator D, which the discriminator considers to be manually translated, G (y ' | x) represents the probability that the target language y ' generated by the source language x and G in the parallel corpus is input into the discriminator D, which the discriminator considers to be manually translated, x represents the mongolian sentence in the parallel corpus, i.e., the source language sentence, y represents the chinese sentence in the parallel corpus, i.e., the manual translation result, and y ' represents the chinese sentence generated by the generator, i.e., the translation result of the generator.
In the process:
processing the additional components of the Mongolian language material grid, wherein the method comprises the following steps:
removing the control symbols in Mongolian sentences and the additional components of the lattices together, and only leaving the stem parts;
the Mongolian language materials are segmented at different granularities, Chinese is subjected to word segmentation processing, the UNK phenomenon in Mongolian language machine translation is relieved, and the Mongolian language machine translation quality is further improved by processing additional components of Mongolian lattices.
The cutting method comprises the following steps:
(1) the corpus to be preprocessed is first cut into the smallest units, which are the Mongolian letters for Mongolian.
(2) Then, the times of occurrence of all adjacent minimum unit combinations in the corpus are counted and ranked, the combination with the highest occurrence frequency is found out, and the combinations are added into the dictionary, and the word with the lowest frequency in the dictionary is deleted at the same time, so that the size of the dictionary is kept unchanged.
(3) Repeating the steps (1) and (2) until the occurrence frequency of the words in the dictionary in the corpus is higher than a set value;
neural Machine Translation (NMT) achieves better translation in an end-to-end framework, but the best NMT systems have a larger gap from human expectations. Compared with the method, the method can minimize the difference between human translation and the translation given by the NMT model, relieve the data sparsity problem in Mongolian Chinese machine translation and relieve the UNK phenomenon in the Mongolian Chinese machine translation, thereby not only obtaining a high-quality Mongolian Chinese machine translation system, but also obtaining a large number of Mongolian Chinese parallel data sets.
Drawings
FIG. 1 is an optimized Transformer1 architecture.
FIG. 2 is an optimized Transformer2 architecture.
Fig. 3 is a generator frame.
Fig. 4 is a discriminator framework.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention relates to a method for improving the translation quality of Mongolian Chinese by constructing a Mongolian Chinese parallel corpus by using a generating confrontation network, which mainly comprises the construction of an encoder and a decoder and the construction of a discriminator model.
FIG. 1 shows an optimized Transformer1 architecture. Firstly adding a gate control linear unit on the basis of an original Transformer, effectively acquiring important information in a source language sentence and discarding redundant information; secondly, a branch structure is added, and the structure can effectively capture diversified semantic information between source language sentences; finally, a capsule network is added on the branch structure and after the third layer of standardization, so that an encoder can capture the accurate position of a word in a source language sentence, and the accuracy of the encoder is further enhanced.
FIG. 2 shows an optimized Transformer2 architecture. Adding a branch structure on the basis of the original Transformer; secondly, a capsule network is added; and finally, a Swish activation function is added, and the addition of the activation function can effectively improve the decoding accuracy of the decoder.
Fig. 3 shows a generator frame. Mainly composed of a hybrid encoder, a sparse attention and decoder 3, the encoder accepts input Mongolian sentences such as:
Figure BDA0002184097520000081
the method comprises the steps of firstly carrying out bidirectional coding on the whole sentence based on bidirectional optimization Transformer, simultaneously carrying out coding on words in the sentence based on a bidirectional LSTM-based coder, and then fusing the representations of the two coders by using a fusion function to generate a coded representation of a source language. The decoder then decodes the source language encoded representation into the target language chinese sentence "rainy tomorrow" in conjunction with a sparse attention mechanism.
Fig. 4 shows a frame of a capsule network, including a convolutional layer, a main capsule layer, a convolutional capsule layer, a fully-connected capsule layer, etc. The convolutional layer comprises two layers, wherein one layer captures characteristics at the sentence level, and the other layer captures characteristics at the word level, so that the multi-scale identification function is realized.
Mitigation of data sparseness problem in Mongolian machine translation:
the generator in the generation countermeasure network can effectively solve the existing data sparseness problem in Mongolian Chinese machine translation, specifically, the generator is pre-trained through Mongolian Chinese alignment corpus to obtain a pre-training model, Mongolian monolingual data is used for generating Mongolian Chinese pseudo bilingual data with the help of the model, and then Chinese sentences which are closer to manual translation are generated with the help of the discriminator to form the Mongolian Chinese alignment corpus.
The accuracy and naturalness of the Mongolian Chinese machine translation are improved:
the Chinese sentences generated by the generator are often relatively hard and unnatural, and in the invention, the identifier is equivalent to a teacher of the generator and assists the generator to generate more natural and accurate Chinese sentences. The multi-scale discriminator means that the teacher has the ability to determine from multiple aspects whether the sentence generated by the generator is similar to the artificial translation.
Decision variables: the Mongolian sentence x is input at the encoder end of the generator, and the corresponding machine-translated Chinese sentence y is output at the decoder end of the generator. Mongolian sentence x, corresponding manually translated Chinese sentence y and corresponding generator translated Chinese sentence y 'input at the input of the discriminator'
The invention comprises the following parts:
1. the Mongolian Chinese neural machine translation system model based on the generation of the countermeasure network comprises the following parts:
A. generating a mixed encoder description in a Mongolian neural machine translation system generator based on an antagonistic network: the hybrid encoder is formed by fusing a sentence encoder and a word encoder, the word encoder and the sentence encoder encode source language sentences in sequence, wherein the word encoder represents each word into a vector form, vector representation of Mongolian sentences taking the word as a basic unit is constructed, and a model formula is as follows:
h1i=Φ(h1i-1,Wi)
where Φ is the activation function, WiAs a weight, h1i-1Is the hidden layer state of the (i-1) th word.
The sentence encoder represents an entire Mongolian sentence in vector form, constructing a vector representation with the sentence as a basic unit. In the sentence encoder, the bidirectional optimization Transformer1 reads the whole text sequence at one time, i.e. learning based on both sides of the sentence, so that the context relationship between the sentences can be learned and parallelization can be realized. The model formula is as follows:
Figure BDA0002184097520000091
wherein the content of the first and second substances,vjvalue representing the jth word,
Figure BDA0002184097520000101
the calculation formula of (a) is as follows:
Figure BDA0002184097520000102
wherein, αi,jIs calculated as follows:
Figure BDA0002184097520000103
wherein q isiIs the query of the ith word, kjThe key (key) for the jth word,. represents a dot product operation, and d represents the dimensions of q and k.
And finally, fusing through a fusion function to obtain vector representation with context information. The fusion function is shown as follows:
ψ(h1i,h2i)=a1h1i+a2h2i
where ψ is a fusion function, a1,a2The encoder which contains the vector information of sentences and words is fused by two kinds of granularity encoding through corresponding weights initialized randomly.
B. Decoder description in generator of Mongolian neural machine translation system based on generation of countermeasure network: the decoder consists of a bi-directional optimization Transformer2, the optimized Transformer2 is basically similar to the optimized Transformer1 in the encoder, except that a Swish activation function is added to the optimized Transformer in this section. The decoder decodes the vector representation of the source language sentence into the target language sentence in combination with a sparse attention mechanism during decoding.
C. Mongolian neural machine translation system discriminator description based on generation of an antagonistic network: the discriminator is a multi-scale discriminator which can not only discriminate the general sentence meaning of the Chinese sentence generated by the generator, but also discriminate the general sentence meaning of the Chinese sentence generated by the generatorThe detailed information (e.g., phrases and words, etc.) may assist the generator in generating sentences that more closely approximate the real translation. The multi-scale discriminator is implemented using a capsule network comprising a convolutional layer, a main capsule layer, a convolutional capsule layer, and a fully-connected capsule. In order to use a network to represent a plurality of discriminators and improve training efficiency, in a convolutional layer, activation values of different sub-layers represent activation values of sentences with different granularities, activation values of a lower layer represent activation values of words, activation values of a higher layer represent activation values of the whole sentence, and finally feature maps of different layers are transformed into feature maps with the same scale and with the channel number of 1. In particular, given a sentence pair (x, y), each capsule network first constructs a 2D-image-like representation by concatenating the embedded vectors of the words in x and y, i.e., for the ith word x in the source language sentence xiAnd the jth word y in the target language sentence yjThe following correspondence relationship is provided:
Figure BDA0002184097520000111
wherein: x is the number ofi TDenotes xiTranspose of (y)j TDenotes yjThe transpose of (a) is performed,
Figure BDA0002184097520000112
representing the ith word x in the Source languageiAnd the jth word y in the target languagejThe constructed matrix, i.e. the virtual 2D image representation.
Based on such virtual 2D image representation, the similarity between the sentence y' translated by the generator and the sentence y translated by the human under the condition of the source language sentence x is captured sequentially through the convolution layer, the main capsule layer, the convolution capsule layer, and the fully connected capsule layer of the capsule network. The specific process is as follows:
(1) after the convolutional layer, first, a convolution operation with a convolutional kernel of 9 × 9 is performed with a step size of 1, and the correspondence between sentences in x and y is captured by the following feature mapping.
Figure BDA0002184097520000113
Wherein f is the activation function of the first convolution operation,
Figure BDA0002184097520000114
is the weight of the first convolution operation,
Figure BDA0002184097520000115
representing the ith word x in the Source languageiAnd the jth word y in the target languagejA matrix of formations, b(1,f)Is the offset of the first convolution operation.
Then, a convolution operation with a step size of 1 and a convolution kernel of 3 × 3 is performed to capture the correspondence between words in x and y by the following feature mapping.
Figure BDA0002184097520000116
Wherein the content of the first and second substances,
Figure BDA0002184097520000117
the weight of the second convolution operation, f,
Figure BDA0002184097520000118
b(1,f)As in the first convolution algorithm.
Two feature maps with different sizes are respectively obtained after two times of convolution operation
Figure BDA0002184097520000119
And
Figure BDA00021840975200001110
then for smaller feature maps
Figure BDA00021840975200001111
Filling is carried out to enable the two feature maps to be the same in size, and then the final feature map is obtained by averaging the two feature maps with the same size, as shown in the following formula:
Figure BDA00021840975200001112
(2) the output of the convolutional layer is calculated as follows.
p=g(WbM+b1)
Where g is the nonlinear squeeze function square through the entire vector, M represents the input to the capsule, b1Is the bias of the capsule, WbAre weights.
This is the first capsule layer in which the capsule replaces the scalar output of the convolution operation with the vector output.
(3) The dynamic strengthening or weakening of the weight to obtain the effective characteristics through the dynamic routing algorithm instead of the maximum pooling.
(4) Entering the rolling capsule layer. After this layer, all the capsules in the previous layer are changed into a series of capsules, and the weights are further dynamically increased or decreased through a dynamic routing algorithm, so that more effective characteristics are obtained.
(5) And (4) entering a full-connection capsule layer, and connecting all the extracted features.
(6) All features are input into the multi-layer perceptron and the activation function is used to derive the probability that the data set (x, y') generated by the generator is the true data (x, y).
2. An optimized Mongolian Chinese machine translation model, comprising the following parts:
A. BPE processing of Mongolian
The existing Mongolian is a pure pinyin character which is not different from the main pinyin characters in Western Europe and all over the world in the pinyin method, the BPE technology is an algorithm for segmenting the pinyin characters by counting the occurrence frequency of adjacent characters, and continuous characters with high occurrence frequency are considered as a combination. Generally, various root affixes in the Mongolian are Mongolian character combinations with high occurrence frequency, so that the BPE algorithm is applied to the segmentation of the Mongolian. The specific algorithm is described as follows:
(1) the corpus to be preprocessed is first cut into the smallest units, which are the Mongolian letters for Mongolian.
(2) Then, the times of occurrence of all adjacent minimum unit combinations in the corpus are counted and ranked, the combination with the highest occurrence frequency is found out, and the combinations are added into the dictionary, and the word with the lowest frequency in the dictionary is deleted at the same time, so that the size of the dictionary is kept unchanged.
(3) And (3) repeating the steps (1) and (2) until the occurrence frequency of the words in the dictionary in the corpus is higher.
B. Removing additional components of lattice in Mongolian
In the Mongolian corpus, the components between Mongolian spaces and ordinary spaces are labeled as additional components of the lattice. Additional components of the lattice are needed among the words of the Mongolian, and the additional components of the lattice only have grammatical significance and have no semantics. After the additional components of the Mongolian plus the lattice are added, the sentence can become smooth. In Mongolian machine translation, if the additional components of the lattice are not processed, the machine translation model recognizes Mongolian blank spaces as common blank spaces to be processed, so that one Mongolian word is easily segmented from the middle to be recognized as two words or even a plurality of words. This can cause a significant increase in the length of the Mongolian sentence, which affects the translation quality and the final BLEU evaluation. Thus, the present invention removes the tokens from the Mongolian sentence along with the lattice's additional components, leaving only the stem portion.
C. Dividing Chinese characters
The Chinese language belongs to the Tibetan language system, each sentence is only composed of a single character and punctuation marks, so that the whole sentence can be only regarded as a unit when the computer processes the Chinese language, and the calculation and the processing of the computer are not facilitated, therefore, the Chinese language materials need to be separated before the Mongolian Chinese machine translation model is trained.
The whole process of the invention is as follows:
(1) hybrid encoder for building generators in a generative confrontational network
(2) Building a decoder to generate generators in a countermeasure network
(3) Setting up a discriminator in a generative confrontation network
(4) Processing lattice for Mongolian language corpus
(5) Carrying out segmentation of different granularities on Mongolian linguistic data
(6) Dividing Chinese characters
(7) Training generator
(8) Generating negative data through a trained generator model
(9) Training discriminator
(10) Performing confrontation training
(11) The BLEU value of the resulting montmorillohman machine translation model was tested.

Claims (9)

1. A method for improving Mongolian translation quality by constructing Mongolian parallel linguistic data by using a generating confrontation network mainly comprises a generator and a discriminator, wherein in the generator, a mixed encoder is used for encoding Mongolian of a source language sentence into vector representation, a decoder based on a bidirectional Transformer is used for converting the vector representation into Chinese of a target language sentence by combining a sparse attention machine system, so that Mongolian sentences and Mongolian parallel linguistic data close to human translation are generated, in the discriminator, the difference between the Chinese sentences generated by the generator and the human translation is judged, the generator and the discriminator are subjected to confrontation training until the discriminator judges that the Chinese translation sentences generated by the generator are very similar to the human translation, namely the generator and the discriminator realize Nash balance, a Mongolian machine translation system and a Mongolian parallel data set are obtained, and the Mongolian machine translation system is used for Mongolian Chinese translation, the system is characterized in that the discriminator is a multi-scale discriminator and can discriminate general sentence meanings and detail information of Chinese sentences generated by the generator so as to assist the generator to generate sentences close to real translation; the multi-scale discriminator is implemented using a capsule network comprising a convolution layer, a main capsule layer, a convolution capsule layer, and a fully connected capsule layer; in the convolutional layer, the activation values of different sub-layers represent the activation values of sentences with different granularities, the activation values of the lower layers represent the activation values of words, the activation values of the upper layers represent the activation values of the whole sentence, and finally the feature maps of different layers are transformed into the feature map with the same scale and the number of channels being 1.
2. The method for improving the quality of Mongolian translation by using the generative countermeasure network to construct Mongolian parallel corpus according to claim 1, wherein the hybrid encoder comprises a sentence encoder and a word encoder, the sentence encoder comprises a bidirectional Transformer, the word encoder uses bidirectional LSTM, the bidirectional Transformer is used for optimizing the Transformer1, a gated linear unit is firstly added on the basis of the original Transformer to obtain important information in the source language sentence and discard redundant information; secondly, a branch structure is added to capture diversified semantic information between source language sentences, the branch structure comprises 2 capsule networks and 2 activation functions, and the output of each capsule network is correspondingly connected with 1 activation function; finally, a capsule network is added on the branch structure and after the third layer of standardization, so that an encoder can capture the accurate position of a word in a source language sentence; in the decoder, a bidirectional Transformer is used as an optimized Transformer2, a branch structure is firstly added on the basis of an original Transformer, the branch structure comprises 2 multi-head attention mechanisms, 2 capsule networks and 1 activation function, the 2 multi-head attention mechanisms are positioned between a first layer standardization and a second layer standardization, the 2 capsule networks and the 1 activation function are positioned between the second layer standardization and a third layer standardization, wherein the output of the 1 capsule network is connected with the 1 activation function, and the 1 capsule network is directly positioned between the second layer standardization and the third layer standardization; secondly, a capsule network connecting the third layer of standardized output and the fourth layer of standardized input is added; finally the Swish activation function is added.
3. The method for improving the translation quality of Mongolian Chinese by using the generated countermeasure network to construct Mongolian parallel corpus according to claim 2, wherein the word encoder and the sentence encoder sequentially encode source language sentences, and then perform fusion through a fusion function to obtain vector representation with context information, wherein the word encoder represents each word in a vector form to construct the vector representation of Mongolian sentences with words as basic units, and the model formula is as follows:
h1i=Φ(h1i-1,Wi)
where Φ is the activation function, WiAs a weight, h1i-1Hidden layer state of i-1 word;
the sentence encoder represents a whole Mongolian sentence in a vector form, and constructs a vector representation taking the sentence as a basic unit, wherein a model formula is as follows:
Figure FDA0002512082500000021
wherein v isjThe value representing the j-th word,
Figure FDA0002512082500000022
the calculation formula of (a) is as follows:
Figure FDA0002512082500000023
wherein, αi,jIs calculated as follows:
Figure FDA0002512082500000024
wherein q isiFor the query of the ith word, kjThe key of the jth word represents a dot product operation, and d represents the dimensions of q and k;
the fusion function is shown as follows:
ψ(h1i,h2i)=a1h1i+a2h2i
where ψ is a fusion function, a1,a2The encoder is characterized in that the two encoders are fused into vector information containing sentences and words through the two encodings by corresponding weights initialized randomly.
4. The method for improving the quality of Mongolian Chinese translation by using the generated countermeasure network to construct the Mongolian Chinese parallel corpus according to claim 2, wherein in the sentence coder, the bidirectional Transformer means to read the whole text sequence at one time, i.e. to learn based on both sides of the sentence, so as to learn the context relationship between words in the text.
5. The method for improving the quality of Mongolian translation by using the generative countermeasure network to construct Mongolian parallel corpus according to claim 2, wherein in the decoder, the bidirectional Transformer means to read the vector representation of the source language sentence at a time, i.e. decoding based on both sides of the vector representation of the whole sentence.
6. The method for improving the translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus according to claim 1, wherein given sentence pair (x, y), each capsule network first constructs a representation of the 2D image by connecting the embedded vectors of the words in x and y, i.e. x for the ith word in source language sentence xiAnd the jth word y in the target language sentence yjThe following correspondence relationship is provided:
Figure FDA0002512082500000031
wherein: x is the number ofi TDenotes xiTranspose of (y)j TDenotes yjThe transpose of (a) is performed,
Figure FDA0002512082500000032
representing the ith word x in the Source languageiAnd the jth word y in the target languagejThe constructed matrix, i.e. the virtual 2D image representation;
based on the virtual 2D image representation, the similarity between the sentence y' translated by the generator and the sentence y translated by the human under the condition of the source language sentence x is captured through the convolution layer, the main capsule layer, the convolution capsule layer and the fully-connected capsule layer of the capsule network in sequence.
7. The method for improving the translation quality of the Mongolian Chinese by constructing the Mongolian Chinese parallel corpus according to the claim 6, wherein the virtual 2D image representation sequentially passes through the convolution layer, the main capsule layer, the convolution capsule layer and the full-connection capsule layer of the capsule network in the following specific processes:
(1) after the convolutional layer, firstly, carrying out convolution operation with step length of 1 and convolution kernel of 9 × 9, and capturing the corresponding relation between sentences in x and y through the following feature mapping;
Figure FDA0002512082500000041
wherein f is the activation function of the first convolution operation,
Figure FDA0002512082500000042
is the weight of the first convolution operation,
Figure FDA0002512082500000043
representing the ith word x in the Source languageiAnd the jth word y in the target languagejA matrix of formations, b(1,f)An offset for the first convolution operation;
then, a convolution operation with step size 1 and convolution kernel 3 × 3 is performed to capture the correspondence between words in x and y by the following feature mapping:
Figure FDA0002512082500000044
wherein the content of the first and second substances,
Figure FDA0002512082500000045
the weight of the second convolution operation, f,
Figure FDA0002512082500000046
b(1,f)The same as in the first convolution operation;
two feature maps with different sizes are respectively obtained after two times of convolution operation
Figure FDA0002512082500000047
And
Figure FDA0002512082500000048
then for smaller feature maps
Figure FDA0002512082500000049
Filling is carried out to enable the two feature maps to be the same in size, and then the final feature map is obtained by averaging the two feature maps with the same size, as shown in the following formula:
Figure FDA00025120825000000410
(2) entering the first capsule layer, the output of the convolution layer is calculated as follows
p=g(WbM+b1)
Where g is the nonlinear squeeze function square through the entire vector, M denotes that the input to the capsule is also the output of the convolutional layer, b1Is the bias of the capsule, WbIs a weight; in the main capsule layer, the capsule replaces the scalar output of the convolution operation with a vector output;
(3) the dynamic routing algorithm replaces the maximum pooling, and the weight is dynamically strengthened or weakened to obtain effective characteristics;
(4) entering a rolling capsule layer; after the layer, all the capsules in the previous layer are changed into a series of capsules, and the weights are further dynamically strengthened or weakened through a dynamic routing algorithm, so that effective characteristics are obtained;
(5) entering a full-connection capsule layer, and connecting all extracted features;
(6) all the features are input into the multilayer perceptron, and the activation function is used to obtain the probability that the data set (x, y') generated by the generator is real data (x, y), namely the similarity degree with a real sentence.
8. The method for improving the Mongolian Chinese translation quality by constructing the Mongolian Chinese parallel corpus according to the claim 1, wherein the final training targets of the generation of the countermeasure network are:
Figure FDA0002512082500000051
wherein: g denotes a generator, D denotes a discriminator, V (D, G) denotes a loss function of the generator G and the discriminator D,
Figure FDA0002512082500000052
representing D for maximizing the loss function V (D, G) and minimizing the loss function, E representing the expectation, Pdata(x, y) represents the probability that the source language x and the target language y in the parallel corpus are input into the discriminator D, which the discriminator considers to be manually translated, G (y ' | x) represents the probability that the target language y ' generated by the source language x and G in the parallel corpus is input into the discriminator D, which the discriminator considers to be manually translated, x represents the mongolian sentence in the parallel corpus, i.e., the source language sentence, y represents the chinese sentence in the parallel corpus, i.e., the manual translation result, and y ' represents the chinese sentence generated by the generator, i.e., the translation result of the generator.
9. The method for improving the translation quality of Mongolian Chinese by constructing the Mongolian Chinese parallel corpus according to the claim 1, wherein the process comprises the following steps:
processing the additional components of the Mongolian language material grid, wherein the method comprises the following steps:
removing the control symbols in Mongolian sentences and the additional components of the lattices together, and only leaving the stem parts;
the Mongolian linguistic data is segmented with different granularities, and the method comprises the following steps:
(1) firstly, separating the linguistic data required to be preprocessed by a minimum unit, wherein the minimum unit is Mongolian letters for Mongolian;
(2) then counting and sequencing the occurrence times of all adjacent minimum unit combinations in the corpus, finding out the combination with the highest occurrence frequency, adding the combinations into a dictionary, and deleting the words with the lowest frequency in the dictionary to keep the size of the dictionary unchanged;
(3) repeating the steps (1) and (2) until the occurrence frequency of the words in the dictionary in the corpus is higher than a set value;
the Chinese character is divided into words.
CN201910807617.7A 2019-08-29 2019-08-29 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network Active CN110598221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910807617.7A CN110598221B (en) 2019-08-29 2019-08-29 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910807617.7A CN110598221B (en) 2019-08-29 2019-08-29 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network

Publications (2)

Publication Number Publication Date
CN110598221A CN110598221A (en) 2019-12-20
CN110598221B true CN110598221B (en) 2020-07-07

Family

ID=68856234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910807617.7A Active CN110598221B (en) 2019-08-29 2019-08-29 Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network

Country Status (1)

Country Link
CN (1) CN110598221B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326157B (en) * 2020-01-20 2023-09-08 抖音视界有限公司 Text generation method, apparatus, electronic device and computer readable medium
CN111310480B (en) * 2020-01-20 2021-12-28 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN111414770B (en) * 2020-02-24 2022-03-18 内蒙古工业大学 Semi-supervised Mongolian neural machine translation method based on collaborative training
CN111310411B (en) * 2020-03-09 2022-07-12 重庆邮电大学 Text relevance determination method, device and equipment based on diversity model
CN111528832B (en) * 2020-05-28 2023-04-18 四川大学华西医院 Arrhythmia classification method and validity verification method thereof
CN111861925B (en) * 2020-07-24 2023-09-29 南京信息工程大学滨江学院 Image rain removing method based on attention mechanism and door control circulation unit
CN111862294B (en) * 2020-07-31 2024-03-01 天津大学 Hand-painted 3D building automatic coloring network device and method based on ArcGAN network
CN112215017B (en) * 2020-10-22 2022-04-29 内蒙古工业大学 Mongolian Chinese machine translation method based on pseudo parallel corpus construction
CN112507733B (en) * 2020-11-06 2023-04-18 昆明理工大学 Dependency graph network-based Hanyue neural machine translation method
CN112329760B (en) * 2020-11-17 2021-12-21 内蒙古工业大学 Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
CN112613326B (en) * 2020-12-18 2022-11-08 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN112633018B (en) * 2020-12-28 2022-04-15 内蒙古工业大学 Mongolian Chinese neural machine translation method based on data enhancement
CN112989845B (en) * 2021-03-02 2023-01-10 北京理工大学 Chapter-level neural machine translation method and system based on routing algorithm
CN113065432A (en) * 2021-03-23 2021-07-02 内蒙古工业大学 Handwritten Mongolian recognition method based on data enhancement and ECA-Net
CN113343672B (en) * 2021-06-21 2022-12-16 哈尔滨工业大学 Unsupervised bilingual dictionary construction method based on corpus merging
CN113642341A (en) * 2021-06-30 2021-11-12 深译信息科技(横琴)有限公司 Deep confrontation generation method for solving scarcity of medical text data
CN113657124B (en) * 2021-07-14 2023-06-30 内蒙古工业大学 Multi-mode Mongolian translation method based on cyclic common attention transducer
CN113505775B (en) * 2021-07-15 2024-05-14 大连民族大学 Character positioning-based full-text word recognition method
CN113538506A (en) * 2021-07-23 2021-10-22 陕西师范大学 Pedestrian trajectory prediction method based on global dynamic scene information depth modeling
CN113611293B (en) * 2021-08-19 2022-10-11 内蒙古工业大学 Mongolian data set expansion method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368475B (en) * 2017-07-18 2021-06-04 中译语通科技股份有限公司 Machine translation method and system based on generation of antagonistic neural network
CN107967262B (en) * 2017-11-02 2018-10-30 内蒙古工业大学 A kind of neural network illiteracy Chinese machine translation method
US11501076B2 (en) * 2018-02-09 2022-11-15 Salesforce.Com, Inc. Multitask learning as question answering
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN108932232A (en) * 2018-05-07 2018-12-04 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on LSTM neural network
CN108897740A (en) * 2018-05-07 2018-11-27 内蒙古工业大学 A kind of illiteracy Chinese machine translation method based on confrontation neural network
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109783827B (en) * 2019-01-31 2023-02-10 沈阳雅译网络技术有限公司 Deep neural machine translation method based on dynamic linear polymerization

Also Published As

Publication number Publication date
CN110598221A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN110598221B (en) Method for improving translation quality of Mongolian Chinese by constructing Mongolian Chinese parallel corpus by using generated confrontation network
CN110334361B (en) Neural machine translation method for Chinese language
CN107357789B (en) Neural machine translation method fusing multi-language coding information
CN107967262A (en) A kind of neutral net covers Chinese machine translation method
CN111160050A (en) Chapter-level neural machine translation method based on context memory network
CN113158665A (en) Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN112580373B (en) High-quality Mongolian non-supervision neural machine translation method
CN108920472A (en) A kind of emerging system and method for the machine translation system based on deep learning
CN110033008A (en) A kind of iamge description generation method concluded based on modal transformation and text
CN111078866A (en) Chinese text abstract generation method based on sequence-to-sequence model
CN114627162A (en) Multimodal dense video description method based on video context information fusion
CN111444730A (en) Data enhancement Weihan machine translation system training method and device based on Transformer model
Liu Neural question generation based on Seq2Seq
CN115719072A (en) Chapter-level neural machine translation method and system based on mask mechanism
Xiu et al. A handwritten Chinese text recognizer applying multi-level multimodal fusion network
CN115545033A (en) Chinese field text named entity recognition method fusing vocabulary category representation
CN110502759A (en) The Chinese for incorporating classified dictionary gets over the outer word treatment method of hybrid network nerve machine translation set
CN112989845B (en) Chapter-level neural machine translation method and system based on routing algorithm
CN115346158A (en) Video description method based on coherence attention mechanism and double-stream decoder
CN114238649A (en) Common sense concept enhanced language model pre-training method
CN109325110B (en) Indonesia document abstract generation method and device, storage medium and terminal equipment
CN115587909A (en) Judicial text data amplification method based on generating type confrontation network
CN110825869A (en) Text abstract generating method of variation generation decoder based on copying mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant