CN114676708B - Low-resource neural machine translation method based on multi-strategy prototype generation - Google Patents
Low-resource neural machine translation method based on multi-strategy prototype generation Download PDFInfo
- Publication number
- CN114676708B CN114676708B CN202210293213.2A CN202210293213A CN114676708B CN 114676708 B CN114676708 B CN 114676708B CN 202210293213 A CN202210293213 A CN 202210293213A CN 114676708 B CN114676708 B CN 114676708B
- Authority
- CN
- China
- Prior art keywords
- prototype
- sentence
- matching
- attention
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013519 translation Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000001537 neural effect Effects 0.000 title claims abstract description 25
- 230000007246 mechanism Effects 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims abstract description 4
- 230000014616 translation Effects 0.000 claims description 71
- 238000004364 calculation method Methods 0.000 claims description 19
- 238000012360 testing method Methods 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000012512 characterization method Methods 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000006872 improvement Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 235000011483 Ribes Nutrition 0.000 description 1
- 241000220483 Ribes Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Automation & Control Theory (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, belonging to the technical field of natural language processing. The method comprises the following steps: first, a prototype sequence is searched by combining keyword matching and distributed representation matching, and if matching is not obtained, a usable pseudo prototype sequence is generated by using a pseudo prototype generation method. Second, to make efficient use of prototype sequences, conventional encoder-decoder frameworks have been improved. The encoding end receives the prototype sequence input by using an additional encoder; the decoding end uses the improved loss function to reduce the influence of the low-quality prototype sequence on the model while controlling the information flow by using a gating mechanism. The method provided by the invention can effectively improve the quantity and quality of the search results based on a small quantity of parallel corpus, and is suitable for neural machine translation in a low-resource environment and in a similarity language environment.
Description
Technical Field
The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, belonging to the technical field of natural language processing.
Background
In recent years, with the development of an end-to-end translation model and an attention mechanism, neural machine translation (Neural Machine Translation, NMT) has been developed, and the translation performance in the mainstream language pair rapidly exceeds that of statistical machine translation, and the current mainstream machine translation mode has been developed. Various methods have been proposed by researchers to improve the performance of neural machine translation. Among them, a prototype method based on prototype sequence integration receives much attention. Under the scene of rich resources, the similarity translation is used as a target end prototype sequence, so that the performance of neural machine translation can be effectively improved. However, in a low-resource scene, due to the lack of parallel corpus resources, prototype sequences cannot be obtained by matching or the quality of the sequences is poor. Therefore, under the low resource scene, how to effectively utilize the prototype sequence to improve the performance of neural machine translation is explored, and the method has very important research and application values.
The prototype sequence is a target-end sentence existing in the translation memory, and contains semantic information of a target language end. Prototype methods utilize target-side semantic information by introducing prototype sequences into the translation process, such that they are implicitly used to guide word alignment and decoding constraints. Research work in the field of prototype methods is currently focused mainly on both prototype retrieval and prototype utilization. The prototype sequence retrieval method is well developed in a resource-rich scene, because a large-scale translation memory library exists in the resource-rich scene. Therefore, the prototype method can obtain a prototype sequence with higher quality by searching the memory library, thereby effectively improving the translation performance. However, in a low-resource scenario, the size and quality of parallel corpora are limited, and it is often difficult to retrieve available prototypes by using a traditional prototype sequence retrieval method. The effect on the next translation task is limited. In addition, researchers have proposed many improvements in the use of prototype sequences, particularly in the way prototype sequences are incorporated as coding inputs into translation models. For example, a double encoder structure is adopted to encode the input sentence and the prototype sequence simultaneously, and a gating mechanism is introduced at the decoding end to balance the information proportion between the source sentence and the prototype sequence. However, the above methods all bring about improvement in translation performance, but are still mainly oriented to resource-rich scenes, and specific improvement is seldom performed for low-resource scenes. Therefore, the invention provides a low-resource neural machine translation method based on multi-strategy prototype generation, and the performance of the low-resource neural machine translation is better improved through an improved prototype acquisition method and a specific translation framework structure.
Disclosure of Invention
The invention provides a low-resource neural machine translation method based on multi-strategy prototype generation, which improves the efficiency and quality of prototype sequence acquisition by combining a traditional retrieval method and a proposed pseudo-prototype generation method, and simultaneously fuses the retrieved prototypes into a coder-decoder framework by utilizing a neural network structure change mode, so that the influence caused by low-quality sequences is weakened while semantic information contained in the prototype sequences is utilized to the maximum extent; can improve the performance of low-resource neural machine translation.
The technical scheme of the invention is as follows: the low-resource neural machine translation method based on multi-strategy prototype generation comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
step3, constructing a translation model of merging a prototype sequence: the codec structure of the traditional neural machine translation model based on the attention mechanism is improved, a prototype sequence is better integrated, and the corpus of steps Step1 and Step2 is used as model input to generate a final translation.
As a preferable scheme of the invention, the Step1 specifically comprises the following steps:
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; the fuzzy matching is adopted as a keyword matching method, which is defined as follows:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
As a preferred embodiment of the present invention, step3 includes:
The coding end adopts a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then codes the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
As a preferred embodiment of the present invention, step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
As a preferred embodiment of the present invention, in Step3, the performing, by the sentence codec attention module, attention calculation includes:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
The beneficial effects of the invention are as follows:
1. The invention obtains the prototype sequence by combining the prototype sequence retrieval method and the pseudo prototype generation method based on word replacement, thereby maximizing and improving the number of available prototype sequences in a low-resource scene on the premise of ensuring the sequence quality;
2. The invention improves the encoder-decoder translation framework, the encoding end uses a double-encoding structure, and the sentence encoder and the prototype encoder encode the input sentence and the retrieved (generated) prototype respectively. The decoding end uses a gating mechanism to control the proportion and the flow process of the information;
3. and (5) improving a neural machine translation model loss calculation method. The model weakens the negative influence of the low-quality prototype sequence on the translation model while trying to utilize semantic information contained in the high-quality prototype sequence, and finally improves the translation performance in a low-resource scene, and simultaneously can bring better translation fluency.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of the overall structure of the model proposed by the present invention;
FIG. 3 shows the effect of keyword replacement times on model performance.
Detailed Description
Example 1: as shown in fig. 1-3, a low-resource neural machine translation method based on multi-strategy prototype generation comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
The treated parallel corpus is divided into two types according to scale: small-scale parallel corpus and large-scale parallel corpus. By applying the method of the invention to parallel corpus of different scales, the influence of the increase of corpus scale on the information utilization rate can be observed, and the verification that the proposed method is applicable to the assumption of a parallel corpus resource scarcity scene. Table 1 is experimental data information.
Table 1 experimental data
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
step3, constructing a translation model of merging a prototype sequence: the codec structure of the traditional neural machine translation model based on the attention mechanism is improved, a prototype sequence is better integrated, and the corpus of steps Step1 and Step2 is used as model input to generate a final translation.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; in a low-resource environment, because the number of long sentences in the corpus is small, the length of the matched fragments is short, and effective similarity measurement is difficult to form. Thus, instead of N-gram matching, fuzzy matching is used as a keyword matching method, which is defined as:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
In order to illustrate the prototype retrieval effect of the invention in a low-resource scenario, the method proposed by the invention is compared with the baseline prototype retrieval effect on different data scales. Table 2 shows the improvement in matching quality brought about by the hybrid prototype retrieval model.
TABLE 2 comparison of fuzzy match and hybrid prototype search effects
As can be seen from the results of Table 2, in the low-resource scenario, the number of prototype sequences obtained by fuzzy matching search is obviously insufficient. The problem can be alleviated to some extent by using hybrid prototype matching that combines fuzzy matching and distributed representation matching. On the large-scale data set WMT14, compared with the single use of fuzzy matching strategies, the combination of distributed representation matching can obtain ideal matching results. For the small-scale data set ISLT 15, after fuzzy matching and distributed representation matching are combined, comprehensive consideration of a semantic layer can be added on the basis of keyword matching, so that the quality of a prototype sequence is further improved. Therefore, the mixed prototype retrieval method provided by the invention is suitable for low-resource scenes.
FIG. 3 illustrates the effect of the number of keyword substitutions on the task of the cross-english translation on the performance of the model. In the process of generating a pseudo prototype by using keyword replacement, a replacement threshold is firstly set according to experience, and then the threshold is adjusted based on verification set performance. The evaluation result shows that under the sequential traversal dictionary strategy, when the replacement times threshold is set smaller, the generated prototype sequence is limited in distinction from the original text, and effective guiding information is difficult to provide for the translation process; and when the setting is larger, the degradation tends to global replacement; the model performs optimally when the number of keyword substitutions is set to 3.
As a preferred embodiment of the present invention, step3 includes:
The coding end adopts a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then codes the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
As a preferred embodiment of the present invention, step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
As a preferred embodiment of the present invention, in Step3, the performing, by the sentence codec attention module, attention calculation includes:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
To illustrate the translation effect of the present invention, a baseline system is used to compare the translations generated by the present invention, and tables 3 and 4 show the results of the improvement on the small-scale corpus.
Table 3 BLEU evaluation results (%)
Table 4 RIBES evaluation results (%)
From the results shown in tables 3 and 4, the method provided by the invention can learn prototype sequence characterization by using an additional prototype encoder at the encoding end, and effectively utilize semantic information contained in a high-quality prototype sequence at the decoding end and reduce noise information introduced by a low-quality prototype sequence. And the quality and fluency of the translation are effectively improved in a low-resource scene.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (5)
1. The low-resource neural machine translation method based on multi-strategy prototype generation is characterized by comprising the following steps of: the method comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
Step3, constructing a translation model of merging a prototype sequence: improving the codec structure of the traditional neural machine translation model based on an attention mechanism, better integrating a prototype sequence, and using the corpus of steps Step1 and Step2 as model input to generate a final translation;
the Step2 specifically comprises the following steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; the fuzzy matching is adopted as a keyword matching method, which is defined as follows:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
2. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step1 specifically comprises the following steps:
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
3. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step3 includes:
The Step3.1 and the coding end adopt a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then code the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
4. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
5. The multi-strategy prototype-based low-resource neural machine translation method as claimed in claim 4, wherein: in Step3, the sentence coding and decoding attention module performs attention calculation including:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210293213.2A CN114676708B (en) | 2022-03-24 | 2022-03-24 | Low-resource neural machine translation method based on multi-strategy prototype generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210293213.2A CN114676708B (en) | 2022-03-24 | 2022-03-24 | Low-resource neural machine translation method based on multi-strategy prototype generation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114676708A CN114676708A (en) | 2022-06-28 |
CN114676708B true CN114676708B (en) | 2024-04-23 |
Family
ID=82073905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210293213.2A Active CN114676708B (en) | 2022-03-24 | 2022-03-24 | Low-resource neural machine translation method based on multi-strategy prototype generation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114676708B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117993396A (en) * | 2024-01-23 | 2024-05-07 | 哈尔滨工业大学 | RAG-based large model machine translation method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10346548B1 (en) * | 2016-09-26 | 2019-07-09 | Lilt, Inc. | Apparatus and method for prefix-constrained decoding in a neural machine translation system |
CN110059323A (en) * | 2019-04-22 | 2019-07-26 | 苏州大学 | Based on the multi-field neural machine translation method from attention mechanism |
CN110489766A (en) * | 2019-07-25 | 2019-11-22 | 昆明理工大学 | The Chinese-weighed based on coding conclusion-decoding gets over low-resource nerve machine translation method |
CN112507734A (en) * | 2020-11-19 | 2021-03-16 | 南京大学 | Roman Uygur language-based neural machine translation system |
-
2022
- 2022-03-24 CN CN202210293213.2A patent/CN114676708B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10346548B1 (en) * | 2016-09-26 | 2019-07-09 | Lilt, Inc. | Apparatus and method for prefix-constrained decoding in a neural machine translation system |
CN110059323A (en) * | 2019-04-22 | 2019-07-26 | 苏州大学 | Based on the multi-field neural machine translation method from attention mechanism |
CN110489766A (en) * | 2019-07-25 | 2019-11-22 | 昆明理工大学 | The Chinese-weighed based on coding conclusion-decoding gets over low-resource nerve machine translation method |
CN112507734A (en) * | 2020-11-19 | 2021-03-16 | 南京大学 | Roman Uygur language-based neural machine translation system |
Non-Patent Citations (2)
Title |
---|
efficient low-resource neural machine translation with reread and feedback mechanism;yu zhiqiang等;ACM Transactions on Asian and low-resource language information processing;20200109;第19卷(第3期);1-13 * |
基于多策略原型生成的低资源神经机器翻译;于志强等;软件学报;20230428;第34卷(第11期);5113-5125 * |
Also Published As
Publication number | Publication date |
---|---|
CN114676708A (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102577514B1 (en) | Method, apparatus for text generation, device and storage medium | |
Liu et al. | A recursive recurrent neural network for statistical machine translation | |
Li et al. | Text compression-aided transformer encoding | |
CN111160050A (en) | Chapter-level neural machine translation method based on context memory network | |
CN110619043A (en) | Automatic text abstract generation method based on dynamic word vector | |
CN113901847A (en) | Neural machine translation method based on source language syntax enhanced decoding | |
CN114676708B (en) | Low-resource neural machine translation method based on multi-strategy prototype generation | |
Heo et al. | Multimodal neural machine translation with weakly labeled images | |
CN115114940A (en) | Machine translation style migration method and system based on curriculum pre-training | |
CN113657125B (en) | Mongolian non-autoregressive machine translation method based on knowledge graph | |
CN114648024A (en) | Chinese cross-language abstract generation method based on multi-type word information guidance | |
CN112380882B (en) | Mongolian Chinese neural machine translation method with error correction function | |
CN115017924B (en) | Construction of neural machine translation model for cross-language translation and translation method thereof | |
CN114707523B (en) | Image-multilingual subtitle conversion method based on interactive converter | |
Zhang et al. | Guidance module network for video captioning | |
CN112347753B (en) | Abstract generation method and system applied to reading robot | |
Chang et al. | Improving language translation using the hidden Markov model | |
Wu | A chinese-english machine translation model based on deep neural network | |
CN113157855A (en) | Text summarization method and system fusing semantic and context information | |
Zhu | Exploration on Korean-Chinese collaborative translation method based on recursive recurrent neural network | |
Maqsood | Evaluating newsQA dataset with ALBERT | |
CN114611487B (en) | Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment | |
Yu et al. | Semantic extraction for sentence representation via reinforcement learning | |
Getachew et al. | Gex'ez-English Bi-Directional Neural Machine Translation Using Transformer | |
Song et al. | RUC_AIM3 at TRECVID 2019: Video to Text. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |