CN114676708B - Low-resource neural machine translation method based on multi-strategy prototype generation - Google Patents

Low-resource neural machine translation method based on multi-strategy prototype generation Download PDF

Info

Publication number
CN114676708B
CN114676708B CN202210293213.2A CN202210293213A CN114676708B CN 114676708 B CN114676708 B CN 114676708B CN 202210293213 A CN202210293213 A CN 202210293213A CN 114676708 B CN114676708 B CN 114676708B
Authority
CN
China
Prior art keywords
prototype
sentence
matching
attention
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210293213.2A
Other languages
Chinese (zh)
Other versions
CN114676708A (en
Inventor
余正涛
朱恩昌
于志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210293213.2A priority Critical patent/CN114676708B/en
Publication of CN114676708A publication Critical patent/CN114676708A/en
Application granted granted Critical
Publication of CN114676708B publication Critical patent/CN114676708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Automation & Control Theory (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, belonging to the technical field of natural language processing. The method comprises the following steps: first, a prototype sequence is searched by combining keyword matching and distributed representation matching, and if matching is not obtained, a usable pseudo prototype sequence is generated by using a pseudo prototype generation method. Second, to make efficient use of prototype sequences, conventional encoder-decoder frameworks have been improved. The encoding end receives the prototype sequence input by using an additional encoder; the decoding end uses the improved loss function to reduce the influence of the low-quality prototype sequence on the model while controlling the information flow by using a gating mechanism. The method provided by the invention can effectively improve the quantity and quality of the search results based on a small quantity of parallel corpus, and is suitable for neural machine translation in a low-resource environment and in a similarity language environment.

Description

Low-resource neural machine translation method based on multi-strategy prototype generation
Technical Field
The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, belonging to the technical field of natural language processing.
Background
In recent years, with the development of an end-to-end translation model and an attention mechanism, neural machine translation (Neural Machine Translation, NMT) has been developed, and the translation performance in the mainstream language pair rapidly exceeds that of statistical machine translation, and the current mainstream machine translation mode has been developed. Various methods have been proposed by researchers to improve the performance of neural machine translation. Among them, a prototype method based on prototype sequence integration receives much attention. Under the scene of rich resources, the similarity translation is used as a target end prototype sequence, so that the performance of neural machine translation can be effectively improved. However, in a low-resource scene, due to the lack of parallel corpus resources, prototype sequences cannot be obtained by matching or the quality of the sequences is poor. Therefore, under the low resource scene, how to effectively utilize the prototype sequence to improve the performance of neural machine translation is explored, and the method has very important research and application values.
The prototype sequence is a target-end sentence existing in the translation memory, and contains semantic information of a target language end. Prototype methods utilize target-side semantic information by introducing prototype sequences into the translation process, such that they are implicitly used to guide word alignment and decoding constraints. Research work in the field of prototype methods is currently focused mainly on both prototype retrieval and prototype utilization. The prototype sequence retrieval method is well developed in a resource-rich scene, because a large-scale translation memory library exists in the resource-rich scene. Therefore, the prototype method can obtain a prototype sequence with higher quality by searching the memory library, thereby effectively improving the translation performance. However, in a low-resource scenario, the size and quality of parallel corpora are limited, and it is often difficult to retrieve available prototypes by using a traditional prototype sequence retrieval method. The effect on the next translation task is limited. In addition, researchers have proposed many improvements in the use of prototype sequences, particularly in the way prototype sequences are incorporated as coding inputs into translation models. For example, a double encoder structure is adopted to encode the input sentence and the prototype sequence simultaneously, and a gating mechanism is introduced at the decoding end to balance the information proportion between the source sentence and the prototype sequence. However, the above methods all bring about improvement in translation performance, but are still mainly oriented to resource-rich scenes, and specific improvement is seldom performed for low-resource scenes. Therefore, the invention provides a low-resource neural machine translation method based on multi-strategy prototype generation, and the performance of the low-resource neural machine translation is better improved through an improved prototype acquisition method and a specific translation framework structure.
Disclosure of Invention
The invention provides a low-resource neural machine translation method based on multi-strategy prototype generation, which improves the efficiency and quality of prototype sequence acquisition by combining a traditional retrieval method and a proposed pseudo-prototype generation method, and simultaneously fuses the retrieved prototypes into a coder-decoder framework by utilizing a neural network structure change mode, so that the influence caused by low-quality sequences is weakened while semantic information contained in the prototype sequences is utilized to the maximum extent; can improve the performance of low-resource neural machine translation.
The technical scheme of the invention is as follows: the low-resource neural machine translation method based on multi-strategy prototype generation comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
step3, constructing a translation model of merging a prototype sequence: the codec structure of the traditional neural machine translation model based on the attention mechanism is improved, a prototype sequence is better integrated, and the corpus of steps Step1 and Step2 is used as model input to generate a final translation.
As a preferable scheme of the invention, the Step1 specifically comprises the following steps:
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; the fuzzy matching is adopted as a keyword matching method, which is defined as follows:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
As a preferred embodiment of the present invention, step3 includes:
The coding end adopts a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then codes the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
As a preferred embodiment of the present invention, step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
As a preferred embodiment of the present invention, in Step3, the performing, by the sentence codec attention module, attention calculation includes:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
The beneficial effects of the invention are as follows:
1. The invention obtains the prototype sequence by combining the prototype sequence retrieval method and the pseudo prototype generation method based on word replacement, thereby maximizing and improving the number of available prototype sequences in a low-resource scene on the premise of ensuring the sequence quality;
2. The invention improves the encoder-decoder translation framework, the encoding end uses a double-encoding structure, and the sentence encoder and the prototype encoder encode the input sentence and the retrieved (generated) prototype respectively. The decoding end uses a gating mechanism to control the proportion and the flow process of the information;
3. and (5) improving a neural machine translation model loss calculation method. The model weakens the negative influence of the low-quality prototype sequence on the translation model while trying to utilize semantic information contained in the high-quality prototype sequence, and finally improves the translation performance in a low-resource scene, and simultaneously can bring better translation fluency.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of the overall structure of the model proposed by the present invention;
FIG. 3 shows the effect of keyword replacement times on model performance.
Detailed Description
Example 1: as shown in fig. 1-3, a low-resource neural machine translation method based on multi-strategy prototype generation comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
The treated parallel corpus is divided into two types according to scale: small-scale parallel corpus and large-scale parallel corpus. By applying the method of the invention to parallel corpus of different scales, the influence of the increase of corpus scale on the information utilization rate can be observed, and the verification that the proposed method is applicable to the assumption of a parallel corpus resource scarcity scene. Table 1 is experimental data information.
Table 1 experimental data
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
step3, constructing a translation model of merging a prototype sequence: the codec structure of the traditional neural machine translation model based on the attention mechanism is improved, a prototype sequence is better integrated, and the corpus of steps Step1 and Step2 is used as model input to generate a final translation.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; in a low-resource environment, because the number of long sentences in the corpus is small, the length of the matched fragments is short, and effective similarity measurement is difficult to form. Thus, instead of N-gram matching, fuzzy matching is used as a keyword matching method, which is defined as:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
In order to illustrate the prototype retrieval effect of the invention in a low-resource scenario, the method proposed by the invention is compared with the baseline prototype retrieval effect on different data scales. Table 2 shows the improvement in matching quality brought about by the hybrid prototype retrieval model.
TABLE 2 comparison of fuzzy match and hybrid prototype search effects
As can be seen from the results of Table 2, in the low-resource scenario, the number of prototype sequences obtained by fuzzy matching search is obviously insufficient. The problem can be alleviated to some extent by using hybrid prototype matching that combines fuzzy matching and distributed representation matching. On the large-scale data set WMT14, compared with the single use of fuzzy matching strategies, the combination of distributed representation matching can obtain ideal matching results. For the small-scale data set ISLT 15, after fuzzy matching and distributed representation matching are combined, comprehensive consideration of a semantic layer can be added on the basis of keyword matching, so that the quality of a prototype sequence is further improved. Therefore, the mixed prototype retrieval method provided by the invention is suitable for low-resource scenes.
FIG. 3 illustrates the effect of the number of keyword substitutions on the task of the cross-english translation on the performance of the model. In the process of generating a pseudo prototype by using keyword replacement, a replacement threshold is firstly set according to experience, and then the threshold is adjusted based on verification set performance. The evaluation result shows that under the sequential traversal dictionary strategy, when the replacement times threshold is set smaller, the generated prototype sequence is limited in distinction from the original text, and effective guiding information is difficult to provide for the translation process; and when the setting is larger, the degradation tends to global replacement; the model performs optimally when the number of keyword substitutions is set to 3.
As a preferred embodiment of the present invention, step3 includes:
The coding end adopts a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then codes the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
As a preferred embodiment of the present invention, step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
As a preferred embodiment of the present invention, in Step3, the performing, by the sentence codec attention module, attention calculation includes:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
To illustrate the translation effect of the present invention, a baseline system is used to compare the translations generated by the present invention, and tables 3 and 4 show the results of the improvement on the small-scale corpus.
Table 3 BLEU evaluation results (%)
Table 4 RIBES evaluation results (%)
From the results shown in tables 3 and 4, the method provided by the invention can learn prototype sequence characterization by using an additional prototype encoder at the encoding end, and effectively utilize semantic information contained in a high-quality prototype sequence at the decoding end and reduce noise information introduced by a low-quality prototype sequence. And the quality and fluency of the translation are effectively improved in a low-resource scene.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. The low-resource neural machine translation method based on multi-strategy prototype generation is characterized by comprising the following steps of: the method comprises the following specific steps:
Step1, corpus pretreatment: preprocessing parallel training corpus, verification corpus and test corpus with different scales for model training, parameter tuning and effect testing; constructing a multilingual global dictionary and a keyword dictionary for generating pseudo prototypes;
Step2, prototype generation: prototype generation is carried out by using a prototype generation method based on a mixture of various strategies so as to ensure the usability of a prototype sequence; the specific idea of the step is as follows: firstly, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching, and if no prototype is retrieved, replacing keywords in an input sentence by using word replacement operation to obtain a pseudo prototype sequence;
Step3, constructing a translation model of merging a prototype sequence: improving the codec structure of the traditional neural machine translation model based on an attention mechanism, better integrating a prototype sequence, and using the corpus of steps Step1 and Step2 as model input to generate a final translation;
the Step2 specifically comprises the following steps:
Step2.1, carrying out prototype retrieval by combining fuzzy matching and distributed representation matching; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences { (s l,tl): l=1, …, L }, where s l is the source sentence and t l is the target sentence; for a given input sentence x, firstly, matching keywords in a translation memory for retrieval; the fuzzy matching is adopted as a keyword matching method, which is defined as follows:
Where ED (x, s i) is the edit distance between x, s i, x is the sentence length of x;
unlike keyword-based matching methods, distributed representation matching retrieves according to the distance between sentence vector tokens, a means of similarity retrieval using semantic information to some extent, thus providing a different retrieval perspective than keyword matching; the distributed representation matching based on cosine similarity is defined as:
wherein h x and The vector characterizations of x and s i respectively, i h x i is a measure of vector h x; in order to realize rapid calculation, firstly, using a multilingual pre-training model mBERT to obtain vector characterization of sentences x and s i, and then using a faiss tool to perform similarity matching according to the characterization;
When the optimal matching source sentence s best can be obtained through fuzzy matching, a set s '= { s 1,s2,…,sk } of top-k matching results is obtained through distributed representation matching, if s best epsilon s', a target end sentence t best corresponding to s best is selected as a prototype sequence; when fuzzy matching fails to retrieve matching source sentence or When the source sentence s best is matched and retrieved through distributed representation;
Step2.2, if the prototype is not searched by step2.1, keyword replacement is carried out on the input sentence, and a pseudo prototype is generated, namely, the pseudo prototype generation based on word replacement is generated; specifically, the method comprises the following two replacement strategies;
global replacement, namely when the input sentence fails to retrieve the matching, performing best effort replacement on words in the input sentence by utilizing a bilingual dictionary based on a maximization principle, wherein the replaced sentence is called a pseudo prototype sequence;
Keyword replacement: extracting important nouns and entities from the bilingual dictionary to construct a keyword dictionary; when the input sentence fails to be searched for matching, the dictionary is utilized to replace the keywords in the input sentence, a pseudo prototype sequence is generated, and the upper limit of the replacement times is smaller than a set threshold value; it is desirable that the pseudo prototype sequence, which mixes the source-side and important target-side vocabularies, provides guidance for the generation of translations on the basis of the shared vocabulary.
2. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step1 specifically comprises the following steps:
Step1.1, performing model training by using a universal data set IWSLT15 in the field of machine translation, wherein the translation tasks are English-Yue, english-neutral and English-German; in the aspect of verification and testing, tst2012 is selected as a verification set for parameter optimization and model selection, and tst2013 is selected as a test set for test evaluation;
step1.2, using PanLex, wikipedia, a laboratory self-built english-chinese-southeast asia dictionary, and google translation interface to construct an english-surmounting-mid-german global lexicon of alternatives;
Step1.3, obtaining a key dictionary by a marking screening mode on the basis of step1.2, and reserving all entities in the screening process; to avoid too concentrating on some hot nouns, the noun vocabulary is searched in the corpus and inverted according to the occurrence frequency.
3. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step3 includes:
The Step3.1 and the coding end adopt a double-encoder structure, can simultaneously receive sentence input and prototype sequence input, and then code the input into corresponding hidden state representation; the sentence coder is a standard transducer coder and is formed by stacking multiple layers; wherein each layer is composed of 2 sublayers: the multi-head self-attention layer and the feedforward neural network layer both use residual connection and a layer regularization mechanism; given an input sentence x= (x 1,x2,…,xm), the sentence encoder encodes it into a corresponding hidden state sequence h x=(hx1,hx2,…,hxm), where h xi is the hidden state corresponding to x i, the prototype encoder is structurally identical to the sentence encoder in the neural network, given a prototype sequence t= (t 1,t2,…,tn), the prototype encoder encodes it into a corresponding hidden state sequence h t=(ht1,ht2,…,htn), where h ti is the hidden state corresponding to t i.
4. The multi-strategy prototype-based low-resource neural machine translation method according to claim 1, wherein: the Step3 includes:
The decoding end is integrated with a gating mechanism, the self-learning capability of the neural network is utilized to realize the proportion optimization between sentence information and prototype information, and the information flow in the decoding process is controlled; the improved decoder is composed of three sublayers: (1) a self-attention layer; (2) an improved codec attention layer; (3) a fully connected feed forward network layer; the improved coding and decoding attention layer consists of a sentence coding and decoding attention module and a prototype coding and decoding attention module; and when receiving the output s self of the multi-head self-attention layer at the moment i and the output h x of the sentence coder, the sentence coding and decoding attention module performs attention calculation.
5. The multi-strategy prototype-based low-resource neural machine translation method as claimed in claim 4, wherein: in Step3, the sentence coding and decoding attention module performs attention calculation including:
sx=MultiHeadAtt(sself,hx,hx)
wherein MultiHeadAtt (·) is a multi-head based attention calculation, similarly, the prototype codec attention calculation is:
st=MultiHeadAtt(sself,ht,ht)
Subsequently, the sentence codec attention output s x and the prototype codec attention output s t are connected for calculating the scale variable α:
α=sigmoid(Wα[sx;st]+bα)
where W α and b α are trainable parameters, α is then used to calculate the final output of the codec's attention layer, the calculation formula is:
senc_dec=α*sx+(1-α)*st
further s enc_dec is filled as input into the fully connected feed forward network:
sffn=f(senc_dec)
where f (x) is defined as f (x) =max (0, xw 1+b1)W2+b2 where W 1,W2,b1 and b 2 are both parameters, and the final translation at time i, y i, is calculated as follows:
P(yi|y<i;x;t,θ)=softmax(σ(sffn))
where t is the prototype sequence and σ (·) is the linear transformation function.
CN202210293213.2A 2022-03-24 2022-03-24 Low-resource neural machine translation method based on multi-strategy prototype generation Active CN114676708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210293213.2A CN114676708B (en) 2022-03-24 2022-03-24 Low-resource neural machine translation method based on multi-strategy prototype generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210293213.2A CN114676708B (en) 2022-03-24 2022-03-24 Low-resource neural machine translation method based on multi-strategy prototype generation

Publications (2)

Publication Number Publication Date
CN114676708A CN114676708A (en) 2022-06-28
CN114676708B true CN114676708B (en) 2024-04-23

Family

ID=82073905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210293213.2A Active CN114676708B (en) 2022-03-24 2022-03-24 Low-resource neural machine translation method based on multi-strategy prototype generation

Country Status (1)

Country Link
CN (1) CN114676708B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117993396A (en) * 2024-01-23 2024-05-07 哈尔滨工业大学 RAG-based large model machine translation method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10346548B1 (en) * 2016-09-26 2019-07-09 Lilt, Inc. Apparatus and method for prefix-constrained decoding in a neural machine translation system
CN110059323A (en) * 2019-04-22 2019-07-26 苏州大学 Based on the multi-field neural machine translation method from attention mechanism
CN110489766A (en) * 2019-07-25 2019-11-22 昆明理工大学 The Chinese-weighed based on coding conclusion-decoding gets over low-resource nerve machine translation method
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10346548B1 (en) * 2016-09-26 2019-07-09 Lilt, Inc. Apparatus and method for prefix-constrained decoding in a neural machine translation system
CN110059323A (en) * 2019-04-22 2019-07-26 苏州大学 Based on the multi-field neural machine translation method from attention mechanism
CN110489766A (en) * 2019-07-25 2019-11-22 昆明理工大学 The Chinese-weighed based on coding conclusion-decoding gets over low-resource nerve machine translation method
CN112507734A (en) * 2020-11-19 2021-03-16 南京大学 Roman Uygur language-based neural machine translation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
efficient low-resource neural machine translation with reread and feedback mechanism;yu zhiqiang等;ACM Transactions on Asian and low-resource language information processing;20200109;第19卷(第3期);1-13 *
基于多策略原型生成的低资源神经机器翻译;于志强等;软件学报;20230428;第34卷(第11期);5113-5125 *

Also Published As

Publication number Publication date
CN114676708A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
Liu et al. A recursive recurrent neural network for statistical machine translation
Li et al. Text compression-aided transformer encoding
CN111160050A (en) Chapter-level neural machine translation method based on context memory network
CN110619043A (en) Automatic text abstract generation method based on dynamic word vector
CN113901847A (en) Neural machine translation method based on source language syntax enhanced decoding
CN114676708B (en) Low-resource neural machine translation method based on multi-strategy prototype generation
Heo et al. Multimodal neural machine translation with weakly labeled images
CN115114940A (en) Machine translation style migration method and system based on curriculum pre-training
CN113657125B (en) Mongolian non-autoregressive machine translation method based on knowledge graph
CN114648024A (en) Chinese cross-language abstract generation method based on multi-type word information guidance
CN112380882B (en) Mongolian Chinese neural machine translation method with error correction function
CN115017924B (en) Construction of neural machine translation model for cross-language translation and translation method thereof
CN114707523B (en) Image-multilingual subtitle conversion method based on interactive converter
Zhang et al. Guidance module network for video captioning
CN112347753B (en) Abstract generation method and system applied to reading robot
Chang et al. Improving language translation using the hidden Markov model
Wu A chinese-english machine translation model based on deep neural network
CN113157855A (en) Text summarization method and system fusing semantic and context information
Zhu Exploration on Korean-Chinese collaborative translation method based on recursive recurrent neural network
Maqsood Evaluating newsQA dataset with ALBERT
CN114611487B (en) Unsupervised Thai dependency syntax analysis method based on dynamic word embedding alignment
Yu et al. Semantic extraction for sentence representation via reinforcement learning
Getachew et al. Gex'ez-English Bi-Directional Neural Machine Translation Using Transformer
Song et al. RUC_AIM3 at TRECVID 2019: Video to Text.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant