CN111274826B

CN111274826B - Semantic information fusion-based low-frequency word translation method

Info

Publication number: CN111274826B
Application number: CN202010060672.7A
Authority: CN
Inventors: 张学强; 董晓飞; 曹峰; 石霖; 孙明俊
Original assignee: Nanjing New Generation Artificial Intelligence Research Institute Co ltd
Current assignee: Nanjing New Generation Artificial Intelligence Research Institute Co ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2021-02-05
Anticipated expiration: 2040-01-19
Also published as: CN111274826A

Abstract

The invention provides a low-frequency word translation method based on semantic information fusion, which belongs to the field of machine translation and is characterized in that a bilingual sentence pair is input into a translation system, wherein a source language sentence X and a target language sentence y corresponding to the source language sentence are used for obtaining a sub-word sequence of low-frequency words in the source language sentence, obtaining a target translation corresponding to the low-frequency words in the target language sentence, and obtaining a new bilingual sentence pair after replacing the low-frequency words in the bilingual sentence pair (X, y) by using a wildcard character UNKi

And fusing the vector representation of the source language low-frequency words and/or the target language low-frequency words with the vector representation of the wildcard character UNKi. The invention tightly deducts the core thought of semantic fusion, provides three semantic fusion specific forms of source language low-frequency word vector representation, target language low-frequency word vector representation and two-end low-frequency word vector representation, and fully utilizes the vector of the low-frequency word in two languages and two vector spacesTo represent semantic information for low frequency words.

Description

Semantic information fusion-based low-frequency word translation method

Technical Field

The invention relates to the field of machine translation, in particular to a low-frequency word conversion task in a neural machine translation system, and the semantic vector representation of low-frequency words at a source end and a target end is fully utilized in the model training and decoding processes, so that the translation quality of the low-frequency words and even the whole sentence is improved.

Background

Low-frequency words refer to words that are sub-sparse or never occur in a large-scale bilingual parallel corpus. Depending on the frequency, it is also commonly referred to as Unknown words (Unknown words) or Out-of-vocabulary words (Out of vocabularies) in natural language processing. Because of the characteristics of frequency sparsity, translation unicity and the like, low-frequency word translation is always the key and difficult point in the research work of machine translation. Particularly, in the current mainstream neural machine translation, the vocabulary is limited, the modeling process depends on vector representation, and the problem of low-frequency word translation is increasingly emphasized by academic circles and industrial circles.

With the further development of globalization trend, machine translation becomes an important research topic facing the interactive communication of different language ethnic groups. The quality of the low-frequency word translation effect directly influences whether the machine translation technology and the application can be successfully applied to the practicability and industrialization. The traditional low-frequency word processing method mainly has two types: first, a subword segmentation method represented by Byte Pair Encoding (BPE) achieves the purpose of reducing modeling units by further segmenting words into subwords. Secondly, converting the low-frequency words into wildcards, and replacing the wildcards with the target low-frequency words after translation to form a final complete translation. However, for the second method, directly replacing the low-frequency words with wildcards may cause incomplete semantic information of the sentences, further resulting in the problems of incompleteness, fluency and the like of the translations generated by decoding. The invention provides a low-frequency word translation method based on semantic fusion, which fuses vector representations of two parts of low-frequency words and wildcards in the process of training and decoding a neural machine translation model so as to simultaneously and explicitly consider semantic information of the low-frequency words and the wildcards, thereby effectively improving the accuracy and the fluency of translated texts.

The low-frequency word translation always is a problem to be solved urgently from the development history of machine translation from machine translation based on rules to machine translation based on statistics to machine translation based on deep learning. As described above, the processing of low frequency words derives two broad categories: one method for generating sub-word units with smaller granularity is based on sub-word segmentation by counting the occurrence frequency of sub-words in large-scale corpus, and the typical method of this category is Byte Pair Encoding (BPE). Secondly, from the replacement angle, translating after using wildcards to express nouns or noun phrases in sentences, replacing the special marks with target low-frequency words in the editing process after translating the texts, wherein the typical method of the category is a wildcards replacement translation method.

The low-frequency word translation method based on the sub-words comprises the following steps:

the method is based on a counting model, and selects N words, sub-words or characters with high frequency as a modeling unit on the premise of limiting the size of a word list through neural machine translation. And the rest words or phrases are combined and expressed by adopting the modeling unit. There are mainly two typical methods:

the method comprises the following steps: word model modeling

The word model is a model using a word as a modeling unit. In natural language, the more the upper level unit has rich and various expression forms, and the more the lower level unit has a relatively single form. Like the line, surface and surface in mathematics, the characters in natural language constitute words, words constitute phrases and phrases constitute sentences. Statistically, although the total number of Chinese characters exceeds 8 ten thousand, the number of commonly used Chinese characters is only about 3500, and it is enough to combine thousands of words or phrases. Therefore, this method is often used in the field of machine translation where the number of modeling units is severely limited. In the end-to-end neural machine translation, the effect is better than that of a modeling mode taking a word as a unit on the whole, and the method is widely applied once.

The second method comprises the following steps: byte pair encoding

Byte pair encoding is a data compression method proposed by Gage et al, 1994, the idea being to recursively use a single, unused byte to represent the most frequently co-occurring byte pair in a sentence sequence. Similarly, the method is applied to Chinese sub-word segmentation, namely, the first N pairs of Chinese characters with higher co-occurrence frequency in a Chinese sentence are used as modeling units. For example, for the word "robot," the frequency of co-occurrence of "machine" and "machine" is typically high in a large-scale corpus, while the frequency of co-occurrence of three words "machine", "machine" and "human" may be relatively low. At this time, the byte pair encoding method divides the "robot" into sub-words "machine" and "human" as two different modeling units, respectively. In end-to-end neural machine translation, the effect of the word joint modeling mode is generally better than that of a single word or word unit modeling mode.

The low-frequency word translation method based on replacement comprises the following steps:

the method comprises the following steps: word in set replacement

The core idea of the intra-word replacement method is that the intra-word with the highest frequency and most similar to the low-frequency words in the large-scale corpus is adopted to replace the low-frequency words. According to the realization principle of the current mainstream neural machine translation method, a word list with fixed dimensionality needs to be generated in advance, and the method usually adopted is to count all M words appearing in large-scale linguistic data

Frequency of (2)

The first N words in descending order are selected according to the word frequency to form a word list (W)_N. At this point we will include the words in the vocabulary

Called words in the set, correspondingly takes the rest M-N words

Called an out-of-set word. The general method of the intra-set word replacement method is to match an intra-set word with the most similar semanteme for each out-set word by calculating the vector distance between word vectors. In the model training and decoding process, all the out-of-set words which are difficult to process are converted into the in-set words, and the target translation of the out-of-set words is only converted back into the translation after decoding, so that the aim of solving the translation of the low-frequency words is fulfilled.

The second method comprises the following steps: low frequency word class replacement

The first method has the advantages that the alternative words in the set with the most similar semanteme can retain the meaning of the source language sentence to the maximum extent, and the first method has the defect that in the attention neural machine translation of soft alignment between the source language sentence and the target language sentence, the position of the alternative words in the translated text is difficult to clearly replace, so that the target translated text of the out-of-set words is difficult to replace. One way to solve this problem is to replace it with a category of out-of-set words as wildcards. For example, the names of people in a bilingual pair are typically replaced with "$ _ person" as a wildcard, and the place and organizational names are replaced with "$ _ location", "$ _ organization", respectively. And finally, replacing the category symbols with target translations of low-frequency words such as the names of people, places, organizations and the like to finish the translation process. The method has the advantages that the special wildcards can be remained in the target translation without change, and the final translation is convenient to be changed back. The method has the defects of sensitivity to low-frequency word types and easy disorder in the post-processing and replacing process of the translated text when a sentence contains a plurality of similar low-frequency words.

The third method comprises the following steps: UNKi replacement

To alleviate the problems of the second method, an unti replacement method is proposed. The alternative principle of this method is not to recognize the type of the low-frequency word, but to replace the low-frequency word in the sentence with the wildcard UNKi (i ═ 1,2,3 …) uniformly. The method not only avoids the problem of inconsistency of the low-frequency words and the context caused by the low-frequency word type recognition error, but also solves the problem of order replacement of the low-frequency words in the translation process.

In addition, there are some low-frequency word processing methods that jointly use a subword segmentation and replacement mechanism. On the basis of sub-word segmentation, sub-words with lower frequency are further replaced, and therefore better translation performance is obtained. The invention provides a method for fusing low-frequency words and UNKi wildcard character vector representation innovatively on the basis of jointly adopting a subword segmentation and UNKi replacement method so as to effectively improve the translation effect of the low-frequency words and even the whole sentence.

The prior art has the following disadvantages:

in a machine translation system, especially in an end-to-end neural machine translation system, it is a feasible approach to adopt a statistical method to segment sub-words, or adopt an alternative method to process low-frequency words. However, after the low-frequency words are replaced by the existing replacement scheme, the influence of the low-frequency word information on semantic coding and representation of wildcard context is not considered any more, and the problems that although the low-frequency words can be translated, the low-frequency words are not smoothly linked with the context in the translated text and the like exist, namely the fluency of the translated text is reduced.

Disclosure of Invention

The invention provides a novel low-frequency word solution aiming at a low-frequency word replacement method based on a wildcard character UNKi. In the encoding process of a source language sentence, vector representation of source language low-frequency words and/or target language low-frequency words and vector representation of a wildcard character UNKi are fused so as to improve the translation effect of the low-frequency words and the context thereof. In addition, considering that the current neural machine translation system commonly applies two methods of sub-word segmentation and replacement mechanism jointly to solve the problem of low-frequency word translation, the low-frequency word is generally segmented into a sub-word sequence. Therefore, the invention also provides a low-frequency word vector coding method based on a long-time memory (LSTM) network, so as to respectively obtain complete semantic vector representations of a plurality of sub-words contained in the low-frequency words of the source language and the target language. And finally, fusing vector representations of the low-frequency words of the source end and the target end with a wildcard character UNKi to improve the translation performance of the low-frequency words and even the full text.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a low-frequency word translation method based on semantic information fusion is characterized by comprising the following steps: inputting bilingual sentence pairs in a translation systemIn the method, a source language sentence X and a target language sentence y corresponding to the source language sentence are obtained, a sub-word sequence of low-frequency words in the source language sentence is obtained, a target translation corresponding to the low-frequency words in the target language sentence is obtained, a wildcard character UNKi is adopted to replace the low-frequency words in a bilingual sentence pair (X, y), and a new bilingual sentence pair is obtained

And fusing the vector representation of the source language low-frequency words and/or the target language low-frequency words with the vector representation of the wildcard character UNKi.

Further, obtaining a sub-word sequence of low-frequency words to be processed in the source language sentence through word frequency statistics; for bilingual low-frequency words, certain difficulty exists in modeling due to the fact that the internal statistical rule of the bilingual low-frequency words is relatively fuzzy and the occurrence frequency of the bilingual low-frequency words in training corpus is very low. Therefore, the method for preferentially using the dictionary searching for the translation of the low-frequency words is adopted, and if the searching fails, the translation is carried out by adopting the low-frequency word translation model.

And the dictionary lookup is to find a translation corresponding to the word in a dictionary lookup mode in the translation process. The advantages of the dictionary lookup method are: a word list of low-frequency words (entities, terms, proper nouns and the like) can be constructed in advance according to specific application scenes and fields of machine translation. As long as the low-frequency words to be searched hit the word list, the returned translation is ensured to be completely correct and to be in line with a specific situation. The method has the defect that the hit rate is difficult to ensure in the searching process depending on the scale of word list construction.

The translation model is a supplement to the dictionary lookup method, and has the advantage that the target translation with the highest probability is output for the given low-frequency words to be translated. The invention preferentially adopts the character model to solve the problems of short low-frequency words and low frequency. The word model refers to that in the translation process, words are used as processing units, but words are not used as processing units.

Further, the translation method for fusing the vector representation of the source language low-frequency word and the vector representation of the wildcard UNKi comprises the following steps:

step one, taking out the replaced source language sentence

Vector characterization of all words in

Step two, taking out the vector representation of the low-frequency words in the source language sentence, and coding the vector representation by adopting the LSTM to obtain the vector representation

Step three, weighted summation and nonlinear transformation are respectively carried out on the wildcard vector representation of the low-frequency words and the source language sub-word sequence vector representation, and the final vector representation V of the low-frequency words is obtained_i：

Wherein tanh is a hyperbolic tangent function,

is a vector characterization of wildcards, W_u，unkiIs LSTM encoder based on specific context semantic environment pairs

Weight of (1), W_s，unkiIs LSTM encoder based on specific context semantic environment pairs

The weight of (c).

The translation method for fusing the vector representation of the low-frequency word of the target language and the vector representation of the wildcard UNKi comprises the following steps:

step one, taking out the replaced source language sentence

Vector characterization of all words in

And step two, taking out the vector representation of the low-frequency words in the target language sentence, and coding the vector representation by adopting LSTM to obtain the vector representation

Step three, weighted summation and nonlinear transformation are respectively carried out on the wildcard vector representation of the low-frequency words and the target language sub-word sequence vector representation, and the final vector representation V of the low-frequency words is obtained_j：

Wherein tanh is a hyperbolic tangent function,

is a vector characterization of wildcards, W_u，unkjIs LSTM encoder based on specific context semantic environment pairs

Weight of (1), W_t，unkjIs LSTM encoder based on specific context semantic environment pairs

The weight of (c).

The translation method for fusing the vector representation of the source language low-frequency words and the target language low-frequency words with the vector representation of the wildcard UNKi comprises the following steps:

step one, taking out the replaced source language sentence

Vector characterization of all words in

Step two, extracting the vector representation of low-frequency words in the source language sentences and the target language sentencesRespectively adopting LSTM to code the vector characterization

And

step three, weighted summation and nonlinear transformation are respectively carried out on the wildcard vector representation of the low-frequency words, the source language sub-word sequence vector representation and the target language sub-word sequence vector representation to obtain the final vector representation V of the low-frequency words_m：

Wherein tanh is a hyperbolic tangent function,

is a vector characterization of wildcards, W_u，unkmIs LSTM encoder based on specific context semantic environment pairs

Weight of (1), W_s，unkmIs LSTM encoder based on specific context semantic environment pairs

Weight of (1), W_t，unkmIs LSTM encoder based on specific context semantic environment pairs

The weight of (c).

In the specific implementation process of fusing the low-frequency word vectors at the two ends of the source end and the target end, the invention mainly comprises the following three advantages:

firstly, in the aspect of low-frequency word vector sources to be fused, vector representation of low-frequency word meaning information in two vector spaces of a source language and a target language is utilized simultaneously.

Secondly, in the aspect of multi-subword vector coding of low-frequency words, a long-time memory (LSTM) network is adopted to reversely scan a sequence of subwords, so that the first subword occupies important information in the final vector representation.

Thirdly, in the aspect of low-frequency word semantic fusion, a hyperbolic tangent function tanh is adopted as an Activation function (Activation functions) to fuse the three-part vectors, which is equivalent to performing nonlinear transformation after weighting and summing the three-part vectors.

Compared with the prior art, the low-frequency word processing method based on semantic fusion provided by the invention has the following advantages:

1. the method can fuse the vector representation of the low-frequency words and the wildcards in an Encoder module translated by an end-to-end neural machine, so that more accurate and sufficient information is coded into an intermediate vector to promote an Attenttion module and a Decoder module to generate a translation with higher quality.

2. The method can be used for solving the problem of low-frequency word translation, and is also beneficial to improving the translation performance of common named entities, complex noun phrases and professional terms in natural language sentences.

3. The core idea of semantic fusion is fastened, three semantic fusion specific forms of source language low-frequency word vector representation, target language low-frequency word vector representation, two-end low-frequency word vector representation and the like are provided, and the vectors of the low-frequency words in two languages and two vector spaces are fully utilized to represent the semantic information of the low-frequency words.

4. The semantic fusion method provided by the invention can ensure that on the premise of ensuring higher frequency of the sub-words in the low-frequency words, better sub-word vector representation is obtained through model learning by a sub-word segmentation and recoding mode, so that complete vector representation of the low-frequency words is obtained through LSTM coding sub-word vector representation.

5. In the semantic fusion method provided by the invention, a method for obtaining complete vector representation of low-frequency words by reversely coding low-frequency word sub-word sequences by using LSTM is adopted, so that the problem of inaccurate vector representation caused by sparse low-frequency words is solved. Therefore, the low-frequency word translation method based on semantic fusion is suitable for a neural machine translation method taking whole words, sub-words or characters as modeling units.

Drawings

FIG. 1 is a diagram of a neural network translation model based on RNN and Attention in this embodiment.

Fig. 2 is a schematic diagram of a semantic fusion process of a vector representation of a source language low-frequency word and a vector representation of a wildcard UNKi in this embodiment.

Fig. 3 is a schematic diagram of a semantic fusion process of the vector representation of the low-frequency word in the target language and the vector representation of the wildcard UNKi in this embodiment.

Fig. 4 is a schematic diagram of the hyperbolic tangent function tanh in this embodiment.

Fig. 5 is a schematic diagram of a semantic fusion process of the vector representations of the source language low-frequency words and the target language low-frequency words and the vector representation of the wildcard UNKi in this embodiment.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the flow of the neural machine translation system is described by taking a translation system based on a Recurrent Neural Network (RNN) and Attention mechanism (Attention) as an example, and then the framework is used as an example to describe how to effectively fuse the vector representations of the source language low-frequency words, the target language low-frequency words and the vector representation of the wildcard UNKi. It should be noted that the present invention can also be extended to other neural network translation systems, such as a Convolutional Neural Network (CNN) based translation system, and a full attention mechanism based translation system.

RNN and Attention based translation System descriptions:

as shown in fig. 1, a schematic diagram of neural network translation model based on RNN and Attention inputs a source language sentence x ═ (x) to be translated₁，x₂，x₃，...，x_m) The output is the target language sentence y ═ y (y)₁，y₂，y₃，...，y_n) Where the source and target language sentences are m and n in length, respectively. The whole translation frame of the system is divided into three modules, namely an Encoder module based on bidirectional RNN, an Attention module and a Decoder module based on RNN, wherein each part of the Encoder module, the Attention module and the Decoder module is divided into three partsThe specific process is described as follows:

encoder module flow:

the module is used to calculate the token code of each word in the input source language sentence in the context of the sentence. Given a source language sentence x ═ x₁，x₂，x₃，...，x_T) Firstly, loading pre-trained or randomly initialized word vectors, and obtaining each word x by a word vector table look-up technology_iCorresponding vector representation v_iThen, a representation f under the condition that each word sees historical vocabulary information is obtained through a forward recurrent neural network based on the word vector representation_iObtaining the representation b of each word seeing future vocabulary information through a reverse circulation neural network_iFinally, the two are spliced together [ f_i：b_i]Form a token vector h of each word in the sentence_i. Here the recurrent neural network can be either the normal RNN and its modified structure GRU or LSTM. The information of each word in a given sentence context can be better represented because the calculation of its token vector uses both forward history information and reverse future information.

Attention module flow:

the module is used for calculating the source language sentence information representation c depended on by the ith decoding time_i. Assume that the last time RNN decoded implicit state is s_i-1Then c is_iThe calculation is described in detail as follows:

wherein, a(s)_i-1，h_j) Is a variable s_i-1And h_jA general function can be realized in a plurality of ways, and a simple and classical realization form is as follows:

a(s_i-1，h_j)＝v^Ttanh(Ws_i-1+Uh_j)

therefore, the semantic information representation of the source language sentence generated at the ith decoding time is a weighted average of each word of the source language, and the weighting coefficient determines the attention degree of each original word at the current time.

Decoder module flow:

the module dynamically generates a vector representation c of the source language sentence based on each time_iAnd the state s of the decoder at the previous moment_i-1And adopting a recurrent neural network to generate the target language sentence. The specific calculation method is as follows:

s_i＝f(x_i-1，y_i-1，c_i)

wherein, f (·) represents a transformation function when the RNN is implemented, and the RNN can be a common structure, and can also be a GRU or LSTM structure added with a gating mechanism. P (y)_i＝V_k) Denotes y_iIs the probability of the kth word in the target language vocabulary, b_k(s_i) Is represented according to b_k(. cndot.) is the transformation function associated with the kth target word. After the word probability calculation on the target language word list is completed at each decoding moment, the final decoding sequence y ═ can be obtained through the Beam Search algorithm (y ═₁，y₂，y₃，...，y_n) The output probability P (y | x) of the entire sentence is maximized.

The low-frequency word translation method based on semantic fusion is explained as follows:

assume that a set of bilingual sentence pairs is input in the end-to-end neural machine translation system described above, where the source language sentence x ═ x (x)₁，x₂，x₃，x₄，x₅，x₆，x₇，x₈，x₉，...，x_m) The corresponding target language sentence y ═ y (y)₁，y₂，y₃，y₄，y₅，y₆，y₇，y₈，y₉，...，y_n). Obtaining a sub-word sequence S in the source language sentence through word frequency statistics_unk1(x₂，x₃) And S_unk2(x₆，x₇，x₈) Is a low-frequency word to be processed, and obtains a target translation T corresponding to the two low-frequency words in a target language sentence in a dictionary lookup or model translation mode_unk1(y₂，y₃，y₄) And T_unk2(y₇，y₈，y₉). After the wildcard character UNKi is used to replace the low-frequency words in the bilingual sentence pair (x, y), a new bilingual sentence pair is obtained

Wherein

At the moment, the purpose of semantic fusion of the low-frequency words is to perform source language low-frequency words S in the training process of a neural network translation model_unk1(x₂，x₃) And S_unk2(x₆，x₇，x₈) With target language low-frequency word T_unk1(y₂，y₃，y₄) And T_unk2(y₇，y₈，y₉) Vector characterization and wildcard u of₁And u₂The vector representations of (a) are fused. According to the invention, the fusion method is divided into three types according to the fact that the low-frequency word vector to be fused comes from the source language or the target language: respectively, a Source-end low-frequency word vector (Source), a Target-end low-frequency word vector (Target) and two-end low-frequency word vectors (Source + Target). In the three schemes, both the Target scheme and the Source + Target scheme applied to the Target low-frequency word need to obtain a low-frequency word translation in advance, and the low-frequency word translation method is firstly explained and then three fusion methods are respectively explained as follows:

low frequency word translation

The invention mainly adopts a strategy of combining two schemes of searching word lists and model translation for the translation of low-frequency words. For bilingual low-frequency words, certain difficulty exists in modeling due to the fact that the internal statistical rule of the bilingual low-frequency words is relatively fuzzy and the occurrence frequency of the bilingual low-frequency words in training corpus is very low. Therefore, the method for preferentially using the dictionary lookup for the translation of the low-frequency words is adopted, and if the lookup fails, the customized low-frequency word translation model is adopted for translation.

The dictionary lookup method has the advantage that a low-frequency word (entity, term, proper noun and the like) vocabulary can be constructed in advance according to the specific application scene and field of machine translation. As long as the low-frequency words to be searched hit the word list, the returned translation is ensured to be completely correct and to be in line with a specific situation. The method has the defect that the hit rate is difficult to ensure in the searching process depending on the scale of word list construction.

The customized model translation is a supplement to the dictionary lookup method, and has the advantage that the target translation with the highest probability is output for the given low-frequency words to be translated. The invention adopts the character model to solve the problems of short low-frequency words and low frequency. Taking Nanjing Yangtze River Bridge and its translation "Nanjing Yangtze River Bridge" as an example, the bilingual forms under the character model are Nanjing Yangtze River Bridge and [ N a N j i N g ] [ Y a N g t z e ] [ R i v e R ] [ B R i d g e ]. The frequency of a modeling unit in the training data is improved by utilizing the word model, so that the translation performance of the low-frequency words is greatly improved.

Fusion source-side low-frequency word vectors

As shown in fig. 2: the method is mainly characterized in that an encoder fuses vector representation of source language low-frequency words and vector representation of wildcards in the training and decoding process of a neural network translation model.

The low-frequency word translation method integrated with the source-end vector is mainly embodied in the encoding process of neural machine translation, and mainly comprises the following three steps: step one, taking out the replaced source language sentence

All ofVector characterization of words

It should be noted that the word vectors may be obtained through model pre-training or may be initialized randomly according to a certain distribution, and the difference in final effect between the two methods is not obvious because the word vectors are updated in real time during the training of the neural network translation model. Step two, taking out low-frequency words S in sentences_unk1(x₂，x₃) And S_unk2(x₆，x₇，x₈) Is characterized by the vector of S_unk1(v₂，v₃) And S_unk2(v₆，v₇，v₈) Respectively adopting LSTM to code the vector characterization

And

step three, respectively aligning the low-frequency words S_unk1Wildcard vector characterization of

And sub-word sequence vector characterization

Carrying out weighted summation and nonlinear transformation to obtain S_unk1The final vector of (2) characterizes V₁：

In the same way, the low-frequency word S can be obtained_unk2Fused vector representation V₂：

After the three steps, the low-frequency word translation method fusing the source-end vector is realized.

Due to S in low frequency words_unk1And S_unk2The final vector of (2) characterizes V₁And V₂The method includes semantic information from history and future contexts in wildcards and semantic information of the low-frequency words, so that vector representation generated by bidirectional RNN coding

And

more accurate and more sufficient information is contained. Further, the method plays an active role in the Attention module and the Decoder module to generate more accurate low-frequency word translation and target translation.

Fuse target-side low-frequency word vectors

As shown in fig. 3: similar to the method, the main idea of the method for fusing the target-end low-frequency word vector is to fuse the vector representation of the target language low-frequency word and the vector representation of the wildcard by adopting an encoder in the training and decoding process of a neural network translation model.

The low-frequency word translation method for integrating the target end vector mainly completes the following three steps in the coding process of neural machine translation: first, the replaced source language sentence is fetched

Vector characterization of all words in

Second, low frequency words T in the sentence are extracted_unk1(y₂，y₃，y₄) And T_unk2(y₇，y₈，y₉) The vector of (1) characterizes T_unk1(v₂，v₃，v₄) And T_unk2(v₇，v₈，v₉) Respectively adoptEncoding the LSTM to obtain vector characterization

And

thirdly, respectively aligning the low frequency words T_unk1Wildcard vector characterization of

And sub-word sequence vector characterization

Carrying out weighted summation and nonlinear transformation to obtain T_unk1The final vector of (2) characterizes V₁：

In the same way, the low-frequency word T can be obtained_unk2Fused vector representation V₂：

After the three steps, the low-frequency word translation method fusing the target end vector is realized.

Fuse both end low frequency word vectors

As shown in fig. 5, a semantic fusion process diagram of vector representations of source language low-frequency words and target language low-frequency words and a vector representation of a wildcard character UNKi is shown, and a method for fusing vectors of low-frequency words at two ends integrates the forms of the two methods, which is also a key point of the present invention. The low-frequency word translation method integrated with vectors at two ends mainly completes the following three steps in the coding process of neural machine translation: first, take out and replaceThe latter source language sentence

Vector characterization of all words in

Second, the low-frequency words S in the source language in the sentence are taken out_unk1(x₂，x₃) And S_unk2(x₆，x₇，x₈) Vector characterization of

And

respectively adopting LSTM to code the LSTM to obtain vector characterization

And

then, the low-frequency words T of the target language in the sentence are taken out_unk1(y₂，y₃，y₄) And T_unk2(x₇，x₈，x₉) Vector characterization of

And

respectively adopting LSTM to code the LSTM to obtain vector characterization

And

And source language sub-word sequence directionQuantity characterisation

Target language sub-word sequence vector characterization

Summing to obtain T_unk1The final vector of (2) characterizes V₁：

After the three steps, the low-frequency word translation method fusing the vectors at the two ends is realized.

In the specific implementation process of fusing the low-frequency word vectors at two ends, the invention mainly comprises the following innovations and advantages in three aspects:

The innovations and advantages of the three aspects described above are set forth in detail below. First, in one aspect, vector representation of low-frequency semantic information in two vector spaces of source language and target language is utilized simultaneously. Because the low-frequency words only appear less frequently on a certain monolingual, and the other end of bilingual translation does not necessarily meet the characteristic of low frequency, the method ensures that the model can complementarily extract the semantic features of the low-frequency words in two different languages in the training process, and effectively relieves the low-frequency defect. On the other hand, because the vector representation of the target low-frequency words is given in the source language sentence coding process, a copying mechanism from the source language sentences to the target language sentences can be learned in the model training process. Specifically, the source end is fused into the target low-frequency word T in the coding process_unk1(y₂，y₃，y₄) And T_unk2(y₇，y₈，y₉) The vector of (1) characterizes T_unk1(v₂，v₃，v₄) And T_unk2(v₇，v₈，v₉) And the output target translation also has T_unk1(y₂，y₃，y₄) And T_unk2(y₇，y₈，y₉) The method realizes a direct prompt, namely a mechanism for directly outputting the target low-frequency words of the source prompt and other contexts to a prediction translation through a training process by learning the neural network translation model.

Second, a long-short-time memory (LSTM) network is used to scan the sequence of subwords in reverse, so that the first subword occupies important information in the final vector representation. The LSTM network is adept at modeling natural language in machine translation, converting sentences of any length into floating point vectors of a particular dimension, remembering important words in the sentences, and retaining memory for a long time. The LSTM is a special structure type of the RNN model, three control units of an input gate, an output gate and a forgetting gate are added, for information entering the LSTM, the three gates can determine the proportion of information to be memorized, forgotten and output after judging the information, and the problem of long-distance dependence in a neural network can be effectively solved.

The invention adopts the LSTM network to reversely scan the low-frequency word sub-word sequence, can ensure that the first sub-word plays the greatest role in vector representation, thereby ensuring that the connection of the low-frequency word and the context is more smooth. As shown in the above figure, with the low frequency word S_unk2(x₆，x₇，x₈) The process of LSTM encoding a sequence of low-frequency word sub-words is illustrated for purposes of example. Similar to sentence coding, the vector representations of all sub-words are first taken out separately

Then one by one in the right-to-left direction

Inputting into LSTM network for encoding. The calculation formula is as follows:

finally, the output vector of the LSTM network

I.e. the low frequency word S_unk2(x₆，x₇，x₈) Is characterized by a vector of (a), and x₆As the last input will be

Plays a greater role in the process.

Thirdly, a hyperbolic tangent function tanh is adopted as an Activation function (Activation functions) to fuse the three-part vectors, which is equivalent to performing nonlinear transformation again after weighting and summing the three-part vectors. With low-frequency words S_unk1(x₂，x₃) For example, the semantic fusion calculation formula is as follows:

wherein the "weighted sum" operationEnabling the model to automatically learn the selection weight W of the three-part vector in each group of samples according to the specific context semantic environment_u，unk1、W_s，unk1And W_t，unk1That is to say the weight matrix W of the three-part vector_u，unk1、W_s，unk1And W_t，unk1Belongs to the adaptive parameters of model training. The larger the weight, the more influential the set of vectors will play in the encoder output. The function of the nonlinear transformation is mainly realized by an activation function, and the method has very important functions for learning an artificial neural network model and expressing a complex and nonlinear mapping relation. The invention adopts a hyperbolic tangent function tanh as an activation function in a semantic fusion module in a neural network translation model, and the calculation formula is as follows:

as shown in fig. 4, the hyperbolic tangent function tanh adopted in the present invention has the following two advantages:

first, when the value of input x is large or small, the slope of tanh will approach zero indefinitely. That is to say if and only if the result of the weighted summation of the three partial vectors

Within a certain range, a larger gradient is generated after nonlinear transformation is performed on the tanh function. On the contrary, when the result of the weighted summation of the three-part vectors exceeds a certain range, the tanh function does not respond much because it approaches saturation. At this time, the function of the tanh function is to shield abnormal values after the weighted summation of the three vectors, and avoid that a certain bias influence is caused on the output of the coding result and the updating of the context vector because the vector summation result is too large or too small.

Secondly, as can be seen from the image of the tanh function, the function has the characteristics of smooth output, easy derivation, output centered at 0 and between (-1, 1), and the like. The characteristics ensure that the parameter updating efficiency of the tanh function is high, and the weighted sum result of the three parts of vectors can be rescaled to a certain size range after nonlinear transformation, which is particularly important in the scenes with dense matrix operations such as neural networks.

In the neural machine translation, an intermediate vector obtained by encoding a source language sentence by an Encoder module directly determines the quality of a target translation generated by an Attention module and a Decoder module. Therefore, the semantic fusion method can ensure that the semantic information of the low-frequency words is fused on the premise of keeping the semantic information of the wildcard characters of the original low-frequency words. At this time, the final vector representation of the low-frequency words not only carries context information in the sentence, but also fully considers the semantic information representation of the low-frequency words.

The invention provides a low-frequency word processing method based on semantic fusion. The method can fuse the vector representation of the low-frequency words and the wildcards in an Encoder module translated by an end-to-end neural machine, so that more accurate and sufficient information is coded into an intermediate vector to promote an Attenttion module and a Decoder module to generate a translation with higher quality. The method provided by the invention can be used for solving the problem of low-frequency word translation and is also beneficial to improving the translation performance of common named entities, complex noun phrases and professional terms in natural language sentences.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A low-frequency word translation method based on semantic information fusion is characterized by comprising the following steps: inputting a bilingual sentence pair in a translation system, wherein a source language sentence x and a target language sentence y corresponding to the source language sentence obtain a sub-word sequence of low-frequency words in the source language sentence, obtain a target translation of the low-frequency words in the target language sentence, and obtain a new bilingual sentence pair after replacing the low-frequency words in the bilingual sentence pair (x, y) by a wildcard character UNKi

Fusing vector representations of source language low-frequency words and/or target language low-frequency words with vector representations of wildcard characters UNKi;

the translation method for fusing the vector representation of the source language low-frequency word and the vector representation of the wildcard UNKi comprises the following steps:

step one, taking out the replaced source language sentence

Vector characterization of all words in

Wherein tanh is a hyperbolic tangent function,

The weight of (c);

step one, taking out the replaced source language sentence

Vector characterization of all words in

And step two, taking out the vector representation of the low-frequency words in the target language sentence, and coding the vector representation by adopting LSTM to obtain the vector representationv _unkj；

Wherein tanh is a hyperbolic tangent function,

Weight of (1), W_t，unkjIs LSTM encoder based on specific context semantic environment pairsv _unkjThe weight of (c);

step one, taking out the replaced source language sentence

Vector characterization of all words in

Step two, vector representations of low-frequency words in source language sentences and target language sentences are taken out and are respectively encoded by adopting LSTM to obtain vector representations

Andv _unkm；

Wherein tanh is a hyperbolic tangent function,

Weight of (1), W_t，unkmIs LSTM encoder based on specific context semantic environment pairsv _unkmThe weight of (c);

replaced source language sentence

Vector characterization of all words in

And initializing randomly according to a certain distribution.

2. The method for translating low-frequency words according to claim 1, wherein: and obtaining a sub-word sequence of the low-frequency words to be processed in the source language sentence through word frequency statistics.

3. The method for translating low-frequency words according to claim 1, wherein: and obtaining a target translation corresponding to the low-frequency words in the target language sentences through searching a dictionary or model translation.

4. The method for translating low-frequency words according to claim 3, wherein: firstly, translating the low-frequency words by using a dictionary searching method, and if the searching fails, translating by using a word model.