CN115048946A

CN115048946A - Chapter-level neural machine translation method fusing topic information

Info

Publication number: CN115048946A
Application number: CN202210665757.7A
Authority: CN
Inventors: 余正涛; 陈玺文; 高盛祥
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-09-13
Anticipated expiration: 2042-06-14
Also published as: CN115048946B

Abstract

The invention relates to a chapter-level neural machine translation method fusing topic information, and belongs to the field of natural language processing. Preprocessing the discourse-level parallel corpus and performing BPE word segmentation; and then training the source language discourse theme by utilizing the word embedding theme model, vectorizing and characterizing discourse texts to obtain word embedding of each word in the texts, then adding the word embedding and the source language word embedding represented by the theme model at a coding end of the neural machine translation model as input, and training the translation model. The method uses the theme model to obtain the theme information, fuses the theme information into the source language code in a word embedding mode, provides more context information at the coding stage, improves the pronoun consistency problem in chapter-level neural machine translation, and respectively improves 0.26, 0.27 and 0.29 BLEU values in English-Germany, English-French and Chinese-English languages compared with a ContextAware-Transformer model.

Description

Chapter-level neural machine translation method fusing theme information

Technical Field

The invention relates to a chapter-level neural machine translation method fusing topic information, and belongs to the technical field of natural language processing.

Background

The chapter-level neural machine translation is based on independent sentence translation, and context information is added, so that the problems of coherence and co-referency in the chapter-level translation process are solved, the translation quality of a model is improved, and the method has important application value.

In recent years, the research on chapter-level machine translation at home and abroad mainly has two aspects, namely extracting context characteristic information on one hand and combining the context information with a translation model on the other hand. In the research of obtaining the context feature information, there is a method of encoding a source language context and a target language local or global context.

In the research of combining the context information with the translation model, some methods store the latest context hidden state through a cache mechanism so as to guide the generation of a translation; other methods obtain more information from word to word, sentence to sentence, and optionally use it in encoding or decoding, through a hierarchical network. Although the existing chapter translation model can learn local context from explicit collocation to handle meaning ambiguity, how to accurately translate the words of implicit collocation is still a challenge, namely, two words or a phrase consisting of more than two words cannot be accurately translated. Although the evaluation indexes of the research method model of the chapter model are improved, the model has poor performance in translation of pronoun consistency, and the problem can be effectively improved by integrating theme information into the translation model at present. The method comprises the steps of training a discourse corpus to obtain word embedding of a word corresponding to a topic of a discourse by utilizing an ETM topic model, wherein the key point of the word embedding is to embed an original word to obtain context information of a word related to the topic of the original word, so that the consistency problem of neural machine translation of the discourse is improved.

Disclosure of Invention

The invention provides a chapter-level neural machine translation method fused with theme information, which is used for solving the problem of inconsistent contexts of pronouns in a chapter-level translation process.

The technical scheme of the invention is as follows: the chapter-level neural machine translation method fusing the theme information comprises the following specific steps:

step1, using the TED speech bilingual data of the open-source IWLST, and carrying out relevant preprocessing such as context alignment, BPE word segmentation and the like on the data for training.

And Step2, removing low-frequency words and high-frequency words from the context sentences of the source language chapter corpus, generating a word list and the like, training subject word embedding by using an open-source ETM subject word embedding model, and extracting subject word embedding.

Step3, embedding the subject words obtained by training and carrying out vector addition to obtain a single word embedding vector topic _s And adding each vector of the corresponding context sentence to obtain the final word embedding E ═ { E ═ E } ₁ ,e ₂ ,e ₃ ,...,e _m E is taken as input to a context encoder of the translation model.

Step4, pre-training a Transformer model by bilingual corpus without context, fixing the encoder and decoder parameters of the Transformer model, introducing an additional context encoder to encode the context sentence, and introducing an additional attention layer to associate the context information with the current sentence information of the Transformer, and using a gating mechanism as the final output of the encoding end.

As a further scheme of the invention, the Step1 comprises the following specific steps:

step1.1, and obtaining chapter-level bilingual parallel corpora based on the IWLST TED speech data set.

Step1.2, removing special symbols in bilingual linguistic data, performing word segmentation operation on the linguistic data, performing word segmentation on the Chinese linguistic data by using a jieba tool, performing BPE word segmentation on English, German and French linguistic data by using an MOSES tool, and removing sentences with the length less than 6 after word segmentation.

As a further scheme of the invention, the Step2 comprises the following specific steps:

step2.1, preprocessing the text context sentences after BPE word segmentation, removing words with the occurrence frequency lower than 2 and higher than 800, and randomly initializing the word embedding vector of each word.

Step2.2, setting the number of topics to be 70-80 based on an ETM (embedding Topic model), and training the Topic model by using the preprocessed context sentences.

Step2.3, ETM firstly adopts the logistic-normal distribution to obtain the theme distribution of the mth document, then carries out polynomial distribution sampling on the theme distribution to obtain the theme of the nth word, and the vocabulary distribution of the theme is represented by multiplying the vector expression rho of the theme by the word vector alpha point of each word and then softmax.

The theme generation process is as follows: for each document m, sampling the topic probability distribution θ of the document m _m ＝softmax(δ _m ) For each word w in the document m _m,n Selecting an implied subject Z _m,n ～cat(θ _m ) Generating word embedding

Where ρ is an LxV matrix, L is the word embedding size, V is the vocabulary size, θ _m Representing the topic probability distribution, δ, of the m-th pair of documents _m Obeying a normal distribution delta _m ～N(0,I)，Z _m,n Hidden topic, w, representing the nth word of the mth document _m,n N-th word representing the M-th document, M representing the total number of documents, N _m Representing the total number of words of the mth document. Theta _m Subject to a logistic-normal distribution, where I is its parameter. Beta denotes a word distribution parameter for each topic,

representing the word distribution for each topic.

Step2.4, calculating the logarithm marginal likelihood estimated value of the model parameter by adopting variational inference in the training stage of the theme model, calculating a loss function by using formulas (1) to (4), and then updating and solving the model parameter rho, alpha and v by using gradient descent.

To calculate the maximum marginal likelihood estimate of ρ, α:

first, the conditional distribution of each word is calculated:

calculating a word distribution parameter beta for a topic _k ：

β _k ＝softmax(ρ ^T α _k ) (3)

And finally, calculating the log boundary likelihood estimated values of rho, alpha and v by using variational reasoning:

wherein theta is _mk Denotes a theme distribution, v denotes a variation parameter,

representing the mathematical expectation, KL (. cndot.) represents the KL divergence and q (. cndot.) represents the Gaussian distribution.

As a further scheme of the invention, the Step3 comprises the following specific steps:

step3.1, firstly, utilizing a word embedding list obtained by training a topic model, after segmenting words of a context sentence, obtaining topic word embedding expression of the context sentence by inquiring a word list. Then, each word embedding represented by the subject word embedding is added to obtain a vector topic _s As shown in equation (5):

wherein t is _i Embedding the ith subject word of the sentence, wherein m is the number of words.

Step3.2, Final transfer of topic _s Word-embedded per-word embedding vector x with context sentence _i The addition yields the final context encoder input E ═ E ₁ ,e ₂ ,e ₃ ,...,e _m As shown in the following equation (6):

e _i ＝x _i +topic _s (6)

as a further scheme of the invention, the Step4 comprises the following specific steps:

step4.1, X for k sentence sequences of a given source language document ⁽¹⁾ ,x ⁽²⁾ ,...,x ^(k) ,

The k-th sentence representing the source language contains I words, and the k sentence sequence of the corresponding target language document is Y ═ Y ⁽¹⁾ ,y ⁽²⁾ ,...,y ^(k) ，

The kth sentence representing the target language contains J words, so the probability of translating the chapters is expressed as shown in formula (7):

wherein

Representing the first m-1 translated words, X _＜k A context sentence representing the k-th sentence.

Step4.2, discourse neural machine translation model adopts the same word embedding mode as the Transformer, an encoder and a decoder. In order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):

the core of the encoder is the self-attention part, as shown in equation (10):

wherein Q, K and V are input word vector matrixes, and dk is an input vector dimension. And calculating the association degree between each word in the sentence and all words in the sentence by using the self-attention mechanism, and performing weighted summation on each word by using the calculated association degree to obtain a new semantic representation of each word. In addition, a multi-head attention mode is adopted, the capability of the model for focusing on different positions is expanded, and a plurality of expression subspaces of the attention layer are given, as shown in formulas (11) and (12):

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^o (11)

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (12)

the Step4.3, chapter machine translation model introduces a context encoder and a context attention layer, the representation of the context sentence and the representation of the current sentence are subjected to new representation through the context attention layer, Q of the context attention layer is a representation matrix of the sentence output by the current sentence encoder, and K and V are representation matrices of the context sentence output by the context encoder.

In order to balance the weight of the new sentence representation after mixing the context representations with the current sentence representation, a context gate is added in the calculation of the mixed context representation, as shown in formula (13) and formula (14):

g _j ＝σ(W _g [s _j ,c _j ]+b _g ) (13)

s _j ＝g _j ⊙s _j +(1-g _j )⊙c _j (14)

wherein s is _j Is the output of the current sentence encoder, c _j Is the output of the context attention layer and σ is the sigmoid function.

And finally, obtaining the mixed context representation from the encoder through a multi-head attention mechanism during decoding so as to obtain an output, and then combining the output with the last input to be used as the input of the decoder until a prompt for ending the output.

The invention has the beneficial effects that:

1. different from other translation models, the topic information is fused into the discourse translation model, extra context information is provided for word embedding of the discourse model, and compared with a model which only encodes context to obtain the context information, the sentence translated based on the process is better in quality.

2. The words in the sections are subject classified through the word embedding subject model, the words can be embedded and learned to potential subject information, and compared with other subject models, the word embedding is used for representing the words, so that the potential semantic structure of the text can be better found.

3. The discourse neural machine translation model can restrict decoding by encoding context sentences compared with other neural machine translation models which can only translate independent sentences.

Drawings

FIG. 1 is a detailed flowchart of a chapter-level neural machine translation method incorporating subject information according to the present invention;

FIG. 2 is a schematic diagram of a detailed structure of fused topic information when words are embedded in the chapter-level neural machine translation method of fused topic information according to the present invention;

FIG. 3 is a conceptual diagram generated by a topic in the chapter-level neural machine translation method with topic information fused according to the present invention;

fig. 4 shows the change in the BLEU value after the number of subjects is increased according to the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, a chapter-level neural machine translation method of fused topic information includes the following specific steps:

Step1.1, and obtaining chapter-level bilingual parallel corpora based on the IWLST TED speech data set. The TED2017 speech data set of the Chinese-English data collection IWLST of the experiment of the invention comprises about 23 ten thousand sentence pairs as a training set, and the verification set and the test set respectively comprise 879 and 1557 parallel sentence pairs; the English-law data set adopts a TED2017 speech data set of IWLST, which comprises about 25 ten thousand sentence pairs as a training set, and a verification set and a test set respectively comprise 890 parallel sentence pairs and 1210 parallel sentence pairs; the English data set adopts a TED2017 speech data set of IWLST, which comprises about 20 ten thousand sentence pairs as a training set, and the verification set and the test set respectively comprise 888 and 993 parallel sentence pairs. The experimental data are prepared as shown in table 1. In the experimental data preprocessing, the JIEBA (https:// github. com/JIEBA) Chinese word segmentation tool is first used to segment Chinese words, and then the MOSES is made to tokenize all training data.

TABLE 1 chapter neural machine translation data table with fused subject information

Aiming at the problem of non-alignment of discourse linguistic data, the text alignment algorithm Vecaliign is used for aligning discourse comparable linguistic data, the algorithm scores one-to-one or one-to-many or many-to-many bilingual sentence pairs by embedding a scoring function into multilingual sentences based on normalized cosine distances, and then an aligned sentence pair is generated according to the scores by using a dynamic programming approximation method.

The theme generation process is as follows: for each document m, sampling a topic probability distribution θ for document m _m ＝softmax(δ _m ) For each word w in the document m _m,n Selecting an implied subject Z _m,n ～cat(θ _m ) Generating word embedding

representing the word distribution for each topic.

Step2.4, calculating the logarithm marginal likelihood estimated value of the model parameter by adopting variational inference in the training stage of the subject model, calculating a loss function by using formulas (1) to (4), and then updating and solving the model parameter rho, alpha and nu by using gradient descent. To calculate the maximum marginal likelihood estimate of ρ, α:

first, the conditional distribution of each word is calculated:

calculating a word distribution parameter beta for a topic _k ：

β _k ＝softmax(ρ ^T α _k ) (3)

indicating a mathematical expectation, KL (-) indicates KL divergence, and q (-) indicates gaussian distribution.

step3.1, firstly, using word embedding list obtained by training topic model to segment words of context sentencesAnd then, obtaining the subject word embedded representation of the context sentence by inquiring the word list. Then, each word embedding represented by the subject word embedding is added to obtain a vector topic _s As shown in equation (5):

Step3.2, Final transfer of topic _s Each word embedding vector x after word embedding with context sentence _i The addition yields the final context encoder input E ═ E ₁ ,e ₂ ,e ₃ ,...,e _m As shown in the following equation (6):

e _i ＝x _i +topic _s (6)

The k-th sentence representing the target language contains J words, so that the probability of translating the chapters is expressed as a formula(7) Shown in the figure:

wherein

The Step4.2, chapter neural machine translation model adopts the same word embedding mode as the transform, an encoder and a decoder. In order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):

the core of the encoder is the self-attention part, as shown in equation (4):

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^o (11)

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (12)

step4.3, the space chapter machine translation model introduces a context encoder and a context attention layer, the representation of the context sentence and the representation of the current sentence are subjected to new representation through the context attention layer, Q of the context attention layer is a representation matrix of the sentence output by the current sentence encoder, and K and V are representation matrices of the context sentence output by the context encoder.

g _j ＝σ(W _g [s _j ,c _j ]+b _g ) (13)

s _j ＝g _j ⊙s _j +(1-g _j )⊙c _j (14)

Finally, the mixed context representation is obtained from the encoder through a multi-head attention mechanism during decoding, so that an output is obtained, and then the output is combined with the input of the last time and is used as the input of the decoder until a prompt for ending the output.

In order to better compare with the existing work, the invention adopts the general evaluation index on the neural machine translation, and the BLEU value is taken as the standard for measuring the model effect, and the specific calculation process is shown in formulas (15) to (17):

wherein p is _n Represents the revised n-gram exact score of the text block, Candidate represents the sentence generated by the neural network. n-gram is a sentence that is subjected to a sliding window operation with the size of n to form n sequences. Count _clip (n-gram) represents the number of n-grams in the reference translation, and Count (n-gram ') represents the number of n-grams' in the Candidate. The first sum symbol statistic is all Candidates, and the second sum symbol statistic is all n-grams in a Candidate. In order not to bias this scoring, BLEU introduces a length Penalty factor (BP) in the final scoring result, c is the machine translation statement length, r is the reference translation statement length,

because the precision of each n-gram statistic decreases exponentially with the increase of the order, in order to balance the function of each order statistic, the statistic is averaged in a geometric mean form and then weighted, and then multiplied by a length penalty factor to obtain the final BLEU value.

(1) BLEU value promotion verification on 6 translation tasks from Chinese to English, English to German and the like

In order to verify the effectiveness of the chapter-level neural machine translation method fusing the topic information, the model is compared with five models, and the baseline model is set as follows:

1) LSTM (Long Short Term Memory network) model: the input sentence is encoded by using a multi-layer long-short term memory network, and then decoded by using another multi-layer long-short term memory network.

2) CNN (volumetric Neural network) model: the model is based on a convolutional neural network, and introduces an Attention mechanism for coding and decoding, and can process a large number of sequences in parallel.

3) Tansformer model: the model uses an Attention-based encoder heap and a decoder heap to encode and decode isolated sentences.

4) Outside-Context-Aware-Transformer (Outside-CA) model: the model improves a Transformer coding layer, uses an additional context coder to code a previous sentence of a current sentence to obtain a context representation, and fuses the context representation with the coded representation of the current sentence through a gating mechanism.

5) Inside-Context-Aware-Transformer (Inside-CA) model: the model improves a decoding layer of a Transformer, firstly, a coder codes a previous sentence of a current sentence to obtain a context representation, then, the context representation and the current sentence representation are respectively combined with the current decoded sentence through an attention layer in the decoding layer to obtain a mixed representation, and finally, the context representation and the representation of the current sentence are fused and output through a gating mechanism.

All models were trained and tested using the same data set, with the parameter settings kept consistent, and the experimental results are shown in table 2.

TABLE 2 comparison of machine translation experiments in different methods

The data in table 2 show that the Topic-Context-Aware-Transformer model, i.e., the Topic-CA model of the present invention, improves 0.25 BLEU values in the chinese-to-english translation task, 0.29 BLEU values in the english-to-chinese translation task, 0.24 BLEU values in the english-to-english translation task, 0.27 BLEU values in the french-to-english translation task, 0.17 BLEU values in the english-to-german translation task, and 0.26 BLEU values in the german-to-english translation task, which indicates that word embedding in the subject model training is of significant help to improve the translation effect of the chapter translation model. The analysis shows that the topic embedding contains more information among the similar words of the topic, the topic information is added before the context information is coded by the coder, and the model can fully learn the relation between the current sentence and the context and the relation between the current sentence and the topic of the chapter, so that the translation generated by the decoder is more in line with the context and the topic of the chapter, and the performance of the model is improved.

(2) Verification of the impact of different topic numbers on BLEU values

To verify the performance of the method for fusing topic information on different topic numbers, experiments set the topic number of the test set to {3, 5, 10, 15, 20, 25, 30, 40, 60, 80, 100, 150}, so as to compare the influence of the generated topic word embedding with the increased topic number on the BLEU value of the model translation of the present invention, as shown in fig. 4. The BLEU value gradually increases as the number of subjects increases, and when the number of subjects is 5, the BLEU is 14.79, and the BLEU is the highest. However, after the number of topics exceeds 5, the overall trend of the BLEU value is to decrease first, then increase back and finally approach to a constant value, which indicates that under a proper topic word, the topic word obtained by the topic model training contains more effective associated topic information, and thus the context encoder is facilitated to obtain more effective context information, and certain improvement effect is provided for the translation quality of the model. On the other hand, the excessive increase of the number of the topics also indicates that word embedded information becomes fragmented, so that the translation quality of the model is reduced.

(3) Example analysis

In order to visually reflect the influence of the method for fusing the subject information on the model mapping accuracy, the following takes the translation results of the translation models in the Chinese-English direction and the English-Chinese direction as examples respectively, and analyzes the influence of the subject information of the source language end on the generation of the translation of the model. Translation quality generation ratio as shown in table 3, in the example from chinese to english, in example sentence 1, the Transformer model and the CA-Transformer model wrongly translate the word "entopica", and the method of the present invention can correctly translate the word. In example 2, the transform model and the CA-transform model translate "vision" into "vision", and the method of the present invention translates into vision, which shows that the meaning is more similar to the meaning of "vision" of the translation, so that the present invention can lead the translation model to learn the information of the words related to the subject by introducing the embedding mode of the subject words, and has a certain constraint effect on the noun words of the generated translation. Although the method is translation-deficient compared with a translated text, the translation quality is greatly improved compared with a baseline model.

TABLE 3 analysis of machine translation examples of Chinese-English and English-Chinese by different methods

While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A chapter-level neural machine translation method fusing topic information is characterized by comprising the following steps: the method comprises the following specific steps:

step1, using bilingual data, and performing context alignment and BPE word segmentation related preprocessing on the bilingual data for training;

step2, removing low-frequency words and high-frequency words from context sentences of the source language chapter corpus, generating a word list, training subject word embedding by using an open-source ETM (extract-transform-word) subject word embedding model, and extracting subject word embedding;

step3, performing vector addition on the subject word embedding obtained by training to obtain a single word embedding vector, and performing addition on the single word embedding vector and each vector of the corresponding context sentence to obtain final word embedding, wherein the final word embedding is used as the input of a context coder of a translation model;

2. The chapter-level neural machine translation method fusing topic information according to claim 1, characterized in that: the specific steps of Step1 are as follows:

step1.1, obtaining chapter-level bilingual parallel corpora based on a TED speech data set of IWLST;

3. The discourse-level neural machine translation method fusing topic information according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.1, preprocessing the context sentences of the chapters after BPE word segmentation, removing words with the occurrence frequency lower than 2 and higher than 800, and randomly initializing word embedding vectors of each word;

step2.2, setting the number of the topics to be 70-80 based on an ETM model, and training the topic model by using the preprocessed context sentences;

step2.3, the ETM firstly adopts the logistic-normal distribution to obtain the theme distribution of the mth document, then carries out polynomial distribution sampling on the theme distribution to obtain the theme of the nth word, and the vocabulary distribution of the theme is represented by multiplying the vector expression rho of the theme by the word vector alpha point of each word and then softmax;

Where ρ is LMatrix of x V, L is word embedding size, V is word list size, θ _m Representing the topic probability distribution, δ, of the m-th pair of documents _m Obeying a normal distribution delta _m ～N(0,I)，Z _m,n Hidden topic, w, representing the nth word of the mth document _m,n N-th word, theta, representing the m-th document _m Obeying a logistic-normal distribution, wherein I is a parameter thereof;

step2.4, calculating a logarithm marginal likelihood estimation value of a model parameter by adopting variational inference in a training stage of the subject model, calculating a loss function by using formulas (1) to (4), and then updating and solving model parameters rho, alpha and v by using gradient descent;

to calculate the maximum marginal likelihood estimate of ρ, α:

first, the conditional distribution of each word is calculated:

calculating a word distribution parameter beta for a topic _k ：

indicating a mathematical expectation, KL (.)Representing the KL divergence and q (-) a Gaussian distribution.

4. The chapter-level neural machine translation method fusing topic information according to claim 1, characterized in that: the specific steps of Step3 are as follows:

step3.1, firstly, utilizing a word embedding list obtained by training a topic model, after segmenting words of a context sentence, obtaining topic word embedding expression of the context sentence by inquiring a word list; then, each word embedding represented by the subject word embedding is added to obtain a single word embedding vector topic _s As shown in equation (5):

wherein t is _i Embedding the ith subject word of the sentence, wherein m is the number of words;

e _i ＝x _i +topic _s (6)。

5. the topic information fused Chinese cross-language word embedding method according to claim 1, characterized in that: the specific steps of Step4 are as follows:

wherein

Representing the first m-1 translated words, X _＜k A context sentence representing the kth sentence;

step4.2, the space chapter neural machine translation model adopts the same word embedding mode as the transform, an encoder and a decoder; in order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):

the core of the encoder is the self-attention part, as shown in equation (10):

wherein Q, K and V are input word vector matrix, d _k For inputting vector dimensions, calculating the association degree between each word in a sentence and all words in the sentence by a self-attention mechanism, and performing weighted summation on each word by using the calculated association degree to obtain a new semantic representation of each word; in addition, a multi-head attention mode is adopted, and the mode is expandedThe ability of the type to concentrate on different locations, gives a number of "representing subspaces" of the attention layer, as shown in equations (11), 12:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^o (11)

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (12)

g _j ＝σ(W _g [s _j ,c _j ]+b _g ) (13)

s _j ＝g _j ⊙s _j +(1-g _j )⊙c _j (14)