CN115048946A - Chapter-level neural machine translation method fusing topic information - Google Patents

Chapter-level neural machine translation method fusing topic information Download PDF

Info

Publication number
CN115048946A
CN115048946A CN202210665757.7A CN202210665757A CN115048946A CN 115048946 A CN115048946 A CN 115048946A CN 202210665757 A CN202210665757 A CN 202210665757A CN 115048946 A CN115048946 A CN 115048946A
Authority
CN
China
Prior art keywords
word
context
sentence
model
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210665757.7A
Other languages
Chinese (zh)
Other versions
CN115048946B (en
Inventor
余正涛
陈玺文
高盛祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210665757.7A priority Critical patent/CN115048946B/en
Publication of CN115048946A publication Critical patent/CN115048946A/en
Application granted granted Critical
Publication of CN115048946B publication Critical patent/CN115048946B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a chapter-level neural machine translation method fusing topic information, and belongs to the field of natural language processing. Preprocessing the discourse-level parallel corpus and performing BPE word segmentation; and then training the source language discourse theme by utilizing the word embedding theme model, vectorizing and characterizing discourse texts to obtain word embedding of each word in the texts, then adding the word embedding and the source language word embedding represented by the theme model at a coding end of the neural machine translation model as input, and training the translation model. The method uses the theme model to obtain the theme information, fuses the theme information into the source language code in a word embedding mode, provides more context information at the coding stage, improves the pronoun consistency problem in chapter-level neural machine translation, and respectively improves 0.26, 0.27 and 0.29 BLEU values in English-Germany, English-French and Chinese-English languages compared with a ContextAware-Transformer model.

Description

Chapter-level neural machine translation method fusing theme information
Technical Field
The invention relates to a chapter-level neural machine translation method fusing topic information, and belongs to the technical field of natural language processing.
Background
The chapter-level neural machine translation is based on independent sentence translation, and context information is added, so that the problems of coherence and co-referency in the chapter-level translation process are solved, the translation quality of a model is improved, and the method has important application value.
In recent years, the research on chapter-level machine translation at home and abroad mainly has two aspects, namely extracting context characteristic information on one hand and combining the context information with a translation model on the other hand. In the research of obtaining the context feature information, there is a method of encoding a source language context and a target language local or global context.
In the research of combining the context information with the translation model, some methods store the latest context hidden state through a cache mechanism so as to guide the generation of a translation; other methods obtain more information from word to word, sentence to sentence, and optionally use it in encoding or decoding, through a hierarchical network. Although the existing chapter translation model can learn local context from explicit collocation to handle meaning ambiguity, how to accurately translate the words of implicit collocation is still a challenge, namely, two words or a phrase consisting of more than two words cannot be accurately translated. Although the evaluation indexes of the research method model of the chapter model are improved, the model has poor performance in translation of pronoun consistency, and the problem can be effectively improved by integrating theme information into the translation model at present. The method comprises the steps of training a discourse corpus to obtain word embedding of a word corresponding to a topic of a discourse by utilizing an ETM topic model, wherein the key point of the word embedding is to embed an original word to obtain context information of a word related to the topic of the original word, so that the consistency problem of neural machine translation of the discourse is improved.
Disclosure of Invention
The invention provides a chapter-level neural machine translation method fused with theme information, which is used for solving the problem of inconsistent contexts of pronouns in a chapter-level translation process.
The technical scheme of the invention is as follows: the chapter-level neural machine translation method fusing the theme information comprises the following specific steps:
step1, using the TED speech bilingual data of the open-source IWLST, and carrying out relevant preprocessing such as context alignment, BPE word segmentation and the like on the data for training.
And Step2, removing low-frequency words and high-frequency words from the context sentences of the source language chapter corpus, generating a word list and the like, training subject word embedding by using an open-source ETM subject word embedding model, and extracting subject word embedding.
Step3, embedding the subject words obtained by training and carrying out vector addition to obtain a single word embedding vector topic s And adding each vector of the corresponding context sentence to obtain the final word embedding E ═ { E ═ E } 1 ,e 2 ,e 3 ,...,e m E is taken as input to a context encoder of the translation model.
Step4, pre-training a Transformer model by bilingual corpus without context, fixing the encoder and decoder parameters of the Transformer model, introducing an additional context encoder to encode the context sentence, and introducing an additional attention layer to associate the context information with the current sentence information of the Transformer, and using a gating mechanism as the final output of the encoding end.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1, and obtaining chapter-level bilingual parallel corpora based on the IWLST TED speech data set.
Step1.2, removing special symbols in bilingual linguistic data, performing word segmentation operation on the linguistic data, performing word segmentation on the Chinese linguistic data by using a jieba tool, performing BPE word segmentation on English, German and French linguistic data by using an MOSES tool, and removing sentences with the length less than 6 after word segmentation.
As a further scheme of the invention, the Step2 comprises the following specific steps:
step2.1, preprocessing the text context sentences after BPE word segmentation, removing words with the occurrence frequency lower than 2 and higher than 800, and randomly initializing the word embedding vector of each word.
Step2.2, setting the number of topics to be 70-80 based on an ETM (embedding Topic model), and training the Topic model by using the preprocessed context sentences.
Step2.3, ETM firstly adopts the logistic-normal distribution to obtain the theme distribution of the mth document, then carries out polynomial distribution sampling on the theme distribution to obtain the theme of the nth word, and the vocabulary distribution of the theme is represented by multiplying the vector expression rho of the theme by the word vector alpha point of each word and then softmax.
The theme generation process is as follows: for each document m, sampling the topic probability distribution θ of the document m m =softmax(δ m ) For each word w in the document m m,n Selecting an implied subject Z m,n ~cat(θ m ) Generating word embedding
Figure BDA0003692874860000031
Where ρ is an LxV matrix, L is the word embedding size, V is the vocabulary size, θ m Representing the topic probability distribution, δ, of the m-th pair of documents m Obeying a normal distribution delta m ~N(0,I),Z m,n Hidden topic, w, representing the nth word of the mth document m,n N-th word representing the M-th document, M representing the total number of documents, N m Representing the total number of words of the mth document. Theta m Subject to a logistic-normal distribution, where I is its parameter. Beta denotes a word distribution parameter for each topic,
Figure BDA0003692874860000032
representing the word distribution for each topic.
Step2.4, calculating the logarithm marginal likelihood estimated value of the model parameter by adopting variational inference in the training stage of the theme model, calculating a loss function by using formulas (1) to (4), and then updating and solving the model parameter rho, alpha and v by using gradient descent.
To calculate the maximum marginal likelihood estimate of ρ, α:
Figure BDA0003692874860000033
first, the conditional distribution of each word is calculated:
Figure BDA0003692874860000034
calculating a word distribution parameter beta for a topic k
β k =softmax(ρ T α k ) (3)
And finally, calculating the log boundary likelihood estimated values of rho, alpha and v by using variational reasoning:
Figure BDA0003692874860000035
wherein theta is mk Denotes a theme distribution, v denotes a variation parameter,
Figure BDA0003692874860000036
representing the mathematical expectation, KL (. cndot.) represents the KL divergence and q (. cndot.) represents the Gaussian distribution.
As a further scheme of the invention, the Step3 comprises the following specific steps:
step3.1, firstly, utilizing a word embedding list obtained by training a topic model, after segmenting words of a context sentence, obtaining topic word embedding expression of the context sentence by inquiring a word list. Then, each word embedding represented by the subject word embedding is added to obtain a vector topic s As shown in equation (5):
Figure BDA0003692874860000041
wherein t is i Embedding the ith subject word of the sentence, wherein m is the number of words.
Step3.2, Final transfer of topic s Word-embedded per-word embedding vector x with context sentence i The addition yields the final context encoder input E ═ E 1 ,e 2 ,e 3 ,...,e m As shown in the following equation (6):
e i =x i +topic s (6)
as a further scheme of the invention, the Step4 comprises the following specific steps:
step4.1, X for k sentence sequences of a given source language document (1) ,x (2) ,...,x (k) ,
Figure BDA0003692874860000042
The k-th sentence representing the source language contains I words, and the k sentence sequence of the corresponding target language document is Y ═ Y (1) ,y (2) ,...,y (k)
Figure BDA0003692874860000043
The kth sentence representing the target language contains J words, so the probability of translating the chapters is expressed as shown in formula (7):
Figure BDA0003692874860000044
wherein
Figure BDA0003692874860000045
Representing the first m-1 translated words, X <k A context sentence representing the k-th sentence.
Step4.2, discourse neural machine translation model adopts the same word embedding mode as the Transformer, an encoder and a decoder. In order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):
Figure BDA0003692874860000046
Figure BDA0003692874860000047
the core of the encoder is the self-attention part, as shown in equation (10):
Figure BDA0003692874860000048
wherein Q, K and V are input word vector matrixes, and dk is an input vector dimension. And calculating the association degree between each word in the sentence and all words in the sentence by using the self-attention mechanism, and performing weighted summation on each word by using the calculated association degree to obtain a new semantic representation of each word. In addition, a multi-head attention mode is adopted, the capability of the model for focusing on different positions is expanded, and a plurality of expression subspaces of the attention layer are given, as shown in formulas (11) and (12):
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W o (11)
head i =Attention(QW i Q ,KW i K ,VW i V ) (12)
the Step4.3, chapter machine translation model introduces a context encoder and a context attention layer, the representation of the context sentence and the representation of the current sentence are subjected to new representation through the context attention layer, Q of the context attention layer is a representation matrix of the sentence output by the current sentence encoder, and K and V are representation matrices of the context sentence output by the context encoder.
In order to balance the weight of the new sentence representation after mixing the context representations with the current sentence representation, a context gate is added in the calculation of the mixed context representation, as shown in formula (13) and formula (14):
g j =σ(W g [s j ,c j ]+b g ) (13)
s j =g j ⊙s j +(1-g j )⊙c j (14)
wherein s is j Is the output of the current sentence encoder, c j Is the output of the context attention layer and σ is the sigmoid function.
And finally, obtaining the mixed context representation from the encoder through a multi-head attention mechanism during decoding so as to obtain an output, and then combining the output with the last input to be used as the input of the decoder until a prompt for ending the output.
The invention has the beneficial effects that:
1. different from other translation models, the topic information is fused into the discourse translation model, extra context information is provided for word embedding of the discourse model, and compared with a model which only encodes context to obtain the context information, the sentence translated based on the process is better in quality.
2. The words in the sections are subject classified through the word embedding subject model, the words can be embedded and learned to potential subject information, and compared with other subject models, the word embedding is used for representing the words, so that the potential semantic structure of the text can be better found.
3. The discourse neural machine translation model can restrict decoding by encoding context sentences compared with other neural machine translation models which can only translate independent sentences.
Drawings
FIG. 1 is a detailed flowchart of a chapter-level neural machine translation method incorporating subject information according to the present invention;
FIG. 2 is a schematic diagram of a detailed structure of fused topic information when words are embedded in the chapter-level neural machine translation method of fused topic information according to the present invention;
FIG. 3 is a conceptual diagram generated by a topic in the chapter-level neural machine translation method with topic information fused according to the present invention;
fig. 4 shows the change in the BLEU value after the number of subjects is increased according to the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, a chapter-level neural machine translation method of fused topic information includes the following specific steps:
step1, using the TED speech bilingual data of the open-source IWLST, and carrying out relevant preprocessing such as context alignment, BPE word segmentation and the like on the data for training.
Step1.1, and obtaining chapter-level bilingual parallel corpora based on the IWLST TED speech data set. The TED2017 speech data set of the Chinese-English data collection IWLST of the experiment of the invention comprises about 23 ten thousand sentence pairs as a training set, and the verification set and the test set respectively comprise 879 and 1557 parallel sentence pairs; the English-law data set adopts a TED2017 speech data set of IWLST, which comprises about 25 ten thousand sentence pairs as a training set, and a verification set and a test set respectively comprise 890 parallel sentence pairs and 1210 parallel sentence pairs; the English data set adopts a TED2017 speech data set of IWLST, which comprises about 20 ten thousand sentence pairs as a training set, and the verification set and the test set respectively comprise 888 and 993 parallel sentence pairs. The experimental data are prepared as shown in table 1. In the experimental data preprocessing, the JIEBA (https:// github. com/JIEBA) Chinese word segmentation tool is first used to segment Chinese words, and then the MOSES is made to tokenize all training data.
TABLE 1 chapter neural machine translation data table with fused subject information
Figure BDA0003692874860000061
Aiming at the problem of non-alignment of discourse linguistic data, the text alignment algorithm Vecaliign is used for aligning discourse comparable linguistic data, the algorithm scores one-to-one or one-to-many or many-to-many bilingual sentence pairs by embedding a scoring function into multilingual sentences based on normalized cosine distances, and then an aligned sentence pair is generated according to the scores by using a dynamic programming approximation method.
Step1.2, removing special symbols in bilingual linguistic data, performing word segmentation operation on the linguistic data, performing word segmentation on the Chinese linguistic data by using a jieba tool, performing BPE word segmentation on English, German and French linguistic data by using an MOSES tool, and removing sentences with the length less than 6 after word segmentation.
And Step2, removing low-frequency words and high-frequency words from the context sentences of the source language chapter corpus, generating a word list and the like, training subject word embedding by using an open-source ETM subject word embedding model, and extracting subject word embedding.
As a further scheme of the invention, the Step2 comprises the following specific steps:
step2.1, preprocessing the text context sentences after BPE word segmentation, removing words with the occurrence frequency lower than 2 and higher than 800, and randomly initializing the word embedding vector of each word.
Step2.2, setting the number of topics to be 70-80 based on an ETM (embedding Topic model), and training the Topic model by using the preprocessed context sentences.
Step2.3, ETM firstly adopts the logistic-normal distribution to obtain the theme distribution of the mth document, then carries out polynomial distribution sampling on the theme distribution to obtain the theme of the nth word, and the vocabulary distribution of the theme is represented by multiplying the vector expression rho of the theme by the word vector alpha point of each word and then softmax.
The theme generation process is as follows: for each document m, sampling a topic probability distribution θ for document m m =softmax(δ m ) For each word w in the document m m,n Selecting an implied subject Z m,n ~cat(θ m ) Generating word embedding
Figure BDA0003692874860000071
Where ρ is an LxV matrix, L is the word embedding size, V is the vocabulary size, θ m Representing the topic probability distribution, δ, of the m-th pair of documents m Obeying a normal distribution delta m ~N(0,I),Z m,n Hidden topic, w, representing the nth word of the mth document m,n N-th word representing the M-th document, M representing the total number of documents, N m Representing the total number of words of the mth document. Theta m Subject to a logistic-normal distribution, where I is its parameter. Beta denotes a word distribution parameter for each topic,
Figure BDA0003692874860000072
representing the word distribution for each topic.
Step2.4, calculating the logarithm marginal likelihood estimated value of the model parameter by adopting variational inference in the training stage of the subject model, calculating a loss function by using formulas (1) to (4), and then updating and solving the model parameter rho, alpha and nu by using gradient descent. To calculate the maximum marginal likelihood estimate of ρ, α:
Figure BDA0003692874860000081
first, the conditional distribution of each word is calculated:
Figure BDA0003692874860000082
calculating a word distribution parameter beta for a topic k
β k =softmax(ρ T α k ) (3)
And finally, calculating the log boundary likelihood estimated values of rho, alpha and v by using variational reasoning:
Figure BDA0003692874860000083
wherein theta is mk Denotes a theme distribution, v denotes a variation parameter,
Figure BDA0003692874860000084
indicating a mathematical expectation, KL (-) indicates KL divergence, and q (-) indicates gaussian distribution.
Step3, embedding the subject words obtained by training and carrying out vector addition to obtain a single word embedding vector topic s And adding each vector of the corresponding context sentence to obtain the final word embedding E ═ { E ═ E } 1 ,e 2 ,e 3 ,...,e m E is taken as input to a context encoder of the translation model.
As a further scheme of the invention, the Step3 comprises the following specific steps:
step3.1, firstly, using word embedding list obtained by training topic model to segment words of context sentencesAnd then, obtaining the subject word embedded representation of the context sentence by inquiring the word list. Then, each word embedding represented by the subject word embedding is added to obtain a vector topic s As shown in equation (5):
Figure BDA0003692874860000085
wherein t is i Embedding the ith subject word of the sentence, wherein m is the number of words.
Step3.2, Final transfer of topic s Each word embedding vector x after word embedding with context sentence i The addition yields the final context encoder input E ═ E 1 ,e 2 ,e 3 ,...,e m As shown in the following equation (6):
e i =x i +topic s (6)
step4, pre-training a Transformer model by bilingual corpus without context, fixing the encoder and decoder parameters of the Transformer model, introducing an additional context encoder to encode the context sentence, and introducing an additional attention layer to associate the context information with the current sentence information of the Transformer, and using a gating mechanism as the final output of the encoding end.
As a further scheme of the invention, the Step4 comprises the following specific steps:
step4.1, X for k sentence sequences of a given source language document (1) ,x (2) ,...,x (k) ,
Figure BDA0003692874860000091
The k-th sentence representing the source language contains I words, and the k sentence sequence of the corresponding target language document is Y ═ Y (1) ,y (2) ,...,y (k)
Figure BDA0003692874860000092
The k-th sentence representing the target language contains J words, so that the probability of translating the chapters is expressed as a formula(7) Shown in the figure:
Figure BDA0003692874860000093
wherein
Figure BDA0003692874860000094
Representing the first m-1 translated words, X <k A context sentence representing the k-th sentence.
The Step4.2, chapter neural machine translation model adopts the same word embedding mode as the transform, an encoder and a decoder. In order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):
Figure BDA0003692874860000095
Figure BDA0003692874860000096
the core of the encoder is the self-attention part, as shown in equation (4):
Figure BDA0003692874860000097
wherein Q, K and V are input word vector matrixes, and dk is an input vector dimension. And calculating the association degree between each word in the sentence and all words in the sentence by using the self-attention mechanism, and performing weighted summation on each word by using the calculated association degree to obtain a new semantic representation of each word. In addition, a multi-head attention mode is adopted, the capability of the model for focusing on different positions is expanded, and a plurality of expression subspaces of the attention layer are given, as shown in formulas (11) and (12):
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W o (11)
head i =Attention(QW i Q ,KW i K ,VW i V ) (12)
step4.3, the space chapter machine translation model introduces a context encoder and a context attention layer, the representation of the context sentence and the representation of the current sentence are subjected to new representation through the context attention layer, Q of the context attention layer is a representation matrix of the sentence output by the current sentence encoder, and K and V are representation matrices of the context sentence output by the context encoder.
In order to balance the weight of the new sentence representation after mixing the context representations with the current sentence representation, a context gate is added in the calculation of the mixed context representation, as shown in formula (13) and formula (14):
g j =σ(W g [s j ,c j ]+b g ) (13)
s j =g j ⊙s j +(1-g j )⊙c j (14)
wherein s is j Is the output of the current sentence encoder, c j Is the output of the context attention layer and σ is the sigmoid function.
Finally, the mixed context representation is obtained from the encoder through a multi-head attention mechanism during decoding, so that an output is obtained, and then the output is combined with the input of the last time and is used as the input of the decoder until a prompt for ending the output.
In order to better compare with the existing work, the invention adopts the general evaluation index on the neural machine translation, and the BLEU value is taken as the standard for measuring the model effect, and the specific calculation process is shown in formulas (15) to (17):
Figure BDA0003692874860000101
Figure BDA0003692874860000102
Figure BDA0003692874860000103
wherein p is n Represents the revised n-gram exact score of the text block, Candidate represents the sentence generated by the neural network. n-gram is a sentence that is subjected to a sliding window operation with the size of n to form n sequences. Count clip (n-gram) represents the number of n-grams in the reference translation, and Count (n-gram ') represents the number of n-grams' in the Candidate. The first sum symbol statistic is all Candidates, and the second sum symbol statistic is all n-grams in a Candidate. In order not to bias this scoring, BLEU introduces a length Penalty factor (BP) in the final scoring result, c is the machine translation statement length, r is the reference translation statement length,
Figure BDA0003692874860000111
because the precision of each n-gram statistic decreases exponentially with the increase of the order, in order to balance the function of each order statistic, the statistic is averaged in a geometric mean form and then weighted, and then multiplied by a length penalty factor to obtain the final BLEU value.
(1) BLEU value promotion verification on 6 translation tasks from Chinese to English, English to German and the like
In order to verify the effectiveness of the chapter-level neural machine translation method fusing the topic information, the model is compared with five models, and the baseline model is set as follows:
1) LSTM (Long Short Term Memory network) model: the input sentence is encoded by using a multi-layer long-short term memory network, and then decoded by using another multi-layer long-short term memory network.
2) CNN (volumetric Neural network) model: the model is based on a convolutional neural network, and introduces an Attention mechanism for coding and decoding, and can process a large number of sequences in parallel.
3) Tansformer model: the model uses an Attention-based encoder heap and a decoder heap to encode and decode isolated sentences.
4) Outside-Context-Aware-Transformer (Outside-CA) model: the model improves a Transformer coding layer, uses an additional context coder to code a previous sentence of a current sentence to obtain a context representation, and fuses the context representation with the coded representation of the current sentence through a gating mechanism.
5) Inside-Context-Aware-Transformer (Inside-CA) model: the model improves a decoding layer of a Transformer, firstly, a coder codes a previous sentence of a current sentence to obtain a context representation, then, the context representation and the current sentence representation are respectively combined with the current decoded sentence through an attention layer in the decoding layer to obtain a mixed representation, and finally, the context representation and the representation of the current sentence are fused and output through a gating mechanism.
All models were trained and tested using the same data set, with the parameter settings kept consistent, and the experimental results are shown in table 2.
TABLE 2 comparison of machine translation experiments in different methods
Figure BDA0003692874860000112
Figure BDA0003692874860000121
The data in table 2 show that the Topic-Context-Aware-Transformer model, i.e., the Topic-CA model of the present invention, improves 0.25 BLEU values in the chinese-to-english translation task, 0.29 BLEU values in the english-to-chinese translation task, 0.24 BLEU values in the english-to-english translation task, 0.27 BLEU values in the french-to-english translation task, 0.17 BLEU values in the english-to-german translation task, and 0.26 BLEU values in the german-to-english translation task, which indicates that word embedding in the subject model training is of significant help to improve the translation effect of the chapter translation model. The analysis shows that the topic embedding contains more information among the similar words of the topic, the topic information is added before the context information is coded by the coder, and the model can fully learn the relation between the current sentence and the context and the relation between the current sentence and the topic of the chapter, so that the translation generated by the decoder is more in line with the context and the topic of the chapter, and the performance of the model is improved.
(2) Verification of the impact of different topic numbers on BLEU values
To verify the performance of the method for fusing topic information on different topic numbers, experiments set the topic number of the test set to {3, 5, 10, 15, 20, 25, 30, 40, 60, 80, 100, 150}, so as to compare the influence of the generated topic word embedding with the increased topic number on the BLEU value of the model translation of the present invention, as shown in fig. 4. The BLEU value gradually increases as the number of subjects increases, and when the number of subjects is 5, the BLEU is 14.79, and the BLEU is the highest. However, after the number of topics exceeds 5, the overall trend of the BLEU value is to decrease first, then increase back and finally approach to a constant value, which indicates that under a proper topic word, the topic word obtained by the topic model training contains more effective associated topic information, and thus the context encoder is facilitated to obtain more effective context information, and certain improvement effect is provided for the translation quality of the model. On the other hand, the excessive increase of the number of the topics also indicates that word embedded information becomes fragmented, so that the translation quality of the model is reduced.
(3) Example analysis
In order to visually reflect the influence of the method for fusing the subject information on the model mapping accuracy, the following takes the translation results of the translation models in the Chinese-English direction and the English-Chinese direction as examples respectively, and analyzes the influence of the subject information of the source language end on the generation of the translation of the model. Translation quality generation ratio as shown in table 3, in the example from chinese to english, in example sentence 1, the Transformer model and the CA-Transformer model wrongly translate the word "entopica", and the method of the present invention can correctly translate the word. In example 2, the transform model and the CA-transform model translate "vision" into "vision", and the method of the present invention translates into vision, which shows that the meaning is more similar to the meaning of "vision" of the translation, so that the present invention can lead the translation model to learn the information of the words related to the subject by introducing the embedding mode of the subject words, and has a certain constraint effect on the noun words of the generated translation. Although the method is translation-deficient compared with a translated text, the translation quality is greatly improved compared with a baseline model.
TABLE 3 analysis of machine translation examples of Chinese-English and English-Chinese by different methods
Figure BDA0003692874860000141
While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. A chapter-level neural machine translation method fusing topic information is characterized by comprising the following steps: the method comprises the following specific steps:
step1, using bilingual data, and performing context alignment and BPE word segmentation related preprocessing on the bilingual data for training;
step2, removing low-frequency words and high-frequency words from context sentences of the source language chapter corpus, generating a word list, training subject word embedding by using an open-source ETM (extract-transform-word) subject word embedding model, and extracting subject word embedding;
step3, performing vector addition on the subject word embedding obtained by training to obtain a single word embedding vector, and performing addition on the single word embedding vector and each vector of the corresponding context sentence to obtain final word embedding, wherein the final word embedding is used as the input of a context coder of a translation model;
step4, pre-training a Transformer model by bilingual corpus without context, fixing the encoder and decoder parameters of the Transformer model, introducing an additional context encoder to encode the context sentence, and introducing an additional attention layer to associate the context information with the current sentence information of the Transformer, and using a gating mechanism as the final output of the encoding end.
2. The chapter-level neural machine translation method fusing topic information according to claim 1, characterized in that: the specific steps of Step1 are as follows:
step1.1, obtaining chapter-level bilingual parallel corpora based on a TED speech data set of IWLST;
step1.2, removing special symbols in bilingual linguistic data, performing word segmentation operation on the linguistic data, performing word segmentation on the Chinese linguistic data by using a jieba tool, performing BPE word segmentation on English, German and French linguistic data by using an MOSES tool, and removing sentences with the length less than 6 after word segmentation.
3. The discourse-level neural machine translation method fusing topic information according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, preprocessing the context sentences of the chapters after BPE word segmentation, removing words with the occurrence frequency lower than 2 and higher than 800, and randomly initializing word embedding vectors of each word;
step2.2, setting the number of the topics to be 70-80 based on an ETM model, and training the topic model by using the preprocessed context sentences;
step2.3, the ETM firstly adopts the logistic-normal distribution to obtain the theme distribution of the mth document, then carries out polynomial distribution sampling on the theme distribution to obtain the theme of the nth word, and the vocabulary distribution of the theme is represented by multiplying the vector expression rho of the theme by the word vector alpha point of each word and then softmax;
the theme generation process is as follows: for each document m, sampling a topic probability distribution θ for document m m =softmax(δ m ) For each word w in the document m m,n Selecting an implied subject Z m,n ~cat(θ m ) Generating word embedding
Figure FDA0003692874850000025
Where ρ is LMatrix of x V, L is word embedding size, V is word list size, θ m Representing the topic probability distribution, δ, of the m-th pair of documents m Obeying a normal distribution delta m ~N(0,I),Z m,n Hidden topic, w, representing the nth word of the mth document m,n N-th word, theta, representing the m-th document m Obeying a logistic-normal distribution, wherein I is a parameter thereof;
step2.4, calculating a logarithm marginal likelihood estimation value of a model parameter by adopting variational inference in a training stage of the subject model, calculating a loss function by using formulas (1) to (4), and then updating and solving model parameters rho, alpha and v by using gradient descent;
to calculate the maximum marginal likelihood estimate of ρ, α:
Figure FDA0003692874850000021
first, the conditional distribution of each word is calculated:
Figure FDA0003692874850000022
calculating a word distribution parameter beta for a topic k
Figure FDA0003692874850000026
And finally, calculating the log boundary likelihood estimated values of rho, alpha and v by using variational reasoning:
Figure FDA0003692874850000023
wherein theta is mk Denotes a theme distribution, v denotes a variation parameter,
Figure FDA0003692874850000024
indicating a mathematical expectation, KL (.)Representing the KL divergence and q (-) a Gaussian distribution.
4. The chapter-level neural machine translation method fusing topic information according to claim 1, characterized in that: the specific steps of Step3 are as follows:
step3.1, firstly, utilizing a word embedding list obtained by training a topic model, after segmenting words of a context sentence, obtaining topic word embedding expression of the context sentence by inquiring a word list; then, each word embedding represented by the subject word embedding is added to obtain a single word embedding vector topic s As shown in equation (5):
Figure FDA0003692874850000031
wherein t is i Embedding the ith subject word of the sentence, wherein m is the number of words;
step3.2, Final transfer of topic s Each word embedding vector x after word embedding with context sentence i The addition yields the final context encoder input E ═ E 1 ,e 2 ,e 3 ,...,e m As shown in the following equation (6):
e i =x i +topic s (6)。
5. the topic information fused Chinese cross-language word embedding method according to claim 1, characterized in that: the specific steps of Step4 are as follows:
step4.1, X for k sentence sequences of a given source language document (1) ,x (2) ,...,x (k) ,
Figure FDA0003692874850000032
The k-th sentence representing the source language contains I words, and the k sentence sequence of the corresponding target language document is Y ═ Y (1) ,y (2) ,...,y (k)
Figure FDA0003692874850000033
The kth sentence representing the target language contains J words, so the probability of translating the chapters is expressed as shown in formula (7):
Figure FDA0003692874850000034
wherein
Figure FDA0003692874850000035
Representing the first m-1 translated words, X <k A context sentence representing the kth sentence;
step4.2, the space chapter neural machine translation model adopts the same word embedding mode as the transform, an encoder and a decoder; in order to reflect the difference and the sequence of the words of the corpus in different positions of the sentence, position embedding is adopted to add position characteristics after a word embedding matrix is calculated, and the positions are represented by formulas (8) and (9):
Figure FDA0003692874850000036
Figure FDA0003692874850000037
the core of the encoder is the self-attention part, as shown in equation (10):
Figure FDA0003692874850000038
wherein Q, K and V are input word vector matrix, d k For inputting vector dimensions, calculating the association degree between each word in a sentence and all words in the sentence by a self-attention mechanism, and performing weighted summation on each word by using the calculated association degree to obtain a new semantic representation of each word; in addition, a multi-head attention mode is adopted, and the mode is expandedThe ability of the type to concentrate on different locations, gives a number of "representing subspaces" of the attention layer, as shown in equations (11), 12:
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W o (11)
head i =Attention(QW i Q ,KW i K ,VW i V ) (12)
the Step4.3, chapter machine translation model introduces a context encoder and a context attention layer, the representation of the context sentence and the representation of the current sentence are subjected to new representation through the context attention layer, Q of the context attention layer is a representation matrix of the sentence output by the current sentence encoder, and K and V are representation matrices of the context sentence output by the context encoder.
In order to balance the weight of the new sentence representation after mixing the context representations with the current sentence representation, a context gate is added in the calculation of the mixed context representation, as shown in formula (13) and formula (14):
g j =σ(W g [s j ,c j ]+b g ) (13)
s j =g j ⊙s j +(1-g j )⊙c j (14)
wherein s is j Is the output of the current sentence encoder, c j Is the output of the context attention layer and σ is the sigmoid function.
And finally, obtaining the mixed context representation from the encoder through a multi-head attention mechanism during decoding so as to obtain an output, and then combining the output with the last input to be used as the input of the decoder until a prompt for ending the output.
CN202210665757.7A 2022-06-14 2022-06-14 Chapter-level neural machine translation method integrating theme information Active CN115048946B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210665757.7A CN115048946B (en) 2022-06-14 2022-06-14 Chapter-level neural machine translation method integrating theme information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210665757.7A CN115048946B (en) 2022-06-14 2022-06-14 Chapter-level neural machine translation method integrating theme information

Publications (2)

Publication Number Publication Date
CN115048946A true CN115048946A (en) 2022-09-13
CN115048946B CN115048946B (en) 2024-06-21

Family

ID=83162369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210665757.7A Active CN115048946B (en) 2022-06-14 2022-06-14 Chapter-level neural machine translation method integrating theme information

Country Status (1)

Country Link
CN (1) CN115048946B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods
CN112287698A (en) * 2020-12-25 2021-01-29 北京百度网讯科技有限公司 Chapter translation method and device, electronic equipment and storage medium
CN112541340A (en) * 2020-12-18 2021-03-23 昆明理工大学 Weak supervision involved microblog evaluation object identification method based on variation double-theme representation
CN112948558A (en) * 2021-03-10 2021-06-11 中国人民解放军国防科技大学 Method and device for generating context-enhanced problems facing open domain dialog system
US20210286810A1 (en) * 2020-03-10 2021-09-16 Korea Advanced Institute Of Science And Technology Method And Apparatus For Generating Context Category Dataset
WO2022028689A1 (en) * 2020-08-05 2022-02-10 Siemens Aktiengesellschaft Method for a language modeling and device supporting the same
CN114357976A (en) * 2022-01-12 2022-04-15 合肥工业大学 Multi-round dialog generation method and system based on information enhancement
CN114385802A (en) * 2022-01-10 2022-04-22 重庆邮电大学 Common-emotion conversation generation method integrating theme prediction and emotion inference
CN114595700A (en) * 2021-12-20 2022-06-07 昆明理工大学 Zero-pronoun and chapter information fused Hanyue neural machine translation method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods
US20210286810A1 (en) * 2020-03-10 2021-09-16 Korea Advanced Institute Of Science And Technology Method And Apparatus For Generating Context Category Dataset
WO2022028689A1 (en) * 2020-08-05 2022-02-10 Siemens Aktiengesellschaft Method for a language modeling and device supporting the same
CN112541340A (en) * 2020-12-18 2021-03-23 昆明理工大学 Weak supervision involved microblog evaluation object identification method based on variation double-theme representation
CN112287698A (en) * 2020-12-25 2021-01-29 北京百度网讯科技有限公司 Chapter translation method and device, electronic equipment and storage medium
CN112948558A (en) * 2021-03-10 2021-06-11 中国人民解放军国防科技大学 Method and device for generating context-enhanced problems facing open domain dialog system
CN114595700A (en) * 2021-12-20 2022-06-07 昆明理工大学 Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN114385802A (en) * 2022-01-10 2022-04-22 重庆邮电大学 Common-emotion conversation generation method integrating theme prediction and emotion inference
CN114357976A (en) * 2022-01-12 2022-04-15 合肥工业大学 Multi-round dialog generation method and system based on information enhancement

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KEHAI CHEN等: "Neural Machine Translation With Sentence-Level Topic Context", 《 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 》, vol. 27, no. 12, 23 August 2019 (2019-08-23), pages 1970, XP011744799, DOI: 10.1109/TASLP.2019.2937190 *
袁佳彬: "文档语义表示方法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 April 2021 (2021-04-15), pages 138 - 913 *
陈玺文: "融合篇章信息的篇章级神经机器翻译方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 10 May 2024 (2024-05-10), pages 1 - 73 *
陈玺文等: "融合主题信息的篇章级神经机器翻译", 《云南大学学报(自然科学版)》, vol. 45, no. 6, 11 April 2023 (2023-04-11), pages 1197 - 1207 *

Also Published As

Publication number Publication date
CN115048946B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
CN111651589B (en) Two-stage text abstract generation method for long document
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN106484682A (en) Based on the machine translation method of statistics, device and electronic equipment
CN105068997B (en) The construction method and device of parallel corpora
WO2007041117A1 (en) Weighted linear model
CN113743133B (en) Chinese cross-language abstracting method integrating word granularity probability mapping information
CN111274827B (en) Suffix translation method based on multi-target learning of word bag
CN110427619B (en) Chinese text automatic proofreading method based on multi-channel fusion and reordering
CN110717345A (en) Translation realignment recurrent neural network cross-language machine translation method
CN113468895A (en) Non-autoregressive neural machine translation method based on decoder input enhancement
CN116663578A (en) Neural machine translation method based on strategy gradient method improvement
Liu Neural question generation based on Seq2Seq
CN110502759B (en) Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary
CN112380882B (en) Mongolian Chinese neural machine translation method with error correction function
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN116956946B (en) Machine translation text fine granularity error type identification and positioning method
CN113033153A (en) Neural machine translation model fusing key information based on Transformer model
CN111274826B (en) Semantic information fusion-based low-frequency word translation method
CN113627172A (en) Entity identification method and system based on multi-granularity feature fusion and uncertain denoising
CN115017924B (en) Construction of neural machine translation model for cross-language translation and translation method thereof
CN115048946B (en) Chapter-level neural machine translation method integrating theme information
CN115659172A (en) Generation type text summarization method based on key information mask and copy
CN114648024A (en) Chinese cross-language abstract generation method based on multi-type word information guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant