CN113887249B

CN113887249B - Mongolian neural machine translation method based on dependency syntax information and transducer model

Info

Publication number: CN113887249B
Application number: CN202111113538.XA
Authority: CN
Inventors: 仁庆道尔吉; 程坤; 庞蕊; 刘馨远; 麻泽蕊; 尹玉娟; 吉亚图; 苏依拉
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2024-07-12
Anticipated expiration: 2041-09-23
Also published as: CN113887249A

Abstract

The invention discloses a Mongolian neural machine translation method based on dependency syntax information and a transducer model, which comprises the following steps: extracting Chinese dependency syntax information according to the Mongolian parallel corpus, and converting the dependency syntax information into an adjacency matrix; one path is additionally added to the output end of the transducer model, one path of the transducer model is used for predicting Chinese target sentences, and the other path of the transducer model learns Chinese dependency syntax knowledge; the method comprises the steps of adding a module for learning Mongolian syntax information on an encoder side explicitly, reinforcing the learning of Mongolian syntax by utilizing Chinese syntax knowledge, and implicitly learning the syntax information of a source language and a target language by an original transducer model by default.

Description

Mongolian neural machine translation method based on dependency syntax information and transducer model

Technical Field

The invention belongs to the technical field of artificial intelligence and natural language processing, relates to end-to-end translation of Mongolian to Chinese, and in particular relates to a Mongolian neural machine translation method based on dependency syntax information and a transducer model.

Background

Different nations have different cultures, beliefs and customs, the enhancement of the communication among the nations is an important way for promoting the friendly way among the nations, mongolian siblings are members of the large families of the China, mongolian translation can effectively promote the communication between Mongolian cultures and Chinese cultures, the culture propagation of the technology and the like can be truly carried out only by carrying out the communication on the basis of no language obstruction, the excellent culture, the art, the knowledge, the custom, the beliefs and the like of the Mongolian can be propagated through Mongolian translation, more people can know the culture of the Mongolian, mongolian translation effectively promotes the culture communication between the Mongolian and the Chinese nations, and plays a promoting role in the culture development of the Mongolian and the comprehensive development of the Chinese nations.

The Mongolian machine translation has less results and weaker leading edge research than the machine translation of other languages, and research methods of the Mongolian machine translation are developed along with the innovation of world machine translation technology, and generally, the Mongolian machine translation is gradually changed from rules to statistics and from statistics to neural machine translation, in the process, researchers can fuse two or more methods to improve the translation effect of the Mongolian machine translation, at present, the machine translation is mainly implemented by the neural machine translation technology, and the application of the neural machine translation technology to the Mongolian translation still has a lot of improvements.

The Mongolian has a plurality of words with the same semantic meaning but different word shapes, which can lead the dictionary used in the neural machine translation not to contain enough Mongolian words, and can lead the translation model not to learn good word vectors, thereby influencing the performance of the neural machine translation.

Disclosure of Invention

(One) solving the technical problems

In order to overcome the defects of the prior art, the invention aims to provide a Mongolian neural machine translation method based on dependency syntax information and a transform model, wherein a Mongolian syntax learning module is explicitly added in an encoder of the transform, an adjacency matrix of a Chinese dependency syntax tree is added behind a decoder to guide the model to learn Mongolian syntax, and after training is finished, mongolian syntax information can be restored through the Mongolian learning module.

(II) technical scheme

A Mongolian machine translation method based on dependency syntax information and a transducer model includes the steps of firstly, carrying out dependency syntax analysis on Chinese language materials in a Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a dependency syntax tree, then converting the dependency syntax tree into an adjacent matrix, enabling the matrix to be symmetrical along a main diagonal line, namely a Chinese dependency syntax adjacent matrix, secondly, improving the transducer model, firstly, adding a unidirectional LSTM cyclic unit after the last decoder of the transducer model outputs, subtracting the output matrix of the unidirectional LSTM cyclic unit from the Chinese dependency syntax adjacent matrix according to elements to obtain a loss value L2, and adding the loss value L2 and a cross entropy loss value L1 originally output by the transducer model to be the total loss value of the model, secondly, improving an encoder of the transducer model, setting a matrix for learning the ancient syntax in the encoder, training the model by utilizing the training set, fine tuning the model by utilizing a verification set, and evaluating the model by utilizing the test set.

Preferably, the flow of extracting the Chinese dependency syntax adjacency matrix is as follows:

step 1: dividing the Mongolian parallel corpus data set into a training set, a verification set and a test set;

step 2: performing dependency syntax analysis on the Chinese corpus in the Mongolian parallel corpus training set by utilizing Stanford CoreNLP to obtain a Chinese dependency syntax tree;

Step 3: and (3) converting the Chinese dependency syntax tree obtained in the step (1) into an adjacency matrix for storage, and symmetrically arranging the adjacency matrix along a main diagonal line, namely the Chinese dependency syntax adjacency matrix.

Preferably, the process of improving the transducer model, adding one output and adding a matrix for learning Mongolian grammar on the encoder is as follows:

step 1: modifying the transducer model into double outputs, and adding a unidirectional LSTM circulation unit after the last decoder;

Step 2: the output matrix of the unidirectional LSTM circulation unit and the Chinese dependency syntax adjacent matrix are subtracted according to elements to obtain a loss value L2, and the loss value L2 and a cross entropy loss value L1 originally output by the transducer model are added to form a total loss value of the model;

step 3: a matrix for learning the mongolian grammar is set in the encoder of the transducer model.

Preferably, in the step 1, the LSTM cycle unit: the inputs of LSTM are the outputs of the decoder and the output matrix has the dimensions of [ batch_size×seq_len×seq_len ].

Preferably, in the step 2, the sum of the loss value L2 and the other path loss is obtained by subtracting the output matrix of the unidirectional LSTM cyclic unit and the corresponding chinese-dependent syntax adjacency matrix by elements, and is used as the total loss of the improved transducer model.

Preferably, the encoder in the step 3 adds a matrix for learning Mongolian grammar, and the transform encoder uses a self-attention mechanism, and the self-attention formula is:

To enhance the learning of the source grammar information by the model, the original formula is changed into:

The M matrix is used for learning Mongolian grammar, and after being guided by Chinese dependency syntax information, the Mongolian syntax information matrix can form different weights among Mongolian word segmentation, so that the connection among the word segmentation can be enhanced.

(III) beneficial effects

The invention provides a Mongolian nerve machine translation method based on dependency syntax information and a transducer model, which has the following effects: the invention explicitly utilizes the syntax information of the target language end to guide the model to learn the syntax information of the source language, so that translation interaction between Mongolian and Chinese is more convenient, chinese groups communicate among Mongolian groups more conveniently, friendly relation establishment between the two groups is promoted, the relation is further greatly, communication is more convenient, and the invention brings help to people.

Drawings

FIG. 1 is a flow chart of a translation method system of the present invention;

fig. 2 is a system diagram of a converter modified to dual outputs.

Detailed Description

FIG. 1 shows an embodiment of a Mongolian machine translation method based on dependency syntax information and a transform model, firstly, performing dependency syntax analysis on a training set divided by Mongolian parallel corpus by Stanford CoreNLP to obtain a dependency syntax tree, converting the dependency syntax tree into an adjacent matrix for storage, symmetrically forming the obtained matrix along a main diagonal line to obtain a Chinese dependency syntax adjacent matrix, secondly, adding a Mongolian grammar learning matrix into an encoder by the modified transform model, adding one LSTM circulation unit after the last decoder, subtracting the output matrix of the LSTM and the corresponding Chinese dependency syntax adjacent matrix as a loss value of the path, and taking the sum of the two paths of loss values as a total loss value of the model.

Specific:

1. the process of constructing the Chinese dependency adjacency matrix:

Stanford CoreNLP when performing dependency syntactic analysis, firstly cutting words from sentences, then analyzing the relation between words, wherein each relation is a triplet, the first element of the triplet is the relation between two words, the second element is the number of the words, the second element is taken as a father node, the three elements are child nodes, and the three elements are expressed by an example:

The sentence is: i want to eat the fried shredded potatoes.

Stanford CoreNLP is the following words: 'I'm ',' want ',' eat ',' fry ',' shredded potato ','. ' the obtained dependency syntax analysis result is: [ ('ROOT', 0, 2), ('nsubj', 2, 1), ('ccomp', 2, 3), ('conj', 3, 4), ('dobj', 4, 5), ('punct', 2, 6) ].

Converting it into a adjacency matrix for storage, the stored rules are as follows: firstly, constructing a square matrix, wherein the length and width of the square matrix are the number of cut words, the other relations form coordinates by using a second three elements except a ROOT relation, a marker bit 1 is arranged at the corresponding coordinate position of the square matrix, the unlabeled position is complemented with 0, after traversing the complete part relation, the square matrix is symmetrical along the main diagonal of the square matrix, and the Chinese dependency syntax adjacent matrix of the sentence is as follows:

2. FIG. 2 is a block diagram of a transform modified to dual output, the process of modifying the transform model, adding one output and adding a matrix for learning Mongolian grammar to the encoder is:

In step 1, LSTM loop unit: the inputs of LSTM are the outputs of the decoder and the output matrix has the dimensions of [ batch_size×seq_len×seq_len ].

In step 2, the sum of the loss value L2 and the other path loss is obtained by subtracting the output matrix of the unidirectional LSTM cyclic unit and the corresponding Chinese dependency syntax adjacent matrix according to elements, and is used as the total loss of the improved transducer model.

The encoder in step 3 adds a matrix for learning Mongolian grammar, and a self-attention mechanism is used in the transducer encoder, and the self-attention formula is as follows:

The M matrix is used to enhance learning of mongolian grammar.

Claims

1. A Mongolian neural machine translation method based on dependency syntax information and a transducer model is characterized in that: firstly, performing dependency syntax analysis on Chinese corpus in Mongolian parallel corpus by using Stanford CoreNLP, and storing the obtained Chinese dependency syntax tree by using an adjacent matrix; secondly, the transducer model is modified: firstly, converting a transducer model from single output to double output, wherein one path is used for predicting Chinese target sentences, the other path is used for learning Chinese syntax knowledge, and the other path is used for adding a matrix for learning Mongolian grammar on an encoder of the transducer model; then training the model, combining Chinese dependency syntax knowledge, and using the Chinese dependency syntax knowledge to enable the model to learn Mongolian grammar, so as to improve the quality of Mongolian machine translation, improve a transducer model, add one path of output and add a matrix for learning Mongolian grammar on an encoder, wherein the process is as follows: step 1: modifying the transducer model into double outputs, and adding a unidirectional LSTM circulation unit after the last decoder; step 2: the output matrix of the unidirectional LSTM circulation unit and the Chinese dependency syntax adjacent matrix are subtracted according to elements to obtain a loss value L2, and the loss value L2 and a cross entropy loss value L1 originally output by the transducer model are added to form a total loss value of the model; step 3: setting a matrix for learning Mongolian grammar in an encoder of a transducer model;

In the step 1, the LSTM loop unit: newly added cyclic layer: the output matrix size of the transform model decoder is [ batch_size×seq_len×d_model ], where batch_size is the number of training samples per batch, seq_len is the training sentence length, d_model is the word vector embedding dimension, a unidirectional LSTM cyclic unit is added after the output of the decoder, and the super-parameters of the unidirectional LSTM unit are set as follows: input_size is the dimension of the input feature, i.e. the dimension value of the word vector is equal to d_model, hidden_size is the number of hidden layer neurons, the value is equal to seq_len, num_layers define the number of layers of the network, nonlinearity define the activation function, bias defines whether bias is used, batch_first defines whether the batch_first parameter is the first dimension of the output matrix, dropout defines the probability of randomly failing certain neurons, birdirectional defines whether bi-directional LSTM is used, the input of LSTM is the output of the decoder, and the size of the output matrix is [ batch_size×seq_len×seq_len ];

in the step 2, the sum of a loss value L2 and another path of loss is obtained by subtracting the output matrix of the unidirectional LSTM circulation unit and the corresponding Chinese dependency syntax adjacent matrix according to elements and is used as the total loss of the improved transducer model;

the encoder in step3 adds a matrix for learning Mongolian grammar: the transducer encoder uses a self-attention mechanism, the self-attention formula is:

Wherein Q, K, V are matrices and have the same dimensions, in translation, analogy is that a current target language Q distribution to be expressed is obtained, K represents a source language syntax structure, V is a source language distribution, correlation is calculated by using the dot product of Q and K, and then the calculated correlation is multiplied by V to obtain a corresponding relation between the source language and the target language, and in order to strengthen the learning of a model on source grammar information, the original formula is changed into:

2. The method for Mongolian machine translation based on dependency syntax information and a transducer model according to claim 1, wherein the flow of extracting the Chinese dependency syntax adjacency matrix is as follows: