CN114595700A

CN114595700A - Zero-pronoun and chapter information fused Hanyue neural machine translation method

Info

Publication number: CN114595700A
Application number: CN202111557675.2A
Authority: CN
Inventors: 余正涛; 王麒鼎
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-06-07

Abstract

The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing. The invention comprises the following steps: constructing a middle-English three-language alignment chapter data set, and carrying out zero-pronoun classification marking on middle-English data; using a self-attention mechanism to respectively obtain bilingual characteristics of a source sentence and a context; performing pooling and connection by using source sentences and context characteristics and performing syntactic component classification of zero pronouns; and predicting the target sentence through two attention sublayers by utilizing the source sentence and the contextual characteristics. The invention adopts a joint learning mode to simultaneously learn and update the parameters of the main task model and the auxiliary model. The classification task and the translation task are combined. And the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can effectively improve the translation task performance. The invention effectively improves the translation performance of the Hanyue neural machine by fusing zero pronouns and chapter information.

Description

Zero-pronoun and chapter information fused Hanyue neural machine translation method

Technical Field

The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing.

Background

China and Vietnam are more and more closely communicated, and the requirement of Chinese-Vietnam translation technology is continuously increased. The translation technology under the low-resource scene of Chinese-Vietnamese is better and better researched. However, the translation technology studied at present aims at the use scenes of formal personalities, such as the translation of news texts, official documents and the like. For informal characters, such as network comments, spoken language daily conversation and other use scenes, under the same translation model, the translation performance is obviously insufficient. The main reason for poor translation performance is that pronouns such as subjects and objects are often omitted in spoken language and daily dialogue scenes. This phenomenon is called the zero pronoun phenomenon, and these missing pronouns are called the zero pronouns. These omissions may not be a problem for humans, and pronoun information may be inferred from speaker mood, speech context, context information, and the like. However, translating such statements presents various difficulties to machine translation in terms of completeness and correctness. Cheng et al first attempted to recover omitted pronouns using rules. Tan et al use a special labeling method to integrate the translation of the labeled omitted pronouns as external knowledge into a neural machine translation model. Although the Transformer can capture more semantic information by using a multi-head attention mechanism, then for the omitted pronouns, only simple partial contents can be translated. Translation of omitted pronouns in complex sentences is often undesirable.

Disclosure of Invention

The invention provides a Hanyue neural machine translation method fusing zero pronouns and chapter information, which is used for implicitly adding effective zero pronouns information into a translation model, relieving the problem of omitting pronouns translation errors, combining a classification task and a translation task by a combined learning framework, and enabling the classification task and the translation task to interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.

The technical scheme of the invention is as follows: the Hanyue neural machine translation method fusing zero pronouns and chapter information comprises the following specific steps:

step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.

Step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.

As a further scheme of the invention, the Step1 comprises the following specific steps:

step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;

step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;

step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronoun subject is labeled S, the missing pronoun is labeled object O, the missing pronoun is labeled idiom a, and the non-missing pronoun is labeled N.

As a further aspect of the present invention, Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:

extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X⁽¹⁾,..,x^(k),..,x^(K)A source language document representing the composition of K source sentences,

indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y⁽¹⁾,...,y^(k),...,y^(K)It is shown that,

the k-th target sentence is shown to contain J words;

represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x^(k),y^(k)>forming a parallel sentence; encoding the characteristic embedding of the source sentences by using a transformer encoding module; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:

E_src＝E(x₁,x₂,...,x_I) (1)

E_src＝Q＝K＝V (2)

wherein E is_srcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to R^I×dRespectively a query vector, a key vector, a value vector,

is a scaling factor;

in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (4)

H_src＝MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O (5)

wherein H_src∈R^I×dOutputting the source sentences after being coded;

W^O∈R^d×das a parameter of training, d_k∈d/h。

As a further aspect of the present invention, Step2, the feature encoding of the source sentence context by word embedding and position embedding includes the following steps:

extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;

E_con＝E(x₁,x₂,...,x_I) (6)

E_con＝Q＝K＝V (7)

H_con＝Transformer_encoder(Q,K,V) (8)

wherein E is_conEmbedding tokens for words of the context input text; q, K, V ∈ R^I×dRespectively as a query vector, a key vector, and a value vector; h_con∈R^I×dIs the output after the context coding module.

As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence characteristics and context characteristics to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:

the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; h_src∈R^I×dAnd H_con∈R^I×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state H_srcAnd H_conAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:

wherein U, V ∈ R^2d，f^PFor the pooling function, here U, V are spliced together to form the input to the classifier, i.e.:

o＝[U；V] (11)

wherein o ∈ R^4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:

z＝softmax(σ(oW₁+b₁)W₂+b₂) (12)

wherein W₁∈R^4d×d，W₂∈R^d×2，b₁∈R^1×d，b₂∈R²σ is the sigmoid function for the model parameters.

As a further aspect of the present invention, in Step2, extracting target sentence features by using a transform decoder, and performing attention calculation with source sentence features and context features respectively includes the following steps:

the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of an encoding end, the input of a decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a target end word list space, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of a prediction result and a real result is calculated;

as a further scheme of the present invention, in Step2, the loss function of the joint learning is divided into two parts, namely, the translation loss of the neural machine translation model and the classification loss of the zero pronoun prediction;

the correlation loss function of the translation target end is as follows:

wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XⁿAnd YⁿRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, m_n,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;

the correlation loss function for the zero pronoun classification is:

wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,

representing the probability of the model predicting class c;

finally, the training goals for joint learning are as follows:

wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.

The invention has the beneficial effects that:

1. aiming at the Chinese-transcendental neural machine translation of informal literary style, the machine translation performance is not satisfactory in the scenes of processing network vocabularies, daily conversations, spoken language and the like. The reason is that under the scenes, expression modes which do not conform to normal grammar often occur, the most typical expression mode is pronoun omission phenomenon (zero pronoun phenomenon), and the method improves the effect of machine translation by filling up missing grammar information.

2. The invention simplifies the task of filling pronouns and re-translating into an end-to-end task. The influence of errors of the pronoun prediction task on the translation task is avoided. By adopting a joint learning method, the classification task and the translation task interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.

3. Context information is integrated into the classification task and the translation task, and the zero pronoun classification task and the Hanyue machine translation task are obviously improved; and the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can also effectively improve the translation task performance;

4. the invention can capture richer semantic features by using the Multi-Head attention (Multi-Head attention) of the Transformer structure, and has good parallelism.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Detailed Description

Example 1: as shown in fig. 1, a hanyue neural machine translation method fusing zero pronouns and chapter information includes the following specific steps:

step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns missing in Chinese-English by using a matrix alignment method, analyzing syntactic components of the omitted pronouns by using a dependency syntactic analysis library DDParser, and using the syntactic components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set marked by zero pronoun information; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.

table 1 is an example of a tag for a classification task

Step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N. The corpus scale of the experiment is shown in table 2:

TABLE 2 corpus Scale statistics for experiment

Data of	Sentence	Discourse and chapters
			Training set	294K	58.8K
Verification set	3.21K	0.64K
			Test set	3.15K	0.63K

the k-th target sentence is shown to contain J words;

represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x^(k),y^(k)>forming a parallel sentence; using a transformer coding module to code the characteristic embedding of the source sentence; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:

E_src＝E(x₁,x₂,...,x_I) (1)

E_src＝Q＝K＝V (2)

is a scaling factor;

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (4)

H_src＝MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O (5)

wherein H_src∈R^I×dOutputting the source sentence after being coded;

W^O∈R^d×das a parameter of training, d_k∈d/h。

E_con＝E(x₁,x₂,...,x_I) (6)

E_con＝Q＝K＝V (7)

H_con＝Transformer_encoder(Q,K,V) (8)

As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence features and context features to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:

o＝[U；V] (11)

z＝softmax(σ(oW₁+b₁)W₂+b₂) (12)

wherein W₁∈R^4d×d，W₂∈R^d×2，b₁∈R^1×d，b₂∈R²σ is a sigmoid function as a model parameter.

the correlation loss function of the translation target end is as follows:

the correlation loss function for the zero pronoun classification is:

representing the probability of the model predicting class c;

finally, the training goals for joint learning are as follows:

And finally, selecting an Adam optimizer which has high convergence speed and stable convergence process and can iteratively update the weight of the neural network based on the training data. The learning rate (step size) is set to 5e-5, which determines the length of each step in the negative direction of the gradient during the gradient descent iteration. The step length is too small, the convergence is slow, and the step length is too large, so that the optimal solution can be far away. Therefore, from small to large, an optimal solution 5e-5 is selected by testing respectively.

To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the effectiveness of the model, and the second group of experiments verify whether the omitted word has the influence on the translation of the Hanyue neural machine.

(1) Model validation

The invention tests the model CMT-G & A and the simplified model performance according to the training data and the test data in the table 1, and the test results are shown in the table 3:

TABLE 3 comparison of the results of the Hanyue and Yuehan translation experiments

As can be seen from the analysis, Table 3 shows the performance results of the model proposed by the present invention on the middle-to-middle translation task. Compared with a reference model, BLEU value promotion of 0.64 and BLEU value promotion of 0.44 are respectively achieved on the middle-crossing translation task and the middle-crossing translation task after the zero-pronoun classification task is combined, the effectiveness of the joint learning method is proved, and the BLEU value can be effectively promoted by the combination classification task by integrating the zero-pronoun information into machine translation. The BLEU value boost of 0.71 and 0.75 was achieved after only the context information was added, respectively. And the final model is integrated with context information on the basis of the joint task, so that the BLEU value is improved by 1.42 and 1.31. And the accuracy of the classification task is improved by about 3%. The context information can not only improve the translation effect, but also has positive effect on the classification task of the zero pronouns.

(2) Verification of influence of non-omitted substitute on Hanyue neural machine translation

In the movie or television play subtitle data set used by the invention, many sentences have the phenomenon of omitting pronouns, but more sentences in the data set are complete sentences without the pronouns. To explore the difference between translation performance of sentences with and without omitted pronouns, we performed the following experiments. We divide the original test set into an elision test set (DP) and an elision-free test set (NP) according to whether an elision pronoun exists, and then test on the reference model and the model proposed by the present invention, respectively. The results are shown in Table 4:

TABLE 4 comparison of CMT-G & A with simplified model

As can be seen from the analysis of table 4, the results obtained by the reference model and the model proposed by the present invention are higher than those obtained by the entire test set. This shows that the pronoun omission phenomenon has an influence on machine translation, and better effect can be obtained without translation of the omitted pronoun. On the basis of the non-omission test set, compared with a reference model, the model provided by the invention respectively improves 1.38 BLEU values and 1.31 BLEU values on the middle-to-middle translation and the middle-to-middle translation, which shows that the chapter information fused by the model provided by the invention can help to improve the translation performance by other semantic information besides the zero pronoun information. Compared with the whole test set, the translation effect of the reference model and the model provided by the invention is obviously reduced on the condition of omitting the test set. However, compared with the reference model, the model provided by the invention improves 1.43 BLEU values and 1.38 BLEU values in the middle translation and the middle translation respectively. The fact shows that the phenomenon of pronoun omission in a translation task is difficult to be well processed by the standard transformer model, and the model provided by the invention can effectively relieve translation errors caused by the pronoun omission.

The experimental data prove that the missing pronouns are merged into the machine translation model by using the grammar information of the missing pronouns and the context, the omitted pronouns are classified by using the source-end sentences and the context sentences by using the model structure of the joint learning, and the context information is merged, so that the machine translation performance is effectively improved. Meanwhile, a transform coding module is applied to better capture remote dependency relationships and improve the parallelism of the model. Experiments show that the method of the invention achieves the optimal effect compared with a plurality of baseline models. Aiming at the Hanyue neural machine translation task, the Hanyue neural machine translation method fusing zero pronouns and chapter information is effective in improving the Hanyue neural machine translation performance.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Hanyue neural machine translation method fusing zero pronouns and chapter information is characterized by comprising the following steps of: the method comprises the following specific steps:

step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling;

2. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N.

3. The method for machine translation of hanyue nerves fusing zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:

the k-th target sentence is shown to contain J words;

E_src＝E(x₁,x₂,...,x_I) (1)

E_src＝Q＝K＝V (2)

is a scaling factor;

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V) (4)

H_src＝MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_h)W^O (5)

wherein H_src∈R^I×dOutputting the source sentence after being coded;

W^O∈R^d×das a parameter of training, d_k∈d/h。

4. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature coding of the source sentence context by word embedding and position embedding includes the following steps:

E_con＝E(x₁,x₂,...,x_I) (6)

E_con＝Q＝K＝V (7)

H_con＝Transformer_encoder(Q,K,V) (8)

5. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, performing pooling and splicing on the obtained source sentence characteristics and context characteristics to obtain new characteristics, and inputting the characteristics into a classifier for classification, wherein the classifying comprises the following steps:

wherein U, V ∈ R^2d，f^PFor pooling functions, where U, V are spliced together to form classifiersInputting, namely:

o＝[U；V] (11)

z＝softmax(σ(oW₁+b₁)W₂+b₂) (12)

6. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, extracting target sentence features by using a Transformer decoder, and performing attention calculation with source sentence features and context features respectively comprises the following steps:

the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of the encoding end, the input of the decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a word list space of the target end, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of the prediction result and the real result is calculated.

7. The method for machine translation of hanyue with fused zero-pronouns and discourse information according to claim 3, wherein: in Step2, a loss function of joint learning is divided into two parts, namely translation loss of a neural machine translation model and classification loss of zero pronoun prediction;

the correlation loss function of the translation target end is as follows:

the correlation loss function for the zero pronoun classification is:

representing the probability of the model predicting class c;

finally, the training goals for joint learning are as follows: