CN114595700A - Zero-pronoun and chapter information fused Hanyue neural machine translation method - Google Patents

Zero-pronoun and chapter information fused Hanyue neural machine translation method Download PDF

Info

Publication number
CN114595700A
CN114595700A CN202111557675.2A CN202111557675A CN114595700A CN 114595700 A CN114595700 A CN 114595700A CN 202111557675 A CN202111557675 A CN 202111557675A CN 114595700 A CN114595700 A CN 114595700A
Authority
CN
China
Prior art keywords
pronouns
sentence
context
zero
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111557675.2A
Other languages
Chinese (zh)
Inventor
余正涛
王麒鼎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111557675.2A priority Critical patent/CN114595700A/en
Publication of CN114595700A publication Critical patent/CN114595700A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing. The invention comprises the following steps: constructing a middle-English three-language alignment chapter data set, and carrying out zero-pronoun classification marking on middle-English data; using a self-attention mechanism to respectively obtain bilingual characteristics of a source sentence and a context; performing pooling and connection by using source sentences and context characteristics and performing syntactic component classification of zero pronouns; and predicting the target sentence through two attention sublayers by utilizing the source sentence and the contextual characteristics. The invention adopts a joint learning mode to simultaneously learn and update the parameters of the main task model and the auxiliary model. The classification task and the translation task are combined. And the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can effectively improve the translation task performance. The invention effectively improves the translation performance of the Hanyue neural machine by fusing zero pronouns and chapter information.

Description

Zero-pronoun and chapter information fused Hanyue neural machine translation method
Technical Field
The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing.
Background
China and Vietnam are more and more closely communicated, and the requirement of Chinese-Vietnam translation technology is continuously increased. The translation technology under the low-resource scene of Chinese-Vietnamese is better and better researched. However, the translation technology studied at present aims at the use scenes of formal personalities, such as the translation of news texts, official documents and the like. For informal characters, such as network comments, spoken language daily conversation and other use scenes, under the same translation model, the translation performance is obviously insufficient. The main reason for poor translation performance is that pronouns such as subjects and objects are often omitted in spoken language and daily dialogue scenes. This phenomenon is called the zero pronoun phenomenon, and these missing pronouns are called the zero pronouns. These omissions may not be a problem for humans, and pronoun information may be inferred from speaker mood, speech context, context information, and the like. However, translating such statements presents various difficulties to machine translation in terms of completeness and correctness. Cheng et al first attempted to recover omitted pronouns using rules. Tan et al use a special labeling method to integrate the translation of the labeled omitted pronouns as external knowledge into a neural machine translation model. Although the Transformer can capture more semantic information by using a multi-head attention mechanism, then for the omitted pronouns, only simple partial contents can be translated. Translation of omitted pronouns in complex sentences is often undesirable.
Disclosure of Invention
The invention provides a Hanyue neural machine translation method fusing zero pronouns and chapter information, which is used for implicitly adding effective zero pronouns information into a translation model, relieving the problem of omitting pronouns translation errors, combining a classification task and a translation task by a combined learning framework, and enabling the classification task and the translation task to interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.
The technical scheme of the invention is as follows: the Hanyue neural machine translation method fusing zero pronouns and chapter information comprises the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.
Step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronoun subject is labeled S, the missing pronoun is labeled object O, the missing pronoun is labeled idiom a, and the non-missing pronoun is labeled N.
As a further aspect of the present invention, Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,
Figure BDA0003419590290000021
indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,
Figure BDA0003419590290000022
the k-th target sentence is shown to contain J words;
Figure BDA0003419590290000023
represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; encoding the characteristic embedding of the source sentences by using a transformer encoding module; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
Figure BDA0003419590290000031
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,
Figure BDA0003419590290000032
is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
wherein Hsrc∈RI×dOutputting the source sentences after being coded;
Figure BDA0003419590290000033
WO∈Rd×das a parameter of training, dk∈d/h。
As a further aspect of the present invention, Step2, the feature encoding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence characteristics and context characteristics to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
Figure BDA0003419590290000041
Figure BDA0003419590290000042
wherein U, V ∈ R2d,fPFor the pooling function, here U, V are spliced together to form the input to the classifier, i.e.:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is the sigmoid function for the model parameters.
As a further aspect of the present invention, in Step2, extracting target sentence features by using a transform decoder, and performing attention calculation with source sentence features and context features respectively includes the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of an encoding end, the input of a decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a target end word list space, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of a prediction result and a real result is calculated;
as a further scheme of the present invention, in Step2, the loss function of the joint learning is divided into two parts, namely, the translation loss of the neural machine translation model and the classification loss of the zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
Figure BDA0003419590290000043
Figure BDA0003419590290000044
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
Figure BDA0003419590290000045
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,
Figure BDA0003419590290000051
representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
Figure BDA0003419590290000052
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
The invention has the beneficial effects that:
1. aiming at the Chinese-transcendental neural machine translation of informal literary style, the machine translation performance is not satisfactory in the scenes of processing network vocabularies, daily conversations, spoken language and the like. The reason is that under the scenes, expression modes which do not conform to normal grammar often occur, the most typical expression mode is pronoun omission phenomenon (zero pronoun phenomenon), and the method improves the effect of machine translation by filling up missing grammar information.
2. The invention simplifies the task of filling pronouns and re-translating into an end-to-end task. The influence of errors of the pronoun prediction task on the translation task is avoided. By adopting a joint learning method, the classification task and the translation task interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.
3. Context information is integrated into the classification task and the translation task, and the zero pronoun classification task and the Hanyue machine translation task are obviously improved; and the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can also effectively improve the translation task performance;
4. the invention can capture richer semantic features by using the Multi-Head attention (Multi-Head attention) of the Transformer structure, and has good parallelism.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
Example 1: as shown in fig. 1, a hanyue neural machine translation method fusing zero pronouns and chapter information includes the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns missing in Chinese-English by using a matrix alignment method, analyzing syntactic components of the omitted pronouns by using a dependency syntactic analysis library DDParser, and using the syntactic components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set marked by zero pronoun information; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.
Step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
table 1 is an example of a tag for a classification task
Figure BDA0003419590290000061
Step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N. The corpus scale of the experiment is shown in table 2:
TABLE 2 corpus Scale statistics for experiment
Data of Sentence Discourse and chapters
Training set 294K 58.8K
Verification set 3.21K 0.64K
Test set 3.15K 0.63K
Step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
As a further aspect of the present invention, Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,
Figure BDA0003419590290000062
indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,
Figure BDA0003419590290000063
the k-th target sentence is shown to contain J words;
Figure BDA0003419590290000071
represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; using a transformer coding module to code the characteristic embedding of the source sentence; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
Figure BDA0003419590290000072
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,
Figure BDA0003419590290000073
is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
wherein Hsrc∈RI×dOutputting the source sentence after being coded;
Figure BDA0003419590290000074
WO∈Rd×das a parameter of training, dk∈d/h。
As a further aspect of the present invention, Step2, the feature encoding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence features and context features to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
Figure BDA0003419590290000081
Figure BDA0003419590290000082
wherein U, V ∈ R2d,fPFor the pooling function, here U, V are spliced together to form the input to the classifier, i.e.:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is a sigmoid function as a model parameter.
As a further aspect of the present invention, in Step2, extracting target sentence features by using a transform decoder, and performing attention calculation with source sentence features and context features respectively includes the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of an encoding end, the input of a decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a target end word list space, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of a prediction result and a real result is calculated;
as a further scheme of the present invention, in Step2, the loss function of the joint learning is divided into two parts, namely, the translation loss of the neural machine translation model and the classification loss of the zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
Figure BDA0003419590290000091
Figure BDA0003419590290000092
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
Figure BDA0003419590290000093
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,
Figure BDA0003419590290000094
representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
Figure BDA0003419590290000095
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
And finally, selecting an Adam optimizer which has high convergence speed and stable convergence process and can iteratively update the weight of the neural network based on the training data. The learning rate (step size) is set to 5e-5, which determines the length of each step in the negative direction of the gradient during the gradient descent iteration. The step length is too small, the convergence is slow, and the step length is too large, so that the optimal solution can be far away. Therefore, from small to large, an optimal solution 5e-5 is selected by testing respectively.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the effectiveness of the model, and the second group of experiments verify whether the omitted word has the influence on the translation of the Hanyue neural machine.
(1) Model validation
The invention tests the model CMT-G & A and the simplified model performance according to the training data and the test data in the table 1, and the test results are shown in the table 3:
TABLE 3 comparison of the results of the Hanyue and Yuehan translation experiments
Figure BDA0003419590290000096
As can be seen from the analysis, Table 3 shows the performance results of the model proposed by the present invention on the middle-to-middle translation task. Compared with a reference model, BLEU value promotion of 0.64 and BLEU value promotion of 0.44 are respectively achieved on the middle-crossing translation task and the middle-crossing translation task after the zero-pronoun classification task is combined, the effectiveness of the joint learning method is proved, and the BLEU value can be effectively promoted by the combination classification task by integrating the zero-pronoun information into machine translation. The BLEU value boost of 0.71 and 0.75 was achieved after only the context information was added, respectively. And the final model is integrated with context information on the basis of the joint task, so that the BLEU value is improved by 1.42 and 1.31. And the accuracy of the classification task is improved by about 3%. The context information can not only improve the translation effect, but also has positive effect on the classification task of the zero pronouns.
(2) Verification of influence of non-omitted substitute on Hanyue neural machine translation
In the movie or television play subtitle data set used by the invention, many sentences have the phenomenon of omitting pronouns, but more sentences in the data set are complete sentences without the pronouns. To explore the difference between translation performance of sentences with and without omitted pronouns, we performed the following experiments. We divide the original test set into an elision test set (DP) and an elision-free test set (NP) according to whether an elision pronoun exists, and then test on the reference model and the model proposed by the present invention, respectively. The results are shown in Table 4:
TABLE 4 comparison of CMT-G & A with simplified model
Figure BDA0003419590290000101
As can be seen from the analysis of table 4, the results obtained by the reference model and the model proposed by the present invention are higher than those obtained by the entire test set. This shows that the pronoun omission phenomenon has an influence on machine translation, and better effect can be obtained without translation of the omitted pronoun. On the basis of the non-omission test set, compared with a reference model, the model provided by the invention respectively improves 1.38 BLEU values and 1.31 BLEU values on the middle-to-middle translation and the middle-to-middle translation, which shows that the chapter information fused by the model provided by the invention can help to improve the translation performance by other semantic information besides the zero pronoun information. Compared with the whole test set, the translation effect of the reference model and the model provided by the invention is obviously reduced on the condition of omitting the test set. However, compared with the reference model, the model provided by the invention improves 1.43 BLEU values and 1.38 BLEU values in the middle translation and the middle translation respectively. The fact shows that the phenomenon of pronoun omission in a translation task is difficult to be well processed by the standard transformer model, and the model provided by the invention can effectively relieve translation errors caused by the pronoun omission.
The experimental data prove that the missing pronouns are merged into the machine translation model by using the grammar information of the missing pronouns and the context, the omitted pronouns are classified by using the source-end sentences and the context sentences by using the model structure of the joint learning, and the context information is merged, so that the machine translation performance is effectively improved. Meanwhile, a transform coding module is applied to better capture remote dependency relationships and improve the parallelism of the model. Experiments show that the method of the invention achieves the optimal effect compared with a plurality of baseline models. Aiming at the Hanyue neural machine translation task, the Hanyue neural machine translation method fusing zero pronouns and chapter information is effective in improving the Hanyue neural machine translation performance.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (7)

1. The Hanyue neural machine translation method fusing zero pronouns and chapter information is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling;
step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
2. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N.
3. The method for machine translation of hanyue nerves fusing zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,
Figure FDA0003419590280000011
indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,
Figure FDA0003419590280000012
the k-th target sentence is shown to contain J words;
Figure FDA0003419590280000013
represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; using a transformer coding module to code the characteristic embedding of the source sentence; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
Figure FDA0003419590280000021
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,
Figure FDA0003419590280000022
is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
wherein Hsrc∈RI×dOutputting the source sentence after being coded;
Figure FDA0003419590280000023
WO∈Rd×das a parameter of training, dk∈d/h。
4. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature coding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
5. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, performing pooling and splicing on the obtained source sentence characteristics and context characteristics to obtain new characteristics, and inputting the characteristics into a classifier for classification, wherein the classifying comprises the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
Figure FDA0003419590280000031
Figure FDA0003419590280000032
wherein U, V ∈ R2d,fPFor pooling functions, where U, V are spliced together to form classifiersInputting, namely:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is a sigmoid function as a model parameter.
6. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, extracting target sentence features by using a Transformer decoder, and performing attention calculation with source sentence features and context features respectively comprises the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of the encoding end, the input of the decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a word list space of the target end, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of the prediction result and the real result is calculated.
7. The method for machine translation of hanyue with fused zero-pronouns and discourse information according to claim 3, wherein: in Step2, a loss function of joint learning is divided into two parts, namely translation loss of a neural machine translation model and classification loss of zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
Figure FDA0003419590280000041
Figure FDA0003419590280000042
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
Figure FDA0003419590280000043
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,
Figure FDA0003419590280000045
representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
Figure FDA0003419590280000044
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
CN202111557675.2A 2021-12-20 2021-12-20 Zero-pronoun and chapter information fused Hanyue neural machine translation method Pending CN114595700A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111557675.2A CN114595700A (en) 2021-12-20 2021-12-20 Zero-pronoun and chapter information fused Hanyue neural machine translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111557675.2A CN114595700A (en) 2021-12-20 2021-12-20 Zero-pronoun and chapter information fused Hanyue neural machine translation method

Publications (1)

Publication Number Publication Date
CN114595700A true CN114595700A (en) 2022-06-07

Family

ID=81803154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111557675.2A Pending CN114595700A (en) 2021-12-20 2021-12-20 Zero-pronoun and chapter information fused Hanyue neural machine translation method

Country Status (1)

Country Link
CN (1) CN114595700A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108862A (en) * 2023-04-07 2023-05-12 北京澜舟科技有限公司 Chapter-level machine translation model construction method, chapter-level machine translation model construction system and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549644A (en) * 2018-04-12 2018-09-18 苏州大学 Omission pronominal translation method towards neural machine translation
CN109492223A (en) * 2018-11-06 2019-03-19 北京邮电大学 A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN111488733A (en) * 2020-04-07 2020-08-04 苏州大学 Chinese zero-index resolution method and system based on Mask mechanism and twin network
CN111666774A (en) * 2020-04-24 2020-09-15 北京大学 Machine translation method and device based on document context
CN112256868A (en) * 2020-09-30 2021-01-22 华为技术有限公司 Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
CN112507733A (en) * 2020-11-06 2021-03-16 昆明理工大学 Dependency graph network-based Hanyue neural machine translation method
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN113095091A (en) * 2021-04-09 2021-07-09 天津大学 Chapter machine translation system and method capable of selecting context information
CN113743133A (en) * 2021-08-20 2021-12-03 昆明理工大学 Chinese cross-language abstract method fusing word granularity probability mapping information

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549644A (en) * 2018-04-12 2018-09-18 苏州大学 Omission pronominal translation method towards neural machine translation
CN109492223A (en) * 2018-11-06 2019-03-19 北京邮电大学 A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN111488733A (en) * 2020-04-07 2020-08-04 苏州大学 Chinese zero-index resolution method and system based on Mask mechanism and twin network
CN111666774A (en) * 2020-04-24 2020-09-15 北京大学 Machine translation method and device based on document context
CN112256868A (en) * 2020-09-30 2021-01-22 华为技术有限公司 Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
CN112507733A (en) * 2020-11-06 2021-03-16 昆明理工大学 Dependency graph network-based Hanyue neural machine translation method
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN113095091A (en) * 2021-04-09 2021-07-09 天津大学 Chapter machine translation system and method capable of selecting context information
CN113743133A (en) * 2021-08-20 2021-12-03 昆明理工大学 Chinese cross-language abstract method fusing word granularity probability mapping information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LONGYUE WANG ET AL: ""DROPPED PRONOUN GENERATION FOR DIALOGUE MACHINE TRANSLATION"", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
汪浩 等: ""融合零指代识别的篇章级机器翻译"", 《HTTPS://SCHOLAR.GOOGLE.COM/CITATIONS?VIEW_OP=VIEW_CITATION&HL=EN&USER=KY7Q3NEAAAAJ&CITATION_FOR_VIEW=KY7Q3NEAAAAJ:MVMSD5A6BFQC》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108862A (en) * 2023-04-07 2023-05-12 北京澜舟科技有限公司 Chapter-level machine translation model construction method, chapter-level machine translation model construction system and storage medium

Similar Documents

Publication Publication Date Title
CN110390103B (en) Automatic short text summarization method and system based on double encoders
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
Zhang et al. SG-Net: Syntax guided transformer for language representation
CN110516244B (en) Automatic sentence filling method based on BERT
CN113343683B (en) Chinese new word discovery method and device integrating self-encoder and countertraining
CN111401079A (en) Training method and device of neural network machine translation model and storage medium
CN112765345A (en) Text abstract automatic generation method and system fusing pre-training model
CN114998670B (en) Multi-mode information pre-training method and system
CN112613326B (en) Tibetan language neural machine translation method fusing syntactic structure
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN113569562B (en) Method and system for reducing cross-modal and cross-language barriers of end-to-end voice translation
Zhang et al. Future-aware knowledge distillation for neural machine translation
Yun et al. Improving context-aware neural machine translation using self-attentive sentence embedding
CN114969304A (en) Case public opinion multi-document generation type abstract method based on element graph attention
González-Gallardo et al. Sentence boundary detection for French with subword-level information vectors and convolutional neural networks
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115510863A (en) Question matching task oriented data enhancement method
Parvin et al. Transformer-based local-global guidance for image captioning
Wu et al. Joint intent detection model for task-oriented human-computer dialogue system using asynchronous training
CN113033153A (en) Neural machine translation model fusing key information based on Transformer model
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN116720531A (en) Mongolian neural machine translation method based on source language syntax dependency and quantization matrix
Solomon et al. Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks
CN115659172A (en) Generation type text summarization method based on key information mask and copy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220607

RJ01 Rejection of invention patent application after publication