CN114595700A - Zero-pronoun and chapter information fused Hanyue neural machine translation method - Google Patents
Zero-pronoun and chapter information fused Hanyue neural machine translation method Download PDFInfo
- Publication number
- CN114595700A CN114595700A CN202111557675.2A CN202111557675A CN114595700A CN 114595700 A CN114595700 A CN 114595700A CN 202111557675 A CN202111557675 A CN 202111557675A CN 114595700 A CN114595700 A CN 114595700A
- Authority
- CN
- China
- Prior art keywords
- pronouns
- sentence
- context
- zero
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013519 translation Methods 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000001537 neural effect Effects 0.000 title claims abstract description 19
- 238000011176 pooling Methods 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000005516 engineering process Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000004458 analytical method Methods 0.000 claims description 8
- ZAKOWWREFLAJOT-CEFNRUSXSA-N D-alpha-tocopherylacetate Chemical compound CC(=O)OC1=C(C)C(C)=C2O[C@@](CCC[C@H](C)CCC[C@H](C)CCCC(C)C)(C)CCC2=C1C ZAKOWWREFLAJOT-CEFNRUSXSA-N 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000009193 crawling Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 6
- 210000005036 nerve Anatomy 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 13
- 238000002474 experimental method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000012512 characterization method Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing. The invention comprises the following steps: constructing a middle-English three-language alignment chapter data set, and carrying out zero-pronoun classification marking on middle-English data; using a self-attention mechanism to respectively obtain bilingual characteristics of a source sentence and a context; performing pooling and connection by using source sentences and context characteristics and performing syntactic component classification of zero pronouns; and predicting the target sentence through two attention sublayers by utilizing the source sentence and the contextual characteristics. The invention adopts a joint learning mode to simultaneously learn and update the parameters of the main task model and the auxiliary model. The classification task and the translation task are combined. And the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can effectively improve the translation task performance. The invention effectively improves the translation performance of the Hanyue neural machine by fusing zero pronouns and chapter information.
Description
Technical Field
The invention relates to a Chinese-Yue neural machine translation method fusing zero pronouns and chapter information, and belongs to the technical field of natural language processing.
Background
China and Vietnam are more and more closely communicated, and the requirement of Chinese-Vietnam translation technology is continuously increased. The translation technology under the low-resource scene of Chinese-Vietnamese is better and better researched. However, the translation technology studied at present aims at the use scenes of formal personalities, such as the translation of news texts, official documents and the like. For informal characters, such as network comments, spoken language daily conversation and other use scenes, under the same translation model, the translation performance is obviously insufficient. The main reason for poor translation performance is that pronouns such as subjects and objects are often omitted in spoken language and daily dialogue scenes. This phenomenon is called the zero pronoun phenomenon, and these missing pronouns are called the zero pronouns. These omissions may not be a problem for humans, and pronoun information may be inferred from speaker mood, speech context, context information, and the like. However, translating such statements presents various difficulties to machine translation in terms of completeness and correctness. Cheng et al first attempted to recover omitted pronouns using rules. Tan et al use a special labeling method to integrate the translation of the labeled omitted pronouns as external knowledge into a neural machine translation model. Although the Transformer can capture more semantic information by using a multi-head attention mechanism, then for the omitted pronouns, only simple partial contents can be translated. Translation of omitted pronouns in complex sentences is often undesirable.
Disclosure of Invention
The invention provides a Hanyue neural machine translation method fusing zero pronouns and chapter information, which is used for implicitly adding effective zero pronouns information into a translation model, relieving the problem of omitting pronouns translation errors, combining a classification task and a translation task by a combined learning framework, and enabling the classification task and the translation task to interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.
The technical scheme of the invention is as follows: the Hanyue neural machine translation method fusing zero pronouns and chapter information comprises the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.
Step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
As a further scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronoun subject is labeled S, the missing pronoun is labeled object O, the missing pronoun is labeled idiom a, and the non-missing pronoun is labeled N.
As a further aspect of the present invention, Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,the k-th target sentence is shown to contain J words;represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; encoding the characteristic embedding of the source sentences by using a transformer encoding module; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
wherein Hsrc∈RI×dOutputting the source sentences after being coded;WO∈Rd×das a parameter of training, dk∈d/h。
As a further aspect of the present invention, Step2, the feature encoding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence characteristics and context characteristics to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
wherein U, V ∈ R2d,fPFor the pooling function, here U, V are spliced together to form the input to the classifier, i.e.:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is the sigmoid function for the model parameters.
As a further aspect of the present invention, in Step2, extracting target sentence features by using a transform decoder, and performing attention calculation with source sentence features and context features respectively includes the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of an encoding end, the input of a decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a target end word list space, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of a prediction result and a real result is calculated;
as a further scheme of the present invention, in Step2, the loss function of the joint learning is divided into two parts, namely, the translation loss of the neural machine translation model and the classification loss of the zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
The invention has the beneficial effects that:
1. aiming at the Chinese-transcendental neural machine translation of informal literary style, the machine translation performance is not satisfactory in the scenes of processing network vocabularies, daily conversations, spoken language and the like. The reason is that under the scenes, expression modes which do not conform to normal grammar often occur, the most typical expression mode is pronoun omission phenomenon (zero pronoun phenomenon), and the method improves the effect of machine translation by filling up missing grammar information.
2. The invention simplifies the task of filling pronouns and re-translating into an end-to-end task. The influence of errors of the pronoun prediction task on the translation task is avoided. By adopting a joint learning method, the classification task and the translation task interact: the classification provides more zero pronoun information for translation, and the translation helps the classification to solve the problems of ambiguity and the like.
3. Context information is integrated into the classification task and the translation task, and the zero pronoun classification task and the Hanyue machine translation task are obviously improved; and the chapter information is added into the classification task, so that the classification accuracy of the zero pronouns is improved. Meanwhile, the chapter information can also effectively improve the translation task performance;
4. the invention can capture richer semantic features by using the Multi-Head attention (Multi-Head attention) of the Transformer structure, and has good parallelism.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
Example 1: as shown in fig. 1, a hanyue neural machine translation method fusing zero pronouns and chapter information includes the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns missing in Chinese-English by using a matrix alignment method, analyzing syntactic components of the omitted pronouns by using a dependency syntactic analysis library DDParser, and using the syntactic components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set marked by zero pronoun information; the parameters of the classification model and the translation model can be learned and updated simultaneously by adopting the joint learning.
Step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
table 1 is an example of a tag for a classification task
Step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N. The corpus scale of the experiment is shown in table 2:
TABLE 2 corpus Scale statistics for experiment
Data of | Sentence | Discourse and chapters |
Training set | 294K | 58.8K |
Verification set | 3.21K | 0.64K |
Test set | 3.15K | 0.63K |
Step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
As a further aspect of the present invention, Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,the k-th target sentence is shown to contain J words;represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; using a transformer coding module to code the characteristic embedding of the source sentence; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
wherein Hsrc∈RI×dOutputting the source sentence after being coded;WO∈Rd×das a parameter of training, dk∈d/h。
As a further aspect of the present invention, Step2, the feature encoding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
As a further scheme of the present invention, in Step2, pooling and splicing the obtained source sentence features and context features to obtain a new characterization, and inputting the characterization into a classifier for classification includes the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
wherein U, V ∈ R2d,fPFor the pooling function, here U, V are spliced together to form the input to the classifier, i.e.:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is a sigmoid function as a model parameter.
As a further aspect of the present invention, in Step2, extracting target sentence features by using a transform decoder, and performing attention calculation with source sentence features and context features respectively includes the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of an encoding end, the input of a decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a target end word list space, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of a prediction result and a real result is calculated;
as a further scheme of the present invention, in Step2, the loss function of the joint learning is divided into two parts, namely, the translation loss of the neural machine translation model and the classification loss of the zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
And finally, selecting an Adam optimizer which has high convergence speed and stable convergence process and can iteratively update the weight of the neural network based on the training data. The learning rate (step size) is set to 5e-5, which determines the length of each step in the negative direction of the gradient during the gradient descent iteration. The step length is too small, the convergence is slow, and the step length is too large, so that the optimal solution can be far away. Therefore, from small to large, an optimal solution 5e-5 is selected by testing respectively.
To illustrate the effect of the present invention, 3 sets of comparative experiments were set up. The first group of experiments verify the effectiveness of the model, and the second group of experiments verify whether the omitted word has the influence on the translation of the Hanyue neural machine.
(1) Model validation
The invention tests the model CMT-G & A and the simplified model performance according to the training data and the test data in the table 1, and the test results are shown in the table 3:
TABLE 3 comparison of the results of the Hanyue and Yuehan translation experiments
As can be seen from the analysis, Table 3 shows the performance results of the model proposed by the present invention on the middle-to-middle translation task. Compared with a reference model, BLEU value promotion of 0.64 and BLEU value promotion of 0.44 are respectively achieved on the middle-crossing translation task and the middle-crossing translation task after the zero-pronoun classification task is combined, the effectiveness of the joint learning method is proved, and the BLEU value can be effectively promoted by the combination classification task by integrating the zero-pronoun information into machine translation. The BLEU value boost of 0.71 and 0.75 was achieved after only the context information was added, respectively. And the final model is integrated with context information on the basis of the joint task, so that the BLEU value is improved by 1.42 and 1.31. And the accuracy of the classification task is improved by about 3%. The context information can not only improve the translation effect, but also has positive effect on the classification task of the zero pronouns.
(2) Verification of influence of non-omitted substitute on Hanyue neural machine translation
In the movie or television play subtitle data set used by the invention, many sentences have the phenomenon of omitting pronouns, but more sentences in the data set are complete sentences without the pronouns. To explore the difference between translation performance of sentences with and without omitted pronouns, we performed the following experiments. We divide the original test set into an elision test set (DP) and an elision-free test set (NP) according to whether an elision pronoun exists, and then test on the reference model and the model proposed by the present invention, respectively. The results are shown in Table 4:
TABLE 4 comparison of CMT-G & A with simplified model
As can be seen from the analysis of table 4, the results obtained by the reference model and the model proposed by the present invention are higher than those obtained by the entire test set. This shows that the pronoun omission phenomenon has an influence on machine translation, and better effect can be obtained without translation of the omitted pronoun. On the basis of the non-omission test set, compared with a reference model, the model provided by the invention respectively improves 1.38 BLEU values and 1.31 BLEU values on the middle-to-middle translation and the middle-to-middle translation, which shows that the chapter information fused by the model provided by the invention can help to improve the translation performance by other semantic information besides the zero pronoun information. Compared with the whole test set, the translation effect of the reference model and the model provided by the invention is obviously reduced on the condition of omitting the test set. However, compared with the reference model, the model provided by the invention improves 1.43 BLEU values and 1.38 BLEU values in the middle translation and the middle translation respectively. The fact shows that the phenomenon of pronoun omission in a translation task is difficult to be well processed by the standard transformer model, and the model provided by the invention can effectively relieve translation errors caused by the pronoun omission.
The experimental data prove that the missing pronouns are merged into the machine translation model by using the grammar information of the missing pronouns and the context, the omitted pronouns are classified by using the source-end sentences and the context sentences by using the model structure of the joint learning, and the context information is merged, so that the machine translation performance is effectively improved. Meanwhile, a transform coding module is applied to better capture remote dependency relationships and improve the parallelism of the model. Experiments show that the method of the invention achieves the optimal effect compared with a plurality of baseline models. Aiming at the Hanyue neural machine translation task, the Hanyue neural machine translation method fusing zero pronouns and chapter information is effective in improving the Hanyue neural machine translation performance.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (7)
1. The Hanyue neural machine translation method fusing zero pronouns and chapter information is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, crawling, collecting and constructing Chinese-English parallel data through a web crawler technology, finding out pronouns with Chinese-English missing by using a matrix alignment method, analyzing syntax components of the omitted pronouns by using a dependency syntax analysis library DDParser, and using the syntax components as real labels of classification tasks to obtain a Chinese-English comparable corpus data set with zero pronoun information labeling;
step2, respectively carrying out feature coding on a source sentence and a source sentence context through word embedding and position embedding, extracting features by using a Transformer encoder, carrying out pooling splicing on the obtained source sentence features and context features to obtain new representations, and inputting the representations into a classifier for classification; and simultaneously, extracting the characteristics of the target sentence by using a Transformer decoder, and performing attention calculation with the characteristics of the source sentence and the characteristics of the context respectively.
2. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, crawling, collecting and constructing parallel data of the medium-cross-English language through a crawler technology;
step1.2, performing zero pronoun prediction on the middle-crossing data; the specific method is as follows: (1) generating a middle-English and over-English alignment matrix by utilizing the property that pronouns in spoken English are not omitted; (2) acquiring the basic positions of pronouns according to the alignment matrix, and predicting pronouns lacking in Chinese and Vietnamese by utilizing pronouns of English; (3) putting the multiple possible pronouns back to the multiple possible positions of the original text, and scoring by using a language model to obtain the most possible missing pronoun positions and missing pronouns;
step1.3, analyzing the syntactic components of the omitted pronouns by utilizing a dependency syntactic analysis library DDParser according to the predicted pronouns, and labeling pronouns missing types of the intermediate-crossing data; the missing pronouns are labeled S, the missing pronouns are labeled O, the missing pronouns are labeled A, and the non-missing pronouns are labeled N.
3. The method for machine translation of hanyue nerves fusing zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature encoding of the source sentence through word embedding and position embedding includes the following steps:
extracting text characteristics of the source sentences by using a source sentence coding module, and setting X as X(1),..,x(k),..,x(K)A source language document representing the composition of K source sentences,indicating that the kth source sentence contains I words; similarly, the corresponding target language document is represented by Y ═ Y(1),...,y(k),...,y(K)It is shown that,the k-th target sentence is shown to contain J words;represents the jth word in the kth target sentence, assuming<X,Y>A parallel document is constructed in such a way that,<x(k),y(k)>forming a parallel sentence; using a transformer coding module to code the characteristic embedding of the source sentence; in order to utilize the sequence, the position code is added to the word embedding representation in the coding module, the position code and the word embedding representation have the same dimension, the core of the coding module is a self-attention mechanism, and when the multi-head attention module is used for computing, the input representation needs to be respectively processed into query (Q), Key (K) and value (V), which are specifically as follows:
Esrc=E(x1,x2,...,xI) (1)
Esrc=Q=K=V (2)
wherein E issrcFor the word embedding representation of the source sentence, d represents the word vector dimension of the source sentence, Q, K, V belongs to RI×dRespectively a query vector, a key vector, a value vector,is a scaling factor;
in order to explore the high parallelism of attention, a multi-head attention mechanism is adopted to pay attention through the dot product of zooming for multiple times in parallel; the multi-head attention carries out h times of linear projection on Q, K, V through different linear projections, then the h times of projection are carried out in parallel to carry out zooming dot product attention, and finally attention results are connected in series to obtain a new representation again; multi-headed attention allows the model to focus jointly on information from different representation subspaces from different locations;
headi=Attention(QWi Q,KWi K,VWi V) (4)
Hsrc=MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO (5)
4. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, the feature coding of the source sentence context by word embedding and position embedding includes the following steps:
extracting text characteristics of a context of a source statement by using a context coding module, and embedding and coding the context characteristics by using a Transformer coding module;
Econ=E(x1,x2,...,xI) (6)
Econ=Q=K=V (7)
Hcon=Transformer_encoder(Q,K,V) (8)
wherein E isconEmbedding tokens for words of the context input text; q, K, V ∈ RI×dRespectively as a query vector, a key vector, and a value vector; hcon∈RI×dIs the output after the context coding module.
5. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, performing pooling and splicing on the obtained source sentence characteristics and context characteristics to obtain new characteristics, and inputting the characteristics into a classifier for classification, wherein the classifying comprises the following steps:
the classification module classifies the categories of the zero pronouns by using the source sentences and the text characteristics of the context; hsrc∈RI×dAnd Hcon∈RI×dRespectively representing the current sentence and the output of the context of the current sentence after passing through the encoder; will hide state HsrcAnd HconAfter passing through the Max-pooling and Mean-pooling operations, represented as vectors U and V, respectively, i.e.:
wherein U, V ∈ R2d,fPFor pooling functions, where U, V are spliced together to form classifiersInputting, namely:
o=[U;V] (11)
wherein o ∈ R4dAnd finally, obtaining a classification result of the zero pronoun recognition through the full connection layer and the four-classification softmax layer:
z=softmax(σ(oW1+b1)W2+b2) (12)
wherein W1∈R4d×d,W2∈Rd×2,b1∈R1×d,b2∈R2σ is a sigmoid function as a model parameter.
6. The method for machine translation of hanyue with fused zero pronouns and discourse information according to claim 1, wherein: in Step2, extracting target sentence features by using a Transformer decoder, and performing attention calculation with source sentence features and context features respectively comprises the following steps:
the decoding module standard is generally consistent with the transform decoder, except that: a layer of context attention sublayer is added between the multi-head mask self-attention sublayer and the encoder decoder attention sublayer, so that the context information can better improve the performance of a translation task; different from the input of the encoding end, the input of the decoding end only has a target end sentence corresponding to the current sentence of the source end, the output of the decoding end can be mapped to a word list space of the target end, the prediction probability corresponding to each word in the word list is calculated by utilizing a softmax function, and finally, the loss of the prediction result and the real result is calculated.
7. The method for machine translation of hanyue with fused zero-pronouns and discourse information according to claim 3, wherein: in Step2, a loss function of joint learning is divided into two parts, namely translation loss of a neural machine translation model and classification loss of zero pronoun prediction;
the correlation loss function of the translation target end is as follows:
wherein D represents the number of parallel chapters in the training set, Sn represents the number of sentences of the nth parallel chapter pair, and XnAnd YnRepresenting the Source-end sentence and the target-end sentence of the nth parallel chapter pair, mn,tRepresenting the total token number of the t sentence of the nth parallel chapter pair, and theta represents the training parameter of the model;
the correlation loss function for the zero pronoun classification is:
wherein N represents the number of training examples of the zero pronoun classification task, C represents the number of class labels,representing the probability of the model predicting class c;
finally, the training goals for joint learning are as follows:
wherein alpha is a weight parameter of zero pronoun classification loss, and alpha is set to be 1.0 in the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111557675.2A CN114595700A (en) | 2021-12-20 | 2021-12-20 | Zero-pronoun and chapter information fused Hanyue neural machine translation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111557675.2A CN114595700A (en) | 2021-12-20 | 2021-12-20 | Zero-pronoun and chapter information fused Hanyue neural machine translation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114595700A true CN114595700A (en) | 2022-06-07 |
Family
ID=81803154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111557675.2A Pending CN114595700A (en) | 2021-12-20 | 2021-12-20 | Zero-pronoun and chapter information fused Hanyue neural machine translation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114595700A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048946A (en) * | 2022-06-14 | 2022-09-13 | 昆明理工大学 | Chapter-level neural machine translation method fusing topic information |
CN116108862A (en) * | 2023-04-07 | 2023-05-12 | 北京澜舟科技有限公司 | Chapter-level machine translation model construction method, chapter-level machine translation model construction system and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549644A (en) * | 2018-04-12 | 2018-09-18 | 苏州大学 | Omission pronominal translation method towards neural machine translation |
CN109492223A (en) * | 2018-11-06 | 2019-03-19 | 北京邮电大学 | A kind of Chinese missing pronoun complementing method based on ANN Reasoning |
CN111488733A (en) * | 2020-04-07 | 2020-08-04 | 苏州大学 | Chinese zero-index resolution method and system based on Mask mechanism and twin network |
CN111666774A (en) * | 2020-04-24 | 2020-09-15 | 北京大学 | Machine translation method and device based on document context |
CN112256868A (en) * | 2020-09-30 | 2021-01-22 | 华为技术有限公司 | Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment |
CN112507733A (en) * | 2020-11-06 | 2021-03-16 | 昆明理工大学 | Dependency graph network-based Hanyue neural machine translation method |
CN112613326A (en) * | 2020-12-18 | 2021-04-06 | 北京理工大学 | Tibetan language neural machine translation method fusing syntactic structure |
CN113095091A (en) * | 2021-04-09 | 2021-07-09 | 天津大学 | Chapter machine translation system and method capable of selecting context information |
CN113743133A (en) * | 2021-08-20 | 2021-12-03 | 昆明理工大学 | Chinese cross-language abstract method fusing word granularity probability mapping information |
-
2021
- 2021-12-20 CN CN202111557675.2A patent/CN114595700A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549644A (en) * | 2018-04-12 | 2018-09-18 | 苏州大学 | Omission pronominal translation method towards neural machine translation |
CN109492223A (en) * | 2018-11-06 | 2019-03-19 | 北京邮电大学 | A kind of Chinese missing pronoun complementing method based on ANN Reasoning |
CN111488733A (en) * | 2020-04-07 | 2020-08-04 | 苏州大学 | Chinese zero-index resolution method and system based on Mask mechanism and twin network |
CN111666774A (en) * | 2020-04-24 | 2020-09-15 | 北京大学 | Machine translation method and device based on document context |
CN112256868A (en) * | 2020-09-30 | 2021-01-22 | 华为技术有限公司 | Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment |
CN112507733A (en) * | 2020-11-06 | 2021-03-16 | 昆明理工大学 | Dependency graph network-based Hanyue neural machine translation method |
CN112613326A (en) * | 2020-12-18 | 2021-04-06 | 北京理工大学 | Tibetan language neural machine translation method fusing syntactic structure |
CN113095091A (en) * | 2021-04-09 | 2021-07-09 | 天津大学 | Chapter machine translation system and method capable of selecting context information |
CN113743133A (en) * | 2021-08-20 | 2021-12-03 | 昆明理工大学 | Chinese cross-language abstract method fusing word granularity probability mapping information |
Non-Patent Citations (2)
Title |
---|
LONGYUE WANG ET AL: ""DROPPED PRONOUN GENERATION FOR DIALOGUE MACHINE TRANSLATION"", 《2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
汪浩 等: ""融合零指代识别的篇章级机器翻译"", 《HTTPS://SCHOLAR.GOOGLE.COM/CITATIONS?VIEW_OP=VIEW_CITATION&HL=EN&USER=KY7Q3NEAAAAJ&CITATION_FOR_VIEW=KY7Q3NEAAAAJ:MVMSD5A6BFQC》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048946A (en) * | 2022-06-14 | 2022-09-13 | 昆明理工大学 | Chapter-level neural machine translation method fusing topic information |
CN116108862A (en) * | 2023-04-07 | 2023-05-12 | 北京澜舟科技有限公司 | Chapter-level machine translation model construction method, chapter-level machine translation model construction system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110390103B (en) | Automatic short text summarization method and system based on double encoders | |
WO2021233112A1 (en) | Multimodal machine learning-based translation method, device, equipment, and storage medium | |
Zhang et al. | Deep Neural Networks in Machine Translation: An Overview. | |
Zhang et al. | SG-Net: Syntax guided transformer for language representation | |
CN112613326B (en) | Tibetan language neural machine translation method fusing syntactic structure | |
CN113343683B (en) | Chinese new word discovery method and device integrating self-encoder and countertraining | |
CN112765345A (en) | Text abstract automatic generation method and system fusing pre-training model | |
CN111401079A (en) | Training method and device of neural network machine translation model and storage medium | |
CN114998670B (en) | Multi-mode information pre-training method and system | |
CN110717341B (en) | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot | |
CN113569562B (en) | Method and system for reducing cross-modal and cross-language barriers of end-to-end voice translation | |
CN114969304B (en) | Method for generating abstract of case public opinion multi-document based on element diagram attention | |
CN112183094A (en) | Chinese grammar debugging method and system based on multivariate text features | |
CN114595700A (en) | Zero-pronoun and chapter information fused Hanyue neural machine translation method | |
Yun et al. | Improving context-aware neural machine translation using self-attentive sentence embedding | |
González-Gallardo et al. | Sentence boundary detection for French with subword-level information vectors and convolutional neural networks | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
Parvin et al. | Transformer-based local-global guidance for image captioning | |
CN116628174A (en) | End-to-end relation extraction method and system for fusing entity and relation information | |
Wu et al. | Joint intent detection model for task-oriented human-computer dialogue system using asynchronous training | |
CN114169346A (en) | Machine translation method using part-of-speech information | |
CN113033153A (en) | Neural machine translation model fusing key information based on Transformer model | |
Zhang et al. | Self-supervised bilingual syntactic alignment for neural machine translation | |
CN115659172A (en) | Generation type text summarization method based on key information mask and copy | |
Ananthanarayana et al. | Effects of feature scaling and fusion on sign language translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220607 |
|
RJ01 | Rejection of invention patent application after publication |