CN112541364A

CN112541364A - Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge

Info

Publication number: CN112541364A
Application number: CN202011409192.3A
Authority: CN
Inventors: 余正涛; 邹翔; 赖华; 徐毓; 文永华; 朱俊国
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-23

Abstract

The invention relates to a method for Chinese-transcendental neural machine translation fusing multilevel language characteristic knowledge, which fuses and analyzes the language characteristic knowledge of three different levels of characters, words and phrases respectively. Secondly, by a method of constructing a phrase tree encoder on the basis of a standard sequence encoder, phrase information in a sentence is further merged into a sequence conversion process of Chinese-crossing neural machine translation. Experimental results show that the fusion method can effectively utilize language feature knowledge of different levels to make up the problem of insufficient resources of the Hanyue language, and the performance of the Hanyue translation model is improved to a certain extent.

Description

Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge

Technical Field

The invention relates to a Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge, belonging to the technical field of natural language processing.

Background

The Chinese-Vietnamese is a typical low-resource language pair, the available resources are less, and the problem of insufficient resources needs to be solved by utilizing language feature knowledge of different levels. The Vietnamese has rich morphological change and various grammatical structures, and the invention aims to fully mine and utilize language feature knowledge of different levels so as to solve the difficult problem of resource scarcity faced by the Chinese-crossing neural machine translation.

The language feature knowledge of different levels refers to semantic information contained in sequence structures of different levels such as characters, words, phrases and the like. Most of the existing neural machine translation is based on words, but large-scale linguistic data is needed for training word vectors, and the problem that words are not registered easily occurs in the translation process. Therefore, researchers have considered making full use of information in words starting from smaller granularity. In consideration of different morphological changes and various grammatical structures of Vietnamese, the invention uses three layers of characters, words and phrases as multi-layer representation of a language symbol sequence, wherein the character sequence can effectively represent the different morphological changes of the Vietnamese, any Vietnamese word is formed by combining the character sequence, and meanwhile, the character sequence can effectively represent information contained in the word, thereby relieving the problem of rare words which are easy to generate under small-scale linguistic data to a certain extent; the word sequence can visually depict semantic information contained in a source language, accords with an expression mode of human habit, and is a translation unit which is used by machine translation at the earliest and has the fastest effect; the phrase sequence contains certain word sequence and syntactic structure information, and is beneficial to improving the remote dependence problem in the Chinese-crossing neural machine translation. Therefore, the invention provides a Chinese-crossing neural machine translation method fusing multi-level language feature knowledge.

Disclosure of Invention

The invention provides a Chinese-Vietnamese neural machine translation method fusing multi-level language feature knowledge, which effectively utilizes the language feature knowledge of different levels to make up the problem of insufficient resources of the Chinese-Vietnamese language and improves the performance of a Chinese-Vietnamese translation model to a certain extent.

The invention provides a Chinese-crossing neural machine translation method fusing multi-level language feature knowledge (characters, words and phrases). The recognition method respectively fuses and analyzes the language feature knowledge of three different levels of characters (Character), words (Word) and phrases (Phrase), in order to effectively utilize the language feature knowledge of the different levels, the invention firstly obtains Word vector representation based on characters through bidirectional LSTM, then combines the Word vector representation based on characters with pre-trained Word vectors, and enables a model to dynamically select the Word vectors and Character information through an attention mechanism. Secondly, by a method of constructing a phrase tree encoder on the basis of a standard sequence encoder, phrase information in a sentence is further merged into a sequence conversion process of Chinese-crossing neural machine translation. The experimental result shows that the fusion method is an optimal technical scheme obtained in the experimental process, can effectively utilize language feature knowledge of different levels to make up the problem of insufficient resources of the Hanyue language, and improves the performance of the Hanyue translation model to a certain extent.

The technical scheme of the invention is as follows: a method for Chinese-Yuan neural machine translation fusing multi-level language feature knowledge comprises the following specific steps:

step1, corpus collection and pretreatment: collecting Chinese and Vietnamese parallel data, and respectively preprocessing the data by using a preprocessing tool which accords with Chinese and Vietnamese characteristics;

step2, obtaining vectors of characters in words by using a bidirectional LSTM on the basis of Step1, and combining word vectors obtained by character training calculation with word vectors pre-trained to obtain word vectors fusing character features;

step3, deep semantic feature fusion: in the phrase structure grammar driven by the central language, a sentence is composed of a plurality of phrase units and is expressed in a binary tree form, and an encoder based on a phrase tree is constructed on a standard sequence encoder according to the sentence structure, and phrase feature knowledge is further merged into the standard sequence encoder based on words;

the realization of Chinese language neural machine translation fusing different levels of language feature knowledge is completed.

Further, the Step1 includes the specific steps of:

step1.1, obtaining 140K Hanyue parallel sentence pairs in a network crawling and manual collecting mode, wherein 2K parallel sentence pairs are collected in a test set, and 2K parallel sentence pairs are verified by a verifier;

step1.2, Chinese data are participated by using a JIEBA participator, and the StanfordNLP toolkit of Stanford university is used for phrase data analysis; and the Vietnamese data is subjected to phrase data method analysis by adopting a Vietnamese phrase syntax analysis tool so as to obtain a Vietnamese phrase tree.

Further, the Step2 includes the specific steps of:

step2.1, in the neural machine translation, a natural language is required to be characterized into a characteristic vector form to be used as the input of a model, and the semantic vector representation of a word is obtained through the information calculation of characters in the word;

step2.2, combining the word vector obtained by character training calculation with the word vector pre-trained by using a weight weighting method to obtain the optimal representation of a semantic unit;

step2.3, the common words have high-quality word vector representation, the character representation is aligned with the word vector through optimizing the vector, and finally the word vector with character characteristics fused is obtained through training.

Further, the Step3 includes the specific steps of:

step3.1, in the phrase structure grammar driven by the central language, a sentence is composed of a plurality of phrases and is expressed in a binary tree form, wherein each node in the binary tree is expressed by an LSTM unit, and a sentence vector is constructed by phrase vectors in a bottom-up mode;

step3.2, when computing LSTM units of leaf nodes, allowing the model to compute different representation forms of the same word appearing in sentences for multiple times; the model now has two different sentence vectors: providing another Tree-LSTM unit, one from the sequence encoder and the other from the phrase Tree based encoder, using the Tree-LSTM unit to initialize the decoder unit with the final sequence encoder unit and the phrase Tree based encoder unit as two subunits;

step3.3, an attention mechanism is introduced into a phrase tree-sequence model, so that the model focuses on not only a sequence hiding unit but also a phrase hiding unit, when the model decodes a target word, which words or phrases in an original sentence are important can be known, and phrase feature knowledge is further merged into the word basis.

The invention has the beneficial effects that: the invention introduces semantic information contained in different symbol sequences into the neural machine translation process through the fusion representation of three different levels of linguistic feature knowledge including characters, words and phrases. Experimental results show that the method effectively utilizes the language feature knowledge of different levels, and improves the translation performance of the Hanyue neural machine to a certain extent.

Drawings

FIG. 1 is a flow chart of neural machine translation incorporating knowledge of multi-level linguistic features in accordance with the present invention;

FIG. 2 is a schematic diagram of shallow semantic feature fusion in accordance with the present invention;

FIG. 3 is an exemplary diagram of a Vietnamese phrase structure tree according to the present invention;

FIG. 4 is a diagram of the phrase tree-sequence based neural machine translation of the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, a method for machine translation of hanyue nerves fusing multi-level language feature knowledge, the method comprises the following specific steps:

step1.2, in the pretreatment of experimental data, dividing Chinese data into words by using a JIEBA word dividing tool, and performing phrase data analysis by using a StanfordNLP tool kit of Stanford university; and the Vietnamese data is subjected to phrase data method analysis by adopting a Vietnamese phrase syntax analysis tool so as to obtain a Vietnamese phrase tree. Because less syntax parsing open source tools are used for Vietnamese, a Vietnamese phrase syntax parsing tool developed by Liying and the like is adopted to carry out phrase syntax parsing on the Vietnamese to obtain a Vietnamese phrase tree. The experimental data used are shown in table 1.

Table 1 experimental data set-up

step2.1, in neural machine translation, requires the characterization of natural language in the form of feature vectors as input to the model. Therefore, we first consider how to compute a semantic vector representation of a word from information of characters within the word. As shown in fig. 2, the words in a sentence are decomposed into characters resulting in an embedded sequence of characters (c 1.., cR) passed through the bi-directional LSTM:

the last hidden vector of each LSTM is concatenated as a character representation of a single word, then passed through a separate non-linear layer:

h^*＝[h_R；h₁] m＝tanh(W_mh) (2)

wherein W_mIs oneThe weight matrix is used for mapping the hidden vectors connected in the two LSTMs to a combined word vector constructed by a single character to represent m;

in the shallow semantic feature fusion of characters and words, word2vec pre-training word vectors are used, the word embedding dimension is 256-dimensional, and words which appear only once in training data are replaced by universal OOV marks and are still used in character components. All the numbers in the corpus are replaced by the character "0", the embedding length of the character is set to 50, and random initialization is performed. The LSTM layer size for each direction is set to 200, the combination indicates that m has the same dimension as word embedding, the default learning rate is 1.0, and the batch size is 64.

Step2.2, combining the word vectors calculated by the character training and the pre-trained word vectors to obtain an optimal representation of a semantic unit by using a weight weighting method. Equation (3) (4) can learn the same semantic features for each word for word embedding and character assembly:

wherein

And

is a weight matrix for z, σ () means a value of [0, 1]]Logistic function within the range. The vectors z and x, m have the same dimensions, which allows the model to dynamically decide how much information to use from word embedding or character components;

step2.3, otherwise common words themselves have high quality word vector representations, so we pair a character representation with a word vector by optimizing the vector m:

wherein m is_(t)Is a representation that is dynamically built from individual characters in the input tth word. The character component learns from word embedding that the embedding of rare words should be excluded, so for OOV we set this to 0 using gt. Finally training to obtain word vector with character characteristics fused

Step3, deep semantic feature fusion: in the phrase structure grammar driven by the central language, a sentence is composed of a plurality of phrase units and is expressed in a binary tree form, and an encoder based on a phrase tree is constructed on a standard sequence encoder according to the sentence structure, and phrase feature knowledge is further merged into the standard sequence encoder based on words; in the Central language driven phrase structure grammar, a sentence is composed of a plurality of phrase units and represented as a binary tree, as shown in FIG. 3.

Step3.1, based on the sentence structure, we constructed a phrase tree based encoder based on the standard sequence encoder, as shown in FIG. 4. Where each node in the binary tree is represented by an LSTM unit and a sentence vector is constructed from the phrase vector in a bottom-up manner. Kth parent hidden unit

Using left and right hidden units

And

and calculating to obtain:

wherein f is_treeIs a non-linear function. Each non-leaf node is also represented by an LSTM element and computes an LSTM element for the parent node having two child LSTM elements. Hidden unit of kth father node

And a storage unit

Is calculated as follows:

wherein i_k、

o_k、

The input gate, the left and right forgetting gates, the output gate and the state for updating the storage unit are respectively.

And

memory cells are left and right cells, respectively.

Are all a matrix of weights, and are,

is a vector of the offset to the offset,

representing a dot product operation;

step3.2, in computing LSTM units for leaf nodes, we allow the model to compute different representations of the same word that occur multiple times in a sentence. The model now has two different sentence vectors: one from the sequence encoder and the other from the phrase Tree based encoder, we provide another Tree-LSTM unit that will end up with sequence encoder unit h_nAnd a phrase tree based encoder unit

As two sub-units for initializing the decoder unit s₁：

Wherein the function g_treeAnd function f_treeWith the same functionality, this initialization allows the decoder to select from sequence data and phrasesCapturing information in the structure, by setting when the parser cannot output parse tree of sentences

A sentence is encoded with a sequence encoder, so that any sentence can be processed by a phrase tree-sequence based encoder;

J context vector is composed of sequential vector d_jAnd phrase vector score by attention α_j(i) Weighting:

if the binary tree has n leaves, the binary tree has n-1 phrase nodes, and a final decoder is set

In addition, the invention adopts an input-feeding method for hiding the unit s from the current target_jIn the previous hidden unit to predict the word y_j-1：

Wherein

Is s_j-1And

is cascaded. The input-feeding method is beneficial to enriching the calculation of the decoder, and experiments show that the input-feeding method is improvedThe BLEU score is given.

In deep semantic feature fusion, hidden unit and word embedding dimensions are 256 dimensions, forgetting gate bias terms of LSTM and Tree-LSTM are initialized with 1.0, and the rest of model parameters are uniformly initialized in [ -0.1, 0.1 ]. The model parameters are optimized by adopting a common SGD, the initial learning rate SGD is 1.0, and the batch size is 128. As the loss becomes more severe, the learning rate is halved. The gradient specification is clipped to 3.0 to avoid the explosive gradient problem. And evaluating the model through the BLEU automatic evaluation index in the experiment.

The influence of multi-level language feature knowledge on the translation performance of the Hanyu neural machine is researched, and a model (LSTM + C) only using characters, a model (LSTM + W) only using words, a model (Tree-LSTM) only using a phrase Tree, a model (LSTM + C + W) only performing character and word fusion, a model (Tree-LSTM + C) only performing character and phrase Tree fusion, a model (Tree-LSTM + W) only performing word and phrase Tree fusion and a model (Tree-LSTM + C + W) provided by the invention are compared in an experiment. The results of the experiment are shown in table 2.

TABLE 2 influence of fusing different levels of linguistic feature knowledge on BLEU value

By comparing the experimental results in Table 2, the model with only character and word fusion (LSTM + C + W), the model with only character and phrase fusion (Tree-LSTM + C) and the model with only word and phrase fusion (Tree-LSTM + W) have higher BLEU values than the three models without feature fusion. Compared with a model only carrying out character and word fusion (LSTM + C + W), the model (Tree-LSTM + C + W) of the invention improves the BLEU value by 0.95 percent; compared with a model only carrying out character and phrase fusion (Tree-LSTM + C), the BLEU value is improved by 0.69 percent; the BLEU value was improved by 0.58 percent compared to the model with only word and phrase fusion (Tree-LSTM + W). The invention effectively introduces the language feature knowledge of different levels into the neural machine translation by deeply mining and utilizing characters, words and phrases, thereby improving the performance of the Chinese-crossing neural machine translation to a certain extent.

Also, as can be seen from table 2, the model using only characters (LSTM + C) performed poorly, with a 0.68 percent reduction in the BLEU value compared to the model using only words (LSTM + W) and a 1.24 percent reduction in the BLEU value compared to the model using only phrase trees (Tree-LSTM). The analysis reason is that although the problem of data sparsity is reduced to a certain extent by using the characters independently, the sentence length is greatly increased, and the difficulty of long-distance dependence learning is increased. Therefore, the model which takes the characters as the operation objects is less competitive, which is also the reason why the invention does not use the character embedding to completely replace the word embedding in the shallow semantic feature fusion stage, but combines the two, thereby allowing the model to fully utilize the information at two granularity levels.

On the basis, in order to visually observe and compare the translation effects of different models, translation quality comparison analysis is carried out on the translation results of 4 language feature knowledge models with different levels of fusion. The results of the experiment are shown in table 3.

TABLE 3 comparison of translation quality

As can be seen from Table 3, the translation quality of the model (Tree-LSTM + C + W) of the present invention is higher, i.e. the correct translation of the word "swimming" is

While others translate it into

And is not entirely accurate. "No matter how harsh the environment is" reference translation is "

(either)

(Environment)

(harsh)

(what), and a translation of the (LSTM + C + W) model is "

(either)

(Environment)

(how)

(rigorous) ", there is a significant syntactic error due to the absence of phrase syntax information by the sequential sequence model. In the model of the invention, the character characteristics are blended to ensure that the expression quality of word vectors is high, meanwhile, the phrase tree reserves the syntactic information among texts, and the performance of Chinese-transcendental neural machine translation is improved by fully utilizing multi-level language characteristic knowledge.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for Chinese-crossing neural machine translation fusing multi-level language feature knowledge is characterized by comprising the following steps:

the method comprises the following specific steps:

2. The method of machine translation of hanyu fusing multi-level linguistic feature knowledge of claim 1, wherein: the specific steps of Step1 are as follows:

3. The method of machine translation of hanyu fusing multi-level linguistic feature knowledge of claim 1, wherein: the specific steps of Step2 are as follows:

4. The method of machine translation of hanyu fusing multi-level linguistic feature knowledge of claim 1, wherein: the specific steps of Step3 are as follows: