CN109492232A

CN109492232A - A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer

Info

Publication number: CN109492232A
Application number: CN201811231017.2A
Authority: CN
Inventors: 苏依拉; 张振; 高芬; 王宇飞; 孙晓骞; 牛向华; 赵亚平; 卞乐乐
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2019-03-19

Abstract

The illiteracy Chinese machine translation method for the enhancing semantic feature information based on Transformer model that this paper presents a kind of.Firstly, the present invention from the language feature of Mongolian, finds out it in stem, affixe with the feature for the supplementary element passed, and these language features are dissolved among the training of model.Secondly, the present invention is expressed as research background to measure the distribution of the similarity degree between two words, the comprehensive analysis influence of depth and density, semantic registration to Concept Semantic Similarity.The present invention is in translation process, using Transformer model, the Transformer model is to carry out multilevel encoder-decoder architecture position encoded and based on enhanced bull attention mechanism construction using trigonometric function, to place one's entire reliance upon attention mechanism to draw the global dependence between outputting and inputting, recurrence and convolution are eliminated.

Description

A kind of illiteracy Chinese machine translation of the enhancing semantic feature information based on Transformer Method

Technical field

The invention belongs to machine translation mothod field, in particular to a kind of enhancing semantic feature based on Transformer The illiteracy Chinese machine translation method of information.

Background technique

Mongol is a kind of agglutinative language, is under the jurisdiction of Altai family.Mongolian written has traditional Mongolian and West That Mongolian, " illiteracy " in illiteracy Chinese translation system that we are studied here refer to the translation of traditional Mongolian to Chinese.It passes Mongolian of uniting is also a kind of alphabetic writing, and alphabetical form is not unique, position phase of the variation of form with letter in word It closes, position includes that the independent of word starts, in word and suffix.The word of Mongolian is by root (root)+affixe (suffix) side Formula is formed, and affixe is divided into two classes: one kind assigns original word for sewing to be connected to new meaning behind root, be called derivative Sew, sews behind root and connect one or more derivational suffixes just and will form stem (stem)；It is another kind of sew to be connected to behind stem be used for Express grammatical meaning.All there are a variety of variations such as tense, number, lattice in noun, the verb of Mongolian, these variations are again by sewing Affixe is connect to realize, therefore Mongolian morphological change is extremely complex.In addition, the word order of Mongolian and Chinese have very big difference, The verb of Mongolian is behind subject and predicate, and positioned at the end of sentence, and verb is between subject and object in Chinese.

A dimension difference of vector is only used with one-hot expression, the distributed of word indicates, uses the dense reality of low-dimensional Number vector indicates word.In the low-dimensional vector space, can be convenient according to distance or angle isometry mode, measure two Similarity degree between a word.In addition, on technological layer, under the background studied statistical language model, Google Company has opened Word2vec in 2013, and this is a for training the software tool of term vector.Word2vec can be according to given Corpus, by optimization after training pattern a word is fast and effeciently expressed as vector form, be natural language at The application study in reason field provides new tool.However, Word2vec relies on skip-grams or continuous bag of words (CBOW) are come Establish neural word insertion.But word2vec realizes when semantic relevancy calculates there is certain limitation at present, on the one hand uses Foundation of the local context information of translation to be generated as prediction translation, not using global contextual information, so right Contextual information using insufficient, there is also rooms for promotion for the extraction of semantic feature.On the other hand, due to the knot of frame itself Structure limits the parallelization of calculating, and computational efficiency is up for improving.

Traditional machine translation system, it is most of be based on Recognition with Recurrent Neural Network (RNN), shot and long term memory (LSTM) or Gate recurrent neural network (GRU).These methods have become the Series Modelings such as machine translation in the past few years and conversion is asked Inscribe state-of-the-art method.However recursive models usually consider the calculating along the character position for outputting and inputting sequence.By position with The step alignment in the time is calculated, they generate a series of hidden state h in position t input_t, while being also previously to hide shape State h_t-1Function.This intrinsic sequential nature eliminates the parallelization in training example, and parallelization is in longer sequence length In become most important because memory restrict crosses over exemplary batch processing.Nearest work is by decomposing skill and base Significantly improving for computational efficiency is realized in the calculating of condition, while also improving model performance in the latter case.However, The basic constraint that sequence calculates still has.

Current encoder device-decoder chassis is a main model for solving the problems, such as sequence to sequence.Model uses coding Device carries out compression expression to source language sentence, generates target language sentence based on the compression expression of source using decoder.The knot The benefit of structure can be achieved on the modeling of end-to-end mode between two sentences, and all parametric variables are unified to one in model It is trained under objective function, model performance is preferable.Fig. 1 illustrates the structure of coder-decoder model, is Down-Up one The process of a machine translation.

Encoder and decoder can select the neural network of different structure, such as RNN, CNN.The working method of RNN is To sequence according to time step, compression expression is successively carried out.When using RNN, two-way RNN structure generally will use.Specifically Mode is using a RNN to the compression expression of element progress from left to right in sequence, another RNN carries out from the right side sequence Compression expression to the left.Two kinds indicate to be joined together using the distribution as ultimate sequence indicates.In the structure, due to To be handled in order the element in sequence, the interaction distance between two words may be considered between them it is opposite away from From.With the growth of sentence, the increase of relative distance, there is the apparent theoretical upper limit to the processing of information.

When using CNN structure, the structure of multilayer is generally used, Lai Shixian sequence is partially illustrated the mistake of global expression Journey.The viewpoint that can regard a kind of time series as using RNN modeling sentence can regard a kind of knot as using CNN modeling sentence The viewpoint of structure.Sequence using RNN structure mainly includes RNNSearch, GNMT etc. to series model, uses CNN structure Sequence mainly has ConvS2S etc. to series model, and what is embodied is a kind of from part to global feature extraction process, between word Interaction distance, corresponding thereto apart from directly proportional.It can only meet on higher CNN node apart from farther away word, just generate friendship Mutually, this process may have more information and lose.

Summary of the invention

In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of based on Transformer's Enhance the illiteracy Chinese machine translation method of semantic feature information, the system based entirely on attention mechanism, completely eliminate recurrence and Convolution.Experiment shows that the system is more superior in quality, while being easier to parallelization, and the less time is needed to be instructed Practice, reaches 45.4BLEU in the translation duties of 120 Wan Menghan Parallel Corpus, realize higher translation quality.

To achieve the goals above, the technical solution adopted by the present invention is that: a kind of enhancing based on Transformer is semantic The illiteracy Chinese machine translation method of characteristic information, which is characterized in that Transformer model is used in translation process, it is described Transformer model is to carry out multilayer position encoded and based on enhanced bull attention mechanism construction using trigonometric function Coder-decoder framework, thus place one's entire reliance upon attention mechanism to draw the global dependence between outputting and inputting, Eliminate recurrence and convolution.

Before translation, feature is preferably extracted for the ease of deep learning neural network, first data is pre-processed, Described to carry out pretreatment to data be to carry out cutting separation to the supplementary element of stem, affixe and lattice in Mongolian corpus, with drop The sparsity of low data, while character segmentation processing is carried out to Chinese, Mongolian is found out in stem, affixe with the supplementary element passed Language feature, and these language features are dissolved among training.

The cutting separation includes the additional of the affixe cutting of small grain size, the stem cutting of big granularity and small-scale lattice Ingredient cutting.

After being pre-processed to data, the influence of comprehensive depth, density, semantic registration to Concept Semantic Similarity, collection Similarity matrix is established at the similarity algorithm of semantic distance and the information content, then carries out principal component analysis, by similarity moment Battle array is converted into principal component transform matrix, calculates principal component contributor rate, and be weighted processing as weight, obtains final Concept Semantic Similarity.

The formula of the similarity matrix is expressed as

X_sim=(x_i1,x_i2,x_i3,x_i4,x_i5)^T, i=1,2,3 ..., n

The final Concept Semantic Similarity calculates representation formula

δ_sim=r₁ysim1+r₂ysim2+r₃ysim3+r₄ysim4+r₅y_sim5

Wherein, X_simIndicate similarity matrix, x_i1Indicate D_s,x_i2It indicatesx_i3Indicate Z_s,x_i4Indicate S_s,x_i5Indicate I_s,N is to be compared concept to the logarithm of the notional word in set, x_i=(D_si,K_si,Z_si,S_si,I_si), based on A vector in ingredient input sample set, wherein respectively representing each section in comprehensive similarity computing module per one-dimensional variable Semantic Similarity Measurement as a result, D_siIndicate the relationship in vector between the semantic distance and similarity of i-th dimension element, K_siTable Show the semantic similarity in vector in terms of the depth of i-th dimension element, Z_siIndicate the density of the notional word c of i-th dimension element in vector Impact factor, S_siIndicate the similarity in vector in terms of the semantic registration of i-th dimension element, I_siIndicate i-th dimension element in vector The information content in terms of similarity；δ_simIndicate Concept Semantic Similarity, y_sim1,y_sim2,y_sim3,y_sim4,y_sim5For to similarity Matrix X_simCarry out the principal component that principal component analysis is extracted, r₁,r₂,r₃,r₄,r₅Indicate each principal component contributor rate.

The bull attention mechanism is described as inquiry and one group of key-value pair is mapped to output, wherein inquiry, key, value and defeated It is all vector out, output is calculated as the weighted sum of value, distributes to the weight of each value by inquiring the compatibility with corresponding secret key Function is calculated.

The encoder is made of N number of identical layer, sublayer there are two every layer, and first sublayer is bull attention Layer, second sublayer are propagated forward sublayers, and each sublayer is output and input there is residual error connection, after each sublayer Face follows a step regularization to operate, to accelerate model convergence；

The decoder is made of N number of identical layer, and every layer there are three sublayers, and first sublayer is mask matrix majorization Bull attention sublayer, for modeling the target side sentence generated, during training, with a mask matrix majorization Each bull attention only calculates when calculating and arrives preceding t-1 word；Second sublayer is bull attention sublayer, is encoder reconciliation Attention mechanism between code device, that is, go in original language to look for relevant semantic information；Third sublayer is propagated forward sublayer, with Propagated forward sublayer in encoder is completely the same, and each sublayer outputs and inputs that there is residual error connections, and heel one Regularization operation is walked, to accelerate model convergence.

Multilevel encoder-decoder architecture is constructed by the following method:

In encoder, the output of each sublayer is LayerNorm (x+Sublayer (x)), and wherein LayerNorm () is indicated Layer normalized function, the function that Sublayer () is realized using the sublayer itself that the residual error based on bull attention mechanism connects, X indicates the current layer vector to be inputted, and Mongolian sentence is generated corresponding vector using word2vec vector techniques, is then made For the input of the first layer coder, i.e. Sublayer (x) is the function of being realized by the sublayer itself based on bull attention mechanism, In order to promote residual error to connect, all sublayers and embeding layer generate dimension d_model=512 output.

The propagated forward sublayer of the encoder has a linear transformation twice in realizing, a Relu nonlinear activation, specifically Calculation formula is as follows:

FFN (x)=γ (0, xW₁+b₁)W₂+b₂

X presentation code device inputs information, W₁Indicate the corresponding weight of input vector, b₁Indicate the inclined of bull attention mechanism The factor is set, (0, xW₁+b₁) indicate propagated forward sublayer input layer information, W₂The corresponding weight of input vector is indicated, before b2 expression To the bias factor of propagation function, the nonlinear activation function of γ presentation code device layer.

It carry out position encoded being calculated absolute position as the variable in trigonometric function using trigonometric function, formula is such as Under:

In formula, pos is position, and i is dimension, i.e., position encoded each dimension corresponds to sine curve, and wavelength is formed from 2 Geometric progression of the π to 100002 π, d_modelBe it is position encoded after embeding layer dimension, the value range of 2i is that minimum value is 0, maximum value is d_model。

Compared with the prior art, the advantages of the present invention are as follows:

1, the present invention uses the Series Modeling method based on Transformer, and the model of sequence to sequence is still continued to use Classical coder-decoder structure, but RNN or CNN are not used as Series Modeling mechanism, but used bull note Meaning power mechanism, to be easier capture " long-distance dependence information ".

2, the stem of the invention in Mongolian corpus, affixe are split with the supplementary element passed, the supplementary element of lattice It is affixe special in Mongolian, the difference with common affixe, first consisting in it only indicates grammer meaning, without any language The meaning of adopted level, the present invention carry out cutting separation to the supplementary element of the lattice in corpus, on the one hand can reduce the dilute of data Property is dredged, Mongolian stem information is on the other hand also preferably remained.

3, the present invention is directed to the serious Sparse Problem as caused by Mongolian word-building characteristic, proposes three kinds in various degree Word segmentation scheme, be the supplementary element of the affixe cutting of small grain size, the stem cutting of big granularity and small-scale lattice respectively Cutting.Experiment shows to combine stem cutting and the supplementary element cutting of lattice, can maximally promote the quality of translation.

4, the present invention is expressed as research background to measure the distribution of the similarity degree between two words, comprehensive analysis depth The influence of degree and density, semantic registration to Concept Semantic Similarity, and it is integrated with traditional semantic distance and the information content Similarity algorithm establishes similarity matrix, by carrying out principal component analysis to it, original similarity matrix is converted into newly Principal component transform matrix, calculate its principal component contributor rate, and be weighted processing as weight, obtain final concept Semantic similarity.

Detailed description of the invention

Fig. 1 is the illiteracy Chinese machine translation frame diagram the present invention is based on Transformer.

Fig. 2 is the illustraton of model the present invention is based on bull attention mechanism to Series Modeling.

Fig. 3 is " soft " attention model figure of the invention.

Fig. 4 is bull attention model figure of the present invention.

Fig. 5 is morpheme cutting flow chart of the present invention.

Fig. 6 is computation model of the bull attention mechanism of the present invention to weight.

Fig. 7 is that the present invention uses two-way RNN to carry out modeling schematic diagram to sequence.

Fig. 8 is that the present invention uses multi-layer C NN to carry out modeling schematic diagram to sequence.

Fig. 9 be aggregate concept Semantic Similarity Measurement of the present invention distributed algorithm under randomly select the phases of 65 groups of words pair Like degree distribution map.

Specific embodiment

The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.

A kind of illiteracy Chinese machine translation method based on Transformer of the present invention, first pre-processes Mongolian corpus, then with The correlation model that word2vec generates term vector is research background, and comprehensive depth, density, semantic registration are similar to Concept Semantic The similarity algorithm of the influence of degree, Semantic distance and the information content establishes similarity matrix, then carries out principal component analysis, Similarity matrix is converted into principal component transform matrix, calculates principal component contributor rate, and be weighted processing as weight, Obtain final Concept Semantic Similarity；Transformer model is finally used in translation process, thus the note that places one's entire reliance upon Meaning power mechanism draws the global dependence between outputting and inputting, and eliminates recurrence and convolution, wherein the Transformer mould Type is to carry out multilevel encoder-decoder position encoded and based on enhanced bull attention mechanism construction using trigonometric function Framework.

Mongolian corpus pretreatment: morpheme cutting based on dictionary, when carrying out cutting firstly the need of utilize word frequency statistics The dictionary of tool OpenNMT.dict generation Mongol corpus.After dictionary generates, searches for stem in dictionary and summarized, generated Stem table.Part other than stem table is corresponding affixe exterior portion point.Herein based on stem table and affixe table, using reverse Maximum matching algorithm carries out morpheme to Mongolian each word-building and carries out cutting, and cutting process is as shown in Figure 5.For each A Mongolian word to be processed matches all dictionary records, one by one if a Mongolian word to be processed includes a certain item Record, then carry out cutting, keep the supplementary element of lattice disconnected, last a Mongolian word is separated into two parts: lattice add Ingredient a part, being left part is another part.

To cover Chinese bilingual corpora carry out coding be uniformly processed after, construct bilingual dictionary on this basis.It should The modeling of Transformer model carries out position comprising the coder-decoder structure of building multilayered structure, using trigonometric function Coding and the model construction based on enhanced bull attention mechanism, and to the training optimization method of model and canonical strategy into Row improves.

On the algorithm based on the information content, the present invention passes through the distributed analysis indicated to word, finds concept packet The sub- concept contained is more, and the concept information contained content is fewer, and gives the distributed I calculating mould indicated for word Type:

Wherein: all sub- concept node numbers of h (c) expression notional word node c；max_wnIt is a constant, indicates in semanteme All concept node trees in classification tree.

The present invention proposes the aggregative weighted method based on principal component analysis, and principal component analysis is introduced into weight computing, benefit Use the contribution rate of principal component as weight to similarity carry out aggregative weighted calculating, alleviate dimension disaster, gradient explosion ask Topic, facilitates the fast convergence of model.

Aggregative weighted algorithm in the present invention based on principal component analysis is mainly by multi-angle similarity calculation, similarity matrix It extracts, 3 part of the weight computing composition based on principal component analysis.

First part's multi-angle similarity calculation

Semantic similarity is analyzed from semantic distance, depth and density, semantic registration and the information content respectively, Provide the calculation formula of each section semantic similarity.

(1) semantic distance

Relationship between semantic distance and similarity is expressed as

Wherein: c₁And c₂For the distributed vector of two concepts to be compared；A is an adjustable parameter, is taken here to be compared Value of the concept to the average semantic distance of set as a；D(c₁,c₂) be 2 concepts semantic distance, indicate c₁And c₂Between Shortest path.

(2) depth and density

The level locating for semantic tree interior joint is higher, and representative notional word is more abstract；Locating level is lower, representative Notional word it is more specific.If the notional word c compared₁And c₂Node where the depth capacity of semantic tree be respectively K_max(c₁) and K_max(c₂), notional word c₁And c₂Node depth be respectively K (c₁) and K (c₂), then the Semantic Similarity Measurement formula in terms of depth For

In semantic hierarchies tree, the density of regional area is bigger, illustrates that this region is more specific to the division of concept, in region Semantic similarity between notional word is relatively large.The Effects of Density factor of notional word c is

Wherein: n (c) is using notional word node c as the direct descendent number of root node, and n (O) is notional word c node place The maximum value of the direct descendent number of sub- each node of semantic tree O.It is obtained based on following formula and is compared notional word c₁And c₂? The calculation formula of semantic similarity is in terms of density

(3) semantic registration

The root node for defining semantic hierarchies tree is R, c₁And c₂For arbitrary 2 notional word nodes, S (c₁) it is from c₁It sets out Until the node number in the node set that root node R is passed through, S (c₂) it is from c₂The node to set out until root node R is passed through Node number in set, S (c₁)∩S(c₂) indicate from c₁And c₂The node set (intersection) passed through jointly to R, S (c₁)∪S (c₂) indicate from c₁To the R node set passed through and c₂The union of the node set passed through to R, then in terms of semantic registration Similarity be expressed as

(4) information content

In order to define the similarity in terms of the information content, following algorithm is proposed to calculate I value.Calculation formula is

Wherein, c₁And c₂Indicate the distributed vector of two concepts to be compared, I (c₁) indicate with Concept Vectors c₁For father The sum of the vector dimension of all child nodes of node, I (c₂) indicate with Concept Vectors c₂For the vector of all child nodes of father node The sum of dimension.

Second part similarity matrix extracts

Assuming that being compared concept to having n in set to notional word, if x_i=(D_si,K_si,Z_si,S_si,I_si) it is that principal component inputs A vector in sample set, wherein it is similar to respectively represent each section semanteme in comprehensive similarity computing module per one-dimensional variable Degree calculate as a result, D_siIndicate the relationship in vector between the semantic distance and similarity of i-th dimension element, K_siIt indicates in vector Semantic similarity in terms of the depth of i-th dimension element, Z_siIndicate the Effects of Density factor of the notional word c of i-th dimension element in vector, S_siIndicate the similarity in vector in terms of the semantic registration of i-th dimension element, I_siIt indicates in vector in the information of i-th dimension element Hold the similarity of aspect.

Then similarity matrix is expressed as

X_sim=(x_i1,x_i2,x_i3,x_i4,x_i5)^T, i=1,2,3 ..., n

Weight computing of the Part III based on principal component analysis

The thought of principal component analysis is that multiple indexs are converted into several overall targets under the premise of losing little information Multivariate statistical method.The overall target being usually converted into is known as principal component, wherein each principal component is original variable Linear combination, and it is irrelevant between each principal component, this, which allows for principal component, has certain superior performances than original variable. The weight of each principal component is distributed in Principal Component Analysis Algorithm according to the contribution rate of principal component, rather than artificially determine, thus The defect that weight is artificially determined in multi-variables analysis is overcome, so that result is objective rationally.

To the similarity matrix X built_simPrincipal component analysis is carried out, the principal component extracted is

Y=(y_sim1,y_sim2,y_sim3,y_sim4,y_sim5)

Each principal component contributor rate is (r₁,r₂,r₃,r₄,r₅), then final Concept Semantic Similarity calculation formula is

δ_sim=r₁y_sim1+r₂y_sim2+r₃y_sim3+r₄y_sim4+r₅y_sim5

The method for constructing the coder-decoder structure of multilayered structure are as follows:

Fig. 1 illustrates one layer of encoder and decoder of the structure of Transformer.

With reference to Fig. 1, the Nx in left side represents one layer of encoder, and two sublayers are contained in this layer, and first sublayer is Bull attention sublayer, second sublayer are a propagated forward sublayers.Each sublayer outputs and inputs that there is residual errors Connection, this mode can theoretically return gradient well.Each sublayer is followed by step regularization operation, regularization Use can accelerate the convergence rate of model.The calculating of bull attention sublayer, will be more in enhanced bull attention mechanism It is discussed in detail in the model construction of head attention mechanism.Propagated forward sublayer has linear transformation twice in realizing, one time Relu is non- Linear activation, specific formula for calculation are as follows:

FFN (x)=γ (0, xW₁+b₁)W₂+b₂

X presentation code device inputs information, W₁Indicate the corresponding weight of input vector, b₁Indicate the inclined of bull attention mechanism Set the factor.(0,xW₁+b₁) indicate propagated forward sublayer input layer information, W₂The corresponding weight of input vector is indicated, before b2 expression To the bias factor of propagation function, the nonlinear activation function of γ presentation code device layer.Wherein, encoder input information is insertion The vector obtained after layer information coal addition position encoded information processing.It is first son of encoder that propagated forward sublayer, which inputs information, Output after layer processing.

With reference to Fig. 1, the Nx on right side represents one layer in decoder of structure, there are three sublayer structures in this layer, first A sublayer is the bull attention sublayer of mask matrix majorization, for modeling the target side sentence generated, in trained mistake Cheng Zhong needs a mask matrix to control, and when so that bull attention calculates every time, only calculates and arrives preceding t-1 word.The Two sublayers are bull attention sublayers, are the attention mechanism between encoder and decoder, that is, go in original language to look for Relevant semantic information, the calculating of this part and the attention of other sequences to sequence calculate unanimously, make in Transformer With the mode of dot product.Third sublayer is propagated forward sublayer, completely the same with the propagated forward sublayer in encoder.Each Also all there is residual error connections and regularization operation for sublayer, to accelerate model convergence.

The present invention carries out position encoded method using trigonometric function are as follows:

Bull attention mechanism models the mode of sequence, neither the timing feature of RNN, nor the structuring of CNN is special Point, but the characteristics of a kind of bag of words (bag of words).If being further described, it should say that the mechanism regards a sequence to be flat Flat structure, no matter being all 1 in bull attention mechanism because distance word how far seemed.Such modeling pattern, it is real The relative distance relationship between word can be lost on border.Citing: " ox has eaten grass ", " grass has eaten ox ", " having eaten timothy " three sentences The corresponding expression of each word come is modeled, can be consistent.

In order to alleviate this problem, the present invention word is mapped to the location of in sentence in Transformer to Amount, adds in its embeding layer.The thinking is not to be suggested for the first time, and there is also similarly be difficult to build in fact for CNN model The defect of mould relative position (timing information), Facebook propose position encoded method.A kind of direct mode is, directly Inside absolute location information modeling to embeding layer, i.e., by word W_iI be mapped to a vector, be added in its embeding layer, but The shortcomings that this mode is the sequence that can only model finite length.

A kind of new timing information modeling pattern is used in the present invention, that is, utilizes the periodicity of trigonometric function, Lai Jianmo Relative positional relationship between word.Specific mode is calculated absolute position as the variable in trigonometric function, specific public Formula is as follows:

Pos is position, and i is dimension.That is, position encoded each dimension corresponds to sine curve.Wavelength is formed From 2 π to the geometric progression of 100002 π.The present invention has selected this function, it allows model easily to learn relative position, Because for any constant offset k, PE_pos+kIt can be expressed as PE_posLinear function.d_modelBe it is position encoded after embeding layer Dimension, the value range of 2i is that minimum value is 0, and maximum value is d_model。

Trigonometric function has good periodicity, that is, every a certain distance, the value of dependent variable can repeat, this spy Property can be used to model relative distance；On the other hand, the codomain of trigonometric function is [- 1,1], can provide embeding layer member well The value of element.

The method of the model construction based on enhanced bull attention mechanism are as follows:

Fig. 2 illustrates the Series Modeling method based on bull attention mechanism.Note that it is apparent in order to show figure, Lack some connecting lines of picture, each word and first layer in " source language sentence subvector " layer (i.e. original language morpheme vector in figure) are more Node in head attention layer is all the relationship connected entirely, between first layer bull attention layer and second layer bull attention layer Node be also all the relationship connected entirely.It can be seen that the interaction distance between any two word is all in this modeling method It is that there is no relationships for relative distance between 1, with word.Under this mode, the semantic determination of each word, all consider with entirely The relationship of all words in sentence.Bull attention mechanism can capture more so that this global interaction becomes more complicated More information.

To sum up, bull attention mechanism can capture long-distance dependence knowledge when modeling sequence problem, have better Theoretical basis.

The mathematical formization expression of bull attention mechanism is described below.Firstly, from being said attention mechanism.

1. attention mechanism (model)

When handling a large amount of input information with neural network, the attention mechanism of human brain can also be used for reference, is only selected The information input of some keys is handled, the efficiency of Lai Tigao neural network.In current neural network model, it can incite somebody to action Maximum convergence (max pooling) gates (gating) mechanism approximatively to regard the note based on conspicuousness from bottom to top as Meaning power mechanism.In addition to this, top-down convergence type attention is also a kind of effective information selection mode.Understood with reading For task, as soon as given very long article, then the content of this article is putd question to.The problem of proposition, is only and in paragraph One or two of sentence is related, and rest part is all unrelated.In order to reduce the computation burden of neural network, it is only necessary to relevant Section, which is picked out, allows subsequent neural network to handle, without all article contents are all inputed to neural network.

Use x_1:N=[x₁,…,x_N] indicate N number of input information, in order to save computing resource, not needing will be all N number of defeated Enter information and be all input to neural network to be calculated, it is only necessary to from x_1:NThe middle some information inputs relevant with task of selection are to mind Through network.A query vector q relevant with task is given, we indicate to be selected information with attention variable z ∈ [1, N] Index position, i.e. z=i expression selected i-th of input information.In order to facilitate calculating, selected using the information of a kind of " soft " The system of selecting a good opportunity is calculated in given q and x first_1:NUnder, select the probability α of i-th of input information_i,

Wherein s (x_i, q) and it is scoring functions, following three kinds of modes can be used to calculate:

Addition model s (x_i, q) and=v^Ttanh(Wx_i+Uq)

Dot product model

Multiplied model

Wherein W, U, v are the network parameter that can learn, and T is the transposition operation of matrix.

Attention is distributed α_iIt can be construed in Context query q, the concerned degree of i-th of information.Using one kind The information selection mechanism of " soft " is encoded to input information

Fig. 3 gives the example of " soft " attention mechanism.

2. the variant of attention mechanism

2.1 key-value pair attentions

More generally, input information can be indicated with key-value pair (key-value pair) format, wherein " key " K is used to It calculates attention and is distributed α_i, " value " V is used to generate the information of selection.With (k, v)_1:N=[(k₁,v₁),…,(k_N,v_N)] indicate N number of It inputs information, when the relevant query vector q of Given task, notices that force function is

Wherein s (k_i, q) and indicate scoring functions.

Fig. 4 gives the example of key-value pair attention mechanism.If the k in key-value pair mode_i=v_i,Then it is just etc. Valence is in common attention mechanism.

2.2 scaling dot product attentions

Scale dot product attention algorithm be describe by key-value pair K-V and query vector q, very be abstracted, here we Assuming that " key " K in key-value pair corresponds to same vector, i.e. K=V with " value " V, as shown in fig. 6, query vector q corresponds to target sentence The term vector of son.

There are three steps for specific operation.

1. the calculating process that each query vector q and " key " K can make a dot product

2. finally will use softmax their normalizings, it is maintained at the range of probability value in [0,1] section.

3. can be used to the end multiplied by " value " V as attention force vector again

Here mathematic(al) representation is as follows.

WhereinFor zoom factor, the transposition operation of T representing matrix.

2.3 bull attentions

Bull attention is to utilize multiple queries q_1:M={ q₁,…,q_M, it is more to calculate the selection from input information in parallel A information.The different piece of each attention concern input information.

The method that the present invention improves the training optimization method and canonical strategy of model are as follows:

The training of model uses Adam method, and present invention employs a kind of learning rate adjusting methods for being warm up, such as Shown in formula:

The formula is meant that training needs to preset the super ginseng of a warmup_steps.

A. when train epochs step_num is less than the value, learning rate, the formula are determined with the Section 2 formula in bracket The linear function that the slope of really step_num variable is positive.

B. when train epochs step_num is greater than warm_steps, learning rate, the public affairs are determined with the first item in bracket Formula is just the power function of negative at an index.

So on the whole, learning rate is conducive to the fast convergence of model in downward trend after first rising.

Two important regularization methods are also used in model, one is common dropout method, is used in every Behind a sublayer and in the calculating of attention.The other is label smoothing method, that is, when training, calculate cross entropy When, no longer it is the model answer of one-hot, but also fills a non-zero minimum at each 0 value.In this way may be used To enhance the robustness of model, the BLEU value of lift scheme.

To sum up, the present invention is based on the modeling sequence method of Transformer, the model of sequence to sequence has still continued to use warp The coder-decoder structure of allusion quotation the difference is that not using RNN or CNN as Series Modeling mechanism, but has used more Head attention mechanism.The theoretic advantage of bull attention mechanism is more easily capture " long-distance dependence information ".It is so-called " long Apart from Dependency Specification " it can be regarded as: 1) word be in fact the symbol that can express diversity semantic information (ambiguity ask Topic).2) semanteme of a word determines, to rely on the context environmental where it.(based on context disappear qi) 3) word that has may The lesser context environmental of range is needed just to can determine that its semantic (short distance dependence phenomenon), some words may need one The biggish context environmental of range just can determine that its semantic (long-distance dependence phenomenon).

For example, following two word is seen:

" have many cuckoos on mountain, spring to when, can opening all over the mountains and plains, it is very beautiful."

" have many cuckoos on mountain, spring to when, can cry of birds or animals all over the mountains and plains, very in a roundabout way."

In this two word, " cuckoo " respectively refers to colored and bird.In machine translation problem, if do not seen distant away from its The word of distance is difficult to translate " cuckoo " this word correct.The example is an obvious example, can significantly be seen Remote dependence between word.Certainly, the most of meaning of a word in a small range of context semantic environment just It was determined that as the ratio regular meeting that above-mentioned example accounts in language is relatively small.It is desirable that be that model can either be good Learn to short-range dependence knowledge, can also learn the knowledge to long-distance dependence.

It is short-range that bull attention mechanism in Transformer of the present invention theoretically can preferably capture this length Knowledge is relied on, lower mask body compares three kinds of Series Modeling methods based on RNN, CNN, Transformer, between any two word Interaction distance on difference.

Fig. 7 is the method modeled using two-way RNN to sequence.Due to be to the element in sequence in order Processing, the interaction distance between two words may be considered the relative distance between them.Interaction distance between W1 and Wn It is n-1.Historical information selectively can be stored and be forgotten in RNN model theory with door control mechanism, have than Pure RNN structure preferably shows, but in the case that gating parameter amount is certain, this ability is certain.With the increasing of sentence Long, there is the apparent theoretical upper limit in the increase of relative distance.

Fig. 8 illustrates the method modeled using multi-layer C NN to sequence.The semantic ring of the CNN unit covering of first layer Border range is smaller, and the semantic environment range of second layer covering can become larger, and so on, the more CNN unit of deep layer, the semanteme of covering Environment can be bigger.One prefix can first interact on bottom CNN unit with the generation of the word of its short distance, then in slightly higher level It is interacted on CNN unit with its farther some word generation.So the CNN structure of multilayer embody be it is a kind of from part to the overall situation Feature extraction process.Interaction distance between word, corresponding thereto apart from directly proportional.It can only be in higher CNN apart from farther away word It meets on node, just generates interaction.This process may have more information and lose.

And Fig. 2 show the present invention is based on the Series Modeling methods of bull attention mechanism to be obviously better than two kinds of sides Formula can capture more information.

It is a specific illiteracy Chinese translation instance below.

It is tested using 120 Wan Menghan Parallel Corpus as data set, effect of the invention is verified.

Aiming at the problem that the serious Sparse occurred in Mongolian corpus, three kinds of processing modes are carried out, have been respectively: word Sew the supplementary element cutting of cutting, stem cutting and lattice, wherein the granularity of affixe cutting is smaller, the fineness ratio of stem cutting Larger, the dicing process of the supplementary element of lattice is similar to stem cutting, and the granularity of cutting is bigger.

The present invention tests these three cutting methods of corpus respectively, and experimental result is as shown in table 1.

Table 1

It can be seen that the quality that all cutting methods all improve translation from the experimental result in table.Wherein, stem is cut Operation BLEU value is divided to can be improved 1.02, although the supplementary element cutting promotion of lattice is unobvious, when common with stem cutting When effect, so that BLEU lifting values have reached 1.14.Why the result of affixe cutting is not so good as stem cutting, it is believed that main The reason is that since affixe cutting is too careful, so that the sentence length amplification after cutting is larger, and neural network machine translation pair The processing capacity of long sentence is weaker, therefore effect will receive influence.Distribution is indicated aggregate concept semantic similarity is added After calculating, BLEU improves 5.88.Then we randomly select out 65 groups of words pair, are built using word pair and similarity value for coordinate Vertical coordinate system indicates that algorithm calculates point of similarity value in a coordinate system to the distributed word of aggregate concept semantic similarity Cloth situation is analyzed, from fig. 9, it can be seen that the obtained continuity of this algorithm is relatively good, this explanation is based on calculating herein The similarity value calculation of method and the artificial marking of similarity have the good degree of correlation.By the pretreated of the data of front two Journey is put into finally, the data pre-processed are divided into training set, verifying collection and test set with certain proportion by this experiment Training in Transformer model, BLEU value improve 10.16, and training effect is obviously better than RNN.

Claims

1. a kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer, which is characterized in that turning over Using Transformer model during translating, the Transformer model is to carry out position encoded and base using trigonometric function In multilevel encoder-decoder architecture of enhanced bull attention mechanism construction, so that the attention mechanism that places one's entire reliance upon is come The global dependence between outputting and inputting is drawn, recurrence and convolution are eliminated.

2. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 1, Be characterized in that, before translation, first data pre-processed, it is described to data carry out pretreatment be to the word in Mongolian corpus The supplementary element of dry, affixe and lattice carries out cutting separation, to reduce the sparsity of data, while finding out Mongolian in stem, affixe With the language feature of the supplementary element of qualifying, and these language features are dissolved among training.

3. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 2, It is characterized in that, the cutting separation includes the attached of the affixe cutting of small grain size, the stem cutting of big granularity and small-scale lattice Addition cutting point.

4. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 1, It is characterized in that, after being pre-processed to data, the influence of comprehensive depth, density, semantic registration to Concept Semantic Similarity, collection Similarity matrix is established at the similarity algorithm of semantic distance and the information content, then carries out principal component analysis, by similarity moment Battle array is converted into principal component transform matrix, calculates principal component contributor rate, and be weighted processing as weight, obtains final Concept Semantic Similarity.

5. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 4, It is characterized in that, the formula of the similarity matrix is expressed as

X_sim=(x_i1,x_i2,x_i3,x_i4,x_i5)^T, i=1,2,3 ..., n

The final Concept Semantic Similarity calculates representation formula

δ_sim=r₁y_sim1+r₂y_sim2+r₃y_sim3+r₄y_sim4+r₅y_sim5

Wherein, X_simIndicate similarity matrix, x_i1Indicate D_s,x_i2Indicate K_s,x_i3 Indicate Z_s,x_i4Indicate S_s,x_i5Indicate I_s,N is by relatively more general Read the logarithm to the notional word in set, x_i=(D_si,K_si,Z_si,S_si,I_si), be principal component input sample set in one to Amount, wherein respectively represented per one-dimensional variable each section Semantic Similarity Measurement in comprehensive similarity computing module as a result, D_siTable Show the relationship in vector between the semantic distance and similarity of i-th dimension element, K_siIt indicates in vector in terms of the depth of i-th dimension element Semantic similarity, Z_siIndicate the Effects of Density factor of the notional word c of i-th dimension element in vector, S_siIndicate i-th dimension in vector Similarity in terms of the semantic registration of element, I_siIndicate the similarity in vector in terms of the information content of i-th dimension element；δ_sim Indicate Concept Semantic Similarity, y_sim1,y_sim2,y_sim3,y_sim4,y_sim5For to similarity matrix X_simPrincipal component analysis is carried out to be mentioned The principal component of taking-up, r₁,r₂,r₃,r₄,r₅Indicate each principal component contributor rate.

6. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 1, Be characterized in that, the bull attention mechanism be described as inquiry and one group of key-value pair is mapped to output, wherein inquiry, key, value and Output is all vector, and output is calculated as the weighted sum of value, and the weight for distributing to each value is compatible with corresponding secret key by inquiring Property function is calculated.

7. the illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer according to claim 1, It is characterized in that,

The encoder is made of N number of identical layer, and every layer there are two sublayers, and first sublayer is bull attention sublayer, the Two sublayers are propagated forward sublayers, and each sublayer is output and input there is residual error connection, each sublayer followed by A step regularization operation, with accelerate model convergence；

The decoder is made of N number of identical layer, and every layer there are three sublayers, and first sublayer is the bull of mask matrix majorization Attention sublayer, it is each with a mask matrix majorization during training for modeling the target side sentence generated It is only calculated when bull attention calculates and arrives preceding t-1 word；Second sublayer is bull attention sublayer, is encoder and decoder Between attention mechanism, that is, go in original language to look for relevant semantic information；Third sublayer is propagated forward sublayer, with coding Propagated forward sublayer in device is completely the same, and each sublayer outputs and inputs that there is residual error connections, and one step of heel is just Change operation, then to accelerate model convergence.

8. according to claim 1 or the illiteracy Chinese machine translation side of the 7 enhancing semantic feature information based on Transformer Method, which is characterized in that construct multilevel encoder-decoder architecture by the following method:

In encoder, the output of each sublayer is LayerNorm (x+Sublayer (x)), and wherein LayerNorm () expression layer is returned One changes function, the function that Sublayer () is realized using the sublayer itself that the residual error based on bull attention mechanism connects, x table Show the current layer vector to be inputted, Mongolian sentence is generated into corresponding vector, then conduct using word2vec vector techniques The input of first layer coder, i.e. Sublayer (x) are the functions of being realized by the sublayer itself based on bull attention mechanism, are The connection of promotion residual error, all sublayers and embeding layer generate dimension d_model=512 output.

9. according to claim 1 or the illiteracy Chinese machine translation side of the 7 enhancing semantic feature information based on Transformer Method, which is characterized in that

The propagated forward sublayer of the encoder has linear transformation twice in realizing, a Relu nonlinear activation is specific to calculate Formula is as follows:

FFN (x)=γ (0, xW₁+b₁)W₂+b₂

X presentation code device inputs information, W₁Indicate the corresponding weight of input vector, b₁Indicate bull attention mechanism biasing because Son, (0, xW₁+b₁) indicate propagated forward sublayer input layer information, W₂Indicate the corresponding weight of input vector, b2 indicates preceding to biography Broadcast the bias factor of function, the nonlinear activation function of γ presentation code device layer.

10. according to claim 1 or the illiteracy Chinese machine translation side of the 7 enhancing semantic feature information based on Transformer Method, which is characterized in that it carry out position encoded being calculated absolute position as the variable in trigonometric function using trigonometric function, Formula is as follows:

In formula, pos is position, and i is dimension, i.e., position encoded each dimension correspond to sine curve, wavelength formed from 2 π to The geometric progression of 100002 π, d_modelBe it is position encoded after embeding layer dimension, the value range of 2i is that minimum value is 0, most Big value is d_model。