CN115017921B

CN115017921B - Towards Chinese nerve machine translation method based on multi-granularity characterization

Info

Publication number: CN115017921B
Application number: CN202210228940.0A
Authority: CN
Inventors: 赵亚慧; 金晶; 崔荣一; 金国哲; 张振国; 李德; 李飞雨; 姜克鑫; 王苑儒; 刘帆; 夏明会; 鲁雅鑫; 赵晓辉
Original assignee: Yanbian University
Current assignee: Yanbian University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2023-08-01
Anticipated expiration: 2042-03-10
Also published as: CN115017921A

Abstract

The invention discloses a toward Chinese nerve machine translation method based on multi-granularity characterization, which comprises the following steps: collecting text data of Korean language materials and preprocessing the text data to obtain multi-granularity sequence representation of the language material text data; and constructing a neural machine translation model, and translating the multi-granularity sequence of the corpus text data based on the neural machine translation model to obtain a target language translation. The invention improves the performance of the machine translation model by utilizing the language structure information of the source language, and enhances the modeling capability of the korean syntax and semantic information.

Description

Towards Chinese nerve machine translation method based on multi-granularity characterization

Technical Field

The invention belongs to the field of natural language processing machine translation, and particularly relates to a towards-Chinese neural machine translation method based on multi-granularity characterization.

Background

The machine translation task is a process of automatically translating a source language into a target language corresponding to the semantics of the source language by using a computer, and is one of important research directions in the field of natural language processing. The machine translation has obvious advantages of translation speed and performance in some scenes with low requirements on translation quality or in translation tasks in specific fields, and is widely applied. In view of the complexity and applicability of machine translation, the field of natural language processing regards this task as an important research direction, and machine translation has become one of the most active research topics in this field.

The machine translation method mainly includes a rule-based method, a statistical-based method, and a neural network-based method. After the neural machine translation model is first proposed, a large number of neural machine translation models based on encoder-decoder structures are continuously developed, and translation performance and speed are continuously refreshed. With the continuous maturity of machine learning technologies such as deep learning, the neural machine translation model is paid attention to by students due to the superior performance and the characteristics of no need of excessive manual intervention and the like. Although neural machine translation models have far exceeded traditional machine translation models in terms of performance, their future development still has a strong development potential.

Korean is an official language of the korean family of our country, which is one of 24 minority nationalities in our country who own their own language. Research on Chinese translation is beneficial to promoting national culture communication, and is also beneficial to propagation, reservation and development of Korean national culture, and scientific culture basis is provided. The domestic machine translation research aiming at minority nationality languages mainly focuses on a few minority nationality languages such as Mongolian, tibetan, uygur language and the like, and the research on the machine translation of the middle-oriented nerves is almost blank.

Korean belongs to a low-resource language, and related research on korean is slow to progress due to lack of corpus resources, self-characteristics of language and the like, and factors such as corpus scale, field, quality and the like greatly limit development of machine translation research in the middle-oriented direction. In addition, korean belongs to an adhesive language, and is spliced on a root by an additional component to form rich morphological changes, and bilingual resources are relatively deficient, so that a huge model cannot be well trained, and translation may not be faithful.

Disclosure of Invention

The invention uses priori structural knowledge in linguistics to guide the language model, and obtains better attention distribution. Through improving the insufficient learning part in the model of the morphological change with the abundance of the Korean language, the model has the capability of capturing information from different subspaces through different token, encourages the diversity of the model, avoids the problem of poor utilization of the model in the decoding process due to fixed input granularity, and simultaneously eliminates the problem of limited data availability in the model.

In order to solve the problems, the invention provides the following scheme: a method of machine translation of a toward the han nerve based on multi-granularity characterization, comprising:

collecting text data of Korean language materials and preprocessing the text data to obtain multi-granularity sequence representation of the language material text data;

and constructing a neural machine translation model, and translating the multi-granularity sequence of the corpus text data based on the neural machine translation model to obtain a target language translation.

Preferably, the preprocessing process includes performing multi-granularity division processing on the text data of the korean corpus through a korean language processing unit which accords with the korean language characteristics and is suitable for machine translation, so as to obtain multi-granularity sequence representation.

Preferably, the multi-granularity division processing by the korean language processing unit which accords with the characteristics of the korean language and is suitable for machine translation comprises designing sub-sections and sub-morpheme granularity processing units based on the root-part-of-speech word-forming method of the korean text, and obtaining syllable granularity sequences by combining a word granularity processing method.

Preferably, the processing of the sub-sections and sub-morpheme granularity processing units comprises the steps of,

selecting and combining the sub word pairs to calculate the sub word pair with the highest likelihood of the whole training data by adopting a WordPiecesub word list construction algorithm in the sub word section granularity, and obtaining a sub word section granularity sequence;

the sub-morpheme granularity adopts a method of combining WordPiece with Korean morpheme analysis to obtain a token sequence; and after the merging sub-word pair with the highest likelihood ratio is obtained through calculation according to WordPiece, morphological element and part-of-speech information in sentences are analyzed by using a KoNLPy morphological element analyzer, and merging and segmentation are carried out on the token conforming to the Korean grammar structure, so that a sub-morpheme granularity sequence is obtained.

Preferably, before translating the multi-granularity sequence of the corpus text data based on the neural machine translation model,

sentence representation vectors are obtained according to the multi-granularity sequence representation, sentence characteristics are extracted through a multi-head multi-granularity attention structure, and multi-granularity sentence characteristics are obtained;

and dynamically masking the multi-granularity sentence features based on a granularity perception masking method.

Preferably, the dynamic masking of the multi-granularity sentence features based on the granularity perception masking method is to mask the same marking information in two different granularity sequences of the sub-syllable granularity sequence and the syllable granularity sequence, so that the attention mechanism focuses more on the semantic information after the segmentation of different granularities between the sequences.

Preferably, translating the multi-granularity sequence of the corpus text data based on the neural machine translation model further comprises translating the multi-granularity sequence of the corpus text data to convergence of the neural machine translation model based on multi-granularity characterization, so as to obtain the target language translation.

Compared with the prior art, the method has the following advantages:

(1) The invention uses priori structural knowledge in linguistics to guide a language model, utilizes the rich morphological changes of Korean to integrate multi-granularity text characterization into an attention mechanism, adds disturbance for sentence representation, and improves the overfitting problem of the model under the condition of low resources;

(2) The invention provides a multi-head multi-granularity attention mechanism, which is used for capturing special language structure information carried by different granularities in a text sequence respectively, and when the structure information of different granularities is obtained simultaneously, a language model can fully capture language characteristics and improve the problem of information redundancy in the multi-head attention mechanism;

on the other hand, unlike the manner of splicing the multi-granularity information direct Concat, the direct Concat has the advantages that the additional auxiliary information is irreversibly fused with the main granularity segmentation, and the additional information redundancy is brought during decoding. The method provided by the invention ensures that the attention obtained in each layer of encoder is in the consistent vector space with the main token sequence, and simultaneously introduces additional auxiliary information to consider the interaction with different granularities in sentences.

(3) The invention provides a granularity perception masking method, which enhances the differences among the token structures of different partitions in a sequence, can improve the performance of a machine translation model by utilizing language structure information of a source language, and enhances the modeling capability of korean syntax and semantic information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a basic block diagram of a translation model according to an embodiment of the present invention;

FIG. 3 is a diagram of a multi-head multi-granularity attention detailing diagram of an embodiment of the present invention;

FIG. 4 is a detailed block diagram of a granularity awareness masking method according to an embodiment of the present invention;

fig. 5 is an exemplary diagram of a granularity aware masking method according to an embodiment of this invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in FIG. 1, the invention provides a machine translation method of the Momordica Charantia based on multi-granularity characterization, which comprises the following steps:

The preprocessing process comprises the steps of carrying out multi-granularity division processing on text data of Korean language through Korean language processing units which accord with Korean language characteristics and are suitable for machine translation, and obtaining multi-granularity sequence representation.

The multi-granularity division processing is carried out by a Korean language processing unit which accords with the Korean language characteristics and is suitable for machine translation, wherein the multi-granularity division processing comprises the steps of designing sub-sections and sub-morpheme granularity processing units based on a root word affix word-forming method of the Korean text and combining a word granularity processing method to obtain syllable granularity sequences.

The processing of the sub-speech segments and sub-morpheme granularity processing units comprises the following steps of,

Before translating the multi-granularity sequence of the corpus text data based on the neural machine translation model,

The dynamic masking of the multi-granularity sentence features based on the granularity perception masking method is to mask the same marking information in two different granularity sequences of the sub-syllable granularity sequence and the syllable granularity sequence, so that the attention mechanism focuses more on semantic information after different granularity segmentation between sequences.

Translating the multi-granularity sequence of the corpus text data based on the neural machine translation model further comprises translating the multi-granularity sequence of the corpus text data to convergence of the neural machine translation model based on multi-granularity characterization, and obtaining the target language translation.

Example 1

As shown in fig. 1-5, the machine translation method of the hogwash based on multi-granularity characterization provided by the invention comprises the following steps: collecting text data and performing a preprocessing step, wherein the input Korean text is divided into a plurality of granularities according to the segmentation granularities which are suitable for the machine translation according to the Korean language structure; a machine translation step, namely adding language structure information for a model by utilizing interaction characteristics with multiple granularities, and enriching the attention distribution; and carrying out dynamic masking by a granularity perception masking method, so that the attention mechanism focuses more on different structural information in sentences, and the local modeling of the model is enhanced.

The method specifically comprises the following steps:

preprocessing Korean language, adopting a Korean language processing unit which is suitable for a Korean language structure and is suitable for machine translation, and combining other processing methods to obtain multi-granularity sequence representation;

step (2) extracting sentence characteristics from the sentence representation vector obtained in the step (1) by using a multi-head multi-granularity attention structure;

step (3) dynamically masking the multi-granularity sentence features in the step (2) by adopting a granularity perception masking method;

and (4) repeating the step (2) and the step (3) in the neural machine translation model until the neural machine translation model based on multi-granularity characterization converges to obtain the target language translation.

In the step (1), the sub-syllable and sub-morpheme language processing units are designed for the Korean language according to the word root and prefix forming method obtained by enriching the Korean language, and the syllable granularity sequence is obtained for the Korean language by combining a word granularity processing method

The segmentation modeling formulas of the sub-speech segments and sub-morpheme granularity processing units are respectively as follows:

selecting and combining the sub word pairs to calculate the sub word pair with the highest likelihood of the whole training data by adopting a WordPiece sub word list construction algorithm in the sub word section granularity, and combining to obtain a sub word section granularity sequenceThe token sequence is obtained by combining WordPiece and Korean morpheme analysis. After obtaining the merging sub-word pair with highest likelihood ratio according to WordPiece calculation, analyzing morphological element and part-of-speech information in sentences by using a KoNLPy morphological element analyzer, merging and segmenting token conforming to the Korean grammar structure to obtain sub-morphological element granularity sequence->

Multiple-granularity text characterization was used to engineer multiple-head attention mechanisms in a Transformer-based neural machine translation model.

The step (2) is specifically realized as follows:

on the basis of convertorsAfter Q, K and V, the relation between each word and other words needs to be calculated in the language model, and the score is used for representing the relatedness of the relation, and the higher the score is, the more important the relation between the two words is. For the multi-head multi-granularity attention structure, the Q and K are calculated by auxiliary information, namely the sequence of sub-syllables is adoptedSequence of granularity of syllablesThe modeling formula is as follows:

wherein W is a neural network weight matrix, and i is the ith head in the multi-head multi-granularity attention structure.

After multi-granularity text representation and mapping through a parameter matrix, attention is carried out to realize multi-head multi-granularity Attention, and the formula is as follows:

head _i ＝Attention(Q ⁱ ,K ⁱ ,V ⁱ )

in the multi-head multi-granularity attention mechanism training process, the goal is to fuse language structure information among multiple granularities; to enhance the difference in language structure between granularity, the granularity perception masking method is combined with a multi-head multi-granularity attention mechanism to generate sentence vector representation under the attention mechanism.

The specific implementation of the step (3) is as follows:

in the granularity aware mask, willAnd->The same characteristic information in the two sequences with different granularity is masked, so that the attention mechanism focuses more on semantic information after different granularity segmentation among the sequences. The masking method is as follows:

from the perspective of computational linguistics, the invention combines modern linguistic theory and natural language processing related theory and technology to carry out structural construction on Korean language units; in the machine translation process, the Korean language multi-granularity text representation is fused, language structure information is added for the model by utilizing multi-granularity interaction characteristics, and the attention distribution is enriched, so that the model models multi-granularity token relation, and the translation conforming to the Korean language grammar is learned. The dynamic masking is carried out by a granularity perception masking method, so that a multi-head multi-granularity attention mechanism focuses more on different structural information in sentences, the difference among the structures is enhanced, the language structural information of a source language is utilized to improve the performance of a machine translation model, and the korean syntax and semantic information modeling capability is enhanced.

Example two

As shown in fig. 1-5, the multi-granularity characterization-based machine translation method for the toward han is used for capturing a hierarchical structure, namely a sentence structure, in a language by fusing source language sequence information through input and feature extraction at an encoder end on the basis of a language model based on a transducer. The proposed basic model diagram is shown in fig. 2. The invention provides two design models respectively: a multi-head multi-granularity attention structure and granularity perception masking method. The detailed construction diagrams are shown in fig. 3 and 4, respectively. In the multi-head multi-granularity attention structure, Q, K and V sequences in the attention structure are utilized to reasonably and effectively fuse multi-granularity source language sequence information, and the sequence information is ensuredAnd the disturbance is increased while the effectiveness is improved, and the problem of overfitting under the condition of low resources is solved. The granularity perception masking method can select to mask the same token information through a dynamic masking strategy, so that the model is more focused on different hierarchical structures in the language. The specific masking strategy when the same token exists in the different granularity of the segmentation of the same input sequence is shown in fig. 5. For syllable granularity sequences And the sub-section granularity sequence->The presence of the same token is +.>And->Granularity perception masking method will be for->And->The three token masks.

In this example, all experiments were performed by using a server for deep learning and training, and using Python language, specifically: the system is Ubuntu 20.04.1 LTS, the memory is 109.8GB, the display card is GeForce RTX 2080 Ti 3, and Python3.7 and Pytorch1.9.1 are used for implementation.

In this embodiment, a transducer-base model is used as the base model, wherein the number of layers of the encoder and decoder is 6, each containing 8 attention heads, and the output dimension of the feedforward neural network layer is 1024. The model was trained using Adam optimizer with a learning rate set to 0.001 and a loss function using a cross entropy loss function, dropout=0.3. In the invention, BLEU-4 obtained by calculation through a multi-blu.pl script is adopted as an evaluation index, and is a universal index for measuring the similarity degree between a machine translation text and a reference text.

The data used in this embodiment is derived from the corpus constructed by the project of the laboratory on the comprehensive platform for processing the scientific and technological information of China, wherein the corpus type is abstract of parallel scientific and technological literature and comprises long and short text types. Limited by corpus scale, the collected monolingual corpus is subjected to corpus expansion by reverse translation to obtain pseudo parallel sentence pairs, and the pseudo parallel sentence pairs are added into the original training data, so that the problem of low-resource neural machine translation is solved. The data information used in the experiments are shown in Table 1.

TABLE 1

This example is mainly directed to the improvement of the encoder part in the transducer, and the implementation process is mainly divided into three parts:

(1) The multi-granularity fractions were split. For machine translation towards the han nerve, the source language is korean and the target language is chinese. The invention uses three granularity segmentation methods for Korean language, namely sub-syllables, sub-morphemes and syllables. Setting a set collection of three granularity extraction sets as a source language dictionary, defining a source language longest sequence as 512, and intercepting the first 512 token as a source sequence when the sequence length exceeds 512; when the sequence length is less than 512, padding is adopted for zero padding.

(2) Multi-head multi-granularity attention section. The single input sequence of the transducer is changed, and the single input sequence is set to be in accordance with the multi-segmentation sequence of the whisper expression, and three multi-granularity expression sequences are sent into an encoder to perform operations such as word vector initialization, position coding, multi-head attention calculation and the like. The characteristics can be better extracted by calculating and forward spreading the Korean expression sequences with multiple granularities, and various sequence expression information can be provided.

(3) Granularity perception mask portion. For three sequences entered into the multi-headed attention, where it is desired to focus on the differing portions between different sequences, masking of the ids of the same token cut in the Q, K sequence is required, while the different cut tokens are unchanged.

In order to verify the performance of the model proposed by the present invention, we have conducted comparative experiments on the method of the present invention with existing related works, including conventional methods and the Transformer-based and related improved methods, and specific results are shown in table 2.

TABLE 2

From the experimental results, it can be seen that the BLER value of 1.15 is improved compared with a transducer baseline model after the multi-granularity text characterization information (transducer+MGSA) is added to the machine translation of the Momordica grosvenori, and the BLER value of 1.18 is improved compared with the baseline model after the granularity perception method (transducer+MGSA+GA-MASK) is added on the basis. Our model enhances korean language structure information by MGSA compared to other models, and further improves performance by adding GA-MASK to the complete MG-transducer model.

Because NMT encoders of different layers can capture grammar and semantic features of different layers, the multi-head multi-granularity attention structure adopted by the invention is applied to different blocks for verification. The results are shown in Table 3.

TABLE 3 Table 3

Where "1" represents the bottom coding layer and "6" represents the top layer. With the increase of multi-granularity text representation, the more parameters are, the translation quality dynamically changes, but the experimental effect is better than that of a baseline model transducer, and the translation quality is improved as a whole. Wherein the BLEU value reaches 22.68 when the multi-granularity text representation is applied to layers 1-2.

The invention discusses different mask settings and verifies the influence of multi-granularity information of the same segmentation token on the task. The token id information of the same segmentation is set to 0,0.05,0.1,0.2,0.5 in the experiment, and the experimental results are shown in table 4.

TABLE 4 Table 4

The results of the machine translation of the present invention with respect to the Momordica towards the baseline model are shown in Table 5.

TABLE 5

The invention discloses a machine translation method of Momordica Charantia based on multi-granularity characterization. In the translation model, a granularity structure suitable for a Korean language machine translation task is adopted, and multi-granularity semantic and structural information is utilized, so that a multi-granularity multi-head attention mechanism is provided for improving attention capture, and the problem of information redundancy among multiple heads is solved. Secondly, a granularity perception masking method is designed for the multi-sequence information of the coding end, the method can execute dynamic masking strategies for different sequences of the coding end, the segmentation boundaries and semantic information of different granularities are focused, and the difference between granularities is enhanced. Experiments show that the model can remarkably improve the performance of the Korea NMT task, and really focuses on special grammar information of the Korean language, thereby enhancing the modeling capability of the syntax and semantic information of the Korean language.

The above embodiments are only illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solutions of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. The machine translation method of the Momordica Charantia based on multi-granularity characterization is characterized by comprising the following steps:

constructing a neural machine translation model, and translating the multi-granularity sequence of the corpus text data based on the neural machine translation model to obtain a target language translation;

dynamically masking the multi-granularity sentence features based on a granularity perception masking method;

the dynamic masking of the multi-granularity sentence features based on the granularity perception masking method is that the same marking information in two different granularity sequences of the sub-syllable granularity sequence and the syllable granularity sequence is masked, so that the attention mechanism focuses more on semantic information after different granularity segmentation between sequences;

the preprocessing process comprises the steps of carrying out multi-granularity division processing on text data of Korean language through Korean language processing units which accord with Korean language characteristics and are suitable for machine translation to obtain multi-granularity sequence representation;

the multi-granularity dividing processing is carried out by a Korean language processing unit which accords with the Korean language characteristics and is suitable for machine translation, wherein the multi-granularity dividing processing comprises the steps of designing sub-sections and sub-morpheme granularity processing units based on a root word affix word forming method of a Korean text, and obtaining syllable granularity sequences by combining a word granularity processing method;

the sub-morpheme granularity adopts a method of combining WordPiece with Korean morpheme analysis to obtain a token sequence; after obtaining a merging sub-word pair with highest likelihood ratio according to WordPiece calculation, analyzing morphological elements and part-of-speech information in sentences by using a KoNLPy morphological element analyzer, merging and segmenting a token conforming to a Korean grammar structure to obtain a sub-morphological element granularity sequence;