CN115688904B

CN115688904B - Translation model construction method based on noun translation prompt

Info

Publication number: CN115688904B
Application number: CN202211348033.6A
Authority: CN
Inventors: 迟雨桐; 冯少辉; 李鹏
Original assignee: Beijing Iplus Teck Co ltd
Current assignee: Beijing Iplus Teck Co ltd
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-07-18
Anticipated expiration: 2042-10-31
Also published as: CN115688904A

Abstract

The invention relates to a translation model construction method based on noun translation prompt, belongs to the technical field of natural language processing, and solves the problems of inaccurate noun and proper noun translation, missing turn and wrong turn of a machine translation model in the prior art. Performing noun recognition on parallel linguistic data of two languages to be translated to obtain an original noun set and a translated noun set, and further obtaining a training sample of a translation model to be trained and a corresponding adjustment matrix; and (3) using an adjustment matrix to adjust the attention calculation of the model, training the translation model by taking the text after the noun translation prompt as input and taking the translation as target output, and obtaining the finally trained translation model NPTrans. Based on the input data containing noun translation prompts and the adjustment of the adjustment matrix, the accuracy of noun translation of the translation model is ensured to a certain extent, the problems of noun miss-translation and mistranslation are solved, and the accuracy of noun translation of the machine translation model is improved.

Description

Translation model construction method based on noun translation prompt

Technical Field

The invention relates to the technical field of natural language processing, in particular to a translation model construction method based on noun translation prompt.

Background

Machine translation is one of the important directions of artificial intelligence, also called automatic translation, is a process of converting one natural language (source language) into another natural language (target language) using a computer. With the globalization of economy and the rapid development of the Internet, the machine translation technology plays an increasingly important role in promoting political, economic and cultural exchanges among countries, so that the research on the machine translation technology has important practical significance.

At the beginning of the proposal of the machine translation technology, the method is based on statistical machine translation (SMT, statistics-based Machine Translation), which regards translation as a probability problem, and directly carries out disambiguation treatment and translation selection according to the statistical result, thereby avoiding the difficult problem of language understanding. However, due to the huge selection and processing engineering amount of the corpus, the machine translation system in the general field is seldom based on a statistical method. In recent years, the neural network machine translation (NMT, neural Machine Translation) based on the deep learning network is widely used, and the multi-layer network structure can well learn the context information of the original text, extract semantic features and generate smoother and normative translations, so that the machine translation quality is improved.

However, deep learning-based methods have some drawbacks, among which the most important is the problem of incorrect noun and proper noun translations. Nouns and proper nouns are not translated accurately, including both, miss-translation (i.e., directly skipping a noun or a segment of a noun without translation) and mistranslation (i.e., translation errors), where miss-translation problems are particularly acute when there are few translation languages and training examples. Because the existing machine translation models have the problems of inaccurate, missed and misplaced translations of nouns and proper nouns, there is a great need for a machine translation model for ensuring the accuracy of noun translation.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a method for constructing a translation model based on noun translation hint, which is used to solve the problems of inaccurate, missed and wrong translations of nouns and proper nouns in the existing machine translation model.

In one aspect, the embodiment of the invention provides a translation model construction method based on noun translation prompting, which comprises the following steps:

acquiring parallel corpus data of two languages to be translated to obtain a data set D;

identifying the original text and the translated text of nouns in each piece of data in the data set D to obtain an original text noun set S of each piece of data _word And translated noun set S _word-trans ；

Training sample X of all data in D is obtained through data construction _input Adjustment matrix M corresponding to all data _train Wherein X is _input ＝[x ₁ ,x ₂ ,…,x _g ]，M _train ＝[M ₁ ,M ₂ ,…,M _g ]Single training sample x _i Is to increase the text x after noun translation prompt _input And target translation x _gold I.e. [1,2, …, g)]G is the number of data bars;

the adjustment matrix M _train Importing a translation model to be trained, and using the trainingTraining sample X _input Training a model to obtain a finally trained translation model NPTrans.

Further, the data construction comprises the following steps:

carrying out data cleaning on the parallel corpus data to obtain a cleaned data set D;

the original text of each data in the data set D is sequentially spliced with noun translation set S _word-trans Translated version of all nouns in the translation model to obtain input text X of the translation model _input ；

Constructing each text x after the addition of noun translation prompts _input List of corresponding positional relationships of (a) _index According to the List _index Determining the value M of the element in the construction adjustment matrix _ij Inserting special symbols into the start and stop rows to obtain the single training sample x _i Corresponding adjustment matrix M _g Thereby obtaining an adjustment matrix M corresponding to all the data _train 。

Further, each text x after the noun translation increasing prompt is constructed _input List of corresponding positional relationships of (a) _index The method comprises the following steps:

inputting each piece of input text x of the translation model _input Each pair of nouns and noun translations in x _input Is represented by a pair of tuples;

each noun-translation position tuple pair forms a sub-list;

connecting all noun-translation position tuple pairs to the sub-List to form a List of the corresponding position relation of the text _index 。

Further, the adjustment matrix element M _ij The values and constraints of (2) are as follows:

wherein len (x) _i0 ) Original text representing a single training sample after washing and before adding translation hintsThe X is _i0 Length of len (List) _index ) Representing the List _index Length of (i.e. number of sub-lists), list _index [z][0]Representing List _index The first tuple in the z-th sub-List, list _index [z][1]Representing List _index The second tuple in the z-th sub-list.

Further, the adjustment matrix M is introduced by calculation using the following function _train Attention of the posterior model:

wherein Q is _i 、K _i 、V _i Is to calculate x _i Query, key, value matrices at attention,is Q _i Or K _i Is a dimension of (2);

calculating the prediction result x using the following function _pred And target result x _gold Loss between:

Loss＝CrossEntropy(x _pred ，x _gold )

minimizing Loss and updating model weights, and training until Loss is no longer reduced;

calculating the accuracy of model translation using the following function:

wherein p is _n For the prediction result x _pred The correct n-gram ratio is predicted and BP is the penalty factor.

Further, the matrix M is adjusted _train Importing a translation model, comprising:

the maximum length L which can be input and preset according to the model _max Expanding the adjustment matrix M to the right and downwards to a size L by adding 0 value elements _max ×L _max Obtaining M _train ’；

Will M _train ' import coding layer of the translation model.

Further, identifying and obtaining a noun set S included in the data set D _word Comprising:

marking nouns in the original text of each piece of data in the data set D by using a marking tool with built-in part of speech; or alternatively

And performing noun recognition on the original text in the data set D by using a noun recognition model trained according to requirements.

Further, the noun set S is obtained _word The noun translations corresponding to all nouns in (a) include:

by querying dictionary subject _noun Acquiring the original noun set S _word Translation w of noun w to be matched in _trans ；

Searching whether the translation w exists in the translations of the parallel corpus in the data set D _trans Matched words, if present, will w _trans Adding the translated noun set S _word-trans If there is no word set S of w from the original text _word And deleted.

Further, the method obtains the translation w of the noun original word w in the data set D _trans Comprising:

1) Acquiring the original noun set S _word To be matched in the word w;

2) Directly inquiring dictionary by taking w as key name _noun If the corresponding value exists, the value is directly taken as a translation, and if the corresponding value exists, the next step is carried out;

3) Calculating the noun w to be matched and dictionary part _noun All keys = { keys in (2) ₁ ,key ₂ ,…,key _x Similarity score of } to obtain a score set s= { S ₁ ,s ₂ ,…,s _x X is a subject _noun Length of (2)

4) Finding the element position with the maximum value in the score set S, and randomly taking one element as the maximum value element if the number of the maximum elements in the score set S is greater than 1;

5) Finding out the dictionary according to the element position with the maximum value _noun Key-to-key in (a) _max And value _max Using value _max As translations.

Further, the translation model to be trained is constructed based on a transformers framework and comprises an encoder and a decoder, wherein the encoder and the decoder comprise multiple layers of identical self-attention residual structures.

Compared with the prior art, the invention has at least one of the following beneficial effects:

1. the translation model is trained by constructing a training set containing noun translation prompts and an adjusting matrix in advance, so that the model learns the internal relation between nouns and noun translations, and the accuracy of the translation model on noun translation based on prompts is improved.

2. The attention calculation of the model is regulated by constructing the regulating matrix, so that the model does not calculate the attention between the noun translation and other original characters any more, only the attention between the noun translation and the noun in the original text is calculated, and the accuracy of the model is improved.

3. By constructing input data containing noun translation prompts and an adjusting matrix, a translation model can accurately translate nouns, and the problems of inaccurate noun and proper noun translation, missing and wrong turning of the existing translation model are solved.

In the invention, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a flow chart of a translation model construction method based on noun translation hints according to an embodiment of the invention;

FIG. 2 is a schematic block diagram of a method for constructing a translation model based on noun translation hints according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a method for constructing an adjustment matrix according to an embodiment of the present invention;

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

In one embodiment of the present invention, as shown in fig. 1, a method for constructing a translation model based on noun translation hints is disclosed, comprising:

step S110, acquiring parallel corpus data of two languages to be translated, and obtaining a data set D; the two languages to be translated refer to two languages before and after translation; parallel corpus data refers to a corpus set formed by original text sentences and translation sentences corresponding to the original text sentences.

Step S120, identifying the original text and the translated text of the nouns in the data set D to obtain an original text noun set S _word And translated noun set S _word-trans The method comprises the steps of carrying out a first treatment on the surface of the Specifically, the noun set S can be obtained by noun recognition by the text input noun recognition module in the dataset D _word . S can be searched out through a noun inter-translation dictionary built in a noun query module _word The noun translations corresponding to all nouns in the data set D are searched for whether the words matched with the noun translations exist in the translations of the parallel corpus in the data set D, and if so, the words matched with the noun translations exist, and the words are added into the translated noun set S _word-trans If there is no noun set S from the original text _word Delete in the middle; preferably, the important original text nouns in the words which are not matched with the noun translations can be screened out by a manual identification mode, and the translated nouns in the corresponding parallel corpus are used as raw words to be added into a dictionary.

Step S130, constructing by dataTraining sample X to all data in D _input Adjustment matrix M corresponding to all data _train Wherein X is _input ＝[x ₁ ,x ₂ ,…,x _g ]，M _train ＝[M ₁ ,M ₂ ,…,M _g ]Single training sample x _i Is to increase the text x after noun translation prompt _input And target translation x _gold I.e. [1,2, …, g)]G is the number of data bars; wherein the noun translation prompt is noun translation set S _word-trans All translations of (a) are described.

Step S140, the adjustment matrix M _train Importing a translation model to be trained, and using the training sample X _input Training a model to obtain a finally trained translation model NPTrans. Specifically, by the adjustment matrix M _train After adjustment, the translation model does not calculate the attention between noun translations and other original characters when calculating the attention, and only calculates the attention between nouns and the nouns in the original.

According to the embodiment of the invention, the translation model is trained by constructing the training set and the adjusting matrix containing the noun translation prompts in advance, so that the model learns the internal relation between nouns and noun translations, and the accuracy of the translation model on noun translation based on prompts is improved; the attention calculation of the model is regulated by constructing the regulating matrix, so that the model does not calculate the attention between noun translation and other original characters any more, only the attention between noun translation and noun in the original text is calculated, and the accuracy of the model is improved; by constructing input data containing noun translation prompts and an adjusting matrix, a translation model can accurately translate nouns, and the problems of inaccurate noun and proper noun translation, missing and wrong turning of the existing translation model are solved.

In a specific embodiment, the noun recognition module in step S120 is a built-in part-of-speech tagging tool or a noun recognition model trained according to requirements. Optionally, the marking tool with built-in part of speech is a jieba word segmentation kit.

In a specific embodiment, the noun query module in the step S120 includes a noun dictionary and a query program;

wherein the noun dictionary is a dictionary (direct) containing all nouns required by a user, the keys of the dictionary are nouns expressed by the language to be translated, and the values are the corresponding nouns expressed by the target language;

illustratively, the data structure of the dictionary (middle translation) is:

dictonoun= { "China": "China", "us": "american", … … }

Alternatively, the user may construct the noun dictionary by using existing resources, self-building, etc.

The matching mode of the query program adopts non-precise matching; preferably, matching is performed by using a text similarity algorithm;

further, the text similarity algorithm matching step includes:

1. directly inquiring dictionary subject by taking any noun w to be matched as key name _noun If there is a corresponding value, if there is, directly taking the value as the translation w _trans If not, carrying out the next step;

2. calculating the noun w to be matched and the dictionary part _noun All keys = { keys in (2) ₁ ,key ₂ ,…,key _x Similarity score of } to obtain a score set s= { S ₁ ,s ₂ ,…,s _x X is a subject _noun Is a length of (2); the similarity score calculation formula is as follows:

where len (w) is the length of word w, len (key _i ) Exp (·) is the desired function, count, for the length of the ith bond _same For w and key _i Number of overlapping grams under n-gram, count _n-gram For w, the number of grams under n-grams, n takes 1-3.

3. And finding the element position with the maximum value in the score set S, and randomly taking one element as the maximum element if the number of the maximum elements in the score set S is greater than 1.

4. Finding out the dictionary according to the element position with the maximum value _noun Key-to-key in (a) _max And value _max Using value _max As translation w _trans 。

In a specific embodiment, the data construction in step S130 includes: data cleaning, training sample construction and adjustment matrix construction;

the data cleaning is to clean parallel corpus data to obtain a cleaned data set D, and the cleaning includes: removing the blank and redundant invalid characters; unified simplified (e.g., chinese);

the constructing training samples includes: sequentially splicing corresponding noun translation sets S after the original text of each piece of data in the data set D _word-trans Space division is used between translations of the medium nouns to obtain a single training sample x of the translation model _i Thereby obtaining a training sample set X _input ；

The construction of the adjustment matrix comprises: building each text x after increasing noun translation prompt _input List of corresponding positional relationships of (a) _index The method comprises the steps of carrying out a first treatment on the surface of the According to the List _index Building a single training sample x _i Corresponding adjustment matrix M _g Thereby obtaining an adjustment matrix M corresponding to all the data _train ；

Wherein, the construction increases each text x after noun translation prompt _input List of corresponding positional relationships of (a) _index The method comprises the following steps of:

prompting each text x after the noun translation is added _input Each noun or noun translation in x _input Is represented by a tuple; each noun-translation position tuple pair forms a sub-list; all noun-translation position tuple pairs are connected to form a large List, namely a List _index . By way of example, table 1 illustrates a method of constructing a list of input text and corresponding positional relationships.

Table 1 example of constructing a list of input text and position correspondences

Original text	Chinese and American trade
		S _word	{ China, U.S. }, U.S
S _word-trans	{China，America})
		X _input	Chinese and United states trade transactions. China America
List _indesx	[[(1,2),(11,11)],[(4,5),(12,12)]]

It should be noted that the above List _indesx Is only an example, and in practice it is also necessary to use the X _input The word segmentation results of the Chinese text are adjusted, wherein the default Chinese text is segmented according to single words, the English text is segmented according to words, and spaces among English words are not counted in the word segmentation.

The construction of a single training sample x _i Corresponding adjustment matrix M _g As shown in fig. 3, includes: according to the List _index Determining the single training sample x _i Corresponding adjustment matrix M _g The value of the medium element; the start and end rows of the matrix are inserted with special symbols, respectively, such that l=len (x _input ) +2; optionally, the start line and the end line of the matrix are inserted respectivelySpecial symbol is [ CLS ]]And [ SEP ]]；

Further, M _g Element M of a certain ith row and jth column _i,j The values and constraints of (2) are as follows:

(1) M when i, j each satisfy any one of the following conditions _i,j ＝0；

Condition 1: less than or equal to len (x _i0 )+1

Condition 2: equal to L

(2) When i, j respectively belong to List _index List of a certain sub-List in (a) _one Two tuple List in (a) _one [0]And List _one [1]When M is _i,j ＝0；

(3) The rest of the cases, M _i,j Negative infinity (- ≡);

alternatively, 1e-4 or 1e-9 is used instead of minus infinity (- ≡);

alternatively, when 1e-4 is used instead of minus infinity (- ≡), M _i,j The expression of the value of (c) is as follows:

wherein len (x) _i0 ) Representing the original text x after cleaning and before adding prompt ₀ Length of len (List) _index ) Representing List _index List (number of sub-lists) _index [z][0]Representing List _index The first tuple in the z-th sub-List, list _index [z][1]Representing List _index The second tuple in the z-th sub-list.

In a specific embodiment, the translation model to be trained employs a neural machine translation model that includes an encoder and a decoder. As shown in fig. 2, the above step S140 may be further optimized as the following steps:

step S210: the adjustment matrix M _train Importing the coding layer of the translation model to be trained, and adjusting the calculation of parameters in the model;

specifically, the adjustment matrix M _train Importing the training to be trainedAn encoding layer of a translation model, comprising: the maximum input length L preset according to the translation model to be trained _max Expanding the adjustment matrix M to the right and downwards to a size L by adding 0 value elements _max ×L _max Obtaining M _train 'A'; will M _train ' import the coding layer;

specifically, the neural machine translation model is constructed by using a transformers framework, and comprises an encoder and a decoder, wherein the encoder and the decoder both comprise multiple layers of identical self-Attention residual structures, and an adjustment matrix is added to calculate self-Attention (Attention), and the calculation formula is as follows:

wherein Q, K, V is a matrix of Query, key, value, d in self-attention mechanism ^k Is the dimension of Q or K (both identical).

Preferably, the encoder and decoder each comprise 12 layers of identical self-attention residual structures;

preferably, the dimension value of the Query or Key in the self-attention mechanism is d ^k ＝64。

Exemplary, as in FIG. 3, an adjustment matrix M is added _train After' the neural machine translation model will not calculate the attention (grey part) between the noun translation and other original characters, but only the attention (a in the figure) between the noun translation and the noun in the original _1,11 ，a _2,11 ，a _4,12 ，a _4,12 Four) and the remaining white part is the original x ₀ Attention in between.

Step 220, using the training sample X _input Training import adjustment matrix M _train And obtaining a finally trained translation model NPTrans by the aid of the post-translation model.

Specifically, X is _input Divided into training sets D _train Verification set D _valid Test set D _test The adjustment matrix M _train Importing the translation model with D _train Training model, D for finishing each round of training _valid And (3) performing verification, and taking a round of model with the best verification result as a final model NPTrans. Preferably, the training set D _train Verification set D _valid Test set D _test The ratio of (2) is 8:1:1.

Further, in training, for each text x _i The attention of the piece of text at the encoder is calculated using the following formula:

wherein Q is _i 、K _i 、V _i Is to calculate x _i Query, key, value matrices at attention,is Q _i Or K _i Is the same as the dimension of (a) and is generally taken as d ^k ＝64。

Prediction result x _pred And target result x _gold The loss function expression between them is:

Loss＝CrossEntropy(x _pred ，x _gold )

minimizing Loss and updating model weights, training until Loss no longer drops.

During verification, calculating accuracy of model translation by using BLEU score:

wherein p is _n For the prediction result x _pred In predicting the correct n-gram proportion, i.e

BP is a penalty factor, penalty when predicting result x _pred Length ratio x of (2) _gold The length is small:

after obtaining the round of model with the best verification result as the final model NPTrans, D can be used _test Testing was performed.

Further, during actual translation, a noun set and a noun translation set in a text to be translated are constructed through the noun recognition module and the noun query module, and an input text of a translation model and an adjustment matrix M of the translation model are further constructed through the data; and translating the input text of the translation model by using the trained final model NPTrans, adjusting the attention calculation of the model by using an adjusting matrix, and finally outputting the translation.

Compared with the prior art, the embodiment of the invention trains the translation model by constructing the training set and the adjusting matrix containing the noun translation prompts in advance, so that the model learns the internal relation between nouns and noun translations, and the accuracy of the translation model on noun translation based on prompts is improved; the attention calculation of the model is regulated by constructing the regulating matrix, so that the model does not calculate the attention between noun translation and other original characters any more, only the attention between noun translation and noun in the original text is calculated, and the accuracy of the model is improved; by constructing input data containing noun translation prompts and an adjusting matrix, a translation model can accurately translate nouns, and the problems of inaccurate noun and proper noun translation, missing and wrong turning of the existing translation model are solved.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A translation model construction method based on noun translation prompt is characterized by comprising the following steps:

the adjustment matrix M _train Importing a translation model to be trained, and using the training sample X _input Training the model to obtain a final trained translation model NPTrans, comprising:

the adjustment matrix M is introduced by means of the following function calculation _train Attention of the posterior model:

wherein Q is _i 、K _i 、V _i Is to calculate x _i Query, key, value matrix, d when attentive ^ki Is Q _i Or K _i Is a dimension of (2);

Loss＝CrossEntropy(x _pred ，x _gold )

calculating the accuracy of model translation using the following function:

2. The translation model construction method according to claim 1, wherein the data construction comprises the steps of:

3. The method for constructing a translation model according to claim 2, wherein each text x after the increasing noun translation hint is constructed _input List of corresponding positional relationships of (a) _index The method comprises the following steps:

inputting each piece of input text x of the translation model _input Each pair of nouns and nouns in (a)Translation is x _input Is represented by a pair of tuples;

each noun-translation position tuple pair forms a sub-list;

4. A translation model construction method according to claim 2 or 3, characterized in that the adjustment matrix element M _ij The values and constraints of (2) are as follows:

wherein len (x) _i0 ) Original text x representing a single training sample after cleaning and before adding translation cues _i0 Length of len (List) _index ) Representing the List _index Length of (i.e. number of sub-lists), list _index [z][0]Representing List _index The first tuple in the z-th sub-List, list _index [z][1]Representing List _index The second tuple in the z-th sub-list.

5. The method according to claim 1, wherein the adjustment matrix M is _train Importing a translation model, comprising:

Will M _train ' import coding layer of the translation model.

6. The method according to claim 1, characterized in that noun set S included in the data set D is identified _word Comprising:

7. The method according to claim 1 or 6, wherein the noun set S is obtained _word The noun translations corresponding to all nouns in (a) include:

8. The method of claim 7, wherein the obtaining a translation w of a noun primitive w in the dataset D _trans Comprising:

1) Acquiring the original noun set S _word To be matched in the word w;

3) Calculating the noun w to be matched and dictionary part _noun All of (3) key= { key ₁ ,key ₂ ,…,key _x Similarity score of } to obtain a score set s= { S ₁ ,s ₂ ,…,s _x X is a subject _noun Length of (2)

5) According to takingFinding out the corresponding dictionary part from the element position with the maximum value _noun Key-to-key in (a) _max And value _max Using value _max As translations.

9. A method of constructing a translation model according to claim 1, characterized in that the translation model to be trained is constructed based on a transformers framework, comprising an encoder and a decoder, both comprising multiple layers of the same self-attention residual structure.