CN115017923A

CN115017923A - Professional term vocabulary alignment replacement method based on Transformer translation model

Info

Publication number: CN115017923A
Application number: CN202210598164.3A
Authority: CN
Inventors: 王晓玲; 郑焕然; 朱威
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-06

Abstract

The invention discloses a professional term word alignment replacement method of a translation model based on a Transformer, which comprises the steps of constructing and training the translation model based on the Transformer, inputting a source language text to be translated into the trained translation model, translating to obtain an initial translation of a target language, simultaneously obtaining a plurality of reference correlation matrixes of a source language word and the target language word, aligning the source language word and the target language word according to the reference correlation matrixes to obtain an aligned word pair, searching whether a source language term in a preset professional technical library exists in a source language sentence or not, inquiring a word set aligned with the source language term in the initial translation if the source language word and the target language word aligned with the source language term in the initial translation are replaced by a professional term translation, and obtaining a final translation. The invention realizes the alignment of the source language and the translated text by utilizing the correlation matrix in the translation model, and corrects the translated text by utilizing the professional term library, thereby improving the accuracy of the translated text.

Description

Professional term vocabulary alignment replacement method based on Transformer translation model

Technical Field

The invention belongs to the technical field of machine translation, and particularly relates to a professional term vocabulary alignment replacement method based on a transform translation model.

Background

Machine translation is the process of converting one source language to another target language by a computer. For example, translation software is often used in daily life to translate english, which is a source language, into chinese, which is a target language. Manual translation, while highly accurate, is time consuming and labor intensive. Machine translation, however, is much faster than manual translation, although the translated version is less accurate than manual translation. Therefore, when a large amount of text needs to be translated and the requirement on precision is not so high, such as a browsing-type task with massive data, the advantages of machine translation are realized. For those things that cannot be done using manual translation, using machine translation may take hours or even minutes to complete. Translation plays an important role for human beings. On one hand, due to the difference of language characters, culture and geographic positions, translation becomes an important requirement; on the other hand, translation also accelerates the convergence of different civilizations and promotes the development of the world. Just because the need for translation is so great, machine translation has always been one of the most research-relevant topics.

Today, under some conditions, the translation results of machine translation have approached the results of human translation. However, the effect is still unsatisfactory when the method is applied to a specific field, such as a medical field. Because the medical field has a very large number of specialized terms, these specialized terms are not common in existing public parallel corpus data sets and are also difficult for models to learn. In some specific scenarios, the requirement on the translation accuracy of these specialized terms is very high, so how to modify and improve the translation obtained by the translation model by using the term library is an important problem to be solved for adapting the translation model to some specific fields. Conventional attention-based alignment replacement algorithms tend to rely only on the cross-attention scores computed by the last layer of the model decoder. Therefore, the alignment result obtained by the method is not necessarily accurate. And other alignment substitution algorithms add constraints in the model decoding process and sometimes influence the final translation result of the model.

In summary, the existing method for making the model fully utilize the term library to correct and improve the translation has the problems that the alignment result is not accurate enough and the quality of the translation generated by the model is possibly influenced, so that the adaptation of the translation model to a specific field is influenced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a professional term vocabulary alignment and replacement method of a translation model based on a Transformer, which utilizes a correlation matrix in the translation model to realize accurate alignment of an input source language text and a translated text generated by the translation model and utilizes a professional term library to correct the translated text, thereby improving the accuracy of the translated text.

In order to achieve the above object, the method for replacing vocabulary alignment in terms of professional terminology based on a translation model of a Transformer according to the present invention comprises the following steps:

s1: constructing a translation model based on a Transformer, wherein the translation model comprises M coding layers and M decoding layers, and training the translation model by adopting parallel corpora of a source language and a target language which are collected in advance;

s2: inputting a source language text to be translated into the translation model based on the Transformer trained in the step S1, translating to obtain an initial translation of the target language, and recording the number of words in the source language text as D _r The number of target language words in the initial translation is D _t (ii) a Obtaining the initial translation and the size D calculated by the cross attention mechanism in M decoding layers in the translation model _t ×D _r Each element representing a relevance value between a source language word and a target language word at a respective location; selecting N decoding layers of the M decoding layers as reference decoding layers according to requirements, and taking the corresponding correlation matrix as a reference correlation matrix R _n ，n＝1,2,…,N；

S3: for each target language word in the initial translation, determining the number of the target language words in the initial translation to be NReference correlation matrix R _n The source language word in the source language text with the maximum middle relevance value serves as a pending source language word W _d,n And the corresponding correlation value is marked as C _d,n ，d＝1,2,…,D _t (ii) a Corresponding N undetermined source language words W to each target language word _d,n Set of pending source language words phi comprising the target language word _d Remembering a pending Source language word set φ _d The number of the undetermined source language words is K, and for the kth undetermined source language word, the occurrence frequency f of the kth undetermined source language word is counted _k,n And mean of correlation values

Weighting to obtain the score of the undetermined source language word

α and β represent preset weights, and α + β is 1; finally, selecting a score S from the K undetermined source language words of each target language word _k,n The largest undetermined source language word is used as an aligned source language word of the target language word, so that an aligned word pair is obtained;

s4: and inquiring whether each source language term exists in the input source language text according to a preset professional language library of the source language and the target language, if not, performing no operation, if so, finding a target language word set aligned with the source language term in the initial translation by using the alignment relation between the source language word and the target language word in the initial translation determined in the step S3, and replacing the target language word aligned with the source language term in the initial translation with the professional term target language translation corresponding to the source language term, thereby obtaining the final translation.

The invention relates to a professional term word alignment replacement method of a translation model based on a Transformer, which comprises the steps of constructing and training the translation model based on the Transformer, inputting a source language text to be translated into the trained translation model, translating to obtain an initial translation of a target language, simultaneously obtaining a plurality of reference correlation matrixes of a source language word and a target language word, aligning the source language word and the target language word according to the reference correlation matrixes to obtain an aligned word pair, searching whether a source language term in a preset professional language library exists in a source language sentence or not, inquiring a word set aligned with the source language term in the initial translation if the source language word and the target language word are aligned with the source language term in the initial translation, and replacing the target language word aligned with the source language term in the initial translation to obtain a final translation.

The invention has the following beneficial effects:

1) the invention uses the idea of model integration for reference, and realizes the word alignment by combining a plurality of layers of correlation matrixes instead of only using the correlation matrix generated by the last layer of a decoder in a translation model, thereby enhancing the accuracy and the robustness of the word alignment;

2) the alignment replacement of the professional terms and the generation of the translation model are independently divided, namely, after the translation model generates the translation, certain words in the translation are replaced and corrected, so that the translation model is not interfered by the outside world in the decoding process, and the translation performance of the model is not influenced;

3) the invention can also introduce a heuristic method to judge whether to replace the professional terms or not and filter out partial wrong alignment results, thereby further ensuring the translation quality of the final translated text.

In summary, the method and the device can fully utilize the property of the translation model based on the Transformer to accurately align the input source language text and the translated text generated by the model on the premise of not influencing the performance of the translation model, and then correct the translation of some professional terms in the translated text by utilizing the professional term library, thereby improving the translation accuracy of the translation model for the text in the specific field.

Drawings

FIG. 1 is a flowchart of an embodiment of a method for replacing vocabulary alignment of professional terms based on a transform translation model according to the present invention;

fig. 2 is a structural diagram of a translation model based on a transform in the present embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flowchart of an embodiment of a method for replacing vocabulary alignment in terms of technical terminology based on a transform translation model according to the present invention. As shown in fig. 1, the specific steps of the method for replacing vocabulary alignment of professional terms based on a translation model of a Transformer of the present invention include:

s101: constructing and training a translation model based on a Transformer:

the Transformer Is a structure proposed in the article "Vaswani a, shazer N, Parmar N, et al. Therefore, the invention constructs a translation model based on a Transformer, which comprises M coding layers and M decoding layers, and trains the translation model by adopting parallel corpora of a source language and a target language collected in advance.

Fig. 2 is a structural diagram of a translation model based on a transform in the present embodiment. As shown in fig. 2, the transform-based translation model in this embodiment employs a 6-layer encoding layer (encoder) and a 6-layer decoding layer (decoder). Each layer in the encoder comprises two modules, namely a self-attention module and a feedforward neural network module, and the self-attention mechanism can help a current node to focus on global information of an input sentence, so that the context semantics can be obtained. Each layer of the decoder also contains two modules mentioned for the coding layer, but a cross-attention module is also present in between. The module can help the current node to acquire important contents needing attention in the source language sentence. Thus, a correlation matrix calculated using cross-attentions may be used that contains alignment information between the source language sentence and the model predicted translation word.

S102: translating to obtain a translated text and acquiring a reference correlation matrix:

inputting the source language text to be translated into the translation model based on the Transformer trained in the step S101, translating to obtain an initial translation of the target language, and recording the number of words in the source language text as D _r The number of target language words in the initial translation is D _t . Obtaining the initial translation and the size D calculated by the cross attention mechanism in M decoding layers in the translation model _t ×D _r Wherein each element represents a correlation value between a source language word and a target language word at a respective location. Selecting N decoding layers of the M decoding layers as reference decoding layers according to requirements, and using the corresponding correlation matrix as a reference correlation matrix R _n ，n＝1,2,…,N。

It can be seen that the present invention uses correlation matrices of multiple decoding layers, rather than relying solely on the correlation matrix of the last layer for alignment. Because the data in the training set is often noisy in the training process, the calculation of the correlation matrix occasionally deviates, and the alignment result obtained by using only the last correlation matrix layer is poor in robustness and obvious in error in some cases. And the influence of certain noise can be eliminated by utilizing the correlation matrix of the multilayer reference decoder, so that the robustness of the alignment algorithm is enhanced.

Because there are differences in functions that may be implemented between decoding layers of a translation model, there is a certain difference in correlation matrices corresponding to the decoding layers. The invention finds that more accurate alignment results can be obtained without simply fusing the correlation matrixes of all decoding layers. Therefore, in order to make the selected correlation matrix more accurate, the invention searches the optimal combination of the correlation matrix of the decoding layer by setting an experimental mode, and the specific method is as follows:

the method comprises the steps of acquiring an Alignment result of a source language and a target language as real labels by using an external Alignment tool (such as GIZA + +) and the like, inputting the source language and the target language into a translation model, acquiring correlation matrixes calculated by a cross attention mechanism in M decoding layers of the translation model, respectively performing Alignment extraction based on each correlation matrix, comparing the extracted Alignment result with the real labels, counting an Alignment Error Rate (AER) of the Alignment result, and taking a decoding layer corresponding to the correlation matrix with the Alignment Error Rate smaller than a preset threshold value as a reference decoding layer.

Taking the translation model shown in fig. 2 as an example, it is found through experiments that the alignment error rate generated by the last three layers of the 6 decoding layers is lower than the preset threshold, and the first three layers are shallow layers of the decoder and may be different from the tasks to be completed by the last three layers, so that the generated alignment result is different from the last three layers, and the alignment error rate is higher. Therefore, only the correlation matrix of the last three layers is used as the reference basis for alignment in this embodiment.

S103: word alignment based on a reference relevance matrix:

due to the principle of attention mechanism, the relevance matrix produced by the trained translation model naturally reflects the contribution of the representation of each word in the source language text to the generation of each word in the translated text during the translation process. The higher the proportion of contributions, the greater the value of the corresponding correlation in the correlation matrix, and the greater the probability that the source and target language words are aligned. Therefore, the alignment result is determined based on the N reference correlation matrices in step S102, and the specific method is as follows:

for each target language word in the initial translation, determining N reference correlation matrixes R _n The source language word in the source language text with the maximum middle relevance value serves as a pending source language word W _d,n And the corresponding correlation value is marked as C _d,n ，d＝1,2,…,D _t . N undetermined source language words W corresponding to each target language word _d,n Set of pending source language words phi comprising the target language word _d To remember the sourceSet of language words phi _d The number of the undetermined source language words is K, and for the kth undetermined source language word, the occurrence frequency f of the kth undetermined source language word is counted _k,n (i.e., at the N pending source language words W _d,n Ratio of the number of occurrences to N) and the mean of the correlation values

Weighting to obtain the score of the undetermined source language word

α and β represent preset weights, and α + β is 1. Finally, selecting a score S from K undetermined source language words of each target language word _k,n The largest pending source language word is used as the aligned source language word of the target language word to obtain an aligned word pair.

In this embodiment, the source language is english, the target language is chinese, the source language text is "COVID-19 is an infectious disease", and the obtained translation is "pneumonia is an infectious disease". As described above, the correlation matrix obtained by the last 3 decoding layers is used as the reference correlation matrix in this embodiment. Table 1 is the 1 st reference correlation matrix in this embodiment.

	COVID-19	is	an	infectious	disease
						Pneumonia of lung	0.63	0.12	0.08	0.04	0.13
Is that	0.21	0.45	0.12	0.06	0.23
						A kind of	0.11	0.18	0.28	0.22	0.21
Infectious diseases	0.02	0.22	0.07	0.33	0.36

TABLE 1

Table 2 is the 2 nd reference correlation matrix in this embodiment.

	COVID-19	is	an	infectious	disease
						Pneumonia of lung	0.33	0.13	0.08	0.34	0.12
Is that	0.24	0.35	0.18	0.16	0.07
						A kind of	0.16	0.11	0.35	0.22	0.21
Infectious diseases	0.03	0.12	0.17	0.32	0.36

TABLE 2

Table 3 is the 3 rd reference correlation matrix in this embodiment.

	COVID-19	is	an	infectious	disease
						Pneumonia of lung	0.46	0.18	0.09	0.13	0.14
Is that	0.26	0.35	0.13	0.14	0.12
						A kind of	0.20	0.33	0.24	0.10	0.13
Infectious diseases	0.17	0.14	0.01	0.37	0.31

TABLE 3

First, the source language word with the maximum relevance of each target language word in each reference relevance matrix is screened as the source language word to be determined. Table 4 is a list of pending target language words and their scores for each source language word in this embodiment.

	R ₁	R ₂	R ₃
				Pneumonia of lung	0(0.63)	3(0.34)	0(0.46)
Is that	1(0.45)	1(0.35)	1(0.35)
				A kind of	2(0.28)	2(0.35)	1(0.33)
Infectious diseases	4(0.36)	4(0.36)	3(0.37)

TABLE 4

The source language words are replaced in table 4 by column numbers with corresponding relevance values in parentheses. Taking the target language word "pneumonia" as an example, its pending source language words include: "COVID-19", frequency of occurrence 2/3, mean value of correlation 0.545; "infectious", frequency of occurrence is 1/3, and mean correlation is 0.34. In this embodiment, the weights α and β are both set to be 0.5, and the scores corresponding to the two undetermined source language words are 0.61 and 0.34, respectively, so that the alignment result is "COVID-19". Similarly, the alignment result of "yes" is "the alignment result of" one "is" an "and the alignment result of" infectious disease "is" infectious ".

S104: professional term query and replacement:

and inquiring whether each source language term exists in the input source language text or not according to a preset professional term library of the source language and the target language, if not, performing no operation, if so, finding a target language word set aligned with the source language term in the initial translation by using the alignment relation between the source language word and the target language word in the initial translation determined in the step S103, and replacing the target language word aligned with the source language term in the initial translation with a professional term target language translation corresponding to the source language term, thereby obtaining the final translation.

A source language term may contain multiple words that may align to multiple words in the initial translation for the source language term present in the source language sentence. Therefore, a heuristic method can be used for judging whether to execute a replacement operation on the source language term existing in the source language text according to the situation, and the specific method is as follows:

the number of words contained in the source language term to be replaced is recorded as L ₁ The number of words in the initial translation that are aligned with the source language term to be replaced is noted as L ₂ If v.times.L ₁ ＜L ₂ ＜u×L ₁ And executing a replacement operation, namely replacing the part of the initial translation corresponding to the source language term with the corresponding term translation in the term library, otherwise, not executing the replacement operation, wherein v and u are preset coefficients according to the translation habit between the source language and the target language.

By adopting the method, the final translation can be ensured not to be influenced by the error alignment results.

Table 5 is an example of the term base in the present embodiment.

COVID-19	etoposide	cisplatin	mometasone	SARS-CoV-2
					New coronary pneumonia	Etoposide	Cis-platinum	Mometasone	Novel coronavirus

TABLE 5

In this embodiment, it is first found that only the term "COVID-19" exists in the sentences in the source language, and its alignment results are [ "COVID-19": pneumonia "].

For the heuristic algorithm, the set coefficients u-2, v-0.5, and for the alignment result of the term "COVID-19" [ "COVID-19": pneumonia "]Wherein the length L of the source language term ₁ Length L of the word set "pneumonia" corresponding to 1 ₂ The set range is not exceeded, so an alternative operation may be performed for the term "codid-19".

Thus, pneumonia is replaced with the term translation "new coronary pneumonia" in the term library. Therefore, the final translation obtained by the source language text 'COVID-19 is an infectious disease' is 'New crown pneumonia is an infectious disease'. Therefore, the translation processed by the method is more accurate than the initial translation.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A professional term vocabulary alignment replacing method based on a transform translation model is characterized by comprising the following steps:

s2: inputting the source language text to be translated into the translation model based on the Transformer trained in the step S1, translating to obtain an initial translation of the target language, and recording the number of words in the source language text as D _r The number of target language words in the initial translation is D _t (ii) a D is the size calculated by the cross attention mechanism in M decoding layers in the translation model while obtaining the initial translation _t ×D _r Each element representing a relevance value between a source language word and a target language word at a respective location; selecting N decoding layers of the M decoding layers as reference decoding layers according to requirements, and using the corresponding correlation matrix as a reference correlation matrix R _n ，n＝1,2,…,N；

S3: for each target language word in the initial translation, determining N reference correlation matrixes R _n The source language word in the source language text with the maximum middle relevance value serves as a pending source language word W _d,n And the corresponding correlation value is marked as C _d,n ，d＝1,2,…,D _t (ii) a Corresponding N undetermined source language words W to each target language word _d,n Set of pending source language words phi comprising the target language word _d Remembering a pending Source language word set φ _d The number of the undetermined source language words is K, and for the kth undetermined source language word, the occurrence frequency f of the kth undetermined source language word is counted _k,n And mean of correlation values

Weighting to obtain the score of the undetermined source language word

α and β represent preset weights, and α + β is 1; finally, selecting a score S from K undetermined source language words of each target language word _k,n The largest pending Source language word as the target language word pairAligning source language words to obtain aligned word pairs;

2. The vocabulary alignment substitution method of claim 1, wherein the selection method of the reference decoding layer in step S2 is as follows:

the method comprises the steps of obtaining an alignment result of a source language and a target language as real labels of the source language and the target language by using an external alignment tool, inputting the source language and the target language into a translation model, obtaining correlation matrixes calculated by a cross attention mechanism in M decoding layers in the translation model, respectively performing alignment extraction based on each correlation matrix, comparing the extracted alignment result with the real labels, counting an alignment error rate of the alignment result, and taking a decoding layer corresponding to the correlation matrix with the alignment error rate smaller than a preset threshold value as a reference decoding layer.

3. The method for lexical alignment substitution of specialized terms as claimed in claim 1, wherein said step S4 further comprises using heuristic method to determine whether to perform substitution operation on source language terms existing in the source language text, the method comprising:

the number of words contained in the source language term to be replaced is recorded as L ₁ The number of words in the initial translation that are aligned with the source language term to be replaced is noted as L ₂ If v.times.L ₁ ＜L ₂ ＜u×L ₁ Performing a replacement operation, otherwise not performing a replacement operation, wherein v and u are based on the source languageAnd the preset coefficient of the translation habit between the language and the target language.