CN112464676A

CN112464676A - Machine translation result scoring method and device

Info

Publication number: CN112464676A
Application number: CN202011395504.XA
Authority: CN
Inventors: 刘绍孔; 李健; 武卫东; 陈明
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-03-09

Abstract

The embodiment of the application relates to a machine translation result scoring method and device. The method comprises the following steps: receiving source language corpora for translation; the initial source language translation model firstly encodes a sentence to be translated in the source language corpus into a source language sentence expression vector, and then decodes the source language sentence expression vector to obtain a target language translation sentence; the initial target language translation model firstly encodes a target language translation sentence into a target language sentence expression vector, and then decodes the target language sentence expression vector to obtain a source language translation sentence; and scoring by using a score calculation model based on the target language sentence representation vector and the source language sentence representation vector to obtain a translation result score. The embodiment of the application can accurately score the translation result, is not limited by a shared word list, and can be popularized to more natural language processing processes.

Description

Machine translation result scoring method and device

Technical Field

The embodiment of the application relates to the technical field of machine translation, in particular to a method and a device for scoring a machine translation result.

Background

Machine translation, also known as automatic translation, is the process of converting one natural language (source language) to another (target language) using a computer. It is a branch of computational linguistics, is one of the ultimate targets of artificial intelligence, and has important scientific research value. Machine translation has important practical value. With the rapid development of the globalization of economy and the internet, the machine translation technology plays an increasingly important role in the aspects of promoting political, economic and cultural communication and the like.

However, the translation results obtained by machine translation in practice are not all accurate, wherein sometimes correct and accurate translation results can be obtained, sometimes partial error or low-quality translation results can be obtained, and the translation results need to be scored for effective evaluation.

In the prior art, the scoring method for the machine translation result scores the translation result through a vocabulary shared by the inter-translation languages, but the calculation method using the shared vocabulary gives the same weight (for example, all 1) to each dimension of the word vector, and the calculation method using the shared vocabulary is limited by the size of the shared vocabulary, and the scoring for the translation result is not accurate.

Disclosure of Invention

Based on the technical problems, the embodiment of the application provides a machine translation result scoring method and device, and aims to solve the problem of how to accurately score a translation result.

In a first aspect, an embodiment of the present application provides a machine translation result scoring method, where the method includes:

receiving a source language corpus for translation, wherein the source language corpus comprises a plurality of sentences to be translated;

the initial source language translation model firstly encodes the sentence to be translated into a source language sentence expression vector, and then decodes the source language sentence expression vector to obtain a target language translation sentence;

the initial target language translation model firstly encodes the target language translation sentence into a target language sentence expression vector, and then decodes the target language sentence expression vector to obtain a source language translation sentence;

and scoring by using a score calculation model based on the target language sentence representation vector and the source language sentence representation vector to obtain a translation result score.

Optionally, a score computation model corpus is collected, the score computation model corpus pair includes at least two translation statement pairs, the translation statement pairs include at least one of: the method comprises the following steps of (1) a source language training sentence, a sentence expression vector of the source language training sentence, a target language training sentence with the same text meaning as the source language training sentence, and a sentence expression vector of the target language training sentence;

training a predetermined model using sentence representative vectors of training sentences in a source language and sentence representative vectors of training sentences in a target language, the training loss function being

-∑_(X,Y)V_X×W×V_Y+L1(W) (1)；

Wherein X is the number of the source language training sentence, V_XRepresenting a vector for a sentence of a training sentence in a source language, Y being the number of a training sentence in a target language having the same meaning as the text of the training sentence in the source language, V_YThe sentence of the target language training sentence represents a vector, L1(W) is a preset regular term, and W is a score calculation model parameter matrix.

Optionally, the method further comprises:

collecting source language translation model training corpora, wherein the source language translation model training corpora at least comprise a source language translation training pair, and the source language translation training pair at least comprises one of the following: the method comprises the steps of obtaining a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated, and a translation result score;

and converting the translation result scores of the source language translation training pairs into translation weights, and updating the initial source language translation model by using all sentences to be translated, the translation weights and the target language translation sentences.

Optionally, the method further comprises:

collecting a target language translation model training corpus, wherein the target language translation model training corpus at least comprises a target language translation training pair, and the target language translation training pair at least comprises one of the following: the method comprises the steps of obtaining a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated, and a translation result score;

and converting the translation result scores of all the target language translation training pairs into translation weights, and updating the initial target language translation model by using all the target language translation sentences, the translation weights and the sentences to be translated.

Optionally, the method further comprises:

judging the score of the translation result;

and removing the target language translation sentences with low translation result scores and keeping the target language translation sentences with high translation result scores.

A second aspect of the embodiments of the present application provides a machine translation result scoring device, where the device includes:

the language material receiving module is used for receiving a source language material for translation, and the source language material comprises a plurality of sentences to be translated;

the source language translation module is configured with an initial source language translation model and used for firstly encoding the target language translation sentences into target language sentence expression vectors and then decoding the target language sentence expression vectors to obtain source language translation sentences;

the target language translation module is configured with an initial target language translation model and used for firstly encoding the target language translation sentences into target language sentence expression vectors and then decoding the target language sentence expression vectors to obtain source language translation sentences;

and the score calculating module is used for scoring by using a score calculating model based on the target language sentence representation vector and the source language sentence representation vector to obtain a translation result score.

Optionally, the score calculation model is a trained model, and the apparatus further includes:

the score computation model corpus receiving module is used for receiving a score computation model corpus, wherein the score computation model corpus pair at least comprises two translation statement pairs, and the translation statement pairs at least comprise one of the following: the method comprises the following steps of (1) a source language training sentence, a sentence expression vector of the source language training sentence, a target language training sentence with the same text meaning as the source language training sentence, and a sentence expression vector of the target language training sentence;

a score calculation model training module for training a preset model by using sentence expression vectors of source language training sentences and sentence expression vectors of target language training sentences, wherein the training loss function is

-∑_(X,Y)V_X×W×V_Y+L1(W) (1)；

Optionally, the apparatus further comprises:

the source language translation model training material collection module is used for collecting source language translation model training corpora, the source language translation model training corpora at least comprise a source language translation training pair, and the source language translation training pair at least comprises one of the following: the method comprises the steps of obtaining a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated, and a translation result score;

a source language translation model training weight generation module for converting the translation result scores of each source language translation training pair into translation weights

And the source language translation model training module is used for updating the initial source language translation model by utilizing all sentences to be translated, translation weights and target language translation sentences.

Optionally, the apparatus further comprises:

the target language translation model training material collection module is used for collecting target language translation model training corpora, the target language translation model training corpora at least comprises a source language translation training pair, and the source language translation training pair at least comprises one of the following: the method comprises the steps of obtaining a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated, and a translation result score;

the target language translation model training weight generation module is used for converting the translation result scores of all target language translation training pairs into translation weights;

and the target language translation model training module is used for updating the initial target language translation model by utilizing all the target language translation sentences, the translation weight and the sentences to be translated.

Optionally, the apparatus further comprises:

and the screening module is used for judging the translation result scores, removing the target language translation sentences with low translation result scores and reserving the target language translation sentences with high translation result scores.

By adopting the machine translation result scoring method and device provided by the application, the relevance of word expression vectors of different languages generated by coding in the translation process is utilized, the accurate scoring of the translation result is realized through the word vectors of the source language and the target language, the limitation of dimensionality, size and the like of a shared word list is avoided, the method and device can be popularized to more natural language processing processes, and the complexity of natural language processing is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a machine translation result scoring method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a machine translation result scoring apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, the result scores of existing machine translations are introduced. In the prior art, a vocabulary shared by translation languages can be used for scoring a translation result, and sentence pairs of two languages respectively obtain semantic vectors of each sentence in the same mapping space through corresponding translation models. Because the shared vocabulary has words in both the source and target languages in one vocabulary, a row of the vocabulary represents a word in either the source or target language, e.g., a shared vocabulary is used for the english translation, row 1 corresponds to the word vector for the english word, and row 10001 corresponds to the word vector for the german word das. It can be seen that when the source language and the target language adopt a shared vocabulary, the word vector dimensions of the two languages are the same, and the corresponding features in each dimension are also the same, and then the score or weight can be obtained by calculating the inner product of the two word vectors. However, such a calculation method using a shared vocabulary is limited to the size of the shared vocabulary, and the word vector is given the same weight (for example, 1) in each dimension.

The method utilizes the relevance of word expression vectors of different languages generated by coding in the translation process, realizes accurate scoring of translation results based on word vectors of a source language and a target language, adopts a more general matrix multiplication mode to promote calculation similarity to a more general form from dot product, and promotes the calculation of corpus weight to a more general form, so that word vectors of the source language and the target language are not constrained by the condition of consistent dimension of a shared word list.

Referring to fig. 1, fig. 1 is a flowchart of a machine translation result scoring method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

step S101, receiving a source language corpus for translation, wherein the source language corpus comprises a plurality of sentences to be translated. Machine translation entails translating a corpus in a certain source language into a corpus in a certain specified target language. For example, Chinese (hereinafter referred to as "ch") is a source language, English (hereinafter referred to as "en") is a target language, and a ch corpus is a dialog or an article formed by a plurality of sensor sentences.

Step S102, the initial source language translation model firstly encodes the sentence to be translated into a source language sentence expression vector, and then the target language translation sentence is obtained through decoding based on the source language sentence expression vector.

In one embodiment of the present application, the source language translation model adopts a neural network structure, the model frameworks of the neural machine translation are all Encoder-Decoder frameworks, the Encoder comprises one or several layers of neural networks, and the Decoder comprises one or several layers of neural networks. The Encoder aims to generate a semantic space, extract information of an original sentence and use an abstract semantic meaning to replace the original sentence, namely, a sentence vector which can be processed by an original natural language conversion machine belonging to human beings is converted into a word expression vector by a source language translation model, and all word expression vectors in the sentence are averagely pooled to obtain the sentence expression vector of the sentence.

The Decoder in the translation model is used for converting abstract semantics, namely source language sentence expression vectors, into sentences in a target language, and enabling the generated sentences to perfectly express the meaning of the original sentences and accord with the logic habit of the target language.

S103, an initial target language translation model firstly encodes the target language translation sentence into a target language sentence expression vector, and then decodes the target language sentence expression vector to obtain a source language translation sentence;

and inputting the target language translation sentence obtained in the step S102 into a preset target language translation model for translation to obtain a target language sentence expression vector and a source language translation sentence of the target language translation sentence.

And step S104, scoring is carried out by using a score calculation model based on the target language sentence expression vector and the source language sentence expression vector to obtain a translation result score.

The source language sentence representative vector obtained in step S102 and the target language sentence representative vector obtained in step S103 are input to a score calculation model trained in advance, and a score of a translation result can be obtained.

In one embodiment of the present application, a score computation model receives a source language sentence representation vector V_xAnd a target language sentence representation vector V_yThe score is calculated using the following formula:

score＝V_x×W×V_y (2)

the word vector representation spaces of the source language and the target language are different, so after two translation models, the sentence representation vectors generated by the two models, namely V_xAnd V_yThe two vectors are also in different representation spaces, e.g. the dimensions of the two are different. Therefore, in this case, the score cannot be calculated directly by the dot product of two vectors, and a more general V is required_x×W×V_yAnd W is a fraction calculation model parameter matrix.

According to the machine translation result scoring method, a source language corpus used for translation is received, the source language corpus comprises a plurality of sentences to be translated, an initial source language translation model firstly encodes the sentences to be translated into source language sentence expression vectors, then the source language sentence expression vectors are decoded to obtain target language translated sentences, the initial target language translation model firstly encodes the target language translated sentences into target language sentence expression vectors, then the target language sentence expression vectors are decoded to obtain source language translated sentences, and scoring is carried out by using a score calculation model on the basis of the target language sentence expression vectors and the source language sentence expression vectors to obtain translation result scores. By utilizing the relevance of word expression vectors of different languages generated by coding in the translation process, the accurate scoring of the translation result is realized through the word vectors of the source language and the target language, the limitation of dimensionality, size and the like of a shared word list is avoided, the method can be popularized to more natural language processing processes, and the complexity of natural language processing is reduced.

In an optional embodiment of the present application, the score calculation model is a trained model, and the training of the score calculation model includes:

collecting a score computation model corpus, wherein the score computation model corpus pair at least comprises two translation statement pairs, and the translation statement pairs at least comprise one of the following: the method comprises the following steps of (1) a source language training sentence, a sentence expression vector of the source language training sentence, a target language training sentence with the same text meaning as the source language training sentence, and a sentence expression vector of the target language training sentence;

-∑_(X,Y)V_X×W×V_Y+L1(W) (1)；

When the score calculation model is trained, the source language training sentence and the target language training sentence of the same pair are constructed into a positive sample (because the text meanings of the target language training sentence and the source language training sentence are the same, sentences of the positive sample are translated with each other), the target language training sentence of the source language training sentence and the target language training sentence of the other translated sentence pair form a negative sample, obviously, the score of the positive sample is higher than that of the negative sample, the model is obtained by optimizing a loss function through the sentence pairs which are translated with each other and do not form translation with each other and are used as the corpus, and the score or the weight of the sentence pair is calculated through the model.

And (3) a score calculation model parameter matrix W is initialized randomly when training is started, and is optimized through the existing marked score corpora.

The loss function may be used to optimize the model parameters. The loss function is added with the L1 regular, which is considered that the L1 regular is helpful for parameter sparsification, because one dimension in the semantic vector of the source language only corresponds to one or a plurality of dimensions in the target language.

In an optional embodiment of the present application, the method further comprises:

collecting source language translation model training corpora, wherein the source language translation model training corpora at least comprise a source language translation training pair, and the source language translation training pair at least comprises one of the following: the method comprises the steps of a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated and a translation result score.

The scores of all the translation sentence pairs are converted into weights, the weights can be generated by normalizing the scores of all the sentence pairs through min-max, the scores of the translation results of all the sentence pairs can also be generated by processing the scores of the translation results of all the sentence pairs through softmax functions, the value range is [0,1], when the initial translation model is updated by using the sentences to be translated and the target language translation sentences in the source language translation training pairs, the learning rate needs to be multiplied by the weights when each connected weight is updated, so that the degree of parameter optimization is smaller, and the weights are higher because the scores of the sentences which are translated correctly before are high, so that the translation model can be promoted to be continuously close to the correct translation results.

In the embodiment of the application, a new translation pair can be added, scores can be manually marked for the new translation sentence pair or preset scores are used, the scores and the previously generated translation result scores are used for generating the weight of the new translation pair, and the weighted new translation pair is also used for training the translation model.

In an optional embodiment of the present application, the method further comprises: collecting a target language translation model training corpus, wherein the target language translation model training corpus at least comprises a target language translation training pair, and the target language translation training pair at least comprises one of the following: the method comprises the steps of obtaining a sentence to be translated, a target language translation sentence corresponding to the sentence to be translated, and a translation result score;

judging the score of the translation result;

Based on the same inventive concept, an embodiment of the present application provides a machine translation result scoring device. Referring to fig. 2, fig. 2 is a schematic diagram of a machine translation result scoring apparatus according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:

the corpus receiving module 201 is configured to receive a source language corpus for translation, where the source language corpus includes a plurality of sentences to be translated;

the source language translation module 202 is configured with an initial source language translation model, and is configured to encode the target language translation sentence into a target language sentence expression vector, and decode the target language sentence expression vector to obtain a source language translation sentence;

the target language translation module 203 is configured with an initial target language translation model, and is configured to encode the target language translation sentence into a target language sentence expression vector, and decode the target language sentence expression vector to obtain a source language translation sentence;

and the score calculating module 204 is configured to score by using a score calculating model based on the target language sentence representation vector and the source language sentence representation vector to obtain a translation result score.

training a preset model by using the sentence expression vector of the source language training sentence and the sentence expression vector of the target language training sentence, wherein the trained loss function is

-∑_(X,Y)V_X×W×V_Y+L1(W) (1)；

In an optional embodiment of the present application, the apparatus further comprises:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is given to a machine translation result scoring method and device provided by the present application, and a specific example is applied in the present application to explain the principle and the implementation manner of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A machine translation result scoring method is characterized by comprising the following steps:

2. The method of claim 1, wherein the score computation model is a trained model, and the training of the score computation model comprises:

-∑_(X，Y)V_X×W×V_Y+L1(W) (1)；

Wherein X is the number of the source language training sentence, V_XRepresenting a vector for a sentence of a training sentence in a source language, Y being the number of a training sentence in a target language having the same meaning as the text of the training sentence in the source language, V_YSentence table of target language training sentencesAnd (5) indicating a vector, wherein L1(W) is a preset regular term, and W is a fraction calculation model parameter matrix.

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

5. The method of claim 1, further comprising:

judging the score of the translation result;

6. A machine translation result scoring apparatus, comprising:

the source language translation module is configured with an initial source language translation model and used for coding the sentence to be translated into a source language sentence expression vector and decoding the source language sentence expression vector to obtain a target language translation sentence;

7. The apparatus of claim 6, wherein the score computation model is a trained model, and the apparatus further comprises:

the score calculation model corpus collection module is used for collecting score calculation model corpus, the score calculation model corpus pair at least comprises two translation statement pairs, and the translation statement pairs at least comprise one of the following: the method comprises the following steps of (1) a source language training sentence, a sentence expression vector of the source language training sentence, a target language training sentence with the same text meaning as the source language training sentence, and a sentence expression vector of the target language training sentence;

a score calculation model collection module for training a preset model by using sentence expression vectors of source language training sentences and sentence expression vectors of target language training sentences, wherein the training loss function is

-∑_(X，Y)V_X×W×V_Y+L1(W) (1)；

Wherein X is the number of the source language training sentence, V_XRepresenting a vector for a sentence of a training sentence in the source language, Y being a target having the same meaning as the text of the training sentence in the source languageNumbering of language training sentences, V_YThe sentence of the target language training sentence represents a vector, L1(W) is a preset regular term, and W is a score calculation model parameter matrix.

8. The apparatus of claim 6, further comprising:

the source language translation model training weight generation module is used for converting the translation result scores of the source language translation training pairs into translation weights;

9. The apparatus of claim 6, further comprising:

10. The apparatus of claim 6, further comprising: