CN114201975B - Translation model training method, translation method and translation device - Google Patents

Translation model training method, translation method and translation device Download PDF

Info

Publication number
CN114201975B
CN114201975B CN202111250312.4A CN202111250312A CN114201975B CN 114201975 B CN114201975 B CN 114201975B CN 202111250312 A CN202111250312 A CN 202111250312A CN 114201975 B CN114201975 B CN 114201975B
Authority
CN
China
Prior art keywords
sentence
language sentence
source language
translation model
target language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111250312.4A
Other languages
Chinese (zh)
Other versions
CN114201975A (en
Inventor
刘恒双
张为泰
许瑞阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
University of Science and Technology of China USTC
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC, iFlytek Co Ltd filed Critical University of Science and Technology of China USTC
Priority to CN202111250312.4A priority Critical patent/CN114201975B/en
Publication of CN114201975A publication Critical patent/CN114201975A/en
Application granted granted Critical
Publication of CN114201975B publication Critical patent/CN114201975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The embodiment of the invention provides a translation model training method, a translation method and a translation device. The model training method comprises the following steps: respectively inputting a source language sentence and a noisy source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence and a second prediction target language sentence, and respectively obtaining a first prediction probability distribution and a second prediction probability distribution of the translation model and/or a first feature vector and a second feature vector output by each hidden layer; and determining the current training loss of the translation model based on the target language sentence in the first prediction target language sentence and parallel bilingual sentence pair, the target language sentence corresponding to the second prediction target language sentence and the noisy source language sentence, the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution, and adjusting the parameters of the translation model. The embodiment of the invention can improve the robustness of the translation model, and has simple training method and stable model training.

Description

Translation model training method, translation method and translation device
Technical Field
The present invention relates to the field of machine translation technologies, and in particular, to a translation model training method, a translation method, and a translation device.
Background
The scanning translation pen is an intelligent terminal product integrating optical character recognition (Optical Character Recognition, abbreviated as OCR) technology, machine translation technology and speech synthesis technology. The working process comprises the steps of firstly collecting images of paper documents through an OCR technology, identifying characters in the images to obtain source language texts, then translating the source language texts through a machine translation technology to obtain target language texts, and finally playing the target language texts in a voice mode through a voice synthesis technology. In order to ensure the translation quality, the scanning translation pen needs to follow a certain specification when in use, for example, when acquiring an image of a paper document, the scanning translation pen is recommended to form an included angle of 45 degrees with a desktop, the sight line of the pen point of the scanning translation pen needs to be aligned with a line of the paper document to be translated, and the like. However, it is difficult to ensure that these specifications are strictly followed during actual use, resulting in a large amount of noise in the source language text obtained by OCR technology, and since the machine translation technology is very sensitive to noise, a small amount of disturbance may cause a large variation in the translation result, thereby degrading the quality of the machine translation. By improving the robustness of the translation model, the quality of the machine translation can be improved.
The existing method for improving the robustness of the translation model mainly adopts a method of combining data enhancement with countermeasure training. However, the method of combining data enhancement with countermeasure training requires an additional discriminator and an additional training discriminator, making the training method complicated and model training unstable.
Disclosure of Invention
The embodiment of the invention provides a translation model training method, a translation method and a translation device, which are used for solving the defects that a method for improving the robustness of a translation model in the prior art is complex and model training is unstable.
The embodiment of the invention provides a translation model training method, which comprises the following steps:
inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model;
inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence;
Determining a current training loss of the translation model based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution;
based on the determined current training loss, parameters of the translation model are adjusted.
According to a translation model training method of an embodiment of the present invention, the determining a current training loss of the translation model based on the target language sentence in the pair of the first predicted target language sentence and the parallel bilingual sentence, the target language sentence in the pair of the parallel bilingual sentence corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution, includes:
determining a first training loss based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, and the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
Determining a second training penalty based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution;
and carrying out weighted summation on the first training loss, the second training loss and/or the third training loss to obtain the current training loss of the translation model.
According to an embodiment of the present invention, the determining the first training loss based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, and the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence includes:
determining a first training loss component based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair;
determining a second training loss component based on the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence;
and accumulating the first training loss component and the second training loss component to obtain the first training loss.
According to an embodiment of the present invention, the method for training a translation model, determining a second training loss based on the first feature vector and the second feature vector, includes:
based on the first characteristic vector and the second characteristic vector of each hidden layer, determining a third training loss component corresponding to each hidden layer in each hidden layer respectively;
and accumulating the third training loss component of each hidden layer in the determined hidden layers to obtain the second training loss.
According to an embodiment of the present invention, the method for training a translation model, performing data enhancement on the source language sentence to obtain the noisy source language sentence, includes:
replacing the characters in the source language sentence based on a character recognition error comparison table to obtain the noisy source language sentence; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the number of the groups of groups,
deleting characters in sentence head or sentence tail words in the source language sentence to obtain the noisy source language sentence; and/or the number of the groups of groups,
deleting punctuation marks at the tail of the source language sentence or adding a punctuation mark at the head of the source language sentence to obtain the noisy source language sentence.
According to an embodiment of the present invention, the method for training a translation model, based on a character recognition error comparison table, replaces characters in the source language sentence to obtain the noisy source language sentence, includes:
counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion;
determining the characters to be replaced in the source language sentence based on the frequency of occurrence of the characters in the source language sentence in the character recognition error comparison table and the determined number of the characters to be replaced;
based on the determined character to be replaced, obtaining a corresponding error character from the character recognition error comparison table, and replacing the character in the source language sentence to obtain the noisy source language sentence.
According to an embodiment of the present invention, the method for training a translation model, based on the determined character to be replaced, obtains a corresponding error character from the character recognition error comparison table, and replaces the character in the source language sentence to obtain the noisy source language sentence, includes:
if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error characters for replacement based on the frequency of occurrence of the more than two error characters corresponding to the character recognition error comparison table;
And acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the noisy source language sentence.
According to an embodiment of the present invention, the method for training a translation model deletes characters in a sentence head or a sentence tail word in the source language sentence to obtain a noisy source language sentence, includes:
determining the number of characters contained in words at the head or tail of a sentence in the source language sentence;
if the number of characters contained in the words of the determined sentence head or sentence tail accords with the preset number, determining the number of characters to be deleted in the words of the sentence head or sentence tail in the source language sentence; wherein the preset number is set according to languages of the source language sentences;
and deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail, so as to obtain the noisy source language sentence.
According to an embodiment of the present invention, the determining the number of characters to be deleted in words at the beginning or end of a sentence in the source language sentence includes:
If the language of the source language sentence is English, determining the number of characters to be deleted in words at the head or tail of the sentence in the source language sentence based on Gaussian distribution;
and if the language of the source language sentence is Chinese, determining the number of characters to be deleted in the word at the head or tail of the sentence in the source language sentence as one character.
According to an embodiment of the present invention, before inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and obtaining a first predicted probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model, the method further includes:
inputting a source language text sample into the translation model to obtain a predicted target language text output by the translation model; the source language text sample comprises source language sentences in the parallel double sentence pairs and the noisy source language sentences;
determining a current training penalty for the translation model based on the predicted target language text and the real target language text of the source language text sample; wherein the real target language text comprises target language sentences in the parallel double sentence pairs;
Based on the determined current training loss, parameters of the translation model are adjusted.
According to the translation model training method of one embodiment of the present invention, the source language text sample further includes clause fragments of a source language sentence in the parallel double sentence pair, and the real target language text further includes clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence in the parallel double sentence pair.
According to a translation model training method of an embodiment of the present invention, the step of obtaining clause fragments of a source language sentence in the parallel double-sentence pair and clause fragments corresponding to the clause fragments of the source language sentence in a target language sentence in the parallel double-sentence pair includes:
word alignment is carried out on the source language sentence and the target language sentence in the parallel double sentence pair;
and extracting clause fragments of the source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence with the aligned words.
According to one embodiment of the present invention, the method for training a translation model, based on word alignment, extracts clause fragments of a source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence, including:
Extracting short clause fragments of the source language sentence and short clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; wherein, the middle of the short clause segment has no punctuation mark;
extracting long clause fragments of the source language sentence and long clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; and punctuation marks are included in the middle of the long clause segment.
The embodiment of the invention also provides a translation method, which comprises the following steps:
collecting an image of a text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text;
translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method described in any one of the above.
The embodiment of the invention also provides a translation model training device, which comprises:
the first prediction module is used for inputting source language sentences in parallel bilingual sentence pairs into a translation model to obtain first prediction target language sentences output by the translation model, and obtaining first prediction probability distribution of the translation model and/or first feature vectors output by each hidden layer of the translation model;
The second prediction module is used for inputting the noisy source language sentence into the translation model to obtain a second prediction target language sentence output by the translation model, and obtaining a second prediction probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence;
a loss calculation module, configured to determine a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector, and/or the first prediction probability distribution and the second prediction probability distribution;
and the parameter adjustment module is used for adjusting parameters of the translation model based on the determined current training loss.
The embodiment of the invention also provides a translation device, which comprises:
the text recognition module is used for collecting images of the text to be translated and recognizing the collected images of the text to be translated to obtain a source language text;
The machine translation module is used for translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method described in any one of the above.
The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the translation model training method according to any one of the above or the steps of the translation method when executing the program.
Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a translation model training method as described in any one of the above, or the steps of the translation method described above.
According to the translation model training method and the translation device, the source language sentences in the parallel bilingual sentence pairs and the noisy source language sentences obtained by carrying out data enhancement on the source language sentences in the parallel bilingual sentence pairs are respectively input into the translation model, forward propagation processing is carried out twice, the prediction probability distribution of the translation model processed by the forward propagation is obtained twice, and/or the feature vectors output by all hidden layers of the translation model are obtained, and based on the prediction target language sentences output by the translation model processed by the forward propagation twice and the target language sentences in the parallel bilingual sentence pairs, and the obtained prediction probability distribution of the translation model processed by the forward propagation twice and/or the feature vectors output by all hidden layers of the translation model, the current training loss of the translation model is determined, parameters of the translation model are adjusted, the prediction probability distribution obtained by the noisy source language sentences and the prediction probability distribution obtained by the source language sentences are mutually constrained, and/or the feature vectors output by all hidden layers obtained by the noisy source language sentences are mutually constrained, so that the translation model can be distinguished under the conditions that the source language sentences are noisy and the noise-added and the parallel bilingual sentences are not noisy, the translation model is not distinguished, and the translation quality is required to be improved by the aid of the translation model training method.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a translation model training method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for determining current training loss of a translation model according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for training a translation model according to another embodiment of the present invention;
FIG. 4 is a flowchart of a method for enhancing data of a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 5 is a flowchart of a method for replacing characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a method for obtaining an error character from a character recognition error lookup table to replace a character in a source language sentence according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of training a translation model by obtaining noisy text through data enhancement by character replacement according to an embodiment of the present invention;
FIG. 8 is a flowchart of a method for deleting characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 9 is a flowchart of a translation model training method according to another embodiment of the present invention;
FIG. 10 is a flowchart of a method for obtaining clause fragments of a source language sentence and a target language sentence of parallel double sentence pairs according to an embodiment of the present invention;
FIG. 11 is a schematic flow chart of a translation method according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of a construction of a translation model training device according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of a translation device according to an embodiment of the present invention;
fig. 14 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to ensure the translation quality, the scanning translation pen needs to follow a certain specification when in use, for example, when acquiring an image of a paper document, the scanning translation pen is recommended to form an included angle of 45 degrees with a desktop, the sight line of the pen point of the scanning translation pen needs to be aligned with a line of the paper document to be translated, and the like. However, it is difficult to ensure that these specifications are strictly followed during actual use, resulting in a large amount of noise in the source language text obtained by OCR technology, and since the machine translation technology is very sensitive to noise, a small amount of disturbance may cause a large variation in the translation result, thereby degrading the quality of the machine translation. By improving the robustness of the translation model, the quality of the machine translation can be improved.
The existing method for improving the robustness of the translation model mainly adopts a method of combining data enhancement with countermeasure training. However, the method of combining data enhancement with countermeasure training requires an additional discriminator, which increases the number of parameters, and the configuration requirements of the discriminator on the parameters are high, so that the training method is complex and model training is unstable.
In this regard, the embodiment of the present invention provides a translation model training method, by respectively inputting a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence obtained by data enhancement of the source language sentence in the parallel bilingual sentence pair into a translation model, performing two forward propagation processes, obtaining a prediction probability distribution of the translation model processed by the two forward propagation processes and/or feature vectors output by each hidden layer of the translation model, and based on a prediction target language sentence and a real target language sentence output by the translation model processed by the two forward propagation processes and/or the obtained prediction probability distribution of the translation model processed by the two forward propagation processes and/or feature vectors output by each hidden layer of the translation model, determining a current training loss of the translation model, adjusting parameters of the translation model, and constraining the prediction probability distribution obtained by the noisy source language sentence and/or the feature vectors output by each hidden layer of the translation model and the prediction probability distribution obtained by the source language sentence and/or the feature vectors output by each hidden layer of the translation model, so that the translation model is insensitive to the source language sentence to the noise and the translation model is added to the source language sentence. Fig. 1 is a flow chart of a translation model training method according to an embodiment of the present invention, as shown in fig. 1, where the method at least includes:
And step 101, inputting the source language sentence in the parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model.
In the embodiment of the invention, the translation model may be a deep learning model for translating one language into another language, where one language may be called a source language and the other language may be called a target language, and the translation model may use an existing deep learning model, such as a Transformer model. The translation model generally adopts a coder-decoder structure, and at least one coder and at least one decoder can be included in one translation model, and the number of the coders is generally equal to the number of the decoders, and each coder and each decoder can be a multi-layer network structure including a plurality of hidden layers. For example, the transducer model includes 6 encoders and 6 decoders; the 6 encoders have the same network structure, and each encoder comprises 2 hidden layers, wherein the 1 st hidden layer is a self-attention layer, and the 2 nd hidden layer is a feedforward neural network layer; the 6 decoders also have the same network structure, and each decoder comprises 3 hidden layers, wherein the 1 st hidden layer is a mask multi-head self-attention layer, the 2 nd hidden layer is a multi-head attention layer output to the decoder, and the 3 rd hidden layer is a feedforward neural network layer.
A large number of parallel bilingual texts with general purpose, standard text and reasonable grammar are generally adopted as training samples to train the translation model, so that the translation model learns the mapping relation between two languages by using the parallel bilingual texts. The parallel bilingual text comprises a source language text and a target language text, the languages of the source language text and the target language text in the parallel bilingual text are not limited, for example, the source language text in the parallel bilingual text is Chinese, and the target language text is English. The parallel bilingual text is mostly obtained from spoken language communication scenes, news fields, electronic documents and the like, and crawled from the Internet, and the method for obtaining the parallel bilingual text is not limited in the embodiment of the invention. Parallel bilingual text may include parallel bilingual sentence pairs, which may be used to train a translation model of a scan translator, for example, because the translation object is typically a sentence.
In the embodiment of the invention, when the translation model is trained, the source language sentence in the parallel bilingual sentence pair can be input into the translation model, forward propagation (Forward Propagation) processing is carried out, and a first prediction target language sentence output by the translation model is obtained. In the forward propagation process, an input source language sentence sequentially passes through an encoder and a decoder in a translation model, a first feature vector is output at each hidden layer of the encoder and the decoder, and a first prediction probability distribution is output at a softmax layer. Therefore, in the process of training the translation model through the source language sentence in the parallel bilingual sentence pair, the first prediction probability distribution of the softmax layer output may be obtained from the translation model, or the first feature vector of each hidden layer output may be obtained from the translation model, or the first prediction probability distribution of the softmax layer output and the first feature vector of each hidden layer output may be obtained from the translation model at the same time.
Step 102, inputting the noisy source language sentence into a translation model to obtain a second prediction target language sentence output by the translation model, and obtaining a second prediction probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence.
In the embodiment of the invention, the noisy source language sentence can be obtained by carrying out data enhancement processing on the source language sentence in the parallel bilingual sentence pair. The data enhancement processing can be performed on the source language sentence by adopting the existing text data enhancement mode, for example, the data enhancement processing can be performed on the source language sentence by means of synonym substitution, random deletion, random exchange, random insertion, random masking and the like, or the data enhancement processing can be performed on the source language sentence according to the mode of designing data enhancement according to the application scene of the translation model, and the implementation mode of the data enhancement is not limited in the embodiment of the invention.
In the embodiment of the invention, when the translation model is trained, the noisy source language sentence obtained by data enhancement of the source language sentence in the step 101 can be input into the translation model, and forward propagation processing is performed to obtain a second predicted target language sentence output by the translation model. In the forward propagation process, the input noisy source language sentence is sequentially processed by an encoder and a decoder in a translation model, a second feature vector is output at each hidden layer of the encoder and the decoder, and a second prediction probability distribution is output at a softmax layer. Therefore, in the process of training the translation model through the noisy source language sentence, the second prediction probability distribution can be obtained from the translation model, or the second feature vector output by each hidden layer can be obtained from the translation model, or the second prediction probability distribution and the second feature vector output by each hidden layer can be obtained from the translation model at the same time.
Step 103, determining the current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution.
In the embodiment of the invention, when a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence obtained by carrying out data enhancement on the source language sentence are respectively input into a translation model to obtain a first prediction target language sentence and a second prediction target language sentence, if only the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model, the current training loss of the translation model can be calculated according to the target language sentence in the parallel bilingual sentence pair, which corresponds to the noisy source language sentence, and the first prediction probability distribution and the second prediction probability distribution; if only the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model, the current training loss of the translation model can be calculated according to the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector; if the first prediction probability distribution and the second prediction probability distribution, and the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model at the same time, the current training loss of the translation model can be calculated according to the target language sentence in the parallel bilingual sentence pair corresponding to the first prediction target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second prediction target language sentence and the noisy source language sentence, the first feature vector and the second feature vector, and the first prediction probability distribution and the second prediction probability distribution.
According to the embodiment of the invention, the current training loss of the translation model is obtained by sequentially calculating the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution through a preset loss function; or, calculating the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution respectively through a preset loss function to obtain various training losses of the translation model, wherein the various training losses of the translation model can comprise losses of a predicted result and a real result, hidden layer losses and/or prediction probability distribution losses, and then carrying out weighted summation on various training losses of the translation model to obtain the current training losses of the translation model; the embodiment of the invention does not limit the implementation mode of determining the current training loss of the translation model based on the target language sentence in the first prediction target language sentence and parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second prediction target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution.
The preset loss function may be an existing loss function, and the type of the loss function for calculating the current training loss of the translation model is not limited in the embodiment of the present invention. When the various training losses of the translation model are calculated in advance and then the current training losses of the translation model are calculated through weighted summation, the various training losses of the translation model can be calculated by adopting the same loss function, or the various training losses of the translation model can be calculated by adopting different loss functions.
Step 104, adjusting parameters of the translation model based on the determined current training loss.
In the embodiment of the invention, after the current training loss of the translation model is obtained, parameters of the translation model can be adjusted through back propagation (Backward Propagation) processing according to the current training loss of the translation model until the translation model after the parameters are adjusted meets the preset convergence condition, and the training of the translation model is stopped. For example, the preset convergence condition may be a preset error value, and if the error of the translation model after the parameter adjustment is smaller than the preset error value, the training of the translation model is stopped; or the preset convergence condition may be preset iteration times, and if the iteration times of the translation model after the parameters are adjusted reach the preset iteration times, the training of the translation model is stopped.
It should be noted that the translation model of the embodiment of the present invention may be a translation model for implementing at least two languages to mutually translate. For any two inter-translation languages, parallel double-sentence pairs consisting of sentences of the two inter-translation languages can be adopted to train the translation model, wherein in the parallel double-sentence pairs, if sentences of one language are used as source language sentences, sentences of the other language are used as target language sentences.
According to the translation model training method provided by the embodiment of the invention, the source language sentences in the parallel bilingual sentence pairs and the noisy source language sentences obtained by carrying out data enhancement on the source language sentences in the parallel bilingual sentence pairs are respectively input into the translation model, the two forward propagation processes are carried out, the predictive probability distribution of the translation model processed by the two forward propagation processes and/or the feature vectors output by all hidden layers of the translation model are obtained, the translation model is enabled to be mutually constrained based on the predictive target language sentences output by the translation model processed by the two forward propagation processes and the target language sentences in the parallel bilingual sentence pairs, and the obtained predictive probability distribution of the translation model and/or the feature vectors output by all hidden layers of the translation model are/is determined, the parameters of the translation model are adjusted, the predictive probability distribution obtained by the noisy source language sentences and the predictive probability distribution obtained by the source language sentences are mutually constrained by the source language sentences, and/or the feature vectors output by all hidden layers obtained by the source language sentences are mutually constrained by the feature vectors obtained by the noisy source language sentences, so that the translation model can be enabled to be improved under the condition that the source language sentences are noisy and are not noisy by the source language sentences and are not noisy, the translation model is enabled to be distinguished by the additional training model, and the translation model is required to be more than the translation model is improved, and the translation model is not susceptible to be trained by the method.
Fig. 2 is a flow chart of a method for determining a current training loss of a translation model according to an embodiment of the present invention, where, as shown in fig. 2, the method at least includes:
step 201, determining a first training loss based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, and the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence.
In an embodiment of the invention, the first training loss is used for representing a difference between a predicted result and a real result output by the translation model. The first training loss component may be determined based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair; then determining a second training loss component based on the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence; and finally, accumulating the first training loss component and the second training loss component to obtain the first training loss. Or, the second training loss component may be determined based on the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence; then determining a first training loss component based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair; and finally, accumulating the first training loss component and the second training loss component to obtain the first training loss. The embodiment of the invention does not limit the sequence of determining the first training loss component and the second training loss component.
The method comprises the steps of respectively calculating target language sentences in a parallel bilingual sentence pair corresponding to a first predicted target language sentence and a parallel bilingual sentence pair, and calculating target language sentences in a parallel bilingual sentence pair corresponding to a second predicted target language sentence and a noisy source language sentence through a preset first loss function to obtain a first training loss component and a second training loss component. The preset first loss function may be an existing loss function, for example, a Cross Entropy (CE) loss function, and the type of the first loss function for calculating the first training loss of the translation model is not limited in the embodiment of the present invention.
Step 202, determining a second training loss based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution.
In an embodiment of the invention, a second training penalty is used to characterize the difference between feature vectors of the hidden layer output of the translation model with and without noise, and a third training penalty is used to characterize the difference between the predictive probability distributions of the translation model output with and without noise. Under the condition that only the first characteristic vector and the second characteristic vector output by each hidden layer are obtained from the translation model, the third training loss component corresponding to each hidden layer in each hidden layer can be respectively determined firstly based on the first characteristic vector and the second characteristic vector of each hidden layer; and then accumulating the third training loss component of each hidden layer in the determined hidden layers to obtain a second training loss. In the case where only the first predictive probability distribution and the second predictive probability distribution are obtained from the translation model, the third training loss may be determined based on the first predictive probability distribution and the second predictive probability distribution. Under the condition that a first prediction probability distribution and a second prediction probability distribution, and a first feature vector and a second feature vector output by each hidden layer are obtained from a translation model at the same time, a second training loss can be determined firstly based on the first feature vector and the second feature vector; a third training loss is then determined based on the first predictive probability distribution and the second predictive probability distribution. Alternatively, the third training loss may be determined first based on the first predictive probability distribution and the second predictive probability distribution; a second training penalty is then determined based on the first feature vector and the second feature vector. The embodiment of the invention does not limit the sequence of determining the second training loss and the third training loss.
The first feature vector and the second feature vector of each hidden layer in each hidden layer can be calculated through a preset second loss function, so that a third training loss component corresponding to each hidden layer in each hidden layer is obtained; and calculating the first prediction probability distribution and the second prediction probability distribution through a preset third loss function to obtain a third training loss. The predetermined second loss function and the predetermined third loss function may be the same loss function, or may be different loss functions, which is not limited in the embodiment of the present invention. The preset second loss function and the preset third loss function may be existing loss functions, for example, the preset second loss function may be a mean square error (Mean Squared Error, abbreviated as MSE) loss function, the preset third loss function may be a KL divergence (Kullback-Leibler divergence) loss function, and the type of the second loss function for calculating the second training loss of the translation model and the type of the third loss function for calculating the third training loss of the translation model are not limited in the embodiment of the present invention.
Step 203, performing weighted summation on the first training loss, the second training loss and/or the third training loss to obtain a current training loss of the translation model.
In the embodiment of the invention, under the condition that the first feature vector and the second feature vector output by each hidden layer are only obtained from the translation model, after the first training loss and the second training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss and the second training loss; under the condition that only the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model, after the first training loss and the third training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss and the third training loss; under the condition that a first prediction probability distribution and a second prediction probability distribution, and a first feature vector and a second feature vector output by each hidden layer are obtained from the translation model, after a first training loss, a second training loss and a third training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss, the second training loss and the third training loss.
Wherein the weights of the first training loss, the second training loss, and the third training loss may be empirically set, the weight of the first training loss should be greater than the weight of the second training loss and the weight of the third training loss.
According to the embodiment of the invention, the current training loss of the translation model is obtained by respectively calculating the various training losses of the translation model and carrying out weighted summation on the various training losses of the translation model, and the duty ratio of the various training losses in the current training loss of the translation model can be reasonably set through the weight, so that the noise robustness of the translation model can be improved on the basis of ensuring effective training of the translation model.
FIG. 3 is a schematic flow chart of a method for training a translation model according to another embodiment of the present invention, as shown in FIG. 3, by inputting clean data, i.e. source language sentences in parallel bilingual sentence pairs, into a translation model for performing a first forward propagation process, a first feature vector output by each hidden layer can be obtained at each hidden layer of an encoder and a decoder of the translation model, and a first prediction probability distribution can be obtained at a softmax layer of the translation model; noise data can be obtained by adding noise into clean data, namely, noisy source language sentences obtained by carrying out data enhancement processing on source language sentences in parallel bilingual sentence pairs, and second feature vectors output by all hidden layers can be obtained at all hidden layers of an encoder and a decoder of a translation model by carrying out second forward propagation processing on the noisy data input into the translation model, and second prediction probability distribution can be obtained at a softmax layer of the translation model; the data enhancement processing can adopt a mode of carrying out character replacement based on a character recognition error comparison table, a mode of deleting characters in the head word or the tail word of the sentence, a mode of deleting punctuation marks at the tail of the sentence and a mode of adding a punctuation mark at the head of the sentence; calculating the Mean Square Error (MSE) loss of a first eigenvector and a second eigenvector of each hidden layer in the forward propagation process for two times, the KL divergence loss of a first prediction probability distribution and a second prediction probability distribution in the forward propagation process for two times, and the Cross Entropy (CE) loss of a prediction result and a real result output by a translation model in the forward propagation process for two times, and carrying out weighted summation on the three types of losses to obtain the current training loss of the translation model; and carrying out back propagation processing according to the current training loss, and updating parameters of the translation model.
According to the embodiment of the invention, the translation model simultaneously learns the noisy and non-noisy data, and the parameters of the translation model are trained by utilizing the Mean Square Error (MSE) loss and the KL divergence loss, so that the hidden layer characteristic vector of the finally obtained translation model is insensitive to input noise, the prediction probability distribution of the translation model is insensitive to noise, the robustness of the translation model to noise is improved, and the translation quality by utilizing the translation model is effectively ensured.
Because the translation model is trained by taking a large number of parallel bilingual texts as training samples, and the mapping relation between the two languages is learned, the translation model is a model based on data driving, the quantity, quality, form and the like of the training samples have great influence on the translation quality of the translation model, and the translation model is still a work for memorizing and inducing data. At present, a large number of general parallel bilingual texts are usually adopted for training the translation model, and most of the texts are standard in text and reasonable in grammar, so that the translation model obtained through training can be used for translating documents in most application scenes. The existing data enhancement method, such as synonym replacement, random deletion, random exchange, random insertion, random masking and the like, has a certain effect on improving the robustness of the translation model to noise in most application scenes.
However, for the scan translator application scenario, the noise of the translation model input text is mostly caused by the irregular use of the scan translator, such as OCR recognition errors, incomplete head and tail characters, lack of punctuation, multi-punctuation, and the like, as shown in table 1. The noise-added text obtained by carrying out data enhancement on the universal text by adopting the existing data enhancement method is not matched with the noise condition actually existing in the input text of the application scene of the scanning translation pen, so that the existing data enhancement method has limited effect on improving the robustness of the translation model to noise in the application scene of the scanning translation pen.
TABLE 1
In view of this, the embodiment of the invention provides a data enhancement method suitable for the application scene of the scanning translation pen for the noise problem caused by the fact that the scanning translation pen uses an irregular input text of the translation model, and trains the translation model by using the noisy text obtained by enhancing the input text of the translation model by using the data enhancement method suitable for the application scene of the scanning translation pen, so as to effectively improve the robustness of the translation model to the noise in the application scene of the scanning translation pen. Fig. 4 is a flow chart of a method for enhancing data of a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, where, as shown in fig. 4, the method at least includes:
Step 401, replacing characters in the source language sentence based on the character recognition error comparison table to obtain a noisy source language sentence; wherein the character recognition error comparison table is obtained based on optical character recognition.
In the embodiment of the invention, the character recognition error comparison table can be obtained based on an optical character recognition model, a large number of picture samples can be input into the optical character recognition model for character recognition, and the text obtained by recognition and the text marked by the picture samples are subjected to character level comparison to obtain the character recognition error comparison table. The obtained character recognition error comparison table may include the frequency of occurrence of characters in addition to the correct characters and the wrong characters. After the character recognition error comparison table is obtained, the characters in the source language sentence can be replaced by the error characters recorded in the character recognition error comparison table according to the occurrence frequency of the characters in the character recognition error comparison table, so that the noisy source language sentence is obtained. For Chinese, the correct Chinese characters and the wrong Chinese characters recorded in the character recognition error comparison table and the occurrence frequency of the Chinese characters are recorded, so that the Chinese characters in the source language sentence are replaced according to the character recognition error comparison table to obtain a noisy source language sentence; for English, the correct English words and the wrong English words and the occurrence frequency of the English words are recorded in the character recognition error comparison table, so that the English words in the source language sentence are replaced according to the character recognition error comparison table, and the noisy source language sentence is obtained.
The character recognition error comparison table obtained through the optical character recognition model is adopted, characters in the source language sentence are replaced according to the occurrence frequency of the characters, the obtained noisy source language sentence can truly simulate noise generated by OCR recognition errors, the noisy source language sentence obtained through character replacement and the corresponding target language sentence in the parallel bilingual sentence pair form a new parallel bilingual sentence pair, the translation model is trained through the new parallel bilingual sentence pair, and the robustness of the translation model to the OCR recognition errors can be enhanced.
And step 402, deleting characters in sentence head or sentence tail words in the source language sentence to obtain the noisy source language sentence.
In the embodiment of the invention, the number of the characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence can be determined according to the number of the characters contained in the words of the sentence head or the sentence tail in the source language sentence, so that the characters in the words of the sentence head or the sentence tail in the source language sentence are deleted according to the determined number of the characters to be deleted, and the noisy source language sentence is obtained. The number of characters contained in the words of the sentence head or the sentence tail and the number of characters to be deleted in the words of the sentence head or the sentence tail can meet a preset functional relation, and the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence can be determined according to the preset functional relation; or, the number of characters contained in the words of the sentence head or the sentence tail and the number of characters to be deleted in the words of the sentence head or the sentence tail can have a preset corresponding relation, and the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence can be determined according to the preset corresponding relation; the embodiment of the invention does not limit the realization mode of determining the number of the characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence.
The noise generated by the incomplete head and tail characters can be truly simulated by deleting characters in words at the head or tail of the sentence in the source language sentence, the noise generated by the incomplete head and tail characters can be truly simulated by deleting the characters to form a new parallel double sentence pair with the corresponding target language sentence in the parallel double sentence pair, the new parallel double sentence pair is utilized to train the translation model, and the robustness of the translation model to the incomplete head and tail characters can be enhanced.
And step 403, deleting punctuation marks at the tail of the source language sentence or adding a punctuation mark at the sentence head of the source language sentence to obtain the noisy source language sentence.
Because the translation model is sensitive to punctuation marks, when no punctuation marks or multi-punctuation marks exist at the tail of a source sentence, the translation obtained by translating through the translation model can also have larger variation for the same source sentence, and although serious errors can not occur in the translation, for example, a near meaning word is used for replacing the translation, the meaning of sentence expression is still unchanged, but obviously, the translation model cannot meet the requirements of some application scenes with high requirements on translation accuracy, such as application scenes of learning education and the like. In the embodiment of the invention, the punctuation mark at the end of the sentence of the source language sentence is deleted, or a punctuation mark is added at the sentence head of the source language sentence to obtain the noisy source language sentence, for example, the deleted or added punctuation mark can be comma, sentence mark, question mark or exclamation mark, etc., and the embodiment of the invention does not limit the type of deleting or adding the punctuation mark.
The noise generated by the head and tail punctuation problem can be truly simulated by deleting punctuation marks at the tail of the sentence of the source language sentence or adding a punctuation mark at the head of the source language sentence, the noise generated by the head and tail punctuation problem can be truly simulated by deleting or adding the punctuation marks, the noise generated by the punctuation marks and the target language sentence in the corresponding parallel bilingual sentence pair form a new parallel bilingual sentence pair, the translation model is trained by the new parallel bilingual sentence pair, and the robustness of the translation model to the head and tail punctuation problem can be enhanced.
It should be noted that, the steps 401, 402, and 403 may be performed simultaneously, or may be performed in a certain order, which is not limited in the embodiment of the present invention. According to a specific application scenario, at least one of the steps 401, 402 and 403 may be adopted to perform data enhancement on the source language sentence to obtain a noisy source language sentence, which is not limited by the embodiment of the present invention.
According to the embodiment of the invention, the data enhancement method suitable for the application scene of the scanning translation pen is provided by aiming at the noise problem of the input text of the application scene of the scanning translation pen, the data enhancement method suitable for the application scene of the scanning translation pen is utilized to conduct data enhancement on the input text of the translation model to obtain the noisy text, and the obtained noisy text is used for training the translation model.
FIG. 5 is a flowchart of a method for replacing characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, where, as shown in FIG. 5, the method at least includes:
step 501, counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion.
In the embodiment of the present invention, the total number of characters in the source language sentence may be obtained by counting the number of characters in the source language sentence, the number of characters to be replaced in the source language sentence may be calculated according to the preset proportion and the obtained total number of characters in the source language sentence, for example, the preset proportion is 5%, and the value of the preset proportion may be determined according to counting the erroneous characters in the OCR recognition error.
The method comprises the steps of counting the number of Chinese characters in a source language sentence, and determining the number of Chinese characters to be replaced in the source language sentence according to a preset proportion; for English, counting the number of English words in the source language sentence, and determining the number of English words to be replaced in the source language sentence according to a preset proportion.
Step 502, determining the character to be replaced in the source language sentence based on the frequency of occurrence of the character in the source language sentence in the character recognition error comparison table and the determined number of characters to be replaced.
In the embodiment of the invention, the character recognition error comparison table can be inquired for all the characters in the source language sentence, the frequency of occurrence of the characters recorded in the character recognition error comparison table by all the characters in the source language sentence can be obtained, and the characters to be replaced in the source language sentence are determined according to the obtained frequency of occurrence of the characters corresponding to all the characters in the source language sentence and the determined number of the characters to be replaced. For example, the obtained frequency of occurrence of characters corresponding to all characters in the source language sentence may be sorted according to the order from high to low, and the characters sorted in front may be selected as the characters to be replaced in the source language sentence according to the determined number of characters to be replaced.
For Chinese, determining the Chinese characters to be replaced in the source language sentence based on the occurrence frequency of the Chinese characters in the source language sentence in the character recognition error comparison table and the determined number of the Chinese characters to be replaced; for English, determining English words to be replaced in the source language sentence based on the frequency of English words in the source language sentence in the character recognition error comparison table and the determined number of English words to be replaced.
Step 503, based on the determined character to be replaced, obtaining the corresponding error character from the character recognition error comparison table, and replacing the character in the source language sentence to obtain the noisy source language sentence.
In the embodiment of the invention, the character recognition error comparison table can be queried according to the character to be replaced in the determined source language sentence, the corresponding error character recorded in the character recognition error comparison table is obtained, and the corresponding character in the source language sentence is replaced according to the obtained error character, so that the noisy source language sentence is obtained.
For Chinese, based on the determined Chinese character to be replaced, obtaining a corresponding error Chinese character from a character recognition error comparison table, and replacing the Chinese character in the source language sentence to obtain a noisy source language sentence; for English, based on the determined English word to be replaced, obtaining the corresponding error English word from the character recognition error comparison table, and replacing the English word in the source language sentence to obtain the noisy source language sentence.
FIG. 6 is a flowchart illustrating a method for obtaining an error character from a character recognition error lookup table to replace a character in a source language sentence according to an embodiment of the present invention, where the method at least includes:
Step 601, determining whether the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table.
If the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, executing step 602; if the determined character to be replaced corresponds to less than two error characters in the character recognition error lookup table, step 603 is performed.
Step 602, determining the error character for replacement based on the frequency of occurrence of the corresponding two or more error characters in the character recognition error comparison table.
In step 603, a corresponding one of the error characters in the character recognition error lookup table is determined as the error character for replacement.
In the embodiment of the invention, after the characters to be replaced in the source language sentence are determined, whether each determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table or not can be judged by inquiring the character recognition error comparison table.
If the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error character for replacement according to the occurrence frequency of the more than two error characters corresponding to the character recognition error comparison table. For example, the frequencies of occurrence of the two or more corresponding error characters may be sorted in order from high to low, and the character sorted first may be selected as the error character for replacement.
If the determined character to be replaced does not correspond to more than two error characters in the character recognition error comparison table, that is, the determined character to be replaced corresponds to one error character in the character recognition error comparison table, determining the corresponding error character in the character recognition error comparison table as the error character for replacement.
Step 604, the determined error characters for replacement are obtained from the character recognition error comparison table, and the characters in the source language sentence are replaced, so that the noisy source language sentence is obtained.
In the embodiment of the invention, after the error characters for replacement are determined according to the character recognition error comparison table, the determined error characters for replacement can be obtained from the character recognition error comparison table, and the corresponding characters in the source language sentence are replaced to obtain the noisy source language sentence.
Fig. 7 is a schematic diagram of training a translation model by obtaining noisy text through data enhancement by character replacement, as shown in fig. 7, in which, a large number of picture samples are identified by using an OCR engine, the identified text is compared with text marked by OCR manually on the picture samples to obtain a character recognition error comparison table, and the character "live" in the source language sentence "I live an apple" is replaced by using an erroneous english word "live" recorded in the character recognition error comparison table to obtain a noisy source language sentence "I live an apple" to train the translation model, thereby obtaining a target language sentence "I me an apple".
Fig. 8 is a flow chart of a method for deleting characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, where, as shown in fig. 8, the method at least includes:
step 801, determining the number of characters contained in words at the beginning or end of a sentence in a source language sentence.
Step 802, determining whether the number of characters contained in the determined words of the beginning or end of the sentence accords with the preset number.
If the number of characters contained in the determined words of the sentence head or the sentence tail accords with the preset number, executing step 803; if the number of characters contained in the words at the head or tail of the sentence does not accord with the preset number, ending the method.
Step 803, determining the number of characters to be deleted in words at the beginning or end of the sentence in the source language sentence.
In the embodiment of the invention, whether the number of characters contained in the words of the head or the tail of the sentence accords with the preset number can be judged by determining the number of characters contained in the words of the head or the tail of the sentence in the source language sentence; if the number of characters contained in the words at the head or tail of the sentence accords with the preset number, deleting the characters in the words at the head or tail of the sentence in the source language sentence, thereby obtaining a noisy source language sentence; if the number of characters contained in the words at the head or tail of the sentence does not accord with the preset number, the characters in the words at the head or tail of the sentence in the source language sentence are not deleted, and the operation of the source language sentence is ended.
Since the number of characters contained in the chinese word is large, the number of characters contained in the english word is generally large, the number of characters contained in the chinese word is small, and chinese word segmentation is required first, so that the first and last character defects have large differences for chinese and english, and when data enhancement is performed by deleting characters in the words of the sentence head or the sentence tail, a preset number needs to be set according to the language of the source language sentence. For example, for english words, the preset number is greater than 5 characters, that is, when the number of characters contained in the english word at the beginning or end of a sentence is greater than 5 characters, the characters in the english word at the beginning or end of the sentence are deleted; for Chinese words, the preset number is greater than or equal to 2 characters, that is, when the number of characters contained in the Chinese word at the beginning or end of a sentence is greater than or equal to 2 characters, the characters in the Chinese word at the beginning or end of the sentence are deleted. The present invention may be used to perform chinese word segmentation by using an existing word segmentation tool, for example, a language technology platform (Language Technology Platform, abbreviated as LTP), and the word segmentation tool used to perform chinese word segmentation is not limited in the embodiment of the present invention.
In the embodiment of the invention, after judging whether the number of characters contained in the words at the head or tail of the sentence accords with the preset number, the number of characters to be deleted in the words at the head or tail of the sentence in the source language sentence can be further determined. If the language of the source language sentence is English, the incomplete data of the head and tail characters are found to approximately meet Gaussian distribution through analysis, so that the number of characters to be deleted in the words of the head or tail of the sentence in the source language sentence can be determined through Gaussian distribution, and the practical application scene of the scanning translation pen is met. For example, the mean of the gaussian distribution may be 2, the variance may be 1, the rounding is required for non-integers, and the zero value is required for negative numbers. If the language of the source language sentence is Chinese, the number of characters to be deleted in the word of the sentence head or the sentence tail in the source language sentence can be determined to be one character.
Step 804, deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail, so as to obtain the noisy source language sentence.
In the embodiment of the invention, after the number of characters to be deleted in the words of the sentence head or the sentence tail is determined, the corresponding number of characters in the words of the sentence head in the source language sentence or the corresponding number of characters in the words of the sentence tail in the source language sentence can be deleted according to the determined number of characters to be deleted in the words of the sentence head or the sentence tail, so that the noisy source language sentence is obtained.
The embodiment of the invention can also perform conventional training on the translation model before training the translation model for improving the noise robustness, wherein the training for improving the noise robustness of the translation model is performed on the translation model obtained by converging the conventional training. Therefore, the embodiment of the invention also provides a translation model training method comprising two stages, wherein the first stage is a conventional training stage, and the second stage is an enhanced training stage for improving the robustness. Fig. 9 is a flow chart of a method for training a translation model according to another embodiment of the present invention, as shown in fig. 9, the method at least includes:
step 901, inputting a source language text sample into a translation model to obtain a predicted target language text output by the translation model; wherein the source language text sample includes source language sentences and noisy source language sentences in parallel pairs of sentences.
Step 902, determining a current training loss of the translation model based on the predicted target language text and the real target language text of the source language text sample; wherein the real target language text comprises target language sentences in parallel bilingual sentence pairs.
Step 903, adjusting parameters of the translation model based on the determined current training loss.
In the embodiment of the present invention, step 901, step 902 and step 903 are steps of the first stage conventional training. The source language text sample may include a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence, where the noisy source language sentence may be obtained by performing data enhancement processing on the source language sentence in the parallel bilingual sentence pair, for example, the source language sentence in step 901 and the noisy source language sentence may be the source language sentence in the second stage step 904 and the noisy source language sentence in step 905, respectively. The real target language text may include target language sentences in parallel bilingual sentence pairs, for example, the target language sentences in step 901 may be target language sentences corresponding to the source language sentences in the second stage step 904, and the noisy source language sentences in step 905. The present penalty function may be used to calculate the real target language text of the predicted target language text and the source language text sample to obtain the current training penalty of the translation model, and the present embodiment of the present invention does not limit the type of the present penalty function, for example, the present penalty function may be a cross entropy penalty function.
Step 904, inputting the source language sentence in the parallel bilingual sentence pair into the translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model.
Step 905, inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence.
Step 906, determining a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution.
Step 907, based on the determined current training loss, adjusts parameters of the translation model.
In an embodiment of the present invention, step 904, step 905, step 906 and step 907 are steps of robust enhancement training for the second stage. The descriptions of step 904, step 905, step 906 and step 907 may be referred to the descriptions of step 101, step 102, step 103 and step 104 in fig. 1, and will not be repeated.
In some alternative examples, the source language text sample may further include clause fragments of the source language sentence in the parallel bilingual sentence pair, and the real target language text may further include clause fragments of the target language sentence in the parallel bilingual sentence pair corresponding to the clause fragments of the source language sentence.
FIG. 10 is a flowchart of a method for obtaining clause fragments of a source language sentence and a target language sentence of parallel double-sentence pairs according to an embodiment of the present invention, as shown in FIG. 10, the method at least includes:
step 1001, word alignment is performed on the source language sentence and the target language sentence in the parallel bilingual sentence pair.
In general, source language text and target language text in parallel bilingual text for training a translation model form sentence pairs according to sentence alignment, namely parallel bilingual sentence pairs, so as to enhance the translation effect of the translation model on sentence fragments, so as to adapt to the situation that in a scanning translation pen application scene, due to incomplete scanning of sentences caused by irregular use of a scanning translation pen, subjects, objects, predicates and the like are absent on sentence grammar structures, as shown in table 2, and thereby sentence fragments are generated in input text of the translation model.
TABLE 2
In order to improve the translation effect of a translation model on sentence fragments, the embodiment of the invention provides a complete sentence pair based on word alignment, and a data enhancement method for extracting sentence fragments from the complete sentence pair. The word alignment tool may be used to perform word alignment on a source language sentence and a target language sentence in a parallel bilingual sentence pair, for example, an open-source mgiza++ tool may be used to perform word alignment on a parallel bilingual sentence pair.
Step 1002, extracting clause fragments of the source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the word-aligned source language sentence and target language sentence.
In the embodiment of the invention, after word alignment is carried out on parallel double-sentence pairs, clause fragments of the source language sentence and corresponding clause fragments of the target language sentence can be extracted based on the source language sentence and the target language sentence with the word alignment, so that sentence fragments for training a translation model are obtained. Short clause segments of the source language sentence and short clause segments of the target language sentence can be extracted based on the source language sentence and the target language sentence with the aligned words, wherein the short clause segments are punctuation marks in the middle of the clause segments. For example, the short clause segment of the source language sentence is "continuously developing", and the short clause segment of the corresponding target language sentence is "sustainable development"; the short clause segment of the source language sentence is a service enterprise, and the short clause segment of the corresponding target language sentence is Service enterprises; the short clause segment of the source language sentence is "their wedding commemorative day", and the short clause segment of the corresponding target language sentence is "their wedding anniversary". The long clause segment of the source language sentence and the long clause segment of the target language sentence can be extracted based on the word-aligned source language sentence and target language sentence, wherein the long clause segment is a clause segment comprising punctuation marks in the middle, for example, 1 to 2 punctuation marks in the middle of the clause segment. For example, the long clause segment of the source language sentence is a corresponding countermeasure, thereby realizing sustainable development of the travel industry, and the long clause segment of the corresponding target language sentence is corresponding measures to achieve the sustainable development of tourism; the long clause segment of the source language sentence is a service enterprise and a service item, and the long clause segment of the corresponding target language sentence is Service enterprises and service projects; the long clause segment of the source language sentence is 'regulated on the wedding commemorative day of the source language sentence, is in agreement with the deep craving of the inner core of the principal', and the corresponding long clause segment of the target language sentence is 'mediation on their wedding anniversary is in line with the deep desire of the parties'.
Based on any of the foregoing embodiments, fig. 11 is a schematic flow chart of a translation method according to an embodiment of the present invention, where the method at least includes:
and 1101, collecting an image of the text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text.
In the embodiment of the invention, the image of the text to be translated can be acquired through the image acquisition equipment, and the acquired image of the text to be translated is subjected to character recognition by utilizing the optical character recognition model to obtain the source language text. For example, the image capturing device may be a video camera, a still camera, a scanner, or the like, which is not limited by the embodiment of the present invention.
And step 1102, translating the source language text through the translation model to obtain the target language text.
In the embodiment of the present invention, the translation model may be trained based on the translation model training method provided in any one of the above embodiments.
According to the translation method provided by the embodiment of the invention, when the translation model is trained, the prediction probability distribution obtained based on the noisy source language sentence and the prediction probability distribution obtained based on the source language sentence are mutually constrained, and/or the feature vector output by each hidden layer obtained by the noisy source language sentence and the feature vector output by each hidden layer obtained by the source language sentence are mutually constrained, so that the translation model learns similar prediction probability distribution and/or feature vector under the condition that the source language sentence is noisy and not noisy, the translation model is insensitive to the source language sentence, the robustness of the translation model to noise is improved, and the translation quality by using the translation model can be effectively ensured.
Based on the translation model training method provided by any one of the embodiments, the embodiment of the present invention further provides a translation model training device, and fig. 12 is a schematic structural diagram of the translation model training device provided by the embodiment of the present invention, and as shown in fig. 12, the translation model training device at least includes:
the first prediction module 1210 is configured to input a source language sentence in a parallel bilingual sentence pair into a translation model, obtain a first predicted target language sentence output by the translation model, and obtain a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model.
The second prediction module 1220 is configured to input the noisy source language sentence into the translation model, obtain a second predicted target language sentence output by the translation model, and obtain a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence.
The loss calculation module 1230 is configured to determine a current training loss of the translation model based on the target language sentence in the pair of the first predicted target language sentence and the parallel bilingual sentence, the target language sentence in the pair of the parallel bilingual sentence corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution.
A parameter adjustment module 1240 for adjusting parameters of the translation model based on the determined current training loss.
According to the translation model training device provided by the embodiment of the invention, the source language sentences in the parallel bilingual sentence pairs and the noisy source language sentences obtained by carrying out data enhancement on the source language sentences in the parallel bilingual sentence pairs are respectively input into the translation model, the forward propagation processing is carried out twice, the prediction probability distribution of the translation model obtained by carrying out the forward propagation processing twice and/or the feature vectors output by all hidden layers of the translation model are obtained, the translation model is based on the prediction target language sentences output by the translation model processed by the forward propagation processing twice and the target language sentences in the parallel bilingual sentence pairs, and the obtained prediction probability distribution of the translation model processed by the forward propagation twice and/or the feature vectors output by all hidden layers of the translation model, the current training loss of the translation model is determined, the parameters of the translation model are adjusted, the prediction probability distribution obtained by the noisy source language sentences and the prediction probability distribution obtained by the source language sentences are mutually constrained, and/or the feature vectors output by all hidden layers obtained by the source language sentences are mutually constrained, the translation model is enabled to be under the condition that the source language sentences are noisy and the translation model is not noisy and not noisy, the translation model is required to be distinguished by the additional training model is improved, the translation is required to be more than the translation model is enabled to be more easily compared with the translation model, and the translation is not required to be distinguished by the translation model.
Based on any of the above embodiments, the loss calculation module 1230 includes:
the first loss calculation unit is used for determining a first training loss based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair and the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence.
A second loss calculation unit configured to determine a second training loss based on the first feature vector and the second feature vector; and/or the number of the groups of groups,
a third loss calculation unit configured to determine a third training loss based on the first predictive probability distribution and the second predictive probability distribution;
and the training loss calculation unit is used for carrying out weighted summation on the first training loss and the second training loss and/or the third training loss to obtain the current training loss of the translation model.
Based on any of the above embodiments, the first loss calculation unit includes:
a first loss component calculation subunit configured to determine a first training loss component based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair;
a second loss component calculation operator unit, configured to determine a third training loss component based on a target language sentence in a parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence;
And the first loss calculation subunit is used for accumulating the first training loss component and the third training loss component to obtain the first training loss.
Based on any of the above embodiments, the second loss calculation unit includes:
a third loss component calculation operator unit, configured to determine a third training loss component corresponding to each hidden layer in each hidden layer based on the first feature vector and the second feature vector of each hidden layer;
and the second loss calculation subunit is used for accumulating the second training loss component of each hidden layer in the determined hidden layers to obtain the second training loss.
Based on any of the above embodiments, the translation model training device further includes:
the first data enhancement module is used for replacing characters in the source language sentence based on a character recognition error comparison table to obtain the noisy source language sentence; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the number of the groups of groups,
the second data enhancement module is used for deleting characters in sentence head or sentence tail words in the source language sentence to obtain the noisy source language sentence; and/or the number of the groups of groups,
And the third data enhancement module is used for deleting punctuation marks at the tail of the source language sentence or adding a punctuation mark at the sentence head of the source language sentence to obtain the noisy source language sentence.
Based on any of the above embodiments, the first data enhancement module includes:
the character quantity counting unit is used for counting the character quantity in the source language sentence and determining the character quantity to be replaced in the source language sentence according to a preset proportion;
a to-be-replaced character determining unit, configured to determine a character to be replaced in the source language sentence based on a frequency of occurrence of the character in the source language sentence in the character recognition error comparison table and the determined number of characters to be replaced;
and the character replacement unit is used for acquiring corresponding error characters from the character recognition error comparison table based on the determined characters to be replaced, and replacing the characters in the source language sentence to obtain the noisy source language sentence.
Based on any of the above embodiments, the character replacement unit is configured to:
if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error characters for replacement based on the frequency of occurrence of the more than two error characters corresponding to the character recognition error comparison table;
And acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the noisy source language sentence.
Based on any of the above embodiments, the second data enhancement module includes:
a character number calculating unit, configured to determine the number of characters contained in words at the beginning or end of a sentence in the source language sentence;
the to-be-deleted character quantity determining unit is used for determining the quantity of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence if the quantity of characters contained in the words of the determined sentence head or sentence tail accords with the preset quantity; wherein the preset number is set according to languages of the source language sentences;
and the character deleting unit is used for deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail, so as to obtain the noisy source language sentence.
Based on any of the above embodiments, the to-be-deleted character number determining unit is configured to:
if the language of the source language sentence is English, determining the number of characters to be deleted in words at the head or tail of the sentence in the source language sentence based on Gaussian distribution;
And if the language of the source language sentence is Chinese, determining the number of characters to be deleted in the word at the head or tail of the sentence in the source language sentence as one character.
Based on any of the above embodiments, the translation model training device further includes:
the third prediction module is used for inputting the source language text sample into the translation model to obtain a predicted target language text output by the translation model; the source language text sample comprises source language sentences in the parallel double sentence pairs and the noisy source language sentences;
the loss calculation module is further configured to determine a current training loss of the translation model based on the predicted target language text and the real target language text of the source language text sample; wherein the real target language text comprises target language sentences in the parallel double sentence pairs;
the parameter adjustment module is further configured to adjust parameters of the translation model based on the determined current training loss.
Based on any of the above embodiments, the source language text sample further includes clause fragments of a source language sentence in the parallel double sentence pair, and the real target language text further includes clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence in the parallel double sentence pair.
Based on any of the above embodiments, the translation model training device further includes:
a fourth data enhancement module, configured to perform word alignment on the source language sentence and the target language sentence in the parallel bilingual sentence pair; and extracting clause fragments of the source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence with the aligned words.
Based on any of the above embodiments, the fourth data enhancement module is configured to:
extracting short clause fragments of the source language sentence and short clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; wherein, the middle of the short clause segment has no punctuation mark;
extracting long clause fragments of the source language sentence and long clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; and punctuation marks are included in the middle of the long clause segment.
Based on the translation method provided by any of the foregoing embodiments, the embodiment of the present invention further provides a translation device, and fig. 13 is a schematic structural diagram of the translation device provided by the embodiment of the present invention, where, as shown in fig. 13, the translation device at least includes:
The text recognition module 1310 is configured to collect an image of the text to be translated, and perform text recognition on the collected image of the text to be translated to obtain a source language text.
A machine translation module 1320, configured to translate the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on the translation model training method provided by any embodiment.
According to the translation device provided by the embodiment of the invention, when the translation model is trained, the similar prediction probability distribution and/or the similar feature vector are learned under the condition that the source language sentence is noisy or not noisy, the translation model is insensitive to the source language sentence, and therefore the robustness of the translation model to noise is improved, and the translation quality by using the translation model can be effectively ensured.
Fig. 14 illustrates a physical structure diagram of an electronic device, as shown in fig. 14, which may include: processor 1410, communication interface (Communications Interface) 1420, memory 1430 and communication bus 1440, wherein processor 1410, communication interface 1420 and memory 1430 communicate with each other via communication bus 1440. The processor 1410 may call logic instructions in the memory 1430 to perform the following method: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model; inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence; determining a current training loss of the translation model based on the target language sentence in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predictive probability distribution and the second predictive probability distribution; based on the determined current training loss, parameters of the translation model are adjusted.
Further, the processor 1410 may call logic instructions in the memory 1430 to perform the following method: collecting an image of a text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
In addition, the logic instructions in the memory 1430 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model; inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence; determining a current training loss of the translation model based on the target language sentence in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predictive probability distribution and the second predictive probability distribution; based on the determined current training loss, parameters of the translation model are adjusted.
Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: collecting an image of a text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model; inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence; determining a current training loss of the translation model based on the target language sentence in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predictive probability distribution and the second predictive probability distribution; based on the determined current training loss, parameters of the translation model are adjusted.
Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: collecting an image of a text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. A method for training a translation model, comprising:
inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and obtaining a first prediction probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model;
inputting the noisy source language sentence into the translation model to obtain a second predicted target language sentence output by the translation model, and obtaining a second predicted probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence;
Determining a current training loss of the translation model based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution;
based on the determined current training loss, parameters of the translation model are adjusted.
2. The method according to claim 1, wherein the determining the current training loss of the translation model based on the target language sentence in the pair of the first predicted target language sentence and the parallel bilingual sentence, the target language sentence in the pair of the parallel bilingual sentences in which the second predicted target language sentence corresponds to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first prediction probability distribution and the second prediction probability distribution includes:
determining a first training loss based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, and the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
Determining a second training penalty based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution;
and carrying out weighted summation on the first training loss, the second training loss and/or the third training loss to obtain the current training loss of the translation model.
3. The method of claim 2, wherein determining the first training penalty based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, comprises:
determining a first training loss component based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair;
determining a second training loss component based on the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence;
and accumulating the first training loss component and the second training loss component to obtain the first training loss.
4. The method of claim 2, wherein determining a second training penalty based on the first feature vector and the second feature vector comprises:
based on the first characteristic vector and the second characteristic vector of each hidden layer, determining a third training loss component corresponding to each hidden layer in each hidden layer respectively;
and accumulating the third training loss component of each hidden layer in the determined hidden layers to obtain the second training loss.
5. The method for training a translation model according to any one of claims 1 to 4, wherein the performing data enhancement on the source language sentence to obtain the noisy source language sentence includes:
replacing the characters in the source language sentence based on a character recognition error comparison table to obtain the noisy source language sentence; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the number of the groups of groups,
deleting characters in sentence head or sentence tail words in the source language sentence to obtain the noisy source language sentence; and/or the number of the groups of groups,
deleting punctuation marks at the tail of the source language sentence or adding a punctuation mark at the head of the source language sentence to obtain the noisy source language sentence.
6. The method for training a translation model according to claim 5, wherein replacing the characters in the source language sentence based on the character recognition error comparison table to obtain the noisy source language sentence comprises:
counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion;
determining the characters to be replaced in the source language sentence based on the frequency of occurrence of the characters in the source language sentence in the character recognition error comparison table and the determined number of the characters to be replaced;
based on the determined character to be replaced, obtaining a corresponding error character from the character recognition error comparison table, and replacing the character in the source language sentence to obtain the noisy source language sentence.
7. The method for training a translation model according to claim 6, wherein the obtaining, based on the determined character to be replaced, a corresponding error character from the character recognition error comparison table, replacing a character in the source language sentence, and obtaining the noisy source language sentence includes:
If the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error characters for replacement based on the frequency of occurrence of the more than two error characters corresponding to the character recognition error comparison table;
and acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the noisy source language sentence.
8. The method for training a translation model according to claim 5, wherein deleting characters in a sentence head or a sentence tail word in the source language sentence to obtain a noisy source language sentence comprises:
determining the number of characters contained in words at the head or tail of a sentence in the source language sentence;
if the number of characters contained in the words of the determined sentence head or sentence tail accords with the preset number, determining the number of characters to be deleted in the words of the sentence head or sentence tail in the source language sentence; wherein the preset number is set according to languages of the source language sentences;
and deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail, so as to obtain the noisy source language sentence.
9. The method for training a translation model according to claim 8, wherein said determining the number of characters to be deleted in words at the beginning or end of a sentence in said source language sentence comprises:
if the language of the source language sentence is English, determining the number of characters to be deleted in words at the head or tail of the sentence in the source language sentence based on Gaussian distribution;
and if the language of the source language sentence is Chinese, determining the number of characters to be deleted in the word at the head or tail of the sentence in the source language sentence as one character.
10. The method for training a translation model according to any one of claims 1 to 4 or 6 to 9, wherein before inputting the source language sentence in the parallel bilingual sentence pair into the translation model to obtain a first predicted target language sentence output by the translation model, and obtaining a first predicted probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model, the method further comprises:
inputting a source language text sample into the translation model to obtain a predicted target language text output by the translation model; the source language text sample comprises source language sentences in the parallel double sentence pairs and the noisy source language sentences;
Determining a current training penalty for the translation model based on the predicted target language text and the real target language text of the source language text sample; wherein the real target language text comprises target language sentences in the parallel double sentence pairs;
based on the determined current training loss, parameters of the translation model are adjusted.
11. The method of claim 10, wherein the source language text sample further comprises clause fragments of a source language sentence in the parallel double sentence pair, and the real target language text further comprises clause fragments of a target language sentence in the parallel double sentence pair corresponding to the clause fragments of the source language sentence.
12. The method of claim 11, wherein the step of obtaining clause fragments of a source language sentence in the parallel bilingual sentence pair and clause fragments of a target language sentence in the parallel bilingual sentence pair corresponding to the clause fragments of the source language sentence comprises:
word alignment is carried out on the source language sentence and the target language sentence in the parallel double sentence pair;
and extracting clause fragments of the source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence with the aligned words.
13. The method for training a translation model according to claim 12, wherein extracting clause fragments of the source language sentence and clause fragments of the target language sentence corresponding to the clause fragments of the source language sentence based on the word alignment, comprises:
extracting short clause fragments of the source language sentence and short clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; wherein, the middle of the short clause segment has no punctuation mark;
extracting long clause fragments of the source language sentence and long clause fragments of the target language sentence based on the word-aligned source language sentence and target language sentence; and punctuation marks are included in the middle of the long clause segment.
14. A method of translation, comprising:
collecting an image of a text to be translated, and performing text recognition on the collected image of the text to be translated to obtain a source language text;
translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method according to any one of claims 1 to 13.
15. A translation model training device, comprising:
the first prediction module is used for inputting source language sentences in parallel bilingual sentence pairs into a translation model to obtain first prediction target language sentences output by the translation model, and obtaining first prediction probability distribution of the translation model and/or first feature vectors output by each hidden layer of the translation model;
the second prediction module is used for inputting the noisy source language sentence into the translation model to obtain a second prediction target language sentence output by the translation model, and obtaining a second prediction probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the noisy source language sentence is obtained based on data enhancement of the source language sentence;
a loss calculation module, configured to determine a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector, and/or the first prediction probability distribution and the second prediction probability distribution;
And the parameter adjustment module is used for adjusting parameters of the translation model based on the determined current training loss.
16. A translation apparatus, comprising:
the text recognition module is used for collecting images of the text to be translated and recognizing the collected images of the text to be translated to obtain a source language text;
the machine translation module is used for translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method according to any one of claims 1 to 13.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the translation model training method according to any one of claims 1 to 13 or the steps of the translation method according to claim 14 when the program is executed by the processor.
18. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the translation model training method according to any one of claims 1 to 13 or the steps of the translation method according to claim 14.
CN202111250312.4A 2021-10-26 2021-10-26 Translation model training method, translation method and translation device Active CN114201975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111250312.4A CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111250312.4A CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Publications (2)

Publication Number Publication Date
CN114201975A CN114201975A (en) 2022-03-18
CN114201975B true CN114201975B (en) 2024-04-12

Family

ID=80646370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111250312.4A Active CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Country Status (1)

Country Link
CN (1) CN114201975B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114611532B (en) * 2022-05-06 2022-08-19 北京百度网讯科技有限公司 Language model training method and device, and target translation error detection method and device
CN116167388A (en) * 2022-12-27 2023-05-26 无锡捷通数智科技有限公司 Training method, device, equipment and storage medium for special word translation model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011180941A (en) * 2010-03-03 2011-09-15 National Institute Of Information & Communication Technology Phrase table generator and computer program therefor
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI753325B (en) * 2019-11-25 2022-01-21 國立中央大學 Computing device and method for generating machine translation model and machine-translation device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011180941A (en) * 2010-03-03 2011-09-15 National Institute Of Information & Communication Technology Phrase table generator and computer program therefor
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚亮 ; 洪宇 ; 刘昊 ; 刘乐 ; 姚建民 ; .基于语义分布相似度的翻译模型领域自适应研究.山东大学学报(理学版).2016,(第07期),全文. *

Also Published As

Publication number Publication date
CN114201975A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN114201975B (en) Translation model training method, translation method and translation device
CN110428820B (en) Chinese and English mixed speech recognition method and device
CN110032938B (en) Tibetan recognition method and device and electronic equipment
TW201918913A (en) Machine processing and text correction method and device, computing equipment and storage media
CN111062376A (en) Text recognition method based on optical character recognition and error correction tight coupling processing
CN109858029B (en) Data preprocessing method for improving overall quality of corpus
WO2022088570A1 (en) Method and apparatus for post-editing of translation, electronic device, and storage medium
CN113408535B (en) OCR error correction method based on Chinese character level features and language model
CN110598224A (en) Translation model training method, text processing device and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN110033008A (en) A kind of iamge description generation method concluded based on modal transformation and text
CN112766000B (en) Machine translation method and system based on pre-training model
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN112016271A (en) Language style conversion model training method, text processing method and device
CN112580326A (en) Punctuation mark model and training system thereof
CN112307130A (en) Document-level remote supervision relation extraction method and system
CN112686030B (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN113204978A (en) Machine translation enhancement training method and system
CN109657244B (en) English long sentence automatic segmentation method and system
CN109325237B (en) Complete sentence recognition method and system for machine translation
CN111144134A (en) Translation engine automatic evaluation system based on OpenKiwi
CN110610006A (en) Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN115631502A (en) Character recognition method, character recognition device, model training method, electronic device and medium
CN115525749A (en) Voice question-answering method, device, electronic equipment and storage medium
CN112836528A (en) Machine translation post-editing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230516

Address after: 230026 No. 96, Jinzhai Road, Hefei, Anhui

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: 230088 666 Wangjiang West Road, Hefei hi tech Development Zone, Anhui

Applicant before: IFLYTEK Co.,Ltd.

GR01 Patent grant
GR01 Patent grant