CN114201975A - Translation model training method, translation method and device - Google Patents

Translation model training method, translation method and device Download PDF

Info

Publication number
CN114201975A
CN114201975A CN202111250312.4A CN202111250312A CN114201975A CN 114201975 A CN114201975 A CN 114201975A CN 202111250312 A CN202111250312 A CN 202111250312A CN 114201975 A CN114201975 A CN 114201975A
Authority
CN
China
Prior art keywords
sentence
source language
language sentence
translation model
target language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111250312.4A
Other languages
Chinese (zh)
Other versions
CN114201975B (en
Inventor
刘恒双
张为泰
许瑞阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111250312.4A priority Critical patent/CN114201975B/en
Publication of CN114201975A publication Critical patent/CN114201975A/en
Application granted granted Critical
Publication of CN114201975B publication Critical patent/CN114201975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a translation model training method, a translation method and a translation device. The model training method comprises the following steps: respectively inputting a source language sentence and a noisy source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence and a second predicted target language sentence, and respectively obtaining a first predicted probability distribution and a second predicted probability distribution of the translation model and/or a first feature vector and a second feature vector output by each hidden layer; and determining the current training loss of the translation model and adjusting the parameters of the translation model based on the target language sentences in the first predicted target language sentences and the parallel bilingual sentence pairs, the target language sentences corresponding to the second predicted target language sentences and the source language sentences added with noise, the first characteristic vectors and the second characteristic vectors and/or the first predicted probability distribution and the second predicted probability distribution. The embodiment of the invention can improve the robustness of the translation model, and the training method is simple and the model training is stable.

Description

Translation model training method, translation method and device
Technical Field
The invention relates to the technical field of machine translation, in particular to a translation model training method, a translation method and a translation device.
Background
The scanning translation pen is an intelligent terminal product integrating an Optical Character Recognition (OCR) technology, a machine translation technology and a voice synthesis technology. The working process comprises the steps of firstly collecting images of paper documents through an OCR technology, identifying characters in the images to obtain source language texts, then translating the source language texts through a machine translation technology to obtain target language texts, and finally playing the target language texts in a voice mode through a voice synthesis technology. In order to ensure the translation quality, the scanning translation pen needs to follow certain specifications when in use, for example, when an image of a paper document is acquired, the scanning translation pen is recommended to form an included angle of 45 degrees with a desktop, a sight line of a pen point of the scanning translation pen needs to be aligned with a text line to be translated of the paper document, and the like. However, in actual use, it is difficult to ensure that the specifications are strictly followed, so that a large amount of noise exists in source language texts obtained by the OCR technology, and since the machine translation technology is sensitive to the noise, a small amount of disturbance can cause great change of translation results, thereby reducing the quality of machine translation. By improving the robustness of the translation model, the quality of machine translation can be improved.
The existing method for improving the robustness of the translation model mainly adopts a method of combining data enhancement and countermeasure training. However, the method of data enhancement combined with the training against requires additional discriminators and additional training discriminators, so that the training method is complicated and the model training is unstable.
Disclosure of Invention
The embodiment of the invention provides a translation model training method, a translation method and a translation device, which are used for solving the defects that the method for improving the robustness of a translation model in the prior art is complex and the model training is unstable.
The embodiment of the invention provides a translation model training method, which comprises the following steps:
inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring first predicted probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model;
inputting the source language sentences added with noise into the translation model to obtain second predicted target language sentences output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence;
determining a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution;
adjusting parameters of the translation model based on the determined current training loss.
According to an embodiment of the present invention, the method for training a translation model, wherein determining a current training loss of the translation model based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution, includes:
determining a first training loss based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
determining a second training loss based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution;
and carrying out weighted summation on the first training loss, the second training loss and/or the third training loss to obtain the current training loss of the translation model.
According to an embodiment of the present invention, the determining a first training loss based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, and the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, includes:
determining a first training loss component based on the first predicted target language statement and a target language statement in the parallel bilingual statement pair;
determining a second training loss component based on the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
and accumulating the first training loss component and the second training loss component to obtain the first training loss.
According to an embodiment of the present invention, the method for training a translation model, wherein the determining a second training loss based on the first feature vector and the second feature vector, includes:
respectively determining a third training loss component corresponding to each hidden layer in each hidden layer based on the first feature vector and the second feature vector of each hidden layer;
and accumulating the determined third training loss components of each hidden layer to obtain the second training loss.
According to the translation model training method of an embodiment of the present invention, the data enhancement of the source language sentence to obtain the noisy source language sentence includes:
replacing characters in the source language sentence based on a character recognition error comparison table to obtain the source language sentence with noise; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the presence of a gas in the gas,
deleting characters in words at the beginning or the end of a sentence in the source language sentence to obtain the source language sentence added with the noise; and/or the presence of a gas in the gas,
and deleting punctuation marks at the tail of the source language sentence, or adding a punctuation mark at the beginning of the source language sentence to obtain the source language sentence added with the noise.
According to the translation model training method of an embodiment of the present invention, the replacing characters in the source language sentence based on the character recognition error lookup table to obtain the noisy source language sentence includes:
counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion;
determining characters to be replaced in the source language sentence based on the frequency of occurrence of the characters in the source language sentence in the character recognition error comparison table and the determined number of the characters to be replaced;
and acquiring corresponding error characters from the character recognition error comparison table based on the determined characters to be replaced, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
According to the translation model training method of an embodiment of the present invention, the obtaining of corresponding incorrect characters from the character recognition incorrect comparison table based on the determined characters to be replaced, and the replacing of the characters in the source language sentence to obtain the noisy source language sentence includes:
if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error character for replacement based on the frequency of the more than two corresponding error characters in the character recognition error comparison table;
and acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
According to the translation model training method of an embodiment of the present invention, deleting characters in words at the beginning or the end of a sentence in the source language sentence to obtain a noisy source language sentence, includes:
determining the number of characters contained in words of sentence heads or sentence tails in the source language sentence;
if the number of characters contained in the words of the sentence head or the sentence tail is consistent with the preset number, determining the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence; wherein the preset number is set according to the language of the source language sentence;
and deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail to obtain the source language sentence added with the noise.
According to a translation model training method of an embodiment of the present invention, the determining the number of characters to be deleted in words of a sentence head or a sentence tail in the source language sentence includes:
if the language of the source language sentence is English, determining the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence based on Gaussian distribution;
and if the language of the source language sentence is Chinese, determining that the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence is one character.
According to the translation model training method of an embodiment of the present invention, before inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring a first predicted probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model, the method further includes:
inputting a source language text sample into the translation model to obtain a predicted target language text output by the translation model; wherein the source language text sample comprises a source language sentence in the parallel bilingual sentence pair and the noisy source language sentence;
determining a current training loss of the translation model based on the predicted target language text and a real target language text of the source language text sample; wherein the real target language text comprises a target language sentence in the parallel bilingual sentence pair;
adjusting parameters of the translation model based on the determined current training loss.
According to the translation model training method of an embodiment of the present invention, the source language text sample further includes a clause fragment of the source language sentence in the parallel bilingual sentence pair, and the real target language text further includes a clause fragment corresponding to the clause fragment of the source language sentence in the target language sentence in the parallel bilingual sentence pair.
According to the translation model training method of an embodiment of the present invention, the step of obtaining the clause segment of the source language sentence in the parallel bilingual sentence pair and the clause segment corresponding to the clause segment of the source language sentence in the target language sentence in the parallel bilingual sentence pair includes:
performing word alignment on a source language sentence and a target language sentence in the parallel bilingual sentence pair;
and extracting clause segments of the source language sentence and clause segments corresponding to the clause segments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence which are aligned.
According to the translation model training method of an embodiment of the present invention, the extracting clause fragments of the source language sentence and clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence of the word alignment includes:
extracting short clause segments of the source language sentences and short clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein, there is no punctuation mark in the middle of the short clause segment;
extracting long clause segments of the source language sentences and long clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein punctuation marks are included in the middle of the long clause segment.
The embodiment of the invention also provides a translation method, which comprises the following steps:
acquiring an image of a text to be translated, and performing character recognition on the acquired image of the text to be translated to obtain a source language text;
translating the source language text through a translation model to obtain a target language text; wherein, the translation model is obtained by training based on the translation model training method.
An embodiment of the present invention further provides a translation model training device, including:
the first prediction module is used for inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and acquiring first prediction probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model;
the second prediction module is used for inputting the source language sentences added with noise into the translation model to obtain second prediction target language sentences output by the translation model and acquiring second prediction probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence;
a loss calculation module for determining a current training loss of the translation model based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution;
a parameter adjustment module to adjust parameters of the translation model based on the determined current training loss.
An embodiment of the present invention further provides a translation apparatus, including:
the character recognition module is used for acquiring an image of a text to be translated and performing character recognition on the acquired image of the text to be translated to obtain a source language text;
the machine translation module is used for translating the source language text through a translation model to obtain a target language text; wherein, the translation model is obtained by training based on the translation model training method.
An embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the translation model training method according to any one of the above descriptions, or the steps of the translation method according to any one of the above descriptions when executing the computer program.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the translation model training method according to any one of the above methods, or the steps of the translation method according to any one of the above methods.
The method comprises the steps of respectively inputting a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence obtained by data enhancement of the source language sentence in the parallel bilingual sentence pair into a translation model, carrying out forward propagation processing twice, obtaining predicted probability distribution of the translation model and/or characteristic vectors output by hidden layers of the translation model, carrying out forward propagation processing twice on the predicted probability distribution of the translation model and/or the characteristic vectors output by hidden layers of the translation model, determining current training loss of the translation model, adjusting parameters of the translation model, and utilizing the predicted probability distribution obtained by the source language sentence with noise to be in accordance with the predicted probability distribution obtained by the source language sentence and the predicted probability distribution obtained by the source language sentence Mutual constraint and/or mutual constraint of feature vectors output by various hidden layers obtained by the source language sentence with noise and feature vectors output by various hidden layers obtained by the source language sentence, so that the translation model learns similar prediction probability distribution and/or feature vectors under the condition that the noise is added and not added to the source language sentence, the translation model is insensitive to the noise addition and the non-noise addition of the source language sentence, the robustness of the translation model to the noise is improved, the translation quality by utilizing the translation model can be effectively ensured, compared with a method for combining data enhancement and countertraining, an additional discriminator and an additional training discriminator are not needed, the training method is simple, and the model training is stable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a translation model training method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for determining a current training loss of a translation model according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a translation model training method according to another embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for performing data enhancement on a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for replacing characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a method for replacing characters in a source language sentence by obtaining incorrect characters from a character recognition error lookup table according to an embodiment of the present invention;
fig. 7 is a schematic diagram illustrating training of a translation model by a noisy text obtained by performing data enhancement through character replacement according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating a method for deleting characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating a translation model training method according to another embodiment of the present invention;
FIG. 10 is a flowchart illustrating a method for obtaining clause fragments of a source language sentence and a target language sentence of a parallel bilingual sentence pair according to an exemplary embodiment of the present invention;
FIG. 11 is a flowchart illustrating a translation method according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a translation model training apparatus according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of a component structure of a translation apparatus according to an embodiment of the present invention;
fig. 14 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to ensure the translation quality, the scanning translation pen needs to follow certain specifications when in use, for example, when an image of a paper document is acquired, the scanning translation pen is recommended to form an included angle of 45 degrees with a desktop, a sight line of a pen point of the scanning translation pen needs to be aligned with a text line to be translated of the paper document, and the like. However, in actual use, it is difficult to ensure that the specifications are strictly followed, so that a large amount of noise exists in source language texts obtained by the OCR technology, and since the machine translation technology is sensitive to the noise, a small amount of disturbance can cause great change of translation results, thereby reducing the quality of machine translation. By improving the robustness of the translation model, the quality of machine translation can be improved.
The existing method for improving the robustness of the translation model mainly adopts a method of combining data enhancement and countermeasure training. However, the method of data enhancement combined with the training counterwork needs to add additional discriminators, the discriminators increase the number of parameters, and the requirements of the discriminators on the configuration of the parameters are high, so that the training method is complex and the model training is unstable.
To this end, an embodiment of the present invention provides a translation model training method, in which a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence obtained by enhancing data of the source language sentence in the parallel bilingual sentence pair are respectively input into a translation model, and forward propagation processing is performed twice to obtain a predicted probability distribution of the translation model and/or feature vectors output by each hidden layer of the translation model, and a current training loss of the translation model is determined based on a predicted target language sentence and a real target language sentence output by the translation model after forward propagation processing twice, and the obtained predicted probability distribution of the translation model and/or feature vectors output by each hidden layer of the translation model after forward propagation processing twice, parameters of the translation model are adjusted, the predicted probability distribution obtained by the noisy source language sentence and/or the feature vectors output by each hidden layer of the translation model, and the prediction probability distribution obtained by the source language sentence and/or the feature vectors output by all hidden layers of the translation model are mutually constrained, so that the translation model is insensitive to noise addition and noise non-addition of the source language sentence, and the robustness of the translation model is effectively improved. Fig. 1 is a schematic flow chart of a translation model training method provided in an embodiment of the present invention, and as shown in fig. 1, the method at least includes:
step 101, inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring a first predicted probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model.
In this embodiment of the present invention, the translation model may be a deep learning model for translating one language into another language, where one language may be referred to as a source language, and another language may be referred to as a target language, and the translation model may use an existing deep learning model, such as a transform model. The translation model generally adopts a coder-decoder structure, and at least one coder and at least one decoder can be included in one translation model, the number of the coders is generally equal to that of the decoders, and each coder and each decoder can be a multi-layer network structure including a plurality of hidden layers. For example, the Transformer model includes 6 encoders and 6 decoders; the 6 encoders have the same network structure, and each encoder comprises 2 hidden layers, wherein the 1 st hidden layer is a self-attention layer, and the 2 nd hidden layer is a feedforward neural network layer; the 6 decoders also have the same network structure, and each decoder comprises 3 hidden layers, wherein the 1 st hidden layer is a mask multi-head self-attention layer, the 2 nd hidden layer is a multi-head attention layer for the decoder output, and the 3 rd hidden layer is a feedforward neural network layer.
Usually, a large amount of general parallel bilingual texts with standard texts and reasonable grammars are used as training samples to train the translation model, so that the translation model learns the mapping relation between two languages by using the parallel bilingual texts. The parallel bilingual text comprises a source language text and a target language text, and the languages of the source language text and the target language text in the parallel bilingual text are not limited in the embodiment of the invention, for example, the source language text in the parallel bilingual text is Chinese, and the target language text in the parallel bilingual text is English. The parallel bilingual text is mostly acquired from a spoken language communication scene, a news field, an electronic document and the like and is crawled from the internet, and the acquisition mode of the parallel bilingual text is not limited in the embodiment of the invention. The parallel bilingual text may include a pair of parallel bilingual sentences, for example, for a scan translation pen, since the translation objects are usually sentences, the translation model of the scan translation pen may be trained using the pair of parallel bilingual sentences.
In the embodiment of the present invention, when the translation model is trained, the source language sentence in the parallel bilingual sentence pair may be input into the translation model, and Forward Propagation (Forward Propagation) processing is performed to obtain the first predicted target language sentence output by the translation model. In the forward propagation processing process, the input source language sentence is processed by an encoder and a decoder in a translation model in sequence, first feature vectors are output at hidden layers of the encoder and the decoder, and first prediction probability distribution is output at a softmax layer. Therefore, in the process of training the translation model through the source language sentence in the parallel bilingual sentence pair, the softmax layer may be obtained from the translation model to output the first prediction probability distribution, or the first feature vector output by each hidden layer may be obtained from the translation model, or the softmax layer to output the first prediction probability distribution and the first feature vector output by each hidden layer may be simultaneously obtained from the translation model, which is not limited in the embodiment of the present invention.
102, inputting the source language sentence added with the noise into a translation model to obtain a second predicted target language sentence output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by hidden layers of the translation model; the source language sentence added with the noise is obtained based on data enhancement of the source language sentence.
In the embodiment of the invention, the source language sentence added with noise can be obtained by performing data enhancement processing on the source language sentence in the parallel bilingual sentence pair. The data enhancement processing may be performed on the source language sentence in the existing text data enhancement mode, for example, the data enhancement processing may be performed on the source language sentence in modes of synonym replacement, random deletion, random exchange, random insertion, random masking, and the like, or the data enhancement processing may be performed on the source language sentence in a mode of designing data enhancement according to an application scenario of a translation model.
In the embodiment of the present invention, when training the translation model, the noisy source language sentence obtained by enhancing the data of the source language sentence in step 101 may be input into the translation model, and forward propagation processing may be performed to obtain the second predicted target language sentence output by the translation model. In the forward propagation processing process, the input source language sentence added with noise is processed by an encoder and a decoder in a translation model in sequence, second feature vectors are output at hidden layers of the encoder and the decoder, and second prediction probability distribution is output at a softmax layer. Therefore, in the process of training the translation model through the noisy source language sentence, the softmax layer output second prediction probability distribution may be obtained from the translation model, or the second feature vector output by each hidden layer may be obtained from the translation model, or the softmax layer output second prediction probability distribution and the second feature vector output by each hidden layer may be simultaneously obtained from the translation model, which is not limited in the embodiment of the present invention.
Step 103, determining the current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the second predicted target language sentence and the source language sentence with noise, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution.
In the embodiment of the invention, when a source language sentence in a parallel bilingual sentence pair and a noisy source language sentence obtained by enhancing data of the source language sentence are respectively input into a translation model to obtain a first predicted target language sentence and a second predicted target language sentence, if only a first predicted probability distribution and a second predicted probability distribution are obtained from the translation model, the current training loss of the translation model can be calculated according to the target language sentence in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first predicted probability distribution and the second predicted probability distribution; if only the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model, the current training loss of the translation model can be calculated according to the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the source language sentence added with noise, the first feature vector and the second feature vector; if the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model at the same time, and the first feature vector and the second feature vector output by each hidden layer, the current training loss of the translation model can be calculated according to the target language sentence in the first prediction target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second prediction target language sentence and the source language sentence with noise, the first feature vector and the second feature vector, and the first prediction probability distribution and the second prediction probability distribution.
In the embodiment of the invention, the target language sentence in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the source language sentence added with noise, the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution can be sequentially calculated through a preset loss function, so that the current training loss of the translation model is obtained; or, respectively calculating target language sentences in the parallel bilingual sentence pair corresponding to the first predicted target language sentences and the parallel bilingual sentences, the second predicted target language sentences and the noisy source language sentences, the first feature vectors and the second feature vectors and/or the first predicted probability distribution and the second predicted probability distribution through a preset loss function to obtain various training losses of the translation model, wherein the various training losses of the translation model can include losses of predicted results and real results, losses of hidden layers and/or losses of predicted probability distribution, and then performing weighted summation on the various training losses of the translation model to obtain the current training loss of the translation model; the embodiment of the invention does not limit the implementation mode of determining the current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the second predicted target language sentence and the source language sentence with noise, the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution.
The preset loss function may adopt an existing loss function, and the embodiment of the present invention does not limit the type of the loss function for calculating the current training loss of the translation model. When various training losses of the translation model are calculated first and then the current training loss of the translation model is calculated through weighted summation, the same loss function can be used for calculating various training losses of the translation model, or different loss functions can be used for calculating various training losses of the translation model, and the embodiment of the invention is not limited to this.
And 104, adjusting parameters of the translation model based on the determined current training loss.
In the embodiment of the present invention, after the current training loss of the translation model is obtained, parameters of the translation model may be adjusted through back Propagation (Backward Propagation) processing according to the current training loss of the translation model, and training of the translation model is stopped until the translation model after the parameters are adjusted meets the preset convergence condition. For example, the preset convergence condition may be a preset error value, and if the error of the translation model after the parameters are adjusted is smaller than the preset error value, the training of the translation model is stopped; or the preset convergence condition may be a preset iteration number, and if the iteration number of the translation model after the parameters are adjusted reaches the preset iteration number, the training of the translation model is stopped.
It should be noted that the translation model of the embodiment of the present invention may be a translation model for implementing mutual translation of at least two languages. For any two mutually translated languages, a parallel bilingual sentence pair consisting of sentences of the two mutually translated languages can be adopted to train the translation model, wherein in the parallel bilingual sentence pair, if a sentence of one language is used as a source language sentence, a sentence of the other language is used as a target language sentence.
The translation model training method provided by the embodiment of the invention comprises the steps of respectively inputting source language sentences in parallel bilingual sentence pairs and noisy source language sentences obtained by data enhancement of the source language sentences in the parallel bilingual sentence pairs into a translation model, carrying out forward propagation processing twice to obtain predicted probability distribution of the translation model processed by the forward propagation twice and/or feature vectors output by various hidden layers of the translation model, determining current training loss of the translation model based on predicted target language sentences output by the translation model processed by the forward propagation twice and target language sentences in the parallel bilingual sentence pairs as well as the obtained predicted probability distribution of the translation model processed by the forward propagation twice and/or feature vectors output by various hidden layers of the translation model, adjusting parameters of the translation model, and mutually constraining the predicted probability distribution obtained by the noisy source language sentences and the predicted probability distribution obtained by the source language sentences, and/or the feature vectors output by the hidden layers obtained by the source language sentences added with noise and the feature vectors output by the hidden layers obtained by the source language sentences are mutually constrained, so that the translation model learns similar prediction probability distribution and/or feature vectors under the condition that the noise is added and not added to the source language sentences, and the translation model is insensitive to the noise addition and the non-noise addition of the source language sentences, thereby improving the robustness of the translation model to the noise, effectively ensuring the quality of translation by utilizing the translation model, and compared with a method for combining data enhancement and countertraining, the method does not need to additionally add a discriminator and an additional training discriminator, is simple in training method, and is stable in model training.
Fig. 2 is a schematic flowchart of a method for determining a current training loss of a translation model according to an embodiment of the present invention, as shown in fig. 2, the method at least includes:
step 201, a first training loss is determined based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence, and the noisy source language sentence.
In an embodiment of the present invention, the first training penalty is used to characterize the difference between the predicted result and the true result output by the translation model. A first training loss component may be first determined based on a first predicted target language statement and a target language statement in a parallel bilingual statement pair; then, determining a second training loss component based on the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the source language sentence added with the noise; and finally, accumulating the first training loss component and the second training loss component to obtain the first training loss. Or, first, determining a second training loss component based on a second predicted target language sentence and a target language sentence in a parallel bilingual sentence pair corresponding to the noisy source language sentence; then, determining a first training loss component based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair; and finally, accumulating the first training loss component and the second training loss component to obtain the first training loss. The embodiment of the invention does not limit the sequence of determining the first training loss component and the second training loss component.
The target language sentence in the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, and the target language sentence in the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence can be respectively calculated through a preset first loss function, so that a first training loss component and a second training loss component are obtained. The preset first loss function may adopt an existing loss function, for example, a Cross Entropy (CE) loss function, and the embodiment of the present invention does not limit the type of the first loss function for calculating the first training loss of the translation model.
Step 202, determining a second training loss based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution.
In an embodiment of the invention, a second training penalty is used to characterize the difference between the feature vectors of the hidden layer output of the translation model with and without noise addition, and a third training penalty is used to characterize the difference between the predicted probability distributions of the translation model output with and without noise addition. Under the condition that only the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model, the third training loss component corresponding to each hidden layer in each hidden layer can be respectively determined based on the first feature vector and the second feature vector of each hidden layer; and then accumulating the determined third training loss components of each hidden layer in the hidden layers to obtain a second training loss. In the case where only the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model, the third training loss may be determined based on the first prediction probability distribution and the second prediction probability distribution. Under the condition that the first prediction probability distribution and the second prediction probability distribution as well as the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model at the same time, the second training loss can be determined based on the first feature vector and the second feature vector; a third training loss is then determined based on the first predictive probability distribution and the second predictive probability distribution. Alternatively, the third training loss may be determined based on the first predictive probability distribution and the second predictive probability distribution; a second training loss is then determined based on the first feature vector and the second feature vector. The embodiment of the invention does not limit the sequence of determining the second training loss and the third training loss.
Calculating a first feature vector and a second feature vector of each hidden layer in each hidden layer respectively through a preset second loss function to obtain a third training loss component corresponding to each hidden layer in each hidden layer; the first prediction probability distribution and the second prediction probability distribution can be calculated through a preset third loss function, and a third training loss is obtained. The preset second loss function and the preset third loss function may be the same loss function or may be different loss functions, which is not limited in the embodiment of the present invention. The preset second loss function and the preset third loss function may adopt existing loss functions, for example, the preset second loss function may be a Mean Squared Error (MSE) loss function, the preset third loss function may be a KL-divergence (Kullback-Leibler) loss function, and the embodiment of the present invention does not limit the type of the second loss function for calculating the second training loss of the translation model and the type of the third loss function for calculating the third training loss of the translation model.
And step 203, carrying out weighted summation on the first training loss, the second training loss and/or the third training loss to obtain the current training loss of the translation model.
In the embodiment of the invention, under the condition that only the first feature vector and the second feature vector output by each hidden layer are obtained from the translation model, after the first training loss and the second training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss and the second training loss; under the condition that only the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model, after the first training loss and the third training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss and the third training loss; under the condition that the first prediction probability distribution and the second prediction probability distribution are obtained from the translation model and the first feature vector and the second feature vector output by each hidden layer are obtained, after the first training loss, the second training loss and the third training loss of the translation model are obtained, the current training loss of the translation model can be obtained by carrying out weighted summation on the first training loss, the second training loss and the third training loss.
Wherein, the weights of the first training loss, the second training loss and the third training loss can be set according to experience, and the weight of the first training loss is larger than the weight of the second training loss and the weight of the third training loss.
According to the embodiment of the invention, the current training loss of the translation model is obtained by respectively calculating various training losses of the translation model and carrying out weighted summation on the various training losses of the translation model obtained by calculation, and the proportion of the various training losses in the current training loss of the translation model can be reasonably set through the weights, so that the noise robustness of the translation model can be improved on the basis of ensuring the effective training of the translation model.
Fig. 3 is a flowchart illustrating a method for training a translation model according to another embodiment of the present invention, as shown in fig. 3, by inputting clean data, i.e., source language sentences in parallel bilingual sentence pairs, into the translation model to perform a first forward propagation process, a first feature vector output by each hidden layer can be obtained at each hidden layer of an encoder and a decoder of the translation model, and a first prediction probability distribution can be obtained at a softmax layer of the translation model; noise data can be obtained by adding noise into clean data, namely the noise-added source language sentences obtained by performing data enhancement processing on the source language sentences in the parallel bilingual sentence pairs, second feature vectors output by hidden layers can be obtained at the hidden layers of an encoder and a decoder of a translation model by inputting the noise-added data into the translation model for second forward propagation processing, and second prediction probability distribution is obtained at a softmax layer of the translation model; the data enhancement processing can adopt a mode of carrying out character replacement based on a character recognition error comparison table, a mode of deleting characters in words at the beginning of a sentence or at the end of the sentence, a mode of deleting punctuation marks at the end of the sentence, and a mode of adding one punctuation mark at the beginning of the sentence; calculating Mean Square Error (MSE) loss of a first eigenvector and a second eigenvector of each hidden layer in two times of forward propagation processing, KL divergence loss of a first prediction probability distribution and a second prediction probability distribution in two times of forward propagation processing, Cross Entropy (CE) loss of a prediction result and a real result output by a translation model in two times of forward propagation processing, and performing weighted summation on the three types of loss to obtain the current training loss of the translation model; and carrying out back propagation processing according to the current training loss, and updating the parameters of the translation model.
According to the embodiment of the invention, the translation model learns the noisy data and the non-noisy data at the same time, and the parameters of the translation model are trained by utilizing Mean Square Error (MSE) loss and KL divergence loss, so that the finally obtained hidden layer characteristic vector of the translation model is insensitive to input noise, and the prediction probability distribution of the translation model is insensitive to noise, thereby improving the robustness of the translation model to the noise and effectively ensuring the quality of translation by utilizing the translation model.
Because the translation model is trained by taking a large amount of parallel bilingual texts as training samples to learn the mapping relation between two languages, the translation model is a model based on data driving, the quantity, quality, form and the like of the training samples have large influence on the translation quality of the translation model, and the translation model is still the work of memorizing and summarizing data. At present, a large amount of general parallel bilingual texts are usually adopted for training the translation model, most of the texts are standard and reasonable in grammar, and the translation model obtained through training can be competent for document translation in most application scenes. The existing data enhancement method, such as synonym replacement, random deletion, random exchange, random insertion, random masking and the like, has a certain effect on improving the robustness of a translation model to noise in most application scenes.
However, for the application scenario of the scanning translation pen, the noise of the translation model input text is mostly caused by the irregular use of the scanning translation pen, for example, OCR recognition error, head-to-tail character missing, punctuation missing, multi-punctuation, etc., as shown in table 1. The noise-added text obtained by data enhancement of the general text by adopting the existing data enhancement method is not matched with the noise condition actually existing in the input text of the scanning translation pen application scene, so that the existing data enhancement method has a limited effect on improving the robustness of the translation model to the noise in the scanning translation pen application scene.
TABLE 1
Figure BDA0003322388450000201
Figure BDA0003322388450000211
In contrast, in the following embodiments of the present invention, for an application scenario of a scanning translation pen, a data enhancement method suitable for the application scenario of the scanning translation pen is provided for solving a noise problem caused by the fact that the scanning translation pen uses an irregular input text of a translation model, and a data enhancement method suitable for the application scenario of the scanning translation pen is used for performing data enhancement on the input text of the translation model to obtain a noisy text, and training the translation model to effectively improve the robustness of the translation model to noise in the application scenario of the scanning translation pen. Fig. 4 is a flowchart illustrating a method for performing data enhancement on a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, as shown in fig. 4, the method at least includes:
step 401, replacing characters in the source language sentence based on the character recognition error comparison table to obtain a source language sentence with noise; wherein the character recognition error lookup table is obtained based on optical character recognition.
In the embodiment of the invention, the character recognition error comparison table can be obtained based on an optical character recognition model, a large number of picture samples can be input into the optical character recognition model for character recognition, and the recognized text and the text marked by the picture samples are compared in character level to obtain the character recognition error comparison table. The resulting character recognition error lookup table may include the frequency of occurrence of characters in addition to the correct characters and the incorrect characters. After the character recognition error comparison table is obtained, the characters in the source language sentence can be replaced by using the error characters recorded in the character recognition error comparison table according to the occurrence frequency of the characters in the character recognition error comparison table, so that the source language sentence added with noise is obtained. For Chinese, the characters in the character recognition error comparison table are correct Chinese characters, wrong Chinese characters and the occurrence frequency of the Chinese characters, so that the Chinese characters in the source language sentence are replaced according to the character recognition error comparison table to obtain a source language sentence added with noise; for English, the characters are recognized as correct English words and wrong English words recorded in the error comparison table and the occurrence frequency of the English words, so that the English words in the source language sentence are replaced according to the character recognition error comparison table to obtain the source language sentence added with noise.
The method comprises the steps of adopting a character recognition error comparison table obtained through an optical character recognition model, replacing characters in a source language sentence according to the frequency of character occurrence, obtaining a noise-added source language sentence which can truly simulate noise generated by OCR recognition errors, forming a new parallel bilingual sentence pair by the noise-added source language sentence obtained through character replacement and a target language sentence in the corresponding parallel bilingual sentence pair, and training a translation model by utilizing the new parallel bilingual sentence pair, so that the robustness of the translation model to the OCR recognition errors can be enhanced.
Step 402, deleting characters in words at the beginning or end of a sentence in the source language sentence to obtain a noisy source language sentence.
In the embodiment of the invention, the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence can be determined according to the number of characters contained in the words of the sentence head or the sentence tail in the source language sentence, so that the characters in the words of the sentence head or the sentence tail in the source language sentence are deleted according to the determined number of characters to be deleted, and the source language sentence with noise is obtained. The number of characters contained in the words of the sentence head or the sentence tail and the number of characters to be deleted in the words of the sentence head or the sentence tail can meet a preset functional relationship, and the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence can be determined according to the preset functional relationship; or, the number of characters included in the word of the sentence start or the sentence end and the number of characters to be deleted in the word of the sentence start or the sentence end may have a preset corresponding relationship, and the number of characters to be deleted in the word of the sentence start or the sentence end in the source language sentence may be determined according to the preset corresponding relationship; the embodiment of the invention does not limit the implementation mode for determining the number of the characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence.
The method comprises the steps of deleting characters in words at the beginning or the end of a sentence in a source language sentence to obtain a noise-added source language sentence, truly simulating noise generated by the incomplete beginning and the end of the characters, forming a new parallel bilingual sentence pair by the noise-added source language sentence obtained by deleting the characters and a target language sentence in the corresponding parallel bilingual sentence pair, and training a translation model by using the new parallel bilingual sentence pair to enhance the robustness of the translation model to the incomplete beginning and the end of the characters.
Step 403, deleting punctuation marks at the end of the source language sentence, or adding a punctuation mark at the beginning of the source language sentence to obtain the source language sentence with noise.
Because the translation model is sensitive to punctuation marks, when there is no punctuation mark at the end of a source sentence or there are multiple punctuation marks at the beginning of the sentence, the translation obtained by translating the translation model may have a large variation for the same source sentence, although the translation may not have serious errors, for example, a synonym is used for substitution in the translation, and the meaning of sentence expression is not changed, it is obviously not able to meet the needs of some application scenarios with high requirements for translation accuracy, for example, learning and education and other application scenarios. In the embodiment of the present invention, a source language sentence to be denoised is obtained by deleting punctuation marks at the end of the source language sentence or adding a punctuation mark at the beginning of the source language sentence, for example, the deleted or added punctuation mark may be a comma, a sentence mark, a question mark, an exclamation mark, or the like.
The method comprises the steps of deleting punctuation marks at the tail of a source language sentence or adding a punctuation mark at the beginning of the source language sentence to obtain a noisy source language sentence which can truly simulate noise generated by the problem of the head-tail punctuation marks, forming a new parallel bilingual sentence pair by the noisy source language sentence obtained by deleting the punctuation marks or adding the punctuation marks and a target language sentence in the corresponding parallel bilingual sentence pair, and training a translation model by using the new parallel bilingual sentence pair to enhance the robustness of the translation model to the problem of the head-tail punctuation marks.
It should be noted that, the steps 401, 402, and 403 may be executed simultaneously or in a certain order, and this is not limited in the embodiment of the present invention. According to a specific application scenario, at least one of the steps 401, 402, and 403 may be adopted to perform data enhancement on a source language sentence to obtain a source language sentence with noise, which is not limited in the embodiment of the present invention.
The embodiment of the invention provides a data enhancement method suitable for the application scene of the scanning translation pen by aiming at the noise problem of the input text of the application scene of the scanning translation pen, performs data enhancement on the input text of the translation model by utilizing the data enhancement method suitable for the application scene of the scanning translation pen to obtain a noise-added text, and trains the translation model by the obtained noise-added text.
Fig. 5 is a flowchart illustrating a method for replacing characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, as shown in fig. 5, where the method at least includes:
step 501, counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion.
In the embodiment of the present invention, the total number of characters in the source language sentence can be obtained by counting the number of characters in the source language sentence, the number of characters to be replaced in the source language sentence can be calculated according to the preset proportion and the obtained total number of characters in the source language sentence, for example, the preset proportion is 5%, and the numerical value of the preset proportion can be determined according to the statistics of the wrong characters in the OCR recognition error, which is not limited in the embodiment of the present invention.
For Chinese, counting the number of Chinese characters in a source language sentence, and determining the number of Chinese characters to be replaced in the source language sentence according to a preset proportion; for English, the number of English words in the source language sentence is counted, and the number of English words to be replaced in the source language sentence is determined according to a preset proportion.
Step 502, determining characters to be replaced in the source language sentence based on the frequency of occurrence of characters in the source language sentence in the character recognition error lookup table and the determined number of characters to be replaced.
In the embodiment of the present invention, the character recognition error comparison table may be queried for all characters in the source language sentence, so as to obtain the occurrence frequency of all characters in the source language sentence in the characters recorded in the character recognition error comparison table, and determine the characters to be replaced in the source language sentence according to the obtained occurrence frequency of the characters corresponding to all characters in the source language sentence and the determined number of the characters to be replaced. For example, the obtained frequencies of occurrence of the characters corresponding to all the characters in the source language sentence may be sorted from high to low, and the characters sorted in the front may be selected as the characters to be replaced in the source language sentence according to the determined number of the characters to be replaced.
For Chinese, determining Chinese characters to be replaced in a source language sentence based on the occurrence frequency of the Chinese characters in the source language sentence in a character recognition error comparison table and the determined quantity of the Chinese characters to be replaced; for English, the English word to be replaced in the source language sentence is determined based on the frequency of occurrence of the English word in the source language sentence in the character recognition error lookup table and the determined number of English words to be replaced.
Step 503, based on the determined character to be replaced, obtaining a corresponding error character from the character recognition error comparison table, and replacing the character in the source language sentence to obtain a source language sentence with noise.
In the embodiment of the present invention, the character recognition error comparison table may be queried according to the determined character to be replaced in the source language sentence, the corresponding error character recorded in the character recognition error comparison table is obtained, and the corresponding character in the source language sentence is replaced according to the obtained error character, so as to obtain the source language sentence with noise.
For Chinese, based on the determined Chinese character to be replaced, acquiring a corresponding error Chinese character from a character recognition error comparison table, and replacing the Chinese character in the source language sentence to obtain a source language sentence with noise; and for English, acquiring a corresponding wrong English word from the character recognition error comparison table based on the determined English word to be replaced, and replacing the English word in the source language sentence to obtain the source language sentence added with noise.
Fig. 6 is a flowchart illustrating a method for replacing a character in a source language sentence by obtaining an incorrect character from a character recognition error lookup table according to an embodiment of the present invention, where as shown in fig. 6, the method at least includes:
step 601, judging whether the determined character to be replaced corresponds to more than two wrong characters in the character recognition error comparison table.
If the determined character to be replaced corresponds to more than two wrong characters in the character recognition error comparison table, executing step 602; if the determined character to be replaced corresponds to less than two wrong characters in the character recognition error comparison table, step 603 is executed.
Step 602, determining the error character for replacement based on the frequency of more than two corresponding error characters in the character recognition error comparison table.
Step 603, determining a corresponding error character in the character recognition error comparison table as an error character for replacement.
In the embodiment of the invention, after the characters to be replaced in the source language sentence are determined, whether each determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table can be judged by inquiring the character recognition error comparison table.
And if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error character for replacement according to the frequency of the corresponding more than two error characters recorded in the character recognition error comparison table. For example, the frequencies of the two or more corresponding error characters can be sorted from high to low, and the character sorted at the top is selected as the error character for replacement.
If the determined character to be replaced does not correspond to more than two error characters in the character recognition error comparison table, that is, the determined character to be replaced corresponds to one error character in the character recognition error comparison table, determining the corresponding one error character in the character recognition error comparison table as the error character for replacement.
Step 604, obtaining the determined error characters for replacement from the character recognition error lookup table, and replacing the characters in the source language sentence to obtain a noisy source language sentence.
In the embodiment of the present invention, after determining the incorrect character for replacement according to the character recognition incorrect comparison table, the determined incorrect character for replacement may be obtained from the character recognition incorrect comparison table, and the corresponding character in the source language sentence is replaced, so as to obtain the source language sentence with noise.
Fig. 7 is a schematic diagram of training a translation model by obtaining a noisy text through data enhancement by character replacement according to an embodiment of the present invention, as shown in fig. 7, where an OCR engine is used to identify a large number of picture samples, the identified text is compared with a text labeled by an OCR manually on the picture samples at a character level to obtain a character identification error comparison table, and a noisy source language sentence "I how an apple" is trained by using an incorrect english word "how" recorded in the character identification error comparison table to replace a character "have" in the source language sentence "I how an apple", so as to obtain a target language sentence "I have an apple".
Fig. 8 is a flowchart illustrating a method for deleting characters in a source language sentence to obtain a noisy source language sentence according to an embodiment of the present invention, as shown in fig. 8, where the method at least includes:
in step 801, the number of characters contained in the words at the beginning or end of a sentence in the source language sentence is determined.
Step 802, determining whether the number of characters included in the determined words at the beginning or end of the sentence matches a preset number.
If the number of characters included in the determined words at the beginning or end of the sentence is in accordance with the preset number, execute step 803; if the number of characters contained in the determined words of the sentence head or the sentence tail does not accord with the preset number, the method is ended.
In step 803, the number of characters to be deleted in the words at the beginning or end of the sentence in the source language sentence is determined.
In the embodiment of the invention, whether the number of characters contained in the words of the sentence head or the sentence tail meets the preset number or not can be judged by determining the number of characters contained in the words of the sentence head or the sentence tail in the source language sentence; if the number of characters contained in the words of the sentence head or the sentence tail is determined to be in accordance with the preset number, deleting the characters in the words of the sentence head or the sentence tail in the source language sentence so as to obtain the source language sentence added with noise; and if the number of the characters contained in the words of the sentence head or the sentence tail does not accord with the preset number, not deleting the characters in the words of the sentence head or the sentence tail in the source language sentence, and finishing the operation of the source language sentence.
Because the difference between the number of characters contained in a Chinese word and an English word is large, the number of characters contained in a Chinese word is small, Chinese word segmentation needs to be performed first, so that the head and tail characters have large difference between Chinese and English, and when data enhancement is performed by deleting characters in words at the head or tail of a sentence, the preset number needs to be set according to the language type of a source language sentence. For example, for an english word, the preset number is greater than 5 characters, that is, when the number of characters included in an english word at the beginning or end of a sentence is greater than 5 characters, the characters in the english word at the beginning or end of the sentence are deleted; for the Chinese words, the preset number is greater than or equal to 2 characters, that is, when the number of characters contained in the Chinese words at the beginning or the end of a sentence is greater than or equal to 2 characters, the characters in the Chinese words at the beginning or the end of a sentence are deleted. The present invention is not limited to the word segmentation tool used for chinese word segmentation, and the word segmentation tool may be a conventional word segmentation tool, for example, a Language Technology Platform (LTP).
In the embodiment of the present invention, after determining whether the number of characters included in the word at the beginning or the end of the sentence is consistent with the preset number, the number of characters to be deleted in the word at the beginning or the end of the sentence in the source language sentence may be further determined. If the language of the source language sentence is English, the data with incomplete beginning and end characters is analyzed, and the data with incomplete beginning and end characters approximately meets Gaussian distribution, so that the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence can be determined through the Gaussian distribution to meet the actual application scene of the scanning translation pen. For example, the mean of a gaussian distribution may be 2, the variance may be 1, rounding is required for non-integers and zero values are required for negative numbers. If the language of the source language sentence is Chinese, the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence can be determined to be one character.
And step 804, deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the number of the characters to be deleted in the words of the sentence head or the sentence tail, so as to obtain the source language sentence with noise.
In the embodiment of the present invention, after the number of characters to be deleted in the words of the beginning or the end of the sentence is determined, a corresponding number of characters in the words of the beginning of the sentence in the source language sentence may be deleted according to the determined number of characters to be deleted in the words of the beginning or the end of the sentence, or a corresponding number of characters in the words of the end of the sentence in the source language sentence may be deleted, so as to obtain the source language sentence with noise.
According to the embodiment of the invention, before the translation model is trained to improve the noise robustness, the translation model can be trained conventionally, and the training to improve the noise robustness of the translation model is carried out on the translation model obtained by conventional training convergence. Therefore, the embodiment of the present invention further provides a translation model training method including two stages, where the first stage is a conventional training stage, and the second stage is an enhanced training stage for improving robustness. Fig. 9 is a schematic flowchart of a translation model training method according to another embodiment of the present invention, as shown in fig. 9, the method at least includes:
step 901, inputting a source language text sample into a translation model to obtain a predicted target language text output by the translation model; wherein the source language text sample comprises a source language sentence and a noisy source language sentence in a parallel bilingual sentence pair.
Step 902, determining a current training loss of the translation model based on the predicted target language text and the real target language text of the source language text sample; wherein the real target language text comprises target language sentences in parallel bilingual sentence pairs.
Step 903, adjusting parameters of the translation model based on the determined current training loss.
In the embodiment of the present invention, step 901, step 902 and step 903 are steps of the first stage of regular training. The source language text sample may include a source language sentence in the parallel bilingual sentence pair and a source language sentence with noise, and the source language sentence with noise may be obtained by performing data enhancement processing on the source language sentence in the parallel bilingual sentence pair, for example, the source language sentence in step 901 and the source language sentence with noise may be the source language sentence in step 904 and the source language sentence with noise in step 905, respectively. The real target language text may include a target language sentence in a parallel bilingual sentence pair, for example, the target language sentence in step 901 may be the target language sentence corresponding to the source language sentence in step 904 and the target language sentence corresponding to the noisy source language sentence in step 905. The current training loss of the translation model can be obtained by calculating the real target language texts of the predicted target language texts and the source language text samples by adopting the existing loss function.
Step 904, inputting the source language sentence in the parallel bilingual sentence pair into the translation model to obtain the first predicted target language sentence output by the translation model, and obtaining the first predicted probability distribution of the translation model and/or the first feature vector output by each hidden layer of the translation model.
Step 905, inputting the source language sentence added with the noise into the translation model to obtain a second predicted target language sentence output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained based on data enhancement of the source language sentence.
Step 906, determining the current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution.
Step 907, adjust parameters of the translation model based on the determined current training loss.
In the embodiment of the present invention, step 904, step 905, step 906 and step 907 are steps of the enhanced training for improving robustness in the second stage. The description of step 904, step 905, step 906 and step 907 can refer to the description of step 101, step 102, step 103 and step 104 in fig. 1, and thus the description is not repeated.
In some optional examples, the source language text sample may further include a clause fragment of the source language sentence in the parallel bilingual sentence pair, and the real target language text may further include a clause fragment of the target language sentence in the parallel bilingual sentence pair corresponding to the clause fragment of the source language sentence.
Fig. 10 is a flowchart illustrating a method for obtaining clause fragments of a source language sentence and a target language sentence of a parallel bilingual sentence pair according to an embodiment of the present invention, as shown in fig. 10, the method at least includes:
step 1001, performing word alignment on a source language sentence and a target language sentence in a parallel bilingual sentence pair.
In order to enhance the translation effect of the translation model on sentence fragments to adapt to the application scenario of the scanning translation pen, the scanning translation pen uses an irregular translation pen to incomplete the scanning of the sentence, which results in a lack of subject, object or predicate in the grammatical structure of the sentence, as shown in table 2, thereby generating sentence fragments in the input text of the translation model.
TABLE 2
Figure BDA0003322388450000311
In order to improve the translation effect of the translation model on the statement fragments, the embodiment of the invention provides a data enhancement method for extracting the statement fragments from complete sentence pairs based on word alignment. The present invention is not limited to the existing word alignment tool for achieving word alignment of parallel bilingual sentences, and the word alignment tool may be used to align words of a source language sentence and a target language sentence in a parallel bilingual sentence pair.
Step 1002, extracting clause segments of the source language sentence and clause segments corresponding to the clause segments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence which are aligned.
In the embodiment of the invention, after word alignment is performed on the parallel bilingual sentence pairs, clause segments of the source language sentences and corresponding clause segments of the target language sentences can be extracted based on the source language sentences and the target language sentences which are aligned, so that sentence segments for training the translation model can be obtained. Short clause segments of the source language sentences and short clause segments of the target language sentences can be extracted based on the source language sentences and the target language sentences which are aligned, wherein the short clause segments are punctuation marks in the middle of the clause segments. For example, the clause fragment of the source language sentence is "continuously developing", and the clause fragment of the corresponding target language sentence is "stable severity"; the short clause fragment of the source language sentence is 'Service enterprise', and the short clause fragment of the corresponding target language sentence is 'Service enterprises'; the short clause segment of the source language sentence is "their marriage commemorative day", and the short clause segment of the corresponding target language sentence is "the wedding and the university". Long clause segments of the source language sentences and long clause segments of the target language sentences may be extracted based on the word-aligned source language sentences and target language sentences, where a long clause segment includes punctuation marks in the middle of the clause segment, e.g., a clause segment includes 1 to 2 punctuation marks in the middle. For example, the long clause segment of the source language sentence is "corresponding countermeasures, thereby realizing the sustainable development of the tourism industry", and the long clause segment of the corresponding target language sentence is "corresponding measures to achieveable the below grade of tourism"; the long clause fragment of the source language sentence is 'Service enterprise, Service project', and the corresponding long clause fragment of the target language sentence is 'Service enterprises and Service projects'; the long clause segment of the source language sentence is 'adjustment on the wedding anniversary, agrees with the deep heart craving of the party', and the long clause segment of the corresponding target language sentence is 'conversation on the horizontal bathing is in line with the deep death of the party'.
Based on any of the above embodiments, fig. 11 is a schematic flow chart of a translation method provided in an embodiment of the present invention, where the method at least includes:
step 1101, collecting an image of a text to be translated, and performing character recognition on the collected image of the text to be translated to obtain a source language text.
In the embodiment of the invention, the image of the text to be translated can be acquired through the image acquisition equipment, and the acquired image of the text to be translated is subjected to character recognition by using the optical character recognition model to obtain the source language text. For example, the image capturing device may be a video camera, a still camera, a scanner, or the like, which is not limited in this embodiment of the present invention.
Step 1102, translating the source language text through the translation model to obtain a target language text.
In the embodiment of the present invention, the translation model may be obtained by training based on the translation model training method provided in any one of the above embodiments.
According to the translation method provided by the embodiment of the invention, when the translation model is trained, based on the mutual constraint of the prediction probability distribution obtained by the source language sentence added with noise and the prediction probability distribution obtained by the source language sentence, and/or the mutual constraint of the feature vector output by each hidden layer obtained by the source language sentence and the feature vector output by each hidden layer obtained by the source language sentence, the translation model learns similar prediction probability distribution and/or feature vector under the condition that the source language sentence is added with noise and is not added with noise, so that the translation model is insensitive to the source language sentence added with noise and is not added with noise, the robustness of the translation model to the noise is improved, and the quality of translating by using the translation model can be effectively ensured.
Based on the translation model training method provided in any of the embodiments, an embodiment of the present invention further provides a translation model training device, fig. 12 is a schematic diagram of a composition structure of the translation model training device provided in an embodiment of the present invention, and as shown in fig. 12, the translation model training device at least includes:
the first prediction module 1210 is configured to input a source language sentence in the parallel bilingual sentence pair into the translation model, obtain a first predicted target language sentence output by the translation model, and obtain a first predicted probability distribution of the translation model and/or a first feature vector output by each hidden layer of the translation model.
The second prediction module 1220 is configured to input the source language sentence with noise into the translation model, obtain a second predicted target language sentence output by the translation model, and obtain a second prediction probability distribution of the translation model and/or a second feature vector output by each hidden layer of the translation model; the source language sentence added with the noise is obtained based on data enhancement of the source language sentence.
A loss calculation module 1230, configured to determine a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair in the first predicted target language sentence and the parallel bilingual sentence pair, the target language sentence in the parallel bilingual sentence pair in the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution.
A parameter adjustment module 1240 for adjusting parameters of the translation model based on the determined current training loss.
The translation model training device provided by the embodiment of the invention respectively inputs the source language sentences in the parallel bilingual sentence pairs and the noisy source language sentences obtained by data enhancement of the source language sentences in the parallel bilingual sentence pairs into the translation model, carries out forward propagation processing twice to obtain the predicted probability distribution of the translation model and/or the feature vectors output by each hidden layer of the translation model, determines the current training loss of the translation model based on the predicted target language sentences output by the translation model and the target language sentences in the parallel bilingual sentence pairs processed by forward propagation twice and the obtained predicted probability distribution of the translation model and/or the feature vectors output by each hidden layer of the translation model processed by forward propagation twice, adjusts the parameters of the translation model, and utilizes the predicted probability distribution obtained by the noisy source language sentences and the predicted probability distribution obtained by the hidden layers of the translation model to be mutually constrained, and/or the feature vectors output by the hidden layers obtained by the source language sentences added with noise and the feature vectors output by the hidden layers obtained by the source language sentences are mutually constrained, so that the translation model learns similar prediction probability distribution and/or feature vectors under the condition that the noise is added and not added to the source language sentences, and the translation model is insensitive to the noise addition and the non-noise addition of the source language sentences, thereby improving the robustness of the translation model to the noise, effectively ensuring the quality of translation by utilizing the translation model, and compared with a method for combining data enhancement and countertraining, the method does not need to additionally add a discriminator and an additional training discriminator, is simple in training method, and is stable in model training.
Based on any of the above embodiments, the loss calculating module 1230 includes:
and the first loss calculation unit is used for determining the first training loss based on the target language sentences in the parallel bilingual sentence pairs corresponding to the first predicted target language sentences and the parallel bilingual sentences and the second predicted target language sentences and the source language sentences added with noise.
A second loss calculation unit for determining a second training loss based on the first feature vector and the second feature vector; and/or the presence of a gas in the gas,
a third loss calculation unit configured to determine a third training loss based on the first prediction probability distribution and the second prediction probability distribution;
and the training loss calculation unit is used for performing weighted summation on the first training loss and the second training loss and/or the third training loss to obtain the current training loss of the translation model.
Based on any one of the embodiments above, the first loss calculation unit includes:
a first loss component calculation subunit, configured to determine a first training loss component based on the first predicted target language statement and a target language statement in the parallel bilingual statement pair;
a second loss component calculation subunit, configured to determine a third training loss component based on the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
a first loss calculating subunit, configured to accumulate the first training loss component and the third training loss component to obtain the first training loss.
Based on any one of the above embodiments, the second loss calculation unit includes:
a third loss component calculating subunit, configured to determine, based on the first feature vector and the second feature vector of each hidden layer, a third training loss component corresponding to each hidden layer in each hidden layer respectively;
and the second loss calculation subunit is configured to accumulate the second training loss components of each determined hidden layer to obtain the second training loss.
Based on any one of the above embodiments, the translation model training apparatus further includes:
the first data enhancement module is used for replacing characters in the source language sentence based on a character recognition error comparison table to obtain the source language sentence added with noise; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the presence of a gas in the gas,
the second data enhancement module is used for deleting characters in words at the beginning or the end of a sentence in the source language sentence to obtain the source language sentence added with noise; and/or the presence of a gas in the gas,
and the third data enhancement module is used for deleting punctuation marks at the tail of the source language sentence or adding a punctuation mark at the beginning of the source language sentence to obtain the source language sentence added with noise.
Based on any of the above embodiments, the first data enhancement module includes:
the character number counting unit is used for counting the number of characters in the source language sentence and determining the number of characters to be replaced in the source language sentence according to a preset proportion;
a to-be-replaced character determining unit, configured to determine a character to be replaced in the source language sentence based on the frequency of occurrence of the character in the source language sentence in the character recognition error lookup table and the determined number of the characters to be replaced;
and the character replacing unit is used for acquiring corresponding error characters from the character recognition error comparison table based on the determined characters to be replaced, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
Based on any of the above embodiments, the character replacement unit is configured to:
if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error character for replacement based on the frequency of the more than two corresponding error characters in the character recognition error comparison table;
and acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
Based on any embodiment above, the second data enhancement module includes:
a character number calculating unit for determining the number of characters contained in the words of the beginning or end of the sentence in the source language sentence;
a to-be-deleted character number determining unit, configured to determine the number of characters to be deleted in the words of the sentence start or the sentence end in the source language sentence if the number of characters included in the determined words of the sentence start or the sentence end matches a preset number; wherein the preset number is set according to the language of the source language sentence;
and the character deleting unit is used for deleting characters in words of the sentence head or the sentence tail in the source language sentence based on the number of the characters to be deleted in the determined words of the sentence head or the sentence tail to obtain the source language sentence added with noise.
Based on any of the above embodiments, the number of characters to be deleted determining unit is configured to:
if the language of the source language sentence is English, determining the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence based on Gaussian distribution;
and if the language of the source language sentence is Chinese, determining that the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence is one character.
Based on any one of the above embodiments, the translation model training apparatus further includes:
the third prediction module is used for inputting the source language text sample into the translation model to obtain a predicted target language text output by the translation model; wherein the source language text sample comprises a source language sentence in the parallel bilingual sentence pair and the noisy source language sentence;
the loss calculation module is further configured to determine a current training loss of the translation model based on the predicted target language text and a real target language text of the source language text sample; wherein the real target language text comprises a target language sentence in the parallel bilingual sentence pair;
the parameter adjustment module is further configured to adjust a parameter of the translation model based on the determined current training loss.
Based on any of the above embodiments, the source language text sample further includes a clause fragment of the source language sentence in the parallel bilingual sentence pair, and the real target language text further includes a clause fragment corresponding to the clause fragment of the source language sentence in the target language sentence in the parallel bilingual sentence pair.
Based on any one of the above embodiments, the translation model training apparatus further includes:
the fourth data enhancement module is used for performing word alignment on the source language sentence and the target language sentence in the parallel bilingual sentence pair; and extracting clause segments of the source language sentence and clause segments corresponding to the clause segments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence which are aligned.
Based on any of the above embodiments, the fourth data enhancement module is configured to:
extracting short clause segments of the source language sentences and short clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein, there is no punctuation mark in the middle of the short clause segment;
extracting long clause segments of the source language sentences and long clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein punctuation marks are included in the middle of the long clause segment.
Based on the translation method provided by any of the above embodiments, an embodiment of the present invention further provides a translation apparatus, fig. 13 is a schematic view of a composition structure of the translation apparatus provided by the embodiment of the present invention, and as shown in fig. 13, the translation apparatus at least includes:
the character recognition module 1310 is configured to collect an image of a text to be translated, and perform character recognition on the collected image of the text to be translated to obtain a source language text.
A machine translation module 1320, configured to translate the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on the translation model training method provided by any one of the embodiments.
According to the translation device provided by the embodiment of the invention, when the translation model is trained, based on the mutual constraint of the prediction probability distribution obtained by the source language sentence added with noise and the prediction probability distribution obtained by the source language sentence, and/or the mutual constraint of the feature vector output by each hidden layer obtained by the source language sentence and the feature vector output by each hidden layer obtained by the source language sentence, the translation model learns similar prediction probability distribution and/or feature vector under the condition that the source language sentence is added with noise and not added with noise, so that the translation model is insensitive to the source language sentence added with noise and not added with noise, the robustness of the translation model to the noise is improved, and the quality of translating by using the translation model can be effectively ensured.
Fig. 14 illustrates a physical structure diagram of an electronic device, and as shown in fig. 14, the electronic device may include: a processor (processor)1410, a communication Interface (Communications Interface)1420, a memory (memory)1430 and a communication bus 1440, wherein the processor 1410, the communication Interface 1420 and the memory 1430 communicate with each other via the communication bus 1440. Processor 1410 may call logic instructions in memory 1430 to perform the following method: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring first predicted probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model; inputting the source language sentences added with noise into the translation model to obtain second predicted target language sentences output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence; determining a current training loss of the translation model based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution; adjusting parameters of the translation model based on the determined current training loss.
Further, processor 1410 may call logic instructions in memory 1430 to perform the following method: acquiring an image of a text to be translated, and performing character recognition on the acquired image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
In addition, the logic instructions in the memory 1430 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention also provide a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments, for example, including: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring first predicted probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model; inputting the source language sentences added with noise into the translation model to obtain second predicted target language sentences output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence; determining a current training loss of the translation model based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution; adjusting parameters of the translation model based on the determined current training loss.
Embodiments of the present invention also provide a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring an image of a text to be translated, and performing character recognition on the acquired image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring first predicted probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model; inputting the source language sentences added with noise into the translation model to obtain second predicted target language sentences output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence; determining a current training loss of the translation model based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution; adjusting parameters of the translation model based on the determined current training loss.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: acquiring an image of a text to be translated, and performing character recognition on the acquired image of the text to be translated to obtain a source language text; translating the source language text through a translation model to obtain a target language text; the translation model is obtained by training based on a translation model training method.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (18)

1. A translation model training method is characterized by comprising the following steps:
inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first predicted target language sentence output by the translation model, and acquiring first predicted probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model;
inputting the source language sentences added with noise into the translation model to obtain second predicted target language sentences output by the translation model, and acquiring second predicted probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence;
determining a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair corresponding to the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution;
adjusting parameters of the translation model based on the determined current training loss.
2. The translation model training method according to claim 1, wherein said determining a current training loss of the translation model based on the target language sentence in the parallel bilingual sentence pair and the first predicted target language sentence, the target language sentence in the parallel bilingual sentence pair corresponding to the second predicted target language sentence and the noisy source language sentence, and the first and second feature vectors and/or the first and second predicted probability distributions comprises:
determining a first training loss based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
determining a second training loss based on the first feature vector and the second feature vector; and/or determining a third training loss based on the first predictive probability distribution and the second predictive probability distribution;
and carrying out weighted summation on the first training loss, the second training loss and/or the third training loss to obtain the current training loss of the translation model.
3. The translation model training method of claim 2, wherein said determining a first training loss based on the first predicted target language sentence and the target language sentence in the parallel bilingual sentence pair and the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence comprises:
determining a first training loss component based on the first predicted target language statement and a target language statement in the parallel bilingual statement pair;
determining a second training loss component based on the second predicted target language sentence and the target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence;
and accumulating the first training loss component and the second training loss component to obtain the first training loss.
4. The translation model training method of claim 2, wherein said determining a second training loss based on the first feature vector and the second feature vector comprises:
respectively determining a third training loss component corresponding to each hidden layer in each hidden layer based on the first feature vector and the second feature vector of each hidden layer;
and accumulating the determined third training loss components of each hidden layer to obtain the second training loss.
5. The translation model training method of any one of claims 1 to 4, wherein the data enhancement of the source language sentence to obtain the noisy source language sentence comprises:
replacing characters in the source language sentence based on a character recognition error comparison table to obtain the source language sentence with noise; wherein the character recognition error comparison table is obtained based on optical character recognition; and/or the presence of a gas in the gas,
deleting characters in words at the beginning or the end of a sentence in the source language sentence to obtain the source language sentence added with the noise; and/or the presence of a gas in the gas,
and deleting punctuation marks at the tail of the source language sentence, or adding a punctuation mark at the beginning of the source language sentence to obtain the source language sentence added with the noise.
6. The translation model training method of claim 5, wherein said replacing characters in said source language sentence based on a character recognition error lookup table to obtain said noisy source language sentence, comprises:
counting the number of characters in the source language sentence, and determining the number of characters to be replaced in the source language sentence according to a preset proportion;
determining characters to be replaced in the source language sentence based on the frequency of occurrence of the characters in the source language sentence in the character recognition error comparison table and the determined number of the characters to be replaced;
and acquiring corresponding error characters from the character recognition error comparison table based on the determined characters to be replaced, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
7. The translation model training method of claim 6, wherein the obtaining of corresponding erroneous characters from the character recognition error lookup table based on the determined characters to be replaced and the replacement of the characters in the source language sentence to obtain the noisy source language sentence comprises:
if the determined character to be replaced corresponds to more than two error characters in the character recognition error comparison table, determining the error character for replacement based on the frequency of the more than two corresponding error characters in the character recognition error comparison table;
and acquiring the determined error characters for replacement from the character recognition error comparison table, and replacing the characters in the source language sentence to obtain the source language sentence with noise.
8. The translation model training method of claim 5, wherein said deleting characters in words at the beginning or end of a sentence in said source language sentence to obtain a noisy source language sentence comprises:
determining the number of characters contained in words of sentence heads or sentence tails in the source language sentence;
if the number of characters contained in the words of the sentence head or the sentence tail is consistent with the preset number, determining the number of characters to be deleted in the words of the sentence head or the sentence tail in the source language sentence; wherein the preset number is set according to the language of the source language sentence;
and deleting the characters in the words of the sentence head or the sentence tail in the source language sentence based on the determined number of the characters to be deleted in the words of the sentence head or the sentence tail to obtain the source language sentence added with the noise.
9. The translation model training method according to claim 8, wherein said determining the number of characters to be deleted in the words of beginning or end of sentence in said source language sentence comprises:
if the language of the source language sentence is English, determining the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence based on Gaussian distribution;
and if the language of the source language sentence is Chinese, determining that the number of characters to be deleted in words at the beginning or the end of the sentence in the source language sentence is one character.
10. The method for training translation models according to any one of claims 1 to 4 or 6 to 9, wherein before inputting the source language sentence in the parallel bilingual sentence pair into the translation model to obtain the first predicted target language sentence output by the translation model, and obtaining the first predicted probability distribution of the translation model and/or the first feature vector output by each hidden layer of the translation model, the method further comprises:
inputting a source language text sample into the translation model to obtain a predicted target language text output by the translation model; wherein the source language text sample comprises a source language sentence in the parallel bilingual sentence pair and the noisy source language sentence;
determining a current training loss of the translation model based on the predicted target language text and a real target language text of the source language text sample; wherein the real target language text comprises a target language sentence in the parallel bilingual sentence pair;
adjusting parameters of the translation model based on the determined current training loss.
11. The translation model training method according to claim 10, wherein the source language text sample further includes a clause fragment of the source language sentence in the parallel bilingual sentence pair, and the real target language text further includes a clause fragment of the target language sentence in the parallel bilingual sentence pair corresponding to the clause fragment of the source language sentence.
12. The translation model training method according to claim 11, wherein the step of obtaining the clause fragment of the source language sentence in the parallel bilingual sentence pair and the clause fragment of the target language sentence in the parallel bilingual sentence pair corresponding to the clause fragment of the source language sentence comprises:
performing word alignment on a source language sentence and a target language sentence in the parallel bilingual sentence pair;
and extracting clause segments of the source language sentence and clause segments corresponding to the clause segments of the source language sentence in the target language sentence based on the source language sentence and the target language sentence which are aligned.
13. The translation model training method according to claim 12, wherein the extracting of the clause fragments of the source language sentence and the clause fragments corresponding to the clause fragments of the source language sentence in the target language sentence based on the word-aligned source language sentence and the target language sentence comprises:
extracting short clause segments of the source language sentences and short clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein, there is no punctuation mark in the middle of the short clause segment;
extracting long clause segments of the source language sentences and long clause segments of the target language sentences based on the source language sentences and the target language sentences which are aligned; wherein punctuation marks are included in the middle of the long clause segment.
14. A method of translation, comprising:
acquiring an image of a text to be translated, and performing character recognition on the acquired image of the text to be translated to obtain a source language text;
translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method according to any one of claims 1 to 13.
15. A translation model training apparatus, comprising:
the first prediction module is used for inputting a source language sentence in a parallel bilingual sentence pair into a translation model to obtain a first prediction target language sentence output by the translation model, and acquiring first prediction probability distribution of the translation model and/or first feature vectors output by hidden layers of the translation model;
the second prediction module is used for inputting the source language sentences added with noise into the translation model to obtain second prediction target language sentences output by the translation model and acquiring second prediction probability distribution of the translation model and/or second feature vectors output by each hidden layer of the translation model; the source language sentence added with the noise is obtained by performing data enhancement on the source language sentence;
a loss calculation module for determining a current training loss of the translation model based on the first predicted target language sentence and a target language sentence in the parallel bilingual sentence pair, the second predicted target language sentence and a target language sentence in the parallel bilingual sentence pair corresponding to the noisy source language sentence, and the first feature vector and the second feature vector and/or the first predicted probability distribution and the second predicted probability distribution;
a parameter adjustment module to adjust parameters of the translation model based on the determined current training loss.
16. A translation apparatus, comprising:
the character recognition module is used for acquiring an image of a text to be translated and performing character recognition on the acquired image of the text to be translated to obtain a source language text;
the machine translation module is used for translating the source language text through a translation model to obtain a target language text; wherein the translation model is trained based on the translation model training method according to any one of claims 1 to 13.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the translation model training method according to any one of claims 1 to 13 or the steps of the translation method according to claim 14 when executing the program.
18. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the translation model training method according to any one of claims 1 to 13, or the steps of the translation method according to claim 14.
CN202111250312.4A 2021-10-26 2021-10-26 Translation model training method, translation method and translation device Active CN114201975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111250312.4A CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111250312.4A CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Publications (2)

Publication Number Publication Date
CN114201975A true CN114201975A (en) 2022-03-18
CN114201975B CN114201975B (en) 2024-04-12

Family

ID=80646370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111250312.4A Active CN114201975B (en) 2021-10-26 2021-10-26 Translation model training method, translation method and translation device

Country Status (1)

Country Link
CN (1) CN114201975B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114611532A (en) * 2022-05-06 2022-06-10 北京百度网讯科技有限公司 Language model training method and device, and target translation error detection method and device
CN116167388A (en) * 2022-12-27 2023-05-26 无锡捷通数智科技有限公司 Training method, device, equipment and storage medium for special word translation model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011180941A (en) * 2010-03-03 2011-09-15 National Institute Of Information & Communication Technology Phrase table generator and computer program therefor
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
US20210157991A1 (en) * 2019-11-25 2021-05-27 National Central University Computing device and method for generating machine translation model and machine-translation device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011180941A (en) * 2010-03-03 2011-09-15 National Institute Of Information & Communication Technology Phrase table generator and computer program therefor
CN108829685A (en) * 2018-05-07 2018-11-16 内蒙古工业大学 A kind of illiteracy Chinese inter-translation method based on single language training
CN110874537A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Generation method of multi-language translation model, translation method and translation equipment
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
US20210157991A1 (en) * 2019-11-25 2021-05-27 National Central University Computing device and method for generating machine translation model and machine-translation device
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚亮;洪宇;刘昊;刘乐;姚建民;: "基于语义分布相似度的翻译模型领域自适应研究", 山东大学学报(理学版), no. 07, 31 May 2016 (2016-05-31) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114611532A (en) * 2022-05-06 2022-06-10 北京百度网讯科技有限公司 Language model training method and device, and target translation error detection method and device
CN116167388A (en) * 2022-12-27 2023-05-26 无锡捷通数智科技有限公司 Training method, device, equipment and storage medium for special word translation model

Also Published As

Publication number Publication date
CN114201975B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
TW201918913A (en) Machine processing and text correction method and device, computing equipment and storage media
CN114201975B (en) Translation model training method, translation method and translation device
CN110428820B (en) Chinese and English mixed speech recognition method and device
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109858029B (en) Data preprocessing method for improving overall quality of corpus
CN113408535B (en) OCR error correction method based on Chinese character level features and language model
WO2022088570A1 (en) Method and apparatus for post-editing of translation, electronic device, and storage medium
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN112329447A (en) Training method of Chinese error correction model, and Chinese error correction method and device
CN113033185B (en) Standard text error correction method and device, electronic equipment and storage medium
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN112686030B (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
US20220292587A1 (en) Method and apparatus for displaying product review information, electronic device and storage medium
CN109657244B (en) English long sentence automatic segmentation method and system
CN111144134A (en) Translation engine automatic evaluation system based on OpenKiwi
CN115525749A (en) Voice question-answering method, device, electronic equipment and storage medium
CN115631502A (en) Character recognition method, character recognition device, model training method, electronic device and medium
CN114519358A (en) Translation quality evaluation method and device, electronic equipment and storage medium
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
KR102562692B1 (en) System and method for providing sentence punctuation
CN112836528A (en) Machine translation post-editing method and system
CN113988047A (en) Corpus screening method and apparatus
CN113011149A (en) Text error correction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230516

Address after: 230026 No. 96, Jinzhai Road, Hefei, Anhui

Applicant after: University of Science and Technology of China

Applicant after: IFLYTEK Co.,Ltd.

Address before: 230088 666 Wangjiang West Road, Hefei hi tech Development Zone, Anhui

Applicant before: IFLYTEK Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant