CN114444521A

CN114444521A - Machine translation method, device, equipment and storage medium

Info

Publication number: CN114444521A
Application number: CN202210026585.9A
Authority: CN
Inventors: 李响; 穆畅
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-05-06

Abstract

The disclosure relates to a machine translation method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a source text sequence of a current sentence to be translated; inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence; and inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, wherein the target text sequence is a translation result corresponding to the source text sequence. The method and the device can reduce the influence of factors such as the speaking voice of a speaker on the voice recognition result, improve the accuracy of the voice recognition result, and further improve the accuracy of subsequent translation of the voice recognition result, thereby improving the quality of machine translation and meeting the requirements of users.

Description

Machine translation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of neural machine translation technologies, and in particular, to a machine translation method, apparatus, device, and storage medium.

Background

At present, in the machine co-transmission, the speech content of a speaker is usually converted into a source text in real time by using a speech recognition technology, then the source text is translated into a target text by using a machine, and finally the target text is displayed on a screen or played by speech synthesis.

However, in the above scheme, the speech recognition process is easily affected by factors such as the accent of the speaker, so that an error exists in the source text recognized by the speech, and further, the accuracy of the target text obtained by performing machine translation based on the source text is affected, and the high-quality requirement of people on machine translation cannot be met.

Disclosure of Invention

To overcome the problems in the related art, embodiments of the present disclosure provide a machine translation method, apparatus, device, and storage medium to solve the defects in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided a machine translation method, the method including:

acquiring a source text sequence of a current sentence to be translated;

inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, wherein the historical sentences comprise sentences translated before the current sentence to be translated, and the detection model is used for detecting whether texts in the source text sequence are wrong texts;

and inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, wherein the target text sequence is a translation result corresponding to the source text sequence.

In some embodiments, the inputting the source text sequence and the historical sentences into a pre-trained detection model for detection includes:

determining pronunciation information of each text in the source text sequence, and forming the pronunciation information of each text into a pronunciation information sequence;

inputting the pronunciation information sequence and the target text sequence of the historical sentence into the detection model to obtain the probability that each text in the source text sequence predicted by the detection model is a correct text;

and confirming the text with the probability less than or equal to a set probability threshold as an error text.

In some embodiments, the method further comprises:

and replacing the error text in the source text sequence with corresponding pronunciation information to obtain a processed source text sequence.

In some embodiments, the inputting the pronunciation information sequence and the target text sequence of the historical sentence into the detection model to obtain a probability that each text in the source text sequence predicted by the detection model is a correct text includes:

determining a target vector for each text in the pronunciation information sequence and the target text sequence of the historical sentence based on a detection model;

outputting, based on a probability that the target vector corresponds to an encoder in the detection model that outputs each text in the sequence of source texts as correct text, the target vector including a word-coded representation sub-vector, a block-coded representation sub-vector, and a position-coded representation sub-vector of the text.

In some embodiments, the method further comprises training the detection model in advance based on:

obtaining sample source corpora of a sample sentence to be translated;

determining pronunciation information of each sample source text in the sample source corpus, and forming the pronunciation information of each sample source text into a sample pronunciation corpus;

acquiring a sample target text corresponding to each pronunciation information in the sample pronunciation corpus, and forming the sample target text corresponding to each pronunciation information into a sample target corpus;

training the detection model based on the sample pronunciation corpus, the sample target corpus and a sample historical target corpus of a sample historical sentence, wherein the sample historical sentence comprises a translated sentence before the sample sentence to be translated.

In some embodiments, the method further comprises training the translation model in advance based on:

acquiring first sample data, wherein the first sample data comprises a source corpus and a corresponding target corpus, the source corpus is a corpus of a text to be translated, and the target corpus is a translation result corresponding to the text to be translated;

replacing the randomly selected texts with corresponding pronunciation information according to the preset proportion in the source corpus to obtain a mixed corpus;

forming second sample data by using the mixed corpus and the target corpus, wherein the mixed corpus is used as a source corpus of the second sample data, and the target corpus is used as a target corpus of the second sample data;

and taking the second sample data and the first sample data as training data to train the translation model.

In some embodiments, the current sentence to be translated comprises a sentence based on a speech input, and the source text sequence comprises a text sequence recognized by a speech recognition technology based on the sentence based on the speech input.

According to a second aspect of embodiments of the present disclosure, there is provided a machine translation apparatus, the apparatus comprising:

the source text sequence acquisition module is used for acquiring a source text sequence of a current sentence to be translated;

a processed text sequence obtaining module, configured to input the source text sequence and a historical sentence into a pre-trained detection model for detection, so as to obtain a processed source text sequence, where the historical sentence includes a sentence that has been translated before the current sentence to be translated, and the detection model is used to detect whether a text in the source text sequence is an error text;

and the target text sequence acquisition module is used for inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, and the target text sequence is a translation result corresponding to the source text sequence.

In some embodiments, the process text sequence obtaining module includes:

the pronunciation information sequence determining unit is used for determining the pronunciation information of each text in the source text sequence and forming the pronunciation information of each text into a pronunciation information sequence;

a correct text probability obtaining unit, configured to input the pronunciation information sequence and the target text sequence of the historical sentence into the detection model, and obtain a probability that each text in the source text sequence predicted by the detection model is a correct text;

and the error text confirming unit is used for confirming the text with the probability less than or equal to the set probability threshold as the error text.

In some embodiments, the processing text sequence obtaining module further comprises:

and the processed text sequence acquisition unit is used for replacing the error text in the source text sequence with corresponding pronunciation information to obtain a processed source text sequence.

In some embodiments, the correct text probability obtaining unit is further configured to:

In some embodiments, the apparatus further comprises a detection model training module;

the detection model training module comprises:

the sample source corpus acquiring unit is used for acquiring sample source corpora of the sample sentence to be translated;

the system comprises a sample pronunciation corpus forming unit, a pronunciation analysis unit and a pronunciation analysis unit, wherein the sample pronunciation corpus forming unit is used for determining pronunciation information of each sample source text in the sample source corpus and forming the pronunciation information of each sample source text into a sample pronunciation corpus;

the sample target corpus forming unit is used for acquiring a sample target text corresponding to each pronunciation information in the sample pronunciation corpus and forming the sample target text corresponding to each pronunciation information into a sample target corpus;

and the detection model training unit is used for training the detection model based on the sample pronunciation corpus, the sample target corpus and a sample historical target corpus of a sample historical statement, wherein the sample historical statement comprises a translated statement before the sample to-be-translated statement.

In some embodiments, the apparatus further comprises a translation model training module;

the translation model training module comprises:

a first data obtaining unit, configured to obtain first sample data, where the first sample data includes a source corpus and a corresponding target corpus, the source corpus is a corpus of a text to be translated, and the target corpus is a translation result corresponding to the text to be translated;

the mixed corpus acquiring unit is used for replacing texts with preset proportions, which are randomly selected from the source corpus, with corresponding pronunciation information to obtain a mixed corpus;

a second data obtaining unit, configured to combine the mixed corpus and the target corpus into a second sample data, where the mixed corpus is used as a source corpus of the second sample data, and the target corpus is used as a target corpus of the second sample data;

and the translation model training unit is used for taking the second sample data and the first sample data as training data to train the translation model.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic apparatus, the apparatus comprising:

a processor and a memory for storing a computer program;

wherein the processor is configured to, when executing the computer program, implement:

acquiring a source text sequence of a current sentence to be translated;

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements:

acquiring a source text sequence of a current sentence to be translated;

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the method comprises the steps of obtaining a source text sequence of a current sentence to be translated, inputting the source text sequence and historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, and further inputting the processed source text sequence into the pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of machine translation in accordance with an exemplary embodiment of the present disclosure;

FIG. 2A is a flow diagram illustrating how a processed source text sequence is obtained according to an exemplary embodiment of the present disclosure;

FIG. 2B is a flow chart illustrating how the probability that each text in the source text sequence predicted by the detection model is the correct text is derived according to an exemplary embodiment of the present disclosure;

FIG. 2C is a block diagram of an inspection model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating how the detection model is trained according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating how the translation model is trained in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a machine translation device according to an exemplary embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating yet another machine translation device in accordance with an exemplary embodiment of the present disclosure;

fig. 7 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

At present, language translation mainly adopts a cascade mode of firstly performing voice recognition and then performing machine translation, for example, when a voice translation scene with Chinese as a source language is translated, homophonic character errors may occur during voice recognition, including errors that pinyin and tone are completely consistent, or errors that pinyin is consistent but tone is inconsistent, and the like, so that accuracy of subsequent machine translation based on a voice recognition result can be caused.

In view of the above, the present disclosure provides a machine translation method. In the method provided by the disclosure, a source text sequence of a current sentence to be translated is obtained; inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, wherein the historical sentences comprise sentences translated before the current sentence to be translated, and the detection model is used for detecting whether texts in the source text sequence are wrong texts; and inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, wherein the target text sequence is a translation result corresponding to the source text sequence.

Inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, wherein the method comprises the following steps: determining pronunciation information of each text in the source text sequence, and forming the pronunciation information of each text into a pronunciation information sequence; inputting the pronunciation information sequence and the target text sequence of the historical sentence into the detection model to obtain the probability that each text in the source text sequence predicted by the detection model is a correct text; and confirming the text with the probability less than or equal to the set probability threshold value as an error text, and further replacing the error text in the source text sequence with corresponding pronunciation information to obtain a processed source text sequence. Inputting the pronunciation information sequence and the target text sequence of the historical sentence into the detection model to obtain the probability that each text in the source text sequence predicted by the detection model is a correct text, wherein the probability comprises the following steps: determining a target vector for each text in the pronunciation information sequence and the target text sequence of the historical sentence based on a detection model; outputting, based on a probability that the target vector corresponds to an encoder in the detection model that outputs each text in the sequence of source texts as correct text, the target vector including a word-coded representation sub-vector, a block-coded representation sub-vector, and a position-coded representation sub-vector of the text. And when the translation model is trained, replacing the texts in the source corpus in the preset proportion with corresponding pronunciation information to obtain a mixed corpus, forming the source corpus, the mixed corpus and the target corpus into training data, and training the translation model based on the formed training data.

The method of the present disclosure may be applied to occurrences of homophonic erroneous words in a sequence of a source language. Here homophonic erroneous words may be generated due to two factors: one is that when the voice input by the user is recognized by using the voice recognition technology, homophonic error words, namely homophonic error words generated by a simultaneous interpretation application scene, are recognized; one is that when a user inputs a source language sequence to be translated, for example, through a keyboard or a handwriting board, a homophonic wrong word is input, namely, the homophonic wrong word generated by an online translation scene.

The present disclosure provides methods for translation from a source language having textual information and corresponding pronunciation information (e.g., Chinese with Chinese hanzi and corresponding pinyin, etc.) to other target languages, such as English, French, German, etc.

The method provided by the disclosure can reduce the influence of factors such as speaker accent on the voice recognition result, improve the accuracy of the voice recognition result, and further improve the accuracy of subsequent translation of the voice recognition result, thereby improving the quality of machine translation and meeting the requirements of users.

FIG. 1 is a flow diagram illustrating a method of machine translation in accordance with an illustrative embodiment; the method of the embodiment can be applied to terminal devices (such as intelligent translators, wearable devices, smart phones, tablet computers and the like) supporting machine translation functions.

As shown in fig. 1, the method comprises the following steps S101-S103:

in step S101, a source text sequence of a current sentence to be translated is acquired.

In this embodiment, the terminal device may obtain a source text sequence of a current sentence to be translated, where the source text sequence may be a text sequence obtained by recognizing the current sentence to be translated input by the user based on a speech recognition technology. Alternatively, in other embodiments, the source text sequence may also be a text sequence of a current sentence to be translated that is input by the user through a keyboard or a writing pad. Taking the source language to be translated as the chinese language as an example, the source text sequence may include a chinese character sequence recognized based on a speech recognition technique or a chinese character sequence input by a user through a keyboard or a tablet.

In step S102, the source text sequence and the historical sentences are input into a pre-trained detection model for detection, so as to obtain a processed source text sequence.

In this embodiment, the historical sentences may include sentences that have been translated before the current sentence to be translated, such as the previous sentence before the current sentence to be translated (e.g., the previous sentence or sentences of the currently spoken sentence to be translated, or the previous sentence or sentences of the sentence to be translated that are input through a keyboard or a tablet, etc.).

The detection model can be used for detecting whether the text in the source text sequence is the wrong text or not by the detection model, so that the wrong text can be replaced by the corresponding pronunciation information subsequently.

In step S103, the processed source text sequence is input into a pre-trained translation model, so as to obtain a target text sequence corresponding to the current sentence to be translated.

In this embodiment, the source text sequence and the corresponding historical sentences are processed based on the trained detection model, so that homophonic error texts identified in the source text sequence can be replaced with corresponding pronunciation information to obtain a processed source text sequence, and then the processed source text sequence is input into the trained translation model to obtain a target text sequence of the current sentence to be translated, that is, a translation result corresponding to the source text sequence of the current sentence to be translated.

As can be seen from the above description, the method of this embodiment obtains the source text sequence of the current sentence to be translated, inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, then inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, because the source text sequence and the corresponding historical sentences are input into the detection model for detecting the error text to obtain the processed source text sequence, can reduce the influence of factors such as the accent of a speaker on the voice recognition result, improve the accuracy of the voice recognition result, and then improve the accuracy of subsequent translation to the speech recognition result, thus can improve the quality of machine translation, meet users' demands.

FIG. 2A is a flow diagram illustrating how a processed source text sequence is obtained according to an exemplary embodiment of the present disclosure; the present embodiment is exemplified by how to obtain the processed source text sequence on the basis of the above embodiments. As shown in fig. 2A, inputting the source text sequence and the historical sentence into a pre-trained detection model for detection in the step S102 to obtain a processed source text sequence may include the following steps S201 to S204:

in step S201, pronunciation information of each text in the source text sequence is determined, and the pronunciation information of each text is composed into a pronunciation information sequence.

In this embodiment, after the source text sequence is obtained, the pronunciation information of each text in the source text sequence may be determined based on a pre-constructed correspondence between the text and the pronunciation information, and the pronunciation information of each text may be composed into a pronunciation information sequence.

For example, in the case that the source text sequence is a Chinese character sequence, the correspondence between the text and the pronunciation information may include a Chinese pinyin mapping table. For example, if the source text sequence is "i like the future", then the corresponding pronunciation information sequence may be "wo xi hu an wei lai".

In step S202, the pronunciation information sequence and the target text sequence of the historical sentence are input to the detection model, and the probability that each text in the source text sequence predicted by the detection model is a correct text is obtained.

In this embodiment, after obtaining the pronunciation information sequence of the source text sequence, the pronunciation information sequence and the target text sequence of the historical sentence may be input to the detection model, so as to obtain the probability that each text in the source text sequence predicted by the detection model is a correct text.

For example, fig. 2B is a flowchart illustrating how to obtain a probability that each text in the source text sequence predicted by the detection model is a correct text according to an exemplary embodiment of the present disclosure, and fig. 2C is a schematic structural diagram of the detection model according to an exemplary embodiment of the present disclosure. As shown in fig. 2B, the step S202 may further include the following steps S221 to S222:

in step S221, a target vector of each text in the pronunciation information sequence and the target text sequence of the history sentence is determined based on a detection model.

In step S222, the probability that each text in the source text sequence is a correct text is output based on the target vector corresponding to the encoder in the detection model.

Wherein the target vector comprises a word encoding representation sub-vector, a block encoding representation sub-vector, and a position encoding representation sub-vector of the text.

The source text sequence of the current sentence to be translated is still explained as 'i like the future', and the corresponding pronunciation information sequence is 'wo xi hu an wei lai'. The target text sequence of the history sentence translated before the current sentence to be translated is 'which new energy automobile you like'. As shown in fig. 2C, the left side of the detection model of the present embodiment inputs a target text sequence of a history sentence "which new energy vehicle you like" and a corresponding target vector (e.g., "□ + □ + □" below each chinese character text in fig. 2C), and the right side inputs a source text sequence that needs to be detected currently as "i like the future" and a corresponding pronunciation information sequence "wo xi huan wei lai" and a corresponding target vector (e.g., "□ + □ + □" below each chinese pinyin text in fig. 2C). Specifically, in order to determine the target vector of each text, a block-coded representation sub-vector < context > and a word-coded representation sub-vector < phone > may be added to each text, respectively, based on the position-coded representation sub-vector originally possessed by each text. The block coding represents that the sub-vector can be used for distinguishing whether the source of the text is a historical sentence or a current sentence to be translated, and the word coding represents that the sub-vector can be used for distinguishing pronunciation information of the text. Thus, the target vector for each text can be added by the corresponding word-coded representation sub-vector (i.e., the first □ below the text), block-coded representation sub-vector (i.e., the second □ below the text, such as EA or EB, etc.), and position-coded representation sub-vector (i.e., the third □ below the text, such as P1 or P2, etc.) as the target vector, i.e., the representation vector of the text, that is ultimately input to the transform encoder of the detection model.

In this embodiment, the architecture of the encoder may adopt a standard Transformer encoder architecture in the related art, which may include an N-layer encoder. Unlike a standard Transformer encoder, in the present embodiment, only the encoder output corresponding to the pronunciation information sequence ("wo xi hu an wei lai" as described above) of the source text sequence of the current sentence to be translated is used to predict the probability that the original text recognition corresponding to each pronunciation information is correct (i.e., the probability that each text in the source text sequence is the correct text), while the target text sequence of the historical sentence does not participate in the final prediction, but only information is provided at the N-1 layer of the bottom Transformer encoder to assist in prediction.

In step S203, the text with the probability less than or equal to the set probability threshold is determined as an error text.

In step S204, the error text in the source text sequence is replaced with corresponding pronunciation information, so as to obtain a processed source text sequence.

In this embodiment, after the pronunciation information sequence and the target text sequence of the historical sentence are input to the detection model to obtain the probability that each text in the source text sequence predicted by the detection model is a correct text, when it is determined that the probability of the correct text is less than or equal to a set probability, the text is determined to be a wrong text, and then the wrong text in the source text sequence can be replaced by corresponding pronunciation information to obtain a processed source text sequence. The determination method of the pronunciation information may refer to step S201, which is not described herein again.

In summary, in this embodiment, a source text sequence of a current sentence to be translated is converted into a corresponding pronunciation information text, and then a detection model is used to predict the probability that a text corresponding to each pronunciation information is a correct recognition result, so that for a text with a probability less than or equal to a set threshold, the text is marked as an error text, and thus pronunciation information of the error text is retained in the pronunciation information sequence of the source text sequence, and for a correct text with a probability greater than the set threshold, the text corresponding to "wei" in the pronunciation information sequence of the source text sequence can be replaced with a corresponding recognition text, for example, the probability that "not" of the text corresponding to "wei" in "wo xi huan wei lai" is a correct recognition result is less than or equal to 0.95, and then the corresponding text "not" is marked as an error text, corresponding pronunciation information is retained, and the rest of texts are confirmed as correct texts, thus, corresponding recognition texts are reserved, and the sequence to be translated, which is finally input to the downstream translation model by the top-level Transformer encoder, can be a sequence in a mixed form of texts and pronunciation information, such as "i like wei coming".

FIG. 3 is a flow chart illustrating how the detection model is trained according to an exemplary embodiment of the present disclosure; the present embodiment is exemplified by how to train the detection model based on the above embodiments. As shown in fig. 3, the present embodiment further includes training the detection model in advance based on the following steps S301 to S304:

in step S301, a sample source corpus of a sample to-be-translated sentence is obtained.

In step S302, pronunciation information of each sample source text in the sample source corpus is determined, and the pronunciation information of each sample source text is composed into a sample pronunciation corpus.

In step S303, a sample target text corresponding to each pronunciation information in the sample pronunciation corpus is obtained, and the sample target text corresponding to each pronunciation information is composed into a sample target corpus.

In step S304, the detection model is trained based on the sample pronunciation corpus, the sample target corpus, and a sample history target corpus of a sample history sentence, where the sample history sentence includes a translated sentence before the sample sentence to be translated.

For example, after obtaining a sample source corpus of a sample sentence to be translated, the pronunciation information of each sample source text in the sample source corpus may be determined based on a pre-constructed correspondence between the text and the pronunciation information, so as to combine the pronunciation information of each sample source text into a sample pronunciation corpus. For example, in the case that the sample source corpus is a chinese corpus, the correspondence between the text and the pronunciation information may include a chinese pinyin mapping table or the like.

For example, pronunciation information of z sample source texts may be randomly selected from the sample pronunciation corpus, for example, by M ═ M₁,m₂,…，m_z]The index of the selected pronunciation information of the z sample source texts in the sample pronunciation corpus is represented. Then, mask the z pronunciation information in sequence in the sample pronunciation corpus, and replace the z pronunciation information with a set special symbol, such as "$ mask", to obtain a masked sample pronunciation corpus, which is denoted as P_M. Wherein the masked pronunciation information corresponds to z sample source texts in the sample source corpus,the corpus composed of the z sample source texts is X_M. It should be noted that the z pronunciation information may be randomly selected, i.e. the position of the z pronunciation information in the sample pronunciation corpus is not required to be continuous.

On this basis, the detection model can be based on P_MThe non-masked pronunciation information to predict what the sample source text corresponding to the location of the masked pronunciation information is. For example, the sample source corpus X is "i like the yulai", the sample pronunciation corpus P is "wo xi hu an wei lai", and the pronunciation information corresponding to one sample source text in X in the random mask P is obtained to obtain the masked sample pronunciation corpus P_M(vi) "wo xi hu an $ mask lai", then X_MThis term is "Wei". Furthermore, in this embodiment, what the sample source text corresponding to the position of "$ mask" is can be predicted based on "wo xi hu an $ mask lai". For this example, P may be used when training the detection model_M(vi): "wo xi huan $ mask lai", then X_M"Wei" is trained as the input and output of the model.

Further, the training of the detection model may use a negative log-likelihood loss function in the neural network model as a training optimization target, as shown in the following formula (1):

where n represents the sample data size (e.g., number of sentences, etc.) of the training model, x_iRepresents the pronunciation sequence, ctx, corresponding to the text to be predicted in the ith training data_iRepresenting a history sentence (i.e., text of a previous sentence), y in the ith sample data_iAnd the sample source text sequence corresponding to the pronunciation sequence needing to be predicted in the ith sample data is represented.

The detection model can be used for determining a target vector of each text in a pronunciation information sequence and a target text sequence of a history sentence, and outputting the probability that each text in the source text sequence is a correct text based on the target vector corresponding to an encoder in the detection model, and the target vector can comprise a word encoding representation sub-vector, a block encoding representation sub-vector and a position encoding representation sub-vector of the text. Specifically, in order to determine the target vector of each text, a block-coded sub-vector < context > and a word-coded sub-vector < phone > may be added to each text based on the position-coded sub-vector originally possessed by each text. The block coding represents that the sub-vector can be used for distinguishing whether the source of the text is a historical sentence or a current sentence to be translated, and the word coding represents that the sub-vector can be used for distinguishing pronunciation information of the text. Therefore, the target vector of each text can be added by the word encoding representation sub-vector, the block encoding representation sub-vector and the position encoding representation sub-vector corresponding to the text, and the target vector which is finally input to the transform encoder of the detection model, namely the representation vector of the text. Illustratively, the target vector can be represented by a triangular position code in a Transformer model.

In this embodiment, the architecture of the encoder may adopt a standard Transformer encoder architecture in the related art, which may include an N-layer encoder. Unlike a standard Transformer encoder, in the present embodiment, only the encoder output corresponding to the pronunciation information sequence of the source text sequence of the current sentence to be translated is used to predict the probability that the original text recognition corresponding to each pronunciation information is correct (i.e., the probability that each text in the source text sequence is a correct text), while the target text sequence of the historical sentence does not participate in the final prediction, but only information is provided at the N-1 layer of the bottom Transformer encoder for auxiliary prediction.

In this embodiment, the training termination condition of the detection model may include the following two conditions:

1) stopping training when the training of the sample data of the N rounds (epochs) is finished;

2) and testing the detection model in the training based on the given development verification set, and stopping the training if the accuracy is not improved for N times continuously.

When the detection model is tested, the current source text sequence to be detected can be converted into a corresponding pronunciation information sequence, the detection model is utilized to predict the correct probability of the source text corresponding to each pronunciation information in the pronunciation information sequence, the source text with the probability less than or equal to a set threshold value is marked as a recognition error, and the pronunciation information of the source text sequence is reserved; and for the source texts with the probability greater than the set threshold, the source texts are retained.

As can be seen from the above description, in this embodiment, by obtaining a sample source corpus of a sentence to be translated, determining pronunciation information of each sample source text in the sample source corpus, and combining the pronunciation information of each sample source text into a sample pronunciation corpus, then obtaining a sample target text corresponding to each pronunciation information in the sample pronunciation corpus, and combining the sample target text corresponding to each pronunciation information into a sample target corpus, and further training the detection model based on the sample pronunciation corpus, the sample target corpus, and the sample historical target corpus of the sample historical sentence, a detection model can be accurately trained, so as to lay a foundation for a subsequent target text sequence based on a source text sequence, a historical sentence, and a pre-trained detection model to obtain a processed source text sequence, and improve the quality of machine translation based on the processed source text sequence, and the quality requirement of the user on machine translation is met.

FIG. 4 is a flowchart illustrating how the translation model is trained in accordance with an exemplary embodiment of the present disclosure; the present embodiment is exemplified by how to train the translation model based on the above embodiments. As shown in fig. 4, the present embodiment further includes training the translation model in advance based on the following steps S401 to S403:

in step S401, first sample data is acquired.

In this embodiment, the first sample data includes a source corpus and a corresponding target corpus, the source corpus is a corpus of a text to be translated, and the target corpus is a translation result corresponding to the text to be translated.

For example, if the source language is chinese and the target language is english, the source corpus of the first sample data may be chinese corpuses and the target corpus may be english corpuses corresponding to the chinese corpuses.

In step S402, replacing the randomly selected text with the corresponding pronunciation information to obtain a mixed corpus.

In this embodiment, in order to reduce the influence of the homophonic wrong characters, texts with a preset proportion randomly selected from the source corpus may be replaced with corresponding pronunciation information, so as to obtain a mixed corpus. Taking the source corpus as an example of the Chinese character corpus, the Chinese characters with the preset proportion in the Chinese character corpus can be replaced by the corresponding Chinese pinyin, so that the corpus in which the Chinese characters and the pinyin are mixed is obtained.

It should be noted that the size of the preset ratio may be set based on actual needs, such as 10%. It is understood that, if the number of texts with a set proportion is not an integer, rounding-up or rounding-down may be performed, which is not limited in this embodiment. Illustratively, if the length of the source material (i.e., the number of texts) is less than 10 and equal to or greater than 5, one of the texts may be replaced randomly.

By replacing the randomly selected text with the corresponding pronunciation information according to the preset proportion in the source corpus, the trained translation model can have a better generalization effect.

In step S403, the mixed corpus and the target corpus are combined into a second sample data.

In this embodiment, the mixed corpus may serve as a source corpus of the second sample data, and the target corpus may still serve as a target corpus of the second sample data.

In step S404, the translation model is trained using the second sample data and the first sample data as training data.

By adopting the method, the randomly selected text with the preset proportion in the source corpus of the first sample data is replaced by the pronunciation information of the text to obtain the mixed corpus, the mixed corpus and the target corpus form the second sample data, the second sample data and the first sample data are used as training data, the translation model is trained, the source corpus which is not replaced by the pronunciation information can be placed into the sample data, and the fact that the completely correct corpus without the pronunciation information appears in the sample data for training the translation model can be ensured. By the processing mode, the robustness of the translation model can be improved aiming at the speech recognition error containing the homophone and heteromorphic characters, so that the translation quality of a neural machine is effectively enhanced.

It can be understood that, the above-mentioned training method of the translation model is from the viewpoint of sample data used for training, and is not related to a specific model architecture and a training mode, so that a neural machine translation model such as a Transformer, RNN or CNN can be used for training based on actual needs, which is not limited in this embodiment.

FIG. 5 is a block diagram illustrating a machine translation device in accordance with an exemplary embodiment; the device of the embodiment can be applied to terminal equipment (such as a smart translator, a wearable device, a smart phone, a tablet computer and the like) supporting a machine translation function.

As shown in fig. 5, the apparatus includes: a source text sequence obtaining module 110, a processed text sequence obtaining module 120, and a target text sequence obtaining module 130, wherein:

a source text sequence obtaining module 110, configured to obtain a source text sequence of a current sentence to be translated;

a processed text sequence obtaining module 120, configured to input the source text sequence and a historical sentence into a pre-trained detection model for detection, so as to obtain a processed source text sequence, where the historical sentence includes a sentence that has been translated before the current sentence to be translated, and the detection model is used to detect whether a text in the source text sequence is an error text;

a target text sequence obtaining module 130, configured to input the processed source text sequence into a pre-trained translation model, so as to obtain a target text sequence corresponding to the current sentence to be translated, where the target text sequence is a translation result corresponding to the source text sequence.

As can be seen from the above description, the apparatus of this embodiment obtains the source text sequence of the current sentence to be translated, inputting the source text sequence and the historical sentences into a pre-trained detection model for detection to obtain a processed source text sequence, then inputting the processed source text sequence into a pre-trained translation model to obtain a target text sequence corresponding to the current sentence to be translated, because the source text sequence and the corresponding historical sentences are input into the detection model for detecting the error text to obtain the processed source text sequence, can reduce the influence of factors such as the accent of a speaker on the voice recognition result, improve the accuracy of the voice recognition result, and then improve the accuracy of subsequent translation to the speech recognition result, thus can improve the quality of machine translation, meet users' demands.

FIG. 6 is a block diagram illustrating a machine translation device in accordance with yet another exemplary embodiment; the device of the embodiment can be applied to terminal equipment (such as a smart translator, a wearable device, a smart phone, a tablet computer and the like) supporting a machine translation function. The source text sequence obtaining module 210, the processed text sequence obtaining module 220, and the target text sequence obtaining module 230 have the same functions as the source text sequence obtaining module 110, the processed text sequence obtaining module 120, and the target text sequence obtaining module 130 in the embodiment shown in fig. 5, and are not described herein again. As shown in fig. 6, the processing text sequence obtaining module 220 may include:

a pronunciation information sequence determining unit 221, configured to determine pronunciation information of each text in the source text sequence, and form the pronunciation information of each text into a pronunciation information sequence;

a correct text probability obtaining unit 222, configured to input the pronunciation information sequence and the target text sequence of the historical sentence into the detection model, and obtain a probability that each text in the source text sequence predicted by the detection model is a correct text;

and an error text confirming unit 223, configured to confirm the text with the probability less than or equal to the set probability threshold as an error text.

In an embodiment, the processing text sequence obtaining module further includes:

a processed text sequence obtaining unit 224, configured to replace the error text in the source text sequence with corresponding pronunciation information, so as to obtain a processed source text sequence.

In an embodiment, the correct text probability obtaining unit is further configured to:

In an embodiment, the apparatus may further include a detection model training module 240;

the detection model training module 240 may include:

a sample source corpus obtaining unit 241, configured to obtain a sample source corpus of a sample sentence to be translated;

a sample pronunciation corpus composing unit 242, configured to determine pronunciation information of each sample source text in the sample source corpus, and compose the pronunciation information of each sample source text into a sample pronunciation corpus;

a sample target corpus composing unit 243, configured to obtain a sample target text corresponding to each piece of pronunciation information in the sample pronunciation corpus, and compose the sample target text corresponding to each piece of pronunciation information into a sample target corpus;

a detection model training unit 244, configured to train the detection model based on the sample pronunciation corpus, the sample target corpus, and a sample history target corpus of a sample history sentence, where the sample history sentence includes a translated sentence before the sample to-be-translated sentence.

In an embodiment, the apparatus may further include a translation model training module 250;

the translation model training module 250 may include:

a first data obtaining unit 251, configured to obtain first sample data, where the first sample data includes a source corpus and a corresponding target corpus, the source corpus is a corpus of a text to be translated, and the target corpus is a translation result corresponding to the text to be translated;

a mixed corpus obtaining unit 252, configured to replace a randomly selected text in the source corpus with corresponding pronunciation information to obtain a mixed corpus;

a second data obtaining unit 253, configured to combine the mixed corpus and the target corpus into second sample data, where the mixed corpus is used as a source corpus of the second sample data, and the target corpus is used as a target corpus of the second sample data;

a translation model training unit 254, configured to train the translation model by using the second sample data and the first sample data as training data.

In an embodiment, the current sentence to be translated includes a sentence based on a voice input, and the source text sequence includes a text sequence recognized by the sentence based on the voice input based on a voice recognition technology.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment. For example, the device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, and the like.

Referring to fig. 7, device 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls the overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia components 908 include a screen that provides an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the device 900. For example, the sensor component 914 may detect an open/closed state of the device 900, the relative positioning of components, such as a display and keypad of the device 900, the sensor component 914 may also detect a change in the position of the device 900 or a component of the device 900, the presence or absence of user contact with the device 900, orientation or acceleration/deceleration of the device 900, and a change in the temperature of the device 900. The sensor assembly 914 may also include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the device 900 and other devices in a wired or wireless manner. The device 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, 4G or 5G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the device 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of machine translation, the method comprising:

acquiring a source text sequence of a current sentence to be translated;

2. The method of claim 1, wherein the inputting the source text sequence and the historical sentences into a pre-trained detection model for detection comprises:

3. The method of claim 2, further comprising:

4. The method of claim 2, wherein the inputting the pronunciation information sequence and the target text sequence of the historical sentence into the detection model to obtain a probability that each text in the source text sequence predicted by the detection model is a correct text comprises:

determining a target vector of each text in the pronunciation information sequence and the target text sequence of the historical sentence based on a detection model;

5. The method of claim 1, further comprising training the detection model in advance based on:

obtaining sample source corpora of a sample sentence to be translated;

6. The method of claim 1, further comprising training the translation model in advance based on:

replacing texts with corresponding pronunciation information in a preset proportion randomly selected from the source corpus to obtain a mixed corpus;

7. The method of claim 1, wherein the current sentence to be translated comprises a sentence based on speech input, and wherein the source text sequence comprises a text sequence recognized by a speech recognition technology based on the sentence based on speech input.

8. A machine translation apparatus, the apparatus comprising:

9. An electronic device, characterized in that the device comprises:

a processor and a memory for storing a computer program;

acquiring a source text sequence of a current sentence to be translated;

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing:

acquiring a source text sequence of a current sentence to be translated;