CN111414772B - Machine translation method, device and medium - Google Patents

Machine translation method, device and medium Download PDF

Info

Publication number
CN111414772B
CN111414772B CN202010171952.5A CN202010171952A CN111414772B CN 111414772 B CN111414772 B CN 111414772B CN 202010171952 A CN202010171952 A CN 202010171952A CN 111414772 B CN111414772 B CN 111414772B
Authority
CN
China
Prior art keywords
pinyin
corpus
chinese character
sequence
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010171952.5A
Other languages
Chinese (zh)
Other versions
CN111414772A (en
Inventor
孙于惠
李响
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Pinecone Electronic Co Ltd
Priority to CN202010171952.5A priority Critical patent/CN111414772B/en
Publication of CN111414772A publication Critical patent/CN111414772A/en
Application granted granted Critical
Publication of CN111414772B publication Critical patent/CN111414772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The disclosure relates to a machine translation method, a device and a medium. The method comprises the following steps: acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence; based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not; and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model. The method can improve the robustness of the translation model, thereby effectively enhancing the neural machine translation quality.

Description

Machine translation method, device and medium
Technical Field
The present disclosure relates to the field of neural machine translation technology, and in particular, to a machine translation method, apparatus, and medium.
Background
Neural machine translation is the most mainstream machine translation method at present, and although translation quality is greatly improved, the neural machine translation is very sensitive to input text to be translated, and even if a small number of errors which do not influence understanding of semantics of people exist in source language, a model cannot generally generate correct translation. The method for enhancing the robustness of the neural machine translation mainly comprises the following steps: (1) Correcting errors of the Chinese input text, so that the subsequent machine translation quality is improved; (2) The training method of the neural machine translation model is improved, the anti-interference performance of the model on noise training samples is enhanced, and the robustness of the translation model is improved.
One type of error that is common in source language errors is homonym errors. The homophone errors herein may be errors in which the pinyin and tone are identical, or errors in which the pinyin is identical but the tone is not identical.
The existing method for improving the robustness of homophone error models comprises the following steps:
1. and (3) constructing homophone error corpus offline, namely adding homophone training corpus with a certain proportion into the original training data, and improving the noise resistance of the model. But the method of constructing noise offline is not exhaustive of the many unknown homonym errors.
2. Homonym errors are mainly caused by the fact that pinyin is the same but text representations are different, so that a model from pinyin of a source language to text of a target text language can be trained, namely, original input is converted into pinyin and fed to the model for learning. However, the information conveyed by pinyin is not as rich as Chinese words.
3. Adding the vector representation of each Chinese character of the translated text to the vector representation of the pinyin of that Chinese character (which may also be in a concatenated form) represents the final vector representation of that Chinese character to participate in the training, desirably enhancing the noise immunity of the model. However, the construction method is not specific, most of correct corpus does not need to add pinyin information, and the effect on correct corpus translation is not good due to the fact that the pinyin information is added.
Disclosure of Invention
In order to overcome the problems in the related art, the present disclosure provides a method, an apparatus, and a medium for training a translation model.
According to a first aspect of embodiments of the present disclosure, there is provided a machine translation method, the method comprising:
acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
The method for obtaining the processed source language sequence based on the Chinese character sequence and the trained discrimination model comprises the following steps:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
When the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
Wherein the translation model is trained by:
acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
the mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
The method for replacing at least one Chinese character in the Chinese character corpus with the corresponding pinyin comprises at least one of the following modes:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
Wherein, the determining at least one homonym error word in the Chinese corpus comprises:
the pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
Wherein the discriminant model is trained by:
acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
Masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
Wherein the source language sequence is derived from speech data based on speech recognition techniques.
According to a second aspect of embodiments of the present disclosure, there is provided a machine translation apparatus, the apparatus comprising:
the system comprises a source language sequence acquisition module, a translation module and a translation module, wherein the source language sequence acquisition module is used for acquiring a source language sequence to be translated, and the source language sequence is a Chinese character sequence;
the source language sequence processing module is set to acquire a processed source language sequence based on the Chinese character sequence and a trained discrimination model, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
the prediction result obtaining module is configured to input the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
Wherein the source language sequence processing module is further configured to:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
when the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
Wherein the apparatus further comprises a translation model training module configured to:
acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
The mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
The translation model training module is further configured to replace at least one Chinese character in the Chinese character corpus with a corresponding pinyin by at least one of the following modes:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
Wherein the translation model training module is further configured to determine at least one homonym error word in the chinese corpus by:
the pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
Acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
Wherein the apparatus further comprises a discriminant model training module, the discriminant model training being arranged to:
acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
Training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
Wherein the source language sequence is derived from speech data based on speech recognition techniques.
According to a third aspect of embodiments of the present disclosure, there is provided a machine translation apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to, when executing the executable instructions, implement the steps of:
acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium, which when executed by a processor of an apparatus, causes the apparatus to perform a machine translation method, the method comprising:
Acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
By adopting the method disclosed by the invention, at least one Chinese character in the Chinese character corpus is replaced by the corresponding pinyin, the mixed corpus is obtained, the Chinese character corpus, the mixed corpus and the target corpus are synthesized into training data, and the translation model is trained based on the synthesized training data. The method provided by the disclosure can improve the robustness of the translation model aiming at the voice recognition error containing the Chinese allotrope, thereby effectively enhancing the neural machine translation quality, and ensuring that the translation effect on the real text to be translated without noise is not poor.
The method can ensure good translation effect on the normal test set, and greatly improve the translation effect of the Chinese source end with homonym error words. In addition, the method is simple to realize, does not need to construct a large number of homonym false corpora in an offline manner, dynamically constructs the mixed corpora of Chinese characters and pinyin, and can basically cover various situations as long as the number of training steps is enough.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flowchart illustrating a method of training a translation model, according to an example embodiment.
FIG. 2 is a flowchart illustrating a method of training a translation model, according to an example embodiment.
FIG. 3 is a block diagram illustrating a training apparatus for a translation model, according to an example embodiment.
Fig. 4 is a block diagram of an apparatus according to an example embodiment.
Fig. 5 is a block diagram of an apparatus according to an example embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
At present, language translation mainly adopts a cascade mode of firstly speech recognition and then machine translation, and is carried out aiming at a speech translation scene with Chinese as a source language. Thus, homophone errors may occur during speech recognition, including errors in which pinyin and tone are perfectly consistent, and errors in which pinyin is consistent but tone is not.
For homonym errors, the simplest approach is to construct a mix of large amounts of homonym error data, and raw data offline to train the model. The training data is (X, Y), where there is one training data x= "taai you", y= "He loves you". Randomly replacing a certain word in the source sentence as an homonym of the source sentence, and assuming that the replacement "he" is "pedal", then replacing X '= "pedal you", and putting the replaced (X', Y) into the training corpus to participate in training. This approach suffers from two major drawbacks: firstly, the offline construction data cannot be used for exhausting various situations; secondly, for example, replacing "he" with an allotrope, there are a very large number of situations, if each is replaced once, the amount of training data is too large and the training period is too long.
Another approach to homonym errors is to train a model of pinyin in a source language to text in a target text language. Let one sentence pair in our training data be (X, Y), x= "taai you", y= "He loves you". Converting X into pinyin to obtain X' = "ta ai ni". Feeding (X', Y) as training language to the model for learning. The method has better effect on the text to be translated with homophone errors, but has obvious decline on the normal test set without homophone errors. This is because the information conveyed by pinyin is not literally as rich. As in the example above, is "ta" in X' translated into "she" or "he?
Yet another approach adds a word-level vector representation of each token of the translated text to a pinyin vector representation of that token (which may also be in the form of a concatenation) to represent the final vector representation of that token for training to improve noise immunity of the model. Let the training data be (X, Y), where x= [ X ] 1 ,x 2 ,...,x n ]Each word x i The word vector is denoted as e i The word vector sequence of sequence X is denoted as E X =[e 1 ,e 2 ,...,e n ]. Assuming that the chinese is character-level here, then each x i The corresponding Pinyin is marked as p i The Pinyin sequence of the sequence X is obtained as P= [ P ] 1 ,p 2 ,...,p n ]. This vector, which also obtains the pinyin sequence, is E p =[z 1 ,z 2 ,...,z n ]. Finally, e i +z i The expression x i Is involved in training. The final word vector constructed by the method is not specific, most of correct corpus does not need to be added with pinyin information, and the effect on correct corpus translation is poor due to the fact that the pinyin information is added.
In view of the foregoing, the present disclosure provides a machine translation method. In the method provided by the disclosure, a source language sequence to be translated is obtained, wherein the source language sequence is a Chinese character sequence; acquiring a processed source language sequence based on the Chinese character sequence and the trained discrimination model; and inputting the processed source language sequence into a trained translation model, and obtaining a prediction result of the trained translation model. When the translation model is trained, at least one Chinese character in the Chinese character corpus is replaced by a corresponding pinyin, a mixed corpus is obtained, the Chinese character corpus, the mixed corpus and the target corpus form training data, and the translation model is trained based on the formed training data. The method provided by the disclosure can improve the robustness of the translation model aiming at the voice recognition error containing the Chinese allotrope, thereby effectively enhancing the neural machine translation quality, and ensuring that the translation effect on the real text to be translated without noise is not poor.
The method of the present disclosure is particularly applicable to the occurrence of homonym error words in a source language sequence. Here homonyms may be generated due to two factors: one is that homonyms are recognized when speech input by a user is recognized using a speech recognition technique; one is when a user enters a sequence of source languages to be translated, for example, via a keyboard or a handwriting pad, homophonic error words are entered.
The methods provided by the present disclosure are applicable to translation of chinese into other languages, which may be english, french, german, etc.
FIG. 1 is a flowchart illustrating a method of training a translation model, as shown in FIG. 1, according to an exemplary embodiment, the method comprising the steps of:
step 101, obtaining a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
102, acquiring a processed source language sequence based on the Chinese character sequence and a trained discrimination model, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
step 103, inputting the processed source language sequence into a trained translation model, and obtaining a prediction result of the trained translation model.
In step 101, the source language sequence is a kanji sequence, which may be a kanji sequence input by the user through a keyboard or a handwriting board, or may be a kanji sequence obtained by recognizing a voice input by the user through a voice recognition technology.
In step 102 and step 103, based on the trained discrimination model, the Chinese character sequence is processed, homophonic error words in the Chinese character sequence are replaced by corresponding pinyin, and then the trained translation model is input to obtain a prediction result, namely a translation result.
By adopting the method, homophonic error words in the Chinese character sequence are replaced by corresponding pinyin, so that the problem of translation errors caused by Chinese homophonic error words is avoided, the robustness of a translation model is improved, and the neural machine translation quality is enhanced.
In an optional implementation manner, the acquiring the processed source language sequence based on the kanji sequence and the trained discriminant model includes:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
When the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
Here, the corresponding pinyin sequence is obtained through the pinyin sequence, the pinyin of one Chinese character in the pinyin sequence is sequentially masked, and the probability of the corresponding Chinese character in the masked position output Chinese character sequence is sequentially judged through the trained discrimination model. When the probability is smaller than the set probability, the corresponding Chinese characters in the Chinese character sequence are replaced by pinyin. And judging whether the first Chinese character in the Chinese character sequence needs to be replaced by pinyin or not in sequence until each Chinese character in the Chinese character sequence is judged to be finished.
The discrimination (prediction) process of the discrimination model is briefly described as follows:
at P M In the known situation, the prediction result T= [ T ] is obtained through a discrimination model 1 ,t 2 ,...,t z ]Wherein t is j Representing the predicted outcome of the j-th word. The output of the model is finally subjected to a normalization (e.g., softmax function) to obtain each t j Probability distribution over the target end vocabulary. The specific operation of the discriminant model prediction is known to those skilled in the art and will not be described in detail herein.
In the above-described embodiment of the present disclosure, t is acquired j And comparing the probability with the set probability to judge whether the target word is a homophonic error word.
In an alternative embodiment, the translation model is trained by:
acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
the mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
The source corpus of the first training data is a Chinese corpus, and the target corpus is an English corpus corresponding to the Chinese corpus.
In order to reduce the influence of homonym error words, at least one Chinese character in the Chinese character corpus can be replaced by the pinyin of the Chinese character, so that the Chinese character and pinyin mixed corpus is obtained. The mixed corpus is used as a second training data source corpus, and the first training data target corpus is used as a second training data target corpus.
And training the translation model by using second training data and first training data which are formed by the mixed corpus and the target corpus.
By adopting the method, the Chinese characters in the source corpus (Chinese character corpus) are replaced by the pinyin of the Chinese characters, and the mixed corpus is generated. And forming new training data by the Chinese character corpus, the mixed corpus and the target corpus, and training a translation model. The Chinese language materials which are not replaced by the pinyin are also put into training data, so that the fact that the complete correct Chinese language materials without the pinyin appear in the data of the translation model is ensured. The method can improve the robustness of the translation model aiming at the voice recognition errors containing Chinese homophones and abnormal words, thereby effectively enhancing the quality of neural machine translation.
In an alternative embodiment, the replacing at least one chinese character in the chinese character corpus with a corresponding pinyin includes at least one of the following manners:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
At least one Chinese character in the Chinese character corpus is replaced by pinyin in two ways. In the first mode, chinese characters in the Chinese character corpus are randomly selected to be replaced by pinyin, so that the trained translation model has a better generalization effect. And in the second mode, homonym error words in the Chinese character corpus are selected to be replaced by pinyin, so that the method has better robustness in some language environments. In a preferred embodiment, the first and second modes may be used to obtain the mixed corpus at the same time.
In an optional implementation manner, the determining at least one homonym error word in the Chinese character corpus includes:
the pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
Let x= [ X ] of chinese language material 1 ,x 2 ,...,x n ]Each Chinese character x i The corresponding Pinyin is marked as p i I is more than or equal to 1 and less than or equal to n, and the corresponding Pinyin corpus is Pp 1 ,p 2 ,...,p n ]。
In the following example, it is determined whether each Chinese character in the Chinese character corpus is a homophonic error word, and if so, the Chinese character is replaced by pinyin.
Specifically, the pinyin corresponding to one Chinese character in the pinyin corpus is sequentially selected, for example, the pinyin corresponding to the first Chinese character in the pinyin corpus is selected, and the pinyin is masked, so that the mask corpus is obtained. The first Chinese character in the Chinese character corpus is used as a target word. And inputting the mask corpus into a trained discriminant model, and obtaining the probability that the position of the masked pinyin corresponds to the target word. The setting probability is set based on the actual translation scene. And when the acquired probability is smaller than the set probability, determining the target word as homophonic error word. In the Chinese character corpus, the target word is replaced by its pinyin. In an alternative embodiment, the discriminant model is trained by:
Acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
Randomly selecting the pinyin of z Chinese characters in the pinyin corpus P, for example, by using M= [ M ] 1 ,m 2 ,...,m z ]And the subscript of the pinyin of the z Chinese characters in the pinyin corpus is represented. Masking the Z pinyins in the pinyin corpus, replacing the Z pinyins with a special symbol, e.g., "# MASK", to obtain a masked pinyin corpus, denoted as P M . The masked pinyin corresponds to z Chinese characters in the Chinese character corpus, and the corpus composed of the z Chinese characters is X M . Will P M X as a discrimination model source corpus M The discriminant model target corpus is used for training the discriminant model. Here, the z pinyins are randomly selected, and there is no requirement that the positions of the z pinyins in the pinyin corpus be continuous.
The discriminant model may predict what the Chinese character corresponding to the location of the masked pinyin is based on the non-masked pinyin in P. For example, x= "taai you", p= "wo ai ni", and P is obtained after pinyin corresponding to one Chinese character in X in the random mask P M = "wo ai$ mask", then X M = "you". In this example, i.e., based on "wo ai $mask", it is predicted what the kanji corresponding to this position of "$mask".
For the above example, when training the discriminant model, P is M = "wo ai $mask" and X M = "you" are trained as input and output of the model.
The maximum value of z is generally n/5, that is, 20% of n, and 10% of n. Of course, 10% is changed according to the specific situation, and the best effect is to see that the replacement of a few percent is the replacement of a few percent in actual training. When z takes 10% of n, if the number of words of Chinese characters in a sentence exceeds 10, we intercept 10% of Chinese characters and convert them into pinyin, if the number of words in a sentence is less than 10, we can randomly take one position as pinyin.
As described above, replacing a part of homophonic error words in the chinese corpus with pinyin can improve the robustness of the translation model, and if all homophonic error words in the chinese corpus are replaced with pinyin, the robustness of the translation model can be improved better.
In an alternative embodiment, the source language sequence is derived from speech data based on speech recognition techniques. I.e. the user inputs speech, which is recognized as text based on speech recognition techniques.
In an alternative embodiment, the discriminant model is a BERT language model.
The BERT language model is a mask language model. According to actual conditions, GPT language model, ELMo language model and other language models can be selected.
In an alternative embodiment, the translation model is a translation model based on a transducer framework.
The translation model is trained based on mixed corpus of Chinese characters and pinyin and is a model based on a transducer framework.
Specific embodiments according to the present disclosure are described below in connection with specific application scenarios. In this embodiment, chinese is translated into english, more than 10 chinese characters are contained in the source corpus, the BERT language model is used as a discrimination model, and the discrimination model and the translation model are trained by the above-mentioned methods, respectively. The homonym is thus determined by the trained BERT language model, replacing the homonym with the corresponding pinyin. As shown in fig. 2, this embodiment includes the following steps.
Step 201, a source language sequence to be translated is obtained, and the source language sequence is a Chinese character sequence.
Step 202, the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence.
In step 203, the pinyin corresponding to the first chinese character in the sequence of chinese characters is masked, and the pinyin is called the first pinyin, and the pinyin sequence after masking the first pinyin is obtained.
Step 204, inputting the masked pinyin sequence into the trained BERT language model to obtain the correct Chinese character probability predicted by the trained BERT language model.
And 205, when the probability of correct Chinese characters is smaller than the set probability, replacing the corresponding Chinese characters in the Chinese character sequence with corresponding pinyin.
Step 206, masking the pinyin of the pinyin sequence corresponding to the second chinese character in the chinese character sequence, the pinyin being referred to as the second pinyin, obtaining a masked pinyin sequence after the second pinyin, and repeating steps 204 and 205 based on the masked pinyin sequence after the second pinyin.
Step 207, the pinyin of the last Chinese character in the pinyin sequence corresponding to the third, fourth and … of the Chinese character sequence is masked in turn, and steps 204 and 205 are repeatedly executed.
Step 208, the processed source language sequence is obtained.
Step 209, inputting the processed source language sequence into a trained translation model, and obtaining a prediction result of the trained translation model.
By adopting the method disclosed by the invention, at least one Chinese character in the Chinese character corpus is replaced by the corresponding pinyin, the mixed corpus is obtained, the Chinese character corpus, the mixed corpus and the target corpus are synthesized into training data, and the translation model is trained based on the synthesized training data. The method provided by the disclosure can improve the robustness of the translation model aiming at the voice recognition error containing the Chinese allotrope, thereby effectively enhancing the neural machine translation quality, and ensuring that the translation effect on the real text to be translated without noise is not poor.
The present disclosure also provides a training device for a translation model, as shown in fig. 3, where the device includes:
the source language sequence obtaining module 301 is configured to obtain a source language sequence to be translated, where the source language sequence is a Chinese character sequence;
the source language sequence processing module 302 is configured to obtain a processed source language sequence based on the Chinese character sequence and a trained discrimination model, where the discrimination model is used to discriminate whether the Chinese characters in the Chinese character sequence are replaced by corresponding pinyin;
The prediction result obtaining module 303 is configured to input the processed source language sequence into a trained translation model, and obtain a prediction result of the trained translation model.
In an alternative embodiment, the source language sequence processing module 302 is further configured to:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
when the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
In an alternative embodiment, the apparatus further comprises a translation model training module configured to:
Acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
the mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
In an alternative embodiment, the translation model training module is further configured to replace at least one chinese character in the chinese character corpus with a corresponding pinyin by at least one of the following means:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
In an alternative embodiment, the translation model training module is further configured to determine at least one homonym in the chinese corpus by:
The pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
In an alternative embodiment, the apparatus further comprises a discriminant model training module, said discriminant model training being arranged to:
acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
Masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
In an alternative embodiment, the source language sequence is derived from speech data based on speech recognition techniques.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
By adopting the method disclosed by the invention, at least one Chinese character in the Chinese character corpus is replaced by the corresponding pinyin, the mixed corpus is obtained, the Chinese character corpus, the mixed corpus and the target corpus are synthesized into training data, and the translation model is trained based on the synthesized training data. The method provided by the disclosure can improve the robustness of the translation model aiming at the voice recognition error containing the Chinese allotrope, thereby effectively enhancing the neural machine translation quality, and ensuring that the translation effect on the real text to be translated without noise is not poor.
The method can ensure good translation effect on the normal test set, and greatly improve the translation effect of the Chinese source end with homonym error words. In addition, the method is simple to realize, does not need to construct a large number of homonym false corpora in an offline manner, dynamically constructs the mixed corpora of Chinese characters and pinyin, and can basically cover various situations as long as the number of training steps is enough.
Fig. 4 is a block diagram illustrating a machine translation device 400 according to an example embodiment.
Referring to fig. 4, apparatus 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and a communication component 416.
The processing component 402 generally controls the overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.
Memory 404 is configured to store various types of data to support operations at device 400. Examples of such data include instructions for any application or method operating on the apparatus 400, contact data, phonebook data, messages, pictures, videos, and the like. The memory 404 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power component 406 provides power to the various components of the device 400. The power components 406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 400.
The multimedia component 408 includes a screen between the device 400 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 400 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 410 is configured to output and/or input audio signals. For example, the audio component 410 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 further includes a speaker for outputting audio signals.
The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of the apparatus 400. For example, the sensor assembly 414 may detect the on/off state of the device 400, the relative positioning of the components, such as the display and keypad of the apparatus 400, the sensor assembly 414 may also detect the change in position of the apparatus 400 or one component of the apparatus 400, the presence or absence of user contact with the apparatus 400, the orientation or acceleration/deceleration of the apparatus 400, and the change in temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate communication between the apparatus 400 and other devices in a wired or wireless manner. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 404, including instructions executable by processor 420 of apparatus 400 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a machine translation method, the method comprising: acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence; based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not; and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
FIG. 5 is a block diagram illustrating a machine translation 500 according to an example embodiment. For example, the apparatus 500 may be provided as a server. Referring to fig. 5, apparatus 500 includes a processing component 522 that further includes one or more processors and memory resources represented by memory 532 for storing instructions, such as applications, executable by processing component 522. The application programs stored in the memory 532 may include one or more modules each corresponding to a set of instructions. Further, the processing component 522 is configured to execute instructions to perform the above-described methods: acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence; based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not; and inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model.
The apparatus 500 may also include a power component 526 configured to perform power management of the apparatus 500, a wired or wireless network interface 550 configured to connect the apparatus 500 to a network, and an input output (I/O) interface 558. The device 500 may operate based on an operating system stored in memory 532, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A machine translation method, the method comprising:
acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model;
the obtaining the processed source language sequence based on the Chinese character sequence and the trained discrimination model comprises the following steps:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
When the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
2. The method of claim 1, wherein the translation model is trained by:
acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
the mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
3. The method of claim 2, wherein said replacing at least one chinese character in said chinese character corpus with a corresponding pinyin comprises at least one of:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
4. The method of claim 3, wherein said determining at least one homonym in said chinese corpus comprises:
the pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
5. The method of claim 1, wherein the discriminant model is trained by:
acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
Selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
6. The training method of claim 1 wherein the source language sequence is derived from speech data based on speech recognition techniques.
7. A machine translation apparatus, the apparatus comprising:
the system comprises a source language sequence acquisition module, a translation module and a translation module, wherein the source language sequence acquisition module is used for acquiring a source language sequence to be translated, and the source language sequence is a Chinese character sequence;
the source language sequence processing module is set to acquire a processed source language sequence based on the Chinese character sequence and a trained discrimination model, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
The prediction result acquisition module is used for inputting the processed source language sequence into a trained translation model to acquire a prediction result of the trained translation model;
the source language sequence processing module is further configured to:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
when the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
8. The apparatus of claim 7, further comprising a translation model training module configured to:
Acquiring first training data, wherein the first training data comprises source linguistic data and corresponding target linguistic data, and the source linguistic data is Chinese linguistic data;
replacing at least one Chinese character in the Chinese character corpus with a corresponding pinyin to obtain a mixed corpus;
the mixed corpus and the target corpus form second training data, wherein the mixed corpus is second training data source corpus and the target corpus is second training data target corpus;
the translation model is trained based on the second training data and the first training data.
9. The apparatus of claim 8, wherein the translation model training module is further configured to replace at least one chinese character in the chinese corpus with a corresponding pinyin by at least one of:
in a first mode, at least one Chinese character randomly selected from the Chinese character corpus is replaced by a corresponding pinyin;
and in a second mode, determining at least one homophonic error word in the Chinese character corpus, and replacing the at least one homophonic error word with a corresponding pinyin.
10. The apparatus of claim 9, wherein the translation model training module is further configured to determine at least one homonym in the chinese corpus by:
The pinyin of each Chinese character in the Chinese character corpus is obtained, and the pinyin of each Chinese character is formed into the pinyin corpus;
selecting the pinyin of a Chinese character from the pinyin corpus as the pinyin to be masked, masking the pinyin to be masked in the pinyin corpus, and obtaining the masking corpus;
acquiring Chinese characters corresponding to the pinyin to be masked in the Chinese character corpus, and taking the Chinese characters corresponding to the pinyin to be masked as target words;
inputting the mask corpus into the trained discrimination model to obtain the probability of the masked pinyin predicted by the trained discrimination model corresponding to the target word;
and when the probability is smaller than a set threshold value, determining that the target word is a homophonic error word.
11. The apparatus of claim 7, further comprising a discriminant model training module configured to:
acquiring a discrimination model source corpus, wherein the discrimination model source corpus is a Chinese character corpus;
the pinyin of each Chinese character in the discrimination model source corpus is obtained to form discrimination model pinyin corpus;
selecting the pinyin of z Chinese characters from the pinyin corpus of the discrimination model, wherein the pinyin corpus of the discrimination model comprises the pinyin of n Chinese characters, and z is more than or equal to 1 and less than n/5;
Masking the pinyin of the selected z Chinese characters in the pinyin corpus of the discrimination model to obtain the pinyin corpus of the discrimination model after masking;
acquiring Chinese characters corresponding to the pinyin of the selected z Chinese characters in the discriminant model source corpus, and forming the Chinese characters corresponding to the pinyin of the selected z Chinese characters into a discriminant model target corpus;
training the discriminant model based on the masked discriminant model pinyin corpus and the discriminant model target corpus.
12. The apparatus of claim 7, wherein the source language sequence is obtained from voice data based on a voice recognition technique.
13. A machine translation device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to, when executing the executable instructions, implement the steps of:
acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model;
The obtaining the processed source language sequence based on the Chinese character sequence and the trained discrimination model comprises the following steps:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
when the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
14. A non-transitory computer readable storage medium, which when executed by a processor of an apparatus, causes the apparatus to perform a machine translation method, the method comprising:
acquiring a source language sequence to be translated, wherein the source language sequence is a Chinese character sequence;
Based on the Chinese character sequence and a trained discrimination model, acquiring a processed source language sequence, wherein the discrimination model is used for discriminating whether Chinese characters in the Chinese character sequence are replaced by corresponding pinyin or not;
inputting the processed source language sequence into a trained translation model to obtain a prediction result of the trained translation model;
the obtaining the processed source language sequence based on the Chinese character sequence and the trained discrimination model comprises the following steps:
the pinyin of each Chinese character in the Chinese character sequence is obtained, and the pinyin of each Chinese character is formed into a pinyin sequence, wherein the Chinese character sequence comprises m Chinese characters, and m is a positive integer greater than or equal to 1;
masking the pinyin of each Chinese character in the pinyin sequences in turn, obtaining m masked pinyin sequences, and executing the following operations for each masked pinyin sequence: inputting the masked pinyin sequence into the trained discrimination model, and obtaining the probability of correct Chinese characters predicted by the trained discrimination model, wherein the probability of correct Chinese characters is the probability of outputting corresponding Chinese characters in the Chinese character sequence at the masked position;
when the probability of the correct Chinese character is smaller than the set probability, replacing the corresponding Chinese character in the Chinese character sequence with the corresponding pinyin, and obtaining the processed source language sequence.
CN202010171952.5A 2020-03-12 2020-03-12 Machine translation method, device and medium Active CN111414772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010171952.5A CN111414772B (en) 2020-03-12 2020-03-12 Machine translation method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010171952.5A CN111414772B (en) 2020-03-12 2020-03-12 Machine translation method, device and medium

Publications (2)

Publication Number Publication Date
CN111414772A CN111414772A (en) 2020-07-14
CN111414772B true CN111414772B (en) 2023-09-26

Family

ID=71492892

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010171952.5A Active CN111414772B (en) 2020-03-12 2020-03-12 Machine translation method, device and medium

Country Status (1)

Country Link
CN (1) CN111414772B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111768765B (en) * 2020-07-30 2022-08-19 华为技术有限公司 Language model generation method and electronic equipment
CN112069795B (en) * 2020-08-28 2023-05-30 平安科技(深圳)有限公司 Corpus detection method, device, equipment and medium based on mask language model
CN113761950A (en) * 2021-04-28 2021-12-07 腾讯科技(深圳)有限公司 Translation model testing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006010163A2 (en) * 2004-07-23 2006-01-26 America Online Incorporated User interface and database structure for chinese phrasal stroke and phonetic text input
CN101788978A (en) * 2009-12-30 2010-07-28 中国科学院自动化研究所 Chinese and foreign spoken language automatic translation method combining Chinese pinyin and character
CN104850237A (en) * 2014-02-19 2015-08-19 马舜尧 Method for generating and processing derived candidate item in input method
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium
CN110147554A (en) * 2018-08-24 2019-08-20 腾讯科技(深圳)有限公司 Simultaneous interpreting method, device and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006010163A2 (en) * 2004-07-23 2006-01-26 America Online Incorporated User interface and database structure for chinese phrasal stroke and phonetic text input
CN101788978A (en) * 2009-12-30 2010-07-28 中国科学院自动化研究所 Chinese and foreign spoken language automatic translation method combining Chinese pinyin and character
CN104850237A (en) * 2014-02-19 2015-08-19 马舜尧 Method for generating and processing derived candidate item in input method
CN110147554A (en) * 2018-08-24 2019-08-20 腾讯科技(深圳)有限公司 Simultaneous interpreting method, device and computer equipment
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘俊鹏 ; 宋鼎新 ; 张一鸣 ; 黄德根 ; .多种数据泛化策略融合的神经机器翻译系统.江西师范大学学报(自然科学版).2020,(第01期),全文. *
曹宜超 ; 高翊 ; 李淼 ; 冯韬 ; 王儒敬 ; 付莎 ; .基于单语语料和词向量对齐的蒙汉神经机器翻译研究.中文信息学报.2020,(第02期),全文. *

Also Published As

Publication number Publication date
CN111414772A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111414772B (en) Machine translation method, device and medium
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN112735396A (en) Speech recognition error correction method, device and storage medium
CN107291260B (en) Information input method and device for inputting information
CN108733657B (en) Attention parameter correction method and device in neural machine translation and electronic equipment
CN112199032A (en) Expression recommendation method and device and electronic equipment
CN109979435B (en) Data processing method and device for data processing
CN108073294B (en) Intelligent word forming method and device for intelligent word forming
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN112863499B (en) Speech recognition method and device, storage medium
CN110837741B (en) Machine translation method, device and system
CN110781689B (en) Information processing method, device and storage medium
CN112837668B (en) Voice processing method and device for processing voice
CN110780749B (en) Character string error correction method and device
CN108345590B (en) Translation method, translation device, electronic equipment and storage medium
CN112149432A (en) Method and device for translating chapters by machine and storage medium
CN109308126B (en) Candidate word display method and device
CN113515618A (en) Voice processing method, apparatus and medium
CN111414731B (en) Text labeling method and device
CN110084065B (en) Data desensitization method and device
US20230196001A1 (en) Sentence conversion techniques
CN110716653B (en) Method and device for determining association source
CN109669549B (en) Candidate content generation method and device for candidate content generation
CN114444521A (en) Machine translation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant