CN112767918B - Russian Chinese language translation method, russian Chinese language translation device and storage medium - Google Patents

Russian Chinese language translation method, russian Chinese language translation device and storage medium Download PDF

Info

Publication number
CN112767918B
CN112767918B CN202110018492.7A CN202110018492A CN112767918B CN 112767918 B CN112767918 B CN 112767918B CN 202110018492 A CN202110018492 A CN 202110018492A CN 112767918 B CN112767918 B CN 112767918B
Authority
CN
China
Prior art keywords
chinese
russian
training
translated
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110018492.7A
Other languages
Chinese (zh)
Other versions
CN112767918A (en
Inventor
马延周
李宏欣
杨政
易绵竹
张一尼
卢国超
闫丹辉
张婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Publication of CN112767918A publication Critical patent/CN112767918A/en
Application granted granted Critical
Publication of CN112767918B publication Critical patent/CN112767918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a Russian Chinese language sound translation method, russian Chinese language sound translation apparatus, and storage medium. A method of Russian Chinese language translation comprising: and obtaining the Russian voice to be translated, and converting the Russian voice to be translated into a Meier spectrogram to be translated. And translating the to-be-translated mel spectrogram into a target mel spectrogram through a pre-trained Russian Chinese language voice translation model. And obtaining the Chinese voice corresponding to the Russian voice to be translated according to the target Mel spectrogram. According to the Russian language voice translation method, russian language voice to be translated is firstly converted into the Russian language spectrogram to be translated, the voice characteristics of the Russian language voice to be translated can be accurately represented, the Russian language translation model is further used for translating the Russian language voice to be translated into the target Russian language spectrogram, and when the Chinese language voice corresponding to the Russian language voice to be translated is obtained, the problem of serious damage of accuracy rate can be reduced, and translation quality is facilitated to be improved. And the method is beneficial to accelerating the Russian Chinese voice translation rate and improving the information processing capability of Russian.

Description

Russian Chinese language translation method, russian Chinese language translation device and storage medium
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a russian chinese language translation method, a russian chinese language translation device, and a storage medium.
Background
The existing speech translation system mainly aims at the problems that the application range of English, japanese, german and the like is wide, the language with more people is used, and the speech translation of national languages, small languages and the like is less.
In the related art, for the russian language translation, two modes are mainly included. The first method is that the Russian language with translation is subjected to voice recognition and then corresponding translation and voice synthesis are carried out. However, by adopting the mode, the speech recognition, the translation, the speech synthesis and other steps are carried out, the damage rate is serious, and the quality of the finally translated Chinese speech translation is poor. Another main use is manual translation, which results in slow information processing speed in speech and easy loss of value due to lack of translators, which results in a large amount of important information not being processed in time.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a Russian Chinese language translation method, a Russian Chinese language translation device, and a storage medium.
According to a first aspect of an embodiment of the present disclosure, there is provided a russian Chinese sound translation method, including: and obtaining the Russian voice to be translated, and converting the Russian voice to be translated into a Meier spectrogram to be translated. And translating the to-be-translated mel spectrogram into a target mel spectrogram through a pre-trained Russian Chinese language voice translation model. And according to the target Mel spectrogram, obtaining the Chinese voice corresponding to the Russian voice to be translated.
In an embodiment, the Russian Chinese language translation model includes: long-short-time memory networks, local attention mechanisms, and bidirectional long-short-time memory networks. The translating the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model comprises the following steps: and encoding the multi-frame Mel spectrogram to be translated through a long-short-time memory network to obtain an intermediate vector to be translated. Focusing the vector to be translated based on a local attention mechanism, and determining the attention vector of the intermediate vector to be translated. And decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network to obtain a target Mel spectrogram of the corresponding Chinese language after translating the Mel spectrogram to be translated.
In another embodiment, the Russian Chinese language translation model is trained by: the training corpus comprises a plurality of Russian training voices, and the training Mel spectrograms correspond to the Russian training voices. And acquiring a plurality of Chinese Mel spectrograms corresponding to the Chinese phonetic set, wherein the Chinese Mel spectrograms correspond to the Chinese training phonetic of the Russian training phonetic. Inputting the training mel spectrogram into an end-to-end model, and obtaining a translation mel spectrogram corresponding to the training mel spectrogram based on a local attention mechanism. And training the end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model.
In yet another embodiment, the training the end-to-end model based on the translated mel spectrum and the chinese mel spectrum to obtain the russian chinese language translation model includes: and according to the translated Mel spectrogram, obtaining the Chinese voice corresponding to the Russian training voice. And obtaining the Chinese training voice corresponding to the target Mel spectrogram. And training the end-to-end model based on a comparison result between the Chinese speech and the Chinese training speech to obtain the Russian Chinese speech translation model.
In yet another embodiment, before obtaining the Russian Chinese language translation model, the training method of the Russian Chinese language translation model further includes: and determining the fluency of the Chinese speech.
In yet another embodiment, the training the end-to-end model based on the comparison between the chinese speech and the chinese training speech to obtain the russian chinese language translation model includes: and acquiring the Chinese text of the Chinese speech and the Chinese training text of the Chinese training speech. And determining the error rate between the Chinese text and the Chinese training text, and stopping training to obtain the Russian Chinese language translation model if the error rate between the Chinese text and the Chinese training text is smaller than an error threshold. And if the error rate between the Chinese text and the Chinese training text is greater than or equal to the error threshold, continuing to train the end-to-end model.
In yet another embodiment, the chinese training speech is the same sampling frequency as the corresponding russian training speech.
In another embodiment, the obtaining, according to the target mel spectrogram, the chinese speech corresponding to the russian speech to be translated includes: and reconstructing the target Mel spectrogram through a vocoder to obtain the Chinese voice corresponding to the Russian voice to be translated.
According to a second aspect of embodiments of the present disclosure, there is provided a russian Chinese sound translating apparatus, comprising: the obtaining unit is used for obtaining the Russian voice to be translated and converting the Russian voice to be translated into a Meier spectrogram to be translated. And the translation unit is used for translating the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model. And the conversion unit is used for obtaining the Chinese voice corresponding to the Russian voice to be translated according to the target Mel spectrogram.
In an embodiment, the Russian Chinese language translation model includes: long-short-time memory networks, local attention mechanisms, and bidirectional long-short-time memory networks. The translation unit translates the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model in the following manner: and encoding the multi-frame Mel spectrogram to be translated through a long-short-time memory network to obtain an intermediate vector to be translated. Focusing the vector to be translated based on a local attention mechanism, and determining the attention vector of the intermediate vector to be translated. And decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network to obtain a target Mel spectrogram of the corresponding Chinese language after translating the Mel spectrogram to be translated.
In another embodiment, the Russian Chinese language translation model is trained by: the training corpus comprises a plurality of Russian training voices, and the training Mel spectrograms correspond to the Russian training voices. And acquiring a plurality of Chinese Mel spectrograms corresponding to the Chinese phonetic set, wherein the Chinese Mel spectrograms correspond to the Chinese training phonetic of the Russian training phonetic. Inputting the training mel spectrogram into an end-to-end model, and obtaining a translation mel spectrogram corresponding to the training mel spectrogram based on a local attention mechanism. And training the end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model.
In yet another embodiment, the Russian Chinese language translation model trains the end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model by: and according to the translated Mel spectrogram, obtaining the Chinese voice corresponding to the Russian training voice. And obtaining the Chinese training voice corresponding to the target Mel spectrogram. And training the end-to-end model based on a comparison result between the Chinese speech and the Chinese training speech to obtain the Russian Chinese speech translation model.
In yet another embodiment, before obtaining the Russian Chinese language translation model, the training method of the Russian Chinese language translation model further includes: and determining the fluency of the Chinese speech.
In yet another embodiment, the Russian Chinese language translation model trains the end-to-end model based on a comparison between the Chinese language speech and the Chinese training speech to obtain the Russian Chinese language translation model by: and acquiring the Chinese text of the Chinese speech and the Chinese training text of the Chinese training speech. And determining the error rate between the Chinese text and the Chinese training text, and stopping training to obtain the Russian Chinese language translation model if the error rate between the Chinese text and the Chinese training text is smaller than an error threshold. And if the error rate between the Chinese text and the Chinese training text is greater than or equal to the error threshold, continuing to train the end-to-end model.
In yet another embodiment, the chinese training speech is the same sampling frequency as the corresponding russian training speech.
In another embodiment, the conversion unit obtains the chinese speech corresponding to the russian speech to be translated according to the target mel spectrogram by using the following method: and reconstructing the target Mel spectrogram through a vocoder to obtain the Chinese voice corresponding to the Russian voice to be translated.
According to a third aspect of embodiments of the present disclosure, there is provided an apparatus for translating Russian Chinese sounds, comprising: a memory for storing instructions; and a processor for calling the instructions stored in the memory to execute the Russian Chinese language translation method provided by any one of the embodiments.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored therein instructions which, when executed by a processor, perform the russian language translation method provided by any one of the embodiments described above.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: according to the Russian language voice translation method, the Russian language voice to be translated is firstly converted into the Russian language spectrogram to be translated, so that the obtained Russian language spectrogram to be translated can accurately represent the voice characteristics of the Russian language voice to be translated, and further, the Russian language voice translation model is well trained to translate the Russian language voice into the target Russian language spectrogram, so that when the Russian language voice corresponding to the Russian language voice to be translated is obtained, the problem of serious damage of accuracy rate can be reduced, and the translation quality is improved. In the translation process, the Russian Chinese language translation model is trained in advance to translate, so that the Russian Chinese language translation rate is increased, and the Russian information processing capability is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a method of Russian Chinese language translation, according to an exemplary embodiment.
FIG. 2 is a schematic diagram illustrating a translation flow according to an example embodiment.
Fig. 3 is a flow chart illustrating another method of Russian Chinese language translation, according to an exemplary embodiment.
Fig. 4 is a schematic diagram of a framework, shown according to an exemplary embodiment.
FIG. 5 is a schematic diagram illustrating another translation flow according to an example embodiment.
Fig. 6 is a block diagram illustrating a Russian Chinese language translation device according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating a russian language translation method according to an exemplary embodiment, and as shown in fig. 1, the russian language translation method includes the following steps S11 to S13.
In step S11, the russian voice to be translated is obtained, and the russian voice to be translated is converted into a mel spectrogram to be translated.
In the embodiment of the present disclosure, the russian voice to be translated is the russian voice to be translated, which may be obtained from a local voice library or cloud. The russian voice content to be translated may include: daily conversations, two-country foreign exchanges in russia, military news, academic exchanges, etc., are not limiting in this disclosure.
When russian is used in practical applications, russian has the following pronunciation characteristics: the vowels are fewer, the number of used consonants is large, most consonants are clear and turbid, soft and hard, the vowels are obviously weakened in the non-heavy syllables, and the sound values are sometimes ambiguous. In russian use, the accents can fall on different syllables in different words, without a fixed location, while accents can shift location when the word changes shape.
Therefore, in order to improve the accuracy of translation, the translated Chinese speech can truly reflect the content expressed by the Russian speech, and the Russian speech to be translated is converted into a Mel (Mel) spectrogram to be translated before translation. The mel-pattern is a spectrum pattern which can reflect the characteristics of voice and the obtained mel frequency can accord with the auditory characteristics of human ears. Based on each frequency peak in the Mel spectrogram, the co-peak of the voice frequency and the boundary between phonemes can be clearly displayed, so that when the Mel spectrogram to be translated is translated by using the Russian Chinese language translation model, the Russian Chinese language translation model can quickly pass through the Mel spectrogram to be translated, and the boundary relationship between words and sentences in the Russian voice to be translated is clear, thereby being beneficial to saving word segmentation recognition time and accelerating translation rate.
In step S12, the mel spectrogram to be translated is translated into the target mel spectrogram through a pre-trained russian Chinese language translation model.
In the embodiment of the disclosure, a mel spectrogram to be translated is used as input of a russian Chinese language translation model, so that the russian Chinese translation is performed on the mel spectrogram to be translated in the russian Chinese language translation model, a translated target mel spectrogram is obtained, and the target mel spectrogram is output as output of the russian Chinese language translation model. The target mel-spectrum may be characterized as a mel-spectrum of chinese speech corresponding to russian speech to be translated.
In one example, the russian Chinese language translation model is a model capable of translating a Mel-spectrogram characterizing russian into a Mel-spectrogram characterizing Chinese or vice versa, i.e., a model capable of translating a Mel-spectrogram characterizing russian and a Mel-spectrogram characterizing Chinese to each other. In another example, the Russian Chinese language translation model is one that can only translate a Mel spectrum characterizing Russian into a Mel spectrum characterizing Chinese.
In step S13, according to the target mel spectrogram, the chinese speech corresponding to the russian speech to be translated is obtained.
In the embodiment of the disclosure, the target mel spectrogram can be characterized as a mel spectrogram of the Chinese language corresponding to the russian language to be translated, and further, according to the obtained target mel spectrogram, the features of the Chinese language can be definitely converted, so that the Chinese language corresponding to the russian language to be translated can be obtained quickly.
In one example, the target mel-pattern may be converted to chinese speech by a vocoder. A vocoder is a speech analysis-by-synthesis system. In the process of synthesizing the target mel spectrogram into the Chinese speech, firstly, modeling is carried out on the response of the sound channel by utilizing linear prediction, namely, the obtained target mel spectrogram is reconstructed based on the linear prediction. And synthesizing the voice of the reconstructed target Mel spectrogram to obtain the synthesized Chinese voice. The Chinese speech is the Chinese speech corresponding to the Russian speech to be translated.
Through the embodiment, through the to-be-translated Mel spectrogram after the conversion of the to-be-translated Russian voice, the pronunciation characteristics of the to-be-translated Russian voice can be truly reflected, so that the Russian Chinese voice translation model can translate by using the to-be-translated Mel spectrogram without causing the damage of accuracy, and further the accuracy of the Russian Chinese voice translation model in translating the to-be-translated Mel spectrogram is improved, and the finally obtained voice quality of Chinese corresponding to the to-be-translated Russian voice is improved.
In one embodiment, the pre-trained Russian Chinese language translation model may be composed of a Long Short-Term Memory (LSTM), a local attention mechanism, and a Bi-directional Long-Term Memory (Bi-directional Long Short-Term Memory, bi-LSTM). The translation flow of the Russian language translation model may be as shown in FIG. 2. When the pre-trained Russian Chinese language translation model is adopted for translation, after a multi-frame Mel spectrogram to be translated is input into the Russian Chinese language translation model, a plurality of layers of long-short-term memory networks in the Russian Chinese language translation model are used for encoding, and the encoded multi-frame Mel spectrogram is represented by an intermediate vector to be translated. Focusing the obtained intermediate vector to be translated based on a local attention mechanism, and focusing the coding attention of the intermediate vector to be translated in a window interval by setting a window to obtain an attention vector. The attention vector is an intermediate vector to be translated which is determined after focusing through a local attention mechanism. In practical application, the same word in Russian has multiple grammar forms, and the meaning expressed by each grammar form is different, so that a local attention mechanism is adopted, the translation accuracy is improved, and the obtained translation result is better. The output attention vector is transmitted to a bidirectional long-short-time memory network for decoding, so that semantic accuracy of translation can be improved based on the relation between the current attention vector and the context, and further the obtained target Mel spectrogram can accurately express the content expressed by the Russian voice to be translated after the obtained target Mel spectrogram is converted into Chinese voice.
In an example, in Bi-LSTM, the loss function used in training the russian translation model is different, i.e., the manner in which the self-learning parameters of the russian translation model are adjusted given the correct output, may have different degrees of influence.
The following examples will specifically illustrate the training process of the Russian language translation model.
Fig. 3 is a flowchart illustrating a method of training a russian language translation model, as shown in fig. 3, according to an exemplary embodiment, the method of training a russian language translation model includes the following steps S21 to S24.
In step S21, a plurality of training mel spectrograms corresponding to the training speech set are acquired.
In the embodiment of the disclosure, before training an end-to-end model to obtain a russian Chinese language translation model, a plurality of training mel spectrograms corresponding to a training voice set are acquired. And taking the acquired training mel spectrograms as an input for training an end-to-end model to obtain the Russian Chinese language translation model. The training voice set comprises a plurality of Russian training voices, and each training Mel spectrogram corresponds to each Russian training voice respectively. In one example, each training mel-pattern may be obtained by converting each russian training speech based on a spectrum, through a mel-scale filter bank.
In step S22, a plurality of chinese mel patterns corresponding to the chinese phonetic set are acquired, the chinese mel patterns corresponding to the chinese training phonetic of the russian training phonetic.
In the embodiment of the disclosure, in order to facilitate verification of accuracy of a training result, chinese training voices corresponding to each Russian training voice in a training voice set are predetermined to obtain a Chinese voice set, and then a Chinese mel spectrogram corresponding to each Chinese training voice in the Chinese voice set is acquired.
In step S23, the training mel spectrogram is input into the end-to-end model, and a translated mel spectrogram corresponding to the training mel spectrogram is obtained based on the local attention mechanism.
In the embodiment of the disclosure, in the process of training an end-to-end model to obtain a Russian Chinese language translation model, a training Mel spectrogram is taken as input, the input is input into the end-to-end model, and the translation Mel spectrogram corresponding to the training Mel spectrogram is output through focusing self-training of a local attention mechanism, so that a translated Mel spectrogram after translation of the training Mel spectrogram is obtained.
The Russian Chinese language translation model is constructed by adopting a neural network frame of an end-to-end model, and a local attention mechanism is added into the neural network frame of the end-to-end model so as to improve the translation accuracy. Because the self-adaptation capability of the end-to-end model is strong, each parameter in the Russian Chinese language translation model can be automatically learned according to the input training Mel spectrogram during training, and manual intervention is not needed, so that the manual preprocessing and the subsequent processing processes can be reduced. And the end-to-end model is trained by adopting the neural network framework constructed by the end-to-end model to obtain the Russian Chinese language translation model, more automatic adjustment space can be given in the training process, and thus the integral fitness of the model is enhanced.
Furthermore, in Russian, a plurality of different expression modes may appear in the same word, and meanings corresponding to the different expression modes are different, so that according to an added local attention mechanism, based on a window in the local attention mechanism, a focusing form is adopted to strengthen the context connection between word words, thereby effectively solving the condition of word ambiguity in Russian and improving the translation accuracy.
In step S24, an end-to-end model is trained based on the translated mel spectrogram and the chinese mel spectrogram, resulting in a russian chinese language translation model.
In the embodiment of the disclosure, based on the translated mel spectrogram output by the Russian Chinese language translation model and the pre-acquired Russian training speech corresponding to the Chinese language training speech, the training degree of the Russian Chinese language translation model is determined, and further when the training result meets the specified requirement, training of the Russian Chinese language translation model is completed, and the trained Russian Chinese language translation model is obtained.
In an embodiment, feature sequence comparison is performed on the translated mel spectrum and the chinese mel spectrum, a cosine distance between the translated mel spectrum and the chinese mel spectrum is determined, and if the cosine distances between the first number of feature sequences in the translated mel spectrum and the chinese mel spectrum are both greater than a cosine distance threshold, it is indicated that the translated mel spectrum is highly similar to the chinese mel spectrum, and it is determined that the translation result of the russian chinese language voice translation model is qualified. In the training process, if the feature sequence comparison between the plurality of translated mel spectrograms and the corresponding Chinese mel spectrums is carried out, the cosine distance between the first number of feature sequences in each translated mel spectrogram and the corresponding Chinese mel spectrums is larger than the cosine distance threshold value, the training is represented to be completed, and a trained Russian Chinese language translation model is obtained.
In another embodiment, the Russian Chinese language translation model is obtained according to a comparison result between the Chinese language corresponding to the Russian training language obtained by translating the Mel spectrogram and the Chinese training language corresponding to the target Mel spectrogram.
In order to determine the training situation of the Russian Chinese language translation model, aiming at the training Mel spectrogram corresponding to the current Russian training voice, converting the translated Mel spectrogram corresponding to the training Mel spectrogram into output voice which can be played after the conversion of a vocoder, and obtaining the Chinese language corresponding to the Russian training voice. And acquiring the Chinese training voice corresponding to the current training Mel spectrogram. And comparing the Chinese speech converted by the vocoder through the translated Mel spectrogram with the Chinese training speech, and determining a comparison result between the Chinese speech and the Chinese training speech. In one example, if the similarity between the two is determined to be greater than the specified similarity threshold according to the comparison result, the accuracy of the translation result of the Russian Chinese language translation model is characterized to reach the qualification requirement. If the similarity between the two is less than or equal to the specified similarity threshold value according to the comparison result, the accuracy of the translation result representing the Russian Chinese language translation model does not reach the qualified requirement, and the training still needs to be continued. In one example, the similarity between the Chinese speech and the Chinese training speech is determined by a manual method, so that whether the Chinese speech and the Chinese training speech are similar to each other or not can be determined quickly.
In yet another embodiment, the training of the Russian Chinese language translation model may also be determined manually. The Chinese meaning of the Russian language is trained by training the Meier spectrogram in advance, the Chinese language which is translated by the Russian training language is played, the error condition between the Chinese language and the predetermined Chinese meaning is determined by adopting a manual mode, and whether wrong turning, missing turning or less turning exists in the translation process is determined. And if the number of times of error occurrence is smaller than a first error threshold value, the Russian Chinese language translation model is characterized to be trained, and a trained Russian Chinese language translation model is obtained. If the number of times of error occurrence is greater than or equal to the first error threshold, the Russian Chinese language translation model is characterized by incomplete training, and the end-to-end model still needs to be continuously trained.
In yet another embodiment, in order to improve the translation quality of the russian Chinese language translation model, in the process of training the end-to-end model to obtain the russian Chinese language translation model, the method further includes: and determining the fluency of the Chinese voice, namely determining the fluency of the Chinese voice obtained by translating the mel spectrogram through vocoder conversion. Based on the fluency of the Chinese speech during playing, whether the phenomenon of missing turning is generated or not can be determined in the translation process of the Russian Chinese speech translation model. If the conditions such as jamming and the like do not occur in the process of playing the Chinese voice, the smoothness of the Chinese voice in the process of playing the Chinese voice is characterized to meet the smoothness requirement, and the phenomenon of missing turning is not generated in the process of translating and training the mel spectrogram. If the conditions such as jamming and the like occur in the process of playing the Chinese voice, the fluency of the Chinese voice in the process of playing the Chinese voice is represented to not meet the fluency requirement, and the phenomenon of missing turning is generated in the process of translating and training the mel spectrogram.
In yet another embodiment, in training the end-to-end model to obtain the russian Chinese language translation model, the output Chinese mel pattern is converted into the Chinese language, and then the Chinese text corresponding to the Chinese language is determined. In one example, the chinese speech may be recognized based on a speech recognition engine to obtain a chinese text corresponding to the chinese speech. The training condition of the Russian Chinese language translation model is determined according to the error rate between the Chinese text of the Chinese language voice and the Chinese training text of the Chinese language training voice, so that the training condition is helped to intuitively and clearly determine whether the phenomena of false turning, missing turning and the like exist in the training process. The Chinese training text is used as a benchmark, the Chinese text is compared with the Chinese training text, and the error rate between the Chinese text and the Chinese training text can be determined according to the error conditions of the translation word number error, the translation content error and the like between the Chinese text and the Chinese training text. If the error rate between the Chinese text and the Chinese training text is smaller than the error threshold, training of the Russian Chinese language translation model is represented, training can be stopped, and a trained Russian Chinese language translation model is obtained. If the error rate between the chinese text and the chinese training text is greater than or equal to the error threshold, the training of the russian chinese language translation model is not complete, and the end-to-end model still needs to be continuously trained.
In yet another embodiment, the training speech set and the chinese speech set for training may be obtained from a pre-constructed russian bilingual corpus. And collecting a plurality of Russian contrast voices in the Russian bilingual corpus, and storing the text of the Russian voices and the text of the Russian contrast Chinese voices together. In one example, the content of the russian comparison voice can be selected from the russian news text, and the content can comprise any one or more of the following: communication between two countries in the middle Russian, foreign exchange between two countries, updating of weaponry equipment of each country and updating of personnel. In another example, the vocabulary involved may include commonly used words, proper nouns, and specific phrases. Proper nouns may include: name of person, place name, organization. Specific terms may include: military term: weaponry, soldier's title, activity.
In an example, in actual pronunciation, the speed of russian is faster than that of chinese, but because russian words are longer and the number of russian vocabularies involved in the same sentence is greater, in order to facilitate training of the russian translation model, the length of russian training speech is limited to be stored within a first specified time length. For example: the length of the russian training speech is limited to 10 seconds for storage. The Chinese characters are short in pronunciation, the number of Chinese words involved in the same sentence is small, the Russian Chinese language translation model is convenient to train in order to ensure the comparison relation between the Russian Chinese language translation model and Russian Chinese language translation model, the integrity of sentences is ensured, and the length of the Chinese language translation is limited to be stored in a second designated time length. For example: the length of the Chinese training speech is limited to 8 seconds for storage.
In another example, to facilitate training of the Russian Chinese language translation model, russian training speech and corresponding Chinese training speech are collected with the same sampling frequency while storing the Russian training speech and corresponding Chinese training speech. The Russian training voices and the corresponding Chinese training voices are subjected to the same sampling frequency, so that the training Mel spectrogram obtained after conversion of the Russian training voices and the time domain frequency change rule of the Chinese Mel spectrogram obtained after conversion of the Chinese training voices are identical. In the process of training the Russian Chinese language translation model, the time domain frequency change rule of the translated Mel spectrogram output by the Russian Chinese language translation model can be kept consistent with the time domain frequency change rule of the Chinese Mel spectrogram, so that the training condition of the Russian Chinese language translation model can be quickly determined, and the convergence of the Russian Chinese language translation model can be quickly completed. In one example, russian training speech and corresponding Chinese training speech are collected using the same number of quantization bits. For example: the quantization bit number may be 16 bits (bit) to facilitate computer recognition.
In one implementation scenario, training may be based on any of the concepts of the embodiments described above when training the Russian Chinese language translation model to enable conversion of Chinese into Russian. And taking the Chinese Meyer atlas converted by the Chinese training speech as input of the Russian Chinese language translation model, taking the training Meyer atlas converted by the Russian Chinese training speech as a comparison Meyer atlas of a training result, determining the training condition of the Russian Chinese language translation model based on the Russian Meyer atlas and the comparison Meyer atlas output by the Russian Chinese language translation model, and completing translation training of the Russian Meyer atlas into the corresponding Russian Meyer atlas.
In another implementation scenario, the present disclosure also provides a Russian Chinese language translation prototype system. The framework of the Russian language translation prototype system may be as shown in the framework diagram of FIG. 4. The Russian Chinese sound translation prototype system relates to the contents of voice signal characteristic representation, end-to-end voice translation model architecture, attention mechanism optimization, vocoder reconstruction waveform and the like of a Mel spectrogram, is developed based on Tensorflow design, and further comprehensively applies the Russian Chinese sound translation model theory. Tensorflow is an artificial intelligence learning system capable of transmitting complex data structures to an artificial intelligence neural network for analysis and processing, and realizing the calculation process of tensors flowing from one end of a flow graph to the other end. In the Russian Chinese language translation prototype system, a model training process and a model testing process can be included.
In order to realize the system function design, the Russian Chinese sound translation prototype system framework mainly comprises three modules, namely a Mel spectrogram generation module for converting input, an end-to-end model module for training and constructing a model, and a vocoder module for reconstructing waveforms. The Mel spectrogram generation module comprises processing procedures of voice signal framing, windowing, fourier transformation, mel filtering and the like, and can convert input voice into Mel spectrogram characteristics for output. The vocoder module is opposite to the Mel spectrogram generating module, inputs Mel spectrogram characteristics, and obtains voice waveform output through inverse Mel filter, conversion to voice waveform output, griffin-Lim algorithm, and de-emphasis process. The model training module includes an LSTM encoding module, a local attention mechanism, and a Bi-LSTM decoding module for training the Russian Chinese language translation model and storing the Russian Chinese language translation model in the Russian Chinese language translation system.
When training is carried out in a model training module of the Russian Chinese language translation model, russian training voices used for training and Chinese training voices corresponding to the Russian training voices are firstly generated into a Mel spectrogram through a Mel spectrogram generating module to serve as input of the Russian Chinese language translation model, and then the model is built and stored through the model training module. When testing is carried out in a model testing module of the Russian Chinese language translation model, russian test voice for testing is generated into a Mel spectrogram through a Mel spectrogram generating module, then the translated Mel spectrogram is obtained through prediction output of the trained Russian Chinese language translation model, and after the waveform of the translated Mel spectrogram is reconstructed through a vocoder, the corresponding Chinese voice is output.
In another implementation scenario, the process of translating using the Russian Chinese language translation model may be as shown in FIG. 5, FIG. 5 being a schematic diagram of a translation flow shown in accordance with an exemplary embodiment. The LSTM network is adopted in the LSTM coding module of the Russian Chinese language translation model, and the Bi-LSTM network is adopted in the Bi-LSTM decoding module. And taking the to-be-translated Mel spectrogram of the source language voice as the input of the Russian Chinese voice translation model, and inputting each frame parameter in the to-be-translated Mel spectrogram into an LSTM network for encoding to obtain an intermediate vector to be translated. Focusing the vector to be translated based on a local attention mechanism, and determining the attention vector of the intermediate vector to be translated. Decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network in the Bi-LSTM decoding module, outputting each frame parameter of the target Mel spectrogram, and obtaining the target Mel spectrogram of the corresponding Chinese language after translating the Mel spectrogram to be translated.
By adopting the mode to construct the whole framework of the Russian Chinese language translation system, on the basis of conforming to the system design and development specifications, the design of each module can conform to the design requirements of conforming to high cohesion and low coupling, and further, the requirement of simple data structure in the process of data transmission among modules is realized on the basis of ensuring complete functions in each module. The Russian Chinese language translation model is a neural network framework constructed by an end-to-end model, so that the adaptivity of the Russian Chinese language translation system is enhanced. Furthermore, the input parameters of the Russian Chinese language translation model, namely the input Mel spectrogram, are obtained through self-learning of voice training, and are not required to be adjusted by adapting to different input corpus. In one example, because the Russian Chinese language translation system involves less processes of user operation, and then can adopt interactive page design to design when using in practical application, simplify the operation process, and then the user need not necessarily possess basic knowledge when using, can convenient operation.
Based on the same conception, the embodiment of the disclosure also provides a Russian Chinese sound translation device.
It will be appreciated that, in order to implement the above-described functions, the Russian Chinese language translation device provided in the embodiments of the present disclosure includes corresponding hardware structures and/or software modules that perform the respective functions. The disclosed embodiments may be implemented in hardware or a combination of hardware and computer software, in combination with the various example elements and algorithm steps disclosed in the embodiments of the disclosure. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present disclosure.
Fig. 6 is a block diagram illustrating a Russian Chinese language translation device according to an exemplary embodiment. Referring to fig. 6, the russian voice translation apparatus 100 includes an acquisition unit 101, a translation unit 102, and a conversion unit 103.
The obtaining unit 101 is configured to obtain the russian voice to be translated, and convert the russian voice to be translated into a mel spectrogram to be translated.
The translation unit 102 is configured to translate the mel spectrogram to be translated into the target mel spectrogram through a pre-trained russian Chinese language translation model.
And the conversion unit 103 is configured to obtain, according to the target mel spectrogram, a chinese speech corresponding to the russian speech to be translated.
In one embodiment, the Russian Chinese language translation model includes: long-short-time memory networks, local attention mechanisms, and bidirectional long-short-time memory networks. Translation unit 102 translates the mel spectrogram to be translated into the target mel spectrogram by a pre-trained russian Chinese language translation model in the following manner: and encoding the multi-frame Mel spectrogram to be translated through a long-short-time memory network to obtain an intermediate vector to be translated. Focusing the vector to be translated based on a local attention mechanism, and determining the attention vector of the intermediate vector to be translated. And decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network to obtain a target Mel spectrogram of the corresponding Chinese after translating the Mel spectrogram to be translated.
In another embodiment, the Russian Chinese language translation model is trained by: a plurality of training mel spectrograms corresponding to the training voice set are obtained, the training corpus set comprises a plurality of Russian training voices, and the training mel spectrograms correspond to the Russian training voices. And acquiring a plurality of Chinese mel spectrograms corresponding to the Chinese phonetic set, wherein the Chinese mel spectrograms correspond to the Chinese training phonetic of the Russian training phonetic. Inputting the training mel spectrogram into an end-to-end model, and obtaining a translation mel spectrogram corresponding to the training mel spectrogram based on a local attention mechanism. And training an end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model.
In yet another embodiment, the Russian Chinese language translation model trains an end-to-end model based on translating the Mel spectrum and the Chinese Mel spectrum in the following manner, resulting in a Russian Chinese language translation model: and according to the translated Mel spectrogram, obtaining the Chinese voice corresponding to the Russian training voice. And obtaining the Chinese training voice corresponding to the target Mel spectrogram. Based on the comparison result between the Chinese speech and the Chinese training speech, training an end-to-end model to obtain the Russian Chinese speech translation model.
In yet another embodiment, before obtaining the Russian Chinese language translation model, the training mode of the Russian Chinese language translation model further includes: and determining the fluency of the Chinese speech.
In yet another embodiment, the Russian Chinese language translation model is trained to obtain the Russian Chinese language translation model based on the comparison between the Chinese language and the Chinese training language by: and acquiring the Chinese text of the Chinese speech and the Chinese training text of the Chinese training speech. And determining the error rate between the Chinese text and the Chinese training text, and stopping training to obtain the Russian Chinese language translation model if the error rate between the Chinese text and the Chinese training text is smaller than the error threshold value. If the error rate between the Chinese text and the Chinese training text is greater than or equal to the error threshold, continuing to train the end-to-end model.
In yet another embodiment, the chinese training speech is sampled at the same frequency as the corresponding russian training speech.
In yet another embodiment, the conversion unit 103 obtains the chinese speech corresponding to the russian speech to be translated according to the target mel spectrogram in the following manner: and reconstructing the target Mel spectrogram through the vocoder to obtain the Chinese voice corresponding to the Russian voice to be translated. .
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Further, in exemplary embodiments, the Russian Chinese language speech translation apparatus may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the methods described above. For example, the Russian Chinese language translation device includes: a memory for storing instructions; and a processor for calling the instructions stored in the memory to execute the Russian Chinese language translation method provided by any one of the embodiments.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory, comprising instructions executable by a processor of the Russian Chinese language translation device to perform the above method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It is further understood that the term "plurality" in this disclosure means two or more, and other adjectives are similar thereto. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It is further understood that the terms "first," "second," and the like are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the expressions "first", "second", etc. may be used entirely interchangeably. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.
It will be further understood that "connected" includes both direct connection where no other member is present and indirect connection where other element is present, unless specifically stated otherwise.
It will be further understood that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (16)

1. A method for Russian Chinese language translation, the method comprising:
obtaining russian voice to be translated, and converting the russian voice to be translated into a mel spectrogram to be translated;
translating the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model;
according to the target Mel spectrogram, obtaining the Chinese voice corresponding to the Russian voice to be translated;
wherein the Russian Chinese language translation model comprises: a long-short-time memory network, a local attention mechanism and a bidirectional long-short-time memory network;
the translating the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model comprises the following steps:
coding the mel spectrogram to be translated through a long-short-time memory network to obtain an intermediate vector to be translated;
focusing the intermediate vector to be translated based on a local attention mechanism, and determining an attention vector of the intermediate vector to be translated;
and decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network to obtain a target Mel spectrogram of the corresponding Chinese language after translating the Mel spectrogram to be translated.
2. The Russian Chinese phonetic translation method of claim 1, wherein the Russian Chinese phonetic translation model is trained by:
acquiring a plurality of training mel spectrograms corresponding to a training voice set, wherein the training voice set comprises a plurality of russian training voices, and the training mel spectrograms correspond to the russian training voices;
acquiring a plurality of Chinese Mel spectrograms corresponding to a Chinese phonetic set, wherein the Chinese Mel spectrograms correspond to the Chinese training phonetic of the Russian training phonetic;
inputting the training mel spectrogram into an end-to-end model, and obtaining a translation mel spectrogram corresponding to the training mel spectrogram based on a local attention mechanism;
and training the end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model.
3. The Russian Chinese language translation method of claim 2, wherein the training the end-to-end model based on the translated Mel spectrum and the Chinese Mel spectrum to obtain the Russian Chinese language translation model comprises:
according to the translated Mel spectrogram, obtaining the Chinese voice corresponding to the Russian training voice;
Acquiring Chinese training voice corresponding to the target Mel spectrogram;
and training the end-to-end model based on a comparison result between the Chinese speech and the Chinese training speech to obtain the Russian Chinese speech translation model.
4. A method of Russian language translation according to claim 3, wherein prior to obtaining the Russian language translation model, the Russian language translation model is further trained by:
and determining the fluency of the Chinese speech.
5. A method of Russian Chinese phonetic translation as in claim 3, wherein the training the end-to-end model based on the comparison between the Chinese speech and the Chinese training speech to obtain the Russian Chinese phonetic translation model comprises:
acquiring a Chinese text of the Chinese speech and a Chinese training text of the Chinese training speech;
determining an error rate between the Chinese text and the Chinese training text, and stopping training to obtain the Russian Chinese language translation model if the error rate between the Chinese text and the Chinese training text is smaller than an error threshold;
and if the error rate between the Chinese text and the Chinese training text is greater than or equal to the error threshold, continuing to train the end-to-end model.
6. A method of Russian Chinese language translation according to claim 3, wherein the Chinese training speech is sampled at the same frequency as the corresponding Russian training speech.
7. The russian language translation method of claim 1, wherein the obtaining, according to the target mel spectrogram, the chinese language corresponding to the russian language to be translated includes:
and reconstructing the target Mel spectrogram through a vocoder to obtain the Chinese voice corresponding to the Russian voice to be translated.
8. A russian chinese language translation device, the russian chinese language translation device comprising:
the device comprises an acquisition unit, a translation unit and a conversion unit, wherein the acquisition unit is used for acquiring the Russian voice to be translated and converting the Russian voice to be translated into a Meier spectrogram to be translated;
the translation unit is used for translating the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model;
the conversion unit is used for obtaining the Chinese voice corresponding to the Russian voice to be translated according to the target Mel spectrogram;
wherein the Russian Chinese language translation model comprises: a long-short-time memory network, a local attention mechanism and a bidirectional long-short-time memory network;
The translation unit translates the mel spectrogram to be translated into a target mel spectrogram through a pre-trained Russian Chinese language translation model in the following manner:
coding the mel spectrogram to be translated through a long-short-time memory network to obtain an intermediate vector to be translated;
focusing the intermediate vector to be translated based on a local attention mechanism, and determining an attention vector of the intermediate vector to be translated;
and decoding the intermediate vector to be translated after confirming the attention vector through a bidirectional long-short-time memory network to obtain a target Mel spectrogram of the corresponding Chinese language after translating the Mel spectrogram to be translated.
9. The russian language translation device of claim 8, wherein the russian language translation model is trained by:
acquiring a plurality of training mel spectrograms corresponding to a training voice set, wherein the training voice set comprises a plurality of russian training voices, and the training mel spectrograms correspond to the russian training voices;
acquiring a plurality of Chinese Mel spectrograms corresponding to a Chinese phonetic set, wherein the Chinese Mel spectrograms correspond to the Chinese training phonetic of the Russian training phonetic;
inputting the training mel spectrogram into an end-to-end model, and obtaining a translation mel spectrogram corresponding to the training mel spectrogram based on a local attention mechanism;
And training the end-to-end model based on the translated Mel spectrogram and the Chinese Mel spectrogram to obtain the Russian Chinese language translation model.
10. The russian language translation device of claim 9, wherein the russian language translation model is based on the translated mel spectrogram and the chinese mel spectrogram by training the end-to-end model to obtain the russian language translation model by:
according to the translated Mel spectrogram, obtaining the Chinese voice corresponding to the Russian training voice;
acquiring Chinese training voice corresponding to the target Mel spectrogram;
and training the end-to-end model based on a comparison result between the Chinese speech and the Chinese training speech to obtain the Russian Chinese speech translation model.
11. The russian translation device of claim 10, wherein prior to obtaining the russian translation model, the russian translation model is trained in a manner further comprising:
and determining the fluency of the Chinese speech.
12. The russian language translation device of claim 10, wherein the russian language translation model is trained based on a comparison between the chinese speech and the chinese training speech by:
Acquiring a Chinese text of the Chinese speech and a Chinese training text of the Chinese training speech;
determining an error rate between the Chinese text and the Chinese training text, and stopping training to obtain the Russian Chinese language translation model if the error rate between the Chinese text and the Chinese training text is smaller than an error threshold;
and if the error rate between the Chinese text and the Chinese training text is greater than or equal to the error threshold, continuing to train the end-to-end model.
13. The russian language translation device of claim 10, wherein the chinese training speech is at the same sampling frequency as the corresponding russian training speech.
14. The russian voice translation device according to claim 8, wherein the conversion unit obtains the chinese voice corresponding to the russian voice to be translated according to the target mel spectrogram by:
and reconstructing the target Mel spectrogram through a vocoder to obtain the Chinese voice corresponding to the Russian voice to be translated.
15. A russian chinese language translation device, the russian chinese language translation device comprising:
A memory for storing instructions; and
a processor for invoking the instructions stored in the memory to perform the russian language translation method as recited in any one of claims 1-7.
16. A computer readable storage medium having stored therein instructions which, when executed by a processor, perform the russian language translation method as recited in any one of claims 1-7.
CN202110018492.7A 2020-12-30 2021-01-07 Russian Chinese language translation method, russian Chinese language translation device and storage medium Active CN112767918B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011611327 2020-12-30
CN2020116113274 2020-12-30

Publications (2)

Publication Number Publication Date
CN112767918A CN112767918A (en) 2021-05-07
CN112767918B true CN112767918B (en) 2023-12-01

Family

ID=75700638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110018492.7A Active CN112767918B (en) 2020-12-30 2021-01-07 Russian Chinese language translation method, russian Chinese language translation device and storage medium

Country Status (1)

Country Link
CN (1) CN112767918B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113612602B (en) * 2021-07-13 2023-11-28 中国人民解放军战略支援部队信息工程大学 Quantum key security assessment method, quantum key security assessment device and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804427A (en) * 2018-06-12 2018-11-13 深圳市译家智能科技有限公司 Speech robot interpretation method and device
CN108986793A (en) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 translation processing method, device and equipment
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN111161702A (en) * 2019-12-23 2020-05-15 爱驰汽车有限公司 Personalized speech synthesis method and device, electronic equipment and storage medium
WO2020205233A1 (en) * 2019-03-29 2020-10-08 Google Llc Direct speech-to-speech translation via machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804427A (en) * 2018-06-12 2018-11-13 深圳市译家智能科技有限公司 Speech robot interpretation method and device
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model
CN108986793A (en) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 translation processing method, device and equipment
WO2020205233A1 (en) * 2019-03-29 2020-10-08 Google Llc Direct speech-to-speech translation via machine learning
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN111161702A (en) * 2019-12-23 2020-05-15 爱驰汽车有限公司 Personalized speech synthesis method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Local Monotonic Attention Mechanism for End-to-End Speech and Language Processing";Tjandra A;《arXiv:1705.08091v2 [cs.CL]》;全文 *
基于LSTM的蒙汉机器翻译的研究;刘婉婉;苏依拉;乌尼尔;仁庆道尔吉;;计算机工程与科学(10);全文 *

Also Published As

Publication number Publication date
CN112767918A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN113439301B (en) Method and system for machine learning
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
CN111833843B (en) Speech synthesis method and system
Yanagita et al. Neural iTTS: Toward synthesizing speech in real-time with end-to-end neural text-to-speech framework
WO2022043712A1 (en) A text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system
Liu et al. Simple and effective unsupervised speech synthesis
CN113205792A (en) Mongolian speech synthesis method based on Transformer and WaveNet
Qu et al. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
Dossou et al. OkwuGb\'e: End-to-End Speech Recognition for Fon and Igbo
CN112767918B (en) Russian Chinese language translation method, russian Chinese language translation device and storage medium
Peymanfard et al. Lip reading using external viseme decoding
Mihajlik et al. BEA-base: A benchmark for ASR of spontaneous Hungarian
Liu et al. A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2
Rolland et al. Multilingual transfer learning for children automatic speech recognition
Li et al. End-to-end mongolian text-to-speech system
Wang [Retracted] Research on Open Oral English Scoring System Based on Neural Network
Zahariev et al. An approach to speech ambiguities eliminating using semantically-acoustical analysis
JP7357518B2 (en) Speech synthesis device and program
Fujimoto et al. Semi-supervised learning based on hierarchical generative models for end-to-end speech synthesis
Shareef et al. Collaborative Training of Acoustic Encoder for Recognizing the Impaired Children Speech
Gu et al. A Study on the improvement of chinese automatic speech recognition accuracy using a lexicon
Zhao Simulation of Chinese Speech Synthesis System Based on Wavelet Optimized Neural Network
Liu Design of Automatic Speech Evaluation System of Professional English for the Navy based on Intelligent Recognition Technology
Korostik et al. The stc text-to-speech system for blizzard challenge 2019
Gupta et al. A study on the techniques for speech to speech translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant