WO2019165748A1 - 一种语音翻译方法及装置 - Google Patents

一种语音翻译方法及装置 Download PDF

Info

Publication number
WO2019165748A1
WO2019165748A1 PCT/CN2018/095766 CN2018095766W WO2019165748A1 WO 2019165748 A1 WO2019165748 A1 WO 2019165748A1 CN 2018095766 W CN2018095766 W CN 2018095766W WO 2019165748 A1 WO2019165748 A1 WO 2019165748A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
text
text unit
speech
voice
Prior art date
Application number
PCT/CN2018/095766
Other languages
English (en)
French (fr)
Inventor
王雨蒙
徐伟
江源
胡国平
胡郁
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2019165748A1 publication Critical patent/WO2019165748A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice translation method and apparatus.
  • the resulting post-translational speech is completely the timbre characteristic of the speaker in the speech synthesis model, and the sense of hearing is completely different from the source speaker. Another vocal characteristic of the speaker.
  • the main purpose of the embodiments of the present application is to provide a speech translation method and apparatus, which can enable the translated speech to have the timbre characteristics of the source speaker when translating the speech of the source speaker.
  • the embodiment of the present application provides a voice translation method, including:
  • Generating a second target voice by performing voice translation on the first target voice, wherein a language of the second target voice is different from a language of the first target voice, and the second target voice carries the source Pronunciation of the person's tone characteristics.
  • the performing the voice translation on the first target voice to generate the second target voice includes:
  • a second target speech is generated by speech synthesis of the translated text.
  • the second target voice is generated by performing voice synthesis on the translated text, including:
  • the translated text is synthesized by speech according to acoustic parameters of each target text unit to generate a second target speech.
  • the method further includes:
  • the obtaining acoustic parameters of each target text unit includes:
  • Acoustic parameters of each target text unit are obtained using the first acoustic model.
  • the method further includes:
  • the obtaining acoustic parameters of each target text unit includes:
  • Acoustic parameters of each target text unit are obtained using the second acoustic model.
  • the method further includes:
  • converting the second sample text unit to obtain a first converted text unit including:
  • the determined second converted text unit corresponding to the third sample text unit is used as the first converted text unit.
  • the method further includes:
  • converting the second sample text unit to obtain a first converted text unit including:
  • the second sample text unit is converted to obtain a first converted text unit.
  • the embodiment of the present application further provides a voice translation apparatus, including:
  • a voice acquiring unit configured to acquire a first target voice of the source speaker
  • a voice translation unit configured to generate a second target voice by performing voice translation on the first target voice, where a language of the second target voice is different from a language of the first target voice, the second target
  • the voice carries the timbre characteristics of the source speaker.
  • the embodiment of the present application further provides a voice translation apparatus, including: a processor, a memory, and a system bus;
  • the processor and the memory are connected by the system bus;
  • the memory is for storing one or more programs, the one or more programs comprising instructions that, when executed by the processor, cause the processor to perform the method of any of the above.
  • the embodiment of the present application further provides a computer readable storage medium comprising instructions, when executed on a computer, causing the computer to perform the method of any of the above.
  • the embodiment of the present application further provides a computer program product, when the computer program product is run on a terminal device, causing the terminal device to perform the method described in any one of the above.
  • the voice translation method and device after acquiring the first target voice of the source speaker, performing voice translation on the first target voice to generate a second target voice, where the second target voice is The language is different from the language of the first target speech, and the second target speech carries the timbre characteristics of the source speaker. It can be seen that when translating the speech of the source speaker, that is, the pre-translation speech, since the timbre characteristics of the source speaker are considered, the post-translation speech also has the timbre characteristics of the source speaker, thereby making the post-translation speech listen. It is more like the voice spoken directly by the source speaker.
  • FIG. 1 is a schematic flowchart of a voice translation method according to an embodiment of the present application.
  • FIG. 2 is a second schematic flowchart of a voice translation method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a speech synthesis model provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for constructing an acoustic model according to an embodiment of the present application
  • FIG. 5 is a second schematic flowchart of a method for constructing an acoustic model according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a sample text unit collection method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a relationship between phoneme sequences provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart diagram of a method for constructing a codec model according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an encoding process according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a voice translation apparatus according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of hardware of a voice translation apparatus according to an embodiment of the present disclosure.
  • the obtained post-translation speech is completely the timbre characteristic of the speaker in the synthetic model, and in the sense of hearing, it is completely different from the source speaker.
  • the timbre characteristics of the pronounced person, that is, it sounds like a person is talking, and another person's subsequent translation is the pronunciation effect of two different people.
  • the embodiment of the present application provides a voice translation method and apparatus.
  • a voice of a source speaker is translated into a voice before translation, that is, when the voice of the source speaker is translated into another language
  • the use belongs to
  • the speech synthesis model of the source speaker performs speech translation, so that the post-translation speech has the timbre characteristics of the source speaker, so that the post-translation speech sounds more like the voice directly spoken by the source speaker, thereby improving the user experience.
  • FIG. 1 is a schematic flowchart of a voice translation method according to an embodiment, where the method includes the following steps:
  • S101 Acquire a first target voice of the source speaker.
  • the present embodiment defines a voice that is to be translated, that is, a pre-translation speech, as a first target voice, and defines a speaker who speaks the first target voice as a source speaker.
  • the embodiment does not limit the source of the first target voice.
  • the first target voice may be a real voice or a recorded voice of a person, or may be a machine processing of the real voice or the recorded voice. After the special effects voice.
  • the embodiment does not limit the length of the first target voice.
  • the first target voice may be a word, a sentence, or a paragraph.
  • S102 Generate a second target voice by performing voice translation on the first target voice, where a language of the second target voice is different from a language of the first target voice, and the second target voice carries a The tone characteristics of the source speaker.
  • the voice that is translated in the first target voice is defined as the second target voice. It should be noted that when the first target voice is the above-mentioned machine-processed special effect voice, it is necessary to further perform the special effect processing in the same manner on the second target voice obtained after the translation.
  • This embodiment does not limit the language types of the first target voice and the second target voice, as long as the language types of the first target voice and the second target voice are different but the voice meanings are the same.
  • the first target voice is Chinese "hello”
  • the second target voice is English "hello”
  • the first target voice is English "hello”
  • the second target voice is Chinese “hello”.
  • the user such as the source speaker
  • the speech synthesis model of the translation machine obtains the first target speech of the source speaker, the speech can be translated and translated.
  • the second target speech is the default translation language.
  • the timbre feature of the source speaker may be collected in advance to construct a speech synthesis model belonging to the source speaker. Based on this, when the first target speech of the source speaker is voice translated, the source pronunciation may be adopted.
  • the human speech synthesis model performs speech translation, so that the translated second target speech is given the timbre feature of the source speaker, and the timbre adaptive mode makes the listener feel that the second target speech has the source speaker's sense of hearing.
  • the speaking effect that is, the pre-translation speech and the post-translation speech are the same or similar in tone effect.
  • the voice translation method after acquiring the first target voice of the source speaker, generates a second target voice by performing voice translation on the first target voice, where the second target voice is The language is different from the language of the first target speech, and the second target speech carries the timbre characteristics of the source speaker. It can be seen that when translating the speech of the source speaker, that is, the pre-translation speech, since the timbre characteristics of the source speaker are considered, the post-translation speech also has the timbre characteristics of the source speaker, thereby making the post-translation speech listen. It is more like the voice spoken directly by the source speaker.
  • FIG. 2 is a schematic flowchart of a voice translation method according to an embodiment, where the method includes the following steps:
  • S201 Acquire a first target voice of the source speaker.
  • S201 in this embodiment is consistent with S101 in the first embodiment.
  • S101 in the first embodiment For related description, refer to the first embodiment, and details are not described herein again.
  • S202 Generate speech recognition text by performing speech recognition on the first target speech.
  • the first target speech is converted into speech recognition text by a speech recognition technology, such as an artificial neural network based speech recognition technology.
  • the first target voice is the Chinese voice "Hello"
  • the Chinese text "Hello” can be obtained by performing voice recognition on it.
  • S203 Generate translated text by performing text translation on the voice recognition text.
  • the speech recognition text is Chinese text
  • the Chinese text can be translated into English translation text, such as Chinese.
  • the text "Hello” is translated into text and the English text "hello” is obtained.
  • S204 Generate a second target voice by performing speech synthesis on the translated text, where a language of the second target voice is different from a language of the first target voice, and the second target voice carries the source Pronunciation of the person's tone characteristics.
  • this embodiment can be modeled in advance using the speech acoustic parameters of the source speaker to obtain the source.
  • the speech synthesis model of the pronunciation person In this way, when the translated text is synthesized into a speech, the speech synthesis model can be utilized to make the translated speech, that is, the second target speech, have the timbre characteristics of the source speaker, and achieve the auditory effect of the source speaker speaking and translating himself.
  • the translated text is the English text "hello”
  • the translated speech, that is, the second target voice is the English voice "hello”.
  • the speech synthesis model may include an acoustic model and a duration model, as shown in the schematic diagram of the speech synthesis model shown in FIG.
  • the acoustic model shown in FIG. 3 is such that the acoustic model determines and outputs an acoustic parameter for each phoneme that carries the timbre characteristics of the source speaker, wherein the acoustic parameters may include parameters such as frequency spectrum, fundamental frequency, and the like.
  • the phoneme information is also input to the duration model shown in FIG. 3, so that the duration model outputs a duration parameter, and the embodiment does not limit the determination method of the duration parameter.
  • the speech rate of the first target speech may be determined or the default speech rate may be used to calculate the length of time that the translated text takes to read according to the speech rate, and the duration is used as the duration parameter.
  • the speech synthesis model will use the acoustic parameters output by the acoustic model to make each phoneme in the translated text pronounced according to the corresponding acoustic parameters, and the speech synthesis model also uses the duration parameter outputted by the duration model to perform pronunciation according to the specified duration.
  • the translated speech with the timbre characteristics of the source speaker is synthesized, that is, the second target speech is obtained.
  • S204 may be implemented in the following manner, and specifically includes the following steps:
  • Step A The translated text is segmented according to a predetermined size text unit to obtain each target text unit.
  • the translated text is divided into text units of a preset size.
  • the translated text when the translated text is Chinese text, it may be divided into units of phonemes, bytes or words, and for example, when the translated text is English text, Phonemes, words, etc. are divided into units.
  • this embodiment defines each text unit divided from the translated text as the target text unit.
  • Step B Acquire acoustic parameters of each target text unit, wherein the acoustic parameters carry the timbre characteristics of the source speaker.
  • the acoustic model of each target text unit can be obtained by using the acoustic model shown in FIG. 3. Since the acoustic model is an acoustic model belonging to the source speaker, the acoustic parameters acquired by using the acoustic model will have the source speaker. The tone characteristics.
  • Step C Perform speech synthesis on the translated text according to acoustic parameters of each target text unit to generate a second target speech.
  • the acoustic parameters of each target text unit in the translated text are obtained through step B, for example, the acoustic parameters may include parameters such as a spectrum, a fundamental frequency, etc., then the speech synthesis model shown in FIG. 3 may cause each target text unit to follow The corresponding acoustic parameters are pronounced to synthesize the translated text into a second target speech of the timbre feature of the particular source speaker.
  • the voice translation method after acquiring the first target voice of the source speaker, performs text translation on the voice recognition text of the first target voice, and then, by acquiring each text in the translated text.
  • the acoustic parameters of the unit are synthesized by speech to generate a second target speech. Since the acoustic parameters carry the timbre characteristics of the source speaker, the translated speech also has the timbre characteristics of the source speaker, so that the translated speech sounds more like the voice directly spoken by the source speaker.
  • This embodiment will introduce a construction method of the acoustic model in the second embodiment, and introduce a specific implementation of the step B in the second embodiment, that is, how to obtain the acoustic parameters of the target text unit using the acoustic model.
  • the recording can be performed according to the instructions of the manual for constructing the acoustic model, and the recording content is optional, and the source speaker can select the language according to his reading ability. That is, the recording language selected by the source speaker may be the same as or different from the language of the translated voice (ie, the second target voice).
  • This embodiment will specifically introduce the construction method of the acoustic model based on the above two different language selection results.
  • the recording language selected by the source speaker is the same as the language of the post-translation speech (ie, the second target speech), and the model construction method is specifically described below.
  • FIG. 4 is a schematic flowchart of a method for constructing an acoustic model according to an embodiment, where the method includes the following steps:
  • S401 Acquire a first sample voice of the source speaker, wherein a language of the first sample voice is the same as a language of the second target voice.
  • a recording of the source speaker may be obtained, and the recording may be the same as the language of the translated voice, and The corresponding text of the recording should cover all the phoneme content of the text language.
  • this embodiment defines the segment recording as the first sample speech.
  • the pre-translation speech that is, the first target speech is Chinese speech
  • the post-translation speech that is, the second target speech is English speech.
  • the source speaker has the ability to read English normally.
  • the translator can ask the source pronunciation. Whether the person can read the English aloud, if the source speaker replies "can read the English" by voice or button, the translator can give a small amount of fixed English text and prompt the source speaker to read the fixed English text, the fixed English text. As far as possible, all English phonemes are covered, and the source speaker reads the fixed English text so that the translator can obtain the voice of the fixed English text, and the voice is the first sample voice.
  • S402 Segment the identification text of the first sample voice according to the text unit of the preset size to obtain each first sample text unit.
  • the first sample speech is converted into speech recognition text by a speech recognition technology, such as an artificial neural network based speech recognition technology.
  • a speech recognition technology such as an artificial neural network based speech recognition technology.
  • the speech recognition text is divided according to a predetermined size text unit (same as the division unit of the step A in the second embodiment), for example, divided into phoneme units.
  • a predetermined size text unit such as the division unit of the step A in the second embodiment
  • phoneme units a predetermined size text unit.
  • the embodiment will use the speech.
  • Each text unit divided in the recognition text is defined as the first sample text unit.
  • S403 Extract a first speech segment corresponding to the first sample text unit from the first sample speech, and extract an acoustic parameter from the first speech segment.
  • the first sample speech is divided according to the text division manner of the identification text of the first sample speech, so that the corresponding speech segment of each first sample text unit in the first sample speech can be determined.
  • the identification text of the first sample speech and the first sample speech are all divided in units of phonemes, thereby obtaining a speech segment corresponding to each phoneme in the recognized text.
  • the present embodiment defines a voice segment corresponding to the first sample text unit as the first voice segment.
  • corresponding acoustic parameters such as a frequency spectrum, a fundamental frequency, and the like are extracted from the first speech segment corresponding thereto, so that the timbre characteristic data of the source speaker is obtained.
  • S404 Construct a first acoustic model by using each of the first sample text units and the acoustic parameters corresponding to the first sample text unit.
  • the respective first sample text units, and acoustic parameters corresponding to each first sample text unit may be stored to form a first data set.
  • the text unit in the first data set as a phoneme as an example, it should be noted that if the first data set cannot cover all the phonemes of the translated language, the uncovered phonemes and the default acoustic parameters set for the phonemes can be added.
  • To the first data set an acoustic model belonging to the source speaker can be constructed based on the correspondence between the first sample text unit and the acoustic parameter in the first data set.
  • the first data set is directly used as the training data, and the training source is used.
  • the acoustic model of the human voice is pronounced, and the training process is the same as the prior art.
  • This embodiment defines the constructed acoustic model as the first acoustic model.
  • the acoustic model may implement step B of “acquiring the acoustic parameters of each target text unit” in the second embodiment, and may specifically include: acquiring, by using the first acoustic model, each target text unit. Acoustic parameters.
  • the acoustic parameters of each target text unit are directly generated by using the acoustic model of the source speaker, that is, the first acoustic model, and the specific generation method may be the same as the prior art.
  • the generation method may be existing. Parameter-based speech synthesis method.
  • the recording language selected by the source speaker is different from the language of the translated speech (ie, the second target speech), and the model construction method is specifically described below.
  • FIG. 5 is a schematic flowchart diagram of another acoustic model construction method provided by the embodiment, where the method includes the following steps:
  • S501 Acquire a second sample speech of the source speaker, wherein a language of the second sample voice is different from a language of the second target voice.
  • a piece of recording of the source speaker may be obtained, and the segment recording may be different from the language of the translated voice, such as
  • the recording of the segment may be the same as the language of the pre-translation speech, that is, the first target speech, and the corresponding text of the recording should cover all the phoneme content of the text language as much as possible.
  • the present embodiment defines the recording as the second sample speech.
  • the pre-translation speech that is, the first target speech is the Chinese speech
  • the translated speech that is, the second target speech is the English speech.
  • the translator can query the source. Whether the pronunciation person can read the English aloud, if the source speaker replies "Cannot read English" by voice or button, the translation machine can provide language selection. If the source speaker selects Chinese, the translator can give a small amount of Fix the Chinese text and prompt the source speaker to read the fixed Chinese text.
  • the fixed Chinese text covers all Chinese phonemes as much as possible, and the source speaker reads the fixed Chinese text for the translator to obtain the voice of the fixed Chinese text. For the second sample speech.
  • S502 Segment the identification text of the second sample voice according to the text unit of the preset size to obtain each second sample text unit.
  • the second sample speech is converted into speech recognition text by a speech recognition technology, such as an artificial neural network based speech recognition technology.
  • a speech recognition technology such as an artificial neural network based speech recognition technology.
  • the speech recognition text is divided according to a predetermined size text unit (same as the division unit of the step A in the second embodiment), for example, divided into phoneme units.
  • a predetermined size text unit such as the division unit of the step A in the second embodiment
  • phoneme units a predetermined size text unit.
  • Each text unit divided in the recognition text is defined as a second sample text unit.
  • S503 Convert the second sample text unit to obtain a first converted text unit, wherein the first converted text unit is a text unit used by a language of the second target voice.
  • the second sample text unit For each second sample text unit, the second sample text unit needs to be converted into a text unit corresponding to the translated language.
  • the converted text unit is defined as the first converted text unit. For example, if the second sample text unit is Chinese phoneme and the translated language is English, the first converted text unit is an English phoneme.
  • S504 Extract a second speech segment corresponding to the second sample text unit from the second sample speech, and extract an acoustic parameter from the second speech segment to obtain a corresponding to the first converted text unit. Acoustic parameters.
  • the identification text of the two sample speech and the second sample speech are all divided in units of phonemes, thereby obtaining a speech segment corresponding to each phoneme in the recognized text.
  • the present embodiment defines a voice segment corresponding to the second sample text unit as the second voice segment.
  • each second sample text unit For each second sample text unit, extract corresponding acoustic parameters, such as a spectrum, a fundamental frequency, and the like from the second speech segment corresponding thereto, as the acoustic parameters of the first converted text unit corresponding to the second sample text unit. .
  • S505 Construct a second acoustic model by using each second sample text unit, a first converted text unit corresponding to the second sample text unit, and an acoustic parameter corresponding to the first converted text unit.
  • Each of the second sample text units, the first converted text unit corresponding to each second sample text unit, and the acoustic parameters corresponding to each of the first converted text units may be stored to form a second data set.
  • the text unit in the second data set as a phoneme as an example, it should be noted that if the second data set cannot cover all the phonemes of the translated language, the uncovered phonemes and the default acoustic parameters set for the phonemes can be added.
  • an acoustic model belonging to the source speaker can be constructed based on the correspondence between the pre-conversion phoneme and the converted phoneme in the second data set, and the converted phoneme and acoustic parameters, and the second data set is directly constructed.
  • the training data the acoustic model of the source speaker is trained, and the training process is the same as the prior art.
  • This embodiment defines the constructed acoustic model as the second acoustic model.
  • the acoustic model may implement step B of “acquiring the acoustic parameters of each target text unit” in the second embodiment, and may specifically include: acquiring acoustics of each target text unit by using the second acoustic model. parameter.
  • the acoustic parameters of each target text unit are directly generated by using the acoustic model of the source speaker, that is, the second acoustic model, and the specific generation method may be the same as the prior art.
  • the generation method may be existing. Parameter-based speech synthesis method.
  • the voice translation method after acquiring the first target voice of the source speaker, performs text translation on the voice recognition text of the first target voice, and then, by acquiring each text in the translated text.
  • the acoustic parameters of the unit are synthesized by speech to generate a second target speech.
  • the acoustic parameters of each text unit can be determined by pre-constructing the acoustic model of the source speaker, and since the acoustic parameters carry the timbre characteristics of the source speaker, the translated speech also has the timbre characteristics of the source speaker, thereby The translated speech sounds more like the voice spoken directly by the source speaker.
  • This embodiment will introduce a specific implementation manner of S503 in the third embodiment.
  • a text unit mapping model needs to be constructed in advance to implement S503 by using the text unit conversion function of the text unit mapping model.
  • This embodiment introduces a method of constructing two text unit mapping models.
  • the correspondence between the text unit sequences of the two languages is directly established, and the conversion between the text units is realized according to the corresponding relationship.
  • the model construction method is specifically introduced below. .
  • FIG. 6 is a schematic flowchart of a sample text unit collection method provided by the embodiment, where the method includes the following steps:
  • S601 Collect a plurality of first sample texts, wherein a language of the first sample text is the same as a language of the second sample voice.
  • this embodiment defines each text corpus collected as the first sample text. This embodiment does not limit the form of the first sample text, and the first sample text may be a word, or a sentence, or a paragraph.
  • each Chinese text is the first sample text.
  • S602 The first sample text is segmented according to the preset size text unit to obtain each third sample text unit.
  • Dividing the first sample text according to a predetermined size text unit (same as the division unit of step A in the second embodiment), for example, in units of phonemes, for the purpose of distinguishing, the first embodiment is from the first
  • Each text unit divided in the sample text is defined as the third sample text unit.
  • the Chinese text needs to be converted into Chinese Pinyin, and each Chinese phoneme in the Chinese Pinyin is marked to obtain a Chinese phoneme sequence (as shown in Figure 7). Show), for example, the Chinese text "Hello”, you can get the Chinese pinyin "[n i][h ao]", and mark the four "n", "i", "h”, "ao” Chinese phonemes, that is, four third sample text units.
  • S603 converting the third sample text unit to obtain a second converted text unit, wherein the second converted text unit is that the third sample text unit is pronounced in a pronunciation manner of the second target voice. Text unit.
  • the first sample text may be marked with the pronunciation of the translated speech, that is, the pronunciation of the second target speech, so that for each third sample text unit in the first sample text, it can be found from the labeled pronunciation.
  • the embodiment defines the corresponding text unit as the second converted text unit.
  • the first sample text is the Chinese text "Hello”
  • the translated speech that is, the second target speech is English speech
  • "Hello” can mark the pronunciation by means of English phonetic symbols. Mark as And mark "n” in order from it, "h”
  • the four English phonemes that is, the four second converted text units, such that the third sample text units of the above four Chinese forms “n", “i”, “h”, “ao”, in turn correspond to the four English
  • each of the third sample text units and the second converted text unit corresponding to each third sample text unit may be stored to form a text unit set. It should be noted that since the second converted text unit in the set of text units belongs to the phoneme of the translated language, the second converted text unit in the set of text units should be covered as much as possible for all text units of the translated language.
  • the third sample text unit in the text unit set and its corresponding second converted text unit may be directly mapped to the table format. Based on this, the text unit mapping model may be based on text units.
  • the mapping relationship implements step S503 in the third embodiment.
  • the step S503, “converting the second sample text unit to obtain the first converted text unit” may specifically include: determining a third sample text unit that is the same as the second sample text unit; The determined second converted text unit corresponding to the third sample text unit is used as the first converted text unit.
  • a third sample text unit identical to the second sample text unit is queried from the phoneme set, and the third sample text unit is determined based on the phoneme mapping relationship.
  • Corresponding second converted text unit which is used as the converted phoneme of the second sample text unit, that is, the first converted text unit.
  • the network model between the text unit sequences of the two languages is trained, such as the codec model shown in FIG. 7, and the network model is used as the text unit mapping model, and the text unit is adopted.
  • the mapping model can make the text unit mapping result more accurate.
  • the model construction method is described in detail below.
  • S801 Collect a plurality of second sample texts, wherein the language of the second sample text is the same as the language of the second sample voice.
  • step S801 is similar to the step S601, and the first sample text in the S601 is replaced by the second sample text.
  • the first sample text in the S601 is replaced by the second sample text.
  • S802 Segment the second sample text according to the preset size text unit to obtain each fourth sample text unit.
  • step S802 similar to step S602, it is only necessary to replace the first sample text in S602 with the second sample text and replace the third sample text unit with the fourth sample text unit. See related description of S602, and details are not described here.
  • S803 converting the fourth sample text unit to obtain a third converted text unit, wherein the third converted text unit is that the fourth sample text unit is pronounced in a pronunciation manner of the second target voice. Text unit.
  • step S803 similar to step S603, it is only necessary to replace the third sample text unit in S603 with the fourth sample text unit and the second converted text unit with the third converted text unit. See related description of S603, and details are not described here.
  • the network model in the middle of the text unit system of the two languages may be trained by using the fourth sample text unit sequence and the third converted text unit sequence, and the network model may include the encoding network and decoding shown in FIG. 7.
  • the internet The codec model will be introduced as an example of the fourth sample text unit sequence being the Chinese phoneme sequence and the third converted text unit sequence being the English phoneme sequence.
  • the coding processing capability of the coding network for different syllables is realized by adding a layer of syllable information, so as to optimize the phoneme combination and the overall phoneme mapping in the syllable.
  • the encoding network may include three encoding processes, which are respectively the encoding process of the phonemes in the syllable, the encoding process between the syllables, and the encoding process of all the phonemes in the text. For each encoding, the subsequent encoding needs to consider the result of the previous encoding.
  • the encoding process of the encoding network is described below by taking FIG. 9 as an example.
  • a second sample text collected is a Chinese text such as “Hello”
  • the fourth sample text unit sequence is “n”, “i”, “h”, “ao”.
  • all the Chinese phonemes "n”, “i”, “h”, and “ao” belonging to the Chinese text are uniformly vectorized, for example, using a method such as Word2Vector, and the Chinese phonemes belonging to the same syllable are passed once.
  • Bidirectional Long Short-term Memory (BLSTM) is encoded, and the obtained coding result contains the relationship between phonemes and phonemes in the syllable, that is, learning the combination relationship between “n” and “i”.
  • the order relationship corresponds to the Chinese syllable "ni”
  • the combination relationship and order relationship between the learning "h” and "ao” correspond to the Chinese syllable "hao”.
  • the syllables "ni” and “hao” of the Chinese text are vectorized, for example, using the method of Word2Vector, and the encoding of the first layer BLSTM network (ie, the syllable phoneme learning network shown in FIG. 9) is obtained.
  • the first layer coding result is combined with the vector of each syllable, and encoded by the bidirectional BLSTM network between the syllables, and the obtained coding result includes the relationship between the syllable and the syllable, that is, learning "ni" and "hao"
  • the combination and order relationship between the two correspond to the Chinese text "Hello.”
  • the coding result of the second layer BLSTM network (that is, the inter-syllable learning network shown in FIG. 9) is combined with the vector features of all the phonemes in each syllable to perform the third layer BLSTM coding, and the corresponding coding result includes the Chinese text.
  • the relationship between the middle phoneme and the phoneme that is, the combination relationship and the order relationship between the learning "n", "i”, "h", and "ao" correspond to the Chinese text "hello".
  • the third layer coding result is used as the input of the decoding network shown in FIG. 7, and the decoding network shown in FIG. 7 correspondingly outputs the English phoneme sequence "n", "h",
  • the codec model learns the combination relationship and the order relationship between two or more syllables, and also learns the individual phonemes of each syllable. Combination and order relationships in syllables.
  • the Chinese phoneme sequence of the Chinese text can be selected according to its combination relationship and order relationship in the Chinese text. A more collocation of English phoneme sequences, and whether the Chinese text is a shorter word or a longer sentence, the corresponding English phoneme sequence has a better articulation effect, which makes the corresponding result between the phoneme sequences more Flexible and accurate.
  • codec model is not limited to the training between the Chinese phoneme sequence and the English phoneme sequence, and is applicable to any two different languages.
  • step S503 in the third embodiment can be implemented based on the learning result of the codec model.
  • the step S503, “converting the second sample text unit to obtain the first converted text unit” may specifically include: converting the second sample text unit by using the codec model , get the first converted text unit.
  • the second sample text unit is output as a input of a pre-built codec model, and the converted first converted text unit is obtained, and in the conversion process, the codec model may be based on the learning result.
  • the first converted text unit matched with each second sample text unit is selected, and the first implementation manner of S503 is pre-learned due to the implementation manner.
  • the actual combination of text unit sequences in different languages makes the converted text unit more accurate.
  • the voice translation method provides a text unit sequence for the recorded voice of the source of the recorded text, and when the text unit sequence of the recorded text is converted into a text unit sequence of the translated language,
  • the text unit mapping model can be pre-built, the text unit mapping model can be constructed based on the correspondence between text unit sequences in different languages or by training the codec network, and the text unit conversion model can be used to obtain the required text. Unit conversion result.
  • the voice translation apparatus 1000 includes:
  • a voice acquiring unit 1001 configured to acquire a first target voice of the source speaker
  • the voice translation unit 1002 is configured to generate a second target voice by performing voice translation on the first target voice, where the language of the second target voice is different from the language of the first target voice, and the second The target voice carries the timbre characteristics of the source speaker.
  • the voice translation unit 1002 may include:
  • a text recognition subunit configured to generate a voice recognition text by performing voice recognition on the first target voice
  • a text translation subunit configured to generate a translated text by performing text translation on the speech recognition text
  • a speech translation subunit configured to generate a second target speech by performing speech synthesis on the translated text.
  • the voice translation subunit may include:
  • a target unit dividing subunit configured to segment the translated text according to a preset size text unit to obtain each target text unit
  • An acoustic parameter acquisition subunit configured to acquire an acoustic parameter of each target text unit, wherein the acoustic parameter carries a timbre characteristic of the source speaker;
  • the translation speech generation subunit is configured to perform speech synthesis on the translated text according to acoustic parameters of each target text unit to generate a second target speech.
  • the apparatus 1000 may further include:
  • a first sample acquiring unit configured to acquire a first sample voice of the source speaker, wherein a language of the first sample voice is the same as a language of the second target voice;
  • a first sample dividing unit configured to segment the identification text of the first sample voice according to the preset size text unit to obtain each first sample text unit
  • a first segment extracting unit configured to extract, from the first sample voice, a first voice segment corresponding to the first sample text unit
  • a first parameter extraction unit configured to extract an acoustic parameter from the first speech segment
  • a first model building unit configured to construct a first acoustic model by using respective first sample text units and acoustic parameters corresponding to the first sample text unit;
  • the acoustic parameter acquisition subunit may be specifically configured to acquire acoustic parameters of each target text unit by using the first acoustic model.
  • the apparatus 1000 may further include:
  • a second sample acquiring unit configured to acquire a second sample voice of the source speaker, wherein a language of the second sample voice is different from a language of the second target voice;
  • a second sample dividing unit configured to segment the identification text of the second sample voice according to the preset size text unit to obtain each second sample text unit;
  • a text unit conversion unit configured to convert the second sample text unit to obtain a first converted text unit, wherein the first converted text unit is a text unit used by a language of the second target voice;
  • a second segment extracting unit configured to extract, from the second sample speech, a second voice segment corresponding to the second sample text unit
  • a second parameter extraction unit configured to extract an acoustic parameter from the second speech segment, to obtain an acoustic parameter corresponding to the first converted text unit
  • a second model building unit configured to construct a second acoustic with each second sample text unit, a first converted text unit corresponding to the second sample text unit, and an acoustic parameter corresponding to the first converted text unit model;
  • the acoustic parameter acquisition subunit may be specifically configured to acquire acoustic parameters of each target text unit by using the second acoustic model.
  • the apparatus 1000 may further include:
  • a first text collecting unit configured to collect a plurality of first sample texts, wherein a language of the first sample text is the same as a language of the second sample voice;
  • a third sample dividing unit configured to segment the first sample text according to the preset size text unit to obtain each third sample text unit
  • a first unit conversion unit configured to convert the third sample text unit to obtain a second converted text unit, wherein the second converted text unit is the third sample text unit to the second target speech
  • the text unit conversion unit may include:
  • a text unit conversion subunit configured to use the determined second converted text unit corresponding to the third sample text unit as the first converted text unit.
  • the apparatus 1000 may further include:
  • a second text collecting unit configured to collect a plurality of second sample texts, wherein a language of the second sample text is the same as a language of the second sample voice;
  • a fourth sample dividing unit configured to divide the second sample text according to a predetermined unit size of the text, to obtain each fourth sample text unit
  • a second unit conversion unit configured to convert the fourth sample text unit to obtain a third converted text unit, wherein the third converted text unit is the fourth sample text unit to the second target speech
  • a codec model building unit for learning, for the syllables in the second sample text, by learning a combination relationship and a sequence relationship of a fourth sample text unit belonging to the same syllable in a corresponding syllable, learning at least two consecutive syllables a combination relationship and a sequence relationship in the second sample text, and learning a combination relationship and a sequence relationship of the fourth sample text unit in the at least two consecutive syllables in the second sample text to construct a codec model;
  • the text unit conversion unit may be specifically configured to convert the second sample text unit by using the codec model to obtain a first converted text unit.
  • the voice translation apparatus 1100 includes a memory 1101 and a receiver 1102, and processing respectively connected to the memory 1101 and the receiver 1102.
  • the memory 1101 is configured to store a set of program instructions
  • the processor 1103 is configured to invoke the program instructions stored by the memory 1101 to perform the following operations:
  • Generating a second target voice by performing voice translation on the first target voice, wherein a language of the second target voice is different from a language of the first target voice, and the second target voice carries the source Pronunciation of the person's tone characteristics.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • a second target speech is generated by speech synthesis of the translated text.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • the translated text is synthesized by speech according to acoustic parameters of each target text unit to generate a second target speech.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • Acoustic parameters of each target text unit are obtained using the first acoustic model.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • Acoustic parameters of each target text unit are obtained using the second acoustic model.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • the determined second converted text unit corresponding to the third sample text unit is used as the first converted text unit.
  • the processor 1103 is further configured to invoke a program instruction stored by the memory 1101 to perform the following operations:
  • the second sample text unit is converted to obtain a first converted text unit.
  • the embodiment further provides a computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform any one of the above described speech translation methods.
  • the embodiment further provides a computer program product, when the computer program product runs on the terminal device, causing the terminal device to perform any one of the foregoing voice translation methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种语音翻译方法及装置,所述方法包括:获取到源发音人的第一目标语音(S101),通过对第一目标语音进行语音翻译,生成第二目标语音,其中,第二目标语音的语种与第一目标语音的语种不同,第二目标语音携带了源发音人的音色特征(S102)。可见,在对源发音人的语音即翻译前语音进行语音翻译时,由于考虑了源发音人本身具有的音色特征,使得翻译后语音也具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音。

Description

一种语音翻译方法及装置
本申请要求于2018年2月28日提交中国专利局、申请号为201810167142.5、申请名称为“一种语音翻译方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种语音翻译方法及装置。
背景技术
随着人工智能技术的日益成熟,人们越来越多地追求着利用智能技术来解决一些问题。例如,曾经人们需要花费大量的时间来学习一门新的语言,才能与以该语言为母语的人沟通,而现在,人们可以直接通过翻译机,围绕着语音识别、智能翻译以及语音合成技术,来实现口语输入、机器翻译、并发音说出翻译后的意思。
但是,在目前的语音翻译技术中,将源发音人的语音进行翻译后,得到的翻译后语音完全是语音合成模型中的发音人的音色特征,在听感上,是与源发音人完全不同的另一个发音人的音色特征。
发明内容
本申请实施例的主要目的在于提供一种语音翻译方法及装置,当对源发音人的语音进行翻译时,能够使翻译后的语音具有源发音人的音色特征。
本申请实施例提供了一种语音翻译方法,包括:
获取源发音人的第一目标语音;
通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
可选的,所述通过对所述第一目标语音进行语音翻译,生成第二目标语音,包括:
通过对所述第一目标语音进行语音识别,生成语音识别文本;
通过对所述语音识别文本进行文本翻译,生成翻译文本;
通过对所述翻译文本进行语音合成,生成第二目标语音。
可选的,所述通过对所述翻译文本进行语音合成,生成第二目标语音,包括:
将所述翻译文本按照预设大小的文本单位进行切分,得到各个目标文本单位;
获取各个目标文本单位的声学参数,其中,所述声学参数携带了所述源发音人的音色特征;
根据各个目标文本单位的声学参数,对所述翻译文本进行语音合成,生成第二目标语音。
可选的,所述方法还包括:
获取所述源发音人的第一样本语音,其中,所述第一样本语音的语种与所述第二目标语音的语种相同;
将所述第一样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第一样本文本单位;
从所述第一样本语音中提取与所述第一样本文本单位对应的第一语音片段;
从所述第一语音片段中提取声学参数;
利用各个第一样本文本单位以及与所述第一样本文本单位对应的声学参数,构建第一声学模型;
则,所述获取各个目标文本单位的声学参数,包括:
利用所述第一声学模型,获取各个目标文本单位的声学参数。
可选的,所述方法还包括:
获取所述源发音人的第二样本语音,其中,所述第二样本语音的语种与所述第二目标语音的语种不同;
将所述第二样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第二样本文本单位;
将所述第二样本文本单位进行转换,得到第一转换文本单位,其中,所述第一转换文本单位是所述第二目标语音的语种所使用的文本单位;
从所述第二样本语音中提取与所述第二样本文本单位对应的第二语音片段;
从所述第二语音片段中提取声学参数,得到与所述第一转换文本单位对应的声学参数;
利用各个第二样本文本单位、与所述第二样本文本单位对应的第一转换文本单位、以及与所述第一转换文本单位对应的声学参数,构建第二声学模型;
则,所述获取各个目标文本单位的声学参数,包括:
利用所述第二声学模型,获取各个目标文本单位的声学参数。
可选的,所述方法还包括:
收集多个第一样本文本,其中,所述第一样本文本的语种与所述第二样本语音的语种相同;
将所述第一样本文本按照所述预设大小的文本单位进行切分,得到各个第三样本文本单位;
将所述第三样本文本单位进行转换,得到第二转换文本单位,其中,所述第二转换文本单位是所述第三样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
则,所述将所述第二样本文本单位进行转换,得到第一转换文本单位,包括:
确定与所述第二样本文本单位相同的第三样本文本单位;
将所确定的第三样本文本单位对应的第二转换文本单位,作为第一转换文本单位。
可选的,所述方法还包括:
收集多个第二样本文本,其中,所述第二样本文本的语种与所述第二样本语音的语种相同;
将所述第二样本文本按照音所述预设大小的文本单位行切分,得到各个第四样本文本单位;
将所述第四样本文本单位进行转换,得到第三转换文本单位,其中,所述第三转换文本单位是所述第四样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
对于所述第二样本文本中的音节,通过学习属于同一音节的第四样本文本单位在对应音节中的组合关系和顺序关系、学习至少两个连续音节在所述第二样本文本中的组合关系和顺序关系、以及学习至少两个连续音节中的第四样本文本单位在所述第二样本文本中的组合关系和顺序关系,构建编解码模型;
则,所述将所述第二样本文本单位进行转换,得到第一转换文本单位,包括:
利用所述编解码模型,将所述第二样本文本单位进行转换,得到第一转换文本单位。
本申请实施例还提供了一种语音翻译装置,包括:
语音获取单元,用于获取源发音人的第一目标语音;
语音翻译单元,用于通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
本申请实施例还提供了一种语音翻译装置,包括:处理器、存储器、系统总线;
所述处理器以及所述存储器通过所述系统总线相连;
所述存储器用于存储一个或多个程序,所述一个或多个程序包括指令,所述指令当被所述处理器执行时使所述处理器执行上述任一项所述的方法。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述任意一项所述的方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述任意一项所述的方法。
本申请实施例提供的一种语音翻译方法及装置,当获取到源发音人的第一目标语音后,通过对第一目标语音进行语音翻译,生成第二目标语音,其中,第二目标语音的语种与第一目标语音的语种不同,第二目标语音携带了源发音人的音色特征。可见,在对源发音人的语音即翻译前语音进行语音翻译时,由于考虑了源发音人本身具有的音色特征,使得翻译后语音也具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种语音翻译方法的流程示意图之一;
图2为本申请实施例提供的一种语音翻译方法的流程示意图之二;
图3为本申请实施例提供的语音合成模型示意图;
图4为本申请实施例提供的一种声学模型构建方法的流程示意图之一;
图5为本申请实施例提供的一种声学模型构建方法的流程示意图之二;
图6为本申请实施例提供的一种样本文本单位收集方法的流程示意图;
图7为本申请实施例提供的音素序列之间的关系示意图;
图8为本申请实施例提供的一种编解码模型构建方法的流程示意图;
图9为本申请实施例提供的编码过程示意图;
图10为本申请实施例提供的一种语音翻译装置的组成示意图;
图11为本申请实施例提供的一种语音翻译装置的硬件结构示意图。
具体实施方式
在目前的语音翻译技术中,将源发音人的语音进行翻译后,得到的翻译后语音完全是合成模型中的发音人的音色特征,在听感上,是与源发音人完全不同的另一个发音人的音色特征,即,听起来像是一个人在说话,另一个人随后进行的翻译,是不同的两个人的发音效果。
为此,本申请实施例提供了一种语音翻译方法及装置,在对源发音人的语音即翻译前语音进行语音翻译时,即需要将源发音人的语音翻译成另一语种时,使用属于源发音人的语音合成模型进行语音翻译,使得翻译后语音具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音,进而提升了用户体验。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申 请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
第一实施例
参见图1,为本实施例提供的一种语音翻译方法的流程示意图,该方法包括以下步骤:
S101:获取源发音人的第一目标语音。
为便于区分,本实施例将需要进行翻译的语音即翻译前语音,定义为第一目标语音,并将说出所述第一目标语音的说话人定义为源发音人。
本实施例不限定所述第一目标语音的来源,例如,所述第一目标语音可以是某人的真实语音或是录制语音,也可以是对所述真实语音或所述录制语音进行机器处理后的特效语音。
本实施例也不限定所述第一目标语音的长度,例如,所述第一目标语音可以是一个词、也可以是一句话、还可以是一段话。
S102:通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
为便于区分,本实施例将对第一目标语音进行翻译后的语音,定义为第二目标语音。需要说明的是,当第一目标语音为上述经机器处理后的特效语音时,还需要进一步将翻译后得到的第二目标语音也进行相同方式的特效处理。
本实施例不限定第一目标语音与第二目标语音的语种类型,只要第一目标语音与第二目标语音的语种类型不同但语音意思相同即可。例如,第一目标语音为中文“你好”,第二目标语音为英文“hello”;或者,第一目标语音为英文“hello”,第二目标语音为中文“你好”。
实际应用中,用户比如源发音人,可以为翻译机预设翻译后的语种,当翻译机的语音合成模型获取到源发音人的第一目标语音后,便可以将其进行语音翻译,使翻译后的第二目标语音为预设的翻译语种。
本实施例中,可以预先采集源发音人的音色特征,用来构建属于源发音 人的语音合成模型,基于此,当对源发音人的第一目标语音进行语音翻译时,可以采用属于源发音人的语音合成模型进行语音翻译,使翻译后的第二目标语音被赋予源发音人的音色特征,这种音色自适应方式,使听者在听感上觉得第二目标语音具有源发音人的说话效果,即,使翻译前语音与翻译后语音在音色效果上相同或相近。
综上,本实施例提供的一种语音翻译方法,当获取到源发音人的第一目标语音后,通过对第一目标语音进行语音翻译,生成第二目标语音,其中,第二目标语音的语种与第一目标语音的语种不同,第二目标语音携带了源发音人的音色特征。可见,在对源发音人的语音即翻译前语音进行语音翻译时,由于考虑了源发音人本身具有的音色特征,使得翻译后语音也具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音。
第二实施例
本实施例将结合附图,通过下述S202-S204介绍上述第一实施例中S102的具体实现方式。
参见图2,为本实施例提供的一种语音翻译方法的流程示意图,该方法包括以下步骤:
S201:获取源发音人的第一目标语音。
需要说明的是,本实施例中的S201与第一实施例中的S101一致,相关说明请参见第一实施例,在此不再赘述。
S202:通过对所述第一目标语音进行语音识别,生成语音识别文本。
当获取到第一目标语音后,通过语音识别技术,比如基于人工神经网络的语音识别技术,将第一目标语音转换成语音识别文本。
例如,第一目标语音为中文语音“你好”,对其进行语音识别,可以得到中文文本“你好”。
S203:通过对所述语音识别文本进行文本翻译,生成翻译文本。
例如,假设翻译前语种为中文、翻译后语种被设定为英文,那么,语音识别文本为中文文本,可以将该中文文本通过“中→英”的翻译模型,得到英文翻译文本,比如将中文文本“你好”进行文本翻译,得到英文文本“hello”。
S204:通过对所述翻译文本进行语音合成,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语 音携带了所述源发音人的音色特征。
针对目前的语音翻译现状,翻译后语音与翻译前语音在音色上的区分度是非常明显的,为克服该缺陷,本实施例可以预先利用源发音人的语音声学参数进行建模,得到属于源发音人的语音合成模型。这样,当将所述翻译文本合成语音时,可以利用该语音合成模型,使翻译后语音即第二目标语音具有源发音人的音色特征,达到源发音人自己说话、自己翻译的听感效果。例如,所述翻译文本为英文文本“hello”,翻译后语音即第二目标语音为英文语音“hello”。
具体地,语音合成模型可以包括声学模型和时长模型,如图3所示的语音合成模型示意图。
在得到第一目标语音的翻译文本后,首先要对该翻译文本进行文本分析处理,确定该翻译文本中每个音节信息,并获取组成每个音节的各个音素信息,然后将这些音素信息输入至图3所示的声学模型,以便该声学模型确定并输出每一音素的声学参数,该声学参数携带了源发音人的音色特征,其中,该声学参数可以包括频谱、基频等参数。此外,还要将上述音素信息输入至图3所示的时长模型,以便该时长模型输出时长参数,本实施例不限制时长参数的确定方式。作为一种示例,可以确定第一目标语音的语速或采用默认语速,计算翻译文本按照该语速进行阅读时所花费的时长,将该时长作为时长参数。
接下来,语音合成模型将利用声学模型输出的声学参数,使翻译文本中每一音素按照对应的声学参数进行发音,语音合成模型还利用时长模型输出的时长参数,按照指定的时长进行发音,从而合成带有源发音人的音色特征的翻译语音,即得到第二目标语音。
在本实施例的一种实现方式中,可以按照下述方式实现S204,具体可以包括以下步骤:
步骤A:将所述翻译文本按照预设大小的文本单位进行切分,得到各个目标文本单位。
将翻译文本按照预设大小的文本单位进行划分,比如,当翻译文本为中文文本时,可以以音素、字节或字等为单位进行划分,又比如,当翻译文本为英文文本时,可以以音素、单词等为单位进行划分。为便于区分,本实施 例将从翻译文本中划分出的每一文本单位定义为目标文本单位。
步骤B:获取各个目标文本单位的声学参数,其中,所述声学参数携带了所述源发音人的音色特征。
本实施例可以利用图3所示声学模型,获取每一目标文本单位的声学参数,由于该声学模型是属于源发音人的声学模型,所以,利用该声学模型获取的声学参数将具有源发音人的音色特征。
需要说明的是,图3所示声学模型的构建方法以及如何利用该声学模型获取目标文本单位的声学参数,将在后续第三实施例中进行具体介绍。
步骤C:根据各个目标文本单位的声学参数,对所述翻译文本进行语音合成,生成第二目标语音。
当通过步骤B获取到翻译文本中每一目标文本单位的声学参数,比如,该声学参数可以包括频谱、基频等参数,然后,图3所示的语音合成模型可以使每一目标文本单位按照对应的声学参数进行发音,从而将翻译文本合成为具体源发音人的音色特征的第二目标语音。
综上,本实施例提供的一种语音翻译方法,当获取到源发音人的第一目标语音后,对第一目标语音的语音识别文本进行文本翻译,然后,通过获取翻译文本中每一文本单位的声学参数进行语音合成,生成第二目标语音。由于声学参数中携带了源发音人的音色特征,使得翻译后语音也具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音。
第三实施例
本实施例将介绍第二实施例中声学模型的构建方法,以及,介绍第二实施例中步骤B的具体实现方式,即如何利用该声学模型获取目标文本单位的声学参数。
在本实施例中,当源发音人首次拿到翻译机后,可以按照说明书提示进行录音,用以构建声学模型使用,录音内容是可选的,源发音人可以根据自己的朗读能力进行语种选择,也就是说,源发音人选择的录音语种,可以与翻译后语音(即第二目标语音)的语种相同或不同。本实施例将分别基于上述两种不同的语种选择结果,对声学模型的构建方法进行具体介绍。
在声学模型的第一种构建方法中,源发音人选择的录音语种,与翻译后语音(即第二目标语音)的语种相同,下面对该模型构建方法进行具体介绍。
参见图4,为本实施例提供的一种声学模型构建方法的流程示意图,该方法包括以下步骤:
S401:获取所述源发音人的第一样本语音,其中,所述第一样本语音的语种与所述第二目标语音的语种相同。
在本实施例中,为了使翻译后语音即第二目标语音,能够按照源发音人的音色特征进行发音,可以获取源发音人的一段录音,该段录音可以与翻译后语音的语种相同,并且,该段录音的对应文本,应尽量涵盖该文本语种的所有音素内容。
为便于区分,本实施例将该段录音定义为第一样本语音。
现以翻译前语音即第一目标语音为中文语音、翻译后语音即第二目标语音为英文语音为例,首先,确认源发音人是否有正常朗读英文的能力,比如,翻译机可以询问源发音人是否可以朗读英文,若源发音人通过语音或按键等形式回复“可以朗读英文”,则翻译机可以给出一段少量的固定英文文本并提示源发音人朗读该固定英文文本,该固定英文文本尽量涵盖所有的英文音素,源发音人对该固定英文文本进行朗读,以便翻译机获取该固定英文文本的语音,该语音即为所述第一样本语音。
S402:将所述第一样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第一样本文本单位。
当获取到第一样本语音后,通过语音识别技术,比如基于人工神经网络的语音识别技术,将第一样本语音转换成语音识别文本。然后,将该语音识别文本按照预设大小的文本单位(与第二实施例中步骤A的划分单位相同)进行划分,比如以音素为单位进行划分,为便于区分,本实施例将从该语音识别文本中划分出的每一文本单位定义为第一样本文本单位。
S403:从所述第一样本语音中提取与所述第一样本文本单位对应的第一语音片段,并从所述第一语音片段中提取声学参数。
按照对第一样本语音的识别文本进行的文本划分方式,对第一样本语音进行划分,这样,便可以确定每一第一样本文本单位在第一样本语音中对应的语音片段,比如,将第一样本语音的识别文本以及第一样本语音,均以音素为单位进行划分,从而得到该识别文本中每一音素对应的语音片段。为便于区分,本实施例将第一样本文本单位对应的语音片段定义为第一语音片段。
对于每一第一样本文本单位,从与其对应的第一语音片段中提取相应的声学参数,如频谱、基频等,这样,便获取到了源发音人的音色特征数据。
S404:利用各个第一样本文本单位以及与所述第一样本文本单位对应的声学参数,构建第一声学模型。
可以将各个第一样本文本单位、以及每一第一样本文本单位对应的声学参数进行存储,以形成第一数据集合。以第一数据集合中的文本单位为音素为例,需要说明的是,如果第一数据集合无法涵盖翻译后语种的所有音素,可以将未涵盖的音素以及为这些音素设置的默认声学参数,添加至第一数据集合中。这样,便可以基于第一数据集合中第一样本文本单位与声学参数之间的对应关系,构建属于源发音人的声学模型,具体构建时,直接将第一数据集合作为训练数据,训练源发音人的声学模型,训练过程与现有技术相同,本实施例将构建的声学模型定义为第一声学模型。
在一种实施方式中,该声学模型可以实现第二实施例中的步骤B“获取各个目标文本单位的声学参数”,具体可以包括:利用所述第一声学模型,获取各个目标文本单位的声学参数。在本实施方式中,利用源发音人的声学模型即第一声学模型,直接生成每一目标文本单位的声学参数,具体生成方法可以与现有技术相同,比如,该生成方法可以是现有的基于参数的语音合成方法。
在声学模型的第二种构建方法中,源发音人选择的录音语种,与翻译后语音(即第二目标语音)的语种不同,下面对该模型构建方法进行具体介绍。
参见图5,为本实施例提供的另一种声学模型构建方法的流程示意图,该方法包括以下步骤:
S501:获取所述源发音人的第二样本语音,其中,所述第二样本语音的语种与所述第二目标语音的语种不同。
在本实施例中,为了使翻译后语音即第二目标语音,能够按照源发音人的音色特征进行发音,可以获取源发音人的一段录音,该段录音可以与翻译后语音的语种不同,比如该段录音可以与翻译前语音即第一目标语音的语种相同,并且,该段录音的对应文本,应尽量涵盖该文本语种的所有音素内容。
为便于区分,本实施将该段录音定义为第二样本语音。
现仍以翻译前语音即第一目标语音为中文语音、翻译后语音即第二目标 语音为英文语音为例,首先,确认源发音人是否有正常朗读英文的能力,比如,翻译机可以询问源发音人是否可以朗读英文,若源发音人通过语音或按键等形式回复“不可以朗读英文”,则翻译机可以提供语种选择项,假设源发音人选择中文,则翻译机可以给出一段少量的固定中文文本并提示源发音人朗读该固定中文文本,该固定中文文本尽量涵盖所有的中文音素,源发音人对该固定中文文本进行朗读,以便翻译机获取该固定中文文本的语音,该语音即为所述第二样本语音。
S502:将所述第二样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第二样本文本单位。
当获取到第二样本语音后,通过语音识别技术,比如基于人工神经网络的语音识别技术,将第二样本语音转换成语音识别文本。然后,将该语音识别文本按照预设大小的文本单位(与第二实施例中步骤A的划分单位相同)进行划分,比如以音素为单位进行划分,为便于区分,本实施例将从该语音识别文本中划分出的每一文本单位定义为第二样本文本单位。
S503:将所述第二样本文本单位进行转换,得到第一转换文本单位,其中,所述第一转换文本单位是所述第二目标语音的语种所使用的文本单位。
对于每一第二样本文本单位,需要将该第二样本文本单位转换成翻译后语种对应的文本单位,本实施例将转换后的文本单位定义为第一转换文本单位。例如,假设第二样本文本单位为中文音素、翻译后语种为英文,则第一转换文本单位为英文音素。
需要说明的是,具体的文本单位转换方式,将在后续第四实施例中进行具体介绍。
S504:从所述第二样本语音中提取与所述第二样本文本单位对应的第二语音片段,并从所述第二语音片段中提取声学参数,得到与所述第一转换文本单位对应的声学参数。
按照对第二样本语音的识别文本进行的文本划分方式,对第二样本语音进行划分,这样,便可以确定每一第二样本文本单位在第二样本语音中对应的语音片段,比如,将第二样本语音的识别文本以及第二样本语音,均以音素为单位进行划分,从而得到该识别文本中每一音素对应的语音片段。为便于区分,本实施例将第二样本文本单位对应的语音片段定义为第二语音片段。
对于每一第二样本文本单位,从与其对应的第二语音片段中提取相应的声学参数,如频谱、基频等,将其作为与第二样本文本单位对应的第一转换文本单位的声学参数。
S505:利用各个第二样本文本单位、与所述第二样本文本单位对应的第一转换文本单位、以及与所述第一转换文本单位对应的声学参数,构建第二声学模型。
可以将各个第二样本文本单位、与每一第二样本文本单位对应的第一转换文本单位、以及每一第一转换文本单位对应的声学参数进行存储,以形成第二数据集合。以第二数据集合中的文本单位为音素为例,需要说明的是,如果第二数据集合无法涵盖翻译后语种的所有音素,可以将未涵盖的音素以及为这些音素设置的默认声学参数,添加至第二数据集合中。这样,便可以基于第二数据集合中转换前音素与转换后音素、以及转换后音素与声学参数之间的对应关系,构建属于源发音人的声学模型,具体构建时,直接将第二数据集合作为训练数据,训练源发音人的声学模型,训练过程与现有技术相同,本实施例将构建的声学模型定义为第二声学模型。
在一种实施方式中,该声学模型可以实现第二实施例中的步骤B“获取各个目标文本单位的声学参数”,具体可以包括:利用所述第二声学模型,获取各个目标文本单位的声学参数。在本实施方式中,利用源发音人的声学模型即第二声学模型,直接生成每一目标文本单位的声学参数,具体生成方法可以与现有技术相同,比如,该生成方法可以是现有的基于参数的语音合成方法。
综上,本实施例提供的一种语音翻译方法,当获取到源发音人的第一目标语音后,对第一目标语音的语音识别文本进行文本翻译,然后,通过获取翻译文本中每一文本单位的声学参数进行语音合成,生成第二目标语音。其中,可以通过预先构建源发音人的声学模型来确定每一文本单位的声学参数,由于声学参数中携带了源发音人的音色特征,使得翻译后语音也具有源发音人的音色特征,从而使得该翻译后语音听起来更像是源发音人直接说出的语音。
第四实施例
本实施例将介绍第三实施例中S503的具体实现方式,为了实现S503,需 要预先构建文本单位映射模型,以便利用该文本单位映射模型的文本单位转换功能实现S503。本实施例介绍了两种文本单位映射模型的构建方法。
在文本单位映射模型的第一种构建方法中,直接建立两种语种的文本单位序列之间的对应关系,根据该对应关系实现文本单位之间的转换,下面对该模型构建方法进行具体介绍。
如图6所示,为本实施例提供的一种样本文本单位收集方法的流程示意图,该方法包括以下步骤:
S601:收集多个第一样本文本,其中,所述第一样本文本的语种与所述第二样本语音的语种相同。
为了实现S503,即,对于第二样本语音(即源发音人的录制语音)的识别文本中的各个第二样本文本单位,为了将其对应转换成翻译后语种所使用的文本单位,需要预先收集与第二样本语音的语种相同的大量文本语料,本实施例将收集的每一文本语料定义为第一样本文本。本实施例不限制所述第一样本文本的形式,所述第一样本文本可以是一个词、或一句话、或是一段话。
例如,假设第二样本语音为中文语音,那么,需要预先收集大量的中文文本语料(如图7所示),每一中文文本即为第一样本文本。
S602:将所述第一样本文本按照所述预设大小的文本单位进行切分,得到各个第三样本文本单位。
将该第一样本文本按照预设大小的文本单位进行划分(与第二实施例中步骤A的划分单位相同),比如以音素为单位进行划分,为便于区分,本实施例从该第一样本文本中划分出的每一文本单位定义为第三样本文本单位。
继续上个步骤的例子,假设第一样本文本为中文文本,需要将该中文文本转换成中文拼音,并对该中文拼音中的每一中文音素进行标记,得到中文音素序列(如图7所示),比如,中文文本“你好”,可以得到中文拼音“[n i][h ao]”,并从中依次标记出“n”、“i”、“h”、“ao”这四个中文音素,即四个第三样本文本单位。
S603:将所述第三样本文本单位进行转换,得到第二转换文本单位,其中,所述第二转换文本单位是所述第三样本文本单位以所述第二目标语音的发音方式进行发音的文本单位。
可以将第一样本文本以翻译后语音即第二目标语音的发音方式来标注读音,这样,对于第一样本文本中的每一第三样本文本单位,可以从该标注读音中找到与之对应的文本单位,为便于区分,本实施例将该对应的文本单位定义为第二转换文本单位。
继续上个步骤的例子,假设第一样本文本为中文文本“你好”、翻译后语音即第二目标语音为英文语音,那么,“你好”可以通过英文音标的方式来标注读音,可以标记为
Figure PCTCN2018095766-appb-000001
并从中依次标记出“n”、
Figure PCTCN2018095766-appb-000002
“h”、
Figure PCTCN2018095766-appb-000003
这四个英文音素,即四个第二转换文本单位,这样,上述四个中文形式的第三样本文本单位“n”、“i”、“h”、“ao”,依次对应这四个英文形式的第二转换文本单位“n”、
Figure PCTCN2018095766-appb-000004
“h”、
Figure PCTCN2018095766-appb-000005
可以理解的是,由于同一中文汉字比如“岳”,该汉字在不同中文词语或句子中的发音方式可能不同,因此,组成该汉字的第三样本文本单位对应的第二转换文本单位也可能不同,当然,这种情形同样存在于其它语种,但本实施例中,只要转换前后的音素标记内容遵循固定的发音规则即可。
基于上述内容,可以将各个第三样本文本单位、以及每一第三样本文本单位对应的第二转换文本单位进行存储,以形成文本单位集合。需要说明的是,由于该文本单位集合中的第二转换文本单位属于翻译后语种的音素,因此,应尽量使该文本单位集合中的第二转换文本单位覆盖翻译后语种的所有文本单位。
在构建文本单位映射模型时,可以直接对该文本单位集合中的第三样本文本单位与其对应的第二转换文本单位做表格式的映射,基于此,文本单位映射模型便可以基于文本单位之间的映射关系,实现第三实施例中的步骤S503。
在第一种实现方式中,步骤S503“将所述第二样本文本单位进行转换,得到第一转换文本单位”具体可以包括:确定与所述第二样本文本单位相同的第三样本文本单位;将所确定的第三样本文本单位对应的第二转换文本单位,作为第一转换文本单位。在本实施方式中,对于每一第二样本文本单位,从上述音素集合中查询与该第二样本文本单位相同的第三样本文本单位,并基于音素映射关系,确定与该第三样本文本单位对应的第二转换文本单位,将其作为该第二样本文本单位的转换后音素即第一转换文本单位。
在文本单位映射模型的第二种构建方法中,训练两种语种的文本单位序列之间的网络模型,比如图7所示编解码模型,将该网络模型作为文本单位映射模型,通过该文本单位映射模型可以使文本单位映射结果更准确,下面对该模型构建方法进行具体介绍。
在第二种构建方式中,参见图8所示的一种编解码模型构建方法的流程示意图,包括以下步骤:
S801:收集多个第二样本文本,其中,所述第二样本文本的语种与所述第二样本语音的语种相同。
需要说明的是,本步骤S801与步骤S601类似,只需将S601中的第一样本文本替换为第二样本文本即可,相关内容请参见S601的相关介绍,在此不再赘述。
S802:将所述第二样本文本按照所述预设大小的文本单位进行切分,得到各个第四样本文本单位。
需要说明的是,本步骤S802与步骤S602类似,只需将S602中的第一样本文本替换为第二样本文本、将第三样本文本单位替换为第四样本文本单位即可,相关内容请参见S602的相关介绍,在此不再赘述。
S803:将所述第四样本文本单位进行转换,得到第三转换文本单位,其中,所述第三转换文本单位是所述第四样本文本单位以所述第二目标语音的发音方式进行发音的文本单位。
需要说明的是,本步骤S803与步骤S603类似,只需将S603中的第三样本文本单位替换为第四样本文本单位、第二转换文本单位替换为第三转换文本单位即可,相关内容请参见S603的相关介绍,在此不再赘述。
S804:对于所述第二样本文本中的音节,通过学习属于同一音节的第四样本文本单位在对应音节中的组合关系和顺序关系、学习至少两个连续音节在所述第二样本文本中的组合关系和顺序关系、以及学习至少两个连续音节中的第四样本文本单位在所述第二样本文本中的组合关系和顺序关系,构建编解码模型。
在本实施例中,可以利用第四样本文本单位序列以及第三转换文本单位序列,训练这两种语种的文本单位体系中间的网络模型,该网络模型可以包括图7所示的编码网络和解码网络。后续将以第四样本文本单位序列为中文音 素序列、第三转换文本单位序列为英文音素序列为例,对该编解码模型进行介绍。
具体地,通过加入一层音节信息来实现所述编码网络对不同音节之间的衔接处理能力,达到优化音节内的音素组合和整体音素映射的作用。所述编码网络可以包含三个编码过程,分别为音节内音素的编码过程、音节间的编码过程、文本中的所有音素的编码过程,每次编码时,后面的编码需要考虑前面编码的结果,下面以图9为例介绍所述编码网络的编码过程。
如图9所示,假设收集到的某第二样本文本为中文文本比如“你好”,则第四样本文本单位序列为“n”、“i”、“h”、“ao”。首先,将属于该中文文本的所有中文音素“n”、“i”、“h”、“ao”统一进行向量化处理,比如使用Word2Vector等方法,并将属于同一音节的中文音素之间通过一次双向长短期记忆神经网络(Bidirectional Long Short-term Memory,BLSTM)进行编码,得到的编码结果包含了音节内音素与音素之间的关系,即,学习“n”与“i”之间的组合关系和顺序关系对应于汉语音节“ni”,以及,学习“h”与“ao”之间的组合关系和顺序关系对应于汉语音节“hao”。
然后,对该中文文本的所有音节“ni”、“hao”进行向量化处理,比如使用Word2Vector等方法,在获取了第一层BLSTM网络(即图9所示的音节内音素学习网络)的编码结果后,将第一层编码结果结合每个音节的向量,通过一次音节之间双向BLSTM网络编码,得到的编码结果包含音节与音节之间的关系,即,学习“ni”与“hao”之间的组合关系和顺序关系对应于中文文本“你好”。
最后,将第二层BLSTM网络(即图9所示的音节间学习网络)的编码结果,结合每个音节中所有音素的向量特征进行第三层BLSTM编码,得到相应编码结果包含了该中文文本中音素与音素之间的关系,即,学习“n”、“i”、“h”、“ao”之间的组合关系和顺序关系对应于中文文本“你好”。
经上述三层编码后,将第三层编码结果作为图7所示解码网络的输入,图7所示的解码网络将对应输出英文音素序列“n”、
Figure PCTCN2018095766-appb-000006
“h”、
Figure PCTCN2018095766-appb-000007
可以理解的是,当使用大量中文文本对编解码模型进行训练时,编解码模型学习了两个或两个以上音节之间的组合关系和顺序关系,也学习了每一音节的各个音素在该音节中的组合关系和顺序关系。当需要将某中文文本的 中文音素序列转换为英文音素序列时,基于这种学习结果,可以将该中文文本的中文音素序列,按照其在该中文文本中的组合关系和顺序关系,选择与之更为搭配的英文音素序列,而且,不论该中文文本是较短的词语还是较长的句子,对应的英文音素序列均具有较好的衔接效果,这种方式使得音素序列之间的对应结果更灵活准确。
需要说明的是,编解码模型不限于在中文音素序列与英文音素序列之间的训练,其适用于任意两种不同语种之间。
基于上述内容,便可以基于编解码模型的学习结果,实现第三实施例中的步骤S503。在第二种实现方式中,步骤S503“将所述第二样本文本单位进行转换,得到第一转换文本单位”具体可以包括:利用所述编解码模型,将所述第二样本文本单位进行转换,得到第一转换文本单位。在本实施方式中,将所述第二样本文本单位作为预先构建的编解码模型的输入,输出即可得到转换后的第一转换文本单位,在转换过程中,编解码模型可以基于上述学习结果,根据各个第二样本文本单位之间的组合关系和顺序关系,选择与每一第二样本文本单位搭配的第一转换文本单位,相对于S503的第一种实现方式,由于本实现方式预先学习了不同语种的文本单位序列之间的实际搭配方式,使得转换后的文本单位更为准确。
综上,本实施例提供的一种语音翻译方法,对于源发音人的录音的识别文本,当需要将该录音识别文本的文本单位序列进行转换,即转换为翻译后语种的文本单位序列时,可以预先构建文本单位映射模型,可以基于不同语种的文本单位序列之间的对应关系或通过训练编解码网络来构建文本单位映射模型,通过该文本单位映射模型进行文本单位转换,能够获取需要的文本单位转换结果。
第五实施例
参见图10,为本实施例提供的一种语音翻译装置的组成示意图,该语音翻译装置1000包括:
语音获取单元1001,用于获取源发音人的第一目标语音;
语音翻译单元1002,用于通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
在本实施例的一种实现方式中,所述语音翻译单元1002可以包括:
文本识别子单元,用于通过对所述第一目标语音进行语音识别,生成语音识别文本;
文本翻译子单元,用于通过对所述语音识别文本进行文本翻译,生成翻译文本;
语音翻译子单元,用于通过对所述翻译文本进行语音合成,生成第二目标语音。
在本实施例的一种实现方式中,所述语音翻译子单元可以包括:
目标单位划分子单元,用于将所述翻译文本按照预设大小的文本单位进行切分,得到各个目标文本单位;
声学参数获取子单元,用于获取各个目标文本单位的声学参数,其中,所述声学参数携带了所述源发音人的音色特征;
翻译语音生成子单元,用于根据各个目标文本单位的声学参数,对所述翻译文本进行语音合成,生成第二目标语音。
在本实施例的一种实现方式中,所述装置1000还可以包括:
第一样本获取单元,用于获取所述源发音人的第一样本语音,其中,所述第一样本语音的语种与所述第二目标语音的语种相同;
第一样本划分单元,用于将所述第一样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第一样本文本单位;
第一片段提取单元,用于从所述第一样本语音中提取与所述第一样本文本单位对应的第一语音片段;
第一参数提取单元,用于从所述第一语音片段中提取声学参数;
第一模型构建单元,用于利用各个第一样本文本单位以及与所述第一样本文本单位对应的声学参数,构建第一声学模型;
则,所述声学参数获取子单元,具体可以用于利用所述第一声学模型,获取各个目标文本单位的声学参数。
在本实施例的一种实现方式中,所述装置1000还可以包括:
第二样本获取单元,用于获取所述源发音人的第二样本语音,其中,所述第二样本语音的语种与所述第二目标语音的语种不同;
第二样本划分单元,用于将所述第二样本语音的识别文本按照所述预设 大小的文本单位进行切分,得到各个第二样本文本单位;
文本单位转换单元,用于将所述第二样本文本单位进行转换,得到第一转换文本单位,其中,所述第一转换文本单位是所述第二目标语音的语种所使用的文本单位;
第二片段提取单元,用于从所述第二样本语音中提取与所述第二样本文本单位对应的第二语音片段;
第二参数提取单元,用于从所述第二语音片段中提取声学参数,得到与所述第一转换文本单位对应的声学参数;
第二模型构建单元,用于利用各个第二样本文本单位、与所述第二样本文本单位对应的第一转换文本单位、以及与所述第一转换文本单位对应的声学参数,构建第二声学模型;
则,所述声学参数获取子单元,具体可以用于利用所述第二声学模型,获取各个目标文本单位的声学参数。
在本实施例的一种实现方式中,所述装置1000还可以包括:
第一文本收集单元,用于收集多个第一样本文本,其中,所述第一样本文本的语种与所述第二样本语音的语种相同;
第三样本划分单元,用于将所述第一样本文本按照所述预设大小的文本单位进行切分,得到各个第三样本文本单位;
第一单位转换单元,用于将所述第三样本文本单位进行转换,得到第二转换文本单位,其中,所述第二转换文本单位是所述第三样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
则,所述文本单位转换单元可以包括:
相同单位确定子单元,用于确定与所述第二样本文本单位相同的第三样本文本单位;
文本单位转换子单元,用于将所确定的第三样本文本单位对应的第二转换文本单位,作为第一转换文本单位。
在本实施例的一种实现方式中,所述装置1000还可以包括:
第二文本收集单元,用于收集多个第二样本文本,其中,所述第二样本文本的语种与所述第二样本语音的语种相同;
第四样本划分单元,用于将所述第二样本文本按照音所述预设大小的文 本单位行切分,得到各个第四样本文本单位;
第二单位转换单元,用于将所述第四样本文本单位进行转换,得到第三转换文本单位,其中,所述第三转换文本单位是所述第四样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
编解码模型构建单元,用于对于所述第二样本文本中的音节,通过学习属于同一音节的第四样本文本单位在对应音节中的组合关系和顺序关系、学习至少两个连续音节在所述第二样本文本中的组合关系和顺序关系、以及学习至少两个连续音节中的第四样本文本单位在所述第二样本文本中的组合关系和顺序关系,构建编解码模型;
则,所述文本单位转换单元,具体可以用于利用所述编解码模型,将所述第二样本文本单位进行转换,得到第一转换文本单位。
第六实施例
参见图11,为本实施例提供的一种语音翻译装置的硬件结构示意图,所述语音翻译装置1100包括存储器1101和接收器1102,以及分别与所述存储器1101和所述接收器1102连接的处理器1103,所述存储器1101用于存储一组程序指令,所述处理器1103用于调用所述存储器1101存储的程序指令执行如下操作:
获取源发音人的第一目标语音;
通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
通过对所述第一目标语音进行语音识别,生成语音识别文本;
通过对所述语音识别文本进行文本翻译,生成翻译文本;
通过对所述翻译文本进行语音合成,生成第二目标语音。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
将所述翻译文本按照预设大小的文本单位进行切分,得到各个目标文本单位;
获取各个目标文本单位的声学参数,其中,所述声学参数携带了所述源发音人的音色特征;
根据各个目标文本单位的声学参数,对所述翻译文本进行语音合成,生成第二目标语音。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
获取所述源发音人的第一样本语音,其中,所述第一样本语音的语种与所述第二目标语音的语种相同;
将所述第一样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第一样本文本单位;
从所述第一样本语音中提取与所述第一样本文本单位对应的第一语音片段;
从所述第一语音片段中提取声学参数;
利用各个第一样本文本单位以及与所述第一样本文本单位对应的声学参数,构建第一声学模型;
利用所述第一声学模型,获取各个目标文本单位的声学参数。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
获取所述源发音人的第二样本语音,其中,所述第二样本语音的语种与所述第二目标语音的语种不同;
将所述第二样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第二样本文本单位;
将所述第二样本文本单位进行转换,得到第一转换文本单位,其中,所述第一转换文本单位是所述第二目标语音的语种所使用的文本单位;
从所述第二样本语音中提取与所述第二样本文本单位对应的第二语音片段;
从所述第二语音片段中提取声学参数,得到与所述第一转换文本单位对应的声学参数;
利用各个第二样本文本单位、与所述第二样本文本单位对应的第一转换文本单位、以及与所述第一转换文本单位对应的声学参数,构建第二声学模 型;
利用所述第二声学模型,获取各个目标文本单位的声学参数。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
收集多个第一样本文本,其中,所述第一样本文本的语种与所述第二样本语音的语种相同;
将所述第一样本文本按照所述预设大小的文本单位进行切分,得到各个第三样本文本单位;
将所述第三样本文本单位进行转换,得到第二转换文本单位,其中,所述第二转换文本单位是所述第三样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
确定与所述第二样本文本单位相同的第三样本文本单位;
将所确定的第三样本文本单位对应的第二转换文本单位,作为第一转换文本单位。
在本实施例的一种实现方式中,所述处理器1103还用于调用所述存储器1101存储的程序指令执行如下操作:
收集多个第二样本文本,其中,所述第二样本文本的语种与所述第二样本语音的语种相同;
将所述第二样本文本按照音所述预设大小的文本单位行切分,得到各个第四样本文本单位;
将所述第四样本文本单位进行转换,得到第三转换文本单位,其中,所述第三转换文本单位是所述第四样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
对于所述第二样本文本中的音节,通过学习属于同一音节的第四样本文本单位在对应音节中的组合关系和顺序关系、学习至少两个连续音节在所述第二样本文本中的组合关系和顺序关系、以及学习至少两个连续音节中的第四样本文本单位在所述第二样本文本中的组合关系和顺序关系,构建编解码模型;
利用所述编解码模型,将所述第二样本文本单位进行转换,得到第一转换文本单位。
此外,本实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述语音翻译方法中的任意一种实现方式。
进一步地,本实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音翻译方法中的任意一种实现方式。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到上述实施例方法中的全部或部分步骤可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者诸如媒体网关等网络通信设备,等等)执行本申请各个实施例或者实施例的某些部分所述的方法。
需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种语音翻译方法,其特征在于,包括:
    获取源发音人的第一目标语音;
    通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
  2. 根据权利要求1所述的方法,其特征在于,所述通过对所述第一目标语音进行语音翻译,生成第二目标语音,包括:
    通过对所述第一目标语音进行语音识别,生成语音识别文本;
    通过对所述语音识别文本进行文本翻译,生成翻译文本;
    通过对所述翻译文本进行语音合成,生成第二目标语音。
  3. 根据权利要求2所述的方法,其特征在于,所述通过对所述翻译文本进行语音合成,生成第二目标语音,包括:
    将所述翻译文本按照预设大小的文本单位进行切分,得到各个目标文本单位;
    获取各个目标文本单位的声学参数,其中,所述声学参数携带了所述源发音人的音色特征;
    根据各个目标文本单位的声学参数,对所述翻译文本进行语音合成,生成第二目标语音。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取所述源发音人的第一样本语音,其中,所述第一样本语音的语种与所述第二目标语音的语种相同;
    将所述第一样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第一样本文本单位;
    从所述第一样本语音中提取与所述第一样本文本单位对应的第一语音片段;
    从所述第一语音片段中提取声学参数;
    利用各个第一样本文本单位以及与所述第一样本文本单位对应的声学参数,构建第一声学模型;
    则,所述获取各个目标文本单位的声学参数,包括:
    利用所述第一声学模型,获取各个目标文本单位的声学参数。
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取所述源发音人的第二样本语音,其中,所述第二样本语音的语种与所述第二目标语音的语种不同;
    将所述第二样本语音的识别文本按照所述预设大小的文本单位进行切分,得到各个第二样本文本单位;
    将所述第二样本文本单位进行转换,得到第一转换文本单位,其中,所述第一转换文本单位是所述第二目标语音的语种所使用的文本单位;
    从所述第二样本语音中提取与所述第二样本文本单位对应的第二语音片段;
    从所述第二语音片段中提取声学参数,得到与所述第一转换文本单位对应的声学参数;
    利用各个第二样本文本单位、与所述第二样本文本单位对应的第一转换文本单位、以及与所述第一转换文本单位对应的声学参数,构建第二声学模型;
    则,所述获取各个目标文本单位的声学参数,包括:
    利用所述第二声学模型,获取各个目标文本单位的声学参数。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    收集多个第一样本文本,其中,所述第一样本文本的语种与所述第二样本语音的语种相同;
    将所述第一样本文本按照所述预设大小的文本单位进行切分,得到各个第三样本文本单位;
    将所述第三样本文本单位进行转换,得到第二转换文本单位,其中,所述第二转换文本单位是所述第三样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
    则,所述将所述第二样本文本单位进行转换,得到第一转换文本单位,包括:
    确定与所述第二样本文本单位相同的第三样本文本单位;
    将所确定的第三样本文本单位对应的第二转换文本单位,作为第一转换文本单位。
  7. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    收集多个第二样本文本,其中,所述第二样本文本的语种与所述第二样本语音的语种相同;
    将所述第二样本文本按照音所述预设大小的文本单位行切分,得到各个第四样本文本单位;
    将所述第四样本文本单位进行转换,得到第三转换文本单位,其中,所述第三转换文本单位是所述第四样本文本单位以所述第二目标语音的发音方式进行发音的文本单位;
    对于所述第二样本文本中的音节,通过学习属于同一音节的第四样本文本单位在对应音节中的组合关系和顺序关系、学习至少两个连续音节在所述第二样本文本中的组合关系和顺序关系、以及学习至少两个连续音节中的第四样本文本单位在所述第二样本文本中的组合关系和顺序关系,构建编解码模型;
    则,所述将所述第二样本文本单位进行转换,得到第一转换文本单位,包括:
    利用所述编解码模型,将所述第二样本文本单位进行转换,得到第一转换文本单位。
  8. 一种语音翻译装置,其特征在于,包括:
    语音获取单元,用于获取源发音人的第一目标语音;
    语音翻译单元,用于通过对所述第一目标语音进行语音翻译,生成第二目标语音,其中,所述第二目标语音的语种与所述第一目标语音的语种不同,所述第二目标语音携带了所述源发音人的音色特征。
  9. 一种语音翻译装置,其特征在于,包括:处理器、存储器、系统总线;
    所述处理器以及所述存储器通过所述系统总线相连;
    所述存储器用于存储一个或多个程序,所述一个或多个程序包括指令,所述指令当被所述处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。
  10. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-7任意一项所述的方法。
  11. 一种计算机程序产品,其特征在于,所述计算机程序产品在终端设 备上运行时,使得所述终端设备执行权利要求1-7任一项所述的方法。
PCT/CN2018/095766 2018-02-28 2018-07-16 一种语音翻译方法及装置 WO2019165748A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810167142.5A CN108447486B (zh) 2018-02-28 2018-02-28 一种语音翻译方法及装置
CN201810167142.5 2018-02-28

Publications (1)

Publication Number Publication Date
WO2019165748A1 true WO2019165748A1 (zh) 2019-09-06

Family

ID=63192800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095766 WO2019165748A1 (zh) 2018-02-28 2018-07-16 一种语音翻译方法及装置

Country Status (2)

Country Link
CN (1) CN108447486B (zh)
WO (1) WO2019165748A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382297A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN112509553A (zh) * 2020-12-02 2021-03-16 出门问问(苏州)信息科技有限公司 一种语音合成方法、装置以及计算机可读存储介质
CN112530404A (zh) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 一种语音合成方法、语音合成装置及智能设备
CN112818707A (zh) * 2021-01-19 2021-05-18 传神语联网网络科技股份有限公司 基于逆向文本共识的多翻引擎协作语音翻译系统与方法
CN113327575A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN113808576A (zh) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 语音转换方法、装置及计算机系统
CN114818748A (zh) * 2022-05-10 2022-07-29 北京百度网讯科技有限公司 用于生成翻译模型的方法、翻译方法及装置
EP4266306A1 (en) * 2022-04-22 2023-10-25 Papercup Technologies Limited A speech processing system and a method of processing a speech signal

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119063B (zh) * 2018-08-31 2019-11-22 腾讯科技(深圳)有限公司 视频配音生成方法、装置、设备及存储介质
CN109300469A (zh) * 2018-09-05 2019-02-01 满金坝(深圳)科技有限公司 基于机器学习的同声传译方法及装置
CN108986793A (zh) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 翻译处理方法、装置及设备
CN109448698A (zh) * 2018-10-17 2019-03-08 深圳壹账通智能科技有限公司 同声传译方法、装置、计算机设备和存储介质
CN109754808B (zh) * 2018-12-13 2024-02-13 平安科技(深圳)有限公司 语音转换文字的方法、装置、计算机设备及存储介质
CN112420008A (zh) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 录制歌曲的方法、装置、电子设备及存储介质
CN110610720B (zh) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN110619867B (zh) * 2019-09-27 2020-11-03 百度在线网络技术(北京)有限公司 语音合成模型的训练方法、装置、电子设备及存储介质
CN110970014B (zh) * 2019-10-31 2023-12-15 阿里巴巴集团控股有限公司 语音转换、文件生成、播音、语音处理方法、设备及介质
CN111105781B (zh) * 2019-12-23 2022-09-23 联想(北京)有限公司 语音处理方法、装置、电子设备以及介质
CN114467141A (zh) * 2019-12-31 2022-05-10 深圳市欢太科技有限公司 语音处理方法、装置、设备以及存储介质
CN111368559A (zh) * 2020-02-28 2020-07-03 北京字节跳动网络技术有限公司 语音翻译方法、装置、电子设备及存储介质
CN113539233A (zh) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111696518A (zh) * 2020-06-05 2020-09-22 四川纵横六合科技股份有限公司 一种基于文本的自动化语音合成方法
CN111785258B (zh) * 2020-07-13 2022-02-01 四川长虹电器股份有限公司 一种基于说话人特征的个性化语音翻译方法和装置
CN113160793A (zh) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 基于低资源语言的语音合成方法、装置、设备及存储介质
CN113362818A (zh) * 2021-05-08 2021-09-07 山西三友和智慧信息技术股份有限公司 一种基于人工智能的语音交互指导系统及方法
CN116343751B (zh) * 2023-05-29 2023-08-11 深圳市泰为软件开发有限公司 基于语音翻译的音频分析方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786801A (zh) * 2014-12-22 2016-07-20 中兴通讯股份有限公司 一种语音翻译方法、通讯方法及相关装置
CN106156009A (zh) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 语音翻译方法及装置
CN107465816A (zh) * 2017-07-25 2017-12-12 广西定能电子科技有限公司 一种通话即时原声语音翻译的通话终端及方法
CN107731232A (zh) * 2017-10-17 2018-02-23 深圳市沃特沃德股份有限公司 语音翻译方法和装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1553381A (zh) * 2003-05-26 2004-12-08 杨宏惠 多语种对应目录式语言数据库及同步电脑互译、交流方法
JP2008032834A (ja) * 2006-07-26 2008-02-14 Toshiba Corp 音声翻訳装置及びその方法
JP4481972B2 (ja) * 2006-09-28 2010-06-16 株式会社東芝 音声翻訳装置、音声翻訳方法及び音声翻訳プログラム
CN101727904B (zh) * 2008-10-31 2013-04-24 国际商业机器公司 语音翻译方法和装置
KR101154011B1 (ko) * 2010-06-07 2012-06-08 주식회사 서비전자 다중 모델 적응화와 음성인식장치 및 방법
CN102821259B (zh) * 2012-07-20 2016-12-21 冠捷显示科技(厦门)有限公司 具有多国语言语音翻译的tv系统及其实现方法
KR102069697B1 (ko) * 2013-07-29 2020-02-24 한국전자통신연구원 자동 통역 장치 및 방법
KR20150105075A (ko) * 2014-03-07 2015-09-16 한국전자통신연구원 자동 통역 장치 및 방법
CN104252861B (zh) * 2014-09-11 2018-04-13 百度在线网络技术(北京)有限公司 视频语音转换方法、装置和服务器
JP2016057986A (ja) * 2014-09-11 2016-04-21 株式会社東芝 音声翻訳装置、方法およびプログラム
CN105390141B (zh) * 2015-10-14 2019-10-18 科大讯飞股份有限公司 声音转换方法和装置
KR102525209B1 (ko) * 2016-03-03 2023-04-25 한국전자통신연구원 원시 발화자의 목소리와 유사한 특성을 갖는 합성음을 생성하는 자동 통역 시스템 및 그 동작 방법
CN106791913A (zh) * 2016-12-30 2017-05-31 深圳市九洲电器有限公司 数字电视节目同声翻译输出方法及系统
CN107632980B (zh) * 2017-08-03 2020-10-27 北京搜狗科技发展有限公司 语音翻译方法和装置、用于语音翻译的装置
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786801A (zh) * 2014-12-22 2016-07-20 中兴通讯股份有限公司 一种语音翻译方法、通讯方法及相关装置
CN106156009A (zh) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 语音翻译方法及装置
CN107465816A (zh) * 2017-07-25 2017-12-12 广西定能电子科技有限公司 一种通话即时原声语音翻译的通话终端及方法
CN107731232A (zh) * 2017-10-17 2018-02-23 深圳市沃特沃德股份有限公司 语音翻译方法和装置

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808576A (zh) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 语音转换方法、装置及计算机系统
CN112382297A (zh) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 用于生成音频的方法、装置、设备和介质
CN112530404A (zh) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 一种语音合成方法、语音合成装置及智能设备
CN112509553A (zh) * 2020-12-02 2021-03-16 出门问问(苏州)信息科技有限公司 一种语音合成方法、装置以及计算机可读存储介质
CN112509553B (zh) * 2020-12-02 2023-08-01 问问智能信息科技有限公司 一种语音合成方法、装置以及计算机可读存储介质
CN112818707A (zh) * 2021-01-19 2021-05-18 传神语联网网络科技股份有限公司 基于逆向文本共识的多翻引擎协作语音翻译系统与方法
CN112818707B (zh) * 2021-01-19 2024-02-27 传神语联网网络科技股份有限公司 基于逆向文本共识的多翻引擎协作语音翻译系统与方法
CN113327575A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
CN113327575B (zh) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质
EP4266306A1 (en) * 2022-04-22 2023-10-25 Papercup Technologies Limited A speech processing system and a method of processing a speech signal
CN114818748A (zh) * 2022-05-10 2022-07-29 北京百度网讯科技有限公司 用于生成翻译模型的方法、翻译方法及装置

Also Published As

Publication number Publication date
CN108447486A (zh) 2018-08-24
CN108447486B (zh) 2021-12-03

Similar Documents

Publication Publication Date Title
WO2019165748A1 (zh) 一种语音翻译方法及装置
KR102581346B1 (ko) 다국어 음성 합성 및 언어간 음성 복제
JP2017058674A (ja) 音声認識のための装置及び方法、変換パラメータ学習のための装置及び方法、コンピュータプログラム並びに電子機器
TWI244638B (en) Method and apparatus for constructing Chinese new words by the input voice
JP2020034883A (ja) 音声合成装置及びプログラム
Abushariah et al. Phonetically rich and balanced text and speech corpora for Arabic language
El Ouahabi et al. Toward an automatic speech recognition system for amazigh-tarifit language
KR20150105075A (ko) 자동 통역 장치 및 방법
Shahriar et al. A communication platform between bangla and sign language
Bachate et al. Automatic speech recognition systems for regional languages in India
TWI467566B (zh) 多語言語音合成方法
Erro et al. ZureTTS: Online platform for obtaining personalized synthetic voices
CN116933806A (zh) 一种同传翻译系统及同传翻译终端
CN110310620B (zh) 基于原生发音强化学习的语音融合方法
CN114254649A (zh) 一种语言模型的训练方法、装置、存储介质及设备
Kano et al. An end-to-end model for cross-lingual transformation of paralinguistic information
CN111489742B (zh) 声学模型训练方法、语音识别方法、装置及电子设备
CN113870833A (zh) 语音合成相关系统、方法、装置及设备
JP2021085943A (ja) 音声合成装置及びプログラム
Mohamed et al. A cascaded speech to Arabic sign language machine translator using adaptation
Dalva Automatic speech recognition system for Turkish spoken language
JP7012935B1 (ja) プログラム、情報処理装置、方法
WO2019106068A1 (en) Speech signal processing and evaluation
Mohammad et al. Phonetically rich and balanced text and speech corpora for Arabic language
Thomas Audibly: Speech to American Sign Language converter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908168

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18908168

Country of ref document: EP

Kind code of ref document: A1