CN113436606B - Original sound speech translation method - Google Patents

Original sound speech translation method Download PDF

Info

Publication number
CN113436606B
CN113436606B CN202110602693.1A CN202110602693A CN113436606B CN 113436606 B CN113436606 B CN 113436606B CN 202110602693 A CN202110602693 A CN 202110602693A CN 113436606 B CN113436606 B CN 113436606B
Authority
CN
China
Prior art keywords
voice
language
module
learning
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110602693.1A
Other languages
Chinese (zh)
Other versions
CN113436606A (en
Inventor
孟强祥
田俊麟
宋昱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Introduction Of Chinese Technology Shenzhen Co ltd
Original Assignee
Introduction Of Chinese Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Introduction Of Chinese Technology Shenzhen Co ltd filed Critical Introduction Of Chinese Technology Shenzhen Co ltd
Priority to CN202110602693.1A priority Critical patent/CN113436606B/en
Publication of CN113436606A publication Critical patent/CN113436606A/en
Application granted granted Critical
Publication of CN113436606B publication Critical patent/CN113436606B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses an acoustic speech translation method, which relates to the technical field of speech translation, and comprises the following steps: the source language voice acquisition and voice feature learning module extracts voice features of a speaker, the voice features are sent to deep neural network DNN training learning, the STT module converts character information of the source voice and the character information is acquired by the translation module and the language feature learning module respectively, wherein the language features of the source language are extracted and recorded in the language feature learning module, and voice synthesis simulation is carried out through the voice synthesis module. The invention takes the language pronunciation characteristics as characteristic values to be sent to a deep neural network DNN for training and learning, obtains the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module after learning, carries out voice synthesis simulation through a synthesis voice module, and sends out the voice similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.

Description

Original sound speech translation method
Technical Field
The invention relates to the technical field of voice translation, in particular to an acoustic voice translation method.
Background
The development of artificial intelligence technology makes the speech translation greatly developed and applied. In the process of voice translation, a source voice signal of a speaker is mainly converted into source text information, the source text information is converted into text information of a target language through a text translation module, and then a voice signal of the target language is generated through a voice synthesis module to be played to complete voice translation.
Disclosure of Invention
The invention aims to provide an acoustic speech translation method to solve the defects in the prior art.
In order to achieve the above purpose, the invention provides the following technical scheme: an acoustic speech translation method comprising the steps of:
step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.
And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.
Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
Figure BDA0003093135760000021
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.
Preferably, the source speech acquisition in the first step includes preprocessing and judgment of the sound signal, the preprocessing includes speech enhancement, background sound elimination, echo suppression, and the like, which are beneficial for optimizing the signal, and the judgment includes judging whether the sound signal includes language information, and if the language information is not detected, the current information is discarded.
Preferably, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned.
Preferably, the sound feature learning module in the second step includes feature extraction, the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features further include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre.
Preferably, the main modules of the translation process in the third step are executed synchronously in real time, and the learning of the voice and language features and the model modification process can be executed asynchronously, so that the real-time performance of the translation process is not influenced.
In the technical scheme, the invention provides the following technical effects and advantages:
the invention collects the voice information through the voice collecting module, the language pronunciation characteristic is sent to the deep neural network DNN for training and learning as the characteristic value, the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of the translation and synthesis module are obtained after learning, meanwhile, the STT module converts the character information of the source voice, the language characteristic model is corrected after the learning of the deep neural network DNN and is used as the pre-judging information of the translation, then the voice synthesis simulation is carried out through the synthesis voice module, the voice similar to the voice of the speaker is sent out after the synthesis based on the language information with the speaking style of the speaker, and therefore, the synthesized voice after the translation is highly close to the characteristic of the speaker.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic view of the overall structure of the present invention.
Fig. 2 is a flow chart of sound feature extraction according to the present invention.
FIG. 3 is a diagram of ADSR envelope representation according to the present invention.
FIG. 4 is a logical block diagram of model reconstruction in accordance with the present invention.
Description of reference numerals:
a: the time from silence to the peak of pronunciation, this time is the energy burst phase;
d: time at which the pronunciation dropped from the peak is stable;
s: a time interval of stable pronunciation;
r: time to fall back after the end of pronunciation.
Detailed Description
In order to make the technical solutions of the present invention better understood, those skilled in the art will now describe the present invention in further detail with reference to the accompanying drawings.
The invention provides an acoustic speech translation method, which comprises the following steps:
step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.
And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.
Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
Figure BDA0003093135760000041
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
given a time-frequency spectrum signal S, the closer the time-frequency spectrum information is to the S, the better the signal needs to be reconstructed;
the human voice features include:
intensity (intensity): the strength of the pronunciation is also the vibration amplitude of the audio signal,
pitch (pitch): the frequency of the vibration of the audio signal,
timbre (time): the timbre is an important index for a speaker to represent that the voice of the speaker is different from other people, the timbre is determined by a corresponding spectral Envelope (Envelope), ADSR is mainly composed of four parameters, namely attach, Delay, Sustain and Release, the same characters are used, different voices of different people can be different, and the four parameters are mainly determined;
step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to complete the voice translation process;
further, in the above technical solution, the source speech acquisition in the first step includes preprocessing and determining a speech signal, where the preprocessing includes speech enhancement, background sound cancellation, echo suppression, and the like, which are beneficial for optimizing the signal, and the determining includes determining whether the speech signal includes language information, and if the language information is not detected, discarding the current information;
further, in the above technical solution, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned;
further, in the above technical solution, the sound feature learning module in the second step includes feature extraction, where the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features also include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre;
furthermore, in the above technical solution, the main modules of the translation process in step three are executed synchronously in real time, and the learning and model modification processes of the sound and language features can be executed asynchronously, so that the real-time performance of the translation process is not affected;
the implementation mode is specifically as follows: after voice information is collected by a voice collecting module, the voice information is sent to a voice characteristic learning module and an STT module, voice characteristics are extracted and then are learned by a deep neural network DNN to establish a voice characteristic model, language pronunciation characteristics are used as characteristic values to be sent to the deep neural network DNN for training and learning, and after learning, a language characteristic model characteristic vector and a human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module are obtained, meanwhile, the STT module converts the character information of the source language, the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the deep neural network DNN learning and is used as the translation prejudgment information, then the speech synthesis simulation is carried out through the speech synthesis module, based on the language information with the speaking style of the speaker, the synthesized voice is synthesized to make a sound similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.
While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.

Claims (4)

1. An acoustic speech translation method, comprising the steps of:
step one, source language voice acquisition, wherein voice information is acquired through a voice acquisition module and then is sent To a voice feature learning module and an STT (Speech-To-Text) module;
step two, extracting the voice characteristics of the speaker by a voice characteristic learning module, establishing a voice characteristic model through the deep Neural network DNN learning after the characteristics are extracted, sending the language pronunciation characteristics serving as characteristic values into the deep Neural network DNN (deep Neural network) for training and learning, and obtaining reference language characteristic model characteristic vectors and voice characteristic model characteristic vectors after learning;
step three, the STT module converts the text information of the source language voice and respectively obtains the text information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature corrects the language feature model after learning of the deep neural network DNN, and the parameters used by the language feature model are used as important reference parameters of the translation module and are used as pre-judgment information of translation;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
Figure FDA0003485577160000011
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.
2. The acoustic speech translation method according to claim 1, wherein: in the first step, the source language voice acquisition comprises preprocessing and judgment of voice signals, the preprocessing comprises voice enhancement, background sound elimination and echo suppression, the judgment comprises judging whether the voice signals contain language information, and if the language information is not detected, the current information is discarded.
3. The acoustic speech translation method according to claim 1, wherein: and in the second step, the voice characteristic model has a pre-trained voice characteristic model, and the voice characteristic model is corrected every time a new voice characteristic is learned.
4. The acoustic speech translation method according to claim 1, wherein: the voice feature learning module in the second step comprises feature extraction, the extracted features comprise features of language pronunciation, namely vowel, consonant and voiced sound, and the extracted features further comprise pronunciation features of a speaker, namely tone intensity, tone and tone color.
CN202110602693.1A 2021-05-31 2021-05-31 Original sound speech translation method Active CN113436606B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110602693.1A CN113436606B (en) 2021-05-31 2021-05-31 Original sound speech translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110602693.1A CN113436606B (en) 2021-05-31 2021-05-31 Original sound speech translation method

Publications (2)

Publication Number Publication Date
CN113436606A CN113436606A (en) 2021-09-24
CN113436606B true CN113436606B (en) 2022-03-22

Family

ID=77804065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110602693.1A Active CN113436606B (en) 2021-05-31 2021-05-31 Original sound speech translation method

Country Status (1)

Country Link
CN (1) CN113436606B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102478763B1 (en) 2022-06-28 2022-12-19 (주)액션파워 Method for speech recognition with grapheme information
CN115312029B (en) * 2022-10-12 2023-01-31 之江实验室 Voice translation method and system based on voice depth characterization mapping
CN116416969A (en) * 2023-06-09 2023-07-11 深圳市江元科技(集团)有限公司 Multi-language real-time translation method, system and medium based on big data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117614B (en) * 2010-01-05 2013-01-02 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
US9564120B2 (en) * 2010-05-14 2017-02-07 General Motors Llc Speech adaptation in speech synthesis
US10446143B2 (en) * 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
CN107146601B (en) * 2017-04-07 2020-07-24 南京邮电大学 Rear-end i-vector enhancement method for speaker recognition system
JP7178028B2 (en) * 2018-01-11 2022-11-25 ネオサピエンス株式会社 Speech translation method and system using multilingual text-to-speech synthesis model
CN111785258B (en) * 2020-07-13 2022-02-01 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics

Also Published As

Publication number Publication date
CN113436606A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113436606B (en) Original sound speech translation method
Ai et al. A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN109616131B (en) Digital real-time voice sound changing method
CN115294970B (en) Voice conversion method, device and storage medium for pathological voice
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
Doi et al. Statistical approach to enhancing esophageal speech based on Gaussian mixture models
Tobing et al. Baseline system of Voice Conversion Challenge 2020 with cyclic variational autoencoder and Parallel WaveGAN
Vallés-Pérez et al. Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows
CN111326170A (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
Toth et al. Synthesizing speech from electromyography using voice transformation techniques
CN115862590A (en) Text-driven speech synthesis method based on characteristic pyramid
CN113314109B (en) Voice generation method based on cycle generation network
Du et al. Effective wavenet adaptation for voice conversion with limited data
CN114550701A (en) Deep neural network-based Chinese electronic larynx voice conversion device and method
Pan et al. Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
TWI746138B (en) System for clarifying a dysarthria voice and method thereof
CN112992118B (en) Speech model training and synthesizing method with few linguistic data
CN112967538B (en) English pronunciation information acquisition system
Chandra et al. Towards The Development Of Accent Conversion Model For (L1) Bengali Speaker Using Cycle Consistent Adversarial Network (Cyclegan)
Lin et al. Investigation of neural network approaches for unified spectral and prosodic feature enhancement
CN117334179A (en) Method, device and storage medium for real-time simulation of designated character tone by digital person
Zhou et al. An improved algorithm of GMM voice conversion system based on changing the time-scale
CN114974271A (en) Voice reconstruction method based on sound channel filtering and glottal excitation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant