CN113436606B - Original sound speech translation method - Google Patents
Original sound speech translation method Download PDFInfo
- Publication number
- CN113436606B CN113436606B CN202110602693.1A CN202110602693A CN113436606B CN 113436606 B CN113436606 B CN 113436606B CN 202110602693 A CN202110602693 A CN 202110602693A CN 113436606 B CN113436606 B CN 113436606B
- Authority
- CN
- China
- Prior art keywords
- voice
- language
- module
- learning
- translation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/086—Detection of language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Abstract
The invention discloses an acoustic speech translation method, which relates to the technical field of speech translation, and comprises the following steps: the source language voice acquisition and voice feature learning module extracts voice features of a speaker, the voice features are sent to deep neural network DNN training learning, the STT module converts character information of the source voice and the character information is acquired by the translation module and the language feature learning module respectively, wherein the language features of the source language are extracted and recorded in the language feature learning module, and voice synthesis simulation is carried out through the voice synthesis module. The invention takes the language pronunciation characteristics as characteristic values to be sent to a deep neural network DNN for training and learning, obtains the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module after learning, carries out voice synthesis simulation through a synthesis voice module, and sends out the voice similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.
Description
Technical Field
The invention relates to the technical field of voice translation, in particular to an acoustic voice translation method.
Background
The development of artificial intelligence technology makes the speech translation greatly developed and applied. In the process of voice translation, a source voice signal of a speaker is mainly converted into source text information, the source text information is converted into text information of a target language through a text translation module, and then a voice signal of the target language is generated through a voice synthesis module to be played to complete voice translation.
Disclosure of Invention
The invention aims to provide an acoustic speech translation method to solve the defects in the prior art.
In order to achieve the above purpose, the invention provides the following technical scheme: an acoustic speech translation method comprising the steps of:
step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.
And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.
Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.
Preferably, the source speech acquisition in the first step includes preprocessing and judgment of the sound signal, the preprocessing includes speech enhancement, background sound elimination, echo suppression, and the like, which are beneficial for optimizing the signal, and the judgment includes judging whether the sound signal includes language information, and if the language information is not detected, the current information is discarded.
Preferably, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned.
Preferably, the sound feature learning module in the second step includes feature extraction, the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features further include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre.
Preferably, the main modules of the translation process in the third step are executed synchronously in real time, and the learning of the voice and language features and the model modification process can be executed asynchronously, so that the real-time performance of the translation process is not influenced.
In the technical scheme, the invention provides the following technical effects and advantages:
the invention collects the voice information through the voice collecting module, the language pronunciation characteristic is sent to the deep neural network DNN for training and learning as the characteristic value, the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of the translation and synthesis module are obtained after learning, meanwhile, the STT module converts the character information of the source voice, the language characteristic model is corrected after the learning of the deep neural network DNN and is used as the pre-judging information of the translation, then the voice synthesis simulation is carried out through the synthesis voice module, the voice similar to the voice of the speaker is sent out after the synthesis based on the language information with the speaking style of the speaker, and therefore, the synthesized voice after the translation is highly close to the characteristic of the speaker.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a schematic view of the overall structure of the present invention.
Fig. 2 is a flow chart of sound feature extraction according to the present invention.
FIG. 3 is a diagram of ADSR envelope representation according to the present invention.
FIG. 4 is a logical block diagram of model reconstruction in accordance with the present invention.
Description of reference numerals:
a: the time from silence to the peak of pronunciation, this time is the energy burst phase;
d: time at which the pronunciation dropped from the peak is stable;
s: a time interval of stable pronunciation;
r: time to fall back after the end of pronunciation.
Detailed Description
In order to make the technical solutions of the present invention better understood, those skilled in the art will now describe the present invention in further detail with reference to the accompanying drawings.
The invention provides an acoustic speech translation method, which comprises the following steps:
step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.
And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.
Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
given a time-frequency spectrum signal S, the closer the time-frequency spectrum information is to the S, the better the signal needs to be reconstructed;
the human voice features include:
intensity (intensity): the strength of the pronunciation is also the vibration amplitude of the audio signal,
pitch (pitch): the frequency of the vibration of the audio signal,
timbre (time): the timbre is an important index for a speaker to represent that the voice of the speaker is different from other people, the timbre is determined by a corresponding spectral Envelope (Envelope), ADSR is mainly composed of four parameters, namely attach, Delay, Sustain and Release, the same characters are used, different voices of different people can be different, and the four parameters are mainly determined;
step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to complete the voice translation process;
further, in the above technical solution, the source speech acquisition in the first step includes preprocessing and determining a speech signal, where the preprocessing includes speech enhancement, background sound cancellation, echo suppression, and the like, which are beneficial for optimizing the signal, and the determining includes determining whether the speech signal includes language information, and if the language information is not detected, discarding the current information;
further, in the above technical solution, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned;
further, in the above technical solution, the sound feature learning module in the second step includes feature extraction, where the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features also include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre;
furthermore, in the above technical solution, the main modules of the translation process in step three are executed synchronously in real time, and the learning and model modification processes of the sound and language features can be executed asynchronously, so that the real-time performance of the translation process is not affected;
the implementation mode is specifically as follows: after voice information is collected by a voice collecting module, the voice information is sent to a voice characteristic learning module and an STT module, voice characteristics are extracted and then are learned by a deep neural network DNN to establish a voice characteristic model, language pronunciation characteristics are used as characteristic values to be sent to the deep neural network DNN for training and learning, and after learning, a language characteristic model characteristic vector and a human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module are obtained, meanwhile, the STT module converts the character information of the source language, the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the deep neural network DNN learning and is used as the translation prejudgment information, then the speech synthesis simulation is carried out through the speech synthesis module, based on the language information with the speaking style of the speaker, the synthesized voice is synthesized to make a sound similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.
While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.
Claims (4)
1. An acoustic speech translation method, comprising the steps of:
step one, source language voice acquisition, wherein voice information is acquired through a voice acquisition module and then is sent To a voice feature learning module and an STT (Speech-To-Text) module;
step two, extracting the voice characteristics of the speaker by a voice characteristic learning module, establishing a voice characteristic model through the deep Neural network DNN learning after the characteristics are extracted, sending the language pronunciation characteristics serving as characteristic values into the deep Neural network DNN (deep Neural network) for training and learning, and obtaining reference language characteristic model characteristic vectors and voice characteristic model characteristic vectors after learning;
step three, the STT module converts the text information of the source language voice and respectively obtains the text information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature corrects the language feature model after learning of the deep neural network DNN, and the parameters used by the language feature model are used as important reference parameters of the translation module and are used as pre-judgment information of translation;
step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:
s is a given time-frequency spectrum signal,
xifor the signal reconstructed at the i-th time,
f is a short-time-range fourier transform,
f-1in order to realize the reverse transformation,
Si,Pieach represents xiThe magnitude and phase of the short-time fourier transform of (a);
and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.
2. The acoustic speech translation method according to claim 1, wherein: in the first step, the source language voice acquisition comprises preprocessing and judgment of voice signals, the preprocessing comprises voice enhancement, background sound elimination and echo suppression, the judgment comprises judging whether the voice signals contain language information, and if the language information is not detected, the current information is discarded.
3. The acoustic speech translation method according to claim 1, wherein: and in the second step, the voice characteristic model has a pre-trained voice characteristic model, and the voice characteristic model is corrected every time a new voice characteristic is learned.
4. The acoustic speech translation method according to claim 1, wherein: the voice feature learning module in the second step comprises feature extraction, the extracted features comprise features of language pronunciation, namely vowel, consonant and voiced sound, and the extracted features further comprise pronunciation features of a speaker, namely tone intensity, tone and tone color.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110602693.1A CN113436606B (en) | 2021-05-31 | 2021-05-31 | Original sound speech translation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110602693.1A CN113436606B (en) | 2021-05-31 | 2021-05-31 | Original sound speech translation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113436606A CN113436606A (en) | 2021-09-24 |
CN113436606B true CN113436606B (en) | 2022-03-22 |
Family
ID=77804065
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110602693.1A Active CN113436606B (en) | 2021-05-31 | 2021-05-31 | Original sound speech translation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113436606B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102478763B1 (en) | 2022-06-28 | 2022-12-19 | (주)액션파워 | Method for speech recognition with grapheme information |
CN115312029B (en) * | 2022-10-12 | 2023-01-31 | 之江实验室 | Voice translation method and system based on voice depth characterization mapping |
CN116416969A (en) * | 2023-06-09 | 2023-07-11 | 深圳市江元科技(集团)有限公司 | Multi-language real-time translation method, system and medium based on big data |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102117614B (en) * | 2010-01-05 | 2013-01-02 | 索尼爱立信移动通讯有限公司 | Personalized text-to-speech synthesis and personalized speech feature extraction |
US9564120B2 (en) * | 2010-05-14 | 2017-02-07 | General Motors Llc | Speech adaptation in speech synthesis |
US10446143B2 (en) * | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
CN107146601B (en) * | 2017-04-07 | 2020-07-24 | 南京邮电大学 | Rear-end i-vector enhancement method for speaker recognition system |
JP7178028B2 (en) * | 2018-01-11 | 2022-11-25 | ネオサピエンス株式会社 | Speech translation method and system using multilingual text-to-speech synthesis model |
CN111785258B (en) * | 2020-07-13 | 2022-02-01 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
-
2021
- 2021-05-31 CN CN202110602693.1A patent/CN113436606B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113436606A (en) | 2021-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113436606B (en) | Original sound speech translation method | |
Ai et al. | A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
CN109616131B (en) | Digital real-time voice sound changing method | |
CN115294970B (en) | Voice conversion method, device and storage medium for pathological voice | |
JP7124373B2 (en) | LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM | |
Doi et al. | Statistical approach to enhancing esophageal speech based on Gaussian mixture models | |
Tobing et al. | Baseline system of Voice Conversion Challenge 2020 with cyclic variational autoencoder and Parallel WaveGAN | |
Vallés-Pérez et al. | Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows | |
CN111326170A (en) | Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution | |
Toth et al. | Synthesizing speech from electromyography using voice transformation techniques | |
CN115862590A (en) | Text-driven speech synthesis method based on characteristic pyramid | |
CN113314109B (en) | Voice generation method based on cycle generation network | |
Du et al. | Effective wavenet adaptation for voice conversion with limited data | |
CN114550701A (en) | Deep neural network-based Chinese electronic larynx voice conversion device and method | |
Pan et al. | Bone-conducted speech to air-conducted speech conversion based on cycleconsistent adversarial networks | |
CN113744715A (en) | Vocoder speech synthesis method, device, computer equipment and storage medium | |
TWI746138B (en) | System for clarifying a dysarthria voice and method thereof | |
CN112992118B (en) | Speech model training and synthesizing method with few linguistic data | |
CN112967538B (en) | English pronunciation information acquisition system | |
Chandra et al. | Towards The Development Of Accent Conversion Model For (L1) Bengali Speaker Using Cycle Consistent Adversarial Network (Cyclegan) | |
Lin et al. | Investigation of neural network approaches for unified spectral and prosodic feature enhancement | |
CN117334179A (en) | Method, device and storage medium for real-time simulation of designated character tone by digital person | |
Zhou et al. | An improved algorithm of GMM voice conversion system based on changing the time-scale | |
CN114974271A (en) | Voice reconstruction method based on sound channel filtering and glottal excitation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |