CN113436606B

CN113436606B - Original sound speech translation method

Info

Publication number: CN113436606B
Application number: CN202110602693.1A
Authority: CN
Inventors: 孟强祥; 田俊麟; 宋昱
Original assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Current assignee: Introduction Of Chinese Technology Shenzhen Co ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-03-22
Anticipated expiration: 2041-05-31
Also published as: CN113436606A

Abstract

The invention discloses an acoustic speech translation method, which relates to the technical field of speech translation, and comprises the following steps: the source language voice acquisition and voice feature learning module extracts voice features of a speaker, the voice features are sent to deep neural network DNN training learning, the STT module converts character information of the source voice and the character information is acquired by the translation module and the language feature learning module respectively, wherein the language features of the source language are extracted and recorded in the language feature learning module, and voice synthesis simulation is carried out through the voice synthesis module. The invention takes the language pronunciation characteristics as characteristic values to be sent to a deep neural network DNN for training and learning, obtains the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module after learning, carries out voice synthesis simulation through a synthesis voice module, and sends out the voice similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.

Description

Original sound speech translation method

Technical Field

The invention relates to the technical field of voice translation, in particular to an acoustic voice translation method.

Background

The development of artificial intelligence technology makes the speech translation greatly developed and applied. In the process of voice translation, a source voice signal of a speaker is mainly converted into source text information, the source text information is converted into text information of a target language through a text translation module, and then a voice signal of the target language is generated through a voice synthesis module to be played to complete voice translation.

Disclosure of Invention

The invention aims to provide an acoustic speech translation method to solve the defects in the prior art.

In order to achieve the above purpose, the invention provides the following technical scheme: an acoustic speech translation method comprising the steps of:

step one, source language voice collection, wherein voice information is collected through a voice collection module and then sent To a voice feature learning module and an STT (Speech-To-Text) module.

And step two, the voice feature learning module extracts the voice features of the speaker, the features are extracted and then are learned through a deep Neural network DNN to establish a voice feature model, the language pronunciation features are sent into the deep Neural network DNN (deep Neural network) for training and learning as feature values, and after learning, a language feature model feature vector and a human voice feature model feature vector which are respectively used for reference of the translation and synthesis module are obtained.

Step three, the STT module converts the character information of the source language and respectively obtains the character information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the feature is learned by the deep neural network DNN, and the parameters used by the model are used as important reference parameters of the translation module and are used as translation prejudgment information;

step four, performing voice synthesis simulation through a synthesized voice module, using a corrected language characteristic model after translation and learning of a deep neural network DNN as an information basis of voice output, simulating and outputting language information, establishing a synthesized voice model by combining a time interval model and a fundamental frequency model, generating a time-frequency spectrum signal, and performing synthesis processing on the synthesized voice module by using a Greens algorithm Griffin-Lim to obtain a corresponding voice characteristic voice signal, wherein the synthesized voice model is as follows:

s is a given time-frequency spectrum signal,

x_ifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f^-1in order to realize the reverse transformation,

S_i,P_ieach represents x_iThe magnitude and phase of the short-time fourier transform of (a);

and step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to finish the voice translation process.

Preferably, the source speech acquisition in the first step includes preprocessing and judgment of the sound signal, the preprocessing includes speech enhancement, background sound elimination, echo suppression, and the like, which are beneficial for optimizing the signal, and the judgment includes judging whether the sound signal includes language information, and if the language information is not detected, the current information is discarded.

Preferably, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned.

Preferably, the sound feature learning module in the second step includes feature extraction, the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features further include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre.

Preferably, the main modules of the translation process in the third step are executed synchronously in real time, and the learning of the voice and language features and the model modification process can be executed asynchronously, so that the real-time performance of the translation process is not influenced.

In the technical scheme, the invention provides the following technical effects and advantages:

the invention collects the voice information through the voice collecting module, the language pronunciation characteristic is sent to the deep neural network DNN for training and learning as the characteristic value, the language characteristic model characteristic vector and the human voice characteristic model characteristic vector which are respectively used for reference of the translation and synthesis module are obtained after learning, meanwhile, the STT module converts the character information of the source voice, the language characteristic model is corrected after the learning of the deep neural network DNN and is used as the pre-judging information of the translation, then the voice synthesis simulation is carried out through the synthesis voice module, the voice similar to the voice of the speaker is sent out after the synthesis based on the language information with the speaking style of the speaker, and therefore, the synthesized voice after the translation is highly close to the characteristic of the speaker.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic view of the overall structure of the present invention.

Fig. 2 is a flow chart of sound feature extraction according to the present invention.

FIG. 3 is a diagram of ADSR envelope representation according to the present invention.

FIG. 4 is a logical block diagram of model reconstruction in accordance with the present invention.

Description of reference numerals:

a: the time from silence to the peak of pronunciation, this time is the energy burst phase;

d: time at which the pronunciation dropped from the peak is stable;

s: a time interval of stable pronunciation;

r: time to fall back after the end of pronunciation.

Detailed Description

In order to make the technical solutions of the present invention better understood, those skilled in the art will now describe the present invention in further detail with reference to the accompanying drawings.

The invention provides an acoustic speech translation method, which comprises the following steps:

s is a given time-frequency spectrum signal,

x_ifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f^-1in order to realize the reverse transformation,

given a time-frequency spectrum signal S, the closer the time-frequency spectrum information is to the S, the better the signal needs to be reconstructed;

the human voice features include:

intensity (intensity): the strength of the pronunciation is also the vibration amplitude of the audio signal,

pitch (pitch): the frequency of the vibration of the audio signal,

timbre (time): the timbre is an important index for a speaker to represent that the voice of the speaker is different from other people, the timbre is determined by a corresponding spectral Envelope (Envelope), ADSR is mainly composed of four parameters, namely attach, Delay, Sustain and Release, the same characters are used, different voices of different people can be different, and the four parameters are mainly determined;

step five, continuously reconstructing signals, finally synthesizing the language and the voice characteristics closest to the speaker, and translating and playing in real time according to the translation content to complete the voice translation process;

further, in the above technical solution, the source speech acquisition in the first step includes preprocessing and determining a speech signal, where the preprocessing includes speech enhancement, background sound cancellation, echo suppression, and the like, which are beneficial for optimizing the signal, and the determining includes determining whether the speech signal includes language information, and if the language information is not detected, discarding the current information;

further, in the above technical solution, the acoustic feature model in the second step has a pre-trained acoustic feature model, and the model is modified each time a new speech acoustic feature is learned;

further, in the above technical solution, the sound feature learning module in the second step includes feature extraction, where the extracted features mainly include features of language pronunciation, such as vowel, consonant, voiced sound, and the like, and the extracted features also include characteristics of pronunciation of the speaker, such as sound intensity, tone, and timbre;

furthermore, in the above technical solution, the main modules of the translation process in step three are executed synchronously in real time, and the learning and model modification processes of the sound and language features can be executed asynchronously, so that the real-time performance of the translation process is not affected;

the implementation mode is specifically as follows: after voice information is collected by a voice collecting module, the voice information is sent to a voice characteristic learning module and an STT module, voice characteristics are extracted and then are learned by a deep neural network DNN to establish a voice characteristic model, language pronunciation characteristics are used as characteristic values to be sent to the deep neural network DNN for training and learning, and after learning, a language characteristic model characteristic vector and a human voice characteristic model characteristic vector which are respectively used for reference of a translation and synthesis module are obtained, meanwhile, the STT module converts the character information of the source language, the language feature learning extracts and records the language feature of the source language, the language feature model is corrected after the deep neural network DNN learning and is used as the translation prejudgment information, then the speech synthesis simulation is carried out through the speech synthesis module, based on the language information with the speaking style of the speaker, the synthesized voice is synthesized to make a sound similar to the voice of the speaker, so that the synthesized voice after translation is highly close to the characteristics of the speaker.

While certain exemplary embodiments of the present invention have been described above by way of illustration only, it will be apparent to those of ordinary skill in the art that the described embodiments may be modified in various different ways without departing from the spirit and scope of the invention. Accordingly, the drawings and description are illustrative in nature and should not be construed as limiting the scope of the invention.

Claims

1. An acoustic speech translation method, comprising the steps of:

step one, source language voice acquisition, wherein voice information is acquired through a voice acquisition module and then is sent To a voice feature learning module and an STT (Speech-To-Text) module;

step two, extracting the voice characteristics of the speaker by a voice characteristic learning module, establishing a voice characteristic model through the deep Neural network DNN learning after the characteristics are extracted, sending the language pronunciation characteristics serving as characteristic values into the deep Neural network DNN (deep Neural network) for training and learning, and obtaining reference language characteristic model characteristic vectors and voice characteristic model characteristic vectors after learning;

step three, the STT module converts the text information of the source language voice and respectively obtains the text information by the translation module and the language feature learning module, wherein the language feature learning extracts and records the language feature of the source language, the language feature corrects the language feature model after learning of the deep neural network DNN, and the parameters used by the language feature model are used as important reference parameters of the translation module and are used as pre-judgment information of translation;

s is a given time-frequency spectrum signal,

x_ifor the signal reconstructed at the i-th time,

f is a short-time-range fourier transform,

f^-1in order to realize the reverse transformation,

S_i，P_ieach represents x_iThe magnitude and phase of the short-time fourier transform of (a);

2. The acoustic speech translation method according to claim 1, wherein: in the first step, the source language voice acquisition comprises preprocessing and judgment of voice signals, the preprocessing comprises voice enhancement, background sound elimination and echo suppression, the judgment comprises judging whether the voice signals contain language information, and if the language information is not detected, the current information is discarded.

3. The acoustic speech translation method according to claim 1, wherein: and in the second step, the voice characteristic model has a pre-trained voice characteristic model, and the voice characteristic model is corrected every time a new voice characteristic is learned.

4. The acoustic speech translation method according to claim 1, wherein: the voice feature learning module in the second step comprises feature extraction, the extracted features comprise features of language pronunciation, namely vowel, consonant and voiced sound, and the extracted features further comprise pronunciation features of a speaker, namely tone intensity, tone and tone color.