CN1534595A

CN1534595A - Speech sound change over synthesis device and its method

Info

Publication number: CN1534595A
Application number: CNA031160506A
Authority: CN
Inventors: 张江安; 张钦
Original assignee: ZHONGYING ELECTRONICS (SHANGHAI) CO Ltd
Current assignee: ZHONGYING ELECTRONICS (SHANGHAI) CO Ltd
Priority date: 2003-03-28
Filing date: 2003-03-28
Publication date: 2004-10-06

Abstract

A speech conversion-composition device is composed of a speech analyzing module, a speech recognizing module, and a speech composing module for outputting particular speech. Its method features that on the basis of analyzed and recognized results, a non-particular speech is converted to the speech of a particular person pointed by user.

Description

Speech conversion synthesizer and method thereof

Technical field

The present invention is particularly to a kind of speech conversion synthesizer and its method that the unspecified person speech conversion is become specific people's voice relevant for a kind of speech conversion synthesizer and method thereof.

Background technology

Voice Conversion Techniques in text-converted (Text To Speech, be called for short TTS) system design, voice are covered up and aspect such as toy designs has a wide range of applications.And Voice Conversion Techniques substantially is how to focus on research according to the speech data of source words person with target words person, sets up transformational relation between the two.

The conversion method of known voice conversion device includes vector quantization and code book mapping method, linear transformation method, neural net for catching fish or birds method, mixed Gauss model method etc., above-mentioned these methods can both be used to set up the characteristic parameter between the words person, as the transformational relation of frequency domain character parameter.But these methods all can only be used to set up man-to-man transformational relation, it is the transformational relation between specific people's voice and the specific objective words person voice, therefore the speech conversion system that adopts these methods to set up can only be faced specific user, and for new user, speech conversion system must rebulid.So known phonetics transfer method also is not suitable for that voice are covered up or toy etc. need become the unspecified person speech conversion occasion of specific people's voice.

Summary of the invention

Therefore, the present invention is providing a kind of speech conversion synthesizer exactly, is to utilize the unspecified person speech recognition technology, and the unspecified person voice are discerned, synthesize according to corresponding speech data in recognition result and the specific people's speech database again, and obtain specific people's voice.

The present invention is that the unspecified person voice that obtained are discerned proposing a kind of speech conversion synthetic method, utilizes corresponding speech data to synthesize again, and obtains specific people's voice.

For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthesizer, and this device comprises speech analysis module, speech recognition module and phonetic synthesis module.

Above-mentioned speech analysis module receives the unspecified person voice that the speech conversion synthesizer is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly is output to output terminal, and voiced segments is then at analyzed back output spectrum feature and prosodic information.

Above-mentioned speech recognition module is coupled to the speech analysis module, receive the spectrum signature that the speech analysis module transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and determining time span (abbreviation duration) the back output of each voice unit.Wherein, the speech recognition module comprises unspecified person speech database and voice recognition unit.This nonspecific speech database stores all speech unit models parameters that are used for the unspecified person speech recognition, and voice recognition unit is coupled to the unspecified person speech database, when receiving spectrum signature, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature to the unspecified person speech database.

Above-mentioned phonetic synthesis module is coupled to speech recognition module and speech analysis module, be responsible for receiving duration, voice unit sequence and prosodic information, and synthesize with the respective phonetic unit data of voice unit sequence, produce specific people's voice, export specific people's voice by output terminal at last.Wherein, the phonetic synthesis module comprises specific people's speech database and phonetic synthesis unit, and specific people's speech database stores respective specific people's voice unit data of speech unit models parameter, and the phonetic synthesis unit is coupled to specific people's speech database, when receiving the voice unit sequence, extremely identify respective specific people's voice unit data of speech unit models parameter in specific people's speech database.

Described according to preferred embodiment of the present invention, above-mentioned unspecified person speech database adopts hidden Markov model (Hidden Markov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.

Described according to preferred embodiment of the present invention, above-mentioned specific people's speech database can be one or more, and these specific people's speech databases all have its corresponding specific people.

Described according to preferred embodiment of the present invention, above-mentioned prosodic information comprises pitch period and short-time energy.

Described according to preferred embodiment of the present invention, above-mentionedly divide frame to be treated to the unspecified person voice a series of unspecified person voice are cut with a Preset Time.

Described according to preferred embodiment of the present invention, above-mentioned speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).

For reaching above-mentioned purpose with other, the present invention proposes a kind of speech conversion synthetic method, is applicable to the synthetic specific people's voice of the unspecified person speech conversion that will be obtained.Its method obtains the unspecified person voice for the speech analysis module, then the unspecified person voice is divided frame to handle, and is divided into voiceless sound section and voiced segments, and secondly speech analysis module obtains spectrum signature and prosodic information after with the voiced segments analysis.The speech recognition module identifies the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature then according to spectrum signature, and the duration of definite voice unit sequence.At last, the phonetic synthesis module is exported by output terminal after with the respective phonetic unit data of voice unit sequence and the synthetic specific people's voice of voiceless sound section according to voice unit sequence, duration, prosodic information.

For above-mentioned and other purposes of the present invention, feature and advantage can be become apparent, a preferred embodiment cited below particularly, and conjunction with figs. are described in detail below:

Description of drawings

Fig. 1 is the functional block diagram of a kind of speech conversion synthesizer of preferred embodiment of the present invention;

Fig. 2 is a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention; And

Fig. 3 is the method flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.

Embodiment

Please refer to Fig. 1, it has illustrated the functional block diagram according to a kind of speech conversion synthesizer of preferred embodiment of the present invention.This speech conversion synthesizer 100 can be covered up or aspect such as toy designs as text-converted system design, voice, and it comprises: speech analysis module 110, speech recognition module 120 and phonetic synthesis module 130.

Speech analysis module 110 receives the unspecified person voice that speech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, wherein, the voiceless sound section directly exports output terminal to, voiced segments is then exported after analyzeding as being spectrum signature and prosodic information, and prosodic information comprises fundamental tone (pitch ofspeech) cycle and short-time energy.

In addition, dividing frame to be treated to the unspecified person voice is cut a series of unspecified person voice with a Preset Time, the unspecified person voice are promptly cut every 20 milliseconds be defined as a frame, and Preset Time can be when speech conversion synthesizer 100 dispatches from the factory and has preset.

Speech recognition module 120 is coupled to speech analysis module 110, receive the spectrum signature that speech analysis module 110 transmits, be responsible for identifying the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature, and behind the duration of determining the voice unit sequence, export.

Wherein, speech recognition module 120 comprises unspecified person speech database 124 and voice recognition unit 122.In unspecified person speech database 124, store all voice unit sequences that are used for the unspecified person speech recognition, and voice recognition unit 122 is coupled to unspecified person speech database 124, when receiving spectrum signature, to unspecified person speech database 124, identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature.

Phonetic synthesis module 130 is coupled to speech recognition module 120 and speech analysis module 110, the prosodic information that duration, voice unit sequence and the speech analysis module 110 that reception speech recognition module 120 transmits transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, produce specific people's voice, export specific people's voice by output terminal at last.

Wherein, phonetic synthesis module 130 comprises a plurality of specific people's speech database D ₁～D _NStore the corresponding respective specific people's voice unit of speech unit models parameter data, and phonetic synthesis unit 132 is coupled to these specific people's speech database D ₁～D _N, when receiving the voice unit sequence, to specific people's speech database D ₁～D _NIn identify and the corresponding respective phonetic unit data of voice unit sequence.

In preferred embodiment of the present invention, specific people's speech database D ₁～D _NCan be one or more, and these specific people's speech databases all there is its corresponding specific people.

In preferred embodiment of the present invention, the unspecified person speech database adopts hidden Markov model (HiddenMarkov Model, be called for short HMM) set up, and the corresponding hidden Markov model of each voice unit can be obtained by a large amount of continuous speech training of unspecified person.

In preferred embodiment of the present invention, speech recognition module 120 only carries out the identification of voice layer, and does not carry out the identification of semantic primitive (as word).

The manner of execution of this speech conversion synthesizer 100 is that speech analysis module 110 receives the unspecified person voice that speech conversion synthesizer 100 is obtained, be divided into voiceless sound section and voiced segments after dividing frame to handle the unspecified person voice, then directly export the voiceless sound section to output terminal, voiced segments then obtains exporting behind spectrum signature and the prosodic information after analyzed.Secondly, speech recognition module 120 receives the spectrum signature that speech analysis module 110 transmits, and exports after identifying voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature and the duration of determining the voice unit sequence.At last, the prosodic information that duration, voice unit sequence and the speech analysis module 110 that phonetic synthesis module 130 reception speech recognition modules 120 transmit transmits, and utilize the corresponding respective phonetic unit data of voice unit sequence to synthesize, after producing specific people's voice, export specific people's voice by output terminal.

Please then refer to Fig. 2, it has illustrated a kind of circuit block diagram of realizing with Digital System Processor of preferred embodiment of the present invention.Voice conversion device 100 comprises analog/digital converter 200, Digital System Processor 210, digital/analog converter 220, unspecified person speech database 230 and a plurality of specific people's speech database D in Fig. 2 ₁～D _N

Analog/digital converter 200 is the phonetic entry port, exports after being responsible for received unspecified person speech simulation signal is converted to unspecified person speech digit signal.Digital System Processor 210 is responsible for carrying out the calculating in the speech conversion, and it comprises analysis and identification and specific people's phonetic synthesis of unspecified person voice.Digital/analog converter 220 is responsible for exporting after analog signal with specific people's voice converts specific people's speech digit signal to for the voice output port.Unspecified person speech database 230 is for storing speech conversion formula and hidden Markov model (HMM) parameter, and wherein unspecified person speech database 230 is a ROM (read-only memory).A plurality of specific people's speech database D ₁～D _NFor storing a plurality of specific people's speech database, speech database D wherein ₁～D _NBe storer.

In preferred embodiment of the present invention, Digital System Processor 210 comprises input buffer 212, digital signal processing enter 214 and output buffer 216.Wherein, input buffer 212 is for storing the frequency spectrum parameter and the prosodic parameter of input voice segments; Digital signal processing enter 214 is responsible for carrying out the calculating of speech conversion; Output buffer 216 is for storing the output voice.

Please continue with reference to figure 3, it has illustrated the process flow diagram of a kind of speech conversion synthetic method of preferred embodiment of the present invention.In the speech conversion synthetic method,, please merge with reference to figure 1 and Fig. 3 for ease of understanding.The method is that speech analysis module 110 is obtained unspecified person voice (s302), then the unspecified person voice are divided frame to handle, and be divided into voiceless sound section and voiced segments (s304), secondly speech analysis module 110 obtains spectrum signature and prosodic information (s306) after with the voiced segments analysis.120 of speech recognition modules identify the voice unit sequence that is comprised with the corresponding corresponding voice segments of spectrum signature according to spectrum signature to unspecified person speech database 124, and the duration of definite voice unit sequence.At last, phonetic synthesis module 130 receives voice unit sequence, duration, prosodic information, together up to specific people's speech database D ₁～D _NIn identify and the corresponding respective phonetic unit data of voice unit sequence, export by output terminal after according to voice unit sequence, duration and prosodic information then the synthetic specific people's voice of voiceless sound section and respective phonetic unit data.

Comprehensive the above, speech conversion synthesizer of the present invention and method thereof have following advantage:

(1) speech conversion synthesizer of the present invention and method thereof can become resulting arbitrary speech conversion one specific people's voice, need not in use to adjust, and have very strong adaptive faculty.

(2) speech conversion synthesizer of the present invention and method thereof are not changing under speech conversion synthesizer structure and the parameter, only increase new specific people's speech database, can make the speech conversion synthesizer possess transfer capability to new specific people's voice.

Though the present invention discloses as above with a preferred embodiment; right its is not in order to limit the present invention; anyly be familiar with present technique field person; without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention defines and is as the criterion when looking the accompanying Claim book.

Claims

1. speech conversion synthesizer is applicable to and unspecified person voice that obtained is changed synthetic that this speech conversion synthesizer comprises:

One speech analysis module, receive this unspecified person voice, be divided into a voiceless sound Duan Yuyi voiced segments after dividing frame to handle these unspecified person voice, wherein this voiceless sound section is for exporting an output terminal to, and this voiced segments is exported after analyzeding as being a spectrum signature and a prosodic information;

One speech recognition module, be coupled to this speech analysis module, receive this spectrum signature that this speech analysis module transmits, a voice unit sequence that is comprised in order to a corresponding voice segments that identifies this spectrum signature, and behind a duration of determining this voice unit sequence, export; And

One phonetic synthesis module, be coupled to this speech recognition device and this speech analysis module, receive this prosodic information, this duration with and this voice unit sequence, and according to this voice unit sequence, this duration, this prosodic information and after utilizing the synthetic specific people's voice of the corresponding respective specific people voice unit data of this voice unit sequence, by these these specific people's voice of output terminal output.

2. speech conversion synthesizer as claimed in claim 1 is characterized in that, this speech recognition module comprises:

One unspecified person speech database is used for this voice unit sequence of this unspecified person speech recognition in order to storage; And

One voice recognition unit is coupled to this unspecified person speech database, when being used to receive this spectrum signature, identifies this voice unit sequence that this corresponding voice segments comprised of this spectrum signature to this unspecified person speech database.

3. speech conversion synthesizer as claimed in claim 2 is characterized in that, this unspecified person speech database adopts a hidden Markov model to set up, and this hidden Markov model is obtained by a large amount of continuous speech training of specific people.

4. speech conversion synthesizer as claimed in claim 1 is characterized in that, this phonetic synthesis module comprises:

One specific people's speech database is in order to store and corresponding this respective specific of this voice unit sequence people voice unit data; And

One phonetic synthesis unit is coupled to this specific people's speech database, when being used to receive this voice unit sequence, identifies and corresponding this respective specific of this voice unit sequence people voice unit data to this specific people's speech database.

5. speech conversion synthesizer as claimed in claim 4 is characterized in that, this specific people's speech database stores at least one specific people's voice data.

6. speech conversion synthesizer as claimed in claim 1 is characterized in that this prosodic information comprises pitch period and short-time energy.

7. speech conversion synthesizer as claimed in claim 1 is characterized in that, divides frame to be treated to these unspecified person voice a series of these unspecified person voice are cut with a Preset Time.

8. speech conversion synthesizer as claimed in claim 1 is characterized in that, this speech recognition module only carries out the identification of voice layer, and does not carry out the identification of semantic primitive.

9. a speech conversion synthetic method comprises the following steps:

Obtain unspecified person voice;

Divide frame to handle these unspecified person voice, and be divided into a voiceless sound Duan Yuyi voiced segments;

A spectrum signature and a prosodic information will be obtained after this voiced segments analysis;

Identify the voice unit sequence that a corresponding voice segments is comprised according to this spectrum signature, and determine this voice unit sequence one duration; And

According to this voice unit sequence, this duration, this prosodic information, will export behind corresponding respective phonetic unit data of this voice unit sequence and the synthetic specific people's voice of this voiceless sound section.

10. speech conversion synthetic method as claimed in claim 9 is characterized in that this prosodic information comprises pitch period and short-time energy.

11. speech conversion synthetic method as claimed in claim 9 is characterized in that, divides frame to be treated to these unspecified person voice a series of these unspecified person voice are cut with a Preset Time.