CN1658281A

CN1658281A - Voice operation device, method and recording medium for recording voice operation program

Info

Publication number: CN1658281A
Application number: CN2005100074542A
Authority: CN
Inventors: 川原毅彦; 剑持秀纪
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-02-20
Filing date: 2005-02-21
Publication date: 2005-08-24
Anticipated expiration: 2025-02-21
Also published as: JP2005234337A; TWI300551B; KR20060043023A; KR100759172B1; CN100337104C; TW200535235A

Abstract

Provided is a speech synthesizer etc., capable of synthesizing speeches having various kinds of voice quality even in an environment where large restrictions are imposed on hardware resources. The speech synthesizer 100 which has one kind of phoneme data is provided with a voice quality change part 250 and a voice quality database 260. The voice quality change part 250 performs retrieval from the voice quality database 260 based upon a voice quality data number supplied from a text analysis part 220 as a retrieval key to obtain voice quality parameters. The voice quality change part 250 changes voice quality of each tone that the phoneme data obtained by a phoneme data acquisition part 230 represent based upon the acquired voice quality parameters.

Description

The recording medium of speech synthetic device, method and record voice operation program

Technical field

The present invention relates to a kind of recording medium that generates speech synthetic device, phoneme synthesizing method and the record voice operation program of synthetic speech according to the text message of being imported.

Background technology

Figure 17 is expression generates existing voice synthesizer 100 structures of synthetic speech according to the text message of being imported figure.

Input part 110 offers text resolution portion 120 with text information after never illustrated operating portion etc. has received text messages such as " こん To Chi わ ".Text resolution portion 120 utilizes word dictionary etc., received text message is carried out word parsing, syntax parsing etc., generate the phoneme information of each phoneme of expression " こ ", " ん ", " To ", " Chi ", " わ " these moras, and represent each phoneme length, highly, the prosodic information of intensity, offer phonetic synthesis portion 130 then.Phonetic synthesis portion 130 obtains the speech data (hereinafter referred to as phoneme data) of mora according to each phoneme information that provides from text resolution portion 120 from phoneme database 140.Then, phonetic synthesis portion 130 suitably processes connection etc. according to prosodic information to each obtained phoneme data, generates synthetic speech signal, then as synthetic speech and from outputs such as loudspeakers.The user is by listening to the synthetic speech from speech synthetic device output, can confirm the content of the text message imported.

But, in above-mentioned phoneme database, only logined a kind of phoneme data that specific declaimer (for example male sex declaimer) sends.Therefore, for example the synthetic speech output young woman etc. who has above-mentioned specific declaimer's tonequality in the utilization text message liking using (" Chi I う ... " or text message such as " ... body いなあ ") under the situation, the problem that the user feels variant sense between tonequality and the voice content can appear.

In order to address the above problem, such technology is suggested, promptly in phoneme database, login multiple phoneme data (for example different phoneme datas such as the male sex, women, children, old man) in advance, wait according to the content of the text message of being imported and to select optimum phoneme data, utilize selected phoneme data to generate synthetic speech (for example, with reference to patent documentation 1) then.

Patent documentation 1: the spy opens 2000-339137 communique (the 3rd～4 page)

Summary of the invention

According to patent documentation 1 disclosed technology, can obtain suitable synthetic speech really, but, must in phoneme database, login multiple phoneme data in order to realize synthetic speech.But such problem can occur, promptly in portable terminal that hardware resources such as storer and CPU is had bigger restriction etc. so multiple phoneme data can't be installed, consequently portable terminal etc. can't generate the synthetic speech of various tonequality.

The present invention proposes in view of the above problems, its purpose is, hardware resource is being had under the environment of bigger restriction even provide a kind of, also can generate the recording medium of speech synthetic device, phoneme synthesizing method and the record voice operation program of the synthetic speech of various tonequality.

In order to address the above problem, speech synthetic device of the present invention is characterized in that, has: acquiring unit, from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech; First storage unit is used for a plurality of phoneme datas of each phoneme of storage representation; Second storage unit is used to store multiple phoneme data machining information, and these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; First extracting unit is used for extracting the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from above-mentioned first storage unit; Second extracting unit is used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from above-mentioned second storage unit; And generation unit, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.

Speech synthetic device of the present invention, preferred above-mentioned each phoneme data comprises the resonance peak information of the resonance peak of representing phoneme, above-mentioned phoneme data machining information comprises the resonance peak modification information of the changed content of representing above-mentioned resonance peak, above-mentioned generation unit is according to above-mentioned resonance peak modification information, above-mentioned resonance peak information is changed, according to the signal waveform of after changing each resonance peak being carried out after the additive operation, generate above-mentioned synthetic speech then.

Speech synthetic device of the present invention, preferred above-mentioned resonance peak information is made of paired formant frequency and resonance peak amplitude, in above-mentioned resonance peak modification information, include the formant frequency modification information of the changed content that is used to represent above-mentioned formant frequency and be used to represent the resonance peak amplitude modification information of the changed content of above-mentioned resonance peak amplitude, above-mentioned generation unit is according to above-mentioned formant frequency modification information and above-mentioned resonance peak amplitude modification information, each formant frequency and each resonance peak amplitude to the represented phoneme of above-mentioned phoneme data changes respectively, obtains above-mentioned each resonance peak after changing thus.

Speech synthetic device of the present invention, preferred above-mentioned acquiring unit from above-mentioned text message except obtaining above-mentioned phoneme appointed information and above-mentioned tonequality appointed information, also obtain the pitch appointed information of the pitch that is used to specify above-mentioned synthetic speech, the signal waveform of above-mentioned generation unit after above-mentioned each resonance peak is after changing carried out additive operation gives above-mentioned pitch appointed information represented pitch, obtains above-mentioned synthetic speech thus.

Speech synthetic device of the present invention, preferred above-mentioned text message comprises above-mentioned tonequality appointed information, and above-mentioned acquiring unit obtains above-mentioned tonequality appointed information from above-mentioned text message.

Speech synthetic device of the present invention, preferred above-mentioned acquiring unit extracts key word from above-mentioned text message, then according to the key word that is extracted, judge the tonequality that is suitable for above-mentioned text message.

The present invention also provides a kind of phoneme synthesizing method, it is characterized in that, have following steps: obtaining step, from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech; First extraction step extracts the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from first storage unit, and this first storage unit is used for a plurality of phoneme datas of each phoneme of storage representation; Second extraction step, be used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from second storage unit, this second storage unit is used to store multiple phoneme data machining information, these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; And the generation step, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.

The effect of invention

As mentioned above, according to the present invention,, also can generate the synthetic speech of various tonequality even hardware resource is being had under the environment of bigger restriction.

Description of drawings

Fig. 1 is the block diagram of functional structure of the speech synthetic device of expression present embodiment.

Fig. 2 is the figure that example illustrates the text message of this embodiment.

Fig. 3 is the figure of the login content of example phoneme database that this embodiment is shown.

Fig. 4 is the figure that example illustrates the phoneme data structure of this embodiment.

Fig. 5 is the figure that is used for illustrating each frame information of the phoneme data that is included in this embodiment.

Fig. 6 is the figure that example illustrates the tonequality database login content of this embodiment.

Fig. 7 is the figure of a structure example of the tonequality parameter of this embodiment of expression.

Fig. 8 is the process flow diagram that the tonequality change of this embodiment of expression is handled.

Fig. 9 is the figure that example illustrates the mapping function of this embodiment.

Figure 10 is the male sex's of this embodiment of expression phonemic analysis result's figure.

Figure 11 is the figure of analysis result of women's phoneme of this embodiment of expression.

Figure 12 is the figure that example illustrates the vibration table of this embodiment.

Figure 13 is that example illustrates the shake number read and the figure of time relation from the vibration table of this embodiment.

Figure 14 is the figure that is used to illustrate the formant frequency vibration of this embodiment.

Figure 15 is used to illustrate that the pitch of this embodiment gives the figure of processing.

Figure 16 be example being carried out of this embodiment has been shown tonequality change handle and pitch is given the figure of the resonance peak waveform of the specific resonance peak after the processing.

Figure 17 is the figure of the functional structure of the existing speech synthetic device of expression.

Embodiment

Following with reference to accompanying drawing, embodiments of the present invention are described.

A. present embodiment

Fig. 1 is the figure of functional structure of the speech synthetic device 100 of expression present embodiment.In the present embodiment, suppose that speech synthetic device 100 is installed in mobile phone or PHS (Personal Handyphone System), PDA (Personal DigitalAssistance) etc. hardware resource is had the situation in the portable terminal of bigger restriction, but the invention is not restricted to this, can be applied in the various electronic equipments.

Input part 210 will offer text resolution portion 220 via the text message of not shown inputs such as operating portion.Fig. 2 has been the illustration figure of text message.

Content of text information is that be used to represent should be as the information of the content (for example " こん To Chi わ ") of the text of synthetic speech output.In addition, figure 2 illustrates the content of text information of only being represented by hiragana, but content of text information being not limited to hiragana, also can be the information of being represented by various literal such as Chinese character, Roman capitals, katakana and various mark.

Tonequality data number (tonequality appointed information) is the unique number that is used for discerning respectively a plurality of tonequality parameter described later (phoneme data the machining information) (K1 of Fig. 2～Kn).In the present embodiment, utilize this tonequality parameter by suitable selection, a kind of phoneme data that can send according to specific declaimer (being assumed to " male sex declaimer " in the present embodiment) obtains the synthetic speech (back detailed description) of various tonequality.

Pitch (pitch) information (pitch appointed information) is the information that is used for giving pitch (in other words, specifying the pitch of synthetic speech) to synthetic speech, and it is by specifying " C (trembling) "～" B (sigh) " to wait the information formation (with reference to Fig. 2) of scale.

220 pairs of text messages that provide from input part 210 of text resolution portion are resolved, and then analysis result are offered phoneme data obtaining section 230, tonequality changing unit 250, voice signal generating unit 270 respectively.Specifically, after being provided text message shown in Figure 2, text resolution portion 220 at first is the phoneme of the such mora of " こ ", " ん ", " To ", " Chi ", " わ " with " こん To Chi わ " such content of text information decomposition.So-called mora is meant expression pronunciation unit, and is basic by 1 consonant and 1 syllable that vowel constitutes.

Text resolution portion (acquiring unit) 220 is after being the phoneme of mora with the content of text information decomposition in such a manner, generation is used to specify the phoneme information (phoneme appointed information) of each phoneme of these synthetic speechs, offers phoneme data obtaining section 230 then successively.Next, text resolution portion 220 obtains tonequality data number (for example K3), pitch information (for example C (trembling)) respectively from text message, then obtained tonequality data number is offered tonequality changing unit 250, and obtained pitch information is offered voice signal generating unit 270.

The phoneme information that phoneme data obtaining section (first extracting unit) 230 will provide from text resolution portion 220 is retrieved phoneme database 240 as key word, obtains the phoneme corresponding phoneme data represented with phoneme information thus.Fig. 3 is the figure that example illustrates the login content of phoneme database 240.As shown in Figure 3, in phoneme database (first storage unit) 240, except login have the expression male sex declaimer mora each phoneme (" あ ", " い " ... " ん " etc.) a series of phoneme data 1～m outside, also login has the quantity (hereinafter referred to as login phoneme data quantity) of these a series of phoneme datas etc.

Fig. 4 is the structural drawing that example illustrates the phoneme data of a certain phoneme of expression (for example " こ " etc.), and Fig. 5 is the figure that is used for illustrating each frame information that is included in phoneme data.In addition, speech waveform vw when the A of Fig. 5 shows above-mentioned male sex declaimer and reads aloud a certain phoneme (for example " こ " etc.) and the relation between each frame FR, the D of the B of Fig. 5, the C of Fig. 5 and Fig. 5 shows the Resonance Peak Analysis result for the 1st frame FR1, the 2nd frame FR2, n frame FRn respectively.

As shown in Figure 4, phoneme data is made of the 1st frame information～n frame information.Each frame information has that each frame Fr (with reference to Fig. 5) to correspondence carries out Resonance Peak Analysis and the 1st resonance peak information～k resonance peak information of obtaining, and the voice that are used to represent each frame FR be voiced sound (voiced sound) or the voiced/unvoiced differentiation mark of voiceless sound (voiceless sound) (for example, " 1 "=voiced sound, " 0 "=voiceless sound).

The 1st frame information～k frame information that constitutes each frame information constitutes (with reference to the D of B～Fig. 5 of Fig. 5) by expression formant frequency F corresponding resonance peak, paired and resonance peak amplitude A.For example, the 1st resonance peak information～k resonance peak information that constitutes the 1st frame information is respectively by (F11, A11), (F12, A12) ... (F1k, A1k) these paired formant frequencies and resonance peak amplitude constitute (with reference to the B of Fig. 5) ... the 1st resonance peak information～k resonance peak information that constitutes the n frame information is respectively by (Fn1, An1), (Fn2, An2) ... (Fnk, Ank) these paired formant frequencies and resonance peak amplitude constitute (with reference to the D of Fig. 5).

Phoneme data obtaining section 230 offers tonequality changing unit 250 with these phoneme datas after obtaining each corresponding phoneme data according to each phoneme information that provides from text resolution portion 220 (each phoneme information of expression " こ ", " ん ", " To ", " Chi ", " わ " etc.).

The tonequality of the phoneme that each phoneme data that 250 changes of tonequality changing unit are obtained by phoneme data obtaining section 230 is represented.Specifically, the tonequality data number that tonequality changing unit (second extracting unit) 250 at first provides text resolution portion 220 is retrieved tonequality database (second storage unit) 260 as search key, obtains corresponding tonequality parameter.Then, tonequality changing unit 250 is carried out the change of the tonequality of above-mentioned each phoneme according to obtained tonequality parameter.

Fig. 6 is the figure that example illustrates the login content of tonequality database 260.

As shown in Figure 6, in tonequality database (second storage unit) 260, as the necessary information of the tonequality that is used to change above-mentioned each phoneme, store the multiple tonequality parameter 1～L of the processing content of expression phoneme data, and the login quantity information of representing the quantity of this tonequality parameter.

Fig. 7 is the figure of a structure example of expression tonequality parameter.

As shown in Figure 7, tonequality parameter (phoneme data machining information) has the sex change mark that is used for determining the tonequality data number of this parameter and the sex whether expression changes synthetic speech, the 1st～the k resonance peak modification information of representing the changed content of the 1st～the k resonance peak.Wherein, be under the situation of " 1 " for example with above-mentioned sex change flag settings, be used to change the processing (handling) of the sex of synthetic speech hereinafter referred to as the sex change by tonequality changing unit 250, and be under the situation of " 0 " with above-mentioned sex change flag settings, do not carry out above-mentioned sex change and handle (back detailed description).In addition, in the present embodiment, because a kind of phoneme data that supposition male sex declaimer sends, so be under the situation of " 1 " with this sex change flag settings, the feature of synthetic speech changes to femaleness from masculinity.On the other hand, be set under the situation of " 0 " at sex change mark, the feature of synthetic speech also keeps the male sex's feature and does not change.

On the other hand, each resonance peak modification information has the formant frequency modification information of the changed content that is used to select the basic waveform of each resonance peak basic waveform (sine wave etc.) described later to select information, represents each formant frequency and the resonance peak amplitude modification information of representing the changed content of each resonance peak amplitude.

In each formant frequency modification information and each resonance peak amplitude modification information, include the information of the converted quantity of represent formant frequency, hunting speed, oscillation amplitude respectively and represent converted quantity, the hunting speed of resonance peak amplitude, the information of oscillation amplitude.Converted quantity, hunting speed, oscillation amplitude about formant frequency and resonance peak amplitude will describe in detail in the back.

Fig. 8 is that the tonequality that expression is carried out by tonequality changing unit 250 changes the process flow diagram of handling.

Tonequality changing unit (generation unit) 250 as search key, is retrieved this tonequality data number after text resolution portion 220 receives the tonequality data number to tonequality database 260, obtain corresponding tonequality parameter (step S1).Then, tonequality changing unit 250 judges whether to change the sex (promptly whether should carry out the sex change handles) (step S2) of synthetic speech with reference to the sex change mark that is included in the obtained tonequality parameter.Be set to " 0 " at for example sex change mark, thereby tonequality changing unit 250 is judged in the time of should not carrying out the sex change, skips steps S3, enter step S4, and be set to " 1 " at for example sex change mark, tonequality changing unit 250 is judged in the time of should carrying out the sex change, enters step S3, carries out the sex change and handles.

Fig. 9 is that example illustrates the figure that the mapping function mf of usefulness is handled in the sex change that is stored in the storage unit (diagram omit), and Figure 10 and Figure 11 are the figure of the analysis result of expression masculinity and femininity when reading aloud same phoneme (for example " あ " etc.) respectively.In addition, the transverse axis of mapping function mf shown in Figure 9 is represented incoming frequency (being transfused to the formant frequency of tonequality changing unit 250), the longitudinal axis is represented output frequency (from the frequency formant frequency after changing of tonequality changing unit 250 output), and fmax represents the maximal value of the formant frequency that can import.In addition, Figure 10 and analysis diagram g1 shown in Figure 11, the transverse axis of g2 are represented frequency, and the longitudinal axis is represented amplitude.

By Figure 10 relatively and analysis diagram g1, g2 shown in Figure 11 as can be known, the 1st formant frequency fm1 of male sex's phoneme～the 4th formant frequency fm4 is lower than the 1st formant frequency ff1～the 4th formant frequency ff4 of women phoneme.Therefore, in the present embodiment, as shown in Figure 9, utilize the mapping function mf (with reference to the solid line part) that is positioned at straight line n1 (incoming frequency=output frequency is with reference to dotted portion) upside, the phoneme that will have masculinity changes to the phoneme with femaleness.

Specifically, tonequality changing unit 250 is utilized mapping function mf shown in Figure 9, with each formant frequency of the phoneme data imported to the high direction transformation of frequency.Thus, each formant frequency of male sex's phoneme of being imported is changed to the formant frequency with femaleness.In addition, opposite with above-mentioned situation under the situation of the formant frequency of importing women's phoneme, can utilize the mapping function mf ' (with reference to the part shown in Fig. 9 dot-and-dash line) that is arranged in straight line n1 downside.

Tonequality changing unit 250 enters after the step S4 having carried out above-mentioned sex change processing, according to the converted quantity of each represented formant frequency of each resonance peak modification information, each formant frequency is carried out conversion.And then tonequality changing unit 250 makes each the formant frequency vibration after the conversion, carries out hunting of frequency and handles (step S5).

Figure 12 is that example illustrates the figure that is stored in the vibration table TA that uses in the storage unit (diagram omit), in hunting of frequency is handled, and Figure 13 is that example illustrates the shake number read and the figure of the relation between the time from this vibration table TA.In the present embodiment, for convenience of description, suppose the situation of using same vibration table TA to make above-mentioned each formant frequency vibration, but also can be for different vibration tables such as each formant frequency use shake numbers.

Vibration table TA is the table according to time sequencing login shake number.Tonequality changing unit 250 is according to the hunting speed of the represented formant frequency of each resonance peak modification information, be controlled at the reading speed (perhaps skipping the quantity of (promptly not reading) shake number) of the shake number of logining among the vibration table TA, on the other hand, the execution hunting of frequency is handled, and each shake number that is about to be read multiply by the oscillation amplitude of the represented formant frequency of each resonance peak modification information.Thus, can obtain to make the waveform of formant frequency fm shown in Figure 14 with hunting speed sp, oscillation amplitude lv vibration.In the present embodiment, for the operand of the oscillation amplitude that reduces formant frequency, illustration utilize the mode of above-mentioned vibration table TA, but also can not utilize vibration table TA, but utilize the function of regulation to obtain the oscillation amplitude of formant frequency.

Tonequality changing unit 250 enters step S6 after having carried out the hunting of frequency processing, according to the converted quantity of each represented resonance peak amplitude of each resonance peak modification information, each resonance peak amplitude is carried out conversion.And then tonequality changing unit 250 makes each the resonance peak amplitude oscillation after the conversion, carries out amplitude oscillation and handles (step S7), end process then.In addition, for the vibration table that in amplitude oscillation is handled, uses and utilize this vibration to show to make action under each resonance peak amplitude oscillation situation since can with roughly explanation similarly of the situation that makes above-mentioned each formant frequency vibration, omit its explanation here.In addition, for the vibration of resonance peak amplitude, can use with the same vibration of the vibration of formant frequency and show to make its vibration, but also can use the vibration different to show to make its vibration with the vibration of formant frequency.

Tonequality changing unit (generation unit) 250 is changing (promptly phoneme data being processed) afterwards according to obtained tonequality parameter (phoneme data machining information) to the tonequality of each phoneme, selects information, each formant frequency and each resonance peak amplitude to offer voice signal generating unit 270 basic waveform of each resonance peak.

Voice signal generating unit 270 obtains this basic waveform and selects the represented Wave data of information after the basic waveform selection information that provides from tonequality changing unit 250 is provided from waveform database 280.The represented basic waveform of this basic waveform selection information can be different for each resonance peak, the basic waveform that for example can make the low resonance peak of frequency is for sinusoidal wave, and the basic waveform that makes the high resonance peak of the frequency of performance individual character is for the waveform (for example square wave or sawtooth wave etc.) beyond sinusoidal wave etc.Certainly, also can not utilize multiple basic waveform, but only utilize single basic waveform (for example sinusoidal wave).

Voice signal generating unit (generation unit) 270 is utilized selected each Wave data, each formant frequency, each resonance peak amplitude after having selected each Wave data in such a manner, generate the resonance peak waveform of each resonance peak.Then, 270 pairs of each resonance peak waveforms of voice signal generating unit (generation unit) carry out additive operation, generate synthetic speech signal.Then, 270 pairs of synthetic speech signals that generated of voice signal generating unit are given the processing (giving processing hereinafter referred to as pitch) of pitch, and this pitch is the represented pitch of pitch information (pitch appointed information) that provides from text resolution portion 220.

Figure 15 is used to illustrate that pitch gives the figure of processing.In Figure 15, in order to understand easily explanation, example shows the situation that the synthetic speech signal of offset of sinusoidal ripple is given pitch.

Voice signal generating unit 270 calculates the cycle of temporal envelope line tp shown in Figure 15 according to the pitch information that provides from text resolution portion 220.Wherein, the pitch of synthetic speech depends on the cycle of temporal envelope line tp, and the cycle of temporal envelope line tp is long more, and pitch is low more, and the cycle of temporal envelope line is short more, and pitch is high more.Voice signal generating unit 270 is after the cycle of obtaining temporal envelope line tp in such a manner, with the cycle of the temporal envelope line tp that tried to achieve, repeatedly temporal envelope line tp and synthetic speech signal are carried out multiplying, obtained being endowed the synthetic speech signal of regulation pitch thus.

Figure 16 is that example illustrates and carried out that the tonequality change is handled and pitch is given the figure of the resonance peak waveform of handling specific resonance peak afterwards.As shown in figure 16, the processing relevant with tonequality change (for example the oscillation treatment of formant frequency and resonance peak amplitude etc.) can be controlled with frame period (frame unit).Voice signal generating unit (generation unit) 270 is exported as synthetic speech it after obtaining the above-mentioned synthetic speech signal that has been endowed the regulation pitch to the outside.Thus, the user can confirm to be transfused to the content of the text (" こん To Chi わ " etc.) of speech synthetic device 100 by the synthetic speech of desirable tonequality.

As mentioned above, speech synthetic device according to present embodiment, because can carry out in the tonequality changing unit with the resonance peak is the various tonequality change processing of unit, even so the phoneme data of being stored has only a kind (phoneme data that promptly has only specific declaimer), also can carry out the phonetic synthesis of various tonequality.

B. other

In above-mentioned present embodiment, example shows the situation (with reference to Fig. 2) that comprises pitch information in the text message that is transfused to speech synthetic device 100, but also can not comprise pitch information in text information.Suppose this situation, in phoneme database 240, login alternative pitch information (part in reference to the bracket of Fig. 3) when in advance, and when in text message, not comprising pitch information, can utilize the represented pitch of this alternative pitch information (for example C (trembling) etc.) to be used as the pitch of synthetic speech.In addition, except substituting pitch information, can also in phoneme database 240, login the quantity (the resonance peak quantity information is with reference to part in the bracket of Fig. 3) of the resonance peak information of each frame shown in Figure 4 in advance.

In addition, in order to carry out the program in the storeies such as being stored in ROM by CPU (or DSP), with the various functions of the speech synthetic device 100 of realizing above explanation, said procedure can be recorded on the recording mediums such as CD-ROM and issue, perhaps can issue via communication networks such as internets.

In the above description, voice change processing is based on the tonequality data number that obtains and carries out from text message, but also can from the text message of being imported, Automatic Extraction go out key word, utilize the key word that is extracted then, come with reference to the database that sets in advance key word in electronic equipment, that have each tonequality, come to judge automatically the tonequality that is suitable for text information thus.

Claims

1. speech synthetic device is characterized in that having:

Acquiring unit is from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech;

First storage unit is used for a plurality of phoneme datas of each phoneme of storage representation;

Second storage unit is used to store multiple phoneme data machining information, and these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas;

First extracting unit is used for extracting the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from above-mentioned first storage unit;

Second extracting unit is used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from above-mentioned second storage unit; And

Generation unit is used for the phoneme data machining information that extracts according to above-mentioned, and the above-mentioned phoneme data that extracts is processed, and generates above-mentioned synthetic speech.

2. speech synthetic device according to claim 1 is characterized in that,

Above-mentioned each phoneme data comprises the resonance peak information of the resonance peak of representing phoneme,

Above-mentioned phoneme data machining information comprises the resonance peak modification information of the changed content of representing above-mentioned resonance peak,

Above-mentioned generation unit changes above-mentioned resonance peak information according to above-mentioned resonance peak modification information, and each signal waveform that basis each resonance peak information is after changing generated is carried out additive operation then, generates above-mentioned synthetic speech thus.

3. speech synthetic device according to claim 2 is characterized in that,

Above-mentioned resonance peak information is made of paired formant frequency and resonance peak amplitude,

In above-mentioned resonance peak modification information, include the formant frequency modification information of the changed content that is used to represent above-mentioned formant frequency and be used to represent the resonance peak amplitude modification information of the changed content of above-mentioned resonance peak amplitude,

Above-mentioned generation unit is according to above-mentioned formant frequency modification information and above-mentioned resonance peak amplitude modification information, each formant frequency and each resonance peak amplitude to the represented phoneme of above-mentioned phoneme data changes respectively, obtains above-mentioned each resonance peak information after changing thus.

4. according to claim 2 or 3 described speech synthetic devices, it is characterized in that,

Above-mentioned acquiring unit except obtaining above-mentioned phoneme appointed information and above-mentioned tonequality appointed information, also obtains the pitch appointed information of the pitch that is used to specify above-mentioned synthetic speech from above-mentioned text message,

Above-mentioned generation unit gives above-mentioned pitch appointed information represented pitch to the composite signal waveform, obtain above-mentioned synthetic speech thus, described composite signal waveform is each signal waveform that generates according to above-mentioned each resonance peak information after changing to be carried out additive operation obtain.

5. speech synthetic device according to claim 1 is characterized in that, above-mentioned text message comprises above-mentioned tonequality appointed information, and above-mentioned acquiring unit obtains above-mentioned tonequality appointed information from above-mentioned text message.

6. speech synthetic device according to claim 1 is characterized in that above-mentioned acquiring unit extracts key word from above-mentioned text message, then according to the key word that is extracted, judges the tonequality that is suitable for above-mentioned text message.

7. phoneme synthesizing method is characterized in that having following steps:

Obtaining step is from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech;

First extraction step extracts the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from first storage unit, and this first storage unit is used for a plurality of phoneme datas of each phoneme of storage representation;

Second extraction step, be used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from second storage unit, this second storage unit is used to store multiple phoneme data machining information, these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; And

Generate step, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.