EP1093111B1 - Amplitude control for speech synthesis - Google Patents

Amplitude control for speech synthesis Download PDF

Info

Publication number
EP1093111B1
EP1093111B1 EP00121304A EP00121304A EP1093111B1 EP 1093111 B1 EP1093111 B1 EP 1093111B1 EP 00121304 A EP00121304 A EP 00121304A EP 00121304 A EP00121304 A EP 00121304A EP 1093111 B1 EP1093111 B1 EP 1093111B1
Authority
EP
European Patent Office
Prior art keywords
frame
speech
signal
power
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP00121304A
Other languages
German (de)
French (fr)
Other versions
EP1093111A3 (en
EP1093111A2 (en
Inventor
Katsumi Pioneer Corporation Amano
Shisei Pioneer Corporation Cho
Soichi Pioneer Corporation Toyama
Hiroyuki Pioneer Corporation Ishihara
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pioneer Corp
Original Assignee
Pioneer Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pioneer Corp filed Critical Pioneer Corp
Publication of EP1093111A2 publication Critical patent/EP1093111A2/en
Publication of EP1093111A3 publication Critical patent/EP1093111A3/en
Application granted granted Critical
Publication of EP1093111B1 publication Critical patent/EP1093111B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules

Definitions

  • the present invention relates to a speech synthesis method for artificially generating speech waveform signals.
  • Speech waveforms of natural speech can be expressed by connecting basic units which are made by continuously connecting phonemes, one vowel (V) and one consonant (C) in a form such as "CV”, "CVC” or "VCV".
  • a conversation can be created by means of synthetic speech by processing and registering such phonemes as data (phoneme data) in advance, reading out phoneme data corresponding to a conversation from the registered phoneme data in sequence, and generating sounds corresponding to respective read-out phoneme data.
  • a given document is read by a person, and his/her speech is recorded. Then, speech signals reproduced from the recorded speech are divided into the above-mentioned phonemes. Various data indicative of these phonemes are registered as phoneme data. Then, in order to synthesize the speech, respective speech data is connected and supplied as a serial speech.
  • EP-A-O 427 485 a method and an apparatus for synthesizing speech in which the amplitudes of the speech waveform signal are adjusted by normalizing the phonemes (i.e. VCV segments) so that powers at both ends of each phoneme coincide with the average power of each vowel to smoothly connect phonemes together.
  • VCV segments normalizing the phonemes
  • the method for synthesizing speech comprises the steps of: generating an excitation signal corresponding to an input text signal, filtering the excitation signal with a linear predictive coefficient calculated from respective phonemes in a phoneme series to generate a speech waveform signal, adjusting the amplitude of the speech waveform signal to a level based on a speech envelope signal to generate an amplitude adjustment waveform signal, and generating an acoustic output corresponding to the amplitude adjustment waveform signal.
  • the speech envelope signal is obtained by the steps of: dividing each of the phonemes in a phoneme series into a plurality of frames having a predetermined time length; summing squares of speech samples in the frame as a frame power value, for each frame of the plurality of frames; obtaining a standardized frame power as a special function of the frame power value and head and tail frame power values in the frame, for each frame of the plurality of frames; summing squares of sample values in a frame of the excitation signal to obtain a frame power correction value; providing power frequency characteristics based on the linear predictive coefficient, for each frame of said plurality of frames; calculating an average value of power values sampled from the power frequency characteristics at a predetermined frequency interval as a mean frame power value, for each frame of said plurality of frames; and providing the speech envelope signal as a special function of the standardized frame power value, the frame power correction value the mean frame power value.
  • the levels of the head and tail portions of respective phonemes are always maintained at predetermined levels without substantially deforming the synthesized speech waveform. Therefore, phonemes are connected together smoothly so that natural sounding synthesized speeches can be generated.
  • Fig. 1 is a block diagram showing a text speech synthesis device for reading a given document (text) by synthesizing the speech by means of a method according to the present invention.
  • a text analyzing circuit 21 generates intermediate language character string information including information such as accents and phrases peculiar to respective languages in a character string based on inputted text signals.
  • the text analyzing circuit 21 then supplies intermediate language character string signals CL corresponding to the above information to a speech synthesis control circuit 22.
  • a phoneme data memory 20 a RAM (Random Access Memory) 27, and a ROM (Read Only Memory) 28 are connected to the speech synthesis control circuit 22.
  • the phoneme data memory 20 stores phoneme data corresponding to various phonemes which have been sampled from actual human voice, and speech synthesizing parameters (standardized frame power values and mean frame power values) used for the speech synthesis.
  • a sound source module 23 is provided with a pulse generator 231 for generating impulse signals having a frequency corresponding to a pitch frequency designating signal K supplied from the speech synthesis control circuit 22, and a noise generator 232 for generating noise signals carrying an unvoiced sound.
  • the sound source module 23 alternatively selects the impulse signal and the noise signal in response to a sound source selection signal S V supplied from the speech synthesis control circuit 22.
  • the sound source module 23 then supplies the selected signal as a frequency signal Q to a vocal tract filter 24.
  • the vocal tract filter 24 may include a FIR (Finite Impulse Response) digital filter, for example.
  • the vocal tract filter 24 filters a frequency signal Q supplied from the sound source module 23 with a filtering coefficient corresponding to a linear predictive code signal LP supplied from the speech synthesis control circuit 22, thereby generating a speech waveform signal V F .
  • FIR Finite Impulse Response
  • An amplitude adjustment circuit 25 generates an amplitude adjustment waveform signal V AUD by adjusting the amplitude of a speech waveform signal V F , to a level based on a speech envelope signal V m supplied from the speech synthesis control circuit 22.
  • the amplitude adjustment circuit 25 then supplies the amplitude adjustment waveform signal V AUD to a speaker 26.
  • the speaker 26 generates an acoustic output corresponding to the amplitude adjustment waveform signal V AUD . That is, the speaker 26 generates the reading speeches based on the input text signals as explained hereinafter.
  • a method will be described hereinafter for generating the above-mentioned phoneme data and speech synthesis parameters stored in the phoneme data memory 20.
  • Fig. 2 is a block diagram showing an apparatus for generating speech synthesis parameters.
  • a speech recorder 32 records a human speech received by a microphone 31.
  • the speech recorder 32 supplies speech signals reproduced from the recorded speech to a phoneme data generating device 30.
  • the phoneme data generating device 30 sequentially samples a speech signal supplied from the speech recorder 32 to generate a speech sample.
  • the phoneme data generating device 30 then stores the signals in a predetermined domain in a memory 33.
  • the phoneme data generating device 30 then executes steps for generating phonemes, as shown in Fig. 3.
  • the phoneme data generating device 30 reads out speech samples stored in the memory 33 in sequence.
  • the phoneme data generating device 30 then divides the series of speech samples into phonemes such as "VCV" (step S1).
  • a Japanese spoken phrase “mokutekichi ni” is segmented to mo/oku/ute/eki/iti/ini/i.
  • the Japanese spoken phrase “moyosimono” is segmented to mo/oyo/osi/imo/ono/ono/o.
  • the Japanese spoken phrase “moyorino” is segmented to mo/oyo/ori/ino/o.
  • the Japanese spoken phrase “mokuhyono” is segmented to mo/oku/uhyo/ono/o.
  • the phoneme data generating device 30 divides each segmented phoneme into frames of a predetermined length, for example, 10ms (step S2). Control information including a name of the phoneme to which each frame belongs, a frame length of the phoneme, and the frame number is added to each divided frame. The above frame is then stored in a given domain of the memory 33 (step S3). Then, the phoneme data generating device 30 analyzes a linear predictive coding LPC on every frame with respect to the waveform of each phoneme to generate a linear predictive coding coefficient (hereinafter called "LPC coefficient") of 15 orders. The resultant coefficient is stored in a memory domain 1 of the memory 33 as shown in Fig. 4 (step S4).
  • LPC coefficient linear predictive coding coefficient
  • the resultant LPC coefficient in step S4 is a so-called speech spectral envelope parameter corresponding to a filter coefficient of the vocal tract filter 24.
  • the phoneme data generating device 30 reads out the LPC coefficient in the memory domain 1 of the memory 33, and supplies the LPC coefficient as the phoneme data (step S5). This phoneme data is stored in the phoneme data memory 20.
  • the phoneme data generating device 30 calculates speech synthesis parameters as shown in Fig. 5 on respective phonemes stored in the memory 33.
  • the phoneme data generating device 30 calculates the sum of all squares of speech sample values in each frame in one phoneme that is subject to processing (hereinafter called "subject phoneme") in order to generate a speech power of the frame. Then, as shown in Fig. 4, the speech power is stored in a memory domain 2 of the memory 33 as a frame power PC (step S12).
  • the phoneme data generating device 30 reads out the frame power PC in the frame n from the memory domain 2 of the memory 33 shown in Fig. 4 (step S15).
  • the phoneme data generating device 30 reads out the frame powers corresponding to the head and tail frames of the subject phoneme as the head and tail frame powers P a and P b , respectively, among the frame powers P c in the memory domain 2 (step S16).
  • the phoneme data generating device 30 generates a standardized frame power P n in the frame n indicated by a built-in register n, by executing the following calculation (1) using the head and tail frame powers Pa, Pb, the frame power Pc obtained in step S15 and the relative position r.
  • P n P c / [(1 - r) P a + r P b ]
  • the phoneme data generating device 30 stores the standardized frame power Pn in a memory domain 3 of the memory 33 (step S17).
  • the phoneme data generating device 30 generates the frame power value in the frame n when the frame power P c in the tail frame of this subject phoneme is set to "1".
  • the phoneme data generating device 30 reads out the LPC coefficient corresponding to the frame n indicated by the built-in register n from the memory domain 1 of the memory 33 shown in Fig. 4. The phoneme data generating device 30 then generates power frequency characteristics in the frame n based on the LPC coefficient (step S18). Thereafter, the phoneme data generating device 30 samples a power value from the power frequency characteristics every predetermined frequency interval, and then stores the average value of these power values as a mean frame power G f in a memory domain 4 of the memory 33 shown in Fig. 4 (step S19).
  • the phoneme data generating device 30 adds "1" to the frame number n stored in the built-in register n to generate a new frame number n, the new frame number n replacing the previous frame number n, and stores the new frame number n in the built-in register n by substitution (step S20). Subsequently, the phoneme data generating device 30 determines whether the frame number stored in the built-in register n equals (N-1) (step S21).
  • step S21 if the frame number stored in the built-in register n does not equal (N-1), the phoneme data generating device 30 returns to the step S14, and repeats the above-mentioned operation.
  • Such an operation stores the standardized frame power P n and the mean frame power G f corresponding to each of the head frame to (N-1) th frames of a subject phoneme in the memory domains 3 and 4, as shown in Fig. 4.
  • the phoneme data generating device 30 respectively reads out the standardized frame power P n and the mean frame power G f stored in the memory domains 3 and 4 of the memory 33 shown in Fig. 4, and outputs the standardized frame power P n and the mean frame power G f (step S23).
  • the standardized frame power P n and the mean frame power G f are stored in the phoneme data memory 20 as speech synthesis parameters.
  • the respective phoneme data obtained by the procedure shown in Fig. 3 is associated with the standardized frame power P n and the mean frame power G f obtained by the procedure shown in Fig. 5 to store the resultant data in the phoneme data memory 20.
  • the speech synthesis control circuit 22 shown in Fig. 1 receives the phoneme data and speech synthesis parameters corresponding to the intermediate language characters string signals CL from the text analyzing circuit 21, by using software stored in the ROM 28. The speech synthesis control circuit 22 then controls speech synthesis as explained hereinafter.
  • the speech synthesis control circuit 22 divides segments of the intermediate language characters string signals CL into phonemes consisting of "VCV", and then receives the phoneme data corresponding to respective phonemes from the phoneme data memory 20 sequentially. The speech synthesis control circuit 22 then supplies a pitch frequency designation signal K for designating the pitch frequency to the sound source module 23. Then, the speech synthesis control circuit 22 synthesizes the speech on respective phoneme data in order of the reading from the phoneme data memory 20.
  • Fig. 6 shows a speech synthesizing control procedure.
  • the speech synthesis control circuit 22 selects the data for one phoneme subject to be processed (hereinafter called "subject phoneme data) in the received order as mentioned above.
  • the speech synthesis control circuit 22 then stores "0" indicative of the head frame number in the phoneme data in the built-in register n (not shown) (step S101).
  • the speech synthesis control circuit 22 supplies a sound source selection signal S V to the sound source module 23 (step S102).
  • the sound source selection signal S V indicates whether the phoneme corresponding to the above-mentioned subject phoneme data is a voiced sound or an unvoiced sound.
  • the sound module 23 generates as a frequency signal Q one of a noise signal and an impulse signal having a frequency designated by the pitch frequency designation signal K.
  • the speech synthesis control circuit 22 samples the frequency signal Q supplied from the sound source module 23 for every predetermined interval. The control circuit 22 then calculates the sum of squares of respective sample values in a frame to generate a frame power correction value G s . Then, the speech synthesis control circuit 22 stores the frame power correction value G s in a built-in register G (not shown) (step S103). Then, the speech synthesis control circuit 22 supplies the LPC coefficient to the vocal tract filter 24 as the linear predictive coding signal LP (step S104). It is noted that the LPC coefficient corresponds to the frame n indicated by the built-in register n in the subject phoneme data.
  • the speech synthesis control circuit 22 reads out the standardized frame power P n and the mean frame power G f corresponding to the frame n indicated by the above-mentioned built-in register n in the subject phoneme data from the phoneme data memory 20 (step S105). Thereafter, the speech synthesis control circuit 22 calculates a speech envelope signal V m , by the following computation with the standardized frame power P n , the mean frame power G f , and the frame power correction value Gs stored in the built-in register G. The speech synthesis control circuit 22 then supplies the speech envelope signal V m to an amplitude adjustment circuit 25 (step S106).
  • V m P n / ( G s G f )
  • the amplitude adjustment circuit 25 adjusts the amplitude of the speech waveform signal V f supplied from the vocal tract filter 24 to a level corresponding to the above-mentioned speech envelope signal V m . Since the connecting portions of respective phonemes are always maintained at a predetermined level through this amplitude adjustment, the connection of phonemes becomes smooth and hence, natural sounding synthesized speech is produced.
  • the speech synthesis control circuit 22 determines whether the frame number n stored in the built-in register n is smaller than the total number of frames in the subject phoneme data N by 1, that is, whether the frame number n equals (N - 1) (step S107). In the step S107, if it is determined that n does not equal (N-1), the speech synthesis control circuit 22 adds "1" to the frame number stored in the built-in register n, and stores this value as a new frame number in the built-in register n by substitution (step S108). After the step S108, the speech synthesis control circuit 22 returns to the step S103, and then repeats the above-mentioned operation.
  • step S107 if it is determined that the frame number n stored in the built-in register n does not equal (N-1), the speech synthesis control circuit 22 returns to the step S101, and repeats the phonemic synthesis process to next phoneme data in the same manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Control Of Amplification And Gain Control (AREA)

Description

1. FIELD OF THE INVENTION
The present invention relates to a speech synthesis method for artificially generating speech waveform signals.
2. BACKGROUND OF THE RELATED ART
Speech waveforms of natural speech can be expressed by connecting basic units which are made by continuously connecting phonemes, one vowel (V) and one consonant (C) in a form such as "CV", "CVC" or "VCV".
Accordingly, a conversation can be created by means of synthetic speech by processing and registering such phonemes as data (phoneme data) in advance, reading out phoneme data corresponding to a conversation from the registered phoneme data in sequence, and generating sounds corresponding to respective read-out phoneme data.
To create a database based on the above-mentioned phoneme data, firstly, a given document is read by a person, and his/her speech is recorded. Then, speech signals reproduced from the recorded speech are divided into the above-mentioned phonemes. Various data indicative of these phonemes are registered as phoneme data. Then, in order to synthesize the speech, respective speech data is connected and supplied as a serial speech.
However, respective connected phonemes are segmented from the separately recorded speeches. Hence, irregularities exist in the vocal power with which the phonemes are uttered. Therefore, a problem arises that synthesized speech is unnatural when the uttered phonemes are merely connected together.
To solve this problem, there are known from EP-A-O 427 485 a method and an apparatus for synthesizing speech in which the amplitudes of the speech waveform signal are adjusted by normalizing the phonemes (i.e. VCV segments) so that powers at both ends of each phoneme coincide with the average power of each vowel to smoothly connect phonemes together.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide an improved method for synthesizing speech in order to generate natural sounding synthetic speech.
This object is achieved by a method for synthesizing speech comprising the steps defined in claim 1. Especially, the method for synthesizing speech comprises the steps of: generating an excitation signal corresponding to an input text signal, filtering the excitation signal with a linear predictive coefficient calculated from respective phonemes in a phoneme series to generate a speech waveform signal, adjusting the amplitude of the speech waveform signal to a level based on a speech envelope signal to generate an amplitude adjustment waveform signal, and generating an acoustic output corresponding to the amplitude adjustment waveform signal. The speech envelope signal is obtained by the steps of: dividing each of the phonemes in a phoneme series into a plurality of frames having a predetermined time length; summing squares of speech samples in the frame as a frame power value, for each frame of the plurality of frames; obtaining a standardized frame power as a special function of the frame power value and head and tail frame power values in the frame, for each frame of the plurality of frames; summing squares of sample values in a frame of the excitation signal to obtain a frame power correction value; providing power frequency characteristics based on the linear predictive coefficient, for each frame of said plurality of frames; calculating an average value of power values sampled from the power frequency characteristics at a predetermined frequency interval as a mean frame power value, for each frame of said plurality of frames; and providing the speech envelope signal as a special function of the standardized frame power value, the frame power correction value the mean frame power value.
As described above, the levels of the head and tail portions of respective phonemes are always maintained at predetermined levels without substantially deforming the synthesized speech waveform. Therefore, phonemes are connected together smoothly so that natural sounding synthesized speeches can be generated.
BRIEF DESCRIPTION OF THE DRAWINGS
The aforementioned aspects and other features of the invention are explained in the following description, taken in connection with the accompanying drawing figures wherein:
  • Fig. 1 is a block diagram showing a speech synthesis apparatus according to the present invention,
  • Fig. 2 is a block diagram showing an apparatus for generating phoneme data and speech synthesis parameters,
  • Fig. 3 is a flow chart showing steps for generating phoneme data,
  • Fig. 4 is a view showing a memory map in a memory 33,
  • Fig. 5 is a flow chart showing steps for calculating speech synthesis parameters, and
  • Fig. 6 is a view showing a speech synthesis control routine based on a speech synthesis method of the present invention.
  • DETAILED DESCRIPTION OF THE INVENTION
    Fig. 1 is a block diagram showing a text speech synthesis device for reading a given document (text) by synthesizing the speech by means of a method according to the present invention.
    In Fig. 1, a text analyzing circuit 21 generates intermediate language character string information including information such as accents and phrases peculiar to respective languages in a character string based on inputted text signals. The text analyzing circuit 21 then supplies intermediate language character string signals CL corresponding to the above information to a speech synthesis control circuit 22.
    A phoneme data memory 20, a RAM (Random Access Memory) 27, and a ROM (Read Only Memory) 28 are connected to the speech synthesis control circuit 22.
    The phoneme data memory 20 stores phoneme data corresponding to various phonemes which have been sampled from actual human voice, and speech synthesizing parameters (standardized frame power values and mean frame power values) used for the speech synthesis.
    A sound source module 23 is provided with a pulse generator 231 for generating impulse signals having a frequency corresponding to a pitch frequency designating signal K supplied from the speech synthesis control circuit 22, and a noise generator 232 for generating noise signals carrying an unvoiced sound. The sound source module 23 alternatively selects the impulse signal and the noise signal in response to a sound source selection signal SV supplied from the speech synthesis control circuit 22. The sound source module 23 then supplies the selected signal as a frequency signal Q to a vocal tract filter 24.
    The vocal tract filter 24 may include a FIR (Finite Impulse Response) digital filter, for example. The vocal tract filter 24 filters a frequency signal Q supplied from the sound source module 23 with a filtering coefficient corresponding to a linear predictive code signal LP supplied from the speech synthesis control circuit 22, thereby generating a speech waveform signal VF.
    An amplitude adjustment circuit 25 generates an amplitude adjustment waveform signal VAUD by adjusting the amplitude of a speech waveform signal VF, to a level based on a speech envelope signal Vm supplied from the speech synthesis control circuit 22. The amplitude adjustment circuit 25 then supplies the amplitude adjustment waveform signal VAUD to a speaker 26. The speaker 26 generates an acoustic output corresponding to the amplitude adjustment waveform signal VAUD. That is, the speaker 26 generates the reading speeches based on the input text signals as explained hereinafter.
    A method will be described hereinafter for generating the above-mentioned phoneme data and speech synthesis parameters stored in the phoneme data memory 20.
    Fig. 2 is a block diagram showing an apparatus for generating speech synthesis parameters.
    In Fig. 2, a speech recorder 32 records a human speech received by a microphone 31. The speech recorder 32 supplies speech signals reproduced from the recorded speech to a phoneme data generating device 30.
    The phoneme data generating device 30 sequentially samples a speech signal supplied from the speech recorder 32 to generate a speech sample. The phoneme data generating device 30 then stores the signals in a predetermined domain in a memory 33. The phoneme data generating device 30 then executes steps for generating phonemes, as shown in Fig. 3.
    In Fig. 3, the phoneme data generating device 30 reads out speech samples stored in the memory 33 in sequence. The phoneme data generating device 30 then divides the series of speech samples into phonemes such as "VCV" (step S1).
    For example, a Japanese spoken phrase "mokutekichi ni" is segmented to mo/oku/ute/eki/iti/ini/i. The Japanese spoken phrase "moyosimono" is segmented to mo/oyo/osi/imo/ono/ono/o. The Japanese spoken phrase "moyorino" is segmented to mo/oyo/ori/ino/o. The Japanese spoken phrase "mokuhyono" is segmented to mo/oku/uhyo/ono/o.
    Subsequently, the phoneme data generating device 30 divides each segmented phoneme into frames of a predetermined length, for example, 10ms (step S2). Control information including a name of the phoneme to which each frame belongs, a frame length of the phoneme, and the frame number is added to each divided frame. The above frame is then stored in a given domain of the memory 33 (step S3). Then, the phoneme data generating device 30 analyzes a linear predictive coding LPC on every frame with respect to the waveform of each phoneme to generate a linear predictive coding coefficient (hereinafter called "LPC coefficient") of 15 orders. The resultant coefficient is stored in a memory domain 1 of the memory 33 as shown in Fig. 4 (step S4). It should be noted that the resultant LPC coefficient in step S4 is a so-called speech spectral envelope parameter corresponding to a filter coefficient of the vocal tract filter 24. Subsequently, the phoneme data generating device 30 reads out the LPC coefficient in the memory domain 1 of the memory 33, and supplies the LPC coefficient as the phoneme data (step S5). This phoneme data is stored in the phoneme data memory 20.
    Then, the phoneme data generating device 30 calculates speech synthesis parameters as shown in Fig. 5 on respective phonemes stored in the memory 33.
    In Fig. 5, the phoneme data generating device 30 calculates the sum of all squares of speech sample values in each frame in one phoneme that is subject to processing (hereinafter called "subject phoneme") in order to generate a speech power of the frame. Then, as shown in Fig. 4, the speech power is stored in a memory domain 2 of the memory 33 as a frame power PC (step S12).
    Subsequently, the phoneme data generating device 30 stores "0" indicative of the head frame number in a built-in register n (not shown) (step S13). Then, the phoneme data generating device 30 generates the relative position in the subject phoneme of the frame n indicated by the frame number stored in the built-in register n (step S14). The relative position is expressed by the following formula: r = (n - 1)/N wherein,
  • r: relative position, and
  • N: the number of all frames in the subject phoneme.
  • Then, the phoneme data generating device 30 reads out the frame power PC in the frame n from the memory domain 2 of the memory 33 shown in Fig. 4 (step S15). The phoneme data generating device 30 reads out the frame powers corresponding to the head and tail frames of the subject phoneme as the head and tail frame powers Pa and Pb, respectively, among the frame powers Pc in the memory domain 2 (step S16).
    Then, the phoneme data generating device 30 generates a standardized frame power Pn in the frame n indicated by a built-in register n, by executing the following calculation (1) using the head and tail frame powers Pa, Pb, the frame power Pc obtained in step S15 and the relative position r. Pn = Pc / [(1 - r) Pa + r Pb] Then, the phoneme data generating device 30 stores the standardized frame power Pn in a memory domain 3 of the memory 33 (step S17).
    That is, the phoneme data generating device 30 generates the frame power value in the frame n when the frame power Pc in the tail frame of this subject phoneme is set to "1".
    Then, the phoneme data generating device 30 reads out the LPC coefficient corresponding to the frame n indicated by the built-in register n from the memory domain 1 of the memory 33 shown in Fig. 4. The phoneme data generating device 30 then generates power frequency characteristics in the frame n based on the LPC coefficient (step S18). Thereafter, the phoneme data generating device 30 samples a power value from the power frequency characteristics every predetermined frequency interval, and then stores the average value of these power values as a mean frame power Gf in a memory domain 4 of the memory 33 shown in Fig. 4 (step S19).
    Then, the phoneme data generating device 30 adds "1" to the frame number n stored in the built-in register n to generate a new frame number n, the new frame number n replacing the previous frame number n, and stores the new frame number n in the built-in register n by substitution (step S20). Subsequently, the phoneme data generating device 30 determines whether the frame number stored in the built-in register n equals (N-1) (step S21).
    In step S21, if the frame number stored in the built-in register n does not equal (N-1), the phoneme data generating device 30 returns to the step S14, and repeats the above-mentioned operation. Such an operation stores the standardized frame power Pn and the mean frame power Gf corresponding to each of the head frame to (N-1)th frames of a subject phoneme in the memory domains 3 and 4, as shown in Fig. 4.
    In the step S21, if the frame number stored in the built-in register n equals (N-1), the phoneme data generating device 30 respectively reads out the standardized frame power Pn and the mean frame power Gf stored in the memory domains 3 and 4 of the memory 33 shown in Fig. 4, and outputs the standardized frame power Pn and the mean frame power Gf (step S23). The standardized frame power Pn and the mean frame power Gf are stored in the phoneme data memory 20 as speech synthesis parameters.
    That is, the respective phoneme data obtained by the procedure shown in Fig. 3 is associated with the standardized frame power Pn and the mean frame power Gf obtained by the procedure shown in Fig. 5 to store the resultant data in the phoneme data memory 20.
    The speech synthesis control circuit 22 shown in Fig. 1 receives the phoneme data and speech synthesis parameters corresponding to the intermediate language characters string signals CL from the text analyzing circuit 21, by using software stored in the ROM 28. The speech synthesis control circuit 22 then controls speech synthesis as explained hereinafter.
    The speech synthesis control circuit 22 divides segments of the intermediate language characters string signals CL into phonemes consisting of "VCV", and then receives the phoneme data corresponding to respective phonemes from the phoneme data memory 20 sequentially. The speech synthesis control circuit 22 then supplies a pitch frequency designation signal K for designating the pitch frequency to the sound source module 23. Then, the speech synthesis control circuit 22 synthesizes the speech on respective phoneme data in order of the reading from the phoneme data memory 20.
    Fig. 6 shows a speech synthesizing control procedure.
    In Fig. 6, the speech synthesis control circuit 22 selects the data for one phoneme subject to be processed (hereinafter called "subject phoneme data) in the received order as mentioned above. The speech synthesis control circuit 22 then stores "0" indicative of the head frame number in the phoneme data in the built-in register n (not shown) (step S101). Subsequently, the speech synthesis control circuit 22 supplies a sound source selection signal SV to the sound source module 23 (step S102). The sound source selection signal SV indicates whether the phoneme corresponding to the above-mentioned subject phoneme data is a voiced sound or an unvoiced sound. Depending on the sound source selection signal SV, the sound module 23 generates as a frequency signal Q one of a noise signal and an impulse signal having a frequency designated by the pitch frequency designation signal K.
    Subsequently, the speech synthesis control circuit 22 samples the frequency signal Q supplied from the sound source module 23 for every predetermined interval. The control circuit 22 then calculates the sum of squares of respective sample values in a frame to generate a frame power correction value Gs. Then, the speech synthesis control circuit 22 stores the frame power correction value Gs in a built-in register G (not shown) (step S103). Then, the speech synthesis control circuit 22 supplies the LPC coefficient to the vocal tract filter 24 as the linear predictive coding signal LP (step S104). It is noted that the LPC coefficient corresponds to the frame n indicated by the built-in register n in the subject phoneme data. Then, the speech synthesis control circuit 22 reads out the standardized frame power Pn and the mean frame power Gf corresponding to the frame n indicated by the above-mentioned built-in register n in the subject phoneme data from the phoneme data memory 20 (step S105). Thereafter, the speech synthesis control circuit 22 calculates a speech envelope signal Vm, by the following computation with the standardized frame power Pn, the mean frame power Gf, and the frame power correction value Gs stored in the built-in register G. The speech synthesis control circuit 22 then supplies the speech envelope signal Vm to an amplitude adjustment circuit 25 (step S106). V m = P n / (G s G f )
    By means of the step S106, the amplitude adjustment circuit 25 adjusts the amplitude of the speech waveform signal Vf supplied from the vocal tract filter 24 to a level corresponding to the above-mentioned speech envelope signal Vm. Since the connecting portions of respective phonemes are always maintained at a predetermined level through this amplitude adjustment, the connection of phonemes becomes smooth and hence, natural sounding synthesized speech is produced.
    Subsequently, the speech synthesis control circuit 22 determines whether the frame number n stored in the built-in register n is smaller than the total number of frames in the subject phoneme data N by 1, that is, whether the frame number n equals (N - 1) (step S107). In the step S107, if it is determined that n does not equal (N-1), the speech synthesis control circuit 22 adds "1" to the frame number stored in the built-in register n, and stores this value as a new frame number in the built-in register n by substitution (step S108). After the step S108, the speech synthesis control circuit 22 returns to the step S103, and then repeats the above-mentioned operation.
    On the other hand, in step S107, if it is determined that the frame number n stored in the built-in register n does not equal (N-1), the speech synthesis control circuit 22 returns to the step S101, and repeats the phonemic synthesis process to next phoneme data in the same manner.
    The present invention has been explained heretofore in conjunction with the preferred embodiment. However, it should be understood that those skilled in the art could easily conceive various modifications falling within the scope of the appended claims.

    Claims (2)

    1. A method for synthesizing speech, comprising the steps of:
      generating an excitation signal (Q) corresponding to an input text signal;
      filtering said excitation signal (Q) with a linear predictive coefficient (LPC) calculated from respective phonemes in a phoneme series to generate a speech waveform signal (VF);
      adjusting the amplitude of said speech waveform signal (VF) to a level based on a speech envelope signal (Vm) to generate an amplitude adjustment waveform signal (VAUD); and
      generating an acoustic output corresponding to said amplitude adjustment waveform signal (VAUD), wherein
      said speech envelope signal (Vm) is obtained by the steps of:
      dividing each of said phonemes in a phoneme series into a plurality of frames N having a predetermined time length;
      summing squares of speech samples in a frame as a frame power value (Pc), for each frame of said plurality of frames;
      obtaining a standardized frame power (Pn), for each frame n of said plurality of frames, as a function expressed as Pn = Pc / [(1-r)Pa + rPb] wherein Pc is said frame power value, Pa and Pb are said head and tail frame power values of the subject phoneme, and r is the relative position r=(n-1)/N;
      summing squares of sample values in a frame of said excitation signal (Q) to obtain a frame power correction value (Gs);
      providing power frequency characteristics based on said linear predictive coefficient (LPC), for each frame of said plurality of frames;
      calculating an average value of power values sampled from said power frequency characteristics at a predetermined frequency interval as a mean frame power value (Gf), for each frame of said plurality of frames; and
      providing said speech envelope signal (Vm) as a function expressed as V m = P n / (G s G f )
         wherein Pn is said standardized frame power value, Gs is said frame power correction value, and Gf is said mean frame power value.
    2. The method according to claim 1, wherein said excitation signal (Q) includes an impulse signal carrying a voiced sound and a noise signal carrying an unvoiced sound.
    EP00121304A 1999-10-15 2000-10-06 Amplitude control for speech synthesis Expired - Lifetime EP1093111B1 (en)

    Applications Claiming Priority (2)

    Application Number Priority Date Filing Date Title
    JP29435799 1999-10-15
    JP29435799A JP2001117576A (en) 1999-10-15 1999-10-15 Voice synthesizing method

    Publications (3)

    Publication Number Publication Date
    EP1093111A2 EP1093111A2 (en) 2001-04-18
    EP1093111A3 EP1093111A3 (en) 2002-09-04
    EP1093111B1 true EP1093111B1 (en) 2005-12-28

    Family

    ID=17806674

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP00121304A Expired - Lifetime EP1093111B1 (en) 1999-10-15 2000-10-06 Amplitude control for speech synthesis

    Country Status (4)

    Country Link
    US (1) US7130799B1 (en)
    EP (1) EP1093111B1 (en)
    JP (1) JP2001117576A (en)
    DE (1) DE60025120T2 (en)

    Families Citing this family (4)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    JP3728173B2 (en) * 2000-03-31 2005-12-21 キヤノン株式会社 Speech synthesis method, apparatus and storage medium
    US7860256B1 (en) * 2004-04-09 2010-12-28 Apple Inc. Artificial-reverberation generating device
    JP4209461B1 (en) * 2008-07-11 2009-01-14 株式会社オトデザイナーズ Synthetic speech creation method and apparatus
    JP6047922B2 (en) * 2011-06-01 2016-12-21 ヤマハ株式会社 Speech synthesis apparatus and speech synthesis method

    Family Cites Families (4)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    DE69028072T2 (en) * 1989-11-06 1997-01-09 Canon Kk Method and device for speech synthesis
    KR19980702608A (en) * 1995-03-07 1998-08-05 에버쉐드마이클 Speech synthesizer
    AU6044298A (en) * 1997-01-27 1998-08-26 Entropic Research Laboratory, Inc. Voice conversion system and methodology
    JP3361066B2 (en) * 1998-11-30 2003-01-07 松下電器産業株式会社 Voice synthesis method and apparatus

    Also Published As

    Publication number Publication date
    EP1093111A3 (en) 2002-09-04
    DE60025120D1 (en) 2006-02-02
    EP1093111A2 (en) 2001-04-18
    JP2001117576A (en) 2001-04-27
    US7130799B1 (en) 2006-10-31
    DE60025120T2 (en) 2006-09-14

    Similar Documents

    Publication Publication Date Title
    US5524172A (en) Processing device for speech synthesis by addition of overlapping wave forms
    US8996378B2 (en) Voice synthesis apparatus
    JPH031200A (en) Regulation type voice synthesizing device
    US6212501B1 (en) Speech synthesis apparatus and method
    JPH0632020B2 (en) Speech synthesis method and apparatus
    JP2001034280A (en) Electronic mail receiving device and electronic mail system
    US7765103B2 (en) Rule based speech synthesis method and apparatus
    EP1093111B1 (en) Amplitude control for speech synthesis
    JP3841596B2 (en) Phoneme data generation method and speech synthesizer
    JP2904279B2 (en) Voice synthesis method and apparatus
    EP1543497B1 (en) Method of synthesis for a steady sound signal
    JPH0876796A (en) Voice synthesizer
    JP4451665B2 (en) How to synthesize speech
    JPH05307395A (en) Voice synthesizer
    JP5175422B2 (en) Method for controlling time width in speech synthesis
    JP2001034284A (en) Voice synthesizing method and voice synthesizer and recording medium recorded with text voice converting program
    JP3394281B2 (en) Speech synthesis method and rule synthesizer
    JP3241582B2 (en) Prosody control device and method
    JP3081300B2 (en) Residual driven speech synthesizer
    JPS5914752B2 (en) Speech synthesis method
    JPH09179576A (en) Voice synthesizing method
    JP2001100777A (en) Method and device for voice synthesis
    JPH0756590A (en) Device and method for voice synthesis and recording medium
    JPH11161297A (en) Method and device for voice synthesizer
    JPH02153397A (en) Voice recording device

    Legal Events

    Date Code Title Description
    PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

    Free format text: ORIGINAL CODE: 0009012

    AK Designated contracting states

    Kind code of ref document: A2

    Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

    AX Request for extension of the european patent

    Free format text: AL;LT;LV;MK;RO;SI

    PUAL Search report despatched

    Free format text: ORIGINAL CODE: 0009013

    AK Designated contracting states

    Kind code of ref document: A3

    Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

    AX Request for extension of the european patent

    Free format text: AL;LT;LV;MK;RO;SI

    17P Request for examination filed

    Effective date: 20021016

    17Q First examination report despatched

    Effective date: 20030313

    AKX Designation fees paid

    Designated state(s): DE FR GB

    GRAP Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOSNIGR1

    GRAS Grant fee paid

    Free format text: ORIGINAL CODE: EPIDOSNIGR3

    GRAA (expected) grant

    Free format text: ORIGINAL CODE: 0009210

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): DE FR GB

    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: FG4D

    REF Corresponds to:

    Ref document number: 60025120

    Country of ref document: DE

    Date of ref document: 20060202

    Kind code of ref document: P

    ET Fr: translation filed
    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: 746

    Effective date: 20060901

    PLBE No opposition filed within time limit

    Free format text: ORIGINAL CODE: 0009261

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

    26N No opposition filed

    Effective date: 20060929

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: DE

    Payment date: 20071030

    Year of fee payment: 8

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: GB

    Payment date: 20071018

    Year of fee payment: 8

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: FR

    Payment date: 20071031

    Year of fee payment: 8

    GBPC Gb: european patent ceased through non-payment of renewal fee

    Effective date: 20081006

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: ST

    Effective date: 20090630

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: DE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20090501

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: FR

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20081031

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: GB

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20081006