WO2006129814A1 - Speech synthesis method and apparatus - Google Patents

Speech synthesis method and apparatus Download PDF

Info

Publication number
WO2006129814A1
WO2006129814A1 PCT/JP2006/311139 JP2006311139W WO2006129814A1 WO 2006129814 A1 WO2006129814 A1 WO 2006129814A1 JP 2006311139 W JP2006311139 W JP 2006311139W WO 2006129814 A1 WO2006129814 A1 WO 2006129814A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
prosodic
modification
conducted
prosodic modification
Prior art date
Application number
PCT/JP2006/311139
Other languages
French (fr)
Inventor
Masayuki Yamada
Yasuo Okutani
Michio Aizawa
Original Assignee
Canon Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Kabushiki Kaisha filed Critical Canon Kabushiki Kaisha
Publication of WO2006129814A1 publication Critical patent/WO2006129814A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules

Definitions

  • the present invention relates to a speech synthesis method for synthesizing desired speech.
  • a speech synthesis technology for synthesizing desired speech is known.
  • the speech synthesis is realized by concatenating speech segments corresponding to the desired speech content and adjusting them so as to achieve the desired prosody.
  • One of the typical speech synthesis technologies is based on the speech source-voice tract model.
  • the speech segment is a vocal tract parameter sequence.
  • a filtering process is conducted on the pulse sequence simulating the vocal cord vibration or the noise simulating the noise caused by exhalation, thus obtaining synthesized speech.
  • a speech synthesis technology referred to as the corpus-based speech synthesis has. become widely-used (for example, refer to Segi, Takagi, "Segmental Selection from Broadcast News Recordings for a High Quality Concatinative Speech Synthesis", Technical report of IEICE, SP2003-35, pp. 1-6, June (2003) ) .
  • various variations of speech are pre-recorded, and only the • concatenating of the segments is conducted in the synthesis.
  • the prosody adjustment is conducted by selecting a segment with the desired prosody from a large number ' of segments as appropriate.
  • a natural and high-quality synthesized speech can be obtained, as compared to speech synthesis based on the speech source-Vocal tract model. This is said to be due to the corpus-based speech synthesis not including a process for transforming speech, such as modeling or signal processing (which causes degradation of speech) .
  • a speech synthesis method including selecting a segment, determining whether to conduct prosodic modification on the selected segment, calculating a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted, conducting prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification, and . concatenating the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted.
  • a speech synthesis apparatus including a selecting unit configured to select a segment, a determining unit configured to determine whether to conduct prosodic modification on the selected segment, a calculating unit configured to calculate a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted based on a determination result by the determining unit, a modification unit configured to conduct prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification, and a segment concatenating unit configured to concatenate the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted based on a determination result by the determining unit.
  • Fig. 1 is a block diagram illustrating a hardware configuration of a speech synthesis apparatus according to an exemplary embodiment of the present invention.
  • Fig. 2 is a flowchart of the process flow • according to a first exemplary embodiment.
  • Fig. 3 is a flowchart of the process flow according to a second exemplary embodiment.
  • Fig. 4 is a flowchart of the process flow according to a third exemplary embodiment.
  • Fig. 1 illustrates a hardware configuration of a speech synthesis apparatus according to a first exemplary embodiment of the present invention.
  • a central processing unit 1 conducts, processing such as numerical processing and control, and conducts the numerical processing according to the procedure of the present invention.
  • a speech output unit 2 outputs speech.
  • An input unit 3 includes, for example, a touch panel, a keyboard, a mouse, a button, or some combination thereof, and is used by a user to instruct an operation to be conducted by the apparatus. The input unit 3 may be omitted in the case where the apparatus operates autonomously without any instruction from the user.
  • An external storage unit 4 includes a disk or a nonvolatile memory which stores a language analysis dictionary 401, a prosody prediction parameter 402, and a speech segment database 403. In addition, the external storage unit 4 stores information that should be used permanently among various information stored in the RAM 6. Furthermore, the external storage unit 4 may take a transportable form such as a CD-ROM or a memory card, which can increase the convenience .
  • a read-only memory (ROM) 5 stores program code 501 for implementing the present invention,, fixed data (not shown) , and so on.
  • the use of the external storage unit 4 and the ROM 5 is arbitrary in the present invention.
  • the program code 501 may be installed on the external storage unit 4 instead of the ROM 5.
  • a memory 6, such as a random. access memory (RAM) stores temporary information, temporary data, and various flags.
  • the above- described units 1 to 6 are connected with one another via a bus 7.
  • step Sl an input speech synthesis target (input sequence) is analyzed.
  • the speech' synthesis target is a natural language such as "Kyou- wa yoi tenki-desu, " in Japanese
  • a natural language processing method such as morphological analysis or syntax analysis (parsing) is used in this step.
  • the language analysis dictionary 401 is used accordingly in conducting the analysis.
  • an exclusive analyzing process is used in this step.
  • step S2 a phoneme sequence is decided based on a result of the analysis in step Sl.
  • step S3 a factor for selecting segments is obtained based on the result of the analysis in step Sl.
  • the factor for selecting segments includes, for example, the number of moras or the accent type of the phrase to which each phoneme belongs, the position within the sentence, phrase, or word, and the phoneme type.
  • step S4 the appropriate segment is selected from the speech segment database 403 stored in the external storage unit 4, based on the result of step S3. Besides using the result of step 3 as the information for selecting segments, the selection can be made so that the gap in the spectral shape or the prosody between the adjoining segments is kept small.
  • step S5 the prosodic value of the segment selected in step 4 is obtained.
  • the prosodic value of a segment can be measured directly from the selected segment, or a value measured in advance and stored in the external storage unit 4 can be read out to be used as the prosodic value of a segment.
  • prosody often refers to fundamental frequency (FO), duration, and power.
  • the present process may be conducted by obtaining only a part of the prosodic information since prosodic modification is basically not conducted in a corpus-based speech synthesis, and it is not necessary to obtain information that is not subjected to prosodic modification .
  • step S6 the degree of dissociation of the prosodic value of the segment obtained in step S5 from that of the adjoining segment . is evaluated. In the case where the degree of dissociation is greater than a threshold value, the process proceeds to step S7. In the case where the degree of dissociation is- not greater than the threshold value, the process proceeds to step S9..
  • the prosody is FO
  • a plurality of values correspond to one segment. Therefore, a plurality of methods can be considered in evaluating the degree of dissociation. For example, a representative value such as the average value, the intermediate value, the maximum value, or the ' minimum value of FO, FO at the center of the segment, or FO at the end-point of the segment may be used. Furthermore, the slope of FO may be used.
  • the prosodic value after conducting prosodic modification is calculated. In the simplest terms, a constant number is added or multiplied to the prosodic value of the segment obtained in step S5 so that the degree of dissociation used in step S6 becomes minimum.
  • step S6 it is possible to keep the amount of prosodic modification so that the degree of dissociation in step S6 falls within the range of the threshold value. Furthermore, the prosodic value of the segment which has been determined not to be modified in step S6 can be interpolated so as to be used as the prosodic value after conducting prosodic modification.
  • step S8 the prosodic value of the segment is modified based on the prosodic value calculated in step S7.
  • Various methods used in conventional speech synthesis for example, the PSOLA (pitch synchronous overlap add) method
  • PSOLA pitch synchronous overlap add
  • step S9 based on the result of step 6, either the segments selected in step S4 or the segments in which the prosody is modified in step S8 are concatenated and output as a synthesized speech.
  • a method is performed in which prosody prediction is conducted based on the result of a language analysis and the predicted prosody is used.
  • step SlOl a factor for predicting the prosody is obtained based on the result of the analysis in step Sl.
  • the factor for predicting the prosody includes, for example, the number of moras or the accent type of the phrase to which each phoneme belongs, the position within the sentence, phrase, or word, the phoneme type, or information on the adjoining phrase or word.
  • step S102 the prosody is predicted based on the prosody prediction factor obtained in step SlOl and the prosody prediction parameter 402 stored in the external storage unit 4.
  • prosody often refers to fundamental frequency (FO) , duration, and power.
  • the present process may be conducted by obtaining only a part of the prosodic information since the prosodic value of the segment can ' be used in conducting a corpus-based speech synthesis, and it is not necessary to predict all of the prosodic information.
  • step S3 the processes of obtaining the segment selection factor (step S3) , selecting the segment v (step S4), and obtaining the prosodic value of the segment (step S5) are conducted as in the first exemplary embodiment. Since prosody prediction is conducted in the present exemplary embodiment, the prosody predicted in step S102 may be used as information for selecting the segment.
  • step S103 the degree of dissociation of the prosodic value predicted in step S102 from the segment obtained in step S5 is evaluated. In the case where the degree of dissociation is greater than a threshold value, the process proceeds to step S104. In the case where the degree of dissociation is not greater than the threshold value, the process proceeds to step S9.
  • the prosody is FO
  • a plurality of values corresponds to one segment. Consequently, various methods of evaluating the degree of dissociation can be considered. For example, a representative value such as the average value, the intermediate value, the maximum value, and the minimum value of FO, FO at the center of the segment, or a mean square error between the predicted prosodic value and the prosodic value of the segment can be used.
  • step S104 the prosodic value after modifying the prosody is calculated.
  • the prosodic value after conducting prosodic modification can be calculated so as to become equal to the prosodic value predicted in step S102.
  • the amount of prosodic modification can be kept so that the degree of dissociation in step S103 falls within the range of the threshold value.
  • Steps S8 and S9 are similar to those in the first exemplary embodiment.
  • the segment is re-selected in the case of the segment on which prosodic modification has been determined to be conducted in the above-described exemplary embodiment
  • the process flow of the third exemplary embodiment is described with reference to Fig. 4.
  • the processes in steps Sl to S5 are similar to those in the second exemplary embodiment.
  • step S103 a determination similar to that in the second exemplary embodiment is made.
  • the process proceeds to step S201.
  • the process proceeds to step S9.
  • step S201 a factor for re-selecting the segment is obtained.
  • step S3 information on the segment on which prosodic modification has been determined not to be conducted can be used as a factor for re-selecting the segment.
  • the consecutiveness of the prosody can be improved by. using the prosodic value of the segment on which prosodic modification has been determined not to be conducted in step S103.
  • the spectral consecutiveness of the segment to be re-selected and the segment obtained in step S5 can be considered.
  • step S202 as in step S4, the segment is selected based on the result of step S201.
  • prosodic modification is conducted (steps S6, S7, and S8) .
  • a synthesized speech is output as in the above-described exemplary embodiment (step S9) .
  • a more appropriate segment can be selected.
  • the present invention can also be achieved by supplying a storage medium storing program code (software) which realizes the functions of the above- described exemplary embodiments to a system or an apparatus, and causing a computer (or CPU or microprocessing unit (MPU) ) of the system or the apparatus to read and execute the program code stored in the storage medium.
  • program code software
  • MPU microprocessing unit
  • the program code itself that is read from the storage medium realizes the function of the above-described exemplary embodiments .
  • the storage medium for supplying the program code includes, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical- disk, a compact disk - ROM (CD-ROM) , a CD-recordable (CD-R) , a magnetic tape, a nonvolatile memory card, and a ROM.
  • the present invention includes also a case in which an operating system (OS) running on the computer performs a part or the whole of the actual process according to instructions of the program code, and that process realizes the functions of the above- described exemplary embodiments.
  • OS operating system
  • the present invention also includes a case in which, after the program code is read from the storage medium and written into a memory of a function expansion board inserted in the computer or a function expansion unit connected to the computer, a CPU in the function expansion board or the function expansion unit performs a part of or the whole process according to instructions of the program code and that process realizes the functions of the above-described exemplary embodiments .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A speech synthesis method includes selecting a segment, determining whether to conduct prosodic modification on the selected segment, calculating a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted based on a result of the determination, conducting prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the target value of prosodic modification, and concatenating the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted as a result of the determination.

Description

DESCRIPTION
SPEECH SYNTHESIS METHOD AND APPARATUS
TECHNICAL FIELD
The present invention relates to a speech synthesis method for synthesizing desired speech.
BACKGROUND ART A speech synthesis technology for synthesizing desired speech is known. The speech synthesis is realized by concatenating speech segments corresponding to the desired speech content and adjusting them so as to achieve the desired prosody. One of the typical speech synthesis technologies is based on the speech source-voice tract model. In this model, the speech segment is a vocal tract parameter sequence. Using this vocal tract parameter, a filtering process is conducted on the pulse sequence simulating the vocal cord vibration or the noise simulating the noise caused by exhalation, thus obtaining synthesized speech.
More recently, a speech synthesis technology referred to as the corpus-based speech synthesis has. become widely-used (for example, refer to Segi, Takagi, "Segmental Selection from Broadcast News Recordings for a High Quality Concatinative Speech Synthesis", Technical report of IEICE, SP2003-35, pp. 1-6, June (2003) ) . In this technology, various variations of speech ,are pre-recorded, and only the concatenating of the segments is conducted in the synthesis. In a corpus-based speech synthesis, the prosody adjustment is conducted by selecting a segment with the desired prosody from a large number' of segments as appropriate.
Generally, in a corpus-based speech synthesis, a natural and high-quality synthesized speech can be obtained, as compared to speech synthesis based on the speech source-Vocal tract model. This is said to be due to the corpus-based speech synthesis not including a process for transforming speech, such as modeling or signal processing (which causes degradation of speech) .
However, in the case where a segment with the desired prosody is not found in the corpus-based speech synthesis described above, the quality of synthesized speech is degraded. In particular, a prosodic gap is generated between the segment not • having the desired prosody and the adjoining segments, thus causing a severe loss of the naturalness in the synthesized speech..
DISCLOSURE OF INVENTION
According to an aspect of the present invention, there is provided a speech synthesis method including selecting a segment, determining whether to conduct prosodic modification on the selected segment, calculating a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted, conducting prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification, and . concatenating the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted.
According to another aspect of the present invention, there is provided a speech synthesis apparatus including a selecting unit configured to select a segment, a determining unit configured to determine whether to conduct prosodic modification on the selected segment, a calculating unit configured to calculate a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted based on a determination result by the determining unit, a modification unit configured to conduct prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification, and a segment concatenating unit configured to concatenate the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted based on a determination result by the determining unit.
Further features of the present invention will become apparent from the following detailed description of exemplary embodiments with reference to the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments of the invention and, together with the description, serve to explain the principles of the invention.
Fig. 1 is a block diagram illustrating a hardware configuration of a speech synthesis apparatus according to an exemplary embodiment of the present invention.
Fig. 2 is a flowchart of the process flow according to a first exemplary embodiment.
Fig. 3 is a flowchart of the process flow according to a second exemplary embodiment.
Fig. 4 is a flowchart of the process flow according to a third exemplary embodiment. BEST MODE FOR CARRYING OUT THE INVENTION
Exemplary embodiments of the invention will be described in detail below with reference to the drawings . First Exemplary Embodiment
Fig. 1 illustrates a hardware configuration of a speech synthesis apparatus according to a first exemplary embodiment of the present invention. A central processing unit 1 conducts, processing such as numerical processing and control, and conducts the numerical processing according to the procedure of the present invention. A speech output unit 2 outputs speech. An input unit 3 includes, for example, a touch panel, a keyboard, a mouse, a button, or some combination thereof, and is used by a user to instruct an operation to be conducted by the apparatus. The input unit 3 may be omitted in the case where the apparatus operates autonomously without any instruction from the user. An external storage unit 4 includes a disk or a nonvolatile memory which stores a language analysis dictionary 401, a prosody prediction parameter 402, and a speech segment database 403. In addition, the external storage unit 4 stores information that should be used permanently among various information stored in the RAM 6. Furthermore, the external storage unit 4 may take a transportable form such as a CD-ROM or a memory card, which can increase the convenience .
A read-only memory (ROM) 5 stores program code 501 for implementing the present invention,, fixed data (not shown) , and so on. The use of the external storage unit 4 and the ROM 5 is arbitrary in the present invention. For example, the program code 501 may be installed on the external storage unit 4 instead of the ROM 5. A memory 6, such as a random. access memory (RAM) , stores temporary information, temporary data, and various flags. The above- described units 1 to 6 are connected with one another via a bus 7.
The process flow in the first exemplary embodiment is described next with reference to Fig. 2 In step Sl, an input speech synthesis target (input sequence) is analyzed. In the case where the speech' synthesis target is a natural language such as "Kyou- wa yoi tenki-desu, " in Japanese, a natural language processing method such as morphological analysis or syntax analysis (parsing) is used in this step. The language analysis dictionary 401 is used accordingly in conducting the analysis. On the other hand, in the case where the speech synthesis target is written in an artificial language for speech synthesis, such as "KYO' OWA/ YO' I /TE' NKIDESU" in Japanese, an exclusive analyzing process is used in this step. In step S2, a phoneme sequence is decided based on a result of the analysis in step Sl. In step S3, a factor for selecting segments is obtained based on the result of the analysis in step Sl. The factor for selecting segments includes, for example, the number of moras or the accent type of the phrase to which each phoneme belongs, the position within the sentence, phrase, or word, and the phoneme type. In step S4, the appropriate segment is selected from the speech segment database 403 stored in the external storage unit 4, based on the result of step S3. Besides using the result of step 3 as the information for selecting segments, the selection can be made so that the gap in the spectral shape or the prosody between the adjoining segments is kept small.
In step S5, the prosodic value of the segment selected in step 4 is obtained. The prosodic value of a segment can be measured directly from the selected segment, or a value measured in advance and stored in the external storage unit 4 can be read out to be used as the prosodic value of a segment. Generally, prosody often refers to fundamental frequency (FO), duration, and power. The present process may be conducted by obtaining only a part of the prosodic information since prosodic modification is basically not conducted in a corpus-based speech synthesis, and it is not necessary to obtain information that is not subjected to prosodic modification .
In step S6, the degree of dissociation of the prosodic value of the segment obtained in step S5 from that of the adjoining segment .is evaluated. In the case where the degree of dissociation is greater than a threshold value, the process proceeds to step S7. In the case where the degree of dissociation is- not greater than the threshold value, the process proceeds to step S9..
In the case where the prosody is FO, a plurality of values correspond to one segment. Therefore, a plurality of methods can be considered in evaluating the degree of dissociation. For example, a representative value such as the average value, the intermediate value, the maximum value, or the' minimum value of FO, FO at the center of the segment, or FO at the end-point of the segment may be used. Furthermore, the slope of FO may be used. In step S7, the prosodic value after conducting prosodic modification is calculated. In the simplest terms, a constant number is added or multiplied to the prosodic value of the segment obtained in step S5 so that the degree of dissociation used in step S6 becomes minimum. Alternatively, it is possible to keep the amount of prosodic modification so that the degree of dissociation in step S6 falls within the range of the threshold value. Furthermore, the prosodic value of the segment which has been determined not to be modified in step S6 can be interpolated so as to be used as the prosodic value after conducting prosodic modification.
In step S8, the prosodic value of the segment is modified based on the prosodic value calculated in step S7. Various methods used in conventional speech synthesis (for example, the PSOLA (pitch synchronous overlap add) method) can be used to modify the prosody.
In step S9, based on the result of step 6, either the segments selected in step S4 or the segments in which the prosody is modified in step S8 are concatenated and output as a synthesized speech.
According to the above exemplary embodiment, even in the case where a segment with the desired prosody is not found, the prosodic gap between the segment not having a desired prosody and the adjoining segment is reduced. As a result, the quality of the synthesized speech is prevented from being greatly degraded. Second Exemplary Embodiment
According to a second exemplary embodiment, a method is performed in which prosody prediction is conducted based on the result of a language analysis and the predicted prosody is used. i o
The process flow of the second exemplary embodiment is described with reference to Fig. 3. The processes in steps Sl and S2 are similar to those in the first exemplary embodiment. In step SlOl, a factor for predicting the prosody is obtained based on the result of the analysis in step Sl. The factor for predicting the prosody includes, for example, the number of moras or the accent type of the phrase to which each phoneme belongs, the position within the sentence, phrase, or word, the phoneme type, or information on the adjoining phrase or word.
In step S102, the prosody is predicted based on the prosody prediction factor obtained in step SlOl and the prosody prediction parameter 402 stored in the external storage unit 4. Generally, prosody often refers to fundamental frequency (FO) , duration, and power. The present process may be conducted by obtaining only a part of the prosodic information since the prosodic value of the segment can ' be used in conducting a corpus-based speech synthesis, and it is not necessary to predict all of the prosodic information.
Next, the processes of obtaining the segment selection factor (step S3) , selecting the segment v (step S4), and obtaining the prosodic value of the segment (step S5) are conducted as in the first exemplary embodiment. Since prosody prediction is conducted in the present exemplary embodiment, the prosody predicted in step S102 may be used as information for selecting the segment.
In step S103, the degree of dissociation of the prosodic value predicted in step S102 from the segment obtained in step S5 is evaluated. In the case where the degree of dissociation is greater than a threshold value, the process proceeds to step S104. In the case where the degree of dissociation is not greater than the threshold value, the process proceeds to step S9.
In the case where the prosody is FO, a plurality of values corresponds to one segment. Consequently, various methods of evaluating the degree of dissociation can be considered. For example, a representative value such as the average value, the intermediate value, the maximum value, and the minimum value of FO, FO at the center of the segment, or a mean square error between the predicted prosodic value and the prosodic value of the segment can be used.
In step S104, the prosodic value after modifying the prosody is calculated. In the simplest terms, the prosodic value after conducting prosodic modification can be calculated so as to become equal to the prosodic value predicted in step S102. Alternatively, the amount of prosodic modification can be kept so that the degree of dissociation in step S103 falls within the range of the threshold value. Steps S8 and S9 are similar to those in the first exemplary embodiment. Third Exemplary Embodiment
According to a third exemplary embodiment, the segment is re-selected in the case of the segment on which prosodic modification has been determined to be conducted in the above-described exemplary embodiment The process flow of the third exemplary embodiment is described with reference to Fig. 4. The processes in steps Sl to S5 are similar to those in the second exemplary embodiment. In step S103, a determination similar to that in the second exemplary embodiment is made. In the case where the degree of dissociation is greater than the threshold value, the process proceeds to step S201. In the case where the degree of dissociation is not greater than the- threshold value, the process proceeds to step S9. In step S201, a factor for re-selecting the segment is obtained. In addition to the factor used in step S3, information on the segment on which prosodic modification has been determined not to be conducted can be used as a factor for re-selecting the segment. For example, the consecutiveness of the prosody can be improved by. using the prosodic value of the segment on which prosodic modification has been determined not to be conducted in step S103. Alternatively, the spectral consecutiveness of the segment to be re-selected and the segment obtained in step S5 can be considered. In step S202, as in step S4, the segment is selected based on the result of step S201.
Furthermore, as in the first exemplary embodiment, in the case where the degree of dissociation of the prosodic values between the adjoining segments is greater than a threshold value, prosodic modification is conducted (steps S6, S7, and S8) . Finally, a synthesized speech is output as in the above-described exemplary embodiment (step S9) .
According to the third exemplary embodiment, a more appropriate segment can be selected.
The present invention can also be achieved by supplying a storage medium storing program code (software) which realizes the functions of the above- described exemplary embodiments to a system or an apparatus, and causing a computer (or CPU or microprocessing unit (MPU) ) of the system or the apparatus to read and execute the program code stored in the storage medium.
In this case, the program code itself that is read from the storage medium realizes the function of the above-described exemplary embodiments .
The storage medium for supplying the program code includes, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical- disk, a compact disk - ROM (CD-ROM) , a CD-recordable (CD-R) , a magnetic tape, a nonvolatile memory card, and a ROM. Furthermore, in addition to realizing the functions of the above-described exemplary embodiments by executing the program code read by a computer, the present invention includes also a case in which an operating system (OS) running on the computer performs a part or the whole of the actual process according to instructions of the program code, and that process realizes the functions of the above- described exemplary embodiments.
Furthermore, the present invention also includes a case in which, after the program code is read from the storage medium and written into a memory of a function expansion board inserted in the computer or a function expansion unit connected to the computer, a CPU in the function expansion board or the function expansion unit performs a part of or the whole process according to instructions of the program code and that process realizes the functions of the above-described exemplary embodiments .
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation' so as to encompass all modifications, equivalent structures, and functions.
This application claims priority from Japanese Patent Application No. 2005-159123 filed May 31, 2005, which is hereby incorporated by reference herein in its entirety.

Claims

1. A speech synthesis method comprising: selecting a segment; determining whether to conduct prosodic modification on the selected segment; calculating a target value of prosodic modification of the selected t segment when it is determined that prosodic modification is to be conducted on the selected segment; conducting prosodic modification such that a prosody of the selected segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification; and concatenating the selected segment on which prosodic modification has been conducted or the selected segment on which prosodic modification has been determined not to be conducted.
2. A speech synthesis method as claimed in claim 1, wherein determining whether to conduct prosodic modification on the selected segment includes determining whether to conduct prosodic modification based on a degree of dissociation of prosody from that of an adjoining segment.
3. A speech synthesis method as claimed in claim 1, wherein calculating the target value of prosodic modification of the selected segment includes calculating the target value of prosodic modification based on a prosodic value of the segment on which prosodic modification has been determined not to be conducted.
4. A speech synthesis method as claimed in claim 1, further comprising predicting a prosody, wherein determining whether to conduct prosodic modification on the selected segment includes determining whether to conduct prosodic modification based on a degree of dissociation from the predicted prosody.
5. A speech synthesis method as claimed in claim 1, further comprising predicting a prosody, wherein - calculating the target value of prosodic modification of the selected segment includes calculating the target value of prosodic modification based on the predicted prosody.
6. A speech synthesis method as claimed in claim 1, further comprising re-selecting a segment for a segment on which prosodic modification has been determined to be conducted.
7. A speech synthesis method as claimed in claim 6, wherein re-selecting the segment includes re- selecting the segment based on information on the segment on which prosodic modification has been determined not to be conducted.
8. A speech synthesis method as claimed in claim 6, further comprising re-determining whether to conduct prosodic modification on the re-selected segment, wherein prosodic modification is conducted on a segment on which prosodic modification has been re-determined to be conducted. - '
9. A control program for causing a computer to execute the speech synthesis method as claimed in claim 1.
10. A speech synthesis apparatus comprising: a selecting unit configured to select a segment; a determining unit configured to determine whether to conduct prosodic modification on the selected segment; a calculating unit configured to calculate a target value of prosodic modification of a segment on which prosodic modification has been determined to be conducted based on a determination result by the determining unit; a modification unit configured to conduct prosodic modification such that a prosody of the segment on which prosodic modification has been determined to be conducted takes the calculated target value of prosodic modification; and a segment concatenating unit configured to concatenate the segment on which prosodic modification has been conducted or a segment on which prosodic modification has been determined not to be conducted based on a determination result by the determining unit.
PCT/JP2006/311139 2005-05-31 2006-05-29 Speech synthesis method and apparatus WO2006129814A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2005-159123 2005-05-31
JP2005159123A JP2006337476A (en) 2005-05-31 2005-05-31 Voice synthesis method and system

Publications (1)

Publication Number Publication Date
WO2006129814A1 true WO2006129814A1 (en) 2006-12-07

Family

ID=37481739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/311139 WO2006129814A1 (en) 2005-05-31 2006-05-29 Speech synthesis method and apparatus

Country Status (2)

Country Link
JP (1) JP2006337476A (en)
WO (1) WO2006129814A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027835B2 (en) 2007-07-11 2011-09-27 Canon Kabushiki Kaisha Speech processing apparatus having a speech synthesis unit that performs speech synthesis while selectively changing recorded-speech-playback and text-to-speech and method
CN104361896A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system
CN104361895A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008139919A1 (en) * 2007-05-08 2008-11-20 Nec Corporation Speech synthesizer, speech synthesizing method, and speech synthesizing program
JP5029884B2 (en) * 2007-05-22 2012-09-19 富士通株式会社 Prosody generation device, prosody generation method, and prosody generation program

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01284898A (en) * 1988-05-11 1989-11-16 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing device
JPH08263090A (en) * 1995-03-20 1996-10-11 N T T Data Tsushin Kk Synthesis unit accumulating method and synthesis unit dictionary device
JPH1097289A (en) * 1996-09-20 1998-04-14 N T T Data Tsushin Kk Phoneme selecting method, voice synthesizer and instruction storing device
JP2000066695A (en) * 1998-08-18 2000-03-03 Ntt Data Corp Element dictionary, and voice synthesizing method and device therefor
JP2001265374A (en) * 2000-03-14 2001-09-28 Omron Corp Voice synthesizing device and recording medium
JP2003233386A (en) * 2002-02-08 2003-08-22 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing method, voice synthesizer and voice synthesizing program
JP2004012700A (en) * 2002-06-05 2004-01-15 Canon Inc Method and apparatus for synthesizing voice and method and apparatus for preparing dictionary
JP2004354644A (en) * 2003-05-28 2004-12-16 Nippon Telegr & Teleph Corp <Ntt> Speech synthesizing method, device and computer program therefor, and information storage medium stored with same
JP2005091747A (en) * 2003-09-17 2005-04-07 Mitsubishi Electric Corp Speech synthesizer

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01284898A (en) * 1988-05-11 1989-11-16 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing device
JPH08263090A (en) * 1995-03-20 1996-10-11 N T T Data Tsushin Kk Synthesis unit accumulating method and synthesis unit dictionary device
JPH1097289A (en) * 1996-09-20 1998-04-14 N T T Data Tsushin Kk Phoneme selecting method, voice synthesizer and instruction storing device
JP2000066695A (en) * 1998-08-18 2000-03-03 Ntt Data Corp Element dictionary, and voice synthesizing method and device therefor
JP2001265374A (en) * 2000-03-14 2001-09-28 Omron Corp Voice synthesizing device and recording medium
JP2003233386A (en) * 2002-02-08 2003-08-22 Nippon Telegr & Teleph Corp <Ntt> Voice synthesizing method, voice synthesizer and voice synthesizing program
JP2004012700A (en) * 2002-06-05 2004-01-15 Canon Inc Method and apparatus for synthesizing voice and method and apparatus for preparing dictionary
JP2004354644A (en) * 2003-05-28 2004-12-16 Nippon Telegr & Teleph Corp <Ntt> Speech synthesizing method, device and computer program therefor, and information storage medium stored with same
JP2005091747A (en) * 2003-09-17 2005-04-07 Mitsubishi Electric Corp Speech synthesizer

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8027835B2 (en) 2007-07-11 2011-09-27 Canon Kabushiki Kaisha Speech processing apparatus having a speech synthesis unit that performs speech synthesis while selectively changing recorded-speech-playback and text-to-speech and method
CN104361896A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system
CN104361895A (en) * 2014-12-04 2015-02-18 上海流利说信息技术有限公司 Voice quality evaluation equipment, method and system

Also Published As

Publication number Publication date
JP2006337476A (en) 2006-12-14

Similar Documents

Publication Publication Date Title
JP3913770B2 (en) Speech synthesis apparatus and method
EP1308928B1 (en) System and method for speech synthesis using a smoothing filter
US8175881B2 (en) Method and apparatus using fused formant parameters to generate synthesized speech
JP4241762B2 (en) Speech synthesizer, method thereof, and program
JP2008249808A (en) Speech synthesizer, speech synthesizing method and program
US20080177548A1 (en) Speech Synthesis Method and Apparatus
WO2006129814A1 (en) Speech synthesis method and apparatus
JP2001282278A (en) Voice information processor, and its method and storage medium
JP5979146B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP6013104B2 (en) Speech synthesis method, apparatus, and program
JP4639932B2 (en) Speech synthesizer
US9020821B2 (en) Apparatus and method for editing speech synthesis, and computer readable medium
JP3728173B2 (en) Speech synthesis method, apparatus and storage medium
US20110196680A1 (en) Speech synthesis system
KR20190048371A (en) Speech synthesis apparatus and method thereof
JP4648878B2 (en) Style designation type speech synthesis method, style designation type speech synthesis apparatus, program thereof, and storage medium thereof
JP4829605B2 (en) Speech synthesis apparatus and speech synthesis program
JP5874639B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP4525162B2 (en) Speech synthesizer and program thereof
JP5387410B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
WO2011030424A1 (en) Voice synthesizing apparatus and program
JP6762454B1 (en) Pitch pattern correction device, program and pitch pattern correction method
JP7004872B2 (en) Pitch pattern correction device, program and pitch pattern correction method
JP2005241789A (en) Piece splicing type voice synthesizer, method, and method of creating voice piece data base
WO2021090381A1 (en) Pitch pattern correction device, program, and pitch pattern correction method

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 11579864

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06766437

Country of ref document: EP

Kind code of ref document: A1