WO2008038994A1

WO2008038994A1 - Method for converting pronunciation using boundary pause intensity and text-to-speech synthesis system based on the same

Info

Publication number: WO2008038994A1
Application number: PCT/KR2007/004699
Authority: WO
Inventors: Jong Jin Kim; Moon Hwan Park
Original assignee: Electronics And Telecommunications Research Institute
Priority date: 2006-09-29
Filing date: 2007-09-27
Publication date: 2008-04-03
Also published as: KR20080030338A

Abstract

Provided are a method for converting pronunciation using boundary pause intensity and a text-to-speech synthesis system based on the same. The synthesis database is processed to reflect a pronunciation variation phenomenon at a word boundary depending on reading with breath, a speaker-dependent pronunciation conversion module is created based on the processed synthesis database, and the feature parameters for pronunciation conversion are extracted based on the language analysis result for an input sentence upon speech synthesis and inter-word boundary strength information and applied to the pronunciation conversion model in order to automatically generate a pronunciation string. Thus, a sophisticated pronunciation string can be generated particularly upon inter-word pronunciation conversion, thereby improving quality of a synthesized sound in the text-to-speech synthesis system.

Description

METHOD FOR CONVERTING PRONUNCIATION USING

BOUNDARY PAUSE INTENSITY AND TEXT-TO-SPEECH

SYNTHESIS SYSTEM BASED ON THE SAME

Technical Field

[1] The present invention relates to a method for converting pronunciation using boundary pause intensity and a text-to-speech synthesis system based on the same, and more particularly, to a method capable of generating a sophisticated pronunciation string using feature parameters for pronunciation-string generation and boundary pause intensity upon inter-word pronunciation conversion, and a text- to- speech synthesis system based on the method. Background Art

[2] A text-to-speech synthesis system extracts text information from a given text sentence, selects the most suitable one of pre-recorded speeches based on the extracted text information, and combines the selected speeches to produce an audibly recognizable speech sentence. The text-to-speech synthesis system is widely used in an automatic answering system, a mobile-phone-number searching system, an automatic notification system located at public places, etc.

[3] Converting an input text sentence to a correct speech pronunciation in such a text- to-speech synthesis system is considered to be an important process. That is, a pronunciation conversion function of generating a sound value from an input text sentence is of great importance.

[4] The pronunciation conversion refers to converting a given spelling notation (letter value) to a corresponding pronunciation notation (pronunciation value), as in the example of "DD" → "DD" (HH AA KK J OW). Examples of a conventional pronunciation conversion method include a pronunciation conversion method based on phonological change rules, a statistical pronunciation conversion method using a pronunciation string dictionary, a statistical pronunciation conversion method using a pronunciation- transferred learning DB, and the like.

[5] In the pronunciation conversion method based on phonological change rules, a pronunciation string of an input text is automatically generated according to phonological rules. In this case, it is difficult to determine a priority between the phonological rules. Further, the method cannot reflect a phonological phenomenon at a word boundary since the phonological rules generally relate to a phonological phenomenon within a word.

[6] The phonological phenomenon at the word boundary will now be described in greater detail. For example, when a phrase, "DD DDD (KYEO UL NA GEU NAE)" is pronounced, it may be pronounced as a continuous sound, "DDD (KYEOULLAGEUNAE)" or as "DDQDD (KYEOUL Il NAGEUNAE)" depending on a speaker reading with breath. In other words, even though continuous sound or phonological change at a word boundary is determined depending on the speaker's reading with breath, i.e., boundary strength of reading with breath, the method performs the pronunciation conversion based on the phonological change rule. Accordingly, the method cannot reflect pronunciation variation depending on the interword pause strength.

[7] Meanwhile, the statistical pronunciation conversion method using a pronunciation string dictionary performs pronunciation transfer on a variety of text corpuses to build a pronunciation string dictionary, performs learning on the pronunciation string dictionary using a variety of statistical learning methods to create a pronunciation conversion model, and generates a pronunciation string for an input text based on the pronunciation conversion model. This method can solve difficulties of exceptional pronunciation processing and rule priority determination, but cannot reflect an allophone model or speaker-dependent pronunciation conversion features required for speech synthesis, because the method uses the text corpus-based pronunciation string dictionary to perform the pronunciation string conversion.

[8] There is a difference in actual speaker utterance between a phoneme string appearing in a speaker's pronounced sentence and a phoneme string appearing in the pronunciation string dictionary. Accordingly, when uniform phoneme conversion is performed based on the pronunciation string dictionary by not considering unique pronunciation features of every speaker, an unnatural pronunciation string is produced particularly at a word boundary, thereby degrading naturalness and clarity of synthesized sound due to erroneous vibrating-reed combination.

[9] Meanwhile, the statistical pronunciation conversion method using a pronunciation- transferred learning DB performs statistical training based on a speaker speech DB used in an actual synthesis system to perform pronunciation string conversion. This method can perform an allophone model or speaker-dependent pronunciation conversion. However, even though a dominant factor that determines continuity of inter-word pronunciation phonologically is inter- word pause strength, the method mainly performs the pronunciation conversion using only the speaker's phoneme string information and thus cannot predict the inter- word pause strength. As a result, the method cannot correctly reflect pronunciation conversion features between word boundaries.

[10] As a result, since the above-described pronunciation conversion methods cannot reflect a pronunciation change depending on a difference of reading with breath, i.e., the inter- word pause strength, they cannot generate a sophisticated pronunciation string upon inter- word pronunciation conversion and accordingly are limited to generating a synthesized natural sound. Disclosure of Invention

Technical Problem

[11] The present invention is directed to a pronunciation conversion method capable of improving quality of a synthesized sound by generating a sophisticated pronunciation string using boundary pause intensity upon inter- word pronunciation conversion to reflect a pronunciation variation phenomenon at a word boundary depending on reading with breath, and a text- to- speech synthesis system based on the method. Technical Solution

[12] One aspect of the present invention provides a method for converting pronunciation using boundary pause intensity, the method comprising the steps of: (a) processing a synthesis database to reflect a pronunciation variation phenomenon at a word boundary; (b) extracting feature parameters from the processed synthesis database to build a training DB; (c) training the training DB based on the extracted feature parameters to create a pronunciation conversion model; (d) when a text is input, performing pre-processing and language analysis on the input text and predicting interword boundary strength for the input text; (e) extracting feature parameters for pronunciation conversion from the input text; and (f) generating a pronunciation string for the input text using the extracted feature parameters based on the pronunciation conversion model.

[13] Another aspect of the present invention provides a text- to- speech synthesis system based on pronunciation conversion using boundary pause intensity comprising: a synthesis database processor for processing a synthesis database to reflect a pronunciation variation phenomenon at a word boundary; a training DB generator for extracting feature parameters from the processed synthesis database to build a training DB; a pronunciation conversion model generator for training the training DB based on the extracted feature parameters to create a pronunciation conversion model, and extracting words frequently causing a pronunciation conversion error in the training process to build an exceptional pronunciation dictionary; a pre -processor for performing pre-processing on an input text; a language analyzer for receiving the pre- processed input text and performing language analysis on the pre-processed input text to predict inter- word boundary strength based on the language analysis result; a feature extractor for extracting feature parameters for pronunciation conversion using the language analysis result and the predicted inter-word boundary strength information; a pronunciation string generator for generating a pronunciation string for the input text based on the pronunciation conversion model or the exceptional pronunciation dictionary; and a synthesized- sound generator for generating and outputting a synthesized sound for the generated pronunciation string.

Advantageous Effects

[14] As described above, according to the present invention, a sophisticated pronunciation string can be generated using the boundary pause intensity upon inter-word pronunciation conversion, thereby improving quality of the synthesized sound in the text- to-speech synthesis system.

[15] Furthermore, the pronunciation conversion model is created in consideration of the previously pronunciation-converted phoneme environment as well as front and rear phoneme environments, thereby improving pronunciation conversion accuracy over conventional speech synthesis methods.

[16] In addition, the exceptional pronunciation dictionary is built by words frequently decided to be erroneous in a previous training process, and a pronunciation string for an exceptional expression is automatically generated, thereby improving performance of the text- to- speech synthesis system. Brief Description of the Drawings

[17] FIG. 1 is a flowchart illustrating a method for converting pronunciation using boundary pause intensity according to the present invention;

[18] FIG. 2 is a flowchart illustrating a step of processing a synthesis database in FIG. 1;

[19] FIG. 3 is a flowchart illustrating a step of building a training DB based on the processed synthesis database in FIG. 1;

[20] FIG. 4 is a flowchart illustrating feature parameters extracted to build the training DB in FIG. 3;

[21] FIG. 5 is a flowchart illustrating a step of creating a pronunciation conversion model in FIG. l; and

[22] FIG. 6 is a block diagram illustrating a text-to-speech synthesis system based on pronunciation conversion using boundary pause intensity according to the present invention.

[23] * Description of Major Symbols in the above Figures.

[24] 100: Synthesis database

[25] 10OA: Processed synthesis database

[26] 200: Training DB

[27] 300: Pronunciation conversion model

[28] 400: Exceptional pronunciation dictionary

[29] 610: Synthesis database processor

[30] 620: Training DB generator [31] 630: Pronunciation conversion model generator

[32] 640: Pre-processor

[33] 650: Language analyzer

[34] 660: Feature extractor

[35] 670: Pronunciation string generator

[36] 680: Synthesized sound generator

Mode for the Invention

[37] Hereinafter, exemplary embodiments of the present invention will be described in detail. However, the present invention is not limited to the exemplary embodiments disclosed below, but can be implemented in various types. Therefore, the present exemplary embodiments are provided for complete disclosure of the present invention and to fully inform the scope of the present invention to those ordinarily skilled in the art.

[38] FIG. 1 is a flowchart illustrating a method for converting pronunciation using boundary pause intensity according to the present invention.

[39] First, referring to FIG. 1, a synthesis database 100 is processed to reflect a pronunciation variation phenomenon at a word boundary depending on reading with breath (Sl 110). The synthesis database 100 is processed because source speech data in the synthesis database 100 does not fully reflect a phonological phenomenon at a word boundary occurring due to speaker's reading with breath, causing an error upon pronunciation conversion. Processing the synthesis database 100 will now be described in greater detail with reference to FIG. 2.

[40] FIG. 2 is a flowchart illustrating the step (Sl 110) of processing the synthesis database 100 in FIG. 1.

[41] First, with reference to FIG. 2, pronunciation transfer is manually, semi- automatically or automatically performed on the synthesis database 100 to generate a pronunciation string (S 1111).

[42] Pause strength is then tagged to the pronunciation string generated through the pronunciation transfer process in a syllable unit, a phoneme unit, or a word unit (Sl 112).

[43] Since the pause strength tagged pronunciation string corresponds to a pronunciation transfer of a simple spelling and phoneme time, a determination is made as to whether a speech signal matches the pronunciation and an allophone of the phoneme is correctly labeled depending on the determination, so that the pronunciation string is more useful for speech synthesis (Sl 113).

[44] Any error of the labeled pronunciation string is then corrected. Consequently, the processed synthesis database IOOA is obtained (S 1114).

[45] The processed synthesis database IOOA includes an utterance list, a pronunciation transfer list in the utterance list, a phoneme-specific labeling data reflecting an allophone in a pronunciation transfer content, and boundary pause intensity tagged data.

[46] Referring back to FIG. 1, after the processed synthesis database IOOA is obtained through the above-described processes, a training DB 200 for a pronunciation conversion model is built using the processed synthesis database IOOA. A step (Sl 120) of building the training DB 200 will now be described in greater detail with reference to FIG. 3.

[47] FIG. 3 is a flowchart illustrating the step of building a training DB 200 based on the processed synthesis database IOOA in FIG. 1.

[48] First, with reference to FIG. 3, language information, rhythm information, and allophone information of each phoneme are calculated using the utterance list, language tagging information for each sentence in the utterance list, the phoneme- specific labeling data, and the boundary pause intensity tagged data based on the context information in the processed synthesis database IOOA (Sl 121).

[49] Feature parameters are then extracted from the language information, rhythm information, and allophone information of each phoneme, and the training DB 200 is built based on the extracted feature parameters (Sl 122 to Sl 123).

[50] The feature parameters extracted to build the training DB 200 include context information of spellings at the left and right of a current spelling, previously pronunciation-converted phoneme context information of the current spelling, boundary strength information of the current spelling, syllable-specific distance information of the current spelling, morpheme information, and the like. The parameters will now be described in greater detail with reference to FIG. 4.

[51] FIG. 4 is a flowchart illustrating feature parameters extracted to build the training DB

200 in FIG. 3.

[52] Referring to FIG. 4, the parameters such as the context information of spellings at the left and right of a current spelling, the previously pronunciation-converted phoneme context information of the current spelling, the boundary strength information of the current spelling, syllable-specific distance information of the current spelling, the morpheme information, and the like may be extracted to build the training DB 200. The feature parameters will now be described in greater detail.

[53] The context information of spellings at the left and right of a current spelling is similar in a meaning to an n-gram model and is used to reflect that a pronunciation conversion result of the current spelling depends on information on left and right spellings.

[54] The previously pronunciation-converted phoneme context information of the current spelling is used to particularly reflect that current pronunciation conversion is affected by previous pronunciation conversions when pronunciation conversion is performed in an allophone level and a pronunciation change model is applied to an input text in a left-to-right way.

[55] The boundary strength information of the current spelling is used to particularly reflect that because a range of influence by pronunciation conversion depends on the boundary strength rhythmically and phonologically, a segmental influence by an interword pronunciation conversion is applied between words in a rhythm group below an accentual phrase while the pronunciation conversion does not jump the boundary between words forming the boundary above the accentual phrase. The boundary strength is shown in Table 1 :

[56] Table 1 [Table 1] [Table ]

[57] Meanwhile, the syllable- specific distance information of the current spelling is used to reflect that the pronunciation conversion is affected depending on whether a pronunciation conversion unit is located at the start or end of the accentual phrase or intonation phrase or at the middle thereof.

[58] The morpheme information is particularly used as one boundary of a pronunciation conversion unit because a meaning of a word affects the pronunciation conversion. [59] Referring back to FIG. 1, after the training DB 200 is built through the above- described processes, the training DB 200 is trained based on the extracted feature parameters in order to create a pronunciation conversion model 300 (Sl 130). Creating the pronunciation conversion model 300 will now be described in greater detail with reference to FIG. 5.

[60] FIG. 5 is a flowchart illustrating a step of creating the pronunciation conversion model (S 1130) in FIG. 1 [61] First, a learning model parameter for statistics-learning the training DB 200 is determined (Sl 131), and the training DB 200 is trained based on the determined learning model parameter (Sl 132).

[62] Pronunciation conversion performance of the determined learning model parameter is then evaluated based on the training result (Sl 133).

[63] Evaluation of the pronunciation conversion performance of the learning model parameter is performed because the parameter for training must be differently determined according to the characteristics of the training DB 200. The training DB 200 may be trained using several parameters, which differently affect pronunciation conversion performance in an initial syllable, a medial syllable and a final syllable.

[64] In this embodiment, the statistics learning method may include several methods. For example, when a decision tree model is used, the step of determining the learning model parameter (Sl 131) corresponds to a step of creating a decision tree, and the step of evaluating the pronunciation conversion performance of the learning model parameter (Sl 133) corresponds to a step of deciding performance of the generated decision tree, such as 10-fold validation.

[65] A determination is then made as to whether the determined learning model parameter is a parameter having the highest pronunciation conversion performance, based on the pronunciation conversion performance evaluation result (Sl 134).

[66] Here, when the decision tree model is used, a parameter (whose risk cost by a depth of a decision tree, the number of terminal nodes, a final measure included in the terminal node, and the like is minimal) is determined as a parameter having the highest pronunciation conversion performance.

[67] If the parameter of the determined learning model is determined as a parameter having the highest pronunciation conversion performance, the pronunciation conversion model 300 is generated based on the training result (Sl 135). Otherwise, the process returns to the learning model parameter determining step (Sl 131).

[68] In other words, the trained pronunciation conversion model 300 is generated using the parameter having the highest pronunciation conversion performance depending on the characteristics of the training DB 200 through such processes.

[69] Meanwhile, it is difficult for the pronunciation conversion model 300 based on such a statistical modeling scheme to have 100% accuracy, and in particular, a data partition-based learning method such as the decision tree method has poor performance for low-frequency contexts. In the present invention, the step of evaluating pronunciation conversion performance of the learning model parameter (Sl 133) comprises building the exceptional pronunciation dictionary 400 using words containing frequently erroneous contexts, and utilizing the exceptional pronunciation dictionary 400 to convert the pronunciation string. This will now be described.

[70] Referring to FIG. 1, after the pronunciation conversion model 300 and the exceptional pronunciation dictionary 400 are built through the above-described process, pre-processing and language analysis are performed on an input text when the text is input (S 1140).

[71] Preferably, the pre-processing step includes numerical conversion, symbol conversion, typographical error correction, etc.

[72] Inter- word boundary strength is then predicted based on the language analysis result

(Sl 150).

[73] Context information for pronunciation conversion is then automatically generated using the language analysis result and the predicted inter- word boundary strength information, and feature parameters for pronunciation conversion are extracted based on the generated context information (Sl 160).

[74] A determination is then made as to whether the exceptional pronunciation dictionary

400 is applied, i.e., whether the words of the input text are included in the exceptional pronunciation dictionary 400 (S 1170).

[75] If the words of the input text are not included in the exceptional pronunciation dictionary 400, a pronunciation string for the input text is automatically generated based on the pronunciation conversion model 300 (Sl 180).

[76] If the words of the input text are included in the exceptional pronunciation dictionary

400, i.e., if words frequently decided to be erroneous in the previous training process are included in the input text, a pronunciation string for the input text is generated based on the exceptional pronunciation dictionary 400 (Sl 190).

[77] A synthesized sound for the generated pronunciation string is then generated and output (S 1200).

[78] Thus, according to the present invention, the synthesis database is processed to reflect the pronunciation variation phenomenon at the word boundary depending on reading with breath, the speaker-dependent pronunciation conversion module 300 is created based on the processed synthesis database, and the feature parameters for pronunciation conversion are extracted based on the language analysis result for an input sentence upon speech synthesis and inter- word boundary strength information and applied to the pronunciation conversion model 300 in order to automatically generate the pronunciation string. Thus, a sophisticated pronunciation string can be generated particularly upon inter- word pronunciation conversion, thereby improving quality of a synthesized sound in the text-to-speech synthesis system.

[79] An example of the text-to-speech synthesis system based on the pronunciation conversion using the boundary pause intensity according to an exemplary embodiment of the present invention will now be described.

[80] FIG. 6 is a block diagram of a text-to-speech synthesis system based on pronunciation conversion using boundary pause intensity according to the present invention. [81] Referring to FIG. 6, the text- to- speech synthesis system according to the present invention comprises a synthesis database processor 610, a training DB generator 620, a pronunciation conversion model generator 630, a pre-processor 640, a language analyzer 650, a feature extractor 660, a pronunciation string generator 670, and a synthesized-sound generator 680.

[82] The synthesis database processor 610 processes the synthesis database 100 to reflect the pronunciation variation phenomenon at a word boundary. Processing the synthesis database has been described in detail with reference to FIG. 2, and thus a description thereof will be omitted.

[83] The training DB generator 620 builds the training DB 200 for pronunciation conversion model based on the synthesis database IOOA processed by the synthesis database processor 610. The training DB generator 620 calculates language information, rhythm information, and allophone information of each phoneme using the utterance list, language tagging information for each sentence in the utterance list, the phoneme-specific labeling data, and the boundary pause intensity tagged data based on the context information in the processed synthesis database IOOA.

[84] The feature parameters extracted to build the training DB 200 include context information of spellings at the left and right of a current spelling, previously pronunciation-converted phoneme context information of the current spelling, boundary strength information of the current spelling, syllable-specific distance information of the current spelling, morpheme information, and the like. The feature parameters have been described in detail with reference to FIG. 4, and thus a description thereof will be omitted.

[85] The pronunciation conversion model generator 630 trains the training DB 200 based on the extracted feature parameters to create the pronunciation conversion model 300 and builds the exceptional pronunciation dictionary 400 based on words including contexts frequently decided to be erroneous in the training process. In particular, the pronunciation conversion model generator 630 trains the training DB 200 based on the feature parameters having the highest pronunciation conversion performance depending on the characteristics of the training DB 200 to create the pronunciation conversion model 300. This has been described in detail with reference to FIG. 5, and thus a description thereof will be omitted.

[86] Meanwhile, when a text is input, the pre-processor 640 performs pre-processing including numerical conversion, symbol conversion, and typographical error correction on the input text.

[87] The language analyzer 650 receives the input text from the pre-processor 640, performs language analysis on the input text, and predicts inter- word boundary strength based on the language analysis result. [88] The feature extractor 660 automatically generates context information for pronunciation conversion using the language analysis result and the predicted inter- word boundary strength information, and extracts feature parameters for pronunciation conversion based on the generated context information.

[89] The pronunciation string generator 670 determines whether the words of the input text are included in the exceptional pronunciation dictionary 400. If the words of the input text are included in the exceptional pronunciation dictionary 400, the pronunciation string generator 670 generates a pronunciation string for the input text based on the exceptional pronunciation dictionary 400.

[90] If the words of the input text are not included in the exceptional pronunciation dictionary 400, the pronunciation string generator 670 automatically generates a pronunciation string for the input text based on the pronunciation conversion model 300.

[91] The synthesized-sound generator 680 generates and outputs a synthesized sound for the pronunciation string generated by the pronunciation string generator 670.

[92] Thus, with the text-to-speech synthesis system of the present invention, the synthesis database 100 is processed to reflect the pronunciation variation phenomenon at the word boundary depending on reading with breath, the speaker-dependent pronunciation conversion module is created based on the processed synthesis database 100, and the feature parameters for pronunciation conversion are extracted based on the language analysis result for an input sentence upon speech synthesis and inter-word boundary strength information and applied to the pronunciation conversion model in order to automatically generate the pronunciation string. Thus, a sophisticated pronunciation string can be generated particularly upon inter- word pronunciation conversion, thereby improving quality of a synthesized sound in the text-to-speech synthesis system.

[93] Meanwhile, the embodiments of the present invention may be formulated by a program that can be executed on a computer, and may be implemented by a general- purpose digital computer that executes the program using a computer-readable recoding medium.

[94] While the invention has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

[1] A method for converting pronunciation using boundary pause intensity, the method comprising the steps of:

(a) processing a synthesis database to reflect a pronunciation variation phenomenon at a word boundary;

(b) extracting feature parameters from the processed synthesis database to build a training DB;

(c) training the training DB based on the extracted feature parameters to create a pronunciation conversion model;

(d) when a text is input, performing pre-processing and language analysis on the input text and predicting inter- word boundary strength for the input text;

(e) extracting feature parameters for pronunciation conversion from the input text; and

(f) generating a pronunciation string for the input text using the extracted feature parameters based on the pronunciation conversion model.

[2] The method according to claim 1, wherein step (a) comprises the steps of: performing pronunciation transfer on the synthesis database to generate a pronunciation string; tagging pause strength to the generated pronunciation string in a syllable unit, a phoneme unit, or a word unit; labeling an allophone of a phoneme in the tagged pronunciation string; and correcting an error in the labeled pronunciation string.

[3] The method according to claim 1, wherein step (b) comprises the steps of: calculating language information, rhythm information, and allophone information of each phoneme based on context information in the processed synthesis database; and extracting feature parameters from the calculated language information, rhythm information, and allophone information of each phoneme and building a training DB based on the extracted feature parameters.

[4] The method according to claim 3, wherein the extracted feature parameters comprise at least one of context information of spellings at the left and right of a current spelling, previously pronunciation-converted phoneme context information of the current spelling, boundary strength information of the current spelling, syllable-specific distance information of the current spelling, and morpheme information.

[5] The method according to claim 1, wherein step (c) comprises: a first step of determining learning model parameters for statistics-learning the training DB; a second step of training the training DB based on the determined learning model parameters; a third step of evaluating pronunciation conversion performance of the determined learning model parameters based on the training result; a fourth step of determining whether the determined learning model parameter is a parameter having the highest pronunciation conversion performance based on the evaluation result of the pronunciation conversion performance; and a fifth step of, when the determined learning model parameter is determined to be a parameter having the highest pronunciation conversion performance, creating the pronunciation conversion model based on the training result.

[6] The method according to claim 5, wherein the third step comprises the step of extracting words frequently causing a pronunciation conversion error to build an exceptional pronunciation dictionary.

[7] The method according to claim 1, further comprising the step of generating the pronunciation string for the input text based on the exceptional pronunciation dictionary when words of the input text are included in the exceptional pronunciation dictionary.

[8] A text-to-speech synthesis system based on pronunciation conversion using boundary pause intensity comprising: a synthesis database processor for processing a synthesis database to reflect a pronunciation variation phenomenon at a word boundary; a training DB generator for extracting feature parameters from the processed synthesis database to build a training DB; a pronunciation conversion model generator for training the training DB based on the extracted feature parameters to create a pronunciation conversion model, and extracting words frequently causing a pronunciation conversion error in the training process to build an exceptional pronunciation dictionary; a pre-processor for performing pre-processing on an input text; a language analyzer for receiving the pre-processed input text and performing language analysis on the pre-processed input text to predict inter- word boundary strength based on the language analysis result; a feature extractor for extracting feature parameters for pronunciation conversion using the language analysis result and the predicted inter- word boundary strength information; a pronunciation string generator for generating a pronunciation string for the input text based on the pronunciation conversion model or the exceptional pronunciation dictionary; and a synthesized-sound generator for generating and outputting a synthesized sound for the generated pronunciation string.

[9] The system according to claim 8, wherein the synthesis database processor performs pronunciation transfer on the synthesis database to generate the pronunciation string, tags pause strength to the generated pronunciation string in a syllable unit, a phoneme unit, or a word unit; labels an allophone of a phoneme in the tagged pronunciation string, and corrects an error in the labeled pronunciation string.

[10] The system according to claim 8, wherein the training DB generator extracts from the processed synthesis database at least one feature parameter of context information of spellings at the left and right of a current spelling, previously pronunciation-converted phoneme context information of the current spelling, boundary strength information of the current spelling, syllable- specific distance information of the current spelling, and morpheme information, and generates the training DB based on the extracted feature parameter.

[11] The system according to claim 10, wherein the pronunciation conversion model generator trains the training DB based on one of the feature parameters having the highest pronunciation conversion performance to create the pronunciation conversion model.

[12] The system according to claim 8, wherein the pronunciation string generator generates the pronunciation string for the input text based on the exceptional pronunciation dictionary when words of the input text are included in the exceptional pronunciation dictionary, and based on the pronunciation conversion model when the words of the input text are not included in the exceptional pronunciation dictionary.