WO2008056590A1

WO2008056590A1 - Text-to-speech synthesis device, program and text-to-speech synthesis method

Info

Publication number: WO2008056590A1
Application number: PCT/JP2007/071285
Authority: WO
Inventors: Reishi Kondo; Shinichi Doi
Original assignee: Nec Corporation
Priority date: 2006-11-08
Filing date: 2007-11-01
Publication date: 2008-05-15
Also published as: JPWO2008056590A1

Abstract

In the case that speech is generated, the speech does not necessarily be clear due to such factors as a method and an order of the speech. A text-to-speech synthesis device is comprised of a text candidate generation unit (102) for generating a plurality of candidate texts composed of synonymous expression by analyzing an input text, a sound quality evaluation score calculation unit (103) for calculating a sound quality evaluation score on each of the candidate texts, a selection unit (104) for selecting a candidate text with the best sound quality evaluation score in a plurality of the candidate texts, a speech synthesis unit (105) for generating a synthetic speech which corresponds to the selected candidate text, so that the text-to-speech synthesis device transforms the input text to the extent that contents of a sentence, etc. are not subjected to change, and selects and outputs the synthetic speech which corresponds to the candidate text with the best sound quality evaluation score (see Fig. 1).

Description

Specification

TECHNICAL FIELD OF THE INVENTION

[0001] (Related application) This application claims the priority of the previous Japanese patent application No. 2006-302954 (filed on Nov. 8, 2006). It is assumed that it has been reprinted with a quotation!

The present invention relates to a text-to-speech synthesis technology, and more particularly to a text-to-speech synthesis technology having a function of arranging input text within a range that does not impair the meaning of the text and making it a synthesized speech.

Background art

[0002] The text-to-speech synthesis technology is a technology for generating speech that reads the content of given text, mainly text composed of kanji or kana in the case of Japanese.

[0003] A method of generating phonetic symbol strings by guessing the reading of input text and then generating speech for the phonetic symbol strings is widely used. As a method of generating speech for phonetic symbol strings, a waveform connection method that cuts and pastes pre-recorded natural utterances, or a parametric method that generates a parameter time series representing acoustic features and generates speech from the parameters. Etc. are used.

[0004] In this speech synthesis technology, one of the major research themes is how to generate natural synthesized speech. For example, Japanese Patent Application Laid-Open No. 2004-258406 discloses a text-to-speech synthesis system that can generate a natural prosodic pattern (phonetic symbol string) even if dependency analysis is incorrect.

[0005] Also, Japanese Patent Laid-Open No. 2004-138728 discloses a modified pitch pattern and language after translation while a basic pitch pattern, which is one of prosodic information, is translated in the frequency direction while maintaining its outline. A speech unit sequence that matches the information is selected, and the deformed pitch pattern related to the speech unit sequence that best matches the selected speech unit 歹 IJ and the most suitable speech unit sequence are combined into the synthesized speech. Prosody information to be used at times and speech units and prosody control means are provided, the pitch deformation amount of the selected speech unit is suppressed, and there is no sudden change in sound quality A speech synthesizer capable of creating synthesized speech is disclosed.

Patent Document 1: Japanese Patent Application Laid-Open No. 2004-258406

Patent Document 2: Japanese Patent Application Laid-Open No. 2004-138728

Non-Patent Document 1: Shikano, Ito, Kawara, Takeda, Yamamoto “IT Text Speech Recognition System”, Ohm, 2001, p. 13-14

Disclosure of the invention

Problems to be solved by the invention

[0007] The disclosures of Patent Documents 1 and 2 and Non-Patent Document 1 described above are incorporated herein by reference. The following is an analysis of the related art according to the present invention. Many of the above-described conventional techniques have a problem that the force S is mainly intended to generate natural speech, and the resulting synthesized speech is not necessarily beautiful. The present invention has been made paying attention to this point, and an object of the present invention is to provide a text-to-speech synthesis technique capable of improving the quality of synthesized speech to be generated. Means for solving the problem

[0008] According to a first aspect of the present invention, a text candidate generation unit that analyzes input text and generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts A sound quality evaluation score calculation unit that calculates a sound quality evaluation score, a selection unit that selects a candidate text having the best sound quality evaluation score from the plurality of candidate texts, and a speech synthesis that generates synthesized speech corresponding to the selected candidate text A text-to-speech synthesizer, a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer.

[0009] According to a second aspect of the present invention, a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality that calculates a sound quality evaluation score for the candidate morpheme analysis results An evaluation score calculation unit; a selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results; and a voice that generates a synthesized speech corresponding to the selected candidate morpheme analysis result A text-to-speech synthesizer comprising a synthesizer, a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer are provided. [0010] According to the third aspect of the present invention, a reading generation unit that generates a reading result (name) from input text and insertion / non-insertion of a pose (punctuation) into the reading result are performed. A pose insertion unit that generates a plurality of candidate pose-reading results (named candidates with poses) by combining; a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the reading results with candidate poses (named candidates with poses); A selection unit for selecting a candidate pose-reading result (pause-named call candidate) having the best sound quality evaluation score from a plurality of candidate-pause-reading results (named candidates with pose), and the selected candidate pose-reading results A text-to-speech synthesizer comprising a speech synthesizer that generates a synthesized speech corresponding to (pause-named candidate), a program for realizing the text-to-speech synthesizer, and the text Speech synthesis method which is carried out by using the strike speech synthesis device is provided.

[0011] According to the fourth aspect of the present invention, a reading generation unit that generates a reading result (name) from the input text, and another sound for a specific sound included in the reading result (name) A read deformation unit that generates a plurality of candidate reading results (named candidates) by combining and replacing replacement / non-replacement, and a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (named candidates) A selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates), and a synthesis corresponding to the selected candidate reading result (name candidate) A text speech synthesizer including a speech synthesizer for generating speech, a program for realizing the text speech synthesizer, and a speech synthesis method implemented using the text speech synthesizer are provided.

The invention's effect

[0012] According to the present invention, it is possible to dramatically improve the quality of synthesized speech. The reason is that the candidate having the best sound quality evaluation score is selected from among a plurality of candidates created from the input text.

[0013] According to the present invention, it is also possible to generate synthesized speech that makes use of the original characteristics of natural speech possessed by speech synthesis means.

Brief Description of Drawings

FIG. 1 is a block diagram showing a configuration of a text-to-speech synthesis system according to a first embodiment of the present invention. FIG. 2] A diagram showing entries stored in the synonym dictionary of the text-to-speech synthesis system according to the first embodiment of the present invention.

FIG. 3 is a diagram for explaining the operation of a text candidate generation unit of the text-to-speech synthesis system according to the first embodiment of the present invention.

4] A block diagram showing a modified configuration of the text-to-speech synthesis system according to the first embodiment of the present invention.

5] A block diagram showing the configuration of the text-to-speech synthesis system according to the second embodiment of the present invention.

6] A block diagram showing the configuration of the text-to-speech synthesis system according to the fifth embodiment of the present invention.

FIG. 7] A diagram showing entries stored in the personality (feature) dictionary of the text-to-speech system according to the fifth embodiment of the present invention.

FIG. 8 is a diagram for explaining the operation of a text candidate generation unit of a text-to-speech synthesis system according to a fifth embodiment of the present invention.

FIG. 9] A block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention.

FIG. 10] A diagram showing entries stored in the filler dictionary of the text-to-speech synthesis system according to the sixth embodiment of the present invention.

11] FIG. 11 is a diagram for explaining the operation of the text candidate generation unit of the text speech synthesis system according to the sixth embodiment of the present invention.

12] A block diagram showing the configuration of the text-to-speech synthesis system according to the tenth embodiment of the present invention.

13] A block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention.

14] A block diagram showing the configuration of the text-to-speech synthesis system according to the twelfth embodiment of the present invention.

15] FIG. 15 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the thirteenth embodiment of the present invention. 16] FIG. 16 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the fourteenth embodiment of the present invention.

Explanation of symbols

[0015] 101 Text input section

102 Text candidate generator

103 Sound quality evaluation score calculator

104 Selector

105 Speech synthesis unit

106 Morphological analysis candidate generator

107 Reading generator

108 Poses

109 Reading deformation part

201 Synonym Dictionary

202 Personality feature dictionary

203 Filler dictionary

BEST MODE FOR CARRYING OUT THE INVENTION

[0016] Various deployable forms of the present invention are shown below.

[0017] (Form 1)

According to the first aspect of the present invention, the input text is analyzed, a text candidate generating unit that generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts is calculated. A sound quality evaluation score calculation unit; a selection unit that selects a candidate text having a best sound quality evaluation score from the plurality of candidate texts; a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate text; The ability to obtain a text-to-speech synthesizer with

[0018]

The text-to-speech synthesizer further includes a synonym dictionary in which synonyms are stored in association with each other in units of words or phrases, and the text candidate generator generates synonyms corresponding to words or phrases included in the input text. A word is retrieved from the synonym dictionary and the word or phrase Operate to generate multiple candidate texts by combining and expanding unit substitution / non-substitution.

[0019] (Form 3)

The text-to-speech synthesizer further includes a synonym expression dictionary that stores a synonym expression that is a phrase expression having the same meaning with respect to a certain phrase expression, and the text candidate generation unit adds the synonym expression dictionary to the input text. It is possible to search the synonym expression corresponding to the phrase expression contained in the synonym expression dictionary and generate a plurality of candidate texts by combining and expanding substitution / non-replacement at the phrase expression unit level.

[0020] (Form 4)

Further, the text candidate generation unit is operated to generate the plurality of candidate texts by modifying the sentence structure of the input text under the condition that the meaning of the input text is not changed. Can do.

[0021] (Form 5)

Further, it is possible to operate the text candidate generation unit so as to generate a plurality of candidate texts by combining and expanding the change / non-change of the polite expression of words or phrases included in the input text.

[0022] (Form 6)

The text-to-speech synthesizer further includes a personality feature dictionary that stores the personal level expression differences in association with each other, and the text candidate generator generates another text corresponding to the expression included in the input text. The synonym expression can be searched from the personality feature dictionary and can be operated to generate a plurality of candidate texts by combining and replacing substitution / non-substitution due to expression differences at the individual level.

[0023] (Form 7)

The text-to-speech synthesizer further includes a filler dictionary in which fillers and interjections are described, and the text candidate generation unit refers to the filler dictionary and inserts the text between the input text words. It can be operated to generate multiple candidate texts by combining and expanding filler or interjection insertion / non-insertion.

[0024] (Form 8) The text-to-speech synthesizer further includes a filler dictionary describing fillers and interjections, and the filler candidate or interjection is deleted from the input text by referring to the filler dictionary in the text candidate generator. Can be operated to generate multiple candidate texts by combining / deleting and deleting.

[0025] (Form 9)

Further, the text candidate generation unit is also operable to generate the plurality of candidate texts by changing a part or all of the words or phrases included in the input text into a repeated expression. it can.

[0026] (Form 10)

Furthermore, the text candidate generation unit can be operated so that only the expression of rhyme among the expanded results is used as the candidate text.

[0027] (Form 11)

Furthermore, the text candidate generation unit can be operated so that only the expression composed of only the mora power of each accent phrase or 7 is set as the candidate text among the expanded results.

[0028] (Form 12)

According to the second aspect of the present invention, a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality evaluation score calculation that calculates a sound quality evaluation score for the candidate morpheme analysis results A selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results, and a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate morpheme analysis result Can be obtained.

[0029] (Form 13)

According to the third aspect of the present invention, a combination of a reading generation unit that generates a reading result (name) from input text, and insertion / non-insertion of a pose (punctuation) to the reading result. A pose insertion unit for generating a plurality of candidate pose-reading results (named candidates with poses), a sound quality evaluation score calculating unit for calculating a sound quality evaluation score for the reading results with candidate poses (named candidates with poses), Reading result with candidate pose (with pose A selection unit that selects a candidate pose-reading result with the best sound quality evaluation score (pause-named candidate) and the selected candidate pose-read result (pause-named candidate) It is possible to obtain a text-to-speech synthesizer including a speech synthesizer that generates synthesized speech.

[0030] (Form 14)

According to the fourth aspect of the present invention, a reading generation unit that generates a reading result (name) from input text, and replacement of a specific sound included in the reading result (name) with another sound A reading deformation unit that generates a plurality of candidate reading results (name candidates) by combining / uncompressing, a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (name candidates), A selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from a plurality of candidate reading results (name candidate), and generates synthesized speech corresponding to the selected candidate reading result (name candidate) A text-to-speech synthesizer including the speech synthesizer.

[0031] (Form 15)

The score S representing the smoothness of the pitch pattern of the synthesized speech is used as the sound quality evaluation score.

[0032] (Form 16)

As the sound quality evaluation score, a score representing the difference between the estimated pitch pattern and the synthesized speech pitch pattern can be used.

[0033] (Form 17)

The score S representing the difference between the estimated rhythm and the synthesized speech rhythm is used as the sound quality evaluation score.

[0034] (Form 18)

As the sound quality evaluation score,

It is possible to use a score that represents the smoothness of the spectrum between segments when creating synthesized speech.

[0035] (Other forms)

A text-to-speech synthesis method according to the present invention and a program for carrying out the method However, the same development as that of each text-to-speech synthesizer described above can be performed.

[0036] [First embodiment]

Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the text-to-speech synthesis system according to the first embodiment of the present invention. Referring to FIG. 1, the text-to-speech synthesis system according to the present embodiment includes a text input unit 101, a text candidate generation unit 102, a synonym dictionary 201, a sound quality evaluation score calculation unit 103, and a selection unit 104. , And a voice synthesizer 105.

Yes

[0037] The text input unit 101 is means for inputting text. In addition, according to the principle of the present invention, the present invention can be applied to any language. However, in the present embodiment, a description will be given assuming that Japanese / Kana / Kanji mixed sentences are handled.

[0038] The text candidate generation unit 102 is means for generating a plurality of candidate texts by referring to the synonym dictionary 201 and transforming the text into another expression that does not change the meaning.

FIG. 2 is a diagram showing entries stored in the synonym dictionary 201, in which synonyms are stored in association with each other in units of words or phrases.

[0040] The text candidate generator 102 morphologically analyzes the text as shown in FIG. 3, searches the synonym dictionary 201 for each morpheme, and if there is a matching entry, the corresponding morpheme can be replaced with the corresponding synonym. Mark as.

[0041] Finally, the text candidate generation unit 102 generates a plurality of candidate texts by combining all replaceability for each morpheme of the text.

[0042] The sound quality evaluation score calculation unit 103 is a means for calculating a sound quality evaluation score for each of the plurality of candidate texts, and details thereof will be described later.

[0043] The selection unit 104 is a means for selecting one candidate text having the best corresponding sound quality evaluation score from the plurality of candidate texts.

[0044] The speech synthesizer 105 is means for generating synthesized speech corresponding to the selected candidate text. In the present embodiment, the speech synthesizer 105 is assumed to be a waveform editing type text speech synthesizer that generates synthesized speech by connecting and editing prerecorded natural speech. [0045] Subsequently, the sound quality evaluation method in the text-to-speech synthesis system according to the present embodiment will be described in detail.

[0046] The sound quality evaluation score refers to an objective evaluation value of the speech quality of the synthesized sound, which is the result of speech synthesis of the candidate text. In general, since it is difficult to obtain an accurate objective evaluation value of speech quality, various scores used in the speech synthesis process can be used as approximate values of the speech quality evaluation score. In the present invention, this approximate value is also referred to as a sound quality evaluation score.

[0047] Hereinafter, as a specific example of the sound quality evaluation score, for a candidate text, a unit connection score value corresponding to the segment sequence selected by the speech synthesizer is used as the sound quality evaluation score. I will explain.

[0048] The unit connection score is a negative number of the sum of the weighted Euclidean distances of the 10-dimensional cepstrum at each unit connection boundary.

[0049] That is, for the i-th element sequence (a (l) _i, a (2) _i,..., A (Ν) —i), the start of each element a (j) _i Cep-b (i, j) = (cep_b (i, j, 1), cep_b (i, j, 2), ..., cep- b (i, j, 10)) Cep-e (i, j) = (cep-one e (i, j, 1), cep-one e (i, j, 2),..., Cep-one e (i, j, 10) ), The unit connection score d (i) for the i-th element sequence is d (i) = —∑ (j = l, (Nl)) ^ ∑ (k = l, 1 0) {a (k ) I cep_e (i, j, (k + 1)) — cep— b (i, j, k) |

Here, a (k) is a weight given in advance.

[0051] The unit connection score is calculated by the sound quality evaluation score calculation unit 103 and the speech synthesis unit 105 by the same method.

[0052] This unit connection score value d (i) can be considered as a score representing the slip force of the spectrum between the segments that are the segmented sections of the synthesized speech. It can be said that the higher the d (i) is, the higher the sound quality is obtained with the difference in cepstrum at each connecting part of the unit.

Here, the weighted Euclidean distance of the 10-dimensional cepstrum is used as the unit connection score, but the number of dimensions can be adjusted as appropriate. Also, instead of cepstrum, Δ cepstrum, which is the time direction differential value of the cepstrum, MFCC (Mel— Frequency

Cepstrum Coefficent (see Non-Patent Document 1), or A MFCC, which is a time-direction differential value of MFCC, can be used with an appropriate number of dimensions. It is of course possible to use a vector composed of these multiple values.

The waveform editing type text-to-speech synthesizer has the ability to take a plurality of natural speech segment sequences when synthesizing speech for a certain text, and the unit connection score for each segment sequence among them. It operates to select the segment series that maximizes.

As described above, a synthesized speech having the same content as a given text and having the highest sound quality evaluation score approximated by the unit selection score used in the speech synthesizer can be generated.

[0056] In the example of FIG. 2, it has been described that only the notation is stored in the synonym dictionary 201.

By describing the part of speech and the transformation rules between Hiragana and Katakana, it is possible to perform more flexible or strict matching.

Further, the text candidate generation unit 102 can be integrated with a text analysis process that forms part of the speech synthesis unit 105. In this case, the morphological analysis results match, so that accuracy can be improved and the amount of calculation can be reduced.

In this embodiment, the selection unit 104 outputs candidate text, and the speech synthesis unit 105 synthesizes speech by using the candidate text as input. The operations of the text candidate generation unit 102 and the speech synthesis unit 105 are the same until the sound quality evaluation score is generated.

[0059] On the other hand, as shown in FIG. 4, the part that performs the same operation (the operation up to the generation of the sound quality evaluation score) is performed by the text candidate generation unit 102, and the intermediate result is the speech synthesis data. Can also be output as data.

[0060] The selection unit 104 outputs speech synthesis data (generated by the text candidate generation unit 102) corresponding to the selected candidate text, and the speech synthesis unit 105 uses the selected speech synthesis data. Synthesize speech. As a result, there is no duplication of operation and the amount of calculation can be reduced.

As described above, in the present embodiment, it is possible to convey information by another word / expression that can be generated with higher sound quality without changing the information content to be transmitted. In addition, when selecting candidate texts based on the sound quality evaluation score, many of the original utterances that contain the segments that make up speech synthesis Since the probability of using the words and expressions that appear is increased, the individuality of the speaker of the original utterance is expressed. Furthermore, even if the given text is a written word expression, it may be replaced with the spoken word expression used in the original utterance where it is synonymous, in which case it is more audible and easier to understand, or no sense of incongruity. ! / It becomes possible to obtain an expression.

In the above-described embodiment, it has been described that the text candidate generation unit 102 obtains all possible combinations of replaceability for each morpheme of text assuming that there is no memory restriction or the like (FIG. 3). If there are memory restrictions, etc., you may obtain multiple combinations that are possible within the scope of the restrictions, or generate text candidates one by one and leave the one with a high sound quality evaluation score. It is also possible to select text.

[0063] [Second Embodiment]

Next, a second embodiment of the present invention in which the synonym dictionary can be omitted will be described in detail with reference to the drawings. FIG. 5 is a block diagram showing the configuration of the text-to-speech synthesis system according to the second embodiment of the present invention. In the following description, the items described in the first embodiment are omitted.

[0064] The text candidate generation unit 102 of the present embodiment expands the input text into a synonym expression by modifying the grammar of the input text based on the synonym expression modification rules described in advance in the text candidate generation unit 102. Generated candidate text. Specific examples are described below.

For example, when the input text is “transform the grammar”, the text candidate generation unit 102 replaces the pair of the noun “transform” and the auxiliary verb “do” with the verb phrase “change shape”. Deform.

[0066] If it is necessary to transform the particle, it is performed together here. As a result, in addition to “transform the grammar”, “change the shape of the grammar” is also set as the candidate text.

For example, when the input text is “replace word”, the text candidate generation unit 102

The verb stem “substitution” and auxiliary verb “do” are transformed by replacing the verb phrase “replace” or “replace” with the verb phrase “replace”.

[0068] After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.

[0069] The text candidate generator 102 applies the transformation rule simply by matching the text strings. However, it is also possible to improve accuracy by performing morphological analysis and confirming the part of speech of the corresponding expression.

[0070] [Third embodiment]

Next, a third embodiment of the present invention in which the synonym dictionary can be omitted will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the text candidate generator 102, which is the difference, will be described in detail.

Unlike the second embodiment, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by modifying the sentence structure of the input text. Hereinafter, a specific example of the deformation of the sentence structure will be described.

[0072] The first type of transformation of the sentence structure in the text candidate generation unit 102 is transformation to inversion 'emphasis syntax. For example, if the input text is “A to B”, the text candidate generator 1

In 02, the inversion / emphasis syntax “B is A” is also a candidate text.

[0073] A second type of modification of the sentence structure in the text candidate generation unit 102 is a modification to the body-stopping syntax. For example, when the input text is “I am the one who posted”, the text candidate generation unit 102 also selects “I am the one that posted” as the candidate text.

[0074] A third type of deformation of the sentence structure in the text candidate generation unit 102 is deformation by replacement of particle phrases (scrambling). For example, if the input text is “I and my brother read a book”, the text candidate generation unit 102 performs transformation by replacing the particle phrase “I read my book with my brother”, “ “Read the book”, “The brother read the book with me”, “The book read the brother with me” and “The book read with me and the brother” are also candidate texts.

[0075] The fourth type of modification of the sentence structure in the text candidate generation unit 102 is modification by replacement of a noun form and a verb form. For example, when the input text is “replace word”, the text candidate generation unit 102 also adopts “perform word replacement”, which is modified by replacing the noun form and the verb form, as the candidate text.

[0076] The fifth type of deformation of the sentence structure in the text candidate generator 102 is passive (passive

). For example, when the input text is “He called her”, the text candidate generation unit 102 determines that “She was called by him” which has been transformed into a passive text. [0077] After a plurality of candidate texts are generated in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.

[Fourth Embodiment]

Next, a fourth embodiment of the present invention in which the synonym dictionary can be omitted will be described. Since the configuration of this embodiment is the same as that of the second and third embodiments, the operation of the text candidate generation unit 102, which is the difference, will be described in detail.

[0079] Unlike the second and third embodiments, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by transforming the polite expression of the input text. Hereinafter, a specific example of the deformation of the sentence structure will be described.

For example, when the input text is “I bought a book”, the text candidate generation unit 102 also uses “I bought a book”, which is a more polite expression, as a candidate text.

[0081] After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.

[0082] [Fifth Embodiment]

Next, a fifth embodiment of the present invention in which a personality (feature) dictionary is installed in place of the synonym dictionary of the first embodiment will be described in detail with reference to the drawings. FIG. 6 is a block diagram showing a configuration of a text-to-speech synthesis system according to the fifth embodiment of the present invention, and includes a personality (feature) dictionary 202. In the following description, the items described in the first embodiment are omitted.

The text candidate generation unit 102 of the present embodiment refers to the personality (feature) dictionary 202 and generates a plurality of candidate texts by changing the input text into a different expression! .

FIG. 7 is a diagram showing entries stored in the personality (feature) dictionary 202, in which “Ketta” is stored in association with the replaceable vocabulary of the entry “Car”. “Ketta” is a term for “bicycle” used in the Nagoya region.

[0085] As shown in FIG. 8, the text candidate generation unit 102 performs morphological analysis on the text, searches the personality (feature) dictionary 202 for each morpheme, and if there is a matching entry, the corresponding morpheme is the corresponding individual. Mark a morpheme (replaceable vocabulary) as replaceable.

[0086] After generating a plurality of candidate texts in this way, the same processing as in the first embodiment is performed. Thus, sound quality evaluation and selection of the best candidate are performed.

[0087] In this embodiment, for example, even when the natural utterance used by the speech synthesizer has variations in utterance words due to dialects or the like, words and phonological sequences that appear less frequently in natural utterance are more likely to correspond to the natural utterance. It is possible to obtain synthesized speech with good sound quality by replacing words or phoneme strings with high frequency of appearance.

[0088] [Sixth embodiment]

Subsequently, a sixth embodiment of the present invention in which a filler dictionary is mounted instead of the synonym dictionary of the first embodiment will be described in detail with reference to the drawings. FIG. 9 is a block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention, and includes a filer dictionary 203. Hereinafter, the items described in the first embodiment will be omitted.

The text candidate generation unit 102 according to the present embodiment refers to the filler dictionary 203 and inserts fillers between words of the input text or combines and expands the combination of not inserting, Generate candidate text.

FIG. 10 is a diagram showing entries stored in the filler dictionary 203. The fillers “eichito” and “gohogo ho” are so-called connective words, meaning words that are used in an interjective manner in conversation. And the like are stored in association with each other. For example, in the case of English, “you know”, “I meanj”, and “Well” are stored.

The text candidate generation unit 102 performs morphological analysis on the text as shown in FIG. 11, and develops all possibilities that the contents of the filler dictionary 203 can be inserted between the morphemes. As shown in the lower part of FIG. 11, the text candidate generation unit 102 expands a morpheme that matches the filler dictionary 203, in combination with replacement / deletion with other fillers.

[0092] After a plurality of candidate texts are generated in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.

[0093] In the present embodiment, for example, by inserting a filler, it is possible to obtain a synthesized speech with good sound quality by originally connecting badly, smoothly connecting the phoneme strings, and the speaker of the original utterance. It is possible to express the personality of Also, it is possible to control (limit) the number of insertable sentences and the maximum number of consecutive inserts, instead of expanding all fillers. [0094] [Seventh embodiment]

Subsequently, a seventh embodiment of the present invention in which each of the above dictionaries can be omitted will be described. Since the configuration of the present embodiment is the same as that of the second embodiment, the operation of the text candidate generation unit 102 that is the difference will be described in detail.

[0095] Unlike the above-described embodiments, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by repeating a phrase in the input text a plurality of times. Hereinafter, a specific example of the deformation of the sentence structure will be described.

[0096] For example, if the input text is "I buy three pencils", the text candidate generator 1

02, "I will buy 3 pencils, 3 pencils", "I will buy 3 3 pencils", "I will buy 3 pencils, 3 pencils" Is also a candidate text.

[0097] As described above, it is also possible to generate the plurality of candidate texts by reiterating part or all of the words or phrases included in the input text into repeated expressions.

After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.

It should be noted that the maximum number of repetitions and the total number of repetitions for the entire sentence can be set in the text candidate generation unit 102 in advance.

[0099] [Eighth Embodiment]

Subsequently, an eighth embodiment of the present invention, which is a modification of the first embodiment, will be described. Since the configuration of this embodiment is the same as that of the first embodiment, the operation of the text candidate generation unit 102 which is the difference will be described in detail.

[0100] Unlike the first embodiment described above, the text candidate generator 102 according to this embodiment

After the synonym is expanded by the synonym dictionary, it is verified that each phrase is humorous. Hereinafter, a specific example of the test process will be described.

[0101] For example, the case where only the synonym "sound" is registered for the entry "smooth" in the synonym dictionary 201, and the text "speak humbly about smooth performance" is entered.

[0102] According to the first embodiment, in addition to "speaking good performance, humble", " "Can you speak humblely?" Will generate two candidate texts.

[0103] Here, looking at the two candidate texts, the first sounds of the exhalation paragraphs of the former “speaking the successful performance, humble” are “ju” and “ke”, so they are not humorous. . On the other hand, the latter “speak humbly about the strong performance” is also “Ke” and “Ke”, so they are humorous.

[0104] Therefore, the text candidate generation unit 102 excludes the former by the test of whether or not the above rhymes are being performed, and outputs only the latter having the rhymes as candidate texts.

[0105] Only candidate texts thus selected from a plurality of candidate texts by testing

The sound quality evaluation and the selection of the best candidate are performed by the same processing as in the first embodiment.

In the present embodiment, the load of the sound quality evaluation process is reduced, and the effect of preventing generation of uncomfortable candidate texts is achieved. In the above-described specific example, the test using the nymph is exemplified, but the same test can be performed for the rhyme.

[Ninth Embodiment]

Subsequently, a ninth embodiment of the present invention, which is a modification of the first embodiment, will be described. Since the configuration of this embodiment is the same as that of the first embodiment, the operation of the text candidate generation unit 102 which is the difference will be described in detail.

[0108] Unlike the above-described first embodiment, the text candidate generation unit 102 according to this embodiment

After the synonym is expanded by the synonym dictionary, it is verified that the mora number of the reading of each phrase is 5 or 7. Hereinafter, a specific example of the test process will be described.

For example, for the entry “contact network” in the synonym dictionary 201, the synonym “surprise” is registered for the synonym “contact net” force entry “surprise”, and the text “boring contact” is registered. A case will be described in which “the net is filled with surprise” is input.

[0110] According to the first embodiment, (1) In addition to "The boring network was filled with surprise" (2

) “Boring contact network filled with surprise”, (3) “Boring contact network filled with startle”, (4) “Boring contact network filled with startle” Four candidate texts are generated.

[0111] By the way, the number of mora for each phrase of each candidate text is (1) 5 mora + 7 mora + 5 mora + 5 mora, (2) 5 mora + 8 mora + 5 mora + 5 mora, 3) 5 mora + 7 mora + 5 mora + 5 mora, (4) 5 mora + 8 mora + 5 mora + 5 mora.

[0112] Therefore, the text candidate generator 102 has all of the phrases composed of 5 or 7 mora. (1) “The boring communication network was filled with surprise” (3 ) Output two candidate texts, “The boring contact network was astounded”.

[0113] Only candidate texts selected from multiple candidate texts by testing

[0114] In this embodiment as well, the load of the sound quality evaluation process is reduced, as well as listening from the viewpoint of the number of mora different from the above eighth embodiment! The effect of preventing the generation of text is achieved. Also, in the specific example above, the power exemplifying a test based on whether each phrase is strictly 5 or 7 mora, each phrase is not exactly 5 or 7 mora, corresponding to V, the surplus character remainder It is also possible.

[0115] [Tenth embodiment]

Subsequently, a tenth embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 12 is a block diagram showing the configuration of the text-to-speech synthesis system according to the tenth embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.

Referring to FIG. 12, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a morpheme analysis candidate generation unit that performs morphological analysis on the input text.

The second embodiment is different from the second embodiment in that it includes 106.

[0117] In general, there are multiple morpheme analysis candidates for which the results of morpheme analysis are not uniquely determined, and in text-to-speech synthesis, one optimal morpheme analysis result is often selected and used.

Therefore, in the text-to-speech synthesis system according to the present embodiment, the morpheme analysis candidate generation unit 106 outputs a plurality of morpheme analysis candidates, and the sound quality evaluation score calculation unit 103 reads the morpheme analysis candidates. And give a sound quality evaluation score.

[0119] The difference from the second embodiment will be described below with an example in which the input text is "There is a market in Tokyo".

[0120] The morphological analysis candidate generation unit 106 performs morphological analysis, and “Tokyo (noun) / Ni (participant) / market (shijo, noun) / ga (particle) / ar (verb) "," higashi (noun) / Kyoto (kyoto, noun) / ni (particle) / market (shijo, noun) ) / Ga (particle) / ar (verb) ”,“ Tokyo (Tokyo, noun) / ni (particle) / market (ichiba, noun) / ga (particle) / al (verb) ”,“ east (Higashi, Noun) / Kyoto (Kyoto, Noun) / (Participant) / Market (First, Noun) / Ga (Participant) / Aru (Verb) ”are output.

[0121] Here, "/" represents an accent phrase delimiter and "'" represents an accent position.

[0122] At this time, there is a possibility of morphological analysis other than these, but those with low possibility are excluded by grammatical knowledge and morphological bidrum.

[0123] Corresponding to these morphological analysis results, the sound quality evaluation score calculation unit 103 performs “Toichikiyo” and “Shojiichi” and “Higashikiyo”. Generates the readings of "I have one", "Toichikiyo" Ichini / Ichibiga / Aru "," Higashikiyo Ichini / Ichibiga / Aru ", The sound quality evaluation score corresponding to each is calculated.

[0124] In this example, nothing is incorrect without contextual knowledge, so it doesn't matter which one is selected! It is possible to use the morphological analysis results selected in step 1.

[0125] [Eleventh embodiment]

Subsequently, an eleventh embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 13 is a block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.

Referring to FIG. 13, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text, A pose insertion unit 108 that generates a plurality of candidate pose-reading results (name candidates with poses) by estimating the position where the pose can be inserted and expanding the pose insertion availability. This is different from the second embodiment. The sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the reading result with candidate pose (name candidate with pose).

[0127] In the following, the second embodiment will be described with an example in which the input text is "I go to the bookstore" The difference is explained.

The reading generation unit 107 generates a reading result “I / Honyae / Iku”. The pose insertion unit 108 assumes that a pose can be inserted at the accent phrase delimiter indicated by “/”, and “I / Honyae / Iku” and “I-PHonyae / It generates four candidate pose readings (named candidates with poses), “Iku”, “I / Honyae P1, Ku” and “Iwa P Honyae P1, Ku”.

[0129] Here, "/" represents an accent phrase delimiter, "'" represents an accent position, and "P" represents a pause position.

[0130] Corresponding to the reading results with candidate poses (name candidates with poses), the sound quality evaluation score calculation unit 103 calculates a sound quality evaluation score corresponding to each.

[0131] According to the present embodiment, the acoustic environment before and after the accent phrase delimitation changes depending on whether or not there is a pause insertion, so that it is possible to use a combination of pause insertions that can be uttered with the best sound quality. Become. In addition, the speech quality score calculation process reflects the pose insertion probability of the original utterance that contains the segments that make up the speech synthesis, so the individuality of the speaker of the original utterance is also expressed.

[0132] [Twelfth embodiment]

Subsequently, a twelfth embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 14 is a block diagram showing the configuration of the text-to-speech synthesis system according to the twelfth embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.

Referring to FIG. 14, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text, and a rule given in advance. The second implementation in that it comprises a reading deformation unit 109 that generates a plurality of candidate reading results (name candidates) that expand the candidate reading results by changing the phoneme of the reading results according to It is different from the form. The sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the candidate reading result (name candidate).

Hereinafter, a rule that the phoneme “hi” can be transformed into “shi” is given in advance, and the input text The difference between the second embodiment and the second embodiment will be explained by giving an example of the case where the job is “take over work alone! /,”. By the way, the above rule of transforming the phoneme “hi” into “shi” is a simple transformation rule for Edoben speakers.

[0135] The reading generation unit 107 gives and outputs a reading result of "doing work / severely / hiding" to the input text. The reading deformation unit 109 has two phoneme “hi” in the reading result, so that it is added to “work / hidden / hikitsu'ida” by changing these to “shi”. The four candidates of “Work / Hide / Hide's Work”, “Work / Hide / Hide '” Generate reading results (probable name).

[0136] Here, "/" represents an accent phrase delimiter and "'" represents an accent position.

[0137] Corresponding to these candidate reading results (candidate names), the sound quality evaluation score calculator 103 calculates a sound quality evaluation score corresponding to each.

[0138] According to the present embodiment, voice synthesis using a voice recorded by a speaker who tends to produce a different phonation of a specific phoneme! / It is possible to utter good! /, Sound quality, and reproduce the personality of the original speaker.

[Thirteenth Embodiment]

Subsequently, a thirteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.

[0140] In the second embodiment, the unit selection score that is an approximate value of the sound quality of the synthesized speech is used as the sound quality evaluation score. In contrast, in the present embodiment, as the sound quality evaluation score, the pitch frequency difference at the end points is measured in Hz between the segments that are the segmented sections of the synthesized speech, and the sum of the absolute values is used.

[0141] Fig. 15 shows an example of calculation of a speech score when the synthesized speech is composed of 4 segments. Here, the segment of the original utterance is assigned to each segment from segment 1 to segment 4. Also, let each original utterance segment be the original utterance 1 to the original utterance 4.

[0142] The pitch frequency on the original utterance 1 corresponding to the end point of segment 1 is F lb and the termination pitch frequency is Fie. Similarly, segment 2 to segment 4 also have pitch frequencies F2b, F2e, F3b, F3e, F4b, and F4e corresponding to the end points.

[0143] Here, the sum of the difference between the termination pitch frequency and the start pitch frequency connected to each other at each end point excluding the Fib at the start and the F4e at the end of the synthesized speech, that is, D = I Fle-F2b | | F2e— F3b | + | F3e— F4b |

Is the sound quality evaluation score for the combination of the corresponding original utterances.

[0144] In the above example, calculation using the sum of squares, the weighted sum, and the pitch frequency as logarithmic values may be used instead of the force using the sum of the real differences of the pitch frequencies. In the above example, for the sake of simplicity, it is assumed that the pitch frequency of the original utterance is directly the pitch frequency of the corresponding section of the synthesized speech by using a general waveform connection method for waveform generation in speech synthesis. However, it is also applicable when the pitch frequency is edited and used by the waveform editing method. In that case, in addition to using the pitch frequency of the original utterance, the same calculation can be performed for the pitch frequency after editing.

As is clear from the above description, the value of the sound quality evaluation score in the present embodiment is a score representing the smoothness of the pitch pattern of the synthesized speech, and can be said to approximate the voice quality of the synthesized speech. . Therefore, even with the method of the present embodiment, a synthesized speech having the highest sound quality evaluation score V can be generated.

[Fourteenth Embodiment]

Subsequently, a fourteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.

[0147] In the second embodiment, a unit selection score that is an approximate value of the sound quality of synthesized speech is used as the sound quality evaluation score. In contrast, in this embodiment, as the sound quality evaluation score, an estimated value of the pitch frequency trajectory is calculated during the speech synthesis process, and the difference between the value and the pitch frequency value of the actual synthesized speech is measured in Hz. Then, the sum of the absolute values over the entire speech interval is used.

[0148] Figure 16 shows an example of speech score calculation when the synthesized speech consists of 4 segments. Show. Here, the segment of the original utterance is assigned to each segment from segment 1 to segment 4.

[0149] Here, the pitch frequency fo (t) of each original utterance assigned to the corresponding time at the pitch frequency of the original utterance consisting of the original utterance 1 to the original utterance 4, and the estimated pitch frequency trajectory ft From (t), set the value of the sound quality evaluation score.

D = ∑ (t = tl, t2) I ft (t) -fo (t) I

It is defined as

[0150] Note that, in this embodiment as well, as in the thirteenth embodiment described above, calculation using the sum of squares, weighted sum, and pitch frequency as a logarithmic value is used instead of the sum of the difference between the real numbers of the pitch frequency. It doesn't matter. In addition, for simplicity, the example above assumes that the pitch frequency of the original utterance is the same as the pitch frequency of the corresponding section of the synthesized speech, using a common waveform connection method for waveform generation in speech synthesis. However, it can also be applied to the case where the pitch frequency is edited and used by the waveform editing method. In that case, in addition to using the pitch frequency of the original utterance, the same calculation can be performed for the pitch frequency after editing.

[0151] In particular, in the case of the waveform connection type speech synthesis method, the pitch frequency of the selected speech unit may be different from the pitch frequency calculated for speech synthesis. The value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated pitch frequency is reproduced! /, And approximates the voice quality of the synthesized speech. Yeah.

Therefore, the synthesized voice having the highest sound quality evaluation score can also be generated by the method of the present embodiment.

[Fifteenth embodiment]

Subsequently, a fifteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.

[0154] In the second embodiment, the unit selection score, which is an approximate value of the sound quality of the synthesized speech, is used as the sound quality evaluation score. In contrast, in the present embodiment, as the sound quality evaluation score, the time length of each voice unit is calculated in the process of voice synthesis, and the value and the actual synthesized voice are calculated. Measure the difference from the time length of the corresponding unit in milliseconds and use the sum of the absolute value over the entire speech interval.

[0155] As described above, particularly in the case of the waveform connection type speech synthesis method, the time length of the selected speech unit may be different from the time length calculated for speech synthesis. The value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated time length is reproduced. In this respect, the voice quality of the synthesized speech is approximated! .

Industrial applicability

[0157] The field to which the present invention can be applied is substantially the same as the field that requires speech synthesis technology. For example, the effect of the present invention can be achieved in a field that does not require strictness on the text of the text, such as a robot voice dialogue. It is thought that it demonstrates.

[0158] Within the scope of the entire disclosure (including claims) of the present invention, further, modifications or adjustments of the embodiments or examples can be made based on the basic technical idea. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention.

[0159] For example, in the above-described embodiment, an example in which a Japanese kana mixed sentence is used as an input text has been described, but a dictionary in another language corresponding to each dictionary shown in each of the above-described embodiments is prepared. By doing so, it is also possible to use sentences in other languages as input text.

Claims

The scope of the claims

[1] A text candidate generation unit that analyzes input text and generates a plurality of candidate texts having synonymous expressions;

A sound quality evaluation score calculation unit that calculates a sound quality evaluation score for each candidate text; a selection unit that selects a candidate text having the best sound quality evaluation score from the plurality of candidate texts;

A speech synthesizer for generating a synthesized speech corresponding to the selected candidate text,

A text-to-speech synthesizer.

[2] Provided with a synonym dictionary that stores synonyms associated with each word or phrase,

The text candidate generation unit searches the synonym dictionary for a synonym corresponding to a word or phrase included in the input text, and expands a combination of substitution / non-substitution for the word or phrase unit to generate a plurality of candidates. Generating text,

The text-to-speech synthesizer according to claim 1.

[3] A synonym expression dictionary that stores a synonym expression that is a phrase expression having the same meaning with respect to a certain phrase expression,

The text candidate generation unit retrieves a synonym expression corresponding to a phrase expression included in the input text from the synonym expression dictionary, and expands / combines substitution / non-replacement at the phrase expression unit level to generate a plurality of candidate texts. Generating,

The text-to-speech synthesizer according to claim 1 or 2.

[4] The text candidate generation unit changes the sentence structure of the input text under a condition that does not change the meaning of the input text! /,! Generating text,

The text-to-speech synthesizer according to any one of claims 1 to 3.

[5] The text candidate generation unit generates a plurality of candidate texts by combining and expanding the change / non-change of the polite expression of words or phrases included in the input text. To 4. The text-to-speech synthesizer according to any one of 4 to 4.

[6] Provided with a personality feature dictionary that stores individual level expression differences The text candidate generation unit retrieves other synonymous expressions corresponding to expressions included in the input text from the personality feature dictionary, and combines and expands a combination of substitution / non-substitution due to the expression difference at the individual level. Generating candidate text for

The text-to-speech synthesizer according to any one of claims 1 to 5.

[7] Equipped with a filler dictionary describing fillers and interjections,

The text candidate generation unit refers to the filler dictionary and generates a plurality of candidate texts by combining and expanding the filler or interjection insertion / non-insertion between words of the input text;

The text-to-speech synthesizer according to any one of claims 1 to 6.

[8] A filler dictionary describing fillers and interjections is provided.

The text candidate generation unit refers to the filler dictionary, and generates a plurality of candidate texts by combining and expanding the input text power, the filer or interjection deletion / non-deletion;

The text-to-speech synthesizer according to any one of claims 1 to 7.

[9] The text candidate generation unit generates the plurality of candidate texts by changing a part of the word or phrase included in the input text or the entire phrase into a repeated expression.

The text-to-speech synthesizer according to any one of claims 1 to 8.

[10] The text candidate generation unit sets only the rhyming expression among the expanded results as candidate text 結果,

The text-to-speech synthesizer according to any one of claims 1 to 9.

[11] In the expanded result, the text candidate generation unit determines only an expression composed of only 5 or 7 mora for each accent phrase as a candidate text.

102. The text-to-speech synthesizer according to claim 1, wherein the text-to-speech synthesizer is any one of shifts.

[12] A morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from the input text;

A sound quality evaluation score calculation unit for calculating a sound quality evaluation score for the candidate morphological analysis result; A selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results;

A speech synthesizer that generates a synthesized speech corresponding to the selected candidate morpheme analysis result,

A text-to-speech synthesizer.

[13] A reading generation unit that generates a reading result (name) from the input text;

A pose insertion section for generating a plurality of candidate pose-reading results (name candidates with poses) by a combination of insertion / non-insertion of poses (punctuation) into the reading results;

A sound quality evaluation score calculation unit for calculating a sound quality evaluation score for the reading result with candidate pose (name candidate with pose);

A selection unit for selecting a candidate pose-reading result (pause-named candidate) having the best sound quality evaluation score from the plurality of candidate-pause reading results (pause-named candidates) and the selected candidate-pose-readed results ( A speech synthesizer that generates a synthesized speech corresponding to a nickname with a pause),

A text-to-speech synthesizer.

[14] A reading generation unit that generates a reading result (name) from the input text;

A reading deformation unit that generates a plurality of candidate reading results (name candidates) by combining and replacing substitution / non-substitution of a specific sound included in the reading result (name) with another sound, and the candidate reading result ( A sound quality evaluation score calculator for calculating a sound quality evaluation score for

A selection unit for selecting a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates);

A speech synthesizer for generating a synthesized speech corresponding to the selected candidate reading result (name candidate),

A text-to-speech synthesizer.

[15] The sound quality evaluation score is a score representing the smoothness of the pitch pattern of the synthesized speech. 15. The text-to-speech synthesizer according to any one of claims 1 to 14.

[16] The sound quality evaluation score is a score representing a difference between the estimated pitch pattern and the pitch pattern of the synthesized speech.

The text-to-speech synthesizer according to any one of claims 1 to 15.

[17] The sound quality evaluation score is a score representing a difference between the estimated rhythm and the rhythm of the synthesized speech,

The text-to-speech synthesizer according to any one of claims 1 to 16.

[18] The sound quality evaluation score is a score representing the smoothness or the smoothness of the spectrum between segments when creating synthesized speech.

The text-to-speech synthesizer according to any one of claims 1 to 17.

[19] A program to be executed by a computer constituting a text-to-speech synthesizer, which analyzes input text and generates a plurality of candidate texts having synonymous expressions;

Means for calculating a sound quality evaluation score for each candidate text;

Means for selecting a candidate text having the best sound quality evaluation score from the plurality of candidate texts;

A program that causes the computer to function as each of the means for generating synthesized speech corresponding to the selected candidate text.

[20] A program to be executed by a computer constituting the text-to-speech synthesizer, means for calculating a plurality of candidate morpheme analysis results from the input text, and means for calculating a sound quality evaluation score for the candidate morpheme analysis results; Selecting a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results;

A program that causes the computer to function as each means described above, and means for generating a synthesized speech corresponding to the selected candidate morpheme analysis result.

[21] A program to be executed by a computer constituting the text-to-speech synthesizer, and means for generating a reading result (name) from input text;

Multiple combinations of insertion / non-insertion of poses for the reading results Means for generating reading results with supplemental poses (name candidates with poses);

Means for calculating a sound quality evaluation score for the reading result with candidate pose (name candidate with pose);

Means for selecting a candidate pose-reading result (pause-named candidate) having the best sound quality evaluation score from the plurality of candidate-pause reading results (pause-named candidates); A program that causes the computer to function as each of the means for generating a synthesized voice corresponding to a name candidate with a pause)

[22] A program to be executed by a computer constituting the text-to-speech synthesizer, and means for generating a reading result (name) from input text;

Means for generating a plurality of candidate reading results (name candidates) by combining and expanding substitution / non-replacement of a specific sound included in the reading result (name) with another sound;

Means for calculating a sound quality evaluation score for the candidate reading result (name candidate); means for selecting a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates); ,

Means for generating synthesized speech corresponding to the selected candidate reading result (name candidate), and a program for causing the computer to function as each of the means.

[23] A text-to-speech synthesis method using a computer,

The computer analyzing the input text and generating a plurality of candidate texts having synonymous expressions;

A step in which the computer calculates a sound quality evaluation score for each candidate text;

The computer selecting a candidate text having a best sound quality evaluation score from the plurality of candidate texts;

A step of generating a synthesized speech corresponding to the selected candidate text;

A text-to-speech synthesis method characterized by

[24] A text-to-speech synthesis method using a computer, The computer calculates a plurality of candidate morpheme analysis results from the input text. The computer calculates a sound quality evaluation score for the candidate morpheme analysis results. The computer calculates the best sound quality from the plurality of candidate morpheme analysis results. Selecting a candidate morphological analysis result having an evaluation score;

The computer generates synthesized speech corresponding to the selected candidate morpheme analysis result.

A text-to-speech synthesis method characterized by

[25] A text-to-speech synthesis method using a computer,

The computer power S, a step of generating a reading result (name) from the input text, and the computer may combine a plurality of candidate poses by a combination of insertion / non-insertion of a pose (phrase) to the reading result. A step of generating a reading result (name candidate with pause),

The computer calculating a sound quality evaluation score for the candidate-pause-reading result (name candidate with pause);

The computer selecting a candidate pose reading result (pause nick call candidate) having the best sound quality evaluation score from the plurality of candidate pose reading results (pause call candidate);

Generating the synthesized speech corresponding to the selected candidate-pause reading result (pause-named candidate).

A text-to-speech synthesis method characterized by

[26] A text-to-speech synthesis method using a computer,

The computer power S, a step of generating a reading result (designation) from the input text, and the computer combined development of substitution / non-substitution with another sound for a specific sound included in the reading result (designation) Multiple candidate reading results (name candidates) The computer calculates a sound quality evaluation score for the candidate reading result (name candidate) The computer selects a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates) And steps to

Generating a synthesized speech corresponding to the selected candidate reading result (name candidate),

A text-to-speech synthesis method characterized by