WO2008056590A1 - Text-to-speech synthesis device, program and text-to-speech synthesis method - Google Patents

Text-to-speech synthesis device, program and text-to-speech synthesis method Download PDF

Info

Publication number
WO2008056590A1
WO2008056590A1 PCT/JP2007/071285 JP2007071285W WO2008056590A1 WO 2008056590 A1 WO2008056590 A1 WO 2008056590A1 JP 2007071285 W JP2007071285 W JP 2007071285W WO 2008056590 A1 WO2008056590 A1 WO 2008056590A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
text
sound quality
quality evaluation
evaluation score
Prior art date
Application number
PCT/JP2007/071285
Other languages
French (fr)
Japanese (ja)
Inventor
Reishi Kondo
Shinichi Doi
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Priority to JP2008543045A priority Critical patent/JPWO2008056590A1/en
Publication of WO2008056590A1 publication Critical patent/WO2008056590A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules

Definitions

  • the present invention relates to a text-to-speech synthesis technology, and more particularly to a text-to-speech synthesis technology having a function of arranging input text within a range that does not impair the meaning of the text and making it a synthesized speech.
  • the text-to-speech synthesis technology is a technology for generating speech that reads the content of given text, mainly text composed of kanji or kana in the case of Japanese.
  • a method of generating phonetic symbol strings by guessing the reading of input text and then generating speech for the phonetic symbol strings is widely used.
  • a method of generating speech for phonetic symbol strings a waveform connection method that cuts and pastes pre-recorded natural utterances, or a parametric method that generates a parameter time series representing acoustic features and generates speech from the parameters. Etc. are used.
  • Japanese Patent Application Laid-Open No. 2004-258406 discloses a text-to-speech synthesis system that can generate a natural prosodic pattern (phonetic symbol string) even if dependency analysis is incorrect.
  • Japanese Patent Laid-Open No. 2004-138728 discloses a modified pitch pattern and language after translation while a basic pitch pattern, which is one of prosodic information, is translated in the frequency direction while maintaining its outline.
  • a speech unit sequence that matches the information is selected, and the deformed pitch pattern related to the speech unit sequence that best matches the selected speech unit ⁇ IJ and the most suitable speech unit sequence are combined into the synthesized speech.
  • Prosody information to be used at times and speech units and prosody control means are provided, the pitch deformation amount of the selected speech unit is suppressed, and there is no sudden change in sound quality
  • a speech synthesizer capable of creating synthesized speech is disclosed.
  • Patent Document 1 Japanese Patent Application Laid-Open No. 2004-258406
  • Patent Document 2 Japanese Patent Application Laid-Open No. 2004-138728
  • Non-Patent Document 1 Shikano, Ito, Kawara, Takeda, Yamamoto “IT Text Speech Recognition System”, Ohm, 2001, p. 13-14
  • Patent Documents 1 and 2 and Non-Patent Document 1 described above are incorporated herein by reference.
  • the following is an analysis of the related art according to the present invention.
  • Many of the above-described conventional techniques have a problem that the force S is mainly intended to generate natural speech, and the resulting synthesized speech is not necessarily beautiful.
  • the present invention has been made paying attention to this point, and an object of the present invention is to provide a text-to-speech synthesis technique capable of improving the quality of synthesized speech to be generated.
  • a text candidate generation unit that analyzes input text and generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts
  • a sound quality evaluation score calculation unit that calculates a sound quality evaluation score
  • a selection unit that selects a candidate text having the best sound quality evaluation score from the plurality of candidate texts, and a speech synthesis that generates synthesized speech corresponding to the selected candidate text
  • a text-to-speech synthesizer a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer.
  • a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality that calculates a sound quality evaluation score for the candidate morpheme analysis results
  • An evaluation score calculation unit a selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results; and a voice that generates a synthesized speech corresponding to the selected candidate morpheme analysis result
  • a text-to-speech synthesizer comprising a synthesizer, a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer are provided.
  • a reading generation unit that generates a reading result (name) from input text and insertion / non-insertion of a pose (punctuation) into the reading result are performed.
  • a pose insertion unit that generates a plurality of candidate pose-reading results (named candidates with poses) by combining; a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the reading results with candidate poses (named candidates with poses);
  • a selection unit for selecting a candidate pose-reading result (pause-named call candidate) having the best sound quality evaluation score from a plurality of candidate-pause-reading results (named candidates with pose), and the selected candidate pose-reading results
  • a text-to-speech synthesizer comprising a speech synthesizer that generates a synthesized speech corresponding to (pause-named candidate), a program for realizing the text-to-speech synthesizer, and the text Speech synthesis method which is carried out by using the strike speech synthesis device is provided.
  • a reading generation unit that generates a reading result (name) from the input text, and another sound for a specific sound included in the reading result (name)
  • a read deformation unit that generates a plurality of candidate reading results (named candidates) by combining and replacing replacement / non-replacement, and a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (named candidates)
  • a selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates), and a synthesis corresponding to the selected candidate reading result (name candidate)
  • a text speech synthesizer including a speech synthesizer for generating speech, a program for realizing the text speech synthesizer, and a speech synthesis method implemented using the text speech synthesizer are provided.
  • the present invention it is possible to dramatically improve the quality of synthesized speech.
  • the reason is that the candidate having the best sound quality evaluation score is selected from among a plurality of candidates created from the input text.
  • FIG. 1 is a block diagram showing a configuration of a text-to-speech synthesis system according to a first embodiment of the present invention.
  • FIG. 2 A diagram showing entries stored in the synonym dictionary of the text-to-speech synthesis system according to the first embodiment of the present invention.
  • FIG. 3 is a diagram for explaining the operation of a text candidate generation unit of the text-to-speech synthesis system according to the first embodiment of the present invention.
  • FIG. 6 A block diagram showing the configuration of the text-to-speech synthesis system according to the fifth embodiment of the present invention.
  • FIG. 7 A diagram showing entries stored in the personality (feature) dictionary of the text-to-speech system according to the fifth embodiment of the present invention.
  • FIG. 8 is a diagram for explaining the operation of a text candidate generation unit of a text-to-speech synthesis system according to a fifth embodiment of the present invention.
  • FIG. 9 A block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention.
  • FIG. 10 A diagram showing entries stored in the filler dictionary of the text-to-speech synthesis system according to the sixth embodiment of the present invention.
  • FIG. 11 is a diagram for explaining the operation of the text candidate generation unit of the text speech synthesis system according to the sixth embodiment of the present invention.
  • FIG. 13 A block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention.
  • FIG. 15 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the thirteenth embodiment of the present invention.
  • FIG. 16 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the fourteenth embodiment of the present invention.
  • the input text is analyzed, a text candidate generating unit that generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts is calculated.
  • a sound quality evaluation score calculation unit a selection unit that selects a candidate text having a best sound quality evaluation score from the plurality of candidate texts; a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate text; The ability to obtain a text-to-speech synthesizer with
  • the text-to-speech synthesizer further includes a synonym dictionary in which synonyms are stored in association with each other in units of words or phrases, and the text candidate generator generates synonyms corresponding to words or phrases included in the input text.
  • a word is retrieved from the synonym dictionary and the word or phrase Operate to generate multiple candidate texts by combining and expanding unit substitution / non-substitution.
  • the text-to-speech synthesizer further includes a synonym expression dictionary that stores a synonym expression that is a phrase expression having the same meaning with respect to a certain phrase expression, and the text candidate generation unit adds the synonym expression dictionary to the input text. It is possible to search the synonym expression corresponding to the phrase expression contained in the synonym expression dictionary and generate a plurality of candidate texts by combining and expanding substitution / non-replacement at the phrase expression unit level.
  • the text candidate generation unit is operated to generate the plurality of candidate texts by modifying the sentence structure of the input text under the condition that the meaning of the input text is not changed. Can do.
  • the text candidate generation unit so as to generate a plurality of candidate texts by combining and expanding the change / non-change of the polite expression of words or phrases included in the input text.
  • the text-to-speech synthesizer further includes a personality feature dictionary that stores the personal level expression differences in association with each other, and the text candidate generator generates another text corresponding to the expression included in the input text.
  • the synonym expression can be searched from the personality feature dictionary and can be operated to generate a plurality of candidate texts by combining and replacing substitution / non-substitution due to expression differences at the individual level.
  • the text-to-speech synthesizer further includes a filler dictionary in which fillers and interjections are described, and the text candidate generation unit refers to the filler dictionary and inserts the text between the input text words. It can be operated to generate multiple candidate texts by combining and expanding filler or interjection insertion / non-insertion.
  • the text-to-speech synthesizer further includes a filler dictionary describing fillers and interjections, and the filler candidate or interjection is deleted from the input text by referring to the filler dictionary in the text candidate generator. Can be operated to generate multiple candidate texts by combining / deleting and deleting.
  • the text candidate generation unit is also operable to generate the plurality of candidate texts by changing a part or all of the words or phrases included in the input text into a repeated expression. it can.
  • the text candidate generation unit can be operated so that only the expression of rhyme among the expanded results is used as the candidate text.
  • the text candidate generation unit can be operated so that only the expression composed of only the mora power of each accent phrase or 7 is set as the candidate text among the expanded results.
  • a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality evaluation score calculation that calculates a sound quality evaluation score for the candidate morpheme analysis results
  • a selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results, and a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate morpheme analysis result Can be obtained.
  • a combination of a reading generation unit that generates a reading result (name) from input text, and insertion / non-insertion of a pose (punctuation) to the reading result A pose insertion unit for generating a plurality of candidate pose-reading results (named candidates with poses), a sound quality evaluation score calculating unit for calculating a sound quality evaluation score for the reading results with candidate poses (named candidates with poses), Reading result with candidate pose (with pose A selection unit that selects a candidate pose-reading result with the best sound quality evaluation score (pause-named candidate) and the selected candidate pose-read result (pause-named candidate) It is possible to obtain a text-to-speech synthesizer including a speech synthesizer that generates synthesized speech.
  • a reading generation unit that generates a reading result (name) from input text, and replacement of a specific sound included in the reading result (name) with another sound
  • a reading deformation unit that generates a plurality of candidate reading results (name candidates) by combining / uncompressing, a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (name candidates),
  • a selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from a plurality of candidate reading results (name candidate), and generates synthesized speech corresponding to the selected candidate reading result (name candidate)
  • a text-to-speech synthesizer including the speech synthesizer.
  • the score S representing the smoothness of the pitch pattern of the synthesized speech is used as the sound quality evaluation score.
  • a score representing the difference between the estimated pitch pattern and the synthesized speech pitch pattern can be used.
  • the score S representing the difference between the estimated rhythm and the synthesized speech rhythm is used as the sound quality evaluation score.
  • a text-to-speech synthesis method according to the present invention and a program for carrying out the method can be performed.
  • FIG. 1 is a block diagram showing the configuration of the text-to-speech synthesis system according to the first embodiment of the present invention.
  • the text-to-speech synthesis system according to the present embodiment includes a text input unit 101, a text candidate generation unit 102, a synonym dictionary 201, a sound quality evaluation score calculation unit 103, and a selection unit 104. , And a voice synthesizer 105.
  • the text input unit 101 is means for inputting text.
  • the present invention can be applied to any language. However, in the present embodiment, a description will be given assuming that Japanese / Kana / Kanji mixed sentences are handled.
  • the text candidate generation unit 102 is means for generating a plurality of candidate texts by referring to the synonym dictionary 201 and transforming the text into another expression that does not change the meaning.
  • FIG. 2 is a diagram showing entries stored in the synonym dictionary 201, in which synonyms are stored in association with each other in units of words or phrases.
  • the text candidate generator 102 morphologically analyzes the text as shown in FIG. 3, searches the synonym dictionary 201 for each morpheme, and if there is a matching entry, the corresponding morpheme can be replaced with the corresponding synonym. Mark as.
  • the text candidate generation unit 102 generates a plurality of candidate texts by combining all replaceability for each morpheme of the text.
  • the sound quality evaluation score calculation unit 103 is a means for calculating a sound quality evaluation score for each of the plurality of candidate texts, and details thereof will be described later.
  • the selection unit 104 is a means for selecting one candidate text having the best corresponding sound quality evaluation score from the plurality of candidate texts.
  • the speech synthesizer 105 is means for generating synthesized speech corresponding to the selected candidate text.
  • the speech synthesizer 105 is assumed to be a waveform editing type text speech synthesizer that generates synthesized speech by connecting and editing prerecorded natural speech.
  • the sound quality evaluation method in the text-to-speech synthesis system according to the present embodiment will be described in detail.
  • the sound quality evaluation score refers to an objective evaluation value of the speech quality of the synthesized sound, which is the result of speech synthesis of the candidate text.
  • various scores used in the speech synthesis process can be used as approximate values of the speech quality evaluation score. In the present invention, this approximate value is also referred to as a sound quality evaluation score.
  • a unit connection score value corresponding to the segment sequence selected by the speech synthesizer is used as the sound quality evaluation score. I will explain.
  • the unit connection score is a negative number of the sum of the weighted Euclidean distances of the 10-dimensional cepstrum at each unit connection boundary.
  • a (k) is a weight given in advance.
  • the unit connection score is calculated by the sound quality evaluation score calculation unit 103 and the speech synthesis unit 105 by the same method.
  • This unit connection score value d (i) can be considered as a score representing the slip force of the spectrum between the segments that are the segmented sections of the synthesized speech. It can be said that the higher the d (i) is, the higher the sound quality is obtained with the difference in cepstrum at each connecting part of the unit.
  • the weighted Euclidean distance of the 10-dimensional cepstrum is used as the unit connection score, but the number of dimensions can be adjusted as appropriate.
  • ⁇ cepstrum which is the time direction differential value of the cepstrum
  • MFCC Mel— Frequency
  • Cepstrum Coefficent (see Non-Patent Document 1), or A MFCC, which is a time-direction differential value of MFCC, can be used with an appropriate number of dimensions. It is of course possible to use a vector composed of these multiple values.
  • the waveform editing type text-to-speech synthesizer has the ability to take a plurality of natural speech segment sequences when synthesizing speech for a certain text, and the unit connection score for each segment sequence among them. It operates to select the segment series that maximizes.
  • a synthesized speech having the same content as a given text and having the highest sound quality evaluation score approximated by the unit selection score used in the speech synthesizer can be generated.
  • the text candidate generation unit 102 can be integrated with a text analysis process that forms part of the speech synthesis unit 105.
  • the morphological analysis results match, so that accuracy can be improved and the amount of calculation can be reduced.
  • the selection unit 104 outputs candidate text, and the speech synthesis unit 105 synthesizes speech by using the candidate text as input.
  • the operations of the text candidate generation unit 102 and the speech synthesis unit 105 are the same until the sound quality evaluation score is generated.
  • the part that performs the same operation is performed by the text candidate generation unit 102, and the intermediate result is the speech synthesis data. Can also be output as data.
  • the selection unit 104 outputs speech synthesis data (generated by the text candidate generation unit 102) corresponding to the selected candidate text, and the speech synthesis unit 105 uses the selected speech synthesis data. Synthesize speech. As a result, there is no duplication of operation and the amount of calculation can be reduced.
  • the text candidate generation unit 102 obtains all possible combinations of replaceability for each morpheme of text assuming that there is no memory restriction or the like (FIG. 3). If there are memory restrictions, etc., you may obtain multiple combinations that are possible within the scope of the restrictions, or generate text candidates one by one and leave the one with a high sound quality evaluation score. It is also possible to select text.
  • FIG. 5 is a block diagram showing the configuration of the text-to-speech synthesis system according to the second embodiment of the present invention.
  • the items described in the first embodiment are omitted.
  • the text candidate generation unit 102 of the present embodiment expands the input text into a synonym expression by modifying the grammar of the input text based on the synonym expression modification rules described in advance in the text candidate generation unit 102. Generated candidate text. Specific examples are described below.
  • the text candidate generation unit 102 replaces the pair of the noun “transform” and the auxiliary verb “do” with the verb phrase “change shape”. Deform.
  • the text candidate generation unit 102 when the input text is “replace word”, the text candidate generation unit 102
  • the text candidate generator 102 applies the transformation rule simply by matching the text strings. However, it is also possible to improve accuracy by performing morphological analysis and confirming the part of speech of the corresponding expression.
  • the text candidate generation unit 102 generates expanded candidate text by modifying the sentence structure of the input text.
  • the sentence structure of the input text.
  • the first type of transformation of the sentence structure in the text candidate generation unit 102 is transformation to inversion 'emphasis syntax. For example, if the input text is “A to B”, the text candidate generator 1
  • the inversion / emphasis syntax “B is A” is also a candidate text.
  • a second type of modification of the sentence structure in the text candidate generation unit 102 is a modification to the body-stopping syntax. For example, when the input text is “I am the one who posted”, the text candidate generation unit 102 also selects “I am the one that posted” as the candidate text.
  • a third type of deformation of the sentence structure in the text candidate generation unit 102 is deformation by replacement of particle phrases (scrambling). For example, if the input text is “I and my brother read a book”, the text candidate generation unit 102 performs transformation by replacing the particle phrase “I read my book with my brother”, “ “Read the book”, “The brother read the book with me”, “The book read the brother with me” and “The book read with me and the brother” are also candidate texts.
  • the fourth type of modification of the sentence structure in the text candidate generation unit 102 is modification by replacement of a noun form and a verb form.
  • the text candidate generation unit 102 also adopts “perform word replacement”, which is modified by replacing the noun form and the verb form, as the candidate text.
  • the fifth type of deformation of the sentence structure in the text candidate generator 102 is passive (passive
  • the text candidate generation unit 102 determines that “She was called by him” which has been transformed into a passive text. [0077] After a plurality of candidate texts are generated in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
  • the text candidate generation unit 102 generates expanded candidate text by transforming the polite expression of the input text.
  • the deformation of the sentence structure will be described.
  • the text candidate generation unit 102 when the input text is “I bought a book”, the text candidate generation unit 102 also uses “I bought a book”, which is a more polite expression, as a candidate text.
  • FIG. 6 is a block diagram showing a configuration of a text-to-speech synthesis system according to the fifth embodiment of the present invention, and includes a personality (feature) dictionary 202.
  • the items described in the first embodiment are omitted.
  • the text candidate generation unit 102 of the present embodiment refers to the personality (feature) dictionary 202 and generates a plurality of candidate texts by changing the input text into a different expression! .
  • FIG. 7 is a diagram showing entries stored in the personality (feature) dictionary 202, in which “Ketta” is stored in association with the replaceable vocabulary of the entry “Car”. “Ketta” is a term for “bicycle” used in the Nagoya region.
  • the text candidate generation unit 102 performs morphological analysis on the text, searches the personality (feature) dictionary 202 for each morpheme, and if there is a matching entry, the corresponding morpheme is the corresponding individual. Mark a morpheme (replaceable vocabulary) as replaceable.
  • FIG. 9 is a block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention, and includes a filer dictionary 203.
  • the items described in the first embodiment will be omitted.
  • the text candidate generation unit 102 refers to the filler dictionary 203 and inserts fillers between words of the input text or combines and expands the combination of not inserting, Generate candidate text.
  • FIG. 10 is a diagram showing entries stored in the filler dictionary 203.
  • the fillers “eichito” and “gohogo ho” are so-called connective words, meaning words that are used in an interjective manner in conversation. And the like are stored in association with each other. For example, in the case of English, “you know”, “I meanj”, and “Well” are stored.
  • the text candidate generation unit 102 performs morphological analysis on the text as shown in FIG. 11, and develops all possibilities that the contents of the filler dictionary 203 can be inserted between the morphemes. As shown in the lower part of FIG. 11, the text candidate generation unit 102 expands a morpheme that matches the filler dictionary 203, in combination with replacement / deletion with other fillers.
  • the text candidate generation unit 102 generates expanded candidate text by repeating a phrase in the input text a plurality of times.
  • a specific example of the deformation of the sentence structure will be described.
  • the maximum number of repetitions and the total number of repetitions for the entire sentence can be set in the text candidate generation unit 102 in advance.
  • the text candidate generation unit 102 excludes the former by the test of whether or not the above rhymes are being performed, and outputs only the latter having the rhymes as candidate texts.
  • the sound quality evaluation and the selection of the best candidate are performed by the same processing as in the first embodiment.
  • the load of the sound quality evaluation process is reduced, and the effect of preventing generation of uncomfortable candidate texts is achieved.
  • the test using the nymph is exemplified, but the same test can be performed for the rhyme.
  • the synonym “surprise” is registered for the synonym “contact net” force entry “surprise”, and the text “boring contact” is registered.
  • a case will be described in which “the net is filled with surprise” is input.
  • the number of mora for each phrase of each candidate text is (1) 5 mora + 7 mora + 5 mora + 5 mora, (2) 5 mora + 8 mora + 5 mora + 5 mora, 3) 5 mora + 7 mora + 5 mora + 5 mora, (4) 5 mora + 8 mora + 5 mora + 5 mora.
  • the text candidate generator 102 has all of the phrases composed of 5 or 7 mora. (1) “The boring communication network was filled with surprise” (3 ) Output two candidate texts, “The boring contact network was astounded”.
  • the sound quality evaluation and the selection of the best candidate are performed by the same processing as in the first embodiment.
  • FIG. 12 is a block diagram showing the configuration of the text-to-speech synthesis system according to the tenth embodiment of the present invention.
  • the items described in the second embodiment are omitted.
  • the text-to-speech synthesis system replaces the text candidate generation unit 102 with a morpheme analysis candidate generation unit that performs morphological analysis on the input text.
  • the second embodiment is different from the second embodiment in that it includes 106.
  • the morpheme analysis candidate generation unit 106 outputs a plurality of morpheme analysis candidates, and the sound quality evaluation score calculation unit 103 reads the morpheme analysis candidates. And give a sound quality evaluation score.
  • the morphological analysis candidate generation unit 106 performs morphological analysis, and “Tokyo (noun) / Ni (participant) / market (shijo, noun) / ga (particle) / ar (verb) "," higashi (noun) / Kyoto (kyoto, noun) / ni (particle) / market (shijo, noun) ) / Ga (particle) / ar (verb) ”,“ Tokyo (Tokyo, noun) / ni (particle) / market (ichiba, noun) / ga (particle) / al (verb) ”,“ east (Higashi, Noun) / Kyoto (Kyoto, Noun) / (Participant) / Market (First, Noun) / Ga (Participant) / Aru (Verb) ”are output.
  • the sound quality evaluation score calculation unit 103 performs “Toichikiyo” and “Shojiichi” and “Higashikiyo”. Generates the readings of "I have one", “Toichikiyo” Ichini / Ichibiga / Aru ",” Higashikiyo Ichini / Ichibiga / Aru ", The sound quality evaluation score corresponding to each is calculated.
  • FIG. 13 is a block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention.
  • the items described in the second embodiment are omitted.
  • the text-to-speech synthesis system replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text,
  • a pose insertion unit 108 that generates a plurality of candidate pose-reading results (name candidates with poses) by estimating the position where the pose can be inserted and expanding the pose insertion availability.
  • the sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the reading result with candidate pose (name candidate with pose).
  • the reading generation unit 107 generates a reading result “I / Honyae / Iku”.
  • the pose insertion unit 108 assumes that a pose can be inserted at the accent phrase delimiter indicated by “/”, and “I / Honyae / Iku” and “I-PHonyae / It generates four candidate pose readings (named candidates with poses), “Iku”, “I / Honyae P1, Ku” and “Iwa P Honyae P1, Ku”.
  • the sound quality evaluation score calculation unit 103 calculates a sound quality evaluation score corresponding to each.
  • the acoustic environment before and after the accent phrase delimitation changes depending on whether or not there is a pause insertion, so that it is possible to use a combination of pause insertions that can be uttered with the best sound quality.
  • the speech quality score calculation process reflects the pose insertion probability of the original utterance that contains the segments that make up the speech synthesis, so the individuality of the speaker of the original utterance is also expressed.
  • FIG. 14 is a block diagram showing the configuration of the text-to-speech synthesis system according to the twelfth embodiment of the present invention.
  • the items described in the second embodiment are omitted.
  • the text-to-speech synthesis system replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text, and a rule given in advance.
  • the second implementation in that it comprises a reading deformation unit 109 that generates a plurality of candidate reading results (name candidates) that expand the candidate reading results by changing the phoneme of the reading results according to It is different from the form.
  • the sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the candidate reading result (name candidate).
  • the reading generation unit 107 gives and outputs a reading result of "doing work / severely / hiding" to the input text.
  • the reading deformation unit 109 has two phoneme “hi” in the reading result, so that it is added to “work / hidden / hikitsu'ida” by changing these to “shi”.
  • the four candidates of “Work / Hide / Hide's Work”, “Work / Hide / Hide '” Generate reading results (probable name).
  • the sound quality evaluation score calculator 103 calculates a sound quality evaluation score corresponding to each.
  • the unit selection score that is an approximate value of the sound quality of the synthesized speech is used as the sound quality evaluation score.
  • the pitch frequency difference at the end points is measured in Hz between the segments that are the segmented sections of the synthesized speech, and the sum of the absolute values is used.
  • Fig. 15 shows an example of calculation of a speech score when the synthesized speech is composed of 4 segments.
  • the segment of the original utterance is assigned to each segment from segment 1 to segment 4.
  • each original utterance segment be the original utterance 1 to the original utterance 4.
  • the pitch frequency on the original utterance 1 corresponding to the end point of segment 1 is F lb and the termination pitch frequency is Fie.
  • segment 2 to segment 4 also have pitch frequencies F2b, F2e, F3b, F3e, F4b, and F4e corresponding to the end points.
  • calculation using the sum of squares, the weighted sum, and the pitch frequency as logarithmic values may be used instead of the force using the sum of the real differences of the pitch frequencies.
  • the pitch frequency of the original utterance is directly the pitch frequency of the corresponding section of the synthesized speech by using a general waveform connection method for waveform generation in speech synthesis.
  • the pitch frequency is edited and used by the waveform editing method. In that case, in addition to using the pitch frequency of the original utterance, the same calculation can be performed for the pitch frequency after editing.
  • the value of the sound quality evaluation score in the present embodiment is a score representing the smoothness of the pitch pattern of the synthesized speech, and can be said to approximate the voice quality of the synthesized speech. . Therefore, even with the method of the present embodiment, a synthesized speech having the highest sound quality evaluation score V can be generated.
  • a unit selection score that is an approximate value of the sound quality of synthesized speech is used as the sound quality evaluation score.
  • the sound quality evaluation score an estimated value of the pitch frequency trajectory is calculated during the speech synthesis process, and the difference between the value and the pitch frequency value of the actual synthesized speech is measured in Hz. Then, the sum of the absolute values over the entire speech interval is used.
  • Figure 16 shows an example of speech score calculation when the synthesized speech consists of 4 segments. Show. Here, the segment of the original utterance is assigned to each segment from segment 1 to segment 4.
  • the pitch frequency of the selected speech unit may be different from the pitch frequency calculated for speech synthesis.
  • the value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated pitch frequency is reproduced! /, And approximates the voice quality of the synthesized speech.
  • the synthesized voice having the highest sound quality evaluation score can also be generated by the method of the present embodiment.
  • the unit selection score which is an approximate value of the sound quality of the synthesized speech
  • the sound quality evaluation score the time length of each voice unit is calculated in the process of voice synthesis, and the value and the actual synthesized voice are calculated. Measure the difference from the time length of the corresponding unit in milliseconds and use the sum of the absolute value over the entire speech interval.
  • the time length of the selected speech unit may be different from the time length calculated for speech synthesis.
  • the value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated time length is reproduced. In this respect, the voice quality of the synthesized speech is approximated! .
  • the synthesized voice having the highest sound quality evaluation score can also be generated by the method of the present embodiment.
  • the field to which the present invention can be applied is substantially the same as the field that requires speech synthesis technology.
  • the effect of the present invention can be achieved in a field that does not require strictness on the text of the text, such as a robot voice dialogue. It is thought that it demonstrates.

Abstract

In the case that speech is generated, the speech does not necessarily be clear due to such factors as a method and an order of the speech. A text-to-speech synthesis device is comprised of a text candidate generation unit (102) for generating a plurality of candidate texts composed of synonymous expression by analyzing an input text, a sound quality evaluation score calculation unit (103) for calculating a sound quality evaluation score on each of the candidate texts, a selection unit (104) for selecting a candidate text with the best sound quality evaluation score in a plurality of the candidate texts, a speech synthesis unit (105) for generating a synthetic speech which corresponds to the selected candidate text, so that the text-to-speech synthesis device transforms the input text to the extent that contents of a sentence, etc. are not subjected to change, and selects and outputs the synthetic speech which corresponds to the candidate text with the best sound quality evaluation score (see Fig. 1).

Description

明 細 書  Specification
テキスト音声合成装置、そのプログラム及びテキスト音声合成方法 技術分野  TECHNICAL FIELD OF THE INVENTION
[0001] (関連出願)本願は、先の日本特許出願 2006— 302954号(2006年 11月 8日出 願)の優先権を主張するものであり、前記先の出願の全記載内容は、本書に引用を もって繰込み記載されて!/、るものとみなされる。  [0001] (Related application) This application claims the priority of the previous Japanese patent application No. 2006-302954 (filed on Nov. 8, 2006). It is assumed that it has been reprinted with a quotation!
本発明は、テキスト音声合成技術に関し、特に、入力されたテキストをその意味内 容を損なわない範囲でアレンジして合成音声化する機能を備えたテキスト音声合成 技術に関する。  The present invention relates to a text-to-speech synthesis technology, and more particularly to a text-to-speech synthesis technology having a function of arranging input text within a range that does not impair the meaning of the text and making it a synthesized speech.
背景技術  Background art
[0002] テキスト音声合成技術は、与えられたテキスト、主に日本語の場合には漢字かな混 じり文からなるテキストに対して、その内容を読み上げる音声を生成する技術である。  [0002] The text-to-speech synthesis technology is a technology for generating speech that reads the content of given text, mainly text composed of kanji or kana in the case of Japanese.
[0003] 入力されたテキストの読みを推測して発音記号列を生成し、次に発音記号列に対 する音声を生成する方法が広く用いられている。発音記号列に対する音声を生成す る方法としては、予め収録した自然発話を切り貼りする波形接続方式や、音響的な特 徴を表すパラメータ時系列を生成してそのパラメータから音声を生成するパラメトリツ ク方式などが用いられる。  [0003] A method of generating phonetic symbol strings by guessing the reading of input text and then generating speech for the phonetic symbol strings is widely used. As a method of generating speech for phonetic symbol strings, a waveform connection method that cuts and pastes pre-recorded natural utterances, or a parametric method that generates a parameter time series representing acoustic features and generates speech from the parameters. Etc. are used.
[0004] この音声合成技術にお!/、ては、如何にして自然な合成音声を生成するかが大きな 研究テーマとなっている。例えば、特開 2004— 258406号公報には、係り受け解析 を誤った場合でも自然な韻律パターン (発音記号列)を生成できるようにしたテキスト 音声合成システムが開示されてレ、る。  [0004] In this speech synthesis technology, one of the major research themes is how to generate natural synthesized speech. For example, Japanese Patent Application Laid-Open No. 2004-258406 discloses a text-to-speech synthesis system that can generate a natural prosodic pattern (phonetic symbol string) even if dependency analysis is incorrect.
[0005] また、特開 2004— 138728号公報には、韻律情報の一つである基本ピッチパター ンをその概形を保持したまま周波数方向に平行移動させながら、移動後の変形ピッ チパターン及び言語情報に適合する音声素片列を選択し、該選択された音声素片 歹 IJのうち最も適合する音声素片列に係る変形ピッチパターンと上記最も適合した音 声素片列とを、合成音声時に用いる韻律情報及び音声素片とする素片 ·韻律制御手 段を備え、前記選択した音声素片のピッチ変形量を抑えて、急激な音質変化の無い 合成音声を作成できる音声合成装置が開示されている。 [0005] Also, Japanese Patent Laid-Open No. 2004-138728 discloses a modified pitch pattern and language after translation while a basic pitch pattern, which is one of prosodic information, is translated in the frequency direction while maintaining its outline. A speech unit sequence that matches the information is selected, and the deformed pitch pattern related to the speech unit sequence that best matches the selected speech unit 歹 IJ and the most suitable speech unit sequence are combined into the synthesized speech. Prosody information to be used at times and speech units and prosody control means are provided, the pitch deformation amount of the selected speech unit is suppressed, and there is no sudden change in sound quality A speech synthesizer capable of creating synthesized speech is disclosed.
[0006] 特許文献 1 :特開 2004— 258406号公報 Patent Document 1: Japanese Patent Application Laid-Open No. 2004-258406
特許文献 2:特開 2004— 138728号公報  Patent Document 2: Japanese Patent Application Laid-Open No. 2004-138728
非特許文献 1:鹿野、伊藤、河原、武田、山本 「IT Text 音声認識システム」、ォ ーム社、 2001年 p. 13— 14頁、  Non-Patent Document 1: Shikano, Ito, Kawara, Takeda, Yamamoto “IT Text Speech Recognition System”, Ohm, 2001, p. 13-14
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0007] 以上の特許文献 1、 2並びに非特許文献 1の開示事項は、本書に引用をもって繰り 込み記載されているものとする。以下に本発明による関連技術の分析を与える。 上記した従来技術の多くは自然な音声を生成することを主眼としている力 S、その結 果生成される合成音声が必ずしも綺麗な音声にならないという問題点がある。本発明 は、かかる点に着目してなされたものであって、その目的とするところは、生成する合 成音声の品質を向上させることのできるテキスト音声合成技術を提供することにある。 課題を解決するための手段 [0007] The disclosures of Patent Documents 1 and 2 and Non-Patent Document 1 described above are incorporated herein by reference. The following is an analysis of the related art according to the present invention. Many of the above-described conventional techniques have a problem that the force S is mainly intended to generate natural speech, and the resulting synthesized speech is not necessarily beautiful. The present invention has been made paying attention to this point, and an object of the present invention is to provide a text-to-speech synthesis technique capable of improving the quality of synthesized speech to be generated. Means for solving the problem
[0008] 本発明の第 1の視点によれば、入力されたテキストを解析し、同義の表現よりなる複 数の候補テキストを生成するテキスト候補生成部と、前記各候補テキストに対する音 質評価スコアを計算する音質評価スコア計算部と、前記複数の候補テキストから、最 良の音質評価スコアを持つ候補テキストを選択する選択部と、前記選択した候補テ キストに対応する合成音声を生成する音声合成部と、を備えること、を特徴とするテキ スト音声合成装置、該テキスト音声合成装置を実現するためのプログラム及び該テキ スト音声合成装置を用いて実施する音声合成方法が提供される。  [0008] According to a first aspect of the present invention, a text candidate generation unit that analyzes input text and generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts A sound quality evaluation score calculation unit that calculates a sound quality evaluation score, a selection unit that selects a candidate text having the best sound quality evaluation score from the plurality of candidate texts, and a speech synthesis that generates synthesized speech corresponding to the selected candidate text A text-to-speech synthesizer, a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer.
[0009] 本発明の第 2の視点によれば、入力されたテキストから複数の候補形態素解析結 果を計算する形態素解析候補生成部と、前記候補形態素解析結果に対する音質評 価スコアを計算する音質評価スコア計算部と、前記複数の候補形態素解析結果から 、最良の音質評価スコアを持つ候補形態素解析結果を選択する選択部と、前記選 択した候補形態素解析結果に対応する合成音声を生成する音声合成部と、を備え るテキスト音声合成装置、該テキスト音声合成装置を実現するためのプログラム及び 該テキスト音声合成装置を用いて実施する音声合成方法が提供される。 [0010] 本発明の第 3の視点によれば、入力されたテキストから読み結果 (称呼)を生成する 読み生成部と、該読み結果へのポーズ (句切り)の揷入/非揷入を組み合わせにより 複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)を生成するポーズ揷入部と 、前記候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対する音質評価スコアを 計算する音質評価スコア計算部と、前記複数の候補ポーズ付き読み結果 (ポーズ付 き称呼候補)から、最良の音質評価スコアを持つ候補ポーズ付き読み結果 (ポーズ付 き称呼候補)を選択する選択部と、前記選択した候補ポーズ付き読み結果 (ポーズ付 き称呼候補)に対応する合成音声を生成する音声合成部と、を備えるテキスト音声合 成装置、該テキスト音声合成装置を実現するためのプログラム及び該テキスト音声合 成装置を用いて実施する音声合成方法が提供される。 [0009] According to a second aspect of the present invention, a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality that calculates a sound quality evaluation score for the candidate morpheme analysis results An evaluation score calculation unit; a selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results; and a voice that generates a synthesized speech corresponding to the selected candidate morpheme analysis result A text-to-speech synthesizer comprising a synthesizer, a program for realizing the text-to-speech synthesizer, and a speech synthesizer implemented using the text-to-speech synthesizer are provided. [0010] According to the third aspect of the present invention, a reading generation unit that generates a reading result (name) from input text and insertion / non-insertion of a pose (punctuation) into the reading result are performed. A pose insertion unit that generates a plurality of candidate pose-reading results (named candidates with poses) by combining; a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the reading results with candidate poses (named candidates with poses); A selection unit for selecting a candidate pose-reading result (pause-named call candidate) having the best sound quality evaluation score from a plurality of candidate-pause-reading results (named candidates with pose), and the selected candidate pose-reading results A text-to-speech synthesizer comprising a speech synthesizer that generates a synthesized speech corresponding to (pause-named candidate), a program for realizing the text-to-speech synthesizer, and the text Speech synthesis method which is carried out by using the strike speech synthesis device is provided.
[0011] 本発明の第 4の視点によれば、入力されたテキストから読み結果 (称呼)を生成する 読み生成部と、前記読み結果 (称呼)に含まれる特定の音についての別の音への置 換/非置換を組み合わせ展開して複数の候補読み結果 (称呼候補)を生成する読 み変形部と、前記候補読み結果 (称呼候補)に対する音質評価スコアを計算する音 質評価スコア計算部と、前記複数の候補読み結果 (称呼候補)から、最良の音質評 価スコアを持つ候補読み結果 (称呼候補)を選択する選択部と、前記選択した候補 読み結果 (称呼候補)に対応する合成音声を生成する音声合成部と、を備えるテキス ト音声合成装置、該テキスト音声合成装置を実現するためのプログラム及び該テキス ト音声合成装置を用いて実施する音声合成方法が提供される。  [0011] According to the fourth aspect of the present invention, a reading generation unit that generates a reading result (name) from the input text, and another sound for a specific sound included in the reading result (name) A read deformation unit that generates a plurality of candidate reading results (named candidates) by combining and replacing replacement / non-replacement, and a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (named candidates) A selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates), and a synthesis corresponding to the selected candidate reading result (name candidate) A text speech synthesizer including a speech synthesizer for generating speech, a program for realizing the text speech synthesizer, and a speech synthesis method implemented using the text speech synthesizer are provided.
発明の効果  The invention's effect
[0012] 本発明によれば、合成音声の品質を飛躍的に向上させることが可能となる。その理 由は、入力テキストから作成した複数の候補のうち最良の音質評価スコアを持つ候補 を採択するよう構成したことにある。  [0012] According to the present invention, it is possible to dramatically improve the quality of synthesized speech. The reason is that the candidate having the best sound quality evaluation score is selected from among a plurality of candidates created from the input text.
[0013] また、本発明によれば、音声合成手段の持つ自然音声の元の特性を生かした合成 音声を生成することも可能となる。 [0013] According to the present invention, it is also possible to generate synthesized speech that makes use of the original characteristics of natural speech possessed by speech synthesis means.
図面の簡単な説明  Brief Description of Drawings
[0014] [図 1]本発明の第 1の実施形態に係るテキスト音声合成システムの構成を表したプロ ック図である。 園 2]本発明の第 1の実施形態に係るテキスト音声合成システムの同義語辞書に格 納されるエントリを表した図である。 FIG. 1 is a block diagram showing a configuration of a text-to-speech synthesis system according to a first embodiment of the present invention. FIG. 2] A diagram showing entries stored in the synonym dictionary of the text-to-speech synthesis system according to the first embodiment of the present invention.
[図 3]本発明の第 1の実施形態に係るテキスト音声合成システムのテキスト候補生成 部の動作を説明するための図である。  FIG. 3 is a diagram for explaining the operation of a text candidate generation unit of the text-to-speech synthesis system according to the first embodiment of the present invention.
園 4]本発明の第 1の実施形態に係るテキスト音声合成システムの変形構成を表した ブロック図である。 4] A block diagram showing a modified configuration of the text-to-speech synthesis system according to the first embodiment of the present invention.
園 5]本発明の第 2の実施形態に係るテキスト音声合成システムの構成を表したプロ ック図である。 5] A block diagram showing the configuration of the text-to-speech synthesis system according to the second embodiment of the present invention.
園 6]本発明の第 5の実施形態に係るテキスト音声合成システムの構成を表したプロ ック図である。 6] A block diagram showing the configuration of the text-to-speech synthesis system according to the fifth embodiment of the present invention.
園 7]本発明の第 5の実施形態に係るテキスト音声合成システムの個人性 (特徴)辞 書に格納されるエントリを表した図である。 FIG. 7] A diagram showing entries stored in the personality (feature) dictionary of the text-to-speech system according to the fifth embodiment of the present invention.
[図 8]本発明の第 5の実施形態に係るテキスト音声合成システムのテキスト候補生成 部の動作を説明するための図である。  FIG. 8 is a diagram for explaining the operation of a text candidate generation unit of a text-to-speech synthesis system according to a fifth embodiment of the present invention.
園 9]本発明の第 6の実施形態に係るテキスト音声合成システムの構成を表したプロ ック図である。 FIG. 9] A block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention.
園 10]本発明の第 6の実施形態に係るテキスト音声合成システムのフィラー辞書に格 納されるエントリを表した図である。 FIG. 10] A diagram showing entries stored in the filler dictionary of the text-to-speech synthesis system according to the sixth embodiment of the present invention.
園 11]本発明の第 6の実施形態に係るテキスト音声合成システムのテキスト候補生成 部の動作を説明するための図である。 11] FIG. 11 is a diagram for explaining the operation of the text candidate generation unit of the text speech synthesis system according to the sixth embodiment of the present invention.
園 12]本発明の第 10の実施形態に係るテキスト音声合成システムの構成を表したブ ロック図である。 12] A block diagram showing the configuration of the text-to-speech synthesis system according to the tenth embodiment of the present invention.
園 13]本発明の第 11の実施形態に係るテキスト音声合成システムの構成を表したブ ロック図である。 13] A block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention.
園 14]本発明の第 12の実施形態に係るテキスト音声合成システムの構成を表したブ ロック図である。 14] A block diagram showing the configuration of the text-to-speech synthesis system according to the twelfth embodiment of the present invention.
園 15]本発明の第 13の実施形態に係るテキスト音声合成システムにおける音質評価 スコアの算出方法を説明するための図である。 園 16]本発明の第 14の実施形態に係るテキスト音声合成システムにおける音質評価 スコアの算出方法を説明するための図である。 15] FIG. 15 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the thirteenth embodiment of the present invention. 16] FIG. 16 is a diagram for explaining a method for calculating a sound quality evaluation score in the text-to-speech synthesis system according to the fourteenth embodiment of the present invention.
符号の説明  Explanation of symbols
[0015] 101 テキスト入力部  [0015] 101 Text input section
102 テキスト候補生成部  102 Text candidate generator
103 音質評価スコア計算部  103 Sound quality evaluation score calculator
104 選択部  104 Selector
105 音声合成部  105 Speech synthesis unit
106 形態素解析候補生成部  106 Morphological analysis candidate generator
107 読み生成部  107 Reading generator
108 ポーズ揷入部  108 Poses
109 読み変形部  109 Reading deformation part
201 同義語辞書  201 Synonym Dictionary
202 個人性特徴辞書  202 Personality feature dictionary
203 フィラー辞書  203 Filler dictionary
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0016] 以下に、本発明の種々の展開可能形態を示す。  [0016] Various deployable forms of the present invention are shown below.
[0017] (形態 1) [0017] (Form 1)
本発明の第 1の視点によれば、入力されたテキストを解析し、同義の表現よりなる複 数の候補テキストを生成するテキスト候補生成部と、前記各候補テキストに対する音 質評価スコアを計算する音質評価スコア計算部と、前記複数の候補テキストから、最 良の音質評価スコアを持つ候補テキストを選択する選択部と、前記選択した候補テ キストに対応する合成音声を生成する音声合成部と、を備えるテキスト音声合成装置 を得ること力 Sでさる。  According to the first aspect of the present invention, the input text is analyzed, a text candidate generating unit that generates a plurality of candidate texts having synonymous expressions, and a sound quality evaluation score for each of the candidate texts is calculated. A sound quality evaluation score calculation unit; a selection unit that selects a candidate text having a best sound quality evaluation score from the plurality of candidate texts; a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate text; The ability to obtain a text-to-speech synthesizer with
[0018]  [0018]
前記テキスト音声合成装置は、更に、単語又は句単位で同義語を対応付けて格納 した同義語辞書を備え、前記テキスト候補生成部に、前記入力されたテキストに含ま れる単語又は句に対応する同義語を前記同義語辞書から検索し、前記単語又は句 単位の置換/非置換を組み合わせ展開して複数の候補テキストを生成するよう動作 させること力 Sでさる。 The text-to-speech synthesizer further includes a synonym dictionary in which synonyms are stored in association with each other in units of words or phrases, and the text candidate generator generates synonyms corresponding to words or phrases included in the input text. A word is retrieved from the synonym dictionary and the word or phrase Operate to generate multiple candidate texts by combining and expanding unit substitution / non-substitution.
[0019] (形態 3) [0019] (Form 3)
前記テキスト音声合成装置は、更に、ある句表現に対して同じ意味を持つ句表現 である同義表現を対応付けて格納した同義表現辞書を備え、前記テキスト候補生成 部に、前記入力されたテキストに含まれる句表現に対応する同義表現を前記同義表 現辞書から検索し、前記句表現単位レベルの置換/非置換を組み合わせ展開して 複数の候補テキストを生成するよう動作させることができる。  The text-to-speech synthesizer further includes a synonym expression dictionary that stores a synonym expression that is a phrase expression having the same meaning with respect to a certain phrase expression, and the text candidate generation unit adds the synonym expression dictionary to the input text. It is possible to search the synonym expression corresponding to the phrase expression contained in the synonym expression dictionary and generate a plurality of candidate texts by combining and expanding substitution / non-replacement at the phrase expression unit level.
[0020] (形態 4) [0020] (Form 4)
更に、前記テキスト候補生成部に、前記入力されたテキストの意味を変えないという 条件下で、前記入力されたテキストの文構造を変形することにより、前記複数の候補 テキストを生成するよう動作させることができる。  Further, the text candidate generation unit is operated to generate the plurality of candidate texts by modifying the sentence structure of the input text under the condition that the meaning of the input text is not changed. Can do.
[0021] (形態 5) [0021] (Form 5)
更に、前記テキスト候補生成部に、前記入力されたテキストに含まれる単語又は句 の丁寧表現の変更/非変更を組み合わせ展開して複数の候補テキストを生成する よう動作させること力でさる。  Further, it is possible to operate the text candidate generation unit so as to generate a plurality of candidate texts by combining and expanding the change / non-change of the polite expression of words or phrases included in the input text.
[0022] (形態 6) [0022] (Form 6)
前記テキスト音声合成装置は、更に、個人レベルの表現差異を対応付けて格納し た個人性特徴辞書を備え、前記テキスト候補生成部に、前記入力されたテキストに含 まれる表現に対応する他の同義表現を前記個人性特徴辞書から検索し、前記個人 レベルの表現差異による置換/非置換を組み合わせ展開して複数の候補テキストを 生成するよう動作させることができる。  The text-to-speech synthesizer further includes a personality feature dictionary that stores the personal level expression differences in association with each other, and the text candidate generator generates another text corresponding to the expression included in the input text. The synonym expression can be searched from the personality feature dictionary and can be operated to generate a plurality of candidate texts by combining and replacing substitution / non-substitution due to expression differences at the individual level.
[0023] (形態 7) [0023] (Form 7)
前記テキスト音声合成装置は、更に、フィラー(Filler)及び間投詞を記述したフイラ 一辞書を備え、前記テキスト候補生成部に、前記フィラー辞書を参照して、前記入力 されたテキストの語間に、前記フィラー又は間投詞の揷入/非揷入を組み合わせ展 開して複数の候補テキストを生成するよう動作させることができる。  The text-to-speech synthesizer further includes a filler dictionary in which fillers and interjections are described, and the text candidate generation unit refers to the filler dictionary and inserts the text between the input text words. It can be operated to generate multiple candidate texts by combining and expanding filler or interjection insertion / non-insertion.
[0024] (形態 8) 前記テキスト音声合成装置は、更に、フィラー(Filler)及び間投詞を記述したフイラ 一辞書を備え、前記テキスト候補生成部に、前記フィラー辞書を参照して、前記入力 されたテキストからフィラー又は間投詞の削除/非削除を組み合わせ展開して複数 の候補テキストを生成するよう動作させることができる。 [0024] (Form 8) The text-to-speech synthesizer further includes a filler dictionary describing fillers and interjections, and the filler candidate or interjection is deleted from the input text by referring to the filler dictionary in the text candidate generator. Can be operated to generate multiple candidate texts by combining / deleting and deleting.
[0025] (形態 9) [0025] (Form 9)
更に、前記テキスト候補生成部も、前記入力されたテキストに含まれる単語又は句 の一部又は句の全部を、繰り返し表現に改めることにより、前記複数の候補テキスト を生成するよう動作させること力 Sできる。  Further, the text candidate generation unit is also operable to generate the plurality of candidate texts by changing a part or all of the words or phrases included in the input text into a repeated expression. it can.
[0026] (形態 10) [0026] (Form 10)
更に、前記テキスト候補生成部に、展開した結果のうち、韻 (rhyme)を踏む表現だ けを候補テキストとするよう動作させることができる。  Furthermore, the text candidate generation unit can be operated so that only the expression of rhyme among the expanded results is used as the candidate text.
[0027] (形態 11 ) [0027] (Form 11)
更に、前記テキスト候補生成部に、展開した結果のうち、各アクセント句のモーラ数 力 あるいは 7だけで構成される表現だけを候補テキストとするよう動作させることがで きる。  Furthermore, the text candidate generation unit can be operated so that only the expression composed of only the mora power of each accent phrase or 7 is set as the candidate text among the expanded results.
[0028] (形態 12)  [0028] (Form 12)
本発明の第 2の視点によれば、入力されたテキストから複数の候補形態素解析結 果を計算する形態素解析候補生成部と、前記候補形態素解析結果に対する音質評 価スコアを計算する音質評価スコア計算部と、前記複数の候補形態素解析結果から 、最良の音質評価スコアを持つ候補形態素解析結果を選択する選択部と、前記選 択した候補形態素解析結果に対応する合成音声を生成する音声合成部と、を備え るテキスト音声合成装置を得ることができる。  According to the second aspect of the present invention, a morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from input text, and a sound quality evaluation score calculation that calculates a sound quality evaluation score for the candidate morpheme analysis results A selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results, and a speech synthesis unit that generates a synthesized speech corresponding to the selected candidate morpheme analysis result Can be obtained.
[0029] (形態 13) [0029] (Form 13)
本発明の第 3の視点によれば、入力されたテキストから読み結果 (称呼)を生成する 読み生成部と、該読み結果へのポーズ (句切り)の揷入/非揷入の組み合わせによ り複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)を生成するポーズ揷入部 と、前記候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対する音質評価スコアを 計算する音質評価スコア計算部と、前記複数の候補ポーズ付き読み結果 (ポーズ付 き称呼候補)から、最良の音質評価スコアを持つ候補ポーズ付き読み結果 (ポーズ付 き称呼候補)を選択する選択部と、前記選択した候補ポーズ付き読み結果 (ポーズ付 き称呼候補)に対応する合成音声を生成する音声合成部と、を備えるテキスト音声合 成装置を得ること力できる。 According to the third aspect of the present invention, a combination of a reading generation unit that generates a reading result (name) from input text, and insertion / non-insertion of a pose (punctuation) to the reading result. A pose insertion unit for generating a plurality of candidate pose-reading results (named candidates with poses), a sound quality evaluation score calculating unit for calculating a sound quality evaluation score for the reading results with candidate poses (named candidates with poses), Reading result with candidate pose (with pose A selection unit that selects a candidate pose-reading result with the best sound quality evaluation score (pause-named candidate) and the selected candidate pose-read result (pause-named candidate) It is possible to obtain a text-to-speech synthesizer including a speech synthesizer that generates synthesized speech.
[0030] (形態 14) [0030] (Form 14)
本発明の第 4の視点によれば、入力されたテキストから読み結果 (称呼)を生成する 読み生成部と、前記読み結果 (称呼)に含まれる特定の音についての別の音への置 換/非置換を組み合わせ展開して複数の候補読み結果 (称呼候補)を生成する読 み変形部と、前記候補読み結果 (称呼候補)に対する音質評価スコアを計算する音 質評価スコア計算部と、前記複数の候補読み結果 (称呼候補)から、最良の音質評 価スコアを持つ候補読み結果 (称呼候補)を選択する選択部と、前記選択した候補 読み結果 (称呼候補)に対応する合成音声を生成する音声合成部と、を備えるテキス ト音声合成装置を得ることができる。  According to the fourth aspect of the present invention, a reading generation unit that generates a reading result (name) from input text, and replacement of a specific sound included in the reading result (name) with another sound A reading deformation unit that generates a plurality of candidate reading results (name candidates) by combining / uncompressing, a sound quality evaluation score calculating unit that calculates a sound quality evaluation score for the candidate reading results (name candidates), A selection unit that selects a candidate reading result (name candidate) having the best sound quality evaluation score from a plurality of candidate reading results (name candidate), and generates synthesized speech corresponding to the selected candidate reading result (name candidate) A text-to-speech synthesizer including the speech synthesizer.
[0031] (形態 15) [0031] (Form 15)
前記音質評価スコアとして、合成音声のピッチパタンの滑らかさを表すスコアを用い ること力 Sでさる。  The score S representing the smoothness of the pitch pattern of the synthesized speech is used as the sound quality evaluation score.
[0032] (形態 16) [0032] (Form 16)
前記音質評価スコアとして、推定したピッチパタンと合成音声のピッチパタンの差分 を表すスコアを用いることができる。  As the sound quality evaluation score, a score representing the difference between the estimated pitch pattern and the synthesized speech pitch pattern can be used.
[0033] (形態 17) [0033] (Form 17)
前記音質評価スコアとして、推定したリズムと合成音声のリズムの差分を表すスコア を用いること力 Sでさる。  The score S representing the difference between the estimated rhythm and the synthesized speech rhythm is used as the sound quality evaluation score.
[0034] (形態 18) [0034] (Form 18)
前記音質評価スコアとして、  As the sound quality evaluation score,
合成音声を作成する際のセグメント間のスペクトルの滑らかさを表すスコアを用いるこ と力 Sできる。  It is possible to use a score that represents the smoothness of the spectrum between segments when creating synthesized speech.
[0035] (その他の形態) [0035] (Other forms)
また、本発明に係るテキスト音声合成方法及び該方法を実施するためのプログラム においても、上記した各テキスト音声合成装置の形態と同様の展開を行うことが可能 である。 A text-to-speech synthesis method according to the present invention and a program for carrying out the method However, the same development as that of each text-to-speech synthesizer described above can be performed.
[0036] [第 1の実施形態] [0036] [First embodiment]
続いて、本発明を実施するための最良の形態について図面を参照して詳細に説明 する。図 1は、本発明の第 1の実施形態に係るテキスト音声合成システムの構成を表 したブロック図である。図 1を参照すると、本実施形態に係るテキスト音声合成システ ムは、テキスト入力部 101と、テキスト候補生成部 102と、同義語辞書 201と、音質評 価スコア計算部 103と、選択部 104と、音声合成部 105と、を含んで構成されている Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the text-to-speech synthesis system according to the first embodiment of the present invention. Referring to FIG. 1, the text-to-speech synthesis system according to the present embodiment includes a text input unit 101, a text candidate generation unit 102, a synonym dictionary 201, a sound quality evaluation score calculation unit 103, and a selection unit 104. , And a voice synthesizer 105.
Yes
[0037] テキスト入力部 101は、テキストを入力する手段である。なお、本発明の原理からす れば、あらゆる言語に適用可能であるが、以下、本実施形態では、 日本語かな漢字 混じり文を取扱うものとして説明する。  [0037] The text input unit 101 is means for inputting text. In addition, according to the principle of the present invention, the present invention can be applied to any language. However, in the present embodiment, a description will be given assuming that Japanese / Kana / Kanji mixed sentences are handled.
[0038] テキスト候補生成部 102は、同義語辞書 201を参照し、該テキストを、意味を変えな い別の表現に変形して複数の候補テキストを生成する手段である。 [0038] The text candidate generation unit 102 is means for generating a plurality of candidate texts by referring to the synonym dictionary 201 and transforming the text into another expression that does not change the meaning.
[0039] 図 2は、同義語辞書 201に格納されるエントリを表した図であり、単語又は句単位で 同義語を対応付けて格納されてレ、る。 FIG. 2 is a diagram showing entries stored in the synonym dictionary 201, in which synonyms are stored in association with each other in units of words or phrases.
[0040] テキスト候補生成部 102は、図 3に示すようにテキストを形態素解析し、各形態素毎 に同義語辞書 201を検索し、マッチするエントリがあれば、該当形態素は該当同義 語に置換可能としてマークする。 [0040] The text candidate generator 102 morphologically analyzes the text as shown in FIG. 3, searches the synonym dictionary 201 for each morpheme, and if there is a matching entry, the corresponding morpheme can be replaced with the corresponding synonym. Mark as.
[0041] 最終的に、テキスト候補生成部 102は、テキストの各形態素に対するすべての置換 可能性の組合せにより、複数の候補テキストを生成する。 [0041] Finally, the text candidate generation unit 102 generates a plurality of candidate texts by combining all replaceability for each morpheme of the text.
[0042] 音質評価スコア計算部 103は、前記複数の候補テキストのそれぞれに対して、音質 評価スコアを計算する手段であり、その詳細については後記する。 [0042] The sound quality evaluation score calculation unit 103 is a means for calculating a sound quality evaluation score for each of the plurality of candidate texts, and details thereof will be described later.
[0043] 選択部 104は、前記複数の候補テキストの中から、対応する該音質評価スコアが最 良となる候補テキストを一つ選択する手段である。  [0043] The selection unit 104 is a means for selecting one candidate text having the best corresponding sound quality evaluation score from the plurality of candidate texts.
[0044] 音声合成部 105は、前記選択された候補テキストに対応する合成音声を生成する 手段である。本実施形態では、音声合成部 105は、予め収録した自然音声を接続編 集することで合成音声を生成する、波形編集型テキスト音声合成装置であるものとす [0045] 続いて、本実施形態に係るテキスト音声合成システムにおける音質評価の方法に ついて詳細を説明する。 [0044] The speech synthesizer 105 is means for generating synthesized speech corresponding to the selected candidate text. In the present embodiment, the speech synthesizer 105 is assumed to be a waveform editing type text speech synthesizer that generates synthesized speech by connecting and editing prerecorded natural speech. [0045] Subsequently, the sound quality evaluation method in the text-to-speech synthesis system according to the present embodiment will be described in detail.
[0046] 音質評価スコアは、前記候補テキストを音声合成した結果である合成音の音声品 質の客観評価値を指している。一般に、音声品質の正確な客観評価値を求めること は困難であるため、音質評価スコアの近似値として、音声合成の過程で用いる各種 スコアを用いることも可能である。本発明では、この近似値もあわせて、音質評価スコ ァと称する。  [0046] The sound quality evaluation score refers to an objective evaluation value of the speech quality of the synthesized sound, which is the result of speech synthesis of the candidate text. In general, since it is difficult to obtain an accurate objective evaluation value of speech quality, various scores used in the speech synthesis process can be used as approximate values of the speech quality evaluation score. In the present invention, this approximate value is also referred to as a sound quality evaluation score.
[0047] 以下、上記音質評価スコアの具体例として、候補テキストに対して、前記音声合成 装置が選択した素片系列に対応する単位接続スコアの値を、該音質評価スコアとし て用いる例を挙げて説明する。  [0047] Hereinafter, as a specific example of the sound quality evaluation score, for a candidate text, a unit connection score value corresponding to the segment sequence selected by the speech synthesizer is used as the sound quality evaluation score. I will explain.
[0048] 単位接続スコアは、各素片接続境界における 10次元ケプストラム(cepstrum)の重 み付きユークリッド距離の総和の負数とする。  [0048] The unit connection score is a negative number of the sum of the weighted Euclidean distances of the 10-dimensional cepstrum at each unit connection boundary.
[0049] すなわち、 i番目の素片系列(a(l)_ i, a(2)_i, . . . , a (Ν)— i)について、各素 片 a(j)_ iの始端の 10次元ケプストラムを Cep— b(i, j) = (cep_b(i, j, 1), cep_b (i, j, 2), · · · , cep— b(i, j, 10))また終端の 10次元ケプストラムを Cep— e(i, j) = (cep一 e(i, j, 1), cep一 e(i, j, 2), . . . , cep一 e(i, j, 10))とすると、 i番目の素 片系列に対する単位接続スコア d(i)は、 d(i) =—∑ (j = l, (N-l))^∑ (k=l, 1 0) { a (k) I cep_e(i, j, (k+ 1) )— cep— b (i, j, k) | '2}として表現される。  [0049] That is, for the i-th element sequence (a (l) _i, a (2) _i,..., A (Ν) —i), the start of each element a (j) _i Cep-b (i, j) = (cep_b (i, j, 1), cep_b (i, j, 2), ..., cep- b (i, j, 10)) Cep-e (i, j) = (cep-one e (i, j, 1), cep-one e (i, j, 2),..., Cep-one e (i, j, 10) ), The unit connection score d (i) for the i-th element sequence is d (i) = —∑ (j = l, (Nl)) ^ ∑ (k = l, 1 0) {a (k ) I cep_e (i, j, (k + 1)) — cep— b (i, j, k) |
[0050] ここで、 a (k)は予め与えられた重みである。  Here, a (k) is a weight given in advance.
[0051] この単位接続スコアは、音質評価スコア計算部 103と音声合成部 105とで同じ方法 で計算するものとする。  [0051] The unit connection score is calculated by the sound quality evaluation score calculation unit 103 and the speech synthesis unit 105 by the same method.
[0052] この単位接続スコアの値 d(i)は、合成音声の区分区間であるセグメントの間のスぺ タトルの滑ら力、さを表すスコアであると考えることができ、単位接続スコアの値 d(i)が 大きいほど、単位の各接続部分におけるケプストラムの差違が小さぐ音質の高い合 成音が得られているといえる。  [0052] This unit connection score value d (i) can be considered as a score representing the slip force of the spectrum between the segments that are the segmented sections of the synthesized speech. It can be said that the higher the d (i) is, the higher the sound quality is obtained with the difference in cepstrum at each connecting part of the unit.
[0053] ここでは、単位接続スコアとして 10次元ケプストラムの重み付きユークリッド距離を 用いたが、次元数は適宜調整することが可能である。また、ケプストラムの代わりに、 ケプストラムの時間方向微分値である Δケプストラム、 MFCC (Mel— FrequencyHere, the weighted Euclidean distance of the 10-dimensional cepstrum is used as the unit connection score, but the number of dimensions can be adjusted as appropriate. Also, instead of cepstrum, Δ cepstrum, which is the time direction differential value of the cepstrum, MFCC (Mel— Frequency
Cepstrum Coefficent;非特許文献 1参照)、または MFCCの時間方向微分値で ある A MFCCを適当な次元数で用いることもできる。また、これら複数の値から構成 されるベクトルを用いることももちろん可能である。 Cepstrum Coefficent (see Non-Patent Document 1), or A MFCC, which is a time-direction differential value of MFCC, can be used with an appropriate number of dimensions. It is of course possible to use a vector composed of these multiple values.
[0054] 上記波形編集型のテキスト音声合成装置は、あるテキストに対して音声を合成する 際に複数の自然音声の素片系列を取り得る力、それらの中で各素片系列に対する 単位接続スコアが最大となる素片系列を選択するよう動作することになる。 The waveform editing type text-to-speech synthesizer has the ability to take a plurality of natural speech segment sequences when synthesizing speech for a certain text, and the unit connection score for each segment sequence among them. It operates to select the segment series that maximizes.
[0055] 以上により、与えられたテキストと同内容で、音声合成装置で用いられる単位選択 スコアで近似された音質評価スコアのもっとも高い合成音声を生成することができる。 As described above, a synthesized speech having the same content as a given text and having the highest sound quality evaluation score approximated by the unit selection score used in the speech synthesizer can be generated.
[0056] なお、図 2の例では、同義語辞書 201に表記のみを格納したものとして説明したが[0056] In the example of FIG. 2, it has been described that only the notation is stored in the synonym dictionary 201.
、品詞や、平仮名'片仮名間の変形規則などを記述することで、より柔軟あるいは厳 密にマッチングすることも可能である。 By describing the part of speech and the transformation rules between Hiragana and Katakana, it is possible to perform more flexible or strict matching.
[0057] また、テキスト候補生成部 102は、音声合成部 105の一部を成すテキスト解析処理 と一体化することもできる。その場合、形態素解析結果が一致することにより、精度の 向上や、計算量の低減を図ることができる。 Further, the text candidate generation unit 102 can be integrated with a text analysis process that forms part of the speech synthesis unit 105. In this case, the morphological analysis results match, so that accuracy can be improved and the amount of calculation can be reduced.
[0058] また、本実施形態では、選択部 104は候補テキストを出力し、音声合成部 105は該 候補テキストを入力として音声を合成するものとしている。テキスト候補生成部 102と 音声合成部 105の動作は、音質評価スコアを生成するところまでは同じである。 In this embodiment, the selection unit 104 outputs candidate text, and the speech synthesis unit 105 synthesizes speech by using the candidate text as input. The operations of the text candidate generation unit 102 and the speech synthesis unit 105 are the same until the sound quality evaluation score is generated.
[0059] これに対して、図 4に示すように、前記同じ動作 (音質評価スコアを生成するところま での動作)をする部分をテキスト候補生成部 102で行い、途中結果を音声合成用デ ータとして出力することもできる。 [0059] On the other hand, as shown in FIG. 4, the part that performs the same operation (the operation up to the generation of the sound quality evaluation score) is performed by the text candidate generation unit 102, and the intermediate result is the speech synthesis data. Can also be output as data.
[0060] 選択部 104は、選択された候補テキストに対応する音声合成用データ(テキスト候 補生成部 102にて生成)を出力し、音声合成部 105は該選択された音声合成データ を用いて音声を合成する。これにより、動作の重複がなくなり、計算量を削減すること が可能となる。 [0060] The selection unit 104 outputs speech synthesis data (generated by the text candidate generation unit 102) corresponding to the selected candidate text, and the speech synthesis unit 105 uses the selected speech synthesis data. Synthesize speech. As a result, there is no duplication of operation and the amount of calculation can be reduced.
[0061] 以上のように、本実施形態では、伝達する情報内容を変えずに、より高音質で生成 できる別の語 ·表現によって情報を伝えることが可能となる。また、音質評価スコアに よる候補テキストの選択の際に、音声合成を構成する素片を収録した元発話で多く 出てきた語 ·表現が使われる確率が高くなるので、元発話の話者の個人性が表現さ れる。更には、与えられたテキストが書き言葉表現であっても、同義であるところの元 発話で使われた話し言葉表現で置換される場合があり、その場合はより聞いていて 解り易い、あるいは違和感の無!/、表現を得ることが可能となる。 As described above, in the present embodiment, it is possible to convey information by another word / expression that can be generated with higher sound quality without changing the information content to be transmitted. In addition, when selecting candidate texts based on the sound quality evaluation score, many of the original utterances that contain the segments that make up speech synthesis Since the probability of using the words and expressions that appear is increased, the individuality of the speaker of the original utterance is expressed. Furthermore, even if the given text is a written word expression, it may be replaced with the spoken word expression used in the original utterance where it is synonymous, in which case it is more audible and easier to understand, or no sense of incongruity. ! / It becomes possible to obtain an expression.
[0062] なお、上記した実施形態では、メモリの制約等がないものとして、テキスト候補生成 部 102がテキストの各形態素に対するすべての置換可能性の組合せを求めるものと して説明したが(図 3参照)、メモリの制約等がある場合には、その制約の範囲で可能 な複数の組合せを求めても良いし、逐次、テキスト候補を生成し、音質評価スコアが 高いものを残すようにして候補テキストを選択することも可能である。  In the above-described embodiment, it has been described that the text candidate generation unit 102 obtains all possible combinations of replaceability for each morpheme of text assuming that there is no memory restriction or the like (FIG. 3). If there are memory restrictions, etc., you may obtain multiple combinations that are possible within the scope of the restrictions, or generate text candidates one by one and leave the one with a high sound quality evaluation score. It is also possible to select text.
[0063] [第 2の実施形態]  [0063] [Second Embodiment]
続いて、同義語辞書を省略可能とした本発明の第 2の実施形態について図面を参 照して詳細に説明する。図 5は、本発明の第 2の実施形態に係るテキスト音声合成シ ステムの構成を表したブロック図である。以下、上記第 1の実施形態で説明した事項 は省略して説明する。  Next, a second embodiment of the present invention in which the synonym dictionary can be omitted will be described in detail with reference to the drawings. FIG. 5 is a block diagram showing the configuration of the text-to-speech synthesis system according to the second embodiment of the present invention. In the following description, the items described in the first embodiment are omitted.
[0064] 本実施形態のテキスト候補生成部 102は、予めテキスト候補生成部 102内に記述 された同義表現の変形ルールに基づいて、入力されたテキストの文法を変形すること により、同義表現に展開した候補テキストを生成する。以下、その具体例を説明する  [0064] The text candidate generation unit 102 of the present embodiment expands the input text into a synonym expression by modifying the grammar of the input text based on the synonym expression modification rules described in advance in the text candidate generation unit 102. Generated candidate text. Specific examples are described below.
[0065] 例えば、入力テキストが「文法を変形する」の場合、テキスト候補生成部 102は、名 詞「変形」と助動詞「する」の組を、動詞句「形を変える」に置換することで変形する。 For example, when the input text is “transform the grammar”, the text candidate generation unit 102 replaces the pair of the noun “transform” and the auxiliary verb “do” with the verb phrase “change shape”. Deform.
[0066] 助詞の変形が必要な場合は、ここで一緒に行う。これにより、「文法を変形する」に 加えて「文法の形を変える」も候補テキストとする。 [0066] If it is necessary to transform the particle, it is performed together here. As a result, in addition to “transform the grammar”, “change the shape of the grammar” is also set as the candidate text.
[0067] また例えば、入力テキストが「単語を置換する」の場合、テキスト候補生成部 102はFor example, when the input text is “replace word”, the text candidate generation unit 102
、動詞語幹「置換」と助動詞「する」の組を、動詞句「置き換える」または「差し替える」 に置換することで変形する。 The verb stem “substitution” and auxiliary verb “do” are transformed by replacing the verb phrase “replace” or “replace” with the verb phrase “replace”.
[0068] こうして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処理に より、音質評価、最良候補の選択が行われる。 [0068] After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
[0069] テキスト候補生成部 102では、単純にテキストの文字列一致により変形ルールを適 用しても良いが、さらに形態素解析を行って該当表現の品詞を確認することで精度を 向上することも可能である。 [0069] The text candidate generator 102 applies the transformation rule simply by matching the text strings. However, it is also possible to improve accuracy by performing morphological analysis and confirming the part of speech of the corresponding expression.
[0070] [第 3の実施形態] [0070] [Third embodiment]
続いて、同じく同義語辞書を省略可能とした本発明の第 3の実施形態について説 明する。本実施形態の構成は、上記第 2の実施形態と同様であるので、その相違点 であるテキスト候補生成部 102の動作について詳説する。  Next, a third embodiment of the present invention in which the synonym dictionary can be omitted will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the text candidate generator 102, which is the difference, will be described in detail.
[0071] 上記第 2の実施形態とは異なり、本実施形態に係るテキスト候補生成部 102は、入 力されたテキストの文構造を変形することにより、展開した候補テキストを生成する。 以下、その文構造の変形の具体例を説明する。 Unlike the second embodiment, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by modifying the sentence structure of the input text. Hereinafter, a specific example of the deformation of the sentence structure will be described.
[0072] テキスト候補生成部 102におけるその文構造の変形の第 1類型は、倒置'強調構文 への変形である。例えば、入力テキストが「Aを Bする」の場合、テキスト候補生成部 1[0072] The first type of transformation of the sentence structure in the text candidate generation unit 102 is transformation to inversion 'emphasis syntax. For example, if the input text is “A to B”, the text candidate generator 1
02は、倒置 ·強調構文である「Bしたのは Aだ」も候補テキストとする。 In 02, the inversion / emphasis syntax “B is A” is also a candidate text.
[0073] テキスト候補生成部 102におけるその文構造の変形の第 2類型は、体言止め構文 への変形である。例えば、入力テキストが「投稿したのは私です」の場合、テキスト候 補生成部 102は、体言止め構文である「投稿したのは私」も候補テキストとする。 [0073] A second type of modification of the sentence structure in the text candidate generation unit 102 is a modification to the body-stopping syntax. For example, when the input text is “I am the one who posted”, the text candidate generation unit 102 also selects “I am the one that posted” as the candidate text.
[0074] テキスト候補生成部 102におけるその文構造の変形の第 3類型は、助詞句の入れ 替え (スクランプリング)による変形である。例えば、入力テキストが「私と兄は本を読ん だ」の場合、テキスト候補生成部 102は、助詞句の入れ替えによる変形を行った「私と 本を兄は読んだ」、「兄は私と本を読んだ」、「兄は本を私と読んだ」、「本を兄は私と 読んだ」および「本を私と兄は読んだ」も候補テキストとする。 [0074] A third type of deformation of the sentence structure in the text candidate generation unit 102 is deformation by replacement of particle phrases (scrambling). For example, if the input text is “I and my brother read a book”, the text candidate generation unit 102 performs transformation by replacing the particle phrase “I read my book with my brother”, “ “Read the book”, “The brother read the book with me”, “The book read the brother with me” and “The book read with me and the brother” are also candidate texts.
[0075] テキスト候補生成部 102におけるその文構造の変形の第 4類型は、名詞形と動詞 形の置換による変形である。例えば、入力テキストが「単語を置換する」の場合、テキ スト候補生成部 102は、名詞形と動詞形の置換による変形を行った「単語の置換を 行う」も候補テキストとする。 [0075] The fourth type of modification of the sentence structure in the text candidate generation unit 102 is modification by replacement of a noun form and a verb form. For example, when the input text is “replace word”, the text candidate generation unit 102 also adopts “perform word replacement”, which is modified by replacing the noun form and the verb form, as the candidate text.
[0076] テキスト候補生成部 102におけるその文構造の変形の第 5類型は、受け身(受動態[0076] The fifth type of deformation of the sentence structure in the text candidate generator 102 is passive (passive
)への変形である。例えば、入力テキストが「彼は彼女を呼んだ」の場合、テキスト候 補生成部 102は、受け身への変形を行った「彼女は彼に呼ばれた。」も候補テキスト とする。 [0077] こうして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処理に より、音質評価、最良候補の選択が行われる。 ). For example, when the input text is “He called her”, the text candidate generation unit 102 determines that “She was called by him” which has been transformed into a passive text. [0077] After a plurality of candidate texts are generated in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
[0078] [第 4の実施形態] [Fourth Embodiment]
続いて、同じく同義語辞書を省略可能とした本発明の第 4の実施形態について説 明する。本実施形態の構成は、上記第 2、第 3の実施形態と同様であるので、その相 違点であるテキスト候補生成部 102の動作について詳説する。  Next, a fourth embodiment of the present invention in which the synonym dictionary can be omitted will be described. Since the configuration of this embodiment is the same as that of the second and third embodiments, the operation of the text candidate generation unit 102, which is the difference, will be described in detail.
[0079] 上記第 2、第 3の実施形態とは異なり、本実施形態に係るテキスト候補生成部 102 は、入力されたテキストの丁寧表現を変形することにより、展開した候補テキストを生 成する。以下、その文構造の変形の具体例を説明する。 [0079] Unlike the second and third embodiments, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by transforming the polite expression of the input text. Hereinafter, a specific example of the deformation of the sentence structure will be described.
[0080] 例えば、入力テキストが「私は本を買った」の場合、テキスト候補生成部 102は、より 丁寧な表現である「私は本を買いました。」も候補テキストとする。 For example, when the input text is “I bought a book”, the text candidate generation unit 102 also uses “I bought a book”, which is a more polite expression, as a candidate text.
[0081] こうして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処理に より、音質評価、最良候補の選択が行われる。 [0081] After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
[0082] [第 5の実施形態] [0082] [Fifth Embodiment]
続いて、上記した第 1の実施形態の同義語辞書に代えて個人性 (特徴)辞書を搭 載した本発明の第 5の実施形態について図面を参照して詳細に説明する。図 6は、 本発明の第 5の実施形態に係るテキスト音声合成システムの構成を表したブロック図 であり、個人性 (特徴)辞書 202が備えられている。以下、上記第 1の実施形態で説 明した事項は省略して説明する。  Next, a fifth embodiment of the present invention in which a personality (feature) dictionary is installed in place of the synonym dictionary of the first embodiment will be described in detail with reference to the drawings. FIG. 6 is a block diagram showing a configuration of a text-to-speech synthesis system according to the fifth embodiment of the present invention, and includes a personality (feature) dictionary 202. In the following description, the items described in the first embodiment are omitted.
[0083] 本実施形態のテキスト候補生成部 102は、個人性(特徴)辞書 202を参照し、入力 テキストを、意味を変えな!/、別の表現に変形して複数の候補テキストを生成する。 The text candidate generation unit 102 of the present embodiment refers to the personality (feature) dictionary 202 and generates a plurality of candidate texts by changing the input text into a different expression! .
[0084] 図 7は、個人性(特徴)辞書 202に格納されるエントリを表した図であり、エントリ「自 転車」の置換可能な語彙として「ケッタ」が対応付けて格納されている。なお、「ケッタ」 とは名古屋地方で用いられる「自転車」を指す言葉である。 FIG. 7 is a diagram showing entries stored in the personality (feature) dictionary 202, in which “Ketta” is stored in association with the replaceable vocabulary of the entry “Car”. “Ketta” is a term for “bicycle” used in the Nagoya region.
[0085] テキスト候補生成部 102は、図 8に示すようにテキストを形態素解析し、各形態素毎 に個人性(特徴)辞書 202を検索し、マッチするエントリがあれば、該当形態素は該 当個人性形態素 (置換可能語彙)に置換可能としてマークする。 [0085] As shown in FIG. 8, the text candidate generation unit 102 performs morphological analysis on the text, searches the personality (feature) dictionary 202 for each morpheme, and if there is a matching entry, the corresponding morpheme is the corresponding individual. Mark a morpheme (replaceable vocabulary) as replaceable.
[0086] こうして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処理に より、音質評価、最良候補の選択が行われる。 [0086] After generating a plurality of candidate texts in this way, the same processing as in the first embodiment is performed. Thus, sound quality evaluation and selection of the best candidate are performed.
[0087] 本実施形態では、例えば、音声合成部が用いる自然発話が方言などにより発声単 語のばらつきがある場合も、自然発話での出現頻度の低い単語や音韻列が、より該 当自然発話の話者にとって出現頻度の高い単語や音韻列に置換されることにより、 音質の良い合成音声を得ることが可能となる。  [0087] In this embodiment, for example, even when the natural utterance used by the speech synthesizer has variations in utterance words due to dialects or the like, words and phonological sequences that appear less frequently in natural utterance are more likely to correspond to the natural utterance. It is possible to obtain synthesized speech with good sound quality by replacing words or phoneme strings with high frequency of appearance.
[0088] [第 6の実施形態]  [0088] [Sixth embodiment]
続いて、上記した第 1の実施形態の同義語辞書に代えてフィラー辞書を搭載した本 発明の第 6の実施形態について図面を参照して詳細に説明する。図 9は、本発明の 第 6の実施形態に係るテキスト音声合成システムの構成を表したブロック図であり、フ イラ一辞書 203が備えられている。以下、上記第 1の実施形態で説明した事項は省 略して説明する。  Subsequently, a sixth embodiment of the present invention in which a filler dictionary is mounted instead of the synonym dictionary of the first embodiment will be described in detail with reference to the drawings. FIG. 9 is a block diagram showing the configuration of the text-to-speech synthesis system according to the sixth embodiment of the present invention, and includes a filer dictionary 203. Hereinafter, the items described in the first embodiment will be omitted.
[0089] 本実施形態のテキスト候補生成部 102は、フィラー辞書 203を参照し、入力テキスト の語間にフィラー(filler)を揷入し、あるいは、揷入しないことを組み合わせ展開して 、複数の候補テキストを生成する。  The text candidate generation unit 102 according to the present embodiment refers to the filler dictionary 203 and inserts fillers between words of the input text or combines and expands the combination of not inserting, Generate candidate text.
[0090] 図 10は、フィラー辞書 203に格納されるエントリを表した図であり、所謂つなぎ語、 会話で間投詞的に使われるあまり意味のない語句であるフィラー「え一と」や「ゴホゴ ホ」などが対応付けて格納されている。また例えば、英語であれば、「you know」、「 I meanj、「Well」が格納される。  FIG. 10 is a diagram showing entries stored in the filler dictionary 203. The fillers “eichito” and “gohogo ho” are so-called connective words, meaning words that are used in an interjective manner in conversation. And the like are stored in association with each other. For example, in the case of English, “you know”, “I meanj”, and “Well” are stored.
[0091] テキスト候補生成部 102は、図 11に示すようにテキストを形態素解析し、各形態素 間に、フィラー辞書 203の内容を揷入できる可能性をすベて展開する。図 11の下段 に示したように、テキスト候補生成部 102は、フィラー辞書 203とマッチする形態素に っレ、ては、それを他のフィラーへの置換/削除を組み合わせて展開する。  The text candidate generation unit 102 performs morphological analysis on the text as shown in FIG. 11, and develops all possibilities that the contents of the filler dictionary 203 can be inserted between the morphemes. As shown in the lower part of FIG. 11, the text candidate generation unit 102 expands a morpheme that matches the filler dictionary 203, in combination with replacement / deletion with other fillers.
[0092] こうして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処理に より、音質評価、最良候補の選択が行われる。  [0092] After a plurality of candidate texts are generated in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
[0093] 本実施形態では、例えばフィラーを揷入することにより、本来であればつながりの悪 力、つた音韻列のつながりがスムーズになり、音質の良い合成音声を得ること、元発話 の話者の個人性を表現することが可能となる。また、フィラーの揷入をすベて展開す るのではなく、文全体の揷入可能個数や、連続揷入可能数を制御(制限)しても良い [0094] [第 7の実施形態] [0093] In the present embodiment, for example, by inserting a filler, it is possible to obtain a synthesized speech with good sound quality by originally connecting badly, smoothly connecting the phoneme strings, and the speaker of the original utterance. It is possible to express the personality of Also, it is possible to control (limit) the number of insertable sentences and the maximum number of consecutive inserts, instead of expanding all fillers. [0094] [Seventh embodiment]
続いて、上記した各辞書を省略可能とした本発明の第 7の実施形態について説明 する。本実施形態の構成は、上記第 2の実施形態と同様であるので、その相違点で あるテキスト候補生成部 102の動作について詳説する。  Subsequently, a seventh embodiment of the present invention in which each of the above dictionaries can be omitted will be described. Since the configuration of the present embodiment is the same as that of the second embodiment, the operation of the text candidate generation unit 102 that is the difference will be described in detail.
[0095] 上記した各実施形態とは異なり、本実施形態に係るテキスト候補生成部 102は、入 力されたテキスト中の句を複数回繰り返すことにより、展開した候補テキストを生成す る。以下、その文構造の変形の具体例を説明する。 [0095] Unlike the above-described embodiments, the text candidate generation unit 102 according to the present embodiment generates expanded candidate text by repeating a phrase in the input text a plurality of times. Hereinafter, a specific example of the deformation of the sentence structure will be described.
[0096] 例えば、入力テキストが「私は、鉛筆を 3本買います」の場合、テキスト候補生成部 1[0096] For example, if the input text is "I buy three pencils", the text candidate generator 1
02は、「私は、鉛筆を、鉛筆を 3本買います」、「私は、鉛筆を 3本、 3本買います」、「 私は、鉛筆を、鉛筆を 3本、 3本買います」も候補テキストとする。 02, "I will buy 3 pencils, 3 pencils", "I will buy 3 3 pencils", "I will buy 3 pencils, 3 pencils" Is also a candidate text.
[0097] 以上のように、入力テキストに含まれる単語又は句の一部又は句の全部を、繰り返 し表現に改めることによつても、前記複数の候補テキストを生成することが可能であり[0097] As described above, it is also possible to generate the plurality of candidate texts by reiterating part or all of the words or phrases included in the input text into repeated expressions.
、このようにして複数の候補テキストを生成した後は、上記第 1の実施形態と同様の処 理により、音質評価、最良候補の選択が行われる。 After generating a plurality of candidate texts in this way, sound quality evaluation and selection of the best candidate are performed by the same processing as in the first embodiment.
[0098] なお、繰り返しの最大回数や、文全体での総繰り返し回数は、予めテキスト候補生 成部 102に設定することができるものとする。 It should be noted that the maximum number of repetitions and the total number of repetitions for the entire sentence can be set in the text candidate generation unit 102 in advance.
[0099] [第 8の実施形態] [0099] [Eighth Embodiment]
続いて、上記第 1の実施形態を変形した本発明の第 8の実施形態について説明す る。本実施形態の構成は、上記第 1の実施形態と同様であるので、その相違点であ るテキスト候補生成部 102の動作について詳説する。  Subsequently, an eighth embodiment of the present invention, which is a modification of the first embodiment, will be described. Since the configuration of this embodiment is the same as that of the first embodiment, the operation of the text candidate generation unit 102 which is the difference will be described in detail.
[0100] 上記した第 1の実施形態とは異なり、本実施形態に係るテキスト候補生成部 102は[0100] Unlike the first embodiment described above, the text candidate generator 102 according to this embodiment
、同義語辞書により同義語を展開した後、各句が頭韻を踏んでいることを検定する。 以下、その検定処理の具体例を説明する。 After the synonym is expanded by the synonym dictionary, it is verified that each phrase is humorous. Hereinafter, a specific example of the test process will be described.
[0101] 例えば、同義語辞書 201のエントリ「順調」に対して同義語「堅調」だけが登録され ており、テキスト「順調な業績を、謙遜して語る」が入力された場合について説明する [0101] For example, the case where only the synonym "sound" is registered for the entry "smooth" in the synonym dictionary 201, and the text "speak humbly about smooth performance" is entered.
[0102] 第 1の実施形態に従えば、「順調な業績を、謙遜して語る」に加えて「堅調な業績を 、謙遜して語る」の合わせて 2つの候補テキストが生成される。 [0102] According to the first embodiment, in addition to "speaking good performance, humble", " "Can you speak humblely?" Will generate two candidate texts.
[0103] ここで、 2つの候補テキスト見てみると、前者の「順調な業績を、謙遜して語る」の各 呼気段落の先頭音は「じゅ」と「け」なので、頭韻を踏んでいない。一方、後者の「堅調 な業績を、謙遜して語る」については同様に、「け」と「け」なので、頭韻を踏んでいる。 [0103] Here, looking at the two candidate texts, the first sounds of the exhalation paragraphs of the former “speaking the successful performance, humble” are “ju” and “ke”, so they are not humorous. . On the other hand, the latter “speak humbly about the strong performance” is also “Ke” and “Ke”, so they are humorous.
[0104] そこで、テキスト候補生成部 102は、上記韻を踏んでいるか否かの検定によって、 前者を除外し、頭韻を踏んでいる後者だけを候補テキストとして出力する。 [0104] Therefore, the text candidate generation unit 102 excludes the former by the test of whether or not the above rhymes are being performed, and outputs only the latter having the rhymes as candidate texts.
[0105] こうして複数の候補テキストから検定によって選別された候補テキストのみについて[0105] Only candidate texts thus selected from a plurality of candidate texts by testing
、上記第 1の実施形態と同様の処理により、音質評価、最良候補の選択が行われる。 The sound quality evaluation and the selection of the best candidate are performed by the same processing as in the first embodiment.
[0106] 本実施形態では、上記音質評価処理の負荷が低減されることはもちろんとして、聞 いて心地よくない候補テキストが生成されることを防止する効果が達成される。また、 上記した具体例では、頭韻による検定を例示したが、脚韻について同様の検定をす ることが可能である。 In the present embodiment, the load of the sound quality evaluation process is reduced, and the effect of preventing generation of uncomfortable candidate texts is achieved. In the above-described specific example, the test using the nymph is exemplified, but the same test can be performed for the rhyme.
[0107] [第 9の実施形態] [Ninth Embodiment]
続いて、上記第 1の実施形態を変形した本発明の第 9の実施形態について説明す る。本実施形態の構成は、上記第 1の実施形態と同様であるので、その相違点であ るテキスト候補生成部 102の動作について詳説する。  Subsequently, a ninth embodiment of the present invention, which is a modification of the first embodiment, will be described. Since the configuration of this embodiment is the same as that of the first embodiment, the operation of the text candidate generation unit 102 which is the difference will be described in detail.
[0108] 上記した第 1の実施形態とは異なり、本実施形態に係るテキスト候補生成部 102は[0108] Unlike the above-described first embodiment, the text candidate generation unit 102 according to this embodiment
、同義語辞書により同義語を展開した後、各句の読みのモーラ (mora)数が 5又は 7 であることを検定する。以下、その検定処理の具体例を説明する。 After the synonym is expanded by the synonym dictionary, it is verified that the mora number of the reading of each phrase is 5 or 7. Hereinafter, a specific example of the test process will be described.
[0109] 例えば、同義語辞書 201のエントリ「連絡網」に対して同義語「連絡ネット」力 ェント リ「驚き」について同義語「驚愕」がそれぞれ登録されていており、テキスト「退屈な連 絡網が、驚きに満たされた」が入力された場合について説明する。 For example, for the entry “contact network” in the synonym dictionary 201, the synonym “surprise” is registered for the synonym “contact net” force entry “surprise”, and the text “boring contact” is registered. A case will be described in which “the net is filled with surprise” is input.
[0110] 第 1の実施形態に従えば、 (1)「退屈な連絡網が、驚きに満たされた」に加えて、 (2[0110] According to the first embodiment, (1) In addition to "The boring network was filled with surprise" (2
)「退屈な連絡ネットが、驚きに満たされた」、 (3)「退屈な連絡網が、驚愕に満たされ た」、 (4)「退屈な連絡ネットが、驚愕に満たされた」の合計 4文の候補テキストが生成 される。 ) “Boring contact network filled with surprise”, (3) “Boring contact network filled with startle”, (4) “Boring contact network filled with startle” Four candidate texts are generated.
[0111] ところで、前記各候補テキストの句毎のモーラ数は、それぞれ(1) 5モーラ + 7モー ラ + 5モーラ + 5モーラ、(2) 5モーラ + 8モーラ + 5モーラ + 5モーラ、(3) 5モーラ + 7モーラ + 5モーラ + 5モーラ、(4) 5モーラ + 8モーラ + 5モーラ + 5モーラ、である。 [0111] By the way, the number of mora for each phrase of each candidate text is (1) 5 mora + 7 mora + 5 mora + 5 mora, (2) 5 mora + 8 mora + 5 mora + 5 mora, 3) 5 mora + 7 mora + 5 mora + 5 mora, (4) 5 mora + 8 mora + 5 mora + 5 mora.
[0112] そこで、テキスト候補生成部 102は、これらのうち、すべての句が 5又は 7モーラで構 成されている、 (1)「退屈な連絡網が、驚きに満たされた」と(3)「退屈な連絡網が、驚 愕に満たされた」の 2つを、候補テキストとして出力する。 [0112] Therefore, the text candidate generator 102 has all of the phrases composed of 5 or 7 mora. (1) “The boring communication network was filled with surprise” (3 ) Output two candidate texts, “The boring contact network was astounded”.
[0113] こうして複数の候補テキストから検定によって選別された候補テキストのみについて[0113] Only candidate texts selected from multiple candidate texts by testing
、上記第 1の実施形態と同様の処理により、音質評価、最良候補の選択が行われる。 The sound quality evaluation and the selection of the best candidate are performed by the same processing as in the first embodiment.
[0114] 本実施形態でも、上記音質評価処理の負荷が低減されることはもちろんとして、上 記第 8の実施形態とは別のモーラ数の観点で聞!/、て心地よくな!/、候補テキストが生 成されることを防止する効果が達成される。また、上記した具体例では、各句が厳密 に 5または 7モーラであるか否かによる検定を例示した力 各句が厳密に 5または 7モ ーラではない、 V、わゆる字余りに対応することも可能である。 [0114] In this embodiment as well, the load of the sound quality evaluation process is reduced, as well as listening from the viewpoint of the number of mora different from the above eighth embodiment! The effect of preventing the generation of text is achieved. Also, in the specific example above, the power exemplifying a test based on whether each phrase is strictly 5 or 7 mora, each phrase is not exactly 5 or 7 mora, corresponding to V, the surplus character remainder It is also possible.
[0115] [第 10の実施形態] [0115] [Tenth embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 10の実施形態について図面 を参照して詳細に説明する。図 12は、本発明の第 10の実施形態に係るテキスト音声 合成システムの構成を表したブロック図である。以下、上記第 2の実施形態で説明し た事項は省略して説明する。  Subsequently, a tenth embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 12 is a block diagram showing the configuration of the text-to-speech synthesis system according to the tenth embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.
[0116] 図 12を参照すると、本実施形態に係るテキスト音声合成システムは、テキスト候補 生成部 102に代えて、入力されたテキストを形態素解析する形態素解析候補生成部Referring to FIG. 12, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a morpheme analysis candidate generation unit that performs morphological analysis on the input text.
106を備えて構成されている点で上記第 2の実施形態と相違している。 The second embodiment is different from the second embodiment in that it includes 106.
[0117] 一般に、形態素解析は一意に結果が決まるのではなぐ複数の形態素解析候補が あり、テキスト音声合成においてはその中から最適な形態素解析結果を一つ選んで 使うことが多い。 [0117] In general, there are multiple morpheme analysis candidates for which the results of morpheme analysis are not uniquely determined, and in text-to-speech synthesis, one optimal morpheme analysis result is often selected and used.
[0118] そこで、本実施形態に係るテキスト音声合成システムでは、形態素解析候補生成部 106は複数の形態素解析候補を出力し、音質評価スコア計算部 103が、これら形態 素解析候補に対して読みを付与し、音質評価スコアを求める方式を採ってレ、る。  Therefore, in the text-to-speech synthesis system according to the present embodiment, the morpheme analysis candidate generation unit 106 outputs a plurality of morpheme analysis candidates, and the sound quality evaluation score calculation unit 103 reads the morpheme analysis candidates. And give a sound quality evaluation score.
[0119] 以下、入力テキストが「東京都に市場がある」である場合の例を挙げて、第 2の実施 形態との違いを説明する。  [0119] The difference from the second embodiment will be described below with an example in which the input text is "There is a market in Tokyo".
[0120] 形態素解析候補生成部 106は、形態素解析を行い、「東京都(とうきょうと,名詞)/ に(助詞) /市場(しじょう,名詞)/が(助詞)/ある(動詞)」、「東(ひがし,名詞)/ 京都(きょうと,名詞)/に (助詞) /市場(しじょう,名詞)/が(助詞)/ある(動詞)」、 「東京都(とうきょうと,名詞)/に (助詞) /市場 (いちば,名詞)/が(助詞)/ある(動 詞)」、「東(ひがし,名詞)/京都(きょうと,名詞)/に(助詞) /市場 (いちば,名詞) /が(助詞) /ある(動詞)」の 4つの形態素解析候補を出力する。 [0120] The morphological analysis candidate generation unit 106 performs morphological analysis, and “Tokyo (noun) / Ni (participant) / market (shijo, noun) / ga (particle) / ar (verb) "," higashi (noun) / Kyoto (kyoto, noun) / ni (particle) / market (shijo, noun) ) / Ga (particle) / ar (verb) ”,“ Tokyo (Tokyo, noun) / ni (particle) / market (ichiba, noun) / ga (particle) / al (verb) ”,“ east (Higashi, Noun) / Kyoto (Kyoto, Noun) / (Participant) / Market (First, Noun) / Ga (Participant) / Aru (Verb) ”are output.
[0121] ここで、「/」はアクセント句の区切りを、「'」はアクセント位置を表すものとする。  [0121] Here, "/" represents an accent phrase delimiter and "'" represents an accent position.
[0122] この時、これら以外の形態素解析の可能性もあり得るが、文法知識や形態素バイダ ラムによって、可能性の低いものは排除してある。  [0122] At this time, there is a possibility of morphological analysis other than these, but those with low possibility are excluded by grammatical knowledge and morphological bidrum.
[0123] これらの形態素解析結果に対応して、音質評価スコア計算部 103は「と一きよ'一と に/しじょ一があ'る」、「ひがしきよ'一とに/しじょ一があ'る」、「と一きよ'一とに/い 'ちばが/あ'る」、「ひがしきよ'一とに/い'ちばが/あ'る」の各読みを生成し、それ ぞれに対応する音質評価スコアを計算する。  [0123] Corresponding to these morphological analysis results, the sound quality evaluation score calculation unit 103 performs “Toichikiyo” and “Shojiichi” and “Higashikiyo”. Generates the readings of "I have one", "Toichikiyo" Ichini / Ichibiga / Aru "," Higashikiyo Ichini / Ichibiga / Aru ", The sound quality evaluation score corresponding to each is calculated.
[0124] この例は、文脈知識無しではどれも不正解ではないので、どれが選択されて発生さ れてもおかしくは無!/、が、良!/、音質で発声できるとレ、う観点で選択した形態素解析結 果を用いることが可能となる。  [0124] In this example, nothing is incorrect without contextual knowledge, so it doesn't matter which one is selected! It is possible to use the morphological analysis results selected in step 1.
[0125] [第 11の実施形態]  [0125] [Eleventh embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 11の実施形態について図面 を参照して詳細に説明する。図 13は、本発明の第 11の実施形態に係るテキスト音声 合成システムの構成を表したブロック図である。以下、上記第 2の実施形態で説明し た事項は省略して説明する。  Subsequently, an eleventh embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 13 is a block diagram showing the configuration of the text-to-speech synthesis system according to the eleventh embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.
[0126] 図 13を参照すると、本実施形態に係るテキスト音声合成システムは、テキスト候補 生成部 102に代えて、入力されたテキストに対する読み結果を生成する読み生成部 107と、読み結果に対して、ポーズを揷入可能な位置を推定し、ポーズ揷入可否を 展開することで、複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)を生成する ポーズ揷入部 108と、を備えて構成されている点で上記第 2の実施形態と相違してい る。音質評価スコア計算部 103は、上記候補ポーズ付き読み結果 (ポーズ付き称呼 候補)に対する、音質評価スコアを求めるよう動作する。  Referring to FIG. 13, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text, A pose insertion unit 108 that generates a plurality of candidate pose-reading results (name candidates with poses) by estimating the position where the pose can be inserted and expanding the pose insertion availability. This is different from the second embodiment. The sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the reading result with candidate pose (name candidate with pose).
[0127] 以下、入力テキストが「私は本屋へ行く」である場合の例を挙げて、第 2の実施形態 との違いを説明する。 [0127] In the following, the second embodiment will be described with an example in which the input text is "I go to the bookstore" The difference is explained.
[0128] 読み生成部 107は、読み結果「わたしわ/ほ'んやえ/いく」を生成する。ポーズ揷 入部 108は、「/」で示すアクセント句の区切り位置にポーズを揷入可能であると推 定し、「わたしわ/ほ'んやえ/いく」、「わたしわ Pほ'んやえ/いく」、「わたしわ/ほ' んやえ P1、く」、「わたしわ Pほ 'んやえ P1、く」の 4つの候補ポーズ付き読み結果(ポー ズ付き称呼候補)を生成する。  The reading generation unit 107 generates a reading result “I / Honyae / Iku”. The pose insertion unit 108 assumes that a pose can be inserted at the accent phrase delimiter indicated by “/”, and “I / Honyae / Iku” and “I-PHonyae / It generates four candidate pose readings (named candidates with poses), “Iku”, “I / Honyae P1, Ku” and “Iwa P Honyae P1, Ku”.
[0129] ここで、 「/」はアクセント句の区切りを、 「'」はアクセント位置を、 「P」はポーズ位置 を表すものとする。  [0129] Here, "/" represents an accent phrase delimiter, "'" represents an accent position, and "P" represents a pause position.
[0130] これらの候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対応して、音質評価ス コア計算部 103は、それぞれに対応する音質評価スコアを計算する。  [0130] Corresponding to the reading results with candidate poses (name candidates with poses), the sound quality evaluation score calculation unit 103 calculates a sound quality evaluation score corresponding to each.
[0131] 本実施の形態によれば、ポーズ揷入の有無により、アクセント句の区切り前後にお ける音響環境が変わるので、もっとも良い音質で発声できるポーズ揷入の組み合わ せを用いることが可能となる。また、音声品質スコアの計算の過程で、音声合成を構 成する素片を収録した元発話のポーズ揷入確率が反映されるため、元発話の話者 の個人性も表現される。  [0131] According to the present embodiment, the acoustic environment before and after the accent phrase delimitation changes depending on whether or not there is a pause insertion, so that it is possible to use a combination of pause insertions that can be uttered with the best sound quality. Become. In addition, the speech quality score calculation process reflects the pose insertion probability of the original utterance that contains the segments that make up the speech synthesis, so the individuality of the speaker of the original utterance is also expressed.
[0132] [第 12の実施形態]  [0132] [Twelfth embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 12の実施形態について図面 を参照して詳細に説明する。図 14は、本発明の第 12の実施形態に係るテキスト音声 合成システムの構成を表したブロック図である。以下、上記第 2の実施形態で説明し た事項は省略して説明する。  Subsequently, a twelfth embodiment of the present invention, which is a modification of the second embodiment, will be described in detail with reference to the drawings. FIG. 14 is a block diagram showing the configuration of the text-to-speech synthesis system according to the twelfth embodiment of the present invention. In the following description, the items described in the second embodiment are omitted.
[0133] 図 14を参照すると、本実施形態に係るテキスト音声合成システムは、テキスト候補 生成部 102に代えて、入力されたテキストに対する読み結果を生成する読み生成部 107と、予め与えられたルールに従って読み結果の音韻を変化させることで、候補読 み結果を展開する複数の候補読み結果 (称呼候補)を生成する読み変形部 109と、 を備えて構成されている点で上記第 2の実施形態と相違している。音質評価スコア計 算部 103は、上記候補読み結果 (称呼候補)に対する、音質評価スコアを求めるよう 動作する。  Referring to FIG. 14, the text-to-speech synthesis system according to the present embodiment replaces the text candidate generation unit 102 with a reading generation unit 107 that generates a reading result for the input text, and a rule given in advance. The second implementation in that it comprises a reading deformation unit 109 that generates a plurality of candidate reading results (name candidates) that expand the candidate reading results by changing the phoneme of the reading results according to It is different from the form. The sound quality evaluation score calculation unit 103 operates to obtain a sound quality evaluation score for the candidate reading result (name candidate).
[0134] 以下、予め音韻「ひ」を「し」に変形可能であるというルールが与えられ、入力テキス トが「仕事を一人で引継!/、だ」である場合の例を挙げて、第 2の実施形態との違レ、を 説明する。ちなみに、上記の音韻「ひ」を「し」に変形するというルールは、江戸弁の 話者に対する簡易的な変形ルールである。 Hereinafter, a rule that the phoneme “hi” can be transformed into “shi” is given in advance, and the input text The difference between the second embodiment and the second embodiment will be explained by giving an example of the case where the job is “take over work alone! /,”. By the way, the above rule of transforming the phoneme “hi” into “shi” is a simple transformation rule for Edoben speakers.
[0135] 読み生成部 107は、入力テキストに対して「しごとを/ひどりで/ひきつ'いだ」とい う読み結果を与え、出力する。読み変形部 109は、この読み結果の中で音韻「ひ」が 2箇所あるため、これらを「し」に変化することの組み合わせにより、「しごとを/ひどり で/ひきつ'いだ」に加えて、「しごとを/しどりで/ひきつ'いだ」、「しごとを/ひど りで/しきつ'いだ」、「しごとを/しどりで/しきつ'いだ」の 4つの候補読み結果 (候 補称呼)を生成する。 [0135] The reading generation unit 107 gives and outputs a reading result of "doing work / severely / hiding" to the input text. The reading deformation unit 109 has two phoneme “hi” in the reading result, so that it is added to “work / hidden / hikitsu'ida” by changing these to “shi”. The four candidates of “Work / Hide / Hide's Work”, “Work / Hide / Hide '” Generate reading results (probable name).
[0136] ここで、「/」はアクセント句の区切りを、「 '」はアクセント位置を表すものとする。  [0136] Here, "/" represents an accent phrase delimiter and "'" represents an accent position.
[0137] これらの候補読み結果 (候補称呼)に対応して、音質評価スコア計算部 103は、そ れぞれに対応する音質評価スコアを計算する。 [0137] Corresponding to these candidate reading results (candidate names), the sound quality evaluation score calculator 103 calculates a sound quality evaluation score corresponding to each.
[0138] 本実施の形態によれば、特定の音韻の発声が別の発声になってしまう傾向のある 話者の収録音声を用いた音声合成にお!/、て、その特徴を生力もて良!/、音質の発声 を行うこと、元発話の話者の個人性を再現することがが可能となる。 [0138] According to the present embodiment, voice synthesis using a voice recorded by a speaker who tends to produce a different phonation of a specific phoneme! / It is possible to utter good! /, Sound quality, and reproduce the personality of the original speaker.
[0139] [第 13の実施形態] [Thirteenth Embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 13の実施形態について説明 する。本実施形態の構成は、上記第 2の実施形態と同様であるので、その相違点で ある音質評価スコア計算部 103の動作について詳説する。  Subsequently, a thirteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.
[0140] 上記第 2の実施形態においては、音質評価スコアとして、合成音声の音質の近似 値である単位選択スコアを用いた。これに対して、本実施形態では、音質評価スコア として、合成音声の区分区間であるセグメントの間で、端点におけるピッチ周波数の 差分を Hz単位で測定し、その絶対値の総和を用いる。 [0140] In the second embodiment, the unit selection score that is an approximate value of the sound quality of the synthesized speech is used as the sound quality evaluation score. In contrast, in the present embodiment, as the sound quality evaluation score, the pitch frequency difference at the end points is measured in Hz between the segments that are the segmented sections of the synthesized speech, and the sum of the absolute values is used.
[0141] 図 15に、合成音声が 4のセグメントから構成された場合の音声スコアの算出の例を 示す。ここで、セグメント 1からセグメント 4までの各セグメントに対して、それぞれ元発 話の区間を割り当てる。また、それぞれの元発話の区間を元発話 1から元発話 4とす [0141] Fig. 15 shows an example of calculation of a speech score when the synthesized speech is composed of 4 segments. Here, the segment of the original utterance is assigned to each segment from segment 1 to segment 4. Also, let each original utterance segment be the original utterance 1 to the original utterance 4.
[0142] セグメント 1の端点に相当する元発話 1上のピッチ周波数は、始端ピッチ周波数が F lbであり、終端ピッチ周波数が Fieである。同様に、セグメント 2からセグメント 4まで にも同様に端点に相当するピッチ周波数 F2b、 F2e、 F3b、 F3e、 F4b、 F4eが存在 する。 [0142] The pitch frequency on the original utterance 1 corresponding to the end point of segment 1 is F lb and the termination pitch frequency is Fie. Similarly, segment 2 to segment 4 also have pitch frequencies F2b, F2e, F3b, F3e, F4b, and F4e corresponding to the end points.
[0143] ここで、合成音声の始端になる Fibと終端になる F4eを除くそれぞれの端点におい て、互いに接続する終端ピッチ周波数と始端ピッチ周波数の差分の総和、すなわち D= I Fle-F2b | + | F2e— F3b | + | F3e— F4b |  [0143] Here, the sum of the difference between the termination pitch frequency and the start pitch frequency connected to each other at each end point excluding the Fib at the start and the F4e at the end of the synthesized speech, that is, D = I Fle-F2b | | F2e— F3b | + | F3e— F4b |
の値を、該当元発話の組み合わせにおける音質評価スコアとする。  Is the sound quality evaluation score for the combination of the corresponding original utterances.
[0144] なお、上記の例ではピッチ周波数の実数の差分の総和を用いた力 代わりに自乗 和、重み付き和、ピッチ周波数を対数値とした計算を用いても構わない。また、上記 の例では簡単のため、音声合成における波形生成で一般的な波形接続方式を用い て、元発話のピッチ周波数がそのまま合成音声の該当区間のピッチ周波数になる場 合を想定して説明したが、波形編集方式によりピッチ周波数が編集されて使用される 場合にも適用可能である。その場合には、元発話のピッチ周波数を使うほかに、編集 後のピッチ周波数に対して同様の計算を行うことが可能である。  [0144] In the above example, calculation using the sum of squares, the weighted sum, and the pitch frequency as logarithmic values may be used instead of the force using the sum of the real differences of the pitch frequencies. In the above example, for the sake of simplicity, it is assumed that the pitch frequency of the original utterance is directly the pitch frequency of the corresponding section of the synthesized speech by using a general waveform connection method for waveform generation in speech synthesis. However, it is also applicable when the pitch frequency is edited and used by the waveform editing method. In that case, in addition to using the pitch frequency of the original utterance, the same calculation can be performed for the pitch frequency after editing.
[0145] 以上の説明からも明らかなように、本実施形態における音質評価スコアの値は、合 成音声のピッチパタンの滑らかさを表すスコアであり、合成音声の声質を近似してい るといえる。従って、本実施形態の方式によっても、上記音質評価スコアのもっとも高 V、合成音声を生成することができる。  As is clear from the above description, the value of the sound quality evaluation score in the present embodiment is a score representing the smoothness of the pitch pattern of the synthesized speech, and can be said to approximate the voice quality of the synthesized speech. . Therefore, even with the method of the present embodiment, a synthesized speech having the highest sound quality evaluation score V can be generated.
[0146] [第 14の実施形態]  [Fourteenth Embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 14の実施形態について説明 する。本実施形態の構成は、上記第 2の実施形態と同様であるので、その相違点で ある音質評価スコア計算部 103の動作について詳説する。  Subsequently, a fourteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.
[0147] 上記第 2の実施形態においては、音質評価スコアとして、合成音声の音質の近似 値である単位選択スコアを用いた。これに対して、本実施形態では、音質評価スコア として、音声合成の過程でピッチ周波数の軌跡の推定値を計算し、その値と実際の 合成音声のピッチ周波数の値の差分を Hz単位で測定し、その絶対値の音声区間全 体にわたる総和を用いる。  [0147] In the second embodiment, a unit selection score that is an approximate value of the sound quality of synthesized speech is used as the sound quality evaluation score. In contrast, in this embodiment, as the sound quality evaluation score, an estimated value of the pitch frequency trajectory is calculated during the speech synthesis process, and the difference between the value and the pitch frequency value of the actual synthesized speech is measured in Hz. Then, the sum of the absolute values over the entire speech interval is used.
[0148] 図 16に、合成音声が 4のセグメントから構成された場合の音声スコアの算出の例を 示す。ここで、セグメント 1からセグメント 4までの各セグメントに対して、それぞれ元発 話の区間を割り当てる。 [0148] Figure 16 shows an example of speech score calculation when the synthesized speech consists of 4 segments. Show. Here, the segment of the original utterance is assigned to each segment from segment 1 to segment 4.
[0149] ここで、元発話 1から元発話 4からなる元発話のピッチ周波数を、該当する時刻に割 り当てた各元発話のピッチ周波数 fo (t)と、ピッチ周波数の軌跡の推定値 ft (t)から、 音質評価スコアの値を  [0149] Here, the pitch frequency fo (t) of each original utterance assigned to the corresponding time at the pitch frequency of the original utterance consisting of the original utterance 1 to the original utterance 4, and the estimated pitch frequency trajectory ft From (t), set the value of the sound quality evaluation score.
D=∑ (t = tl , t2) I ft (t) -fo (t) I  D = ∑ (t = tl, t2) I ft (t) -fo (t) I
と定義する。  It is defined as
[0150] なお、本実施形態においても、上記した第 13の実施形態と同様に、ピッチ周波数 の実数の差分の総和の代わりに自乗和、重み付き和、ピッチ周波数を対数値とした 計算を用いても構わない。また、上記の例では簡単のため、音声合成における波形 生成で一般的な波形接続方式を用いて、元発話のピッチ周波数がそのまま合成音 声の該当区間のピッチ周波数になる場合を想定して説明したが、波形編集方式によ りピッチ周波数が編集されて使用される場合にも適用可能である。その場合には、元 発話のピッチ周波数を使うほかに、編集後のピッチ周波数に対して同様の計算を行 うことが可能である。  [0150] Note that, in this embodiment as well, as in the thirteenth embodiment described above, calculation using the sum of squares, weighted sum, and pitch frequency as a logarithmic value is used instead of the sum of the difference between the real numbers of the pitch frequency. It doesn't matter. In addition, for simplicity, the example above assumes that the pitch frequency of the original utterance is the same as the pitch frequency of the corresponding section of the synthesized speech, using a common waveform connection method for waveform generation in speech synthesis. However, it can also be applied to the case where the pitch frequency is edited and used by the waveform editing method. In that case, in addition to using the pitch frequency of the original utterance, the same calculation can be performed for the pitch frequency after editing.
[0151] 特に波形接続型音声合成方式の場合、選択された音声素片のピッチ周波数は、 音声合成するために計算したピッチ周波数と異なる場合がある。本実施形態におけ る音質評価スコアの値は、計算したピッチ周波数をどれだけ忠実に再現して!/、るかを 表すスコアであり、合成音声の声質を近似してレ、るとレ、える。  [0151] In particular, in the case of the waveform connection type speech synthesis method, the pitch frequency of the selected speech unit may be different from the pitch frequency calculated for speech synthesis. The value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated pitch frequency is reproduced! /, And approximates the voice quality of the synthesized speech. Yeah.
[0152] 従って、本実施形態の方式によっても、上記音質評価スコアのもっとも高い合成音 声を生成することができる。  Therefore, the synthesized voice having the highest sound quality evaluation score can also be generated by the method of the present embodiment.
[0153] [第 15の実施形態]  [Fifteenth embodiment]
続いて、上記第 2の実施形態を変形した本発明の第 15の実施形態について説明 する。本実施形態の構成は、上記第 2の実施形態と同様であるので、その相違点で ある音質評価スコア計算部 103の動作について詳説する。  Subsequently, a fifteenth embodiment of the present invention, which is a modification of the second embodiment, will be described. Since the configuration of this embodiment is the same as that of the second embodiment, the operation of the sound quality evaluation score calculation unit 103, which is the difference, will be described in detail.
[0154] 上記第 2の実施形態においては、音質評価スコアとして、合成音声の音質の近似 値である単位選択スコアを用いた。これに対して、本実施形態では、音質評価スコア として、音声合成の過程で各音声単位の時間長を計算し、その値と実際の合成音声 の該当単位の時間長との差分をミリ秒単位で測定し、その絶対値の音声区間全体に わたる総和を用いる。 [0154] In the second embodiment, the unit selection score, which is an approximate value of the sound quality of the synthesized speech, is used as the sound quality evaluation score. In contrast, in the present embodiment, as the sound quality evaluation score, the time length of each voice unit is calculated in the process of voice synthesis, and the value and the actual synthesized voice are calculated. Measure the difference from the time length of the corresponding unit in milliseconds and use the sum of the absolute value over the entire speech interval.
[0155] 先に述べたように、特に波形接続型音声合成方式の場合、選択された音声素片の 時間長は、音声合成するために計算した時間長と異なる場合がある。本実施形態に おける音質評価スコアの値は、計算した時間長をどれだけ忠実に再現しているかを 表すスコアであり、この点で合成音声の声質を近似して!/、るとレ、える。  [0155] As described above, particularly in the case of the waveform connection type speech synthesis method, the time length of the selected speech unit may be different from the time length calculated for speech synthesis. The value of the sound quality evaluation score in this embodiment is a score that represents how faithfully the calculated time length is reproduced. In this respect, the voice quality of the synthesized speech is approximated! .
[0156] 従って、本実施形態の方式によっても、上記音質評価スコアのもっとも高い合成音 声を生成することができる。  Therefore, the synthesized voice having the highest sound quality evaluation score can also be generated by the method of the present embodiment.
産業上の利用可能性  Industrial applicability
[0157] 本発明の適用可能分野は、音声合成技術を必要とする分野と略一致するが、例え ば、ロボットの音声対話等の、テキストの文面に対する厳密性が必要無い分野におい てその効果を発揮すると考えられる。 [0157] The field to which the present invention can be applied is substantially the same as the field that requires speech synthesis technology. For example, the effect of the present invention can be achieved in a field that does not require strictness on the text of the text, such as a robot voice dialogue. It is thought that it demonstrates.
[0158] 本発明の全開示(請求の範囲を含む)の枠内において、さらにその基本的技術思 想に基づいて、実施形態ないし実施例の変更'調整が可能である。また、本発明の 請求の範囲の枠内において種々の開示要素の多様な組み合わせないし選択が可 能である。 [0158] Within the scope of the entire disclosure (including claims) of the present invention, further, modifications or adjustments of the embodiments or examples can be made based on the basic technical idea. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention.
[0159] 例えば、上記した実施形態では、 日本語かな混じり文を入力テキストとする例を挙 げて説明したが、上記した各実施形態に示した各辞書に対応する他の言語の辞書 を準備することで、他の言語の文章を入力テキストとすることも可能である。  [0159] For example, in the above-described embodiment, an example in which a Japanese kana mixed sentence is used as an input text has been described, but a dictionary in another language corresponding to each dictionary shown in each of the above-described embodiments is prepared. By doing so, it is also possible to use sentences in other languages as input text.

Claims

請求の範囲 The scope of the claims
[1] 入力されたテキストを解析し、同義の表現よりなる複数の候補テキストを生成するテ キスト候補生成部と、  [1] A text candidate generation unit that analyzes input text and generates a plurality of candidate texts having synonymous expressions;
前記各候補テキストに対する音質評価スコアを計算する音質評価スコア計算部と、 前記複数の候補テキストから、最良の音質評価スコアを持つ候補テキストを選択す る選択部と、  A sound quality evaluation score calculation unit that calculates a sound quality evaluation score for each candidate text; a selection unit that selects a candidate text having the best sound quality evaluation score from the plurality of candidate texts;
前記選択した候補テキストに対応する合成音声を生成する音声合成部と、を備える こと、  A speech synthesizer for generating a synthesized speech corresponding to the selected candidate text,
を特徴とするテキスト音声合成装置。  A text-to-speech synthesizer.
[2] 単語又は句単位で同義語を対応付けて格納した同義語辞書を備え、 [2] Provided with a synonym dictionary that stores synonyms associated with each word or phrase,
前記テキスト候補生成部は、前記入力されたテキストに含まれる単語又は句に対応 する同義語を前記同義語辞書から検索し、前記単語又は句単位の置換/非置換を 組み合わせ展開して複数の候補テキストを生成すること、  The text candidate generation unit searches the synonym dictionary for a synonym corresponding to a word or phrase included in the input text, and expands a combination of substitution / non-substitution for the word or phrase unit to generate a plurality of candidates. Generating text,
を特徴とする請求項 1に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to claim 1.
[3] ある句表現に対して同じ意味を持つ句表現である同義表現を対応付けて格納した 同義表現辞書を備え、 [3] A synonym expression dictionary that stores a synonym expression that is a phrase expression having the same meaning with respect to a certain phrase expression,
前記テキスト候補生成部は、前記入力されたテキストに含まれる句表現に対応する 同義表現を前記同義表現辞書から検索し、前記句表現単位レベルの置換/非置換 を組み合わせ展開して複数の候補テキストを生成すること、  The text candidate generation unit retrieves a synonym expression corresponding to a phrase expression included in the input text from the synonym expression dictionary, and expands / combines substitution / non-replacement at the phrase expression unit level to generate a plurality of candidate texts. Generating,
を特徴とする請求項 1又は 2に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to claim 1 or 2.
[4] 前記テキスト候補生成部は、前記入力されたテキストの意味を変えな!/、と!/、う条件 下で、前記入力されたテキストの文構造を変形することにより、前記複数の候補テキ ストを生成すること、 [4] The text candidate generation unit changes the sentence structure of the input text under a condition that does not change the meaning of the input text! /,! Generating text,
を特徴とする請求項 1乃至 3いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 3.
[5] 前記テキスト候補生成部は、前記入力されたテキストに含まれる単語又は句の丁寧 表現の変更/非変更を組み合わせ展開して複数の候補テキストを生成すること、 を特徴とする請求項 1乃至 4いずれか一に記載のテキスト音声合成装置。 [5] The text candidate generation unit generates a plurality of candidate texts by combining and expanding the change / non-change of the polite expression of words or phrases included in the input text. To 4. The text-to-speech synthesizer according to any one of 4 to 4.
[6] 個人レベルの表現差異を対応付けて格納した個人性特徴辞書を備え、 前記テキスト候補生成部は、前記入力されたテキストに含まれる表現に対応する他 の同義表現を前記個人性特徴辞書から検索し、前記個人レベルの表現差異による 置換/非置換を組み合わせ展開して複数の候補テキストを生成すること、 [6] Provided with a personality feature dictionary that stores individual level expression differences The text candidate generation unit retrieves other synonymous expressions corresponding to expressions included in the input text from the personality feature dictionary, and combines and expands a combination of substitution / non-substitution due to the expression difference at the individual level. Generating candidate text for
を特徴とする請求項 1乃至 5いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 5.
[7] フィラー(Filler)及び間投詞を記述したフィラー辞書を備え、 [7] Equipped with a filler dictionary describing fillers and interjections,
前記テキスト候補生成部は、前記フィラー辞書を参照して、前記入力されたテキスト の語間に、前記フィラー又は間投詞の揷入/非揷入を組み合わせ展開して複数の 候補テキストを生成すること、  The text candidate generation unit refers to the filler dictionary and generates a plurality of candidate texts by combining and expanding the filler or interjection insertion / non-insertion between words of the input text;
を特徴とする請求項 1乃至 6いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 6.
[8] フィラー(Filler)及び間投詞を記述したフィラー辞書を備え、 [8] A filler dictionary describing fillers and interjections is provided.
前記テキスト候補生成部は、前記フィラー辞書を参照して、前記入力されたテキスト 力、らフイラ一又は間投詞の削除/非削除を組み合わせ展開して複数の候補テキスト を生成すること、  The text candidate generation unit refers to the filler dictionary, and generates a plurality of candidate texts by combining and expanding the input text power, the filer or interjection deletion / non-deletion;
を特徴とする請求項 1乃至 7いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 7.
[9] 前記テキスト候補生成部は、前記入力されたテキストに含まれる単語又は句の一部 又は句の全部を、繰り返し表現に改めることにより、前記複数の候補テキストを生成 すること、 [9] The text candidate generation unit generates the plurality of candidate texts by changing a part of the word or phrase included in the input text or the entire phrase into a repeated expression.
を特徴とする請求項 1乃至 8いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 8.
[10] 前記テキスト候補生成部は、展開した結果のうち、韻を踏む表現だけを候補テキス 卜とすること、 [10] The text candidate generation unit sets only the rhyming expression among the expanded results as candidate text 結果,
を特徴とする請求項 1乃至 9いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 9.
[11] 前記テキスト候補生成部は、展開した結果のうち、各アクセント句のモーラ数が 5あ るいは 7だけで構成される表現だけを候補テキストとすること、 [11] In the expanded result, the text candidate generation unit determines only an expression composed of only 5 or 7 mora for each accent phrase as a candidate text.
を特徴とする請求項 1乃至 101、ずれか一に記載のテキスト音声合成装置。  102. The text-to-speech synthesizer according to claim 1, wherein the text-to-speech synthesizer is any one of shifts.
[12] 入力されたテキストから複数の候補形態素解析結果を計算する形態素解析候補生 成部と、 [12] A morpheme analysis candidate generation unit that calculates a plurality of candidate morpheme analysis results from the input text;
前記候補形態素解析結果に対する音質評価スコアを計算する音質評価スコア計 算部と、 前記複数の候補形態素解析結果から、最良の音質評価スコアを持つ候補形態素 解析結果を選択する選択部と、 A sound quality evaluation score calculation unit for calculating a sound quality evaluation score for the candidate morphological analysis result; A selection unit that selects a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results;
前記選択した候補形態素解析結果に対応する合成音声を生成する音声合成部と 、を備えること、  A speech synthesizer that generates a synthesized speech corresponding to the selected candidate morpheme analysis result,
を特徴とするテキスト音声合成装置。  A text-to-speech synthesizer.
[13] 入力されたテキストから読み結果 (称呼)を生成する読み生成部と、 [13] A reading generation unit that generates a reading result (name) from the input text;
該読み結果へのポーズ (句切り)の揷入/非揷入の組み合わせにより複数の候補 ポーズ付き読み結果 (ポーズ付き称呼候補)を生成するポーズ揷入部と、  A pose insertion section for generating a plurality of candidate pose-reading results (name candidates with poses) by a combination of insertion / non-insertion of poses (punctuation) into the reading results;
前記候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対する音質評価スコアを 計算する音質評価スコア計算部と、  A sound quality evaluation score calculation unit for calculating a sound quality evaluation score for the reading result with candidate pose (name candidate with pose);
前記複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)から、最良の音質評 価スコアを持つ候補ポーズ付き読み結果 (ポーズ付き称呼候補)を選択する選択部と 前記選択した候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対応する合成音 声を生成する音声合成部と、を備えること、  A selection unit for selecting a candidate pose-reading result (pause-named candidate) having the best sound quality evaluation score from the plurality of candidate-pause reading results (pause-named candidates) and the selected candidate-pose-readed results ( A speech synthesizer that generates a synthesized speech corresponding to a nickname with a pause),
を特徴とするテキスト音声合成装置。  A text-to-speech synthesizer.
[14] 入力されたテキストから読み結果 (称呼)を生成する読み生成部と、 [14] A reading generation unit that generates a reading result (name) from the input text;
前記読み結果 (称呼)に含まれる特定の音についての別の音への置換/非置換を 組み合わせ展開して複数の候補読み結果 (称呼候補)を生成する読み変形部と、 前記候補読み結果 (称呼候補)に対する音質評価スコアを計算する音質評価スコ ァ計算部と、  A reading deformation unit that generates a plurality of candidate reading results (name candidates) by combining and replacing substitution / non-substitution of a specific sound included in the reading result (name) with another sound, and the candidate reading result ( A sound quality evaluation score calculator for calculating a sound quality evaluation score for
前記複数の候補読み結果 (称呼候補)から、最良の音質評価スコアを持つ候補読 み結果 (称呼候補)を選択する選択部と、  A selection unit for selecting a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates);
前記選択した候補読み結果 (称呼候補)に対応する合成音声を生成する音声合成 部と、を備えること、  A speech synthesizer for generating a synthesized speech corresponding to the selected candidate reading result (name candidate),
を特徴とするテキスト音声合成装置。  A text-to-speech synthesizer.
[15] 前記音質評価スコアは、合成音声のピッチパタンの滑らかさを表すスコアであること を特徴とする請求項 1乃至 14いずれか一に記載のテキスト音声合成装置。 [15] The sound quality evaluation score is a score representing the smoothness of the pitch pattern of the synthesized speech. 15. The text-to-speech synthesizer according to any one of claims 1 to 14.
[16] 前記音質評価スコアは、推定したピッチパタンと合成音声のピッチパタンの差分を 表すスコアであること、  [16] The sound quality evaluation score is a score representing a difference between the estimated pitch pattern and the pitch pattern of the synthesized speech.
を特徴とする請求項 1乃至 15いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 15.
[17] 前記音質評価スコアは、推定したリズムと合成音声のリズムの差分を表すスコアで あること、 [17] The sound quality evaluation score is a score representing a difference between the estimated rhythm and the rhythm of the synthesized speech,
を特徴とする請求項 1乃至 16いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 16.
[18] 前記音質評価スコアは、合成音声を作成する際のセグメント間のスペクトルの滑ら 力、さを表すスコアであること、 [18] The sound quality evaluation score is a score representing the smoothness or the smoothness of the spectrum between segments when creating synthesized speech.
を特徴とする請求項 1乃至 17いずれか一に記載のテキスト音声合成装置。  The text-to-speech synthesizer according to any one of claims 1 to 17.
[19] テキスト音声合成装置を構成するコンピュータに実行させるプログラムであって、 入力されたテキストを解析し、同義の表現よりなる複数の候補テキストを生成する手 段と、 [19] A program to be executed by a computer constituting a text-to-speech synthesizer, which analyzes input text and generates a plurality of candidate texts having synonymous expressions;
前記各候補テキストに対する音質評価スコアを計算する手段と、  Means for calculating a sound quality evaluation score for each candidate text;
前記複数の候補テキストから、最良の音質評価スコアを持つ候補テキストを選択す る手段と、  Means for selecting a candidate text having the best sound quality evaluation score from the plurality of candidate texts;
前記選択した候補テキストに対応する合成音声を生成する手段と、の前記各手段と して、前記コンピュータを機能させるプログラム。  A program that causes the computer to function as each of the means for generating synthesized speech corresponding to the selected candidate text.
[20] テキスト音声合成装置を構成するコンピュータに実行させるプログラムであって、 入力されたテキストから複数の候補形態素解析結果を計算する手段と、 前記候補形態素解析結果に対する音質評価スコアを計算する手段と、 前記複数の候補形態素解析結果から、最良の音質評価スコアを持つ候補形態素 解析結果を選択する手段と、 [20] A program to be executed by a computer constituting the text-to-speech synthesizer, means for calculating a plurality of candidate morpheme analysis results from the input text, and means for calculating a sound quality evaluation score for the candidate morpheme analysis results; Selecting a candidate morpheme analysis result having the best sound quality evaluation score from the plurality of candidate morpheme analysis results;
前記選択した候補形態素解析結果に対応する合成音声を生成する手段と、の前 記各手段として、前記コンピュータを機能させるプログラム。  A program that causes the computer to function as each means described above, and means for generating a synthesized speech corresponding to the selected candidate morpheme analysis result.
[21] テキスト音声合成装置を構成するコンピュータに実行させるプログラムであって、 入力されたテキストから読み結果 (称呼)を生成する手段と、 [21] A program to be executed by a computer constituting the text-to-speech synthesizer, and means for generating a reading result (name) from input text;
前記読み結果へのポーズ (句切り)の揷入/非揷入の組み合わせにより複数の候 補ポーズ付き読み結果 (ポーズ付き称呼候補)を生成する手段と、 Multiple combinations of insertion / non-insertion of poses for the reading results Means for generating reading results with supplemental poses (name candidates with poses);
前記候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対する音質評価スコアを 計算する手段と、  Means for calculating a sound quality evaluation score for the reading result with candidate pose (name candidate with pose);
前記複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)から、最良の音質評 価スコアを持つ候補ポーズ付き読み結果 (ポーズ付き称呼候補)を選択する手段と、 前記選択した候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対応する合成音 声を生成する手段と、の前記各手段として、前記コンピュータを機能させるプログラム  Means for selecting a candidate pose-reading result (pause-named candidate) having the best sound quality evaluation score from the plurality of candidate-pause reading results (pause-named candidates); A program that causes the computer to function as each of the means for generating a synthesized voice corresponding to a name candidate with a pause)
[22] テキスト音声合成装置を構成するコンピュータに実行させるプログラムであって、 入力されたテキストから読み結果 (称呼)を生成する手段と、 [22] A program to be executed by a computer constituting the text-to-speech synthesizer, and means for generating a reading result (name) from input text;
前記読み結果 (称呼)に含まれる特定の音についての別の音への置換/非置換を 組み合わせ展開して複数の候補読み結果 (称呼候補)を生成する手段と、  Means for generating a plurality of candidate reading results (name candidates) by combining and expanding substitution / non-replacement of a specific sound included in the reading result (name) with another sound;
前記候補読み結果 (称呼候補)に対する音質評価スコアを計算する手段と、 前記複数の候補読み結果 (称呼候補)から、最良の音質評価スコアを持つ候補読 み結果 (称呼候補)を選択する手段と、  Means for calculating a sound quality evaluation score for the candidate reading result (name candidate); means for selecting a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates); ,
前記選択した候補読み結果 (称呼候補)に対応する合成音声を生成する手段と、 の前記各手段として、前記コンピュータを機能させるプログラム。  Means for generating synthesized speech corresponding to the selected candidate reading result (name candidate), and a program for causing the computer to function as each of the means.
[23] コンピュータを用いたテキスト音声合成方法であって、 [23] A text-to-speech synthesis method using a computer,
前記コンピュータが、入力されたテキストを解析し、同義の表現よりなる複数の候補 テキストを生成するステップと、  The computer analyzing the input text and generating a plurality of candidate texts having synonymous expressions;
前記コンピュータが、前記各候補テキストに対する音質評価スコアを計算するステツ プと、  A step in which the computer calculates a sound quality evaluation score for each candidate text;
前記コンピュータが、前記複数の候補テキストから、最良の音質評価スコアを持つ 候補テキストを選択するステップと、  The computer selecting a candidate text having a best sound quality evaluation score from the plurality of candidate texts;
前記コンピュータが、前記選択した候補テキストに対応する合成音声を生成するス テツプと、を含むこと、  A step of generating a synthesized speech corresponding to the selected candidate text;
を特徴とするテキスト音声合成方法。  A text-to-speech synthesis method characterized by
[24] コンピュータを用いたテキスト音声合成方法であって、 前記コンピュータが、入力されたテキストから複数の候補形態素解析結果を計算す 前記コンピュータが、前記候補形態素解析結果に対する音質評価スコアを計算す 前記コンピュータが、前記複数の候補形態素解析結果から、最良の音質評価スコ ァを持つ候補形態素解析結果を選択するステップと、 [24] A text-to-speech synthesis method using a computer, The computer calculates a plurality of candidate morpheme analysis results from the input text. The computer calculates a sound quality evaluation score for the candidate morpheme analysis results. The computer calculates the best sound quality from the plurality of candidate morpheme analysis results. Selecting a candidate morphological analysis result having an evaluation score;
前記コンピュータが、前記選択した候補形態素解析結果に対応する合成音声を生
Figure imgf000032_0001
The computer generates synthesized speech corresponding to the selected candidate morpheme analysis result.
Figure imgf000032_0001
を特徴とするテキスト音声合成方法。  A text-to-speech synthesis method characterized by
[25] コンピュータを用いたテキスト音声合成方法であって、 [25] A text-to-speech synthesis method using a computer,
前記コンピュータ力 S、入力されたテキストから読み結果 (称呼)を生成するステップと 前記コンピュータが、前記読み結果へのポーズ (句切り)の揷入/非揷入の組み合 わせにより複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補)を生成するステツ プと、  The computer power S, a step of generating a reading result (name) from the input text, and the computer may combine a plurality of candidate poses by a combination of insertion / non-insertion of a pose (phrase) to the reading result. A step of generating a reading result (name candidate with pause),
前記コンピュータが、前記候補ポーズ付き読み結果 (ポーズ付き称呼候補)に対す る音質評価スコアを計算するステップと、  The computer calculating a sound quality evaluation score for the candidate-pause-reading result (name candidate with pause);
前記コンピュータが、前記複数の候補ポーズ付き読み結果 (ポーズ付き称呼候補) から、最良の音質評価スコアを持つ候補ポーズ付き読み結果 (ポーズ付き称呼候補) を選択するステップと、  The computer selecting a candidate pose reading result (pause nick call candidate) having the best sound quality evaluation score from the plurality of candidate pose reading results (pause call candidate);
前記コンピュータが、前記選択した候補ポーズ付き読み結果 (ポーズ付き称呼候補 )に対応する合成音声を生成するステップと、を含むこと、  Generating the synthesized speech corresponding to the selected candidate-pause reading result (pause-named candidate).
を特徴とするテキスト音声合成方法。  A text-to-speech synthesis method characterized by
[26] コンピュータを用いたテキスト音声合成方法であって、 [26] A text-to-speech synthesis method using a computer,
前記コンピュータ力 S、入力されたテキストから読み結果 (称呼)を生成するステップと 前記コンピュータが、前記読み結果 (称呼)に含まれる特定の音についての別の音 への置換/非置換を組み合わせ展開して複数の候補読み結果 (称呼候補)を生成 前記コンピュータが、前記候補読み結果 (称呼候補)に対する音質評価スコアを計 前記コンピュータが、前記複数の候補読み結果 (称呼候補)から、最良の音質評価 スコアを持つ候補読み結果 (称呼候補)を選択するステップと、 The computer power S, a step of generating a reading result (designation) from the input text, and the computer combined development of substitution / non-substitution with another sound for a specific sound included in the reading result (designation) Multiple candidate reading results (name candidates) The computer calculates a sound quality evaluation score for the candidate reading result (name candidate) The computer selects a candidate reading result (name candidate) having the best sound quality evaluation score from the plurality of candidate reading results (name candidates) And steps to
前記コンピュータが、前記選択した候補読み結果 (称呼候補)に対応する合成音声 を生成するステップと、を含むこと、  Generating a synthesized speech corresponding to the selected candidate reading result (name candidate),
を特徴とするテキスト音声合成方法。  A text-to-speech synthesis method characterized by
PCT/JP2007/071285 2006-11-08 2007-11-01 Text-to-speech synthesis device, program and text-to-speech synthesis method WO2008056590A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008543045A JPWO2008056590A1 (en) 2006-11-08 2007-11-01 Text-to-speech synthesizer, program thereof, and text-to-speech synthesis method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2006302954 2006-11-08
JP2006-302954 2006-11-08

Publications (1)

Publication Number Publication Date
WO2008056590A1 true WO2008056590A1 (en) 2008-05-15

Family

ID=39364408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2007/071285 WO2008056590A1 (en) 2006-11-08 2007-11-01 Text-to-speech synthesis device, program and text-to-speech synthesis method

Country Status (2)

Country Link
JP (1) JPWO2008056590A1 (en)
WO (1) WO2008056590A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010092710A1 (en) * 2009-02-16 2010-08-19 株式会社東芝 Speech processing device, speech processing method, and speech processing program
JP2014044289A (en) * 2012-08-27 2014-03-13 Animo:Kk Program, method and device for formatting text
JP2014048443A (en) * 2012-08-31 2014-03-17 Nippon Telegr & Teleph Corp <Ntt> Voice synthesis system, voice synthesis method, and voice synthesis program
JP2016004267A (en) * 2014-06-19 2016-01-12 バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド Method and apparatus for speech synthesis based on large corpus
KR20190057792A (en) * 2017-11-20 2019-05-29 주식회사 마인즈랩 System for generating a sentence for machine learning and method for generating a similar sentence using thereof
JP2020060642A (en) * 2018-10-09 2020-04-16 株式会社日立ソリューションズ・テクノロジー Speech synthesis system and speech synthesizer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003302993A (en) * 2002-04-10 2003-10-24 Canon Inc Voice synthesizer, voice synthesis method, program and storage device
JP2004118004A (en) * 2002-09-27 2004-04-15 Asahi Kasei Corp Voice synthesizer
JP2004126205A (en) * 2002-10-02 2004-04-22 Nippon Telegr & Teleph Corp <Ntt> Method, device, and program for voice synthesis
JP2006227425A (en) * 2005-02-18 2006-08-31 National Institute Of Information & Communication Technology Speech reproducing device and utterance support device
JP2006243104A (en) * 2005-03-01 2006-09-14 Canon Inc Speech synthesizing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003302993A (en) * 2002-04-10 2003-10-24 Canon Inc Voice synthesizer, voice synthesis method, program and storage device
JP2004118004A (en) * 2002-09-27 2004-04-15 Asahi Kasei Corp Voice synthesizer
JP2004126205A (en) * 2002-10-02 2004-04-22 Nippon Telegr & Teleph Corp <Ntt> Method, device, and program for voice synthesis
JP2006227425A (en) * 2005-02-18 2006-08-31 National Institute Of Information & Communication Technology Speech reproducing device and utterance support device
JP2006243104A (en) * 2005-03-01 2006-09-14 Canon Inc Speech synthesizing method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010092710A1 (en) * 2009-02-16 2010-08-19 株式会社東芝 Speech processing device, speech processing method, and speech processing program
US8650034B2 (en) 2009-02-16 2014-02-11 Kabushiki Kaisha Toshiba Speech processing device, speech processing method, and computer program product for speech processing
JP2014044289A (en) * 2012-08-27 2014-03-13 Animo:Kk Program, method and device for formatting text
JP2014048443A (en) * 2012-08-31 2014-03-17 Nippon Telegr & Teleph Corp <Ntt> Voice synthesis system, voice synthesis method, and voice synthesis program
JP2016004267A (en) * 2014-06-19 2016-01-12 バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド Method and apparatus for speech synthesis based on large corpus
KR20190057792A (en) * 2017-11-20 2019-05-29 주식회사 마인즈랩 System for generating a sentence for machine learning and method for generating a similar sentence using thereof
KR102102388B1 (en) * 2017-11-20 2020-04-21 주식회사 마인즈랩 System for generating a sentence for machine learning and method for generating a similar sentence using thereof
JP2020060642A (en) * 2018-10-09 2020-04-16 株式会社日立ソリューションズ・テクノロジー Speech synthesis system and speech synthesizer
JP7110055B2 (en) 2018-10-09 2022-08-01 株式会社日立ソリューションズ・テクノロジー Speech synthesis system and speech synthesizer

Also Published As

Publication number Publication date
JPWO2008056590A1 (en) 2010-02-25

Similar Documents

Publication Publication Date Title
KR102199067B1 (en) Method of multilingual text-to-speech synthesis
JP4302788B2 (en) Prosodic database containing fundamental frequency templates for speech synthesis
US20100057435A1 (en) System and method for speech-to-speech translation
JP2008545995A (en) Hybrid speech synthesizer, method and application
Kayte et al. Hidden Markov model based speech synthesis: A review
JP5198046B2 (en) Voice processing apparatus and program thereof
JPH11249677A (en) Rhythm control method for voice synthesizer
JP4704254B2 (en) Reading correction device
WO2008056590A1 (en) Text-to-speech synthesis device, program and text-to-speech synthesis method
JP2006227589A (en) Device and method for speech synthesis
Badino et al. Language independent phoneme mapping for foreign TTS
HIROSE et al. A system for the synthesis of high-quality speech from texts on general weather conditions
KR0146549B1 (en) Korean language text acoustic translation method
Black Speech synthesis for educational technology
JP7406418B2 (en) Voice quality conversion system and voice quality conversion method
Bonafonte et al. The UPC TTS system description for the 2008 blizzard challenge
Janyoi et al. An Isarn dialect HMM-based text-to-speech system
Anberbir et al. Grapheme-to-phoneme conversion for Amharic text-to-speech system
Iyanda et al. Development of a Yorúbà Textto-Speech System Using Festival
Williams Text-to-speech synthesis for Welsh and Welsh English.
Mustafa et al. EM-HTS: real-time HMM-based Malay emotional speech synthesis.
Cabral et al. The ADAPT entry to the Blizzard Challenge 2016
Delić et al. A Review of AlfaNum Speech Technologies for Serbian, Croatian and Macedonian
Khalil et al. Optimization of Arabic database and an implementation for Arabic speech synthesis system using HMM: HTS_ARAB_TALK
Sawada et al. Constructing text-to-speech systems for languages with unknown pronunciations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07831019

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2008543045

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07831019

Country of ref document: EP

Kind code of ref document: A1