JP2003108180A

JP2003108180A - Method and device for voice synthesis

Info

Publication number: JP2003108180A
Application number: JP2001294722A
Authority: JP
Inventors: Yoshiteru Uchiyama; 喜照内山
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-09-26
Filing date: 2001-09-26
Publication date: 2003-04-11

Abstract

PROBLEM TO BE SOLVED: To obtain a voice synthesis result of high quality by performing voice synthesis processing after optimizing parameters used for the voice synthesis so as to obtain the voice synthesis result of high quality. SOLUTION: A voice synthesizing device 1 inputs a text, synthesizes a voice of the text, and outputs a voice signal, a voice recognizing device 2 recognizes the voice signal outputted from the voice synthesizing device 1 and outputs its voice recognition result as a text, and a text comparison part 3 compares the text as the voice recognition result with the text inputted to the voice synthesizing device 1. The text comparison part 3 supplies the comparison result to the voice synthesizing device 1 and the voice synthesizing device 1 obtains optimum parameters by changing the parameters for the voice synthesis until a signal indicating matching is received and outputs the voice signal obtained by performing voice synthesis according to the optimum parameters as the voice synthesis result.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明はテキストから音声信
号を生成する音声合成方法および音声合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and a voice synthesizing apparatus for generating a voice signal from text.

【０００２】[0002]

【従来の技術】与えられたテキストから音声信号を生成
する音声の規則合成は、音声を取り扱う情報処理技術で
従来から広く利用されている。ここでは、この音声規則
合成のことを単に音声合成と呼ぶことにする。2. Description of the Related Art Speech rule synthesis for generating a speech signal from a given text has been widely used in the past as an information processing technology for handling speech. Here, this speech rule synthesis will be simply referred to as speech synthesis.

【０００３】最近では、音声合成手法の進歩によって、
より自然な合成音声の生成が可能となりつつあるが、テ
キストの内容やその場の状況を適切に表現し、かつ、人
間の肉声に近い合成音声を生成するには、なお改善の余
地がある。Recently, due to advances in speech synthesis techniques,
It is becoming possible to generate more natural synthetic speech, but there is still room for improvement in appropriately expressing the content of the text and the situation on the spot and generating synthetic speech close to the human voice.

【０００４】[0004]

【発明が解決しようとする課題】この音声合成は簡単に
言えば、入力されたテキストを言語辞書と言語パラメー
タを用いて言語処理して読みの情報（発音表記情報）に
変換し、さらに、その読みの情報を音韻辞書と韻律パラ
メータを用いて音韻処理して音声信号を生成するもので
あるが、生成された音声信号がテキストの内容を適正に
反映しない場合も多い。Briefly speaking, this speech synthesis is performed by subjecting an input text to language processing using a language dictionary and language parameters, converting it into reading information (pronunciation information), and further The phonetic information is processed by using the phoneme dictionary and the prosodic parameters to generate a voice signal, but the generated voice signal often does not properly reflect the content of the text.

【０００５】たとえば、入力されたテキストのなかに
「８００Ｆ」などという文字列（この場合、読みとして
は、「はっぴゃくえふ」であるとする）が存在した場
合、そのテキストを上述した手順で音声合成する際、
「８００Ｆ」の「Ｆ」を「階（かい）」と解釈して、
「はっぴゃくかい」といういような音声信号が出力され
る場合もある。また、アクセントを適正に判断できず
に、たとえば、「橋」を「箸」のアクセントの音声信号
として出力してしまうといった問題もある。For example, when a character string such as "800F" (in this case, the reading is "Happyakuefu") is present in the input text, the text is subjected to the above-mentioned procedure. When synthesizing voice,
Interpret "F" of "800F" as "floor",
In some cases, an audio signal such as "Happyakai" is output. Further, there is also a problem that the accent cannot be properly judged and, for example, "hashi" is output as a voice signal of the accent of "chopsticks".

【０００６】このような問題に対処するには、音声合成
結果を自動的に評価し、その評価結果に基づいて適正な
音声合成結果が得られるような修正を加えるなどといっ
た方法も考えられているが、特に日本語の場合、その言
語処理が複雑であるので、音声合成結果を自動的にかつ
客観的に評価し、適正な音声合成結果を得るような修正
を施すことは難しい。In order to deal with such a problem, a method of automatically evaluating a voice synthesis result and making a correction to obtain a proper voice synthesis result based on the evaluation result has been considered. However, especially in the case of Japanese, since the language processing is complicated, it is difficult to automatically and objectively evaluate the speech synthesis result and make a correction to obtain an appropriate speech synthesis result.

【０００７】そこで本発明は、テキストを音声合成処理
して音声信号を生成する際、最適な音声合成結果が得ら
れるような学習を行って音声合成処理を可能とした音声
合成方法およびその装置を提供することを目的としてい
る。Therefore, the present invention provides a voice synthesizing method and apparatus capable of performing voice synthesizing processing by performing learning so as to obtain an optimum voice synthesizing result when voice synthesizing text to generate a voice signal. It is intended to be provided.

【０００８】[0008]

【課題を解決するための手段】上述の目的を達成するた
めに本発明の音声合成方法は、テキストを音声合成手段
に入力してそのテキストを音声合成に必要なパラメータ
を用いて音声合成処理して音声信号として出力し、その
音声信号を音声認識し、その音声認識結果としてのテキ
ストと前記音声合成手段に入力されたテキストを比較
し、その比較結果に基づいて前記パラメータをある値に
設定してそれを学習パラメータとし、その学習パラメー
タにより音声合成されて得られた音声信号を音声合成結
果として出力するようにしている。In order to achieve the above object, a speech synthesis method of the present invention is such that a text is input to a speech synthesis means and the text is subjected to speech synthesis processing using parameters necessary for speech synthesis. Output as a voice signal, voice-recognize the voice signal, compare the text as the voice recognition result with the text input to the voice synthesizing means, and set the parameter to a certain value based on the comparison result. Then, it is used as a learning parameter, and a voice signal obtained by performing voice synthesis with the learning parameter is output as a voice synthesis result.

【０００９】この音声合成方法において、前記比較結果
に基づいて前記パラメータをある値に設定してそれを学
習パラメータとする処理は、前記音声認識結果としての
テキストが前記音声合成手段に入力されたテキストに一
致するまで当該パラメータを変化させ、両者が一致した
ときのパラメータを学習パラメータとする処理である。
そして、前記学習パラメータの保存を可能としてい
る。In this voice synthesizing method, the process of setting the parameter to a certain value based on the comparison result and using it as a learning parameter is performed by the text as the voice recognition result being input to the voice synthesizing means. This is a process in which the parameter is changed until it matches with, and the parameter when both match is the learning parameter.
Then, the learning parameters can be saved.

【００１０】また、本発明の音声合成方法は、テキスト
を音声合成手段に入力してそのテキストを音声合成に必
要なパラメータを用いて音声合成処理する際、そのパラ
メータとして複数の候補が存在する場合、その複数の候
補のうちのあるパラメータを選択し、その選択されたパ
ラメータを用いて音声合成を行い、その音声合成結果を
音声認識し、その音声認識結果と前記音声合成手段に入
力されたテキストとの類似度を判定し、その類似度判定
結果に基づいて、前記複数の候補のいずれかのパラメー
タを選択して、その選択されたパラメータを学習パラメ
ータとし、その学習パラメータにより音声合成されて得
られた音声信号を音声合成結果として出力するようにし
ている。Further, according to the voice synthesis method of the present invention, when a text is input to the voice synthesis means and the text is subjected to the voice synthesis processing using the parameters necessary for the voice synthesis, when there are a plurality of candidates as the parameters. , Selecting a parameter from the plurality of candidates, performing speech synthesis using the selected parameter, recognizing the speech synthesis result, and recognizing the speech recognition result and the text input to the speech synthesis means. Is determined, and one of the parameters of the plurality of candidates is selected based on the result of the similarity determination, the selected parameter is set as a learning parameter, and the speech is synthesized by the learning parameter. The obtained voice signal is output as a voice synthesis result.

【００１１】この音声合成方法において、前記類似度判
定結果に基づいて前記複数の候補のいずれかのパラメー
タを選択してその選択されたパラメータを学習パラメー
タとする処理は、前記類似度が予め定めた値以上となる
まで、前記候補となったパラメータを順次選択して行
き、類似度が予め定めた値以上となったパラメータを学
習パラメータとする処理である。そして、前記学習パラ
メータの保存を可能としている。In this speech synthesizing method, in the process of selecting one of the parameters of the plurality of candidates based on the similarity determination result and using the selected parameter as a learning parameter, the similarity is predetermined. This is a process in which the candidate parameters are sequentially selected until the value becomes equal to or more than a value, and the parameter whose similarity becomes equal to or more than a predetermined value is used as a learning parameter. Then, the learning parameters can be saved.

【００１２】また、本発明の音声合成装置は、テキスト
を入力してそのテキストを音声合成を行うに必要なパラ
メータを用いて音声合成処理して音声信号として出力す
る音声合成手段と、この音声合成手段から出力される音
声信号を音声認識してその音声認識結果をテキストとし
て出力する音声認識手段と、この音声認識手段による認
識結果としてのテキストと前記音声合成手段に入力され
たテキストを比較するテキスト比較手段とを有し、前記
音声合成手段は前記テキスト比較手段から出力される前
記認識結果としてのテキストと前記音声合成手段に入力
されたテキストとの比較結果に基づいて前記パラメータ
をある値に設定してそれを学習パラメータとし、その学
習パラメータにより音声合成されて得られた音声信号を
音声合成結果として出力するようにしている。The speech synthesizer of the present invention further comprises a speech synthesizing means for inputting text, subjecting the text to speech synthesis processing using parameters necessary for performing speech synthesis, and outputting the speech signal as a speech signal. A voice recognition means for voice-recognizing a voice signal output from the means and outputting the voice recognition result as text, and a text for comparing the text as the recognition result by the voice recognition means with the text input to the voice synthesis means. Comparing means, wherein the speech synthesizing means sets the parameter to a certain value based on a comparison result between the text as the recognition result output from the text comparing means and the text input to the speech synthesizing means. Then, it is used as a learning parameter, and the speech signal obtained by speech synthesis with the learning parameter is used as the speech synthesis result. It is to be output.

【００１３】この音声合成装置において、前記比較結果
に基づいて前記パラメータをある値に設定してそれを学
習パラメータとする処理は、前記テキスト比較手段から
出力される前記認識結果としてのテキストが前記音声合
成手段に入力されたテキストに一致するまで前記パラメ
ータを変化させ、両者が一致したときのパラメータを学
習パラメータとする処理である。In this speech synthesizer, the process of setting the parameter to a certain value on the basis of the comparison result and using it as a learning parameter is such that the text as the recognition result output from the text comparing means is the voice. This is a process in which the parameters are changed until they match the text input to the synthesizing means, and the parameters when both match are used as learning parameters.

【００１４】そして、前記音声合成手段にパラメータ記
憶手段を設け、前学習パラメータをこのパラメータ記憶
手段に保存することを可能としている。The voice synthesizing means is provided with a parameter storing means, and the pre-learning parameters can be stored in the parameter storing means.

【００１５】また、本発明の音声合成装置において、テ
キストを音声合成手段に入力してそのテキストを音声合
成に必要なパラメータを用いて音声合成処理する際、そ
のパラメータとして複数の候補が存在した場合、その複
数の候補のうちのあるパラメータを選択し、その選択さ
れたパラメータを用いて音声合成処理して音声信号とし
て出力する音声合成手段と、この音声合成手段から出力
される音声信号を音声認識し、その認識結果と前記音声
合成手段に入力されたテキストとの類似度を判定してそ
の類似度判定結果を出力可能な音声認識手段とを有し、
前記音声合成手段は前記音声認識手段からの類似度判定
結果に基づいて、前記複数の候補のいずれかのパラメー
タを選択してその選択されたパラメータを学習パラメー
タとし、その学習パラメータにより音声合成されて得ら
れた音声信号を音声合成結果として出力するようにして
いる。In the speech synthesizer of the present invention, when a text is input to the speech synthesis means and the text is speech-synthesized using the parameters necessary for speech synthesis, if there are a plurality of candidates for the parameter. , A voice synthesizing means for selecting a certain parameter from the plurality of candidates, performing a voice synthesizing process using the selected parameter and outputting as a voice signal, and a voice recognition of the voice signal output from the voice synthesizing means. A voice recognition unit capable of determining the similarity between the recognition result and the text input to the voice synthesizing unit and outputting the similarity determination result,
The voice synthesizing means selects one of the plurality of candidates as a learning parameter based on the similarity determination result from the voice recognizing means, and synthesizes the voice by the learning parameter. The obtained voice signal is output as a voice synthesis result.

【００１６】前記類似度判定結果に基づいて前記複数の
候補のいずれかのパラメータを選択してその選択された
パラメータを学習パラメータとする処理は、前記類似度
が予め定めた値以上となるまで、前記候補となったパラ
メータを順次選択して行き、類似度が予め定めた値以上
となったパラメータを学習パラメータとする処理であ
る。The process of selecting any one of the plurality of candidates based on the similarity determination result and using the selected parameter as a learning parameter is performed until the similarity becomes equal to or more than a predetermined value. This is a process in which the candidate parameters are sequentially selected, and the parameters whose similarity becomes a predetermined value or more are used as learning parameters.

【００１７】そして、前記音声合成手段にパラメータ記
憶手段を設け、前記学習パラメータをこのパラメータ記
憶手段に保存することを可能としている。Further, the voice synthesizing means is provided with a parameter storing means, and the learning parameters can be stored in the parameter storing means.

【００１８】このように本発明は、テキストを音声合成
に必要なパラメータを用いて音声合成処理して音声信号
として出力し、その音声信号を音声認識し、その音声認
識結果としてのテキストと入力されたテキストを比較
し、音声認識結果としてのテキストと前記音声合成手段
に入力されたテキストとが一致するまで当該パラメータ
を変化させるようにしているので、その入力テキストに
対し最適なパラメータを学習パラメータとして設定する
ことができる。これによって、その入力テキストに対し
高品質な音声合成結果を得ることができる。As described above, according to the present invention, the text is subjected to the voice synthesis processing using the parameters necessary for the voice synthesis and output as the voice signal, the voice signal is voice-recognized, and the text as the voice recognition result is input. Different texts are compared, and the parameter is changed until the text as the voice recognition result and the text input to the voice synthesizing means match. Therefore, the optimum parameter for the input text is set as the learning parameter. Can be set. This makes it possible to obtain a high-quality speech synthesis result for the input text.

【００１９】また、そのときのパラメータを保存するこ
とによって、以降の音声合成処理の際には、同じテキス
トであればそれを参照するだけで適正な音声合成を行う
ことができる。Further, by storing the parameters at that time, it is possible to perform proper voice synthesis only by referring to the same text in subsequent voice synthesis processing if the text is the same.

【００２０】また、本発明は、パラメータとして複数の
候補が存在した場合、その複数の候補のうちのある１つ
のパラメータを選択し、その選択されたパラメータを用
いて音声合成を行う場合にも適用することができる。こ
の発明も上述同様、その入力テキストに対し最適なパラ
メータを学習パラメータとして設定することができるの
で、その入力テキストに対し高品質な音声合成結果を得
ることができる。Further, the present invention is also applied to the case where when there are a plurality of candidates as parameters, one of the plurality of candidates is selected and speech synthesis is performed using the selected parameters. can do. Also in the present invention, as described above, the optimum parameter for the input text can be set as a learning parameter, so that a high-quality speech synthesis result can be obtained for the input text.

【００２１】また、この場合、複数の候補のうちの最適
なパラメータを選択する処理を行うので、特に、読みや
アクセントの確定がしにくい単語などを音声合成する場
合に大きな効果を発揮する。なお、この場合も、最適な
パラメータを保存するようにしているので、以降の音声
合成処理の際には、同じテキストであればそれを参照す
るだけで適正な音声合成を行うことができる。Further, in this case, since the process of selecting the optimum parameter from the plurality of candidates is performed, a great effect is exerted particularly in the case of synthesizing a word or the like whose pronunciation or accent is difficult to be determined. In this case as well, since the optimum parameters are stored, proper speech synthesis can be performed by simply referring to the same text in subsequent speech synthesis processing if the text is the same.

【００２２】[0022]

【発明の実施の形態】以下、本発明について実施の形態
について説明するが、ここでは以下に示す２つの実施の
形態について説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below. Here, the following two embodiments will be described.

【００２３】〔第１の実施の形態〕図１は本発明におけ
る音声合成方法および音声合成装置の第１の実施の形態
を説明する基本的な構成を示すもので、音声合成装置
１、音声認識装置２、テキスト比較部３で構成されてい
る。[First Embodiment] FIG. 1 shows a basic configuration for explaining a first embodiment of a voice synthesizing method and a voice synthesizing apparatus according to the present invention. It is composed of a device 2 and a text comparison unit 3.

【００２４】この第１の実施の形態における処理手順を
おおまかに説明すると、音声合成対象となるテキスト
（入力テキストと呼ぶ）を音声合成装置２で音声合成に
必要なパラメータを用いて音声合成処理し、その音声合
成結果である音声信号を出力する。そして、この音声信
号を音声認識装置２で音声認識してその認識結果として
のテキスト（出力テキストと呼ぶ）を出力し、この認識
結果である出力テキストと入力テキストとをテキスト比
較部３で比較して、この比較結果を音声合成装置１にフ
ィードバックし、音声合成装置１で、入力テキストに対
する出力テキストの異なる部分について、音声合成を行
うためのパラメータを変化させて再度音声合成処理する
手順を、出力テキストが入力テキストに一致するまで行
い、両者が一致したら、そのときのパラメータを学習パ
ラメータとし、その学習パラメータにより音声合成され
て得られた音声信号を音声合成結果として出力する。以
下、さらに詳細に説明する。The processing procedure in the first embodiment will be roughly described. A text to be voice-synthesized (referred to as input text) is subjected to voice-synthesizing processing by the voice-synthesizing device 2 using parameters necessary for voice-synthesizing. , And outputs a voice signal which is the result of the voice synthesis. Then, this voice signal is voice-recognized by the voice recognition device 2 and a text (referred to as output text) as the recognition result is output, and the output text as the recognition result and the input text are compared by the text comparison unit 3. Then, this comparison result is fed back to the voice synthesizing device 1, and the voice synthesizing device 1 outputs a procedure for changing the parameters for performing voice synthesizing and performing the voice synthesizing process again for different portions of the output text with respect to the input text. The process is performed until the text matches the input text, and when the two match, the parameter at that time is used as a learning parameter, and the voice signal obtained by voice synthesis using the learning parameter is output as the voice synthesis result. The details will be described below.

【００２５】図２は図１をより詳細に説明するもので、
音声合成装置１は、言語処理部１１、言語辞書部１２、
音韻処理部１３、音韻辞書部１４、パラメータ生成部１
５などを有している。なお、パラメータ生成部１５で生
成されるパラメータは、テキストを構成する文字列の読
みや文の区切りなどを決める言語パラメータと、アクセ
ント、基本周波数、発話継続時間長などを決める韻律パ
ラメータなどが含まれ、言語パラメータは言語処理部１
１に与えられ、韻律パラメータは音韻処理部１３に与え
られる。FIG. 2 illustrates FIG. 1 in more detail.
The speech synthesizer 1 includes a language processing unit 11, a language dictionary unit 12,
Phoneme processing unit 13, phoneme dictionary unit 14, parameter generation unit 1
5 and so on. The parameters generated by the parameter generation unit 15 include a language parameter that determines reading of a character string that constitutes a text and sentence separation, and a prosody parameter that determines accents, fundamental frequencies, utterance durations, and the like. , The language parameter is the language processing unit 1
1 and the prosody parameter is given to the phoneme processing unit 13.

【００２６】音声認識装置２は、音声認識処理部２１を
有し、入力された音声信号をテキストに変換して出力す
るもので、ここで用いられる音声認識処理部２１は本発
明独特の音声認識手法が用いられているものではない
が、入力された音声信号に対し理想的な認識を行うこと
ができる高い認識性能を持ったものであるとし、その認
識結果はテキスト（出力テキスト）としてテキスト比較
部３に与えられる。The voice recognition device 2 has a voice recognition processing unit 21, which converts an input voice signal into text and outputs the text. The voice recognition processing unit 21 used here is a voice recognition unique to the present invention. Although the method is not used, it is assumed that it has a high recognition performance that can perform ideal recognition for the input speech signal, and the recognition result is a text comparison (text output). Given to part 3.

【００２７】このような構成において、音声合成装置１
に入力テキストが与えられると、言語処理部１１では、
言語辞書１２を参照し、かつ、パラメータ生成部１５に
より生成された言語パラメータを用いて言語処理し、読
みの情報（発音表記情報）を出力し、その読みの情報
は、音韻処理部１３に与えられる。音韻処理部１３では
音韻辞書１４を参照し、かつ、パラメータ生成部１５に
より生成された韻律パラメータを用いた音韻処理を行っ
て音声信号を出力する。In such a configuration, the speech synthesizer 1
When the input text is given to, the language processing unit 11
The linguistic dictionary 12 is referred to, and language processing is performed using the language parameter generated by the parameter generation unit 15 to output reading information (pronunciation notation information), and the reading information is given to the phonological processing unit 13. To be The phonological processing unit 13 refers to the phonological dictionary 14 and performs phonological processing using the prosody parameters generated by the parameter generation unit 15 to output a voice signal.

【００２８】この音韻処理部１３から出力された音声信
号は、音声認識装置２に与えられ、音声認識処理部２１
によって音声認識処理がなされて、認識結果としてのテ
キスト（出力テキスト）が出力され、テキスト比較部３
に与えられる。The voice signal output from the phoneme processing unit 13 is given to the voice recognition device 2, and the voice recognition processing unit 21.
The voice recognition process is performed by the text recognition processing, and the text (output text) as the recognition result is output.
Given to.

【００２９】なお、ここでの音声認識処理は、たとえ
ば、前後の文脈などを考慮するとともに音声認識辞書な
どを用いて音声認識処理し、その認識結果として、幾つ
かの認識候補が上位から順に幾つか出力され、その第１
位の認識候補が認識結果としてテキスト比較部３に与え
られ、その第１位の認識候補と入力テキストとの比較が
なされる。In the speech recognition processing here, for example, the speech recognition processing is performed by using a speech recognition dictionary and the like in consideration of contexts before and after, and as a recognition result, several recognition candidates are sequentially output from the top. Is output, the first
The recognition candidate of rank is given to the text comparison unit 3 as a recognition result, and the recognition candidate of the first rank is compared with the input text.

【００３０】テキスト比較部３では、音声認識装置２か
ら出力される出力テキストと、音声合成装置１に入力さ
れた入力テキストとを文字列として比較し、異なる部分
があるか否かを判断し、異なる部分があれば、その異な
る部分を示す情報をパラメータ生成部１５に通知する。The text comparison unit 3 compares the output text output from the voice recognition device 2 and the input text input to the voice synthesis device 1 as a character string to determine whether there is a different portion, If there is a different portion, the parameter generation unit 15 is notified of information indicating the different portion.

【００３１】パラメータ生成部１５では、テキスト比較
部３から異なる部分を示す情報を受け取ると、前回とは
異なったパラメータを生成し、そのパラメータによって
音声合成処理を行い、その音声合成結果としての音声信
号を出力する。When the parameter generation unit 15 receives the information indicating the different portion from the text comparison unit 3, the parameter generation unit 15 generates a parameter different from the previous one, performs the voice synthesis processing by the parameter, and outputs the voice signal as the voice synthesis result. Is output.

【００３２】そして、この前回とは異なったパラメータ
によって生成された音声信号は音声認識装置２に入力さ
れ、再度、音声認識処理がなされ、認識結果としてのテ
キスト（出力テキスト）がテキスト比較部３に与えられ
る。テキスト比較部３では、この出力テキストを入力テ
キストと比較して、異なった部分があればその異なった
部分を示す情報をパラメータ生成部１５に与える。Then, the voice signal generated by the parameter different from the last time is input to the voice recognition device 2, the voice recognition process is performed again, and the text as the recognition result (output text) is input to the text comparison unit 3. Given. The text comparison unit 3 compares the output text with the input text and, if there is a different portion, provides the parameter generation unit 15 with information indicating the different portion.

【００３３】パラメータ生成部１５では、テキスト比較
部３から異なる部分を示す情報を受け取ると、前回とは
異なったパラメータを生成し、そのパラメータによって
音声合成処理を行い、その音声合成結果としての音声信
号を出力する。When the parameter generation unit 15 receives the information indicating the different portion from the text comparison unit 3, the parameter generation unit 15 generates a different parameter from the previous time, performs the voice synthesis processing by the parameter, and outputs the voice signal as the voice synthesis result. Is output.

【００３４】このような処理を繰り返し行い、音声認識
結果としての出力テキストと音声合成装置１に入力され
た入力テキストとの比較の結果、出力テキストが入力テ
キストに一致すると、テキスト比較部３から一致信号が
出力され、その一致信号がパラメータ生成部１５に与え
られる。When the output text matches the input text as a result of the comparison between the output text as the voice recognition result and the input text input to the speech synthesizer 1 by repeating the above processing, the text comparing unit 3 matches the output text. A signal is output and the coincidence signal is given to the parameter generation unit 15.

【００３５】パラメータ生成部１５はこの一致信号を受
け取ると、そのときの音声合成結果としての音声信号を
外部に出力する。なお、このときのパラメータは、入力
テキストに対し最適な音声合成を可能とする学習済みの
パラメータ（学習パラメータという）となる。したがっ
て、その学習パラメータを用いて音声合成処理されて得
られた音声合成結果は、その入力テキストに対し最適な
音声合成結果とすることができる。When the parameter generator 15 receives this coincidence signal, it outputs a voice signal as a voice synthesis result at that time to the outside. The parameter at this time is a learned parameter (referred to as a learning parameter) that enables optimal speech synthesis for the input text. Therefore, the speech synthesis result obtained by the speech synthesis processing using the learning parameter can be the optimal speech synthesis result for the input text.

【００３６】すなわち、音声合成装置１により音声合成
処理されて得られた音声信号を音声認識装置２で音声認
識し、その認識結果として出力されたテキストが、音声
合成装置１に入力されたテキストと同じものであるとい
うことは、適正な音声合成処理がなされたということが
でき、その時の音声合成結果は入力テキストに対し最適
な音声合成結果であるといえる。That is, the voice signal obtained by the voice synthesizing process by the voice synthesizing device 1 is voice-recognized by the voice recognizing device 2, and the text output as the recognition result is the same as the text input to the voice synthesizing device 1. If they are the same, it can be said that proper speech synthesis processing has been performed, and the speech synthesis result at that time can be said to be the optimal speech synthesis result for the input text.

【００３７】図３は入力テキストと出力テキストの一例
を示すもので、図３（ａ）に示すような入力テキストが
音声合成装置１に与えられ、音声合成装置１で音声合成
処理を行って、その入力テキストに対する音声合成結果
としての音声信号が出力され、その音声信号を音声認識
装置２で認識処理することによって図３（ｂ）に示すよ
うなテキスト（出力テキスト）が得られたとする。FIG. 3 shows an example of the input text and the output text. The input text as shown in FIG. 3A is given to the voice synthesizing device 1, and the voice synthesizing device 1 performs the voice synthesizing process. It is assumed that a voice signal as a voice synthesis result for the input text is output, and the voice recognition device 2 recognizes the voice signal to obtain a text (output text) as shown in FIG. 3B.

【００３８】この図３（ｂ）に示す出力テキストにおい
て、アンダラインを施した部分が入力テキストと異なる
部分である。この図３の例では、「コンピュータに対し
て発せられた・・・」という入力テキストに対する出力
テキストは「コンピュータに対して８０００られた・・
・」となり、「発せられた」の部分が「８０００られ
た」となっている。In the output text shown in FIG. 3B, the underlined portion is different from the input text. In the example of FIG. 3, the output text corresponding to the input text "issued to the computer ..." is "8000 issued to the computer ...
・, And the part of "I was issued" is "8000 was given".

【００３９】つまり、この例は、「発せられた」の部分
が適正に音声合成されずに「はっせんられた」という音
声信号として出力されたために、音声認識装置２ではそ
の部分を「はっせんられた」とそのまま認識し、その認
識結果としてのテキストが「８０００られた」となった
ものである。In other words, in this example, since the "emitted" portion is not properly synthesized into a voice signal and is output as a voice signal "squashed", the voice recognition device 2 "squeezed" the portion. It is recognized as it is, and the text as the recognition result is “8000”.

【００４０】したがって、テキスト比較部３では、この
異なった部分を示す情報をパラメータ生成部１５に通知
し、それによって、パラメータ生成部１５では、その部
分について、前回とは異なったパラメータ（読みに対す
る音素選択、基本周波数、アクセント、発話継続時間長
など）を生成して、そのパラメータを言語処理部１１や
韻律処理部１３に送る。Therefore, the text comparison unit 3 notifies the parameter generation unit 15 of the information indicating the different portion, so that the parameter generation unit 15 causes the parameter generation unit 15 to change a parameter (phoneme for reading) different from the last time. (Selection, fundamental frequency, accent, utterance duration, etc.) are generated and the parameters are sent to the language processing unit 11 and the prosody processing unit 13.

【００４１】これによって、「コンピュータに対して発
せられた・・・」という入力テキストの「発せられた」
の部分に対し、再度、新たなパラメータを用いて音声合
成処理がなされ、その音声合成結果としての音声信号が
出力される。この音声信号は音声認識装置２に入力さ
れ、認識結果としてのテキスト（出力テキスト）が出さ
れ、再度、テキスト比較部３で入力テキストとの比較が
行われる。As a result, the input text "Sent" to the computer ...
The voice synthesis processing is performed again on the part of using the new parameter, and the voice signal as the voice synthesis result is output. This voice signal is input to the voice recognition device 2, a text (output text) as a recognition result is output, and the text comparison unit 3 again compares it with the input text.

【００４２】そして、その比較の結果、両者が一致して
いると判定されたとすれば、その時のパラメータが学習
パラメータとなり、その学習パラメータにより音声合成
されて得られた音声信号が音声合成結果として出力され
る。If it is determined as a result of the comparison that they match each other, the parameter at that time becomes a learning parameter, and a voice signal obtained by voice synthesis by the learning parameter is output as a voice synthesis result. To be done.

【００４３】以上のように、この第１の実施の形態によ
れば、テキスト（入力テキスト）を音声合成装置１で音
声合成処理して、その音声合成結果としての音声信号を
音声認識装置２に与えて音声認識し、その認識結果であ
るテキスト（出力テキスト）を、入力テキストと比較
し、入力テキストに対して出力テキストに異なる部分が
あれば、その部分を前回とは異なったパラメータを用い
て再度音声合成処理し、その音声合成結果である音声信
号を音声認識装置１に与えて音声認識して、その認識結
果であるテキスト（出力テキスト）と入力テキストとを
比較するという処理を、出力テキストが入力テキストに
一致するまで行う。As described above, according to the first embodiment, the text (input text) is subjected to the voice synthesizing process by the voice synthesizing device 1, and the voice signal as the voice synthesizing result is sent to the voice recognizing device 2. It gives voice recognition, compares the recognition result text (output text) with the input text, and if there is a different part in the output text with respect to the input text, use that part with a different parameter from the previous time. A process of performing voice synthesis processing again, giving a voice signal as the voice synthesis result to the voice recognition device 1 to perform voice recognition, and comparing the text (output text) as the recognition result with the input text is called output text. Until the input text matches.

【００４４】そして、出力テキストが入力テキストに一
致したら、そのときのパラメータ（学習パラメータ）に
より音声合成されて得られた音声信号を音声合成結果と
して出力する。これによって、そのときの入力テキスト
に対し最適な音声合成結果を得ることができる。When the output text matches the input text, the speech signal obtained by speech synthesis with the parameter (learning parameter) at that time is output as the speech synthesis result. This makes it possible to obtain the optimum speech synthesis result for the input text at that time.

【００４５】このように、音声合成装置１により音声合
成処理されて得られた音声信号を音声認識装置２で音声
認識し、その認識結果として出力されたテキストが、音
声合成装置１に入力されたテキストと同じものであると
いうことは、適正な音声合成処理がなされたということ
ができ、しかも、音声合成されて得られた合成音声が正
しく音声認識されるということは、その合成音声が人間
の聴覚にとっての明瞭度も高く、高品質な合成音声であ
ることを意味している。As described above, the voice signal obtained by the voice synthesizing process by the voice synthesizing device 1 is voice-recognized by the voice recognizing device 2, and the text output as the recognition result is input to the voice synthesizing device 1. The fact that it is the same as the text means that proper speech synthesis processing has been performed. Moreover, the fact that the synthesized speech obtained by speech synthesis is correctly recognized means that the synthesized speech is human-readable. This means that it is a high-quality synthetic speech that has high intelligibility.

【００４６】なお、この第１の実施の形態において、学
習パラメータを保存しておくこともできる。図４はこの
学習パラメータを保存する機能を有した音声合成装置の
構成を示すもので、図２の構成に対し、パラメータ記憶
部１６を付加したものであり、その他は図２と同じであ
るので、同一部分には同一符号が付されている。In the first embodiment, the learning parameter can be saved. FIG. 4 shows the structure of the speech synthesizer having the function of storing the learning parameters. The parameter storage unit 16 is added to the structure of FIG. 2 and the other parts are the same as those of FIG. The same parts are designated by the same reference numerals.

【００４７】このようにパラメータ記憶部１６を設ける
ことにより、出力テキストが入力テキストに一致したと
きのパラメータ（学習パラメータ）はパラメータ記憶部
１６に記憶される。By thus providing the parameter storage unit 16, the parameter (learning parameter) when the output text matches the input text is stored in the parameter storage unit 16.

【００４８】これによって、パラメータ記憶部１６に
は、それまでに入力された様々なテキストに対し最適な
音声合成結果を得ることができるパラメータが蓄積され
る。したがって、以降に同じテキストが入力された場合
には、そのテキストに対応するパラメータを用いて音声
合成すれば、そのテキストに対し最適な音声合成結果を
得ることができるようになる。しかも、その場合、上述
したような学習パラメータを生成する処理を不要とする
ことができる。As a result, the parameter storage unit 16 accumulates the parameters for obtaining the optimum speech synthesis result for the various texts input so far. Therefore, when the same text is subsequently input, if the voice synthesis is performed using the parameter corresponding to the text, the optimum voice synthesis result for the text can be obtained. Moreover, in that case, the process of generating the learning parameter as described above can be omitted.

【００４９】〔第２の実施の形態〕図５は本発明の第２
の実施の形態を説明する構成図であり、この第２の実施
の形態も音声合成装置１と音声認識装置２を有している
点は上述の第１の実施の形態と同様である。[Second Embodiment] FIG. 5 shows a second embodiment of the present invention.
Is a configuration diagram for explaining the embodiment of the present invention, and is similar to the above-described first embodiment in that the second embodiment also has a voice synthesizing device 1 and a voice recognizing device 2.

【００５０】音声合成装置１は図２と同様、言語処理部
１１、言語辞書部１２、音韻処理部１３、音韻辞書部１
４、パラメータ生成部１５を有しており、この第２の実
施の形態ではその他にパラメータ選択部１７を有してい
る。As in FIG. 2, the speech synthesizer 1 has a language processing unit 11, a language dictionary unit 12, a phoneme processing unit 13, and a phoneme dictionary unit 1.
4 has a parameter generation unit 15, and the second embodiment also has a parameter selection unit 17.

【００５１】また、音声認識装置２は第１の実施の形態
同様の音声認識処理部２１を有し、この音声認識処理部
２１によって音声認識を行って、認識結果を出力する
が、この第２の実施の形態では、この音声認識処理部２
１での認識結果と入力テキスト（音声合成装置１に入力
されるテキスト）との類似度を判定する類似度判定部２
２を有している。Further, the voice recognition device 2 has a voice recognition processing section 21 similar to that of the first embodiment. The voice recognition processing section 21 performs voice recognition and outputs a recognition result. In this embodiment, the voice recognition processing unit 2
1. The similarity determination unit 2 that determines the similarity between the recognition result in 1 and the input text (text input to the speech synthesizer 1)
Have two.

【００５２】この第２の実施の形態は、音声合成装置１
が音声合成処理を行う際、音声合成を行うに必要なパラ
メータの候補が複数存在し、その複数のパラメータから
最適なパラメータを決定する手段として、音声認識装置
２の認識結果を用いるものである。なお、音声合成を行
うに必要なパラメータの候補が複数存在する例として
は、たとえば、入力テキストの読みやアクセントを確定
できない場合などがあり、以下、具体例を参照して説明
する。The second embodiment is a speech synthesizer 1
When performing speech synthesis processing, there are a plurality of parameter candidates required for speech synthesis, and the recognition result of the speech recognition device 2 is used as means for determining the optimum parameter from the plurality of parameters. Note that, as an example in which there are a plurality of parameter candidates necessary for performing voice synthesis, for example, there is a case where the reading or accent of the input text cannot be determined. This will be described below with reference to a specific example.

【００５３】たとえば、「・・・社は新製品Ａ−８００
Ｆを開発した」といった入力テキストが音声合成装置１
に入力された場合を考える。ここで、音声合成装置１
は、音声合成を行うためのパラメータとして複数の候補
を順次選択可能なパラメータ選択部１７を有しており、
このパラメータ選択部１７がパラメータ生成部１５か
ら、たとえば、「８００Ｆ」の「Ｆ」の部分に対する読
みのパラメータの第１候補として「階（かい）」と言う
読みに対するパラメータを選択したとする。For example, "... is a new product A-800
Input text such as "I developed F" is the speech synthesizer 1.
Consider the case where it is input to. Here, the speech synthesizer 1
Has a parameter selection unit 17 capable of sequentially selecting a plurality of candidates as a parameter for performing voice synthesis,
It is assumed that the parameter selecting unit 17 selects, from the parameter generating unit 15, for example, a parameter for reading "kai" as the first candidate of the reading parameter for the "F" portion of "800F".

【００５４】これによって、音声合成装置１から出力さ
れる合成音声は「はっぴゃくかい」となり、それに対応
する音声信号が音声認識装置２に与えられる。音声認識
装置２では、その音声認識処理部２１により、入力され
た音声信号に対し、前後の文脈などを考慮するとともに
音声認識辞書などを用いて音声認識処理し、その認識結
果として、上位から順に幾つかの認識候補とその類似度
が出力される。As a result, the synthesized speech output from the speech synthesizer 1 becomes "Happikaku", and the corresponding speech signal is given to the speech recognizer 2. In the voice recognition device 2, the voice recognition processing unit 21 performs voice recognition processing on an input voice signal by using a voice recognition dictionary and the like in consideration of contexts before and after, and outputs the recognition result in order from the top. Several recognition candidates and their similarities are output.

【００５５】図６（ａ）は「はっぴゃくかい」という音
声信号に対して得られた上位幾つかの認識候補とその類
似度の例を示すもので、この例では、第１位の認識候補
として「８００回」とその類似度「４３」、第２位の認
識候補として「８００階」とその類似度「３０」、第３
位の認識候補として「１００回」とその類似度「２２」
というように、第１位から順に認識候補とその類似度が
出力される。FIG. 6A shows an example of the top several recognition candidates obtained for the voice signal "Happyakukai" and their similarity. In this example, the first recognition is performed. The candidate is “800 times” and its similarity is “43”, the second recognition candidate is “800 floor” and its similarity is “30”, the third.
"100 times" and its similarity "22" as recognition candidates for rank
In this way, the recognition candidates and their similarities are output in order from the first rank.

【００５６】なお、ここでの類似度というのは、入力さ
れた音声信号に対して得られた認識候補がどの程度の確
からしさを有しているかを示す数値で、ここでは、「１
００」を最大としている。たとえば、「はっぴゃくか
い」という音声信号に対し、「８００回」という認識候
補は「４３」という確からしさを有しているということ
である。The degree of similarity here is a numerical value indicating the degree of certainty of the recognition candidate obtained for the input voice signal, and here, "1".
00 "is the maximum. For example, a recognition candidate “800 times” has a certainty “43” with respect to a voice signal “Happyakukai”.

【００５７】ところで、図６（ａ）で示したような音声
認識処理部２１による認識結果（上位幾つかのの認識候
補とその類似度）は、類似度判定部２２に与えられ、こ
の類似度判定部２２には、音声合成装置１に入力された
入力テキストが与えられている。By the way, the recognition result by the voice recognition processing unit 21 (several high-ranking recognition candidates and their similarity) as shown in FIG. 6A is given to the similarity determination unit 22, and this similarity is calculated. The input text input to the voice synthesizer 1 is given to the determination unit 22.

【００５８】したがって、類似度判定部２２は、その入
力テキストに基づき前後の文脈などから、上位幾つかの
認識候補のうちのいずれかを選択し、選択した認識候補
に対する類似度を類似度判定結果として出力し、それを
パラメータ選択部１７に与える。Therefore, the similarity determination section 22 selects one of the top several recognition candidates based on the input text, based on the contexts before and after, and determines the similarity to the selected recognition candidate as the similarity determination result. And outputs it to the parameter selection unit 17.

【００５９】すなわち、この場合、入力テキストは「・
・・社は新製品Ａ−８００Ｆを開発した」であるので、
類似度判定部２２はそのテキストの「８００Ｆ」の部分
に対する認識候補として、前後の文脈などから、図６
（ａ）に示す上位幾つかの認識候補のうち、第２位の認
識候補である「８００階」を選択し、その類似度「３
０」をパラメータ選択部１７に与える。That is, in this case, the input text is ".
.. The company has developed a new product A-800F. "
Based on the contexts before and after FIG.
Among the top several recognition candidates shown in (a), the second recognition candidate “800th floor” is selected and the similarity “3” is selected.
0 ”is given to the parameter selection unit 17.

【００６０】パラメータ選択部１７はその類似度が予め
定めた値に達しているかどうかを判定し、この場合、そ
の類似度は「３０」と低い値であるので、予め定めた大
きさに達していないと判定する。これによって、パラメ
ータ選択部１７は、「８００Ｆ」の部分に対する読みの
パラメータの第２候補を選択し、ここでは、第２候補と
して「はっぴゃくえふ」と言う合成音声を生成するため
のパラメータを選択したとする。The parameter selection unit 17 determines whether or not the degree of similarity has reached a predetermined value. In this case, since the degree of similarity is as low as "30", it has reached a predetermined value. It is determined not to. As a result, the parameter selection unit 17 selects the second candidate of the reading parameter for the portion of “800F”, and here, the parameter for generating the synthetic voice “Happyakuef” is selected as the second candidate. Suppose you have selected.

【００６１】これによって、音声合成装置１から出力さ
れる合成音声は「はっぴゃくえふ」となり、それに対応
する音声信号が音声認識装置２に与えられ、音声認識処
理部２１で音声認識される。そして、その認識結果とし
て図６（ｂ）に示すような上位幾つかの認識候補とその
類似度が出力され、その出力が類似度判定部２２に与え
られる。As a result, the synthesized speech output from the speech synthesizer 1 becomes "Happyakufu", the corresponding speech signal is given to the speech recognition apparatus 2, and the speech recognition processing unit 21 recognizes the speech. Then, as the recognition result, some of the top recognition candidates as shown in FIG. 6B and the degrees of similarity thereof are output, and the output is given to the similarity determination unit 22.

【００６２】この図６（ｂ）の例では、第１位の認識候
補として「８００Ｆ」とその類似度「８０」、第２位の
認識候補として「１００Ｆ」とその類似度「２２」とい
うように、第１位から順に認識候補とその類似度が出力
され、類似度判定部２２に与えられる。In the example of FIG. 6B, the first recognition candidate is “800F” and its similarity is “80”, and the second recognition candidate is “100F” and its similarity is “22”. Then, the recognition candidates and their similarity are sequentially output from the first rank and are given to the similarity determination unit 22.

【００６３】類似度判定部２２では、音声合成装置１に
入力された入力テキストに基づき前後の文脈などから、
上位幾つかの認識候補のうちのいずれかを選択し、その
選択した認識候補に対する類似度を類似度判定結果とし
てパラメータ選択部１７に与える。In the similarity determination section 22, based on the input text input to the speech synthesizer 1, from the context before and after, etc.,
One of the top several recognition candidates is selected, and the similarity to the selected recognition candidate is given to the parameter selection unit 17 as a similarity determination result.

【００６４】すなわち、この場合、入力テキストは「・
・・社は新製品Ａ−８００Ｆを開発した」であるので、
類似度判定部２２はそのテキストの「８００Ｆ」の部分
に対する認識結果として、前後の文脈などから、図６
（ｂ）に示す上位の認識候補のうち、第１位の認識候補
である「８００Ｆ」を選択し、その類似度「８０」をパ
ラメータ選択部１７に与える。That is, in this case, the input text is ".
.. The company has developed a new product A-800F. "
The similarity determination unit 22 recognizes the “800F” portion of the text as a recognition result, and then, based on the contexts before and after the text, FIG.
Among the upper recognition candidates shown in (b), “800F” which is the first recognition candidate is selected, and the similarity “80” is given to the parameter selection unit 17.

【００６５】パラメータ選択部１７はその類似度が予め
定めた値に達しているかどうかを判定し、この場合、類
似度は「８０」という高い値であり、予め定めた大きさ
に達しているとの判定がなされる。The parameter selection unit 17 determines whether or not the degree of similarity has reached a predetermined value, and in this case, the degree of similarity is a high value of "80" and has reached a predetermined value. Is determined.

【００６６】これによって、パラメータ選択部１７で
は、そのときのパラメータ（第２の候補）を正解として
そのパラメータを学習パラメータとし、その学習パラメ
ータにより音声合成して得られた音声信号を音声合成結
果として出力する。As a result, the parameter selection unit 17 sets the parameter (second candidate) at that time as the correct answer, sets the parameter as the learning parameter, and the speech signal obtained by the speech synthesis by the learning parameter is set as the speech synthesis result. Output.

【００６７】また、他の例として、たとえば、「・・・
川にかかる長い橋を・・・」というような入力テキスト
が音声合成装置１に与えられたとすると、音声合成装置
１では「橋」の部分に対するアクセントの位置を確定す
ることができず、「橋」の部分に対するアクセントを表
現するパラメータとして、複数の候補が出力される場合
について考える。As another example, for example, "...
If an input text such as "a long bridge over the river ..." is given to the speech synthesizer 1, the speech synthesizer 1 cannot determine the position of the accent with respect to the "bridge", and Consider a case in which a plurality of candidates are output as a parameter expressing the accent for the part.

【００６８】まず、その第１候補として、「は」にアク
セントの存在するようなパラメータが与えられると、音
声合成結果としての音声信号は「は」にアクセントの存
在する「はし」に対応する音声信号が出力されることに
なる。First, as a first candidate, when a parameter such that "ha" has an accent is given, the speech signal as a speech synthesis result corresponds to "hashi" where "ha" has an accent. An audio signal will be output.

【００６９】この「は」にアクセントの存在する「は
し」に対応する音声信号を音声認識処理部２１によっ
て、前後の文脈などを考慮するとともに音声認識辞書な
どを用いて音声認識処理すると、その認識結果として、
上述した「はっぴゃくえふ」の例と同様に、上位幾つか
の認識候補とその類似度が出力される。When the speech recognition processing unit 21 considers the context before and after, and performs speech recognition processing using a speech recognition dictionary or the like, the speech signal corresponding to "Hashi" having an accent in "Ha" is As a recognition result,
Similar to the example of "Happyakufu" described above, the top several recognition candidates and their similarities are output.

【００７０】この上位幾つかの認識候補とその類似度は
類似度判定部２２に与えられ、入力テキストに基づき、
いずれかの認識候補とその類似度が選択され、その選択
された認識候補に対する類似度が類似度判定結果として
出力される。The top several recognition candidates and their similarities are given to the similarity judging section 22, and based on the input text,
One of the recognition candidates and its similarity are selected, and the similarity to the selected recognition candidate is output as the similarity determination result.

【００７１】ここでは、類似度判定部２２によって、入
力テキストに基づき、認識候補として上位幾つかの認識
候補の中から、「橋」が選択される。これによって、そ
の「橋」の類似度（その類似度は「４０」であるとす
る）が類似度判定結果としてパラメータ選択部１７に与
えられる。Here, the similarity determination unit 22 selects "bridge" from the top several recognition candidates as a recognition candidate based on the input text. As a result, the similarity of the “bridge” (assuming that the similarity is “40”) is given to the parameter selection unit 17 as the similarity determination result.

【００７２】パラメータ選択部１７では類似度判定部２
２から与えられた類似度判定結果としての類似度「４
０」が予め定めた値に達していないと判断し、他のパラ
メータを第２の候補として選択する。In the parameter selection unit 17, the similarity determination unit 2
The degree of similarity “4
It is determined that "0" has not reached the predetermined value, and another parameter is selected as the second candidate.

【００７３】ここで、第２の候補として、「はし」の
「し」にアクセントの存在するようなパラメータが選択
されたとすると、この第２候補のパラメータによって音
声合成処理され、それによって生成された音声信号が音
声認識装置２に与えられる。Here, if a parameter having an accent on "shi" of "hashi" is selected as the second candidate, the speech synthesis processing is performed by the parameter of the second candidate, and the result is generated. The voice signal is provided to the voice recognition device 2.

【００７４】つまり、この場合、音声合成結果としての
音声信号は「し」にアクセントのある「はし」に対応す
る音声信号となる。In other words, in this case, the voice signal as the voice synthesis result is the voice signal corresponding to "hashi" having the accent "shi".

【００７５】これを音声認識処理部２１で同様に音声認
識し、その認識結果として上位幾つかの認識候補とその
類似度が出力され、その出力は類似度判定部２２に与え
られ、入力テキストに基づき、いずれかの認識候補とそ
の類似度が選択され、その選択された認識候補に対する
類似度が類似度判定結果として出力される。The speech recognition processing section 21 similarly performs speech recognition, and outputs some of the top several recognition candidates and their similarities as the recognition result. The output is given to the similarity judgment section 22 and the input text is output. Based on this, one of the recognition candidates and its similarity is selected, and the similarity to the selected recognition candidate is output as the similarity determination result.

【００７６】ここでは、類似度判定部２２によって、入
力テキストに基づき、認識候補として上位幾つかの認識
候補の中から、「橋」が選択される。これによって、そ
の「橋」の類似度（その類似度は「９０」であるとす
る）が類似度判定結果としてパラメータ選択部１７に与
えられる。Here, the similarity determination unit 22 selects "bridge" from the top several recognition candidates as a recognition candidate based on the input text. As a result, the similarity of the “bridge” (assuming that the similarity is “90”) is given to the parameter selection unit 17 as the similarity determination result.

【００７７】パラメータ選択部１７では類似度判定部２
２から与えられた類似度判定結果としての類似度「９
０」が予め定めた値よりも高いと判断し、その第２候補
のパラメータを学習パラメータとし、その学習パラメー
タにより音声合成して得られた音声信号を音声合成結果
として出力する。In the parameter selection unit 17, the similarity determination unit 2
The degree of similarity “9
It is determined that “0” is higher than a predetermined value, the parameter of the second candidate is used as a learning parameter, and a voice signal obtained by performing voice synthesis using the learning parameter is output as a voice synthesis result.

【００７８】このように、第２の実施の形態では、音声
合成装置１で読みやアクセントの位置などを確定できな
いテキストが入力された場合、その確定できない部分に
対するパラメータを幾つかの候補として選択し、それぞ
れの候補ごとに音声合成を行い、その音声合成結果であ
る音声信号を音声認識処理して、その認識結果として上
位幾つかの認識候補とその類似度を出力し、その複数の
認識候補とそれに対応する類似度の中から類似度判定部
２２が入力テキストに基づいて、いずれかの認識候補に
対応する類似度を選択し、その類似度の大きさから最適
なパラメータを選択するようにしている。As described above, in the second embodiment, when a text for which the reading or accent position cannot be determined by the speech synthesizer 1 is input, the parameters for the undetermined portion are selected as some candidates. , Speech synthesis is performed for each candidate, the speech signal that is the result of the speech synthesis is subjected to speech recognition processing, and the top several recognition candidates and their similarity are output as the recognition result. The similarity determining unit 22 selects the similarity corresponding to any of the recognition candidates based on the input text from the similarities corresponding to the similarity, and selects the optimum parameter from the magnitude of the similarity. There is.

【００７９】このように、音声合成装置１では、とりあ
えず、候補として選択されたパラメータで音声合成し、
その音声合成結果を音声認識装置２に与え、音声認識装
置２側から送られてくる類似度からそのパラメータが適
正であるか否かを判断するようにし、予め定めた値の類
似度が得られるまで、その処理を繰り返し行い、予め定
めた値の類似度が得られればそのパラメータを正解とし
て確定するようにしている。As described above, in the voice synthesizing apparatus 1, for the time being, voice synthesis is performed using the parameters selected as candidates,
The voice synthesis result is given to the voice recognition device 2, and it is determined whether or not the parameter is appropriate based on the similarity sent from the voice recognition device 2 side, and the similarity of a predetermined value is obtained. Until then, the process is repeated, and if a similarity of a predetermined value is obtained, the parameter is set as the correct answer.

【００８０】そして、その確定されたパラメータ（学習
パラメータ）を用いて音声合成されて得られた音声信号
が音声合成結果として出力される。Then, a voice signal obtained by performing voice synthesis using the determined parameter (learning parameter) is output as a voice synthesis result.

【００８１】これによって、音声合成装置１にとって未
知語であっても、音声認識ができれば、入力テキストに
対する最適な音声合成結果を生成することができるよう
になる。As a result, even if the word is unknown to the speech synthesizer 1, if the speech can be recognized, the optimum speech synthesis result for the input text can be generated.

【００８２】なお、この第２の実施の形態においても前
述の第１の実施の形態同様、学習パラメータを保存して
おくことができる。図７はこの学習パラメータを保存す
る機能を有した音声合成装置の構成を示すもので、図５
の構成に対し、パラメータ記憶部１６を付加したもので
あり、その他は図５と同じであるので、同一部分には同
一符号が付されている。In the second embodiment, the learning parameter can be stored as in the first embodiment. FIG. 7 shows the configuration of a speech synthesizer having a function of storing this learning parameter.
The configuration shown in FIG. 5 is the same as that of FIG. 5 except that the parameter storage unit 16 is added, and the same portions are denoted by the same reference numerals.

【００８３】このように、上述した処理によって得られ
た学習パラメータがパラメータ記憶部１６に記憶される
ことにより、パラメータ記憶部１６には、それまでに入
力された様々なテキストに対し最適な音声合成結果を得
ることができるパラメータが蓄積される。したがって、
以降に同じテキストが入力された場合には、そのテキス
トに対応するパラメータを用いて音声合成すれば、その
テキストに対し最適な音声合成結果を得ることができる
ようになる。しかも、その場合、上述したような学習パ
ラメータを生成する処理を不要とすることができる。As described above, the learning parameters obtained by the above-described processing are stored in the parameter storage unit 16, so that the parameter storage unit 16 can optimally perform speech synthesis for various texts input so far. The parameters for which the result can be obtained are accumulated. Therefore,
When the same text is subsequently input, if the speech synthesis is performed using the parameter corresponding to the text, the optimal speech synthesis result can be obtained for the text. Moreover, in that case, the process of generating the learning parameter as described above can be omitted.

【００８４】なお、本発明は以上説明した実施の形態に
限定されるものではなく、本発明の要旨を逸脱しない範
囲で種々変形実施可能となるものである。たとえば、前
述した第２の実施の形態においては、音声認識結果とし
て上位幾つかの認識候補とその類似度を出力し、その複
数の認識候補とそれに対応する類似度の中から類似度判
定部２２が入力テキストに基づいて、いずれかの認識候
補に対応する類似度を選択するようにしているが、入力
テキストを音声認識処理部２１に与えるようにすれば、
音声認識処理部２１では音声認識結果としての正解を知
ることができるので、入力音声信号に対し幾つもの認識
候補を得てその類似度を計算したり、類似度の上位から
並べるといった処理を省略することができる。これによ
って、音声認識に要する処理の簡略化が図れ高速な認識
処理が可能となり、全体的な処理の高速化も図れる。そ
の場合、類似度判定部２２も特に必要はなくなり、音声
認識結果としての類似度を直接、パラメータ選択部１７
に与えることができる。The present invention is not limited to the embodiments described above, and various modifications can be made without departing from the gist of the present invention. For example, in the above-described second embodiment, the top several recognition candidates and their similarities are output as the voice recognition result, and the similarity determination unit 22 is selected from the plurality of recognition candidates and the corresponding similarities. Selects the similarity corresponding to any of the recognition candidates based on the input text, but if the input text is given to the speech recognition processing unit 21,
Since the voice recognition processing unit 21 can know the correct answer as the voice recognition result, the process of obtaining a number of recognition candidates for the input voice signal and calculating the similarity or arranging the similarity from the top of the similarity is omitted. be able to. This simplifies the processing required for voice recognition, enables high-speed recognition processing, and speeds up the overall processing. In that case, the similarity determination unit 22 is not particularly necessary, and the similarity as the voice recognition result is directly calculated by the parameter selection unit 17.
Can be given to.

【００８５】また、本発明は、以上説明した本発明を実
現するための処理手順が記述された処理プログラムを作
成し、その処理プログラムをフロッピィディスク、光デ
ィスク、ハードディスクなどの記録媒体に記録させてお
くことができ、本発明はその処理プログラムが記録され
た記録媒体をも含むものである。また、ネットワークか
ら当該処理プログラムを得るようにしてもよい。Further, according to the present invention, a processing program in which a processing procedure for realizing the above-described present invention is described is created, and the processing program is recorded in a recording medium such as a floppy disk, an optical disk, a hard disk. The present invention also includes a recording medium in which the processing program is recorded. Further, the processing program may be obtained from the network.

【００８６】[0086]

【発明の効果】以上説明したように本発明によれば、テ
キストを音声合成に必要なパラメータを用いて音声合成
処理して音声信号として出力し、その音声信号を音声認
識し、その音声認識結果としてのテキストと入力された
テキストを比較し、音声認識結果としてのテキストと前
記音声合成手段に入力されたテキストとが一致するまで
当該パラメータを変化させるようにしているので、その
入力テキストに対し最適なパラメータを学習パラメータ
として設定することができる。これによって、その入力
テキストに対し高品質な音声合成結果を得ることができ
る。As described above, according to the present invention, the text is subjected to the voice synthesis processing using the parameters necessary for the voice synthesis and output as a voice signal, the voice signal is voice-recognized, and the voice recognition result is obtained. Is compared with the input text, and the parameter is changed until the text as the voice recognition result and the text input to the voice synthesizing means match, so that it is optimal for the input text. Various parameters can be set as learning parameters. This makes it possible to obtain a high-quality speech synthesis result for the input text.

【００８７】また、そのときのパラメータを保存するこ
とによって、以降の音声合成処理の際には、同じテキス
トであればそれを参照するだけで適正な音声合成を行う
ことができる。Further, by storing the parameters at that time, it is possible to perform proper speech synthesis only by referring to the same text in the subsequent speech synthesis processing if the text is the same.

【００８８】また、本発明は、パラメータとして複数の
候補が存在した場合、その複数の候補のうちのある１つ
のパラメータを選択し、その選択されたパラメータを用
いて音声合成を行う場合にも適用することができる。こ
の発明も上述同様、その入力テキストに対し最適なパラ
メータを学習パラメータとして設定することができるの
で、その入力テキストに対し高品質な音声合成結果を得
ることができる。The present invention is also applied to the case where, when a plurality of candidates exist as parameters, one of the plurality of candidates is selected and speech synthesis is performed using the selected parameters. can do. Also in the present invention, as described above, the optimum parameter for the input text can be set as a learning parameter, so that a high-quality speech synthesis result can be obtained for the input text.

【００８９】また、この場合、複数の候補のうちの最適
なパラメータを選択する処理を行うので、特に、読みや
アクセントの確定がしにくい単語などを音声合成する場
合に大きな効果を発揮する。なお、この場合も、最適な
パラメータを保存するようにしているので、以降の音声
合成処理の際には、同じテキストであればそれを参照す
るだけで適正な音声合成を行うことができる。Further, in this case, since the process of selecting the optimum parameter from the plurality of candidates is performed, a great effect is exerted particularly in the case of synthesizing a word or the like whose pronunciation or accent is difficult to be determined. In this case as well, since the optimum parameters are stored, proper speech synthesis can be performed by simply referring to the same text in subsequent speech synthesis processing if the text is the same.

[Brief description of drawings]

【図１】本発明における音声合成方法および装置の第１
の実施の形態を説明する基本的な構成図である。FIG. 1 is a first speech synthesis method and apparatus according to the present invention.
It is a basic block diagram for explaining the embodiment of.

【図２】図１の構成を詳細に説明する図である。FIG. 2 is a diagram illustrating the configuration of FIG. 1 in detail.

【図３】第１の実施の形態を説明するための入力テキス
トと出力テキストの一例を示す図である。FIG. 3 is a diagram showing an example of an input text and an output text for explaining the first embodiment.

【図４】図２において学習パラメータを保存するパラメ
ータ記憶部を設けた構成図である。FIG. 4 is a configuration diagram in which a parameter storage unit for storing a learning parameter is provided in FIG.

【図５】本発明における音声合成方法および装置の第２
の実施の形態を説明する構成図である。FIG. 5 is a second speech synthesis method and apparatus according to the present invention.
It is a block diagram explaining the embodiment of.

【図６】第２の実施の形態における音声認識処理部２１
から出力された上位の認識候補とその類似度の例を示す
図であり、（ａ）は入力音声信号が「はっぴゃくかい」
である場合の上位幾つかの認識候補とその類似度の例を
示す図、（ｂ）は入力音声信号が「はっぴゃくえふ」で
ある場合の上位幾つかの認識候補とその類似度の例を示
す図である。FIG. 6 is a voice recognition processing unit 21 according to the second embodiment.
It is a figure which shows the example of the high-order recognition candidate and its similarity which were output from (a), and an input speech signal is "Happyakukai".
Is a diagram showing an example of some top recognition candidates and their similarity, (b) is an example of some top recognition candidates and their similarity when the input speech signal is "Happyakuefu" FIG.

【図７】図５において学習パラメータを保存するパラメ
ータ記憶部を設けた構成図である。7 is a configuration diagram in which a parameter storage unit for storing learning parameters is provided in FIG.

[Explanation of symbols]

１音声合成装置２音声認識装置３テキスト比較部１１言語処理部１２言語辞書部１３音韻処理部１４音韻辞書部１５パラメータ生成部１６パラメータ記憶部１７パラメータ選択部２１音声認識処理部２２類似度判定部 1 Speech synthesizer 2 voice recognition device 3 Text comparison section 11 Language Processing Department 12 language dictionary 13 Phoneme processing unit 14 Phonological dictionary 15 Parameter generator 16 Parameter storage 17 Parameter selection section 21 Speech recognition processing unit 22 Similarity determination unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/22 Ｇ１０Ｌ 3/00 ５３７Ｃ ─────────────────────────────────────────────────── ─── Continued Front Page (51) Int.Cl. ⁷ Identification Code FI Theme Coat (Reference) G10L 15/22 G10L 3/00 537C

Claims

[Claims]

1. A text is input to a voice synthesizing means, the text is subjected to a voice synthesizing process using a parameter necessary for the voice synthesizing, and is output as a voice signal. And the text input to the speech synthesizing means are compared, the parameter is set to a certain value based on the comparison result, the learning parameter is set as the learning parameter, and the speech signal obtained by speech synthesis by the learning parameter is obtained. Is output as a speech synthesis result.

2. The process of setting the parameter to a certain value based on the comparison result and using it as a learning parameter until the text as the voice recognition result matches the text input to the voice synthesizing means. 2. The speech synthesis method according to claim 1, which is a process of changing the parameter and using a parameter when both match as a learning parameter.

3. The speech synthesis method according to claim 1, wherein the learning parameter is stored.

4. When a plurality of candidates are present as the parameters when the text is input to the voice synthesizing means and the text is subjected to the voice synthesizing process using the parameters necessary for the voice synthesizing, the plurality of candidates are A certain parameter is selected, voice synthesis is performed using the selected parameter, the voice synthesis result is voice-recognized, and the similarity between the voice recognition result and the text input to the voice synthesis means is determined, Based on the similarity determination result, one of the parameters of the plurality of candidates is selected, the selected parameter is used as a learning parameter, and a speech signal obtained by speech synthesis with the learning parameter is used as a speech synthesis result. A method for synthesizing speech, which is characterized by outputting as.

5. The process of selecting one of the parameters of the plurality of candidates based on the similarity determination result and using the selected parameter as a learning parameter, the similarity becomes equal to or more than a predetermined value. 5. The speech synthesis method according to claim 4, further comprising the step of sequentially selecting the candidate parameters and using the parameter having the similarity of a predetermined value or more as a learning parameter.

6. The speech synthesis method according to claim 4, wherein the learning parameter is stored.

7. A voice synthesizing means for inputting a text, performing a voice synthesizing process of the text using a parameter necessary for performing a voice synthesizing, and outputting as a voice signal, and a voice signal output from the voice synthesizing means. Voice recognition means for recognizing voice and outputting the voice recognition result as text, and text comparison means for comparing the text as the recognition result by the voice recognition means with the text input to the voice synthesizing means, The speech synthesis means sets the parameter to a certain value based on the result of comparison between the text as the recognition result output from the text comparison means and the text input to the speech synthesis means, and sets it as a learning parameter. , And outputs a voice signal obtained by voice synthesis based on the learning parameter as a voice synthesis result. Speech synthesizer.

8. The process of setting the parameter to a certain value based on the comparison result and using it as a learning parameter, the text as the recognition result output from the text comparison means is input to the voice synthesis means. 8. The speech synthesizer according to claim 7, wherein the parameter is changed until the matched text matches, and the parameter when the two match is used as a learning parameter.

9. The voice synthesizing apparatus according to claim 7, wherein the voice synthesizing means is provided with a parameter storage means, and the learning parameter is stored in the parameter storage means.

10. When a text is input to a voice synthesizing means and the text is subjected to a voice synthesizing process using a parameter required for the voice synthesizing, when a plurality of candidates exist as the parameter, among the plurality of candidates, A voice synthesizing unit that selects a certain parameter, performs a voice synthesizing process using the selected parameter, and outputs as a voice signal, recognizes the voice signal output from the voice synthesizing unit, and recognizes the recognition result and the voice. And a voice recognition means capable of outputting a similarity determination result by determining the similarity to the text input to the synthesizing means, wherein the voice synthesizing means is based on the similarity determination result from the voice recognizing means. , One of the plurality of candidates is selected, the selected parameter is used as a learning parameter, and the speech is synthesized by the learning parameter. Speech synthesis apparatus and outputs as a speech synthesis result an audio signal.

11. The process of selecting one of the parameters of the plurality of candidates based on the similarity determination result and using the selected parameter as a learning parameter is such that the similarity is equal to or more than a predetermined value. 11. The voice synthesizing apparatus according to claim 10, wherein the process is performed by sequentially selecting the candidate parameters and using the parameter having a similarity equal to or more than a predetermined value as a learning parameter.

12. The voice synthesizing apparatus according to claim 10, wherein the voice synthesizing unit is provided with a parameter storage unit, and the learning parameter is stored in the parameter storage unit.