JP4297433B2

JP4297433B2 - Speech synthesis method and apparatus

Info

Publication number: JP4297433B2
Application number: JP2004139861A
Authority: JP
Inventors: 未来長谷部; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-05-10
Filing date: 2004-05-10
Publication date: 2009-07-15
Anticipated expiration: 2024-05-10
Also published as: JP2005321631A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizing method for providing higher quality synthesized speech than the conventional one in response to the case that synthesized speech is reproduced, and also to provide its system. <P>SOLUTION: The speech synthesizing method comprises steps of: retrieving an elementary speech unit used for synthesis at each unit in an input text by a DB retrieval part 2 from a database 1 storing a speech corpus including the elementary speech units of various units configuring a language; selecting combination of the optimum elementary speech unit from among the elementary speech units at each unit in the retrieved text by an elementary speech unit selector 3; determining by a state determining part 4 whether audio equipment reproducing the synthesized speech is a head phone or a speaker; outputting each elementary speech unit to a connection part 6 as it is with the head phone, or routing each elementary speech unit through a signal processor 5 with the speaker, and outputting it to the connection part 6 after signal processing correcting a rhythm to the elementary speech unit is performed. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、キーボード等から直接入力され又は予めテキストを記憶した記憶媒体から読み出されて入力され又は通信媒体を介して他の装置等から入力されたテキストを音声に変換して再生（出力）するテキスト音声合成技術に関する。 The present invention converts text input directly from a keyboard or the like, or read and input from a storage medium that stores text in advance, or input from another device or the like via a communication medium, into sound and reproduced (output). The present invention relates to text-to-speech synthesis technology.

現在、電話による株価案内システム等の各種の情報案内システムやＥメール・Ｗｅｂの読み上げ等、様々な状況でテキスト音声合成技術が利用されている。 Currently, text-to-speech synthesis technology is used in various situations, such as various information guidance systems such as a stock price guidance system by telephone, and reading out emails and Webs.

従来の音声合成方法として、（１）音素、音韻、単語等の言語を構成する様々な単位の音声素片を含む音声コーパスから、入力されたテキスト中の前記単位に対応し、合成に使用可能な音声素片を検索し、該検索した音声素片の中から最適な音声素片を選択し、これを前記テキスト中の全ての前記単位について繰り返し、選択した各音声素片をそのまま接続して合成音声とする方法がある（特許文献１参照）。 As a conventional speech synthesis method, (1) a speech corpus including speech units of various units constituting a language such as phonemes, phonemes, words, etc., can be used for synthesis corresponding to the units in the input text. Search for a speech unit, select an optimal speech unit from the searched speech units, repeat this for all the units in the text, and connect the selected speech units as they are. There is a method of using synthesized speech (see Patent Document 1).

また、この際、（２）選択した各音声素片に対して、合成目標となる韻律に合わせるための信号処理を施した上で接続して合成音声とする方法がある（非特許文献１参照）。
特許第２７６１５５２号公報 Satoshi TAKANO, Masanobu ABE "A NEW F0 MODIFICATION ALGORITHM BY MANIPULATING HARMONICS OF MAGNITUDE SPECTRUM", Eurospeech'99 高野、阿部「部分置換実験による韻律変形の音質へおよぼす影響の評価」日本音響学会講演論文集１−７−１１、２０００（３）、ｐｐ．２１７−２１８ In this case, (2) there is a method in which each selected speech unit is connected to a synthesized speech after being subjected to signal processing for matching to the synthesis target prosody (see Non-Patent Document 1). ).
Japanese Patent No. 2761552 Satoshi TAKANO, Masanobu ABE "A NEW F0 MODIFICATION ALGORITHM BY MANIPULATING HARMONICS OF MAGNITUDE SPECTRUM", Eurospeech'99 Takano, Abe “Evaluation of the effect of prosody modification on sound quality by partial substitution experiment” Proceedings of the Acoustical Society of Japan 1-7-11, 2000 (3), pp. 217-218

しかし、前述した（１）の方法では、肉声らしい自然な音質の音声を合成できるが、合成音声の韻律が合成目標の韻律とは異なる可能性があるという問題があり、また、（２）の方法では、合成目標通りの韻律が得られるが、信号処理によって音声の持つ肉声らしさが損なわれる可能性があるという問題があった。 However, with the method (1) described above, it is possible to synthesize natural voice-like speech, but there is a problem that the prosody of the synthesized speech may be different from the synthesis target prosody, and (2) In the method, the prosody according to the synthesis target can be obtained, but there is a problem that the voice quality of the voice may be impaired by the signal processing.

つまり、従来のテキスト音声合成技術では、合成音声の肉声らしさと韻律の正確さがトレードオフの関係にあり、両方同時に満たすことができないという問題があった。 In other words, the conventional text-to-speech synthesis technique has a problem that the real voice quality of the synthesized speech and the accuracy of the prosody are in a trade-off relationship, and both cannot be satisfied at the same time.

このように、現状の音声合成技術は人間の音声に比べて十分な品質を達成できておらず、合成音声の品質向上への要望が強かった。 Thus, the current speech synthesis technology has not achieved sufficient quality compared to human speech, and there has been a strong demand for improving the quality of synthesized speech.

本発明の目的は、音声合成技術の利用状況、特に合成音声を再生する際の状況に応じて、これまでより高品質な合成音声を提供可能な音声合成方法及びその装置を実現することにある。 SUMMARY OF THE INVENTION An object of the present invention is to realize a speech synthesis method and apparatus capable of providing a synthesized speech of higher quality than ever, according to the use situation of speech synthesis technology, particularly when the synthesized speech is reproduced. .

上述のように、合成音声の肉声らしさと韻律の正確さはトレードオフの関係にあり、韻律を正確に合成しようとすると音質が劣化するという問題がある。 As described above, there is a trade-off relationship between the real voice quality of synthesized speech and the accuracy of prosody, and there is a problem that sound quality deteriorates when attempting to synthesize the prosody accurately.

ここで、韻律を正確に合成するために信号処理を行った場合の音質の劣化は、韻律の変形量や変形の方向、信号処理方法などによってその度合いが異なる（非特許文献１、２参照）。 Here, the degree of deterioration in sound quality when signal processing is performed in order to synthesize the prosody accurately differs depending on the prosody deformation amount and direction, the signal processing method, and the like (see Non-Patent Documents 1 and 2). .

また、信号処理を行った場合の合成音声の音質劣化は、ヘッドフォンで聞く場合とスピーカから聞く場合とではその劣化の判り方が異なり、当然ながら、ヘッドフォンで合成音声のみを集中して聞く方が合成音声の音質劣化が判り易い。 In addition, sound quality degradation of synthesized speech when signal processing is performed differs depending on whether it is heard with headphones or speakers, and naturally, it is better to listen only to synthesized speech with headphones. It is easy to understand the deterioration of the synthesized speech.

このように、合成音声を再生する際の状況として、ヘッドフォンで再生して聞くのか、スピーカから再生して聞くのか、また、スピーカから再生して聞く場合には、部屋の中等の静かな場所なのか、駅構内等の騒がしい場所なのか、といった様々な状況が存在し、それぞれの状況や環境で合成音声の音質劣化の判り方が異なる。 In this way, the situation when playing back synthesized speech is whether it is played back through headphones, listened through speakers, or played back through speakers. There are various situations such as whether it is a noisy place such as a station premises, etc., and the way of understanding the deterioration of the quality of the synthesized speech differs depending on the situation and environment.

本発明では、合成音声を再生する際の状況まで考慮して、信号処理による音質の劣化が目立たない状況では信号処理を行って韻律を修正することにより、ユーザに対しては信号処理による音質の劣化を感じさせずかつ正しい韻律の合成音声を提示することで、上述の合成音声の音質と韻律の正しさを両立するという課題を解決する。 In the present invention, in consideration of the situation when the synthesized speech is reproduced, in a situation where deterioration in sound quality due to signal processing is not conspicuous, signal processing is performed to correct the prosody, so that the sound quality due to signal processing can be improved for the user. By presenting the synthesized speech with the correct prosody without causing deterioration, the above-described problem of achieving both the sound quality of the synthesized speech and the correctness of the prosody is solved.

本発明によれば、合成音声を再生する際の利用状況に応じて、これまでより高品質な合成音声の提供が可能となる。 ADVANTAGE OF THE INVENTION According to this invention, according to the utilization condition at the time of reproducing | regenerating synthetic | combination speech, it becomes possible to provide synthetic speech of higher quality than before.

なお、本出願でいうヘッドフォンは、一人の人間の身体に直接接触する形で使用され、音声信号を当該人間の耳に伝達する全ての音響機器を含み、また、本出願でいうスピーカは、人間の身体に接触することなく使用され、主として空気を介して音声信号を一人以上の人間の耳に伝達する全ての音響機器を含むものとする。 The headphone referred to in this application is used in direct contact with the body of one person, and includes all acoustic devices that transmit audio signals to the human ear, and the speaker referred to in this application is a human being. It includes all acoustic equipment that is used without touching the body and that transmits audio signals to one or more human ears, primarily through the air.

本発明の特徴は、音声合成技術の利用状況まで考慮し、合成音声が再生される際の状況に応じて信号処理による韻律変形を行うかどうかを決定することで、音質劣化が目立たないような雑音の多い環境においては韻律を修正して音声を再生することが可能となり、従来より高品質な音声を合成することである。 The feature of the present invention is that the use of speech synthesis technology is taken into consideration, and whether or not prosodic deformation by signal processing is performed according to the situation when the synthesized speech is reproduced, so that the sound quality degradation is not noticeable. In a noisy environment, the prosody can be corrected and the voice can be played back, which is to synthesize higher-quality voice than before.

その結果、各種の情報案内等のサービスにおいて、より高品質な合成音声を提供することが可能となり、また、従来は品質の問題から合成音声を利用できなかった分野においても音声合成技術を利用可能になる。 As a result, it is possible to provide higher-quality synthesized speech for various information guidance services, etc., and speech synthesis technology can also be used in fields where synthesized speech could not be used due to quality problems. become.

図１は本発明の音声合成装置の実施の形態の一例を示すもので、図中、１はデータベース（ＤＢ）、２はデータベース検索部（ＤＢ検索部）、３は音声素片選択部、４は状況判定部、５は信号処理部、６は接続部である。以下、各部の動作をその構成とともに説明する。 FIG. 1 shows an example of an embodiment of a speech synthesizer according to the present invention. In the figure, 1 is a database (DB), 2 is a database search unit (DB search unit), 3 is a speech unit selection unit, 4 Is a status determination unit, 5 is a signal processing unit, and 6 is a connection unit. Hereinafter, the operation of each unit will be described together with its configuration.

ＤＢ１は、音素、音韻、単語等の言語を構成する様々な単位の音声素片を含む音声コーパス、詳細には音声波形、音声の韻律情報、発声内容に対応する音素ラベル列、音声素片の境界を示すラベルデータ等、合成のための情報を含む音声コーパスを格納している。 DB1 is a speech corpus that includes speech units of various units constituting a language such as phonemes, phonemes, and words. Specifically, speech corpus, speech prosodic information, phoneme label sequence corresponding to utterance content, speech segment A speech corpus including information for synthesis such as label data indicating a boundary is stored.

ＤＢ検索部２は、入力されたテキストと、図示しないテキスト解析部によって得られた当該テキストの音素系列、合成の目標となる韻律、使用するデータベースや信号処理方法の指定等、音声合成のための制御情報とを含む入力情報１０１を入力とし、テキスト中の前記単位毎に合成に使用可能な音声素片をデータベース１から検索し、検索した音声素片１０２を検索結果として音声素片選択部３へ渡す。 The DB search unit 2 is used for speech synthesis such as input text, phoneme series of the text obtained by a text analysis unit (not shown), prosody as a synthesis target, specification of a database to be used and a signal processing method. The input information 101 including the control information is used as an input, the speech unit usable for synthesis is searched from the database 1 for each unit in the text, and the speech unit selection unit 3 using the searched speech unit 102 as a search result. To pass.

音声素片選択部３は、ＤＢ検索部２から渡されたテキスト中の前記単位毎の音声素片１０２の中から最適な音声素片の組み合わせを選択する、詳細には音声素片１０２に対し、韻律、音韻環境、接続性、言語情報等の合成音声の品質に関わる要素をコストとして計算し、コストを最小化する音声素片の組み合わせを探索することによって、最適な音声素片の組み合わせ１０３を選択し、これを選択結果として状況判定部４に渡す。 The speech unit selection unit 3 selects an optimal combination of speech units from the speech units 102 for each unit in the text passed from the DB search unit 2. The optimal speech element combination 103 is calculated by calculating, as a cost, elements related to the quality of synthesized speech such as prosody, phonological environment, connectivity, and linguistic information, and searching for a speech element combination that minimizes the cost. Is selected and passed to the situation determination unit 4 as a selection result.

状況判定部４は、音声素片選択部３から渡された最適な音声素片の組み合わせ１０３を、合成音声を再生する際の状況に応じて、後述する如く信号処理部５もしくは接続部６へ渡す。 The situation determination unit 4 sends the optimum speech unit combination 103 passed from the speech unit selection unit 3 to the signal processing unit 5 or the connection unit 6 as described later according to the situation when the synthesized speech is reproduced. hand over.

信号処理部５は、状況判定部４から音声素片の組み合わせ１０３が渡された場合、該組み合わせ１０３に含まれる音声素片のそれぞれに対し、合成目標となる韻律に合わせるために韻律を修正する信号処理を行い、修正済み音声素片の組み合わせ１０４として接続部６へ渡す。 When the speech unit combination 103 is passed from the situation determination unit 4, the signal processing unit 5 corrects the prosody for each of the speech units included in the combination 103 to match the prosody to be synthesized. Signal processing is performed, and the result is passed to the connection unit 6 as a modified speech element combination 104.

接続部６は、状況判定部４から渡された音声素片の組み合わせ１０３もしくは信号処理部５から渡された修正済み音声素片の組み合わせ１０４を接続し、合成音声（データ）１０５として図示しない再生用の音響機器等へ出力する。 The connection unit 6 connects the speech unit combination 103 delivered from the situation determination unit 4 or the modified speech unit combination 104 delivered from the signal processing unit 5, and is reproduced as synthesized speech (data) 105 (not shown). Output to audio equipment for use.

なお、ＤＢ１、ＤＢ検索部２、音声素片選択部３、信号処理部５及び接続部６並びに図示しないテキスト解析部の構成及び動作は、既存の音声合成装置の場合と何ら変わらないので、その詳細は省略する。 The configuration and operation of the DB1, the DB search unit 2, the speech unit selection unit 3, the signal processing unit 5 and the connection unit 6, and the text analysis unit (not shown) are the same as those of the existing speech synthesizer. Details are omitted.

図２は状況判定部４における処理の流れを示すもので、以下、これに従って状況判定部４の動作を説明する。 FIG. 2 shows the flow of processing in the situation determination unit 4, and the operation of the situation determination unit 4 will be described below in accordance with this.

状況判定部４は、音声素片選択部３から最適な音声素片の組み合わせ１０３を受け取るとともに、合成音声を再生する音響機器がヘッドフォンであるかスピーカであるかを示す機器識別情報１０６及び音響機器がスピーカである場合の当該スピーカの設置場所の周囲の雑音レベル（情報）１０７を受信し、機器識別情報１０６から合成音声を再生する音響機器がヘッドフォンであるかスピーカであるかを判定し（ｓ１）、この際、ヘッドフォンであれば信号処理による音質劣化の影響が大きいとして最適な音声素片の組み合わせ１０３を接続部６へ渡す（ｓ２）。 The situation determination unit 4 receives the optimum speech unit combination 103 from the speech unit selection unit 3, and also includes device identification information 106 and an acoustic device indicating whether the acoustic device for reproducing the synthesized speech is a headphone or a speaker. Is a speaker, receives the noise level (information) 107 around the installation location of the speaker, and determines from the device identification information 106 whether the acoustic device that reproduces the synthesized speech is a headphone or a speaker (s1). At this time, if it is a headphone, since the influence of the sound quality deterioration due to the signal processing is large, the optimum speech unit combination 103 is passed to the connection unit 6 (s2).

一方、機器識別情報１０６が、合成音声を再生する音響機器がスピーカであることを示している場合は、雑音レベル１０７が予め設定した所定の閾値以下であるかどうかを判定し（ｓ３）、この際、閾値以下であれば信号処理による音質劣化の影響が大きいとして最適な音声素片の組み合わせ１０３を接続部６へ渡し（ｓ２）、また、閾値より大きければ信号処理による音質劣化の影響が小さいとして最適な音声素片の組み合わせ１０３を信号処理部４へ渡す（ｓ４）。 On the other hand, when the device identification information 106 indicates that the acoustic device that reproduces the synthesized speech is a speaker, it is determined whether or not the noise level 107 is equal to or lower than a predetermined threshold value (s3). At this time, if it is less than the threshold value, it is assumed that the influence of the sound quality deterioration due to the signal processing is large, and the optimum speech unit combination 103 is passed to the connection unit 6 (s2). As a result, the optimum speech unit combination 103 is passed to the signal processing unit 4 (s4).

なお、ここでステップｓ３を省略し、合成音声を再生する音響機器がスピーカである場合は、常に音質劣化の影響が小さいとして最適な音声素片の組み合わせ１０３を信号処理部４へ渡すようにしても良い。 Here, step s3 is omitted, and when the acoustic device that reproduces the synthesized speech is a speaker, it is assumed that the influence of the sound quality deterioration is always small and the optimum speech unit combination 103 is passed to the signal processing unit 4. Also good.

また、前述した機器識別情報１０６としては、ユーザから入力された音響機器がヘッドフォンであるかスピーカであるかを示す指定情報や、ヘッドフォンジャックを備えたパソコンやテレビ等の装置における当該ヘッドフォンジャックにヘッドフォンプラグが接続されたかどうかを表す信号等を用いることができ、また、雑音レベル１０７としては、スピーカの設置場所の周囲に配置されたマイク等の出力信号を適当な積分回路等に通した際に得られる、ある程度時間軸上で平均化された信号を用いることができる。 Further, as the device identification information 106 described above, the specification information indicating whether the acoustic device input from the user is a headphone or a speaker, or the headphone jack in a device such as a personal computer or a television equipped with a headphone jack is connected to the headphone. A signal indicating whether or not the plug is connected can be used, and the noise level 107 is obtained when an output signal from a microphone or the like placed around the speaker installation place is passed through an appropriate integration circuit or the like. The resulting signal averaged to some extent on the time axis can be used.

本発明の音声合成装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the speech synthesizer of this invention 図１中の状況判定部における処理の流れ図Flow chart of processing in the situation determination unit in FIG.

Explanation of symbols

１：データベース（ＤＢ）、２：データベース検索部（ＤＢ検索部）、３：音声素片選択部、４：状況判定部、５：信号処理部、６：接続部。 1: Database (DB), 2: Database search unit (DB search unit), 3: Speech segment selection unit, 4: Situation determination unit, 5: Signal processing unit, 6: Connection unit.

Claims

Using a database storing a speech corpus that includes speech units of units constituting a language, a computer searches the database for speech units that can be used for synthesis corresponding to the units in the input text. In the speech synthesis method, an optimal speech unit is selected from the retrieved speech units, this is repeated for all the units in the text, and the selected speech units are connected to form synthesized speech. ,
The calculator
Determine whether the audio device that plays the synthesized sound is a headphone or a speaker,
If the audio device is a headphone, the selected speech segments are connected as they are to produce synthesized speech,
If the audio equipment is a speaker , further determine whether the noise level around the speaker installation location is below a predetermined threshold,
If it is below the threshold, each selected speech segment is connected as it is to make a synthesized speech,
A speech synthesis method characterized in that if it is larger than the threshold value, the selected speech unit is connected to the selected speech segment after performing signal processing for correcting the prosody, and then synthesized speech.

A database storing a speech corpus including speech units of units constituting a language, a database search unit for searching speech units usable for synthesis for each unit in the input text, and the search A speech unit selecting unit that selects an optimal speech unit from speech units for each unit in the text, and a speech unit corresponding to each unit in the selected text, A speech synthesizer provided with a connection unit for
A signal processing unit for performing signal processing for correcting the prosody for the speech unit;
Audio equipment to reproduce synthetic speech is determined whether headphones or speakers, if the headphone output as it is to the connecting portion of each speech unit selected by the speech unit selection unit, if the speaker of the speaker location It is determined whether or not the noise level in the surroundings is equal to or less than a predetermined threshold value. A speech synthesizer, comprising: a situation determination unit that outputs each speech unit selected by the unit to the connection unit via the signal processing unit.