JP2005309173A

JP2005309173A - Speech synthesis controller, method thereof and program thereof, and data generating device for speech synthesis

Info

Publication number: JP2005309173A
Application number: JP2004127698A
Authority: JP
Inventors: Kazuhisa Iguchi; 和久井口; Tomoyasu Komori; 智康小森; Hiroyuki Segi; 寛之世木; Shuichi Aoki; 秀一青木; Yoshiaki Shishikui; 善明鹿喰; Yoshihiro Fujita; 欣裕藤田
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2004-04-23
Filing date: 2004-04-23
Publication date: 2005-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesis controller capable of easily synchronizing a synthesized speech with a signal to be synchronized such as a video signal at specified display time. <P>SOLUTION: The speech synthesis controller 1 acquires an utterance start position indicating the start position of speech synthesis of a text to be read aloud and utterance start time as time when utterance at the utterance start position is started by an utterance start timing acquisition part 13 when the text to be read aloud is uttered through speech synthesis, and outputs the utterance start timing obtained by the utterance start timing acquisition part 13 to a speech synthesizer 7 by a speech synthesis start control part 15 to control the start of speech synthesis by the speech synthesizer 7. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、読み上げ用のテキストを音声合成して所定の発話タイミングで合成音声として発話させるように音声合成装置を制御する音声合成制御装置、その方法、そのプログラムおよび音声合成装置の制御のための音声合成用データを生成する音声合成用データ生成装置に関する。 The present invention relates to a speech synthesis control device for controlling a speech synthesizer so as to synthesize text to be read out and uttered as synthesized speech at a predetermined utterance timing, its method, its program, and control of the speech synthesizer The present invention relates to a speech synthesis data generation device that generates speech synthesis data.

近時、読み上げ用のテキストを音声合成した合成音声として生成し、この合成音声と動画像とを同期させて出力させる音声合成装置が開発されてきている。
例えば、この種の音声合成装置としては、動画像に表示される唇形およびその動きを分析して、唇の動き始めの時刻および唇が動いている持続時間を唇情報として取得すると共に唇形に最も合うテキスト中の文字を推定し、推定した文字を唇形に合うように音声合成して合成音声を生成することで、合成音声と動画像（唇形の動き）とを同期させるもの（動画像とテキスト／音声変換器間の同期化システム）が知られている（例えば、特許文献１参照）。 Recently, a speech synthesizer has been developed that generates text to be read out as synthesized speech that is synthesized, and outputs the synthesized speech and a moving image in synchronization.
For example, this type of speech synthesizer analyzes the lip shape displayed in a moving image and its movement, acquires the lip movement start time and the duration of lip movement as lip information, and Which synchronizes the synthesized voice and the video (the movement of the lip shape) by synthesizing the estimated character in the text and synthesizing the estimated character to fit the lip shape to generate a synthesized speech ( A synchronization system between a moving image and a text / speech converter is known (for example, see Patent Document 1).

この特許文献１に記載の動画像とテキスト／音声変換器間の同期化システムでは、唇形に合わせて音声合成時の音素別持続時間を調整して、唇形を参照してテキストと動画像とを同期させるために、音素列および音素の持続時間に従って、各音素別調音場所や調音方法等の音素別調音特性から各音素に割り当てる唇形を推定する。そして、この同期化システムでは、その推定した唇形と動画像（映像）中の唇形とを、位置情報や持続時間に従い時間軸上に配列した結果と比較して、両者の近似度が最大となるように音素別持続時間を調整して合成音声を生成し、その合成音声を動画像に同期させて発話するようになっている。 In the synchronization system between the moving image and the text / speech converter described in Patent Document 1, the duration of each phoneme during speech synthesis is adjusted according to the lip shape, and the text and the moving image are referred to by referring to the lip shape. In order to synchronize with each other, the lip shape to be assigned to each phoneme is estimated from the phoneme-specific articulation characteristics such as the articulation location and articulation method of each phoneme according to the phoneme string and the duration of the phoneme. In this synchronization system, the estimated lip shape and the lip shape in the moving image (video) are compared with the result of arranging them on the time axis according to the position information and the duration, and the degree of approximation between them is maximum. The synthesized speech is generated by adjusting the duration for each phoneme so that the synthesized speech is synchronized with the moving image.

また、この種の音声合成装置としては、所定長さのテキスト毎に音声合成し、合成音声に必要な時間（視聴覚的許容範囲内の時間）だけ、表示画面上のキャラクタの口を動かすようにして、合成音声と動画像とを同期させるもの（表示制御装置）も知られている（例えば、特許文献２参照）。 Also, this type of speech synthesizer synthesizes speech for each text of a predetermined length, and moves the mouth of the character on the display screen only for the time required for the synthesized speech (time within the audiovisual tolerance). A device that synchronizes synthesized speech and moving images (display control device) is also known (see, for example, Patent Document 2).

この特許文献２に記載の表示制御装置では、カンマの数やピリオドの数や単語数等のテキストの長さを表す特徴的な数を基準として、所定長さの単語列にテキストを分けると共に単語列の長さをカウントし、単語列毎に音声合成して一定の発話速度で単語列毎の合成音声を発話するのに必要な単語列表示時間を算出する。そして、この表示制御装置では、算出された単語列表示時間だけ単語列の合成音声を出力すると共に、その単語列表示時間だけ動画像の表示（例えば、キャラクタの口を動かすような動画像の表示）を行うようになっている。 In the display control device described in Patent Document 2, text is divided into word strings of a predetermined length and words based on a characteristic number representing the length of text such as the number of commas, the number of periods, the number of words, and the like. The length of the sequence is counted, speech synthesis is performed for each word sequence, and a word sequence display time required to utter synthesized speech for each word sequence at a constant speech rate is calculated. In this display control device, the synthesized speech of the word string is output for the calculated word string display time, and the moving image is displayed for the word string display time (for example, the moving image display that moves the mouth of the character). ).

したがって、特許文献１に記載の動画像とテキスト／音声変換器間の同期化システムは、動画像中に表示された唇の動きに合わせて合成音声を発話させることによって、唇が表示されている動画像と合成音声とを同期させるようになっている。また、特許文献２に記載の表示制御装置は、テキストの長さから発話にかかる時間を算出することによって、一定の発話速度で合成音声を発話させるようになっている。
特開平１０−１７１４８６号公報（段落００１５〜００３０、図２および図３）特開平５−３１３６８６号公報（段落００２３〜００４８、図２、図６および図７） Therefore, in the synchronization system between the moving image and the text / speech converter described in Patent Document 1, the lips are displayed by causing the synthesized speech to be uttered in accordance with the movement of the lips displayed in the moving image. The moving image and the synthesized voice are synchronized. Moreover, the display control apparatus described in Patent Document 2 is designed to synthesize synthesized speech at a constant utterance speed by calculating the time required for utterance from the length of text.
Japanese Patent Laid-Open No. 10-171486 (paragraphs 0015 to 0030, FIGS. 2 and 3) JP-A-5-313686 (paragraphs 0023 to 0048, FIGS. 2, 6 and 7)

しかし、前記した特許文献１に記載の動画像とテキスト／音声変換器間の同期化システムでは、唇形に合わせて音声合成時の音素別持続時間を調整し、唇形を参照してテキストと動画像とを同期させるため、唇が写らない場面のあるコンテンツの映像と合成音声とを同期させることができないという問題があった。例えば、無声映画に合成音声を同期させて発話させるような場合には、常に、唇が写っているわけではないため、音声を合成音声として発話させることができなかった。 However, in the synchronization system between the moving image and the text / speech converter described in Patent Document 1, the duration of each phoneme during speech synthesis is adjusted according to the lip shape, In order to synchronize with a moving image, there was a problem that it was not possible to synchronize the video of the content with the scene where the lips are not captured and the synthesized audio. For example, in the case where a synthetic voice is struck in synchronism with an unvoiced movie, the lips are not always shown, so that the voice cannot be uttered as a synthetic voice.

また、前記した特許文献２に記載の表示制御装置では、音声合成の対象のテキスト全文を一定の発話速度の合成音声として発話させているため、任意の言葉毎に発話速度を変えて発話させることができなかった。 Further, in the display control device described in Patent Document 2, since the entire text to be synthesized is uttered as synthesized speech having a constant utterance speed, the utterance speed is changed for each arbitrary word. I could not.

本発明は、以上のような課題を解決するためになされたものであり、所定の表示時間の映像信号等の同期対象信号に合成音声を容易に同期させることが可能な音声合成制御装置、その方法、そのプログラムおよび合成音声用データ生成装置を提供することを目的とする。 The present invention has been made to solve the above-described problems, and a speech synthesis control device capable of easily synchronizing synthesized speech with a synchronization target signal such as a video signal having a predetermined display time, It is an object of the present invention to provide a method, a program thereof, and a synthesized voice data generation apparatus.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の音声合成制御装置は、読み上げ用のテキストを音声合成して合成音声として発話させる際の前記テキストの音声合成の開始位置を示す予め設定された発話開始位置と、前記テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、前記テキストを音声合成して合成音声として発話させるように音声合成装置を制御する音声合成制御装置であって、発話開始タイミング取得手段と音声合成開始制御手段とを備えている構成とした。 The present invention was devised to achieve the above object. First, the speech synthesis control device according to claim 1, the speech synthesis speech synthesizes the text to be read out and uttered as synthesized speech. Based on a preset utterance start position indicating the start position of voice synthesis and a preset utterance start time indicating a start time at which the text is uttered as a synthesized voice, the text is synthesized and synthesized voice The speech synthesis control device controls the speech synthesizer so as to utter as the speech synthesis start timing acquisition unit and the speech synthesis start control unit.

かかる構成によれば、音声合成制御装置は、発話開始タイミング取得手段によって、発話開始位置と発話開始時刻とを発話開始タイミングとして取得する。
なお、発話開始位置は、テキスト中の任意の位置を指定するものであって、所定の発話範囲毎に一番初めに発話を開始する言葉を指定するものである。この発話開始位置は、テキスト中の任意の文字であって所定の発話範囲毎に一番初めに発話を開始する発話開始文字としてもよい。また、この発話開始文字は、テキストを構成している文字自体とする場合に限らず、例えば、テキストから生成されるローマ字や仮名等の表音文字であってもよい。また、発話開始時刻は、所定の基準時刻から計時する時刻であり、発話開始文字の発話の開始を指定するものである。 According to such a configuration, the speech synthesis control device acquires the utterance start position and the utterance start time as the utterance start timing by the utterance start timing acquisition unit.
The utterance start position designates an arbitrary position in the text, and designates a word that starts utterance first for each predetermined utterance range. The utterance start position may be an arbitrary character in the text and an utterance start character that starts utterance first for each predetermined utterance range. In addition, the utterance start character is not limited to the character itself constituting the text, and may be a phonetic character such as a Roman character or a kana generated from the text. The utterance start time is a time measured from a predetermined reference time, and designates the start of the utterance of the utterance start character.

そして、音声合成制御装置は、音声合成開始制御手段によって、発話開始位置から音声合成装置にテキストを音声合成させ、発話開始時刻からテキストを合成音声として発話させるように音声合成装置の発話の開始を制御する。
なお、この場合、音声合成制御装置は、時刻を計時する計時手段を備えるか、または、外部に備えられた計時手段から与えられる時刻信号の供給を得る必要がある。そして、音声合成制御装置は、計時手段により計時される時刻を基準として求めた発話開始タイミングに音声合成が開始されるように、発話開始の時刻を音声合成装置に出力する。 Then, the speech synthesis control device causes the speech synthesis start control means to synthesize the text from the utterance start position to the speech synthesizer, and to start the utterance of the speech synthesizer so that the text is uttered as synthesized speech from the utterance start time. Control.
In this case, the speech synthesis control device needs to have time measuring means for measuring time, or to obtain a time signal supplied from time measuring means provided outside. Then, the speech synthesis control device outputs the utterance start time to the speech synthesizer so that speech synthesis is started at the utterance start timing obtained based on the time measured by the time measuring means.

また、請求項２に記載の音声合成制御装置は、請求項１に記載の音声合成制御装置において、発話速度決定手段と発話速度制御手段とを備えている構成とした。
かかる構成によれば、音声合成制御装置は、発話速度決定手段によって、発話開始位置から始まるテキストの音声合成の範囲を示す予め設定された発話範囲毎に、合成音声として発話させる際の発話速度およびその形式を決定する。また、音声合成制御装置は、発話速度制御手段によって、この発話速度決定手段により決定された発話速度を音声合成装置に出力して、発話範囲毎に決定された発話速度となるように音声合成装置の音声合成を制御する。
この場合、発話開始タイミング取得手段または音声合成制御手段が、発話範囲を規定すればよい。この発話範囲は、任意の発話開始文字からこの任意の発話開始文字の次に並ぶ発話開始文字の一文字前にある文字（発話終了文字）までの文字列として規定することができる。 According to a second aspect of the present invention, there is provided a speech synthesis control apparatus according to the first aspect, wherein the speech synthesis control apparatus includes an utterance speed determining means and an utterance speed control means.
According to such a configuration, the speech synthesis control device uses the speech speed determination unit to utter speech as synthesized speech for each preset speech range indicating a speech synthesis range of text starting from the speech start position, and Determine its format. Further, the speech synthesis control device outputs the speech rate determined by the speech rate determination unit to the speech synthesis device by the speech rate control unit so that the speech rate is determined for each speech range. Controls speech synthesis.
In this case, the utterance start timing acquisition unit or the speech synthesis control unit may define the utterance range. This utterance range can be defined as a character string from an arbitrary utterance start character to a character (utterance end character) that precedes the utterance start character arranged next to the arbitrary utterance start character.

さらに、請求項３に記載の音声合成制御装置は、請求項１に記載の音声合成制御装置において、発話速度決定手段と、発話速度計測手段と、話速変換制御手段と、話速変換手段とを備えていることを特徴とする。
かかる構成によれば、音声合成制御装置は、発話速度決定手段によって、発話開始位置から始まるテキストの音声合成の範囲を示す予め設定された発話範囲毎に、合成音声として発話させる際の発話速度およびその形式を決定する。 Furthermore, the speech synthesis control device according to claim 3 is the speech synthesis control device according to claim 1, wherein the speech speed determining means, the speech speed measuring means, the speech speed conversion control means, the speech speed conversion means, It is characterized by having.
According to such a configuration, the speech synthesis control device uses the speech speed determination unit to utter speech as synthesized speech for each preset speech range indicating a speech synthesis range of text starting from the speech start position, and Determine its format.

音声合成制御装置は、発話速度計測手段によって、音声合成装置により発話される合成音声の発話速度を計測する。音声合成制御装置は、話速変換制御手段によって、発話速度決定手段により決定された発話速度と発話速度計測手段により計測された発話速度とを比較して、音声合成装置により発話される合成音声の発話速度が発話速度決定手段により決定された発話速度に近づくように、音声合成装置による合成音声の発話速度を変換するための話速制御信号を出力する。
音声合成制御装置は、話速変換手段によって、この話速変換制御手段から出力される話速制御信号に基づいて、音声合成装置により発話される合成音声の発話速度が発話速度決定手段により決定された発話速度に近づくように発話速度を変換した合成音声を出力させる。 The speech synthesis control device measures the speech rate of the synthesized speech uttered by the speech synthesizer by the speech rate measuring means. The speech synthesis control device compares the speech rate determined by the speech rate determination unit by the speech rate conversion control unit with the speech rate measured by the speech rate measurement unit, and determines the synthesized speech uttered by the speech synthesis device. A speech speed control signal for converting the speech speed of the synthesized speech by the speech synthesizer is output so that the speech speed approaches the speech speed determined by the speech speed determining means.
In the speech synthesis control device, the speech rate conversion unit determines the speech rate of the synthesized speech uttered by the speech synthesizer based on the speech rate control signal output from the speech rate conversion control unit. The synthesized speech with the utterance speed converted so as to approach the utterance speed is output.

また、請求項４に記載の音声合成制御装置は、請求項２または請求項３に記載の音声合成制御装置において、前記発話速度決定手段が、発話速度の形式を決定することを特徴とする。
かかる構成によれば、発話速度決定手段が、発話範囲の発話開始位置の発話開始時刻と次に発話を開始する次発話開始位置の次発話開始時刻とから算出される発話範囲の発話に要する時間および発話範囲内の文字数に基づいて表される形式、単位時間当たりの発話文字数に基づいて表される形式、発話範囲での発話に要する発話時間に基づいて表される形式または基準の発話速度に対する比に基づいて表される形式のいずれか一つの形式の発話速度として決定する。 According to a fourth aspect of the present invention, there is provided the speech synthesis control apparatus according to the second or third aspect, wherein the speech rate determining means determines the speech rate format.
According to such a configuration, the time required for the speech rate determining means to speak the speech range calculated from the speech start time of the speech start position of the speech range and the next speech start time of the next speech start position where the next speech starts. And the format expressed based on the number of characters in the utterance range, the format expressed based on the number of utterance characters per unit time, the format expressed based on the utterance time required for utterance in the utterance range, or the reference speech rate It is determined as the speech rate of any one of the formats represented based on the ratio.

また、請求項５に記載の音声合成用データ生成装置は、請求項１に記載の音声合成制御装置に提供するための読み上げ用のテキスト、前記発話開始位置および前記発話開始位置に対応する前記発話開始時刻を生成する音声合成用データ生成装置であって、音声認識手段と、発話開始タイミング取得手段と、音声合成用データ記憶手段とを備えていることを特徴とする。 The speech synthesis data generation device according to claim 5 is a text to be read out to be provided to the speech synthesis control device according to claim 1, the speech start position, and the speech corresponding to the speech start position. A speech synthesis data generation device for generating a start time, comprising speech recognition means, speech start timing acquisition means, and speech synthesis data storage means.

かかる構成によれば、音声合成用データ生成装置は、音声認識手段によって、音声を認識してテキストを生成すると共に、発話開始タイミング取得手段によって、この音声認識手段が生成するテキストの発話開始位置および発話開始位置に対応する発話開始時刻を発話開始タイミングとして取得する。そして、音声合成用データ生成装置は、音声合成用データ記憶手段によって、音声認識手段により生成されたテキストおよび発話開始タイミング取得手段により取得された発話開始タイミングを音声合成用データとして記憶する。 According to this configuration, the speech synthesis data generating device recognizes the speech by the speech recognition unit and generates text, and the speech start position of the text generated by the speech recognition unit by the speech start timing acquisition unit and An utterance start time corresponding to the utterance start position is acquired as an utterance start timing. Then, the speech synthesis data generation device stores the text generated by the speech recognition unit and the speech start timing acquired by the speech start timing acquisition unit as speech synthesis data by the speech synthesis data storage unit.

なお、この音声合成用データ生成装置に備える発話開始タイミング取得手段の機能は、音声認識手段に含まれるように備えていてもよい。また、この音声合成用データ生成装置は、送信側または受信側のいずれに備えるようにしてもよい。ここで、「送信側」は、映像と共にテキストや音声によってコンテンツ提供する放送局やウェブ提供業者、または、コンテンツを記憶した各種記憶媒体であり、有線または無線のデータ通信網によって受信側に接続されるものである。また、「受信側」は、有線または無線のデータ通信網を介して送信側から渡されるコンテンツを映像等の同期対象信号と合成音声とを同期させて出力するものである。 Note that the function of the utterance start timing acquisition unit included in the voice synthesis data generation device may be included in the voice recognition unit. In addition, the speech synthesis data generation device may be provided on either the transmission side or the reception side. Here, the “transmission side” is a broadcasting station or web provider that provides content by video and text and audio, or various storage media storing content, and is connected to the reception side by a wired or wireless data communication network. Is. The “reception side” outputs content delivered from the transmission side via a wired or wireless data communication network in synchronization with a synchronization target signal such as video and synthesized audio.

さらに、請求項６に記載の音声合成制御方法は、読み上げ用のテキストを音声合成して合成音声として発話させる際の前記テキストの音声合成の開始位置を示す予め設定された発話開始位置と、前記テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、前記テキストを音声合成して合成音声として発話させるように音声合成装置を制御する音声合成制御方法であって、発話開始タイミング取得ステップと音声合成開始制御ステップとを含んでいることを特徴とする。 Furthermore, the speech synthesis control method according to claim 6, the speech start control position set in advance indicating a speech synthesis start position of the text when the text to be read out is synthesized as speech and synthesized speech, A speech synthesis control method for controlling a speech synthesizer to synthesize the text and synthesize it as synthesized speech based on a preset utterance start time indicating a start time to utter the text as synthesized speech. And an utterance start timing acquisition step and a speech synthesis start control step.

この手順によれば、音声合成制御方法は、発話開始タイミング取得ステップによって、発話開始位置と発話開始時刻とを発話開始タイミングとして取得する。そして、音声合成制御方法は、音声合成開始制御ステップによって、発話開始位置から音声合成装置にテキストを音声合成させ、発話開始時刻からテキストを合成音声として発話させるように音声合成装置の発話の開始を制御する。 According to this procedure, the speech synthesis control method acquires the utterance start position and the utterance start time as the utterance start timing in the utterance start timing acquisition step. In the speech synthesis control method, in the speech synthesis start control step, the speech synthesizer starts speech so that the speech synthesizer synthesizes text from the utterance start position and utters the text as synthesized speech from the utterance start time. Control.

また、請求項７に記載の音声合成制御プログラムは、読み上げ用のテキストを音声合成して合成音声として発話させる際の前記テキストの音声合成の開始位置を示す予め設定された発話開始位置と、前記テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、前記テキストを音声合成して合成音声として発話させるように音声合成装置を制御するために、コンピュータを、発話開始タイミング取得手段、音声合成開始制御手段として機能させることとした。 According to a seventh aspect of the present invention, there is provided a speech synthesis control program comprising: a predetermined speech start position that indicates a speech synthesis start position of the text when the text to be read out is speech-synthesized and uttered as synthesized speech; In order to control the speech synthesizer to synthesize the text and utter it as synthesized speech based on a preset utterance start time indicating the start time to utter the text as synthesized speech, It was made to function as start timing acquisition means and speech synthesis start control means.

かかる構成によれば、音声合成制御プログラムは、発話開始タイミング取得手段によって、発話開始位置と発話開始時刻とを発話開始タイミングとして取得する。音声合成制御プログラムは、音声合成開始制御手段によって、発話開始位置から音声合成装置にテキストを音声合成させ、発話開始時刻からテキストを合成音声として発話させるように音声合成装置の発話の開始を制御する。 According to this configuration, the speech synthesis control program acquires the utterance start position and the utterance start time as the utterance start timing by the utterance start timing acquisition unit. The speech synthesis control program controls speech start of the speech synthesizer so that the speech synthesizer causes the speech synthesizer to synthesize text from the utterance start position and utters the text as synthesized speech from the utterance start time. .

請求項１、請求項６または請求項７に記載の発明によれば、所定の発話開始時刻からの合成音声の発話と所定の表示時間の映像信号等の同期対象信号とを任意の時刻に合わせることで容易に同期させることができるようになる。 According to the invention described in claim 1, claim 6 or claim 7, the utterance of the synthesized voice from the predetermined utterance start time and the synchronization target signal such as the video signal of the predetermined display time are adjusted to an arbitrary time. This makes it easy to synchronize.

請求項２に記載の発明によれば、音声合成装置によって、テキストを音声合成させて、発話範囲毎に決定される所定の発話速度の合成音声を発話させることができるようになる。なお、音声合成装置は、発話速度を制御する発話速度制御手段によって、音声合成開始制御手段から送られてくる発話開始タイミングで、所定の発話速度に切り替えて合成音声を発話する。 According to the second aspect of the present invention, the speech synthesizer can synthesize a text and synthesize a synthesized speech having a predetermined speech rate determined for each speech range. The speech synthesizer utters the synthesized speech by switching to a predetermined speech rate at the speech start timing sent from the speech synthesis start control unit by the speech rate control unit that controls the speech rate.

請求項３に記載の発明によれば、音声合成装置（手段）内で発話速度を切り替えることができない場合であっても、目標の発話速度に調整することができるようになる。これによって、汎用な音声合成装置（手段）の出力である合成音声を目的の発話速度に調整することができるため、ユーザにとって経済的であり、延いては、本発明の音声合成制御装置自体の普及にもつながる。 According to the third aspect of the present invention, even when the speech rate cannot be switched in the speech synthesizer (means), the target speech rate can be adjusted. This makes it possible to adjust the synthesized speech, which is the output of the general-purpose speech synthesizer (means), to the target speech speed, which is economical for the user, and further, the speech synthesis control device of the present invention itself. It also leads to widespread use.

請求項４に記載の発明によれば、音声合成装置が取り扱える形式の発話速度を指定して決定することができるため、発話速度決定手段が決定した発話速度の形式と音声合成装置での発話速度の形式とが異なる場合には、発話速度制御手段が、発話速度決定手段の決定した発話速度を音声合成装置で扱える形式の発話速度に変換して制御することができるようになる。 According to the fourth aspect of the present invention, since it is possible to specify and determine the speech rate in a format that can be handled by the speech synthesizer, the speech rate format determined by the speech rate determining means and the speech rate in the speech synthesizer The speech rate control means can control the speech rate determined by the speech rate determination unit by converting it into a speech rate that can be handled by the speech synthesizer.

請求項５に記載の発明によれば、人がしゃべっている音声を元に、音声データを文字認識手段によりテキストに変換すると共に、人がしゃべっている音声の発話開始位置および発話開始時刻を発話開始タイミングとして取得し、テキストと発話開始タイミングとを音声合成用データとして記憶しておくことができるようになる。なお、人がしゃべっている音声は、既に録音されている音声データのみに限らず、リアルタイムに音声を集音して生放送するような場合であってもよい。 According to the fifth aspect of the present invention, the voice data is converted into text by the character recognition means based on the voice spoken by the person, and the utterance start position and the utterance start time of the voice spoken by the person are uttered. Acquired as the start timing, the text and the utterance start timing can be stored as data for speech synthesis. The voice spoken by the person is not limited to the voice data that has already been recorded, but may be a case where the voice is collected and broadcast live in real time.

したがって、本発明によれば、所定の表示時間の映像信号等の同期対象信号に合成音声を容易に同期させることが可能な音声合成制御装置、その方法、そのプログラムおよび合成音声用データ生成装置を提供することができる。 Therefore, according to the present invention, there is provided a speech synthesis control device, a method thereof, a program thereof, and a synthesized speech data generation device capable of easily synchronizing synthesized speech to a synchronization target signal such as a video signal having a predetermined display time. Can be provided.

ところで、映像の表示速度を変えることによって、所定の発話速度の合成音声と映像とを同期させることも可能であるが、映像の表示速度を変化させることは見え方が変わってしまうため、例えば、映画の場合には作品としての価値を下げてしまうことにもなってしまう。そのため、本発明のように発話速度を可変させて、映像と合成音声とを同期させるようにすることが有用である。 By the way, by changing the video display speed, it is also possible to synchronize the synthesized voice of the predetermined speech speed and the video, but changing the video display speed changes the appearance, for example, In the case of a movie, it will also reduce the value of the work. Therefore, it is useful to vary the utterance speed and synchronize the video and the synthesized voice as in the present invention.

以下、本発明の第一ないし第五の実施の形態について図面を参照して説明する。 Hereinafter, first to fifth embodiments of the present invention will be described with reference to the drawings.

（第一の実施の形態）
[音声合成制御装置の構成]
図１は、本発明の第一の実施の形態に係る音声合成制御装置の構成を示すブロック図である。図２は、本発明の実施の形態に係る音声合成制御装置の発話開始文字の概念を説明する図である。図３は、本発明の実施の形態に係る音声合成制御装置の発話速度の概念を説明する図であり、（ａ）は発話速度で指定する場合の例示図であり、（ｂ）はある範囲の発話にかかる時間で指定する場合の例示図であり、（ｃ）は発話開始タイミングから計算する場合の例示図であり、（ｄ）はあらかじめ定めた標準的な発話速度との比で指定する場合の例示図である。 (First embodiment)
[Configuration of speech synthesis controller]
FIG. 1 is a block diagram showing the configuration of the speech synthesis control device according to the first embodiment of the present invention. FIG. 2 is a diagram for explaining the concept of the utterance start character of the speech synthesis control device according to the embodiment of the present invention. FIG. 3 is a diagram for explaining the concept of the speech rate of the speech synthesis control device according to the embodiment of the present invention. FIG. 3 (a) is an exemplary diagram when the speech rate is specified, and FIG. 3 (b) is a certain range. It is an illustration figure in the case of designating with the time concerning utterance of (2), (c) is an illustration figure in the case of calculating from the utterance start timing, and (d) is designated by a ratio with a predetermined standard utterance speed. It is an illustration figure in the case.

まず、図１を参照して、音声合成制御装置１の構成を説明する。
音声合成制御装置１は、読み上げ用のテキストを音声合成して合成音声として発話させる際のテキストの音声合成の開始位置を示す予め設定された発話開始位置と、テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、テキストを音声合成して所定の発話速度の合成音声として発話させるように音声合成装置７を制御するために、音声合成用データ入力手段１０と、音声合成用データ記憶手段１１と、解析分離手段１２と、発話開始タイミング取得手段１３と、音声合成開始制御手段１４と、発話速度決定手段１５と、発話速度制御手段１６とを主に備えている。 First, the configuration of the speech synthesis control device 1 will be described with reference to FIG.
The speech synthesis control device 1 synthesizes speech to be read out and utters it as synthesized speech. A speech start position that indicates a speech synthesis start position that is set in advance and a start time at which the text is uttered as synthesized speech. In order to control the speech synthesizer 7 to synthesize text and synthesize it as synthesized speech at a predetermined utterance speed based on a preset utterance start time indicating The speech synthesis data storage means 11, the analysis separation means 12, the speech start timing acquisition means 13, the speech synthesis start control means 14, the speech speed determination means 15, and the speech speed control means 16 are mainly provided. Yes.

音声合成用データ入力手段１０は、音声合成制御装置１内に音声合成用データを入力するものである。この音声合成用データ入力手段１０としては、例えば、地上波デジタル放送を受信する受信機、ケーブルテレビ放送を受信する受信機、インターネット等のネットワークに接続するための各種インターフェースおよび各種リムーバブルメディアからの読出装置がある。 The voice synthesis data input means 10 is for inputting voice synthesis data into the voice synthesis control device 1. The voice synthesis data input means 10 includes, for example, a receiver for receiving terrestrial digital broadcasts, a receiver for receiving cable television broadcasts, various interfaces for connecting to a network such as the Internet, and reading from various removable media. There is a device.

音声合成用データは、発話開始タイミングと発話速度とテキストとをパッケージ化（ストリーム）したものである。発話開始タイミングは、テキストを音声合成して合成音声として発話させる際のテキストの音声合成の開始位置を示す予め設定された発話開始位置と、テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とにより決定されるものである。また、音声合成用データは、発話開始タイミングと発話速度とをパッケージ化したものであって、テキストと別体として構成してもよい。 The speech synthesis data is a package (stream) of speech start timing, speech speed, and text. The utterance start timing is set in advance to indicate a preset utterance start position indicating the start position of text speech synthesis when the text is synthesized and uttered as synthesized speech, and a start time indicating the text to be uttered as synthesized speech. The utterance start time is determined. The speech synthesis data is a package of the utterance start timing and the utterance speed, and may be configured separately from the text.

発話開始位置は、テキスト中の任意の位置を指定するものであって、所定の発話範囲毎に一番初めに発話を開始する言葉を指定するものである。この発話開始位置は、テキスト中の任意の文字であって所定の発話範囲毎に一番初めに発話を開始する発話開始文字としてもよい。また、この発話開始文字は、テキストの各文節の先頭や音素や文字としてもよく、特に、テキストを構成している文字自体とする場合に限らず、例えば、テキストから生成されるローマ字や仮名等の表音文字であってもよい。 The utterance start position designates an arbitrary position in the text, and designates the first word to start utterance for each predetermined utterance range. The utterance start position may be an arbitrary character in the text and an utterance start character that starts utterance first for each predetermined utterance range. The utterance start character may be the beginning of each clause of the text, a phoneme or a character, and is not particularly limited to the character itself constituting the text, for example, a Roman character or a kana generated from the text. It may be a phonetic character.

この発話開始文字は、合成音声として出力させる場合に、接続する音声の基本となる音声単位としてもよい。音声単位としては、例えば、音素、音節、ＶＣＶ（Ｖｏｗｅｌ−Ｃｏｎｓｏｎａｎｔ−Ｖｏｗｅｌ）、ＣＶＣ（Ｃｏｎｓｏｎａｎｔ−Ｖｏｗｅｌ−Ｃｏｎｓｏｎａｎｔ）、クラスタリングによる単位および複合単位がある。
また、表音文字は、少なくとも一文字で漢字等の表意文字の読みを表し、テキストの文章の時系列の並びの文字列からなっている。そのため、テキストが漢字等の表意文字の場合には、発話の開始を指定する表意文字の読みの先頭の表音文字を発話開始文字としてもよい。 This utterance start character may be used as a basic voice unit of the voice to be connected when outputting the synthesized voice. Examples of speech units include phonemes, syllables, VCV (Vowel-Consonant-Vowel), CVC (Consonant-Vowel-Consant), units by clustering, and composite units.
A phonetic character represents at least one ideographic reading such as a kanji character, and is composed of a character string in a time series of text sentences. Therefore, when the text is an ideographic character such as a Chinese character, the first phonogram of reading of the ideogram specifying the start of the utterance may be used as the utterance start character.

なお、本明細書中においてテキストは、漢字仮名混じり文として説明するが、これに限らず、例えば、漢字のみ、仮名のみの日本語であってもよい。テキストとして扱う言語は、日本語に限らず、英語、ドイツ語、中国語等の各種言語であって構わない。また、音声合成のためには、仮名表記やローマ字表記のように、直接音素を表すような表記の場合には特に必要ないが、漢字仮名混じり文のように文字のみで表音を表すことができない場合には、テキストを表音文字に変換する必要がある。この表音文字は、漢字等の表意文字から生成する場合に限らず、直接音素を表すようなローマ字表記や仮名表記を含めた概念のものとする。
また、音声合成装置７は、音声合成するためのテキストに応じた辞書を記憶させておき、この辞書に応じて合成音声を出力させるようにすればよい。 In the present specification, the text is described as a kanji-kana mixed sentence, but is not limited thereto, and for example, kanji only or kana-only Japanese may be used. The language handled as text is not limited to Japanese, and may be various languages such as English, German, Chinese, and the like. Also, for speech synthesis, it is not particularly necessary for notation that directly represents phonemes, such as kana and romaji, but it is possible to express phonetics with only letters, such as kanji kana mixed sentences. If you can't, you need to convert the text to phonetic characters. The phonetic characters are not limited to those generated from ideographic characters such as Kanji characters, but have a concept including Roman notation and kana notation that directly represent phonemes.
The speech synthesizer 7 may store a dictionary corresponding to the text for speech synthesis and output the synthesized speech in accordance with this dictionary.

また、前記したとおり、発話開始位置は、テキスト中の任意の位置を指定するものであって、所定の発話範囲毎に一番初めに発話を開始する発話開始文字としての言葉を指定するものである。ここで、図２を参照しつつ、本発明の実施の形態に係る音声合成制御装置の発話開始文字の概念を説明する。ここでは、テキストの文例「よみあげようテキストのはつわそくどをしていするほうほうのれい。」を用いて説明する。 Further, as described above, the utterance start position designates an arbitrary position in the text, and designates a word as an utterance start character that starts utterance first for each predetermined utterance range. is there. Here, the concept of the utterance start character of the speech synthesis control device according to the embodiment of the present invention will be described with reference to FIG. Here, a description is given using an example of a text “A textbook that reads quickly”.

例えば、この文例中、発話開始文字ａ₁の「よ」から始まる発話範囲は、「よみあげようテキストの」であり、発話開始文字ａ₂の「は」の一字前の「の」を発話終了文字としている。同様に、発話開始文字ａ₂の「は」から始まる発話範囲は、「はつわそくどをしていする」であり、発話開始文字ａ₃の「ほ」の一字前の「る」を発話終了文字としている。また、発話開始文字ａ₃の「ほ」から始まる発話範囲は、「ほうほうのれい」であり、句点「。」の前の文字「い」を発話終了文字としている。 For example, in this sentence example, the utterance range starting from “yo” of the utterance start character a ₁ is “reading text”, and the utterance is terminated by “no” one character before “ha” of the utterance start character a ₂ It is a letter. Similarly, the utterance range starting from “ha” of the utterance start character a ₂ is “has been playing”, and “ru” immediately before the “ho” of the utterance start character a ₃ The utterance end character. In addition, the utterance range starting from “ho” of the utterance start character a ₃ is “Hoho no Rei”, and the character “i” before the punctuation mark “.” Is the utterance end character.

したがって、発話開始文字を規定するための順番は、規定される発話開始文字の前に存在する文字数に応じて規定される。例えば、文例において、発話開始文字ａ₁の「よ」は、この文字以前に文字が存在しないため、１番目の順番である。また、発話開始文字ａ₂の「は」は、この文字以前に「よみあげようテキストの」を構成する文字列が存在するため、この文字列の文字数（１１文字）＋１文字、つまり、１２番目の順番として規定される。なお、発話開始文字ａ₃の「ほ」の場合も同様にして順番を規定される。 Therefore, the order for defining the utterance start character is defined according to the number of characters existing before the defined utterance start character. For example, in the sentence example, “yo” of the utterance start character a ₁ is the first order because there is no character before this character. In addition, since “ha” of the utterance start character a ₂ has a character string constituting “reading text” before this character, the number of characters of this character string (11 characters) +1 character, that is, the 12th character Defined as an order. Note that the order is similarly defined in the case of “ho” of the utterance start character a ₃ .

また、前記のように、発話開始文字を規定する順番は、規定されている発話開始文字の前に存在する文字数に応じて規定する場合に限らず、読み上げ用テキストの中における発話開始文字が定まる方法によって規定するようにしてもよい。
例えば、その順番は、発話開始文字の順番と次の発話開始文字までの文字数との関係によって規定することもできる。図２の文例を用いて説明すると、１番目として１１文字と規定する場合には、最初の発話開始文字がａ₁の「よ」であり、次の発話開始文字がａ₂の「は」であることを表すことができる。２番目として１２文字と規定する場合には、次の発話開始文字がａ₃の「ほ」であることを表すことができる。また、１番目から順番に伝送することをあらかじめ定めておく場合には、何番目かという情報を省略し、「１１文字」、「１２文字」といったように文字数だけを指定することによっても、発話開始文字を規定することができる。 In addition, as described above, the order in which the utterance start characters are defined is not limited to the case where the utterance start characters are defined according to the number of characters existing before the specified utterance start characters, and the utterance start characters in the text to be read out are determined. You may make it prescribe | regulate by the method.
For example, the order can be defined by the relationship between the order of utterance start characters and the number of characters up to the next utterance start character. Referring to the example sentence in FIG. 2, when 11 characters are defined as the first, the first utterance start character is “yo” of a ₁ , and the next utterance start character is “ha” of a _2. It can be expressed. If the second character is defined as 12 characters, it can indicate that the next utterance start character is “ ₃ ” of a ₃ . In addition, when it is determined in advance that transmission is performed in order from the first, it is also possible to omit the information on what number and specify only the number of characters such as “11 characters”, “12 characters”, etc. A starting character can be defined.

ところで、句点「。」については、任意の発話開始文字から発話を開始した場合の発話の終了の合図としている。また、読点「、」についても、任意の発話開始文字から発話を開始した場合の発話の終了の合図とすることができる。句点および読点は、順番を規定するための文字として計数してもしなくてもいずれでもよい。ただし、句点および読点は、発話されないが、「間」として文や文章を区切る役目があるため、発話速度として算出する際には、計数することが好ましい。 By the way, the punctuation mark “.” Is used as a signal for the end of the utterance when the utterance is started from an arbitrary utterance start character. Also, the punctuation mark “,” can also be used as a cue for the end of the utterance when the utterance is started from an arbitrary utterance start character. Punctuation marks and reading marks may or may not be counted as characters for defining the order. However, although punctuation marks and punctuation marks are not uttered, they have the role of separating sentences and sentences as “between”, and therefore it is preferable to count them when calculating the utterance speed.

また、前記のように発話開始文字ａ₁，ａ₂，ａ₃自体を指定する場合に限らず、例えば、１文字後のｂ₁の「み」，ｂ₂の「つ」，ｂ₃の「う」の指定または１文字前のｃ₁の文字無、ｃ₂の「の」、ｃ₃の「る」の指定をしておき、発話開始タイミング取得手段１３がこれを判別して、発話開始文字ａ₁，ａ₂，ａ₃を取得するようにしてもよい。
なお、指定する前後の文字は１文字に限らず、発話開始文字の設定時にユーザの好みに合わせて選択させるために、何文字前や何文字後の文字であってもよい。 Further, not only to specify the utterance start character a _1, a _2, a ₃ itself as described above, for example, 1 "only" after the character of b _1, "one" in b _2, the b ₃ " Mu specified or 1 character preceding c ₁ letter u "," the the c ₂ ", leave the designation of" Turning "of c _3, utterance start timing acquisition means 13 to determine this, utterance start The characters a ₁ , a ₂ , and a ₃ may be acquired.
Note that the number of characters before and after the designation is not limited to one character, and may be any number of characters before or after how many characters to select according to the user's preference when setting the utterance start character.

発話開始時刻は、所定の基準時刻から計時する時刻であり、発話開始文字の発話の開始を指定するものである。この発話開始時刻は、例えば、文頭からの相対時刻、前回の発話開始時刻からの相対時刻、コンテンツ先頭からの相対時刻、グリニッジ標準時を基準にした時刻によって表すことができる。特に、テキストの一番始めの発話開始文字の発話開始時刻を基準にして計時を開始し、その発話開始時刻を基準として以降の発話開始文字の発話開始時刻を指定するのが好ましい（文頭からの相対時刻）。また、グリニッジ標準時を基準にした時刻は、所定の時刻に所定のコンテンツを放送するような場合に用いることができる。
なお、発話開始タイミングとしては、発話開始位置として発話開始文字のみを指定するものであってもよい。ただし、発話開始時刻については、発話開始文字を発話させる順番毎に規定値として音声合成制御装置１内に記憶させておく必要がある。 The utterance start time is a time measured from a predetermined reference time, and designates the start of the utterance of the utterance start character. This utterance start time can be represented by, for example, a relative time from the beginning of the sentence, a relative time from the previous utterance start time, a relative time from the beginning of the content, or a time based on Greenwich Mean Time. In particular, it is preferable to start timing based on the utterance start time of the first utterance start character of the text, and specify the utterance start time of subsequent utterance start characters based on the utterance start time (from the beginning of the sentence). Relative time). The time based on the Greenwich Mean Time can be used when a predetermined content is broadcast at a predetermined time.
Note that as the utterance start timing, only the utterance start character may be designated as the utterance start position. However, the utterance start time needs to be stored in the speech synthesis control device 1 as a prescribed value for each order in which the utterance start character is uttered.

発話速度は、発話開始位置から始まるテキストの音声合成の範囲を示す予め設定された発話範囲の発話に要する時間と、この発話範囲の文字数の関係に応じて決定されるものである。
また、この発話速度は、前記したとおり、発話文字数と時間との関係（以下「文字数−時間関係」という）で決定されるものであるが、文字数−時間関係になるように算出または変換可能であれば、その決定の仕方はどのようなものでもよい。例えば、一定の時間内に発話する音素数や単語数によって指定する場合の他に、発話範囲の合成音声を出力するのに要する時間や、発話開始タイミングの間隔から計算する場合、音声合成装置７の標準的な発話速度に対する比率がある。 The utterance speed is determined according to the relationship between the time required for utterance in a preset utterance range indicating the range of speech synthesis of text starting from the utterance start position and the number of characters in the utterance range.
Further, as described above, this utterance speed is determined by the relationship between the number of uttered characters and time (hereinafter referred to as “character number-time relationship”), but can be calculated or converted so as to have a character number-time relationship. As long as there is a decision, any method may be used. For example, in addition to the case of designating by the number of phonemes or words uttered within a certain time, the speech synthesizer 7 when calculating from the time required to output the synthesized speech in the utterance range and the interval of the speech start timing. There is a ratio to the standard speech rate.

ここで、図３を参照して、発話速度の概念を説明する。なお、前記したテキストの文例「よみあげようテキストのはつわそくどをしていするほうほうのれい。」を用いて説明する。
（ａ）の「発話速度で指定する例」は、単位時間当たりの発話文字数により発話範囲毎の発話速度を決定するものである。例えば、この文例中、発話速度は、発話開始文字「よ」から発話終了文字「の」までの発話範囲の文字列「よみあげようテキストの」を毎秒５文字とし、発話開始文字「は」から発話終了文字「る」までの発話範囲の文字列「はつわそくどをしていする」を毎秒６文字とし、発話開始文字「ほ」から発話終了文字「い」までの発話範囲の文字列「ほうほうのれい」を毎秒５文字としている。 Here, the concept of the speech rate will be described with reference to FIG. The description will be made with reference to the above-mentioned text example “A textbook that reads quickly”.
The “example of designating by speaking rate” in (a) determines the speaking rate for each speaking range based on the number of speaking characters per unit time. For example, in this sentence example, the speech rate is set to 5 characters per second for the text range of the utterance range from the utterance start character “yo” to the utterance end character “no”, and from the utterance start character “ha” The character string of the utterance range from the utterance start character “ho” to the utterance end character “i” is set to 6 characters per second in the utterance range character string “ending fast” to the end character “ru”. Hohorei is 5 characters per second.

（ｂ）の「ある範囲の発話にかかる時間で指定する例」は、発話範囲での発話に要する発話時間により発話範囲毎の発話速度を指定するものである。例えば、この文例中、発話速度は、発話開始文字「よ」から発話終了文字「の」までの発話範囲の文字列「よみあげようテキストの」を２．２秒（一番目）とし、発話開始文字「は」から発話終了文字「る」までの発話範囲の文字列「はつわそくどをしていする」を２秒（二番目）とし、発話開始文字「ほ」から発話終了文字「い」までの発話範囲の文字列「ほうほうのれい」を１．４秒（三番目）としている。 The “example of designating by the time required for utterance within a certain range” in (b) designates the utterance speed for each utterance range by the utterance time required for utterance within the utterance range. For example, in this sentence example, the utterance speed is set to 2.2 seconds (first) for the character string “Reading text” in the utterance range from the utterance start character “yo” to the utterance end character “no”. The character string “I am playing fast” from “ha” to the utterance end character “ru” is 2 seconds (second), and the utterance start character “ho” to the utterance end character “i” The character string “Hoho no Rei” in the utterance range is set to 1.4 seconds (third).

（ｃ）の「発話開始タイミングから計算する例」は、発話範囲での発話開始時刻および発話終了時刻の間の時間と発話範囲の文字数との関係（文字数÷時間）により発話範囲毎の発話速度を指定するものである。つまり、この場合は、発話速度が、前記した発話開始文字と、前記した発話開始時刻とによって指定されるものである。例えば、この文例中、発話開始文字「よ」の発話開始時刻を００．００秒の基準として発話を開始させ、発話開始文字「は」の発話開始時刻を０２．２０秒とし、発話開始文字「ほ」の発話開始時刻を０４．２０秒としている。 The “example calculated from the utterance start timing” in (c) is an utterance speed for each utterance range based on the relationship between the time between the utterance start time and the utterance end time in the utterance range and the number of characters in the utterance range (number of characters / time). Is specified. That is, in this case, the utterance speed is specified by the utterance start character described above and the utterance start time described above. For example, in this sentence example, the utterance is started with the utterance start time of the utterance start character “yo” as a reference of 00.00 seconds, the utterance start time of the utterance start character “ha” is set to 02.20 seconds, and the utterance start character “ The utterance start time of “ho” is set to 04.20 seconds.

（ｄ）の「あらかじめ定めた標準的な発話速度との比で指定する例」は、基準の発話速度に対する比により発話範囲毎の発話速度を指定するものである。例えば、この文例中、発話速度は、発話開始文字「よ」から発話終了文字「の」までの発話範囲の文字列「よみあげようテキストの」を１．０倍速とし、発話開始文字「は」から発話終了文字「る」までの発話範囲の文字列「はつわそくどをしていする」を１．２倍速とし、発話開始文字「ほ」から発話終了文字「い」までの発話範囲の文字列「ほうほうのれい」を１．０倍速としている。なお、１．０倍速は、基準の発話速度を表す。例えば、この基準の発話速度を毎秒５文字とすると、１．２倍速は毎秒６文字の発話速度を表す。 The “example of designating by a ratio with a predetermined standard utterance speed” in (d) designates the utterance speed for each utterance range by the ratio to the reference utterance speed. For example, in this example sentence, the utterance speed is set to 1.0 times the character string “reading text” in the utterance range from the utterance start character “yo” to the utterance end character “no”, and from the utterance start character “ha”. The character string in the utterance range from the utterance start character "ho" to the utterance end character "i" is set to 1.2 times the character string "has been playing" until the utterance end character "ru". The column “Hou no Rei” is set to 1.0 times speed. Note that 1.0 times the speed represents the reference speech speed. For example, if the standard speaking rate is 5 characters per second, 1.2 times speed represents a speaking rate of 6 characters per second.

また、前記のように、読み上げ用テキストをそのまま用いる他に、発話開始文字を示すあらかじめ定めた区切り（列）を読み上げ用テキストに挿入して、発話開始文字や発話開始時刻や発話速度のいずれかまたは全てを規定してもよい。
ここで、ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）を用いて、規定する場合を説明する。 In addition to using the text for reading as it is, as described above, a predetermined delimiter (column) indicating the utterance start character is inserted into the text for reading, and any one of the utterance start character, the utterance start time, and the utterance speed is selected. Or you may prescribe everything.
Here, the case where it prescribes | regulates using XML (extensible Markup Language) is demonstrated.

[ＸＭＬによる記述例１]
<s time=“00:00” speed=“1.0”>よみあげようテキストの</s> …（行１）
<s time=“2.2sec” speed=“1.2”>はつわそくどを</s> …（行２）
<s time=“4.2sec” speed=“1.0”>していするほうほうのれい。</s> …(行３) [Description example 1 in XML]
<s time = “00:00” speed = “1.0”> Reading text </ s>… (line 1)
<s time = “2.2sec” speed = “1.2”>
<s time = “4.2sec” speed = “1.0”></s> ... (line 3)

この場合では、発話範囲がタグ<s>で指定され、指定された発話範囲の発話開始時刻がtimeアトリビュート（属性）で指定され、指定された発話範囲の発話速度がspeedアトリビュートで指定されている。
ここでは、行１ないし３のように、「よみあげようテキストのはつわそくどをしていするほうほうのれい。」の文例をタグ<s>で囲まれる発話範囲に分けて、行毎に発話開始時刻と発話速度とが指定されている。 In this case, the utterance range is specified by the tag <s>, the utterance start time of the specified utterance range is specified by the time attribute (attribute), and the utterance speed of the specified utterance range is specified by the speed attribute. .
Here, as shown in lines 1 to 3, divide the sentence example of “How to read text quickly” into the utterance range enclosed by the tag <s>, line by line. An utterance start time and an utterance speed are specified.

なお、タグやアトリビュートの名称や記述方法等の文法は、発話開始時刻や発話速度が指定できれば、この例の場合に限らない。
さらに、使用する言語はＸＭＬでなくとも、読み上げ用テキストと区別がつけば、どのような区切り文字（列）であってもよい。 Note that the syntax of tag and attribute names, description methods, and the like are not limited to this example as long as the utterance start time and utterance speed can be specified.
Furthermore, even if the language to be used is not XML, any delimiter (column) may be used as long as it can be distinguished from the text for reading.

また、省略された場合の動作はあらかじめ定めておくものとすれば、発話開始時刻や発話速度の指定を省略するようにしてもよい。
例えば、発話開始時刻が省略された場合は、直前の発話範囲の出力が終了した直後に発話を開始するものと規定し、発話速度が省略された場合は、直前の発話範囲における発話速度を用いる（先頭の発話範囲の場合は１倍速とする）とあらかじめ規定するようにしてもよい。この場合のＸＭＬによる記述例を次に示す。 If the operation when omitted is predetermined, the specification of the utterance start time and the utterance speed may be omitted.
For example, if the utterance start time is omitted, it is defined that the utterance starts immediately after the output of the immediately preceding utterance range ends, and if the utterance speed is omitted, the utterance speed in the immediately preceding utterance range is used. (In the case of the first utterance range, the speed is set to 1 × speed). An example of description in XML in this case is shown below.

[ＸＭＬによる記述例２]
<s time=“00:00” speed=“1.0”>よみあげようテキストの</s> …（行１）
<s speed=“1.2”>はつわそくどを</s> …（行２）
<s time=“4.2sec”>していするほうほうのれい。</s> …（行３） [Description example 2 in XML]
<s time = “00:00” speed = “1.0”> Reading text </ s>… (line 1)
<s speed = “1.2”> is a quick </ s>… (line 2)
<s time = “4.2sec”></s> ... (line 3)

この場合は、行２で示す発話範囲「はつわそくどを」の発話開始時刻は、直前の行１で示す発話範囲「よみあげようテキストの」の出力が終了した直後の時刻となり、行３で示す発話範囲「していするほうほうのれい」の発話速度は、直前の行２で示す発話範囲「はつわそくどを」と同じ１．２倍速となる。 In this case, the utterance start time of the utterance range “Hatsusokudo wo” shown in line 2 is the time immediately after the output of the utterance range “reading text” shown in the immediately preceding line 1, and line 3 The utterance speed of the utterance range “Shihou Shiho no Rei” indicated by is 1.2 times the same as the utterance range “Hakusokudo wo” indicated by the immediately preceding line 2.

図１に戻って説明を続けると、音声合成用データ記憶手段１１は、音声合成用データ入力手段１０が入力した音声合成用データを記憶するものである。この音声合成用データ記憶手段１１としては、例えば、ハードディスクや半導体デバイスがある。 Returning to FIG. 1 and continuing the description, the voice synthesis data storage unit 11 stores the voice synthesis data input by the voice synthesis data input unit 10. Examples of the speech synthesis data storage unit 11 include a hard disk and a semiconductor device.

解析分離手段１２は、音声合成用データ記憶手段１１から音声合成用データを読み出して、音声合成用データを解析して、テキストと発話開始タイミングと発話速度とに音声合成用データを分離して、音声合成装置７にテキストを送り、発話開始タイミング取得手段１３に発話開始タイミングを送り、発話速度決定手段１５に発話速度を送るものである。
なお、解析分離手段１２は、テキストと発話開始タイミングと発話速度とを区別するために、例えば、各データの前後にフラグを付与しておき、解析分離手段１２がそのフラグを認識することで判別することが可能であるが、これに限らず、発話開始タイミングや発話速度を指定した特徴部分を特に認識して判別するようにしてもよい。また、フラグは、ＸＭＬによって記述されている場合には、前記したタグを用いることができる。 The analysis / separation unit 12 reads out the speech synthesis data from the speech synthesis data storage unit 11, analyzes the speech synthesis data, separates the speech synthesis data into text, speech start timing, and speech speed, A text is sent to the speech synthesizer 7, an utterance start timing is sent to the utterance start timing acquisition means 13, and an utterance speed is sent to the utterance speed determination means 15.
For example, the analysis / separation unit 12 adds a flag before and after each data in order to distinguish the text, the utterance start timing, and the utterance speed, and the analysis / separation unit 12 recognizes the flag. However, the present invention is not limited to this, and the feature portion that specifies the utterance start timing and the utterance speed may be particularly recognized and determined. Further, when the flag is described in XML, the above-described tag can be used.

発話開始タイミング取得手段１３は、前記した発話開始位置および発話開始時刻を発話開始タイミングとして取得する。ここでは、発話開始タイミング取得手段１３が、図２を用いて説明したような予め設定してある前記規則を参照して発話終了位置としての発話開始文字を決定するものとして説明するが、これに限らず、音声合成開始制御手段１４が行う場合やこれら以外に独立した手段が行う場合でもよい。 The utterance start timing acquisition unit 13 acquires the utterance start position and the utterance start time as the utterance start timing. Here, description will be made assuming that the utterance start timing acquisition means 13 determines the utterance start character as the utterance end position with reference to the rules set in advance as described with reference to FIG. Not limited to this, it may be performed by the speech synthesis start control means 14 or by other independent means.

音声合成開始制御手段１４は、発話開始タイミング取得手段１３により取得した発話開始タイミングの発話開始時刻に発話が開始されるように発話の開始の合図を音声合成装置７に出力して、音声合成装置７の音声合成の開始を制御するものである。また、音声合成開始制御手段１４は、合図を出力するだけでなく、次の発話開始文字までの文字数を計数し、発話終了文字を決定し、発話終了文字の発話終了時刻を合図として音声合成装置７に出力するようにしてもよい。
発話速度決定手段１５は、発話開始タイミング取得手段１３により取得される発話開始タイミング毎に発話させる発話速度を分離解析部１２を介して取得して、取得した発話速度を対象の発話開始タイミングの発話速度として決定するものである。 The speech synthesis start control means 14 outputs an utterance start signal to the speech synthesizer 7 so that the utterance is started at the utterance start time of the utterance start timing acquired by the utterance start timing acquisition means 13, and the speech synthesizer 7 7 is used to control the start of speech synthesis. The speech synthesis start control means 14 not only outputs a cue, but also counts the number of characters up to the next utterance start character, determines the utterance end character, and uses the utterance end time of the utterance end character as a cue. 7 may be output.
The utterance speed determination means 15 acquires the utterance speed to be uttered for each utterance start timing acquired by the utterance start timing acquisition means 13 via the separation analysis unit 12, and uses the acquired utterance speed as the utterance at the target utterance start timing. It is determined as a speed.

発話速度制御手段１６は、図３を用いて説明したように、発話速度決定手段１５により決定された発話速度を音声合成装置７に出力して、発話開始タイミング毎に決定された発話速度となるように音声合成装置７の音声合成を制御するものである。
なお、図３の（ｂ）を用いて説明したように、音声合成装置７が発話速度を毎秒何文字として設定するようになっている場合には、一番目のときには、発話速度制御手段１６が、１１文字÷２．２秒で発話速度を算出し、算出される毎秒５文字を発話速度として音声合成装置７に送る。二番目のときには、発話速度制御手段１６が、１２文字÷２秒で発話速度を算出し、算出される毎秒６文字を発話速度として音声合成装置７に送る。三番目のときには、発話速度制御手段１６が、７文字÷１．４秒で発話速度を算出し、算出される毎秒５文字を発話速度として音声合成装置７に送る。 As described with reference to FIG. 3, the speech rate control unit 16 outputs the speech rate determined by the speech rate determination unit 15 to the speech synthesizer 7 and becomes the speech rate determined at each speech start timing. Thus, the speech synthesis of the speech synthesizer 7 is controlled.
As described with reference to FIG. 3B, when the speech synthesizer 7 sets the speech speed as how many characters per second, at the first time, the speech speed control means 16 , The utterance speed is calculated by 11 characters ÷ 2.2 seconds, and the calculated 5 characters per second is sent to the speech synthesizer 7 as the utterance speed. At the second time, the speech rate control means 16 calculates the speech rate by 12 characters / 2 seconds, and sends the calculated 6 characters per second to the speech synthesizer 7 as the speech rate. At the third time, the speech rate control means 16 calculates the speech rate by 7 characters / 1.4 seconds, and sends the calculated 5 characters per second to the speech synthesizer 7 as the speech rate.

また、音声合成装置７が発話速度を時間で設定するようになっている場合には、発話速度制御手段１６が、一番目ないし三番目で指定された時間を音声合成装置７に送り、音声合成装置７が前記したとおり発話速度を毎秒何文字として算出して、発話範囲毎に発話速度を毎秒何文字として調節した合成音声を出力する。
また、図３の（ｃ）を用いて説明したように、音声合成装置７が発話速度を発話開始タイミングで設定するようになっている場合には、発話速度制御手段１６が、各発話タイミングを音声合成装置７に送り、音声合成装置７が前記したとおり発話速度を毎秒何文字として算出して、発話範囲毎に発話速度を毎秒何文字として調節した合成音声を出力する。 If the speech synthesizer 7 sets the speech rate by time, the speech rate control means 16 sends the first to third designated times to the speech synthesizer 7 to synthesize speech. As described above, the device 7 calculates the speech rate as how many characters per second, and outputs a synthesized speech in which the speech rate is adjusted as many characters per second for each speech range.
Further, as described with reference to (c) of FIG. 3, when the speech synthesizer 7 is configured to set the speech speed at the speech start timing, the speech speed control means 16 determines each speech timing. The speech synthesizer 7 sends to the speech synthesizer 7, and the speech synthesizer 7 calculates the speech speed as how many characters per second as described above, and outputs synthesized speech with the speech speed adjusted as many characters per second for each speech range.

発話開始文字「よ」から発話終了文字「の」までの発話範囲の文字列「よみあげようテキストの」は、次の発話開始文字「は」の発話開始時刻が０２．２０秒なので、この２．２秒間（一番目）に発話が終了している必要がある。そのため、発話速度制御手段１６が、１１文字÷２．２秒で発話速度を算出し、算出される毎秒５文字を発話速度として音声合成装置７に送る。
次に、発話開始文字「は」から発話終了文字「る」までの発話範囲の文字列「はつわそくどをしていする」は、次の発話開始文字「ほ」の発話開始時刻が０４．２０秒なので、（０４．２０−０２．２０）で算出される２秒間で発話が終了している必要がある。そのため、発話速度制御手段１６が、１２文字÷２秒で発話速度を算出し、算出される毎秒６文字を発話速度として音声合成装置７に送る。 Since the utterance start time of the next utterance start character “ha” is 02.20 seconds, the character string “Yomiageyo text” of the utterance range from the utterance start character “yo” to the utterance end character “no” is the 2. The utterance needs to be completed in 2 seconds (first). Therefore, the speech rate control means 16 calculates the speech rate by 11 characters ÷ 2.2 seconds, and sends the calculated 5 characters per second to the speech synthesizer 7 as the speech rate.
Next, in the character string “has been playing” from the utterance start character “ha” to the utterance end character “ru”, the utterance start time of the next utterance start character “ho” is 04. Since it is 20 seconds, the utterance needs to be completed in 2 seconds calculated by (04.20-02.20). Therefore, the utterance speed control means 16 calculates the utterance speed by 12 characters / 2 seconds, and sends the calculated 6 characters per second to the speech synthesizer 7 as the utterance speed.

ところで、この文例に続く文章がある場合には、次の文章の発話開始文字の発話開始時刻によって、発話開始文字「ほ」から発話終了文字「い」までの発話範囲の文字列「ほうほうのれい」の発話速度を決定することができる。しかし、この文例に続く文章がない場合には、発話はこの文例で終了することになるので、発話開始文字と発話開始時刻とを指定したのみでは、発話範囲の文字列「ほうほうのれい」の発話速度を決定することができない。 By the way, when there is a sentence that follows this sentence example, the character string “Honoho” in the utterance range from the utterance start character “ho” to the utterance end character “i” is determined according to the utterance start time of the utterance start character of the next sentence. The utterance speed of “Rei” can be determined. However, if there is no sentence following this sentence example, the utterance will end with this sentence example, so simply specifying the utterance start character and the utterance start time will result in the character string “Honorei” in the utterance range. The speech rate cannot be determined.

そのため、この場合には、例えば、発話範囲「ほうほうのれい」の次にさらに文章が仮想的に存在するものとして、次の文章の発話開始文字「」（スペース）と、発話開始時刻を指定することにより決定することが可能である。なお、仮想的に存在するものとして発話開始文字に指定する文字は、スペースに限らず、任意の文字や記号を割り当てておけばよい。また、仮想的に存在することを区別するためのフラグをデータ列に付加するようにしてもよい。 Therefore, in this case, for example, the utterance start character “” (space) of the next sentence and the utterance start time are specified on the assumption that the sentence further exists next to the utterance range “Hou no Rei”. Can be determined. Note that the character designated as the utterance start character as being virtually present is not limited to a space, and an arbitrary character or symbol may be assigned. In addition, a flag for distinguishing virtually existing may be added to the data string.

また、発話速度が指定されていない場合の発話速度をあらかじめ定めておいてもよい。例えば、発話速度が指定されていない発話範囲では、直前の発話範囲の速度を継続し、先頭の発話範囲の場合は１．０倍速とあらかじめ定めておけば、次の文章の発話開始文字を仮想しなくてもよい。 Further, the speech rate when the speech rate is not designated may be determined in advance. For example, in the utterance range in which the utterance speed is not specified, the speed of the immediately preceding utterance range is continued, and in the case of the first utterance range, if the speed is set to 1.0 times in advance, the utterance start character of the next sentence is virtually You don't have to.

また、図３の（ｄ）を用いて説明したように、音声合成装置７が発話速度を何倍速として設定するようになっている場合には、発話速度制御手段１６が何倍速かを音声合成装置７に送り、音声合成装置７が基準の発話速度に乗じて毎秒何文字かを算出して、発話範囲毎に発話速度を毎秒何文字として調節した合成音声を出力する。また、音声合成装置７が発話速度を毎秒何文字かで設定するようになっている場合には、発話速度制御手段１６が基準の発話速度に乗じて毎秒何文字かを算出し、その算出された値を音声合成装置７に送り、発話範囲毎に発話速度を毎秒何文字として調節した合成音声を出力する。 Further, as described with reference to (d) of FIG. 3, when the speech synthesizer 7 is set to what speed the speech speed is set, how much the speech speed control means 16 is speech-synthesized. The speech synthesizer 7 multiplies the reference speech speed to calculate how many characters per second, and outputs the synthesized speech with the speech speed adjusted as many characters per second for each speech range. If the speech synthesizer 7 sets the speech rate by the number of characters per second, the speech rate control means 16 calculates the number of characters per second by multiplying the reference speech rate. The synthesized value is sent to the speech synthesizer 7 to output a synthesized speech in which the speech rate is adjusted to how many characters per second for each speech range.

ここでは、文字のみから構成される読み上げ用のテキストの例を示したが、音素情報等の音声合成用の補助情報や、文字を含まず音声合成に特化したデータの場合でも同様に発話速度を指定することができる。発話速度は、テキストの任意の部分について指定してもよい。例えば、読み上げ用のテキストの各文節ごとに発話速度を指定したり、各文節ごとに文節の発話に使用する時間を指定すれば、１つの文中に早口の部分やゆっくりしゃべる部分を作ることができる。また、発話速度が指定されてない部分については、前回の発話速度を適用したり、音声合成装置７の標準的な発話速度を適用する等、あらかじめ定めておいた方法で発話速度決定手段１５が発話速度を決定する。 Here, an example of text to be read out consisting only of characters has been shown, but the speech rate is similarly applied even to auxiliary information for speech synthesis such as phoneme information or data specialized for speech synthesis that does not include characters. Can be specified. The speaking rate may be specified for any part of the text. For example, if you specify the utterance speed for each phrase of the text to be read, or specify the time to be used for utterance of the phrase for each phrase, you can create a fast-talking part or a slowly speaking part in one sentence . In addition, the speech rate determination means 15 uses a predetermined method such as applying the previous speech rate or applying the standard speech rate of the speech synthesizer 7 for the part where the speech rate is not specified. Determine the speaking rate.

[音声合成装置・映像表示装置・同期装置の主な機能構成]
音声合成装置７は、テキストを単語解析して、読み・間・アクセントを付けた表音単語列を生成し、この表音単語列の音素毎の継続時間長および基本周波数パターンを生成し、その基本周波数パターンに基づいて音声波形データベースを参照して生成する合成音声を電気信号として出力するものである。
なお、音声合成装置７は、テキストの単語解析において、辞書を参照してテキストを単語解析して単語列の「読み」を決定し、単語列の係り受けを解析して単語列の「間」を決定し、単語列の「読み」と「間」とから単語列に対応する表音単語列を生成し、この表音単語列から「アクセント」を決定するようになっている。また、音声合成装置７が電気信号として出力した合成音声は、図示しないスピーカから出力される。 [Main functional configuration of speech synthesizer, video display, and synchronizer]
The speech synthesizer 7 performs word analysis on the text, generates a phonetic word string with readings, spaces, and accents, generates a duration length and a basic frequency pattern for each phoneme of the phonetic word string, The synthesized speech generated by referring to the speech waveform database based on the fundamental frequency pattern is output as an electrical signal.
In the word analysis of the text, the speech synthesizer 7 analyzes the word by referring to the dictionary to determine “reading” of the word string, analyzes the dependency of the word string, and “between” the word strings. The phonetic word string corresponding to the word string is generated from “read” and “between” of the word strings, and “accent” is determined from the phonetic word string. The synthesized speech output as an electrical signal by the speech synthesizer 7 is output from a speaker (not shown).

また、この音声合成装置７は、特に、発話開始タイミング設定手段と、発話速度設定手段とを備えている（いずれの手段も図示省略）。発話開始タイミング設定手段は、音声合成開始制御手段１４から送られてくる発話開始タイミングを音声合成のために設定するものである。発話開始タイミング設定手段は、解析分離手段１２から送られてくるテキスト中に発話開始文字を設定し、この発話開始文字の発話開始時刻を設定する。また、発話速度設定手段は、発話速度制御手段１６から送られてくる発話開始タイミングに基づく発話範囲毎の発話速度を設定するものである。 In addition, the speech synthesizer 7 particularly includes an utterance start timing setting unit and an utterance speed setting unit (all units are not shown). The speech start timing setting means sets the speech start timing sent from the speech synthesis start control means 14 for speech synthesis. The utterance start timing setting means sets the utterance start character in the text sent from the analysis / separation means 12, and sets the utterance start time of the utterance start character. The utterance speed setting means sets the utterance speed for each utterance range based on the utterance start timing sent from the utterance speed control means 16.

映像表示装置８は、音声合成装置７から出力される合成音声に同期させるための映像を表示するものである。この映像表示装置８としては、例えば、デジタル信号として供給されるテキストを扱えるデジタル放送用テレビジョン受像機、パーソナルコンピュータ、プロジェクタ、動画表示機能付携帯電話を挙げることができる。ここでは、特に、テレビジョン受像機とし、近時放送が開始された地上波デジタル放送用のものとする。なお、映像表示装置８は、その構成を図示して説明しないが、放送波を受信する受信手段と、映像信号と同期信号とを分離する同期分離手段と、同期信号から水平同期手段を分離する水平同期分離手段と、同期信号から垂直同期信号を分離する垂直同期分離手段と、水平同期信号と垂直同期信号とに同期させて映像信号を表示画面に表示させる表示手段とを主に備えている。 The video display device 8 displays video for synchronizing with synthesized speech output from the speech synthesizer 7. Examples of the video display device 8 include a digital broadcast television receiver capable of handling text supplied as a digital signal, a personal computer, a projector, and a mobile phone with a moving image display function. Here, in particular, a television receiver is used for digital terrestrial broadcasting for which recent broadcasting has started. Although the video display device 8 is not illustrated and described, the video display device 8 separates the horizontal synchronizing means from the receiving means for receiving the broadcast wave, the synchronizing separation means for separating the video signal and the synchronizing signal, and the synchronizing signal. The apparatus mainly includes a horizontal synchronization separation unit, a vertical synchronization separation unit that separates a vertical synchronization signal from a synchronization signal, and a display unit that displays a video signal on a display screen in synchronization with the horizontal synchronization signal and the vertical synchronization signal. .

同期装置９は、音声合成装置７および映像表示装置８の出力タイミングを制御して、合成音声と映像とが同期するように出力させるものである。同期装置９は、任意の時刻を計時する計時手段（図示省略）を備え、その計時手段が計時する時刻を音声合成装置７と映像表示装置８とに与えるようになっている。 The synchronizer 9 controls the output timings of the voice synthesizer 7 and the video display device 8 so that the synthesized voice and the video are output in synchronization. The synchronizer 9 is provided with a time measuring means (not shown) for measuring an arbitrary time, and the time measured by the time measuring means is given to the speech synthesizer 7 and the video display device 8.

ところで、音声合成装置７は、音声合成制御装置１の発話開始タイミングに基づく制御により合成音声を出力し、一方、映像表示装置８は、水平同期信号と垂直同期信号とに基づいて映像を出力する。そのため、発話開始タイミングの発話開始時刻と、水平同期信号および垂直同期信号の出力時刻とを、同期装置９の計時手段（図示省略）の時刻に合わせたものにすることによって、音声合成装置７から出力される合成音声と、映像表示装置８から出力される映像信号とを同期させることができる。 By the way, the speech synthesizer 7 outputs synthesized speech by control based on the speech start timing of the speech synthesis control device 1, while the video display device 8 outputs video based on the horizontal synchronization signal and the vertical synchronization signal. . For this reason, the speech synthesizer 7 sets the utterance start time at the utterance start timing and the output times of the horizontal synchronization signal and the vertical synchronization signal to the time of the time measuring means (not shown) of the synchronization device 9. The synthesized voice to be output and the video signal output from the video display device 8 can be synchronized.

[音声合成制御装置の動作]
次に、図４を参照（適宜図１ないし図３参照）して、第一の実施の形態に係る音声合成制御装置の動作について説明する。
図４は、本発明の第一の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。
＜音声合成用データの受信・入力・記憶ステップ＞
音声合成制御装置１では、音声合成用データ入力手段１０が音声合成用データを外部から受け取ると、音声合成用データ記憶手段１１が音声合成用データを記憶する（ステップＳＡ１）。
＜音声合成用データの取得・解析ステップ＞
解析分離手段１２は、音声合成用データ記憶手段１１から音声合成用データを読み出し、音声合成用データを解析する（ステップＳＡ２）。 [Operation of speech synthesis controller]
Next, the operation of the speech synthesis control device according to the first embodiment will be described with reference to FIG. 4 (refer to FIGS. 1 to 3 as appropriate).
FIG. 4 is a flowchart showing the operation of the speech synthesis control device according to the first embodiment of the present invention.
<Receiving, inputting and storing data for speech synthesis>
In the voice synthesis control device 1, when the voice synthesis data input means 10 receives voice synthesis data from the outside, the voice synthesis data storage means 11 stores the voice synthesis data (step SA1).
<Acquisition and analysis steps for speech synthesis data>
The analysis / separation means 12 reads the voice synthesis data from the voice synthesis data storage means 11 and analyzes the voice synthesis data (step SA2).

＜音声合成用データの分離ステップ＞
解析分離手段１２は、音声合成用データからテキスト、発話開始タイミングおよび発話速度を分離する（ステップＳＡ３）。
解析分離手段１２は、読み出した音声合成用データにテキストが含まれているか否かを判断して、テキストが含まれていれば（ステップＳＡ３１でＹｅｓ）、テキストを抽出して（ステップＳＡ３４）、音声合成装置７へテキストを出力し（ステップＳＡ８）、処理をステップＳＡ９に移す。 <Speech synthesis data separation step>
The analysis / separation means 12 separates the text, the utterance start timing, and the utterance speed from the speech synthesis data (step SA3).
The analysis / separation unit 12 determines whether or not text is included in the read speech synthesis data. If the text is included (Yes in step SA31), the analysis / separation unit 12 extracts the text (step SA34). Text is output to the speech synthesizer 7 (step SA8), and the process proceeds to step SA9.

解析分離手段１２は、読み出した音声合成用データにテキストでなければ（ステップＳＡ３１でＮｏ）、分離した音声合成用データが発話開始タイミングか否かを判断して、発話開始タイミングであれば（ステップＳＡ３２でＹｅｓ）、発話開始タイミングとして抽出して処理をステップＳＡ４に移す（ステップＳＡ３５）。
解析分離手段１２は、分離した音声合成用データが発話開始タイミングでなければ（ステップＳＡ３２でＮｏ）、分離した音声合成用データが発話速度か否かを判断して、発話速度であれば（ステップＳＡ３３でＹｅｓ）、発話速度として取得して処理をステップＳＡ６に移す（ステップＳＡ３６）。 If the read speech synthesis data is not text (No in step SA31), the analysis / separation means 12 determines whether the separated speech synthesis data is the speech start timing, and if it is the speech start timing (step (Yes in SA32), the utterance start timing is extracted and the process proceeds to Step SA4 (Step SA35).
If the separated speech synthesis data is not the speech start timing (No in step SA32), the analysis / separation means 12 determines whether or not the separated speech synthesis data is the speech speed, and if it is the speech speed (step (Yes in SA33), the speech rate is acquired and the process proceeds to Step SA6 (Step SA36).

なお、解析分離手段１２は、分離した音声合成用データが発話速度でもなければ音声合成用データに付随する他のデータと判断して処理をステップＳＡ９に移す（ステップＳＡ３３でＮｏ）。他のデータとしては、例えば、映像信号を挙げることができる。音声合成用データと映像信号とを一つのストリームとする場合に、音声合成制御装置１がそのストリームをまず受け取り、解析分離手段１２により映像信号を分離し、映像信号（水平同期信号および垂直同期信号を含む）を映像表示装置８に送るようにすることもできる。 If the separated speech synthesis data is not the speech rate, the analysis / separation means 12 determines that the data is other data accompanying the speech synthesis data and moves the process to step SA9 (No in step SA33). Examples of the other data include a video signal. When the voice synthesis data and the video signal are made into one stream, the voice synthesis control device 1 first receives the stream, and the analysis / separation means 12 separates the video signal to obtain a video signal (horizontal synchronization signal and vertical synchronization signal). Can be sent to the video display device 8.

＜発話開始タイミング取得ステップ＞
そして、発話開始タイミング取得手段１３は、解析分離手段１２で抽出された発話開始タイミングである発話開始文字を規定するための発話開始位置および発話開始時刻を取得し、取得した発話開始タイミングを音声合成開始制御手段１４に渡す（ステップＳＡ４）。
＜音声合成開始制御ステップ＞
音声合成開始制御手段１４は、発話開始タイミング取得手段１３から渡された発話開始タイミングの発話開始時刻に音声合成装置７から合成音声として発話が開始するように、発話を開始させるためのタイミングを音声合成開始制御手段１４に指示し(ステップＳＡ５，ステップＳＡ８)、処理をステップＳＡ９に移す。 <Speech start timing acquisition step>
Then, the utterance start timing acquisition unit 13 acquires the utterance start position and the utterance start time for defining the utterance start character that is the utterance start timing extracted by the analysis / separation unit 12, and synthesizes the acquired utterance start timing with speech synthesis. It passes to the start control means 14 (step SA4).
<Speech synthesis start control step>
The voice synthesis start control means 14 uses the voice timing to start the utterance so that the voice synthesizer 7 starts utterance as synthesized voice at the utterance start time of the utterance start timing passed from the utterance start timing acquisition means 13. The synthesis start control means 14 is instructed (step SA5, step SA8), and the process proceeds to step SA9.

＜発話速度決定ステップ＞
発話速度決定手段１５は、解析分離手段１２で抽出された発話速度を取得し、取得した発話速度が音声合成装置７に設定可能な形式か否かを判断し、設定可能な形式の場合には取得した発話速度を音声合成装置７から出力される合成音声の発話速度と決定し、設定不可能な形式の場合には取得した発話速度を設定可能な発話速度に変換して、音声合成装置７から出力される合成音声の発話速度と決定する（ステップＳＡ６）。 <Speech speed determination step>
The speech rate determination unit 15 acquires the speech rate extracted by the analysis / separation unit 12, determines whether the acquired speech rate is a format that can be set in the speech synthesizer 7, and if it is a format that can be set, The acquired speech rate is determined as the speech rate of the synthesized speech output from the speech synthesizer 7. If the format is not settable, the acquired speech rate is converted into a configurable speech rate, and the speech synthesizer 7 Is determined as the speech speed of the synthesized speech output from (step SA6).

設定不可能な形式の場合には、発話速度決定手段１５は、例えば、音声合成装置７が「あらかじめ定めた標準的な発話速度との比で指定する場合（倍速）」として発話速度を扱う場合に、発話速度決定手段１５は、解析分離手段１２を介して取得した発話速度が「ある範囲の発話にかかる時間で指定する場合」のときには、「ある範囲の発話にかかる時間で指定する場合」を「あらかじめ定めた標準的な発話速度との比で指定する場合（倍速）」に変換した発話速度と決定する。なお、この場合には、発話速度制御手段１５が参照可能な図示しない記憶部に発話速度の形式の変換規則を登録しておく必要がある。
＜発話速度制御ステップ＞
発話速度制御手段１６は、発話速度決定手段１５が決定した発話速度を音声合成装置７に送るタイミングを計り、音声合成装置７へ発話速度を出力し（ステップＳＡ７，ステップＳＡ８）、処理をステップＳＡ９に移す。 In the case of a format that cannot be set, for example, the speech rate determination means 15 treats the speech rate as if the speech synthesizer 7 specifies “a ratio with a predetermined standard speech rate (double speed)”. In addition, the utterance speed determination means 15 "when specifying with the time required for utterance within a certain range" when the utterance speed acquired via the analysis / separation means 12 is "specified with the time required for utterance within a certain range". Is determined to be a speech rate converted into “when specified by a ratio with a predetermined standard speech rate (double speed)”. In this case, it is necessary to register a conversion rule of the speech rate format in a storage unit (not shown) that can be referred to by the speech rate control means 15.
<Speech speed control step>
The speech rate control means 16 measures the timing at which the speech rate determined by the speech rate determination means 15 is sent to the speech synthesizer 7, outputs the speech rate to the speech synthesizer 7 (step SA7, step SA8), and performs the processing in step SA9. Move to.

＜音声合成用データの有無確認ステップ＞
分離解析手段１２は、音声合成用データ記憶手段１１に記憶されている出力対象の音声合成用データが有る場合には処理をステップＳＡ２に戻し（ステップＳＡ９でＹｅｓ）、その音声合成用データが無い場合には処理を終了する（ステップＳＡ９でＮｏ）。 <Confirmation step for voice synthesis data>
If there is output target speech synthesis data stored in the speech synthesis data storage unit 11, the separation analysis unit 12 returns the process to step SA2 (Yes in step SA9), and there is no speech synthesis data. If so, the process ends (No in step SA9).

前記第一の実施の形態によれば、所定の発話開始時刻の発話開始位置毎に所定の発話速度になるようにテキストを合成音声として発話させることができるため、所定の表示時間の映像信号等の同期対象信号と合成音声とを発話開始タイミングに基づいて容易に同期させることができる効果がある。そのため、前記第一の実施の形態よれば、唇等のキャラクタ情報に基づくことなく、外国語音声がついた映画（いわゆる吹き替え映画）や無声映画等の映画、ビデオおよびテレビ放送等の映像コンテンツの任意の時刻に同期させて、早口なしゃべりやゆっくりとしたしゃべりを含む合成音声を発話させることができるようになる。また、ラジオ放送等のオーディオコンテンツに対しても、任意の時刻に可変は発話速度の合成音声を発話させることができる。 According to the first embodiment, since the text can be uttered as synthesized speech so as to have a predetermined utterance speed at each utterance start position at a predetermined utterance start time, a video signal having a predetermined display time, etc. The synchronization target signal and the synthesized speech can be easily synchronized based on the utterance start timing. Therefore, according to the first embodiment, a movie such as a movie with a foreign language sound (so-called dubbed movie) or a silent movie, a video content such as a video and a television broadcast, etc., without being based on character information such as lips. Synchronized speech can be uttered in synchronism with arbitrary time, including quick chat and slow chat. In addition, even for audio contents such as radio broadcasts, it is possible to synthesize synthesized speech with a variable speech rate at an arbitrary time.

（第二の実施の形態）
以下、本発明の第二の実施の形態について、特有の構成要素およびその動作を主に説明する。なお、第一の実施の形態と同一構成要素には同一符号を付して説明を省略し、また、同一構成要素の動作の説明も省略する。
[音声合成制御装置の構成]
図５は、本発明の第二の実施の形態に係る音声合成制御装置の構成を示すブロック図である。
音声合成制御装置２は、読み上げ用のテキストを音声合成して合成音声として発話させる際のテキストの音声合成の開始位置を示す予め設定された発話開始位置と、テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、テキストを音声合成して一定の発話速度の合成音声として発話させるように音声合成装置７を制御するために、音声合成用データ入力手段１０と、音声合成用データ記憶手段１１と、解析分離手段２２と、発話開始タイミング取得手段１３と、音声合成開始制御手段１４とを主に備えている。 (Second embodiment)
Hereinafter, specific components and operations thereof will be mainly described in the second embodiment of the present invention. Note that the same components as those in the first embodiment are denoted by the same reference numerals and description thereof is omitted, and description of operations of the same components is also omitted.
[Configuration of speech synthesis controller]
FIG. 5 is a block diagram showing the configuration of the speech synthesis control device according to the second embodiment of the present invention.
The speech synthesis control device 2 speech-synthesizes the text to be read out and utters it as synthesized speech. A speech start position that indicates a speech synthesis start position that is set in advance and a start time at which the text is uttered as synthesized speech In order to control the speech synthesizer 7 to synthesize text and synthesize it as synthesized speech with a constant speech rate based on a preset speech start time indicating The speech synthesis data storage means 11, the analysis separation means 22, the speech start timing acquisition means 13, and the speech synthesis start control means 14 are mainly provided.

なお、本実施の形態では、音声合成用データは、テキストと発話開始タイミングとによって構成されていて、発話速度を含んでいない。また、音声合成制御装置２は、音声合成装置７を制御する。また、音声合成装置７は、第一の実施の形態の場合と同様に、同期装置９によって映像表示装置８と同期させられる。
解析分離手段２２は、音声合成用データ記憶手段１１から音声合成用データを読み出して、音声合成用データを解析して、テキストと発話開始タイミングとに音声合成用データを分離して、音声合成装置７にテキストを送り、発話開始タイミング取得手段１３に発話開始タイミングを送るものである。なお、テキストと発話開始タイミングとの区別は、前記したとおりフラグによって行える。 In the present embodiment, the speech synthesis data is composed of text and utterance start timing, and does not include the utterance speed. The speech synthesis control device 2 controls the speech synthesis device 7. Also, the speech synthesizer 7 is synchronized with the video display device 8 by the synchronization device 9 as in the case of the first embodiment.
The analysis / separation unit 22 reads out the speech synthesis data from the speech synthesis data storage unit 11, analyzes the speech synthesis data, separates the speech synthesis data into text and utterance start timing, and generates a speech synthesis device. 7 is sent, and the utterance start timing is sent to the utterance start timing acquisition means 13. Note that the text and the utterance start timing can be distinguished by the flag as described above.

[音声合成制御装置の動作]
図６は、本発明の第二の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。なお、図４に示す第一の実施の形態のフローチャートと同一のステップには同一ステップ番号を付し、説明を省略する。
音声合成制御装置２の動作は、音声合成制御装置１の動作に比べて、音声合成用データ分離ステップが大きく異なる。 [Operation of speech synthesis controller]
FIG. 6 is a flowchart showing the operation of the speech synthesis control device according to the second embodiment of the present invention. In addition, the same step number is attached | subjected to the step same as the flowchart of 1st embodiment shown in FIG. 4, and description is abbreviate | omitted.
The operation of the speech synthesis control device 2 is significantly different from the operation of the speech synthesis control device 1 in the speech synthesis data separation step.

＜音声合成用データの分離ステップ＞
解析分離手段１２は、音声合成用データからテキストおよび発話開始タイミングを分離する（ステップＳＢ３）。つまり、発話速度が、音声合成用データに含まれていないため、ステップＳＡ３３およびステップＳＡ３６が無い点が第一の実施の形態の場合と大きく異なる。
第一の実施の形態の場合と同様に、解析分離手段２２は、分離した音声合成用データがテキストか否かを判断して、テキストであれば（ステップＳＡ３１でＹｅｓ）、テキストとして抽出して（ステップＳＡ３４）、音声合成装置７へテキストを出力し（ステップＳＡ８）、処理をステップＳＡ９に移す。 <Speech synthesis data separation step>
The analysis separation unit 12 separates the text and the utterance start timing from the speech synthesis data (step SB3). That is, since the speech rate is not included in the speech synthesis data, the point that step SA33 and step SA36 are not provided is greatly different from the case of the first embodiment.
As in the case of the first embodiment, the analysis / separation means 22 determines whether or not the separated speech synthesis data is text, and if it is text (Yes in step SA31), extracts it as text. (Step SA34), text is output to the speech synthesizer 7 (step SA8), and the process proceeds to step SA9.

同様に、解析分離手段２２は、分離した音声合成用データがテキストでなければ（ステップＳＡ３１でＮｏ）、分離した音声合成用データが発話開始タイミングか否かを判断して、発話開始タイミングであれば（ステップＳＡ３２でＹｅｓ）、発話開始タイミングとして抽出して処理をステップＳＡ４に移す（ステップＳＡ３５）。
解析分離手段２２は、分離した音声合成用データが発話開始タイミングでなければ（ステップＳＡ３２でＮｏ）、音声合成用データに付随する前記した他のデータと判断して処理をステップＳＡ９に移す。 Similarly, if the separated speech synthesis data is not text (No in step SA31), the analysis / separation unit 22 determines whether the separated speech synthesis data is the speech start timing, and can determine the speech start timing. If it is yes (Yes in step SA32), it is extracted as the utterance start timing and the process proceeds to step SA4 (step SA35).
If the separated speech synthesis data is not the speech start timing (No in step SA32), the analysis / separation means 22 determines that the data is the other data associated with the speech synthesis data, and moves the process to step SA9.

前記第二の実施の形態によれば、音声合成制御装置２が発話開始タイミングに基づいて音声合成装置７の音声合成を制御するため、所定の発話開始時刻の発話開始位置毎に一定の発話速度になるようにテキストを合成音声として発話させることができる According to the second embodiment, since the speech synthesis control device 2 controls speech synthesis of the speech synthesizer 7 based on the speech start timing, a constant speech speed for each speech start position at a predetermined speech start time. The text can be uttered as synthesized speech so that

（第三の実施の形態）
以下、本発明の第三の実施の形態について、特有の構成要素およびその動作を主に説明する。なお、第一の実施の形態と同一構成要素には同一符号を付して説明を省略し、また、同一構成要素の動作の説明も省略する。この第三の実施の形態は、前記した第一の実施の形態の音声合成制御装置１と音声合成装置７とを一体にし、音声合成制御機能を音声合成装置４に内蔵させたものである。以下では、音声合成制御機能を実現する手段が、第一の実施の形態で説明したものと同一のものには同一符号を付して説明を省略する。 (Third embodiment)
Hereinafter, specific components and operations thereof will be mainly described in the third embodiment of the present invention. Note that the same components as those in the first embodiment are denoted by the same reference numerals and description thereof is omitted, and description of operations of the same components is also omitted. In the third embodiment, the speech synthesis control device 1 and the speech synthesis device 7 of the first embodiment described above are integrated, and the speech synthesis control function is built in the speech synthesis device 4. In the following, the means for realizing the speech synthesis control function is the same as that described in the first embodiment, and the description thereof is omitted.

[音声合成装置の構成]
図７は、本発明の第三の実施の形態に係る音声合成装置の構成を示すブロック図である。
音声合成装置３は、読み上げ用のテキストを音声合成して合成音声として発話させる際のテキストの音声合成の開始位置を示す予め設定された発話開始位置と、テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、テキストを音声合成して合成音声として発話させるために、音声合成用データ入力手段１０と、音声合成用データ記憶手段１１と、解析分離手段１２と、発話開始タイミング取得手段１３と、音声合成開始制御手段１４と、発話速度決定手段１５と、発話速度計測手段３６と、話速変換制御手段３７と、話速変換手段３８と、音声合成手段３９とを主に備えている。 [Configuration of speech synthesizer]
FIG. 7 is a block diagram showing the configuration of the speech synthesizer according to the third embodiment of the present invention.
The speech synthesizer 3 synthesizes a speech start position that indicates the speech synthesis start position of the text when the text to be read out is synthesized and uttered as a synthesized speech, and a start time at which the text is uttered as a synthesized speech. Based on the preset utterance start time shown, speech synthesis data input means 10, speech synthesis data storage means 11, analysis separation means 12, The speech start timing acquisition means 13, the speech synthesis start control means 14, the speech speed determination means 15, the speech speed measurement means 36, the speech speed conversion control means 37, the speech speed conversion means 38, and the speech synthesis means 39. And mainly.

したがって、音声合成装置３は、音声合成制御装置１の発話速度制御手段１６（図１参照）を備えておらず、この発話速度制御手段１６の代わりに、発話速度計測手段３６と、話速変換制御手段３７と、話速変換手段３８とを備え、さらに、音声合成手段３９を備えたものである。なお、音声合成手段３９以外の手段は、音声合成手段３９の音声合成を制御する音声合成制御機能を実現するものである。 Therefore, the speech synthesizer 3 does not include the speech rate control unit 16 (see FIG. 1) of the speech synthesis control device 1, and instead of the speech rate control unit 16, the speech rate measurement unit 36 and the speech rate conversion A control means 37 and a speech speed conversion means 38 are provided, and a speech synthesis means 39 is further provided. Note that means other than the speech synthesizer 39 realize a speech synthesis control function for controlling speech synthesis of the speech synthesizer 39.

発話速度計測手段３６は、音声合成手段３９の出力する合成音声の発話速度を計測するものである。発話速度計測手段３６は、例えば、音素数を認識して、その時間を計測しておくことによって、発話速度を計測することができる。
話速変換制御手段３７は、発話速度決定手段１５により決定された発話速度と発話速度計測手段３６により計測された発話速度とを比較して、音声合成手段３９の出力する合成音声の発話速度が発話速度決定手段１５により決定された発話速度に近づくように、音声合成装置３９の出力の話速を変換するための話速制御信号を出力するものである。
なお、音声合成手段３９の発話速度が既知である場合は、発話速度計測手段３６を備えない構成としてもよい。その場合は、話速変換制御手段３７は、発話速度計測手段３６が計測した発話速度のかわりに前記既知の発話速度を用いればよい。 The utterance speed measuring means 36 measures the utterance speed of the synthesized speech output from the speech synthesis means 39. The utterance speed measuring means 36 can measure the utterance speed by, for example, recognizing the number of phonemes and measuring the time.
The speech speed conversion control means 37 compares the speech speed determined by the speech speed determination means 15 with the speech speed measured by the speech speed measurement means 36, and the speech speed of the synthesized speech output by the speech synthesis means 39 is determined. A speech speed control signal for converting the speech speed of the output of the speech synthesizer 39 is output so as to approach the speech speed determined by the speech speed determining means 15.
If the speech rate of the speech synthesizer 39 is known, the speech rate measuring unit 36 may not be provided. In that case, the speech speed conversion control means 37 may use the known speech speed instead of the speech speed measured by the speech speed measuring means 36.

話速変換手段３８は、話速変換制御手段３７から出力される話速制御信号に基づいて、音声合成手段３９から出力される合成音声の発話速度が発話速度決定手段１５により決定された発話速度に近づけるように話速を変換した合成音声を出力するものである。話速変換手段３８は、音声合成手段３９により発話された合成音声を一定時間の周期で区切り、区切られたデータの一部を省略して接合することによって再生速度を速めたり、区切られたデータの一部を再度繰り返すように挿入して接合することによって再生速度を遅くしたりすることができる。例えば、「あ」の音素を区切って、一部を省略して「あっ」のように短くしたり、一部を挿入して「あー」のように長く伸ばすことによって話速を変換することができる。
音声合成手段３９は、第一または第二の実施の形態で説明した音声合成装置７と同一の手段を備え、同一の動作をするものであるため、説明を省略する。また、音声合成手段３９は、第一の実施の形態の場合の音声合成装置７と同様に、同期装置９によって映像表示装置８と同期させられる。 The speech speed conversion means 38 is based on the speech speed control signal output from the speech speed conversion control means 37, and the speech speed of the synthesized speech output from the speech synthesis means 39 is determined by the speech speed determination means 15. The synthesized speech in which the speech speed is converted so as to be close to is output. The speech speed conversion means 38 divides the synthesized speech uttered by the speech synthesizer 39 at a fixed time period, omits a part of the divided data, and increases the reproduction speed, or separates the divided data. It is possible to slow down the reproduction speed by inserting and joining a part of them repeatedly. For example, by dividing the phoneme of “A” and omitting a part and shortening it like “Ah”, or by inserting a part and extending it like “Ah”, the speech speed can be converted. it can.
The voice synthesizer 39 includes the same means as the voice synthesizer 7 described in the first or second embodiment, and performs the same operation, so that the description thereof is omitted. Further, the speech synthesizing means 39 is synchronized with the video display device 8 by the synchronizing device 9 in the same manner as the speech synthesizing device 7 in the first embodiment.

[音声合成制御装置の動作]
図８は、本発明の第三の実施の形態に係る音声合成装置の動作を示すフローチャートである。なお、図４に示す第一の実施の形態のフローチャートと同一のステップには同一ステップ番号を付し、説明を省略する。
音声合成装置３の動作は、音声合成制御装置１の動作に比べて、音声合成手段３９を備えているために音声合成ステップ（ステップＳＣ１）、発話速度制御ステップ（ステップＳＡ７）の代わりに、発話速度計測ステップ（ステップＳＣ２）、話速変換制御ステップ（ステップＳＣ３）および話速変換ステップ（ステップＳＣ４）を含めた点が大きく異なる。以下、ステップＳＣ１ないしＳＣ４の各ステップを説明する。 [Operation of speech synthesis controller]
FIG. 8 is a flowchart showing the operation of the speech synthesizer according to the third embodiment of the present invention. In addition, the same step number is attached | subjected to the step same as the flowchart of 1st embodiment shown in FIG. 4, and description is abbreviate | omitted.
Compared with the operation of the speech synthesis control device 1, the operation of the speech synthesis device 3 is provided with speech synthesis means 39, so that instead of the speech synthesis step (step SC1) and speech rate control step (step SA7) The points including the speed measurement step (step SC2), the speech speed conversion control step (step SC3), and the speech speed conversion step (step SC4) are greatly different. Hereinafter, the steps SC1 to SC4 will be described.

＜音声合成ステップ＞
音声合成手段３９は、解析分離手段１２から渡されるテキストがある場合には、音声合成開始制御手段１４により出力される発話開始タイミングである発話開始位置に基づく合図に従ってテキストを音声合成して、所定の発話速度の合成音声として発話開始時刻毎に発話させる（ステップＳＣ１）。
＜発話速度計測ステップ＞
発話速度計測手段３６は、音声合成手段３９により発話される合成音声の発話速度を計測する（ステップＳＣ２）。
＜話速変換制御ステップ＞
話速変換制御手段３７は、発話速度決定手段１５により決定された発話速度と発話速度計測手段３６により計測された発話速度とを比較して、音声合成手段３９により発話される合成音声の発話速度が発話速度決定手段１５により決定された発話速度に近づくように、音声合成手段３９により発話される発話速度を変換するための話速制御信号を出力する（ステップＳＣ３）。 <Speech synthesis step>
When there is a text passed from the analysis / separation unit 12, the speech synthesis unit 39 synthesizes the text according to a cue based on the utterance start position which is the utterance start timing output by the speech synthesis start control unit 14, and performs predetermined synthesis. Is uttered at each utterance start time as synthesized speech of the utterance speed (step SC1).
<Speech speed measurement step>
The speech rate measuring unit 36 measures the speech rate of the synthesized speech uttered by the speech synthesis unit 39 (step SC2).
<Speech speed conversion control step>
The speech speed conversion control means 37 compares the speech speed determined by the speech speed determination means 15 with the speech speed measured by the speech speed measurement means 36, and the speech speed of the synthesized speech spoken by the speech synthesis means 39. A speech speed control signal for converting the speech speed spoken by the speech synthesizer 39 is output so that the speech speed approaches the speech speed determined by the speech speed determiner 15 (step SC3).

＜話速変換ステップ＞
話速変換手段３８は、話速変換制御手段３７から出力される話速制御信号に基づいて、音声合成手段３９により発話される合成音声の発話速度が発話速度決定手段１５により決定された発話速度に近づけるように発話速度を変換した合成音声を出力する（ステップＳＣ４）。
前記第三の実施の形態によれば、音声合成装置３の音声合成制御機能を実現する各手段が発話開始タイミングおよび発話速度に基づいて音声合成手段３９の音声合成を制御するため、所定の発話開始時刻の発話開始位置毎に所定の発話速度になるようにテキストを合成音声として発話させることができるようになる。 <Speaking speed conversion step>
The speech speed conversion means 38 is based on the speech speed control signal output from the speech speed conversion control means 37, and the speech speed of the synthesized speech uttered by the speech synthesis means 39 is determined by the speech speed determination means 15. The synthesized speech in which the speech rate is converted so as to be close to is output (step SC4).
According to the third embodiment, each means for realizing the voice synthesis control function of the voice synthesizer 3 controls the voice synthesis of the voice synthesis means 39 based on the utterance start timing and the utterance speed. The text can be uttered as synthesized speech so as to have a predetermined utterance speed for each utterance start position at the start time.

（第四の実施の形態）
以下、本発明の第四の実施の形態について、特有の構成要素およびその動作を主に説明する。なお、第一の実施の形態と同一構成要素には同一符号を付して説明を省略し、また、同一構成要素の動作の説明も省略する。
[音声合成制御装置の構成]
図９は、本発明の第四の実施の形態に係る音声合成制御装置の構成を示すブロック図である。
音声合成制御装置４は、読み上げ用のテキストを音声合成して合成音声として発話させる際のテキストの音声合成の開始位置を示す予め設定された発話開始位置と、テキストを合成音声として発話させる開始時刻を示す予め設定された発話開始時刻とに基づいて、テキストを音声合成して所定の発話速度の合成音声として発話させるように音声合成装置７を制御するために、入力手段４０と、テキスト記憶手段４１と、発話開始タイミング記憶手段４２と、発話速度記憶手段４３と、発話開始タイミング取得手段１３と、音声合成開始制御手段１４と、発話速度決定手段１５と、発話速度制御手段１６と、を主に備えている。したがって、音声合成制御装置４は、音声合成用データ入力手段１０と音声合成用データ記憶手段１１と解析分離手段１２との代わりに、入力手段４０と、テキスト記憶手段４１と、発話開始タイミング記憶手段４２と、発話速度記憶手段４３とを主に備えたものである。 (Fourth embodiment)
Hereinafter, specific components and operations thereof will be mainly described in the fourth embodiment of the present invention. Note that the same components as those in the first embodiment are denoted by the same reference numerals and description thereof is omitted, and description of operations of the same components is also omitted.
[Configuration of speech synthesis controller]
FIG. 9 is a block diagram showing a configuration of a speech synthesis control device according to the fourth embodiment of the present invention.
The speech synthesis control device 4 speech-synthesizes the text to be read out and utters it as a synthesized speech. The speech utterance starting position that indicates the speech synthesis start position of the text and the start time at which the text is uttered as a synthesized speech Input means 40 and text storage means for controlling the speech synthesizer 7 to synthesize text and synthesize it as synthesized speech at a predetermined utterance speed based on a preset utterance start time indicating 41, utterance start timing storage means 42, utterance speed storage means 43, utterance start timing acquisition means 13, speech synthesis start control means 14, utterance speed determination means 15, and utterance speed control means 16. In preparation. Therefore, the speech synthesis control device 4 uses the input unit 40, the text storage unit 41, and the speech start timing storage unit instead of the speech synthesis data input unit 10, the speech synthesis data storage unit 11, and the analysis separation unit 12. 42 and the speech rate storage means 43 are mainly provided.

入力手段４０は、テキストと、発話開始タイミングと、発話速度とが分かれた状態で音声合成制御装置４内に入力させるものである。なお、発話開始タイミングと発話速度とが同一の音声合成用データであって、テキストと分かれたデータ構造でもよい。この場合には、テキストは、解析分離手段１２を介さずに音声合成装置７に直接入力していくようにしても、解析分離手段１２を介するようにしても構わない。
テキスト記憶手段４１は、入力手段４０により入力されたテキストを記憶するものである。
発話開始タイミング記憶手段４２は、入力手段４０により入力された発話開始タイミングを記憶するものである。
発話速度記憶手段４３は、入力手段４０により入力された発話速度を記憶するものである。
なお、テキスト記憶手段４１と発話開始タイミング記憶手段４２と発話速度記憶手段４３とは、それぞれを別々の記憶媒体によって構成する場合に限らず、一つの記憶媒体の異なる記憶領域を割り当てることで構成することができる。 The input unit 40 is for inputting the text, the speech start timing, and the speech speed into the speech synthesis control device 4 in a state in which the text is separated. Note that the speech start data and the speech speed may be the same data for speech synthesis and may have a data structure separated from text. In this case, the text may be input directly to the speech synthesizer 7 without going through the analysis / separation unit 12 or may be sent through the analysis / separation unit 12.
The text storage unit 41 stores the text input by the input unit 40.
The utterance start timing storage means 42 stores the utterance start timing input by the input means 40.
The utterance speed storage means 43 stores the utterance speed input by the input means 40.
The text storage means 41, the utterance start timing storage means 42, and the utterance speed storage means 43 are not limited to being configured by separate storage media, but are configured by allocating different storage areas on one storage medium. be able to.

[音声合成制御装置の動作]
次に、図１０を参照（適宜図９参照）して、第四の実施の形態に係る音声合成制御装置の動作について説明する。
図１０は、本発明の第四の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。なお、図４に示す第一の実施の形態のフローチャートと同一のステップには同一ステップ番号を付し、説明を省略する。本第四の実施の形態の音声合成制御装置４の動作は、一つのストリームである音声合成用データを入力するのではなく、テキストと発話開始タイミングと発話速度とが分かれた入力データとして入力する点が第一の実施の形態の場合と異なる。以下、本実施の形態に特有の入力データの取り扱いについての各ステップを説明する。 [Operation of speech synthesis controller]
Next, referring to FIG. 10 (refer to FIG. 9 as appropriate), the operation of the speech synthesis control device according to the fourth embodiment will be described.
FIG. 10 is a flowchart showing the operation of the speech synthesis control device according to the fourth embodiment of the present invention. In addition, the same step number is attached | subjected to the step same as the flowchart of 1st embodiment shown in FIG. 4, and description is abbreviate | omitted. The operation of the speech synthesis control device 4 according to the fourth embodiment does not input speech synthesis data that is a single stream, but instead inputs text, speech start timing, and speech speed as input data. The point is different from the case of the first embodiment. Hereinafter, each step regarding handling of input data unique to the present embodiment will be described.

＜入力データの受信ステップ＞
音声合成制御装置４では、入力手段４０が、例えば、放送波を入力データとして受信すると（ステップＳＤ１）、入力データの判別・記憶処理を行う（ステップＳＤ２）。
＜入力データの判別・記憶ステップ＞
入力手段４０は、入力データがテキストか否かを判断して、テキストであれば（ステップＳＤ２１でＹｅｓ）、テキストとして認識して（ステップＳＤ２４）、テキスト記憶手段４１に記憶する（ステップＳＤ２７）。テキスト記憶手段４１は、図示しない制御手段の指令により音声合成装置７へテキストを出力し（ステップＳＡ８）、処理をステップＳＤ９に移す。 <Receiving step of input data>
In the speech synthesis control device 4, for example, when the input means 40 receives a broadcast wave as input data (step SD1), the input data is discriminated / stored (step SD2).
<Identification / storage step of input data>
The input means 40 determines whether or not the input data is text, and if it is text (Yes in step SD21), recognizes it as text (step SD24) and stores it in the text storage means 41 (step SD27). The text storage means 41 outputs the text to the speech synthesizer 7 in response to a command from a control means (not shown) (step SA8), and the process proceeds to step SD9.

入力手段４０は、入力データがテキストでなければ（ステップＳＤ２１でＮｏ）、入力データが発話開始タイミングか否かを判断して、発話開始タイミングであれば（ステップＳＤ２２でＹｅｓ）、発話開始タイミングとして認識して（ステップＳＤ２５）、発話開始タイミング記憶手段４２に記憶する（ステップＳＤ２８）。発話開始タイミング記憶手段４２は、図示しない制御手段の指令により発話開始タイミング取得手段１３へ発話開始タイミングを出力して、処理をステップＳＡ４に移す。
入力手段４０は、入力データが発話開始タイミングでもなければ（ステップＳＤ２２でＮｏ）、入力データが発話速度か否かを判断して、発話速度であれば（ステップＳＤ２３でＹｅｓ）、発話速度として認識して（ステップＳＤ２５）、発話速度記憶手段４３に記憶する（ステップＳＤ２８）。発話速度記憶手段４３は、図示しない制御手段の指令により発話速度決定手段１５へ発話速度を出力して、処理をステップＳＡ６に移す。 If the input data is not text (No in step SD21), the input means 40 determines whether or not the input data is the utterance start timing. If the input data is the utterance start timing (Yes in step SD22), the input means 40 determines the utterance start timing. Recognize (step SD25) and store it in the utterance start timing storage means 42 (step SD28). The utterance start timing storage means 42 outputs the utterance start timing to the utterance start timing acquisition means 13 according to a command from a control means (not shown), and the process proceeds to step SA4.
If the input data is not the utterance start timing (No in step SD22), the input means 40 determines whether or not the input data is the utterance speed. If the input data is the utterance speed (Yes in step SD23), it is recognized as the utterance speed. (Step SD25), and it is stored in the speech rate storage means 43 (step SD28). The utterance speed storage means 43 outputs the utterance speed to the utterance speed determination means 15 according to a command from a control means (not shown), and the process proceeds to step SA6.

なお、入力手段４０は、入力データが発話速度でもなければ前記した他のデータと判断して処理をステップＳＤ９に移す（ステップＳＤ２３でＮｏ）。
＜入力データの有無確認ステップ＞
入力手段４０は、入力データの受信の有無を確認し、入力データが有る場合には処理をステップＳＤ２に戻し（ステップＳＤ９でＹｅｓ）、入力データが無い場合には処理を終了する（ステップＳＤ９でＮｏ）。
前記第四の実施の形態によれば、音声合成制御装置４が発話開始タイミングおよび発話速度に基づいて音声合成装置７の音声合成を制御するため、所定の発話開始時刻の発話開始位置毎に所定の発話速度になるようにテキストを合成音声として発話させることができるようになる。 If the input data is not the speech rate, the input means 40 determines that the input data is the other data described above, and moves the process to step SD9 (No in step SD23).
<Input data existence confirmation step>
The input means 40 confirms whether or not input data has been received. If there is input data, the process returns to step SD2 (Yes in step SD9), and if there is no input data, the process ends (in step SD9). No).
According to the fourth embodiment, since the speech synthesis control device 4 controls speech synthesis of the speech synthesizer 7 based on the utterance start timing and the utterance speed, it is predetermined for each utterance start position at a predetermined utterance start time. The text can be uttered as synthesized speech so that the speech speed becomes.

以下、第五の実施の形態として、発話開始タイミングと発話速度とを生成する音声合成用データ生成装置５の構成および動作を説明する。
（第五の実施の形態）
[音声合成用データ生成装置の構成]
図１１は、本発明の第五の実施の形態に係る音声合成用データ生成装置の構成を示すブロック図である。
この音声合成用データ生成装置５は、読み上げ用のテキスト、発話開始位置およびこの発話開始位置に対応する発話開始時刻を生成するために、音声信号入力手段５０と、音声認識手段５１と、発話開始タイミング取得手段５２と、発話速度計測手段５３と、音声合成用データ生成手段５４と、音声合成用データ記憶手段５５と、出力手段５６とを主に備えている。 Hereinafter, as a fifth embodiment, the configuration and operation of the speech synthesis data generation device 5 that generates the utterance start timing and the utterance speed will be described.
(Fifth embodiment)
[Configuration of data generator for speech synthesis]
FIG. 11 is a block diagram showing the configuration of the speech synthesis data generating apparatus according to the fifth embodiment of the present invention.
The speech synthesis data generation device 5 generates a text to be read, an utterance start position, and an utterance start time corresponding to the utterance start position. It mainly includes a timing acquisition unit 52, an utterance speed measurement unit 53, a voice synthesis data generation unit 54, a voice synthesis data storage unit 55, and an output unit 56.

音声信号入力手段５０は、音波を電気信号としての音声信号に変換して入力するものである。なお、ここでは、音波を変換する信号を電気信号として説明するが、これに限らず、例えば、光信号であってもよい。
音声認識手段５１は、音声信号入力手段５０が入力した音声信号を認識してテキストを生成するものである。音声認識手段５１は、アナログ信号として入力される音声信号をＡ／Ｄ変換して、デジタル信号としての音声信号に変換した後、単語標準モデルとしての音声のスペクトルを参照して音声信号から特徴を抽出し、その特徴要素を、特徴に合った文字に変換することで音素を認識する。そして、音声認識手段５１は、単語辞書や文法辞書を参照して、認識した音素を、単語および文章として認識し、テキストを生成する。 The sound signal input means 50 converts sound waves into sound signals as electric signals and inputs them. In addition, although the signal which converts a sound wave is demonstrated here as an electrical signal, it is not restricted to this, For example, an optical signal may be sufficient.
The voice recognition means 51 recognizes the voice signal input by the voice signal input means 50 and generates text. The voice recognition means 51 performs A / D conversion on a voice signal input as an analog signal, converts it into a voice signal as a digital signal, and then refers to the spectrum of the voice as a word standard model to characterize the voice signal. The phoneme is recognized by extracting and converting the characteristic element into a character suitable for the characteristic. Then, the speech recognition means 51 refers to the word dictionary or grammar dictionary, recognizes the recognized phonemes as words and sentences, and generates text.

発話開始タイミング取得手段５２は、音声認識手段５１が生成するテキストの発話開始位置および発話開始位置に対応する発話開始時刻を発話開始タイミングとして取得する。発話開始文字としては、任意の文字で構わないが、特に、無音状態から発話が開始されたときのものを取得し、発話開始時刻としては、そのときの時刻を取得するものである。 The utterance start timing acquisition unit 52 acquires the utterance start time of the text generated by the speech recognition unit 51 and the utterance start time corresponding to the utterance start position as the utterance start timing. As the utterance start character, any character may be used. In particular, the utterance start character is obtained when the utterance is started from the silent state, and the utterance start time is obtained at that time.

発話速度計測手段５３は、発話開始タイミング取得手段５２が発話開始文字として取得した時の時刻から発話速度の測定を開始する。この場合には、音声認識手段５１で特徴として抽出された音素の時刻に対応させ、任意の音素の時刻から音素数に応じた時刻までの間の時間と、その音素数との関係から発話速度を算出することができる。
また、発話速度計測手段５３は、音声認識手段５１が生成するテキストに対応させて、入力手段５０が入力する音声の発話速度を計測する。発話速度計測手段５３は、発話開始タイミング取得手段５２が発話開始文字として取得した時の時刻から無音状態となるまでの時刻と、その間の文字数とから発話速度を算出することもできる。 The utterance speed measurement means 53 starts measuring the utterance speed from the time when the utterance start timing acquisition means 52 acquires it as the utterance start character. In this case, the speech rate is determined from the relationship between the time between the time of any phoneme and the time corresponding to the number of phonemes, corresponding to the time of the phonemes extracted as a feature by the speech recognition means 51. Can be calculated.
Further, the speech rate measuring unit 53 measures the speech rate of the speech input by the input unit 50 in correspondence with the text generated by the speech recognition unit 51. The utterance speed measuring means 53 can also calculate the utterance speed from the time from when the utterance start timing acquisition means 52 acquires the utterance start character to the time until the silent state and the number of characters in between.

音声合成用データ生成手段５４は、音声認識手段５１により生成されたテキストと、発話開始タイミング取得手段５２により取得された発話開始タイミングと、発話速度計測手段５３により計測された発話速度とを音声合成用データとして生成する。ここでは、一つのパッケージ化されたストリームとしての音声合成用データとして説明するが、テキストのみが分離し、発話開始タイミングと発話速度とが音声合成用データとしている場合やそれぞれが別々に分かれている場合であってもよい。
音声合成用データ記憶手段５５は、音声合成用データ生成手段５４が生成した音声合成用データを記憶する。 The voice synthesizing data generating means 54 synthesizes the text generated by the voice recognizing means 51, the utterance start timing acquired by the utterance start timing acquiring means 52, and the utterance speed measured by the utterance speed measuring means 53. Generate as business data. Here, description will be given as speech synthesis data as one packaged stream, but only text is separated, and the speech start timing and speech speed are used as speech synthesis data, or each is separated separately. It may be the case.
The voice synthesis data storage unit 55 stores the voice synthesis data generated by the voice synthesis data generation unit 54.

出力手段５６は、音声合成用データ記憶手段５５が記憶する音声合成用データを、第一、第二もしくは第四の実施の形態の音声合成制御装置１、２もしくは４または第三の実施の形態の音声合成装置３に出力する。なお、音声合成用データ生成装置５は、音声合成制御装置１、２もしくは４または音声合成装置３と同一箇所に設置される場合に限らず、遠隔的に設置され通信回線や放送電波によって送受信を行えるようにしたものであってもよい。また、同一箇所に設置される場合には、音声合成制御装置１、２もしくは４または音声合成装置３と、音声合成用データ生成装置５とが、同一筐体内に内蔵されていてもいなくても構わない。 The output means 56 uses the voice synthesis data stored in the voice synthesis data storage means 55 as the voice synthesis control device 1, 2 or 4 of the first, second or fourth embodiment, or the third embodiment. To the voice synthesizer 3. Note that the voice synthesis data generation device 5 is not limited to being installed at the same location as the voice synthesis control device 1, 2, or 4 or the voice synthesis device 3, but is installed remotely and transmits and receives via a communication line and broadcast radio waves. You may be able to do it. When installed in the same location, the speech synthesis control device 1, 2 or 4 or the speech synthesis device 3 and the speech synthesis data generation device 5 may or may not be built in the same housing. I do not care.

[音声合成用データ生成装置の動作]
図１２は、本発明の第五の実施の形態に係る音声合成用データ生成装置の動作を示すフローチャートである。
音声信号入力手段５０は、音波を受けると、音波を電気信号としての音声信号に変換して入力する（ステップＳＥ１）。発話開始タイミング取得手段５２が発話開始タイミングを取得すると共に（ステップＳＥ２）、発話速度計測手段５３が発話速度の計測を行う（ステップＳＥ３）。また、音声認識手段５１が音声信号を認識してテキストを生成する（ステップＳＥ４）。以上のステップＳＥ２ないしＳＥ４を音声信号が無くなるまで繰り返し（ステップＳＥ５）、音声信号が無くなれば(ステップＳＥ６)、音声合成用データ生成手段５４がテキストと発話開始タイミングと発話速度とを音声合成用データとし（ステップＳＥ６）、音声合成用データ記憶手段５５に記憶する（ステップＳＥ７）。 [Operation of data generator for speech synthesis]
FIG. 12 is a flowchart showing the operation of the speech synthesis data generating apparatus according to the fifth embodiment of the present invention.
When receiving the sound wave, the sound signal input means 50 converts the sound wave into a sound signal as an electric signal and inputs the sound signal (step SE1). The utterance start timing acquisition means 52 acquires the utterance start timing (step SE2), and the utterance speed measurement means 53 measures the utterance speed (step SE3). Further, the voice recognition means 51 recognizes the voice signal and generates a text (step SE4). The above steps SE2 to SE4 are repeated until there is no speech signal (step SE5). If there is no speech signal (step SE6), the speech synthesis data generating means 54 determines the text, speech start timing, and speech rate as speech synthesis data. (Step SE6) and stored in the speech synthesis data storage means 55 (step SE7).

前記第五の実施の形態によれば、音声合成用データ生成手段５がテキストと発話開始タイミングと発話速度とをパッケージ化した音声合成用データとして取得することができるようになる。
なお、本実施の形態では、音声認識手段５１と発話開始タイミング取得手段５２と発話速度計測手段５３とを分けた機能要素として説明したが、発話開始タイミング取得手段および発話速度計測手段の機能を音声認識手段５１に備えたものとしてもよい。 According to the fifth embodiment, the speech synthesis data generation means 5 can acquire text, speech start timing, and speech rate as packaged speech synthesis data.
In this embodiment, the voice recognition means 51, the utterance start timing acquisition means 52, and the utterance speed measurement means 53 have been described as separate functional elements. However, the functions of the utterance start timing acquisition means and the utterance speed measurement means are described as voice. The recognition means 51 may be provided.

また、第一、第二および第四の実施の形態の音声合成制御装置１、２および４または第三の実施の形態の音声合成装置３に搭載される音声合成制御機能を実現する手段は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（音声合成制御プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The means for realizing the speech synthesis control function installed in the speech synthesis control devices 1, 2 and 4 of the first, second and fourth embodiments or the speech synthesis device 3 of the third embodiment are: This can be realized by causing a general computer to execute a program and operating an arithmetic device or a storage device in the computer. This program (speech synthesis control program) can be distributed via a communication line, or can be distributed by writing on a recording medium such as a CD-ROM.

また、同様に、第五の実施の形態の読み上げ用の音声合成用データ生成装置５は、一般的なコンピュータにプログラムを実行させ、コンピュータ内の演算装置や記憶装置を動作させることにより実現することができる。このプログラム（音声合成用データ生成プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 Similarly, the speech synthesis data generation device 5 for reading in the fifth embodiment is realized by causing a general computer to execute a program and operating an arithmetic device and a storage device in the computer. Can do. The program (speech synthesis data generation program) can be distributed via a communication line, or can be distributed by writing on a recording medium such as a CD-ROM.

また、第一、第二および第四の実施の形態の音声合成制御装置１、２および４は、音声合成装置７と別体のものとして説明したが、第三の実施の形態の音声合成装置３のように音声合成制御装置１、２および４を音声合成装置７に搭載させてもよい。また、第三の実施の形態の音声合成装置３は、音声合成制御機能を実現する各手段と音声合成手段３９とを別体のものとしてもよい。 Further, although the speech synthesis control devices 1, 2 and 4 of the first, second and fourth embodiments have been described as separate from the speech synthesis device 7, the speech synthesis device of the third embodiment. 3, the speech synthesis control devices 1, 2, and 4 may be mounted on the speech synthesis device 7. Further, in the speech synthesizer 3 of the third embodiment, each means for realizing the speech synthesis control function and the speech synthesizer 39 may be separated.

なお、各本実施の形態では、動画像の映像信号を同期対象信号として説明したが、これに限らず、例えば、オーディオ信号であってもよい。また、同期対象信号は、映像信号やオーディオ信号のように、時系列的に並んで構成されるコンテンツである。したがって、静止画をスライド表示させる場合も静止画の表示を時系列的に切り替えるという意味において、同期対象信号に含めることができる。ただし、スライド表示の場合には、静止画と、この静止画を何秒おきに表示させるかを指定する設定値とを含む同期対象信号とすればよい。 In each embodiment, the video signal of the moving image has been described as the synchronization target signal. However, the present invention is not limited to this, and may be an audio signal, for example. The synchronization target signal is content that is arranged in time series, such as a video signal or an audio signal. Therefore, even when a still image is slide-displayed, it can be included in the synchronization target signal in the sense that the display of the still image is switched in time series. However, in the case of slide display, a synchronization target signal including a still image and a setting value specifying how many seconds this still image is displayed may be used.

また、同期用信号は、時系列があればよいため、時刻自体とすることも可能である。この場合は、例えば、任意の時間内に所定量の合成音声を早口なしゃべりで発話させてしまうような場合に有用である。例えば、１分間のニュース放送中に所定量のニュースを合成音声として出力させるような場合である。この場合には、１分間の内で、１０秒間で表題的なニュースを放送し、２０秒間でその対応する詳しい解説を放送する場合に、最初の１０秒間は、基準となる発話速度よりもゆっくりとしたしゃべりで合成音声を出力して、聴衆の興味をひきつけ、内容の多い詳しい解説を基準の発話速度よりも少し早口の合成音声で出力させるような用途への使用も期待できる。 Further, since the synchronization signal only needs to have a time series, it can also be the time itself. In this case, for example, it is useful in a case where a predetermined amount of synthesized speech is uttered by quick chat within an arbitrary time. For example, a predetermined amount of news is output as synthesized speech during a one-minute news broadcast. In this case, when the title news is broadcast for 10 seconds within 1 minute and the corresponding detailed explanation is broadcast for 20 seconds, the first 10 seconds is slower than the standard speech rate. It can be expected to be used for applications such as outputting synthesized speech by chatting, and attracting the interest of the audience, and outputting detailed explanations with a lot of content with synthesized speech a little faster than the standard speech rate.

本発明の第一の実施の形態に係る音声合成制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis control apparatus which concerns on 1st embodiment of this invention. 本発明の実施の形態に係る音声合成制御装置の発話開始文字の概念を説明する図である。It is a figure explaining the concept of the utterance start character of the speech synthesis control apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る音声合成制御装置の発話速度の概念を説明する図であり、（ａ）は発話速度で指定する場合の例示図であり、（ｂ）はある範囲の発話にかかる時間で指定する場合の例示図であり、（ｃ）は発話開始タイミングから計算する場合の例示図であり、（ｄ）はあらかじめ定めた標準的な発話速度との比で指定する場合の例示図である。It is a figure explaining the concept of the speech rate of the speech synthesis control apparatus which concerns on embodiment of this invention, (a) is an illustration figure in the case of designating with a speech rate, (b) is based on a certain range of speech It is an illustration figure in the case of designating by time, (c) is an illustration figure in the case of calculating from the utterance start timing, and (d) is an illustration figure in the case of designating by a ratio with a predetermined standard utterance speed. It is. 本発明の第一の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesis control apparatus which concerns on 1st embodiment of this invention. 本発明の第二の実施の形態に係る音声合成制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis control apparatus which concerns on 2nd embodiment of this invention. 本発明の第二の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesis control apparatus which concerns on 2nd embodiment of this invention. 本発明の第三の実施の形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on 3rd embodiment of this invention. 本発明の第三の実施の形態に係る音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer which concerns on 3rd embodiment of this invention. 本発明の第四の実施の形態に係る音声合成制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis control apparatus which concerns on 4th embodiment of this invention. 本発明の第四の実施の形態に係る音声合成制御装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesis control apparatus which concerns on 4th embodiment of this invention. 本発明の第五の実施の形態に係る音声合成用データ生成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the data generation apparatus for speech synthesis which concerns on 5th embodiment of this invention. 本発明の第五の実施の形態に係る音声合成用データ生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data generation apparatus for speech synthesis which concerns on 5th embodiment of this invention.

Explanation of symbols

１、２、４音声合成制御装置
３、７音声合成装置
５、６音声合成用データ生成装置
１０、２０、３０音声合成用データ入力手段
１１、２１、３１音声合成用データ記憶手段
１２、２２、３２解析分析手段
１３、２３、３３、４４発話開始タイミング取得手段
１４、２４、３４、４５音声合成開始制御手段
１５、３５、４６発話速度決定手段
３６発話速度計測手段
３７話速変換制御手段
３８話速変換手段
４０入力手段
４１テキスト記憶手段
４２発話開始タイミング記憶手段
４３発話速度タイミング記憶手段
４７発話速度制御手段
５０音声信号入力手段
５１音声認識手段
５２発話開始タイミング取得手段
５３発話速度計測手段
５４音声合成用データ生成手段
５５音声合成用データ記憶手段
５６出力手段 1, 2, 4 Speech synthesis control device 3, 7 Speech synthesis device 5, 6 Speech synthesis data generation device 10, 20, 30 Speech synthesis data input means 11, 21, 31 Speech synthesis data storage means 12, 22, 32 Analysis analysis means 13, 23, 33, 44 Speech start timing acquisition means 14, 24, 34, 45 Speech synthesis start control means 15, 35, 46 Speech speed determination means 36 Speech speed measurement means 37 Speech speed conversion control means 38 Speed conversion means 40 Input means 41 Text storage means 42 Speech start timing storage means 43 Speech speed timing storage means 47 Speech speed control means 50 Speech signal input means 51 Speech recognition means 52 Speech start timing acquisition means 53 Speech speed measurement means 54 Speech synthesis Data generation means 55 Speech synthesis data storage means 56 Output means

Claims

A preset utterance start position indicating the start position of speech synthesis of the text when the text to be read out is synthesized and uttered as synthesized speech, and a start time indicating the start time of uttering the text as synthesized speech are preset. A speech synthesis control device for controlling the speech synthesizer to synthesize the text and utter as synthesized speech based on the utterance start time,
Utterance start timing acquisition means for acquiring the utterance start position and the utterance start time as an utterance start timing;
Speech synthesis start control means for controlling the start of speech of the speech synthesizer so that the speech synthesizer synthesizes the text from the utterance start position and utters the text as synthesized speech from the utterance start time;
A speech synthesis control device comprising:

An utterance speed determining means for determining an utterance speed and a format of the utterance as a synthesized voice for each predetermined utterance range indicating a range of speech synthesis of the text starting from the utterance start position;
An utterance speed control means for outputting the utterance speed determined by the utterance speed determination means to the speech synthesizer and controlling the speech synthesis of the speech synthesizer so as to be the utterance speed determined for each utterance range; ,
The speech synthesis control device according to claim 1, comprising:

An utterance speed determining means for determining an utterance speed and a format of the utterance as a synthesized voice for each predetermined utterance range indicating a range of speech synthesis of the text starting from the utterance start position;
Utterance speed measuring means for measuring the utterance speed of synthesized speech uttered by the speech synthesizer;
The utterance speed determined by the utterance speed determination means is compared with the utterance speed measured by the utterance speed measurement means, and the utterance speed of the synthesized speech uttered by the speech synthesizer is determined by the utterance speed determination means. Speech speed conversion control means for outputting a speech speed control signal for converting the speech speed of the synthesized speech by the speech synthesizer so as to approach the spoken speech speed;
Based on the speech speed control signal output from the speech speed conversion control means, the speech speed is adjusted so that the speech speed of the synthesized speech uttered by the speech synthesizer approaches the speech speed determined by the speech speed determination means. A speech rate conversion means for outputting the converted synthesized speech;
The speech synthesis control device according to claim 1, comprising:

The time required for the utterance of the utterance range calculated by the utterance speed determining means calculated from the utterance start time of the utterance start position of the utterance range and the next utterance start time of the next utterance start position of starting the next utterance, and The format expressed based on the number of characters in the utterance range, the format expressed based on the number of utterance characters per unit time, the format expressed based on the utterance time required for utterance in the utterance range, or the standard utterance speed The speech synthesis control device according to claim 2 or 3, wherein the speech synthesis control device is determined as an utterance speed of any one of the formats represented based on a ratio to.

A speech synthesis data generation device for generating the text to be read to provide to the speech synthesis control device according to claim 1, the speech start position, and the speech start time corresponding to the speech start position,
Speech recognition means for recognizing speech and generating the text;
Utterance start timing acquisition means for acquiring, as the utterance start timing, the utterance start position of the text generated by the speech recognition means and the utterance start time corresponding to the utterance start position;
Speech synthesis data storage means for storing the text generated by the speech recognition means and the speech start timing acquired by the speech start timing acquisition means as speech synthesis data;
A data generating apparatus for speech synthesis, comprising:

A preset utterance start position indicating the start position of speech synthesis of the text when the text to be read out is synthesized and uttered as synthesized speech, and a start time indicating the start time of uttering the text as synthesized speech are preset. A speech synthesis control method for controlling a speech synthesizer to synthesize the text and utter as synthesized speech based on the utterance start time,
An utterance start timing acquisition step of acquiring the utterance start position and the utterance start time as an utterance start timing;
A speech synthesis start control step of controlling the start of speech of the speech synthesizer so that the speech synthesizer synthesizes the text from the utterance start position and utters the text as synthesized speech from the utterance start time;
A speech synthesis control method comprising:

A preset utterance start position indicating the start position of speech synthesis of the text when the text to be read out is synthesized and uttered as synthesized speech, and a start time indicating the start time of uttering the text as synthesized speech are preset. In order to control the speech synthesizer to synthesize the text and utter as synthesized speech based on the utterance start time,
An utterance start timing acquisition means for acquiring the utterance start position and the utterance start time as an utterance start timing;
Speech synthesis start control means for controlling the start of speech of the speech synthesizer so as to cause the speech synthesizer to synthesize the text from the utterance start position and to utter the text as synthesized speech from the utterance start time;
A speech synthesis control program characterized by being made to function as.