JP2003099078A

JP2003099078A - Synthetic voice reproduction method and synthesized voice reproduction device

Info

Publication number: JP2003099078A
Application number: JP2001287402A
Authority: JP
Inventors: Hiroshi Hasegawa; 浩長谷川
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-09-20
Filing date: 2001-09-20
Publication date: 2003-04-04

Abstract

(57)【要約】【課題】テキストを音声合成して得られた合成音声を聞
くだけでは音声合成対象となったテキストの内容以外の
様々な情報を得ることができない。【解決手段】テキストを解析して音声合成すべきテキス
ト情報と合成音声の音源位置を仮想的な音空間上に定義
するために必要な情報を音源位置定義情報として抽出す
るテキスト解析部２と、このテキスト解析部２によって
抽出された音声合成すべきテキスト情報を音声合成処理
する音声合成部４と、前記音源位置定義情報によって空
間上における仮想的な音源位置を定義する音源位置定義
部６と、前記音声合成部４により音声合成処理されて得
られた合成音声を前記音源位置定義部６により定義され
た空間上における仮想的な音源位置に定位させる音像定
位処理部７と、この音像定位処理部７により定位された
位置から前記合成音声を出力させる立体音像出力部８と
を有する。 (57) [Summary] [Problem] It is not possible to obtain various information other than the content of a text subjected to speech synthesis simply by listening to synthesized speech obtained by speech synthesis of text. A text analysis unit (2) for extracting text information to be synthesized by analyzing text and information necessary for defining a sound source position of a synthesized voice in a virtual sound space as sound source position definition information, A speech synthesis unit 4 that performs speech synthesis processing on the text information to be speech-synthesized extracted by the text analysis unit 2, a sound source position definition unit 6 that defines a virtual sound source position in space by the sound source position definition information, A sound image localization processing unit 7 for localizing a synthesized voice obtained by the voice synthesis processing by the voice synthesis unit 4 to a virtual sound source position in the space defined by the sound source position definition unit 6; and a sound image localization processing unit And a three-dimensional sound image output unit 8 for outputting the synthesized sound from the position localized by the control unit 7.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明はテキストから音声信
号を生成し、それを合成音声として出力する合成音声再
生方法および合成音声再生装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a synthetic speech reproducing method and a synthetic speech reproducing apparatus for generating a speech signal from text and outputting it as synthetic speech.

【０００２】[0002]

【従来の技術】テキストから音声信号を生成し、それを
合成音声として出力する音声合成技術は音声を扱う情報
技術の分野で広く用いられている。2. Description of the Related Art A voice synthesizing technique for generating a voice signal from a text and outputting it as a synthetic voice is widely used in the field of information technology for handling voice.

【０００３】一般に、従来の音声合成は、合成音声出力
が１チャンネルのモノラルであるのが殆どであるため、
たとえば、音声合成対象のテキストが複数の話者の対話
や会議などを記録したテキストである場合、そのテキス
トを音声合成処理して得られた合成音声では、話者の識
別が困難で、話者の位置関係などその場の状況を把握し
にくいといった問題がある。Generally, in the conventional speech synthesis, since the synthesized speech output is almost one-channel monaural,
For example, if the text to be speech-synthesized is a text that records a dialogue, a meeting, or the like of multiple speakers, it is difficult to identify the speaker with the synthesized speech obtained by performing the speech-synthesis processing on the text. There is a problem that it is difficult to grasp the situation on the spot such as the positional relationship of the.

【０００４】また、音声合成対象のテキストとして複数
種類のテキストが存在し、その複数種類のテキストを音
声合成処理して、その音声合成結果を羅列的に出力する
ような場合や、１つのテキストであっても文章量の多い
テキストを音声合成処理して、その音声合成結果を羅列
的に出力するような場合も、出力される合成音声を聞い
ただけではどのテキストに対する合成音声が出力されて
いるか、あるいは、テキストのどの部分の合成音声が出
力されているのかといったことがわかりにくいという問
題がある。Further, in the case where a plurality of types of text exist as texts to be voice-synthesized and the plurality of types of texts are subjected to voice-synthesizing processing and the voice-synthesis results are output in a list, one text is used. Even if there is a large amount of text, the text-to-speech processing is performed, and even if the results of the text-to-speech are output in a list, what text the synthesized speech is output just by listening to the output synthetic speech, Alternatively, there is a problem that it is difficult to know which part of the text the synthesized voice is output.

【０００５】これは、上述したように、従来の音声合成
技術は単にテキストを音声合成処理してその合成結果と
しての合成音声をモノラルの音声として出力するだけで
あるので、ユーザに提供できるデータはテキストに対す
る合成音声のみで、それ以外の付加的な情報は何等存在
しないからである。This is because, as described above, the conventional speech synthesis technology simply performs speech synthesis processing on text and outputs the synthesized speech as the synthesis result as monaural speech, so that data that can be provided to the user is not available. This is because it is only the synthetic voice for the text and there is no additional information other than that.

【０００６】この問題を解決する１つの手段として、合
成音声を複数チャネルの立体音声として出力することが
考えられる。複数チャンネルの立体音声を生成する技術
は従来から存在する。たとえば、特開平８−３３１６９
７の「音声情報変換装置」（以下、第１の従来技術とい
う）は、ステレオ信号に任意の周波数帯域特性を設定す
ることで、もともとのステレオ信号の再生音の広がり感
を持たせ、高域周波成分を擬似的に付加させて音色や音
質を変換させてより自然な音に近い音場を生成するとい
うものである。As one means for solving this problem, it is conceivable to output synthetic speech as stereophonic speech of a plurality of channels. Conventionally, there is a technique for generating multi-channel stereophonic sound. For example, JP-A-8-33169
The “sound information conversion device” (hereinafter, referred to as the first conventional technology) of 7 sets a stereo signal to have an arbitrary frequency band characteristic to give a sense of expansiveness of the reproduced sound of the original stereo signal, thereby increasing the high frequency range. This is to generate a more natural sound field by artificially adding frequency components and converting timbre and sound quality.

【０００７】また、特開平９−９０９６３の「音情報提
供装置、及び音情報選択方法」（以下、第２の従来技術
という）は、複数の音情報を仮想的な音空間に配置し、
音情報選択手段を設けて、その複数の音情報をユーザが
選択できるようにしたものである。Further, in Japanese Patent Laid-Open No. 9-90963, "Sound Information Providing Device and Sound Information Selecting Method" (hereinafter referred to as "second prior art"), a plurality of sound information is arranged in a virtual sound space,
The sound information selecting means is provided so that the user can select the plurality of sound information.

【０００８】[0008]

【発明が解決しようとする課題】上述した第１の従来技
術は、上述したように、もともとのステレオ信号の再生
音の広がり感を持たせるようにしたものであり、この第
１の従来技術では上述した合成音声に対する問題点を解
決することはできない。As described above, the first prior art described above is designed to give a sense of expansiveness of the reproduced sound of the original stereo signal. In the first prior art, It is not possible to solve the above-mentioned problems with synthetic speech.

【０００９】また、第２の従来技術は、上述したよう
に、複数の音情報を仮想的な音空間に配置し、音情報選
択手段を設けて、その複数の音情報をユーザが選択でき
るようにしたものであり、この第２の従来技術も上述し
たような合成音声に対する問題点を解決することはでき
ない。Further, in the second conventional technique, as described above, a plurality of sound information are arranged in a virtual sound space, and sound information selecting means is provided so that the user can select the plurality of sound information. However, the second conventional technique cannot solve the above-mentioned problem with synthesized speech.

【００１０】そこで本発明は、合成音声に空間上におけ
る位置情報を付加して再生することによって、テキスト
のどの部分の合成音声が出力されているか、複数のテキ
ストが存在する場合にはどのテキストの合成音声が出力
されているかなど、合成音声以外の様々な付加的情報を
もユーザに伝達可能とする合成音声再生方法および合成
音声再生装置を提供することを目的としている。Therefore, according to the present invention, by adding position information in space to the synthesized voice and reproducing the synthesized voice, which portion of the text is output as the synthesized voice, and when a plurality of texts exist, which of the texts is output. It is an object of the present invention to provide a synthetic voice reproduction method and a synthetic voice reproduction device capable of transmitting various additional information other than synthetic voice to a user such as whether synthetic voice is output.

【００１１】[0011]

【課題を解決するための手段】上述した目的を達成する
ために、本発明の合成音声再生方法は、音声合成対象と
して入力されたテキストを解析し、音声合成すべきテキ
ストを抽出するとともに、合成音声の音源位置を仮想的
な音空間上に定義するために必要な情報を音源位置定義
情報として抽出し、この音源位置定義情報を用いて前記
仮想的な音空間に音源位置を定義するとともに、前記音
声合成すべきテキストに対して音声合成処理を行い、そ
れによって得られた合成音声を、前記定義された音空間
上の仮想的な音源位置に定位させ、その定位させた位置
から前記合成音声を出力させるようにしている。In order to achieve the above-mentioned object, a synthetic speech reproduction method of the present invention analyzes a text input as a speech synthesis target, extracts a text to be speech-synthesized, and synthesizes it. Extracting the information necessary for defining the sound source position of the voice on the virtual sound space as sound source position definition information, and defining the sound source position in the virtual sound space using this sound source position definition information, Speech synthesis processing is performed on the text to be speech-synthesized, the synthesized speech obtained thereby is localized at a virtual sound source position in the defined sound space, and the synthesized speech is generated from the localized position. Is output.

【００１２】この合成音声再生方法において、音声合成
すべきテキストの中に内容のまとまりが複数存在する場
合、前記音空間上の仮想的な音源位置は、前記内容のま
とまりごとに定義され、それぞれの音源位置からその音
源位置に対応付けられた内容のまとまりに対する合成音
声を出力するようにしている。In this synthetic speech reproduction method, when there are a plurality of content groups in the text to be speech-synthesized, a virtual sound source position in the sound space is defined for each content group, and each of the content groups is defined. The synthesized voice is output from the sound source position for a group of contents associated with the sound source position.

【００１３】なお、内容のまとまりごとに音源位置を定
義する際、その内容のまとまりがテキストの文章の流れ
に沿った順序で存在する場合は、その内容のまとまりご
との音源位置は前記文章の流れに沿った順序で変化する
ような位置に定義する。When defining the sound source position for each group of contents, if the group of contents exists in the order along the flow of the text sentence, the sound source position for each group of the content is the flow of the sentence. The positions are defined so that they change in the order along.

【００１４】また、前記合成音声再生方法において、前
記音声合成対象のテキストがそれぞれ独立した複数のテ
キストである場合、前記音空間上の仮想的な音源位置
は、それぞれのテキストごとに定義され、それぞれの音
源位置からその音源位置に対応付けられたテキストに対
する合成音声を出力するようにしている。In the synthesized voice reproduction method, when the voice synthesis target text is a plurality of independent texts, a virtual sound source position in the sound space is defined for each text, The synthesized voice for the text associated with the sound source position is output from the sound source position.

【００１５】なお、それぞれ独立した複数のテキストが
ある基準で分類可能である場合は、前記音空間上の仮想
的な音源位置は、それぞれの分類ごとに定義され、それ
ぞれの音源位置からその音源位置に対応付けられた分類
に属するテキストに対する合成音声を出力するようにし
ている。When a plurality of independent texts can be classified by a certain standard, the virtual sound source position in the sound space is defined for each classification, and the sound source position is calculated from each sound source position. The synthesized voice for the text belonging to the category associated with is output.

【００１６】また、前記合成音声再生方法において、前
記音声合成対象のテキストがマルチタスク機能を有した
情報処理システムにおける複数のアプリケーション上に
存在するテキストである場合、前記音空間上の仮想的な
音源位置は、それぞれのアプリケーションごとに定義さ
れ、それぞれの音源位置からその音源位置に対応付けら
れたアプリケーション上に存在するテキストに対する合
成音声を出力するようにしている。In the synthesized voice reproduction method, when the text to be voice synthesized is text existing in a plurality of applications in an information processing system having a multitask function, a virtual sound source in the sound space. The position is defined for each application, and the synthesized sound is output from each sound source position for the text existing in the application associated with the sound source position.

【００１７】また、本発明の合成音声再生装置は、音声
合成対象のテキストを解析し、音声合成すべきテキスト
情報を抽出するとともに、合成音声の音源位置を仮想的
な音空間上に定義するために必要な情報を音源位置定義
情報として抽出するテキスト解析手段と、このテキスト
解析手段によって抽出された音源位置定義情報を用いて
音空間上の仮想的な音源位置を定義する音源位置定義手
段と、前記テキスト解析手段によって抽出された前記音
声合成すべきテキスト情報を音声合成処理する音声合成
手段と、この音声合成手段により音声合成処理されて得
られた合成音声を、前記音源位置定義手段により定義さ
れた音空間上の仮想的な音源位置に定位させる音像定位
手段と、この音像定位手段により定位された音源位置か
ら前記合成音声を出力させる立体音像出力手段とを有し
た構成としている。Further, the synthesized speech reproducing apparatus of the present invention analyzes the text to be synthesized, extracts the text information to be synthesized, and defines the sound source position of the synthesized speech in the virtual sound space. A text analysis means for extracting information necessary for sound source position definition information, and a sound source position definition means for defining a virtual sound source position in a sound space using the sound source position definition information extracted by the text analysis means, The sound source position defining means defines a voice synthesizing means for performing a voice synthesizing process on the text information to be subjected to the voice synthesizing extracted by the text analyzing means, and a synthesized voice obtained by the voice synthesizing processing by the voice synthesizing means. Sound image localization means for locating a virtual sound source position in a sound space, and the synthesized voice from the sound source position localized by the sound image localization means. It has a configuration having a three-dimensional sound image output means for force.

【００１８】この合成音声再生装置において、前記音声
合成対象のテキストの中に内容のまとまりが複数存在す
る場合、前記音空間上の仮想的な音源位置は、前記内容
のまとまりごとに定義され、それぞれの音源位置からそ
の音源位置に対応付けられた内容のまとまりに対する合
成音声を出力するようにしている。In this synthesized speech reproducing apparatus, when a plurality of content groups exist in the text to be synthesized, the virtual sound source position in the sound space is defined for each content group, and The synthesized voice is output from the sound source position corresponding to the group of contents associated with the sound source position.

【００１９】なお、内容のまとまりごとに音源位置を定
義する際、その内容のまとまりがテキストの文章の流れ
に沿った順序で存在する場合は、その内容のまとまりご
との音源位置は前記文章の流れに沿った順序で変化する
ような位置に定義する。When defining the sound source position for each group of contents, if the group of contents exists in the order along the flow of the text sentence, the sound source position for each group of the content is the flow of the sentence. The positions are defined so that they change in the order along.

【００２０】また、前記合成音声再生装置において、前
記音声合成対象のテキストがそれぞれ独立した複数のテ
キストである場合、前記音空間上の仮想的な音源位置
は、それぞれのテキストごとに定義され、それぞれの音
源位置からその音源位置に対応付けられたテキストに対
する合成音声を出力するようにしている。Further, in the above-mentioned synthesized speech reproducing apparatus, when the texts to be speech-synthesized are a plurality of independent texts, the virtual sound source position in the sound space is defined for each text, The synthesized voice for the text associated with the sound source position is output from the sound source position.

【００２１】なお、それぞれ独立した複数のテキストが
ある基準で分類可能である場合は、前記音空間上の仮想
的な音源位置は、それぞれの分類ごとに定義され、それ
ぞれの音源位置からその音源位置に対応付けられた分類
に属するテキストに対する合成音声を出力するようにし
ている。When a plurality of independent texts can be classified by a certain criterion, the virtual sound source position in the sound space is defined for each classification, and the sound source position is calculated from each sound source position. The synthesized voice for the text belonging to the category associated with is output.

【００２２】また、前記合成音声再生装置において、前
記音声合成対象のテキストがマルチタスク機能を有した
情報処理システムにおける複数のアプリケーション上に
存在するテキストである場合、前記音空間上の仮想的な
音源位置は、それぞれのアプリケーションごとに定義さ
れ、それぞれの音源位置からその音源位置に対応付けら
れたアプリケーション上に存在するテキストに対する合
成音声を出力するようにしている。In the synthesized voice reproduction device, when the text to be voice synthesized is text existing in a plurality of applications in an information processing system having a multitask function, a virtual sound source in the sound space. The position is defined for each application, and the synthesized sound is output from each sound source position for the text existing in the application associated with the sound source position.

【００２３】このように本発明は、合成音声情報に空間
的な位置情報を付加して再生することで、合成音声以外
の様々な付加的情報をもユーザに伝達可能とし、ユーザ
に伝達できる情報量を増やすことができる。As described above, according to the present invention, by adding spatial position information to the synthesized voice information and reproducing the synthesized voice information, various additional information other than the synthesized voice can be transmitted to the user, and the information that can be transmitted to the user. The amount can be increased.

【００２４】すなわち、単にテキストを音声合成して合
成音声として出力するだけでは、ユーザに対し、そのテ
キストの内容を音声で伝達するだけであるが、本発明に
よれば、テキストのどの部分の合成音声が出力されてい
るか、複数のテキストが存在する場合にはどのテキスト
の合成音声が出力されているかなど、合成音声以外の様
々な付加的情報をユーザに伝達することができる。That is, the content of the text is transmitted to the user by voice by simply synthesizing the voice and outputting the synthesized voice as the synthesized voice. However, according to the present invention, which portion of the text is synthesized. Various additional information other than the synthesized voice can be transmitted to the user, such as whether the voice is output, which text has a synthesized voice output when a plurality of texts are present, and the like.

【００２５】具体的には、音声合成対象のテキストの中
に内容のまとまりが複数存在する場合、その複数のまと
まりごとに音源位置を定義し、それぞれの音源位置から
その音源位置に対応するテキストの合成音声を出力する
ようにしている。Specifically, when there are a plurality of groups of contents in the text to be speech-synthesized, a sound source position is defined for each of the plurality of groups, and from each sound source position, the text corresponding to the sound source position is defined. I am trying to output synthetic speech.

【００２６】たとえば、複数話者による会議の議事録が
記述されたテキストなどを音声合成処理して得られたそ
れぞれぞれの話者対応の合成音声の音源位置をそれぞれ
の話者ごとに異ならせることができる。それによって、
どの話者が何を発話たかがわかり会議の様子などの把握
し易くなる。For example, the sound source position of the synthesized voice corresponding to each speaker obtained by performing the voice synthesis processing on the text in which the minutes of the conference by a plurality of speakers are described is made different for each speaker. be able to. Thereby,
It becomes easy to understand which speaker uttered what and how the conference was held.

【００２７】また、合成音声の音源位置をテキストの段
落や起承転結に対応した内容のまとまりごとに異ならせ
ることもできる。この段落や起承転結に対応した内容の
まとまりのように、内容のまとまりが文章の流れに沿っ
て存在する場合には、それぞれのまとまりごとの音源位
置を、文章の流れに沿った順序で変化するような位置に
定義することで、合成音声を聞くユーザは文章の流れが
理解し易くなり、また、合成音声の出力されている音源
位置から、現在、そのテキストのどの部分の合成音声が
出力されているかを直感的に知ることができる。It is also possible to make the sound source position of the synthetic voice different for each paragraph of the text or for each group of contents corresponding to the transfer of text. When there is a group of contents along the flow of a sentence, such as a group of contents corresponding to this paragraph or Kiseki-Koi, change the sound source position for each group in the order along the flow of the sentence. By defining it at a certain position, it becomes easier for a user who listens to the synthesized voice to understand the flow of the sentence, and at which part of the text the synthesized voice is currently output from the sound source position where the synthesized voice is output. You can know intuitively.

【００２８】また、音声合成対象のテキストがそれぞれ
独立した複数のテキストである場合、それぞれのテキス
トごとに前記音源位置を定義し、それぞれの音源位置か
らその音源位置に対応するテキストの合成音声を出力す
ることで、合成音声の出力されている音源位置から、現
在、どのテキストの合成音声が出力されているかを直感
的に知ることができる。Further, when the texts to be speech-synthesized are a plurality of independent texts, the sound source position is defined for each text, and the synthesized speech of the text corresponding to the sound source position is output from each sound source position. By doing so, it is possible to intuitively know which text of the synthesized voice is currently output from the sound source position where the synthesized voice is output.

【００２９】そして、このような複数のテキストがある
基準で幾つかのグループに分類可能である場合は、その
分類ごとに音源位置を定義することもでき、それによっ
て、合成音声の出力されている音源位置から、現在、ど
の分類に属するテキストの合成音声が出力されているか
を直感的に知ることができる。When such a plurality of texts can be classified into several groups based on a certain criterion, the sound source position can be defined for each classification, whereby the synthesized speech is output. From the sound source position, it is possible to intuitively know which classification of the synthesized voice of the text currently being output.

【００３０】また、音声合成対象のテキストがマルチタ
スク機能を有した情報処理システムにおける複数のアプ
リケーション上に存在するテキストである場合、それぞ
れのアプリケーションごとに音源位置を定義することも
でき、それによって、合成音声の出力されている音源位
置から、現在、どのアプリケーション上に存在するテキ
ストの合成音声が出力されているかを直感的に知ること
ができる。Further, when the text to be speech-synthesized is the text existing in a plurality of applications in the information processing system having the multitask function, the sound source position can be defined for each application, whereby the sound source position can be defined. It is possible to intuitively know which application currently outputs the synthesized voice of the text from the sound source position where the synthesized voice is output.

【００３１】このように、本発明では合成音声に位置情
報を付加して再生することで、テキストのどの部分の合
成音声が出力されているか、複数のテキストが存在する
場合にはどのテキストの合成音声が出力されているかな
ど、合成音声の他にも様々な付加的情報をユーザに伝達
することができる。As described above, according to the present invention, the position information is added to the synthesized voice and the synthesized voice is reproduced, so that which portion of the text is output as the synthesized voice, and when a plurality of texts are present, which text is synthesized. Various additional information other than the synthetic voice, such as whether or not the voice is output, can be transmitted to the user.

【００３２】また、あるテキストの合成音声やテキスト
の特定部分の合成音声を再度聞きたいような場合には、
その合成音声の出力された音源位置から再度聞きたい内
容の指定がし易くなり、あるテキストやテキストの特定
部分についてその内容を、より詳しく知りたいような場
合にはきわめて便利なものとなる。When it is desired to hear the synthesized voice of a certain text or the synthesized voice of a specific portion of the text again,
It becomes easy to specify the content to be heard again from the sound source position where the synthesized voice is output, and it becomes extremely convenient when it is desired to know the content of a certain text or a specific portion of the text in more detail.

【００３３】[0033]

【発明の実施の形態】以下、本発明の実施の形態につい
て説明する。なお、この実施の形態で説明する内容は、
本発明の合成音声再生方法および合成音声再生装置の両
方を含むものである。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below. The contents explained in this embodiment are
It includes both the synthetic speech reproduction method and the synthetic speech reproduction apparatus of the present invention.

【００３４】図１は本発明の合成音声再生装置の構成を
示す図であり、それぞれの構成要素の動作などについて
はのちに説明するものとし、まず、構成要素のみを列挙
すると、音声合成対象となるテキストデータ（以下では
単にテキストいう）１、テキスト解析部２、テキスト解
析辞書３、音声合成部４、音声合成辞書５、音源位置定
義部６、音像定位処理部７、立体音像出力部８を有した
構成となっている。FIG. 1 is a diagram showing the configuration of a synthesized voice reproduction apparatus of the present invention. The operation of each component will be described later. A text data (hereinafter simply referred to as “text”) 1, a text analysis unit 2, a text analysis dictionary 3, a voice synthesis unit 4, a voice synthesis dictionary 5, a sound source position definition unit 6, a sound image localization processing unit 7, and a stereoscopic sound image output unit 8. It has a structure that has.

【００３５】テキスト解析部２はテキスト解析辞書３を
参照してテキスト１の解析を行うもので、たとえば、音
声合成を行うに必要な形態素解析処理や、文章の構造、
文章の論理的な流れを把握するなどの機能を有し、その
テキストの内容から音声合成すべきテキストと、合成音
声の音源位置を仮想的な音空間上に定義するために必要
な情報（この情報を音源位置定義情報という）を取り出
して、音声合成すべきテキスト情報は音声合成部４に渡
し、音源位置定義情報は音源位置定義部６に渡す。これ
がテキスト解析部２の基本的な動作であるが、これ以外
にも、以下に示す様々な具体例を実現するための様々な
機能を有するものであり、それについては、その都度説
明する。The text analysis unit 2 analyzes the text 1 by referring to the text analysis dictionary 3. For example, the morphological analysis processing necessary for performing speech synthesis and the structure of sentences,
It has functions such as grasping the logical flow of sentences, and the information necessary to define the text to be synthesized from the contents of the text and the sound source position of the synthesized speech in a virtual sound space (this Information is referred to as sound source position definition information), text information to be subjected to voice synthesis is passed to the voice synthesis unit 4, and sound source position definition information is passed to the sound source position definition unit 6. This is the basic operation of the text analysis unit 2, but in addition to this, it has various functions for realizing various specific examples shown below, which will be described each time.

【００３６】音声合成部４はテキスト解析部２から渡さ
れた音声合成すべきテキストに対し音声合成辞書５を用
いて音声合成処理するもので、この音声合成部４は既知
の音声合成処理技術を用いることができる。The voice synthesizing unit 4 performs a voice synthesizing process using the voice synthesizing dictionary 5 on the text to be subjected to the voice synthesizing passed from the text analyzing unit 2. The voice synthesizing unit 4 uses a known voice synthesizing processing technique. Can be used.

【００３７】音源位置定義部６はテキスト解析部２から
渡された音源位置定義情報を用いて、音声合成部４で音
声合成されて得られた合成音声の位置を仮想的音源位置
として音空間上に定義するものである。The sound source position definition unit 6 uses the sound source position definition information passed from the text analysis unit 2 as a virtual sound source position in the sound space with the position of the synthesized voice obtained by the voice synthesis by the voice synthesis unit 4. Is defined in.

【００３８】音像定位処理部７は音声合成部４からの合
成音声を音源位置定義部６で定義された仮想的音源位置
に定位させるもので、この音像定位処理部７や音源位置
定義部６は既知の技術を用いることができる。The sound image localization processing unit 7 localizes the synthesized voice from the voice synthesis unit 4 to the virtual sound source position defined by the sound source position definition unit 6, and the sound image localization processing unit 7 and the sound source position definition unit 6 Known techniques can be used.

【００３９】立体音像出力部８は音声合成部４からの合
成音声を音源位置定義部６で定義された仮想的音源位置
から出力するものである。The stereoscopic sound image output unit 8 outputs the synthesized voice from the voice synthesis unit 4 from the virtual sound source position defined by the sound source position definition unit 6.

【００４０】このような構成において、以下に幾つかの
具体例を参照しながら本発明の動作について説明する。The operation of the present invention having such a configuration will be described below with reference to some specific examples.

【００４１】〔具体例１〕この具体例１は、会議録や対
話文章など複数話者によって発話された内容からなるテ
キストを音声合成処理してその合成音声を出力する例で
ある。[Specific Example 1] This specific example 1 is an example in which text composed of contents uttered by a plurality of speakers, such as a conference proceedings and dialogue sentences, is subjected to voice synthesis processing and the synthesized voice is output.

【００４２】たとえば、図２に示すように、話者（話者
Ａ１、話者Ａ２、話者Ａ３とする）が話者Ａ１、話者Ａ
２、話者Ａ３の順で円卓１０を囲んで会議を行い、それ
ぞれの話者Ａ１，Ａ２，Ａ３の会議録がテキストとして
存在する場合を考える。このとき、そのテキストにはそ
れぞれの話者がごとの発話内容を示すために、たとえ
ば、話者Ａ１：「・・・」、話者Ａ２：「・・・」、話
者Ａ３：「・・・」といようにそれぞれの話者とその発
話内容が対応付けられているものとする。For example, as shown in FIG. 2, speakers (speaker A1, speaker A2, and speaker A3) are speakers A1 and A, respectively.
Consider a case where a conference is held in the order of the speaker A3 and the speaker A3 in this order, and the minutes of the speakers A1, A2, and A3 exist as texts. At this time, in order to indicate the utterance content for each speaker in the text, for example, speaker A1: "...", speaker A2: "...", speaker A3: "... -", Each speaker and its utterance content are associated with each other.

【００４３】図１において、それぞれの話者Ａ１，Ａ
２，Ａ３とその発話内容が対応付けられているテキスト
１はテキスト解析部２に与えられる。テキスト解析部２
では話者Ａ１の発話内容、話者Ａ２の発話内容、話者Ａ
３の発話内容に対し、それぞれの話者を１つのまとまり
として把握し、それぞれの発話内容を音声合成部４に渡
すとともに、ここでは話者が３人であるため、話者が３
人であるという情報や話者Ａ１，Ａ２，Ａ３の並び位置
などを示す情報を音源位置定義情報として音源位置定義
部６に渡す。In FIG. 1, the respective speakers A1 and A
The text 1 in which 2, 2 and its utterance contents are associated is given to the text analysis unit 2. Text analysis unit 2
Then, the utterance content of the speaker A1, the utterance content of the speaker A2, and the speaker A
For each of the three utterance contents, each speaker is grasped as one group, and each utterance content is passed to the voice synthesizing unit 4. Here, since there are three speakers, the number of speakers is three.
Information indicating that the person is a person and information indicating the arrangement positions of the speakers A1, A2, and A3 are passed to the sound source position definition unit 6 as the sound source position definition information.

【００４４】音源位置定義部６はテキスト解析部２から
渡された音源位置定義情報を用いて仮想的な音源位置を
音空間上に定義する。一方、音声合成部４では、それぞ
れの話者対応の発話内容を音声合成処理し、その合成音
声を音像定位処理部７に渡す。音像定位処理部７は、音
声合成部４からのそれぞれの話者Ａ１，Ａ２，Ａ３対応
の合成音声を音源位置定義部６で定義された仮想的な音
源位置に定位させる。The sound source position definition unit 6 uses the sound source position definition information passed from the text analysis unit 2 to define a virtual sound source position in the sound space. On the other hand, the voice synthesis unit 4 performs voice synthesis processing on the utterance contents corresponding to each speaker, and passes the synthesized voice to the sound image localization processing unit 7. The sound image localization processing unit 7 localizes the synthesized voices corresponding to the speakers A1, A2, and A3 from the voice synthesis unit 4 to the virtual sound source position defined by the sound source position definition unit 6.

【００４５】これによって、この場合、立体音像出力部
８からは、話者Ａ１，Ａ２，Ａ３の発話内容に対する合
成音声が、それぞれの話者ごとに決められた音源位置か
ら出力される。As a result, in this case, the three-dimensional sound image output unit 8 outputs the synthesized voice corresponding to the utterance contents of the speakers A1, A2, A3 from the sound source position determined for each speaker.

【００４６】たとえば、図３に示すように、話者Ａ１の
発話内容に対する合成音声は音空間２０上において、左
端付近から聞こえ、話者Ａ２の発話内容に対する合成音
声は中央付近から聞こえ、話者Ａ３の発話内容に対する
合成音声は右端付近から聞こえるというように、それぞ
れの話者Ａ１，Ａ２，Ａ３の発話内容に対する合成音声
が話者ごとに異なった位置から聞こえてくる。For example, as shown in FIG. 3, the synthesized voice for the utterance content of the speaker A1 can be heard from the left end in the sound space 20, and the synthesized voice for the utterance content of the speaker A2 can be heard from the center, The synthesized speech for the utterance content of A3 is heard from the vicinity of the right end, and the synthesized speech for the utterance content of each of the speakers A1, A2, and A3 comes from different positions for each speaker.

【００４７】このように、複数の話者の発話内容に対す
る合成音声がそれぞれの話者ごとに異なった位置から発
せられるので、会議録や対話文など複数の話者の発話に
よって作成されたテキストを音声合成処理して得られた
合成音声の場合、それぞれの話者の発話内容に対する合
成音声を話者ごとに区別しながら聞くことができる。し
かも、それぞれの話者対応の音源位置は、それぞれの話
者のもともとの位置（図２参照）をある程度反映した位
置とすることもできるので、その場の状況を把握しやす
くなり会議や対談の臨場感を得ることができる。As described above, since the synthesized voices for the utterance contents of a plurality of speakers are emitted from different positions for each speaker, the texts created by the utterances of a plurality of speakers such as conference proceedings and dialogue sentences can be generated. In the case of the synthesized speech obtained by the speech synthesis processing, the synthesized speech corresponding to the utterance content of each speaker can be heard while being distinguished for each speaker. Moreover, the position of the sound source corresponding to each speaker can be set to a position that reflects the original position of each speaker (see FIG. 2) to some extent, which makes it easier to grasp the situation at the place and facilitates meetings and conversations. You can get a sense of reality.

【００４８】〔具体例２〕この具体例２は文章量の多い
テキストなどを合成音声として出力する場合、音声合成
対象のテキストを文章の流れに沿った内容のまとまりご
とに区切って、その内容のまとまりごとの音源位置を仮
想的な音空間上に定義するものである。[Specific Example 2] In this specific example 2, when a text having a large amount of text is output as a synthesized voice, the text to be voice-synthesized is divided into groups of content along the flow of the text, and the content of the content is divided. The sound source position for each group is defined in a virtual sound space.

【００４９】その第１の例として（これを具体例２の１
とする）、たとえば、図４に示すような幾つかの内容の
まとまりからなるテキスト１がテキスト解析部２に入力
されたとする。なお、この図４の例では内容のまとまり
を段落とし、ここでは段落Ｂ１，Ｂ２，Ｂ３の３つの段
落が存在するものとしている。また、文章の流れは矢印
ａで示すように段落Ｂ１，Ｂ２，Ｂ３の順であるとす
る。As a first example thereof (this will be referred to as 1
It is assumed that, for example, the text 1 including a set of some contents as shown in FIG. 4 is input to the text analysis unit 2. In the example of FIG. 4, a group of contents is a paragraph, and here, there are three paragraphs B1, B2, and B3. In addition, the flow of sentences is assumed to be in the order of paragraphs B1, B2, B3 as indicated by arrow a.

【００５０】このような段落Ｂ１，Ｂ２，Ｂ３の存在す
るテキスト１がテキスト解析部２に入力されると、それ
ぞれの段落Ｂ１，Ｂ２，Ｂ３対応のテキストが音声合成
すべきテキストとして音声合成部４に渡されるととも
に、段落が３つ存在することを示す情報やそれぞれの段
落Ｂ１，Ｂ２，Ｂ３のつながりなどを示す情報が音源位
置定義情報として音源位置定義部６に渡される。When the text 1 in which such paragraphs B1, B2, B3 exist is input to the text analysis unit 2, the texts corresponding to the respective paragraphs B1, B2, B3 are the texts to be voice-synthesized. And the information indicating that there are three paragraphs and the connection between the paragraphs B1, B2, and B3 are passed to the sound source position definition unit 6 as the sound source position definition information.

【００５１】音源位置定義部６はテキスト解析部２から
渡された音源位置定義情報を用いて仮想的な音源位置を
音空間上に定義する。一方、音声合成部４では、それぞ
れの段落Ｂ１，Ｂ２，Ｂ３ごとのテキスト内容を音声合
成処理し、その合成音声を音像定位処理部７に渡す。音
像定位処理部７は、音声合成部４からのそれぞれの段落
Ｂ１，Ｂ２，Ｂ３対応の合成音声を、音源位置定義部６
で定義された仮想的な音源位置に定位させる。The sound source position definition unit 6 uses the sound source position definition information passed from the text analysis unit 2 to define a virtual sound source position in the sound space. On the other hand, the voice synthesizing unit 4 performs a voice synthesizing process on the text content of each of the paragraphs B1, B2 and B3, and passes the synthesized voice to the sound image localization processing unit 7. The sound image localization processing unit 7 outputs the synthesized speech corresponding to the respective paragraphs B1, B2, B3 from the speech synthesis unit 4 to the sound source position definition unit 6
Localize to the virtual sound source position defined in.

【００５２】これによって、この場合、立体音像出力部
８からは、段落Ｂ１，Ｂ２，Ｂ３のテキスト内容に対す
る合成音声がそれぞれの段落ごとに、音空間上で異なっ
た位置から出力される。As a result, in this case, the stereophonic sound image output unit 8 outputs the synthesized voice corresponding to the text contents of the paragraphs B1, B2, B3 from different positions in the sound space for each paragraph.

【００５３】たとえば、図５に示すように、段落Ｂ１の
内容に対する合成音声は音空間２０上の左端付近から聞
こえ、段落Ｂ２の内容に対する合成音声は音空間２０上
の中央付近から聞こえ、段落Ｂ３の内容に対する合成音
声は音空間２０上の右端付近から聞こえるというよう
に、それぞれの段落の内容に対する合成音声が音空間２
０上の異なった位置から聞こえてくる。For example, as shown in FIG. 5, the synthesized voice for the contents of paragraph B1 is heard near the left end in the sound space 20, the synthesized voice for the contents of paragraph B2 is heard near the center of the sound space 20, and paragraph B3 is used. The synthesized speech for the content of each paragraph is heard from near the right end on the sound space 20.
It comes from different positions on the zero.

【００５４】なお、このように、そのテキスト全体を文
章の流れに沿ったまとまり（この場合は段落）ごとに区
切る場合は、その区切られたまとまり（段落）ごとの音
源位置を、図５に示すように、その文章の流れに沿って
一方向（矢印ｂで示す）に切り替わるように定義すれ
ば、文章の流れに沿って音源位置が一方向に順次切り替
わって行くので、それを聞くユーザは文テキストの内容
を理解しやすくなる。When the entire text is divided into groups (paragraphs in this case) along the flow of the sentence in this way, the sound source positions for each of the divided groups (paragraphs) are shown in FIG. Thus, if the sound source position is switched in one direction (along arrow b) along the flow of the sentence, the sound source position is sequentially switched in one direction. Makes it easier to understand the contents of the text.

【００５５】このように、複数の段落に対する合成音声
がそれぞれの段落ごとに、音空間上の異なった位置から
発せられ、しかも、文章の流れに沿って音源位置が一方
向に順次切り替わって行くので、文章量の多いテキスト
を音声合成処理して得られた合成音声であっても、今、
テキストのどの辺の合成音声が出力されているかが直感
的にわかる。In this way, the synthesized voices for a plurality of paragraphs are emitted from different positions in the sound space for each paragraph, and the sound source positions are sequentially switched in one direction along the flow of the sentence. , Even if it is a synthetic speech obtained by performing speech synthesis processing on a large amount of text,
You can intuitively know which side of the text the synthesized voice is output.

【００５６】これによって、テキストのある部分を拾い
読みさせるような場合、その部分がテキスト全体のどの
辺であるかがわかる。また、ある一部分の内容をあとで
再度聞きたいというような場合、その部分の音源位置か
ら音空間上で位置を指定するといったこともできる。Thus, when a certain portion of the text is to be browsed, it is possible to know which side of the entire text the portion is. Further, when it is desired to hear the contents of a part again later, it is possible to specify the position in the sound space from the sound source position of the part.

【００５７】なお、ここではテキストの内容のまとまり
として段落を例にしたが、段落に限られるものではな
く、たとえば、幾つかの見出しがあって、その見出しご
とに文章が存在するようなテキストなども同様に考える
ことができる。Although a paragraph has been taken as an example of a group of text contents here, it is not limited to a paragraph, and for example, a text in which there are several headings and a sentence exists for each heading, etc. Can be considered similarly.

【００５８】また、第２の例（これを具体例２の２とす
る）として、テキストを構成する幾つかの論理的な要素
を、具体例２の１と同様の考え方で文章のまとまりと
し、そのまとまりとしての要素ごとの音源位置を仮想的
な音空間上に定義する例について説明する。たとえば、
起承転結が明確に表現されているようなテキストであれ
ば、その起承転結のそれぞれの部分を要素とし、その要
素ごとに音源位置を仮想的な音空間上に定義させる。As a second example (this is referred to as Specific Example 2-2), some logical elements forming a text are grouped into sentences in the same manner as in Specific Example 2-1, An example of defining a sound source position for each element as a unit in a virtual sound space will be described. For example,
If the text is a text that clearly expresses the kyōshū kyōsetsu, each part of the kyōshū kyōshō is used as an element, and the sound source position is defined for each element in a virtual sound space.

【００５９】起承転結が明確に表現されているテキスト
の一例として、技術論文などが考えられる。この場合、
その技術論文を音声合成対象のテキスト１としてテキス
ト解析部２が読み込んで、その技術論文の構成を把握す
るようなテキスト解析を行い、その技術論文を構成する
起承転結に対応する要素（たとえば、表題、要約、まえ
がき、本文、結論など）が把握され、それぞれの要素対
応のテキスト内容が音声合成部４に渡されるとともに、
幾つの要素から構成されているかの情報やそれぞれの要
素のつながりを示す情報が音源位置定義情報として音源
位置定義部６に渡される。なお、ここでは、表題、要
約、前書きを１つの要素Ｃ１とし、本文が２つの部分に
分けられているとしてそれを要素Ｃ２，Ｃ３とし、結論
を要素Ｃ４とするものとする。As an example of the text in which the succession / transfer is clearly expressed, a technical paper or the like can be considered. in this case,
The text analysis unit 2 reads the technical paper as the text 1 to be speech-synthesized, performs a text analysis for grasping the structure of the technical paper, and forms an element (for example, a title, (Summary, preface, text, conclusion, etc.) is grasped, and the text contents corresponding to each element are passed to the voice synthesis unit 4,
Information on how many elements are configured and information indicating the connection between the respective elements are passed to the sound source position definition unit 6 as sound source position definition information. Here, it is assumed that the title, the abstract, and the preamble are one element C1, that the body is divided into two parts, the elements are C2 and C3, and the conclusion is the element C4.

【００６０】音源位置定義部６はテキスト解析部２から
渡された音源位置定義情報を用いて仮想的な音源位置を
音空間上に決める。一方、音声合成部４では、テキスト
を構成する各要素Ｃ１，Ｃ２，Ｃ３，Ｃ４ごとのテキス
ト内容を音声合成処理し、その合成音声を音像定位処理
部７に渡す。音像定位処理部７は、音声合成部４から渡
された要素Ｃ１，Ｃ２，Ｃ３，Ｃ４対応の合成音声を、
音源位置定義部６で決められた仮想的な音源位置に定位
させる。The sound source position definition unit 6 uses the sound source position definition information passed from the text analysis unit 2 to determine a virtual sound source position in the sound space. On the other hand, the voice synthesizing unit 4 performs a voice synthesizing process on the text contents of each of the elements C1, C2, C3, and C4 forming the text, and passes the synthesized voice to the sound image localization processing unit 7. The sound image localization processing unit 7 converts the synthesized speech corresponding to the elements C1, C2, C3, C4 passed from the speech synthesis unit 4,
The sound source position is defined at the virtual sound source position determined by the sound source position definition unit 6.

【００６１】これによって、この場合、立体音像出力部
８からは、要素Ｃ１，Ｃ２，Ｃ３，Ｃ４に対応するテ
キスト内容を音声合成して得られた合成音声がそれぞれ
の要素ごとに音空間上で異なった位置から出力される。As a result, in this case, the three-dimensional sound image output unit 8 outputs the synthesized voice obtained by voice-synthesizing the text contents corresponding to the elements C1, C2, C3, and C4 in the sound space for each element. It is output from different positions.

【００６２】たとえば、図６に示すように、それぞれ
の要素Ｃ１，Ｃ２，Ｃ３，Ｃ４に対応する合成音声が音
空間２０上において、左端方向から順に右端方向に向か
って、その音源位置を順次切り替えながら聞こえる。For example, as shown in FIG. 6, the synthesized voices corresponding to the respective elements C1, C2, C3, and C4 sequentially switch their sound source positions from the left end direction toward the right end direction in the sound space 20. I can hear it.

【００６３】この場合も前述の具体例２の１と同様に、
それぞれのまとまりの部分（この場合は、起承転結に対
応するそれぞれの要素Ｃ１，Ｃ２，Ｃ３，Ｃ４）ごとの
音源位置を、その文章の流れに沿って一方向（矢印ｂ方
向）に切り替わるように定義すれば、図６で示したよう
に、文章の流れに沿って音源位置が一方向に順次切り替
わって行くので、それを聞くユーザは論理の展開を把握
しやすくなる。Also in this case, as in the case of the specific example 2 described above,
The sound source position for each unit (in this case, the respective elements C1, C2, C3, C4 corresponding to the locomotive transfer) is defined to switch in one direction (direction of arrow b) along the flow of the sentence. Then, as shown in FIG. 6, the sound source position is sequentially switched in one direction along the flow of the sentence, so that the user who hears it can easily understand the logic development.

【００６４】このように、テキストを構成する幾つかの
要素に対応する合成音声がそれぞれの要素ごとに異なっ
た位置から発せられるので、今、テキストのどの辺の合
成音声が出力されているかが直感的にわかる。これは技
術論文などのように起承転結が明確に表現されたテキス
トに特に有効であり、このように、テキストを構成する
それぞれの要素ごとに音源位置を音空間上で定義するこ
とで、前述の具体例２の１と同様に、ある一部分の内容
をあとで再度聞きたいというような場合、その部分の音
の位置から音空間上で位置を指定することが容易にでき
る。As described above, since the synthetic voices corresponding to some elements constituting the text are emitted from different positions for each element, it is intuitive to know which side of the text the synthetic voice is currently output. Understand. This is particularly effective for texts in which the spontaneous transitions are clearly expressed, such as in technical papers.In this way, by defining the sound source position for each element that composes the text in the sound space, As in the case of 1 in Example 2, when it is desired to hear the contents of a part again later, the position in the sound space can be easily specified from the position of the sound of the part.

【００６５】なお、これら具体例２の１あるいは具体例
２の２においては、音源位置の切り替わり方向の例とし
て、左端方向から右端方向（あるいはその逆でもよい）
への一方向に順次切り替えて行く例で説明したが、音源
位置を多数定義する必要がある場合には、音空間２０上
で大きなループを描くように音源位置に定義するように
してもよく、要は文章の流れに沿って順次音源が切り替
わるように音源位置を定義すればよい。It should be noted that, in the first specific example 2 or the second specific example 2, as an example of the switching direction of the sound source position, the direction from the left end to the right end (or vice versa).
Although the example has been described in which the sound source positions are sequentially switched in one direction to, it is possible to define the sound source positions so as to draw a large loop in the sound space 20 when it is necessary to define a large number of sound source positions. In short, the sound source position may be defined so that the sound source is switched sequentially along the flow of the sentence.

【００６６】〔具体例３〕この具体例３は独立したテキ
ストが複数存在する場合、それぞれのテキストごとに音
源位置を定義し、それぞれの音源位置からその音源位置
に対応するテキストの合成音声を出力するものである。
たとえば、図示しないが３つのテキスト（これをＤ１，
Ｄ２，Ｄ３とする）がある場合には、それぞれのテキス
トＤ１，Ｄ２，Ｄ３の内容を音声合成部４に渡して、そ
れぞれのテキスト対応に音声合成するとともに、３つの
テキストＤ１，Ｄ２，Ｄ３が存在することを示す音源位
置定義情報を音源位置定義部６に渡し仮想的な音源位置
を定義する。[Specific Example 3] In Specific Example 3, when a plurality of independent texts exist, a sound source position is defined for each text, and a synthesized voice of the text corresponding to the sound source position is output from each sound source position. To do.
For example, although not shown, three texts (this is D1,
D2, D3), the contents of the respective texts D1, D2, D3 are passed to the voice synthesizing unit 4 for voice synthesizing corresponding to the respective texts, and the three texts D1, D2, D3 are The sound source position definition information indicating the existence is passed to the sound source position definition unit 6 to define a virtual sound source position.

【００６７】これによって、この場合、立体音像出力部
８からは、テキストＤ１，Ｄ２，Ｄ３に対する合成音声
がそれぞれのテキストごとに異なった音源位置から出力
される。As a result, in this case, the three-dimensional sound image output unit 8 outputs the synthesized voice for the texts D1, D2, D3 from different sound source positions for each text.

【００６８】また、このような独立したテキストが多数
存在する場合、その多数のテキストをその内容などによ
ってクラスタリング処理し、そのクラスタリング処理に
よってクラスタリングされた結果ごとに音空間上での音
源位置を定義することもできる。When a large number of such independent texts are present, the large number of texts are clustered according to their contents and the sound source position in the sound space is defined for each clustering result of the clustering processing. You can also

【００６９】図７はクラスタリング処理によってクラス
タリングされた結果を示すもので、ここでは、そのクラ
スタリング処理によって４つのクラスタＥ１，Ｅ２，Ｅ
３、Ｅ４が得られたとする（図７において、それぞれの
クラスタＥ１，Ｅ２，Ｅ３、Ｅ４内に示されている黒丸
はそのクラスタに属する個々のテキストを表してい
る）。FIG. 7 shows the result of clustering performed by the clustering process. Here, four clusters E1, E2, E are obtained by the clustering process.
3 and E4 are obtained (in FIG. 7, black circles shown in the respective clusters E1, E2, E3, and E4 represent individual texts belonging to the cluster).

【００７０】なお、このクラスタリング処理はテキスト
解析部２が行う。すなわち、テキスト解析部２では入力
された多数のテキスト１に対し、それぞれのテキスト１
を解析し、その解析結果に基づいてクラスタリングを行
い、それによって得られたクラスタ（ここではクラスタ
Ｅ１，Ｅ２，Ｅ３，Ｅ４）ごとのテキストが音声合成部
４に渡されるとともに、クラスタ数などの音源位置定義
情報が音源位置定義部６に渡される。The text analysis unit 2 performs this clustering process. That is, in the text analysis unit 2, for each of many input texts 1, each text 1
Is analyzed, clustering is performed based on the analysis result, and the text for each cluster (here, clusters E1, E2, E3, E4) obtained by the analysis is passed to the speech synthesis unit 4, and a sound source such as the number of clusters is generated. The position definition information is passed to the sound source position definition unit 6.

【００７１】音源位置定義部６はテキスト解析部２から
渡された音源位置定義情報を用いて仮想的音源位置を音
空間上に決める。一方、音声合成部４では、これらクラ
スタＥ１，Ｅ２，Ｅ３，Ｅ４に属するそれぞれのテキス
ト内容を音声合成処理し、その合成音声を音像定位処理
部７に渡す。音像定位処理部７は、音声合成部４からの
クラスタＥ１，Ｅ２，Ｅ３，Ｅ４に属するそれぞれのテ
キストの合成音声を、音源位置定義部６で決められた仮
想的な音源位置に定位させる。The sound source position definition unit 6 determines a virtual sound source position in the sound space by using the sound source position definition information passed from the text analysis unit 2. On the other hand, the voice synthesizing unit 4 performs a voice synthesizing process on the respective text contents belonging to the clusters E1, E2, E3, E4, and passes the synthesized voice to the sound image localization processing unit 7. The sound image localization processing unit 7 localizes the synthesized speech of each text belonging to the clusters E1, E2, E3, E4 from the speech synthesis unit 4 to the virtual sound source position determined by the sound source position definition unit 6.

【００７２】これによって、この場合、立体音像出力部
８からは、クラスタＥ１，Ｅ２，Ｅ３，Ｅ４に属する
テキスト内容を音声合成して得られた合成音声がそれぞ
れのクラスタごとに空間上で異なった位置から出力され
る。たとえば、図８に示すように、クラスタＥ１に属す
るテキストは音空間２０上において中央部の後方向から
聞こえ、クラスタＥ２に属するテキストは音空間２０上
において左端方向から聞こえ、クラスタＥ３に属するテ
キストは音空間２０上において中央部の前方向から聞こ
え、クラスタＥ４に属するテキストは音空間２０上にお
いて右端方向から聞こえるというように、それそれのク
ラスタごとに異なった位置から聞こえる。As a result, in this case, the three-dimensional sound image output unit 8 produces a different synthesized speech obtained by speech-synthesizing the text contents belonging to the clusters E1, E2, E3, E4 for each cluster in space. It is output from the position. For example, as shown in FIG. 8, the text belonging to the cluster E1 is heard from the rear of the central portion on the sound space 20, the text belonging to the cluster E2 is heard from the left end on the sound space 20, and the text belonging to the cluster E3 is heard. In the sound space 20, the sound is heard from the front of the center, and the text belonging to the cluster E4 is heard from the right end direction in the sound space 20, and so on.

【００７３】このように、クラスタごとにそのクラスタ
に属するテキストの合成音声がそれぞれのクラスタごと
に異なった位置から発せられるので、今、そのテキスト
がどのクラスタに属しているのかを把握し易くなり、ま
た、今、どのクラスタに属するテキストの合成音声が出
力されているかが直感的にわかる。As described above, since the synthesized voice of the text belonging to each cluster is emitted from a different position for each cluster, it becomes easy to grasp which cluster the text belongs to now. In addition, it is intuitively known which cluster the text synthesized voice is currently outputting.

【００７４】また、このように、クラスタごとに音源位
置を音空間上で定義することで、あるクラスタに属する
テキストを再度聞きたいというような場合、そのクラス
タに対応する音の位置から音空間上で位置を指定するこ
ともできる。Further, by defining the sound source position for each cluster in the sound space in this way, when it is desired to hear a text belonging to a certain cluster again, the position of the sound corresponding to the cluster is changed in the sound space. You can also specify the position with.

【００７５】また、この具体例３における類似例とし
て、討論の結果や投書などたとえば賛成、反対、中立な
どの意見の書かれた多数のテキストに対する合成音声
を、その意見の内容ごとに音空間上で音源位置を定義す
ることもできる。As a similar example to the third specific example, a synthetic voice for a large number of texts in which opinions such as a result of discussion and a letter of comment such as approval, disagreement, and neutrality are written on a sound space for each content of the opinion. You can also define the sound source position with.

【００７６】この多数のテキストの内容から賛成意見で
あるか反対意見であるか中立意見であるか判断してそれ
を分類する処理もテキスト解析部２が行うものとし、そ
れによって得られた賛成意見のグループＦ１、反対意見
のグループＦ２、中立意見のグループＦ３に属するそれ
ぞれのテキストが音声合成部４に渡されるとともに、こ
の場合、入力された多数のテキストが賛成、反対、中立
の３つに分類されることを示す情報などが音源位置定義
情報として音源位置定義部６に渡される。From the contents of the large number of texts, it is assumed that the text analysis unit 2 also performs a process of determining whether the opinion is a supportive opinion, a dissenting opinion, or a neutral opinion, and classifying it. Each of the texts belonging to the group F1, the opposite opinion group F2, and the neutral opinion group F3 is passed to the speech synthesis unit 4, and in this case, a large number of input texts are classified into three categories: yes, no, and neutral. Information indicating that the sound source position is defined is passed to the sound source position definition unit 6 as sound source position definition information.

【００７７】これによって、この場合、立体音像出力部
８からは、賛成意見のグループＦ１、反対意見のグル
ープＦ２、中立意見のグループＦ３にそれぞれ属するテ
キストを音声合成して得られた合成音声がそれぞれの意
見のグループごとに、音空間上で異なった位置から出力
される。As a result, in this case, the three-dimensional sound image output unit 8 produces synthetic voices obtained by voice-synthesizing the texts respectively belonging to the opinion group F1, the opinion group F2, and the neutral opinion group F3. Are output from different positions in the sound space for each group of opinions.

【００７８】たとえば、図９に示すように、賛成意見の
グループＦ１に属するテキストに対する合成音声は音空
間２０上において左端方向から聞こえ、反対意見のグル
ープＦ２に属するテキストに対する合成音声は音空間２
０上において右端方向から聞こえ、中立意見のグループ
Ｆ３に属するテキストに対する合成音声は空間上におい
て中央付近から聞こえるというように、それそれの意見
のグループごとにそれぞれのグループに属するテキスト
に対する合成音声が異なった位置から聞こえる。For example, as shown in FIG. 9, the synthetic voice for the text belonging to the group F1 of the opinion is heard from the left end on the sound space 20, and the synthetic voice for the text belonging to the group F2 of the disagreement is the sound space 2.
0 is heard from the right end direction, and the synthesized voice for the text belonging to the neutral opinion group F3 is heard in the vicinity of the center in space, so that the synthesized voice for the text belonging to each group is different for each opinion group. Can be heard from the position.

【００７９】このように、それぞれの意見に属するテキ
ストの合成音声がそれぞれの意見ごとに異なった位置か
ら発せられるので、今、そのテキストがどの意見に属し
ているのかを把握し易くなり、また、今、どの意見に属
するテキストの合成音声が出力されているかが直感的に
わかる。In this way, since the synthesized voice of the texts belonging to each opinion is uttered from different positions for each opinion, it becomes easy to grasp which opinion the text belongs to now, and Now, it is possible to intuitively understand to which opinion the synthesized voice of the text is output.

【００８０】また、このように、意見ごとに音源位置を
音空間上で定義することで、ある意見に属するテキスト
を再度聞きたいというような場合、その意見に対応する
音の位置から音空間上で位置を指定することもできる。Further, by defining the sound source position for each opinion in the sound space in this way, when it is desired to hear a text belonging to a certain opinion again, the sound position corresponding to the opinion is changed in the sound space. You can also specify the position with.

【００８１】〔具体例４〕この具体例４は、マルチタス
ク機能を有する情報処理システムにおいて、アプリケー
ションごとに異なる位置に音源を定位させるようにする
例である。Concrete Example 4 Concrete Example 4 is an example in which, in an information processing system having a multitask function, sound sources are localized at different positions for each application.

【００８２】その第１の例（これを具体例４の１とす
る）として、パーソナルコンピュータ（以下、ＰＣとい
う）などのＧＵＩ（Graphical User Interface）におい
て、複数のウインドウやダイアログが開かれている場
合、それぞれのウインドウやダイアログごとに、音空間
上で異なる位置に音源位置を決める。As a first example (this will be referred to as Specific Example 1), when a plurality of windows and dialogs are opened in a GUI (Graphical User Interface) such as a personal computer (hereinafter referred to as a PC) , The sound source position is set to a different position in the sound space for each window or dialog.

【００８３】たとえば、ＰＣの画面３０上に図１０のよ
うな３つのウインドウＧ１，Ｇ２，Ｇ３が開かれてい
て、それぞれのウインドウＧ１，Ｇ２，Ｇ３に存在する
それぞれのテキストに対する合成音声を出力する場合、
図１１に示すように、ウインドウＧ１に存在するテキス
トに対する合成音声の音源位置は音空間２０上で左上方
向に定義し、ウインドウＧ２に存在するテキストに対す
る合成音声の音源位置は音空間２０上で真上方向に定義
し、ウインドウＧ３に存在するテキストに対する合成音
声の音源位置は音空間２０上で右上方向に定義するとい
うように、それぞれのウインドウごとに音源位置を異な
らせている。For example, three windows G1, G2 and G3 as shown in FIG. 10 are opened on the screen 30 of the PC, and the synthesized voice for each text existing in each window G1, G2 and G3 is output. If
As shown in FIG. 11, the sound source position of the synthetic voice for the text existing in the window G1 is defined in the upper left direction in the sound space 20, and the sound source position of the synthetic voice for the text existing in the window G2 is true in the sound space 20. The sound source position is defined in the upward direction, and the sound source position of the synthesized voice with respect to the text existing in the window G3 is defined in the upper right direction in the sound space 20, so that the sound source position is different for each window.

【００８４】これによって、合成音声の出力位置からど
のウインドウのテキストの合成音声であるかが直感的に
わかる。As a result, it is possible to intuitively know from which output position of the synthesized voice the text of which window is the synthesized voice.

【００８５】また、この具体例４における第２の例（こ
れを具体例４の２とする）として、たとえば、スケジュ
ーラ機能、電子メールの送受信機能、webのニュースの
閲覧機能などを有し、かつ、出力音声を、図１２に示す
ように、ヘッドホン４１などにより立体音として出力可
能な携帯情報機器４０などにおいて、これらスケジュー
ラ機能を果たすそれぞれのアプリケーションＨ１、電子
メールの送受信機能を果たすそれぞれのアプリケーショ
ンＨ２、webのニュースの閲覧機能などの機能を果たす
アプリケーションＨ３ごとの音源位置を図１３に示すよ
うに音空間２０上で定義することも可能である。Further, as a second example of this specific example 4 (this will be referred to as specific example 2-2), for example, it has a scheduler function, an electronic mail transmitting / receiving function, a web news browsing function, and the like, and As shown in FIG. 12, in a portable information device 40 or the like capable of outputting output sound as stereophonic sound with headphones 41 or the like, respective applications H1 fulfilling these scheduler functions and respective applications H2 fulfilling e-mail transmission / reception functions. , It is also possible to define the sound source position for each application H3 that performs a function of browsing news on the web in the sound space 20 as shown in FIG.

【００８６】この図１３の例では、スケジューラ機能の
アプリケーションＨ１で作成されたテキストに対する合
成音声の音源位置は音空間２０上で左端付近に定義し、
電信メールの送受信機能のアプリケーションＨ２で作成
されたテキストに対する合成音声の音源位置は音空間２
０上で中央付近に定義し、webのニュースの閲覧機能の
アプリケーションＨ３で取得されたテキストに対する合
成音声の音源位置は音空間２０上で右端付近に定義して
いる。In the example of FIG. 13, the sound source position of the synthesized voice for the text created by the application H1 of the scheduler function is defined near the left end in the sound space 20,
The sound source position of the synthetic voice for the text created by the application H2 of the transmission / reception function of the electronic mail is the sound space 2
0 is defined near the center, and the sound source position of the synthesized voice for the text acquired by the application H3 of the web news browsing function is defined near the right end in the sound space 20.

【００８７】これによって、それぞれの合成音声の出力
位置から、どのアプリケーションのテキストに対する合
成音声が出力されているのかを直感的に知ることができ
る。As a result, it is possible to intuitively know from which output position of each synthesized voice the synthesized speech for which text of the application is outputted.

【００８８】この具体例４の１あるいは具体例４の２の
ように、それぞれのアプリケーションごとに音源位置を
定義させることも可能であり、さらに、同じアプリケー
ションであっても、そのアプリケーションで得られる複
数のテキストごとに音源を定位させることも可能であ
る。It is possible to define the sound source position for each application as in the case of 1 of the fourth example or the second of the fourth example, and further, even if the same application is used, a plurality of positions obtained by the application can be obtained. It is also possible to localize the sound source for each text.

【００８９】たとえば、電子メールの送受信を行うアプ
リケーションにおいて、その電子メールが仕事上のメー
ルであるか、家族からのメールであるか、友人からのメ
ールであるかなど、相手を幾つかのグループに分けて、
それぞれのグループごとに音源位置を定位させることも
可能であり、それによって、どのような相手からのメー
ルかを直感的に知ることができる。For example, in an application for sending and receiving electronic mail, whether the electronic mail is a business mail, a mail from a family member, a mail from a friend, or the like is divided into several groups. Divided into,
It is also possible to localize the sound source position for each group, so that it is possible to intuitively know what kind of partner the mail is from.

【００９０】ところで、この具体例４において、それぞ
れのアプリケーションのテキスト１をテキスト解析部２
に入力する際、そのテキスト１がどのアプリケーション
に存在するテキストであるかを示すデータをテキスト１
のヘッダとして付加するようにすれば、テキスト解析部
２ではそのヘッダ部分を見ることによって、アプリケー
ションを識別することができ、それを音源位置定義情報
として音源位置定義部６に渡すことができ、また、音声
合成部４に対しては、それぞれのアプリケーションを識
別する情報とともにそのそれぞれのアプリケーションに
存在するテキストを渡すようにする。By the way, in this specific example 4, the text 1 of each application is converted into the text analysis unit 2.
When inputting into the text 1, the data that indicates in which application the text 1 exists is text 1.
If it is added as a header of, the text analysis unit 2 can identify the application by looking at the header portion and pass it to the sound source position definition unit 6 as sound source position definition information. The text existing in each application is passed to the speech synthesizer 4 together with the information for identifying each application.

【００９１】また、この具体例４において、アプリケー
ションが複数のフレームを有する場合は、それぞれのフ
レームごとに音源位置を決めることもできる。Further, in the fourth specific example, when the application has a plurality of frames, the sound source position can be determined for each frame.

【００９２】さらに、本発明は以上説明したそれぞれの
具体例に限定されるものではなく、本発明の要旨を逸脱
しない範囲で種々変形実施可能となるものである。Furthermore, the present invention is not limited to the specific examples described above, and various modifications can be made without departing from the gist of the present invention.

【００９３】また、本発明は、以上説明した本発明を実
現するための処理手順が記述された処理プログラムを作
成し、その処理プログラムをフロッピィディスク、光デ
ィスク、ハードディスクなどの記録媒体に記録させてお
くことができ、本発明はその処理プログラムが記録され
た記録媒体をも含むものである。また、ネットワークか
ら当該処理プログラムを得るようにしてもよい。Further, according to the present invention, a processing program in which a processing procedure for realizing the above-described present invention is described is created, and the processing program is recorded in a recording medium such as a floppy disk, an optical disk or a hard disk. The present invention also includes a recording medium in which the processing program is recorded. Further, the processing program may be obtained from the network.

【００９４】[0094]

【発明の効果】以上説明したように本発明によれば、音
声合成対象として入力されたテキストを解析し、音声合
成すべきテキストを抽出するとともに合成音声の音源位
置を定義するに必要な情報を音源位置定義情報として抽
出し、前記音源位置定義情報を用いて音空間上の仮想的
な音源位置を定義し、前記音声合成すべきテキストに対
して音声合成処理を行い、それによって得られた合成音
声を、前記音源位置定義情報を用いて定義された音空間
上の仮想的な音源位置に定位させ、その定位させた位置
から前記合成音声を出力させるようにしたので、ユーザ
に伝達できる情報量を増やすことができる。すなわち、
単にテキストを音声合成して合成音声として出力するだ
けでは、ユーザに対し、そのテキストの内容を音声で伝
達するのみであるが、本発明によれば、テキストのどの
部分の合成音声が出力されているか、複数のテキストが
存在する場合にはどのテキストの合成音声が出力されて
いるかなど、テキストに対する合成音声以外にの様々な
付加的情報をユーザに伝達することができる。As described above, according to the present invention, the text input as the speech synthesis target is analyzed, the text to be speech-synthesized is extracted, and the information necessary for defining the sound source position of the synthesized speech is obtained. Extracted as sound source position definition information, using the sound source position definition information to define a virtual sound source position in a sound space, performing speech synthesis processing on the text to be speech-synthesized, and synthesizing it. Since the sound is localized at a virtual sound source position on the sound space defined by using the sound source position definition information, and the synthesized voice is output from the localized position, the amount of information that can be transmitted to the user. Can be increased. That is,
By simply synthesizing the text and outputting the synthesized speech as the synthesized speech, only the content of the text is audibly transmitted to the user. However, according to the present invention, the synthesized speech of any part of the text is output. It is possible to convey various additional information other than the synthesized voice to the text to the user, such as which synthesized voice of which text is output when there are a plurality of texts.

[Brief description of drawings]

【図１】本発明の合成音声再生装置の構成図である。FIG. 1 is a configuration diagram of a synthetic voice reproduction device of the present invention.

【図２】本発明の実施の形態における具体例１を説明す
る図であり、円卓を囲む複数の話者を示す図である。FIG. 2 is a diagram for explaining a specific example 1 in the embodiment of the present invention, and is a diagram showing a plurality of speakers surrounding a round table.

【図３】図２で示す各話者対応のテキストに対する合成
音声の音源位置を音空間上でそれぞれの話者ごとに定義
した例を説明する図である。FIG. 3 is a diagram illustrating an example in which a sound source position of a synthesized voice for a text corresponding to each speaker shown in FIG. 2 is defined for each speaker in a sound space.

【図４】本発明の実施の形態における具体例２の１を説
明するために用いられる幾つかの段落からなるテキスト
例を示す図である。FIG. 4 is a diagram showing an example of a text consisting of several paragraphs used for explaining specific example 1 of example 1 in the embodiment of the present invention.

【図５】図４のテキストの段落ごとの合成音声の音源位
置を音空間上でそれぞれの段落ごとに定義した例を説明
する図である。5 is a diagram illustrating an example in which a sound source position of synthesized speech for each paragraph of the text of FIG. 4 is defined for each paragraph in a sound space.

【図６】本発明の実施の形態における具体例２の２を説
明するためのテキスト（起承転結などが明確に表現され
たテキスト）におけるそれぞれの要素ごとの合成音声の
音源位置を音空間上でそれぞれの要素ごとに定義した例
を説明する図である。FIG. 6 is a diagram for explaining the second specific example 2 of the embodiment of the present invention (the text in which the succession and rearrangement is clearly expressed) for explaining the specific example 2; It is a figure explaining the example defined for every element of.

【図７】本発明の実施の形態における具体例３を説明す
るために用いられる複数のテキストをクラスタリング処
理して幾つかのクラスタに分類した例を示す図である。FIG. 7 is a diagram showing an example in which a plurality of texts used for explaining a specific example 3 in the embodiment of the present invention are clustered and classified into some clusters.

【図８】図７で示すクラスタリング処理によって得られ
たそれぞれのクラスタに属するテキストの合成音声の音
源位置を音空間上でそれぞれのクラスタごとに定義した
例を説明する図である。8 is a diagram illustrating an example in which a sound source position of synthesized speech of a text belonging to each cluster obtained by the clustering process illustrated in FIG. 7 is defined for each cluster in a sound space.

【図９】具体例４の類似例として討論の結果や投書など
賛成、反対、中立などの意見の書かれた多数のテキスト
に対する合成音声の音源位置を音空間上でその意見の内
容ごとに定義した例を説明する図である。[FIG. 9] As a similar example to Concrete example 4, the sound source position of the synthetic voice for a large number of texts in which opinions such as the result of discussions and comments such as approval, disagreement, and neutrality are written is defined for each content of the opinion in the sound space. It is a figure explaining the example which did.

【図１０】本発明の実施の形態における具体例４の１を
説明する図であり、マルチタスク機能を有する情報処理
システムにおいて、画面上にアプリケーションごとに異
なるウインドウを表示させた例を示す図である。FIG. 10 is a diagram for explaining specific example 1 of Embodiment 4 of the present invention, and is a diagram showing an example in which different windows are displayed on the screen for each application in the information processing system having the multitask function. is there.

【図１１】図１０に示す複数のアプリケーションに対応
したテキストの合成音声の音源位置を音空間上でそれぞ
れのアプリケーションごとに定義した例を説明する図で
ある。11 is a diagram illustrating an example in which a sound source position of a synthesized voice of text corresponding to a plurality of applications illustrated in FIG. 10 is defined for each application in a sound space.

【図１２】本発明の実施の形態における具体例４の２を
説明する図であり、複数の機能を有し、かつ、出力音声
を立体音として出力可能な携帯情報機器の例を示す図で
ある。FIG. 12 is a diagram for explaining the second specific example 4 in the embodiment of the present invention, which is a diagram showing an example of a portable information device having a plurality of functions and capable of outputting an output sound as a stereo sound. is there.

【図１３】図１２に示すような携帯情報機器などでそれ
ぞれの機能に対応するそれぞれのアプリケーションで作
成あるいは取得されたテキストの合成音声の音源位置を
音空間上でそれぞれのアプリケーションごとに定義した
例を説明する図である。13 is an example in which a sound source position of a synthesized voice of a text created or acquired by each application corresponding to each function in the portable information device as shown in FIG. 12 is defined for each application in the sound space. It is a figure explaining.

[Explanation of symbols]

１音声合成対象となるテキストデータ２テキスト解析部３テキスト解析辞書４音声合成部５音声合成辞書６音源位置定義部７音像定位処理部８立体音像出力部Ａ１，Ａ２，Ａ３話者Ｂ１，Ｂ２，Ｂ３段落Ｃ１，Ｃ２，Ｃ３，Ｃ４起承転結に対応する要素Ｅ１，Ｅ２，Ｅ３，Ｅ４クラスタリングによって得ら
れたクラスタＦ１，Ｆ２，Ｆ３賛成、反対、中立の意見グループＧ１，Ｇ２，Ｇ３画面上に開かれたウインドウＨ１スケジューラ機能に対するアプリケーションＨ２電子メール送受信機能に対するアプリケーションＨ３ webニュース閲覧機能に対するアプリケーション1 Text data to be voice-synthesized 2 Text analysis unit 3 Text analysis dictionary 4 Voice synthesis unit 5 Voice synthesis dictionary 6 Sound source position definition unit 7 Sound image localization processing unit 8 Stereo sound image output unit A1, A2, A3 Speaker B1, B2 B3 Paragraphs C1, C2, C3, C4 Elements E1, E2, E3, E4 corresponding to the inscription transfer Clusters F1, F2, F3 obtained by clustering Approval, opposition, neutral opinion groups G1, G2, G3 Open on the screen Window H1 Application for scheduler function H2 Application for email sending / receiving function H3 web Application for news browsing function

Claims

[Claims]

1. A sound source position, which is information necessary for analyzing a text input as a voice synthesis target, extracting a text to be voice synthesized, and defining a sound source position of synthesized voice in a virtual sound space. Extracted as definition information, the sound source position definition information is used to define a sound source position in the virtual sound space, and a voice synthesis process is performed on the text to be voice-synthesized. Is localized at a virtual sound source position on the defined sound space, and the synthesized voice is output from the localized position.

2. When a plurality of content groups are present in the text to be speech-synthesized, a virtual sound source position in the sound space is defined for each content group, and the sound source position The synthetic voice reproduction method according to claim 1, wherein a synthetic voice for a group of contents associated with a sound source position is output.

3. When defining a sound source position for each group of contents, if the group of contents exists in an order along the flow of a text sentence, the sound source position for each group of contents is 3. The synthetic voice reproduction method according to claim 2, wherein the position is defined so as to change in an order along the flow.

4. When the text to be speech-synthesized is a plurality of independent texts, a virtual sound source position in the sound space is defined for each text, and from each sound source position to that sound source position. 2. The synthetic voice reproduction method according to claim 1, wherein a synthetic voice for the text associated with is output.

5. When the plurality of independent texts can be classified by a certain criterion, a virtual sound source position in the sound space is defined for each classification, and the sound source positions are used to define the sound source. The synthetic voice reproduction method according to claim 4, wherein a synthetic voice for the text belonging to the classification associated with the position is output.

6. When the text to be speech-synthesized is text existing in a plurality of applications in an information processing system having a multitask function, a virtual sound source position in the sound space is set for each application. 2. The synthetic voice reproduction method according to claim 1, wherein the synthetic voice for the text existing on the application defined in the above and corresponding to the sound source position is output from each sound source position.

7. The sound source position definition information for analyzing the text to be synthesized, extracting the text information to be synthesized, and defining the sound source position of the synthesized speech in a virtual sound space. And a sound source position defining unit that defines a virtual sound source position in a sound space using the sound source position definition information extracted by the text analyzing unit, and the text extracting unit extracted by the text analyzing unit. A voice synthesizing means for synthesizing the text information to be voice-synthesized, and a synthesized voice obtained by the voice synthesizing processing by the voice synthesizing means, a virtual sound source in the sound space defined by the sound source position defining means. Sound image localization means for locating to a position, and stereoscopic sound image output means for outputting the synthesized voice from the sound source position localized by the sound image localization means, Synthesized speech reproducing device characterized in that it comprises.

8. When a plurality of groups of contents exist in the text to be speech-synthesized, a virtual sound source position in the sound space is defined for each unit of the contents, and the virtual sound source position is defined from each sound source position. 8. The synthetic speech reproduction device according to claim 7, wherein synthetic speech is output for a group of contents associated with a sound source position.

9. When defining a sound source position for each group of the contents, if the group of the contents exists in an order along the flow of a text sentence, the sound source position for each group of the contents is the sentence position of the sentence. 9. The synthesized speech reproducing apparatus according to claim 8, wherein the synthesized speech reproducing apparatus is defined at a position that changes in order along the flow.

10. When the text to be speech-synthesized is a plurality of independent texts, a virtual sound source position in the sound space is defined for each text, and from each sound source position to the sound source position. 8. The synthetic speech reproducing apparatus according to claim 7, wherein the synthetic speech for the text associated with is output.

11. When the plurality of independent texts can be classified based on a certain criterion, a virtual sound source position in the sound space is defined for each classification, and the sound source positions are used to define the sound source. 11. The synthetic speech reproduction device according to claim 10, wherein synthetic speech for text belonging to a classification associated with a position is output.

12. When the voice synthesis target text is text existing in a plurality of applications in an information processing system having a multitask function, a virtual sound source position in the sound space is set for each application. 8. The synthesized speech reproducing apparatus according to claim 7, wherein the synthesized speech reproducing apparatus outputs the synthesized speech corresponding to the text existing on the application defined by the respective sound source positions and associated with the sound source position.