JP2002073069A

JP2002073069A - Voice synthesizer, voice synthesis method and information storage medium

Info

Publication number: JP2002073069A
Application number: JP2000263544A
Authority: JP
Inventors: Toshiyuki Mizoguchi; 稔幸溝口; Osamu Kasai; 治笠井
Original assignee: Konami Computer Entertainment Co Ltd; Konami Computer Entertainment Tokyo Inc
Current assignee: Konami Computer Entertainment Co Ltd; Konami Computer Entertainment Tokyo Inc
Priority date: 2000-08-31
Filing date: 2000-08-31
Publication date: 2002-03-12
Anticipated expiration: 2020-08-31
Also published as: JP3718116B2

Abstract

PROBLEM TO BE SOLVED: To improve the quality of a synthesized voice by connecting two pieces of basic voice data to be reproduced continuously between the sections suitable for connection in a phoneme section corresponding to the same phoneme. SOLUTION: A first basic voice data (CV form) which represents 'na' and a second basic voice data (VCV form) which represents 'aka' are connected in a phoneme section (V section) which represents the same vowel 'a', and then a synthesized voice data which represent 'naka' are generated. At this time, in the first basic voice data, a connection candidate section is set within at least the phoneme section which represents 'a', and also in the second basic voice data, a connection candidate section is set within at least the phoneme section which represents 'a' of the preceding side. Then, connecting locations are determined within the range of these connection candidate sections, respectively, where the first basic voice data and the second basic voice data are connected, and synthesized voice data are generated.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は音声合成装置、音声
合成方法及び情報記憶媒体に関し、接続に適した位置で
確実に連続再生されるべき２つの基礎音声データを接続
するための技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer, a speech synthesis method, and an information storage medium, and more particularly to a technique for connecting two basic speech data to be continuously reproduced at a position suitable for connection.

【０００２】[0002]

【従来の技術】音声は人間にとって最も自然な情報伝達
手段であることから、各種家電製品の他、家庭用又は業
務用ゲーム機、或いはゲームソフトウェアにも音声合成
技術の利用範囲が広がりつつある。例えば、予めプレイ
ヤの名前を文字入力させておき、その入力された文字を
音声合成してゲームの中で適宜発音するようにすれば、
ゲームの面白さをさらに増すことができる。2. Description of the Related Art Since speech is the most natural means of transmitting information to humans, the range of use of speech synthesis technology is expanding not only to various home appliances but also to home or business game machines or game software. For example, if the player's name is input in advance, and the input character is voice-synthesized and appropriately pronounced in the game,
The fun of the game can be further increased.

【０００３】従来、各種の音声合成技術が提案されてい
るが、その中でも自然音声の波形そのもの、或いは自然
音声又はそれに準ずる音声の波形を復元するためのパラ
メータを記録してなる基礎音声データを予め多数用意し
ておき、それを例えばユーザが入力した文字列等に応じ
て組み合わせて、合成音声の波形を表す合成音声データ
を生成する技術は、合成音声を比較的自然なものとする
ことができる点で利用価値が高い。Conventionally, various voice synthesis techniques have been proposed. Among them, basic voice data in which a natural voice waveform itself or a parameter for restoring a waveform of a natural voice or a voice similar thereto is recorded is stored in advance. A technique of preparing a large number of pieces and combining them according to, for example, a character string input by a user to generate synthesized speech data representing a waveform of the synthesized speech can make the synthesized speech relatively natural. High value in terms of use.

【０００４】[0004]

【発明が解決しようとする課題】上記音声合成技術は、
具体的には、多数の基礎音声データの中から合成音声に
対応する基礎音声データ列を選出し、それを接続するこ
とにより合成音声を再生するための合成音声データを生
成する。このとき、選出される基礎音声データ列におい
て、連続する２つの基礎音声データのうち、先に再生さ
れる方がある音素で終わり、続いて再生される方が同じ
音素で始まる場合、先に再生される基礎音声データと続
いて再生される基礎音声データとで、波形が近似してい
る部分（パラメータ編集方式においてはパラメータが近
似している部分。以下同じ。）を探し、そこで両基礎音
声データを接続している。例えば基礎音声データをＶＣ
Ｖ（母音−子音−母音）形式で記録しておき、同じ音素
に対応するＶ区間で連続再生されるべき基礎音声データ
を接続する場合、或いは基礎音声データをＣＶＣ（子音
−母音−子音）形式で記録しておき、同じ音素に対応す
るＣ区間で連続再生されるべき基礎音声データを接続す
る場合において、先後の基礎音声データのＶ区間或いは
Ｃ区間のうち波形が近似しているタイミングを探し出
し、そこで両基礎音声データを接続する。その他、ＣＶ
（子音−母音）形式で記録された基礎音声データの後に
ＶＣＶ形式で記録された基礎音声データを同じ音素に対
応するＶ区間で接続する場合や、ＣＶＣ形式で記録され
た基礎音声データの後にＣＶ形式で記録された基礎音声
データを同じ音素に対応するＣ区間で接続する場合も同
様である。このように同じ音素に対応する区間で２つの
基礎音声データを接続する場合、波形が近似した部分で
接続するようにすれば、接続部分を目立たないようにす
ることができ、合成音声の品質を向上させることができ
る。SUMMARY OF THE INVENTION
Specifically, a basic voice data string corresponding to the synthesized voice is selected from a large number of basic voice data, and connected to generate synthesized voice data for reproducing the synthesized voice. At this time, in the selected basic sound data sequence, if two of the continuous basic sound data end with the phoneme to be reproduced first and the next to be reproduced start with the same phoneme, the sound is reproduced first. Between the basic sound data to be reproduced and the basic sound data to be reproduced subsequently, a part whose waveform is approximated (a part whose parameter is approximated in the parameter editing method; the same applies hereinafter). Are connected. For example, if the basic audio data is VC
When recording in V (vowel-consonant-vowel) format and connecting basic voice data to be reproduced continuously in V section corresponding to the same phoneme, or when basic voice data is connected in CVC (consonant-vowel-consonant) format In the case of connecting basic audio data to be continuously reproduced in the C section corresponding to the same phoneme, the timing at which the waveform is similar is searched for in the V section or the C section of the preceding and following basic audio data. Then, the two basic audio data are connected. Other, CV
Basic sound data recorded in the (consonant-vowel) format and basic sound data recorded in the VCV format are connected in the V section corresponding to the same phoneme, or CV after the basic sound data recorded in the CVC format The same applies to a case where basic audio data recorded in a format is connected in section C corresponding to the same phoneme. When two pieces of basic voice data are connected in a section corresponding to the same phoneme in this way, if the connection is made at a portion where the waveform is approximated, the connection portion can be made inconspicuous, and the quality of the synthesized voice can be reduced. Can be improved.

【０００５】しかしながら、たとえ波形が近似している
部分で２つの基礎音声データを接続したとしても、基礎
音声データのうち、過渡区間（ある音素から他の音素へ
推移する部分であり、例えばＶＣＶ形式で記録された基
礎音声データではＶＣ又はＣＶの中間部分）や先頭区間
又は後尾区間にて偶々波形が近似してしまう場合があ
り、このような部分で基礎音声データを接続してしまう
と、却って接続部分が目立ってしまい、合成音声の品質
が劣化するという問題がある。[0005] However, even if two pieces of basic sound data are connected in a portion where the waveforms are similar, in the transient section of the basic sound data (a part that transitions from one phoneme to another, for example, in the VCV format). In the basic audio data recorded in step (2), the waveform may be approximated by chance in the middle section of VC or CV), the first section, or the last section. There is a problem that the connection portion becomes conspicuous, and the quality of synthesized speech is deteriorated.

【０００６】本発明は上記課題に鑑みてなされたもので
あって、その目的は、接続に適した位置で確実に連続再
生されるべき２つの基礎音声データを接続し、以て合成
音声の品質を向上させることのできる音声合成装置、音
声合成方法及び情報記憶媒体を提供することにある。SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and has as its object to connect two basic audio data to be surely continuously reproduced at a position suitable for connection, thereby achieving a quality of synthesized speech. It is an object of the present invention to provide a speech synthesis device, a speech synthesis method, and an information storage medium that can improve the performance.

【０００７】[0007]

【課題を解決するための手段】上記課題を解決するため
に、本発明に係る音声合成装置は、複数の基礎音声デー
タを記憶する基礎音声データ記憶手段と、前記複数の基
礎音声データの中から合成音声に対応する基礎音声デー
タ列を選出する基礎音声データ列選出手段と、選出され
る前記基礎音声データ列を接続して合成音声を再生する
ための合成音声データを生成する合成音声データ生成手
段と、を含む音声合成装置において、前記複数の基礎音
声データのうち少なくとも２つの基礎音声データにそれ
ぞれ対応づけて、該２つの基礎音声データにおける所定
音素に対応する区間の内側に設定された接続候補区間を
それぞれ特定する、接続候補区間データを記憶する接続
候補区間特定データ記憶手段をさらに含み、前記合成音
声データ生成手段は、前記基礎音声データ列選出手段に
より選出される前記基礎音声データ列に前記２つの基礎
音声データが隣接して含まれる場合に、前記接続候補区
間特定データ記憶手段から前記２つの基礎音声データに
対応する接続候補区間特定データを読み出すとともに、
該接続候補区間特定データにより特定される接続候補区
間の範囲内で前記２つの基礎音声データのそれぞれにつ
いて接続位置を決定し、該接続位置にて前記２つの基礎
音声データを接続する、ことを特徴とする。In order to solve the above-mentioned problems, a voice synthesizing apparatus according to the present invention comprises: a basic voice data storage unit for storing a plurality of basic voice data; Basic voice data string selecting means for selecting a basic voice data string corresponding to the synthesized voice, and synthesized voice data generating means for connecting the selected basic voice data strings to generate synthesized voice data for reproducing the synthesized voice And a connection candidate set inside a section corresponding to a predetermined phoneme in the two basic voice data in association with at least two basic voice data of the plurality of basic voice data, respectively. Further comprising a connection candidate section specifying data storage unit for storing connection candidate section data for specifying each section, wherein the synthesized voice data generating means When the two basic audio data strings are adjacently included in the basic audio data string selected by the basic audio data string selecting means, the basic candidate audio data strings correspond to the two basic sound data from the connection candidate section specifying data storage means. Connection candidate section identification data to be read, and
Determining a connection position for each of the two basic sound data within a range of the connection candidate section specified by the connection candidate section specifying data, and connecting the two basic sound data at the connection position. And

【０００８】基礎音声データにおける所定音素に対応す
る区間の内側には、過渡区間や安定発音区間等、音素に
対応して接続に適した区間と接続に適しない区間とが存
在する。本発明によれば、接続候補区間を基礎音声デー
タの接続に適した区間に確実に設定することができ、こ
れにより同じ音素（所定音素）に対応する区間のうち接
続に適した区間で、連続再生されるべき前記２つの基礎
音声データを接続し、以て合成音声の品質を向上させる
ことができる。なお、前記基礎音声データを録音音声デ
ータとすれば、さらに自然な合成音声とすることができ
る。[0008] Inside a section corresponding to a predetermined phoneme in the basic speech data, there are a section suitable for connection and a section not suitable for connection corresponding to the phoneme, such as a transient section and a stable sounding section. According to the present invention, a connection candidate section can be reliably set to a section suitable for connection of basic voice data, whereby a section suitable for connection among sections corresponding to the same phoneme (predetermined phoneme) can be set continuously. By connecting the two basic sound data to be reproduced, it is possible to improve the quality of the synthesized sound. If the basic voice data is recorded voice data, a more natural synthesized voice can be obtained.

【０００９】また、本発明の一態様では、各接続候補区
間内に設定された複数の接続候補位置を特定する、接続
候補位置特定データを記憶する接続候補位置特定データ
記憶手段をさらに含み、前記合成音声データ生成手段
は、前記２つの基礎音声データに対応する接続候補区間
特定データにより特定される接続候補区間内に設定され
た接続候補位置を前記接続候補位置特定データ記憶手段
に記憶された前記接続候補位置特定データにより特定
し、前記２つの基礎音声データのそれぞれについて該接
続候補位置の中から前記接続位置を選出する、ことを特
徴とする。この態様によれば、接続候補位置として相応
しい接続候補区間内の位置を予め前記接続候補区間特定
データにより特定可能としておくことにより、さらに軽
い処理負担で基礎音声データの接続部分を目立たないよ
うにすることができ、以て合成音声の品質を向上させる
ことができる。In one aspect of the present invention, the apparatus further includes connection candidate position specifying data storage means for storing connection candidate position specifying data for specifying a plurality of connection candidate positions set in each connection candidate section, The synthesized voice data generating means stores the connection candidate position set in the connection candidate section specified by the connection candidate section specifying data corresponding to the two basic voice data in the connection candidate position specifying data storage means. The connection position is specified by connection candidate position specifying data, and the connection position is selected from the connection candidate positions for each of the two basic sound data. According to this aspect, the position in the connection candidate section suitable as the connection candidate position can be specified in advance by the connection candidate section specifying data, so that the connection part of the basic audio data is made inconspicuous with a lighter processing load. Therefore, the quality of the synthesized speech can be improved.

【００１０】また、本発明の一態様では、前記接続候補
区間は、前記所定音素の安定発声区間内に設定される。
こうすれば、基礎音声データの接続部分を目立たないよ
うにすることができる。[0010] In one aspect of the present invention, the connection candidate section is set within a stable utterance section of the predetermined phoneme.
This makes it possible to make the connection portion of the basic audio data inconspicuous.

【００１１】また、本発明の一態様では、前記２つの基
礎音声データの接続部分に対応する前記所定音素を含む
音節の長さを決定する音節長決定手段をさらに含み、前
記合成音声データ生成手段は、前記音節長決定手段によ
り決定される音節の長さに基づき、前記接続位置を決定
する。こうすれば、前記音節長決定手段により決定され
る音節の長さに、前記２つの基礎音声データの接続部分
に対応する前記所定音素を含む音節の長さを調整できる
ようになる。In one aspect of the present invention, the apparatus further includes syllable length determining means for determining a syllable length including the predetermined phoneme corresponding to a connection portion of the two basic voice data, and Determines the connection position based on the syllable length determined by the syllable length determining means. This makes it possible to adjust the length of the syllable containing the predetermined phoneme corresponding to the connection between the two basic speech data to the length of the syllable determined by the syllable length determining means.

【００１２】また、本発明の一態様では、前記基礎音声
データ列選出手段は、合成すべき音声を表す記号列を入
力する記号列入力手段を含み、該記号列入力手段により
入力される前記記号列に基づいて前記基礎音声データ列
を選出する。こうすれば、入力する記号列に応じた合成
音声を得ることができるようになる。In one aspect of the present invention, the basic voice data string selecting means includes a symbol string input means for inputting a symbol string representing a voice to be synthesized, and the symbol input by the symbol string input means. The basic audio data sequence is selected based on the sequence. This makes it possible to obtain a synthesized speech corresponding to the input symbol string.

【００１３】また、本発明に係る音声合成方法は、第１
及び第２の基礎音声データのそれぞれに対応づけられ、
前記第１及び第２の基礎音声データにおける所定音素に
対応する区間の内側に設定された接続候補区間をそれぞ
れ特定する、２つの接続候補区間データを取得するステ
ップと、該２つの接続候補区間特定データによりそれぞ
れ特定される接続候補区間の範囲内で前記第１及び第２
の基礎音声データのそれぞれについて接続位置を決定す
るステップと、該接続位置にて前記第１及び第２の基礎
音声データを接続するステップと、を含むことを特徴と
する。Further, the speech synthesizing method according to the present invention comprises the following steps:
And each of the second basic audio data,
Obtaining two connection candidate section data respectively specifying connection candidate sections set inside a section corresponding to a predetermined phoneme in the first and second basic voice data; and identifying the two connection candidate sections. Within the range of the connection candidate section respectively specified by the data,
Determining a connection position for each of the basic audio data, and connecting the first and second basic audio data at the connection position.

【００１４】さらに、本発明に係る情報記憶媒体は、第
１及び第２の基礎音声データのそれぞれに対応づけら
れ、前記第１及び第２の基礎音声データにおける所定音
素に対応する区間の内側に設定された接続候補区間をそ
れぞれ特定する、２つの接続候補区間データを取得する
ステップと、該２つの接続候補区間特定データによりそ
れぞれ特定される接続候補区間の範囲内で前記第１及び
第２の基礎音声データのそれぞれについて接続位置を決
定するステップと、接続位置にて前記第１及び第２の基
礎音声データを接続するステップと、をコンピュータに
実行させるためのプログラムを記憶したものである。Further, the information storage medium according to the present invention is associated with each of the first and second basic voice data, and is provided inside a section corresponding to a predetermined phoneme in the first and second basic voice data. Obtaining two connection candidate section data respectively specifying the set connection candidate sections; and obtaining the first and second connection candidate sections within the connection candidate sections specified by the two connection candidate section specification data, respectively. A program for causing a computer to execute a step of determining a connection position for each of the basic sound data and a step of connecting the first and second basic sound data at the connection position is stored.

【００１５】第１及び第２の基礎音声データにおける所
定音素に対応する区間の内側には、過渡区間や安定発音
区間等、音素に対応して接続に適した区間と接続に適し
ない区間とが存在する。本発明によれば、接続候補区間
を第１及び第２の基礎音声データの接続に適した区間に
設定することができ、これにより前記所定音素に対応す
る区間のうち接続に適した区間で、第１及び第２の基礎
音声データを接続し、以て合成音声の品質を向上させる
ことができる。Inside the section corresponding to the predetermined phoneme in the first and second basic speech data, there are a section suitable for connection corresponding to the phoneme and a section not suitable for connection, such as a transient section and a stable sounding section. Exists. According to the present invention, a connection candidate section can be set as a section suitable for connection of the first and second basic voice data, whereby, among sections corresponding to the predetermined phoneme, a section suitable for connection, By connecting the first and second basic voice data, the quality of the synthesized voice can be improved.

【００１６】[0016]

【発明の実施の形態】以下、本発明の好適な実施の形態
について図面に基づき詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of the present invention will be described below in detail with reference to the drawings.

【００１７】本実施の形態に係る音声合成方法では、同
じ音素に対応する区間（音素区間）で基礎音声データを
接続し、以て合成音声データを生成する。このとき、基
礎音声データのそれら同じ音素に対応する音素区間の中
に、さらに接続候補区間が予め設定されている。この接
続候補区間は、例えば音素区間が母音に対応するもので
あれば音素環境に依存しにくい中央付近の安定発音区間
内に設定される。また、音素区間が子音に対応するもの
であれば中央付近であって、他の音素からの移行或いは
他の音素への移行の影響を受けていない区間、すなわち
音素環境に依存しにくい安定発音区間内に設定される。
そして、本実施の形態に係る音声合成方法では、この接
続候補区間の範囲内で各基礎音声データの接続位置を決
定し、その接続位置にて基礎音声データを接続して合成
音声データとする。In the speech synthesis method according to the present embodiment, basic speech data is connected in a section (phoneme section) corresponding to the same phoneme, thereby generating synthesized speech data. At this time, connection candidate sections are set in advance in the phoneme sections corresponding to the same phonemes in the basic voice data. This connection candidate section is set in a stable sounding section near the center which is hardly dependent on the phoneme environment if the phoneme section corresponds to a vowel, for example. If the phoneme section corresponds to a consonant, it is near the center and is not affected by the transition from another phoneme or the transition to another phoneme, that is, a stable sounding section that is hardly dependent on the phoneme environment. Is set within.
Then, in the speech synthesis method according to the present embodiment, the connection position of each basic speech data is determined within the range of the connection candidate section, and the basic speech data is connected at the connection position to obtain synthesized speech data.

【００１８】このとき、本実施の形態に係る音声合成方
法では、接続候補区間内に複数の接続候補位置が予め設
定されており、具体的な接続位置は各基礎音声データに
おいて接続候補位置から選ぶようにしているので、極め
て軽い処理で合成音声データを生成することができる。
なお、接続候補位置は基礎音声データの接続に適した具
体的位置の候補であり、例えば各音素の接続候補区間内
で各周期波形（当該音素の基本周波数をｆ０として、１
／ｆ０毎に現れる）の１ピッチを同定するある開始点
（ピッチマーク）を接続候補位置として選ぶようにすれ
ばよい。また例えば、声門閉鎖点を接続候補位置として
選ぶようにしてもよい。At this time, in the speech synthesis method according to the present embodiment, a plurality of connection candidate positions are preset in the connection candidate section, and a specific connection position is selected from the connection candidate positions in each basic voice data. As a result, synthesized speech data can be generated with extremely light processing.
The connection candidate position is a candidate for a specific position suitable for connection of the basic voice data. For example, in the connection candidate section of each phoneme, each periodic waveform (where the fundamental frequency of the phoneme is f0, 1
/ F0), a starting point (pitch mark) for identifying one pitch may be selected as a connection candidate position. Further, for example, a glottal closure point may be selected as a connection candidate position.

【００１９】図１は、”ｎａ”を表す第１基礎音声デー
タと、”ａｋａ”を表す第２基礎音声データとを接続し
て、”ｎａｋａ”を表す合成音声データを生成する様子
を説明する図である。同図（ａ）は第１基礎音声データ
を示しており、同図（ｂ）は第２基礎音声データを示し
ており、同図（ｃ）は合成音声データを示している。同
図において斜線は接続候補区間を示している。また、縦
線は各音素区間の区切りを示している。第１基礎音声デ
ータはＣＶ形式で記録されており、第２基礎音声データ
はＶＣＶ形式で記録されている。そして、合成音声デー
タは、第１基礎音声データの”ａ”を表す音素区間（Ｖ
区間）と第２基礎音声データの前側の”ａ”を表す音素
区間（Ｖ区間）とで接続されている。すなわち、同図
は、ＣＶ形式で記録された第１基礎音声データとＶＣＶ
形式で記録された第２基礎音声データとを、同じ母音”
ａ”を表す音素区間で接続し、合成音声データを生成す
る場合を示している。このとき、第１基礎音声データに
おいて、少なくとも”ａ”を表す音素区間には、その内
側にさらに接続候補区間が設定されており、この接続候
補区間の範囲内で接続位置が決定される。具体的には、
この接続候補区間内に予め設定されている接続候補位置
の中から接続位置が決定される。同様に、第２基礎音声
データにおいて、少なくとも前側の”ａ”を表す音素区
間には、その内側にさらに接続候補区間が設定されてお
り、この接続候補区間の範囲内で接続位置が決定され
る。ここでも同様に、この接続候補区間内に予め設定さ
れている接続候補位置の中から接続位置が決定される。FIG. 1 illustrates a state in which first basic voice data representing "na" and second basic voice data representing "aka" are connected to generate synthetic voice data representing "naka". FIG. FIG. 1A shows the first basic audio data, FIG. 1B shows the second basic audio data, and FIG. 1C shows the synthesized audio data. In the figure, the hatched lines indicate connection candidate sections. The vertical lines indicate the boundaries between the phoneme sections. The first basic audio data is recorded in CV format, and the second basic audio data is recorded in VCV format. Then, the synthesized speech data includes a phoneme section (V) representing “a” of the first basic speech data.
Section) and a phoneme section (V section) representing “a” on the front side of the second basic voice data. That is, the figure shows the first basic audio data and the VCV recorded in the CV format.
The same basic vowel as the second basic sound data recorded in the format
A case is shown in which connection is made with a phoneme section representing "a" to generate synthesized speech data. At this time, in the first basic speech data, at least a phoneme section representing "a" is further provided with a connection candidate section inside thereof. Is set, and the connection position is determined within the range of the connection candidate section.
A connection position is determined from connection candidate positions preset in this connection candidate section. Similarly, in the second basic speech data, at least a connection candidate section is set inside the phoneme section representing “a” on the front side, and the connection position is determined within the range of the connection candidate section. . Here, similarly, the connection position is determined from the connection candidate positions preset in the connection candidate section.

【００２０】そして、これら接続位置にて第１基礎音声
データと第２基礎音声データとが接続され、合成音声デ
ータが生成される。このとき、接続位置以降の第１基礎
音声データ及び接続位置以前の第２基礎音声データは遺
棄される。ここで、接続候補区間はＶ区間に設定されて
おり、母音に対する安定発音区間内に設定されている。
このため、接続候補区間のいずれの位置も、安定的に”
ａ”の音を再生することができる波形（波形そのもの又
はスペクトルパラメータ等）となっている。そして、接
続候補区間の範囲内で接続位置を決定するようにすれ
ば、第１基礎音声データのうち”ａ”を表す音素区間
と、第２基礎音声データのうち”ａ”を表す音素区間
と、で波形が近似した部分を探し、その部分で第１及び
第２基礎音声データを接続する従来技術に比し、軽い演
算量で接続部分を目立たないよう第１及び第２基礎音声
データを接続することができる。また、接続候補区間内
で接続位置を決定するようにしているので、接続に適し
ない位置で第１及び第２の基礎音声データが接続されて
しまうことを、確実に防止できる。Then, the first basic voice data and the second basic voice data are connected at these connection positions, and synthetic voice data is generated. At this time, the first basic audio data after the connection position and the second basic audio data before the connection position are discarded. Here, the connection candidate section is set in the V section, and is set in the stable sounding section for the vowel.
Therefore, any position in the connection candidate section is stably
The waveform (a waveform itself or a spectrum parameter, etc.) capable of reproducing the sound of a ″. If the connection position is determined within the range of the connection candidate section, the first basic audio data A prior art in which a waveform approximation is searched for between a phoneme section representing "a" and a phoneme section representing "a" in the second basic voice data, and the first and second basic voice data are connected at that portion. It is possible to connect the first and second basic audio data so that the connection portion is not noticeable with a small amount of calculation as compared with the above.In addition, since the connection position is determined within the connection candidate section, it is suitable for connection. It is possible to reliably prevent the first and second basic audio data from being connected at a non-existent position.

【００２１】次に、図２は、”ｈａｓｈ”を表す第１基
礎音声データと、”ｓｈｉ”を表す第２基礎音声データ
とを接続して、”ｈａｓｈｉ”を表す合成音声データを
生成する様子を説明する図である。同図（ａ）は第１基
礎音声データを示しており、同図（ｂ）は第２基礎音声
データを示しており、同図（ｃ）は合成音声データを示
している。同図において斜線は接続候補区間を示してい
る。また、縦線は各音素区間の区切りを示している。第
１基礎音声データはＣＶＣ形式で記録されており、第２
基礎音声データはＣＶ形式で記録されている。そして、
合成音声データは、第１基礎音声データの”ｓｈ”を表
す音素区間（Ｃ区間）と第２基礎音声データの”ｓｈ”
を表す音素区間（Ｃ区間）とで接続されている。すなわ
ち、同図は、ＣＶＣ形式で記録された第１基礎音声デー
タとＣＶ形式で記録された第２基礎音声データとを、同
じ子音”ｓｈ”を表す音素区間で接続し、合成音声デー
タを生成する場合を示している。このとき、第１基礎音
声データにおいて、少なくとも”ｓｈ”を表す音素区間
には、その内側にさらに接続候補区間が設定されてお
り、この接続候補区間の範囲内で接続位置が決定され
る。具体的には、この接続候補区間内に予め設定されて
いる接続候補位置の中から接続位置が決定される。同様
に、第２基礎音声データにおいて、少なくとも”ｓｈ”
を表す音素区間には、その内側にさらに接続候補区間が
設定されており、この接続候補区間の範囲内で接続位置
が決定される。ここでも同様に、この接続候補区間内に
予め設定されている接続候補位置の中から接続位置が決
定される。Next, FIG. 2 shows a state in which the first basic voice data representing "hash" and the second basic voice data representing "shi" are connected to generate synthetic voice data representing "hashi". FIG. FIG. 1A shows the first basic audio data, FIG. 1B shows the second basic audio data, and FIG. 1C shows the synthesized audio data. In the figure, the hatched lines indicate connection candidate sections. The vertical lines indicate the boundaries between the phoneme sections. The first basic audio data is recorded in the CVC format,
Basic sound data is recorded in CV format. And
The synthesized voice data includes a phoneme section (C section) representing “sh” of the first basic voice data and “sh” of the second basic voice data.
Are connected with a phoneme section (section C). That is, in the figure, the first basic voice data recorded in the CVC format and the second basic voice data recorded in the CV format are connected in a phoneme section representing the same consonant "sh" to generate synthesized voice data. Is shown. At this time, in the first basic audio data, a connection candidate section is further set inside a phoneme section representing at least “sh”, and a connection position is determined within the range of the connection candidate section. Specifically, a connection position is determined from connection candidate positions preset in the connection candidate section. Similarly, in the second basic audio data, at least “sh”
Is further set inside the phoneme section indicating the connection, and the connection position is determined within the range of the connection candidate section. Here, similarly, the connection position is determined from the connection candidate positions preset in this connection candidate section.

【００２２】そして、これら接続位置にて第１基礎音声
データと第２基礎音声データとが接続され、合成音声デ
ータが生成される。このとき、接続位置以降の第１基礎
音声データ及び接続位置以前の第２基礎音声データは遺
棄される。ここで、接続候補区間はＣ区間に設定されて
おり、子音に対する安定発音区間内に設定されている。
このため、接続候補区間のいずれの位置も、安定的に”
ｓｈ”の音を再生することができる波形（波形そのもの
又はスペクトルパラメータ等）となっている。そして、
接続候補区間の範囲内で接続位置を決定するようにすれ
ば、第１基礎音声データのうち”ｓｈ”を表す音素区間
と、第２基礎音声データのうち”ｓｈ”を表す音素区間
と、で波形が近似した部分を探し、その部分で第１及び
第２基礎音声データを接続する従来技術に比し、軽い演
算量で接続部分を目立たないよう第１及び第２基礎音声
データを接続することができる。また、接続候補区間内
で接続位置を決定するようにしているので、接続に適し
ない位置で第１及び第２の基礎音声データが接続されて
しまうことを、確実に防止できる。Then, the first basic voice data and the second basic voice data are connected at these connection positions, and synthetic voice data is generated. At this time, the first basic audio data after the connection position and the second basic audio data before the connection position are discarded. Here, the connection candidate section is set in the C section, and is set in the stable sounding section for the consonant.
Therefore, any position in the connection candidate section is stably
sh "(a waveform itself or a spectrum parameter, etc.) that can reproduce the sound of" sh ".
If the connection position is determined within the range of the connection candidate section, the phoneme section representing “sh” in the first basic voice data and the phoneme section representing “sh” in the second basic voice data can be used. Compare the first and second basic audio data so that the connection portion is not noticeable with a small amount of calculation compared to the conventional technology in which the waveform is approximated and the first and second basic audio data are connected at that portion. Can be. In addition, since the connection position is determined in the connection candidate section, it is possible to reliably prevent the first and second basic audio data from being connected at a position that is not suitable for connection.

【００２３】図３は、本発明の一実施形態に係るゲーム
装置の構成を示す図である。以下では、同図に示すゲー
ム装置１０にて本発明に係る音声合成装置を実現する例
について説明する。同図に示すゲーム装置１０は、家庭
用ゲーム機１１にモニタ１８及びスピーカ２２を接続
し、さらに情報記憶媒体たるＤＶＤ−ＲＯＭ２５を装着
することによって構成される。ここでは、ゲームプログ
ラムやゲームデータを家庭用ゲーム機１１に供給するた
めにＤＶＤ−ＲＯＭ２５を用いるが、ＣＤ−ＲＯＭやＲ
ＯＭカード等、他のあらゆる情報記憶媒体を用いること
ができる。また、通信ネットワークを介して遠隔地から
ゲームプログラムやゲームデータを家庭用ゲーム機１１
に供給することもできる。FIG. 3 is a diagram showing a configuration of a game device according to one embodiment of the present invention. Hereinafter, an example will be described in which the game device 10 shown in FIG. The game apparatus 10 shown in FIG. 1 is configured by connecting a monitor 18 and a speaker 22 to a consumer game machine 11 and mounting a DVD-ROM 25 as an information storage medium. Here, the DVD-ROM 25 is used to supply the game program and the game data to the consumer game machine 11, but the CD-ROM and the R
Any other information storage medium such as an OM card can be used. In addition, game programs and game data can be transferred from a remote location via a communication network to the home game machine 11.
Can also be supplied.

【００２４】家庭用ゲーム機１１は、マイクロプロセッ
サ１４、画像処理部１６、主記憶２６及び入出力処理部
３０がバス１２により相互データ通信可能に接続され、
さらに入出力処理部３０には、コントローラ３２、音声
処理部２０及びＤＶＤ再生部２４が接続されている。コ
ントローラ３２以外の家庭用ゲーム機１１の各構成要素
は筐体内に収容されている。モニタ１８には例えば家庭
用のテレビ受像機が用いられ、スピーカ２２には例えば
その内蔵スピーカが用いられる。In the home-use game machine 11, a microprocessor 14, an image processing unit 16, a main memory 26, and an input / output processing unit 30 are connected by a bus 12 so as to be able to communicate with each other.
Further, the input / output processing unit 30 is connected with a controller 32, an audio processing unit 20, and a DVD reproducing unit 24. Each component of the consumer game machine 11 other than the controller 32 is housed in a housing. For the monitor 18, for example, a home television receiver is used, and for the speaker 22, for example, its built-in speaker is used.

【００２５】マイクロプロセッサ１４は、図示しないＲ
ＯＭに格納されるオペレーティングシステム（ＯＳ）や
ＤＶＤ−ＲＯＭ２５から読み出されるゲームプログラム
に基づいて、家庭用ゲーム機１１の各部を制御する。バ
ス１２はアドレス及びデータを家庭用ゲーム機１１の各
部でやり取りするためのものである。また、主記憶２６
には、ＤＶＤ−ＲＯＭ２５から読み取られたゲームプロ
グラム及びゲームデータが必要に応じて書き込まれる。
画像処理部１６はＶＲＡＭを含んで構成されており、マ
イクロプロセッサ１４から送られる画像データを受け取
ってＶＲＡＭ上にゲーム画面を描画するとともに、その
内容を所定ビデオ信号に変換して所定タイミングでモニ
タ１８に出力する。The microprocessor 14 has an R (not shown)
Based on an operating system (OS) stored in the OM and a game program read from the DVD-ROM 25, each unit of the consumer game machine 11 is controlled. The bus 12 is used for exchanging addresses and data between the units of the consumer game machine 11. The main memory 26
The game program and the game data read from the DVD-ROM 25 are written in the ROM as needed.
The image processing section 16 is configured to include a VRAM, receives image data sent from the microprocessor 14, draws a game screen on the VRAM, converts the content into a predetermined video signal, and monitors the video signal at a predetermined timing. Output to

【００２６】入出力処理部３０はコントローラ３２、音
声処理部２０及びＤＶＤ再生部２４とマイクロプロセッ
サ１４との間のデータ通信を中継するためのインターフ
ェースである。コントローラ３２はプレイヤがゲーム操
作をするための入力手段である。入出力処理部３０は一
定周期（例えば１／６０秒毎）にコントローラ３２の各
種ボタンの操作状態をスキャンし、そのスキャン結果を
表す操作信号をバス１２を介してマイクロプロセッサ１
４に渡す。マイクロプロセッサ１４は、その操作信号に
基づいてプレイヤのゲーム操作を判定する。音声処理部
２０はサウンドバッファを含んで構成されており、ＤＶ
Ｄ−ＲＯＭ２５から読み出されてサウンドバッファに記
憶された音楽やゲーム効果音等のデータを再生してスピ
ーカ２２から出力する。また、マイクロプロセッサ１４
により生成され、主記憶２６又は入出力処理部３０に接
続される図示しないメモリカードに記憶される合成音声
データを転送すると、それをスピーカ２２から再生出力
するようになっている。ＤＶＤ再生部２４は、マイクロ
プロセッサ１４からの指示に従ってＤＶＤ−ＲＯＭ２５
に記録されたゲームプログラム及びゲームデータを読み
取る。The input / output processing unit 30 is an interface for relaying data communication between the controller 32, the audio processing unit 20, the DVD reproducing unit 24, and the microprocessor 14. The controller 32 is input means for the player to perform a game operation. The input / output processing unit 30 scans the operation states of various buttons of the controller 32 at regular intervals (for example, every 1/60 second), and sends an operation signal indicating the scan result to the microprocessor 1
Pass to 4. The microprocessor 14 determines a game operation of the player based on the operation signal. The audio processing unit 20 is configured to include a sound buffer.
The data, such as music and game sound effects, read from the D-ROM 25 and stored in the sound buffer is reproduced and output from the speaker 22. The microprocessor 14
When the synthesized voice data generated by and stored in a memory card (not shown) connected to the main memory 26 or the input / output processing unit 30 is transferred, the synthesized voice data is reproduced and output from the speaker 22. The DVD reproducing unit 24 is provided with a DVD-ROM 25 according to an instruction from the microprocessor 14.
Read the game program and game data recorded in the.

【００２７】以上の構成を有するゲーム装置１０におい
て、ＤＶＤ−ＲＯＭ２５には、図４にその一部が示され
ている音声合成用データベースが予め格納されている。
音声合成用データベースにおいては多数の基礎音声デー
タ（ここで波形データそのものを基礎音声データとして
保持しておく方式を採用するが、波形を復元可能な各種
パラメータを保持しておく方式を採用してもよい。）が
記憶されている。基礎音声データとして、ここではＣＶ
形式で記録されたデータ及びＶＣＶ形式で記録されたデ
ータ等が網羅的に記憶されているものとするが、ＣＶＣ
形式で記録されたデータ及びＣＶ形式で記録されたデー
タ等を網羅的に記録しておく方式を採用してもよい。In the game apparatus 10 having the above-described configuration, the voice synthesis database, a part of which is shown in FIG. 4, is stored in the DVD-ROM 25 in advance.
In the database for speech synthesis, a large number of basic audio data (a method in which the waveform data itself is stored as the basic audio data is employed, but a method in which various parameters capable of restoring the waveform are retained is employed. Good.) Is stored. As basic audio data, here CV
It is assumed that data recorded in the VCV format and data recorded in the VCV format are comprehensively stored.
A method of comprehensively recording data recorded in the CV format, data recorded in the CV format, and the like may be employed.

【００２８】同図（ａ）は”ａｋａ”を表す基礎音声デ
ータについての音声合成用データベースの記録内容を一
例として示しており、同図（ａ）に示すように各基礎音
声データに対して、その基礎音声データが表す音素のそ
れぞれにつき、音素の種類、音素区間の開始タイミン
グ、接続候補区間特定データ、接続候補位置特定データ
が付加的に記憶されている。音素の種類は音素記号を記
したものである。接続候補区間特定データは接続候補区
間を特定するものであり、各音素区間の内部に設定され
る接続候補区間を特定すべく、例えばその開始タイミン
グ及び終了タイミングを記している。接続候補位置特定
データは接続候補区間内の具体的な接続位置の複数候補
を特定するものである。各接続候補区間において最初の
接続候補位置は接続候補区間の開始タイミングと一致
し、最後の接続候補位置は接続候補区間の終了タイミン
グと一致する。このため、接続候補位置特定データのみ
を合成音声用データベースに記憶しておくようにして、
接続候補区間特定データだけ別途記憶するのは省略して
もよい。この場合、接続候補位置特定データのうち、最
初の接続候補位置と最後の接続候補位置とを特定するも
のは、接続候補区間特定データとしても用いられること
になる。各音素区間の開始タイミングｔ_ｎ、接続候補区
間特定データｔ_ｓ ^（ｎ），ｔ_ｅ ^（ｎ）、接続候補位置特
定データｔ^（ｎ）（１）〜ｔ^（ｎ）（Ｎ）の関係は、同
図（ｂ）に示されている。この他、各基礎音声データに
つき、Ｖ区間については、そのピッチ及び音量が記憶さ
れる（図示せず）。このピッチ及び音量については、入
力テキストに対応する基礎音声データ列を選定する際に
参照される。なお、後述するように、このゲーム装置１
０では基礎音声データをＶ区間で接続するので、Ｃ区間
について接続候補区間特定データは不要であり、音声合
成用データベースへの記録を省略してもよい。FIG. 3A shows an example of the recorded contents of a speech synthesis database for basic speech data representing "aka". As shown in FIG. For each phoneme represented by the basic voice data, the type of phoneme, the start timing of the phoneme section, the connection candidate section specifying data, and the connection candidate position specifying data are additionally stored. The phoneme type is a phoneme symbol. The connection candidate section specifying data specifies a connection candidate section, and describes, for example, a start timing and an end timing thereof in order to specify a connection candidate section set inside each phoneme section. The connection candidate position specifying data specifies a plurality of specific connection position candidates in the connection candidate section. In each connection candidate section, the first connection candidate position matches the start timing of the connection candidate section, and the last connection candidate position matches the end timing of the connection candidate section. For this reason, by storing only the connection candidate position specifying data in the synthesized voice database,
Storing only the connection candidate section specifying data separately may be omitted. In this case, of the connection candidate position specifying data, the data that specifies the first connection candidate position and the last connection candidate position is also used as the connection candidate section specifying data. Start timing _{t n} for each phoneme section, the relationship between the connection candidate section identification data _{^{_{^{t s (n), t e}}}} (n), the connection candidate position identification data ^{^{t (n) (1) ~t}} (n) (N), This is shown in FIG. In addition, the pitch and volume are stored for each basic sound data in the V section (not shown). The pitch and volume are referred to when selecting a basic audio data sequence corresponding to the input text. In addition, as described later, this game device 1
In the case of 0, since the basic voice data is connected in the V section, the connection candidate section specifying data is not necessary for the C section, and the recording in the voice synthesis database may be omitted.

【００２９】図５は、ゲーム装置１０で実行される音声
合成処理について説明するフロー図である。同図に示さ
れる音声合成処理は、合成音声データを生成し、それを
主記憶２６等に格納するものである。この処理はＤＶＤ
−ＲＯＭ２５に格納されているゲームプログラムに基づ
き、例えばゲーム開始時等に実行される。この処理によ
り生成された合成音声データは、ゲームプログラムに従
って適宜主記憶２６等から読み出され、音声処理部２０
に転送される。そして、音声処理部２０により合成音声
データが再生され、合成音声がスピーカ２２から出力さ
れる。こうして、合成音声によりゲームを盛り上げるこ
とができる。FIG. 5 is a flowchart illustrating a speech synthesis process performed by the game apparatus 10. The speech synthesis processing shown in FIG. 3 is to generate synthesized speech data and store it in the main memory 26 or the like. This process is a DVD
-It is executed based on the game program stored in the ROM 25, for example, at the start of the game. The synthesized voice data generated by this processing is read out from the main memory 26 or the like as appropriate according to the game program, and
Is forwarded to Then, the synthesized voice data is reproduced by the voice processing unit 20, and the synthesized voice is output from the speaker 22. Thus, the game can be excited by the synthesized voice.

【００３０】同図に示すように、この音声合成処理で
は、まずプレイヤがコントローラ３２により自分の名前
等のテキスト（記号列）を入力する（Ｓ１０１）。例え
ば、モニタ１８にテキスト一覧を表示しておき、コント
ローラ３２により順に自分の名前等を表すテキストを指
定すると、それが主記憶２６に一旦格納されるようにす
る。ここで入力されるテキストは音声合成の対象とされ
る。次に、入力されたテキストを解析する（Ｓ１０
２）。具体的には、ここでマイクロプロセッサ１４が入
力済みテキストを音素列に変換するとともに、それをＣ
Ｖ及びＶＣＶ単位の組合せにより再表現する。As shown in the figure, in this speech synthesis processing, the player first inputs a text (symbol string) such as his / her name by the controller 32 (S101). For example, a text list is displayed on the monitor 18, and when the text indicating its name or the like is sequentially specified by the controller 32, the text is temporarily stored in the main memory 26. The text input here is the target of speech synthesis. Next, the input text is analyzed (S10).
2). Specifically, here, the microprocessor 14 converts the input text into a phoneme sequence,
Re-expressed by a combination of V and VCV units.

【００３１】さらに、マイクロプロセッサ１４は入力済
みテキストに含まれる各音節のピッチ、音量、長さを決
定する（Ｓ１０３）。例えば、幾つかの代表テキストに
対し、各音節のピッチ、音量、長さを予めＤＶＤ−ＲＯ
Ｍ２５に韻律モデルデータとして記憶させておき、Ｓ１
０１で入力されたテキストに最も近い代表テキストの各
音節に対するピッチ、音量、長さを、その入力されたテ
キストの各音節に対するピッチ、音量、長さとして採用
すればよい。両者が完全に一致しない場合には、所定ア
ルゴリズムにより、それらピッチ、音量、長さを補正す
るようにしてもよい。Further, the microprocessor 14 determines the pitch, volume, and length of each syllable included in the input text (S103). For example, for some representative texts, the pitch, volume, and length of each syllable are determined in advance by DVD-RO.
M25 is stored as prosody model data, and S1
The pitch, volume, and length for each syllable of the representative text closest to the text input at 01 may be adopted as the pitch, volume, and length for each syllable of the input text. If the two do not completely match, the pitch, volume, and length may be corrected by a predetermined algorithm.

【００３２】その後、テキスト解析（Ｓ１０２）で得ら
れたＣＶ及びＶＣＶ単位の組合せによる入力済みテキス
トの表現に基づき、入力済みテキストに対応する基礎音
声データ列を選び出す（Ｓ１０４）。具体的には、テキ
スト解析で得られたＣＶ又はＶＣＶの各単位に対して、
最もＶ区間のピッチ及び音量が近い基礎音声データを選
択し、それらを入力テキストに対応して順に並べること
により基礎音声データ列を得る。Ｖ区間のピッチ及び音
量は音声合成用データベースから取得する。After that, based on the expression of the input text based on the combination of the CV and the VCV obtained in the text analysis (S102), a basic voice data string corresponding to the input text is selected (S104). Specifically, for each unit of CV or VCV obtained by text analysis,
Basic voice data having the closest pitch and volume in the V section is selected and arranged in order in correspondence with the input text to obtain a basic voice data sequence. The pitch and volume of the V section are obtained from the voice synthesis database.

【００３３】さらに、各Ｖ区間（最後尾を除く）につい
て接続位置を決定する（Ｓ１０５）。具体的には、Ｓ１
０３で決定した入力テキストの各音節の長さに基づき、
実際に合成音声の各音声の長さが、その決定した長さに
なるよう、各Ｖ区間（音素区間）の内側に設定されてい
る接続候補区間の範囲内で接続位置を決定する。接続候
補区間は、各音素区間に対して音声合成用データベース
に記憶されている接続候補区間特定データを読みだし、
それにより特定する。接続位置の決定に自由度がある場
合には、さらに別の基準を用いて決定するようにすれば
よい。例えば、２つの接続位置が共にできるだけ接続候
補区間の中央寄りの音素環境に依存しにくい箇所に設定
されるようにしてもよい。なお、マイクロプロセッサ１
４のデータ処理能力に余裕があれば、接続候補区間の範
囲内で波形の近似している部分を探し、そこを接続位置
としてもよい。こうしても、接続候補区間の範囲内で接
続位置を決定するので、接続に適しない位置が接続位置
となることを防止でき、合成音声の品質を向上させるこ
とができる。Further, a connection position is determined for each V section (excluding the last section) (S105). Specifically, S1
Based on the length of each syllable of the input text determined in 03,
The connection position is determined within the range of the connection candidate section set inside each V section (phoneme section) so that the length of each synthesized speech actually becomes the determined length. The connection candidate section reads connection candidate section identification data stored in the speech synthesis database for each phoneme section,
It is specified by that. If there is a degree of freedom in determining the connection position, the connection position may be determined using still another criterion. For example, the two connection positions may be set to a position that is as close to the center of the connection candidate section as possible and hardly depends on the phoneme environment. The microprocessor 1
If there is room in the data processing capability of No. 4, a portion where the waveform is approximated within the range of the connection candidate section may be searched, and that portion may be set as the connection position. Also in this case, since the connection position is determined within the range of the connection candidate section, it is possible to prevent a position that is not suitable for connection from becoming a connection position, and to improve the quality of synthesized speech.

【００３４】図６は接続位置決定処理の一例を詳細に示
すフロー図である。この処理では接続候補区間内に予め
設定されている接続候補位置から具体的な接続位置が選
出される。図７は第１基礎音声データと第２基礎音声デ
ータとの接続態様を模式的に示す図であり、同図（ａ）
は、第１基礎音声データにおける接続部分の音素（以
下、「前接続音素」という。）の接続候補区間を合成音
声データに全て含めるようにして、さらに続いて第２基
礎音声データにおける接続部分の音素（以下、「後接続
音素」という。）の接続候補区間の一部を合成音声デー
タに含める接続態様を示す。また、同図（ｂ）は、前接
続音素の接続候補区間の一部を合成音声データに含める
ようにして、その続き第２基礎音声データのうち後接続
音素における接続候補区間の終了タイミング以降とし
て、後接続音素の接続候補区間は合成音声データに含め
ない接続態様を示す。FIG. 6 is a flowchart showing an example of the connection position determination processing in detail. In this process, a specific connection position is selected from the connection candidate positions preset in the connection candidate section. FIG. 7 is a diagram schematically showing a connection mode between the first basic audio data and the second basic audio data, and FIG.
Is to include all the connection candidate sections of the phoneme (hereinafter referred to as “pre-connected phoneme”) of the connection part in the first basic voice data in the synthesized voice data, and further subsequently to the connection part of the second basic voice data. A connection mode in which a part of a connection candidate section of a phoneme (hereinafter, referred to as a “post-connection phoneme”) is included in synthesized speech data is shown. Also, FIG. 2B shows a case where a part of the connection candidate section of the preceding connection phoneme is included in the synthesized speech data, and the subsequent part of the second basic speech data is after the end timing of the connection candidate section of the subsequent connection phoneme. , The connection candidate section of the subsequent connection phoneme indicates a connection mode not included in the synthesized speech data.

【００３５】図６に示すように、この接続位置決定処理
では、まず接続音素の１つ前の音素の長さｌ１（図７参
照）を取得する（Ｓ２０１）。ここでは接続音素を母音
としているため、その前に位置する子音の長さをｌ１と
して取得する。長さｌ１は音声合成用データベースにお
いて各音素の開始タイミングを引き算することにより得
ることができる。次に、前接続音素における接続候補区
間前の長さｌ２（図７参照）を取得する（Ｓ２０２）。
長さｌ２は前接続音素における接続候補区間の開始タイ
ミングｔ_ｓ（ｎ）から該前接続音素の開始タイミングを
引き算することにより得ることができる。同様に、後接
続音素における接続候補区間後の長さｌ３（図７参照）
を取得する（Ｓ２０３）。長さｌ３は後接続音素に続く
音素の開始タイミングから当該後接続候補区間の終了タ
イミングを引き算することにより得ることができる。さ
らに、接続音素が含まれる音節の長さＬを取得する（Ｓ
２０４）。長さＬはＳ１０３（図５）において既に取得
している値を用いる。As shown in FIG. 6, in this connection position determination processing, first, the length l1 (see FIG. 7) of the phoneme immediately before the connection phoneme is obtained (S201). Here, since the connected phoneme is a vowel, the length of the consonant located before it is obtained as l1. The length l1 can be obtained by subtracting the start timing of each phoneme in the speech synthesis database. Next, the length l2 (see FIG. 7) of the preceding connected phoneme before the connection candidate section is acquired (S202).
The length l2 can be obtained by subtracting the start timing of the preceding connection phoneme from the start timing t _s (n) of the connection candidate section in the previous connection phoneme. Similarly, the length 13 after the connection candidate section in the subsequent connection phoneme (see FIG. 7)
Is acquired (S203). The length l3 can be obtained by subtracting the end timing of the post-connection candidate section from the start timing of the phoneme following the post-connection phoneme. Further, the syllable length L including the connected phoneme is obtained (S
204). As the length L, the value already obtained in S103 (FIG. 5) is used.

【００３６】次に、前接続音素の接続候補区間の長さ、
すなわちｔ_ｅ ^（α）−ｔ_ｓ ^（α）を算出する（Ｓ２０
５）。ここでαは前接続音素のインデックスである。そ
して、Ｌ−（ｌ１＋ｌ２＋ｌ３）が前接続音素の接続候
補区間の長さｔ_ｅ ^（α）−ｔ_ｓ ^（α）以上であるかを判
断する（Ｓ２０６）。Ｌ−（ｌ１＋ｌ２＋ｌ３）は、合
成音声において接続部分に含められる接続候補区間のト
ータルの長さを示しており、この長さを前接続音素の接
続候補区間だけで満たすことができるか、それとも後接
続音素の接続候補区間の全部又は一部も加える必要があ
るかを調べるのである。Ｌ−（ｌ１＋ｌ２＋ｌ３）が前
接続音素の接続候補区間の長さｔ_ｅ ^（α）−ｔ_ｓ ^（α）
以上であれば、後接続音素の接続候補区間も合成音声に
含める必要があり、Ｌ−（ｌ１＋ｌ２＋ｌ３）−（ｔ_ｅ
^（α）−ｔ_ｓ ^（α））≒ｔ_ｅ ^（β）−ｔ^（β）（ｎ）と
なるｎを探し出す（Ｓ２０７）。ここでβは後接続音素
のインデックスである。そして、前接続音素におけるｔ
_ｅ ^（α）と後接続音素におけるｔ^（β）（ｎ）とをそれ
ぞれ第１基礎音声データ及び第２基礎音声データの接続
位置に決定する（Ｓ２０８）。一方、Ｌ−（ｌ１＋ｌ２
＋ｌ３）が前接続音素の接続候補区間の長さｔ_ｅ ^（α）
−ｔ_ｓ ^（α）未満であれば、後接続音素の接続候補区間
を合成音声に含める必要がなく、Ｌ−（ｌ１＋ｌ２＋ｌ
３）≒ｔ^（α）（ｎ）−ｔ_ｓ ^（α）となるｎを探し出す
（Ｓ２０９）。そして、前接続音素におけるｔ
^（α）（ｎ）と後接続音素におけるｔ_ｅ ^（β）とをそれ
ぞれ第１基礎音声データ及び第２基礎音声データの接続
位置に決定する（Ｓ２１０）。Next, the length of the connection candidate section of the previous connection phoneme,
That is, t_e ^(Α)-T_s ^(Α)Is calculated (S20).
5). Here, α is the index of the previous connected phoneme. So
L- (l1 + l2 + l3) is the connection condition of the previous connected phoneme
Complementary section length t_e ^(Α)-T_s ^(Α)Judge whether
It is turned off (S206). L- (l1 + l2 + l3) is
Tone of the connection candidate section included in the connection part in the synthesized voice
This indicates the length of the total
Can it be satisfied only with the continuation candidate section, or
It is necessary to add all or part of the connection candidate
To find out. L- (11 + 12 + 13) comes before
Length t of connection candidate section of connected phoneme_e ^(Α)-T_s ^(Α)
If this is the case, the connection candidate section of the subsequent connection phoneme is also converted to synthesized speech.
Must be included, and L− (11 + 12 + 13) − (t_e
^(Α)-T_s ^(Α)) ≒ t_e ^(Β)-T^(Β)(N) and
The next n is searched for (S207). Where β is the post-connection phoneme
Index. And t in the preceding connected phoneme
_e ^(Α)And t in the connected phoneme^(Β)(N) and it
Connection of the first basic audio data and the second basic audio data, respectively
The position is determined (S208). On the other hand, L- (l1 + l2
+ L3) is the length t of the connection candidate section of the previous connected phoneme_e ^(Α)
-T_s ^(Α)If less than, the connection candidate section of the subsequent connection phoneme
Need not be included in the synthesized speech, and L- (l1 + l2 + l
3) Δt^(Α)(N) -t_s ^(Α)Find n that becomes
(S209). And t in the preceding connected phoneme
^(Α)(N) and t in the post-connection phoneme_e ^(Β)And it
Connection of the first basic audio data and the second basic audio data, respectively
The position is determined (S210).

【００３７】次に図５に戻り、Ｓ１０４で選出された基
礎音声データ列を接続する（Ｓ１０６）。このとき、Ｓ
１０５（Ｓ２０８，Ｓ２１０）で決定された接続位置に
より、各基礎音声データの開始部分又は終了部分を決定
する。すなわち、Ｓ１０５で決定された接続位置の間の
基礎音声データを音声合成用データベースから読みだ
し、それを前方（先に再生される方）の基礎音声データ
に接続する。こうして基礎音声データ列を接続して合成
音声データを作成する。そして、作成した合成音声デー
タを主記憶２６に格納しておく（Ｓ１０７）。こうして
記憶された合成音声データは、図示しないゲームプログ
ラムに従って適宜読み出され、ゲームの演出としてスピ
ーカ２２から再生出力される。また、入出力処理部３０
に不揮発性のメモリカードを着脱可能に接続しておき、
そこに合成音声データを格納するようにしてもよい。こ
うすれば、次回プレイのときに、再度合成音声データを
作成しなくとも、直ちに合成音声を出力できるようにな
る。Next, returning to FIG. 5, the basic audio data sequence selected in S104 is connected (S106). At this time, S
Based on the connection position determined in 105 (S208, S210), the start portion or the end portion of each basic audio data is determined. That is, the basic voice data between the connection positions determined in S105 is read from the voice synthesis database, and is connected to the preceding (first reproduced) basic voice data. Thus, the synthesized speech data is created by connecting the basic speech data strings. Then, the created synthesized voice data is stored in the main memory 26 (S107). The stored synthesized voice data is appropriately read out according to a game program (not shown), and is reproduced and output from the speaker 22 as a game effect. The input / output processing unit 30
A non-volatile memory card is detachably connected to
The synthesized voice data may be stored there. In this way, in the next play, the synthesized voice can be output immediately without generating the synthesized voice data again.

【００３８】以上説明したゲーム装置１０によれば、２
つの基礎音声データを同じ音素に対応する区間で接続す
る際、その区間の内側に予め設定された接続候補区間の
範囲内で接続位置を決定するので、接続に適しない位置
で２つの基礎音声データが接続されてしまう事態を防止
でき、合成音声の品質を向上させることができる。ま
た、接続候補区間を安定発音区間内に設定しておけば、
接続候補区間の範囲内で接続位置を決定する限り、２つ
の基礎音声データの接続部分を目立たなくすることがで
きるため、波形比較やパラメータ比較等の重いデータ処
理を必要とすることなく、簡易に高品質の合成音声デー
タを生成することができる。さらに、接続候補位置を複
数用意しておき、そこから具体的な接続位置を選出する
ようにしたので、極めて軽い処理で高品質の合成音声デ
ータを生成することができる。According to the game device 10 described above, 2
When two pieces of basic sound data are connected in a section corresponding to the same phoneme, the connection position is determined within a range of a connection candidate section set in advance inside the section. Can be prevented from being connected, and the quality of synthesized speech can be improved. If the connection candidate section is set within the stable sounding section,
As long as the connection position is determined within the range of the connection candidate section, the connection portion between the two basic audio data can be made inconspicuous, so that heavy data processing such as waveform comparison and parameter comparison is not required and can be easily performed. High quality synthesized speech data can be generated. Furthermore, since a plurality of connection candidate positions are prepared and a specific connection position is selected therefrom, it is possible to generate high-quality synthesized speech data with extremely light processing.

【００３９】なお、本発明は以上説明した実施の形態に
限定されるものではない。The present invention is not limited to the embodiment described above.

【００４０】例えば、以上の説明ではＣＶ形式及びＶＣ
Ｖ形式で記録された基礎音声データをＶ区間で接続して
合成音声データを生成したが、ＣＶＣ形式及びＣＶ形式
で記録された基礎音声データをＣ区間で接続して合成音
声データを生成する場合も、Ｃ区間に接続候補区間を設
定しておくことで、接続に適した位置で確実に２つの基
礎音声データを接続することができるようになる。その
他、基礎音声データの形式に依らず、同じ音素を表す音
素区間で２つの基礎音声データを接続する場合は、その
音素区間の内側に接続候補区間を設定しておき、その範
囲内で接続位置を決定することで、接続に適した位置で
確実に２つの基礎音声データを接続することができるよ
うになる。For example, in the above description, the CV format and the VC
In the case where basic voice data recorded in V format is connected in V section to generate synthesized voice data, but basic voice data recorded in CVC format and CV format are connected in C section to generate synthetic voice data Also, by setting a connection candidate section in the C section, two basic audio data can be reliably connected at a position suitable for connection. In addition, when two pieces of basic voice data are connected in a phoneme section representing the same phoneme regardless of the format of the basic voice data, a connection candidate section is set inside the phoneme section, and a connection position is set within the range. Is determined, two basic audio data can be reliably connected at a position suitable for connection.

【００４１】また、以上の説明は本発明を家庭用ゲーム
機１１を用いて実施する例についてのものであるが、業
務用ゲーム装置にも本発明は同様に適用可能である。こ
の場合、ＤＶＤ−ＲＯＭ２５及びＤＶＤ再生部２４に代
えてより高速な記憶装置を用い、モニタ１８やスピーカ
２２も一体的に形成することが望ましい。Although the above description is of an example in which the present invention is implemented using the consumer game machine 11, the present invention is similarly applicable to arcade game machines. In this case, it is desirable to use a higher-speed storage device instead of the DVD-ROM 25 and the DVD reproducing unit 24, and to integrally form the monitor 18 and the speaker 22.

【００４２】さらに、以上の説明ではゲームプログラム
及びゲームデータを格納したＤＶＤ−ＲＯＭ２５を家庭
用ゲーム機１１で使用するようにしたが、パーソナルコ
ンピュータ等、ゲームプログラム及びゲームデータを記
録した情報記憶媒体を読み取って、その読み取った内容
に基づく情報処理が可能なコンピュータであれば、どの
ようなものでも使用することができる。Further, in the above description, the DVD-ROM 25 storing the game program and the game data is used in the home-use game machine 11, but an information storage medium storing the game program and the game data, such as a personal computer, is used. Any computer that can read and process information based on the read content can be used.

【００４３】[0043]

【発明の効果】以上説明したように、本発明では、２つ
の基礎音声データを同じ音素に対応する区間で接続する
際に、その区間に対応する接続候補区間データを読みだ
し、該接続候補区間が特定する接続候補区間の範囲内で
接続位置を決定するようにしたので、接続に適した位置
で確実に２つの基礎音声データを接続することができ、
以て合成音声の品質を向上させることができる。また、
接続候補位置の中から接続位置を選出するようにすれ
ば、高品質の合成音声を比較的軽い処理で生成すること
ができる。As described above, according to the present invention, when two pieces of basic speech data are connected in a section corresponding to the same phoneme, connection candidate section data corresponding to the section is read out, and the connection candidate section is read. The connection position is determined within the range of the connection candidate section specified by, so that the two basic audio data can be reliably connected at a position suitable for the connection,
Thus, the quality of synthesized speech can be improved. Also,
If a connection position is selected from the connection candidate positions, a high-quality synthesized speech can be generated by relatively light processing.

[Brief description of the drawings]

【図１】本発明の実施の形態に係る音声合成方法の一
例を説明する図である。FIG. 1 is a diagram illustrating an example of a speech synthesis method according to an embodiment of the present invention.

【図２】本発明の実施の形態に係る音声合成方法の他
の例を説明する図である。FIG. 2 is a diagram illustrating another example of the speech synthesis method according to the embodiment of the present invention.

【図３】本発明の実施の形態に係るゲーム装置の構成
を示す図である。FIG. 3 is a diagram showing a configuration of a game device according to an embodiment of the present invention.

【図４】各基礎音声データに対し、付加的に記憶され
るデータを説明する図である。FIG. 4 is a diagram illustrating data additionally stored for each basic audio data.

【図５】本発明の実施の形態に係るゲーム装置により
実行される音声合成処理を説明するフロー図である。FIG. 5 is a flowchart illustrating a speech synthesis process executed by the game device according to the embodiment of the present invention.

【図６】基礎音声データの接続位置決定処理について
詳細に説明するフロー図である。FIG. 6 is a flowchart for explaining a connection position determination process of basic audio data in detail;

【図７】基礎音声データの接続態様を示す図である。FIG. 7 is a diagram showing a connection mode of basic audio data.

[Explanation of symbols]

１０ゲーム装置、１１家庭用ゲーム機、１２バ
ス、１４マイクロプロセッサ、１６画像処理部、１
８モニタ、２０音声処理部、２２スピーカ、２４
ＤＶＤ再生部、２５ＤＶＤ−ＲＯＭ、２６主記
憶、３０入出力処理部、３２コントローラ。10 game device, 11 home game machine, 12 bus, 14 microprocessor, 16 image processing unit, 1
8 monitors, 20 audio processing units, 22 speakers, 24
DVD playback unit, 25 DVD-ROM, 26 main memory, 30 input / output processing unit, 32 controller.

Claims

[Claims]

1. Basic sound data storage means for storing a plurality of basic sound data; basic sound data string selecting means for selecting a basic sound data string corresponding to a synthesized sound from the plurality of basic sound data; A synthesized voice data generating means for generating synthesized voice data for reproducing the synthesized voice by connecting the basic voice data strings to be reproduced, wherein at least two of the plurality of basic voice data A connection candidate section specifying data storage unit for storing connection candidate section data for respectively specifying connection candidate sections set inside a section corresponding to a predetermined phoneme in the two basic voice data in association with the voice data; The synthesized voice data generating means further comprises: the basic voice data selected by the basic voice data string selecting means. When the two basic voice data are adjacently included, the connection candidate section specifying data corresponding to the two basic voice data is read from the connection candidate section specifying data storage means, and A speech synthesizer, wherein a connection position is determined for each of the two basic sound data within a range of the specified connection candidate section, and the two basic sound data are connected at the connection position.

2. The voice synthesizing device according to claim 1, further comprising a connection candidate position specifying data storage unit configured to specify a plurality of connection candidate positions set in each connection candidate section and storing connection candidate position specifying data. The synthesized voice data generation unit further stores a connection candidate position set in a connection candidate section specified by the connection candidate section specifying data corresponding to the two basic voice data in the connection candidate position specifying data storage unit. A voice synthesizing apparatus characterized by specifying the connection candidate position specifying data stored, and selecting the connection position from the connection candidate positions for each of the two basic voice data.

3. The speech synthesizer according to claim 1, wherein the connection candidate section is set within a stable utterance section of the predetermined phoneme.

4. The syllable synthesizing device according to claim 1, further comprising: a syllable length determining unit that determines a length of a syllable including the predetermined phoneme corresponding to a connection portion between the two basic audio data. The speech synthesis apparatus further includes: the synthesized speech data generation unit determines the connection position based on a syllable length determined by the syllable length determination unit.

5. The voice synthesizing device according to claim 1, wherein said basic voice data string selecting means includes a symbol string inputting means for inputting a symbol string representing a voice to be synthesized, and A speech synthesizing device, wherein the basic speech data sequence is selected based on the symbol sequence input by a column input unit.

6. A connection candidate section which is associated with each of the first and second basic voice data and which is set inside a section corresponding to a predetermined phoneme in the first and second basic voice data, respectively. Acquiring two connection candidate section data; and determining a connection position for each of the first and second basic voice data within a connection candidate section specified by the two connection candidate section specifying data. Deciding, and connecting the first and second basic audio data at the connection position.

7. A connection candidate section associated with each of the first and second basic voice data and set inside a section corresponding to a predetermined phoneme in the first and second basic voice data is specified. Acquiring two connection candidate section data; and determining a connection position for each of the first and second basic voice data within a connection candidate section specified by the two connection candidate section specifying data. An information storage medium storing a program for causing a computer to execute: a step of determining; and a step of connecting the first and second basic audio data at the connection position.