JP4424024B2

JP4424024B2 - Segment-connected speech synthesizer and method

Info

Publication number: JP4424024B2
Application number: JP2004075185A
Authority: JP
Inventors: 信行西澤; 恒河井
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-03-16
Filing date: 2004-03-16
Publication date: 2010-03-03
Anticipated expiration: 2024-03-16
Also published as: JP2005266010A

Description

この発明は音声合成装置に関し、特に、所定のコスト関数に基づいて音声素片を選択し接続することにより合成器指令に合致した音声合成を行なう音声合成装置に関する。 The present invention relates to a speech synthesizer, and more particularly to a speech synthesizer that performs speech synthesis that matches a synthesizer command by selecting and connecting speech segments based on a predetermined cost function.

音声認識、音声合成は、人間とコンピュータを用いた諸システムとのインターフェースを実現する技術として重要である。これらと人工知能技術とを併用することにより、利用者は相手がコンピュータシステムであることを意識せずに様々なサービスを利用することができる。 Speech recognition and speech synthesis are important technologies for realizing interfaces between humans and various systems using computers. By using these and artificial intelligence technology together, the user can use various services without being aware that the other party is a computer system.

中でも音声合成は、人間に対するシステム出力のためのインターフェースとしてその重要性は大きい。人間は、合成された音声の不自然さを敏感に感じ取る。合成された音声が不自然であると利用者が感じると、発話にも影響を及ぼし、その結果、人間とシステムとの間の対話がうまく行かなくなるおそれもある。 Speech synthesis is especially important as an interface for system output to humans. Humans are sensitive to the unnaturalness of synthesized speech. If the user feels that the synthesized speech is unnatural, it will affect the utterance, and as a result, the dialogue between the human and the system may not be successful.

最近の音声合成技術としては、予め人間の発話を多数集めて語・音節・音素等を単位とする音声素片を音素ラベルと関連付けてデータベース化しておき、合成時には、指定された語・音節・音素等に対応する音声素片の中から、最も適切と思われるものを選択して接続するものが知られている。これを素片接続型音声合成と呼ぶ。なお、音素ラベルとは、通常は各音素の音素記号とその開始・終了時刻を記述したものをいう。これに加えて、その区間におけるＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）、基本周波数（Ｆ０）等の音響特徴量、さらに前後の素片の音素記号を含む場合もある。 As a recent speech synthesis technology, a large number of human utterances are collected in advance and speech segments in units of words, syllables, phonemes, etc. are associated with phoneme labels to create a database, and at the time of synthesis, specified words, syllables, There is known one that selects and connects the most appropriate speech segment from speech segments corresponding to phonemes. This is called segment-connected speech synthesis. The phoneme label usually refers to a phoneme symbol describing each phoneme and its start / end time. In addition to this, acoustic feature quantities such as MFCC (Mel-Frequency Cepstrum Coefficient) and fundamental frequency (F0) in the section, and phoneme symbols of the front and rear segments may be included.

素片接続型音声合成では、与えられた合成目標を基準として、いかにして適切な音声素片をデータベース中から取出すかが問題となる。 In unit-connected speech synthesis, there is a problem of how to extract an appropriate speech unit from a database based on a given synthesis target.

合成目標を構成するデータは、典型的には音素と、Ｆ０、持続時間、ＭＦＣＣ、及びパワー等の音声特徴量とを含む。これらを以下「合成器指令」と呼ぶ。 The data constituting the synthesis target typically includes phonemes and speech feature quantities such as F0, duration, MFCC, and power. These are hereinafter referred to as “synthesizer commands”.

素片接続型音声合成では、合成器指令と音声素片のＦ０、持続時間、ＭＦＣＣ、パワー等とのずれ、及び接続に伴う自然劣化を表現するための「コスト」と呼ばれる評価関数を定義し、コストを最小とする音声素片を求めることにより、最適な音声素片系列を決定する。 In unit-connected speech synthesis, an evaluation function called “cost” is defined to express the difference between the synthesizer command and the F0, duration, MFCC, power, etc. of speech units, and the natural degradation associated with the connection. Then, an optimum speech unit sequence is determined by obtaining a speech unit that minimizes the cost.

本件出願の出願人は、上記した「コスト」を、それぞれある音声の特徴に対応するような「サブコスト」に分解し、それらを結合したもの（例えば線形和）により定義した素片接続型音声合成を提案している。例えば特許文献１を参照されたい。 The applicant of the present application decomposes the above-mentioned “cost” into “sub-costs” corresponding to certain voice features, and combines them (for example, linear sum) to define a unit-connected speech synthesis. Has proposed. For example, see Patent Document 1.

サブコストには、物理量から計算されるものと、シンボリックな情報から事前に作成した規則から基づき得られるものとがある。前者は、複数のサンプル値に対する非線形演算であることも多く、その計算量は相対的に大きい。後者は、単純なテーブル参照の形であることが多く、テーブル参照で実現される場合にはサブコスト計算に必要な計算量は非常に少ない。 There are sub-costs that are calculated from physical quantities and those that are obtained from rules created in advance from symbolic information. The former is often a non-linear operation for a plurality of sample values, and the amount of calculation is relatively large. The latter is often in the form of a simple table reference, and when it is realized by table reference, the amount of calculation required for sub-cost calculation is very small.

以上はあくまで一例であるが、この例に限らず、各サブコストの計算量はその種類により大きなばらつきがある場合が多い。 The above is only an example, but the present invention is not limited to this example, and the calculation amount of each sub-cost often varies greatly depending on the type.

一方、上記とは別に、サブコストは、ターゲットコストに関係するものと接続コストに属するものとの二つに大別することもできる。ターゲットコストは、合成目標と素片候補との間の誤差を表す。接続コストは、合成音声において隣接する素片間の誤差（不連続性）を表す。 On the other hand, apart from the above, the sub-costs can be broadly divided into two types, those related to the target cost and those belonging to the connection cost. The target cost represents an error between the synthesis target and the segment candidate. The connection cost represents an error (discontinuity) between adjacent segments in the synthesized speech.

このような音声合成をリアルタイムで行なおうとする場合、いかにして素片選択と合成とを高速に行なうかが問題となる。この処理を高速化するためには、素片選択のコスト計算を高速に行なうとともに、選択された音声素片を接続する処理も高速にすることが望ましい。 When such speech synthesis is to be performed in real time, the problem is how to perform segment selection and synthesis at high speed. In order to increase the speed of this process, it is desirable to perform the cost calculation of the segment selection at a high speed and to increase the speed of the process for connecting the selected speech segment.

実際に音声素片を接続する際には音声素片の波形データが必要となるが、個々の波形データのデータ量が比較的大きく、また、合成音声の品質を高くするためには、音声素片データベースに格納される音声素片の数を大きくする必要がある。その結果、音声素片データベース全体の容量は大きくなる。従って従来は、素片データのうち音響特徴量のみをメモリに格納して素片選択のコスト計算を行ない、音声素片データベースは、固定ハードディスク等比較的容量が大きな記憶装置に格納しておき、素片が選択された後、素片の接続時に波形データを読出すようにしている。 When speech units are actually connected, the waveform data of the speech units is required. However, in order to increase the quality of synthesized speech, the amount of individual waveform data is relatively large. It is necessary to increase the number of speech elements stored in the fragment database. As a result, the capacity of the entire speech unit database increases. Therefore, conventionally, only the acoustic feature amount of the segment data is stored in the memory to calculate the cost of segment selection, and the speech segment database is stored in a storage device having a relatively large capacity such as a fixed hard disk, After the segment is selected, the waveform data is read when the segment is connected.

特開２００３−２０８１８８号公報（段落００１４〜００４７）JP 2003-208188 A (paragraphs 0014 to 0047)

しかし、容量が大きな記憶装置のアクセス速度は比較的低速である。そのため、素片選択後、音声素片の波形データの取得に要する時間が大きく、音声合成に要する時間も長くかかるという問題があった。 However, the access speed of a storage device with a large capacity is relatively low. For this reason, there is a problem that after selecting a segment, it takes a long time to acquire waveform data of the speech segment, and it takes a long time to synthesize speech.

それゆえに、本発明の目的は、利用可能な音声素片の数を多く保ったまま、高速に音声合成を行なうことができる素片接続型音声合成装置及び方法を提供することである。 Therefore, an object of the present invention is to provide a unit connection type speech synthesis apparatus and method capable of performing speech synthesis at high speed while maintaining a large number of usable speech units.

本発明の第１の局面に係る素片接続型音声合成装置は、合成音声の目標と音声素片候補との間で、複数のサブコストを含むコストを算出し、当該コストに基づいて、複数の音声素片候補を含む音声素片データベースから音声素片を選択し接続することにより音声合成を行なう素片接続型音声合成装置であって、各音声素片候補は、各音声素片の音響特徴量データと、各音声素片の波形データとを含み、音声素片データベースを記憶するための第１の記憶装置と、音声素片データベースに記憶された複数の音声素片候補の音響特徴量データと、音声素片データベースに記憶された複数の音声素片候補の中で、所定の基準で選択された音声素片候補の波形データとを記憶するための、第１の記憶装置より高速アクセス可能な第２の記憶装置とを含み、複数のサブコストは、各音声素片候補の波形データが記憶されている記憶装置へのアクセス速度に関するアクセス速度コストを含み、複数の音声素片候補の各々の音響特徴量データは、当該音声素片候補の波形データが第１及び第２の記憶装置のいずれに記憶されているかを示す第１のフラグを含み、音声合成装置はさらに、合成音声の目標との間で、複数のサブコストを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、第２の記憶装置に記憶された音響特徴量に基づいて複数の音声素片候補から選択するための選択手段と、選択手段により選択された音声素片候補の音声波形を、当該選択された音声波形に対応する第１のフラグに基づいて、第１の記憶装置又は第２の記憶装置のいずれかから読出して合成音声の目標に従って接続し、合成音声波形を出力するための接続手段とを含む。 The segment-connected speech synthesizer according to the first aspect of the present invention calculates a cost including a plurality of sub-costs between a synthesized speech target and a speech segment candidate, and based on the cost, A speech synthesizer that synthesizes speech by selecting and connecting speech units from a speech unit database including speech unit candidates, wherein each speech unit candidate is an acoustic feature of each speech unit A first storage device for storing a speech unit database, and acoustic feature data of a plurality of speech unit candidates stored in the speech unit database, including volume data and waveform data of each speech unit Can be accessed at a higher speed than the first storage device for storing the waveform data of a speech unit candidate selected according to a predetermined standard among a plurality of speech unit candidates stored in the speech unit database Including a second storage device The plurality of sub-costs includes an access speed cost related to the access speed to the storage device in which the waveform data of each speech unit candidate is stored, and the acoustic feature value data of each of the plurality of speech unit candidates is The speech synthesizer further includes a plurality of sub-costs with the target of the synthesized speech. The first flag indicates whether the waveform data of the single candidate is stored in the first storage device or the second storage device. Selecting means for selecting one speech unit candidate for which the cost calculated in step 1 satisfies a predetermined condition from a plurality of speech unit candidates based on the acoustic feature quantity stored in the second storage device; Based on the first flag corresponding to the selected speech waveform, the speech unit candidate speech waveform selected by the selection unit is read from either the first storage device or the second storage device and synthesized. voice Connect according to the target, and a connection means for outputting a synthesized speech waveform.

好ましくは、第１の記憶装置に記憶された音声素片データベースは、当該音声素片データベースに含まれる複数の音声素片候補のうち、対応する波形データを第２の記憶装置に記憶すべき音声素片候補を選択するための基準となる選択基準情報が付されており、素片接続型音声合成装置はさらに、第１の記憶装置に記憶された音声素片データベースに含まれる複数の音声素片候補のうち、選択基準情報により選択された音声素片候補の波形データを第２の記憶装置にロードするためのロード手段を含む。 Preferably, the speech unit database stored in the first storage device is a speech to store corresponding waveform data in the second storage device among a plurality of speech unit candidates included in the speech unit database. Selection criterion information serving as a criterion for selecting a segment candidate is attached, and the segment-connected speech synthesizer further includes a plurality of speech elements included in the speech segment database stored in the first storage device. Load means for loading waveform data of a speech element candidate selected by the selection criterion information among the candidate pieces into the second storage device is included.

さらに好ましくは、テストデータに基づいて、音声素片データベースの選択基準情報を生成するための選択基準情報生成手段をさらに含む。 More preferably, it further includes selection criterion information generating means for generating selection criterion information of the speech segment database based on the test data.

選択基準情報生成手段は、テストデータに基づき、音声素片データベースに含まれる音声素片候補を使用して音声合成をシミュレートするための音声合成シミュレート手段と、音声素片データベースに含まれる複数の音声素片候補の各々について、音声合成シミュレート手段による音声合成の際に選択された頻度を記録するための頻度記録手段とを含んでもよい。選択基準情報は、頻度記録手段により記録された、音声素片データベースに含まれる複数の音声素片候補の各々の頻度情報でもよい。 The selection criterion information generating means includes: speech synthesis simulating means for simulating speech synthesis using speech segment candidates included in the speech segment database based on the test data; and a plurality of selection criteria information generating means included in the speech segment database. For each of the speech segment candidates, a frequency recording unit for recording the frequency selected at the time of speech synthesis by the speech synthesis simulation unit may be included. The selection criterion information may be frequency information of each of a plurality of speech unit candidates included in the speech unit database recorded by the frequency recording unit.

好ましくは、ロード手段は、頻度情報に基づき、音声素片データベースに含まれる音声素片のうち、音声合成シミュレート手段による音声合成で選択された頻度の高いものを上位から所定個数選択し、選択された音声素片の波形データを第２の記憶装置にロードするための手段を含む。 Preferably, the loading unit selects a predetermined number of high-frequency ones selected by the speech synthesis by the speech synthesis simulation unit from the speech units included in the speech unit database based on the frequency information, and selects them. Means for loading the waveform data of the recorded speech segment into the second storage device.

さらに好ましくは、音声合成シミュレート手段は、テストデータから得られる合成音声の目標と、音声素片データベースに含まれる複数の音声素片候補との間で、複数のサブコストのうち、アクセス速度コストを除くアクセスコストと、頻度記録手段により記録された各音声素片候補が選択された頻度に基づいて算出される選択頻度コストとを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、音声素片データベースから選択するための手段を含む。 More preferably, the speech synthesis simulation means calculates an access speed cost among a plurality of sub-costs between a target of synthesized speech obtained from the test data and a plurality of speech unit candidates included in the speech unit database. One speech element for which the cost calculated including the access cost to be excluded and the selection frequency cost calculated based on the frequency with which each speech segment candidate recorded by the frequency recording means is selected satisfies a predetermined condition Means are included for selecting the piece candidates from the speech unit database.

素片接続型音声合成装置は、キャッシュメモリと、キャッシュメモリに対して設けられたキャッシュメモリ管理機構と、複数の音声素片候補に対応して設けられ、対応の音声素片候補が選択手段により選択されたか否かを記録するための第２のフラグを記憶するための手段と、第２のフラグの値を所定の第１の値に初期化するための手段と、複数の音声素片候補のいずれかが選択手段により選択されるたびに、選択された音声素片候補に対応する第２のフラグを第１の値と異なる所定の第２の値に更新するためのフラグ更新手段とをさらに含んでもよい。アクセス速度コストは、第１及び第２のフラグに基づいて算出され、選択手段は、合成音声の目標との間で、アクセス速度コストを含む複数のサブコストを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、第２の記憶装置に記憶された音響特徴量に基づいて複数の音声素片候補から選択するための手段を含んでもよい。 The unit connection type speech synthesizer is provided corresponding to a cache memory, a cache memory management mechanism provided for the cache memory, and a plurality of speech unit candidates. Means for storing a second flag for recording whether or not it has been selected, means for initializing the value of the second flag to a predetermined first value, and a plurality of speech segment candidates Flag updating means for updating the second flag corresponding to the selected speech segment candidate to a predetermined second value different from the first value each time any of the above is selected by the selection means; Further, it may be included. The access speed cost is calculated on the basis of the first and second flags, and the selection means determines that the cost calculated including a plurality of sub-costs including the access speed cost is a predetermined condition with the target of the synthesized speech. A means for selecting one speech unit candidate satisfying the above from a plurality of speech unit candidates based on the acoustic feature quantity stored in the second storage device may be included.

本発明の第２の局面に係る素片接続型音声合成方法は、合成音声の目標と音声素片候補との間で、複数のサブコストを含むコストを算出し、当該コストに基づいて、複数の音声素片候補を含む音声素片データベースから音声素片を選択し接続することにより音声合成を行なう素片接続型音声合成方法であって、各音声素片候補は、各音声素片の音響特徴量データと、各音声素片の波形データとを含み、音声素片データベースを第１の記憶装置に記憶させるステップと、音声素片データベースに記憶された複数の音声素片候補の音響特徴量データと、音声素片データベースに記憶された複数の音声素片候補の中で、所定の基準で選択された音声素片候補の波形データとを、第１の記憶装置より高速アクセス可能な第２の記憶装置に記憶させるステップとを含み、複数のサブコストは、各音声素片候補の波形データが記憶されている記憶装置へのアクセス速度に関するアクセス速度コストを含み、複数の音声素片候補の各々の音響特徴量データは、当該音声素片候補の波形データが第１及び第２の記憶装置のいずれに記憶されているかを示す第１のフラグを含み、音声合成方法はさらに、合成音声の目標との間で、複数のサブコストを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、第２の記憶装置に記憶された音響特徴量に基づいて複数の音声素片候補から選択する選択ステップと、選択ステップにおいて選択された音声素片候補の音声波形を、当該選択された音声波形に対応する第１のフラグに基づいて、第１の記憶装置又は第２の記憶装置のいずれかから読出して合成器指令に従って接続し、合成音声波形を出力する接続ステップとを含む。 The unit connection type speech synthesis method according to the second aspect of the present invention calculates a cost including a plurality of sub-costs between a target of synthesized speech and a speech unit candidate, and based on the cost, A unit-connected speech synthesis method for performing speech synthesis by selecting and connecting speech units from a speech unit database including speech unit candidates, wherein each speech unit candidate is an acoustic feature of each speech unit A step of storing the speech unit database in the first storage device, and the acoustic feature data of the plurality of speech unit candidates stored in the speech unit database. And the second speech unit waveform data selected on the basis of a predetermined criterion among the plurality of speech unit candidates stored in the speech unit database, which can be accessed at a higher speed than the first storage device. The memory to be stored in the storage device The plurality of sub-costs includes an access speed cost related to an access speed to the storage device in which the waveform data of each speech unit candidate is stored, and each acoustic feature amount data of the plurality of speech unit candidates Includes a first flag indicating whether the waveform data of the speech segment candidate is stored in the first storage device or the second storage device, and the speech synthesis method further includes a target for the synthesized speech, Selection that selects one speech unit candidate whose cost calculated including a plurality of sub-costs satisfies a predetermined condition from the plurality of speech unit candidates based on the acoustic feature quantity stored in the second storage device The speech waveform of the speech unit candidate selected in the step and the selection step is selected from either the first storage device or the second storage device based on the first flag corresponding to the selected speech waveform. Out the connections per synthesizer command and a connection step of outputting synthetic speech waveform.

好ましくは、第１の記憶装置に記憶された音声素片データベースは、当該音声素片データベースに含まれる複数の音声素片候補のうち、対応する波形データを第２の記憶装置に記憶すべき音声素片候補を選択する基準となる選択基準情報を有し、素片接続型音声合成方法はさらに、第１の記憶装置に記憶された音声素片データベースに含まれる複数の音声素片候補のうち、選択基準情報により選択された音声素片候補の波形データを第２の記憶装置にロードするロードステップを含む。 Preferably, the speech unit database stored in the first storage device is a speech to store corresponding waveform data in the second storage device among a plurality of speech unit candidates included in the speech unit database. The unit-connected speech synthesis method further includes selection criterion information serving as a criterion for selecting a segment candidate, and further includes a plurality of speech unit candidates included in the speech unit database stored in the first storage device. And a loading step of loading waveform data of the speech segment candidate selected by the selection criterion information into the second storage device.

さらに好ましくは、素片接続型音声合成方法はテストデータに基づいて、音声素片データベースの選択基準情報を生成する選択基準情報生成ステップをさらに含む。 More preferably, the unit connection type speech synthesis method further includes a selection criterion information generation step of generating selection criterion information of the speech unit database based on the test data.

選択基準情報生成ステップは、テストデータに基づき、音声素片データベースに含まれる音声素片候補を使用して音声合成をシミュレートする音声合成シミュレートステップと、音声素片データベースに含まれる複数の音声素片候補の各々について、音声合成シミュレートステップによる音声合成の際に選択された頻度を記録する頻度記録ステップとを含んでもよい。選択基準情報は、頻度記録ステップにおいて記録された、音声素片データベースに含まれる複数の音声素片候補の各々の頻度情報でもよい。 The selection criterion information generation step includes a speech synthesis simulation step of simulating speech synthesis using speech unit candidates included in the speech unit database based on the test data, and a plurality of speeches included in the speech unit database. A frequency recording step for recording the frequency selected at the time of speech synthesis by the speech synthesis simulation step for each of the segment candidates may be included. The selection criterion information may be frequency information of each of a plurality of speech unit candidates included in the speech unit database recorded in the frequency recording step.

好ましくは、ロードステップは、頻度情報に基づき、音声素片データベースに含まれる音声素片のうち、音声合成シミュレートステップによる音声合成で選択された頻度の高いものを上位から所定個数選択し、選択された音声素片の波形データを第２の記憶装置にロードするステップを含む。 Preferably, the loading step selects a predetermined number of high-frequency ones selected in the speech synthesis by the speech synthesis simulation step from among the speech units contained in the speech unit database based on the frequency information, and selects them. Loading waveform data of the generated speech segment into the second storage device.

さらに好ましくは、音声合成シミュレートステップは、テストデータから得られる合成音声の目標と、音声素片データベースに含まれる複数の音声素片候補との間で、複数のサブコストのうち、アクセス速度コストを除くアクセスコストと、頻度記録ステップにおいて記録された各音声素片候補が選択された頻度に基づいて算出される選択頻度コストとを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、音声素片データベースから選択するステップを含む。 More preferably, the speech synthesis simulation step calculates an access speed cost among a plurality of sub-costs between a target of synthesized speech obtained from test data and a plurality of speech unit candidates included in the speech unit database. One speech element for which the cost calculated including the access cost to be excluded and the selection frequency cost calculated based on the frequency at which each speech element candidate recorded in the frequency recording step is selected satisfies a predetermined condition Selecting a segment candidate from the speech segment database.

素片接続型音声合成方法は、キャッシュメモリと、キャッシュメモリに対して設けられたキャッシュメモリ管理機構とを有するコンピュータ上で実行される方法でもよい。素片接続型音声合成方法はさらに、複数の音声素片候補に対応して設けられ、対応の音声素片候補が選択ステップにおいて選択されたか否かを記録する第２のフラグを所定の記憶装置に記憶するステップと、第２のフラグの値を所定の第１の値に初期化するステップと、複数の音声素片候補のいずれかが選択ステップにおいて選択されるたびに、選択された音声素片候補に対応する第２のフラグを第１の値と異なる所定の第２の値に更新するステップとをさらに含んでもよい。アクセス速度コストは、第１及び第２のフラグに基づいて算出される。選択ステップは、合成音声の目標との間で、アクセス速度コストを含む複数のサブコストを含んで算出されるコストが所定の条件を充足する一つの音声素片候補を、第２の記憶装置に記憶された音響特徴量に基づいて複数の音声素片候補から選択するステップを含んでもよい。 The unit connection type speech synthesis method may be a method executed on a computer having a cache memory and a cache memory management mechanism provided for the cache memory. The unit connection type speech synthesis method is further provided with a second flag provided corresponding to a plurality of speech unit candidates and recording whether or not the corresponding speech unit candidate is selected in the selection step. Each of the speech unit candidates, the step of initializing the value of the second flag to a predetermined first value, and the selection of each of the plurality of speech unit candidates in the selection step. The method may further include a step of updating the second flag corresponding to the single candidate to a predetermined second value different from the first value. The access speed cost is calculated based on the first and second flags. The selection step stores, in the second storage device, one speech unit candidate whose cost calculated by including a plurality of sub-costs including the access speed cost satisfies a predetermined condition with the target of synthesized speech The method may include a step of selecting from a plurality of speech segment candidates based on the acoustic feature value thus determined.

素片選択型音声合成において、実際に選ばれる音声素片には偏りが生じる。従って、選ばれやすい音声素片は音声素片データベースより高速にアクセス可能な記憶装置（例えばメモリ）に格納しておくことで、音声合成の速度を全体として上げることができる。さらに、素片選択の際に、各音声素片の波形データが記憶されている記憶装置のアクセス速度をコストに加える。記憶装置のアクセス速度に対応するコストを、本明細書では「アクセス速度コスト」と呼ぶ。アクセス速度コストは、記憶装置のアクセス速度が高いほど小さく（０に近く）、低いほど大きくなるように、予め算出式を設計する。高速な記憶装置に波形データが記憶されている音声素片ほど、実際に音声合成で選択される可能性が高くなり、音声合成の速度を全体として高くすることができる。 In the unit selection type speech synthesis, the actually selected speech unit is biased. Therefore, by storing speech units that are easily selected in a storage device (for example, a memory) that can be accessed at a higher speed than the speech unit database, the overall speech synthesis speed can be increased. Furthermore, when selecting a segment, the access speed of the storage device storing the waveform data of each speech segment is added to the cost. The cost corresponding to the access speed of the storage device is referred to as “access speed cost” in this specification. The calculation formula is designed in advance so that the access speed cost decreases as the access speed of the storage device increases (close to 0) and increases as the access speed decreases. A speech unit having waveform data stored in a high-speed storage device is more likely to be actually selected for speech synthesis, and the speech synthesis speed can be increased as a whole.

図１に、本発明の一実施の形態に係る音声合成システム２０のブロック図を示す。図１を参照して、この音声合成システム２０は、従来と同様の音声素片ＤＢ３０と、多数の音声合成のためのテキストからなるテストデータ４２を使用して、音声素片ＤＢ３０を用いた音声合成をシミュレートし、音声素片ＤＢ３０に含まれる音声素片ごとに音声合成で使用された頻度を算出して、頻度情報４４を有する頻度情報付き音声素片ＤＢ３４を生成するための頻度情報生成装置３２とを含む。 FIG. 1 shows a block diagram of a speech synthesis system 20 according to an embodiment of the present invention. Referring to FIG. 1, this speech synthesis system 20 uses speech unit DB 30 similar to the prior art and speech data using speech unit DB 30 using test data 42 consisting of a number of texts for speech synthesis. Frequency information generation for simulating synthesis, calculating the frequency used in speech synthesis for each speech unit included in speech unit DB 30, and generating speech unit DB 34 with frequency information having frequency information 44 Device 32.

音声合成システム２０はさらに、目標となるテキストを分析した結果得られる合成器指令３６を入力として受け、頻度情報付き音声素片ＤＢ３４に含まれる音声素片から適切な音声素片を選択し接続して合成音声波形４０を出力するための音声合成装置３８を含む。音声合成装置３８は、素片選択のための、頻度情報付き音声素片ＤＢ３４内の各音声素片の音響情報と、頻度情報付き音声素片ＤＢ３４内の音声素片のうち、頻度情報４４により表される出現頻度が高いものを予め記憶しておくためのメモリ４８を含む。 The speech synthesis system 20 further receives a synthesizer command 36 obtained as a result of analyzing the target text as an input, and selects and connects an appropriate speech unit from speech units included in the speech unit DB 34 with frequency information. A speech synthesizer 38 for outputting a synthesized speech waveform 40. The speech synthesizer 38 selects the acoustic information of each speech unit in the speech unit DB 34 with frequency information and selects the frequency information 44 out of the speech units in the speech unit DB 34 with frequency information. A memory 48 for storing in advance what is represented with a high appearance frequency is included.

音声合成システム２０はさらに、音声合成装置３８の起動時に、頻度情報付き音声素片ＤＢ３４内の各音声素片の音響情報と、頻度情報４４を参照することにより、頻度情報付き音声素片ＤＢ３４の中から選択した出現頻度の上位の所定個数の音声素片の波形データとをメモリ４８にロードするための音声素片データロード装置４６とを含む。 The speech synthesis system 20 further refers to the acoustic information of each speech unit in the speech unit DB 34 with frequency information and the frequency information 44 when the speech synthesizer 38 is started up, so that the speech unit DB 34 with the frequency information is stored. A speech unit data loading device 46 for loading waveform data of a predetermined number of speech units having a higher appearance frequency selected from among them into the memory 48;

図２を参照して、頻度情報生成装置３２は、音声素片ＤＢ３０を使用し、テストデータ４２に含まれる各入力に対して実際に音声合成と同様の処理をして、各音声素片の使用頻度を算出する機能を持つ。頻度情報生成装置３２は、テストデータ４２の各テキスト文を受け、合成目標となる合成器指令６２を作成するための合成器指令作成部６０と、この合成器指令６２と音声素片ＤＢ３０中の各音声素片の音響特徴量とのターゲットコストを算出するためのターゲットコスト算出部６８と、合成器指令６２と音声素片ＤＢ３０中の各音声素片の音響特徴量との接続コストを算出するための接続コスト算出部７０と、音声素片ＤＢ３０中の各音声素片が音声合成で選択された頻度を反映する選択頻度コストを算出するための選択頻度コスト算出部７２とを含む。 Referring to FIG. 2, the frequency information generation device 32 uses the speech unit DB 30 to perform the same process as speech synthesis on each input included in the test data 42 and Has a function to calculate the usage frequency. The frequency information generation device 32 receives each text sentence of the test data 42 and generates a synthesizer command creating unit 60 for creating a synthesizer command 62 as a synthesis target, and the synthesizer command 62 and the speech unit DB 30. A target cost calculation unit 68 for calculating a target cost with the acoustic feature amount of each speech unit, and a connection cost between the synthesizer command 62 and the acoustic feature amount of each speech unit in the speech unit DB 30 are calculated. And a selection frequency cost calculation unit 72 for calculating a selection frequency cost that reflects the frequency at which each speech unit in the speech unit DB 30 is selected by speech synthesis.

選択頻度コストとは、頻度情報生成装置３２において選択される音声素片に対し、意図的に偏りを生じさせるために導入されたコストである。選択頻度コストは、音声素片が選択された頻度が高くなるほど小さく、低くなるほど大きくなるような算出式で算出される。本実施の形態では、ｉ番目の音声素片の選択頻度コストＣｓ_iを次の式により算出する。 The selection frequency cost is a cost introduced in order to intentionally bias the speech unit selected by the frequency information generation device 32. The selection frequency cost is calculated by a calculation formula that decreases as the frequency at which a speech segment is selected increases and increases as the frequency decreases. In the present embodiment, the selection frequency cost Cs _i of the i-th speech element is calculated by the following equation.

ただしｎ_iはｉ番目の音声素片の選択回数、Ｎは全ての素片の出現回数の和、ａ、Ｎ₀、及びｒは適当な定数である（ただしｒ＞０、ａ＞０、及びＮ≧０）。この式によれば、ｉ番目の音声素片の選択頻度コストＣｓ_iの値は、選択頻度ｎ_iが０であればａであり、選択頻度ｎ_iが多くなるにつれて０に近づいて行く。選択頻度コスト算出部７２はこのため、頻度情報４４に記憶されている各音声素片の頻度情報を使用する。

Where n _i is the number of times the i-th speech unit is selected, N is the sum of the number of occurrences of all the units, a, N ₀ , and r are appropriate constants (where r> 0, a> 0, and N ≧ 0). According to this equation, the value of the selected frequency cost Cs _i of i-th speech unit is a If 0 is selected frequency n _i, approaches zero as the selection frequency n _i increases. For this reason, the selection frequency cost calculation unit 72 uses the frequency information of each speech unit stored in the frequency information 44.

頻度情報生成装置３２はさらに、ターゲットコスト算出部６８により算出されたターゲットコストと、接続コスト算出部７０により算出された接続コストと、選択頻度コスト算出部７２により算出された選択頻度コストとに基づいて総コストを算出し、総コストの最も小さな音声素片を選択することにより、実際の音声合成時の素片選択をシミュレートするための素片選択部６４と、素片選択部６４により選択された音声素片に関し、頻度情報４４を更新するための頻度情報更新部６６とを含む。 Further, the frequency information generation device 32 is based on the target cost calculated by the target cost calculation unit 68, the connection cost calculated by the connection cost calculation unit 70, and the selection frequency cost calculated by the selection frequency cost calculation unit 72. The total cost is calculated, and the speech unit having the smallest total cost is selected, so that the unit selection unit 64 for simulating the unit selection at the time of actual speech synthesis and the unit selection unit 64 select A frequency information update unit 66 for updating the frequency information 44 is included with respect to the speech unit.

図３を参照して、図１に示す音声素片データロード装置４６がメモリ４８にロードするデータについて説明する。図３に示すように、メモリ４８は、頻度情報付き音声素片ＤＢ３４の全ての音声素片の音響特徴量１３０,…を格納するための音響特徴量格納領域１２０と、頻度情報付き音声素片ＤＢ３４内の音声素片のうち、頻度情報４４に記録された出現頻度が所定の値以上のものの波形データを記憶するための波形データ格納領域１２２とを含む。これらはいずれも音声素片データロード装置４６により、音声合成装置３８の起動時にメモリ４８にロードされる。 With reference to FIG. 3, the data loaded into the memory 48 by the speech segment data loading device 46 shown in FIG. 1 will be described. As shown in FIG. 3, the memory 48 includes an acoustic feature amount storage area 120 for storing acoustic feature amounts 130,... Of all speech units in the speech unit DB 34 with frequency information, and speech units with frequency information. A waveform data storage area 122 for storing waveform data having an appearance frequency recorded in the frequency information 44 having a frequency equal to or higher than a predetermined value among speech elements in the DB 34 is included. These are all loaded into the memory 48 by the speech unit data loading device 46 when the speech synthesizer 38 is activated.

音響特徴量１３０の各々は、前述したとおり、音素ラベル、基本周波数（Ｆ０），ＭＦＣＣ、パワー（図示せず）、持続時間（図示せず）を含むが、これらに加えて、音声素片の波形データが頻度情報付き音声素片ＤＢ３４に格納されているか、メモリ４８の波形データ格納領域１２２に格納されているかをあらわす第１のフラグ（Ｆ₁）１４０と、音声素片の波形データが最近読出されたか否かを示す第２のフラグ（Ｆ₂）１４２とを含む。フラグ１４０は、音声素片データロード装置４６が音響特徴量格納領域１２０に音響特徴量をロードした後、高頻度の音声素片の波形データを波形データ格納領域１２２にロードする際に、音声素片データロード装置４６によって設定される。 As described above, each acoustic feature 130 includes a phoneme label, a fundamental frequency (F0), an MFCC, a power (not shown), and a duration (not shown). The first flag (F ₁ ) 140 indicating whether the waveform data is stored in the speech unit DB 34 with frequency information or in the waveform data storage area 122 of the memory 48, and the waveform data of the speech unit are And a second flag (F ₂ ) 142 indicating whether or not the data has been read. The flag 140 is used when the speech element data loading device 46 loads the acoustic feature quantity into the acoustic feature quantity storage area 120 and then loads the waveform data of the high-frequency speech element into the waveform data storage area 122. Set by the single data load device 46.

本実施の形態では、第１のフラグ１４０が０の場合には波形データは頻度情報付き音声素片ＤＢ３４に格納されていることを表し、フラグが１の場合には波形データがメモリ４８の波形データ格納領域１２２に格納されていることを示す。従ってこの第１のフラグ１４０の値を見ることで波形データをどこから読出せばよいかが判定できる。 In the present embodiment, when the first flag 140 is 0, the waveform data is stored in the speech unit DB 34 with frequency information, and when the flag is 1, the waveform data is stored in the memory 48 waveform. It indicates that the data is stored in the data storage area 122. Therefore, it is possible to determine where to read the waveform data by looking at the value of the first flag 140.

フラグ１４２は最初に０に初期化され、実際に音声合成を行ないながら、波形データが読出されたときに「１」に更新される。これは、装置をコンピュータで実現する場合、ハードディスク又はメモリから読出されたデータはメモリよりもさらに高速アクセス可能なキャッシュに格納されることがあることを考慮したものである。この第２のフラグ１４２の値はアクセス速度コストに反映される。 The flag 142 is initially initialized to 0 and updated to “1” when the waveform data is read while actually performing speech synthesis. This is because the data read from the hard disk or the memory may be stored in a cache that can be accessed at a higher speed than the memory when the device is realized by a computer. The value of the second flag 142 is reflected in the access speed cost.

図４を参照して、音声合成装置３８は、前述のメモリ４８に加え、それぞれ合成目標を定める合成器指令３６を受け、メモリ４８に記憶されている音声素片であって、かつ合成器指令３６により指定された音素ラベルを持つ音声素片の音響特徴量と合成器指令３６との間のターゲットコストを算出するためのターゲットコスト算出部８２と、同じくメモリ４８内の音声素片と合成器指令３６との間の接続コストを算出するための接続コスト算出部８４と、メモリ４８に記憶されている音声素片に対応する第１及び第２のフラグ１４０、１４２に基づいてアクセス速度コストを算出するためのアクセス速度コスト算出部８６とを含む。 Referring to FIG. 4, speech synthesizer 38 receives synthesizer commands 36 for determining synthesis targets in addition to memory 48 described above, is a speech unit stored in memory 48, and is a synthesizer command. A target cost calculation unit 82 for calculating a target cost between the acoustic feature quantity of the speech unit having the phoneme label specified by 36 and the synthesizer command 36, and the speech unit and synthesizer in the memory 48. The access cost cost is calculated based on the connection cost calculation unit 84 for calculating the connection cost between the command 36 and the first and second flags 140 and 142 corresponding to the speech segments stored in the memory 48. And an access speed cost calculation unit 86 for calculation.

音声合成装置３８はさらに、ターゲットコスト算出部８２により算出されたターゲットコスト、接続コスト算出部８４により算出された接続コスト、及びアクセス速度コスト算出部８６により算出されたアクセス速度コストに基づいて総コストを算出し、総コストの最小の音声素片を選択するための素片選択部８０と、素片選択部８０により選択された音声素片の波形データを接続して合成音声波形４０を出力するための接続部８８とを含む。 The speech synthesizer 38 further determines the total cost based on the target cost calculated by the target cost calculation unit 82, the connection cost calculated by the connection cost calculation unit 84, and the access speed cost calculated by the access speed cost calculation unit 86. And the synthesized speech waveform 40 is output by connecting the segment selection unit 80 for selecting the speech unit having the minimum total cost and the waveform data of the speech unit selected by the unit selection unit 80. Connecting portion 88 for the purpose.

接続部８８は、素片選択部８０により指定された音声素片を頻度情報付き音声素片ＤＢ３４又はメモリ４８から読出すため、次のような信号を出力する。すなわち、接続部８８は、素片選択部８０により指定された音声素片の第１のフラグ１４０に対応するレベルをとり、波形データをメモリ４８と頻度情報付き音声素片ＤＢ３４とのいずれから読出すかを指定するための選択信号１００と、波形データを読出すアドレスを指定するアドレス信号１０２とを出力する機能を持つ。選択信号１００は、指定された音声素片のフラグが１のときにはＨレベルをとり、それ以外のときにはＬレベルをとる。 The connection unit 88 outputs the following signals in order to read the speech unit designated by the unit selection unit 80 from the speech unit DB 34 with frequency information or the memory 48. That is, the connection unit 88 takes a level corresponding to the first flag 140 of the speech unit specified by the unit selection unit 80, and reads the waveform data from either the memory 48 or the speech unit DB 34 with frequency information. It has a function of outputting a selection signal 100 for designating the address and an address signal 102 for designating the address for reading the waveform data. The selection signal 100 takes the H level when the flag of the designated speech unit is 1, and takes the L level otherwise.

音声合成装置３８は、接続部８８による波形データの読出を行なうための機能ブロックとして、第１及び第２の入力を持ち、選択信号１００のレベルに応じて第１及び第２の入力の信号のいずれかを選択して出力するための選択回路９０と、選択信号１００のレベルを反転して反転選択信号１０４を出力するための反転回路９２と、Ｈレベルの反転選択信号１０４を受けると、アドレス信号１０２により指定されるアドレスの波形データを頻度情報付き音声素片ＤＢ３４から読出して選択回路９０の第１の入力に与えるためのアクセス部９４とを含む。一方、メモリ４８は、Ｈレベルの選択信号１００を受けると、アドレス信号１０２により指定されるアドレスの波形データを選択回路９０の第２の入力に与える。選択回路９０は、選択信号１００がＨレベルのときは第２の入力の信号を、Ｌレベルのときには第１の入力の信号を選択して出力する。 The speech synthesizer 38 has first and second inputs as functional blocks for reading out waveform data by the connection unit 88, and the first and second input signals according to the level of the selection signal 100. When receiving a selection circuit 90 for selecting and outputting one, an inversion circuit 92 for inverting the level of the selection signal 100 and outputting an inversion selection signal 104, and an inversion selection signal 104 of H level, And an access unit 94 for reading out waveform data at an address designated by the signal 102 from the speech unit DB 34 with frequency information and supplying the waveform data to the first input of the selection circuit 90. On the other hand, when the memory 48 receives the selection signal 100 of H level, the memory 48 gives the waveform data of the address specified by the address signal 102 to the second input of the selection circuit 90. The selection circuit 90 selects and outputs the second input signal when the selection signal 100 is at the H level, and selects the first input signal when the selection signal 100 is at the L level.

音声合成装置３８はさらに、接続部８８により読出が指示された波形データについて、メモリ４８のうち、対応する音声素片の第２のフラグ１４２（Ｆ₂）を「１」に更新するためのフラグ更新部９６を含む。頻度情報付き音声素片ＤＢ３４又はメモリ４８から読出されたデータは、いずれの場合も、コンピュータのキャッシュメモリに格納されることが通常である。キャッシュメモリはメモリ４８と比較してもさらに高速にアクセス可能である。一度でも読出された波形データはキャッシュメモリに格納されている可能性が高いので、このように第２のフラグを更新し,次のアクセス速度算出部でのコスト計算に反映させ、より選択されやすくする。 The speech synthesizer 38 further updates the second flag 142 (F ₂ ) of the corresponding speech unit in the memory 48 with respect to the waveform data instructed to be read out by the connection unit 88, to “1”. An update unit 96 is included. In any case, data read from the speech element DB 34 with frequency information or the memory 48 is usually stored in a cache memory of a computer. The cache memory can be accessed even faster than the memory 48. Since the waveform data read even once is highly likely to be stored in the cache memory, the second flag is updated in this way and reflected in the cost calculation in the next access speed calculation unit, making it easier to select. To do.

なお、キャッシュメモリの容量には限りがあるため、何らかのアルゴリズムによってキャッシュに格納されているデータを選択して削除し、そこに新しいデータを格納する。本来であればキャッシュメモリにどの波形データが格納されているかを把握できればよいが、キャッシュメモリはコンピュータハードウェアにより管理されており、ソフトウェアでキャッシュの内容について知ることはできない。従って本実施の形態では、キャッシュメモリに実際にどのようなデータが格納されているかとは別に、一度でも読出されたことのあるデータについてはキャッシュメモリに格納されているものと想定した設計としている。もちろん、キャッシュメモリに記憶されている波形データがどの音声素片に対応するものであるかを容易に知ることができれば、それをアクセス速度コストの計算に反映させることが好ましい。 Since the capacity of the cache memory is limited, data stored in the cache is selected and deleted by some algorithm, and new data is stored there. Originally, it is only necessary to know which waveform data is stored in the cache memory, but the cache memory is managed by computer hardware, and the contents of the cache cannot be known by software. Therefore, in this embodiment, the design assumes that the data that has been read even once is stored in the cache memory, apart from what data is actually stored in the cache memory. . Of course, if it is possible to easily know which speech element corresponds to the waveform data stored in the cache memory, it is preferable to reflect this in the calculation of the access speed cost.

本実施の形態で使用されるサブコストは、前述したアクセス速度コスト以外に、基本周波数（Ｆ０）誤差、継続長誤差、ＭＦＣＣ誤差、Ｆ０不連続誤差、ＭＦＣＣ不連続誤差、音素環境誤差にそれぞれ対応する６種類のサブコストを含む。これらのうち、前３者はターゲットコストに属し、後３者は接続コストに属する。 The sub-costs used in the present embodiment correspond to the fundamental frequency (F0) error, duration error, MFCC error, F0 discontinuous error, MFCC discontinuous error, and phoneme environment error in addition to the access speed cost described above. Includes 6 types of sub-costs. Among these, the former three belong to the target cost, and the latter three belong to the connection cost.

本実施の形態に係る素片選択部６４によるコスト計算では、コストＣ_０は以下のようにしてサブコストから計算される。 In the cost calculation by the segment selection unit 64 according to the present embodiment, the cost C ₀ is calculated from the sub cost as follows.

ただし、Ｃ_i1（ｉ1＝１〜３）はターゲットサブコスト、Ｃ_i2（ｉ2＝１〜３）は接続コスト、ｗ_i1（ｉ1＝１〜３）はターゲットサブコスト間に定義された重み、ｗ_i2（ｉ2＝１〜３）は接続サブコスト間に定義された重み、ｐ_１及びｐ_２はそれぞれ、ターゲットコストと接続コスト間に定義された重み、ｗ_sはアクセス速度コストの重み、Ｆ₁及びＦ₂はそれぞれ第１及び第２のフラグの値である。また重みｗsの大きさは、他のサブコストの値の範囲を考慮して適当に決定し、品質に影響を及ぼす極端な偏りが生じないようにする。

Where C _i1 (i1 = 1 to 3) is a target sub cost, C _i2 (i2 = 1 to 3) is a connection cost, w _i1 (i1 = 1 to 3) is a weight defined between the target sub costs, w _{i2 (i2} = _1~3) is the weight that is defined between the connection sub-cost, respectively p ₁ and p ₂ are the weight defined between the target costs and the concatenation cost, w _s is the weight of the access speed cost, F ₁ and F ₂ is the value of the first and second flags, respectively. The size of the weight ws is appropriately determined in consideration of the range of other sub-cost values so as not to cause an extreme bias affecting quality.

上記した第１の実施の形態に係る音声合成システム２０は以下のように動作する。大きく分けてこの音声合成システム２０の動作には二つの局面がある。第１の局面は頻度情報付き音声素片ＤＢ３４の作成であり、第２の局面は第１の局面で作成された頻度情報付き音声素片ＤＢ３４を用いた音声合成である。以下、順に説明する。 The speech synthesis system 20 according to the first embodiment described above operates as follows. Broadly speaking, there are two aspects to the operation of the speech synthesis system 20. The first aspect is the creation of the speech unit DB34 with frequency information, and the second aspect is the speech synthesis using the speech unit DB34 with frequency information created in the first aspect. Hereinafter, it demonstrates in order.

まず第１の局面では頻度情報生成装置３２が以下のようにして頻度情報付き音声素片ＤＢ３４を作成する。図２を参照して、まず音声素片ＤＢ３０を用意する。音声素片ＤＢ３０には頻度情報は付されていない。また、頻度情報４４を記憶すべき領域をメモリ上に確保しておく。さらに、頻度情報４４中の、各音声素片の選択頻度回数を全て０に初期化する。 First, in the first aspect, the frequency information generating device 32 creates a speech element DB 34 with frequency information as follows. With reference to FIG. 2, first, a speech segment DB 30 is prepared. The frequency information is not attached to the speech unit DB 30. Further, an area for storing the frequency information 44 is secured on the memory. Further, the frequency of selection of each speech unit in the frequency information 44 is initialized to 0.

テストデータ４２の第１のテキストを頻度情報生成装置３２に与えると、合成器指令作成部６０がそのテキストに基づいて合成目標の音素ごとに合成器指令６２を作成し、素片選択部６４に与える。素片選択部６４は、この合成器指令６２により指定された音響特徴量をターゲットコスト算出部６８及び接続コスト算出部７０に与える。ターゲットコスト算出部６８及び接続コスト算出部７０は、与えられた音響特徴量に基づき、音声素片ＤＢ３０に含まれる各音声素片との間でターゲットコスト及び接続コストを算出し素片選択部６４に与える。選択頻度コスト算出部７２は、式（１）に従って各音声素片候補の選択頻度コストを算出し素片選択部６４に与える。第１回目の処理では選択頻度はいずれも０であるから、選択頻度コストはいすれも式（１）より「ａ」となる。 When the first text of the test data 42 is given to the frequency information generating device 32, the synthesizer command creation unit 60 creates a synthesizer command 62 for each synthesis target phoneme based on the text, and sends it to the segment selection unit 64. give. The segment selection unit 64 gives the acoustic feature amount specified by the synthesizer command 62 to the target cost calculation unit 68 and the connection cost calculation unit 70. The target cost calculation unit 68 and the connection cost calculation unit 70 calculate a target cost and a connection cost with each speech unit included in the speech unit DB 30 based on the given acoustic feature amount, and a unit selection unit 64. To give. The selection frequency cost calculation unit 72 calculates the selection frequency cost of each speech unit candidate according to the equation (1) and gives it to the unit selection unit 64. In the first process, since the selection frequency is all 0, the selection frequency cost is “a” from the equation (1).

素片選択部６４は、ターゲットコスト算出部６８から与えられたターゲットコスト、接続コスト算出部７０から与えられた接続コスト、及び選択頻度コスト算出部７２から与えられた選択頻度コストから式（２）によって総コストを算出する。素片選択部６４はこの総コストが最小の音声素片を選択し、選択された音声素片を示す情報を頻度情報更新部６６に与える。 The element selection unit 64 uses the target cost given from the target cost calculation unit 68, the connection cost given from the connection cost calculation unit 70, and the selection frequency cost given from the selection frequency cost calculation unit 72 to formula (2). To calculate the total cost. The unit selection unit 64 selects the speech unit having the minimum total cost, and gives information indicating the selected speech unit to the frequency information update unit 66.

頻度情報更新部６６は、頻度情報４４の中の頻度情報のうち、素片選択部６４により選択された音素の頻度に１を加算する。以上で第１番目の音素に対する処理を終了する。 The frequency information update unit 66 adds 1 to the frequency of the phoneme selected by the segment selection unit 64 among the frequency information in the frequency information 44. This completes the process for the first phoneme.

同様の処理を、最初のテキストの各音素に対して繰り返す。この繰り返しにより、頻度情報４４は徐々に更新されていく。選択頻度コスト算出部７２による選択頻度コストの算出においては、頻度情報４４の内容が反映される。すなわち、選択された回数が多くなるほど選択頻度コストは小さくなる。従って、選択されたことのある素片候補についてはその後の素片選択で選択される可能性が高くなる。その結果、こうした処理を繰り返すと、互いによく似た音響特徴慮をもつ音声素片同士であって、選択されたことのある素片選択候補はさらに選択されやすく、選択されたことのない候補はさらに選択されにくくなる。 Similar processing is repeated for each phoneme of the first text. By repeating this, the frequency information 44 is gradually updated. In the selection frequency cost calculation by the selection frequency cost calculation unit 72, the contents of the frequency information 44 are reflected. In other words, the selection frequency cost decreases as the number of selections increases. Therefore, it is highly likely that a segment candidate that has been selected will be selected in subsequent segment selection. As a result, when such a process is repeated, speech unit having similar acoustic feature considerations, which have been selected, are easier to select, and candidates that have not been selected are Furthermore, it becomes difficult to select.

テストデータ４２の全てのテキストについて上記した処理を繰り返すことにより、頻度情報付き音声素片ＤＢ３４が完成する。頻度情報付き音声素片ＤＢ３４が完成すると、音声合成装置３８による音声合成が可能となる。 By repeating the above-described processing for all the texts of the test data 42, the speech unit DB 34 with frequency information is completed. When the speech unit DB 34 with frequency information is completed, speech synthesis by the speech synthesizer 38 becomes possible.

音声素片の出現頻度を意図的に偏らせることにより、特徴空間において素片の密度が高い領域において、一部の素片のみがよく選択されるようになり、それ以外の素片の出現頻度は下がる。その分、素片の密度が低い部分でそれ以外の素片の頻度の順位が相対的に上がる。これにより、実際の合成時に高速アクセス可能な記憶装置に波形データが格納される音声素片の分布が広がり、高速な記憶装置に格納された波形データに対応する音声素片がより頻繁に選択されるようになる。 By intentionally biasing the appearance frequency of speech elements, only some of the elements are often selected in regions where the density of the elements is high in the feature space, and the frequency of appearance of other elements Go down. Accordingly, the frequency ranking of the other segments is relatively increased in the portion where the density of the segments is low. As a result, the distribution of speech units storing waveform data in a storage device that can be accessed at high speed during actual synthesis spreads, and speech units corresponding to waveform data stored in a high-speed storage device are selected more frequently. Become so.

実際の音声合成は以下のようにして行なわれる。図１を参照して、最初に音声素片データロード装置４６により、頻度情報付き音声素片ＤＢ３４中の音声素片の音響特徴量がメモリ４８に格納される。音響特徴量に付随する第１及び第２のフラグの値は０で初期化される。次に、頻度情報４４を基準とし、上位の所定個数の音声素片の波形データがメモリ４８に格納される。波形データをメモリ４８にロードした音声素片については、メモリ４８に格納された音響特徴量に付随する第１のフラグ１４０（図３参照）の値を「１」に設定する。 Actual speech synthesis is performed as follows. Referring to FIG. 1, first, an acoustic feature quantity of a speech unit in speech unit DB 34 with frequency information is stored in memory 48 by speech unit data loading device 46. The values of the first and second flags associated with the acoustic feature quantity are initialized with 0. Next, the waveform data of a predetermined number of upper speech units is stored in the memory 48 with the frequency information 44 as a reference. For the speech element having the waveform data loaded into the memory 48, the value of the first flag 140 (see FIG. 3) associated with the acoustic feature quantity stored in the memory 48 is set to “1”.

音声素片データロード装置４６によるメモリ４８へのデータのロードが終わると、実際の音声合成が開始される。図４を参照して、合成器指令３６が与えられると、素片選択部８０はターゲットコスト算出部８２、接続コスト算出部８４及びアクセス速度コスト算出部８６に合成器指令６２により指定された音響特徴量を与える。ターゲットコスト算出部８２及び接続コスト算出部８４は、与えられた音響特徴量を用い、メモリ４８に格納されている音声素片候補のうち、指定された音素ラベルの音声素片候補の音響特徴量との間でターゲットコスト及び接続コストを算出し、素片選択部８０に与える。アクセス速度コスト算出部８６は、各音声素片候補の第１のフラグ１４０及び第２のフラグ１４２の値に基づきアクセス速度コストを算出し素片選択部８０に与える。 When the voice unit data loading device 46 finishes loading the data into the memory 48, the actual voice synthesis is started. Referring to FIG. 4, when the synthesizer command 36 is given, the segment selection unit 80 transmits the sound designated by the synthesizer command 62 to the target cost calculation unit 82, the connection cost calculation unit 84 and the access speed cost calculation unit 86. Give a feature value. The target cost calculation unit 82 and the connection cost calculation unit 84 use the given acoustic feature amount, and among the speech unit candidates stored in the memory 48, the acoustic feature amount of the speech unit candidate of the specified phoneme label. The target cost and the connection cost are calculated between them and given to the segment selection unit 80. The access speed cost calculation unit 86 calculates an access speed cost based on the values of the first flag 140 and the second flag 142 of each speech unit candidate, and gives the access speed cost to the unit selection unit 80.

素片選択部８０は、ターゲットコスト算出部８２、接続コスト算出部８４、及びアクセス速度コスト算出部８６から与えられたターゲットコスト、接続コスト、及びアクセスコストに基づき、式（２）に従って総コストを算出する。素片選択部８０はさらに、そのようにして算出された総コストが最小の音声素片候補を選択し、その音声素片候補を示す情報と、音響特徴量とを接続部８８に与える。接続部８８は、与えられた音響特徴量の中の、波形データアドレスをアドレス信号１０２に、第１のフラグ１４０を選択信号１００に、それぞれ出力する。 The segment selection unit 80 calculates the total cost according to the equation (2) based on the target cost, the connection cost, and the access cost given from the target cost calculation unit 82, the connection cost calculation unit 84, and the access speed cost calculation unit 86. calculate. The segment selection unit 80 further selects a speech unit candidate having the smallest total cost calculated as described above, and gives information indicating the speech unit candidate and an acoustic feature amount to the connection unit 88. The connection unit 88 outputs a waveform data address to the address signal 102 and a first flag 140 to the selection signal 100 in the given acoustic feature amount.

例えば選択された音声素片の波形データがメモリ４８に格納されている場合、その音声素片の第１のフラグの値は１であり、選択信号１００はＨレベルとなる。アドレス信号１０２はメモリ４８の、選択された音声素片の波形データのアドレスとなる。アドレス信号１０２はメモリ４８に与えられる。Ｈレベルの選択信号１００がメモリ４８に与えられるので、メモリ４８はアドレス信号１０２により指定されるアドレスの波形データを読出し、選択回路９０の第２の入力に与える。反転選択信号１０４はＬレベルなのでアクセス部９４は何もしない。 For example, when the waveform data of the selected speech unit is stored in the memory 48, the value of the first flag of the speech unit is 1, and the selection signal 100 is H level. The address signal 102 is the address of the waveform data of the selected speech unit in the memory 48. Address signal 102 is provided to memory 48. Since the H level selection signal 100 is applied to the memory 48, the memory 48 reads the waveform data at the address specified by the address signal 102 and supplies it to the second input of the selection circuit 90. Since the inversion selection signal 104 is at the L level, the access unit 94 does nothing.

選択信号１００がＨレベルなので、選択回路９０は第２の入力を選択する。すなわち、選択回路９０はメモリ４８からの出力を接続部８８に与える。接続部８８はこの波形データを用いて波形接続を行なう。 Since the selection signal 100 is at the H level, the selection circuit 90 selects the second input. In other words, the selection circuit 90 gives the output from the memory 48 to the connection unit 88. The connection unit 88 performs waveform connection using this waveform data.

また、選択された音声素片の波形データが頻度情報付き音声素片ＤＢ３４に格納されている場合、その音声素片の第１のフラグの値は０であり、選択信号１００はＬレベルとなる。アドレス信号１０２は頻度情報付き音声素片ＤＢ３４の、選択された音声素片の波形データのアドレスとなる。アドレス信号１０２はアクセス部９４に与えられる。Ｈレベルの反転選択信号１０４がアクセス部９４に与えられるので、アクセス部９４はアドレス信号１０２により指定されるアドレスの波形データを頻度情報付き音声素片ＤＢ３４から読出し、選択回路９０の第１の入力に与える。選択信号１００はＬレベルなのでメモリ４８からは何も出力されない。 When the waveform data of the selected speech unit is stored in the speech unit DB 34 with frequency information, the value of the first flag of the speech unit is 0, and the selection signal 100 is L level. . The address signal 102 is the address of the waveform data of the selected speech unit in the speech unit DB 34 with frequency information. Address signal 102 is applied to access unit 94. Since the H level inversion selection signal 104 is supplied to the access unit 94, the access unit 94 reads the waveform data at the address specified by the address signal 102 from the speech unit DB 34 with frequency information, and the first input of the selection circuit 90. To give. Since the selection signal 100 is at L level, nothing is output from the memory 48.

選択信号１００がＬレベルなので選択回路９０は第１の入力の信号を選択して接続部８８に与える。すなわちこの場合、頻度情報付き音声素片ＤＢ３４から読出された波形データが接続部８８に与えられ、波形接続に用いられる。 Since the selection signal 100 is at the L level, the selection circuit 90 selects the signal of the first input and supplies it to the connection unit 88. That is, in this case, the waveform data read from the speech element DB 34 with frequency information is given to the connection unit 88 and used for waveform connection.

メモリ４８の波形データにせよ、頻度情報付き音声素片ＤＢ３４の波形データにせよ、最近接続部８８により読出されたものは図示しないキャッシュメモリに格納される可能性が高い。従ってフラグ更新部９６は、メモリ４８中の、接続部８８により波形データが読出された音声素片の音響特徴量に付随する第２のフラグ１４２の値を「１」に更新する。この結果、同じ音声素片が次に選択された場合、アクセス速度コストはより小さくなり、同じ音声素片が選択される可能性が高くなる。この音声素片に対応する波形データはキャッシュに格納されていて高速アクセス可能である可能性が高い。従って全体として波形データの読出が高速化される可能性が高い。 Regardless of the waveform data in the memory 48 or the waveform data in the speech element DB 34 with frequency information, the data read by the connection unit 88 is likely to be stored in a cache memory (not shown). Accordingly, the flag update unit 96 updates the value of the second flag 142 associated with the acoustic feature amount of the speech unit from which the waveform data is read by the connection unit 88 in the memory 48 to “1”. As a result, when the same speech unit is selected next, the access speed cost becomes smaller, and the possibility that the same speech unit is selected increases. It is highly possible that the waveform data corresponding to the speech segment is stored in the cache and can be accessed at high speed. Therefore, there is a high possibility that the reading of waveform data as a whole is speeded up.

以上のように本実施の形態に係る音声合成システム２０では、音声素片ＤＢ３０を使用してテストデータ４２による音声合成実験を行なって、各音声頻度が選択された頻度を調べる。音声合成実験の途中では、音声素片が選択された回数に応じ、選択された素片はさらに選択されやすく、そうでない素片はさらに選択されにくくなるように、選択頻度コストというサブコストを導入し、音声素片の選択に人為的な偏りが生じるようにする。 As described above, in the speech synthesis system 20 according to the present embodiment, the speech synthesis experiment using the test data 42 is performed using the speech segment DB 30 to check the frequency with which each speech frequency is selected. During the speech synthesis experiment, a sub-cost called a selection frequency cost is introduced so that the selected unit is more easily selected and the other units are more difficult to select according to the number of times the speech unit is selected. , Make sure that the selection of speech segments is artificially biased.

音声合成時には、この頻度が上位の所定個数の音声素片候補の波形データをメモリに記憶しておく。さらに、サブコストとして、波形データが記憶されている記憶装置のアクセス速度を反映したアクセス速度コストを定義し、高速アクセス可能な記憶装置に波形データが記憶されている音声素片候補が選択されやすくする。その結果、メモリなどの高速アクセス可能な記憶媒体に記憶された音声素片が選択されやすくなり、全体として音声合成処理が高速化される。 At the time of speech synthesis, waveform data of a predetermined number of speech segment candidates having higher frequencies is stored in the memory. Furthermore, as a sub-cost, an access speed cost reflecting the access speed of the storage device storing the waveform data is defined, and the speech unit candidate storing the waveform data in the storage device capable of high-speed access is easily selected. . As a result, a speech unit stored in a storage medium such as a memory that can be accessed at high speed can be easily selected, and the speech synthesis process as a whole is accelerated.

さらに、キャッシュメモリなどが使用されることを考慮し、最近選択された音声波形についてはアクセス速度コストが少なく算出されるようにアクセス速度コストを設計する。これにより、コンピュータによるキャッシュ制御を利用した処理速度の向上を図ることができる。 Further, in consideration of the use of a cache memory or the like, the access speed cost is designed so that the access speed cost is calculated to be low for the recently selected speech waveform. Thereby, it is possible to improve the processing speed using the cache control by the computer.

また、上記した実施の形態の装置では、一旦頻度情報４４が作成された後は、頻度情報４４を更新しないことを前提としている。しかし本発明はそのような実施の形態には限定されない。例えば音声合成時、頻度情報４４のをメモリ４８に転記し、素片候補が選択されるたびに頻度情報４４を更新し、全体の処理が終了したとき、または所定回数の素片候補の選択が行なわれるたびごとに、もとの頻度情報４４に書き戻すようにしてもよい。こうすることで、実験だけでなく実際のデータに基づく音声合成での選択頻度に基づいて、波形データの格納場所を決定することができる。 In the apparatus of the above-described embodiment, it is assumed that the frequency information 44 is not updated after the frequency information 44 is once created. However, the present invention is not limited to such an embodiment. For example, at the time of speech synthesis, the frequency information 44 is transferred to the memory 48, and the frequency information 44 is updated each time a segment candidate is selected. When the entire process is completed, or a predetermined number of segment candidates is selected. Each time it is performed, the original frequency information 44 may be written back. By doing so, it is possible to determine the storage location of the waveform data based on the selection frequency in the speech synthesis based on the actual data as well as the experiment.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態にかかる音声合成システム２０のブロック図である。1 is a block diagram of a speech synthesis system 20 according to an embodiment of the present invention. 図１に示す頻度情報生成装置３２のブロック図である。It is a block diagram of the frequency information generation device 32 shown in FIG. 図１に示すメモリ４８の一部の記憶領域の構成を模式的に示す図である。It is a figure which shows typically the structure of the one part storage area of the memory 48 shown in FIG. 図１に示す音声合成装置３８のブロック図である。It is a block diagram of the speech synthesizer 38 shown in FIG.

Explanation of symbols

２０音声合成システム、３０音声素片ＤＢ、３２頻度情報生成装置、３４頻度情報付き音声素片ＤＢ、３６,６２合成器指令、３８音声合成装置、４０合成音声波形、４２テストデータ、４４頻度情報、４６音声素片データロード装置、４８メモリ、６０合成器指令作成部、６４,８０素片選択部、６６頻度情報更新部、６８,８２ターゲットコスト算出部、７０,８４接続コスト算出部、７２選択頻度コスト算出部、８６アクセス速度コスト算出部８８接続部、９０選択回路、１００選択信号、１０２アドレス信号、１０４反転選択信号 20 speech synthesis system, 30 speech segment DB, 32 frequency information generating device, 34 speech segment DB with frequency information, 36,62 synthesizer command, 38 speech synthesizer, 40 synthesized speech waveform, 42 test data, 44 frequency information 46, speech unit data load device, 48 memory, 60 synthesizer command creation unit, 64, 80 unit selection unit, 66 frequency information update unit, 68, 82 target cost calculation unit, 70, 84 connection cost calculation unit, 72 Selection frequency cost calculation unit, 86 access speed cost calculation unit 88 connection unit, 90 selection circuit, 100 selection signal, 102 address signal, 104 inversion selection signal

Claims

Calculate the cost including multiple sub-costs between the target of synthesized speech and the speech element candidate, and select and connect the speech element from the speech element database including the multiple speech element candidates based on the cost A speech synthesizer that performs speech synthesis, and each speech unit candidate includes acoustic feature data of each speech unit and waveform data of each speech unit;
A first storage device for storing the speech segment database;
Acoustic feature quantity data of a plurality of speech unit candidates stored in the speech unit database and a speech unit selected based on a predetermined criterion among the plurality of speech unit candidates stored in the speech unit database A second storage device for storing one candidate waveform data, which is accessible at a higher speed than the first storage device,
The plurality of sub-costs includes an access speed cost related to an access speed to a storage device in which waveform data of each speech unit candidate is stored,
Each acoustic feature amount data of the plurality of speech unit candidates includes a first flag indicating which of the first and second storage devices the waveform data of the speech unit candidate is stored.
The speech synthesizer further includes:
One speech segment candidate whose cost calculated including the plurality of sub-costs satisfies a predetermined condition with the target of the synthesized speech is stored as an acoustic feature amount stored in the second storage device. A selection means for selecting from the plurality of speech segment candidates based on;
Based on the first flag corresponding to the selected speech waveform, the speech unit candidate speech waveform selected by the selection means is selected from either the first storage device or the second storage device. Connection unit for reading and connecting according to the target of the synthesized speech and outputting a synthesized speech waveform.

The speech unit database stored in the first storage device is a speech to store corresponding waveform data in the second storage device among a plurality of speech unit candidates included in the speech unit database. Selection criteria information that serves as a criterion for selecting segment candidates is attached.
The unit-connected speech synthesizer further includes a speech unit selected according to the selection criterion information from the plurality of speech unit candidates included in the speech unit database stored in the first storage device. The unit connection type speech synthesizer according to claim 1, further comprising loading means for loading candidate waveform data into the second storage device.

The unit connection type speech synthesizer according to claim 2, further comprising selection criterion information generation means for generating the selection criterion information of the speech segment database based on test data.

The selection criterion information generating means includes:
Speech synthesis simulation means for simulating speech synthesis using speech segment candidates included in the speech segment database based on the test data;
Frequency recording means for recording the frequency selected during speech synthesis by the speech synthesis simulation means for each of the plurality of speech unit candidates included in the speech unit database;
The unit connection type speech synthesis according to claim 3, wherein the selection criterion information is frequency information of each of the plurality of speech unit candidates included in the speech unit database recorded by the frequency recording unit. apparatus.

Based on the frequency information, the loading unit selects a predetermined number of high-frequency ones selected by speech synthesis by the speech synthesis simulation unit from among speech units included in the speech unit database, 5. The unit-connected speech synthesizer according to claim 4, further comprising means for loading waveform data of a selected speech unit into the second storage device.

The speech synthesis simulating means includes the access speed of the plurality of sub-costs between a target of synthesized speech obtained from the test data and the plurality of speech unit candidates included in the speech unit database. The cost calculated including the access cost excluding the cost and the selection frequency cost calculated based on the frequency with which each speech segment candidate recorded by the frequency recording means is selected satisfies a predetermined condition. The unit-connected speech synthesizer according to claim 4 or 5, comprising means for selecting one speech unit candidate from the speech unit database.

The unit connection type speech synthesizer comprises:
Cache memory,
A cache memory management mechanism provided for the cache memory;
Means for storing a second flag provided corresponding to the plurality of speech unit candidates and for recording whether or not the corresponding speech unit candidate has been selected by the selection unit;
Means for initializing the value of the second flag to a predetermined first value;
Each time one of the plurality of speech unit candidates is selected by the selection unit, the second flag corresponding to the selected speech unit candidate is set to a predetermined second value different from the first value. A flag updating means for updating,
The access speed cost is calculated based on the first and second flags,
The selection means selects one speech unit candidate for which the cost calculated by including the plurality of sub-costs including the access speed cost satisfies a predetermined condition with respect to the target of the synthesized speech. The unit connection type speech synthesizer according to any one of claims 1 to 6, further comprising means for selecting from the plurality of speech unit candidates based on an acoustic feature amount stored in the storage device.

Calculate the cost including multiple sub-costs between the target of synthesized speech and the speech element candidate, and select and connect the speech element from the speech element database including the multiple speech element candidates based on the cost A speech synthesis method for performing speech synthesis by performing speech synthesis, wherein each speech unit candidate includes acoustic feature data of each speech unit and waveform data of each speech unit;
Storing the speech segment database in a first storage step;
Acoustic feature quantity data of a plurality of speech unit candidates stored in the speech unit database and a speech unit selected based on a predetermined criterion among the plurality of speech unit candidates stored in the speech unit database Storing waveform data of one candidate in a second storage device accessible at a higher speed than the first storage device,
The plurality of sub-costs includes an access speed cost related to an access speed to a storage device in which waveform data of each speech unit candidate is stored,
Each acoustic feature amount data of the plurality of speech unit candidates includes a first flag indicating which of the first and second storage devices the waveform data of the speech unit candidate is stored.
The speech synthesis method further includes:
One speech segment candidate whose cost calculated including the plurality of sub-costs satisfies a predetermined condition with the target of the synthesized speech is stored as an acoustic feature amount stored in the second storage device. A selection step of selecting from the plurality of speech segment candidates based on;
Based on the first flag corresponding to the selected speech waveform, the speech unit candidate speech waveform selected in the selection step is selected from either the first storage device or the second storage device. A unit connection type speech synthesis method including a connection step of reading out and connecting in accordance with the synthesizer command and outputting a synthesized speech waveform.

The speech unit database stored in the first storage device is a speech to store corresponding waveform data in the second storage device among a plurality of speech unit candidates included in the speech unit database. Selection criteria information that serves as a criterion for selecting segment candidates is attached.
The unit-connected speech synthesis method further includes a speech unit selected by the selection criterion information among the plurality of speech unit candidates included in the speech unit database stored in the first storage device. The segment connection type speech synthesis method according to claim 8, further comprising a load step of loading candidate waveform data into the second storage device.

The unit connection type speech synthesis method according to claim 9, further comprising a selection criterion information generation step of generating the selection criterion information of the speech segment database based on test data.

The selection criterion information generation step includes:
A speech synthesis simulation step of simulating speech synthesis using speech unit candidates included in the speech unit database based on the test data;
A frequency recording step for recording the frequency selected at the time of speech synthesis by the speech synthesis simulation step for each of the plurality of speech unit candidates included in the speech unit database;
The unit connection type speech synthesis according to claim 10, wherein the selection criterion information is frequency information of each of the plurality of speech unit candidates included in the speech unit database recorded in the frequency recording step. Method.

Based on the frequency information, the loading step selects a predetermined number of high-frequency ones selected in speech synthesis by the speech synthesis simulation step from among speech units included in the speech unit database, The unit-connected speech synthesis method according to claim 11, comprising loading waveform data of the selected speech unit into the second storage device.

In the speech synthesis simulation step, the access speed of the plurality of sub-costs between the target of synthesized speech obtained from the test data and the plurality of speech unit candidates included in the speech unit database. The cost calculated including the access cost excluding the cost and the selection frequency cost calculated based on the frequency at which each speech segment candidate recorded in the frequency recording step is selected satisfies a predetermined condition. The unit-connected speech synthesis method according to claim 11 or 12, comprising a step of selecting two speech unit candidates from the speech unit database.

The unit connection type speech synthesis method is executed on a computer having a cache memory and a cache memory management mechanism provided for the cache memory, and
Storing in a predetermined storage device a second flag provided corresponding to the plurality of speech unit candidates and recording whether or not the corresponding speech unit candidate is selected in the selection step;
Initializing the value of the second flag to a predetermined first value;
Each time one of the plurality of speech unit candidates is selected in the selection step, the second flag corresponding to the selected speech unit candidate is set to a predetermined second value different from the first value. And further updating
The access speed cost is calculated based on the first and second flags,
In the selection step, one speech unit candidate whose cost calculated by including the plurality of sub-costs including the access speed cost satisfies a predetermined condition with respect to the target of the synthesized speech is selected as the second speech unit candidate. The unit connection type speech synthesis method according to claim 8, further comprising a step of selecting from the plurality of speech unit candidates based on an acoustic feature amount stored in the storage device.