JP4424023B2

JP4424023B2 - Segment-connected speech synthesizer

Info

Publication number: JP4424023B2
Application number: JP2004073724A
Authority: JP
Inventors: 隆志野見; 恒河井; みちよ河野
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-03-16
Filing date: 2004-03-16
Publication date: 2010-03-03
Anticipated expiration: 2024-03-16
Also published as: JP2005265874A

Description

この発明は音声合成装置に関し、特に、所定のコスト関数に基づいて音声素片を選択し接続することにより合成器指令に合致した音声合成を行なう素片接続型音声合成装置に関する。 The present invention relates to a speech synthesizer, and more particularly to a unit connection type speech synthesizer that performs speech synthesis that matches a synthesizer command by selecting and connecting speech units based on a predetermined cost function.

音声認識、音声合成は、人間とコンピュータを用いた諸システムとのインターフェースを実現する技術として重要である。これらと人工知能技術とを併用することにより、利用者は相手がコンピュータシステムであることを意識せずに様々なサービスを利用することができる。 Speech recognition and speech synthesis are important technologies for realizing interfaces between humans and various systems using computers. By using these and artificial intelligence technology together, the user can use various services without being aware that the other party is a computer system.

中でも音声合成については、人間に対するシステム出力のためのインターフェースとしてその重要性は大きい。人間は、合成された音声の不自然さを敏感に感じ取る。合成された音声が不自然であると利用者が感じると、発話にも影響を及ぼし、その結果、人間とシステムとの間の対話がうまく行かなくなるおそれもある。 In particular, speech synthesis is very important as an interface for system output to humans. Humans are sensitive to the unnaturalness of synthesized speech. If the user feels that the synthesized speech is unnatural, it will affect the utterance, and as a result, the dialogue between the human and the system may not be successful.

最近の音声合成技術としては、予め人間の発話を多数集めた発話コーパスから音素ごとの音声素片をデータベース化しておき、合成時には、指定された音素に対応する音声素片の中から最も適切と思われるものを選択して接続するものが知られている。これを本明細書では素片接続型音声合成と呼ぶ。 As a recent speech synthesis technology, a speech unit for each phoneme is made into a database from an utterance corpus in which many human utterances are collected in advance, and at the time of synthesis, the most appropriate speech unit corresponding to a specified phoneme There is known what selects and connects what seems to be. This is referred to as segment-connected speech synthesis in this specification.

素片接続型音声合成では、与えられた合成目標を基準として、いかにして適切な音声素片をデータベース中から取出すかが問題となる。 In unit-connected speech synthesis, there is a problem of how to extract an appropriate speech unit from a database based on a given synthesis target.

合成目標を構成するデータは、典型的には音素と、基本周波数（Ｆ０）、持続時間、ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）、及びパワー等の音声特徴量とを含む。これらを以下「合成器指令」と呼ぶ。 The data constituting the synthesis target typically includes a phoneme, and a speech frequency such as a fundamental frequency (F0), a duration, an MFCC (Mel-Frequency Cepstrum Coefficient), and power. These are hereinafter referred to as “synthesizer commands”.

素片接続型音声合成では、合成器指令と音声素片のＦ０、持続時間、ＭＦＣＣ、パワー等とのずれ、及び接続に伴う自然劣化を表現するための「コスト」と呼ばれる評価関数を定義し、コストを最小とする音声素片を求めることにより、最適な音声素片系列を決定する。 In unit-connected speech synthesis, an evaluation function called “cost” is defined to express the difference between the synthesizer command and the F0, duration, MFCC, power, etc. of speech units, and the natural degradation associated with the connection. Then, an optimum speech unit sequence is determined by obtaining a speech unit that minimizes the cost.

本件出願の出願人は、上記した「コスト」を、それぞれある音声の特徴に対応するような「サブコスト」に分解し、それらを結合したもの（例えば線形和）により定義した素片接続型音声合成を提案している。例えば特許文献１を参照されたい。 The applicant of the present application decomposes the above-mentioned “cost” into “sub-costs” corresponding to certain voice features, and combines them (for example, linear sum) to define a unit-connected speech synthesis. Has proposed. For example, see Patent Document 1.

サブコストは、ターゲットコストと接続コストとの二グループに大きく分類できる。ターゲットコストは、合成器指令と素片候補との間の誤差を表す。接続コストは、合成音声において隣接する素片間の不連続性を表す。 Sub-costs can be broadly classified into two groups: target costs and connection costs. The target cost represents an error between the synthesizer command and the segment candidate. The connection cost represents a discontinuity between adjacent segments in the synthesized speech.

特開２００３−２０８１８８号公報（段落００１４〜００４７）JP 2003-208188 A (paragraphs 0014 to 0047)

上記したような素片接続型音声合成技術では、音声素片データベースを大きくするほど、合成時にコストを小さくできる候補を見つけることができる可能性が高くなり、音声合成の品質が高くなる。しかし、音声素片データベースが大きい場合には、候補を決定するために必要な計算量が大きくなるという問題がある。 In the unit-connected speech synthesis technology as described above, the larger the speech unit database, the higher the possibility of finding a candidate that can reduce the cost during synthesis, and the quality of speech synthesis increases. However, when the speech unit database is large, there is a problem that the amount of calculation necessary for determining candidates increases.

計算量を小さくするための一つの方策として、コスト計算による素片選択に先立って、より少ない計算量ですむコスト計算を行なって素片候補を予備選択することが考えられる。例えば、接続コストの計算には、素片候補だけでなくその前後の音素との関係も必要になり計算量が多くなる。一方、ターゲットコストの計算には、素片候補が必要なだけである。そこで、接続コストを用いず、ターゲットコストのみを用いて素片候補を予備選択することが考えられる。 As one measure for reducing the amount of calculation, it is conceivable to pre-select a unit candidate by performing a cost calculation that requires a smaller amount of calculation prior to selecting a unit by cost calculation. For example, the calculation of the connection cost requires not only a segment candidate but also a relationship with the phonemes before and after it, which increases the amount of calculation. On the other hand, the target cost calculation only requires a segment candidate. Therefore, it is conceivable to preselect a segment candidate using only the target cost without using the connection cost.

しかしその場合でも、音声素片データベースが大きくなるほど予備選択のための計算量は大きくなるという問題が残る。音声素片データベースのサイズにかかわらず、高速かつ一定した速度で素片選択を行なえるようにすることが望ましい。また、その場合でも品質の劣化が生じることは避けるべきである。 However, even in that case, the problem remains that the amount of calculation for preliminary selection increases as the speech unit database increases. Regardless of the size of the speech segment database, it is desirable to be able to perform segment selection at a high speed and at a constant speed. Even in such a case, it should be avoided that the quality deteriorates.

それゆえに本発明の目的は、大規模な音声コーパスを用いた場合であっても、高速かつ一定した速度で素片選択を行なうことが可能な音声合成装置を提供することである。 Therefore, an object of the present invention is to provide a speech synthesizer capable of performing segment selection at a high speed and at a constant speed even when a large-scale speech corpus is used.

本発明の他の目的は、大規模な音声コーパスを用いた場合であっても、高速かつ一定した速度で素片選択を行ない、かつ合成された音声の品質を高くすることが可能な音声合成装置を提供することである。 Another object of the present invention is to perform speech synthesis capable of performing segment selection at a high speed and a constant speed and improving the quality of synthesized speech even when a large-scale speech corpus is used. Is to provide a device.

本発明に係る素片接続型音声合成装置は、音声素片データベースと、音声素片データベース中の音声素片を、音素ごとに、かつ所定の特徴量の値をキーとしてソートして保持するテーブルとを用いて、音声素片データベース中の音声素片を接続する素片接続型音声合成装置であって、合成音声の目標となる音素ラベルのシーケンスと、各音素ラベルに付随する目標特徴量とを定める合成器指令の入力を受け、合成器指令の音素ラベルにより指定される音素に対応するテーブルを選択するテーブル選択手段と、テーブル選択手段により選択されたテーブルの中で、合成器指令の目標特徴量により特定される所定の特徴量の値を有する音声素片を含む、予め定める基準により定められる範囲に位置する音声素片を予備選択するための予備選択手段と、予備選択手段により予備選択された音声素片の中から、所定の基準に基づいて音声素片を選択するための選択手段と、選択手段により選択された音声素片を合成器指令に従って接続し合成音声波形を出力するための接続手段とを含む。 The unit-connected speech synthesizer according to the present invention includes a speech unit database and a table that stores speech units in the speech unit database sorted for each phoneme and using a predetermined feature value as a key. Is a unit-connected speech synthesizer that connects speech units in a speech unit database, and includes a sequence of phoneme labels that are targets of synthesized speech, and a target feature amount associated with each phoneme label, and A table selection means for selecting a table corresponding to the phoneme specified by the phoneme label of the synthesizer command, and a target of the synthesizer command among the tables selected by the table selection means Preliminary selection means for preliminarily selecting a speech element located in a range defined by a predetermined criterion, including a speech element having a predetermined feature value specified by the feature value A selection unit for selecting a speech unit based on a predetermined criterion from speech units preliminarily selected by the preliminary selection unit, and a speech unit selected by the selection unit is connected in accordance with a synthesizer command. Connecting means for outputting a synthesized speech waveform.

好ましくは、所定の特徴量は音素長であり、予備選択手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令により指定される音素長を含む、予め定める基準により特定される範囲に位置する音声素片を選択するための手段を含む。 Preferably, the predetermined feature amount is a phoneme length, and the preliminary selection unit includes a range specified by a predetermined criterion including the phoneme length specified by the synthesizer command in the table selected by the table selection unit. Means for selecting a speech unit located at.

好ましくは、テーブルには、予め予備選択されるべき音声素片の許容数を特定する許容数情報が付されており、選択するための手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令により指定される音素長を有する音声素片を中心とする、許容数で指定される数の音声素片を選択するための手段を含む。 Preferably, the table is provided with permissible number information for specifying the permissible number of speech segments to be preliminarily selected, and the means for selecting is the table selected by the table selecting means, Means for selecting a number of speech units specified by an allowable number centered on a speech unit having a phoneme length specified by the synthesizer command;

所定の特徴量は量子化された基本周波数（Ｆ０）であってもよく、予備選択手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令により指定される基本周波数を含む、予め定める基準により特定される範囲に位置する音声素片を選択するための手段を含んでもよい。 The predetermined feature amount may be a quantized fundamental frequency (F0), and the preliminary selection means includes a fundamental frequency specified by a synthesizer command in a table selected by the table selection means. Means may be included for selecting speech segments located in a range specified by the defined criteria.

テーブルには、予め予備選択されるべき音声素片の許容数を特定する許容数情報が付されており、選択するための手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令により指定される基本周波数を有する音声素片を中心とする、許容数で指定される数の音声素片を選択するための手段を含んでもよい。 The table is preliminarily provided with permissible number information for specifying the permissible number of speech segments to be preselected, and the means for selecting is a synthesizer command in the table selected by the table selecting means. There may be included means for selecting a number of speech units specified by an allowable number centered on a speech unit having a fundamental frequency specified by.

好ましくは、予備選択手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令の目標特徴量により特定される所定の特徴量の値を有する音声素片を、所定の探索アルゴリズムにより探索するための探索手段と、テーブル選択手段により選択されたテーブルの中で、探索手段により探索された音声素片を中心とする、予め定められる範囲に位置する音声素片を選択するための手段とを含む。 Preferably, the preliminary selection means searches for a speech segment having a predetermined feature value specified by the target feature value of the synthesizer command in a table selected by the table selection means by a predetermined search algorithm. And a means for selecting a speech unit located in a predetermined range centered on the speech unit searched for by the search means in the table selected by the table selection means. including.

好ましくは、探索手段は、テーブル選択手段により選択されたテーブルの中で、合成器指令の目標特徴量により特定される所定の特徴量の値を有する音声素片を、２分木探索アルゴリズムにより探索するための２分木探索手段を含む。 Preferably, the search means searches for a speech element having a predetermined feature value specified by the target feature value of the synthesizer command in the table selected by the table selection means, using a binary tree search algorithm. Binary tree search means for doing this.

好ましくは、選択手段は、予備選択手段により予備選択された音声素片候補の中から、当該音声素片の特徴量と、目標特徴量とに基づいて予め定めるコスト計算により算出されるコストが所定の条件を充足する音声素片を選択するための手段を含む。 Preferably, the selection unit has a predetermined cost calculated from a speech unit candidate preliminarily selected by the preliminary selection unit based on a feature amount of the speech unit and a target feature amount. Means for selecting a speech segment that satisfies the above condition.

［第１の実施の形態］
−構成−
図１に、本発明の第１の実施の形態に係る音声合成システム２０のブロック図を示す。図１を参照して、この音声合成システム２０は、従来と同様の音声素片データベース（ＤＢ）３０と、音声素片ＤＢ３０に含まれる各音声素片を音素ごとに分類し、かつ各音素の音素長等、素片選択に必要な情報を抽出してソートした音素別音素長テーブル３４を生成するための音素長テーブル作成部３２と、合成目標となるテキストを分析した結果得られる合成器指令３６を入力として受け、音素別音素長テーブル３４を利用して音声素片ＤＢ３０からほぼ一定量の音声素片を予備選択し、予備選択された素片候補の中から適切な音声素片を接続して合成音声波形４０を出力するための音声合成装置３８とを含む。 [First Embodiment]
−Configuration−
FIG. 1 shows a block diagram of a speech synthesis system 20 according to the first embodiment of the present invention. Referring to FIG. 1, this speech synthesis system 20 classifies speech unit database (DB) 30 similar to the prior art and speech units included in speech unit DB 30 for each phoneme. A phoneme length table creation unit 32 for generating a phoneme length table 34 for each phoneme extracted by extracting information necessary for selecting a segment such as phoneme length, and a synthesizer command obtained as a result of analyzing a text as a synthesis target 36 as an input, the phoneme-specific phoneme length table 34 is used to preselect a substantially constant amount of speech units from the speech unit DB 30, and appropriate speech units are connected from the preselected segment candidates. And a speech synthesizer 38 for outputting a synthesized speech waveform 40.

図２に、音素長テーブル作成部３２の構成をブロック図形式で示す。図２を参照して、音素長テーブル作成部３２は、音声素片ＤＢ３０から、音声素片ＤＢ３０内の各素片の音素ラベル、アドレス、音素長、及びその他のコスト計算に必要な情報を抽出し、音素別音素長テーブル８２を作成するための音素長抽出部８０と、音素別音素長テーブル８２を各テーブルごとに音素長の昇順でソートすることにより、ソート済みの音素別音素長テーブル３４を作成するためのソート処理部８４と、音素別音素長テーブル３４ごとに、音素長の分布を調べ、素片候補を予備選択する場合の音素長の許容幅を算出するための許容素片候補数算出部８６とを含む。許容素片候補数算出部８６の機能については図４を参照して後述する。 FIG. 2 shows the configuration of the phoneme length table creation unit 32 in the form of a block diagram. Referring to FIG. 2, the phoneme length table creation unit 32 extracts the phoneme label, address, phoneme length, and other information necessary for cost calculation of each unit in the speech unit DB 30 from the speech unit DB 30. Then, the phoneme length extraction unit 80 for creating the phoneme length table 82 for each phoneme and the phoneme length table 82 for each phoneme are sorted in ascending order of the phoneme length for each table, so that the sorted phoneme length table 34 sorted. For each phoneme length table 34 for each of the sort processing unit 84 and the phoneme-specific phoneme length table 34, the allowable segment candidate for calculating the allowable range of the phoneme length when the segment candidate is preliminarily selected A number calculator 86. The function of the allowable segment candidate number calculation unit 86 will be described later with reference to FIG.

再び図１を参照して、音声合成装置３８は、合成器指令３６を受け、合成器指令３６により指定された音素に対応する素片であって、かつ合成器指令３６により指定された音素長を中心とした所定の幅の音素長を持つ音素を音素別音素長テーブル３４から予備選択し素片候補テーブル６２を作成するための予備選択部６０と、合成器指令３６を受け、素片候補テーブル６２に含まれる素片候補の中から、コストの最も小さな素片を選択するための素片選択部６４と、素片選択部６４により選択された音声素片に対応する音声素片データを音声素片ＤＢ３０から読出し、互いに接続して合成音声波形４０を出力するための接続部６６とを含む。 Referring to FIG. 1 again, the speech synthesizer 38 receives the synthesizer command 36, is a segment corresponding to the phoneme specified by the synthesizer command 36, and has the phoneme length specified by the synthesizer command 36. A phoneme having a phoneme length of a predetermined width centered on the phoneme is preliminarily selected from the phoneme-specific phoneme length table 34, and receives a synthesizer command 36 and a synthesizer command 36 to receive a segment candidate. A unit selection unit 64 for selecting the unit with the lowest cost from the unit candidates included in the table 62, and speech unit data corresponding to the speech unit selected by the unit selection unit 64 A connection unit 66 for reading out from the speech unit DB 30 and connecting to each other to output the synthesized speech waveform 40 is included.

図３を参照して、予備選択部６０は、合成器指令３６を受け、合成器指令３６により指定された音素に対応する音素長テーブル１１０を音素別音素長テーブル３４のうちから選択するためのテーブル選択部１００を含む。音素長テーブル１１０には、許容素片候補数算出部８６により算出された許容素片候補数１１２が付属している。 Referring to FIG. 3, preliminary selection unit 60 receives synthesizer command 36 and selects phoneme length table 110 corresponding to the phoneme specified by synthesizer command 36 from phoneme-specific phoneme length table 34. A table selection unit 100 is included. The phoneme length table 110 is attached with the allowable segment candidate count 112 calculated by the allowable segment candidate count calculator 86.

予備選択部６０はさらに、合成器指令３６により与えられた音素長と一致する音素長の素片候補を、音素長テーブル１１０に対する２分木探索により探索するための２分木探索部１０２と、音素長テーブル１１０の中で、２分木探索部１０２により探索された素片候補を中心とする、許容素片候補数１１２により指定された素片数の素片候補を選択し、素片候補テーブル６２を作成するための素片候補選択部１０４とを含む。 The preliminary selection unit 60 further includes a binary tree search unit 102 for searching for a phoneme length candidate corresponding to the phoneme length given by the synthesizer command 36 by a binary tree search for the phoneme length table 110; In the phoneme length table 110, select a segment candidate with the number of segments specified by the allowable segment candidate number 112 centered on the segment candidate searched by the binary tree search unit 102, and select a segment candidate And a segment candidate selection unit 104 for creating the table 62.

素片候補選択部１０４が素片候補を選択する際の許容素片候補数は、特に厳密な基準を用いて算出する必要はなく、素片候補数をある程度の数に絞ることができるものであればどのようなものでもよい。素片長の分布が正規分布にしたがうと考えられれば、その標準偏差σを算出し、ａσ（ａは予め定められた数）に入る素片数を用いればよい。 The number of allowable element candidates when the element candidate selection unit 104 selects an element candidate need not be calculated using a particularly strict criterion, and the number of element candidates can be reduced to a certain number. Anything is acceptable. If it is considered that the segment length distribution follows a normal distribution, the standard deviation σ is calculated, and the number of segments that fall within aσ (a is a predetermined number) may be used.

実際には、素片長の分布が多峰型を示すことも多い。そうした場合、本実施の形態に係る許容素片候補数算出部８６（図２参照）は、図４に示す基準によりこの幅を算出するものとする。 In practice, the segment length distribution is often multimodal. In such a case, the allowable segment candidate number calculation unit 86 (see FIG. 2) according to the present embodiment calculates this width based on the criteria shown in FIG.

図４を参照して、たとえば分布が２峰型の場合には、谷の頂点Ａの部分でこの分布曲線に接し、ｘ軸に平行な線を引く。この線が分布曲線と交わる点をそれぞれＢ，Ｃとする。ＢＡ＝ｗ₁、ＡＣ＝ｗ₂とする。本実施の形態では、ｗ＝α（ｗ₁＋ｗ₂）／２（αは予め定められた数。好ましくは０＜α≦１）で求められるｗを、許容音素長幅とする。分布曲線が多峰型の場合にも、これと同様な考え方を拡張すればよい。もちろん、これ以外にも様々な方法で許容音素長幅を定めることができる。 Referring to FIG. 4, for example, when the distribution is bimodal, a line parallel to the x axis is drawn in contact with the distribution curve at the apex A of the valley. The points where this line intersects the distribution curve are denoted by B and C, respectively. BA = w ₁ and AC = w ₂ . In the present embodiment, w obtained by w = α (w ₁ + w ₂ ) / 2 (α is a predetermined number, preferably 0 <α ≦ 1) is set as an allowable phoneme length width. Even when the distribution curve is multimodal, the same idea may be extended. Of course, the allowable phoneme length width can be determined by various methods other than this.

本実施の形態に係る素片選択部６４によるコスト計算では、コストＣ_０は以下のようにしてサブコストから計算される。 In the cost calculation by the segment selection unit 64 according to the present embodiment, the cost C ₀ is calculated from the sub cost as follows.

ただし、Ｃ_i1（ｉ1＝１〜Ｎ₁）はターゲットサブコスト、Ｃ_i2（ｉ2＝１〜Ｎ_２）は接続コスト、ｗ_i1（ｉ1＝１〜Ｎ₁）はターゲットサブコスト間に定義された重み、ｗ_i2（ｉ2＝１〜Ｎ_２）は接続サブコスト間に定義された重み、ｐ_１及びｐ_２はそれぞれ、ターゲットコストと接続コスト間に定義された重みである。

Where C _i1 (i1 = 1 to N ₁ ) is defined between the target subcosts, C _i2 (i2 = 1 to N ₂ ) is defined as the connection cost, and w _i1 (i1 = 1 to N ₁ ) is defined between the target subcosts. A weight, w _i2 (i2 = 1 to N ₂ ) is a weight defined between connection sub-costs, and p ₁ and p ₂ are weights defined between a target cost and a connection cost, respectively.

−動作−
この音声合成システム２０は以下のように動作する。この音声合成システム２０の動作には大きく分けて二つのフェーズがある。第１のフェーズは音素別音素長テーブル３４の構築である。第２のフェーズは音声合成装置３８による音声合成である。 -Operation-
The speech synthesis system 20 operates as follows. The operation of the speech synthesis system 20 is roughly divided into two phases. The first phase is the construction of the phoneme-specific phoneme length table 34. The second phase is speech synthesis by the speech synthesizer 38.

第１のフェーズでは以下のような処理が行なわれる。この処理に先立ち、音声素片ＤＢ３０が音声コーパスから作成済みであるものとする。音素長テーブル作成部３２の音素長抽出部８０（図２参照）は、音声素片ＤＢ３０に含まれる音声素片データの各々から、上記したように音素ラベル、アドレス、音素長、及びその他のコスト計算に必要な情報を抽出し、音素別音素長テーブル８２を作成する。音素別音素長テーブル８２の各テーブルに含まれる素片データはソートされていない。 In the first phase, the following processing is performed. Prior to this processing, it is assumed that the speech segment DB 30 has been created from the speech corpus. The phoneme length extraction unit 80 (see FIG. 2) of the phoneme length table creation unit 32 obtains the phoneme label, address, phoneme length, and other costs from each of the speech unit data included in the speech unit DB 30 as described above. Information necessary for the calculation is extracted and a phoneme-specific phoneme length table 82 is created. The segment data included in each table of the phoneme-specific phoneme length table 82 is not sorted.

ソート処理部８４は、各音素別音素長テーブル８２を、音素長で昇順にソートする。その結果、音素長によってソート済みの音素別音素長テーブル３４が作成される。許容素片候補数算出部８６は、この音素別音素長テーブル３４の各々に含まれる音声素片の音素長の分布に基づき、許容素片候補数１１２（図３参照）を算出し、音素別音素長テーブル３４の各々に付す。 The sort processing unit 84 sorts each phoneme-specific phoneme length table 82 in ascending order by phoneme length. As a result, the phoneme length table 34 sorted by phoneme is created. The allowable element candidate number calculation unit 86 calculates the allowable element candidate number 112 (see FIG. 3) based on the distribution of the phoneme lengths of the speech elements included in each of the phoneme length table 34. Attached to each phoneme length table 34.

以上の処理が終了すると、音声合成装置３８による音声合成が可能となる。音声合成に先立ち、音声合成装置３８（コンピュータにより構成される。）は音素別音素長テーブル３４をメモリ上に配列として格納し、予備選択を高速に行なうことができるようにしておく。 When the above processing is completed, speech synthesis by the speech synthesizer 38 becomes possible. Prior to speech synthesis, the speech synthesizer 38 (configured by a computer) stores the phoneme-specific phoneme length table 34 as an array on the memory so that preliminary selection can be performed at high speed.

音声合成時、合成対象のテキスト分析により合成器指令３６が得られると、この合成器指令３６は予備選択部６０のテーブル選択部１００に与えられる（図３参照）。テーブル選択部１００は、合成器指令３６に基づいて、合成器指令３６により指定された音素に対応する音素長テーブル１１０を音素別音素長テーブル３４の中から選択する。 At the time of speech synthesis, if a synthesizer command 36 is obtained by analyzing the text to be synthesized, this synthesizer command 36 is given to the table selection unit 100 of the preliminary selection unit 60 (see FIG. 3). Based on the synthesizer command 36, the table selection unit 100 selects the phoneme length table 110 corresponding to the phoneme specified by the synthesizer command 36 from the phoneme-specific phoneme length table 34.

２分木探索部１０２は、合成器指令３６により与えられた音素長と一致する音素長の音声素片を、音素長テーブル１１０に対する２分木により探索し、探索された音声素片の音素別音素長テーブル３４中でのアドレス（配列のインデックス）を素片候補選択部１０４に与える。素片候補選択部１０４は、与えられたインデックスで示される音声素片を中心とし、許容素片候補数１１２により指定される範囲（中心の前後の所定個数）の音声素片のインデックスを算出し、それら素片データを全て読出して素片候補テーブル６２に格納する。 The binary tree search unit 102 searches the phoneme length table 110 for a phoneme unit having a phoneme length that matches the phoneme length given by the synthesizer command 36, and determines the phoneme classification of the searched phoneme unit. An address (array index) in the phoneme length table 34 is given to the segment candidate selection unit 104. The segment candidate selection unit 104 calculates the index of speech units in a range (predetermined number before and after the center) specified by the allowable segment candidate number 112 with the speech unit indicated by the given index as the center. All the segment data are read out and stored in the segment candidate table 62.

図１を参照して、素片選択部６４は、合成器指令３６を受け、素片候補テーブル６２に含まれる音声素片のうち、式（１）により算出されるコストが最も小さなものを選択して接続部６６に与える。接続部６６は、素片選択部６４から与えられた音声素片に対応する音声波形データを音声素片ＤＢ３０から読出し、音声が滑らかになるように変形して接続し合成音声波形４０として出力する。 Referring to FIG. 1, unit selection unit 64 receives synthesizer command 36 and selects the speech unit included in unit candidate table 62 that has the lowest cost calculated by equation (1). To the connection portion 66. The connection unit 66 reads out the speech waveform data corresponding to the speech unit given from the unit selection unit 64 from the speech unit DB 30, transforms the speech so that the speech is smooth, connects it, and outputs it as a synthesized speech waveform 40. .

予備選択部６０により、許容素片候補数１１２により定まる数の素片候補を予め選択して素片候補テーブル６２に格納し、その中から音声素片を選択するため、素片選択部６４が素片選択を行なう際のコスト計算の計算量は少なく、一定量以下で済む。２分木探索部１０２による２分木探索は高速に行なえることが知られており、また素片候補選択部１０４による素片候補抽出のためのアドレス計算の計算量も少なくて済む。そのため、音声合成装置３８による素片選択のための計算量は全体としても少なくて済む。 The number of segment candidates determined by the allowable segment candidate number 112 is selected in advance by the preliminary selection unit 60 and stored in the segment candidate table 62, and a segment selection unit 64 selects a speech segment from the segment candidates. The amount of cost calculation for selecting a segment is small and less than a certain amount. It is known that the binary tree search by the binary tree search unit 102 can be performed at high speed, and the calculation amount of the address calculation for the segment candidate extraction by the segment candidate selection unit 104 is small. For this reason, the calculation amount for selecting a segment by the speech synthesizer 38 is small as a whole.

また、素片候補テーブル６２は、合成器指令３６により指定される音素に対応する音素別音素長テーブル３４から、指定された音素長の素片候補を中心とする所定数の素片候補からなるので、ターゲットコストの小さな音声素片を多数含む。そのため、それらの中からコスト最小の音声素片を選択して接続した場合、接続時の変形による品質低下は無視できるほど小さくなる。その結果、最終的に得られる合成音声波形４０には、音声素片の接続による品質低下はわずかしか見られない。 The element candidate table 62 is made up of a predetermined number of element candidates centered on the element candidate of the specified phoneme length from the phoneme-specific phoneme length table 34 corresponding to the phoneme specified by the synthesizer command 36. Therefore, it contains many speech segments with a small target cost. Therefore, when the speech unit having the lowest cost is selected and connected from among them, the quality deterioration due to the deformation at the time of connection becomes so small that it can be ignored. As a result, the synthesized speech waveform 40 finally obtained shows only a slight deterioration in quality due to the connection of speech units.

なお、この実施の形態では、予備選択部６０により音素別音素長テーブル３４から音素長を基準として所定数の素片を予備的に選択する。しかし本発明はそのような実施の形態には限定されない。たとえば、音素長以外の特徴量、たとえば基本周波数などを基準に音素長テーブルをソートし、予備選択に用いるようにしてもよい。 In this embodiment, the preliminary selection unit 60 preliminarily selects a predetermined number of segments from the phoneme-specific phoneme length table 34 based on the phoneme length. However, the present invention is not limited to such an embodiment. For example, the phoneme length table may be sorted on the basis of feature quantities other than phoneme length, such as the fundamental frequency, and used for preliminary selection.

上記した本発明の実施の形態のシステムでは、予備選択の際に、許容素片候補数を用いて素片候補の数を制限した。しかし本発明はそのような実施の形態には限定されない。たとえば、音素長を基準とするのであれば、探索された素片を中心として所定の音素長幅を有する素片候補を抽出するようにしてもよい。また、上記のように抽出した素片候補に対し、ターゲットコストのみを利用した第２の予備選択を行なうようにしてもよい。 In the system according to the embodiment of the present invention described above, the number of segment candidates is limited using the allowable number of segment candidates at the time of preliminary selection. However, the present invention is not limited to such an embodiment. For example, if the phoneme length is used as a reference, a segment candidate having a predetermined phoneme length width around the searched segment may be extracted. Alternatively, the second preliminary selection using only the target cost may be performed on the segment candidates extracted as described above.

また、上記した実施の形態では、音素長を昇順でソートすることにより音素別音素長テーブル３４を作成したが、ソートを降順で行なっても同様の効果が得られることは明らかである。さらに、上記した実施の形態のシステムでは、音素別音素長テーブル３４に音素長のみならず他の特徴量も格納することにより、音素選択に音素別音素長テーブル３４を用いたが、音素別音素長テーブル３４には音素長、音素ラベル、及び音声素片ＤＢ３０中の当該素片のアドレスのみを格納し、音素別音素長テーブル３４を素片の予備選択のみに用いるようにしてもよい。 In the above-described embodiment, the phoneme-specific phoneme length table 34 is created by sorting the phoneme lengths in ascending order. However, it is obvious that the same effect can be obtained even if the sort is performed in descending order. Furthermore, in the system of the above-described embodiment, the phoneme-specific phoneme length table 34 is used for phoneme selection by storing not only the phoneme length in the phoneme-specific phoneme length table 34 but also other feature quantities. Only the phoneme length, the phoneme label, and the address of the unit in the speech unit DB 30 may be stored in the length table 34, and the phoneme-specific phoneme length table 34 may be used only for preliminary selection of the unit.

さらに、上記した実施の形態のシステムでは、予備選択する素片候補の数を、音素別音素長テーブル内の音素長の分布により算出した。しかし本発明はそのような実施の形態には限定されない。たとえば、音素ごとに固定した数を予め定めておき、それを用いてもよい。 Furthermore, in the system of the above-described embodiment, the number of segment candidates to be preliminarily selected is calculated based on the phoneme length distribution in the phoneme-specific phoneme length table. However, the present invention is not limited to such an embodiment. For example, a fixed number for each phoneme may be determined in advance and used.

さらに、上記した実施の形態では、音素長を基準として素片候補をソートし、候補数を制限した。しかし、制限に使用される基準は音素長には限らない。例えば、量子化した基本周波数（Ｆ０）を用いてもよい。この場合、合成器指令３６で指定された基本周波数と一致する素片候補を２分木探索により探索し、探索された素片候補を含み、その素片候補のＦ０を中心とした許容される範囲の基本周波数の素片候補、又は所定個数の素片候補を抽出すればよい。 Furthermore, in the above-described embodiment, the segment candidates are sorted based on the phoneme length, and the number of candidates is limited. However, the criteria used for restriction are not limited to phoneme length. For example, a quantized fundamental frequency (F0) may be used. In this case, a segment candidate that matches the fundamental frequency specified by the synthesizer command 36 is searched by a binary tree search, includes the searched segment candidate, and is allowed centering on F0 of the segment candidate. What is necessary is just to extract the segment candidate of the fundamental frequency of a range, or a predetermined number of segment candidates.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態に係る音声合成システム２０のブロック図である。1 is a block diagram of a speech synthesis system 20 according to an embodiment of the present invention. 図１に示す音素長テーブル作成部３２のブロック図である。It is a block diagram of the phoneme length table preparation part 32 shown in FIG. 図１に示す予備選択部６０のブロック図である。FIG. 2 is a block diagram of a preliminary selection unit 60 shown in FIG. 1. 許容素片候補数の算出方法を説明するための図である。It is a figure for demonstrating the calculation method of the number of permissible segment candidates.

Explanation of symbols

２０音声合成システム、３０音声素片ＤＢ、３２音素長テーブル作成部、３４音素別音素長テーブル、３６合成器指令、３８音声合成装置、６０予備選択部、６２素片候補テーブル、８０音素長抽出部、８２音素別音素長テーブル（未ソート）、８４ソート処理部、８６許容素片候補数算出部、１１０音素長テーブル、１１２許容素片候補数 20 speech synthesis system, 30 speech segment DB, 32 phoneme length table creation unit, 34 phoneme length table, 36 synthesizer command, 38 speech synthesizer, 60 preliminary selection unit, 62 segment candidate table, 80 phoneme length extraction , 82 Phoneme-specific phoneme length table (unsorted), 84 Sort processing unit, 86 Allowable segment candidate number calculation unit, 110 Phoneme length table, 112 Allowable segment candidate number

Claims

Using a speech unit database and a table that stores speech units in the speech unit database sorted by phoneme and using predetermined feature values as keys, A unit connection type speech synthesizer for connecting speech units,
The table corresponding to the phoneme specified by the phoneme label of the synthesizer command is received by receiving an input of a synthesizer command defining a sequence of phoneme labels to be a target of synthesized speech and a target feature amount associated with each phoneme label. A table selection means to select;
Within the table selected by the table selection means, the position is within a range determined by a predetermined standard including the speech segment having the predetermined feature value specified by the target feature value of the synthesizer command Pre-selection means for pre-selecting speech segments to be performed;
A selection unit for selecting a speech unit based on a predetermined criterion from the speech units preliminarily selected by the preliminary selection unit;
A unit connection type speech synthesizer comprising: a connection unit configured to connect the speech units selected by the selection unit according to the synthesizer command and output a synthesized speech waveform.

The predetermined feature amount is a phoneme length,
The preliminary selection means is for selecting a speech segment located in a range specified by a predetermined criterion, including the phoneme length specified by the synthesizer command, in the table selected by the table selection means. The unit connection type speech synthesizer according to claim 1, comprising:

The table is provided with permissible number information for specifying the permissible number of speech segments to be pre-selected in advance,
In the table selected by the table selection means, the means for selecting is a number specified by the allowable number, centered on a speech unit having a phoneme length specified by the synthesizer command. The unit connection type speech synthesizer according to claim 2, comprising means for selecting a speech unit.

The predetermined feature amount is a quantized fundamental frequency (F0),
The preliminary selection means is for selecting a speech unit located in a range specified by a predetermined criterion, including a fundamental frequency specified by the synthesizer command, in the table selected by the table selection means. The unit connection type speech synthesizer according to claim 1, comprising:

The table is provided with permissible number information for specifying the permissible number of speech segments to be pre-selected in advance,
In the table selected by the table selecting means, the means for selecting is a number specified by the allowable number centered on a speech unit having a fundamental frequency specified by the synthesizer command. 5. The unit connection type speech synthesizer according to claim 4, further comprising means for selecting a speech unit.

The preliminary selection means includes
A search for searching for a speech unit having a value of the predetermined feature amount specified by the target feature amount of the synthesizer command in a table selected by the table selection means by a predetermined search algorithm Means,
And means for selecting a speech element located in a predetermined range centered on the speech element searched by the search means from the table selected by the table selection means. The unit connection type speech synthesizer according to any one of claims 1 to 5.

The search means is a binary tree search algorithm for a speech element having a value of the predetermined feature value specified by the target feature value of the synthesizer command in the table selected by the table selection means. The unit-connected speech synthesizer according to claim 6, further comprising: a binary tree search means for searching by the above.

The selection means has a predetermined cost calculated from a speech element candidate preliminarily selected by the preliminary selection means based on a feature value of the speech element and a target feature value. The unit connection type speech synthesizer according to any one of claims 1 to 7, further comprising means for selecting a speech unit that satisfies the above condition.