JP2007052456A

JP2007052456A - Method and system for generating dictionary for speech synthesis

Info

Publication number: JP2007052456A
Application number: JP2006293157A
Authority: JP
Inventors: Masaaki Yamada; 雅章山田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2006-10-27
Filing date: 2006-10-27
Publication date: 2007-03-01

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the amount of calculations and memory capacity necessary for processing to reduce the "fuzziness" of the spectra of the speech by a window function for obtaining a fine phonemic piece and to realize speech synthesis of high sound quality with fewer computer resources. <P>SOLUTION: In the method for generating the dictionary to be used for speech synthesis processing, an alternative filter is first generated by approximating a correction filter for spectral correction to be obtained based on the speech waveform data by realizing the same in the amount of data or the amount of calculation smaller than that of the correction filter. Corrected waveform data is generated by acting a filter compensating the difference between the correction filter and the alternative filter on the speech waveform data, and the alternative filter and the corrected waveform data are stored as a part of the dictionary. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声を合成する音声合成装置のための音声合成用辞書の生成方法及び装置に関する。 The present invention relates to a method and apparatus for generating a speech synthesis dictionary for a speech synthesizer that synthesizes speech.

従来より、所望の合成音声を得るための音声合成方法として、あらかじめ収録し蓄えられた音声素片を複数の微細素片に分割し、分割の結果得られた微細素片の再配置を行って所望の合成音声を得る方法がある。これら微細素片の再配置において、微細素片に対して間隔変更・繰り返し・間引き等の処理が行われることにより、所望の時間長・基本周波数を持つ合成音声が得られる。 Conventionally, as a speech synthesis method for obtaining a desired synthesized speech, a speech segment recorded and stored in advance is divided into a plurality of fine segments, and the fine segments obtained as a result of the division are rearranged. There is a method for obtaining a desired synthesized speech. In the rearrangement of these fine segments, synthetic speech having a desired time length and fundamental frequency is obtained by performing processing such as interval change, repetition, and thinning on the fine segments.

図８は、音声波形を微細素片に分割する方法を模式的に示した図である。図８に示された音声波形は、切り出し窓関数（以下、窓関数）によって微細素片に分割される。このとき、有声音の部分（音声波形の後半部）では原音声のピッチ間隔に同期した窓関数が用いられる。一方、無声音の部分では、適当な間隔の窓関数が用いられる。 FIG. 8 is a diagram schematically showing a method of dividing a speech waveform into fine segments. The speech waveform shown in FIG. 8 is divided into fine segments by a cutout window function (hereinafter referred to as a window function). At this time, a window function synchronized with the pitch interval of the original speech is used in the voiced portion (second half of the speech waveform). On the other hand, a window function with an appropriate interval is used in the unvoiced sound part.

そして、図８に示すようにこれらの微細素片を間引いて用いることにより音声の継続時間長を短縮することができる。一方、これらの微細素片を繰り返して用いれば、音声の継続時間長を伸長することができる。更に、図８に示すように、有声音の部分では、微細素片の間隔を詰めることにより合成音声の基本周波数を上げることが可能となる。一方、微細素片の間隔を広げることにより合成音声の基本周波数を下げることが可能である。 Then, as shown in FIG. 8, the duration of the voice can be shortened by thinning and using these fine pieces. On the other hand, if these fine segments are used repeatedly, the voice duration can be extended. Furthermore, as shown in FIG. 8, in the voiced sound part, it is possible to increase the fundamental frequency of the synthesized speech by narrowing the interval between the fine segments. On the other hand, it is possible to lower the fundamental frequency of the synthesized speech by increasing the interval between the fine segments.

以上のような繰り返し・間引き・間隔変更を行なって再配置された微細素片を再び重畳することにより所望の合成音声が得られる。なお、音声素片を収録・蓄積する単位としては、音素やＣＶ・ＶＣあるいはＶＣＶといった単位が用いられる。ＣＶ・ＶＣは音素内に素片境界を置いた単位、ＶＣＶは母音内に素片境界を置いた単位である。 The desired synthesized speech can be obtained by superimposing the re-arranged fine segments again after repeating, thinning, and changing the interval as described above. Note that a unit such as a phoneme, CV / VC, or VCV is used as a unit for recording / accumulating the speech element. CV · VC is a unit in which a segment boundary is placed in a phoneme, and VCV is a unit in which a segment boundary is placed in a vowel.

しかしながら、上記従来法においては、音声波形から微細素片を得るために窓関数が適用されることにより、音声のスペクトルに所謂「ぼやけ」が生じてしまう。すなわち、音声のホルマントが広がったりスペクトル包絡の山谷が曖昧になる等の現象が起こり、合成音声の音質が低下することになる。 However, in the above-described conventional method, so-called “blurring” occurs in the spectrum of the voice by applying a window function to obtain fine segments from the voice waveform. That is, a phenomenon such as the formant of the voice spreading or the peaks and valleys of the spectrum envelope becoming ambiguous occurs, and the sound quality of the synthesized voice is deteriorated.

本発明は上記の課題に鑑みてなされたものであり、微細素片を得るために適用した窓関数による音声のスペクトルの「ぼやけ」を軽減し、高音質な音声合成を実現するための音声合成用辞書を提供可能とすることを目的とする。 The present invention has been made in view of the above-described problems, and is designed to reduce speech “blurring” due to a window function applied to obtain a fine segment and to achieve high-quality speech synthesis. The purpose is to be able to provide a dictionary .

更に、本発明の目的は、音声のスペクトルの「ぼやけ」を軽減して高音質な音声合成を少ないハードウエア資源で実現可能とするための音声合成用辞書を提供可能とすることにある。 It is a further object of the present invention to provide a speech synthesis dictionary for reducing “blurring” of a speech spectrum so that high-quality speech synthesis can be realized with few hardware resources.

上記の目的を達成するための本発明による音声合成用辞書生成方法は、In order to achieve the above object, a dictionary generation method for speech synthesis according to the present invention includes:
音声合成処理に用いる辞書の生成方法であって、  A method for generating a dictionary used for speech synthesis processing,
音声波形データに基づいて得られるスペクトル補正用の補正フィルタを、該補正フィルタより少ないデータ量又は計算量で実現するように近似することにより代替フィルタを生成する第１生成工程と、  A first generation step of generating a substitute filter by approximating a correction filter for spectrum correction obtained based on speech waveform data so as to be realized with a smaller data amount or calculation amount than the correction filter;
前記補正フィルタと前記代替フィルタの差を補償するフィルタを前記音声波形データに対して作用させることによって修正波形データを生成する第２生成工程と、  A second generation step of generating corrected waveform data by causing a filter that compensates for a difference between the correction filter and the alternative filter to act on the speech waveform data;
前記第１生成工程で生成された代替フィルタおよび前記第２生成工程で生成された修正波形データを前記辞書の一部として格納する格納工程とを備える。  A storage step of storing the substitute filter generated in the first generation step and the modified waveform data generated in the second generation step as a part of the dictionary.

また、上記の目的を達成するための本発明による音声合成用辞書生成装置は、In order to achieve the above object, a dictionary generating apparatus for speech synthesis according to the present invention includes:
音声合成処理に用いる辞書を生成する装置であって、  An apparatus for generating a dictionary used for speech synthesis processing,
音声波形データに基づいて得られるスペクトル補正用の補正フィルタを、該補正フィルタより少ないデータ量又は計算量で実現するように近似することにより代替フィルタを生成する第１生成手段と、  First generating means for generating an alternative filter by approximating a correction filter for spectrum correction obtained based on speech waveform data so as to realize a data amount or calculation amount smaller than the correction filter;
前記補正フィルタと前記代替フィルタの差を補償するフィルタを前記音声波形データに対して作用させることによって修正波形データを生成する第２生成手段と、  Second generation means for generating corrected waveform data by causing a filter that compensates for the difference between the correction filter and the alternative filter to act on the speech waveform data;
前記第１生成手段で生成された代替フィルタおよび前記第２生成手段で生成された修正波形データ前記辞書の一部として格納する格納手段とを備える。  An alternative filter generated by the first generation means, and storage means for storing the modified waveform data generated by the second generation means as a part of the dictionary.

また、本発明によれば、上記音声合成用辞書生成方法をコンピュータに実行させるための制御プログラムが提供される。According to the present invention, there is also provided a control program for causing a computer to execute the speech synthesis dictionary generation method.

以上の本発明によれば、微細素片を得るために適用した窓関数による音声のスペクトルの「ぼやけ」を軽減するための処理に必要な計算量・記憶容量を削減することができ、少ない計算機資源で音質が高い音声合成を実現するための音声合成用辞書を作成することができる。 According to the present invention described above , it is possible to reduce the amount of computation and storage capacity necessary for processing for reducing the “blurring” of the spectrum of speech due to the window function applied to obtain the fine segment, and to reduce the number of computers A speech synthesis dictionary for realizing speech synthesis with high sound quality using resources can be created .

以下、添付の図面を参照して本発明の好適な実施形態のいくつかについて詳細に説明する。 Hereinafter, some preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

〈第１実施形態〉
本出願人は、特願２００２−１６４６２４において、図８に示した微細素片に対してスペクトル補正フィルタを適用して微細素片のスペクトルを補正することにより、上述した音声スペクトルの「ぼやけ」を改善する音声合成装置及び方法を提案した。これは、音声波形から微細素片を得るために窓関数が適用されることによって生じた、音声のホルマントが広がる減少や、スペクトル包絡の山谷が曖昧になる現象等を緩和し、合成音声の音質低下を防ぐものである。 <First Embodiment>
In the Japanese Patent Application No. 2002-164624, the present applicant applies the spectrum correction filter to the fine element shown in FIG. 8 to correct the spectrum of the fine element, thereby correcting the above-described “blurring” of the audio spectrum. An improved speech synthesizer and method were proposed. This alleviates the decrease in the spread of speech formants and the ambiguity of the peaks and valleys in the spectral envelope caused by the application of a window function to obtain fine segments from the speech waveform, and the sound quality of the synthesized speech. It prevents the decline.

図９は、スペクトル補正フィルタを適用する方法を模式的に表した図である。窓関数９０２によって音声波形９０１より切り出された微細素片９０３の各々に、対応するスペクトル補正フィルタ９０７を作用させることにより、スペクトル補正された微細素片９０４（例えばホルマントが補正された微細素片）を得る。そして、スペクトル補正された微細素片９０４を用いて合成音声９０６を生成する。 FIG. 9 is a diagram schematically illustrating a method of applying the spectrum correction filter. By applying the corresponding spectrum correction filter 907 to each of the fine segments 903 cut out from the speech waveform 901 by the window function 902, the spectrum-corrected fine segments 904 (for example, fine segments whose formants have been corrected). Get. Then, a synthesized speech 906 is generated using the spectrally corrected fine segment 904.

ここで、スペクトル補正フィルタは音響分析によって得られ、上記処理に適用可能なスペクトル補正フィルタ９０７の具体例として、以下の３つのフィルタが挙げられる。 Here, the spectrum correction filter is obtained by acoustic analysis, and specific examples of the spectrum correction filter 907 applicable to the above processing include the following three filters.

（１）まず、音響分析にｐ次の線形予測分析を用いた場合には、以下の［数１］式で表される特性を持ったフィルタをスペクトル補正フィルタ９０７として用いることができる。 (1) First, when p-order linear prediction analysis is used for acoustic analysis, a filter having the characteristic expressed by the following [Equation 1] can be used as the spectrum correction filter 907.

（２）また、音響分析にｐ次のケプストラム分析を用いた場合には、以下の［数２］式で表される特性を持ったフィルタをスペクトル補正フィルタとして用いることができる。 (2) In addition, when p-order cepstrum analysis is used for acoustic analysis, a filter having the characteristic expressed by the following [Equation 2] can be used as a spectrum correction filter.

（３）あるいは、上記フィルタのインパルス応答を適当な次数で打ち切って構成した、以下の［数３］式で表されるＦＩＲフィルタを用いることも可能である。 (3) Alternatively, it is also possible to use an FIR filter represented by the following [Equation 3], which is configured by cutting off the impulse response of the filter with an appropriate order.

上記各式において、ｐは分析次数、μ，γは適当な係数、αは線形予測係数、ｃはケプストラム係数である。また、βは［数１］，［数２］で表されるフィルタのインパルス応答から得られたＦＩＲフィルタ係数である。 In the above equations, p is the analysis order, μ and γ are appropriate coefficients, α is a linear prediction coefficient, and c is a cepstrum coefficient. Β is an FIR filter coefficient obtained from the impulse response of the filter expressed by [Equation 1] and [Equation 2].

さて、上記スペクトル補正フィルタの演算には、波形１サンプル当たり、少なくとも十回〜数十回以上程度の積和演算が必要である。これは、音声合成の基本処理（図８に示された処理）の計算量に対して非常に大きい。また、通常、上記補正フィルタの係数は音声合成用辞書作成時に求めておくため、補正フィルタ係数を保持しておくための記憶領域も必要となる。すなわち、音声合成用辞書のサイズが肥大化してしまう。 Now, the calculation of the spectrum correction filter requires a product-sum operation at least about 10 to several tens of times per waveform sample. This is very large with respect to the calculation amount of the basic process of speech synthesis (the process shown in FIG. 8). Further, since the coefficient of the correction filter is usually obtained when the speech synthesis dictionary is created, a storage area for holding the correction filter coefficient is also required. That is, the size of the speech synthesis dictionary is enlarged.

もちろん、上記フィルタ次数ｐやＦＩＲフィルタ次数ｐ’を小さくすれば、計算量や記憶容量を削減することが出来る。あるいは、スペクトル補正フィルタ係数をクラスタリングすることにより、スペクトル補正フィルタ係数を保持するのに必要な記憶容量を削減できる。しかしながら、この場合にはスペクトル補正の効果が薄れ、音質が低下することになる。そこで、以下に説明する実施形態では、スペクトル補正フィルタリングに必要な計算量・記憶容量を軽減して、計算量・記憶容量の増加を抑えながら、音声のスペクトルの「ぼやけ」を軽減し、高音質な音声合成を実現する。 Of course, if the filter order p and the FIR filter order p ′ are reduced, the calculation amount and the storage capacity can be reduced. Alternatively, the storage capacity necessary to hold the spectrum correction filter coefficients can be reduced by clustering the spectrum correction filter coefficients. However, in this case, the effect of spectrum correction is diminished and the sound quality is degraded. Therefore, in the embodiment described below, the amount of calculation / storage capacity necessary for spectrum correction filtering is reduced, and the increase in the amount of calculation / storage capacity is suppressed, while the “blurring” of the voice spectrum is reduced, and high sound quality is achieved. Realize voice synthesis.

第１実施形態では、フィルタ次数を小さくした近似フィルタを用いて計算量や記憶容量を低減するとともに、音声合成用辞書の波形データを当該近似フィルタに適するように修正しておくことにより、合成音声の品質を維持する。 In the first embodiment, an approximate filter with a reduced filter order is used to reduce the amount of calculation and storage capacity, and the waveform data in the speech synthesis dictionary is modified so as to be suitable for the approximate filter. To maintain the quality.

図１は第１実施形態におけるハードウェア構成を示すブロック図である。図１において、１１は中央処理装置であり、数値演算・制御等の処理を行なう。特に、中央処理装置１１は、以下に説明する手順に従った音声合成処理を実行する。１２は出力装置であり、中央処理装置１１の制御下でユーザに対して各種の情報を提示する。１３はタッチパネル或はキーボード等を備えた入力装置であり、ユーザが本装置に対して動作の指示を与えたり、各種の情報を入力するのに用いられる。１４は音声を出力する音声出力装置であり音声合成された内容を出力する。 FIG. 1 is a block diagram showing a hardware configuration in the first embodiment. In FIG. 1, reference numeral 11 denotes a central processing unit which performs processing such as numerical calculation and control. In particular, the central processing unit 11 performs speech synthesis processing according to the procedure described below. An output device 12 presents various information to the user under the control of the central processing unit 11. An input device 13 includes a touch panel or a keyboard, and is used by a user to give an operation instruction to the device and to input various information. Reference numeral 14 denotes an audio output device that outputs audio, and outputs the synthesized content.

１５はディスク装置や不揮発メモリ等の記憶装置であり、音声合成用辞書５０１等が保持される。音声合成用辞書５０１には後述の方法で音声波形に修正を加えた修正波形データと、後述の方法で近似されたスペクトル補正フィルタが格納される。１６は読み取り専用の記憶装置であり、本実施形態の音声合成処理の手順や、必要な固定的データが格納される。１７はＲＡＭ等の一時情報を保持する記憶装置であり、一時的なデータや各種フラグ等が保持される。以上の各構成（１１〜１７）は、バス１８によって接続されている。なお、本実施形態ではＲＯＭ１６に音声合成処理のための制御プログラムが格納され、中央処理装置１１がこれを実行する形態とするが、そのような制御プログラムを外部記憶装置１５に格納しておき、実行に際してＲＡＭ１７にロードするような形態としてもよい。 Reference numeral 15 denotes a storage device such as a disk device or a nonvolatile memory, which holds a speech synthesis dictionary 501 and the like. The speech synthesis dictionary 501 stores modified waveform data obtained by modifying a speech waveform by a method described later, and a spectrum correction filter approximated by a method described later. Reference numeral 16 denotes a read-only storage device, which stores the speech synthesis processing procedure of the present embodiment and necessary fixed data. Reference numeral 17 denotes a storage device that holds temporary information such as a RAM, which holds temporary data, various flags, and the like. Each of the above components (11 to 17) is connected by a bus 18. In the present embodiment, a control program for speech synthesis processing is stored in the ROM 16, and the central processing unit 11 executes this. However, such a control program is stored in the external storage device 15, It may be configured to be loaded into the RAM 17 upon execution.

以上のような構成を備えた本実施形態の音声出力装置の動作について、図２，図３及び図４を参照して以下に説明する。図２及び図３は第１実施形態による音声出力処理を説明するフローチャートである。また、図４は第１実施形態の音声合成処理の様子を表す図である。 The operation of the audio output apparatus according to this embodiment having the above-described configuration will be described below with reference to FIGS. 2 and 3 are flowcharts for explaining audio output processing according to the first embodiment. FIG. 4 is a diagram showing the state of speech synthesis processing according to the first embodiment.

本実施形態では、スペクトル補正フィルタの構成を音声合成に先立って行い、フィルタを構成するための構成情報（フィルタ係数）を所定の記憶領域（音声合成用辞書）に保持しておく構成となっている。すなわち、音声合成用辞書を作成するためのデータ作成処理（図２）と音声合成処理（図３）の２つのプロセスになっている。ここで、データ作成処理ではスペクトル補正フィルタの近似を採用して構成情報の情報量低減を図るとともに、当該スペクトル補正フィルタの近似による合成音声の劣化を防止するように音声合成辞書の音声波形を修正する。 In this embodiment, the spectrum correction filter is configured prior to speech synthesis, and configuration information (filter coefficients) for configuring the filter is held in a predetermined storage area (speech synthesis dictionary). Yes. That is, there are two processes: data creation processing (FIG. 2) and speech synthesis processing (FIG. 3) for creating a speech synthesis dictionary. Here, in the data creation process, the approximation of the spectrum correction filter is adopted to reduce the amount of configuration information, and the speech waveform of the speech synthesis dictionary is modified to prevent deterioration of the synthesized speech due to the approximation of the spectrum correction filter. To do.

まず、ステップＳ１において、合成音声の元となる波形データ（図４の音声波形３０１）を取得する。そして、ステップＳ２において、ステップＳ１で取得した波形データについて線形予測（ＬＰＣ）分析、ケプストラム分析、一般化ケプストラム分析等の音響分析を行い、スペクトル補正フィルタ３１０を構成するのに必要なパラメータを計算する。なお、波形データの分析は、ある定められた時間間隔で行なっても良いし、ピッチ同期分析を行なっても良い。 First, in step S1, waveform data (speech waveform 301 in FIG. 4) that is a source of synthesized speech is acquired. In step S2, acoustic analysis such as linear prediction (LPC) analysis, cepstrum analysis, and generalized cepstrum analysis is performed on the waveform data acquired in step S1, and parameters necessary for configuring the spectrum correction filter 310 are calculated. . The analysis of the waveform data may be performed at a predetermined time interval, or pitch synchronization analysis may be performed.

次に、スペクトル補正フィルタ構成ステップＳ３において、ステップＳ２で計算されたパラメータを用いてスペクトル補正フィルタ３１０を構成する。例えば、音響分析にｐ次の線形予測分析を用いた場合には、上記［数１］で表される特性を持ったフィルタをスペクトル補正フィルタ３１０として用いる。また、ｐ次のケプストラム分析を用いた場合には、［数２］で表される特性を持ったフィルタをスペクトル補正フィルタ３１０として用いる。あるいは、上記フィルタのインパルス応答を適当な次数で打ち切って構成した、［数３］で表されるＦＩＲフィルタがスペクトル補正フィルタ３１０として用いられる場合もある。なお、実際には、上記の各式において、システムのゲインを考慮する必要がある。 Next, in the spectrum correction filter configuration step S3, the spectrum correction filter 310 is configured using the parameters calculated in step S2. For example, when p-order linear prediction analysis is used for acoustic analysis, a filter having the characteristic expressed by the above [Equation 1] is used as the spectrum correction filter 310. When p-th order cepstrum analysis is used, a filter having the characteristic expressed by [Equation 2] is used as the spectrum correction filter 310. Alternatively, the FIR filter represented by [Equation 3] configured by cutting off the impulse response of the filter with an appropriate order may be used as the spectrum correction filter 310. Actually, it is necessary to consider the system gain in each of the above equations.

次に、ステップＳ４において、ステップＳ３で構成したスペクトル補正フィルタ３１０を近似によって簡略化し、より少ない計算量・記憶量で実現可能な近似スペクトル補正フィルタ３０６を構成する。近似スペクトル補正フィルタ３０６の簡単な例としては、上記［数３］で表されるＦＩＲフィルタの打ち切り次数を低次に限ったフィルタが考えられる。あるいは、スペクトル補正フィルタとの周波数特性の差をスペクトル領域における距離として定義し、その差が最小となるフィルタ係数をニュートン法等によって求めることで近似補正フィルタを構成することも可能である。 Next, in step S4, the spectrum correction filter 310 configured in step S3 is simplified by approximation to configure an approximate spectrum correction filter 306 that can be realized with a smaller amount of calculation and storage. As a simple example of the approximate spectrum correction filter 306, a filter in which the order of truncation of the FIR filter expressed by the above [Equation 3] is limited to a low order can be considered. Alternatively, the approximate correction filter can be configured by defining a difference in frequency characteristics with the spectrum correction filter as a distance in the spectrum region and obtaining a filter coefficient that minimizes the difference by the Newton method or the like.

次に、ステップＳ５において、ステップＳ４で構成した近似スペクトル補正フィルタ３０６を音声合成用辞書５０１に記録する（実際には近似スペクトル補正フィルタの係数を格納することになる）。 Next, in step S5, the approximate spectrum correction filter 306 configured in step S4 is recorded in the speech synthesis dictionary 501 (actually, the coefficients of the approximate spectrum correction filter are stored).

次のステップＳ６〜Ｓ８では、上記ステップＳ４、Ｓ５で構成し、音声合成用辞書５０１に記録した近似スペクトル補正フィルタを音声波形に適用した際の、音質劣化を低減するべく音声波形データを修正して音声波形辞書５０１に登録する。 In the next steps S6 to S8, the speech waveform data is corrected in order to reduce the sound quality degradation when the approximate spectrum correction filter configured in steps S4 and S5 and recorded in the speech synthesis dictionary 501 is applied to the speech waveform. To the speech waveform dictionary 501.

まず、ステップＳ６において、スペクトル補正フィルタ３１０と近似スペクトル補正フィルタ３０６の逆フィルタを合成し、近似補正フィルタ３０２を構成する。例えば、スペクトル補正フィルタとして［数１］で表されるフィルタを用い、近似スペクトル補正フィルタとして［数３］で表される低次ＦＩＲフィルタを用いた場合、近似補正フィルタは以下の［数４］のようになる。 First, in step S <b> 6, an inverse filter of the spectrum correction filter 310 and the approximate spectrum correction filter 306 is synthesized to configure the approximate correction filter 302. For example, when the filter represented by [Equation 1] is used as the spectrum correction filter and the low-order FIR filter represented by [Equation 3] is used as the approximate spectrum correction filter, the approximate correction filter is expressed by the following [Equation 4]. become that way.

次に、ステップＳ７において、ステップＳ１で得られた音声波形データに対して上記近似補正フィルタ３０２を適用し、修正音声波形３０３を作成する。そして、ステップＳ８において、ステップＳ７で得られた修正音声波形を音声合成用辞書５０１に記録する。 Next, in step S7, the approximate correction filter 302 is applied to the speech waveform data obtained in step S1, thereby creating a modified speech waveform 303. In step S8, the modified speech waveform obtained in step S7 is recorded in the speech synthesis dictionary 501.

以上がデータ作成処理である。次に、音声合成処理を図３のフローチャートを参照して説明する。音声合成処理では、上記データ作成処理によって音声合成用辞書５０１に登録された近似スペクトル補正フィルタ３０６と修正音声波形３０３が用いられる。 The above is the data creation process. Next, the speech synthesis process will be described with reference to the flowchart of FIG. In the speech synthesis process, the approximate spectrum correction filter 306 and the modified speech waveform 303 registered in the speech synthesis dictionary 501 by the data creation process are used.

まず、韻律目標値取得ステップＳ９において、合成音声の目標韻律値を取得する。合成音声の目標韻律値は、歌声合成の様に直接上位モジュールから与えられる場合もあれば、何らかの手段を用いて推定される場合もある。例えば、テキストからの音声合成であるならばテキストの言語解析結果より推定される。 First, in the prosodic target value acquisition step S9, the target prosodic value of the synthesized speech is acquired. The target prosodic value of the synthesized speech may be given directly from the upper module as in the case of singing voice synthesis, or may be estimated using some means. For example, if it is speech synthesis from text, it is estimated from the language analysis result of the text.

次に、ステップＳ１０において、ステップＳ９で取得した目標韻律値に基づいて音声合成用辞書５０１に記録された修正音声波形を取得する。そして、ステップＳ１１において、ステップＳ５で音声合成用辞書５０１に記録された近似スペクトル補正フィルタを読み込む。なお、読み込まれる近似スペクトル補正フィルタはステップＳ１０で取得した修正音声波形に対応する近似スペクトル補正フィルタである。 Next, in step S10, the modified speech waveform recorded in the speech synthesis dictionary 501 is acquired based on the target prosodic value acquired in step S9. In step S11, the approximate spectrum correction filter recorded in the speech synthesis dictionary 501 in step S5 is read. Note that the approximate spectrum correction filter that is read is an approximate spectrum correction filter corresponding to the modified speech waveform acquired in step S10.

次に、ステップＳ１２において、ステップＳ１０で取得した修正音声波形に窓関数３０４を適用し、微細素片３０５を切り出す。なお、窓関数としてはハニング窓等が用いられる。次に、ステップＳ１３において、ステップＳ１２で切り出した微細素片３０５の各々に対して、ステップＳ１１で読み込まれた近似スペクトル補正フィルタ３０６を適用し、微細素片３０５のスペクトルを補正する。こうして、スペクトル補正された微細素片３０７が取得される。 Next, in step S12, the window function 304 is applied to the modified speech waveform acquired in step S10, and the fine segment 305 is cut out. A Hanning window or the like is used as the window function. Next, in step S13, the approximate spectrum correction filter 306 read in step S11 is applied to each of the fine pieces 305 cut out in step S12 to correct the spectrum of the fine pieces 305. Thus, the spectrally corrected fine segment 307 is obtained.

次に、ステップＳ１４において、ステップＳ１３でスペクトル補正された微細素片３０７を、ステップＳ９で取得した韻律目標値に合致するように、間引き・繰り返し・間隔変更して再配置（３０８）することにより、韻律変更する。そしてステップＳ１５において、ステップＳ１４で再配置した微細素片を重畳し、合成音声３０９（音声素片）を得る。その後、ステップＳ１６において、ステップＳ１５で得られた合成音声３０９（音声素片）を接続して合成音声を得て音声出力する。 Next, in step S14, the fine segment 307 whose spectrum has been corrected in step S13 is rearranged (308) by thinning, repeating, and changing the interval so as to match the prosodic target value acquired in step S9. Change prosody. In step S15, the fine segments rearranged in step S14 are superimposed to obtain a synthesized speech 309 (speech segment). Thereafter, in step S16, the synthesized speech 309 (speech unit) obtained in step S15 is connected to obtain synthesized speech and output the speech.

なお、微細素片の再配置処理に関して、「間引き」については、図４に示すように近似スペクトル補正フィルタ３０６を作用させる前に実行するようにしてもよい。このようにすれば、不要な微細素片についてフィルタ処理を施すという無駄な処理を省くことができるからである。 As for the rearrangement processing of the fine pieces, “decimation” may be executed before the approximate spectrum correction filter 306 is operated as shown in FIG. This is because it is possible to omit a useless process of performing filter processing on unnecessary fine pieces.

〈第２実施形態〉
上記第１実施形態においては、近似によってフィルタ係数の次数を減らし、計算量や記憶容量を低減した例について説明した。第２実施形態では、スペクトル補正フィルタのクラスタリングによって記憶容量を削減する場合について説明する。第２実施形態のプロセスは、クラスタリング処理（図５）、データ作成処理（図６）および音声合成処理（図７）の３つのプロセスとなる。なお、本処理を実現するための装置構成は第１実施形態（図１）と同様である。 Second Embodiment
In the first embodiment, the example in which the order of the filter coefficient is reduced by approximation to reduce the calculation amount and the storage capacity has been described. In the second embodiment, a case where the storage capacity is reduced by clustering the spectrum correction filter will be described. The process of the second embodiment includes three processes: clustering processing (FIG. 5), data creation processing (FIG. 6), and speech synthesis processing (FIG. 7). The apparatus configuration for realizing this processing is the same as that of the first embodiment (FIG. 1).

図５のフローチャートにおいて、ステップＳ１、Ｓ２、Ｓ３はスペクトル補正フィルタを構成する処理であり、第１実施形態（図２）と同様である。これらの処理を音声合成用辞書５０１に含まれる全ての波形データに対して行なう（ステップＳ１００）。 In the flowchart of FIG. 5, steps S1, S2, and S3 are processes constituting a spectrum correction filter, and are the same as those in the first embodiment (FIG. 2). These processes are performed on all waveform data included in the speech synthesis dictionary 501 (step S100).

全波形データについてスペクトル補正フィルタが構成されると、ステップＳ１０１へ進み、ステップＳ３で得られたスペクトル補正フィルタをクラスタリングする。なお、クラスタリングとしては、例えばＬＢＧアルゴリズムと呼ばれる手法等を適用できる。そして、ステップＳ１０２で、ステップＳ１０１によるクラスタリングの結果（クラスタリング情報）を外部記憶装置１５に記録する。具体的には、各クラスタの代表ベクトル（フィルタ係数）とクラスタ番号の対応表が作成され、記録される。この代表ベクトルによって当該クラスタのスペクトル補正フィルタ（代表フィルタ）が構成される。本実施形態では、ステップＳ３で音声合成用辞書５０１に登録されている各波形データについてスペクトル補正フィルタを構成し、各波形データに対応するスペクトル補正フィルタの係数を上記クラスタ番号で音声合成用辞書５０１内に保持する。すなわち、図６により後述するように、第２実施形態の音声合成用辞書５０１には、各音声波形の波形データ（正確には修正音声波形データ（図６により後述））とスペクトル補正フィルタのクラスタ番号、及び各クラスタ番号と代表ベクトル（各係数の代表値）が登録されることになる。 When the spectrum correction filters are configured for all waveform data, the process proceeds to step S101, and the spectrum correction filters obtained in step S3 are clustered. As the clustering, for example, a method called LBG algorithm can be applied. In step S102, the clustering result (clustering information) in step S101 is recorded in the external storage device 15. Specifically, a correspondence table of representative vectors (filter coefficients) and cluster numbers of each cluster is created and recorded. The representative vector constitutes a spectrum correction filter (representative filter) of the cluster. In this embodiment, a spectrum correction filter is configured for each waveform data registered in the speech synthesis dictionary 501 in step S3, and the coefficient of the spectrum correction filter corresponding to each waveform data is used as the cluster number for the speech synthesis dictionary 501. Hold in. That is, as will be described later with reference to FIG. 6, the speech synthesis dictionary 501 of the second embodiment includes waveform data of each speech waveform (correctly speech waveform data (described later with reference to FIG. 6)) and a cluster of spectrum correction filters. The number, each cluster number, and the representative vector (representative value of each coefficient) are registered.

次に、辞書作成処理（図６）を説明する。辞書作成処理において、ステップＳ１〜Ｓ３によるスペクトルフィルタの構成処理は第１実施形態と同様である。第１実施形態と異なる点は、近似スペクトル補正フィルタを構成する代わりに、スペクトル補正フィルタのフィルタ係数をベクトル量子化してクラスタ番号で登録する点である。すなわち、まず、ステップＳ１０３において、ステップＳ３で得られたスペクトル補正フィルタに最も近いベクトルをステップＳ１０２で記録されたクラスタリング情報の代表ベクトルから選択する。次に、ステップＳ１０４において、ステップＳ１０３で選択された代表ベクトルに対応する番号（クラスタ番号）を、音声合成用辞書５０１に記録する。 Next, the dictionary creation process (FIG. 6) will be described. In the dictionary creation process, the spectrum filter configuration process in steps S1 to S3 is the same as in the first embodiment. The difference from the first embodiment is that, instead of configuring the approximate spectral correction filter, the filter coefficient of the spectral correction filter is vector-quantized and registered by the cluster number. That is, first, in step S103, a vector closest to the spectrum correction filter obtained in step S3 is selected from the representative vector of the clustering information recorded in step S102. Next, in step S104, the number (cluster number) corresponding to the representative vector selected in step S103 is recorded in the speech synthesis dictionary 501.

さらに、スペクトル補正フィルタのフィルタ係数を量子化することによって生じる合成音声の劣化を低減するために修正音声波形を生成し、音声合成用辞書に登録する。すなわち、ステップＳ１０５において、量子化誤差を補正するための量子化誤差補正フィルタを構成する。量子化誤差補正フィルタは、上記代表ベクトルを使って構成されるフィルタの逆フィルタと当該音声波形のスペクトル補正フィルタとを合成することによって構成される。例えば、スペクトル補正フィルタとして［数１］で表されるフィルタを用いた場合、量子化誤差補正フィルタは［数５］のようになる。 Furthermore, a modified speech waveform is generated and registered in the speech synthesis dictionary in order to reduce deterioration of the synthesized speech caused by quantizing the filter coefficient of the spectrum correction filter. That is, in step S105, a quantization error correction filter for correcting the quantization error is configured. The quantization error correction filter is configured by synthesizing an inverse filter of a filter configured using the representative vector and a spectrum correction filter of the speech waveform. For example, when the filter represented by [Equation 1] is used as the spectrum correction filter, the quantization error correction filter is represented by [Equation 5].

数５において，α’がベクトル量子化された線形予測係数である。その他の形式のフィルタを用いた場合も同様に量子化誤差補正フィルタを構成できる。こうして構成された量子化誤差補正フィルタを用いて波形データを修正して修正音声波形を作成し（ステップＳ７）、得られた修正音声波形を音声合成用辞書５０１に登録する（ステップＳ８）。スペクトル補正フィルタをクラスタ番号と対応表（クラスタ情報）によって登録するので、音声合成用辞書に要求される記憶容量を低減できる。 In Equation 5, α ′ is a vector-quantized linear prediction coefficient. A quantization error correction filter can be configured similarly when other types of filters are used. The waveform data is corrected using the quantization error correction filter configured in this way to create a corrected speech waveform (step S7), and the obtained modified speech waveform is registered in the speech synthesis dictionary 501 (step S8). Since the spectrum correction filter is registered by the cluster number and the correspondence table (cluster information), the storage capacity required for the speech synthesis dictionary can be reduced.

音声合成時においては、図７のフローチャートに示されるように、第１実施形態の処理におけるステップＳ１１（近似スペクトル補正フィルタを読み込むステップ）が不要となり、代りにステップＳ１０６（スペクトル補正フィルタ番号（クラスタ番号）を読込む処理）およびステップＳ１０７（読み込んだクラスタ番号からスペクトル補正フィルタを取得する処理）が追加される。 At the time of speech synthesis, as shown in the flowchart of FIG. 7, step S11 (step of reading the approximate spectrum correction filter) in the processing of the first embodiment is not necessary, and instead, step S106 (spectrum correction filter number (cluster number) ) And step S107 (processing for obtaining a spectrum correction filter from the read cluster number) are added.

第１実施形態と同様に、韻律目標値を取得し（ステップＳ９）、図６のステップＳ８で登録された修正音声波形データを取得する（ステップＳ１０）。ステップＳ１０６では、ステップＳ１０４で記録したスペクトル補正フィルタ番号を読み込む。次に、ステップＳ１０７において、ステップＳ１０２で記録された対応表を元に、スペクトル補正フィルタ番号に対応するスペクトル補正フィルタを取得する。以下、第１実施形態と同様にステップＳ１２〜Ｓ１６により合成音声を出力する。すなわち、修正音声波形に窓関数を適用して微細素片を切り出し（ステップＳ１２）、切り出された微細素片にステップＳ１０７で取得したスペクトル補正フィルタを適用してスペクトル補正された微細素片を取得し（ステップＳ１３）、韻律目標値に従ってスペクトル補正された微細素片を再配置し（ステップＳ１４）、再配置した微細素片を重畳して合成音声３０９（音声素片）を得る（ステップＳ１５）。 Similar to the first embodiment, the prosodic target value is acquired (step S9), and the modified speech waveform data registered in step S8 of FIG. 6 is acquired (step S10). In step S106, the spectrum correction filter number recorded in step S104 is read. Next, in step S107, a spectrum correction filter corresponding to the spectrum correction filter number is acquired based on the correspondence table recorded in step S102. Thereafter, the synthesized speech is output in steps S12 to S16 as in the first embodiment. That is, a fine unit is cut out by applying a window function to the modified speech waveform (step S12), and the spectrally corrected fine unit is obtained by applying the spectrum correction filter obtained in step S107 to the cut out fine unit. (Step S13), the fine segments whose spectrum has been corrected according to the prosodic target value are rearranged (Step S14), and the rearranged fine segments are superimposed to obtain a synthesized speech 309 (speech segment) (Step S15). .

以上のように、クラスタリングによって，スペクトル補正フィルタを量子化しても、［数５］に示したようなフィルタによって修正された修正音声波形を用いることにより量子化誤差を補正することが可能となり、音質を損なうことなく記憶容量を削減することが可能となる。 As described above, even if the spectrum correction filter is quantized by the clustering, it becomes possible to correct the quantization error by using the modified speech waveform corrected by the filter as shown in [Equation 5]. It is possible to reduce the storage capacity without impairing the storage capacity.

〈その他の実施形態〉
上記各実施形態において、波形のサンプリング周波数が高い場合には、帯域分割フィルタによって帯域分割を行い、帯域制限された個々の波形に対してスペクトル補正フィルタリングを行なっても良い。この場合、帯域毎にフィルタを持ち、対象となる音声波形そのものも帯域分割して、それぞれの波形について処理を行なうことになる。帯域分割によってスペクトル補正フィルタの次数が押えられ、計算量を削減する効果がある。メルケプストラムのような周波数軸の伸縮によっても同様の効果がある。
また、上記第１および第２実施形態を組み合わせた実施形態も可能である。この場合、近似前のスペクトル補正フィルタをベクトル量子化した後、代表ベクトルによるフィルタを近似しても良いし、近似スペクトル補正フィルタの係数をベクトル量子化しても良い。
また、第２実施形態において、音響分析の結果を一旦変換し、変換後のベクトルをベクトル量子化しても良い。例えば、音響分析に線形予測係数を用いた場合、線形予測係数を直接ベクトル量子化するのではなく、LSP係数に変換し、LSP係数を量子化する。スペクトル補正フィルタを構成する際には、量子化されたLSP係数を線形予測係数に逆変換して用いることができる。一般に、線形予測係数よりもLSP係数の方が量子化特性が良いため、より適切なベクトル量子化が可能となる。 <Other embodiments>
In each of the above embodiments, when the waveform sampling frequency is high, band division may be performed by a band division filter, and spectrum correction filtering may be performed on each band-limited waveform. In this case, a filter is provided for each band, the target speech waveform itself is also divided into bands, and processing is performed for each waveform. The order of the spectrum correction filter is suppressed by the band division, and the calculation amount is reduced. The same effect can be obtained by expansion and contraction of the frequency axis such as a mel cepstrum.
An embodiment in which the first and second embodiments are combined is also possible. In this case, after the spectral correction filter before approximation is vector quantized, the filter based on the representative vector may be approximated, or the coefficient of the approximate spectral correction filter may be vector quantized.
In the second embodiment, the result of acoustic analysis may be converted once, and the converted vector may be vector quantized. For example, when a linear prediction coefficient is used for acoustic analysis, the linear prediction coefficient is not directly vector quantized, but is converted into an LSP coefficient, and the LSP coefficient is quantized. When configuring the spectrum correction filter, the quantized LSP coefficient can be inversely converted into a linear prediction coefficient and used. In general, since the LSP coefficient has better quantization characteristics than the linear prediction coefficient, more appropriate vector quantization is possible.

また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。 Another object of the present invention is to supply a storage medium storing software program codes for implementing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in the.

この場合、記憶媒体から読出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク，ハードディスク，光ディスク，光磁気ディスク，ＣＤ−ＲＯＭ，ＣＤ−Ｒ，磁気テープ，不揮発性のメモリカード，ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

第１実施形態におけるハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions in 1st Embodiment. 第１実施形態による音声出力処理における近似スペクトル補正フィルタ登録処理を説明するフローチャートである。It is a flowchart explaining the approximate spectrum correction filter registration process in the audio | voice output process by 1st Embodiment. 第１実施形態による音声出力処理における音声合成処理を説明するフローチャートである。It is a flowchart explaining the speech synthesis process in the audio | voice output process by 1st Embodiment. 第１実施形態の音声合成処理の様子を表す図である。It is a figure showing the mode of the speech synthesis process of 1st Embodiment. 第２実施形態による音声出力処理におけるクラスタリング処理を説明するフローチャートである。It is a flowchart explaining the clustering process in the audio | voice output process by 2nd Embodiment. 第２実施形態による音声出力処理におけるスペクトル補正フィルタ登録処理を説明するフローチャートである。It is a flowchart explaining the spectrum correction filter registration process in the audio | voice output process by 2nd Embodiment. 第２実施形態による音声出力処理における音声合成処理を説明するフローチャートである。It is a flowchart explaining the speech synthesis process in the speech output process by 2nd Embodiment. 音声波形の微細素片への分割、再配置、合成による音声合成方法を模式的に示した図である。It is the figure which showed typically the speech synthesis method by the division | segmentation, rearrangement, and the synthesis | combination of the audio | voice waveform into the fine fragment. 音声波形の微細素片への分割、再配置、合成による音声合成方法において、スペクトル補正を用いる方法を模式的に示した図である。It is the figure which showed typically the method of using a spectrum correction in the speech synthesis method by the division | segmentation, rearrangement, and the synthesis | combination of an audio | voice waveform into a fine segment.

Claims

A method for generating a dictionary used for speech synthesis processing,
A first generation step of generating a substitute filter by approximating a correction filter for spectrum correction obtained based on speech waveform data so as to be realized with a smaller data amount or calculation amount than the correction filter;
A second generation step of generating corrected waveform data by causing a filter that compensates for a difference between the correction filter and the alternative filter to act on the speech waveform data;
A dictionary generation method for speech synthesis, comprising: a storage step of storing the substitute filter generated in the first generation step and the modified waveform data generated in the second generation step as a part of the dictionary.

The method of generating a dictionary for speech synthesis according to claim 1, wherein the alternative filter is a filter obtained by generating an FIR filter from an impulse response of the correction filter and cutting the FIR filter at a low order. .

The method for generating a dictionary for speech synthesis according to claim 1, wherein the substitution filter has a lower order than the correction filter.

The speech synthesis dictionary generation method according to claim 1, wherein the alternative filter is a filter obtained by vector quantization of filter coefficients of the correction filter.

An apparatus for generating a dictionary used for speech synthesis processing,
First generating means for generating an alternative filter by approximating a correction filter for spectrum correction obtained based on speech waveform data so as to realize a data amount or calculation amount smaller than the correction filter;
Second generation means for generating corrected waveform data by causing a filter that compensates for the difference between the correction filter and the alternative filter to act on the speech waveform data;
A speech synthesis dictionary generation apparatus comprising: a substitute filter generated by the first generation means; and storage means for storing the modified waveform data generated by the second generation means as a part of the dictionary.

A control program for causing a computer to execute the speech synthesis dictionary generation method according to claim 1.

A storage medium storing the control program according to claim 6.