JP5252452B2

JP5252452B2 - SPECTRUM ANALYZER AND SPECTRUM OPERATION DEVICE

Info

Publication number: JP5252452B2
Application number: JP2009146502A
Authority: JP
Inventors: 芳則志賀
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2009-06-19
Filing date: 2009-06-19
Publication date: 2013-07-31
Anticipated expiration: 2029-06-19
Also published as: JP2011002703A

Description

この発明は音声関連技術に関し、特に、音声を統計的に処理する際のパラメータ化の改善技術に関する。 The present invention relates to speech-related technology, and more particularly to a technology for improving parameterization when statistically processing speech.

音声スペクトルの表現（パラメータ）としてケプストラムがよく用いられる。例えば音声認識に用いられる音響モデルは隠れマルコフモデル（ＨＭＭ）によることが多いが、その学習のための音響パラメータとしてケプストラムが用いられることが多い。ケプストラムを用いた音声のパラメータ化技術はよく研究されており、そのために必要なソフトウェア等も充実している。なお、音声認識等で用いられるケプストラム解析の際には、周波数を聴覚周波数スケールで変換したメル周波数表現が用いられることが多く、それに対するケプストラム解析で得られるケプストラム係数はメルケプストラムと呼ばれる。 A cepstrum is often used as a speech spectrum expression (parameter). For example, an acoustic model used for speech recognition is often a hidden Markov model (HMM), but a cepstrum is often used as an acoustic parameter for learning. Speech parameterization technology using cepstrum has been well studied, and the software necessary for this has been enhanced. In the cepstrum analysis used in speech recognition or the like, a mel frequency expression obtained by converting a frequency on an auditory frequency scale is often used, and a cepstrum coefficient obtained by the cepstrum analysis is called a mel cepstrum.

ＨＭＭは、音声認識だけではなく音声合成にも用いられる。図１に、ＨＭＭを用いた従来の音声合成システムの概略構成を示す。図１を参照して、ＨＭＭを用いた従来の音声合成システム５０は、コンテキストに依存した音素ＨＭＭ６２を記憶する記憶装置と、この音素ＨＭＭ６２の学習を行なうための学習部６０と、入力されたテキスト６６にしたがって、学習が完了した音素ＨＭＭ６２を使用して音声合成を行なうための合成部６４とを含む。 The HMM is used not only for speech recognition but also for speech synthesis. FIG. 1 shows a schematic configuration of a conventional speech synthesis system using an HMM. Referring to FIG. 1, a conventional speech synthesis system 50 using an HMM includes a storage device that stores a phoneme HMM 62 depending on a context, a learning unit 60 for learning the phoneme HMM 62, and input text. 66, a synthesis unit 64 for performing speech synthesis using the phoneme HMM 62 for which learning has been completed.

学習部６０は、多数の発話を記憶した音声コーパス７０と、音声コーパス７０内の各音素の音声波形に対して基本周波数抽出処理を行ない、基本周波数パラメータＦ０を出力するための基本周波数抽出部７２と、音声コーパス７０内の各音素の音声波形に対してスペクトル分析を行ない、音声の対数パワースペクトルの包絡を表すスペクトルパラメータ（ケプストラム係数）を出力するためのスペクトル分析部７４とを含む。さらに学習部６０は、基本周波数抽出部７２からのＦ０パラメータ、スペクトル分析部７４からのスペクトルパラメータ、及び音声コーパス７０の各音素のコンテキストに依存した音素ラベル（以下このラベルを「コンテキスト依存ラベル」と呼ぶ。）を含む学習データを記憶するための学習データ記憶部７６と、学習データ記憶部７６に記憶された学習データに対する統計処理を行なって、音素ＨＭＭ６２の各コンテキスト依存音素モデルの確率密度関数等のパラメータの計算を行なうためのＨＭＭモデル学習部７８とを含む。コンテキストとしては、当該音素を含む文節のアクセント型、当該音素を含む単語の品詞、文の長さ、文内での当該音素の位置等が含まれる。 The learning unit 60 performs a fundamental frequency extraction process on the speech corpus 70 in which a large number of utterances are stored, and the speech waveform of each phoneme in the speech corpus 70, and outputs a fundamental frequency parameter F0. And a spectrum analysis unit 74 for performing spectrum analysis on the speech waveform of each phoneme in the speech corpus 70 and outputting a spectrum parameter (cepstrum coefficient) representing an envelope of the logarithmic power spectrum of the speech. Further, the learning unit 60 uses the F0 parameter from the fundamental frequency extraction unit 72, the spectrum parameter from the spectrum analysis unit 74, and the phoneme label depending on the context of each phoneme in the speech corpus 70 (hereinafter, this label is referred to as “context-dependent label”). The learning data storage unit 76 for storing the learning data including the probability density function of each context-dependent phoneme model of the phoneme HMM 62 by performing statistical processing on the learning data stored in the learning data storage unit 76 And an HMM model learning unit 78 for calculating the parameters. The context includes the accent type of the phrase including the phoneme, the part of speech of the word including the phoneme, the length of the sentence, the position of the phoneme in the sentence, and the like.

合成部６４は、入力されたテキスト６６に対してテキスト解析を行ない、テキスト６６に対する音素列を示す音素ラベル列であって、テキスト６６内で各音素のおかれたコンテキストに応じた音素ラベル列（「コンテキスト依存ラベル列」と呼ぶ。）を出力するためのテキスト解析部９０と、テキスト解析部９０からのコンテキスト依存ラベル列に応じ、音素ＨＭＭ６２内の音素ＨＭＭを連結し、与えられたコンテキスト依存ラベル列に対して最も尤度が高くなる音響パラメータ（Ｆ０及びスペクトルパラメータ）列をこれらＨＭＭ列から推定するための音響パラメータ生成部９２と、音響パラメータ生成部９２から出力されるＦ０にしたがって音源生成を行なう音源生成部９４と、音源生成部９４からの音源波形に対し、音響パラメータ生成部９２から出力されるスペクトルパラメータにしたがって変調することにより、合成音声信号を出力するための合成フィルタ９６とを含む。 The synthesizing unit 64 performs text analysis on the input text 66, and is a phoneme label string indicating a phoneme string for the text 66, and a phoneme label string (in accordance with a context in which each phoneme is placed in the text 66 ( The text analysis unit 90 for outputting “context-dependent label sequence”) and the phoneme HMM in the phoneme HMM 62 are concatenated according to the context-dependent label sequence from the text analysis unit 90 and given context-dependent label An acoustic parameter generation unit 92 for estimating an acoustic parameter (F0 and spectral parameter) sequence having the highest likelihood for the sequence from these HMM sequences, and sound source generation according to F0 output from the acoustic parameter generation unit 92 A sound source generation unit 94 for performing sound parameter generation on the sound source waveform from the sound source generation unit 94 By modulating according to the spectral parameter output from the section 92, and a synthesis filter 96 for outputting a synthesized speech signal.

このような音声合成システム５０では、多数の音声により音素ＨＭＭ６２の学習を行なうことが必要である。この学習時には、結局のところ、特定音素コンテキストの音声スペクトルの、全サンプルにわたる平均が計算される。しかしそのような処理をケプストラムで行なうと、スペクトルの山（フォルマント）の位置（周波数）が異なる複数の音声スペクトルがケプストラム領域で平均されることになる。この場合、次のような問題が生じる。 In such a speech synthesis system 50, it is necessary to learn the phoneme HMM 62 with a large number of speeches. During this learning, after all, the average of the speech spectrum of a specific phoneme context over all samples is calculated. However, when such processing is performed with a cepstrum, a plurality of speech spectra having different positions (frequencies) of spectrum peaks (formants) are averaged in the cepstrum region. In this case, the following problem occurs.

図２を参照して、２つのスペクトル１１０及び１１２を考える。これらはそれぞれフォルマントに対応するピークを持つが、その周波数軸上の位置は互いにずれている。これらを単純に平均すると、スペクトル１１６が得られる。スペクトル１１６では、スペクトル１１０及びスペクトル１１２で明確に存在するピークがなまってしまっている。このスペクトルで仮に音声合成を行なうと、音質が低くなることは明らかである。本来は、スペクトル１１４のように、ピークが明確に生じるように両者の平均を算出すべきである。 With reference to FIG. 2, consider two spectra 110 and 112. These have peaks corresponding to formants, but their positions on the frequency axis are shifted from each other. If these are simply averaged, a spectrum 116 is obtained. In the spectrum 116, the peaks clearly existing in the spectrum 110 and the spectrum 112 are lost. If speech synthesis is performed with this spectrum, it is clear that the sound quality is lowered. Originally, as in the spectrum 114, the average of both should be calculated so that a peak clearly occurs.

大室仲他、「積分スペクトル逆関数（ＩＦＩＳ）」とその応用に関する検討」、信学技報、ＳＰ８９−７２、ｐ．２３―３０、１９８９年Omuro Naka et al., “Examination of Inverse Integral Spectrum Function (IFIS) and its Applications”, IEICE Technical Report, SP89-72, p. 23-30, 1989

こうした問題を解決する１つの手法が非特許文献１に開示されている。非特許文献１は、スペクトルを補間するために、「積分スペクトル逆関数（ＩＦＩＳ）」と呼ばれるパラメータを使用することを提案している。平均は補間の一部と考えることができるため、非特許文献１に提案されたパラメータを上記した処理に適用できる可能性がある。 One technique for solving such a problem is disclosed in Non-Patent Document 1. Non-Patent Document 1 proposes to use a parameter called “integral spectral inverse function (IFIS)” to interpolate the spectrum. Since the average can be considered as a part of the interpolation, the parameter proposed in Non-Patent Document 1 may be applicable to the above-described processing.

図３を参照して、この手法によれば、２つのスペクトル１３０及び１３２の平均を算出する時には、まずそれらのグラフを全体にわたり積分する。積分の結果得られた曲線１４０及び１４２において、元のスペクトル１３０及び１３２のピークＡ及びＢに対応する周波数の値を求め（Ａ′及びＢ′）、これらの周波数軸上での平均Ｃ′を算出する。この周波数Ｃ′が、スペクトル１３０及び１３２を平均したスペクトルのピークＣの中心周波数位置となる。非特許文献１によれば、さらに、スペクトル１３０及び１３２をこの手法を使用して平均する場合、結果として得られるスペクトルの各周波数における高さは、これらスペクトル１３０及び１３２のその周波数における高さの調和平均となる。 Referring to FIG. 3, according to this method, when calculating the average of two spectra 130 and 132, the graphs are first integrated over the whole. In the curves 140 and 142 obtained as a result of the integration, frequency values corresponding to the peaks A and B of the original spectra 130 and 132 are obtained (A ′ and B ′), and the average C ′ on these frequency axes is obtained. calculate. This frequency C ′ becomes the center frequency position of the peak C of the spectrum obtained by averaging the spectra 130 and 132. According to Non-Patent Document 1, when the spectra 130 and 132 are further averaged using this technique, the height of the resulting spectrum at each frequency is the height of these spectra 130 and 132 at that frequency. Harmonic average.

すなわち、非特許文献１による手法は、「２つのスペクトルをそれぞれ周波数０から積分し、積分値が等しくなった２点の振幅の調和平均をとる」手法であるということができる。 That is, it can be said that the method according to Non-Patent Document 1 is a method of “integrating two spectra from frequency 0 and taking the harmonic average of two points at which the integrated values are equal”.

この手法によってスペクトル１３０及び１３２を平均して得られたスペクトルの例を図４に示す。図４において、スペクトル１５０のピークはスペクトル１３０及び１３２のピークの中間位置となり、そのピークの高さも両者のピークの中間となっている。そのため、図２に示すような例と比較すると、スペクトルのピークがなまるおそれは小さい。なお、図４に示すスペクトル１３０及び１３２は試験のためのデータであるため、通常のスペクトルの曲線とは異なっている。 An example of a spectrum obtained by averaging the spectra 130 and 132 by this method is shown in FIG. In FIG. 4, the peak of the spectrum 150 is in the middle position between the peaks of the spectra 130 and 132, and the height of the peak is also in the middle of both peaks. Therefore, compared with the example as shown in FIG. It should be noted that the spectra 130 and 132 shown in FIG. 4 are data for testing, and therefore are different from ordinary spectrum curves.

確かに非特許文献１による手法によれば、２つのスペクトルを「平均」してもピークがなまってしまうことはなく、ＨＭＭの学習には好ましいと思われる。しかしこの非特許文献１の開示では、まず、ＩＦＩＳの数値計算方法が明らかにされていない。スペクトルを実際に積分してその逆関数をとる場合には、多くの計算量を必要とする問題がある。また、得られるＩＦＩSのパラメータは、スペクトルと同程度の次元数をもつ（１２８〜１０２４次元程度）。こうした高い次元数の音響パラメータをＨＭＭの学習に用いると莫大な処理量（処理時間）を必要とし問題となる。さらに、ＩＦＩＳはスペクトルの単純な積分に基づくため、対象となるスペクトルが例えば大きな傾斜をもっている場合に、パワーの小さな周波数領域において周波数解像度が悪くなる問題がある。 Certainly, according to the technique according to Non-Patent Document 1, even if two spectra are “averaged”, no peak is lost, which is preferable for HMM learning. However, in the disclosure of Non-Patent Document 1, first, the numerical calculation method of IFIS is not clarified. When the spectrum is actually integrated and its inverse function is taken, there is a problem that requires a large amount of calculation. The obtained IFIS parameters have the same number of dimensions as the spectrum (about 128 to 1024 dimensions). If such high-dimensional acoustic parameters are used for HMM learning, a huge amount of processing (processing time) is required, which becomes a problem. Further, since IFIS is based on simple integration of the spectrum, there is a problem that the frequency resolution is deteriorated in a frequency region where the power is low when the target spectrum has a large slope, for example.

したがって、図４に示されるような結果を効率よく計算し、ＨＭＭ学習に適したパラメータとして得ることができ、かつ全周波数帯域にわたって十分な周波数解像度を得ることができるような音声のパラメータ化技術が必要である。 Therefore, there is a speech parameterization technique that can efficiently calculate a result as shown in FIG. 4 and obtain a parameter suitable for HMM learning and a sufficient frequency resolution over the entire frequency band. is necessary.

それゆえに本発明の目的は、複数のスペクトルについて、形状の特徴的な部分を失うことなく、複数のスペクトルの間で、形状の補間を行なうことが容易にできるパラメータを出力可能なスペクトル分析装置を提供することである。 Therefore, an object of the present invention is to provide a spectrum analyzer capable of outputting a parameter that can easily perform shape interpolation between a plurality of spectra without losing characteristic portions of the shapes of the plurality of spectra. Is to provide.

本発明の他の目的は、複数のスペクトルについて、形状の特徴的な部分を失うことなく、複数のスペクトルの間で形状の補間を行なうことが容易にできるスペクトル演算装置を提供することである。 Another object of the present invention is to provide a spectrum calculation apparatus that can easily perform shape interpolation between a plurality of spectra without losing a characteristic portion of the shape of the plurality of spectra.

本発明の第１の局面に係るスペクトル分析装置は、音声信号に対するスペクトル分析を行なって、音声のスペクトル包絡を表すスペクトル信号を出力するためのスペクトル分析手段と、スペクトル分析手段により出力されたスペクトル信号のパルス密度表現における、各パルス位置に対応する周波数を、音声信号のスペクトル包絡を表すパラメータとして出力するためのパラメータ生成手段とを含む。 A spectrum analysis apparatus according to a first aspect of the present invention includes a spectrum analysis unit for performing spectrum analysis on a speech signal and outputting a spectrum signal representing a spectrum envelope of the speech, and a spectrum signal output by the spectrum analysis unit. Parameter generation means for outputting a frequency corresponding to each pulse position as a parameter representing the spectral envelope of the audio signal.

スペクトル分析手段は、入力された音声信号に対するスペクトル分析を行ない、スペクトル信号を出力する。このスペクトル信号は音声のスペクトル包絡を表す。パラメータ生成手段は、スペクトル分析手段により出力されたスペクトル信号のパルス密度表現における、各パルス位置に対応する周波数を、音声信号のスペクトル包絡を表すパラメータとして出力する。この出力が、音声信号のスペクトル包絡を表すパラメータとして使用される。 The spectrum analysis means performs spectrum analysis on the input voice signal and outputs a spectrum signal. This spectral signal represents the spectral envelope of the speech. The parameter generation means outputs the frequency corresponding to each pulse position in the pulse density expression of the spectrum signal output by the spectrum analysis means as a parameter representing the spectrum envelope of the audio signal. This output is used as a parameter representing the spectral envelope of the audio signal.

音声信号のスペクトル包絡を、パルス密度表現における、各パルス位置に対応する周波数の形でパラメータとして表す。スペクトル包絡の特徴を一連の周波数列で表すため、形状が類似しているが特徴となる部分の周波数位置が異なるような複数のスペクトルについて、特徴となる部分の対応関係を的確に表すことができる。その結果、複数のスペクトルについて、形状の特徴的な部分を失うことなく、形状の補間を行なうことが容易にできるパラメータを出力可能なスペクトル分析装置を提供できる。 The spectral envelope of the audio signal is expressed as a parameter in the form of a frequency corresponding to each pulse position in the pulse density representation. Since the characteristics of the spectral envelope are represented by a series of frequency sequences, the correspondence between the characteristic parts can be accurately represented for a plurality of spectra having similar shapes but different frequency positions of the characteristic parts. . As a result, it is possible to provide a spectrum analyzer capable of outputting parameters that can easily perform shape interpolation without losing characteristic portions of the shape for a plurality of spectra.

好ましくは、パラメータ生成手段は、スペクトル信号を入力とし、所定のしきい値により量子化を行なうデルタ・シグマ変調に基づいて得られるパルス密度表現の、各パルス位置に対応する周波数を、音声信号のスペクトル包絡を表すパラメータとして出力する。 Preferably, the parameter generating means inputs a spectrum signal and inputs a frequency corresponding to each pulse position in a pulse density expression obtained based on delta-sigma modulation which performs quantization with a predetermined threshold value. Output as a parameter representing the spectral envelope.

時間領域の信号に対してよく利用されるデルタ・シグマ変調を利用して、スペクトル信号を周波数データに変換することができる。 Spectral signals can be converted to frequency data using delta-sigma modulation, which is often used for time domain signals.

より好ましくは、スペクトル分析手段が出力するスペクトル信号が、音声のスペクトル包絡を表すケプストラム係数列である。 More preferably, the spectrum signal output by the spectrum analysis means is a cepstrum coefficient sequence representing the spectrum envelope of speech.

音声信号の解析にはケプストラム解析が多用されており、ケプストラム解析によって得られたスペクトル情報を処理することで、既存の手段を有効に利用しながら、スペクトル包絡を周波数列で表す新たなパラメータにより、音声信号の特徴を表すことができる。 Cepstrum analysis is often used for the analysis of audio signals, and by processing the spectrum information obtained by cepstrum analysis, while using existing means effectively, a new parameter that represents the spectrum envelope as a frequency sequence, The characteristics of the audio signal can be expressed.

さらに好ましくは、パラメータ生成手段は、スペクトル分析手段が出力するケプストラム係数の内、第０次のケプストラム係数を記憶する第１の記憶手段と、スペクトル分析手段の出力するケプストラム係数の内、第１次以降、所定次数までのケプストラム係数により表されるスペクトル包絡のパルス密度表現における、各パルス位置に対応する周波数を周波数列として記憶する第２の記憶手段とを備え、第１の記憶手段に記憶された第０次のケプストラム係数と、第２の記憶手段に記憶した周波数列とを、パラメータとして出力する。 More preferably, the parameter generation means includes a first storage means for storing a zeroth-order cepstrum coefficient among the cepstrum coefficients output from the spectrum analysis means, and a first-order cepstrum coefficient output from the spectrum analysis means. Thereafter, in the pulse density expression of the spectral envelope represented by the cepstrum coefficients up to a predetermined order, the second storage means for storing the frequency corresponding to each pulse position as a frequency sequence, and stored in the first storage means The 0th-order cepstrum coefficient and the frequency sequence stored in the second storage means are output as parameters.

第０次のケプストラム係数は、スペクトルの平均値を表す。平均値を除いてパルス密度変調することにより、スペクトルの平均値は０となり、パラメータ化する際の情報量と処理量とを削減できる。 The zeroth-order cepstrum coefficient represents the average value of the spectrum. By performing pulse density modulation excluding the average value, the average value of the spectrum becomes 0, and the amount of information and the amount of processing when parameterizing can be reduced.

パラメータ生成手段は、スペクトル分析手段が出力するスペクトル包絡の平均値を記憶する第１の記憶手段と、スペクトル分析手段の出力するスペクトル包絡から、平均値を差し引いたスペクトルのパルス密度表現における、各パルス位置に対応する周波数を周波数列として記憶する第２の記憶手段とを備え、第１の記憶手段に記憶されたスペクトル包絡平均値と、第２の記憶手段に記憶した周波数列とを、パラメータとして出力してもよい。 The parameter generation means includes a first storage means for storing an average value of the spectrum envelope output from the spectrum analysis means, and each pulse in a pulse density expression of the spectrum obtained by subtracting the average value from the spectrum envelope output from the spectrum analysis means. Second storage means for storing the frequency corresponding to the position as a frequency sequence, and the spectrum envelope average value stored in the first storage means and the frequency sequence stored in the second storage means as parameters It may be output.

好ましくは、スペクトル分析装置は、パラメータ生成手段が出力する周波数列に対して、周波数列データを圧縮する処理を行なうパラメータ圧縮処理手段をさらに含み、圧縮された周波数列データを、スペクトル包絡を表すパラメータの全部または一部として出力する。 Preferably, the spectrum analysis apparatus further includes parameter compression processing means for performing processing for compressing the frequency string data with respect to the frequency string output by the parameter generation means, and the compressed frequency string data is converted into a parameter representing a spectrum envelope. Are output as all or part of

さらに好ましくは、パラメータ圧縮処理手段は、パラメータ生成手段が出力する周波数列を、三角級数展開に基づいて圧縮する。 More preferably, the parameter compression processing unit compresses the frequency sequence output by the parameter generation unit based on a trigonometric series expansion.

スペクトル分析装置は、スペクトル分析手段が出力する音声のスペクトル包絡に対して、該スペクトル包絡の傾きを含む大局的な特徴を抑圧又は除去するスペクトル成形手段をさらに備え、該スペクトル成形手段において大局的な特徴が抑圧あるいは除去されたスペクトル包絡を、パラメータ生成手段へ入力するようにしてもよい。 The spectrum analyzing apparatus further includes a spectrum shaping unit that suppresses or removes a global feature including a slope of the spectrum envelope with respect to a spectrum envelope of the sound output from the spectrum analyzing unit, and the spectrum shaping unit The spectral envelope from which the features are suppressed or removed may be input to the parameter generation unit.

スペクトル成形手段は、スペクトル分析手段が出力する音声のスペクトル包絡を表すケプストラムに対して、該ケプストラムの低次の係数を減じることによって、スペクトル包絡の傾きを含む大局的な特徴を抑圧又は除去してもよい。 The spectrum shaping means suppresses or removes global features including the slope of the spectrum envelope by subtracting low-order coefficients of the cepstrum from the cepstrum representing the spectrum envelope of the sound output from the spectrum analysis means. Also good.

本発明の第２の局面に係るスペクトル演算装置は、上記したいずれかのスペクトル分析装置と、スペクトル分析装置が第１及び第２のスペクトルに対してそれぞれ出力する第１及び第２のパラメータを受け、当該第１及び第２のパラメータ間で所定の補間演算をするための補間手段とを含む。 A spectrum calculation device according to a second aspect of the present invention receives any one of the spectrum analysis devices described above and the first and second parameters output from the spectrum analysis device with respect to the first and second spectra, respectively. Interpolating means for performing a predetermined interpolation calculation between the first and second parameters.

スペクトル分析装置は、複数のスペクトルについて、パルス密度表現における各パルス位置に対応する周波数を、音声信号のスペクトル包絡を表すパラメータとして出力する。このパラメータは、スペクトル包絡の特徴的な部分を失うことなく補間ができる性質を持つ。補間手段は、スペクトル分析装置によって第１及び第２のスペクトルから得られた第１及び第２のパラメータの間で所定の補間演算を行なう。したがって、第１及び第２のスペクトルについて、特徴部分を失うことなく補間処理を行なうことができる。 The spectrum analyzer outputs, for a plurality of spectra, a frequency corresponding to each pulse position in the pulse density expression as a parameter representing the spectral envelope of the audio signal. This parameter has the property that it can be interpolated without losing the characteristic part of the spectral envelope. The interpolation means performs a predetermined interpolation operation between the first and second parameters obtained from the first and second spectra by the spectrum analyzer. Therefore, interpolation processing can be performed on the first and second spectra without losing the characteristic portion.

好ましくは、補間手段は、第１及び第２のパラメータの内で、対応するパラメータの平均を演算するための平均手段を含む。 Preferably, the interpolation means includes averaging means for calculating an average of the corresponding parameters among the first and second parameters.

補間演算として平均が計算される。複数のスペクトルの平均を演算する際に、それらスペクトルの特徴部分を失うことなく、平均のスペクトルを得ることができる。 An average is calculated as an interpolation operation. When calculating the average of a plurality of spectra, an average spectrum can be obtained without losing the characteristic portions of the spectra.

以上のように本発明によれば、スペクトル包絡の特徴を一連の周波数列で表すので、形状が類似しているが特徴となる部分の周波数位置が異なるような複数のスペクトルについて、特徴となる部分の対応関係を的確に表すことができる。その結果、複数のスペクトルについて、形状の特徴的な部分を失うことなく、形状の補間を行なうことが容易にできるようなパラメータを出力可能なスペクトル分析装置を提供できる。
As described above, according to the present invention, the characteristics of the spectrum envelope are represented by a series of frequency sequences, so that the characteristic portions of a plurality of spectra having similar shapes but different frequency positions of the characteristic portions. Can be accurately represented. As a result, it is possible to provide a spectrum analyzer capable of outputting parameters that can easily perform shape interpolation without losing characteristic portions of the shape for a plurality of spectra.

従来の音声合成システム５０のブロック図である。It is a block diagram of the conventional speech synthesis system 50. FIG. 従来の手法でスペクトルを平均するときの問題点を示すスペクトルのグラフである。It is a spectrum graph which shows a problem when a spectrum is averaged by the conventional method. 非特許文献１に提案されたパラメータ化手法を説明するための図である。It is a figure for demonstrating the parameterization method proposed by the nonpatent literature 1. FIG. 非特許文献１により提案されたパラメータ化手法によって平均されたスペクトルを説明するためのグラフである。It is a graph for demonstrating the spectrum averaged by the parameterization method proposed by the nonpatent literature 1. FIG. 本発明の実施の形態で採用するパルス密度変調（ＰＤＭ）によるスペクトルの平均の算出方法を説明するための図である。It is a figure for demonstrating the calculation method of the average of the spectrum by the pulse density modulation (PDM) employ | adopted by embodiment of this invention. 本発明の実施の形態に係る音声合成システム２００のブロック図である。1 is a block diagram of a speech synthesis system 200 according to an embodiment of the present invention. 図６に示すＰＤＭエンコーダ２２０のブロック図である。FIG. 7 is a block diagram of the PDM encoder 220 shown in FIG. 6. 図７に示すデルタ・シグマ変調部２４６のブロック図である。FIG. 8 is a block diagram of a delta sigma modulation unit 246 shown in FIG. 7. 図６の圧縮処理部２５０で行なう正弦級数展開を説明するためのグラフである。It is a graph for demonstrating the sine series expansion | deployment performed with the compression process part 250 of FIG. スペクトルと、本発明の実施の形態によってこのスペクトルから得られたパルス列とを対比して示す図である。It is a figure which contrasts and shows a spectrum and the pulse train obtained from this spectrum by embodiment of this invention. ／ｒ／から／ｌ／への過渡部の連続スペクトルを示す図である。It is a figure which shows the continuous spectrum of the transition part from / r / to / l /. 図１１に示す連続スペクトルの真の平均であるスペクトルと、連続スペクトルをケプストラム及び本発明の実施の形態に係るＰＤＭパラメータを用いてそれぞれ平均化して得られるスペクトルとを対比して示すグラフである。12 is a graph showing a comparison between a spectrum that is a true average of the continuous spectrum shown in FIG. 11 and a spectrum obtained by averaging the continuous spectrum using the cepstrum and the PDM parameters according to the embodiment of the present invention.

本明細書及び図面では、同一の部品には同一の参照番号を付してある。それらの名称及び機能もそれぞれ同一である。したがってそれらについての詳細な説明は繰返さない。 In the present specification and drawings, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

［基本的な考え方］
図５を参照して、本発明の実施の形態におけるスペクトルのパラメータ化方法、及びそのパラメータを用いたスペクトルの平均の計算方法について説明する。２つのスペクトル１６０及び１６２の平均を求める場合を考える。両者のピークは図からも明らかなように周波数軸上で異なった位置にある。 [basic way of thinking]
With reference to FIG. 5, a spectrum parameterization method and a spectrum average calculation method using the parameters according to the embodiment of the present invention will be described. Consider the case of obtaining the average of two spectra 160 and 162. Both peaks are at different positions on the frequency axis, as is apparent from the figure.

本実施の形態では、音声スペクトルの包絡をまずリフタリングして対極的なスペクトルの特性を抑制した後、パルス密度変調（ＰＤＭ）を行なって、スペクトル１６０及び１６２の振幅（又はパワー）をパルス密度を表すパルス列１７０及び１７２にそれぞれ変換する。各スペクトルを予め正規化しておくことで、１つのスペクトルについて出力されるパルス数が一定となるようにし、各パルスが出力されたときの周波数を記憶しておく。このためにはパルスが出力されたときの周波数のみを記憶しておけばよい。 In this embodiment, the envelope of the voice spectrum is first lifted to suppress the characteristic of the opposite spectrum, and then pulse density modulation (PDM) is performed, and the amplitude (or power) of the spectra 160 and 162 is changed to the pulse density. Convert to pulse trains 170 and 172, respectively. By normalizing each spectrum in advance, the number of pulses output for one spectrum is made constant, and the frequency when each pulse is output is stored. For this purpose, it is only necessary to store only the frequency when the pulse is output.

パルス列１７０及び１７２の各パルスの間には１対１の対応関係が付く。対応するパルス対の周波数を全てのパルス対について平均することで、新たなパルス列１７４が得られる。このパルス列１７４をＰＤＭデコードして、スペクトル１６０及び１６２を平均した新たなスペクトル１８０が得られる。 There is a one-to-one correspondence between the pulses in the pulse trains 170 and 172. A new pulse train 174 is obtained by averaging the frequencies of the corresponding pulse pairs for all pulse pairs. This pulse train 174 is PDM decoded to obtain a new spectrum 180 that averages the spectra 160 and 162.

［構成］
図６は、本発明の実施の形態に係る音声合成システム２００のブロック図である。図６を参照して、ＨＭＭを用いた、本発明の実施の形態に係る音声合成システム２００は、コンテキストに依存した音素ＨＭＭであって、かつ音響パラメータとして図１に示す音素ＨＭＭ６２と異なり、上記したパルス列の周波数を使用した音素ＨＭＭ２１２を記憶する記憶装置と、この音素ＨＭＭ２１２の学習を行なうための学習部２１０と、入力されたテキスト６６にしたがって、学習が完了した音素ＨＭＭ２１２を使用して音声合成を行なうための合成部２１４とを含む。 [Constitution]
FIG. 6 is a block diagram of the speech synthesis system 200 according to the embodiment of the present invention. Referring to FIG. 6, a speech synthesis system 200 using an HMM according to an embodiment of the present invention is a phoneme HMM depending on a context, and is different from the phoneme HMM 62 shown in FIG. Using the storage device that stores the phoneme HMM 212 using the frequency of the pulse train, the learning unit 210 for learning the phoneme HMM 212, and the phoneme HMM 212 that has been learned in accordance with the input text 66. And a synthesizing unit 214.

学習部２１０は、図１に示す従来の学習部６０と同様の構成に加え、図１のスペクトル分析部７４の出力を受けるように接続され、スペクトル分析部７４から出力されるスペクトルパラメータに対して、ＰＤＭエンコードを行なって、スペクトルを表すパルス列の周波数情報（これらを「ＰＤＭパラメータ」又は「ＰＤＭケプストラム」と呼ぶ。）に変換して出力するためのＰＤＭエンコーダ２２０をさらに含む点と、図１の学習データ記憶部７６に代えて、基本周波数抽出部７２から出力されるある音声波形のＦ０パラメータ、音声コーパス７０から与えられる、対応する音声波形に付与されたコンテキスト依存ラベル列、及びＰＤＭエンコーダ２２０から与えられるＰＤＭパラメータを学習データとしてまとめて記憶するための学習データ記憶部２２２を含む点とで異なる。 The learning unit 210 is connected to receive the output of the spectrum analysis unit 74 of FIG. 1 in addition to the same configuration as the conventional learning unit 60 shown in FIG. 1 further includes a PDM encoder 220 for performing PDM encoding and converting the frequency information of the pulse train representing the spectrum (referred to as “PDM parameters” or “PDM cepstrum”) and outputting the same. Instead of the learning data storage unit 76, the F0 parameter of a certain speech waveform output from the fundamental frequency extraction unit 72, the context-dependent label sequence given to the corresponding speech waveform given from the speech corpus 70, and the PDM encoder 220 Learning data for storing together given PDM parameters as learning data It differs between that it includes a storage unit 222.

学習部２１０は図１の学習部６０と同様のＨＭＭモデル学習部７８を含んでいる。ＨＭＭモデル学習部７８自体は、図１に示すものと全く同じ機能を持つが、スペクトルに関する音響パラメータとして通常のケプストラムではなくＰＤＭパラメータを含む学習データを用いて音素ＨＭＭ２１２の学習を行なう。そのため、音素ＨＭＭ２１２の内部パラメータは図１に示す音素ＨＭＭ６２の内部パラメータとは異なったものとなる。特に、音素ＨＭＭ２１２は、その出力が、ケプストラムではなくＰＤＭパラメータである点で図１の音素ＨＭＭ６２と異なる。 The learning unit 210 includes an HMM model learning unit 78 similar to the learning unit 60 of FIG. Although the HMM model learning unit 78 itself has exactly the same function as that shown in FIG. 1, the phoneme HMM 212 is learned using learning data including PDM parameters instead of a normal cepstrum as an acoustic parameter related to the spectrum. Therefore, the internal parameters of the phoneme HMM 212 are different from the internal parameters of the phoneme HMM 62 shown in FIG. In particular, the phoneme HMM 212 differs from the phoneme HMM 62 of FIG. 1 in that its output is not a cepstrum but a PDM parameter.

合成部２１４も、図１に示す合成部６４と同様の構成を持つが、音素ＨＭＭ２１２から出力されるスペクトルに関する音響パラメータがケプストラムでなくＰＤＭパラメータであるため、以下に述べる点で合成部６４と異なっている。すなわち、合成部２１４は、図１に示す合成部６４の構成に加え、音響パラメータ生成部９２から出力されるＰＤＭパラメータをデコードし、スペクトルパラメータに変換して出力し合成フィルタ９６に与えるためのＰＤＭデコーダ２３０をさらに含む。ＰＤＭデコーダ２３０は、ＰＤＭデコーダ２３０から与えられるＰＤＭパラメータ（パルスの周波数データ）にしたがってパルスを発生した後、そのパルス列をローパスフィルタに通すという簡単な構成で実現できる。なお、ここでは、音声のスペクトルを音声信号とみなし、周波数と時間とを対応付けてパルス列を発生させるようにすればよい。 The synthesizing unit 214 has the same configuration as the synthesizing unit 64 shown in FIG. 1, but the acoustic parameters related to the spectrum output from the phoneme HMM 212 are not cepstrum but PDM parameters, and therefore differ from the synthesizing unit 64 in the following points. ing. That is, the synthesis unit 214 decodes the PDM parameters output from the acoustic parameter generation unit 92 in addition to the configuration of the synthesis unit 64 shown in FIG. A decoder 230 is further included. The PDM decoder 230 can be realized with a simple configuration in which a pulse is generated according to the PDM parameter (pulse frequency data) given from the PDM decoder 230 and then the pulse train is passed through a low-pass filter. Here, it is only necessary to regard the spectrum of speech as a speech signal and generate a pulse train in association with the frequency and time.

図７は、図６に示すＰＤＭエンコーダ２２０のブロック図である。図７を参照して、図６に示すＰＤＭエンコーダ２２０は、スペクトル分析部７４から出力されるスペクトルパラメータの平均値を算出して平均値を示す信号を出力するための平均値算出回路２４０と、平均値算出回路２４０から出力される平均値信号を記憶するための平均記憶回路２４２と、スペクトル分析部７４から出力されるスペクトルパラメータから平均記憶回路２４２に記憶されている平均値を減算するための減算回路２４４とを含む。実際には、スペクトル分析部７４から出力されるケプストラム係数の内、第０次の係数Ｃ_０がこの平均値に相当するため、平均値算出回路２４０が第０次の係数Ｃ_０のみを抽出する処理を行ない、減算回路２４４がケプストラム係数の内、第１次以降の係数のみを抽出する処理を行なうようにすることでこの処理を実現できる。平均記憶回路２４２は第０次のケプストラム係数を記憶し、ＰＤＭパラメータの一部として学習データ記憶部２２２に出力する。 FIG. 7 is a block diagram of the PDM encoder 220 shown in FIG. Referring to FIG. 7, the PDM encoder 220 shown in FIG. 6 calculates an average value of spectrum parameters output from the spectrum analysis unit 74 and outputs a signal indicating the average value, An average storage circuit 242 for storing an average value signal output from the average value calculation circuit 240, and an average value stored in the average storage circuit 242 from a spectrum parameter output from the spectrum analysis unit 74. A subtracting circuit 244. In fact, among the cepstral coefficients output from the spectrum analyzer 74, since the 0th coefficient C ₀ corresponds to the average value, the average value calculating circuit 240 extracts only the zeroth order coefficients C ₀ This process can be realized by performing the process, and the subtracting circuit 244 performs a process of extracting only the first and subsequent coefficients from the cepstrum coefficients. The average storage circuit 242 stores the 0th-order cepstrum coefficient and outputs it to the learning data storage unit 222 as a part of the PDM parameter.

ＰＤＭエンコーダ２２０はさらに、減算回路２４４から出力される、平均値を減算した後の音響パラメータを入力とし、スペクトルの周波数を時間軸とみなしてデルタ・シグマ変調を行ない、スペクトルを表すパルス列に変換するためのデルタ・シグマ変調部２４６と、処理するスペクトルごとに、デルタ・シグマ変調部２４６からパルスが出力されたときのスペクトルの周波数を記憶するためのパルス列記憶部２４８と、パルス列記憶部２４８に記憶されたパルス列の発生位置（すなわち周波数）情報に対して正弦級数展開を行なってデータを圧縮し、学習データ記憶部２２２に記憶させるための圧縮処理部２５０とを含む。 Further, the PDM encoder 220 receives the acoustic parameter after subtracting the average value output from the subtraction circuit 244 as input, performs the delta-sigma modulation with the frequency of the spectrum as a time axis, and converts it into a pulse train representing the spectrum. A delta-sigma modulation unit 246 for storing, a pulse train storage unit 248 for storing a frequency of a spectrum when a pulse is output from the delta-sigma modulation unit 246 for each spectrum to be processed, and a pulse train storage unit 248 A compression processing unit 250 for compressing data by performing sine series expansion on the generated pulse train generation position (ie, frequency) information and storing the compressed data in the learning data storage unit 222.

図８は、図７に示すデルタ・シグマ変調部２４６のブロック図である。図８を参照して、デルタ・シグマ変調部２４６は、本実施の形態では一次のデルタ・シグマ変調を行なうものであって、減算回路２４４の出力を受ける積分器２６２と、積分器２６２の出力が一定のしきい値を越えると＋１のパルスを出力する量子化器２６６と、量子化器２６６の出力を積分器２６２の入力にフィードバックするフィードバック部２６８とを含む。パルス列記憶部２４８は、スペクトル信号と量子化器２６６の出力とを受けており、量子化器２６６がパルスを出力したときのスペクトルの周波数を記憶する。積分器２６２によってスペクトル波形が積分され、あるしきい値を越えたところでパルスが出力される。したがって、図５に示すように、スペクトルのピーク部分では出力されるパルスの密度が高く、スペクトルの値が小さくなると出力されるパルスの密度は低くなる。 FIG. 8 is a block diagram of the delta-sigma modulation unit 246 shown in FIG. Referring to FIG. 8, delta sigma modulation section 246 performs primary delta sigma modulation in the present embodiment, and receives integrator 262 that receives the output of subtraction circuit 244, and the output of integrator 262. Includes a quantizer 266 that outputs a +1 pulse when the value exceeds a certain threshold value, and a feedback unit 268 that feeds back the output of the quantizer 266 to the input of the integrator 262. The pulse train storage unit 248 receives the spectrum signal and the output of the quantizer 266, and stores the frequency of the spectrum when the quantizer 266 outputs a pulse. The spectrum waveform is integrated by the integrator 262, and a pulse is output when a certain threshold value is exceeded. Therefore, as shown in FIG. 5, the density of the output pulse is high at the peak portion of the spectrum, and the density of the output pulse is low when the spectrum value is small.

図７に示す圧縮処理部２５０は、パルス列を表すデータ（周波数列）を正弦級数展開によって圧縮する。図９を参照して、横軸に正規化されたスペクトルの積分値、縦軸に周波数軸をとって各パルスが出力された点をプロットし、それらを結んだ曲線２９０を考える。この曲線の内、最初の点と最後の点とを結ぶ線分２９２を考えると、曲線２９０は線分２９２を中心としてその上方と下方とに分けられる。図９に示す例は図５に示すスペクトル１６０に対応するものであり、曲線２９０が線分２９２の上方に存在する領域と、線分２９２の下方に存在する領域との２つの領域に分けられる。線分２９２を曲線２９０から減算すると、ちょうど正弦級数に似た曲線が得られる。従って、曲線２９０は正弦級数展開の低次の項（十数〜数十次程度）で近似でき、曲線２９０を構成する各点を示すデータを圧縮することができる。曲線２９０が線分２９２によって上下の複数箇所に分けられた場合もこれと同様である。この圧縮処理部２５０によって、前記周波数列データ（百数十〜千次程度）を少数次元のデータで表すことができ、その結果、ＨＭＭ学習における処理量が膨大になるのを防ぎ、リーゾナブルな計算時間でＨＭＭ学習を完了することができる。 The compression processing unit 250 shown in FIG. 7 compresses data representing a pulse train (frequency train) by sine series expansion. Referring to FIG. 9, the normalized value of the integral of the spectrum is plotted on the horizontal axis, and the point where each pulse is output is plotted with the frequency axis on the vertical axis, and a curve 290 connecting them is considered. Considering a line segment 292 connecting the first point and the last point in the curve, the curve 290 is divided into an upper portion and a lower portion with the line segment 292 as the center. The example shown in FIG. 9 corresponds to the spectrum 160 shown in FIG. 5, and the curve 290 is divided into two regions: a region existing above the line segment 292 and a region existing below the line segment 292. . Subtracting line segment 292 from curve 290 yields a curve that is just like a sine series. Therefore, the curve 290 can be approximated by a low-order term (about 10 to several tens of degrees) of the sine series expansion, and data indicating each point constituting the curve 290 can be compressed. The same applies to the case where the curve 290 is divided into a plurality of upper and lower portions by the line segment 292. The compression processing unit 250 can represent the frequency string data (hundreds of tens to thousands) with data of a small number of dimensions. As a result, the processing amount in HMM learning is prevented from becoming enormous, and a reasonable calculation is performed. HMM learning can be completed in time.

［動作］
音声合成システム２００の動作には２つのフェーズがある。第１フェーズは音素ＨＭＭ２１２の学習である。第２フェーズは音素ＨＭＭ２１２を使用して、テキスト６６にしたがった音声を合成する処理である。以下、これらフェーズにおける音声合成システム２００の動作を説明する。 [Operation]
The operation of the speech synthesis system 200 has two phases. The first phase is learning of the phoneme HMM 212. The second phase is a process of synthesizing speech according to the text 66 using the phoneme HMM 212. Hereinafter, the operation of the speech synthesis system 200 in these phases will be described.

−学習−
音素ＨＭＭ２１２の学習時、音声合成システム２００は以下のように動作する。音声コーパス７０内の発話内の音声の各フレームには、予めコンテキスト依存ラベルが付されている。各フレームの音声波形データは基本周波数抽出部７２とスペクトル分析部７４とにそれぞれ与えられる。基本周波数抽出部７２は、与えられた音声からＦ０パラメータを算出し、学習データ記憶部２２２に与える。スペクトル分析部７４は、音声のスペクトルパラメータを算出し、ＰＤＭエンコーダ２２０に与える。 -Learning-
When learning the phoneme HMM 212, the speech synthesis system 200 operates as follows. Each frame of speech in the speech in the speech corpus 70 has a context-dependent label attached in advance. The speech waveform data of each frame is given to the fundamental frequency extraction unit 72 and the spectrum analysis unit 74, respectively. The fundamental frequency extraction unit 72 calculates the F0 parameter from the given voice and provides it to the learning data storage unit 222. The spectrum analysis unit 74 calculates a spectrum parameter of the voice and gives it to the PDM encoder 220.

図７を参照して、平均値算出回路２４０は、与えられた対数パワースペクトルの平均値（第０次のケプストラム係数）を算出し、その値を平均記憶回路２４２が記憶する。減算回路２４４は、対数パワースペクトルからその平均値を減算し、周波数の低い方から始めてデルタ・シグマ変調部２４６に与える。 Referring to FIG. 7, average value calculation circuit 240 calculates an average value (0th-order cepstrum coefficient) of a given logarithmic power spectrum, and average storage circuit 242 stores the value. The subtracting circuit 244 subtracts the average value from the logarithmic power spectrum, and gives it to the delta-sigma modulation unit 246 starting from the lower frequency.

図８を参照して、積分器２６２は、与えられるパワースペクトルを積分する。積分器２６２の出力は量子化器２６６に与えられる。量子化器２６６は、積分器２６２の出力がしきい値より高くなるとパルスを出力する。このパルスはフィードバック部２６４により積分器２６２の入力にフィードバックされる。その結果、積分器２６２の出力からは、しきい値に相当する値が減算される。このようにしてデルタ・シグマ変調部２４６は、スペクトル波形の低周波数領域から高周波数領域に向かって波形を積分し、積分値がしきい値を越えるとパルスを出力する。パルスが出力されると、積分値からしきい値に相当する値が減算されるので、結果としてスペクトル波形を低周波数側から積分していったときに、その積分値がしきい値となった時点でパルスが出力される。パルス列記憶部２４８はこのときのスペクトルの周波数を記憶する。こうして、パルス列記憶部２４８はスペクトルについて出力されたパルス列を、それらが出力されたときの周波数列の形で記憶する。圧縮処理部２５０は、正弦級数展開により、パルス列記憶部２４８に記憶された周波数列データ（百数十〜千次程度）を少数次元のデータに変換する。この圧縮処理の結果、ＨＭＭ学習における処理量が膨大になるのを防ぎ、リーゾナブルな計算時間でＨＭＭ学習を完了することができる。 Referring to FIG. 8, integrator 262 integrates a given power spectrum. The output of integrator 262 is provided to quantizer 266. The quantizer 266 outputs a pulse when the output of the integrator 262 becomes higher than the threshold value. This pulse is fed back to the input of the integrator 262 by the feedback unit 264. As a result, a value corresponding to the threshold value is subtracted from the output of the integrator 262. In this way, the delta-sigma modulation unit 246 integrates the waveform from the low frequency region to the high frequency region of the spectrum waveform, and outputs a pulse when the integrated value exceeds the threshold value. When a pulse is output, the value corresponding to the threshold value is subtracted from the integrated value. As a result, when the spectrum waveform is integrated from the low frequency side, the integrated value becomes the threshold value. A pulse is output at the time. The pulse train storage unit 248 stores the frequency of the spectrum at this time. Thus, the pulse train storage unit 248 stores the pulse train output for the spectrum in the form of a frequency train when they are output. The compression processing unit 250 converts the frequency sequence data (on the order of hundreds to tenths) stored in the pulse sequence storage unit 248 into decimal-order data by sine series expansion. As a result of this compression processing, it is possible to prevent the amount of processing in HMM learning from becoming enormous and to complete HMM learning in a reasonable calculation time.

ＰＤＭエンコーダ２２０は、パルス列記憶部２４８に記憶され、圧縮処理部２５０により圧縮された周波数列データをＰＤＭパラメータとして学習データ記憶部２２２に与える。学習データ記憶部２２２は、各フレームごとに、基本周波数抽出部７２からのＦ０パラメータと、音声コーパス７０からコンテキスト依存ラベルとをそれぞれ受け、さらにＰＤＭエンコーダ２２０からＰＤＭパラメータとを受けてこれらを一まとめの学習データとして保存する。 The PDM encoder 220 supplies the frequency sequence data stored in the pulse sequence storage unit 248 and compressed by the compression processing unit 250 to the learning data storage unit 222 as a PDM parameter. The learning data storage unit 222 receives the F0 parameter from the fundamental frequency extraction unit 72 and the context-dependent label from the speech corpus 70 for each frame, and further receives the PDM parameters from the PDM encoder 220 and collects them together. Save as learning data.

このようにして、音声コーパス７０に保存されている発話データの全フレームに対して学習データが作成されると、ＨＭＭモデル学習部７８はこの学習データを使用して、音素ＨＭＭ２１２の学習を行なう。 In this way, when learning data is created for all frames of speech data stored in the speech corpus 70, the HMM model learning unit 78 uses this learning data to learn the phoneme HMM 212.

図１０に、ある音声の対数パワースペクトルのグラフと、このスペクトルに対してＰＤＭを行なったときに得られるパルス列との例を示す。図１０に示すように、スペクトルの対数パワーが大きいときにはパルス密度は高く、小さいときにはパルス密度は低くなる。 FIG. 10 shows an example of a logarithmic power spectrum graph of a certain voice and a pulse train obtained when PDM is performed on this spectrum. As shown in FIG. 10, when the logarithmic power of the spectrum is large, the pulse density is high, and when it is small, the pulse density is low.

−音声合成−
音声合成時には、音声合成システム２００は以下の様に動作する。音声合成の対象となるテキスト６６が与えられると、合成部２１４のテキスト解析部９０は、公知のテキスト解析処理を行ない、合成すべき音素列を含むコンテキスト依存ラベル列を生成して音響パラメータ生成部９２に与える。 -Speech synthesis-
At the time of speech synthesis, the speech synthesis system 200 operates as follows. When the text 66 to be synthesized is given, the text analysis unit 90 of the synthesis unit 214 performs a known text analysis process, generates a context-dependent label sequence including a phoneme sequence to be synthesized, and generates an acoustic parameter generation unit. 92.

音響パラメータ生成部９２は、与えられたコンテキスト依存ラベル列にしたがって音素ＨＭＭ２１２内のコンテキスト依存ＨＭＭを連結する。音響パラメータ生成部９２はさらに、連結後のコンテキスト依存ＨＭＭに基づいて、最も尤度の高い音響パラメータ列（Ｆ０パラメータ列及びＰＤＭパラメータ列）を生成する。このとき、音素ＨＭＭ２１２が音響パラメータとしてＰＤＭパラメータを用いた学習を行なっているため、音響パラメータ生成部９２の出力する音響パラメータ列は、従来の技術で示したようなケプストラム列ではなく、ＰＤＭパラメータ列となる。 The acoustic parameter generation unit 92 concatenates the context-dependent HMMs in the phoneme HMM 212 according to the given context-dependent label sequence. The acoustic parameter generation unit 92 further generates an acoustic parameter string (F0 parameter string and PDM parameter string) having the highest likelihood based on the context-dependent HMM after connection. At this time, since the phoneme HMM 212 performs learning using the PDM parameter as the acoustic parameter, the acoustic parameter sequence output from the acoustic parameter generation unit 92 is not a cepstrum sequence as shown in the related art, but a PDM parameter sequence. It becomes.

ＰＤＭデコーダ２３０は、音響パラメータ生成部９２の出力するＰＤＭパラメータ列に対するデコードを行なってスペクトルパラメータ列に変換し、合成フィルタ９６に与える。 The PDM decoder 230 performs decoding on the PDM parameter sequence output from the acoustic parameter generation unit 92, converts it into a spectral parameter sequence, and provides the resultant to the synthesis filter 96.

音源生成部９４は音響パラメータ生成部９２から与えられるＦ０パラメータ列にしたがって音源信号を生成し、合成フィルタ９６は、この音源信号に対してＰＤＭデコーダ２３０から与えられるスペクトルパラメータに依存した特性のフィルタ処理を行ない、合成音声信号を出力する。この合成音声信号をアナログ変換・増幅してスピーカに与えることにより、合成音声の発声が行なわれる。 The sound source generation unit 94 generates a sound source signal according to the F0 parameter sequence given from the acoustic parameter generation unit 92, and the synthesis filter 96 performs filtering processing of characteristics depending on the spectral parameter given from the PDM decoder 230 to the sound source signal. To output a synthesized speech signal. By synthesizing and amplifying the synthesized voice signal and giving it to the speaker, the synthesized voice is uttered.

以上のように本実施の形態によれば、ＨＭＭのモデル学習のときに行なわれるスペクトルの平均処理において、スペクトルの平坦化を緩和できる。したがって、合成音声の音質が改善するという効果を得ることができる。 As described above, according to the present embodiment, spectrum flattening can be mitigated in the spectrum averaging process performed during HMM model learning. Therefore, an effect that the sound quality of the synthesized speech is improved can be obtained.

例えば、図１１に示したような、／ｒ／から／ｌ／への過渡部の連続スペクトルを平均化する場合を考える。図１２を参照して、従来のようにケプストラム領域でスペクトルを平均化して得られる波形３２２を、厳密に周波数の対応関係を付けて平均した波形３２０と比較すると、波形３２０では第２フォルマント及び第３フォルマントが区別できているのに対し、波形３２２ではこれらが平坦化され区別できなくなっている。一方、本実施の形態の方法にしたがって得られた波形３２４では、これらフォルマントがきちんと区別され、厳密に計算した波形の特徴をよく残すことができている。 For example, consider the case of averaging the continuous spectrum of the transition from / r / to / l / as shown in FIG. Referring to FIG. 12, when a waveform 322 obtained by averaging spectra in the cepstrum region as in the prior art is compared with a waveform 320 averaged with a strict correspondence of frequencies, the waveform 320 has a second formant and a second waveform. While the three formants can be distinguished, the waveform 322 is flattened and cannot be distinguished. On the other hand, in the waveform 324 obtained according to the method of the present embodiment, these formants are properly distinguished, and the characteristics of the waveform calculated strictly can be well preserved.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

５０，２００音声合成システム
６０，２１０学習部
６２，２１２コンテキスト依存音素ＨＭＭ
６４，２１４合成部
６６テキスト
７０音声コーパス
７２基本周波数抽出部
７４スペクトル分析部
７６，２２２学習データ記憶部
７８ＨＭＭモデル学習部
９０テキスト解析部
９２音響パラメータ生成部
９４音源生成部
９６合成フィルタ
１７０，１７２，１７４パルス列
２２０ＰＤＭエンコーダ
２３０ＰＤＭデコーダ
２４６デルタ・シグマ変調部 50, 200 Speech synthesis system 60, 210 Learning unit 62, 212 Context-dependent phoneme HMM
64, 214 Synthesis unit 66 Text 70 Speech corpus 72 Fundamental frequency extraction unit 74 Spectrum analysis unit 76, 222 Learning data storage unit 78 HMM model learning unit 90 Text analysis unit 92 Sound parameter generation unit 94 Sound source generation unit 96 Synthesis filters 170, 172 , 174 Pulse train 220 PDM encoder 230 PDM decoder 246 Delta-sigma modulation section

Claims

A spectrum analyzer using a pulse density expression that expresses a distribution based on intensity on a frequency axis of a spectrum signal by converting it into a density of pulses generated at frequency positions at variable intervals on the frequency axis,
Spectrum analysis means for performing spectrum analysis on the speech signal and outputting a spectrum signal representing the spectrum envelope of the speech;
In the pulse density representation of the output spectrum signal by said spectrum analysis means, the frequency corresponding to each pulse position, and a parameter generating means for outputting a parameter representing the spectrum envelope of the speech signal, the spectrum analyzer .

The frequency corresponding to each pulse position in the pulse density expression obtained based on delta-sigma modulation with the spectrum signal as input and quantizing with a predetermined threshold is used as a parameter representing the spectrum envelope of the audio signal. The spectrum analysis apparatus according to claim 1, wherein the parameter generation means outputs.

The spectrum analysis apparatus according to claim 1 or 2, wherein the spectrum signal output by the spectrum analysis means is a cepstrum coefficient sequence representing a spectrum envelope of speech.

The parameter generation means includes
Of the cepstrum coefficients output by the spectrum analyzing means, first storage means for storing a zeroth-order cepstrum coefficient;
A frequency corresponding to each pulse position in the pulse density representation of the spectral envelope represented by the cepstrum coefficients from the first order to the predetermined order among the cepstrum coefficients output from the spectrum analyzing means is stored as a frequency sequence. Storage means,
The said 0th-order cepstrum coefficient memorize | stored in the said 1st memory | storage means and the frequency sequence memorize | stored in the said 2nd memory | storage means are output as the said parameter, The Claim 3 characterized by the above-mentioned. Spectrum analyzer.

The parameter generation means includes
First storage means for storing an average value of a spectrum envelope output from the spectrum analysis means;
A second storage means for storing a frequency corresponding to each pulse position as a frequency sequence in a pulse density representation of a spectrum obtained by subtracting an average value from a spectrum envelope output from the spectrum analysis means;
3. The spectrum envelope average value stored in the first storage unit and the frequency sequence stored in the second storage unit are output as the parameters. The spectrum analyzer described.

Parameter compression processing means for compressing frequency string data for the frequency string output from the parameter generation means, and the compressed frequency string data is converted to all or a part of parameters representing the spectrum envelope. The spectrum analyzer according to any one of claims 1 to 5, wherein

The spectrum analysis apparatus according to claim 6, wherein the parameter compression processing unit compresses the frequency sequence output from the parameter generation unit based on a trigonometric series expansion.

Spectral shaping means that suppresses or eliminates global features including the slope of the spectral envelope with respect to the spectral envelope of the speech output by the spectral analysis means, and the global features are suppressed or removed in the spectral shaping means. The spectrum analysis apparatus according to claim 1, wherein the removed spectrum envelope is input to the parameter generation unit.

The spectrum shaping unit suppresses a global feature including the slope of the spectrum envelope by subtracting a low-order coefficient of the cepstrum from the cepstrum representing the spectrum envelope of the sound output from the spectrum analysis unit. The spectrum analyzer according to claim 8, wherein the spectrum analyzer is removed.

The spectrum analyzer according to any one of claims 1 to 9,
Interpolation means for receiving first and second parameters output from the spectrum analyzer for the first and second spectra, respectively, and performing a predetermined interpolation operation between the first and second parameters; Including a spectrum calculation device.

The spectrum calculation device according to claim 10, wherein the interpolation means includes an averaging means for calculating an average of corresponding parameters among the first and second parameters.