JP2005055778A

JP2005055778A - Equalizer of frequency characteristic of speech

Info

Publication number: JP2005055778A
Application number: JP2003288293A
Authority: JP
Inventors: Jinfu Ni; ジンフニ; Minoru Tsuzaki; 実津崎; Hisashi Kawai; 恒河井
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-08-06
Filing date: 2003-08-06
Publication date: 2005-03-03
Anticipated expiration: 2023-08-06
Also published as: JP3869823B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an equalizer of frequency characteristics of speech capable of stably correcting the change in the transmission characteristics of a sound recording system while taking human auditory characteristics into consideration. <P>SOLUTION: The equalizer 20 includes a source PSD forming section 34 which calculates the difference of the power spectral density (PSD) between source speech 30 and target speech 32, a target PSD forming section 36 and a dividing section 38, and a cepstrum filter section 40 which filters the source speech 30 by a filter parameter smoothed by using cepstrum transformations 70 and 72 and mel warping transformation 74 for the difference, and reverse transformation thereof 76. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、音声の補正技術に関し、特に、波形素片接続型音声合成システム等において、入力される音声の音質を、ターゲット音声に近い音質に補正するための技術に関する。 The present invention relates to a speech correction technique, and more particularly to a technique for correcting the sound quality of an input sound to a sound quality close to a target sound in a waveform segment connection type speech synthesis system or the like.

コンピュータ技術及びデータコミュニケーション技術の発達に伴い、人間と機械との間のインターフェイスが重要となっている。人間にとっては、人と話をするのと同様に機械とのコミュニケーションを行なえることが望ましく、そのための技術開発が進められている。 With the development of computer technology and data communication technology, the interface between humans and machines has become important. For human beings, it is desirable to be able to communicate with machines in the same way as talking to people, and technology development for that purpose is underway.

人間から機械への情報の伝達としては、音声認識、画像認識等の認知技術が主として用いられる。また機械から人間への情報の伝達方法は種々あるが、中でも音声合成技術が用いられる機会が増加している。音声応答システム、音声翻訳システム、コンピュータゲーム等が代表的な応用例である。さらに、近年のロボット等の開発の進展に伴い、音声認識及び画像認識と音声合成とを組合せることで、人間とロボットとのコミュニケーションを人間同士のコミュニケーションと同様に実現することが期待される。 Cognitive techniques such as voice recognition and image recognition are mainly used for transmitting information from humans to machines. There are various methods for transmitting information from a machine to a human being, and among them, the opportunity to use speech synthesis technology is increasing. A typical application is a voice response system, a voice translation system, a computer game, and the like. Furthermore, with the development of robots and the like in recent years, it is expected that communication between humans and robots can be realized in the same way as communication between humans by combining voice recognition and image recognition with voice synthesis.

音声合成では、如何にして自然な音声を合成するかが重要である。最近では、数十時間規模の大規模な音声コーパスを使用して音声素片を作成しておき、入力されるテキストデータに応じて適切な音声素片を選択し接続する、いわゆる音声素片接続型音声合成が主流となっている。この技術では、如何に自然に音声波形素片を接続するかが重要となる。 In speech synthesis, how to synthesize natural speech is important. Recently, a speech unit is created by using a large speech corpus of several tens of hours to create speech units and select and connect appropriate speech units according to the input text data. Type speech synthesis has become mainstream. In this technique, it is important how to connect speech waveform segments naturally.

上述した様に現在の波形素片音声合成システムでは、音質向上のために大規模な音声コーパスを使用している。多くの場合、単一の話者の音声を長期間かけて収録する。場合によってはその収録に数ヶ月から数年の期間を必要とする。 As described above, in the current waveform segment speech synthesis system, a large-scale speech corpus is used to improve sound quality. In many cases, a single speaker's voice is recorded over a long period of time. In some cases, the recording may take months to years.

こうした場合、録音時期が異なると、録音系の特性が経年変化し、そのために録音された音声を再生した場合、その音質が変化してしまうことがある。波形接続を行なう場合、その様に互いに異なる音質の音声を接続すると、合成された音声が不自然なものとなる問題がある。 In such a case, if the recording time is different, the characteristics of the recording system will change over time, and the sound quality may change if the recorded sound is reproduced. In the case of performing waveform connection, there is a problem in that synthesized speech becomes unnatural when speeches having different sound qualities are connected.

こうした問題を解決するための音声の補正技術に関し、一つの提案が非特許文献１においてなされている。図４に、非特許文献１に記載されたチャネル等化装置のブロック図を示す。図４を参照して、この装置２００は、ソース音声３０を受け、ソース音声３０の発話内容でターゲット音声３２とほぼ同じ周波数特性の音声を発生するためのものである。なお、本明細書では、ターゲット音声３２は予め録音されていた、基準となる音声を指す。ソース音声３０は、ターゲット音声３２とは別の時期に録音された音声であり、録音系の特性の経年変化により、その周波数特性がターゲット音声３２とは異なっている可能性があるものとする。 One proposal has been made in Non-Patent Document 1 regarding a sound correction technique for solving such a problem. FIG. 4 shows a block diagram of the channel equalizer described in Non-Patent Document 1. Referring to FIG. 4, this apparatus 200 receives source sound 30 and generates sound having the same frequency characteristics as target sound 32 in the utterance content of source sound 30. In the present specification, the target sound 32 refers to a reference sound that has been recorded in advance. The source sound 30 is sound recorded at a different time from the target sound 32, and it is assumed that the frequency characteristic may be different from that of the target sound 32 due to the secular change of the characteristics of the recording system.

この装置２００は、ソース音声３０のパワースペクトル密度（ＰＳＤ）を生成するためのソースＰＳＤ生成部３４と、ターゲット音声３２のＰＳＤを生成するためのターゲットＰＳＤ生成部３６と、ソース音声３０のＰＳＤとターゲット音声３２のＰＳＤとの差分（ターゲットＰＳＤ／ソースＰＳＤ）を計算するための除算部３８と、除算部３８の出力に基づくＬＰＣ（線形予測係数）分析の結果を用いたＩＩＲ（ＩｎｆｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタでソース音声３０をフィルタリングし等化処理済みの音声２１２を出力するためのＬＰＣフィルタ部２１０とを含む。 The apparatus 200 includes a source PSD generation unit 34 for generating a power spectral density (PSD) of the source sound 30, a target PSD generation unit 36 for generating a PSD of the target sound 32, and a PSD of the source sound 30. A division unit 38 for calculating a difference (target PSD / source PSD) from the PSD of the target speech 32, and an IIR (Infinite Impulse Response) using a result of LPC (Linear Prediction Coefficient) analysis based on the output of the division unit 38 An LPC filter unit 210 for filtering the source audio 30 with a filter and outputting the equalized audio 212.

ソースＰＳＤ生成部３４及びターゲットＰＳＤ生成部３６は同様の構成を有する。ソースＰＳＤ生成部３４は、ソース音声３０のデータに含まれる各音声フレームを検出するための音声フレーム検出部５０と、音声フレーム検出部により検出された各音声フレームに所定の窓掛け処理を行なうための窓掛け処理部５２と、窓掛け処理部５２により窓掛け処理された音声フレームデータから高速フーリエ変換（ＦＦＴ）により当該音声フレームのＰＳＤを算出するためのパワースペクトル算出部５４と、パワースペクトル算出部５４により算出された、所定期間のソース音声３０のフレームのＰＳＤの平均を算出するためのフレーム平均部５６とを含む。 The source PSD generation unit 34 and the target PSD generation unit 36 have the same configuration. The source PSD generation unit 34 performs a predetermined windowing process on each audio frame detected by the audio frame detection unit 50 for detecting each audio frame included in the data of the source audio 30 and the audio frame detection unit. A windowing processing unit 52, a power spectrum calculating unit 54 for calculating PSD of the sound frame by fast Fourier transform (FFT) from the sound frame data windowed by the windowing processing unit 52, and power spectrum calculation And a frame averaging unit 56 for calculating the average PSD of the frames of the source audio 30 for a predetermined period calculated by the unit 54.

ターゲットＰＳＤ生成部３６も同様に、音声フレーム検出部６０と、窓掛け処理部６２と、パワースペクトル算出部６４と、フレーム平均部６６とを含む。 Similarly, the target PSD generation unit 36 includes an audio frame detection unit 60, a windowing processing unit 62, a power spectrum calculation unit 64, and a frame averaging unit 66.

ＬＰＣフィルタ部２１０は、除算部３８の出力に逆ＦＦＴ（ＩＦＦＴ）処理を行なうためのＩＦＦＴ部２２０と、ＩＦＦＴ部２２０の出力に対しＬＰＣ変換を行なうためのＬＰＣ変換部２２２と、ＬＰＣ変換部２２２の出力するＬＰＣ係数により決定されるフィルタパラメータを持ち、ソース音声３０に対するフィルタリング処理を行なってソース音声３０の周波数特性をターゲット音声３２の周波数特性に等化させるためのＩＩＲフィルタ２２４とを含む。 The LPC filter unit 210 includes an IFFT unit 220 for performing inverse FFT (IFFT) processing on the output of the division unit 38, an LPC conversion unit 222 for performing LPC conversion on the output of the IFFT unit 220, and an LPC conversion unit 222. And an IIR filter 224 for performing a filtering process on the source sound 30 to equalize the frequency characteristic of the source sound 30 to the frequency characteristic of the target sound 32.

ソース音声３０とターゲット音声３２との周波数特性の差分を除算部３８で算出し、その差分に対するＬＰＣ変換を行なってＩＩＲのフィルタパラメータを設定する。このチャネル等化装置２００により、ソース音声３０の周波数特性をターゲット音声３２のそれとほぼ等しいものに等化できる。 A frequency characteristic difference between the source sound 30 and the target sound 32 is calculated by the division unit 38, and LPC conversion is performed on the difference to set an IIR filter parameter. The channel equalizer 200 can equalize the frequency characteristic of the source sound 30 to be substantially equal to that of the target sound 32.

ユーシ、エリックチャン、フペン、ミンチュウ、「接続型ＴＴＳシステムのための、大規模音声データベースについてのパワースペクトル密度に基づくチャネル等化」、ＩＣＳＬＰ２００２予稿集、ｐｐ．２３６９−２３７２、米国、２００２（ＹｕＳｈｉ，ＥｒｉｃＣｈａｎｇ，ＨｕＰｅｎｇ，ａｎｄＭｉｎＣｈｕ，“ＰｏｗｅｒＳｐｅｃｒａｌＤｅｎｃｉｔｙＢａｓｅｄＣｈａｎｎｅｌＥｑｕａｌｉｚａｔｉｏｎＯｆＬａｒｇｅＳｐｅｅｃｈＤａｔａｂａｓｅＦｏｒＣｏｎｃａｔｅｎａｔｉｖｅＴＴＳＳｙｓｔｅｍ，ＰｒｏｃｏｆＩＣＳＬＰ２００２，ｐｐ．２３６９−２３７２、ＵＳＡ，２００２）Yushi, Eric Chang, Hupen, Minchu, “Channel Equalization Based on Power Spectral Density for Large-Scale Speech Databases for Connected TTS Systems”, ICSLP2002 Proceedings, pp. 2369-2372, USA, 2002 (Yu Shi, Eric Chang, Hu Peng, and Min Chu, “Power Special Density Based Channel 7 Proc. )

図４に示す従来の等化装置については、その有効性が示されている。 The effectiveness of the conventional equalization apparatus shown in FIG. 4 is shown.

しかし従来法では、ＬＰＣ変換における次数をどの様に選択すべきかについて、困難な問題がある。すなわち、ＬＰＣ変換の次数を小さくすると、補正の効果がほとんどなくなる一方、次数を大きくすると音質の劣化が甚だしくなるという問題がある。そのためＬＰＣ変換の次数を適切な値に決めるのが困難である。 However, the conventional method has a difficult problem as to how to select the order in the LPC conversion. That is, if the order of the LPC conversion is reduced, there is a problem that the effect of the correction is almost lost, while if the order is increased, the sound quality is greatly deteriorated. Therefore, it is difficult to determine the LPC conversion order to an appropriate value.

本発明に係る音声の周波数特性の等化装置は、処理対象となる音声と基準となる音声との間のパワースペクトル密度（ＰＳＤ）の差分を算出するための手段と、差分をケプストラムにより表される周波数特性の特徴空間に変換するための手段と、ケプストラムにより表された差分を用いてフィルタパラメータが設定される、処理対象となる音声をフィルタリングするための、予め定められたフィルタリング手段とを含む。 An apparatus for equalizing frequency characteristics of speech according to the present invention includes means for calculating a difference in power spectral density (PSD) between speech to be processed and reference speech, and the difference is represented by a cepstrum. And a predetermined filtering means for filtering the speech to be processed in which filter parameters are set using the difference represented by the cepstrum. .

好ましくは、フィルタリング手段は、差分のケプストラムをメルケプストラムに変換するための手段と、メルケプストラムにより表された差分をフィルタパラメータとし、処理対象となる音声を入力として受ける様に接続されるＭＬＳＡ（ｍｅｌ−ｌｏｇａｒｉｔｈｍｉｃｓｐｅｃｔｒａｌａｐｐｒｏｘｉｍａｔｉｏｎ）フィルタとを含む。 Preferably, the filtering means includes means for converting a difference cepstrum into a mel cepstrum, and MLSA (mel) connected so as to receive a voice to be processed as an input using the difference represented by the mel cepstrum as a filter parameter. -Logarithmic (special application) filter.

又は、フィルタリング手段は、ＰＳＤの差分に対し平滑化処理を行なうためのＰＳＤ差分平滑化手段と、ＰＳＤ差分平滑化手段により平滑化されたＰＳＤによりフィルタパラメータが設定される、処理対象となる音声を受ける様に接続されるＦＩＲ（ｆｉｎｉｔｅｉｍｐｕｌｓｅｒｅｓｐｏｎｓｅ）フィルタとを含んでもよい。 Alternatively, the filtering means is a PSD difference smoothing means for performing a smoothing process on the PSD difference, and a sound to be processed in which a filter parameter is set by the PSD smoothed by the PSD difference smoothing means. And a FIR (Finite Impulse Response) filter connected to receive.

音声の周波数特性の等化装置は、ＰＳＤの差分を、予め定める第１の次数を有する差分のケプストラムに変換するための手段をさらに含んでもよい。ＰＳＤ差分平滑化手段は、差分のケプストラムに対し前記第１の次数のケプストラムから前記第１の次数よりも小さな第２の次数を有するメルケプストラムへの周波数軸ワーピングを行なうための第１のメルワーピング手段と、第１のメルワーピング手段の出力に対し、第１のメルワーピング手段によるメルワーピングの逆変換を行ない逆変換された差分のケプストラムを出力するための第２のメルワーピング変換手段と、第２のメルワーピング変換手段の出力する逆変換された差分のケプストラムをスペクトルに変換することで平滑化されたＰＳＤの差分を出力し、ＦＩＲフィルタにフィルタパラメータとして与えるための手段とを含んでもよい。 The audio frequency characteristic equalizer may further include means for converting the PSD difference into a differential cepstrum having a predetermined first order. The PSD difference smoothing means performs a first melwarping for frequency axis warping from the first order cepstrum to a mel cepstrum having a second order smaller than the first order for the difference cepstrum. And second melwarping conversion means for performing inverse transformation of melwarping by the first melwarping means on the output of the first melwarping means and outputting a cepstrum of the inversely transformed difference, And a means for outputting a PSD difference smoothed by converting the cepstrum of the inversely converted difference output from the two melwarping conversion means into a spectrum and giving the difference as a filter parameter to the FIR filter.

処理対象となる音声と基準となる音声とのパワースペクトル密度の差分がケプストラムにより表される周波数特性の特徴空間に変換される。それをさらにメルスケールに変換してＭＬＳＡフィルタを設定する。又は、ケプストラムに変換した後、メルワーピング及びその逆変換を行なって逆変換されたケプストラムを得て、それをさらにスペクトルに戻すことでＰＳＤの差分を平滑化し、そのＰＳＤの差分でフィルタを設定する。こうして設定されたフィルタは人間の聴覚特性に近い特性を持つ。またこうして設定されるフィルタの特性は、ＬＰＣ変換によるフィルタと異なり、パラメータ次数に敏感でない。フィルタの精度を高める様にパラメータの算出を行なう場合にも、音質の劣化が生じることがない。また従来のチャネル等価装置と同程度の音質で、処理対象となる音声の周波数特性を基準となる音声の周波数特性に等化させることができる。 A difference in power spectral density between the speech to be processed and the speech to be processed is converted into a feature space having a frequency characteristic represented by a cepstrum. This is further converted into a mel scale and an MLSA filter is set. Alternatively, after converting to a cepstrum, Melwarping and its inverse transform are performed to obtain an inversely transformed cepstrum, which is further returned to the spectrum to smooth the PSD difference, and a filter is set with the PSD difference. . The filter set in this way has characteristics close to human auditory characteristics. Further, the characteristics of the filter set in this way are not sensitive to the parameter order, unlike the filter by LPC conversion. Even when the parameters are calculated so as to increase the accuracy of the filter, the sound quality does not deteriorate. Further, it is possible to equalize the frequency characteristic of the voice to be processed to the reference frequency characteristic with the same sound quality as that of the conventional channel equivalent apparatus.

［第１の実施の形態］
図１に、本発明の第１の実施の形態に係るチャネル等価装置２０のブロック図を示す。図１において、図４と同じ部品には同じ参照番号を付してある。それらの機能も同一である。従ってそれらについての詳細な説明は繰返さない。 [First Embodiment]
FIG. 1 shows a block diagram of a channel equivalent apparatus 20 according to the first embodiment of the present invention. In FIG. 1, the same components as those in FIG. 4 are denoted by the same reference numerals. Their functions are also the same. Therefore, detailed description thereof will not be repeated.

図１に示すチャネル等価装置２０が図４のチャネル等価装置２００と異なるのは、図４のＬＰＣフィルタ部２１０に代えて、ＰＳＤの差分を平滑化したフィルタパラメータで設定されたＦＩＲ（ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタを用いて等化を行なうケプストラムフィルタ部４０を含む点である。 The channel equivalent device 20 shown in FIG. 1 is different from the channel equivalent device 200 shown in FIG. 4 in that an FIR (Finite Impulse Response) set by a filter parameter obtained by smoothing the PSD difference is used instead of the LPC filter unit 210 shown in FIG. ) A cepstrum filter unit 40 that performs equalization using a filter is included.

ケプストラムフィルタ部４０は、除算部３８の出力するソース音声３０とターゲット音声３２とのＰＳＤの平均の差分について平方根をとりさらにその対数を算出するための対数算出部７０と、対数算出部７０の出力に対しｍ次のＩＦＦＴ処理を実行することにより除算部３８の出力に対するケプストラムを算出するためのＩＦＦＴ処理部７２と、ＩＦＦＴ処理部７２の出力するケプストラムについて、その横軸（周波数軸）をメルスケールに変換する（メルワーピングする）ための第１のメルワーピング部７４とを含む。 The cepstrum filter unit 40 takes a square root of the average difference between the PSDs of the source sound 30 and the target sound 32 output from the division unit 38 and calculates the logarithm thereof, and the output of the logarithm calculation unit 70. The IFFT processing unit 72 for calculating the cepstrum for the output of the division unit 38 by executing m-th order IFFT processing on the cepstrum, and the horizontal axis (frequency axis) of the cepstrum output from the IFFT processing unit 72 on the mel scale And a first melwarping unit 74 for converting to melwarping.

第１のメルワーピング部７４での変換をメルワーピング（ｍ、ｎ、ａ）と表しているが、ｍは変換前のケプストラムの次数、ｎは変換後の次数である。ここではｎ＜ｍとなる様にｍとｎとが選ばれている。ａは周波数軸伸縮のパラメータである定数であり、サンプリング周波数に応じて定められる。 The conversion in the first melwarping unit 74 is expressed as melwarping (m, n, a), where m is the order of the cepstrum before conversion, and n is the order after conversion. Here, m and n are selected so that n <m. a is a constant which is a parameter for frequency axis expansion and contraction, and is determined according to the sampling frequency.

ケプストラムフィルタ部４０はさらに、第１のメルワーピング部７４の出力に対してメルワーピング（ｎ、ｍ、−ａ）を実行するための第２のメルワーピング部７６を含む。第２のメルワーピング部７６でのメルワーピングと第１のメルワーピング部７４でのメルワーピングとは、互いに逆変換の関係になる。すなわちこれら二つの処理を直列に実行することにより周波数軸は元の線形軸に戻る。ただしｍ、ｎの値がｎ＜ｍとなる様に選ばれているため、これら二つの処理を直列に実行した場合、メル変換後の周波数軸上の値の高い部分のケプストラムが除去される。 The cepstrum filter unit 40 further includes a second melwarping unit 76 for performing melwarping (n, m, -a) on the output of the first melwarping unit 74. The melwarping in the second melwarping section 76 and the melwarping in the first melwarping section 74 are in a mutually inverse relationship. That is, by executing these two processes in series, the frequency axis returns to the original linear axis. However, since the values of m and n are selected so that n <m, when these two processes are executed in series, the cepstrum of the high value portion on the frequency axis after the mel conversion is removed.

ケプストラムフィルタ部４０はさらに、第２のメルワーピング部７６の出力に対してＦＦＴ処理を行なうためのＦＦＴ処理部７８と、ＦＦＴ処理部７８の出力に指数変換を行なうための指数変換部８０と、指数変換部８０の出力に対しＩＦＦＴ処理を行なってフィルタパラメータを出力するためのＩＦＦＴ処理部８２とを含む。ＩＦＦＴ処理部８２の出力は、第１のメルワーピング部７４及び第２のメルワーピング部７６によるメルワーピングとその逆変換とにより、除算部３８の出力のＰＳＤが平滑化されたものとなる。 The cepstrum filter unit 40 further includes an FFT processing unit 78 for performing FFT processing on the output of the second melwarping unit 76, an exponent conversion unit 80 for performing exponent conversion on the output of the FFT processing unit 78, An IFFT processing unit 82 for performing an IFFT process on the output of the exponent conversion unit 80 and outputting a filter parameter. The output of the IFFT processing unit 82 is obtained by smoothing the PSD of the output of the division unit 38 by melwarping by the first melwarping unit 74 and the second melwarping unit 76 and its inverse transformation.

ケプストラムフィルタ部４０はさらに、ＩＦＦＴ処理部８２の出力するフィルタパラメータにより設定され、ソース音声３０に対しフィルタ処理を行なうことにより、ソース音声３０の周波数特性を補正し、ターゲット音声３２の周波数特性とほぼ同じ周波数特性の音声４２として出力するためのＦＩＲ８４を含む。 The cepstrum filter unit 40 is further set by a filter parameter output from the IFFT processing unit 82, and performs a filter process on the source sound 30, thereby correcting the frequency characteristic of the source sound 30 and substantially equal to the frequency characteristic of the target sound 32. An FIR 84 for outputting as sound 42 having the same frequency characteristic is included.

チャネル等価装置２０は以下の様に動作する。図２を参照して、ソース音声３０の波形９０に対して図１に示す音声フレーム検出部５０、窓掛け処理部５２、パワースペクトル算出部５４、及びフレーム平均部５６によってソースＰＳＤ１００が得られる。同様にターゲット音声３２の波形９２に対して音声フレーム検出部６０、窓掛け処理部６２、パワースペクトル算出部６４、及びフレーム平均部６６によってターゲットＰＳＤ１０２が得られる。除算部３８が後者を前者で除算することにより、ＰＳＤの差分１１０が得られる。 The channel equivalent device 20 operates as follows. Referring to FIG. 2, source PSD 100 is obtained from waveform 90 of source audio 30 by audio frame detection unit 50, windowing processing unit 52, power spectrum calculation unit 54, and frame averaging unit 56 shown in FIG. 1. Similarly, the target PSD 102 is obtained by the audio frame detection unit 60, the windowing processing unit 62, the power spectrum calculation unit 64, and the frame averaging unit 66 for the waveform 92 of the target audio 32. The divider 38 divides the latter by the former, so that a PSD difference 110 is obtained.

このＰＳＤの差分１１０に対し、対数算出部７０、ＩＦＦＴ処理部７２、及び第１のメルワーピング部７４での処理を行なうことにより、メルケプストラム１２０が得られる。このメルケプストラム１２０に対し、第２のメルワーピング部７６、ＦＦＴ処理部７８、及び指数変換部８０での処理を実行することにより、ＰＳＤの差分１１０の平滑化されたＰＳＤ１３０が得られる。この平滑化されたＰＳＤ１３０に対しＩＦＦＴ処理部８２の処理を行なうことにより、ＦＩＲ８４のフィルタパラメータを設定する。この様に設定されたＦＩＲ８４を用いてソース音声３０をフィルタリングすることにより得られる音声４２の周波数特性は、ターゲット音声３２の周波数特性とほぼ等しいものとなる。 A mel cepstrum 120 is obtained by performing processing in the logarithm calculation unit 70, the IFFT processing unit 72, and the first mel warping unit 74 on the PSD difference 110. By executing the processing in the second mel warping unit 76, the FFT processing unit 78, and the exponent conversion unit 80 on the mel cepstrum 120, a PSD 130 in which the PSD difference 110 is smoothed is obtained. The filter parameter of the FIR 84 is set by performing processing of the IFFT processing unit 82 on the smoothed PSD 130. The frequency characteristic of the voice 42 obtained by filtering the source voice 30 using the FIR 84 set in this way is substantially equal to the frequency characteristic of the target voice 32.

メルワーピングによってケプストラムを一旦メルスケールに変換した後、その逆変換によってその高周波数成分を除去することで、ＰＳＤの差分を平滑化している。従ってこうして得られたフィルタパラメータにより設定されたＦＩＲ８４は人間の聴覚特性に近い特性を持つ。さらに、この様に設定されるＦＩＲ８４の特性は、ＬＰＣ変換によるフィルタと異なり、パラメータ次数に敏感でない。フィルタの精度を高める様にフィルタパラメータの算出を行なった場合にも音質の劣化が生じることがなく、従来のチャネル等価装置２００と同程度の音質でソース音声３０の周波数特性の等化を行なうことができる。 After the cepstrum is once converted to the mel scale by mel warping, the PSD difference is smoothed by removing the high frequency component by the inverse conversion. Therefore, the FIR 84 set by the filter parameters obtained in this way has characteristics close to human auditory characteristics. Further, the characteristic of the FIR 84 set in this way is not sensitive to the parameter order unlike the filter by LPC conversion. When the filter parameters are calculated so as to improve the accuracy of the filter, the sound quality is not deteriorated, and the frequency characteristics of the source sound 30 are equalized with the sound quality comparable to that of the conventional channel equivalent device 200. Can do.

［第２の実施の形態］
上記した第１の実施の形態の装置では、フィルタリングはＦＩＲで行なっている。しかし本発明はその様にＦＩＲを用いるものには限定されない。図１に示す第１のメルワーピング部７４の出力するメルケプストラム係数で直接設定できるフィルタを使用する場合、構成はより簡単となる。図３にそうしたフィルタとしてＭＬＳＡ（ｍｅｌ−ｌｏｇａｒｉｔｈｍｉｃｓｐｅｃｔｒａｌａｐｐｒｏｘｉｍａｔｉｏｎ）フィルタを用いた、本発明の第２の実施の形態に係るチャネル等価装置１４０のブロック図を示す。 [Second Embodiment]
In the apparatus of the first embodiment described above, filtering is performed by FIR. However, the present invention is not limited to that using FIR. When a filter that can be directly set by the mel cepstrum coefficient output from the first mel warping unit 74 shown in FIG. 1 is used, the configuration becomes simpler. FIG. 3 shows a block diagram of a channel equivalent device 140 according to the second embodiment of the present invention using an MLSA (mel-logarithmic spectral application) filter as such a filter.

図３において、図１及び図４と同一部品には同一の参照符号を付してある。それらの機能も同一である。従ってここではそれらについての詳細な説明は繰返さない。 In FIG. 3, the same components as those in FIGS. 1 and 4 are denoted by the same reference numerals. Their functions are also the same. Therefore, detailed description thereof will not be repeated here.

図３を参照して、この第３の実施の形態に係るチャネル等価装置１４０が図１に示すチャネル等価装置２０と異なるのは、図１に示すケプストラムフィルタ部４０に代えて、ＭＬＳＡフィルタを含むＭＬＳＡフィルタ部１５０を含む点である。そしてＭＬＳＡフィルタ部１５０が図１のケプストラムフィルタ部４０と異なるのは、図１の第２のメルワーピング部７６、ＦＦＴ処理部７８、指数変換部８０、及びＩＦＦＴ処理部８２に代えて、第２のメルワーピング部７６から出力されるメルケプストラムによって直接にフィルタパラメータが設定されるＭＬＳＡフィルタ１６０を含む点である。 Referring to FIG. 3, channel equivalent device 140 according to the third embodiment differs from channel equivalent device 20 shown in FIG. 1 in that it includes an MLSA filter instead of cepstrum filter unit 40 shown in FIG. The MLSA filter unit 150 is included. The MLSA filter unit 150 differs from the cepstrum filter unit 40 of FIG. 1 in that a second melwarping unit 76, an FFT processing unit 78, an exponent conversion unit 80, and an IFFT processing unit 82 in FIG. This includes a MLSA filter 160 in which filter parameters are directly set by a mel cepstrum output from the mel warping unit 76.

この第２の実施の形態に係るチャネル等価装置１４０の動作は、第１のメルワーピング部７４の出力によってＭＬＳＡフィルタ１６０が設定される点を除き、第１の実施の形態のチャネル等価装置２０と同じである。 The operation of the channel equivalent device 140 according to the second embodiment is the same as that of the channel equivalent device 20 of the first embodiment except that the MLSA filter 160 is set by the output of the first melwarping unit 74. The same.

チャネル等価装置１４０によっても、第１の実施の形態のチャネル等価装置２０と同様の効果を得ることができる。それに加えて、ＭＬＳＡフィルタ１６０は、メルケプストラムをパラメータとするフィルタであり、第１のメルワーピング部７４の出力によって直接設定できる。従って、第１の実施の形態のチャネル等価装置２０と比較して、メルケプストラムからＦＩＲのフィルタパラメータを作成するための種々の部品が不要となり、回路構成が簡単となる。 The channel equivalent device 140 can also provide the same effects as the channel equivalent device 20 of the first embodiment. In addition, the MLSA filter 160 is a filter having a mel cepstrum as a parameter, and can be directly set by the output of the first mel warping unit 74. Accordingly, as compared with the channel equivalent device 20 of the first embodiment, various parts for creating FIR filter parameters from the mel cepstrum are unnecessary, and the circuit configuration is simplified.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

第１の実施の形態に係るチャネル等価装置のブロック図である。It is a block diagram of the channel equivalent apparatus which concerns on 1st Embodiment. 第１の実施の形態に係るチャネル等価装置の動作を説明するための模式図である。It is a schematic diagram for demonstrating operation | movement of the channel equivalent apparatus which concerns on 1st Embodiment. 第２の実施の形態に係るチャネル等価装置のブロック図である。It is a block diagram of the channel equivalent apparatus which concerns on 2nd Embodiment. 従来技術に係るチャネル等価装置のブロック図である。It is a block diagram of the channel equivalent apparatus which concerns on a prior art.

Explanation of symbols

２０，１４０，２００チャネル等価装置、３０ソース音声、３２ターゲット音声、３４ソースＰＳＤ生成部、３６ターゲットＰＳＤ生成部、３８除算部、４０ケプストラムフィルタ部、５０，６０音声フレーム検出部、５２，６２窓掛け処理部、５４，６４パワースペクトル算出部、５６，６６フレーム平均部、７４第１のメルワーピング部、７６第２のメルワーピング部、８４ＦＩＲ、１５０ＭＬＳＡフィル
タ部、１６０ＭＬＳＡフィルタ、２１０ＬＰＣフィルタ部 20, 140, 200 channel equivalent device, 30 source speech, 32 target speech, 34 source PSD generation unit, 36 target PSD generation unit, 38 division unit, 40 cepstrum filter unit, 50, 60 speech frame detection unit, 52, 62 window Multiplication processing unit, 54, 64 power spectrum calculation unit, 56, 66 frame averaging unit, 74 first mel warping unit, 76 second mel warping unit, 84 FIR, 150 MLSA filter unit, 160 MLSA filter, 210 LPC filter Part

Claims

Means for calculating a power spectral density (PSD) difference between the speech to be processed and the reference speech;
Means for converting the difference into a feature space of frequency characteristics represented by a cepstrum;
An audio frequency characteristic equalization apparatus including predetermined filtering means for filtering the audio to be processed, in which a filter parameter is set using the difference represented by the cepstrum.

The filtering means includes
Means for converting the differential cepstrum into a mel cepstrum;
The audio frequency according to claim 1, further comprising: an MLSA (mel-logarithmic spectral application) filter connected so as to receive the audio to be processed as an input using the difference represented by the mel cepstrum as a filter parameter. Character equalizer.

The filtering means includes
PSD difference smoothing means for performing a smoothing process on the PSD difference;
The filter parameter is set by PSD smoothed by the PSD difference smoothing means, and includes an FIR (fine impulse response) filter connected to receive the speech to be processed. Equalizer for frequency characteristics of audio.

The audio frequency characteristic equalizer further includes means for converting the PSD difference into a differential cepstrum having a predetermined first order,
The PSD differential smoothing means performs a first mel for warping the difference cepstrum from the first order cepstrum to a mel cepstrum having a second order smaller than the first order. Warping means;
A second melwarping conversion means for performing an inverse transformation of the melwarping by the first melwarping means and outputting a cepstrum of the inversely transformed difference to the output of the first melwarping means;
Means for outputting a PSD difference smoothed by converting the cepstrum of the inversely converted difference output from the second mel warping conversion means into a spectrum, and giving the difference as a filter parameter to the FIR filter; The equalization apparatus of the frequency characteristic of the audio | voice of Claim 3 containing.