JP2014232245A

JP2014232245A - Sound clarifying device, method, and program

Info

Publication number: JP2014232245A
Application number: JP2013113644A
Authority: JP
Inventors: 歩相名神山; Hosona Kamiyama; 水野　秀之; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-05-30
Filing date: 2013-05-30
Publication date: 2014-12-11
Anticipated expiration: 2033-05-30
Also published as: JP6087731B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound clarifying technology for generating a clear sound even in a noise environment while preventing the natural degradation of the sound.SOLUTION: A determination unit 3 determines, using phoneme information about an inputted sound s(t), the type of phoneme corresponding to each frame constituting the sound s(t). A filter generation unit 11 generates an emphasis filter for emphasizing a sound spectrum S(i, f) of each frame corresponding to the type of phoneme corresponding to the determined each frame. A filter unit 7 generates an emphasis spectrum S'(i, f) by emphasizing the sound spectrum S(i, f) of each frame by using the emphasis filter corresponding to the type of phoneme corresponding to the determined each frame. A spectrum generation unit 8 generates an emphasized sound s'(t) which is a time-domain signal by performing inverse Fourier transform on the generated emphasis spectrum.

Description

この発明は、雑音環境下において合成音声及び自然音声等の音声を明瞭化する技術に関する。 The present invention relates to a technique for clarifying speech such as synthesized speech and natural speech in a noisy environment.

近年、音声合成技術の発展・普及により、様々な場所で合成音声によるメッセージを聴取する機会が増えた。合成音声は静かな場所だけではなく、空港や駅のホーム、商店街などのように周囲に雑音があるような騒がしい環境で聴取する場合が多い。このよう雑音がある環境下では、合成音声が聞き取りにくくなる問題がある。このとき、音量を上げると音声の聞き取りやすさが向上する。しかし、音量の上昇には上限があり、強くしすぎると音声が歪み、かえって音声が聞き取りづらくなることがある。 In recent years, due to the development and popularization of speech synthesis technology, opportunities to listen to synthesized speech messages in various places have increased. Synthetic voices are often heard not only in quiet places, but also in noisy environments such as airports, station platforms, and shopping streets where there is noisy surroundings. In such an environment with noise, there is a problem that it is difficult to hear the synthesized speech. At this time, if the volume is increased, the ease of listening to the voice is improved. However, there is an upper limit for the increase in volume, and if it is too strong, the sound may be distorted, making it difficult to hear the sound.

これまで人間が発声した自然音声、合成音声に関わらず、雑音下で音声を聞き取りやすくする明瞭化技術が存在した。一般に音声の母音の周波数スペクトルには複数のピークが存在し、これをフォルマントと言う。このフォルマント部分を強調することで、音量を過大に上げることなく、音声を明瞭化することが知られており、音声のフォルマントのパワーを強調するイコライザを用いて聞き取りやすさを改善している（例えば、非特許文献１参照。）。 Until now, there has been a clarification technology that makes it easy to hear speech in the presence of noise, regardless of whether it is a natural speech or a synthesized speech. In general, there are a plurality of peaks in the frequency spectrum of a vowel sound, which is called a formant. By emphasizing this formant part, it is known to clarify the sound without excessively increasing the volume, and the ease of hearing is improved by using an equalizer that emphasizes the power of the formant of the sound ( For example, refer nonpatent literature 1.).

天野文雄,“らくらくホンにおける聞え支援機能について”，日本音響学会2012年春季研究発表,1-2-2,pp.1563-1564Fumio Amano, “Hearing Support Function for Easy Phone”, Acoustical Society of Japan 2012 Spring Research Presentation, 1-2-2, pp.1563-1564

しかしながら、これまでの音声明瞭化手法は、合成音声及び自然音声の何れであっても、音素の種類に関係なくフォルマントを強調するため、無声子音のようなフォルマントのない音に対してもスペクトルのピーク部分を強調してしまい、音声の自然性が劣化する可能性があった。また、母音や有声子音であっても、フォルマントのパワーや位置は、音声の自然性にとって重要な特徴であるため、フォルマントを強調しつつ、自然な音声のフォルマントのパワーとなるようなフィルタの設計に細かいチューニングが必要であった。 However, the conventional speech clarification method emphasizes formants regardless of the type of phoneme in both synthesized speech and natural speech. The peak portion is emphasized, and the naturalness of the voice may be deteriorated. In addition, even for vowels and voiced consonants, the power and position of formants are important features for the naturalness of speech, so a filter design that emphasizes formants and provides the power of natural speech formants. Fine tuning was necessary.

この発明は、細かいチューニングを必要とせずに、音声の自然性の劣化を防ぎつつ、雑音下でも明瞭な音声を生成する音声明瞭化装置、方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a speech clarification device, method, and program for generating clear speech even under noise while preventing deterioration of speech naturalness without requiring fine tuning.

この発明の一態様による音声明瞭化装置は、音声s(t)を一定の時間長のフレームごとにフーリエ変換することにより、各フレームの音声スペクトルS(i,f)を生成する音声スペクトル分析部と、入力された音声s(t)の音素情報を用いて、音声s(t)を構成する各フレームに対応する音素の種類を判断する判断部と、判断された各フレームに対応する音素の種類に対応する、各フレームの音声スペクトルS(i,f)を強調するための強調フィルタを生成するフィルタ生成部と、判断された各フレームに対応する音素の種類に対応する強調フィルタを用いて、各フレームの音声スペクトルS(i,f)を強調することにより強調スペクトルS’(i,f)を生成するフィルタ部と、生成された強調スペクトルを逆フーリエ変換することにより時間領域信号である強調音声s’(t)を生成するスペクトル合成部と、を含む。 A speech clarification device according to an aspect of the present invention includes a speech spectrum analysis unit that generates a speech spectrum S (i, f) of each frame by performing Fourier transform on the speech s (t) for each frame having a certain time length. And using the phoneme information of the input speech s (t), a determination unit that determines the type of phoneme corresponding to each frame constituting the speech s (t), and the phoneme corresponding to each determined frame Using a filter generation unit that generates an enhancement filter for enhancing the speech spectrum S (i, f) of each frame corresponding to the type, and an enhancement filter corresponding to the type of phoneme corresponding to each determined frame A filter unit that generates the enhanced spectrum S ′ (i, f) by enhancing the speech spectrum S (i, f) of each frame, and a time domain signal by performing an inverse Fourier transform on the generated enhanced spectrum. Generate emphasized speech s' (t) And a spectrum synthesizer.

音声の音素種別ごとに音声明瞭化を行うことで、また、自然な音声のフォルマント部分のパワーの関係を維持しながらフォルマントを強調することで、細かいチューニングを必要とせずに音声の自然性を劣化させずに音声を明瞭化することができる。 By clarifying the voice for each phoneme type, and by emphasizing the formant while maintaining the power relationship of the formant part of the natural voice, the naturalness of the voice is degraded without the need for fine tuning. The voice can be clarified without doing so.

第一実施形態の音声明瞭化装置の例を示すブロック図。The block diagram which shows the example of the audio | voice clarification apparatus of 1st embodiment. 音声明瞭化方法の例を示すフローチャート。The flowchart which shows the example of the speech clarification method. 音声s(t)の例を示す図。The figure which shows the example of the audio | voice s (t). 音素情報の例を示す図。The figure which shows the example of phoneme information. 音声スペクトル分析部の処理の例を示すフローチャート。The flowchart which shows the example of a process of an audio | voice spectrum analysis part. 判断部の処理の例を示すフローチャート。The flowchart which shows the example of a process of a judgment part. 母音フィルタ生成部の例を示すブロック図。The block diagram which shows the example of a vowel filter production | generation part. フォルマントの例を説明するための図。The figure for demonstrating the example of a formant. フォルマント間の帯域の相関関係の例を示す図。The figure which shows the example of the correlation of the zone | band between formants. 母音強調フィルタE₁(i,f)の例を示す図。Shows an example of a vowel enhancement filter E ₁ (i, f). 有声子音フィルタ生成部の例を示すブロック図。The block diagram which shows the example of a voiced consonant filter production | generation part. 帯域間の相関関係の例を示す図。The figure which shows the example of the correlation between bands. 有声子音強調フィルタE₂(i,f)の例を示す図。Shows an example of a voiced consonant enhancement filter E ₂ (i, f). 帯域間の相関関係の例を示す図。The figure which shows the example of the correlation between bands. スペクトル合成部の処理の例を示すフローチャート。The flowchart which shows the example of a process of a spectrum synthetic | combination part. 第二実施形態の音声明瞭化装置の例を示すブロック図。The block diagram which shows the example of the speech clarification apparatus of 2nd embodiment.

以下、図面を参照して、音声明瞭化装置及び方法の実施形態を説明する。 Hereinafter, embodiments of a speech clarification apparatus and method will be described with reference to the drawings.

［第一実施形態］
第一実施形態の音声明瞭化装置は、図１に示すように、音声合成部１、音声スペクトル分析部２、判断部３、フィルタ生成部１１、フィルタ部７、スペクトル合成部８及び雑音スペクトル分析部９を例えば備えている。フィルタ生成部１１は、母音フィルタ生成部４、有声子音フィルタ生成部５、無声子音フィルタ生成部６を例えば備えている。音声明瞭化方法は、音声明瞭化装置が図２の各ステップの処理を例えば行うことにより実現される。 [First embodiment]
As shown in FIG. 1, the speech clarification device according to the first embodiment includes a speech synthesis unit 1, a speech spectrum analysis unit 2, a determination unit 3, a filter generation unit 11, a filter unit 7, a spectrum synthesis unit 8, and a noise spectrum analysis. For example, the unit 9 is provided. The filter generation unit 11 includes, for example, a vowel filter generation unit 4, a voiced consonant filter generation unit 5, and an unvoiced consonant filter generation unit 6. The voice clarification method is realized by the voice clarification apparatus performing, for example, each step in FIG.

＜ステップＳ１＞
音声合成部１は、合成するテキストを入力として音声波形s(t)を出力する。また、音声合成部１は、音声波形を生成する際の中間情報である音素情報（ph(k),Ts(k),Te(k)）も出力として利用する（ステップＳ１）。すなわち、音声合成部１は、入力されたテキストに基づいて音素情報を生成し、その音素情報を用いてその入力されたテキストに対応する合成音声を生成し音声s(t)とする。音素情報は、音声s(t)を構成する音素についての情報である。 <Step S1>
The speech synthesizer 1 receives a text to be synthesized and outputs a speech waveform s (t). The speech synthesizer 1 also uses phoneme information (ph (k), Ts (k), Te (k)), which is intermediate information when generating a speech waveform, as an output (step S1). That is, the speech synthesizer 1 generates phoneme information based on the input text, generates a synthesized speech corresponding to the input text using the phoneme information, and sets it as a speech s (t). The phoneme information is information about phonemes constituting the speech s (t).

音声s(t)及び後述する強調音声s’(t)は、音声のサンプリング周波数をf_s[Hz]とした場合の、音声のサンプル時間t（t=0,1,…,T-1）における振幅である。音声s(t)の例を図３に示す。図３に例示された音声s(t)は、f_s=16000,T=20000であり、T/f_s=1.25秒程度の音声である。 The voice s (t) and the emphasized voice s ′ (t) to be described later have a voice sampling time t (t = 0, 1,..., T−1) when the voice sampling frequency is f _s [Hz]. Is the amplitude at. An example of the voice s (t) is shown in FIG. The voice s (t) illustrated in FIG. 3 is a voice with f _s = 16000 and T = 20000, and T / f _s = 1.25 seconds.

このように、音声合成部１は、テキストを入力とし、入力したテキストに対応する音声s(t)を合成する。その際の音声合成は、例えば参考文献１のような公知の技術によって実現ができる。 As described above, the speech synthesizer 1 receives text as input and synthesizes speech s (t) corresponding to the input text. The voice synthesis at that time can be realized by a known technique such as Reference 1.

〔参考文献１〕益子貴史, 徳田恵一, 小林隆夫, 今井聖, “動的特徴を用いたHMMに基づく音声合成”, 電子情報通信学会論文誌, D-II, vol. J79-D-II, No.12, pp.2184-2190, 1996 [Reference 1] Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Kiyoshi Imai, “HMM-based speech synthesis using dynamic features”, IEICE Transactions, D-II, vol. J79-D-II, No.12, pp.2184-2190, 1996

音声合成部１は、音声を合成する際には、音声の音の種類（音素）ph(k)と、音素の開始時間Ts(k)、終了時間Te(k)の情報を生成する。ph(k)は音素情報であり、k番目の音素の種別を予め定めたアルファベット情報で出力する。音素の開始時間、終了時間はサンプル数で例えば表す。図４にこの音素情報の例を示す。音声明瞭化装置は、効果的な明瞭化を実現するために、音素の種類によって音声の明瞭化のアルゴリズムを変える。このために、音声合成部１はこの音素情報ph(k), Ts(k), Te(k)も出力する。 When synthesizing speech, the speech synthesizer 1 generates information on the type of sound (phoneme) ph (k), phoneme start time Ts (k), and end time Te (k). ph (k) is phoneme information, and the type of the kth phoneme is output as predetermined alphabet information. The phoneme start time and end time are represented by the number of samples, for example. FIG. 4 shows an example of this phoneme information. The speech clarification device changes a speech clarification algorithm depending on the type of phoneme in order to achieve effective clarification. For this purpose, the speech synthesizer 1 also outputs this phoneme information ph (k), Ts (k), Te (k).

音声合成部１によって生成された音声s(t)は音声スペクトル分析部２に提供される。また、音素情報(ph(k),Ts(k),Te(k))は、判断部３に提供される。 The speech s (t) generated by the speech synthesizer 1 is provided to the speech spectrum analyzer 2. The phoneme information (ph (k), Ts (k), Te (k)) is provided to the determination unit 3.

＜ステップＳ２＞
音声スペクトル分析部２は、音声s(t)を入力として音声スペクトルS(i,f)を出力する。音声スペクトル分析部２は、音声s(t)をpサンプル間隔で分析を行い、音声スペクトルS(i,f)[dB]を抽出する。pは音声分析を行うためのフレームシフト幅であり、例えばp=0.032*f_s（32ms間隔で分析）とすることができる。 <Step S2>
The speech spectrum analyzer 2 receives the speech s (t) and outputs a speech spectrum S (i, f). The speech spectrum analysis unit 2 analyzes the speech s (t) at p sample intervals, and extracts the speech spectrum S (i, f) [dB]. p is a frame shift width for performing speech analysis, and can be set to, for example, p = 0.032 * f _s (analysis at intervals of 32 ms).

i（i=0,1,…,[(T-1)/p]）は、pサンプル間隔で分析した場合の分析番号（フレーム番号）であり、t=ip+m（m=0,1,…,p-1）である。また、f（f=0,1,…,D-1）は、(f/N)×(f_s/2)[Hz]以上、((f+1)/N)×(f_s/2) [Hz]未満の周波数帯域を表す番号（帯域番号）である。DはFFT長であり、例えばD=2¹⁰とすることができる。音声スペクトルS(i,f)は、音声の短時間スペクトルを示す複素数であり、|S(i,f)|²は、iフレーム目の周波数番号fのパワー、arg{S(i,f)}は位相を示す。音声スペクトル分析部２が行う処理の例のフローチャートを図５に示す。音声スペクトル分析部２は、例えば次の計算を行う。 i (i = 0,1, ..., [(T-1) / p]) is an analysis number (frame number) when analyzed at p-sample intervals, and t = ip + m (m = 0,1 , ..., p-1). Also, f (f = 0,1, ..., D-1) is (f / N) x (f _s / 2) [Hz] or more, ((f + 1) / N) x (f _s / 2 ) A number (band number) representing a frequency band below [Hz]. D is the FFT length can be, for example, D = 2 ^10. The speech spectrum S (i, f) is a complex number indicating the short-time spectrum of speech, and | S (i, f) | ² is the power of the frequency number f in the i-th frame, arg {S (i, f) } Indicates a phase. A flowchart of an example of processing performed by the speech spectrum analysis unit 2 is shown in FIG. The voice spectrum analysis unit 2 performs, for example, the following calculation.

(1) フレーム番号i=0とする。 (1) Frame number i = 0.

(2) 全てのフレームについて計算を終えたとき、つまりi>[(T-1)/p]の場合終了する。 (2) When calculation is completed for all frames, that is, when i> [(T-1) / p], the process ends.

(3) 周波数番号f=0とする。 (3) Set frequency number f = 0.

(4) すべての周波数番号fについて計算を終えたとき、つまりf>D-1の場合(7)に進む。 (4) When calculation is completed for all frequency numbers f, that is, when f> D−1, proceed to (7).

(5) 音声s(t)を、窓関数w(p,f)を用いて切り出し、切り出した音声s₀(f)とする。 (5) The voice s (t) is cut out using the window function w (p, f), and is set as the cut out voice s ₀ (f).

s₀(f)=w(p,f)s(ip+f)
w(p,f)は周波数スペクトル解析で用いる窓関数であり、滑らかに音声を切り出す関数である。窓関数は様々なものが提案されているが、例えば次式で示されるハミング窓を用いて制御することが可能である。 s ₀ (f) = w (p, f) s (ip + f)
w (p, f) is a window function used in frequency spectrum analysis, and is a function for smoothly cutting out speech. Various window functions have been proposed. For example, the window function can be controlled using a Hamming window represented by the following equation.

(6) f←f+1として(4)に戻る (6) Return to (4) as f ← f + 1

(7) 切り出した音声サンプルs₀(f)（f=0,1,…,D-1）に対して長さDの離散フーリエ変換を行い、音声スペクトルS(i,f)を求める。 (7) A discrete Fourier transform of length D is performed on the extracted speech sample s ₀ (f) (f = 0, 1,..., D−1) to obtain a speech spectrum S (i, f).

(8) i←i+1として(2)に戻る。 (8) Return to (2) as i ← i + 1.

このように、音声スペクトル分析部２は、音声s(t)を一定の時間長のフレームごとにフーリエ変換することにより、各フレームの音声スペクトルS(i,f)を生成する（ステップＳ２）。生成された音声スペクトルS(i,f)は、フィルタ生成部１１及びフィルタ部７に提供される。 In this way, the speech spectrum analysis unit 2 generates the speech spectrum S (i, f) of each frame by performing Fourier transform on the speech s (t) for each frame having a certain time length (step S2). The generated speech spectrum S (i, f) is provided to the filter generation unit 11 and the filter unit 7.

音声は、音素の種類によって聞き取りやすさと相関性のある音響特徴量が異なる。例えば参考文献２では、母音、有声子音、無声子音で聞き取りやすさと相関性のある特徴量が、異なることを示している。本技術は、判断部３で判断された例えば母音、有声子音、無声子音の音素種類ごとに明瞭化の処理を切り替えることで、自然性の劣化を防ぎながら聞き取りやすさを改善する。 The sound has different acoustic feature quantities that correlate with ease of hearing depending on the type of phoneme. For example, Reference Document 2 shows that feature quantities that are correlated with ease of hearing are different for vowels, voiced consonants, and unvoiced consonants. The present technology improves easiness of hearing while preventing deterioration of naturalness by switching the clarification process for each phoneme type of vowels, voiced consonants, and unvoiced consonants determined by the determination unit 3, for example.

〔参考文献２〕神山歩相名, 井島勇祐, 磯貝光昭, 水野秀之, “雑音重畳音声の聴き取りやすさと音響特徴量の関係の分析”, 信学技報, vol.112, no.81, SP2012-46, pp.69-74, 2012 [Reference 2] Ayami Kamiyama, Yusuke Ijima, Mitsuaki Isogai, Hideyuki Mizuno, “Analysis of Relationship between Ease of Listening of Noise Superimposed Speech and Acoustic Features”, IEICE Technical Report, vol.112, no.81, SP2012-46, pp.69-74, 2012

＜ステップＳ３＞
雑音スペクトル分析部９は、雑音n(t)を入力として音声スペクトルN(i,f)を出力する。言い換えれば、雑音スペクトル分析部９は、雑音n(t)を一定の時間長のフレームごとにフーリエ変換することにより、各フレームの雑音スペクトルN(i,f)を生成する（ステップＳ３）。雑音n(t)は、例えば強調音声s’(t)が出力される場における雑音のである。 <Step S3>
The noise spectrum analyzer 9 receives the noise n (t) as an input and outputs a speech spectrum N (i, f). In other words, the noise spectrum analysis unit 9 generates a noise spectrum N (i, f) of each frame by Fourier transforming the noise n (t) for each frame having a certain time length (step S3). The noise n (t) is, for example, noise in a field where the emphasized speech s ′ (t) is output.

雑音スペクトル分析部９の処理は、入力を音声s(t)の代わりに雑音n(t)とした以外は、上記の音声スペクトル分析部２の処理と同様であるため、ここでは重複説明を省略する。 The processing of the noise spectrum analysis unit 9 is the same as the processing of the voice spectrum analysis unit 2 except that the input is noise n (t) instead of the voice s (t). To do.

生成された雑音スペクトルN(i,f)は、フィルタ生成部１１に提供される。 The generated noise spectrum N (i, f) is provided to the filter generation unit 11.

＜ステップＳ４＞
判断部３は、入力された記声s(t)の音素情報を用いて、音声s(t)を構成する各フレームに対応する音素の種類を判断する（ステップＳ４）。例えば、判断部３は、音素情報（ph(k),Ts(k),Te(k)）を入力として、母音区間情報H_vo、有声子音区間情報H_vc及び無声子音区間情報H_ucを出力する。言い換えれば、判断部３は、音素情報に基づいて明瞭化処理を切り替えるための、母音区間情報H_vo、有声子音区間情報H_vc及び無声子音区間情報H_ucを例えば生成する。判断部３が行う処理の例のフローチャートを図６に示す。 <Step S4>
The determination unit 3 determines the type of phoneme corresponding to each frame constituting the speech s (t) using the phoneme information of the input written s (t) (step S4). For example, the determination unit 3 receives the phoneme information (ph (k), Ts (k), Te (k)) and outputs vowel section information H _vo , voiced consonant section information H _vc and unvoiced consonant section information H _uc . To do. In other words, the determination unit 3 generates, for example, vowel segment information H _vo , voiced consonant segment information H _vc, and unvoiced consonant segment information H _uc for switching the clarification processing based on phoneme information. A flowchart of an example of processing performed by the determination unit 3 is shown in FIG.

判断部３は、具体的には次のような処理を行う。まず、前処理として、k=0, 母音区間H_vo=φ、有声子音区間H_vc=φ、無声子音区間H_uc=φとする。 Specifically, the determination unit 3 performs the following processing. First, as preprocessing, k = 0, vowel interval H _vo = φ, voiced consonant interval H _vc = φ, and unvoiced consonant interval H _uc = φ.

(1) k>Kのとき終了する。 (1) Exit when k> K.

(2) i=[(Ts(k)-1)/p]とする。 (2) i = [(Ts (k) -1) / p].

(3) i>[(Te(k)-1)/p]のとき、k←k+1として(2)に戻る。 (3) If i> [(Te (k) -1) / p], return to (2) as k ← k + 1.

(4) ph(k)の種別によって、下記(i)-(iii)の処理を行うことにより明瞭化処理の切り替えを実現する。 (4) The clarification process is switched by performing the following processes (i) to (iii) depending on the type of ph (k).

(i) ph(k)が母音のとき、H_vo←H_vo∪{i}とする。 (i) When ph (k) is a vowel, _let H _vo ← H _vo ∪ {i}.

(ii) ph(k)が有声子音のとき、H_vc←H_vc∪{i}とする。 (ii) When ph (k) is a voiced consonant, H _vc ← H _vc ∪ {i}.

(iii) ph(k)が無声子音のとき、H_uo←H_uo∪{i}とする。 (iii) When ph (k) is an unvoiced consonant, _let H _uo ← H _uo ∪ {i}.

(5) i←i+1として(3)に戻る。 (5) Return to (3) as i ← i + 1.

このようにして、判断部３は、音素情報を用いて、音声s(t)を構成する各フレームに対応する音素が母音、有声子音、無声子音の何れであるかを例えば判断する（ステップＳ４）。各フレームに対応する音素が母音、有声子音、無声子音の何れであるかについての情報である母音区間情報H_vo、有声子音区間情報H_vc及び無声子音区間情報H_ucは、それぞれ母音フィルタ生成部４、有声子音フィルタ生成部５及び無声子音フィルタ生成部６に提要される。 In this way, the determination unit 3 uses the phoneme information to determine, for example, whether the phoneme corresponding to each frame constituting the speech s (t) is a vowel, voiced consonant, or unvoiced consonant (step S4). ). The vowel section information H _vo , the voiced consonant section information H _vc, and the unvoiced consonant section information H _uc , which are information about whether the phoneme corresponding to each frame is a vowel, a voiced consonant, or an unvoiced consonant, respectively, 4. Presented to voiced consonant filter generation unit 5 and unvoiced consonant filter generation unit 6.

＜ステップＳ５からステップＳ７＞
フィルタ生成部１１は、判断部３で判断された各フレームに対応する音素の種類に対応する、その各フレームの音声スペクトルS(i,f)を強調するための強調フィルタを生成する（ステップＳ５からステップＳ７）。生成された強調フィルタは、フィルタ部７に提供される。 <Step S5 to Step S7>
The filter generation unit 11 generates an enhancement filter for enhancing the speech spectrum S (i, f) of each frame corresponding to the type of phoneme corresponding to each frame determined by the determination unit 3 (step S5). To step S7). The generated enhancement filter is provided to the filter unit 7.

後述するように、フィルタ生成部１１により生成される音素の種類に対応する強調フィルタは、パワー間の正規分布に基づいてこれらの帯域の相対的なパワーの関係を保ちながら強調するフィルタである。 As will be described later, the emphasis filter corresponding to the type of phoneme generated by the filter generation unit 11 is an emphasis while maintaining the relative power relationship of these bands based on the normal distribution between the powers.

この例では、判断部３は、音素が母音、有声子音、無声子音の何れの種類であるかを判断している。そこで、この例では、フィルタ生成部１１は母音フィルタ生成部４、有声子音フィルタ生成部５及び無声子音フィルタ生成部６を含んでおり、音素が母音であると判断された場合には母音フィルタ生成部４が以下に述べるステップＳ５の処理を行い、音素が有声子音であると判断された場合には有声子音フィルタ生成部５が以下に述べるステップＳ６の処理を行い、音素が無声子音であると判断された場合には無声子音フィルタ生成部６が以下に述べるステップＳ７の処理を行う。 In this example, the determination unit 3 determines whether the phoneme is a vowel, a voiced consonant, or an unvoiced consonant. Therefore, in this example, the filter generation unit 11 includes a vowel filter generation unit 4, a voiced consonant filter generation unit 5, and an unvoiced consonant filter generation unit 6. If it is determined that the phoneme is a vowel, the vowel filter generation is performed. When the unit 4 performs the process of step S5 described below and it is determined that the phoneme is a voiced consonant, the voiced consonant filter generation unit 5 performs the process of step S6 described below, and the phoneme is an unvoiced consonant. If it is determined, the unvoiced consonant filter generation unit 6 performs the process of step S7 described below.

＜ステップＳ５＞
母音フィルタ生成部４は、音声s(t)、音声スペクトルS(i,f)、雑音スペクトルN(i,f)及び母音区間情報H_voを入力として、母音強調フィルタE₁(i, f)を出力する。 <Step S5>
The vowel filter generation unit 4 receives the speech s (t), the speech spectrum S (i, f), the noise spectrum N (i, f), and the vowel section information H _vo as inputs, and the vowel enhancement filter E ₁ (i, f) Is output.

母音フィルタ生成部４は、音声スペクトル分析部２で分析した周波数スペクトルS(i,f)に対する、母音を明瞭化するフィルタE₁(i,f)を生成する。母音を明瞭化する方式は様々な方法が考えられるが、例えば図７に示された機能構成により明瞭化する方法が考えられる。 The vowel filter generation unit 4 generates a filter E ₁ (i, f) for clarifying vowels with respect to the frequency spectrum S (i, f) analyzed by the speech spectrum analysis unit 2. There are various methods for clarifying the vowels. For example, a method for clarifying the vowels by the functional configuration shown in FIG.

この例では、母音フィルタ生成部４は、フォルマント抽出部４１と強調フィルタ生成部４２とを備えている。 In this example, the vowel filter generation unit 4 includes a formant extraction unit 41 and an enhancement filter generation unit 42.

フォルマント抽出部４１は、音声s(t)を入力として、フォルマント情報F(i,j)を出力する。フォルマント抽出部４１は、参考文献３等の公知の方法によって実現される。 The formant extraction unit 41 receives the speech s (t) and outputs formant information F (i, j). The formant extraction unit 41 is realized by a known method such as Reference 3.

〔参考文献３〕大塚貴弘，“音源パルス列を考慮した頑健なARX音声分析法”，日本音響学会誌，58巻，7号，pp.386-397, 2002.7 [Reference 3] Takahiro Otsuka, “Robust ARX Speech Analysis Method Considering Source Pulse Train”, Journal of the Acoustical Society of Japan, Vol.58, No.7, pp.386-397, 2002.7

フォルマントとは、図８のような音声スペクトルのピーク部分であり、低い周波数から第１フォルマント、第２フォルマントというように番号で区別される。このフォルマントの周波数軸上の位置が、音声の音韻性や話者性を特徴付ける。フォルマント抽出部４１は、音声s(t)からpサンプル間隔でフォルマント周波数F(i,j)[Hz]を抽出する。i(i=0,1,…,[(T-1)/p])は音声スペクトル分析部２と同じで分析番号（フレーム番号）である。また、j(j=1,2,…,J)はフォルマント番号であり、F(i,j)は、第jフォルマントの位置となる。Jは抽出するフォルマントの数であり、４又は５程度の値である。F(i,j)はiフレーム目が無声区間及び無音区間のようなフォルマントが存在しない区間である場合、各j(j=1,2,…,J)についてF(i,j)=0となる。 A formant is a peak portion of a voice spectrum as shown in FIG. 8, and is distinguished by a number such as a first formant and a second formant from a low frequency. The position of this formant on the frequency axis characterizes the phoneme and speaker nature of speech. The formant extraction unit 41 extracts the formant frequency F (i, j) [Hz] from the speech s (t) at p sample intervals. i (i = 0,1,..., [(T-1) / p]) is the same as the speech spectrum analysis unit 2 and is an analysis number (frame number). Further, j (j = 1, 2,..., J) is a formant number, and F (i, j) is the position of the j-th formant. J is the number of formants to be extracted, and is a value of about 4 or 5. F (i, j) is F (i, j) = 0 for each j (j = 1, 2,..., J) when the i-th frame is a section where there is no formant such as a silent section and a silent section. It becomes.

強調フィルタ生成部４２は、雑音スペクトルN(i,f)、音声スペクトルS(i,f)及び母音区間情報H_voを入力として、雑音下で母音を聞き取りやすくするフィルタE₁(i,f)を出力する。 The enhancement filter generation unit 42 receives the noise spectrum N (i, f), the speech spectrum S (i, f), and the vowel section information _Hvo, and makes the filter E ₁ (i, f) easy to hear vowels under noise. Is output.

音声の母音の音韻性はフォルマントによって特徴づけられることが知られており、雑音下でもこのフォルマント周波数の位置の音声スペクトルがマスキングせずに聞き取れることが重要となる。一方、このフォルマントの位置及びパワーが自然な音声を特徴づけるため、自然な音声となるようなフォルマントを強調する強調フィルタを設計する必要がある。 It is known that the phonological properties of speech vowels are characterized by formants, and it is important that the speech spectrum at the position of this formant frequency can be heard without masking even under noise. On the other hand, since the position and power of this formant characterize a natural sound, it is necessary to design an emphasis filter that emphasizes the formant so that the sound becomes natural.

図９に20名の話者の平均パワー密度のフォルマント間の帯域を求め、第jフォルマントを含む帯域の1帯域番号あたりの平均パワーの20名間の相関係数を示す。第jフォルマントを含む帯域の1帯域番号あたりのパワーは、そのフォルマントの相対的なパワーを示す。平均パワー密度は下記式P_d(i,f)にて定義され、第jフォルマントを含む部分の相対的なパワーは下記B(j)である。 FIG. 9 shows a band between formants of average power density of 20 speakers, and shows a correlation coefficient between 20 persons of average power per band number of a band including the j-th formant. The power per band number of the band including the jth formant indicates the relative power of the formant. The average power density is defined by the following formula P _d (i, f), and the relative power of the portion including the jth formant is B (j) below.

B(1)とB(3),B(4)、及びB(3)とB(2)は比較的強い相関があり、自然な音声となるためには、このフォルマントの相対的なパワーの相関を保ちながらフォルマントを強調する必要がある。また、フォルマントを強調することで音声のパワーが上がりすぎないことが望まれる。 B (1) and B (3), B (4), and B (3) and B (2) have a relatively strong correlation. It is necessary to emphasize formants while maintaining correlation. In addition, it is desirable that the sound power is not increased too much by emphasizing the formants.

そこで、B(j)を正規分布の確率と仮定して、予めB=(B(1), B(2), …, B(J-1))^Tの平均μ_Bと共分散行列Σ_Bを求めておき平均パワー分布とする。つまり、P(B)=N(B;μ_B,Σ_B)とする。その上で、参考文献２でも示されているように、第３フォルマントの音声と雑音パワー比が聞き取りやすさと相関があるため、例えば参考文献２に基づいてj=3として第jフォルマントを次のようにフォルマントを強調するフィルタE₁(i,f)を生成する。 Therefore, assuming that B (j) is a probability of normal distribution, the average μ _{B of} B = (B (1), B (2), ..., B (J-1)) ^T and the covariance matrix Σ _B To obtain an average power distribution. That is, P (B) = N (B; μ _B , Σ _B ). In addition, as shown in Reference Document 2, since the voice and noise power ratio of the third formant have a correlation with ease of hearing, for example, j = 3 based on Reference Document 2 A filter E ₁ (i, f) that emphasizes the formant is generated as follows.

(1) 下記式Ps,Pnを用いて第jフォルマント部分の音声と雑音のパワー比R=10log₁₀(Ps(j)/Pn(j))を求める。 (1) The voice / noise power ratio R = ₁₀ log ₁₀ (Ps (j) / Pn (j)) of the jth formant part is obtained using the following formulas Ps and Pn.

(2) Rがあらかじめ定めたパワー比よりR’より大きい、つまりR>R’のとき、全てのi∈H_vo,f=0,1,…,D-1について、E₁(i,f)=1として終了する。 (2) For all i∈H _vo , f = 0,1, ..., D-1 when R is greater than R ′ than a predetermined power ratio, that is, R> R ′, E ₁ (i, f ) = 1.

(3) (*1)式を用いて、B=(B(1), B(2), …, B(J-1))^Tを求める。 (3) B = (B (1), B (2), ..., B (J-1)) ^T is obtained using the equation (* 1).

(4) (*2)式を用いてパワーPs=(Ps(1), Ps(2), …, Ps(J-1))^Tを求める。 (4) The power Ps = (Ps (1), Ps (2),..., Ps (J-1)) ^T is obtained using the expression (* 2).

(5) j番目以外の合計パワーM'=Σ_i=1 ^JPs(i)-Ps(j)を求める。 (5) Total power M ′ = Σi _{= 1} ^J Ps (i) −Ps (j) other than the j-th is obtained.

(6) 目標平均パワーB’を例えば、下記(i)-(v)によって求める。 (6) The target average power B ′ is obtained by the following (i)-(v), for example.

(i) A=Ps./B=(Ps(1)/ B(1), Ps(2)/ B(2), …, Ps(J-1)/ B(J-1))^Tとする。 (i) A = Ps. / B = (Ps (1) / B (1), Ps (2) / B (2),…, Ps (J-1) / B (J-1)) ^T .

(ii) 第jフォルマントの目標とする平均パワーB’(j)＝10^((R-R’)/20)B(j)とする。 (ii) The target average power B ′ (j) = 10 ^ ((R−R ′) / 20) B (j) for the j-th formant.

(iii) Bの平均μ_B={μ_n}と共分散行列Σ_B={σ_nm}を用いて、B’(j)を求めたときの条件付き正規分布P(B|B’(j))=N(B|B’(j);μ_B,Σ_B)=N(B';μ'_B,Σ'_B)の平均μ’_B、共分散行列Σ’_Bを求める。μ'_B,Σ'_B の導出には、下記のBの平均μ_Bからj番目の要素を取り除いた平均ベクトルμ_B1、またBの分散行列Σ_Bからj行j列目を取り除いた行列Σ_11、j列目のj行目以外のベクトルΣ₁₂、j行目のj列目以外のベクトルΣ₂₁、及びj行j列目の要素Σ₂₂を用いる。 (iii) Conditional normal distribution P (B | B '(j) when B' (j) is obtained using _B mean μ _B = {μ _n } and covariance matrix Σ _B = {σ _nm } )) = N (B | B ′ (j); μ _B , Σ _B ) = N (B ′; μ ′ _B , Σ ′ _B ) average μ ′ _B and covariance matrix Σ ′ _B are obtained. In order to derive μ ′ _B and Σ ′ _B , an average vector μ _B1 obtained by removing the j-th element from the average μ _B of B below, and a matrix Σ obtained by removing the j-th row and j-th column from the variance matrix Σ _{B of B} _11. A vector Σ ₁₂ other than the j-th row in the j-th column, a vector Σ ₂₁ other than the j-th column in the j-th row, and an element Σ ₂₂ in the j-th row and j-th column are used.

上記を用いて、以下のようにμ'_B,Σ'_Bを求める。 Using the above, μ ′ _B and Σ ′ _B are obtained as follows.

(iv) Aからj番目の要素を取り除いたベクトルA’を求める。 (iv) A vector A ′ obtained by removing the j-th element from A is obtained.

(v) 残りの目標平均パワーB’を下記式により求める。 (v) The remaining target average power B 'is obtained by the following equation.

として、B’= B'(1),…,B'(J-1))とする。 B ′ = B ′ (1),..., B ′ (J−1)).

(7)各フォルマント帯域毎のゲイン幅G=B’./ Bを求める。 (7) Obtain a gain width G = B ′ ./ B for each formant band.

(8) 各フォルマント帯域のパワーをG倍する母音強調フィルタE₁(i,f)を生成する。フィルタは様々なものを構築することができるが、例えば図１０のようにフォルマント部分をg(j)、各フォルマント間の中点を1とする母音強調フィルタE₁(i,f)を構成し、以下の関係を満たすフィルタを構成することができる。 (8) Generate a vowel enhancement filter E ₁ (i, f) that multiplies the power of each formant band by G. Various filters can be constructed. For example, as shown in FIG. 10, a vowel emphasis filter E ₁ (i, f) is formed with g (j) as the formant part and 1 as the midpoint between each formant. A filter satisfying the following relationship can be configured.

目標のパワーの合計が、 The total target power is

となるように、新しく求めた目標のパワーの合計が一定制御前と変わらない同一であることがわかる。また、新しいパワーはBの条件付き確率分布P(B|B’(j))を最大にするB'を求めるにすぎないため、Bの相関関係が保たれながら新しい平均パワーB’及びPs(j)G(j)を求めることができる。この例では参考文献２に基づいて、j=3として第3フォルマントを強調する例を示したが、同様にして第2・4フォルマントも制御しても良い。 It can be seen that the total of the newly calculated target power is the same as before the constant control. Also, since the new power only finds B ′ that maximizes the conditional probability distribution P (B | B ′ (j)) of B, the new average powers B ′ and Ps ( j) G (j) can be obtained. In this example, based on Reference Document 2, an example in which the third formant is emphasized with j = 3 is shown, but the second and fourth formants may be controlled in the same manner.

このように、母音フィルタ生成部４は、あるフレームiに対応する音素が母音である場合には、そのフレームiの音声スペクトルS(i,f)に対する、母音を強調するための母音強調フィルタE₁(i,f)を生成する（ステップＳ５）。 In this way, when the phoneme corresponding to a certain frame i is a vowel, the vowel filter generation unit 4 enhances the vowel enhancement filter E for enhancing the vowel with respect to the speech spectrum S (i, f) of the frame i. ₁ (i, f) is generated (step S5).

このようにして生成された母音強調フィルタE₁(i,f)は、各フォルマントを含む帯域の相対的なパワーが正規分布に従うと仮定してこれらの帯域の相対的なパワーの関係を保ちながら母音を強調するフィルタであると言える。生成された母音強調フィルタE₁(i,f)は、フィルタ部７に提供される。 The vowel enhancement filter E ₁ (i, f) generated in this way assumes that the relative power of the bands including each formant follows a normal distribution, while maintaining the relationship of the relative powers of these bands. It can be said that it is a filter that emphasizes vowels. The generated vowel enhancement filter E ₁ (i, f) is provided to the filter unit 7.

＜ステップＳ６＞
有声子音フィルタ生成部５は、雑音スペクトルN(i,f)、音声スペクトルS(i,f)及び有声子音区間情報H_vcを入力として、有声子音強調フィルタE₂(i, f)を出力する。 <Step S6>
The voiced consonant filter generation unit 5 receives the noise spectrum N (i, f), the voice spectrum S (i, f), and the voiced consonant interval information H _vc and outputs a voiced consonant enhancement filter E ₂ (i, f). .

有声子音フィルタ生成部５は、音声スペクトル分析部２で分析した周波数スペクトルS(i,f)に対し、有声子音を明瞭化するフィルタE₂(i,f)を生成する。有声子音にもフォルマントがあるものの、撥音などフォルマントが母音よりも少ない音がある。そこで、有声子音についてはフォルマントの帯域ではなく、特定の周波数帯域とし、その他の帯域を母音フィルタ生成部と同じように相関関係に基づいて制御することで、音声の聴き取りやすさを改善する方法が考えられる。この特定の帯域をM(k)、平均パワーCを用いて以下に述べる(1)から(8)ように制御すればよい。この処理は、図１１に例示するように有声子音フィルタ生成部５に設けられた有声子音強調部５１により行われる。 The voiced consonant filter generation unit 5 generates a filter E ₂ (i, f) for clarifying the voiced consonant with respect to the frequency spectrum S (i, f) analyzed by the voice spectrum analysis unit 2. Although voiced consonants also have formants, there are sounds with fewer formants than vowels, such as repellent sounds. Therefore, for voiced consonants, not a formant band, but a specific frequency band, and the other bands are controlled based on the correlation in the same way as the vowel filter generation unit, thereby improving the ease of listening to the voice. Can be considered. This specific band may be controlled as described in (1) to (8) below using M (k) and average power C. This process is performed by the voiced consonant enhancement unit 51 provided in the voiced consonant filter generation unit 5 as illustrated in FIG.

図１２に、20名の話者の音声データに基づく、有声子音のパワーの相関関係の例を示す。ここでは、分割する帯域MはL=5として、M(0)=0, M(1)=D*(2000/fs), M(2)= D*(4000/fs), M(3)=D *(6000/fs), M(4)= D*(8000/fs), M(5)= D*(16000/fs),（それぞれ0kHz, 1kHz, 2kHz,4kHz,6kH,8kHzを境界とする帯域）としている。帯域kの相対的なパワーは下記(3)の説明で定義されるC(k)である。 FIG. 12 shows an example of the correlation of the power of voiced consonants based on the voice data of 20 speakers. Here, the band M to be divided is L = 5, M (0) = 0, M (1) = D * (2000 / fs), M (2) = D * (4000 / fs), M (3) = D * (6000 / fs), M (4) = D * (8000 / fs), M (5) = D * (16000 / fs), (boundary 0kHz, 1kHz, 2kHz, 4kHz, 6kH, 8kHz respectively) Band). The relative power of the band k is C (k) defined in the description of (3) below.

図１２の例では、C(1)とC(2)、C(1)とC(3)、C(2)とC(3)には比較的強い相関があることがわかる。自然な音声となるためには、これらの帯域間の相対的なパワーの相関を保ちながら強調する必要がある。 (1) 下記式Ps,Pnを用いてm番目の帯域（M(m-1)〜M(m)）の部分の音声と雑音のパワー比R=10log₁₀(Ps(m)/Pn(m))を求める。 In the example of FIG. 12, it can be seen that C (1) and C (2), C (1) and C (3), and C (2) and C (3) have a relatively strong correlation. In order to become natural speech, it is necessary to emphasize while maintaining the relative power correlation between these bands. (1) Using the following formulas Ps and Pn, the voice-to-noise power ratio R = 10log ₁₀ (Ps (m) / Pn (m) in the m-th band (M (m-1) to M (m)) )).

(2) Rがあらかじめ定めたパワー比よりR’より大きい、つまりR>R’のとき、全てのi∈H_vo,f=0,1,…,D-1について、E₂(i,f)=1として終了する。 (2) For all i∈H _vo , f = 0,1, ..., D-1 when R is greater than R ′ than the predetermined power ratio, that is, R> R ′, E ₂ (i, f ) = 1.

(3) 下記式を用いて、C=(C(1), C(2), …, C(L-1))^Tを求める。Lは、帯域分割する数である。 (3) C = (C (1), C (2), ..., C (L-1)) ^T is obtained using the following equation. L is the number of band division.

(4) (*3)式を用いてパワーPs=(Ps(1), Ps(2), …, Ps(L-1))^Tを求める。 (4) The power Ps = (Ps (1), Ps (2),..., Ps (L-1)) ^T is obtained using the equation (* 3).

(5) m番目以外の合計パワーM'=Σ_i=1 ^MPs(i)-Ps(m)を求める。 (5) Total power M ′ = Σ _{i = 1} ^M Ps (i) −Ps (m) other than the m-th is obtained.

(6) 目標平均パワーC’を例えば、下記(i)-(v)によって求める。 (6) The target average power C ′ is obtained by the following (i)-(v), for example.

(i) D=Ps./C=(Ps(1)/ C(1), Ps(2)/ C(2), …, Ps(L-1)/ C(L-1))^Tとする。 (i) D = Ps. / C = (Ps (1) / C (1), Ps (2) / C (2),…, Ps (L-1) / C (L-1)) ^T .

(ii) 目標とする平均パワーC’(j)＝10^((R-R’)/20)C(m)とする。 (ii) Target average power C ′ (j) = 10 ^ ((R−R ′) / 20) C (m).

(iii) Cの平均μ_C={μ_n}と共分散行列Σ_C={σ_nm}を用いて、C’(m)を求めたときの条件付き正規分布P(C|C’(m))=N(C|C’(m);μ_C,Σ_C)=N(C';μ'_C,Σ'_C)の平均μ’_C、共分散行列Σ’_Cを求める。μ'_C,Σ'_C の導出には、下記のCの平均μ_Cからm番目の要素を取り除いた平均ベクトルμ_c1、またCの分散行列Σ_Cからm行m列目を取り除いた行列Σ_11、m列目のm行目以外のベクトルΣ₁₂、m行目のm列目以外のベクトルΣ₂₁、及びm行m列目の要素Σ₂₂を用いる。 (iii) Using C mean μ _C = {μ _n } and covariance matrix Σ _C = {σ _nm }, C ′ (m) is a conditional normal distribution P (C | C ′ (m )) = N (C | C ′ (m); μ _C , Σ _C ) = N (C ′; μ ′ _C , Σ ′ _C ) average μ ′ _C and covariance matrix Σ ′ _C are obtained. For the derivation of μ ′ _C and Σ ′ _C , the mean vector μ _c1 obtained by removing the m-th element from the mean _C of C below, and the matrix Σ obtained by removing the m-th row and m-th column from the _C dispersion matrix Σ _C _11. A vector Σ ₁₂ other than the m-th row in the m-th column, a vector Σ ₂₁ other than the m-th column in the m-th row, and an element Σ ₂₂ in the m-th row and the m-th column are used.

上記を求めてμ’_C、Σ’_Cは以下のように求める。 By obtaining the above, μ ′ _C and Σ ′ _C are obtained as follows.

(iv) Dからm番目の要素を取り除いたベクトルD’を求める。 (iv) A vector D ′ obtained by removing the m-th element from D is obtained.

(v) 残りの目標平均パワーC’を下記式により求める。 (v) The remaining target average power C ′ is obtained by the following equation.

として、C’=(B'(1),…,B'(L-1))とする。 C ′ = (B ′ (1),..., B ′ (L−1)).

(7) 帯域毎のゲイン幅G=C’./Cを求める。 (7) Obtain the gain width G = C ′ ./ C for each band.

(8) 帯域毎のパワーをG倍するフィルタE₂(i,f)を生成する。フィルタは様々なものを構築することができるが、例えば図１３のように帯域の中心部をg(j)、帯域の境界点を1とするフィルタE₂(i,f)を構成しΣ_f=M(m-1) ^M(m)|E₂(i,f)S(i,f)|²=Ps(j)G(j)となるフィルタを構成することができる。 (8) A filter E ₂ (i, f) that multiplies the power for each band by G is generated. Various filters can be constructed. For example, as shown in FIG. 13, a filter E ₂ (i, f) having g (j) at the center of the band and 1 at the boundary point of the band is formed and Σ _{f = M (m-1)} ^{M (m)} | E ₂ (i, f) S (i, f) | ² = Ps (j) G (j) can be configured.

例えばパワー比R’=0[dB]、分割する帯域MはL=5として、M(0)=0, M(1)=D*(2000/fs) , M(2)= D*(4000/fs), M(3)=D *(6000/fs), M(4)= D*(8000/fs), M(5)= D*(16000/fs),（それぞれ0kHz, 1kHz, 2kHz,4kHz,6kH,8kHzを境界とする帯域）とし、SN比を計算する帯域を参考文献２に基づいてm=2とするとよい。 For example, the power ratio R ′ = 0 [dB], the band M to be divided is L = 5, M (0) = 0, M (1) = D * (2000 / fs), M (2) = D * (4000 / fs), M (3) = D * (6000 / fs), M (4) = D * (8000 / fs), M (5) = D * (16000 / fs), (0kHz, 1kHz, 2kHz respectively) , 4 kHz, 6 kHz, 8 kHz), and the band for calculating the SN ratio is preferably m = 2 based on Reference 2.

このように、有声子音フィルタ生成部５は、あるフレームiに対応する音素が有声子音である場合には、そのフレームiの音声スペクトルS(i,f)に対する、有声子音を強調するための有声子音強調フィルタE₂(i,f)を生成する（ステップＳ６）。 As described above, when the phoneme corresponding to a certain frame i is a voiced consonant, the voiced consonant filter generation unit 5 emphasizes the voiced consonant for the voice spectrum S (i, f) of the frame i. A consonant enhancement filter E ₂ (i, f) is generated (step S6).

このようにして生成された有声子音強調フィルタE₂(i,f)は、各帯域の相対的なパワーが正規分布に従うと仮定してこれらの帯域の相対的なパワーの関係を保ちながら有声子音を強調するフィルタであると言える。生成された有声子音強調フィルタE₂(i,f)は、フィルタ部７に提供される。 The voiced consonant enhancement filter E ₂ (i, f) generated in this way assumes that the relative power of each band follows a normal distribution, while maintaining the relationship between the relative powers of these bands. It can be said that it is a filter that emphasizes. The generated voiced consonant enhancement filter E ₂ (i, f) is provided to the filter unit 7.

＜ステップＳ７＞
無声子音フィルタ生成部６は、雑音スペクトルN(i,f)、音声スペクトルS(i,f)及び無声子音区間情報H_ucを入力として、無声子音強調フィルタE₃(i,f)を出力する。 <Step S7>
The unvoiced consonant filter generation unit 6 receives the noise spectrum N (i, f), the voice spectrum S (i, f), and the unvoiced consonant section information _Huc and outputs an unvoiced consonant enhancement filter E ₃ (i, f). .

無声子音フィルタ生成部６は、有声子音フィルタ生成部５と同様に特定の帯域を強調する。無声子音フィルタ生成部６の機能構成は有声子音フィルタ生成部５と同様であり、明瞭化する区間はi∈H_ucとなる。無声子音は高い周波数の成分が多く、聞き取りやすさには高い周波数の音声のパワーと関係する。そのため、R’=0 [dB]、分割する帯域MはL=5として、M(0)=0, M(1)=D(2000/fs) , M(2)= D(4000/fs), M(3)=D (6000/fs), M(4)= D(8000/fs), M(5)= D(16000/fs),（それぞれ0kHz, 1kHz, 2kHz,4kHz,6kH,8kHzを境界とする帯域）とし、SN比を計算する帯域m=5とするとよい。 The unvoiced consonant filter generation unit 6 emphasizes a specific band in the same manner as the voiced consonant filter generation unit 5. The functional configuration of the unvoiced consonant filter generation unit 6 is the same as that of the voiced consonant filter generation unit 5, and the section to be clarified is i∈H _uc . Unvoiced consonants have many high frequency components, and the ease of hearing is related to the power of high frequency speech. Therefore, R '= 0 [dB], the band M to be divided is L = 5, M (0) = 0, M (1) = D (2000 / fs), M (2) = D (4000 / fs) , M (3) = D (6000 / fs), M (4) = D (8000 / fs), M (5) = D (16000 / fs), (0kHz, 1kHz, 2kHz, 4kHz, 6kH, 8kHz respectively) And a band m = 5 for calculating the SN ratio.

図１４に、20名の話者の音声データに基づく、無声子音のパワーの相関関係の例を示す。分割する帯域Mは、L=5とする上記の分割の例の通りである。帯域kの相対的なパワーは有声子音強調部５１の説明で定義されたC(k)である。 FIG. 14 shows an example of the correlation of the power of unvoiced consonants based on the voice data of 20 speakers. The band M to be divided is as in the above example of division with L = 5. The relative power of the band k is C (k) defined in the description of the voiced consonant enhancement unit 51.

図１４の例では、C(2)とC(3)に比較的強い相関があることがわかる。自然な音声となるためには、これらの帯域間の相対的なパワーの相関を保ちながら強調する必要がある。 In the example of FIG. 14, it can be seen that C (2) and C (3) have a relatively strong correlation. In order to become natural speech, it is necessary to emphasize while maintaining the relative power correlation between these bands.

このように、無声子音フィルタ生成部６は、あるフレームiに対応する音素が無声子音である場合には、そのフレームiの音声スペクトルS(i,f)に対する、無声子音を強調するための無声子音強調フィルタE₃(i,f)を生成する（ステップＳ７）。 Thus, when the phoneme corresponding to a certain frame i is an unvoiced consonant, the unvoiced consonant filter generation unit 6 is unvoiced for enhancing the unvoiced consonant with respect to the speech spectrum S (i, f) of that frame i. A consonant enhancement filter E ₃ (i, f) is generated (step S7).

このようにして生成された無声子音強調フィルタE₃(i,f)は、各帯域の相対的なパワーが正規分布に従うと仮定してこれらの帯域の相対的なパワーの関係を保ちながら無声子音を強調するフィルタであると言える。生成された無声子音強調フィルタE₃(i,f)は、フィルタ部７に提供される。 The unvoiced consonant enhancement filter E ₃ (i, f) generated in this way assumes that the relative power of each band follows a normal distribution, while maintaining the relationship between the relative powers of these bands. It can be said that it is a filter that emphasizes. The generated unvoiced consonant enhancement filter E ₃ (i, f) is provided to the filter unit 7.

＜ステップＳ８＞
フィルタ部７は、音声スペクトルS(i,f)、母音強調フィルタE₁(i,f)、有声子音強調フィルタE₂(i,f)及び無声子音強調フィルタE₃(i,f)を入力として、強調スペクトルS’(i,f)を出力する。 <Step S8>
The filter unit 7 receives the speech spectrum S (i, f), the vowel enhancement filter E ₁ (i, f), the voiced consonant enhancement filter E ₂ (i, f), and the unvoiced consonant enhancement filter E ₃ (i, f). As a result, the enhanced spectrum S ′ (i, f) is output.

フィルタ部７は、音声スペクトルに母音フィルタ生成部４、有声子音フィルタ生成部５及び無声子音フィルタ生成部６のそれぞれで生成したフィルタを畳み込み強調スペクトルを生成する。フィルタは下記式で簡単に統合することができ、強調スペクトルは下記式Eを用いてS^'(i,f)=E(i,f)S(i,f)を計算すればよい。 The filter unit 7 generates a convolution enhancement spectrum by convolving the filters generated by the vowel filter generation unit 4, the voiced consonant filter generation unit 5, and the unvoiced consonant filter generation unit 6 with the speech spectrum. The filter can be easily integrated by the following equation, and the enhanced spectrum can be calculated using the following equation E as S ^ '(i, f) = E (i, f) S (i, f).

なお、各音韻のフィルタの不連続部分の違和感を少なくするために、上記式でフィルタを統合した後、i方向（すなわち、時間方向）にスムージングを行っても良い。 In order to reduce the discomfort of the discontinuous portions of the filter of each phoneme, smoothing may be performed in the i direction (that is, the time direction) after integrating the filters by the above formula.

このように、フィルタ部７は、あるフレームiに対応する音素が母音である場合にはそのフレームiの母音強調フィルタE₁(i,f)を用いてそのフレームiの音声スペクトルS(i,f)を強調してそのフレームiに対応する強調スペクトルS’(i,f)を生成し、あるフレームiに対応する音素が有声子音である場合にはそのフレームiの有声子音強調フィルタを用いてそのフレームiの音声スペクトルS(i,f)を強調してそのフレームiに対応する強調スペクトルS’(i,f)を生成し、あるフレームiに対応する音素が無声子音である場合にはそのフレームiの無声子音強調フィルタE₃(i,f)を用いてそのフレームiの音声スペクトルS(i,f)を強調してそのフレームiに対応する強調スペクトルS’(i,f)を生成する（ステップＳ８）。生成された強調スペクトルS’(i,f)は、スペクトル合成部８に提供される。 As described above, when the phoneme corresponding to a certain frame i is a vowel, the filter unit 7 uses the vowel enhancement filter E ₁ (i, f) of the frame i and the speech spectrum S (i, f) is emphasized to generate an enhancement spectrum S ′ (i, f) corresponding to the frame i. If the phoneme corresponding to a frame i is a voiced consonant, the voiced consonant enhancement filter of the frame i is used. The speech spectrum S (i, f) of the frame i is emphasized to generate an enhanced spectrum S ′ (i, f) corresponding to the frame i, and the phoneme corresponding to the frame i is an unvoiced consonant. Emphasizes the speech spectrum S (i, f) of the frame i using the unvoiced consonant enhancement filter E ₃ (i, f) of the frame i, and the enhanced spectrum S ′ (i, f) corresponding to the frame i Is generated (step S8). The generated enhanced spectrum S ′ (i, f) is provided to the spectrum synthesis unit 8.

＜ステップＳ９＞
スペクトル合成部８は、音声スペクトル分析部２と入出力の関係が逆となっており、強調スペクトルS’(i,f)を入力として強調音声s’(t)を出力する。具体的には例えば下記によって強調音声s’(t)を求める。スペクトル合成部８が行う処理の例のフローチャートを図１５に示す。 <Step S9>
The spectrum synthesizing unit 8 has a reverse input / output relationship with the speech spectrum analyzing unit 2 and outputs the enhanced speech s ′ (t) with the enhanced spectrum S ′ (i, f) as an input. Specifically, for example, the emphasized speech s ′ (t) is obtained by the following. A flowchart of an example of processing performed by the spectrum synthesizer 8 is shown in FIG.

(1) フレーム番号i=0とする。 (1) Frame number i = 0.

(3) フォルマント強調スペクトルS’(i,f)を長さDの逆フーリエ変換を行い、音声サンプルs₁(f)（f=0,1,…,D-1）に変換する。 (3) The formant-enhanced spectrum S ′ (i, f) is subjected to inverse Fourier transform of length D to be converted into speech samples s ₁ (f) (f = 0, 1,..., D−1).

(3) 周波数番号f=0とする。 (3) Set frequency number f = 0.

(5) 求めた音声サンプルs₁(f)を、下記式にて強調音声s’(t)に加算する。 (5) The obtained speech sample s ₁ (f) is added to the enhanced speech s ′ (t) by the following equation.

(6) f←f+1として(4)に戻る。 (6) Return to (4) as f ← f + 1.

(7) i←i+1として(2)に戻る。 (7) Return to (2) as i ← i + 1.

このように、スペクトル合成部８は、生成された強調スペクトルを逆フーリエ変換することにより時間領域信号である強調音声s’(t)を生成する（ステップＳ９）。 As described above, the spectrum synthesizer 8 generates the enhanced speech s ′ (t), which is a time domain signal, by performing inverse Fourier transform on the generated enhanced spectrum (step S9).

このように、合成音声の音素種別ごとに音声明瞭化を行うことで、また、自然な音声のフォルマント部分のパワーの関係を維持しながらフォルマントを強調することで、細かいチューニングを必要とせずに音声の自然性を劣化させずに音声を明瞭化することができる。 In this way, by performing speech clarification for each phoneme type of synthesized speech, and by emphasizing the formant while maintaining the power relationship of the formant part of natural speech, it is possible to perform speech without requiring fine tuning The voice can be clarified without degrading the naturalness of the sound.

［第二実施形態］
第二実施形態の音声明瞭化装置は、第一実施形態の音声合成部１に代えて、図１６に例示するように、音声認識部１０を例えば備えている。音声認識部１０は、音声s(t)を入力として、音素情報（ph(k),Ts(k),Te(k)）を出力する。 [Second Embodiment]
The voice clarification device of the second embodiment includes, for example, a voice recognition unit 10 as illustrated in FIG. 16 instead of the voice synthesis unit 1 of the first embodiment. The speech recognition unit 10 receives speech s (t) and outputs phoneme information (ph (k), Ts (k), Te (k)).

第一実施形態のは、音声合成部１で、音声合成時に生成される音素情報（ph(k),Ts(k),Te(k)）を用いて明瞭化していた。第二実施形態は、この音素情報を音声認識部１０を用いて抽出することで（ステップＳ１）、音素種別ごとの明瞭化を行う。音声認識部１０は、例えば参考文献４に記載された既存の音声認識技術を用いて実現することができる。 In the first embodiment, the speech synthesizer 1 uses the phoneme information (ph (k), Ts (k), Te (k)) generated during speech synthesis for clarification. In the second embodiment, this phoneme information is extracted by using the speech recognition unit 10 (step S1), thereby clarifying for each phoneme type. The voice recognition unit 10 can be realized by using an existing voice recognition technique described in Reference Document 4, for example.

〔参考文献４〕P.C.Woodland, J.J.Odell, V.Valtchev 63 S.J. Young, “Large vocabulary continuous speech recognition using HTK”, ICASSP, vol.2, pp.II/125-II128, 1994 [Reference 4] P.C.Woodland, J.J.Odell, V.Valtchev 63 S.J.Young, “Large vocabulary continuous speech recognition using HTK”, ICASSP, vol.2, pp.II / 125-II128, 1994

第二実施形態の他の処理は、第一実施形態と同様であるため、重複説明を省略する。 Since the other processes of the second embodiment are the same as those of the first embodiment, a duplicate description is omitted.

［変形例］
上記の例では、母音区間、有声子音区間、無声子音区間の全てで、強調フィルタの生成及びその適用の処理を行ったが、少なくとも１つの音素の種類の区間について、強調フィルタの生成及びその適用の処理を行えばよい。例えば、母音区間のみについて強調フィルタの生成及びその適用の処理を行ってもよい。また、無声子音区間については、強調フィルタの生成及びその適用の処理を行わなくてもよい。 [Modification]
In the above example, the enhancement filter is generated and applied in all of the vowel interval, voiced consonant interval, and unvoiced consonant interval. However, the enhancement filter is generated and applied in at least one phoneme type interval. May be performed. For example, the enhancement filter may be generated and applied only for the vowel section. Further, for the unvoiced consonant section, it is not necessary to generate the enhancement filter and apply it.

上記の例では、音素を、母音、有声子音、無声子音で分類したが、これは音素の分類の一例である。音素を、他の種類で分けてもよい。例えば、母音を構成する一部の音素についてその一部の音素を強調するためのフィルタ生成及びその適用の処理を行い、母音を構成する他の一部の音素についてその他の一部の音素を強調するためのフィルタ生成及びフィルタ適用の処理を行ってもよい。 In the above example, the phonemes are classified into vowels, voiced consonants, and unvoiced consonants. This is an example of phoneme classification. Phonemes may be divided into other types. For example, filter generation for emphasizing some phonemes constituting a vowel and processing of applying the same are performed, and some other phonemes constituting other vowels are emphasized. Filter generation and filter application processing may be performed.

有声子音強調フィルタを生成する場合には、比較的強い相関がある帯域についてのみ、上記説明した有声子音強調フィルタを生成するための計算を行ってもよい。例えば、図１２に示した例では、C(1)とC(2)、C(1)とC(3)、C(2)とB(3)には比較的強い相関があることから、k=0からk=3の帯域、すなわち0kHzから4kHzの帯域についてのみ、上記説明した有声子音強調フィルタを生成するための計算を行ってもよい。 When generating a voiced consonant enhancement filter, the calculation for generating the voiced consonant enhancement filter described above may be performed only for a band having a relatively strong correlation. For example, in the example shown in FIG. 12, C (1) and C (2), C (1) and C (3), and C (2) and B (3) have a relatively strong correlation. The calculation for generating the above-described voiced consonant enhancement filter may be performed only for the band from k = 0 to k = 3, that is, the band from 0 kHz to 4 kHz.

同様に、無声子音強調フィルタを生成する場合にも、比較的強い相関がある帯域についてのみ、上記説明した無声子音強調フィルタを生成するための計算を行ってもよい。 Similarly, when generating an unvoiced consonant enhancement filter, the calculation for generating the unvoiced consonant enhancement filter described above may be performed only for a band having a relatively strong correlation.

上記装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The processes described in the above apparatus and method are not only executed in time series according to the description order, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the process.

また、音声明瞭化装置における各処理をコンピュータによって実現する場合、その装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 Further, when each process in the speech clarification device is realized by a computer, the processing contents of the functions that the device should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

１音声合成部
２音声スペクトル分析部
３判断部
４母音フィルタ生成部
４１フォルマント抽出部
４２強調フィルタ生成部
５有声子音フィルタ生成部
５１有声子音強調部
６無声子音フィルタ生成部
７フィルタ部
８スペクトル合成部
９雑音スペクトル分析部
１０音声認識部
１１フィルタ生成部 DESCRIPTION OF SYMBOLS 1 Speech synthesis part 2 Speech spectrum analysis part 3 Judgment part 4 Vowel filter production | generation part 41 Formant extraction part 42 Emphasis filter production | generation part 5 Voiced consonant filter production | generation part 51 Voiced consonant enhancement part 6 Unvoiced consonant filter production | generation part 7 Filter part 8 Spectrum synthesis part 9 Noise Spectrum Analysis Unit 10 Speech Recognition Unit 11 Filter Generation Unit

Claims

A voice spectrum analyzer that generates a voice spectrum S (i, f) of each frame by performing Fourier transform on the voice s (t) for each frame of a certain time length;
Using the phoneme information of the input speech s (t), a determination unit that determines the type of phoneme corresponding to each frame constituting the speech s (t);
A filter generation unit that generates an enhancement filter for enhancing the speech spectrum S (i, f) of each frame corresponding to the type of phoneme corresponding to each of the determined frames;
Using the enhancement filter corresponding to the type of phoneme corresponding to each frame determined above, the speech spectrum S (i, f) of each frame is enhanced to generate the enhancement spectrum S ′ (i, f) A filter section to perform,
A spectrum synthesizer that generates an enhanced speech s ′ (t) that is a time domain signal by performing inverse Fourier transform on the generated enhanced spectrum, and
A speech clarification device including:

The speech clarification device according to claim 1,
It further includes a speech synthesizer that generates phoneme information based on the input text and uses it as the phoneme information, generates a synthesized speech corresponding to the input text using the phoneme information, and uses it as the speech s (t) ,
Voice clarification device.

The speech clarification device according to claim 1,
A speech recognition unit that generates the phoneme information by recognizing the speech s (t);
Voice clarification device.

The speech clarification device according to any one of claims 1 to 3,
The enhancement filter corresponding to the phoneme type is a filter that emphasizes while maintaining the relative power relationship of these bands based on the normal distribution between the powers.
Voice clarification device.

The speech clarification device according to any one of claims 1 to 4,
The determination unit uses the phoneme information of the input speech s (t) to determine whether the phoneme corresponding to each frame constituting the speech s (t) is a vowel, voiced consonant, or unvoiced consonant Judging
When the phoneme corresponding to each frame is a vowel, the filter unit vowel enhancement filter E ₁ (i, f) for enhancing the vowels with respect to the speech spectrum S (i, f) of each frame And a vowel consonant enhancement filter for enhancing the voiced consonant with respect to the speech spectrum S (i, f) of each frame when the phoneme corresponding to each frame is a voiced consonant. A voiced consonant filter generator that generates E ₂ (i, f), and when the phoneme corresponding to each frame is an unvoiced consonant, an unvoiced consonant for the speech spectrum S (i, f) of each frame is An unvoiced consonant filter generation unit that generates an unvoiced consonant enhancement filter E ₃ (i, f) for emphasis, and
When the phoneme corresponding to each frame is a vowel, the filter unit emphasizes the speech spectrum S (i, f) of each frame using the vowel enhancement filter E ₁ (i, f). An enhancement spectrum S ′ (i, f) corresponding to each frame is generated, and when the phoneme corresponding to each frame is a voiced consonant, each of the above-mentioned each using the voiced consonant enhancement filter E ₂ (i, f) The speech spectrum S (i, f) of the frame is emphasized to generate an enhanced spectrum S ′ (i, f) corresponding to each of the frames. When the phoneme corresponding to each frame is an unvoiced consonant, the unvoiced Using the consonant enhancement filter E ₃ (i, f), the speech spectrum S (i, f) of each frame is emphasized to generate the enhanced spectrum S ′ (i, f) corresponding to each frame.
Voice clarification device.

The voice spectrum analysis unit generates a voice spectrum S (i, f) of each frame by Fourier transforming the voice s (t) for each frame of a certain time length, and
The determination unit determines a phoneme type corresponding to each frame constituting the speech s (t) using the input phoneme information of the speech s (t);
A filter generating step for generating an enhancement filter for enhancing the speech spectrum S (i, f) of each frame corresponding to the type of phoneme corresponding to each of the determined frames;
The filter unit emphasizes the speech spectrum S (i, f) of each frame by using the enhancement filter corresponding to the type of phoneme corresponding to each of the determined frames, thereby enhancing the enhanced spectrum S ′ (i, a filter step that generates f);
A spectrum synthesis step in which a spectrum synthesis unit generates an enhanced speech s ′ (t) that is a time domain signal by performing an inverse Fourier transform on the generated enhanced spectrum;
A speech clarification method including:

The program for functioning a computer as each part of the speech clarification apparatus described in any one of Claim 1 to 5.