JP5667963B2

JP5667963B2 - Speech enhancement device, method and program thereof

Info

Publication number: JP5667963B2
Application number: JP2011245547A
Authority: JP
Inventors: 歩相名神山; 水野　秀之; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-09
Filing date: 2011-11-09
Publication date: 2015-02-12
Anticipated expiration: 2031-11-09
Also published as: JP2013101255A

Description

この発明は、周囲に背景雑音がある環境において、音声を聞き取り易くする音声強調装置とその方法とプログラムに関する。 The present invention relates to a speech emphasizing device, a method thereof, and a program that make it easy to hear speech in an environment where background noise is present.

近年、音声通信端末、音声合成技術などの開発・普及により、様々な場所で音声を聴取する機会が増えた。このような音声聴取は静かな場所だけではなく、空港や駅のホームのように周囲に雑音があるような騒がしい環境で聴取する場合が多い。このため、周囲の雑音によって音声が聞き取り難くなる問題がある。 In recent years, with the development and popularization of voice communication terminals and voice synthesis technology, the opportunity to listen to voices in various places has increased. Such voice listening is often performed not only in a quiet place but also in a noisy environment such as an airport or a station platform where there is noisy surroundings. For this reason, there is a problem that it is difficult to hear the sound due to ambient noise.

雑音環境下での音声を聞き取り易くするために、最も簡単な方法は、雑音に応じて音量を大きくする方法である。しかし、音量を大きくし過ぎると、スピーカへの入力が過大となり、音声が歪んでしまい、かえって音質が劣化する場合がある。そこで、周波数スペクトルの特定の帯域のみを強調して音声を聞き易くする方法が従来より検討されている。 In order to make it easy to hear a voice in a noisy environment, the simplest method is to increase the volume according to the noise. However, if the volume is increased too much, the input to the speaker becomes excessive, the sound is distorted, and the sound quality may be deteriorated. In view of this, a method for enhancing the ease of listening to audio by emphasizing only a specific band of the frequency spectrum has been studied.

その方法の一つとして、音声の周波数スペクトルのピーク部分であるフォルマント（formant）を強調することで、音声の明瞭度を改善する考えが知られている（特許文献１）。図１２に、特許文献１に開示された考えを示す。図１２は、音声強調前後の音声のパワーと周波数との関係を示す図である。 As one of the methods, there is known an idea of improving speech intelligibility by emphasizing a formant that is a peak portion of a speech frequency spectrum (Patent Document 1). FIG. 12 shows the idea disclosed in Patent Document 1. FIG. 12 is a diagram showing the relationship between the power and frequency of speech before and after speech enhancement.

音声の音韻性は、このフォルマントの位置によって特徴付けられることが分かっており、このフォルマント部分のみを強調することで、音量を過大に上げることなく、音声の明瞭度を改善することができる（図１２(b)の強調後の特性を参照）。 It has been found that the phonological nature of speech is characterized by the position of this formant, and by enhancing only this formant part, the intelligibility of speech can be improved without excessively increasing the volume (Fig. (See characteristic after enhancement in 12 (b)).

特許第４２１９８９８号公報Japanese Patent No. 421989

音声のスペクトルパワー（周波数スペクトルの密度分布）は聴覚的に話者を判別するための声質と密接な関係があることが知られている。しかし、従来の方法では、フォルマントが存在する帯域のみを強調することにより、スペクトルパワーが変化してしまい、声質が変化してしまう課題がある。 It is known that the spectrum power (frequency spectrum density distribution) of speech is closely related to the voice quality for auditorily discriminating a speaker. However, in the conventional method, there is a problem that the spectral power changes and the voice quality changes by emphasizing only the band where the formants exist.

この発明は、このような課題に鑑みてなされたものであり、音量及び声質を変化させずに音声の明瞭度を向上させる音声強調装置と、その方法とプログラムを提供することを目的とする。 This invention is made in view of such a subject, and it aims at providing the audio | voice emphasis apparatus which improves the intelligibility of an audio | voice without changing a sound volume and a voice quality, its method, and a program.

この発明の音声強調装置は、音声分析部と、非周期性指標変換部と、音声合成部と、を具備する。音声分析部は、音声信号ｓ（ｔ）を入力として、当該音声信号をｐサンプル間隔で分析を行い、ｐサンプルごとの基本周波数ｆ_０（ｉ）と、非周期性指標Ａ（ｉ，ｆ）と、スペクトルパワーＰ（ｉ，ｆ）を出力する。非周期性指標変換部は、所定の周波数範囲Ｆ_Ｌ〜Ｆ_Ｈの非周期性指標の値Ａ（ｉ，ｆ）を、周波数の増加に対して小さくなる変換後非周期性指標Ａ′（ｉ，ｆ）と、当該所定の周波数Ｆ_Ｈよりも大きな周波数では上記小さくなる変換後非周期性指標Ａ′（ｉ，ｆ）の最小の変換後非周期性指標Ａ′（ｉ，ｆ）とに変換して出力する。音声合成部は、基本周波数ｆ_０（ｉ）とスペクトルパワーＰ（ｉ，ｆ）と変換後非周期性指標Ａ′（ｉ，ｆ）とを入力として音声合成音ｓ′（ｔ）を合成する。 The speech enhancement apparatus according to the present invention includes a speech analysis unit, an aperiodic index conversion unit, and a speech synthesis unit. The voice analysis unit receives the voice signal s (t) as input, analyzes the voice signal at p sample intervals, and generates a fundamental frequency f ₀ (i) for each p sample and an aperiodicity index A (i, f). And the spectrum power P (i, f) is output. The non-periodic index conversion unit converts the non-periodic index value A (i, f) of the predetermined frequency range F _{L to} F _H into a post-conversion non-periodic index A ′ (i , F) and the minimum post-conversion non-periodicity index A ′ (i, f) of the post-conversion non-periodicity index A ′ (i, f) that becomes smaller at a frequency higher than the predetermined frequency F _H. Convert and output. The speech synthesizer synthesizes the speech synthesized sound s ′ (t) with the fundamental frequency f ₀ (i), the spectrum power P (i, f), and the converted non-periodicity index A ′ (i, f) as inputs. .

この発明の音声強調装置は、音声信号の所定の周波数範囲の非周期性指標Ａ（ｉ，ｆ）の値を、周波数の増加に対して減少させた変換後非周期性指標Ａ′（ｉ，ｆ）を用いて音声合成し、スペクトルパワーＰ（ｉ，ｆ）は変化させないため、音量と声質を変化させることなく音声信号の音声の明瞭度を向上させることができる。 The speech enhancement apparatus according to the present invention is a post-conversion aperiodicity index A ′ (i, f) in which the value of the aperiodicity index A (i, f) in a predetermined frequency range of the speech signal is decreased with respect to an increase in frequency. Since the speech synthesis is performed using f) and the spectrum power P (i, f) is not changed, the speech intelligibility of the speech signal can be improved without changing the volume and voice quality.

非周期性指標Ａ（ｉ，ｆ）と音声の明瞭度スコアとの関係を示す図。The figure which shows the relationship between the aperiodic parameter | index A (i, f) and the intelligibility score of an audio | voice. この発明の音声強調装置１００の機能構成例を示す図。The figure which shows the function structural example of the audio | voice emphasis apparatus 100 of this invention. 音声強調装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the audio | voice emphasis apparatus 100. 音声波形ｓ（ｔ）の一例を示す図。The figure which shows an example of the audio | voice waveform s (t). 音声波形ｓ（ｔ）の基本周波数の一例を示す図。The figure which shows an example of the fundamental frequency of the audio | voice waveform s (t). 図４に示す音声波形ｓ（ｔ）を分析して求めた基本周波数ｆ_０（ｉ）を示す図。Shows the fundamental frequency f _{0 (i)} obtained by analyzing the speech waveform s (t) shown in FIG. 変換関数Ｅ（ｆ）の一例を示す図。The figure which shows an example of the conversion function E (f). 変換関数定義手段２１の動作フローを示す図。The figure which shows the operation | movement flow of the conversion function definition means. 加算手段２２の動作フローを示す図。The figure which shows the operation | movement flow of the addition means 22. 図４に示す音声波形ｓ（ｔ）を音声強調装置１００で音声強調した音声を、分析した変換後非周期性指標Ａ′（ｉ，ｆ）を示す図。FIG. 5 is a diagram showing a post-conversion aperiodicity index A ′ (i, f) obtained by analyzing a voice obtained by voice enhancement of the voice waveform s (t) shown in FIG. 図４に示す音声波形ｓ（ｔ）を分析した非周期性指標Ａ（ｉ，ｆ）を示す図。The figure which shows the aperiodic parameter | index A (i, f) which analyzed the audio | voice waveform s (t) shown in FIG. 特許文献１に開示された音声強調の考えを示す図。The figure which shows the idea of the audio | voice emphasis disclosed by patent document 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには
同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の考えについて説明する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the idea of the present invention will be described.

〔この発明の考え方〕
人の声は、声帯の周期的な振動に基づく音と、声帯から口唇、及び鼻孔までの呼気の乱流による周期的な振動を伴わない音との混合音であることが知られている。この人の声を成す２つの音の混合比は、非周期性指標Ａ（ｉ，ｆ）で表すことができる（参考文献：河原英紀、“聴覚の情景分析が生んだ高品質ＶＯＣＯＤＥＲ：ＳＴＲＡＩＧＨＴ”日本音響学会誌、５４巻、７号、pp.521-526(1998.7)）。 [Concept of this invention]
It is known that a human voice is a mixed sound of a sound based on a periodic vibration of a vocal cord and a sound not accompanied by a periodic vibration due to a turbulent flow of exhalation from the vocal cord to the lips and the nostrils. The mixing ratio of the two sounds that make up this person's voice can be expressed by the aperiodic index A (i, f) (reference: Hideki Kawahara, “High-quality VOCODER: STRAIGHT produced by auditory scene analysis”) Journal of the Acoustical Society of Japan, Vol. 54, No. 7, pp.521-526 (1998.7)).

非周期性指標Ａ（ｉ，ｆ）は、音声を周波数スペクトルの周期的成分（声帯の振動）と非周期的成分（呼気の乱流）の和と見なしたとき、帯域毎の非周期成分の割合を表す特徴量である。音声の明瞭度を向上させる目的で、この非周期性指標Ａ（ｉ，ｆ）に着目して雑音下における音声の聞き易さを評価する実験を行った。雑音環境としては、白色雑音、人ごみの騒音、電車の通過音を、それぞれ別々に用いて評価を行い、それぞれの結果を平均して音声の聞き易さを求めた。 The non-periodic index A (i, f) is a non-periodic component for each band when the speech is regarded as a sum of a periodic component of the frequency spectrum (voice zone vibration) and an aperiodic component (exhalation turbulence). This is a feature amount that represents the ratio of. For the purpose of improving speech intelligibility, an experiment was conducted to evaluate the ease of hearing of speech under noise by focusing on this non-periodicity index A (i, f). As noise environment, white noise, crowd noise, and train passing sound were evaluated separately, and the results were averaged to determine the ease of hearing.

図１に、非周期性指標と音声の明瞭度スコアとの関係を示す。横軸は、主観評価で得られた音声の聞こえ易さを５段階のスコアで表す。１は全く聞き取れない。５は全てはっきりと聞き取れる。である。縦軸は、非周期性指標Ａ（ｉ，ｆ）を[dB]で表す。 FIG. 1 shows the relationship between the aperiodicity index and the speech intelligibility score. The horizontal axis represents the ease of hearing of the voice obtained by the subjective evaluation with a five-level score. 1 cannot be heard at all. All 5 are clearly audible. It is. The vertical axis represents aperiodicity index A (i, f) in [dB].

０〜１ＫＨｚの範囲の非周期性指標と明瞭度スコアとの相関を示す◆は、両者の間にほとんど相関が無いことを示している。１ｋＨｚ以上の周波数における非周期性指標と明瞭度スコアとの相関を示す■，▲，×，＊は、高い負の相関があることを示している。表１に相関係数を示す。 The symbol ♦ showing the correlation between the aperiodicity index in the range of 0 to 1 KHz and the clarity score indicates that there is almost no correlation between the two. The ■, ▲, ×, and * indicating the correlation between the aperiodicity index and the clarity score at frequencies of 1 kHz or higher indicate that there is a high negative correlation. Table 1 shows the correlation coefficient.

１〜８ｋＨｚの帯域で相関係数が大きい。つまり、周波数帯域が１ｋＨｚ以上の範囲において、非周期性指標Ａ（ｉ，ｆ）を減少させることで、音声が聞き易くなることが分かる。 The correlation coefficient is large in the band of 1 to 8 kHz. That is, it can be seen that the voice can be easily heard by decreasing the non-periodicity index A (i, f) in the frequency band of 1 kHz or more.

この発明は、この新しい知見に基づいて音声の明瞭度を向上させる音声強調方法とその装置を実現するものである。 The present invention realizes a speech enhancement method and apparatus for improving speech intelligibility based on this new knowledge.

図２に、この発明の音声強調装置１００の機能構成例を示す。図３にその動作フローを示す。音声強調装置１００は、音声分析部１０と、非周期性指標変換部２０と、音声合成部３０と、を具備する。音声強調装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 2 shows a functional configuration example of the speech enhancement apparatus 100 of the present invention. FIG. 3 shows the operation flow. The speech enhancement apparatus 100 includes a speech analysis unit 10, an aperiodic index conversion unit 20, and a speech synthesis unit 30. The function of each part of the speech enhancement apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, and a CPU and executing the program by the CPU.

音声分析部１０は、音声信号ｓ（ｔ）を入力として、当該音声信号ｓ（ｔ）をｐサンプル間隔で分析を行い、ｐサンプルごとの基本周波数ｆ_０（ｉ）と、非周期性指標Ａ（ｉ，ｆ）と、スペクトルパワーＰ（ｉ，ｆ）を出力する（ステップＳ１０）。音声分析部１０は、基本周波数分析手段１１と、非周期性指標分析手段１２と、スペクトルパワー分析手段１３と、で構成される。図４に、サンプリング周波数１６[kHz]でサンプリングした音声信号ｓ（ｔ）の一例を示す。図４の横軸はサンプル時刻ｔ、縦軸は振幅ｓ（ｔ）である。 The speech analysis unit 10 receives the speech signal s (t) as an input, analyzes the speech signal s (t) at p sample intervals, the fundamental frequency f ₀ (i) for each p sample, and the aperiodicity index A. (I, f) and spectrum power P (i, f) are output (step S10). The voice analysis unit 10 includes a fundamental frequency analysis unit 11, an aperiodicity index analysis unit 12, and a spectrum power analysis unit 13. FIG. 4 shows an example of an audio signal s (t) sampled at a sampling frequency of 16 [kHz]. The horizontal axis in FIG. 4 is the sample time t, and the vertical axis is the amplitude s (t).

ｉ（i=0,1,…,[(T-1)/p]、Tはサンプル数）は、ｐサンプル間隔で分析した場合の分析番号（フレーム番号）であり、ｔ＝ｉｐである。また、ｆは（f=0,1,…,N-1）は、０からナイキスト周波数までの周波数帯域をＮ分割したうちの、（ｆ/Ｎ）・（ｆ_ｓ/２）[Hz]以上、（（ｆ＋１）/Ｎ）・（ｆ_ｓ/２）[Hz]未満の周波数帯域を表す番号（帯域番号）である。例えば、サンプリング周波数ｆ_ｓを１６[kHz]としたときのナイキスト周波数８[kHz]を、Ｎ＝５１２個の帯域に分割する場合、帯域番号０番の周波数範囲は０〜１５.６２５[Hz]、帯域番号１番は１５.６２６〜３１.２５[Hz]、帯域番号５１２番の周波数範囲は７９８４.３７５〜８０００[Hz]である。 i (i = 0, 1,..., [(T-1) / p], T is the number of samples) is an analysis number (frame number) when analysis is performed at an interval of p samples, and t = ip. Further, f is (f = 0,1, ..., N -1) are among the frequency band from 0 to Nyquist frequency is N divided, (f / N) · ( f s / 2) [Hz] or more a ((f + 1) / N ) · (f s / 2) [Hz] number representing a frequency band of less than (band number). For example, the Nyquist frequency 8 [kHz] when the sampling frequency f _s and 16 [kHz], the case of dividing the N = 512 pieces of band, the frequency range of the band number 0 are 0-15.625 [Hz] The band number 1 is 15.626 to 31.25 [Hz], and the frequency range of the band number 512 is 7984.375 to 8000 [Hz].

基本周波数ｆ_０（ｉ）は、声の高さを表す特徴量であり、音声波形の周期をτ_０[秒]としたときに、その逆数１/τ_０[Hz]が基本周波数となる。図５に、図４に示した音声波形ｓ（ｔ）を時間方向に拡大して見た基本周波数ｆ_０（ｉ）の一例を示す。図５の横軸は時間[ms]、縦軸は音声振幅である。図６に、図４に示した音声信号ｓ（ｔ）を分析して求めた基本周波数ｆ_０（ｉ）を示す。図６の横軸はフレーム番号ｉ、縦軸は基本周波数ｆ_０（ｉ）[Hz]であり、フレーム毎の声の高さを表す。図６に示す基本周波数ｆ_０（ｉ）は、１２８[Hz]〜２３０[Hz]程度に分布している。 The fundamental frequency f ₀ (i) is a feature quantity representing the pitch of the voice, and when the period of the speech waveform is τ ₀ [second], its reciprocal 1 / τ ₀ [Hz] is the fundamental frequency. FIG. 5 shows an example of the fundamental frequency f ₀ (i) obtained by enlarging the speech waveform s (t) shown in FIG. 4 in the time direction. The horizontal axis in FIG. 5 is time [ms], and the vertical axis is audio amplitude. FIG. 6 shows the fundamental frequency f ₀ (i) obtained by analyzing the audio signal s (t) shown in FIG. The horizontal axis in FIG. 6 is the frame number i, and the vertical axis is the fundamental frequency f ₀ (i) [Hz], which represents the voice pitch for each frame. The fundamental frequency f ₀ (i) shown in FIG. 6 is distributed in the range of about 128 [Hz] to 230 [Hz].

非周期性指標Ａ（ｉ，ｆ）は、周波数スペクトルを周期成分と非周期成分の和と見なしたとき、帯域毎の非周期成分の割合を表す。スペクトルパワーＰ（ｉ，ｆ）は、それぞれの帯域の周波数スペクトルの強さを表す。なお、音声分析部１０は公知技術で構成できる。 The aperiodic index A (i, f) represents the ratio of the aperiodic component for each band when the frequency spectrum is regarded as the sum of the periodic component and the aperiodic component. The spectrum power P (i, f) represents the strength of the frequency spectrum of each band. The voice analysis unit 10 can be configured by a known technique.

非周期性指標変換部２０は、所定の周波数範囲Ｆ_Ｌ〜Ｆ_Ｈの非周期性指標の値Ａ（ｉ，ｆ）と、その所定の周波数Ｆ_Ｈよりも大きな周波数では小さくなる変換後非周期性指標Ａ′（ｉ，ｆ）の最小の変換後非周期性指標Ａ′（ｉ，ｆ）と、を出力する（ステップＳ２０）。ここで最小の変換後非周期性指標Ａ′（ｉ，ｆ）とは、例えば周波数範囲Ｆ_Ｌ〜Ｆ_Ｈにおいて周波数の増加に対して一定の傾きで小さくなる変換後非周期性指標Ａ′（ｉ，ｆ）の最小値のことであり、周波数Ｆ_Ｈの変換後非周期性指標Ａ′（ｉ，ｆ）の値である。 The non-periodicity index conversion unit 20 converts the non-periodicity after conversion to be smaller at a value A (i, f) of the non-periodicity index in the predetermined frequency range F _{L to} F _H and a frequency larger than the predetermined frequency F _H. The minimum post-conversion aperiodicity index A ′ (i, f) of the sex index A ′ (i, f) is output (step S20). Here, the minimum post-conversion non-periodicity index A ′ (i, f) is, for example, a post-conversion non-periodicity index A ′ () that decreases with a constant slope with respect to an increase in frequency in the frequency range F _{L to} F _H. This is the minimum value of i, f), and is the value of the non-periodicity index A ′ (i, f) after conversion of the frequency F _H.

非周期性指標変換部２０は、変換係数定義手段２１と、加算手段２２と、を備える。 The aperiodic index conversion unit 20 includes conversion coefficient definition means 21 and addition means 22.

変換係数定義手段２１は、所定の周波数範囲Ｆ_Ｌ〜Ｆ_Ｈの（Ｎ・Ｆ_Ｌ/ｆ_ｓ/２）以上、（Ｎ・Ｆ_Ｈ/ｆ_ｓ/２）未満の帯域番号ｆの周波数ｆ′の上記非周期性指標Ａ（ｉ，ｆ）の値を、所定の周波数ｆ＜Ｆ_ＬにおいてＥ（ｆ）＝０、所定の周波数範囲Ｆ_Ｌ≦ｆ≦Ｆ_ＨにおいてＥ（ｆ）＝−γ{（ｆ′−Ｆ_Ｌ）/（Ｆ_Ｈ−Ｆ_Ｌ）}、所定の周波数範囲Ｆ_Ｈ＜ｆにおいてＥ（ｆ）＝−γという関係で小さくする変換関数Ｅ（ｆ）を定義する。ここでγは減衰量（γ＞０）、ｆ′は帯域番号ｆで表される実際の周波数である（式（１））。 The conversion coefficient defining means 21 has a frequency f ′ of a band number f that is not less than (N · F _L / f _s / 2) and less than (N · F _H / f _s / 2) in a predetermined frequency range F _{L to} F _H. The value of the non-periodicity index A (i, f) is E (f) = 0 at a predetermined frequency f <F _{L and} E (f) = − γ in a predetermined frequency range F _L ≦ f ≦ F _H. {(F′−F _L ) / (F _H −F _L )}, a conversion function E (f) that is reduced in a relationship of E (f) = − γ in a predetermined frequency range F _H <f is defined. Here, γ is the attenuation (γ> 0), and f ′ is the actual frequency represented by the band number f (formula (1)).

ｆ_ｓはサンプリング周波数、Ｎは周波数帯域の分割数である。変換関数Ｅ（ｆ）は、例えば、Ｆ_Ｌ＝１０００[Hz]、Ｆ_Ｈ＝２０００[Hz]として、ｆ′＜１０００[Hz]のときＥ（ｆ）＝０、１０００[Hz]≦ｆ′≦２０００ [Hz]のときＥ（ｆ）＝−γ{（ｆ′−１０００）/１０００}、ｆ′＞２０００[Hz]のときはＥ（ｆ）＝−γとする。 f _s is the sampling frequency, and N is the number of divisions of the frequency band. The conversion function E (f) is, for example, F _L = 1000 [Hz], F _H = 2000 [Hz], and when f ′ <1000 [Hz], E (f) = 0, 1000 [Hz] ≦ f ′ When ≦ 2000 [Hz], E (f) = − γ {(f′−1000) / 1000}, and when f ′> 2000 [Hz], E (f) = − γ.

図７に、変換関数Ｅ（ｆ）の一例を示す。図７の横軸は、縦軸は変換関数Ｅ（ｆ）[dB]である。 FIG. 7 shows an example of the conversion function E (f). The horizontal axis in FIG. 7 is the conversion function E (f) [dB].

図７に示す例は、小さくする変換係数Ｅ（ｆ）を、例えばＦ_Ｌ＝１０００[Hz]、Ｆ_Ｈ＝２０００[Hz]で定義したものである。変換関数Ｅ（ｆ）を、例えば式（２）で定義する。 In the example shown in FIG. 7, the conversion coefficient E (f) to be reduced is defined by, for example, F _L = 1000 [Hz] and F _H = 2000 [Hz]. The conversion function E (f) is defined by, for example, Expression (2).

変換関数Ｅ（ｆ）は、周波数が高くなるとマイナス方向に絶対値が大きくなる関数である。この変換関数Ｅ（ｆ）の値を、非周期性指標Ａ（ｉ，ｆ）に加算することで、非周期性指標Ａ（ｉ，ｆ）を周波数の増加に対して減少させることができる。 The conversion function E (f) is a function whose absolute value increases in the negative direction as the frequency increases. By adding the value of the conversion function E (f) to the non-periodic index A (i, f), the non-periodic index A (i, f) can be decreased with respect to the increase in frequency.

変換関数定義手段２１の動作を、図８に示す動作フローを参照して更に詳しく説明する。変換関数定義手段２１は、帯域番号ｆの示す実際の周波数ｆ′ごとに変換関数Ｅ（ｆ）の値を計算する処理を全ての帯域番号ｆについて行う。帯域番号ｆの示す実際の周波数ｆ′は式（１）で計算する（ステップＳ２１１）。周波数ｆ′がＦ_Ｌ[Hz]未満の場合（ステップＳ２１２のYes）、変換関数Ｅ（ｆ）の値はＥ（ｆ）＝０とする（ステップＳ２１７）。 The operation of the conversion function defining means 21 will be described in more detail with reference to the operation flow shown in FIG. The conversion function defining means 21 performs the process of calculating the value of the conversion function E (f) for every actual frequency f ′ indicated by the band number f for all band numbers f. The actual frequency f ′ indicated by the band number f is calculated by equation (1) (step S211). When the frequency f ′ is less than F _L [Hz] (Yes in step S212), the value of the conversion function E (f) is set to E (f) = 0 (step S217).

周波数ｆ′がＦ_Ｌ[Hz]以上、Ｆ_Ｈ[Hz]以下の場合（ステップＳ２１３）は、式（２）で変換関数Ｅ（ｆ）の値を求める（ステップＳ２１５）。周波数ｆ′がＦ_Ｈ[Hz]より大きい場合は、変換関数Ｅ（ｆ）の値はＥ（ｆ）＝−γとする（ステップＳ２１６）。ステップＳ２１１〜Ｓ２１６の処理は、全ての帯域番号ｆについて行われる（ステップＳ２１０〜Ｓ２１７のループ）。以上の動作によって、図７に示した変換関数Ｅ（ｆ）の値が計算される。減衰量γは、図１に示した相関関係から５[dB]〜１５[dB]の値にすると良い。 When the frequency f ′ is not less than F _L [Hz] and not more than F _H [Hz] (step S213), the value of the conversion function E (f) is obtained by equation (2) (step S215). When the frequency f ′ is larger than F _H [Hz], the value of the conversion function E (f) is set to E (f) = − γ (step S216). The processing of steps S211 to S216 is performed for all band numbers f (loop of steps S210 to S217). With the above operation, the value of the conversion function E (f) shown in FIG. 7 is calculated. The attenuation amount γ is preferably set to a value of 5 [dB] to 15 [dB] based on the correlation shown in FIG.

加算手段２２は、音声分析部１０が出力する非周期性指標Ａ（ｉ，ｆ）に、変換関数定義手段２１で計算された変換関数Ｅ（ｆ）の値を加算する(ステップＳ２２)。図９に、加算手段２２の動作フローを示す。加算手段２２は、全てのフレーム番号ｉにおいて、変換関数定義手段２１で計算された変換関数Ｅ（ｆ）の値を、音声分析部１０が出力する非周期性指標Ａ（ｉ，ｆ）に加算する（ステップＳ２２２）。この加算処理は、全てのフレーム番号ｉ（ｉのループ）の全ての帯域番号ｆ（ｆのループ）について行われる。 The adding means 22 adds the value of the conversion function E (f) calculated by the conversion function defining means 21 to the non-periodicity index A (i, f) output from the speech analysis unit 10 (step S22). FIG. 9 shows an operation flow of the adding means 22. The adding means 22 adds the value of the conversion function E (f) calculated by the conversion function defining means 21 to the aperiodicity index A (i, f) output from the speech analysis unit 10 for all frame numbers i. (Step S222). This addition processing is performed for all band numbers f (f loops) of all frame numbers i (i loops).

図１０に、非周期性指標変換部２０で処理した変換後非周期性指標Ａ′（ｉ，ｆ）を示す。図１０の横軸はフレーム番号ｉ、縦軸は帯域周波数の帯域番号ｆである。図１０は、本来はスペクトルの大きさをグレースケールで表すものであるが、作図の都合上、約−３０[dB]以下を黒で表現している。図１０は、所定の周波数範囲Ｆ_Ｌ〜Ｆ_ＨをＦ_Ｌ＝１０００[Hz]、Ｆ_Ｈ＝２０００[Hz]、減衰量γ＝１５[dB]とした例である。周波数１[kHz]に対応する帯域番号ｆ＝６４以上の範囲の変換後非周期性指標Ａ′（ｉ，ｆ）が、小さくなっていることが分かる。 FIG. 10 shows the post-conversion aperiodicity index A ′ (i, f) processed by the aperiodicity index conversion unit 20. The horizontal axis in FIG. 10 is the frame number i, and the vertical axis is the band number f of the band frequency. FIG. 10 originally represents the magnitude of the spectrum in gray scale, but for the convenience of drawing, about −30 [dB] or less is represented in black. FIG. 10 is an example in which the predetermined frequency ranges F _{L to} F _H are set to F _L = 1000 [Hz], F _H = 2000 [Hz], and attenuation γ = 15 [dB]. It can be seen that the post-conversion aperiodicity index A ′ (i, f) in the range of the band number f = 64 or more corresponding to the frequency 1 [kHz] is small.

図１１に、非周期性指標変換部２０で処理する前の非周期性指標Ａ（ｉ，ｆ）を示す。横軸と縦軸の関係は図１０と同じである。図１１から明らかなように、非周期性指標変換部２０で処理する前の帯域番号ｆ＝６４以上は大きな値を示している。 FIG. 11 shows an aperiodic index A (i, f) before processing by the aperiodic index conversion unit 20. The relationship between the horizontal axis and the vertical axis is the same as in FIG. As is clear from FIG. 11, the band number f = 64 or more before processing by the non-periodicity index conversion unit 20 shows a large value.

音声合成部３０は、変換後非周期性指標Ａ′（ｉ，ｆ）と、音声分析部１０が出力する基本周波数ｆ_０（ｉ）とスペクトルパワーＰ（ｉ，ｆ）とを用いて音声合成音を合成する。周波数帯域が１[kHz]以上の所定の範囲において、非周期性指標Ａ（ｉ，ｆ）を、周波数の増加に対して減少させることで、音声が聞き易くなることは、上記した図１に示す新たな知見によって明らかである。よって、この発明の音声強調装置１００で強調した音声は、騒音下でも聞き取り易い音声となる。また、スペクトルパワーを変化させないため、話者の声質も変化しない。 The speech synthesizer 30 uses the post-conversion aperiodicity index A ′ (i, f), the fundamental frequency f ₀ (i) and the spectrum power P (i, f) output from the speech analyzer 10 to synthesize speech. Synthesize the sound. In the predetermined range of the frequency band of 1 [kHz] or more, by reducing the non-periodicity index A (i, f) with respect to the increase in frequency, it is easy to hear the voice in FIG. 1 described above. This is evident from the new findings shown. Therefore, the voice emphasized by the voice enhancement device 100 of the present invention is easy to hear even under noise. Further, since the spectral power is not changed, the voice quality of the speaker is not changed.

なお、所定の周波数範囲Ｆ_Ｌ〜Ｆ_Ｈを１０００[Hz]〜２０００[Hz]として説明したが、この周波数範囲はその前後の凡その周波数範囲でも良い。また、その周波数範囲の変換後非周期性指標Ａ′（ｉ，ｆ）を、式（２）に示すように一定の割合で減少させる例で説明を行ったが、この発明はこの実施例に限定されない。例えば、変換後非周期性指標Ａ′（ｉ，ｆ）の値を、周波数範囲Ｆ_Ｌ〜Ｆ_Ｈの間の飛び飛びの周波数ごとに階段状に減少させても良い。また、変換後非周期性指標Ａ′（ｉ，ｆ）の値を、周波数の増加に対して反比例する関係で設定しても同様の効果を奏する。 The predetermined frequency range F _{L to} F _H has been described as 1000 [Hz] to 2000 [Hz], but this frequency range may be an approximate frequency range before and after that. Further, the example has been described in which the post-conversion aperiodicity index A ′ (i, f) in the frequency range is decreased at a constant rate as shown in the equation (2). It is not limited. For example, the value of the post-conversion aperiodicity index A ′ (i, f) may be decreased stepwise for each jumping frequency between the frequency ranges F _{L to} F _H. The same effect can be obtained by setting the value of the post-conversion aperiodicity index A ′ (i, f) in a relationship inversely proportional to the increase in frequency.

音声強調装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるように構成してもよい。 The speech enhancement apparatus 100 may be configured to be realized by, for example, reading a predetermined program into a computer configured by a ROM, a RAM, a CPU, and the like and executing the program by the CPU.

その場合、その処理内容を記述したプログラムは、コンピュータで読み取り可能な任意の記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリがある。より具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 In that case, the program describing the processing contents can be recorded in any computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. More specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read) Only Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェアとして実現することとしてもよい。 Each unit may be configured by executing a predetermined program on a computer, or at least a part of the processing contents may be realized as hardware.

Claims

The audio signal s (t) is input, the audio signal is analyzed at p-sample intervals, the fundamental frequency f ₀ (i) for each p sample, the non-periodicity index A (i, f), and the spectral power A voice analysis unit that outputs P (i, f);
A non-periodic index value A (i, f) of a predetermined frequency range F _{L to} F _H is converted into a post-conversion non-periodic index A ′ (i, f) that decreases with increasing frequency, The non-periodicity which is converted into the minimum post-conversion non-periodicity index A ′ (i, f) of the post-conversion non-periodicity index A ′ (i, f) which becomes smaller at a frequency higher than the frequency F _H and output. An indicator conversion unit;
Speech synthesis for synthesizing speech synthesized sound s ′ (t) with the fundamental frequency f ₀ (i), the spectrum power P (i, f), and the converted non-periodicity index A ′ (i, f) as inputs. And
A speech enhancement device comprising:

The speech enhancement apparatus according to claim 1,
The aperiodic index conversion unit is
(N · F _L / f _s / 2) or more of a predetermined frequency range F _{L to} F _H , (N · F _H / f _s / 2) (f = 0, 1,..., N−1, f _s are 'the non-periodic index a (i a, f' the frequency f of the sampling frequency) of less than band number f the value of), in the predetermined frequency range F _L to F _H, E when the attenuation gamma = A conversion function defining means for defining a conversion function E (f) to be reduced by the relationship of −γ {(f′−F _L ) / (F _H −F _L )}
Adding means for adding the value of the conversion function E (f) to the non-periodicity index A (i, f) output by the speech analysis unit ;
A speech enhancement device comprising:

In the speech enhancement apparatus according to claim 1 or 2,
The predetermined frequency range _F L to F _H _is, F L = 1000 Hz or _higher, the audio enhancement apparatus which is a range of F H = 2000 Hz.

The audio signal s (t) is input, the audio signal is analyzed at p-sample intervals, the fundamental frequency f ₀ (i) for each p sample, the non-periodicity index A (i, f), and the spectral power A speech analysis process for outputting P (i, f);
A non-periodic index value A (i, f) of a predetermined frequency range F _{L to} F _H is converted into a post-conversion non-periodic index A ′ (i, f) that decreases with increasing frequency, The non-periodicity which is converted into the minimum post-conversion non-periodicity index A ′ (i, f) of the post-conversion non-periodicity index A ′ (i, f) which becomes smaller at a frequency higher than the frequency F _H and output. Indicator conversion process,
Speech synthesis for synthesizing speech synthesized sound s ′ (t) with the fundamental frequency f ₀ (i), the spectrum power P (i, f), and the converted non-periodicity index A ′ (i, f) as inputs. Process,
A speech enhancement method comprising:

The speech enhancement method according to claim 4,
The aperiodic index conversion process is
(N · F _L / f _s / 2) or more of a predetermined frequency range F _{L to} F _H , (N · F _H / f _s / 2) (f = 0, 1,..., N−1, f _s are E = − when the value of the non-periodicity index A (i, f) of the frequency f ′ of the band number f less than the sampling frequency) is the attenuation amount γ in the predetermined frequency range F _{L to} F _H. a conversion function defining step for defining a conversion function E (f) to be reduced in a relationship of γ {(f′−F _L ) / (F _H −F _L )};
An adding step of adding the value of the conversion function E (f) to the non-periodicity index A (i, f) output by the speech analysis process ;
A speech enhancement method characterized by comprising:

In claim 4 or speech enhancement method according to 5,
The predetermined frequency range _F L to F _H _is, F L = 1000 Hz or _higher, the speech enhancement method which is a F H = 2000 [Hz] or less.

A program for causing a computer to function as the speech enhancement apparatus according to any one of claims 1 to 3.