JP5450298B2

JP5450298B2 - Voice detection device

Info

Publication number: JP5450298B2
Application number: JP2010163680A
Authority: JP
Inventors: 博秋河崎
Original assignee: Toa Corp
Current assignee: Toa Corp
Priority date: 2010-07-21
Filing date: 2010-07-21
Publication date: 2014-03-26
Anticipated expiration: 2030-07-21
Also published as: JP2012027114A

Description

本発明は、音声検出装置に関し、特に、音声成分と雑音成分とが混在する入力信号から当該音声成分を検出する、音声検出装置に関する。 The present invention relates to a voice detection device, and more particularly to a voice detection device that detects a voice component from an input signal in which a voice component and a noise component are mixed.

この種の音声検出装置として、従来、例えば特許文献１に開示の音声応答スイッチに適用されたものがある。この従来技術によれば、音声信号の入力レベルが所定値以上であるときに、所定時間にわたって、当該音声信号から少なくとも第１フォルマントＦ１と第２フォルマントＦ２とが抽出される。そして、抽出された第１フォルマントＦ１と第２フォルマントＦ２とから母音の変化が求められ、この変化が“ａ”および“ｏ”という２つの母音のいずれか一方を始音とすることを含む所定の条件を満足するとき、音声信号が予め設定された制御音声と一致したと判断され、スイッチ要素がオンされる。 As this type of voice detection device, there is one conventionally applied to a voice response switch disclosed in Patent Document 1, for example. According to this prior art, when the input level of the audio signal is equal to or higher than a predetermined value, at least the first formant F1 and the second formant F2 are extracted from the audio signal for a predetermined time. Then, a change in vowels is obtained from the extracted first formant F1 and second formant F2, and this change includes a predetermined start including one of the two vowels “a” and “o”. When the above condition is satisfied, it is determined that the voice signal matches the preset control voice, and the switch element is turned on.

特開昭６１−２４６８００号公報JP-A-61-246800

このように、上述の従来技術では、第１フォルマントＦ１と第２フォルマントＦ２とに基づいて音声検出が行われるが、特に第１フォルマントＦ１の抽出が必須とされることで、次のような問題がある。即ち、日常の環境下においては、例えば道路交通騒音をはじめ１ｋＨｚ付近の周波数帯域に大きなパワーを持つ雑音が多く存在する。その一方で、この１ｋＨｚ付近という周波数帯域は、第１フォルマントＦ１の周波数帯域と重なる。このため、当該第１フォルマントＦ１の抽出が必須とされる従来技術では、道路交通騒音等の日常的な雑音の影響を受け易く、ゆえに、使用可能な環境が極端に制限される、という問題がある。しかも、従来技術では、第１フォルマントＦ１を含む各フォルマントの抽出が、具体的には複数に分割された周波数帯域毎の信号レベルに基づいて行われるため、個々の周波数帯域に一定レベル以上の雑音成分が存在する場合には、当該雑音成分がフォルマントとして誤って検出される。従って、道路交通騒音はおろか、それ以外の雑音の影響をも受け易い。 As described above, in the above-described prior art, voice detection is performed based on the first formant F1 and the second formant F2. In particular, since the extraction of the first formant F1 is essential, the following problems are caused. There is. That is, in an everyday environment, for example, there are many noises having a large power in a frequency band near 1 kHz including road traffic noise. On the other hand, the frequency band near 1 kHz overlaps with the frequency band of the first formant F1. For this reason, in the conventional technique in which the extraction of the first formant F1 is indispensable, there is a problem that it is easily affected by daily noise such as road traffic noise, and therefore the usable environment is extremely limited. is there. In addition, in the prior art, each formant including the first formant F1 is extracted based on the signal level for each frequency band divided into a plurality of frequencies. Therefore, noise of a certain level or more in each frequency band. When a component exists, the noise component is erroneously detected as a formant. Therefore, not only road traffic noise but also other noises are easily affected.

そこで、本発明は、従来よりも道路交通騒音等の雑音の影響を受け難く、特に防犯用途において人間の悲鳴や叫び声等を検出するのに好適な、音声検出装置を提供することを、目的とする。 Therefore, the object of the present invention is to provide a voice detection device that is less affected by noise such as road traffic noise than in the past and is suitable for detecting human screams and screams, particularly in crime prevention applications. To do.

この目的を達成するために、本発明は、音声成分と雑音成分とが混在する入力信号から当該音声成分を検出する音声検出装置において、入力信号の周波数スペクトルのピークを強調するピーク強調手段と、このピーク強調手段によってピークが強調された後の強調後スペクトルのうち雑音成分に対応する雑音スペクトルを推定する雑音推定手段と、強調後スペクトルから当該雑音スペクトルを差し引く差引手段と、を具備するものである。 In order to achieve this object, the present invention provides a peak emphasizing means for emphasizing a peak of a frequency spectrum of an input signal in an audio detection device that detects the audio component from an input signal in which an audio component and a noise component are mixed. A noise estimation unit that estimates a noise spectrum corresponding to a noise component of the enhanced spectrum after the peak is enhanced by the peak enhancement unit, and a subtraction unit that subtracts the noise spectrum from the enhanced spectrum. is there.

即ち、本発明は、音声成分と雑音成分とが混在する入力信号の周波数スペクトルを観察すると、この入力信号の周波数スペクトルには、当該音声成分と雑音成分とのそれぞれのピークが含まれており、これらのピークは、音声成分のものと雑音成分のものとで互いに異なる性質を有する点に、着目したものである。この着目点に基づいて、まず、入力信号の周波数スペクトルのピークが、ピーク強調手段によって強調され、つまり当該ピークの性質を含め顕著化される。そして、このピーク強調後のスペクトルのうち、雑音成分に対応する雑音スペクトルが、雑音推定手段によって推定される。さらに、差引手段によって、当該雑音スペクトルがピーク強調後スペクトルから差し引かれる。これにより、ピーク強調後スペクトルに含まれるピークのうち、雑音成分のピークが除去され、音声成分のピーク、つまりフォルマント、のみが残る。このフォルマントのピークが捉えられることで、音声成分の検出が実現される。 That is, in the present invention, when the frequency spectrum of the input signal in which the voice component and the noise component are mixed is observed, the frequency spectrum of the input signal includes the respective peaks of the voice component and the noise component, These peaks pay attention to the fact that the speech component and the noise component have different properties. Based on this point of interest, first, the peak of the frequency spectrum of the input signal is emphasized by the peak emphasizing means, that is, it becomes noticeable including the nature of the peak. Of the spectrum after peak enhancement, the noise spectrum corresponding to the noise component is estimated by the noise estimation means. Further, the noise spectrum is subtracted from the peak enhanced spectrum by the subtracting means. As a result, the peak of the noise component is removed from the peaks included in the spectrum after peak enhancement, and only the peak of the speech component, that is, the formant remains. By detecting this formant peak, the detection of the voice component is realized.

なお、本発明において、ピーク強調手段は、過去の入力信号に基づいて現在の入力信号を予測する予測手段と、この予測手段による演算式の逆演算式により入力信号を処理することでピークを強調する強調実行手段と、を含むものであってもよい。ここで、予測手段は、入力信号に含まれる周期的な成分、つまりフォルマント、を予測することになる。そして、強調実行手段は、予測手段による演算式の逆演算式により入力信号を処理することで、当該入力信号に含まれるフォルマントを強調することになる。このとき、フォルマントのみならず、雑音成分のピークも強調されるが、この雑音成分のピークは、上述の如く差引手段によって除去される。 In the present invention, the peak emphasizing means emphasizes the peak by processing the input signal by a predicting means for predicting the current input signal based on the past input signal and an inverse arithmetic expression of the arithmetic expression by the predicting means. And emphasizing execution means. Here, the prediction means predicts a periodic component included in the input signal, that is, a formant. Then, the emphasis execution means emphasizes the formant included in the input signal by processing the input signal by the inverse operation expression of the operation expression by the prediction means. At this time, not only the formant but also the peak of the noise component is emphasized, but the peak of the noise component is removed by the subtracting means as described above.

ここで言う予測手段は、例えば線形予測誤差フィルタによって構成することができる。そして、強調実行手段は、当該線形予測誤差フィルタの逆フィルタによって構成することができる。 The prediction means here can be constituted by, for example, a linear prediction error filter. The enhancement execution means can be configured by an inverse filter of the linear prediction error filter.

この場合、予測手段としての線形予測誤差フィルタと、強調実行手段としての逆フィルタと、のそれぞれは、格子型（Lattice）型のデジタルフィルタであるのが、望ましい。即ち、線形予測誤差フィルタと逆フィルタとは、互いに共役であるため、このうちの一方が、例えばＦＩＲ（Finite
Impulse Response）フィルタによって設計されると、他方は、必然的にＩＩＲ（Infinite Impulse Response）フィルタとなる。ここで、ＩＩＲフィルタは、一般に、不安定である、言い換えれば安定判別が困難である、という欠点を有するが、格子型であれば、この欠点が解消されることが、知られている。また、例えば線形予測誤差フィルタが格子型のＦＩＲフィルタによって設計され、逆フィルタが格子型のＩＩＲフィルタによって設計される、とすると、線形予測誤差フィルタとしての格子型ＦＩＲフィルタについては、トランスバーサル型をはじめとする他構成のフィルタよりも高い収束速度が得られる等の優れた線形予測性能が発揮される。そして、この線形予測誤差フィルタとしての格子型ＦＩＲフィルタのフィルタ係数が、そのまま逆フィルタとしての格子型ＩＩＲフィルタのフィルタ係数に適用されることで、当該逆フィルタが設計される。つまり、逆フィルタの設計が容易である、という利点もある。 In this case, it is desirable that each of the linear prediction error filter as the prediction unit and the inverse filter as the enhancement execution unit is a lattice type digital filter. That is, since the linear prediction error filter and the inverse filter are conjugate with each other, one of them is, for example, FIR (Finite).
When designed with an Impulse Response) filter, the other is necessarily an IIR (Infinite Impulse Response) filter. Here, the IIR filter generally has a defect that it is unstable, in other words, it is difficult to determine stability, but it is known that this defect can be solved if it is a lattice type. For example, when the linear prediction error filter is designed by a lattice type FIR filter and the inverse filter is designed by a lattice type IIR filter, the transversal type is used for the lattice type FIR filter as the linear prediction error filter. Excellent linear prediction performance, such as a higher convergence speed than that of other filters of the beginning, is exhibited. The filter coefficient of the lattice type FIR filter as the linear prediction error filter is directly applied to the filter coefficient of the lattice type IIR filter as the inverse filter, so that the inverse filter is designed. That is, there is an advantage that the design of the inverse filter is easy.

さらに、本発明における雑音推定手段は、強調後スペクトルを時間平均することで雑音スペクトルを推定するものであってもよい。即ち、雑音成分が略定常的に存在する場合は、強調後スペクトルに含まれる当該雑音成分のピークは概ね不変である。一方、この雑音成分のピークに比べると、音声成分のピークは単発的（間欠的）であり、つまり経時的に変化する。従って、強調後スペクトルが時間平均されると、これに含まれる雑音成分のピークのみが残り、音声成分のピークは全体的に低減される。これにより、雑音スペクトルの推定が実現される。 Furthermore, the noise estimation means in the present invention may estimate the noise spectrum by time-averaging the enhanced spectrum. That is, when the noise component exists substantially constantly, the peak of the noise component included in the post-emphasis spectrum is almost unchanged. On the other hand, compared with the peak of the noise component, the peak of the voice component is single (intermittent), that is, changes with time . Therefore, when the emphasized spectrum is time-averaged, only the peak of the noise component included therein remains, and the peak of the speech component is reduced as a whole. Thereby, estimation of the noise spectrum is realized.

また、入力信号に有色雑音が含まれる、とすると、当該入力信号の周波数スペクトルは、周波数に対してパワーが概ね反比例するような全体的に傾斜した特性となる。そして、この周波数スペクトルのピークがそのままピーク強調手段によって強調される、とすると、当該周波数スペクトルの傾斜が急峻になる等の種々の不都合が生じる。このため、本発明においては、入力信号の周波数スペクトルを平坦化する平坦化手段が、さらに備えられてもよい。ただし、平坦化手段は、この入力信号の周波数スペクトルに含まれるピークについては、平坦化されることなく、その先鋭さが維持される程度に、当該周波数スペクトルを平坦化するものとする。そして、ピーク強調手段は、この平坦化手段によって平坦化された後の平坦化後スペクトルのピークを強調するものとする。 If the input signal includes colored noise, the frequency spectrum of the input signal has a generally inclined characteristic such that the power is approximately inversely proportional to the frequency. If the peak of the frequency spectrum is directly enhanced by the peak emphasizing means, various inconveniences such as a steep slope of the frequency spectrum occur. For this reason, in the present invention, flattening means for flattening the frequency spectrum of the input signal may be further provided. However, the flattening means flattens the frequency spectrum to such an extent that the peak contained in the frequency spectrum of the input signal is maintained without being flattened. The peak emphasizing means emphasizes the peak of the flattened spectrum after being flattened by the flattening means.

このような平坦化手段は、入力信号の周波数スペクトルに含まれるピークに追随するのに不十分な低い周波数分解能を持つ低分解能フィルタ、例えば比較的にタップ数（フィルタ次数）の少ないデジタルフィルタによって構成することができる。 Such flattening means is constituted by a low resolution filter having a low frequency resolution that is insufficient to follow a peak included in the frequency spectrum of the input signal, for example, a digital filter having a relatively small number of taps (filter order). can do.

上述したように、本発明によれば、入力信号の周波数スペクトルに含まれる音声成分のピークと雑音成分のピークとが互いに異なる性質を有する点に着目して、当該入力信号の周波数スペクトルのピークが強調され、この強調されたピークのうち雑音成分のピークが除去されることで、音声成分のピークのみが捉えられる。つまり、道路交通騒音等の雑音が存在する環境下において、当該雑音の影響を排除することができる。従って、雑音の影響を受け易い上述の従来技術に比べて、正確な音声検出を実現することができる。これは、特に防犯用途において人間の悲鳴や叫び声等を適確に検出するのに好適である。 As described above, according to the present invention, focusing on the fact that the peak of the speech component and the peak of the noise component included in the frequency spectrum of the input signal are different from each other, the peak of the frequency spectrum of the input signal is By emphasizing and removing the noise component peak from the emphasized peak, only the speech component peak is captured. That is, in an environment where noise such as road traffic noise exists, the influence of the noise can be eliminated. Therefore, accurate voice detection can be realized as compared with the above-described conventional technology that is easily affected by noise. This is particularly suitable for accurately detecting human screams and screams in crime prevention applications.

本発明の一実施形態の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of one Embodiment of this invention. 同実施形態における入力信号の周波数スペクトルを示す図解図である。It is an illustration figure which shows the frequency spectrum of the input signal in the same embodiment. 同実施形態における線形予測誤差フィルタの具体的な構成を示すブロック図である。It is a block diagram which shows the specific structure of the linear prediction error filter in the same embodiment. 同線形予測誤差フィルタに逆フィルタを組み合わせた構成を示すブロック図である。It is a block diagram which shows the structure which combined the reverse filter with the linear prediction error filter. 同線形予測誤差フィルタのさらに具体的な構成を示すブロック図である。It is a block diagram which shows the more specific structure of the linear prediction error filter. 同実施形態における逆フィルタのさらに具体的な構成を示すブロック図である。It is a block diagram which shows the more specific structure of the inverse filter in the embodiment. 同実施形態における平坦化回路の必要性を説明するための図解図である。It is an illustration figure for demonstrating the necessity of the planarization circuit in the embodiment. 同平坦化回路による処理後信号の周波数スペクトルを示す図解図である。It is an illustration figure which shows the frequency spectrum of the signal after a process by the same planarization circuit. 同実施形態における逆フィルタによる処理後信号の周波数スペクトルを示す図解図である。It is an illustration figure which shows the frequency spectrum of the signal after a process by the inverse filter in the embodiment. 同実施形態におけるスペクトルサブトラクションの具体的な構成を示す図解図である。It is an illustration figure which shows the specific structure of the spectrum subtraction in the same embodiment. 同スペクトルサブトラクションの動作を説明するための図解図である。It is an illustration figure for demonstrating the operation | movement of the spectrum subtraction. 同実施形態における一実験結果を示す図解図である。It is an illustration figure which shows one experimental result in the same embodiment. 同実施形態におけるピーク判定回路の動作を説明するための図解図である。It is an illustration figure for demonstrating operation | movement of the peak determination circuit in the same embodiment. 図１３の別の態様を示す図解図である。It is an illustration figure which shows another aspect of FIG.

本発明の一実施形態について、図１〜図１４を参照して説明する。 An embodiment of the present invention will be described with reference to FIGS.

本実施形態に係る音声検出装置１０は、例えばスーパ防犯灯等の防犯機器に適用されるものであり、詳しくは当該防犯機器に備えられたマイクロホンによって人間の悲鳴や叫び声等が拾われたときに、これを検出するためのものである。この音声検出を実現するべく、当該音声検出装置１０は、図１に示すように、平坦化手段としての平坦化回路２０を有しており、この平坦化回路２０に、図示しないマイクロホンの出力信号Ｘ（ｚ）（ｚ；ｚ変換における変数）が入力される。 The voice detection device 10 according to the present embodiment is applied to a security device such as a super security light, for example. Specifically, when a human scream or screaming voice is picked up by a microphone provided in the security device. This is for detecting this. In order to realize this voice detection, the voice detection device 10 has a flattening circuit 20 as a flattening means, as shown in FIG. X (z) (z; variable in z conversion) is input.

ここで、平坦化回路２０に入力される信号Ｘ（ｚ）には、上述の悲鳴や叫び声等の音声成分の他に、道路交通騒音等の雑音成分が含まれる場合がある。この場合、入力信号Ｘ（ｚ）の周波数スペクトルには、例えば図２にα，βおよびγという符号を付して示すように、当該音声成分と雑音成分とのそれぞれに対応するピークが現れる。このうち、最も周波数ｆの低いピークαは、雑音成分のピークである。そして、他のピークβおよびγは、音声成分のピークであり、詳しくは周波数ｆの低いものから順に第２フォルマントおよび第３フォルマントのピークである。なお、図２において、一点鎖線は、各ピークα，βおよびγを含む入力信号Ｘ（ｚ）の平均パワーである。また、本実施形態では、周波数スペクトルを求めるための離散フーリエ変換（ＤＦＴ；Discrete Fourier Transform）の周波数帯域がｆ＝１２００Ｈｚ〜３０００Ｈｚに制限されている。従って、実際には、第１フォルマントのピークも存在するが、この第１フォルマントのピークは、当該周波数帯域外であるので、図２には現れない。さらに、入力信号Ｘ（ｚ）には、有色雑音も含まれている。従って、図２から分かるように、当該有色雑音を含む入力信号Ｘ（ｚ）の周波数スペクトルは、周波数ｆに対してパワーＰが概ね反比例するような全体的に傾斜した特性となる。 Here, the signal X (z) input to the flattening circuit 20 may include noise components such as road traffic noise in addition to the above-described audio components such as screams and screams. In this case, in the frequency spectrum of the input signal X (z), for example, peaks corresponding to the speech component and the noise component appear as shown in FIG. Among these, the peak α having the lowest frequency f is the peak of the noise component. The other peaks β and γ are the peaks of the voice component, and specifically the peaks of the second formant and the third formant in order from the lowest frequency f. In FIG. 2, the alternate long and short dash line is the average power of the input signal X (z) including the peaks α, β and γ. In this embodiment, the frequency band of discrete Fourier transform (DFT) for obtaining a frequency spectrum is limited to f = 1200 Hz to 3000 Hz. Therefore, in practice, a first formant peak also exists, but this first formant peak does not appear in FIG. 2 because it is outside the frequency band. Further, the input signal X (z) includes colored noise. Therefore, as can be seen from FIG. 2, the frequency spectrum of the input signal X (z) including the colored noise has a generally inclined characteristic such that the power P is approximately inversely proportional to the frequency f.

図１に戻って、平坦化回路２０は、入力信号Ｘ（ｚ）に対して、後述する平坦化処理を施す。そして、この平坦化処理後の信号Ｘ’（ｚ）は、予測手段としての線形予測誤差フィルタ（ＬＰＥＦ；Linear Prediction Error Filter）３０と、当該線形予測誤差フィルタ３０の逆フィルタ（ＬＰＥＦ^−１）４０と、のそれぞれに入力される。 Returning to FIG. 1, the flattening circuit 20 performs a flattening process to be described later on the input signal X (z). Then, the signal X ′ (z) after the flattening processing includes a linear prediction error filter (LPEF) 30 as a prediction unit and an inverse filter (LPEF ⁻¹ ) 40 of the linear prediction error filter 30. And each of them.

線形予測誤差フィルタ３０は、後述するように、過去の平坦化後信号Ｘ’（ｚ）に基づいて現在の平坦化後信号Ｘ’（ｚ）を予測し、その予測誤差Ｅ（ｚ）が最小になるように適応動作する。そして、この線形予測誤差フィルタ３０の適応動作に合わせて、これと共役な逆フィルタ４０が形成され、この逆フィルタ４０によって、平坦化後信号Ｘ’（ｚ）が処理される。これにより、この平坦化後信号Ｘ’（ｚ）に含まれる上述のピークα，βおよびγが強調される。このピーク強調についても、後で詳しく説明する。 As will be described later, the linear prediction error filter 30 predicts the current flattened signal X ′ (z) based on the past flattened signal X ′ (z), and the prediction error E (z) is minimized. Adaptive operation to become. In accordance with the adaptive operation of the linear prediction error filter 30, an inverse filter 40 conjugate with the linear prediction error filter 30 is formed, and the flattened signal X '(z) is processed by the inverse filter 40. Thereby, the above-mentioned peaks α, β and γ included in the post-flattened signal X ′ (z) are emphasized. This peak enhancement will also be described in detail later.

さらに、この逆フィルタ４０によってピーク強調された後の強調後信号Ｗ（ｚ）は、スペクトルサブトラクション（ＳＳ；Spectrum Subtraction）５０に入力される。スペクトルサブトラクション５０は、入力された強調後信号Ｗ（ｚ）に含まれる雑音成分のピークαを推定し、このピークαを当該強調後信号Ｗ（ｚ）から差し引く。これによって、雑音成分のピークαが除去され、音声成分のピークβおよびγのみが残された、差引後信号Ｇ（ｚ）が生成される。この差引後信号Ｇ（ｚ）は、ピーク判定回路６０に入力される。なお、この差引後信号Ｇ（ｚ）を生成するためのスペクトルサブトラクション５０の動作についても、後で詳しく説明する。 Further, the post-emphasis signal W (z) after the peak enhancement by the inverse filter 40 is input to a spectrum subtraction (SS) 50. The spectral subtraction 50 estimates the peak α of the noise component included in the input enhanced signal W (z), and subtracts the peak α from the enhanced signal W (z). As a result, the noise component peak α is removed, and a post-subtraction signal G (z) in which only the speech component peaks β and γ are left is generated. The post-subtraction signal G (z) is input to the peak determination circuit 60. The operation of the spectral subtraction 50 for generating the post-subtraction signal G (z) will be described later in detail.

ピーク判定回路６０は、差引後信号Ｇ（ｚ）に音声成分のピークβおよびγが含まれているか否かを判定する。そして、この音声成分のピークβおよびγが含まれている場合には、例えば防犯機器に備えられている図示しない警報機を作動させたり、所定の防災センタに通知信号を送信したりする。このピーク判定回路６０によるピーク判定処理の要領についても、後で詳しく説明する。 The peak determination circuit 60 determines whether or not the audio signal peaks β and γ are included in the post-subtraction signal G (z). When the peaks β and γ of the audio component are included, for example, an alarm device (not shown) provided in the crime prevention device is operated, or a notification signal is transmitted to a predetermined disaster prevention center. The point of the peak determination process by the peak determination circuit 60 will also be described in detail later.

このように、本実施形態の音声検出装置１０によれば、平坦化回路２０，線形予測誤差フィルタ３０，逆フィルタ４０，スペクトルサブトラクション５０およびピーク判定回路６０を備える構成によって、音声検出が実現されるが、これらについて、以下、より具体的に説明する。 As described above, according to the speech detection device 10 of the present embodiment, speech detection is realized by the configuration including the flattening circuit 20, the linear prediction error filter 30, the inverse filter 40, the spectral subtraction 50, and the peak determination circuit 60. However, these will be described more specifically below.

まず、線形予測誤差フィルタ３０の必要性について、説明する。即ち、線形予測誤差フィルタ３０は、上述したように過去の平坦化後信号Ｘ’（ｚ）に基づいて現在の平坦化後信号Ｘ’（ｚ）を予測するものであるが、結果的に、予測可能な成分が打ち消され、予測不可能な成分のみが予測誤差Ｅ（ｚ）として出力される。このような線形予測誤差フィルタ３０は、例えば図３に示すように、１サンプル分の遅延素子３０２と、ＦＩＲ型の適応フィルタ３０４と、加算器３０６と、によって構成される。 First, the necessity of the linear prediction error filter 30 will be described. That is, the linear prediction error filter 30 predicts the current flattened signal X ′ (z) based on the past flattened signal X ′ (z) as described above. The predictable component is canceled and only the unpredictable component is output as the prediction error E (z). Such a linear prediction error filter 30 includes, for example, a delay element 302 for one sample, an FIR type adaptive filter 304, and an adder 306 as shown in FIG.

この図３に示す構成において、例えば、今、平坦化後信号Ｘ’（ｚ）ではなく、上述の入力信号Ｘ（ｚ）が直接的に入力される、と仮定する。この場合、当該入力信号Ｘ（ｚ）は、遅延素子３０２によって遅延された後、適応フィルタ３０４によって処理される。そして、この適応フィルタ３０４による処理後信号Ｕ（ｚ）は、加算器３０６に入力される。加算器３０６には、入力信号Ｘ（ｚ）も入力されており、当該加算器３０６は、この入力信号Ｘ（ｚ）から適応フィルタ３０４による処理後信号Ｕ（ｚ）を差し引く。この差し引き後の信号Ｅ（ｚ）が、（入力信号Ｘ（ｚ）に対応する）予測誤差として出力され、当該予測誤差Ｅ（ｚ）が最小になるように、適応フィルタ３０４が適応動作する。 In the configuration shown in FIG. 3, for example, it is assumed that the input signal X (z) described above is directly input instead of the flattened signal X ′ (z). In this case, the input signal X (z) is delayed by the delay element 302 and then processed by the adaptive filter 304. The signal U (z) after processing by the adaptive filter 304 is input to the adder 306. The adder 306 also receives the input signal X (z), and the adder 306 subtracts the signal U (z) processed by the adaptive filter 304 from the input signal X (z). The signal E (z) after this subtraction is output as a prediction error (corresponding to the input signal X (z)), and the adaptive filter 304 performs an adaptive operation so that the prediction error E (z) is minimized.

ここで、適応フィルタ３０４の伝達関数をＨ（ｚ）とすると、当該適応フィルタ３０４による処理後信号Ｕ（ｚ）は、次の数１によって表される。 Here, assuming that the transfer function of the adaptive filter 304 is H (z), the signal U (z) processed by the adaptive filter 304 is expressed by the following equation (1).

そして、この適応フィルタ３０４の伝達関数Ｈ（ｚ）を含む線形予測誤差フィルタ３０全体の伝達関数をＬ（ｚ）とすると、この伝達関数Ｌ（ｚ）は、次の数２によって表される。 If the transfer function of the entire linear prediction error filter 30 including the transfer function H (z) of the adaptive filter 304 is L (z), the transfer function L (z) is expressed by the following equation (2).

さらに、適応フィルタ３０４のタップ数をＮとすると、当該適応フィルタ３０４の伝達関数Ｈ（ｚ）は、次の数３によって表される。なお、この数３において、ｈ_ｎは、ｎタップ目のフィルタ係数である。 Furthermore, when the number of taps of the adaptive filter 304 is N, the transfer function H (z) of the adaptive filter 304 is expressed by the following formula 3. In Equation 3, h _n is an n-th tap filter coefficient.

そして、この数３の表現が便宜的に書き換えられた上で、当該数３が数２に代入されると、線形予測誤差フィルタ３０全体の伝達関数Ｌ（ｚ）は、次の数４のように表される。 Then, after the expression of Equation 3 is rewritten for convenience, when Equation 3 is substituted into Equation 2, the transfer function L (z) of the entire linear prediction error filter 30 is as shown in Equation 4 below. It is expressed in

一方、音声、特に有声音Ｐ（ｚ）は、次の数５のように表される。なお、この数５において、Ａ（ｚ）は、当該有声音を発する発声者の声道全体の伝達関数（共振特性）であり、Ｂ（ｚ）は、当該発声者の声帯振動の特性である。 On the other hand, voice, particularly voiced sound P (z), is expressed as in the following formula 5. In Equation 5, A (z) is a transfer function (resonance characteristic) of the entire vocal tract of the speaker who emits the voiced sound, and B (z) is a characteristic of vocal cord vibration of the speaker. .

この数５によって表される有声音Ｐ（ｚ）の特性、特に母音のフォルマントの特性は、声道の伝達関数Ａ（ｚ）に依存する。そこで、この声道の伝達関数Ａ（ｚ）を、例えば有限長の全極型モデルで表現する、とすると、当該伝達関数Ａ（ｚ）は、次の数６のように表される。なお、この数６において、Ｍは、当該全極型モデルのタップ数である。 The characteristics of the voiced sound P (z) represented by this equation 5, particularly the characteristics of the vowel formant, depend on the transfer function A (z) of the vocal tract. Therefore, when the transfer function A (z) of the vocal tract is expressed by, for example, a finite-length all-pole model, the transfer function A (z) is expressed as the following Expression 6. In Equation 6, M is the number of taps of the all-pole model.

ゆえに、入力信号Ｘ（ｚ）として有声音Ｐ（ｚ）のみが入力される、と仮定すると、予測誤差Ｅ（ｚ）は、次の数７によって表される。 Therefore, assuming that only the voiced sound P (z) is input as the input signal X (z), the prediction error E (z) is expressed by the following Expression 7.

その上で、適応フィルタ３０４が、声道の伝達関数Ａ（ｚ）を表現するのに十分なタップ数Ｎを有し、かつ、数７によって表される予測誤差Ｅ（ｚ）が最小になるように適応動作する、とすると、当該数７において、次の数８が成立する。 In addition, the adaptive filter 304 has a sufficient number of taps N to represent the vocal tract transfer function A (z), and the prediction error E (z) represented by Equation 7 is minimized. Assuming that the adaptive operation is performed, the following formula 8 is established in the formula 7.

これは、即ち、適応フィルタ３０４を含む線形予測誤差フィルタ３０によって声道の伝達関数Ａ（ｚ）の逆数が予測されることを、意味する。 This means that the inverse of the vocal tract transfer function A (z) is predicted by the linear prediction error filter 30 including the adaptive filter 304.

従って、この線形予測誤差フィルタ３０の逆フィルタ４０によって入力信号Ｘ（ｚ）が処理されることで、つまり当該逆フィルタ４０の伝達関数Ｌ^−１（ｚ）が入力信号Ｘ（ｚ）に掛けられることで、当該入力信号Ｘ（ｚ）に含まれるフォルマントが強調される。なお、逆フィルタ４０の伝達関数Ｌ^−１（ｚ）は、次の数９によって表される。 Therefore, the input signal X (z) is processed by the inverse filter 40 of the linear prediction error filter 30, that is, the transfer function L ⁻¹ (z) of the inverse filter 40 is multiplied to the input signal X (z). Thus, the formants included in the input signal X (z) are emphasized. The transfer function L ⁻¹ (z) of the inverse filter 40 is expressed by the following formula 9.

このような逆フィルタ４０は、図４に示すように、線形予測誤差フィルタ３０における適応フィルタ３０４の伝達関数Ｈ（ｚ）がコピーされる言わば従属フィルタ４０２と、この従属フィルタ４０２による処理後信号を入力信号Ｘ（ｚ）に加算する加算器４０４と、この加算器４０４による加算後の信号Ｗ（ｚ）を遅延させて従属フィルタ４０２に入力する１サンプル分の遅延素子４０６と、によって構成される。そして、加算器４０４による加算後の信号Ｗ（ｚ）が、この逆フィルタ４０による処理後信号、つまり強調後信号、として出力される。ただし、この逆フィルタ４０の構成は、いわゆるＩＩＲ型であるため、その動作が不安定になることが懸念される。そこで、この欠点を解消するべく、逆フィルタ４０として、格子型のものが採用される。これに合わせて、線形予測誤差フィルタ３０もまた、格子型とされる。 As shown in FIG. 4, the inverse filter 40 has a dependent filter 402 to which the transfer function H (z) of the adaptive filter 304 in the linear prediction error filter 30 is copied, and a signal processed by the dependent filter 402. An adder 404 that adds to the input signal X (z), and a delay element 406 for one sample that delays the signal W (z) added by the adder 404 and inputs the delayed signal to the dependent filter 402. . Then, the signal W (z) after the addition by the adder 404 is output as a signal after processing by the inverse filter 40, that is, a signal after enhancement. However, since the configuration of the inverse filter 40 is a so-called IIR type, there is a concern that its operation becomes unstable. Therefore, in order to eliminate this drawback, a lattice type filter is used as the inverse filter 40. In accordance with this, the linear prediction error filter 30 is also of a lattice type.

具体的には、まず、線形予測誤差フィルタ３０は、図５に示すように、遅延素子３０２の出力が入力される遅延側（後ろ向き予測側）の加算器３１０と、入力信号Ｘ（ｚ）が直接的に入力される非遅延側（前向き予測側）の別の加算器３１２と、を有している。また、遅延素子３０２の出力は、乗算器３１４にも入力され、この乗算器３１４の出力は、非遅延側の加算器３１２に入力される。非遅延側の加算器３１２は、乗算器３１４の出力を入力信号Ｘ（ｚ）から差し引いて、この差し引き後の信号を次段の加算器３１２ａに入力する。併せて、入力信号Ｘ（ｚ）は、別の乗算器３１６にも入力され、この乗算器３１６の出力は、遅延側の加算器３１０に入力される。遅延側の加算器３１０は、乗算器３１６の出力を遅延素子３０２の出力から差し引いて、この差し引き後の信号を次段の遅延素子３０２ａに入力する。次段の遅延素子３０２ａは、２つの加算器３１０ａおよび３１２ａと２つの乗算器３１４ａおよび３１６ａと共に、前段と同様の構成を築く。そして、この構成は、Ｍ段にわたって縦続され、最終のＭ段目の非遅延側加算器３１２ｂが、図３および図４に示した加算器３０６を担う。つまり、このＭ段目の非遅延側加算器３１２ｂの出力が、予測誤差Ｅ（ｚ）とされる。なお、最初の１段目を構成する２つの乗算器３１４および３１６には、互いに同じフィルタ係数（反射係数）δ_１が設定される。このことは、他段についても、同様である。これらのフィルタ係数δ_１，δ_２，…，δ_Ｍの算出法については、公知であるので、ここでの詳しい説明を省略する。 Specifically, first, as shown in FIG. 5, the linear prediction error filter 30 includes a delay side (backward prediction side) adder 310 to which an output of the delay element 302 is input, and an input signal X (z). And another adder 312 on the non-delay side (forward prediction side) that is directly input. The output of the delay element 302 is also input to the multiplier 314, and the output of the multiplier 314 is input to the non-delay side adder 312. The non-delay side adder 312 subtracts the output of the multiplier 314 from the input signal X (z) and inputs the signal after the subtraction to the adder 312a at the next stage. In addition, the input signal X (z) is also input to another multiplier 316, and the output of this multiplier 316 is input to the adder 310 on the delay side. The adder 310 on the delay side subtracts the output of the multiplier 316 from the output of the delay element 302 and inputs the signal after the subtraction to the delay element 302a in the next stage. The delay element 302a in the next stage forms a configuration similar to that in the previous stage together with the two adders 310a and 312a and the two multipliers 314a and 316a. This configuration is cascaded over M stages, and the final M-th non-delay side adder 312b serves as the adder 306 shown in FIGS. That is, the output of the M-th non-delay side adder 312b is the prediction error E (z). It should be noted that the same filter coefficient (reflection coefficient) δ ₁ is set in the two multipliers 314 and 316 constituting the first stage. The same applies to the other stages. Since the calculation methods of these filter coefficients δ ₁ , δ ₂ ,..., Δ _M are well known, detailed description thereof is omitted here.

一方、逆フィルタ４０は、図６に示すように、遅延素子４０６の出力が入力される帰還側（後ろ向き予測に対応する側）の加算器４１０と、強調後信号Ｗ（ｚ）を出力する順方向側（前向き予測に対応する側）の別の加算器４１２と、を有している。また、遅延素子４０６の出力は、乗算器４１４にも入力され、この乗算器４１４の出力は、順方向側の加算器４１２に入力される。この順方向側加算器４１２は、その前段の加算器４１２ａ経由で入力される信号に当該乗算器４１４の出力を加算して、この加算後の信号を強調後信号Ｗ（ｚ）として出力する。併せて、この強調後信号Ｗ（ｚ）は、別の乗算器４１６にも入力され、この乗算器４１６の出力は、帰還側の加算器４１０に入力される。帰還側の加算器４１０は、乗算器４１６の出力を遅延素子４０６の出力から差し引いて、この差し引き後の信号を次段の遅延素子４０６ａに入力する。次段の遅延素子４０６ａは、前段の遅延素子４０６が２つの加算器４１０および４１２と２つの乗算器４１４および４１６と共に築くのと同様の構成を、２つの加算器４１０ａおよび４１２ａと２つの乗算器４１４ａおよび４１６ａと共に築く。そして、この構成は、Ｍ段にわたって縦続され、Ｍ段目の順方向側加算器４１２ｂが、図４に示した加算器４０４を担う。つまり、このＭ段目の順方向側加算器４１２ｂに、入力信号Ｘ（ｚ）が入力される。なお、１段目の各乗算器４１４および４１６には、図５に示した線形予測誤差フィルタ３０の１段目の各乗算器３１４および３１６のフィルタ係数δ_１が設定される。このことは、他段についても、同様である。これにより、線形予測誤差フィルタ３０の逆フィルタ４０が構成される。 On the other hand, as shown in FIG. 6, the inverse filter 40 includes an adder 410 on the feedback side (side corresponding to backward prediction) to which the output of the delay element 406 is input, and the order in which the post-emphasis signal W (z) is output. And another adder 412 on the direction side (side corresponding to the forward prediction). The output of the delay element 406 is also input to the multiplier 414, and the output of the multiplier 414 is input to the adder 412 on the forward direction side. The forward side adder 412 adds the output of the multiplier 414 to the signal input via the previous stage adder 412a, and outputs the signal after the addition as an enhanced signal W (z). In addition, the emphasized signal W (z) is also input to another multiplier 416, and the output of this multiplier 416 is input to the adder 410 on the feedback side. The adder 410 on the feedback side subtracts the output of the multiplier 416 from the output of the delay element 406, and inputs the signal after the subtraction to the delay element 406a of the next stage. The delay element 406a in the next stage has the same configuration as that of the delay element 406 in the previous stage together with the two adders 410 and 412 and the two multipliers 414 and 416, and the two adders 410a and 412a and the two multipliers. Build with 414a and 416a. This configuration is cascaded over M stages, and the M-th forward-direction adder 412b serves as the adder 404 shown in FIG. That is, the input signal X (z) is input to the M-th forward direction adder 412b. The first stage multipliers 414 and 416 are set with the filter coefficient δ ₁ of the first stage multipliers 314 and 316 of the linear prediction error filter 30 shown in FIG. The same applies to the other stages. Thereby, the inverse filter 40 of the linear prediction error filter 30 is configured.

このような格子型の逆フィルタ４０は、ＩＩＲ型であるものの、安定した動作を奏することが知られている。言い換えれば、安定判別が容易であり、具体的には、各フィルタ係数δ_１，δ_２，…，δ_Ｍのそれぞれが±１未満であれば、当該逆フィルタ４０の動作が安定することが知られている。また、線形予測誤差フィルタ３０についても、格子型とされることで、トランスバーサル型等の他構成のフィルタよりも高い収束速度が得られる等の優れた線形予測性能が発揮される。しかも、この線形予測誤差フィルタ３０の各フィルタ係数δ_１，δ_２，…，δ_Ｍが逆フィルタ４０にそのまま適用されることで、当該逆フィルタ４０が実現される。 Such a lattice-type inverse filter 40 is known to exhibit a stable operation although it is an IIR type. In other words, stability determination is easy, and specifically, it is known that if each of the filter coefficients δ ₁ , δ ₂ ,..., Δ _M is less than ± 1, the operation of the inverse filter 40 is stabilized. It has been. Further, the linear prediction error filter 30 is also of a lattice type, so that excellent linear prediction performance such as a higher convergence speed than other filters such as a transversal type can be obtained. Moreover, the inverse filter 40 is realized by applying the filter coefficients δ ₁ , δ ₂ ,..., Δ _{M of the} linear prediction error filter 30 to the inverse filter 40 as they are.

このように、線形予測誤差フィルタ３０の逆フィルタ４０によって入力信号Ｘ（ｚ）が処理されることで、当該入力信号Ｘ（ｚ）に含まれるフォルマントが強調されるが、この場合、つまり入力信号Ｘ（ｚ）に対して直接的にピーク強調が施された場合、次のような不都合が生じる。 In this way, the input signal X (z) is processed by the inverse filter 40 of the linear prediction error filter 30 to emphasize the formant included in the input signal X (z). In this case, that is, the input signal When peak enhancement is directly applied to X (z), the following inconvenience occurs.

即ち、入力信号Ｘ（ｚ）は、上述した図２の如く全体的に傾斜した周波数スペクトルを示すが、この入力信号Ｘ（ｚ）に対して直接的にピーク強調が施される、とすると、図７に実線の曲線で示すように、当該ピーク強調後の周波数スペクトルの傾斜が急峻になる。なお、同図における破線曲線は、ピーク強調前の入力信号Ｘ（ｚ）の周波数スペクトルであり、つまり図２に示した実線の曲線と同じものである。そして、このように周波数スペクトルの傾斜が急峻になることによって、各ピークα，βおよびγ以外の部分のパワーが特にフォルマントのピークβおよびγよりも大きくなる恐れがあり、そうなると、後述するピーク判定回路６０による当該フォルマントのピークβおよびγの判定が難しくなる。また、このピーク強調においては、フォルマントのピークβおよびγのみならず、雑音成分のピークαも強調されるため、特に当該雑音成分のピークαが過大となり、言わばレンジオーバ（オーバフロー）となる。 That is, the input signal X (z) exhibits a frequency spectrum that is totally inclined as shown in FIG. 2 described above, and if the input signal X (z) is directly subjected to peak enhancement, As shown by the solid curve in FIG. 7, the slope of the frequency spectrum after peak emphasis becomes steep. Note that the dashed curve in the figure is the frequency spectrum of the input signal X (z) before peak enhancement, that is, the same as the solid curve shown in FIG. In addition, since the slope of the frequency spectrum becomes steep in this way, there is a possibility that the power of portions other than the peaks α, β, and γ is particularly larger than the formant peaks β and γ. It becomes difficult for the circuit 60 to determine the peaks β and γ of the formant. Further, in this peak emphasis, not only the formant peaks β and γ but also the noise component peak α is emphasized, so that the noise component peak α is particularly excessive, that is, the range overflows.

この不都合を回避するために、平坦化回路２０が設けられている。つまり、この平坦化回路２０によって入力信号Ｘ（ｚ）が処理されることで、図８に実線の曲線で示すように、各ピークα，βおよびγについては、それぞれの先鋭さが維持される程度に、当該入力信号Ｘ（ｚ）の周波数スペクトル全体が平坦化され、傾斜が是正される。このような平坦化回路２０は、各ピークα，βおよびγには追随し得ない程度の低い周波数分解能のフィルタによって実現され、例えば線形予測誤差フィルタ３０と同様の構成であり、かつ、当該線形予測誤差フィルタ３０よりもタップ数の少ないフィルタによって実現される。勿論、これ以外の構成によって、当該平坦化回路２０が実現されてもよい。 In order to avoid this inconvenience, a planarization circuit 20 is provided. That is, as the input signal X (z) is processed by the flattening circuit 20, the sharpness of each peak α, β, and γ is maintained as shown by the solid curve in FIG. To the extent, the entire frequency spectrum of the input signal X (z) is flattened and the tilt is corrected. Such a flattening circuit 20 is realized by a filter having a low frequency resolution that cannot follow the peaks α, β, and γ. For example, the flattening circuit 20 has the same configuration as that of the linear prediction error filter 30 and is linear. This is realized by a filter having fewer taps than the prediction error filter 30. Of course, the planarization circuit 20 may be realized by other configurations.

そして、この平坦化回路２０によって言わば軽度に平坦化処理された後の平坦化後信号Ｘ’（ｚ）が、線形予測誤差フィルタ３０と逆フィルタ４０とのそれぞれに入力される。これにより、図７に実線の曲線で示したのとは異なり、図９に実線の曲線で示すように、各ピークα，βおよびγが適度に強調された強調後信号Ｗ（ｚ）が得られる。なお、図９における破線曲線は、ピーク強調前の平坦化後信号Ｘ’（ｚ）の周波数スペクトルであり、つまり図８に実線の曲線で示したのと同じである。 Then, the flattened signal X ′ (z) that has been lightly flattened by the flattening circuit 20 is input to each of the linear prediction error filter 30 and the inverse filter 40. Accordingly, unlike the solid line curve shown in FIG. 7, as shown by the solid line curve in FIG. 9, a post-emphasis signal W (z) in which the peaks α, β and γ are moderately emphasized is obtained. It is done. 9 is a frequency spectrum of the flattened signal X ′ (z) before peak enhancement, that is, the same as that indicated by the solid curve in FIG.

さらに、強調後信号Ｗ（ｚ）は、スペクトルサブトラクション５０に入力されるが、このスペクトルサブトラクション５０は、図１０に示すような構成とされている。即ち、スペクトルサブトラクション５０は、移動平均回路５０２を有しており、この移動平均回路５０２に、強調後信号Ｗ（ｚ）が入力される。移動平均回路５０２は、入力された強調後信号Ｗ（ｚ）をＴａという所定期間にわたって、例えばＴａ＝５秒間にわたって、移動平均（時間平均）する。そして、この移動平均回路５０２によって移動平均された後の平均化信号Ｗａ（ｚ）は、乗算器５０４に入力され、ここで、εという一定の係数を掛けられる。この係数εの値は、状況に応じて適宜に定められ、例えばε＝１．５とされる。そして、この乗算器５０４による乗算後の平均化信号Ｗａ’（ｚ）は、加算器５０６に入力される。 Furthermore, the post-emphasis signal W (z) is input to the spectral subtraction 50. The spectral subtraction 50 is configured as shown in FIG. That is, the spectral subtraction 50 has a moving average circuit 502, and the enhanced signal W (z) is input to the moving average circuit 502. The moving average circuit 502 performs a moving average (time average) over the input emphasized signal W (z) over a predetermined period of Ta, for example, Ta = 5 seconds. Then, the averaged signal Wa (z) after the moving average by the moving average circuit 502 is input to the multiplier 504, where it is multiplied by a certain coefficient ε. The value of the coefficient ε is appropriately determined according to the situation, and for example, ε = 1.5. The averaged signal Wa ′ (z) after multiplication by the multiplier 504 is input to the adder 506.

また、スペクトルサブトラクション５０は、遅延回路５０８を有しており、この遅延回路５０８にも、強調後信号Ｗ（ｚ）が入力される。遅延回路５０８は、入力された強調後信号Ｗ（ｚ）をＴｄという一定期間だけ遅延させる。この遅延回路５０８による遅延時間Ｔｄは、移動平均回路５０２による移動平均時間Ｔａの１／２であり、つまりＴｄ＝２．５秒間である。そして、この遅延回路５０８による遅延後の信号Ｗｄもまた、加算器５０６に入力される。 Further, the spectrum subtraction 50 has a delay circuit 508, and the enhanced signal W (z) is also input to this delay circuit 508. The delay circuit 508 delays the input post-emphasis signal W (z) by a certain period of Td. The delay time Td by the delay circuit 508 is ½ of the moving average time Ta by the moving average circuit 502, that is, Td = 2.5 seconds. The signal Wd delayed by the delay circuit 508 is also input to the adder 506.

加算器５０６は、遅延回路５０８による遅延後信号Ｗｄ（ｚ）から乗算器５０４による乗算後の平均化信号Ｗａ’（ｚ）を差し引くことで、上述した差引後信号Ｇ（ｚ）を生成する。ここで、遅延後信号Ｗｄ（ｚ）は、現在時刻よりも遅延時間Ｔｄだけ前の時刻における強調後信号Ｗ（ｚ）であり、例えば図１１（ａ）に示すような周波数スペクトルとなる。一方、平均化信号Ｗａ’（ｚ）は、遅延回路５０８による遅延時間Ｔｄだけ遡った時刻を中心として、見かけ上、その前後２．５秒間の合計５秒間という平均化時間Ｔａにわたって強調後信号Ｗ（ｚ）が移動平均され、さらにそのレベルがε倍されたものである。特に、このＴａ＝５秒間という平均化時間においては、定在する雑音成分のピークαは概ね不変である。これに対して、悲鳴や叫び声などの音声成分（長母音成分）は単発的であるので、そのピークβおよびγは変動する。この結果、平均化信号Ｗａ’（ｚ）は、図１１（ｂ）に示すように、雑音成分のピークαのみが残り、音声成分のピークβおよびγについては大きく低減された周波数スペクトルとなる。従って、差引後信号Ｇ（ｚ）は、図１１（ｃ）に示すように、雑音成分のピークαが除去され、音声成分のピークβおよびγのみが残された周波数スペクトルとなる。つまり、音声成分のピークβおよびγのみが抽出される。ゆえに、この音声成分のピークβおよびγが捉えられることで、音声検出が実現される。 The adder 506 generates the above-described subtracted signal G (z) by subtracting the averaged signal Wa ′ (z) after the multiplication by the multiplier 504 from the delayed signal Wd (z) by the delay circuit 508. Here, the post-delay signal Wd (z) is the post-emphasis signal W (z) at a time before the current time by the delay time Td, and has a frequency spectrum as shown in FIG. 11A, for example. On the other hand, the averaged signal Wa ′ (z) is apparently the signal W after the enhancement over the average time Ta of a total of 5 seconds, 2.5 seconds before and after it, centering around the time that is delayed by the delay time Td by the delay circuit 508. (Z) is a moving average and the level is multiplied by ε. In particular, in the averaging time of Ta = 5 seconds, the peak α of the standing noise component is almost unchanged. On the other hand, since speech components (long vowel components) such as screams and screams are sporadic, their peaks β and γ vary. As a result, the averaged signal Wa ′ (z) has a frequency spectrum in which only the noise component peak α remains and the speech component peaks β and γ are greatly reduced, as shown in FIG. 11B. Therefore, the subtracted signal G (z) becomes a frequency spectrum in which the noise component peak α is removed and only the speech component peaks β and γ are left, as shown in FIG. 11C. That is, only the peaks β and γ of the speech component are extracted. Therefore, voice detection is realized by capturing the peaks β and γ of the voice component.

なお、上述の如く移動平均回路５０２による平均化時間Ｔａにわたって雑音成分のピークαが概ね不変であるとしても、この平均化時間Ｔａにわたる移動平均処理によって、当該雑音成分のピークαもまた多少低減される。この低減分を補うために、上述の乗算器５０４が設けられる。つまり、差引後信号Ｇ（ｚ）において雑音成分のピークαが適当に除去されるように、乗算器５０４の係数εが設定される。 Even if the noise component peak α is substantially unchanged over the averaging time Ta by the moving average circuit 502 as described above, the noise component peak α is also somewhat reduced by the moving average processing over the averaging time Ta. The In order to compensate for this reduction, the above-described multiplier 504 is provided. That is, the coefficient ε of the multiplier 504 is set so that the noise component peak α is appropriately removed from the subtracted signal G (z).

ここで、実際の実験結果を報告する。 Here we report the actual experimental results.

即ち、図示しない評価音源を用いて、雑音として、ピンクノイズと１４００Ｈｚの正弦波とを発生させる。そして、音声として、「助けて〜」という男声を発生させる。これらの音声と雑音とのＳＮＲ（Signal-Noise Ratio）は、約−１５ｄＢとされる。そして、これらの音声と雑音とを含む入力信号Ｘ（ｚ）を得るためのサンプリング周波数は、３２ｋＨｚとされ、上述した離散フーリエ変換の点数は、８００とされる。さらに、平坦化回路２０を構成する線形予測誤差フィルタのタップ長が、４０とされ、ステップサイズが、０．００６２５とされる。そして、線形予測誤差フィルタ３０のタップ長が、８００とされ、ステップサイズが、０．２５とされる。逆フィルタ４０についても、この線形予測誤差フィルタ３０と同じタップ数およびステップサイズとされる。そして、スペクトルサブトラクション５０の上述した平均化時間Ｔａは、Ｔａ＝５秒間とされ、係数εは、ε＝１．５とされる。 That is, using an evaluation sound source (not shown), pink noise and a 1400 Hz sine wave are generated as noise. Then, a voice of “Help me” is generated as a voice. The SNR (Signal-Noise Ratio) between these voices and noise is about -15 dB. The sampling frequency for obtaining the input signal X (z) including these sounds and noise is 32 kHz, and the above-mentioned discrete Fourier transform score is 800. Further, the tap length of the linear prediction error filter constituting the flattening circuit 20 is 40, and the step size is 0.00625. The tap length of the linear prediction error filter 30 is 800, and the step size is 0.25. The inverse filter 40 also has the same tap number and step size as the linear prediction error filter 30. The above-described averaging time Ta of the spectral subtraction 50 is Ta = 5 seconds, and the coefficient ε is ε = 1.5.

このような条件下での入力信号Ｘ（ｚ）は、図１２（ａ）に示すような周波数スペクトルとなる。そして、この入力信号Ｘ（ｚ）が平坦化された後の平坦化後信号Ｘ’（ｚ）は、図１２（ｂ）に示すような周波数スペクトルとなる。これらの比較から分かるように、図１２（ａ）の入力信号Ｘ（ｚ）の周波数スペクトルにおいては、雑音成分のピークαに比べて、音声成分のピークβおよびγは特段に目立たないものの、図１２（ｂ）の平坦化後信号Ｘ’（ｚ）の周波数スペクトルでは、雑音成分のピークαと同じ程度に、当該音声成分のピークβおよびγが目立つようになる。 The input signal X (z) under such conditions has a frequency spectrum as shown in FIG. Then, the flattened signal X ′ (z) after the input signal X (z) is flattened has a frequency spectrum as shown in FIG. As can be seen from these comparisons, in the frequency spectrum of the input signal X (z) in FIG. 12A, the peaks β and γ of the speech component are not particularly noticeable compared to the peak α of the noise component. In the frequency spectrum of the post-flattened signal X ′ (z) of 12 (b), the speech component peaks β and γ become conspicuous as much as the noise component peak α.

さらに、平坦化後信号Ｘ’（ｚ）がピーク強調された後の強調後信号Ｗ（ｚ）は、図１２（ｃ）に示すような周波数スペクトルとなる。そして、この強調後信号Ｗ（ｚ）が上述の平均化時間Ｔａにわたって移動平均された後の平均化信号Ｗａ（ｚ）は、図１２（ｄ）に示すような周波数スペクトルとなる。これらの比較から分かるように、図１２（ｃ）の強調後信号Ｗ（ｚ）の周波数スペクトルにおいては、音声成分のピークβおよびγは十分に大きいものの、図１２（ｄ）の平均化信号Ｗａ（ｚ）の周波数スペクトルでは、当該音声成分のピークβおよびγは極端に低減されている。これに対して、雑音成分のピークαもまた、低減されるものの、その低減度合は小さい。 Further, the post-emphasis signal W (z) after the post-flattening signal X ′ (z) is subjected to peak emphasis has a frequency spectrum as shown in FIG. Then, the averaged signal Wa (z) after the post-emphasis signal W (z) is moving averaged over the above-described averaging time Ta has a frequency spectrum as shown in FIG. As can be seen from these comparisons, in the frequency spectrum of the post-emphasis signal W (z) in FIG. 12 (c), although the peaks β and γ of the speech component are sufficiently large, the averaged signal Wa in FIG. 12 (d). In the frequency spectrum of (z), the peaks β and γ of the sound component are extremely reduced. On the other hand, although the peak α of the noise component is also reduced, the degree of reduction is small.

そして、上述したように、強調後信号Ｗ（ｚ）がＴｄという遅延時間だけ遅延されることによって遅延後信号Ｗｄ（ｚ）が生成され、平均化信号Ｗａ（ｚ）に係数εが掛けられることで乗算後の平均化信号Ｗａ’（ｚ）が生成され、これらの差し引きによって差引後信号Ｇ（ｚ）が生成される。そして、この差引後信号Ｇ（ｚ）は、図１２（ｅ）に示すような周波数スペクトルとなる。この図１２（ｅ）の差引後信号Ｇ（ｚ）の周波数スペクトルから明らかなように、雑音成分のピークαが除去され、音声成分のピークβおよびγのみが残る。 Then, as described above, the post-emphasis signal W (z) is delayed by a delay time of Td to generate the post-delay signal Wd (z), and the average signal Wa (z) is multiplied by the coefficient ε. In this way, an averaged signal Wa ′ (z) after multiplication is generated, and a subtracted signal G (z) is generated by subtracting these signals. The post-subtraction signal G (z) has a frequency spectrum as shown in FIG. As is apparent from the frequency spectrum of the subtracted signal G (z) in FIG. 12 (e), the noise component peak α is removed, and only the speech component peaks β and γ remain.

このように、実際の実験によっても、本実施形態の有効性が確認された。 As described above, the effectiveness of the present embodiment was also confirmed by actual experiments.

差引後信号Ｇ（ｚ）は、ピーク判定回路６０に入力され、ここで、音声成分のピークβおよびγの有無が判定されるが、当該差引後信号Ｇ（ｚ）の周波数スペクトルは、スペクトルサブトラクション５０による処理の結果、負数をも持つことになるため、絶対的な評価によって当該音声成分のピークβおよびγの有無を判定することができない。つまり、或る一定の閾値を設定すると共に、この閾値よりも大きい音声成分のピークβおよびγが存在するか否かによって、当該音声成分のピークβおよびγの有無を判定することができない。それゆえに、ピーク判定回路６０は、次の相対的評価によってピーク判定を行う。 The post-subtraction signal G (z) is input to the peak determination circuit 60, where it is determined whether or not there are peaks β and γ of the speech component. The frequency spectrum of the post-subtraction signal G (z) is spectral subtraction. As a result of the processing by 50, since it also has a negative number, the presence or absence of the peaks β and γ of the speech component cannot be determined by absolute evaluation. That is, a certain threshold value is set, and whether or not there are peaks β and γ of the sound component cannot be determined based on whether or not there are sound component peaks β and γ larger than the threshold value. Therefore, the peak determination circuit 60 performs peak determination by the following relative evaluation.

即ち、今、差引後信号Ｇ（ｚ）の周波数スペクトルの一部が、例えば図１３に示すような特性である、とする。ここで、任意の周波数ビンｉにおける成分値ｇ［ｉ］が、次の数１０で表される条件式を満足するときに、当該成分値ｇ［ｉ］が音声成分のピークβまたはγである、と判定される。 That is, it is assumed that a part of the frequency spectrum of the subtracted signal G (z) has characteristics as shown in FIG. Here, when the component value g [i] in an arbitrary frequency bin i satisfies the conditional expression expressed by the following equation 10, the component value g [i] is the peak β or γ of the audio component. Is determined.

この数１０で表される条件式に基づいて、つまり連続する５つの周波数ビン［ｉ−２］〜［ｉ＋２］における成分値ｇ［ｉ−２］〜ｇ［ｉ＋２］に基づいて、その中央の周波数ビンｉにおける成分値ｇ［ｉ］が音声成分のピークβまたはγであるか否かが判定されることで、負数を持つ差引後信号Ｇ（ｚ）の周波数スペクトルであっても、当該音声信号のピークβおよびγの有無を正確に判定することができる。また、ゆらぎ等によって、音声信号のピークβおよびγが、例えば図１４に示す如く一種不明確となっても、それらの有無を正確に判定することができる。なお、数１０のＱ５〜Ｑ８に含まれる「４」という数値は、いわゆる経験値であり、状況に応じて異なる値が設定されることがある。 Based on the conditional expression expressed by this equation 10, that is, based on the component values g [i−2] to g [i + 2] in five consecutive frequency bins [i−2] to [i + 2], By determining whether or not the component value g [i] in the frequency bin i is the peak β or γ of the sound component, even if it is the frequency spectrum of the subtraction signal G (z) having a negative number, the sound The presence or absence of signal peaks β and γ can be accurately determined. Also, even if the peaks β and γ of the audio signal become unclear due to fluctuations, for example, as shown in FIG. 14, the presence or absence of them can be accurately determined. The numerical value “4” included in Q5 to Q8 in Equation 10 is a so-called experience value, and a different value may be set depending on the situation.

以上のように、本実施形態によれば、入力信号Ｘ（ｚ）の周波数スペクトルを求めるための離散フーリエ変換の周波数帯域がｆ＝１２００Ｈｚ〜３０００Ｈｚに制限されており、つまり第１フォルマントについては音声検出の対象から意図的に外されている。従って、第１フォルマントの周波数帯域と重複する１ｋＨｚ付近の周波数帯域に大きなパワーを持つ道路交通騒音等の雑音が存在する環境下においても、当該雑音の影響を排除しつつ、音声検出を実現することができる。その上で、本実施形態によれば、入力信号Ｘ（ｚ）の周波数スペクトルに含まれる雑音成分のピークαと音声成分のピークβおよびγとの性質の差異に着目して、これらのピークα，βおよびγを強調し、さらに、このピーク強調後の周波数スペクトルから雑音成分のピークαのみを除去して、音声成分のピークβおよびγを捉えることで、音声検出を実現している。従って、道路交通騒音のみならず、それ以外の雑音の影響をも排除することができ、ひいては当該雑音の影響を受け易い上述の従来技術よりも正確な音声検出を実現することができる。これは、特に防犯用途において人間の悲鳴や叫び声等を適確に検出するのに好適である。 As described above, according to the present embodiment, the frequency band of the discrete Fourier transform for obtaining the frequency spectrum of the input signal X (z) is limited to f = 1200 Hz to 3000 Hz, that is, the first formant is a voice. It is intentionally removed from the detection target. Therefore, even in an environment where there is a noise such as road traffic noise having a large power in the frequency band near 1 kHz that overlaps the frequency band of the first formant, voice detection is realized while eliminating the influence of the noise. Can do. In addition, according to the present embodiment, paying attention to the difference in properties between the noise component peak α and the speech component peaks β and γ included in the frequency spectrum of the input signal X (z), these peaks α , Β and γ are emphasized, and furthermore, only the noise component peak α is removed from the frequency spectrum after the peak enhancement, and the speech component peaks β and γ are captured, thereby realizing speech detection. Therefore, not only road traffic noise but also the influence of other noises can be eliminated. As a result, more accurate voice detection can be realized than the above-described conventional technique that is easily affected by the noise. This is particularly suitable for accurately detecting human screams and screams in crime prevention applications.

なお、このような音声検出装置１０は、例えばＣＰＵ（Central Processing
Unit）や当該ＣＰＵとＤＳＰ（Digital Signal Processor）との組合せによって実現される。また、これらのＣＰＵやＤＳＰにとっては、比較的に少ない処理量で上述した要領による音声検出が可能であるので、当該ＣＰＵやＤＳＰとして比較的に廉価なものを採用することができ、特にＤＳＰとしては固定小数点型のものを採用することができる。さらに、入力信号Ｘ（ｚ）を得るためのサンプリング周波数を、上述した３２ｋＨｚよりも低減することができ、例えば１２ｋＨｚにダウンサンプリングすることができる。このこともまた、ＣＰＵやＤＳＰの廉価化に大きく貢献する。 Such a voice detection device 10 is, for example, a CPU (Central Processing).
Unit) or a combination of the CPU and a DSP (Digital Signal Processor). For these CPUs and DSPs, it is possible to detect voices in the manner described above with a relatively small amount of processing, so it is possible to use relatively inexpensive CPUs and DSPs, especially as DSPs. Can be a fixed-point type. Furthermore, the sampling frequency for obtaining the input signal X (z) can be reduced from the above-described 32 kHz, and can be down-sampled to 12 kHz, for example. This also greatly contributes to cost reduction of the CPU and DSP.

本実施形態においては、防犯用途に本発明を適用する場合について説明したが、これに限らない。即ち、音声成分と雑音成分とが混在する入力信号Ｘ（ｚ）から当該音声成分のみを検出する必要性がある用途であれば、本発明を適用することができる。 In this embodiment, although the case where this invention is applied to a crime prevention use was demonstrated, it is not restricted to this. That is, the present invention can be applied to any application where it is necessary to detect only the audio component from the input signal X (z) in which the audio component and the noise component are mixed.

また、離散フーリエ変換の周波数帯域を制限することで、結果的に、第１フォルマントが音声検出の対象が外れるようにしたが、これに限らない。即ち、当該第１フォルマントを音声検出の対象から外すための別の手段、例えばローパスフィルタ等の周波数制限手段、を採用してもよい。 Further, by limiting the frequency band of the discrete Fourier transform, as a result, the first formant is excluded from the target of speech detection. However, the present invention is not limited to this. That is, another means for removing the first formant from the target of voice detection, for example, a frequency limiting means such as a low-pass filter may be employed.

さらに、線形予測誤差フィルタ３０として、図５に示した格子型のＦＩＲフィルタを採用したが、トランスバーサル型等の他構成のフィルタを採用してもよい。この場合、逆フィルタについても、当該線形予測誤差フィルタ３０と共役な構成のフィルタを採用するのが肝要である。ただし、格子型のフィルタを採用することで、多大な利点が得られることは、上述した通りである。 Furthermore, although the lattice-type FIR filter shown in FIG. 5 is adopted as the linear prediction error filter 30, a filter having another configuration such as a transversal type may be adopted. In this case, it is important to use a filter having a configuration conjugate with the linear prediction error filter 30 for the inverse filter. However, as described above, a tremendous advantage can be obtained by adopting a lattice-type filter.

そして、スペクトルサブトラクション５０として、図１０に示した構成のものを採用したが、これに限らない。特に、雑音成分のピークαを推定するための手段として、移動平均回路５０２以外のものを採用してもよい。 And although the thing of the structure shown in FIG. 10 was employ | adopted as the spectrum subtraction 50, it is not restricted to this. In particular, devices other than the moving average circuit 502 may be employed as means for estimating the noise component peak α.

１０音声検出装置
２０平坦化回路
３０線形予測誤差フィルタ
４０逆フィルタ
５０スペクトルサブトラクション DESCRIPTION OF SYMBOLS 10 Speech detector 20 Flattening circuit 30 Linear prediction error filter 40 Inverse filter 50 Spectral subtraction

Claims

In a voice detection device that detects a voice component from an input signal in which a voice component and a noise component are mixed,
Peak emphasizing means for emphasizing the peak of the frequency spectrum of the input signal;
Noise estimation means for estimating a noise spectrum corresponding to the noise component in the enhanced spectrum after the peak is enhanced by the peak enhancement means;
Subtracting means for subtracting the noise spectrum from the enhanced spectrum;
Peak determination means for determining whether or not the peak is included in the signal after subtraction after the noise spectrum is subtracted from the spectrum after enhancement by the subtraction means;
Equipped with,
The peak enhancement means is configured to set a linear prediction error filter that predicts the current input signal based on the past input signal, and an inverse transfer function that is the reciprocal of the transfer function of the linear prediction error filter. Including an inverse filter that enhances the peak by processing,
A voice detection device characterized by the above.

Each of the linear prediction error filter and the inverse filter is a lattice type.
The voice detection device according to claim 1 .

The noise estimation means estimates the noise spectrum by time averaging the emphasized spectrum.
Speech detection apparatus according to claim 1 or 2.

In a voice detection device that detects a voice component from an input signal in which a voice component and a noise component are mixed,
Peak emphasizing means for emphasizing the peak of the frequency spectrum of the input signal;
Noise estimation means for estimating a noise spectrum corresponding to the noise component in the enhanced spectrum after the peak is enhanced by the peak enhancement means;
Subtracting means for subtracting the noise spectrum from the enhanced spectrum;
Comprising
Flattening means for flattening the frequency spectrum of the input signal to such an extent that the peak is maintained;
The peak enhancing means emphasizes the peak of the flattened spectrum after being flattened by the flattening means ;
A voice detection device characterized by the above .

The flattening means includes a low resolution filter having a low frequency resolution that is insufficient to follow the peak;
The voice detection device according to claim 4 .