JP2003255971A

JP2003255971A - Speech extracting method and speech extracting device using the method

Info

Publication number: JP2003255971A
Application number: JP2002054497A
Authority: JP
Inventors: Hiroyuki Hoshino; 博之星野
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2002-02-28
Filing date: 2002-02-28
Publication date: 2003-09-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech extracting method which extracts a speech in noisy environment, and its device. <P>SOLUTION: The speech extracting method and its device are characterized in that attention is paid to the difference in acoustic feature between background noise and a speech. When an octave band level analysis of a speech including background noise is taken, its level difference becomes remarkable in a specified band. For the purpose, a filter 30 which has such a characteristic that the level difference between the both becomes large is generated through the analysis. This filter part 30 is made to operate on the output of a microphone 10. Consequently, a band where the difference in power between the speech and background noise is large is passed. Namely, the filter part 30 relatively emphasizes the speech. According to the filter output, a threshold determination part 40 determines a threshold. Lastly, a speech extraction part 60 sets the threshold for a speech signal on a time series and regards a part above the threshold as a speech section. In this section, the speech level is relatively large, so the speech can accurately be extracted even in noisy environment. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、騒音下において音
声認識率を向上させるための音声抽出方法及びその方法
を用いた音声抽出装置に関する。特に、人の聴感に対応
したバンドレベル分析に基づいたフィルタを背景ノイズ
を含む検出音に作用させ、そのフィルタ出力に基づいて
有効な音声区間を決定して音声を抽出する音声抽出方法
とその装置に関する。本発明は車室内騒音を背景ノイズ
とし、音声によって操作される車輌搭載装置の音声認識
に適用できる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice extraction method for improving a voice recognition rate in a noise and a voice extraction apparatus using the method. In particular, a voice extraction method and apparatus for applying a filter based on band level analysis corresponding to human hearing to a detection sound including background noise, determining an effective voice section based on the output of the filter, and extracting a voice Regarding INDUSTRIAL APPLICABILITY The present invention can be applied to voice recognition of a vehicle-mounted device that is operated by voice with the vehicle interior noise as background noise.

【０００２】[0002]

【従来の技術】従来より、騒音下で音声を認識する音声
認識方法とその装置がある。例えば、特開平８−３１４
５００号公報に開示の音声認識方法及び音声認識装置が
ある。これは、騒音下で音声によって発音された数字等
を的確に認識するための音声認識方法及び音声認識装置
である。音声区間を判定する閾値を第１閾値と第２閾値
に分け、第１閾値によって得られた音声信号の最大値、
又はスペクトルの最大値に基づいて動的に第２閾値を決
定し、それにより切り取られた音声信号に対して音声認
識することを特徴としている。即ち、変化する音声信号
に対応して第２閾値を変動させ、それにより音声区間を
決定することを特徴としている。2. Description of the Related Art Conventionally, there are a voice recognition method and a device for recognizing voice under noise. For example, Japanese Patent Laid-Open No. 8-314
There is a voice recognition method and a voice recognition device disclosed in Japanese Patent Publication No. 500. This is a voice recognition method and a voice recognition device for accurately recognizing numbers or the like produced by voice under noise. The threshold for determining the voice section is divided into a first threshold and a second threshold, and the maximum value of the voice signal obtained by the first threshold,
Alternatively, the second threshold value is dynamically determined based on the maximum value of the spectrum, and voice recognition is performed on the cut-out voice signal. That is, the feature is that the second threshold value is changed corresponding to the changing voice signal, and the voice section is determined by the change.

【０００３】他に、特開平１０−２５４４７６号公報に
開示の音声区間検出方法がある。これは、音声／非音声
音響モデルで認識処理し音声区間を決定する方法であ
る。具体的には、音声認識装置が認識対象語彙（クラ
ス）を網羅する全ての音声を用いて学習した音声音響モ
デルと、音声の発声されていない区間を用いて学習した
非音声音響モデルとを用意する。そして、入力信号の適
当な区間長ごとに音声音響モデルと非音声音響モデルの
尤度比を計算する。その尤度比が所定の閾値を越えた区
間が一定時間継続した場合にそれを音声区間の始端と
し、その後、尤度比が所定の閾値を下回った区間が一定
時間継続した場合に、それを音声区間の終端とする方法
である。これにより、的確に音声区間が決定されるので
音声認識率が向上するとしている。In addition, there is a voice section detection method disclosed in Japanese Patent Application Laid-Open No. 10-254476. This is a method of recognizing a voice / non-voice acoustic model and determining a voice section. Specifically, a speech recognition apparatus prepares a speech acoustic model learned by using all the voices covering the recognition target vocabulary (class) and a non-speech acoustic model learned by using an unvoiced section of the speech. To do. Then, the likelihood ratio between the voice acoustic model and the non-voice acoustic model is calculated for each appropriate section length of the input signal. When a section whose likelihood ratio exceeds a predetermined threshold continues for a certain period of time, it is set as the beginning of a voice section, and thereafter, when a section whose likelihood ratio is below a predetermined threshold continues for a certain period of time, it is set as This is a method of terminating the voice section. As a result, the voice section is accurately determined, and the voice recognition rate is said to be improved.

【０００４】[0004]

【発明が解決しようする課題】しかしながら、特開平８
−３１４５００号公報に開示の音声認識方法及び音声認
識装置は、第１閾値で切り取られた音声信号に突発的な
ノイズが混入している場合、第２閾値も変動することに
なり、音声区間に誤差が生じる可能性がある。即ち、閾
値を第１閾値、第２閾値に分離しても確実には外部ノイ
ズの影響を除去できるものではない。即ち、背景ノイズ
の影響なく音声区間を決定し、常に的確に音声認識する
方法及びその装置ではない。又、特開平１０−２５４４
７６号公報に開示の音声区間検出方法は、パターン認識
処理を行うため計算量が多くなる。従って、リアルタイ
ム性を必要とする装置、例えば車載されるナビゲーショ
ン装置等の音声認識には採用し難いという問題がある。[Patent Document 1] Japanese Unexamined Patent Publication No. Hei 8
In the voice recognition method and the voice recognition device disclosed in Japanese Patent Laid-Open No. 314500, when sudden noise is mixed in the voice signal clipped by the first threshold value, the second threshold value also changes, and the voice interval is changed. There may be an error. That is, even if the threshold value is divided into the first threshold value and the second threshold value, the influence of external noise cannot be reliably removed. That is, it is not a method and an apparatus for determining a voice section without the influence of background noise and always recognizing a voice accurately. Also, JP-A-10-2544
The speech segment detection method disclosed in Japanese Patent Publication No. 76-76 requires a large amount of calculation because pattern recognition processing is performed. Therefore, there is a problem that it is difficult to adopt for voice recognition of a device that requires real-time property, for example, a vehicle-mounted navigation device.

【０００５】本発明は、上述した問題点を解決するため
になされたものであり、その目的は、人の聴感に合った
分析方法に基づいたフィルタを背景ノイズを含む検出音
に作用させ、そのフィルタ出力に基づいて閾値を設定
し、精度よく音声区間を決定することである。そして、
それにより的確に音声を抽出する方法及びその装置を提
供することである。又、他の目的は本発明の音声抽出方
法及び装置を車輌搭載の音声認識装置に適用し、運転者
の音声を明確に認識させて走行の安全性を高めることで
ある。The present invention has been made to solve the above-mentioned problems, and an object thereof is to apply a filter based on an analysis method suitable for human hearing to a detection sound including background noise, This is to set a threshold value based on the filter output and accurately determine the voice section. And
Accordingly, it is an object of the present invention to provide a method and an apparatus for accurately extracting voice. Another object of the present invention is to apply the voice extraction method and device of the present invention to a voice recognition device mounted on a vehicle so that the voice of the driver can be clearly recognized to improve the safety of driving.

【０００６】[0006]

【課題を解決するための手段】請求項１に記載の音声抽
出方法は騒音下における音声抽出方法であって、バンド
レベル分析結果に基づいたフィルタを背景ノイズを含む
検出音に作用させ、そのフィルタ出力に基づいて閾値を
決定し、フィルタ出力のその閾値以上のレベルの区間を
音声区間としてその音声区間から音声を抽出することを
特徴とする。According to a first aspect of the present invention, there is provided a voice extraction method in a noisy state, wherein a filter based on a band level analysis result is applied to a detection sound including background noise, and the filter is applied. A feature is that a threshold value is determined based on the output, and a section of a level of the filter output that is equal to or higher than the threshold value is set as a voice section and voice is extracted from the voice section.

【０００７】又、請求項２に記載の音声抽出方法は請求
項１に記載の音声抽出方法であって、バンドレベル分析
はオクターブバンドレベル分析であることを特徴とす
る。又、請求項３に記載の音声抽出方法は請求項１又は
請求項２に記載の音声抽出方法であって、フィルタの作
用はバンドレベル分析による所定帯域に所定の重みを付
加して加算する重み付け加算であることを特徴とする。The speech extraction method according to a second aspect is the speech extraction method according to the first aspect, wherein the band level analysis is an octave band level analysis. The voice extraction method according to claim 3 is the voice extraction method according to claim 1 or 2, wherein the function of the filter is weighting by adding a predetermined weight to a predetermined band by band level analysis and adding the weight. It is characterized by addition.

【０００８】又、請求項４に記載の音声抽出方法は請求
項１又は請求項２に記載の音声抽出方法であって、フィ
ルタの作用はバンドレベル分析による各所定帯域のレベ
ルを聴感上の大きさに尺度変換し、その変換された値に
対して高周波帯域ほど大きな重みを付加して加算する重
み付け加算であることを特徴とする。又、請求項５に記
載の音声抽出方法は請求項１乃至請求項４の何れか１項
に記載の音声抽出方法であって、音声抽出は算出された
音声区間の周波数成分から背景ノイズの周波数成分を差
し引いて求められることを特徴とする。The voice extraction method according to a fourth aspect is the voice extraction method according to the first or second aspect, wherein the function of the filter is that the level of each predetermined band obtained by band level analysis is perceived to be large. It is characterized in that it is a weighted addition in which a scale conversion is performed, and a higher weight is added to the converted value in a higher frequency band to be added. Further, the voice extraction method according to claim 5 is the voice extraction method according to any one of claims 1 to 4, wherein the voice extraction is performed based on the frequency component of the calculated voice section and the frequency of the background noise. It is characterized in that it is obtained by subtracting the components.

【０００９】又、請求項６に記載の音声抽出装置は騒音
下における音声抽出装置であって、背景ノイズを含む音
声を検出する音声検出手段と、その音声検出手段の検出
音にバンドレベル分析に基づいたフィルタ特性を作用さ
せるフィルタ手段と、そのフィルタ手段の出力に基づい
て閾値を決定する閾値決定手段と、その閾値決定手段に
よる閾値以上のレベルの区間を音声区間とする音声区間
算出手段と、その音声区間から音声を抽出する音声抽出
手段とを備えたことを特徴とする。Further, the voice extraction device according to claim 6 is a voice extraction device in a noisy state, wherein voice detection means for detecting a voice including background noise and band level analysis of the detected sound of the voice detection means. A filter means for applying a filter characteristic based on the threshold value, a threshold value determining means for determining a threshold value based on the output of the filter means, and a voice section calculating means for determining a section having a level equal to or higher than the threshold value by the threshold value determining means as a voice section; And a voice extraction unit for extracting a voice from the voice section.

【００１０】又、請求項７に記載の音声抽出装置は請求
項６に記載の音声抽出装置であって、フィルタ特性の決
定に用いられたバンドレベル分析はオクターブバンドレ
ベル分析であることを特徴とする。又、請求項８に記載
の音声抽出装置は請求項６又は請求項７に記載の音声抽
出装置であって、フィルタ手段はフィルタ特性を作用さ
せた検出音の所定帯域に所定の重みを付加して加算する
重み付け加算を行うことを特徴とする。Further, the voice extraction device according to claim 7 is the voice extraction device according to claim 6, wherein the band level analysis used for determining the filter characteristic is an octave band level analysis. To do. Further, the voice extraction device according to claim 8 is the voice extraction device according to claim 6 or 7, wherein the filter means adds a predetermined weight to a predetermined band of the detected sound to which the filter characteristic is applied. It is characterized in that weighted addition is performed.

【００１１】又、請求項９に記載の音声抽出装置は請求
項６又は請求項７に記載の音声抽出装置であって、フィ
ルタ手段はフィルタ特性（バンドレベル分析結果）を作
用させた検出音の所定帯域のレベルを聴感上の大きさに
尺度変換し、その変換された値に対して高周波帯域ほど
大きな重みを付加して加算する重み付け加算を行うこと
を特徴とする。又、請求項１０に記載の音声抽出装置は
請求項６又は請求項９の何れか１項に記載の音声抽出装
置であって、音声抽出手段は音声区間算出手段によって
算出された音声区間の検出音周波数成分から背景ノイズ
の周波数成分を差し引くことを特徴とする。又、請求項
１１に記載の音声抽出装置は請求項６乃至請求項１０の
何れか１項に記載の音声抽出装置であって、背景ノイズ
は車室内の騒音であり音声は運転者の車輌搭載機器への
指示であることを特徴とする。Further, the voice extraction device according to claim 9 is the voice extraction device according to claim 6 or 7, wherein the filter means applies the detected sound to which filter characteristics (band level analysis result) are applied. It is characterized in that the level of a predetermined band is scale-converted to a perceptual size, and the converted value is weighted and added by adding a greater weight to a higher frequency band. The voice extraction device according to claim 10 is the voice extraction device according to any one of claims 6 or 9, wherein the voice extraction means detects the voice section calculated by the voice section calculation means. It is characterized in that the frequency component of background noise is subtracted from the sound frequency component. The voice extraction device according to claim 11 is the voice extraction device according to any one of claims 6 to 10, wherein the background noise is noise in the vehicle interior and the voice is mounted on the vehicle of the driver. It is characterized by being an instruction to the device.

【００１２】[0012]

【発明の作用および効果】請求項１に記載の音声抽出方
法によれば、検出音にバンドレベル分析に基づいたフィ
ルタを作用させる。バンドレベル分析とは検出音から指
数関数的に分布する人の聴感周波数に対応するように周
波数帯域（バンド）を選択してそのパワーを求める分析
である。これは背景ノイズと人音声との音響的特徴が図
４に示すように周波数帯域によって異なるためである。
背景ノイズは、ランダムノイズと例えばエンジン音等の
周期ノイズであるが、人音声の周波数は指数関数的に分
布している。よって、その特性に着目して指数関数的な
バンドで検出音を分析する。例えば、図４のバンドレベ
ル分析結果から１．２５ｋＨｚ帯域を抽出すれば、背景
ノイズを小さく、音声を大きく検出することができる。
即ち、このバンドレベル分析に基づいたフィルタ特性を
背景ノイズを含む検出音に作用させて、音声情報と背景
ノイズ差が大きい帯域を通過させ音声を強調する。According to the voice extracting method of the first aspect, the filter based on the band level analysis is applied to the detected sound. The band level analysis is an analysis for selecting a frequency band (band) from the detected sound so as to correspond to human hearing frequencies exponentially distributed and obtaining the power. This is because the acoustic characteristics of background noise and human voice differ depending on the frequency band as shown in FIG.
Background noise is random noise and periodic noise such as engine sound, but the frequency of human voice is exponentially distributed. Therefore, the detected sound is analyzed in an exponential band focusing on the characteristic. For example, if the 1.25 kHz band is extracted from the band level analysis result of FIG. 4, the background noise can be reduced and the voice can be detected greatly.
That is, the filter characteristic based on the band level analysis is applied to the detection sound including the background noise, and the band having a large difference between the voice information and the background noise is passed to emphasize the voice.

【００１３】そして、そのフィルタ出力に基づいて閾値
を決定する。例えば、フィルタ出力の周波数スペクトル
における最大パワーに基づいた値、例えば最大パワーの
−２０ｄＢを閾値とする。そして、その閾値以上のレベ
ルの区間を音声区間とする。音声区間とは、音声が発せ
られた時間軸上での区間である。そして、その音声区間
から音声抽出する。この区間は、背景ノイズが小であっ
て音声パワーが大であるので、効率よく確実に音声を抽
出することができる。これにより、背景ノイズに埋もれ
た音声でも精度よく抽出することができる。Then, the threshold value is determined based on the output of the filter. For example, a value based on the maximum power in the frequency spectrum of the filter output, for example, -20 dB of the maximum power is set as the threshold value. Then, a section having a level equal to or higher than the threshold is set as a voice section. The voice section is a section on the time axis where the voice is emitted. Then, the voice is extracted from the voice section. Since the background noise is low and the voice power is high in this section, the voice can be extracted efficiently and reliably. As a result, even a voice buried in background noise can be accurately extracted.

【００１４】又、請求項２に記載の音声抽出方法は請求
項１に記載の音声抽出方法において、バンドレベル分析
をオクターブバンドレベル分析としている。オクターブ
バンドレベル分析は、指数関数的な人の聴感に則した帯
域を選択して分析する手法である。例えば、１／３オク
ターブバンド分析とすれば、初期値の中心周波数を１０
００Ｈｚとすれば、次段との比が２^1/3であるので、中
心周波数は順次、１２５０Ｈｚ、１６００Ｈｚ、２００
０Ｈｚ、・・・となる。１０００Ｈｚ以上の帯域では、
これらは、人間の聴覚現象を説明する上で有効な考え方
である臨界帯域とほぼ一致する。背景ノイズはオクター
ブバンドに則していないが、人音声及び聴感はオクター
ブバンドに則しているので、この臨界帯域でバンド分析
すれば効率よく音声を得ることができる。即ち、効率よ
く音声区間を決定することができる。The voice extraction method according to a second aspect of the present invention is the voice extraction method according to the first aspect, wherein the band level analysis is octave band level analysis. Octave band level analysis is a method of selecting and analyzing a band that conforms to human exponential hearing. For example, in the case of 1/3 octave band analysis, the center frequency of the initial value is 10
If it is 00 Hz, the ratio with the next stage is 2 ^1/3 , so the center frequencies are 1250 Hz, 1600 Hz, 200
0 Hz, ... In the band above 1000Hz,
These almost agree with the critical band, which is an effective way of explaining human auditory phenomena. Background noise does not conform to the octave band, but human voice and hearing sensation conform to the octave band. Therefore, band analysis in this critical band enables efficient voice acquisition. That is, the voice section can be efficiently determined.

【００１５】又、請求項３に記載の音声抽出方法は請求
項１又は請求項２に記載の音声抽出方法であって、バン
ドレベル分析による所定帯域はフィルタ作用によって所
定の重みが付加されて加算される。これは、バンドレベ
ル分析後には、音声と背景ノイズのレベル差が顕著であ
る帯域、例えば、１ｋＨｚ、１．２５ｋＨｚ、４ｋＨｚ
を中心とする音声スペクトルが得られ、それらを効果的
に強調するためである。この帯域は音声と背景ノイズの
レベル差が顕著であるので、例えば、これらの帯域をそ
の帯域の最大レベルに合わせて増大させて加算すれば、
音声の音響的特徴がより強調される。例えば、上記スペ
クトル中心を３：２：１の比で重み付け加算を行う。こ
のようにすれば背景ノイズと音声の音響的特徴がより強
調される。そして、重み付け加算後、所定の閾値以上と
なった区間を音声区間とする。背景ノイズが相対的に低
減されるので、より的確に音声区間が決定される。A voice extraction method according to a third aspect of the present invention is the voice extraction method according to the first or second aspect, wherein a predetermined band obtained by band level analysis is added with a predetermined weight by a filter action and added. To be done. This is a band where the level difference between voice and background noise is remarkable after the band level analysis, for example, 1 kHz, 1.25 kHz, 4 kHz.
This is because the speech spectrum centered on is obtained and they are effectively emphasized. In this band, the level difference between the voice and the background noise is remarkable, so if these bands are increased according to the maximum level of the band and added,
The acoustic features of the voice are emphasized more. For example, weighting addition is performed on the center of the spectrum with a ratio of 3: 2: 1. In this way, the background noise and the acoustic features of the voice are further emphasized. Then, after weighted addition, a section that is equal to or more than a predetermined threshold is set as a voice section. Since the background noise is relatively reduced, the voice section is more accurately determined.

【００１６】又、請求項４に記載の音声抽出方法は請求
項１又は請求項２に記載の音声抽出方法であって、検出
音へのフィルタ作用において、先ずバンドレベル分析に
よる各所定帯域（臨界帯域）のレベルを聴感上の大きさ
に尺度変換する。尺度変換とは、周波数スペクトル値の
ｄＢ値からｓｏｎｅ値への変換である。ｄＢ値は静電型
マイクロフォンによる音圧単位であるが、ｓｏｎｅ値は
人聴覚の大きさ知覚に対応した尺度である。ｓｏｎｅ値
をＳ、音圧をＰとする時、両者にはＬｏｇ₁₀Ｓ＝０．０
３（Ｐ−４０）の関係がある。これは、横軸にＰ値、縦
軸にＳ値を取るとＳ値は指数関数的に増加する関係であ
る。即ち、ｓｏｎｅ値への変換は、人間の聴覚における
音の大きさ知覚と一致させることを意味する。そして、
更にフィルタ作用はその変換された値に対して高周波帯
域ほど大きな重みを付加して加算する。これは、人間の
聴覚における音の鋭さの知覚と一致させ音声の音の鋭さ
を増大させる変換である。音声には元々子音の発音等に
高周波成分が含まれるため、高周波帯域の強調が音声認
識に有効となるからである。これにより、音声の音響特
徴を人間の聴覚における処理と同様に強調することがで
きる。これにより、更に的確に音声区間を設定すること
ができる。The voice extraction method according to a fourth aspect is the voice extraction method according to the first or second aspect, wherein, in filtering the detected sound, first, each predetermined band (critical value) by band level analysis is detected. The level of the band is scaled to an auditory level. Scale conversion is conversion from a dB value of a frequency spectrum value to a sone value. The dB value is a unit of sound pressure by the electrostatic microphone, and the sone value is a scale corresponding to the perception of the size of human hearing. When the sound value is S and the sound pressure is P, Log ₁₀ S = 0.0 for both.
3 (P-40). This is a relationship in which the S value exponentially increases when the P value is plotted on the horizontal axis and the S value is plotted on the vertical axis. That is, the conversion into the sound value means matching with the loudness perception of sound in human hearing. And
Further, the filtering action adds a greater weight to the converted value in the higher frequency band and adds the weight. This is a transformation that matches the perception of sound acuity in human hearing and increases the sound acuity of speech. This is because the high-frequency component is originally included in the pronunciation of consonants and the like, so that the high-frequency band enhancement is effective for voice recognition. As a result, the acoustic characteristics of the voice can be emphasized in the same manner as the processing in human hearing. As a result, the voice section can be set more accurately.

【００１７】又、請求項５に記載の音声抽出方法は請求
項１乃至請求項４の何れか１項に記載の音声抽出方法で
あって、音声抽出は算出された音声区間の周波数成分か
ら背景ノイズの周波数成分を差し引く演算で求められ
る。ここで、背景ノイズの周波数成分は、算出された音
声区間直前の検出音の周波数成分でもよいし、音声のな
い場合に、所定帯域で予め測定した周波数成分でもよ
い。そして、音声抽出時には算出された音声区間の検出
音周波数成分からその背景ノイズの周波数成分を差し引
く（スペクトルサブトラクション）。これにより、音声
が効果的に抽出される。A voice extraction method according to a fifth aspect of the present invention is the voice extraction method according to any one of the first to fourth aspects, wherein the voice extraction is performed from the frequency components of the calculated voice section to the background. It is calculated by subtracting the frequency component of noise. Here, the frequency component of the background noise may be the frequency component of the detected sound immediately before the calculated voice section, or may be the frequency component previously measured in a predetermined band when there is no voice. Then, at the time of voice extraction, the frequency component of the background noise is subtracted from the detected sound frequency component of the voice section calculated (spectral subtraction). Thereby, the voice is effectively extracted.

【００１８】又、請求項６に記載の音声抽出装置によれ
ば、先ず音声検出手段が背景ノイズを含む音声を検出す
る。そして、フィルタ手段がバンドレベル分析に基づい
たフィルタ特性を作用させる。即ち、それを聴感上の大
きさに加工する。バンドレベル分析とは検出音から、例
えば指数関数的に分布するバンド（周波数帯域）を選択
してそのパワーを求める分析である。背景ノイズと音声
とは音響的特徴が異なるので、例えば指数関数的に分布
するバンドレベルで分析すれば、背景ノイズが低減さ
れ、結果として音声が強調される。その後、閾値決定手
段がそのフィルタ手段の出力に基づいて閾値を決定す
る。例えば、フィルタ出力の周波数スペクトルにおける
最大パワーに基づいた値、例えば最大パワーの−２０ｄ
Ｂを閾値とする。そして、音声区間算出手段がその閾値
以上のレベルの区間を音声区間とする。音声区間とは、
音声が発せられた時間軸上での区間である。そして、最
後に音声抽出手段がその音声区間から音声を抽出する。
抽出は、周波数軸上で、例えば背景ノイズのスペクトル
を差し引くことで求められる。このように本発明の音声
抽出装置によれば、バンドレベル分析に基づくフィルタ
手段によって音声を強調して音声区間を決定しているの
で、背景ノイズがあっても確実に、又効率よく音声を抽
出することができる。According to the sixth aspect of the voice extraction device, the voice detection means first detects the voice including the background noise. The filter means then act on the filter characteristics based on the band level analysis. That is, it is processed into a size that is audible. The band level analysis is an analysis for selecting, for example, an exponentially distributed band (frequency band) from the detected sound and obtaining its power. Since the background noise and the voice have different acoustic characteristics, the background noise can be reduced and the voice can be emphasized as a result if the band level analysis is performed exponentially, for example. Then, the threshold value determining means determines the threshold value based on the output of the filter means. For example, a value based on the maximum power in the frequency spectrum of the filter output, for example, −20d of the maximum power.
Let B be the threshold. Then, the voice section calculating means sets a section having a level equal to or higher than the threshold as a voice section. What is a voice section?
It is a section on the time axis where the voice is emitted. Finally, the voice extraction means extracts voice from the voice section.
The extraction is obtained on the frequency axis by subtracting the spectrum of background noise, for example. As described above, according to the voice extraction device of the present invention, the voice is emphasized by the filter means based on the band level analysis to determine the voice section, so that the voice can be extracted reliably and efficiently even if there is background noise. can do.

【００１９】又、請求項７に記載の音声抽出装置は請求
項６に記載の音声抽出装置であって、フィルタ手段のフ
ィルタ特性の決定にはオクターブバンドレベル分析が採
用されている。オクターブバンドレベル分析は、指数関
数的な人の聴感に則した帯域を選択して分析する手法で
ある。例えば、１／３オクターブバンド分析とすれば、
初期値の中心周波数をを１０００Ｈｚとすれば、次段の
中心周波数との比が２ ^1/3であるので、順次、１２５０
Ｈｚ、１６００Ｈｚ、２０００Ｈｚ、・・・となる。人
音声及び聴感はオクターブバンドに則し、背景ノイズは
それに則していないので、これらを中心とするバンドで
分析すれば、効率よく人音声を強調することができる。
即ち、バンドレベル分析をオクターブバンドレベル分析
として、それに基づいたフィルタ手段を用いれば、より
的確に音声区間を決定することができる。即ち、より精
度よく音声抽出することができる。Further, the voice extraction apparatus according to claim 7 claims.
Item 6. The voice extraction device according to Item 6, comprising a filter unit
Octave band level analysis is used to determine the filter characteristics.
Is used. Octave band level analysis
A method of selecting and analyzing a band that matches the human hearing
is there. For example, with 1/3 octave band analysis,
If the center frequency of the initial value is 1000 Hz,
The ratio to the center frequency is 2 ^1/3Therefore, 1250 sequentially
Hz, 1600 Hz, 2000 Hz, ... Man
Voice and hearing conform to the octave band, background noise
It doesn't follow that, so in bands centered around these
If analyzed, human voice can be emphasized efficiently.
That is, the band level analysis is changed to the octave band level analysis.
As a result, if a filter means based on it is used,
The voice section can be accurately determined. That is,
It is possible to extract voice frequently.

【００２０】又、請求項８に記載の音声抽出装置は請求
項６又は請求項７に記載の音声抽出装置であって、フィ
ルタ手段は、フィルタ特性（バンドレベル分析）を作用
させた検出音の所定帯域に所定の重みを付加して加算す
る。バンドレベル分析後には、例えば１ｋＨｚ、１．２
５ｋＨｚ、４ｋＨｚを中心とする帯域が得られる。この
帯域は、音声が集中し背景ノイズとの差が顕著である。
ここで、音声をさらに効果的に強調するため、それらの
帯域を、例えば、その帯域の最大レベルに合わせて増大
させて、その後加算する。例えば、上記スペクトル中心
を３：２：１の比で重み付け加算を行う。このようにす
れば背景ノイズと音声の音響的特徴がより強調される。
これに基づいて音声区間を決定すれば、より的確に音声
を抽出することができる。即ち、より的確に音声抽出す
る請求項６又は請求項７に記載の音声抽出装置を実現す
ることができる。Further, the voice extraction device according to claim 8 is the voice extraction device according to claim 6 or 7, wherein the filter means detects the detection sound to which filter characteristics (band level analysis) are applied. A predetermined weight is added to a predetermined band and added. After the band level analysis, for example, 1 kHz, 1.2
A band centered at 5 kHz and 4 kHz is obtained. In this band, the voice is concentrated and the difference from the background noise is remarkable.
Here, in order to emphasize the voice more effectively, those bands are increased, for example, according to the maximum level of the band, and then added. For example, weighting addition is performed on the center of the spectrum with a ratio of 3: 2: 1. In this way, the background noise and the acoustic features of the voice are further emphasized.
If the voice section is determined based on this, the voice can be more accurately extracted. That is, it is possible to realize the voice extraction device according to claim 6 or 7, which performs voice extraction more accurately.

【００２１】又、請求項９に記載の音声抽出装置は請求
項６又は請求項７に記載の音声抽出装置であって、フィ
ルタ手段はフィルタ特性（バンドレベル分析）を作用さ
せた検出音の各所定帯域のレベルを聴感上の大きさに尺
度変換する。尺度変換とは、周波数スペクトルのｄＢ値
からｓｏｎｅ値への変換である。静電型マイクロフォン
による音圧はｄＢ値で検出されるが、人聴覚はそのｄＢ
値を線形的には知覚しない。人聴覚はｓｏｎｅ値で音圧
を知覚する。ここで、ｓｏｎｅ値をＳ、音圧をＰとする
時、両者にはＬｏｇ₁₀Ｓ＝０．０３（Ｐ−４０）の関係
がある。これは、横軸にＰ値、縦軸にＳ値を取るとＳ値
はＰ値に対して指数関数的に増加する関係である。即
ち、ｓｏｎｅ値への変換は人間の聴覚における音の大き
さ知覚と一致させることを意味する。そして、フィルタ
手段はその変換された値に対して高周波帯域ほど大きな
重みを付加して加算する重み付け加算を行う。これは、
音の鋭さへの変換である。人の音声には、もともと子音
の発音等に高周波成分が含まれるため、高周波帯域の強
調が音声認識に有効となるからである。これにより、音
声の音響的特徴をより強調することができる。即ち、さ
らに的確に音声区間を設定し、音声を更に確実に抽出す
る音声抽出装置を実現することができる。Further, the voice extraction device according to claim 9 is the voice extraction device according to claim 6 or 7, wherein the filter means applies the filter characteristics (band level analysis) to each detected sound. The level of a predetermined band is scaled to an auditory level. Scale conversion is conversion from the dB value of the frequency spectrum to the sone value. The sound pressure by the electrostatic microphone is detected by the dB value, but the human auditory sense is measured by the dB value.
Do not perceive the value linearly. Human hearing perceives sound pressure with a sone value. Here, when the sound value is S and the sound pressure is P, both have a relationship of Log ₁₀ S = 0.03 (P-40). This is a relationship in which the S value exponentially increases with respect to the P value when the P value is plotted on the horizontal axis and the S value is plotted on the vertical axis. That is, the conversion into the sound value means matching with the perception of the loudness of sound in human hearing. Then, the filter means performs weighted addition in which a greater weight is added to the converted value in the higher frequency band and added. this is,
It is the conversion of sound into sharpness. This is because human voice originally contains high-frequency components in pronunciation of consonants and the like, so that enhancement of a high-frequency band is effective for voice recognition. Thereby, the acoustic feature of the voice can be further emphasized. That is, it is possible to realize a voice extraction device that sets a voice segment more accurately and more reliably extracts a voice.

【００２２】又、請求項１０に記載の音声抽出装置は請
求項６又は請求項９の何れか１項に記載の音声抽出装置
であって、音声抽出手段は音声区間算出手段によって算
出された音声区間の検出音周波数成分から背景ノイズの
周波数成分を差し引いている。ここで、背景ノイズの周
波数成分は、算出された音声区間直前の検出音の周波数
成分でもよいし、又、音声抽出に先だって予め音声のな
い場合に、所定帯域、例えばオクターブバンドレベル
（臨界帯域）で予め測定した周波数成分でもよい。そし
て、音声抽出時には、音声抽出手段は音声区間での検出
音周波数成分からその背景ノイズの周波数成分を差し引
く（スペクトルサブトラクション）。これにより、音声
が効果的に抽出される。The voice extraction device according to claim 10 is the voice extraction device according to any one of claims 6 and 9, wherein the voice extraction means is a voice calculated by the voice section calculation means. The background noise frequency component is subtracted from the detected sound frequency component of the section. Here, the frequency component of the background noise may be the frequency component of the detected sound immediately before the calculated voice section, or if there is no voice in advance prior to voice extraction, a predetermined band, for example, octave band level (critical band). It may be a frequency component measured in advance. Then, at the time of voice extraction, the voice extraction means subtracts the frequency component of the background noise from the detected sound frequency component in the voice section (spectral subtraction). Thereby, the voice is effectively extracted.

【００２３】又、請求項１１に記載の音声抽出装置は請
求項６乃至請求項１０の何れか１項に記載の音声抽出装
置であって、背景ノイズは車室内の騒音であり音声は運
転者の車輌搭載機器への指示である。請求項６乃至請求
項１０の何れか１項に記載の音声抽出装置は背景ノイズ
があっても効率よく確実に音声を抽出できるので、音声
認識を必要とする車輌搭載機器に応用できる。即ち、背
景ノイズとして車室内騒音があっても運転者の音声（指
示）を効率よく確実に抽出することができる。よって、
ナビゲーション装置等の車輌搭載機器の音声認識率が向
上し、それにより安全走行に寄与することができる。Further, the voice extraction device according to claim 11 is the voice extraction device according to any one of claims 6 to 10, wherein the background noise is noise in the vehicle interior and the voice is the driver. This is an instruction to the on-vehicle equipment of. Since the voice extraction device according to any one of claims 6 to 10 can efficiently and reliably extract voice even in the presence of background noise, it can be applied to a vehicle-mounted device that requires voice recognition. That is, the driver's voice (instruction) can be extracted efficiently and reliably even if there is vehicle interior noise as background noise. Therefore,
The voice recognition rate of a vehicle-mounted device such as a navigation device is improved, which can contribute to safe driving.

【００２４】[0024]

【発明の実施の形態】以下、本発明の音声抽出方法とそ
の装置について図面を参照して説明する。本発明の音声
抽出方法とその装置は一般的な騒音下で有効である。例
えば、交通騒音、車輌騒音、工場等での騒音下で有効で
ある。ここでは、自動車の車室内騒音下における音声抽
出方法及びその装置について説明する。自動車の車室内
騒音は、エンジン音に起因する周期性騒音と走行に起因
するランダム騒音（ロードノイズ、風切り音等）からな
る。これらの車室内騒音は低周波帯域にそのパワーが集
中しているため、音声認識時にはハイパスフィルタを用
いてそれらを除去する前処理が有効である。しかしなが
ら、現在の所ハイパスフィルタを用いても、尚音声区間
が正確に特定できないため更なる認識率の向上には至っ
ていない。そこで、発明者は音声と騒音の音響的特徴の
違い、即ち聴感上の帯域（臨界帯域）において両者の差
が顕著になることに着目し、この帯域を強調して音声区
間を算出することを試みた。BEST MODE FOR CARRYING OUT THE INVENTION A voice extraction method and apparatus according to the present invention will be described below with reference to the drawings. The voice extraction method and apparatus of the present invention are effective under general noise. For example, it is effective under traffic noise, vehicle noise, and noise in factories. Here, a method and a device for extracting a voice in a vehicle interior noise will be described. The vehicle interior noise is composed of periodic noise caused by engine noise and random noise caused by running (road noise, wind noise, etc.). Since the power of these vehicle interior noises is concentrated in the low frequency band, preprocessing for removing them using a high-pass filter is effective during voice recognition. However, at present, even if a high-pass filter is used, the recognition interval cannot be accurately specified, and therefore the recognition rate has not been further improved. Therefore, the inventor pays attention to the fact that the difference between the acoustic characteristics of voice and noise, that is, the difference between the two becomes significant in the audible band (critical band), and emphasizes this band to calculate the voice section. I tried.

【００２５】図１に本発明の音声抽出装置の一実施例を
示す。図は、システム構成図である。本発明の音声抽出
装置は、音声検出手段であるマイクロフォン１０、バン
ドレベル分析に基づいてフィルタ特性が決定されている
フィルタ手段であるフィルタ部３０、閾値決定手段であ
る閾値決定部４０、音声区間算出手段である音声区間算
出部５０、音声抽出手段である音声抽出部６０から構成
される。尚、上記各要素は、具体的には図２に示すよう
にＣＰＵ１１０、ＲＯＭ１２０、ＲＡＭ１３０、Ａ／Ｄ
変換装置１４０、Ｉ／Ｏインターフェース装置１５０等
からなるコンピュータ装置１００で構成される。例え
ば、フィルタ部３０は、Ａ／Ｄ変換装置１４０、ＲＯＭ
１２０に格納された処理プログラム、それを実行するＣ
ＰＵ１１０、実行エリアであるＲＡＭ１３０から構成さ
れる。FIG. 1 shows an embodiment of the voice extraction device of the present invention. The figure is a system configuration diagram. The voice extraction device of the present invention includes a microphone 10 that is a voice detection unit, a filter unit 30 that is a filter unit whose filter characteristics are determined based on band level analysis, a threshold determination unit 40 that is a threshold determination unit, and a voice section calculation. It is composed of a voice section calculation unit 50 which is a unit and a voice extraction unit 60 which is a voice extraction unit. Note that each of the above-mentioned elements is specifically a CPU 110, a ROM 120, a RAM 130, an A / D as shown in FIG.
The computer device 100 includes a conversion device 140 and an I / O interface device 150. For example, the filter unit 30 includes the A / D conversion device 140, the ROM
Processing program stored in 120, C for executing it
It includes a PU 110 and a RAM 130 that is an execution area.

【００２６】上記各構成要素の作用を図１、図２のシス
テム構成図と図３のフローチャートに従って説明する。
それに先立ち、バンドレベル分析によってフィルタ部３
０のフィルタ特性を決定しておくその方法を説明する。
バンドレベル分析は、マイクロホン１０からの検出音を
人の聴感周波数に対応するように、指数関数的に周波数
帯域を選択してそのパワーを求める分析方法である。そ
れは、例えばオクターブレベル分析である。車室内にお
ける背景ノイズはランダムノイズに車輌に関連する周期
音ノイズを加えた周波数構造である。一方、音声の音響
的特徴は周知のように基準周波数を指数関数的に増大さ
せた周波数構造である。例えば、その周波数構造はｎを
整数とする時、基準周波数（例えば２００Ｈｚ）を２ⁿ
倍したオクターブ構造、又は、例えば２^n/3倍した１／
３オクターブ構造で表現できる。よって、背景ノイズを
含んだ検出音を例えば１／３オクターブレベル分析すれ
ば背景ノイズが低減され、その結果、音声が強調され
る。又、それとともに両者の差が顕著な帯域が判明す
る。図４に、車室内騒音と６名発話を１／３オクターブ
バンドレベル分析したグラフを示す。横軸が１／３オク
ターブバンドであり縦軸がそのレベルである。図から分
かるように、１ｋＨｚ帯域、１．２５ｋＨｚ帯域、２ｋ
Ｈｚ〜４ｋＨｚの各帯域でその差が顕著である。よっ
て、上記帯域を通過させるようにフィルタ部３０のフィ
ルタ特性を予め設定しておく。The operation of each of the above components will be described with reference to the system configuration diagrams of FIGS. 1 and 2 and the flowchart of FIG.
Prior to that, the filter unit 3 is analyzed by band level analysis.
A method for determining the filter characteristic of 0 will be described.
The band level analysis is an analysis method that exponentially selects a frequency band so as to obtain the power of the detected sound from the microphone 10 so as to correspond to the human hearing frequency. It is, for example, an octave level analysis. Background noise in the passenger compartment is a frequency structure in which random noise is added to periodic noise related to the vehicle. On the other hand, the acoustic characteristic of voice is a frequency structure in which the reference frequency is exponentially increased, as is well known. For example, the frequency structure has a reference frequency (for example, 200 Hz) of 2 ⁿ when n is an integer.
Doubled octave structure, or, for example, 2 ^{n / 3} times 1 /
It can be represented by a 3-octave structure. Therefore, if the detected sound including the background noise is subjected to, for example, 1/3 octave level analysis, the background noise is reduced, and as a result, the voice is emphasized. Along with that, a band where the difference between the two is remarkable is found. FIG. 4 shows a graph obtained by performing a 1/3 octave band level analysis of vehicle interior noise and 6 person's utterances. The horizontal axis is the 1/3 octave band, and the vertical axis is the level. As can be seen from the figure, 1kHz band, 1.25kHz band, 2k
The difference is remarkable in each band of Hz to 4 kHz. Therefore, the filter characteristic of the filter unit 30 is set in advance so as to pass the band.

【００２７】そして、音声抽出時には図３に示すフロー
チャートのステップｓ１００から開始する。ステップｓ
１００では、先ずマイクロフォン１０で背景ノイズを含
む音声を検出する。そして、それをＡ／Ｄ変換部１４０
でデジタル信号に変換し、ステップｓ１１０でバンドレ
ベル分析に基づいたフィルタ部３０を通過させる。フィ
ルタ部３０は、上述のように予めバンドレベル分析の結
果に基づいて上記周波数帯域を通過させる（バンドパ
ス）フィルタとなっている。これにより、例えば音声情
報と背景ノイズの差が顕著な上記１ｋＨｚ帯域、１．２
５ｋＨｚ帯域、２ｋＨｚ〜４ｋＨｚの各帯域が通過され
る。即ち、音声が相対的に強調される。そして、ステッ
プｓ１２０において閾値決定部４０が、その選択された
帯域のパワーに基づいて閾値を決定する。例えば、例え
ば最大レベルの−２０ｄＢのレベルを閾値とする。Then, at the time of voice extraction, the process starts from step s100 of the flowchart shown in FIG. Step s
In 100, first, the microphone 10 detects a voice including background noise. Then, the A / D converter 140
Is converted into a digital signal in step s110, and is passed through the filter unit 30 based on the band level analysis in step s110. As described above, the filter unit 30 is a filter that passes the frequency band based on the result of the band level analysis in advance (band pass). As a result, for example, in the above 1 kHz band where the difference between the audio information and the background noise is remarkable, 1.2
The 5 kHz band and the 2 kHz to 4 kHz bands are passed. That is, the voice is relatively emphasized. Then, in step s120, the threshold determination unit 40 determines the threshold based on the power of the selected band. For example, the maximum level of −20 dB is set as the threshold value.

【００２８】次に、ステップｓ１３０に移行し音声区間
算出部５０が音声区間を算出する。即ち、時系列にＲＡ
Ｍ１２０に格納された検出信号から、算出された上記閾
値以上の区間を音声区間として設定する。次に、その音
声区間直前、又は音声区間終了後の区間で背景ノイズの
周波数成分を測定する。これは、次段（ステップｓ１５
０）でのスペクトルサブトラクションに使用するためで
ある。そして、最後にステップｓ１５０において、音声
を抽出する。音声抽出は、音声抽出部６０が音声区間で
の検出音信号の周波数成分からステップｓ１４０で求め
た背景ノイズの周波数成分の差し引きによって求められ
る（スペクトルサブトラクション）。この音声区間では
背景ノイズと音声レベルの差が大であるので、スペクト
ルサブトラクション演算すればさらに効果的に音声が強
調される。即ち、騒音下であっても的確に音声を抽出す
ることができる。よって、これを音声認識に適用すれ
ば、騒音下での音声認識率を向上させることができる。Next, in step s130, the voice section calculator 50 calculates the voice section. That is, RA in time series
From the detection signal stored in M120, a section equal to or greater than the calculated threshold value is set as a voice section. Next, the frequency component of the background noise is measured immediately before the voice section or in the section after the end of the voice section. This is the next step (step s15
This is because it is used for spectral subtraction in 0). Finally, in step s150, voice is extracted. The voice extraction is obtained by subtracting the frequency component of the background noise obtained in step s140 from the frequency component of the detected sound signal in the voice section by the voice extraction unit 60 (spectral subtraction). Since the difference between the background noise and the voice level is large in this voice section, the voice can be more effectively emphasized by the spectral subtraction calculation. That is, the sound can be accurately extracted even in the presence of noise. Therefore, if this is applied to voice recognition, the voice recognition rate under noise can be improved.

【００２９】（第２実施例）第１実施例の音声区間決定
方法は、バンドレベル分析に基づいたフィルタ部３０を
作用させることにより音声を相対的に強調し、強調され
た臨界帯域の最大パワーに基づいて閾値を算出し、それ
により音声区間を決定していた。しかしながら、例えば
突発的な背景ノイズがある臨界帯域に侵入し最大パワー
が突出すると、閾値レベルも大となり、それにより音声
区間に誤差が生じる場合がある。本実施例は、恣意的に
臨界帯域の音声を強調することで、このような場合にも
正確に音声区間を抽出することを目的とする。(Second Embodiment) In the voice section determination method of the first embodiment, the filter section 30 based on the band level analysis is used to relatively emphasize the voice, and the maximum power of the emphasized critical band is set. The threshold value is calculated based on the above, and the voice section is determined by the threshold value. However, for example, if the maximum power is projected by entering a critical band having a sudden background noise, the threshold level also becomes large, which may cause an error in the voice section. The present embodiment is intended to accurately extract a voice section even in such a case by emphasizing a voice in a critical band arbitrarily.

【００３０】図５に本発明の音声抽出装置を示す。本実
施例の音声抽出装置は、第１実施例のフィルタ部３０内
に各臨界帯域毎に重み付けをし、それを加算する重み付
け加算部３１を備えたことを特徴とする。又、それに伴
い第１実施例のフローチャートのステップｓ１１０とス
テップｓ１２０間、即ちステップｓ１１０のバンドレベ
ル分析に基づいたフィルタ部３０の通過後に、図６に示
すステップｓ１１１の重み付け加算を付加したことを特
徴とする。尚、具体的な構成は図２に示すコンピュータ
装置１００で構成され、重み付け加算部３１もＣＰＵ１
１０、ＲＯＭ１２０に格納されたプログラム、ＲＡＭ１
３０等から構成される。フィルタ部３０の通過後には、
図４に示すように例えば１ｋＨｚ、１．２５ｋＨｚ、４
ｋＨｚを中心とする帯域のスペクトルが得られる。この
帯域は音声と背景ノイズのレベル差が顕著である。よっ
て、それらを効果的に強調するため、フィルタ部３０の
重み付け加算部３１を用いて、即ちステップｓ１１１に
おいて、それらの帯域をその帯域の最大レベルに合わせ
て重み付け加算する。例えば、上記スペクトル中心を
３：２：１の比で重み付け加算を行う。このようにすれ
ば、突発的なノイズが侵入して閾値が上昇しても他の臨
界帯域も重み付けでレベルが増大されているのでその影
響を受けがたい。即ち、より的確に音声区間が算出さ
れ、より的確に音声抽出することができる。FIG. 5 shows a voice extraction device of the present invention. The voice extraction device of the present embodiment is characterized in that the filter unit 30 of the first embodiment is provided with a weighting addition unit 31 for weighting each critical band and adding the weights. In addition, accordingly, the weighted addition of step s111 shown in FIG. 6 is added between step s110 and step s120 of the flowchart of the first embodiment, that is, after passing through the filter unit 30 based on the band level analysis of step s110. And It should be noted that the specific configuration is configured by the computer device 100 shown in FIG. 2, and the weighted addition unit 31 also includes the CPU 1.
10, program stored in ROM120, RAM1
It is composed of 30 and others. After passing through the filter unit 30,
As shown in FIG. 4, for example, 1 kHz, 1.25 kHz, 4
A spectrum in the band centered on kHz is obtained. In this band, the level difference between voice and background noise is remarkable. Therefore, in order to effectively emphasize them, the weighting addition unit 31 of the filter unit 30 is used, that is, in step s111, the bands are weighted and added according to the maximum level of the band. For example, weighting addition is performed on the center of the spectrum with a ratio of 3: 2: 1. In this way, even if sudden noise enters and the threshold value rises, the levels of other critical bands are also increased by weighting and are not affected by the influence. That is, the voice segment can be calculated more accurately, and the voice can be extracted more accurately.

【００３１】（第３実施例）第１実施例の音声区間決定
方法は、バンドレベル分析に基づいたフィルタ部３０に
より音声が強調された臨界帯域を選定し、その臨界帯域
の最大パワーに基づいて閾値を算出して、それにより音
声区間を決定する例であった。又、第２実施例の音声区
間決定方法は、フィルタ部３０の出力の各臨界帯域に対
して重み付けを行い、それらを加算してより効果的に音
声を強調することで、より正確に音声区間を算出する例
であった。本実施例は、聴感上の大きさとシャープネス
（鋭さ）がさらに効果的に音声を強調することに着目
し、それらを利用することで更に精度よく音声区間を算
出する例である。(Third Embodiment) In the voice section determination method of the first embodiment, the critical band in which the voice is emphasized by the filter unit 30 based on the band level analysis is selected, and the maximum power of the critical band is selected. This is an example in which the threshold value is calculated and the voice section is determined based on the threshold value. In addition, in the voice section determination method of the second embodiment, each critical band of the output of the filter unit 30 is weighted and added to enhance the voice more effectively, so that the voice section is more accurate. Was an example of calculating. The present embodiment is an example in which attention is paid to the fact that the size and sharpness in the auditory sense enhance the voice more effectively, and the voice segment is calculated more accurately by using them.

【００３２】図７に本発明の音声抽出装置を示す。本実
施例の音声抽出装置は、第１実施例のフィルタ部３０内
に音圧レベルを聴感上の知覚レベルに変換する尺度変換
部３２と尺度変換後の各臨界帯域に、その周波数に比例
して重み付けをする重み付け加算部３３を備えたことを
特徴とする。又、それに伴い図８に示すように、第１実
施例のフローチャのステップｓ１１０とステップｓ１２
０間に、音圧レベル（ｄＢ）を人の聴感上の大きさ（ｓ
ｏｎｅ）に変換するステップｓ１１２の尺度変換とステ
ップｓ１１３の重み付け加算を付加したことを特徴とす
る。FIG. 7 shows a voice extraction device of the present invention. The voice extraction device of this embodiment has a scale conversion unit 32 for converting a sound pressure level into a perceptual perceptual level in the filter unit 30 of the first embodiment and each critical band after scale conversion in proportion to its frequency. It is characterized in that a weighting addition unit 33 for weighting is provided. Further, as a result, as shown in FIG. 8, steps s110 and s12 of the flowchart of the first embodiment.
During 0, the sound pressure level (dB) is set to the human hearing level (s).
One) is added with the scale conversion in step s112 and the weighted addition in step s113.

【００３３】尺度変換とは、周波数スペクトル値のｄＢ
値からｓｏｎｅ値への変換である。マイクロフォン１０
による音圧はｄＢ値で検出されるが、人聴覚は音圧を線
形的には知覚せず指数関数的なｓｏｎｅ値で音圧を知覚
する。ｓｏｎｅ値をＳ、音圧をＰとする時、両者にはＬ
ｏｇ₁₀Ｓ＝０．０３（Ｐ−４０）の関係がある。即ち、
ｓｏｎｅ値はｄＢ値に対して指数関数的に増加する関係
がある。即ち、ｓｏｎｅ値への変換は音声帯域の強調を
意味する。よって、尺度変換部３２を用いて、ステップ
ｓ１１２で尺度変換すれば、音声と背景ノイズとの差が
より強調される。Scale conversion is the dB of the frequency spectrum value.
It is a conversion from a value to a sone value. Microphone 10
Although the sound pressure due to is detected by the dB value, human auditory sense does not perceive the sound pressure linearly but perceives the sound pressure with an exponential sone value. When the tone value is S and the sound pressure is P, both are L
There is a relationship of og ₁₀ S = 0.03 (P-40). That is,
The son value has an exponentially increasing relationship with the dB value. That is, the conversion into the sound value means enhancement of the voice band. Therefore, if scale conversion is performed in step s112 using the scale conversion unit 32, the difference between voice and background noise is further emphasized.

【００３４】そして、更に重み付け加算部３３を用い
て、ステップｓ１１３において、尺度変換後の値に対し
て高周波帯域ほど大きな重みを付加して加算する重み付
け加算を行う。これは、音のシャープネス（鋭さ）の強
調である。人の音声には、子音の発音等に高周波成分が
含まれ、背景ノイズには高周波成分は少ない。よって、
高周波帯域ほど大きな重み付けをすれば鋭さが強調され
ることになる。即ち、音声の音響的特徴をより強調する
ことができる。この聴感上の大きさに変換され、更に鋭
さが強調されたスペクトルから第１実施例、第２実施例
と同様の方法で閾値を決定し、音声区間を算出すれば、
更に的確に音声を抽出することができる。Then, in step s113, the weighted addition section 33 is further used to perform weighted addition in which a greater weight is added to the scale-converted value in the higher frequency band. This is an emphasis on the sharpness of the sound. Human voice includes high-frequency components in the pronunciation of consonants, and background noise has few high-frequency components. Therefore,
Sharpness is emphasized by weighting the higher the frequency band. That is, it is possible to further emphasize the acoustic characteristics of the voice. If the threshold is determined and the voice section is calculated from the spectrum converted to the perceptual size and the sharpness is emphasized in the same manner as in the first and second embodiments,
The voice can be extracted more accurately.

【００３５】図９、図１０に従来手法による音声抽出結
果と本実施例の手法による音声抽出結果を示す。図９
（ａ）が音声信号であり、図９（ｂ）はノイズ重畳信号
である。又、図１０のＡが本実施例による音声信号の抽
出結果であり、Ｂが従来手法による音声抽出結果であ
る。図１０に示すように、バンドレベル分析に基づいた
フィルタで音声が優位な臨界帯域を取り出し、更に尺度
変換と重み付け加算を行うと、従来より的確に音声抽出
できることが分かる。よって。本実施例の装置をナビゲ
ーション装置等の音声認識を必要とする車輌搭載機器に
採用すれば、精度よく的確に音声を抽出できるので、例
えば運転者の音声を精度よく認識させることができる。
即ち、本実施例の手法と装置は、車輌の安全走行に寄与
することができる。FIG. 9 and FIG. 10 show the voice extraction result by the conventional method and the voice extraction result by the method of this embodiment. Figure 9
9A is a voice signal, and FIG. 9B is a noise superimposed signal. Further, A of FIG. 10 is the extraction result of the audio signal according to the present embodiment, and B is the extraction result of the audio according to the conventional method. As shown in FIG. 10, if a filter based on band level analysis is used to extract a critical band in which speech is dominant, and then scale conversion and weighted addition are performed, it can be understood that speech can be extracted more accurately than before. Therefore. If the device of the present embodiment is applied to a vehicle-mounted device such as a navigation device that requires voice recognition, the voice can be accurately and accurately extracted, so that, for example, the voice of the driver can be accurately recognized.
That is, the method and apparatus of this embodiment can contribute to safe driving of the vehicle.

【００３６】（変形例）上記実施例は１例であり、他に
様々な変形が考えられる。例えば、第１実施例乃至第３
実施例では、バンドレベル分析は１／３オクターブバン
ドレベル分析で説明したが、勿論オクターブバンドレベ
ル分析を採用することもできる。又、本発明の音声抽出
方法とその装置は自動車の車室内騒音だけでなく、他の
一般騒音においても同様に作用し有効である。工場内の
騒音環境、鉄道、船舶、航空機内等の室内騒音下での音
声抽出装置に有効である。(Modification) The above embodiment is only one example, and various modifications can be considered. For example, the first to third embodiments
In the embodiment, the band level analysis is described as the 1/3 octave band level analysis, but the octave band level analysis can of course be adopted. Further, the voice extraction method and its device of the present invention are effective not only in the vehicle interior noise of the automobile but also in other general noises. It is effective for the voice extraction device under the noise environment in the factory, the indoor noise in the railway, the ship, the airplane, etc.

[Brief description of drawings]

【図１】本発明の第１実施例に係る音声抽出装置のシス
テム構成図。FIG. 1 is a system configuration diagram of a voice extraction device according to a first embodiment of the present invention.

【図２】本発明の第１実施例に係る音声抽出装置のコン
ピュータ装置を用いたシステム構成図。FIG. 2 is a system configuration diagram using a computer device of the voice extraction device according to the first embodiment of the present invention.

【図３】本発明の第１実施例の音声抽出装置の動作を示
すフローチャート。FIG. 3 is a flowchart showing the operation of the voice extraction device of the first embodiment of the present invention.

【図４】本発明の第１実施例に係る音声と背景ノイズの
１／３オクターブバンドレベル分析結果を示すグラフ。FIG. 4 is a graph showing a 1/3 octave band level analysis result of voice and background noise according to the first exemplary embodiment of the present invention.

【図５】本発明の第２実施例に係る音声抽出装置のシス
テム構成図。FIG. 5 is a system configuration diagram of a voice extraction device according to a second embodiment of the present invention.

【図６】本発明の第２実施例の音声抽出装置の動作を示
すフローチャートの一部。FIG. 6 is a part of a flowchart showing the operation of the voice extraction device according to the second embodiment of the present invention.

【図７】本発明の第３実施例に係る音声抽出装置のシス
テム構成図。FIG. 7 is a system configuration diagram of a voice extraction device according to a third embodiment of the present invention.

【図８】本発明の第３実施例の音声抽出装置の動作を示
すフローチャートの一部。FIG. 8 is a part of a flowchart showing the operation of the voice extraction device of the third embodiment of the present invention.

【図９】音声信号（ａ）と、音声信号を含んだ背景ノイ
ズ重畳信号（ｂ）の説明図。FIG. 9 is an explanatory diagram of an audio signal (a) and a background noise superimposed signal (b) including the audio signal.

【図１０】本発明の第３実施例にかかる音声抽出装置と
従来装置による音声抽出結果の比較図。FIG. 10 is a comparison diagram of voice extraction results by the voice extraction device according to the third embodiment of the present invention and a conventional device.

[Explanation of symbols]

１０…マイクロフォン３０…フィルタ部４０…閾値決定部５０…音声区間算出部６０…音声抽出部 10 ... Microphone 30 ... Filter section 40 ... Threshold value determining unit 50 ... Voice section calculator 60 ... voice extraction unit

Claims

[Claims]

1. A sound extraction method under noise, wherein a filter based on a band level analysis result is applied to a detected sound including background noise, a threshold value is determined based on the filter output, and the filter output is A voice extraction method, characterized in that a section having a level equal to or higher than a threshold is set as a voice section, and voice is extracted from the voice section.

2. The voice extraction method according to claim 1, wherein the band level analysis is an octave band level analysis.

3. The voice extraction method according to claim 1, wherein the action of the filter is weighted addition in which a predetermined weight is added to a predetermined band by the band level analysis and added. .

4. The function of the filter is that the level of each predetermined band obtained by the band level analysis is scale-converted into an audible level, and the converted value is added with a greater weight in a higher frequency band. The voice extraction method according to claim 1 or 2, wherein the weighted addition is performed.

5. The voice extraction according to claim 1, wherein the voice extraction is obtained by subtracting the frequency component of the background noise from the calculated frequency component of the voice section. Audio extraction method.

6. A voice extraction device under noise, comprising voice detection means for detecting voice including background noise, and filter means for applying a filter characteristic based on band level analysis to the detection sound of the voice detection means. A threshold value determining unit that determines a threshold value based on the output of the filter unit; a voice section calculating unit that sets a section of a level equal to or higher than the threshold value as a voice section by the threshold value determining unit; And a voice extraction device.

7. The voice extraction apparatus according to claim 6, wherein the band level analysis used for determining the filter characteristic is an octave band level analysis.

8. The weighted addition for adding a predetermined weight to a predetermined band of the detected sound to which the filter characteristic is applied and adding the weighted addition, according to claim 6 or claim 7. The voice extraction device described.

9. The filter means scale-converts the level of a predetermined band of the detected sound to which the filter characteristic is applied into a audible level, and weights the converted value in a higher frequency band. The voice extraction device according to claim 6 or 7, wherein weighted addition is performed to add and add.

10. The voice extraction means subtracts the frequency component of the background noise from the detected sound frequency component of the voice section calculated by the voice section calculation means. The voice extraction device according to item 1.

11. The background noise is noise inside a vehicle,
11. The voice extraction device according to claim 6, wherein the voice is an instruction to a vehicle-mounted device by a driver.