JP3160228B2

JP3160228B2 - Voice section detection method and apparatus

Info

Publication number: JP3160228B2
Application number: JP11282297A
Authority: JP
Inventors: 徹都木; 信正清山; 篤今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1997-04-30
Filing date: 1997-04-30
Publication date: 2001-04-25
Anticipated expiration: 2017-04-30
Also published as: JPH10301593A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、放送番組中や録音
テープあるいは日常生活で、雑音や背景音を伴って発声
された音声を加工して声の高さや話す速さを変えたり、
意味内容を機械的に認識したり、符号化して伝送あるい
は記録する場合などに、入力信号中の音声区間と、非音
声区間とを判別する音声区間検出方法およびその装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to processing a voice uttered with noise or background sound during a broadcast program, on a recording tape, or in daily life, to change the pitch or the speed of speech.
The present invention relates to a voice section detection method and apparatus for discriminating between a voice section and a non-voice section in an input signal when mechanically recognizing or encoding or transmitting or recording the meaning.

【０００２】［発明の概要］本発明は、入力信号データ
に対し、所定の時間間隔毎に、所定の時間幅を有するフ
レーム単位で、そのパワーを算出し、過去の所定の時間
内におけるパワーの最大値と、最小値とを保持するとと
もに、その最大値、ならびに最大値と最小値との差に応
じて変化するパワーに関するしきい値を用いて、入力信
号中の音声と背景音とのそれぞれのパワーの変化に逐
次、適応しながら、フレーム毎に音声区間と、非音声区
間との判別を行なうことにより、入力信号中の音声区間
を正確に検出して、放送番組中や録音テープあるいは日
常生活で、雑音や背景音を伴って発声された音声を加工
して声の高さや話す速さを変えたり、意味内容を機械的
に認識したり、符号化して伝送あるいは記録する場合な
どに、加工音声の音質の向上、音声認識率の改善、符号
化効率の上昇や復号化音声の品質向上などを図る。[Summary of the Invention] The present invention calculates the power of input signal data at predetermined time intervals in frame units having a predetermined time width, and calculates the power of the power within a predetermined time in the past. The maximum value and the minimum value are held, and the maximum value and the threshold value regarding the power that changes in accordance with the difference between the maximum value and the minimum value are used to determine each of the voice and the background sound in the input signal. The sound section and the non-speech section are discriminated for each frame while sequentially adapting to the power change, so that the sound section in the input signal can be accurately detected, so that the sound section can be accurately detected in the broadcast program, on a recording tape, or on a daily basis. In daily life, when processing voice uttered with noise and background sound to change the pitch and speaking speed, mechanically recognize the meaning, encode and transmit or record, etc. Processing sound The improvement, improvement of speech recognition rate, aim and quality increase and decoding speech coding efficiency.

【０００３】さらに、パワーという比較的、簡便に求め
られる特徴量のみを用いることにより、演算時間を短く
するとともに、コストを低減させ、リアルタイムに音声
処理を行なうことを可能にする。Further, by using only a relatively simple characteristic amount of power, the calculation time can be reduced, the cost can be reduced, and voice processing can be performed in real time.

【０００４】[0004]

【従来の技術】従来の音声区間検出方式の１つとして、
音声信号のパワーなどを基に、雑音レベル、音声レベル
などを算出し、この算出結果に基づいてレベルしきい値
を設定し、このレベルしきい値と、入力信号とを比較し
て、入力信号のレベルが大である場合に、これを音声区
間と判定し、また小である場合に、これを非音声区間と
判定する方式が知られている。2. Description of the Related Art As one of conventional voice section detection methods,
Calculate noise level, audio level, etc. based on the power of the audio signal, etc., set a level threshold based on the calculation result, compare this level threshold with the input signal, When the level of is high, it is determined to be a voice section, and when the level is low, it is determined to be a non-voice section.

【０００５】この方式で用いるレベルしきい値を設定す
る方法としては、代表的な第１〜第３の方式があり、第
１の方式では、音声入力時の雑音レベル値に、予め定め
られている定数を加算した値をレベルしきい値とする。
またこれを改良した第２の方式では、入力音声信号レベ
ル最大値から雑音レベル値を減算した値が大であるとき
には、比較的大きい値に前記レベルしきい値を設定し、
小であるときには、比較的小さい値に前記レベルしきい
値を設定する（例えば、特開昭５８−１３０３９５号公
報、特開昭６１−２７２７９６号公報など）。As methods for setting the level threshold value used in this method, there are representative first to third methods. In the first method, a noise level value at the time of voice input is determined in advance. The value obtained by adding the constants is used as the level threshold.
In a second method that improves this, when the value obtained by subtracting the noise level value from the maximum value of the input audio signal level is large, the level threshold is set to a relatively large value,
If it is small, the level threshold is set to a relatively small value (for example, Japanese Patent Application Laid-Open Nos. 58-130395 and 61-272796).

【０００６】また、第３の方式では、これらの各レベル
しきい値の設定方法に加え、入力信号を連続的に観測
し、そのレベルが一定の時間以上にわたって定常なと
き、これを雑音レベルと見なし、逐次、雑音レベルを更
新しながら、音声区間検出のためのしきい値を設定する
（平成７年、電子情報通信学会総合大会講演論文集Ｄ−
６９５、３０１頁）。In the third method, in addition to the method of setting these level thresholds, the input signal is continuously observed, and when the level of the input signal is stationary for a certain period of time or more, the level is regarded as a noise level. Assuming, successively updating the noise level and setting a threshold value for voice section detection (1995, IEICE General Conference Proceedings D-
695, 301).

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上述し
た従来の音声区間検出方式においては、次に述べるよう
な問題があった。However, the above-described conventional voice section detection method has the following problems.

【０００８】まず、第１の方式は、簡便であるという利
点を持ち、音声の平均的なレベルが中程度の場合には、
うまく機能するものの、音声の平均的なレベルが大き過
ぎる場合には、雑音などを音声として誤検出し易く、ま
た小さ過ぎる場合には、音声の一部が欠落して検出され
易いという問題があった。First, the first method has an advantage of being simple, and when the average level of the sound is medium,
Although it works well, if the average level of the voice is too high, noise or the like is likely to be erroneously detected as voice, and if it is too low, there is a problem that the voice is partially missing and easily detected. Was.

【０００９】また、第２の方式は、このような第１の方
式の問題を解決することができるものの、入力信号中の
雑音や背景音のレベルがほぼ一定であることを前提にし
ていることから、音声のレベル変動に対しては、これに
追随するが、雑音や背景音のレベルが時々刻々、変化し
た場合には、正確な音声区間の検出が保証されていない
という問題があった。The second method can solve the problem of the first method, but assumes that the level of noise and background sound in the input signal is substantially constant. Therefore, although the sound level fluctuation follows the fluctuation, if the level of the noise or the background sound changes every moment, there is a problem that the accurate detection of the sound section is not guaranteed.

【００１０】また、第３の方式では、このような雑音レ
ベルの変動を考慮していることから、雑音レベルが逐
次、変化しても、誤検出が発生しない。In the third method, since such a change in the noise level is taken into consideration, no erroneous detection occurs even if the noise level changes sequentially.

【００１１】しかしながら、放送番組などでは、雑音の
みならず、効果音として、音楽や擬音などの背景音が存
在し、それらのレベルが時々刻々、変動するのが一般的
であり、しかもこれと同時に音声が常に発せられ続け、
入力信号レベルが一定時間以上にわたって定常になるこ
とが殆ど無いこともあり、このような場合には、第３の
方式でも、雑音レベルを正しく設定することができず、
音声区間を正確に検出することが難しいという問題があ
った。However, in broadcast programs and the like, not only noise but also background sounds such as music and onomatopoeia exist as sound effects, and their levels generally fluctuate every moment. The sound is always emitted,
In some cases, the input signal level rarely becomes steady over a certain period of time. In such a case, the noise level cannot be set correctly even with the third method.
There is a problem that it is difficult to accurately detect a voice section.

【００１２】本発明は上記の事情に鑑み、請求項１で
は、入力音声と、背景音とをそれぞれのレベルの変化に
逐次、適応しながら、リアルタイムで音声処理を行なっ
て、音声区間と、非音声区間とを判別することができる
音声区間検出方法を提供することを目的としている。In view of the above circumstances, according to the present invention, the voice processing is performed in real time while sequentially adapting the input voice and the background sound to changes in the respective levels, and the voice section and the non-voice section are processed. It is an object of the present invention to provide a voice section detection method that can determine a voice section.

【００１３】また、請求項２では、パワーという比較
的、簡便に求められる特徴量のみを用いていることか
ら、演算時間を短くするとともに、コストを低減させな
がら、入力音声と、背景音とをそれぞれのレベルの変化
に逐次、適応して、リアルタイムで音声処理を行なっ
て、音声区間と、非音声区間とを判別することができる
音声区間検出装置を提供することを目的としている。According to the second aspect of the present invention, since only a relatively simple feature amount called power is used, the input sound and the background sound can be reduced while shortening the operation time and reducing the cost. It is an object of the present invention to provide a voice section detection device capable of performing voice processing in real time by sequentially adapting to changes in each level and discriminating between voice sections and non-voice sections.

【００１４】[0014]

【課題を解決するための手段】上記の目的を達成するた
めに、本発明による音声区間検出方法では、入力された
信号データに対して、所定の時間間隔毎に、所定のフレ
ーム幅でフレームパワーを算出するとともに、過去の所
定の時間内のフレームパワーの最大値および最小値を保
持し、保持されている最大値より予め定めた値ａだけ小
さいフレームパワーに関するしきい値Ｐthrを決定し、
さらに最大値と最小値との差ｄが予め定めた基準値より
小さくなった場合には、差ｄに応じて値ａを減らしてし
きい値Ｐthrを大きくするように該しきい値Ｐthrを自動
的に調整し、このしきい値Ｐthrと、現在のフレームの
フレームパワーとを比較して、現在のフレームが音声区
間か、非音声区間かを決定することを特徴としている。In order to achieve the above-mentioned object, in the voice section detection method according to the present invention, the input signal data is transmitted at predetermined time intervals at a predetermined frame width and a predetermined frame width. And holds the maximum value and the minimum value of the frame power within a predetermined time in the past, and determines a threshold value Pthr relating to the frame power smaller than the held maximum value by a predetermined value a,
Further, when the difference d between the maximum value and the minimum value becomes smaller than a predetermined reference value, the threshold value Pthr is automatically adjusted so as to decrease the value a according to the difference d and increase the threshold value Pthr. The threshold Pthr is compared with the frame power of the current frame to determine whether the current frame is a voice section or a non-voice section.

【００１５】また、本発明による音声区間検出装置で
は、入力された信号データに対して、所定の時間間隔毎
に、所定のフレーム幅でフレームパワーを算出するパワ
ー算出部と、過去の所定の時間内のフレームパワーの最
大値を保持する瞬時パワー最大値保持部と、過去の所定
の時間内のフレームパワーの最小値を保持する瞬時パワ
ー最小値保持部と、瞬時パワー最大値保持部に保持され
ている最大値よりも予め定めた値ａだけ小さいフレーム
パワーに関するしきい値Ｐthrを決定し、さらに瞬時パ
ワー最大値保持部に保持されている最大値と瞬時パワー
最小値保持部に保持されている最小値との差ｄが予め定
めた基準値より小さくなった場合には、差ｄに応じて値
ａを減らしてしきい値Ｐthrを大きくするように該しき
い値Ｐthrを自動的に調整するパワーしきい値決定部
と、このパワーしきい値決定部によって得られたしきい
値Ｐthrと現在のフレームのフレームパワーとを比較し
て、音声区間か、非音声区間かを決定する判定部とを備
えたことを特徴としている。Further, in the voice section detection device according to the present invention, a power calculating section for calculating a frame power with a predetermined frame width at predetermined time intervals for input signal data; The maximum value of the instantaneous power that holds the maximum value of the frame power within the above, the minimum value of the instantaneous power that holds the minimum value of the frame power within a predetermined time in the past, and the maximum value of the instantaneous power that is held by the maximum power holding unit The threshold value Pthr relating to the frame power smaller than the maximum value by a predetermined value a is determined, and the maximum value and the instantaneous power minimum value held in the instantaneous power maximum value holding unit are further held in the instantaneous power maximum value holding unit. When the difference d from the minimum value becomes smaller than a predetermined reference value, the threshold value Pthr is automatically adjusted so that the value a is reduced and the threshold value Pthr is increased according to the difference d. A threshold value Pthr obtained by the power threshold value determining unit and a frame power of the current frame to determine whether the frame is a voice section or a non-voice section. It is characterized by having.

【００１６】上記の構成により、本発明による音声区間
検出方法では、入力された信号データに対して、所定の
時間間隔毎に、所定のフレーム幅でフレームパワーを算
出するとともに、過去の所定の時間内のフレームパワー
の最大値および最小値を保持する。そして、保持されて
いる最大値より予め定めた値ａだけ小さいフレームパワ
ーに関するしきい値Ｐthrを決定し、さらに最大値と最
小値との差ｄが予め定めた基準値より小さくなった場合
には、差ｄに応じて値ａを減らしてしきい値Ｐthrを大
きくするように該しきい値Ｐthrを自動的に調整する。
そして、このしきい値Ｐthrと、現在のフレームのフレ
ームパワーとを比較して、現在のフレームが音声区間
か、非音声区間かを決定する。これにより、入力音声
と、背景音とをそれぞれのレベルの変化に逐次、適応し
ながら、リアルタイムで音声処理を行なって、音声区間
と、非音声区間とを判別する。According to the above configuration, in the voice section detection method according to the present invention, the frame power is calculated for the input signal data at a predetermined frame width at predetermined time intervals, and a predetermined predetermined time in the past is calculated. Holds the maximum value and the minimum value of the frame power in. Then, a threshold value Pthr relating to the frame power smaller than the held maximum value by a predetermined value a is determined, and when the difference d between the maximum value and the minimum value becomes smaller than a predetermined reference value, , The threshold value Pthr is automatically adjusted so as to decrease the value a according to the difference d and increase the threshold value Pthr.
Then, the threshold Pthr is compared with the frame power of the current frame to determine whether the current frame is a voice section or a non-voice section. In this way, the voice processing is performed in real time while sequentially adapting the input voice and the background sound to changes in the respective levels, and a voice section and a non-voice section are discriminated.

【００１７】また、本発明による音声区間検出装置で
は、パワー算出部によって、所定の時間間隔毎に、所定
の時間幅を有するフレーム単位で入力された信号データ
を処理して、そのフレームパワーを算出するとともに、
瞬時パワー最大値保持部および瞬時パワー最小値保持部
によって、過去の所定の時間内におけるフレームパワー
の最大値および最小値を保持する。パワーしきい値決定
部では、瞬時パワー最大値保持部に保持されている最大
値よりも予め定めた値ａだけ小さいフレームパワーに関
するしきい値Ｐthrを決定し、さらに瞬時パワー最大値
保持部に保持されている最大値と瞬時パワー最小値保持
部に保持されている最小値との差ｄが予め定めた基準値
より小さくなった場合には、差ｄに応じて値ａを減らし
てしきい値Ｐthrを大きくするように該しきい値Ｐthrを
自動的に調整する。そして、判定部では、パワーしきい
値決定部によって得られたしきい値Ｐthrと現在のフレ
ームのフレームパワーとを比較して、音声区間か、非音
声区間かを決定する。これにより、フレームパワーとい
う比較的、簡便に求められる特徴量のみを用いて、演算
時間を短くするとともに、コストを低減させながら、入
力音声と、背景音とをそれぞれのレベルの変化に逐次、
適応して、リアルタイムで音声処理を行なって、音声区
間と、非音声区間とを判別する。Further, in the voice section detection device according to the present invention, the power calculator calculates the frame power by processing the signal data inputted in frame units having a predetermined time width at predetermined time intervals. Along with
The instantaneous power maximum value holding unit and the instantaneous power minimum value holding unit hold the maximum value and the minimum value of the frame power within a predetermined past time. The power threshold value determining unit determines a threshold value Pthr for the frame power that is smaller by a predetermined value a than the maximum value held in the instantaneous power maximum value holding unit, and further holds the threshold value in the instantaneous power maximum value holding unit. If the difference d between the maximum value and the minimum value held in the instantaneous power minimum value holding unit becomes smaller than a predetermined reference value, the value a is reduced according to the difference d and the threshold value is reduced. The threshold value Pthr is automatically adjusted so as to increase Pthr. Then, the determination unit compares the threshold value Pthr obtained by the power threshold value determination unit with the frame power of the current frame to determine whether it is a voice section or a non-voice section. Thus, using only a relatively simple feature amount called frame power, the calculation time is shortened and the cost is reduced, while the input sound and the background sound are sequentially changed at respective levels.
The voice processing is adaptively performed in real time to determine a voice section and a non-voice section.

【００１８】[0018]

BEST MODE FOR CARRYING OUT THE INVENTION

《発明の基本原理》本発明による音声区間検出方法およ
びその装置では、入力信号のパワーを指標とすると、入
力信号中の音声のレベル変動に関しては、直前までに入
力されたパワーの最大値に反映され、背景音のレベル変
動に関しては、直前までに入力されたパワーの最小値に
反映されていることに着目して、音声／非音声判別のし
きい値を決定する際、雑音が殆ど存在しないとき、直前
までに入力されたパワーの最大値から所定の値だけ減算
した値を基本のしきい値とし、直前までに入力されたパ
ワーの最大値から最小値を差し引いた値が小さくなるに
つれて（Ｓ／Ｎが小さくなるにつれて）、しきい値を大
きくしていくように、補正を加えるという処理で、しき
い値を決定する。<< Basic Principle of the Invention >> In the voice section detection method and apparatus according to the present invention, when the power of the input signal is used as an index, the level fluctuation of the voice in the input signal is reflected in the maximum value of the power input immediately before. When determining the threshold value for speech / non-speech discrimination by paying attention to the fact that the level fluctuation of the background sound is reflected in the minimum value of the power input immediately before, almost no noise is present. At this time, a value obtained by subtracting a predetermined value from the maximum value of the power input immediately before is used as a basic threshold value, and as the value obtained by subtracting the minimum value from the maximum value of the power input immediately before becomes smaller ( The threshold value is determined by a process of adding a correction so that the threshold value increases as the S / N ratio decreases).

【００１９】そして、入力音声データに対し、所定の時
間間隔毎に、所定の時間幅を有するフレーム単位で、そ
のパワーを算出し、過去の所定の時間内におけるパワー
の最大値と、最小値とを保持しながら、最大値、ならび
に最大値と最小値との差に応じて変化するパワーに関す
るしきい値を用いて、入力音声、背景音、それぞれのパ
ワーの変化に逐次、適応しながら、フレーム毎に音声区
間と、非音声区間とを判別する。The power of the input audio data is calculated at predetermined time intervals in frame units having a predetermined time width, and the maximum value and the minimum value of the power within a predetermined time in the past are calculated. While maintaining the maximum value, and using a threshold value for the power that changes according to the difference between the maximum value and the minimum value, the input voice, the background sound, and the frame are successively adapted to changes in the respective powers. Each time, a voice section and a non-voice section are determined.

【００２０】《実施の形態の説明》図１は本発明の実施
の形態としての音声区間検出装置の一例を示すブロック
図である。<< Description of Embodiment >> FIG. 1 is a block diagram showing an example of a voice section detection apparatus according to an embodiment of the present invention.

【００２１】この図に示す音声区間検出装置１は、デジ
タル化されて入力された入力信号データに対して所定の
時間間隔毎に所定のフレーム幅でパワーを算出するパワ
ー算出部２と、過去の所定の時間内のフレームパワーの
最大値を保持する瞬時パワー最大値保持部３と、過去の
所定の時間内のフレームパワーの最小値を保持する瞬時
パワー最小値保持部４と、これら瞬時パワー最大値保持
部３、瞬時パワー最小値保持部４に保持されている最大
値、ならびに最大値と最小値との差の両者に応じて変化
するパワーに関するしきい値を決定するパワーしきい値
決定部５と、このパワーしきい値決定部５によって決定
されたしきい値と現在のフレームのパワーとを比較して
音声区間か、非音声区間かを決定する判別部６とを備え
ている。The voice section detection device 1 shown in FIG. 1 includes a power calculator 2 for calculating power at a predetermined frame width at predetermined time intervals for input signal data that has been digitized and input; An instantaneous power maximum value holding unit 3 for holding the maximum value of the frame power within a predetermined time; an instantaneous power minimum value holding unit 4 for holding the minimum value of the frame power within a predetermined time period in the past; A power threshold value determining unit that determines a threshold value for a power that changes in accordance with both the maximum value held in the value holding unit 3 and the instantaneous power minimum value holding unit 4 and a difference between the maximum value and the minimum value. 5 and a discriminating unit 6 for comparing the threshold value determined by the power threshold value determining unit 5 with the power of the current frame to determine a voice section or a non-voice section.

【００２２】そして、この音声区間検出装置１では、入
力信号データに対し、所定の時間間隔毎に所定の時間幅
を有するフレーム単位でそのパワーを算出し、過去の所
定の時間内におけるパワーの最大値と最小値とを保持し
ながら、最大値、ならびに最大値と最小値との差に応じ
て変化するパワーに関するしきい値を用いて、入力音声
と背景音のそれぞれのパワーの変化に逐次適応しなが
ら、フレーム毎に音声区間と、非音声区間との判別を行
なう。The voice section detecting device 1 calculates the power of the input signal data in units of frames having a predetermined time width at predetermined time intervals, and calculates the maximum power in a past predetermined time. Adapts to changes in the power of the input voice and the background sound sequentially, using a threshold value for the power that changes according to the difference between the maximum value and the minimum value while retaining the value and the minimum value. Meanwhile, a speech section and a non-speech section are determined for each frame.

【００２３】パワー算出部２では、例えば５ｍｓの時間
間隔で、例えば２０ｍｓのフレーム幅にわたり、信号の
自乗和ないし自乗平均値を算出し、これを対数化、即ち
デシベル化して、その時刻のフレームパワーを“Ｐ”と
し、これを瞬時パワー最大値保持部３と、瞬時パワー最
小値保持部４と、判別部６とに供給する。The power calculator 2 calculates the sum of squares or the mean square value of the signal at a time interval of, for example, 5 ms and a frame width of, for example, 20 ms. Is set to “P”, and this is supplied to the maximum instantaneous power holding unit 3, the minimum instantaneous power holding unit 4, and the determination unit 6.

【００２４】瞬時パワー最大値保持部３では、過去の所
定の時間内（例えば、６秒）のフレームパワー“Ｐ”の
最大値を保持するように設計されており、常にその保持
した値“Ｐ_upper”をパワーしきい値決定部５に供給す
る。但し、最大値“Ｐ_upper”は“Ｐ＞Ｐ_upper”である
ような、フレームパワー“Ｐ”がパワー算出部２から供
給されると、直ちにその値が更新される。The maximum instantaneous power holding unit 3 is designed to hold the maximum value of the frame power “P” within a predetermined time in the past (for example, 6 seconds), and always holds the held value “P”. _upper "is supplied to the power threshold determination unit 5. However, when the frame power “P” is supplied from the power calculator 2 such that the maximum value “P _upper ” is “P> P _upper ”, the value is immediately updated.

【００２５】また、瞬時パワー最小値保持部４では、過
去の所定の時間内（例えば、４秒）のフレームパワー
“Ｐ”の最小値を保持するように設計されており、常に
その保持した値“Ｐ_lower”をパワーしきい値決定部５
に供給する。但し、最小値“Ｐ_l _ower”は“Ｐ＜
Ｐ_lower”であるような、フレームパワー“Ｐ”がパワ
ー算出部２から供給されると、直ちにその値が更新され
る。The instantaneous power minimum value holding unit 4 is designed to hold the minimum value of the frame power "P" within a predetermined time in the past (for example, 4 seconds). “P _lower ” is set to the power threshold determination unit 5
To supply. However, the minimum value "P _l _ower" is "P <
When the frame power “P”, such as “P _lower ”, is supplied from the power calculation unit 2, the value is immediately updated.

【００２６】パワーしきい値決定部５では、瞬時パワー
最大値保持部３および瞬時パワー最小値保持部４に保持
されている最大値“Ｐ_upper”と、最小値“Ｐ_lower”と
を用いて、例えば、次式に示す演算を行なってパワーに
関するしきい値“Ｐ_thr”を決定し、これを判別部６に
供給する。The power threshold value determining unit 5 uses the maximum value “P _upper ” and the minimum value “P _lower ” held in the instantaneous power maximum value holding unit 3 and the instantaneous power minimum value holding unit 4. For example, the threshold value “P _thr ” for the power is determined by performing the calculation shown in the following equation, and this is supplied to the determination unit 6.

【００２７】[0027]

【数１】Ｐ_upper−Ｐ_lower≧６０［ｄＢ］の場合：Ｐ_thr＝Ｐ_upper−３５ …（１）Ｐ_upper−Ｐ_lower＜６０［ｄＢ］の場合：Ｐ_thr＝Ｐ_upper−３５＋３５・｛１−（Ｐ_upper−Ｐ_lower）／６０｝…（２）## EQU1 ## When P _upper −P _lower ≧ 60 [dB]: P _thr = P _upper −35 (1) When P _upper −P _lower <60 [dB]: P _thr = P _upper −35 + 35 · ｛ 1- (P _upper -P _lower ) / 60 _° (2)

【００２８】但し、背景音のレベルが音声のレベルに近
接してきた場合の本発明装置の誤動作を防ぐために、Ｐ
_thrは、Ｐ_thr＝Ｐ_upper−１３を上限とするのが望まし
い。また、上式中の定数３５は、前述の雑音が殆ど存在
しないときの基本のしきい値である。However, in order to prevent a malfunction of the device of the present invention when the background sound level approaches the sound level, P
_It is desirable that _thr has an _upper limit of P _thr = P _upper -13. The constant 35 in the above equation is a basic threshold value when the above-mentioned noise hardly exists.

【００２９】また、判定部６では、パワー算出部２から
フレーム毎に供給されるパワー“Ｐ”と、パワーしきい
値決定部５から供給されるしきい値“Ｐ_thr”とを比較
して、フレーム毎に、“Ｐ＞Ｐ_thr”ならば、当該フレ
ームを音声区間と判定し、また“Ｐ≦Ｐ_thr”ならば、
当該フレームを非音声区間と判定し、これらの各判定結
果に基づき音声／非音声の判別信号を出力する。The determination section 6 compares the power “P” supplied from the power calculation section 2 for each frame with the threshold “P _thr ” supplied from the power threshold determination section 5. , For each frame, if “P> P _thr ”, the frame is determined as a voice section, and if “P ≦ P _thr ”,
The frame is determined to be a non-voice section, and a voice / non-voice determination signal is output based on the results of these determinations.

【００３０】これによって、図２に示すように、入力信
号データの値が変化しているとき、パワー算出部２から
出力されるパワー“Ｐ”に基づき、瞬時パワー最大値保
持部３と、瞬時パワー最小値保持部４とに各々、最大値
“Ｐ_upper”と、最小値“Ｐ_l _ower”とが保持されるとと
もに、これら最大値“Ｐ_upper”と、最小値“Ｐ_lower”
とに基づいて、しきい値“Ｐ_thr”が決定され、このし
きい値“Ｐ_thr”に基づき、各フレームが音声区間、非
音声区間のいずれであるか判定される。As a result, as shown in FIG. 2, when the value of the input signal data changes, the instantaneous power maximum value holding unit 3 and the instantaneous power maximum value holding unit 3 are controlled based on the power “P” output from the power calculation unit 2. each the power minimum value holding unit 4, the maximum value "P _upper", with the minimum value "P _l _ower" and is held, these maximum value and "P _upper", the minimum value "P _lower"
, A threshold value “P _thr ” is determined. Based on the threshold value “P _thr ”, it is determined whether each frame is a voice section or a non-voice section.

【００３１】このように、この実施の形態では、入力信
号データに対して、所定の時間間隔毎に所定の時間幅を
有するフレーム単位でそのフレームパワーを算出すると
ともに、過去の所定の時間内のフレームパワーの最大値
および最小値を保持し、保持されている最大値より予め
定めた値ａだけ小さいフレームパワーに関するしきい値
Ｐthrを決定し、さらに最大値と最小値との差ｄが予め
定めた基準値より小さくなった場合には、差ｄに応じて
前記値ａを減らしてしきい値Ｐthrを大きくするように
該しきい値Ｐthrを自動的に調整し、このしきい値Ｐthr
と、現在のフレームのフレームパワーとを比較して、現
在のフレームが音声区間か、非音声区間かを決定するよ
うにした。このため、放送番組中や録音テープあるいは
日常生活で、雑音や背景音を伴って発声された音声につ
いて、フレーム毎に、音声区間か、非音声区間かを正確
に判別することができる。As described above, in this embodiment, the frame power of the input signal data is calculated for each frame having a predetermined time width at predetermined time intervals, and the frame power within a predetermined time in the past is calculated. A maximum value and a minimum value of the frame power are held, a threshold value Pthr for the frame power smaller than the held maximum value by a predetermined value a is determined, and a difference d between the maximum value and the minimum value is determined in advance. When the threshold value becomes smaller than the reference value, the threshold value Pthr is automatically adjusted so as to decrease the value a according to the difference d and increase the threshold value Pthr.
Is compared with the frame power of the current frame to determine whether the current frame is a voice section or a non-voice section. For this reason, it is possible to accurately determine, for each frame, a voice section or a non-voice section of a sound uttered with noise or background sound during a broadcast program, a recording tape, or daily life.

【００３２】また、この実施の形態では、過去の所定の
時間内の瞬時パワーの最小値を基に、背景音のレベルを
推定しているので、放送番組中などで、背景音のレベル
が時々刻々、変動し、かつ同時に音声が発せられ続けて
いる場合においても、入力信号中の音声区間と、非音声
区間とを判別することができる。In this embodiment, the level of the background sound is estimated based on the minimum value of the instantaneous power within a predetermined time in the past. Even in the case where the sound fluctuates every moment and the sound continues to be emitted at the same time, it is possible to determine the sound section and the non-speech section in the input signal.

【００３３】この結果、入力信号中の音声に対して、
（ａ）加工して声の高さや話す速さを変える、（ｂ）
意味内容を機械的に音声認識する、（ｃ）符号化し
て伝送あるいは記録する、場合などにおいて、加工音声
の音質の向上、また音声認識率の改善、さらに符号化効
率の上昇や、復号化音声の品質の向上が可能となる。As a result, for the sound in the input signal,
(A) processing to change the pitch and speaking speed of the voice, (b)
(C) Encoding and transmitting or recording by means of mechanical recognition of semantic content, such as in the case of improving the sound quality of the processed voice, improving the voice recognition rate, further increasing the coding efficiency, and decoding the decoded voice. Quality can be improved.

【００３４】また、パワーという比較的簡便に求められ
る特徴量のみを用いているので、演算時間を短縮するこ
とができるとともに、装置全体の構成を簡素化して、コ
ストを低減することができ、さらにリアルタイムに音声
処理を行なうことが可能となる。Further, since only the characteristic amount of power, which is relatively easily obtained, is used, the calculation time can be shortened, the configuration of the entire apparatus can be simplified, and the cost can be reduced. Audio processing can be performed in real time.

【００３５】[0035]

【発明の効果】以上説明したように本発明によれば、請
求項１では、入力音声と、背景音とをそれぞれのレベル
の変化に逐次、適応しながら、リアルタイムで音声処理
を行なって、音声区間と、非音声区間とを判別すること
ができる。As described above, according to the present invention, according to the first aspect, audio processing is performed in real time while sequentially adapting an input voice and a background sound to changes in their respective levels. The section and the non-voice section can be distinguished.

【００３６】また、請求項２では、パワーという比較
的、簡便に求められる特徴量のみを用いていることか
ら、演算時間を短くするとともに、コストを低減させな
がら、入力音声と、背景音とをそれぞれのレベルの変化
に逐次、適応して、リアルタイムで音声処理を行なっ
て、音声区間と、非音声区間とを判別することができ
る。According to the second aspect of the present invention, since only a relatively simple feature amount called power is used, the input time and the background sound can be reduced while shortening the calculation time and reducing the cost. Voice processing is performed in real time, adaptive to each level change, and a voice section and a non-voice section can be discriminated.

[Brief description of the drawings]

【図１】本発明による音声区間検出方法およびその装置
の一形態を適用した音声区間検出装置の一例を示すブロ
ック図である。FIG. 1 is a block diagram showing an example of a voice section detection apparatus to which an embodiment of a voice section detection method and apparatus according to the present invention is applied.

【図２】図１に示す音声区間検出装置の動作例を示す模
式図である。FIG. 2 is a schematic diagram showing an operation example of the voice section detection device shown in FIG.

[Explanation of symbols]

１音声区間検出装置２パワー算出部３瞬時パワー最大値保持部４瞬時パワー最小値保持部５パワーしきい値決定部６判別部 REFERENCE SIGNS LIST 1 voice section detection device 2 power calculation unit 3 instantaneous power maximum value holding unit 4 instantaneous power minimum value holding unit 5 power threshold value determination unit 6 discrimination unit

フロントページの続き (56)参考文献特開平８−294199（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02 Continuation of the front page (56) References JP-A-8-294199 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 11/02

Claims

(57) [Claims]

1. A method for calculating frame power at a predetermined frame width at predetermined time intervals for input signal data, and holding a maximum value and a minimum value of frame power within a predetermined time in the past. Then, a threshold value Pthr relating to a frame power smaller by a predetermined value a than the held maximum value is determined, and when the difference d between the maximum value and the minimum value becomes smaller than a predetermined reference value, , The threshold value Pthr is automatically adjusted so as to decrease the value a according to the difference d and increase the threshold value Pthr, and compare the threshold value Pthr with the frame power of the current frame. And determining whether the current frame is a voice section or a non-voice section.

2. A power calculator for calculating frame power with a predetermined frame width at predetermined time intervals for input signal data, and holding a maximum value of frame power within a predetermined time in the past. An instantaneous power maximum value holding unit, an instantaneous power minimum value holding unit that holds a minimum value of frame power within a predetermined time in the past, and a maximum value stored in the instantaneous power maximum value holding unit. The threshold value Pthr relating to the frame power smaller by the value a is determined, and the difference d between the maximum value held in the instantaneous power maximum value holding unit and the minimum value held in the instantaneous power minimum value holding unit is determined in advance. A power threshold value deciding unit for automatically adjusting the threshold value Pthr so as to decrease the value a according to the difference d and increase the threshold value Pthr when the reference value becomes smaller than the reference value. Power Threshold P obtained by the threshold determining unit
Compare thr with the frame power of the current frame,
A voice section detection device, comprising: a determination section that determines whether the section is a voice section or a non-voice section.