JP5621786B2

JP5621786B2 - Voice detection device, voice detection method, and voice detection program

Info

Publication number: JP5621786B2
Application number: JP2011547442A
Authority: JP
Inventors: 田中　大介; 大介田中; 隆行荒川; 健花沢; 長田誠也; 誠也長田; 岡部　浩司; 浩司岡部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-12-24
Filing date: 2010-11-26
Publication date: 2014-11-12
Anticipated expiration: 2030-11-26
Also published as: WO2011077924A1; JPWO2011077924A1

Description

本発明は、音声区間を検出する音声検出装置、音声検出方法、および音声検出プログラムに関する。 The present invention relates to a voice detection device, a voice detection method, and a voice detection program for detecting a voice section.

音声検出技術は、移動体通信などにおいて非音声区間の圧縮率を向上させたりその区間だけ伝送しないようにしたりして音声伝送効率を向上する目的や、ノイズキャンセラ、エコーキャンセラなどにおいて非音声区間で雑音を推定したり決定したりする目的、音声認識システムにおける音声認識性能の向上や処理量削減などの目的で広く用いられている。
図１４は、一般的な音声検出装置の構成例を示すブロック図である。なお、特許文献１には図１４に例示した音声検出装置に相当する発明が開示されている。
図１４に示す一般的な音声検出装置は、入力信号をフレーム単位に切り出して取得する波形切り出し部１０１と、切り出されたフレーム毎の入力信号から音声検出に用いる特徴量を算出する特徴量算出部１０２と、算出された特徴量と閾値記憶部１０３に記憶されている閾値とをフレーム毎に比較し、入力信号が音声にもとづく信号であるのか、または非音声にもとづく信号であるのかを判定する音声／非音声判定部１０４と、フレーム毎の判定結果を複数のフレームに渡って保持するフレーム毎の判定結果保持部１０５と、区間整形ルール記憶部１０６に記憶されている区間整形ルールにもとづいて、判定結果保持部１０５に保持された複数のフレームの判定結果を整形し、音声区間であるのか、または非音声区間であるのかを決定する音声／非音声区間整形部１０７とを含む。
なお、入力信号をフレーム単位に切り出して取得するとは、ある時刻から単位時間が経過するまでに入力された入力信号を取り出すことである。また、フレームは、入力信号が入力されている時間を単位時間毎に分割した各時間である。区間整形ルールは、例えば、連続する複数のフレームに渡って音声にもとづく入力信号または非音声にもとづく入力信号が入力されていると判定された場合に、それら複数のフレームを１つの音声区間または非音声区間と決定するルールである。
特許文献１には、特徴量算出部１０２で算出される特徴量の例として、スペクトルパワーの変動を平滑化し、さらにその変動を平滑化したものが開示されている。また、非特許文献１の４．３．３節には、特徴量の例として、ＳＮＲ（ＳｉｇｎａｌｔｏＮｏｉｓｅｒａｔｉｏ）の値が開示され、４．３．５節には、ＳＮＲの値を平均したものが開示されている。非特許文献２のＢ．３．１．４節には、特徴量の例として、零点交差数が開示され、非特許文献３には、特徴量の例として、音声ＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）と無音ＧＭＭとを用いた尤度比が開示されている。
音声／非音声判定部１０４は、予め実験により定められた閾値とフレーム毎の特徴量との比較を行い、特徴量が閾値以上の場合は音声にもとづく入力信号であると判定し、閾値以下の場合は非音声にもとづく入力信号であると判定する。
特許文献２には、１発声ごとに閾値を更新する方法が開示されている。図１５は、音声検出の閾値を変更する音声検出装置を示すブロック図である。なお、特許文献２には図１５に例示した音声検出装置に相当する発明が開示されている。音声検出閾値設定部１８は、音声区間のスペクトルパワーの最大値と音声区間ではない背景雑音区間のスペクトルパワーの平均値とにもとづいて、音声区間であるか否かを判定するためのスペクトルパワーの閾値を算出し、算出した閾値に更新する。Voice detection technology is intended to improve the efficiency of voice transmission by improving the compression rate of non-voice sections in mobile communication, etc. or not transmitting only that section, and noise in non-voice sections in noise cancellers, echo cancellers, etc. It is widely used for the purpose of estimating and determining the voice, and for the purpose of improving the voice recognition performance and reducing the processing amount in the voice recognition system.
FIG. 14 is a block diagram illustrating a configuration example of a general voice detection device. Patent Document 1 discloses an invention corresponding to the voice detection device illustrated in FIG.
The general speech detection apparatus shown in FIG. 14 includes a waveform cutout unit 101 that cuts out and acquires an input signal in frame units, and a feature amount calculation unit that calculates a feature amount used for speech detection from the cut out input signal for each frame. 102, the calculated feature value and the threshold value stored in the threshold value storage unit 103 are compared for each frame to determine whether the input signal is a signal based on speech or a signal based on non-speech. Based on the voice / non-voice judgment unit 104, the judgment result holding unit 105 for each frame that holds the judgment results for each frame over a plurality of frames, and the section shaping rules stored in the section shaping rule storage unit 106. , A sound that shapes the determination results of a plurality of frames held in the determination result holding unit 105 and determines whether the frame is a speech segment or a non-speech segment / And a non-speech section shaping unit 107.
It should be noted that “acquiring and acquiring an input signal in units of frames” means that an input signal input from a certain time until a unit time elapses is extracted. Further, the frame is each time obtained by dividing the time during which the input signal is input into unit time. For example, when it is determined that an input signal based on speech or an input signal based on non-speech is input over a plurality of consecutive frames, the section shaping rule determines that these frames are divided into one speech segment or non-speech. This is a rule for determining a voice section.
Patent Document 1 discloses an example of the feature amount calculated by the feature amount calculation unit 102 by smoothing the fluctuation of the spectrum power and further smoothing the fluctuation. Non-Patent Document 1, section 43.3 discloses SNR (Signal to Noise ratio) values as examples of feature values, and section 4.3.5 averages SNR values. Are disclosed. B. of Non-Patent Document 2. Section 3.1.4 discloses the number of zero crossings as an example of a feature quantity, and Non-Patent Document 3 describes the likelihood using a voice GMM (Gaussian Mixture Model) and a silent GMM as examples of feature quantities. The degree ratio is disclosed.
The voice / non-voice determination unit 104 compares a threshold value determined in advance with an experiment and a feature value for each frame. If the feature value is equal to or higher than the threshold value, the voice / non-voice determination unit 104 determines that the input signal is based on voice. In this case, it is determined that the input signal is based on non-voice.
Patent Document 2 discloses a method of updating a threshold value for each utterance. FIG. 15 is a block diagram showing a voice detection device that changes a voice detection threshold. Patent Document 2 discloses an invention corresponding to the voice detection device illustrated in FIG. The voice detection threshold setting unit 18 determines the spectrum power for determining whether or not the voice section is based on the maximum value of the spectral power of the voice section and the average value of the spectral power of the background noise section that is not the voice section. The threshold value is calculated and updated to the calculated threshold value.

特開２００６−２０９０６９号公報（段落００１８〜００５９、図１）JP 2006-209069 A (paragraphs 0018 to 0059, FIG. 1) 特開平７−９２９８９号公報（段落０００８〜００１４、図１）Japanese Patent Laid-Open No. 7-92989 (paragraphs 0008 to 0014, FIG. 1)

「テクニカルディスクリプションオブＶＡＤオプション２（ＴｅｃｈｎｉｃａｌＤｅｓｃｒｉｐｔｉｏｎｏｆＶＡＤＯｐｔｉｏｎ２）」，（フランス）、ヨーロッパ電気通信標準化協会（ＥＴＳＩ（ＥｕｒｏｐｉａｎＴｅｌｅｃｏｍｍｕｎｉｃａｔｉｏｎｓＳｔａｎｄａｒｄｓＩｎｓｔｉｔｕｔｅ）），１９９９年１２月，ＥＴＳＩＥＮ３０１７０８Ｖ７．１．１，ｐ．１７−２６“Technical Description of VAD Option 2” (France), European Telecommunications Standards Institute (ETSI (European Telecommunications Standards Institute), December 1999, 1999 ETSI8 ENIT8). , P. 17-26 “ＩＴＵ−ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎＧ．７２９”、［ｏｎｌｉｎｅ］、２００７年１月、ＩＴＵ−Ｔ、［平成２１年１２月９日検索］、インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｉｔｕ．ｉｎｔ／ｒｅｃ／Ｔ−ＲＥＣ−Ｇ．７２９−２００７０１−Ｉ／ｅｎ＞“ITU-T Recommendation G.729”, [online], January 2007, ITU-T, [searched on December 9, 2009], Internet <URL: http: // www. itu. int / rec / T-REC-G. 729-200701-I / en> アキノブリー（ＡｋｉｎｏｂｕＬｅｅ）他４名，「ノイズロバストリアルワールドスポークンダイアログシステムユージングＧＭＭベーストリジェクションオブアンインテンデッドインプット（ＮｏｉｓｅＲｏｂｕｓｔＲｅａｌＷｏｒｌｄＳｐｏｋｅｎＤｉａｌｏｇＳｙｓｔｅｍｕｓｉｎｇＧＭＭＢａｓｅｄＲｅｊｅｃｔｉｏｎｏｆＵｎｉｎｔｅｎｄｅｄＩｎｐｕｔｓ）」，（韓国），アイシーエスエルピー（ＩＣＳＬＰ（ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ）），２００４年１０月４日，ＩＣＳＬＰ−２００４，Ｖｏｌ．１，ｐ．１７３−１７６Akinobu Lee and 4 others, “Noise Robust Real World Spoked of Unintended Input” ICSLP (International Conference on Spoken Language Processing), October 4, 2004, ICSLP-2004, Vol. 1, p. 173-176

しかし、図１４に示す音声検出装置は、閾値を設定するためには予め雑音のみ入力されている複数のフレームから平均雑音パワーと音声信号が入力されているフレームで構成された区間における最大スペクトルパワーとを測定する必要があり、雑音や最大スペクトルパワーが常に変化する環境に対応できない。
図１５に示す音声検出装置は、閾値を決めるために音声検出を行い、背景雑音のスペクトルパワーを求める必要があるが、検出の精度が低いと雑音を推定できない可能性がある。例えば、入力信号の最初から音声区間が続く場合や、閾値を越えるような背景雑音が続いてしまい音声区間と判断されてしまったりするような場合には、音声検出装置は背景雑音のスペクトルパワーを取得することが困難となる。そのため、音声検出装置は、閾値の決定および更新をすることができない。
そこで、上述の課題を解決するため、本発明は、雑音が変化したり、入力信号の最初から雑音や音声区間が続いたりするような場合であっても、音声区間を検出することができる音声検出装置、音声検出方法、および音声検出プログラムを提供することを目的とする。However, in order to set the threshold, the speech detection apparatus shown in FIG. 14 has a maximum spectral power in a section composed of a frame in which an average noise power and a speech signal are input from a plurality of frames in which only noise is input in advance. Therefore, it is not possible to cope with an environment in which noise and maximum spectral power constantly change.
The speech detection apparatus shown in FIG. 15 needs to perform speech detection to determine the threshold and obtain the spectral power of background noise. However, if the detection accuracy is low, the noise may not be estimated. For example, when the speech section continues from the beginning of the input signal, or when background noise exceeding the threshold value continues and the speech section is determined to be a speech section, the speech detection device uses the background noise spectrum power. It becomes difficult to obtain. Therefore, the voice detection device cannot determine and update the threshold value.
Therefore, in order to solve the above-described problem, the present invention provides a voice that can detect a voice section even when noise changes or noise or a voice section continues from the beginning of an input signal. An object of the present invention is to provide a detection device, a voice detection method, and a voice detection program.

本発明による音声検出装置は、単位時間ごとの入力信号であるフレームごとの入力信号の特徴量を算出する特徴量算出手段と、特徴量と閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定する音声／非音声判定手段と、特徴量算出手段が算出した音声区間または非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、音声区間または非音声区間の特徴量である長区間特徴量を算出する長区間特徴量算出手段と、長区間特徴量を用いて、音声区間および非音声区間が非音声にもとづく信号が入力された区間である確率である非音声確率を算出し、算出した非音声確率にもとづいて、音声検出閾値を更新する閾値更新手段とを備えたことを特徴とする。
本発明による音声検出方法は、単位時間内の入力信号であるフレームごとの入力信号の特徴量を算出し、特徴量と閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定し、音声区間または非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、音声区間または非音声区間の特徴量である長区間特徴量を算出し、長区間特徴量を用いて、音声区間および非音声区間が非音声にもとづく信号が入力された区間である確率である非音声確率を算出し、算出した非音声確率にもとづいて、音声検出閾値を更新することを特徴とする。
本発明によるプログラム記録媒体に格納される音声検出プログラムは、コンピュータに、単位時間ごとの入力信号であるフレームごとの入力信号の特徴量を算出する特徴量算出処理と、特徴量と閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定する音声／非音声判定処理と、特徴量算出処理で算出した音声区間または非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、音声区間または非音声区間の特徴量である長区間特徴量を算出する長区間特徴量算出処理と、長区間特徴量を用いて、音声区間および非音声区間が非音声にもとづく信号が入力された区間である確率である非音声確率を算出し、算出した非音声確率にもとづいて、音声検出閾値を更新する閾値更新処理とを実行させることを特徴とする。The speech detection apparatus according to the present invention compares a feature amount and a threshold value with a feature amount calculation unit that calculates a feature amount of an input signal for each frame that is an input signal per unit time, and a signal based on speech over a plurality of frames. A speech / non-speech determination unit that determines whether the signal is an input speech segment or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, and a speech segment calculated by a feature amount calculation unit Alternatively, a long-section feature quantity calculating unit that calculates a long-section feature quantity that is a feature quantity of a voice section or a non-speech section based on a statistical value of feature quantities of a plurality of frames constituting the non-voice section, and a long-section feature quantity Is used to calculate the non-speech probability, which is the probability that the speech section and the non-speech section are input to a signal based on non-speech, and based on the calculated non-speech probability, Characterized in that a threshold updating means for updating the voice detection threshold.
The voice detection method according to the present invention calculates a feature quantity of an input signal for each frame, which is an input signal within a unit time, compares the feature quantity with a threshold value, and receives a voice-based signal over a plurality of frames. It is determined whether it is a section or a non-speech section in which a signal based on non-speech is input over a plurality of frames, and based on statistical values of feature values of a plurality of frames constituting the speech section or the non-speech section This is a probability that a long segment feature value, which is a feature value of a voice segment or a non-speech segment, is calculated, and a voice segment and a non-speech segment are segments in which a signal based on non-speech is input using the long segment feature value. A non-speech probability is calculated, and the speech detection threshold is updated based on the calculated non-speech probability.
The voice detection program stored in the program recording medium according to the present invention is a computer that compares a feature amount calculation process for calculating a feature amount of an input signal for each frame, which is an input signal per unit time, with a threshold value. And a voice / non-voice determination process for determining whether a voice-based signal is input over a plurality of frames or a non-voice-based signal is input over a plurality of frames. , A long section that calculates a feature value of a long section, which is a feature quantity of a speech section or a non-speech section, based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation process Using feature amount calculation processing and long interval feature amounts, it is the probability that a speech segment and a non-speech segment are segments in which a signal based on non-speech is input. Calculating a speech probability, based on the calculated non-speech probabilities, characterized in that to perform the threshold updating process for updating the voice detection threshold value.

本発明は、閾値を超えるような背景雑音が入力の先頭に入る場合などにおいても、雑音環境下においても高精度の音声区間検出を行うことができる音声検出装置、音声検出方法、および音声検出プログラムを提供する。 The present invention relates to a voice detection device, a voice detection method, and a voice detection program capable of detecting a voice segment with high accuracy even in a noise environment even when background noise exceeding a threshold value enters the head of an input. I will provide a.

本発明による音声検出装置の第１の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 1st Embodiment of the audio | voice detection apparatus by this invention. 本発明の第１の実施形態の音声検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice detection apparatus of the 1st Embodiment of this invention. 第１の実施形態の関数Ｇを示す説明図である。It is explanatory drawing which shows the function G of 1st Embodiment. 閾値を変更する例を示す説明図である。It is explanatory drawing which shows the example which changes a threshold value. 更新前の閾値が小さすぎた場合の例を示す説明図である。It is explanatory drawing which shows the example when the threshold value before an update is too small. 更新前の閾値が大きすぎた場合の例を示す説明図である。It is explanatory drawing which shows the example when the threshold value before an update is too large. 本発明による音声検出装置の第２の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 2nd Embodiment of the audio | voice detection apparatus by this invention. 本発明による音声検出装置の第３の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 3rd Embodiment of the audio | voice detection apparatus by this invention. 音声検出装置の第３の実施形態の他の例を示すブロック図である。It is a block diagram which shows the other example of 3rd Embodiment of an audio | voice detection apparatus. 本発明の第３の実施形態において非音声確率αを求めるための関数を示す説明図である。It is explanatory drawing which shows the function for calculating | requiring the non-speech probability (alpha) in the 3rd Embodiment of this invention. 本発明の第５の実施形態において非音声確率αを求めるための関数を示す説明図である。It is explanatory drawing which shows the function for calculating | requiring the non-speech probability (alpha) in the 5th Embodiment of this invention. 本発明による音声検出装置の第６の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 6th Embodiment of the audio | voice detection apparatus by this invention. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention. 一般的な音声検出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a common audio | voice detection apparatus. 音声検出の閾値を変更する音声検出装置を示すブロック図である。It is a block diagram which shows the audio | voice detection apparatus which changes the threshold value of an audio | voice detection.

実施形態１．
本発明の第１の実施形態について、図面を参照して説明する。図１は、本発明による音声検出装置の第１の実施形態の構成例を示すブロック図である。図１に示すように、本発明の第１の実施形態の音声検出装置は、波形切りだし部１０１、特徴量算出部１０２、閾値記憶部１０３、音声／非音声判定部１０４、判定結果保持部１０５、整形ルール記憶部１０６、音声／非音声区間整形部１０７、長区間特徴量算出部１０８、および閾値更新部１０９を含む。
波形切り出し部１０１は、入力信号をフレーム単位に切り出して取得する。具体的には、波形切り出し部１０１は、例えば、所定の単位時間ごとの入力信号をそれぞれ切りだして取得する。特徴量算出部１０２は、波形切り出し部１０１が切り出したフレーム毎の入力信号から音声検出に用いる特徴量を算出する。閾値記憶部１０３は、入力信号が音声にもとづく入力信号であるのか、または非音声にもとづく入力信号であるのかを判定するための閾値を記憶する。
音声／非音声判定部１０４は、特徴量算出部１０２が算出した特徴量と閾値記憶部１０３に記憶されている閾値とをフレーム毎に比較し、そのフレームの入力信号が音声にもとづく入力信号であるのか、または非音声にもとづく入力信号であるのかを判定する。なお、音声にもとづく入力信号のフレームを音声フレームといい、非音声にもとづく入力信号のフレームを非音声フレームという。判定結果保持部１０５は、音声／非音声判定部１０４によるフレーム毎の判定結果を複数フレームに渡り保持する。
区間整形ルール記憶部１０６には、区間整形ルールが記憶されている。音声／非音声区間整形部１０７は、区間整形ルール記憶部１０６に記憶されている区間整形ルールにもとづいて、判定結果保持部１０５に保持されている複数フレームの判定結果を整形し、音声区間または非音声区間であると決定する。具体的には、音声／非音声区間整形部１０７は、例えば、音声フレームが複数連続していた場合に、それら複数のフレームは一の音声区間であると決定する。また、音声／非音声区間整形部１０７は、非音声フレームが複数連続していた場合に、それら複数のフレームは一の非音声区間であると決定する。なお、音声／非音声区間整形部１０７は、連続する複数のフレームにおいて、音声フレームの割合が所定の割合よりも大きい場合にそれら複数のフレームを一の音声区間であると決定したり、非音声フレームの割合が一定の割合よりも大きい場合に一の非音声区間であると決定したりしてもよい。
長区間特徴量算出部１０８は、音声／非音声区間整形部１０７によって決定された音声区間および非音声区間に対し、特徴量算出部１０２が算出したフレーム毎の特徴量を統計処理した長区間特徴量を算出する。
閾値更新部１０９は、長区間特徴量算出部１０８が算出した長区間特徴量を用いて、音声／非音声区間整形部１０７によって決定された音声区間および非音声区間に対する非音声確率を算出し、閾値記憶部１０３に記憶されている閾値を変更する。なお、非音声確率とは、後述するように、当該区間の入力信号が非音声にもとづく入力信号である確率である。
音声検出装置は、例えば、音声検出プログラムを搭載したコンピュータによって実現される。
次に、本発明の第１の実施形態の音声検出装置の動作について、図面を参照して説明する。図２は、本発明の第１の実施形態の音声検出装置の動作を示すフローチャートである。
まず、波形切り出し部１０１は、マイクロフォン（図示せず）から入力される集音された時系列の入力音データを単位時間のフレーム毎に切り出す（ステップＳ１０１）。例えば、入力音データがサンプリング周波数８０００Ｈｚの１６ｂｉｔＬｉｎｅａｒ−ＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）形式である場合、１秒当たり８０００点の入力音データによる波形データが各フレームに格納されている。
波形切り出し部１０１は、例えば、この波形データをフレーム幅２００点（２５ミリ秒）、フレームシフト８０点（１０ミリ秒）で時系列に従って逐次切り出す。
次に、特徴量算出部１０２が、フレームごとに切り出された波形から特徴量を算出する（ステップＳ１０２）。特徴量算出部１０２が算出する特徴量は、例えば、スペクトルパワーやＳＮＲ、零交差点、尤度などである。
音声／非音声判定部１０４は、閾値記憶部１０３に記憶されている閾値と特徴量算出部１０２が算出した特徴量とを比較し、閾値を超えている場合は音声フレームであると判定し、超えていない場合は非音声フレームであると判定する（ステップＳ１０３）。なお、閾値記憶部１０３に記憶されている閾値と特徴量算出部１０２が算出した特徴量とが同じであった場合に、音声／非音声判定部１０４が音声フレームと判定するかまたは非音声フレームと判定するかは予め決定されていてもよい。そして、音声／非音声判定部１０４は当該決定にもとづいて、音声フレームまたは非音声フレームと判定する。
判定結果保持部１０５は、音声／非音声判定部１０４がステップＳ１０６の処理で判定した結果を複数フレーム分保持する（ステップＳ１０４）。
音声／非音声区間整形部１０７は、音声／非音声判定部１０４がフレーム毎に判定するために生じる短い継続長の音声区間や短い継続長の非音声区間の発生を抑制するために、区間の整形を行う（ステップＳ１０５）。
長区間特徴量算出部１０８は、音声／非音声区間整形部１０７がステップＳ１０５の処理で求めた整形済みの音声区間および非音声区間に対して、特徴量算出部１０２がステップＳ１０２の処理で算出したフレーム毎の特徴量を統計処理し、長区間特徴量を算出する（ステップＳ１０６）。長区間特徴量は、例えば、スペクトルパワーやＳＮＲ、零交差点、尤度などのうち１つ、または２つ以上の組み合わせである。
長区間特徴量算出部１０８が行う統計処理の例として、整形済みの音声区間におけるフレーム毎の特徴量の平均値を算出する方法がある。平均値を算出する方法以外にも、長区間特徴量算出部１０８は、最頻値を用いる方法、中央値を用いる方法、フレーム毎の特徴量を大きさで並べ替えて特徴量の値が大きい順に上位４０％付近にある値を用いる方法などを用いても良い。なお、４０％という値はあくまで一例であり、ユーザ等が任意に定めた割合としても構わない。ユーザ等が５０％と定めた場合、中央値を用いる方法に一致する。
閾値更新部１０９は、長区間特徴量算出部１０８がステップＳ１０６の処理で算出した長区間特徴量を用いて、整形済みの音声区間に対して非音声確率αを算出する（ステップＳ１０７）。ここで非音声確率とは、当該区間の入力信号が雑音など非音声にもとづく入力信号である確率である。従って、１−αは当該区間が音声である確率に相当する。αは以下の式を用いて算出される。
＜Ｆ＞＝Σωｉ×＜ｆｉ＞・・・（１）
α＝Ｇ［＜Ｆ＞］・・・（２）
ここで、＜ｆｉ＞はフレームごとの特徴量ｆｉに前述の統計処理を施すことにより得られた長区間特徴量である。ωｉは長区間特徴量＜ｆｉ＞にかける重みである。そして、式（１）で複数種類（例えば、スペクトルパワーやＳＮＲ、零交差点、尤度など）の長区間特徴量＜ｆｉ＞にそれぞれ重みωｉを乗じて足し合わされて算出される＜Ｆ＞は統合長区間特徴量である。Ｇは統合長区間特徴量（単に長区間特徴量ともいう）＜Ｆ＞を変数とする関数である。図３は、本実施形態の関数Ｇを示す説明図である。図３の横軸は長区間特徴量の値であり、縦軸は非音声確率αである。
図３に示す例では、関数Ｇは、長区間特徴量が０である場合に、非音声確率αが１となる関数である。つまり、Ｇは、長区間特徴量が０である場合には、非音声確率は１００％となる関数である。そして、Ｇは、長区間特徴量がτ０である場合に、非音声確率αが０となる関数である。つまり、Ｇは、長区間特徴量がτ０である場合には、非音声確率は０％となる関数である。そして、Ｇは、長区間特徴量がτｍａｘである場合に、非音声確率αが１となる関数である。つまり、Ｇは、長区間特徴量がτｍａｘである場合には、非音声確率は１００％となる関数である。
なお、図３に示した関数は一例である。関数は、長区間特徴量が中庸な値から遠ざかるほど関数値が増加するような関数や、単調減少（非増大）関数であれば、他の関数であってもよい。（１）式のωｉ、および図３に示されているτ０、τｍａｘは予め実験で適切な値を求めておく。またωｉを実験的に定めることが難しければ、ωｉは各長区間特徴量に対して等しい値（１など）に設定されていてもよい。
次に、閾値更新部１０９は、ステップＳ１０７の処理で算出した非音声確率αを用いて閾値記憶部１０３に記憶されている閾値を更新する（ステップＳ１０８）。閾値更新部１０９は、閾値の更新は具体的には以下のように行う。まず、閾値更新部１０９は、閾値候補θ’を以下の式を用いて計算する。
θ’＝α×Ｆｍａｘ＋（１−α）×Ｆｍｉｎ・・・（３）
ここで、Ｆｍａｘは音声区間、または非音声区間におけるフレーム毎の特徴量の最大値である。Ｆｍｉｎは音声区間、または非音声区間におけるフレーム毎の特徴量の最小値である。αは音声区間、または非音声区間の非音声確率である。次に、閾値更新部１０９は、閾値候補θ’を用いて、閾値θを以下の式を用いて更新する。
θ←θ＋ε×（θ’−θ）・・・（４）
ここで、εは閾値の更新のスピードを調整するステップサイズである。つまり、本発明による音声検出装置は、閾値の更新のスピードを調整することができる。従って、音声検出装置は、一時的な背景雑音の大きさの変動に応じて閾値を大きく変動させたい場合と、一時的な背景雑音によっては閾値をあまり変動させたくない場合とのいずれの場合にも対応することができる。
図４は、閾値を変更する例を示す説明図である。図４に示す例では、音声／非音声区間整形部１０７によって、非音声区間１、音声区間２、非音声区間３、音声区間４、非音声区間５の順に各区間が音声区間または非音声区間に決定されている。
図４における上部の波形によって入力信号が示されている。また、図４において各音声区間および各非音声区間の終端付近の上下の矢印によって、各音声区間および各非音声区間の特徴量の最大値および最小値が示されている。また、閾値の推移は、縦軸に平行して上下に移動する実線によって示されている。
ここで、音声／非音声区間整形部１０７が音声区間または非音声区間を決定した際、閾値更新部１０９が、式（１），（２）を用いて非音声確率を算出し、式（３）を用いて閾値候補を決定する。決定された閾値は式（４）を用いて変更される。
また、閾値の更新は以下に示す式（５）のように、過去のＮ発声分の閾値候補の平均値を用いて行うことも可能である。
θ←１／Ｎ×Σθ’・・・（５）
閾値更新部１０９は、特定の値以上または未満の非音声確率の場合のみ閾値を更新することも可能である。また、長区間特徴量算出部１０８が、１つ以上の音声区間、または非音声区間ごとの特徴量に統計処理を施して長区間特徴量を算出し、閾値更新部１０９が、１つ以上の音声区間、または非音声区間ごとに閾値を更新することも可能である。
また、最初に設定された閾値が大きすぎる場合、または小さすぎる場合には、音声／非音声判定部１０４における判定結果にもとづき、音声／非音声区間整形部１０７は、例えば、判定対象のすべての区間を音声区間または非音声区間と判定してしまい、閾値更新部１０９による閾値の更新が行われない場合がある。
そのような場合に対応するために、閾値更新部１０９は、音声／非音声判定部１０４において一定時間以上音声区間または非音声区間に判定されない場合には、閾値を一定値小さくしたり、一定値大きくしたり、当該一定時間に特徴量算出部１０２が算出した特徴量の平均値を閾値としたりしてもよい。
音声検出装置は、閾値更新部１０９によって閾値が更新された後、次の音声区間または非音声区間に対してステップＳ１０１からＳ１０８の処理を行う。また、音声検出装置は、同じ発声に対して再度ステップＳ１０１からＳ１０８の処理を繰り返すことも可能である。
図５は、更新前の閾値が小さすぎた場合の例を示す説明図である。図５に示す例では、更新前の閾値が小さすぎたので、音声検出装置は、非音声区間１を音声区間であると誤って判定する。
図６は、更新前の閾値が大きすぎた場合の例を示す説明図である。図６に示す例では、更新前の閾値が大きすぎたので、音声検出装置は、音声区間２を非音声区間であると誤って判定する。
本実施形態における音声検出装置は、図５に例示した更新前の閾値が小さすぎた場合であっても、長区間特徴量を用いて算出される非音声確率αを大きくする。図５に示すように、非音声区間１の非音声確率αは０．８である。このような場合、閾値更新部１０９が（３）式を計算すると、閾値候補θ’はこの非音声区間１の長区間特徴量の最大値に近づくので、閾値がより大きな値に更新される。
また、本実施形態における音声検出装置は、図６に例示した更新前の閾値が大きすぎた場合であっても、長区間特徴量を用いて算出される非音声確率αを小さくする。図６に示すように、音声区間２の非音声確率αは０．２である。このような場合、閾値更新部１０９が（３）式を計算すると、閾値候補θ’はこの音声区間２の長区間特徴量の最小値に近づくので、閾値がより小さな値に更新される。
従って、本実施形態における音声検出装置は、長区間特徴量算出部１０８において非音声確率αを算出して閾値更新部１０９で適切な閾値を設定することで、前段の音声／非音声判定部１０４で認識対象となる音声区間を正しく検出して、発話環境によって変化する雑音に頑健な音声検出を実現できる。
実施形態２．
本発明の第２の実施形態について、図面を参照して説明する。図７は、本発明による音声検出装置の第２の実施形態の構成例を示すブロック図である。
第２の実施形態の音声検出装置は、図１に示す第１の実施形態の音声検出装置の構成に加えて、入力信号をフレームごとに切り分けて音声らしさを表す特徴量を出力する音声分析部１１０を含む。音声分析部１１０は、図１に示す第１の実施形態の音声検出装置の構成における波形切りだし部１０１や特徴量算出部１０２に相当する機能を有する。
音声分析部１１０は、ステップＳ１０２の処理で特徴量算出部１０２とは独立に、第２の特徴量を算出する。音声分析部１１０が算出する第２の特徴量とは、例えば、スペクトルパワーやＳＮＲ、零交差点、尤度などである。
音声分析部１１０は、特徴量算出部１０２が特徴量を算出する際に用いたパラメタとは異なるパラメタを用いて、より詳細に入力信号を分析して第２の特徴量を算出する。なお、音声分析部１１０は、複数の発声ごとに第２の特徴量を算出したり、ユーザによって指示されたときに第２の特徴量を算出したりして、特徴量算出部１０２が特徴量を算出するときと異なるタイミングで第２の特徴量を算出してもよい。
そして、長区間特徴量算出部１０８は、ステップＳ１０６の処理で、特徴量算出部１０２が算出した特徴量と、音声分析部１１０が算出した第２の特徴量とにもとづいて、長区間特徴量を算出する。前述した各特徴量は、入力信号が生成された環境によって検出しやすい場合と、検出が困難である場合とがある。そこで、長区間特徴量算出部１０８は、例えば、特徴量算出部１０２が特徴量を算出できなかった場合に、音声分析部１１０が算出した第２の特徴量を用いて長区間特徴量を算出する。また、特徴量算出部１０２が算出した特徴量と異なる特徴量を音声分析部１１０が算出し、長区間特徴量算出部１０８が、音声分析部１１０が算出した特徴量である第２の特徴量を補助的に用いて長区間特徴量を算出してもよい。
本実施形態における音声検出装置は、音声分析部１１０が、特徴量算出部１０２とは独立に様々の特徴量を算出することができるので、様々な観点で特徴量が算出され、より頑健な音声検出を実現することが可能になる。
実施形態３．
本発明の第３の実施形態について、図面を参照して説明する。図８は、本発明による音声検出装置の第３の実施形態の構成例を示すブロック図である。
第３の実施形態の音声検出装置は、図１に示す第１の実施形態の音声検出装置の構成に加えて、音声らしい特徴量を用いて音声区間に対応する認識結果を出力する音声認識部１１１を含む。
図９は、音声検出装置の第３の実施形態の他の例を示すブロック図である。図９に示す例では、音声認識部１１１は、音声検出された音声区間に対して音声認識を行う。
図８および図９に示す第３の実施形態の音声検出装置は、以下のように動作する。すなわち、音声認識部１１１は、入力された音声信号から適宜特徴量を抽出する。そして、音声認識部１１１は、言語モデル／音声認識辞書（図示せず）に格納されている単語の特徴量と、抽出した特徴量とをマッチングすることで音声区間の時間情報付き単語列である認識結果を算出する音声認識を行い、時間情報付き音声認識結果単語列を出力する。
長区間特徴量算出部１０８は、長区間特徴量として音声認識結果から音素継続時間を求める。音素継続時間Ｔａは、以下に示す式（６）で算出される。
Ｔａ＝Ｔｂ／Ｎｆ・・・（６）
ここで、Ｔｂは音声認識部１１１が出力した音声認識結果単語列の単語１つについてのフレーム数であり、Ｎｆは単語の音素数である。
閾値更新部１０９は、長区間特徴量算出部１０８がステップＳ１０６の処理で算出した長区間特徴量、すなわち音素継続時間長を用いて、音声／非音声区間整形部１０７によって切り出された各区間の非音声確率αを算出する。
具体的には、閾値更新部１０９は、例えば、図１０に示すような長区間特徴量を変数とする関数を用いて非音声確率αを求める。図１０は、本発明の第３の実施形態において非音声確率αを求めるための関数を示す説明図である。図１０に示すように、横軸は長区間特徴量の値、縦軸は非音声確率αである。図１０に示すように、長区間特徴量がτｍｉｎ以下である場合、およびτｍａｘ以上である場合に、非音声確率αは１である。また、長区間特徴量がτ０以上であってτ１以下である場合に、非音声確率αは０である。そして、図１０に示す例では、長区間特徴量がτｍｉｎを超えている場合にτ０まで非音声確率αは単調減少し、長区間特徴量がτ１を超えている場合にτｍａｘまで非音声確率αは単調増加する。
なお、τｍｉｎ、τｍａｘ、τ０、およびτ１は、予め実験で求められた適切な値であるとする。
本実施形態では、長区間特徴量算出部１０８は、継続時間長を算出する単位を音素としたが、音節など、他の単位を使ってもよい。また、図１０に示す関数は一例に過ぎず、これに限られるものではない。関数は、長区間特徴量の中庸な値から遠ざかるにつれて関数値が増加するような任意の関数を定義でもよい。
本実施形態の効果について説明する。閾値を超える背景雑音が長時間続いたときなどに、通常の音声認識結果から得られる継続時間長よりも極端に長いまたは短い継続時間長が生じやすいという性質がある。具体的には、背景雑音が長時間続いた結果、極端に長い音声区間になった場合には、その音声区間の音は背景雑音なので音声らしさはほとんどない。そして、音声認識部１１１がその音を音声認識しても短い単語が認識結果として出力されてしまうことがある。つまり、適切な音声認識は行われない。また、２〜３フレームなどの極端に短い突発雑音などを音声区間とした場合には、そのような短い時間で単語を発することは不可能であるので、その音声区間の音は非音声であると判断される。従って、通常の音声認識結果から得られる継続時間長よりも極端に長いまたは短い継続時間長の音声区間の音は、非音声であるという性質がある。
本実施形態における音声検出装置は、そのような性質を利用して非音声確率αを算出するので、より精度の高い非音声確率αを算出することが可能となる。
実施形態４．
本発明の第４の実施形態について説明する。第４の実施形態の音声検出装置は、図８および図９に示す第３の実施形態の音声検出装置の音声認識部１１１が、音声認識ではなく連続音素認識を行う。すなわち、音声認識部１１１は、連続音素認識を行い、時刻情報付きの音素列を出力する。長区間特徴量算出部１０８は、音声認識部１１１が出力した音素列を構成する各音素の継続時間長を求める。閾値更新部１０９の動作は、前述した第３の実施形態における動作と同様である。
なお、本実施形態でも第３の実施形態と同様に、継続時間長を算出する単位を音素としているが、音節などの単位が用いられてもよい。
本実施形態における音声検出装置は、音声認識部１１１が連続音素認識を行うので、音声認識を行う第３の実施形態の音声検出装置よりも容易に音素の継続時間長を取得することができる。すると、音素の継続時間長を計算する負荷を軽減し、音声検出装置全体の処理速度が高速化する。音声認識部１１１は、音素認識の場合には音素単位で認識を行っているので、発声区間の音素長を容易に取得することができるが、音声認識の場合には、認識結果の単語から音素数を導き出し、１発声あたりの時間で除算して音素の継続時間長を算出しなければならない。したがって、音声検出装置が音素の継続時間長を容易に取得することは処理負荷の軽減のために重要である。
実施形態５．
本発明の第５の実施形態について説明する。第５の実施形態の音声検出装置は、図８または図９に示す第３の実施形態の音声検出装置の構成と同様であるが、長区間特徴量算出部１０８が、音声認識結果の信頼度を用いて長区間特徴量を算出する。
具体的には、例えば、音声認識部１１１は、入力された音声信号から適宜特徴量を抽出する。そして、音声認識部１１１は、言語モデル／音声認識辞書に格納されている単語の特徴量と、抽出した特徴量とをマッチングし、複数の音声認識結果の候補のスコアを出力する。スコアとは、例えば、言語モデル／音声認識辞書に格納されている単語の特徴量と、抽出した特徴量とが合致する度合いを表す数値である。音声認識部１１１は、当該度合いが高い複数のスコアを出力する。
そして、長区間特徴量算出部１０８は、音声認識部１１１が出力した音声認識結果のスコアのうち、当該度合いが高い順に第１位の候補のスコアと第２位の候補のスコアとの差を算出する。当該スコアの差が小さい場合には、音声認識結果の信頼度は低いと考えられ、当該スコアの差が大きい場合には、音声認識結果の信頼度は高いと考えられる。なお、音声認識結果の信頼度に相当する尺度は、スコアの差に代えて他の尺度であってもよい。
閾値更新部１０９は、長区間特徴量算出部１０８が算出した長区間特徴量、すなわち信頼度を用いて、音声／非音声区間整形部１０７によって切り出された音声区間に対して非音声確率αを算出する。閾値更新部１０９は、具体的には、例えば、図１０に示すような長区間特徴量を変数とする関数を用いて非音声確率αを求める。
図１１は、本発明の第５の実施形態において非音声確率αを求めるための関数を示す説明図である。図１１に示すように、横軸は長区間特徴量の値、縦軸は非音声確率αである。図１１に示すように、長区間特徴量がτ０以上である場合に、非音声確率αは０である。また、長区間特徴量が０からτ０未満である場合に、非音声確率αは１から０に単調減少する。なお、τ０は、予め実験で求められた適切な値であるとする。また、図１１に示す関数は一例であり、任意の単調減少関数または単調非増大関数であってもよい。
本実施形態における音声検出装置は、音声認識結果の信頼度が低い区間は非音声区間である可能性が高いという性質を利用して、非音声確率αを算出するように動作するので、より精度の高い非音声確率を算出することが可能となる。
実施形態６．
本発明の第６の実施形態について、図面を参照して説明する。図１２は、本発明による音声検出装置の第６の実施形態の構成例を示すブロック図である。
第６の実施形態の音声検出装置は第１〜第５の実施形態を組み合わせたものである。長区間特徴量算出部１０８は、第１〜第５の実施形態の方法を１つ以上組み合わせて長区間特徴量を算出する。音声検出装置は、非音声確率αを第１〜第５の実施形態の非音声確率算出方法を用いて算出し、各々の非音声確率αの積を非音声確率とする。また、音声検出装置は、各々の非音声確率αを重み付けした後に積を算出して非音声確率として用いてもよい。また、音声検出装置は、各々の非音声確率αの平均値や、適当な重み付け平均値などを非音声確率として用いてもよい。
本実施形態における音声検出装置は、第１〜第５の実施形態を組み合わせることで、より精度の高い非音声確率を算出することが可能になる。
実施形態７．
本発明の第７の実施形態は、第１〜第５の実施形態の音声検出装置を含む音声認識装置である。音声認識装置は、第１〜第５の実施形態の音声検出装置によって音声区間であると決定された区間に対して、公知の音声認識処理を行い、音声認識結果を出力する。
本実施形態における音声認識装置は、高い精度で音声区間であると決定された区間に音声認識処理を行うので、非音声区間に音声認識処理を行う無駄な処理の実行を防ぐことができる。また、音声区間に対して高い精度で音声認識処理を行い、音声認識処理の漏れを防ぐことができる。
次に、本発明の概要について説明する。図１３は、本発明の概要を示すブロック図である。本発明による音声検出装置３００は、特徴量算出部３０１（図１に示す特徴量算出部１０２に相当）、音声／非音声判定部３０２（図１に示す音声／非音声判定部１０４および音声／非音声区間整形部１０７に相当）、長区間特徴量算出部３０３（図１に示す長区間特徴量算出部１０８に相当）、および閾値更新部３０４（図１に示す閾値更新部１０９に相当）を含む。
特徴量算出部３０１は、所定の単位時間ごとの入力信号であるフレームごとの入力信号の特徴量を算出する。音声／非音声判定部３０２は、特徴量算出部３０１が算出した特徴量と、入力信号が音声にもとづく信号であるか否かを判定するための音声検出閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定する。
長区間特徴量算出部３０３は、特徴量算出部３０１が算出した音声区間または非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、音声区間または非音声区間の特徴量である長区間特徴量を算出する。
閾値更新部３０４は、長区間特徴量算出部３０３が算出した長区間特徴量を用いて、音声区間および非音声区間が非音声にもとづく信号が入力された区間であった確率である非音声確率を算出し、算出した非音声確率にもとづいて、音声検出閾値を更新する。
上記の構成による音声検出装置３００は、入力信号の先頭が背景雑音にもとづく信号であって、特徴量が音声検出閾値を超える信号であっても、音声検出閾値を更新して、高精度の音声区間検出を行うことができる。
また、上記の各実施形態では、以下の（１）〜（１１）に示すような音声検出装置も開示されている。
（１）長区間特徴量算出部３０３が、音声／非音声判定部３０２が判定した１つ以上の音声区間、または非音声区間にわたる特徴量に統計処理を施し、長区間特徴量を算出する音声検出装置。
（２）長区間特徴量算出部３０３が、長区間特徴量を算出する際に、フレームごとの特微量の平均値、最頻値、中央値、および大きい順に並べた結果の上から数えて所定の割合に達する位置にある値を用いる方法の少なくともいずれか１つを用いる音声検出装置。
（３）閾値更新部３０４が、音声区間または非音声区間における特徴量の最大値と最小値と非音声確率とを用いて、音声検出閾値を更新する音声検出装置。
（４）閾値更新部３０４が、非音声確率を用いて特徴量の最大値と最小値を内分する値を求め、内分した値に近い値になるように音声検出閾値を更新する音声検出装置。
（５）特徴量算出部３０４が算出する特徴量とは異なる第２の特徴量を算出する第２の特徴量算出部（図７に示す音声分析部１１０に相当）を備え、長区間特徴量算出部３０３が、特徴量算出部３０４が算出した特徴量と、第２の特徴量算出部が算出した第２の特徴量とを用いて長区間特徴量を算出する音声検出装置。
（６）第２の特徴量算出部（図８に示す音声認識部１１１に相当）が、入力信号に音声認識を行って音声認識結果を出力し、長区間特徴量算出部３０３は、音声認識結果にもとづいて長区間特徴量を算出する音声検出装置。
（７）長区間特徴量算出部３０３が、長区間特徴量として音声認識結果の信頼度を算出する音声検出装置。
（８）第２の特徴量算出部が、予め記憶手段に格納されている単語の特徴量と音声認識対象の入力信号の特徴量とが合致する度合いを示す値であるスコアにもとづく音声認識結果の複数の候補のスコアを出力し、長区間特徴量算出部が、度合いが高い順に第１位の候補のスコアと第２位の候補のスコアとの差を信頼度として算出する音声検出装置。
（９）第２の特徴量算出部が、入力信号に音声認識を行って時刻情報の付いた音声認識結果を出力し、長区間特徴量算出部３０３が、時刻情報の付いた音声認識結果から長区間特徴量を算出する音声検出装置。
（１０）長区間特徴量算出部３０３は、長区間特徴量として時刻情報から継続時間長を算出する音声検出装置。
（１１）長区間特徴量算出部３０３が、音素または音節を単位として継続時間長を算出する音声検出装置。
以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解しうる様々な変更をすることができる。
この出願は、２００９年１２月２４日に出願された日本出願特願２００９−２９１９７６を基礎とする優先権を主張し、その開示の全てをここに取り込む。
（付記１）所定の単位時間ごとの入力信号であるフレームごとの入力信号の特徴量を算出する特徴量算出部と、前記特徴量と、前記入力信号が音声にもとづく信号であるか否かを判定するための音声検出閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定する音声／非音声判定部と、前記特徴量算出部が算出した前記音声区間または前記非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、前記音声区間または前記非音声区間の特徴量である長区間特徴量を算出する長区間特徴量算出部と、前記長区間特徴量を用いて、前記音声区間および前記非音声区間が非音声にもとづく信号が入力された区間であった確率である非音声確率を算出し、算出した前記非音声確率にもとづいて、前記音声検出閾値を更新する閾値更新部とを備えたことを特徴とする音声検出装置。
（付記２）長区間特徴量算出部は、音声／非音声判定部が判定した１つ以上の音声区間、または非音声区間にわたる特徴量に統計処理を施し、長区間特徴量を算出する付記１に記載の音声検出装置。
（付記３）長区間特徴量算出部は、長区間特徴量を算出する際に、フレームごとの特徴量の平均値、最頻値、中央値、および大きい順に並べた結果の上から数えて所定の割合に達する位置にある値を用いる方法の少なくともいずれか１つを用いる付記１または付記２に記載の音声検出装置。
（付記４）閾値更新部は、音声区間または非音声区間における特徴量の最大値と最小値と非音声確率とを用いて、音声検出閾値を更新する付記１から付記３のうちいずれかに記載の音声検出装置。
（付記５）閾値更新部は、非音声確率を用いて前記特徴量の最大値と最小値を内分する値を求め、前記内分した値に近い値になるように音声検出閾値を更新する付記４に記載の音声検出装置。
（付記６）特徴量算出部が算出する特徴量とは異なる第２の特徴量を算出する第２の特徴量算出部を備え、長区間特徴量算出部は、前記特徴量算出部が算出した特徴量と、前記第２の特徴量算出部が算出した第２の特徴量とを用いて長区間特徴量を算出する付記１から付記５のうちいずれかに記載の音声検出装置。
（付記７）第２の特徴量算出部は、入力信号に音声認識を行って音声認識結果を出力し、長区間特徴量算出部は、前記音声認識結果にもとづいて長区間特徴量を算出する付記６に記載の音声検出装置。
（付記８）長区間特徴量算出部は、長区間特徴量として音声認識結果の信頼度を算出する付記７に記載の音声検出装置。
（付記９）第２の特徴量算出部は、予め記憶手段に格納されている単語の特徴量と音声認識対象の入力信号の特徴量とが合致する度合いを示す値であるスコアにもとづく音声認識結果の複数の候補のスコアを出力し、長区間特徴量算出部は、前記度合いが高い順に第１位の候補のスコアと第２位の候補のスコアとの差を信頼度として算出する付記８に記載の音声検出装置。
（付記１０）第２の特徴量算出部は、入力信号に音声認識を行って時刻情報の付いた音声認識結果を出力し、長区間特徴量算出部は、前記時刻情報の付いた音声認識結果から長区間特徴量を算出する付記６に記載の音声検出装置。
（付記１１）長区間特徴量算出部は、長区間特徴量として時刻情報から継続時間長を算出する付記１０に記載の音声検出装置。
（付記１２）長区間特徴量算出部は、音素または音節を単位として継続時間長を算出する付記１１に記載の音声検出装置。
（付記１３）付記１から付記１２のうちいずれかに記載の音声検出装置を含み、前記音声検出装置が出力する音声区間に対して音声認識を行い、音声認識結果を出力することを特徴とする音声認識装置。
（付記１４）所定の単位時間内の入力信号であるフレームごとの入力信号の特徴量を算出し、前記特徴量と、前記入力信号が音声にもとづく信号であるか否かを判定するための音声検出閾値とを比較し、複数のフレームにわたって音声にもとづく信号が入力された音声区間であるのか、または複数のフレームにわたって非音声にもとづく信号が入力された非音声区間であるのかを判定し、前記音声区間または前記非音声区間を構成する複数のフレームの特徴量の統計値にもとづいて、前記音声区間または前記非音声区間の特徴量である長区間特徴量を算出し、前記長区間特徴量を用いて、前記音声区間および前記非音声区間が非音声にもとづく信号が入力された区間であった確率である非音声確率を算出し、算出した前記非音声確率にもとづいて、前記音声検出閾値を更新することを特徴とする音声検出方法。
（付記１５）１つ以上の音声区間、または非音声区間にわたる特徴量に統計処理を施し、長区間特徴量を算出する付記１４に記載の音声検出方法。Embodiment 1. FIG.
A first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of a first embodiment of a voice detection device according to the present invention. As shown in FIG. 1, the speech detection apparatus according to the first exemplary embodiment of the present invention includes a waveform cutout unit 101, a feature amount calculation unit 102, a threshold storage unit 103, a speech / non-speech determination unit 104, and a determination result holding unit. 105, a shaping rule storage unit 106, a voice / non-speech segment shaping unit 107, a long segment feature value computing unit 108, and a threshold updating unit 109.
The waveform cutout unit 101 cuts out and acquires an input signal in units of frames. Specifically, the waveform cutout unit 101 cuts out and acquires input signals for each predetermined unit time, for example. The feature amount calculation unit 102 calculates a feature amount used for speech detection from the input signal for each frame cut out by the waveform cutout unit 101. The threshold storage unit 103 stores a threshold for determining whether the input signal is an input signal based on voice or an input signal based on non-voice.
The voice / non-voice determination unit 104 compares the feature amount calculated by the feature amount calculation unit 102 with the threshold value stored in the threshold value storage unit 103 for each frame, and the input signal of the frame is an input signal based on the voice. It is determined whether there is an input signal based on non-voice. Note that a frame of an input signal based on voice is called a voice frame, and a frame of an input signal based on non-voice is called a non-voice frame. The determination result holding unit 105 holds the determination result for each frame by the voice / non-voice determination unit 104 over a plurality of frames.
The section shaping rule storage unit 106 stores section shaping rules. The speech / non-speech segment shaping unit 107 shapes the determination results of a plurality of frames held in the decision result holding unit 105 based on the segment shaping rules stored in the segment shaping rule storage unit 106, It is determined that it is a non-voice segment. Specifically, the speech / non-speech section shaping unit 107 determines, for example, that a plurality of frames are one speech section when a plurality of speech frames are consecutive. Further, when a plurality of non-voice frames are consecutive, the voice / non-voice section shaping unit 107 determines that the plurality of frames are one non-voice section. Note that the voice / non-voice section shaping unit 107 determines that a plurality of frames are one voice section when the ratio of the voice frames is larger than a predetermined ratio in a plurality of consecutive frames, It may be determined that the non-voice section is one when the ratio of frames is larger than a certain ratio.
The long section feature amount calculation unit 108 performs statistical processing on the feature amount for each frame calculated by the feature amount calculation unit 102 for the speech section and the non-speech section determined by the speech / non-speech section shaping unit 107. Calculate the amount.
The threshold update unit 109 calculates the non-speech probability for the speech segment and the non-speech segment determined by the speech / non-speech segment shaping unit 107, using the long segment feature amount calculated by the long segment feature amount calculator 108, The threshold value stored in the threshold value storage unit 103 is changed. The non-speech probability is a probability that the input signal in the section is an input signal based on non-speech, as will be described later.
The voice detection device is realized by, for example, a computer equipped with a voice detection program.
Next, the operation of the voice detection device according to the first exemplary embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a flowchart showing the operation of the voice detection device according to the first exemplary embodiment of the present invention.
First, the waveform cutout unit 101 cuts out collected time-series input sound data input from a microphone (not shown) for each frame of unit time (step S101). For example, when the input sound data is in a 16-bit Linear-PCM (Pulse Code Modulation) format with a sampling frequency of 8000 Hz, waveform data of 8000 points of input sound data per second is stored in each frame.
For example, the waveform cutout unit 101 sequentially cuts out the waveform data at a frame width of 200 points (25 milliseconds) and a frame shift of 80 points (10 milliseconds) according to a time series.
Next, the feature amount calculation unit 102 calculates a feature amount from the waveform cut out for each frame (step S102). The feature amount calculated by the feature amount calculation unit 102 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
The voice / non-voice determination unit 104 compares the threshold value stored in the threshold value storage unit 103 with the feature amount calculated by the feature amount calculation unit 102, and determines that the frame is an audio frame if the threshold value is exceeded. If not, it is determined that the frame is a non-voice frame (step S103). If the threshold value stored in the threshold value storage unit 103 is the same as the feature value calculated by the feature value calculation unit 102, the voice / non-voice determination unit 104 determines that the voice frame is a voice frame or a non-voice frame. May be determined in advance. Then, the voice / non-voice determination unit 104 determines a voice frame or a non-voice frame based on the determination.
The determination result holding unit 105 holds the result determined by the voice / non-voice determination unit 104 in the process of step S106 for a plurality of frames (step S104).
The voice / non-speech segment shaping unit 107 is configured to suppress the occurrence of a short-duration speech segment or a short-duration non-speech segment that occurs because the speech / non-speech determination unit 104 determines for each frame. Shaping is performed (step S105).
The long section feature amount calculation unit 108 calculates the feature amount calculation unit 102 in the process in step S102 for the shaped speech section and non-speech section obtained by the speech / non-speech section shaping unit 107 in the process in step S105. The feature amount for each frame is statistically processed to calculate the long interval feature amount (step S106). The long section feature amount is, for example, one or a combination of two or more of spectrum power, SNR, zero crossing, likelihood, and the like.
As an example of the statistical processing performed by the long section feature amount calculation unit 108, there is a method of calculating an average value of feature amounts for each frame in a shaped speech section. In addition to the method of calculating the average value, the long-section feature value calculation unit 108 uses a mode value method, a median value method, and the feature value for each frame is rearranged by size. A method using values in the vicinity of the upper 40% in order may be used. Note that the value of 40% is merely an example, and it may be a ratio arbitrarily determined by the user or the like. When the user or the like determines 50%, this corresponds to the method using the median.
The threshold update unit 109 calculates the non-speech probability α for the shaped speech segment using the long segment feature value calculated by the long segment feature value calculation unit 108 in the process of step S106 (step S107). Here, the non-speech probability is a probability that the input signal in the section is an input signal based on non-speech such as noise. Therefore, 1-α corresponds to the probability that the section is speech. α is calculated using the following equation.
<F> = Σωi × <fi> (1)
α = G [<F>] (2)
Here, <fi> is a long-section feature value obtained by performing the above-described statistical processing on the feature value fi for each frame. ωi is a weight applied to the long section feature <fi>. Then, in Formula (1), <F>, which is calculated by adding a plurality of types (for example, spectrum power, SNR, zero-crossing, likelihood, etc.) of long-section feature quantities <fi> and multiplying them by weights ωi, is integrated. Long section feature. G is a function having an integrated long section feature quantity (also simply referred to as a long section feature quantity) <F> as a variable. FIG. 3 is an explanatory diagram showing the function G of the present embodiment. The horizontal axis in FIG. 3 is the value of the long interval feature value, and the vertical axis is the non-speech probability α.
In the example illustrated in FIG. 3, the function G is a function with which the non-speech probability α is 1 when the long-section feature amount is 0. That is, G is a function whose non-speech probability is 100% when the long section feature amount is zero. G is a function for which the non-speech probability α is 0 when the long-section feature value is τ0. That is, G is a function whose non-speech probability is 0% when the long-section feature value is τ0. G is a function whose non-speech probability α is 1 when the long-section feature value is τmax. That is, G is a function whose non-speech probability is 100% when the long section feature amount is τmax.
The function shown in FIG. 3 is an example. The function may be another function as long as the function value increases as the long-section feature value becomes farther from a moderate value or a monotonously decreasing (non-increasing) function. (1) ωi, and τ0 and τmax shown in FIG. If it is difficult to experimentally determine ωi, ωi may be set to an equal value (such as 1) for each long-section feature amount.
Next, the threshold update unit 109 updates the threshold stored in the threshold storage unit 103 using the non-speech probability α calculated in the process of step S107 (step S108). Specifically, the threshold update unit 109 updates the threshold as follows. First, the threshold update unit 109 calculates a threshold candidate θ ′ using the following equation.
θ ′ = α × Fmax + (1−α) × Fmin (3)
Here, Fmax is the maximum value of the feature amount for each frame in the speech section or the non-speech section. Fmin is a minimum value of the feature amount for each frame in the voice section or the non-voice section. α is a speech interval or a non-speech probability of a non-speech interval. Next, the threshold update unit 109 updates the threshold θ using the following equation using the threshold candidate θ ′.
θ ← θ + ε × (θ′−θ) (4)
Here, ε is a step size for adjusting the speed of updating the threshold. That is, the voice detection device according to the present invention can adjust the speed of the threshold update. Therefore, the voice detection device is either in the case where it is desired to greatly change the threshold according to the temporal fluctuation of the background noise or in the case where it is not desired to change the threshold depending on the temporary background noise. Can also respond.
FIG. 4 is an explanatory diagram illustrating an example of changing the threshold value. In the example shown in FIG. 4, the speech / non-speech segment shaping unit 107 causes each segment to be a speech segment or a non-speech segment in order of non-speech segment 1, speech segment 2, non-speech segment 3, speech segment 4, and non-speech segment 5. Has been determined.
The input signal is shown by the upper waveform in FIG. In FIG. 4, the maximum value and the minimum value of the feature amount of each speech segment and each non-speech segment are indicated by up and down arrows near the end of each speech segment and each non-speech segment. The transition of the threshold is indicated by a solid line that moves up and down in parallel with the vertical axis.
Here, when the speech / non-speech segment shaping unit 107 determines a speech segment or a non-speech segment, the threshold update unit 109 calculates a non-speech probability using equations (1) and (2), and formula (3) ) Is used to determine threshold candidates. The determined threshold value is changed using Equation (4).
Further, the threshold value can be updated using the average value of the threshold candidates for the past N utterances as shown in Equation (5) below.
θ ← 1 / N × Σθ ′ (5)
The threshold update unit 109 can also update the threshold only when the non-voice probability is greater than or less than a specific value. In addition, the long segment feature amount calculation unit 108 performs statistical processing on the feature amount for each of one or more speech sections or non-speech sections to calculate a long segment feature amount, and the threshold update unit 109 performs one or more It is also possible to update the threshold value for each voice interval or non-voice interval.
Also, if the initially set threshold is too large or too small, based on the determination result in the sound / non-voice determination unit 104, the voice / non-speech section shaping unit 107, for example, The section may be determined as a voice section or a non-voice section, and the threshold update unit 109 may not update the threshold.
In order to cope with such a case, the threshold value updating unit 109 reduces the threshold value by a certain value or determines a certain value when the speech / non-speech determination unit 104 does not determine a speech period or a non-speech period for a certain time or more. The threshold value may be increased, or the average value of the feature values calculated by the feature value calculation unit 102 during the certain time may be used as a threshold value.
After the threshold value is updated by the threshold update unit 109, the voice detection device performs the processing of steps S101 to S108 for the next voice segment or non-voice segment. In addition, the voice detection device can repeat the processing of steps S101 to S108 again for the same utterance.
FIG. 5 is an explanatory diagram illustrating an example in which the threshold before update is too small. In the example shown in FIG. 5, since the threshold value before update is too small, the voice detection device erroneously determines that the non-voice section 1 is a voice section.
FIG. 6 is an explanatory diagram illustrating an example when the threshold before update is too large. In the example illustrated in FIG. 6, since the threshold value before the update is too large, the voice detection device erroneously determines that the voice section 2 is a non-voice section.
The speech detection apparatus according to the present embodiment increases the non-speech probability α calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 5 is too small. As shown in FIG. 5, the non-speech probability α in the non-speech section 1 is 0.8. In such a case, when the threshold update unit 109 calculates the expression (3), the threshold candidate θ ′ approaches the maximum value of the long section feature amount of the non-speech section 1, and thus the threshold is updated to a larger value.
Further, the speech detection apparatus according to the present embodiment reduces the non-speech probability α calculated using the long section feature amount even when the pre-update threshold illustrated in FIG. 6 is too large. As shown in FIG. 6, the non-voice probability α of the voice section 2 is 0.2. In such a case, when the threshold update unit 109 calculates the expression (3), the threshold candidate θ ′ approaches the minimum value of the long section feature amount of the speech section 2, and thus the threshold is updated to a smaller value.
Therefore, the speech detection apparatus according to the present embodiment calculates the non-speech probability α in the long section feature quantity calculation unit 108 and sets an appropriate threshold value in the threshold update unit 109, so that the speech / non-speech determination unit 104 in the previous stage. Thus, it is possible to correctly detect a speech section to be recognized and to realize speech detection that is robust against noise that varies depending on the speech environment.
Embodiment 2. FIG.
A second embodiment of the present invention will be described with reference to the drawings. FIG. 7 is a block diagram showing a configuration example of the second embodiment of the voice detection device according to the present invention.
In addition to the configuration of the voice detection device of the first embodiment shown in FIG. 1, the voice detection device of the second embodiment is a voice analysis unit that outputs a feature quantity that represents voice likeness by dividing an input signal for each frame. 110 is included. The voice analysis unit 110 has functions corresponding to the waveform cutout unit 101 and the feature amount calculation unit 102 in the configuration of the voice detection device according to the first embodiment shown in FIG.
The voice analysis unit 110 calculates the second feature amount independently of the feature amount calculation unit 102 in the process of step S102. The second feature amount calculated by the speech analysis unit 110 is, for example, spectrum power, SNR, zero crossing, likelihood, and the like.
The voice analysis unit 110 calculates the second feature amount by analyzing the input signal in more detail using a parameter different from the parameter used when the feature amount calculation unit 102 calculates the feature amount. The voice analysis unit 110 calculates the second feature value for each of a plurality of utterances, or calculates the second feature value when instructed by the user. The second feature amount may be calculated at a timing different from the time of calculating.
Then, the long-section feature value calculation unit 108 performs the long-section feature value based on the feature value calculated by the feature value calculation unit 102 and the second feature value calculated by the speech analysis unit 110 in the process of step S106. Is calculated. Each feature amount described above may be easily detected depending on the environment in which the input signal is generated, or may be difficult to detect. Therefore, the long-section feature value calculation unit 108 calculates the long-section feature value using the second feature value calculated by the speech analysis unit 110, for example, when the feature value calculation unit 102 cannot calculate the feature value. To do. Further, the speech analysis unit 110 calculates a feature amount different from the feature amount calculated by the feature amount calculation unit 102, and the long-section feature amount calculation unit 108 is a second feature amount that is the feature amount calculated by the speech analysis unit 110. May be used to calculate the long-section feature value.
In the speech detection apparatus according to the present embodiment, since the speech analysis unit 110 can calculate various feature amounts independently of the feature amount calculation unit 102, feature amounts are calculated from various viewpoints, and more robust speech. Detection can be realized.
Embodiment 3. FIG.
A third embodiment of the present invention will be described with reference to the drawings. FIG. 8 is a block diagram showing a configuration example of the third embodiment of the voice detection device according to the present invention.
In addition to the configuration of the voice detection device of the first embodiment shown in FIG. 1, the voice detection device of the third embodiment outputs a recognition result corresponding to a voice section using a feature amount that seems to be voice. 111 is included.
FIG. 9 is a block diagram illustrating another example of the third embodiment of the voice detection device. In the example illustrated in FIG. 9, the voice recognition unit 111 performs voice recognition on a voice section in which voice is detected.
The voice detection apparatus according to the third embodiment shown in FIGS. 8 and 9 operates as follows. That is, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal. The speech recognition unit 111 is a word string with time information of the speech section by matching the feature amount of the word stored in the language model / speech recognition dictionary (not shown) with the extracted feature amount. Speech recognition for calculating a recognition result is performed, and a speech recognition result word string with time information is output.
The long segment feature value calculation unit 108 obtains the phoneme duration from the speech recognition result as the long segment feature value. The phoneme duration Ta is calculated by the following equation (6).
Ta = Tb / Nf (6)
Here, Tb is the number of frames for one word in the speech recognition result word string output by the speech recognition unit 111, and Nf is the number of phonemes of the word.
The threshold update unit 109 uses the long-section feature value calculated by the long-section feature value calculation unit 108 in step S106, that is, the phoneme duration length, for each section cut out by the speech / non-speech section shaping unit 107. Non-voice probability α is calculated.
Specifically, the threshold update unit 109 obtains the non-speech probability α using, for example, a function having a long-section feature value as a variable as shown in FIG. FIG. 10 is an explanatory diagram showing a function for obtaining the non-voice probability α in the third embodiment of the present invention. As shown in FIG. 10, the horizontal axis represents the value of the long section feature value, and the vertical axis represents the non-speech probability α. As shown in FIG. 10, the non-speech probability α is 1 when the long-section feature value is τmin or less and when it is τmax or more. In addition, the non-speech probability α is 0 when the long section feature amount is τ0 or more and τ1 or less. In the example shown in FIG. 10, the non-speech probability α monotonously decreases to τ0 when the long-section feature value exceeds τmin, and the non-speech probability α to τmax when the long-section feature value exceeds τ1. Increases monotonically.
It is assumed that τmin, τmax, τ0, and τ1 are appropriate values obtained in advance through experiments.
In the present embodiment, the long segment feature value calculation unit 108 uses phonemes as the unit for calculating the duration length, but other units such as syllables may be used. Further, the function shown in FIG. 10 is merely an example, and the present invention is not limited to this. The function may be defined as an arbitrary function whose function value increases as the distance from the medium value of the long interval feature amount increases.
The effect of this embodiment will be described. When background noise exceeding a threshold value continues for a long time, there is a property that a duration time extremely longer or shorter than a duration time obtained from a normal speech recognition result is likely to occur. Specifically, when the background noise continues for a long time, resulting in an extremely long voice section, the sound in the voice section is background noise, so there is almost no voice. Even if the speech recognition unit 111 recognizes the sound, a short word may be output as a recognition result. That is, appropriate speech recognition is not performed. In addition, when an extremely short sudden noise such as 2 to 3 frames is used as a speech section, it is impossible to emit a word in such a short time, so the sound in the speech section is non-speech. It is judged. Therefore, the sound in the speech section having a duration longer or shorter than the duration obtained from the normal speech recognition result has a property of being non-speech.
Since the speech detection apparatus according to the present embodiment calculates the non-speech probability α using such a property, it is possible to calculate the non-speech probability α with higher accuracy.
Embodiment 4 FIG.
A fourth embodiment of the present invention will be described. In the voice detection device of the fourth embodiment, the voice recognition unit 111 of the voice detection device of the third embodiment shown in FIGS. 8 and 9 performs continuous phoneme recognition instead of voice recognition. That is, the speech recognition unit 111 performs continuous phoneme recognition and outputs a phoneme string with time information. The long section feature amount calculation unit 108 obtains the duration time of each phoneme constituting the phoneme string output by the speech recognition unit 111. The operation of the threshold update unit 109 is the same as the operation in the third embodiment described above.
In this embodiment, as in the third embodiment, the unit for calculating the duration is a phoneme. However, a unit such as a syllable may be used.
In the speech detection device according to the present embodiment, the speech recognition unit 111 performs continuous phoneme recognition, so that the phoneme duration can be acquired more easily than the speech detection device according to the third embodiment that performs speech recognition. Then, the load for calculating the phoneme duration time is reduced, and the processing speed of the entire speech detection apparatus is increased. In the case of phoneme recognition, since the speech recognition unit 111 performs recognition in units of phonemes, it can easily acquire the phoneme length of the utterance section. The prime number must be derived and divided by the time per utterance to calculate the duration of the phoneme. Therefore, it is important for the reduction of processing load that the voice detection device easily acquires the phoneme duration.
Embodiment 5. FIG.
A fifth embodiment of the present invention will be described. The speech detection apparatus according to the fifth embodiment has the same configuration as that of the speech detection apparatus according to the third embodiment illustrated in FIG. 8 or FIG. 9, but the long interval feature value calculation unit 108 determines the reliability of the speech recognition result. Is used to calculate long-section feature values.
Specifically, for example, the voice recognition unit 111 appropriately extracts a feature amount from the input voice signal. The speech recognition unit 111 then matches the feature quantities of the words stored in the language model / speech recognition dictionary with the extracted feature quantities, and outputs a plurality of candidate speech recognition result scores. The score is, for example, a numerical value representing the degree of matching between the feature amount of the word stored in the language model / speech recognition dictionary and the extracted feature amount. The voice recognition unit 111 outputs a plurality of scores having a high degree.
Then, the long interval feature value calculation unit 108 calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the scores of the speech recognition results output by the speech recognition unit 111. calculate. When the score difference is small, the reliability of the speech recognition result is considered low. When the score difference is large, the reliability of the speech recognition result is considered high. Note that the scale corresponding to the reliability of the speech recognition result may be another scale instead of the difference in scores.
The threshold update unit 109 uses the long-section feature amount calculated by the long-section feature amount calculation unit 108, that is, the reliability, to calculate the non-speech probability α for the speech section cut out by the speech / non-speech section shaping unit 107. calculate. Specifically, the threshold update unit 109 obtains the non-speech probability α using, for example, a function having a long-section feature value as a variable as shown in FIG.
FIG. 11 is an explanatory diagram showing a function for obtaining the non-speech probability α in the fifth embodiment of the present invention. As shown in FIG. 11, the horizontal axis represents the value of the long segment feature value, and the vertical axis represents the non-speech probability α. As shown in FIG. 11, the non-speech probability α is 0 when the long-section feature amount is τ0 or more. In addition, when the long-section feature value is 0 to less than τ0, the non-speech probability α monotonously decreases from 1 to 0. It is assumed that τ0 is an appropriate value obtained in advance through experiments. Moreover, the function shown in FIG. 11 is an example, and may be an arbitrary monotone decreasing function or a monotonic non-increasing function.
Since the speech detection apparatus according to the present embodiment operates to calculate the non-speech probability α using the property that a section with low reliability of the speech recognition result is likely to be a non-speech section, more accuracy is achieved. It is possible to calculate a high non-voice probability.
Embodiment 6. FIG.
A sixth embodiment of the present invention will be described with reference to the drawings. FIG. 12 is a block diagram showing a configuration example of the sixth embodiment of the speech detection device according to the present invention.
The voice detection device according to the sixth embodiment is a combination of the first to fifth embodiments. The long section feature quantity calculation unit 108 calculates a long section feature quantity by combining one or more methods of the first to fifth embodiments. The speech detection apparatus calculates the non-speech probability α using the non-speech probability calculation methods of the first to fifth embodiments, and sets the product of each non-speech probability α as a non-speech probability. Further, the voice detection device may calculate the product after weighting each non-voice probability α and use it as the non-voice probability. Further, the speech detection apparatus may use the average value of each non-speech probability α or an appropriate weighted average value as the non-speech probability.
The voice detection device according to the present embodiment can calculate a more accurate non-voice probability by combining the first to fifth embodiments.
Embodiment 7. FIG.
The seventh embodiment of the present invention is a voice recognition device including the voice detection devices of the first to fifth embodiments. The speech recognition apparatus performs a known speech recognition process on a section determined to be a speech section by the speech detection apparatuses of the first to fifth embodiments, and outputs a speech recognition result.
Since the speech recognition apparatus according to the present embodiment performs speech recognition processing on a segment determined to be a speech segment with high accuracy, execution of useless processing that performs speech recognition processing on a non-speech segment can be prevented. In addition, it is possible to perform speech recognition processing with high accuracy on the speech section, and prevent the speech recognition processing from being leaked.
Next, the outline of the present invention will be described. FIG. 13 is a block diagram showing an outline of the present invention. The voice detection apparatus 300 according to the present invention includes a feature quantity calculation unit 301 (corresponding to the feature quantity calculation unit 102 shown in FIG. 1), a voice / non-voice judgment unit 302 (speech / non-voice judgment unit 104 and voice / non-voice judgment unit shown in FIG. (Corresponding to the non-speech segment shaping unit 107), long segment feature value calculating unit 303 (corresponding to the long segment feature value calculating unit 108 shown in FIG. 1), and threshold updating unit 304 (corresponding to the threshold updating unit 109 shown in FIG. 1). including.
The feature amount calculation unit 301 calculates the feature amount of the input signal for each frame, which is an input signal for each predetermined unit time. The speech / non-speech determination unit 302 compares the feature amount calculated by the feature amount calculation unit 301 with a speech detection threshold value for determining whether or not the input signal is a signal based on speech. It is determined whether it is a speech section in which a signal based on speech is input or a non-speech section in which a signal based on non-speech is input over a plurality of frames.
The long section feature amount calculation unit 303 is a feature amount of the speech section or the non-speech section based on the statistical values of the feature amounts of a plurality of frames constituting the speech section or the non-speech section calculated by the feature amount calculation unit 301. The long interval feature value is calculated.
The threshold update unit 304 uses the long-section feature value calculated by the long-section feature value calculation unit 303 to use the long-section feature value as a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input. And the voice detection threshold is updated based on the calculated non-voice probability.
The voice detection device 300 having the above configuration updates the voice detection threshold even when the head of the input signal is a signal based on background noise, and the feature amount exceeds the voice detection threshold, so that a high-precision voice can be obtained. Section detection can be performed.
In each of the above-described embodiments, voice detection devices as shown in the following (1) to (11) are also disclosed.
(1) A voice in which the long section feature quantity calculation unit 303 performs statistical processing on one or more voice sections determined by the voice / non-voice judgment unit 302 or a feature quantity over a non-speech section to calculate a long section feature quantity Detection device.
(2) When the long section feature amount calculation unit 303 calculates the long section feature amount, the predetermined value is counted from the average value of the feature amount for each frame, the mode value, the median value, and the results arranged in descending order. A voice detection device using at least one of the methods using a value at a position reaching the ratio of.
(3) The voice detection device in which the threshold update unit 304 updates the voice detection threshold by using the maximum value and the minimum value of the feature amount in the voice section or the non-voice section and the non-voice probability.
(4) The threshold detection unit 304 obtains a value that internally divides the maximum value and the minimum value of the feature amount using the non-speech probability and updates the speech detection threshold so that the value is close to the internally divided value. apparatus.
(5) A long-section feature value is provided that includes a second feature value calculation unit (corresponding to the speech analysis unit 110 shown in FIG. 7) that calculates a second feature value that is different from the feature value calculated by the feature value calculation unit 304. A speech detection apparatus in which the calculation unit 303 calculates a long-section feature value using the feature value calculated by the feature value calculation unit 304 and the second feature value calculated by the second feature value calculation unit.
(6) A second feature quantity calculation unit (corresponding to the voice recognition unit 111 shown in FIG. 8) performs voice recognition on the input signal and outputs a voice recognition result. A speech detection device that calculates long-section feature values based on results.
(7) The speech detection apparatus in which the long section feature value calculation unit 303 calculates the reliability of the speech recognition result as the long section feature value.
(8) The voice recognition result based on the score, which is a value indicating a degree by which the second feature quantity calculation unit matches the feature quantity of the word stored in advance in the storage unit and the feature quantity of the input signal to be voice-recognized. The speech detection device that outputs the scores of the plurality of candidates, and the long interval feature amount calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in the descending order as the reliability.
(9) The second feature amount calculation unit performs speech recognition on the input signal and outputs a speech recognition result with time information, and the long interval feature amount calculation unit 303 determines from the speech recognition result with time information. A voice detection device for calculating long-section feature values.
(10) The long section feature value calculation unit 303 is a voice detection device that calculates a duration length from time information as a long section feature value.
(11) The speech detection apparatus in which the long segment feature amount calculation unit 303 calculates the duration time in units of phonemes or syllables.
While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2009-291976 for which it applied on December 24, 2009, and takes in those the indications of all here.
(Additional remark 1) The feature-value calculation part which calculates the feature-value of the input signal for every flame | frame which is an input signal for every predetermined unit time, The said feature-value, and whether the said input signal is a signal based on an audio | voice. Compared to the voice detection threshold for determination, whether the signal is based on speech over a plurality of frames or whether it is a non-speech segment where signals based on non-speech are input over a plurality of frames A speech / non-speech determination unit and a feature value statistical value of a plurality of frames constituting the speech segment or the non-speech segment calculated by the feature amount calculation unit. A long-section feature quantity calculation unit that calculates a long-section feature quantity that is a feature quantity of the section, and the voice section and the non-speech section are based on non-speech using the long-section feature quantity. A speech detection apparatus comprising: a threshold update unit that calculates a non-speech probability that is a probability of being a section in which a speech is input, and updates the speech detection threshold based on the calculated non-speech probability .
(Additional remark 2) The long section feature-value calculation part calculates a long-section feature-value by performing a statistical process to the feature-value over the 1 or more audio | voice area determined by the audio | voice / non-voice determination part, or a non-voice section. The voice detection device according to 1.
(Additional remark 3) When calculating a long section feature-value, a long-section feature-value calculation part counts from the result arranged in the order of the average value of the feature-value for every frame, a mode value, a median, and a big order. The voice detection device according to supplementary note 1 or supplementary note 2, which uses at least one of methods using a value at a position that reaches the ratio of.
(Additional remark 4) A threshold value update part is described in any one of Additional remark 1 to Additional remark 3 which updates an audio | voice detection threshold value using the maximum value and minimum value of a feature-value in a speech area or a non-speech period, and a non-speech probability. Voice detection device.
(Additional remark 5) A threshold value update part calculates | requires the value which divides the maximum value and minimum value of the said feature-value using a non-speech probability, and updates a speech detection threshold value so that it may become a value close | similar to the said internally divided value. The voice detection device according to attachment 4.
(Additional remark 6) It has the 2nd feature-value calculation part which calculates the 2nd feature-value different from the feature-value which the feature-value calculation part calculates, The long section feature-value calculation part calculated by the said feature-value calculation part The speech detection device according to any one of supplementary notes 1 to 5, wherein a long section feature amount is calculated using the feature amount and the second feature amount calculated by the second feature amount calculation unit.
(Supplementary Note 7) The second feature quantity calculator performs speech recognition on the input signal and outputs a speech recognition result, and the long section feature quantity calculator calculates the long section feature quantity based on the speech recognition result. The voice detection device according to appendix 6.
(Supplementary note 8) The speech detection device according to supplementary note 7, wherein the long section feature amount calculation unit calculates the reliability of the speech recognition result as the long section feature amount.
(Supplementary Note 9) The second feature amount calculation unit performs speech recognition based on a score that is a value indicating a degree of matching between the feature amount of the word stored in the storage unit in advance and the feature amount of the input signal to be recognized. The score of a plurality of candidate results is output, and the long interval feature value calculation unit calculates the difference between the score of the first candidate and the score of the second candidate in descending order of the degree as the reliability level 8 The voice detection device according to 1.
(Additional remark 10) The 2nd feature-value calculation part performs speech recognition to an input signal, and outputs the speech recognition result with time information, and a long section feature-value calculation part has the said voice recognition result with the said time information. The voice detection device according to appendix 6, wherein a long section feature amount is calculated from
(Supplementary note 11) The voice detection device according to supplementary note 10, wherein the long section feature amount calculation unit calculates a duration length from time information as the long section feature amount.
(Supplementary note 12) The speech detection device according to supplementary note 11, wherein the long section feature amount calculation unit calculates a duration length in units of phonemes or syllables.
(Supplementary note 13) The speech detection device according to any one of Supplementary note 1 to Supplementary note 12, including speech recognition performed on a speech section output by the speech detection device, and a speech recognition result being output. Voice recognition device.
(Additional remark 14) The audio | voice for calculating the feature-value of the input signal for every flame | frame which is an input signal within predetermined unit time, and determining whether the said feature-value and the said input signal are signals based on an audio | voice The detection threshold value is compared, and it is determined whether the signal is a voice segment in which a signal based on speech is input over a plurality of frames or a non-speech segment in which a signal based on non-speech is input over a plurality of frames, Based on the statistical values of the feature quantities of a plurality of frames constituting the speech section or the non-speech section, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated, and the long section feature quantity is calculated. And calculating a non-speech probability that is a probability that the speech section and the non-speech section are sections in which a signal based on non-speech is input, and based on the calculated non-speech probability. Te, voice detection method and updates the voice detection threshold value.
(Additional remark 15) The audio | voice detection method of Additional remark 14 which performs a statistical process to the feature-value over one or more audio | voice area or a non-voice area, and calculates a long-section feature-value.

１０１波形切りだし部
１０２、３０１特徴量算出部
１０３閾値記憶部
１０４、３０２音声／非音声判定部
１０５判定結果保持部
１０６整形ルール記憶部
１０７音声／非音声区間整形部
１０８、３０３長区間特徴量算出部
１０９、３０４閾値更新部
１１０音声分析部
１１１音声認識部
３００音声検出装置DESCRIPTION OF SYMBOLS 101 Waveform cut-out part 102,301 Feature-value calculation part 103 Threshold storage part 104,302 Voice / non-voice determination part 105 Determination result holding part 106 Shaping rule storage part 107 Voice / non-voice section shaping part 108,303 Long-section feature quantity Calculation unit 109, 304 Threshold update unit 110 Speech analysis unit 111 Speech recognition unit 300 Speech detection device

Claims

A feature amount calculating means for calculating a feature amount of an input signal for each frame, which is an input signal for each unit time;
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. Voice / non-voice judgment means to perform,
Based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature quantity calculation unit, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated. Long-section feature value calculating means for calculating;
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, Threshold updating means for updating the threshold;
A voice detection device.

2. The long section feature quantity calculating unit performs statistical processing on the plurality of voice sections determined by the voice / non-speech determination unit, or the feature quantity over the non-speech section, and calculates the long section feature quantity. The voice detection device according to 1.

The long section feature quantity calculating means calculates the long section feature quantity by counting from the average value, mode value, median value, and results arranged in descending order of the feature quantity for each frame. The voice detection device according to claim 1, wherein at least one of the methods using a value at a position that reaches the ratio of the above is used.

The threshold value updating unit updates the voice detection threshold value using the maximum value and the minimum value of the feature amount and the non-voice probability in the voice section or the non-voice section. The voice detection device according to claim 1.

5. The threshold update unit obtains a value that internally divides the maximum value and the minimum value of the feature amount using the non-speech probability, and updates the threshold so that the value is close to the internally divided value. The voice detection device according to 1.

A second feature amount calculating unit that calculates a second feature amount different from the feature amount calculated by the feature amount calculating unit;
The long section feature quantity calculating means calculates the long section feature quantity using the feature quantity calculated by the feature quantity calculating means and the second feature quantity calculated by the second feature quantity calculating means. The voice detection device according to any one of claims 1 to 5.

The second feature amount calculating means performs voice recognition on the input signal and outputs a voice recognition result,
The speech detection apparatus according to claim 6, wherein the long section feature amount calculating unit calculates the long section feature amount based on the speech recognition result.

The speech detection apparatus according to claim 7, wherein the long section feature amount calculation unit calculates a reliability of the speech recognition result as the long section feature amount.

Calculate the feature value of the input signal for each frame that is the input signal within the unit time,
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. And
Based on the statistical values of the feature quantities of a plurality of frames constituting the speech section or the non-speech section, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated,
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, A voice detection method for updating the threshold.

On the computer,
A feature amount calculation process for calculating a feature amount of an input signal for each frame that is an input signal for each unit time;
The feature amount is compared with a threshold value, and it is determined whether it is a speech section in which a signal based on speech is input over a plurality of frames or a non-speech section in which a signal based on non-speech is input over a plurality of frames. Voice / non-voice judgment processing,
Based on a statistical value of feature quantities of a plurality of frames constituting the speech section or the non-speech section calculated by the feature quantity computation process, a long section feature quantity that is a feature quantity of the speech section or the non-speech section is calculated. Long-section feature value calculation processing to be calculated;
Using the long section feature amount, the speech section and the non-speech section calculate a non-speech probability that is a section in which a signal based on non-speech is input, and based on the calculated non-speech probability, A threshold update process for updating the threshold;
Voice detection program for running.