JPS62211699A

JPS62211699A - Voice section detecting circuit

Info

Publication number: JPS62211699A
Application number: JP61055251A
Authority: JP
Inventors: 杉　伸夫; 洋一竹林
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1986-03-13
Filing date: 1986-03-13
Publication date: 1987-09-17

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は音声認識処理に必要な入力音声に対する音声区
間検出を簡易に、且つ高精度に行うことのできる音声区
間検出回路に関する。[Detailed Description of the Invention] [Object of the Invention] (Industrial Application Field) The present invention provides a speech segment detection circuit that can easily and accurately detect speech segments from input speech necessary for speech recognition processing. Regarding.

（従来の技術）近時、入力音声をａｍ処理して情報処理に供することが
種々試みられている。例えば計算機に対するマンマシン
・インターフェースとしての情報入力手段として、或い
は音声ワードプロセッサにおける作成文書の情報入力手
段として、音声認識処理が担う役割が非常に高くなって
きている。(Prior Art) Recently, various attempts have been made to perform AM processing on input speech and provide it for information processing. For example, the role played by speech recognition processing has become very important, such as as an information input means for a computer as a man-machine interface, or as an information input means for a document created in an audio word processor.

この音声認識処理は、一般に入力音声の特徴データ系列
と、予め設定されている辞書データとを照合することに
より行われる。この為、入力音声データ中からその入力
音声区間（音声の始端と終端）を高精度に検出すること
が重要な課題となる。This speech recognition process is generally performed by comparing the feature data series of the input speech with preset dictionary data. Therefore, it is important to detect the input voice section (the start and end of the voice) from the input voice data with high accuracy.

そこで従来では一般的に音声区間検出回路を用い、入力
音声パワー値が成る閾値を上回ったときにこれを始端と
して検出し、また上記音声パワー値が成る閾値よりも下
回ったときにこれを終端として検出するようにしている
。即ち、音声パワー値を所定の閾値と比較して入力デー
タ中の実際に音声が含まれる区間を検出するようにして
いる。Conventionally, therefore, a voice section detection circuit is generally used to detect the input voice power value as the start point when it exceeds the threshold value, and to detect it as the end point when the voice power value falls below the threshold value. I'm trying to detect it. That is, the audio power value is compared with a predetermined threshold value to detect a section in the input data that actually includes audio.

ところがこのようにして音声区間検出を行う場合、次の
ような不具合があった。However, when performing voice section detection in this manner, there are the following problems.

即ち、始端・終端検出の為の閾値が１つだけであり、そ
の閾値が低めに設定されていると、ノイズ成分を含めて
音声区間検出してしまう虞れがある。また逆に上記閾値
が高めに設定されていると、入力レベルの小さい音声デ
ータをノイズ成分と看做して切捨ててしまう虞れがある
。特にこのような不具合は、入力音声の始端部分で生じ
易い。That is, if there is only one threshold value for detecting the start and end points, and the threshold value is set to a low value, there is a risk that a voice section including noise components will be detected. On the other hand, if the threshold value is set high, there is a risk that audio data with a low input level will be regarded as a noise component and will be discarded. In particular, such a problem is likely to occur at the beginning of the input audio.

そこで２つの閾値を設定し、入力音声のパワーの最大値
が成る範囲内に収まっている場合には低い方の閾値を用
いて音声区間検出を行い、上記音声パワーの最大値が上
記範囲内を越えたとき、高い方の閾値を用いて音声区間
検出を行うことが考えられている。Therefore, two thresholds are set, and if the maximum value of the input audio power falls within the range, the lower threshold is used to detect the voice section, and if the maximum value of the audio power falls within the range. It is considered that when the threshold value is exceeded, the voice section is detected using the higher threshold value.

然し乍ら、入力音声のパワーの最大値が大きいからと云
って必ずしも入力音声のパワーが全体的に高いとは限ら
ない。この為、音声パワーの最大値に従って上記閾値を
高くすると、その始端部分で音声を切捨ててしまう可能
性が非常に高い。However, just because the maximum value of the power of the input voice is large does not necessarily mean that the power of the input voice is high overall. For this reason, if the threshold value is increased according to the maximum value of the audio power, there is a very high possibility that the audio will be truncated at the beginning.

（発明が解決しようとする問題点）このように従来にあっては、音声区間の検出、特に始端
検出を高精度に行うことが非常に困難であったことに鑑
み、本発明では入力音声のパワー値の大小に拘ることな
く、その音声区間検出を簡易に、且つ高精度に行うこと
を可能とする音声区間検出回路を提供することを目的と
するものである。(Problems to be Solved by the Invention) In view of the fact that in the past, it has been extremely difficult to detect speech sections, especially to detect the beginning, with high precision, the present invention It is an object of the present invention to provide a voice section detection circuit that can detect voice sections simply and with high accuracy, regardless of the magnitude of the power value.

［発明の構成］（問題点を解決するための手段）本発明は、■入力音声のパワー値が予め設定した第１の
フレーム数Ｆ１以上に亙っで所定の閾値Ｔ１を上回ると
き、該音声パワー値が上記閾値Ｔ１を最初に上回ったフ
レームを前記入力音声の始端フレームとして検出し、■
この始端フレームの検出の後で予め設定した第２のフレ
ーム数Ｆ２以上に亙って前記音声パワー値が前記所定の
閾値Ｔ１を下回るとき、該音声パワー値が上記ａｍＴ１
を最初に下回ったフレームを前記入力音声の終端フレー
ムとして検出するようにし、■更には前記始端フレーム
の検出の後で前記音声パワー値が前記閾値Ｔ１より所定
の設定値り以上太き（なったとき、その値が最大値に至
った直前に前記音声パワー値が前記閾値Ｔ１を上回った
フレームを前記入力音声の始端フレームとして検出し直
すようにしたことを特徴とするものである。[Structure of the Invention] (Means for Solving the Problems) The present invention provides the following features: (1) When the power value of input audio exceeds a predetermined threshold T1 over a preset first number of frames F1, the audio Detecting the frame whose power value first exceeds the threshold T1 as the starting frame of the input audio;
When the audio power value is below the predetermined threshold T1 for a preset second number of frames F2 or more after detection of the start frame, the audio power value is lower than the amT1.
The frame in which the first frame becomes lower than the threshold value T1 is detected as the end frame of the input audio; The present invention is characterized in that the frame in which the voice power value exceeds the threshold T1 immediately before the value reaches the maximum value is re-detected as the starting frame of the input voice.

（作用）かくして本発明によれば、音声パワー値が閾値Ｔ１をフ
レーム数Ｆ１以上に亙って上回らない限りこれを始端と
して検出しないので、入力音声のパワー値が全体的に高
い場合であってもノイズ成分を音声区間として検出する
ことがなくなる。(Function) Thus, according to the present invention, unless the voice power value exceeds the threshold value T1 for the number of frames F1 or more, this is not detected as the starting point, so even if the power value of the input voice is high overall, Also, noise components are no longer detected as voice sections.

また始端が検出された後、その音声パワー値が閾値Ｔ１
をフレーム数Ｆ２以上に１って下回らない限りこれを終
端として検出しないので、音声パワー値の過渡的な低下
を誤って終端として検出することがない。Also, after the start edge is detected, the audio power value is set to the threshold T1
Since this is not detected as the end unless the number of frames is greater than or equal to the frame number F2 and less than 1, there is no possibility that a transient drop in the audio power value will be mistakenly detected as the end.

更には、始端フレームの検出の後、音声パワー値が所定
の設定値よりも高い場合には、その状態に至った直前に
前記音声パワー値が閾値Ｔ１を上回ったフレームを始端
フレームとして検出のやり直しが行われるので、音声パ
ワーのレベルが高いときに誤って始端検出しやすいノイ
ズ成分を効果的に排除して、その始端フレームを検出す
ることが可能となる。Furthermore, if the audio power value is higher than a predetermined set value after the detection of the starting frame, the detection is redone using the frame whose audio power value exceeded the threshold T1 immediately before reaching that state as the starting frame. Therefore, it is possible to detect the starting edge frame by effectively eliminating noise components that tend to cause incorrect starting edge detection when the audio power level is high.

従って入力音声のパワー値の大小に拘ることな（、その
音声区間を簡易に、且つ高精度に検出することが可能と
なる。Therefore, regardless of the magnitude of the power value of the input voice, it is possible to detect the voice section easily and with high accuracy.

（実施例）以下、図面を参照して本発明の一実施例につき説明する
。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings.

第１図は本発明に係る音声区間検出回路を組込んで構成
される音声認識装置の概略構成図である。FIG. 1 is a schematic configuration diagram of a speech recognition device incorporating a speech section detection circuit according to the present invention.

この音声認識装置は、基本的には音響分析回路１、音声
区間検出回路２、マツチング処理回路３、辞書データメ
モリ４、および後処理回路５を備えて構成される。This speech recognition device basically includes an acoustic analysis circuit 1, a speech section detection circuit 2, a matching processing circuit 3, a dictionary data memory 4, and a post-processing circuit 5.

音響分析回路１は、例えば複数のバンドパスフィルタ群
からなり、周波数分析等により所定のフレーム周期毎に
入力音声を音響分析してその音響的特徴を求めている。The acoustic analysis circuit 1 is composed of, for example, a plurality of band-pass filter groups, and acoustically analyzes input speech every predetermined frame period by frequency analysis or the like to obtain its acoustic characteristics.

また同時に上記入力音声の全帯域の音声パワーを上記所
定のフレーム周期毎に求めている。これらの分析データ
（入力音声データ）は、上記各フレーム毎にディジタル
化されて音響分析回路１から出力される。At the same time, the audio power of the entire band of the input audio is obtained for each of the predetermined frame periods. These analysis data (input audio data) are digitized for each frame and output from the acoustic analysis circuit 1.

音声区間検出回路２は、上記音響分析回路１で求められ
た音声データを格納する音声データＲＡＭ２ａや、後述
する音声区間検出処理に用いるワークメモリ２ｂ等を備
えて構成される。この音声区間検出回路２によって検出
された音声区間の前記音響的特徴データが前記音声デー
タＲＡＭ２ａから選択的に読出されてマツチング処理回
路３に与えられる。The speech segment detection circuit 2 is configured to include an audio data RAM 2a that stores the audio data obtained by the acoustic analysis circuit 1, a work memory 2b used for speech segment detection processing to be described later, and the like. The acoustic feature data of the voice section detected by the voice section detection circuit 2 is selectively read out from the voice data RAM 2a and provided to the matching processing circuit 3.

マツチング処理回路３は、与えられた音声区間の特徴デ
ータ系列と、辞書データメモリ（ＲＯＭ）４に予め登録
されている複数の認識対象カテゴリの各音声特徴データ
とを照合し、例えばそれらの間・の複合類似度を計算す
る等して入力音声を認識処理するものである。このマツ
チング処理回路３で求められた入力音声に対する認識候
補は後処理回路５に与えられ、その認識候補を得た類似
度値に対する検定処理や、言語的な検定処理が施されて
その認識結果が求められる。The matching processing circuit 3 matches the feature data series of a given speech section with each speech feature data of a plurality of recognition target categories registered in advance in a dictionary data memory (ROM) 4, and performs a matching process, for example, between them. The input speech is recognized by calculating the composite similarity of the input speech. The recognition candidates for the input speech obtained by the matching processing circuit 3 are given to the post-processing circuit 5, and the similarity values obtained from the recognition candidates are subjected to testing processing and linguistic testing processing to obtain the recognition results. Desired.

さて、前記音声区間検出回路２における音声区間検出処
理は次のようにして行われる。Now, the speech section detection process in the speech section detection circuit 2 is performed as follows.

第２図（ａ）（ｂ）はその処理概念を示すものであり、
第３図はその処理手続きの流れを示す図である。Figures 2 (a) and (b) show the processing concept,
FIG. 3 is a diagram showing the flow of the processing procedure.

この音声区間検出回路２では、基本的には入力音声のパ
ワー値Ｘと予め設定した所定のｌｌｉ！［Ｔ１とを比較
することにより行われる。そして先ず、第２図（ａ）に
示すように音声パワー値Ｘが上記閾値Ｔ１を予め設定さ
れたフレーム数Ｆ１１Ｘ上に亙っで上回るとき、上記音
声パワー値Ｘが最初に上記閾値Ｔ１を上回ったときのフ
レームを前記入力音声の始端フレーム＄１として検出す
る。This voice section detection circuit 2 basically uses the input voice power value X and a predetermined lli! [This is done by comparing with T1. First, as shown in FIG. 2(a), when the audio power value X exceeds the threshold T1 for a preset number of frames F11X, the audio power value X exceeds the threshold T1 for the first time. The frame at which this occurs is detected as the starting frame $1 of the input audio.

このようにして始端フレームＳ１が検出された後、今度
は上記音声パワー値Ｘが上記閾値Ｔ１を予め設定された
フレーム数Ｆ２以上に屋って下回るとき、上記音声パワ
ー値Ｘが最初に上記閾値Ｔ１を下回りたときのフレーム
を前記入力音声の終端フレームＥとして検出する。この
例では、上記始端フレームＳ１が検出された後、音声パ
ワー値Ｘが閾値Ｔ１を下回る期間があるが、その期間Ｇ
が前記フレーム数Ｆ２に満たない為、そのときに音声パ
ワー１［ｘが最初に上記閾値Ｔ１を下回ったときのフレ
ームＸは、終端フレームとして検出されることがない。After the start frame S1 is detected in this way, when the audio power value X falls below the threshold T1 by a preset number of frames F2 or more, the audio power value The frame when the value falls below T1 is detected as the final frame E of the input audio. In this example, after the start frame S1 is detected, there is a period in which the audio power value X is below the threshold T1;
is less than the number of frames F2, the frame X when the audio power 1[x first falls below the threshold T1 at that time is not detected as a terminal frame.

以上の処理が入力音声に対する１次の検切処理であり、
これによって入力音声の始端フレームＳ１と終端フレー
ムＥとがそれぞれ検出されることになる。The above process is the primary verification process for the input audio,
As a result, the start frame S1 and the end frame E of the input audio are respectively detected.

ところが入力音声のパワーが全体的に高いような場合、
これに伴ってノイズ成分のパワーも高いことがあり、前
述した如く検出した始端フレームＳ１がノイズ成分を含
んでいる場合がある。However, if the power of the input audio is high overall,
Along with this, the power of the noise component may also be high, and as described above, the detected start frame S1 may include the noise component.

そこで音声区間検出回路２では音声パワー１１１ｘを順
次検出し、前記始端フレームの検出の後、その音声パワ
ー値Ｘが前記閾値Ｔ１よりも更に予め設定した値り以上
に高くなった場合、第２図（１））に示すように第２の
１ｉｉｌＴ２を想定している。この第２の閾値Ｔ２は上
記音声パワー値Ｘよりも上記設定値りだけ低い値、つま
り（Ｔ２−ｘ−Ｄ）として想定されるものである。そし
て前記音声パワー値Ｘが上記閾値Ｔ２を越える直前に前
記閾値Ｔ１を越えたときのフレームを、入力音声の始端
フレームＳ３として検出し直している。具体的には前記
始端フレームの検出の後で音声パワー値Ｘが（ＴＩ　＋
Ｄ）となる状態を検出し、その状態に至る直前に前記音
声パワー値Ｘが上記閾値Ｔ１を上回った時点のフレーム
を新たな始端フレームとして検出のやり直しを行ってい
る。Therefore, the voice section detection circuit 2 sequentially detects the voice power 111x, and if the voice power value X becomes higher than the threshold value T1 by a preset value after detecting the start frame, as shown in FIG. As shown in (1)), a second 1iilT2 is assumed. This second threshold T2 is assumed to be a value lower than the audio power value X by the set value, that is, (T2-x-D). Then, the frame at which the audio power value X exceeds the threshold T1 immediately before exceeding the threshold T2 is re-detected as the starting frame S3 of the input audio. Specifically, after the detection of the start frame, the audio power value X becomes (TI +
D) is detected, and the frame at which the audio power value X exceeds the threshold T1 immediately before that state is reached is used as a new starting frame, and the detection is redone.

この処理が２次の検切処理であり、入力音声のパワー値
Ｘが前記閾値Ｔ１よりも更に設定値り以上に高くなった
とき、つまり（ＴＩ　＋Ｄ）以上となったとき、先の１
次検切処理にて求めた始端フレームＳ１がノイズ成分を
含む可能性があることを配慮して行われる。そして、例
えば音声パワー値Ｘが最大値をとるまで、上記の如く想
定した閾値Ｔ２を更新しながら繰返しその処理が行われ
る。This process is a secondary test process, and when the power value
This is done in consideration of the possibility that the start frame S1 obtained in the next cutoff process may contain noise components. Then, the process is repeated while updating the assumed threshold T2 as described above until, for example, the audio power value X takes the maximum value.

これによって、例えば音声パワー値Ｘが一時的に閾１１
［Ｔ１を下回ることがあるような場合、先に検出された
始端フレームが修正される。そして第２図（ｂ）に示す
ように０部分のノイズ成分が切離されてその入力音声の
本来の始端部分■から音声区間検出が行われることにな
る。As a result, for example, the audio power value
[If T1 is sometimes lower than T1, the first detected start frame is corrected. Then, as shown in FIG. 2(b), the noise component of the 0 part is separated, and the speech section is detected from the original starting part (2) of the input speech.

第３図はこのような音声区間検出（始端・終端フレーム
の検出）処理の流れを示すものであり、これについて簡
単に説明する。FIG. 3 shows the flow of such voice section detection (starting and ending frame detection) processing, and will be briefly explained.

この処理は、前記ワークメモリ２ｂ上に設定されるデー
タ領域をそれぞれ“０″に初期設定してから行われる（
ステップａ）。This process is performed after each data area set on the work memory 2b is initialized to "0" (
Step a).

ここで８１は１次検切処理で求められる始端候補フレー
ムを格納する領域であり、Ｓ２は２次検切処理で求めら
れる始端候補フレームを格納する領域である。またＳ３
は最終的に決定された始端フレームを格納する領域であ
り、Ｅは終端フレームを格納する領域である。更にＦＬ
ＡＧは、上記２次検切処理によって始端候補フレームの
変更が必要になったことを示す領域である。Here, 81 is an area for storing the starting edge candidate frame obtained in the primary inspection process, and S2 is an area for storing the starting edge candidate frame obtained in the secondary inspection process. Also S3
E is an area for storing the finally determined start frame, and E is an area for storing the end frame. Further FL
AG is an area indicating that the starting edge candidate frame needs to be changed due to the secondary inspection process.

しかして上記の初期設定が終了すると前記音声データＲ
ＡＭ２ａから順に音声パワー値ｘｉを読出しくステップ
ｂ）、その音声パワー値ｘｉが第１の閾値Ｔ１を上回る
か否かを判定する（ステップＣ）、そして音声パワー１
［ｘｉが第１の同値Ｔ１を上回ったとき、そのときのフ
レーム番号を始端候補フレームＳ１として格納する（ス
テップｄ）。However, when the above initial settings are completed, the audio data R
step b) of reading out the voice power values xi in order from AM2a; determining whether the voice power values xi exceed the first threshold T1 (step C);
[When xi exceeds the first equivalent value T1, the frame number at that time is stored as the starting end candidate frame S1 (step d).

しかる後、引き続いて音声データＲＡＭ２ａからの音声
パワー値ｘｉの読出しと（ステップｅ）、その音声パワ
ー値ｘｉが第１の閾値Ｔ１を上回るか否かの判定を行い
（ステップｆ）、このとき音声パワー値ｘｔが第１の閾
値Ｔ１を下回るならば前述したステップｂからの処理を
繰返して実行する。Thereafter, the audio power value xi is read out from the audio data RAM 2a (step e), and it is determined whether or not the audio power value xi exceeds the first threshold T1 (step f). If the power value xt is less than the first threshold T1, the process from step b described above is repeated.

またこのとき音声パワー値ｘｉが第１の閾値Ｔ１を上回
っている場合には、その状態が予め設定されたフレーム
数Ｆ１以上に屋っで継続しているか否かを判定しくステ
ップＱ）、フレーム数Ｆ１未満である場合には上記ステ
ップｅからの処理を繰返し実行する。At this time, if the audio power value xi exceeds the first threshold T1, it is determined whether or not this state continues for more than a preset number of frames F1. If it is less than the number F1, the process from step e is repeated.

そしてフレーム数Ｆ１以上に厘って音声パワー値ｘｉが
閾値Ｔ１を上回る状態が続く場合、前記始端候補フレー
ムＳ１として求められたフレーム番号を、入力音声に対
して求められた始端フレームＳ３であるとして格納する
（ステップｈ）。If the audio power value xi continues to exceed the threshold value T1 by the number of frames F1 or more, the frame number obtained as the starting edge candidate frame S1 is assumed to be the starting edge frame S3 obtained for the input audio. Store (step h).

以上のようにして１次の検切処理による始端フレームの
検出が行われる。As described above, the start frame is detected by the primary cutoff process.

しかる後、再び前記音声データＲＡＭ２ａからの音声パ
ワー値×１の読出しと（ステップ１）、その音声パワー
値ｘｉが第１のａ＠ＴＩを上回るか否かの判定を行う（
ステップｊ）。そして音声パワー値ｘｉが第１の閾値Ｔ
１を下回るならば、そのときのフレーム番号を終端フレ
ームＥとして格納しくステップｋ）、引き続いて音声パ
ワー値ｘ１の読出しくステップ２）と、その音声パワー
値ｘｉが第１の閾値Ｔ１を上回るか否かの判定を行う（
ステップｍ）。After that, the audio power value x 1 is read out again from the audio data RAM 2a (step 1), and it is determined whether the audio power value xi exceeds the first a@TI (step 1).
Step j). Then, the audio power value xi is the first threshold T
If it is less than 1, store the frame number at that time as the end frame E (step k), then read out the audio power value x1 (step 2), and check whether the audio power value xi exceeds the first threshold T1. Determine whether or not (
Step m).

ここで音声パワー値ｘｉが第１の閾値Ｔ１を下回る場合
には、次にその状態が設定されたフレーム数Ｆ２に亙っ
て継続するか否かを判定しくステップｎ）、その条件が
満たされた場合には前記始端フレームＳ３にセットされ
たフレーム番号を入力音声の始端フレームとして、また
終端フレームＥにセットされているフレーム番号を終端
フレームとしてそれぞれ確定しくステップｏ）、その音
声区間検出処理を終了する。If the audio power value xi is below the first threshold T1, then it is determined whether the condition continues for a set number of frames F2 (step n), and if the condition is satisfied. In this case, the frame number set in the start frame S3 is determined as the start frame of the input audio, and the frame number set in the end frame E is determined as the end frame, respectively, and the voice section detection process is performed in step o). finish.

従ってこの場合には、２次検切処理が行われることなく
、その音声区間検出処理が終了することになる。Therefore, in this case, the voice section detection process ends without performing the secondary inspection process.

ところで前述したようにして終端フレームＥを求めた後
、音声パワー値ｘｉが第１の閾値Ｔ１を下回る状態が所
定のフレーム数Ｆ２以上に亙つて継続する以前に上記音
声パワー値ｘｉが第１の閾値Ｔ１を上回ると、ステップ
ｍにおける判定処理にてこれが判定される。そしてこの
場合には、先に検出された始端フレームがノイズ成分を
含む可能性があると判定する。そこでそのときのフレー
ム番号を２次検切処理の為の始端候補フレームＳ２とし
て格納する（ステップｐ）。そして２次検切処理による
始端フレームの再検出処理、つまり始端フレーム検出の
やり直しが必要となる可能性があることを指示するべく
前記ＦＬＡＧを“１”にセットする（ステップｑ）。By the way, after determining the end frame E as described above, the audio power value xi becomes the first threshold before the audio power value xi remains below the first threshold T1 for a predetermined number of frames F2 or more. If the threshold value T1 is exceeded, this is determined in the determination process in step m. In this case, it is determined that there is a possibility that the first frame detected earlier contains a noise component. Therefore, the frame number at that time is stored as the starting end candidate frame S2 for the secondary inspection process (step p). Then, the FLAG is set to "1" to indicate that there is a possibility that re-detection of the start frame by the secondary inspection process, that is, re-detection of the start frame is necessary (step q).

しかる後、次に前記音声パワー値ｘｉが前記閾値Ｔ１よ
りも設定値り以上高いか否か、つまり音声パワー値ｘｔ
が（ＴＩ　＋Ｄ）よりも高いか否かを判定する（ステッ
プｒ）。この処理は、前記始端フレームＳ３の検出の後
、音声パワー値ｘｉが前記閾値Ｔ１よりも高い状態が継
続する場合にも行われる。そして、この条件が満されな
い場合には前述したステップｉの処理に戻り、上述した
処理を繰返し実行する。After that, it is determined whether the voice power value xi is higher than the threshold T1 by a set value or more, that is, the voice power value xt
is higher than (TI +D) (step r). This process is also performed when the audio power value xi continues to be higher than the threshold T1 after the start frame S3 is detected. If this condition is not met, the process returns to step i described above and the process described above is repeated.

尚、この繰返し処理時に終端フレームが検出された場合
には、音声パワー値ｘｉの閾値Ｔ１がら一時的な低下は
ノイズによるものと判定される。Note that if an end frame is detected during this iterative process, it is determined that the temporary decrease in the threshold value T1 of the audio power value xi is due to noise.

従ってこの場合、始端フレームＳ２として検出されたフ
レーム番号は無視される。Therefore, in this case, the frame number detected as the start frame S2 is ignored.

一方、この繰返し処理時に前記音声パワー値ｘ１が（Ｔ
Ｉ　＋Ｄ）よりも高くなると、次にその値が最大値に到
達したか否かが判定される（ステップＳ）。そして音声
パワーｘｉの最大値が検出された場合には、前記ＦＬＡ
Ｇが“１″であるが否かを判定する（ステップｔ）。On the other hand, during this iterative process, the audio power value x1 becomes (T
I+D), it is then determined whether the value has reached the maximum value (step S). When the maximum value of the audio power xi is detected, the FLA
It is determined whether G is "1" or not (step t).

ここでＦＬＡＧが“０″である場合には、音声パワーｘ
ｉが継続的に閾値Ｔ１を上回っていることを示している
。従ってこの場合には入力音声の始端フレーム＄３の検
出のやり直しは不要であり、前記始端フレームＳ２の情
報を無視して前述したステップｉからの処理を繰返す。Here, if FLAG is "0", the audio power x
This shows that i continues to exceed the threshold T1. Therefore, in this case, there is no need to redetect the start frame $3 of the input audio, and the process from step i described above is repeated, ignoring the information on the start frame S2.

一方、ＦＬＡＧが“１″である場合には、前述したよう
に音声パワー値ｘｉが−Ｈ前記閾値Ｔ１よりも下回り、
その後、音声パワーｘｉが上記（ＴＩ　＋Ｄ）よりも高
くなっていることが示される。従ってこの場合には、音
声パワーの入力レベルが全体的に高いことに起因して前
記１次の検切処理で求められた始端フレームはノイズ部
分を示していると判定される。そこで前記ステップｍで
検出された始端候補フレームＳ２を新たに求められる始
端フレームＳ３として更新処理しくステップｕ）、同時
に前記ＦＬＡＧを“Ｏ″にリセットする。On the other hand, when FLAG is "1", as mentioned above, the audio power value xi is lower than the -H threshold T1,
Thereafter, it is shown that the audio power xi is higher than the above (TI +D). Therefore, in this case, because the input level of audio power is high overall, it is determined that the start frame obtained in the first cutoff process represents a noise portion. Therefore, the starting edge candidate frame S2 detected in step m is updated as the newly found starting edge frame S3 in step u), and at the same time, the FLAG is reset to "O".

この処理によって、音声パワーｘｉが（ＴＩ　＋Ｄ）よりも高くなり、しかも音声パワーｘｉ
が一旦閾値Ｔ１を下回ったとき、その音声パワー×１が
最大値をとる直前に閾値Ｔ１を上回ったフレームが、新
たな始端候補フレームＳ３として求められることになる
。つまり音声パワーＸ：が一旦閾値Ｔ１を下回る前に求
められた始端候補フレームが、上記音声パワーｘｉが閾
値Ｔ１を下回った後、再び閾値Ｔ１を上回ったフレーム
により新たな始端候補フレームとして更新されることに
なる。Through this process, the audio power xi becomes higher than (TI + D), and the audio power xi
Once falls below the threshold T1, a frame whose audio power x1 exceeds the threshold T1 immediately before reaching the maximum value is determined as a new starting edge candidate frame S3. In other words, the starting edge candidate frame obtained before the audio power X: once falls below the threshold T1 is updated as a new starting edge candidate frame by the frame that exceeds the threshold T1 again after the audio power xi falls below the threshold T1. It turns out.

以上の処理が２次検切処理であり、この２次検切処理が
行われた後、前述した終端フレームの検出処理が行われ
ることになる。The above process is the secondary inspection process, and after this secondary inspection process is performed, the above-mentioned end frame detection process is performed.

このように音声区間検出回路２では、閾値Ｔ１を上回り
、且つその状態が所定のフレームＦ１に厘、うて継続す
るときの最初に上記閾値Ｔ１を上回ったフレームを始端
フレームとして、また上記閾値Ｔ１を下回る状態が所定
のフレームＦ２に１って継続するときの最初に上記閾値
Ｔ１を下回ったフレームを終端フレームとしてそれぞれ
検出している。そして、更に上記始端フレームの検出後
に音声パワーが所定値よりも大きい場合には、音声パワ
ーが上記所定値を越える直前に前記閾値Ｔ１を上回った
フレームを始端フレームとして検出し直している。In this way, the voice section detection circuit 2 sets the frame that exceeds the threshold T1 for the first time when the state exceeds the threshold T1 and continues until the predetermined frame F1 as the starting frame, and also sets the frame that exceeds the threshold T1 as the starting frame. When the state in which the threshold value T1 continues to be lower than the threshold value T1 continues every predetermined frame F2, the first frame in which the threshold value T1 is lower than the threshold value T1 is detected as the terminal frame. Further, if the audio power is greater than the predetermined value after the start frame is detected, the frame that exceeded the threshold T1 immediately before the audio power exceeded the predetermined value is re-detected as the start frame.

従って入力音声のパワーレベルに関係なく、またノイズ
成分を効果的に排除してその音声区間を高精度に検出す
ることが可能である。しがもその処理もさほど複雑では
なく、比較的簡易に実行できる。故に、このように高精
度に検出された音声区間の情報を用いて、その音声認識
処理を高精度に行うことが可能となる。Therefore, regardless of the power level of the input voice, it is possible to effectively eliminate noise components and detect the voice section with high precision. However, the process is not very complicated and can be executed relatively easily. Therefore, it is possible to perform the speech recognition process with high precision using the information on the speech sections detected with high precision in this manner.

尚、本発明は上述した実施例に限定されるものではない
。例えば閾値Ｔ１およびフレーム数Ｆｌ。Note that the present invention is not limited to the embodiments described above. For example, the threshold value T1 and the number of frames Fl.

Ｆ２は、その仕様に応じて設定すれば良いものである。F2 may be set according to its specifications.

また始端フレームおよび終端フレームの検出アルゴリズ
ムも種々変形することができる。要するに本発明はその
要旨を逸脱しない範囲で種々変形して実施することがで
きる。Furthermore, the algorithm for detecting the start frame and the end frame can be modified in various ways. In short, the present invention can be implemented with various modifications without departing from the gist thereof.

［発明の効果］以上説明したように本発明によれば、入力音声の音声区
間を、その入力レベルに拘らず、しかもノイズ成分を効
果的に排除して高精度に、しかも簡易に求めることが可
能となる。[Effects of the Invention] As explained above, according to the present invention, it is possible to easily obtain the speech section of input speech with high accuracy and by effectively eliminating noise components, regardless of the input level. It becomes possible.

[Brief explanation of drawings]

第１図は本発明の一実施例に係る音声区間検出回路を用
いて構成された音声認識装置の概略構成図、第２図は音
声区間検出回路の処理概念を示す図、第３図は音声区間
検出の処理手順を示す図である。１・・・音響分析回路、２・・・音声区間検出回路、３
・・・マツチング処理回路、４・・・辞書データメモリ
、５・・・侵処理回路。FIG. 1 is a schematic configuration diagram of a speech recognition device configured using a speech section detection circuit according to an embodiment of the present invention, FIG. 2 is a diagram showing the processing concept of the speech section detection circuit, and FIG. It is a figure showing the processing procedure of section detection. 1... Acoustic analysis circuit, 2... Voice section detection circuit, 3
. . . matching processing circuit, 4 . . . dictionary data memory, 5 . . . invasion processing circuit.

Claims

[Claims]

(1) When the power value of the input audio exceeds a predetermined threshold for a preset first number of frames or more, the frame in which the audio power value first exceeds the threshold is selected from the input audio. means for detecting a starting edge frame; and when the audio power value falls below the threshold value for a preset second number of frames or more after detection of the starting edge frame, the audio power value falls below the threshold value. means for detecting the first frame that falls below as the end frame of the input audio, and when the audio power value becomes larger than the threshold by a predetermined set value or more after the detection of the start end frame, the value becomes a maximum value; 2. A speech section detection circuit comprising: means for re-detecting a frame in which the speech power value exceeds the threshold immediately before the input speech as a starting frame of the input speech.

(2) The voice section detection circuit according to claim 1, wherein the power value of the input voice is digitized at a predetermined frame period, stored in the voice data memory, and used for voice section detection processing. .