JPH02184915A

JPH02184915A - Speech recognition device

Info

Publication number: JPH02184915A
Application number: JP1005427A
Authority: JP
Inventors: Nobuo Sugi; 杉　伸夫
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1989-01-12
Filing date: 1989-01-12
Publication date: 1990-07-19
Anticipated expiration: 2013-11-25
Also published as: JP2829014B2

Abstract

PURPOSE:To execute speech recognition processing by voice data without generating the omission of a leading part by storing input voice data just before a start command is supplied sequentially in a ring buffer, and performing the detection of a voice section and the extraction processing of the voice data by using the input voice data. CONSTITUTION:The input voice data just before a command to instruct the start of the speech recognition processing is inputted is stored in the ring buffer 3 sequentially extending over the several constant number of frames. Therefore, even when the leading part is omitted in the input voice data stored in a data buffer 4, the detection of the start of the voice section can be performed by supplementing an omitted part by the voice data stored in the ring buffer 3, and the data in the voice section can be extracted without defeating the data unexpectedly. In such a way, even when vocalization is started prior to the input of the command to start the speech recognition processing, the vocalization can surely be recognized and processed.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）本発明は入力音声データの先頭部分を取り零すことなく
入力音声を確実に認識することのできる音声認識装置に
関する。DETAILED DESCRIPTION OF THE INVENTION [Object of the Invention] (Industrial Application Field) The present invention relates to a speech recognition device that can reliably recognize input speech without dropping the beginning of input speech data.

（従来の技術）近時、自然性の高いマンマシン争インターフェースを実
現する重要な技術の１つとして音声認識処理が注目され
、種々研究開発されている。この種の音声認識処理は、
基本的には■マイクロフォンから収集された入力音声デ
ータを分析し、■この分析結果に従って入力音声区間検
出を行ない、■該入力音声区間の特徴データの系列（入
力音声パターン）と予め求められている認識対象カテゴ
リの標準パターンと照合し、■その照合結果を判定して
入力音声を識別することにより行なわれる。従って入力
音声データ中から、上記音声区間を如何に精度良く検出
し、その音声区間の入力音声データ（特徴データの系列
）を抽出して辞書パターンとの照合処理に供するかが、
その認識性能を決定する上で大きな要因となる。(Prior Art) Recently, voice recognition processing has been attracting attention as one of the important technologies for realizing a highly natural human-machine interface, and various research and developments have been conducted. This type of speech recognition processing is
Basically, ∎ analyzes the input voice data collected from the microphone, ∎ detects the input voice section according to the analysis results, ∎ detects the characteristic data series (input voice pattern) of the input voice section, and This is done by comparing the standard pattern of the recognition target category, and (2) determining the matching result to identify the input voice. Therefore, the question is how to accurately detect the voice section from the input voice data, extract the input voice data (sequence of feature data) of that voice section, and use it for matching with dictionary patterns.
This is a major factor in determining its recognition performance.

しかして従来の音声認識装置では、ホストシステム（制
御部）からの認識処理開始のコマンドを受けて音声認識
処理動作を開始し、この動作開始後に入力される音声デ
ータについてその認識処理を実行するものとなっている
。然し乍ら、上記コマンドを与えるタイミングと話者の
発声開始タイミングとを合せることが一般的には非常に
難しく、通常は話者に対してホストシステム（制御部）
から一定時間毎に、合成音声出力やデイスプレィ表示等
の手段を用いて発声要求を促し、これにタイミングを合
せて話者が発声する音声を取込んで音声認識処理するよ
うにしている。However, in conventional speech recognition devices, the speech recognition processing operation is started upon receiving a command to start recognition processing from the host system (control unit), and after this operation starts, the recognition processing is executed on the input speech data. It becomes. However, it is generally very difficult to match the timing of giving the above command with the timing of the speaker's start of utterance, and usually the host system (control unit)
At regular intervals from then on, a voice request is prompted using means such as synthesized voice output or display display, and the voice uttered by the speaker is captured at the same timing and subjected to voice recognition processing.

また最近では、音声を取込むマイクロフォンの近傍に光
センサや超音波センサ等の近接センサを設け、話者がマ
イクロフォンに近付いて発声しようとする状態を感知し
て前述したコマンドを発することも行なわれている。こ
のような近接センサを用いて音声認識処理動作の開始タ
イミングを設定するようにした音声認識装置によれば、
自然性の高いより良いマンマシンＯインターフェースを
実現することができる。Recently, a proximity sensor such as an optical sensor or an ultrasonic sensor has been installed near the microphone that captures the sound, and the above-mentioned commands are issued by sensing when the speaker is approaching the microphone and trying to speak. ing. According to a voice recognition device that uses such a proximity sensor to set the start timing of a voice recognition processing operation,
It is possible to realize a better man-machine O interface with high naturalness.

しかし、全ての話者がマイクロフォンに近付いてから発
声を開始すると云う保証はなく、慣れを伴った話者にあ
っては往々にしてマイクロフォンに近付きながら、つま
りマイクロフォンに十分に近付く以前に発声を開始して
しまうことがある。However, there is no guarantee that all speakers will start speaking after getting close to the microphone, and experienced speakers often start speaking while approaching the microphone, that is, before they are close enough to the microphone. Sometimes I end up doing it.

このような場合、発声の開始後に音声認識処理開始のコ
マンドが発せられることになるので入力音声の先頭部分
の取り零しが生じ、結局、入力音声区間の始端検出に失
敗したり、或いは誤認識の発声要因となることが否めな
い。そこで上記近接センサの感度を高めることで、もう
少し早いタイミングでコマンドを発することや、話者の
接近を検知した後、所定の時間を経て発話要求を発する
等の種々の工夫が試みられているが、話者の個人差等に
起因してそのタイミング設定が非常に困難であり、本質
的な解決策となっていないのが実情である。In such a case, the command to start speech recognition processing will be issued after the start of utterance, which will cause the beginning of the input speech to be omitted, resulting in failure to detect the start of the input speech section or erroneous recognition. It is undeniable that this is a factor in the vocalization of Various efforts have been made to improve the sensitivity of the proximity sensor, such as issuing a command at a slightly earlier timing, or issuing a request to speak after a predetermined period of time after detecting the approach of the speaker. , it is extremely difficult to set the timing due to individual differences among speakers, and the reality is that there is no essential solution.

（発明が解決しようとする課題）このように従来にあっては、発話者の接近を検出して音
声認識処理の開始コマンドを与える場合であっても、認
識処理動作の開始前に発声が開始されてしまうことが多
々あり、音声認識処理動作開始時における音声の先頭部
分の取り零しに起因して誤認識や認識リジェクトが生じ
易いと云う不具合があった。(Problem to be Solved by the Invention) Conventionally, even when the approach of a speaker is detected and a command to start speech recognition processing is given, the speech starts before the recognition processing starts. There is a problem that erroneous recognition or recognition rejection is likely to occur due to the omission of the beginning part of the voice at the start of the voice recognition processing operation.

本発明はこのような事情を考慮してなされたもので、そ
の目的とするところは、繁雑なタイミング設定を要する
ことなく入力音声をその先頭部分から確実に収集して高
精度な音声認識を行なうことのできる簡易な構成で実用
性の高い音声認識装置を提供することにある。The present invention was made in consideration of these circumstances, and its purpose is to perform highly accurate speech recognition by reliably collecting input speech from the beginning without requiring complicated timing settings. An object of the present invention is to provide a highly practical speech recognition device with a simple configuration that allows for easy recognition.

［発明の構成］（課題を解決するための手段）本発明に係る音声認識装置は、認識処理開始のコマンド
が入力された後の入力音声データを格納するデータバッ
ファに加えて、上記コマンドが入力される直前までの一
定フレーム数の入力音声データを順次格納するリングバ
ッファを備え、前記データバッファに格納された入力音
声データから該入力音声の音声区間の始端が検出されな
かったとき、前記リングバッファに格納された入力音声
データと前記データバッファに格納された音声データと
を連続させて音声区間検出を行なうようにしたことを特
徴とするものである。[Structure of the Invention] (Means for Solving the Problems) The speech recognition device according to the present invention has a data buffer that stores input speech data after a command to start recognition processing is input, and a data buffer that stores input speech data after the command to start recognition processing is input. a ring buffer that sequentially stores a predetermined number of frames of input audio data up to immediately before the input audio data, and when the start end of the audio section of the input audio is not detected from the input audio data stored in the data buffer, The present invention is characterized in that voice section detection is performed by consecutively inputting voice data stored in the data buffer and voice data stored in the data buffer.

（作　用）このように構成された本装置によれば、音声認識処理の
開始を指示するコマンドが入力される直前までの入力音
声データが一定フレーム数に亙って順次リングバッファ
に格納されているので、データバッファに格納された入
力音声データにその先頭部分が欠落している場合であっ
ても、リングバッファに格納されている音声データにて
その欠落部分を補って音声区間の始端検出を行ない、そ
の音声区間のデータを取り零しなく抽出することができ
る。(Function) According to the device configured as described above, input voice data up to just before a command instructing the start of voice recognition processing is input is sequentially stored in the ring buffer over a fixed number of frames. Therefore, even if the beginning part of the input audio data stored in the data buffer is missing, the missing part can be compensated for using the audio data stored in the ring buffer and the beginning of the audio section can be detected. It is possible to extract the data of that voice section without missing anything.

この結果、認識処理開始のコマンドの入力に先立って発
声が開始された場合であっても、その音声を確実に認識
処理することが可能となる。しかも従来のような複雑な
タイミング調整を必要とすることなく、入力音声を簡易
に、且つ確実に認識処理することが可能となる。As a result, even if speech is started prior to inputting a command to start recognition processing, the speech can be reliably recognized. Moreover, input speech can be easily and reliably recognized without requiring complicated timing adjustments as in the conventional method.

（実施例）以下、図面を参照して本発明の一実施例に係る音声認識
装置につき説明する。(Embodiment) Hereinafter, a speech recognition device according to an embodiment of the present invention will be described with reference to the drawings.

第１図は実施例装置の概略構成図で、ｌはマイクロフォ
ンを介して収集される入力音声を音響分析してその特徴
データを抽出する特徴抽出部である。この特徴抽出部ｌ
は、例えば所定のフレーム周期でバンドパス・フィルタ
リング処理やＬＰＣ分析等を実行して入力音声の特徴パ
ラメータを順次水める如く構成される。FIG. 1 is a schematic configuration diagram of an embodiment of the apparatus, and l is a feature extraction unit that acoustically analyzes input speech collected through a microphone and extracts feature data thereof. This feature extraction part
is configured to perform, for example, bandpass filtering processing, LPC analysis, etc. at a predetermined frame period to sequentially improve the characteristic parameters of the input speech.

しかしてこの特徴抽出部ｌを介して取込まれる人力音声
データ（特徴データの系列）は入力スイッチ２を介して
リングバッファ３およびデータ量くッファ４に選択的に
入力される。この入力スイッチ２は制御部５の制御を受
けて切替え動作するもので、常時は前記特徴抽出部ｌで
分析さた入力音声データをリングバッファ３に供給し、
前記制御部５に認識処理動作開始のコマンドが入力され
た時点から、それ以降の入力音声データを前記データバ
ッファ４に供給するように構成される。The human voice data (sequence of feature data) taken in through the feature extractor 1 is selectively input to the ring buffer 3 and the data volume buffer 4 via the input switch 2. This input switch 2 operates under the control of the control section 5, and normally supplies the input audio data analyzed by the feature extraction section 1 to the ring buffer 3.
The control unit 5 is configured to supply input audio data to the data buffer 4 from the time when a command to start the recognition processing operation is input to the control unit 5 .

ここで上記リングバッファ３は、例えば５０フレームに
亙って前記入力音声データを順次格納する記憶容量を有
し、その記憶データ量が限界に達したとき、最も古いデ
ータを棄却しながら最新のデータを格納することで、常
に最新の５０フレ一ム分の入力音声データを格納するよ
うになっており、そのデータ格納動作を前記入力スイッ
チ２が切替えられるまで継続して動作する。つまり制御
部５へのコマンドの入力によって音声認識処理動作の開
始が指示される時点まで、その直前までの最新の５０フ
レームに亙る入力音声データを格納するようになってい
る。Here, the ring buffer 3 has a storage capacity to sequentially store the input audio data over, for example, 50 frames, and when the amount of stored data reaches its limit, the oldest data is discarded and the latest data is stored. By storing , the latest 50 frames of input audio data are always stored, and the data storage operation continues until the input switch 2 is switched. In other words, the latest 50 frames of input voice data up to the time when the start of voice recognition processing is instructed by inputting a command to the control unit 5 are stored.

しかして認識処理動作開始のコマンド入力に伴い、前記
入力スイッチ２の切替えによって入力音声データのデー
タバッファ３による格納が開始されると、前記リングバ
ッファ２に格納されたコマンド入力時点までの入力音声
データは、その格納状態を維持したまま順次閾値計算部
６に読出される。閾値計算部６は、このリングバッファ
３から読出される５０フレームに亙る入力音声データか
ら音声区間検出処理の基礎となる第１の閾値Ａを設定し
、これを音声区間検出部７に与えるものである。具体的
には閾値計算部６は前記リングバッファ３に格納されて
いる５０フレームに亙る入力音声データの平均パワーを
求め、この平均パワーをベースとして周囲雑音のパワー
レベルと、入力音声のパワーレベルとを大略的に弁別す
る為の第１の閾値Ａを設定している。When the input switch 2 is switched to start storing the input audio data in the data buffer 3 in response to the input of a command to start the recognition processing operation, the input audio data stored in the ring buffer 2 up to the time of the command input is are sequentially read out to the threshold calculation unit 6 while maintaining their stored state. The threshold calculation section 6 sets a first threshold A, which is the basis of the speech section detection process, from the 50 frames of input speech data read out from the ring buffer 3, and supplies this to the speech section detection section 7. be. Specifically, the threshold calculation unit 6 calculates the average power of the 50 frames of input audio data stored in the ring buffer 3, and calculates the power level of the ambient noise and the power level of the input audio based on this average power. A first threshold value A is set for roughly discriminating.

音声区間検出部７はこのようにして設定される第１の閾
値Ａに従い、例えば第２図に示すような処理手順に従っ
て入力音声データ中の音声区間検出を行ない、この処理
によって検出された音声区間の入力音声データ（特徴デ
ータの系列）を類似度計算部８に与える。この類似度計
算部８にて上記音声区間の特徴データの系列（入力音声
パターン）と、予め標準パターンメモリ９に格納されて
いる認識対象カテゴリの標準パターンとの類似度が計算
され、各認識対象カテゴリに対する類似度を相互に比較
して類似度計算結果を評価することで、その認識結果が
求められるようになっている。In accordance with the first threshold value A set in this manner, the speech section detection unit 7 detects a speech section in the input speech data according to the processing procedure shown in FIG. 2, for example, and detects the speech section detected by this processing. The input audio data (sequence of feature data) is given to the similarity calculation unit 8. The similarity calculation unit 8 calculates the similarity between the series of feature data of the speech section (input speech pattern) and the standard pattern of the recognition target category stored in the standard pattern memory 9 in advance, and calculates the similarity for each recognition target. The recognition results are obtained by comparing the similarities between the categories and evaluating the similarity calculation results.

ここで本装置の特徴的な音声区間検出処理について説明
すると、音声区間検出部７は前記データバッファ４に格
納された入力音声データから音声区間の始端が検出され
るか否かを判定し、始端検出が不成功に終わったとき、
前記リングバッファ３に格納されている５０フレ一ム分
の入力音声データまでを認識処理対象範囲として拡大し
てその音声区間検出処理を実行するものとなっている。Here, to explain the characteristic voice section detection process of this device, the voice section detection section 7 determines whether or not the start of a voice section is detected from the input voice data stored in the data buffer 4, and detects the start of the voice section. When detection is unsuccessful,
The recognition process target range is expanded to 50 frames worth of input voice data stored in the ring buffer 3, and the voice section detection process is executed.

即ち、音声認識処理開始のコマンドが入力されると、先
ず閾値計算部Ｂにて前記リングバッファ３に格納された
入力音声データに基づく第１の閾値Ａの設定が行なわれ
る（ステップａ）。しかる後、音声区間検出部７による
音声区間の始端検出が行なわれる。この音声区間検出の
始端検出は、先ず前記データバッファ４に格納されてい
る入力音声データを順に読出して当該入力音声データの
パワーが前記第１の閾値Ａを越えるか否かを判定しくス
テップｂ）、第１の閾値Ａを越えて入力音声データのパ
ワーが変化した場合には、次に音声パワーのピークが検
出されるか否かを判定することにより行なわれる（ステ
ップＣ）。こうして音声パワーのピーク値が検出された
とき、この検出タイミングを基準とし、且つその音声パ
ワーのピーク値に従って始端検出の為の第２の閾値Ｂを
設定して前記データバッファ４に格納された人力音声デ
ータから始端が検出されるか否かを判定する（ステップ
ｄ）。That is, when a command to start speech recognition processing is input, first, a first threshold value A is set in the threshold calculation section B based on the input speech data stored in the ring buffer 3 (step a). Thereafter, the voice section detecting section 7 detects the start end of the voice section. To detect the start of this voice section detection, first, input voice data stored in the data buffer 4 is sequentially read out, and it is determined whether the power of the input voice data exceeds the first threshold A or not.Step b) , when the power of the input audio data changes beyond the first threshold value A, the next step is to determine whether or not a peak of the audio power is detected (step C). When the peak value of the voice power is detected in this way, a second threshold value B for start detection is set based on this detection timing and according to the peak value of the voice power, and the human power stored in the data buffer 4 is set. It is determined whether a start end is detected from the audio data (step d).

この際、前記データバッファ４に格納された入力音声デ
ータから音声パワーのピークが検出できなかった場合に
は、そのピーク検出の対象範囲を前記リングバッファ３
に格納されている入力音声データまで拡張してその検出
処理を行なう（ステップｅ）。またこのようにしてリン
グバッファ３に格納された入力音声データから音声パワ
ーのピークを検出した後、或いはデータバッファ４に格
納された入力音声データから音声パワーのピーク値が検
出されたとしても、データバッファ４に格納されている
入力音声データから音声区間の始端が検出できなかった
場合には、次に音声区間の始端検出の対象範囲を前記リ
ングバッファ３に格納されている入力音声データまで拡
張し、このデータ中から始端の検出を行なう（ステップ
ｆ）。At this time, if a peak of audio power cannot be detected from the input audio data stored in the data buffer 4, the target range for peak detection is
The input audio data stored in the input audio data is extended to perform detection processing (step e). Furthermore, even after detecting the peak of voice power from the input voice data stored in the ring buffer 3 in this way, or even if the peak value of voice power is detected from the input voice data stored in the data buffer 4, the data If the start of the voice section cannot be detected from the input voice data stored in the buffer 4, then the target range for detecting the start of the voice section is extended to include the input voice data stored in the ring buffer 3. , the start end is detected from this data (step f).

以上のようにしてデータバッファ４またはリングバッフ
ァ３に格納されている入力音声データからその始端検出
がなされた後、前記データバッファ４に格納されている
入力音声データの前記音声パワーのピーク点より後側の
データを調べて前記入力音声区間の終端検出を行なう（
ステップｇ）。After the start end of the input audio data stored in the data buffer 4 or the ring buffer 3 is detected in the above manner, a point after the peak point of the audio power of the input audio data stored in the data buffer 4 is detected. The end of the input voice section is detected by examining the data on the side (
Step g).

尚、前述したデータバッファ４に格納されている入力音
声データが第１の閾値Ａを越えることがない場合には、
音声入力がなされていないとしてエラー処理を起動する
（ステップｈ）。このようにして入力音声区間検出を行
ない、当該区間の入力音声データを選択的に切出して前
述した音声認工処理（標準パターンとの照合）を行なう
ことになる。Note that if the input audio data stored in the data buffer 4 described above does not exceed the first threshold A,
It is assumed that no voice input has been made and error processing is activated (step h). In this way, an input speech section is detected, and the input speech data of the section is selectively extracted and subjected to the above-mentioned speech recognition processing (comparison with a standard pattern).

このような音声区間検出処理を第３図を参照して更に詳
しく説明する。第３図（ａ）はコマンドが入力された後
に発声が開始されて入力音声のデータがデータバッファ
４に格納されている状態での音声区間検出の過程を示し
ている。この場合にはコマンド人力の後、データバッフ
ァ４に格納されている入力音声データのパワーを順次調
べることにより、第１の閾値Ａを上回っているフレーム
区間が検出され、その中で最大のパワー値をとるフレー
ムＰが検出されることになる。そこでこの最大パワーフ
レームＰを基準として入力音声データを逆に辿り、入力
音声のパワーが最初に第２の閾値Ｂを越えて上回るフレ
ームＳを検出する。このようにして検出されるフレーム
Ｓが音声区間の始端となる。しかしてこのようにして音
声区間の始端Ｓが検出されたならば、次に前記最大パワ
ーフレームＰを基準として入力音声データを順に辿り、
入力音声のパワーが最初に第２の閾値Ｂを越えて下回る
フレームＥを検出する。このようにして検出されるフレ
ームＥが音声区間の終端となり、ここに始端Ｓと終端Ｅ
とで規定されるフレーム区間が音声区間として検出され
る。尚、上記始端検出の為の閾値と終端検出の為の閾値
とを異ならせて設定しても良いことは云うまでもない。Such voice section detection processing will be explained in more detail with reference to FIG. FIG. 3(a) shows the process of detecting a voice section in a state where utterance is started after a command is input and data of the input voice is stored in the data buffer 4. In this case, by sequentially checking the power of the input audio data stored in the data buffer 4 after the command is input, a frame section in which the power exceeds the first threshold value A is detected, and the maximum power value among them is detected. A frame P that takes . Therefore, the input audio data is traced backwards using this maximum power frame P as a reference, and the frame S in which the power of the input audio exceeds the second threshold value B for the first time is detected. The frame S detected in this way becomes the start of the audio section. However, once the start point S of the voice section is detected in this way, the input voice data is sequentially traced using the maximum power frame P as a reference,
A frame E in which the power of the input audio first falls below a second threshold B is detected. Frame E detected in this way becomes the end of the audio section, and there is a starting point S and a ending point E.
A frame section defined by is detected as a voice section. It goes without saying that the threshold for detecting the start end and the threshold for detecting the end may be set differently.

これに対してコマンドが入力されるタイミングより僅か
に先立って発声が開始された場合、第３図（ｂ）に示す
ようにその音声の先頭部分がリングバッファ３に格納さ
れ、データバッファ４には上記先頭部分が欠落した状態
で入力音声データが格納されることになる。このような
場合には、前述した如く検出される最大パワーフレーム
Ｐを基準として人力音声データを逆に辿っても、データ
バッファ４に格納された入力音声データからは第２の閾
値を下回る音声パワーを検出することができない。そこ
で、その検索対象を前述したリングバッファ３に格納さ
れている入力音声データまで拡張し、このデータ中から
人力音声のパワーが最初に第２の閾値Ｂを越えて上回る
フレームＳを検出する。この結果、リングバッファ３に
格納され、データバッファ４に格納された入力音声デー
タの前に連続している入力音声データから音声区間の始
端Ｓの検出か行なわれることになる。その後、前述した
例と同様にして音声区間の終端Ｅを検出することにより
、入力音声の先頭を取り零すことなく、その入力音声デ
ータの全てを検出することが可能となる。On the other hand, if the voice starts slightly before the command is input, the first part of the voice is stored in the ring buffer 3, as shown in FIG. 3(b), and the data buffer 4 is The input audio data will be stored with the leading portion missing. In such a case, even if the human voice data is traced backwards based on the maximum power frame P detected as described above, the input voice data stored in the data buffer 4 will still have a voice power that is below the second threshold. cannot be detected. Therefore, the search target is expanded to include the input voice data stored in the ring buffer 3 described above, and a frame S in which the power of the human voice exceeds the second threshold value B is detected from this data. As a result, the start point S of the voice section is detected from the input voice data that is stored in the ring buffer 3 and that is continuous before the input voice data stored in the data buffer 4. Thereafter, by detecting the end E of the voice section in the same manner as in the example described above, it becomes possible to detect all of the input voice data without omitting the beginning of the input voice.

ところで、コマンドの入力タイミングよりも相当早いタ
イミングで発声が開始されたような場合、第３図（Ｃ）
に示すようにデータバッファ４に格納されている入力音
声データから音声パワーのピークフレームが検出できな
い場合がある。つまりデータバッファ４に格納されてい
る入力音声データがその先頭から第１の閾値Ａを上回っ
ていることから、データバッファ４に格納されている入
力音声データから検出される音声パワーのピークからだ
けでは、そのピークが本当の音声パワーのピークフレー
ムであるか否として確定できない場合がある。By the way, if the utterance starts much earlier than the input timing of the command, as shown in Fig. 3 (C).
As shown in FIG. 3, there are cases where a peak frame of audio power cannot be detected from the input audio data stored in the data buffer 4. In other words, since the input audio data stored in the data buffer 4 exceeds the first threshold value A from the beginning of the input audio data, the input audio data stored in the data buffer 4 exceeds the first threshold A. , it may not be possible to determine whether the peak is the true peak frame of audio power or not.

このような場合には、音声パワーのピーク値検出範囲自
体をリングバッファ３に格納されている入力音声データ
までに広げ、これらの連続する入力音声データ中からピ
ークフレームを検出する。In such a case, the audio power peak value detection range itself is expanded to include the input audio data stored in the ring buffer 3, and the peak frame is detected from among these continuous input audio data.

その後、検出されたピークフレームを基準として、リン
グバッファ３およびデータバッファ４に格納されている
一連の入力音声データからその旬端Ｓと終端Ｅとを検出
することにより、入力音声の先端部分を取り零すことな
くその音声区間の正確な検出と、検出音声区間の入力音
声データの抽出を行なうことが可能となる。Then, using the detected peak frame as a reference, the leading edge of the input audio is extracted by detecting its peak S and terminal E from a series of input audio data stored in the ring buffer 3 and data buffer 4. It becomes possible to accurately detect the voice section and extract the input voice data of the detected voice section without any loss.

尚、発声のタイミングよりも大幅に遅れてコマンドが人
力されるような場合、第３図（ｄ）に示すようにデータ
バッファ４に格納された入力音声データのパワーが前述
した第１の閾値Ａを上回ることがなく、またリクンバッ
ファ３を遡って入力音声データを拾い出しても、その入
力音声データの全てを得ることが殆んど不可能であるか
ら、この場合にはエラー処理を起動して音声の再入力を
促す等の対策を講じる。In addition, if the command is manually input much later than the timing of the utterance, the power of the input voice data stored in the data buffer 4 will exceed the first threshold value A as shown in FIG. 3(d). , and since it is almost impossible to obtain all of the input audio data even if you trace the input audio data back through Recken Buffer 3, in this case, start error processing. Take measures such as prompting the user to re-enter the voice.

かくしてこのように構成された本装置によれば、認識処
理動作の開始コマンドが与えられるタイミングと音声の
発声開始のタイミングとがずれ、コマンドの入力が音声
の入力タイミングより僅かに遅れるような場合であって
も、コマンドの入力タイミング前に発声された音声の先
頭部分がリングバッファ３に格納されているので、この
データを適宜用いることにより入力音声の先頭部分を取
り零すことなく音声区間の検出を行なって、その入力音
声データを正確に抽出することができる。しかもコマン
ドが与えられる直前までの一定フレーム数に亙る入力音
声データを補助的に格納しておくだけで、音声の先頭部
分の欠落を生じることのないデータ抽出を行なうことが
できる。この結果、複雑なタイミング調整を行なうこと
なく、簡易に、且つ確実に入力音声データを検出して音
声認識処理に供することが可能となり、誤認識や認識リ
ジェクトの発生を抑制することが可能となる。According to the present device configured in this way, even if the timing at which the command to start the recognition processing operation is given and the timing at which the voice utterance starts is different, and the input of the command is slightly delayed from the timing at which the voice is input, Even if there is, the beginning part of the voice uttered before the command input timing is stored in the ring buffer 3, so by using this data appropriately, the voice section can be detected without dropping the beginning part of the input voice. This allows the input audio data to be extracted accurately. In addition, data extraction can be performed without missing the beginning of the audio by simply storing input audio data for a certain number of frames up to just before a command is given. As a result, it becomes possible to easily and reliably detect input voice data and submit it to voice recognition processing without making complicated timing adjustments, making it possible to suppress the occurrence of erroneous recognition and recognition rejects. .

尚、本発明は上述した実施例に限定されるものではない
。例えばリングバッファ３に何フレームの音声データを
順次格納するかは、そのフレーム周期やタイミングのず
れを考慮して設定すれば良いものである。また音声の認
識処理方式自体は、従来より種々提唱されている手法を
適宜採用可能である。また閾値の設定アルゴリズムも特
に限定されるものではなく、要はその要旨を逸脱しない
範囲で種々変形して実施することができる。Note that the present invention is not limited to the embodiments described above. For example, the number of frames of audio data to be sequentially stored in the ring buffer 3 can be determined by taking into consideration the frame period and timing deviation. Further, as the speech recognition processing method itself, various methods that have been proposed in the past can be appropriately adopted. Further, the threshold value setting algorithm is not particularly limited, and can be implemented with various modifications without departing from the gist thereof.

［発明の効果］以上説明したように本発明によれば、認識処理動作の開
始コマンドが与えられる直前までの入力音声データを順
次リングバッファに格納しておき、このリングバッファ
に格納された入力音声データを適宜用いて音声区間検出
と、その音声データの抽出処理を行なうので非常に簡易
にして効果的に先頭部分の欠落のない音声データによる
音声認識処理を実行することができる等の実用上多大な
る効果が奏せられる。[Effects of the Invention] As explained above, according to the present invention, the input voice data up to just before a recognition processing operation start command is given is sequentially stored in a ring buffer, and the input voice data stored in this ring buffer is Since the data is used appropriately to detect voice sections and extract the voice data, it is possible to perform voice recognition processing using voice data without missing the beginning part very easily and effectively. This produces a certain effect.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る音声認識装置の要部概
略構成図、第２図は実施例装置における音声区間検出処
理手続きの流れを示す図、第３図は実施例装置の作用を
模式的に示す図である。１・・・特徴抽出部、２・・・入力スイッチ、３・・・
リングバッファ、４・・・データバッファ、５・・・制
御部、６・・・閾値計算部、７・・・音声区間検出部、
８・・・類似度計算部、９・・・標準パターンメモリ。出願人代理人　弁理士　鈴江武彦第図第図FIG. 1 is a schematic diagram of the main parts of a speech recognition device according to an embodiment of the present invention, FIG. 2 is a diagram showing the flow of a speech section detection process in the embodiment device, and FIG. 3 is an operation of the embodiment device. FIG. 1...Feature extraction unit, 2...Input switch, 3...
Ring buffer, 4... Data buffer, 5... Control unit, 6... Threshold calculation unit, 7... Voice section detection unit,
8... Similarity calculation unit, 9... Standard pattern memory. Applicant's Representative Patent Attorney Takehiko Suzue

Claims

[Claims]

The data buffer includes a ring buffer that sequentially stores a certain number of frames of input audio data up to just before a command to start recognition processing is input, and a data buffer that stores input audio data after the command is input. When the start of the input voice is not detected from the input voice data stored in the ring buffer, the input voice data stored in the ring buffer and the voice data stored in the data buffer are consecutively connected to detect a voice section. A speech recognition device characterized by: