JPH0950288A

JPH0950288A - Device and method for recognizing voice

Info

Publication number: JPH0950288A
Application number: JP7204215A
Authority: JP
Inventors: Tetsuya Muroi; 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-08-10
Filing date: 1995-08-10
Publication date: 1997-02-18
Anticipated expiration: 2015-08-10
Also published as: JP3523382B2

Abstract

PROBLEM TO BE SOLVED: To accelerate the voice recognition speed by word spotting. SOLUTION: A voice signal continuously inputted to a voice input means 2 is converted into a voice pattern by a characteristic extraction means 3. A voice section predicted to include a recognition voice is detected by a voice detection means 4 from the voice pattern, and the voice recognition by the word spotting is executed by a voice recognition means 5 in the voice section. Since the word spotting is executed only in the voice section even though the voice signal is continuous, the processing load is reduced, and the recognition processing is speeded up.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、連続する音声を認
識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing continuous voice.

【０００２】[0002]

【従来の技術】現在、人間が発声した音声を認識する音
声認識装置が開発されている。このような音声認識装置
では、キーボードの手動操作などを要することなく各種
の情報を取り込むことができるので、例えば、人間が両
手を使用する作業環境でも所望の情報を入力することが
できる。このような音声認識装置を実現したパーソナル
コンピュータでは、人間が特定の音声を発生すると、こ
れを認識して所定の処理動作を実行する。2. Description of the Related Art At present, a speech recognition device for recognizing a speech uttered by a human has been developed. In such a voice recognition device, various kinds of information can be taken in without requiring manual operation of a keyboard, so that desired information can be input even in a work environment in which a person uses both hands, for example. In a personal computer that realizes such a voice recognition device, when a human generates a specific voice, it recognizes the specific voice and executes a predetermined processing operation.

【０００３】人間が単語を一つだけ発声する場合、これ
を音声認識装置が認識することは困難ではないが、人間
の自然な会話では音声は連続しており、そこには単語の
他にも言い淀みや息つぎなどが含まれる。このように連
続する音声から単語を抽出して認識する場合、従来は連
続する音声を最初から最後まで取り込み、この全体で音
声認識を実行していた。しかし、これでは言い淀みや咳
払いなどの不要な音声がノイズとなり、音声認識の所要
時間が無用に増大し、認識精度も低下していた。When a human utters only one word, it is not difficult for the speech recognizer to recognize it, but in a natural human conversation, the voice is continuous, and in addition to the word It includes sighing and breathing. In the case of extracting and recognizing a word from continuous speech in this way, conventionally, continuous speech is fetched from the beginning to the end, and the speech recognition is executed as a whole. However, in this case, unnecessary voices such as sneezing and coughing become noises, the time required for voice recognition unnecessarily increases, and the recognition accuracy also deteriorates.

【０００４】このような課題を解決する手法の一つとし
て提案されたワードスポッティングでは、音声信号から
標準パターンにマッチングする部分のみ認識する。つま
り、音声信号に認識されない部分が存在しても構わない
ので、言い淀みなどの不要な音声の悪影響を解消でき
る。In word spotting, which is proposed as one of the methods for solving such a problem, only a portion matching a standard pattern is recognized from a voice signal. In other words, since there is no problem in recognizing the voice signal, it is possible to eliminate the adverse effects of unnecessary voice such as stagnation.

【０００５】[0005]

【発明が解決しようとする課題】上述のようなワードス
ポッティングでは、連続する音声信号に不要な音声が存
在しても必要な音声のみ認識することができる。In the word spotting as described above, only the necessary voice can be recognized even if the unnecessary voice exists in the continuous voice signal.

【０００６】しかし、このようなワードスポッティング
でも、連続する音声信号の全体を処理対象とするので、
処理負担が大きく処理を高速に実行することが困難であ
る。また、このようなワードスポッティングも、認識処
理の開始と終了とが適正なタイミングに確定されない
と、予想外の誤認識が発生することがある。However, even in such word spotting, since the entire continuous voice signal is processed,
The processing load is large and it is difficult to execute the processing at high speed. Also, in such word spotting, unexpected erroneous recognition may occur unless the start and end of the recognition process are confirmed at proper timing.

【０００７】また、前述のように音声認識装置を利用し
てパーソナルコンピュータを操作することが実用化され
ているが、このような機器を音声により迅速に動作させ
ることは困難である場合が予想される。例えば、“スト
ップ”なる音声を認識すると各種動作を停止するように
機器を設定しても、誤動作を発見した人間が「あれぇ…
変だなぁ…ストップ！」と発声すると、“あれぇ…変だ
なぁ…”の認識処理が終了してから“ストップ”の認識
処理が実行されるので、この認識が遅滞して機器を迅速
に停止させることができない。Further, as described above, it has been put into practical use to operate a personal computer by using a voice recognition device, but it is expected that it may be difficult to quickly operate such equipment by voice. It For example, even if the device is set to stop various operations when it recognizes the voice "Stop," a person who finds a malfunction is "Ah ...
It's weird ... Stop! ”, The recognition process of“ Stop ”is executed after the recognition process of“ Ahhh… weird… ”is completed, and this recognition is delayed and the device cannot be stopped quickly.

【０００８】[0008]

【課題を解決するための手段】請求項１記載の音声認識
装置は、音声信号が連続的に入力される音声入力手段
と、連続的な音声信号を特徴ベクトルの時系列である音
声パターンに変換する特徴抽出手段と、認識する音声が
含まれることが予想される有音区間を音声パターンから
所定条件に従って検出する有音検出手段と、有音区間で
ワードスポッティングによる音声認識を実行する音声認
識手段とを有する。このため、音声信号が連続でも無音
の部分ではワードスポッティングが実行されない。な
お、本発明で云う有音区間は、連続的な音声信号中で実
際に音声が存在する区間であり、例えば、音声パワーが
閾値以上の部分などと検出される。According to a first aspect of the present invention, there is provided a voice recognition device, wherein a voice input means for continuously inputting a voice signal and a continuous voice signal are converted into a voice pattern which is a time series of feature vectors. Feature extracting means, a voice detecting means for detecting a voiced section expected to include a voice to be recognized from a voice pattern according to a predetermined condition, and a voice recognition means for executing voice recognition by word spotting in the voiced section. Have and. Therefore, word spotting is not executed in a silent portion even if the audio signal is continuous. The voiced section in the present invention is a section in which a voice actually exists in a continuous voice signal, and is detected as, for example, a portion where the voice power is equal to or higher than a threshold.

【０００９】請求項２記載の音声認識装置では、有音区
間を前後の少なくとも一方に延長する区間延長手段を設
けた。このため、検出ミスによる音声の欠落が解消され
る。In the voice recognition device according to the second aspect, the section extending means for extending the voiced section to at least one of the front and rear is provided. Therefore, the loss of voice due to a detection error is eliminated.

【００１０】請求項３記載の音声認識装置では、有音区
間から所定時間以上の無音区間を検出する無音検出手段
を設け、無音区間で有音区間を複数に分割する区間分割
手段を設け、音声認識手段は、ワードスポッティングを
分割された複数の有音区間の各々で実行する。このた
め、有音区間の内部でも無音区間が排除される。なお、
本発明で云う有音区間とは、連続的な音声信号中で実際
に音声が存在しない区間であり、例えば、音声パワーが
閾値以下の部分などと検出される。According to a third aspect of the present invention, there is provided a voice recognition device, which is provided with a silence detecting means for detecting a silent segment from a voice segment for a predetermined time or more, and segment dividing means for dividing the voice segment into a plurality of voice segments. The recognition means performs word spotting on each of the divided voiced sections. Therefore, the silent section is excluded even within the sound section. In addition,
The voiced section referred to in the present invention is a section in which no voice actually exists in a continuous voice signal, and is detected as, for example, a portion where the voice power is below a threshold value.

【００１１】請求項４記載の音声認識装置では、無音区
間で分割された複数の有音区間の各々のパワーを検出す
るパワー検出手段を設け、音声認識手段は、パワーが最
大の有音区間で最初にワードスポッティングを実行す
る。このため、人間が大声で発声した音声が最初に認識
される。According to a fourth aspect of the present invention, there is provided a voice recognizing device, which is provided with a power detecting means for detecting the power of each of a plurality of voiced sections divided by a silent section, and the voice recognizing means has a maximum voiced section. Perform word spotting first. Therefore, a voice uttered by a human being is first recognized.

【００１２】請求項５記載の音声認識装置では、無音区
間で分割された複数の有音区間の各々のパワーを検出す
るパワー検出手段を設け、音声認識手段は、パワーが閾
値を超過した有音区間でワードスポッティングを優先的
に実行する。このため、人間が大声で発声した音声が優
先的に認識される。According to a fifth aspect of the present invention, there is provided a voice recognition device, wherein power detection means is provided for detecting the power of each of a plurality of voiced sections divided by a voiceless section, and the voice recognition means has a voice output whose power exceeds a threshold value. Perform word spotting preferentially in sections. For this reason, the voice uttered by a human being is loudly recognized with priority.

【００１３】請求項６記載の音声認識方法は、音声入力
手段に連続的に入力される音声信号を、特徴抽出手段に
より特徴ベクトルの時系列である音声パターンに変換
し、認識する音声が含まれることが予想される有音区間
を音声パターンから所定条件に従って有音検出手段によ
り検出し、この有音区間でワードスポッティングによる
音声認識を音声認識手段により実行する。このため、音
声信号が連続でも無音の部分ではワードスポッティング
が実行されない。According to a sixth aspect of the present invention, the voice recognition method includes converting a voice signal continuously input to the voice input means into a voice pattern which is a time series of feature vectors by the feature extraction means, and includes recognized voice. The voiced section that is expected to be detected is detected from the voice pattern by the voiced detection unit according to a predetermined condition, and voice recognition by word spotting is executed by the voice recognition unit in this voiced section. Therefore, word spotting is not executed in a silent portion even if the audio signal is continuous.

【００１４】[0014]

【発明の実施の形態】本発明の実施の一形態を図面に基
づいて以下に説明する。まず、ここで例示する音声認識
装置１は、図１に示すように、音声入力手段である音声
入力部２を有しており、この音声入力部２には、特徴抽
出手段である特徴抽出部３、有音検出手段である区間検
出部４、音声認識手段である音声認識部５、結果出力手
段である結果出力部６、が順番に接続されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the drawings. First, as shown in FIG. 1, the voice recognition device 1 illustrated here has a voice input unit 2 which is a voice input unit, and the voice input unit 2 has a feature extraction unit which is a feature extraction unit. 3, a section detecting unit 4 which is a voice detecting unit, a voice recognizing unit 5 which is a voice recognizing unit, and a result output unit 6 which is a result outputting unit are sequentially connected.

【００１５】前記音声入力部２は、ハードウェアとして
マイクロフォンやＡ／ＤＣ（Analog／Digital Converto
r)などを有しており、連続音声をデジタル信号に変換す
る。前記特徴抽出部３は、マイクロコンピュータを有し
ており、音声信号を特徴ベクトルの時系列である音声パ
ターンに変換する。The voice input unit 2 includes a microphone or an A / DC (Analog / Digital Converto) as hardware.
r) etc., and converts continuous sound into digital signals. The feature extraction unit 3 has a microcomputer and converts a voice signal into a voice pattern which is a time series of feature vectors.

【００１６】このように連続音声をデジタル信号に変換
してから音声パターンに変換することには、既存の各種
手法が利用できるが、ここでは連続音声を16(kHz)で16
(bit)のデジタル信号に変換する。これを窓長256 でシ
フト幅160 の解析条件により十次のＬＰＣ(Linear Pred
ictive Coding)ケプストラムに変換するので、連続音声
の特徴ベクトルは10(ms)のフレーム毎に十次元のベクト
ルとして生成される。Although various existing methods can be used for converting a continuous voice into a digital signal and then converting into a voice pattern as described above, here, the continuous voice at 16 (kHz) is 16
Convert to (bit) digital signal. This is a tenth order LPC (Linear Pred
ictive Coding) cepstrum, the feature vector of continuous speech is generated as a 10-dimensional vector for every 10 (ms) frame.

【００１７】前記区間検出部４は、マイクロコンピュー
タを有しており、認識する音声が含まれることが予想さ
れる有音区間を音声パターンから所定条件に従って検出
する。この条件は音声パワーと閾値との比較に基づいて
設定されており、より詳細には、音声パワーが閾値以上
となると有音区間の開始を検出する。このような区間開
始の検出後に、音声パワーが閾値以下の状態が所定時間
まで連続すると、音声パワーが閾値以下となった時点を
有音区間の終了と検出する。このような処理動作はフレ
ーム単位で実行され、フレーム番号で管理される。The section detecting section 4 has a microcomputer and detects a sound section in which the recognized voice is expected to be included from the voice pattern according to a predetermined condition. This condition is set based on the comparison between the voice power and the threshold value, and more specifically, when the voice power becomes equal to or higher than the threshold value, the start of the voiced section is detected. If the state in which the voice power is equal to or lower than the threshold continues for a predetermined time after the detection of the start of such a section, the time when the voice power becomes equal to or lower than the threshold is detected as the end of the voiced section. Such processing operations are executed in frame units and managed by frame numbers.

【００１８】前記音声認識部５は、マイクロコンピュー
タを有しており、有音区間でワードスポッティングによ
る音声認識を実行する。前記結果出力部６は、インター
フェイスなどを有しており、例えば、音声により操作さ
れる機器などが接続される。なお、上述した各部２〜６
には制御回路が接続されており、この制御回路には開始
スイッチが設けられている。The voice recognition unit 5 has a microcomputer and executes voice recognition by word spotting in a voiced section. The result output unit 6 has an interface or the like, and is connected to, for example, a device operated by voice. In addition, each part 2-6 mentioned above
A control circuit is connected to the control circuit, and the control circuit is provided with a start switch.

【００１９】このような構成において、上述した音声認
識装置１は、人間が連続的に発声する音声から単語を抽
出して認識する。このような音声認識装置１の音声認識
方法を、図２ないし図４を参照して以下に順次詳述す
る。With such a configuration, the above-described voice recognition device 1 extracts and recognizes words from the voices that a human continuously utters. The voice recognition method of the voice recognition apparatus 1 will be described in detail below with reference to FIGS. 2 to 4.

【００２０】まず、図２及び図４に示すように、開始ス
イッチが操作されると制御回路により各部２〜６が起動
され、音声入力部２が外部から連続に入力される音声を
信号変換し、この連続の音声信号が特徴抽出部３により
ＬＰＣケプストラムに変換される。First, as shown in FIGS. 2 and 4, when the start switch is operated, the control circuit activates the respective units 2 to 6, and the voice input unit 2 converts the voice continuously input from the outside. , The continuous voice signal is converted into an LPC cepstrum by the feature extraction unit 3.

【００２１】区間検出部４は、図３に示すように、音声
パワーと閾値Ｐｔとを比較し、音声パワーが閾値Ｐｔ以
上となる最初のフレームを検出する。これは有音区間の
開始フレームとして検出され、そのフレーム番号Ｉｓが
記録される。これが完了すると音声パワーと閾値Ｐｔと
が比較され、音声パワーが閾値Ｐｔ以下となると、これ
は有音区間の終点候補としてフレーム番号が一時記憶さ
れる。このように音声パワーが閾値Ｐｔ以下の状態が所
定時間Ｌｔまで連続するかが判定され、この連続が検出
されると有音区間が終了が検出され、一時記憶されたフ
レーム番号Ｉｅが確定される。As shown in FIG. 3, the section detector 4 compares the voice power with the threshold Pt and detects the first frame in which the voice power is equal to or higher than the threshold Pt. This is detected as the start frame of the voiced section, and its frame number Is is recorded. When this is completed, the voice power and the threshold value Pt are compared, and when the voice power becomes equal to or lower than the threshold value Pt, the frame number is temporarily stored as the end point candidate of the voiced section. In this way, it is determined whether or not the state in which the voice power is equal to or lower than the threshold Pt continues for a predetermined time Lt. When this continuity is detected, the end of the voiced section is detected, and the temporarily stored frame number Ie is determined. .

【００２２】このように有音区間“Ｉｓ〜Ｉｅ”が検出
されると、音声認識部５は、有音区間でのみワードスポ
ッティングによる音声認識を実行し、この認識結果が結
果出力部６から出力される。When the voiced section "Is to Ie" is thus detected, the voice recognition section 5 executes the voice recognition by word spotting only in the voiced section, and the recognition result is output from the result output section 6. To be done.

【００２３】上述した音声認識装置１では、開始スイッ
チにより音声認識の開始が入力操作されても、認識する
音声が含まれることが予想される有音区間でのみワード
スポッティングによる音声認識が実行され、連続する音
声信号の全体を処理対象とはしないので、処理負担を軽
減して認識処理を高速化することができ、認識処理の開
始と終了とが適正なタイミングに確定されるので、予想
外の誤認識が発生することも防止される。In the voice recognition device 1 described above, even if the start switch is used to input the voice recognition, the voice recognition by word spotting is executed only in the voiced section in which the recognized voice is expected to be included. Since the entire continuous audio signal is not processed, the processing load can be reduced and the recognition processing can be sped up, and the start and end of the recognition processing can be determined at appropriate timings. False recognition is also prevented from occurring.

【００２４】なお、本発明は上記した実施の形態に限定
されるものではなく、各種の変形を許容するものであ
る。例えば、上述した音声認識装置１では、音声パワー
を閾値と比較して有音区間を検出することを例示した
が、このような有音区間を前後の少なくとも一方に延長
する区間延長手段を設けることも可能である。The present invention is not limited to the above-described embodiment, but allows various modifications. For example, in the voice recognition device 1 described above, the voice power is compared with the threshold value to detect the voiced section, but the section extension means for extending such a voiced section to at least one of the front and back is provided. Is also possible.

【００２５】この場合、有音区間が前後に延長されるの
で、検出ミスによる音声の欠落が解消され、音声認識の
精度が向上する。例えば、有音区間の開始フレーム“Ｉ
ｓ”から所定のフレーム数“Ｌ１”を減算すれば、有音
区間を前方に延長することができ、有音区間の終了フレ
ーム“Ｉｅ”に所定のフレーム数“Ｌ２”を加算すれ
ば、有音区間を後方に延長することができる。このよう
な延長は、前方には音の長さである“50(ms)＝５フレー
ム”程度、後方には発音の長さである“100(ms)＝10フ
レーム”程度、が好ましい。In this case, since the voiced section is extended forward and backward, the loss of voice due to a detection error is eliminated and the accuracy of voice recognition is improved. For example, the start frame “I
By subtracting a predetermined number of frames "L1" from "s", the voiced section can be extended forward, and by adding a predetermined number of frames "L2" to the end frame "Ie" of the voiced section It is possible to extend the sound section backwards, such an extension being about 50 (ms) = 5 frames, which is the length of the sound in the front, and "100 (ms), which is the sounding length, in the back. ) = About 10 frames ”is preferable.

【００２６】また、上述した音声認識装置１では、連続
する音声信号から検出された有音区間の全体をワードス
ポッティングの処理対象とすることを例示したが、図５
に示すように、有音区間“Ｉｓ〜Ｉｅ”から所定時間
“Ｌ３”以上の無音区間を検出する無音検出手段と、こ
の無音区間で有音区間を複数に分割する区間分割手段と
を設け、この分割された複数の有音区間の各々でワード
スポッティングを実行することも可能である。Further, in the above-described voice recognition device 1, the whole of the voiced section detected from the continuous voice signal is exemplified as the processing target of the word spotting.
As shown in FIG. 3, a silence detecting unit that detects a silent period of “L3” or more for a predetermined time from the voice period “Is to Ie” and a segment dividing unit that divides the voice period into a plurality of voice periods are provided. It is also possible to execute word spotting in each of the divided voiced sections.

【００２７】この場合、有音区間の内部でも無音区間が
排除されるので、さらに処理負担を軽減して認識処理を
高速化することができ、認識処理の開始と終了とを適正
化して認識精度を向上させることができる。なお、音声
認識を単語や音節の単位で実行するならば、ポーズや息
つぎは排除すべき無音区間であり、促音は排除すべきで
ない無音区間である。このような場合、無音区間を検出
する所定時間が促音より長く息つぎなどより短い時間に
は設定すれば良いので、これは“ 300(ms)＝30フレー
ム”程度である。In this case, since the silent section is eliminated even within the sound section, the processing load can be further reduced and the recognition processing can be speeded up. The start and end of the recognition processing can be optimized and the recognition accuracy can be improved. Can be improved. If voice recognition is executed in units of words or syllables, pauses and breaths are silent intervals that should be excluded, and consonants are silent intervals that should not be excluded. In such a case, the predetermined time for detecting the silent section may be set to a time longer than the consonant and shorter than a breathing, so this is about “300 (ms) = 30 frames”.

【００２８】なお、このように一つの有音区間を無音区
間により複数に分割する場合、上述のように検出した有
音区間を一つに検出してから複数に分割することの他、
分割された複数の有音区間を最初から順番に検出するこ
とも可能である。When one voiced section is divided into a plurality of voiceless sections in this way, in addition to detecting one voiced section detected as described above and then dividing it into a plurality of sections,
It is also possible to detect a plurality of divided voiced sections in order from the beginning.

【００２９】さらに、このように有音区間を複数に分割
する音声認識装置１の結果出力部６に外部機器を接続
し、この外部機器を音声認識装置１の認識結果により操
作することも想定できる。このような場合、無音区間で
分割された複数の有音区間の各々のパワーを検出するパ
ワー検出手段を設け、パワーが最大の有音区間で最初に
ワードスポッティングを実行することが好ましい。Furthermore, it can be assumed that an external device is connected to the result output unit 6 of the voice recognition device 1 for dividing the voiced section into a plurality of parts in this way, and the external device is operated according to the recognition result of the voice recognition device 1. . In such a case, it is preferable to provide a power detection means for detecting the power of each of the voiced sections divided by the voiceless section, and execute the word spotting first in the voiced section having the maximum power.

【００３０】例えば、音声認識装置１に電子ファイル装
置を接続し、その各種動作を音声制御するならば、“ス
トップ”なる音声により各種動作が停止されるように設
定しておく。このようなシステムにおいて、電子ファイ
ル装置の誤動作を発見した人間が「あれぇ…変だなぁ…
ストップ！」と発声すると、この“ストップ”なる音声
は自然と大声に発声される。この場合、音声認識装置１
は、複数の有音区間を“あれぇ”“変だなぁ”“ストッ
プ”の順番で検出するが、パワーが最大の“ストップ”
を最初に認識するので、電子ファイル装置を迅速に停止
させることができる。For example, if an electronic file device is connected to the voice recognition device 1 and various operations thereof are voice-controlled, it is set so that the various operations are stopped by a "stop" voice. In such a system, the person who discovered the malfunction of the electronic file device said, "Oh ... it's weird ...
stop! When you say "", this "stop" voice is naturally made loud. In this case, the voice recognition device 1
Detects multiple voiced sections in the order of "Are", "Weird" and "Stop", but "Stop" with the maximum power
Is recognized first, the electronic file device can be stopped quickly.

【００３１】同様に、無音区間が分割された複数の有音
区間の各々のパワーを検出し、パワーが閾値を超過した
有音区間でワードスポッティングを優先的に実行するこ
とも可能である。この場合、音声認識装置１は、連続す
る音声を人間の発声が大声の順番で認識するので、有用
な音声より大声で無用な音声が発声されても、有用な音
声を迅速に認識することができる。Similarly, it is possible to detect the power of each of a plurality of voiced sections obtained by dividing a silent section and preferentially execute word spotting in a voiced section in which the power exceeds a threshold value. In this case, the voice recognition device 1 recognizes continuous voices in the order of loudness of human utterances. Therefore, useful voices can be recognized quickly even if louder and unnecessary voices are uttered than useful voices. it can.

【００３２】[0032]

【発明の効果】請求項１記載の音声認識装置では、音声
信号が連続的に入力される音声入力手段と、連続的な音
声信号を特徴ベクトルの時系列である音声パターンに変
換する特徴抽出手段と、認識する音声が含まれることが
予想される有音区間を音声パターンから所定条件に従っ
て検出する有音検出手段と、有音区間でワードスポッテ
ィングによる音声認識を実行する音声認識手段とを有す
ることにより、音声信号が連続的に入力されても、ワー
ドスポッティングによる音声認識が、認識する音声が含
まれることが予想される有音区間でのみ実行され、連続
する音声信号の全体を処理対象とはしないので、処理負
担を軽減して認識処理を高速化することができる。According to the voice recognition apparatus of the present invention, the voice input means for continuously inputting the voice signal and the feature extracting means for converting the continuous voice signal into a voice pattern which is a time series of feature vectors. And voiced means for detecting a voiced section expected to include a recognized voice from a voice pattern according to a predetermined condition, and voice recognition means for executing voice recognition by word spotting in the voiced section. Thus, even if a voice signal is continuously input, voice recognition by word spotting is executed only in a voiced section where the recognized voice is expected to be included, and the entire continuous voice signal is not processed. Therefore, the processing load can be reduced and the recognition processing can be speeded up.

【００３３】請求項２記載の音声認識装置では、有音区
間を前後の少なくとも一方に延長する区間延長手段を設
けたので、有音区間の検出ミスによる音声の欠落を防止
することができ、音声の認識精度を向上させることがで
きる。In the voice recognition apparatus according to the second aspect, since the section extending means for extending the voiced section to at least one of the front and back is provided, it is possible to prevent the voice from being lost due to the detection error of the voiced section. The recognition accuracy of can be improved.

【００３４】請求項３記載の音声認識装置では、有音区
間から所定時間以上の無音区間を検出する無音検出手段
を設け、無音区間で有音区間を複数に分割する区間分割
手段を設け、音声認識手段は、ワードスポッティングを
分割された複数の有音区間の各々で実行することによ
り、ワードスポッティングの処理対象とならない無音区
間を有音区間から排除することができるので、更に処理
負担を軽減して認識処理を高速化することができる。In the voice recognition apparatus according to the third aspect of the invention, there is provided silence detecting means for detecting a silent section of a predetermined duration or more from the sound section, and section dividing means for dividing the sound section into a plurality of silent sections. By performing word spotting on each of the plurality of divided voiced sections, the recognition means can eliminate a silent section that is not the target of word spotting from the voiced section, further reducing the processing load. The recognition process can be speeded up.

【００３５】請求項４記載の音声認識装置では、無音区
間で分割された複数の有音区間の各々のパワーを検出す
るパワー検出手段を設け、音声認識手段は、パワーが最
大の有音区間で最初にワードスポッティングを実行する
ことにより、例えば、“ストップ”なる音声により各種
動作が停止されるシステムを形成した場合、誤動作を発
見した人間が「あれぇ…変だなぁ…ストップ！」などと
発声しても、自然と大声に発声される“ストップ”が最
初に認識されるので、システムを迅速に停止させること
ができる。In the voice recognition apparatus according to the present invention, a power detecting means for detecting the power of each of the plurality of voiced sections divided by the silent section is provided, and the voice recognition means detects the voiced section having the maximum power. By first performing word spotting, for example, when a system is created in which various actions are stopped by the voice "stop", the person who finds the malfunction says "Are ... strange ... stop!" Even so, the system will be able to stop quickly because the naturally pronounced "stop" is recognized first.

【００３６】請求項５記載の音声認識装置では、無音区
間で分割された複数の有音区間の各々のパワーを検出す
るパワー検出手段を設け、音声認識手段は、パワーが閾
値を超過した有音区間でワードスポッティングを優先的
に実行することにより、例えば、“ストップ”なる音声
により各種動作が停止されるシステムを形成した場合、
誤動作を発見した人間が「あれぇ…変だなぁ…ストップ
！」などと発声しても、自然と大声に発声される“スト
ップ”が最初に認識されるので、システムを迅速に停止
させることができ、より大声の悲鳴などが発声されて
も、次に大声の“ストップ”が迅速に認識される。According to a fifth aspect of the present invention, there is provided a voice recognizing device, which is provided with a power detecting means for detecting the power of each of the plurality of voiced sections divided by the voiceless section, and the voice recognizing means has the voice detecting section whose power exceeds a threshold value. In the case of forming a system in which various operations are stopped by the voice of "stop" by preferentially executing word spotting in a section,
Even if the person who found the malfunction utters, "Ahhh ... weird ... stop!", It naturally recognizes a loud "stop", so the system can be stopped quickly. Even if a louder scream is made, the next loud “stop” is quickly recognized.

【００３７】請求項６記載の音声認識方法では、音声入
力手段に連続的に入力される音声信号を、特徴抽出手段
により特徴ベクトルの時系列である音声パターンに変換
し、認識する音声が含まれることが予想される有音区間
を音声パターンから所定条件に従って有音検出手段によ
り検出し、この有音区間でワードスポッティングによる
音声認識を音声認識手段により実行することにより、音
声信号が連続的に入力されても、ワードスポッティング
による音声認識が、認識する音声が含まれることが予想
される有音区間でのみ実行され、連続する音声信号の全
体を処理対象とはしないので、処理負担を軽減して認識
処理を高速化することができる。In the voice recognition method according to the sixth aspect, the voice signal continuously input to the voice input means is converted into the voice pattern which is the time series of the feature vector by the feature extraction means, and the recognized voice is included. It is possible to continuously input a voice signal by detecting a voiced section expected from the voice pattern by a voiced detection unit according to a predetermined condition and executing voice recognition by word spotting in this voiced section by the voice recognition unit. However, the voice recognition by word spotting is executed only in the voiced section where the recognized voice is expected to be included, and the entire continuous voice signal is not processed, so the processing load is reduced. The recognition process can be speeded up.

[Brief description of drawings]

【図１】本発明の実施の一形態の音声認識装置を示すブ
ロック図。FIG. 1 is a block diagram showing a voice recognition device according to an embodiment of the present invention.

【図２】音声認識方法を示すフローチャートである。FIG. 2 is a flowchart showing a voice recognition method.

【図３】区間検出の処理動作を示すフローチャートであ
る。FIG. 3 is a flowchart showing a processing operation of section detection.

【図４】音声パターンであるＬＰＣケプストラムを示す
タイムチャートである。FIG. 4 is a time chart showing an LPC cepstrum that is a voice pattern.

【図５】変形例のＬＰＣケプストラムを示すタイムチャ
ートである。FIG. 5 is a time chart showing a modified LPC cepstrum.

[Explanation of symbols]

１音声認識装置２音声入力手段３特徴抽出手段４有音検出手段５音声認識手段 1 voice recognition device 2 voice input means 3 feature extraction means 4 voiced detection means 5 voice recognition means

Claims

[Claims]

1. A voice input means for continuously inputting a voice signal, a feature extracting means for converting the continuous voice signal into a voice pattern which is a time series of feature vectors, and a voice to be recognized. A voice recognition device comprising: a voiced voice detection unit that detects an expected voiced segment from a voice pattern in accordance with a predetermined condition; and a voice recognition unit that performs voice recognition by word spotting in the voiced segment.

2. The voice recognition device according to claim 1, further comprising a section extension means for extending a voiced section to at least one of front and rear.

3. A silence detecting means for detecting a silent interval of a predetermined time or more from a voice interval is provided, and a segment dividing means for dividing the voice interval into a plurality of voice intervals is provided, and the voice recognizing means divides word spotting. The voice recognition device according to claim 1, wherein the voice recognition device is executed in each of the plurality of voiced sections.

4. A power detection means for detecting the power of each of the voiced sections divided by the voiceless section is provided, and the voice recognition means first executes word spotting in the voiced section having the maximum power. The voice recognition device according to claim 4, wherein

5. A power detection means for detecting the power of each of the plurality of voiced sections divided by the voiceless section is provided, and the voice recognition means preferentially performs word spotting in the voiced section whose power exceeds a threshold value. The speech recognition apparatus according to claim 4, which is executed.

6. A voiced section in which a voice to be recognized is expected to be converted by converting the voice signal continuously input to the voice input means into a voice pattern which is a time series of feature vectors by the feature extraction means. The voice recognition method is characterized in that the voice recognition is detected from the voice pattern according to a predetermined condition and the voice recognition by word spotting is executed by the voice recognition means in the voiced section.