JPH06161488A

JPH06161488A - Speech recognizing device

Info

Publication number: JPH06161488A
Application number: JP4330896A
Authority: JP
Inventors: Tetsuya Muroi; 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-11-17
Filing date: 1992-11-17
Publication date: 1994-06-07

Abstract

PURPOSE:To greatly reduce a period of time and a throughput required for spotting and efficiently and reliably perform speech recognition. CONSTITUTION:This speech recognizing device has a speech input part 1 which inputs a vocalized speech and a spotting part 6 which extracts >=1 key words from the speech inputted from the speech input part 1. Further, key words to be extracted are classified in plural groups and held in a key word holding part 4 and recognition order on the respective groups are determined as recognition order rules and held in a recognition order rule holding part 5. The spotting part 6 spots the input speech in the order based upon the recognition order rules by regarding the groups as vocabularies to be recognized to extract the key words from the input speech in order. Consequently, the key words can be extracted from the input speech efficiently and reliably at the minimum spotting frequency.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、発声された音声をワー
ドスポッティングによって認識する音声認識装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing spoken voice by word spotting.

【０００２】[0002]

【従来の技術】発声された音声の音声区間全体を隙間な
く（１字１句）認識するのではなく、音声をワードスポ
ッティングによって認識する手法は、不用語の付加やポ
ーズなどの問題を避けることができ、音声対話システム
や音声理解システムに向いていることが知られている。2. Description of the Related Art A method of recognizing a voice by word spotting, instead of recognizing the entire voice section of a uttered voice without a gap (one phrase per phrase), avoids problems such as addition of non-words and pauses. It is known that it is suitable for spoken dialogue systems and speech understanding systems.

【０００３】文献「日本音響学会講演論文集（平成４年
３月，ｐ１３９〜１４０）」には、この種のワードスポ
ッティングによって音声を認識する技術が示されてお
り、この文献の技術は、抽出対象としての全てのキーワ
ードに対してスポッティングを行なって、ワードラティ
ス（どの単語が音声区間のどこから（始端）どこまで
（終端）にスコアが何点で存在するか）を算出し、その
後キーワードの組合せを構文的に解析して、発声された
音声中のキーワードの組を認識するようになっている。A technique for recognizing speech by this type of word spotting is shown in a document "Acoustic Society of Japan Lecture Collection (March 1992, p139-140)". The technique of this document is extracted. Perform word spotting (which words have a score from where (starting) to where (ending)) in the speech section by spotting all the target keywords, and then combine the keywords. It parses syntactically to recognize the set of keywords in the spoken voice.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述し
た従来の音声認識の手法では、構文や意味などの言語的
な制約なしに、抽出対象である全てのキーワードに対し
てワードスポッティングを実行するようになっていたの
で、スポッティングに相当の計算時間を要し、また、処
理量が多く、さらには信頼性の良い認識結果を得ること
ができないという欠点があった。However, in the above-mentioned conventional speech recognition method, word spotting is executed for all keywords to be extracted without any linguistic restrictions such as syntax and meaning. Therefore, there is a drawback that a lot of calculation time is required for spotting, a large amount of processing is required, and a reliable recognition result cannot be obtained.

【０００５】本発明は、スポッティングに要する時間や
処理量を著しく低減することができ、効率良くかつ信頼
性良く音声認識を行なうことの可能な音声認識装置を提
供することを目的としている。It is an object of the present invention to provide a voice recognition device which can significantly reduce the time and processing amount required for spotting and can perform voice recognition efficiently and reliably.

【０００６】[0006]

【課題を解決するための手段および作用】上記目的を達
成するために、請求項１乃至請求項３記載の発明は、発
声された音声を入力する音声入力手段と、該音声入力手
段から入力された音声中から少なくとも１つ以上のキー
ワードを抽出するスポッティング手段とを有する音声認
識装置において、抽出対象としてのキーワードが複数の
グループに分類され、また、各グループの認識順序が認
識順序規則として定められており、前記スポッティング
手段は、該認識順序規則に従った順序でグループを認識
対象語彙として入力音声に対しスポッティングを行な
い、入力音声からキーワードを順次に抽出することを特
徴としている。これにより、最小限のスポッティング回
数で入力音声中からキーワードを効率良くかつ信頼性良
く抽出することができる。In order to achieve the above-mentioned object, the invention according to claims 1 to 3 is a voice input means for inputting a uttered voice, and an input from the voice input means. In a voice recognition device having a spotting means for extracting at least one or more keywords from the speech, the keywords to be extracted are classified into a plurality of groups, and the recognition order of each group is defined as a recognition order rule. It is characterized in that the spotting means performs spotting on the input voice as a group to be recognized as a recognition target vocabulary in an order according to the recognition order rule, and sequentially extracts keywords from the input voice. As a result, the keyword can be efficiently and reliably extracted from the input voice with the minimum number of spotting times.

【０００７】また、請求項４記載の発明は、入力音声中
からキーワードを抽出する際に、上記スポッティング手
段が、すでにスポッティング抽出のなされたキーワード
の存在する音声区間を除外した音声区間をスポッティン
グすべき音声区間の範囲とすることを特徴としている。
これにより、スポッティングを行なう領域を狭くするこ
とができ、キーワードをより効率良くかつ信頼性良く抽
出することができて、認識効率をより向上させることが
できる。According to the fourth aspect of the invention, when extracting a keyword from the input speech, the spotting means should spot a speech section excluding a speech section in which the keyword already spotted is present. The feature is that the range is a voice section.
As a result, the spotting area can be narrowed, the keywords can be extracted more efficiently and reliably, and the recognition efficiency can be further improved.

【０００８】また、請求項５乃至請求項７記載の発明
は、上記認識順序規則に従ってグループを認識対象語彙
としてスポッティングを行なった際、抽出されたキーワ
ードのスコアが予め定められた閾値以下であった場合に
は、使用者に所定のメッセージを出力するメッセージ出
力手段がさらに設けられていることを特徴としている。
これにより、使用者との間で対話を進め、正しい認識を
行なうことができる。Further, in the invention described in claims 5 to 7, when spotting is performed according to the recognition order rule with the group as a recognition target vocabulary, the score of the extracted keyword is equal to or less than a predetermined threshold value. In this case, a message output means for outputting a predetermined message to the user is further provided.
As a result, a dialogue can be promoted with the user and correct recognition can be performed.

【０００９】また、請求項８記載の発明は、上記複数の
グループの少なくとも１つのグループに、前回の認識結
果が誤まりであることを示すキーワードが含まれている
ことを特徴としている。これにより、認識効率をより向
上させることができる。The invention according to claim 8 is characterized in that at least one of the plurality of groups includes a keyword indicating that the previous recognition result is erroneous. Thereby, the recognition efficiency can be further improved.

【００１０】また、請求項９記載の発明は、上記複数の
グループの少なくとも１つのグループに、対話を中断し
てやり直すことを示すキーワードが含まれていることを
特徴としている。これにより、使用者が対話を中断する
ためのキーワード，例えば「エスケープ」を発声した場
合には、最初に機能語のグループのスポッティングを行
なって、「エスケープ」が抽出された時点で即座に認識
装置の状態を最初に戻すので、無駄なスポッティングを
行なうことがなく、効率の良い処理を行なうことができ
る。Further, the invention according to claim 9 is characterized in that at least one group of the plurality of groups includes a keyword indicating that the dialogue is interrupted and the processing is started again. As a result, when the user utters a keyword for interrupting the dialogue, for example, "escape", spotting of a group of functional words is performed first, and the recognition device is immediately detected when "escape" is extracted. Since the state of 1 is returned to the initial state, it is possible to perform efficient processing without performing unnecessary spotting.

【００１１】[0011]

【実施例】以下、本発明の一実施例を図面に基づいて説
明する。図１は本発明に係る音声認識装置の一実施例の
ブロック図である。図１を参照すると、この音声認識装
置は、音声を入力するマイクロフォンや受話器などの音
声入力部１と、入力された音声を特徴ベクトルの時系列
に変換する特徴抽出部２と、特徴抽出結果に基づき音韻
や音節，ＶＣＶ（母音−子音−母音）などの音声の基本
単位を認識する基本単位認識部３と、ワードスポッティ
ングを行なうためのキーワードが複数のグループに分類
されて格納されているキーワード保持部４と、複数のグ
ループに対する認識順序規則が保持されている認識順序
規則保持部５と、認識順序規則保持部５に保持されてい
る認識順序規則に従いグループを選択し、このグループ
に属するキーワードをキーワード保持部４から抽出する
スポッティング部６と、メッセージが保持されているメ
ッセージ保持部７と、スポッティング部６の抽出結果に
応じて、メッセージ保持部７から所定のメッセージを取
り出し、出力するメッセージ出力部８とを有している。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an embodiment of a voice recognition device according to the present invention. Referring to FIG. 1, the voice recognition device includes a voice input unit 1 such as a microphone or a handset for inputting a voice, a feature extraction unit 2 for converting the input voice into a time series of feature vectors, and a feature extraction result. A basic unit recognizing unit 3 for recognizing a basic unit of a voice such as a phoneme, a syllable, a VCV (vowel-consonant-vowel), and a keyword holding for storing keywords for performing word spotting classified into a plurality of groups. The group 4, the recognition order rule holding unit 5 holding the recognition order rules for a plurality of groups, and the recognition order rule held in the recognition order rule holding unit 5 are used to select a group, and the keywords belonging to this group are selected. The spotting unit 6 that extracts from the keyword holding unit 4, the message holding unit 7 that holds a message, and the spotting 6 in accordance with the extraction result of taking out a predetermined message from the message storage unit 7, and a message output unit 8 for outputting.

【００１２】図２はキーワード保持部４の具体的な構成
例を示す図である。この例では、キーワード保持部４
は、同種類の意味を持つキーワードを１つのグループに
まとめ、各グループごとにキーワードを保持するように
なっている。より具体的には、用件，人物，場所，日
付，時間を表わすキーワードをそれぞれ１つのグループ
にしている。また、このキーワード保持部４には、対話
をやり直すことを示す言葉や前回の認識結果が誤ってい
ることを示す言葉の集合である機能語を表わすキーワー
ド［エスケープ」，「違います」，「誤認識していま
す」も１つのグループにして保持されている。なお、
「エスケープ」は、対話を中断してやり直すことを示す
キーワードであり、また、「違います」，「誤認識して
います」は、前回の認識結果が誤まりであることを示す
キーワードである。FIG. 2 is a diagram showing a concrete example of the configuration of the keyword holding unit 4. In this example, the keyword holding unit 4
Is configured to group keywords having the same meaning into one group and hold the keyword for each group. More specifically, the keywords representing the subject, the person, the place, the date, and the time are each grouped. In addition, the keyword holding unit 4 includes keywords "escape", "mistake", and "mistake" that represent a function word that is a set of words that indicate that the dialogue should be redone and that the previous recognition result was incorrect. I am aware of "is also held as a group. In addition,
“Escape” is a keyword indicating that the dialogue is interrupted and redone, and “mismatch” and “misrecognized” are keywords indicating that the previous recognition result is an error.

【００１３】図３には、「時間」のグループに属するキ
ーワードの保持例が示されており、この例では、「１時
半」，「２時半」，「３時半」等の時間についてのキー
ワードが、音声の基本単位として音節をアークとする１
つのオートマトンで記述され、保持されている。FIG. 3 shows an example of holding keywords that belong to the "time" group. In this example, for times such as "1:30", "2:30", "3:30", etc. Syllabic arc as a basic unit of speech
Described and maintained by one automaton.

【００１４】また、図４は認識順序規則保持部５の具体
的な構成例を示す図である。この例では、認識順序規則
保持部５には、キーワードごとに認識順序規則が記述さ
れている。すなわち、認識すべき音声が第１回目（すな
わち最初）の発声であれば、「用件」を最初にスポッテ
ィングし、第１回目の発声以外の発声（例えば第２回目
の発声）であれば「機能語」をスポッティングするよう
な認識順序規則が保持されており、また、「用件」を示
すキーワード（会議，年休）ごとに、次以降、どのグル
ープをスポッティングすべきかの認識順序規則が保持さ
れている。例えば、「会議」のキーワードでは、「場
所」の次に「日付」，「日付」の次に「時間」というよ
うな順序規則が保持されている。スポッティング部６で
は、このような順序規則に従ってグループを特定し、そ
のグループを認識対象語彙としてスポッティングを行な
い（例えばそのグループ内のオートマトンを認識対象と
してスポッティングを行ない）、所定のキーワードを抽
出するようになっている。また、この際、スポッティン
グ部６は、すでにスポッティングされたキーワードの存
在する区間を除外した音声区間をスポッティングすべき
音声区間の範囲とするようになっている。なお、ここ
で、「順序」とは、上述の説明からもわかるように、入
力音声中に出現するキーワードの時間的位置のことでは
なく、音声認識装置が入力音声に対して、スポッティン
グを試みる順序のことである。FIG. 4 is a diagram showing a specific example of the configuration of the recognition order rule holding unit 5. In this example, the recognition order rule holding unit 5 describes the recognition order rule for each keyword. That is, if the voice to be recognized is the first (that is, the first) utterance, the "requirement" is spotted first, and if the utterance other than the first utterance (for example, the second utterance), " The recognition order rule for spotting "function words" is held, and the recognition order rule for which group should be spotted from the next time onward is held for each keyword (meeting, annual leave) indicating "message". Has been done. For example, the keyword "meeting" holds order rules such as "place" followed by "date" and "date" followed by "time". The spotting unit 6 identifies a group according to such an order rule, performs spotting with the group as a recognition target vocabulary (for example, performs spotting with an automaton in the group as a recognition target), and extracts a predetermined keyword. Has become. In addition, at this time, the spotting unit 6 sets the voice section excluding the section in which the already spotted keyword is present as the range of the voice section to be spotted. Here, as can be seen from the above description, the “order” does not mean the temporal position of the keyword appearing in the input voice, but the order in which the voice recognition device attempts spotting on the input voice. That is.

【００１５】また、図５はメッセージ保持部７の具体的
な構成例を示す図である。この例では、各グループごと
にメッセージが保持されている。このメッセージは、上
記認識対象のグループの中のキーワードの発声を使用者
に促すものであり、例えば、「用件」のグループには、
「御用件をどうぞ」のメッセージが用意されている。メ
ッセージ出力部８は、スポッティング部６において上記
順序規則に従って上記グループを認識対象語彙としてス
ポッティングを行なった際、抽出されたキーワードのス
コアが予め定められた閾値以下であったり、あるいはキ
ーワードを抽出できなかった場合に、そのグループに対
応したメッセージをメッセージ保持部７から取り出し、
使用者に対し出力するようになっている。FIG. 5 is a diagram showing a concrete example of the configuration of the message holding unit 7. In this example, a message is held for each group. This message prompts the user to utter a keyword in the above-mentioned recognition target group. For example, in the "message" group,
There is a message saying "Please have a request." When the message output unit 8 performs spotting according to the ordering rule in the spotting unit 6 with the group as a recognition target vocabulary, the score of the extracted keyword is less than or equal to a predetermined threshold value, or the keyword cannot be extracted. In this case, the message corresponding to that group is retrieved from the message holding unit 7,
It is designed to be output to the user.

【００１６】次にこのような構成の音声認識装置の処理
動作について説明する。音声入力部１から話者の音声が
入力すると、特徴抽出部２では、この音声を特徴ベクト
ルの時系列に変換する。より具体的には、特徴抽出部２
は、例えばバンドパスフィルタ群によるスペクトラムや
ＬＰＣケプストラムなどの音声認識用の特徴ベクトルに
変換する。Next, the processing operation of the speech recognition apparatus having such a configuration will be described. When the voice of the speaker is input from the voice input unit 1, the feature extraction unit 2 converts this voice into a time series of feature vectors. More specifically, the feature extraction unit 2
Is converted into a feature vector for speech recognition such as a spectrum by a bandpass filter group or an LPC cepstrum.

【００１７】次いで、基本単位認識部３では、音声の基
本単位を認識する。例えば音節を音声の基本単位として
認識する。この場合、認識された結果（スコア）は音節
ラティスと呼ばれるもので、ｃｖ_ｒｓｔ（ｃｖ，ｉ
ｓ，ｄｕｒ）の形式の配列に格納される。ここで、ｃｖ
は音節番号，ｉｓは音節の開始フレーム番号，ｄｕｒは
音節の継続フレーム数を表している。この音節ラティス
は、後述のスポッティング部６の計算の前に全て計算し
ても良いし、また、スポッティング部６で必要になった
際に逐次計算しても良い。また、音節ラティスの具体的
な計算方法は、音節ごとに標準パターンを持ってＤＰマ
ッチングによって計算する方法や、音節ごとにＨＭＭ
（Hidden Markov Model）を用意して認識する方法な
ど、様々な手法を用いることができる。Next, the basic unit recognition section 3 recognizes the basic unit of voice. For example, a syllable is recognized as a basic unit of voice. In this case, the recognized result (score) is called syllable lattice, and cv_rst (cv, i
s, dur). Where cv
Is a syllable number, is is a syllable start frame number, and dur is a syllable continuous frame number. This syllable lattice may be calculated before the spotting unit 6 described later, or may be calculated sequentially when the spotting unit 6 needs it. Further, as a concrete calculation method of the syllable lattice, there is a method of calculating a standard pattern for each syllable by DP matching, or an HMM for each syllable.
(Hidden Markov Model) can be prepared and recognized.

【００１８】スポッティング部６では、オートマトン制
御によって音節ラティスを接続してオートマトン上の最
適な１パスを求め、これを抽出したキーワードとする。
なお、オートマトンを用いて制御する方法は、従来知ら
れており、例えば、次式に示す漸化式を用いて、累積ス
コアＳ（ｉ，ｊ）を計算し、オートマトンの終端ノード
で最も大きなスコアを持つマッチングパスを認識結果
（スポッティング結果）とすることができる。The spotting unit 6 connects the syllable lattices by automaton control to find an optimum one path on the automaton, and uses this as the extracted keyword.
A method of controlling using an automaton is conventionally known. For example, the cumulative score S (i, j) is calculated using the recurrence formula shown below, and the largest score is calculated in the terminal node of the automaton. It is possible to set a matching path having “” as a recognition result (spotting result).

【００１９】[0019]

【数１】 [Equation 1]

【００２０】ここで、ｉは入力音声のフレーム番号、ｊ
はオートマトンのノード番号、Ｓ（ｉ，ｊ）は入力音声
の第ｉフレームがオートマトンの第ｊノードに到達した
マッチングパスの累積スコアである。また、ｊ’は第ｊ
ノードの親のノード番号であり、ｃｖ’は第ｊ’ノード
と第ｊノードを結ぶアーク（音節）であり、ｒｓｔ（ｃ
ｖ’，ｉ−ｄｕｒ，ｄｕｒ）は上述の音節の認識結果で
ある。Where i is the frame number of the input voice, and j
Is the node number of the automaton, and S (i, j) is the cumulative score of the matching path at which the i-th frame of the input speech reaches the j-th node of the automaton. Also, j'is the j-th
The node number of the parent of the node, cv ′ is an arc (syllable) connecting the j′th node and the jth node, and rst (c
v ', i-dur, dur) is the recognition result of the above syllable.

【００２１】ところで、本実施例では、抽出対象として
のキーワードを複数のグループに分類して、このグルー
プの認識すべき順序を定めた認識順序規則に従ってキー
ワードのスポッティングを行なうことにより、最小限の
スポッティング回数で音声中からキーワードを抽出する
ことができ、発声された音声の意味を効率良く理解する
ことができる。従って、これを例えばスケジュール管理
装置に適用した場合、効率の良いスケジュール管理を行
なうことができる。By the way, in the present embodiment, the keywords to be extracted are classified into a plurality of groups, and the keywords are spotted according to the recognition order rule that defines the order in which the groups should be recognized. The keyword can be extracted from the voice by the number of times, and the meaning of the uttered voice can be efficiently understood. Therefore, when this is applied to a schedule management device, for example, efficient schedule management can be performed.

【００２２】第１の例として、入力音声が図６（ａ）の
ようなものであり、これが、第１発声（第１回目の発
声）であった場合には、スポッティング部６は、認識順
序規則保持部５に保持されている図４に示すような認識
順序規則に従って「用件」のグループを認識対象とし、
キーワード保持部４に保持されている図２に示すような
グループとキーワードとの対応表を用いて、「用件」の
グループから所定のキーワードの抽出，すなわちスポッ
ティングを行なう。すなわち、図６（ａ）の音声区間
（領域Ａ）のうちで、「用件」のグループを認識対象と
して認識を行ない、その結果、領域Ｂの区間においてキ
ーワード「会議」が抽出されたとする。この場合には、
以後の認識順序は図４の認識順序規則においてキーワー
ド「会議」を条件とし、「場所」，「日付」，「時間」
のグループ順に定められる。すなわち、第１番目に、
「場所」のグループを認識対象として、図６の〔Ａ−
Ｂ〕の領域でスポッティングを行なう。この結果、領域
Ｃの区間にキーワード「第１会議室」が抽出されると、
第２番目に、「日付」のグループを認識対象として、図
６の〔Ａ−（Ｂ＋Ｃ）〕の領域でスポッティングを行な
う。この結果、領域Ｄの区間にキーワード「明後日」が
抽出されると、第３番目に、「時間」のグループを認識
対象として、図６の〔Ａ−（Ｂ＋Ｃ＋Ｄ）〕の領域でス
ポッティングを行なうというような仕方で、認識順序規
則に従って、順次にスポッティングを行なうことができ
る。As a first example, when the input voice is as shown in FIG. 6 (a) and this is the first utterance (first utterance), the spotting unit 6 determines the recognition order. According to the recognition order rule held in the rule holding unit 5 as shown in FIG.
Using the correspondence table between groups and keywords held in the keyword holding unit 4 as shown in FIG. 2, a predetermined keyword is extracted from the “message” group, that is, spotting is performed. That is, it is assumed that, in the voice section (area A) of FIG. 6A, the group “work matter” is recognized as a recognition target, and as a result, the keyword “meeting” is extracted in the section of area B. In this case,
Subsequent recognition order is based on the keyword “meeting” in the recognition order rule of FIG. 4, and is “place”, “date”, “time”.
It is determined in the order of groups. That is, first,
[A- of FIG.
Spotting is performed in the area [B]. As a result, when the keyword “first conference room” is extracted in the section of the area C,
Second, spotting is performed in the area [A- (B + C)] in FIG. 6 with the "date" group as the recognition target. As a result, when the keyword "after tomorrow" is extracted in the section of the area D, thirdly, spotting is performed in the area [A- (B + C + D)] of FIG. 6 with the "time" group as the recognition target. In such a manner, spotting can be sequentially performed according to the recognition order rule.

【００２３】また、第２の例として、入力音声が図６
（ｂ）のようなものである場合、この入力音声に対して
は、用件が「年休」であるということが認識さえできれ
ば、図４の認識順序規則に従って、次に認識すべきは
「日付」のグループのみであり、「場所」や「時間」な
どのグループを認識対象としてスポッティングを試みる
ことはない。このようにスポッティングを行なう領域を
狭くすることができるので、不必要なスポッティングが
なされることなく、最小限のスポッティング回数で音声
中からキーワードを抽出することができる。また、スポ
ッティングを行なう領域を狭くし、認識対象を限定する
ことにより信頼性の高い結果を効率良く得ることができ
る。As a second example, the input voice is shown in FIG.
In the case of something like (b), if it is possible to recognize that the input speech is "annual leave", the next recognition should be "according to the recognition order rule of FIG. Only the "Date" group is used, and spotting is not attempted with the "Place" or "Time" groups as recognition targets. Since the spotting area can be narrowed in this way, it is possible to extract keywords from the voice with the minimum number of spotting times without unnecessary spotting. Also, by narrowing the spotting area and limiting the recognition target, highly reliable results can be obtained efficiently.

【００２４】このようにして得られた認識結果が使用者
の意図するものでなく、違っている場合、あるいは誤認
識している場合には、使用者は、「違います。…」，
「誤認識しています。…」などの機能語を次に発声す
る。例えば、上述の第１の例において、使用者が第１回
目の発声で「第１会議室」と発声したにもかかわらず、
これが「第２会議室」と誤認識された場合には、使用者
は、次に、「違います。第１会議室です。」，あるい
は、「誤認識しています。第１会議室です。」のように
発声する。If the recognition result obtained in this way is not what the user intended and is different, or is erroneously recognized, the user "does not ...",
Next, say a functional word such as "I am erroneously recognizing ...". For example, in the above-mentioned first example, although the user uttered “first conference room” in the first utterance,
If this is mistakenly recognized as the "second meeting room," the user then says, "No, it is the first meeting room." ”.

【００２５】この入力音声は、第１回目の発声ではな
く、次に発声された音声であるので、スポッティング部
６では、図４の認識順序規則に従って、「機能語」のス
ポッティングを試みる。すなわち、対話をやり直す言葉
や前回の認識結果が誤まっていることを示す言葉につい
てのスポッティングを試みる。ここで、「違います」あ
るいは「誤認識しています」のキーワードが抽出できた
場合には、図４の認識順序規則に従い、前回の認識対象
を認識する。すなわち、第１の例では、前回の認識結果
のキーワードのグループ「用件」，「場所」，「日付」
のスポッティングを試みる。この結果、「用件」のグル
ープの次の「場所」のグループにおいて、「第１会議
室」のキーワードを抽出することができる。この場合に
も、「日付」，「時間」のグループを認識対象とするこ
となく、「場所」のグループのみにおいて、目的とする
キーワードを抽出できるので、スポッティング回数を低
減し、また信頼性ある認識結果を得ることができる。Since this input voice is not the first utterance, but the next uttered voice, the spotting section 6 attempts to spot the "function word" according to the recognition order rule of FIG. In other words, we try to spot the words to be re-dialogue and the words showing that the previous recognition result is wrong. Here, when the keyword of “not correct” or “erroneously recognized” can be extracted, the previous recognition target is recognized according to the recognition order rule of FIG. That is, in the first example, the keyword groups “message”, “location”, and “date” of the previous recognition result are displayed.
Try spotting. As a result, the keyword "first conference room" can be extracted in the "place" group next to the "message" group. Even in this case, the target keyword can be extracted only in the "place" group without targeting the "date" and "time" groups, thus reducing the number of spotting and reliable recognition. The result can be obtained.

【００２６】すなわち、従来では、認識結果が誤まって
いて、使用者が訂正することも考慮すると、今回認識す
べき対象（すなわち「場所」）と前回の認識対象（すな
わち「用件」，「日付」，「時間」）との全てを認識対
象，すなわち抽出対象としなければならないが、本発明
によれば、最初に認識誤まりを示すキーワード（違いま
す，誤認識してます，…）が抽出できた場合のみ、前回
認識できなかった対象のみを認識するようにしている。
このため、「違いますよ、第１会議室ですよ」のような
入力音声に対して、「時間」のグループを認識対象とす
ることもなく、効率の良い認識を行なうことができる。That is, in the past, considering that the recognition result is incorrect and the user corrects it, the object to be recognized this time (ie, “place”) and the previously recognized object (ie, “message”, “message”). All of "Date", "Time") must be recognized, that is, extracted, but according to the present invention, a keyword (mismatch, misrecognize, ...) indicating a recognition error is first detected. Only if it can be extracted, only the target that was not recognized last time is recognized.
Therefore, it is possible to efficiently recognize an input voice such as “No, it is the first conference room” without targeting the “time” group.

【００２７】また、本実施例では、図４の認識順序規則
に従ってスポッティングを行なった結果、あるグループ
のスポッティングに失敗した場合（認識できなかった場
合と、入力音声中に実際になかった場合とを含む）に、
このグループを発声するように使用者に促すメッセージ
を単純な構造で（対話の履歴を考慮することなく）出力
するようにしている。具体的には、上述の第１の例にお
いて、「用件」として「会議」が抽出され、「場所」と
して「第１会議室」が抽出され、「日付」として「明後
日」が抽出されたが、「時間」＝「２時」のスコアが所
定の閾値以下であったとする。このとき、メッセージ出
力部８では、「時間」のグループが認識できなかったの
で、メッセージ保持部７から「時間」の欄のメッセージ
「時間は何時ですか？」を出力し、使用者に対し、次の
発声では、「時間」を表わす言葉を発声するように促
す。Further, in the present embodiment, as a result of performing spotting according to the recognition order rule of FIG. 4, when spotting of a certain group fails (when it cannot be recognized and when it is not actually recognized in the input voice, Including),
A message prompting the user to utter this group is output with a simple structure (without considering the history of the dialogue). Specifically, in the above-mentioned first example, "meeting" was extracted as the "requirement", "first meeting room" was extracted as the "place", and "the day after tomorrow" was extracted as the "date". However, it is assumed that the score of “time” = “2:00” is less than or equal to a predetermined threshold. At this time, since the message output unit 8 could not recognize the "time" group, the message holding unit 7 outputs the message "What time is it?" In the "Time" column, and the message is displayed to the user. In the next utterance, encourage them to say the word "time."

【００２８】これにより、使用者が「時間」に関する発
声を行なうと、この入力音声は、第１回目の発声ではな
く、次に発声された音声であるので、スポッティング部
６では、図４の認識順序規則に従って、機能語のスポッ
ティングを試みる。例えば、使用者が「ええと、２時で
すよ」と発声したとする。しかしながら、この入力音声
には、「機能語」が含まれておらず、従って、スポッテ
ィング部６は、「機能語」を抽出できない。このように
「機能語」を抽出できないときには、スポッティング部
６は、前回の認識においてスコアが所定の閾値以下のキ
ーワードの属するグループを認識対象としてスポッティ
ングを試みる。すなわち、上述の例では、「時間」のグ
ループを認識対象としてスポッティングを試みる。これ
により、「２時」をキーワードとして抽出することがで
きる。この場合にも、前回の認識対象である「用件」，
「場所」，「日付」のグループを認識の対象としないの
で、効率の良いかつ信頼性の良い認識を行なうことがで
きる。As a result, when the user utters "time", the input voice is not the first utterance, but the next uttered voice. Attempt to spot function words according to the ordering rules. For example, suppose the user utters "Well, it's 2 o'clock." However, this input voice does not include the “function word”, and therefore the spotting unit 6 cannot extract the “function word”. When the "function word" cannot be extracted in this way, the spotting unit 6 attempts spotting with a group to which a keyword having a score of a predetermined threshold value or less in the previous recognition belongs as a recognition target. That is, in the above example, spotting is attempted with the "time" group as the recognition target. As a result, “2 o'clock” can be extracted as a keyword. Also in this case, the “requirement”, which is the previous recognition target,
Since the group of "place" and "date" is not the target of recognition, efficient and reliable recognition can be performed.

【００２９】このようにして、装置との間で普通の対話
を続けて、正しい結果を効率良く得ることができる。In this way, normal interaction can be continued with the device to efficiently obtain the correct results.

【００３０】また、使用者が対話を中断したい場合に
は、使用者は「エスケープ」と発声すれば良い。「エス
ケープ」と発声した場合には、最初に機能語のグループ
のスポッティングがなされ、「エスケープ」が抽出され
た時点で即座に認識装置の状態を最初に戻す。これによ
り、無駄なスポッティングを行なうことを防止すること
ができる。When the user wants to interrupt the dialogue, the user may say "escape". When "escape" is uttered, a group of functional words is first spotted, and the state of the recognizer is immediately returned to the initial state when "escape" is extracted. This can prevent useless spotting.

【００３１】[0031]

【発明の効果】以上に説明したように、請求項１乃至請
求項３記載の発明によれば、キーワードを複数のグルー
プに分類し、各グループの認識すべき順序を定めた認識
順序規則に従って入力音声に対しスポッティングを行な
いキーワードを抽出するようになっているので、最小限
のスポッティング回数で入力音声中からキーワードを効
率良くかつ信頼性良く抽出することができる。As described above, according to the inventions of claims 1 to 3, the keywords are classified into a plurality of groups, and the keywords are input in accordance with a recognition order rule that defines the order in which each group should be recognized. Since the keyword is extracted by performing spotting on the voice, the keyword can be efficiently and reliably extracted from the input voice with the minimum number of spotting times.

【００３２】また、請求項４記載の発明によれば、入力
音声の区間内ですでに抽出されたキーワードの区間を除
いて、次に認識すべきグループのスポッティングを行な
うようにしているので、スポッティングを行なう領域を
狭くすることができ、これにより、キーワードをより効
率良くかつ信頼性良く抽出することができ、認識効率を
より向上させることができる。According to the fourth aspect of the invention, the spotting of the group to be recognized next is performed excluding the keyword section already extracted within the section of the input voice. It is possible to narrow the region in which the keyword is performed, so that the keyword can be extracted more efficiently and reliably, and the recognition efficiency can be further improved.

【００３３】また、請求項５乃至請求項７記載の発明に
よれば、認識順序規則に従って順にスポッティングを行
ない、スポッティングを行なったが認識できなかったも
のについて、使用者にメッセージを出力するようにして
いるので、使用者との間で対話を進め、正しい認識を行
なうことができる。According to the fifth to seventh aspects of the invention, spotting is performed in order according to the recognition order rule, and a message is output to the user about the spotting that was not recognized but could be recognized. Therefore, it is possible to carry out a dialogue with the user and make a correct recognition.

【００３４】また、請求項８記載の発明によれば、認識
結果に誤まりがあったキーワードの属するグループのみ
を認識対象とするので、認識効率をより向上させること
ができる。Further, according to the invention of claim 8, only the group to which the keyword having an incorrect recognition result belongs is targeted for recognition, so that the recognition efficiency can be further improved.

【００３５】また、請求項９記載の発明によれば、使用
者が対話を中断するためのキーワード，例えば「エスケ
ープ」を発声した場合には、最初に機能語のグループの
スポッティングを行なって、「エスケープ」が抽出され
た時点で即座に認識装置の状態を最初に戻すので、無駄
なスポッティングを行なうことがなく、効率の良い処理
を行なうことができる。According to the ninth aspect of the invention, when the user utters a keyword for interrupting the dialogue, for example, "escape", first, spotting of a group of function words is performed, and " Since the state of the recognition device is immediately returned to the beginning when "escape" is extracted, it is possible to perform efficient processing without performing unnecessary spotting.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明に係る音声認識装置の一実施例のブロッ
ク図である。FIG. 1 is a block diagram of an embodiment of a voice recognition device according to the present invention.

【図２】キーワード保持部の具体的な構成例を示す図で
ある。FIG. 2 is a diagram showing a specific configuration example of a keyword holding unit.

【図３】「時間」のグループに属するキーワードの保持
例を示す図である。FIG. 3 is a diagram showing an example of holding keywords belonging to a “time” group.

【図４】認識順序規則保持部の具体的な構成例を示す図
である。FIG. 4 is a diagram illustrating a specific configuration example of a recognition order rule holding unit.

【図５】メッセージ保持部の具体的な構成例を示す図で
ある。FIG. 5 is a diagram showing a specific configuration example of a message holding unit.

【図６】（ａ），（ｂ）は入力音声の一例を示す図であ
る。6A and 6B are diagrams showing an example of an input voice.

[Explanation of symbols]

１音声入力部２特徴抽出部３基本単位認識部４キーワード保持部５認識順序規則保持部６スポッティング部７メッセージ保持部８メッセージ出力部 1 voice input unit 2 feature extraction unit 3 basic unit recognition unit 4 keyword storage unit 5 recognition order rule storage unit 6 spotting unit 7 message storage unit 8 message output unit

Claims

[Claims]

1. A voice recognition device having voice input means for inputting spoken voice and spotting means for extracting at least one or more keywords from the voice input from the voice input means, as an extraction target. Keywords are classified into a plurality of groups, and the recognition order of each group is determined as a recognition order rule, and the spotting means sets the groups as recognition target vocabulary in the input voice in an order according to the recognition order rule. A voice recognition device characterized by performing spotting and sequentially extracting keywords from an input voice.

2. The voice recognition device according to claim 1, wherein
It has a voice basic unit recognition means for recognizing a basic unit of voice, and in this case, the keywords in each group are described for each group by an automaton whose basic unit of voice is an arc, and the spotting means. When spotting a group of input speeches in the order according to the recognition order rule as a recognition target vocabulary and extracting a keyword from the input speech, an optimum path on the automaton in one group is obtained, and the group is extracted. A voice recognition device characterized by being adapted to perform extraction of a keyword inside.

3. The voice recognition device according to claim 1, wherein
The voice recognition device, wherein the recognition order rule is described for each keyword.

4. The voice recognition device according to claim 1,
The speech recognition, wherein the spotting means, when extracting a keyword from the input speech, sets a speech section excluding a speech section in which a keyword already spotted is present as a range of a speech section to be spotted. apparatus.

5. The voice recognition device according to claim 1,
When performing the spotting with the group as the recognition target vocabulary according to the recognition order rule, when the score of the extracted keyword is equal to or less than a predetermined threshold value,
A voice recognition device, further comprising message output means for outputting a predetermined message to a user.

6. The voice recognition device according to claim 5,
The voice recognition device, wherein the message output to the user is predetermined and held for each group.

7. The voice recognition device according to claim 5, wherein
The voice recognition device, wherein the message output to the user is a message prompting the user to re-speak a keyword in a group to which the keyword whose score is equal to or lower than a predetermined threshold value.

8. The voice recognition device according to claim 1,
At least one group of the plurality of groups includes
A voice recognition device characterized in that a keyword indicating that the previous recognition result is incorrect is included.

9. The voice recognition device according to claim 1, wherein
At least one group of the plurality of groups includes
A voice recognition device characterized in that it includes a keyword indicating that the dialogue should be interrupted and started over.