JP3478171B2

JP3478171B2 - Voice recognition device and voice recognition method

Info

Publication number: JP3478171B2
Application number: JP13409499A
Authority: JP
Inventors: 剛加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-05-14
Filing date: 1999-05-14
Publication date: 2003-12-15
Anticipated expiration: 2019-05-14
Also published as: JP2000322085A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置及び
音声認識方法に関し、特にサービスシナリオに対応した
辞書の登録のやり直しを不要とする場合に好適な音声認
識装置及び音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition method, and more particularly to a voice recognition device and a voice recognition method suitable when it is not necessary to re-register a dictionary corresponding to a service scenario.

【０００２】[0002]

【従来の技術】従来から、不特定話者の利用を考慮した
音声認識装置が開発されている。この種の音声認識装置
の一例が、特開平１１−７２９２号公報の第２実施形態
の変形例に記載されている。同公報は、種々の形態の付
加語に対し高い認識性能を保持することを目的としたも
のであり、入力音声の一定時間（フレーム）毎の特徴量
を抽出する音声分析部と、認識対象単語または単語列の
前または後または前後に付加語モデルを接続した標準パ
タンと、前記標準パタンと前記特徴量とのパタンマッチ
ングを前記フレーム毎に行い、前記単語または単語列中
で最適な単語系列（最適列）を選択し、その尤度を算出
する尤度算出部と、前記最適列と前記尤度より最適認識
結果を決定し出力する出力部とを備え、前記付加語モデ
ルが背景雑音と任意音声の両者を受理するモデルである
ことを特徴とする音声認識装置が開示されている。2. Description of the Related Art Conventionally, a voice recognition device has been developed in consideration of use by an unspecified speaker. An example of this kind of voice recognition device is described in a modification of the second embodiment of Japanese Patent Laid-Open No. 11-7292. The publication is intended to maintain high recognition performance for various types of additional words, and includes a speech analysis unit that extracts a feature amount of an input speech for each constant time (frame), and a recognition target word. Alternatively, a standard pattern in which additional word models are connected before or after or before or after a word string, and pattern matching between the standard pattern and the feature amount is performed for each frame, and an optimal word sequence in the word or word string ( An optimal column), and a likelihood calculating unit that calculates the likelihood thereof; and an output unit that determines and outputs the optimal recognition result from the optimal column and the likelihood, and the additional word model has background noise and arbitrary noise. A voice recognition device is disclosed, which is a model that accepts both voices.

【０００３】即ち、連続音節認識のための文法は、予め
既知の付加語（認識対象語の前後に付加される「えー」
や「です」等の語）とそれに挟まれた認識対象語からな
り、この文法を用いることによって付加語を除去し所望
の認識を完了する。例えば図１２に示すように、この従
来のワードスポッティングを用いた音声認識装置は、認
識対象となる連続文法の前後に付加語を認識するモデル
を接続したモデルを接続し認識する。このような構成を
有する従来のワードスポッティングを用いた音声認識装
置は次のように動作する。即ち、指定された連続文法に
従って前置不要語を認識し、指定されたキーワード単語
群の中から最もよいキーワード単語を認識し、後置不要
語を認識し、認識されたキーワード単語を認識結果とし
て出力し、認識を終了する。That is, the grammar for continuous syllable recognition is a previously known additional word ("Eh" added before and after the word to be recognized).
And words such as "is" and recognition target words sandwiched between them, and using this grammar, additional words are removed and desired recognition is completed. For example, as shown in FIG. 12, this conventional speech recognition apparatus using word spotting connects and recognizes a model in which models for recognizing additional words are connected before and after a continuous grammar to be recognized. The conventional speech recognition apparatus using word spotting having such a configuration operates as follows. That is, the prefix unnecessary words are recognized according to the specified continuous grammar, the best keyword word is recognized from the specified keyword word group, the trailing unnecessary words are recognized, and the recognized keyword words are used as the recognition result. Output and end recognition.

【０００４】また、上記の音声認識に関する他の従来例
としては、例えば特許第２８１８３６２号に記載の技術
が提案されている。同特許は、複数の利用者のための話
者独立型連続音声認識を目的としたものであり、データ
処理手段、音声入力手段、及び文字ストリング使用出力
装置に結合したメモリと、上記メモリ上の、第２の複数
の音素を含む第１の複数の単語を含む第１のコンテキス
ト区画と、上記メモリ上の、第４の複数の音素を含む第
３の複数の単語を含む第２のコンテキスト区画と、上記
メモリ上の、第５の複数の音素パタン・マッチングデー
タ装置を含むパタン・マッチング区画と、上記第２の複
数の音素の各々を、上記第５の複数のパタン・マッチン
グ・データ装置の各々に関係づける、第２の複数のポイ
ンタを含む第１のポインタ・マップと、上記第４の複数
の音素の各々を、上記第５の複数のパタン・マッチング
・データ装置の各々に関係づける、第４の複数のポイン
タを含む第２のポインタ・マップと、上記第１のコンテ
キスト区画と上記第１のポインタ・マップあるいは、上
記第２のコンテキスト区画と上記第２のポインタ・マッ
プを選択するためと、上記入力手段から受け取られた音
声入力情報を上記文字ストリング使用装置への文字スト
リング出力情報に変換するために、上記メモリに結合し
ている選択手段とを備え、音声認識適用業務のコンテキ
ストが、上記メモリ上に新しいパタン・マッチング・デ
ータ装置をロードすることなしに切り換えられうる音声
認識装置上の瞬間コンテキスト切換えシステムが開示さ
れている。Further, as another conventional example relating to the above speech recognition, for example, a technique described in Japanese Patent No. 2818362 has been proposed. The patent is directed to speaker-independent continuous speech recognition for multiple users, a memory coupled to the data processing means, the voice input means, and the character string using output device, and a memory on the memory. , A first context section including a first plurality of words including a second plurality of phonemes and a second context section on the memory including a third plurality of words including a fourth plurality of phonemes. And a pattern matching section on the memory including a fifth plurality of phoneme pattern matching data devices and each of the second plurality of phonemes in the fifth plurality of pattern matching data devices. Associating a first pointer map including a second plurality of pointers with each other and each of the fourth plurality of phonemes with each of the fifth plurality of pattern matching data devices; First A second pointer map including a plurality of pointers, and selecting the first context partition and the first pointer map or the second context partition and the second pointer map, Selecting means coupled to the memory for converting voice input information received from the input means to character string output information to the character string using device, wherein the context of the voice recognition application is: An instantaneous context switching system on a speech recognizer is disclosed that can be switched without loading a new pattern matching data device on memory.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上述し
た従来例においては次のような問題点があった。However, the above-mentioned conventional example has the following problems.

【０００６】上記特開平１１−７２９２号公報記載の従
来例における第１の問題点は、サービスの場面によって
必要／不必要な認識対象の単語や付加語が異なることが
あるため、入力が一回ですまないようなサービスの場
合、認識制御文法を取り換えなければならない。従っ
て、再びメモリ上にロードし直さなければならない（後
述の本発明の実施形態で用いる図２を参照すると、音声
認識に際しては、標準パタンをロードし文法ファイルを
ロードし音声認識処理を行う）。The first problem in the conventional example described in Japanese Patent Laid-Open No. 11-7292 is that the necessary or unnecessary recognition target words and additional words may differ depending on the scene of the service. If it's a shameful service, the recognition control grammar must be replaced. Therefore, it has to be loaded again into the memory (see FIG. 2, which will be described later in the embodiment of the present invention, at the time of speech recognition, a standard pattern is loaded, a grammar file is loaded, and speech recognition processing is performed).

【０００７】第２の問題点は、例えば局におかれるよう
な多回線型システムの場合、接続された呼それぞれの認
識単位毎にサービスに応じ辞書をロードし直すことによ
ってＩ／Ｏアクセスが増大しシステムの負荷が増大する
ため、連続運転型のサービスを提供するような多回線型
音声認識装置の場合、辞書情報をサービス毎に頻繁にロ
ードしなおすとシステムの負荷が増える。The second problem is that in the case of a multi-line type system such as a station, the I / O access is increased by reloading the dictionary according to the service for each recognition unit of each connected call. However, since the load on the system increases, in the case of a multi-line type speech recognition device that provides a continuous operation type service, if the dictionary information is frequently reloaded for each service, the system load increases.

【０００８】即ち、従来の手段では地名単語のスポッテ
ィングが終わると、次に人名単語のスポッティングを起
動するために、メモリ上にある文法ファイルを取り換え
なければならないため、上位装置などからコマンドが流
れてくるような大きなシステムにおいて、サービス進行
のスムーズさを妨げるおそれがある。That is, when spotting of place-name words is completed by the conventional means, the grammar file on the memory must be replaced in order to activate spotting of person-name words next time, so a command flows from the host device or the like. In such a large system, there is a risk of hindering smooth service progress.

【０００９】また、上記特許第２８１８３６２号記載の
従来例では、「音声認識装置のコンテキスト切り替えシ
ステム及び方法」として、新たにパタンマッチング用デ
ータをロードせずにサービスが切り替わることを謳って
いるが、各サービスに対するネットワーク情報を個々に
持ち、そのネットワークは標準パタンへのポインタを持
つことによって標準パタンの重複ロードがさけられると
しており、後述する本発明における認識制御用文法の重
複ロードをさけるといった目的とは本質的に異なる。ま
た、後述する本発明のように予め制御情報を登録してお
けばキーボードセレクタの様な外部から命令を送ること
なく切り替え可能であが、従来例では実現することはで
きない。Further, in the conventional example described in the above-mentioned Japanese Patent No. 2,818,362, as the "context switching system and method of the voice recognition device", it is stated that the service is switched without newly loading the pattern matching data. It is said that the network has individual network information for each service, and that network has a pointer to the standard pattern to avoid the overlapping load of the standard pattern. The purpose is to avoid the overlapping load of the recognition control grammar in the present invention described later. Are essentially different. Also, if control information is registered in advance as in the present invention described later, switching can be performed without sending a command from the outside such as a keyboard selector, but this cannot be realized in the conventional example.

【００１０】本発明の目的は、特に、連続運転しなくて
はならないようワードスポッティングを用いた音声認識
において文法の変更のためにシステムの停止を伴うこと
のない音声認識装置及び音声認識方法を提供するもので
ある。An object of the present invention is to provide a speech recognition apparatus and a speech recognition method which do not stop the system for changing the grammar in speech recognition using word spotting so that continuous operation is required. To do.

【００１１】[0011]

【課題を解決するための手段】本発明は、音声認識対象
語の前後に付加される不要語を除去して音声認識を行う
音声認識装置において、音声情報の標準パタンが登録さ
れた標準パタン登録手段と、使用が想定される全ての前
記不要語を保持した連続文法が登録された不用語登録手
段と、前記標準パタン登録手段及び前記不用語登録手段
を参照し音声認識を行う音声認識手段と、該音声認識手
段による前記不用語登録手段の探索空間をサービスシナ
リオに合わせて適宜縮小する制御手段とを具備すること
を特徴とする。SUMMARY OF THE INVENTION According to the present invention, a standard pattern registration in which a standard pattern of voice information is registered in a voice recognition device for performing voice recognition by removing unnecessary words added before and after a voice recognition target word. Means, a non-word registration means in which a continuous grammar holding all unnecessary words that are supposed to be used are registered, and a voice recognition by referring to the standard pattern registration means and the non-word registration means. The voice recognition means to be performed and the search space of the non-word registration means by the voice recognition means are used as a service syn
And a control means for appropriately reducing the size according to the Rio .

【００１２】また、本発明の音声認識装置は、図１を
参照しつつ説明すれば、音声認識対象語の前後に付加さ
れる不要語を除去して音声認識を行う音声認識装置にお
いて、音声情報の標準パタンが登録された標準パタン登
録手段（図１の１３１）と、使用が想定される全ての前
記不要語を保持した連続文法が登録された不用語登録手
段（図１の１３２）と、前記標準パタン登録手段及び前
記不用語登録手段を参照し音声認識を行う音声認識手段
（図１の１２０）と、該音声認識手段による前記不用語
登録手段の探索空間をサービスシナリオに合わせて適宜
縮小する制御手段（図１の１４０）とを具備している。Further, the speech recognition apparatus of the present invention will be described with reference to FIG. 1. In the speech recognition apparatus which removes unnecessary words added before and after a speech recognition target word to perform speech recognition, Standard pattern registration means (131 in FIG. 1) in which the standard pattern of FIG. 1 is registered, and non-word registration means (FIG. 1 in FIG. 1) in which continuous grammars holding all unnecessary unnecessary words are registered. and 132), said standard pattern and the registration means and the reference speech recognition means for performing speech recognition non term registration means (120 in FIG. 1), the service search space of the non term registration means by speech recognition means Shi A control means (140 in FIG. 1) for appropriately reducing the size according to Nario is provided.

【００１３】［作用］本発明の音声認識装置は、予め考
えられる不要語候補を一度に登録しておきサービスの場
面場面においてその小集合に分割し用いるように制御し
ている。このため、想定されるサービスシナリオに対応
した辞書を音声認識装置にただ一回登録しておけば、登
録をやり直すことなくサービスを進行させることができ
る。また、ただ考えられる不要語を全て登録するのでは
他の単語による湧き出し誤りが頻発することが予想され
るが、サービスシーンに応じ不要語の集合を適時切り替
えていくことにより、不必要な認識候補を排除できるよ
うに制御している。このため、想定されるサービスシナ
リオにおいて認識率の向上が期待できる。また、予めよ
く使用する文法制御情報をパタンとして登録することに
より、効率よくワードスポッティングを用いた認識を切
り替えることができる。[Operation] The speech recognition apparatus of the present invention is controlled such that possible unnecessary word candidates are previously registered at one time and divided into small sets for use in a service scene. Therefore, if the dictionary corresponding to the expected service scenario is registered in the voice recognition device only once, the service can proceed without re-registering. Also, if all possible unnecessary words are registered, it is expected that there will be frequent errors caused by other words, but unnecessary recognition will be realized by switching the set of unnecessary words in time according to the service scene. It is controlled so that candidates can be eliminated. Therefore, it is expected that the recognition rate will be improved in the assumed service scenario. Also, by registering frequently used grammar control information as a pattern, recognition using word spotting can be efficiently switched.

【００１４】[0014]

【発明の実施の形態】［第１実施形態］次に、本発明の
実施形態について図面を参照して詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [First Embodiment] Next, embodiments of the present invention will be described in detail with reference to the drawings.

【００１５】（１）構成の説明図１は本発明の第１実施形態の音声認識装置の構成例を
示すブロック図である。図１において、本発明の第１実
施形態の音声認識装置は、音声入力部１００と、音声分
析部１１０と、距離計算部１２０と、標準パタン辞書１
３１及び汎用ワードスポッティング用文法辞書１３２を
有するデータ部１３０と、選別手段１４１及び特化手段
１４２を有する文法制御部１４０とを具備している。(1) Description of Configuration FIG. 1 is a block diagram showing an example of the configuration of a speech recognition apparatus according to the first embodiment of the present invention. In FIG. 1, the voice recognition device according to the first exemplary embodiment of the present invention includes a voice input unit 100, a voice analysis unit 110, a distance calculation unit 120, and a standard pattern dictionary 1.
31 and a general-purpose word spotting grammar dictionary 132, and a grammar control unit 140 having a selection unit 141 and a specialization unit 142.

【００１６】上記各部の構成を説明すると、音声入力部
１００は、認識対象の音声を入力する。音声分析部１１
０は、音声入力部１００から入力された音声の分析を行
い距離計算部１２０へ送出する。距離計算部１２０は、
データ部１３０の標準パタン辞書１３１及び汎用ワード
スポッティング用文法辞書１３２を参照しパラメータ化
された音声を認識する。データ部１３０の標準パタン辞
書１３１には、予め音声情報の標準的なパタンが格納さ
れている。データ部１３０の汎用ワードスポッティング
用文法辞書１３２には、予めサービスシナリオの開始か
ら終わりまでで使用が想定される不要語が全て登録され
ている。Explaining the configuration of each of the above units, the voice input unit 100 inputs the voice to be recognized. Speech analysis unit 11
0 analyzes the voice input from the voice input unit 100 and sends it to the distance calculation unit 120. The distance calculation unit 120
The parameterized voice is recognized by referring to the standard pattern dictionary 131 and the general-purpose word spotting grammar dictionary 132 of the data section 130. The standard pattern dictionary 131 of the data section 130 stores standard patterns of voice information in advance. In the general-purpose word spotting grammar dictionary 132 of the data section 130, all unnecessary words that are supposed to be used from the start to the end of the service scenario are registered in advance.

【００１７】ここで、サービスシナリオとは、音声認識
のサービスの流れを示し、契約加入者がこのサービスシ
ステムに接続して、認識処理を用いて、サービスを享受
してから接続を切断するまでの一連の音声認識サービス
の流れをいう。本実施形態では、サービスシナリオは予
めロードした文法ファイルとなる。ただし、人名、地名
に限らず、サービスによっては、人名又は地名、又はそ
の他の名詞を入力ということも考えられるので、一連の
呼の接続が切断されるまでに、入力される発声を全て網
羅したような文法ファイルともなる。Here, the service scenario indicates a flow of a voice recognition service, in which a contracting subscriber connects to this service system and uses recognition processing to enjoy the service until disconnection. This is the flow of a series of voice recognition services. In this embodiment, the service scenario is a preloaded grammar file. However, not only the name of a person or a place, but depending on the service, it may be possible to enter a name of a person or a place, or another noun, so all the utterances that are input before the disconnection of a series of calls are covered It also becomes a grammar file like this.

【００１８】また、文法制御部１４０は、汎用ワードス
ポッティング用文法辞書１３２の探索空間をサービスシ
ナリオに合わせ適宜縮小する。文法制御部１４０の選別
手段１４１は、データ部１３０の汎用ワードスポッティ
ング用文法辞書１３２をサービス毎に合わせ選別する。
文法制御部１４０の特化手段１４２は、汎用ワードスポ
ッティング用文法辞書１３２を具体的に特化する。Further, the grammar control unit 140 appropriately reduces the search space of the general-purpose word spotting grammar dictionary 132 according to the service scenario. The selection unit 141 of the grammar control unit 140 selects the general-purpose word spotting grammar dictionary 132 of the data unit 130 for each service.
The specialization means 142 of the grammar control unit 140 specifically specializes the general-purpose word spotting grammar dictionary 132.

【００１９】上記構成を有する音声認識装置において、
音声入力部１００から入力された音声は、音声分析部１
１０で分析され距離計算部１２０に送られる。距離計算
部１２０では、予めデータ部１３０にロードしておいた
標準パタン辞書１３１と汎用ワードスポッティング用文
法辞書１３２を用いてパラメータ化された音声を認識す
る。汎用ワードスポッティング文法辞書１３２には、サ
ービスシナリオの開始から終わりまでで使用が想定され
る不要語が全て登録されており、距離計算部１２０は、
サービスシナリオの進行に応じ適宜必要な不要語のみを
選択し、キーワード音声の音声の認識を行う。これによ
って、不要語を含む文法辞書１３２をロードし直すこと
なくサービスを完了することができるようにしている。In the voice recognition device having the above structure,
The voice input from the voice input unit 100 is the voice analysis unit 1
It is analyzed at 10 and sent to the distance calculation unit 120. The distance calculation unit 120 recognizes the parameterized voice using the standard pattern dictionary 131 and the general-purpose word spotting grammar dictionary 132 loaded in the data unit 130 in advance. In the general-purpose word spotting grammar dictionary 132, all unnecessary words that are supposed to be used from the start to the end of the service scenario are registered, and the distance calculation unit 120
Only necessary unnecessary words are selected according to the progress of the service scenario, and the voice of the keyword voice is recognized. As a result, the service can be completed without reloading the grammar dictionary 132 including unnecessary words.

【００２０】更に、上記要部の構成を詳述すると、標準
パタン辞書１３１は、音声の情報の標準的なパタンを格
納したものであり、汎用ワードスポッティング用文法辞
書１３２は、所望のサービスシナリオ完了までに必要と
する不要語が全て含まれた前置語、後置語を保持してい
る連続文法であり、距離計算部１２０での計算方法を制
御するものである。文法制御部１４０は、上述した如
く、汎用ワードスポッティング用文法辞書１３２をサー
ビス毎に合わせ選別する選別手段１４１と、具体的な特
化手段１４２から構成されているこれらを用いて文法制
御部１４０は、汎用ワードスポッティング用文法辞書１
３２の探索空間をサービスシナリオに合わせ適宜縮小す
る。Further, the structure of the main part will be described in detail. The standard pattern dictionary 131 stores standard patterns of voice information, and the general-purpose word spotting grammar dictionary 132 stores a desired service scenario completion. It is a continuous grammar that holds a prefix and a suffix including all unnecessary words required up to, and controls the calculation method in the distance calculation unit 120. As described above, the grammar control unit 140 includes a selection unit 141 that selects the general-purpose word spotting grammar dictionary 132 for each service, and a specific specialization unit 142. , General Word Spotting Grammar Dictionary 1
The 32 search spaces are appropriately reduced according to the service scenario.

【００２１】文法制御部１４０の選別手段１４１は概略
次のように動作する。即ち、選別手段１４１は、サービ
スシナリオの要請によって汎用ワードスポッティング用
文法辞書１３２に登録された不要語の集合の中から必要
な不要語からのパスを特化手段１４２によって抽出し、
その他の不要語を距離計算部１２０で計算されることの
ないように汎用ワードスポッティング用文法辞書１３２
から取り除く。The selection means 141 of the grammar control section 140 operates as follows. That is, the selecting unit 141 extracts, by the specializing unit 142, a path from a necessary unnecessary word from a set of unnecessary words registered in the general-purpose word spotting grammar dictionary 132 in response to a service scenario request,
The general-purpose word spotting grammar dictionary 132 so that other unnecessary words are not calculated by the distance calculation unit 120.
Remove from.

【００２２】文法制御部１４０の特化手段１４２は、概
略次のように動作する。即ち、特化手段１４２は、例え
ば、不要語候補を削減するためには、不必要な候補の初
期スコアに、そこからの遷移が無効となるようなペナル
ティ的なスコアを与えてもよいし、計算量を削減するた
めに予め必要／不必要のフラグを用意し、計算候補のル
ープから外すことが考えられる。The specializing means 142 of the grammar control section 140 operates as follows. That is, the specializing means 142 may, for example, in order to reduce unnecessary word candidates, give an initial score of an unnecessary candidate a penalty score such that a transition from the initial score becomes invalid. In order to reduce the amount of calculation, it is possible to prepare a necessary / unnecessary flag in advance and remove it from the calculation candidate loop.

【００２３】ここで、初期スコアは、距離計算部１２０
において、単語毎の距離計算を開始するときの距離スコ
アの初期値をいい、この初期値に後述するペナルティ的
値をセットすることによって、距離の値が莫大に大きく
なり、認識候補として上位にくるのを押さえることがで
きる。また、ペナルティは認識候補と挙がらないように
するような十分大きな値のことをいい、例えば、Ｂ２に
ある無限大（正の値で計算するとき）のことをいう。Here, the initial score is the distance calculation unit 120.
In, the initial value of the distance score when starting the distance calculation for each word, and by setting a penalty value described later to this initial value, the distance value becomes enormous and becomes a high-ranked candidate for recognition. You can hold down. Further, the penalty refers to a sufficiently large value so as not to be listed as a recognition candidate, for example, infinity in B2 (when calculated with a positive value).

【００２４】（２）動作の説明次に、本発明の第１実施形態の動作について図１〜図９
を参照して詳細に説明する。図２、図３は第１実施形態
の音声認識処理を示すフローチャート、図４は第１実施
形態の地名／人名入力処理を示すフローチャート、図
５、図６は第１実施形態の累積距離計算処理を示すフロ
ーチャート、図７〜図９は第１実施形態の音声認識の具
体例を示す説明図である。(2) Description of Operation Next, the operation of the first embodiment of the present invention will be described with reference to FIGS.
Will be described in detail with reference to. 2 and 3 are flowcharts showing the voice recognition processing of the first embodiment, FIG. 4 is a flowchart showing the place name / personal name input processing of the first embodiment, and FIGS. 5 and 6 are cumulative distance calculation processing of the first embodiment. FIG. 7 to FIG. 9 are explanatory diagrams showing a specific example of the voice recognition according to the first embodiment.

【００２５】音声認識装置では、上述したように、デー
タ部１３０に標準パタン辞書１３１をロードすると共に
（図２のステップＳ２１）、汎用ワードスポッティング
用文法辞書１３２をロードし（図２のステップＳ２
２）、距離計算部１２０で標準パタン辞書１３１及び汎
用ワードスポッティング用文法辞書１３２を用いて音声
認識を行う（図２のステップＳ２３）。In the voice recognition apparatus, as described above, the standard pattern dictionary 131 is loaded into the data section 130 (step S21 in FIG. 2), and the general-purpose word spotting grammar dictionary 132 is loaded (step S2 in FIG. 2).
2) Then, the distance calculation unit 120 performs voice recognition using the standard pattern dictionary 131 and the general-purpose word spotting grammar dictionary 132 (step S23 in FIG. 2).

【００２６】上記処理を更に詳述すると、先ず、音声認
識装置の指定のアドレスに標準パタン辞書１３１をロー
ドする（図３のステップＡ１）。次に、汎用ワードスポ
ッティング用文法辞書１３２をロードする（図３のステ
ップＡ２）。更に、サービスで必要な不要語を文法制御
部１４０で選別手段１４１を用いて選択する（図３のス
テップＡ３）。ステップＡ３での情報を元に、汎用ワー
ドスポッティング用文法辞書１３２から制限された文法
を特化手段１４２で作成する（図３のステップＡ４）。
この後、距離計算部１２０で認識処理を行う（図３のス
テップＡ５）。The above process will be described in more detail. First, the standard pattern dictionary 131 is loaded into the address designated by the voice recognition device (step A1 in FIG. 3). Next, the general-purpose word spotting grammar dictionary 132 is loaded (step A2 in FIG. 3). Further, the grammatical control unit 140 selects unnecessary words necessary for the service by using the selection means 141 (step A3 in FIG. 3). Based on the information in step A3, the specialized means 142 creates a restricted grammar from the general-purpose word spotting grammar dictionary 132 (step A4 in FIG. 3).
After that, the distance calculation unit 120 performs recognition processing (step A5 in FIG. 3).

【００２７】文法制御部１４０の選別手段１４１は、サ
ービスで必要な不要語を確認し情報をセットする（図５
のステップＢ１、ステップＢ２、または図６のステップ
Ｂ１、ステップＣ１）。文法制御部１４０の特化手段１
４２は、選別手段１４１でセットされた情報を用い、サ
ービスで必要としない不要語候補を文法から削除または
無効にする（図５のステップＢ２、または図６のステッ
プＣ２、ステップＣ３、ステップＣ４）。The selection means 141 of the grammar control unit 140 confirms unnecessary words necessary for the service and sets information (FIG. 5).
Step B1, Step B2, or Step B1, Step C1 of FIG. 6). Specialization means 1 of the grammar control unit 140
42 uses the information set by the selection means 141 to delete or invalidate unnecessary word candidates not required by the service from the grammar (step B2 in FIG. 5, or step C2, step C3, step C4 in FIG. 6). .

【００２８】次に、具体例を用いて本発明の第１実施形
態の動作を説明する。図７に示すように、例えば、汎用
ワードスポッティング（以下ＷＳ）用文法辞書１３２の
前置不要語候補１〜ｍ（この場合、ｍ＝７）には、「あ
ー」、「あのー」、「えっとー」、「えー」、「う
ー」、「そのー」、またはナル単語φが登録され、後置
不要語候補１〜ｎ（この場合、ｎ＝８）には、「か
ら」、「まで」、「です」、「さん」、「部長」、「主
任」、「くん」、またはナル単語φが登録されているも
のとする。Next, the operation of the first embodiment of the present invention will be described using a specific example. As shown in FIG. 7, for example, the prefix unnecessary word candidates 1 to m (m = 7 in this case) of the general-purpose word spotting (WS) grammar dictionary 132 include “a”, “a”, and “e”. “To”, “”, “”, “”, or the null word φ is registered, and “from” and “to” are included in postfix unnecessary word candidates 1 to n (n = 8 in this case). , "Is", "san", "manager", "chief", "kun", or the null word φ is registered.

【００２９】但し、ここでいうナル単語φとはシンボル
の出力を伴わない状態遷移をする単語という意味であ
り、内容が空の単語をいう。アルゴリズム上は、単語と
同一に扱うが、実際には、前後の単語をトンネルしてつ
なげるパスとして動作する。例えば、次の文を受理する
文法は、Ｎをナル単語とすると、「私はこの本を読
む。」、「私は本を読む。」に対して、文法は、「私→
は→Ｎ→｜この｜→本→を→読む。」となる。この文法
で、認識処理を行うと、「私は（Ｎ）本を読む。」とい
うように認識されるが、実際の処理では、「は〜本」の
間はトンネルされて直接接続されているので、「私は本
を読む。」というようになる。従って、後述する図８の
場合では、発声内容「やまだくん」（３２に「やまだ」
があるとする）となるが、実際には、ナル単語なので、
前の不要語はなかったと判断され、答えは「やまだく
ん」となる。However, the null word φ here means a word that makes a state transition without the output of a symbol, and is a word whose content is empty. Although it is treated the same as a word in terms of algorithm, it actually works as a path that connects the preceding and following words by tunneling them. For example, the grammar that accepts the following sentence is "I read this book." And "I read the book."
→→→→→→→→→→ It will be. With this grammar, when recognition processing is performed, it is recognized as "I read (N) books." However, in the actual processing, "ha ~ book" is tunneled and directly connected. So, "I read a book." Therefore, in the case of FIG. 8 described later, the utterance content “Yamada-kun” (32 is “Yamada”)
However, since it is a null word,
It is judged that there was no unnecessary word before, and the answer is "Yamada-kun".

【００３０】サービスシナリオＳ０１では最初に主に地
名単語の入力（図４のステップＳ４１）を必要とし、次
に人名単語の入力（図４のステップＳ４２）を必要する
ものとする（図４のシーケンスＳ０１）。文法制御部１
４０はサービスアプリケーションからの情報を元に汎用
ＷＳ用文法辞書１３２に文法制御情報を伝える。選別手
段１４１は前記文法制御情報を元に汎用ＷＳ用文法辞書
１３２内の前置不要語集合３１からＷＳ対象単語３２を
通り後置不要語集合３３への遷移の許されるパスを変更
する。In the service scenario S01, it is assumed that the place name word is first input (step S41 in FIG. 4) first, and then the person name word is input (step S42 in FIG. 4) (sequence in FIG. 4). S01). Grammar control unit 1
40 transmits grammar control information to the general-purpose WS grammar dictionary 132 based on the information from the service application. Based on the grammar control information, the selection means 141 changes the path permitted for transition from the prefix unnecessary word set 31 in the general-purpose WS grammar dictionary 132 through the WS target word 32 to the suffix unnecessary word set 33.

【００３１】変更方法には特化手段１４２が用いられ
る。特化手段１４２は、例えば、サービスパタンが地名
入力であると確認したときは、前置不要語の「あー」と
後置不要語「さん」、「部長」、「主任」、「くん」の
累積距離計算テーブルの初期値に無限大を代入して実行
的に計算から外す（図５のステップＢ１、ステップＢ
２）か、選別手段１４１でセットされた累積距離計算の
実行を管理するテーブルのフラグを見て、フラグがつい
ていないものは文法からないものとして計算のループか
ら取り除く（図５のステップＢ２、図６のステップＣ
２、ステップＣ３、ステップＣ４）。The specializing means 142 is used for the changing method. For example, when it is confirmed that the service pattern is input of a place name, the specializing means 142 selects the prefix unnecessary word "aa" and the suffix unnecessary word "san", "manager", "chief", and "kun". Substitute infinity for the initial value of the cumulative distance calculation table and remove it from the calculation practically (step B1 and step B in FIG.
2) Or, look at the flags of the table that manages the execution of the cumulative distance calculation set by the selection means 141, and those that are not flagged are removed from the calculation loop as being ungrammatical (step B2 in FIG. 5, FIG. Step C of 6
2, step C3, step C4).

【００３２】結果として、地名単語入力の場合は図８に
示すような連続文法の形をとり、人名単語入力の場合は
図９に示すような連続文法の形をとる。各音声入力にお
いては、図８、図９で示す連続文法を用いて認識処理が
行われる。As a result, the input of a place name word takes the form of a continuous grammar as shown in FIG. 8, and the input of a person's name word takes the form of a continuous grammar as shown in FIG. For each voice input, the recognition process is performed using the continuous grammar shown in FIGS.

【００３３】本発明の第１実施形態によれば、想定され
るサービスシナリオに対応した辞書を音声認識装置にた
だ一回登録しておけば、登録をやり直すことなくサービ
スを進行させることができる、想定されるサービスシナ
リオにおいて認識率の向上が期待できる、効率よく汎用
ワードスポッティング用文法辞書を用いた認識を切り替
えることができるという効果がある。According to the first embodiment of the present invention, if the dictionary corresponding to the assumed service scenario is registered in the voice recognition device only once, the service can be advanced without re-registering. There is an effect that the recognition rate can be expected to be improved in the assumed service scenario, and the recognition using the general-purpose word spotting grammar dictionary can be efficiently switched.

【００３４】［第２実施形態］次に、本発明の第２実施
形態について図面を参照して詳細に説明する。[Second Embodiment] Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

【００３５】（１）構成の説明本発明の第２実施形態の音声認識装置は、上記図１に示
したように、音声入力部１００と、音声分析部１１０
と、距離計算部１２０と、標準パタン辞書１３１及び汎
用ワードスポッティング用文法辞書１３２を有するデー
タ部１３０と、選別手段１４１及び特化手段１４２を有
する文法制御部１４０とを具備している。各部について
は第１実施形態で説明したため省略するものとする。(1) Description of Configuration As shown in FIG. 1, the speech recognition apparatus according to the second embodiment of the present invention has a speech input section 100 and a speech analysis section 110.
A distance calculation unit 120, a data unit 130 having a standard pattern dictionary 131 and a general-purpose word spotting grammar dictionary 132, and a grammar control unit 140 having a selection unit 141 and a specialization unit 142. Since each part has been described in the first embodiment, it will be omitted.

【００３６】本発明の第２実施形態は、キーワード単語
についても同様に始めに全て登録しておき、あとから除
くことを可能としたものである。図１０を参照すると、
第２実施形態においてはキーワード対象単語群が更に複
数個のグループに分割されている点が、上記第１実施形
態の図７に示した汎用ＷＳ用文法辞書と異なる。前記分
割されたキーワード対象単語群は、図１の文法制御部１
４０によって制御可能であり、サービスシナリオに対し
て最適なキーワードグループに分類されている。In the second embodiment of the present invention, it is possible to similarly register all keyword words at the beginning and remove them later. Referring to FIG.
The second embodiment differs from the general-purpose WS grammar dictionary shown in FIG. 7 of the first embodiment in that the keyword target word group is further divided into a plurality of groups. The divided keyword target word group is the grammar control unit 1 of FIG.
It is controllable by 40 and is classified into the optimum keyword group for the service scenario.

【００３７】（２）動作の説明次に、本発明の第２実施形態の動作について上記図１・
図７・図９、図１０〜図１２を参照して詳細に説明す
る。(2) Description of Operation Next, the operation of the second embodiment of the present invention will be described with reference to FIG.
This will be described in detail with reference to FIGS. 7 and 9 and FIGS.

【００３８】図１０に示すように、例えば、汎用ワード
スポッティング（以下ＷＳ）用文法辞書１３２の前置不
要語候補１〜ｍには「あー」、「あのー」、「えっと
ー」、「えー」、「うー」、「そのー」またはナル単語
φが登録され、後置不要語候補１〜ｎには「から」、
「まで」、「です」、「さん」、「部長」、「主任」、
「くん」またはナル単語φが登録されているものとす
る。但し、ここでいうナル単語とはシンボルの出力を伴
わない状態遷移をする単語という意味である。As shown in FIG. 10, for example, the prefix unnecessary word candidates 1 to m in the general-purpose word spotting (WS) grammar dictionary 132 are "a", "a", "etto", and "e". , “Uu”, “that” or the null word φ is registered, and “kara” is added to the postfix unnecessary word candidates 1 to n.
"Up", "is", "san", "manager", "chief",
It is assumed that “kun” or the null word φ is registered. However, the null word here means a word that undergoes state transition without output of symbols.

【００３９】また、キーワード対象単語群６２、６３に
はサービス毎に最適なグループに分類されたキーワード
単語群Ａ、Ｂ(例えば２つの場合)が格納されている。サ
ービスシナリオＳ０１では最初に主に地名単語の入力を
必要とし、次に人名単語の入力を必要するものとする。Further, the keyword target word groups 62 and 63 store keyword word groups A and B (for example, two cases) classified into optimum groups for each service. In the service scenario S01, it is assumed that the place name word is mainly input first, and then the person name word is input.

【００４０】文法制御部１４０は、サービスアプリケー
ションからの情報を元に汎用ＷＳ用文法辞書１３２に文
法制御情報を伝える。前記文法制御情報を元に汎用ＷＳ
用文法辞書１３２内の前置不要語集合３１からＷＳ対象
単語３２を通り後置不要語集合３３への遷移の許される
パスは変更される。このとき、ＷＳ対象単語３２は図１
０の６２、６３のように複数個の単語グループに予め分
類されているので、前置不要語集合３１から後置不要語
集合３３への遷移の許されるパスは、キーワードスポッ
ティング６２、６３の中で更に分類さえ制限されたもの
となる。The grammar control section 140 transmits the grammar control information to the general-purpose WS grammar dictionary 132 based on the information from the service application. General-purpose WS based on the grammar control information
The path permitted for transition from the prefix unnecessary word set 31 in the grammar dictionary 132 through the WS target word 32 to the suffix unnecessary word set 33 is changed. At this time, the WS target word 32 is shown in FIG.
0, 62 and 63 are pre-classified into a plurality of word groups, and therefore, the paths allowed for the transition from the prefix unnecessary word set 31 to the suffix unnecessary word set 33 are in the keyword spotting 62, 63. Therefore, the classification will be further restricted.

【００４１】結果として、地名単語入力の場合は図１１
に示すような連続文法の形をとり、人名単語入力の場合
は図９に示すような連続文法の形をとる。各音声入力に
おいては、図１１、図１２で示す連続文法を用いて最適
な認識処理が行われる。As a result, FIG. 11 shows the case of inputting a place name word.
In the case of inputting a personal name word, the continuous grammar as shown in FIG. For each voice input, the optimum recognition processing is performed using the continuous grammar shown in FIGS.

【００４２】本発明の第２実施形態によれば、図１０に
示すように、キーワードスポッティング対象単語のグル
ープを複数個から構成することにより、キーワード単語
の候補をより少ない候補の中から抽出することができ、
認識率を更に上げるという新たな効果を生ずる。According to the second embodiment of the present invention, as shown in FIG. 10, a keyword word candidate is extracted from a smaller number of candidates by forming a plurality of keyword spotting target word groups. Can
The new effect of raising the recognition rate further occurs.

【００４３】[0043]

【発明の効果】以上説明したように本発明によれば、次
のような効果が得られる。As described above, according to the present invention, the following effects can be obtained.

【００４４】第１の効果は、予め考えられる不要語候補
を一度に登録しておきサービスの場面場面においてその
小集合に分割し用いるため、想定されるサービスシナリ
オに対応した辞書を音声認識装置にただ一回登録してお
けば、登録をやり直すことなくサービスを進行させるこ
とができる。The first effect is that the possible unnecessary word candidates are registered at a time and divided into a small set in the scene of the service for use, so that the dictionary corresponding to the expected service scenario is used in the voice recognition device. If you register only once, you can proceed with the service without re-registering.

【００４５】第２の効果は、ただ考えられる不要語を全
て登録するのでは他の単語による湧き出し誤りが頻発す
ることが予想されるが、本発明のようにサービスシーン
に応じ不要語の集合を適時切り替えていくことにより、
不必要な認識候補を排除できるため、想定されるサービ
スシナリオにおいて認識率の向上が期待できる。The second effect is that if all possible unnecessary words are registered, it is expected that source errors due to other words will occur frequently. However, as in the present invention, a set of unnecessary words is generated. By switching the timely,
Since unnecessary recognition candidates can be eliminated, the recognition rate can be expected to improve in the assumed service scenario.

【００４６】第３の効果は、予めよく使用する文法制御
情報をパタンとして登録することにより、効率よくワー
ドスポッティングを用いた認識を切り替えることができ
る。The third effect is that by registering frequently used grammar control information as a pattern in advance, recognition using word spotting can be efficiently switched.

[Brief description of drawings]

【図１】本発明の第１及び第２実施形態の音声認識装置
の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a voice recognition device according to first and second embodiments of the present invention.

【図２】本発明の第１実施形態の音声認識処理を示すフ
ローチャートである。FIG. 2 is a flowchart showing a voice recognition process according to the first embodiment of the present invention.

【図３】本発明の第１実施形態の音声認識処理を示すフ
ローチャートである。FIG. 3 is a flowchart showing a voice recognition process according to the first embodiment of the present invention.

【図４】本発明の第１実施形態の地名／人名入力処理を
示すフローチャートである。FIG. 4 is a flowchart showing a place name / person name input process according to the first embodiment of the present invention.

【図５】本発明の第１実施形態の累積距離計算処理を示
すフローチャートである。FIG. 5 is a flowchart showing a cumulative distance calculation process according to the first embodiment of the present invention.

【図６】本発明の第１実施形態の累積距離計算処理を示
すフローチャートである。FIG. 6 is a flowchart showing a cumulative distance calculation process according to the first embodiment of the present invention.

【図７】本発明の第１実施形態の音声認識の具体例を示
す説明図である。FIG. 7 is an explanatory diagram showing a specific example of voice recognition according to the first embodiment of the present invention.

【図８】本発明の第１実施形態の音声認識の具体例を示
す説明図である。FIG. 8 is an explanatory diagram showing a specific example of voice recognition according to the first embodiment of the present invention.

【図９】本発明の第１実施形態の音声認識の具体例を示
す説明図である。FIG. 9 is an explanatory diagram showing a specific example of voice recognition according to the first embodiment of the present invention.

【図１０】本発明の第２実施形態の音声認識の具体例を
示す説明図である。FIG. 10 is an explanatory diagram showing a specific example of voice recognition according to the second embodiment of the present invention.

【図１１】本発明の第２実施形態の音声認識の具体例を
示す説明図である。FIG. 11 is an explanatory diagram showing a specific example of voice recognition according to the second embodiment of the present invention.

【図１２】本発明の第２実施形態の音声認識の具体例を
示す説明図である。FIG. 12 is an explanatory diagram showing a specific example of voice recognition according to the second embodiment of the present invention.

[Explanation of symbols]

１００音声入力部１１０音声分析部１２０距離計算部１３０データ部１３１標準パタン辞書１３２汎用ワードスポッティング用文法辞書１４０文法制御部１４１選別手段１４２特化手段 100 voice input section 110 Voice analysis unit 120 distance calculator 130 data section 131 Standard Pattern Dictionary 132 General-purpose word spotting grammar dictionary 140 Grammar Control Unit 141 sorting means 142 Specialized means

Claims

(57) [Claims]

1. A voice recognition device for performing voice recognition by removing unnecessary words added before and after a voice recognition target word, and is expected to be used with a standard pattern registration means in which a standard pattern of voice information is registered. Non-word registration means in which a continuous grammar holding all prefix and postfix non-words is registered, voice recognition means for performing voice recognition by referring to the standard pattern registration means and the non-word registration means, The search space of the non-word registration means by the voice recognition means is matched to the service scenario.
And control means for reducing Align Te Optionally, said control means, based on pre-registered grammatical control information
Change the path transition is permitted in the previous said rear置不term set as the speech recognition target word group from置不term set, when speech recognition by said speech recognition means, wherein the pre that is registered A voice recognition device characterized by reducing candidates for non-words and post-words.

2. The speech recognition target word group is a speech recognition request.
The speech recognition apparatus according to claim 1, comprising a plurality of speech recognition target word groups classified into optimum groups for each .

3. The control means appropriately reduces the search space of the non-word registration means by the voice recognition means according to a service scenario based on noun word input such as place name word input and person name word input. The voice recognition device according to claim 1.

4. A voice recognition method for performing voice recognition by removing unnecessary words added before and after a word to be voice-recognized, and all standard patterns of voice information registered in advance and use registered in advance are assumed. of pre when performing speech recognition on the basis of non-term and continuous grammar holding the rear置不terms, the front for置不terms and rear置不 said registered use is contemplated
It is appropriately scaled down to fit the search space of the word to the service scenario
In order, by changing the path transition is permitted in the previous said rear置不term set as the speech recognition target word group from置不term set, when the voice recognition, before the registered置不A speech recognition method characterized by reducing candidates for terms and postfix non-words.