JP2001255888A

JP2001255888A - Speech recognition device, speech recognition method and storage medium stored with program to execute the method

Info

Publication number: JP2001255888A
Application number: JP2000064189A
Authority: JP
Inventors: Junichi Takami; 淳一鷹見
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-03-08
Filing date: 2000-03-08
Publication date: 2001-09-21

Abstract

PROBLEM TO BE SOLVED: To realize highly efficient candidate search while switching grammars for speech recognition. SOLUTION: When speech is inputted (S2), the length of inputted speech is measured (S3). When a grammar which is to be used to recognize the inputted speech is specified from an application program side, cummulative input speech lengths that are stored to be made correspond to the grammar identification information and the number of speech inputs are obtained and updated. Then, an average speech length is updated from the updated cummulative input speech lengths and the number of speech inputs (S4). In parallel with the updating process, an appropriate searching method is selected employing the size of the specified grammar and the average speech length in the speech recognition processing conducted in the past using the grammar (S5). Then, a candidate search is conducted with the inputted speech as a recognition process by using the selected searching method (S6) and a recognition result is outputted (S7).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の文法と複数
の候補探索法を備えた音声認識装置、音声認識方法およ
びその方法を実施するためのプログラムを記憶したこと
を特徴とする記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus having a plurality of grammars and a plurality of candidate search methods, a speech recognition method, and a storage medium storing a program for executing the method. .

【０００２】[0002]

【従来の技術】音声認識においては、例えば有限状態オ
ートマトンによった文法ネットワークを用いて、入力音
声に一致すると思われる言葉の候補を抽出する。個々の
音素を状態に割り当て、音素ｘから音素ｙへ推移する可
能性を状態ｘから状態ｙへの遷移として表し、認識可能
とする語彙の文法を制約条件として発話の始端から終端
に至る可能な音素列（音素並び）をカバーする状態遷移
モデル（有限状態オートマトン）を備え、そのなかから
入力された音声の音素列と類似する音素列を抽出するの
である。例えば、音響モデルによって決まる個々の音素
を表現する特徴量と入力音声から順次切り出した個々の
音素を表現する特徴量とを順次比較し、類似度が高けれ
ば高いスコアを与え、有限状態オートマトンの示す可能
な音素列毎に累積スコアを求め、累積スコアが高い音素
列を認識結果候補として抽出する。なお、可能な遷移状
態をカバーする有限状態オートマトンのなかから候補と
する音素列を抽出するための代表的な探索法には、幅優
先探索法および深さ優先探索法、Ａ* 探索法などがあ
る。以下、それぞれについて概説する。まず、幅優先探
索法であるが、この方法は広く浅く探索する方法であ
る。図３に示すように、探索する深さつまり仮説時間長
を一定にして可能な複数の仮説の探索がほぼ同時に終了
するように探索するのである。なお、図３において数値
は探索順を示している（図４においても同様）。したが
って、原理的には入力音声長（発話の始端から終端ま
で）に係わらず実時間処理が可能となるので、例えば任
意桁数の連続数字や少ない語彙数からなる短文のように
文法規模（例えば有限状態オートマトンの総状態数）が
小さい場合に効果的である。但し、処理時間が文法規模
に比例して増加するため、大語彙音声認識には向かな
い。また、深さ優先探索法では、図４に示すように、深
さ方向（仮説時間長が長くなる方向）に探索して行き、
終端に達したら直前の分岐点まで戻って別の仮説を終端
に向かって探索するという処理を繰り返す。したがっ
て、この探索法は仮説の深さを制御する必要がある場合
に有利な方法であり、桁数固定の連続数字認識などに適
している。但し、実時間処理ができないし、処理時間が
文法規模に比例して増加するという欠点がある。Ａ* 探
索法では、その時点までに実際に計算されている途中ス
コアと、その時点から先に展開を進めた場合に計算され
るであろう予測スコアの和の大きい音素列を候補として
抽出する。この探索法は、計算時間が文法規模の影響を
ほとんど受けないため、大語彙音声認識に向いている。
但し、入力音声長が長くなった場合に計算時間が大幅に
増加する場合があり、連続数字音声認識などには向かな
い。したがって、場面や目的に応じて文法を動的に入れ
替えながら認識処理を行うような音声認識方法において
は、各場面で使用する文法の規模や性質に応じて、それ
に適した探索法を使用することで、常に効率の良い探索
を実現することが望まれる。2. Description of the Related Art In speech recognition, for example, a grammar network based on a finite state automaton is used to extract word candidates that are considered to match input speech. Each phoneme is assigned to a state, and the possibility of transition from phoneme x to phoneme y is expressed as a transition from state x to state y, and the grammar of the vocabulary to be recognizable can be used as a constraint from the beginning to the end of the utterance. A state transition model (finite state automaton) that covers the phoneme sequence (phoneme sequence) is provided, and a phoneme sequence similar to the phoneme sequence of the input voice is extracted from the model. For example, a feature quantity representing each phoneme determined by an acoustic model is sequentially compared with a feature quantity representing each phoneme sequentially cut out from the input speech. If the similarity is high, a high score is given, and the finite state automaton indicates. A cumulative score is obtained for each possible phoneme string, and a phoneme string having a high cumulative score is extracted as a recognition result candidate. Note that typical search methods for extracting candidate phoneme strings from finite state automata covering possible transition states include a breadth-first search method, a depth-first search method, and an A * search method. . Hereinafter, each is outlined. First, the breadth-first search method is a search method that is broad and shallow. As shown in FIG. 3, the search is performed such that the search for a plurality of possible hypotheses can be completed at substantially the same time while the search depth, that is, the hypothesis time length is fixed. The numerical values in FIG. 3 indicate the search order (the same applies to FIG. 4). Therefore, in principle, real-time processing can be performed regardless of the input speech length (from the beginning to the end of the utterance), so that the grammar scale (for example, a short sentence having an arbitrary number of digits or a small number of vocabularies) (for example, This is effective when the total number of states of the finite state automaton) is small. However, since the processing time increases in proportion to the grammar scale, it is not suitable for large vocabulary speech recognition. Further, in the depth-first search method, as shown in FIG. 4, the search is performed in the depth direction (the direction in which the hypothesis time length becomes longer).
When the end is reached, the process of returning to the immediately preceding branch point and searching for another hypothesis toward the end is repeated. Therefore, this search method is an advantageous method when it is necessary to control the depth of a hypothesis, and is suitable for recognition of continuous digits having a fixed number of digits. However, there are drawbacks that real-time processing cannot be performed, and the processing time increases in proportion to the grammar scale. In the A * search method, a candidate phoneme string having a large sum of an intermediate score actually calculated up to that point and a predicted score that would be calculated when the expansion is advanced from that point onward is extracted. . This search method is suitable for large vocabulary speech recognition because the calculation time is hardly affected by the grammar scale.
However, when the input voice length is long, the calculation time may be significantly increased, and is not suitable for continuous numeral voice recognition or the like. Therefore, in a speech recognition method that performs recognition processing while dynamically changing the grammar according to the scene or purpose, use a search method that is appropriate for the scale and nature of the grammar used in each scene. Therefore, it is desirable to always realize efficient search.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来の
技術においては、音声認識のための文法を切り替えなが
ら候補の探索を行う際、用いる文法の規模または性質に
応じて複数の探索法を使い分けることができないという
問題がある。本発明の課題は、このような従来の技術の
問題を解決することにある。すなわち、本発明の目的
は、音声認識のための文法を切り替えながら候補の探索
を行う際、効率の良い候補探索を実現することができる
音声認識方法などを提供することにある。However, in the prior art, when searching for candidates while switching the grammar for speech recognition, a plurality of search methods may be used depending on the scale or nature of the grammar to be used. There is a problem that can not be. An object of the present invention is to solve such a problem of the conventional technology. That is, an object of the present invention is to provide a speech recognition method or the like that can realize an efficient candidate search when searching for a candidate while switching a grammar for speech recognition.

【０００４】[0004]

【課題を解決するための手段】前記の課題を解決するた
めに、請求項１に記載の発明は、音声認識のための文法
を切り替えながら候補の探索を行うことができる音声認
識方法において、用いる文法の規模に応じて複数の探索
法を使い分けることを特徴とする。請求項２に記載の発
明は、請求項１に記載の音声認識方法において、選択対
象の複数の文法または探索法を表示させ、そのなかから
用いる文法または探索法を選択させることを特徴とす
る。請求項３に記載の発明は、請求項１に記載の音声認
識方法において、文法の規模に応じて探索法を自動的に
選択することを特徴とする。請求項４に記載の発明は、
請求項１および請求項３の１つに記載の音声認識方法に
おいて、入力された音声長を記憶しておき、次回以降に
探索法を自動選択する際、前記音声長を選択基準に加え
て探索法を選択することを特徴とする。請求項５に記載
の発明は、請求項１乃至請求項４の１つに記載の音声認
識方法において、音響モデル情報、音素継続時間長モデ
ル情報および文法ネットワーク情報のうち少なくとも一
つを複数の探索法で共通に用いることを特徴とする。請
求項６に記載の発明は、複数の探索方法で候補を探索す
ることができる音声認識方法において、複数の探索法の
それぞれによる探索を順次または並列に実行し、各探索
法で得られた各候補のスコアの平均値に従って最終的な
候補を求めることを特徴とする。請求項７に記載の発明
は、請求項１乃至請求項６の１つに記載の音声認識方法
を実施するためのプログラムを記憶したことを特徴とす
る。請求項８に記載の発明は、請求項１乃至請求項６の
１つに記載の音声認識方法を用いた音声認識装置におい
て、尤度テーブル計算手段または音素継続時間制御用ペ
ナルティテーブル計算手段を複数の探索法で共通に用い
る構成にしたことを特徴とする。In order to solve the above-mentioned problems, the invention according to claim 1 is used in a speech recognition method capable of searching for a candidate while switching a grammar for speech recognition. It is characterized in that a plurality of search methods are properly used according to the scale of the grammar. According to a second aspect of the present invention, in the speech recognition method according to the first aspect, a plurality of grammars or search methods to be selected are displayed, and a grammar or a search method to be used is selected from the plurality of grammars or search methods. According to a third aspect of the present invention, in the speech recognition method according to the first aspect, a search method is automatically selected according to the scale of the grammar. The invention described in claim 4 is
4. The voice recognition method according to claim 1, wherein the input voice length is stored, and when the search method is automatically selected in the next and subsequent times, the voice length is added to a selection criterion and searched. The method is characterized by selecting a law. According to a fifth aspect of the present invention, in the speech recognition method according to any one of the first to fourth aspects, at least one of acoustic model information, phoneme duration model information, and grammar network information is searched for by a plurality. It is characterized by being commonly used in law. According to a sixth aspect of the present invention, in the speech recognition method capable of searching for a candidate by a plurality of search methods, a search by each of the plurality of search methods is performed sequentially or in parallel, and each search method obtained by each search method is obtained. A final candidate is obtained according to the average value of the candidate scores. According to a seventh aspect of the present invention, a program for executing the voice recognition method according to any one of the first to sixth aspects is stored. According to an eighth aspect of the present invention, there is provided a speech recognition apparatus using the speech recognition method according to any one of the first to sixth aspects, wherein a plurality of likelihood table calculation means or a phoneme duration control penalty table calculation means are provided. , Which are commonly used in the search method.

【０００５】[0005]

【発明の実施の形態】次に、本発明の実施の形態を図面
に基づいて詳細に説明する。本発明の実施の形態は、使
用する文法の規模などに応じて複数の探索法を適切に切
り替えることで、常に効率の良い探索を可能にするもの
である。本発明の実施の形態における適切な探索法の判
断基準としては、主として文法の規模を表す値（例えば
有限状態オートマトンの総状態数）を使用するが、それ
以外の探索の効率に影響を与える要因として入力音声長
についても考慮する。但し、フレーム同期でリアルタイ
ム処理を行う際には、発話が終了する前に処理が開始さ
れるため、入力音声長が確定していない段階で使用する
探索法を定める必要があり、従って、直接その時の入力
音声長を探索法の判断基準として使用することはできな
い。そこで、音声が入力される度にその長さ（発話時
間）を記憶しておき、文法（つまり場面）毎の平均の入
力音声長を算出することによって、この情報を次回以降
の探索法の判断基準に使用できるようにする機能を設け
る。なお、この音声認識装置では、複数の探索法の間で
共通に使用することが可能な情報や処理手段を共有する
ことで、複数の探索法を実装することによるメモリ量の
増加を抑制する。さらに、バッチ処理のようにリアルタ
イム性が要求されない場面や、文法規模が小さく、処理
時間に余裕がある場面などでは、複数の探索法を並列に
実行して各探索法で得られた候補のスコアの加重平均値
に従って最終的な候補を算出することで、頑健な音声認
識を可能にする。Next, embodiments of the present invention will be described in detail with reference to the drawings. The embodiment of the present invention always enables efficient search by appropriately switching a plurality of search methods according to the scale of a grammar to be used. As a criterion for an appropriate search method in the embodiment of the present invention, a value representing the scale of the grammar (for example, the total number of states of the finite state automaton) is mainly used, but other factors affecting search efficiency are used. The input voice length is also considered. However, when performing real-time processing with frame synchronization, the processing is started before the utterance ends, so it is necessary to determine the search method to be used at the stage where the input voice length has not been determined. Cannot be used as a criterion for the search method. Therefore, each time a voice is input, its length (utterance time) is stored, and the average input voice length for each grammar (that is, scene) is calculated. Provide a function that can be used as a reference. In this speech recognition device, by sharing information and processing means that can be used in common among a plurality of search methods, an increase in the amount of memory due to the implementation of the plurality of search methods is suppressed. Furthermore, in situations where real-time processing is not required, such as in batch processing, or in situations where the grammar scale is small and there is ample processing time, the scores of candidates obtained by executing multiple search methods in parallel and by each search method By calculating final candidates in accordance with the weighted average value of, robust speech recognition is enabled.

【０００６】以下、本発明の１つの実施の形態を図面に
基づいて詳細に説明する。図１は本発明の１つの実施の
形態を示す音声認識装置を示すブロック図である。図示
したように、この実施例の音声認識装置は、マイクロフ
ォン、増幅回路、ＡＤ変換器などから成る音声入力部
１、音響モデルに係わるデータや各種文法情報を一時的
に記憶したり、探索処理のための計算を行うためのワー
クメモリ２、入力された音声情報の始端から終端までの
入力音声長を取得して平均音声長を文法毎に算出する平
均音声長算出部３、文法規模や平均音声長の情報に基づ
いて適切な探索法を選択する探索法選択部４、および指
定された探索法によって候補の探索を行う候補探索部
５、複数の探索法から得られる各候補のスコアの加重平
均を計算して最終的な候補を決定するスコア集計部６、
認識結果を表示したり紙上に出力したりする認識結果出
力部７、キーボードやマウスなどを有して探索法を指定
する探索法指定部８および音響モデルに係わるデータを
記憶しておく音響モデル記憶部９、各種文法情報を記憶
しておく文法情報記憶部１０などを備えている。ま
た、図示したように、候補探索部５は探索法Ａ、探索法
Ｂおよび探索法Ｃなど各種探索法に対応した固有の探索
部とそれらの探索法に共通な共通処理部１１からなる。
なお、探索法Ａ、探索法Ｂおよび探索法Ｃとは、例えば
幅優先探索法（Ｏｎｅ−ＰａｓｓＤＰ法）、深さ優先
探索法およびＡ* 探索法などである。また、前記平均音
声長算出部３、探索法選択部４、候補探索部５、および
スコア集計部６は、例えば共通のメモリおよびこれの記
憶されたプログラムに従って動作するＣＰＵを有する。
また、音響モデル記憶部９および文法情報記憶部１０は
例えばハードディスク装置により実現する。Hereinafter, one embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a speech recognition apparatus according to one embodiment of the present invention. As shown in the figure, the speech recognition apparatus of this embodiment includes a speech input unit 1 including a microphone, an amplification circuit, an AD converter, etc., temporarily stores data related to an acoustic model and various grammatical information, and performs search processing. Memory 2 for performing the calculation for the above, an average voice length calculation unit 3 for acquiring the input voice length from the beginning to the end of the input voice information and calculating the average voice length for each grammar, a grammar scale and an average voice A search method selection unit 4 for selecting an appropriate search method based on length information, a candidate search unit 5 for searching for a candidate by a specified search method, a weighted average of scores of each candidate obtained from a plurality of search methods , A score totaling unit 6 for determining a final candidate,
A recognition result output unit 7 for displaying a recognition result or outputting the result on paper, a search method specifying unit 8 having a keyboard, a mouse, and the like to specify a search method, and an acoustic model storage for storing data relating to an acoustic model. And a grammar information storage unit 10 for storing various grammar information. As shown in the figure, the candidate search unit 5 includes unique search units corresponding to various search methods such as search method A, search method B, and search method C, and a common processing unit 11 common to those search methods.
Note that the search method A, the search method B, and the search method C include, for example, a breadth-first search method (One-Pass DP method), a depth-first search method, an A * search method, and the like. The average voice length calculation unit 3, search method selection unit 4, candidate search unit 5, and score totalization unit 6 have, for example, a common memory and a CPU that operates according to a program stored therein.
The acoustic model storage unit 9 and the grammar information storage unit 10 are realized by, for example, a hard disk device.

【０００７】図２に、この実施例の動作フローを示す。
以下、図２などに従って、動作を説明する。なお、この
動作に先立って、音響モデル記憶部９には音素毎のスコ
アを求めるための音響モデルに係わるデータが記憶され
ている。文法情報記憶部１０には各種文法情報が記憶さ
れている。また、文法情報記憶部１０には平均音声長算
出部３により、これまで入力された入力音声の平均音声
長が文法毎に算出され、文法に対応付けて記憶されてい
るものとする。まず、認識に先立つ初期化処理として、
音響モデルに係わるデータを音響モデル記憶部９からワ
ークメモリ２へ、文法情報（例えば有限状態オートマト
ン）を文法情報記憶部１０からワークメモリ２へ読み込
む（ステップＳ１）。これにより、この時点で、それぞ
れの文法の規模、例えば有限状態オートマトンの総状態
数などが判明する。このような状態で、実際の音声認識
場面では、音声入力部１を構成しているマイクロフォン
から音声が入力される（ステップＳ２）。次に、ワーク
メモリ２を介して、平均音声長算出部３が入力音声の長
さを測定する（ステップＳ３）。また、それと前後し
て、アプリケーションプログラム側から、この入力音声
を認識するために使用すべき文法が指定されると、平均
音声長算出部３はその文法を示す文法識別情報を取得
し、その文法識別情報に対応付けて記憶されているその
文法を用いた音声認識処理に対応した累積入力音声長と
音声入力回数を取得する（ステップＳ４）。そして、累
積入力音声長にそのとき測定された入力音声長を加え、
音声入力回数を１だけ増やすことにより、更新された累
積入力音声長と音声入力回数から平均音声長を更新する
（ステップＳ４）。また、平均音声長更新の処理と並行
して、探索法選択部４が、指定された文法の規模αおよ
びこの文法を使用して過去に行った音声認識処理におけ
る平均音声長βを用いて、適切な探索法を選択する（ス
テップＳ５）。一例として、大語彙で、かつ、短い発話
の認識に適したＡ* 探索法と、少語彙で、かつ、長い発
話の認識に適したＯｎｅ−ＰａｓｓＤＰ法の２種類の探
索法を実装した場合の評価尺度を考える。この場合に
は、例えば次の式（１）で計算される評価値Ｓが使用で
きる。Ｓ＝log α - ｋ・log β （１）この評価値Ｓがある一定値Ｓ０以上の場合にはＡ* 探索
法を使用し、そうでない場合にはＯｎｅ−ＰａｓｓＤＰ
法を使用するのである。なお、Ｓ０やｋの値は、実験的
に定める定数である。ｋの値が０の場合には、文法の規
模のみを判断基準として探索法が選択され、ｋの値を大
きくするにつれて、平均発話長を考慮する度合いが高く
なる。FIG. 2 shows an operation flow of this embodiment.
The operation will be described below with reference to FIG. Prior to this operation, the acoustic model storage unit 9 stores data relating to an acoustic model for obtaining a score for each phoneme. The grammar information storage unit 10 stores various grammar information. In the grammar information storage unit 10, the average speech length of the input speech input so far is calculated for each grammar by the average speech length calculation unit 3, and is stored in association with the grammar. First, as initialization processing prior to recognition,
Data relating to the acoustic model is read from the acoustic model storage unit 9 into the work memory 2 and grammatical information (for example, a finite state automaton) is read from the grammatical information storage unit 10 into the work memory 2 (step S1). Thus, at this time, the scale of each grammar, for example, the total number of states of the finite state automaton, is determined. In such a state, in an actual voice recognition scene, voice is input from the microphone constituting the voice input unit 1 (step S2). Next, the average voice length calculation unit 3 measures the length of the input voice via the work memory 2 (step S3). Further, before or after that, when a grammar to be used for recognizing the input speech is specified from the application program side, the average speech length calculation unit 3 acquires grammar identification information indicating the grammar, and acquires the grammar identification information. The cumulative input voice length and the number of voice inputs corresponding to the voice recognition processing using the grammar stored in association with the identification information are acquired (step S4). Then, the input voice length measured at that time is added to the cumulative input voice length,
By increasing the number of voice inputs by one, the average voice length is updated from the updated cumulative input voice length and the updated number of voice inputs (step S4). In parallel with the process of updating the average voice length, the search method selection unit 4 uses the designated scale α of the grammar and the average voice length β in the voice recognition process performed in the past using the grammar. An appropriate search method is selected (step S5). As an example, when two types of search methods are implemented, an A * search method suitable for recognizing large vocabulary and short utterances, and an One-PassDP method suitable for recognizing long utterances with small vocabulary. Consider a rating scale. In this case, for example, an evaluation value S calculated by the following equation (1) can be used. S = log α−k · log β (1) If the evaluation value S is equal to or more than a certain value S0, the A * search method is used. If not, One-PassDP is used.
Use the law. Note that the values of S0 and k are constants determined experimentally. When the value of k is 0, a search method is selected using only the scale of the grammar as a criterion. As the value of k is increased, the degree of consideration of the average utterance length increases.

【０００８】前記実施の形態において、使用する探索法
がアプリケーションプログラム側から文法と共に指定さ
れた場合や、図１に示した探索法指定部８により指定さ
れた場合には、この評価は行わず、その指示に従う。な
お、探索法指定部８による指定の際には、複数の探索法
を表示させ、表示されたなかの一つをマウスなどにより
選択させる。また、ｎ通りの探索法を順次、または並列
に駆動する場合は、適切なものから順にｎ種類の探索法
を選択し、スコア集計部６が、それらを用いて行った認
識結果のスコアの加重平均を求め、加重平均の最も大き
い候補を最終的な探索解（認識結果）とする。これに対
して、選択された一つの探索法で探索を行う場合には、
次に、候補探索部５が選択された探索法を用いて入力音
声の認識処理としての候補探索を行う（ステップＳ
６）。有限状態オートマトンによった文法ネットワーク
を用いて、入力された音声に一致すると思われる音素列
の候補を抽出するのである。つまり、音響モデルによっ
て決まるデータと入力音声から順次切り出した個々の音
素データから抽出したデータをワークメモリ２を用いて
順次比較し、類似度が高ければ高いスコアを与え、有限
状態オートマトンの示す可能な音素列毎に累積スコアを
求め、累積スコアが最も高い音素列などを認識結果候補
として抽出する。なお、この際、探索法として例えばOn
e-Pass DP 法が選択されていれば、探索する深さつまり
音素列長を一定にして可能な対象とする音素列の探索が
ほぼ同時に終了するように探索する。そして、最後に、
探索結果、つまり認識結果を認識結果出力部７に出力す
る（ステップＳ７）。In the above embodiment, when the search method to be used is specified together with the grammar from the application program side, or when specified by the search method specifying unit 8 shown in FIG. 1, this evaluation is not performed. Follow the instructions. When the search method is specified by the search method specifying unit 8, a plurality of search methods are displayed, and one of the displayed search methods is selected by a mouse or the like. When driving the n kinds of search methods sequentially or in parallel, the n kinds of search methods are selected in order from the appropriate one, and the score totaling unit 6 weights the scores of the recognition results performed using them. The average is determined, and the candidate with the largest weighted average is used as the final search solution (recognition result). On the other hand, when searching with one selected search method,
Next, the candidate search unit 5 performs a candidate search as input voice recognition processing using the selected search method (Step S).
6). Using a grammar network based on a finite state automaton, phoneme string candidates that are considered to match the input speech are extracted. That is, data determined by the acoustic model and data extracted from individual phoneme data sequentially cut out from the input speech are sequentially compared using the work memory 2, and a higher score is given when the similarity is higher, and the finite state automaton indicates a higher score. A cumulative score is obtained for each phoneme string, and a phoneme string having the highest cumulative score is extracted as a recognition result candidate. In this case, a search method such as On
If the e-Pass DP method is selected, the search is performed so that the search depth, that is, the length of the phoneme string is constant, and the search for the target phoneme string that can be performed is completed almost simultaneously. And finally,
The search result, that is, the recognition result is output to the recognition result output unit 7 (step S7).

【０００９】なお、前記においては、文法の規模と入力
音声長から探索法を選択したが、文法の規模のみから探
索法を選択するようにしてもよい。また、用いる文法を
決める際、対応可能な複数の文法を表示させ、マウスな
どにより表示されたなかの一つを選択させるようにして
もよい。また、前記において、音響モデルに係わるデー
タつまり音響モデル情報、および有限状態オートマトン
によった文法ネットワーク情報は、複数の探索法で共通
に用いる。あるいは、音響モデル情報および文法ネット
ワーク情報のうちのどちらかを複数の探索法で共通に用
いるようにすることも可能である。また、音素継続時間
長モデル情報を複数の探索法で共通に用いるようにする
ことも可能である。フレーム非同期に行うＡ* 探索法で
は、途中スコア（累積スコア）の領域と予測スコアの領
域の接続点、つまり評価値算出点である接続代表点の位
置（時刻ｔ）と、途中スコアの領域の音素の平均時間長
および分散とを用いて接続範囲を決定するのであるが、
前記音素継続時間長モデル情報とは、この決定のために
用いる情報である。また、複数の探索法で共通に用いら
れる共通処理部１１には、尤度テーブル計算手段および
音素継続時間制御用ペナルティテーブル計算手段の両方
またはいずれかを備えている。なお、尤度テーブル計算
手段とは、尤度テーブルを用いて有限状態オートマトン
で示された可能な音素列を構成する各音素が、入力音声
を構成している各音素である尤度（確からしさ、それら
しさ、類似度）を求めたり、音響モデルパラメータを用
いて尤度テーブルを作成したりする計算手段である。ま
た、音素継続時間制御用ペナルティテーブル計算手段と
は、音素継続時間制御用ペナルティテーブルを用いて、
接続代表点までの累積スコアｆ'(t)と接続代表点からの
経過時間τから経過時間τにおける累積スコアｆ(t) を
求めたり、そのペナルティテーブルを作成したりする計
算手段である。なお、前記ペナルティテーブルとは、次
の式（２）に示されるペナルティαd(τ) を接続代表点
からの経過時間τに対応付けてテーブル形式で与えたも
のである。積スコアｆ(t) は、次の式（２）で表され
る。 log ｆ(t) ＝log ｆ'(t)+ αd(τ) （２）In the above description, the search method is selected based on the scale of the grammar and the input voice length. However, the search method may be selected based only on the scale of the grammar. When determining the grammar to be used, a plurality of grammars that can be used may be displayed, and one of the displayed grammars may be selected by a mouse or the like. In the above description, data relating to the acoustic model, that is, acoustic model information, and grammar network information based on the finite state automaton are commonly used in a plurality of search methods. Alternatively, one of the acoustic model information and the grammar network information may be commonly used in a plurality of search methods. It is also possible to use the phoneme duration model information in common for a plurality of search methods. In the A * search method performed asynchronously with the frame, the connection point between the area of the intermediate score (cumulative score) and the area of the prediction score, that is, the position (time t) of the connection representative point that is the evaluation value calculation point and the area of the intermediate score area The connection range is determined using the average time length and variance of phonemes,
The phoneme duration model information is information used for this determination. The common processing unit 11 commonly used in a plurality of search methods includes both or any of a likelihood table calculation unit and a phoneme duration control penalty table calculation unit. Note that the likelihood table calculation means means that each phoneme constituting a possible phoneme sequence represented by a finite state automaton using the likelihood table is a likelihood (probability) that is a phoneme constituting each input speech. , Similarity, and similarity), and a likelihood table is created using acoustic model parameters. Further, the phoneme duration control penalty table calculation means, using a phoneme duration control penalty table,
It is a calculating means for calculating the cumulative score f (t) at the elapsed time τ from the cumulative score f ′ (t) up to the connection representative point and the elapsed time τ from the connection representative point, and creating a penalty table thereof. The penalty table is obtained by giving a penalty αd (τ) shown in the following equation (2) in a table format in association with the elapsed time τ from the connection representative point. The product score f (t) is represented by the following equation (2). log f (t) = log f '(t) + αd (τ) (2)

【００１０】なお、Ａ＊探索法では、第１段階で音素列
の前方から後方に向けて予測スコアを求めた後、第２段
階でそれとは逆向きに候補の探索を行うという２段階の
探索を行なうが、Ｏｎｅ−ＰａｓｓＤＰ法では、文字通
り、音素列の前方から後方に向けた１回の探索を得る。
そして、その際の個々の仮説に着目すると、ある時刻で
音素ｘに対応する音響モデルの状態に入った後、ある一
定時間その音素内に留まり、やがて次の音素ｙに移行す
る。したがって、各仮説毎にｘに先行する音素からｘに
移行した時刻を記憶しておくことににより、ある仮説が
音素ｘから音素ｙに移行する瞬間に、その仮説がどれく
らいの時間音素ｘに留まっっていたかという情報を算出
することが可能である。そして、その時点までの仮説の
累積スコアに対して、音素ｘへの滞留時間から求められ
るペナルティを加算して求めた累積スコアを候補選択に
用いることにより、音声認識精度を向上させることがで
きる。つまり、このようなペナルティ計算の手法が、Ａ
＊探索法のものと同じであるため、音素継続時間制御用
ペナルティテーブル計算手段を複数の探索法で共有化す
ることができるというわけである。以上、図１に示した
音声認識装置について説明したが、スコア集計部６およ
び探索法指定部８を備えない構成も可能である。また、
説明したような本発明の音声認識方法を実施するための
プログラムを機械読み取り可能な記憶媒体に記憶しても
よい。この場合には、本発明の音声認識方法を実施する
ためのプログラムを記憶した機械読み取り可能な記憶媒
体であって、情報処理装置に着脱可能な記憶媒体を得る
ことができるから、その記憶媒体をそれまで本発明の音
声認識方法による音声認識を行えなかった情報処理装置
に装着することにより、その情報処理装置においても本
発明音声認識方法による音声認識を行うことができる。[0010] In the A * search method, a two-stage search is performed in which a prediction score is obtained from the front to the rear of a phoneme sequence in a first stage, and then a candidate search is performed in a reverse direction in a second stage. In the One-PassDP method, literally, one search is performed from the front to the back of the phoneme string.
Focusing on the individual hypotheses at that time, after entering the state of the acoustic model corresponding to the phoneme x at a certain time, it stays in the phoneme for a certain time and then shifts to the next phoneme y. Therefore, by storing the time at which a phoneme preceding x shifts to x for each hypothesis, at the moment when a certain hypothesis shifts from phoneme x to phoneme y, for how long the hypothesis stays at phoneme x. It is possible to calculate information indicating whether or not the user has been. Then, the cumulative score of the hypothesis up to that point is added to the penalty calculated from the residence time at the phoneme x, and the cumulative score calculated is used for candidate selection, thereby improving the accuracy of speech recognition. In other words, such a penalty calculation method is called A
* Since it is the same as that of the search method, the penalty table calculation means for controlling the phoneme duration can be shared by a plurality of search methods. Although the speech recognition apparatus shown in FIG. 1 has been described above, a configuration without the score totalizing unit 6 and the search method designating unit 8 is also possible. Also,
A program for implementing the speech recognition method of the present invention as described above may be stored in a machine-readable storage medium. In this case, a machine-readable storage medium storing a program for performing the voice recognition method of the present invention, which is removable from the information processing apparatus, can be obtained. By attaching to an information processing apparatus that has not been able to perform the voice recognition by the voice recognition method of the present invention, the information processing apparatus can also perform the voice recognition by the voice recognition method of the present invention.

【００１１】[0011]

【発明の効果】以上説明したように、請求項１に記載の
発明によれば、音声認識のための文法を切り替えながら
候補の探索を行う際、用いる文法の規模に応じて複数の
探索法が使い分けられるので、効率の良い候補探索を実
現することができる。また、請求項２に記載の発明によ
れば、請求項１に記載の発明において、選択対象の複数
の文法または探索法が示され、そのなかから用いられる
文法または探索法が選択されるので、文法または探索法
の選択を容易に行うことができる。また、請求項３に記
載の発明によれば、請求項１記載の発明において、文法
の規模に応じて探索法が自動的に選択されるので、適切
な探索法をさらに容易に選択することができる。また、
請求項４に記載の発明によれば、請求項１および請求項
３の１つに記載の発明において、入力された音声長が記
憶しておかれ、次回以降に探索法を自動選択する際、前
記音声長が選択基準に加えられて探索法が選択されるの
で、さらに適切な探索法選択が可能になる。また、請求
項５に記載の発明によれば、請求項１乃至請求項４の１
つに記載の発明において、音響モデル情報、音素継続時
間長モデル情報、文法ネットワーク情報のうち少なくと
も一つが複数の探索法で共通に用いられるので、複数の
探索法の実現が容易になるし、使用するメモリ量も少な
くて済む。また、請求項６に記載の発明によれば、候補
を探索する際、複数の探索法のそれぞれによる探索が順
次または並列に実行され、各探索法で得られた各候補の
スコアの平均値に従って最終的な候補が求められるの
で、状況によっては、より正確な音声認識を実現でき
る。また、請求項７に記載の発明によれば、請求項１乃
至請求項６の１つに記載の音声認識方法を実施するため
のプログラムを記憶した機械読み取り可能な記憶媒体で
あって、情報処理装置に着脱可能な記憶媒体を得ること
ができるから、その記憶媒体をそれまで請求項１乃至請
求項６の１つに記載の音声認識方法による音声認識を行
えなかった情報処理装置に装着することにより、その情
報処理装置においても請求項１乃至請求項６の１つに記
載の音声認識方法による音声認識を行うことができる。
また、請求項８に記載の発明によれば、請求項１乃至請
求項６の１つに記載の発明において、尤度テーブル計算
手段または音素継続時間制御用ペナルティテーブル計算
手段が複数の探索法で共通に用いられるので、複数の探
索法の実現がさらに容易になる。As described above, according to the first aspect of the present invention, when searching for candidates while switching the grammar for speech recognition, a plurality of search methods are used according to the scale of the grammar to be used. Since it can be used properly, an efficient candidate search can be realized. According to the invention described in claim 2, in the invention described in claim 1, a plurality of grammars or search methods to be selected are shown, and a grammar or search method to be used is selected from the plurality of grammars or search methods. A grammar or a search method can be easily selected. According to the third aspect of the present invention, in the first aspect of the present invention, a search method is automatically selected according to the scale of the grammar, so that an appropriate search method can be more easily selected. it can. Also,
According to the invention described in claim 4, in the invention described in one of claims 1 and 3, the input speech length is stored, and when the search method is automatically selected in the next and subsequent times, Since the search method is selected by adding the voice length to the selection criterion, a more appropriate search method can be selected. According to the invention described in claim 5, according to claims 1 to 4,
In the invention described in (1), at least one of acoustic model information, phoneme duration model information, and grammar network information is commonly used in a plurality of search methods, so that a plurality of search methods can be easily realized and used. The amount of memory to be performed is small. According to the invention described in claim 6, when searching for a candidate, the search by each of the plurality of search methods is executed sequentially or in parallel, and according to the average value of the scores of each candidate obtained by each search method. Since final candidates are required, more accurate speech recognition can be realized depending on the situation. According to a seventh aspect of the present invention, there is provided a machine-readable storage medium storing a program for performing the voice recognition method according to any one of the first to sixth aspects. A storage medium detachable from the apparatus can be obtained, so that the storage medium is mounted on an information processing apparatus that has not been able to perform voice recognition by the voice recognition method according to any one of claims 1 to 6. Accordingly, the information processing apparatus can also perform speech recognition by the speech recognition method according to one of the first to sixth aspects.
According to the invention described in claim 8, in the invention described in any one of claims 1 to 6, the likelihood table calculation means or the penalty table calculation means for controlling phoneme duration is determined by a plurality of search methods. Since they are used in common, it is easier to implement a plurality of search methods.

[Brief description of the drawings]

【図１】本発明の１つの実施の形態に係る音声認識装置
を示すブロック図である。FIG. 1 is a block diagram showing a speech recognition device according to one embodiment of the present invention.

【図２】本発明の１つの実施の形態に係る音声認識方法
を説明するためのフロー図である。FIG. 2 is a flowchart for explaining a voice recognition method according to one embodiment of the present invention.

【図３】従来技術および本発明の実施の形態に係わる音
声認識方法を説明するための説明図である。FIG. 3 is an explanatory diagram for explaining a speech recognition method according to a conventional technique and an embodiment of the present invention.

【図４】従来技術および本発明の実施の形態に係わる他
の音声認識方法を説明するための説明図である。FIG. 4 is an explanatory diagram for explaining another speech recognition method according to the related art and the embodiment of the present invention.

[Explanation of symbols]

１音声入力部、２ワークメモリ、３平均音声長算
出部、４探索法選択部、５候補探索部、６スコア
集計部、７認識結果出力部、８探索法指定部、９
音響モデル記憶部、１０文法情報記憶部、１１共通
処理部。Reference Signs List 1 voice input section, 2 work memory, 3 average voice length calculation section, 4 search method selection section, 5 candidate search section, 6 score totaling section, 7 recognition result output section, 8 search method designation section, 9
Acoustic model storage unit, 10 grammar information storage unit, 11 common processing unit.

Claims

[Claims]

1. A speech recognition method capable of searching for a candidate while switching a grammar for speech recognition,
A speech recognition method characterized by using a plurality of search methods according to the scale of a grammar to be used.

2. The speech recognition method according to claim 1, wherein a plurality of grammars or search methods to be selected are displayed, and a grammar or search method to be used is selected from the plurality of grammars or search methods.

3. The speech recognition method according to claim 1, wherein a search method is automatically selected according to the scale of the grammar.

4. The voice recognition method according to claim 1, wherein the input voice length is stored, and said voice length is selected when a search method is automatically selected from the next time. A speech recognition method characterized by selecting a search method in addition to a criterion.

5. The speech recognition method according to claim 1, wherein at least one of acoustic model information, phoneme duration model information, and grammar network information is shared by a plurality of search methods. A speech recognition method characterized by using.

6. In a speech recognition method capable of searching for a candidate by a plurality of search methods, a search by each of the plurality of search methods is executed sequentially or in parallel, and a score of each candidate obtained by each search method is calculated. A speech recognition method characterized by finding final candidates according to an average value.

7. A storage medium storing a program for performing the voice recognition method according to claim 1. Description:

8. A speech recognition device using the speech recognition method according to claim 1, wherein the likelihood table calculation means or the phoneme duration control penalty table calculation means is implemented by a plurality of search methods. A speech recognition device characterized by having a configuration commonly used.