JPH0950290A

JPH0950290A - Voice recognition device and communication device using it

Info

Publication number: JPH0950290A
Application number: JP7221140A
Authority: JP
Inventors: Masami Suzuki; 雅実鈴木; Naoki Inoue; 直己井ノ上; Fumihiro Tanido; 文廣谷戸
Original assignee: Kokusai Denshin Denwa KK
Current assignee: KDDI Corp
Priority date: 1995-08-07
Filing date: 1995-08-07
Publication date: 1997-02-18

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device having an expansibility without being accompanied by increase in a processing amount and improving a recognition rate, and to provide a communication device using it. SOLUTION: In the voice recognition device, a vocal state detection means 7, an LR table means 4 being plural voice recognition grammar corresponding to respective vocal states and voice recognition means 5, 6 using the LR table means 4 and voice recognizing next vocalization are included. Since the voice recognition is performed using the voice recognition grammar corresponding to the next estimated vocal state, the processing amount is reduced when compared with the case where general grammar is used, and since useless grammar is not included, the recognition rate is improved. Further, an application range is expanded to conversation for an optional business depending on the method of the definition of the vocalizing state.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は音声認識装置およ
び該装置を用いた通信装置に関し、特に音声認識精度を
向上させることが可能な音声認識装置および該装置を用
いた通信装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a communication device using the device, and more particularly to a voice recognition device capable of improving voice recognition accuracy and a communication device using the device.

【０００２】[0002]

【従来の技術】従来、音声認識の分野において、例えば
特開平２−１１３２９７号公報あるいは特開平４−１８
２０００号公報に開示されているような、ＨＭＭ−ＬＲ
法が提案されている。ＨＭＭ−ＬＲ法は、ＬＲ（Ｌｅｆ
ｔｔｏＲｉｇｈｔ）法と呼ばれる構文解析法によ
り、入力された音声データ中の音韻を予測し、予測され
た音韻の尤度を、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖ
Ｍｏｄｅｌ）法と呼ばれる音韻認識法により調べるこ
とにより、音声認識と言語処理を同時進行させる方式で
ある。2. Description of the Related Art Conventionally, in the field of voice recognition, for example, JP-A-2-113297 or JP-A-4-18.
HMM-LR as disclosed in Japanese Patent Publication No. 2000
A law has been proposed. The HMM-LR method uses LR (Lef
The phoneme in the input speech data is predicted by a syntactic analysis method called “t to Right” method, and the likelihood of the predicted phoneme is calculated by HMM (Hidden Markov).
This is a method of simultaneously advancing speech recognition and language processing by investigating by a phoneme recognition method called Model method.

【０００３】[0003]

【発明が解決しようとする課題】前記したような、従来
のＨＭＭ−ＬＲ法を使用した場合においては、一般的な
対話を認識するためには、多量の一般文法からＬＲテー
ブルを生成する必要があるので、認識のための処理時間
が増大すると共に、認識率も低下するという問題点があ
った。この発明の目的は、前記した従来技術の問題点を
解決し、処理量の増加を伴わずに、拡張性があり、かつ
認識率が向上する音声認識装置および該装置を用いた通
信装置を提供することにある。When the conventional HMM-LR method is used as described above, it is necessary to generate an LR table from a large amount of general grammar in order to recognize general dialogue. Therefore, there is a problem that the processing time for recognition increases and the recognition rate also decreases. An object of the present invention is to solve the above-mentioned problems of the prior art, and to provide a speech recognition apparatus that has expandability and an improved recognition rate without increasing the processing amount, and a communication apparatus using the apparatus. To do.

【０００４】[0004]

【課題を解決するための手段】この発明は、音声認識装
置において、認識された対話が予め定義された発話状態
のいずれに属するかを検出する発話状態検出手段と、各
発話状態に対応して、次の発話において使用され得る認
識規則から生成された複数の音声認識文法と、発話状態
検出手段の出力する発話状態に対応する音声認識文法を
使用して次の発話の音声認識を行う音声認識手段とを含
むことを特徴とする。また、該装置を用いた通信装置に
も特徴がある。According to the present invention, in a voice recognition device, there is provided utterance state detecting means for detecting which one of predefined utterance states the recognized dialogue belongs to, and corresponding to each utterance state. , Speech recognition for performing speech recognition of the next utterance using a plurality of speech recognition grammars generated from recognition rules that can be used in the next utterance and a speech recognition grammar corresponding to the utterance state output by the utterance state detecting means. And means. Further, a communication device using the device is also characterized.

【０００５】本発明はこのような構成によって、次に予
測される発話状態に対応した音声認識文法を使用して音
韻認識を行うので、一般的な文法を使用する場合に較べ
て処理量が減少すると共に、不要な文法規則が含まれて
いないので認識率が向上する。また、発話状態の定義の
仕方によって、任意の業務向けの対話に適用範囲を拡張
することが可能となる。更に、該装置を用いた通信装置
は処理量が少ないので、より実時間に近い対話が可能と
なり、また通信費用も減少する。According to the present invention, since the phoneme recognition is performed by using the voice recognition grammar corresponding to the next predicted utterance state, the processing amount is reduced as compared with the case of using a general grammar. In addition, the recognition rate is improved because unnecessary grammatical rules are not included. Further, depending on how to define the utterance state, it is possible to extend the scope of application to dialogue for any business. Further, since the communication device using the device has a small amount of processing, it becomes possible to perform a dialogue in a more real time manner, and the communication cost is reduced.

【０００６】[0006]

【発明の実施の形態】以下に、図面を参照して本発明を
詳細に説明する。図１は本発明が適用される音声認識装
置の機能を示すブロック図である。１は後述する音声認
識用の文法規則データであり、２はやはり後述する発話
状態遷移マトリックスデータである。ＬＲテーブルコン
パイラ３は、認識文法１および発話状態遷移マトリック
ス２に基づき、各発話状態に対応する音声認識規則を抽
出し、それぞれＬＲテーブル４を生成する。なお、これ
らの処理は音声認識に先だって予め行われる。DETAILED DESCRIPTION OF THE INVENTION The present invention will be described in detail below with reference to the drawings. FIG. 1 is a block diagram showing the functions of a voice recognition device to which the present invention is applied. Reference numeral 1 is grammatical rule data for voice recognition, which will be described later, and 2 is utterance state transition matrix data, which will also be described later. The LR table compiler 3 extracts a voice recognition rule corresponding to each utterance state based on the recognition grammar 1 and the utterance state transition matrix 2 and generates an LR table 4 for each. Note that these processes are performed in advance prior to voice recognition.

【０００７】音声認識部６は、ＬＲ法と呼ばれる構文解
析法により、入力された音声データ中の音韻を予測し、
予測された音韻の尤度をＨＭＭ法と呼ばれる音韻認識法
により調べることにより、音声認識と言語処理を同時進
行させる方式を採用し、テーブル選択部５によって選択
されたＬＲテーブルを使用して音声認識を行う。具体的
には、従来の音素同期方式とは異なるフレーム同期ＨＭ
Ｍ方式を採用している。このフレーム同期方式は、発音
データを所定時間ごとにフレームに区切り、該フレーム
ごとの音韻を予測し、照合するものである。発話状態検
出部７は、音声認識部６の出力結果に基づき、認識に適
用された認識文法規則に付加されている発話状態情報
や、外部から入力される対話相手の発話状態情報等を参
考に発話状態を決定する。テーブル選択部５は、発話状
態検出部７から出力される発話状態情報に基づき、対応
するＬＲテーブルを選択する。The voice recognition unit 6 predicts a phoneme in the input voice data by a syntax analysis method called LR method,
A method of simultaneously advancing speech recognition and language processing is adopted by checking the likelihood of the predicted phoneme by a phoneme recognition method called HMM method, and speech recognition is performed using the LR table selected by the table selection unit 5. I do. Specifically, a frame synchronization HM different from the conventional phoneme synchronization method
It uses the M method. In this frame synchronization method, pronunciation data is divided into frames at predetermined time intervals, and phonemes of each frame are predicted and collated. Based on the output result of the voice recognition unit 6, the utterance state detection unit 7 refers to the utterance state information added to the recognition grammar rule applied to the recognition, the utterance state information of the conversation partner input from the outside, and the like. Determine the utterance status. The table selection unit 5 selects the corresponding LR table based on the utterance state information output from the utterance state detection unit 7.

【０００８】図３は、図１の発話状態遷移マトリックス
の内容の一例を示す説明図である。なお、以下の説明に
おいては、特定の業務として例えばホテルの部屋の予約
に関する業務を例に説明する。このような業務における
発話状態としては、まず発話者の情報が有効であり、例
えばホテル側を「Ｈ」、利用者側を「Ｃ」で表す。次
に、発話タイプとしては、例えば以下のような発話タイ
プが考えられる。「ＧＯ」（対話開始時の挨拶）、「Ｇ
Ｃ」（対話終了時の挨拶）、「ＱＰ」（イエス／ノー疑
問文）、「ＱＷ」（What、When等の疑問詞を用いた疑問
文）、「ＲＰ」（肯定的な応答）、「ＲＮ」（否定的な
応答）、「ＣＦ」（確認）、「ＥＸ」（謝意表現）。発
話タイプは業務への依存性は少なく、汎用性がある。当
実施例においては、発話者と発話タイプとを組み合わせ
て発話状態を表す。FIG. 3 is an explanatory diagram showing an example of the contents of the utterance state transition matrix of FIG. In addition, in the following description, as a specific job, for example, a job related to a hotel room reservation will be described. As the utterance state in such work, the information of the utterer is effective first, and for example, the hotel side is represented by "H" and the user side is represented by "C". Next, as the utterance type, for example, the following utterance types can be considered. "GO" (greeting at the beginning of the dialogue), "G
"C" (greeting at the end of the dialogue), "QP" (yes / no question), "QW" (question using question words such as What, When, etc.), "RP" (affirmative response), ""RN" (negative response), "CF" (confirmation), "EX" (thanks). The utterance type has little dependence on work and is versatile. In the present embodiment, the utterance state is expressed by combining the speaker and the utterance type.

【０００９】例えば、現在の発話状態が「Ｈ／ＧＯ」、
即ちホテル側の対話開始時の挨拶という状態であった場
合には、図３のマトリックスの１行目を参照すると、次
に遷移する可能性のある発話状態は、マトリックス内で
１の立っている状態、即ち「Ｃ／ＧＯ」（利用者の対話
開始時の挨拶）および「Ｃ／ＱＰ」（利用者のイエス／
ノー疑問文）であることが分かる。従って、現在の発話
状態が「Ｈ／ＧＯ」であった場合には、次の音声認識に
おいては、「Ｃ／ＧＯ」および「Ｃ／ＱＰ」の状態にお
いて使用される可能性のある文法規則のみから生成され
たＬＲテーブルを使用すればよいことが判明する。For example, the current utterance state is "H / GO",
That is, in the case of a greeting at the beginning of the dialogue on the hotel side, referring to the first line of the matrix in FIG. 3, the utterance state that may have the next transition is 1 in the matrix. State, ie "C / GO" (greeting at the beginning of the user's dialogue) and "C / QP"(user's yes /
It turns out that it is a (no question sentence). Therefore, if the current utterance state is "H / GO", only the grammatical rules that may be used in the "C / GO" and "C / QP" states in the next speech recognition. It turns out that it is sufficient to use the LR table generated from

【００１０】図４は、図１の認識文法データ１の内容の
例を示す説明図である。認識文法は分脈自由文法の形式
で記述されている。１行が１つの文法規則を表してお
り、各規則にはそれぞれヘッダが付与されている。ヘッ
ダは４桁の文字から構成されており、１桁目は発話者情
報、２、３桁目は発話タイプ情報を表している。なお
「−」は無限定、即ち任意の発話者あるいは発話タイプ
に適用されることを表している。なお、認識文法１は、
文法規則にヘッダを付与することにより、１つのデータ
ベースとして管理可能である。従って、例えば汎用性の
ある文法規則を追加する場合には、無限定のヘッダを付
与することによって、全ての発話状態における文法に反
映され、各発話状態ごとに文法をメンテナンスするより
も効率のよい管理、更新が可能となる。FIG. 4 is an explanatory diagram showing an example of the contents of the recognition grammar data 1 of FIG. The recognition grammar is described in the form of a branch-free grammar. Each line represents one grammar rule, and each rule is provided with a header. The header is composed of four-digit characters, and the first digit represents speaker information and the second and third digits represent speech type information. It should be noted that "-" is unlimited, that is, it is applied to any speaker or utterance type. The recognition grammar 1 is
By adding a header to the grammar rule, it can be managed as one database. Therefore, for example, when adding a general-purpose grammar rule, by adding an unlimited header, it is reflected in the grammar in all utterance states, which is more efficient than maintaining the grammar for each utterance state. Can be managed and updated.

【００１１】図５（ａ）は、図１のテーブル選択部５、
音声認識部６、発話状態検出部７における処理を示すフ
ローチャートである。ステップＳ１においては、対話開
始時における発話状態の初期値を発話状態検出部７にセ
ットする。ステップＳ２においては、テーブル選択部５
は、発話状態検出部から入力された発話状態情報に基づ
き、対応するＬＲテーブルを選択する。ステップＳ３に
おいては、音声認識部６は、前処理された音声データを
入力し、ステップＳ４においては、ステップＳ２におい
て選択されたＬＲテーブルを使用して音声認識を行う。FIG. 5A shows the table selection unit 5 of FIG.
6 is a flowchart showing processing in the voice recognition unit 6 and the utterance state detection unit 7. In step S1, the initial value of the utterance state at the start of the dialogue is set in the utterance state detecting unit 7. In step S2, the table selection unit 5
Selects the corresponding LR table based on the utterance state information input from the utterance state detection unit. In step S3, the voice recognition unit 6 inputs the preprocessed voice data, and in step S4, performs voice recognition using the LR table selected in step S2.

【００１２】ステップＳ５においては、認識処理によっ
て得られた上位の所定数の候補文を例えばディスプレイ
装置に表示し、マウス等のポインティング装置を使用し
て話者に選択させることにより、確定した認識結果を得
る。ステップＳ６においては、例えば話者が対話の終了
操作を行ったか否かを調べることにより、対話終了か否
かが判定され、結果が否定の場合にはステップＳ７に移
行する。In step S5, a predetermined number of candidate sentences obtained by the recognition process are displayed on, for example, a display device, and the speaker selects the sentence using a pointing device such as a mouse to confirm the confirmed recognition result. To get In step S6, for example, by checking whether or not the speaker has performed a dialogue ending operation, it is determined whether or not the dialogue has ended. If the result is negative, the process proceeds to step S7.

【００１３】ステップＳ７においては、認識結果に基づ
き、発話状態の検出が行われる。発話状態の検出は、例
えば認識結果が確定すると、該認識結果を得るために適
用した認識文法が判明する。認識文法には、文の構造に
関するトップレベルの規則も含まれ、該規則にはそれぞ
れ発話状態情報がヘッダとして付加されている。従っ
て、認識時に使用した文の構造に関する規則に付与され
ているヘッダから発話状態を検出することができる。ま
た、文中に含まれる特定の語句のヘッダから発話状態を
検出することも可能であり、双方のデータを参照しても
よい。In step S7, the utterance state is detected based on the recognition result. In the detection of the utterance state, for example, when the recognition result is confirmed, the recognition grammar applied to obtain the recognition result is known. The recognition grammar also includes top-level rules regarding the structure of sentences, each of which has utterance state information added as a header. Therefore, the utterance state can be detected from the header added to the rule regarding the structure of the sentence used at the time of recognition. It is also possible to detect the utterance state from the header of a specific phrase included in the sentence, and both data may be referenced.

【００１４】ステップＳ８においては、例えば後述する
自動翻訳電話システムのように、当音声認識システムが
一方の話者の音声認識のみを担当し、他方の話者につい
ては他の音声認識システムによって認識された結果のテ
キストデータが受信されるような場合に、受信された対
話相手からのテキスト情報中の特定の語句から、認識辞
書を参照して相手の発話状態を検出する。ステップＳ９
においては、ステップＳ７およびＳ８で得られた発話状
態に関する情報を基に、現在の発話状態を決定し、ステ
ップＳ２に戻って、次の発話の音声認識を行う。In step S8, for example, as in the automatic translation telephone system described later, this voice recognition system is in charge of voice recognition of only one speaker, and the other speaker is recognized by another voice recognition system. When the resulting text data is received, the utterance state of the other party is detected by referring to the recognition dictionary from the specific phrase in the received text information from the other party. Step S9
In step S7, the current utterance state is determined based on the information about the utterance state obtained in steps S7 and S8, and the process returns to step S2 to perform voice recognition of the next utterance.

【００１５】図５（ｂ）は、ＬＲテーブル生成処理を表
すフローチャートである。但し、ステップＳ２０〜２２
の処理は人が行う。ステップＳ２０においては、任意の
業務において使用される一般的な文法規則および特定の
業務において使用される語句等の文法規則などから認識
文法データが作成される。なお、文法規則には文の構造
に関する規則から、各単語の音韻の配列に関する規則
（単語ごとの発音情報）まで複数のレベルの規則があ
る。ステップＳ２１においては、文法の各規則毎に図４
に示すようにヘッダ情報を付与する。ステップＳ２２に
おいては、使用する発話状態を決定し、図３に示すよう
な、各状態間の遷移マトリックスを決定する。FIG. 5B is a flowchart showing the LR table generation process. However, steps S20 to S22
The processing of is performed by a person. In step S20, recognition grammar data is created from general grammatical rules used in an arbitrary job and grammatical rules such as words and phrases used in a specific job. It should be noted that the grammar rules include a plurality of levels of rules ranging from a sentence structure rule to a phoneme arrangement rule (pronunciation information for each word). In step S21, FIG.
Header information is added as shown in. In step S22, the utterance state to be used is determined, and the transition matrix between the states as shown in FIG. 3 is determined.

【００１６】ステップＳ２３においては、遷移マトリッ
クスを基に、文法規則に付与されたヘッダ情報を参照し
て、発話状態に対応する文法サブセットを抽出する。例
えば現在の発話状態が「Ｈ／ＧＯ」であった場合には、
遷移マトリックスから、次の発話状態が「Ｃ／ＧＯ」お
よび「Ｃ／ＱＰ」のいずれかであることが判明するの
で、ヘッダ情報を参照して、該発話状態において使用さ
れる可能性のある文法規則のみを抽出する。In step S23, the grammar subset corresponding to the utterance state is extracted by referring to the header information given to the grammar rule based on the transition matrix. For example, if the current utterance state is "H / GO",
Since it is found from the transition matrix that the next utterance state is either "C / GO" or "C / QP", referring to the header information, the grammar that may be used in the utterance state Extract only rules.

【００１７】ステップＳ２４においては、抽出された各
サブセットから、それぞれ発話状態と対応するＬＲテー
ブルを生成する。ＬＲテーブルは、例えばある音韻の後
に続く可能性のある音韻を全て接続した多段ツリー構造
のデータである。このようにして生成されたＬＲテーブ
ルは、該当する発話状態において必要な文法のみから構
成されているので、汎用のＬＲテーブルに較べてデータ
数が少なく、認識処理時間が短縮されると共に、不要な
データが含まれていないので誤認識する恐れが減少し、
認識率が向上する。In step S24, an LR table corresponding to each utterance state is generated from each of the extracted subsets. The LR table is, for example, data of a multistage tree structure in which all phonemes that may possibly follow a certain phoneme are connected. Since the LR table generated in this way is composed of only the grammar necessary for the corresponding utterance state, the number of data is smaller than that of the general-purpose LR table, and the recognition processing time is shortened and unnecessary. Since the data is not included, the risk of erroneous recognition is reduced,
The recognition rate is improved.

【００１８】次に、第１の実施例である音声認識装置を
使用した自動翻訳電話装置について説明する。図２は本
発明が適用される自動翻訳電話装置のブロック図であ
る。例えばワークステーション等により構成されている
自動翻訳電話装置１０および１２は、電話網、データ
網、専用線、ＬＡＮ等の任意の通信網１１を介して接続
されている。マイク２０から入力された音声信号は前処
理部２１において音声信号の音響分析（ＶＱ）処理を施
され、第１の実施例である図１の音声認識装置と同じ機
能を有する音声認識部２２に入力される。音声認識部２
２においては、第１の実施例と同様の処理によって音声
認識を行い、必要があれば、図示しない表示装置および
マウス等の入力装置を使用して話者により認識結果の複
数の候補の中から正解を選択させる。Next, an automatic translation telephone device using the voice recognition device of the first embodiment will be described. FIG. 2 is a block diagram of an automatic translation telephone device to which the present invention is applied. For example, the automatic translation telephone devices 10 and 12 configured by workstations or the like are connected via an arbitrary communication network 11 such as a telephone network, a data network, a private line, and a LAN. The voice signal input from the microphone 20 is subjected to acoustic analysis (VQ) of the voice signal in the pre-processing unit 21, and the voice signal is input to the voice recognition unit 22 having the same function as the voice recognition device of the first embodiment shown in FIG. Is entered. Speech recognition unit 2
In No. 2, voice recognition is performed by the same processing as that of the first embodiment, and if necessary, a speaker uses a display device and an input device such as a mouse (not shown) to select from among a plurality of candidates of the recognition result. Let the correct answer be selected.

【００１９】音声認識部２２は認識結果のテキスト情報
および認識時に判明した文の構造や単語の区切りに関す
る情報を翻訳部２３に出力する。翻訳部２３は、入力さ
れた情報を基に、翻訳辞書を参照しながらテキスト情報
を他国語に翻訳し、翻訳されたテキスト情報を出力す
る。通信管理部２４は通信網１１および相手端末１２と
のインターフェースをとり、テキストデータの転送を行
う。通信網は、テキストデータを相互に通信可能な網で
あれば、任意の通信網が利用可能であり、データ転送速
度は音声データに較べれば非常に低速で通信可能であ
る。The voice recognition unit 22 outputs to the translation unit 23 the text information of the recognition result and the information about the structure of the sentence and the word breaks found at the time of recognition. The translation unit 23 translates the text information into another language while referring to the translation dictionary based on the input information, and outputs the translated text information. The communication management unit 24 interfaces with the communication network 11 and the partner terminal 12 and transfers text data. As the communication network, any communication network can be used as long as it can communicate text data with each other, and the data transfer rate is very low compared to voice data.

【００２０】相手端末から転送されてきた翻訳済みのテ
キストデータは通信管理部２４によって受信され、音声
合成部２６および発話状態検知部２５に出力される。音
声合成部２６はテキスト情報を音声信号に変換し、スピ
ーカ２７から発音される。また、発話状態検知部２５
は、テキスト情報から単語を切り出し、認識文法内の文
法規則（単語辞書）と照合し、該当する規則のヘッダ情
報から、相手話者の発話状態を検知する。この情報は、
音声認識部２２内の発話状態検出部７に入力され、発話
状態が決定される。なお、音声認識部２２において検出
された発話状態情報をテキスト情報と共に相手端末に伝
送し、発話状態検知部２５において、該情報を抽出する
ようにしてもよい。以上のような構成により、話者は外
国の対話相手とほぼ実時間で対話が可能となる。The translated text data transferred from the partner terminal is received by the communication management unit 24 and output to the voice synthesis unit 26 and the utterance state detection unit 25. The voice synthesizer 26 converts the text information into a voice signal, and the speaker 27 produces a sound. Also, the utterance state detection unit 25
Cuts out a word from the text information, matches it with a grammatical rule (word dictionary) in the recognition grammar, and detects the utterance state of the other speaker from the header information of the corresponding rule. This information is
The speech state is input to the speech state detection unit 7 in the voice recognition unit 22, and the speech state is determined. Alternatively, the utterance state information detected by the voice recognition unit 22 may be transmitted to the other terminal together with the text information, and the utterance state detection unit 25 may extract the information. With the above configuration, the speaker can talk with the foreign conversation partner in almost real time.

【００２１】以上、実施例を開示したが、更に以下に述
べるような変形例も考えられる。ＬＲテーブルは予め準
備しておく例を開示したが、発話状態が変化する度に、
ＬＲテーブルコンパイラを使用して発話状態に対応する
ＬＲテーブルを生成するようにしてもよい。発話状態遷
移マトリックスの内容は０か１である例を開示したが、
遷移の確率を３値以上の重み付け係数で表し、該係数を
音韻判定時における尤度の算出に使用するようにしても
よい。Although the embodiment has been disclosed above, the following modifications are also conceivable. I disclosed an example of preparing the LR table in advance, but each time the utterance state changes,
The LR table compiler may be used to generate the LR table corresponding to the utterance state. Although the example in which the content of the utterance state transition matrix is 0 or 1 is disclosed,
The transition probability may be represented by a weighting coefficient of three values or more, and the coefficient may be used to calculate the likelihood at the time of phoneme determination.

【００２２】音韻照合方式についてはＨＭＭ法の例を開
示したが、実施例の方式に限らず任意の音韻照合法を使
用可能である。また、言語解析手法についてはＬＲ法の
例を開示したが、音声認識文法のような言語制約の知識
源を使用する言語解析手法であれば、実施例の方式に限
らず任意の言語解析手法を使用可能である。発話状態に
ついては、発話者と発話タイプの組み合わせを使用する
例を開示したが、これに限らず、例えば話題（トピッ
ク）の種類、属性など、その他の情報を採用する（組み
合わせる）ことも可能である。Although the HMM method has been disclosed as an example of the phoneme matching method, it is not limited to the method of the embodiment, and any phoneme matching method can be used. Although the example of the LR method has been disclosed as the language analysis method, any language analysis method is not limited to the method of the embodiment as long as it is a language analysis method that uses a knowledge source of language constraints such as speech recognition grammar. It can be used. Regarding the utterance state, the example using the combination of the speaker and the utterance type is disclosed, but the present invention is not limited to this, and other information such as the type of topic (topic) and attributes can be adopted (combined). is there.

【００２３】音声認識装置を使用した通信装置として
は、自動翻訳電話装置の例を開示したが、該装置構成を
テレビ電話装置へ適用することも可能である。また、翻
訳手段を除いて、テキストデータによる対話を行う装置
としてもよく、そうすれば、非常に狭帯域の伝送チャネ
ルを用いた対話が可能となる。更に、聴覚障害者向け
に、電話の音声を文字で表示するような装置であっても
よい。自動翻訳電話装置においては、音声認識と翻訳用
の文法（辞書）データを共通化することも可能である。
例えば日本語の語彙見出しに対して、その読み、統語／
意味属性、対訳外国語等をテーブル形式で記載し、この
内の読みと統語情報を用いて音声認識用の語彙レベルの
文法規則を生成するようにしてもよい。このようにすれ
ば、辞書の生成、更新が効率的に行える。Although an example of an automatic translation telephone device has been disclosed as a communication device using a voice recognition device, the device configuration can be applied to a video telephone device. In addition, the device may be a device that conducts a dialogue using text data, except for the translating means, which enables a dialogue using a very narrow band transmission channel. Further, it may be a device that displays the voice of the telephone in characters for the hearing impaired. In the automatic translation telephone device, it is possible to share grammar (dictionary) data for voice recognition and translation.
For example, for a Japanese vocabulary heading, its reading, syntactic /
Semantic attributes, bilingual foreign languages, and the like may be described in a table format, and vocabulary-level grammatical rules for voice recognition may be generated using the reading and syntactic information in the table format. By doing so, the dictionary can be efficiently generated and updated.

【００２４】[0024]

【発明の効果】以上述べたように、この発明によれば、
音声認識装置において、発話状態検出手段と、各発話状
態に対応した複数の音声認識文法と、該文法を使用して
次の発話の音声認識を行う音声認識手段とを含み、次に
予測される発話状態に対応した音声認識文法を使用して
音韻認識を行うので、一般的な文法を使用する場合に較
べて処理量が減少すると共に、不要な文法が含まれてい
ないので、認識率が向上する。また、発話状態の定義の
仕方によって、任意の業務向けの対話に適用範囲を拡張
することが可能となるという効果がある。更に、該装置
を用いた自動翻訳電話装置等の通信装置は処理量が少な
いので、より実時間に近い対話が可能となり、また通信
費用も減少するという効果がある。As described above, according to the present invention,
The speech recognition apparatus includes a speech state detection unit, a plurality of speech recognition grammars corresponding to each speech state, and a speech recognition unit that performs speech recognition of the next speech by using the grammar, and is predicted next. The phoneme recognition is performed using the speech recognition grammar corresponding to the utterance state, so the processing amount is reduced compared to the case of using a general grammar, and the unnecessary grammar is not included, so the recognition rate is improved. To do. Further, there is an effect that the application range can be expanded to a dialogue for any business depending on how the utterance state is defined. Further, since a communication device such as an automatic translation telephone device using the device has a small processing amount, it is possible to have a dialogue closer to a real time and to reduce communication cost.

[Brief description of drawings]

【図１】本発明が適用される音声認識装置の機能を示す
ブロック図である。FIG. 1 is a block diagram showing functions of a voice recognition device to which the present invention is applied.

【図２】本発明が適用される自動翻訳電話装置のブロッ
ク図である。FIG. 2 is a block diagram of an automatic translation telephone device to which the present invention is applied.

【図３】発話状態遷移マトリックス２の内容の一例を示
す説明図である。FIG. 3 is an explanatory diagram showing an example of contents of an utterance state transition matrix 2.

【図４】認識文法データ１の内容の例を示す説明図であ
る。FIG. 4 is an explanatory diagram showing an example of contents of recognition grammar data 1.

【図５】音声認識、ＬＲテーブル生成処理を示すフロー
チャートである。FIG. 5 is a flowchart showing voice recognition and LR table generation processing.

[Explanation of symbols]

１…認識文法、２…発話状態遷移マトリックス、３…Ｌ
Ｒテーブルコンパイラ、４…状態別ＬＲテーブル、５…
テーブル選択部、６…音声認識部、７…発話状態検出部1 ... Recognition grammar, 2 ... Utterance state transition matrix, 3 ... L
R table compiler, 4 ... LR table by state, 5 ...
Table selection unit, 6 ... Voice recognition unit, 7 ... Speech state detection unit

Claims

[Claims]

1. A utterance state detecting means for detecting which one of predefined utterance states the recognized pronunciation belongs to, and a speech recognition rule which can be used in the next utterance corresponding to each utterance state. It is characterized by including a plurality of speech recognition grammars generated from the set and a speech recognition means for performing speech recognition of the next utterance using the speech recognition grammar corresponding to the utterance state output by the utterance state detection means. Speech recognizer.

2. The utterance state information is added to at least a part of the speech recognition rules, and the speech recognition grammar corresponding to each utterance state is based on the utterance state transition information, and the next utterance is recognized from the entire speech recognition grammar. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus is created by extracting a grammar that can be used in.

3. The utterance state detecting means, based on the utterance state information added to the voice recognition grammar applied to the voice recognition,
The voice recognition device according to claim 2, wherein the current utterance state is detected.

4. A communication device including the voice recognition device according to claim 1.

5. The speech recognition device further includes a translation unit that translates the text information recognized by the speech recognition device into another language, and a speech synthesis unit that converts the received text information into a speech signal. The communication device according to claim 4, wherein the translation is performed based on the syntax information found at the time of recognition.