JP4951422B2

JP4951422B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP4951422B2
Application number: JP2007164538A
Authority: JP
Inventors: 健大野; 実冨樫; 大介斎藤; 景子桂川; 久高橋; 修山下; 佳幸水野; 健本間; 信夫畑岡; 浩明小窪
Original assignee: Clarion Co Ltd; Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd; Faurecia Clarion Electronics Co Ltd
Priority date: 2007-06-22
Filing date: 2007-06-22
Publication date: 2012-06-13
Anticipated expiration: 2027-06-22
Also published as: JP2009003205A

Abstract

<P>PROBLEM TO BE SOLVED: To perform voice recognition of spoken voice when a user speaks while changing the expression of a specified command. <P>SOLUTION: A CPU 1034a stores vocabulary shown by a plurality of language models differed in level of restriction for restricting a user's spoken content as recognition object words in execution of voice recognition, inputs spoken voice by the user, computes the matching degree of the input spoken voice with the stored recognition object words, extracts recognition result candidates from the recognition object words based on the computing result of matching degree, and specifies a recognition result from the recognition result candidates based on at least one of the matching degree of each extracted recognition result candidate and the level of restriction of the language model including this recognition result candidate. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声を認識するための音声認識装置、および音声認識方法に関する。 The present invention relates to a speech recognition apparatus and a speech recognition method for recognizing speech.

次のような音声認識装置が知られている。この音声認識装置は、音声認識エンジンと規定コマンド用辞書とを備え、音声認識エンジンは、使用者の発話音声を規定コマンド辞書と照合することによって、認識結果を出力する（例えば、特許文献１）。 The following voice recognition devices are known. This speech recognition apparatus includes a speech recognition engine and a prescribed command dictionary, and the speech recognition engine outputs a recognition result by collating the user's uttered speech with the prescribed command dictionary (for example, Patent Document 1). .

特開平０６−０９５６８７号公報Japanese Patent Laid-Open No. 06-095687

しかしながら、従来の音声認識装置では、使用者が規定コマンドの表現を変更して発話した場合には、誤認識が生じる恐れがあった。 However, in the conventional speech recognition apparatus, when the user changes the expression of the specified command and speaks, there is a possibility that erroneous recognition occurs.

本発明は、使用者による発話音声を入力し、前記発話音声を認識するために設けられ、音声で操作される機器に対する目的語および前記機器を操作するための操作語を含む語彙を、使用者の発話内容を拘束するレベルが異なる複数の言語モデルで表した待ち受け語彙として記憶し、記憶手段に記憶した前記待ち受け語彙と、音声入力手段で入力した前記発話音声との一致度を、前記複数の言語モデルについてそれぞれ演算し、前記一致度演算手段により演算された一致度に基づいて、前記複数の言語モデルにおける待ち受け語彙の中から、一致度が高い順に複数個の待ち受け語彙を選択して認識結果候補として抽出し、候補抽出手段で抽出した複数個の認識結果候補の中で使用者の発話内容を拘束
するレベルが高い言語モデル(第１レベル言語モデル）の認識結果候補の中で最も高い一致度と、前記候補抽出手段で抽出した複数個の認識結果候補の中で最も高い一致度との差を演算し、その差が所定値より小さいときは前記第１レベル言語モデルの認識結果候補を認識結果として特定し、そうでないときは、前記候補抽出手段で抽出した複数個の認識結果候補の中で次に拘束するレベルが高い言語モデル（第２レベル言語モデル）の認識結果候補の中で最も高い一致度と、前記候補抽出手段で抽出した複数個の認識結果候補の中で最も高い一致度との差を演算し、その差が所定値より小さいときは前記第２レベル言語モデルの認識結果候補を認識結果として特定する音声認識装置、または方法である。 The present invention provides a user with a vocabulary provided for inputting speech speech by a user and recognizing the speech speech, including an object for a device operated by speech and an operation word for operating the device. Is stored as a standby vocabulary expressed by a plurality of language models with different levels of restraining the utterance content, and the degree of coincidence between the standby vocabulary stored in the storage means and the uttered speech input by the voice input means Each of the language models is calculated, and based on the degree of coincidence calculated by the degree of coincidence calculating means, a plurality of standby vocabularies are selected from the standby vocabulary in the plurality of language models in descending order of the degree of coincidence. A language model (first level language) that is extracted as a candidate and has a high level of restraining the user's utterance content among a plurality of recognition result candidates extracted by the candidate extraction means. When the difference between the highest matching degree among the recognition result candidates of the model) and the highest matching degree among the plurality of recognition result candidates extracted by the candidate extracting means is calculated, and the difference is smaller than a predetermined value Identifies a recognition result candidate of the first level language model as a recognition result, and if not, a language model (first one) that is next constrained among a plurality of recognition result candidates extracted by the candidate extraction means. The difference between the highest matching degree among the recognition result candidates of the two-level language model) and the highest matching degree among the plurality of recognition result candidates extracted by the candidate extraction means is calculated, and the difference is a predetermined value. When it is smaller, the speech recognition apparatus or method identifies a recognition result candidate of the second level language model as a recognition result.

本発明によれば、認識結果候補の中に、より拘束性のレベルが高い言語モデルの待ち受け語彙が含まれている場合であっても、その一致度が低い場合には、それが優先的に採用されることを防いで、誤認識を防止することができる。According to the present invention, even when a standby vocabulary of a language model having a higher level of restraint is included in the recognition result candidates, if the matching degree is low, it is preferentially used. It is possible to prevent misrecognition by preventing the adoption.

図１は、本実施の形態における音声認識装置の一実施の形態の構成を示すブロック図である。音声認識装置１００は、マイク１０１と、スピーカ１０２と、信号処理ユニット１０３と、入力装置１０４と、ディスプレイ１０５とを備えている。 FIG. 1 is a block diagram showing a configuration of an embodiment of a speech recognition apparatus according to the present embodiment. The voice recognition device 100 includes a microphone 101, a speaker 102, a signal processing unit 103, an input device 104, and a display 105.

信号処理ユニット１０３は、Ａ／Ｄコンバータ１０３１、Ｄ／Ａコンバータ１０３２、出力アンプ１０３３、信号処理装置１０３４、および外部記憶装置１０３５を備えている。信号処理装置１０３４は、ＣＰＵ１０３４ａ、メモリ１０３４ｂ、およびその他周辺回路により構成されている。また、入力装置１０４は、発話スイッチ１０４ａおよび訂正スイッチ１０４ｂを備えている。 The signal processing unit 103 includes an A / D converter 1031, a D / A converter 1032, an output amplifier 1033, a signal processing device 1034, and an external storage device 1035. The signal processing device 1034 includes a CPU 1034a, a memory 1034b, and other peripheral circuits. The input device 104 includes an utterance switch 104a and a correction switch 104b.

音声認識装置１００においては、使用者は、発話スイッチ１０４ａを押下することによって、音声認識の開始を指示することができる。使用者によって音声認識の開始が指示された場合、使用者による発話音声はマイク１０１を通して信号処理ユニット１０３へ入力される。信号処理ユニット１０３へ入力された音声信号（入力音声信号）は、Ａ／Ｄコンバータ１０３１でデジタル信号に変換された後、信号処理装置１０３４へ入力される。 In the speech recognition apparatus 100, the user can instruct the start of speech recognition by pressing the speech switch 104a. When the user instructs the start of voice recognition, the voice spoken by the user is input to the signal processing unit 103 through the microphone 101. An audio signal (input audio signal) input to the signal processing unit 103 is converted into a digital signal by the A / D converter 1031 and then input to the signal processing device 1034.

信号処理装置１０３４では、ＣＰＵ１０３４ａは、図２により後述する処理を実行して、使用者による発話音声を音声認識する。また、音声認識の結果に基づいて、使用者への応答文を生成する。生成した応答文は、Ｄ／Ａコンバータ１０３２でアナログ信号に変換され、出力アンプ１０３３で増幅された後、スピーカ１０２を介して出力される。使用者は、応答文の内容から音声認識結果が誤認識であると判断した場合には、訂正ボタン１０４ｂを押下して訂正指示をすることができる。また、使用者は、訂正ボタン１０４ｂを一定時間押下（長押し）することにより、音声認識を途中で中断することもできる。 In the signal processing device 1034, the CPU 1034a executes processing to be described later with reference to FIG. 2 and recognizes speech uttered by the user. Further, a response sentence to the user is generated based on the result of the speech recognition. The generated response sentence is converted into an analog signal by the D / A converter 1032, amplified by the output amplifier 1033, and then output through the speaker 102. When the user determines that the voice recognition result is erroneous recognition from the contents of the response sentence, the user can press the correction button 104b to give a correction instruction. In addition, the user can interrupt voice recognition in the middle by pressing the correction button 104b for a certain period of time (long pressing).

図２は、本実施の形態における音声認識装置１００の処理を示すフローチャートである。図２に示す処理は、使用者によって発話スイッチ１０４ａが押下されると起動するプログラムとして、ＣＰＵ１０３４ａによって実行される。 FIG. 2 is a flowchart showing processing of the speech recognition apparatus 100 according to the present embodiment. The processing shown in FIG. 2 is executed by the CPU 1034a as a program that is activated when the utterance switch 104a is pressed by the user.

ステップＳ１０において、ＣＰＵ１０３４ａは、音声認識に使用する認識対象語彙を外部記憶装置１０３５からメモリ１０３４ｂに読み込んで、音声認識処理のための待ち受け設定を行う。ここで読み込む認識対象語彙は、使用者の発話内容を拘束する拘束性のレベルが異なる複数の言語モデルで表される。以下、言語モデルの具体例について、図３〜図８を用いて説明する。 In step S10, the CPU 1034a reads the recognition target vocabulary used for speech recognition from the external storage device 1035 to the memory 1034b, and performs standby setting for speech recognition processing. The recognition target vocabulary read here is represented by a plurality of language models with different levels of restraint that restrain the user's utterance content. Hereinafter, specific examples of language models will be described with reference to FIGS.

ＣＰＵ１０３４ａは、まず、外部記憶装置１０３５から拘束性の高レベルの言語モデルで表される認識対象語彙を読み込む。拘束性の高レベルの言語モデルとは、認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容しない規定コマンド用の言語モデルであって、例えば、図３に示すように、第一階層Ａ、第二階層Ｂ、および第三階層Ｃとで構成される階層構造になっている。なお、図３は、車両に搭載されるナビゲーション装置を音声操作するためのコマンドを待ち受けるための言語モデルを表しており、以下の説明では、使用者がナビゲーション装置を音声操作する場合の音声認識処理について説明する。 First, the CPU 1034a reads a recognition target vocabulary represented by a high-level language model with restraint from the external storage device 1035. A restrictive high-level language model is a language model for a prescribed command that does not allow the insertion of an arbitrary phoneme string in a word string constituting a recognition target vocabulary. For example, as shown in FIG. It has a hierarchical structure composed of a first hierarchy A, a second hierarchy B, and a third hierarchy C. FIG. 3 shows a language model for waiting for a command for voice-operating the navigation device mounted on the vehicle. In the following description, voice recognition processing when the user voice-operates the navigation device. Will be described.

図３において、第一階層Ａは、「行き先設定」、「ルート設定」などのナビゲーション装置を操作するためのコマンドを保持している。第二階層Ｂは、第一階層Ａのコマンドの下位コマンドを保持しており、例えば、第一階層Ａに含まれる「行き先設定」の下位コマンドとして、「自宅設定」や「登録地表示」などのコマンドを保持している。第三階層Ｃは、第二階層Ｂの下位コマンドを保持しており、例えば、第二階層Ｂに含まれる「登録地表示」の下位コマンドとして、「○○さん」や「○○社」などの具体的な登録地の名称を保持している。 In FIG. 3, the first hierarchy A holds commands for operating navigation devices such as “destination setting” and “route setting”. The second layer B holds lower commands of the commands of the first layer A. For example, as a lower command of “destination setting” included in the first layer A, “home setting”, “registered place display”, etc. Holds the command. The third layer C holds the lower commands of the second layer B. For example, as a lower command of “Registered location display” included in the second layer B, “Mr. The name of a specific registered place is held.

ここでは、ＣＰＵ１０３４ａは、この図３に示す拘束性の高レベルの言語モデルの中から、第一階層Ａに含まれる全てのコマンド、第二階層Ｂに含まれる一部のコマンド、および第三階層Ｃに含まれる一部のコマンドを抽出して読み込む。例えば、図３に示す枠３ａ内に含まれるコマンドを抽出して読み込む。この拘束性の高レベルの言語モデルとして、例えば、図４に示すような語彙を認識対象語彙として待ち受けることが可能になる。すなわち、使用者が行き先を設定しようとして「行き先設定」と発話した場合に、認識対象語彙４ａによってこれを音声認識することができる。 Here, the CPU 1034a selects all the commands included in the first hierarchy A, some commands included in the second hierarchy B, and the third hierarchy from the high-level language model shown in FIG. Some commands included in C are extracted and read. For example, the commands included in the frame 3a shown in FIG. 3 are extracted and read. As this highly restrictive language model, for example, a vocabulary as shown in FIG. 4 can be awaited as a recognition target vocabulary. That is, when the user speaks “Destination setting” in an attempt to set a destination, the recognition target vocabulary 4a can recognize the voice.

次に、ＣＰＵ１０３４ａは、外部記憶装置１０３５から拘束性の中レベルの言語モデルで表される認識対象語彙を読み込む。拘束性の中レベルの言語モデルとは、認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容し、かつ認識可能な単語列が確定している言語モデルである。例えば図５に示すように、「行き先」、「目的地」など、ナビゲーション装置を操作する上での目的語５ａと、「設定」、「決定」などナビゲーション装置を操作するための操作語５ｂとをガベージ５ｃを挟んで接続し、さらに目的語５ａの前にもガベージ５ｄを挟んで操作語５ｅを接続することによって、目的語と操作語の倒置も許した言語モデルである。 Next, the CPU 1034a reads a recognition target vocabulary represented by a restrictive medium-level language model from the external storage device 1035. The restrictive medium-level language model is a language model in which an arbitrary phoneme string is allowed to be inserted into a word string constituting a recognition target vocabulary and a recognizable word string is fixed. For example, as shown in FIG. 5, a destination 5a for operating the navigation device such as “destination” and “destination”, and an operation word 5b for operating the navigation device such as “setting” and “decision” Are connected with the garbage 5c interposed therebetween, and the operation word 5e is connected with the garbage 5d in front of the object 5a, thereby allowing the object and the operation word to be inverted.

なお、ガベージは、操作語や目的語などのキーワード以外の部分を吸収する。この拘束性の中レベルの言語モデルで表される認識対象語彙を読み込むことによって、例えば、図６に示すような語彙を認識対象語彙として待ち受けることが可能になる。すなわち、使用者が行き先を設定しようとして「行き先を設定」と発話した場合でも、認識対象語彙６ａにより、これを音声認識することができる。 Garbage absorbs parts other than keywords such as operation words and objects. By reading the recognition target vocabulary represented by this restrictive medium-level language model, for example, a vocabulary as shown in FIG. 6 can be awaited as the recognition target vocabulary. That is, even when the user utters “set destination” in an attempt to set a destination, the recognition target vocabulary 6a can recognize the voice.

最後に、ＣＰＵ１０３４ａは、外部記憶装置１０３５から拘束性の低レベルの言語モデルで表される認識対象語彙を読み込む。拘束性の低レベルの言語モデルとは、認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容し、かつ認識可能な単語列が確定していない言語モデルである。例えば、図７に示すように、「行き先」、「目的地」などのナビゲーション装置を操作する上での目的語および「設定」、「決定」などナビゲーション装置を操作するための操作語からなる語彙７ａをガベージ７ｂおよび７ｃを挟んで任意に接続することを許した言語モデルである。 Lastly, the CPU 1034a reads the recognition target vocabulary represented by the low-level language model with restraint from the external storage device 1035. A low-level language model with a restrictive property is a language model that allows an arbitrary phoneme string to be inserted into a word string constituting a recognition target vocabulary and in which a recognizable word string is not fixed. For example, as shown in FIG. 7, a vocabulary made up of objects for operating a navigation device such as “destination” and “destination” and operation words for operating the navigation device such as “setting” and “decision”. This is a language model that allows 7a to be arbitrarily connected with garbage 7b and 7c interposed therebetween.

この拘束性の低レベルの言語モデルで表される認識対象語彙を読み込むことによって、例えば、図８に示すような語彙を認識対象語彙として待ち受けることが可能になる。すなわち、使用者が行き先を設定しようとして「行き先、行き先設定」のように誤った発話した場合でも、認識対象語彙８ａにより、これを音声認識することができる。 By reading the recognition target vocabulary represented by this low-level language model of restriction, for example, it is possible to wait for a vocabulary as shown in FIG. 8 as the recognition target vocabulary. That is, even when the user tries to set a destination and makes an erroneous utterance such as “destination, destination setting”, this can be recognized by the recognition target vocabulary 8a.

次に、ステップＳ２０へ進み、ＣＰＵ１０３４ａは、例えば図９に示すような音声入力用のメニュー画面をディスプレイ１０５へ出力して表示する。図９（ａ）は、ステップＳ１０で読み込んだ拘束性の高レベルの言語モデルの第一階層Ａに含まれるコマンドを表示して、使用者に発話を促すためのメニュー画面例である。また、図９（ｂ）は、第二階層Ｂに含まれるコマンドを表示して、使用者に発話を促すためのメニュー画面例であり、図９（ｃ）は、第三階層Ｃに含まれるコマンドを表示して、使用者に発話を促すためのメニュー画面例である。ＣＰＵ１０３４ａは、まず、図９（ａ）に示すメニュー画面を表示して、使用者に対して、第一階層Ａに含まれるコマンドを発話するように促す。 Next, proceeding to step S20, the CPU 1034a outputs a menu screen for voice input as shown in FIG. FIG. 9A shows an example of a menu screen for prompting the user to speak by displaying commands included in the first hierarchy A of the restrictive high-level language model read in step S10. FIG. 9B shows an example of a menu screen for displaying commands included in the second level B and prompting the user to speak, and FIG. 9C is included in the third level C. It is an example of a menu screen for displaying a command and prompting a user to speak. First, the CPU 1034a displays a menu screen shown in FIG. 9A to prompt the user to speak a command included in the first hierarchy A.

このようなメニュー画面例を表示して、使用者に発話可能なコマンドを提示することによって、使用者は、どのようなコマンドを発話すればよいかを把握することができる。なお、ここでメニュー画面上に表示されるのは、第一階層Ａに含まれるコマンドのうちの一部であるが、このメニュー画面に表示されているコマンドは、全て目的語と操作語の組み合わせとなっている。このため、使用者は、他のコマンドを発話しようとした場合であっても、このメニュー画面を見ることによって、目的語と操作語とからなるコマンドを発話すればよいことを把握することができる。 By displaying such a menu screen example and presenting commands that can be spoken to the user, the user can grasp what commands should be spoken. Here, what is displayed on the menu screen is a part of the commands included in the first hierarchy A, but all the commands displayed on this menu screen are combinations of the object word and the operation word. It has become. For this reason, even when a user tries to speak another command, the user can grasp that it is only necessary to speak a command composed of an object word and an operation word by looking at this menu screen. .

次に、ＣＰＵ１０３４ａは、処理を開始した旨を使用者に通知するために、外部記憶装置１０３５に記憶されている音声メッセージ、例えば「処理を開始しました」や「発話を開始してください」などを出力する。すなわちＣＰＵ１０３４ａは、音声メッセージの音声データを外部記憶装置１０３５から読み込み、Ｄ／Ａコンバータ１０３２へ出力する。音声メッセージの音声データは、Ｄ／Ａコンバータ１０３２でアナログデータに変換され、出力アンプ１０３３で増幅された後、スピーカ１０２を介して出力される。使用者は、音声メッセージを受けて発話を行う。 Next, the CPU 1034a notifies the user that processing has started, such as a voice message stored in the external storage device 1035, such as “processing has started” or “start to speak”. Is output. That is, the CPU 1034 a reads the voice data of the voice message from the external storage device 1035 and outputs it to the D / A converter 1032. The voice data of the voice message is converted into analog data by the D / A converter 1032, amplified by the output amplifier 1033, and then output through the speaker 102. The user speaks in response to the voice message.

ＣＰＵ１０３４ａは、マイク１０１を介した音声入力を監視して、使用者による発話音声の入力開始を検出する。具体的には、ＣＰＵ１０３４ａは、次のようにして発話音声の入力開始を検出する。ＣＰＵ１０３４ａは、使用者によって、発話スイッチ１０４ａが押下されるまでの間は、マイク１０１およびＡ／Ｄコンバータ１０３１を介して入力されるデジタル信号の平均パワーを演算している。 The CPU 1034a monitors the voice input via the microphone 101 and detects the start of input of the spoken voice by the user. Specifically, the CPU 1034a detects the start of speech voice input as follows. The CPU 1034a calculates the average power of the digital signal input through the microphone 101 and the A / D converter 1031 until the user presses the speech switch 104a.

そして、使用者によって発話スイッチ１０４ａが押下された後は、マイク１０１およびＡ／Ｄコンバータ１０３１を介して入力されるデジタル信号の瞬間パワーが、上記平均パワーを所定値以上大きくなったときに、使用者による発話音声の入力が開始されたと検出する。そして、発話音声の入力が開始されたことを検出した場合には、ＣＰＵ１０３４ａは、音声の取り込みを開始する。 After the utterance switch 104a is pressed by the user, the digital signal input via the microphone 101 and the A / D converter 1031 is used when the average power of the digital signal becomes greater than a predetermined value. It is detected that the input of the spoken voice by the person has started. When it is detected that the input of the speech voice has been started, the CPU 1034a starts to capture the voice.

その後、ステップＳ３０へ進み、ＣＰＵ１０３４ａは、上述したステップＳ１０でメモリ１０３４ｂに読み込んだ認識対象語彙（待ち受け単語）と、取り込んだ音声との一致度を演算する。一致度とは、認識対象語彙と取り込んだ音声とがどの程度似ているかを表す指標であって、本実施の形態では、一致度はスコアとして算出される。このスコアは、数値で表され、値が大きいほど認識対象語彙と取り込んだ音声とが似ていることを意味する。なお、ＣＰＵ１０３４ａがこの一致度を演算している間も、発話音声の取り込みは継続されている。 Thereafter, the process proceeds to step S30, and the CPU 1034a calculates the degree of coincidence between the recognition target vocabulary (standby word) read into the memory 1034b in step S10 and the captured voice. The degree of coincidence is an index that indicates how similar the recognition target vocabulary and the captured speech are, and in this embodiment, the degree of coincidence is calculated as a score. This score is represented by a numerical value. The larger the value, the more similar the recognition target vocabulary and the captured speech. It should be noted that while the CPU 1034a calculates the degree of coincidence, the utterance voice is continuously captured.

ステップＳ４０では、ＣＰＵ１０３４ａは、入力されるデジタル信号の瞬間パワーが、所定時間以上継続して所定値以下である場合には、発話音声の入力は終了したと判断して、音声の取り込みを終了する。 In step S40, if the instantaneous power of the input digital signal is not more than a predetermined value for a predetermined time or longer, the CPU 1034a determines that the input of the uttered voice has been completed and ends the voice capturing. .

その後、ステップＳ５０へ進み、ＣＰＵ１０３４ａは、ステップＳ３０で開始した一致度の演算が終了したら、一致度の最も大きな認識対象語彙から順番にＮ個の認識対象語彙を認識結果Ｎ−ｂｅｓｔとして出力する。図１０は、使用者が、ナビゲーション装置を操作するための規定コマンドである「行き先設定」を発話した場合の認識結果Ｎ−ｂｅｓｔを示す図である。なお、図１０では、Ｎが５の場合、すなわち認識結果Ｎ−ｂｅｓｔとして、一致度が上位の５個の認識対象語彙が出力された場合の具体例を示している。 Thereafter, the process proceeds to step S50, and when the coincidence calculation started in step S30 ends, the CPU 1034a outputs N recognition target vocabularies in order from the recognition target vocabulary having the largest coincidence as recognition results N-best. FIG. 10 is a diagram illustrating a recognition result N-best when the user utters “destination setting” which is a prescribed command for operating the navigation device. FIG. 10 shows a specific example in the case where N is 5, that is, when the five recognition target words having higher matching degrees are output as the recognition result N-best.

この図１０に示す例では、使用者による実際の発話内容と一致する認識対象語彙「行き先設定」は、その一致度が第四位と低く算出されている。この場合、従来の一般的な音声認識方法と同様に一致度の最上位の認識対象語彙を最終的な認識結果として採用した場合には、誤認識が生じることになる。よって、本実施の形態では、次のようにして誤認識を防止する。 In the example shown in FIG. 10, the recognition target word “destination setting” that matches the actual utterance content by the user is calculated as low as the fourth rank. In this case, as in the conventional general speech recognition method, when the recognition target vocabulary having the highest matching score is adopted as the final recognition result, erroneous recognition occurs. Therefore, in this embodiment, erroneous recognition is prevented as follows.

ＣＰＵ１０３４ａは、認識結果Ｎ−ｂｅｓｔの中で、最も拘束性のレベルが高い言語モデルから出力された認識対象語彙を選択する。例えば、図１０に示す例では、第一位の認識結果である「（ガベージ）・（ガベージ）」は拘束性の低レベルの言語モデルから出力された認識対象語彙である。第二位の認識結果である「駅・（ガベージ）・探す」は、拘束性の中レベルの言語モデルから出力された認識対象語彙である。第三位の認識結果である「地図・見せて」は、拘束性の中レベルの言語モデルから出力された認識対象語彙である。第四位の認識結果である「行き先設定」は、拘束性の高レベルの言語モデルから出力された認識対象語彙である。第五位の認識結果である「（ガベージ）・設定」は、拘束性の低レベルの言語モデルから出力された認識対象語彙である。 The CPU 1034a selects the recognition target vocabulary output from the language model having the highest level of restraint among the recognition results N-best. For example, in the example shown in FIG. 10, “(garbage) · (garbage)”, which is the first recognition result, is a recognition target vocabulary output from a low-level language model with constraints. The second recognition result “station / (garbage) / search” is a recognition target vocabulary output from a middle-level language model of restraint. The third recognition result “Map / Show” is the recognition target vocabulary output from the middle-level language model of restraint. “Destination setting”, which is the fourth recognition result, is a recognition target vocabulary output from a highly restrictive language model. The fifth recognition result “(garbage) / setting” is a recognition target vocabulary output from a language model with a low level of restraint.

よって、この図１０に示す例では、ＣＰＵ１０３４ａは、最も拘束性のレベルが高い第四位の認識結果である「行き先設定」を選択する。そして、ＣＰＵ１０３４ａは、この第四位の認識結果である「行き先設定」を最終的な認識結果として優先採用するか否かの判定を行う。本実施の形態では、ＣＰＵ１０３４ａは、（Ａ）判定対象の認識結果の順位が所定の順位Ｎｔｈより高く、かつ（Ｂ）第一位の認識機結果と判定対象の認識結果とのスコア差が所定値Ｌｔｈより小さい場合には、判定対象の認識結果を上位の他の認識結果よりも優先して採用する。なお、判定時の閾値として用いるＮｔｈとＬｔｈは、それぞれ実験的に求められる値であって、ここでは、Ｎｔｈ＝５、Ｌｔｈ＝０．１０とする。 Therefore, in the example shown in FIG. 10, the CPU 1034a selects “destination setting” which is the fourth recognition result with the highest level of restraint. Then, the CPU 1034a determines whether or not to preferentially adopt the “destination setting” that is the fourth recognition result as the final recognition result. In this embodiment, the CPU 1034a has (A) the recognition result rank of the determination target is higher than the predetermined rank Nth, and (B) the score difference between the first recognition machine result and the determination target recognition result is predetermined. When the value is smaller than the value Lth, the recognition result to be determined is adopted with priority over the other recognition results at the top. Note that Nth and Lth used as threshold values at the time of determination are values obtained experimentally, and here, Nth = 5 and Lth = 0.10.

ここで、判定対象の認識結果である第四位の認識結果についてみると、順位は第四位であるので閾値Ｎｔｈより大きく条件（Ａ）を満たす。また、第一位の認識結果のスコア（０．２５）と第四位の認識結果のスコア（０．１８）の差は０．０７であって閾値Ｌｔｈより小さく条件（Ｂ）も満たす。よって、ＣＰＵ１０３４ａは、図１０に示す例では、第四位の認識結果を他の上位の認識結果よりも優先して採用すると判定し、第四位の認識結果である「行き先設定」を最終的な認識結果とする。そして、ＣＰＵ１０３４ａは、音声合成処理を行い、認識結果「行き先設定」を音声信号に変換した後、Ｄ／Ａコンバータ１０３２、出力アンプ１０３３を介して、スピーカ１０２から音声出力する。 Here, regarding the recognition result of the fourth place, which is the recognition result of the determination target, the rank is the fourth place and satisfies the condition (A) larger than the threshold value Nth. The difference between the score of the first recognition result (0.25) and the score of the fourth recognition result (0.18) is 0.07 , which is smaller than the threshold value Lth and satisfies the condition (B). Therefore, in the example illustrated in FIG. 10, the CPU 1034a determines that the fourth recognition result is prioritized over the other higher recognition results, and finally sets the fourth destination “destination setting”. Recognition results. The CPU 1034a performs voice synthesis processing, converts the recognition result “destination setting” into a voice signal, and then outputs the voice from the speaker 102 via the D / A converter 1032 and the output amplifier 1033.

また、別の例として、使用者が、ナビゲーション装置を操作するための規定コマンドとは異なる「行き先をえーと探す」を発話した場合に、認識結果Ｎ−ｂｅｓｔが図１１に示すように出力された場合について説明する。この場合もＣＰＵ１０３４ａは、認識結果Ｎ−ｂｅｓｔの中で、最も拘束性のレベルが高い言語モデルから出力された認識対象語彙、すなわち第四位の「ルート設定」を選択する。そして、この第四位の認識結果である「ルート設定」が上述した（Ａ）および（Ｂ）の条件を満たすか否かを判定して、最終的な認識結果として優先して採用するか否かを判定する。 As another example, when the user utters “search for a destination” which is different from the prescribed command for operating the navigation device, the recognition result N-best is output as shown in FIG. The case will be described. Also in this case, the CPU 1034a selects the recognition target vocabulary output from the language model having the highest level of restriction, that is, the fourth “route setting” from the recognition result N-best. Then, it is determined whether or not “ route setting” that is the fourth recognition result satisfies the above-described conditions (A) and (B), and whether or not the final recognition result is preferentially adopted. Determine whether.

この場合には、順位は第四位であるので閾値Ｎｔｈより大きく条件（Ａ）を満たすが、第一位の認識結果のスコア（０．２５）と第四位の認識結果のスコア（０．０２）の差は０．２３であって閾値Ｌｔｈより大きいため条件（Ｂ）は満たさない。よって、ＣＰＵ１０３４ａは、この第四位の認識結果は優先して採用しない。 In this case, since the rank is fourth, the condition (A) is satisfied larger than the threshold value Nth, but the first recognition result score (0.25) and the fourth recognition result score (0. The difference of 02) is 0.23 , which is larger than the threshold value Lth, so the condition (B) is not satisfied. Therefore, the CPU 1034a does not adopt the fourth recognition result with priority.

ＣＰＵ１０３４ａは、次に、認識結果Ｎ−ｂｅｓｔの中から２番目に拘束性のレベルが高い言語モデルから出力された認識対象語彙を選択する。図１１に示す例では、拘束性の中レベルの言語モデルから出力された第二位の認識結果である「行き先・（ガベージ）・探す」を選択する。そして、この第二位の認識結果が上述した（Ａ）および（Ｂ）の条件を満たすか否かを判定して、最終的な認識結果として優先して採用するか否かを判定する。 Next, the CPU 1034a selects the recognition target vocabulary output from the language model having the second highest level of restriction from the recognition result N-best. In the example shown in FIG. 11, “destination / (garbage) / search”, which is the second recognition result output from the restrictive middle-level language model, is selected. Then, it is determined whether or not the second recognition result satisfies the conditions (A) and (B) described above, and it is determined whether or not the final recognition result is preferentially adopted.

この場合には、順位は第２位であるので閾値Ｎｔｈより大きく条件（Ａ）を満たす。また、第一位の認識結果のスコア（０．２５）と第二位の認識結果のスコア（０．２２）の差は０．０３であって閾値Ｌｔｈより小さいため条件（Ｂ）も満たす。よって、ＣＰＵ１０３４ａは、この第二位の認識結果を優先して採用する。 In this case, since the rank is second, the condition (A) is satisfied which is greater than the threshold value Nth. Further, since the difference between the score of the first recognition result (0.25) and the score of the second recognition result (0.22) is 0.03 , which is smaller than the threshold value Lth, the condition (B) is also satisfied. Therefore, the CPU 1034a preferentially adopts the second recognition result.

以上より、ＣＰＵ１０３４ａは、図１１に示す例では、第二位の認識結果を他の上位の認識結果よりも優先して採用すると判定し、第二位の認識結果である「行き先・（ガベージ）・探す」を最終的な認識結果とする。この場合、最終的な認識結果である「行き先・（ガベージ）・探す」をナビゲーション装置用の規定コマンドに変換する必要があるため、「行き先・（ガベージ）・探す」を対応する規定コマンド「行き先設定」に変換する。そして、ＣＰＵ１０３４ａは、音声合成処理を行い、認識結果「行き先設定」を音声信号に変換した後、Ｄ／Ａコンバータ１０３２、出力アンプ１０３３を介して、スピーカ１０２から音声出力する。 As described above, in the example illustrated in FIG. 11, the CPU 1034a determines that the second recognition result is prioritized over other higher recognition results, and the second recognition result “destination / (garbage)”. “Find” is the final recognition result. In this case, since it is necessary to convert the final recognition result “destination / (garbage) / search” into a specified command for the navigation device, the corresponding specified command “destination / (garbage) / search” Convert to "setting". The CPU 1034a performs voice synthesis processing, converts the recognition result “destination setting” into a voice signal, and then outputs the voice from the speaker 102 via the D / A converter 1032 and the output amplifier 1033.

なお、この場合、認識結果である「行き先・（ガベージ）・探す」に基づいて、使用者の発話内容に近い「行き先を探す」を音声出力する方法も考えられる。しかしながら、本実施の形態では、使用者に規定コマンドの習得を促すために、規定コマンドに変換した後の「行き先設定」を音声出力するようにしている。 In this case, based on the recognition result “destination / (garbage) / search”, a method of outputting “search for destination” close to the utterance content of the user by voice is also conceivable. However, in this embodiment, in order to prompt the user to learn the specified command, the “destination setting” after being converted into the specified command is output as a voice.

ステップＳ６０では、ＣＰＵ１０３４ａは、入力装置１０４からの出力に基づいて、使用者によって訂正スイッチ１０４ｂが操作されたか否かを判断する。例えば、使用者は、「行き先設定」と発話したのに対して、認識結果として異なる認識対象語彙、例えば「電話」が音声出力された場合には、誤認識が発生したと判断して訂正スイッチ１０４ｂを押下する。ＣＰＵ１０３４ａは、認識結果を音声出力した後、所定時間、使用者による訂正スイッチ１０４ｂの操作を受け付ける。 In step S60, the CPU 1034a determines whether the correction switch 104b has been operated by the user based on the output from the input device 104. For example, if the user utters “Destination setting” but a recognition target vocabulary different from the recognition result, for example, “telephone” is output as a voice, it is determined that a misrecognition has occurred and the correction switch Press 104b. The CPU 1034a receives the operation of the correction switch 104b by the user for a predetermined time after outputting the recognition result by voice.

ＣＰＵ１０３４ａは、所定時間以内に使用者によって訂正スイッチ１０４ｂが操作されたと判断した場合には、認識結果を取り消して、ステップＳ１０へ戻り、使用者からの再発話を受け付ける。一方、ＣＰＵ１０３４ａは、所定時間以内に訂正スイッチ１０４ｂが操作されないと判断した場合には、使用者は認識結果を容認したものとして認識結果を確定し、ステップＳ７０へ進む。 If the CPU 1034a determines that the correction switch 104b has been operated by the user within a predetermined time, the CPU 1034a cancels the recognition result, returns to step S10, and accepts a re-utterance from the user. On the other hand, if the CPU 1034a determines that the correction switch 104b is not operated within the predetermined time, the user confirms the recognition result as having accepted the recognition result, and proceeds to step S70.

ステップＳ７０では、ＣＰＵ１０３４ａは、認識結果として確定した認識対象語彙に下位の階層があるか否かを判断する。下位の階層が存在すると判断した場合には、ステップＳ１０へ戻って、下位階層を対象とした待ち受け設定を行う。例えば、確定した認識結果が「行き先設定」である場合には、当該認識結果は、図３に示したように、第一階層Ａに含まれる認識対象語彙であることから、下位階層として第二階層と第三階層が存在すると判断する。そして、この場合には、ＣＰＵ１０３４ａは、図９（ｂ）および図９（ｃ）に示した音声入力用のメニュー画面をディスプレイ１０５へ出力して、使用者に下位階層に含まれるコマンドの発話を促す。 In step S70, the CPU 1034a determines whether or not the recognition target vocabulary determined as the recognition result has a lower hierarchy. If it is determined that there is a lower hierarchy, the process returns to step S10 to perform standby setting for the lower hierarchy. For example, when the confirmed recognition result is “destination setting”, the recognition result is the recognition target vocabulary included in the first hierarchy A as shown in FIG. It is determined that a hierarchy and a third hierarchy exist. In this case, the CPU 1034a outputs the voice input menu screen shown in FIG. 9B and FIG. 9C to the display 105, and utters the command included in the lower layer to the user. Prompt.

これに対して、下位階層がないと判断した場合、すなわち最も下の階層まで音声認識が完了したと判断した場合には、ステップＳ８０へ進む。ステップＳ８０では、ＣＰＵ１０３４ａは、下位階層まで音声認識したことによって特定される使用者からの操作指示に基づいて、処理を実行する。例えば、ナビゲーション装置上で目的地設定や経路探索を行う。 On the other hand, if it is determined that there is no lower hierarchy, that is, if it is determined that speech recognition has been completed up to the lowest hierarchy, the process proceeds to step S80. In step S80, the CPU 1034a executes processing based on an operation instruction from the user specified by performing voice recognition up to the lower layer. For example, destination setting and route search are performed on the navigation device.

図１２は、本実施の形態における音声認識装置１００を使用してナビゲーション装置を操作する場合の使用者による発話と音声認識装置１００による応答の具体例を示した図である。この図１２は、使用者が規定のコマンドを発話して、ナビゲーション装置を操作する場合を示している。 FIG. 12 is a diagram illustrating a specific example of the utterance by the user and the response by the voice recognition device 100 when the navigation device is operated using the voice recognition device 100 according to the present embodiment. FIG. 12 shows a case where the user speaks a prescribed command and operates the navigation device.

ＣＰＵ１０３４ａは、使用者に対してコマンドの発話を促すためのシステムメッセージＡとして「コマンドをどうぞ」をスピーカ１０２から出力し、使用者からの発話を待ち受ける。同時に、ＣＰＵ１０３４ａは、ディスプレイ１０５に、図９（ａ）に示した拘束性の高レベルの言語モデルの第一階層Ａに含まれるコマンドを表示したメニュー画面を表示する。使用者は、これに対応して、ナビゲーション装置で行き先を設定するためのユーザ発話Ｂとして規定コマンドである「行き先設定」を発話する。 The CPU 1034a outputs “command please” from the speaker 102 as a system message A for prompting the user to speak a command, and waits for a speech from the user. At the same time, the CPU 1034a displays a menu screen displaying commands included in the first hierarchy A of the high-level language model shown in FIG. 9A on the display 105. In response to this, the user utters “destination setting”, which is a specified command, as user utterance B for setting the destination with the navigation device.

ＣＰＵ１０３４ａは、使用者による発話を受け付けて、上述した音声認識処理を実行し、図１０で上述したように、認識結果Ｎ−ｂｅｓｔの中から「行き先設定」を認識結果として特定する。そして、ＣＰＵ１０３４ａは、使用者に対して下位の階層のコマンドの発話を促すためのシステムメッセージＣとして「行き先設定のコマンドをどうぞ」をスピーカ１０２から出力し、使用者からの発話を待ち受ける。同時に、ＣＰＵ１０３４ａは、ディスプレイ１０５に、図９（ｂ）に示した拘束性の高レベルの言語モデルの第二階層Ｂに含まれるコマンドを表示したメニュー画面を表示する。使用者は、これに対応して、登録地の中から行き先を選択するためのユーザ発話Ｄとして規定コマンドである「登録地表示」を発話する。 The CPU 1034a receives the speech from the user, executes the above-described speech recognition process, and identifies “destination setting” as the recognition result from the recognition result N-best as described above with reference to FIG. Then, the CPU 1034a outputs “Destination setting command please” from the speaker 102 as a system message C for prompting the user to utter a command in a lower hierarchy, and waits for an utterance from the user. At the same time, the CPU 1034a displays on the display 105 a menu screen on which commands included in the second hierarchy B of the high-level language model shown in FIG. 9B are displayed. In response to this, the user utters “registration location display”, which is a specified command, as user utterance D for selecting a destination from the registration locations.

ＣＰＵ１０３４ａは、使用者による発話を受け付けて、上述した音声認識処理を実行し、
認識結果Ｎ−ｂｅｓｔの中から「登録地表示」を認識結果として特定する。そして、ＣＰＵ１０３４ａは、使用者に対してさらに下位の階層のコマンドの発話を促すためのシステムメッセージＥとして「登録地表示の番号をどうぞ」をスピーカ１０２から出力し、使用者からの発話を待ち受ける。同時に、ＣＰＵ１０３４ａは、ディスプレイ１０５に、図９（ｃ）に示した拘束性の高レベルの言語モデルの第三階層Ｃに含まれるコマンドを表示したメニュー画面を表示する。使用者は、これに対応して、登録地の中から登録地の番号を選択するためのユーザ発話Ｆとして「３番」を発話する。 CPU1034a receives the speech by a user, performs the speech recognition process mentioned above,
“Registered place display” is specified as the recognition result from the recognition result N-best. Then, the CPU 1034a outputs “Registration location display number please” from the speaker 102 as a system message E for prompting the user to utter a command in a lower hierarchy, and waits for an utterance from the user. At the same time, the CPU 1034a displays on the display 105 a menu screen on which commands included in the third layer C of the high-level language model shown in FIG. In response to this, the user utters “No. 3” as the user utterance F for selecting a registration place number from the registration places.

ＣＰＵ１０３４ａは、以上の処理によって、「△△社」を行き先として設定するようにナビゲーション装置を制御する。これによって、使用者は音声操作によりナビゲーション装置を操作できる。 The CPU 1034a controls the navigation device so as to set “△△ Company” as the destination by the above processing. Thereby, the user can operate the navigation device by voice operation.

次に、図１３により、使用者が規定のコマンドとは異なる内容を発話した場合の具体例について説明する。使用者に対してコマンドの発話を促すためのシステムメッセージＡとして「コマンドをどうぞ」をスピーカ１０２から出力し、使用者からの発話を待ち受ける。同時に、ＣＰＵ１０３４ａは、ディスプレイ１０５に、図９（ａ）に示した拘束性の高レベルの言語モデルの第一階層Ａに含まれるコマンドを表示したメニュー画面を表示する。使用者は、これに対応して、ナビゲーション装置で行き先を設定するためのユーザ発話Ｂとして規定コマンドとは異なる内容の「行き先をえーと探す」を発話する。 Next, referring to FIG. 13, a specific example when the user utters content different from the prescribed command will be described. As a system message A for prompting the user to speak a command, “command please” is output from the speaker 102, and the user speaks for a speech. At the same time, the CPU 1034a displays a menu screen displaying commands included in the first hierarchy A of the high-level language model shown in FIG. 9A on the display 105. In response to this, the user utters “search for a destination”, which is different from the prescribed command, as user utterance B for setting the destination with the navigation device.

ＣＰＵ１０３４ａは、使用者による発話を受け付けて、上述した音声認識処理を実行し、図１１で上述したように、認識結果Ｎ−ｂｅｓｔの中から「行き先・（ガベージ）・探す」を認識結果として特定し、これを対応する規定コマンド「行き先設定」に変換する。そして、ＣＰＵ１０３４ａは、使用者に対して下位の階層のコマンドの発話を促すためのシステムメッセージＣとして「行き先設定のコマンドをどうぞ」をスピーカ１０２から出力し、使用者からの発話を待ち受ける。同時に、ＣＰＵ１０３４ａは、ディスプレイ１０５に、図９（ｂ）に示した拘束性の高レベルの言語モデルの第二階層Ｂに含まれるコマンドを表示したメニュー画面を表示する。使用者は、これに対応して、登録地の中から行き先を選択するためのユーザ発話Ｄとして規定コマンドである「登録地表示」を発話する。 The CPU 1034a receives the utterance by the user, executes the above-described speech recognition process, and identifies “destination / (garbage) / search” as the recognition result from the recognition result N-best as described above with reference to FIG. Then, this is converted into a corresponding prescribed command “destination setting”. Then, the CPU 1034a outputs “Destination setting command please” from the speaker 102 as a system message C for prompting the user to utter a command in a lower hierarchy, and waits for an utterance from the user. At the same time, the CPU 1034a displays on the display 105 a menu screen on which commands included in the second hierarchy B of the high-level language model shown in FIG. 9B are displayed. In response to this, the user utters “registration location display”, which is a specified command, as user utterance D for selecting a destination from the registration locations.

以上説明した本実施の形態によれば、以下のような作用効果を得ることができる。
（１）使用者の発話内容を拘束する拘束性のレベルが異なる複数の言語モデルで表される語彙を音声認識実行時の認識対象語彙として待ち受け、使用者による発話音声と認識対象語彙との一致度を演算して、認識結果候補として認識結果Ｎ−ｂｅｓｔを抽出する。そして、認識結果Ｎ−ｂｅｓｔに含まれる認識対象語彙の一致度、およびその認識結果候補を含む言語モデルの拘束性のレベルに基づいて、認識結果Ｎ−ｂｅｓｔの中から認識結果を特定するようにした。これによって、使用者が規定コマンドを表現を変更して、拘束性の低い語彙を発話した場合でも、誤認識が生じる可能性を低減することができる。 According to the present embodiment described above, the following operational effects can be obtained.
(1) A vocabulary expressed by a plurality of language models with different levels of restraint that restricts the user's utterance content is awaited as a recognition target vocabulary when executing speech recognition, and the user's utterance speech matches the recognition target vocabulary. The degree is calculated, and the recognition result N-best is extracted as a recognition result candidate. Then, the recognition result is specified from the recognition result N-best based on the matching degree of the recognition target vocabulary included in the recognition result N-best and the level of restriction of the language model including the recognition result candidate. did. As a result, even when the user changes the expression of the specified command and utters a vocabulary with low restraint, the possibility of erroneous recognition can be reduced.

（２）拘束性のレベルが異なる複数の言語モデルは、認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容しない拘束性の高レベルの言語モデル、認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容し、かつ認識可能な単語列が確定している拘束性の中レベルの言語モデル、および認識対象語彙を構成する単語列の中に任意の音素列の挿入を許容し、かつ認識可能な単語列が確定していない拘束性の低レベルの言語モデルを含むようにした。これによって、使用者による様々な態様の発話を待ち受けて、精度高く音声認識することができる。 (2) A plurality of language models having different levels of restrictiveness constitutes a recognition-target vocabulary and a high-level language model that does not allow insertion of an arbitrary phoneme string in a word string constituting the recognition-target vocabulary Arbitrary phoneme strings are allowed to be inserted into word strings, and a recognizable medium-level language model in which recognizable word strings are established, and arbitrary phonemes into word strings constituting recognition vocabulary Including a low-level language model that allows insertion of strings and has no recognizable word strings. As a result, it is possible to recognize speech with high accuracy while waiting for various modes of speech by the user.

（３）認識結果候補である認識結果Ｎ−ｂｅｓｔの中から、より拘束性のレベルが高い言語モデルの認識対象語彙を優先して認識結果として特定するようにした。これによって、使用者が規定コマンドのような拘束性のレベルが高い言語モデルの発話を行った場合に、拘束性のレベルが低い言語モデルの認識対象語彙が誤って認識されることを防止することができる。 (3) From the recognition result N-best which is a recognition result candidate, the recognition target vocabulary of the language model having a higher level of restraint is preferentially specified as the recognition result. This prevents the recognition target vocabulary of a language model with a low level of restriction when the user utters a language model with a high level of restriction such as a specified command. Can do.

（４）認識結果候補である認識結果Ｎ−ｂｅｓｔの中から、より拘束性のレベルが高い言語モデルの認識対象語彙であって、かつその認識対象語彙の一致度と、最も一致度が高い認識対象語彙の一致度との差（スコア差）が所定の閾値より小さい場合に、その認識対象語彙を優先して認識結果として特定するようにした。これによって、認識結果Ｎ−ｂｅｓｔの中に、より拘束性のレベルが高い言語モデルの認識対象語彙が含まれている場合であっても、その一致度が低い場合には、それが優先的に採用されることを防いで、誤認識を防止することができる。 (4) A recognition target vocabulary of a language model having a higher level of restraint among the recognition results N-best which are recognition result candidates, and the recognition with the highest matching degree and the matching degree of the recognition target vocabulary When the difference (score difference) from the matching degree of the target vocabulary is smaller than a predetermined threshold, the recognition target vocabulary is specified with priority as a recognition result. As a result, even if the recognition result vocabulary of the language model having a higher level of restraint is included in the recognition result N-best, if the degree of coincidence is low, it is preferentially used. It is possible to prevent misrecognition by preventing the adoption.

（５）使用者が発話可能な認識対象語彙をディスプレイ１０５に表示して提示するようにした。これによって、使用者は、発話すべき語彙をあらかじめ把握することができる。 (5) The recognition target vocabulary that the user can utter is displayed on the display 105 and presented. As a result, the user can grasp in advance the vocabulary to be uttered.

（６）認識結果に基づいて、使用者に対する応答文を生成して出力するようにした。これによって、使用者は、直前の発話内容が正しく認識されたかを把握することができる。 (6) A response sentence for the user is generated and output based on the recognition result. As a result, the user can grasp whether or not the content of the immediately preceding utterance has been correctly recognized.

（７）一致度の演算によって得られる一致度が所定値以上となる所定数の認識対象語彙を認識結果Ｎ−ｂｅｓｔとして抽出するようにした。これによって、使用者による発話内容と一致する可能性が高い語彙を認識結果の候補とすることができる。 (7) A predetermined number of recognition target vocabularies having a matching degree obtained by calculating the matching degree equal to or greater than a predetermined value are extracted as recognition results N-best. Thus, a vocabulary that is highly likely to match the content of the utterance by the user can be set as a recognition result candidate.

―変形例―
なお、上述した実施の形態の音声認識装置は、以下のように変形することもできる。
（１）上述した実施の形態では、ＣＰＵ１０３４ａは、認識結果Ｎ−ｂｅｓｔの中から、より拘束性のレベルが高い言語モデルの認識対象語彙であって、かつその認識対象語彙の一致度と、最も一致度が高い認識対象語彙とスコア差が所定の閾値より小さい場合に、その認識対象語彙を優先して認識結果として特定するようにした。しかしながら、ＣＰＵ１０３４ａは、認識結果Ｎ−ｂｅｓｔに含まれる認識対象語彙の一致度、およびその認識対象語彙を含む言語モデルの拘束性のレベルの少なくともいずれか一方に基づいて、認識結果を特定するようにしてもよい。例えば、認識結果Ｎ−ｂｅｓｔの中から、より拘束性のレベルが高い言語モデルの認識対象語彙を優先して認識結果として特定するようにしてもよい。 -Modification-
Note that the speech recognition apparatus of the above-described embodiment can be modified as follows.
(1) In the above-described embodiment, the CPU 1034a is the recognition target vocabulary of the language model having a higher level of restraint among the recognition results N-best, and the matching degree of the recognition target vocabulary is the highest. When the recognition target vocabulary having a high degree of coincidence and the score difference is smaller than a predetermined threshold, the recognition target vocabulary is specified with priority as a recognition result. However, the CPU 1034a specifies the recognition result based on at least one of the matching degree of the recognition target vocabulary included in the recognition result N-best and the level of restriction of the language model including the recognition target vocabulary. May be. For example, the recognition target vocabulary of the language model having a higher level of restriction may be specified as the recognition result with priority from the recognition result N-best.

（２）上述した実施の形態では、ＣＰＵ１０３４ａは、認識結果Ｎ−ｂｅｓｔの中から、より拘束性のレベルが高い言語モデルの認識対象語彙であって、かつその認識対象語彙の一致度と、最も一致度が高い認識対象語彙とスコア差が所定の閾値より小さい場合に、その認識対象語彙を優先して認識結果として特定するようにした。しかしながら、ＣＰＵ１０３４ａは、一致度演算に際して、拘束性のレベルが高い言語モデルの認識対象語彙のスコアに所定のスコアを加算する、あるいは所定の重み付け係数を乗算することにより、前記語彙を優先して認識結果として特定するようにしてもよい。 (2) In the above-described embodiment, the CPU 1034a is the recognition target vocabulary of the language model having a higher level of restraint among the recognition results N-best, and the matching degree of the recognition target vocabulary is the highest. When the recognition target vocabulary having a high degree of coincidence and the score difference is smaller than a predetermined threshold, the recognition target vocabulary is specified with priority as a recognition result. However, the CPU 1034a preferentially recognizes the vocabulary by adding a predetermined score to the score of the recognition target vocabulary of the language model having a high level of constraining or multiplying by a predetermined weighting coefficient when calculating the degree of coincidence. It may be specified as a result.

（３）上述した実施の形態では、音声認識装置１００を用いて音声操作が可能なナビゲーション装置を操作する例について説明した。しかしながら、音声認識装置１００は、音声操作可能な他の機器に適用することも可能である。 (3) In the above-described embodiment, the example in which the navigation apparatus capable of voice operation is operated using the voice recognition apparatus 100 has been described. However, the speech recognition apparatus 100 can also be applied to other devices that can perform voice operations.

なお、本発明の特徴的な機能を損なわない限り、本発明は、上述した実施の形態における構成に何ら限定されない。 Note that the present invention is not limited to the configurations in the above-described embodiments as long as the characteristic functions of the present invention are not impaired.

音声認識装置の一実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of a speech recognition apparatus. 音声認識装置１００の処理を示すフローチャート図である。4 is a flowchart showing processing of the speech recognition apparatus 100. FIG. 拘束性の高レベルの言語モデルの具体例を示す図である。It is a figure which shows the specific example of a language model of a restraint high level. 拘束性の高レベルの言語モデルを読み込むことにより待ち受け可能な認識対象語彙の具体例を示す図である。It is a figure which shows the specific example of the recognition object vocabulary which can be awaited by reading the language model of a restraint high level. 拘束性の中レベルの言語モデルの具体例を示す図である。It is a figure which shows the specific example of a language model of a restraint medium level. 拘束性の中レベルの言語モデルを読み込むことにより待ち受け可能な認識対象語彙の具体例を示す図である。It is a figure which shows the specific example of the recognition object vocabulary which can be waited by reading the language model of a restraint medium level. 拘束性の低レベルの言語モデルの具体例を示す図である。It is a figure which shows the specific example of a language model of a restraining low level. 拘束性の低レベルの言語モデルを読み込むことにより待ち受け可能な認識対象語彙の具体例を示す図である。It is a figure which shows the specific example of the recognition object vocabulary which can be awaited by reading the language model of a restraining low level. 音声入力用のメニュー画面の具体例を示す図である。It is a figure which shows the specific example of the menu screen for audio | voice input. 認識結果Ｎ−ｂｅｓｔの具体例を示す第１の図である。It is a 1st figure which shows the specific example of recognition result N-best. 認識結果Ｎ−ｂｅｓｔの具体例を示す第２の図である。It is a 2nd figure which shows the specific example of recognition result N-best. 使用者による発話と音声認識装置１００による応答の具体例を示した第１の図である。It is the 1st figure which showed the specific example of the speech by the user, and the response by the speech recognition apparatus. 使用者による発話と音声認識装置１００による応答の具体例を示した第２の図である。It is the 2nd figure which showed the specific example of the speech by the user, and the response by the speech recognition apparatus.

Explanation of symbols

１００音声認識装置、１０１マイク、１０２スピーカ、１０３信号処理ユニット、１０３１Ａ／Ｄコンバータ、１０３２Ｄ／Ａコンバータ、１０３３出力アンプ、１０３４信号処理装置、１０３４ａＣＰＵ、１０３４ｂメモリ、１０３５外部記憶装置、１０４入力装置、１０４ａ発話スイッチ、１０４ｂ訂正スイッチ、１０５ディスプレイ DESCRIPTION OF SYMBOLS 100 Speech recognition apparatus, 101 Microphone, 102 Speaker, 103 Signal processing unit, 1031 A / D converter, 1032 D / A converter, 1033 Output amplifier, 1034 Signal processing apparatus, 1034a CPU, 1034b Memory, 1035 External storage device, 104 input Device, 104a speech switch, 104b correction switch, 105 display

Claims

A voice input means for inputting speech voice by the user;
A plurality of language models that are provided for recognizing the spoken speech and that include a target word for a device operated by voice and a vocabulary that includes an operation word for operating the device, with different levels of restraining the user's speech content Storage means for storing as a standby vocabulary represented by
A degree of coincidence calculating means for calculating the degree of coincidence between the standby vocabulary stored in the storage means and the uttered speech input by the voice input means for each of the plurality of language models;
Candidate extraction means for selecting a plurality of standby vocabulary words in descending order of matching degree from the standby vocabularies in the plurality of language models and extracting them as recognition result candidates based on the matching degree calculated by the matching degree calculation means. When,
Among the plurality of recognition result candidates extracted by the candidate extraction means , the highest degree of coincidence among the recognition result candidates of the language model (first level language model) having the highest level of restraining the user's utterance content; A difference with the highest degree of coincidence among a plurality of recognition result candidates extracted by the candidate extraction means is calculated, and when the difference is smaller than a predetermined value, the recognition result candidate of the first level language model is used as a recognition result. If not, if not, among the plurality of recognition result candidates extracted by the candidate extraction means, the highest match among the recognition result candidates of the language model (second level language model) with the next highest level of restriction And when the difference is smaller than a predetermined value, the recognition result candidate of the second level language model is selected. Recognition result And a recognition result specifying means for specifying the voice recognition device.

The speech recognition apparatus according to claim 1,
The plurality of language models are:
(1) The language in which the standby vocabulary includes the object word and the operation word, is composed of a word string that does not allow any phoneme string to be inserted between the two words, and has a high level of restraining the user's utterance content model,
(2) The standby vocabulary includes the object word and the operation word, is modeled such that an arbitrary phoneme string is allowed to be inserted between both words, and the recognizable word string is a finite number of words, A medium-level language model that restrains the user's utterance content, and
(3) The standby vocabulary includes the object word and the operation word, is modeled so that an arbitrary phoneme string is allowed to be inserted between the two words, and the recognizable word string is infinite. A speech recognition apparatus, characterized in that a language model with a low level for constraining a user's utterance content.

The speech recognition apparatus according to claim 1 or 2,
A speech recognition apparatus, further comprising display control means for displaying the standby vocabulary that a user can utter on a display device.

The speech recognition apparatus according to any one of claims 1 to 3,
A speech recognition apparatus, further comprising: response sentence output means for generating and outputting a response sentence for a user based on the recognition result specified by the recognition result specifying means.

The speech recognition device according to any one of claims 1 to 4,
The speech recognition apparatus characterized in that the candidate extraction unit extracts a predetermined number of the standby vocabulary having the matching level equal to or higher than a predetermined value as the recognition result candidates as a result of the calculation by the matching level calculation unit.

The speech recognition apparatus according to any one of claims 1 to 5,
The speech recognition apparatus characterized in that the object word includes a destination and a facility, and the operation word includes setting, search, and display.

Enter the voice spoken by the user,
A plurality of language models that are provided for recognizing the spoken speech and that include a target word for a device operated by voice and a vocabulary that includes an operation word for operating the device, with different levels of restraining the user's speech content As a standby vocabulary expressed in
The degree of coincidence between the standby vocabulary stored in the storage means and the uttered speech input by the voice input means is calculated for each of the plurality of language models,
Based on the degree of coincidence calculated by the degree of coincidence calculating means , from the standby vocabulary in the plurality of language models, select a plurality of standby vocabularies in descending order of the degree of coincidence and extract them as recognition result candidates.
Among the plurality of recognition result candidates extracted by the candidate extraction means, the highest degree of coincidence among the recognition result candidates of the language model (first level language model) having the highest level of restraining the utterance content of the user, The difference with the highest matching score among a plurality of recognition result candidates extracted by the candidate extraction means is calculated, and when the difference is smaller than a predetermined value, the recognition result candidate of the first level language model is specified as the recognition result. If not, among the plurality of recognition result candidates extracted by the candidate extraction means , the highest degree of coincidence among recognition result candidates of the language model (second level language model) having the next highest level of restriction And the highest matching score among the plurality of recognition result candidates extracted by the candidate extraction means, and when the difference is smaller than a predetermined value, the recognition result candidate of the second level language model is recognized. as a result A speech recognition method characterized by specifying.

The speech recognition method according to claim 7,
The plurality of language models are:
(1) The language in which the standby vocabulary includes the object word and the operation word, is composed of a word string that does not allow any phoneme string to be inserted between the two words, and has a high level of restraining the user's utterance content model,
(2) The standby vocabulary includes the object word and the operation word, is modeled such that an arbitrary phoneme string is allowed to be inserted between both words, and the recognizable word string is a finite number of words, A medium-level language model that restrains the user's utterance content, and
(3) The standby vocabulary includes the object word and the operation word, is modeled so that an arbitrary phoneme string is allowed to be inserted between the two words, and the recognizable word string is infinite. A speech recognition method, characterized in that the language model has a low level of restraining the user's utterance content.