JP2008076811A

JP2008076811A - Voice recognition device, voice recognition method and voice recognition program

Info

Publication number: JP2008076811A
Application number: JP2006256907A
Authority: JP
Inventors: Hisayuki Nagashima; 久幸長島
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2006-09-22
Filing date: 2006-09-22
Publication date: 2008-04-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device, a voice recognition method and a voice recognition program, capable of accurately recognizing user's utterance, even when the user's utterance is ambiguous. <P>SOLUTION: The voice recognition device 1 comprises: an operation specifying means 31 which performs specifying processing for specifying an operation object and an operation content by classifying a recognition result regarding voice into a category of the operation object and the operation content which are set beforehand, in order to determine the operation content of the operation object, based on the recognition result regarding input voice; an ambiguous word detecting means 32 for detecting an ambiguous word which is not classified by the operation specifying means 31, from the recognition result regarding voice; a candidate extracting means 33 for extracting the operation object which may be able to be specified by the detected ambiguous word, as an operation object candidate. When the operation specifying means 31 can specify the operation content and cannot specify the operation object, the specifying processing is performed by replacing the ambiguous word detected by the ambiguous word detecting means 32, with the candidate extracted by the candidate extracting means 33 in the operation specifying means 31. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、使用者により入力される音声を認識し、認識した結果に基づいて対象を制御するための情報を取得する音声認識装置、音声認識方法及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a speech recognition program that recognizes speech input by a user and acquires information for controlling an object based on the recognition result.

近年、機器の制御等の操作を行うシステムにおいて、使用者により入力される音声を認識して機器の操作等に必要な情報を取得する音声認識装置が用いられている。このような音声認識装置では、使用者により入力される音声（発話）を認識し、認識した結果に基づいて使用者に応答して次の発話を促すことで、使用者との対話が行われる。そして、使用者との対話を認識した結果から、機器の操作等を行うために必要な情報が取得される。このとき、例えば、認識対象であるコマンドが予め登録された音声認識辞書を用いて、入力された発話の特徴量と音声認識辞書に登録されたコマンドの特徴量とを比較することにより発話が認識される。 In recent years, in a system that performs operations such as device control, a speech recognition device that recognizes a voice input by a user and acquires information necessary for device operation or the like has been used. In such a voice recognition device, a voice (speech) input by the user is recognized, and the user is urged to respond to the next utterance based on the recognized result, whereby a dialogue with the user is performed. . Information necessary for operating the device is acquired from the result of recognizing the dialogue with the user. At this time, for example, using a speech recognition dictionary in which a command to be recognized is registered in advance, the utterance is recognized by comparing the feature amount of the input utterance with the feature amount of the command registered in the speech recognition dictionary. Is done.

このような音声認識装置は、例えば車両に装備されて、車両に搭載されたオーディオ、ナビゲーション装置、エアコンディショナ等の複数の機器を使用者が操作するために使用される。例えば、オーディオ機器の音量を増減する機能を操作対象として操作する場合、使用者は、当該機能をどのように変化させるか（操作内容）を音声で指示するため、「アップ」又は「ダウン」という言葉（指示語）で操作内容を指示する。これは、ラジオの受信局を変更する機能を操作対象として操作する場合も同様であり、使用者は、「受信周波数」を増減させるため、「アップ」又は「ダウン」という指示語を発声することにより操作内容を指示する。 Such a voice recognition device is used for a user to operate a plurality of devices such as an audio device, a navigation device, and an air conditioner mounted on the vehicle, for example. For example, when a function for increasing / decreasing the volume of an audio device is operated as an operation target, the user instructs by voice how to change the function (operation content), and is referred to as “up” or “down”. The operation content is instructed with words (indicators). The same applies to the case of operating the function of changing the radio reception station as the operation target, and the user must utter the instruction word “up” or “down” in order to increase or decrease the “reception frequency”. The operation content is indicated by.

上記のように、同一の指示語によって操作を行う機能が複数存在するとき、そのような指示語のみの発声ではどの操作対象の機能かを識別できないので、操作対象の機能を表す言葉（機能語）と共に発声する必要がある。上記の例でいえば、増減という機能を指示する言葉（指示語）でも、操作対象となるオーディオ機器の音量を増減する操作とラジオの受信周波数を増減する操作とが存在するため、音量と受信周波数のいずれを増減させるかを明示することが必要である。すなわち、「音量」を増加させたいときは「音量アップ」、周波数を増加させたいときは「周波数アップ」という言葉を発する必要がある。また、エアコンの温度調節の際にも、設定温度を上下させるために「温度アップ」あるいは「温度ダウン」という言葉を発声する必要がある。 As described above, when there are a plurality of functions that operate with the same instruction word, it is not possible to identify which operation target function by uttering only such an instruction word. ). In the above example, there are operations to increase / decrease the volume of the audio device to be operated and operations to increase / decrease the reception frequency of the radio even if the word (indicator) indicates the function of increase / decrease. It is necessary to clearly indicate which frequency is to be increased or decreased. That is, it is necessary to say “volume up” to increase “volume” and “frequency up” to increase frequency. Also, when adjusting the temperature of an air conditioner, it is necessary to say the words “temperature up” or “temperature down” in order to raise or lower the set temperature.

しかしながら、例えば車両搭載機器において「アップ」という言葉（指示語）が使われる場合、オーディオ機器が作動していれば「音量」増加、エアコンが作動していれば「温度」増加というように、作動中の機器の機能を指示することが一般的であり、また、特に他の機器が作動していないときには音量しかあり得ない。また、オーディオ機器とエアコンが共に作動しているときに連続して「アップ」という言葉が発せられた場合には、オーディオ機器の音量に対する指示である確率が高い。このような事情があるにも拘らず、常に機能語と指示語を一緒に発声しなければならないのは、使用者にとって煩わしく不便である。 However, for example, when the word “up” (indicator) is used in a vehicle-mounted device, the “volume” increases if the audio device is activated, and the “temperature” increases if the air conditioner is activated. It is common to indicate the function of the inside device, and there can only be a volume, especially when no other device is operating. Also, if the word “up” is issued continuously when both the audio device and the air conditioner are operating, there is a high probability that it is an instruction for the volume of the audio device. In spite of such circumstances, it is bothersome and inconvenient for the user to always utter the function word and the instruction word together.

この対策として、例えば音量を増加させる操作を行うときは「アップ」という指示語、周波数を増加させる操作については「上」という指示語を使用し、また、エアコンの設定温度を増加させたいときは「寒い」という言葉で指示するというように、各機能によって異なる指示語を使用するように予め設定しておくことも考えられる。しかしながら、各機能について設定量を増減させる機能が、実際上殆ど同じ操作指示で行われる場合においても、機能毎に異なる言葉（指示語）を用いるように定めることは、使用者にとって却ってわかりにくくなり、機器操作が面倒になってしまう。 As countermeasures, for example, use the "up" instruction word to increase the volume, use the "up" instruction to increase the frequency, and increase the air conditioner set temperature. It is also conceivable to set in advance to use different instruction words for each function, such as instructing with the word “cold”. However, even when the function to increase or decrease the set amount for each function is performed with almost the same operation instructions, it is difficult for the user to understand that it is determined to use different words (indicators) for each function. , Device operation becomes troublesome.

そこで、下記特許文献１では、機器の機能の操作内容を示す特定の指示語が複数の機能に共通して用いられている場合であって、これらの機器が全て作動しているときであっても、指示語を発声しただけで利用者が希望する機能を推定して、所望の操作を行うことができるようにした音声認識機器操作装置が提案されている。 Therefore, in Patent Document 1 below, a specific instruction word indicating the operation content of the function of the device is used in common for a plurality of functions, and these devices are all in operation. In addition, there has been proposed a voice recognition device operating device that can perform a desired operation by estimating a function desired by a user only by uttering an instruction word.

この音声認識機器操作装置は、音声認識部で認識された音声から、機器の機能を選択する機能語とその機能の操作を指示する指示語とを識別し、指示語単独の音声入力が行われたことを検出する識別部と、機器の機能とこれを操作する指示語を記録した操作指示データベースと、前記識別部で識別した指示語が複数の機能に共通の指示語であるか否かを前記操作指示データベースから検出する操作可能機能検出部と、前記指示語に対応して使用頻度を記録する使用頻度データベースと、前記操作可能機能検出部で複数の機能に共通の指示語であることが検出されたとき、前記使用頻度データベースから前記複数の機能のうち最も使用頻度の高い機能を検索する使用頻度検索部と、前記使用頻度検索部で検索された最も使用頻度の高い機能の機能語を、入力した指示語に付加し操作語として出力する操作語形成部と、機器に対する操作指示信号が出力されるとき、前記使用頻度データベースのデータを修正する使用頻度修正部とを備えたものである。 This voice recognition device operating device identifies a function word for selecting a function of the device and a command word for instructing operation of the function from the voice recognized by the voice recognition unit, and voice input of the command word alone is performed. An identification unit that detects the fact, an operation instruction database that records the function of the device and an instruction word for operating the device, and whether or not the instruction word identified by the identification unit is a common instruction word for a plurality of functions The operable function detection unit that is detected from the operation instruction database, the usage frequency database that records the usage frequency corresponding to the instruction word, and the instruction word that is common to a plurality of functions in the operable function detection unit When detected, the usage frequency search unit that searches the usage frequency database for the most frequently used function among the plurality of functions, and the function of the most frequently used function searched by the usage frequency search unit. An operation word forming unit that adds words to input instruction words and outputs them as operation words, and a use frequency correction unit that corrects data in the use frequency database when an operation instruction signal for the device is output It is.

この装置によれば、利用者が希望する機能を使用頻度データによって推定することにより、使用者が特定の指示語を発声しただけで、その機能を指示する機能語と前記指示語を組み合わせて操作語として、機器の操作を行わせることができる。
特開２００１−３４２９２号公報 According to this apparatus, the function desired by the user is estimated based on the use frequency data, so that the user operates the combination of the function word indicating the function and the instruction word only by speaking the specific instruction word. As a word, the operation of the device can be performed.
JP 2001-34292 A

しかしながら、上記の音声認識機器操作装置は、要するに、複数の機能に共通して用いられる操作内容を指示する特定の言葉に対して、最も使用頻度の高い機能を対応させるものである。このため、使用者が特定の言葉を使用した言い回しの違いで別の操作を指示している場合には対応することができない。また、例えば、使用者が発する入力音声における操作対象を示す言葉が、「これ」「それ」「あれ」といった曖昧な言葉（曖昧語）であった場合に、言い回しの違いを考慮して当該曖昧語の示す内容を推定するものではない。よって、機能の推定が適切に行われず、機器の操作上使い勝手がよくないという問題がある。 However, the above-described voice recognition device operating device basically corresponds to a function that is used most frequently to a specific word that indicates operation contents that are commonly used for a plurality of functions. For this reason, it is not possible to cope with the case where the user instructs another operation due to the difference in the phrase using a specific word. Also, for example, when the word indicating the operation target in the input voice uttered by the user is an ambiguous word (ambiguous word) such as “this”, “it”, “that”, the ambiguous word is considered in consideration of the difference in the wording. It does not estimate what the word shows. Therefore, there is a problem that the function is not properly estimated and the usability of the device is not good.

本発明は、上記事情に鑑み、使用者の発話に操作対象が特定できないような曖昧語が含まれている場合にも対応可能で、使用者の発話に対して適切な操作を推定できる音声認識装置を提供することを目的とする。 In view of the above circumstances, the present invention is applicable to a case where an ambiguous word whose operation target cannot be specified is included in the user's utterance, and voice recognition capable of estimating an appropriate operation for the user's utterance. An object is to provide an apparatus.

本発明は、入力された音声についての認識結果に基づいて操作対象の操作内容を決定する音声認識装置において、前記音声についての認識結果を、予め定めた操作対象及び操作内容の種類に分類することにより、操作対象及び操作内容を特定する特定処理を行う操作特定手段と、前記音声についての認識結果から、前記操作特定手段で分類し得ない曖昧語を検出する曖昧語検出手段と、前記曖昧語検出手段で検出された曖昧語により特定される可能性のある操作対象を操作対象候補として抽出する候補抽出手段とを備え、前記操作特定手段は、操作内容の特定は可能で、操作対象の特定ができない場合、前記曖昧語検出手段で検出された曖昧語を前記候補抽出手段で抽出された候補に置換して当該操作特定手段で前記特定処理を行うことを特徴とする。 The present invention relates to a speech recognition apparatus that determines an operation content of an operation target based on a recognition result of an input voice, and classifies the recognition result of the voice into a predetermined operation target and operation content type. The operation specifying means for performing the specifying process for specifying the operation target and the operation content, the ambiguous word detecting means for detecting the ambiguous word that cannot be classified by the operation specifying means from the recognition result of the voice, and the ambiguous word A candidate extraction unit that extracts an operation target that may be specified by an ambiguous word detected by the detection unit as an operation target candidate. The operation specifying unit is capable of specifying an operation content and specifying an operation target. If it is not possible to replace the ambiguous word detected by the ambiguous word detecting means with the candidate extracted by the candidate extracting means, the operation specifying means performs the specifying process. To.

本発明によれば、例えば使用者から対象を操作するための発話が音声入力されて、当該音声が単語列で表現されるテキストとして認識される。ここで、「テキスト」とは、単語の列で表現された、所定の意味を有する有意構文である。そして、操作対象特定手段は、例えばこの音声についての認識結果に含まれる単語間の関係を解析することで、当該認識結果を操作対象及び操作内容の種類に分類することにより、操作対象及び操作内容を特定する。このとき、入力された音声が曖昧である場合には、操作内容は特定されるが、操作対象が特定されないことがある。例えば、入力音声における操作対象を示す単語又は単語列が、「これ」「それ」「あれ」といった一般的な指示語であったり、音声が不明瞭なために正しく認識されない等により、操作特定手段で分類し得ない曖昧語である場合である。このとき、本発明では、音声についての認識結果から曖昧語を検出し、当該曖昧語を操作対象候補に置換して操作特定手段で特定処理を行うことにより、発話に含まれる曖昧語の示す内容を適切に推定して操作対象及び操作内容を特定することができる。 According to the present invention, for example, an utterance for manipulating an object is inputted by voice, and the voice is recognized as text expressed by a word string. Here, “text” is a significant syntax expressed in a string of words and having a predetermined meaning. Then, the operation target specifying means classifies the recognition result into the types of the operation target and the operation content by analyzing the relationship between words included in the recognition result for the voice, for example, and thereby the operation target and the operation content. Is identified. At this time, if the input voice is ambiguous, the operation content is specified, but the operation target may not be specified. For example, the operation specifying means is such that a word or a word string indicating an operation target in the input voice is a general instruction word such as “this”, “it”, “that”, or is not recognized correctly because the voice is unclear. This is a case of an ambiguous word that cannot be classified by. At this time, in the present invention, the ambiguous word is detected from the recognition result of the speech, the ambiguous word is replaced with the operation target candidate, and the identification process is performed by the operation specifying unit, whereby the content indicated by the ambiguous word included in the utterance is displayed. It is possible to appropriately estimate the operation target and the operation content.

したがって、本発明によれば、使用者の発話に操作対象が特定できないような曖昧語が含まれている場合にも対応可能で、操作対象を特定するための候補を抽出することで適切な操作を推定できる。 Therefore, according to the present invention, it is possible to cope with a case where an ambiguous word whose operation target cannot be specified is included in the user's utterance, and an appropriate operation can be performed by extracting candidates for specifying the operation target. Can be estimated.

本発明の好ましい実施形態では、前記候補抽出手段は、前記操作特定手段で特定された操作内容に対して可能性のある操作対象を操作対象候補として抽出するものとする。 In a preferred embodiment of the present invention, the candidate extracting unit extracts a possible operation target for the operation content specified by the operation specifying unit as an operation target candidate.

これにより、曖昧語に対して、特定された操作内容に応じた操作対象候補を効率良く抽出し、的確に操作対象を特定することができる。 Thereby, with respect to an ambiguous word, the operation target candidate according to the specified operation content can be efficiently extracted, and the operation target can be specified accurately.

別の実施形態は、前記操作特定手段で特定された操作対象及び操作内容がそれぞれ複数あるときに該複数の操作対象及び操作内容を報知する報知手段と、該複数の操作対象及び操作内容のうちそれぞれ１つを選択する選択手段とを備え、前記操作特定手段は、前記選択手段で選択された操作対象及び操作内容を最終的に特定した操作対象及び操作内容とする。 In another embodiment, when there are a plurality of operation objects and operation contents specified by the operation specifying means, a notification means for notifying the plurality of operation objects and operation contents, and among the plurality of operation objects and operation contents Selection means for selecting one each, and the operation specifying means sets the operation target and operation content selected by the selection means as the operation target and operation content finally specified.

これにより、複数の操作対象について使用者に知らせて、所望のものを選択してもらうことができるので、操作対象を的確に特定することができる。 Accordingly, the user can be informed about a plurality of operation objects and a desired one can be selected, so that the operation objects can be accurately identified.

さらに別の実施形態は、操作対象となる機器の状態を検知する機器状態検知手段を備え、前記操作特定手段で特定された操作対象及び操作内容がそれぞれ複数あるときに、前記操作特定手段は、前記機器状態検知手段による検知結果に基づいて、該複数の操作対象及び操作内容のうちそれぞれ１つを特定することを特徴とする。 Still another embodiment includes a device state detection unit that detects a state of a device to be operated, and when there are a plurality of operation targets and operation contents specified by the operation specifying unit, the operation specifying unit includes: One of each of the plurality of operation objects and operation contents is specified based on a detection result by the device state detection means.

これにより、特定された操作対象及び操作内容がそれぞれ複数あっても、操作対象となる機器の状態から最も適切な操作対象及び操作内容を特定することができる。 Thereby, even when there are a plurality of specified operation targets and operation details, the most appropriate operation target and operation details can be specified from the state of the device to be operated.

さらに別の実施形態は、操作対象及び操作内容の履歴を保存する操作履歴記憶手段を備え、前記操作特定手段で特定された操作対象及び操作内容がそれぞれ複数あるときに、前記操作特定手段は、前記操作履歴記憶手段に保存されている履歴に基づいて、該複数の操作対象及び操作内容のうちそれぞれ１つを特定することを特徴とする。 Still another embodiment includes an operation history storage unit that stores a history of operation targets and operation details, and when there are a plurality of operation targets and operation details specified by the operation specifying unit, the operation specifying unit includes: One of each of the plurality of operation objects and operation contents is specified based on a history stored in the operation history storage means.

これにより、特定された操作対象及び操作内容がそれぞれ複数あっても、それまでの操作履歴から最も適切な操作対象及び操作内容を特定することができる。 Thereby, even when there are a plurality of specified operation objects and operation contents, the most appropriate operation object and operation contents can be specified from the operation history so far.

また、本発明によれば、入力された音声についての認識結果に基づいて操作対象の操作内容を決定する音声認識方法であって、前記音声についての認識結果を、予め定めた操作対象及び操作内容の種類に分類することにより、操作対象及び操作内容を特定する特定ステップと、前記特定ステップで操作内容は特定できたが、操作対象を特定できない場合に、前記音声についての認識結果から、該特定ステップで分類し得ない曖昧語を検出する曖昧語検出ステップと、前記曖昧語検出ステップで検出された曖昧語により特定される可能性のある操作対象を操作対象候補として抽出する候補抽出ステップと、前記検出ステップで検出された曖昧語を前記候補抽出ステップで抽出された候補に置換して、当該置換後の認識結果を、予め定めた操作対象及び操作内容の種類に分類することにより、操作対象及び操作内容を特定する第２の特定ステップとを備えることを特徴とする方法が提供される。 In addition, according to the present invention, there is provided a speech recognition method for determining an operation content of an operation target based on a recognition result for an input voice, wherein the recognition result for the voice is determined based on a predetermined operation target and operation content. The operation step and the operation content can be specified by the classification step, and the operation content can be specified in the specification step, but when the operation target cannot be specified, the identification result is obtained from the recognition result of the voice. An ambiguous word detection step for detecting an ambiguous word that cannot be classified in the step; a candidate extraction step for extracting an operation target that may be specified by the ambiguous word detected in the ambiguous word detection step; The ambiguous word detected in the detection step is replaced with the candidate extracted in the candidate extraction step, and the recognition result after the replacement is determined as a predetermined operation pair. And by classifying the type of operation contents, the method characterized by comprising a second specifying step of specifying the operation target and operation content is provided.

この音声認識方法によれば、本発明の音声認識装置に関して説明したように、使用者の発話に操作対象が特定できないような曖昧語が含まれている場合にも対応可能で、操作対象を特定するための候補を抽出することで適切な操作を推定できる。よって、この音声認識方法によれば、使用者の発話に操作対象が特定できないような曖昧語が含まれている場合にも対応可能で、操作対象を特定するための候補を抽出することで適切な操作を推定できる。 According to this speech recognition method, as described with reference to the speech recognition apparatus of the present invention, it is possible to cope with a case where an ambiguous word that does not specify the operation target is included in the user's utterance. It is possible to estimate an appropriate operation by extracting candidates to be performed. Therefore, according to this speech recognition method, it is possible to cope with the case where an ambiguous word whose operation target cannot be specified is included in the user's utterance, and it is appropriate to extract candidates for specifying the operation target. Can be estimated.

さらに、本発明によれば、入力された音声についての認識結果に基づいて操作対象の操作内容を決定する処理をコンピュータに実行させる音声認識プログラムであって、前記音声についての認識結果を、予め定めた操作対象及び操作内容の種類に分類することにより、操作対象及び操作内容を特定する特定処理と、前記音声についての認識結果から、前記特定処理で分類し得ない曖昧語を検出する検出処理と、前記検出処理で検出された曖昧語により特定される可能性のある操作対象を操作対象候補として抽出する候補抽出処理とを備え、前記特定処理で操作内容の特定は可能で、操作対象の特定ができない場合、前記検出処理で検出された曖昧語を前記抽出処理で抽出された候補に置換して前記特定処理を行う機能を前記コンピュータに実行させることを特徴とする音声認識プログラムが提供される。 Furthermore, according to the present invention, there is provided a speech recognition program for causing a computer to execute a process of determining an operation content of an operation target based on a recognition result for input speech, wherein the recognition result for the speech is determined in advance. A specific process for identifying the operation target and the operation content by classifying the operation target and the operation content, and a detection process for detecting an ambiguous word that cannot be classified by the specific process from the recognition result of the voice. A candidate extraction process for extracting an operation target that may be identified by an ambiguous word detected in the detection process as an operation target candidate, and the operation content can be identified in the identification process. If the computer cannot perform the specific processing by replacing the ambiguous word detected by the detection processing with the candidate extracted by the extraction processing, Speech recognition program for causing a is provided.

この場合、本発明の音声認識装置に関して説明した効果を奏し得る処理をコンピュータに実行させることができる。 In this case, it is possible to cause the computer to execute processing that can achieve the effects described in regard to the speech recognition apparatus of the present invention.

図１に示すように、本発明の音声認識装置は音声対話ユニット１から成り、車両１０に搭載されている。この音声対話ユニット１には、車両１０の運転者から発話が入力されるマイク２が接続されると共に、車両１０の状態を検出する車両状態検出部３が接続されている。また、音声対話ユニット１には、運転者への応答を出力するスピーカ４と、運転者への表示を行うディスプレイ５とが接続されている。さらに、音声対話ユニット１には、運転者が音声等で操作可能な複数の機器６ａ〜６ｃが接続されている。 As shown in FIG. 1, the speech recognition apparatus of the present invention comprises a speech dialogue unit 1 and is mounted on a vehicle 10. The voice interaction unit 1 is connected to a microphone 2 to which an utterance is input from a driver of the vehicle 10, and to a vehicle state detection unit 3 that detects the state of the vehicle 10. In addition, a speaker 4 that outputs a response to the driver and a display 5 that displays to the driver are connected to the voice interaction unit 1. Furthermore, a plurality of devices 6 a to 6 c that can be operated by the driver by voice or the like are connected to the voice interaction unit 1.

マイク２は、車両１０の運転者の音声が入力されるものであり、車内の所定位置に設置されている。マイク２は、例えば、トークスイッチにより音声の入力開始が指令されると、入力される音声を運転者の発話として取得する。トークスイッチは、車両１０の運転者により操作されるＯＮ・ＯＦＦスイッチであり、押下してＯＮ操作されることによって音声の入力開始が指令される。 The microphone 2 is for inputting the voice of the driver of the vehicle 10 and is installed at a predetermined position in the vehicle. For example, when a voice switch is instructed by a talk switch, the microphone 2 acquires the input voice as the driver's utterance. The talk switch is an ON / OFF switch that is operated by the driver of the vehicle 10, and is commanded to start voice input when pressed by being pressed.

車両状態検出部３は、車両１０の状態を検出するセンサ等である。車両１０の状態とは、例えば、車両１０の速度や加減速等の走行状態、車両１０の位置や走行道路等の走行環境情報、車両１０に装備された機器（ワイパー、ウィンカー、ナビゲーションシステム６ａ、オーディオ６ｂ等）の動作状態、或いは車両１０の車内温度等の車内の状態をいう。具体的には、例えば、車両１０の走行状態を検出するセンサとして、車両１０の走行速度（車速）を検出する車速センサ、車両１０のヨーレートを検出するヨーレートセンサ、車両１０のブレーキ操作（ブレーキペダルが操作されているか否か）を検出するブレーキセンサ等が挙げられる。さらに、車両１０の状態として、車両１０の運転者の状態（運転者の手掌部の発汗、運転負荷等）を検出してもよい。 The vehicle state detection unit 3 is a sensor or the like that detects the state of the vehicle 10. The state of the vehicle 10 includes, for example, the traveling state of the vehicle 10 such as speed and acceleration / deceleration, traveling environment information such as the position of the vehicle 10 and the traveling road, and the equipment (wiper, winker, navigation system 6a, The operation state of the audio 6b or the like, or the vehicle interior state such as the vehicle interior temperature of the vehicle 10. Specifically, for example, as a sensor that detects the traveling state of the vehicle 10, a vehicle speed sensor that detects the traveling speed (vehicle speed) of the vehicle 10, a yaw rate sensor that detects the yaw rate of the vehicle 10, and a brake operation (brake pedal) of the vehicle 10 And a brake sensor for detecting whether or not the engine is operated. Further, as the state of the vehicle 10, the state of the driver of the vehicle 10 (perspiration of the palm of the driver, driving load, etc.) may be detected.

スピーカ４は、車両１０の運転者への応答（音声ガイド）を出力するものである。なお、このスピーカ４としては、後述のオーディオ６ａが有するスピーカを用いることができる。 The speaker 4 outputs a response (voice guide) to the driver of the vehicle 10. In addition, as this speaker 4, the speaker which the below-mentioned audio 6a has can be used.

ディスプレイ５は、例えば、車両１０のフロントウィンドウに画像等の情報を表示するＨＵＤ（ヘッドアップディスプレイ）、車両１０の車速などの走行状態を表示するメータに一体的に設けられたディスプレイ、或いは後述のナビゲーションシステム６ｂに備えられたディスプレイ等である。なお、ナビゲーションシステム６ｂのディスプレイは、タッチスイッチが組み込まれたタッチパネル２４となっている。 The display 5 is, for example, a HUD (head-up display) that displays information such as an image on the front window of the vehicle 10, a display that is provided integrally with a meter that displays a traveling state such as the vehicle speed of the vehicle 10, or It is the display etc. with which the navigation system 6b was equipped. The display of the navigation system 6b is a touch panel 24 in which a touch switch is incorporated.

機器６ａ〜６ｃは、具体的には、車両１０に装備されたオーディオ６ａ、ナビゲーションシステム６ｂ、エアコンディショナ６ｃである。各機器６ａ〜６ｃには、制御可能な構成要素（デバイス，コンテンツ等）、機能、動作等が予め定められている。 The devices 6a to 6c are specifically an audio 6a, a navigation system 6b, and an air conditioner 6c installed in the vehicle 10. In each of the devices 6a to 6c, controllable components (device, content, etc.), functions, operations, and the like are determined in advance.

例えば、オーディオ６ａには、デバイスとして「ＣＤ」「ＭＰ３」「ラジオ」「スピーカ」等がある。また、オーディオ６ａの機能として「音量」等がある。また、オーディオ６ａの動作として「変更」「オン」「オフ」等がある。さらに、「ＣＤ」「ＭＰ３」の動作として、「再生」「停止」等がある。また、「ラジオ」の機能として「選局」等がある。また、「音量」の動作として「上げる」「下げる」等がある。 For example, the audio 6a includes “CD”, “MP3”, “radio”, “speaker”, and the like as devices. Further, there is a “volume” as a function of the audio 6a. The operation of the audio 6a includes “change”, “on”, “off”, and the like. Furthermore, “CD” and “MP3” operations include “play” and “stop”. “Radio” functions include “channel selection”. In addition, the “volume” operation includes “up”, “down”, and the like.

また、例えば、ナビゲーションシステム６ｂには、コンテンツとして「画面表示」「経路誘導」「ＰＯＩ検索」等がある。さらに、「画面表示」の動作として「変更」「拡大」「縮小」等がある。なお、「経路誘導」は音声ガイド等により目的地へ誘導する機能であり、「ＰＯＩ検索」は、例えばレストラン、ホテル等の目的地を検索する機能である。 Further, for example, the navigation system 6b includes “screen display”, “route guidance”, “POI search”, and the like as contents. Further, the “screen display” operation includes “change”, “enlarge”, “reduce”, and the like. “Route guidance” is a function of guiding to a destination by voice guidance or the like, and “POI search” is a function of searching for a destination such as a restaurant or a hotel.

また、例えば、エアコンディショナ６ｃには、その機能として「風量」「設定温度」等がある。また、エアコンディショナ６ｃの動作として「オン」「オフ」等がある。さらに、「風量」「設定温度」の動作として「変更」「上げる」「下げる」等がある。 Further, for example, the air conditioner 6c has “air volume”, “set temperature”, and the like as its functions. The operation of the air conditioner 6c includes “on” and “off”. Further, “change”, “increase”, “decrease” and the like are included in the operations of “air volume” and “set temperature”.

これらの機器６ａ〜６ｃは、対象を制御するための情報（機器や機能の種別、動作の内容等）を指定することにより制御される。操作対象となる各機器６ａ〜６ｃのデバイス、コンテンツ、機能は複数のドメインに分類されている。「ドメイン」とは認識対象のカテゴリに応じた分類を意味し、具体的には、操作対象である機器や機能を表す。ドメインは、例えば「オーディオ」のドメインが、その下位で「ＣＤ」「ラジオ」のドメインに分類されるといったように、階層的に指定することができる。 These devices 6a to 6c are controlled by designating information (device and function types, operation contents, etc.) for controlling the target. The devices, contents, and functions of the devices 6a to 6c to be operated are classified into a plurality of domains. “Domain” means classification according to the category of the recognition target, and specifically represents the device or function that is the operation target. The domain can be specified hierarchically, for example, such that the “audio” domain is classified into the “CD” and “radio” domains below it.

音声対話ユニット１は、詳細の図示は省略するが、Ａ／Ｄ変換回路、マイクロコンピュータ（ＣＰＵ、ＲＡＭ、ＲＯＭ）等を含む電子回路により構成され、マイク２の出力（アナログ信号）がＡ／Ｄ変換回路を介してデジタル信号に変換されて入力される。そして、音声対話ユニット１は、入力されたデータに基づいて、運転者から入力された発話を認識する処理や、その認識結果に基づいて、スピーカ４やディスプレイ５を介して運転者との対話や運転者への情報提示を行う処理や、機器６ａ〜６ｃを制御する処理等を実行する。これらの処理は、音声対話ユニット１のメモリに予め実装されたプログラムを音声対話ユニット１により実行することにより実現される。このプログラムは、本発明の音声認識プログラムを含んでいる。なお、当該プログラムはＣＤ−ＲＯＭ等の記録媒体を介してメモリに格納されてもよく、外部のサーバからネットワークや人工衛星を介して配信または放送され、車両１０に搭載された通信機器により受信された上でメモリに格納されてもよい。 Although not shown in detail, the voice interaction unit 1 is composed of an electronic circuit including an A / D conversion circuit, a microcomputer (CPU, RAM, ROM), etc., and the output (analog signal) of the microphone 2 is A / D. It is converted into a digital signal and input through a conversion circuit. Then, the voice interaction unit 1 recognizes the utterance input from the driver based on the input data, and performs dialogue with the driver via the speaker 4 or the display 5 based on the recognition result. A process for presenting information to the driver, a process for controlling the devices 6a to 6c, and the like are executed. These processes are realized by the voice interaction unit 1 executing a program pre-installed in the memory of the voice interaction unit 1. This program includes the speech recognition program of the present invention. The program may be stored in a memory via a recording medium such as a CD-ROM, distributed or broadcast from an external server via a network or an artificial satellite, and received by a communication device mounted on the vehicle 10. In addition, it may be stored in a memory.

より詳しくは、音声対話ユニット１は、上記プログラムにより実現される機能として、入力された音声を音響モデル１５と言語モデル１６とを用いて認識してテキストとして出力する音声認識部１１と、認識されたテキストから構文モデル１７を用いて発話の意味を理解する構文解析部１２とを備えている。また、音声対話ユニット１は、発話の認識結果から特定される操作候補に基づいてシナリオデータベース１８を用いてシナリオを決定し、運転者への応答や機器の制御等を行うシナリオ制御部１３と、運転者に出力する音声による応答を音素モデル２１を用いて合成する音声合成部１４とを備えている。なお、「操作候補」は、発話の認識結果に基づいて特定される操作対象や操作内容の候補に相当する。 More specifically, the voice interaction unit 1 is recognized as a function realized by the above program, a voice recognition unit 11 that recognizes an input voice using the acoustic model 15 and the language model 16 and outputs it as text. And a syntax analysis unit 12 for understanding the meaning of the utterance from the text using the syntax model 17. Further, the voice interaction unit 1 determines a scenario using the scenario database 18 based on the operation candidate specified from the recognition result of the utterance, and responds to the driver, controls the device, and the like, And a speech synthesis unit 14 that synthesizes a response by speech output to the driver using the phoneme model 21. The “operation candidate” corresponds to an operation target or operation content candidate specified based on the recognition result of the utterance.

さらに詳細には、構文解析部１２は、その機能として、認識されたテキストから操作対象及び操作内容を特定する特定処理（後述の「構文解析処理」）を行う操作特定手段３１と、操作特定手段３１により分類し得ない曖昧語を検出する曖昧語検出手段３２と、曖昧語検出手段３２で検出された曖昧語により特定される可能性のある操作対象を、変換候補データベース３４を用いて操作対象候補として抽出する候補抽出手段３３とを備えている。 More specifically, the syntax analysis unit 12 includes, as its functions, an operation specifying unit 31 that performs a specifying process (a “syntax analyzing process” described later) for specifying an operation target and an operation content from the recognized text, and an operation specifying unit. The ambiguous word detection means 32 that detects ambiguous words that cannot be classified by the reference numeral 31 and the operation target that may be identified by the ambiguous words detected by the ambiguous word detection means 32 are processed using the conversion candidate database 34. Candidate extraction means 33 for extracting as candidates.

さらに、操作特定手段３１は、タッチパネル２４を介して、特定処理により特定された操作候補群に含まれる複数の操作候補を、運転者に選択を促すように表示すると共に、スピーカ４を介して、運転者に選択を促すような音声ガイドを出力する。そして、操作特定手段３１は、運転者からのタッチパネル２４へのタッチ操作に基づいて、最終的な操作候補を特定する。なお、シナリオ制御部１３、音声合成部１４が、本発明の報知手段を構成する。また、シナリオ制御部１３が、本発明の選択手段を構成する。 Further, the operation specifying unit 31 displays a plurality of operation candidates included in the operation candidate group specified by the specifying process via the touch panel 24 so as to prompt the driver to select, and via the speaker 4, A voice guide that prompts the driver to select is output. The operation specifying unit 31 specifies a final operation candidate based on a touch operation on the touch panel 24 from the driver. The scenario control unit 13 and the speech synthesis unit 14 constitute notification means of the present invention. Further, the scenario control unit 13 constitutes a selection unit of the present invention.

なお、音響モデル１５、言語モデル１６、構文モデル１７、シナリオデータベース１８、音素モデル１９は、変換候補データベース３４は、それぞれ、データが記録されているＣＤ−ＲＯＭ、ＤＶＤ、ＨＤＤ等の記録媒体（データベース）である。 Note that the acoustic model 15, the language model 16, the syntax model 17, the scenario database 18, the phoneme model 19, the conversion candidate database 34, and the recording medium (database) such as a CD-ROM, DVD, and HDD in which data is recorded. ).

また、図１で破線で示した操作履歴格納部３５は、後述の第３実施形態のみに備えられる構成であるので、ここでは説明を省略する。 Further, since the operation history storage unit 35 indicated by a broken line in FIG. 1 has a configuration provided only in a third embodiment to be described later, description thereof is omitted here.

音声認識部１１は、マイク２に入力された発話の音声を示す波形データを周波数分析して特徴ベクトルを抽出する。そして、音声認識部１１は、抽出された特徴ベクトルに基づいて、入力された音声を認識して、単語列で表現されたテキストとして出力する「音声認識処理」を実行する。この音声認識処理は、次に説明するような確率統計的な手法を用いて、入力音声の音響的な特徴と言語的な特徴とを総合的に判断することにより実行される。 The voice recognition unit 11 performs frequency analysis on the waveform data indicating the voice of the utterance input to the microphone 2 and extracts a feature vector. Then, the speech recognition unit 11 executes “speech recognition processing” that recognizes the input speech based on the extracted feature vector and outputs the recognized speech as a text represented by a word string. This speech recognition process is executed by comprehensively determining the acoustic features and linguistic features of the input speech using a probabilistic method as described below.

すなわち、音声認識部１１は、まず、音響モデル１５を用いて、抽出された特徴ベクトルに応じた発音データの尤度（以下、この尤度を適宜「音響スコア」という。）を評価し、当該音響スコアに基づいて発音データを決定する。また、音声認識部１１は、言語モデル１６を用いて、決定された発音データに応じた単語列で表現されたテキストの尤度（以下、この尤度を適宜「言語スコア」という。）を評価し、当該言語スコアに基づいてテキストを決定する。さらに、音声認識部１１は、決定された全てのテキストについて、当該テキストの音響スコアと言語スコアとに基づいて音声認識の確信度（以下、この確信度を適宜「音声認識スコア」という。）を算出する。そして、音声認識部１１は、この音声認識スコアが所定の条件を満たす単語列で表現されたテキストを、認識されたテキスト（Recognized Text）として出力する。 That is, the speech recognition unit 11 first evaluates the likelihood of the pronunciation data according to the extracted feature vector using the acoustic model 15 (hereinafter, this likelihood is referred to as “acoustic score” as appropriate). Pronunciation data is determined based on the acoustic score. Further, the speech recognition unit 11 uses the language model 16 to evaluate the likelihood of the text expressed by the word string corresponding to the determined pronunciation data (hereinafter, this likelihood is referred to as “language score” as appropriate). The text is determined based on the language score. Furthermore, the speech recognition unit 11 determines the certainty of speech recognition for all the determined texts based on the acoustic score and language score of the text (hereinafter, this certainty is referred to as “speech recognition score” as appropriate). calculate. Then, the speech recognition unit 11 outputs the text expressed by a word string whose speech recognition score satisfies a predetermined condition as recognized text (Recognized Text).

構文解析部１２は、音声認識部１１で認識されたテキストから、構文モデル１７を用いて、入力された発話の意味を理解する「構文解析処理」を実行する。この構文解析処理は、次に説明するような確率統計的な手法を用いて、音声認識部１１で認識されたテキストにおける単語間の関係（構文）を解析することにより実行される。 The syntax analysis unit 12 executes “syntax analysis processing” for understanding the meaning of the input utterance from the text recognized by the speech recognition unit 11 using the syntax model 17. This parsing process is executed by analyzing a relationship (syntax) between words in the text recognized by the speech recognition unit 11 using a probabilistic statistical method as described below.

すなわち、構文解析部１２は、認識されたテキストの尤度（以下、この尤度を適宜「構文解析スコア」という。）を評価し、当該構文解析スコアに基づいて、当該認識されたテキストの意味に対応するクラスに分類されたテキストを決定する。そして、構文解析部１２は、構文解析スコアが所定の条件を満たすクラス分類されたテキスト（Categorized Text）を、入力された発話の認識結果に基づいて特定される操作候補群として、構文解析スコアと共に出力する。「クラス」とは、上述したドメインのような、操作対象や操作内容を表すカテゴリに応じた分類に相当する。例えば、認識されたテキストが「設定変更」「設定変更する」「設定を変える」「セッティング変更」である場合には、いずれも、クラス分類されたテキストは｛Setup｝となる。 That is, the parsing unit 12 evaluates the likelihood of the recognized text (hereinafter, this likelihood is appropriately referred to as a “parsing score”), and the meaning of the recognized text is determined based on the parsing score. The text classified into the class corresponding to is determined. The parsing unit 12 then classifies the classified text (Categorized Text) that satisfies a predetermined condition of the parsing score as an operation candidate group that is specified based on the recognition result of the input utterance, together with the parsing score. Output. The “class” corresponds to a classification according to a category representing an operation target or operation content, such as the domain described above. For example, if the recognized text is “setting change”, “setting change”, “change setting”, or “setting change”, the classified text becomes {Setup}.

このとき、構文解析部１２は、音声認識部１１で認識されたテキストから構文解析処理により操作内容は特定できるが操作対象を特定できない場合、当該認識されたテキストに含まれる曖昧語を検出し、当該曖昧語を変換候補データベース３４を用いて操作対象候補に置換して、当該置換後のテキストに対して再度構文解析処理を行う。 At this time, the syntax analysis unit 12 detects an ambiguous word included in the recognized text when the operation content can be specified from the text recognized by the speech recognition unit 11 by the syntax analysis process but the operation target cannot be specified. The ambiguous word is replaced with an operation target candidate using the conversion candidate database 34, and the syntax analysis process is performed again on the replaced text.

また、シナリオ制御部１３は、特定された操作候補と、車両状態検出部３から取得される車両１０の状態とに基づいて、シナリオデータベース１８に記録されたデータを用いて、運転者に対する応答出力や機器制御のシナリオを決定する。シナリオデータベース１８には、応答出力や機器制御のための複数のシナリオが、操作候補や車両状態の条件と共に予め記録されている。そして、シナリオ制御部１３は、決定されたシナリオに従って、音声や画像表示による応答を制御する処理や、機器を制御する処理を実行する。具体的には、シナリオ制御部１３は、例えば、音声による応答では、出力する応答の内容（運転者の次の発話を促すための応答文や、操作の完了等を使用者に報知するための応答文）や、応答を出力する際の速度や音量を決定する。 In addition, the scenario control unit 13 outputs a response to the driver using data recorded in the scenario database 18 based on the specified operation candidate and the state of the vehicle 10 acquired from the vehicle state detection unit 3. And determine device control scenarios. A plurality of scenarios for response output and device control are recorded in the scenario database 18 together with operation candidates and vehicle condition conditions. Then, the scenario control unit 13 executes a process for controlling a response by voice or image display or a process for controlling a device according to the determined scenario. Specifically, for example, in the case of a voice response, the scenario control unit 13 informs the user of the content of the response to be output (response sentence for prompting the driver's next utterance, completion of the operation, etc. Response sentence) and the speed and volume when outputting the response.

音声合成部１４は、シナリオ制御部１３で決定された応答文に応じて、音素モデル１９を用いて音声を合成して、音声を示す波形データとして出力する。音声は、例えばＴＴＳ（Text to Speech）等の処理を用いて合成される。具体的には、音声合成部１４は、シナリオ制御部１３で決定された応答文のテキストを音声出力に適した表現に正規化し、この正規化したテキストの各単語を発音データに変換する。そして、音声合成部１４は、音素モデル１９を用いて発音データから特徴ベクトルを決定し、この特徴ベクトルにフィルタ処理を施して波形データに変換する。この波形データは、スピーカ４から音声として出力される。 The speech synthesizer 14 synthesizes speech using the phoneme model 19 in accordance with the response sentence determined by the scenario control unit 13 and outputs it as waveform data indicating the speech. The voice is synthesized using a process such as TTS (Text to Speech). Specifically, the speech synthesis unit 14 normalizes the text of the response sentence determined by the scenario control unit 13 into an expression suitable for speech output, and converts each word of the normalized text into pronunciation data. Then, the speech synthesizer 14 determines a feature vector from the pronunciation data using the phoneme model 19 and performs filtering on the feature vector to convert it into waveform data. This waveform data is output from the speaker 4 as sound.

音響モデル（Acoustic Model）１５には、特徴ベクトルと発音データとの確率的な対応を示すデータが記録されている。詳細には、音響モデル１５には、認識単位（音素、形態素、単語等）毎に用意された複数のＨＭＭ（Hidden Markov Model、隠れマルコフモデル）がデータとして記録されている。ＨＭＭは、音声を定常信号源（状態）の連結で表し、時系列をある状態から次の状態への遷移確率で表現する統計的信号源モデルである。ＨＭＭにより、時系列で変動する音声の音響的な特徴を簡易な確率モデルで表現することができる。ＨＭＭの遷移確率等のパラメータは、対応する学習用の音声データを与えて学習させることにより予め決定される。また、音素モデル１９にも、発音データから特徴ベクトルを決定するための、音響モデル１５と同様のＨＭＭが記録されている。 In the acoustic model 15, data indicating a probabilistic correspondence between the feature vector and the pronunciation data is recorded. Specifically, in the acoustic model 15, a plurality of HMMs (Hidden Markov Models) prepared for each recognition unit (phoneme, morpheme, word, etc.) are recorded as data. The HMM is a statistical signal source model that expresses speech as a connection of stationary signal sources (states) and expresses a time series as a transition probability from one state to the next state. With the HMM, it is possible to represent the acoustic features of speech that varies in time series with a simple probability model. Parameters such as transition probabilities of the HMM are determined in advance by giving corresponding learning speech data for learning. The phoneme model 19 also records the same HMM as the acoustic model 15 for determining the feature vector from the pronunciation data.

言語モデル（Language Model）１６には、認識対象である単語の出現確率や接続確率を示すデータが、この単語の発音データ及びテキストと共に記録されている。認識対象である単語とは、対象を制御するための発話で使用される可能性のある単語として予め定められるものである。単語の出現確率や接続確率等のデータは、大量の学習テキストコーパスを解析することにより統計的に作成される。また、単語の出現確率は、例えば、学習テキストコーパスにおけるその単語の出現頻度等に基づいて算出される。 In the language model 16, data indicating the appearance probability and connection probability of a word to be recognized is recorded together with pronunciation data and text of the word. The word that is the recognition target is predetermined as a word that may be used in the utterance for controlling the target. Data such as word appearance probabilities and connection probabilities are statistically created by analyzing a large amount of learning text corpus. Further, the appearance probability of a word is calculated based on, for example, the appearance frequency of the word in the learning text corpus.

この言語モデル１６には、例えば、特定のＮ個の単語が連続して出現する確率により表現されるＮグラム（N-gram）の言語モデルが用いられる。本実施形態では、言語モデル１６には、入力された発話に含まれる単語数に応じたＮグラムが用いられる。具体的には、言語モデル１６では、Ｎの値が発音データに含まれる単語数以下のＮグラムが用いられる。例えば発音データに含まれる単語数が２である場合、１単語の出現確率で表現されるユニグラム（Uni-gram，Ｎ＝１）、及び２つの単語の列の生起確率（先行する１単語についての条件付き出現確率）で表現されるバイグラム（Bi-gram，Ｎ＝２）が用いられる。 As the language model 16, for example, an N-gram language model expressed by the probability that specific N words appear successively is used. In the present embodiment, N-grams corresponding to the number of words included in the input utterance are used for the language model 16. Specifically, the language model 16 uses N-grams in which the value of N is equal to or less than the number of words included in the pronunciation data. For example, when the number of words included in the pronunciation data is 2, a unigram (Uni-gram, N = 1) represented by the appearance probability of one word, and the occurrence probability of a sequence of two words (for the preceding one word) A bigram (Bi-gram, N = 2) expressed by a conditional appearance probability) is used.

さらに、言語モデル１６では、Ｎの値を所定の上限値に制限してＮグラムを用いることもできる。所定の上限値としては、例えば、予め定められた所定値（例えばＮ＝２）や、入力された発話に対する音声認識処理の処理時間が所定時間以内になるように逐次設定される値等を用いることができる。例えばＮ＝２を上限値としてＮグラムを用いる場合、発音データに含まれる単語数が２より大きいときにも、ユニグラム及びバイグラムのみが用いられる。これにより、音声認識処理の演算コストが過大になることを防止して、運転者の発話に対して適切な応答時間で応答を出力することができる。 Furthermore, in the language model 16, N gram can be used by limiting the value of N to a predetermined upper limit value. As the predetermined upper limit value, for example, a predetermined value (for example, N = 2) or a value that is sequentially set so that the processing time of the speech recognition processing for the input utterance is within a predetermined time is used. be able to. For example, when N-grams are used with N = 2 as the upper limit, only unigrams and bigrams are used even when the number of words included in the pronunciation data is greater than two. Thereby, it is possible to prevent the calculation cost of the voice recognition processing from becoming excessive, and to output a response with an appropriate response time to the driver's utterance.

構文モデル（Parser Model）１７には、認識対象である単語の出現確率や接続確率を示すデータが、この単語のテキスト及びクラスと共に記録されている。この構文モデル１７には、例えば、言語モデル１６と同様にＮグラムの言語モデルが用いられる。本実施形態では、具体的には、構文モデル１７で、Ｎ＝３を上限値として、Ｎの値が認識されたテキストに含まれる単語数以下のＮグラムが用いられる。すなわち、構文モデル１７では、ユニグラム、バイグラム、及び３つの単語の列の生起確率（先行する２単語についての条件付き出現確率）で表現されるトライグラム（Tri-gram，Ｎ＝３）が用いられる。なお、上限値は３以外でもよく、任意に設定可能である。また、上限値に制限せずに、Ｎの値が認識されたテキストに含まれる単語数以下のＮグラムを用いるものとしてもよい。 In the syntax model (Parser Model) 17, data indicating the appearance probability and connection probability of a word to be recognized is recorded together with the text and class of the word. For example, an N-gram language model is used for the syntax model 17 in the same manner as the language model 16. In the present embodiment, specifically, the syntax model 17 uses N-grams equal to or less than the number of words included in the text in which the value of N is recognized with N = 3 as the upper limit. That is, in the syntax model 17, a trigram (Tri-gram, N = 3) represented by the occurrence probability (conditional appearance probability of the preceding two words) of a unigram, a bigram, and a sequence of three words is used. . The upper limit value may be other than 3, and can be arbitrarily set. Moreover, it is good also as what uses N gram below the number of words contained in the text by which the value of N was recognized, without restrict | limiting to an upper limit.

図２に示すように、言語モデル１６と構文モデル１７とは、それぞれ、ドメインの種類毎に分類されて作成されている。図２の例では、ドメインの種類は、｛Audio，Climate，Passenger Climate，POI，Ambiguous，Navigation，Clock，Help｝の８種類である。｛Audio｝は操作対象がオーディオ６ａであることを示している。｛Climate｝は操作対象がエアコンディショナ６ｃであることを示している。｛Passenger Climate｝は操作対象が助手席のエアコンディショナ６ｃであることを示している。｛POI｝は操作対象がナビゲーションシステム６ｂのＰＯＩ検索機能であることを示している。｛Navigation｝は操作対象がナビゲーションシステム６ｂの経路誘導や地図操作等の機能であることを示している。｛Clock｝は操作対象が時計機能であることを示している。｛Help｝は操作対象が機器６ａ〜６ｃや音声認識装置の操作方法を知るためのヘルプ機能であることを示している。また、｛Ambiguous｝は、操作対象が不明であることを示している。 As shown in FIG. 2, the language model 16 and the syntax model 17 are created by being classified for each type of domain. In the example of FIG. 2, there are eight types of domains: {Audio, Climate, Passenger Climate, POI, Ambiguous, Navigation, Clock, Help}. {Audio} indicates that the operation target is the audio 6a. {Climate} indicates that the operation target is the air conditioner 6c. {Passenger Climate} indicates that the operation target is the air conditioner 6c in the passenger seat. {POI} indicates that the operation target is the POI search function of the navigation system 6b. {Navigation} indicates that the operation target is a function such as route guidance or map operation of the navigation system 6b. {Clock} indicates that the operation target is a clock function. {Help} indicates that the operation target is a help function for knowing how to operate the devices 6a to 6c and the speech recognition apparatus. {Ambiguous} indicates that the operation target is unknown.

また、変換候補データベース３４には、曖昧語とクラス分類されたテキストとに応じた１又は複数の操作対象候補が、データとして予め記録されている。ここで、曖昧語は、「これ」「それ」「あれ」といった一般的な指示語である。例えば、図３に示すように、曖昧語が「それ」であり、クラス分類されたテキストが｛Ambiguous_Setting_Up｝である場合には、操作対象候補は「曲」「局」「ディスク」「ボリューム」「温度」「ファン」「スクロール」「スケール」の７つとなる。また、例えば、曖昧語が「それ」であり、クラス分類されたテキストが｛Ambiguous_Setting_Max｝である場合には、変換候補は「ボリューム」「温度」「ファン」「縮尺」の４つとなる。なお、「ボリューム」は操作対象がオーディオ６ａの音量であることを示す。また、「ファン」は操作対象がエアコンディショナ６ｃのファンのスピード（風量）であることを示す。 In addition, in the conversion candidate database 34, one or a plurality of operation target candidates corresponding to ambiguous words and classified texts are recorded in advance as data. Here, the ambiguous word is a general instruction word such as “this”, “it”, “that”. For example, as shown in FIG. 3, when the ambiguous word is “it” and the classified text is {Ambiguous_Setting_Up}, the operation target candidates are “song”, “station”, “disk”, “volume”, “ The temperature, fan, scroll, and scale are seven. Further, for example, when the ambiguous word is “it” and the classified text is {Ambiguous_Setting_Max}, the conversion candidates are “volume”, “temperature”, “fan”, and “scale”. “Volume” indicates that the operation target is the volume of the audio 6a. “Fan” indicates that the operation target is the speed (air volume) of the fan of the air conditioner 6c.

次に、本実施形態の音声認識装置の作動（音声対話処理）について説明する。図４に示すように、まず、ＳＴＥＰ１で、車両１０の運転者から、対象を制御するための発話がマイク２に入力される。具体的には、運転者がトークスイッチをＯＮ操作して発話の入力開始を指令し、マイク２に音声を入力する。 Next, the operation (voice dialogue processing) of the voice recognition device of this embodiment will be described. As shown in FIG. 4, first, in STEP 1, an utterance for controlling an object is input to the microphone 2 from the driver of the vehicle 10. Specifically, the driver turns on the talk switch to instruct the start of utterance input, and inputs sound into the microphone 2.

次に、ＳＴＥＰ２で、音声対話ユニット１は、入力された音声を認識してテキストとして出力する音声認識処理を実行する。 Next, in STEP 2, the voice interaction unit 1 executes voice recognition processing for recognizing the input voice and outputting it as text.

まず、音声対話ユニット１は、マイク２に入力された音声をＡ／Ｄ変換して音声を示す波形データを取得する。次に、音声対話ユニット１は、音声を示す波形データを周波数分析して特徴ベクトルを抽出する。これにより、音声を示す波形データは、例えば短時間スペクトル分析の手法によってフィルタ処理を施され、特徴ベクトルの時系列に変換される。この特徴ベクトルは、各時刻における音声スペクトルの特微量を抽出したもので、一般に１０次元〜１００次元（例えば３９次元）であり、ＬＰＣメルケプストラム（Linear Predictive Coding（線形予測分析） Mel Cepstrum）係数等が用いられる。 First, the voice interaction unit 1 performs A / D conversion on the voice input to the microphone 2 to obtain waveform data indicating the voice. Next, the voice interaction unit 1 performs frequency analysis on the waveform data indicating the voice and extracts a feature vector. Thereby, the waveform data indicating the voice is subjected to filter processing by, for example, a technique of short-time spectrum analysis, and converted into a time series of feature vectors. This feature vector is obtained by extracting a feature amount of a speech spectrum at each time, and generally has 10 to 100 dimensions (for example, 39 dimensions), such as LPC mel cepstrum (Linear Predictive Coding) coefficients, etc. Is used.

次に、音声対話ユニット１は、抽出された特徴ベクトルに対し、音響モデル１５に記録された複数のＨＭＭのそれぞれについて、当該特徴ベクトルの尤度（音響スコア）を評価する。そして、音声対話ユニット１は、当該複数のＨＭＭのうちの音響スコアの高いＨＭＭに対応する発音データを決定する。これにより、例えば「千歳」という発話が入力された場合、その音声の波形データから、「ti-to-se」という発音データがその音響スコアと共に得られる。このとき、例えば「マークセット」という発話が入力された場合、「ma-a-ku-se-t-to」という発音データと共に、「ma-a-ku-ri-su-to」のような音響的に類似の度合が高い発音データがそれぞれ音響スコアと共に得られる。 Next, the voice interaction unit 1 evaluates the likelihood (acoustic score) of the feature vector for each of the plurality of HMMs recorded in the acoustic model 15 with respect to the extracted feature vector. Then, the voice interaction unit 1 determines pronunciation data corresponding to the HMM having a high acoustic score among the plurality of HMMs. Thus, for example, when an utterance “Chitose” is input, pronunciation data “ti-to-se” is obtained together with the acoustic score from the waveform data of the speech. At this time, for example, when the utterance "mark set" is input, the pronunciation data "ma-a-ku-se-t-to" and "ma-a-ku-ri-su-to" The pronunciation data having a high degree of acoustic similarity are obtained together with the acoustic score.

次に、音声対話ユニット１は、決定された発音データから、言語モデル１６全体のデータを用いて、単語列で表現されたテキストを当該テキストの言語スコアに基づいて決定する。このとき、複数の発音データが決定されている場合には、各発音データについて、それぞれテキストが決定される。 Next, the spoken dialogue unit 1 determines the text expressed by the word string based on the language score of the text using the data of the entire language model 16 from the determined pronunciation data. At this time, when a plurality of pronunciation data are determined, text is determined for each pronunciation data.

具体的には、まず、音声対話ユニット１は、決定された発音データと言語モデル１６に記録された発音データとを比較して、類似の度合の高い単語を抽出する。次に、音声対話ユニット１は、抽出された単語の言語スコアを、発音データに含まれる単語数に応じたＮグラムを用いて算出する。そして、音声対話ユニット１は、発音データにおける各単語について、算出した言語スコアが所定の条件（例えば所定値以上）を満たすテキストを決定する。例えば、図５に示すように、入力された発話が「Set the station ninety nine point three FM.」である場合に、この発話から決定された発音データに応じたテキストとして、「set the station ninety nine point three FM」が決定される。 Specifically, first, the voice interaction unit 1 compares the determined pronunciation data with the pronunciation data recorded in the language model 16 and extracts words having a high degree of similarity. Next, the voice interaction unit 1 calculates the language score of the extracted word using an N-gram according to the number of words included in the pronunciation data. Then, the voice interaction unit 1 determines the text for which the calculated language score satisfies a predetermined condition (for example, a predetermined value or more) for each word in the pronunciation data. For example, as illustrated in FIG. 5, when the input utterance is “Set the station ninety nine point three FM.”, The text corresponding to the pronunciation data determined from the utterance is “set the station ninety nine”. “point three FM” is determined.

このとき、ユ二グラムでは、「set」「the」…「FM」のそれぞれの出現確率ａ１〜ａ８が与えられる。また、バイグラムでは、「set the」「the station」…「three FM」のそれぞれの２単語の生起確率ｂ１〜ｂ７が与えられる。同様に、Ｎ＝３〜８について、Ｎ単語の生起確率ｃ１〜ｃ６，ｄ１〜ｄ５，ｅ１〜ｅ４，ｆ１〜ｆ３，ｇ１〜ｇ２，ｈ１が与えられる。そして、例えばテキスト「ninety」の言語スコアは、発音データに含まれる単語「ninety」と当該単語に先行する単語とを合わせた単語数４に応じて、Ｎ＝１〜４のＮグラムから得られるａ４，ｂ３，ｃ２，ｄ１に基づいて算出される。 At this time, in the unigram, the appearance probabilities a1 to a8 of “set”, “the”... “FM” are given. In the bigram, occurrence probabilities b1 to b7 of two words “set the”, “the station”,..., “Three FM” are given. Similarly, for N = 3 to 8, N word occurrence probabilities c1 to c6, d1 to d5, e1 to e4, f1 to f3, g1 to g2, and h1 are given. For example, the language score of the text “ninety” is obtained from N grams of N = 1 to 4 according to the number of words 4 including the word “ninety” included in the pronunciation data and the word preceding the word. It is calculated based on a4, b3, c2, and d1.

このように、入力された発話を、単語毎の確率統計的な言語モデルを用いてテキストとして書き起こす手法（ディクテーション）を用いることで、予め決められた言い回しの発話に限定されない、運転者の自然な発話の認識が可能となる。 In this way, by using a technique (dictation) that transcribes input utterances as text using a probabilistic language model for each word, the driver's natural utterances are not limited to utterances of predetermined phrases. Utterances can be recognized.

次に、音声対話ユニット１は、決定された全てのテキストについて、音響スコアと言語スコアとの重み付き和を、音声認識の確信度（音声認識スコア）として算出する。なお、重み係数としては、例えば実験的に予め定められた値が用いられる。 Next, the voice interaction unit 1 calculates the weighted sum of the acoustic score and the language score for all the determined texts as the certainty of voice recognition (voice recognition score). As the weighting factor, for example, a value predetermined experimentally is used.

次に、音声対話ユニット１は、算出した音声認識スコアが所定の条件を満たす単語列で表現されるテキストを、認識されたテキストとして決定して出力する。所定の条件は、例えば、音声認識スコアが最も高いテキスト、音声認識スコアが上位から所定順位までのテキスト、或いは音声認識スコアが所定値以上のテキスト等のように予め定められている。 Next, the voice interaction unit 1 determines and outputs the text represented by the word string whose calculated voice recognition score satisfies the predetermined condition as the recognized text. The predetermined condition is determined in advance, for example, as a text having the highest speech recognition score, a text having a speech recognition score from a higher rank to a predetermined rank, or a text having a speech recognition score of a predetermined value or more.

次に、ＳＴＥＰ３で、音声対話ユニット１は、認識されたテキストから発話の意味を理解する構文解析処理を実行する。具体的には、音声対話ユニット１は、構文モデル１７を用いて、認識されたテキストから、クラス分類されたテキストを決定する。 Next, in STEP 3, the voice interaction unit 1 executes a parsing process for understanding the meaning of the utterance from the recognized text. Specifically, the voice interaction unit 1 uses the syntax model 17 to determine the classified text from the recognized text.

まず、音声対話ユニット１は、構文モデル１７全体のデータを用いて、認識されたテキストに含まれる単語について、それぞれ、１単語における各ドメインの尤度を算出する。次に、音声対話ユニット１は、当該尤度に基づいて１単語におけるドメインをそれぞれ決定する。次に、音声対話ユニット１は、構文モデル１７のうち決定された種類のドメインに分類された部分のデータを用いて、１単語における各クラスの組（クラス分類されたテキスト）の尤度（単語スコア）を算出する。そして、音声対話ユニット１は、当該単語スコアに基づいて、１単語におけるクラス分類されたテキストを決定する。 First, the voice interaction unit 1 calculates the likelihood of each domain in one word for each word included in the recognized text, using data of the entire syntax model 17. Next, the voice interaction unit 1 determines a domain in one word based on the likelihood. Next, the voice interaction unit 1 uses the data of the portion classified into the domain of the determined type in the syntax model 17, and the likelihood (words) of each class set (classified text) in one word Score). Then, the voice interaction unit 1 determines the classified text in one word based on the word score.

同様に、音声対話ユニット１は、認識されたテキストに含まれる２単語列について、それぞれ、２単語における各ドメインの尤度を算出し、当該尤度に基づいて２単語におけるドメインを決定する。さらに、音声対話ユニット１は、２単語における各クラスの組の尤度（２単語スコア）を算出し、当該２単語スコアに基づいて２単語におけるクラスの組（クラス分類されたテキスト）を決定する。また、同様に、音声対話ユニット１は、認識されたテキストに含まれる３単語列について、それぞれ、３単語における各ドメインの尤度を算出し、当該尤度に基づいて３単語におけるドメインを決定する。さらに、音声対話ユニット１は、３単語における各クラスの組の尤度（３単語スコア）を算出し、当該３単語スコアに基づいて３単語におけるクラスの組（クラス分類されたテキスト）を決定する。 Similarly, the voice interaction unit 1 calculates the likelihood of each domain in two words for each of the two word strings included in the recognized text, and determines the domain in the two words based on the likelihood. Furthermore, the voice interaction unit 1 calculates the likelihood (two-word score) of each class set in two words, and determines a class set (class-categorized text) in two words based on the two-word score. . Similarly, the voice interaction unit 1 calculates the likelihood of each domain in the three words for each of the three word strings included in the recognized text, and determines the domain in the three words based on the likelihood. . Furthermore, the voice interaction unit 1 calculates the likelihood (three-word score) of each class set in three words, and determines the class set (class-categorized text) in three words based on the three-word score. .

次に、音声対話ユニット１は、１単語、２単語、３単語で決定された各クラスの組と当該クラスの組のスコア（１単語スコア、２単語スコア、３単語スコア）とに基づいて、認識されたテキスト全体における各クラスの組の尤度（構文解析スコア）を算出する。そして、音声対話ユニット１は、当該構文解析スコアに基づいて、認識されたテキスト全体におけるクラスの組（クラス分類されたテキスト）を決定する。 Next, the voice interaction unit 1 is based on each class set determined by 1 word, 2 words, and 3 words and the score of the class set (1 word score, 2 word score, 3 word score), The likelihood (parse score) of each class set in the entire recognized text is calculated. Then, the voice interaction unit 1 determines a class set (classified text) in the entire recognized text based on the parsing score.

ここで、図６に示す例を用いて、構文モデル１７を用いてクラス分類されたテキストを決定する処理について説明する。図６の例では、認識されたテキストが「AC on floor to defrost」である。 Here, the process for determining the text classified by using the syntax model 17 will be described using the example shown in FIG. In the example of FIG. 6, the recognized text is “AC on floor to defrost”.

このとき、構文モデル１７全体を用いて、ユニグラムで、「AC」「on」…「defrost」について、それぞれ、１単語における各ドメインの尤度が算出される。そして、当該尤度に基づいて１単語におけるドメインが決定される。例えば、第１位の（尤度の最も高い）ドメインは、「ＡＣ」については｛Climate｝、「on」については｛Ambiguous｝、「defrost」については｛Climate｝と決定される。 At this time, the likelihood of each domain in one word is calculated for each of “AC”, “on”... “Defrost” as a unigram using the entire syntax model 17. Then, a domain in one word is determined based on the likelihood. For example, the first (highest likelihood) domain is determined as {Climate} for “AC”, {Ambiguous} for “on”, and {Climate} for “defrost”.

さらに、構文モデル１７のうちの決定されたドメインの種類に分類された部分のデータを用いて、ユニグラムで、「AC」「on」…「defrost」について、１単語における各クラスの組に対する尤度がそれぞれ算出される。そして、当該尤度に基づいて１単語におけるクラスの組が決定される。例えば、「AC」について、第１位の（尤度の最も高い）クラスの組は、｛Climate_ACOnOff_On｝と決定され、このクラスの組に対する尤度（単語スコア）ｉ１が得られる。同様に、「on」…「defrost」について、クラスの組が決定され、このクラスの組に対する尤度（単語スコア）ｉ２〜ｉ５が得られる。 Further, using the data of the portion of the syntax model 17 classified into the determined domain type, the likelihood for each class set in one word for “AC” “on”. Are calculated respectively. Then, a class set in one word is determined based on the likelihood. For example, with respect to “AC”, the class set having the highest rank (highest likelihood) is determined as {Climate_ACOnOff_On}, and the likelihood (word score) i1 for this class set is obtained. Similarly, for “on”... “Defrost”, a class set is determined, and likelihoods (word scores) i2 to i5 for the class set are obtained.

同様に、バイグラムで、「AC on」「on floor」…「to defrost」について、それぞれ、２単語における各ドメインの尤度が算出され、当該尤度に基づいて２単語におけるドメインが決定される。そして、２単語におけるクラスの組とその尤度（２単語スコア）ｊ１〜ｊ４が決定される。また、同様に、トライグラムで、「AC on floor」「on floor to」「floor to defrost」について、それぞれ、３単語における各ドメインの尤度が算出され、当該尤度に基づいて３単語におけるドメインが決定される。そして、３単語におけるクラスの組とその尤度（３単語スコア）ｋ１〜ｋ３が決定される。 Similarly, for each of “AC on”, “on floor”... “To defrost” in the bigram, the likelihood of each domain in two words is calculated, and the domain in two words is determined based on the likelihood. Then, a class set in two words and its likelihood (two-word score) j1 to j4 are determined. Similarly, in the trigram, for each of “AC on floor”, “on floor to”, and “floor to defrost”, the likelihood of each domain in three words is calculated, and the domain in three words is calculated based on the likelihood. Is determined. Then, a class set in three words and its likelihood (three word score) k1 to k3 are determined.

次に、１単語、２単語、３単語で決定された各クラスの組について、例えば、各クラスの組の単語スコアｉ１〜ｉ５、２単語スコアｊ１〜ｊ４、３単語スコアｋ１〜ｋ３の和が、テキスト全体における各クラスの組に対する尤度（構文解析スコア）として算出される。例えば、｛Climate_Fan-Vent_Floor｝に対する構文解析スコアは、ｉ３＋ｊ２＋ｊ３＋ｋ１＋ｋ２となる。また、例えば、｛Climate_ACOnOff_On｝に対する構文解析スコアは、ｉ１+ｊ１となる。また、例えば、｛Climate_Defrost_Front｝に対する構文解析スコアは、ｉ５+ｊ４となる。そして、算出された構文解析スコアに基づいて、テキスト全体についてのクラスの組（クラス分類されたテキスト）が決定される。これにより、認識されたテキストから、｛Climate_Defrost_Front｝｛Climate_Fan-Vent_Floor｝｛Climate_ACOnOff_On｝といったクラス分類されたテキストが決定される。 Next, for each class set determined by one word, two words, and three words, for example, the sum of the word scores i1 to i5, the two word scores j1 to j4, and the three word scores k1 to k3 of each class set is The likelihood (syntactic analysis score) for each set of classes in the entire text is calculated. For example, the parsing score for {Climate_Fan-Vent_Floor} is i3 + j2 + j3 + k1 + k2. For example, the parsing score for {Climate_ACOnOff_On} is i1 + j1. For example, the parsing score for {Climate_Defrost_Front} is i5 + j4. Then, based on the calculated parsing score, a class set (classified text) for the entire text is determined. As a result, text classified into {Climate_Defrost_Front} {Climate_Fan-Vent_Floor} {Climate_ACOnOff_On} is determined from the recognized text.

次に、音声対話ユニット１は、算出された構文解析スコアが所定の条件を満たすようなクラス分類されたテキスト（Categorized Text）を、入力された発話の認識結果に基づいて特定される操作候補として、その操作候補の確信度（構文解析スコア）と共に出力する。所定の条件は、例えば、構文解析スコアが最も高いテキスト、構文解析スコアが上位から所定順位までのテキスト、或いは構文解析スコアが所定値以上のテキスト等のように予め定められている。例えば、上述のように「AC on floor to defrost」という発話が入力された場合に、操作候補として、｛Climate_Defrost_Front｝が、その構文解析スコアと共に出力される。 Next, the voice interaction unit 1 uses the classified text (Categorized Text) whose calculated parsing score satisfies a predetermined condition as an operation candidate specified based on the recognition result of the input utterance. And the certainty (syntactic analysis score) of the operation candidate. The predetermined condition is determined in advance as, for example, text having the highest parsing score, text having a parsing score from the top to a predetermined rank, text having a parsing score of a predetermined value or more, and the like. For example, when the utterance “AC on floor to defrost” is input as described above, {Climate_Defrost_Front} is output as an operation candidate together with its parsing score.

次に、ＳＴＥＰ４で、音声対話ユニット１は、操作候補から操作内容が特定されないか否かを判断する。具体的には、音声対話ユニット１は、ＳＴＥＰ３の構文解析処理の結果として得られたクラス分類されたテキストに、先頭部分のドメインの種類以外のクラスが含まれているか否かを判断する。ＳＴＥＰ４の判断結果がＹＥＳの場合（操作内容が特定されない場合）には、ＳＴＥＰ１０に進み、ＳＴＥＰ３の構文解析処理の結果がそのまま操作候補群として出力される。 Next, in STEP 4, the voice interaction unit 1 determines whether or not the operation content is identified from the operation candidates. Specifically, the voice interaction unit 1 determines whether or not the class classified text obtained as a result of the syntax analysis processing in STEP 3 includes a class other than the type of domain at the head portion. When the determination result of STEP 4 is YES (when the operation content is not specified), the process proceeds to STEP 10 and the result of the syntax analysis process of STEP 3 is output as it is as an operation candidate group.

ＳＴＥＰ４の判断結果がＮＯ（操作内容が特定されている）の場合には、ＳＴＥＰ５に進み、音声対話ユニット１は、操作候補から操作対象が特定されないか否かを判断する。具体的には、音声対話ユニット１は、ＳＴＥＰ３の構文解析処理の結果として得られたクラス分類されたテキストのドメインの種類が｛Ambiguous｝であるか否かを判断する。ＳＴＥＰ５の判断結果がＮＯの場合（ドメインの種類が｛Ambiguous｝でない、すなわち操作対象が特定されている）には、ＳＴＥＰ１０に進み、ＳＴＥＰ３の構文解析処理の結果がそのまま操作候補群として出力される。 If the determination result in STEP 4 is NO (operation content is specified), the process proceeds to STEP 5 and the voice interaction unit 1 determines whether or not the operation target is specified from the operation candidates. Specifically, the voice interaction unit 1 determines whether or not the domain type of the classified text obtained as a result of the syntax analysis process in STEP 3 is {Ambiguous}. If the determination result in STEP 5 is NO (the domain type is not {Ambiguous}, that is, the operation target is specified), the process proceeds to STEP 10 and the result of the syntax analysis process in STEP 3 is output as it is as an operation candidate group. .

ＳＴＥＰ５の判断結果がＹＥＳの場合（ドメインの種類が｛Ambiguous｝である、すなわち操作対象が特定されない）には、ＳＴＥＰ６に進み、音声対話ユニット１は、ＳＴＥＰ２で認識されたテキストに曖昧語が含まれているか否かを判断する。具体的には、「これ」「それ」「あれ」といった一般的な指示語が含まれているか否かが判断される。ＳＴＥＰ６の判断結果がＮＯの場合（曖昧語が含まれていない）には、ＳＴＥＰ１０に進み、ＳＴＥＰ３の構文解析処理の結果がそのまま操作候補群として出力される。これらのＳＴＥＰ３〜５の判断処理により、認識されたテキストに含まれる曖昧語が検出される。 If the determination result in STEP 5 is YES (the domain type is {Ambiguous}, that is, the operation target is not specified), the process proceeds to STEP 6 and the voice interaction unit 1 includes an ambiguous word in the text recognized in STEP 2. It is determined whether or not. Specifically, it is determined whether or not general instructions such as “this”, “it”, and “that” are included. If the determination result in STEP 6 is NO (no ambiguous word is included), the process proceeds to STEP 10 and the result of the syntax analysis process in STEP 3 is output as it is as an operation candidate group. By these determination processes in STEPs 3 to 5, ambiguous words included in the recognized text are detected.

ＳＴＥＰ６の判断結果がＹＥＳの場合（曖昧語が含まれている）には、ＳＴＥＰ７に進み、音声対話ユニット１は、認識されたテキストに含まれる曖昧語を、操作対象候補を示す直接語に置換する処理を実行する。まず、音声対話ユニット１は、変換候補データベース３４から、曖昧語とクラス分類されたテキストとに基づいて、操作対象候補を読み込む。そして、音声対話ユニット１は、認識されたテキストのうちの曖昧語の部分を、読み込まれた操作対象候補に置換する。 If the determination result in STEP 6 is YES (an ambiguous word is included), the process proceeds to STEP 7 and the speech dialogue unit 1 replaces the ambiguous word included in the recognized text with a direct word indicating the operation target candidate. Execute the process. First, the voice interaction unit 1 reads an operation target candidate from the conversion candidate database 34 based on the ambiguous word and the classified text. Then, the voice interaction unit 1 replaces the ambiguous word portion of the recognized text with the read operation target candidate.

次に、ＳＴＥＰ８で、音声対話ユニット１は、置換後の各テキストについて、それぞれ、ＳＴＥＰ３と同様に、構文解析処理を実行する。これにより、置換後の各テキストについて、クラス分類されたテキストが、当該クラス分類されたテキストの構文解析スコアと共に決定される。この構文解析スコアは、置換後のテキストの妥当性を示す。 Next, in STEP 8, the voice interaction unit 1 executes a parsing process for each replaced text in the same manner as in STEP 3. As a result, for each text after replacement, the classified text is determined together with the parsing score of the classified text. This parsing score indicates the validity of the text after replacement.

次に、ＳＴＥＰ９で、音声対話ユニット１は、ＳＴＥＰ８で得られたクラス分類されたテキストのうち、構文解析スコアが所定の条件を満たすようなクラス分類されたテキスト（Categorized Text）を、入力された発話の認識結果に基づいて特定される操作候補として、その操作候補の確信度（構文解析スコア）と共に出力する。所定の条件は、例えば、構文解析スコアが最も高いテキスト、構文解析スコアが上位から所定順位までのテキスト、或いは構文解析スコアが所定値以上のテキスト等のように予め定められている。これにより、操作候補群に使用者の発話に該当する操作候補を高い確率で含めることができる。 Next, in STEP 9, the voice interaction unit 1 is input with the classified text (Categorized Text) whose parsing score satisfies a predetermined condition among the classified text obtained in STEP 8. The operation candidate specified based on the recognition result of the utterance is output together with the certainty of the operation candidate (syntactic analysis score). The predetermined condition is determined in advance as, for example, text having the highest parsing score, text having a parsing score from the top to a predetermined rank, text having a parsing score of a predetermined value or more, and the like. Thereby, operation candidates corresponding to the user's speech can be included in the operation candidate group with high probability.

ここで、図７に示す例を用いて、ＳＴＥＰ４〜９の処理について説明する。図７の例は、図７（ａ）に示すように、「それ強くして」という発話がＳＴＥＰ１で運転者から入力された場合である。この場合、図７（ｂ）に示すように、ＳＴＥＰ２の音声認識処理により認識されたテキストは「それ強くして」となり、図７（ｃ）に示すように、ＳＴＥＰ３の構文解析処理の結果として、クラス分類されたテキスト｛Ambiguous_Setting_Up｝が得られる。 Here, the processing of STEP 4 to 9 will be described using the example shown in FIG. The example of FIG. 7 is a case where the utterance “make it stronger” is input from the driver in STEP 1 as shown in FIG. In this case, as shown in FIG. 7B, the text recognized by the speech recognition process in STEP 2 becomes “strengthen”, and as shown in FIG. 7C, as a result of the syntax analysis process in STEP 3 The classified text {Ambiguous_Setting_Up} is obtained.

このとき、｛Ambiguous_Setting_Up｝には、先頭部分のドメインの種類以外のクラス｛Setting_Up｝が含まれており、制御内容は特定されているので、ＳＴＥＰ４の判断結果はＮＯとなり、ＳＴＥＰ５に進む。次に、｛Ambiguous_Setting_Up｝のドメインの種類は｛Ambiguous｝で、操作対象が特定されないので、ＳＴＥＰ５の判断結果はＹＥＳとなり、ＳＴＥＰ６に進む。そして、テキスト「それ強くして」には「それ」という曖昧語が含まれているので、ＳＴＥＰ６の判断結果はＮＯとなる。これにより、認識されたテキストに含まれる「それ」という単語が曖昧語として検出される。 At this time, since {Ambiguous_Setting_Up} includes a class {Setting_Up} other than the type of the domain at the head portion and the control content is specified, the determination result in STEP 4 is NO and the process proceeds to STEP 5. Next, since the domain type of {Ambiguous_Setting_Up} is {Ambiguous} and the operation target is not specified, the determination result in STEP 5 is YES, and the process proceeds to STEP 6. Since the ambiguous word “it” is included in the text “make it stronger”, the determination result in STEP 6 is NO. Thereby, the word “it” included in the recognized text is detected as an ambiguous word.

次に、ＳＴＥＰ７で、図３に示すような変換候補データベース３４から、曖昧語「それ」とクラス分類されたテキスト｛Ambiguous_Setting_Up｝に基づいて、変換候補として「曲」「局」「ディスク」「ボリューム」「温度」「ファン」「スクロール」「スケール」が読み込まれる。そして、テキスト「それ強くして」に含まれる曖昧語がこれらの変換候補で置換される。この置換処理の結果として、図７（ｄ）に示すように、置換後テキストＡ_１〜Ａ_７として、「曲強くして」「ボリューム強くして」「ファン強くして」「スケール強くして」等が得られる。 Next, in STEP 7, based on the text {Ambiguous_Setting_Up} classified as the ambiguous word “it” from the conversion candidate database 34 as shown in FIG. 3, “song”, “station”, “disk”, “volume” are converted as conversion candidates. “Temperature” “Fan” “Scroll” “Scale” is read. Then, the ambiguous words contained in the text “Strengthen” are replaced with these conversion candidates. As a result of this replacement process, as shown in FIG. 7 (d), as the replaced text A ₁ to A _7, "Song strongly to""Volume strongly to""Fan strongly to""strongly scale Or the like.

次に、ＳＴＥＰ８で、置換後テキストＡ_１〜Ａ_７について、構文解析処理が実行され、図７（ｄ）の表に示すように、クラス分類されたテキストＢ_１〜Ｂ_７として、｛Audio_CD_NextTrack｝等が当該テキストＢ_１〜Ｂ_７の構文解析スコアＳｃ（Ｂ_１）〜Ｓｃ（Ｂ_７）と共に得られる。 Next, in STEP 8, the post-replacement text _A 1 to A _7, parsing process is executed, as shown in the table of FIG. 7 (d), the as text _B 1 .about.B ₇ which is classification, {Audio_CD_NextTrack} Are obtained together with the parsing scores Sc (B ₁ ) to Sc (B ₇ ) of the texts B _{1 to} B ₇ .

次に、ＳＴＥＰ９で、構文解析スコアＳｃ（Ｂ_１）〜Ｓｃ（Ｂ_７）に基づいて、置換後テキストＡ_１〜Ａ_７の妥当性が評価され、操作候補が特定されて出力される。例えば、「曲強くして」「スケール強くして」のようなテキストは、単語のつながりとして不自然であるため、図７（ｄ）に示すように構文解析スコアが低くなる。一方、例えば、「ボリューム強くして」「ファン強くして」のようなテキストは、単語のつながりが比較的妥当であるので、図７（ｄ）に示すように構文解析スコアが高くなる。よって、例えば所定閾値を０．２とすると、構文解析処理の結果のうち、これらの構文解析スコアの低い結果は、操作候補に含まれないこととなる。これにより、操作候補群として、図７（ｅ）に示すように第１操作候補Ｌ_１｛Audio_Volume_Up｝及び第２操作候補Ｌ_２｛Climate_Fan_Up｝が出力される。 Next, in STEP 9, based on the parsing score _{_{Sc (B 1) ~Sc (B}} 7), the validity of the replaced text _A 1 to A ₇ are evaluated, the candidate operation is outputted is specified. For example, texts such as “Strengthen the song” and “Strengthen the scale” are unnatural as word connections, so the parsing score is low as shown in FIG. On the other hand, for example, a text such as “enhance volume” and “enhance fan” has a relatively valid word connection, and therefore has a high parsing score as shown in FIG. Therefore, for example, when the predetermined threshold is 0.2, among the results of the syntax analysis process, those results having a low syntax analysis score are not included in the operation candidates. As a result, the first operation candidate L ₁ {Audio_Volume_Up} and the second operation candidate L ₂ {Climate_Fan_Up} are output as operation candidate groups as shown in FIG.

図４に戻り、次に、ＳＴＥＰ１１で、音声対話ユニット１は、車両状態検出部３により検出される、車両１０の状態（車両１０の走行状態、車両１０に搭載された機器の状態、車両１０の運転者の状態等）の検出値を取得する。 Returning to FIG. 4, next, in STEP 11, the voice interaction unit 1 detects the state of the vehicle 10 (the traveling state of the vehicle 10, the state of the device mounted on the vehicle 10, the vehicle 10) detected by the vehicle state detection unit 3. The detected value of the driver's state, etc.) is acquired.

次に、ＳＴＥＰ１１で、音声対話ユニット１は、ＳＴＥＰ９，１０で出力された操作候補群から、最終的な操作候補を特定する処理を実行する。具体的には、音声対話ユニット１は、ＳＴＥＰ９，１０で得られた１又は複数の操作候補を、運転者に選択を促すようにタッチパネル２４に画面表示する。これと共に、音声対話ユニット１は、運転者に操作候補の選択を促す音声ガイドを、音声合成部１４で合成してスピーカ４から出力する。 Next, in STEP 11, the voice interaction unit 1 executes a process of specifying a final operation candidate from the operation candidate group output in STEP 9 and 10. Specifically, the voice interaction unit 1 displays the one or a plurality of operation candidates obtained in STEPs 9 and 10 on the touch panel 24 so as to prompt the driver to select. At the same time, the voice interaction unit 1 synthesizes a voice guide that prompts the driver to select an operation candidate by the voice synthesizer 14 and outputs it from the speaker 4.

例えば、上述の図７の例では、図８に示すように、第１操作候補Ｌ_１に対応した「ボリュームを上げる」という表示及び第２操作候補Ｌ_２に対応した「エアコンの風量を上げる」とう表示がタッチパネル２４に表示され、且つ、第１，第２操作候補Ｌ_１，の当否確認を運転者に促す音声がスピーカから出力される。これにより、第１操作候補Ｌ_１及び第２操作候補Ｌ_２が同時または逐次的にタッチパネル２４に表示される。例えば、図８（ａ）に示すように、第１操作候補Ｌ_１である「ボリュームを上げる」及び第２操作候補Ｌ_２である「エアコンの風量を上げる」が同時にタッチパネル２４に表示される。また、図８（ｂ）に示すように、まず第１操作候補Ｌ_１である「ボリュームを上げる」がタッチパネル２４に表示され、タッチパネル２４に表示された次ボタンのタッチ操作に応じて図８（ｃ）に示されているように第２操作候補Ｌ_２である「エアコンの風量を上げる」がタッチパネル２４に表示される。 For example, in the example of FIG. 7 described above, as shown in FIG. 8, the display “Increase volume” corresponding to the first operation candidate L ₁ and “Increase the air volume of the air conditioner” corresponding to the second operation candidate L _2. A sound display is displayed on the touch panel 24, and a sound that prompts the driver to confirm whether or not the _first and second operation candidates L ₁ are correct is output from the speaker. Thus, the first candidate operation L ₁ and second candidate operation L ₂ is displayed simultaneously or sequentially the touch panel 24. For example, as shown in FIG. 8 (a), first an operation candidate L ₁ "Volume Up" and a second candidate operation L ₂ "increase the air volume of the air conditioner" is displayed on the touch panel 24 at the same time. Also, as shown in FIG. 8B, first, “Increase volume” as the first operation candidate L ₁ is displayed on the touch panel 24, and FIG. 8 ( as shown in c) is the second candidate operation L ₂ "increase the air volume of the air conditioner" is displayed on the touch panel 24.

このとき、ＳＴＥＰ４〜９において曖昧語を置換したテキストに対して構文解析処理を行い、発話の意味をより明確に把握して操作候補が特定されているので、タッチパネル２４に画面表示される操作候補に運転者の発話に該当する操作候補を高い確率で含ませることができ、運転者は発話の意図に沿った操作候補を選択可能となる。 At this time, the parsing process is performed on the text in which the ambiguous word is replaced in STEPs 4 to 9, and the operation candidates are identified by more clearly grasping the meaning of the utterance. Therefore, the operation candidates displayed on the touch panel 24 are displayed on the screen. The operation candidates corresponding to the driver's utterance can be included with high probability, and the driver can select the operation candidate according to the intention of the utterance.

次に、音声対話ユニット１は、運転者からのタッチパネル２４へのタッチ操作に基づいて、最終的な操作候補を特定する。例えば、運転者により、タッチパネル２４に表示された複数の操作候補Ｌ_１，Ｌ_２のうち１つを選択するタッチ入力がなされる。そして、音声対話ユニット１は、このタッチ入力により選択された操作候補を最終的な操作候補として特定する。これにより、運転者の発話に合致した操作候補が最終的な操作候補として特定される。 Next, the voice interaction unit 1 identifies a final operation candidate based on a touch operation on the touch panel 24 from the driver. For example, the driver performs touch input for selecting one of the plurality of operation candidates L ₁ and L ₂ displayed on the touch panel 24. Then, the voice interaction unit 1 specifies the operation candidate selected by this touch input as the final operation candidate. Thereby, the operation candidate that matches the driver's utterance is specified as the final operation candidate.

次に、ＳＴＥＰ１３で、音声対話ユニット１は、ＳＴＥＰ１２で特定された操作候補と、ＳＴＥＰ１１で検出された車両１０の状態とに基づいて、シナリオデータベース１８を用いて、運転者に対する応答出力や機器制御のためのシナリオを決定する。 Next, in STEP 13, the voice interaction unit 1 uses the scenario database 18 based on the operation candidate specified in STEP 12 and the state of the vehicle 10 detected in STEP 11 to output a response to the driver and control the equipment. Determine the scenario for.

まず、音声対話ユニット１は、特定された操作候補から、対象を制御するための情報を取得する。図９に示すように、音声対話ユニット１には、対象を制御するための情報を格納する複数のフォームが備えられている。各フォームには、必要な情報のクラスに対応した所定数のスロットが設けられている。例えば、ナビゲーションシステム６ｂを制御するための情報を格納するフォームとして、「Plot a route」「Traffic info.」等が備えられ、エアコンディショナ６ｃを制御するための情報を格納するフォームとして「Climate control」等が備えられている。また、フォーム「Plot a route」には、４つのスロット「From」「To」「Request」「via」が設けられている。 First, the voice interaction unit 1 acquires information for controlling a target from the identified operation candidates. As shown in FIG. 9, the voice interaction unit 1 is provided with a plurality of forms for storing information for controlling an object. Each form has a predetermined number of slots corresponding to the class of information required. For example, “Plot a route” and “Traffic info.” Are provided as forms for storing information for controlling the navigation system 6b, and “Climate control” is provided as a form for storing information for controlling the air conditioner 6c. And the like. The form “Plot a route” is provided with four slots “From”, “To”, “Request”, and “via”.

音声対話ユニット１は、運転者との対話における各回の発話の認識結果から特定される操作候補に基づいて、該当するフォームのスロットに値を入力していく。これと共に、各フォームについての確信度（フォームに入力された値の信頼の度合）を算出してフォームに記録する。フォームの確信度は、例えば、各回の発話の認識結果から特定される操作候補の確信度と、各フォームのスロットの埋まり具合とに基づいて算出される。例えば、図１０に示すように、「千歳空港まで最短ルートで案内して」という発話が運転者から入力された場合には、フォーム「Plot a route」の３つのスロット「From」「To」「Request」に値「ここ」「千歳空港」「最短」が入力される。また、フォーム「Plot a route」の「Score」に、算出されたフォームの確信度８０が記録される。 The voice interaction unit 1 inputs a value into a slot of the corresponding form based on the operation candidate specified from the recognition result of each utterance in the dialogue with the driver. At the same time, the certainty factor (degree of confidence of the value input to the form) for each form is calculated and recorded on the form. The certainty of the form is calculated based on, for example, the certainty of the operation candidate specified from the recognition result of each utterance and the filling degree of the slot of each form. For example, as shown in FIG. 10, when an utterance “Guide to Chitose Airport with the shortest route” is input from the driver, three slots “From” “To” “ The values “here”, “Chitose Airport” and “shortest” are entered in “Request”. Further, the calculated confidence factor 80 of the form is recorded in “Score” of the form “Plot a route”.

次に、音声対話ユニット１は、フォームの確信度に基づいて、実際の制御処理に用いるフォームを選択する。そして、選択されたフォームに基づいて、シナリオデータベース１８に格納されたデータを用いて、シナリオを決定する。図１０に示すように、シナリオデータベース１８には、例えば運転者へ出力する応答文等が、スロットの埋まり具合やレベル毎に分類されて格納されている。なお、レベルは、例えばフォームの確信度や車両１０の状態（車両１０の走行状態、運転者の状態等）等に基づいて設定される値である。 Next, the voice interaction unit 1 selects a form to be used for actual control processing based on the certainty of the form. Then, based on the selected form, the scenario is determined using the data stored in the scenario database 18. As shown in FIG. 10, the scenario database 18 stores, for example, response sentences to be output to the driver, classified by slot filling level and level. Note that the level is a value set based on, for example, the certainty of the form, the state of the vehicle 10 (the traveling state of the vehicle 10, the state of the driver, and the like).

例えば、選択されたフォーム内に空きスロット（値が入力されていないスロット）がある場合には、運転者へフォーム内の空きスロットの入力を促すような応答文を出力するシナリオが決定される。このとき、レベルに応じて、すなわちフォームの確信度や車両１０の状態を考慮して、運転者の次回の発話を促す適切な応答文が決定される。例えば、運転者の運転負荷に応じて、運転負荷が高いと考えられる状態では、入力を促すスロットの数が少なめに設定された応答文がが決定される。そして、このように決定された応答文の出力により使用者の次の発話を促すことで、効率の良い対話が行われる。 For example, when there is an empty slot (a slot in which no value is input) in the selected form, a scenario is determined for outputting a response sentence that prompts the driver to input an empty slot in the form. At this time, an appropriate response sentence that prompts the driver to speak next time is determined according to the level, that is, taking into account the certainty of the form and the state of the vehicle 10. For example, in a state where the driving load is considered to be high according to the driving load of the driver, a response sentence in which the number of slots for prompting input is set to be small is determined. Then, by prompting the user's next utterance by outputting the response sentence determined in this way, an efficient dialogue is performed.

図１０に示す例では、フォーム「Plot a route」の第１〜第３のスロット「From」「To」「Request」には値が入力され、第４のスロット「via」には値が入力されていない。また、レベル＝２に設定されている。このとき、シナリオデータベース１８から応答文「<To>を<Request>設定します」が選択され、「千歳空港を高速優先設定します」という応答文の内容が決定される。 In the example shown in FIG. 10, values are input to the first to third slots “From”, “To”, and “Request” of the form “Plot a route”, and values are input to the fourth slot “via”. Not. Further, level = 2 is set. At this time, the response sentence “<To> <Request> is set” is selected from the scenario database 18, and the content of the response sentence “High-speed priority setting is set for Chitose Airport” is determined.

また、例えば、選択されたフォーム内の全てのスロットが全て埋まっている（値が入力されている）場合には、内容を確認するような応答文（例えば各スロットの入力値を運転者に報知する応答文）を出力するシナリオが決定される。 In addition, for example, when all slots in the selected form are all filled (values are input), a response sentence that confirms the contents (for example, the input value of each slot is notified to the driver) Response scenario) is determined.

次に、ＳＴＥＰ１４で、音声対話ユニット１は、決定したシナリオに基づいて、運転者との対話が終了したか否かを判断する。ＳＴＥＰ１４の判断結果がＮＯの場合には、ＳＴＥＰ１５に進み、音声対話ユニット１は、決定された応答文の内容や応答文を出力する際の条件に応じて音声を合成する。そして、ＳＴＥＰ１６で、生成された応答文（運転者の次回の発話を促す応答文等）が、スピーカ４から出力される。 Next, in STEP 14, the voice interaction unit 1 determines whether or not the dialogue with the driver has ended based on the determined scenario. If the determination result in STEP14 is NO, the process proceeds to STEP15, where the voice interaction unit 1 synthesizes a voice according to the contents of the determined response sentence and the conditions for outputting the response sentence. In STEP 16, the generated response text (such as a response text prompting the driver to speak next time) is output from the speaker 4.

次に、ＳＴＥＰ１に戻り、２回目の発話が運転者から入力される。次に、ＳＴＥＰ２〜１３で、音声対話ユニット１は、１回目の発話と同様に、２回目の発話に対して処理を実行する。このとき、ＳＴＥＰ２で、音声対話ユニット１は、例えば、運転者からの前回の発話の認識結果に基づいてドメインの種類が決定される場合には、言語モデル１６のうちの、この決定された種類のドメインに分類された部分のデータのみを用いてテキストを決定する処理を行うようにしてもよい。次に、ＳＴＥＰ１４で、音声対話ユニット１は、運転者との対話が終了したか否かを判断する。ＳＴＥＰ１４の判断結果がＮＯの場合には、１回目の発話と同様に、音声対話ユニット１は、ＳＴＥＰ１５〜１６の処理を実行する。 Next, returning to STEP 1, the second utterance is input from the driver. Next, in STEPs 2 to 13, the voice interaction unit 1 executes processing for the second utterance in the same manner as the first utterance. At this time, in STEP 2, for example, when the type of the domain is determined based on the recognition result of the previous utterance from the driver, the voice interaction unit 1 determines the determined type of the language model 16. The text may be determined using only the data of the portion classified into the domain. Next, in STEP 14, the voice interaction unit 1 determines whether or not the dialogue with the driver has ended. When the determination result in STEP 14 is NO, the voice interaction unit 1 executes the processing in STEPs 15 to 16 as in the first utterance.

以下、ＳＴＥＰ１４の判断結果がＹＥＳとなるまで、上述の２回目の発話に対するＳＴＥＰ１〜１６と同様の処理が繰り返される。 Thereafter, the processing similar to STEP 1 to STEP 16 for the second utterance described above is repeated until the determination result in STEP 14 is YES.

ＳＴＥＰ１４の判断結果がＹＥＳの場合には、ＳＴＥＰ１７に進み、音声対話ユニット１は、決定された応答文（機器制御の内容を報知する応答文等）の音声を合成する。次に、ＳＴＥＰ１８で、応答文がスピーカ４から出力される。次に、ＳＴＥＰ１９で、音声対話ユニット１は、決定されたシナリオに基づいて機器を制御して、音声対話処理を終了する。 If the determination result in STEP 14 is YES, the process proceeds to STEP 17, where the voice interaction unit 1 synthesizes the voice of the determined response sentence (such as a response sentence that informs the contents of device control). Next, in STEP 18, a response sentence is output from the speaker 4. Next, in STEP 19, the voice interaction unit 1 controls the device based on the determined scenario, and ends the voice interaction process.

以上の処理により、発話に含まれる曖昧語の示す内容が適切に推定されて運転者の意図がより明確に把握されるので、効率の良い対話を介して機器の制御が行われる。 With the above processing, the content indicated by the ambiguous word included in the utterance is appropriately estimated and the driver's intention is more clearly grasped, so that the device is controlled through efficient dialogue.

なお、本実施形態では、ＳＴＥＰ１２の最終的な操作候補を特定する処理において、運転者からのタッチパネル２４へのタッチ入力により、操作候補の選択や表示する操作候補の変更が行われるものとしたが、操作候補の選択等を、例えば、運転者による音声入力（いずれかの操作候補を選択する発話や「次」という発話等）により行うことも可能である。さらに、入力インタフェース（キーボード、ボタン、スイッチ、マルチジョグダイヤル等）を備え、操作候補の選択等を、当該入力インタフェースへの入力により行うことも可能である。また、これらの場合、操作候補の表示は、ディスプレイ５のうちタッチパネル２４以外で行ってもよい。
［第２実施形態］
次に、本発明の第２実施形態の音声認識装置について説明する。なお、本実施形態は、第１実施形態と、音声対話処理のＳＴＥＰ１２における最終的な操作候補を特定する処理のみが相違する。本実施形態の構成は、第１実施形態と同様であるので、同一の構成には同一の参照符号を付して、以下では説明を省略する。 In the present embodiment, in the process of specifying the final operation candidate in STEP 12, the operation candidate is selected and the operation candidate to be displayed is changed by a touch input from the driver to the touch panel 24. The operation candidate can be selected by, for example, voice input by the driver (an utterance for selecting any operation candidate or an utterance of “next”). Further, an input interface (keyboard, button, switch, multi-jog dial, etc.) is provided, and selection of operation candidates and the like can be performed by inputting to the input interface. In these cases, operation candidates may be displayed on the display 5 other than the touch panel 24.
[Second Embodiment]
Next, the speech recognition apparatus according to the second embodiment of the present invention will be described. Note that the present embodiment is different from the first embodiment only in the process of specifying the final operation candidate in STEP 12 of the voice interaction process. Since the configuration of this embodiment is the same as that of the first embodiment, the same reference numerals are given to the same configuration, and the description thereof is omitted below.

本実施形態の音声認識装置では、操作特定手段３１は、特定された操作候補群から、車両状態検出部３により検出される、車両１０に搭載された各機器６ａ〜６ｃの状態に基づいて最終的な操作候補を特定する。他の構成は第１実施形態と同じである。なお、車両状態検出部３は、本発明の機器状態検知部に相当する。 In the voice recognition device according to the present embodiment, the operation specifying unit 31 is finally determined based on the states of the devices 6a to 6c mounted on the vehicle 10 detected by the vehicle state detection unit 3 from the specified operation candidate group. Specific operation candidates are identified. Other configurations are the same as those of the first embodiment. The vehicle state detection unit 3 corresponds to the device state detection unit of the present invention.

本実施形態の音声認識装置の作動（音声対話処理）では、ＳＴＥＰ１２で、まず、音声対話ユニット１は、車両状態検出部３の検出結果から、複数の機器６ａ〜６ｃのうちの、ＳＴＥＰ９，１０の処理から得られた操作候補群に関連する機器の状態を取得する。そして、当該機器の状態に基づいて、当該操作候補群から、運転者の意図に該当する可能性がより高い操作候補を決定する。 In the operation of the voice recognition device of the present embodiment (voice dialogue processing), in STEP12, first, the voice dialogue unit 1 determines the STEPs 9 and 10 among the plurality of devices 6a to 6c from the detection result of the vehicle state detection unit 3. The state of the device related to the operation candidate group obtained from the process is acquired. Then, based on the state of the device, an operation candidate that is more likely to correspond to the driver's intention is determined from the operation candidate group.

ここで、上述の図７に示すように第１，第２操作候補Ｌ_１，Ｌ_２が得られた場合を例にして説明する。このとき、操作候補群に関連する機器は、オーディオ６ａとエアコンディショナ６ｃとなり、オーディオ６ａの音量、エアコンディショナ６ｃの風量が取得される。そして、オーディオ６ａの音量が最大値であり、エアコンディショナ６ｃの風量が３で、風量の最大値が５となっている。この場合、音量を現在の状態よりも上げることはできないので、エアコンディショナ６ｃの操作であると推定される。よって、最終的な操作候補として｛Climate_Fan_Up｝が決定される。他の動作は第１実施形態と同じである。 Here, the case where the first and second operation candidates L ₁ and L ₂ are obtained as shown in FIG. 7 will be described as an example. At this time, the devices related to the operation candidate group are the audio 6a and the air conditioner 6c, and the volume of the audio 6a and the air volume of the air conditioner 6c are acquired. The volume of the audio 6a is the maximum value, the air volume of the air conditioner 6c is 3, and the maximum value of the air volume is 5. In this case, since the volume cannot be increased from the current state, it is estimated that the operation is performed by the air conditioner 6c. Therefore, {Climate_Fan_Up} is determined as the final operation candidate. Other operations are the same as those in the first embodiment.

本実施形態の音声認識装置によれば、第１実施形態と同様に、発話に含まれる曖昧語の示す内容が適切に推定されて運転者の意図がより明確に把握されるので、効率の良い対話を介して機器の制御が行われる。
［第３実施形態］
次に、本発明の第３実施形態の音声認識装置について説明する。なお、本実施形態は、第１実施形態と、音声対話処理のＳＴＥＰ１２における最終的な操作候補を特定する処理のみが相違する。本実施形態の構成は、第１実施形態と同様であるので、同一の構成には同一の参照符号を付して、以下では説明を省略する。 According to the speech recognition apparatus of the present embodiment, as in the first embodiment, the content indicated by the ambiguous word included in the utterance is appropriately estimated and the driver's intention is more clearly grasped, which is efficient. The device is controlled through dialogue.
[Third Embodiment]
Next, a speech recognition apparatus according to a third embodiment of the present invention will be described. Note that the present embodiment is different from the first embodiment only in the process of specifying the final operation candidate in STEP 12 of the voice interaction process. Since the configuration of this embodiment is the same as that of the first embodiment, the same reference numerals are given to the same configuration, and the description thereof is omitted below.

本実施形態の音声認識装置には、操作対象や操作内容の履歴（操作履歴）を格納する操作履歴格納部（操作履歴記憶手段）３５が備えられている。操作履歴格納部３５には、データとして、運転者による機器６ａ〜６ｃの操作内容が、当該操作の日時と共に格納されている。さらに、これらのデータに基づいて、運転者による機器６ａ〜６ｃの使用頻度、使用回数等が把握される。 The voice recognition apparatus according to the present embodiment includes an operation history storage unit (operation history storage unit) 35 that stores a history of operation targets and operation contents (operation history). The operation history storage unit 35 stores the operation details of the devices 6a to 6c by the driver together with the date and time of the operation as data. Furthermore, based on these data, the frequency of use of the devices 6a to 6c by the driver, the number of times of use, etc. are grasped.

そして、操作特定手段３１は、特定された操作候補群から、操作履歴格納部３５に格納された運転者による機器６ａ〜６ｃの操作履歴に基づいて、最終的な操作候補を特定する。具体的には、例えば、運転者が直前に行った操作（１つ前の操作）や、当該操作の直前に行った操作（２つ前にした操作）等に基づいて、操作候補を特定する。他の構成は第１実施形態と同じである。 Then, the operation specifying unit 31 specifies a final operation candidate based on the operation history of the devices 6a to 6c by the driver stored in the operation history storage unit 35 from the specified operation candidate group. Specifically, for example, an operation candidate is specified based on the operation performed immediately before the driver (the previous operation), the operation performed immediately before the operation (the operation performed two times before), or the like. . Other configurations are the same as those of the first embodiment.

本実施形態の音声認識装置の作動（音声対話処理）では、ＳＴＥＰ１２で、まず、音声対話ユニット１は、操作履歴格納部３４から、運転者の操作履歴を読み込む。そして、当該操作履歴に基づいて、ＳＴＥＰ９，１０から得られた１又は複数の操作候補から、運転者の意図に該当する可能性がより高い操作候補を特定する。 In the operation (voice dialogue processing) of the voice recognition device of the present embodiment, in STEP 12, first, the voice dialogue unit 1 reads the driver's operation history from the operation history storage unit 34. Then, based on the operation history, an operation candidate that is more likely to correspond to the driver's intention is identified from one or a plurality of operation candidates obtained from STEPs 9 and 10.

ここで、上述の図７に示すように第１，第２操作候補Ｌ_１，Ｌ_２が得られた場合を例にして説明する。このとき、操作履歴格納部３４から、１つ前の操作が、エアコンディショナ６ｃの温度設定を下げる操作であり、２つ前の操作が、エアコンディショナ６ｃをＯＮにする操作であることが読み込まれる。そして、エアコンディショナ６ｃに対する操作が連続していることから、今回の発話の意図する操作も、エアコンディショナ６ｃに対する操作の可能性が高いと推定される。よって、最終的な操作候補として｛Climate_Fan_Up｝が特定される。他の動作は第１実施形態と同じである。 Here, the case where the first and second operation candidates L ₁ and L ₂ are obtained as shown in FIG. 7 will be described as an example. At this time, from the operation history storage unit 34, the previous operation may be an operation for lowering the temperature setting of the air conditioner 6c, and the second operation may be an operation for turning on the air conditioner 6c. Is read. And since the operation with respect to the air conditioner 6c is continuing, it is presumed that the operation intended for the utterance this time is also likely to be operated with respect to the air conditioner 6c. Therefore, {Climate_Fan_Up} is specified as the final operation candidate. Other operations are the same as those in the first embodiment.

本実施形態の音声認識装置によれば、第１実施形態と同様に、発話に含まれる曖昧語の示す内容が適切に推定されて運転者の意図がより明確に把握されるので、効率の良い対話を介して機器の制御が行われる。 According to the speech recognition apparatus of the present embodiment, as in the first embodiment, the content indicated by the ambiguous word included in the utterance is appropriately estimated and the driver's intention is more clearly grasped, which is efficient. The device is controlled through dialogue.

なお、音声対話処理のＳＴＥＰ１２における最終的な操作候補を特定する処理について、第１〜第３実施形態における処理のうち２つ、又は３つ全てを組み合わせて用いることも可能である。例えば、ＳＴＥＰ９，１０から得られる操作候補群から、機器の状態及び操作履歴に基づいて可能性の高い１又は複数の操作候補を決定する。そして、さらに、当該決定された操作候補をタッチパネル２４を介して表示し、運転者からのタッチ入力に基づいて最終的な操作候補を特定する。 In addition, about the process which specifies the final operation candidate in STEP12 of a voice interaction process, it is also possible to use combining two or all three of the processes in the first to third embodiments. For example, from the operation candidate group obtained from STEPs 9 and 10, one or a plurality of operation candidates that are highly likely are determined based on the state of the device and the operation history. Further, the determined operation candidate is displayed via the touch panel 24, and the final operation candidate is specified based on the touch input from the driver.

また、第１〜第３実施形態において、シナリオ制御部１３は、発話の認識結果から特定される操作候補と車両状態検出部３により検出した車両状態とに基づいてシナリオを決定するものとしたが、シナリオ制御部１３は、発話の認識結果から特定される操作候補のみから制御処理を決定するものとしてもよい。 In the first to third embodiments, the scenario control unit 13 determines the scenario based on the operation candidate specified from the utterance recognition result and the vehicle state detected by the vehicle state detection unit 3. The scenario control unit 13 may determine the control process only from the operation candidates specified from the utterance recognition result.

また、第１〜第３実施形態においては、音声入力する使用者は、車両１０の運転者としたが、運転者以外の乗員としてもよい。 In the first to third embodiments, the user who inputs the voice is the driver of the vehicle 10, but may be an occupant other than the driver.

また、第１〜第３実施形態においては、音声認識装置は、車両１０に搭載されるものとしたが、車両以外の移動体に搭載されるものとしてもよい。さらに、移動体に限らず、使用者が発話により対象を制御するシステムに適用可能である。 In the first to third embodiments, the voice recognition device is mounted on the vehicle 10, but may be mounted on a moving body other than the vehicle. Furthermore, the present invention is not limited to a mobile object, and can be applied to a system in which a user controls an object by speaking.

本発明の第１実施形態である音声認識装置の機能ブロック図。The functional block diagram of the speech recognition apparatus which is 1st Embodiment of this invention. 図１の音声認識装置の言語モデル、構文モデルの構成を示す説明図。FIG. 2 is an explanatory diagram illustrating a configuration of a language model and a syntax model of the speech recognition apparatus in FIG. 1. 図１の音声認識装置における変換候補データベースの構成を示す説明図。Explanatory drawing which shows the structure of the conversion candidate database in the speech recognition apparatus of FIG. 図１の音声認識装置の全体的な作動（音声対話処理）を示すフローチャート。The flowchart which shows the whole operation | movement (voice dialogue process) of the speech recognition apparatus of FIG. 図４の音声対話処理における言語モデルを用いた音声認識処理を示す説明図。Explanatory drawing which shows the speech recognition process using the language model in the speech dialogue process of FIG. 図４の音声対話処理における構文モデルを用いた構文解析処理を示す説明図。FIG. 5 is an explanatory diagram illustrating a syntax analysis process using a syntax model in the voice interaction process of FIG. 4. 図４の音声対話処理における曖昧語を置換して操作候補を特定する処理を示す説明図。Explanatory drawing which shows the process which replaces the ambiguous word in the voice dialogue process of FIG. 4, and specifies an operation candidate. 図４の音声対話処理における最終的な操作候補を特定する処理を示す説明図。Explanatory drawing which shows the process which specifies the final operation candidate in the audio | voice conversation process of FIG. 図４の音声対話処理におけるシナリオを決定する処理に用いるフォームを示す説明図。Explanatory drawing which shows the form used for the process which determines the scenario in the voice interaction process of FIG. 図４の音声対話処理におけるシナリオを決定する処理を示す説明図。Explanatory drawing which shows the process which determines the scenario in the voice dialogue process of FIG.

Explanation of symbols

１…音声対話ユニット、２…マイク、３…車両状態検出部、４…スピーカ、５…ディスプレイ、６ａ〜６ｃ…機器、１０…車両、１１…音声認識部、１２…構文解析部、１３…シナリオ制御部、１４…音声合成部、１５…音響モデル、１６…言語モデル、１７…構文モデル、１８…シナリオデータベース、１９…音素モデル、２４…タッチパネル、３１…操作特定手段、３２…曖昧語検出手段、３３…候補抽出手段、３４…変換候補データベース、３５…操作履歴格納部。 DESCRIPTION OF SYMBOLS 1 ... Voice dialogue unit, 2 ... Microphone, 3 ... Vehicle state detection part, 4 ... Speaker, 5 ... Display, 6a-6c ... Equipment, 10 ... Vehicle, 11 ... Voice recognition part, 12 ... Syntax analysis part, 13 ... Scenario Control unit, 14 ... speech synthesis unit, 15 ... acoustic model, 16 ... language model, 17 ... syntax model, 18 ... scenario database, 19 ... phoneme model, 24 ... touch panel, 31 ... operation specifying means, 32 ... ambiguity word detecting means 33 ... Candidate extraction means, 34 ... Conversion candidate database, 35 ... Operation history storage unit.

Claims

In the speech recognition apparatus that determines the operation content of the operation target based on the recognition result of the input speech,
An operation specifying means for performing a specifying process for specifying the operation target and the operation content by classifying the recognition result of the voice into a predetermined type of operation target and operation content;
An ambiguous word detecting means for detecting an ambiguous word that cannot be classified by the operation specifying means from the recognition result of the voice;
Candidate extraction means for extracting an operation target that may be identified by the ambiguous word detected by the ambiguous word detection means as an operation target candidate,
If the operation content can be specified and the operation target cannot be specified, the operation specifying unit replaces the ambiguous word detected by the ambiguous word detecting unit with the candidate extracted by the candidate extracting unit, and A speech recognition apparatus characterized by performing the specifying process by specifying means.

The speech recognition apparatus according to claim 1,
The speech recognition apparatus, wherein the candidate extracting unit extracts an operation target that has a possibility of the operation content specified by the operation specifying unit as an operation target candidate.

The speech recognition apparatus according to claim 1 or 2,
When there are a plurality of operation targets and operation contents specified by the operation specifying means, a notification means for notifying the plurality of operation objects and operation contents and one of each of the plurality of operation objects and operation contents are selected. Selecting means,
The speech recognition apparatus characterized in that the operation specifying means uses the operation target and operation content selected by the selection means as the final operation target and operation content.

The speech recognition apparatus according to claim 1 or 2,
Provided with device status detection means for detecting the status of the device to be operated,
When there are a plurality of operation objects and operation contents specified by the operation specifying means, the operation specifying means is configured to select one of the plurality of operation objects and operation contents based on a detection result by the device state detection means. A voice recognition device characterized by identifying one.

The speech recognition apparatus according to claim 1 or 2,
An operation history storage means for storing a history of the operation target and operation content is provided,
When there are a plurality of operation targets and operation contents specified by the operation specifying means, the operation specifying means, based on the history stored in the operation history storage means, the plurality of operation targets and operation contents. A speech recognition apparatus characterized by identifying one of each.

A speech recognition method for determining an operation content to be operated based on a recognition result of input speech,
A step of identifying the operation target and the operation content by classifying the recognition results for the voice into predetermined types of operation target and operation content; and
The operation content can be specified in the specifying step, but when the operation target cannot be specified, the ambiguous word detecting step of detecting the ambiguous word that cannot be classified in the specifying step from the recognition result of the voice;
A candidate extraction step of extracting an operation target that may be specified by the ambiguous word detected in the ambiguous word detection step as an operation target candidate;
By replacing the ambiguous word detected in the detection step with the candidate extracted in the candidate extraction step, and classifying the recognition result after the replacement into a predetermined type of operation target and operation content. And a second specifying step for specifying the operation content.

A speech recognition program for causing a computer to execute a process of determining an operation content of an operation target based on a recognition result of input speech,
A specific process for identifying the operation target and the operation content by classifying the recognition result of the voice into a predetermined type of operation target and operation content;
From the recognition result for the voice, a detection process for detecting an ambiguous word that cannot be classified by the specific process;
A candidate extraction process that extracts an operation target that may be identified by an ambiguous word detected in the detection process as an operation target candidate,
The function of performing the specifying process by replacing the ambiguous word detected by the detection process with the candidate extracted by the extraction process when the operation content can be specified by the specifying process and the operation target cannot be specified A speech recognition program which is executed by a computer.