JP2008191581A

JP2008191581A - Voice input support method and device, and navigation system

Info

Publication number: JP2008191581A
Application number: JP2007028326A
Authority: JP
Inventors: Hidekazu Arita; 英一有田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-02-07
Filing date: 2007-02-07
Publication date: 2008-08-21

Abstract

<P>PROBLEM TO BE SOLVED: To improve operability and user-friendliness further by constructing a system for supporting a speech content. <P>SOLUTION: A voice input support method comprises: a first step (ST23) for recognizing voice by searching a second dictionary (a sound model DB14) in which a speech content is obtained and a sound model corresponding to a word is stored; and a second step (ST31) which outputs the voice recognizable word by searching a first dictionary (a word dictionary DB13) from a character string of the word which is input and obtained when the voice recognition fails. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、特に、施設検索や目的地設定を行う車載装置に用いて好適な、音声入力支援方法および装置、ならびにナビゲーションシステムに関するものである。 The present invention particularly relates to a voice input support method and apparatus, and a navigation system, which are suitable for use in an in-vehicle apparatus that performs facility search and destination setting.

ユーザにより発話された音声を、辞書検索により認識して合成音声を出力する音声認識装置が広く用いられるようになった。カーナビゲーションの分野においても上記した音声認識装置を用い、操作性および利便性の向上をはかることが通常行なわれている。音声認識は、通常、発話された音声を特徴量に変換し、この特徴量に基づく確率統計的な手法により文字列に変換する仕組みにより実現される。 A speech recognition apparatus that recognizes speech uttered by a user by dictionary search and outputs synthesized speech has been widely used. Also in the field of car navigation, the above-described voice recognition device is usually used to improve operability and convenience. Speech recognition is usually realized by a mechanism that converts spoken speech into a feature value and converts it into a character string by a probabilistic method based on this feature value.

図６に、従来の音声認識システムにおける音声認識処理の基本的な流れが示されている。図６において、符号６０は、音声認識装置である。また、符号６１は音声認識の対象となる単語データベース（以下、単に単語辞書ＤＢ６１という）であり、ここには文字列データが格納されている。また、符号６２は、単語辞書ＤＢ６１に対応した音響モデルのデータベース（以下、単に音響モデルＤＢ６２という）であり、ここには波形特徴量データが格納されている。 FIG. 6 shows a basic flow of voice recognition processing in a conventional voice recognition system. In FIG. 6, reference numeral 60 denotes a voice recognition device. Reference numeral 61 denotes a word database (hereinafter simply referred to as a word dictionary DB 61) to be subjected to speech recognition, in which character string data is stored. Reference numeral 62 denotes an acoustic model database (hereinafter simply referred to as acoustic model DB 62) corresponding to the word dictionary DB 61, in which waveform feature amount data is stored.

音声認識装置６０は、単語辞書ＤＢ６１と音響モデルＤＢ６２とを組み合わせて検索することにより、単語や句や文の認識が可能となる。具体的に、音声認識装置６０は、音声認識の波形特徴量マッチング処理を行う。これはユーザの発話の波形特徴量データと、認識単語の波形特徴量データとの類似度判定を行う処理であり、ＤＰ（Dynamic Programming）やＨＭＭ（Hidden Marcov Model）などの手法がある。類似度判定の結果、最大尤度の波形特徴量データを音声認識結果として出力し、その音声認識結果はアプリケーションに引き渡される。アプリケーションの例としては、例えば、エアコンの制御（温度設定、風量、風向など）やＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの記録メディアの操作（次のトラック、前のトラックなど）やカーナビゲーションシステムの施設検索や目的地設定などがある。 The speech recognition device 60 can recognize words, phrases, and sentences by searching the word dictionary DB 61 and the acoustic model DB 62 in combination. Specifically, the speech recognition device 60 performs a waveform feature amount matching process for speech recognition. This is a process of determining the similarity between the waveform feature value data of the user's utterance and the waveform feature value data of the recognized word, and there are methods such as DP (Dynamic Programming) and HMM (Hidden Marcov Model). As a result of similarity determination, the maximum likelihood waveform feature data is output as a speech recognition result, and the speech recognition result is delivered to the application. Examples of applications include control of air conditioners (temperature setting, air volume, wind direction, etc.), operation of recording media such as CD (Compact Disc) and DVD (Digital Versatile Disc) (next track, previous track, etc.) Car navigation system facility search and destination setting.

ところで、車載装置における目的地設定の際の音声認識の不完全さを補うことを目的に、音素単位で音声認識を行い、その認識結果を文字列として入力し、データベースを検索（所定の距離尺度にしたがって照合）して距離の近い上位Ｍ個の照合結果を出力し、話者にその１つを選択させることにより入力する技術が知られている（例えば、特許文献１参照）。 By the way, in order to make up for incomplete speech recognition when setting a destination in an in-vehicle device, speech recognition is performed on a phoneme basis, the recognition result is input as a character string, and a database is searched (predetermined distance scale A technique is known in which the top M closest matching results are output and input by allowing a speaker to select one of them (see, for example, Patent Document 1).

特開２００６−３９９５４号公報JP 2006-39954 A

しかしながら、上記した特許文献１に開示された技術によれば、音素単位の音声認識を行っており、仮名をキーにデータベース検索を行っているため、地名等、特殊な読み方をする目的地検索については有効でない。例えば、「枚方駅（ひらかたえき）」を検索したい場合、その地域に住んでいない人が読むであろうと考えられる「まきかたえき」と発話した場合、データベース検索の照合結果である上位Ｍ個に「枚方駅」が出力されることはなく、結果として発話者であるユーザは音声認識による入力ができないという問題がある。
同様に、「三田」は、地域によっては「みた」と読む場合と、「さんだ」と読む場合があり、その読み方よっては期待する目的地検索ができないという問題があった。 However, according to the technique disclosed in Patent Document 1 described above, since speech recognition is performed in units of phonemes and a database search is performed using a pseudonym as a key, a destination search for special reading such as a place name is performed. Is not valid. For example, if you want to search for “Hirakata Station”, if you say “Maki Kataeki” that people who do not live in the area will read, then the top M that is the result of matching the database search. “Hirakata Station” is not output individually, and as a result, the user who is the speaker cannot input by voice recognition.
Similarly, “Mita” may be read as “Mita” or “Sanda” depending on the region, and there is a problem that the expected destination search cannot be performed depending on the reading.

また、上記した特許文献１に開示された技術によれば、不完全な仮名入力を修正する手間（不完全な位置を特定し、それを置換する等の手間）を考えれば、初めから読みを入力したほうが速い場合が多いと考えられる。さらに、全文検索を行う場合は目的地などの全文文字列（フルスペル）を入力する必要はない。この場合、確実なテキスト入力を初めから行うため、構成として必須の「誤りやすさデータベース」の存在は意味が無いことになる。
また、特許文献１に開示された技術によれば、音声認識は基本的に距離尺度で音響モデルと照合するため、認識候補としては常に複数存在し、このため、ユーザは、上位Ｍ個の中から目的地として最適な１つを選択する操作が必要になる。これに対し、最初から文字列入力によって検索する場合は、初めから正しい入力を行うことができるため、検索結果が一意に決まる場合はユーザに選択させる手間が不要になる。 In addition, according to the technique disclosed in the above-mentioned Patent Document 1, considering the trouble of correcting an incomplete kana input (the trouble of specifying an incomplete position and replacing it), the reading is started from the beginning. It is likely that typing is faster. Furthermore, when performing a full text search, it is not necessary to input a full text string (full spelling) such as a destination. In this case, since reliable text input is performed from the beginning, the existence of an “error susceptibility database” essential as a configuration is meaningless.
Further, according to the technique disclosed in Patent Document 1, since speech recognition is basically collated with an acoustic model using a distance scale, there are always a plurality of recognition candidates. Therefore, it is necessary to select an optimum destination as the destination. On the other hand, when searching by character string input from the beginning, correct input can be performed from the beginning, so that the user is not required to select when the search result is uniquely determined.

この発明は上記した課題を解決するためになされたものであり、発話内容の同定を支援する仕組みを構築することにより、一層の操作性と利便性の向上をはかった、音声入力支援方法および装置、ならびにナビゲーションシステムを得ることを目的とする。 SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and a voice input support method and apparatus that further improves operability and convenience by constructing a mechanism for supporting identification of utterance contents. And to obtain a navigation system.

上記した課題を解決するためにこの発明に係る音声入力支援方法は、発話内容を取得し、単語に対応した音響モデルが格納される第２の辞書を検索して音声認識を行う第１のステップと、前記音声認識に失敗した場合に入力され取得される前記単語の文字列から第１の辞書を検索して音声認識可能な単語を出力する第２のステップと、を有するものである。 In order to solve the above-described problem, the speech input support method according to the present invention is a first step of acquiring speech content, searching a second dictionary in which an acoustic model corresponding to a word is stored, and performing speech recognition. And a second step of searching the first dictionary from the character string of the word that is input and acquired when the speech recognition fails and outputting a speech-recognizable word.

また、この発明に係る音声入力支援装置は、音声認識が可能な単語からなる第１の辞書と、前記単語に対応した音響モデルからなる第２の辞書が格納される記憶部と、発話内容を取得し、前記記憶部に格納された第２の辞書を検索して音声認識を行う音声認識部と、前記音声認識に失敗した場合に入力され取得される前記単語の文字列から前記記憶部に格納された第１の辞書を検索して音声認識可能な単語を出力する全文検索処理部と、を備えたものである。 In addition, the speech input support device according to the present invention includes a first dictionary composed of words capable of speech recognition, a storage unit storing a second dictionary composed of acoustic models corresponding to the words, and utterance contents. A speech recognition unit that acquires and performs speech recognition by searching the second dictionary stored in the storage unit, and a character string of the word that is input and acquired when the speech recognition fails, to the storage unit And a full-text search processing unit that searches the stored first dictionary and outputs a speech-recognizable word.

また、この発明に係るナビゲーションシステムは、音声認識可能な施設名もしくは目的地名からなる第１の辞書、前記施設名もしくは目的地名に対応した音響モデルからなる第２の辞書が格納される記憶部と、前記施設名もしくは目的地名からなる発話内容を取得して前記記憶部に格納された第２の辞書を検索して音声認識を行ない、前記音声認識に失敗した場合に入力され取得される施設名もしくは目的地名の文字列から前記記憶部に格納された第１の辞書を検索して音声認識可能な施設名もしくは目的地名を出力し、前記出力された施設名もしくは目的地名に基づき、施設名検索もしくは目的地設定によるナビゲーションを行う制御部と、を備えたものである。 In addition, the navigation system according to the present invention includes a storage unit storing a first dictionary composed of facility names or destination names capable of voice recognition, and a second dictionary composed of acoustic models corresponding to the facility names or destination names. , The name of the facility that is input and acquired when the speech content consisting of the facility name or the destination name is acquired, the second dictionary stored in the storage unit is searched for speech recognition, and the speech recognition fails. Alternatively, the first dictionary stored in the storage unit is searched from the character string of the destination name to output a facility name or destination name that can be recognized by voice, and the facility name search is performed based on the output facility name or destination name. Alternatively, a control unit that performs navigation by destination setting is provided.

この発明によれば、発話内容の同定を支援する仕組みを構築することにより、一層の操作性と利便性の向上をはかることができる。 According to the present invention, it is possible to further improve operability and convenience by constructing a mechanism for supporting identification of utterance contents.

実施の形態１．
図１は、この発明の実施の形態１にかかわる音声入力支援装置の内部構成を示すブロック図である。
この発明の実施の形態１にかかわる音声入力支援装置１０は、音声入力取得部１１と、音声認識エンジン部１２と、単語辞書ＤＢ（データベース）１３と、音響モデルＤＢ１４と、文法辞書ＤＢ１５と、アプリケーションプログラム実行制御部１６（以下、アプリ実行制御部１６という）と、キー入力取得部１７と、全文検索処理部１８と、検索結果出力部１９とにより構成される。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing an internal configuration of a voice input support apparatus according to Embodiment 1 of the present invention.
A speech input support device 10 according to Embodiment 1 of the present invention includes a speech input acquisition unit 11, a speech recognition engine unit 12, a word dictionary DB (database) 13, an acoustic model DB 14, a grammar dictionary DB 15, and an application. The program execution control unit 16 (hereinafter referred to as the application execution control unit 16), a key input acquisition unit 17, a full-text search processing unit 18, and a search result output unit 19 are configured.

音声入力取得部１１は、不図示のマイクを介して集音されたユーザの発話による音声入力を取得して波形特徴量データに変換して音声認識エンジン部１２へ供給する。音声認識エンジン部１２は、テキストデータが蓄えられている単語辞書ＤＢ１３と、音声波形の特徴抽出結果と類似度を比較するために必要な波形特徴量データが蓄えられている音響モデルＤＢ１４とを組み合わせて検索することにより句や文の認識を行う。音声認識エンジン部１２は、更に、文法辞書ＤＢ１５に蓄積された文法データから、単語辞書ＤＢ１３と音響モデルＤＢ１４との組み合わせにより生成される単語、句、文の集合をチェックしてもよい。
すなわち、音声認識エンジン部１２は、周知のＤＰやＨＭＭの手法を用いて、ユーザの発話による波形特徴量データと、認識単語の波形特徴量データとのマッチング処理を行い、最大尤度の波形特徴量データに対応する単語または句、文を音声認識結果としてアプリ実行制御部１６へ引き渡す。アプリ実行制御部１６は、エアコンの制御やＣＤやＤＶＤなどのメディアの操作、あるいはカーナビゲーションシステムの施設検索や目的地設定などを行う。 The voice input acquisition unit 11 acquires a voice input by a user's utterance collected via a microphone (not shown), converts it into waveform feature data, and supplies the waveform feature data to the voice recognition engine unit 12. The speech recognition engine unit 12 combines a word dictionary DB 13 in which text data is stored and an acoustic model DB 14 in which waveform feature data necessary for comparing the similarity with the feature extraction result of the speech waveform is stored. To recognize phrases and sentences. The speech recognition engine unit 12 may further check a set of words, phrases, and sentences generated by a combination of the word dictionary DB 13 and the acoustic model DB 14 from the grammar data stored in the grammar dictionary DB 15.
That is, the speech recognition engine unit 12 performs a matching process between the waveform feature value data of the user's utterance and the waveform feature value data of the recognized word using a well-known DP or HMM technique, and the waveform feature of the maximum likelihood. A word, phrase, or sentence corresponding to the quantity data is delivered to the application execution control unit 16 as a voice recognition result. The application execution control unit 16 performs control of an air conditioner, operation of a medium such as a CD or a DVD, facility search or destination setting of a car navigation system.

一方、キー入力取得部１７は、不図示のソフトキーボードにより入力される文字列を取得して全文検索処理部１８へ供給する。全文検索処理部１８は、その文字列をキーに単語辞書ＤＢ１３の全文検索を行い、その検索結果を検索結果出力部１９に出力し、この検索結果出力部１９により、不図示のディスプレイに表示し、あるいはその発音を不図示のスピーカ等でユーザに聞かせたりする。このことにより、ユーザは、音声認識装置が認識可能な単語、または句、文とその発音方法を知ることができる。 On the other hand, the key input acquisition unit 17 acquires a character string input from a soft keyboard (not shown) and supplies it to the full-text search processing unit 18. The full-text search processing unit 18 performs a full-text search of the word dictionary DB 13 using the character string as a key, outputs the search result to the search result output unit 19, and displays the search result on a display (not shown) by the search result output unit 19. Or, the user can hear the pronunciation through a speaker (not shown). As a result, the user can know the words, phrases or sentences that can be recognized by the speech recognition apparatus and the pronunciation method.

図２は、この発明の実施の形態１にかかわる音声入力支援装置の動作の流れを示した図である。また、図２には、この発明の実施の形態１に係る音声入力支援方法の各手順についても合わせて示されている。
図２において、まず、ユーザは、アプリケーション実行のために必要な、例えば、施設名検索や目的地設定に必要な読み（音声）を発話する（ステップＳＴ２０）。これを受けた音声入力支援装置１０は、ユーザの発話内容を音声入力取得部１１により波形特徴量データとして取得し、音声認識エンジン部１２へ供給する（ステップＳＴ２１）。音声認識エンジン部１２は、音響モデルＤＢ１４を参照し（ステップＳＴ２２）、ＤＰやＨＭＭの手法を用いて取得したユーザの発話による波形特徴量データとのマッチング処理を行い、最大尤度の波形特徴量データを音声認識結果（合成音声）としてアプリ実行制御部１６へ引き渡す（ステップＳＴ２３、ＳＴ２４）。このとき、音声認識エンジン部１２は、文法辞書ＤＢ１５に蓄積された文法データから、単語辞書ＤＢ１３と音響モデルＤＢ１４との組み合わせにより生成される単語、句、文の集合をチェックしてもよい（ステップＳＴ２５）。 FIG. 2 is a diagram showing a flow of operation of the voice input support apparatus according to the first embodiment of the present invention. FIG. 2 also shows each procedure of the voice input support method according to the first embodiment of the present invention.
In FIG. 2, first, the user utters a reading (voice) necessary for executing the application, for example, necessary for facility name search and destination setting (step ST20). Receiving this, the speech input support device 10 acquires the user's utterance content as waveform feature amount data by the speech input acquisition unit 11, and supplies it to the speech recognition engine unit 12 (step ST21). The speech recognition engine unit 12 refers to the acoustic model DB 14 (step ST22), performs a matching process with the waveform feature value data by the user's utterance acquired using the DP or HMM technique, and performs the maximum likelihood waveform feature value. The data is delivered to the application execution control unit 16 as a voice recognition result (synthesized voice) (steps ST23 and ST24). At this time, the speech recognition engine unit 12 may check a set of words, phrases, and sentences generated by a combination of the word dictionary DB 13 and the acoustic model DB 14 from the grammar data stored in the grammar dictionary DB 15 (step ST25).

アプリ実行制御部１６は、引き渡された認識結果に基づき、例えば、施設名検索、目的地設定等のアプリケーションを実行し、その結果を不図示のディスプレイやスピーカ等に出力する。これを視聴したユーザが音声認識に失敗したことを確認すると、ユーザは、不図示のソフトキーボードを操作することにより発話内容（施設や目的地等の名称）に関する文字入力を行う（ステップＳＴ２６）。
これをうけた音声入力支援装置１０は、上記した文字入力をキー入力取得部１７により取得して全文検索処理部１８へ供給する（ステップＳＴ２７）。このことにより、全文検索部１８は、入力され取得された文字列をキーに単語辞書ＤＢ１３を参照して全文検索を行い、その検索結果を検索結果出力部１９に出力する（ステップＳＴ２９）。なお、このとき、全文検索処理部１８は、文法辞書ＤＢ１５を参照して文法チェックを組み合わせて検索結果を出力してもよい（ステップＳＴ３０）。文法検索結果出力部１９は、検索結果をディスプレイに表示し、あるいはその発音をスピーカ等に音声出力する（ステップＳＴ３１）。このことにより、ユーザは、音声認識エンジン部１２が認識可能な単語、または句、文とその発音方法を知ることができる（ステップＳＴ３２）。 The application execution control unit 16 executes applications such as facility name search and destination setting based on the received recognition result, and outputs the result to a display or a speaker (not shown). When it is confirmed that the user who has viewed this has failed in voice recognition, the user inputs characters related to the utterance content (name of facility, destination, etc.) by operating a soft keyboard (not shown) (step ST26).
Upon receiving this, the voice input support device 10 acquires the character input described above by the key input acquisition unit 17 and supplies it to the full-text search processing unit 18 (step ST27). As a result, the full-text search unit 18 performs a full-text search with reference to the word dictionary DB 13 using the input and acquired character string as a key, and outputs the search result to the search result output unit 19 (step ST29). At this time, the full-text search processing unit 18 may output a search result by combining the grammar check with reference to the grammar dictionary DB 15 (step ST30). The grammar search result output unit 19 displays the search result on the display, or outputs the pronunciation of the search to a speaker or the like (step ST31). As a result, the user can know the words, phrases or sentences that can be recognized by the speech recognition engine unit 12 and the pronunciation method (step ST32).

以上説明のようにこの発明の実施の形態１に係る音声入力支援装置によれば、音声認識に失敗した場合に入力され取得される単語の文字列から単語辞書ＤＢ１３を検索して音声認識可能な単語を出力することにより、ユーザは、音声認識システムが想定している認識可能な単語、または句または文の集合を、音声よりも確実性の高いソフトキーボード等のキー入力による文字列検索で知ることができる。
このことにより、使用している音声認識システムにおいて、認識可能な単語または句または文を発話することができるようになり、結果として音声認識システムを使う上での障害を少なくすることができる。また、使用している音声認識システムで音声認識が可能な単語、句、文の集合等の入力対象を、ディスプレイ、あるいはスピーカ等の出力装置を利用して再生し、あるいはＴＴＳ（Text To Speech）等により話者であるユーザに提示することにより、ユーザは、音声認識可能な入力対象に関して認識率の高い発声方法を学習でき、次回以降、その発声方法を真似ることで認識率の向上がはかれる。 As described above, according to the speech input support device according to the first embodiment of the present invention, speech recognition is possible by searching the word dictionary DB 13 from the character string of words that are input and acquired when speech recognition fails. By outputting words, the user knows a recognizable word or a set of phrases or sentences assumed by the voice recognition system by a character string search by key input of a soft keyboard or the like with higher certainty than voice. be able to.
As a result, a recognizable word, phrase or sentence can be spoken in the used speech recognition system, and as a result, obstacles to using the speech recognition system can be reduced. Also, an input object such as a set of words, phrases, and sentences that can be recognized by the speech recognition system being used is reproduced using an output device such as a display or a speaker, or TTS (Text To Speech). By presenting it to the user who is a speaker, etc., the user can learn an utterance method with a high recognition rate with respect to an input target capable of voice recognition, and the recognition rate can be improved by imitating the utterance method from the next time.

なお、上記した実施の形態１によれば、音声認識に失敗した場合、単語の読みからなる文字列をソフトキーボード等によるキー入力操作に基づき取得することとしたが、この文字列を音素単位の音声入力操作に基づき取得することで代替してもよい。また、上記した実施の形態１によれば、単語辞書ＤＢ１３と音響モデルＤＢ１４の他に、文法辞書ＤＢ１５も検索して単語等の入力対象を認識する構成としたが、データベースとして、文法辞書ＤＢ１５の使用は必須ではなく、単語辞書ＤＢ１３と音響モデルＤＢ１４のみを必須とするものである。 According to the first embodiment described above, when speech recognition fails, a character string formed by reading a word is acquired based on a key input operation using a soft keyboard or the like. You may substitute by acquiring based on voice input operation. Moreover, according to Embodiment 1 described above, in addition to the word dictionary DB 13 and the acoustic model DB 14, the grammar dictionary DB 15 is also searched to recognize input objects such as words. Use is not essential, and only the word dictionary DB 13 and the acoustic model DB 14 are essential.

実施の形態２．
図３は、この発明の実施の形態２にかかわる音声入力支援装置を内蔵したナビゲーションシステムの構成を示すブロック図である。
図３に示されるように、この発明の実施の形態２に係るナビゲーションシステム３０は、制御部３１と、入力部３２と、マイク３３と、位置情報取得部３４と、地図データ格納部３５と、音声出力部３６と、表示部３７とにより構成される。 Embodiment 2. FIG.
FIG. 3 is a block diagram showing a configuration of a navigation system incorporating a voice input support device according to Embodiment 2 of the present invention.
As shown in FIG. 3, the navigation system 30 according to the second embodiment of the present invention includes a control unit 31, an input unit 32, a microphone 33, a position information acquisition unit 34, a map data storage unit 35, An audio output unit 36 and a display unit 37 are included.

制御部３１は、マイクロコンピュータと、メモリと、それらを利用して動作するアプリケーションプログラムによって実現される。具体的に、マイクロコンピュータは、施設名もしくは目的地名からなる発話内容を取得し、記憶部に格納された第２の辞書を検索して音声認識を行ない、音声認識に失敗した場合に入力され取得される施設名もしくは目的地名の文字列から記憶部に格納された第１の辞書を検索して音声認識可能な施設名もしくは目的地名を出力し、当該出力された施設名もしくは目的地名に基づき、施設名検索もしくは目的地設定によるナビゲーションを行う。
すなわち、制御部３１は、図１に示す音声入力支援装置としての各機能ブロックを有し、さらに、記憶部としての内蔵メモリには、音声認識可能な施設名もしくは目的地名からなる第１の辞書と、施設名もしくは目的地名に対応した音響モデルからなる第２の辞書の他に、ナビゲーションのために必須の施設名もしくは目的地名データベースが付加され、格納されている（以下、これらを総称してデータベースと称する）。 The control unit 31 is realized by a microcomputer, a memory, and an application program that operates using the microcomputer. Specifically, the microcomputer acquires the utterance contents including the facility name or the destination name, searches the second dictionary stored in the storage unit, performs voice recognition, and is input and acquired when the voice recognition fails. The first dictionary stored in the storage unit is searched from the character string of the facility name or destination name to be output, and the facility name or destination name capable of voice recognition is output. Based on the output facility name or destination name, Navigation by facility name search or destination setting.
That is, the control unit 31 has each functional block as the voice input support device shown in FIG. 1, and the built-in memory as the storage unit has a first dictionary consisting of facility names or destination names that can be recognized by voice. In addition to the second dictionary consisting of acoustic models corresponding to facility names or destination names, a facility name or destination name database essential for navigation is added and stored (hereinafter collectively referred to as Called database).

入力部３２は、タッチパネル、タッチパッド、リモコン、コマンダー等によりソフトキーボードにより実現され、ユーザによって操作される内容を制御部３１に入力する。また、マイク３３は、ユーザによる発話内容を集音して制御部３１に入力する。位置情報取得部３４は、ＧＰＳ（Global Positioning System）やジャイロによって実現され、ここで自車の現在位置データを取得して制御部３１に入力する。地図データ格納部３５には、ナビゲーションに必要な道路データや施設データ等が格納される。音声出力部３６は、ナビゲーションによる誘導案内音声、音声認識結果の復唱、ナビゲーションシステムの状態、動作終了などの状態、イベント等の音声出力を行う。表示部３７は、液晶ディスプレイなどで実現され、ここに、ナビゲーションのための地図表示、メニュー情報等が表示される。 The input unit 32 is realized by a soft keyboard such as a touch panel, a touch pad, a remote controller, and a commander, and inputs contents operated by the user to the control unit 31. In addition, the microphone 33 collects the content of the user's utterance and inputs it to the control unit 31. The position information acquisition unit 34 is realized by a GPS (Global Positioning System) or a gyro, and acquires the current position data of the own vehicle and inputs it to the control unit 31. The map data storage unit 35 stores road data and facility data necessary for navigation. The voice output unit 36 outputs voices such as guidance guidance voice by navigation, voice recognition result repetition, navigation system state, operation end state, event, and the like. The display unit 37 is realized by a liquid crystal display or the like, on which a map display for navigation, menu information, and the like are displayed.

図４、図５は、この発明の実施の形態２に係るナビゲーションシステムの動作を示すフローチャートであり、音声認識による施設検索、目的地設定について示されている。ここでは、ユーザの意図する目的地が「国立京都国際会館」であるが、ユーザはその施設の正式名称を「京都国際会議場」であると思っているというシナリオで説明する。
以下、図４、図５のフローチートを参照しながら、この発明の実施の形態２に係るナビゲーションシステムの動作について詳細に説明する。 4 and 5 are flowcharts showing the operation of the navigation system according to Embodiment 2 of the present invention, which shows facility search and destination setting by voice recognition. Here, the destination intended by the user is “National Kyoto International Conference Center”, but the user assumes that the official name of the facility is “Kyoto International Conference Center”.
Hereinafter, the operation of the navigation system according to Embodiment 2 of the present invention will be described in detail with reference to the flow charts of FIGS. 4 and 5.

まず、ユーザが「京都国際会議場」と発話したものとし（ステップＳＴ４００“ＹＥＳ”）、そのとき、マイク３３により集音された音声は制御部３１に入力される（ステップＳＴ４０１）。このことにより、制御部３１は、音声認識エンジンにより、ユーザが発話した施設名を、内蔵メモリに格納されたデータベースを検索することにより音声認識する。音声認識の方法は上記した実施の形態１と同様であり、説明の重複を回避する意味でここでの説明は省略する。
ここで、音声認識に失敗した場合（ステップＳＴ４０２“Ｎｏ”）、制御部３１は、「該当する施設はありません」等のメッセージを音声出力部３６、もしくは表示部３７に出力してユーザに対して音声認識が失敗したことを通知する（ステップＳＴ４０３）。続いて、表示等により音声認識結果を確認したユーザは、音声認識をリトライするか否か判断し、再度音声認識を実行する場合は、例えば、入力部３２による発話キーの押下を検出して（ステップＳＴ４０４“ＹＥＳ”）、ステップＳＴ４００の処理に戻る。 First, it is assumed that the user speaks “Kyoto International Conference Hall” (step ST400 “YES”), and the sound collected by the microphone 33 is input to the control unit 31 (step ST401). Thus, the control unit 31 recognizes the facility name spoken by the user by searching the database stored in the built-in memory using the speech recognition engine. The voice recognition method is the same as that of the first embodiment, and the description here is omitted to avoid duplication of explanation.
Here, when voice recognition fails (step ST402 “No”), the control unit 31 outputs a message such as “no corresponding facility” to the voice output unit 36 or the display unit 37 to the user. Notify that the voice recognition has failed (step ST403). Subsequently, the user who confirms the voice recognition result by display or the like determines whether or not to retry the voice recognition, and when the voice recognition is performed again, for example, the pressing of the speech key by the input unit 32 is detected ( Step ST404 “YES”), the process returns to step ST400.

ユーザは、施設名が「京都国際会議場」で正しいと信じている場合、音声認識の失敗が自分の発声方法が悪いと考え、ステップＳＴ４００〜ＳＴ４０４の処理を繰り返す。音声認識のためのリトライを数回繰り返した後、ユーザは、違う読みの名前を発話し、制御部３１がこれを取得したとする。例えば、ユーザが「京都会館」と発話したとし、その場合、制御部３１は、データベースに「京都会館」という施設名が存在するため、音声認識が成功してステップＳＴ４０６以降の処理に進むことができる。
音声認識に成功した場合（ステップＳＴ４０２“ＹＥＳ”）、制御部３１は、音声出力部３６または表示部３７により音声認識された施設名をユーザに通知する（ステップＳＴ４０６）。ここでは、表示部３７に施設名を表示することとする。続いてユーザは、例えば、入力部３２を操作し、あるいはマイク３３を介して音声入力することにより、「（当該施設の）地図表示」を制御部３１に指示し、検索され表示された施設名を確認する。 If the user believes that the facility name is correct at “Kyoto International Conference Center”, he / she thinks that the speech recognition failure is bad for his / her speech method, and repeats the processes of steps ST400 to ST404. After repeating the retry for voice recognition several times, it is assumed that the user speaks the name of a different reading and the control unit 31 acquires it. For example, if the user utters “Kyoto Kaikan”, in this case, since the facility name “Kyoto Kaikan” exists in the database, the control unit 31 proceeds to the processing after step ST406 after successful speech recognition. it can.
When the voice recognition is successful (step ST402 “YES”), the control unit 31 notifies the user of the facility name recognized by the voice output unit 36 or the display unit 37 (step ST406). Here, the facility name is displayed on the display unit 37. Subsequently, for example, the user operates the input unit 32 or inputs a voice through the microphone 33 to instruct the control unit 31 to display a map (of the facility), and the facility name searched and displayed Confirm.

続いて、ユーザは意図する施設と合致するか否かを確認判定する（ステップＳＴ４０７）。この確認判定は、例えば、ユーザの意図する施設が京都市の北の方（北区）にあると思っているのに対し、表示部３７による地図表示等により、「京都会館」が京都市左京区にあることを視認することで、ユーザは意図する施設でないことを確認することができる。この確認判定の結果、意図する施設であった場合は（ステップＳＴ４０７“ＹＥＳ”）、制御部３１に対して目的地に設定することを通知する（ステップＳＴ４０８）。これは、音声認識により、「そこへ行く」と指示し、あるいは表示部３７に表示された「目的地に設定」といったメニューを選択することにより実現することができる。このようにして、目的地設定、または施設名検索の処理を終了し、続いて、制御部３１は、ルート検索等、ナビゲーションシステムとしての本来の処理を実行する。 Subsequently, the user confirms and determines whether or not it matches the intended facility (step ST407). This confirmation determination is made, for example, when the facility intended by the user is located in the northern part of Kyoto City (Kita Ward), but “Kyoto Kaikan” By visually confirming that there is, the user can confirm that the facility is not intended. As a result of the confirmation determination, if the facility is an intended facility (step ST407 “YES”), the control unit 31 is notified that the destination is set (step ST408). This can be realized by instructing “go there” by voice recognition or by selecting a menu such as “set to destination” displayed on the display unit 37. In this way, the destination setting or facility name search process is terminated, and then the control unit 31 executes the original process as a navigation system such as route search.

一方、ステップＳＴ４０７において、ユーザが意図する施設でないと判断された場合は（ステップＳＴ４０７“ＮＯ”）、音声認識により施設検索を続ける場合ステップＳＴ４００の処理に戻る。また、音声認識による施設名検索を諦めて全文検索する場合は、ステップＳＴ４０４の処理へ進む。また、ステップＳＴ４０４でユーザが音声認識を諦め、入力部３２を操作することにより入力される文字列により検索することにした場合は（ステップＳＴ４０４“ＮＯ”）、ユーザが目的地に関する文字列を入力することを制御部３１に通知する（ステップＳＴ４０５）。これは、例えば、表示部３７に表示された、「キー入力で検索する」といった内容のメニューボタンを選択することにより実現される。 On the other hand, if it is determined in step ST407 that the facility is not intended by the user ("NO" in step ST407), the process returns to step ST400 when the facility search is continued by voice recognition. If the full text search is given up with the facility name search by voice recognition, the process proceeds to step ST404. In addition, when the user gives up the voice recognition in step ST404 and decides to search by a character string input by operating the input unit 32 (step ST404 “NO”), the user inputs a character string related to the destination. This is notified to the control unit 31 (step ST405). This is realized, for example, by selecting a menu button displayed on the display unit 37 and having a content such as “Search by key input”.

続いて、ユーザは、入力部３２を操作することにより、意図する施設名の文字列を入力し、制御部３１はその内容を取得する（図５のステップＳＴ４０９）。ここでは、例えば、「京都国際会議場」と入力したとする。このことにより、制御部３１は、ユーザが入力した文字列に基づきデータベースの全文検索を行い、その結果を表示部３７に表示する（ステップＳＴ４１０）。ここで、複数の施設が検索された場合は、リスト表示等により表示部３７に表示してユーザに選択を促す。このことにより、ユーザは所望とする施設名を選択するが、複数の候補施設がリスト表示された場合は、カーソルを移動させる等して、目的の施設を選択する。ここでは、「国立京都国際会館」が所望の施設名とする。
具体的に所望の施設を選択する際（ステップＳＴ４１１“ＹＥＳ”）、ユーザはその施設を地図表示等により確認し（ステップＳＴ４１２）、意図する施設か否かを判断する（ステップＳＴ４１３）。意図する施設であった場合は（ステップＳＴ４１３“ＹＥＳ”）、ステップＳＴ４１４の処理へ進み、意図しない施設であった場合は（ステップＳＴ４１３”ＮＯ“）、ステップＳＴ４１１の処理に戻る。このとき、ユーザは、「京都国際会議場」と思っていた施設は正式の施設名が「国立京都国際会館」であることを知ることができる。 Subsequently, the user operates the input unit 32 to input a character string of an intended facility name, and the control unit 31 acquires the content (step ST409 in FIG. 5). Here, for example, it is assumed that “Kyoto International Conference Hall” is entered. Thereby, the control unit 31 performs a full-text search of the database based on the character string input by the user, and displays the result on the display unit 37 (step ST410). Here, when a plurality of facilities are searched, they are displayed on the display unit 37 by a list display or the like to prompt the user to select. Thus, the user selects a desired facility name, but when a plurality of candidate facilities are displayed in a list, the user selects the target facility by moving the cursor or the like. Here, “National Kyoto International House” is the desired facility name.
Specifically, when a desired facility is selected (step ST411 “YES”), the user confirms the facility by a map display or the like (step ST412), and determines whether the facility is an intended facility (step ST413). If it is an intended facility (step ST413 “YES”), the process proceeds to step ST414. If it is an unintended facility (step ST413 “NO”), the process returns to step ST411. At this time, the user can know that the facility that was supposed to be the “Kyoto International Conference Hall” has the official facility name “National Kyoto International House”.

続いてユーザは、「国立京都国際会館」を目的地設定することを制御部３１に通知する（ステップＳＴ４１４）。目的地設定は、例えば、表示部３７に表示された、「目的地設定」といった内容のメニューボタンを選択することにより実現される。以上により目的地設定または施設名検索の処理を終了し、以降、ナビゲーションシステムは、ルート検索等の本来の処理を実行することになる。
なお、上記したステップＳＴ４１４において、ユーザは、施設名が「京都国際会議場」ではなく、「国立京都国際会館」であることを学習することができるため、次回、同じ施設名を目的地設定する場合は、はじめから「国立京都国際会館」と発声して音声認識させることができ、したがって、ステッブＳＴ４００〜ＳＴ４０２、ＳＴ４０６〜ＳＴ４０８のパスにより、容易に短時間で目的地設定を行うことが可能である。 Subsequently, the user notifies the control unit 31 that the destination is “National Kyoto International House” (step ST414). The destination setting is realized, for example, by selecting a menu button with a content such as “Destination setting” displayed on the display unit 37. Thus, the destination setting or facility name search process is terminated, and the navigation system thereafter executes the original process such as route search.
In step ST414 described above, the user can learn that the facility name is not “Kyoto International Conference Center” but “National Kyoto International Conference Center”. In this case, “Kyoto International Conference Center” can be spoken from the beginning to be recognized, so that the destination can be easily set in a short time by using the steps ST400 to ST402 and ST406 to ST408. is there.

以上説明のようにこの発明の実施の形態２に係る音声入力支援装置によれば、ナビゲーションシステムが音声入力対象をディスプレイもしくはスピーカ等の出力装置を介してユーザに提示するため、ユーザは、音声認識が可能な入力対象を最初から知ることなく施設検索あるいは目的地設定を行うことができ、合わせて音声認識システムが認識させやすい発話方法を知ることができ、したがって音声認識における操作性および利便性を向上させることができる。
また、従来は、音声認識による入力機能と、ソフトキーボード等による入力機能とが独立して設計されていたことから、音声認識機能が入力対象とする単語や句や文の集合と、検索機能が検索対象とする単語や句や文の集合とが一致しないことがあったが、上記した実施の形態２によれば、施設名を文字列入力することにより、入力対象を一致させることができ、したがって、ユーザは、音声認識システムが想定している認識可能な単語、または句または文の集合を、キー入力による文字列検索で知ることができる。これにより、ユーザは、使用している音声認識システムにおいて、認識可能な単語または句または文を発話することができるようになり、結果として音声認識システムを使う上での障害を少なくすることができる。 As described above, according to the voice input support device according to the second embodiment of the present invention, the navigation system presents the voice input target to the user via the output device such as a display or a speaker. The facility search or destination setting can be performed without knowing the input target that can be recognized from the beginning, and it is possible to know the speech method that the voice recognition system can easily recognize. Therefore, the operability and convenience in voice recognition can be improved. Can be improved.
In addition, since the input function based on voice recognition and the input function based on a soft keyboard have been designed independently, a set of words, phrases, and sentences to be input by the voice recognition function and a search function are provided. The set of words, phrases, and sentences to be searched may not match, but according to the second embodiment described above, by inputting the facility name as a character string, the input target can be matched, Therefore, the user can know a recognizable word or a set of phrases or sentences assumed by the voice recognition system by a character string search by key input. As a result, the user can utter a recognizable word, phrase or sentence in the voice recognition system being used, and as a result, the obstacles to using the voice recognition system can be reduced. .

なお、この発明に係る音声入力支援方法は、音声認識が可能な単語が格納される第１の辞書（単語辞書ＤＢ１３）と、前記単語に対応した音響モデルが格納される第２の辞書（音響モデルＤＢ１４）とを備えた音声認識システムにおける音声入力支援方法であって、例えば、図２に示す流れ図において、発話内容を取得し、前記第２の辞書（音響モデルＤＢ１４）を検索して音声認識を行う第１のステップ（ステップＳＴ２０〜ＳＴ２４）と、前記音声認識に失敗した場合に入力され取得される前記単語の文字列から前記第１の辞書（単語辞書ＤＢ１３）を検索して音声認識可能な単語を出力する第２のステップ（ステップＳＴ２６〜ＳＴ３２）と、を有するものである。 The speech input support method according to the present invention includes a first dictionary (word dictionary DB 13) that stores words that can be recognized and a second dictionary (acoustics) that stores acoustic models corresponding to the words. A speech input support method in a speech recognition system comprising a model DB 14). For example, in the flowchart shown in FIG. 2, speech content is acquired and speech recognition is performed by searching the second dictionary (acoustic model DB 14). The first step (steps ST20 to ST24) for performing the speech recognition, and the first dictionary (word dictionary DB13) can be searched from the character string of the word that is input and acquired when the voice recognition fails, and voice recognition is possible. A second step (steps ST26 to ST32) for outputting a simple word.

この発明に係る音声入力支援方法によれば、音声認識に失敗した場合に入力され取得される単語の文字列から第１の辞書（単語辞書ＤＢ１３）を検索して音声認識可能な単語を出力することにより、ユーザは、音声認識システムが想定している認識可能な単語、または句または文の集合を、音声よりも確実性の高いソフトキーボード等のキー入力による文字列検索で知ることができ、このことにより、使用している音声認識システムにおいて、認識しやすい単語または句または文を発話することができるようになり、結果として音声認識システムを使う上での障害を少なくすることができる。
また、使用している音声認識システムで音声認識が可能な単語、句、文の集合等の入力対象を、ディスプレイ、あるいはスピーカ等の出力装置を利用して再生し、あるいはＴＴＳ（Text To Speech）等により話者であるユーザに提示することにより、ユーザは、音声認識可能な入力対象に関して認識率の高い発声方法を学習でき、次回以降、その発声方法を真似ることで認識率の向上がはかれるものである。 According to the voice input support method according to the present invention, the first dictionary (word dictionary DB 13) is searched from the character string of the words that are input and acquired when the voice recognition fails, and the voice-recognizable words are output. By this, the user can know a recognizable word, a set of phrases or sentences assumed by the voice recognition system by a character string search by key input such as a soft keyboard with higher certainty than voice, This makes it possible to utter words, phrases or sentences that are easy to recognize in the used speech recognition system, and as a result, it is possible to reduce obstacles in using the speech recognition system.
Also, an input object such as a set of words, phrases, and sentences that can be recognized by the speech recognition system being used is reproduced using an output device such as a display or a speaker, or TTS (Text To Speech). By presenting to the user who is a speaker, etc., the user can learn an utterance method with a high recognition rate with respect to an input target that can be recognized, and the recognition rate can be improved by imitating the utterance method from the next time It is.

なお、図１に示す、音声入力取得部１１と、音声認識エンジン部１２と、アプリ実行制御部１６と、キー入力取得部１７と、全文検索処理部１８と、検索結果出力部１９のそれぞれが持つ機能は、具体的には、音声入力支援装置１０を構成するＣＰＵが内蔵のメモリに記録されたプログラムを逐次読み出し実行することにより実現されるものである。なお、このとき、単語辞書ＤＢ１３と、音響モデルＤＢ１４と、文法辞書ＤＢ１５のそれぞれは、内蔵のメモリもしくは外付けの外部メモリに割り付けられ記憶されるものとする。
また、上記した音声入力支援装置１０が持つ各構成ブロックの機能は、上記のように全てをソフトウエアによって実現しても、あるいはその少なくとも一部をハードウエアで実現してもよい。例えば、音声認識エンジン部１２や、全文検索処理部１８における処理は、１または複数のプログラムによりコンピュータ上で実現してもよく、また、その少なくとも一部をハードウエアで実現してもよい。 Each of the voice input acquisition unit 11, the voice recognition engine unit 12, the application execution control unit 16, the key input acquisition unit 17, the full text search processing unit 18, and the search result output unit 19 illustrated in FIG. Specifically, the functions possessed are realized by the CPU constituting the voice input support device 10 sequentially reading and executing a program recorded in a built-in memory. At this time, each of the word dictionary DB 13, the acoustic model DB 14, and the grammar dictionary DB 15 is allocated and stored in a built-in memory or an external external memory.
In addition, the functions of the constituent blocks of the voice input support device 10 described above may be realized entirely by software as described above, or at least a part thereof may be realized by hardware. For example, the processing in the speech recognition engine unit 12 and the full-text search processing unit 18 may be realized on a computer by one or a plurality of programs, or at least a part thereof may be realized by hardware.

この発明の実施の形態１に係る音声入力支援装置の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the audio | voice input assistance apparatus which concerns on Embodiment 1 of this invention. この発明の実施の形態１に係る音声入力支援方法の処理の流れを示す図である。It is a figure which shows the flow of a process of the audio | voice input assistance method which concerns on Embodiment 1 of this invention. この発明の実施の形態２に係る音声入力支援装置を内蔵したナビゲーションシステムの内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the navigation system incorporating the audio | voice input assistance apparatus which concerns on Embodiment 2 of this invention. この発明の実施の形態２に係る音声入力支援装置を含むナビゲーションシステムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the navigation system containing the audio | voice input assistance apparatus which concerns on Embodiment 2 of this invention. この発明の実施の形態２に係る音声入力支援装置を含むナビゲーションシステムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the navigation system containing the audio | voice input assistance apparatus which concerns on Embodiment 2 of this invention. 従来の音声認識装置の音声認識処理の流れを示す図である。It is a figure which shows the flow of the speech recognition process of the conventional speech recognition apparatus.

Explanation of symbols

１０音声入力支援装置、１１音声入力取得部、１２音声認識エンジン部、１３単語辞書ＤＢ、１４音響モデルＤＢ、１５文法辞書ＤＢ、１６アプリ実行制御部、１７キー入力取得部、１８全文検索処理部、１９検索結果出力部、３０ナビゲーションシステム、３１制御部、３２入力部、３３マイク、３４位置情報取得部、３５地図データ格納部、３６音声出力部、３７表示部。 DESCRIPTION OF SYMBOLS 10 Speech input support device, 11 Speech input acquisition part, 12 Speech recognition engine part, 13 Word dictionary DB, 14 Acoustic model DB, 15 Grammar dictionary DB, 16 Application execution control part, 17 Key input acquisition part, 18 Full text search process part , 19 Search result output unit, 30 Navigation system, 31 Control unit, 32 Input unit, 33 Microphone, 34 Location information acquisition unit, 35 Map data storage unit, 36 Audio output unit, 37 Display unit.

Claims

A first dictionary in which words capable of speech recognition are stored, and a second dictionary in which acoustic models corresponding to the words are stored;
A first step of acquiring utterance content, searching the second dictionary and performing speech recognition;
A second step of searching the first dictionary from a character string of the word that is input and acquired when the speech recognition fails, and outputting a speech-recognizable word;
A voice input support method comprising:

2. The speech input support method according to claim 1, wherein the first step further recognizes the word by searching a grammar dictionary to be added.

The second step includes
The voice input support method according to claim 1, wherein a character string of the word input based on a key input operation is acquired.

The second step includes
The voice input support method according to claim 1, wherein a character string of the word input based on a voice input operation is acquired.

A storage unit storing a first dictionary composed of words capable of speech recognition, and a second dictionary composed of acoustic models corresponding to the words;
A speech recognition unit that acquires utterance content, searches the second dictionary stored in the storage unit, and performs speech recognition;
A full-text search processing unit that searches the first dictionary stored in the storage unit from the character string of the word that is input and acquired when the speech recognition fails, and outputs a speech-recognizable word;
A voice input support device comprising:

A storage unit for storing a first dictionary composed of facility names or destination names capable of voice recognition, a second dictionary composed of acoustic models corresponding to the facility names or destination names;
Acquires the utterance content consisting of the facility name or destination name, searches the second dictionary stored in the storage unit to perform speech recognition, and the facility name input and acquired when the speech recognition fails or Search the first dictionary stored in the storage unit from the character string of the destination name, output the facility name or destination name that can be recognized by voice, and search the facility name based on the output facility name or destination name or A control unit for performing navigation by setting a destination;
A navigation system characterized by comprising: