JP6820086B2

JP6820086B2 - Voice recognition device and voice recognition method

Info

Publication number: JP6820086B2
Application number: JP2017113062A
Authority: JP
Inventors: 信範工藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2017-06-08
Filing date: 2017-06-08
Publication date: 2021-01-27
Anticipated expiration: 2037-06-08
Also published as: JP2018205612A

Description

本発明は、音声認識装置および音声認識方法に関し、特に、音声認識時に発話ボタンの操作や特定動作などのトリガを不要にした音声認識装置に用いて好適なものである。 The present invention relates to a voice recognition device and a voice recognition method, and is particularly suitable for use in a voice recognition device that does not require a trigger such as an operation of a utterance button or a specific operation during voice recognition.

従来、車両に搭載されているオーディオ装置、エアコンディショナ、ナビゲーション装置など各種の電子機器を操作する際の片手運転等を回避するために、電子機器の操作を音声認識により行えるようにしたシステムが提供されている。この音声認識技術を用いれば、運転者は、ハンドルから手を離すことなく（リモートコントローラや操作パネル等の操作部を手動で操作せずに）各種電子機器の操作を行うことができる。 Conventionally, in order to avoid one-handed driving when operating various electronic devices such as audio devices, air conditioners, and navigation devices mounted on vehicles, there is a system that enables the operation of electronic devices by voice recognition. It is provided. By using this voice recognition technology, the driver can operate various electronic devices without taking his / her hand off the steering wheel (without manually operating the operation unit such as the remote controller or the operation panel).

通常、音声認識装置では、マイクより入力されたユーザの発話音声と、音声認識辞書に登録されている特定の単語や熟語、簡単な命令文など（以下、これらをまとめて「認識対象ワード」という）の音声パターンとの類似度に基づいて認識が行われる。例えば、ユーザの発話音声と、音声認識辞書に登録されている音声パターンとの類似度を表す指標として距離値を算出し、距離値が閾値を下回った場合に、その音声パターンに対応する認識対象ワードが発話されたと認識する。 Normally, in a voice recognition device, a user's utterance voice input from a microphone, a specific word or compound word registered in a voice recognition dictionary, a simple command sentence, etc. (hereinafter, these are collectively referred to as a "recognition target word"). ) Is recognized based on the similarity with the voice pattern. For example, a distance value is calculated as an index indicating the degree of similarity between the user's spoken voice and the voice pattern registered in the voice recognition dictionary, and when the distance value falls below the threshold value, the recognition target corresponding to the voice pattern is calculated. Recognize that the ward has been spoken.

従来の音声認識装置は、ユーザが備え付けの発話ボタンを押すことで音声認識モードとなり、マイクから入力されたユーザの発話音声を認識するようになされている。発話ボタンの操作に代えて、手を叩く等の特定動作をトリガとして音声認識モードとなるようになされたものも知られている。これに対し、最近では、音声認識時に発話ボタンの操作や特定動作などのトリガを不要にした音声認識（以下、トリガレス音声認識という）も提供されている。 In the conventional voice recognition device, when the user presses the built-in utterance button, the voice recognition mode is set and the user's utterance voice input from the microphone is recognized. It is also known that instead of operating the utterance button, a specific action such as clapping a hand triggers the voice recognition mode. On the other hand, recently, voice recognition (hereinafter referred to as triggerless voice recognition) that does not require a trigger such as an operation of a utterance button or a specific action during voice recognition has been provided.

トリガレス音声認識装置では、マイクを常時オン状態にしておき、入力される音声が認識対象ワードに該当するかどうかを常に判定する。なお、車室内では、マイクより入力される音声には、音声認識のための発話音声の他に、エンジンの動作音や走行音、オーディオ音声、搭乗者どうしの会話音声などの各種ノイズが含まれている。そのため、トリガレス音声認識装置では、このようにノイズの多い環境下でも音声認識を正しく行えるようにするための工夫が必要となる。 In the triggerless voice recognition device, the microphone is always on, and it is always determined whether or not the input voice corresponds to the recognition target word. In the passenger compartment, the voice input from the microphone includes various noises such as engine operation sound, running sound, audio voice, and conversation voice between passengers, in addition to the spoken voice for voice recognition. ing. Therefore, in the triggerless voice recognition device, it is necessary to devise a device for correctly performing voice recognition even in such a noisy environment.

従来、オーディオ音声による誤認識を抑制するための技術が提供されている（例えば、特許文献１参照）。特許文献１に記載の技術では、音声認識エンジンを２つ搭載し、第１の音声認識エンジンにおいてマイクから入力された音声を認識すると同時に、本来はスピーカから出力するオーディオ音声を内部で分岐して第２の音声認識エンジンに入力し、第２の音声認識エンジンでオーディオ音声の認識を行う。そして、それぞれの音声認識エンジンにおいて同じワードを認識した場合は、オーディオ音声による誤認識として棄却する。 Conventionally, a technique for suppressing erroneous recognition by audio-voice has been provided (see, for example, Patent Document 1). In the technique described in Patent Document 1, two voice recognition engines are installed, and the first voice recognition engine recognizes the voice input from the microphone and at the same time internally branches the audio voice originally output from the speaker. It is input to the second voice recognition engine, and the audio voice is recognized by the second voice recognition engine. Then, when the same word is recognized by each voice recognition engine, it is rejected as an erroneous recognition by audio voice.

実開平７−２３４００号公報Jikkenhei 7-23400

トリガレス音声認識の場合、マイクより入力される音声が認識対象ワードに該当するかどうかを常時判定するため、車載機のＣＰＵに常に処理負荷がかかっている状態となる。１つの音声認識エンジンで１２個の認識対象ワードに対応する場合、５％程度のＣＰＵ負荷が増加する。特許文献１のように、２つの音声認識エンジンを同時に動作させた場合、１２個の認識対象ワードに対応する場合は合計２４ワードとなり、１０％程度のＣＰＵ負荷を増加させてしまう。 In the case of triggerless voice recognition, it is always determined whether or not the voice input from the microphone corresponds to the recognition target word, so that the processing load is always applied to the CPU of the in-vehicle device. When one voice recognition engine supports 12 words to be recognized, the CPU load increases by about 5%. When two voice recognition engines are operated at the same time as in Patent Document 1, the total number of words corresponding to 12 recognition target words is 24 words, which increases the CPU load by about 10%.

こういった状況では、音声認識処理以外の他処理のレスポンス性能が低下するという問題が生じる。例えば、ナビゲーション装置の地図描画性能（１秒当たりの描画可能フレーム数であるfps＝frame per second）の低下を招くといった問題が生じてしまう。他処理のレスポンス性能の低下を抑えるためには、定常的に待ち受ける認識対象ワードの数を少なく制限する必要がある。しかしながら、このようにすると、ごく限られた数の認識対象ワードしか音声認識できないことになり、トリガレス音声認識の利便性が損なわれてしまう。 In such a situation, there arises a problem that the response performance of other processes other than the voice recognition process is deteriorated. For example, there arises a problem that the map drawing performance of the navigation device (fps = frame per second, which is the number of frames that can be drawn per second) is deteriorated. In order to suppress the deterioration of the response performance of other processes, it is necessary to limit the number of words to be recognized that are constantly waiting. However, in this way, only a very limited number of words to be recognized can be recognized by voice, and the convenience of triggerless voice recognition is impaired.

本発明は、このような問題を解決するために成されたものであり、トリガレス音声認識において定常的に待ち受ける認識対象ワードの数を少なく制限することなく、また音声認識処理以外の他処理のレスポンス性能の低下を極力抑えつつ、車載機で発生された音声による誤認識を抑制することができるようにすることを目的とする。 The present invention has been made to solve such a problem, without limiting the number of words to be recognized that are constantly waiting in triggerless speech recognition, and the response of processing other than speech recognition processing. The purpose is to make it possible to suppress erroneous recognition by voice generated by an in-vehicle device while suppressing deterioration of performance as much as possible.

上記した課題を解決するために、本発明の音声認識装置は、認識対象ワードの全体が登録された第１音声認識辞書と、認識対象ワードの後半部分のみが登録された第２音声認識辞書とを有する。また、本発明の音声認識装置は、第１音声認識辞書を用いて、マイクより入力された外部入力音声の音声認識を行う第１音声認識部と、第２音声認識辞書を用いて、車載機で発生されスピーカから出力される前の内部発生音声の音声認識を行う第２音声認識部とを備える。第１音声認識部は、外部入力音声の順次入力と並行して類似度の算出を逐次行い、算出した類似度が第１のレベルより大きくなった時点で、外部入力音声が認識対象ワードの前半部分に相当すると認識し、引き続き算出した類似度が第２のレベルより大きくなった時点で、外部入力音声が認識対象ワードの全体に相当すると認識する。第２音声認識部は、第１音声認識部により算出された類似度が第１のレベルよりも大きくなった時点で認識処理を開始し、算出した類似度が所定レベルより大きい場合に、内部発生音声が認識対象ワードの後半部分に相当すると認識する。そして、第１音声認識部において外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部により内部発生音声が認識対象ワードの後半部分に相当すると認識された場合、第１音声認識部による認識結果を破棄するようにしている。 In order to solve the above-mentioned problems, the speech recognition device of the present invention includes a first speech recognition dictionary in which the entire recognition target word is registered, and a second speech recognition dictionary in which only the latter half of the recognition target word is registered. Has. Further, the voice recognition device of the present invention is an in-vehicle device using a first voice recognition dictionary, a first voice recognition unit that performs voice recognition of externally input voice input from a microphone, and a second voice recognition dictionary. It is provided with a second voice recognition unit that performs voice recognition of the internally generated voice before being output from the speaker. The first voice recognition unit sequentially calculates the similarity in parallel with the sequential input of the external input voice, and when the calculated similarity becomes larger than the first level, the external input voice is the first half of the recognition target word. It recognizes that it corresponds to a part, and when the calculated similarity becomes larger than the second level, it recognizes that the external input voice corresponds to the whole recognition target word. The second voice recognition unit starts the recognition process when the similarity calculated by the first voice recognition unit becomes larger than the first level, and internally occurs when the calculated similarity is larger than the predetermined level. Recognize that the voice corresponds to the latter half of the recognition target word. Then, when the first voice recognition unit recognizes that the externally input voice corresponds to the entire recognition target word and the second voice recognition unit recognizes that the internally generated voice corresponds to the latter half of the recognition target word, the first 1 The recognition result by the voice recognition unit is discarded.

上記のように構成した本発明によれば、第１音声認識部において外部入力音声が認識対象ワードの前半部分に相当すると認識された場合にのみ第２音声認識部が起動されるので、第２音声認識部が常時動作している場合に比べて処理負荷を小さくすることができる。処理負荷が小さいので、トリガレス音声認識において定常的に待ち受ける認識対象ワードの数を少なく制限する必要がない。そして、第１音声認識部において外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部により内部発生音声が認識対象ワードの後半部分に相当すると認識された場合には、第１音声認識部による認識結果が、スピーカから出力された内部発生音声がマイクから入力されたために生じた誤認識であるものとして破棄される。これにより、本発明によれば、トリガレス音声認識において定常的に待ち受ける認識対象ワードの数を少なく制限することなく、また音声認識処理以外の他処理のレスポンス性能の低下を極力抑えつつ、車載機で発生された音声による誤認識を抑制することができる。 According to the present invention configured as described above, the second voice recognition unit is activated only when the first voice recognition unit recognizes that the external input voice corresponds to the first half of the recognition target word. The processing load can be reduced as compared with the case where the voice recognition unit is always in operation. Since the processing load is small, it is not necessary to limit the number of words to be recognized that are constantly waiting in triggerless speech recognition. When the first voice recognition unit recognizes that the externally input voice corresponds to the entire recognition target word and the second voice recognition unit recognizes that the internally generated voice corresponds to the latter half of the recognition target word. , The recognition result by the first voice recognition unit is discarded as erroneous recognition caused by the internally generated voice output from the speaker being input from the microphone. As a result, according to the present invention, the number of words to be recognized that are constantly waiting in triggerless speech recognition is not limited to a small number, and the deterioration of the response performance of other processes other than the speech recognition process is suppressed as much as possible. It is possible to suppress erroneous recognition due to the generated voice.

本実施形態に係る音声認識装置の機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the voice recognition apparatus which concerns on this embodiment. 認識対象ワードの一例を示す図である。It is a figure which shows an example of the recognition target word. 本実施形態に係る第１音声認識辞書および第２音声認識辞書が記憶する情報の一例を示す図である。It is a figure which shows an example of the information which the 1st voice recognition dictionary and the 2nd voice recognition dictionary which concern on this Embodiment store. 本実施形態に係る第１音声認識部および第２音声認識部が算出する距離値の推移の一例を示すグラフである。It is a graph which shows an example of the transition of the distance value calculated by the 1st voice recognition unit and the 2nd voice recognition unit which concerns on this embodiment. 第１音声認識部の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the 1st voice recognition part. 第２音声認識部の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the 2nd voice recognition part. 認識結果破棄部の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the recognition result discarding part.

以下、本発明の一実施形態を図面に基づいて説明する。図１は、本実施形態に係る音声認識装置１００の機能構成例を示すブロック図を、車両に設けられたマイク２００、ナビゲーション装置３００、オーディオ装置４００、カメラシステム５００、表示装置６００、および音声出力装置７００と共に示す図である。本実施形態に係る音声認識装置１００は、車両の車内に設けられたマイク２００より入力される乗員の発話音声（特定の単語や熟語、簡単な命令文などのワード）を発話コマンドとして認識し、認識結果に基づいてナビゲーション装置３００を制御するものである。なお、ここでは音声認識装置１００の制御対象の電子機器をナビゲーション装置３００としているが、オーディオ装置４００、エアコンディショナ、その他の電子機器であってもよい。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows a block diagram showing a functional configuration example of the voice recognition device 100 according to the present embodiment, showing a microphone 200, a navigation device 300, an audio device 400, a camera system 500, a display device 600, and a voice output provided in the vehicle. It is a figure which shows together with the apparatus 700. The voice recognition device 100 according to the present embodiment recognizes the utterance voice (words such as a specific word, a compound word, a simple command sentence) of the occupant input from the microphone 200 provided in the vehicle, and recognizes it as a utterance command. The navigation device 300 is controlled based on the recognition result. Although the electronic device to be controlled by the voice recognition device 100 is the navigation device 300 here, it may be an audio device 400, an air conditioner, or other electronic device.

マイク２００は、収音装置であり、車両に搭乗する搭乗者の発話音声を収音可能な位置に設けられる。マイク２００は、収音した音声に基づく音声信号を、後述する第１音声認識部１０に出力する。以下、マイク２００が収音する音声を「外部入力音声」といい、マイク２００が第１音声認識部１０に出力する音声信号を「外部入力音声信号」という。 The microphone 200 is a sound collecting device, and is provided at a position where the voice of a passenger boarding the vehicle can be picked up. The microphone 200 outputs a voice signal based on the picked-up voice to the first voice recognition unit 10 described later. Hereinafter, the voice picked up by the microphone 200 is referred to as an "external input voice", and the voice signal output by the microphone 200 to the first voice recognition unit 10 is referred to as an "external input voice signal".

ナビゲーション装置３００には、表示装置６００と、カメラシステム５００とが接続される。表示装置６００は、液晶表示パネル等の画像が表示可能な装置であり、例えば、ダッシュボードの中央部に設けられる。カメラシステム５００は、車両の前方を撮影するフロントカメラ、および、車両の後方を撮影するリアカメラを備え、ナビゲーション装置３００からの要求に応じて、いずれか一方のカメラの撮影結果をナビゲーション装置３００に出力する。 A display device 600 and a camera system 500 are connected to the navigation device 300. The display device 600 is a device capable of displaying an image such as a liquid crystal display panel, and is provided in the center of a dashboard, for example. The camera system 500 includes a front camera that shoots the front of the vehicle and a rear camera that shoots the rear of the vehicle, and in response to a request from the navigation device 300, the shooting result of either camera is sent to the navigation device 300. Output.

ナビゲーション装置３００は、車両の位置を検出する機能や、表示装置６００に地図を表示して地図上に車両の位置を表示する機能、目的地までの経路を探索する機能、表示装置６００に地図を表示して地図上に目的地までの経路を描画して当該経路を案内する機能を備える。また、ナビゲーション装置３００は、ユーザの指示に応じて、フロントカメラの撮影結果を示す画像、または、リアカメラの撮影結果を示す画像を、表示装置６００に表示する機能を備える。 The navigation device 300 has a function of detecting the position of the vehicle, a function of displaying a map on the display device 600 and displaying the position of the vehicle on the map, a function of searching for a route to the destination, and a map on the display device 600. It has a function to display and draw a route to the destination on a map to guide the route. Further, the navigation device 300 has a function of displaying an image showing the shooting result of the front camera or an image showing the shooting result of the rear camera on the display device 600 according to the instruction of the user.

音声出力装置７００は、Ｄ／Ａコンバータや、ボリューム、アンプ、スピーカ等を備え、入力された音声信号をＤ／Ａ変換し、増幅した後、スピーカにより音声出力する。オーディオ装置４００は、メディア（ＣＤ、ＤＶＤ、ＭＤ等）に記録された音声データや、メモリー（オーディオ装置４００に搭載された内部メモリーであってもよく、当該装置に接続された外部メモリーであってもよい）に記憶された音声データに基づいて、音声信号を生成し、音声出力装置７００に出力する。なお、オーディオ装置４００は、音声出力装置７００に音声信号を出力し、車内に音声を放音する装置であればよく、例えば、ラジオ受信装置であってもよい。図１に示すように、オーディオ装置４００が出力する音声信号は、分岐されて、後述する第２音声認識部１２に出力される。以下、第２音声認識部１２が入力する音声信号を「内部発生音声信号」といい、この内部発生音声信号に基づく音声を内部発生音声という。内部発生音声は、車載機の１つであるオーディオ装置４００で発生されスピーカから出力される前の音声である。 The audio output device 700 includes a D / A converter, a volume, an amplifier, a speaker, and the like, D / A converts the input audio signal, amplifies it, and then outputs the audio through the speaker. The audio device 400 may be audio data recorded on media (CD, DVD, MD, etc.) or a memory (internal memory mounted on the audio device 400, or an external memory connected to the device. A voice signal is generated based on the voice data stored in the voice output device 700 and output to the voice output device 700. The audio device 400 may be any device that outputs a voice signal to the voice output device 700 and emits sound into the vehicle, and may be, for example, a radio receiving device. As shown in FIG. 1, the voice signal output by the audio device 400 is branched and output to the second voice recognition unit 12 described later. Hereinafter, the voice signal input by the second voice recognition unit 12 is referred to as an "internally generated voice signal", and the voice based on this internally generated voice signal is referred to as an internally generated voice. The internally generated voice is the voice generated by the audio device 400, which is one of the on-board units, and before being output from the speaker.

図１に示すように、本実施形態に係る音声認識装置１００は、その機能構成として、第１音声認識部１０、電子機器制御部１１、第２音声認識部１２および認識結果破棄部１３を備えている。また、音声認識装置１００は、記憶媒体として、辞書記憶部２０を備えている。 As shown in FIG. 1, the voice recognition device 100 according to the present embodiment includes a first voice recognition unit 10, an electronic device control unit 11, a second voice recognition unit 12, and a recognition result discard unit 13 as its functional configuration. ing. Further, the voice recognition device 100 includes a dictionary storage unit 20 as a storage medium.

なお、上記各機能ブロック１０〜１３は、ハードウェア、ＤＳＰ（Digital Signal Processor）、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、上記各機能ブロック１０〜１３は、実際にはコンピュータのＣＰＵ、ＲＡＭ、ＲＯＭなどを備えて構成され、ＲＡＭやＲＯＭ、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。 The functional blocks 10 to 13 can be configured by any of hardware, DSP (Digital Signal Processor), and software. For example, when configured by software, each of the functional blocks 10 to 13 is actually configured to include a computer CPU, RAM, ROM, etc., and is a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. Is realized by the operation of.

辞書記憶部２０は、第１音声認識辞書２０Ａおよび第２音声認識辞書２０Ｂを記憶する。以下、認識対象ワードについて説明した後、第１音声認識辞書２０Ａおよび第２音声認識辞書２０Ｂについて説明する。 The dictionary storage unit 20 stores the first voice recognition dictionary 20A and the second voice recognition dictionary 20B. Hereinafter, the recognition target word will be described, and then the first voice recognition dictionary 20A and the second voice recognition dictionary 20B will be described.

図２は、本実施形態における認識対象ワードの一覧例である。なお、図２で例示した認識対象ワードはあくまで一例であり、他の認識対象ワードが存在してもよいことは勿論である。図２に示すように、本実施形態では、認識対象ワードとして、９個の認識対象ワードが用意されている。従って、搭乗者が、９個の認識対象ワードのうち、いずれかの認識対象ワードに対応する文言を発話した場合、音声認識装置１００は、発話音声を認識対象ワードと認識し、認識対象ワードに対応する処理をナビゲーション装置３００に実行させる。また、９個の認識対象ワードのそれぞれには、番号Ａ１〜番号Ａ９の識別情報が割り振られている。 FIG. 2 is an example of a list of recognition target words in this embodiment. The recognition target word illustrated in FIG. 2 is just an example, and it goes without saying that other recognition target words may exist. As shown in FIG. 2, in the present embodiment, nine recognition target words are prepared as recognition target words. Therefore, when the passenger utters a word corresponding to any of the nine recognition target words among the nine recognition target words, the voice recognition device 100 recognizes the spoken voice as the recognition target word and sets it as the recognition target word. The navigation device 300 is made to execute the corresponding process. Further, identification information of numbers A1 to A9 is assigned to each of the nine recognition target words.

認識対象ワードのそれぞれについて、各ワードに相当する文言が発話された場合に、音声認識装置１００がナビゲーション装置３００に実行させる処理について簡単に説明すると、番号Ａ１の認識対象ワードは、車両の現在位置の近くにあるコンビニエンスストアを探索する処理を実行するためのワードである。また、番号Ａ２の認識対象ワードは、車両の現在位置の近くにあるガソリンスタンドを探索する処理を実行するためのワードである。番号Ａ３の認識対象ワードは、目的地までの経路を案内する処理を実行するためのワードである。番号Ａ４の認識対象ワードは、事前に登録された自宅までの経路を案内する処理を実行するためのワードである。番号Ａ５の認識対象ワードは、表示装置６００に車両の現在位置を中心とした所定スケールの地図を表示する処理を実行するためのワードである。番号Ａ６の認識対象ワードは、表示された地図の縮尺を拡大する処理を実行するためのワードである。番号Ａ７の認識対象ワードは、表示された地図の縮尺を縮小する処理を実行するためのワードである。番号Ａ８の認識対象ワードは、フロントカメラの撮影結果を示す画像を表示装置６００に表示する処理を実行するためのワードである。番号Ａ９の認識対象ワードは、リアカメラの撮影結果を示す画像を表示装置６００に表示する処理を実行するためのワードである。 To briefly explain the process that the voice recognition device 100 causes the navigation device 300 to execute when a word corresponding to each word is spoken for each of the recognition target words, the recognition target word of the number A1 is the current position of the vehicle. It is a word for executing the process of searching for a convenience store near. Further, the recognition target word of the number A2 is a word for executing a process of searching for a gas station near the current position of the vehicle. The recognition target word of the number A3 is a word for executing a process of guiding the route to the destination. The recognition target word of the number A4 is a word for executing a process of guiding the route to the home registered in advance. The recognition target word of the number A5 is a word for executing a process of displaying a map of a predetermined scale centered on the current position of the vehicle on the display device 600. The recognition target word of the number A6 is a word for executing a process of enlarging the scale of the displayed map. The recognition target word of the number A7 is a word for executing a process of reducing the scale of the displayed map. The recognition target word of the number A8 is a word for executing a process of displaying an image indicating the shooting result of the front camera on the display device 600. The recognition target word of the number A9 is a word for executing a process of displaying an image indicating the shooting result of the rear camera on the display device 600.

本実施形態では、認識対象ワードのそれぞれは、事前に、前半部分と、後半部分とに分けられている。図２では、記号「／」で、各認識対象ワードの前半部分と、後半部分との区切りを示している。例えば、番号Ａ１の認識対象ワードは、文言「近くのコンビニ」からなる認識対象ワードである。そして、番号Ａ１の認識対象ワードは、文言「近くの」と、文言「コンビニ」とに分けられている。また例えば、番号Ａ８の認識対象ワードは、文言「フロントカメラ」からなる認識対象ワードである。そして、番号Ａ８の認識対象ワードは、文言「フロント」と、文言「カメラ」とに分けられている。 In the present embodiment, each of the recognition target words is divided into a first half portion and a second half portion in advance. In FIG. 2, the symbol “/” indicates a delimiter between the first half portion and the second half portion of each recognition target word. For example, the recognition target word of the number A1 is a recognition target word composed of the wording "near convenience store". The recognition target word of the number A1 is divided into the wording "near" and the wording "convenience store". Further, for example, the recognition target word of the number A8 is a recognition target word composed of the word "front camera". The recognition target word of the number A8 is divided into the wording "front" and the wording "camera".

図３（Ａ）は、第１音声認識辞書２０Ａが記憶する情報を説明に適した内容で模式的に示す図である。図３（Ｂ）は、第２音声認識辞書２０Ｂが記憶する情報を説明に適した内容で模式的に示す図である。図３（Ａ）に示すように、第１音声認識辞書２０Ａには、全ての認識対象ワードのそれぞれ（本実施形態では、上述した９個の認識対象ワードのそれぞれ）について、認識対象ワードの「全体」の音声パターンが登録されている。 FIG. 3A is a diagram schematically showing information stored in the first speech recognition dictionary 20A with contents suitable for explanation. FIG. 3B is a diagram schematically showing information stored in the second speech recognition dictionary 20B with contents suitable for explanation. As shown in FIG. 3A, in the first speech recognition dictionary 20A, for each of all the recognition target words (in this embodiment, each of the above-mentioned nine recognition target words), the recognition target word " The "whole" voice pattern is registered.

図３（Ｂ）に示すように、第２音声認識辞書２０Ｂには、全ての認識対象ワードのそれぞれについて、認識対象ワードの「後半部分」の音声パターンが登録されている。例えば、番号Ａ１の認識対象ワードについて、第２音声認識辞書２０Ｂには、認識対象ワードの後半部分の文言「コンビニ」の音声パターンが登録されている。図２に示すように、番号Ａ５、Ａ６、Ａ７の認識対象ワードの後半部分は、「表示」で共通する。これを踏まえ、図３（Ｂ）に示すように、番号Ａ５、Ａ６、Ａ７の認識対象ワードについては、認識対象ワードの「後半部分」の音声パターンとして、文言「表示」の音声パターンが、１つだけ、第２音声認識辞書２０Ｂに登録されている。このことは、後半部分が文言「カメラ」で共通している番号Ａ８、Ａ９の認識対象ワードについても同様である。この結果、本実施形態に係る第２音声認識辞書２０Ｂには、文言「コンビニ」、文言「ガソリンスタンド」、文言「案内」、文言「帰る」、文言「表示」、文言「カメラ」の６個の認識対象ワードの後半部分の音声パターンが登録される。 As shown in FIG. 3B, in the second voice recognition dictionary 20B, the voice pattern of the "second half" of the recognition target word is registered for each of the recognition target words. For example, for the recognition target word of the number A1, the voice pattern of the word "convenience store" in the latter half of the recognition target word is registered in the second voice recognition dictionary 20B. As shown in FIG. 2, the latter half of the recognition target words of the numbers A5, A6, and A7 are common to "display". Based on this, as shown in FIG. 3B, with respect to the recognition target words of numbers A5, A6, and A7, the voice pattern of the wording "display" is 1 as the voice pattern of the "second half" of the recognition target word. Only one is registered in the second speech recognition dictionary 20B. This also applies to the recognition target words of numbers A8 and A9 whose latter half is common to the word "camera". As a result, in the second speech recognition dictionary 20B according to the present embodiment, there are six words "convenience store", word "gas station", word "guidance", word "return", word "display", and word "camera". The voice pattern of the latter half of the recognition target word of is registered.

第１音声認識部１０は、第１音声認識辞書２０Ａに登録されている認識対象ワードの全体の音声パターンと、マイク２００より入力された外部入力音声との類似度を算出し、当該算出した類似度が所定レベルより大きい場合に、外部入力音声が認識対象ワードの全体に相当すると認識する。より詳細には、第１音声認識部１０は、外部入力音声の順次入力と並行して類似度の算出を逐次行い、当該算出した類似度が第１のレベルより大きくなった時点で、外部入力音声が認識対象ワードの前半部分に相当すると認識し、引き続き算出した類似度が第２のレベルより大きくなった時点で、外部入力音声が認識対象ワードの全体に相当すると認識する。 The first voice recognition unit 10 calculates the degree of similarity between the entire voice pattern of the recognition target word registered in the first voice recognition dictionary 20A and the externally input voice input from the microphone 200, and the calculated similarity. When the degree is greater than a predetermined level, it is recognized that the externally input voice corresponds to the entire recognition target word. More specifically, the first voice recognition unit 10 sequentially calculates the similarity in parallel with the sequential input of the external input voice, and when the calculated similarity becomes larger than the first level, the external input It recognizes that the voice corresponds to the first half of the recognition target word, and when the calculated similarity becomes larger than the second level, it recognizes that the external input voice corresponds to the entire recognition target word.

本実施形態では、第１音声認識部１０は、類似度を表す指標として距離値を算出する。距離値は、値「０」〜値「１０００」の範囲で値をとる。第１音声認識部１０が算出する１の認識対象ワードに係る距離値は、外部入力音声と、当該１の認識対象ワードの「全体」の音声パターンとが類似するほど、その距離値の値は小さくなる。そして、第１音声認識部１０は、算出した距離値が第１閾値Ｔ１（＞第２閾値Ｔ２（後述））より小さくなったことを検出することによって、類似度が第１のレベルより大きくなったことを検出し、その時点で外部入力音声が認識対象ワードの前半部分に相当すると認識する。第１音声認識部１０は、算出した距離値が第１閾値Ｔ１より小さくなった後、引き続き算出した距離値が第２閾値Ｔ２より小さくなったことを検出することによって、類似度が第２のレベルより大きくなったことを検出し、その時点で、外部入力音声が認識対象ワードの全体に相当すると認識する。 In the present embodiment, the first voice recognition unit 10 calculates a distance value as an index indicating the degree of similarity. The distance value takes a value in the range of the value "0" to the value "1000". The distance value related to the recognition target word 1 calculated by the first voice recognition unit 10 is such that the external input voice and the “whole” voice pattern of the recognition target word are similar to each other. It becomes smaller. Then, the first voice recognition unit 10 detects that the calculated distance value is smaller than the first threshold value T1 (> second threshold value T2 (described later)), so that the similarity becomes larger than the first level. It is recognized that the external input voice corresponds to the first half of the recognition target word at that time. The first voice recognition unit 10 detects that the calculated distance value is smaller than the first threshold value T1 and then the calculated distance value is smaller than the second threshold value T2, so that the similarity is second. It detects that the level is higher than the level, and at that point, recognizes that the externally input voice corresponds to the entire recognition target word.

以下、第１音声認識部１０の処理について詳述する。図４（Ａ）は、ある１の認識対象ワードについて、第１音声認識部１０により算出された距離値の推移の一例を示すグラフである。特に、図４（Ａ）は、距離値が第１閾値Ｔ１を下回った後、さらに、第２閾値Ｔ２を下回る場合の距離値の推移の一例を示す。図４（Ａ）のグラフの縦軸は距離値を示し、横軸は時間の経過を示す。第１音声認識部１０は、いわゆるトリガレス音声認識を実行し、常時、距離値の算出を実行する。 Hereinafter, the processing of the first voice recognition unit 10 will be described in detail. FIG. 4A is a graph showing an example of the transition of the distance value calculated by the first voice recognition unit 10 for a certain recognition target word. In particular, FIG. 4A shows an example of the transition of the distance value when the distance value falls below the first threshold value T1 and then further falls below the second threshold value T2. The vertical axis of the graph of FIG. 4A shows the distance value, and the horizontal axis shows the passage of time. The first voice recognition unit 10 executes so-called triggerless voice recognition and constantly calculates the distance value.

上述したように、マイク２００は、収音した音声に基づく外部入力音声信号を第１音声認識部１０に出力する。第１音声認識部１０は、９個の認識対象ワードのそれぞれについて、外部入力音声の順次入力と並行して、外部入力音声信号の音声波形と、第１音声認識辞書２０Ａに登録された音声パターン（認識対象ワードの「全体」の音声パターン）との比較に基づく距離値の算出を逐次行う。この結果、図４（Ａ）に示すように、時間の経過と共に距離値の値が逐次変化する。 As described above, the microphone 200 outputs an external input voice signal based on the collected voice to the first voice recognition unit 10. The first voice recognition unit 10 has, for each of the nine recognition target words, the voice waveform of the external input voice signal and the voice pattern registered in the first voice recognition dictionary 20A in parallel with the sequential input of the external input voice. The distance value is calculated sequentially based on the comparison with (the "whole" voice pattern of the recognition target word). As a result, as shown in FIG. 4A, the value of the distance value changes sequentially with the passage of time.

第１音声認識部１０は、距離値が第１閾値Ｔ１を上回っている状態の場合、距離値が第１閾値Ｔ１を下回った状態へ移行したか否かを監視する。図４（Ａ）では、タイミングＴＭ１で、距離値が第１閾値Ｔ１を上回った状態から、下回った状態へ移行している。第１音声認識部１０は、距離値が第１閾値Ｔ１を下回った状態へ移行したことを検出した場合、その時点（図４（Ａ）ではタイミングＴＭ１）で、外部入力音声が認識対象ワードの前半部分に相当すると認識する。 When the distance value is above the first threshold value T1, the first voice recognition unit 10 monitors whether or not the distance value has shifted to a state below the first threshold value T1. In FIG. 4A, at the timing TM1, the distance value shifts from the state where the distance value exceeds the first threshold value T1 to the state where the distance value falls below the first threshold value T1. When the first voice recognition unit 10 detects that the distance value has shifted to a state below the first threshold value T1, at that time (timing TM1 in FIG. 4A), the external input voice is the recognition target word. Recognize that it corresponds to the first half.

ここで、マイク２００が、１の認識対象ワードに対応する文言の音声を収音した場合、収音した音声に基づく外部入力音声信号と、当該１の認識対象ワードの音声パターンとの距離値は、徐々に小さくなっていく。より詳細には、収音した音声に基づく外部入力音声信号と、当該１の認識対象ワードの音声パターンとの比較量が多くなるにつれて、外部入力音声信号の音声波形と、当該１の認識対象ワードの音声パターンとの一致率が徐々に高くなり、これに伴って距離値が徐々に小さくなっていき、ある時点で第１閾値Ｔ１を下回り、さらにその後の時点で第２閾値Ｔ２を下回る。 Here, when the microphone 200 picks up the voice of the wording corresponding to the recognition target word of 1, the distance value between the external input voice signal based on the picked up voice and the voice pattern of the recognition target word of 1 is , Gradually getting smaller. More specifically, as the amount of comparison between the external input voice signal based on the collected voice and the voice pattern of the recognition target word of 1 increases, the voice waveform of the external input voice signal and the recognition target word of 1 concerned. The matching rate with the voice pattern of is gradually increased, and the distance value is gradually decreased accordingly, and the value falls below the first threshold value T1 at a certain point in time and further falls below the second threshold value T2 at a subsequent point in time.

そして、１の認識対象ワードに係る第１閾値Ｔ１の値は、マイク２００が当該１の認識対象ワードに対応する文言の音声を収音する場合において、当該１の認識対象ワードの「前半部分」に対応する文言の音声が収音され、当該１の認識対象ワードの「前半部分」に対応する外部入力音声に基づく距離値の算出が完了した時点で、その距離値が第１閾値Ｔ１に至るような値に設定される。第１閾値Ｔ１の値は、認識対象ワードごとに、事前のテストの結果等を踏まえ、適切に設定される。以上のことを踏まえ、第１音声認識部１０は、距離値が第１閾値Ｔ１を上回っている状態から下回った状態へ移行したことを検出した場合、その時点で、外部入力音声が認識対象ワードの前半部分に相当すると認識する。 The value of the first threshold value T1 related to the recognition target word of 1 is the "first half portion" of the recognition target word of 1 when the microphone 200 picks up the voice of the wording corresponding to the recognition target word of 1. When the voice of the word corresponding to is picked up and the calculation of the distance value based on the external input voice corresponding to the "first half part" of the recognition target word of 1 is completed, the distance value reaches the first threshold value T1. Is set to a value like. The value of the first threshold value T1 is appropriately set for each recognition target word based on the results of prior tests and the like. Based on the above, when the first voice recognition unit 10 detects that the distance value has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value has fallen below the first threshold value T1, the external input voice is the recognition target word at that time. Recognize that it corresponds to the first half of.

第１音声認識部１０は、距離値が第１閾値Ｔ１を上回っている状態から下回った状態へ移行したことを検出した場合、処理開始通知を第２音声認識部１２に出力する。図４（Ａ）で例示するグラフでは、第１音声認識部１０は、タイミングＴＭ１で処理開始通知を出力する。処理開始通知については、第２音声認識部１２の処理と共に後に説明する。 When the first voice recognition unit 10 detects that the distance value has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value has fallen below the first threshold value T1, the first voice recognition unit 10 outputs a processing start notification to the second voice recognition unit 12. In the graph illustrated in FIG. 4A, the first voice recognition unit 10 outputs a processing start notification at the timing TM1. The processing start notification will be described later together with the processing of the second voice recognition unit 12.

第１音声認識部１０は、距離値が第１閾値Ｔ１を下回った場合、その時点からの経過時間の計測を開始する。そして、第１音声認識部１０は、距離値が第１閾値Ｔ１を下回ってからの経過時間として時間Ｊ１が経過したか否かを監視しつつ、距離値が第２閾値Ｔ２を下回ったか否かを監視する。 When the distance value falls below the first threshold value T1, the first voice recognition unit 10 starts measuring the elapsed time from that time point. Then, the first voice recognition unit 10 monitors whether or not the time J1 has elapsed as the elapsed time since the distance value has fallen below the first threshold value T1, and whether or not the distance value has fallen below the second threshold value T2. To monitor.

ここで、第２閾値Ｔ２および時間Ｊ１の値は、時間Ｊ１が経過することなく距離値が第２閾値Ｔ２を下回った場合、外部入力音声が、認識対象ワードの全体に相当するとみなすことができ、逆に、距離値が第２閾値Ｔ２を下回ることなく時間Ｊ１が経過した場合、外部入力音声が、認識対象ワードの全体に相当しないとみなすことができるような値に設定される。第２閾値Ｔ２および時間Ｊ１は、事前のテストの結果等を踏まえ、第１音声認識辞書２０Ａに登録された認識対象ワードごとに適切な値とされる。なお、距離値が第２閾値Ｔ２を下回ることなく、時間Ｊ１が経過した場合、時間Ｊ１の経過後、距離値は徐々に大きくなっていき、いずれ、第１閾値Ｔ１を上回ることになる。 Here, the values of the second threshold value T2 and the time J1 can be regarded as the external input voice corresponding to the entire recognition target word when the distance value falls below the second threshold value T2 without the lapse of the time J1. On the contrary, when the time J1 elapses without the distance value falling below the second threshold value T2, the value is set so that the external input voice does not correspond to the entire recognition target word. The second threshold value T2 and the time J1 are set to appropriate values for each recognition target word registered in the first speech recognition dictionary 20A based on the results of prior tests and the like. If the time J1 elapses without the distance value falling below the second threshold value T2, the distance value gradually increases after the time J1 elapses, and eventually exceeds the first threshold value T1.

以上のことを踏まえ、距離値が第２閾値Ｔ２を下回ることなく時間Ｊ１が経過した場合、第１音声認識部１０は、距離値が第２閾値Ｔ２を下回ったか否かの判定を停止し、距離値が第１閾値Ｔ１を上回っている状態から下回った状態へ移行したか否かを監視する。上述したように、時間Ｊ１の経過後、距離値は、いずれ第１閾値Ｔ１を上回った状態となるため、第１音声認識部１０は、距離値が、一旦、第１閾値Ｔ１を上回った後、第１閾値Ｔ１を上回った状態から下回った状態へ移行したか否かを監視する。なお、本実施形態では、距離値が第２閾値Ｔ２を下回ることなく時間Ｊ１が経過したことをもって、外部入力音声が、認識対象ワードの全体に相当しないと判定する構成であるが、この点について、距離値が、第１閾値Ｔ１を下回った後に、第２閾値Ｔ２を下回ることなく、第１閾値Ｔ１を上回ったことをもって、外部入力音声が、認識対象ワードの全体に相当しないと判定する構成としてもよい。 Based on the above, when the time J1 elapses without the distance value falling below the second threshold value T2, the first voice recognition unit 10 stops determining whether or not the distance value falls below the second threshold value T2. It is monitored whether or not the distance value has shifted from the state where the distance value is above the first threshold value T1 to the state where the distance value is below the first threshold value T1. As described above, after the elapse of the time J1, the distance value will eventually exceed the first threshold value T1. Therefore, the first voice recognition unit 10 once the distance value exceeds the first threshold value T1. , It is monitored whether or not the transition from the state above the first threshold value T1 to the state below the first threshold value T1 has occurred. In the present embodiment, it is determined that the externally input voice does not correspond to the entire recognition target word when the time J1 elapses without the distance value falling below the second threshold value T2. , The configuration in which the external input voice does not correspond to the entire recognition target word when the distance value exceeds the first threshold value T1 without falling below the second threshold value T2 after falling below the first threshold value T1. May be.

一方、時間Ｊ１が経過する前に距離値が第２閾値Ｔ２を下回った場合、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当すると認識する。図４（Ａ）で例示するグラフでは、第１音声認識部１０は、タイミングＴＭ２において、外部入力音声が認識対象ワードの全体に相当すると認識する。なお、距離値が第２閾値Ｔ２を下回った場合、距離値は、一旦、第２閾値Ｔ２を下回った後、徐々に大きくなっていき、いずれ、第１閾値Ｔ１を上回ることになる。 On the other hand, when the distance value falls below the second threshold value T2 before the time J1 elapses, the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word. In the graph illustrated in FIG. 4A, the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word in the timing TM2. When the distance value is below the second threshold value T2, the distance value once falls below the second threshold value T2, then gradually increases, and eventually exceeds the first threshold value T1.

上記認識後、第１音声認識部１０は、いずれか１つの認識対象ワードについて、外部入力音声が認識対象ワードの全体に相当したと認識したことを通知する第１音声認識通知を認識結果破棄部１３に出力する。認識結果破棄部１３の処理については後述する。 After the above recognition, the first voice recognition unit 10 recognizes the first voice recognition notification for notifying that the externally input voice corresponds to the entire recognition target word for any one recognition target word. Output to 13. The processing of the recognition result discarding unit 13 will be described later.

次いで、第１音声認識部１０は、認識結果破棄部１３から、認識破棄通知または認識確定通知のいずれかを入力したか否かを監視する。認識破棄通知とは、外部入力音声が認識対象ワードの全体に相当したという認識（以下、単に「第１音声認識部１０の認識結果」という場合がある）について、認識を破棄することを指示する通知である。認識破棄通知は、後に詳述するが、オーディオ装置４００が放音した音声に、偶発的に、認識対象ワードに対応する文言が含まれており、当該音声に基づく外部入力信号を音声処理した結果、第１音声認識部１０が外部入力音声が認識対象ワードの全体に相当したと認識した場合に通知される。この場合、搭乗者が、認識対象ワードに対応する文言を発話したわけではないので、第１音声認識部１０の認識結果を破棄し、当該認識結果に基づいてナビゲーション装置３００の制御が行われないようにする必要がある。また、認識確定通知は、後に詳述するが、第１音声認識部１０の認識結果が、オーディオ装置４００が放音した音声に対する音声処理の結果に由来するものではなく、搭乗者が発話した音声に対する音声処理の結果に由来する場合に通知される。この場合、ナビゲーション装置３００に、認識対象ワードに対応する処理を実行させる必要がある。 Next, the first voice recognition unit 10 monitors whether or not either the recognition destruction notification or the recognition confirmation notification is input from the recognition result destruction unit 13. The recognition discard notification indicates to discard the recognition regarding the recognition that the external input voice corresponds to the entire recognition target word (hereinafter, may be simply referred to as "recognition result of the first voice recognition unit 10"). It is a notification. The recognition discard notification will be described in detail later, but the voice emitted by the audio device 400 accidentally contains a word corresponding to the recognition target word, and the result of voice processing of an external input signal based on the voice. , The first voice recognition unit 10 is notified when it recognizes that the externally input voice corresponds to the entire recognition target word. In this case, since the passenger did not utter the word corresponding to the recognition target word, the recognition result of the first voice recognition unit 10 is discarded, and the navigation device 300 is not controlled based on the recognition result. Must be done. Further, the recognition confirmation notification will be described in detail later, but the recognition result of the first voice recognition unit 10 is not derived from the result of voice processing for the voice emitted by the audio device 400, but the voice spoken by the passenger. Is notified when it comes from the result of voice processing for. In this case, it is necessary for the navigation device 300 to execute the process corresponding to the recognition target word.

認識破棄通知を入力した場合、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当したという認識について、認識を破棄する。この場合、マイク２００が収音した音声に基づくナビゲーション装置３００の制御は行われない。その後、第１音声認識部１０は、距離値が第１閾値Ｔ１を上回っている状態から下回った状態へ移行したか否かを監視する。上述したように、距離値は、第２閾値Ｔ２を下回った後、いずれ第１閾値Ｔ１を上回った状態となるため、第１音声認識部１０は、距離値が、一旦、第１閾値Ｔ１を上回った後、第１閾値Ｔ１を上回った状態から下回った状態へ移行したか否かを監視する。 When the recognition discard notification is input, the first voice recognition unit 10 discards the recognition that the externally input voice corresponds to the entire recognition target word. In this case, the navigation device 300 is not controlled based on the voice picked up by the microphone 200. After that, the first voice recognition unit 10 monitors whether or not the distance value has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value has fallen below the first threshold value T1. As described above, since the distance value falls below the second threshold value T2 and then eventually exceeds the first threshold value T1, the first voice recognition unit 10 temporarily sets the distance value to the first threshold value T1. After the threshold is exceeded, it is monitored whether or not the transition from the state where the threshold value T1 is exceeded to the state where the threshold value T1 is lower is changed.

一方、認識確定通知を入力した場合、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当したという認識について、認識を確定する。次いで、第１音声認識部１０は、外部入力音声が相当したと認識した認識対象ワード（以下、「確定認識対象ワード」という。）を電子機器制御部１１に通知する。確定認識対象ワードは、搭乗者が、ナビゲーション装置３００に特定の処理を実行させるべく発話した発話コマンドに対応する認識対象ワードである。その後、第１音声認識部１０は、距離値が第１閾値Ｔ１を上回っている状態から下回った状態へ移行したか否かを監視する。 On the other hand, when the recognition confirmation notification is input, the first voice recognition unit 10 confirms the recognition that the externally input voice corresponds to the entire recognition target word. Next, the first voice recognition unit 10 notifies the electronic device control unit 11 of the recognition target word (hereinafter, referred to as “fixed recognition target word”) recognized as corresponding to the external input voice. The definite recognition target word is a recognition target word corresponding to an utterance command uttered by the passenger to cause the navigation device 300 to execute a specific process. After that, the first voice recognition unit 10 monitors whether or not the distance value has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value has fallen below the first threshold value T1.

電子機器制御部１１は、第１音声認識部１０から確定認識対象ワードが通知された場合、通知された確定認識対象ワードに対応する処理をナビゲーション装置３００に実行させる制御信号をナビゲーション装置３００に出力する。ナビゲーション装置３００は、入力した制御信号に基づいて、処理を実行する。 When the first voice recognition unit 10 notifies the definite recognition target word, the electronic device control unit 11 outputs a control signal to the navigation device 300 to cause the navigation device 300 to execute the process corresponding to the notified definite recognition target word. To do. The navigation device 300 executes the process based on the input control signal.

第２音声認識部１２は、第２音声認識辞書２０Ｂに登録されている認識対象ワードの後半部分の音声パターンと、上述した内部発生音声との類似度を算出し、当該算出した類似度が所定レベルより大きい場合に、内部発生音声が認識対象ワードの後半部分に相当すると認識する。より詳細には、第２音声認識部１２は、第１音声認識部１０と同様、類似度を表す指標として距離値を算出すると共に、当該算出した距離値が第３閾値Ｔ３より小さくなったことを検出することによって、類似度が所定レベルより大きくなったことを検出する。そして、第２音声認識部１２は、当該算出した距離値が第３閾値Ｔ３より小さくなった場合、内部発生音声が認識対象ワードの後半部分に相当すると認識する。 The second voice recognition unit 12 calculates the similarity between the voice pattern of the latter half of the recognition target word registered in the second voice recognition dictionary 20B and the internally generated voice described above, and the calculated similarity is predetermined. When it is larger than the level, it recognizes that the internally generated voice corresponds to the latter half of the recognition target word. More specifically, the second voice recognition unit 12 calculates the distance value as an index indicating the degree of similarity as in the first voice recognition unit 10, and the calculated distance value is smaller than the third threshold value T3. By detecting, it is detected that the similarity becomes larger than a predetermined level. Then, when the calculated distance value becomes smaller than the third threshold value T3, the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half of the recognition target word.

さらに、第２音声認識部１２は、第１音声認識部１０により算出された距離値が第１閾値Ｔ１よりも小さくなった時点（＝第１音声認識部１０により算出された類似度が第１のレベルよりも大きくなった時点）で、認識処理を開始する。認識処理は、第２音声認識辞書２０Ｂを用いた距離値の算出および算出した距離値を用いた各種処理のことを意味する。また、第２音声認識部１２は、認識処理を開始してから時間Ｊ２（後述）が経過する前に内部発生音声が認識対象ワードの後半部分に相当すると認識した場合、または、内部発生音声が認識対象ワードの後半部分に相当すると認識することなく時間Ｊ２が経過した場合、認識処理を停止する。 Further, in the second voice recognition unit 12, when the distance value calculated by the first voice recognition unit 10 becomes smaller than the first threshold value T1 (= the similarity calculated by the first voice recognition unit 10 is first. The recognition process is started when the level becomes higher than the level of. The recognition process means the calculation of the distance value using the second speech recognition dictionary 20B and various processes using the calculated distance value. Further, when the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half of the recognition target word before the time J2 (described later) elapses after starting the recognition process, or the internally generated voice is heard. When the time J2 elapses without recognizing that it corresponds to the latter half of the recognition target word, the recognition process is stopped.

以下、第２音声認識部１２の処理について詳述する。図４（Ｂ）は、ある１の認識対象ワードについて、第２音声認識部１２により算出された距離値の推移の一例を示すグラフである。特に、図４（Ｂ）のグラフは、第２音声認識部１２が実行する認識処理において、時間Ｊ２以内に、距離値が第３閾値Ｔ３を下回る場合の距離値の推移を示す。図４（Ｂ）のグラフの縦軸は距離値を示し、横軸は時間の経過を示す。図４（Ｂ）のグラフの横軸の各タイミングは、図４（Ａ）のグラフの横軸の各タイミングと対応する。 Hereinafter, the processing of the second voice recognition unit 12 will be described in detail. FIG. 4B is a graph showing an example of the transition of the distance value calculated by the second voice recognition unit 12 for a certain recognition target word. In particular, the graph of FIG. 4B shows the transition of the distance value when the distance value is less than the third threshold value T3 within the time J2 in the recognition process executed by the second voice recognition unit 12. The vertical axis of the graph of FIG. 4B shows the distance value, and the horizontal axis shows the passage of time. Each timing on the horizontal axis of the graph of FIG. 4 (B) corresponds to each timing of the horizontal axis of the graph of FIG. 4 (A).

以下の説明では、第２音声認識辞書２０Ｂに音声パターンが登録された認識対象ワードの後半部分のことを「後半部分ワード」という。図３（Ｂ）を用いて説明したように、本実施形態では、後半部分ワードは、６個、用意されている。 In the following description, the latter half of the recognition target word in which the voice pattern is registered in the second speech recognition dictionary 20B is referred to as a "second half word". As described with reference to FIG. 3B, six latter-half partial words are prepared in the present embodiment.

上述したように、オーディオ装置４００が音声出力装置７００に出力する音声信号は、分岐されて、第２音声認識部１２に出力される。また、上述したように、第１音声認識部１０は、１の認識対象ワードについての距離値が第１閾値Ｔ１を下回ったことを検出した場合、その時点（図４では、タイミングＴＭ１）で、処理開始通知を第２音声認識部１２に出力する。第２音声認識部１２は、この処理開始通知を入力するまでは、認識処理（上述したように、第２音声認識辞書２０Ｂを用いた距離値の算出および付随する処理）を実行せず、処理開始通知を入力したか否かを監視する。そして、第２音声認識部１２は、処理開始通知を入力すると、認識処理を開始する。この結果、第２音声認識部１２は、第１音声認識部１０により算出された距離値が第１閾値Ｔ１よりも小さくなった時点で、認識処理を開始する。図４（Ｂ）で例示するグラフでは、第２音声認識部１２は、タイミングＴＭ１（図４（Ａ）も併せて参照）で、認識処理を開始する。 As described above, the voice signal output by the audio device 400 to the voice output device 700 is branched and output to the second voice recognition unit 12. Further, as described above, when the first voice recognition unit 10 detects that the distance value for the recognition target word of 1 is less than the first threshold value T1, at that time point (timing TM1 in FIG. 4), The processing start notification is output to the second voice recognition unit 12. The second voice recognition unit 12 does not execute the recognition process (calculation of the distance value using the second voice recognition dictionary 20B and the accompanying process as described above) until the process start notification is input. Monitor whether you have entered a start notification. Then, the second voice recognition unit 12 starts the recognition process when the process start notification is input. As a result, the second voice recognition unit 12 starts the recognition process when the distance value calculated by the first voice recognition unit 10 becomes smaller than the first threshold value T1. In the graph illustrated in FIG. 4B, the second voice recognition unit 12 starts the recognition process at the timing TM1 (see also FIG. 4A).

認識処理において、第２音声認識部１２は、６個の後半部分ワードのそれぞれについて、内部発生音声の順次入力と並行して、内部発生音声信号の音声波形と、第２音声認識辞書２０Ｂに登録された音声パターン（認識対象ワードの「後半部分」の音声パターン）との比較に基づく距離値の算出を逐次行う。この結果、図４（Ｂ）に示すように、時間の経過と共に距離値の値が逐次変化する。 In the recognition process, the second voice recognition unit 12 registers the voice waveform of the internally generated voice signal and the second voice recognition dictionary 20B for each of the six latter half words in parallel with the sequential input of the internally generated voice. The distance value is calculated sequentially based on the comparison with the voice pattern (the voice pattern of the "second half" of the recognition target word). As a result, as shown in FIG. 4B, the value of the distance value changes sequentially with the passage of time.

第２音声認識部１２は、認識処理の開始と併せて経過時間の計測を開始する。そして、第２音声認識部１２は、認識処理を開始してから時間Ｊ２が経過したか否かを監視しつつ、距離値が第３閾値Ｔ３を下回ったか否かを監視する。 The second voice recognition unit 12 starts measuring the elapsed time at the same time as the start of the recognition process. Then, the second voice recognition unit 12 monitors whether or not the distance value is below the third threshold value T3 while monitoring whether or not the time J2 has elapsed since the recognition process was started.

ここで、第３閾値Ｔ３および時間Ｊ２の値は、時間Ｊ２が経過することなく距離値が第３閾値Ｔ３を下回った場合、内部発生音声が、後半部分ワード（認識対象ワードの後半部分）に相当するとみなすことができ、逆に、距離値が第３閾値Ｔ３を下回ることなく時間Ｊ２が経過した場合、内部発生音声が、後半部分ワードに相当しないとみなすことができるような値に設定される。第３閾値Ｔ３および時間Ｊ２は、事前のテストの結果等を踏まえ、第２音声認識辞書２０Ｂに登録された後半部分ワードごとに適切な値とされる。 Here, as for the values of the third threshold value T3 and the time J2, when the distance value falls below the third threshold value T3 without the lapse of the time J2, the internally generated voice becomes the second half word (the second half of the recognition target word). On the contrary, when the time J2 elapses without the distance value falling below the third threshold value T3, the internally generated voice is set to a value that can be regarded as not corresponding to the latter half word. To. The third threshold value T3 and the time J2 are set to appropriate values for each of the latter half words registered in the second speech recognition dictionary 20B based on the results of the preliminary test and the like.

距離値が第３閾値Ｔ３を下回ることなく時間Ｊ２が経過した場合、第２音声認識部１２は、認識不能通知を認識結果破棄部１３に出力する。認識不能通知は、認識処理において内部発生音声を後半部分ワードと認識しなかったことの通知である。認識結果破棄部１３の処理については後述する。その後、第２音声認識部１２は、認識処理を停止する。認識処理を停止後、第２音声認識部１２は、上述した処理開始通知を入力するまでは、認識処理を開始しない。 When the time J2 elapses without the distance value falling below the third threshold value T3, the second voice recognition unit 12 outputs the unrecognizable notification to the recognition result discard unit 13. The unrecognizable notification is a notification that the internally generated voice is not recognized as the latter half word in the recognition process. The processing of the recognition result discarding unit 13 will be described later. After that, the second voice recognition unit 12 stops the recognition process. After stopping the recognition process, the second voice recognition unit 12 does not start the recognition process until the above-mentioned processing start notification is input.

一方、時間Ｊ２が経過する前に距離値が第３閾値Ｔ３を下回った場合、第２音声認識部１２は、内部発生音声が後半部分ワードに相当すると認識する。当該認識後、第２音声認識部１２は、第２音声認識通知を認識結果破棄部１３に出力する。第２音声認識通知は、内部発生音声が、第２音声認識辞書２０Ｂに登録された後半部分ワードのうち、いずれか１つの後半部分ワードに相当したと認識したことの通知である。認識結果破棄部１３の処理については後述する。その後、第２音声認識部１２は、認識処理を停止する。認識処理を停止後、第２音声認識部１２は、上述した処理開始通知を入力するまでは、認識処理を開始しない。図４（Ｂ）のグラフでは、第２音声認識部１２は、タイミングＴＭ３において、第２音声認識通知の出力、および、認識処理の停止を実行する。 On the other hand, if the distance value falls below the third threshold value T3 before the time J2 elapses, the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half word. After the recognition, the second voice recognition unit 12 outputs the second voice recognition notification to the recognition result discarding unit 13. The second voice recognition notification is a notification that the internally generated voice has been recognized as corresponding to any one of the latter half words among the latter half words registered in the second voice recognition dictionary 20B. The processing of the recognition result discarding unit 13 will be described later. After that, the second voice recognition unit 12 stops the recognition process. After stopping the recognition process, the second voice recognition unit 12 does not start the recognition process until the above-mentioned processing start notification is input. In the graph of FIG. 4B, the second voice recognition unit 12 executes the output of the second voice recognition notification and the stop of the recognition process at the timing TM3.

なお、第１音声認識部１０が第１音声認識通知を出力するタイミングと、第２音声認識部１２が第２音声認識通知または認識不能通知を出力するタイミングとは、時間的に非常に近接する。また、第１音声認識部１０により第１音声認識通知が認識結果破棄部１３に出力された場合、必ず、第２音声認識部１２により第２音声認識通知または認識不能通知が認識結果破棄部１３に出力される。 The timing at which the first voice recognition unit 10 outputs the first voice recognition notification and the timing at which the second voice recognition unit 12 outputs the second voice recognition notification or the unrecognizable notification are very close in time. .. Further, when the first voice recognition notification is output to the recognition result discarding unit 13 by the first voice recognition unit 10, the second voice recognition notification or the unrecognizable notification is always sent by the second voice recognition unit 12 to the recognition result discarding unit 13. Is output to.

認識結果破棄部１３は、第１音声認識部１０により外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部１２により内部発生音声が後半部分ワード（認識対象ワードの後半部分）に相当すると認識された場合、第１音声認識部１０による認識結果を破棄する。 In the recognition result discarding unit 13, the first voice recognition unit 10 recognizes that the externally input voice corresponds to the entire recognition target word, and the second voice recognition unit 12 recognizes that the internally generated voice is the latter half word (recognition target word). When it is recognized that it corresponds to the latter half), the recognition result by the first voice recognition unit 10 is discarded.

以下、認識結果破棄部１３の処理について詳述する。上述したように、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当すると認識した場合、第１音声認識通知を認識結果破棄部１３に出力する。認識結果破棄部１３は、第１音声認識部１０から、この第１音声認識通知を入力したか否かを監視する。そして、認識結果破棄部１３は、第１音声認識通知を入力した場合、当該第１音声認識通知と時間的に近接したタイミングで第２音声認識通知または認識不能通知のいずれかを入力する。 Hereinafter, the processing of the recognition result discarding unit 13 will be described in detail. As described above, when the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word, the first voice recognition unit 10 outputs the first voice recognition notification to the recognition result discard unit 13. The recognition result discarding unit 13 monitors whether or not the first voice recognition notification is input from the first voice recognition unit 10. Then, when the first voice recognition notification is input, the recognition result discarding unit 13 inputs either the second voice recognition notification or the unrecognizable notification at a timing close in time to the first voice recognition notification.

認識結果破棄部１３は、第２音声認識部１２から入力した通知が第２音声認識通知の場合、認識破棄通知を第１音声認識部１０に出力する。上述したように、第１音声認識部１０は、認識破棄通知を入力した場合、外部入力音声が認識対象ワードの全体に相当したという認識について、認識を破棄する。つまり、認識結果破棄部１３は、第２音声認識通知を入力した場合は、第１音声認識部１０に認識結果を破棄させる。 When the notification input from the second voice recognition unit 12 is the second voice recognition notification, the recognition result discard unit 13 outputs the recognition discard notification to the first voice recognition unit 10. As described above, when the recognition discard notification is input, the first voice recognition unit 10 discards the recognition regarding the recognition that the externally input voice corresponds to the entire recognition target word. That is, when the second voice recognition notification is input, the recognition result discard unit 13 causes the first voice recognition unit 10 to discard the recognition result.

一方、認識結果破棄部１３は、第２音声認識部１２から入力した通知が認識不能通知の場合、認識確定通知を第１音声認識部１０に出力する。上述したように、第１音声認識部１０は、認識確定通知を入力した場合、確定認識対象ワードを電子機器制御部１１に通知して、確定認識対象ワードに対応する処理をナビゲーション装置３００に実行させる。つまり、認識結果破棄部１３は、認識不能通知を入力した場合は、第１音声認識部１０の認識結果を確定させて、ナビゲーション装置３００に対応する処理を実行させる。 On the other hand, when the notification input from the second voice recognition unit 12 is an unrecognizable notification, the recognition result discard unit 13 outputs the recognition confirmation notification to the first voice recognition unit 10. As described above, when the recognition confirmation notification is input, the first voice recognition unit 10 notifies the electronic device control unit 11 of the confirmation recognition target word, and executes the process corresponding to the confirmation recognition target word to the navigation device 300. Let me. That is, when the recognition result discard unit 13 inputs the unrecognizable notification, the recognition result of the first voice recognition unit 10 is determined, and the processing corresponding to the navigation device 300 is executed.

ここで、オーディオ装置４００が放音する音声には、偶然、認識対象ワードに対応する文言が含まれる場合がある。このような場合、第１音声認識部１０は、マイク２００が収音する外部入力音声（認識対象ワードに対応する文言が含まれる音声）が、認識対象ワードの全体に相当すると認識してしまうが、このような認識は破棄し、ナビゲーション装置３００の制御が行われないようにする必要がある。外部入力音声は、搭乗者が発した音声ではないからである。そして、上述のとおり、本実施形態では、第１音声認識部１０が、外部入力音声が認識対象ワードの全体に相当すると認識した場合であっても、第２音声認識部１２が、内部発生音声が後半部分ワードに相当すると認識した場合、第１音声認識部１０の認識が破棄され、ナビゲーション装置３００の制御が行われない。これにより、オーディオ装置４００が放音する音声に、偶然、認識対象ワードに対応する文言が含まれていた場合であっても、ナビゲーション装置３００の制御が行われないようにすることができる。以下、詳述する。 Here, the sound emitted by the audio device 400 may accidentally include a wording corresponding to the recognition target word. In such a case, the first voice recognition unit 10 recognizes that the external input voice (voice including the wording corresponding to the recognition target word) collected by the microphone 200 corresponds to the entire recognition target word. It is necessary to discard such recognition so that the navigation device 300 is not controlled. This is because the external input voice is not the voice emitted by the passenger. Then, as described above, in the present embodiment, even when the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word, the second voice recognition unit 12 causes the internally generated voice. When is recognized as corresponding to the latter half word, the recognition of the first voice recognition unit 10 is discarded and the navigation device 300 is not controlled. As a result, even if the voice emitted by the audio device 400 accidentally contains a word corresponding to the recognition target word, the navigation device 300 can be prevented from being controlled. The details will be described below.

上述したように、第２音声認識部１２により認識処理が行われる期間は、第１音声認識部１０によって外部入力音声が認識対象ワードの前半部分に相当すると認識されたタイミングに続く所定の期間である。このように、第１音声認識部１０によって外部入力音声が認識対象ワードの前半部分に相当すると認識されたタイミングに続く所定の期間で、第１音声認識部１０および第２音声認識部１２が、共に、音声が認識対象ワードの後半部分に相当すると認識した場合、以下の状況であるということができる。すなわち、オーディオ装置４００が放音した音声に認識対象ワードに対応する文言が含まれている状況であり、かつ、第１音声認識部１０が、オーディオ装置４００が放音した音声について、認識対象ワードの後半部分に相当すると認識した可能性が高い状況である。 As described above, the period during which the recognition process is performed by the second voice recognition unit 12 is a predetermined period following the timing when the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word. is there. In this way, the first voice recognition unit 10 and the second voice recognition unit 12 receive the first voice recognition unit 10 and the second voice recognition unit 12 during a predetermined period following the timing when the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word. In both cases, when it is recognized that the voice corresponds to the latter half of the recognition target word, it can be said that the situation is as follows. That is, the voice emitted by the audio device 400 includes the wording corresponding to the recognition target word, and the first voice recognition unit 10 recognizes the voice emitted by the audio device 400. It is highly likely that it corresponds to the latter half of.

すなわち、第１音声認識部１０が、外部入力音声が認識対象ワードの全体に相当すると認識した場合であっても、第２音声認識部１２が、内部発生音声が後半部分ワードに相当すると認識した場合には、第１音声認識部１０の認識は、搭乗者が発生した音声に由来するものではなく、オーディオ装置４００が放音した音声に由来するものである可能性が高い。これを踏まえ、第１音声認識部１０が、外部入力音声が認識対象ワードの全体に相当すると認識した場合であっても、第２音声認識部１２が、内部発生音声が後半部分ワードに相当すると認識した場合には、第１音声認識部１０の認識を破棄することにより、オーディオ装置４００が放音した音声に認識対象ワードに対応する文言が含まれていた場合であっても、その音声に基づいて、ナビゲーション装置３００が制御されて処理を実行してしまうことを防止できる。 That is, even when the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word, the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half word. In this case, it is highly possible that the recognition of the first voice recognition unit 10 is not derived from the voice generated by the passenger, but from the voice emitted by the audio device 400. Based on this, even when the first voice recognition unit 10 recognizes that the externally input voice corresponds to the entire recognition target word, the second voice recognition unit 12 considers that the internally generated voice corresponds to the latter half word. When it is recognized, by discarding the recognition of the first voice recognition unit 10, even if the voice emitted by the audio device 400 contains a word corresponding to the word to be recognized, the voice becomes the voice. Based on this, it is possible to prevent the navigation device 300 from being controlled to execute the process.

さらに、本実施形態では、第１音声認識部１０で外部入力音声が認識対象ワードの前半部分に相当すると認識された場合にのみ、第２音声認識部１２は、認識処理を実行する。これにより、第１音声認識部１０が外部入力音声が認識対象ワードの前半部分に相当すると認識し、その後、第１音声認識部１０が、外部入力音声が認識対象ワードの「全体」に相当すると認識する可能性が生じた状況でのみ、第２音声認識部１２により認識処理を行って、必要な場合に的確に、第１音声認識部１０の認識結果を破棄することができる。 Further, in the present embodiment, the second voice recognition unit 12 executes the recognition process only when the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word. As a result, the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word, and then the first voice recognition unit 10 determines that the external input voice corresponds to the "whole" of the recognition target word. Only in the situation where there is a possibility of recognition, the second voice recognition unit 12 can perform the recognition process, and the recognition result of the first voice recognition unit 10 can be accurately discarded when necessary.

このように、第２音声認識部１２が認識処理を実行するのは、第１音声認識部１０が外部入力音声が認識対象ワードの前半部分に相当すると認識した場合のみであり、かつ、第２音声認識部１２が認識処理を実行する期間は、最長で、時間Ｊ２である。このような構成のため、第２音声認識部１２が、常時、認識処理を実行している場合に比べて、認識処理を実行する期間が限定され、処理負荷を小さくすることができる。処理負荷が小さいため、第１音声認識部１０および第２音声認識部１２による音声認識処理以外の他処理が実行された場合に、その他処理のレスポンス性能の低下が抑制される。そのため、第２音声認識部１２の認識処理に係る処理負荷の増大に伴う他処理への悪影響（例えば、他処理に対するＣＰＵの割り当ての著しい減少）を考慮して、第１音声認識部１０によるトリガレス音声認識において定常的に待ち受ける認識対象ワード（第１音声認識辞書２０Ａに登録する認識対象ワード）の数を少なく制限する必要がない。 In this way, the second voice recognition unit 12 executes the recognition process only when the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word, and the second voice recognition unit 12 executes the recognition process. The maximum period during which the voice recognition unit 12 executes the recognition process is time J2. With such a configuration, the period for executing the recognition process is limited and the processing load can be reduced as compared with the case where the second voice recognition unit 12 constantly executes the recognition process. Since the processing load is small, when other processing other than the voice recognition processing by the first voice recognition unit 10 and the second voice recognition unit 12 is executed, the deterioration of the response performance of the other processing is suppressed. Therefore, in consideration of adverse effects on other processing (for example, a significant decrease in CPU allocation to other processing) due to an increase in the processing load related to the recognition processing of the second voice recognition unit 12, triggerless by the first voice recognition unit 10. It is not necessary to limit the number of recognition target words (recognition target words registered in the first speech recognition dictionary 20A) that are constantly waiting in speech recognition.

さらに、本実施形態では、複数の認識対象ワードの後半部分の文言が共通する場合があり、このような場合、第２音声認識辞書２０Ｂには、共通する文言の後半部分ワードが１つ登録される。これにより、第２音声認識辞書２０Ｂに登録される後半部分ワードの数を少なくすることができ、より効果的に第２音声認識部１２の処理負荷を低減できる。 Further, in the present embodiment, the wording of the latter half of the plurality of recognition target words may be common, and in such a case, one second half word of the common wording is registered in the second speech recognition dictionary 20B. To. As a result, the number of the latter half words registered in the second voice recognition dictionary 20B can be reduced, and the processing load of the second voice recognition unit 12 can be reduced more effectively.

図５は、本実施形態に係る第１音声認識部１０の動作例を示すフローチャートである。図６は、本実施形態に係る第２音声認識部１２の動作例を示すフローチャートである。図７は、本実施形態に係る認識結果破棄部１３の動作例を示すフローチャートである。図５、６、７の各フローチャートの処理は、音声認識装置１００の電源が投入され、トリガレス音声認識の開始が指示された後に適宜実行される。 FIG. 5 is a flowchart showing an operation example of the first voice recognition unit 10 according to the present embodiment. FIG. 6 is a flowchart showing an operation example of the second voice recognition unit 12 according to the present embodiment. FIG. 7 is a flowchart showing an operation example of the recognition result discarding unit 13 according to the present embodiment. The processing of each flowchart of FIGS. 5, 6 and 7 is appropriately executed after the power of the voice recognition device 100 is turned on and the start of triggerless voice recognition is instructed.

以下の説明では、図５のフローチャートの開始時点では、第１音声認識部１０が算出する距離値が、第１閾値Ｔ１を上回っている状態であるものとする。また、特に説明はしないが、図５のフローチャートの処理が行われる間、第１音声認識部１０は、継続して距離値を算出している。 In the following description, it is assumed that the distance value calculated by the first voice recognition unit 10 exceeds the first threshold value T1 at the start of the flowchart of FIG. Further, although not particularly described, the first voice recognition unit 10 continuously calculates the distance value while the processing of the flowchart of FIG. 5 is performed.

図５のフローチャートに示すように、第１音声認識部１０は、現時点で算出した距離値が、距離値が第１閾値Ｔ１を上回った状態から、下回った状態へ移行したか否かを判定する（ステップＳＡ１）。このステップＳＡ１では、第１音声認識部１０は、距離値が第１閾値Ｔ１を下回っているか否かを判定するのではなく、「第１閾値Ｔ１を上回った状態」から、「第１閾値Ｔ１を下回った状態」へと状態の変化があったか否かを判定する。第１音声認識部１０は、ステップＳＡ１の処理を、距離値が第１閾値Ｔ１を上回った状態から、下回った状態へ移行したことを検出するまで、継続して繰り返し実行する。 As shown in the flowchart of FIG. 5, the first voice recognition unit 10 determines whether or not the distance value calculated at the present time has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value falls below the first threshold value T1. (Step SA1). In this step SA1, the first voice recognition unit 10 does not determine whether or not the distance value is below the first threshold value T1, but from the “state where the distance value is above the first threshold value T1”, the “first threshold value T1”. It is determined whether or not there has been a change in the state to "a state below." The first voice recognition unit 10 continuously and repeatedly executes the process of step SA1 until it detects that the distance value has shifted from the state where the distance value exceeds the first threshold value T1 to the state where the distance value has fallen below the first threshold value T1.

距離値が、距離値が第１閾値Ｔ１を上回った状態から、下回った状態へ移行しことを検出した場合、第１音声認識部１０は、外部入力音声が認識対象ワードの前半部分に相当すると認識する（ステップＳＡ２）。次いで、第１音声認識部１０は、処理開始通知を第２音声認識部１２に出力する（ステップＳＡ３）。次いで、第１音声認識部１０は、経過時間の計測を開始する（ステップＳＡ４）。 When it is detected that the distance value shifts from the state where the distance value exceeds the first threshold value T1 to the state where the distance value falls below the first threshold value T1, the first voice recognition unit 10 determines that the external input voice corresponds to the first half of the recognition target word. Recognize (step SA2). Next, the first voice recognition unit 10 outputs a processing start notification to the second voice recognition unit 12 (step SA3). Next, the first voice recognition unit 10 starts measuring the elapsed time (step SA4).

次いで、第１音声認識部１０は、時間Ｊ１が経過したか否かを判定しつつ（ステップＳＡ５）、距離値が第２閾値Ｔ２を下回ったか否かを判定する（ステップＳＡ６）。 Next, the first voice recognition unit 10 determines whether or not the time J1 has elapsed (step SA5), and determines whether or not the distance value has fallen below the second threshold value T2 (step SA6).

距離値が第２閾値Ｔ２を下回ることなく時間Ｊ１が経過した場合（ステップＳＡ５：ＹＥＳ）、第１音声認識部１０は、処理手順をステップＳＡ１へ移行する。一方、時間Ｊ１が経過する前に距離値が第２閾値Ｔ２を下回った場合（ステップＳＡ６：ＹＥＳ）、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当すると認識する（ステップＳＡ７）。次いで、第１音声認識部１０は、第１音声認識通知を認識結果破棄部１３に出力する（ステップＳＡ８）。 When the time J1 elapses without the distance value falling below the second threshold value T2 (step SA5: YES), the first voice recognition unit 10 shifts the processing procedure to step SA1. On the other hand, when the distance value falls below the second threshold value T2 before the time J1 elapses (step SA6: YES), the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word (step SA6: YES). Step SA7). Next, the first voice recognition unit 10 outputs the first voice recognition notification to the recognition result discard unit 13 (step SA8).

第１音声認識通知の出力後、第１音声認識部１０は、認識結果破棄部１３から、認識破棄通知または認識確定通知のいずれかを入力したか否かを判定する（ステップＳＡ９）。第１音声認識部１０は、ステップＳＡ９の処理を、いずれかの通知が入力されるまで、継続して繰り返し実行する。いずれかの通知を入力した場合（ステップＳＡ９：ＹＥＳ）、第１音声認識部１０は、入力した通知が、認識破棄通知であるか否かを判定する（ステップＳＡ１０）。 After outputting the first voice recognition notification, the first voice recognition unit 10 determines whether or not either the recognition destruction notification or the recognition confirmation notification has been input from the recognition result discard unit 13 (step SA9). The first voice recognition unit 10 continuously and repeatedly executes the process of step SA9 until any notification is input. When any of the notifications is input (step SA9: YES), the first voice recognition unit 10 determines whether or not the input notification is a recognition discard notification (step SA10).

入力した通知が認識破棄通知の場合（ステップＳＡ１０：ＹＥＳ）、第１音声認識部１０は、認識結果を破棄する（ステップＳＡ１１）。その後、第１音声認識部１０は、処理手順をステップＳＡ１４へ移行する。 When the input notification is a recognition discard notification (step SA10: YES), the first voice recognition unit 10 discards the recognition result (step SA11). After that, the first voice recognition unit 10 shifts the processing procedure to step SA14.

一方、入力した通知が認識破棄通知でない場合（ステップＳＡ１０：ＮＯ）、第１音声認識部１０は、外部入力音声が認識対象ワードの全体に相当したという認識について、認識を確定する（ステップＳＡ１２）。次いで、第１音声認識部１０は、確定認識対象ワードを電子機器制御部１１に通知する（ステップＳＡ１３）。その後、第１音声認識部１０は、処理手順をステップＳＡ１４へ移行する。 On the other hand, when the input notification is not the recognition discard notification (step SA10: NO), the first voice recognition unit 10 confirms the recognition that the externally input voice corresponds to the entire recognition target word (step SA12). .. Next, the first voice recognition unit 10 notifies the electronic device control unit 11 of the definite recognition target word (step SA13). After that, the first voice recognition unit 10 shifts the processing procedure to step SA14.

ステップＳＡ１４において、第１音声認識部１０は、トリガレス音声認識の終了が指示されたか否かを判定する。トリガレス音声認識の終了が指示されていない場合（ステップＳＡ１４：ＮＯ）、第１音声認識部１０は、処理手順をステップＳＡ１に移行する。トリガレス音声認識の終了が指示された場合（ステップＳＡ１４：ＹＥＳ）、第１音声認識部１０は、処理を終了する。なお、図５のフローチャートでは、説明の便宜のため、第１音声認識部１０が、ステップＳＡ１４で、トリガレス音声認識の終了が指示されたか否かを判定する構成としているが、第１音声認識部１０は、フローチャートの処理が実行されている間、継続してトリガレス音声認識の終了が指示されたか否かを監視し、指示された場合は、必要な終了処理を実行した上で、処理を終了する。 In step SA14, the first voice recognition unit 10 determines whether or not the end of triggerless voice recognition is instructed. When the end of the triggerless voice recognition is not instructed (step SA14: NO), the first voice recognition unit 10 shifts the processing procedure to step SA1. When the end of the triggerless voice recognition is instructed (step SA14: YES), the first voice recognition unit 10 ends the process. In the flowchart of FIG. 5, for convenience of explanation, the first voice recognition unit 10 is configured to determine whether or not the end of triggerless voice recognition is instructed in step SA14, but the first voice recognition unit 10 continuously monitors whether or not the end of triggerless speech recognition is instructed while the processing of the flowchart is being executed, and if instructed, executes the necessary end processing and then ends the processing. To do.

図６のフローチャートに示すように、第２音声認識部１２は、処理開始通知を入力したか否かを判定する（ステップＳＢ１）。処理開始通知を入力した場合（ステップＳＢ１：ＹＥＳ）、第２音声認識部１２は、認識処理を開始する（ステップＳＢ２）。次いで、第２音声認識部１２は、認識処理の開始に応じて経過時間の計測を開始する（ステップＳＢ３）。次いで、第２音声認識部１２は、認識処理を開始してから時間Ｊ２が経過したか否かを判定しつつ（ステップＳＢ４）、距離値が第３閾値Ｔ３を下回ったか否かを判定する（ステップＳＢ５）。 As shown in the flowchart of FIG. 6, the second voice recognition unit 12 determines whether or not the processing start notification has been input (step SB1). When the process start notification is input (step SB1: YES), the second voice recognition unit 12 starts the recognition process (step SB2). Next, the second voice recognition unit 12 starts measuring the elapsed time according to the start of the recognition process (step SB3). Next, the second voice recognition unit 12 determines whether or not the time J2 has elapsed since the start of the recognition process (step SB4), and determines whether or not the distance value is below the third threshold value T3 (step SB4). Step SB5).

距離値が第３閾値Ｔ３を下回ることなく時間Ｊ２が経過した場合（ステップＳＢ４：ＹＥＳ）、第２音声認識部１２は、認識不能通知を認識結果破棄部１３に出力する（ステップＳＢ６）。次いで、第２音声認識部１２は、認識処理を停止する（ステップＳＢ７）。その後、第２音声認識部１２は、処理手順をステップＳＢ１１へ移行する。 When the time J2 elapses without the distance value falling below the third threshold value T3 (step SB4: YES), the second voice recognition unit 12 outputs the unrecognizable notification to the recognition result discard unit 13 (step SB6). Next, the second voice recognition unit 12 stops the recognition process (step SB7). After that, the second voice recognition unit 12 shifts the processing procedure to step SB11.

一方、時間Ｊ２が経過する前に距離値が第３閾値Ｔ３を下回った場合（ステップＳＢ５：ＹＥＳ）、第２音声認識部１２は、内部発生音声が後半部分ワードに相当すると認識する（ステップＳＢ８）。次いで、第２音声認識部１２は、第２音声認識通知を認識結果破棄部１３に出力する（ステップＳＢ９）。次いで、第２音声認識部１２は、認識処理を停止する（ステップＳＢ１０）。その後、第２音声認識部１２は、処理手順をステップＳＢ１１へ移行する。 On the other hand, when the distance value falls below the third threshold value T3 before the time J2 elapses (step SB5: YES), the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half word (step SB8). ). Next, the second voice recognition unit 12 outputs the second voice recognition notification to the recognition result discard unit 13 (step SB9). Next, the second voice recognition unit 12 stops the recognition process (step SB10). After that, the second voice recognition unit 12 shifts the processing procedure to step SB11.

ステップＳＢ１１において、第２音声認識部１２は、トリガレス音声認識の終了が指示されたか否かを判定する。トリガレス音声認識の終了が指示されていない場合（ステップＳＢ１１：ＮＯ）、第２音声認識部１２は、処理手順をステップＳＢ１に移行する。トリガレス音声認識の終了が指示された場合（ステップＳＢ１１：ＹＥＳ）、第２音声認識部１２は、処理を終了する。なお、図６のフローチャートは、説明の便宜のため、第２音声認識部１２が、ステップＳＢ１１で、トリガレス音声認識の終了が指示されたか否かを判定する構成としているが、第２音声認識部１２は、フローチャートの処理が実行されている間、継続してトリガレス音声認識の終了が指示されたか否かを監視し、指示された場合は、必要な終了処理を実行した上で、処理を終了する。 In step SB11, the second voice recognition unit 12 determines whether or not the end of triggerless voice recognition is instructed. When the end of the triggerless voice recognition is not instructed (step SB11: NO), the second voice recognition unit 12 shifts the processing procedure to step SB1. When the end of the triggerless voice recognition is instructed (step SB11: YES), the second voice recognition unit 12 ends the process. The flowchart of FIG. 6 is configured such that the second voice recognition unit 12 determines whether or not the end of triggerless voice recognition is instructed in step SB11 for convenience of explanation. However, the second voice recognition unit 12 continuously monitors whether or not the end of triggerless speech recognition is instructed while the processing of the flowchart is being executed, and if instructed, executes the necessary end processing and then ends the processing. To do.

図７のフローチャートに示すように、認識結果破棄部１３は、第１音声認識部１０から第１音声認識通知を入力したか否かを判定する（ステップＳＣ１）。第１音声認識部１０は、ステップＳＣ１の処理を、第１音声認識通知を入力するまで、継続して繰り返し実行する。第１音声認識通知を入力した場合（ステップＳＣ１：ＹＥＳ）、認識結果破棄部１３は、ステップＳＣ１で入力した第１音声認識通知と時間的に近接したタイミングで第２音声認識通知と認識不能通知とのいずれかを入力する（ステップＳＣ２）。 As shown in the flowchart of FIG. 7, the recognition result discarding unit 13 determines whether or not the first voice recognition notification has been input from the first voice recognition unit 10 (step SC1). The first voice recognition unit 10 continuously and repeatedly executes the process of step SC1 until the first voice recognition notification is input. When the first voice recognition notification is input (step SC1: YES), the recognition result discarding unit 13 performs the second voice recognition notification and the unrecognizable notification at a timing close in time to the first voice recognition notification input in step SC1. One of the above is input (step SC2).

次いで、認識結果破棄部１３は、ステップＳＣ２で入力した通知が第２音声認識通知か否かを判定する（ステップＳＣ３）。ステップＳＣ２で入力した通知が第２音声認識通知の場合（ステップＳＣ３：ＹＥＳ）、認識結果破棄部１３は、認識破棄通知を第１音声認識部１０に出力する（ステップＳＣ４）。その後、認識結果破棄部１３は、処理手順をステップＳＣ６へ移行する。 Next, the recognition result discarding unit 13 determines whether or not the notification input in step SC2 is the second voice recognition notification (step SC3). When the notification input in step SC2 is the second voice recognition notification (step SC3: YES), the recognition result discard unit 13 outputs the recognition discard notification to the first voice recognition unit 10 (step SC4). After that, the recognition result discarding unit 13 shifts the processing procedure to step SC6.

ステップＳＣ２で入力した通知が第２音声認識通知ではない場合（＝認識不能通知の場合）（ステップＳＣ３：ＮＯ）、認識結果破棄部１３はイン式確定通知を第１音声認識部１０に出力する（ステップＳＣ５）。その後、認識結果破棄部１３は、処理手順をステップＳＣ６へ移行する。 When the notification input in step SC2 is not the second voice recognition notification (= unrecognizable notification) (step SC3: NO), the recognition result discard unit 13 outputs the in-type confirmation notification to the first voice recognition unit 10. (Step SC5). After that, the recognition result discarding unit 13 shifts the processing procedure to step SC6.

ステップＳＣ６において、認識結果破棄部１３は、トリガレス音声認識の終了が指示されたか否かを判定する。トリガレス音声認識の終了が指示されていない場合（ステップＳＣ６：ＮＯ）、認識結果破棄部１３は、処理手順をステップＳＣ１に移行する。トリガレス音声認識の終了が指示された場合（ステップＳＣ６：ＹＥＳ）、認識結果破棄部１３は、処理を終了する。なお、図７のフローチャートは、説明の便宜のため、認識結果破棄部１３が、ステップＳＣ６で、トリガレス音声認識の終了が指示されたか否かを判定する構成としているが、認識結果破棄部１３は、フローチャートの処理が実行されている間、継続してトリガレス音声認識の終了が指示されたか否かを監視し、指示された場合は、必要な終了処理を実行した上で、処理を終了する。 In step SC6, the recognition result discarding unit 13 determines whether or not the end of triggerless speech recognition is instructed. When the end of the triggerless voice recognition is not instructed (step SC6: NO), the recognition result discarding unit 13 shifts the processing procedure to step SC1. When the end of the triggerless voice recognition is instructed (step SC6: YES), the recognition result discard unit 13 ends the process. The flowchart of FIG. 7 has a configuration in which the recognition result discarding unit 13 determines whether or not the end of triggerless speech recognition is instructed in step SC6 for convenience of explanation. , While the flow chart process is being executed, it continuously monitors whether or not the end of triggerless speech recognition is instructed, and if instructed, executes the necessary end process and then ends the process.

以上詳しく説明したように、本実施形態に係る音声認識装置１００は、認識対象ワードの全体が登録された第１音声認識辞書２０Ａと、認識対象ワードの後半部分のみが登録された第２音声認識辞書２０Ｂとを有する。また、本実施形態に係る音声認識装置１００は、第１音声認識辞書２０Ａを用いて、マイク２００より入力された外部入力音声の音声認識を行う第１音声認識部１０と、第２音声認識辞書２０Ｂを用いて、車載機たるオーディオ装置４００で発生されスピーカから出力される前の内部発生音声の音声認識を行う第２音声認識部１２とを備える。第１音声認識部１０は、外部入力音声の順次入力と並行して類似度の算出（距離値の算出）を逐次行い、算出した類似度が第１のレベルより大きくなった時点（算出した距離値が第１閾値Ｔ１を下回った時点）で、外部入力音声が認識対象ワードの前半部分に相当すると認識する。第１音声認識部１０は、引き続き算出した類似度が第２のレベルより大きくなった時点（距離値が第２閾値Ｔ２を下回った時点）で、外部入力音声が認識対象ワードの全体に相当すると認識する。第２音声認識部１２は、第１音声認識部１０により算出された類似度が第１のレベルよりも大きくなった時点（第１音声認識部１０により算出された距離値が第１閾値Ｔ１を下回った時点）で認識処理を開始し、算出した類似度が所定レベルより大きい場合に（算出した距離値が第３閾値Ｔ３を下回った場合に）、内部発生音声が認識対象ワードの後半部分に相当すると認識する。そして、第１音声認識部１０において外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部１２により内部発生音声が認識対象ワードの後半部分に相当すると認識された場合、第１音声認識部１０による認識結果を破棄するようにしている。 As described in detail above, the voice recognition device 100 according to the present embodiment has a first voice recognition dictionary 20A in which the entire recognition target word is registered, and a second voice recognition in which only the latter half of the recognition target word is registered. It has a dictionary 20B. In addition, the voice recognition device 100 according to the present embodiment uses the first voice recognition dictionary 20A to perform voice recognition of externally input voice input from the microphone 200, and a second voice recognition dictionary. The 20B is provided with a second voice recognition unit 12 that performs voice recognition of the internally generated voice generated by the audio device 400, which is an in-vehicle device, before being output from the speaker. The first voice recognition unit 10 sequentially calculates the similarity (calculation of the distance value) in parallel with the sequential input of the external input voice, and when the calculated similarity becomes larger than the first level (calculated distance). When the value falls below the first threshold value T1), it is recognized that the externally input voice corresponds to the first half of the recognition target word. The first voice recognition unit 10 determines that the external input voice corresponds to the entire recognition target word when the calculated similarity becomes larger than the second level (when the distance value falls below the second threshold value T2). recognize. The second voice recognition unit 12 sets the first threshold value T1 when the similarity calculated by the first voice recognition unit 10 becomes larger than the first level (the distance value calculated by the first voice recognition unit 10 sets the first threshold value T1. When the recognition process is started (when it falls below) and the calculated similarity is greater than the predetermined level (when the calculated distance value falls below the third threshold T3), the internally generated voice is in the latter half of the recognition target word. Recognize that it is equivalent. Then, when the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word, and the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half of the recognition target word. , The recognition result by the first voice recognition unit 10 is discarded.

上記構成によれば、第１音声認識部１０において外部入力音声が認識対象ワードの前半部分に相当すると認識された場合にのみ第２音声認識部１２が起動されるので、第２音声認識部１２が常時動作している場合に比べて処理負荷を小さくすることができる。処理負荷が小さいので、トリガレス音声認識において定常的に待ち受ける認識対象ワードの数を少なく制限する必要がない。そして、第１音声認識部１０において外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部１２により内部発生音声が認識対象ワードの後半部分に相当すると認識された場合には、第１音声認識部１０による認識結果が、スピーカから出力された内部発生音声がマイク２００から入力されたために生じた誤認識であるものとして破棄される。これにより、本発明によれば、トリガレス音声認識において定常的に待ち受ける認識対象ワードの数を少なく制限することなく、また音声認識処理以外の他処理のレスポンス性能の低下を極力抑えつつ、車載機で発生された音声による誤認識を抑制することができる。 According to the above configuration, since the second voice recognition unit 12 is activated only when the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of the recognition target word, the second voice recognition unit 12 The processing load can be reduced as compared with the case where is constantly operating. Since the processing load is small, it is not necessary to limit the number of words to be recognized that are constantly waiting in triggerless speech recognition. Then, when the first voice recognition unit 10 recognizes that the externally input voice corresponds to the entire recognition target word, and the second voice recognition unit 12 recognizes that the internally generated voice corresponds to the latter half of the recognition target word. The recognition result by the first voice recognition unit 10 is discarded as erroneous recognition caused by the internally generated voice output from the speaker being input from the microphone 200. As a result, according to the present invention, the number of words to be recognized that are constantly waiting in triggerless speech recognition is not limited to a small number, and the deterioration of the response performance of other processes other than the speech recognition process is suppressed as much as possible. It is possible to suppress erroneous recognition due to the generated voice.

なお、上述した実施形態では、認識結果破棄部１３は、第１音声認識部１０により外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部１２により内部発生音声が認識対象ワードの後半部分に相当すると認識された場合、第１音声認識部１０に対して認識破棄通知を出力し、第１音声認識部１０よる認識結果を破棄した。この点に関し、以下の構成でもよい。すなわち、認識結果破棄部１３は、第１音声認識部１０により外部入力音声が認識対象ワードの全体に相当すると認識され、かつ、第２音声認識部１２より内部発生音声が認識対象ワードの後半部分に相当すると認識された場合、さらに、以下の処理を実行する。すなわち、認識結果破棄部１３は、第１音声認識部１０において認識された認識対象ワードの後半部分と、第２音声認識部１２において認識された認識対象ワードの後半部分とが同じであるか否かを判定する。そして、認識結果破棄部１３は、認識された認識対象ワードの後半部分が同じ場合に、第１音声認識部１０に対して認識破棄通知を出力し、第１音声認識部１０よる認識結果を破棄する。この構成によれば、以下の効果を奏する。 In the above-described embodiment, the recognition result discarding unit 13 recognizes that the externally input voice corresponds to the entire recognition target word by the first voice recognition unit 10, and the internally generated voice is generated by the second voice recognition unit 12. When it was recognized that it corresponds to the latter half of the recognition target word, the recognition discard notification was output to the first voice recognition unit 10, and the recognition result by the first voice recognition unit 10 was discarded. In this regard, the following configuration may be used. That is, in the recognition result discarding unit 13, the first voice recognition unit 10 recognizes that the external input voice corresponds to the entire recognition target word, and the second voice recognition unit 12 recognizes that the internally generated voice is the latter half of the recognition target word. If it is recognized that it corresponds to, the following processing is further executed. That is, whether or not the recognition result discarding unit 13 has the same second half of the recognition target word recognized by the first voice recognition unit 10 and the second half of the recognition target word recognized by the second voice recognition unit 12. Is determined. Then, when the latter half of the recognized recognition target word is the same, the recognition result discard unit 13 outputs a recognition discard notification to the first voice recognition unit 10 and discards the recognition result by the first voice recognition unit 10. To do. According to this configuration, the following effects are obtained.

すなわち、第１音声認識部１０によって外部入力音声が一の認識対象ワードの前半部分に相当すると認識された後に行われる認識処理の実行中に、偶然、内部発生音声に、一の認識対象ワードとは異なる他の認識対象ワードの後半部分が含まれる可能性が全くないわけではない。この場合、第１音声認識部１０が認識対象ワードの前半部分に相当すると認識した外部入力音声は、オーディオ装置４００が放音した音声ではないため、第１音声認識部１０の認識結果は破棄されてはならない。しかしながら、上述した実施形態では、第１音声認識部１０において認識された認識対象ワードの後半部分と、第２音声認識部１２において認識された認識対象ワードの後半部分とが同じであるか否かの判定までは行われないため、第１音声認識部１０の認識結果が破棄されてしまう。一方、上記構成によれば、認識結果破棄部１３は、第１音声認識部１０において認識された認識対象ワードの後半部分と、第２音声認識部１２において認識された認識対象ワードの後半部分とが同じである場合にのみ、第１音声認識部１０よる認識結果を破棄する。このため、第１音声認識部１０によって外部入力音声が認識対象ワードの前半部分に相当すると認識された後に行われる認識処理の実行中に、偶然、内部発生音声に、その認識対象ワードとは異なる認識対象ワードの後半部分が含まれた場合に、第１音声認識部１０の認識結果が破棄されるのを防止できる。 That is, during the execution of the recognition process performed after the first voice recognition unit 10 recognizes that the external input voice corresponds to the first half of one recognition target word, the internally generated voice accidentally becomes one recognition target word. Is not entirely possible to contain the second half of other recognized words that are different. In this case, since the external input voice recognized by the first voice recognition unit 10 as corresponding to the first half of the recognition target word is not the voice emitted by the audio device 400, the recognition result of the first voice recognition unit 10 is discarded. must not. However, in the above-described embodiment, whether or not the latter half of the recognition target word recognized by the first voice recognition unit 10 and the second half of the recognition target word recognized by the second voice recognition unit 12 are the same. Since the determination is not performed, the recognition result of the first voice recognition unit 10 is discarded. On the other hand, according to the above configuration, the recognition result discarding unit 13 includes the latter half of the recognition target word recognized by the first voice recognition unit 10 and the second half of the recognition target word recognized by the second voice recognition unit 12. Only when is the same, the recognition result by the first voice recognition unit 10 is discarded. Therefore, during the execution of the recognition process performed after the first voice recognition unit 10 recognizes that the externally input voice corresponds to the first half of the recognition target word, the internally generated voice accidentally differs from the recognition target word. When the latter half of the recognition target word is included, it is possible to prevent the recognition result of the first voice recognition unit 10 from being discarded.

また、上述した実施形態では、登録された音声パターンと、入力された音声との類似度として、値「０」〜値「１０００」の範囲で値をとる距離値を用いた。しかしながら、類似度として、距離値以外の指標を用いる構成でもよい。すなわち、類似度の判定には、既存の技術を広く用いることができる。 Further, in the above-described embodiment, a distance value having a value in the range of the value "0" to the value "1000" is used as the degree of similarity between the registered voice pattern and the input voice. However, as the degree of similarity, an index other than the distance value may be used. That is, existing techniques can be widely used for determining the degree of similarity.

その他、上記実施形態は、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 In addition, the above embodiments are merely examples of embodiment of the present invention, and the technical scope of the present invention should not be construed in a limited manner. That is, the present invention can be implemented in various forms without departing from its gist or its main features.

１０第１音声認識部
１２第２音声認識部
１３認識結果破棄部
２０辞書記憶部
２０Ａ第１音声認識辞書
２０Ｂ第２音声認識辞書
１００音声認識装置
２００マイク 10 1st voice recognition unit 12 2nd voice recognition unit 13 Recognition result discard unit 20 Dictionary storage unit 20A 1st voice recognition dictionary 20B 2nd voice recognition dictionary 100 Voice recognition device 200 Microphone

Claims

It is a voice recognition device that constantly determines whether or not the voice input from the microphone corresponds to the word to be recognized.
A dictionary storage unit that stores a first voice recognition dictionary in which the entire voice pattern of the recognition target word is registered, and a second voice recognition dictionary in which the voice pattern of only the latter half of the recognition target word is registered.
When the similarity between the entire voice pattern of the recognition target word registered in the first voice recognition dictionary and the externally input voice input from the microphone is calculated, and the calculated similarity is greater than a predetermined level. In addition, the first voice recognition unit that recognizes that the external input voice corresponds to the entire recognition target word,
The similarity between the voice pattern of the latter half of the recognition target word registered in the second voice recognition dictionary and the internally generated voice before being output from the speaker generated by the in-vehicle device is calculated, and the calculated similarity is calculated. A second voice recognition unit that recognizes that the internally generated voice corresponds to the latter half of the recognition target word when the degree is greater than a predetermined level.
The first voice recognition unit recognizes that the external input voice corresponds to the entire recognition target word, and the second voice recognition unit recognizes that the internally generated voice corresponds to the latter half of the recognition target word. In this case, a recognition result discarding unit for discarding the recognition result by the first voice recognition unit is provided.
The first voice recognition unit sequentially calculates the similarity in parallel with the sequential input of the external input voice, and when the calculated similarity becomes larger than the first level, the external input voice is released. It recognizes that it corresponds to the first half of the recognition target word, and when the calculated similarity becomes larger than the second level, it recognizes that the external input voice corresponds to the entire recognition target word.
The second voice recognition unit is a voice recognition device characterized in that the recognition process is started when the similarity calculated by the first voice recognition unit becomes larger than the first level.

The second voice recognition unit is characterized in that the recognition process is stopped when it recognizes that the internally generated voice corresponds to the latter half of the recognition target word, or when a predetermined time has elapsed from the start of the recognition process. The voice recognition device according to claim 1.

The first voice recognition unit calculates a distance value as an index representing the similarity, and when the calculated distance value becomes smaller than the first threshold value, the external input voice is the first half of the recognition target word. When the calculated distance value becomes smaller than the second threshold value, it is recognized that the external input voice corresponds to the entire recognition target word.
The second voice recognition unit starts the recognition process when the distance value calculated by the first voice recognition unit becomes smaller than the first threshold value, and then calculates the distance value as an index indicating the similarity. The voice recognition according to claim 1 or 2, wherein when the calculated distance value becomes smaller than a predetermined threshold value, the internally generated voice is recognized as corresponding to the latter half of the recognition target word. apparatus.

The above recognition result discarding part
The first voice recognition unit recognizes that the external input voice corresponds to the entire recognition target word, and the second voice recognition unit recognizes that the internally generated voice corresponds to the latter half of the recognition target word. In that case, whether or not the latter half of the recognition target word recognized by the first voice recognition unit and the latter half of the recognition target word recognized by the second voice recognition unit are the same. The voice recognition device according to any one of claims 1 to 3, wherein the recognition result by the first voice recognition unit is discarded when the judgment is made and the same is determined.

It is a voice recognition method that constantly determines whether or not the voice input from the microphone corresponds to the word to be recognized.
The first voice recognition unit of the voice recognition device sequentially calculates the similarity between the entire voice pattern of the recognition target word registered in the first voice recognition dictionary and the external input voice sequentially input from the microphone. The first step and
The first voice recognition unit determines whether or not the similarity calculated in the first step is higher than the first level, and when it is determined that the similarity is higher than the first level, the external input voice is the recognition target. The second step of recognizing that it corresponds to the first half of the word,
When it is determined that the similarity calculated in the first step is higher than the first level, the third step of activating the second voice recognition unit of the voice recognition device and the third step.
At the same time, the first voice recognition unit sequentially calculates the similarity between the entire voice pattern of the recognition target word registered in the first voice recognition dictionary and the external input voice continuously input from the microphone. , The degree of similarity between the voice pattern of the latter half of the recognition target word registered in the second voice recognition dictionary by the second voice recognition unit and the internally generated voice before being output from the speaker generated by the in-vehicle device. And the fourth step of sequentially calculating
The first voice recognition unit determines whether or not the similarity calculated in the fourth step is higher than the second level, and when it is determined that the similarity is higher than the second level, the external input voice is the recognition target. The fifth step of recognizing that it corresponds to the whole word,
The second voice recognition unit determines whether or not the similarity calculated in the fourth step is higher than the predetermined level, and when it is determined that the similarity is higher than the predetermined level, the internally generated voice is the recognition target word. The sixth step, which is recognized as corresponding to the second half,
In the fifth step, the recognition result discarding unit of the voice recognition device recognizes that the external input voice corresponds to the entire recognition target word by the first voice recognition unit, and in the sixth step, the above. When the second voice recognition unit recognizes that the internally generated voice corresponds to the latter half of the recognition target word, it has a seventh step of discarding the recognition result by the first voice recognition unit. Voice recognition method.