JP2009098217A

JP2009098217A - Speech recognition device, navigation device with speech recognition device, speech recognition method, speech recognition program and recording medium

Info

Publication number: JP2009098217A
Application number: JP2007267128A
Authority: JP
Inventors: Kenji Takeda; 賢司武田; Yoshiko Kato; 淑子加藤; Ryo Oda; 亮小田; Keiichiro Koyama; 馨一郎小山; Koji Shinto; 浩司新戸; Kunihiko Mori; 邦彦森
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2007-10-12
Filing date: 2007-10-12
Publication date: 2009-05-07

Abstract

PROBLEM TO BE SOLVED: To reduce user's trouble when starting a speech recognition. SOLUTION: The speech recognition device 100 includes: an input section 101; a detection section 102; an image recognition device 103; and a speech recognition section 104. User's speech is input to the input section 101. The detection section 102 detects a portion operating by the utterance in a user's body. An image recognition section 103 performs image recognition of an action state regarding the user's utterance, based on the detection result by the detection section 102. The speech recognition section 104 starts speech recognition on the speech which is input to the input section 101, after performing image recognition of the action state regarding the user's utterance by the image recognition section 103. COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、音声認識装置、音声認識装置を備えたナビゲーション装置、音声認識方法、音声認識プログラム、および記録媒体に関する。 The present invention relates to a speech recognition device, a navigation device including the speech recognition device, a speech recognition method, a speech recognition program, and a recording medium.

近年、自動車などの車両には、目的地までの経路を探索して、当該目的地まで誘導するナビゲーション装置が搭載されている。このようなナビゲーション装置において、目的地の設定など各種設定や入力は、タッチパネルなどの操作入力によっておこなわれるものが知られている。また、このほかにも、音声認識機能を具備したものであれば、利用者からの発話により各種設定や入力がおこなわれるものが知られている。 In recent years, vehicles such as automobiles are equipped with navigation devices that search for a route to a destination and guide the vehicle to the destination. In such a navigation apparatus, various settings and inputs such as a destination setting are known to be performed by an operation input such as a touch panel. In addition, as long as it has a voice recognition function, it is known that various settings and inputs can be performed by speech from the user.

音声認識機能を具備した技術としては、たとえば、音声の誤認識の低減を図るために、利用者からの語彙のジャンルの発話に基づき、語彙のジャンルを指定し、指定されたジャンルの中から音声認識をおこなうようにした技術が提案されている（たとえば、特許文献１参照。）。 As a technology having a speech recognition function, for example, in order to reduce misrecognition of speech, a vocabulary genre is designated based on the utterance of the vocabulary genre from a user, and speech is designated from the designated genre. A technique for performing recognition has been proposed (see, for example, Patent Document 1).

特開平１０−９７２８１号公報JP-A-10-97281

しかしながら、上述した特許文献１の技術は、音声認識を開始させるためには、利用者がトークスイッチをオンにする必要があり、利用者にとって手間がかかるといった問題が一例として挙げられる。 However, the technique disclosed in Patent Document 1 described above includes, for example, a problem that the user needs to turn on the talk switch in order to start speech recognition, which is troublesome for the user.

上述した課題を解決し、目的を達成するため、請求項１の発明にかかる音声認識装置は、利用者からの音声が入力される入力手段と、利用者の身体のうち発話時に動作する部位を検知する検知手段と、前記検知手段による検知結果に基づいて、利用者の発話に関する行動状態を画像認識する画像認識手段と、前記画像認識手段によって利用者の発話に関する行動状態が画像認識された後に、前記入力手段に入力される音声に対する音声認識を開始する音声認識手段と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, a speech recognition apparatus according to the invention of claim 1 includes an input means for inputting a voice from a user, and a part that operates during utterance in the user's body. Based on the detection means for detecting, the image recognition means for recognizing the action state related to the user's utterance based on the detection result by the detection means, and after the action state related to the user's utterance is image-recognized by the image recognition means And voice recognition means for starting voice recognition with respect to the voice inputted to the input means.

また、請求項８に記載のナビゲーション装置は、上記音声認識装置を備えることを特徴とする。 A navigation device according to an eighth aspect includes the voice recognition device.

また、請求項１０の発明にかかる音声認識方法は、利用者からの音声が入力される入力工程と、利用者の発話に関する行動状態を検知する検知工程と、前記検知工程による検知結果に基づいて、利用者の発話に関する行動状態を画像認識する画像認識工程と、前記画像認識工程によって利用者の発話に関する行動状態が画像認識された後に、前記入力工程にて入力される音声に対する音声認識を開始する音声認識工程と、を含むことを特徴とする。 According to a tenth aspect of the present invention, there is provided a voice recognition method based on an input step in which voice from a user is input, a detection step of detecting an action state related to a user's utterance, and a detection result of the detection step. An image recognition process for recognizing an action state related to a user's utterance, and voice recognition for a voice input in the input process is started after the action state related to the user's utterance is recognized by the image recognition process. And a voice recognition step.

また、請求項１１の発明にかかる音声認識プログラムは、請求項１０に記載の音声認識方法をコンピュータに実行させることを特徴とする。 A speech recognition program according to the invention of claim 11 causes a computer to execute the speech recognition method according to claim 10.

また、請求項１２の発明にかかる記録媒体は、請求項１１に記載の音声認識プログラムをコンピュータに読み取り可能に記録したことを特徴とする。 According to a twelfth aspect of the present invention, there is provided a recording medium in which the voice recognition program according to the eleventh aspect is recorded in a computer-readable manner.

以下に添付図面を参照して、この発明にかかる音声認識装置、音声認識装置を備えたナビゲーション装置、音声認識方法、音声認識プログラム、および記録媒体の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a speech recognition device, a navigation device including the speech recognition device, a speech recognition method, a speech recognition program, and a recording medium according to the present invention will be explained below in detail with reference to the accompanying drawings.

（実施の形態）
（音声認識装置の機能的構成）
この発明の実施の形態にかかる音声認識装置１００の機能的構成について説明する。図１は、本実施の形態にかかる音声認識装置１００の機能的構成の一例を示すブロック図である。図１において、音声認識装置１００は、入力部１０１と、検知部１０２と、画像認識部１０３と、音声認識部１０４と、出力部１０５と、電源制御部１０６と、記録部１０７とを備えている。 (Embodiment)
(Functional configuration of voice recognition device)
A functional configuration of the speech recognition apparatus 100 according to the embodiment of the present invention will be described. FIG. 1 is a block diagram illustrating an example of a functional configuration of the speech recognition apparatus 100 according to the present embodiment. In FIG. 1, the speech recognition apparatus 100 includes an input unit 101, a detection unit 102, an image recognition unit 103, a speech recognition unit 104, an output unit 105, a power control unit 106, and a recording unit 107. Yes.

入力部１０１には、利用者からの音声が入力される。入力部１０１は、具体的には、マイクロフォンである。マイクロフォンには、たとえば、ハンズフリー・マイクロフォンが用いられ、ヘッドセットなどに小型のマイクを装着させたものや、車両などの移動体内に配置されるものなどが挙げられる。 Voice from the user is input to the input unit 101. Specifically, the input unit 101 is a microphone. As the microphone, for example, a hands-free microphone is used, and a microphone in which a small microphone is attached to a headset or a microphone arranged in a moving body such as a vehicle can be used.

検知部１０２は、利用者の身体のうち発話時に動作する部位を検知する。検知部１０２には、たとえば、画像を撮影するカメラからの撮像信号を検知する。発話時に変化する部位は、たとえば、目、眉、鼻、頬のほか、人によっては手なども挙げられるが、代表的には、口元が挙げられる。 The detection part 102 detects the site | part which operate | moves at the time of speech among user's bodies. For example, the detection unit 102 detects an imaging signal from a camera that captures an image. Examples of the part that changes during utterance include eyes, eyebrows, nose, cheeks, and hands depending on the person, but typically the mouth.

画像認識部１０３は、検知部１０２による検知結果に基づいて、利用者の発話に関する行動状態を画像認識する。発話に関する行動状態は、具体的には、利用者が発話する状態であり、目、眉、鼻、頬の動いた状態であってもよいが、代表的には、口元の動いた状態が挙げられる。 The image recognizing unit 103 recognizes an image of a behavior state related to the user's utterance based on the detection result of the detecting unit 102. The action state related to utterance is specifically a state in which the user speaks, and may be a state in which the eyes, eyebrows, nose, and cheeks are moved, but typically, a state in which the mouth is moved is mentioned. It is done.

音声認識部１０４は、画像認識部１０３によって利用者の発話に関する行動状態が画像認識された後に、入力部１０１に入力される音声に対する音声認識を開始する。音声認識部１０４は、代表的には、画像認識部１０３によって利用者の口元に動きがあると画像認識された後に、入力部１０１に入力される音声に対する音声認識を開始する。この音声認識部１０４は、入力部１０１に入力された音声を音声解析し、解析した音声データを出力部１０５に出力する。音声認識部１０４による音声解析は、具体的には、記録部１０７に、予め記録される言語データと、入力された音声の特徴とを照らし合わせ、尤もらしい言語を推定することによりおこなわれる。 The voice recognition unit 104 starts voice recognition for the voice input to the input unit 101 after the image recognition unit 103 recognizes an action state related to the user's speech. The voice recognition unit 104 typically starts voice recognition for the voice input to the input unit 101 after the image recognition unit 103 recognizes that the user's mouth is moving and recognizes the image. The voice recognition unit 104 performs voice analysis on the voice input to the input unit 101, and outputs the analyzed voice data to the output unit 105. Specifically, the speech analysis by the speech recognition unit 104 is performed by comparing the language data recorded in the recording unit 107 in advance with the characteristics of the input speech and estimating a likely language.

出力部１０５は、音声認識部１０４によって音声解析された音声データを出力する。出力部１０５から出力された音声データにより、たとえば、ナビゲーション装置において各種プログラムが実行され、各種設定や処理がおこなわれる。 The output unit 105 outputs the voice data analyzed by the voice recognition unit 104. For example, various programs are executed in the navigation device by the audio data output from the output unit 105, and various settings and processes are performed.

また、本実施の形態において、電源制御部１０６を備えてもよい。電源制御部１０６は、画像認識部１０３によって利用者の発話に関する行動状態が画像認識された場合に、入力部１０１の電源をオンにする。この場合、音声認識部１０４は、入力部１０１の電源がオンになってから、入力部１０１に入力される音声に対する音声認識処理を開始すればよい。本構成は、音声認識をおこなう必要があるときに、入力部１０１の電源をオンにすることにより、消費電力の低減を図ったものである。 In the present embodiment, a power control unit 106 may be provided. The power control unit 106 turns on the power of the input unit 101 when the image recognition unit 103 recognizes an action state related to the user's utterance. In this case, the voice recognition unit 104 may start the voice recognition process for the voice input to the input unit 101 after the input unit 101 is turned on. This configuration is intended to reduce power consumption by turning on the input unit 101 when it is necessary to perform voice recognition.

また、本実施の形態において、画像認識部１０３は、検知部１０２による検知結果に基づいて、利用者の口元の動きが所定時間ないことを画像認識してもよい。この場合、音声認識部１０４は、画像認識部１０３によって利用者の口元の動きが所定時間ないと画像認識された場合に、入力部１０１に入力される音声に対する音声認識を停止する。本構成は、利用者の口元の動きが所定時間ない場合に、利用者に発話する様子がないものと想定できることに基づき、音声認識を停止させることにより、誤認識や、これに伴う誤作動を防止するようにしたものである。 In the present embodiment, the image recognition unit 103 may perform image recognition based on the detection result of the detection unit 102 that there is no movement of the user's mouth for a predetermined time. In this case, the voice recognition unit 104 stops voice recognition for the voice input to the input unit 101 when the image recognition unit 103 recognizes that the movement of the user's mouth has not been performed for a predetermined time. This configuration is based on the assumption that when there is no movement of the user's mouth for a predetermined time, it is assumed that there is no state of speaking to the user. It is intended to prevent.

また、このような、音声認識部１０４が入力部１０１に入力される音声に対する音声認識を停止する条件下で、電源制御部１０６により、入力部１０１の電源をオフにさせてもよい。本構成は、音声認識をおこなう必要のないときに、入力部１０１の電源をオフにさせることにより、消費電力の低減を図ったものである。 In addition, the power supply control unit 106 may turn off the power of the input unit 101 under such a condition that the voice recognition unit 104 stops voice recognition for the voice input to the input unit 101. This configuration is intended to reduce power consumption by turning off the power of the input unit 101 when it is not necessary to perform voice recognition.

また、本実施の形態において、音声認識部１０４は、入力部１０１に所定時間以上音声が入力されていないと判断した場合に、入力部１０１に入力される音声に対する音声認識を停止してもよい。本構成は、所定時間以上音声が入力されない場合に、利用者に発話する様子がないものと想定できることに基づき、音声認識を停止させるようにしたものである。また、このような、音声認識部１０４が入力部１０１に入力される音声に対する音声認識を停止する条件下で、電源制御部１０６により入力部１０１の電源をオフにさせてもよい。 Further, in the present embodiment, the voice recognition unit 104 may stop voice recognition for the voice input to the input unit 101 when it is determined that no voice is input to the input unit 101 for a predetermined time or longer. . In this configuration, the speech recognition is stopped based on the assumption that the user does not speak when no speech is input for a predetermined time or longer. Further, the power supply control unit 106 may turn off the power of the input unit 101 under such a condition that the voice recognition unit 104 stops the voice recognition for the voice input to the input unit 101.

また、本実施の形態において、音声認識部１０４は、入力部１０１に非言語音が入力された場合に、入力部１０１に入力される音声に対する音声認識を停止してもよい。非言語音は、具体的には、咳払い、あくび、くしゃみなどの音声である。本構成は、入力部１０１に非言語音が入力された場合に、利用者からの発話ではないものと認識できることにより、音声認識を停止させるようにしたものである。また、このような、音声認識部１０４が入力部１０１に入力される音声に対する音声認識を停止する条件下で、電源制御部１０６により入力部１０１の電源をオフにさせてもよい。 In the present embodiment, the voice recognition unit 104 may stop voice recognition for the voice input to the input unit 101 when a non-language sound is input to the input unit 101. Specifically, the non-speech sounds are sounds such as coughing, yawning, and sneezing. In this configuration, when a non-language sound is input to the input unit 101, it is possible to recognize that it is not an utterance from the user, so that speech recognition is stopped. Further, the power supply control unit 106 may turn off the power of the input unit 101 under such a condition that the voice recognition unit 104 stops the voice recognition for the voice input to the input unit 101.

また、本実施の形態において、音声認識部１０４は、入力部１０１に一定の周波数の音声が所定時間以上入力された場合に、入力部１０１に入力される音声に対する音声認識を停止してもよい。一定の周波数の音声は、具体的には、ガムを噛んでいる場合などの音声である。本構成は、入力部１０１に一定の周波数の音声が所定時間以上入力された場合に、利用者からの発話ではないものと認識できることにより、音声認識を停止させるようにしたものである。また、このような、音声認識部１０４が入力部１０１に入力される音声に対する音声認識を停止する条件下で、電源制御部１０６により入力部１０１の電源をオフにさせてもよい。 Further, in the present embodiment, the voice recognition unit 104 may stop the voice recognition for the voice input to the input unit 101 when voice of a certain frequency is input to the input unit 101 for a predetermined time or more. . Specifically, the sound of a certain frequency is a sound when chewing gum. In this configuration, when voice of a certain frequency is input to the input unit 101 for a predetermined time or more, the voice recognition is stopped by being able to recognize that the voice is not from the user. Further, the power supply control unit 106 may turn off the power of the input unit 101 under such a condition that the voice recognition unit 104 stops the voice recognition for the voice input to the input unit 101.

また、本実施の形態において、音声認識装置１００を、移動体に搭載されるナビゲーション装置に用いてもよい。この場合、検知部１０２は、移動体に搭乗する複数の利用者のうち、少なくとも一人の身体のうち発話時に動作する部位を検知すればよい。移動体に搭乗する複数の利用者のうち、少なくとも一人とは、ナビゲーション装置に対して発話する利用者であり、たとえば、運転者や助手席の搭乗者が挙げられるが、後部座席の搭乗者であってもよい。 In the present embodiment, the speech recognition device 100 may be used for a navigation device mounted on a mobile object. In this case, the detection part 102 should just detect the site | part which operate | moves at the time of speech among at least one body among several users boarding a mobile body. Among a plurality of users boarding a moving body, at least one of them is a user who speaks to a navigation device, for example, a driver or a passenger in a passenger seat, but a passenger in a rear seat. There may be.

また、画像認識部１０３は、検知部１０２による検知結果に基づいて、少なくとも一人の発話に関する行動状態を画像認識する。音声認識部１０４は、画像認識部１０３によって少なくとも一人の発話に関する行動状態が画像認識された後に、入力部１０１に入力される音声に対する音声認識を開始する。本構成は、移動体に搭乗する利用者のうち、少なくとも一人の身体のうち発話時に動作する部位を検知するようにし、搭乗者からのナビゲーション装置に対する音声入力を可能にしたものである。 Further, the image recognition unit 103 recognizes an image of an action state related to at least one person's utterance based on the detection result by the detection unit 102. The voice recognition unit 104 starts voice recognition for the voice input to the input unit 101 after the image recognition unit 103 recognizes an action state related to at least one utterance. In this configuration, among the users who board the moving body, at least one body part that operates during speech is detected, and voice input from the passenger to the navigation device is enabled.

（音声認識装置の音声認識処理手順）
つぎに、図２を用いて、音声認識装置１００の音声認識処理手順について説明する。図２は、本実施の形態にかかる音声認識装置１００の音声認識処理手順の一例を示すフローチャートである。 (Voice recognition processing procedure of voice recognition device)
Next, the speech recognition processing procedure of the speech recognition apparatus 100 will be described with reference to FIG. FIG. 2 is a flowchart showing an example of a voice recognition processing procedure of the voice recognition apparatus 100 according to the present embodiment.

図２のフローチャートにおいて、音声認識装置１００は、検知部１０２により利用者の身体のうち発話時に動作する部位を検知する（ステップＳ２０１）。そして、検知部１０２による検知結果に基づいて、画像認識部１０３が利用者の発話に関する行動状態を画像認識するまで待機する（ステップＳ２０２：Ｎｏのループ）。 In the flowchart of FIG. 2, the voice recognition device 100 detects a part that operates during utterance in the user's body by the detection unit 102 (step S 201). And based on the detection result by the detection part 102, it waits until the image recognition part 103 image-recognizes the action state regarding a user's speech (step S202: No loop).

そして、利用者の発話に関する行動状態を画像認識すると（ステップＳ２０２：Ｙｅｓ）、電源制御部１０６が入力部１０１の電源をオンにする（ステップＳ２０３）。このあと、音声認識部１０４が入力部１０１に入力される音声に対する音声認識を開始し（ステップＳ２０４）、一連の処理を終了する。 When the action state related to the user's utterance is recognized as an image (step S202: Yes), the power supply control unit 106 turns on the power of the input unit 101 (step S203). Thereafter, the voice recognition unit 104 starts voice recognition for the voice input to the input unit 101 (step S204), and the series of processing ends.

以上説明したように、本実施の形態にかかる音声認識装置１００は、利用者の身体のうち発話時に動作する部位の検知結果に基づいて、利用者の発話に関する行動状態が画像認識された後に、入力される音声に対する音声認識を開始するようにした。これにより、利用者の操作によりトークスイッチをオンにすることなく、音声認識を開始させることができる。したがって、利用者の手間を軽減することが可能になる。 As described above, the speech recognition apparatus 100 according to the present embodiment, after the action state related to the user's utterance is image-recognized based on the detection result of the part that operates during the utterance in the user's body, The voice recognition for the input voice was started. Thereby, voice recognition can be started without turning on the talk switch by the user's operation. Therefore, it is possible to reduce the labor of the user.

また、本実施の形態において、利用者の口元の検知結果に基づいて、利用者の口元の動きを画像認識するようにすれば、簡単に、利用者の発話に関する行動状態を画像認識することができる。 Further, in the present embodiment, if the movement of the user's mouth is image-recognized based on the detection result of the user's mouth, the action state relating to the user's utterance can be easily image-recognized. it can.

また、本実施の形態において、利用者の発話に関する行動状態が画像認識された場合に、入力部１０１の電源をオンにし、音声に対する音声認識処理を開始するようにすれば、音声認識をおこなう必要があるときにのみ、電源をオンにすることができ、消費電力を低減させることができる。 Also, in this embodiment, when the behavior state related to the user's utterance is recognized as an image, it is necessary to perform speech recognition by turning on the input unit 101 and starting speech recognition processing for speech. Only when there is, the power can be turned on, and the power consumption can be reduced.

また、本実施の形態において、利用者の口元の動きが所定時間ない場合など、利用者に発話する様子がないものと想定できる場合や、非言語音など利用者からの発話ではないものと認識できる場合に、音声認識を停止させるようにすれば、不要な音声認識をおこなうことなく、音声認識における誤認識や、これに伴う誤作動を防止することができる。特に、このような、入力される音声に対する音声認識を停止する条件下で、入力部１０１の電源をオフにさせるようにすれば、消費電力を低減させることができる。 Also, in this embodiment, when it is assumed that there is no state of speaking to the user, such as when there is no movement of the user's mouth for a predetermined time, it is recognized that the speech is not from the user such as a non-language sound. If the voice recognition is stopped when possible, it is possible to prevent erroneous recognition in voice recognition and malfunctions associated therewith without performing unnecessary voice recognition. In particular, if the power of the input unit 101 is turned off under such a condition that the speech recognition for the input speech is stopped, the power consumption can be reduced.

また、本実施の形態において、音声認識装置１００を備えたナビゲーション装置によれば、利用者がトークスイッチを操作する手間を省くことができることにより、利用者は、運転動作に早く就くことができるとともに、運転に専念することができる。 Further, in the present embodiment, according to the navigation device provided with the speech recognition device 100, the user can quickly take a driving action because the user can save the trouble of operating the talk switch. You can concentrate on driving.

また、同乗者移動体に搭乗する複数の利用者のうち、少なくとも一人の身体のうち発話時に動作する部位を検知するようにすれば、たとえば、運転者以外の搭乗者からの発話を受け付けることも可能になる。 In addition, among the plurality of users who board the passenger moving body, for example, it is possible to accept utterances from passengers other than the driver by detecting a part that operates during utterance in at least one body. It becomes possible.

以下に、本発明の実施例について説明する。本実施例では、車両に搭載されるナビゲーション装置によって、本発明の音声認識装置１００を実施した場合の一例について説明する。 Examples of the present invention will be described below. In the present embodiment, an example in which the voice recognition device 100 of the present invention is implemented by a navigation device mounted on a vehicle will be described.

（ナビゲーション装置のハードウェア構成）
図３を用いて、本実施例にかかるナビゲーション装置３００のハードウェア構成について説明する。図３は、本実施例にかかるナビゲーション装置３００のハードウェア構成の一例を示すブロック図である。図３において、ナビゲーション装置３００は、車両などの移動体に搭載されており、ＣＰＵ３０１と、ＲＯＭ３０２と、ＲＡＭ３０３と、磁気ディスクドライブ３０４と、磁気ディスク３０５と、光ディスクドライブ３０６と、光ディスク３０７と、音声Ｉ／Ｆ（インターフェース）３０８と、マイク３０９と、スピーカ３１０と、入力デバイス３１１と、映像Ｉ／Ｆ３１２と、ディスプレイ３１３と、通信Ｉ／Ｆ３１４と、ＧＰＳユニット３１５と、各種センサ３１６と、カメラ３１７と、を備えている。また、各構成部３０１〜３１７はバス３２０によってそれぞれ接続されている。 (Hardware configuration of navigation device)
The hardware configuration of the navigation device 300 according to the present embodiment will be described with reference to FIG. FIG. 3 is a block diagram illustrating an example of a hardware configuration of the navigation device 300 according to the present embodiment. In FIG. 3, a navigation device 300 is mounted on a moving body such as a vehicle, and includes a CPU 301, ROM 302, RAM 303, magnetic disk drive 304, magnetic disk 305, optical disk drive 306, optical disk 307, and audio. I / F (interface) 308, microphone 309, speaker 310, input device 311, video I / F 312, display 313, communication I / F 314, GPS unit 315, various sensors 316, and camera 317 And. Each component 301 to 317 is connected by a bus 320.

ＣＰＵ３０１は、ナビゲーション装置３００の全体の制御を司る。ＲＯＭ３０２は、ブートプログラム、現在位置算出プログラム、経路探索プログラム、経路誘導プログラム、音声認識プログラムなどの各種プログラムを記録している。また、ＲＡＭ３０３は、ＣＰＵ３０１のワークエリアとして使用される。 The CPU 301 governs overall control of the navigation device 300. The ROM 302 stores various programs such as a boot program, a current position calculation program, a route search program, a route guidance program, and a voice recognition program. The RAM 303 is used as a work area for the CPU 301.

現在位置算出プログラムは、たとえば、後述するＧＰＳユニット３１５および各種センサ３１６の出力情報に基づいて、車両の現在位置（ナビゲーション装置３００の現在位置）を算出させる。 The current position calculation program, for example, calculates the current position of the vehicle (current position of the navigation device 300) based on output information from a GPS unit 315 and various sensors 316 described later.

経路探索プログラムは、後述する磁気ディスク３０５に記録されている地図データなどを利用して、出発地点から目的地点までの最適な経路を探索させる。ここで、最適な経路とは、目的地点までの最短（または最速）経路やユーザが指定した条件に最も合致する経路などである。また、目的地点のみならず、立ち寄り地点や休憩地点までの経路を探索してもよい。探索された誘導経路は、ＣＰＵ３０１を介して音声Ｉ／Ｆ３０８や映像Ｉ／Ｆ３１２へ出力される。 The route search program searches for an optimal route from the departure point to the destination point using map data or the like recorded on a magnetic disk 305 described later. Here, the optimum route is a shortest (or fastest) route to the destination point or a route that best matches a condition specified by the user. Further, not only the destination point but also a route to a stop point or a rest point may be searched. The searched guidance route is output to the audio I / F 308 and the video I / F 312 via the CPU 301.

経路誘導プログラムは、経路探索プログラムを実行することによって探索された誘導経路情報、現在位置算出プログラムを実行することによって算出された車両の現在位置情報、磁気ディスク３０５から読み出された地図データに基づいて、リアルタイムな経路誘導情報を生成させる。生成された経路誘導情報は、ＣＰＵ３０１を介して音声Ｉ／Ｆ３０８や映像Ｉ／Ｆ３１２へ出力される。 The route guidance program is based on guidance route information searched by executing the route search program, vehicle current location information calculated by executing the current position calculation program, and map data read from the magnetic disk 305. Real-time route guidance information is generated. The generated route guidance information is output to the audio I / F 308 and the video I / F 312 via the CPU 301.

音声認識プログラムは、カメラ３１７によって撮影された利用者の口元の撮像結果に基づいて、利用者の口元の動きが画像認識された後に、音声Ｉ／Ｆ３０８から入力される音声に対する音声認識を開始させる。 The voice recognition program starts voice recognition for the voice input from the voice I / F 308 after the movement of the user's mouth is recognized based on the imaging result of the user's mouth shot by the camera 317. .

磁気ディスクドライブ３０４は、ＣＰＵ３０１の制御にしたがって磁気ディスク３０５に対するデータの読み取り／書き込みを制御する。磁気ディスク３０５は、磁気ディスクドライブ３０４の制御で書き込まれたデータを記録する。磁気ディスク３０５としては、たとえば、ＨＤ（ハードディスク）やＦＤ（フレキシブルディスク）を用いることができる。 The magnetic disk drive 304 controls the reading / writing of the data with respect to the magnetic disk 305 according to control of CPU301. The magnetic disk 305 records data written under the control of the magnetic disk drive 304. As the magnetic disk 305, for example, an HD (hard disk) or an FD (flexible disk) can be used.

光ディスクドライブ３０６は、ＣＰＵ３０１の制御にしたがって光ディスク３０７に対するデータの読み取り／書き込みを制御する。光ディスク３０７は、光ディスクドライブ３０６の制御にしたがってデータの読み出される着脱自在な記録媒体である。光ディスク３０７は、書き込み可能な記録媒体を利用することもできる。また、この着脱可能な記録媒体として、光ディスク３０７のほか、ＭＯ、メモリカードなどであってもよい。 The optical disk drive 306 controls the reading / writing of the data with respect to the optical disk 307 according to control of CPU301. The optical disk 307 is a detachable recording medium from which data is read according to the control of the optical disk drive 306. As the optical disc 307, a writable recording medium can be used. In addition to the optical disk 307, the removable recording medium may be an MO, a memory card, or the like.

音声Ｉ／Ｆ３０８は、音声入力用のマイク３０９および音声出力用のスピーカ３１０に接続される。マイク３０９は、車室内の音を収集するハンズフリー・マイクロフォンによって構成される。マイク３０９は、たとえば、車両のサンバイザー付近に設置され、その数は単数でも複数でもよい。マイク３０９に受音された音声は、音声Ｉ／Ｆ３０８内でＡ／Ｄ変換される。スピーカ３１０からは、音声が出力される。 The audio I / F 308 is connected to a microphone 309 for audio input and a speaker 310 for audio output. The microphone 309 is configured by a hands-free microphone that collects sound in the vehicle interior. For example, the microphone 309 may be installed near the sun visor of the vehicle, and the number thereof may be one or more. The sound received by the microphone 309 is A / D converted in the sound I / F 308. Sound is output from the speaker 310.

入力デバイス３１１は、文字、数値、各種指示などの入力のための複数のキーを備えたリモコン、キーボード、マウス、タッチパネルなどが挙げられる。 Examples of the input device 311 include a remote controller having a plurality of keys for inputting characters, numerical values, various instructions, a keyboard, a mouse, a touch panel, and the like.

映像Ｉ／Ｆ３１２は、ディスプレイ３１３と接続される。映像Ｉ／Ｆ３１２は、具体的には、たとえば、ディスプレイ３１３全体の制御をおこなうグラフィックコントローラと、即時表示可能な画像情報を一時的に記録するＶＲＡＭ（ＶｉｄｅｏＲＡＭ）などのバッファメモリと、グラフィックコントローラから出力される画像データに基づいて、ディスプレイ３１３を表示制御する制御ＩＣなどによって構成される。 The video I / F 312 is connected to the display 313. Specifically, the video I / F 312 includes, for example, a graphic controller that controls the entire display 313, a buffer memory such as a VRAM (Video RAM) that temporarily records image information that can be displayed immediately, and a graphic controller. Based on the output image data, the display 313 is configured by a control IC or the like.

ディスプレイ３１３には、アイコン、カーソル、メニュー、ウインドウ、あるいは文字や画像などの各種データが表示される。このディスプレイ３１３は、たとえば、ＣＲＴ、ＴＦＴ液晶ディスプレイ、プラズマディスプレイなどを採用することができる。 The display 313 displays icons, cursors, menus, windows, or various data such as characters and images. As the display 313, for example, a CRT, a TFT liquid crystal display, a plasma display, or the like can be adopted.

通信Ｉ／Ｆ３１４は、無線を介してネットワークに接続され、ナビゲーション装置３００とＣＰＵ３０１とのインターフェースとして機能する。通信Ｉ／Ｆ３１４は、さらに、無線を介してインターネットなどの通信網に接続され、この通信網とＣＰＵ３０１とのインターフェースとしても機能する。 The communication I / F 314 is connected to a network via wireless and functions as an interface between the navigation device 300 and the CPU 301. The communication I / F 314 is further connected to a communication network such as the Internet via wireless, and also functions as an interface between the communication network and the CPU 301.

通信網には、ＬＡＮ、ＷＡＮ、公衆回線網や携帯電話網などがある。具体的には、通信Ｉ／Ｆ３１４は、たとえば、ＦＭチューナー、ＶＩＣＳ（ＶｅｈｉｃｌｅＩｎｆｏｒｍａｔｉｏｎａｎｄＣｏｍｍｕｎｉｃａｔｉｏｎＳｙｓｔｅｍ）／ビーコンレシーバ、無線ナビゲーション装置、およびそのほかのナビゲーション装置によって構成され、ＶＩＣＳセンターから配信される渋滞や交通規制などの道路交通情報を取得する。なお、ＶＩＣＳは登録商標である。 Communication networks include LANs, WANs, public line networks and mobile phone networks. Specifically, the communication I / F 314 includes, for example, an FM tuner, a VICS (Vehicle Information and Communication System) / beacon receiver, a radio navigation device, and other navigation devices. Get road traffic information such as regulations. VICS is a registered trademark.

また、通信Ｉ／Ｆ３１４は、たとえば、ＤＳＲＣ（ＤｅｄｉｃａｔｅｄＳｈｏｒｔＲａｎｇｅＣｏｍｍｕｎｉｃａｔｉｏｎ）を用いた場合は、路側に設置された無線装置と双方向の無線通信をおこなう車載無線装置によって構成され、交通情報や地図情報などの各種情報を取得する。なお、ＤＳＲＣの具体例としては、ＥＴＣ（ノンストップ自動料金支払いシステム）が挙げられる。 The communication I / F 314 is configured by an in-vehicle wireless device that performs two-way wireless communication with a wireless device installed on the roadside, for example, when using DSRC (Dedicated Short Range Communication), and traffic information and map information Get various information. A specific example of DSRC is ETC (non-stop automatic fee payment system).

ＧＰＳユニット３１５は、ＧＰＳ衛星からの電波を受信し、車両の現在位置を示す情報を出力する。ＧＰＳユニット３１５の出力情報は、後述する各種センサ３１６の出力値とともに、ＣＰＵ３０１による車両の現在位置の算出に際して利用される。現在位置を示す情報は、たとえば緯度・経度、高度などの、地図情報上の１点を特定する情報である。 The GPS unit 315 receives radio waves from GPS satellites and outputs information indicating the current position of the vehicle. The output information of the GPS unit 315 is used when the CPU 301 calculates the current position of the vehicle together with output values of various sensors 316 described later. The information indicating the current position is information for specifying one point on the map information such as latitude / longitude and altitude.

各種センサ３１６は、車速センサや加速度センサ、角速度センサなどを含み、車両の位置や挙動を判断することが可能な情報を出力する。各種センサ３１６の出力値は、ＣＰＵ３０１による車両の現在位置の算出や、速度や方位の変化量の測定などに用いられる。 The various sensors 316 include a vehicle speed sensor, an acceleration sensor, an angular velocity sensor, and the like, and output information that can determine the position and behavior of the vehicle. The output values of the various sensors 316 are used for the calculation of the current position of the vehicle by the CPU 301 and the measurement of the speed and direction change amount.

カメラ３１７は、たとえば、運転者の口元の映像を撮影する。なお、カメラ３１７は、助手席や後部座席の搭乗者の口元の映像を撮影してもよい。映像は、動画が用いられる。 For example, the camera 317 captures an image of the driver's mouth. Note that the camera 317 may capture an image of the mouth of the passenger in the passenger seat or the rear seat. A moving image is used as the video.

図１に示した音声認識装置１００が備える入力部１０１と、検知部１０２と、画像認識部１０３と、音声認識部１０４と、出力部１０５と、電源制御部１０６とは、図３に示したナビゲーション装置３００におけるＲＯＭ３０２、ＲＡＭ３０３、磁気ディスク３０５、光ディスク３０７などに記録されたプログラムやデータを用いて、ＣＰＵ３０１が所定のプログラムを実行し、ナビゲーション装置３００における各部を制御することによって、その機能を実現する。 The input unit 101, the detection unit 102, the image recognition unit 103, the voice recognition unit 104, the output unit 105, and the power supply control unit 106 included in the voice recognition device 100 illustrated in FIG. 1 are illustrated in FIG. Using the programs and data recorded in the ROM 302, RAM 303, magnetic disk 305, optical disk 307, etc. in the navigation device 300, the CPU 301 executes a predetermined program and controls each part in the navigation device 300 to realize its functions. To do.

すなわち、本実施例のナビゲーション装置３００は、ナビゲーション装置３００における記録媒体としてのＲＯＭ３０２に記録されている音声認識プログラムを実行することにより、図１に示した音声認識装置１００が備える機能を、図２に示した音声認識処理手順で実行することができる。 That is, the navigation device 300 of the present embodiment executes the voice recognition program recorded in the ROM 302 as a recording medium in the navigation device 300, thereby providing the functions of the voice recognition device 100 shown in FIG. The voice recognition processing procedure shown in FIG.

（ナビゲーション装置の音声認識処理の一例）
つぎに、図４を用いて、本実施例にかかるナビゲーション装置３００がおこなう音声認識処理の一例について説明する。図４は、本実施例にかかるナビゲーション装置３００の音声認識処理の一例を示すフローチャートである。 (Example of voice recognition processing of navigation device)
Next, an example of speech recognition processing performed by the navigation device 300 according to the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart illustrating an example of the voice recognition process of the navigation device 300 according to the present embodiment.

図４のフローチャートにおいて、ナビゲーション装置３００は、カメラ３１７により利用者の口元を撮像する（ステップＳ４０１）。そして、利用者の口元の動きを画像認識するまで（ステップＳ４０２：Ｎｏのループ）、ステップＳ４０１に移行し、利用者の口元の動きを画像認識すると（ステップＳ４０２：Ｙｅｓ）、マイク３０９の電源をオンにする（ステップＳ４０３）。 In the flowchart of FIG. 4, the navigation apparatus 300 images the user's mouth with the camera 317 (step S 401). Until the movement of the user's mouth is recognized (step S402: No loop), the process proceeds to step S401. When the movement of the user's mouth is recognized (step S402: Yes), the microphone 309 is turned on. Turn on (step S403).

このあと、マイク３０９に入力される音声に対する音声認識を開始する（ステップＳ４０４）。そして、所定時間以上、口元の動きがないか否かを判断する（ステップＳ４０５）。ステップＳ４０５において、所定時間内に口元の動きがあると判断した場合（ステップＳ４０５：Ｎｏ）、所定時間以上、音声の入力がないか否かを判断する（ステップＳ４０６）。 Thereafter, voice recognition for the voice input to the microphone 309 is started (step S404). Then, it is determined whether or not there is no movement of the mouth for a predetermined time or more (step S405). If it is determined in step S405 that there is a movement of the mouth within a predetermined time (step S405: No), it is determined whether or not there is no voice input for a predetermined time or more (step S406).

ステップＳ４０６において、所定時間内に音声の入力があると判断した場合（ステップＳ４０６：Ｎｏ）、入力された音声が非言語音か否かを判断する（ステップＳ４０７）。なお、非言語音は、咳払い、くしゃみ、あくびなどの音声である。ステップＳ４０７において、入力された音声が非言語音ではないと判断した場合（ステップＳ４０７：Ｎｏ）、一定の周波数の音声が所定時間以上入力されているか否かを判断する（ステップＳ４０８）。一定の周波数の音声が所定時間以上入力されている場合とは、たとえば、ガムを噛んでいる場合などである。 In step S406, when it is determined that there is a voice input within a predetermined time (step S406: No), it is determined whether or not the input voice is a non-verbal sound (step S407). The non-speech sounds are sounds such as coughing, sneezing, and yawning. If it is determined in step S407 that the input voice is not a non-verbal sound (step S407: No), it is determined whether or not a voice having a certain frequency has been input for a predetermined time or more (step S408). The case where sound of a certain frequency is input for a predetermined time or longer is, for example, a case where a gum is chewed.

ステップＳ４０８において、一定の周波数の音声が所定時間以上入力されていないと判断した場合（ステップＳ４０８：Ｎｏ）、ステップＳ４０４に移行する。一方、ステップＳ４０８において、一定の周波数の音声が所定時間以上入力されていると判断した場合（ステップＳ４０８：Ｙｅｓ）、マイク３０９の電源をオフにし（ステップＳ４０９）、一連の処理を終了する。 In step S408, when it is determined that the voice of a certain frequency has not been input for a predetermined time or longer (step S408: No), the process proceeds to step S404. On the other hand, if it is determined in step S408 that sound of a certain frequency has been input for a predetermined time or longer (step S408: Yes), the microphone 309 is turned off (step S409), and the series of processing ends.

また、ステップＳ４０５において、所定時間以上、口元の動きがないと判断した場合（ステップＳ４０５：Ｙｅｓ）、ステップＳ４０９に移行する。また、ステップＳ４０６において、所定時間以上、音声の入力がないと判断した場合（ステップＳ４０６：Ｙｅｓ）、ステップＳ４０９に移行する。また、ステップＳ４０７において、入力された音声が非言語音であると判断した場合（ステップＳ４０７：Ｙｅｓ）、ステップＳ４０９に移行する。 If it is determined in step S405 that there is no movement of the mouth for a predetermined time or more (step S405: Yes), the process proceeds to step S409. If it is determined in step S406 that there is no voice input for a predetermined time or longer (step S406: Yes), the process proceeds to step S409. If it is determined in step S407 that the input voice is a non-verbal sound (step S407: Yes), the process proceeds to step S409.

以上説明したように、本実施例にかかるナビゲーション装置３００は、利用者の口元の撮像結果に基づいて、利用者の口元の動きが画像認識された後に、マイクの電源をオンにし、入力される音声に対する音声認識を開始するようにした。これにより、利用者の操作によりトークスイッチをオンにすることなく、音声認識を開始させることができる。したがって、利用者の手間を軽減することが可能になる。 As described above, the navigation apparatus 300 according to the present embodiment turns on and inputs a microphone after the movement of the user's mouth is recognized based on the imaging result of the user's mouth. Start speech recognition for voice. Thereby, voice recognition can be started without turning on the talk switch by the user's operation. Therefore, it is possible to reduce the labor of the user.

また、本実施例において、利用者の発話に関する行動状態が画像認識された場合に、マイクの電源をオンにし、音声に対する音声認識処理を開始するようにすれば、音声認識をおこなう必要があるときにのみ、電源をオンにすることができ、消費電力を低減させることができる。 Also, in this embodiment, when the action state related to the user's utterance is recognized as an image, if the microphone is turned on and the voice recognition process for the voice is started, it is necessary to perform voice recognition. Only the power can be turned on, and the power consumption can be reduced.

また、本実施例において、利用者の口元の動きが所定時間ない場合など、利用者に発話する様子がないものと想定できる場合や、非言語音など利用者からの発話ではないものと認識できる場合に、音声認識を停止させるようにしたので、不要な音声認識をおこなうことなく、音声認識における誤認識や、これに伴う誤作動を防止することができる。特に、このような、入力される音声に対する音声認識を停止する条件下で、マイクの電源をオフにしたので、消費電力を低減させることができる。 Also, in this embodiment, it can be recognized that there is no state of speaking to the user, such as when there is no movement of the user's mouth for a predetermined time, or that the speech is not from the user such as a non-language sound. In this case, since the voice recognition is stopped, it is possible to prevent erroneous recognition in voice recognition and a malfunction caused thereby without performing unnecessary voice recognition. In particular, since the power supply of the microphone is turned off under such a condition that the speech recognition for the input speech is stopped, the power consumption can be reduced.

また、本実施例にかかるナビゲーション装置３００によれば、利用者がトークスイッチを操作する手間を省くことができることにより、利用者は、運転動作に早く就くことができるとともに、運転に専念することができる。また、同乗者移動体に搭乗する複数の利用者のうち、助手席や後部座席に搭乗する少なくとも一人の口元を検知するようにすれば、運転者以外の搭乗者からの発話を受け付けることも可能になる。 Further, according to the navigation device 300 according to the present embodiment, since the user can save the trouble of operating the talk switch, the user can quickly get into the driving operation and concentrate on the driving. it can. It is also possible to accept utterances from passengers other than the driver by detecting the mouth of at least one passenger boarding the passenger seat or the rear seat among multiple users boarding the passenger moving body become.

以上説明したように、本発明の音声認識装置、音声認識装置を備えたナビゲーション装置、音声認識方法、音声認識プログラム、および記録媒体は、利用者の身体のうち発話時に動作する部位の検知結果に基づいて、利用者の発話に関する行動状態が画像認識された後に、入力される音声に対する音声認識を開始するようにした。これにより、利用者の操作によりトークスイッチをオンにすることなく、音声認識を開始させることができる。したがって、利用者の手間を軽減することが可能になる。 As described above, the voice recognition device, the navigation device including the voice recognition device, the voice recognition method, the voice recognition program, and the recording medium according to the present invention can be used to detect a part of a user's body that operates during speech. Based on this, after the action state related to the user's utterance is image-recognized, voice recognition for the input voice is started. Thereby, voice recognition can be started without turning on the talk switch by the user's operation. Therefore, it is possible to reduce the labor of the user.

なお、本実施例で説明した音声認識方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーションなどのコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネットなどのネットワークを介して配布することが可能な伝送媒体であってもよい。 The voice recognition method described in the present embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. The program may be a transmission medium that can be distributed via a network such as the Internet.

本実施の形態にかかる音声認識装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of the speech recognition apparatus concerning this Embodiment. 本実施の形態にかかる音声認識装置の音声認識処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the speech recognition process sequence of the speech recognition apparatus concerning this Embodiment. 本実施例にかかるナビゲーション装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the navigation apparatus concerning a present Example. 本実施例にかかるナビゲーション装置の音声認識処理の一例を示すフローチャートである。It is a flowchart which shows an example of the speech recognition process of the navigation apparatus concerning a present Example.

Explanation of symbols

１００音声認識装置
１０１入力部
１０２検知部
１０３画像認識部
１０４音声認識部
１０５出力部
１０６電源制御部
３００ナビゲーション装置 DESCRIPTION OF SYMBOLS 100 Voice recognition apparatus 101 Input part 102 Detection part 103 Image recognition part 104 Voice recognition part 105 Output part 106 Power supply control part 300 Navigation apparatus

Claims

An input means for inputting voice from the user;
A detecting means for detecting a part of the user's body that operates when speaking,
Image recognition means for recognizing an action state related to a user's utterance based on a detection result by the detection means;
Voice recognition means for starting voice recognition processing for voice input to the input means after the action state related to the user's utterance is recognized by the image recognition means;
A speech recognition apparatus comprising:

The detection means detects a user's mouth,
The image recognizing unit recognizes that there is movement in the user's mouth based on the detection result by the detecting unit,
The speech recognition means starts speech recognition processing for speech input to the input means after the image recognition means recognizes that there is a movement in the user's mouth and the speech input to the input means. The speech recognition apparatus described.

The image recognition means recognizes that there is no movement of the user's mouth for a predetermined time based on the detection result by the detection means,
The voice recognition means stops the voice recognition processing for the voice input to the input means when the image recognition means recognizes that there is no movement of the user's mouth for a predetermined time. The speech recognition apparatus according to claim 1.

The speech recognition means stops speech recognition processing for speech input to the input means when it is determined that no voice has been input to the input means for a predetermined time or more. The speech recognition device according to any one of the above.

5. The voice recognition unit according to claim 1, wherein when a non-language sound is input to the input unit, the voice recognition unit stops the voice recognition process for the voice input to the input unit. The speech recognition apparatus described in 1.

6. The voice recognition unit according to claim 5, wherein the voice recognition unit stops voice recognition processing on the voice input to the input unit when a voice having a certain frequency is input to the input unit for a predetermined time or more. Voice recognition device.

A power control unit that turns on the power of the input unit when the action state related to the user's utterance is recognized by the image recognition unit;
The voice recognition means starts voice recognition processing for voice inputted to the input means after the input means is turned on. Voice recognition device.

8. The voice recognition apparatus according to claim 7, wherein the power control unit turns off the power of the input unit when the voice recognition unit stops the voice recognition process.

A navigation device comprising the voice recognition device according to any one of claims 1 to 8, and mounted on a mobile body,
The detecting means detects a part that operates at the time of speaking out of at least one body among a plurality of users boarding the moving body,
The image recognition means recognizes an action state related to at least one utterance based on a detection result by the detection means,
The navigation apparatus according to claim 1, wherein the voice recognition means starts voice recognition for the voice input to the input means after the image recognition means recognizes an action state related to at least one utterance.

An input process in which voice from the user is input;
A detection process for detecting an action state related to the user's utterance;
Based on the detection result of the detection step, an image recognition step for recognizing an action state related to the user's utterance,
A voice recognition step of starting a voice recognition process for the voice input in the input step after the behavioral state relating to the user's utterance is recognized by the image recognition step;
A speech recognition method comprising:

A speech recognition program for causing a computer to execute the speech recognition method according to claim 10.

A computer-readable recording medium on which the voice recognition program according to claim 11 is recorded.