JP2014240856A

JP2014240856A - Voice input system and computer program

Info

Publication number: JP2014240856A
Application number: JP2013122361A
Authority: JP
Inventors: 三宅　隆; Takashi Miyake; 隆三宅
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2013-06-11
Filing date: 2013-06-11
Publication date: 2014-12-25

Abstract

PROBLEM TO BE SOLVED: To provide a voice input system and a computer program which receive voice input properly and promptly.SOLUTION: A labial recognition part 7 detects a stop of labial movement of a user from user's video taken by a camera 2 and outputs a labial movement stop notification. When voice input is generated, a voice recognition unit 6 captures the voice input, performs voice recognition to the captured voice input until a voice interval of the voice input is completed or the labial movement stop notification is outputted, and outputs a voice recognition result. A control unit 8 receives the voice input of only the voice recognition result outputted during a predetermined period only, after the labial movement stop notification is outputted.

Description

本発明は、ユーザの発話による音声入力を受け付ける技術に関するものである。 The present invention relates to a technique for receiving voice input by a user's utterance.

ユーザから音声入力を受け付ける技術としては、発話するユーザの口唇を撮影した映像から口唇の動きを検出し、口唇の動きからユーザが発話している時間区間である発話区間を特定し、特定した発話区間に入力した音声に対して音声認識処理を行って、音声認識結果を内容とする音声入力を受け付ける技術が知られている（たとえば、特許文献１、２）。 As a technique for accepting voice input from the user, the movement of the lips is detected from the video of the user's lips that speaks, the utterance section that is the time section in which the user is speaking is identified from the lip movement, A technique is known in which voice recognition processing is performed on voice input in a section and voice input including a voice recognition result is received (for example, Patent Documents 1 and 2).

特開平１１-３３８４９０号公報Japanese Patent Laid-Open No. 11-338490 特開２００８-１５２１２５号公報JP 2008-152125 A

上述した口唇の動きから特定した発話区間に入力した音声に対して音声認識処理を行う場合、撮影した映像の口唇の動きからユーザの発話の開始を検出した後に、音声認識処理を開始することとなるため、ユーザの発話完了後にすみやかに音声入力の受け付けを完了できない場合がある。 When performing voice recognition processing on the voice input in the utterance section identified from the lip movement described above, starting the voice recognition processing after detecting the start of the user's utterance from the lip movement of the captured video; Therefore, there may be a case where the acceptance of voice input cannot be completed immediately after the user's utterance is completed.

たとえば、撮影した映像の口唇の動きからユーザの発話の開始を検出した後に、音声認識処理の起動や認識する音声の取り込みなどの音声認識の前処理を行ってから、音声認識を行う場合には、ユーザの発話完了後すぐに音声認識を開始する場合に比べ、上述した口唇の動きからの発話の開始の検出の処理や上述した前処理に要する時間分、音声入力の受け付けが遅延してしまう場合がある。 For example, when speech recognition is performed after detecting the start of the user's utterance from the movement of the lips of the captured video and then performing speech recognition pre-processing such as starting speech recognition processing or capturing recognized speech Compared with the case where voice recognition is started immediately after completion of the user's utterance, the reception of voice input is delayed by the time required for the process of detecting the start of utterance from the lip movement described above and the time required for the pre-processing described above. There is a case.

一方で、入力した音声に対する音声認識処理を常時行って、音声認識結果に応じた音声入力を受け付けることとすれば、このような上述した口唇の動きからの発話の開始の検出の処理や上述した前処理による遅延の発生は避けられるが、ノイズ音声に対しても音声認識処理を行って音声入力を受け付けてしまうこととなるため、誤った音声入力の受け付けが発生してしまうことがある。また、ユーザの発話音声に引き続いてノイズ音声が連続して発生した場合などには、ユーザの発話音声の終了後のノイズ音声が無くなった時点を、音声の終了時点として音声認識を行ってしまう場合があり、この場合には、ユーザの発話の終了後のノイズ音声の時間長分、音声入力の受け付けの完了が遅延してしまうこととなる。 On the other hand, if the voice recognition process is always performed on the input voice and the voice input corresponding to the voice recognition result is accepted, the process of detecting the start of the utterance from the movement of the lips described above and the above-described process Although the occurrence of delay due to preprocessing can be avoided, since voice recognition processing is performed for noise speech and voice input is received, erroneous voice input may be received. In addition, when noise speech is continuously generated following the user's uttered speech, speech recognition is performed using the time when the noise speech after the end of the user's uttered speech disappears as the speech end time. In this case, the completion of acceptance of the voice input is delayed by the length of the noise voice after the end of the user's utterance.

そこで、本発明は、ユーザの発話音声による音声入力を正しく、かつ、すみやかに受け付けることのできる音声入力システムを提供することを課題とする。 Therefore, an object of the present invention is to provide a voice input system capable of correctly and promptly accepting voice input by a user's uttered voice.

前記課題達成のために、本発明は、ユーザから音声入力を受け付ける音声入力システムに、ユーザの少なくとも口唇を含む部分を撮影するカメラと、前記カメラで撮影した画像から、ユーザの口唇の動作状態の動作中状態から停止状態への変化を検出する口唇認識手段と、マイクロフォンと、前記マイクロフォンからの音声の入力の発生に応答して、当該入力した音声に対する音声認識処理を行って、音声認識結果を出力する音声認識手段と、前記音声認識手段が出力する音声認識結果のうちの、前記口唇認識手段が前記停止状態への変化を検出した後の所定期間内に前記音声認識手段から出力された前記音声認識結果についてのみ、当該音声認識結果を入力内容とする音声入力を受け付ける音声入力受付手段とを備えたものである。 To achieve the above object, the present invention provides a voice input system that receives voice input from a user, a camera that captures at least a portion including the lips of the user, and an operating state of the user's lips from an image captured by the camera. In response to the occurrence of voice input from the microphone, the lip recognition means for detecting a change from the operating state to the stopped state, a voice recognition process is performed on the input voice, and a voice recognition result is obtained. Out of the speech recognition results output by the speech recognition means and the speech recognition means, the lip recognition means outputs the speech recognition means output from the speech recognition means within a predetermined period after detecting the change to the stopped state. Only the voice recognition result is provided with voice input receiving means for receiving voice input with the voice recognition result as input content.

このような音声入力システムによれば、音声認識手段は、常時、マイクロフォンからの音声の入力の発生に応答して、当該入力した音声に対する音声認識処理を行うので、口唇の動きからの発話の開始の検出の処理や音声認識処理の前処理によって音声認識処理が遅延してしまうことはない。また、ユーザの口唇の動作状態の動作中状態から停止状態への変化後の所定期間内に出力された前記音声認識結果、すなわち、ユーザが口唇を停止する前に口唇を動作している期間（ユーザの発話期間）中にマイクロフォンから入力した音声の音声認識結果のみ、ユーザが音声入力した内容として受け付けるので、正しくユーザの発話音声のみ、音声入力を受け付けることをできる。 According to such a voice input system, since the voice recognition means always performs voice recognition processing on the inputted voice in response to the occurrence of voice input from the microphone, the utterance starts from the movement of the lips. The voice recognition process is not delayed by the detection process or the pre-process of the voice recognition process. Further, the voice recognition result output within a predetermined period after the change of the user's lip movement state from the active state to the stop state, that is, the period during which the user operates the lips before stopping the lips ( Since only the speech recognition result of the voice input from the microphone during the user's speech period) is accepted as the content input by the user, it is possible to correctly accept the voice input of only the user's speech.

ここで、このような音声入力システムは、前記音声認識手段において、前記口唇認識手段が前記停止状態への変化を検出した場合に、直ちに、前記マイクロフォンから入力した、現時点までの音声の連続部分に対する音声認識処理を行って、音声認識結果を出力するように構成することが好ましい。 Here, in such a voice input system, when the lip recognition means detects a change to the stop state in the voice recognition means, the voice input system immediately applies to the continuous portion of the voice input up to the present time. It is preferable to perform a voice recognition process and output a voice recognition result.

このようにすることにより、ユーザの発話音声に引き続いてノイズ音声が連続して発生している場合に、ユーザの発話終了後に、よりすみやかに発話音声による音声入力を受け付けることができる。 By doing in this way, when the noise voice is continuously generated following the user's utterance voice, the voice input by the utterance voice can be accepted more promptly after the user's utterance is finished.

以上のように、本発明によれば、ユーザの発話音声による音声入力を正しく、かつ、すみやかに受け付けることのできる音声入力システムを提供することができる。 As described above, according to the present invention, it is possible to provide a voice input system that can correctly and promptly accept voice input by a user's uttered voice.

本発明の実施形態に係る情報システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information system which concerns on embodiment of this invention. 本発明の実施形態に係る情報システムの配置を示す図である。It is a figure which shows arrangement | positioning of the information system which concerns on embodiment of this invention. 本発明の実施形態に係る情報システムが行う処理を示すフローチャートである。It is a flowchart which shows the process which the information system which concerns on embodiment of this invention performs. 本発明の実施形態に係る情報システムの動作例を示す図である。It is a figure which shows the operation example of the information system which concerns on embodiment of this invention.

以下、本発明の実施形態について、自動車に搭載される情報システムへの適用を例にとり説明する。
図１に、本実施形態に係る情報システムの構成を示す。
ここで、情報システムは、たとえば、ＡＶシステムやナビゲーションシステムなどを構成するものであり、図示するように、マイクロフォン１、カメラ２、サブディスプレイ３、メインディスプレイ４、音声出力装置５、音声認識装置６、口唇認識部７、制御装置８、１または複数の周辺装置９を備えている。 Hereinafter, embodiments of the present invention will be described by taking application to an information system mounted on an automobile as an example.
FIG. 1 shows the configuration of the information system according to this embodiment.
Here, the information system constitutes, for example, an AV system or a navigation system. As shown in the figure, the microphone 1, the camera 2, the sub display 3, the main display 4, the voice output device 5, and the voice recognition device 6 are used. A lip recognition unit 7, a control device 8, and one or a plurality of peripheral devices 9.

ただし、このような情報システムは、ＣＰＵやメモリを備えたコンピュータを用いて構成されるものであってよく、この場合、音声認識装置６や口唇認識部７や制御装置８は、コンピュータが所定のコンピュータプログラムを実行することにより実現されるものであってよい。 However, such an information system may be configured by using a computer having a CPU and a memory. In this case, the voice recognition device 6, the lip recognition unit 7, and the control device 8 have a predetermined computer. It may be realized by executing a computer program.

さて、このような情報システムにおいて、図２に示すように、カメラ２は、自動車のメータクラスタ内に配置され、運転者であるユーザの顔の口唇を含む部分を撮影する。
また、マイクロフォン１も、メータクラスタ内に配置され、運転者であるユーザの発話音声などをピックアップする。
また、サブディスプレイ３は、液晶表示装置であり、自動車のメータクラスタ内に配置され、バーチャルアシスタントキャラクタ３１の表示などに用いられる。ここで、このように、メータクラスタ内にバーチャルアシスタントキャラクタ３１を表示することにより、運転者であるユーザが音声入力の際にバーチャルアシスタントキャラクタ３１に向かって発話するように促すことができる。そして、ユーザが音声入力の際にユーザはバーチャルアシスタントキャラクタ３１に向かって発話することにより、メータクラスタ内に設けたカメラ２で、ユーザの顔の口唇を含む部分を良好に撮影でき、また、メータクラスタ内に設けたマイクロフォン１でユーザの発話音声を良好にピックアップすることができる。 Now, in such an information system, as shown in FIG. 2, the camera 2 is arranged in a meter cluster of an automobile and photographs a portion including a lip of a face of a user who is a driver.
Further, the microphone 1 is also arranged in the meter cluster and picks up the speech voice of the user who is the driver.
The sub-display 3 is a liquid crystal display device and is arranged in a meter cluster of an automobile and used for displaying the virtual assistant character 31 and the like. Here, by displaying the virtual assistant character 31 in the meter cluster as described above, it is possible to encourage the user as a driver to speak toward the virtual assistant character 31 during voice input. When the user inputs a voice, the user speaks toward the virtual assistant character 31, so that the camera 2 provided in the meter cluster can properly shoot a portion including the lips of the user's face. The user's speech can be picked up satisfactorily by the microphones 1 provided in the cluster.

また、メインディスプレイ４は、たとえば液晶表示装置であり、図２に示すように、ダッシュボード上の、左右方向について運転席と助手席の中間の位置に配置され、情報システムが構成するＡＶシステムやナビゲーションシステムのＧＵＩ画面の表示などに用いられる。 The main display 4 is, for example, a liquid crystal display device. As shown in FIG. 2, the main display 4 is arranged at a position intermediate between the driver's seat and the passenger seat in the left-right direction on the dashboard. This is used for displaying a GUI screen of a navigation system.

以下、このような情報システムの音声入力動作について説明する。
まず、音声認識装置６が行う音声認識制御処理について説明する。
図３ａに、この音声認識制御処理の手順を示す。
図示するように、音声認識装置６は音声認識制御処理において、マイクロフォン１から入力する音声である入力音声の音声レベルが所定のしきい値Tha超となるのを監視し（ステップ３０２）、音声レベルが所定のしきい値Tha超となったならば、入力音声の音声認識対象音声としての取り込みを開始する（ステップ３０４）。ここで、しきい値Thaは、それ以上の音声レベルであれば音声が存在すると見なせる入力音声の音声レベルとする。 Hereinafter, the voice input operation of such an information system will be described.
First, the speech recognition control process performed by the speech recognition device 6 will be described.
FIG. 3a shows the procedure of the voice recognition control process.
As shown in the figure, in the voice recognition control process, the voice recognition device 6 monitors whether the voice level of the input voice, which is the voice input from the microphone 1, exceeds a predetermined threshold value Tha (step 302). When the value exceeds a predetermined threshold value Tha, the input voice is started to be taken as voice recognition target voice (step 304). Here, the threshold value Tha is a voice level of input voice that can be regarded as having voice if the voice level is higher than that.

そして、入力音声の音声レベルが所定のしきい値Tha以下である期間の所定期間（たとえば、300ms）以上の継続の発生と（ステップ３０６）、口唇認識部７からの口唇動作停止通知の受信の発生（ステップ３０８）とを監視する。
そして、入力音声の音声レベルが所定のしきい値Tha以下である期間の所定期間（たとえば、300ms）以上の継続の発生（ステップ３０８）が発生したならば、入力音声の音声認識対象音声としての取り込みを停止する（ステップ３１０）。
そして、それまでに取り込んだ音声認識対象音声、すなわち、音声が連続的に存在する時間区間である音声区間のうちの、現時点を終点とする音声区間の音声を対象として、音声認識処理を行い（ステップ３１２）、音声認識した内容を音声認識結果として制御装置８に送信し（ステップ３１４）、ステップ３０２からの処理に戻る。 Then, the occurrence of a continuation for a predetermined period (for example, 300 ms) or longer during a period in which the sound level of the input sound is equal to or less than a predetermined threshold value Tha (step 306), Occurrence (step 308) is monitored.
If the occurrence of continuation (step 308) for a predetermined period (for example, 300 ms) or longer of a period in which the audio level of the input audio is equal to or less than the predetermined threshold value Tha occurs, The capturing is stopped (step 310).
Then, speech recognition processing is performed on the speech recognition target speech that has been captured so far, that is, the speech in the speech interval that ends at the current time in the speech interval that is a time interval in which the speech continuously exists ( In step 312, the speech-recognized content is transmitted to the control device 8 as a speech recognition result (step 314), and the process returns to step 302.

一方、口唇認識部７からの口唇動作停止通知の受信が発生した場合にも、入力音声の音声認識対象音声としての取り込みを停止する（ステップ３１０）。
そして、それまでに取り込んだ音声認識対象音声、すなわち、現時点が含まれる音声区間のうち、現時点に先行する時間区間の音声を対象として、音声認識処理を行い（ステップ３１２）、音声認識した内容を音声認識結果として制御装置８に送信し（ステップ３１４）、ステップ３０２からの処理に戻る。 On the other hand, when reception of the lip movement stop notification from the lip recognition unit 7 occurs, the capturing of the input speech as the speech recognition target speech is stopped (step 310).
Then, speech recognition processing is performed on the speech recognition target speech that has been captured so far, that is, speech in a time interval preceding the current time among the speech segments including the current time (step 312), The voice recognition result is transmitted to the control device 8 (step 314), and the process returns to step 302.

以上、音声認識装置６が行う音声認識制御処理について説明した。
次に、口唇認識部７の動作について説明する。
まず、口唇認識部７は、カメラ２で撮影した映像より、運転者であるユーザの口唇の向きや動作の有無を検出する動作を繰り返し行う。
また、口唇認識部７は、図３ｂに示す口唇動作検出処理を行う。
すなわち、口唇認識部７は、ユーザの口唇が正面方向を向いた状態で（ステップ３２２）、ユーザの口唇が動作を開始する（ステップ３２４）のを待ち、ユーザの口唇が正面方向を向いた状態で動作を開始したならば、ユーザの口唇の向きが正面方向でなくなるか（ステップ３２６）、ユーザの口唇の動作が所定期間（たとえば、100ｍｓ）以上停止する（ステップ３２８）のを監視する。
そして、口唇の向きが正面方向でなくなったならば（ステップ３２６）、そのままステップ３２２の処理に戻る。一方、口唇の動作が所定期間以上停止した場合には（ステップ３２８）、口唇動作停止通知を音声認識装置６と制御装置８に送信し（ステップ３３０）、ステップ３２２からの処理に戻る。 The speech recognition control process performed by the speech recognition device 6 has been described above.
Next, the operation of the lip recognition unit 7 will be described.
First, the lip recognition unit 7 repeatedly performs an operation of detecting the direction of the lip of the user who is the driver and the presence or absence of the operation from the video imaged by the camera 2.
Further, the lip recognition unit 7 performs a lip motion detection process shown in FIG. 3b.
That is, the lip recognition unit 7 waits for the user's lip to start operating (step 324) with the user's lip facing the front direction (step 322), and the user's lip facing the front direction. If the operation is started, the direction of the user's lips is no longer the front direction (step 326), or the operation of the user's lips is stopped for a predetermined period (for example, 100 ms) (step 328) is monitored.
If the direction of the lips is no longer the front direction (step 326), the process directly returns to step 322. On the other hand, when the movement of the lips has been stopped for a predetermined period or longer (step 328), a lip movement stop notification is transmitted to the voice recognition device 6 and the control device 8 (step 330), and the processing returns to step 322.

以上、口唇認識部７の動作について説明した。
次に、制御装置８が行う音声入力受付処理について説明する。
図３ｃに、この音声入力受付処理の手順を示す。
図示するように、この音声入力受付処理では、口唇認識部７からの口唇動作停止通知の受信を監視し（ステップ３４２）、口唇動作停止通知を受信したならば、所定のタイムアウト時間（たとえば、１秒）をセットしたタイマをスタートする（ステップ３４４）。
そして、タイマのタイムアウトの発生と（ステップ３４６）、音声認識装置６からの音声認識結果の受信（ステップ３４８）とを監視し、タイムアウトが発生した場合には（ステップ３４６）、そのままステップ３４２からの処理に戻る。
一方、タイムアウトが発生する前に（ステップ３４６）、音声認識装置６から音声認識結果を受信できた場合には（ステップ３４８）、音声認識結果が表す音声認識装置６が音声認識した内容を入力内容とする音声入力を受け付け（ステップ３５０）、受け付けた音声入力の内容に応じた処理を行い（ステップ３５２）、ステップ３４２からの処理に戻る。 The operation of the lip recognition unit 7 has been described above.
Next, the voice input reception process performed by the control device 8 will be described.
FIG. 3c shows the procedure of the voice input acceptance process.
As shown in the figure, in this voice input acceptance process, reception of a lip motion stop notification from the lip recognition unit 7 is monitored (step 342), and if a lip motion stop notification is received, a predetermined timeout period (for example, 1 The timer for which (second) is set is started (step 344).
Then, the occurrence of a timeout of the timer (step 346) and the reception of the speech recognition result from the speech recognition apparatus 6 (step 348) are monitored. If a timeout occurs (step 346), the process from step 342 is directly performed. Return to processing.
On the other hand, before the time-out occurs (step 346), if the voice recognition result can be received from the voice recognition device 6 (step 348), the content recognized by the voice recognition device 6 represented by the voice recognition result is input content. (Step 350), processing according to the content of the received voice input is performed (step 352), and the processing returns to step 342.

ここで、受け付けた音声入力の内容に応じた処理としては、受け付けた音声入力の内容や当該内容に応じて行う制御内容を確認するフィードバック音声の音声出力装置５からの出力や、サブディスプレイ３に表示しているバーチャルアシスタントキャラクタ３１のフィードバック音声を話しているようすを表すアニメーション表示や、受け付けた音声入力の内容に応じた情報システムが構成するＡＶシステムやナビゲーションシステムの制御などを行う。すなわち、たとえば、受け付けた音声入力の内容が「楽曲Ｘをかけて」といったものである場合には、制御装置８は、音声入力の内容に応じた処理として、「楽曲Ｘを再生します」といったフィードバック音声を音声出力装置５からの出力しながら、サブディスプレイ３に表示しているバーチャルアシスタントキャラクタ３１の口を開閉するアニメーション表示を行うと共に、ＡＶシステムの楽曲Ｘの再生動作の開始の制御を行う。 Here, as processing according to the contents of the received voice input, feedback audio output from the voice output device 5 for confirming the contents of the received voice input and the control contents performed according to the contents, or the sub display 3 An animation display indicating that the feedback voice of the displayed virtual assistant character 31 is spoken, control of an AV system or navigation system configured by an information system according to the content of the received voice input, and the like are performed. That is, for example, when the content of the received voice input is “please play music X”, the control device 8 “plays music X” as processing according to the content of voice input. While outputting the feedback sound from the sound output device 5, the animation display for opening and closing the mouth of the virtual assistant character 31 displayed on the sub display 3 is performed, and the start of the reproduction operation of the music system X of the AV system is controlled. .

以上、制御装置８が行う音声入力受付処理について説明した。
以上のような情報システムによれば、音声認識装置６における音声認識処理は、マイクロフォン１から入力する入力音声について常時行われるが、音声認識結果の音声入力としての受け付けは、口唇認識部７がユーザの口唇の動作の停止を検出してから所定期間（タイマのタイムアウト時間）以内のみ行われる。したがって、ユーザが口唇の動作の停止する前に発話した音声の音声認識結果についてのみ音声入力の受け付けが行われることとなる。 Heretofore, the voice input acceptance process performed by the control device 8 has been described.
According to the information system as described above, the voice recognition processing in the voice recognition device 6 is always performed for the input voice input from the microphone 1, but the lip recognition unit 7 accepts the voice recognition result as voice input. This is performed only within a predetermined period (timeout time of the timer) after detecting the stop of the movement of the lips. Accordingly, the voice input is accepted only for the voice recognition result of the voice uttered before the user stops the movement of the lips.

以下、以上のような情報システムの音声入力動作の動作例を示す。
いま、図４に示すように、ユーザの口唇の向きが正面方向を向いており、ユーザの口唇が動作していない期間中の時刻ｔ０からｔ１にマイクフォンから入力音声Ｓ１が入力すると、入力音声Ｓ１は音声認識対象音声Ｄ１として音声認識装置６に取り込まれていく。そして、時刻ｔ１において入力音声Ｓ１が終了すると、音声認識装置６において、それまでに取り込んだ音声認識対象音声Ｄ１に対して音声認識処理Ｐ１が行われ、音声認識処理Ｐ１で得られた音声認識結果Ｒ１が制御装置８に送信される。 Hereinafter, an operation example of the voice input operation of the information system as described above will be shown.
As shown in FIG. 4, when the input voice S1 is input from the microphone from time t0 to t1 during the period when the user's lips are facing the front and the user's lips are not moving, S1 is taken into the speech recognition apparatus 6 as the speech recognition target speech D1. Then, when the input speech S1 ends at time t1, the speech recognition apparatus 6 performs speech recognition processing P1 on the speech recognition target speech D1 captured so far, and the speech recognition result obtained by the speech recognition processing P1. R1 is transmitted to the control device 8.

しかし、この以前の所定期間（タイマのタイムアウト時間Ｔout）以内の時点においてユーザの口唇の動作の停止は検出されていないので、制御装置８は音声入力を受け付け可能な状態になく、制御装置８において、音声認識処理Ｐ１で得られた音声認識結果Ｒ１の音声入力は受け付けられない。 However, since the stop of the user's lip movement has not been detected within the previous predetermined period (timeout time Tout of the timer), the control device 8 is not in a state where it can accept voice input, and the control device 8 The voice input of the voice recognition result R1 obtained in the voice recognition process P1 is not accepted.

一方、その後、時刻ｔ２から時刻ｔ４までの期間、ユーザが口唇を正面方向に向けて動作させて発話し、時刻ｔ２直後の時刻ｔ３から時刻ｔ４より後の時刻ｔ５まで、マイクフォンから入力音声Ｓ２が入力したものとする。
この場合、時刻ｔ３から、入力音声Ｓ２は音声認識対象音声Ｄ２として音声認識装置６に取り込まれていく。そして、時刻ｔ４において、口唇動作の停止が検出されると、音声認識装置６は、以降の入力音声Ｓ２の音声認識対象音声Ｄ２としての取り込みを終了する。すなわち、入力音声Ｓ２の時刻ｔ４以降の部分は、口唇動作の停止の入力音声であり、したがってユーザの発話音声ではなくノイズ音声であるので、音声認識処理の対象とせずに、これを無視する。このようにすることにより、次に述べる音声認識対象音声Ｄ２の音声認識や、音声認識対象音声Ｄ２の音声認識結果Ｒ２の音声入力の受け付けをすみやかに行えるようになる。 On the other hand, after that, during the period from time t2 to time t4, the user moves his / her lips toward the front and speaks, and from time t3 immediately after time t2 to time t5 after time t4, the input voice S2 from the microphone. Is entered.
In this case, the input voice S2 is taken into the voice recognition device 6 as the voice recognition target voice D2 from time t3. Then, when the stop of the lip movement is detected at time t4, the voice recognition device 6 ends the subsequent capture of the input voice S2 as the voice recognition target voice D2. That is, the portion after the time t4 of the input voice S2 is an input voice for stopping the lip movement, and is therefore a noise voice instead of a user's utterance voice, and is therefore ignored without being subjected to voice recognition processing. By doing so, it becomes possible to promptly accept the voice recognition of the voice recognition target voice D2 described below and the voice input of the voice recognition result R2 of the voice recognition target voice D2.

さて、次に、入力音声Ｓ２の音声認識対象音声Ｄ２としての取り込みを終了したならば、音声認識装置６において、それまでに取り込んだ音声認識対象音声Ｄ２に対する音声認識処理Ｐ２が行われ、音声認識処理Ｐ２で得られた音声認識結果Ｒ２が制御装置８に送信される。
この時点では、これ以前の所定期間（タイマのタイムアウト時間Ｔout）以内の時点ｔ４においてユーザの口唇の動作の停止が検出されているので、制御装置８は、音声入力を受け付け可能な状態にある。したがって、制御装置８において、音声認識処理Ｐ２で得られた音声認識結果Ｒ２の音声入力が受け付けられ、音声認識結果Ｒ２に応じた処理が実行される。 Next, when the input speech S2 is captured as the speech recognition target speech D2, the speech recognition apparatus 6 performs speech recognition processing P2 on the speech recognition target speech D2 captured so far, and performs speech recognition. The speech recognition result R2 obtained in the process P2 is transmitted to the control device 8.
At this time, since the stop of the user's lip movement is detected at time t4 within a predetermined period (timeout time Tout of the timer) before this, the control device 8 is in a state where it can accept voice input. Accordingly, the control device 8 receives a voice input of the voice recognition result R2 obtained in the voice recognition process P2, and executes a process according to the voice recognition result R2.

以上、本発明の実施形態について説明した。
以上、本発明の実施形態について説明した。
なお、以上の実施形態における音声入力の技術は、自動車に搭載された情報システムに限らずに、任意の情報システムに同様に適用することができる。 The embodiment of the present invention has been described above.
The embodiment of the present invention has been described above.
The voice input technique in the above embodiment is not limited to the information system mounted on the automobile, but can be similarly applied to any information system.

１…マイクロフォン、２…カメラ、３…サブディスプレイ、４…メインディスプレイ、５…音声出力装置、６…音声認識装置、７…口唇認識部、８…制御装置、９…周辺装置。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Camera, 3 ... Sub-display, 4 ... Main display, 5 ... Audio | voice output apparatus, 6 ... Voice recognition apparatus, 7 ... Lip recognition part, 8 ... Control apparatus, 9 ... Peripheral apparatus.

Claims

A voice input system that receives voice input from a user,
A camera that captures at least the part of the user including the lips;
Lip recognition means for detecting a change from the operating state of the user's lip to the stopped state from the image taken by the camera;
A microphone,
Voice recognition means for performing voice recognition processing on the inputted voice in response to occurrence of voice input from the microphone, and outputting a voice recognition result;
Of the speech recognition results output by the speech recognition means, only the speech recognition results output from the speech recognition means within a predetermined period after the lip recognition means detects a change to the stopped state. A voice input system comprising voice input receiving means for receiving a voice input having a voice recognition result as an input content.

The voice input system according to claim 1,
When the lip recognition unit detects a change to the stop state, the voice recognition unit immediately performs a voice recognition process on a continuous portion of the voice input from the microphone up to the present time, and obtains a voice recognition result. A voice input system characterized by output.

A computer program that is read and executed by a computer comprising a camera and a microphone, the computer comprising:
Lip recognition means for detecting a change from the operating state of the user's lip to the stopped state from an image of a part including at least the lip of the user photographed by the camera;
Voice recognition means for performing voice recognition processing on the inputted voice in response to occurrence of voice input from the microphone, and outputting a voice recognition result;
Of the speech recognition results output by the speech recognition means, only the speech recognition results output from the speech recognition means within a predetermined period after the lip recognition means detects a change to the stopped state. A computer program that functions as a voice input receiving unit that receives a voice input having a voice recognition result as an input content.

A computer program according to claim 3,
When the lip recognition unit detects a change to the stop state, the voice recognition unit performs a voice recognition process on a continuous portion of the voice input from the microphone up to the present time, and outputs a voice recognition result A computer program characterized by the above.