JP2011118822A

JP2011118822A - Electronic apparatus, speech detecting device, voice recognition operation system, and voice recognition operation method and program

Info

Publication number: JP2011118822A
Application number: JP2009277686A
Authority: JP
Inventors: Shintaro Takada; 晋太郎高田; Yuji Yamamoto; 裕二山本
Original assignee: NEC Casio Mobile Communications Ltd
Current assignee: NEC Casio Mobile Communications Ltd
Priority date: 2009-12-07
Filing date: 2009-12-07
Publication date: 2011-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To remotely operate an electronic apparatus body by voice recognition, etc., with high accuracy and low power consumption without requiring complicated user operation. <P>SOLUTION: A television set 100 and a cellular phone 200 collect ambient voice. The cellular phone 200 detects a start of user's speech from the collected voices, transmits a speech start signal and voice signal corresponding to the voice to the television set 100, detects an end of the user's speech from the collected voices, transmits a speech end signal to the television set 100 and stops transmission of the voice signal. The television set 100, on receiving the speech start signal, starts an instruction extracting part for extracting an operation instruction by voice recognition, and extracts an operation instruction on the basis of the voice signal based on the collected voices and the received voice signals. The television set 100, on receiving the speech end signal, stops the instruction extracting part. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、発話により遠隔操作が可能な電子機器、発話を検出する発話検出装置、発話検出装置を用いて音声認識により電子機器本体の操作を行う音声認識操作システム及び音声認識操作方法、電子機器又は発話検出装置を制御するコンピュータに実行させるプログラムに関する。 The present invention relates to an electronic device that can be remotely operated by utterance, an utterance detection device that detects an utterance, a voice recognition operation system that operates an electronic device main body by voice recognition using the utterance detection device, a voice recognition operation method, and an electronic device Alternatively, the present invention relates to a program that is executed by a computer that controls the speech detection apparatus.

テレビ等のＡＶ機器のリモコンは、電源のオン／オフ操作やチャンネル切り替え操作などの単純な操作に用いられるのが常であった。しかしながら、近年では、録画再生機能、インターネット接続によるコンテンツの再生機能、宅内ネットワーク機能等、付加的な機能が増加するのに比例して、リモコンの操作ボタンが増え、メニュー操作が複雑になっている。このため、ＡＶ機器に詳しくない人にとってはリモコンを使いこなすことが難しくなってきている。 The remote control of an AV device such as a television is usually used for simple operations such as power on / off operation and channel switching operation. However, in recent years, the number of remote control operation buttons has increased in proportion to the increase in additional functions such as a recording / playback function, a content playback function via the Internet connection, and a home network function, and the menu operation has become complicated. . For this reason, it is becoming difficult for those who are not familiar with AV equipment to use the remote control.

そこで、視聴者の表情、ジェスチャ、音声を認識し、その認識結果に基づいて家電機器を操作し、リモコンを必要としない操作技術の研究開発が盛んになっている。視聴者の動作に応じて、その視聴者がどのような操作を行いたいのかを家電機器が自動的に判断し、該当する操作を行う技術の実現は究極の目標であり、ユーザの利便性向上に大きく貢献するものとして期待されている。 Therefore, research and development of operation techniques that recognize the viewer's facial expressions, gestures, and voices, operate home electric appliances based on the recognition results, and do not require a remote controller, have become active. The ultimate goal is to realize technology that allows home appliances to automatically determine what operation the viewer wants to perform according to the viewer's actions, and improve user convenience It is expected to contribute greatly to

リモコンを用いることなく家電機器を操作する形態の１つとして、音声認識による操作が挙げられる。健全な声帯の持ち主であるならば、発声は特別な学習を必要とせず万人が行うことができ、自分の考えを直感的に表現できる手段であるため、音声認識によりＡＶ機器を操作できるようになれば、ユーザの利便性を著しく高めることができる。また、音声ならば、キーボード操作、リモコン操作による入力の煩わしさを感じることなく、ダイレクトにキーワード（例えば、検索キーワード）を入力することも可能になる。 One form of operating home appliances without using a remote control is an operation based on voice recognition. If you are a sound vocal cord owner, utterance can be done by everyone without special learning and you can express your thoughts intuitively, so you can operate AV equipment by voice recognition If it becomes, it can raise a user's convenience remarkably. In addition, with a voice, it is possible to directly input a keyword (for example, a search keyword) without feeling troublesome input by keyboard operation or remote control operation.

音声によって機器を操作する装置は、従来から提案されている。例えば、テレビ本体にマイクロホンを設置し、そのマイクロホンに入力されるユーザの操作命令を音声認識し、テレビの操作を行う装置はその代表例である。しかしながら、マイクロホンで収集した音には、スピーカから発生する音が混入するので、音声認識の性能が劣化してしまう。そこで、スピーカから出力される音が、既知であることを利用し、エコーキャンセラ等の適応的なノイズ除去を施すことにより、音声認識の性能が向上するシステムが開示されている（例えば、特許文献１参照）。 An apparatus for operating a device by voice has been conventionally proposed. For example, a device in which a microphone is installed in a television main body, a user's operation command input to the microphone is recognized by voice, and a television is operated is a typical example. However, since the sound collected from the microphone is mixed with the sound generated from the speaker, the voice recognition performance is deteriorated. Therefore, a system is disclosed in which the sound output performance is improved by using the known noise output from the speaker and performing adaptive noise removal such as an echo canceller (for example, Patent Documents). 1).

しかしながら、より高精度な音声認識を実現するためには、マイクロホンで収集した音からスピーカからの音を除去するエコーキャンセラ処理のみならず、遠くの人物からの発話を抽出するために、周囲のノイズを除去する空間的な雑音除去処理や、マイクロホンで収集した音声が雑談等の非命令なものか操作命令であるのかを判定する発話判定処理や、最後に音声を特徴量に変換し、データベースとのパターンマッチングを行う音声認識処理等、多様な処理が必要とされる。 However, in order to realize more accurate speech recognition, not only echo canceller processing that removes sound from the speaker from sound collected by the microphone, but also ambient noise to extract speech from a distant person. Spatial noise removal processing, speech utterance determination processing to determine whether the voice collected by the microphone is non-command or operation command such as chat, and finally convert the voice to feature quantity, Various processes such as a voice recognition process for performing pattern matching are required.

また、音声認識により命令を検出する装置では、音声認識機能を常時起動し続ける必要がある。この結果、発話が無い時間帯でも冗長な雑音除去処理と音声認識処理が実行されるようになる。特に、テレビのオン／オフ操作も音声認識で行うならば、テレビがオフ中にも上記音声認識のための処理を常に実行しておく必要があるため、テレビがオフ中であるにも関わらず、電力を消費してしまう。 In addition, in a device that detects a command by voice recognition, it is necessary to always start the voice recognition function. As a result, redundant noise removal processing and speech recognition processing are executed even during a time period when there is no speech. In particular, if the TV on / off operation is also performed by voice recognition, it is necessary to always perform the above voice recognition processing even when the TV is off. , Consumes power.

一方、テレビのリモコンにマイクロホンを設置し、リモコン内部において音声認識を行い、認識結果をテレビに送信するシステムが開示されている（例えば、特許文献２参照）。リモコンにマイクロホンが設置されている場合、ユーザの口とマイクロホンとの距離が極めて近くなるため、周囲の雑音の影響が少なくなるので、音声認識精度を高めることができる。 On the other hand, a system is disclosed in which a microphone is installed on a remote control of a television, voice recognition is performed inside the remote control, and a recognition result is transmitted to the television (for example, see Patent Document 2). When a microphone is installed on the remote control, the distance between the user's mouth and the microphone is extremely short, and the influence of ambient noise is reduced, so that the voice recognition accuracy can be improved.

しかしながら、このシステムでは、発話を行う際にリモコンを口元に持っていく動作が必要になる。また、音声認識処理部をリモコンに内包する必要があるため、消費電力の増大によって電池の消耗が激しくなる。さらに、テレビの音や、周囲の雑音のレベルによっては、それらを除去する機能もリモコンに搭載する必要があるため、消費電力がさらに増大する場合も考えられる。 However, this system requires the operation of bringing the remote control to the mouth when speaking. In addition, since it is necessary to include the voice recognition processing unit in the remote controller, battery consumption is increased due to an increase in power consumption. Furthermore, depending on the sound of the television and the level of ambient noise, it is necessary to install a function for removing them in the remote control, so that the power consumption may further increase.

また、リモコンに無線通信部を搭載し、リモコンのマイクロホンに入力された音声信号を無線通信経由でインターネット上のサーバに送り、認識結果を受信、リモコンからテレビ部に送信するシステムが開示されている（例えば、特許文献３参照）。このシステムによれば、音声認識処理部がリモコンに不要となり、コストや消費電力の削減が期待できるが、リモコンを持って発話しなくてはならないという点は変わらない。加えて、一旦インターネットサーバを経由するため、通信速度や環境によってはリアルタイム性が損なわれる可能性もある。 Further, a system is disclosed in which a wireless communication unit is mounted on a remote control, an audio signal input to a microphone of the remote control is transmitted to a server on the Internet via wireless communication, a recognition result is received, and the recognition result is transmitted from the remote control to the television unit. (For example, refer to Patent Document 3). According to this system, the voice recognition processing unit is not required for the remote controller, and cost and power consumption can be expected to be reduced, but the point that the user has to speak with the remote controller remains unchanged. In addition, since it goes through the Internet server once, real-time performance may be impaired depending on the communication speed and environment.

さらに、携帯電話を音声認識のためのデバイスとして使用し、携帯電話に向かって発話した命令を携帯電話で認識し、テレビに送信するシステムが開示されている（例えば、特許文献４参照）。携帯電話はユーザの近くに置かれる傾向があり、元々マイクロホンを備えているので、追加のコストが少ないことがこの装置の利点である。しかしながら、このシステムでも、携帯電話を持って発話する動作が必要となるため、従来のリモコンのボタンを押す動作と煩わしさは変わらないことになる。また、他の従来のシステムと同様に、周囲の雑音環境によってはうまく音声認識がなされない可能性がある。 Further, a system is disclosed in which a mobile phone is used as a device for voice recognition, a command spoken toward the mobile phone is recognized by the mobile phone, and transmitted to a television (for example, see Patent Document 4). It is an advantage of this device that the mobile phone tends to be placed close to the user and is originally equipped with a microphone, so that the additional cost is low. However, this system also requires an operation of speaking with a mobile phone, so the troublesomeness is not different from the operation of pressing a button on a conventional remote control. In addition, as with other conventional systems, there is a possibility that speech recognition may not be performed well depending on the surrounding noise environment.

特開平５−２２７７９号公報Japanese Patent Laid-Open No. 5-22779 特開２００１−３１８６８９号公報JP 2001-318689 A 特開２００３−１１５９３９号公報JP 2003-115939 A 特開２００５−６５１５６号公報JP 2005-65156 A

上述のように、上記特許文献１乃至４に開示されたシステムでは、発話を行う度に、音声認識スイッチを押下したり、マイクロホンを口の近くに持って行ったりするような煩わしい操作を行わなければ、音声認識精度を高めるのが困難になる。また、音声認識による遠隔操作が行われていないときでも、一連の命令抽出処理を行う必要があるので、消費電力が増大する。 As described above, in the systems disclosed in Patent Documents 1 to 4, it is necessary to perform troublesome operations such as pressing the voice recognition switch or holding the microphone near the mouth each time a speech is performed. In this case, it is difficult to improve the voice recognition accuracy. Further, even when remote operation by voice recognition is not performed, it is necessary to perform a series of command extraction processing, which increases power consumption.

本発明は、上記実情に鑑みてなされたものであり、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる電子機器、発話検出装置、音声認識操作システム、音声認識操作方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an electronic device capable of performing remote operation of the electronic device main body by voice recognition or the like with high accuracy and low power consumption without performing troublesome operations by the user, An object of the present invention is to provide a speech detection device, a speech recognition operation system, a speech recognition operation method, and a program.

上記目的を達成するため、本発明の第１の観点に係る電子機器は、
周囲の音声を収集し、その音声に対応する第１の音声信号を出力する少なくとも１つの音声収集部と、
外部機器と無線通信を行う無線通信部と、
前記第１の音声信号と、前記無線通信部で受信される第２の音声信号との少なくとも一方に対して音声認識処理を含む信号処理を行って、ユーザが発声した操作命令を抽出する命令抽出部と、
抽出された前記操作命令に従って、本体を操作する操作部と、
前記無線通信部で受信されたタイミング信号に基づいて、前記命令抽出部の起動及び停止を制御する制御部と、
を備える。 In order to achieve the above object, an electronic device according to the first aspect of the present invention provides:
At least one sound collection unit that collects ambient sound and outputs a first sound signal corresponding to the sound;
A wireless communication unit for performing wireless communication with an external device;
Command extraction for extracting an operation command uttered by a user by performing signal processing including voice recognition processing on at least one of the first voice signal and the second voice signal received by the wireless communication unit And
An operation unit for operating the main body according to the extracted operation instructions;
Based on the timing signal received by the wireless communication unit, a control unit for controlling the start and stop of the command extraction unit,
Is provided.

この場合、周囲に前記ユーザが存在することを検出する人感センサをさらに備え、
前記無線通信部は、
前記人感センサのセンサ出力を、前記外部機器に送信し、
前記制御部は、
前記人感センサのセンサ出力に基づいて、前記命令抽出部の起動及び停止を制御する、
こととしてもよい。 In this case, further comprising a human sensor for detecting the presence of the user in the surroundings,
The wireless communication unit is
Send the sensor output of the human sensor to the external device,
The controller is
Based on the sensor output of the human sensor, control the start and stop of the command extraction unit,
It is good as well.

また、前記無線通信部で受信されるタイミング信号には、
発話開始信号及び発話終了信号が含まれ、
前記制御部は、
前記無線通信部で前記発話開始信号が受信されると、前記命令抽出部を起動し、
前記無線通信部で前記発話終了信号が受信されると、前記命令抽出部を停止する、
こととしてもよい。 In addition, the timing signal received by the wireless communication unit includes:
Includes an utterance start signal and an utterance end signal,
The controller is
When the utterance start signal is received by the wireless communication unit, the command extraction unit is activated,
When the utterance end signal is received by the wireless communication unit, the command extraction unit is stopped.
It is good as well.

この場合、前記第１の音声信号を保存する記録部をさらに備え、
前記無線通信部で受信されるタイミング信号には、
前記記録部に対する保存命令及び破棄命令が含まれ、
前記制御部は、
前記無線通信部で前記保存命令が受信されると、前記記録部への前記第１の音声信号の保存を開始し、
前記無線通信部で前記破棄命令が受信されると、前記記録部に保存された前記第１の音声信号を破棄し、
前記記録部に保存された前記第１の音声信号を用いて前記操作命令を抽出するように、前記命令抽出部を制御する、
こととしてもよい。 In this case, the recording apparatus further includes a recording unit that stores the first audio signal,
In the timing signal received by the wireless communication unit,
A save command and a discard command for the recording unit are included,
The controller is
When the storage command is received by the wireless communication unit, the storage of the first audio signal to the recording unit is started,
When the discard command is received by the wireless communication unit, the first audio signal stored in the recording unit is discarded,
Controlling the command extraction unit to extract the operation command using the first audio signal stored in the recording unit;
It is good as well.

この場合、音声を出力する出力部をさらに備え、
前記制御部は、
前記無線通信部で、前記保存命令を受信すると、前記出力部から出力される音声の音量を小さくする、
こととしてもよい。 In this case, it further includes an output unit for outputting sound,
The controller is
When the wireless communication unit receives the storage command, the volume of the sound output from the output unit is reduced.
It is good as well.

また、前記制御部は、
音声認識モード以外の動作モードでは、前記命令抽出部を停止し、
前記操作部に、前記音声認識モードへの切り替え操作が入力されると、
前記命令抽出部を起動するとともに、前記音声認識モードへの切り替え信号を、前記無線通信部を介して、前記外部機器に送信する、
こととしてもよい。 In addition, the control unit
In an operation mode other than the voice recognition mode, the instruction extraction unit is stopped,
When a switching operation to the voice recognition mode is input to the operation unit,
Activating the command extraction unit and transmitting a switching signal to the voice recognition mode to the external device via the wireless communication unit,
It is good as well.

この場合、情報を表示する表示部をさらに備え、
前記制御部は、
前記命令抽出部を起動後、前記ユーザに発話を控えてもらう旨のメッセージを、前記表示部に表示させる、
こととしてもよい。 In this case, a display unit for displaying information is further provided.
The controller is
After starting the command extraction unit, a message to the effect that the user refrains from speaking is displayed on the display unit,
It is good as well.

また、前記制御部は、
前記外部機器との無線接続が確立されない場合に、
前記音声認識モードを終了し、前記命令抽出部を停止する、
こととしてもよい。 In addition, the control unit
When a wireless connection with the external device is not established,
End the voice recognition mode and stop the command extraction unit,
It is good as well.

また、前記制御部は、
所定の期間、前記無線通信部で発話の開始が検出された旨の信号が受信されない場合に、
前記音声認識モードを終了し、前記命令抽出部を停止するとともに、前記無線通信部を介して、前記外部機器に前記音声認識モードの終了通知を送信する、
こととしてもよい。 In addition, the control unit
When a signal indicating that the start of speech has been detected by the wireless communication unit is not received for a predetermined period of time,
Ending the voice recognition mode, stopping the command extraction unit, and sending an end notification of the voice recognition mode to the external device via the wireless communication unit,
It is good as well.

また、前記制御部は、
前記操作命令が、前記音声認識モードの終了命令であった場合に、
前記音声認識モードを終了し、前記命令抽出部を停止するとともに、前記無線通信部を介して、前記外部機器に前記音声認識モードの終了通知を送信する、
こととしてもよい。 In addition, the control unit
When the operation command is a command to end the voice recognition mode,
Ending the voice recognition mode, stopping the command extraction unit, and sending an end notification of the voice recognition mode to the external device via the wireless communication unit,
It is good as well.

また、前記制御部は、
前記命令抽出部に、
前記第１の音声信号及び前記第２の音声信号の両方に対して音声認識処理を行わせ、
前記音声認識処理の処理結果の尤度が高い方を用いて前記操作命令を抽出させる、
こととしてもよい。 In addition, the control unit
In the instruction extraction unit,
Causing both the first voice signal and the second voice signal to perform voice recognition processing;
Extracting the operation command using the one with the higher likelihood of the processing result of the voice recognition processing,
It is good as well.

本発明の第２の観点に係る発話検出装置は、
周囲の音声を収集し、その音声に対応する音声信号を出力する音声収集部と、
電子機器と無線通信を行う無線通信部と、
前記音声信号の音圧レベルに基づいて、ユーザの発話の開始及び終了を検出する発話検出処理部と、
前記発話検出処理部により前記ユーザの発話の開始及び終了が検出される度に、タイミング信号を、前記無線通信部を介して、前記電子機器に送信する制御部と、
を備える。 The utterance detection device according to the second aspect of the present invention is:
A sound collection unit that collects surrounding sound and outputs a sound signal corresponding to the sound;
A wireless communication unit that performs wireless communication with an electronic device;
An utterance detection processing unit for detecting the start and end of the user's utterance based on the sound pressure level of the audio signal;
A control unit that transmits a timing signal to the electronic device via the wireless communication unit each time the utterance detection processing unit detects the start and end of the user's utterance;
Is provided.

この場合、前記制御部は、
前記人感センサのセンサ出力に基づいて、前記発話検出処理部の起動及び停止を制御する、
こととしてもよい。 In this case, the control unit
Control activation and stop of the utterance detection processing unit based on the sensor output of the human sensor,
It is good as well.

また、前記制御部は、
前記発話検出処理部により前記ユーザの発話の開始が検出されると、前記無線通信部を介して、前記タイミング信号としての発話開始信号とともに前記音声信号を前記電子機器に送信し、前記発話検出処理部により前記ユーザの発話の終了が検出されると、前記無線通信部を介して、前記タイミング信号としての発話終了信号を前記電子機器に送信するとともに前記音声信号の送信を停止する、
こととしてもよい。 In addition, the control unit
When the start of the user's speech is detected by the speech detection processing unit, the speech signal is transmitted to the electronic device together with the speech start signal as the timing signal via the wireless communication unit, and the speech detection processing When the end of the user's utterance is detected by the unit, the utterance end signal as the timing signal is transmitted to the electronic device via the wireless communication unit and the transmission of the audio signal is stopped.
It is good as well.

この場合、前記音声信号を保存する記録部をさらに備え、
前記発話検出処理部は、
前記音声信号の音圧レベルが閾値を超えるか否かを判定する音圧レベル判定処理部をさらに備え、
前記制御部は、
前記音圧レベル判定処理部により前記音声信号の音圧レベルが閾値を超えたと判定されると、前記記録部への前記音声信号の保存を開始するとともに、保存命令を前記電子機器に送信し、
所定の期間経過後に、前記音圧レベル判定処理部により前記音声信号の音圧レベルが閾値以下になったと判定されると、前記記録部へ保存された前記音声信号を破棄するとともに、破棄命令を前記電子機器に送信し、
前記所定の期間経過後に、前記音圧レベル判定処理部により前記音声信号の音圧レベルが閾値を超えたと判定されると、前記無線通信部を介して、発話開始信号と前記記録部へ保存された前記音声信号とを、前記電子機器に送信する、
こととしてもよい。 In this case, it further comprises a recording unit for storing the audio signal,
The utterance detection processing unit
A sound pressure level determination processing unit that determines whether the sound pressure level of the audio signal exceeds a threshold;
The controller is
When it is determined by the sound pressure level determination processing unit that the sound pressure level of the audio signal has exceeded a threshold, the storage of the audio signal to the recording unit is started, and a storage command is transmitted to the electronic device,
When the sound pressure level determination processing unit determines that the sound pressure level of the sound signal has become equal to or lower than a threshold value after a predetermined period of time, the sound signal stored in the recording unit is discarded and a discard command is issued. Send to the electronic device,
When the sound pressure level determination processing unit determines that the sound pressure level of the audio signal has exceeded a threshold value after the predetermined period has elapsed, it is stored in the utterance start signal and the recording unit via the wireless communication unit. Transmitting the audio signal to the electronic device,
It is good as well.

また、前記制御部は、
前記無線通信部を介して、前記電子機器から音声認識モードへの切り替え信号を受信すると、前記発話検出処理部を起動する、
こととしてもよい。 In addition, the control unit
When the switching signal to the voice recognition mode is received from the electronic device via the wireless communication unit, the speech detection processing unit is activated.
It is good as well.

この場合、前記制御部は、
前記発話検出処理部を起動させた後、
所定の期間、前記音圧レベル判定処理部に、前記音声信号の音圧レベルが閾値を超えたか否かを判定させ、
前記所定の期間に対する前記音声信号の音圧レベルが閾値を超えていた期間の割合が、所定の割合より小さくなるように、前記閾値を調整する、
こととしてもよい。 In this case, the control unit
After activating the utterance detection processing unit,
For a predetermined period, the sound pressure level determination processing unit determines whether or not the sound pressure level of the audio signal exceeds a threshold,
Adjusting the threshold so that a ratio of a period during which the sound pressure level of the audio signal with respect to the predetermined period exceeds the threshold is smaller than a predetermined ratio;
It is good as well.

また、前記制御部は、
前記電子機器との無線接続が確立されない場合に、
前記発話検出処理部を停止する、
こととしてもよい。 In addition, the control unit
If a wireless connection with the electronic device is not established,
Stopping the utterance detection processing unit;
It is good as well.

また、前記制御部は、
前記無線通信部を介して、前記電子機器から前記音声認識モードの終了通知を受信すると、前記発話検出処理部を停止する、
こととしてもよい。 In addition, the control unit
When the end notification of the voice recognition mode is received from the electronic device via the wireless communication unit, the speech detection processing unit is stopped.
It is good as well.

本発明の第３の観点に係る音声認識操作システムは、
本発明の電子機器と、
本発明の発話検出装置と、
を備える。 A voice recognition operation system according to a third aspect of the present invention is:
An electronic device of the present invention;
An utterance detection device of the present invention;
Is provided.

この場合、前記電子機器が、テレビジョン装置である、
こととしてもよい。 In this case, the electronic device is a television device.
It is good as well.

また、前記発話検出装置が、携帯電話である、
こととしてもよい。 Further, the speech detection device is a mobile phone.
It is good as well.

本発明の第４の観点に係る音声認識操作方法は、
周囲の音声を、ユーザの近くに置かれた発話検出装置と電子機器とで同時に収集する音声収集工程と、
前記発話検出装置を用いて、収集された音声に対応する音声信号に基づいて、ユーザの発話の開始を検出する発話開始検出工程と、
前記発話開始検出工程で、前記ユーザの発話の開始が検出されると、無線通信により、前記発話検出装置から前記電子機器に、発話開始信号を送信する第１の送信工程と、
受信した発話開始信号に従って、前記電子機器において入力された音声信号からユーザが発声した操作命令を抽出する命令抽出部を起動する起動工程と、
収集された音声に対応する音声信号から、前記命令抽出部により、前記操作命令を抽出する抽出工程と、
前記発話検出装置を用いて、収集された音声に対応する音声信号に基づいて、ユーザの発話の終了を検出する発話終了検出工程と、
前記発話終了検出工程で、前記ユーザの発話の終了が検出されると、無線通信により、前記発話検出装置から前記電子機器に、発話終了信号を送信する第２の送信工程と、
前記電子機器で、発話終了信号を受信すると、前記命令抽出部を停止する停止工程と、
抽出された前記操作命令に従って、本体を操作する操作工程と、
を含む。 The speech recognition operation method according to the fourth aspect of the present invention is:
A voice collecting step of simultaneously collecting ambient sounds by an utterance detection device and an electronic device placed near the user;
An utterance start detection step of detecting the start of the user's utterance based on an audio signal corresponding to the collected audio using the utterance detection device;
A first transmission step of transmitting an utterance start signal from the utterance detection device to the electronic device by wireless communication when the utterance start detection step detects the start of the user's utterance;
An activation step of activating an instruction extraction unit that extracts an operation instruction uttered by a user from an audio signal input in the electronic device according to the received utterance start signal;
An extraction step of extracting the operation command by the command extraction unit from a voice signal corresponding to the collected voice;
An utterance end detection step of detecting the end of the user's utterance based on an audio signal corresponding to the collected audio using the utterance detection device;
A second transmission step of transmitting an utterance end signal from the utterance detection device to the electronic device by wireless communication when the utterance end detection step detects the end of the user's utterance;
When the electronic device receives an utterance end signal, a stop step of stopping the command extraction unit;
An operation step of operating the main body according to the extracted operation instructions;
including.

本発明の第５の観点に係るプログラムは、
電子機器を制御するコンピュータを、
周囲の音声を収集し、その音声に対応する第１の音声信号を出力する少なくとも１つの音声収集手段と、
外部機器と無線通信を行う無線通信手段と、
前記第１の音声信号と、外部機器から送信される第２の音声信号との少なくとも一方に対して音声認識処理を含む信号処理を行って、ユーザが発声した操作命令を抽出する命令抽出手段と、
抽出された前記操作命令に従って、本体を操作する操作手段と、
前記外部機器から送信されるタイミング信号に基づいて、前記命令抽出手段の起動及び停止を制御する制御手段と、
して機能させる。 A program according to the fifth aspect of the present invention is:
The computer that controls the electronics
At least one voice collecting means for collecting ambient voice and outputting a first voice signal corresponding to the voice;
A wireless communication means for performing wireless communication with an external device;
Command extraction means for performing signal processing including voice recognition processing on at least one of the first voice signal and the second voice signal transmitted from an external device to extract an operation command uttered by the user; ,
Operating means for operating the main body according to the extracted operation instructions;
Based on a timing signal transmitted from the external device, control means for controlling activation and stop of the instruction extraction means,
And make it work.

本発明の第６の観点に係るプログラムは、
ユーザの発話を検出する発話検出装置を制御するコンピュータを、
周囲の音声を収集し、その音声に対応する音声信号を出力する音声収集手段と、
電子機器と無線通信を行う無線通信手段と、
前記音声信号の音圧レベルに基づいて、ユーザの発話の開始及び終了を検出する発話検出処理手段と、
前記発話検出処理手段により前記ユーザの発話の開始及び終了が検出される度に、タイミング信号を、電子機器に送信する制御手段と、
して機能させる。 A program according to the sixth aspect of the present invention is:
A computer that controls an utterance detection device that detects a user's utterance,
Sound collecting means for collecting surrounding sound and outputting sound signals corresponding to the sound;
Wireless communication means for performing wireless communication with an electronic device;
Utterance detection processing means for detecting the start and end of the user's utterance based on the sound pressure level of the audio signal;
Control means for transmitting a timing signal to an electronic device each time the utterance detection processing means detects the start and end of the user's utterance;
And make it work.

本発明によれば、以下に示す効果を奏する。
（１）本発明の第１の観点に係る電子機器は、収集された音声に基づく音声信号と、外部機器から受信した音声信号を取得する。これにより、２つの音声信号のうち、良好な方又は両方を用いて、ユーザが発声した操作命令を抽出することができる。このようにすれば、この電子機器は、外部機器から受信したタイミング信号に基づいて、発話が検出されていない時は命令抽出部を停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 The present invention has the following effects.
(1) The electronic device according to the first aspect of the present invention acquires an audio signal based on the collected audio and an audio signal received from an external device. Thereby, the operation command which the user uttered can be extracted using the better one or both of the two audio signals. According to this configuration, the electronic device can stop the command extraction unit when the utterance is not detected based on the timing signal received from the external device. As a result, the remote operation of the electronic device main body by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

（２）本発明の第２の観点に係る発話検出装置は、収集された音声信号に基づいてユーザの発話の開始及び終了を検出する度にタイミング信号を電子機器に送信する。このようにすれば、タイミング信号を受信した電子機器は、ユーザが発声した操作命令を正確なタイミングで抽出することができるうえ、ユーザが発声した操作命令を抽出する命令抽出部を不要な時に停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 (2) The utterance detection device according to the second aspect of the present invention transmits a timing signal to the electronic device every time it detects the start and end of a user's utterance based on the collected audio signals. In this way, the electronic device that has received the timing signal can extract the operation command uttered by the user at an accurate timing, and also stops the command extraction unit that extracts the operation command uttered by the user when unnecessary. I can leave it to you. As a result, the remote operation of the electronic device main body by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

（３）本発明の第３の観点に係る音声認識操作システムによれば、本発明の電子機器と発話検出装置とを備えているので、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 (3) According to the voice recognition operation system according to the third aspect of the present invention, since the electronic device and the utterance detection device of the present invention are provided, an electronic device based on voice recognition or the like can be used without any troublesome operation by the user. Remote operation of the device body can be performed with high accuracy and low power consumption.

（４）本発明の第４の観点に係る音声認識操作方法によれば、発話検出はユーザの近くに置かれた発話検出装置で行われる。また、音声認識等による操作命令の抽出は、発話検出装置で収集された音声に基づく音声信号と、電子機器で収集された音声に基づく音声信号との少なくとも一方を用いて、例えば良好な方又は両方を組み合わせて行われる。また、発話開始信号が送信されてから発話終了信号が送信されるまでの間だけ、操作命令を抽出する命令抽出部を起動させておくことができる。これにより、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 (4) According to the speech recognition operation method according to the fourth aspect of the present invention, the speech detection is performed by the speech detection device placed near the user. Further, the extraction of the operation command by voice recognition or the like is performed by using at least one of a voice signal based on the voice collected by the speech detection device and a voice signal based on the voice collected by the electronic device, for example, the better one or A combination of both. Further, it is possible to activate the command extraction unit that extracts the operation command only after the utterance start signal is transmitted until the utterance end signal is transmitted. Thereby, the remote operation of the electronic device main body by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

（５）本発明の第５の観点に係るプログラムによれば、コンピュータは、収集された音声に基づく音声信号と、外部機器から送信される音声信号を取得する。このようにすれば、２つの音声信号のうち、良好な方又は両方を用いて、ユーザが発声した操作命令を抽出することができる。また、この電子機器は、外部機器から送信されるタイミング信号に基づいて、発話が検出されていない時は命令抽出部を停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 (5) According to the program according to the fifth aspect of the present invention, the computer acquires an audio signal based on the collected audio and an audio signal transmitted from an external device. In this way, an operation command uttered by the user can be extracted using the better one or both of the two audio signals. In addition, the electronic device can stop the instruction extraction unit based on the timing signal transmitted from the external device when the utterance is not detected. As a result, the remote operation of the electronic device main body by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

（６）本発明の第６の観点に係るプログラムによれば、コンピュータは、収集された音声信号に基づいてユーザの発話の開始及び終了を検出する度にタイミング信号を電子機器に送信する。これにより、タイミング信号を受信した電子機器は、操作命令を正確なタイミングで抽出することができるうえ、命令抽出部を不要な時に停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等による電子機器本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 (6) According to the program according to the sixth aspect of the present invention, the computer transmits a timing signal to the electronic device every time it detects the start and end of the user's utterance based on the collected audio signal. Accordingly, the electronic device that has received the timing signal can extract the operation command at an accurate timing, and can stop the command extraction unit when it is unnecessary. As a result, the remote operation of the electronic device main body by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

本発明の実施の形態に係る音声認識操作システムの構成を示す模式図である。It is a schematic diagram which shows the structure of the speech recognition operation system which concerns on embodiment of this invention. 図１の音声認識操作システムを構成するテレビジョン装置の構成を示すブロック図である。It is a block diagram which shows the structure of the television apparatus which comprises the speech recognition operation system of FIG. 図１の音声認識操作システムを構成する携帯電話の構成を示すブロック図である。It is a block diagram which shows the structure of the mobile telephone which comprises the voice recognition operation system of FIG. 図２のテレビジョン装置の操作モードによる状態遷移図である。FIG. 3 is a state transition diagram according to an operation mode of the television apparatus of FIG. 2. 図１の音声認識操作システムにおける音声認識モードにおける一連の全体動作のフローチャートである。It is a flowchart of a series of whole operation | movement in the voice recognition mode in the voice recognition operation system of FIG. 図５の音圧検知閾値キャリブレーションのサブルーチンである。6 is a subroutine for sound pressure detection threshold calibration of FIG. 図７（Ａ）乃至図７（Ｃ）は、音圧検知閾値キャリブレーションを説明するためのタイミングチャート（その１）である。FIGS. 7A to 7C are timing charts (part 1) for explaining the sound pressure detection threshold calibration. 図８（Ａ）乃至図８（Ｃ）は、音圧検知閾値キャリブレーションを説明するためのタイミングチャート（その２）である。FIG. 8A to FIG. 8C are timing charts (part 2) for explaining the sound pressure detection threshold calibration. 図５の発話検出処理のサブルーチンである。It is a subroutine of the speech detection process of FIG. 図１０（Ａ）乃至図１０（Ｆ）は、発話検出処理を説明するためのタイミングチャートである。FIGS. 10A to 10F are timing charts for explaining the speech detection process. 図５の継続判定処理のサブルーチンである。It is a subroutine of the continuation determination process of FIG.

以下、本発明の実施の形態について図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１に示すように、本実施の形態に係る音声認識操作システム３００は、テレビジョン装置１００と、携帯電話２００とを備える。テレビジョン装置１００は、本実施の形態に係る音声認識操作システム３００の電子機器に対応する。携帯電話２００は、ユーザの近くに置かれている。本実施の形態では、携帯電話２００が、音声認識操作システム３００を構成する発話検出機器に対応する。 As shown in FIG. 1, a voice recognition operation system 300 according to the present embodiment includes a television device 100 and a mobile phone 200. Television apparatus 100 corresponds to the electronic device of voice recognition operation system 300 according to the present embodiment. The mobile phone 200 is placed near the user. In the present embodiment, the mobile phone 200 corresponds to an utterance detection device that constitutes the voice recognition operation system 300.

（テレビジョン装置）
テレビジョン装置１００は、図２に示すように、テレビ基幹部１、スピーカ２、表示部３、マイクロホン４、音響処理部５及び音響制御部６を備える。 (Television device)
As shown in FIG. 2, the television apparatus 100 includes a television backbone unit 1, a speaker 2, a display unit 3, a microphone 4, an acoustic processing unit 5, and an acoustic control unit 6.

テレビ基幹部１には、放送電波を受信して再生するためのテレビが有する一般的な各種機能がまとめられている。スピーカ２は、テレビ基幹部１から出力された音声信号に対応する音声を、音響処理部５を介して出力する。表示部３には、テレビ基幹部１から出力された映像信号に対応する映像が表示される。例えば、表示部３は、後述する音圧検知閾値キャリブレーションを精度良く行うためのメッセージ等を表示する。 The TV backbone 1 collects various general functions of a TV for receiving and playing back broadcast radio waves. The speaker 2 outputs audio corresponding to the audio signal output from the TV backbone unit 1 via the acoustic processing unit 5. The display unit 3 displays a video corresponding to the video signal output from the TV backbone unit 1. For example, the display unit 3 displays a message or the like for performing sound pressure detection threshold calibration described later with high accuracy.

マイクロホン４は１つ以上設けられている。マイクロホン４は、テレビジョン装置１００の周囲の音声を収集し、アナログの音声信号に変換して出力する。 One or more microphones 4 are provided. The microphone 4 collects sound around the television apparatus 100, converts it into an analog sound signal, and outputs it.

音響処理部５は、音声の入出力処理を行う。マイクロホン４から送られた音声信号は、音響処理部５に入力され、そこで音響処理された後、必要に応じてテレビ基幹部１に入力される。テレビ基幹部１から出力された音声信号は、音響処理部５に入力され、そこで音響処理された後、必要に応じてスピーカ２から出力される。 The acoustic processing unit 5 performs voice input / output processing. The audio signal sent from the microphone 4 is input to the acoustic processing unit 5, subjected to acoustic processing there, and then input to the television backbone unit 1 as necessary. The audio signal output from the TV backbone unit 1 is input to the acoustic processing unit 5, subjected to acoustic processing there, and then output from the speaker 2 as necessary.

音響制御部６は、主として、音響処理部５を制御する。 The acoustic control unit 6 mainly controls the acoustic processing unit 5.

テレビ基幹部１について、さらに詳細に説明する。図２では、テレビ基幹部１を構成する構成要素として、操作制御部１０、人感センサ１１及び無線通信部１２が示されている。 The TV backbone 1 will be described in more detail. In FIG. 2, an operation control unit 10, a human sensor 11, and a wireless communication unit 12 are shown as components constituting the TV backbone 1.

操作制御部１０は、ユーザの操作入力（例えば、リモコンの操作入力）や後述の操作命令に従ってテレビジョン装置１００を制御する。 The operation control unit 10 controls the television apparatus 100 according to a user operation input (for example, an operation input of a remote controller) or an operation command described later.

人感センサ１１は、超音波センサや赤外線センサ等を備える。これらのセンサ出力は、テレビジョン装置１００の周囲にユーザが存在するか否かを判定するために用いられる。 The human sensor 11 includes an ultrasonic sensor, an infrared sensor, and the like. These sensor outputs are used to determine whether there is a user around the television apparatus 100.

無線通信部１２は、ＷＬＡＮ等の一般的な無線通信や、Ｂｌｕｅｔｏｏｔｈ（登録商標）、低消費電力版Ｂｌｕｅｔｏｏｔｈ（登録商標）、Ｚｉｇｂｅｅ（登録商標）等の近距離無線通信機能を有する。人感センサ１１の出力は、無線通信部１２を介して、外部に送信可能である。 The wireless communication unit 12 has general wireless communication such as WLAN, and short-range wireless communication functions such as Bluetooth (registered trademark), low power consumption version Bluetooth (registered trademark), and Zigbee (registered trademark). The output of the human sensor 11 can be transmitted to the outside via the wireless communication unit 12.

続いて、音響処理部５について、さらに詳細に説明する。音響処理部５は、ＡＤコンバータ（ＡＤＣ）２０、記録領域部２１、エコーキャンセラ２２、雑音除去処理部２３、発話判定処理部２４、音声認識処理部２５、音声認識制御部２６、ＤＡコンバータ（ＤＡＣ）２７を備える。ＡＤＣ２０、記録領域部２１、エコーキャンセラ２２、雑音除去処理部２３、発話判定処理部２４、音声認識処理部２５、音声認識制御部２６で、命令抽出部７が形成される。 Next, the acoustic processing unit 5 will be described in more detail. The acoustic processing unit 5 includes an AD converter (ADC) 20, a recording area unit 21, an echo canceller 22, a noise removal processing unit 23, an utterance determination processing unit 24, a speech recognition processing unit 25, a speech recognition control unit 26, a DA converter (DAC) 27). The instruction extraction unit 7 is formed by the ADC 20, the recording area unit 21, the echo canceller 22, the noise removal processing unit 23, the speech determination processing unit 24, the speech recognition processing unit 25, and the speech recognition control unit 26.

ＡＤＣ２０は、アンプ付きのＡＤコンバータである。ＡＤＣ２０は、マイクロホン４から入力されたアナログの音声信号の増幅、デジタル音声データへの変換を行う。 The ADC 20 is an AD converter with an amplifier. The ADC 20 amplifies an analog audio signal input from the microphone 4 and converts it into digital audio data.

記録領域部２１には、ＡＤＣ２０から出力されるデジタル音声データが保存される。保存されたデジタル音声データは、エコーキャンセラ２２や、雑音除去処理部２３のデジタルフィルタ等、操作命令の抽出に用いられる。 The recording area unit 21 stores digital audio data output from the ADC 20. The stored digital audio data is used for extracting operation commands such as the echo canceller 22 and the digital filter of the noise removal processing unit 23.

エコーキャンセラ２２は、ＡＤＣ２０から出力されたデジタル音声データから、スピーカ２から発せられる音声の成分（エコー）をキャンセルする。より具体的には、エコーキャンセラ２２は、テレビ基幹部１から出力されスピーカ２から出力される音声に対応するデジタル音声データを参照信号として入力する。エコーキャンセラ２２は、ＡＤＣ２０から出力されたデジタル音声データから、適用フィルタなどを用いて、この参照信号の成分を抑圧したデータを出力する。 The echo canceller 22 cancels the audio component (echo) emitted from the speaker 2 from the digital audio data output from the ADC 20. More specifically, the echo canceller 22 inputs digital audio data corresponding to the audio output from the TV backbone 1 and output from the speaker 2 as a reference signal. The echo canceller 22 outputs data in which the component of the reference signal is suppressed from the digital audio data output from the ADC 20 using an application filter or the like.

雑音除去処理部２３は、ユーザの音声がユーザからマイクロホン４に到達するまでに混入したその他の雑音を、デジタルフィルタ処理を行って除去して出力する。マイクロホン４が複数設置されている場合には、このデジタルフィルタ処理には、マイクロホンアレイ技術が適用される。一方、設置されているマイクロホン４が１つである場合には、このデジタルフィルタ処理には、単一のマイク向けの雑音除去技術が適用される。 The noise removal processing unit 23 performs digital filter processing to remove and outputs other noise mixed before the user's voice reaches the microphone 4 from the user. When a plurality of microphones 4 are installed, a microphone array technique is applied to the digital filter processing. On the other hand, when only one microphone 4 is installed, a noise removal technique for a single microphone is applied to the digital filter processing.

発話判定処理部２４は、雑音除去処理部２３又は無線通信部１２から出力されたデジタル音声データが操作命令であるのか否か（その他の日常的な会話等であるのか）を判定する。発話判定処理部２４は、入力したデジタル音声データを操作命令であると判定すると、そのデジタル音声データを出力する。 The speech determination processing unit 24 determines whether the digital voice data output from the noise removal processing unit 23 or the wireless communication unit 12 is an operation command (whether it is other everyday conversation or the like). If the speech determination processing unit 24 determines that the input digital sound data is an operation command, the speech determination processing unit 24 outputs the digital sound data.

音声認識処理部２５は、発話判定処理部２４から出力されたデジタル音声データに対して音声認識処理を行う。より具体的には、音声認識処理部２５は、図示しない命令データベース（ＤＢ）を有する。音声認識処理部２５は、命令ＤＢを参照して、入力したデジタル音声データからその特徴量を抽出したり、パターンマッチングを行ったりして、デジタル音声データに含まれる操作命令の内容を特定する。音声認識処理部２５は、特定された操作命令の内容を、テレビ基幹部１の操作制御部１０に入力する。操作制御部１０は、ユーザの操作入力や後述の操作命令に従ってテレビジョン装置１００の本体を操作制御する。 The voice recognition processing unit 25 performs voice recognition processing on the digital voice data output from the utterance determination processing unit 24. More specifically, the voice recognition processing unit 25 has a command database (DB) (not shown). The voice recognition processing unit 25 refers to the command DB and extracts the feature amount from the input digital voice data or performs pattern matching to specify the content of the operation command included in the digital voice data. The voice recognition processing unit 25 inputs the content of the specified operation command to the operation control unit 10 of the television backbone unit 1. The operation control unit 10 controls the operation of the main body of the television device 100 in accordance with a user operation input or an operation command described later.

音声認識制御部２６は、エコーキャンセラ２２、雑音除去処理部２３、発話判定処理部２４及び音声認識処理部２５を制御する。 The voice recognition control unit 26 controls the echo canceller 22, the noise removal processing unit 23, the speech determination processing unit 24, and the voice recognition processing unit 25.

ＤＡＣ２７は、テレビ基幹部１から出力されたデジタル音声データをアナログの音声信号に変換してスピーカ２に出力する。 The DAC 27 converts the digital audio data output from the TV backbone unit 1 into an analog audio signal and outputs the analog audio signal to the speaker 2.

続いて、音響制御部６について、さらに詳細に説明する。音響制御部６は、ＡＤＣ２０、記録領域部２１、音声認識制御部２６及びＤＡＣ２７を制御する。 Next, the acoustic control unit 6 will be described in more detail. The acoustic control unit 6 controls the ADC 20, the recording area unit 21, the voice recognition control unit 26, and the DAC 27.

音響制御部６は、例えば、ＡＤＣ２０から出力されたデジタル音声データの記録領域部２１へのデジタル音声データの保存を制御する。また、音響制御部６は、上位コントローラとしての音声認識制御部２６を制御する。例えば、音響制御部６は、音声認識制御部２６を介して、テレビ基幹部１から出力されたデジタル音声データのエコーキャンセラ２２への入力制御、音声認識処理部２６から出力された音声認識結果の操作制御部１０への伝送制御等を行う。 For example, the acoustic control unit 6 controls the storage of the digital audio data in the recording area unit 21 of the digital audio data output from the ADC 20. The acoustic control unit 6 controls the voice recognition control unit 26 as a host controller. For example, the acoustic control unit 6 controls the input of the digital voice data output from the TV backbone unit 1 to the echo canceller 22 and the voice recognition result output from the voice recognition processing unit 26 via the voice recognition control unit 26. Transmission control to the operation control unit 10 is performed.

音響制御部６は、人感センサ１１のセンサ出力を入力している。音響制御部６は、人感センサ１１のセンサ出力に基づいて、命令抽出部７のオン／オフを制御する。このオン／オフ制御には、ＡＤＣ２０のアンプの電源のオン／オフ制御も含まれる。 The acoustic control unit 6 inputs the sensor output of the human sensor 11. The sound control unit 6 controls on / off of the command extraction unit 7 based on the sensor output of the human sensor 11. This on / off control includes power on / off control of the amplifier of the ADC 20.

また、音響制御部６は、テレビ基幹部１の無線通信部１２を介して、携帯電話２００との間でデータを送受信する。例えば、音響制御部６は、携帯電話２００から、デジタル音声データを受信する。 The acoustic control unit 6 transmits and receives data to and from the mobile phone 200 via the wireless communication unit 12 of the television backbone unit 1. For example, the acoustic control unit 6 receives digital audio data from the mobile phone 200.

（携帯電話）
続いて、図３を参照して、携帯電話２００について説明する。 (mobile phone)
Next, the mobile phone 200 will be described with reference to FIG.

図３に示すように、携帯電話２００は、携帯電話基幹部３０、マイクロホン３１、発話検出処理部３２及び発話検出制御部３３を備える。 As shown in FIG. 3, the mobile phone 200 includes a mobile phone backbone unit 30, a microphone 31, an utterance detection processing unit 32, and an utterance detection control unit 33.

携帯電話基幹部３０には、音声通話機能等、一般的な携帯電話に必要とされる機能がまとめられている。例えば、携帯電話基幹部３０は、無線通信部４０を備える。無線通信部４０は、ＷＬＡＭ等の一般的な無線通信や、Ｂｌｕｅｔｏｏｔｈ（登録商標）、低消費電力版Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＺｉｇＢｅｅ（登録商標）等の近距離無線通信機能を有する。 The mobile phone backbone unit 30 collects functions required for a general mobile phone such as a voice call function. For example, the mobile phone backbone unit 30 includes a wireless communication unit 40. The wireless communication unit 40 has general wireless communication such as WLAM, and short-range wireless communication functions such as Bluetooth (registered trademark), low power consumption version Bluetooth (registered trademark), and ZigBee (registered trademark).

マイクロホン３１は、携帯電話２００の周囲の音声を収集し、アナログの音声信号に変換して出力する。 The microphone 31 collects sound around the mobile phone 200, converts it into an analog sound signal, and outputs it.

発話検出処理部３２は、マイクロホン３１から出力されたアナログの音声信号に基づいて、ユーザの発話の検出処理を行う。発話検出処理部３２は、ＡＤＣ４１、音圧レベル判定処理部４２、記録領域部４３を備える。 The utterance detection processing unit 32 performs user utterance detection processing based on the analog audio signal output from the microphone 31. The utterance detection processing unit 32 includes an ADC 41, a sound pressure level determination processing unit 42, and a recording area unit 43.

ＡＤＣ４１は、アンプ付きである。ＡＤＣ４１は、マイクロホン３１から入力されたアナログ音声信号の増幅、デジタル音声データへの変換を行う。 The ADC 41 has an amplifier. The ADC 41 amplifies the analog audio signal input from the microphone 31 and converts it into digital audio data.

音圧レベル判定処理部４２は、ＡＤＣ４１から出力されるデジタル音声データに基づいて、周囲の音圧を監視する。より具体的には、音圧レベル判定処理部４２は、ＡＤＣ４１から出力されるデジタル音声データの音圧レベルが、一定の閾値を超えるか否かを判定する。 The sound pressure level determination processing unit 42 monitors ambient sound pressure based on the digital audio data output from the ADC 41. More specifically, the sound pressure level determination processing unit 42 determines whether the sound pressure level of the digital audio data output from the ADC 41 exceeds a certain threshold value.

記録領域部４３には、ＡＤＣ４１から出力されるデジタル音声データが保存される。 In the recording area unit 43, digital audio data output from the ADC 41 is stored.

発話検出制御部３３は、タイマ４４を備える。発話検出制御部３３は、このタイマ４４４等を用いて、発話検出処理部３２の動作制御を行う。 The utterance detection control unit 33 includes a timer 44. The utterance detection control unit 33 controls the operation of the utterance detection processing unit 32 using the timer 444 and the like.

より具体的には、音圧レベル判定処理部４２により、ＡＤＣ４１から出力されるデジタル音声データの音圧レベルが閾値を超えていると判定されると、発話検出制御部３３は、記録領域部４３にデジタル音声データの保存を開始させるとともに、タイマ４４に計時を開始させる。また、これと同時に、発話検出制御部３３は、デジタル音声データの保存命令を、テレビジョン装置１００に送信する。 More specifically, when the sound pressure level determination processing unit 42 determines that the sound pressure level of the digital audio data output from the ADC 41 exceeds the threshold value, the utterance detection control unit 33 sets the recording area unit 43. Starts to store the digital audio data and causes the timer 44 to start measuring time. At the same time, the utterance detection control unit 33 transmits a digital audio data storage command to the television apparatus 100.

タイマ４４によって一定時間計時されると、発話検出制御部３３は、音圧レベル判定処理部４２により、再び、ＡＤＣ４１から出力されるデジタル音声データの音圧レベルが閾値を超えていたと判定されると、無線通信部４０を介して、テレビジョン装置１００に、発話が開始された旨の発話開始信号とデジタル音声データとを送信する。 When the timer 44 counts a certain time, the speech detection control unit 33 determines again that the sound pressure level of the digital sound data output from the ADC 41 has exceeded the threshold value by the sound pressure level determination processing unit 42. Then, the utterance start signal indicating that the utterance is started and the digital audio data are transmitted to the television apparatus 100 via the wireless communication unit 40.

一方、発話検出制御部３３は、テレビジョン装置１００から、人感センサ１１のセンサ出力を、無線通信部４０を介して受信する。発話検出制御部３３は、人感センサ１１のセンサ出力に基づいて、発話検出処理部３２の起動及び停止を制御する。また、これと同時に、発話検出制御部３３は、無線通信部４０を介して、テレビジョン装置１００から音声認識モードの開始、終了の切り替え等の制御命令を受信し、その命令に従って、発話検出処理部３２の起動及び停止を制御する。 On the other hand, the utterance detection control unit 33 receives the sensor output of the human sensor 11 from the television device 100 via the wireless communication unit 40. The utterance detection control unit 33 controls activation and stop of the utterance detection processing unit 32 based on the sensor output of the human sensor 11. At the same time, the utterance detection control unit 33 receives a control command for switching the start and end of the voice recognition mode from the television device 100 via the wireless communication unit 40, and the utterance detection process according to the command. The start and stop of the unit 32 are controlled.

ところで、テレビジョン装置１００では、操作モードの切り替えによって４つの状態を遷移する。図４を参照して、操作モードによる状態遷移について説明する。 By the way, in the television apparatus 100, four states are changed by switching the operation mode. With reference to FIG. 4, the state transition by operation mode is demonstrated.

図４に示すように、４つの操作モードは、リモコンによって操作できる通常モード（状態３０１、状態３０２）と、音声認識によっても操作できる音声認識モード（状態３０３、状態３０４）の２つに大別できる。 As shown in FIG. 4, the four operation modes are roughly divided into two modes: a normal mode (state 301 and state 302) that can be operated by a remote controller and a voice recognition mode (state 303 and state 304) that can also be operated by voice recognition. it can.

状態３０１は、テレビジョン装置１００の電源がオンされており、映像を受信、表示し、リモコンによって電源のオン／オフ、チャンネル変更等の各種操作ができる状態である。状態３０２は、テレビジョン装置１００の電源がオフとなっており、リモコンからの起動要求のみを待ち受けている状態である。 A state 301 is a state in which the power of the television apparatus 100 is turned on, an image is received and displayed, and various operations such as power on / off and channel change can be performed by the remote controller. A state 302 is a state in which the power of the television apparatus 100 is turned off and only a startup request from the remote controller is awaited.

状態３０３は、通常のリモコン操作に加えて、命令抽出部７が起動している状態で、音声認識によりテレビジョン装置１００の操作が可能な状態である。状態３０４は、テレビジョン装置１００の電源がオフされているが、リモコンからの起動要求と命令抽出部７が起動している状態での音声認識による起動要求とのいずれかを待ち受けている状態である。 A state 303 is a state in which the television apparatus 100 can be operated by voice recognition in a state where the command extraction unit 7 is activated in addition to a normal remote control operation. A state 304 is a state in which the power of the television apparatus 100 is turned off, but is waiting for either a start request from the remote control or a start request by voice recognition in a state where the command extraction unit 7 is started. is there.

通常モードから音声認識モードへの切り替えは、リモコン操作により切り替ることができる。この他、ユーザがあらかじめ音声認識モードでの操作を望む時間帯（例えば、９：００〜２４：００等）を設定しておき、自動的にモードを切り替えるようにすることができる。この時間帯の設定を以下では、スケジューリングともいう。 Switching from the normal mode to the voice recognition mode can be performed by a remote control operation. In addition, it is possible to set a time zone (for example, 9:00 to 24:00, etc.) that the user desires to operate in the voice recognition mode in advance, and automatically switch the mode. Hereinafter, this time zone setting is also referred to as scheduling.

図４において、２つの状態を結ぶ実線の矢印ａは、リモコン操作やスケジューリングで、その矢印の方向の状態の切り替えが可能であることを示している。また、２つの状態を結ぶ一点鎖線の矢印ｂは、リモコン操作、スケジューリングに加え、音声操作でも、その矢印の方向の状態の切り替え可能であることを示している。 In FIG. 4, a solid arrow a connecting the two states indicates that the state in the direction of the arrow can be switched by remote control operation or scheduling. A dash-dot line arrow b connecting the two states indicates that the state in the direction of the arrow can be switched by voice operation in addition to remote control operation and scheduling.

本実施の形態では、音声認識モード切り替えのためのスケジュール管理等は、テレビジョン装置１００で行われる。携帯電話２００の操作モードは、音声認識モードのテレビジョン装置１００からの切り替え信号に従って切り替わるようになる。 In the present embodiment, schedule management or the like for switching the voice recognition mode is performed by the television device 100. The operation mode of the mobile phone 200 is switched according to a switching signal from the television device 100 in the voice recognition mode.

この場合、携帯電話２００とテレビジョン装置１００との無線通信方式に、Ｂｌｕｅｔｏｏｔｈ（登録商標）や低消費電力Ｂｌｕｅｔｏｏｔｈ（登録商標）を用いるようにすれば、信号待機の待ち受け消費電力は非常に小さくなる。音声認識モードから通常のモードへの切り替えは、上記と同様にスケジューリングや、リモコン操作での切り替えに加えて、音声操作でも切り替えることができる。 In this case, if Bluetooth (registered trademark) or low power consumption Bluetooth (registered trademark) is used for the wireless communication system between the mobile phone 200 and the television apparatus 100, the standby power consumption for signal standby becomes very small. . Switching from the voice recognition mode to the normal mode can be performed by voice operation in addition to scheduling and remote control operation as described above.

次に、上述の構成を有する音声認識操作システム３００の動作について説明する。 Next, the operation of the voice recognition operation system 300 having the above configuration will be described.

（音声認識モード中の動作）
まず、テレビジョン装置１００において、音声認識モードに切り替わった後の一連の全体動作について、図５のフローチャート等を参照して説明する。なお、図面では、テレビジョン装置１００をＴＶとも略述している。 (Operation in voice recognition mode)
First, a series of overall operations after switching to the voice recognition mode in the television device 100 will be described with reference to the flowchart of FIG. In the drawing, the television device 100 is also abbreviated as TV.

上述のリモコン操作やスケジューリング等により、音声認識モードへの切り替え操作がなされると、操作制御部１０から、音声認識モードに切り替える操作信号が、音響制御部６に入力される。これにより、図５に示す処理が開始される。 When a switching operation to the voice recognition mode is performed by the above-described remote control operation or scheduling, an operation signal for switching to the voice recognition mode is input from the operation control unit 10 to the acoustic control unit 6. Thereby, the process shown in FIG. 5 is started.

まず、音響制御部６は、命令抽出部７を起動すると同時に、無線通信部１２を介して、携帯電話２００の発話検出制御部３３に音声認識モードに切り替える命令を送信することにより、発話検出制御部３３に発話検出処理部３２を起動させる（ステップＳ１）。 First, the sound control unit 6 activates the command extraction unit 7 and simultaneously transmits a command to switch to the speech recognition mode to the speech detection control unit 33 of the mobile phone 200 via the wireless communication unit 12, thereby controlling the speech detection control. The utterance detection processing unit 32 is activated by the unit 33 (step S1).

続いて、音響制御部６及び発話検出制御部３３は、発話検出処理部３２を起動して、音圧検知のための閾値を校正する音圧検知閾値キャリブレーションのサブルーチンを行う（ステップＳ２）。音圧検知閾値キャリブレーションの詳細については、後述する。 Subsequently, the acoustic control unit 6 and the utterance detection control unit 33 activate the utterance detection processing unit 32 and perform a sound pressure detection threshold calibration subroutine for calibrating a threshold for sound pressure detection (step S2). Details of the sound pressure detection threshold calibration will be described later.

音圧検知閾値キャリブレーションが終了すると、音響制御部６及び発話検出制御部３３は、人感センサ１１によって、ユーザが検出されたか否かを判定する（ステップＳ３）。 When the sound pressure detection threshold calibration is completed, the sound control unit 6 and the speech detection control unit 33 determine whether or not the user is detected by the human sensor 11 (step S3).

ユーザが検出されなかった場合（ステップＳ３；Ｎｏ）、音響制御部６は、音声認識をする必要は無いものとみなし、命令抽出部７を停止する一方、発話検出制御部３３は、発話検出処理部３２を停止する（ステップＳ４）。 When the user is not detected (step S3; No), the acoustic control unit 6 considers that it is not necessary to perform voice recognition, stops the command extraction unit 7, while the utterance detection control unit 33 performs utterance detection processing. The unit 32 is stopped (step S4).

ユーザが検出された場合（ステップＳ３；Ｙｅｓ）、携帯電話２００の発話検出制御部３３は、発話検出処理を行う（ステップＳ５）。これにより、発話検出処理が開始される。この発話検出処理の詳細については、後述するが、この発話検出処理で、発話が検出されると、記録領域部４３には、マイクロホン３１から入力された音声に対応するデジタル音声データが保存されている。 When a user is detected (step S3; Yes), the utterance detection control unit 33 of the mobile phone 200 performs an utterance detection process (step S5). Thereby, the speech detection process is started. The details of this utterance detection process will be described later. When an utterance is detected in this utterance detection process, digital audio data corresponding to the voice input from the microphone 31 is stored in the recording area unit 43. Yes.

続いて、発話検出処理の結果、発話検出制御部３３は、発話の開始を検出したか否かを判定する（ステップＳ６）。発話の開始が検出されない場合（ステップＳ６；Ｎｏ）、後述する音声認識モードの継続判定処理が行われる（ステップＳ１３）。継続判定処理の詳細については後述する。 Subsequently, as a result of the utterance detection process, the utterance detection control unit 33 determines whether or not the start of the utterance has been detected (step S6). When the start of the utterance is not detected (step S6; No), a voice recognition mode continuation determination process described later is performed (step S13). Details of the continuation determination process will be described later.

発話の開始が検出されると（ステップＳ６；Ｙｅｓ）、発話検出制御部３３は、無線通信部４０を介して、テレビジョン装置１００へ、発話の開始が検出されたことを伝える信号（発話開始信号）と、記録領域部４３に保存されたデジタル音声データとの送信を開始する（ステップＳ７）。 When the start of the utterance is detected (step S6; Yes), the utterance detection control unit 33 transmits a signal (speech start) to the television device 100 via the wireless communication unit 40 that the start of the utterance is detected. Signal) and digital audio data stored in the recording area 43 are started (step S7).

音響制御部６は、起動した命令抽出部７の音声認識制御部２６を制御し、携帯電話２００から受信したデジタル音声データと、記録領域部２１に格納されたデジタル音声データとに基づいて、エコーキャンセラ２２、雑音消去処理部２３、発話判定処理部２４、音声認識処理部２５を動作させ、一連の命令抽出処理の実行を開始させる（ステップＳ８）。 The acoustic control unit 6 controls the voice recognition control unit 26 of the activated command extraction unit 7, and echoes based on the digital voice data received from the mobile phone 200 and the digital voice data stored in the recording area unit 21. The canceller 22, the noise elimination processing unit 23, the utterance determination processing unit 24, and the speech recognition processing unit 25 are operated to start execution of a series of instruction extraction processes (step S8).

続いて、発話検出制御部３３は、音圧レベル判定処理部４２の判定結果を参照して、発話が終了するまで待つ（ステップＳ９；Ｎｏ）。この間にも、携帯電話２００からテレビジョン装置１００へのデジタル音声データの転送、一連の命令抽出処理が継続されている。 Subsequently, the utterance detection control unit 33 refers to the determination result of the sound pressure level determination processing unit 42 and waits until the utterance is completed (step S9; No). In the meantime, the transfer of digital audio data from the mobile phone 200 to the television device 100 and a series of instruction extraction processes are continued.

発話の終了が検出されると（ステップＳ９；Ｙｅｓ）、発話検出制御部３３は、発話終了信号を、テレビジョン装置１００に送信するとともに、デジタル音声データの送信を停止する（ステップＳ１０）。発話終了信号を受けて、音響制御部６は、命令抽出部７における命令抽出処理を停止させる。 When the end of the utterance is detected (step S9; Yes), the utterance detection control unit 33 transmits the utterance end signal to the television device 100 and stops the transmission of the digital audio data (step S10). Upon receiving the utterance end signal, the sound control unit 6 stops the command extraction process in the command extraction unit 7.

続いて、音響制御部６は、音声認識の結果得られた発話内容が、音響認識処理部２５の命令ＤＢに存在するか否か判定する（ステップＳ１１）。発話内容が命令ＤＢに存在しない場合（ステップＳ１１；Ｎｏ）、後述する音声認識モードの継続判定処理が行われる（ステップＳ１３）。継続判定処理の詳細については後述する。 Subsequently, the sound control unit 6 determines whether or not the utterance content obtained as a result of the speech recognition exists in the instruction DB of the sound recognition processing unit 25 (step S11). If the utterance content does not exist in the command DB (step S11; No), a speech recognition mode continuation determination process described later is performed (step S13). Details of the continuation determination process will be described later.

発話内容が命令ＤＢに存在する場合（ステップＳ１１；Ｙｅｓ）、操作制御部１０は、その発話内容（操作命令）に従って、テレビジョン装置１００の操作制御を行う（ステップＳ１２）。 When the utterance content exists in the command DB (step S11; Yes), the operation control unit 10 controls the operation of the television device 100 according to the utterance content (operation command) (step S12).

続いて、音響制御部６は、音声認識モードの継続判定処理を行う（ステップＳ１３）。継続判定処理の詳細については後述する。 Subsequently, the acoustic control unit 6 performs a voice recognition mode continuation determination process (step S13). Details of the continuation determination process will be described later.

継続判定処理の結果、音声認識モードを継続する場合（ステップＳ１４；Ｙｅｓ）、音響制御部６は、人感センサ１１によって、ユーザが検出されたか否かを判定する（ステップＳ３）。一方、音声認識モードを継続しない場合（ステップＳ１４；Ｎｏ）、音響制御部６及び発話検出制御部３３は、音声認識モード中の動作を終了する。 As a result of the continuation determination process, when the speech recognition mode is continued (step S14; Yes), the acoustic control unit 6 determines whether or not the user is detected by the human sensor 11 (step S3). On the other hand, when the voice recognition mode is not continued (step S14; No), the acoustic control unit 6 and the speech detection control unit 33 end the operation during the voice recognition mode.

以上、音声認識モードにおける一連の全体動作について説明した。 The series of overall operations in the voice recognition mode has been described above.

続いて、音圧検知閾値キャリブレーション（ステップＳ２）、発話検出処理（ステップＳ５）、音声認識モードの継続判定処理（ステップＳ１３）のそれぞれの詳細について説明する。 Next, the details of the sound pressure detection threshold calibration (step S2), the speech detection process (step S5), and the speech recognition mode continuation determination process (step S13) will be described.

（音圧検知閾値キャリブレーション）
まず、図６を参照して、ステップＳ２（図５参照）の音圧検知閾値キャリブレーションについて説明する。 (Sound pressure detection threshold calibration)
First, the sound pressure detection threshold calibration in step S2 (see FIG. 5) will be described with reference to FIG.

図６に示すように、音圧検知閾値キャリブレーションのサブルーチン（ステップＳ２）が開始されると、まず、音響制御部６は、表示部３に、音圧検知閾値キャリブレーション中のため、ユーザに発話を控えてもらう旨のメッセージを表示させる（ステップＳ２１）。 As shown in FIG. 6, when the sound pressure detection threshold value calibration subroutine (step S2) is started, first, the sound control unit 6 displays the sound pressure detection threshold value calibration on the display unit 3 to the user. A message to refrain from speaking is displayed (step S21).

続いて、一定時間（例えば５秒間）、発話検出制御部３３は、音圧レベル判定処理部４２に、ＡＤＣ４１から出力されたデジタル音声データの音圧レベルを監視させる（ステップＳ２２）。デジタル音声データの音圧は、音声認識モード開始時のテレビジョン装置１００と携帯電話２００の位置関係や、テレビジョン装置１００の音量に依存する。本実施の形態では、このデジタル音声データの音圧レベルが閾値を超えたか否かにより、発話検出を行うため、ユーザが発話していない時のデジタル音声データの音圧レベルをあらかじめ調べておき、そのレベルに応じて必要であれば、閾値を調整するのである。 Subsequently, for a certain time (for example, 5 seconds), the speech detection control unit 33 causes the sound pressure level determination processing unit 42 to monitor the sound pressure level of the digital audio data output from the ADC 41 (step S22). The sound pressure of the digital sound data depends on the positional relationship between the television device 100 and the mobile phone 200 at the start of the sound recognition mode and the volume of the television device 100. In this embodiment, in order to detect the utterance depending on whether or not the sound pressure level of the digital sound data exceeds a threshold, the sound pressure level of the digital sound data when the user is not speaking is checked in advance, If necessary, the threshold value is adjusted according to the level.

このとき、テレビジョン装置１００のスピーカ２から発する音は、選局中のテレビ放送や映像の音、もしくはキャリブレーション用の音（例えば、ピンクノイズ）が採用される。さらに、この際に、デジタル音声データの音圧レベルに基づいて、テレビジョン装置１００のエコーキャンセラ２２における適応フィルタの更新を行うようにしてもよい。 At this time, the sound emitted from the speaker 2 of the television apparatus 100 is a sound of a television broadcast or video being selected or a sound for calibration (for example, pink noise). Further, at this time, the adaptive filter in the echo canceller 22 of the television apparatus 100 may be updated based on the sound pressure level of the digital audio data.

続いて、監視期間（例えば５秒間）が経過したら、発話検出制御部３３は、監視期間中、音圧レベル判定処理部４２の判定により、初期閾値を上回り、音圧検知された時間の割合（音圧検知時間率）を算出する（ステップＳ２３）。音圧検知時間率が所定の割合（本実施の形態では５％）より少なければ（ステップＳ２４；Ｎｏ）、発話検出制御部３３は、音圧検知閾値キャリブレーションを終了する。すなわち、この場合には、音圧検知のための閾値として初期閾値がそのまま用いられる。 Subsequently, when a monitoring period (for example, 5 seconds) elapses, the utterance detection control unit 33 exceeds the initial threshold and is detected by the sound pressure level determination processing unit 42 during the monitoring period. Sound pressure detection time rate is calculated (step S23). If the sound pressure detection time rate is less than the predetermined rate (5% in the present embodiment) (step S24; No), the utterance detection control unit 33 ends the sound pressure detection threshold calibration. That is, in this case, the initial threshold value is used as it is as a threshold value for sound pressure detection.

一方、音圧検知時間率が所定の割合（本実施の形態では５％）以上である場合（ステップＳ２４；Ｙｅｓ）、発話検出制御部３３は、音圧検知のための閾値を、監視期間中の音圧検知時間率が５％より少なくなるような値に調整する（ステップＳ２５）。 On the other hand, when the sound pressure detection time rate is equal to or greater than a predetermined ratio (5% in the present embodiment) (step S24; Yes), the utterance detection control unit 33 sets the threshold for sound pressure detection during the monitoring period. Is adjusted to a value such that the sound pressure detection time rate is less than 5% (step S25).

続いて、図７（Ａ）乃至図７（Ｃ）、図８（Ａ）乃至図８（Ｃ）のタイミングチャートを参照して、音圧検知閾値キャリブレーションにおける閾値調整のタイミングについて説明する。 Next, the timing of threshold adjustment in sound pressure detection threshold calibration will be described with reference to the timing charts of FIGS. 7 (A) to 7 (C) and FIGS. 8 (A) to 8 (C).

図７（Ａ）乃至図７（Ｃ）には、音声認識モード開始時から、取得されるデジタル音声データの音圧レベルが、ある程度大きくなっている場合が示されている。図７（Ａ）に示すように、非発話時でも、スピーカ２の音や周囲の音がある程度大きく、デジタル音声データの音圧レベルが大きい場合、監視時間中の初期閾値による音圧検知、すなわち音圧監視を行うと、図７（Ｂ）に示すように、監視期間中、すべての時間において、発話が検出されたことになり、音圧検知時間率はほぼ１００％となった。そこで、ここでは、図７（Ｃ）に示すように、閾値がより大きな値（調整後の閾値）に調整され、非発話時に発話が誤検出されないように校正される。 FIGS. 7A to 7C show a case where the sound pressure level of the acquired digital voice data has increased to some extent from the start of the voice recognition mode. As shown in FIG. 7A, even when not speaking, if the sound of the speaker 2 and surrounding sounds are large to some extent and the sound pressure level of the digital audio data is large, sound pressure detection based on the initial threshold during the monitoring time, that is, When the sound pressure was monitored, as shown in FIG. 7B, utterances were detected at all times during the monitoring period, and the sound pressure detection time rate was almost 100%. Therefore, as shown in FIG. 7C, the threshold value is adjusted to a larger value (adjusted threshold value) and calibrated so that the utterance is not erroneously detected during non-utterance.

一方、図８（Ａ）乃至図８（Ｃ）には、音声認識モード開始時から、携帯電話２００のデジタル音声データの音圧レベルが低かった場合が示されている。図８（Ａ）に示すように、非発話時に、デジタル音声データの音圧レベルが初期閾値を上回らない場合、図８（Ｂ）に示すように、監視期間中、すべての時間において、発話が検出されていなかったことになり、音圧検知閾値をこのままとしても誤検出の恐れが無いので、図８（Ｃ）に示すように、閾値は初期閾値のままとなる。 On the other hand, FIGS. 8A to 8C show a case where the sound pressure level of the digital voice data of the mobile phone 200 has been low since the start of the voice recognition mode. As shown in FIG. 8A, when the sound pressure level of the digital voice data does not exceed the initial threshold during non-speaking, as shown in FIG. 8B, the utterance is not generated at all times during the monitoring period. Since it has not been detected and there is no fear of erroneous detection even if the sound pressure detection threshold value is left as it is, the threshold value remains the initial threshold value as shown in FIG. 8C.

なお、スケジューリングにより、自動的に音声認識モードの電源オフの状態３０４に切り替わった場合、携帯電話２００は、必ずしもテレビジョン装置１００の前にあるとは限らないので、この場合の閾値として初期閾値を設定しておき、テレビジョン装置１００の電源をオンした後に、この音圧検知閾値キャリブレーションを実施すればよい。 Note that, when the mobile phone 200 is not necessarily in front of the television device 100 when automatically switched to the power-off state 304 in the voice recognition mode by scheduling, an initial threshold is set as a threshold in this case. The sound pressure detection threshold calibration may be performed after setting and turning on the power of the television apparatus 100.

初期閾値としては、工場出荷前に、一般的なテレビジョン装置１００の音量と、２ｍ〜３ｍ程離れた場所に携帯電話２００を置いた場合とを想定して、発話が誤検出されないような値を設定しておくのが望ましい。また、本実施の形態に係る音声認識操作システム３００の運用開始に先立って、テレビジョン装置１００の視聴環境や使用状況に基づいて、初期閾値をユーザが調整できるようにしてもよい。 As an initial threshold value, a value is set such that an utterance is not erroneously detected on the assumption that the volume of a general television device 100 and the mobile phone 200 are placed at a distance of about 2 to 3 m before shipment from the factory. It is desirable to set. Further, prior to the start of operation of the speech recognition operation system 300 according to the present embodiment, the user may be able to adjust the initial threshold based on the viewing environment and usage status of the television device 100.

また、調整後の閾値は、テレビジョン装置１００の記録領域部２１又は携帯電話２００の記録領域部４３に保存しておき、次回起動時の初期閾値とするようにしてもよい。調整後の閾値が高くなり過ぎて、発話検出の精度が悪い場合は、音声認識モード切り替え時に限らず、ユーザがいつでも音圧検知閾値キャリブレーションを実行できるようにしてもよい。 Further, the adjusted threshold value may be stored in the recording area unit 21 of the television device 100 or the recording area unit 43 of the mobile phone 200 so as to be an initial threshold value at the next activation. If the threshold value after adjustment becomes too high and the accuracy of speech detection is poor, the sound pressure detection threshold calibration may be performed by the user at any time, not only when the speech recognition mode is switched.

さらに、音圧検知閾値キャリブレーション後に、テレビジョン装置１００の音量調整によってスピーカ２の音量が変化した場合は、その音量の変化量に合わせて自動的に音圧検知閾値キャリブレーションを実施して、閾値を調整できるようにしてもよい。 Further, after the sound pressure detection threshold calibration, if the volume of the speaker 2 changes due to the volume adjustment of the television device 100, the sound pressure detection threshold calibration is automatically performed according to the amount of change in the volume, You may enable it to adjust a threshold value.

（発話検出処理）
続いて、図９を参照して、図５のステップＳ５の発話検出処理について説明する。 (Speech detection processing)
Next, the utterance detection process in step S5 in FIG. 5 will be described with reference to FIG.

発話検出処理では、まず、発話検出制御部３３は、音圧レベル判定処理部４２を用いて、ＡＤＣ４１から出力されるデジタル音声データの音圧レベルを監視する（ステップＳ３１）。 In the utterance detection process, first, the utterance detection control unit 33 uses the sound pressure level determination processing unit 42 to monitor the sound pressure level of the digital audio data output from the ADC 41 (step S31).

デジタル音声データの音圧レベルが、閾値より以下である場合（ステップＳ３２；Ｎｏ）、発話検出制御部３３は、発話が検出されなかったことを設定し（ステップＳ４２）、発話検出処理を終了する。 If the sound pressure level of the digital audio data is below the threshold (step S32; No), the utterance detection control unit 33 sets that no utterance has been detected (step S42), and ends the utterance detection process. .

デジタル音声データの音圧レベルが、閾値を超えた場合（ステップＳ３２；Ｙｅｓ）、発話検出制御部３３は、記録領域部４３に、デジタル音声データの保存を開始させる（ステップＳ３３）。これと同時に、発話検出制御部３３は、無線通信部４０を介して、テレビジョン装置１００（より具体的には、音響制御部６）に、ＡＤＣ４１から出力されたデジタル音声データの記録領域部２１への保存を開始する命令（保存命令）を送信する（ステップＳ３４）。音響制御部６は、この保存命令を受け、記録領域部２１に、ＡＤＣ２０から出力されたデジタル音声データの保存を開始させる（ステップＳ３５）。 When the sound pressure level of the digital audio data exceeds the threshold (step S32; Yes), the utterance detection control unit 33 causes the recording area unit 43 to start storing the digital audio data (step S33). At the same time, the utterance detection control unit 33 records the digital audio data output from the ADC 41 to the television apparatus 100 (more specifically, the acoustic control unit 6) via the wireless communication unit 40. A command to start saving (save command) is transmitted (step S34). The sound control unit 6 receives this save command, and causes the recording area unit 21 to start saving digital audio data output from the ADC 20 (step S35).

その後、発話検出制御部３３は、タイマ４４を用いて、一定期間（例えば、０．５秒）が経過するまで、動作を保留する（ステップＳ３６）。 Thereafter, the utterance detection control unit 33 uses the timer 44 to suspend the operation until a certain period (for example, 0.5 seconds) elapses (step S36).

一定期間が経過した後、発話検出制御部３３は、音圧レベル判定処理部４２に、デジタル音声データの音圧レベルが閾値を超えているか否かを再び判定させる（ステップＳ３７）。デジタル信号データの音圧レベルが閾値を超えている場合（ステップＳ３７；Ｙｅｓ）、発話検出制御部３３は、０．５秒前に検知した音圧は発話であるとみなし、発話を検出したことを設定し（ステップＳ３８）、発話検出処理を終了する。 After a certain period of time has elapsed, the utterance detection control unit 33 causes the sound pressure level determination processing unit 42 to determine again whether or not the sound pressure level of the digital audio data exceeds the threshold (step S37). If the sound pressure level of the digital signal data exceeds the threshold value (step S37; Yes), the utterance detection control unit 33 considers that the sound pressure detected 0.5 seconds ago is an utterance, and has detected the utterance. Is set (step S38), and the speech detection process is terminated.

一方、デジタル音声データの音圧レベルが閾値を超えていない場合（ステップＳ３７；Ｎｏ）、０．５秒前に検知した音圧は発話では無く、ノイズであったとみなし、発話検出制御部３３は、記録領域部４３に保存されているデジタル音声データ（保存データ）を破棄する（ステップＳ３９）。また、これと同時に、発話検出制御部３３は、無線通信部４０を介して、テレビジョン装置１００に保存されていたデジタル音声データを破棄する破棄命令を送信する（ステップＳ４０）。テレビジョン装置１００の音響制御部６は、この破棄命令を受け、記録領域部２１に保存されていたデジタル音声データ（保存データ）を破棄する（ステップＳ４１）。そして、発話検出制御部３３は、発話が検出されなかったことを設定する（ステップＳ４２）。 On the other hand, if the sound pressure level of the digital audio data does not exceed the threshold (step S37; No), the sound pressure detected 0.5 seconds ago is not utterance but noise, and the utterance detection control unit 33 Then, the digital audio data (stored data) stored in the recording area 43 is discarded (step S39). At the same time, the speech detection control unit 33 transmits a discard command for discarding the digital audio data stored in the television device 100 via the wireless communication unit 40 (step S40). The sound control unit 6 of the television apparatus 100 receives this discard command and discards the digital audio data (stored data) stored in the recording area unit 21 (step S41). Then, the utterance detection control unit 33 sets that no utterance has been detected (step S42).

このように、ＡＤＣ４１から出力されるデジタル音声データの音圧レベルに基づいて、発話を検出する。このため、操作命令では無い音（例えば、携帯電話２００の本体を移動した時、近くにコップ等を置いた時…）が混入されることが予想される。このため、この発話検出処理では、突発的な音圧検知のみで発話を検知するのでは無く、一定時間後（例えば０．５秒）にも、継続して音圧が閾値を超えている場合に発話を検出したものとみなす。その音声が操作命令であるならば、ある程度の時間（少なくとも１秒以上）、音圧レベルは継続して高いままになると考えられるからである。 As described above, the utterance is detected based on the sound pressure level of the digital sound data output from the ADC 41. For this reason, it is expected that sound that is not an operation command (for example, when the main body of the mobile phone 200 is moved, when a cup or the like is placed nearby) is mixed. For this reason, in this utterance detection process, the utterance is not detected only by sudden sound pressure detection, but the sound pressure continuously exceeds the threshold even after a certain time (for example, 0.5 seconds). Is considered to have detected an utterance. This is because if the sound is an operation command, the sound pressure level is considered to remain high for a certain period of time (at least 1 second or more).

したがって、本実施の形態では、最初の音圧検知時点では、テレビジョン装置１００の命令抽出部７が起動していないため、最初の音圧検知がノイズであったときに、命令抽出処理が無駄に実行されることを防止することができる。 Therefore, in the present embodiment, since the instruction extraction unit 7 of the television apparatus 100 is not activated at the time of the first sound pressure detection, the instruction extraction process is wasted when the first sound pressure detection is noise. Can be prevented.

なお、本実施の形態では、音声が操作命令であった場合に、理論的には、発話開始から０．５秒後に音声認識が開始されることになるが、操作命令は、発話が完全に終了しなくてはその内容を特定することができないため、発話から０．５秒後に起動することは音声認識の性能に悪影響を与えるものではない。 In this embodiment, when the voice is an operation command, theoretically, the speech recognition starts 0.5 seconds after the start of the utterance. Since the contents cannot be specified without completion, starting up 0.5 seconds after the utterance does not adversely affect the performance of speech recognition.

また、本実施の形態では、最初の音圧検知から一定時間後に、音圧を再度検知した場合を発話とみなしているが、時間間隔をより短くしてもよいし、数回（３回以上）音圧を検知した場合を発話とみなすようにしてもよい。 Further, in this embodiment, the case where the sound pressure is detected again after a certain time from the first sound pressure detection is regarded as an utterance, but the time interval may be shortened or several times (three times or more) ) A case where sound pressure is detected may be regarded as an utterance.

続いて、上述の発話検出処理の動作タイミングについて、図１０（Ａ）乃至図１０（Ｆ）のタイミングチャートを参照して説明する。 Next, the operation timing of the above-described speech detection process will be described with reference to the timing charts of FIGS. 10 (A) to 10 (F).

図１０（Ａ）に示すように、時点ｔ１において、突発的な雑音が混入し、音圧レベルが高くなって閾値を超えると、図１０（Ｂ）に示すように、音圧レベル判定処理部４２の音圧検知結果が検知となり、図１０（Ｃ）に示すように、携帯電話２００においてデジタル音声データの保存が開始される。そして、図１０（Ｄ）に示すように、携帯電話２００からテレビジョン装置１００に保存命令が送信され、図１０（Ｅ）に示すように、テレビジョン装置１００においてデジタル音声データの保存が開始される。 As shown in FIG. 10 (A), when sudden noise is mixed in at time t1 and the sound pressure level increases and exceeds the threshold value, as shown in FIG. 10 (B), the sound pressure level determination processing unit. The sound pressure detection result of 42 is detected, and as shown in FIG. 10C, storage of digital audio data in the mobile phone 200 is started. Then, as shown in FIG. 10D, a save command is transmitted from the mobile phone 200 to the television apparatus 100, and as shown in FIG. 10E, saving of digital audio data is started in the television apparatus 100. The

しかし、その０．５秒後の時点ｔ２では、図１０（Ａ）に示すように、音圧レベルが閾値より小さくなり、図１０（Ｂ）に示すように、発声が検知されなくなるので、図１０（Ｃ）に示すように、記録領域部４３へのデジタル音声データの保存は停止され、保存されたデジタル音声データは破棄される。そして、図１０（Ｄ）に示すように携帯電話２００からテレビジョン装置１００に破棄命令が送信され、図１０（Ｅ）に示すように、記録領域部２１へのデジタル音声データの保存が停止され、保存されたデジタル音声データは破棄される。このとき、図１０（Ｅ）に示すように、テレビジョン装置１００の命令抽出部７が起動することはない。 However, at time t2 0.5 seconds later, as shown in FIG. 10 (A), the sound pressure level becomes lower than the threshold value, and no utterance is detected as shown in FIG. 10 (B). As shown in FIG. 10C, the storage of the digital audio data in the recording area unit 43 is stopped, and the stored digital audio data is discarded. Then, a discard command is transmitted from the mobile phone 200 to the television device 100 as shown in FIG. 10D, and storage of the digital audio data in the recording area unit 21 is stopped as shown in FIG. The stored digital audio data is discarded. At this time, as shown in FIG. 10E, the instruction extraction unit 7 of the television apparatus 100 is not activated.

さらに、時点ｔ３において、実際にユーザが発話を行った場合には、図１０（Ａ）に示すように、０．５秒後の時点ｔ４においてもその音圧レベルが高く維持されている。このため、図１０（Ｂ）に示すように、時点ｔ４でも、発声が検知されたままとなる。この場合、図１０（Ｃ）、図１０（Ｅ）に示すように、記録領域部４３、２１への発話音声が含まれるデジタル音声データの保存が継続されたままとなる。さらに、図１０（Ｄ）に示すように、時点ｔ４において、携帯電話２００からテレビジョン装置１００へ発話開始信号及びデジタル音声データが送信される。これにより、図１０（Ｆ）に示すように、命令抽出部７が起動され、記録領域部２１に保存されたデジタル音声データ及び送信されたデジタル音声データを用いて、一連の命令抽出処理が行われる。 Further, when the user actually speaks at time t3, as shown in FIG. 10A, the sound pressure level is maintained high even at time t4 after 0.5 seconds. For this reason, as shown in FIG. 10B, the utterance remains detected even at time t4. In this case, as shown in FIGS. 10C and 10E, the storage of the digital audio data including the uttered audio to the recording area portions 43 and 21 is continued. Further, as shown in FIG. 10D, the utterance start signal and the digital audio data are transmitted from the mobile phone 200 to the television apparatus 100 at time t4. As a result, as shown in FIG. 10 (F), the command extraction unit 7 is activated, and a series of command extraction processing is performed using the digital audio data stored in the recording area unit 21 and the transmitted digital audio data. Is called.

このようにして発話検出処理を行うことにより、突発的な雑音には反応せず、意味を持つ発話のみ検出し、その検出結果に基づいて、テレビジョン装置１００の命令抽出部７の起動及び停止を効率的に制御することができる。 By performing utterance detection processing in this way, only meaningful utterances are detected without reacting to sudden noise, and the instruction extraction unit 7 of the television apparatus 100 is activated and stopped based on the detection result. Can be controlled efficiently.

（継続判定処理）
次に、本実施の形態に係る音声認識モードの継続判定処理（ステップＳ１３）について説明する。携帯電話２００では、通常の待ち受け動作時に比べ、音声認識モードにおける消費電力は大きくなる。そこで、ユーザが携帯電話２００を音声認識モードのまま外に持ち出したり、室内にユーザがいないのに発話検出を行い続けたりして消費電力が増大してしまう状態が極力生じないようにするために、音声認識モードの継続判定処理が行われる。 (Continuation judgment processing)
Next, the voice recognition mode continuation determination process (step S13) according to the present embodiment will be described. In the mobile phone 200, the power consumption in the voice recognition mode is larger than that in the normal standby operation. Therefore, in order to prevent the occurrence of a situation where power consumption increases as much as possible, such as when the user takes the mobile phone 200 outside in the voice recognition mode or continues to detect speech even when there is no user in the room. Then, the voice recognition mode continuation determination process is performed.

図１１を参照して、音声認識モードの継続判定処理について説明する。 The speech recognition mode continuation determination process will be described with reference to FIG.

まず、音響制御部６は、人感センサ１１により、周囲にユーザが存在するか否かを判定する（ステップＳ５１）。 First, the acoustic control unit 6 determines whether or not there is a user around by the human sensor 11 (step S51).

ユーザが存在していないと判定した場合（ステップＳ５１；Ｎｏ）、ユーザがいない状態が一定期間（例えば１時間）継続しているか否かを判定する（ステップＳ５２）。ユーザがいない状態が、一定期間継続されていた場合（ステップＳ５２；Ｙｅｓ）、音響制御部６は、音声認識モードを終了する（ステップＳ５７）。続いて、音響制御部６は、命令抽出部７を停止し、発話検出制御部３３に停止命令を送信し、発話検出制御部３３に、発話検出処理部３２を停止させる（ステップＳ５８）。 If it is determined that there is no user (step S51; No), it is determined whether or not the state where there is no user continues for a certain period (for example, 1 hour) (step S52). When the state where there is no user has been continued for a certain period (step S52; Yes), the acoustic control unit 6 ends the voice recognition mode (step S57). Subsequently, the acoustic control unit 6 stops the command extraction unit 7, transmits a stop command to the utterance detection control unit 33, and causes the utterance detection control unit 33 to stop the utterance detection processing unit 32 (step S58).

一方、ユーザが検出された場合（ステップＳ５１；Ｙｅｓ）又はユーザがいない状態が、一定期間継続されていない場合（ステップＳ５２；Ｎｏ）、音響制御部６は、携帯電話２００とテレビジョン装置１００の無線接続が確立されているか否かを判定する（ステップＳ５３）。例えば、音声認識モードのままで携帯電話２００を持って外出した場合、テレビジョン装置１００と携帯電話２００の距離が離れることで、無線接続が解除され、通信不能となる。この場合、無線接続は確立されていないので（ステップＳ５３；Ｎｏ）、音響制御部６は、音声認識モードを終了する（ステップＳ５７）。続いて、音響制御部６は、命令抽出部７を停止し、発話検出制御部３３に停止命令を送信し、発話検出制御部３３に、発話検出処理部３２を停止させる（ステップＳ５８）。 On the other hand, when the user is detected (step S51; Yes) or when the state where there is no user has not been continued for a certain period of time (step S52; No), the acoustic control unit 6 is connected to the mobile phone 200 and the television device 100. It is determined whether a wireless connection has been established (step S53). For example, when the user goes out with the mobile phone 200 in the voice recognition mode, the wireless connection is released and communication is disabled because the distance between the television device 100 and the mobile phone 200 is increased. In this case, since the wireless connection is not established (step S53; No), the acoustic control unit 6 ends the voice recognition mode (step S57). Subsequently, the acoustic control unit 6 stops the command extraction unit 7, transmits a stop command to the utterance detection control unit 33, and causes the utterance detection control unit 33 to stop the utterance detection processing unit 32 (step S58).

無線接続が確立されている場合（ステップＳ５３；Ｙｅｓ）、音響制御部６は、一定時間（例えば１時間）、操作命令が有るか否かを判定する（ステップＳ５４）。一定時間継続して、命令が発せられていない場合（ステップＳ５４；Ｎｏ）、音響制御部６は、音声認識モードを終了する（ステップＳ５７）。続いて、音響制御部６は、命令抽出部７を停止し、発話検出制御部３３に停止信号を送信し、発話検出制御部３３に、発話検出処理部３２を停止させる（ステップＳ５８）。 When the wireless connection is established (step S53; Yes), the sound control unit 6 determines whether or not there is an operation command for a certain time (for example, 1 hour) (step S54). If no command is issued for a certain period of time (step S54; No), the sound control unit 6 ends the voice recognition mode (step S57). Subsequently, the acoustic control unit 6 stops the command extraction unit 7, transmits a stop signal to the utterance detection control unit 33, and causes the utterance detection control unit 33 to stop the utterance detection processing unit 32 (step S58).

例えば、ユーザが音声認識モードにしていることを認識していなかったり、その場で寝てしまったりしていた場合に、音声認識モードを設定したままであると電力を無駄に消費してしまうことになるので、音声認識モードを終了し、命令抽出部７及び発話検出処理部３２を停止させるのである。 For example, if the user has not recognized that the voice recognition mode is set, or if the user has fallen asleep on the spot, the power is wasted if the voice recognition mode remains set. Therefore, the voice recognition mode is terminated, and the command extraction unit 7 and the speech detection processing unit 32 are stopped.

一方、一定時間中に命令が発せられた場合（ステップＳ５４；Ｙｅｓ）、音響制御部６は、リモコン操作又は操作命令により音声認識モードの終了命令が発せられているか否かを判定する（ステップＳ５５）。音声認識モードの終了命令が発せられていれば（ステップＳ５５；Ｙｅｓ）、音響制御部６は、音声認識モードを終了する（ステップＳ５７）。続いて、音響制御部６は、命令抽出部７を停止し、発話検出制御部３３に停止信号を送信し、発話検出制御部３３に、発話検出処理部３２を停止させる（ステップＳ５８）。 On the other hand, when a command is issued during a certain time (step S54; Yes), the sound control unit 6 determines whether or not a voice recognition mode end command is issued by a remote control operation or an operation command (step S55). ). If the voice recognition mode end command has been issued (step S55; Yes), the acoustic control unit 6 ends the voice recognition mode (step S57). Subsequently, the acoustic control unit 6 stops the command extraction unit 7, transmits a stop signal to the utterance detection control unit 33, and causes the utterance detection control unit 33 to stop the utterance detection processing unit 32 (step S58).

一方、音声認識モードの終了命令が発せられていなければ（ステップＳ５５：Ｎｏ）、音響制御部６は、音声認識モードの継続設定を行う（ステップＳ５６）。 On the other hand, if the voice recognition mode end command has not been issued (step S55: No), the acoustic control unit 6 performs continuous setting of the voice recognition mode (step S56).

ステップＳ５８、ステップＳ５６終了後は、継続判定処理を終了する。 After step S58 and step S56 are finished, the continuation determination process is finished.

以上詳細に説明したように、本実施の形態によれば、テレビジョン装置１００は、収集された音声に基づく音声信号と、携帯電話２００から受信した音声信号を取得する。これにより、２つの音声信号のうち、良好な方又は両方を用いて、ユーザが発声した操作命令を抽出することができる。このようにすれば、このテレビジョン装置１００は、携帯電話２００から受信したタイミング信号（発話開始信号、発話終了信号）に基づいて、発話が検出されていない時は命令抽出部７を停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等によるテレビジョン装置１００本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 As described above in detail, according to the present embodiment, television apparatus 100 acquires an audio signal based on the collected audio and an audio signal received from mobile phone 200. Thereby, the operation command which the user uttered can be extracted using the better one or both of the two audio signals. In this way, the television apparatus 100 stops the instruction extraction unit 7 when no utterance is detected based on the timing signals (utterance start signal, utterance end signal) received from the mobile phone 200. I can leave. As a result, the remote operation of the main body of the television apparatus 100 by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

また、本実施の形態によれば、携帯電話２００は、収集された音声信号に基づいてユーザの発話の開始及び終了を検出する度にタイミング信号をテレビジョン装置１００に送信する。このようにすれば、タイミング信号を受信したテレビジョン装置１００は、ユーザが発声した操作命令を正確なタイミングで抽出することができるうえ、ユーザが発声した操作命令を抽出する命令抽出部７を不要な時に停止させておくことができる。この結果、ユーザが煩わしい操作を行うことなく、音声認識等によるテレビジョン装置１００本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 Further, according to the present embodiment, the mobile phone 200 transmits a timing signal to the television apparatus 100 every time it detects the start and end of the user's utterance based on the collected audio signals. In this way, the television apparatus 100 that has received the timing signal can extract the operation command uttered by the user at an accurate timing, and does not require the command extraction unit 7 that extracts the operation command uttered by the user. It can be stopped at any time. As a result, the remote operation of the main body of the television apparatus 100 by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

すなわち、本実施の形態によれば、発話検出はユーザの近くに置かれた携帯電話２００で行われる。また、音声認識等による操作命令の抽出は、携帯電話２００で収集された音声に基づく音声信号と、テレビジョン装置１００で収集された音声に基づく音声信号との少なくとも一方を用いて、良好な方又は両方を組み合わせて行われる。また、発話開始信号が送信されてから発話停止信号が送信されるまでの間だけ、操作命令を抽出する命令抽出部７を起動させておくことができる。これにより、ユーザが煩わしい操作を行うことなく、音声認識等によるテレビジョン装置１００本体の遠隔操作を、高精度かつ低消費電力で行うことができる。 That is, according to the present embodiment, speech detection is performed by the mobile phone 200 placed near the user. Further, the extraction of the operation command by voice recognition or the like is preferable by using at least one of the voice signal based on the voice collected by the mobile phone 200 and the voice signal based on the voice collected by the television device 100. Or a combination of both. In addition, the command extraction unit 7 that extracts the operation command can be activated only during the period from when the utterance start signal is transmitted until when the utterance stop signal is transmitted. Thereby, the remote operation of the main body of the television apparatus 100 by voice recognition or the like can be performed with high accuracy and low power consumption without performing troublesome operations by the user.

より詳細には、本実施の形態では、異なる位置で収集された２つの音声信号を用いて音声認識が行われる。これにより、より頑健な（ノイズに対して強い）音声認識が可能となる。 More specifically, in the present embodiment, speech recognition is performed using two speech signals collected at different positions. Thereby, more robust (strong against noise) speech recognition is possible.

また、比較的に電力に余裕のあるテレビジョン装置１００で大きな電力を必要とする雑音消去、音声認識が行われ、発話検出は、ユーザの近くに置かれる携帯電話２００で行われる。このため、本実施の形態に係る音声認識操作システム３００は、精度及び電力の観点からすれば、最適なシステム構成となる。 Moreover, noise elimination and voice recognition that require a large amount of power are performed by the television device 100 having relatively large power, and speech detection is performed by the mobile phone 200 placed near the user. For this reason, the speech recognition operation system 300 according to the present embodiment has an optimal system configuration from the viewpoint of accuracy and power.

また、携帯電話２００を発話検出装置とすることで、ハードウエアの追加を最小限に留めることができる。 Further, by using the mobile phone 200 as an utterance detection device, the addition of hardware can be minimized.

また、本実施の形態によれば、人感センサによってユーザの存在が検出される。ユーザが存在しなければ、テレビジョン装置１００の命令抽出部７と、携帯電話２００の発話検出処理部３２を停止する。これにより双方の消費電力を低減することができる。 Further, according to the present embodiment, the presence of the user is detected by the human sensor. If there is no user, the command extraction unit 7 of the television device 100 and the speech detection processing unit 32 of the mobile phone 200 are stopped. Thereby, both power consumption can be reduced.

また、本実施の形態によれば、携帯電話２００において、発話が検出されている間だけ、テレビジョン装置１００における命令抽出部７を起動させておくことができるので、消費電力を低減することができる。 Further, according to the present embodiment, in mobile phone 200, instruction extraction unit 7 in television apparatus 100 can be activated only while speech is detected, so that power consumption can be reduced. it can.

また、本実施の形態によれば、収集された音声に対応する音圧レベルが、一定期間高くなければ、命令抽出部７を起動しないので、無駄な消費電力を費やさないようにすることができる。 In addition, according to the present embodiment, if the sound pressure level corresponding to the collected voice is not high for a certain period, the command extraction unit 7 is not activated, so that useless power consumption can be avoided. .

また、本実施の形態では、音声認識モード以外の動作モードでは、音響処理部５及び発話検出処理部３２は、その動作を停止している。これにより、消費電力をさらに低減することができる。 In the present embodiment, the acoustic processing unit 5 and the utterance detection processing unit 32 stop operating in an operation mode other than the voice recognition mode. Thereby, power consumption can be further reduced.

また、本実施の形態では、音声認識モードに切り替わると、発話を検出するための音圧レベルの閾値の校正を行う。これにより、周囲の状況に応じた高精度な音声認識が可能となる。また、この閾値の校正をする行う際には、表示部３で、発話を控える旨のメッセージを表示する。これにより、周囲の状況に応じた閾値の校正をより適切に行うことができる。 In this embodiment, when the mode is switched to the voice recognition mode, the sound pressure level threshold value for detecting the utterance is calibrated. Thereby, highly accurate voice recognition according to the surrounding situation becomes possible. Further, when the threshold value is calibrated, the display unit 3 displays a message to refrain from speaking. Thereby, calibration of the threshold value according to the surrounding situation can be performed more appropriately.

また、本実施の形態では、テレビジョン装置１００と携帯電話２００との間の無線接続が確立されない場合には、音声認識モードを終了する。これにより、命令抽出部７及び発話検出処理部３３が停止されるので、消費電力をさらに低減することができる。 In the present embodiment, when the wireless connection between the television device 100 and the mobile phone 200 is not established, the voice recognition mode is terminated. Thereby, since the instruction extraction unit 7 and the speech detection processing unit 33 are stopped, the power consumption can be further reduced.

また、本実施の形態では、一定期間、発話が行われない場合に、音声認識モードを終了する。これにより、命令抽出部７及び発話検出処理部３２が停止されるので、消費電力をさらに低減することができる。 In the present embodiment, the speech recognition mode is terminated when no utterance is made for a certain period. Thereby, since the instruction extraction unit 7 and the speech detection processing unit 32 are stopped, the power consumption can be further reduced.

また、本実施の形態では、操作命令の内容が、音声認識モードの終了命令である場合に、音声認識モードを終了する。これにより、命令抽出部７及び発話検出処理部３２が停止されるので、消費電力をさらに低減することができる。 In the present embodiment, when the content of the operation command is a voice recognition mode end command, the voice recognition mode is ended. Thereby, since the instruction extraction unit 7 and the speech detection processing unit 32 are stopped, the power consumption can be further reduced.

また、本実施の形態では、音声から操作命令を抽出する一連の命令抽出処理において、テレビジョン装置１００に複数のマイクロホン４、エコーキャンセラ２２、雑音除去処理部２３等の機能を備えることで、高精度な音声認識が可能となる。 Further, in this embodiment, in a series of instruction extraction processes for extracting operation instructions from sound, the television apparatus 100 is provided with functions such as a plurality of microphones 4, an echo canceller 22, a noise removal processing unit 23, and the like. Accurate speech recognition is possible.

しかしながら、周囲の環境や雑音状況によっては、雑音をうまく除去できない場合があり得る。例えば、ユーザとテレビジョン装置１００の位置が非常に遠かったり、スピーカ２の音が非常に大きくて、テレビジョン装置１００のマイクロホン４に到達するユーザの音声のＳＮ比が極端に小さかったり場合には、そのような状況が起こり得る。 However, depending on the surrounding environment and noise conditions, noise may not be removed successfully. For example, when the position of the user and the television device 100 is very far away, the sound of the speaker 2 is very loud, and the SN ratio of the user's voice reaching the microphone 4 of the television device 100 is extremely small. Such a situation can happen.

そこで、本実施の形態では、ユーザに近い場所にある携帯電話２００を、発話検出装置として使用するとともに、マイクロホン３１から入力された音声データをテレビジョン装置１００に送信し、操作命令の抽出に用いる。このため、高いＳＮ比で、操作命令を取得することができる。 Therefore, in the present embodiment, the mobile phone 200 located near the user is used as an utterance detection device, and the audio data input from the microphone 31 is transmitted to the television device 100 and used to extract an operation command. . For this reason, an operation command can be acquired with a high S / N ratio.

テレビジョン装置１００は、リモコンによる操作が複雑であるため、本実施の形態のように、音声認識による操作が可能となれば、ユーザの作業負担が著しく軽減される。しかしながら、本発明は、チューナ、オーディオ、レコーダなど、あらゆるＡＶ機器に適用可能であり、ＡＶ機器以外の家電製品にも適用可能である。 Since the television device 100 is complicated to operate with a remote controller, the user's work burden is significantly reduced if the operation by voice recognition is possible as in the present embodiment. However, the present invention can be applied to any AV device such as a tuner, an audio, and a recorder, and can be applied to home appliances other than the AV device.

本実施の形態では、エコーキャンセルを施し、雑音除去処理を施したテレビジョン装置１００で取得されたデジタル音声データと、比較的ＳＮ比が高い携帯電話２００のデジタル音声データの２系統の音声データを取得することができる。そこで、それぞれを独立して音声認識を行い、より音声認識精度が高い方を用いて、音声認識結果とするようにしてもよい。 In the present embodiment, two types of audio data, that is, digital audio data acquired by the television apparatus 100 that has been subjected to echo cancellation and noise removal processing, and digital audio data of the mobile phone 200 having a relatively high S / N ratio are obtained. Can be acquired. Therefore, the voice recognition may be performed independently, and the voice recognition result may be obtained using the one with higher voice recognition accuracy.

いずれの音声データを採用するかは、それぞれの音声データが、どれくらいの確実さであるかを表わす尺度である尤度を基準とすることができる。この尤度を比較し、尤度が高いほうを音声認識結果として採用すればよい。例えば、ユーザがテレビジョン装置１００から遠い場所にいて、机等に携帯電話２００を置いた状態で、携帯電話２００に向かって発話している状況等では、携帯電話２００で取得された音声データを用いた方が、尤度が高くなるので、高精度な音声認識の精度を期待できる。 Which voice data is adopted can be based on the likelihood, which is a measure representing the certainty of each voice data. This likelihood is compared, and the higher likelihood may be adopted as the speech recognition result. For example, in a situation where the user is far away from the television apparatus 100 and the mobile phone 200 is placed on a desk or the like and speaking to the mobile phone 200, the voice data acquired by the mobile phone 200 is used. Since the likelihood is higher when it is used, high accuracy of speech recognition can be expected.

また、携帯電話２００側のマイクロホン３１とテレビジョン装置１００のマイクロホン４を合わせて、１つのマイクロホンアレイとみなし、雑音除去処理を行うようにしてもよい。複数のマイクロホンを用いた雑音除去処理の手法のうち、特にＩＣＡ（独立成分分析）を用いた雑音除去手法を用いるようにすれば、マイク特性やマイク位置の事前情報が未知であっても雑音除去が可能となる。このようにして高精度に雑音が除去された音声データを用いて音声認識を行うことにより、高精度な認識結果を期待することができる。 Further, the microphone 31 on the mobile phone 200 side and the microphone 4 of the television device 100 may be combined and regarded as one microphone array, and noise removal processing may be performed. Of the noise removal processing methods using multiple microphones, noise removal using ICA (independent component analysis) is used, so noise removal is possible even if the microphone characteristics and microphone location prior information are unknown. Is possible. In this way, by performing speech recognition using speech data from which noise has been removed with high accuracy, a highly accurate recognition result can be expected.

また、本実施の形態では、図９の発話検出処理を実行することにより、発話をしている時間を検出することができるので、発話中は自動的にテレビジョン装置１００の音量を下げたり、ゼロにしたりすることで、テレビジョン装置１００から発せられる音の影響を小さくし、ユーザから発せられる音声を高ＳＮ比でマイクロホン３１に到達させることが可能となる。 In the present embodiment, since the utterance detection process of FIG. 9 can be performed to detect the utterance time, the volume of the television device 100 can be automatically reduced during the utterance, By setting it to zero, it is possible to reduce the influence of the sound emitted from the television device 100 and to allow the sound emitted from the user to reach the microphone 31 with a high SN ratio.

このように、本実施の形態によれば、発話を行う度に、音声認識スイッチを押したり、マイクロホン３１を口の近くに持ってきたりしなくても、ユーザが命令を発声するだけで、周囲の雑音に影響を受けない高精度な音声認識による家電機器の操作を実現し、さらに、その機能を低消費電力で実現することが可能である。 As described above, according to the present embodiment, each time an utterance is performed, even if the voice recognition switch is not pressed or the microphone 31 is not brought near the mouth, the user can simply speak the surroundings. It is possible to realize the operation of home appliances by high-accuracy voice recognition that is not affected by noise, and to realize the function with low power consumption.

なお、本実施の形態で指定した各種時間（発話検出のための音圧検知の区間、人感センサによって音声認識モードを終了する区間、等）は、一例であり、本実施の形態において例示した値には限られない。また、本実施の形態では、携帯電話２００側のデジタル音声データをテレビジョン装置１００側に送信したが、携帯電話２００のさらなる消費電力削減のために、音声自体の送信は行わず、発話検出の信号送信のみに限定することもできる。さらに、本実施の形態では携帯電話２００を発話検出用の機器として用いているが、同等の機能を実現する発話検出専用のモジュールを用意してもよい。 Note that the various times specified in the present embodiment (sound pressure detection section for speech detection, section in which the speech recognition mode is terminated by the human sensor, etc.) are examples, and are exemplified in the present embodiment. It is not limited to the value. In this embodiment, the digital audio data on the mobile phone 200 side is transmitted to the television apparatus 100 side. However, in order to further reduce power consumption of the mobile phone 200, the audio itself is not transmitted and the speech detection is performed. It can also be limited to signal transmission only. Furthermore, although the mobile phone 200 is used as an utterance detection device in the present embodiment, a module dedicated to utterance detection that realizes an equivalent function may be prepared.

なお、上記実施の形態において、テレビジョン装置１００及び携帯電話２００により実行されるプログラムは、フレキシブルディスク、ＣＤ−ＲＯＭ（Compact Disk Read-Only Memory）、ＤＶＤ（Digital Versatile Disk）、ＭＯ（Magneto-Optical Disk）等のコンピュータ読み取り可能な記録媒体に格納して配布し、そのプログラムをインストールすることにより、上述の処理を実行するシステムを構成することとしてもよい。 In the above embodiment, programs executed by the television device 100 and the mobile phone 200 are a flexible disk, a CD-ROM (Compact Disk Read-Only Memory), a DVD (Digital Versatile Disk), an MO (Magneto-Optical). A system that executes the above-described processing may be configured by storing and distributing the program in a computer-readable recording medium such as a disk) and installing the program.

また、プログラムをインターネット等の通信ネットワーク上の所定のサーバ装置が有するディスク装置等に格納しておき、例えば、搬送波に重畳させて、ダウンロード等するようにしてもよい。 Further, the program may be stored in a disk device or the like of a predetermined server device on a communication network such as the Internet, and may be downloaded, for example, superimposed on a carrier wave.

また、上述の機能を、ＯＳ（Operating System）が分担して実現する場合又はＯＳとアプリケーションとの協働により実現する場合等には、ＯＳ以外の部分のみを媒体に格納して配布してもよく、また、ダウンロード等してもよい。 In addition, when the above functions are realized by sharing an OS (Operating System), or when the functions are realized by cooperation between the OS and an application, only the part other than the OS may be stored in a medium and distributed. You may also download it.

なお、本発明は、上記実施の形態及び図面によって限定されるものではない。本発明の要旨を変更しない範囲で実施の形態及び図面に変更を加えることができるのはもちろんである。 In addition, this invention is not limited by the said embodiment and drawing. It goes without saying that the embodiments and the drawings can be modified without changing the gist of the present invention.

本発明は、ＡＶ機器等の電子機器等の遠隔操作に好適である。 The present invention is suitable for remote control of electronic devices such as AV devices.

１…テレビ基幹部、２…スピーカ、３…表示部、４…マイクロホン、５…音響処理部、６…音響制御部、７…命令抽出部、１０…操作制御部、１１…人感センサ、１２…無線通信部、２０…ＡＤコンバータ（ＡＤＣ）、２１…記録領域部、２２…エコーキャンセラ、２３…雑音除去処理部、２４…発話判定処理部、２５…音声認識処理部、２６…音声認識制御部、２７…ＤＡコンバータ（ＤＡＣ）、３０…携帯電話基幹部、３１…マイクロホン、３２…発話検出処理部、３３…発話検出制御部、４０…無線通信部、４１…ＡＤコンバータ（ＡＤＣ）、４２…音圧レベル判定処理部、４３…記録領域部、４４…タイマ、１００…テレビジョン装置、２００…携帯電話、３００…音声認識操作システム、３０１、３０２、３０３、３０４…状態 DESCRIPTION OF SYMBOLS 1 ... Television backbone part, 2 ... Speaker, 3 ... Display part, 4 ... Microphone, 5 ... Sound processing part, 6 ... Sound control part, 7 ... Command extraction part, 10 ... Operation control part, 11 ... Human sensor, 12 DESCRIPTION OF SYMBOLS ... Wireless communication part, 20 ... AD converter (ADC), 21 ... Recording area part, 22 ... Echo canceller, 23 ... Noise removal processing part, 24 ... Speech determination processing part, 25 ... Speech recognition processing part, 26 ... Voice recognition control , 27 ... DA converter (DAC), 30 ... mobile phone backbone, 31 ... microphone, 32 ... utterance detection processing part, 33 ... utterance detection control part, 40 ... wireless communication part, 41 ... AD converter (ADC), 42 ... Sound pressure level determination processing unit, 43 ... Recording area unit, 44 ... Timer, 100 ... Television apparatus, 200 ... Mobile phone, 300 ... Voice recognition operation system, 301, 302, 303, 304 ... State

Claims

At least one sound collection unit that collects ambient sound and outputs a first sound signal corresponding to the sound;
A wireless communication unit for performing wireless communication with an external device;
Command extraction for extracting an operation command uttered by a user by performing signal processing including voice recognition processing on at least one of the first voice signal and the second voice signal received by the wireless communication unit And
An operation unit for operating the main body according to the extracted operation instructions;
Based on the timing signal received by the wireless communication unit, a control unit for controlling the start and stop of the command extraction unit,
Electronic equipment comprising.

A human sensor for detecting the presence of the user in the surroundings;
The wireless communication unit is
Send the sensor output of the human sensor to the external device,
The controller is
Based on the sensor output of the human sensor, control the start and stop of the command extraction unit,
The electronic device according to claim 1.

In the timing signal received by the wireless communication unit,
Includes an utterance start signal and an utterance end signal,
The controller is
When the utterance start signal is received by the wireless communication unit, the command extraction unit is activated,
When the utterance end signal is received by the wireless communication unit, the command extraction unit is stopped.
The electronic device according to claim 1, wherein the electronic device is an electronic device.

A recording unit for storing the first audio signal;
In the timing signal received by the wireless communication unit,
A save command and a discard command for the recording unit are included,
The controller is
When the storage command is received by the wireless communication unit, the storage of the first audio signal to the recording unit is started,
When the discard command is received by the wireless communication unit, the first audio signal stored in the recording unit is discarded,
Controlling the command extraction unit to extract the operation command using the first audio signal stored in the recording unit;
The electronic device according to claim 3.

It further includes an output unit for outputting sound,
The controller is
When the wireless communication unit receives the storage command, the volume of the sound output from the output unit is reduced.
The electronic device according to claim 4.

The controller is
In an operation mode other than the voice recognition mode, the instruction extraction unit is stopped,
When a switching operation to the voice recognition mode is input to the operation unit,
Activating the command extraction unit and transmitting a switching signal to the voice recognition mode to the external device via the wireless communication unit,
The electronic device according to claim 1, wherein the electronic device is an electronic device.

A display unit for displaying information;
The controller is
After starting the command extraction unit, a message to the effect that the user refrains from speaking is displayed on the display unit,
The electronic apparatus according to claim 6.

The controller is
When a wireless connection with the external device is not established,
End the voice recognition mode and stop the command extraction unit,
The electronic apparatus according to claim 6 or 7, wherein

The controller is
When a signal indicating that the start of speech has been detected by the wireless communication unit is not received for a predetermined period of time,
Ending the voice recognition mode, stopping the command extraction unit, and sending an end notification of the voice recognition mode to the external device via the wireless communication unit,
The electronic device according to claim 6, wherein the electronic device is an electronic device.

The controller is
When the operation command is a command to end the voice recognition mode,
Ending the voice recognition mode, stopping the command extraction unit, and sending an end notification of the voice recognition mode to the external device via the wireless communication unit,
The electronic device according to claim 6, wherein the electronic device is an electronic device.

The controller is
In the instruction extraction unit,
Causing both the first voice signal and the second voice signal to perform voice recognition processing;
Extracting the operation command using the one with the higher likelihood of the processing result of the voice recognition processing,
The electronic device according to claim 1, wherein the electronic device is an electronic device.

A sound collection unit that collects surrounding sound and outputs a sound signal corresponding to the sound;
A wireless communication unit that performs wireless communication with an electronic device;
An utterance detection processing unit for detecting the start and end of the user's utterance based on the sound pressure level of the audio signal;
A control unit that transmits a timing signal to the electronic device via the wireless communication unit each time the utterance detection processing unit detects the start and end of the user's utterance;
An utterance detection device comprising:

The controller is
Control activation and stop of the utterance detection processing unit based on the sensor output of the human sensor,
The utterance detection apparatus according to claim 12.

The controller is
When the start of the user's speech is detected by the speech detection processing unit, the speech signal is transmitted to the electronic device together with the speech start signal as the timing signal via the wireless communication unit, and the speech detection processing When the end of the user's utterance is detected by the unit, the utterance end signal as the timing signal is transmitted to the electronic device via the wireless communication unit and the transmission of the audio signal is stopped.
The utterance detection apparatus according to claim 12 or 13, characterized in that:

A recording unit for storing the audio signal;
The utterance detection processing unit
A sound pressure level determination processing unit that determines whether the sound pressure level of the audio signal exceeds a threshold;
The controller is
When it is determined by the sound pressure level determination processing unit that the sound pressure level of the audio signal has exceeded a threshold, the storage of the audio signal to the recording unit is started, and a storage command is transmitted to the electronic device,
When the sound pressure level determination processing unit determines that the sound pressure level of the sound signal has become equal to or lower than a threshold value after a predetermined period of time, the sound signal stored in the recording unit is discarded and a discard command is issued. Send to the electronic device,
When the sound pressure level determination processing unit determines that the sound pressure level of the audio signal has exceeded a threshold value after the predetermined period has elapsed, it is stored in the utterance start signal and the recording unit via the wireless communication unit. Transmitting the audio signal to the electronic device,
The utterance detection apparatus according to claim 14.

The controller is
When the switching signal to the voice recognition mode is received from the electronic device via the wireless communication unit, the speech detection processing unit is activated.
The speech detection device according to any one of claims 12 to 15, wherein

The controller is
After activating the utterance detection processing unit,
For a predetermined period, the sound pressure level determination processing unit determines whether or not the sound pressure level of the audio signal exceeds a threshold,
Adjusting the threshold so that a ratio of a period during which the sound pressure level of the audio signal with respect to the predetermined period exceeds the threshold is smaller than a predetermined ratio;
The utterance detection apparatus according to claim 16.

The controller is
If a wireless connection with the electronic device is not established,
Stopping the utterance detection processing unit;
The utterance detection apparatus according to claim 16 or 17,

The controller is
When the end notification of the voice recognition mode is received from the electronic device via the wireless communication unit, the speech detection processing unit is stopped.
The utterance detection device according to any one of claims 16 to 18.

An electronic device according to any one of claims 1 to 11,
The utterance detection device according to any one of claims 12 to 19,
A voice recognition operation system comprising:

The electronic device is a television device;
21. The voice recognition operation system according to claim 20, wherein

The utterance detection device is a mobile phone;
The voice recognition operation system according to claim 20 or 21, wherein

A voice collecting step of simultaneously collecting ambient sounds by an utterance detection device and an electronic device placed near the user;
An utterance start detection step of detecting the start of the user's utterance based on an audio signal corresponding to the collected audio using the utterance detection device;
A first transmission step of transmitting an utterance start signal from the utterance detection device to the electronic device by wireless communication when the utterance start detection step detects the start of the user's utterance;
An activation step of activating an instruction extraction unit that extracts an operation instruction uttered by a user from an audio signal input in the electronic device according to the received utterance start signal;
An extraction step of extracting the operation command by the command extraction unit from a voice signal corresponding to the collected voice;
An utterance end detection step of detecting the end of the user's utterance based on an audio signal corresponding to the collected audio using the utterance detection device;
A second transmission step of transmitting an utterance end signal from the utterance detection device to the electronic device by wireless communication when the utterance end detection step detects the end of the user's utterance;
When the electronic device receives an utterance end signal, a stop step of stopping the command extraction unit;
An operation step of operating the main body according to the extracted operation instructions;
Voice recognition operation method including.

The computer that controls the electronics
At least one voice collecting means for collecting ambient voice and outputting a first voice signal corresponding to the voice;
A wireless communication means for performing wireless communication with an external device;
Command extraction means for performing signal processing including voice recognition processing on at least one of the first voice signal and the second voice signal transmitted from an external device to extract an operation command uttered by the user; ,
Operating means for operating the main body according to the extracted operation instructions;
Based on a timing signal transmitted from the external device, control means for controlling activation and stop of the instruction extraction means,
Program to make it work.

A computer that controls an utterance detection device that detects a user's utterance,
Sound collecting means for collecting surrounding sound and outputting sound signals corresponding to the sound;
Wireless communication means for performing wireless communication with an electronic device;
Utterance detection processing means for detecting the start and end of the user's utterance based on the sound pressure level of the audio signal;
Control means for transmitting a timing signal to an electronic device each time the utterance detection processing means detects the start and end of the user's utterance;
Program to make it work.