JP2022030416A

JP2022030416A - Imaging apparatus, method for controlling imaging apparatus, and program

Info

Publication number: JP2022030416A
Application number: JP2020134450A
Authority: JP
Inventors: 祐介鳥海; Yusuke Chokai
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2022-02-18

Abstract

To enable an imaging apparatus to more reliably pick up an image of a subject.SOLUTION: An imaging apparatus has: imaging means; driving means that can drive an imaging direction of the imaging means; sound collecting means for collecting the sound of a voice; detection means that detects the direction of the sound source of the voice collected by the sound collecting means; communication means; and control means. The control means controls the communication means to communicate with an external device that can perform voice recognition. Upon recognition of an identification command of the external device, the control means determines a direction in which the imaging means picks up an image. When the control means is instructed to pick up an image from the external device through the communication means, it controls the imaging means and the driving means to pick up an image in the determined direction.SELECTED DRAWING: Figure 13

Description

本発明は、撮像装置、撮像装置の制御方法、およびプログラムに関する。 The present invention relates to an image pickup apparatus, a control method for the image pickup apparatus, and a program.

従来、音声認識可能な撮像装置が知られている。特許文献１では、音声コマンドによって被写体を検索し、撮影することができるカメラが開示されている。 Conventionally, an image pickup device capable of voice recognition is known. Patent Document 1 discloses a camera capable of searching for a subject by a voice command and taking a picture.

特開２０１９－１１７３７５号公報Japanese Unexamined Patent Publication No. 2019-117375

特許文献１に開示されているようなカメラに、よりユーザの音声を幅広く認識することが可能なスマートスピーカーのような外部装置を連携させたシステムを構築することで、ユーザがより音声コマンドを入力しやすくすることが考えられる。このようなシステムでは、外部装置が音声コマンドを認識し、その音声認識の結果に基づいてカメラが動作する。しかしながら、外部装置の音声コマンドの認識の結果に基づいてカメラが撮影する場合、カメラが単体で音声コマンドを認識して撮影する場合よりも、ユーザが音声コマンドを発声してからその撮影処理が実行されるまでに時間がかかってしまう。その結果、撮影処理が実行されるまでに、他の撮影条件などによって画角が変化してしまい、所望の被写体を撮影できないおそれがある。 By constructing a system in which an external device such as a smart speaker capable of recognizing a user's voice more widely is linked to a camera as disclosed in Patent Document 1, the user can input more voice commands. It is possible to make it easier. In such a system, an external device recognizes a voice command, and the camera operates based on the result of the voice recognition. However, when the camera shoots based on the recognition result of the voice command of the external device, the shooting process is executed after the user utters the voice command, rather than when the camera recognizes the voice command by itself and shoots. It will take some time before it is done. As a result, the angle of view may change due to other shooting conditions or the like before the shooting process is executed, and a desired subject may not be shot.

そこで、本発明は、撮像装置が被写体をより確実に撮像できるようにすることを目的とする。 Therefore, an object of the present invention is to enable the image pickup apparatus to capture a subject more reliably.

本発明の撮像装置は、撮像手段と、前記撮像手段の撮像方向を駆動可能な駆動手段と、音声を収音するための収音手段と、前記収音手段によって収音された音声の音源の方向を検出する検出手段と、通信手段と、制御手段と、を有し、前記制御手段は、音声認識が可能な外部装置と通信するよう前記通信手段を制御し、前記制御手段は、前記外部装置の識別コマンドが認識されたことによって、前記撮像手段によって撮像する方向を決定し、前記制御手段は、前記通信手段を介して前記外部装置から撮像を指示された場合、前記決定した方向を撮像するよう前記撮像手段および前記駆動手段を制御する。 The image pickup apparatus of the present invention comprises an image pickup means, a drive means capable of driving the image pickup direction of the image pickup means, a sound collection means for collecting sound, and a sound source of sound collected by the sound collection means. The control means has a detection means for detecting a direction, a communication means, and a control means, the control means controls the communication means so as to communicate with an external device capable of voice recognition, and the control means is the external. When the identification command of the device is recognized, the image pickup direction is determined by the image pickup means, and the control means captures the determined direction when the external device instructs the image pickup via the communication means. The image pickup means and the drive means are controlled so as to do so.

本発明によれば、撮像装置が被写体をより確実に撮像することができる。 According to the present invention, the image pickup apparatus can take a picture of a subject more reliably.

撮像装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the image pickup apparatus. 音声信号処理部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the audio signal processing unit. コマンドの例を示す図である。It is a figure which shows the example of a command. 撮像装置の外観と使用例を示す図である。It is a figure which shows the appearance and the use example of an image pickup apparatus. 撮像装置の駆動方向を示す図である。It is a figure which shows the drive direction of an image pickup apparatus. 撮像装置の制御方法を示すフローチャートである。It is a flowchart which shows the control method of the image pickup apparatus. 撮像装置の制御方法を示すフローチャートである。It is a flowchart which shows the control method of the image pickup apparatus. 撮像装置の制御方法を示すフローチャートである。It is a flowchart which shows the control method of the image pickup apparatus. 撮像装置の制御方法を示すタイミングチャートである。It is a timing chart which shows the control method of an image pickup apparatus. 音方向検知を示す図である。It is a figure which shows the sound direction detection. 音方向検知を示す図である。It is a figure which shows the sound direction detection. スマートスピーカーの構成例を示すブロック図である。It is a block diagram which shows the configuration example of a smart speaker. スマートスピーカーと撮像装置の制御方法を示す図である。It is a figure which shows the control method of a smart speaker and an image pickup apparatus. 音源方向の検知方法を示す図である。It is a figure which shows the detection method of a sound source direction. スマートスピーカーと複数の撮像装置の制御方法を示す図である。It is a figure which shows the control method of a smart speaker and a plurality of image pickup devices.

［第１の実施形態］
図１は、第１の実施形態に係る撮像装置１３０のブロック構成図である。撮像装置１３０は、光学レンズユニットを含み、撮像する撮像方向（光軸方向）が可変の可動撮像部１００、および、可動撮像部１００の駆動制御および撮像装置１３０の全体を制御する中央制御部（ＣＰＵ）１１１を含む支持部１１０を有する。 [First Embodiment]
FIG. 1 is a block configuration diagram of the image pickup apparatus 130 according to the first embodiment. The image pickup device 130 includes an optical lens unit, a movable image pickup unit 100 in which the image pickup direction (optical axis direction) to be imaged is variable, and a central control unit (a central control unit) that controls the drive of the movable image pickup unit 100 and controls the entire image pickup device 130. It has a support portion 110 including a CPU) 111.

なお、支持部１１０は、圧電素子を含む複数の振動体１２５～１２７が可動撮像部１００の面に対して接触するように設けられている。これらの振動体１２５～１２７の振動の制御により、可動撮像部１００がパン動作、およびチルト動作を行う。なお、パン動作、およびチルト動作はサーボモータ等で実現しても構わない。 The support portion 110 is provided so that a plurality of vibrating bodies 125 to 127 including a piezoelectric element come into contact with the surface of the movable imaging unit 100. By controlling the vibrations of these vibrating bodies 125 to 127, the movable imaging unit 100 performs a pan operation and a tilt operation. The pan operation and the tilt operation may be realized by a servomotor or the like.

可動撮像部１００は、レンズ部１０１、撮像部１０２、レンズアクチュエータ制御部１０３、および、音声入力部１０４を有する。 The movable image pickup unit 100 includes a lens unit 101, an image pickup unit 102, a lens actuator control unit 103, and a voice input unit 104.

レンズ部１０１は、ズームレンズ、絞り・シャッタ、および、フォーカレンズなどの撮影光学系を有する。撮像部１０２は、ＣＭＯＳセンサやＣＣＤセンサなどの撮像素子を含み、レンズ部１０１により結像された光学像を光電変換して電気信号を出力する。レンズアクチュエータ制御部１０３は、モータドライバＩＣを含み、レンズ部１０１のズームレンズ、絞り・シャッタ、および、フォーカスレンズ等の各種アクチュエータを駆動する。各種アクチュエータは、後述する支持部１１０内の中央制御部１１１より受信した、アクチュエータ駆動指示データに基づいて駆動される。音声入力部１０４は、マイクロフォン（以降マイク）を含む音声入力部であり、複数のマイク（本実施形態では４つ）を有し、音声を電気信号に変換し、さらに電気信号をデジタル信号（音声データ）に変換して出力する。 The lens unit 101 includes a zoom lens, a diaphragm / shutter, and a photographing optical system such as a focus lens. The image pickup unit 102 includes an image pickup element such as a CMOS sensor or a CCD sensor, and photoelectrically converts an optical image formed by the lens unit 101 to output an electric signal. The lens actuator control unit 103 includes a motor driver IC and drives various actuators such as a zoom lens, an aperture / shutter, and a focus lens of the lens unit 101. The various actuators are driven based on the actuator drive instruction data received from the central control unit 111 in the support unit 110, which will be described later. The voice input unit 104 is a voice input unit including a microphone (hereinafter referred to as a microphone), has a plurality of microphones (four in the present embodiment), converts voice into an electric signal, and further converts the electric signal into a digital signal (voice). Convert to data) and output.

一方、支持部１１０は、撮像装置１３０の全体の制御を行うための中央制御部１１１を有する。この中央制御部１１１は、ＣＰＵと、ＣＰＵが実行するプログラムを格納したＲＯＭ、および、ＣＰＵのワークエリアとして使用されるＲＡＭを有する。また、支持部１１０は、撮像信号処理部１１２、映像信号処理部１１３、音声信号処理部１１４、操作部１１５、記憶部１１６、および表示部１１７を有する。更に、支持部１１０は、外部入出力端子部１１８、音声再生部１１９、電源部１２０、電源制御部１２１、位置検出部１２２、回動制御部１２３、無線通信部１２４、並びに、先に説明した振動体１２５～１２７を有する。 On the other hand, the support unit 110 has a central control unit 111 for controlling the entire image pickup apparatus 130. The central control unit 111 has a CPU, a ROM in which a program executed by the CPU is stored, and a RAM used as a work area of the CPU. Further, the support unit 110 includes an image pickup signal processing unit 112, a video signal processing unit 113, an audio signal processing unit 114, an operation unit 115, a storage unit 116, and a display unit 117. Further, the support unit 110 includes an external input / output terminal unit 118, a voice reproduction unit 119, a power supply unit 120, a power supply control unit 121, a position detection unit 122, a rotation control unit 123, a wireless communication unit 124, and the above-described description. It has a vibrating body 125-127.

撮像信号処理部１１２は、可動撮像部１００の撮像部１０２から出力された電気信号を映像信号へ変換する。映像信号処理部１１３は、撮像信号処理部１１２から出力された映像信号を用途に応じて加工する。映像信号の加工は、画像切り出し、および、回転加工による電子防振動作や、被写体（顔）を検出する被写体検出処理も含まれる。 The image pickup signal processing unit 112 converts the electric signal output from the image pickup unit 102 of the movable image pickup unit 100 into a video signal. The video signal processing unit 113 processes the video signal output from the image pickup signal processing unit 112 according to the application. The processing of the video signal includes image cutting, electronic vibration isolation operation by rotation processing, and subject detection processing for detecting the subject (face).

音声信号処理部１１４は、音声入力部１０４から出力されたデジタル信号に対して、音声処理を行う。音声入力部１０４がアナログ信号を出力するマイクであれば、音声信号処理部１１４において、アナログ信号からデジタル信号に変換する構成が含まれていても構わない。なお、音声入力部１０４を含めた音声信号処理部１１４の詳細については、図２を用いて後述する。 The voice signal processing unit 114 performs voice processing on the digital signal output from the voice input unit 104. If the audio input unit 104 is a microphone that outputs an analog signal, the audio signal processing unit 114 may include a configuration for converting an analog signal into a digital signal. The details of the voice signal processing unit 114 including the voice input unit 104 will be described later with reference to FIG.

操作部１１５は、撮像装置１３０とユーザとの間のユーザインターフェースとして機能するものであり、各種スイッチ、ボタン等を有する。記憶部１１６は、撮影により得られた映像情報などの種々のデータを記憶する。表示部１１７は、ＬＣＤなどのディスプレイを備え、映像信号処理部１１３から出力された信号に基づいて、必要に応じて画像表示を行う。また、この表示部１１７は、各種メニュー等を表示することで、ユーザインターフェースの一部として機能する。外部入出力端子部１１８は、外部装置との間で通信信号および映像信号を入出力する。音声再生部１１９は、スピーカーを含み、音声データを電気信号に変換し、音声を再生する。電源部１２０は、撮像装置１３０の全体（各要素）の駆動に必要な電力供給源であり、本実施形態では充電可能なバッテリである。 The operation unit 115 functions as a user interface between the image pickup device 130 and the user, and has various switches, buttons, and the like. The storage unit 116 stores various data such as video information obtained by shooting. The display unit 117 includes a display such as an LCD, and displays an image as necessary based on the signal output from the video signal processing unit 113. Further, the display unit 117 functions as a part of the user interface by displaying various menus and the like. The external input / output terminal unit 118 inputs / outputs a communication signal and a video signal to / from an external device. The voice reproduction unit 119 includes a speaker, converts voice data into an electric signal, and reproduces the voice. The power supply unit 120 is a power supply source necessary for driving the entire image pickup device 130 (each element), and is a rechargeable battery in the present embodiment.

電源制御部１２１は、撮像装置１３０の状態に応じて、上記の各構成要素への電源部１２０からの電力の供給／遮断を制御する。撮像装置１３０の状態によっては、不使用の構成要素が存在する。電源制御部１２１は、中央制御部１１１の制御下で、撮像装置１３０の状態によって不使用の構成要素への電力を遮断して、電力消費量を抑制する機能を果たす。なお、電力供給／遮断については、後述する説明から明らかにする。 The power supply control unit 121 controls the supply / cutoff of electric power from the power supply unit 120 to each of the above-mentioned components according to the state of the image pickup apparatus 130. Depending on the state of the image pickup apparatus 130, there are unused components. Under the control of the central control unit 111, the power supply control unit 121 functions to cut off power to unused components depending on the state of the image pickup device 130 to suppress power consumption. The power supply / cutoff will be clarified from the explanation described later.

位置検出部１２２は、ジャイロ、加速度センサ、ＧＰＳ等を有し、撮像装置１３０の動きを検出する。この位置検出部１２２は、撮像装置１３０をユーザが身に着ける場合にも対処するためである。回動制御部１２３は、中央制御部１１１からの指示に従って、振動体１２５～１２７を駆動する駆動信号を生成し、出力する。振動体１２５～１２７は、圧電素子を有し、回動制御部１２３から印加される駆動信号に応じて振動する。振動体１２５～１２７は、駆動部であり、回動駆動部（パン・チルト駆動部）を有し、可動撮像部１００の撮像方向を駆動可能である。この結果、可動撮像部１００は、中央制御部１１１が指示した方向に、パン動作、チルト動作する。 The position detection unit 122 has a gyro, an acceleration sensor, GPS, and the like, and detects the movement of the image pickup device 130. This position detection unit 122 is for dealing with the case where the user wears the image pickup device 130. The rotation control unit 123 generates and outputs a drive signal for driving the vibrating bodies 125 to 127 according to the instruction from the central control unit 111. The vibrating bodies 125 to 127 have a piezoelectric element and vibrate in response to a drive signal applied from the rotation control unit 123. The vibrating bodies 125 to 127 are drive units, have a rotation drive unit (pan / tilt drive unit), and can drive the image pickup direction of the movable image pickup unit 100. As a result, the movable imaging unit 100 pans and tilts in the direction instructed by the central control unit 111.

無線通信部１２４は、ＷｉＦｉ（登録商標）やＢＬＥ（Bluetooth（登録商標） Low Energy）などの無線規格に準拠して画像データ等のデータ送信を行う。本実施形態では、無線通信部１２４による通信は、ＢｌｕｅｔｏｏｔｈＳｐｅｃｉａｌＩｎｔｅｒｅｓｔＧｒｏｕｐ（ＢｌｕｅｔｏｏｔｈＳＩＧ）が策定している通信規格であるＢｌｕｅｔｏｏｔｈ（登録商標）に準拠する。本実施形態において、Ｂｌｕｅｔｏｏｔｈ通信は、Ｂｌｕｅｔｏｏｔｈのバージョン５．０以降に規定されているＢｌｕｅｔｏｏｔｈＬｏｗＥｎｅｒｇｙ（以下、ＢＬＥと略す）という仕様を採用する。このＢＬＥ通信は、無線ＬＡＮ通信と比べて通信可能な範囲が狭く（通信可能な距離が短い）、無線ＬＡＮ通信と比べて通信速度が遅い。その一方で、ＢＬＥ通信は、無線ＬＡＮ通信と比べて消費電力が少ない。 The wireless communication unit 124 transmits data such as image data in accordance with wireless standards such as WiFi (registered trademark) and BLE (Bluetooth (registered trademark) Low Energy). In the present embodiment, the communication by the wireless communication unit 124 conforms to Bluetooth (registered trademark), which is a communication standard established by the Bluetooth Special Interest Group (Bluetooth SIG). In the present embodiment, the Bluetooth communication adopts the specification of Bluetooth Low Energy (hereinafter, abbreviated as BLE) specified in Bluetooth version 5.0 or later. This BLE communication has a narrower communicable range (a shorter communicable distance) than the wireless LAN communication, and has a slower communication speed than the wireless LAN communication. On the other hand, BLE communication consumes less power than wireless LAN communication.

ＢＬＥ通信で接続する２つの通信装置は、それぞれ、セントラルとペリフェラルの役割に分かれる。ＢＬＥ通信規格の接続形態は、マスタースレーブ方式のスター型ネットワークである。セントラルとして動作する通信装置（以下、セントラル装置という）がマスターとなり、ペリフェラルとして動作する通信装置（以下、ペリフェラル装置という）がスレーブとなる。セントラル装置は、ペリフェラル装置のネットワークへの参加の管理やペリフェラル装置との無線接続における各種パラメータの設定等をする。セントラル装置は、複数のペリフェラル装置と同時接続できるが、ペリフェラル装置は一度に１つのセントラル装置しか無線接続を確立することができない。また、セントラル装置同士は、無線接続を確立することはできず、無線接続を確立するためには、片方がセントラル、もう片方がペリフェラルとならなければならない。 The two communication devices connected by BLE communication are divided into the roles of central and peripheral, respectively. The connection form of the BLE communication standard is a master-slave type star network. A communication device that operates as a central device (hereinafter referred to as a central device) becomes a master, and a communication device that operates as a peripheral device (hereinafter referred to as a peripheral device) becomes a slave. The central device manages the participation of the peripheral device in the network and sets various parameters in the wireless connection with the peripheral device. The central device can be connected to a plurality of peripheral devices at the same time, but the peripheral device can establish a wireless connection to only one central device at a time. Further, the central devices cannot establish a wireless connection, and in order to establish a wireless connection, one must be central and the other must be peripheral.

また、ＢＬＥ通信では、相互接続（ペアリング）が確立していない機器に対してもアドバタイズと呼ばれる信号を送る事ができる。ペアリングとは、お互いの識別情報を、通信すべき相手の識別情報として、通信する装置が互いに登録（所定の領域に記録）することである。 Further, in BLE communication, a signal called advertisement can be sent to a device for which interconnection (pairing) has not been established. Pairing means that the communicating devices register (record in a predetermined area) each other's identification information as the identification information of the other party to communicate with.

本信号は、機器同士が互いの存在を知らせる為に発するビーコンのような使われ方のほかに、電波強度を測り、別のＢＬＥ通信機能を搭載した機器との距離を大まかに測定することもできる。ＢＬＥの電波強度によって、以下の３段階で判別することができる。
・『Ｉｍｍｅｄｉａｔｅ』：非常に近い（数ｃｍ程度）
・『Ｎｅａｒ』：近い（１～３ｍ程度）
・『Ｆａｒ』：遠い（３ｍ～） In addition to being used like a beacon emitted by devices to notify each other of their existence, this signal can also measure the radio field strength and roughly measure the distance to a device equipped with another BLE communication function. can. It can be discriminated in the following three stages according to the radio wave strength of BLE.
・ "Immedit": Very close (a few centimeters)
・ "Near": Close (about 1 to 3m)
・ "Far": Far (3m ~)

次に、本実施形態における音声入力部１０４および音声信号処理部１１４の構成と、音方向検出処理を、図２を参照して説明する。図２は、音声入力部１０４および音声信号処理部１１４の構成と、音声信号処理部１１４、中央制御部１１１および電源制御部１２１の接続関係を示している。 Next, the configuration of the voice input unit 104 and the voice signal processing unit 114 and the sound direction detection processing in the present embodiment will be described with reference to FIG. FIG. 2 shows the configuration of the voice input unit 104 and the voice signal processing unit 114, and the connection relationship between the voice signal processing unit 114, the central control unit 111, and the power supply control unit 121.

音声入力部１０４は、収音部であり、４つの無指向性のマイク１０４ａ、１０４ｂ、１０４ｃ、およびマイク１０４ｄを有し、音声を収音する。各マイク１０４ａ～１０４ｄは、Ａ／Ｄコンバータを内蔵している。各マイク１０４ａ～１０４ｄは、予め設定されたサンプリングレート（音声コマンド検出および音方向検出処理：１６ｋＨｚ、動画録音：４８ｋＨｚ）で音声を収音し、内蔵のＡ／Ｄコンバータにより収音した音声信号をデジタルの音声データとして出力する。なお、本実施形態では、音声入力部１０４は、４つのデジタル出力のマイク１０４ａ～１０４ｄを有するものとしているが、アナログ出力のマイクを有しても構わない。アナログマイクの場合、音声信号処理部１１４内に、対応するＡ／Ｄコンバータを設ければよい。また、本実施形態におけるマイクの数は４つとするが、３つ以上であればよい。 The voice input unit 104 is a sound collecting unit and has four omnidirectional microphones 104a, 104b, 104c, and a microphone 104d, and collects sound. Each microphone 104a to 104d has a built-in A / D converter. Each microphone 104a to 104d picks up sound at a preset sampling rate (voice command detection and sound direction detection processing: 16 kHz, video recording: 48 kHz), and picks up the sound signal by the built-in A / D converter. Output as digital audio data. In the present embodiment, the audio input unit 104 has four digital output microphones 104a to 104d, but may have analog output microphones. In the case of an analog microphone, a corresponding A / D converter may be provided in the audio signal processing unit 114. Further, although the number of microphones in this embodiment is four, it may be three or more.

マイク１０４ａは、撮像装置１３０の電源がオンの場合には無条件に電力が供給され、収音可能状態となる。一方、他のマイク１０４ｂ、マイク１０４ｃ、およびマイク１０４ｄは、中央制御部１１１の制御下での電源制御部１２１による電力供給／遮断の対象となっており、撮像装置１３０の電源がオンとなった初期状態では、電力は遮断されている。 When the power of the image pickup apparatus 130 is turned on, the microphone 104a is unconditionally supplied with power and is in a state where sound can be collected. On the other hand, the other microphones 104b, the microphone 104c, and the microphone 104d are subject to power supply / cutoff by the power supply control unit 121 under the control of the central control unit 111, and the power of the image pickup apparatus 130 is turned on. In the initial state, the power is cut off.

音声信号処理部１１４は、音圧レベル検出部２０１、音声用メモリ２０２、音声コマンド認識部２０３、音方向検出部２０４、動画用音声処理部２０５、および、コマンドメモリ２０６を有する。 The voice signal processing unit 114 includes a sound pressure level detection unit 201, a voice memory 202, a voice command recognition unit 203, a sound direction detection unit 204, a moving voice processing unit 205, and a command memory 206.

音圧レベル検出部２０１は、マイク１０４ａから出力された音声データの音圧レベルが予め設定された閾値を超えるとき、音声検出を表す信号を電源制御部１２１および音声用メモリ２０２に供給する。 When the sound pressure level of the sound data output from the microphone 104a exceeds a preset threshold value, the sound pressure level detection unit 201 supplies a signal indicating voice detection to the power supply control unit 121 and the voice memory 202.

電源制御部１２１は、音圧レベル検出部２０１から音声検出を表す信号を受信した場合、音声コマンド認識部２０３への電力供給を行う。 When the power supply control unit 121 receives a signal indicating voice detection from the sound pressure level detection unit 201, the power supply control unit 121 supplies power to the voice command recognition unit 203.

音声用メモリ２０２は、中央制御部１１１の制御下での電源制御部１２１による電力供給／遮断の対象の１つである。また、この音声用メモリ２０２は、マイク１０４ａから出力された音声データを一時的に記憶するバッファメモリである。マイク１０４ａは、サンプリングレートが１６ｋＨｚであり、１サンプリングにつき２バイト（１６ビット）の音声データを出力する。最長の音声コマンドが仮に５秒であった場合、音声用メモリ２０２は、約１６０キロバイト（≒５×１６×１０００×２）の容量を有する。また、音声用メモリ２０２は、マイク１０４ａからの音声データで満たされた場合、古い音声データが新たな音声データで上書きされる。この結果、音声用メモリ２０２には、直近の所定期間（上記例では約５秒）の音声データが保持される。また、音声用メモリ２０２は、音圧レベル検出部２０１から音声検出を示す信号を受信したことをトリガにして、マイク１０４ａからの音声データをサンプリングデータ領域に格納していく。 The voice memory 202 is one of the targets of power supply / cutoff by the power supply control unit 121 under the control of the central control unit 111. Further, the voice memory 202 is a buffer memory for temporarily storing the voice data output from the microphone 104a. The microphone 104a has a sampling rate of 16 kHz and outputs 2 bytes (16 bits) of audio data per sampling. If the longest voice command is 5 seconds, the voice memory 202 has a capacity of about 160 kilobytes (≈5 × 16 × 1000 × 2). Further, when the voice memory 202 is filled with the voice data from the microphone 104a, the old voice data is overwritten with the new voice data. As a result, the voice memory 202 holds the voice data for the latest predetermined period (about 5 seconds in the above example). Further, the voice memory 202 stores the voice data from the microphone 104a in the sampling data area, triggered by receiving a signal indicating voice detection from the sound pressure level detection unit 201.

コマンドメモリ２０６は、不揮発性のメモリであり、撮像装置１３０が認識する音声コマンドに係る情報を予め記憶（登録）している。詳細は後述するが、コマンドメモリ２０６に格納される音声コマンドの種類は、例えば図３に示す通りであり、「起動コマンド」をはじめとして、複数種類のコマンドの情報がコマンドメモリ２０６に格納されている。 The command memory 206 is a non-volatile memory, and stores (registers) information related to voice commands recognized by the image pickup apparatus 130 in advance. The details will be described later, but the types of voice commands stored in the command memory 206 are as shown in FIG. 3, for example, and information on a plurality of types of commands including the "start command" is stored in the command memory 206. There is.

音声コマンド認識部２０３は、中央制御部１１１の制御下での電源制御部１２１による電力供給／遮断の対象の１つである。なお、音声認識そのものは周知技術であるので、ここでの説明は省略する。この音声コマンド認識部２０３は、コマンドメモリ２０６を参照し、音声用メモリ２０２に格納された音声データの認識処理を行う。そして、音声コマンド認識部２０３は、マイク１０４ａにより収音した音声データが、音声コマンドであるか否か、並びに、コマンドメモリ２０６に記憶されている登録音声コマンドに一致するのかの判定を行う。そして、音声コマンド認識部２０３は、コマンドメモリ２０６に記憶されたいずれかの音声コマンドに一致する音声データを検出する。すると、音声コマンド認識部２０３は、いずれの音声コマンドであるかを示す情報、および音声用メモリ２０２内の、その音声コマンドを決定づけた最初と最後の音声データのアドレス（或いは音声コマンドを受け付けたタイミング）を中央制御部１１１に供給する。 The voice command recognition unit 203 is one of the targets of power supply / cutoff by the power supply control unit 121 under the control of the central control unit 111. Since speech recognition itself is a well-known technique, the description thereof is omitted here. The voice command recognition unit 203 refers to the command memory 206 and performs recognition processing of the voice data stored in the voice memory 202. Then, the voice command recognition unit 203 determines whether or not the voice data collected by the microphone 104a is a voice command and whether or not it matches the registered voice command stored in the command memory 206. Then, the voice command recognition unit 203 detects voice data corresponding to any voice command stored in the command memory 206. Then, the voice command recognition unit 203 has information indicating which voice command it is, and the address of the first and last voice data (or the timing at which the voice command is received) that determines the voice command in the voice memory 202. ) Is supplied to the central control unit 111.

音方向検出部２０４は、中央制御部１１１の制御下での電源制御部１２１による電力供給／遮断の対象の１つである。また、音方向検出部２０４は、４つのマイク１０４ａ～１０４ｄによって収音された音声データに基づき、周期的に、その音声データの音源の方向を検出する。音方向検出部２０４は、内部にバッファメモリ２０４ａを有し、検出した音源方向を表す情報をバッファメモリ２０４ａに格納する。なお、音方向検出部２０４による音方向検出処理を行う周期（例えば１６ｋＨｚ）は、マイク１０４ａのサンプリング周期に対して十分に長くて構わない。ただし、このバッファメモリ２０４ａは、音声用メモリ２０２に格納可能な音声データの期間と同じ期間分の音方向情報を記憶するための容量を有するものとする。 The sound direction detection unit 204 is one of the targets of power supply / cutoff by the power supply control unit 121 under the control of the central control unit 111. Further, the sound direction detection unit 204 periodically detects the direction of the sound source of the sound data based on the sound data collected by the four microphones 104a to 104d. The sound direction detection unit 204 has a buffer memory 204a inside, and stores information indicating the detected sound source direction in the buffer memory 204a. The cycle for performing the sound direction detection process by the sound direction detection unit 204 (for example, 16 kHz) may be sufficiently longer than the sampling cycle of the microphone 104a. However, it is assumed that the buffer memory 204a has a capacity for storing sound direction information for the same period as the period of audio data that can be stored in the audio memory 202.

動画用音声処理部２０５は、中央制御部１１１の制御下での電源制御部１２１による電力供給／遮断の対象の１つである。動画用音声処理部２０５は、４つのマイク１０４ａ～１０４ｄのうち、マイク１０４ａとマイク１０４ｂの２つの音声データをステレオ音声データとして入力する。そして、動画用音声処理部２０５は、各種フィルタ処理、ウィンドカット、ステレオ感強調、駆動音除去、ＡＬＣ（ＡｕｔｏＬｅｖｅｌＣｏｎｔｒｏｌ）、圧縮処理といった動画音声用の音声処理を行う。詳細は後述する説明から明らかになるが、本実施形態では、マイク１０４ａはステレオマイクのＬチャネル用マイク、マイク１０４ｂはステレオマイクのＲチャネル用マイクとして機能する。 The moving image processing unit 205 is one of the targets of power supply / cutoff by the power supply control unit 121 under the control of the central control unit 111. The moving image audio processing unit 205 inputs two audio data of the microphone 104a and the microphone 104b out of the four microphones 104a to 104d as stereo audio data. Then, the video audio processing unit 205 performs audio processing for video audio such as various filter processing, wind cut, stereo feeling enhancement, drive sound removal, ALC (Auto Level Control), and compression processing. The details will be clarified from the description described later, but in the present embodiment, the microphone 104a functions as a microphone for the L channel of the stereo microphone, and the microphone 104b functions as the microphone for the R channel of the stereo microphone.

なお、図２では、消費電力や回路構成を考慮し、音声入力部１０４の各マイク１０４ａ～１０４ｄと音声信号処理部１１４に含まれる各ブロックとの接続は、４つのマイク１０４ａ～１０４ｄにおける必要最低限の接続を示す。しかし、電力および回路構成の許す限り、複数のマイク１０４ａ～１０４ｄを音声信号処理部１１４に含まれる各ブロックで共有して使用しても構わない。また、本実施形態では、マイク１０４ａを基準のマイクとして接続しているが、どのマイクを基準としても構わない。 In FIG. 2, in consideration of power consumption and circuit configuration, the connection between the microphones 104a to 104d of the audio input unit 104 and the blocks included in the audio signal processing unit 114 is the minimum required for the four microphones 104a to 104d. Shows a limited connection. However, as long as the power and the circuit configuration allow, a plurality of microphones 104a to 104d may be shared and used by each block included in the audio signal processing unit 114. Further, in the present embodiment, the microphone 104a is connected as a reference microphone, but any microphone may be used as a reference.

図４（ａ）～図４（ｅ）を参照して、撮像装置１３０の外観図および使用例を説明する。図４（ａ）は、本実施形態に係る撮像装置１３０の外観の上面図および正面図を示している。撮像装置１３０の可動撮像部１００は、略半球体形であり、底面と平行な面を水平面とし、この面を０度したとき、－２０度から垂直方向を示す９０度の範囲の切欠き窓を有し、図示矢印Ａが示す水平面にて３６０度に亘って回動可能な第１の筐体４００を有する。また、可動撮像部１００は、この切欠き窓に沿って図示の矢印Ｂが示す水平から垂直の範囲内で、レンズ部１０１および撮像部１０２と一緒に回動可能な第２の筐体４０１を有する。ここで、第１の筐体４００の矢印Ａの回動動作はパン動作、第２の筐体４０１の矢印Ｂの回動動作はチルト動作に対応し、これらは振動体１２５～１２７の駆動によって実現している。なお、本実施形態における撮像装置１３０のチルト可能な範囲は、上記の通り、－２０度から＋９０度の範囲であるものとする。 An external view and a usage example of the image pickup apparatus 130 will be described with reference to FIGS. 4 (a) to 4 (e). FIG. 4A shows a top view and a front view of the appearance of the image pickup apparatus 130 according to the present embodiment. The movable image pickup unit 100 of the image pickup apparatus 130 has a substantially hemispherical shape, has a horizontal plane parallel to the bottom surface, and has a notched window in the range of -20 degrees to 90 degrees indicating the vertical direction when this plane is 0 degrees. It has a first housing 400 that is rotatable over 360 degrees in the horizontal plane indicated by the arrow A in the figure. Further, the movable imaging unit 100 has a second housing 401 that can rotate together with the lens unit 101 and the imaging unit 102 within the horizontal to vertical range indicated by the arrow B in the figure along the notched window. Have. Here, the rotation operation of the arrow A of the first housing 400 corresponds to the pan operation, and the rotation operation of the arrow B of the second housing 401 corresponds to the tilt operation, which are driven by the vibrating bodies 125 to 127. It has been realized. The tiltable range of the image pickup apparatus 130 in the present embodiment is assumed to be a range of −20 degrees to +90 degrees as described above.

マイク１０４ａおよび１０４ｂは、第１の筐体４００の切欠き窓を挟む前面側の位置に配置されている。また、マイク１０４ｃおよび１０４ｄは、第１の筐体４００の後方側に設けられている。図４（ａ）からもわかるように、第２の筐体４０１を固定にした状態で、第１の筐体４００を矢印Ａに沿ってどの方向にパン動作させたとしても、レンズ部１０１および撮像部１０２に対する、マイク１０４ａおよび１０４ｂの相対的な位置は変わらない。つまり、撮像部１０２の撮像方向に対して左側にマイク１０４ａが常に位置し、右側にマイク１０４ｂが常に位置する。また、複数のマイク１０４ａおよびマイク１０４ｂは、撮像部１０２の撮像方向に対して対称に配置されるので、マイク１０４ａはステレオマイクのＬチャネルへの入力を担い、マイク１０４ｂはステレオマイクのＲチャネルへの入力を担う。それ故、撮像部１０２による撮像して得た画像が表す空間と、マイク１０４ａおよび１０４ｂによる取得した音場は一定の関係を維持できる。 The microphones 104a and 104b are arranged at positions on the front side of the first housing 400 across the notched window. Further, the microphones 104c and 104d are provided on the rear side of the first housing 400. As can be seen from FIG. 4A, with the second housing 401 fixed, the lens unit 101 and the lens unit 101 and the first housing 400 can be panned along the arrow A in any direction. The relative positions of the microphones 104a and 104b with respect to the image pickup unit 102 do not change. That is, the microphone 104a is always located on the left side and the microphone 104b is always located on the right side with respect to the imaging direction of the imaging unit 102. Further, since the plurality of microphones 104a and 104b are arranged symmetrically with respect to the imaging direction of the imaging unit 102, the microphone 104a is responsible for input to the L channel of the stereo microphone, and the microphone 104b is connected to the R channel of the stereo microphone. Is responsible for the input of. Therefore, the space represented by the image captured by the image pickup unit 102 and the sound field acquired by the microphones 104a and 104b can maintain a certain relationship.

なお、本実施形態における４つのマイク１０４ａ、１０４ｂ、１０４ｃ、１０４ｄは、撮像装置１３０の上面から見て、図４（ａ）に示すように長方形の各頂点の位置に配置されている。また、これら４つのマイク１０４ａ～１０４ｄは、図４（ａ）における１つの水平面上に位置するものとするが、多少のずれがあっても構わない。 The four microphones 104a, 104b, 104c, and 104d in the present embodiment are arranged at the positions of the vertices of the rectangle as shown in FIG. 4A when viewed from the upper surface of the image pickup apparatus 130. Further, although these four microphones 104a to 104d are located on one horizontal plane in FIG. 4A, there may be some deviation.

マイク１０４ａとマイク１０４ｂとの距離は、マイク１０４ａとマイク１０４ｃとの距離よりも大きい。なお、隣りあうマイク間の距離は、１０ｍｍ～３０ｍｍ程度が望ましい。また、本実施形態では、マイクの数を４つとしているが、直線上に並ばないという条件を満たせば、マイクの数は３つ以上であれば構わない。また、図４（ａ）のマイク１０４ａ～１０４ｄの配置位置は一例であって、これらの配置方法は、メカ的制約やデザイン制約等によって適宜変更しても構わない。 The distance between the microphone 104a and the microphone 104b is larger than the distance between the microphone 104a and the microphone 104c. The distance between adjacent microphones is preferably about 10 mm to 30 mm. Further, in the present embodiment, the number of microphones is four, but the number of microphones may be three or more as long as the condition that they are not lined up on a straight line is satisfied. Further, the arrangement positions of the microphones 104a to 104d in FIG. 4A are examples, and these arrangement methods may be appropriately changed due to mechanical restrictions, design restrictions, and the like.

図４（ｂ）～図４（ｅ）は、本実施形態における撮像装置１３０の利用形態を示している。図４（ｂ）は、机などに撮像装置１３０が載置される場合であり、撮影者自身やその周囲の被写体の撮影を目的とした利用形態を説明するための図である。図４（ｃ）は、撮像装置１３０を撮影者の首にぶら下げる例であり、主に、撮影者の行動の前方の撮影を目的とした利用形態を説明するための図である。図４（ｄ）は、撮像装置１３０を撮影者の肩に固定した使用例であり、撮影者の周囲の前後、および、右側の撮影を目的とした利用形態を説明するための図である。そして、図４（ｅ）は、撮像装置１３０をユーザが持つ棒の端に固定する使用例であり、ユーザが望む所望の撮影位置（高所や手が届かない位置）に撮像装置１３０を移動させることで、撮影を行うことを目的とした利用形態を説明するための図である。 4 (b) to 4 (e) show a usage mode of the image pickup apparatus 130 in the present embodiment. FIG. 4B is a diagram in which the image pickup apparatus 130 is placed on a desk or the like, and is a diagram for explaining a usage pattern for the purpose of photographing the photographer himself or a subject around the photographer himself / herself. FIG. 4C is an example in which the image pickup apparatus 130 is hung on the neck of the photographer, and is a diagram for explaining a usage pattern mainly for the purpose of photographing the front of the photographer's behavior. FIG. 4D is an example of use in which the image pickup apparatus 130 is fixed to the shoulder of the photographer, and is a diagram for explaining a usage pattern for the purpose of photographing the front and back around the photographer and the right side. FIG. 4E is an example of using the image pickup device 130 to be fixed to the end of a rod held by the user, and the image pickup device 130 is moved to a desired shooting position (high place or a position out of reach) desired by the user. It is a figure for demonstrating the usage form for the purpose of taking a picture by letting it do.

図５（ａ）～図５（ｃ）を参照して、本実施形態の撮像装置１３０のパン動作、およびチルト動作を更に詳しく説明する。ここでは、図４（ｂ）のように据え置いた撮像装置１３０の使用例を前提として記載するが、そのほかの使用例においても同様である。 The pan operation and the tilt operation of the image pickup apparatus 130 of the present embodiment will be described in more detail with reference to FIGS. 5 (a) to 5 (c). Here, a usage example of the stationary image pickup apparatus 130 as shown in FIG. 4B is described on the premise, but the same applies to other usage examples.

図５（ａ）は、レンズ部１０１が水平方向を向いている状態を示している。図５（ａ）を初期状態とし、第１の筐体４００を、上方向から見て反時計回りに９０度パン動作させると、図５（ｂ）のようになる。一方、図５（ａ）の初期状態から、第２の筐体４０１の９０度チルト動作を行うと、図５（ｃ）のようになる。第１の筐体４００と第２の筐体４０１の回動は、先に説明したように、回動制御部１２３により駆動される振動体１２５～１２７による振動にて実現している。 FIG. 5A shows a state in which the lens portion 101 faces in the horizontal direction. When FIG. 5A is set as the initial state and the first housing 400 is panned 90 degrees counterclockwise when viewed from above, the result is as shown in FIG. 5B. On the other hand, when the 90-degree tilt operation of the second housing 401 is performed from the initial state of FIG. 5 (a), the result is as shown in FIG. 5 (c). The rotation of the first housing 400 and the second housing 401 is realized by vibration by the vibrating bodies 125 to 127 driven by the rotation control unit 123, as described above.

次に、本実施形態における撮像装置１３０の中央制御部１１１の処理手順を図６と図７のフローチャートに従って説明する。図６と図７に係る処理は、撮像装置１３０のメイン電源がオンされた場合の中央制御部１１１の処理を示している。 Next, the processing procedure of the central control unit 111 of the image pickup apparatus 130 in the present embodiment will be described with reference to the flowcharts of FIGS. 6 and 7. The processing according to FIGS. 6 and 7 shows the processing of the central control unit 111 when the main power supply of the image pickup apparatus 130 is turned on.

ステップＳ６０１では、中央制御部１１１は、撮像装置１３０の初期化処理を行う。この初期化処理にて、中央制御部１１１は、現在の可動撮像部１００の撮像部１０２の撮像方向における、水平面内の方向成分をパン動作の基準角度（０度）として決定する。 In step S601, the central control unit 111 performs initialization processing of the image pickup apparatus 130. In this initialization process, the central control unit 111 determines the direction component in the horizontal plane in the image pickup direction of the image pickup unit 102 of the current movable image pickup unit 100 as the reference angle (0 degree) of the pan operation.

これ以降、可動撮像部１００のパン動作を行った後の撮像方向のうち水平面の成分は、この基準角度からの相対的な角度で表されるものとする。また、音方向検出部２０４が検出する音源方向のうちの水平面の成分も、上記の基準角度に対する相対的な角度で表されるものとする。また、詳細は後述するが、音方向検出部２０４は、撮像装置１３０の真上の方向（パン動作の回転軸の軸方向）に音源があるか否かの判定も行う。 From this point onward, the component of the horizontal plane in the image pickup direction after the pan operation of the movable image pickup unit 100 is represented by an angle relative to this reference angle. Further, the horizontal plane component in the sound source direction detected by the sound direction detection unit 204 is also represented by an angle relative to the above reference angle. Further, although the details will be described later, the sound direction detection unit 204 also determines whether or not there is a sound source in the direction directly above the image pickup device 130 (the axial direction of the rotation axis of the pan operation).

なお、この段階で、音声用メモリ２０２、音方向検出部２０４、動画用音声処理部２０５、並び、マイク１０４ｂ～１０４ｄへの電力は遮断されている。 At this stage, the power to the audio memory 202, the sound direction detection unit 204, the video audio processing unit 205, the microphones 104b to 104d, and the microphones 104b to 104d are cut off.

初期化処理を終えると、ステップＳ６０２では、中央制御部１１１は、電源制御部１２１を制御して、音圧レベル検出部２０１とマイク１０４ａへの電力の供給を開始する。この結果、音圧レベル検出部２０１は、マイク１０４ａから出力された音声データに基づいて、この音声データに変換される前の音声の音圧レベルの検出処理を実行する。そして、音圧レベル検出部２０１は、この音声が予め設定された閾値を超える音圧レベルであると判定した場合に、その旨を中央制御部１１１に通知する。なお、この閾値は、例えば６０dB SPL（Sound Pressure Level）とするが、撮像装置１３０が環境等に応じて変更してもよいし、必要な周波数帯域だけに絞るようにしてもよい。 When the initialization process is completed, in step S602, the central control unit 111 controls the power supply control unit 121 to start supplying electric power to the sound pressure level detection unit 201 and the microphone 104a. As a result, the sound pressure level detection unit 201 executes the sound pressure level detection process of the voice before being converted into the voice data based on the voice data output from the microphone 104a. Then, when the sound pressure level detection unit 201 determines that the sound pressure level exceeds a preset threshold value, the sound pressure level detection unit 201 notifies the central control unit 111 to that effect. The threshold value is, for example, 60 dB SPL (Sound Pressure Level), but the image pickup apparatus 130 may be changed according to the environment or the like, or may be narrowed down to only a necessary frequency band.

ステップＳ６０３では、中央制御部１１１は、音圧レベル検出部２０１による閾値を超える音圧レベルである音声が検出されるのを待つ。閾値を超える音圧レベルである音声が検出されると、ステップＳ６０４では、音声用メモリ２０２は、マイク１０４ａからの音声データの受信と格納処理を開始する。 In step S603, the central control unit 111 waits for the sound pressure level detection unit 201 to detect a voice having a sound pressure level exceeding the threshold value. When a voice having a sound pressure level exceeding the threshold value is detected, in step S604, the voice memory 202 starts receiving and storing voice data from the microphone 104a.

ステップＳ６０５では、中央制御部１１１は、電源制御部１２１を制御し、音声コマンド認識部２０３への電力供給を開始する。この結果、音声コマンド認識部２０３は、コマンドメモリ２０６を参照した音声用メモリ２０２に格納されていく音声データの認識処理を開始する。そして、音声コマンド認識部２０３は、音声用メモリ２０２に格納された音声データの認識処理を行い、コマンドメモリ２０６内のいずれかの音声コマンドと一致する音声コマンドを認識した場合、以下の処理を行う。その場合、音声コマンド認識部２０３は、その認識された音声コマンドを特定する情報と、音声用メモリ２０２内の、認識した音声コマンドを決定づけた最初と最後の音声データのアドレス情報とを含む情報を中央制御部１１１に通知する。なお、上記の音声データのアドレス情報は、音声コマンドを受け付けたタイミング情報でもよい。 In step S605, the central control unit 111 controls the power supply control unit 121 and starts supplying power to the voice command recognition unit 203. As a result, the voice command recognition unit 203 starts the recognition process of the voice data stored in the voice memory 202 with reference to the command memory 206. Then, the voice command recognition unit 203 performs recognition processing of voice data stored in the voice memory 202, and when it recognizes a voice command that matches any voice command in the command memory 206, performs the following processing. .. In that case, the voice command recognition unit 203 contains information including information for identifying the recognized voice command and address information of the first and last voice data in the voice memory 202 that determines the recognized voice command. Notify the central control unit 111. The address information of the above voice data may be timing information for receiving a voice command.

ステップＳ６０６では、中央制御部１１１は、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信したか否かを判定する。中央制御部１１１が、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信したと判定した場合、処理はステップＳ６０７に進む。また、中央制御部１１１が、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信していないと判定した場合、処理はステップＳ６０８に進む。 In step S606, the central control unit 111 determines whether or not the information indicating that the voice command has been recognized has been received from the voice command recognition unit 203. When the central control unit 111 determines that the information indicating that the voice command has been recognized has been received from the voice command recognition unit 203, the process proceeds to step S607. If the central control unit 111 determines that the voice command recognition unit 203 has not received the information indicating that the voice command has been recognized, the process proceeds to step S608.

ステップＳ６０７では、中央制御部１１１は、認識された音声コマンドが、図３に示される起動コマンドであるか否かを判定する。中央制御部１１１が、認識された音声コマンドが起動コマンドであると判定した場合、処理はステップＳ６１０に進む。また、中央制御部１１１が、認識された音声コマンドが起動コマンドでないと判定した場合、処理はステップＳ６０８に進む。 In step S607, the central control unit 111 determines whether or not the recognized voice command is the activation command shown in FIG. If the central control unit 111 determines that the recognized voice command is an activation command, the process proceeds to step S610. If the central control unit 111 determines that the recognized voice command is not an activation command, the process proceeds to step S608.

ステップＳ６０８では、中央制御部１１１は、音声コマンド認識部２０３を起動させてからの経過時間が、予め設定された閾値を超えたか否かを判定する。中央制御部１１１が、音声コマンド認識部２０３を起動させてからの経過時間が、予め設定された閾値を超えていないと判定した場合、処理はステップＳ６０６に戻り、起動コマンドが検出されるまで処理の処理を繰り返す。また、中央制御部１１１が、音声コマンド認識部２０３を起動させてからの経過時間が、予め設定された閾値を超えたと判定した場合、処理はステップＳ６０９に進む。 In step S608, the central control unit 111 determines whether or not the elapsed time since the voice command recognition unit 203 is activated exceeds a preset threshold value. If the central control unit 111 determines that the elapsed time since the voice command recognition unit 203 is activated does not exceed the preset threshold value, the process returns to step S606 and processes until the activation command is detected. Repeat the process of. If the central control unit 111 determines that the elapsed time since the voice command recognition unit 203 is activated exceeds a preset threshold value, the process proceeds to step S609.

ステップＳ６０９では、中央制御部１１１は、電源制御部１２１を制御して音声コマンド認識部２０３への電力を遮断する。その後、処理はステップＳ６０３に戻る。 In step S609, the central control unit 111 controls the power supply control unit 121 to cut off the power to the voice command recognition unit 203. After that, the process returns to step S603.

ステップＳ６１０では、中央制御部１１１は、電源制御部１２１を制御し、音方向検出部２０４とマイク１０４ｂ～１０４ｄへの電力供給を開始する。この結果、音方向検出部２０４は、４つのマイク１０４ａ～１０４ｄからの同時刻の音声データに基づく、音源の方向の検出処理を開始する。音源の方向の検出処理は、所定周期で行われる。そして、音方向検出部２０４は、検出した音源の方向を示す音方向情報を、内部のバッファメモリ２０４ａに格納していく。このとき、音方向検出部２０４は、音方向情報を決定に利用した音声データのタイミングが、音声用メモリ２０２に格納された音声データのどのタイミングであったのかを対応付くように、バッファメモリ２０４ａに格納する。典型的には、バッファメモリ２０４ａに格納するのは、音源の方向と、音声用メモリ２０２内の音声データのアドレスとすればよい。なお、音方向情報には、水平面における、先に説明した基準角度に対する音源の方向との差を表す角度とする。また、詳細は後述するが、音源が撮像装置１３０の真上に位置する場合には、真上方向にあることを示す情報が音方向情報にセットされるものとする。 In step S610, the central control unit 111 controls the power supply control unit 121 and starts supplying power to the sound direction detection unit 204 and the microphones 104b to 104d. As a result, the sound direction detection unit 204 starts the detection process of the direction of the sound source based on the sound data from the four microphones 104a to 104d at the same time. The sound source direction detection process is performed at a predetermined cycle. Then, the sound direction detection unit 204 stores the sound direction information indicating the direction of the detected sound source in the internal buffer memory 204a. At this time, the sound direction detection unit 204 a. Store in. Typically, the buffer memory 204a may store the direction of the sound source and the address of the voice data in the voice memory 202. The sound direction information is an angle representing the difference from the direction of the sound source with respect to the reference angle described above in the horizontal plane. Further, although the details will be described later, when the sound source is located directly above the image pickup apparatus 130, the information indicating that the sound source is directly above the image pickup device 130 is set in the sound direction information.

ステップＳ６１１では、中央制御部１１１は、電源制御部１２１を制御し、撮像部１０２、および、レンズアクチュエータ制御部１０３への電力供給を開始する。この結果、可動撮像部１００は、撮像装置１３０として機能し始める。 In step S611, the central control unit 111 controls the power supply control unit 121 and starts supplying power to the image pickup unit 102 and the lens actuator control unit 103. As a result, the movable image pickup unit 100 begins to function as the image pickup device 130.

次に、図７のステップＳ７０１では、中央制御部１１１は、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信したか否かを判定する。中央制御部１１１が、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信したと判定した場合、処理はステップＳ７０６に進む。また、中央制御部１１１が、音声コマンド認識部２０３から、音声コマンドが認識されたことを示す情報を受信していないと判定した場合、処理はステップＳ７０２に進む。 Next, in step S701 of FIG. 7, the central control unit 111 determines whether or not the information indicating that the voice command has been recognized has been received from the voice command recognition unit 203. When the central control unit 111 determines that the information indicating that the voice command has been recognized has been received from the voice command recognition unit 203, the process proceeds to step S706. If the central control unit 111 determines that the voice command recognition unit 203 has not received the information indicating that the voice command has been recognized, the process proceeds to step S702.

ステップＳ７０２では、中央制御部１１１は、現在、ユーザからの指示に従った実行中のジョブがあるか否かを判定する。詳細は図８のフローチャートの説明から明らかになるが、動画撮影記録や追尾処理等がジョブに相当する。中央制御部１１１が、現在、ユーザからの指示に従った実行中のジョブがあると判定した場合、処理はステップＳ７０１に戻る。また、中央制御部１１１が、現在、ユーザからの指示に従った実行中のジョブがないと判定した場合、処理はステップＳ７０３に進む。 In step S702, the central control unit 111 determines whether or not there is a job currently being executed according to an instruction from the user. Details will be clarified from the explanation of the flowchart of FIG. 8, but moving image shooting recording, tracking processing, and the like correspond to jobs. When the central control unit 111 determines that there is a job currently being executed according to the instruction from the user, the process returns to step S701. If the central control unit 111 determines that there is no job currently being executed according to the instruction from the user, the process proceeds to step S703.

ステップＳ７０３では、中央制御部１１１は、前回の音声コマンドを認識してからの経過時間が、予め設定された閾値を超えたか否かを判定する。中央制御部１１１が、前回の音声コマンドを認識してからの経過時間が、予め設定された閾値を超えたと判定した場合、処理はステップＳ７０４に進む。また、中央制御部１１１が、前回の音声コマンドを認識してからの経過時間が、予め設定された閾値を超えていないと判定した場合、処理はステップＳ７０１に戻る。 In step S703, the central control unit 111 determines whether or not the elapsed time since recognizing the previous voice command exceeds a preset threshold value. If the central control unit 111 determines that the elapsed time since recognizing the previous voice command exceeds a preset threshold value, the process proceeds to step S704. If the central control unit 111 determines that the elapsed time since recognizing the previous voice command does not exceed the preset threshold value, the process returns to step S701.

ステップＳ７０４では、中央制御部１１１は、電源制御部１２１を制御し、撮像部１０２とレンズアクチュエータ制御部１０３への電力を遮断する。そして、ステップＳ７０５では、中央制御部１１１は、電源制御部１２１を制御し、音方向検出部２０４への電力も遮断ｓる。その後、処理は図６のステップＳ６０６に戻る。 In step S704, the central control unit 111 controls the power supply control unit 121 to cut off the power to the image pickup unit 102 and the lens actuator control unit 103. Then, in step S705, the central control unit 111 controls the power supply control unit 121, and also cuts off the power to the sound direction detection unit 204. After that, the process returns to step S606 of FIG.

さて、中央制御部１１１が音声コマンド認識部２０３から音声コマンドが認識されたことを示す情報を受信したとする。この場合、処理は、ステップＳ７０１からステップＳ７０６に進む。 Now, it is assumed that the central control unit 111 receives information from the voice command recognition unit 203 indicating that the voice command has been recognized. In this case, the process proceeds from step S701 to step S706.

本実施形態における中央制御部１１１は、認識した音声コマンドに応じたジョブを実行するに先立って、音声コマンドを発した人物を、可動撮像部１００の撮像部１０２の視野内に入れる処理を行う。そして、中央制御部１１１は、撮像部１０２の視野内に人物が入っている状態で、認識した音声コマンドに基づくジョブを実行する。 The central control unit 111 in the present embodiment performs a process of putting a person who has issued a voice command into the field of view of the image pickup unit 102 of the movable image pickup unit 100 prior to executing a job corresponding to the recognized voice command. Then, the central control unit 111 executes a job based on the recognized voice command with a person in the field of view of the image pickup unit 102.

上記を実現するため、ステップＳ７０６では、中央制御部１１１は、音声コマンド認識部２０３で認識された音声コマンドに同期する音方向情報を、音方向検出部２０４のバッファメモリ２０４ａから取得する。音声コマンド認識部２０３は、音声コマンドを認識したとき、音声用メモリ２０２内の音声コマンドを表す先頭と終端を表す２つのアドレスを中央制御部１１１に通知する。そこで、中央制御部１１１は、この２つのアドレスが示す期間内で検出した音方向情報をバッファメモリ２０４ａから取得する。２つのアドレスが示す期間内に複数の音方向情報が存在することもある。その場合、中央制御部１１１は、その中の時間的に最も後の音方向情報をバッファメモリ２０４ａから取得する。時間的に後の音方向情報の方が、その音声コマンドを発した人物の現在の位置を表している蓋然性が高いからである。 In order to realize the above, in step S706, the central control unit 111 acquires the sound direction information synchronized with the sound command recognized by the voice command recognition unit 203 from the buffer memory 204a of the sound direction detection unit 204. When the voice command recognition unit 203 recognizes the voice command, the voice command recognition unit 203 notifies the central control unit 111 of two addresses representing the start and end of the voice command in the voice memory 202. Therefore, the central control unit 111 acquires the sound direction information detected within the period indicated by these two addresses from the buffer memory 204a. There may be multiple sound direction information within the period indicated by the two addresses. In that case, the central control unit 111 acquires the latest sound direction information in time from the buffer memory 204a. This is because the sound direction information later in time is more likely to represent the current position of the person who issued the voice command.

ステップＳ７０７では、中央制御部１１１は、取得した音方向情報が表す音源の方向が、撮像装置１３０の真上の方向であるか否かを判定する。なお、音源の方向が撮像装置１３０の真上であるか否かの判定についての詳細は後述する。中央制御部１１１が、取得した音方向情報が表す音源の方向が、撮像装置１３０の真上の方向であると判定した場合、処理はステップＳ７０８に進む。また、中央制御部１１１が、取得した音方向情報が表す音源の方向が、撮像装置１３０の真上の方向でないと判定した場合、処理はステップＳ７１０に進む。 In step S707, the central control unit 111 determines whether or not the direction of the sound source represented by the acquired sound direction information is the direction directly above the image pickup apparatus 130. The details of determining whether or not the direction of the sound source is directly above the image pickup apparatus 130 will be described later. When the central control unit 111 determines that the direction of the sound source represented by the acquired sound direction information is the direction directly above the image pickup apparatus 130, the process proceeds to step S708. If the central control unit 111 determines that the direction of the sound source represented by the acquired sound direction information is not the direction directly above the image pickup apparatus 130, the process proceeds to step S710.

ステップＳ７０８では、中央制御部１１１は、回動制御部１２３を制御し、レンズ部１０１および撮像部１０２の撮像方向が図５（ｃ）に示す真上方向になるように、可動撮像部１００の第２の筐体４０１を回動させる。撮像部１０２の撮像方向は、真上方向になる。 In step S708, the central control unit 111 controls the rotation control unit 123, and the movable image pickup unit 100 is set so that the image pickup directions of the lens unit 101 and the image pickup unit 102 are directly upward as shown in FIG. 5 (c). The second housing 401 is rotated. The image pickup direction of the image pickup unit 102 is directly upward.

ステップＳ７０９では、中央制御部１１１は、映像信号処理部１１３から撮像画像を受信し、撮像画像内に音声発生原となるオブジェクト（人物の顔）が存在するか否かを判定する。中央制御部１１１が、撮像画像内に音声発生原となるオブジェクト（人物の顔）が存在すると判定した場合、処理はステップＳ７１４に進む。また、中央制御部１１１が、撮像画像内に音声発生原となるオブジェクト（人物の顔）が存在しないと判定した場合、処理はステップＳ７０１に戻る。 In step S709, the central control unit 111 receives the captured image from the video signal processing unit 113, and determines whether or not an object (person's face) that is a source of sound is present in the captured image. If the central control unit 111 determines that an object (person's face) that is a source of voice generation exists in the captured image, the process proceeds to step S714. Further, when the central control unit 111 determines that the object (face of a person) that is the source of voice generation does not exist in the captured image, the process returns to step S701.

ステップＳ７１０では、中央制御部１１１は、回動制御部１２３を制御して、可動撮像部１００のパン動作を行い、現在の撮像部１０２の水平面の角度を、音方向情報が示す水平面の角度に一致させる。 In step S710, the central control unit 111 controls the rotation control unit 123 to pan the movable image pickup unit 100, and changes the angle of the current horizontal plane of the image pickup unit 102 to the angle of the horizontal plane indicated by the sound direction information. Match.

ステップＳ７１１では、中央制御部１１１は、映像信号処理部１１３から撮像画像を受信し、撮像画像内に音声発生原となるオブジェクト（顔）が存在するか否かを判定する。中央制御部１１１が、撮像画像内に音声発生原となるオブジェクト（顔）が存在すると判定した場合、処理はステップＳ７１４に進む。また、中央制御部１１１が、撮像画像内に音声発生原となるオブジェクト（顔）が存在しないと判定した場合、処理はステップＳ７１２に進む。 In step S711, the central control unit 111 receives the captured image from the video signal processing unit 113, and determines whether or not an object (face) that is a sound generation source exists in the captured image. If the central control unit 111 determines that an object (face) that is a source of sound generation exists in the captured image, the process proceeds to step S714. If the central control unit 111 determines that the object (face) that is the source of the sound does not exist in the captured image, the process proceeds to step S712.

ステップＳ７１４では、中央制御部１１１は、既に認識した音声コマンドに対応するジョブを実行する。なお、このステップＳ７１４の詳細は図８を用いて後述する。 In step S714, the central control unit 111 executes a job corresponding to the already recognized voice command. The details of this step S714 will be described later with reference to FIG.

ステップＳ７１２では、中央制御部１１１は、回動制御部１２３を制御して、目標とするオブジェクトに向かって可動撮像部１００のチルト動作を行う。 In step S712, the central control unit 111 controls the rotation control unit 123 to tilt the movable image pickup unit 100 toward the target object.

ステップＳ７１３では、中央制御部１１１は、撮像部１０２の撮像方向のチルトの向きの角度が、チルト動作の上限（本実施形態では水平方向に対して９０度）に到達したか否かを判定する。中央制御部１１１が、撮像部１０２の撮像方向のチルトの向きの角度が、チルト動作の上限に到達したと判定した場合、処理はステップＳ７０１に戻る。また、中央制御部１１１が、撮像部１０２の撮像方向のチルトの向きの角度が、チルト動作の上限に到達していないと判定した場合、処理はステップＳ７１１に戻る。 In step S713, the central control unit 111 determines whether or not the angle of the tilt direction of the image pickup unit 102 in the image pickup direction has reached the upper limit of the tilt operation (90 degrees with respect to the horizontal direction in the present embodiment). .. When the central control unit 111 determines that the angle of the tilt direction of the image pickup unit 102 in the image pickup direction has reached the upper limit of the tilt operation, the process returns to step S701. If the central control unit 111 determines that the angle of the tilt direction of the image pickup unit 102 in the image pickup direction has not reached the upper limit of the tilt operation, the process returns to step S711.

図８は、図７のステップＳ７１４の処理の詳細を示すフローチャートである。図３の音声コマンドテーブルに示される“Hi, Camera”等の音声コマンドに対応する音声パターンデータは、コマンドメモリ２０６に格納されるものである。なお、図３には、代表的な音声コマンドを示す。音声コマンドは、これに限らない。また、以下の説明における音声コマンドは、図７のステップＳ７０１のタイミングで検出された音声コマンドである点に注意されたい。 FIG. 8 is a flowchart showing the details of the process of step S714 of FIG. The voice pattern data corresponding to the voice command such as "Hi, Camera" shown in the voice command table of FIG. 3 is stored in the command memory 206. Note that FIG. 3 shows typical voice commands. Voice commands are not limited to this. Further, it should be noted that the voice command in the following description is the voice command detected at the timing of step S701 in FIG. 7.

まず、ステップＳ８０１では、中央制御部１１１は、認識した音声コマンドが起動コマンドであるか否かを判定する。この起動コマンドは、撮像装置１３０に対し、撮像可能な状態に遷移させるための音声コマンドである。この起動コマンドは、図６のステップＳ６０７で判定されるコマンドであり、撮像に係るジョブを実行させるためのコマンドではない。よって、中央制御部１１１は、認識した音声コマンドが起動コマンドであると判定した場合には、そのコマンドについては無視し、処理は図７のステップＳ７０１に戻る。また、中央制御部１１１が、認識した音声コマンドが起動コマンドでないと判定した場合、処理はステップＳ８０２に進む。 First, in step S801, the central control unit 111 determines whether or not the recognized voice command is an activation command. This activation command is a voice command for transitioning the image pickup apparatus 130 to a state in which image pickup is possible. This start command is a command determined in step S607 of FIG. 6, and is not a command for executing a job related to imaging. Therefore, when the central control unit 111 determines that the recognized voice command is an activation command, the central control unit 111 ignores the command and returns to step S701 in FIG. 7. If the central control unit 111 determines that the recognized voice command is not an activation command, the process proceeds to step S802.

ステップＳ８０２では、中央制御部１１１は、認識した音声コマンドが停止コマンドであるか否かを判定する。この停止コマンドは、一連の撮像可能な状態から、起動コマンドの入力を待つ状態に遷移させるコマンドである。よって、中央制御部１１１が、認識した音声コマンドが停止コマンドであると判定した場合、処理はステップＳ８１１に進む。また、中央制御部１１１が、認識した音声コマンドが停止コマンドでないと判定した場合、処理はステップＳ８０３に進む。 In step S802, the central control unit 111 determines whether or not the recognized voice command is a stop command. This stop command is a command for transitioning from a series of image-capable states to a state of waiting for input of a start command. Therefore, when the central control unit 111 determines that the recognized voice command is a stop command, the process proceeds to step S811. If the central control unit 111 determines that the recognized voice command is not a stop command, the process proceeds to step S803.

ステップＳ８１１では、中央制御部１１１は、電源制御部１２１を制御し、既に起動している撮像部１０２、音方向検出部２０４、音声コマンド認識部２０３、動画用音声処理部２０５、およびマイク１０４ｂ～１０４ｄ等への電力を遮断し、これらを停止する。その後、処理は図６のステップＳ６０３に戻る。 In step S811, the central control unit 111 controls the power supply control unit 121, and the image pickup unit 102, the sound direction detection unit 204, the voice command recognition unit 203, the video voice processing unit 205, and the microphones 104b to be already activated. The power to 104d and the like is cut off, and these are stopped. After that, the process returns to step S603 of FIG.

ステップＳ８０３では、中央制御部１１１は、認識した音声コマンドが静止画撮影コマンドであるか否かを判定する。この静止画撮影コマンドは、撮像装置１３０に対して、１枚の静止画の撮影・記録ジョブの実行の要求を行うコマンドである。よって、中央制御部１１１が、認識した音声コマンドが静止画撮影コマンドであると判定した場合、処理はステップＳ８１２に進む。また、中央制御部１１１が、認識した音声コマンドが静止画撮影コマンドでないと判定した場合、処理はステップＳ８０４に進む。 In step S803, the central control unit 111 determines whether or not the recognized voice command is a still image shooting command. This still image shooting command is a command for requesting the image pickup apparatus 130 to execute a shooting / recording job for one still image. Therefore, when the central control unit 111 determines that the recognized voice command is a still image shooting command, the process proceeds to step S812. If the central control unit 111 determines that the recognized voice command is not a still image shooting command, the process proceeds to step S804.

ステップＳ８１２では、中央制御部１１１は、撮像部１０２で撮像した１枚の静止画像データを例えばＪＰＥＧファイルとして、記憶部１１６に記録する。ここで、中央制御部１１１は、撮像した方向を記憶部１１６に、撮影した方向の履歴として記録する。なお、この静止画撮影コマンドのジョブが、１枚の静止画撮影記録により完結するので、先に説明した図７のステップＳ７０２で判定する対象のジョブとはならない。その後、処理は図７のステップＳ７０１に進む。 In step S812, the central control unit 111 records one still image data imaged by the image pickup unit 102 in the storage unit 116 as, for example, a JPEG file. Here, the central control unit 111 records the imaged direction in the storage unit 116 as a history of the imaged direction. Since the job of this still image shooting command is completed by one still image shooting record, it is not the job to be determined in step S702 of FIG. 7 described above. After that, the process proceeds to step S701 in FIG.

ステップＳ８０４では、中央制御部１１１は、認識した音声コマンドが動画撮影コマンドであるか否かを判定する。動画撮影コマンドは、撮像装置１３０に対して、動画像の撮像と記録を要求するコマンドである。中央制御部１１１が、認識した音声コマンドが動画撮影コマンドであると判定した場合、処理はステップＳ８１３に進む。また、中央制御部１１１が、認識した音声コマンドが動画撮影コマンドでないと判定した場合、処理はステップＳ８０５に進む。 In step S804, the central control unit 111 determines whether or not the recognized voice command is a moving image shooting command. The moving image shooting command is a command for requesting the image pickup device 130 to capture and record a moving image. If the central control unit 111 determines that the recognized voice command is a moving image shooting command, the process proceeds to step S813. If the central control unit 111 determines that the recognized voice command is not a moving image shooting command, the process proceeds to step S805.

ステップＳ８１３では、中央制御部１１１は、撮像部１０２を用いて動画像の撮影と記録を開始する。その後、処理は図７のステップＳ７０１に戻る。ここで、中央制御部１１１は、撮像した方向を記憶部１１６に、撮影した方向の履歴として記録する。本実施形態では、撮像した動画像は、記憶部１１６に格納されるものとするが、外部入出力端子部１１８を介してネットワーク上のファイルサーバに送信しても構わない。動画撮影コマンドは、動画像の撮像と記録を継続させるコマンドであるので、この動画撮影コマンドによるジョブは、先に説明した図７のステップＳ７０２で判定する対象のジョブとなる。 In step S813, the central control unit 111 starts shooting and recording a moving image using the imaging unit 102. After that, the process returns to step S701 in FIG. Here, the central control unit 111 records the imaged direction in the storage unit 116 as a history of the imaged direction. In the present embodiment, the captured moving image is stored in the storage unit 116, but may be transmitted to a file server on the network via the external input / output terminal unit 118. Since the moving image shooting command is a command for continuing the imaging and recording of the moving image, the job by this moving image shooting command is the job to be determined in step S702 of FIG. 7 described above.

ステップＳ８０５では、中央制御部１１１は、認識した音声コマンドが動画撮影終了コマンドであるか否かを判定する。中央制御部１１１が、認識した音声コマンドが動画撮影終了コマンドであり、且つ、現に動画像の撮像および記録中であると判定した場合、処理はステップＳ８１４に進む。また、中央制御部１１１が、認識した音声コマンドが動画撮影終了コマンドでない、または、現に動画像の撮像および記録中でないと判定した場合、処理はステップＳ８０６に進む。 In step S805, the central control unit 111 determines whether or not the recognized voice command is a moving image recording end command. If the central control unit 111 determines that the recognized voice command is a moving image shooting end command and is actually capturing and recording a moving image, the process proceeds to step S814. If the central control unit 111 determines that the recognized voice command is not a moving image shooting end command, or is not actually capturing and recording a moving image, the process proceeds to step S806.

ステップＳ８１４では、中央制御部１１１は、動画像の撮像および記録（ジョブ）を終了する。その後、処理は図７のステップＳ７０１に戻る。 In step S814, the central control unit 111 ends the imaging and recording (job) of the moving image. After that, the process returns to step S701 in FIG.

ステップＳ８０６では、中央制御部１１１は、認識した音声コマンドが追尾コマンドであるか否かを判定する。追尾コマンドは、撮像装置１３０に対して、撮像部１０２の撮像方向に、ユーザを継続して位置させることを要求するコマンドである。中央制御部１１１が、認識した音声コマンドが追尾コマンドであると判定した場合、処理はステップＳ８１５に進む。また、中央制御部１１１が、認識した音声コマンドが追尾コマンドでないと判定した場合、処理はステップＳ８０７に進む。 In step S806, the central control unit 111 determines whether or not the recognized voice command is a tracking command. The tracking command is a command that requests the image pickup apparatus 130 to continuously position the user in the image pickup direction of the image pickup unit 102. If the central control unit 111 determines that the recognized voice command is a tracking command, the process proceeds to step S815. If the central control unit 111 determines that the recognized voice command is not a tracking command, the process proceeds to step S807.

ステップＳ８１５では、中央制御部１１１は、映像信号処理部１１３で得られた映像の中心位置にオブジェクトが位置し続けるように、回動制御部１２３の制御を開始する。その後、処理は図７のステップＳ７０１に戻る。この結果、可動撮像部１００がパン動作、或いはチルト動作を行い、移動するユーザを追尾する。ただし、撮像装置１３０は、ユーザを追尾するものの、撮像した画像の記録は行わない。また、追尾している間は、追尾コマンドのジョブは、先に説明した図７のステップＳ７０２で判定する対象のジョブとなる。そして、中央制御部１１１は、追尾終了コマンドを受信して初めて、この動画像の撮影記録を終了する。なお、中央制御部１１１は、追尾中に、例えば静止画撮影コマンドや動画撮影コマンドのジョブを実行しても構わない。 In step S815, the central control unit 111 starts controlling the rotation control unit 123 so that the object continues to be positioned at the center position of the image obtained by the video signal processing unit 113. After that, the process returns to step S701 in FIG. As a result, the movable image pickup unit 100 performs a pan operation or a tilt operation to track a moving user. However, although the image pickup apparatus 130 tracks the user, it does not record the captured image. Further, during the tracking, the job of the tracking command becomes the job to be determined in step S702 of FIG. 7 described above. Then, the central control unit 111 ends the shooting recording of this moving image only after receiving the tracking end command. The central control unit 111 may execute, for example, a job of a still image shooting command or a moving image shooting command during tracking.

ステップＳ８０７では、中央制御部１１１は、認識した音声コマンドが追尾終了コマンドであるか否かを判定する。中央制御部１１１が、認識した音声コマンドが追尾終了コマンドであり、且つ、現に追尾中であると判定した場合、処理はステップＳ８１６に進む。また、中央制御部１１１が、認識した音声コマンドが追尾終了コマンドでない、または、現に追尾中でないと判定した場合、処理はステップＳ８０８に進む。 In step S807, the central control unit 111 determines whether or not the recognized voice command is a tracking end command. If the central control unit 111 determines that the recognized voice command is a tracking end command and is actually tracking, the process proceeds to step S816. If the central control unit 111 determines that the recognized voice command is not a tracking end command or is not actually tracking, the process proceeds to step S808.

ステップＳ８１６では、中央制御部１１１は、回動制御部１２３を制御し、追尾（ジョブ）を終了する。その後、処理は図７のステップＳ７０１に戻る。 In step S816, the central control unit 111 controls the rotation control unit 123 and ends the tracking (job). After that, the process returns to step S701 in FIG.

ステップＳ８０８では、中央制御部１１１は、認識した音声コマンドが自動動画撮影コマンドであるか否かを判定する。中央制御部１１１が、認識した音声コマンドが自動動画撮影コマンドであると判定した場合、処理はステップＳ８１７に進む。 In step S808, the central control unit 111 determines whether or not the recognized voice command is an automatic moving image shooting command. If the central control unit 111 determines that the recognized voice command is an automatic moving image shooting command, the process proceeds to step S817.

ステップＳ８１７では、中央制御部１１１は、撮像部１０２による動画像の撮影と記録を開始する。その後、処理は図７のステップＳ７０１に戻る。この自動動画撮影コマンドにより実行されるジョブと、先に説明した動画撮影コマンドにより実行されるジョブとの違いは、発声がある度に、その発声の音源の方向にレンズ部１０１の撮像方向を向けつつ動画像を撮影および記録を行う点である。例えば、複数の話者が存在するミーティングの環境下で、発言があるたびに、中央制御部１１１は、その発言者をレンズ部１０１の画角内に収めるために、パン動作、およびチルト動作を行いながら、動画像を記録する。なお、この場合、この自動動画撮影コマンドのジョブの実行中は、ジョブを終了させる音声コマンドを受け付けない。このジョブの終了は、操作部１１５に設けられた所定のスイッチ操作によって終了するものとする。また、このジョブを実行中、中央制御部１１１は、音声コマンド認識部２０３を停止させる。そして、中央制御部１１１は、音圧レベル検出部２０１により、閾値を超える音圧レベルを検出したタイミングでの、音方向検出部２０４が検出した音方向情報を参照して、可動撮像部１００のパン動作およびチルト動作を行う。 In step S817, the central control unit 111 starts taking and recording a moving image by the imaging unit 102. After that, the process returns to step S701 in FIG. The difference between the job executed by this automatic movie shooting command and the job executed by the movie shooting command described above is that each time there is a utterance, the imaging direction of the lens unit 101 is directed toward the sound source of the utterance. The point is to shoot and record moving images. For example, in a meeting environment where a plurality of speakers are present, each time a speech is made, the central control unit 111 performs a pan operation and a tilt operation in order to fit the speaker within the angle of view of the lens unit 101. While doing this, record the moving image. In this case, the voice command for terminating the job is not accepted while the job of this automatic video recording command is being executed. The end of this job shall be terminated by a predetermined switch operation provided on the operation unit 115. Further, while executing this job, the central control unit 111 stops the voice command recognition unit 203. Then, the central control unit 111 refers to the sound direction information detected by the sound direction detection unit 204 at the timing when the sound pressure level detection unit 201 detects the sound pressure level exceeding the threshold value, and the movable image pickup unit 100 Performs pan and tilt operations.

なお、図８には示していないが、認識した音声コマンドが拡大コマンドである場合、中央制御部１１１は、レンズアクチュエータ制御部１０３を制御し、予め設定された値だけ、現在のズーム倍率を増加させる。また、認識した音声コマンドが縮小コマンドである場合、中央制御部１１１は、レンズアクチュエータ制御部１０３を制御し、予め設定された値だけ、現在のズーム倍率を減少させる。なお、レンズ部１０１が既にテレ端、或いは、ワイド端にあるとき、それを超えた拡大率、或いは、縮小率は設定できないので、このような音声コマンドがあった場合、中央制御部１１１は、その音声コマンドを無視する。 Although not shown in FIG. 8, when the recognized voice command is an enlargement command, the central control unit 111 controls the lens actuator control unit 103 and increases the current zoom magnification by a preset value. Let me. When the recognized voice command is a reduction command, the central control unit 111 controls the lens actuator control unit 103 to reduce the current zoom magnification by a preset value. When the lens unit 101 is already at the telephoto end or the wide end, the enlargement ratio or reduction ratio beyond that cannot be set. Therefore, when such a voice command is given, the central control unit 111 moves. Ignore the voice command.

以上であるが、上記以外の音声コマンドについては、ステップＳ８０８以降で実行されるが、ここでの説明は省略する。 As described above, voice commands other than the above are executed in step S808 and subsequent steps, but the description thereof is omitted here.

図９は、本実施形態の撮像装置１３０におけるメイン電源オンからの処理のシーケンスの一例を示すタイミングチャートである。撮像装置１３０のメイン電源がオンになると、音圧レベル検出部２０１は、マイク１０４ａからの音声データの音圧レベルの検出処理を開始する。 FIG. 9 is a timing chart showing an example of the processing sequence from the main power on in the image pickup apparatus 130 of the present embodiment. When the main power of the image pickup apparatus 130 is turned on, the sound pressure level detection unit 201 starts the sound pressure level detection process of the audio data from the microphone 104a.

タイミングＴ９０１では、ユーザは、起動コマンド“Hi,Camera”の発声を開始したとする。この結果、音圧レベル検出部２０１が閾値を超える音圧を検出する。そして、これがトリガになって、タイミングＴ９０２では、音声用メモリ２０２は、マイク１０４ａからの音声データの格納を開始し、音声コマンド認識部２０３は、音声コマンドの認識を開始する。 At the timing T901, it is assumed that the user has started to utter the activation command "Hi, Camera". As a result, the sound pressure level detection unit 201 detects the sound pressure exceeding the threshold value. Then, with this as a trigger, at the timing T902, the voice memory 202 starts storing the voice data from the microphone 104a, and the voice command recognition unit 203 starts recognizing the voice command.

ユーザが起動コマンド“Hi,Camera”の発声を終えると、タイミングＴ９０３では、音声コマンド認識部２０３は、その音声コマンドを認識し、且つ、認識した音声コマンドが起動コマンドであることを特定する。 When the user finishes uttering the activation command "Hi, Camera", in the timing T903, the voice command recognition unit 203 recognizes the voice command and specifies that the recognized voice command is the activation command.

中央制御部１１１は、この起動コマンドが認識されたことをトリガにして、タイミングＴ９０４では、音方向検出部２０４に電力供給を開始する。また、タイミングＴ９０５では、中央制御部１１１は、撮像部１０２への電力供給も開始する。 The central control unit 111 starts supplying power to the sound direction detection unit 204 at the timing T904, triggered by the recognition of this activation command. Further, at the timing T905, the central control unit 111 also starts supplying power to the image pickup unit 102.

タイミングＴ９０６では、ユーザは、例えば“Movie start”の発声を開始したとする。この場合、タイミングＴ９０７から順に、音声用メモリ２０２は、発生の開始のタイミングの音声データを格納していく。 At the timing T906, it is assumed that the user has started to utter "Movie start", for example. In this case, the voice memory 202 stores the voice data of the timing of the start of generation in order from the timing T907.

タイミングＴ９０８では、音声コマンド認識部２０３は、音声データを“Movie start”を表す音声コマンドとして認識する。音声コマンド認識部２０３は、音声用メモリ２０２内の“Movie start”を表す音声データの先頭と終端のアドレスと、認識結果を中央制御部１１１に通知する。中央制御部１１１は、受信した先頭と終端のアドレスが表す範囲を有効範囲として決定する。そして、中央制御部１１１は、音方向検出部２０４のバッファメモリ２０４ａ内の、有効範囲内から、最新の音方向情報を抽出する。 At the timing T908, the voice command recognition unit 203 recognizes the voice data as a voice command representing “Movie start”. The voice command recognition unit 203 notifies the central control unit 111 of the start and end addresses of the voice data representing “Movie start” in the voice memory 202 and the recognition result. The central control unit 111 determines the range represented by the received start and end addresses as the effective range. Then, the central control unit 111 extracts the latest sound direction information from the effective range in the buffer memory 204a of the sound direction detection unit 204.

タイミングＴ９０９では、中央制御部１１１は、その抽出した音方向情報に基づいて、回動制御部１２３を制御して、可動撮像部１００のパン動作およびチルト動作を開始する。 At the timing T909, the central control unit 111 controls the rotation control unit 123 based on the extracted sound direction information to start the pan operation and the tilt operation of the movable image pickup unit 100.

タイミングＴ９１２では、撮像信号処理部１１２は、可動撮像部１００のパン動作およびチルト動作中に、撮像部１０２を用いて生成された画像に被写体（オブジェクト；顔）を検出する。すると、タイミングＴ９１３では、中央制御部１１１は、パン動作およびチルト動作を停止する。 At the timing T912, the image pickup signal processing unit 112 detects a subject (object; face) in the image generated by the image pickup unit 102 during the pan operation and the tilt operation of the movable image pickup unit 100. Then, at the timing T913, the central control unit 111 stops the pan operation and the tilt operation.

タイミングＴ９１４では、中央制御部１１１は、動画用音声処理部２０５に電力を供給して、マイク１０４ａ、および、１０４ｂによるステレオ音声の収音状態にする。タイミングＴ９１５では、中央制御部１１１は、音声付動画像の撮像と記録を開始する。 At the timing T914, the central control unit 111 supplies electric power to the moving image processing unit 205 to bring the stereo sound picked up by the microphones 104a and 104b. At the timing T915, the central control unit 111 starts capturing and recording the voice-driven moving image.

次に、本実施形態における音方向検出部２０４による音源方向の検出処理を説明する。この処理は、図６のステップＳ６１０以降、周期的に、且つ、継続的に行われるものである。 Next, the sound source direction detection process by the sound direction detection unit 204 in the present embodiment will be described. This process is performed periodically and continuously after step S610 in FIG.

まず、図１０（ａ）を用いて、２つのマイク１０４ａとマイク１０４ｂを用いた簡易の音方向検出を説明する。図１０（ａ）は、マイク１０４ａとマイク１０４ｂが平面上（パン動作の回転軸に垂直な平面上）に配置されているとする。マイク１０４ａとマイク１０４ｂの距離をｄ［ａ－ｂ］と表す。距離ｄ［ａ－ｂ］に対して、撮像装置１３０と音源間の距離は十分に大きいと仮定する。この場合、マイク１０４ａとマイク１０４ｂの音声を比較することによって、両者間の音声の遅延時間を特定することができる。 First, a simple sound direction detection using two microphones 104a and a microphone 104b will be described with reference to FIG. 10A. In FIG. 10A, it is assumed that the microphone 104a and the microphone 104b are arranged on a plane (on a plane perpendicular to the rotation axis of the pan operation). The distance between the microphone 104a and the microphone 104b is expressed as d [ab]. It is assumed that the distance between the image pickup apparatus 130 and the sound source is sufficiently large with respect to the distance d [ab]. In this case, by comparing the voices of the microphones 104a and 104b, the delay time of the voices between them can be specified.

音方向検出部２０４は、到達遅延時間に音速（空気中は約３４０ｍ／ｓ）を乗じることで、距離Ｉ［ａ－ｂ］を特定することができる。その結果、音方向検出部２０４は、次式で音源方向角度θ［ａ－ｂ］を特定することができる。音方向検出部２０４は、音声が発生したタイミングにおいて、複数のマイク１０４ａおよび１０４ｂに入力される音声の時間差から発生した音声の音源の方向の角度θ［ａ－ｂ］を検出することができる。
θ［ａ－ｂ］＝ａｃｏｓ（Ｉ［ａ－ｂ］／ｄ［ａ－ｂ］） The sound direction detection unit 204 can specify the distance I [ab] by multiplying the arrival delay time by the speed of sound (about 340 m / s in the air). As a result, the sound direction detection unit 204 can specify the sound source direction angle θ [ab] by the following equation. The sound direction detection unit 204 can detect the angle θ [ab] in the direction of the sound source of the sound generated from the time difference of the sound input to the plurality of microphones 104a and 104b at the timing when the sound is generated.
θ [ab] = acos (I [ab] / d [ab])

しかしながら、２つのマイク１０４ａおよび１０４ｂで求めた音方向は、求めた音源方向とθ［ａ－ｂ］とθ［ａ－ｂ］’（図１０（ａ））との区別ができない。つまり、音方向検出部２０４は、２つの方向のいずれであるのかまでは特定できない。 However, the sound directions obtained by the two microphones 104a and 104b cannot be distinguished from the obtained sound source directions by θ [ab] and θ [ab]'(FIG. 10 (a)). That is, the sound direction detection unit 204 cannot specify which of the two directions it is.

そこで、本実施形態における音源の方向の検出方法を以下、図１０（ｂ）と図１０（ｃ）を用いて説明する。具体的には、２つのマイクで推定できる音源方向は２つあるので、それらの２つの方向を仮方向として扱う。そして、更なる２つのマイクで音源の方向を求め、仮方向を２つ求める。そして、これらに共通している方向が、求める音源の方向として決定する。なお、図１０（ｂ）および図１０（ｃ）の上方向を可動撮像部１００の撮像方向とする。可動撮像部１００の撮像方向は、レンズ部１０１の光軸方向（主軸方向）とも言い換えられる。 Therefore, the method of detecting the direction of the sound source in the present embodiment will be described below with reference to FIGS. 10 (b) and 10 (c). Specifically, since there are two sound source directions that can be estimated by the two microphones, those two directions are treated as temporary directions. Then, the direction of the sound source is obtained with two more microphones, and two temporary directions are obtained. Then, the direction common to these is determined as the direction of the desired sound source. The upward direction of FIGS. 10 (b) and 10 (c) is taken as the image pickup direction of the movable image pickup unit 100. The imaging direction of the movable imaging unit 100 can also be rephrased as the optical axis direction (main axis direction) of the lens unit 101.

図１０（ｂ）は、３つのマイク１０４ａ～１０４ｃで行う音方向検出方法を示す。３つのマイク１０４ａ、マイク１０４ｂ、およびマイク１０４ｃを用いて説明する。図４（ａ）で示したような配置図であると、マイク１０４ａとマイク１０４ｂの並ぶ方向に直交する方向がレンズ部１０１の撮像方向となる。 FIG. 10B shows a sound direction detection method performed by the three microphones 104a to 104c. Three microphones 104a, 104b, and 104c will be described. In the layout diagram as shown in FIG. 4A, the direction orthogonal to the line-up direction of the microphone 104a and the microphone 104b is the imaging direction of the lens unit 101.

図１０（ａ）で説明したように、マイク１０４ａとマイク１０４ｂより、距離ｄ［ａ－ｂ］は既知である。音方向検出部２０４は、音声データより距離Ｉ［ａ－ｂ］を特定することができれば、θ［ａ－ｂ］を特定できる。さらにマイク１０４ａおよびマイク１０４ｃ間の距離ｄ［ａ－ｃ］も既知であるので、音方向検出部２０４は、音声データより距離Ｉ［ａ－ｃ］も特定することができ、θ［ａ－ｃ］を特定できる。音方向検出部２０４は、θ［ａ－ｂ］およびθ［ａ－ｃ］を算出できれば、マイク１０４ａ、１０４ｂ、１０４ｃの配置と同一の２次元平面上（パン動作の回転軸に垂直な平面上）における、それらに共通な方角が、正確な音声発生方向として決定できる。 As described with reference to FIG. 10A, the distance d [ab] is known from the microphone 104a and the microphone 104b. If the sound direction detection unit 204 can specify the distance I [ab] from the voice data, the sound direction detection unit 204 can specify θ [ab]. Further, since the distance d [ac] between the microphone 104a and the microphone 104c is also known, the sound direction detection unit 204 can also specify the distance I [ac] from the voice data, and θ [ac]. ] Can be specified. If the sound direction detection unit 204 can calculate θ [ab] and θ [ac], it is on the same two-dimensional plane as the arrangement of the microphones 104a, 104b, 104c (on the plane perpendicular to the rotation axis of the pan operation). ), The direction common to them can be determined as the accurate voice generation direction.

図１０（ｃ）を用いて、４つのマイク１０４ａ～１０４ｄで音源方向を決定する方法を説明する。図４（ａ）に示すマイク１０４ａ、マイク１０４ｂ、マイク１０４ｃ、マイク１０４ｄの配置により、マイク１０４ａとマイク１０４ｂの並ぶ方向に直交する方向がレンズ部１０１の撮像方向（光軸方向）となる。４つのマイク１０４ａ～１０４ｄを利用する場合、対角線上に位置するマイク１０４ａと１０４ｄのペアと、マイク１０４ｂとマイク１０４ｃのペアの２つのペアを用いると、音方向検出部２０４は、精度よく音源方向を算出できる。 A method of determining the sound source direction with the four microphones 104a to 104d will be described with reference to FIG. 10 (c). Due to the arrangement of the microphone 104a, the microphone 104b, the microphone 104c, and the microphone 104d shown in FIG. 4A, the direction orthogonal to the line-up direction of the microphone 104a and the microphone 104b is the image pickup direction (optical axis direction) of the lens unit 101. When using four microphones 104a to 104d, if two pairs of microphones 104a and 104d located diagonally and a pair of microphones 104b and 104c are used, the sound direction detection unit 204 can accurately perform the sound source direction. Can be calculated.

マイク１０４ａおよびマイク１０４ｄ間の距離ｄ［ａ－ｄ］は既知であるので、音方向検出部２０４は、音声データから距離Ｉ［ａ－ｄ］を特定できるので、θ［ａ－ｄ］も特定できる。 Since the distance d [ad] between the microphone 104a and the microphone 104d is known, the sound direction detection unit 204 can specify the distance I [ad] from the voice data, so that θ [ad] is also specified. can.

更にマイク１０４ｂおよびマイク１０４ｃ間の距離ｄ［ｂ－ｃ］も既知であるので、音方向検出部２０４は、音声データより距離Ｉ［ｂ－ｃ］を特定できるので、θ［ｂ－ｃ］を特定できる。 Further, since the distance d [bc] between the microphone 104b and the microphone 104c is also known, the sound direction detection unit 204 can specify the distance I [bc] from the voice data, so that θ [bc] can be determined. Can be identified.

よって、θ［ａ－ｄ］およびθ［ｂ－ｃ］がわかれば、音方向検出部２０４は、マイク１０４ａ～１０４ｄの配置と同一の２次元平面上では、正確な音声発生方向を検出することが可能である。 Therefore, if θ [ad] and θ [bc] are known, the sound direction detection unit 204 can detect an accurate sound generation direction on the same two-dimensional plane as the arrangement of the microphones 104a to 104d. Is possible.

さらに、音方向検出部２０４は、θ［ａ－ｂ］およびθ［ｃ－ｄ］と検出角度を増やしていけば、方向検出の角度の精度を高めることも可能である。 Further, the sound direction detection unit 204 can improve the accuracy of the direction detection angle by increasing the detection angles to θ [ab] and θ [cd].

以上のような処理を行うため、マイク１０４ａとマイク１０４ｂとマイク１０４ｃおよびマイク１０４ｄは、図４（ａ）のように、長方形の４つの頂点に配置した。なお、マイクの数が３つであっても、それらが直線状に並ばないのであれば、必ずしも４つである必要はない。 In order to perform the above processing, the microphone 104a, the microphone 104b, the microphone 104c, and the microphone 104d are arranged at the four vertices of the rectangle as shown in FIG. 4A. Even if the number of microphones is three, it does not necessarily have to be four if they do not line up in a straight line.

上記の方法のデメリットとして、同一２次元平面上の音方向しか検知しかできない。そのため、音源が撮像装置１３０の真上に位置する場合には、その方向を検出できない。そこで、次に、音方向検出部２０４における、音源の存在する方向として真上であるか否かの判定原理を図１１（ａ）および図１１（ｂ）を参照して説明する。 As a demerit of the above method, only the sound direction on the same two-dimensional plane can be detected. Therefore, when the sound source is located directly above the image pickup apparatus 130, the direction cannot be detected. Therefore, next, the principle of determining whether or not the sound source is directly above the direction in which the sound source exists in the sound direction detection unit 204 will be described with reference to FIGS. 11 (a) and 11 (b).

図１１（ａ）は、３つのマイクで行う音方向検出方法を説明するための図である。マイク１０４ａ、マイク１０４ｂ、マイク１０４ｃを用いて説明する。図４（ａ）で示したような配置図であると、マイク１０４ａとマイク１０４ｂの並び方向に直交する方向がレンズ部１０１の撮像方向（光軸方向）である。マイク１０４ａとマイク１０４ｂの並び方向とは、マイク１０４ａの中心点とマイク１０４ｂの中心点とを結ぶ直線の方向である。 FIG. 11A is a diagram for explaining a sound direction detection method performed by three microphones. A microphone 104a, a microphone 104b, and a microphone 104c will be described. In the layout diagram as shown in FIG. 4A, the direction orthogonal to the arrangement direction of the microphone 104a and the microphone 104b is the imaging direction (optical axis direction) of the lens unit 101. The arrangement direction of the microphone 104a and the microphone 104b is the direction of a straight line connecting the center point of the microphone 104a and the center point of the microphone 104b.

音声入力部１０４の配置されている平面に対して、垂直に交わる直線上、すなわち上方向からマイク１０４ａ、マイク１０４ｂ、マイク１０４ｃに音声が入ってきたときについて記載する。 The case where the microphone enters the microphone 104a, the microphone 104b, and the microphone 104c from above on a straight line perpendicularly intersecting the plane on which the voice input unit 104 is arranged will be described.

ここで、撮像装置１３０の真上に音源が位置する場合、その音源からマイク１０４ａとマイク１０４ｂは等距離にあるとみなせる。つまり、音源からこれら２つのマイク１０４ａと１０４ｂに到達する音の時間差は無い。そのため、マイク１０４ａとマイク１０４ｂを結ぶ直線に対して、垂直に交わる方向に音源があると認識される。 Here, when the sound source is located directly above the image pickup apparatus 130, it can be considered that the microphone 104a and the microphone 104b are equidistant from the sound source. That is, there is no time difference between the sounds reaching these two microphones 104a and 104b from the sound source. Therefore, it is recognized that the sound source is in the direction perpendicular to the straight line connecting the microphone 104a and the microphone 104b.

さらに、マイク１０４ａとマイク１０４ｃも同様に音源からは等距離にあるとみなせるので、やはり音源からこれら２つのマイク１０４ａと１０４ｃに到達する音の時間差は無い。そのため、マイク１０４ａとマイク１０４ｃを結ぶ直線に対して、垂直に交わる方向に音源があると認識される。 Further, since the microphones 104a and 104c can also be regarded as equidistant from the sound source, there is no time difference between the sounds reaching these two microphones 104a and 104c from the sound source. Therefore, it is recognized that the sound source is in the direction perpendicular to the straight line connecting the microphone 104a and the microphone 104c.

ここで、マイク１０４ａとマイク１０４ｂで検出した音の時間差の絶対値をΔＴ１とし、マイク１０４ａとマイク１０４ｃで検出した音の時間差の絶対値をΔＴ２とする。予め設定された十分に小さい閾値εとの関係が次の条件を満たす場合、音方向検出部２０４は、音源が撮像装置１３０の真上に位置すると判定できる。
条件：ΔＴ１＜ε かつ ΔＴ２＜ε Here, the absolute value of the time difference between the sounds detected by the microphone 104a and the microphone 104b is ΔT1, and the absolute value of the time difference between the sounds detected by the microphone 104a and the microphone 104c is ΔT2. When the relationship with the preset sufficiently small threshold value ε satisfies the following conditions, the sound direction detection unit 204 can determine that the sound source is located directly above the image pickup device 130.
Conditions: ΔT1 <ε and ΔT2 <ε

図１１（ｂ）を参照し、４つのマイク１０４ａ、マイク１０４ｂ、マイク１０４ｃ、マイク１０４ｄを用いた、撮像装置１３０の真上に位置する音源の検出法を説明する。図４（ａ）に示すように、マイク１０４ａとマイク１０４ｄのペアと、マイク１０４ｂとマイク１０４ｃのペアについて考察する。 A method of detecting a sound source located directly above the image pickup apparatus 130 using four microphones 104a, a microphone 104b, a microphone 104c, and a microphone 104d will be described with reference to FIG. 11B. As shown in FIG. 4A, a pair of microphone 104a and microphone 104d and a pair of microphone 104b and microphone 104c will be considered.

撮像装置１３０の真上に音源が存在する場合、その音源からマイク１０４ａとマイク１０４ｄは等距離になるので、これらのマイク１０４ａとマイク１０４ｄで検出する音の時間差の絶対値ΔＴ３はゼロか、非常に小さい値となる。つまり、マイク１０４ａとマイク１０４ｄを結ぶ直線に対して、垂直に交わる方向に音源があると認識できる。 When a sound source exists directly above the image pickup device 130, the microphone 104a and the microphone 104d are at equal distances from the sound source, so that the absolute value ΔT3 of the time difference between the sounds detected by these microphones 104a and the microphone 104d is zero or very high. Is a small value. That is, it can be recognized that the sound source is in the direction perpendicular to the straight line connecting the microphone 104a and the microphone 104d.

さらに、マイク１０４ｂとマイク１０４ｃも、音源からは等距離になるため、これらのマイク１０４ｂとマイク１０４ｃで検出する音の時間差の絶対値ΔＴ４もゼロか、非常に小さい値となる。つまり、マイク１０４ｂとマイク１０４ｃを結ぶ直線に対して、垂直に交わる方向に音源があると認識できる。故に、次の条件を満たす場合、音方向検出部２０４は、音源が撮像装置１３０の真上に位置すると判定できる。
条件：ΔＴ３＜ε 且つ ΔＴ４＜ε Further, since the microphone 104b and the microphone 104c are also equidistant from the sound source, the absolute value ΔT4 of the time difference between the sounds detected by the microphone 104b and the microphone 104c is also zero or a very small value. That is, it can be recognized that the sound source is in the direction perpendicular to the straight line connecting the microphone 104b and the microphone 104c. Therefore, if the following conditions are satisfied, the sound direction detection unit 204 can determine that the sound source is located directly above the image pickup device 130.
Conditions: ΔT3 <ε and ΔT4 <ε

以上のように、音方向検出部２０４は、３つ以上のマイクのうちの２つのペアについて、音の到達時間差の絶対値を求め、それらの２つの絶対値が共に十分に小さい閾値未満となった場合に、音源の存在方向が撮像装置１３０の真上であると決定できる。なお、２つのペアを決めるとき、それらの２つのペアの向きが互いに非平行となるように決定すれば、２つのペアはどのような組み合わせでもよい。 As described above, the sound direction detection unit 204 obtains the absolute value of the sound arrival time difference for two pairs of three or more microphones, and both of these two absolute values are less than a sufficiently small threshold value. In this case, it can be determined that the direction in which the sound source exists is directly above the image pickup apparatus 130. When determining two pairs, the two pairs may be in any combination as long as the orientations of the two pairs are determined to be non-parallel to each other.

ここで、本実施形態では、撮像装置１３０は、より音声コマンドを認識しやすくするためにスマートスピーカーと連携する。 Here, in the present embodiment, the image pickup apparatus 130 cooperates with the smart speaker in order to make it easier to recognize the voice command.

図１２は、本実施形態に係るスマートスピーカー１２００のブロック構成図である。スマートスピーカー１２００は、中央制御部１２０１、音声入力部１２０２、音声信号処理部１２０３、無線通信部１２０４、操作部１２０５、音声再生部１２０６、電源部１２０７、および、電源制御部１２０８を有する。スマートスピーカー１２００は、外部装置の一例である。 FIG. 12 is a block configuration diagram of the smart speaker 1200 according to the present embodiment. The smart speaker 1200 has a central control unit 1201, a voice input unit 1202, a voice signal processing unit 1203, a wireless communication unit 1204, an operation unit 1205, a voice reproduction unit 1206, a power supply unit 1207, and a power supply control unit 1208. The smart speaker 1200 is an example of an external device.

音声入力部１２０２は、図２の音声入力部１０４と同様に、４つの無指向性のマイク１２０２ａ、１２０２ｂ、１２０２ｃ、およびマイク１２０２ｄを有する。各マイク１２０２ａ～１２０２ｄは、Ａ／Ｄコンバータを内蔵している。各マイク１２０２ａ～１２０２ｄは、予め設定されたサンプリングレート（音声コマンド検出および音方向検出処理：１６ｋＨｚ、動画録音：４８ｋＨｚ）で音声を収音し、内蔵のＡ／Ｄコンバータにより収音した音声信号をデジタルの音声データとして出力する。なお、音声入力部１２０２は、４つのデジタル出力のマイク１２０２ａ～１２０２ｄを有するものとしているが、アナログ出力のマイクを有しても構わない。アナログマイクの場合、音声信号処理部１２０３内に、対応するＡ／Ｄコンバータを設ければよい。 The voice input unit 1202 has four omnidirectional microphones 1202a, 1202b, 1202c, and a microphone 1202d, similar to the voice input unit 104 of FIG. Each microphone 1202a to 1202d has a built-in A / D converter. Each microphone 1202a to 1202d picks up sound at a preset sampling rate (voice command detection and sound direction detection processing: 16 kHz, video recording: 48 kHz), and picks up the sound signal by the built-in A / D converter. Output as digital audio data. Although the audio input unit 1202 has four digital output microphones 1202a to 1202d, it may have analog output microphones. In the case of an analog microphone, a corresponding A / D converter may be provided in the audio signal processing unit 1203.

音声信号処理部１２０３は、図２の音声信号処理部１１４と同様に、音圧レベル検出部１２１１、音声用メモリ１２１２、音声コマンド認識部１２１３、および、音方向検出部１２１４を有する。 The voice signal processing unit 1203 has a sound pressure level detection unit 1211, a voice memory 1212, a voice command recognition unit 1213, and a sound direction detection unit 1214, similarly to the voice signal processing unit 114 of FIG.

音圧レベル検出部１２１１は、マイク１２０２ａから出力された音声データの音圧レベルが予め設定された閾値を超えるとき、音声検出を表す信号を電源制御部１２０８および音声用メモリ１２１２に供給する。 When the sound pressure level of the sound data output from the microphone 1202a exceeds a preset threshold value, the sound pressure level detection unit 1211 supplies a signal indicating voice detection to the power supply control unit 1208 and the voice memory 1212.

電源制御部１２０８は、図２の電源制御部１２１と同様に、音圧レベル検出部１２１１から音声検出を表す信号を受信した場合、音声コマンド認識部１２１３への電力供給を行う。 Similar to the power supply control unit 121 of FIG. 2, the power supply control unit 1208 supplies power to the voice command recognition unit 1213 when it receives a signal indicating voice detection from the sound pressure level detection unit 1211.

音声用メモリ１２１２は、中央制御部１２０１の制御下での電源制御部１２０８による電力供給／遮断の対象の１つである。音声用メモリ１２１２は、あらかじめ起動コマンドの音声データを記憶している。また、音声用メモリ１２１２は、マイク１２０２ａから出力された音声データを一時的に記憶するバッファメモリである。マイク１２０２ａは、サンプリングレートが１６ｋＨｚであり、１サンプリングにつき２バイト（１６ビット）の音声データを出力する。最長の音声コマンドが仮に５秒であった場合、音声用メモリ１２１２は、約１６０キロバイト（≒５×１６×１０００×２）の容量を有する。また、音声用メモリ１２１２は、マイク１２０２ａからの音声データで満たされた場合、古い音声データが新たな音声データで上書きされる。この結果、音声用メモリ１２１２には、直近の所定期間（上記例では約５秒）の音声データが保持される。また、音声用メモリ１２１２は、音圧レベル検出部１２１１から音声検出を示す信号を受信したことをトリガにして、マイク１２０２ａからの音声データをサンプリングデータ領域に格納していく。 The voice memory 1212 is one of the targets of power supply / cutoff by the power supply control unit 1208 under the control of the central control unit 1201. The voice memory 1212 stores the voice data of the activation command in advance. Further, the voice memory 1212 is a buffer memory for temporarily storing the voice data output from the microphone 1202a. The microphone 1202a has a sampling rate of 16 kHz and outputs 2 bytes (16 bits) of audio data per sampling. If the longest voice command is 5 seconds, the voice memory 1212 has a capacity of about 160 kilobytes (≈5 × 16 × 1000 × 2). Further, when the voice memory 1212 is filled with the voice data from the microphone 1202a, the old voice data is overwritten with the new voice data. As a result, the voice memory 1212 holds the voice data for the latest predetermined period (about 5 seconds in the above example). Further, the voice memory 1212 stores the voice data from the microphone 1202a in the sampling data area, triggered by receiving a signal indicating voice detection from the sound pressure level detection unit 1211.

音声コマンド認識部１２１３は、中央制御部１２０１の制御下での電源制御部１２０８による電力供給／遮断の対象の１つである。音声コマンド認識部１２１３は、音声用メモリ１２１２に格納された起動コマンドとマイク１２０２ａを通して格納される音声データを比較し、音声データに起動コマンドを示す言葉が含まれているかどうかを判定する。音声コマンド認識部１２１３は、音声データに起動コマンドを示す言葉が含まれていると判定した場合、操作コマンドを認識する状態に移行する。音声コマンド認識部１２１３は、マイク１２０２ａを通して音声用メモリ１２１２に逐次的に格納されてくる音声データを、無線通信部１２０４を通じてクラウド１２２０に送信する。クラウド１２２０は、その音声データに含まれるユーザの音声が撮像装置１３０に対する指示であるか否かを認識する。音声データに含まれるユーザの音声が撮像装置１３０に対する指示である場合、クラウド１２２０は、不図示のメモリに格納された図３の操作コマンドと音声データとが一致しているか否かを判定する。そして、音声コマンド認識部１２１３は、クラウド１２２０からその判定結果を受信し、その判定結果を中央制御部１２０１に供給する。その判定結果は、いずれの音声コマンドであるかを示す情報、並びに、その音声コマンドを決定づけた最初と最後の音声データのアドレス（或いは音声コマンドを受け付けたタイミング）である。中央制御部１２０１は、その判定結果に応じて、無線通信部１２０４を介して、入力された操作コマンドに応じた動作を撮像装置１３０に指示する。 The voice command recognition unit 1213 is one of the targets of power supply / cutoff by the power supply control unit 1208 under the control of the central control unit 1201. The voice command recognition unit 1213 compares the start command stored in the voice memory 1212 with the voice data stored through the microphone 1202a, and determines whether or not the voice data contains a word indicating the start command. When the voice command recognition unit 1213 determines that the voice data contains a word indicating an activation command, the voice command recognition unit 1213 shifts to a state of recognizing an operation command. The voice command recognition unit 1213 transmits voice data sequentially stored in the voice memory 1212 through the microphone 1202a to the cloud 1220 through the wireless communication unit 1204. The cloud 1220 recognizes whether or not the user's voice included in the voice data is an instruction to the image pickup apparatus 130. When the user's voice included in the voice data is an instruction to the image pickup apparatus 130, the cloud 1220 determines whether or not the operation command of FIG. 3 stored in the memory (not shown) matches the voice data. Then, the voice command recognition unit 1213 receives the determination result from the cloud 1220 and supplies the determination result to the central control unit 1201. The determination result is information indicating which voice command is used, and the addresses of the first and last voice data (or the timing at which the voice command is received) that determine the voice command. The central control unit 1201 instructs the image pickup apparatus 130 to operate according to the input operation command via the wireless communication unit 1204 according to the determination result.

音方向検出部１２１４は、中央制御部１２０１の制御下での電源制御部１２０８による電力供給／遮断の対象の１つである。また、音方向検出部１２１４は、４つのマイク１２０２ａ～１２０２ｄからの音声データに基づき、周期的に音源が存在する方向の検出処理を行う。音方向検出部１２１４は、内部にバッファメモリを有し、検出した音源方向を表す情報をバッファメモリに格納する。なお、音方向検出部１２１４による音方向検出処理を行う周期（例えば１６ｋＨｚ）は、マイク１２０２ａのサンプリング周期に対して十分に長くて構わない。 The sound direction detection unit 1214 is one of the targets of power supply / cutoff by the power supply control unit 1208 under the control of the central control unit 1201. Further, the sound direction detection unit 1214 periodically performs detection processing in the direction in which the sound source exists, based on the voice data from the four microphones 1202a to 1202d. The sound direction detection unit 1214 has a buffer memory inside, and stores information indicating the detected sound source direction in the buffer memory. The cycle (for example, 16 kHz) for performing the sound direction detection process by the sound direction detection unit 1214 may be sufficiently longer than the sampling cycle of the microphone 1202a.

なお、図１２では、消費電力や回路構成を考慮し、音声入力部１２０２の各マイク１２０２ａ～１２０２ｄと音声信号処理部１２０３に含まれる各ブロックとの接続は、４つのマイク１２０２ａ～１２０２ｄにおける必要最低限の接続を示す。しかし、電力および回路構成の許す限り、複数のマイク１２０２ａ～１２０２ｄを音声信号処理部１２０３に含まれる各ブロックで共有して使用しても構わない。また、本実施形態では、マイク１２０２ａを基準のマイクとして接続しているが、どのマイクを基準としても構わない。 In FIG. 12, in consideration of power consumption and circuit configuration, the connection between the microphones 1202a to 1202d of the audio input unit 1202 and each block included in the audio signal processing unit 1203 is the minimum required for the four microphones 1202a to 1202d. Shows a limited connection. However, as long as the power and the circuit configuration allow, a plurality of microphones 1202a to 1202d may be shared and used by each block included in the audio signal processing unit 1203. Further, in the present embodiment, the microphone 1202a is connected as a reference microphone, but any microphone may be used as a reference.

中央制御部１２０１は、ＣＰＵと、ＣＰＵが実行するプログラムを格納したＲＯＭ、および、ＣＰＵのワークエリアとして使用されるＲＡＭを有する。この中央制御部１２０１は、スマートスピーカー１２００の全体の制御を行う。 The central control unit 1201 has a CPU, a ROM in which a program executed by the CPU is stored, and a RAM used as a work area of the CPU. The central control unit 1201 controls the entire smart speaker 1200.

操作部１２０５は、スマートスピーカー１２００とユーザとの間のユーザインターフェースとして機能するものであり、各種スイッチ、ボタン等を有する。音声再生部１２０６は、スピーカーを含み、音声データまたは音楽データを電気信号に変換し、音声を再生する。電源部１２０７は、スマートスピーカー１２００の全体（各要素）の駆動に必要な電力供給源であり、本実施形態では充電可能なバッテリである。 The operation unit 1205 functions as a user interface between the smart speaker 1200 and the user, and has various switches, buttons, and the like. The audio reproduction unit 1206 includes a speaker, converts audio data or music data into an electric signal, and reproduces audio. The power supply unit 1207 is a power supply source necessary for driving the entire smart speaker 1200 (each element), and is a rechargeable battery in the present embodiment.

電源制御部１２０８は、スマートスピーカー１２００の状態に応じて、上記の各構成要素への電源部１２０７からの電力の供給／遮断を制御する。スマートスピーカー１２００の状態によっては、不使用の構成要素が存在する。電源制御部１２０８は、中央制御部１２０１の制御下で、スマートスピーカー１２００の状態によって不使用の構成要素への電力を遮断して、電力消費量を抑制する機能を果たす。 The power supply control unit 1208 controls the supply / cutoff of power from the power supply unit 1207 to each of the above-mentioned components according to the state of the smart speaker 1200. Depending on the state of the smart speaker 1200, there are unused components. Under the control of the central control unit 1201, the power supply control unit 1208 functions to cut off power to unused components depending on the state of the smart speaker 1200 to suppress power consumption.

無線通信部１２０４は、図１の無線通信部１２４と同様に、ＷｉＦｉやＢＬＥなどの無線規格に準拠してデータ送受信を行う。無線通信部１２０４は、ストリーミング再生のために音楽データを受信したりする他、撮像装置１３０に対して電源オン／オフ制御や動作開始／停止などの各種制御を行う。 Similar to the wireless communication unit 124 in FIG. 1, the wireless communication unit 1204 transmits / receives data in accordance with wireless standards such as WiFi and BLE. The wireless communication unit 1204 receives music data for streaming reproduction, and also performs various controls such as power on / off control and operation start / stop for the image pickup apparatus 130.

図１３を用いて、本実施形態における、スマートスピーカー１２００と撮像装置１３０を用いた音源方向の検出方法について説明する。図１３は、撮像装置１３０とスマートスピーカー１２００の制御方法を示すシーケンス図である。ステップＳ１３０１～Ｓ１３０４は、スマートスピーカー１２００の処理である。ステップＳ１３１１～Ｓ１３１７は、撮像装置１３０の処理である。なお、本シーケンスの処理が開始される前に、撮像装置１３０の中央制御部１１１は、スマートスピーカー１２００と通信するよう無線通信部１２４を制御し、撮像装置１３０とスマートスピーカー１２００は無線通信の接続を確立する。ここで、撮像装置１３０はスマートスピーカー１２００に接続されている間、音声認識の処理を行わない。まず、ユーザが、撮像装置１３０とスマートスピーカー１２００に対して、「スマートスピーカー、撮像装置を使って撮影して」という音声を発したとする。 A method of detecting the sound source direction using the smart speaker 1200 and the image pickup apparatus 130 in the present embodiment will be described with reference to FIG. 13. FIG. 13 is a sequence diagram showing a control method of the image pickup apparatus 130 and the smart speaker 1200. Steps S1301 to S1304 are processes of the smart speaker 1200. Steps S1311 to S1317 are processes of the image pickup apparatus 130. Before the processing of this sequence is started, the central control unit 111 of the image pickup device 130 controls the wireless communication unit 124 so as to communicate with the smart speaker 1200, and the image pickup device 130 and the smart speaker 1200 are connected by wireless communication. To establish. Here, the image pickup device 130 does not perform voice recognition processing while being connected to the smart speaker 1200. First, it is assumed that the user emits a voice saying "take a picture using the smart speaker and the image pickup device" to the image pickup device 130 and the smart speaker 1200.

ステップＳ１３０１では、スマートスピーカー１２００の音声用メモリ１２１２が、ユーザが発した音声を含む音声データを格納したことに応じて、音声コマンド認識部１２１３は、音声用メモリ１２１２に格納された音声データの認識処理を行う。音声コマンド認識部１２１３が起動コマンドと一致する音声コマンドを認識した場合、本ステップにおいて以下の処理が行われる。この場合、音声コマンド認識部１２１３は、その認識された音声コマンドを特定する情報と、音声用メモリ１２１２内の、認識した音声コマンドを決定づけた最初と最後の音声データのアドレス情報とを含む情報を中央制御部１２０１に通知する。なお、上記の音声データのアドレス情報は、音声コマンドを受け付けたタイミング情報でもよい。中央制御部１２０１は、音声コマンド認識部１２１３から起動コマンドが認識されたことが通知された場合、起動コマンドが認識されたと判定する。中央制御部１２０１は、音声コマンドが認識されたと判定された場合、ステップＳ１３０２の処理を行う。また、中央制御部１２０１が、音声コマンドが認識されていないと判定した場合、処理はステップＳ１３０１に戻る。起動コマンドは、スマートスピーカー１２００の識別コマンドであり、例えば、「スマートスピーカー」等である。 In step S1301, the voice command recognition unit 1213 recognizes the voice data stored in the voice memory 1212 in response to the voice data including the voice emitted by the user stored in the voice memory 1212 of the smart speaker 1200. Perform processing. When the voice command recognition unit 1213 recognizes a voice command that matches the start command, the following processing is performed in this step. In this case, the voice command recognition unit 1213 contains information including information for identifying the recognized voice command and address information of the first and last voice data in the voice memory 1212 that determines the recognized voice command. Notify the central control unit 1201. The address information of the above voice data may be timing information for receiving a voice command. When the voice command recognition unit 1213 notifies that the start command has been recognized, the central control unit 1201 determines that the start command has been recognized. When it is determined that the voice command is recognized, the central control unit 1201 performs the process of step S1302. If the central control unit 1201 determines that the voice command is not recognized, the process returns to step S1301. The activation command is an identification command for the smart speaker 1200, for example, "smart speaker" or the like.

並行して、ステップＳ１３１１では、撮像装置１３０の音方向検出部２０４は、４つのマイク１０４ａ～１０４ｄによって収音された同時刻の音声データに基づき、その音声データの音源の方向（音方向）の検出処理を行う。音源の方向の検出処理は、所定周期で行われる。本シーケンスでは、音方向検出部２０４は、ユーザから発せられた「撮像装置を使って撮影して」の音声の音源の方向を検出する。 At the same time, in step S1311, the sound direction detection unit 204 of the image pickup apparatus 130 is based on the sound data of the same time picked up by the four microphones 104a to 104d, and is in the direction (sound direction) of the sound source of the sound data. Perform detection processing. The sound source direction detection process is performed at a predetermined cycle. In this sequence, the sound direction detection unit 204 detects the direction of the sound source of the sound "taken by using the image pickup device" emitted from the user.

ステップＳ１３０２では、スマートスピーカー１２００の中央制御部１２０１は、無線通信部１２０４を介して、撮像装置１３０に対して音方向を記録することを指示するための信号を送信し、処理はステップＳ１３０３に進む。 In step S1302, the central control unit 1201 of the smart speaker 1200 transmits a signal for instructing the image pickup apparatus 130 to record the sound direction via the wireless communication unit 1204, and the process proceeds to step S1303. ..

ステップＳ１３１２では、撮像装置１３０の中央制御部１１１は、無線通信部１２４を介して、スマートスピーカー１２００から音方向を記録することを指示するための信号を受信すると、ステップＳ１３１１で検出された音源の方向を記憶部１１６に記録する。本ステップの後、中央制御部１１１は、待機状態に遷移する。音方向を記録することを指示するための信号は、スマートスピーカー１２００の起動コマンドが認識されたことを示す信号でもある。この信号を受信したことに応じて、中央制御部１１１は、スマートスピーカー１２００の起動コマンドが認識されたときに収音された音声の音源の方向を、撮像部１０２によって撮像する方向として決定する。 In step S1312, when the central control unit 111 of the image pickup apparatus 130 receives a signal for instructing the recording of the sound direction from the smart speaker 1200 via the wireless communication unit 124, the sound source detected in step S1311 is used. The direction is recorded in the storage unit 116. After this step, the central control unit 111 transitions to the standby state. The signal for instructing to record the sound direction is also a signal indicating that the activation command of the smart speaker 1200 has been recognized. In response to receiving this signal, the central control unit 111 determines the direction of the sound source of the sound picked up when the activation command of the smart speaker 1200 is recognized as the direction to be imaged by the image pickup unit 102.

なお、中央制御部１１１は、待機状態時に、音声または音を収音した場合、音方向検出部２０４により検出された音源の方向を記憶部１１６に記録する。また、中央制御部１１１は、撮像装置１３０が無線通信部１２４によってスマートスピーカー１２００に接続されている場合には、音声コマンド認識部２０３により音声認識しない。 When the sound or the sound is picked up in the standby state, the central control unit 111 records the direction of the sound source detected by the sound direction detection unit 204 in the storage unit 116. Further, when the image pickup device 130 is connected to the smart speaker 1200 by the wireless communication unit 124, the central control unit 111 does not recognize the voice by the voice command recognition unit 203.

ステップＳ１３０３では、中央制御部１２０１は、マイク１２０２ａを通して音声用メモリ１２１２に逐次的に格納されてくる音声データを、無線通信部１２０４を通じてクラウド１２２０に送信する。本実施形態では、例えば、音声データは、「撮像装置１３０を使って静止画撮影して」というユーザの音声を含んでいる。この音声データには、撮像装置１３０に対する指示であること、および静止画撮影を指示する操作コマンドが含まれている。この場合、クラウド１２２０は音声データに含まれるユーザの音声から、撮像装置１３０に対する静止画撮影の指示であると判定する。そして、無線通信部１２０４は、クラウド１２２０からその判定結果を受信し、その判定結果を中央制御部１２０１に供給する。中央制御部１２０１は、受信した判定結果に基づいて、撮像装置１３０に対する操作コマンドを認識したか否かを判定する。中央制御部１２０１が、音声データに操作コマンドが含まれると判定した場合、処理はステップＳ１３０４に進む。また、中央制御部１２０１が、音声データに操作コマンドが含まれないと判定した場合、処理はステップＳ１３０１に戻る。 In step S1303, the central control unit 1201 transmits voice data sequentially stored in the voice memory 1212 through the microphone 1202a to the cloud 1220 through the wireless communication unit 1204. In the present embodiment, for example, the voice data includes a user's voice saying "take a still image using the image pickup device 130". This audio data includes an instruction to the image pickup apparatus 130 and an operation command instructing to shoot a still image. In this case, the cloud 1220 determines from the user's voice included in the voice data that it is an instruction to shoot a still image to the image pickup apparatus 130. Then, the wireless communication unit 1204 receives the determination result from the cloud 1220 and supplies the determination result to the central control unit 1201. The central control unit 1201 determines whether or not the operation command for the image pickup apparatus 130 is recognized based on the received determination result. If the central control unit 1201 determines that the voice data includes an operation command, the process proceeds to step S1304. If the central control unit 1201 determines that the voice data does not include an operation command, the process returns to step S1301.

ステップＳ１３０４では、中央制御部１２０１は、無線通信部１２０４を介して、撮像装置１３０に対して、ステップＳ１３０３で一致した操作コマンドに応じた操作コマンドを示す信号を送信する。本実施形態では、例えば、中央制御部１２０１は、静止画撮影の操作コマンドを示す信号を送信する。 In step S1304, the central control unit 1201 transmits a signal indicating an operation command corresponding to the operation command matched in step S1303 to the image pickup apparatus 130 via the wireless communication unit 1204. In the present embodiment, for example, the central control unit 1201 transmits a signal indicating an operation command for still image shooting.

ステップＳ１３１３では、中央制御部１１１は、無線通信部１２４を介して、スマートスピーカー１２００から操作コマンドを示す信号を受信する。本実施形態では、中央制御部１１１は、静止画撮影コマンドを示す信号を受信する。 In step S1313, the central control unit 111 receives a signal indicating an operation command from the smart speaker 1200 via the wireless communication unit 124. In the present embodiment, the central control unit 111 receives a signal indicating a still image shooting command.

ステップＳ１３１４では、中央制御部１１１は、ステップＳ１３１２の処理が実行されてからステップＳ１３１３の処理が実行されるまでに経過した時間が閾値Ｔ秒以上であるか否かを判定する。このような処理をする理由は図１４を用いて後述する。中央制御部１１１が、ステップＳ１３１２の処理が実行されてからステップＳ１３１３の処理が実行されるまでの経過時間が閾値Ｔ秒未満であると判定した場合、処理はステップＳ１３１６に進む。すなわち、中央制御部１１１が、ステップＳ１３１１で検出された音源の方向を記憶部１１６に記録してから所定時間以上経過せずに、スマートスピーカー１２００から操作コマンドを示す信号を受信した場合、処理はステップＳ１３１６に進む。この場合、ステップＳ１３１６では、中央制御部１１１は、ステップＳ１３１３において受信した操作コマンドに対応する処理を開始する。本実施形態では、本ステップにおいて、中央制御部１１１は、回動制御部１２３を制御し、レンズ部１０１および撮像部１０２の撮像方向がステップＳ１３１２で記録された音源の方向になるように、パン動作とチルト動作を制御する。その後、処理はステップＳ１３１７に進む。 In step S1314, the central control unit 111 determines whether or not the time elapsed from the execution of the process of step S1312 to the execution of the process of step S1313 is the threshold value T seconds or more. The reason for such processing will be described later with reference to FIG. When the central control unit 111 determines that the elapsed time from the execution of the process of step S1312 to the execution of the process of step S1313 is less than the threshold value T seconds, the process proceeds to step S1316. That is, when the central control unit 111 receives a signal indicating an operation command from the smart speaker 1200 within a predetermined time or more after recording the direction of the sound source detected in step S1311 in the storage unit 116, the processing is performed. The process proceeds to step S1316. In this case, in step S1316, the central control unit 111 starts the process corresponding to the operation command received in step S1313. In the present embodiment, in this step, the central control unit 111 controls the rotation control unit 123 so that the image pickup direction of the lens unit 101 and the image pickup unit 102 is the direction of the sound source recorded in step S1312. Controls motion and tilt motion. After that, the process proceeds to step S1317.

他方、中央制御部１１１が、ステップＳ１３１２の処理からステップＳ１３１３の処理までの経過時間が閾値Ｔ秒以上であると判定した場合、処理はステップＳ１３１５に進む。すなわち、中央制御部１１１は、ステップＳ１３１１で検出された音源の方向を記憶部１１６に記録してから所定時間以上経過した後に、スマートスピーカー１２００から操作コマンドを示す信号を受信した場合、処理はステップＳ１３１５に進む。 On the other hand, when the central control unit 111 determines that the elapsed time from the process of step S1312 to the process of step S1313 is equal to or longer than the threshold value T seconds, the process proceeds to step S1315. That is, when the central control unit 111 receives a signal indicating an operation command from the smart speaker 1200 after a predetermined time or more has elapsed after recording the direction of the sound source detected in step S1311 in the storage unit 116, the process is stepped. Proceed to S1315.

ステップＳ１３１５では、中央制御部１１１は、記憶部１１６に記録された音源の方向の履歴を基に、過去の撮影コマンドに対応する音源の方向のうち、最も多い音源の方向を音方向として決定し、処理はステップＳ１３１６に進む。この場合、ステップＳ１３１６では、中央制御部１１１は、回動制御部１２３を制御し、レンズ部１０１および撮像部１０２の撮像方向がステップＳ１３１５で決定された音方向になるように、パン動作とチルト動作する。その後、処理はステップＳ１３１７に進む。 In step S1315, the central control unit 111 determines the direction of the most sound source among the directions of the sound sources corresponding to the past shooting commands as the sound direction based on the history of the direction of the sound source recorded in the storage unit 116. , Processing proceeds to step S1316. In this case, in step S1316, the central control unit 111 controls the rotation control unit 123, and pan operation and tilt so that the image pickup directions of the lens unit 101 and the image pickup unit 102 are the sound directions determined in step S1315. Operate. After that, the process proceeds to step S1317.

ステップＳ１３１７では、中央制御部１１１は、ステップＳ１３１３において受信した操作コマンドを示す信号に基づいて、ステップＳ１３１６の撮像方向を撮像するよう撮像部１０２を制御する。操作コマンドを示す信号が静止画撮影コマンド指示信号である場合、中央制御部１１１は、撮像部１０２で撮像した１枚の静止画像データを例えばＪＰＥＧファイルとして、記憶部１１６に記録する。操作コマンドを示す信号が動画撮影コマンド指示信号である場合、中央制御部１１１は、撮像部１０２を用いて動画像の撮影を開始し、記憶部１１６に対して動画像の記録を開始する。なお、中央制御部１１１は、映像信号処理部１１３にて顔認識ができない場合は、ステップＳ１３１７の処理を実施しないようにしてもよい。 In step S1317, the central control unit 111 controls the image pickup unit 102 to image the image pickup direction in step S1316 based on the signal indicating the operation command received in step S1313. When the signal indicating the operation command is a still image shooting command instruction signal, the central control unit 111 records one still image data imaged by the image pickup unit 102 in the storage unit 116 as, for example, a JPEG file. When the signal indicating the operation command is a moving image shooting command instruction signal, the central control unit 111 starts shooting a moving image using the imaging unit 102, and starts recording the moving image in the storage unit 116. If the video signal processing unit 113 cannot recognize the face, the central control unit 111 may not perform the process of step S1317.

ステップＳ１３１７の後、中央制御部１１１は、ステップＳ１３１２またはＳ１３１５で設定された音源の方向の解除を行い、待機状態に戻る。 After step S1317, the central control unit 111 releases the direction of the sound source set in step S1312 or S1315, and returns to the standby state.

このように、ステップＳ１３１２の処理からステップＳ１３１３の処理までの経過時間が閾値Ｔ秒以上である場合、中央制御部１１１は、ステップＳ１３１１で記録した音源の方向の位置からユーザが移動したと想定して音方向を撮影する。また、ステップＳ１３１２の処理からステップＳ１３１３の処理までの経過時間が閾値Ｔ秒未満である場合、中央制御部１１１は、ステップＳ１３１２において記憶した音方向を撮影する。これにより、中央制御部１１１は、音源の方向を記憶してからの経過時間に基づいて音方向を決定することで、被写体がいない方向を撮影するおそれを低減することができる。 As described above, when the elapsed time from the process of step S1312 to the process of step S1313 is equal to or longer than the threshold value T seconds, the central control unit 111 assumes that the user has moved from the position in the direction of the sound source recorded in step S1311. And shoot the sound direction. Further, when the elapsed time from the process of step S1312 to the process of step S1313 is less than the threshold value T seconds, the central control unit 111 photographs the sound direction stored in step S1312. As a result, the central control unit 111 can reduce the possibility of shooting in a direction in which there is no subject by determining the sound direction based on the elapsed time from storing the direction of the sound source.

なお、例えば、ユーザがスマートスピーカー１２００に対して、「スマートスピーカー、撮像装置を使って撮影して」という音声を発したとする。スマートスピーカー１２００は、「スマートスピーカー」という音声データを音声用メモリ１２１２に格納された起動コマンドと比較して判定を行うため、他の音声コマンドを認識するよりも認識速度が速い。本実施形態では、スマートスピーカー１２００は、音声用メモリ１２１２内の起動コマンドと比較して判定するため、マイクロ秒単位で認識を完了する。一方、スマートスピーカー１２００は、操作コマンド「撮像装置を使って撮影して」を認識する場合、クラウド１２２０のメモリ内の操作コマンドと比較するため、数秒かかる。そのため、本実施形態では、例えば、撮像装置１３０は、閾値Ｔを１０秒と設定することで、スマートスピーカー１２００が音声コマンドを認識する時間を確保する。 It should be noted that, for example, it is assumed that the user emits a voice to the smart speaker 1200, "shooting with a smart speaker and an image pickup device". Since the smart speaker 1200 makes a determination by comparing the voice data of the "smart speaker" with the activation command stored in the voice memory 1212, the recognition speed is faster than recognizing other voice commands. In the present embodiment, the smart speaker 1200 completes recognition in microsecond units in order to make a determination in comparison with the activation command in the voice memory 1212. On the other hand, when the smart speaker 1200 recognizes the operation command "take a picture using the image pickup device", it takes several seconds to compare with the operation command in the memory of the cloud 1220. Therefore, in the present embodiment, for example, the image pickup apparatus 130 secures the time for the smart speaker 1200 to recognize the voice command by setting the threshold value T to 10 seconds.

なお、ステップＳ１３１２およびＳ１３１５では、撮像装置１３０またはスマートスピーカー１２００は、撮像装置１３０またはスマートスピーカー１２００の表示部を点灯または点滅させるなどして、ユーザに撮影指示の処理中であることを知らせてもよい。 In steps S1312 and S1315, the image pickup device 130 or the smart speaker 1200 may turn on or blink the display unit of the image pickup device 130 or the smart speaker 1200 to notify the user that the shooting instruction is being processed. good.

なお、ステップＳ１３１５において、中央制御部１１１は、記憶部１１６に記録された音源の方向の履歴を基に、前回の撮影コマンドに対応する音源の方向を音方向として決定してもよい。 In step S1315, the central control unit 111 may determine the direction of the sound source corresponding to the previous shooting command as the sound direction based on the history of the direction of the sound source recorded in the storage unit 116.

以上のように、撮像装置１３０は、スマートスピーカー１２００と連携し、撮影することで、被写体をより確実に撮影することができる。 As described above, the image pickup apparatus 130 can take a picture of the subject more reliably by taking a picture in cooperation with the smart speaker 1200.

ここで、図１４（ａ）および（ｂ）を用いて、図１３のステップＳ１３１４およびＳ１３１５の内容と効果を詳細に説明する。図１４は、本実施形態におけるスマートスピーカー１２００と撮像装置１３０を用いて、音源の方向の履歴を基に音源の方向を決定する効果を示す図である。 Here, the contents and effects of steps S1314 and S1315 of FIG. 13 will be described in detail with reference to FIGS. 14 (a) and 14 (b). FIG. 14 is a diagram showing an effect of determining the direction of a sound source based on the history of the direction of the sound source by using the smart speaker 1200 and the image pickup apparatus 130 in the present embodiment.

撮像装置１３０とスマートスピーカー１２００の周りをユーザ１４０３、１４０４、および１４０５が囲んでいる。ユーザ１４０３が撮影指示者である。ユーザ１４０３は、ユーザ１４０３が含まれる写真を撮影したい。図１４（ａ）は、図１３のステップＳ１３０１からステップＳ１３１２までの状態を示す図である。図１４（ｂ）は、図１３のステップＳ１３１３からステップＳ１３１７までの状態を示す図である。 Users 1403, 1404, and 1405 surround the image pickup apparatus 130 and the smart speaker 1200. User 1403 is the shooting instructor. User 1403 wants to take a picture that includes user 1403. FIG. 14A is a diagram showing states from step S1301 to step S1312 in FIG. FIG. 14B is a diagram showing the states from step S1313 to step S1317 in FIG.

図１４（ａ）では、ユーザ１４０３は、起動コマンド「スマートスピーカー」のみを発声する。その後、図１４（ｂ）では、ユーザ１４０３は、ユーザ１４０４および１４０５の近くへ移動し、操作コマンド「撮像装置を使って撮影して」を発声する。この場合、図１４（ａ）の起動コマンドの発声と図１４（ｂ）の操作コマンドの発声との間隔が長くなる。言い換えると、ステップＳ１３１２の処理からステップＳ１３１３の処理までの経過時間が閾値Ｔ秒以上になる。 In FIG. 14A, the user 1403 utters only the activation command “smart speaker”. Then, in FIG. 14B, the user 1403 moves closer to the users 1404 and 1405 and utters the operation command "take a picture using the image pickup device". In this case, the interval between the utterance of the activation command in FIG. 14A and the utterance of the operation command in FIG. 14B becomes long. In other words, the elapsed time from the process of step S1312 to the process of step S1313 is the threshold value T seconds or more.

ステップＳ１３１４では、ステップＳ１３１２の処理からステップＳ１３１３の処理までの経過時間が閾値Ｔ秒以上になる場合、処理はステップＳ１３１５に進む。ステップＳ１３１５では、中央制御部１１１は、記憶部１１６に記録されている過去の音源の方向の履歴のうち、最も多い音源の方向１４０６を音方向として決定する。撮像装置１３０は、ステップＳ１３１６では、撮像部１０２の撮影方向が音源の方向１４０６になるように制御し、ステップＳ１３１７では、撮影を行う。 In step S1314, if the elapsed time from the process of step S1312 to the process of step S1313 is equal to or greater than the threshold value T seconds, the process proceeds to step S1315. In step S1315, the central control unit 111 determines the direction 1406 of the most sound source in the history of the directions of the past sound sources recorded in the storage unit 116 as the sound direction. In step S1316, the image pickup apparatus 130 controls so that the shooting direction of the image pickup unit 102 is the direction 1406 of the sound source, and in step S1317, shooting is performed.

一方、起動コマンドの発声と操作コマンドの発声との間隔が短い場合、ユーザは、起動コマンドおよび操作コマンドを連続して発声していると想定される。この場合、起動コマンドおよび操作コマンドを発声したユーザは、移動していないと考えられるため、中央制御部１１１は、撮像部１０２の撮影方向がステップＳ１３１２で記録された音源方向になるように制御し、撮影を行う。 On the other hand, when the interval between the utterance of the activation command and the utterance of the operation command is short, it is assumed that the user is uttering the activation command and the operation command continuously. In this case, since it is considered that the user who uttered the start command and the operation command is not moving, the central control unit 111 controls the shooting direction of the image pickup unit 102 so as to be the sound source direction recorded in step S1312. , Take a picture.

以上のように、中央制御部１１１は、記憶部１１６に記録されている過去の音源の方向の履歴を使うことで、撮影指示者が移動した場合でも、撮影指示者の方向へ撮像部１０２を向け、撮影することが可能となる。 As described above, the central control unit 111 uses the history of the direction of the past sound source recorded in the storage unit 116 to move the image pickup unit 102 toward the shooting instructor even when the shooting instructor moves. It is possible to aim and shoot.

なお、中央制御部１１１が、記憶部１１６に記録されている音源の方向の履歴を基に撮影指示者の方向を検出できない場合がある。その場合、中央制御部１１１は、レンズ部１０１および撮像部１０２を回転させて、被写体を探すことで、可動撮像部１００のパン動作にかかる時間を短くすることができる。 The central control unit 111 may not be able to detect the direction of the shooting instructor based on the history of the direction of the sound source recorded in the storage unit 116. In that case, the central control unit 111 can shorten the time required for the pan operation of the movable image pickup unit 100 by rotating the lens unit 101 and the image pickup unit 102 to search for a subject.

［第２の実施形態］
図１５は、第２の実施形態に係るスマートスピーカー１２００と３台の撮像装置１３０ａ～１３０ｃの制御方法を示すシーケンス図である。図１５のスマートスピーカー１２００は、図１３のスマートスピーカー１２００と同様である。図１５の撮像装置１３０ａ～１３０ｃは、それぞれ、図１３の撮像装置１３０と同様である。 [Second Embodiment]
FIG. 15 is a sequence diagram showing a control method of the smart speaker 1200 and the three image pickup devices 130a to 130c according to the second embodiment. The smart speaker 1200 of FIG. 15 is similar to the smart speaker 1200 of FIG. The image pickup devices 130a to 130c of FIG. 15 are the same as those of the image pickup device 130 of FIG. 13, respectively.

スマートスピーカー１２００は、ステップＳ１５０１～Ｓ１５０４の処理を行う。ステップＳ１５０１～Ｓ１５０４は、図１３のステップＳ１３０１～Ｓ１３０４と同様である。 The smart speaker 1200 performs the processes of steps S1501 to S1504. Steps S1501 to S1504 are the same as steps S1301 to S1304 in FIG.

撮像装置１３０ａは、ステップＳ１５１１～Ｓ１５１７の処理を行う。ステップＳ１５１１～Ｓ１５１７は、図１３のステップＳ１３１１～Ｓ１３１７と同様である。撮像装置１３０ｃは、ステップＳ１５３１～Ｓ１５３７の処理を行う。ステップＳ１５３１～Ｓ１５３７は、図１３のステップＳ１３１１～Ｓ１３１７と同様である。撮像装置１３０ｂも、撮像装置１３０ａおよび１３０ｃと同様の処理を行う。 The image pickup apparatus 130a performs the processes of steps S1511 to S1517. Steps S1511 to S1517 are the same as steps S1311 to S1317 in FIG. The image pickup apparatus 130c performs the processes of steps S1531 to S1537. Steps S1531 to S1537 are the same as steps S1311 to S1317 in FIG. The image pickup device 130b also performs the same processing as the image pickup devices 130a and 130c.

ステップＳ１５０２では、スマートスピーカー１２００は、ＢＬＥのアドバタイズ機能を用いて、スマートスピーカー１２００の近傍にいる、無線接続の確立していない撮像装置１３０ａ～１３０ｃに対し、音方向を記録することを指示するための信号を送信する。このため、１台のスマートスピーカー１２００に対して、複数の撮像装置１３０ａ～１３０ｃを連携させたい場合でも、１台の撮像装置１３０の場合と同様に、音方向を記録することを指示するための信号を送信することが可能となる。 In step S1502, the smart speaker 1200 uses the advertisement function of the BLE to instruct the image pickup devices 130a to 130c, which are in the vicinity of the smart speaker 1200 and have no established wireless connection, to record the sound direction. Signal. Therefore, even when it is desired to link a plurality of image pickup devices 130a to 130c to one smart speaker 1200, it is for instructing to record the sound direction as in the case of one image pickup device 130. It becomes possible to transmit a signal.

第１および第２の実施形態では、ＢＬＥ通信を例に説明したが、それに限定されない。ＢＬＥ通信によってパケットを受信した場合、中央制御部１１１は、撮像部１０２によって撮像する方向を決定し、無線ＬＡＮ通信によってパケットを受信した場合、中央制御部１１１は、撮像部１０２によって撮像するよう制御してもよい。 In the first and second embodiments, BLE communication has been described as an example, but the present invention is not limited thereto. When the packet is received by the BLE communication, the central control unit 111 determines the direction to be imaged by the image pickup unit 102, and when the packet is received by the wireless LAN communication, the central control unit 111 is controlled to be imaged by the image pickup unit 102. You may.

［その他の実施形態］
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 [Other embodiments]
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 It should be noted that the present invention is not limited to the above embodiment as it is, and at the implementation stage, the components can be modified and embodied within a range that does not deviate from the gist thereof. In addition, various inventions can be formed by an appropriate combination of the plurality of components disclosed in the above-described embodiment. For example, some components may be removed from all the components shown in the embodiments. In addition, components across different embodiments may be combined as appropriate.

１０２撮像部、１０４音声入力部、１１１中央制御部、１２３回動制御部、１２４無線通信部、１２５～１２７振動体、２０４音方向検出部 102 Imaging unit, 104 Voice input unit, 111 Central control unit, 123 Rotation control unit, 124 Wireless communication unit, 125-127 vibrating body, 204 Sound direction detection unit

Claims

Imaging means and
A drive means capable of driving the image pickup direction of the image pickup means and
Sound collection means for collecting sound, and
A detection means for detecting the direction of a sound source of sound picked up by the sound collecting means, and a detection means.
Communication means and
With control means,
The control means controls the communication means so as to communicate with an external device capable of voice recognition.
The control means determines the direction to be imaged by the image pickup means when the identification command of the external device is recognized.
The control means is an image pickup device that controls the image pickup means and the drive means so as to take an image in the determined direction when an image pickup is instructed by the external device via the communication means.

The imaging device according to claim 1, wherein the control means controls to image the direction of a sound source of the sound picked up when the identification command of the external device is recognized.

The imaging device according to claim 1 or 2, wherein the control means receives a signal indicating that the identification command of the external device has been recognized via the communication means.

The image pickup device according to any one of claims 1 to 3, wherein the control means does not recognize voice when the image pickup device is connected to the external device by the communication means.

When an image pickup is instructed by the external device via the communication means after a predetermined time or more has elapsed from the recognition of the identification command of the external device, the control means is not in the determined direction but the image pickup means. The image pickup apparatus according to any one of claims 1 to 4, wherein the image pickup direction is determined based on the history of the image pickup direction, and the image pickup is controlled to take the determined direction.

When the packet is received by the BLE communication, the control means determines the direction to be imaged by the image pickup means, and when the packet is received by the wireless LAN communication, the control means controls to image the image by the image pickup means. The image pickup apparatus according to any one of claims 1 to 5, wherein the image pickup apparatus is characterized by the above-mentioned.

The sound collecting means has a plurality of microphones and has a plurality of microphones.
One of claims 1 to 6, wherein the detection means detects an angle of the direction of a sound source of the generated voice from a time difference of the voice input to the plurality of microphones at the timing when the voice is generated. The imaging device according to the section.

The imaging device according to claim 7, wherein the plurality of microphones are arranged symmetrically with respect to the imaging direction of the imaging means.

Imaging means and
A drive means capable of driving the image pickup direction of the image pickup means and
Sound collection means for collecting sound, and
A detection means for detecting the direction of a sound source of sound picked up by the sound collecting means, and a detection means.
It is a control method of an image pickup apparatus having a communication means.
A step of controlling the communication means to communicate with an external device capable of voice recognition,
When the identification command of the external device is recognized, the step of determining the image pickup direction by the image pickup means and the step of determining the image pickup direction.
A control method for an image pickup apparatus, which comprises a step of controlling the image pickup means and the drive means so as to capture an image in the determined direction when an image pickup is instructed by the external device via the communication means.

A computer-readable program for operating a computer as each means of the image pickup apparatus according to any one of claims 1 to 8.