JP2009284473A

JP2009284473A - Camera control apparatus and method

Info

Publication number: JP2009284473A
Application number: JP2009103361A
Authority: JP
Inventors: Hideo Kuboyama; 英生久保山; Michio Aizawa; 道雄相澤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-04-23
Filing date: 2009-04-21
Publication date: 2009-12-03
Anticipated expiration: 2029-04-21
Also published as: JP5495612B2

Abstract

PROBLEM TO BE SOLVED: To solve the problem, wherein if there are included a function for capturing a voice source direction and causing an imaging apparatus to follow up the voice source direction and a function for controlling operation of the imaging apparatus, according to voice command, both the functions are not operated as intended by interfering control each other. SOLUTION: A voice acquisition section 201 acquires a voice, and a voice direction detection section 202 detects the direction of generation of the voice acquired by the voice acquisition section 201. A voice recognition section 205 recognizes the voice acquired by the voice acquisition section 201. An imaging direction control section 203 controls the imaging direction of a camera 204 to the voice generation direction detected by the voice direction detection section 202. When the voice recognition section 205 recognizes an input voice as a voice command, the imaging direction control section 203 suppresses the imaging direction of the camera 204 from being controlled to the voice generation direction detected by the voice direction detection section 202. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、カメラの動作を制御するカメラ制御装置及び方法に関する。 The present invention relates to a camera control apparatus and method for controlling the operation of a camera.

音声で機器を制御する種々の技術が開発されているが、機器が入力した音声を扱う場合には、制御用の音声が悪影響を与えることがある。例えばビデオカメラは、音声を画像と共に収録することを本質的機能とするが、その収録中に制御用の音声も一緒に録音されてしまうという問題がある。そこで、入力される音声から、機器制御用の音声を抑圧する方法が提案されている（例えば、特許文献１乃至３）。 Various techniques for controlling a device by voice have been developed. However, when the voice input by the device is handled, the control voice may have an adverse effect. For example, a video camera has an essential function of recording sound together with an image, but there is a problem that control sound is also recorded during the recording. In view of this, a method of suppressing device control sound from input sound has been proposed (for example, Patent Documents 1 to 3).

特開平５−６１４９７号公報JP-A-5-61497 特開２００１−２０３９７４号公報JP 2001-203974 A 特開２００３−２９８９１６号公報JP 2003-298916 A

テレビ会議システムなどで使われるカメラの中には、発言者の方向に自動的にカメラの撮像方向を向けるものがある。このカメラは、入力された音声を用いて発言者の方向を検知するため、カメラ制御用の音声が悪影響を与えることがある。例えば、カメラ制御として、音声コマンド「ホワイトボード」が認識された場合ホワイトボードの方向にカメラを向けるように構成されている場合を考える。この場合において、音声コマンドとして「ホワイトボード」と発声したにもかかわらず、この音声の方向が検知され、その発声者の方向へカメラが向いてしまう。これは意図したカメラの動作ではない。 Some cameras used in video conferencing systems and the like automatically point the camera in the direction of the speaker. Since this camera detects the direction of the speaker using the input voice, the camera control voice may have an adverse effect. For example, let us consider a case where the camera control is configured to point the camera in the direction of the whiteboard when the voice command “whiteboard” is recognized. In this case, although the voice command “whiteboard” is uttered, the direction of the voice is detected, and the camera faces the direction of the speaker. This is not the intended camera behavior.

そこで、上述のビデオカメラの場合と同様に、カメラに入力される音声から、カメラ制御用の音声を抑圧する方法が考えられる。しかし、カメラ制御用の音声コマンドとして「こっち」と発声した場合にもこの音声を抑圧してしまうと、その発言者の方向を検知できなってしまう。そのため、意図した動作が行えない。 Therefore, as in the case of the video camera described above, a method of suppressing the camera control sound from the sound input to the camera is conceivable. However, if this voice is suppressed even when “this” is spoken as a camera control voice command, the direction of the speaker cannot be detected. Therefore, the intended operation cannot be performed.

このように、音源方向を捉えて撮像装置をその音源方向に追従させる機能と、音声コマンドにより撮像装置の動作を制御する機能とを持たせる場合、互いの制御が干渉し両機能が意図どおりに働かないという問題がある。 In this way, when having the function of capturing the sound source direction and causing the imaging device to follow the direction of the sound source and the function of controlling the operation of the imaging device by voice commands, the mutual control interferes and both functions are as intended. There is a problem of not working.

本発明は、上述したような問題を解決することを目的としている。
本発明の一側面によれば、カメラの動作を制御するカメラ制御装置であって、音声を取得する取得手段と、前記取得手段で取得した前記音声の発生方向を検知する検知手段と、前記取得手段で取得した前記音声を認識する音声認識手段と、前記検知手段が検知した前記音声の発生方向に前記カメラの撮像方向を制御する制御手段とを備え、前記制御手段は、前記音声認識手段が前記音声を音声コマンドとして認識したときは、前記検知手段が検知した前記音声の発生方向に前記カメラの撮像方向を制御することを抑制することを特徴とするカメラ制御装置が提供される。 The present invention aims to solve the above-described problems.
According to one aspect of the present invention, there is provided a camera control device that controls the operation of a camera, an acquisition unit that acquires audio, a detection unit that detects a direction in which the audio acquired by the acquisition unit is detected, and the acquisition Voice recognition means for recognizing the voice acquired by the means, and control means for controlling the imaging direction of the camera in the direction of generation of the voice detected by the detection means, wherein the voice recognition means When the voice is recognized as a voice command, there is provided a camera control device that suppresses controlling the imaging direction of the camera in the voice generation direction detected by the detection means.

本発明によれば、音源方向を捉えて撮像装置をその音源方向に追従させる機能と、音声コマンドにより撮像装置の動作を制御する機能とが、互いに干渉して悪影響を及ぼすことなく首尾よく動作するようになる。 According to the present invention, the function of capturing the sound source direction and causing the imaging device to follow the direction of the sound source and the function of controlling the operation of the imaging device by a voice command operate successfully without interfering with each other. It becomes like this.

実施形態に係るカメラ制御装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the camera control apparatus which concerns on embodiment. 実施形態に係るカメラ制御装置の機能構成を示すブロック図。The block diagram which shows the function structure of the camera control apparatus which concerns on embodiment. 実施形態に係るカメラ制御装置の処理手順を示すフローチャート。6 is a flowchart illustrating a processing procedure of the camera control apparatus according to the embodiment. 実施形態における音声コマンド表の例を示す図。The figure which shows the example of the voice command table | surface in embodiment. 実施形態に係るカメラ制御装置の機能構成を示すブロック図。The block diagram which shows the function structure of the camera control apparatus which concerns on embodiment. 実施形態に係るカメラ制御装置の処理手順を示すフローチャート。6 is a flowchart illustrating a processing procedure of the camera control apparatus according to the embodiment. 実施形態における音声コマンド表の例を示す図。The figure which shows the example of the voice command table | surface in embodiment. 実施形態に係るカメラ制御装置をテレビ会議に適用した場合を説明する図。The figure explaining the case where the camera control apparatus which concerns on embodiment is applied to a video conference. 実施形態に係るカメラ制御装置を音声リモコンとして実現した例を示す図。The figure which shows the example which implement | achieved the camera control apparatus which concerns on embodiment as an audio | voice remote control. 実施形態に係るカメラ制御装置の処理手順を示すフローチャート。6 is a flowchart illustrating a processing procedure of the camera control apparatus according to the embodiment. 実施形態に係る途中の音声認識スコアの比較を説明する図。The figure explaining the comparison of the voice recognition score in the middle which concerns on embodiment. 実施形態に係る発言に対する撮像方向制御のタイミングを説明する図。The figure explaining the timing of the imaging direction control with respect to the utterance which concerns on embodiment. 実施形態における音声認識スコアと撮像方向制御の関係を示す図。The figure which shows the relationship between the speech recognition score and imaging direction control in embodiment. 実施形態における、音声認識スコアと撮像方向制御及び制御速度との関係を示す図。The figure which shows the relationship between the audio | voice recognition score, imaging direction control, and control speed in embodiment. 実施形態における音声認識スコアと制御速度の関係を示す図。The figure which shows the relationship between the speech recognition score and control speed in embodiment. 、, 実施形態に係るカメラ制御装置の処理手順を示すフローチャート。6 is a flowchart illustrating a processing procedure of the camera control apparatus according to the embodiment.

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。なお、本発明は以下の実施形態に限定されるものではなく、本発明の実施に有利な具体例を示すにすぎない。また、以下の実施形態の中で説明されている特徴の組み合わせの全てが本発明の課題解決手段として必須のものであるとは限らない。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to the following embodiment, It shows only the specific example advantageous for implementation of this invention. In addition, not all combinations of features described in the following embodiments are indispensable as means for solving the problems of the present invention.

（実施形態１）
図１において、本発明の一実施形態に係るカメラ制御装置のハードウェア構成を示すブロック図である。１０１はＣＰＵ（中央処理装置）で、システム制御部として装置全体の動作を制御する。１０２はＲＯＭで、制御プログラムを格納する。具体的には、後述する音声によるカメラ制御を行うプログラムを格納する。１０３はＲＡＭである。これはＣＰＵ１０１のワークエリアを提供し、各種データなどを保持するために用いられる。１０４は記憶装置で、ハードディスクなどからなる。例えば、後述する音声コマンド表を格納する。音声によるカメラ制御を行うプログラムをこの記憶装置１０４に格納することも可能である。１０５は、音声による制御の対象とするカメラ（撮像装置）である。この被制御装置としてのカメラ１０５は、その撮像方向を制御可能に構成されている。１０６は音声を収集するマイクロホンである。 (Embodiment 1)
In FIG. 1, it is a block diagram which shows the hardware constitutions of the camera control apparatus which concerns on one Embodiment of this invention. A CPU (Central Processing Unit) 101 controls the operation of the entire apparatus as a system control unit. A ROM 102 stores a control program. Specifically, it stores a program for performing camera control by voice, which will be described later. Reference numeral 103 denotes a RAM. This provides a work area for the CPU 101 and is used to hold various data. A storage device 104 includes a hard disk. For example, a voice command table to be described later is stored. It is also possible to store a program for performing camera control by voice in the storage device 104. Reference numeral 105 denotes a camera (imaging device) to be controlled by voice. The camera 105 as the controlled device is configured to be able to control its imaging direction. A microphone 106 collects sound.

図２は、本発明の一実施形態に係るカメラ制御装置の機能構成を示すブロック図である。音声取得部２０１は、外部で発生した音声を取得し、音声方向検知部２０２及び音声認識部２０５へ送る。音声取得部２０１はマイクロホン１０６で実現される。なお、音声方向検知部２０２が音声の発生方向を容易に検知できるようにするためには、２本以上のマイクロホンから構成されることが好ましい。 FIG. 2 is a block diagram showing a functional configuration of the camera control apparatus according to the embodiment of the present invention. The voice acquisition unit 201 acquires a voice generated outside and sends the voice to the voice direction detection unit 202 and the voice recognition unit 205. The voice acquisition unit 201 is realized by the microphone 106. In order for the voice direction detection unit 202 to easily detect the voice generation direction, the voice direction detection unit 202 preferably includes two or more microphones.

音声方向検知部２０２は、音声取得部２０１から送られる音声の発生方向を検知する。ここで、検知した音声の発生方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。生成したパン・チルト・ズーム情報は、撮像方向制御部２０３へと送られる。 The voice direction detection unit 202 detects the direction in which the voice sent from the voice acquisition unit 201 is generated. Here, the pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction in which the detected sound is generated. The generated pan / tilt / zoom information is sent to the imaging direction control unit 203.

音声認識部２０５は、音声取得部２０１から送られる音声を認識し、音声コマンドを出力する。入力した音声が音声コマンドとして認識されなかった場合は、音声コマンドでないことを出力する。カメラ制御装置をテレビ会議に用いた場合、通常の会話音声は音声コマンドとして認識されない。 The voice recognition unit 205 recognizes the voice sent from the voice acquisition unit 201 and outputs a voice command. If the input voice is not recognized as a voice command, it is output that it is not a voice command. When the camera control device is used for a video conference, a normal conversation voice is not recognized as a voice command.

音声認識部２０５が認識可能な音声コマンドは、音声コマンド表２０６で定義される。音声コマンド表２０６は、カメラ制御装置を用いる場面に応じて、適切なものを用いる。ここでは、カメラ制御装置を用いる場面の例として、図８の場面を想定する。 Voice commands that can be recognized by the voice recognition unit 205 are defined in the voice command table 206. As the voice command table 206, an appropriate one is used according to the scene where the camera control device is used. Here, the scene of FIG. 8 is assumed as an example of the scene using a camera control apparatus.

図８は、自分側と相手側の拠点を結んでテレビ会議を行う場面を示している。それぞれの拠点にカメラ制御装置が設置されている。カメラ制御装置同士は、例えばネットワークを通じてつながっている。自分側のモニタには、相手側のカメラ制御装置のカメラ２０４が撮像した映像を表示する。相手側のモニタには、自分側のカメラ制御装置のカメラ２０４が撮像した映像を表示する。自分側の出席者はＡさん、Ｂさん、Ｃさんであるとする。相手側の出席者はＸさん、Ｙさんであるとする。 FIG. 8 shows a scene where a video conference is performed by connecting the base of the other party and the other party. Camera control devices are installed at each site. The camera control devices are connected through a network, for example. The image captured by the camera 204 of the other camera control device is displayed on the own monitor. On the other party's monitor, an image captured by the camera 204 of the own camera control device is displayed. Assume that the attendees on your side are Mr. A, Mr. B, and Mr. C. Assume that the other party's attendees are X and Y.

図８の場面において、自分側のカメラ制御装置で用いる音声コマンド表２０６の例を図４（ａ）に示す。図示のように、各音声コマンドには、「音声」、「制御情報」、「検知した音声方向への制御」の属性情報が関連付けられている。「音声」は、音声コマンドに対応する音声を音素などで表したものである。「制御情報」は、カメラ２０４をどう制御するかの指示である。例えば、音声コマンド「ホワイトボード」を認識した場合は、カメラ２０４の撮像方向をホワイトボードへ向ける。音声コマンド「Ｘさん」を認識すると、相手側のカメラ２０４の撮像方向をＸさんへ向ける。「検知した音声方向への制御」は、音声取得部２０１が取得した音声を音声コマンドと認識した場合に、カメラ２０４の撮像方向を音声の発生方向へ追従制御するか否かの指示である。値が「○」の音声コマンドの場合、カメラ２０４の撮像方向を、検知した音声の発生方向へ追従させる。つまり、カメラ２０４の音声の発生方向への追従制御を抑制しない。値が「×」の音声コマンドの場合、カメラ２０４の撮像方向を、検知した音声の発生方向へ追従させない。つまり、カメラ２０４の音声の発生方向への追従制御を抑制する。 FIG. 4A shows an example of the voice command table 206 used in the camera control device on the own side in the scene of FIG. As illustrated, each voice command is associated with attribute information of “voice”, “control information”, and “control in detected voice direction”. “Voice” represents a voice corresponding to a voice command by a phoneme or the like. “Control information” is an instruction on how to control the camera 204. For example, when the voice command “whiteboard” is recognized, the imaging direction of the camera 204 is directed to the whiteboard. When the voice command “Mr. X” is recognized, the imaging direction of the camera 204 on the other side is directed to Mr. X. “Control in detected voice direction” is an instruction as to whether or not to control the imaging direction of the camera 204 in the direction of voice generation when the voice acquired by the voice acquisition unit 201 is recognized as a voice command. In the case of a voice command having a value of “◯”, the imaging direction of the camera 204 is made to follow the direction in which the detected voice is generated. That is, the follow-up control of the camera 204 in the sound generation direction is not suppressed. In the case of a voice command having a value of “×”, the imaging direction of the camera 204 is not allowed to follow the direction in which the detected voice is generated. That is, the follow-up control of the camera 204 in the sound generation direction is suppressed.

「検知した音声方向への制御」の値を「○」にする音声コマンドは、音声コマンドの発声者へ、カメラ２０４の撮像方向を向けるものである。発声者の方向は、音声方向検知部２０２が検知した方向である。例えば音声コマンド「こっち」である。 A voice command that sets the value of “control in detected voice direction” to “◯” directs the imaging direction of the camera 204 to the speaker of the voice command. The direction of the speaker is the direction detected by the voice direction detection unit 202. For example, the voice command “here”.

また例えば、音声コマンド「Ｘさん」は、相手側のカメラ２０４の撮像方向を、Ｘさんへ向けるコマンドである。このとき、呼び出されたＸさんにとって、呼び出したのは誰かがわかるのが望ましい。そこで、自分側のカメラ２０４の撮像方向を、発声者Ａさんの方向へ向ける。 Further, for example, the voice command “Mr. X” is a command for directing the imaging direction of the camera 204 on the other side to Mr. X. At this time, it is desirable for Mr. X who is called to know who is calling. Therefore, the imaging direction of the camera 204 on the own side is directed toward the speaker A.

「検知した音声方向への制御」の値を「×」にする音声コマンドの発声者へ、カメラ２０４の撮像方向を向けないものである。例えば、「ホワイトボード」は、音声コマンドの発声者ではなく、ホワイトボードの方向へ、カメラ２０４の撮像方向を向ける。 The direction in which the camera 204 is imaged is not directed to the voice command speaker who sets the value of “control in detected voice direction” to “x”. For example, the “whiteboard” is not the voice command speaker, but directs the imaging direction of the camera 204 toward the whiteboard.

撮像方向制御部２０３は、音声方向検知部２０２が生成したパン・チルト・ズーム情報と、音声認識部２０５が認識した音声コマンドを用いて、カメラ２０４の撮像方向を制御する。以下、図８の場面を用いて、いくつかの事例について説明する。特に断りが無い場合、自分側の拠点にあるカメラ制御装置の動作について説明する。 The imaging direction control unit 203 controls the imaging direction of the camera 204 using the pan / tilt / zoom information generated by the audio direction detection unit 202 and the audio command recognized by the audio recognition unit 205. Hereinafter, some examples will be described using the scene of FIG. If there is no notice, the operation of the camera control device at the local site will be described.

［Ａさんが（音声コマンド以外の）通常の発言を行った場合：］
音声取得部２０１がＡさんの発言を音声として取得する。音声方向検知部２０２は、取得した音声の発生方向を検知する。ここで音声の発生方向はＡさんの方向である。次に、Ａさんの方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。一方、音声認識部２０５は、リジェクション機能を用いて、取得した音声が音声コマンドでないことを認識する。 [When Mr. A makes a normal speech (other than a voice command):]
The voice acquisition unit 201 acquires Mr. A's speech as voice. The sound direction detection unit 202 detects the direction in which the acquired sound is generated. Here, the direction of voice generation is Mr. A's direction. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction of Mr. A. On the other hand, the voice recognition unit 205 recognizes that the acquired voice is not a voice command by using the rejection function.

音声認識では、各音声コマンドの確からしさを計算し、最も確からしさが高いものを結果として出力する。音声コマンド以外の音声が入力された場合、この確からしさが低い値となる。例えば、閾値を設け、確からしさが閾値以下の場合、音声コマンドでないと認識する。あるいは、あらゆる発話をモデル化したＧＢＧ（ガーベッジ）モデルを用いても良い。音声認識コマンドに近い音声が入力された場合には、音声コマンドのモデルのスコアのほうが、ＧＢＧモデルのスコアよりも大きくなる。一方、音声コマンドと異なる音声が入力された場合には、ＧＢＧモデルのスコアのほうが、音声コマンドのモデルのスコアよりも大きくなる。 In voice recognition, the likelihood of each voice command is calculated, and the highest likelihood is output as a result. When a voice other than a voice command is input, the probability is low. For example, a threshold is provided, and when the probability is less than or equal to the threshold, it is recognized that the command is not a voice command. Alternatively, a GBG (garbage) model in which all utterances are modeled may be used. When a voice close to a voice recognition command is input, the score of the voice command model is larger than the score of the GBG model. On the other hand, when a voice different from the voice command is input, the score of the GBG model is larger than the score of the voice command model.

以上の結果を用いて、撮像方向制御部２０３は、カメラ２０４の撮像方向を制御する。取得した音声は音声コマンドでない。よって、音声方向検知部２０２が生成したパン・チルト・ズーム情報を用いて、カメラ２０４を制御する。カメラ２０４の撮像方向をＡさんへ向ける。相手側のモニタにＡさんが映る。 Using the above result, the imaging direction control unit 203 controls the imaging direction of the camera 204. The acquired voice is not a voice command. Therefore, the camera 204 is controlled using the pan / tilt / zoom information generated by the voice direction detection unit 202. The imaging direction of the camera 204 is directed to Mr. A. A appears on the other party's monitor.

［Ａさんが「ホワイトボード」と音声コマンドを発声した場合：］
音声取得部２０１がＡさんの発声を音声として取得する。音声方向検知部２０２は、取得した音声の発生方向を検知する。ここで音声の発生方向はＡさんの方向である。次に、Ａさんの方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。一方、音声認識部２０５は、取得した音声を音声認識し、音声コマンド「ホワイトボード」を認識する。 [When Mr. A utters a voice command “Whiteboard”:]
The voice acquisition unit 201 acquires Mr. A's utterance as voice. The sound direction detection unit 202 detects the direction in which the acquired sound is generated. Here, the direction of voice generation is Mr. A's direction. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction of Mr. A. On the other hand, the voice recognition unit 205 recognizes the acquired voice and recognizes the voice command “whiteboard”.

以上の結果を用いて、撮像方向制御部２０３は、カメラ２０４の撮像方向を制御する。図４（ａ）の音声コマンド表２０６によれば、音声コマンド「ホワイトボード」の「検知した音声方向への制御」の値は「×」である。よって、音声方向検知部２０２が生成したパン・チルト・ズーム情報は用いない。音声コマンド「ホワイトボード」の制御情報を用いて、カメラ２０４を制御する。カメラ２０４の撮像方向をホワイトボードへ向ける。相手側のモニタにホワイトボードが映る。なお、カメラ２０４の撮像方向をホワイトボードへ向けるために必要なパン・チルト・ズーム情報は、予め設定しておくものとする。 Using the above result, the imaging direction control unit 203 controls the imaging direction of the camera 204. According to the voice command table 206 in FIG. 4A, the value of “control in the detected voice direction” of the voice command “whiteboard” is “×”. Therefore, the pan / tilt / zoom information generated by the voice direction detection unit 202 is not used. The camera 204 is controlled using the control information of the voice command “whiteboard”. The imaging direction of the camera 204 is directed to the whiteboard. The whiteboard appears on the other party's monitor. Note that pan / tilt / zoom information necessary for directing the imaging direction of the camera 204 to the whiteboard is set in advance.

［Ａさんが「こっち」と音声コマンドを発声した場合：］
音声取得部２０１がＡさんの発声を音声として取得する。音声方向検知部２０２は、取得した音声の発生方向を検知する。ここで音声の発生方向はＡさんの方向である。次に、Ａさんの方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。一方、音声認識部２０５は、取得した音声を音声認識し、音声コマンド「こっち」を認識する。 [When Mr. A utters a voice command “This”:]
The voice acquisition unit 201 acquires Mr. A's utterance as voice. The sound direction detection unit 202 detects the direction in which the acquired sound is generated. Here, the direction of voice generation is Mr. A's direction. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction of Mr. A. On the other hand, the voice recognition unit 205 recognizes the acquired voice and recognizes the voice command “here”.

以上の結果を用いて、撮像方向制御部２０３は、カメラ２０４の撮像方向を制御する。音声コマンド「こっち」の「検知した音声方向への制御」の値は「○」である。よって、音声方向検知部２０２が生成したパン・チルト・ズーム情報を用いて、カメラ２０４を制御する。これによりカメラ２０４の撮像方向をＡさんへ向ける。相手側のモニタにはＡさんが映ることとなる。音声コマンド「こっち」の制御情報は無いので、音声コマンドに対する制御は行わない。 Using the above result, the imaging direction control unit 203 controls the imaging direction of the camera 204. The value of “control in the detected voice direction” of the voice command “here” is “◯”. Therefore, the camera 204 is controlled using the pan / tilt / zoom information generated by the voice direction detection unit 202. Thereby, the imaging direction of the camera 204 is directed to Mr. A. Mr. A will appear on the other party's monitor. Since there is no control information for the voice command “here”, the voice command is not controlled.

音声コマンド「こっち」のように、予め制御情報（パン・チルト・ズーム）を設定することができない音声コマンドを処理することが可能となる。発言者により、「こっち」の方向が異なる。 As in the case of the voice command “here”, a voice command for which control information (pan / tilt / zoom) cannot be set in advance can be processed. The direction of “here” varies depending on the speaker.

［Ａさんが「Ｘさん」と音声コマンドを発声した場合：］
音声取得部２０１がＡさんの発声を音声として取得する。音声方向検知部２０２は、取得した音声の発生方向を検知する。ここで音声の発生方向はＡさんの方向である。次に、Ａさんの方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。一方、音声認識部２０５は、取得した音声を音声認識し、音声コマンド「Ｘさん」を認識する。 [When Mr. A utters a voice command “Mr. X”:]
The voice acquisition unit 201 acquires Mr. A's utterance as voice. The sound direction detection unit 202 detects the direction in which the acquired sound is generated. Here, the direction of voice generation is Mr. A's direction. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction of Mr. A. On the other hand, the voice recognition unit 205 recognizes the acquired voice and recognizes the voice command “Mr. X”.

以上の結果を用いて、撮像方向制御部２０３は、カメラ２０４の撮像方向を制御する。音声コマンド「Ｘさん」の「検知した音声方向への制御」の値は、「○」である。よって、音声方向検知部２０２が生成したパン・チルト・ズーム情報を用いて、カメラ２０４を制御する。カメラ２０４の撮像方向をＡさんへ向ける。相手側のモニタにＡさんが映る。 Using the above result, the imaging direction control unit 203 controls the imaging direction of the camera 204. The value of “control in the detected voice direction” of the voice command “Mr. X” is “◯”. Therefore, the camera 204 is controlled using the pan / tilt / zoom information generated by the voice direction detection unit 202. The imaging direction of the camera 204 is directed to Mr. A. A appears on the other party's monitor.

一方、音声コマンド「Ｘさん」の制御情報は、相手側のカメラ制御装置へ送られる。この制御情報を用いて、相手側の撮像方向制御部２０３が、相手側のカメラ２０４の撮像方向をＸさんへ向ける。自分側のモニタにＸさんが映る。なお、相手側のカメラ２０４の撮像方向をＸさんへ向けるために必要なパン・チルト・ズーム情報は、相手側のカメラ制御装置に、予め設定しておくものとする。 On the other hand, the control information of the voice command “Mr. X” is sent to the camera control device on the other side. Using this control information, the partner imaging direction control unit 203 directs the partner camera 204 to the X direction. Mr. X appears on his monitor. Note that the pan / tilt / zoom information necessary for directing the imaging direction of the counterpart camera 204 toward Mr. X is set in advance in the counterpart camera control device.

Ａさんは、相手側のＸさんと会話を開始するときに、音声コマンド「Ｘさん」を発声する。１つの音声コマンド「Ｘさん」により、自分側のカメラ２０４をＡさんに向け、相手側のカメラ２０４をＸさんに向けることができる。互いのモニタに、ＡさんとＸさんが映り、ＡさんとＸさんの会話をサポートする。このように、音声方向検知部２０２が生成したパン・チルト・ズーム情報と、音声認識部２０５が認識した音声コマンドを協調させ、柔軟に機器を制御することが可能となる。 Mr. A utters the voice command “Mr. X” when starting a conversation with the other party, Mr. X. With one voice command “Mr. X”, the user's camera 204 can be pointed to Mr. A and the partner's camera 204 can be pointed to Mr. X. Mr. A and Mr. X are reflected on each other's monitors, supporting the conversation between Mr. A and Mr. X. In this way, the pan / tilt / zoom information generated by the voice direction detection unit 202 and the voice command recognized by the voice recognition unit 205 can be coordinated to flexibly control the device.

図３は、本発明の一実施形態に係るカメラ制御装置の処理手順を示すフローチャートである。Ｓ３０１で、音声取得部２０１が、外部で発声した音声を取得したか否かを判定する。音声を取得した場合はＳ３０２へ進む。音声を取得しなかった場合はＳ３０１へ戻る。 FIG. 3 is a flowchart showing a processing procedure of the camera control apparatus according to the embodiment of the present invention. In S301, it is determined whether or not the voice acquisition unit 201 has acquired a voice uttered externally. If the voice has been acquired, the process proceeds to S302. If no sound is acquired, the process returns to S301.

Ｓ３０２で、音声方向検知部２０２が、取得した音声の発生方向を検知する。次に、検知した音声の発生方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。生成したパン・チルト・ズーム情報を撮像方向制御部２０３へ送る。 In S302, the voice direction detection unit 202 detects the direction in which the acquired voice is generated. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction in which the detected sound is generated. The generated pan / tilt / zoom information is sent to the imaging direction control unit 203.

Ｓ３０３で、音声認識部２０５が、取得した音声を認識する。音声認識するための認識語彙は、例えば図４（ａ）に示す音声コマンドである。さらに音声コマンド以外の発話を認識する一手法としてＧＢＧモデルを追加しても良い。 In step S303, the voice recognition unit 205 recognizes the acquired voice. The recognition vocabulary for voice recognition is, for example, a voice command shown in FIG. Furthermore, a GBG model may be added as a method for recognizing utterances other than voice commands.

Ｓ３０４で、音声認識部２０５が、音声認識の結果を用いて、取得した音声が音声コマンドか否かを判定する。音声コマンドの場合はＳ３０５へ進む。音声コマンドでない場合はＳ３０７へ進む。 In step S304, the voice recognition unit 205 determines whether the acquired voice is a voice command using the result of voice recognition. In the case of a voice command, the process proceeds to S305. If it is not a voice command, the process proceeds to S307.

Ｓ３０５で、撮像方向制御部２０３が、音声コマンドの制御情報に従ってカメラ２０４を制御する。Ｓ３０６で、撮像方向制御部２０３が、音声コマンドが所定の音声コマンドか否かを判定する。所定の音声コマンドとは、「検知した音声方向への制御」の値が「○」のものである。所定の音声コマンドの場合はＳ３０７へ進む。所定の音声コマンドでない場合はＳ３０１へ戻る。 In step S305, the imaging direction control unit 203 controls the camera 204 according to the control information of the voice command. In step S306, the imaging direction control unit 203 determines whether the voice command is a predetermined voice command. The predetermined voice command is a command whose value of “control in detected voice direction” is “◯”. In the case of a predetermined voice command, the process proceeds to S307. If it is not the predetermined voice command, the process returns to S301.

Ｓ３０７で、撮像方向制御部２０３が、送られたパン・チルト・ズーム情報を用いて、カメラ２０４の撮像方向を制御する。 In step S 307, the imaging direction control unit 203 controls the imaging direction of the camera 204 using the sent pan / tilt / zoom information.

なお、Ｓ３０１とＳ３０２の順番は逆でも良い。音声方向検知をした上で、音の中に音声が入力されているか否かを判定する。
また、本実施形態では図８のような自分側と相手側の拠点を結んでテレビ会議を行う場面でのユースケースとして説明したが、自分側の拠点一箇所におけるユースケースにおいても本発明は適用される。例えばミーティングの議事録を映像として記録する場合などに有効である。 Note that the order of S301 and S302 may be reversed. After detecting the voice direction, it is determined whether or not voice is input in the sound.
Further, although the present embodiment has been described as a use case in the case of performing a video conference by connecting the base of the other side and the other side as shown in FIG. 8, the present invention is also applied to a use case in one place of the base of the own side. Is done. For example, it is effective when recording the minutes of a meeting as a video.

なお、本実施形態では音声認識部２０５が認識する音声コマンドとして図４（ａ）のように、音声方向への制御を行う音声コマンドと音声方向への制御を行わない音声コマンドの二種類がある。しかし本発明はこれに限るものではなく、音声コマンドは全て音声方向への制御を行わないようにしても良い。図４（ｂ）はその場合の音声コマンド表の例である。この場合、「検知した音声方向への制御」を表すフラグを持つ必要はない。またこの場合、Ｓ３０６は常にＮＯへ進む。 In this embodiment, there are two types of voice commands recognized by the voice recognition unit 205: a voice command that performs control in the voice direction and a voice command that does not perform control in the voice direction, as shown in FIG. . However, the present invention is not limited to this, and all voice commands may not be controlled in the voice direction. FIG. 4B is an example of a voice command table in that case. In this case, it is not necessary to have a flag indicating “control in the detected voice direction”. In this case, S306 always proceeds to NO.

（実施形態２）
上述の実施形態１では、音声取得部２０１が取得した音声を、音声方向検知部２０２と音声認識部２０５が共に利用する。ここでは、音声取得部２０１を、第１マイクロホン５０１と第２マイクロホン５０７に分離した実施形態について説明する。 (Embodiment 2)
In the first embodiment described above, both the voice direction detection unit 202 and the voice recognition unit 205 use the voice acquired by the voice acquisition unit 201. Here, an embodiment in which the sound acquisition unit 201 is separated into a first microphone 501 and a second microphone 507 will be described.

図５は、本実施形態に係るカメラ制御装置の機能構成を示すブロック図である。第１マイクロホン５０１は、音声方向検知部５０２が利用する。検知した音声方向へカメラ５０４を正しく向けるためには、第１マイクロホン５０１とカメラ５０４の位置関係が固定されるのが望ましい。例えば、第１マイクロホン５０１とカメラ５０４をカメラ制御装置の本体に組み込む構成とする。カメラ５０４の位置を変更しても、第１マイクロホン５０１とカメラ５０４の位置関係が固定されたままとなる。また、第１マイクロホン５０１は音声の方向を検知するため、２本以上のマイクロホンで構成されることが好ましい。 FIG. 5 is a block diagram illustrating a functional configuration of the camera control apparatus according to the present embodiment. The first microphone 501 is used by the voice direction detection unit 502. In order to correctly point the camera 504 in the detected voice direction, it is desirable that the positional relationship between the first microphone 501 and the camera 504 be fixed. For example, the first microphone 501 and the camera 504 are incorporated into the main body of the camera control device. Even if the position of the camera 504 is changed, the positional relationship between the first microphone 501 and the camera 504 remains fixed. The first microphone 501 is preferably composed of two or more microphones in order to detect the direction of sound.

第２マイクロホン５０７は、音声認識部５０８が利用する。音声認識率を向上させるためには、第２マイクロホン５０７が発声者の近くにあるのが望ましい。また、参加者が複数いる場合は、発声している参加者のそばに、随時移動できることが望ましい。例えば、第２マイクロホン５０７を本体から分離可能な構成とする。第２マイクロホン５０７と本体は、有線又は無線で通信を行う。発声者は、第２マイクロホン５０７を、自分のそばに置いて利用することが可能となる。または、発声者は、第２マイクロホン５０７を、手に持って利用することも可能である。 The second microphone 507 is used by the voice recognition unit 508. In order to improve the speech recognition rate, it is desirable that the second microphone 507 be near the speaker. In addition, when there are a plurality of participants, it is desirable that the participant can move at any time near the speaking participant. For example, the second microphone 507 is configured to be separable from the main body. The second microphone 507 and the main body communicate with each other by wire or wireless. The speaker can use the second microphone 507 by placing it next to him. Alternatively, the speaker can use the second microphone 507 in his / her hand.

音声方向検知部５０２は、第１マイクロホン５０１から入力される音声の発生方向を検知する。そして、検知した音声の発生方向へ、カメラ５０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。生成したパン・チルト・ズーム情報は、撮像方向制御部５０３へ送られる。 The sound direction detection unit 502 detects the direction in which sound input from the first microphone 501 is generated. Then, the pan / tilt / zoom information necessary for controlling the imaging direction of the camera 504 is generated in the detected sound generation direction. The generated pan / tilt / zoom information is sent to the imaging direction control unit 503.

音声入力ボタン５０５は、音声認識を開始するためのトリガとして用いられる。音声入力ボタンの押下が検知されると、押下信号が、音声認識部５０８と抑制部５０６へ送られる。音声入力ボタン５０５は、ハードボタンに限らず、ソフトで実装されたＧＵＩのボタンでもよい。 The voice input button 505 is used as a trigger for starting voice recognition. When the pressing of the voice input button is detected, a pressing signal is sent to the voice recognition unit 508 and the suppression unit 506. The voice input button 505 is not limited to a hard button but may be a GUI button implemented by software.

音声認識部５０８は、音声入力ボタン５０５の押下信号を受けると、第２マイクロホン５０７から入力される音声を認識し、音声コマンドを出力する。音声認識部５０８が認識可能な音声コマンドは、音声コマンド表５０９に定義される。図８の場面において自分側のカメラ制御装置で用いる音声コマンド表５０９の例を図７に示す。各音声コマンドには、「音声」、「制御情報」、「音声方向制御抑制信号」の属性情報が関連付けられている。ここで、「音声方向制御抑制信号」は、カメラ５０４を音声の発生方向に制御しない指示を表す。 When the voice recognition unit 508 receives a press signal of the voice input button 505, the voice recognition unit 508 recognizes the voice input from the second microphone 507 and outputs a voice command. Voice commands that can be recognized by the voice recognition unit 508 are defined in the voice command table 509. FIG. 7 shows an example of the voice command table 509 used in the camera control device on the own side in the scene of FIG. Each voice command is associated with attribute information of “voice”, “control information”, and “voice direction control suppression signal”. Here, the “voice direction control suppression signal” represents an instruction not to control the camera 504 in the voice generation direction.

音声認識部５０８は、認識した音声コマンドの制御情報を撮像方向制御部５０３へ送る。また、「音声方向制御抑制信号」の値が「有り」の場合、音声方向制御抑制信号を抑制部５０６へ送る。 The voice recognition unit 508 sends control information of the recognized voice command to the imaging direction control unit 503. When the value of the “voice direction control suppression signal” is “present”, the voice direction control suppression signal is sent to the suppression unit 506.

抑制部５０６は、音声方向検知部５０２が検知した音声の発生方向へ、カメラ５０４の撮像方向を制御することを抑制するか否かを判定する。判定に利用する情報は、音声入力ボタン５０５からの押下信号と、音声認識部５０８からの音声方向制御抑制信号である。押下信号が無し又は音声方向制御抑制信号が無しの場合、「抑制しない」と判定する。押下信号が有りかつ音声方向制御抑制信号が有りの場合、「抑制する」と判定する。判定結果を撮像方向制御部５０３へ送る。 The suppression unit 506 determines whether to suppress the control of the imaging direction of the camera 504 in the direction in which the audio detected by the audio direction detection unit 502 is generated. Information used for the determination is a press signal from the voice input button 505 and a voice direction control suppression signal from the voice recognition unit 508. When there is no press signal or no voice direction control suppression signal, it is determined that “do not suppress”. When there is a press signal and there is a voice direction control suppression signal, it is determined to “suppress”. The determination result is sent to the imaging direction control unit 503.

撮像方向制御部５０３は、音声方向検知部５０２から送られるパン・チルト・ズーム情報と、音声認識部５０８から送られる制御情報と、抑制部５０６の判定結果を用いて、カメラ５０４の撮像方向を制御する。以下、図８の場面を用いて、いくつかの事例について説明する。特に断りがない場合、自分側の拠点にあるカメラ制御装置の動作について説明する。 The imaging direction control unit 503 determines the imaging direction of the camera 504 using the pan / tilt / zoom information sent from the voice direction detection unit 502, the control information sent from the voice recognition unit 508, and the determination result of the suppression unit 506. Control. Hereinafter, some examples will be described using the scene of FIG. If there is no notice, the operation of the camera control device at the base on its own side will be described.

［Ａさんが通常の発言で「ホワイトボード」と発声した場合：］
Ａさんの発声を、第１マイクロホン５０１と第２マイクロホン５０７が拾う。ここで音声方向検知部５０２は、音声の発生方向としてＡさんの方向を検知する。音声入力ボタン５０５が押下されないため、抑制部５０６と音声認識部５０８から、撮像方向制御部５０３へ送られる情報はない。撮像方向制御部５０３は、音声方向検知部５０２から送られる情報に基づいて、カメラ５０４の撮像方向を制御する。こうしてカメラ５０４は、Ａさんの方向を向く。 [When Mr. A says “Whiteboard” in a normal remark:]
The first microphone 501 and the second microphone 507 pick up Mr. A's utterance. Here, the voice direction detection unit 502 detects the direction of Mr. A as the voice generation direction. Since the voice input button 505 is not pressed, no information is sent from the suppression unit 506 and the voice recognition unit 508 to the imaging direction control unit 503. The imaging direction control unit 503 controls the imaging direction of the camera 504 based on information sent from the audio direction detection unit 502. Thus, the camera 504 faces Mr. A.

［Ａさんが音声入力ボタン５０５を押下し、「ホワイトボード」と発声した場合：］
Ａさんの発声を、第１マイクロホン５０１と第２マイクロホン５０７が拾う。音声方向検知部５０２が、音声の発生方向としてＡさんの方向を検知する。音声認識部５０８が、音声コマンド「ホワイトボード」を認識する。音声コマンド「ホワイトボード」は、「音声方向制御抑制信号」の値が「有り」である。音声方向検知部５０２が検知した音声の発生方向へ、カメラ５０４の撮像方向を制御することを抑制すると、抑制部５０６が判定する。撮像方向制御部５０３は、音声認識部５０８から送られる制御情報に基づいて、カメラ５０４の撮像方向を制御する。こうしてカメラ５０４は、ホワイトボードの方向を向く。 [When Mr. A presses the voice input button 505 and says "Whiteboard":]
The first microphone 501 and the second microphone 507 pick up Mr. A's utterance. The voice direction detection unit 502 detects the direction of Mr. A as the voice generation direction. The voice recognition unit 508 recognizes the voice command “whiteboard”. In the voice command “whiteboard”, the value of the “voice direction control suppression signal” is “present”. The suppression unit 506 determines that the control of the imaging direction of the camera 504 in the direction in which the audio detected by the audio direction detection unit 502 is suppressed. The imaging direction control unit 503 controls the imaging direction of the camera 504 based on control information sent from the voice recognition unit 508. Thus, the camera 504 faces the whiteboard.

［Ａさんが音声入力ボタンを押下し、「こっち」と発声した場合：］
Ａさんの発声を、第１マイクロホン５０１と第２マイクロホン５０７が拾う。音声方向検知部５０２が、音声の発生方向としてＡさんの方向を検知する。音声認識部５０８が、音声コマンド「こっち」を認識する。音声コマンド「こっち」は「音声方向制御抑制信号」の値が「無し」である。音声検知部５０２が検知した音声の発生方向へ、カメラ５０４の撮像方向を制御することを抑制しないと、抑制部５０６が判定する。撮像方向制御部５０３は、音声方向検知部５０２から送られる情報に基づいて、カメラ５０４の撮像方向を制御する。こうしてカメラ５０４は、Ａさんの方向を向く。 [When Mr. A presses the voice input button and says "Here":]
The first microphone 501 and the second microphone 507 pick up Mr. A's utterance. The voice direction detection unit 502 detects the direction of Mr. A as the voice generation direction. The voice recognition unit 508 recognizes the voice command “here”. The value of the “voice direction control suppression signal” for the voice command “this” is “none”. The suppression unit 506 determines that the control of the imaging direction of the camera 504 in the direction in which the audio detected by the audio detection unit 502 is not suppressed is determined. The imaging direction control unit 503 controls the imaging direction of the camera 504 based on information sent from the audio direction detection unit 502. Thus, the camera 504 faces Mr. A.

［Ａさんが音声入力ボタンを押下し、「Ｘさん」と発声した場合：］
Ａさんの発声を、第１マイクロホン５０１と第２マイクロホン５０７が拾う。音声方向検知部５０２が、音声の発生方向としてＡさんの方向を検知する。音声認識部５０８が、音声コマンド「Ｘさん」を認識する。音声コマンド「Ｘさん」は「音声方向制御抑制信号」の値が「無し」である。音声検知部５０２が検知した音声の発生方向へ、カメラ５０４を制御することを抑制しないと、抑制部５０６が判定する。撮像方向制御部５０３は、音声方向検知部５０２から送られる情報に基づいて、カメラ５０４の撮像方向を制御する。こうしてカメラ５０４は、Ａさんの方向を向く。 [When Mr. A presses the voice input button and says "Mr. X":]
The first microphone 501 and the second microphone 507 pick up Mr. A's utterance. The voice direction detection unit 502 detects the direction of Mr. A as the voice generation direction. The voice recognition unit 508 recognizes the voice command “Mr. X”. The voice command “Mr. X” has a “voice direction control suppression signal” value of “none”. The suppression unit 506 determines that control of the camera 504 is not suppressed in the direction in which the audio detected by the audio detection unit 502 is detected. The imaging direction control unit 503 controls the imaging direction of the camera 504 based on information sent from the audio direction detection unit 502. Thus, the camera 504 faces Mr. A.

一方、音声コマンド「Ｘさん」の制御情報は、相手側のカメラ制御装置に送られる。相手側のカメラ５０４は、Ｘさんの方向を向く。 On the other hand, the control information of the voice command “Mr. X” is sent to the camera control device on the other side. The counterpart camera 504 faces Mr. X.

図６は、本発明の一実施形態に係るカメラ制御装置の処理手順を示すフローチャートである。 FIG. 6 is a flowchart showing a processing procedure of the camera control apparatus according to the embodiment of the present invention.

Ｓ６０１で、音声方向検知部５０２が、第１マイクロホン５０１に、音声入力が有るか無いかを判定する。音声入力が有る場合はＳ６０２へ進む。音声入力が無い場合はＳ６０１へ戻る。 In step S 601, the voice direction detection unit 502 determines whether there is a voice input to the first microphone 501. If there is a voice input, the process proceeds to S602. If there is no voice input, the process returns to S601.

Ｓ６０２で、音声方向検知部５０２が、第１マイクロホン５０１へ入力された音声の発生方向を検知する。 In step S 602, the voice direction detection unit 502 detects the direction in which the voice input to the first microphone 501 is generated.

Ｓ６０３で、抑制部５０６が、音声入力ボタン５０５からの押下信号の有無を判定する。押下信号が有る場合はＳ６０４へ進む。押下信号が無い場合はＳ６０８へ進む。 In step S 603, the suppression unit 506 determines whether there is a press signal from the voice input button 505. If there is a press signal, the process proceeds to S604. If there is no press signal, the process proceeds to S608.

Ｓ６０４で、音声認識部５０８が、第２マイクロホン５０７から入力される音声を認識する。 In step S 604, the voice recognition unit 508 recognizes the voice input from the second microphone 507.

Ｓ６０５で、音声認識部５０８が、入力音声が音声コマンドとして認識されたか、すなわち、音声認識に成功したか否かを判定する。音声認識に成功した場合はＳ６０６へ進む。入力音声が音声コマンドとして認識されなかった、すなわち、音声認識に失敗した場合はＳ６０１へ戻る。 In step S605, the voice recognition unit 508 determines whether the input voice has been recognized as a voice command, that is, whether voice recognition has been successful. If the speech recognition is successful, the process proceeds to S606. If the input voice is not recognized as a voice command, that is, if the voice recognition fails, the process returns to S601.

Ｓ６０６で、撮像方向制御部５０３が、音声コマンドの制御情報に基づいて、カメラ５０４を制御する。 In step S606, the imaging direction control unit 503 controls the camera 504 based on the control information of the voice command.

Ｓ６０７で、抑制部５０６が、音声認識部５０８からの音声方向制御抑制信号の有無を判定する。音声方向制御抑制信号が有る場合はＳ６０１へ戻る。音声方向制御抑制信号が無い場合はＳ６０８へ進む。 In step S 607, the suppression unit 506 determines whether or not there is a voice direction control suppression signal from the voice recognition unit 508. If there is a voice direction control suppression signal, the process returns to S601. If there is no voice direction control suppression signal, the process proceeds to S608.

Ｓ６０８で、撮像方向制御部５０３が、音声方向検知部５０２が検知した音声の発生方向へ、カメラ５０４の撮像方向を制御する。 In step S 608, the image capturing direction control unit 503 controls the image capturing direction of the camera 504 in the sound generation direction detected by the sound direction detecting unit 502.

（実施形態３）
図５から音声入力ボタン５０５を取り除いた構成をとることも可能である。 (Embodiment 3)
A configuration in which the voice input button 505 is removed from FIG. 5 is also possible.

この場合、第２マイクロホン５０７が音声入力を検出すると、「音声入力有り」の情報を、音声認識部５０８と抑制部５０６へ送る。音声認識部５０８は、音声入力ボタン５０５の押下信号の代わりに、第２マイクロホン５０７からの「音声入力有り」の情報を用いて処理を行う。抑制部５０６は、音声入力ボタン５０５の押下信号の代わりに、第２マイクロホン５０７からの「音声入力有り」の情報を用いて処理を行う。 In this case, when the second microphone 507 detects a voice input, information “sound input” is sent to the voice recognition unit 508 and the suppression unit 506. The voice recognition unit 508 performs processing using information “with voice input” from the second microphone 507 in place of the pressing signal of the voice input button 505. The suppression unit 506 performs processing using the “speech input present” information from the second microphone 507 instead of the pressing signal of the voice input button 505.

（実施形態４）
第２マイクロホン５０７と音声入力ボタン５０５を音声リモコンとして本体から分離した構成例を、図９に示す。 (Embodiment 4)
A configuration example in which the second microphone 507 and the voice input button 505 are separated from the main body as a voice remote controller is shown in FIG.

第１マイクロホン５０１と音声方向検知部５０２と撮像方向制御部５０３とカメラ５０４は本体に組み込む。抑制部５０６と音声認識部５０８と音声コマンド表５０９は、本体と音声リモコンのどちらに組み込んでもよい。また、例えば赤外線を利用して、本体と音声リモコンの間の通信を行う。 The first microphone 501, the sound direction detection unit 502, the imaging direction control unit 503, and the camera 504 are incorporated in the main body. The suppression unit 506, the voice recognition unit 508, and the voice command table 509 may be incorporated in either the main body or the voice remote controller. Further, for example, communication between the main body and the voice remote controller is performed using infrared rays.

（実施形態５）
実施形態１では、入力された音声を認識して、音声コマンドか否か、あるいは所定の音声コマンドか否かで、音声方向へカメラの撮像方向を制御するか否かを決定している。ここで音声認識部２０５は、音声取得部２０１が取得する音声コマンドや音声コマンド以外の発言が終了した時点で認識結果を取得することになる。従って、カメラ制御の開始は音声コマンドや音声コマンド以外の発言が終了した後になる。しかし本発明はこれに限るものではない。本実施形態では、発言者の発言中に、途中までの音声認識結果を所定の時間間隔で逐次取得して、その結果に基づいて逐次カメラの制御を決定する。なお、本実施形態では、音声コマンドの種類によって音声方向への制御を区別せず、図４（ｂ）に示す音声コマンド表の音声コマンドを認識する例を説明する。 (Embodiment 5)
In the first embodiment, the input voice is recognized, and it is determined whether or not to control the imaging direction of the camera in the voice direction depending on whether or not it is a voice command or a predetermined voice command. Here, the voice recognition unit 205 acquires the recognition result when the voice command acquired by the voice acquisition unit 201 or the speech other than the voice command ends. Therefore, the camera control is started after the voice command or the speech other than the voice command is finished. However, the present invention is not limited to this. In the present embodiment, during the speech of the speaker, the voice recognition results up to the middle are sequentially acquired at predetermined time intervals, and the control of the camera is sequentially determined based on the results. In the present embodiment, an example will be described in which voice commands in the voice command table shown in FIG. 4B are recognized without distinguishing control in the voice direction depending on the type of voice command.

図１０は本実施形態に係るカメラ制御装置の処理手順を示すフローチャートである。Ｓ１００１で、音声取得部２０１が、外部で発声した音声を取得したか否かを判定する。音声を取得した場合はＳ１００２へ進む。音声を取得しなかった場合はＳ１００１へ戻る。 FIG. 10 is a flowchart showing a processing procedure of the camera control apparatus according to the present embodiment. In step S 1001, the voice acquisition unit 201 determines whether a voice uttered externally has been acquired. If the voice has been acquired, the process proceeds to S1002. If no voice is acquired, the process returns to S1001.

Ｓ１００２で、音声方向検知部２０２が、取得した音声の発生方向を検知する。次に、検知した音声の発生方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。生成したパン・チルト・ズーム情報は撮像方向制御部２０３へ送られる。 In S1002, the voice direction detection unit 202 detects the direction in which the acquired voice is generated. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction in which the detected sound is generated. The generated pan / tilt / zoom information is sent to the imaging direction control unit 203.

Ｓ１００３で、音声認識部２０５が、音声入力有りと判定された時点から現時点までに取得した音声を使って音声認識する。音声が発言の途中である場合も、途中までの音声で認識する。所定の長さの時間Ｔごとに逐次、認識結果を出力する。 In step S 1003, the voice recognition unit 205 performs voice recognition using voices acquired from the time when it is determined that there is voice input to the present time. Even when the voice is in the middle of speaking, it is recognized by the voice up to the middle. The recognition result is output sequentially every time T having a predetermined length.

Ｓ１００４で、音声認識部２０５が、今回の音声認識結果と１つ前（所定の時間Ｔ前）の音声認識結果を比較する。１つ前の音声認識結果が音声コマンドかつ今回の音声認識結果と異なる場合は、Ｓ１００５に進む。音声入力開始からはじめの音声認識結果の場合、１つ前の音声認識結果が音声コマンドでない場合、１つ前と今回の音声コマンドが等しい場合は、Ｓ１００６に進む。 In step S1004, the voice recognition unit 205 compares the current voice recognition result with the previous voice recognition result (predetermined time T). If the previous voice recognition result is different from the voice command and the current voice recognition result, the process proceeds to S1005. In the case of the first voice recognition result from the start of voice input, if the previous voice recognition result is not a voice command, if the previous voice command is equal to the current voice command, the process proceeds to S1006.

Ｓ１００５では、撮像方向制御部２０３が、１つ前の音声コマンドに対する処理（後述のＳ１００８で行われた処理）をキャンセルする。 In S1005, the imaging direction control unit 203 cancels the process for the previous voice command (the process performed in S1008 described later).

Ｓ１００６では、音声認識部２０５が、音声認識の結果を用いて、取得した音声が音声コマンドか否かを判定する。音声コマンドでない場合はＳ１００７へ進む。音声コマンドの場合はＳ１００８へ進む。
Ｓ１００７では、撮像方向制御部２０３が、音声方向検知部２０２から送られたパン・チルト・ズーム情報を用いて、カメラ２０４の撮像方向を制御する。 In step S1006, the voice recognition unit 205 determines whether the acquired voice is a voice command using the voice recognition result. If it is not a voice command, the process proceeds to S1007. In the case of a voice command, the process proceeds to S1008.
In step S 1007, the imaging direction control unit 203 controls the imaging direction of the camera 204 using the pan / tilt / zoom information sent from the audio direction detection unit 202.

Ｓ１００８では、撮像方向制御部２０３が、音声コマンドの制御情報（ズームする、登録されたホワイトボードの方向に向く等）に従ってカメラ２０４を制御する。
Ｓ１００９では、撮像方向制御部２０３が、カメラ２０４を音声方向に制御することを抑制する。それまでに既に音声方向へカメラのパン・チルト・ズームが動いている場合には、音声方向から音声入力前にカメラの向いていた方向あるいは音声コマンドの制御情報の示す方向に戻すよう制御する。 In step S 1008, the imaging direction control unit 203 controls the camera 204 in accordance with voice command control information (such as zooming and pointing toward the registered whiteboard).
In step S 1009, the imaging direction control unit 203 is prevented from controlling the camera 204 in the audio direction. If the pan / tilt / zoom of the camera has already moved in the voice direction so far, control is performed so as to return the voice direction to the direction that the camera was facing before voice input or the direction indicated by the control information of the voice command.

なお、Ｓ１００８とＳ１００９とによる制御情報は組み合わせて制御情報を作ってカメラ２０４を制御する。従って、音声コマンドの内容によって、音声入力前にカメラの向いていた方向を向く場合と、音声コマンドの示す方向を向く場合とが存在する。 The control information in S1008 and S1009 is combined to create control information to control the camera 204. Therefore, depending on the content of the voice command, there are cases where the camera faces the direction before voice input and where the voice command points.

例えば、音声方向に制御中に音声認識結果が音声コマンド「ズーム」となった場合、Ｓ１００８では音声コマンドに従ってズームを拡大する制御情報を生成する。Ｓ１００９では音声方向にパン、チルトが制御されていたカメラを元の方向へ戻す制御を行う。そのためこれらを組み合わせ、ズームは拡大し、パン、チルトは音声入力前と同じ方向へと制御する。一方、音声方向に制御中に音声認識結果が音声コマンド「ホワイトボード」となった場合、Ｓ１００８では音声コマンドに従ってホワイトボードに向ける制御情報を生成する。Ｓ１００９では音声方向に制御されていたカメラのパン、チルトを元の方向へ戻そうとする。しかしここで、Ｓ１００８で生成した、ホワイトボードに向けるパン、チルトの制御情報があるため、制御情報は音声入力前に向いていた方向ではなく、ホワイトボードの方向に向く制御情報となる。 For example, if the voice recognition result is the voice command “zoom” during control in the voice direction, control information for enlarging the zoom according to the voice command is generated in S1008. In step S1009, control is performed to return the camera whose pan and tilt are controlled in the voice direction to the original direction. Therefore, these are combined, zoom is expanded, and pan and tilt are controlled in the same direction as before audio input. On the other hand, if the voice recognition result is the voice command “whiteboard” during control in the voice direction, control information directed to the whiteboard is generated according to the voice command in S1008. In step S1009, the camera pan / tilt controlled in the audio direction is returned to the original direction. However, since there is pan / tilt control information directed to the whiteboard generated in step S1008, the control information is control information directed to the direction of the whiteboard, not the direction directed to the voice input.

Ｓ１０１０で、音声認識部２０５が、取得した音声に含まれる人の声の区間が終了したか否かを判定する。人の声の区間が終了していないと判定した場合には、Ｓ１００３に戻り、さらに所定時間後の途中音声認識結果を取得する。人の声の区間が終了したと判定した場合には、Ｓ１０１１に進む。 In step S 1010, the voice recognition unit 205 determines whether the section of the human voice included in the acquired voice has ended. If it is determined that the section of the human voice has not ended, the process returns to S1003, and a midway speech recognition result after a predetermined time is acquired. If it is determined that the human voice section has ended, the process proceeds to S1011.

Ｓ１０１１で、音声認識部２０５が、その時間での認識結果に認識結果を確定し、撮像方向制御部２０３が、カメラの撮像方向制御を確定する。すなわち、確定した認識結果に基づく撮像方向制御が実行された状態でカメラを固定する。 In step S 1011, the voice recognition unit 205 determines the recognition result as the recognition result at that time, and the imaging direction control unit 203 determines the imaging direction control of the camera. That is, the camera is fixed in a state where the imaging direction control based on the confirmed recognition result is executed.

なお、Ｓ１００１とＳ１００２の順番は逆でも良い。音声方向検知をした上で、音の中に音声が入力されているか否かを判定する。なお、本実施形態は図８のような自分側と相手側の拠点を結んでテレビ会議を行う場面でも良いし、自分側の拠点一箇所において議事録を記録する場面でも良い。 Note that the order of S1001 and S1002 may be reversed. After detecting the voice direction, it is determined whether or not voice is input in the sound. In addition, this embodiment may be a scene where a video conference is performed by connecting the base of the other party and the other party as shown in FIG. 8, or a scene where the minutes are recorded at one place of the own base.

図１１は、音声認識部２０５が、音声入力途中の認識結果を出力する様子を説明する図である。同図において、１１０１は音声コマンドなどの単語を表現するＨＭＭ（ヒドゥン・マルコフ・モデル）などのモデルである。１１０２は、モデル１１０１を構成する状態である。モデル１１０１は、１つ以上の状態と、状態間の遷移（自己ループを含む）で構成される。１１０３は、音声コマンド以外のあらゆる発話をモデル化したＧＢＧ（ガーベッジ）モデルである。音声認識コマンドに近い音声が入力された場合には、音声コマンドのモデルのスコアのほうが、ＧＢＧモデルのスコアよりも大きくなる。一方、音声コマンドと異なる音声が入力された場合には、ＧＢＧモデルのスコアのほうが、音声コマンドのモデルのスコアよりも大きくなる。実施形態１では、各音声コマンドの確からしさを計算して音声コマンド以外の発言であると認識するとして説明したが、本実施形態ではこのＧＢＧモデルを用いて音声コマンド以外の発言であることを判定する。 FIG. 11 is a diagram for explaining how the speech recognition unit 205 outputs a recognition result during speech input. In the figure, reference numeral 1101 denotes a model such as an HMM (Hidden Markov Model) that expresses a word such as a voice command. Reference numeral 1102 denotes a state constituting the model 1101. The model 1101 includes one or more states and transitions between states (including self-loops). Reference numeral 1103 denotes a GBG (garbage) model that models all utterances other than voice commands. When a voice close to a voice recognition command is input, the score of the voice command model is larger than the score of the GBG model. On the other hand, when a voice different from the voice command is input, the score of the GBG model is larger than the score of the voice command model. In the first embodiment, it has been described that the probability of each voice command is calculated and recognized as a speech other than a voice command. However, in this embodiment, it is determined that the speech is a speech other than a voice command using this GBG model. To do.

通常の音声認識では、発話区間が終了した時点で、各モデルの最終状態（図１１の各モデルの中で、一番右端の状態）におけるスコアを比較することで、どれが尤もらしい認識結果かを求める。これに対して、Ｓ１００３で実行する途中までの音声認識においては、各単語において全ての状態１１０２におけるスコアを比較し、その中で最大のスコアをその単語のスコアとする。この単語スコアを、全ての音声コマンドのモデルおよびＧＢＧモデルで比較することで、最もスコアの高いモデルを、その時点での認識結果として得る。 In normal speech recognition, when the utterance period ends, by comparing the scores in the final state of each model (the rightmost state in each model in FIG. 11), which is the most likely recognition result? Ask for. On the other hand, in the speech recognition performed halfway in S1003, the scores in all the states 1102 are compared in each word, and the maximum score among them is set as the score of the word. By comparing this word score with all voice command models and GBG models, the model with the highest score is obtained as the recognition result at that time.

図１２は、本実施形態における音声認識と撮像方向制御の例を挙げた図である。同図の（ａ）は、「ホワイトボード」という音声コマンドを音声取得部２０１から入力した場合の例である。同図の横軸は時間軸であり、１２０１は「ホワイトボード」という発声がその時間区間に入力されている様子を示し、そして１２０１の中の文字が、その文字の音がその時間に入力されている様子を示す。 FIG. 12 is a diagram illustrating an example of voice recognition and imaging direction control in the present embodiment. FIG. 6A shows an example when a voice command “whiteboard” is input from the voice acquisition unit 201. The horizontal axis in the figure is the time axis, 1201 shows a state where the utterance “whiteboard” is input in the time interval, and the character in 1201 is the sound of the character input at that time. It shows how it is.

実施形態１の場合、音声認識部２０５は音声全体が入力された後に音声認識結果を出力するため、同図のt_endになるまではカメラ制御は行われない。これに対して本実施形態の場合、Ｓ１００１で音声入力を検知してから所定時間Ｔごとに、Ｓ１００３で音声認識部２０５が途中音声認識結果を出力する。所定時間Ｔは、例えば１００ミリ秒とする。音声認識部２０５が１０ミリ秒ずつ音声の特徴量を分析する場合、１０サンプルだけ分析を進めるごとに途中音声認識結果を出力することになる。図１２（ａ）では、時刻t_1において音声コマンド「ホワイトボード」が認識されるため、Ｓ１００９によって音声方向への撮像方向制御が抑制される。一方、音声コマンドに基づくホワイトボードへの撮像方向制御は実行される。さらに時刻が進み、t_2, t_3と進んでも認識結果は変わらず、最終的にt_endで制御が確定する。 In the case of the first embodiment, since the voice recognition unit 205 outputs the voice recognition result after the entire voice is input, the camera control is not performed until t_end in FIG. On the other hand, in the case of the present embodiment, the voice recognition unit 205 outputs a midway voice recognition result at S1003 every predetermined time T after the voice input is detected at S1001. The predetermined time T is, for example, 100 milliseconds. When the voice recognition unit 205 analyzes the feature amount of the voice every 10 milliseconds, an intermediate voice recognition result is output every time analysis is performed by 10 samples. In FIG. 12A, since the voice command “whiteboard” is recognized at time t_1, the imaging direction control in the voice direction is suppressed in S1009. On the other hand, the imaging direction control to the whiteboard based on the voice command is executed. Even if the time advances further and advances to t_2 and t_3, the recognition result does not change, and the control is finally determined at t_end.

図１２（ｂ）は、音声コマンド以外の発言を入力した場合の例である。この場合、撮像方向制御は音声方向に向けるべきであるが、実施形態１の場合、同図のt_endになるまではカメラ制御は行われない。これに対して本実施形態の場合、t_1の時点で音声コマンドのモデルよりもＧＢＧモデルのほうがスコアが大きくなり、音声コマンド以外であるという途中認識結果が出る。これに従って、Ｓ１００７で撮像方向制御部２０３が、音声方向に撮像方向を制御する。さらに時刻が進み、t_2, t_3と進んでも認識結果は変わらず、最終的にt_endで制御が確定する。従って、実施形態１よりも早い時刻t_1でカメラをユーザに向けることができる。 FIG. 12B shows an example when a speech other than a voice command is input. In this case, the imaging direction control should be directed to the voice direction, but in the case of the first embodiment, the camera control is not performed until t_end in FIG. On the other hand, in the case of the present embodiment, the score of the GBG model is larger than that of the voice command model at the time t_1, and an intermediate recognition result is obtained that it is other than the voice command. Accordingly, in step S1007, the imaging direction control unit 203 controls the imaging direction in the audio direction. Even if the time advances further and advances to t_2 and t_3, the recognition result does not change, and the control is finally determined at t_end. Therefore, the camera can be pointed at the user at time t_1 earlier than that in the first embodiment.

図１２（ｃ）は、音声コマンド以外の発言のはじめに、音声コマンド「ホワイトボード」に一致する単語を含む場合の制御の様子を示している。（ｃ）の場合、時刻t_1からt_5までは、音声認識部２０５が途中音声認識結果として「ホワイトボード」を得る。そのため、撮像方向制御部２０３は、Ｓ１００８で音声コマンドに対応する撮像方向制御を行い、Ｓ１００９で音声方向への撮像方向制御を抑制する。一方、t_6の時点では「ホワイトボードに書」までの音声が入力されるため、音声認識部２０５では「ホワイトボード」のモデルよりもＧＢＧモデルのスコアが大きくなり、音声コマンド以外であるという認識結果を得る。そのため、この時点でＳ１００５で、音声コマンド「ホワイトボード」に対応するホワイトボード方向への制御をキャンセルし、さらにＳ１００７で音声方向へのカメラ制御を開始する。そしてt_endにてこの制御が確定する。この場合、一度は望ましい制御とは異なる「ホワイトボード」に関わる制御がされるものの、その後音声認識が進むことにより修正し、実施形態１のときと比較して、時刻t_endよりも早い時刻t_6で音声方向にカメラを向けることが可能となる。 FIG. 12C shows the state of control when a word that matches the voice command “whiteboard” is included at the beginning of a statement other than the voice command. In the case of (c), from time t_1 to time t_5, the speech recognition unit 205 obtains “whiteboard” as a midway speech recognition result. Therefore, the imaging direction control unit 203 performs imaging direction control corresponding to the voice command in S1008, and suppresses imaging direction control in the voice direction in S1009. On the other hand, since the speech up to “written on the whiteboard” is input at the time t_6, the speech recognition unit 205 has a recognition result that the score of the GBG model is larger than that of the “whiteboard” model, and is other than the speech command. Get. Therefore, at this time, control in the whiteboard direction corresponding to the voice command “whiteboard” is canceled in S1005, and camera control in the voice direction is started in S1007. This control is finalized at t_end. In this case, although control related to the “whiteboard” that is different from the desired control is performed once, it is corrected by subsequent speech recognition, and at time t_6 earlier than time t_end as compared to the case of the first embodiment. It is possible to point the camera in the voice direction.

図１２（ｄ）は、音声コマンド以外の発言の途中に音声コマンド「ズーム」に一致する単語を含む場合の制御の様子を示している。この場合、時刻t_1からt_4までは音声コマンドに一致しない発言のため、Ｓ１００７で音声方向へ撮像方向を制御する。ここでt_6において音声コマンド「ズーム」に一致する音声が含まれるが、t_6で認識される音声は「このカメラのズーム」となるため、音声認識結果は音声コマンド「ズーム」よりも音声コマンド以外のほうが音声認識のスコアが高くなり、そのまま制御される。この場合、t_1で制御開始した音声方向への制御がt_endで確定するので、実施形態１のときと比較して、時刻t_endよりも早い時刻t_1で音声方向にカメラを向けることが可能となる。 FIG. 12D shows the state of control when a word that matches the voice command “zoom” is included in the middle of a statement other than the voice command. In this case, since the speech does not match the voice command from time t_1 to t_4, the imaging direction is controlled in the voice direction in S1007. Here, at t_6, a voice that matches the voice command “zoom” is included, but the voice recognized at t_6 is “zoom of this camera”, so the voice recognition result is other than the voice command “zoom”. The voice recognition score becomes higher and is controlled as it is. In this case, since the control in the voice direction started at t_1 is confirmed at t_end, the camera can be pointed in the voice direction at time t_1 earlier than time t_end as compared to the first embodiment.

図１２（ｅ）は、音声コマンド「ズーム」の音声の前に短い雑音が入った例である。同図において、時刻t_1、t_2では、ＧＢＧモデルのスコアが大きくなり、音声コマンド以外の発言であると判定して音声方向へ撮像方向を制御する。ここでt_3になると音声コマンド「ズーム」の一部が音声の大部分を占めるため、音声コマンド「ズーム」を認識結果として得る。すると撮像方向制御部２０３は、Ｓ１００８において音声コマンド「ズーム」に従ってズームを変更するよう制御し、さらにＳ１００９において音声方向への撮像方向制御を抑制する。ここで既にt_1において音声方向への制御を開始しているため、撮像方向制御部２０３は、撮像方向を元の方向へ戻すよう制御する。そして時刻t_endで制御が確定する。従って、一度は望ましい制御とは異なる音声方向へ制御されるものの、その後音声認識が進むことにより修正し、実施形態１のときと比較して、時刻t_endよりも早い時刻t_3で音声方向にカメラを向けることが可能となる。 FIG. 12E shows an example in which a short noise is inserted before the voice of the voice command “Zoom”. In the figure, at times t_1 and t_2, the score of the GBG model increases, and it is determined that the statement is a statement other than a voice command, and the imaging direction is controlled in the voice direction. Here, at t_3, since a part of the voice command “zoom” occupies most of the voice, the voice command “zoom” is obtained as a recognition result. Then, the imaging direction control unit 203 controls to change the zoom according to the voice command “zoom” in S1008, and further suppresses the imaging direction control in the voice direction in S1009. Here, since control in the audio direction has already started at t_1, the imaging direction control unit 203 performs control so that the imaging direction is returned to the original direction. Then, control is confirmed at time t_end. Accordingly, although it is controlled once in the direction of voice different from the desired control, it is corrected by subsequent voice recognition, and the camera is moved in the direction of voice at time t_3 earlier than time t_end as compared to the first embodiment. Can be directed.

（実施形態６）
実施形態５では、途中の音声認識結果に応じて、音声方向への撮像方向制御を行うか抑制するかを逐次切り換えている。本実施形態ではさらに、制御するか抑制するかを短時間で切り換えないように、音声認識スコアに閾値を用意する。本実施形態の様子を図１３に示す。 (Embodiment 6)
In the fifth embodiment, whether to perform the imaging direction control in the voice direction is sequentially switched according to the voice recognition result on the way. In the present embodiment, a threshold value is further prepared for the voice recognition score so that control or suppression is not switched in a short time. The state of this embodiment is shown in FIG.

図１３は、横軸に時間軸をとり、音声コマンドのモデルのスコアとＧＢＧモデルのスコアの差をグラフにした図である。同図の（ａ）は、実施形態５のとおり、音声コマンドのモデルのスコアとＧＢＧモデルのスコアとを比較して、スコアの大きいほうの認識結果に基づいて逐次制御を切り替える場合のグラフである。同図において、１３０１は音声コマンドのモデルのスコアとＧＢＧモデルのスコアの差を表すプロットである。プロット１３０１は、音声認識部２０５がスコアを出力する時間Ｔごとに得る。１３０２は、各時刻の制御を表す。スコアの差が負になると音声方向へ制御し、正になると音声方向への制御を抑制することになる。１３０２を見ると、音声の途中で制御が何度も切り換っていることがわかる。 FIG. 13 is a graph in which the horizontal axis represents the time axis, and the difference between the voice command model score and the GBG model score is graphed. (A) of the figure is a graph in the case where the score of the voice command model is compared with the score of the GBG model as in the fifth embodiment, and the sequential control is switched based on the recognition result of the larger score. . In the figure, 1301 is a plot showing the difference between the voice command model score and the GBG model score. The plot 1301 is obtained every time T when the speech recognition unit 205 outputs a score. 1302 represents control at each time. When the score difference is negative, control is performed in the voice direction, and when the difference is positive, control in the voice direction is suppressed. Looking at 1302, it can be seen that the control has been switched many times in the middle of the voice.

これに対し、図１３（ｂ）は、スコアの差に閾値αを用意し、音声入力の途中ではスコアの差がαを超えた場合にのみ制御を切り換えることを表す。１３０３はこの場合の各時刻の制御を表す。スコアの差が閾値αを超えるまでは、スコアが逆転する可能性が高いとして制御を切り換えない。同図では（音声コマンドのモデルのスコア−ＧＢＧモデルのスコア）が−αを下回った時点で音声方向への制御を開始し、さらにその後αを上回った時点で音声方向への制御を抑制するよう制御している。１３０２と１３０３を比較すると、１３０３が安定して制御を切り換えていることがわかる。 On the other hand, FIG. 13B shows that a threshold value α is prepared for the difference in scores, and control is switched only when the score difference exceeds α during voice input. Reference numeral 1303 represents control at each time in this case. Until the difference between the scores exceeds the threshold value α, the control is not switched because the possibility that the score is reversed is high. In the figure, the control in the voice direction is started when (score of voice command model-GBG model score) is lower than -α, and then the control in the voice direction is suppressed when it exceeds α after that. I have control. Comparing 1302 and 1303, it can be seen that 1303 stably switches control.

また、このとき望ましくは、音声区間が終了した時点では、スコアの差がα以上であってもα未満であっても、終了した時点でのスコアが高いほうに基づいて制御するか抑制するかを決定する。なお、ここで図１３（ｂ）においてスコアの差の閾値αと−αは、α１と−α２のように絶対値が異なっていても構わない。 Also, at this time, preferably, at the time when the speech section ends, whether the control is to be controlled or suppressed based on the higher score at the end when the difference between the scores is greater than or less than α. To decide. Here, in FIG. 13B, the threshold value α and −α of the difference in score may be different in absolute value as α1 and −α2.

（実施形態７）
実施形態５および実施形態６では音声コマンドのモデルのスコアとＧＢＧモデルのスコアとを比較して音声方向への撮像方向制御を行うか抑制するかを切り換えている。この際に、撮像方向制御部２０３は、スコアの差の大きさに応じてカメラの制御速度を変化させても良い。これによって、撮像方向が切り換るときに滑らかな制御を行うことができる。 (Embodiment 7)
In the fifth and sixth embodiments, the voice command model score and the GBG model score are compared to switch whether or not to control the imaging direction in the voice direction. At this time, the imaging direction control unit 203 may change the control speed of the camera according to the magnitude of the score difference. Thus, smooth control can be performed when the imaging direction is switched.

図１４は、図１３（ｂ）のように閾値αを超えたときのみ撮像制御の切り換えを行う場合において、さらにカメラの制御速度を閾値αによって切り換えた例を示す。同図において、スコアの差がα以上の場合には、制御が切り換る可能性が低いので通常速度で撮像方向を制御する。一方、スコアの差がα未満の場合には、制御が切り換る可能性が高いので、通常速度よりも遅い速度で制御する。なお、速度を決定する閾値は、制御を決定する閾値αと異なっていても良い。また、スコアの差の閾値αと−αは、α１と−α２のように絶対値が異なっていても構わない。 FIG. 14 shows an example in which the control speed of the camera is further switched by the threshold value α when the imaging control is switched only when the threshold value α is exceeded as shown in FIG. In the same figure, when the difference between the scores is α or more, the imaging direction is controlled at the normal speed because there is a low possibility that the control is switched. On the other hand, when the difference in scores is less than α, there is a high possibility that the control will be switched, so control is performed at a speed slower than the normal speed. Note that the threshold for determining the speed may be different from the threshold α for determining the control. Further, the threshold value α and −α of the difference in score may be different in absolute value as in α1 and −α2.

また、スコアの差に応じて速度をより細かく制御しても良い。図１５は、スコアの差に対するカメラの制御速度を表す図である。スコアの差が大きくなるほどカメラの制御速度を速くし、最大値V_maxに達した後はスコアの差が広がっても一定の速度とする。このようにスコアの差に応じて細かくカメラの速度を制御することで、音声認識部２０５の途中の認識結果が切り換っても、滑らかな撮像方向制御を行うことができる。 Further, the speed may be more finely controlled according to the difference in scores. FIG. 15 is a diagram illustrating the control speed of the camera with respect to the difference in score. As the difference in scores increases, the camera control speed is increased, and after reaching the maximum value V_max, the speed is constant even if the difference in scores increases. By controlling the camera speed finely according to the difference in score in this way, smooth imaging direction control can be performed even if the recognition result in the middle of the voice recognition unit 205 is switched.

なお、ここで図１５のグラフは、音声コマンドのスコアがＧＢＧモデルのスコアよりも高い場合と低い場合とで形状が異なっていても構わない。この場合、音声方向へ制御する場合と元の方向へ戻す場合とでスコアの差に対する速度が異なることになる。 Here, the shape of the graph of FIG. 15 may differ depending on whether the score of the voice command is higher or lower than the score of the GBG model. In this case, the speed with respect to the difference in score is different between the case of controlling in the voice direction and the case of returning to the original direction.

（実施形態８）
実施形態５では音声コマンドか音声コマンド以外の発言かに基づいて音声方向へ制御するか抑制するかを逐次決定して制御している。しかし本発明はこれに限るものではなく、実施形態１と同様に音声コマンドを抑制する音声コマンドと抑制しない音声コマンドに分類し、それによって処理を変えても良い。この場合、例えば、音声認識のための音声コマンド表として図４（ａ）を用いる。 (Embodiment 8)
In the fifth embodiment, whether to control in the voice direction or to suppress based on whether a voice command or a speech other than a voice command is determined and controlled. However, the present invention is not limited to this, and the voice commands may be classified into voice commands that suppress voice commands and voice commands that do not suppress, as in the first embodiment, and processing may be changed accordingly. In this case, for example, FIG. 4A is used as a voice command table for voice recognition.

この実施形態によるフローチャートを図１６に示す。同図において、Ｓ１００１からＳ１０１１は図１０と同様の処理である。Ｓ１６０１において、撮像方向制御部２０３が、音声コマンドが所定の音声コマンドか否かを判定する。 A flowchart according to this embodiment is shown in FIG. In the figure, S1001 to S1011 are the same processes as in FIG. In step S1601, the imaging direction control unit 203 determines whether the voice command is a predetermined voice command.

所定の音声コマンドとは、図４（ａ）において「検知した音声方向への制御」の値が「○」のものである。これによって、音声方向への制御を抑制する音声コマンド「ズーム」、「ホワイトボード」が入力された場合は音方向への制御が行われない。一方、音声方向へ制御する音声コマンド「こっち」、「Ｘさん」が入力された場合は音方向への制御が、途中の認識結果が出た時点ですぐに制御される。 The predetermined voice command is a command whose value of “control in detected voice direction” is “◯” in FIG. Accordingly, when the voice commands “zoom” and “whiteboard” for suppressing the control in the voice direction are input, the control in the sound direction is not performed. On the other hand, when the voice commands “here” and “Mr. X” that control in the voice direction are input, the control in the sound direction is controlled immediately when an intermediate recognition result is output.

例えば、図８のように自分側と相手側の拠点を結んでテレビ会議を行う場面を想定する。ここで自分側の拠点で、Ａさんが「Ｘさんの意見に賛成です。」と発言したとする。このとき、途中までの発言「Ｘさん」の時点で、音声認識部２０５はＳ１００３において音声コマンド「Ｘさん」と認識する。すると撮像方向制御部２０３は、Ｓ１００８で音声コマンド「Ｘさん」に基づいて相手側の拠点のカメラをＸさんに向けるよう制御する。さらに音声コマンド「Ｘさん」はＳ１６０１で、音声方向への撮像方向制御を抑制しない所定のコマンドに該当するので、Ｓ１００７に進み、撮像方向制御部２０３は、Ａさんの方向へ自分側の拠点のカメラを制御する。しかし発言が進み、「Ｘさんの意見に」まで入力されると、音声認識部２０５はＳ１００３において音声コマンド以外と認識する。すると撮像方向制御部２０３は、Ｓ１００５で、前の時刻で行った音声コマンド「Ｘさん」に対する制御をキャンセルする。すなわち、相手側の拠点のカメラをＸさんの方向から制御前の方向に戻す。一方、音声コマンド以外という認識結果によりＳ１００６からＳ１００７に進み、Ａさんの方向への自分側の拠点のカメラ制御は継続する。 For example, as shown in FIG. 8, a scene is assumed in which a video conference is performed by connecting the bases of the own side and the other side. Suppose here that Mr. A remarked "I agree with Mr. X's opinion" at his base. At this time, the voice recognition unit 205 recognizes the voice command “Mr. X” in S1003 at the time of the statement “Mr. X” halfway. In step S 1008, the imaging direction control unit 203 controls the camera at the partner site to point at Mr. X based on the voice command “Mr. X”. Furthermore, since the voice command “Mr. X” corresponds to a predetermined command that does not suppress the imaging direction control in the voice direction in S1601, the process proceeds to S1007, and the imaging direction control unit 203 moves to the direction of Mr. Control the camera. However, when the remarks progress and “Mr. X's opinion” is input, the voice recognition unit 205 recognizes a voice command other than the voice command in S1003. In step S 1005, the imaging direction control unit 203 cancels the control for the voice command “Mr. X” performed at the previous time. That is, the camera at the partner site is returned from the direction of Mr. X to the direction before the control. On the other hand, depending on the recognition result other than the voice command, the process proceeds from S1006 to S1007, and the camera control of the local site in the direction of Mr. A is continued.

（実施形態９）
実施形態５では、音声コマンドに対する制御（ズーム、ホワイトボードに向けるなど）もＳ１００８において途中の音声認識結果に基づいて処理している。しかし本発明はこれに限るものではなく、音声コマンドに対する制御は音声認識結果が確定してから制御しても良い。この場合、音声方向への撮像方向制御のみ、早期に制御を実行し、音声コマンドに対しては音声が最後まで入力された時点で確実に実行することができる。 (Embodiment 9)
In the fifth embodiment, the control for the voice command (zoom, turning to the whiteboard, etc.) is also processed based on the voice recognition result in the middle in S1008. However, the present invention is not limited to this, and the voice command may be controlled after the voice recognition result is confirmed. In this case, only the imaging direction control in the voice direction can be executed at an early stage, and the voice command can be reliably executed when the voice is input to the end.

図１７は本実施形態に係るカメラ制御装置の処理手順を示すフローチャートである。Ｓ１７０１で、音声取得部２０１が、外部で発声した音声を取得したか否かを判定する。音声を取得した場合はＳ１７０２へ進む。音声を取得しなかった場合はＳ１７０１へ戻る。 FIG. 17 is a flowchart showing a processing procedure of the camera control apparatus according to the present embodiment. In step S 1701, it is determined whether the voice acquisition unit 201 has acquired voice uttered externally. If the voice has been acquired, the process proceeds to S1702. If no voice is acquired, the process returns to S1701.

Ｓ１７０２で、音声方向検知部２０２が、取得した音声の発生方向を検知する。次に、検知した音声の発生方向へ、カメラ２０４の撮像方向を制御するために必要な、パン・チルト・ズーム情報を生成する。生成したパン・チルト・ズーム情報を撮像方向制御部２０３へ送る。 In S1702, the voice direction detection unit 202 detects the direction in which the acquired voice is generated. Next, pan / tilt / zoom information necessary for controlling the imaging direction of the camera 204 is generated in the direction in which the detected sound is generated. The generated pan / tilt / zoom information is sent to the imaging direction control unit 203.

Ｓ１７０３で、音声認識部２０５が、音声入力有りと判定された時点から現時点までに取得した音声を使って音声認識する。音声が発言の途中である場合も、途中までの音声で認識する。所定の長さの時間ごとに逐次、認識結果を出力する。 In step S 1703, the voice recognition unit 205 performs voice recognition using voices acquired from the time when it is determined that there is voice input to the present time. Even when the voice is in the middle of speaking, it is recognized by the voice up to the middle. The recognition result is output sequentially for each predetermined length of time.

Ｓ１７０４で、音声認識部２０５が、音声認識の結果を用いて、取得した音声が音声コマンドか否かを判定する。音声コマンドでない場合はＳ１７０５へ進む。音声コマンドの場合はＳ１７０６へ進む。 In step S1704, the voice recognition unit 205 determines whether the acquired voice is a voice command using the result of voice recognition. If it is not a voice command, the process proceeds to S1705. In the case of a voice command, the process proceeds to S1706.

Ｓ１７０５では、撮像方向制御部２０３が、音声方向検知部２０２から送られたパン・チルト・ズーム情報を用いて、カメラ２０４の撮像方向を制御する。Ｓ１７０６では、撮像方向制御部２０３が、カメラ２０４を音声方向に制御することを抑制する。既にそれまでにカメラのパン・チルト・ズームが、音声入力前にカメラが向いていた元の方向から動いている場合には、元の方向に戻すよう制御する。 In step S 1705, the imaging direction control unit 203 controls the imaging direction of the camera 204 using the pan / tilt / zoom information sent from the audio direction detection unit 202. In step S1706, the imaging direction control unit 203 suppresses the camera 204 from being controlled in the audio direction. If the pan / tilt / zoom of the camera has already moved from the original direction that the camera was facing before inputting the sound, control is performed so that the camera returns to the original direction.

Ｓ１７０７で、音声認識部２０５が、取得した音声に含まれる人の声の区間が終了したか否かを判定する。人の声の区間が終了していないと判定した場合には、Ｓ１７０３に戻り、さらに所定時間後の途中音声認識結果を取得する。人の声の区間が終了したと判定した場合には、Ｓ１７０８に進む。 In step S 1707, the voice recognition unit 205 determines whether the section of the human voice included in the acquired voice has ended. If it is determined that the section of the human voice has not ended, the process returns to S1703, and a midway speech recognition result after a predetermined time is acquired. If it is determined that the human voice section has ended, the process advances to step S1708.

Ｓ１７０８で、音声認識部２０５が、その時間での認識結果に認識結果を確定し、撮像方向制御部２０３が、カメラの撮像方向制御を確定する。Ｓ１７０９で、認識結果が音声コマンドであった場合は、Ｓ１７１０で、撮像方向制御部２０３が、音声コマンドの制御情報に従ってさらにカメラ２０４を制御する。 In step S 1708, the voice recognition unit 205 determines the recognition result as the recognition result at that time, and the imaging direction control unit 203 determines the imaging direction control of the camera. If the recognition result is a voice command in S1709, the imaging direction control unit 203 further controls the camera 204 according to the control information of the voice command in S1710.

Ｓ１７０１とＳ１７０２の順番は逆でも良い。音声方向検知をした上で、音の中に音声が入力されているか否かを判定する。
なお、本実施形態は図８のような自分側と相手側の拠点を結んでテレビ会議を行う場面でも良いし、自分側の拠点一箇所において議事録を記録する場面でも良い。 The order of S1701 and S1702 may be reversed. After detecting the voice direction, it is determined whether or not voice is input in the sound.
In addition, this embodiment may be a scene where a video conference is performed by connecting the base of the other party and the other party as shown in FIG. 8, or a scene where the minutes are recorded at one place of the own base.

（他の実施形態）
以上、本発明の実施形態を詳述したが、本発明は、複数の機器から構成されるシステムに適用してもよいし、また、一つの機器からなる装置に適用してもよい。 (Other embodiments)
As mentioned above, although embodiment of this invention was explained in full detail, this invention may be applied to the system comprised from several apparatuses, and may be applied to the apparatus which consists of one apparatus.

なお、本発明は、前述した実施形態の各機能を実現するプログラムを、システム又は装置に直接又は遠隔から供給し、そのシステム又は装置に含まれるコンピュータがその供給されたプログラムコードを読み出して実行することによっても達成される。 In the present invention, a program for realizing each function of the above-described embodiments is supplied directly or remotely to a system or apparatus, and a computer included in the system or apparatus reads and executes the supplied program code. Can also be achieved.

したがって、本発明の機能・処理をコンピュータで実現するために、そのコンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、上記機能・処理を実現するためのコンピュータプログラム自体も本発明の一つである。 Accordingly, since the functions and processes of the present invention are implemented by a computer, the program code itself installed in the computer also implements the present invention. That is, the computer program itself for realizing the functions and processes is also one aspect of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するためのコンピュータ読み取り可能な記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷなどがある。また、記録媒体としては、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などもある。 Examples of the computer-readable recording medium for supplying the program include a flexible disk, a hard disk, an optical disk, a magneto-optical disk, an MO, a CD-ROM, a CD-R, and a CD-RW. Examples of the recording medium include a magnetic tape, a non-volatile memory card, a ROM, a DVD (DVD-ROM, DVD-R), and the like.

また、プログラムは、クライアントコンピュータのブラウザを用いてインターネットのホームページからダウンロードしてもよい。すなわち、ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードしてもよい。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードする形態も考えられる。つまり、本発明の機能・処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明の構成要件となる場合がある。 The program may be downloaded from a homepage on the Internet using a browser on a client computer. That is, the computer program itself of the present invention or a compressed file including an automatic installation function may be downloaded from a home page to a recording medium such as a hard disk. Further, it is also possible to divide the program code constituting the program of the present invention into a plurality of files and download each file from a different home page. That is, a WWW server that allows a plurality of users to download a program file for realizing the functions and processing of the present invention on a computer may be a constituent requirement of the present invention.

また、本発明のプログラムを暗号化してコンピュータ読み取り可能なＣＤ−ＲＯＭ等のコンピュータ読み取り可能な記憶媒体に格納してユーザに配布してもよい。この場合、所定条件をクリアしたユーザにのみ、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報で暗号化されたプログラムを復号して実行し、プログラムをコンピュータにインストールしてもよい。 The program of the present invention may be encrypted and stored in a computer-readable storage medium such as a computer-readable CD-ROM and distributed to users. In this case, only the user who cleared the predetermined condition is allowed to download the key information to be decrypted from the homepage via the Internet, decrypt the program encrypted with the key information, execute it, and install the program on the computer May be.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現されてもよい。なお、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部又は全部を行ってもよい。もちろん、この場合も、前述した実施形態の機能が実現され得る。 Further, the functions of the above-described embodiments may be realized by the computer executing the read program. Note that an OS or the like running on the computer may perform part or all of the actual processing based on the instructions of the program. Of course, also in this case, the functions of the above-described embodiments can be realized.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれてもよい。そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部又は全部を行ってもよい。このようにして、前述した実施形態の機能が実現されることもある。 Furthermore, the program read from the recording medium may be written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. Based on the instructions of the program, a CPU or the like provided in the function expansion board or function expansion unit may perform part or all of the actual processing. In this way, the functions of the above-described embodiments may be realized.

Claims

A camera control device for controlling the operation of a camera,
An acquisition means for acquiring audio;
Detecting means for detecting the direction of occurrence of the sound acquired by the acquiring means;
Voice recognition means for recognizing the voice acquired by the acquisition means;
Control means for controlling the imaging direction of the camera in the direction of generation of the sound detected by the detection means;
With
When the voice recognition unit recognizes the voice as a voice command, the control unit suppresses controlling the imaging direction of the camera in the voice generation direction detected by the detection unit. Camera control device.

A camera control device for controlling the operation of a camera,
An acquisition means for acquiring audio;
Detecting means for detecting the direction of occurrence of the sound acquired by the acquiring means;
Voice recognition means for recognizing the voice acquired by the acquisition means;
Control means for controlling the imaging direction of the camera in the direction of generation of the sound detected by the detection means;
With
The control means includes
When the voice recognition unit recognizes the voice as a voice command and the instruction to control the camera in the voice generation direction is associated with the voice command, the voice generation direction detected by the detection unit To control the imaging direction of the camera,
When the voice recognition unit recognizes the voice as a voice command and an instruction not to control the camera in the voice generation direction is associated with the voice command, the voice generation direction detected by the detection unit Further, it is possible to suppress controlling the imaging direction of the camera.

The acquisition means includes a first microphone and a second microphone,
The detecting means detects a direction of generation of the sound input to the first microphone;
The camera control apparatus according to claim 1, wherein the voice recognition unit recognizes a voice input to the second microphone.

The acquisition means includes a first microphone and a second microphone,
The detecting means detects a direction of generation of the sound input to the first microphone;
The voice recognition means recognizes a voice input to the second microphone;
The control means includes
When the voice recognition unit does not detect a voice input to the second microphone, or when the voice recognition unit recognizes the voice input to the second microphone as a voice command, the camera generates the voice. When an instruction not to control the direction is not associated with the voice command, the imaging direction of the camera is controlled in the direction of generation of the voice detected by the detection unit;
When the voice recognition unit does not recognize the voice input to the second microphone as a voice command, or when the voice is recognized as a voice command, an instruction not to control the camera in the voice generation direction is given. 3. The camera control device according to claim 2, wherein when associated with the voice command, control of the imaging direction of the camera in the direction of generation of the voice detected by the detection unit is suppressed.

A voice input button for the user to input a voice command;
The camera control apparatus according to claim 3, wherein the voice recognition unit starts voice recognition triggered by pressing of the voice input button.

A voice input button for the user to input a voice command;
The acquisition means includes a first microphone and a second microphone,
The detecting means detects a direction of generation of the sound input to the first microphone;
The voice recognition means recognizes the voice input to the second microphone by starting voice recognition triggered by pressing of the voice input button;
The control means includes
When the voice input button is not pressed, or when the voice input button is pressed and the voice recognition means recognizes the voice input to the second microphone as a voice command, the camera generates the voice. When an instruction not to control the direction is not associated with the voice command, the imaging direction of the camera is controlled in the direction of generation of the voice detected by the detection unit;
When the voice input button is pressed and the voice recognition unit does not recognize the voice input to the second microphone as a voice command, or when the voice is recognized as a voice command, the camera recognizes the voice. The control of the imaging direction of the camera in the direction of generation of the voice detected by the detection unit is suppressed when an instruction not to control in the generation direction is associated with the voice command. The camera control device described in 1.

The voice recognition means sequentially acquires voice recognition results in the middle by performing voice recognition at a predetermined time interval after the acquisition means starts acquiring voice,
The control means executes control of the imaging direction of the camera when the midway voice recognition result is obtained, and when the midway voice recognition result is a voice command, the control means detects the voice detected by the detection means. The camera control device according to claim 1, wherein control of the imaging direction of the camera in the generation direction is suppressed.

While the control means acquires the intermediate speech recognition result, the detection means detects only when the difference between the voice recognition score of the voice command and the voice recognition score other than the voice command is larger than the threshold value. The camera control apparatus according to claim 7, wherein control of an imaging direction of the camera in a direction in which the sound is generated and switching of suppression thereof are possible.

The control means reduces the control speed of the camera as the difference between the voice recognition score of the voice command and the voice recognition score other than the voice command is smaller while acquiring the voice recognition result on the way. The camera control device according to claim 7, characterized in that:

The voice recognition means sequentially acquires voice recognition results in the middle by performing voice recognition at a predetermined time interval after the acquisition means starts acquiring voice,
The control means executes the control of the imaging direction of the camera at the time of obtaining the voice recognition result on the way,
When the voice recognition result on the way is a voice command associated with an instruction to control the camera in the voice generation direction, the imaging direction of the camera is set to the voice generation direction detected by the detection means. Control
When the voice recognition result in the middle is a voice command associated with an instruction not to control the camera in the voice generation direction, the imaging direction of the camera is set to the voice generation direction detected by the detection means. The camera control device according to claim 2, wherein control is suppressed.

A camera control method for controlling the operation of a camera,
An acquisition step in which the acquisition means acquires voice;
A detecting step for detecting a direction in which the sound acquired in the acquiring step is detected;
A voice recognition unit that recognizes the voice acquired by the acquisition step;
A control step, wherein the control means controls the imaging direction of the camera in the direction of generation of the sound detected in the detection step;
Have
In the control step, when the voice is recognized as a voice command in the voice recognition step, the control step suppresses controlling the imaging direction of the camera in the voice generation direction detected in the detection step. Camera control method.

A camera control method for controlling the operation of a camera,
An acquisition step in which the acquisition means acquires voice;
A detecting step for detecting a direction in which the sound acquired in the acquiring step is detected;
A voice recognition unit that recognizes the voice acquired by the acquisition step;
A control step, wherein the control means controls the imaging direction of the camera in the direction of generation of the sound detected in the detection step;
Have
The control step includes
When the voice is recognized as a voice command in the voice recognition step and an instruction to control the camera in the voice generation direction is associated with the voice command, the voice detected in the detection step Controlling the imaging direction of the camera in the direction of occurrence,
When the voice is recognized as a voice command in the voice recognition step and an instruction not to control the camera in the voice generation direction is associated with the voice command, the voice of the voice detected in the detection step is Controlling the imaging direction of the camera in the direction of generation is suppressed.

A program for causing a computer to function as each unit of the camera control device according to any one of claims 1 to 10.