JP2013016929A

JP2013016929A - Imaging apparatus, imaging method, and program

Info

Publication number: JP2013016929A
Application number: JP2011146768A
Authority: JP
Inventors: Ai Hata; 愛秦
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2011-06-30
Filing date: 2011-06-30
Publication date: 2013-01-24

Abstract

PROBLEM TO BE SOLVED: To provide an imaging apparatus, an imaging method, and a program capable of collecting voice to be uttered by a person even when the person is not included within an imaging range.SOLUTION: When a conference terminal 1 for imaging participants 53-55 is panned to allow an imaging direction to be A3 so that the participants are not included within a camera imaging range B1, the participants 53-55 do not appear in an image P3 and the faces of the persons are not detected. In this case, an orientation direction of an array microphone is set to C3, and its voice collection range is set to D3 being an area excluding an area to be specified by the imaging direction A3 and the imaging range B1 from all the 360-degree directions with the conference terminal 1 as a center. Consequently, the area with the participants 53-55 reliably becomes the voice collection object area, and also the voice collection from an area where the absence of the participants 53-55 is determined is avoided. Thus, the voice uttered by the participants 53-55 can be reliably and clearly collected.

Description

本発明は、撮像手段と集音手段とが一体に構成された撮像装置、撮像方法およびプログラムに関する。 The present invention relates to an imaging apparatus, an imaging method, and a program in which an imaging unit and a sound collection unit are integrally configured.

画像を撮像するカメラと、音声を集音するマイクロフォン（以下では「マイク」と略す。）とが筐体に一体に構成された撮像装置が知られている。例えば、遠隔会議に用いられる会議用の端末装置は、撮像装置を用いて自拠点の画像を撮像し、音声を集音し、ネットワークを介して他の拠点の端末装置との間で画像や音声のデータを送受信する。 2. Description of the Related Art An imaging device is known in which a camera that captures an image and a microphone that collects sound (hereinafter abbreviated as “microphone”) are configured integrally with a housing. For example, a conference terminal device used for a remote conference captures an image of its own site using an imaging device, collects sound, and transmits images and audio to / from terminal devices at other sites via a network. Send and receive data.

このような撮像装置において、会議における発言者の音声を確実に、且つクリアに拾うため、集音用のマイクとして、単一指向性マイクを用いたものが知られている（例えば特許文献１参照。）。特許文献１に記載の撮像装置（カメラ付きマイクロフォン）は、カメラの画角がマイクの単一指向性の範囲とほぼ等しい構成となっている。そして、カメラで撮像した画像において顔の画像を認識できなかった場合に、マイクによる音声の取り込みを行わないようにすることで、発言者が映っていなければ不要な音声を取り込まないようにしている。 In such an imaging apparatus, a microphone that uses a unidirectional microphone is known as a microphone for collecting sound in order to reliably and clearly pick up the voice of a speaker in a conference (see, for example, Patent Document 1). .) The imaging device (microphone with camera) described in Patent Document 1 has a configuration in which the angle of view of the camera is substantially equal to the range of unidirectionality of the microphone. And if the face image is not recognized in the image captured by the camera, the voice is not taken in by the microphone so that the unnecessary voice is not taken in unless the speaker is shown. .

また、撮像装置の集音用のマイクとして、公知のアレイマイクを用いたものが知られている（例えば特許文献２参照。）。アレイマイクは複数の無指向性のマイクをアレイ状に並べて配置したものであり、電気的な制御によって任意の方向への指向性を得ることができるものである。このようなアレイマイクを用いた特許文献２に記載の撮像装置（マイクロホン内蔵型ビデオカメラ）は、アレイマイクの指向特性を、カメラの振れ角、ズーム角と連動させている。これにより、発言者の方向にカメラが向けられたらアレイマイクが発言者側に指向され、また、発言者がズームされたらその発言者に対し鋭く指向され、カメラに映し出された発言者の音声を効果的に拾うことができる。 In addition, a microphone using a known array microphone is known as a microphone for collecting sound of the imaging apparatus (see, for example, Patent Document 2). An array microphone is one in which a plurality of non-directional microphones are arranged in an array, and directivity in an arbitrary direction can be obtained by electrical control. The image pickup apparatus (video camera with a built-in microphone) described in Patent Document 2 using such an array microphone links the directivity characteristics of the array microphone with the camera shake angle and zoom angle. As a result, when the camera is pointed in the direction of the speaker, the array microphone is directed toward the speaker, and when the speaker is zoomed, the array microphone is directed sharply toward the speaker. Can be picked up effectively.

特開２００９−４９７３４号公報JP 2009-49734 A 特開平１０−１５５１０７号公報JP-A-10-155107

しかしながら、特許文献１，２に記載の発明は、発言者がカメラに映され、そのカメラの画像や向きを基準にマイクの指向方向が決定される。このため、例えばカメラでホワイトボードを映しながら発言者が説明を行う場合など、発言者以外の物体あるいは他の参加者がカメラに映し出された場合、特許文献１ではマイクによる音声の取り込みが遮断されてしまうという問題があった。また、特許文献２では、カメラに連動するアレイマイクの指向方向がカメラの向けられたホワイトボードに向けられてしまうため、発言者の音声を明瞭に捉えることができないという問題があった。 However, in the inventions described in Patent Documents 1 and 2, a speaker is projected on a camera, and the direction of the microphone is determined based on the image and orientation of the camera. For this reason, in the case where an object other than the speaker or another participant is displayed on the camera, for example, when the speaker explains while displaying the whiteboard with the camera, in Japanese Patent Application Laid-Open No. 2004-151867, the voice capturing by the microphone is blocked. There was a problem that. Moreover, in patent document 2, since the direction of the array microphone linked to the camera is directed to the whiteboard to which the camera is directed, there is a problem that the voice of the speaker cannot be clearly captured.

本発明は、上記問題点を解決するためになされたものであり、撮像範囲内に人物が含まれない場合でも人物の発する音声を集音することができる撮像装置、撮像方法およびプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides an imaging apparatus, an imaging method, and a program capable of collecting sound produced by a person even when the person is not included in the imaging range. For the purpose.

本発明の第１態様によれば、画像を撮像する撮像手段と、前記撮像手段と一体に構成され、音声を集音する複数の集音手段と、前記撮像手段による画像の撮像範囲に基づいて、複数の前記集音手段に音声を集音させる指向方向および集音範囲を制御する制御手段と、前記撮像手段の撮像した画像に基づき、前記撮像範囲内に人物が含まれるか否かを判断する第一判断手段と、を備え、前記制御手段は、前記第一判断手段によって前記撮像範囲内に人物が含まれないと判断された場合に、前記集音手段が集音可能な領域のうち、前記撮像手段の前記撮像範囲外の領域の少なくとも一部が、前記集音手段による音声の集音対象の領域となるように、前記集音手段の指向方向および集音範囲を制御する撮像装置が提供される。 According to the first aspect of the present invention, based on an image capturing unit that captures an image, a plurality of sound collecting units configured to be integrated with the image capturing unit and collecting sound, and an image capturing range of the image capturing unit. Determining whether or not a person is included in the imaging range, based on a control unit that controls a directivity direction and a sound collection range in which sound is collected by the plurality of sound collection units, and an image captured by the imaging unit A first determination unit that performs the sound collection when the first determination unit determines that a person is not included in the imaging range. An image pickup apparatus that controls the directivity direction and the sound collection range of the sound collection means so that at least a part of the area outside the image pickup range of the image pickup means is a sound collection target area of the sound collection means. Is provided.

第１態様によれば、撮像手段の撮像範囲内に人物が含まれなければ、撮像範囲外の領域の少なくとも一部を集音対象の領域とすることができるので、人物のいる領域を集音範囲に含めることができ、人物の発する音声を確実に集音することができる。また、人物が含まれていない撮像範囲内の領域は、集音対象の領域から外されるので、その領域に発生源を有するノイズ等があっても集音されることがなく、人物の発する音声を、より明瞭に集音することができる。 According to the first aspect, if a person is not included in the imaging range of the imaging means, at least a part of the area outside the imaging range can be set as a sound collection target area. It can be included in the range, and the voice uttered by the person can be reliably collected. In addition, since the area within the imaging range that does not include a person is excluded from the sound collection target area, the sound is not collected even if there is noise or the like having a generation source in that area. Sound can be collected more clearly.

第１態様に係る前記撮像装置は、前記撮像手段の前記撮像範囲が変化したか否かを判断する第二判断手段をさらに備えてもよい。この場合に前記制御手段は、前記第二判断手段によって前記撮像範囲が変化したと判断された場合に、前記撮像範囲の変化の内容に基づいて、前記集音手段の指向方向および集音範囲を制御してもよい。 The imaging apparatus according to the first aspect may further include second determination means for determining whether or not the imaging range of the imaging means has changed. In this case, when the second determination unit determines that the imaging range has changed, the control unit determines the directing direction and the sound collection range of the sound collection unit based on the content of the change in the imaging range. You may control.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想される。そこで、制御手段が、集音手段の指向方向および集音範囲の制御を撮像範囲の変化の内容に基づいて行えば、人物のいる領域が確実に集音対象の領域に含まれるようにすることができる。よって、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change. Therefore, if the control means controls the directivity direction and the sound collection range of the sound collection means based on the contents of the change in the imaging range, the area where the person is present is surely included in the area to be collected. Can do. Therefore, it is possible to reliably and more clearly collect the voice uttered by the person.

第１態様において、前記制御手段は、前記第二判断手段によって、前記撮像範囲が変化したと判断され、且つ、前記撮像範囲の変化が前記撮像手段の撮像方向の変化に起因するものであって、変化前の前記撮像方向から変化後の前記撮像方向を特定不可能であると判断された場合に、前記集音手段が集音可能な領域のうち、前記撮像手段の前記撮像範囲外の全ての領域が、前記集音手段による音声の集音対象の領域となるように、前記集音手段の指向方向および集音範囲を制御してもよい。 In the first aspect, the control means is determined by the second determination means that the imaging range has changed, and the change in the imaging range is caused by a change in the imaging direction of the imaging means. , When it is determined that the imaging direction after the change from the imaging direction before the change cannot be specified, all of the areas outside the imaging range of the imaging means among the areas where the sound collecting means can collect sound The direction of sound collection and the sound collection range of the sound collecting means may be controlled so that this area becomes a sound collection target area of the sound collecting means.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想されるが、変化前から変化後の撮像方向が特定不可能である場合は、変化後の撮像方向を基準に、変化前の指向方向および集音範囲を特定することができない。よって、制御手段が、集音手段の指向方向および集音範囲を制御して、集音可能な領域のうち、撮像手段の撮像範囲外の全ての領域から音声を集音することで、人物のいる領域が確実に集音対象の領域に含まれるようにしつつ、人物がいないとわかっている領域からは集音しないようにすることができるので、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change, but if the imaging direction after the change cannot be specified from before the change, based on the imaging direction after the change, The directivity direction and sound collection range before the change cannot be specified. Therefore, the control means controls the directivity direction and the sound collection range of the sound collection means and collects sound from all areas outside the image pickup range of the image pickup means among the areas that can be collected. The sound collected by the person can be collected more reliably and clearly. can do.

第１態様において、前記制御手段は、前記第二判断手段によって、前記撮像範囲が変化したと判断され、且つ、前記撮像範囲の変化が前記撮像手段の前記撮像方向の変化に起因するものであって、変化前の前記撮像方向から変化後の前記撮像方向を特定可能であると判断された場合に、前記撮像範囲の変化前における前記集音手段の集音対象の領域から音声が集音されるように、前記集音手段の指向方向および集音範囲を制御してもよい。 In the first aspect, the control means determines that the imaging range has changed by the second determination means, and the change in the imaging range is caused by a change in the imaging direction of the imaging means. Then, when it is determined that the imaging direction after the change can be specified from the imaging direction before the change, the sound is collected from the sound collection target area of the sound collection means before the change of the imaging range. As described above, the directing direction and sound collection range of the sound collecting means may be controlled.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想され、さらに、変化前から変化後の撮像方向が特定可能である場合は、変化後の撮像方向を基準に、変化前の指向方向および集音範囲を特定することができる。よって、制御手段が、集音手段の指向方向および集音範囲を制御して、撮像範囲の変化前における集音手段の集音対象の領域から音声を集音することで、人物のいる領域を確実に集音対象の領域としつつ、人物のいない領域から集音することを避け、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change, and if the imaging direction after the change can be specified from before the change, the person changes based on the imaging direction after the change. The previous directivity direction and sound collection range can be specified. Therefore, the control means controls the directivity direction and the sound collection range of the sound collection means, and collects the sound from the sound collection target area of the sound collection means before the change of the imaging range. It is possible to collect sound from a person reliably and more clearly while avoiding collecting from an area where there is no person while reliably making the area to be collected.

第１態様において、前記制御手段は、前記第二判断手段によって、前記撮像範囲が変化したと判断され、且つ、前記撮像範囲の変化が前記撮像手段の画角の変化に起因すると判断された場合に、前記画角の変化前における前記集音手段の集音対象の領域から、前記画角の変化後における前記撮像範囲と重なる領域を除外した領域が、前記集音手段による音声の集音対象の領域となるように、前記集音手段の指向方向および集音範囲を制御してもよい。 In the first aspect, when the control unit determines that the imaging range has changed and the change of the imaging range is caused by a change in the angle of view of the imaging unit by the second determination unit. In addition, a region obtained by excluding a region that overlaps the imaging range after the change in the angle of view from a region to be collected by the sound collection unit before the change in the angle of view is a sound collection target of the sound by the sound collection unit The directing direction and the sound collection range of the sound collecting means may be controlled so as to be in the region.

撮像範囲が変化したときに、その変化が画角の変化に起因するものである場合、人物は、変化前における撮像範囲から、変化後における撮像範囲を除いた領域にいると予想される。よって、制御手段が、集音手段の指向方向および集音範囲を制御して、画角の変化前における集音手段の集音対象の領域から、画角の変化後における撮像範囲と重なる領域を除外した領域から音声を集音することで、人物のいる領域を確実に集音対象の領域としつつ、人物のいない領域から集音することを避け、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, if the change is caused by a change in the angle of view, the person is expected to be in an area excluding the imaging range after the change from the imaging range before the change. Therefore, the control means controls the directivity direction and the sound collection range of the sound collection means so that the area overlapping the imaging range after the change in the angle of view is changed from the sound collection target area of the sound collection means before the change in the angle of view. By collecting the sound from the excluded area, the area where the person is present is surely the area to be collected, while collecting the sound from the area where the person is not present, and the sound emitted by the person is collected more reliably and clearly. Can sound.

第１態様において、前記第一判断手段は、前記撮像手段の撮像した画像から人の顔の特徴を有する部位を認識し、認識した部位の大きさが所定の大きさよりも大きい場合に、前記撮像範囲に人物が含まれると判断してもよい。 In the first aspect, the first determination unit recognizes a part having a human facial feature from the image captured by the imaging unit, and the imaging unit is configured to detect the part when the recognized part has a size larger than a predetermined size. It may be determined that a person is included in the range.

撮像した画像内に含まれる、人の顔の特徴を有する部位が、所定の大きさ以下であるものを人物として検出しないようにすれば、撮像装置が撮像対象としない人物がたまたま撮像範囲に含まれても、その人物が集音手段の制御条件となることがない。これにより、制御手段が誤った指向方向および集音範囲で制御してしまうことを防止でき、集音対象の人物の発する音声を確実に集音することができる。 If a part having a human face characteristic within a captured image that is less than or equal to a predetermined size is not detected as a person, a person who is not an imaging target by chance is included in the imaging range. Even if it is, the person does not become a control condition of the sound collecting means. Thereby, it is possible to prevent the control means from controlling in the wrong directivity direction and sound collection range, and it is possible to reliably collect the sound emitted by the person to be collected.

本発明の第２態様によれば、画像を撮像する撮像手段と、音声を集音する複数の集音手段とが一体に構成された撮像装置を機能させるため、コンピュータにおいて実行される撮像方法であって、前記撮像手段による画像の撮像範囲に基づいて、複数の前記集音手段に音声を集音させる指向方向および集音範囲を制御する制御ステップと、前記撮像手段の撮像した画像に基づき、前記撮像範囲内に人物が含まれるか否かを判断する第一判断ステップと、を含み、さらに、前記第一判断ステップにおいて前記撮像範囲内に人物が含まれないと判断された場合に、前記制御ステップにおいて、前記集音手段が集音可能な領域のうち、前記撮像手段の前記撮像範囲外の領域の少なくとも一部が、前記集音手段による音声の集音対象の領域となるように、前記集音手段の指向方向および集音範囲が制御される撮像方法が提供される。 According to the second aspect of the present invention, there is provided an imaging method executed in a computer in order to cause an imaging device in which an imaging unit that captures an image and a plurality of sound collection units that collect sound to function together. And a control step for controlling a directivity direction and a sound collection range for collecting sound by the plurality of sound collection means based on an image pickup range of the image by the image pickup means, and based on an image taken by the image pickup means, A first determination step of determining whether or not a person is included in the imaging range, and when it is determined in the first determination step that a person is not included in the imaging range, In the control step, among the areas where the sound collecting means can collect sound, at least a part of the area outside the imaging range of the imaging means is a target area for collecting sound by the sound collecting means. An imaging method orientation and sound collecting range of the sound collecting means is controlled is provided.

本発明の第３態様によれば、画像を撮像する撮像手段と、音声を集音する複数の集音手段とが一体に構成された撮像装置を機能させるためのプログラムであって、コンピュータに、前記撮像手段による画像の撮像範囲に基づいて、複数の前記集音手段に音声を集音させる指向方向および集音範囲を制御する制御ステップと、前記撮像手段の撮像した画像に基づき、前記撮像範囲内に人物が含まれるか否かを判断する第一判断ステップと、を実行させ、さらに、前記第一判断ステップにおいて前記撮像範囲内に人物が含まれないと判断された場合に、前記制御ステップにおいて、前記集音手段が集音可能な領域のうち、前記撮像手段の前記撮像範囲外の領域の少なくとも一部が、前記集音手段による音声の集音対象の領域となるように、前記集音手段の指向方向および集音範囲が制御されるプログラムが提供される。 According to the third aspect of the present invention, there is provided a program for causing an image pickup apparatus configured to integrate an image pickup means for picking up an image and a plurality of sound collection means for collecting sound, Based on the imaging range of the image by the imaging means, a control step for controlling the directivity direction and the sound collection range for collecting sound by the plurality of sound collection means, and the imaging range based on the image captured by the imaging means A first determination step for determining whether or not a person is included therein, and when the first determination step determines that a person is not included in the imaging range, the control step The sound collection means collects the sound so that at least a part of the area outside the imaging range of the image pickup means is a sound collection target area of the sound collection means. Program pointing direction and sound collecting range means is controlled is provided.

第２態様に係る撮像方法に従う処理を撮像装置のコンピュータで実行することによって、あるいは、第３態様に係るプログラムを実行してコンピュータを撮像装置として機能させることで、第１態様と同様の効果を得ることができる。 By executing the processing according to the imaging method according to the second aspect on the computer of the imaging apparatus, or by executing the program according to the third aspect and causing the computer to function as the imaging apparatus, the same effects as in the first aspect are obtained. Can be obtained.

会議端末１およびＰＣ９の斜視図である。It is a perspective view of the conference terminal 1 and PC9. 会議端末１の電気的構成を示すブロック図である。2 is a block diagram showing an electrical configuration of the conference terminal 1. FIG. 会議端末１で実行されるプログラムのフローチャートである。4 is a flowchart of a program executed on the conference terminal 1. 会議端末１の撮像方向Ａ１、撮像範囲Ｂ１等に合わせて設定される指向方向Ｃ１、集音範囲Ｄ１を示す図である。It is a figure which shows the directional direction C1 and the sound collection range D1 set according to the imaging direction A1, the imaging range B1, etc. of the conference terminal 1. 会議端末１の撮像方向Ａ２、撮像範囲Ｂ１等に合わせて設定される指向方向Ｃ１、集音範囲Ｄ１を示す図である。It is a figure which shows the directional direction C1 and the sound collection range D1 set according to the imaging direction A2, the imaging range B1, etc. of the conference terminal 1. 会議端末１の撮像方向Ａ３、撮像範囲Ｂ１等に合わせて設定される指向方向Ｃ３、集音範囲Ｄ３を示す図である。It is a figure which shows the directivity direction C3 set according to the imaging direction A3, the imaging range B1, etc. of the conference terminal 1, and the sound collection range D3. 会議端末１の撮像方向Ａ１、撮像範囲Ｂ４等に合わせて設定される指向方向Ｃ１、集音範囲Ｄ４を示す図である。It is a figure which shows the directivity direction C1 and the sound collection range D4 set according to the imaging direction A1, the imaging range B4, etc. of the conference terminal 1. 会議端末１の撮像方向Ａ１、撮像範囲Ｂ５等に合わせて設定される指向方向Ｃ１、集音範囲Ｄ１を示す図である。It is a figure which shows the directional direction C1 and the sound collection range D1 set according to the imaging direction A1, the imaging range B5, etc. of the conference terminal 1. 会議端末１の撮像方向Ａ６、撮像範囲Ｂ１等に合わせて設定される指向方向Ｃ６、集音範囲Ｄ６を示す図である。It is a figure which shows the directivity direction C6 set according to the imaging direction A6 of the conference terminal 1, the imaging range B1, etc., and the sound collection range D6. 会議端末１の撮像方向Ａ７、撮像範囲Ｂ１等に合わせて設定される集音範囲Ｄ７を示す図である。It is a figure which shows the sound collection range D7 set according to the imaging direction A7 of the conference terminal 1, imaging range B1, etc. FIG.

以下、本発明に係る撮像装置の一実施の形態である会議端末１について、図面を参照して説明する。なお、参照される図面は、本発明が採用しうる技術的特徴を説明するために用いられるものであり、記載されている装置の構成、各種処理のフローチャートなどは、単なる説明例である。 Hereinafter, a conference terminal 1 which is an embodiment of an imaging apparatus according to the present invention will be described with reference to the drawings. Note that the drawings to be referred to are used to explain technical features that can be adopted by the present invention, and the configuration of the apparatus described, flowcharts of various processes, and the like are merely illustrative examples.

まず、図１を参照して、会議端末１の概略構成について説明する。図１に示す会議端末１は、アレイマイク２５、スピーカ２７、カメラ２８、および操作部２９を備える。会議端末１は、カメラ２８で画像を撮像し、アレイマイク２５で音声を集音することができ、また、スピーカ２７で音声を発生することのできる装置である。会議端末１は、筐体４の上端に回転軸３を備え、その回転軸３を中心に筐体４の一部を回転させ、下端側を開いたり閉じたりできるように構成されている。ユーザは、筐体４の下端側を開くことで、筐体４の姿勢を自立させることのできる姿勢、すなわち使用時の姿勢（図１参照）とすることができる。また、筐体４の下端側を閉じることで、筐体４の姿勢を折りたたまれた姿勢、すなわち非使用時の姿勢（図示外）とすることができる。 First, the schematic configuration of the conference terminal 1 will be described with reference to FIG. The conference terminal 1 shown in FIG. 1 includes an array microphone 25, a speaker 27, a camera 28, and an operation unit 29. The conference terminal 1 is a device that can capture an image with the camera 28, collect sound with the array microphone 25, and generate sound with the speaker 27. The conference terminal 1 includes a rotation shaft 3 at the upper end of the housing 4 and is configured to rotate a part of the housing 4 around the rotation shaft 3 so that the lower end side can be opened and closed. By opening the lower end side of the housing 4, the user can set the posture of the housing 4 to be independent, that is, the posture in use (see FIG. 1). Further, by closing the lower end side of the housing 4, the housing 4 can be folded, that is, not in use (not shown).

会議端末１は、設置された拠点の音声をアレイマイク２５から集音（入力）し、且つ画像をカメラ２８から撮像（入力）する。アレイマイク２５は２つ以上の無指向性マイクを並べて配置したものである。詳細は後述するが、アレイマイク２５は電気的な制御によって指向方向と集音範囲を設定することができる。本実施の形態では、アレイマイク２５に３個のマイクを用いている。 The conference terminal 1 collects (inputs) the voice of the installed base from the array microphone 25 and picks up (inputs) the image from the camera 28. The array microphone 25 is formed by arranging two or more omnidirectional microphones side by side. Although details will be described later, the array microphone 25 can set the directivity direction and the sound collection range by electrical control. In the present embodiment, three microphones are used for the array microphone 25.

カメラ２８は、例えばＣＭＯＳやＣＣＤなどのイメージセンサを搭載した単焦点デジタルカメラが用いられる。本実施の形態の会議端末１は、例えば卓上に載置して使用する形態のものであり、カメラ２８の撮像向きを調整するパンやチルトなどの動作は手動で会議端末１の筐体４を動かすことによって行われる。また、会議端末１におけるズームは、いわゆるデジタルズームによって電気的になされる。より詳細には、本実施の形態のカメラ２８は単焦点デジタルカメラを用いるため、画角は固定であり、ズームは撮像した画像に対し、トリミングと拡大処理を行うことで実現される疑似的なズームを用いるものとする。以下では、カメラ２８によって撮像可能となる範囲（撮像する画像内に納まる範囲）を撮像範囲と呼ぶが、撮像範囲はカメラ２８が向く方向（撮像方向）を基準とした角度範囲で表すものとし、光学ズームにおける画角（ズームレンズが移動して焦点距離が変わることによって変化する撮像可能な角度範囲）と同義で扱うものとする。よって、デジタルズームによって行われる撮像範囲に対する拡大・縮小の動作は、便宜上、画角の変化によって表す場合もある。 As the camera 28, for example, a single focus digital camera equipped with an image sensor such as a CMOS or a CCD is used. The conference terminal 1 according to the present embodiment is used, for example, by being placed on a table, and operations such as panning and tilting for adjusting the imaging direction of the camera 28 are manually performed on the housing 4 of the conference terminal 1. It is done by moving. Moreover, the zoom in the conference terminal 1 is electrically performed by so-called digital zoom. More specifically, since the camera 28 of the present embodiment uses a single-focus digital camera, the angle of view is fixed, and zoom is a pseudo-realization realized by performing trimming and enlargement processing on the captured image. Assume that zoom is used. Hereinafter, a range that can be captured by the camera 28 (a range that falls within an image to be captured) is referred to as an imaging range, but the imaging range is represented by an angle range based on the direction in which the camera 28 faces (imaging direction) The angle of view in the optical zoom (the range of angles that can be picked up as the zoom lens moves and the focal length changes) is treated as synonymous. Therefore, the enlargement / reduction operation with respect to the imaging range performed by the digital zoom may be represented by a change in the angle of view for convenience.

会議端末１の操作部２９には、電源ボタン、音量調節ボタン、マイクミュートボタン等が設けられている。また、会議端末１は、ＵＳＢインタフェイス２１（図２参照）を搭載し、外部機器との電気的な接続を行うことができる。本実施の形態では、会議端末１は、例えばパーソナルコンピュータ（以下、「ＰＣ」と略する。）９に接続される。ＰＣ９は、データ通信、画像表示等の各種情報処理を行う一般的なコンピュータである。 The operation unit 29 of the conference terminal 1 is provided with a power button, a volume control button, a microphone mute button, and the like. Further, the conference terminal 1 is equipped with a USB interface 21 (see FIG. 2) and can be electrically connected to an external device. In the present embodiment, the conference terminal 1 is connected to, for example, a personal computer (hereinafter abbreviated as “PC”) 9. The PC 9 is a general computer that performs various information processing such as data communication and image display.

図１に示す、ＰＣ９はノート型のＰＣであり、表示装置６および操作部７等を備えるが、表示装置、操作部等のデバイスを備えないデスクトップ型のＰＣを用いてもよいことは言うまでもない。ＰＣ９と会議端末１とはＵＳＢケーブル２によって電気的に接続される。なお、ＰＣ９と会議端末１との接続はＵＳＢケーブル２に限らず、ＷｉＦｉ（登録商標）等の無線通信や赤外線等の光通信、その他ＩＥＥＥ１３９４等、様々な接続方式が利用できる。 The PC 9 shown in FIG. 1 is a notebook PC and includes a display device 6 and an operation unit 7. However, it goes without saying that a desktop PC without devices such as a display device and an operation unit may be used. . The PC 9 and the conference terminal 1 are electrically connected by a USB cable 2. The connection between the PC 9 and the conference terminal 1 is not limited to the USB cable 2, and various connection methods such as wireless communication such as WiFi (registered trademark), optical communication such as infrared rays, and other IEEE 1394 can be used.

アレイマイク２５によって集音される音声のデータや、カメラ２８によって撮像される画像のデータは、ＵＳＢケーブル２を介してＰＣ９に送信される。また、会議端末１は、ＰＣ９から受信した音声のデータに基づいて、スピーカ２７から音声を発生させる。 Audio data collected by the array microphone 25 and image data captured by the camera 28 are transmitted to the PC 9 via the USB cable 2. The conference terminal 1 generates sound from the speaker 27 based on the sound data received from the PC 9.

ユーザは、ＰＣ９および会議端末１を用いることで、画像を用いた遠隔会議（ビデオ会議）を実行することができる。詳細には、ＰＣ９は、会議端末１から入力した音声および画像のデータを、他拠点に配置されたＰＣ等の通信装置に、インターネット等のネットワーク８（図２参照）を介して送信する。同時に、ＰＣ９は、他拠点に配置された通信装置から、他拠点の音声および画像のデータを受信する。ＰＣ９は、受信した画像のデータに基づいて、他拠点の画像を表示装置６に表示させる。さらに、ＰＣ９は、受信した音声のデータに基づいて、接続している会議端末１のスピーカ２７に他拠点の音声を発生させる。その結果、複数の拠点の音声および画像が共有され、全てのユーザが同一の拠点にいない場合でも円滑に会議が進行する。 The user can execute a remote conference (video conference) using images by using the PC 9 and the conference terminal 1. Specifically, the PC 9 transmits the voice and image data input from the conference terminal 1 to a communication device such as a PC disposed at another site via a network 8 such as the Internet (see FIG. 2). At the same time, the PC 9 receives the voice and image data of the other base from the communication device arranged at the other base. The PC 9 causes the display device 6 to display an image of another site based on the received image data. Further, the PC 9 causes the speaker 27 of the connected conference terminal 1 to generate the sound of another base based on the received sound data. As a result, voices and images of a plurality of bases are shared, and the conference proceeds smoothly even when all users are not at the same base.

なお、ＰＣ９および会議端末１の構成は適宜変更可能である。例えば、他拠点から受信した音声をＰＣ９が内蔵するスピーカで発生し、会議端末１のスピーカ２７は使用しなくともよい。また、アレイマイク、スピーカ、および表示装置を備えるＰＣにさらに小型のカメラを接続し、そのＰＣを会議端末としてビデオ会議を実行してもよい。もちろん、ＰＣがカメラを内蔵してもよい。あるいは、会議端末１が音声および画像のデータを送信する機能をさらに備え、ＰＣ９は、他拠点の会議端末１から受信した音声の発生および画像の表示を行うための装置として用いられてもよい。もちろん、会議端末１は必ずしもビデオ会議に用いなくともよく、単に画像を撮像し、音声を集音する装置として機能すれば足り、ＰＣ９は、会議端末１から受信する画像や音声のデータに基づき、画像の表示と音声の発生を行えばよい。 The configurations of the PC 9 and the conference terminal 1 can be changed as appropriate. For example, the voice received from another base is generated by a speaker built in the PC 9, and the speaker 27 of the conference terminal 1 may not be used. Further, a small camera may be connected to a PC including an array microphone, a speaker, and a display device, and a video conference may be performed using the PC as a conference terminal. Of course, the PC may incorporate a camera. Alternatively, the conference terminal 1 may further include a function of transmitting audio and image data, and the PC 9 may be used as a device for generating audio and displaying images received from the conference terminal 1 at another site. Of course, the conference terminal 1 does not necessarily have to be used for video conferencing, and only needs to function as a device that picks up images and collects sound, and the PC 9 is based on image and audio data received from the conference terminal 1. What is necessary is just to display an image and generate sound.

次に、図２を参照して、会議端末１の電気的構成について説明する。会議端末１は、会議端末１の制御を司るＣＰＵ１１を備える。ＣＰＵ１１には、ＲＯＭ１２、ＲＡＭ１３、フラッシュメモリ１４、および入出力インタフェイス（Ｉ／Ｆ）１６が、バス１５を介して接続されている。 Next, the electrical configuration of the conference terminal 1 will be described with reference to FIG. The conference terminal 1 includes a CPU 11 that controls the conference terminal 1. A ROM 12, a RAM 13, a flash memory 14, and an input / output interface (I / F) 16 are connected to the CPU 11 via a bus 15.

ＲＯＭ１２は、会議端末１を動作させるためのプログラムおよび初期値等を記憶している。ＲＡＭ１３は各種情報を一時的に記憶する。フラッシュメモリ１４は不揮発性の記憶装置である。入出力インタフェイス１６には、ＵＳＢインタフェイス（Ｉ／Ｆ）２１、音声入力処理部２２、指向性制御部２６、音声出力処理部２３、映像入力処理部２４、および操作部２９が接続されている。ＵＳＢインタフェイス２１は、会議端末１をＰＣ９に接続する。音声入力処理部２２は、指向性制御部２６を介して入力されるアレイマイク２５からの音声信号を処理して音声データを生成する。指向性制御部２６は、アレイマイク２５の指向方向および集音範囲を制御する処理を行う。音声出力処理部２３はスピーカ２７の動作を処理する。映像入力処理部２４は、カメラ２８からの画像信号を処理して画像データを生成する。 The ROM 12 stores a program for operating the conference terminal 1, an initial value, and the like. The RAM 13 temporarily stores various information. The flash memory 14 is a nonvolatile storage device. Connected to the input / output interface 16 are a USB interface (I / F) 21, an audio input processing unit 22, a directivity control unit 26, an audio output processing unit 23, a video input processing unit 24, and an operation unit 29. Yes. The USB interface 21 connects the conference terminal 1 to the PC 9. The voice input processing unit 22 processes voice signals from the array microphone 25 input via the directivity control unit 26 to generate voice data. The directivity control unit 26 performs processing for controlling the directivity direction and the sound collection range of the array microphone 25. The audio output processing unit 23 processes the operation of the speaker 27. The video input processing unit 24 processes image signals from the camera 28 to generate image data.

ここで、アレイマイク２５において集音する音声の指向方向および集音範囲を制御するため指向性制御部２６において行われる処理の動作原理について、簡単に説明する。アレイ状に並べて配置された個々のマイクに到達する音声は、マイクの並び方向に対してどの方向から到達したかによって、その到達時間に差を生ずる。例えば、マイクの並び方向と直交する方向（便宜上、「正面方向」とする。）から音声が到達する場合、音声は各マイクに同時に到達する。このため、アレイマイク２５からは個々のマイクから音声信号が出力され、音声入力処理部２２において電気的に足し合わされることによって、マイクの数に相当する分の倍率に増幅された音声の出力が得られることとなる。一方、マイクの並び方向に対し斜めの方向（便宜上、「斜め方向」とする。なお、側方も含む。）から音声が到達する場合、音声の発生源に近いマイクほど早く音声が到達するため、個々のマイクが取得する音声に時間差（位相ずれ）を生ずる。このため、アレイマイク２５からの音声信号を音声入力処理部２２において電気的に足し合わせた場合の音声のゲインは、各マイクへの音声の到達角度とマイクの配置間隔（あるいは配置位置）に応じたものとなり、正面方向から到達した場合よりも小さくなる。個々のマイクの配置間隔はあらかじめ判っているので、指向性制御部２６において各マイクの取得する音声の時間差を取得してＣＰＵ１１で解析すれば、音声の発生源の方向を求めることができる。 Here, the operation principle of processing performed in the directivity control unit 26 for controlling the directivity direction and sound collection range of the sound collected by the array microphone 25 will be briefly described. The sound reaching the individual microphones arranged in an array forms a difference in arrival time depending on from which direction the microphones are arranged. For example, when sound arrives from a direction orthogonal to the direction in which the microphones are arranged (for convenience, “front direction”), the sound reaches each microphone at the same time. For this reason, audio signals are output from the individual microphones from the array microphone 25 and are electrically added together by the audio input processing unit 22, thereby outputting an audio amplified to a magnification corresponding to the number of microphones. Will be obtained. On the other hand, when the sound arrives from a direction oblique to the microphone arrangement direction (for the sake of convenience, “oblique direction” is included, including the side), the sound comes earlier as the microphone is closer to the sound source. A time difference (phase shift) occurs in the sound acquired by each microphone. For this reason, the audio gain when the audio signals from the array microphone 25 are electrically added in the audio input processing unit 22 depends on the arrival angle of the audio to each microphone and the arrangement interval (or arrangement position) of the microphones. It becomes smaller than when reaching from the front. Since the arrangement intervals of the individual microphones are known in advance, the direction of the sound source can be obtained by acquiring the time difference of the sound acquired by each microphone in the directivity control unit 26 and analyzing it by the CPU 11.

また、指向性制御部２６では、アレイマイク２５の個々のマイクで集音した音声をそれぞれ遅延させた上で音声入力処理部２２に出力することができる。このことは、個々のマイクの出力に対する遅延時間を制御することにより、所定の斜め方向から到達する音声を足し合わせた場合のゲインを最大とすることができることを意味する。言い換えると、個々のマイクからの出力を指向性制御部２６において電気的に制御して遅延させることにより、所望する方向に対し、アレイマイク２５が指向性を得ることができる。 Further, the directivity control unit 26 can delay the sound collected by the individual microphones of the array microphone 25 and output the delayed sound to the sound input processing unit 22. This means that by controlling the delay time with respect to the output of each microphone, the gain can be maximized when sounds arriving from a predetermined diagonal direction are added. In other words, the array microphone 25 can obtain directivity with respect to a desired direction by electrically controlling and delaying the output from each microphone in the directivity control unit 26.

このように、遅延制御により指向性を得ることのできるアレイマイク２５の出力のゲインは、一つの方向から到達した場合に最大となり、その方向から少しずれた方向から音声が到達すれば低下する。つまり、個々のマイクが集音する音声をマイクの配置間隔に応じて一律にずらすように遅延制御を行えば、アレイマイク２５を狭指向性に制御することができ、集音範囲（指向方向を中心とした場合に集音可能な角度範囲）を狭くすることができる。また、個々のマイクの遅延時間を一律とはせず、あらかじめ計算等により求めた遅延時間の組合せを個々のマイクに適用すれば、アレイマイク２５を広指向性に制御して、集音範囲を広くすることも可能である。さらに、マイクをいくつかの組に分けて、組ごとに遅延制御を異ならせれば、アレイマイク２５に複数の指向方向を持たせることが可能となる。本実施の形態では、このような動作原理に基づき、ＣＰＵ１１による演算に従って、指向性制御部２６が個々のマイクによって集音される音声の遅延処理を行うことで、アレイマイク２５の指向方向および集音範囲の制御が行われる。なお、本実施の形態では、集音範囲について、上記のように、指向方向を中心としてアレイマイク２５が音声を集音可能な方向の角度範囲を対象とする。 Thus, the gain of the output of the array microphone 25 that can obtain directivity by delay control is maximized when it arrives from one direction, and decreases when the sound arrives from a direction slightly deviated from that direction. That is, if the delay control is performed so that the sound collected by the individual microphones is uniformly shifted according to the arrangement interval of the microphones, the array microphone 25 can be controlled to have a narrow directivity, and the sound collection range (directivity direction can be changed). The angle range in which sound can be collected when the center is set can be narrowed. Also, if the delay times of the individual microphones are not uniform and a combination of delay times previously obtained by calculation or the like is applied to the individual microphones, the array microphone 25 is controlled to have a wide directivity, and the sound collection range is set. It can also be widened. Furthermore, if the microphones are divided into several groups and the delay control is different for each group, the array microphone 25 can have a plurality of directivity directions. In the present embodiment, based on such an operating principle, the directivity control unit 26 performs a delay process on the sound collected by each microphone in accordance with the calculation by the CPU 11, so that the direction of the array microphone 25 and the collection direction are collected. The sound range is controlled. In the present embodiment, as described above, the sound collection range is an angular range in a direction in which the array microphone 25 can collect sound with the directional direction as the center.

また、本実施の形態の会議端末１では、カメラ２８で撮像した画像に映される人物が発する音声を確実に拾うことができるように、アレイマイク２５の指向方向と集音範囲の制御が、画像の解析結果に応じて行われる。具体的には、カメラ２８によって撮像した画像に人物の顔が含まれるか否かを判断するための画像解析と、カメラ２８の水平方向における回転（パン）によって向きが変更されたか否かを判断するための画像解析とが行われる。人物の顔を検出する画像解析は、例えば目、鼻、口など顔の特徴を有する部分を画像から抽出し、相対位置や大きさなどをテンプレートと比較したり、あるいは幾何学的に解析したりする公知の方法により行われる。 Further, in the conference terminal 1 of the present embodiment, the control of the directivity direction and the sound collection range of the array microphone 25 is performed so that the sound emitted by the person reflected in the image captured by the camera 28 can be reliably picked up. This is performed according to the analysis result of the image. Specifically, it is determined whether or not the orientation has been changed by image analysis for determining whether or not a human face is included in the image captured by the camera 28 and rotation (panning) of the camera 28 in the horizontal direction. Image analysis is performed. Image analysis that detects the face of a person, for example, extracts parts with facial features such as eyes, nose and mouth from the image, compares the relative position and size with the template, or analyzes geometrically. This is performed by a known method.

なお、本実施の形態では、顔の特徴を有する部分の相対位置がテンプレートと一致しても、その大きさが、あらかじめ定められた所定の大きさに満たない場合には、人物の顔として検出されない。言い換えると、画像解析により画像内に人物の顔の特徴を有する部分が含まれても、その大きさが所定の大きさよりも小さければ、人物の顔として判断されない。これにより、例えば会議端末１から遠く離れた位置にいる人がカメラ２８の撮像範囲に含まれて撮像されて画像に映ってしまっても、その人は、人物として検出される対象から除外される。 In the present embodiment, even if the relative position of the part having facial features matches the template, if the size does not satisfy a predetermined size, it is detected as a human face. Not. In other words, even if the image analysis includes a portion having the characteristics of a person's face in the image, if the size is smaller than a predetermined size, it is not determined as a person's face. Thereby, for example, even if a person who is far away from the conference terminal 1 is included in the imaging range of the camera 28 and captured in the image, the person is excluded from the target to be detected as a person. .

カメラ２８の向きを検出する画像解析は、例えば最新の画像と、前回撮像された画像との双方に映る特徴物の画像内における配置位置のずれの有無を検出する公知の方法により行われる。上記したように、会議端末１は例えば卓上に載置して使用する形態のものであり、カメラ２８のパンやチルトは、会議の参加者等が会議端末１を手動で動かすことによって行われる。言い換えると、会議端末１はパンやチルトのための駆動装置を搭載せず、ＰＣ９における操作に応じた制御によるパンやチルトが行われない。このため、会議端末１では、パンやチルトの制御の機構を用いたカメラ２８の撮像方向の検出は、行われない。そこで会議端末１では、カメラ２８で撮像した画像の解析結果に基づいて、カメラ２８の向きの変化を検出している。特徴物とは、例えば閉じた輪郭線を検出できるものなどである。会議端末１では特徴物の配置位置にずれがあった場合、画像内でずれの大きさ（横方向のドット数など）が求められ、あらかじめ作成されたテーブルや計算式等により、カメラ２８がどの方向に何度回転したか、求められる。なお、これらの画像解析の方法は一例に過ぎず、公知の様々な画像解析の方法を適用することができる。 The image analysis for detecting the orientation of the camera 28 is performed by, for example, a known method for detecting whether or not there is a displacement of the arrangement position in the image of the feature object reflected in both the latest image and the previously captured image. As described above, the conference terminal 1 is used, for example, by being placed on a table, and panning and tilting of the camera 28 is performed by manually moving the conference terminal 1 by a conference participant or the like. In other words, the conference terminal 1 is not equipped with a driving device for panning and tilting, and panning and tilting by control according to the operation in the PC 9 is not performed. For this reason, the conference terminal 1 does not detect the imaging direction of the camera 28 using the pan and tilt control mechanism. Therefore, the conference terminal 1 detects a change in the orientation of the camera 28 based on the analysis result of the image captured by the camera 28. The feature object is, for example, one that can detect a closed contour line. In the conference terminal 1, if there is a deviation in the arrangement position of the feature object, the magnitude of deviation (the number of dots in the horizontal direction, etc.) in the image is obtained, The number of rotations in the direction is required. Note that these image analysis methods are merely examples, and various known image analysis methods can be applied.

次に、図３のフローチャートに従い、図４〜図１０を参照しながら、会議端末１におけるアレイマイク２５の指向方向と集音範囲とが制御される具体的な処理の流れについて説明する。なお、図３に示す処理を実行するためのプログラムはＲＯＭ１２に記憶されており、ＣＰＵ１１がプログラムに従って実行する。 Next, a specific processing flow for controlling the directivity direction and the sound collection range of the array microphone 25 in the conference terminal 1 will be described with reference to FIGS. A program for executing the processing shown in FIG. 3 is stored in the ROM 12, and is executed by the CPU 11 according to the program.

会議端末１は、例えば会議室などに、使用時の姿勢で会議の参加者の方に向けられて設置され、ＰＣ９に接続される。ユーザ（参加者の一人であってもよい）によって操作部２９の電源ボタンがＯＮにされると、ＰＣ９との通信が開始されて、会議端末１は待機状態となる（Ｓ１１：ＮＯ）。さらにユーザがＰＣ９を操作することによって、ＰＣ９から撮像開始の指示信号を受信すると（Ｓ１１：ＹＥＳ）、ＣＰＵ１１は、カメラ２８による撮像と、アレイマイク２５による集音とを開始する。また、ＣＰＵ１１は、他の拠点に配置された通信装置からＰＣ９が受信した音声のデータに基づいて、スピーカ２７から他の拠点の音声の発生（出力）を開始する。 The conference terminal 1 is installed, for example, in a conference room so as to be directed toward the conference participants in a posture during use, and is connected to the PC 9. When the power button of the operation unit 29 is turned on by a user (which may be one of the participants), communication with the PC 9 is started and the conference terminal 1 enters a standby state (S11: NO). When the user further operates the PC 9 to receive an imaging start instruction signal from the PC 9 (S11: YES), the CPU 11 starts imaging by the camera 28 and sound collection by the array microphone 25. Further, the CPU 11 starts generating (outputting) the voice of the other base from the speaker 27 based on the voice data received by the PC 9 from the communication device arranged at the other base.

なお、本実施の形態では、図４に示すように、会議室５０の中央に配置されたテーブル５２の手前側に設置された会議端末１で、会議の様子が撮像されるものとする。会議室５０では、書類５１が載置されたテーブル５２を囲んで３人の参加者５３，５４，５５が着席し、テーブル５２の右手前側にホワイトボード５６が用意され、右奥に花５７が飾られているものとする。 In the present embodiment, as shown in FIG. 4, it is assumed that the conference is imaged by the conference terminal 1 installed on the front side of the table 52 disposed in the center of the conference room 50. In the conference room 50, three participants 53, 54, 55 are seated around a table 52 on which documents 51 are placed, a white board 56 is prepared on the right front side of the table 52, and a flower 57 is on the right back. It shall be decorated.

撮像の開始時には、カメラ２８のズームは行われない設定となっており、カメラ２８によって撮像される画像には、カメラ２８で撮像可能な最大の角度範囲に含まれる対象物が映される。会議端末１の正面方向はテーブル５２の中央に向けられており、以下の説明では、便宜上、この方向を撮像方向Ａ１とする。会議室５０の様子を撮像したカメラ２８の信号は映像入力処理部２４に入力されて、画像Ｐ１のデータが生成される。画像Ｐ１には、カメラ２８で撮像可能な撮像範囲Ｂ１（太実線で示す。）に含まれる人物（参加者５３，５４，５５）や物体（書類５１，テーブル５２，花５７）が映されている。 At the start of imaging, the camera 28 is set not to be zoomed, and an object captured in the maximum angle range that can be captured by the camera 28 is reflected in the image captured by the camera 28. The front direction of the conference terminal 1 is directed to the center of the table 52. In the following description, this direction is referred to as an imaging direction A1 for convenience. A signal from the camera 28 that captures the state of the conference room 50 is input to the video input processing unit 24, and data of the image P1 is generated. In the image P1, a person (participants 53, 54, 55) and an object (document 51, table 52, flower 57) included in an imaging range B1 (shown by a thick solid line) that can be captured by the camera 28 are shown. Yes.

また、会議端末１による撮像の開始時には、アレイマイク２５の指向方向Ｃ１は、カメラ２８の正面方向、すなわち便宜上の撮像方向Ａ１と同じ方向（つまり会議端末１の正面方向）に設定される。ＣＰＵ１１は、さらに、アレイマイク２５の集音範囲Ｄ１をカメラ２８の初期の画角に合わせるため、撮像方向Ａ１と撮像範囲Ｂ１とに基づき、上記説明した動作原理に従いあらかじめ設定された演算式もしくはテーブルによる演算を行う。指向性制御部２６は、ＣＰＵ１１が行った演算の結果に応じて、アレイマイク２５の個々のマイクの遅延時間を設定する。指向方向Ｃ１および集音範囲Ｄ１が制御されたアレイマイク２５により集音した会議室５０の音声信号は、音声入力処理部２２に入力されて足し合わされ、音声データが生成される。映像入力処理部２４において生成される画像データと、音声入力処理部２２において生成される音声データとは、ＵＳＢケーブル２を介してＰＣ９にストリーミング形式により送信される。 At the start of imaging by the conference terminal 1, the orientation direction C1 of the array microphone 25 is set to the front direction of the camera 28, that is, the same direction as the imaging direction A1 for convenience (that is, the front direction of the conference terminal 1). Further, the CPU 11 further sets an arithmetic expression or table set in advance according to the operation principle described above based on the imaging direction A1 and the imaging range B1 in order to adjust the sound collection range D1 of the array microphone 25 to the initial angle of view of the camera 28. Perform the calculation by. The directivity control unit 26 sets the delay time of each microphone of the array microphone 25 according to the result of the calculation performed by the CPU 11. The audio signals of the conference room 50 collected by the array microphone 25 in which the directivity direction C1 and the sound collection range D1 are controlled are input to the audio input processing unit 22 and added to generate audio data. The image data generated in the video input processing unit 24 and the audio data generated in the audio input processing unit 22 are transmitted to the PC 9 via the USB cable 2 in a streaming format.

次に図３に示すように、ＣＰＵ１１は、画像Ｐ１の画像解析を行い、画像Ｐ１に映る人物（つまり参加者５３〜５５）の顔の検出を行い、検出された参加者の人数をカウントする（Ｓ１２）。画像Ｐ１からは３人の参加者５３〜５５の顔（人の顔の特徴を有する部位）が認識される。ＣＰＵ１１は、会議の参加者の人数が３であるとして（Ｓ１２）、ＲＡＭ１３（フラッシュメモリ１４でもよい。）に一時的に記憶する。 Next, as shown in FIG. 3, the CPU 11 performs image analysis of the image P 1, detects the face of the person (that is, the participants 53 to 55) shown in the image P 1, and counts the number of detected participants. (S12). From the image P1, the faces of the three participants 53 to 55 (parts having human face characteristics) are recognized. The CPU 11 temporarily stores in the RAM 13 (may be the flash memory 14) assuming that the number of participants in the conference is 3 (S12).

カメラ２８による画像の撮像と、アレイマイク２５による音声の集音とは継続して行われ、生成される画像データと音声データとがＰＣ９にストリーミング送信される。その間に会議端末１が水平回転（パン）されても、映像入力処理部２４は、カメラ２８が向けられた方向において撮像された画像の画像データを生成する。また、ユーザのＰＣ９における操作によってＰＣ９からズームの指示信号をＣＰＵ１１が受信した場合、映像入力処理部２４は、指示されたズーム倍率に応じて画像のトリミングと拡大処理を行って、画像データを生成する。この場合には、ズーム倍率に応じた画角が所定の計算式あるいはテーブルを用いて算出され、現在の撮像範囲として、ＲＡＭ１３（フラッシュメモリ１４でもよい。）に一時的に記憶される。 Image capturing by the camera 28 and sound collection by the array microphone 25 are continuously performed, and the generated image data and sound data are streamed to the PC 9. Even if the conference terminal 1 is horizontally rotated (panned) during that time, the video input processing unit 24 generates image data of an image captured in the direction in which the camera 28 is directed. When the CPU 11 receives a zoom instruction signal from the PC 9 by the user's operation on the PC 9, the video input processing unit 24 performs image trimming and enlargement processing according to the instructed zoom magnification to generate image data. To do. In this case, the angle of view corresponding to the zoom magnification is calculated using a predetermined calculation formula or table, and temporarily stored in the RAM 13 (or the flash memory 14) as the current imaging range.

カメラ２８による画像の撮像と、アレイマイク２５による音声の集音とが所定時間の間、継続して行われ（Ｓ１３：ＮＯ，Ｓ１３）、所定時間が経過すると（Ｓ１３：ＹＥＳ）、Ｓ１５〜Ｓ３０の処理が実行される。Ｓ１５〜Ｓ３０の処理では、アレイマイク２５の指向方向および集音範囲の制御が行われる。また、Ｓ１５〜Ｓ３０の処理が行われる際に、カメラ２８によって最新の画像がＲＡＭ１３に記憶され、ＣＰＵ１１による画像解析に用いられる。なお、Ｓ１５〜Ｓ３０の処理が行われる度に、ＲＡＭ１３には最新の画像と、前回撮像された画像との２つの画像が記憶され、それ以前に記憶された画像は上書き消去される。 Image capturing by the camera 28 and sound collection by the array microphone 25 are continuously performed for a predetermined time (S13: NO, S13), and when the predetermined time has passed (S13: YES), S15 to S30. The process is executed. In the processing of S15 to S30, the directivity direction and sound collection range of the array microphone 25 are controlled. Further, when the processes of S15 to S30 are performed, the latest image is stored in the RAM 13 by the camera 28 and used for image analysis by the CPU 11. Note that each time the processes of S15 to S30 are performed, the RAM 13 stores two images, the latest image and the previously captured image, and the previously stored image is overwritten and erased.

まず、新たに撮像されてＲＡＭ１３に記憶された最新の画像に対し、ＣＰＵ１１が画像解析を行い、画像に映る人物の顔を検出できたか判断する（Ｓ１５）。会議端末１に対してパンやズームがなされておらず、最新の画像が、例えば前回撮像された図４の画像Ｐ１とほぼ同じ画像であった場合、ＣＰＵ１１は、３人の参加者５３，５４，５５の顔を認識し、すなわち人物を検出する（Ｓ１５：ＹＥＳ）。検出される顔の数が３であり、Ｓ１２で記憶した会議の参加者の人数よりも減っていない場合（Ｓ２２：ＮＯ）、ＣＰＵ１１は、カメラ２８の撮像範囲内に全ての参加者がいるとして、アレイマイク２５の集音範囲を現在のカメラ２８の画角に合わせる処理を行う（Ｓ２３）。すなわち、ＣＰＵ１１は、図４に示すように、アレイマイク２５の指向方向を撮像方向Ａ１と同じＣ１に設定する。そして上記同様、集音範囲がＤ１となるように撮像方向Ａ１と撮像範囲Ｂ１とに基づく演算を行い、アレイマイク２５の個々のマイクの遅延時間を設定するための指示を指向性制御部２６に送出する。 First, the CPU 11 performs image analysis on the latest image newly captured and stored in the RAM 13, and determines whether or not the face of a person shown in the image has been detected (S15). When the conference terminal 1 is not panned or zoomed and the latest image is, for example, the same image as the image P1 of FIG. 4 captured last time, the CPU 11 has three participants 53 and 54. , 55 are recognized, that is, a person is detected (S15: YES). When the number of detected faces is 3 and it is not less than the number of conference participants stored in S12 (S22: NO), the CPU 11 assumes that all participants are within the imaging range of the camera 28. Then, a process of adjusting the sound collection range of the array microphone 25 to the current angle of view of the camera 28 is performed (S23). That is, as shown in FIG. 4, the CPU 11 sets the directivity direction of the array microphone 25 to the same C1 as the imaging direction A1. Similarly to the above, calculation based on the imaging direction A1 and the imaging range B1 is performed so that the sound collection range becomes D1, and an instruction for setting the delay time of each microphone of the array microphone 25 is given to the directivity control unit 26. Send it out.

このように、最新の画像Ｐ１内に参加者５３〜５５の全員が映っていれば、撮像範囲Ｂ１から集音を行えば参加者５３〜５５全員の発する音声を集音できると判断できる。ゆえに、ＣＰＵ１１は、撮像方向Ａ１を指向方向Ｃ１とし、演算により、撮像範囲Ｂ１と同じ大きさの集音範囲Ｄ１を求め、設定する。これにより、参加者５３〜５５の発する音声を確実に集音することができるのである。その後処理はＳ１３に戻る。 As described above, if all the participants 53 to 55 are reflected in the latest image P1, it can be determined that the sound emitted from all the participants 53 to 55 can be collected by collecting sound from the imaging range B1. Therefore, the CPU 11 sets the imaging direction A1 as the directivity direction C1, and obtains and sets a sound collection range D1 having the same size as the imaging range B1 by calculation. Thereby, the sound which participant 53-55 emits can be collected reliably. Thereafter, the process returns to S13.

次に、ホワイトボード５６を映すため会議端末１に対してパンがなされ、例えば図５に示すように、カメラ２８の撮像方向がＡ２に向けられた場合、撮像範囲Ｂ１内に参加者５３〜５５が映らなくなることがある。この場合に最新の画像Ｐ２には参加者５３〜５５が映っておらず、Ｓ１５において、ＣＰＵ１１は、画像Ｐ２の解析を行っても人物の顔を検出することができない（Ｓ１５：ＮＯ）。ＰＣ９からズームの指示信号を受信していなければ（Ｓ１６：ＮＯ）、ＲＡＭ１３に記憶された最新の画像Ｐ２と、前回撮像されてＲＡＭ１３に記憶された画像Ｐ１との比較による回転角度の推測（撮像方向の検出）が行われる（Ｓ１７）。 Next, in order to project the whiteboard 56, panning is performed on the conference terminal 1. For example, as shown in FIG. 5, when the imaging direction of the camera 28 is directed to A2, the participants 53 to 55 are within the imaging range B1. May disappear. In this case, the participants 53 to 55 are not shown in the latest image P2, and in S15, the CPU 11 cannot detect the face of the person even if the image P2 is analyzed (S15: NO). If the zoom instruction signal has not been received from the PC 9 (S16: NO), the rotation angle is estimated (imaging) by comparing the latest image P2 stored in the RAM 13 with the image P1 previously captured and stored in the RAM 13. Direction detection) is performed (S17).

上記したように、撮像方向（カメラ２８の向き）を検出する画像解析は、ＣＰＵ１１が、前回の画像Ｐ１に映る特徴物（例えば花５７）を、最新の画像Ｐ２内において同様に検出し、配置位置にずれがないか検出する公知の方法によって行われる。図５に示すように、画像Ｐ２内において花５７は左端寄りの位置にあり、前回の画像Ｐ１では右端寄りの位置にあって、矢印Ｅ１で示すように位置ずれが生じていることから、会議端末１にパンが行われたことが検出される。さらに、撮像範囲Ｂ１（画角）がわかっていることから、画像Ｐ１，Ｐ２の横幅に対する画像Ｐ１，Ｐ２内での花５７の位置ずれの大きさから、会議端末１になされたパンの大きさ、すなわち会議端末１の回転角度が算出される。 As described above, in the image analysis for detecting the imaging direction (the direction of the camera 28), the CPU 11 similarly detects and arranges the feature (for example, the flower 57) reflected in the previous image P1 in the latest image P2. This is performed by a known method for detecting whether there is a shift in position. As shown in FIG. 5, in the image P2, the flower 57 is located at the position near the left end, and in the previous image P1, the position is located near the right end, and the position shift occurs as shown by the arrow E1, so that the conference It is detected that the terminal 1 has been panned. Furthermore, since the imaging range B1 (angle of view) is known, the size of the pan made on the conference terminal 1 from the size of the positional shift of the flower 57 in the images P1 and P2 with respect to the horizontal width of the images P1 and P2. That is, the rotation angle of the conference terminal 1 is calculated.

会議端末１の回転角度を推測（算出）することができた場合（Ｓ１７：ＹＥＳ）、ＣＰＵ１１は、現在の撮像方向Ａ２から、求められた回転角度分、回転前の方向に集音の向きを戻し、その方向を、指向方向に設定する。図５に示す例では、上記の画像解析により撮像方向がＡ１（図４参照）からＡ２（図５参照）に向けられたことが判ったことから、指向方向が回転前のＣ１に設定される。また、ＣＰＵ１１は、アレイマイク２５の集音範囲を回転前の集音範囲であるＤ１に設定する（Ｓ２０）。 When the rotation angle of the conference terminal 1 can be estimated (calculated) (S17: YES), the CPU 11 changes the direction of sound collection in the direction before rotation by the calculated rotation angle from the current imaging direction A2. Return and set the direction to the pointing direction. In the example shown in FIG. 5, since the imaging direction has been found to be directed from A1 (see FIG. 4) to A2 (see FIG. 5) by the above image analysis, the directing direction is set to C1 before rotation. . Further, the CPU 11 sets the sound collection range of the array microphone 25 to D1, which is the sound collection range before rotation (S20).

このように、最新の画像Ｐ２内に、前回の画像Ｐ１に映る参加者５３〜５５が映っていなければ、会議端末１のみがパンされたものと判断することができる。ゆえに、ＣＰＵ１１は、パンによる回転角度が画像解析から判る場合、アレイマイク２５の指向方向と集音範囲とを回転前の指向方向と集音範囲とに合わせる。これにより、参加者５３〜５５の発する音声を確実に集音することができる。つまり、ホワイトボード５６を映すために参加者５３〜５５が画像Ｐ３に映らなくなっても、参加者５３〜５５の発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 Thus, if the participants 53 to 55 shown in the previous image P1 are not shown in the latest image P2, it can be determined that only the conference terminal 1 is panned. Therefore, the CPU 11 matches the directivity direction and the sound collection range of the array microphone 25 with the directivity direction and the sound collection range before the rotation when the rotation angle due to pan can be determined from the image analysis. Thereby, the sound which participant 53-55 emits can be collected reliably. That is, even if the participants 53 to 55 are not shown in the image P3 to show the whiteboard 56, it is possible to reliably collect the sounds emitted by the participants 53 to 55. Thereafter, the process returns to S13.

さらに、会議端末１に対してホワイトボード５６を映すためのパンがなされたときに、例えば図６に示すように、カメラ２８の撮像方向が、撮像範囲Ｂ１内に参加者５３〜５５も特徴物（花５７）も含まれない、Ａ３に向けられることがある。この場合にＣＰＵ１１は、最新の画像Ｐ３からは人物の顔を検出することができない（Ｓ１５：ＮＯ）。またズームの指示信号を受信していなければ（Ｓ１６：ＮＯ）、上記同様に回転角度の推測（撮像方向の検出）を行う（Ｓ１７）。ＣＰＵ１１は、上記の画像解析により、前回の画像Ｐ１（図４参照）に映る特徴物（花５７）を、最新の画像Ｐ３内において検出することができないので、回転角度を推測することができないと判断する（Ｓ１７：ＮＯ）。 Further, when panning for projecting the whiteboard 56 to the conference terminal 1 is performed, for example, as shown in FIG. 6, the imaging direction of the camera 28 is within the imaging range B1, and the participants 53 to 55 are also characteristic features. (Flower 57) may not be included and may be directed to A3. In this case, the CPU 11 cannot detect the face of a person from the latest image P3 (S15: NO). If the zoom instruction signal has not been received (S16: NO), the rotation angle is estimated (detection of the imaging direction) as described above (S17). The CPU 11 cannot detect the feature (flower 57) shown in the previous image P1 (see FIG. 4) in the latest image P3 by the above image analysis, and therefore cannot estimate the rotation angle. Judgment is made (S17: NO).

この場合、ＣＰＵ１１は、現在の撮像方向Ａ３の反対方向であるＣ３を指向方向とするとともに、アレイマイク２５の集音範囲を、３６０°の全範囲から、現在のカメラ２８の画角の範囲である撮像範囲Ｂ１を除き、残った範囲であるＤ３に設定する（Ｓ１８）。言い換えると、アレイマイク２５の指向性を、カメラ２８の画角外に設定する。 In this case, the CPU 11 sets C3, which is the opposite direction of the current imaging direction A3, as the directing direction, and the sound collection range of the array microphone 25 from the entire range of 360 ° to the range of the angle of view of the current camera 28. Except for a certain imaging range B1, the remaining range is set to D3 (S18). In other words, the directivity of the array microphone 25 is set outside the angle of view of the camera 28.

このように、最新の画像Ｐ３内に前回の画像Ｐ１に映る参加者５３〜５５が映っていなければ、上記同様、会議端末１のみがパンされたものと判断することができる。このとき、パンによる回転角度が画像解析から判らない場合、ＣＰＵ１１は、アレイマイク２５の指向方向と集音範囲を現在のカメラ２８の画角外に設定する。これにより、参加者５３〜５５がいないと判っている範囲からは集音せず、それ以外の範囲から集音することができる。つまり、ホワイトボード５６を映すために参加者５３〜５５が画像Ｐ３に映らなくなっても、参加者５３〜５５の発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 Thus, if the participants 53 to 55 shown in the previous image P1 are not shown in the latest image P3, it can be determined that only the conference terminal 1 is panned as described above. At this time, if the rotation angle due to pan is not known from the image analysis, the CPU 11 sets the directivity direction and the sound collection range of the array microphone 25 outside the current angle of view of the camera 28. As a result, sound is not collected from a range where it is known that there are no participants 53 to 55, and sound can be collected from other ranges. That is, even if the participants 53 to 55 are not shown in the image P3 to show the whiteboard 56, it is possible to reliably collect the sounds emitted by the participants 53 to 55. Thereafter, the process returns to S13.

次に、例えば図７に示すように、ＣＰＵ１１がＰＣ９からズームの指示信号を受け、撮像した画像Ｐ１のトリミングと拡大処理を行った結果、ズームによって小さくなった撮像範囲Ｂ４内に参加者５３〜５５が含まれなくなることがある。この場合にＣＰＵ１１は、最新の画像Ｐ４からは人物の顔を検出することができない（Ｓ１５：ＮＯ）。またズームの指示信号を受信したので（Ｓ１６：ＹＥＳ）、Ｓ２１に進み、アレイマイク２５の指向方向をズーム前の指向方向Ｃ１に設定する。そしてアレイマイク２５の集音範囲を、ズーム前のカメラ２８の画角の範囲である撮像範囲Ｂ１から、ズームによって小さくなった画角の範囲である撮像範囲Ｂ４を除き、残った範囲であるＤ４に設定する（Ｓ２１）。 Next, for example, as shown in FIG. 7, the CPU 11 receives a zoom instruction signal from the PC 9, and performs the trimming and enlargement processing of the captured image P 1. 55 may not be included. In this case, the CPU 11 cannot detect a human face from the latest image P4 (S15: NO). Since the zoom instruction signal is received (S16: YES), the process proceeds to S21, and the directivity direction of the array microphone 25 is set to the directivity direction C1 before zooming. Then, the sound collection range of the array microphone 25 is the remaining range except for the imaging range B4 which is the range of the angle of view reduced by the zoom from the imaging range B1 which is the range of the angle of view of the camera 28 before zooming. (S21).

このように、最新の画像Ｐ４内に前回の画像Ｐ１に映る参加者５３〜５５が映っておらず、その際にＣＰＵ１１がズームの信号を受けていれば、ズームによって画角が狭くなったことから、参加者５３〜５５が画像Ｐ４内に映らなくなったと判断できる。ゆえにＣＰＵ１１は、アレイマイク２５の集音範囲をズーム前の撮像範囲Ｂ１から、参加者５３〜５５がいないと判っている現在の撮像範囲Ｂ４を除いた範囲である集音範囲Ｄ４に設定する。これにより、ズームして画像Ｐ４に参加者５３〜５５が映らなくなった場合でも、参加者５３〜５５の発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 Thus, if the participants 53 to 55 shown in the previous image P1 are not shown in the latest image P4, and the CPU 11 receives a zoom signal at that time, the angle of view is reduced by zooming. Therefore, it can be determined that the participants 53 to 55 are no longer shown in the image P4. Therefore, the CPU 11 sets the sound collection range of the array microphone 25 to the sound collection range D4 that is a range obtained by removing the current image pickup range B4 that is known to have no participants 53 to 55 from the image pickup range B1 before zooming. Thereby, even when the participants 53 to 55 no longer appear in the image P4 after zooming, it is possible to reliably collect the sounds emitted by the participants 53 to 55. Thereafter, the process returns to S13.

次に、会議端末１においてパンやズームがなされ、撮像された画像に映る参加者の人数が減ってしまった場合における処理について説明する。Ｓ１５においてＣＰＵ１１が画像内に人物の顔を検出しても（Ｓ１５：ＹＥＳ）、その数が、Ｓ１２で記憶した会議の参加者の人数よりも少なかった場合（Ｓ２２：ＹＥＳ）、Ｓ２５〜Ｓ３０の処理が行われる。 Next, processing when panning or zooming is performed in the conference terminal 1 and the number of participants appearing in the captured image is reduced will be described. Even if the CPU 11 detects a human face in the image in S15 (S15: YES), if the number is smaller than the number of conference participants stored in S12 (S22: YES), S25 to S30. Processing is performed.

例えば図８に示すように、ズームの指示信号を受けたＣＰＵ１１が画像Ｐ１のトリミングと拡大処理を行った結果、ズームによって小さくなった撮像範囲Ｂ５内に、一部の参加者５３，５４が含まれる場合がある。この場合にＣＰＵ１１は、最新の画像Ｐ５から人物の顔を検出でき（Ｓ１５：ＹＥＳ）、その人数が参加人数よりも少なく（Ｓ２２：ＹＥＳ）、またズームの指示信号を受信したので（Ｓ２５：ＹＥＳ）、Ｓ３０に進む。ＣＰＵ１１は、アレイマイク２５の指向方向をズーム前の指向方向Ｃ１に設定するとともに、アレイマイク２５の集音範囲についても同様に、ズーム前のカメラ２８の画角の範囲である撮像範囲Ｂ１と同じＤ１に設定する（Ｓ３０）。 For example, as illustrated in FIG. 8, the CPU 11 that has received the zoom instruction signal performs trimming and enlargement processing of the image P 1, and as a result, some participants 53 and 54 are included in the imaging range B 5 that is reduced by zooming. May be. In this case, the CPU 11 can detect a person's face from the latest image P5 (S15: YES), the number of persons is smaller than the number of participants (S22: YES), and a zoom instruction signal has been received (S25: YES). ), Go to S30. The CPU 11 sets the directivity direction of the array microphone 25 to the directivity direction C1 before zooming, and the sound collection range of the array microphone 25 is the same as the imaging range B1 that is the range of the angle of view of the camera 28 before zooming. Set to D1 (S30).

このように、最新の画像Ｐ５内に一部の参加者５３，５４が映り、その際にＣＰＵ１１がズームの信号を受けていれば、参加者５５は、ズームによって画角が狭くなったことから画像Ｐ５内に映らなくなったと判断できる。ゆえにＣＰＵ１１は、アレイマイク２５の集音範囲を、ズーム前の撮像範囲Ｂ１と同じ集音範囲Ｄ４に設定する。これにより、ズームした画像Ｐ５に映る参加者５３，５４と、映らない参加者５５の発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 In this way, if some of the participants 53 and 54 appear in the latest image P5 and the CPU 11 receives a zoom signal at that time, the participant 55 has the angle of view narrowed by zooming. It can be determined that the image is no longer displayed in the image P5. Therefore, the CPU 11 sets the sound collection range of the array microphone 25 to the same sound collection range D4 as the imaging range B1 before zooming. As a result, it is possible to reliably collect the sounds produced by the participants 53 and 54 shown in the zoomed image P5 and the participants 55 who are not shown. Thereafter, the process returns to S13.

ところで、会議端末１がパンされ、その結果、撮像範囲内に一部の参加者だけが含まれることとなる場合がある。例えば図９に示すように、カメラ２８の撮像方向が、撮像範囲Ｂ１内に参加者５４が含まれ、且つ、特徴物（花５７）が含まれるＡ６に向けられた場合、ＣＰＵ１１は、画像解析により、画像Ｐ６から人物（参加者５４）の顔を検出する（Ｓ１５：ＹＥＳ）。画像Ｐ６に映らない他の参加者５３，５５の顔は検出できないので、検出する人物の数は、Ｓ１２で記憶した参加人数より少ない（Ｓ２２：ＹＥＳ）。 By the way, the conference terminal 1 is panned, and as a result, only some participants may be included in the imaging range. For example, as illustrated in FIG. 9, when the imaging direction of the camera 28 is directed to A6 where the participant 54 is included in the imaging range B1 and the characteristic object (flower 57) is included, the CPU 11 performs image analysis. Thus, the face of the person (participant 54) is detected from the image P6 (S15: YES). Since the faces of the other participants 53 and 55 that are not shown in the image P6 cannot be detected, the number of persons to be detected is smaller than the number of participants stored in S12 (S22: YES).

ズームの指示信号を受信していなければ（Ｓ２５：ＮＯ）、ＣＰＵ１１は、回転角度の推測（撮像方向の検出）を行う（Ｓ２６）。前回の画像Ｐ１（図４参照）の右端寄りの位置に映る特徴物（花５７）が、画像Ｐ６では中央よりやや左寄りの位置に映っており、矢印Ｅ２で示すように位置ずれが生じていることから、会議端末１にパンが行われたことが検出される。さらに、撮像範囲Ｂ１に基づき、画像Ｐ１，Ｐ６内での花５７の位置ずれの大きさから、会議端末１になされたパンの大きさ、すなわち会議端末１の回転角度が算出される。 If the zoom instruction signal is not received (S25: NO), the CPU 11 estimates the rotation angle (detects the imaging direction) (S26). The feature (flower 57) shown at the position near the right end of the previous image P1 (see FIG. 4) is shown at a position slightly to the left of the center in the image P6, and there is a displacement as shown by the arrow E2. From this, it is detected that the conference terminal 1 has been panned. Further, based on the imaging range B1, the size of the pan made to the conference terminal 1, that is, the rotation angle of the conference terminal 1 is calculated from the size of the positional shift of the flower 57 in the images P1 and P6.

会議端末１の回転角度を推測（算出）することができた場合（Ｓ２６：ＹＥＳ）、ＣＰＵ１１は、アレイマイク２５の指向方向を撮像方向Ａ６と、回転前の指向方向Ｃ１との中間の方向であるＣ６に設定する。そして、アレイマイク２５の集音範囲を、撮像方向Ａ６に対する現在のカメラ２８の撮像範囲Ｂ１の画角の範囲と、前回の指向方向Ｃ１に対する集音範囲Ｄ１とを足し合わせたＤ６に設定する（Ｓ２８）。 When the rotation angle of the conference terminal 1 can be estimated (calculated) (S26: YES), the CPU 11 sets the directivity direction of the array microphone 25 in an intermediate direction between the image pickup direction A6 and the directivity direction C1 before the rotation. Set to a certain C6. Then, the sound collection range of the array microphone 25 is set to D6 obtained by adding the field angle range of the current imaging range B1 of the camera 28 with respect to the imaging direction A6 and the sound collection range D1 with respect to the previous directing direction C1 ( S28).

このように、パンによって最新の画像Ｐ６内に一部の参加者５４が映る場合には、その参加者５４を映すため、会議端末１がパンされたものと判断することができる。ゆえに、ＣＰＵ１１は、パンによる回転角度が画像解析から判る場合、アレイマイク２５の指向方向を、回転前の指向方向とカメラ２８の回転後の撮像方向との中間の方向とする。そしてアレイマイク２５の集音範囲を、カメラ２８の回転前における集音範囲に、回転後の撮像方向に基づく撮像範囲を足し合わせた範囲に合わせる。これにより、パンによって注目された参加者５４と、画像Ｐ６に映らなくなった参加者５３，５５との発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 As described above, when a part of the participants 54 is reflected in the latest image P6 by panning, it is possible to determine that the conference terminal 1 has been panned because the participants 54 are reflected. Therefore, when the rotation angle by pan is known from the image analysis, the CPU 11 sets the directivity direction of the array microphone 25 to an intermediate direction between the directivity direction before the rotation and the imaging direction after the rotation of the camera 28. Then, the sound collection range of the array microphone 25 is adjusted to the range obtained by adding the image pickup range based on the image pickup direction after the rotation to the sound collection range before the rotation of the camera 28. As a result, it is possible to reliably collect the voices uttered by the participant 54 noted by panning and the participants 53 and 55 that are no longer shown in the image P6. Thereafter, the process returns to S13.

また、一人の参加者５４がホワイトボード５６を用いた説明を行う場合など、もとの位置から移動し、それに合わせて会議端末１が参加者５４を映すようにパンされる場合がある。例えば図１０に示すように、カメラ２８の撮像方向が、撮像範囲Ｂ１内に参加者５４が含まれるものの、特徴物（花５７）が含まれないＡ７に向けられる場合である。ＣＰＵ１１は、上記同様、画像解析により、画像Ｐ７から人物（参加者５４）の顔を検出するが（Ｓ１５：ＹＥＳ）、検出する人物の数が参加人数より少ない（Ｓ２２：ＹＥＳ）。 Further, when one participant 54 explains using the whiteboard 56, the conference terminal 1 may be panned so as to reflect the participant 54 in accordance with the movement from the original position. For example, as shown in FIG. 10, the imaging direction of the camera 28 is directed to A7 where the participant 54 is included in the imaging range B1 but the feature (flower 57) is not included. Similarly to the above, the CPU 11 detects the face of the person (participant 54) from the image P7 by image analysis (S15: YES), but the number of persons to be detected is smaller than the number of participants (S22: YES).

また、ＣＰＵ１１は、前回の画像Ｐ１（図４参照）に映る特徴物（花５７）を、最新の画像Ｐ７内において検出することができなければ、回転角度を推測することができないと判断する（Ｓ２６：ＮＯ）。この場合、ＣＰＵ１１はアレイマイク２５の指向方向を３６０°全方向（無指向）とし、アレイマイク２５の集音範囲を、３６０°の全範囲であるＤ７に設定する。 The CPU 11 determines that the rotation angle cannot be estimated unless the feature (flower 57) shown in the previous image P1 (see FIG. 4) can be detected in the latest image P7 (see FIG. 4). S26: NO). In this case, the CPU 11 sets the directivity direction of the array microphone 25 to 360 ° in all directions (non-directional), and sets the sound collection range of the array microphone 25 to D7 which is the full range of 360 °.

このように、パンによる回転角度が画像解析から判らない場合、ＣＰＵ１１は、集音範囲Ｄ７を設定し、３６０°の全範囲から集音することにより、画像Ｐ７内に映る参加者５４の発する音声だけでなく、画像Ｐ７内に映らない参加者５３，５５の発する音声にも対応することができる。すなわち、パンによって注目された参加者５４と、画像Ｐ７に映らなくなった参加者５３，５５との発する音声を確実に集音することができる。その後処理はＳ１３に戻る。 As described above, when the rotation angle due to pan is not known from the image analysis, the CPU 11 sets the sound collection range D7 and collects sound from the entire 360 ° range, so that the sound emitted from the participant 54 shown in the image P7 is generated. In addition to this, it is possible to deal with voices uttered by the participants 53 and 55 not shown in the image P7. In other words, it is possible to reliably collect the voices uttered by the participant 54 noted by panning and the participants 53 and 55 no longer appearing in the image P7. Thereafter, the process returns to S13.

以上説明したように、本実施の形態の会議端末１では、会議端末１の撮像範囲内に人物が含まれなければ、撮像範囲外の領域の少なくとも一部を集音対象の領域とすることができるので、人物のいる領域を集音範囲に含めることができ、人物の発する音声を確実に集音することができる。また、人物が含まれていない撮像範囲内の領域は、集音対象の領域から外されるので、その領域に発生源を有するノイズ等があっても集音されることがなく、人物の発する音声を、より明瞭に集音することができる。 As described above, in the conference terminal 1 according to the present embodiment, if a person is not included in the imaging range of the conference terminal 1, at least a part of the area outside the imaging range may be set as a sound collection target area. Therefore, the area where the person is present can be included in the sound collection range, and the sound emitted by the person can be reliably collected. In addition, since the area within the imaging range that does not include a person is excluded from the sound collection target area, the sound is not collected even if there is noise or the like having a generation source in that area. Sound can be collected more clearly.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想される。そこで、アレイマイク２５の指向方向および集音範囲の制御を撮像範囲の変化の内容に基づいて行えば、人物のいる領域が確実に集音対象の領域に含まれるようにすることができる。よって、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change. Therefore, if the direction of the array microphone 25 and the sound collection range are controlled based on the contents of the change in the imaging range, the area where the person is present can be surely included in the sound collection target area. Therefore, it is possible to reliably and more clearly collect the voice uttered by the person.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想されるが、変化前から変化後の撮像方向が特定不可能である場合は、変化後の撮像方向を基準に、変化前の指向方向および集音範囲を特定することができない。よって、アレイマイク２５の指向方向および集音範囲を制御して、集音可能な領域のうち、会議端末１の撮像範囲外の全ての領域から音声を集音することで、人物のいる領域が確実に集音対象の領域に含まれるようにしつつ、人物がいないとわかっている領域からは集音しないようにすることができるので、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change, but if the imaging direction after the change cannot be specified from before the change, based on the imaging direction after the change, The directivity direction and sound collection range before the change cannot be specified. Therefore, by controlling the directivity direction and the sound collection range of the array microphone 25 and collecting sounds from all areas outside the imaging range of the conference terminal 1 among the areas that can be collected, the area where the person is present can be obtained. It is possible to collect sound from a person surely and more clearly, because it can be surely included in the sound collection target area, but not collected from an area known to have no person. it can.

撮像範囲が変化したときに、人物は、変化前における撮像範囲にいると予想され、さらに、変化前から変化後の撮像方向が特定可能である場合は、変化後の撮像方向を基準に、変化前の指向方向および集音範囲を特定することができる。よって、アレイマイク２５の指向方向および集音範囲を制御して、撮像範囲の変化前におけるアレイマイク２５の集音対象の領域から音声を集音することで、人物のいる領域を確実に集音対象の領域としつつ、人物のいない領域から集音することを避け、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, the person is expected to be in the imaging range before the change, and if the imaging direction after the change can be specified from before the change, the person changes based on the imaging direction after the change. The previous directivity direction and sound collection range can be specified. Therefore, by controlling the directivity direction and sound collection range of the array microphone 25 and collecting sound from the sound collection target region of the array microphone 25 before the change of the imaging range, it is possible to reliably collect the region where the person is present. It is possible to collect sound from a person reliably and more clearly while avoiding collecting from an area where there is no person while making it a target area.

撮像範囲が変化したときに、その変化が画角の変化に起因するものである場合、人物は、変化前における撮像範囲から、変化後における撮像範囲を除いた領域にいると予想される。よって、アレイマイク２５の指向方向および集音範囲を制御して、画角の変化前におけるアレイマイク２５の集音対象の領域から、画角の変化後における撮像範囲と重なる領域を除外した領域から音声を集音することで、人物のいる領域を確実に集音対象の領域としつつ、人物のいない領域から集音することを避け、人物の発する音声を確実且つより明瞭に集音することができる。 When the imaging range changes, if the change is caused by a change in the angle of view, the person is expected to be in an area excluding the imaging range after the change from the imaging range before the change. Therefore, the directivity direction and the sound collection range of the array microphone 25 are controlled, and the area that is the sound collection target of the array microphone 25 before the change of the angle of view is excluded from the area excluding the area that overlaps the imaging range after the change of the angle of view. By collecting sound, it is possible to reliably collect sound from a person while avoiding collecting from an area without a person, while ensuring that the area with the person is the area to be collected. it can.

撮像した画像内に含まれる、人の顔の特徴を有する部位が、所定の大きさ以下であるものを人物として検出しないようにすれば、撮像装置が撮像対象としない人物がたまたま撮像範囲に含まれても、その人物がアレイマイク２５の制御条件となることがない。これにより、誤った指向方向および集音範囲が設定されてしまうことを防止でき、集音対象の人物の発する音声を確実に集音することができる。 If a part having a human face characteristic within a captured image that is less than or equal to a predetermined size is not detected as a person, a person who is not an imaging target by chance is included in the imaging range. Even if it is, the person does not become a control condition of the array microphone 25. As a result, it is possible to prevent an incorrect directivity direction and sound collection range from being set, and it is possible to reliably collect sound produced by the person to be collected.

本発明は上記実施の形態に限定されるものではなく、種々の変更が可能である。カメラ２８として単焦点デジタルカメラを使用し、ズームは撮像した画像に対し、トリミングと拡大処理を行うことで実現される疑似的なデジタルズームにより行ったが、カメラ２８に機械的に焦点距離を変化させるズームレンズを設け、光学ズームを実現してもよい。 The present invention is not limited to the above embodiment, and various modifications can be made. A single-focus digital camera was used as the camera 28, and zooming was performed by a pseudo digital zoom realized by performing trimming and enlargement processing on the captured image. The camera 28 mechanically changes the focal length. A zoom lens may be provided to realize optical zoom.

アレイマイク２５には、一例として３個のマイクが設けられているとしたが、２個以上であればよく、望ましくは３個以上であり、数が多いほどより精確に集音範囲を設定することができる。また、アレイマイク２５を構成する個々のマイクについて、本実施の形態では無指向性マイクを用いたが、指向性マイクを用いてもよい。あるいは無指向性マイクと指向性マイクとを組み合わせてアレイマイク２５を構成してもよい。 The array microphone 25 is provided with three microphones as an example. However, the number may be two or more, preferably three or more. The larger the number, the more accurately the sound collection range is set. be able to. Moreover, although the omnidirectional microphone was used in the present embodiment for each microphone constituting the array microphone 25, a directional microphone may be used. Alternatively, the array microphone 25 may be configured by combining an omnidirectional microphone and a directional microphone.

会議端末１のパンの回転角度の演算は画像解析によって回転前と回転後との画像から特徴物の位置を検出することで行ったが、会議端末１に加速度センサを設け、会議端末１の向きを常時把握できるようにしてもよい。また、特徴物として、会議室５０内の数カ所にマーカーを設け、画像解析により画像内に映るマーカーから会議端末１の向きを把握できるようにしてもよい。加速度センサを設けるコストや、マーカーを準備する手間を考慮すると、本実施の形態のように、画像解析によって会議端末１の向きを把握する方法を採用すれば、ソフトウェアだけで処理できるため、好ましい。 The calculation of the rotation angle of the pan of the conference terminal 1 was performed by detecting the position of the feature from the images before and after the rotation by image analysis. The conference terminal 1 is provided with an acceleration sensor, and the orientation of the conference terminal 1 It may be possible to keep track of at any time. In addition, as a feature, markers may be provided at several places in the conference room 50 so that the orientation of the conference terminal 1 can be grasped from the markers reflected in the image by image analysis. Considering the cost of providing an acceleration sensor and the effort of preparing a marker, it is preferable to adopt a method of grasping the orientation of the conference terminal 1 by image analysis as in the present embodiment, because it can be processed only by software.

会議端末１の設置向きは、任意の向きであってもよい。例えば、会議端末１を９０度傾けて壁などに取り付け、本実施の形態におけるパンがチルトの動作に相当するようにしてもよい。この場合、画像解析により、垂直方向において画像内の特徴物の移動を検出し、回転角度を求めれば、アレイマイク２５の指向方向と集音範囲の制御を行うことができる。 The installation direction of the conference terminal 1 may be any direction. For example, the conference terminal 1 may be tilted 90 degrees and attached to a wall or the like, and the pan in the present embodiment may correspond to a tilting operation. In this case, if the movement of the feature in the image in the vertical direction is detected by image analysis and the rotation angle is obtained, the directivity direction and the sound collection range of the array microphone 25 can be controlled.

本実施の形態では、会議端末１が、本発明の「撮像装置」に相当する。カメラ２８が「撮像手段」に相当する。アレイマイク２５が「集音手段」に相当する。諸条件に応じてアレイマイク２５の指向方向および集音範囲を決定するための演算を行うＣＰＵ１１と、ＣＰＵ１１の演算結果に基づいてアレイマイク２５の個々のマイクの遅延時間を制御してアレイマイク２５の指向方向および集音範囲を制御する指向性制御部２６とが「制御手段」に相当する。Ｓ１５で人物を検出して画像内に人物が含まれるか否かを判断するＣＰＵ１１が「第一判断手段」に相当する。Ｓ１５で人物を検出できないと判断し、また、人物を検出したもののＳ２２で人物の数が減ったと判断するＣＰＵ１１が、「第二判断手段」に相当する。 In the present embodiment, the conference terminal 1 corresponds to the “imaging device” of the present invention. The camera 28 corresponds to “imaging means”. The array microphone 25 corresponds to “sound collecting means”. The CPU 11 that performs calculations for determining the directivity direction and the sound collection range of the array microphone 25 according to various conditions, and the array microphone 25 by controlling the delay time of each microphone of the array microphone 25 based on the calculation results of the CPU 11. The directivity control unit 26 that controls the directivity direction and the sound collection range corresponds to “control means”. The CPU 11 that detects a person in S15 and determines whether or not a person is included in the image corresponds to the “first determination unit”. The CPU 11 that determines that a person cannot be detected in S15, and that has detected a person but determines that the number of persons has decreased in S22 corresponds to “second determination means”.

１会議端末
１１ＣＰＵ
１３ＲＡＭ
２５アレイマイク
２６指向性制御部
２８カメラ 1 Conference terminal 11 CPU
13 RAM
25 Array microphone 26 Directivity control unit 28 Camera

Claims

An imaging means for capturing an image;
A plurality of sound collecting means configured to be integrated with the imaging means and collecting sound;
Control means for controlling a directivity direction and a sound collection range for collecting sound by the plurality of sound collection means based on an image pickup range of the image by the image pickup means;
First determination means for determining whether or not a person is included in the imaging range based on an image captured by the imaging means;
With
When the first determination unit determines that no person is included in the imaging range, the control unit includes a region outside the imaging range of the imaging unit in a region where the sound collection unit can collect sound. An imaging apparatus that controls a directivity direction and a sound collection range of the sound collecting means so that at least a part of the area is a sound collection target area of the sound collecting means.

Further comprising second determination means for determining whether or not the imaging range of the imaging means has changed,
The control means controls the directivity direction and the sound collection range of the sound collection means based on the content of the change of the image pickup range when the second determination means determines that the image pickup range has changed. The imaging apparatus according to claim 1.

The control means is determined by the second determination means that the imaging range has changed, and the change in the imaging range is caused by a change in the imaging direction of the imaging means, When it is determined that the imaging direction after the change from the imaging direction cannot be specified, all the areas outside the imaging range of the imaging means among the areas where the sound collecting means can collect sound are The imaging apparatus according to claim 2, wherein a directivity direction and a sound collection range of the sound collection unit are controlled so as to be a region where sound is collected by the sound collection unit.

The control means is determined by the second determination means that the imaging range has changed, and the change in the imaging range is caused by the change in the imaging direction of the imaging means, When it is determined that the imaging direction after the change from the imaging direction can be specified, the sound is collected from the sound collection target area of the sound collecting means before the change of the imaging range. The imaging apparatus according to claim 2 or 3, wherein a directivity direction and a sound collection range of the sound collecting means are controlled.

When the second determination unit determines that the imaging range has changed and the control unit determines that the change in the imaging range is caused by a change in the angle of view of the imaging unit, the control unit A region obtained by excluding a region that overlaps the imaging range after the change of the angle of view from a region to be collected by the sound collecting unit before the change of the sound field becomes a region to be collected by the sound collecting unit. The imaging apparatus according to claim 2, wherein a directivity direction and a sound collection range of the sound collecting means are controlled.

The first determination unit recognizes a part having a human facial feature from an image captured by the imaging unit, and a person is included in the imaging range when the size of the recognized part is larger than a predetermined size. The imaging apparatus according to claim 1, wherein the imaging apparatus is determined to be capable of being detected.

An imaging method that is executed in a computer in order to cause an imaging device in which an imaging unit that captures an image and a plurality of sound collection units that collect sound to function together,
A control step for controlling a directivity direction and a sound collection range for collecting sound by the plurality of sound collection means based on an image pickup range of the image by the image pickup means;
A first determination step of determining whether or not a person is included in the imaging range based on an image captured by the imaging unit;
Including
Further, when it is determined in the first determination step that no person is included in the imaging range, in the control step, the imaging range of the imaging unit among the areas where the sound collecting unit can collect sound. An imaging method, wherein a directivity direction and a sound collection range of the sound collecting means are controlled so that at least a part of an outside area is a sound collection target area of the sound collecting means.

A program for functioning an imaging apparatus in which an imaging unit that captures an image and a plurality of sound collection units that collect sound are configured,
On the computer,
A control step for controlling a directivity direction and a sound collection range for collecting sound by the plurality of sound collection means based on an image pickup range of the image by the image pickup means;
A first determination step of determining whether or not a person is included in the imaging range based on an image captured by the imaging unit;
And execute
Further, when it is determined in the first determination step that no person is included in the imaging range, in the control step, the imaging range of the imaging unit among the areas where the sound collecting unit can collect sound. A program in which a directivity direction and a sound collection range of the sound collecting means are controlled so that at least a part of an outside area is a sound collection target area of the sound collecting means.