JP2016010010A

JP2016010010A - Imaging apparatus with voice input and output function and video conference system

Info

Publication number: JP2016010010A
Application number: JP2014129638A
Authority: JP
Inventors: 大坪　宏安; Hiroyasu Otsubo; 宏安大坪
Original assignee: Hitachi Maxell Ltd
Current assignee: Maxell Holdings Ltd
Priority date: 2014-06-24
Filing date: 2014-06-24
Publication date: 2016-01-18
Also published as: WO2015198964A1

Abstract

PROBLEM TO BE SOLVED: To provide an imaging apparatus with a voice input and output function for video conference that can be manufactured at a low cost, and to provide a video conference system.SOLUTION: An imaging apparatus 1 with a voice input and output function for use in video conference includes a microphone 5, a speaker 6 and an omnidirectional camera 7, arranged at positions close to each other. The imaging apparatus 1 with a voice input and output function is placed on a table surrounded by participants to the video conference. A speaker of the conference speaks toward the microphone 5, and faces the speaker so as to hear the sound thereof, with high possibility, thus bringing about a situation where the omnidirectional camera 7 shoots the speaker from the front.

Description

本発明は、音声入出力機能付き撮像装置およびテレビ会議システムに関する。 The present invention relates to an imaging apparatus with a voice input / output function and a video conference system.

近年、会議室には、テーブル上に配置されたマイクおよびスピーカを備える音声入出力装置と、テーブルの近傍に配置されるディスプレイと、このディスプレイの近傍に配置された動画撮影用のカメラ（例えば録画機能の無いテレビカメラ）とを備え、離れた場所の別の会議室との間で、画像と音声を用いた所謂テレビ会議が可能となるテレビ会議システムが設けられている場合がある。 In recent years, a conference room has a voice input / output device including a microphone and a speaker arranged on a table, a display arranged near the table, and a video camera (for example, recording) arranged near the display. There is a case in which a video conference system is provided in which a so-called video conference using an image and sound is possible between another conference room and a remote conference room.

このようなテレビ会議システムでは、テレビカメラの画角を調整して、会議の参加者の全員が撮影範囲に入るようにする場合が多い。この場合に、参加者の着席位置が制限されたり、参加者全員を撮影範囲に収めることが困難であったりする場合がある。また、会議開始前にテレビカメラの画角やズーム等を調整するのに少し時間がかかることがあり、参加者が全員揃ってから会議開始までに時間差が生じてしまう。 In such a video conference system, the angle of view of the TV camera is often adjusted so that all the participants in the conference fall within the shooting range. In this case, the seating position of the participant may be limited, or it may be difficult to keep all the participants within the shooting range. Further, it may take a little time to adjust the angle of view, zoom, etc. of the TV camera before the start of the conference, and there will be a time lag between the start of the conference after all the participants have gathered.

また、会議において、主な発言者が予め決まっている場合には、発言者にテレビカメラの撮影範囲のなるべく中央側に座って貰うなどの対策が可能であるが、参加者の誰が発言するか分からない状態では、発言者が撮影範囲の端の方にいて、よく見えないなどの問題が生じる。 In addition, if the main speaker is determined in advance at the conference, it is possible to take measures such as sitting at the center of the shooting range of the TV camera as much as possible, but who of the participants speaks? If you don't know, there is a problem that the speaker is near the end of the shooting range and you cannot see it well.

そこで、音声入力用マイクを複数設けるか、複数の広角度カメラを設け、これら複数のマイクの音声信号や複数の広角カメラの画像データから発言者の位置を特定し、発言者の位置に基づいて、発言者の発する音声を主に音声入力するようにマイクを制御し、かつ、発言者を主に撮影するようにカメラを制御する提案がなされている（特許文献１参照）。 Therefore, a plurality of microphones for voice input or a plurality of wide-angle cameras are provided, and the position of the speaker is identified from the audio signals of the plurality of microphones and the image data of the plurality of wide-angle cameras, and based on the position of the speaker A proposal has been made to control a microphone so as to mainly input a voice uttered by a speaker and to control a camera so as to mainly photograph the speaker (see Patent Document 1).

また、近年の会議システムでは、カメラとしてＰＴＺカメラが用いられる。ＰＴＺとは、カメラを左右に首振りさせるパーン（Ｐ）と、上下に首振りさせるチルト（ｔ）、画像を拡大するズーム（Ｚ）が可能なカメラであり、例えば、会議の発言者が中心となるようにカメラの向きとズームを制御することができる。また、上述のように発言者の位置が特定できるシステムの場合に、自動でＰＴＺカメラを発言者に向けることができる。 In recent conference systems, a PTZ camera is used as a camera. PTZ is a camera capable of panning (P) for swinging the camera left and right, tilt (t) for swinging the camera up and down, and zoom (Z) for enlarging the image. The camera direction and zoom can be controlled so that Further, in the case of a system in which the position of the speaker can be specified as described above, the PTZ camera can be automatically directed to the speaker.

特開平１０−１４５７６３号公報JP-A-10-145663

ところで、特許文献１の発明では、複数台のマイクやカメラを用いて発言者の位置を特定し、この特定された発言者の位置に基づいて、発言者が主に撮影されるようにカメラを制御したり、発言者の発言の音声が主に入力されるようにマイクを制御したりする。したがって、特許文献１では、複数のマイクやカメラが必要で、かつ、マイクやカメラを制御する制御装置が必要であり、会議システムのコストが高くなる。 By the way, in the invention of Patent Document 1, the position of the speaker is specified using a plurality of microphones and cameras, and the camera is used so that the speaker is mainly photographed based on the position of the specified speaker. The microphone is controlled so that the voice of the speaker's speech is mainly input. Therefore, in Patent Document 1, a plurality of microphones and cameras are necessary, and a control device that controls the microphones and cameras is necessary, which increases the cost of the conference system.

例えば、１つの会議室の参加者が数十人を超えるような場合には、発言者の位置を特定し、特定された発言者を撮像するためのカメラの制御や、発言者の発言の音声を抽出するためのマイクの制御が必要となるかもしれないが、１つの会議室の参加者が十数人以下の場合に、コストパフォーマンス的に問題がある。 For example, if there are more than several tens of participants in one conference room, the position of the speaker is specified, the camera control for imaging the specified speaker, and the voice of the speaker Although it may be necessary to control the microphone to extract the signal, there is a problem in cost performance when the number of participants in one conference room is less than ten.

本発明は、前記事情に鑑みてなされたものであり、低コストに製造可能なテレビ会議用の音声入出力機能付き撮像装置およびこの音声入出力機能付き撮像装置を有するテレビ会議システムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides an imaging device with a voice input / output function for a video conference that can be manufactured at low cost and a video conference system having the imaging device with a voice input / output function. With the goal.

前記課題を解決するために、本発明の音声入出力機能付き撮像装置は、周囲を撮像対象とする全方位カメラと、
前記全方位カメラの近傍に設けられ、外部から入力される音声信号を音声として周囲に出力する音声出力手段と、
前記全方位カメラの近傍に設けられ、周囲の音声を音声信号として入力する音声入力手段とを備え、
前記全方位カメラにより撮像された画像データと、前記音声入力手段により入力された音声信号を出力することを特徴とする。 In order to solve the above-described problem, an imaging apparatus with a voice input / output function according to the present invention includes an omnidirectional camera that captures the surroundings,
An audio output means provided in the vicinity of the omnidirectional camera and outputting an audio signal input from the outside to the surroundings as an audio;
Provided in the vicinity of the omnidirectional camera, comprising voice input means for inputting surrounding voice as a voice signal,
The image data picked up by the omnidirectional camera and the sound signal input by the sound input means are output.

このような構成によれば、音声入出力機能付き撮像装置を会議システムの撮像装置、音声出力手段としてのスピーカ、音声入力手段としてのマイクとして使用する場合に、当該音声入出力機能付き撮像装置をテーブルに配置し、このテーブルを囲んで会議の複数の参加者に座って貰うことにより、全方位カメラにより参加者の全員を撮影することが可能となる。この場合、テーブルを囲む参加者は、それぞれ、テーブル上の音声入出力機能付き撮像装置を見るか、テレビ会議の他の会場が映し出されたディスプレイを見ることになる。 According to such a configuration, when the imaging apparatus with a voice input / output function is used as an imaging apparatus for a conference system, a speaker as a voice output means, and a microphone as a voice input means, the imaging apparatus with a voice input / output function is used. All the participants can be photographed with an omnidirectional camera by placing them on a table and sitting around and crawling around the table. In this case, each participant who surrounds the table sees the imaging device with a voice input / output function on the table, or sees a display on which other venues of the video conference are projected.

但し、発言者の多くは、基本的に音声入力手段としてのマイクに向かって発言する場合が多く、また、他の参加者の音声が出力されるスピーカの方向を向く可能性も高い。一般に音源が顔の正面方向にある方が、音が聞き取り易く、音が良く聞こえるように音源の方を見ることが多い。すなわち、テーブル上の全方位カメラの周囲を囲んで会議の参加者が座った場合に少なくとも発言者がマイクやスピーカの方を向くことで、マイクやスピーカの近傍にある全方位カメラの方向を向くことになり、全方位カメラで正面から参加者を撮影する状態となり、発言者を撮像した画像データ上では、発言者が画像データを見ているテレビ会議の他の会場の参加者を向いて発言しているように見える。 However, many of the speakers basically speak to a microphone as a voice input means, and there is a high possibility that the speakers will face the speaker from which the voices of other participants are output. In general, when the sound source is in the front direction of the face, the sound is easier to hear and the sound source is often viewed so that the sound can be heard better. That is, when a conference participant sits around the omnidirectional camera on the table, at least the speaker faces the mic or speaker so that the omnidirectional camera in the vicinity of the mic or speaker faces. Therefore, the omnidirectional camera takes a picture of the participant from the front, and on the image data of the speaker, the speaker speaks to the participant at the other conference room where the speaker is watching the image data. Looks like you are doing.

すなわち、マイクとスピーカとカメラを略同じ位置に配置することにより、少なくとも会議の参加者が発言する場合に、カメラを向いて発言させるように促すことができ、発言者の画像を明確にすることができる。 In other words, by placing microphones, speakers, and cameras in approximately the same position, at least when a participant in a conference speaks, it is possible to encourage the camera to speak and to clarify the image of the speaker Can do.

また、全方位カメラは、テーブル状に配置されて、テーブル周囲に座る参加者を撮影するので参加者と距離が短く、かつ、各参加者による距離の差が少ない。したがって、高い解像度を有する全方位カメラでなくても、参加者の撮影が十分可能であり、高い解像度の全方位カメラを用いる場合よりもコストの低減を図ることができる。 In addition, since the omnidirectional camera is arranged in a table shape and photographs the participants sitting around the table, the distance from the participants is short, and the difference in distance between the participants is small. Therefore, even if it is not an omnidirectional camera having a high resolution, the participants can be sufficiently photographed, and the cost can be reduced as compared with the case of using a high resolution omnidirectional camera.

なお、全方位カメラで撮像された全方位画像データをそのまま平面に投影した状態に出力すると歪んだ画像となるので、例えば、パノラマ画像に変換したり、各被写体となる会議の参加者毎の画像に変換したりするとともに、歪みをとる画像処理を行う必要がある。また、全方位カメラは、例えば、魚眼レンズを用いた魚眼カメラや、円錐状に近い形状のミラーを用いたカメラや、全天球カメラが含まれる。また、音声出力手段は、例えば、スピーカである。また、音声入力手段は、例えば、マイクである。 Note that if the omnidirectional image data picked up by the omnidirectional camera is output as it is projected onto the plane as it is, it becomes a distorted image. For example, it is converted into a panoramic image or an image for each participant of the conference as each subject. In addition, it is necessary to perform image processing that takes distortion. The omnidirectional camera includes, for example, a fisheye camera using a fisheye lens, a camera using a mirror having a shape close to a conical shape, and an omnidirectional camera. The sound output means is, for example, a speaker. The voice input means is, for example, a microphone.

本発明の前記構成において、前記全方位カメラの近傍で当該全方位カメラによる周囲の撮像を妨げない位置に、周囲の複数方向から視認可能に外部から入力された画像データを表示する複数台のディスプレイが設けられていることが好ましい。 In the configuration of the present invention, a plurality of displays for displaying image data input from the outside so as to be visible from a plurality of directions in the vicinity of the omnidirectional camera in a position that does not interfere with surrounding imaging by the omnidirectional camera Is preferably provided.

このような構成によれば、基本的に会議の参加者は、他の会場の参加者が映し出されたディスプレイ、他の会場の参加者の発言が音声として出力されるスピーカ、他の会場の参加者へ話しかけるためのマイクの方向を向く可能性が高いが、これらディスプレイ、マイク、スピーカが互いに近傍にまとまって存在するので、基本的に会議の参加者は、その多くが自然に全方位カメラの方向を向くことになり、他の会場のディスプレイでは、他の会場の参加者の方向を向いている参加者が映し出されることになる。 According to such a configuration, the participants of the conference basically have a display on which the participants of other venues are projected, a speaker in which the speech of the participants of other venues is output as audio, and participation of other venues. Although there is a high possibility that the microphone is facing the direction of the person who talks to the person, these displays, microphones, and speakers are close together, so basically, most of the participants in the conference are naturally omnidirectional cameras. In other venue displays, the participants who are facing the participants in the other venues are displayed.

また、全方位カメラをテーブル上に置いた場合に、各参加者とディスプレイとの距離が短くなり、比較的小さなサイズのディスプレイでも別会場の参加者の識別が可能になるので、ディスプレイを複数用いるものとしても大きなディスプレイを１つ用いる場合よりもコストの低減を図ることができる。なお、四角いテーブルに、参加者が２列で向かい合って座るような場合に、ディスプレイを２つとすることができる。円卓の周囲に参加者が円状に並んで座る場合や、四角いテーブルの４辺のうちの３辺以上に、分かれて参加者が座る場合には、ディスプレイが３つ以上あることが好ましい。 In addition, when an omnidirectional camera is placed on a table, the distance between each participant and the display becomes shorter, and it becomes possible to identify participants in different venues even with a relatively small display, so use multiple displays. Even if it is a thing, cost reduction can be aimed at rather than the case where one big display is used. In addition, when a participant sits facing each other in two rows on a square table, two displays can be provided. When participants sit in a circle around the round table, or when participants sit separately on three or more of the four sides of a square table, it is preferable that there are three or more displays.

本発明の前記構成において、前記音声入力手段は、少なくとも周囲の異なる方向をそれぞれ向いた少なくとも３つのマイクを備え、
各マイクに入力する音声の音量から音源の方向を特定する音源方向認識手段と、
前記全方位カメラで撮像された全方位画像データを、前記音源方向認識手段により特定された音源の方向を中心とする画像データに変換する画像処理手段を備えることが好ましい。 In the above configuration of the present invention, the voice input means includes at least three microphones respectively facing at least different surrounding directions,
Sound source direction recognition means for identifying the direction of the sound source from the volume of the sound input to each microphone;
It is preferable that image processing means for converting omnidirectional image data captured by the omnidirectional camera into image data centered on the direction of the sound source specified by the sound source direction recognition means.

このような構成によれば、参加者のうち発言している発言者を特定して、発言者を左右の略中央とするパノラマ画像を他の会議会場のディスプレイに表示したり、発言者を抜き出した状態の画像を他の会場のディスプレイに表示したりすることが可能になる。本発明においては、全方位カメラおよびその近傍のマイクの周囲に参加者がいるので、指向性の高いマイクでなくとも、各マイクの音量を比較することで、比較的容易に音源としての発言者の方向を特定可能であるとともに、発言者の全方位画像上の位置を特定するのに音源の方向さえ特定できれば、音源の位置まで特定する必要がなく、音源の位置を特定するのにマイクアレーや指向性の高いマイク等を用いなくてもよいので、コストの低減を図ることができる。また、マイクにより特定された発言者を中心（主体）とする画像データを作成する際には、全方位画像データ上で方向を指定することにより、容易に発言者を主体とする画像データを作成することができる。 According to such a configuration, a speaker who is speaking is identified from among the participants, and a panoramic image with the speaker as a substantially central left and right is displayed on the display of another conference hall, or the speaker is extracted. It is possible to display the image of the state on the display of another venue. In the present invention, since there are participants around the omnidirectional camera and the microphone in the vicinity thereof, a speaker as a sound source can be relatively easily compared by comparing the volumes of the microphones, even if they are not highly directional microphones. As long as the direction of the sound source can be specified and the direction of the sound source can be specified to specify the position of the speaker on the omnidirectional image, it is not necessary to specify the position of the sound source, and the microphone array can be used to specify the position of the sound source. In addition, since it is not necessary to use a microphone with high directivity, the cost can be reduced. In addition, when creating image data centered on the speaker specified by the microphone (main subject), it is easy to create image data mainly consisting of the speaker by specifying the direction on the omnidirectional image data. can do.

本発明の前記構成において、前記全方位カメラで撮像された画像データ中に撮像されている被撮像者の顔を認識するとともに、認識された前記顔の口の動きから前記被撮像者のうちの発言している前記被撮像者を特定する画像認識手段と、
前記全方位カメラで撮像された全方位画像データを、前記画像認識手段により発言していると特定された前記被撮像者を中心とする画像データに変換する画像処理手段とを備えることが好ましい。 In the configuration of the present invention, the face of the person being imaged is recognized in the image data imaged by the omnidirectional camera, and the movement of the mouth of the recognized face is used to recognize the face of the person being imaged. Image recognition means for identifying the person being imaged,
It is preferable that image processing means for converting omnidirectional image data captured by the omnidirectional camera into image data centered on the imaged person identified as speaking by the image recognition means.

このような構成におれば、音声の場合と同様に発言者の方向を特定すれば、発言者を主体とする画像データを作成可能であり、位置を特定する必要がないので、複数のカメラを用いる必要がなく、コストの低減を図ることができる。また、音声で発言者の方向を特定した場合と同様に、特定された発言者を主体とする画像データを作成する際には、全方位画像データ上で方向を指定することにより、容易に発言者を主体とする画像データを作成することができる。 In such a configuration, if the direction of the speaker is specified in the same manner as in the case of voice, it is possible to create image data mainly for the speaker, and it is not necessary to specify the position. There is no need to use it, and the cost can be reduced. Similarly to the case where the direction of the speaker is specified by voice, when creating image data mainly including the specified speaker, the direction can be easily specified by specifying the direction on the omnidirectional image data. Image data mainly composed of a person can be created.

本発明のテレビ会議システムは、本発明の音声入出力機能付き撮像装置を複数備え、各音声入出力機能付き撮像装置は、他の前記音声入出力機能付き撮像装置に前記画像データと前記音声信号を出力し、かつ、他の前記音声入出力機能付き撮像装置から出力された前記画像データおよび前記音声信号を入力するための通信手段を備えることを特徴とする。 The video conference system according to the present invention includes a plurality of imaging devices with audio input / output functions according to the present invention, and each of the imaging devices with audio input / output functions includes the image data and the audio signal in another imaging device with audio input / output functions. And a communication means for inputting the image data and the audio signal output from the other imaging apparatus with the audio input / output function.

このような構成によれば、本発明のテレビ会議システムは、各音声入出力機能付き撮像装置の上述の作用効果を奏することができる。なお、音声入出力機能付き撮像装置には、ディスプレイが無い構成の場合もあるが、他の音声入出力機能付き撮像装置で撮像された画像データが入力されることにより、音声入出力機能付き撮像装置において、外部のディスプレイに画像データを出力することが可能になる。 According to such a configuration, the video conference system of the present invention can achieve the above-described effects of each imaging device with a voice input / output function. An imaging device with a voice input / output function may be configured without a display, but image data with a voice input / output function can be obtained by inputting image data captured by another imaging device with a voice input / output function. In the apparatus, image data can be output to an external display.

本発明の音声入出力機能付き撮像装置およびテレビ会議システムによれば、低コストに製造可能であり、かつ、発言者がディスプレイに表示された場合に、ディスプレイを見る人を向いた状態となる可能性が高くなる。 According to the imaging device with audio input / output function and the video conference system of the present invention, it can be manufactured at a low cost, and when a speaker is displayed on the display, it can be in a state suitable for a person who looks at the display. Increases nature.

第１の実施の形態の音声入出力機能付き撮像装置を示すカバーを半透明化した図であって、（ａ）が平面図であり、（ｂ）が側面図である。It is the figure which made the cover which shows the imaging device with a voice input / output function of 1st Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. 同、音声入出力機能付き撮像装置の使用状況を説明するための図である。It is a figure for demonstrating the use condition of an imaging device with a voice input / output function. 同、音声入出力機能付き撮像装置の全方位カメラに撮影された画像を説明するための図である。It is a figure for demonstrating the image image | photographed with the omnidirectional camera of the imaging device with a voice input / output function. 同、音声入出力機能付き撮像装置から出力される画像を説明するための図であって、（ａ）は全方位画像から変換されたパノラマ画像の概略を示す図であり、（ｂ）は全方位画像から変換されたパノラマ画像を分割して２列にしたものであり、（ｃ）は、発言者の画像を加えたものであり、（ｄ）異なる３か所で撮影された全方位画像をそれぞれパノラマ画像としたものである。FIG. 4 is a diagram for explaining an image output from an imaging apparatus with a voice input / output function, in which (a) is a diagram illustrating an outline of a panoramic image converted from an omnidirectional image, and (b) is a diagram illustrating all images. The panoramic image converted from the azimuth image is divided into two columns, and (c) is an image of the speaker added, and (d) the omnidirectional images taken at three different locations. Are panoramic images. 第２の実施の形態の音声入出力機能付き撮像装置を示すカバーを半透明化した図であって、（ａ）が平面図であり、（ｂ）が側面図である。It is the figure which made the cover which shows the imaging device with a voice input / output function of 2nd Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. 第３の実施の形態の音声入出力機能付き撮像装置を示すカバーを半透明化した図であって、（ａ）が平面図であり、（ｂ）が側面図である。It is the figure which made the cover which shows the imaging device with a voice input / output function of 3rd Embodiment translucent, Comprising: (a) is a top view, (b) is a side view. 第４の実施の形態の音声入出力機能付き撮像装置を示す図であって、（ａ）が正面図であり、（ｂ）が背面図である。It is a figure which shows the imaging device with a voice input / output function of 4th Embodiment, (a) is a front view, (b) is a rear view.

以下、図面を参照しながら本発明の第１の実施の形態について説明する。
本実施の形態のテレビ会議システムは、図１（ａ）、（ｂ）に示す音声入出力機能付き撮像装置１を複数用いるものであり、離れた複数箇所の会議室に音声入出力機能付き撮像装置１を配置することにより、テレビ会議システムが構築される。 The first embodiment of the present invention will be described below with reference to the drawings.
The video conference system according to the present embodiment uses a plurality of imaging devices 1 with audio input / output functions shown in FIGS. 1A and 1B, and images with audio input / output functions are provided in a plurality of remote conference rooms. By arranging the device 1, a video conference system is constructed.

図１に示す音声入出力機能付き撮像装置１は、略円板状のベース板２と、ベース板２上を覆う略ドーム状のカバー３と、ベース板２の外周部に周方向に沿って等間隔に配置されるとともに後述の制御基板４に接続されたマイク（音声入力手段）５と、ベース板２とカバー３との間に、カバー３で覆われた状態に配置されたスピーカ（音声出力手段）６と、カバー３上に固定された全方位カメラ７とを備える。マイク５と、スピーカ６と、全方位カメラ７は、互いに近接して設けられている。すなわち、マイク５と、スピーカ６と、全方位カメラ７とは互いに近傍となる配置となっている。また、スピーカ６と、全方位カメラ７とは、それらの中心軸が略一致するように配置され、マイク５は、上述の中心軸から略等距離となる位置に配置されている。 An imaging apparatus 1 with a voice input / output function shown in FIG. 1 includes a substantially disc-shaped base plate 2, a substantially dome-shaped cover 3 that covers the base plate 2, and an outer peripheral portion of the base plate 2 along the circumferential direction. Speakers (sounds) arranged in a state of being covered with a cover 3 between a base plate 2 and a cover 3 and microphones (sound input means) 5 that are arranged at equal intervals and connected to a control board 4 described later. Output means) 6 and an omnidirectional camera 7 fixed on the cover 3. The microphone 5, the speaker 6, and the omnidirectional camera 7 are provided close to each other. That is, the microphone 5, the speaker 6, and the omnidirectional camera 7 are arranged close to each other. Further, the speaker 6 and the omnidirectional camera 7 are arranged so that their central axes substantially coincide with each other, and the microphone 5 is arranged at a position that is substantially equidistant from the above-mentioned central axis.

ベース板２は、その上面に、マイク５、スピーカ６、制御基板４を取り付けるための取付構造が設けられている。また、円板状のベース板２の外周部には、ベース板２と略同径のカバー３の円形の下側縁部（外周縁部）を取り付けるための取付構造が設けられている。 The base plate 2 is provided with an attachment structure for attaching the microphone 5, the speaker 6, and the control board 4 on the upper surface thereof. An attachment structure for attaching a circular lower edge portion (outer peripheral edge portion) of the cover 3 having the same diameter as the base plate 2 is provided on the outer peripheral portion of the disc-shaped base plate 2.

カバー３は、マイク５に対応する位置に図示しない１つまたは複数の孔が設けられ、マイク５への音声入力を妨げないようになっている。まあ、ドーム状のカバー３の上部（中央部）には、スピーカ６からの音声出力用の開口部３ａが設けられている。まあ、カバー３の開口部３ａには、全方位カメラ７をカバー３の上部の中央部に固定するための橋状のカメラ固定部３ｂが設けられている。 The cover 3 is provided with one or a plurality of holes (not shown) at a position corresponding to the microphone 5 so as not to interfere with voice input to the microphone 5. The dome-shaped cover 3 is provided with an opening 3a for outputting sound from the speaker 6 at the upper part (central part). The opening 3 a of the cover 3 is provided with a bridge-like camera fixing portion 3 b for fixing the omnidirectional camera 7 to the central portion of the upper portion of the cover 3.

マイク５は、例えば、指向性を有するものであり、最も感度の高い方向を、全方位カメラ７の例えば撮影範囲となる半球面や円筒面の中心軸に直交する半径方向に合わせている。また、マイク５の配置位置は、撮影範囲の中心軸に対して半径方向に等距離で、それぞれ９０度ずれた位置（周方向に等間隔）に配置されている。なお、マイク５として無指向性のマイク５を用いてもよい。各マイク５は、制御基板４に接続されており音声を音声信号に変換して制御基板４に入力している。なお、音声信号はアナログであってもデジタルであってもよい。 The microphone 5 has, for example, directivity, and the direction with the highest sensitivity is matched with the radial direction orthogonal to the central axis of the hemispherical surface or cylindrical surface of the omnidirectional camera 7, for example. In addition, the microphones 5 are arranged at equal positions in the radial direction with respect to the center axis of the photographing range and at positions shifted by 90 degrees (equal intervals in the circumferential direction). Note that an omnidirectional microphone 5 may be used as the microphone 5. Each microphone 5 is connected to the control board 4 and converts sound into an audio signal and inputs it to the control board 4. Note that the audio signal may be analog or digital.

スピーカ６は、全方位型のものであり、１つのスピーカ６により音声が全方位に略同等に出力する。なお、全方位型でないスピーカを３つまたは４つ等のように複数用いてもよい。スピーカ６は、制御基板４に接続されており、制御基板４から出力される音声信号を音声に変換して周囲に出力する。 The speaker 6 is of an omnidirectional type, and a single speaker 6 outputs sound almost equally in all directions. A plurality of non-omnidirectional speakers such as three or four may be used. The speaker 6 is connected to the control board 4 and converts a sound signal output from the control board 4 into sound and outputs the sound to the surroundings.

全方位カメラ７は、例えば、半球状の撮像範囲を有する魚眼カメラであり、周囲を撮像対象としているが、例えば、複数のカメラで撮影された画像から全方位画像データＦ（図２に図示）を得るようなものであっても良いし、略円錐状のミラーを介して周囲を撮影するカメラであってもよいし、全天球カメラであってもよい。全方位カメラ７では、テーブルＴに置かれた音声入出力機能付き撮像装置１からテーブルＴの周囲に座る被写体としての参加者を撮像できればよく、例えば、上方向の画像データは必要としない。 The omnidirectional camera 7 is, for example, a fisheye camera having a hemispherical imaging range, and the surrounding area is an imaging target. For example, omnidirectional image data F (shown in FIG. 2) is obtained from images captured by a plurality of cameras. ), A camera that captures the surroundings through a substantially conical mirror, or an omnidirectional camera. The omnidirectional camera 7 only needs to be able to image a participant as a subject sitting around the table T from the imaging apparatus 1 with a voice input / output function placed on the table T. For example, upward image data is not required.

また、全方位カメラ７の配置位置が高い場合、例えば、座った参加者の頭部以上の高さを有する場合など、半球状の撮影範囲では、参加者の胸像を撮影することができなくなるので、全方位カメラ７の配置位置が高くなる場合には、全天球カメラを好適に用いることができる。 Further, when the arrangement position of the omnidirectional camera 7 is high, for example, when it has a height higher than the head of the participant who sits down, it becomes impossible to photograph the bust of the participant in the hemispherical photographing range. When the arrangement position of the omnidirectional camera 7 becomes high, an omnidirectional camera can be preferably used.

制御基板４は、音源方向認識手段として、４つのマイク５から入力される音声信号の音量レベル（音の大きさ）から音源の方向を特定するようになっている。本実施の形態では、音源の方向と音源までの距離を特定することにより音源の位置を特定することはしないので、４つのマイク５の音量レベルから音源の位置を測定する。例えば、音量レベルが高い上位２本の隣り合うマイク５を特定し、これらの２つのマイクの音量の差からこれら２つのマイク５の中間となる方向を決定する。 The control board 4 is configured to identify the direction of the sound source from the volume level (sound volume) of the audio signals input from the four microphones 5 as sound source direction recognition means. In the present embodiment, since the position of the sound source is not specified by specifying the direction of the sound source and the distance to the sound source, the position of the sound source is measured from the volume levels of the four microphones 5. For example, the top two adjacent microphones 5 having a high volume level are specified, and the direction between these two microphones 5 is determined from the difference in volume between these two microphones.

例えば、２つのマイク５で音量に差が無ければ、これらマイク５の略中央となる方向に音源があると特定し、どちらかのマイク５の音量が高ければ、これらマイク５の中央となる方向と、音量が高い方のマイク５の方向との間に音源の方向があることになる。また、音量が２位となるマイク５と、音量が３位となるマイク５とで音量が略同じならば、音量が１位のマイク５が向く方向に音源があることになる。 For example, if there is no difference in volume between the two microphones 5, it is specified that there is a sound source in the direction that is approximately the center of these microphones 5. If the volume of either microphone 5 is high, the direction that is the center of these microphones 5. And the direction of the sound source is between the direction of the microphone 5 with the higher volume. If the volume of the microphone 5 with the second volume is substantially the same as that of the microphone 5 with the third volume, the sound source is in the direction in which the microphone 5 with the first volume faces.

なお、各マイク５における音の位相のずれから音源を特定するものとしてもよい。すなわち、音源からの距離の違いによる各マイク５における音の到達時間の違いに基づいて音源の方向を特定する周知の方法を用いてもよい。
また、画像認識手段としての制御基板４は、全方位カメラ７から入力される全方位画像データＦから発言者の方向を特定するようになっている。基本的には、周知の顔認識により全方位画像データＦから各参加者（被撮像者）の顔を認識することにより、各参加者の方向を特定する。また、各参加者の口を画像認識し、口（唇）が動いているか否かを判定し、口が動いていると判定された顔の方向を発言者の方向とする。 The sound source may be specified from the phase shift of the sound in each microphone 5. That is, a well-known method for specifying the direction of the sound source based on the difference in sound arrival time in each microphone 5 due to the difference in distance from the sound source may be used.
Further, the control board 4 as the image recognizing means specifies the direction of the speaker from the omnidirectional image data F input from the omnidirectional camera 7. Basically, the direction of each participant is specified by recognizing the face of each participant (imaged person) from the omnidirectional image data F by well-known face recognition. Further, each participant's mouth is image-recognized, it is determined whether or not the mouth (lips) is moving, and the direction of the face determined that the mouth is moving is set as the direction of the speaker.

なお、画像処理および画像認識に関しては、インテル（登録商標）オープンＣＶ（ＩｎｔｅｌＯｐｅｎＳｏｕｒｃｅＣｏｍｐｕｔｅｒＶｉｓｉｏｎＬｉｂｒａｒｙ）を利用して容易に作成可能である。例えば、顔認識プログラムを作成する場合に、オープンＣＶに登録されているオブジェクト検出プログラムを用いることができる。画像認識の原理として、学習フェーズと認識フェーズがあり、画像から特徴量を抽出し、学習アルゴリズムによってオブジェクトの特徴を学習することにより、例えば、顔認識等の画像認識が可能となる。オープンＣＶでは、画像特徴量としてＨａａｒ・Ｌｉｋｅ特徴量を用い、学習アルゴリズムとしてＡｄａｂｏｏｓｔと呼ばれるアルゴリズムを使用している。オブジェクト検出プログラムにおいて、特徴点に基づいて顔の画像か否かを機械学習させることにより、オブジェクト検出プログラムにおいて、顔の画像を顔として認識することが可能となる。なお、画像認識プログラムに必ずしもオープンＣＶを利用しなくてもよいし、既存のプログラムや、既存の画像認識回路を搭載したチップを利用してもよい。発言者の口の動きの認識も上述のオープンＣＶのオブジェクト検出プログラムを用いて、機会学習させることにより、例えば、話している口と、黙っている口の違いを認識させることができる。 Note that image processing and image recognition can be easily created using Intel (registered trademark) Open CV (Intel Open Source Computer Vision Library). For example, when creating a face recognition program, an object detection program registered in the open CV can be used. As a principle of image recognition, there are a learning phase and a recognition phase. By extracting a feature amount from an image and learning a feature of an object by a learning algorithm, for example, image recognition such as face recognition becomes possible. In Open CV, Haar / Like feature values are used as image feature values, and an algorithm called Adaboost is used as a learning algorithm. In the object detection program, it is possible to recognize a face image as a face in the object detection program by causing the object detection program to perform machine learning based on the feature points. Note that the open CV is not necessarily used for the image recognition program, and an existing program or a chip equipped with an existing image recognition circuit may be used. The movement of the speaker's mouth can also be recognized by using the above-mentioned open CV object detection program for opportunity learning, for example, to recognize the difference between a speaking mouth and a silent mouth.

本実施の形態では、顔認識を行って各参加者の方向を認識するとともに、口の動きを検出して発言者の方向を認識する。なお、上述のように制御基板４では、音声によっても発言者としての音源の方向を特定しているので、本実施の形態では、これら音源方向認識と画像認識に基づく発言者の方向が例えば所定角度範囲内（例えば０〜１０度以内）で一致する場合に、これら音源方向認識と画像認識で求められた２つの方向のうち、例えば、画像認識で得られた方向を、発言者の方向としている。 In this embodiment, face recognition is performed to recognize the direction of each participant, and mouth movement is detected to recognize the direction of the speaker. As described above, the control board 4 specifies the direction of the sound source as a speaker even by voice. In the present embodiment, the direction of the speaker based on the sound source direction recognition and the image recognition is, for example, predetermined. Of the two directions obtained by the sound source direction recognition and the image recognition, for example, the direction obtained by the image recognition is used as the direction of the speaker in the case of matching within the angle range (for example, within 0 to 10 degrees). Yes.

音源方向認識による音源方向と画像認識による発言者の方向とが所定角度範囲以内とならない場合には、発言者がいないと判定する。これにより、小声で私語を話している参加者や、あくびをしている参加者や、椅子を動かした際に大きな音を出した参加者などが、一時的にでも発言者として認識されて例えば別の会場のディスプレイ８に大きく表示されてしまうような状態を防止している。なお、音源方向認識だけで、発言者の方向を決定しても良いし、画像認識だけで発言者の方向を決定してもよい。 When the sound source direction by sound source direction recognition and the direction of the speaker by image recognition are not within the predetermined angle range, it is determined that there is no speaker. As a result, participants who speak a private language, participants who yawn, participants who make loud noises when moving a chair, etc. are recognized as speakers even temporarily, for example This prevents a situation in which the image is displayed largely on the display 8 at another venue. Note that the direction of the speaker may be determined only by sound source direction recognition, or the direction of the speaker may be determined only by image recognition.

また、制御基板４は、全方位カメラ７から入力された全方位画像データＦを周知の画像処理によりパノラマ画像に変換する画像処理手段として機能する。この際には、全方位画像データＦからパノラマ画像の右端および左端となる位置を決定して、全方位画像データＦからパノラマ画像データを作成する。上述のように発言者の方向が特定された場合には、発言者の方向から１８０度、すなわち、発明者の方向の反対となる方向の位置で、全方位画像データＦを切り開き、この位置をパノラマ画像の右端および左端の位置とする。また、発言者がいない場合には、例えば、上述のように顔認識された各参加者の間隔を判定し、最も広い間隔の中央をパノラマ画像の左端および右端の位置とする。 The control board 4 functions as image processing means for converting the omnidirectional image data F input from the omnidirectional camera 7 into a panoramic image by known image processing. At this time, the panoramic image data is created from the omnidirectional image data F by determining the positions at the right and left ends of the panoramic image from the omnidirectional image data F. When the direction of the speaker is specified as described above, the omnidirectional image data F is cut open at a position 180 degrees from the direction of the speaker, that is, in a direction opposite to the direction of the inventor. The positions of the right end and the left end of the panoramic image are used. Further, when there is no speaker, for example, the interval of each participant whose face is recognized as described above is determined, and the center of the widest interval is set as the position of the left end and the right end of the panoramic image.

また、制御基板４は、発言者の方向を特定した場合に、その方向で顔認識された参加者が主に被写体となっている発言者の画像データを作成する。なお、この画像データの作成においては、顔認識された参加者の画像部分を取り出して画像データとしてもよいし、特定された発言者の方向の所定角度範囲の画像部分を発言者の画像データとしてもよい。 Further, when the direction of the speaker is specified, the control board 4 creates image data of the speaker whose participants are mainly recognized in the direction. In the creation of the image data, the face-recognized participant's image portion may be taken out and used as image data, or the image portion within a predetermined angle range in the direction of the specified speaker is used as the speaker's image data. Also good.

また、通信手段としての制御基板４は、ローカルエリアネットワーク（ＬＡＮ）や、インターネットや公衆電話回線網や、携帯電話回線網や専用通信回線等を利用して、離れた場所にある他の音声入出力機能付き撮像装置１とデータ通信を行い、マイク５により入力された音声信号および全方位カメラ７で撮影された全方位画像データＦを上述のように画像処理したパノラマ画像データおよび発言者の画像データを他の音声入出力機能付き撮像装置１に送信する。 In addition, the control board 4 as a communication means uses a local area network (LAN), the Internet, a public telephone line network, a mobile phone line network, a dedicated communication line, etc. The panoramic image data and the speaker image obtained by performing the image processing as described above on the audio signal input from the microphone 5 and the omnidirectional image data F captured by the omnidirectional camera 7 by performing data communication with the imaging device 1 with an output function. Data is transmitted to another imaging apparatus 1 with a voice input / output function.

また、他の音声入出力機能付き撮像装置１から送信された音声信号、パノラマ画像データ、発言者の画像データ等を受信する。なお、発言者の画像データは、当該画像データが作成された場合にだけ送受信される。また、本実施の形態では、音声入出力機能付き撮像装置１にはディスプレイ８が無いので、受信された画像データは、ディスプレイ８用の接続端子に出力され、接続端子に接続されたディスプレイ８に画像データを表示する。なお、後述のように音声入出力機能付き撮像装置１にディスプレイ８を含めて受信した画像データを音声入出力機能付き撮像装置１のディスプレイ８に出力するようにしてもよい。 In addition, an audio signal, panoramic image data, speaker image data, and the like transmitted from another imaging apparatus 1 with a voice input / output function are received. The image data of the speaker is transmitted / received only when the image data is created. In the present embodiment, since the imaging apparatus 1 with the voice input / output function does not have the display 8, the received image data is output to the connection terminal for the display 8, and is displayed on the display 8 connected to the connection terminal. Display image data. As will be described later, the received image data including the display 8 in the imaging apparatus 1 with the voice input / output function may be output to the display 8 of the imaging apparatus 1 with the voice input / output function.

また、制御基板４で、音源方向認識、画像認識、画像処理等を行うものとしたが、制御基板４では、主に音声信号、画像データの入出力だけを制御し、制御基板４に有線ＬＡＮや無線ＬＡＮやＵＳＢ等で接続されたパーソナルコンピュータ（パソコンＰＣ：図２に図示）で音源方向認識、画像認識、画像処理を行うものとしてもよい。また、各種画像処理を、全方位画像を撮影した全方位カメラ７がある音声入出力機能付き撮像装置１で行うものとしたが、画像処理を、画像データを受信する側の音声入出力機能付き撮像装置１またはそれに接続されたパソコンＰＣで行ってもよい。すなわち、画像データとして全方位カメラ７で撮影された全方位画像データＦをそのまま送信して、受信した音声入出力機能付き撮像装置１において、画像処理してディスプレイ８に表示するものとしてもよい。 The control board 4 performs sound source direction recognition, image recognition, image processing, and the like. However, the control board 4 mainly controls input / output of audio signals and image data, and connects the control board 4 to the wired LAN. Alternatively, sound source direction recognition, image recognition, and image processing may be performed by a personal computer (PC PC: illustrated in FIG. 2) connected by a wireless LAN, USB, or the like. In addition, various image processing is performed by the imaging apparatus 1 with an audio input / output function having an omnidirectional camera 7 that captures an omnidirectional image. You may carry out with the imaging device 1 or the personal computer PC connected to it. That is, the omnidirectional image data F taken by the omnidirectional camera 7 may be transmitted as image data as it is, and the received image pickup apparatus with a voice input / output function 1 may process the image and display it on the display 8.

このような電話会議システムの音声入出力機能付き撮像装置１は、例えば、図２に示すように、会議室のテーブルＴの上に置いて用いられる。会議の参加者Ｐは、テーブルＴを囲んで座ることになる。ここでは、長方形状のテーブルＴの２つの長辺にそれぞれ参加者Ｐが２列に座っている。なお、図２では、上述のようにパソコンＰＣを用いるものとし、ディスプレイ８は、パソコンＰＣを介して接続されており、パソコンＰＣで処理された画像データがディスプレイ８に表示される。 The imaging device 1 with a voice input / output function of such a telephone conference system is used by being placed on a table T in a conference room, for example, as shown in FIG. The conference participant P sits around the table T. Here, the participants P sit in two rows on the two long sides of the rectangular table T, respectively. In FIG. 2, the personal computer PC is used as described above, and the display 8 is connected via the personal computer PC, and image data processed by the personal computer PC is displayed on the display 8.

図２に示す状態で、全方位カメラ７で撮像された全方位画像データＦは、図３に示す状態となる。なお、図３では、立体的な全方位画像データＦを平面に投影した状態で簡略化して示している。制御基板４では、この全方位画像データＦを画像処理して、図４（ａ）または図４（ｂ）に示すディスプレイ８の表示中に表示されるパノラマ画像Ｇ１またはパノラマ画像Ｇ１を２つに分割したパノラマ画像Ｇ２、Ｇ３としている。 The omnidirectional image data F captured by the omnidirectional camera 7 in the state shown in FIG. 2 is in the state shown in FIG. In FIG. 3, the three-dimensional omnidirectional image data F is shown in a simplified manner in a state projected onto a plane. The control board 4 performs image processing on this omnidirectional image data F to make two panoramic images G1 or G1 displayed during display on the display 8 shown in FIG. 4A or 4B. The divided panoramic images G2 and G3 are used.

本実施の形態では、図４（ｂ）に示すように、全方位画像データＦ中の各参加者Ｐの間隔を判定し、所定間隔（角度）以上の間隔がある場合に、パノラマ画像Ｇ１を分離し、分離された部分の間隔をカットすることで、パノラマ画像Ｇ２，Ｇ３の左右幅を圧縮している。なお、パノラマ画像Ｇ１、Ｇ２、Ｇ３の作成に際し、参加者Ｐ同士の間の間隔を全てカットするようにしてもよい。また、所定幅（所定角度範囲）で各参加者Ｐの画像データを作成し、これを横に並べることでパノラマ画像を作成してもよい。この場合も、参加者Ｐ同士の間隔を表示しないようにできる。なお、図４（ｂ）では、２つに分離した画像データを上下二段に表示することにより、各パノラマ画像Ｇ２，Ｇ３を大きく表示している。 In the present embodiment, as shown in FIG. 4B, the interval of each participant P in the omnidirectional image data F is determined, and if there is an interval greater than a predetermined interval (angle), the panoramic image G1 is displayed. The left and right widths of the panoramic images G2 and G3 are compressed by separating and cutting the interval between the separated parts. Note that when creating the panoramic images G1, G2, and G3, all the intervals between the participants P may be cut. Alternatively, the panoramic image may be created by creating image data of each participant P with a predetermined width (predetermined angle range) and arranging the data side by side. Also in this case, the interval between the participants P can be prevented from being displayed. In FIG. 4B, the panoramic images G2 and G3 are displayed in a large size by displaying the image data separated into two in the upper and lower stages.

また、発言者を特定した場合には、図４（ｃ）に示すように、図４（ａ）に示すパノラマ画像Ｇ１に加えて発言者を主体とする画像Ｇ１０を別に表示する。なお、テレビ会議は、２箇所だけで行われるとは限らず、３か所以上で行われる場合があるので、その場合には、例えば、図４（ｄ）に示すように、ディスプレイ８の画面を分割して、各分割箇所にパノラマ画像Ｇ１，Ｇ４，Ｇ５を表示する。図４（ｄ）では、４か所を結んでテレビ会議が行われ、ディスプレイ８がある会議室以外の他の３か所の会議室の画像が表示された状態となっている。 When the speaker is specified, as shown in FIG. 4C, in addition to the panoramic image G1 shown in FIG. 4A, an image G10 mainly composed of the speaker is displayed separately. In addition, since the video conference is not necessarily held in only two places and may be held in three or more places, in that case, for example, as shown in FIG. And panoramic images G1, G4, and G5 are displayed at the respective divided portions. In FIG. 4D, a video conference is performed by connecting four places, and images of three meeting rooms other than the meeting room with the display 8 are displayed.

この音声入出力機能付き撮像装置１を用いたテレビ会議システムでは、各会議室に設置された音声入出力機能付き撮像装置１の上述のように通信手段としての制御基板４において、各会議室で撮影された画像データと入力された音声信号を送受信することにより、上述のようにディスプレイ８に、他の会議室の参加者の画像が表示されるとともに、スピーカ６から他の会議室で入力された音声信号が出力される。 In the video conference system using the imaging device 1 with the voice input / output function, the control board 4 as the communication means of the imaging device 1 with the voice input / output function installed in each conference room, as described above, in each conference room. By transmitting and receiving the captured image data and the input audio signal, the images of the participants in other conference rooms are displayed on the display 8 as described above, and input from the speakers 6 in the other conference rooms. Audio signal is output.

このような音声入出力機能付き撮像装置１およびテレビ会議システムにおいては、上述のように全方位カメラ７とマイク５とスピーカ６とが略一体的に構成されており、発言する参加者（発言者）は、基本的にマイク５に向かって発言しようとする。この場合に、マイク５の近傍に全方位カメラ７があるので、発言者は、全方位カメラ７に向かって発言する状態となり、発言者は正面から撮影される状態となる。この場合に、発言者の画像Ｇ１０をディスプレイ８に表示した際に、発言者がディスプレイ８を見ている他の会議室の参加者に向かって話しているように見える可能性が高い。 In such an imaging apparatus 1 with a voice input / output function and a video conference system, the omnidirectional camera 7, the microphone 5, and the speaker 6 are substantially integrated as described above, and a participant who speaks (speaker) ) Basically tries to speak into the microphone 5. In this case, since there is the omnidirectional camera 7 in the vicinity of the microphone 5, the speaker is in a state of speaking toward the omnidirectional camera 7, and the speaker is in a state of being photographed from the front. In this case, when the speaker's image G <b> 10 is displayed on the display 8, there is a high possibility that the speaker is talking to a participant in another conference room looking at the display 8.

また、他の会議室の参加者と話し合っている状態の場合には、他の会議室の発言者の音声が、全方位カメラ７の近傍のスピーカ６から聞こえるので、音を聞き取り易くするためにスピーカ６の方を向くことになる。これにより、発言者が全方位カメラ７に向かって話す状態となり安い。したがって、上述のように発言者が他の会議室の参加者の方を向いて話している状態の画像を得易くなる。これらのことから、ディスプレイ８の画面において、発言者が全方位カメラ７以外の方向を向いて話すことによるテレビ会議特有の違和感が生じるのを抑制することができる。言い換えれば、発言者が意識してカメラの方を向くように努力しなくても、自然に全方位カメラ７の方を向くように促すことができる。 Also, in the state of talking with participants in other conference rooms, the voice of the speaker in the other conference room can be heard from the speaker 6 near the omnidirectional camera 7, so that the sound can be easily heard. It faces the speaker 6. As a result, the speaker speaks into the omnidirectional camera 7 and is inexpensive. Therefore, as described above, it is easy to obtain an image in a state where a speaker is speaking toward a participant in another conference room. For these reasons, it is possible to suppress a sense of incongruity peculiar to a video conference caused by a speaker speaking in a direction other than the omnidirectional camera 7 on the screen of the display 8. In other words, it is possible to urge the speaker to naturally face the omnidirectional camera 7 without making an effort to consciously face the camera.

また、全方位カメラ７により、基本的にテーブルＴの周囲に坐っている全ての参加者が略同等の大きさで撮影されているので、特に全方位カメラ７を制御しなくても、上述のように発言している参加者を特定すれば、容易に発言者の画像を得ることができる。 In addition, since all the participants sitting around the table T are basically photographed with substantially the same size by the omnidirectional camera 7, the above-described omnidirectional camera 7 can be used without any particular control. Thus, if the participant who speaks is specified, the image of the speaker can be easily obtained.

次に、本発明の第２の実施の形態を説明する。
図５（ａ）、（ｂ）に示すように、第２の実施の形態の音声入出力機能付き撮像装置１ａは、第１の実施の形態の音声入出力機能付き撮像装置１と同様に、ベース板１１、カバー１２、図示しない制御基板（図１の制御基板４）、マイク５、スピーカ６、全方位カメラ７を備える。第２の実施の形態の音声入出力機能付き撮像装置１ａは、さらにディスプレイ８を備える、すなわち、第１の実施の形態の音声入出力機能付き撮像装置１と第２の実施の形態の音声入出力機能付き撮像装置１ａとの違いは、ディスプレイ８が音声入出力機能付き撮像装置１に対して別体になっているか、音声入出力機能付き撮像装置１ａに、ディスプレイ８が備えられているかの違いである。 Next, a second embodiment of the present invention will be described.
As shown in FIGS. 5A and 5B, the imaging apparatus 1a with the voice input / output function of the second embodiment is similar to the imaging apparatus 1 with the voice input / output function of the first embodiment. A base plate 11, a cover 12, a control board (not shown) (control board 4 in FIG. 1), a microphone 5, a speaker 6, and an omnidirectional camera 7 are provided. The imaging apparatus 1a with a voice input / output function of the second embodiment further includes a display 8, that is, the imaging apparatus 1 with a voice input / output function of the first embodiment and the voice input of the second embodiment. The difference from the imaging device with an output function 1a is whether the display 8 is separate from the imaging device 1 with a voice input / output function or whether the display 8 is provided in the imaging device 1a with a voice input / output function. It is a difference.

本実施の形態において、ベース板１１は、矩形板状に形成され、その四隅部のそれぞれにマイク５が備えられている。また、ベース板１１の互いに離れた一対の側縁部には、それぞれ表示画面を反対方向（外側）に向けてディスプレイ（例えば、液晶ディスプレイ）８が取り付けられている。また、ベース板１１の２つのディスプレイ８の間に、図示しない制御基板とスピーカ６が配置されている。 In the present embodiment, the base plate 11 is formed in a rectangular plate shape, and a microphone 5 is provided at each of the four corners. In addition, a display (for example, a liquid crystal display) 8 is attached to a pair of side edges of the base plate 11 that face each other in the opposite direction (outside). A control board and a speaker 6 (not shown) are disposed between the two displays 8 on the base plate 11.

カバー１２は、矩形状のベース板１１に対応する直方体状に形成され、ベース板１１を覆うように取り付けられている。カバー１２の上述の２つのディスプレイ８に対応する互いに平行な２つの側面には、ディスプレイ８の表示画面を外部から視認可能とする窓部１２ａが設けられている。また、カバー１２の天板には、スピーカ６に対応して開口部１２ｂが設けられている。カバー１２の開口部１２ｂの部分には、橋状にカメラ固定部１２ｃが設けられ、このカメラ固定部１２ｃに全方位カメラ７が取り付けられている。なお、カバー１２のマイク５に対応する位置には、１つか複数の孔を設けてもよい。 The cover 12 is formed in a rectangular parallelepiped shape corresponding to the rectangular base plate 11 and is attached so as to cover the base plate 11. On two side surfaces of the cover 12 corresponding to the above-described two displays 8 that are parallel to each other, a window portion 12a that allows the display screen of the display 8 to be visually recognized from the outside is provided. Further, the top plate of the cover 12 is provided with an opening 12 b corresponding to the speaker 6. A camera fixing portion 12c is provided in a bridge shape at the opening 12b portion of the cover 12, and the omnidirectional camera 7 is attached to the camera fixing portion 12c. One or a plurality of holes may be provided at a position corresponding to the microphone 5 of the cover 12.

また、この音声入出力機能付き撮像装置１ａは、図２に示すように、テーブルＴの互いに平行な２つの側縁にそれぞれ参加者Ｐが並んで座る場合に好適に用いられるように、２つのディスプレイ８を互いに反対向きに配置している。また、ディスプレイ８としては、例えば、７インチから１５インチ程度の比較的画面の小さいディスプレイ８を用いてＩおり、テーブルＴ上に置いた場合に、互いに対向して坐っている参加者同士の視線を遮らないようになっている。また、ディスプレイ８にかかるコストを低減している。 In addition, as shown in FIG. 2, the imaging apparatus 1 a with a voice input / output function includes two pieces so as to be preferably used when participants P sit side by side on two parallel side edges of the table T, respectively. The displays 8 are arranged in opposite directions. Further, as the display 8, for example, a display 8 having a relatively small screen of about 7 inches to 15 inches is used, and when placed on the table T, the lines of sight of the participants sitting facing each other are displayed. It is designed not to block. Moreover, the cost concerning the display 8 is reduced.

このような第２の実施の形態の音声入出力機能付き撮像装置１ａによれば、第１の実施の形態の音声入出力機能付き撮像装置１と略同様の作用効果を得ることができる。また、全方位カメラ７の近傍にディスプレイ８が設けられており、上述のようにテーブルＴの回りに参加者が着席した場合に、頭の向きを斜めにしたりすることなく、正面を向いた状態で無理なくディスプレイ８を見ることができる。 According to the imaging apparatus 1a with the voice input / output function of the second embodiment, it is possible to obtain substantially the same operational effects as the imaging apparatus 1 with the voice input / output function of the first embodiment. In addition, the display 8 is provided in the vicinity of the omnidirectional camera 7, and when the participant is seated around the table T as described above, the head is turned to the front without being inclined. The display 8 can be seen without difficulty.

また、参加者Ｐがディスプレイ８の方を向くと、ディスプレイ８の近傍でディスプレイ８の略上に全方位カメラ７があることにより、全方位カメラ７を見ることになり、参加者の略全員が他の会場の参加者を見ているような画像データを得られる。すなわち、第１の実施の形態では、主に発言者が全方位カメラ７を見て発言するように促す構造であったが、他の参加者は、全方位カメラ７と異なる場所にあるディスプレイ８を見ている可能性があり、発言者以外の参加者が全方位カメラ７を見ておらず、発言者以外の参加者がよそを向いている画像が撮像されるのを抑制することが困難であった。 When the participant P faces the display 8, the omnidirectional camera 7 is seen in the vicinity of the display 8 in the vicinity of the display 8, so that the omnidirectional camera 7 is viewed. You can get image data that looks like participants in other venues. That is, in the first embodiment, the structure is such that the speaker is mainly urged to speak by looking at the omnidirectional camera 7, but the other participants have the display 8 in a different place from the omnidirectional camera 7. It is difficult to prevent the participant other than the speaker from looking at the omnidirectional camera 7 and taking an image of the participant other than the speaker facing away. Met.

それに対して、第２の実施の形態では、全方位カメラ７の近傍に、ディスプレイ８を配置し、参加者がディスプレイ８を見ると参加者の顔が全方位カメラ７の方向くことになる。また、発言者もディスプレイ８を見るために、マイク５、スピーカ６、全方位カメラ７の方向から顔の向きを逸らす必要がなくなり、発言中は、全方位カメラ７に顔を向けた状態となる。 On the other hand, in the second embodiment, the display 8 is arranged in the vicinity of the omnidirectional camera 7, and when the participant looks at the display 8, the participant's face is directed toward the omnidirectional camera 7. In addition, in order for the speaker to view the display 8, it is not necessary to deviate the direction of the face from the direction of the microphone 5, the speaker 6, and the omnidirectional camera 7, and the face is directed to the omnidirectional camera 7 while speaking. .

次に、本発明の第３の実施の形態を説明する。
図６（ａ）、（ｂ）に示すように、第３の実施の形態の音声入出力機能付き撮像装置１ｂは、第１の実施の形態の音声入出力機能付き撮像装置１と同様に、ベース板２１、カバー２２、制御基板（図１の制御基板４）、マイク５、スピーカ６、全方位カメラ７を備える。第３の実施の形態の音声入出力機能付き撮像装置１ｂは、第２の実施の形態の場合と同様に、ディスプレイ８を備える。 Next, a third embodiment of the present invention will be described.
As shown in FIGS. 6A and 6B, the imaging device 1b with the voice input / output function of the third embodiment is similar to the imaging device 1 with the voice input / output function of the first embodiment. A base plate 21, a cover 22, a control board (control board 4 in FIG. 1), a microphone 5, a speaker 6, and an omnidirectional camera 7 are provided. The imaging apparatus 1b with a voice input / output function of the third embodiment includes a display 8 as in the case of the second embodiment.

本実施の形態において、ベース板２１は、三角形の板状に形成され、その３つの隅部のそれぞれにマイク５が備えられている。また、ベース板２１の３つの側縁部には、それぞれ表示画面を外側に向けてディスプレイ８が取り付けられている。また、ベース板１１の３つのディスプレイ８の内側に、図示しない制御基板とスピーカ６が配置されている。 In the present embodiment, the base plate 21 is formed in a triangular plate shape, and a microphone 5 is provided at each of its three corners. A display 8 is attached to each of the three side edges of the base plate 21 with the display screen facing outward. Further, a control board and a speaker 6 (not shown) are arranged inside the three displays 8 of the base plate 11.

カバー２２は、三角形状のベース板２１に対応する三角柱状に形成され、ベース板２１を覆うように取り付けられている。カバー２２の３つの側面それぞれのディスプレイ８に対応する位置には、ディスプレイ８の表示画面を外部から視認可能とする窓部２２ａが設けられている。また、カバー２２の天板には、スピーカ６に対応して開口部２２ｂが設けられている。カバー２２の開口部２２ｂの部分には、Ｙ字橋状にカメラ固定部２２ｃが設けられ、このカメラ固定部２２ｃに全方位カメラ７が取り付けられている。なお、カバー２２のマイク５に対応する位置には、１つか複数の孔を設けてもよい。 The cover 22 is formed in a triangular prism shape corresponding to the triangular base plate 21 and is attached so as to cover the base plate 21. At positions corresponding to the display 8 on each of the three side surfaces of the cover 22, a window portion 22 a that allows the display screen of the display 8 to be visually recognized from the outside is provided. In addition, the top plate of the cover 22 is provided with an opening 22 b corresponding to the speaker 6. A camera fixing portion 22c is provided in a Y-bridge shape at the opening 22b of the cover 22, and the omnidirectional camera 7 is attached to the camera fixing portion 22c. One or a plurality of holes may be provided at a position corresponding to the microphone 5 of the cover 22.

第３の実施の形態の音声入出力機能付き撮像装置１ｂは、基本的にディスプレイ８とマイク５の数の違いと、平面形状が四角形か三角形かの違い以外は、第２の実施の形態の音声入出力機能付き撮像装置１ａと略同様の構造を有するものであり、同様の作用効果を奏する。また、第３の実施の形態では、ディスプレイ８が互いに１２０度離れた３方向を向いているので、テーブルＴの周囲でディスプレイ８の画面が見られない死角となる方向を減らすことができる。なお、第２の実施の形態の形状で、カバー１２の全ての側面にディスプレイ８を設けることで、音声入出力機能付き撮像装置１ａが４つのディスプレイ８を持つものとしてもよい。 The imaging apparatus 1b with a voice input / output function according to the third embodiment is basically the same as that of the second embodiment except for the difference in the number of displays 8 and microphones 5 and whether the planar shape is a square or a triangle. The image pickup apparatus 1a with a voice input / output function has substantially the same structure and exhibits the same effects. In the third embodiment, since the display 8 faces three directions 120 degrees apart from each other, it is possible to reduce the direction of the blind spot where the screen of the display 8 cannot be seen around the table T. In addition, it is good also as the imaging device 1a with an audio | voice input / output function having the four displays 8 by providing the display 8 in all the side surfaces of the cover 12 in the shape of 2nd Embodiment.

次に、本発明の第４の実施の形態を説明する。
図７（ａ）、（ｂ）に示すように、第４の実施の形態の音声入出力機能付き撮像装置１ｃは、第１の実施の形態の音声入出力機能付き撮像装置１と同様に、制御基板（図１の制御基板４）、マイク５ａ、スピーカ６ａ、全方位カメラ７ａを備える。第４の実施の形態の音声入出力機能付き撮像装置１ｃは、第２、第３の実施の形態の場合と同様に、ディスプレイ８ａを備える。 Next, a fourth embodiment of the present invention will be described.
As shown in FIGS. 7A and 7B, the imaging device 1c with the voice input / output function of the fourth embodiment is similar to the imaging device 1 with the voice input / output function of the first embodiment. A control board (control board 4 in FIG. 1), a microphone 5a, a speaker 6a, and an omnidirectional camera 7a are provided. The imaging apparatus with audio input / output function 1c according to the fourth embodiment includes a display 8a as in the second and third embodiments.

本実施の形態においては、例えば、１５インチより大きいディスプレイ８ａとして、例えば２０〜３２インチ程度（それ以上であってもよい）のディスプレイ８ａに音声入出力機能付き撮像装置１ｃの制御基板、マイク５ａ、スピーカ６ａを組み込み、ディスプレイ８ａの上面の中央部に、全方位カメラ７ａが取り付けられている。すなわち、パソコン用ディスプレイなどで、スピーカとマイクを内蔵するディスプレイに制御基板と全方位カメラ７ａを設けた構成となっている。但し、ディスプレイ８ａをパソコンと接続し、制御基板のデータの入出力以外の機能をパソコンＰＣに持たせるものとしてもよい。この場合にディスプレイ８ａとパソコンＰＣの接続は、スピーカ、マイク、カメラを備えるタイプのディスプレイをパソコンＰＣに接続する場合と同様に行うことができる。 In the present embodiment, for example, as the display 8a larger than 15 inches, the control board of the imaging device 1c with a voice input / output function, the microphone 5a, for example, the display 8a of about 20 to 32 inches (or more) may be used. The speaker 6a is incorporated, and an omnidirectional camera 7a is attached to the center of the upper surface of the display 8a. That is, in the display for a personal computer or the like, the control board and the omnidirectional camera 7a are provided on a display incorporating a speaker and a microphone. However, the display 8a may be connected to a personal computer, and the personal computer PC may have functions other than the control board data input / output. In this case, the connection between the display 8a and the personal computer PC can be performed in the same manner as when a display of a type including a speaker, a microphone, and a camera is connected to the personal computer PC.

図７（ａ）、（ｂ）に示すように、第４の実施の形態では、ディスプレイ８ａは、表裏面の両方に表示画面１４ａ、１４ｂを有するものであり、図２に示すように、互いに対向して坐る参加者がそれぞれ別の表示画面１４ａ、１４ｂを見るようになっている。なお、音声入出力機能付き撮像装置１ｃにおいて、裏面側に表示画面１４ｂを設けないものとして、複数台の音声入出力機能付き撮像装置１ｃを用いるものとしてもよい。この場合に、全方位カメラ７ａが複数となるが、必ずしも複数台の全方位カメラ７ａを必要としないので、全方位カメラ７ａを有するタイプと、全方位カメラ７ａが無いタイプとを組み合わせるものとしてもよい。また、ディスプレイ８の大きさによっては、参加者に対して全方位カメラ７ａの位置が高くなり過ぎて、半球状の撮影範囲に参加者の胸像部分の上部しか映らない可能性があり、最悪、参加者の顔の上部しか映らない可能性がある。そこで、撮影範囲が半球より広く全球に近い撮影範囲を有する全天球カメラを全方位カメラ７ａとすることが好ましい。 As shown in FIGS. 7 (a) and 7 (b), in the fourth embodiment, the display 8a has display screens 14a and 14b on both the front and back surfaces. As shown in FIG. Participants sitting facing each other see different display screens 14a and 14b. In addition, in the imaging device 1c with a voice input / output function, a plurality of imaging devices 1c with a voice input / output function may be used as the display screen 14b is not provided on the back side. In this case, although there are a plurality of omnidirectional cameras 7a, a plurality of omnidirectional cameras 7a are not necessarily required. Therefore, a combination of a type having an omnidirectional camera 7a and a type having no omnidirectional camera 7a may be used. Good. Further, depending on the size of the display 8, the position of the omnidirectional camera 7a may be too high for the participant, and only the upper part of the bust portion of the participant may be reflected in the hemispherical shooting range. Only the upper part of the participant's face may be visible. Therefore, it is preferable that the omnidirectional camera 7a is an omnidirectional camera having a shooting range that is wider than the hemisphere and close to the whole globe.

第４の実施の形態の音声入出力機能付き撮像装置１ｃによれば、第１および第２の実施の形態と略同様の作用効果を得ることができる。 According to the imaging device 1c with a voice input / output function of the fourth embodiment, substantially the same operational effects as those of the first and second embodiments can be obtained.

１，１ａ，１ｂ，１ｃ音声入出力機能付き撮像装置
４制御基板（画像認識手段、画像処理手段、音源方向認識手段、通信手段）
５，５ａマイク（音声入力手段）
６，６ａスピーカ（音声出力手段）
７，７ａ全方位カメラ
８，８ａディスプレイ 1, 1a, 1b, 1c Image pickup apparatus with voice input / output function 4 Control board (image recognition means, image processing means, sound source direction recognition means, communication means)
5,5a Microphone (voice input means)
6,6a Speaker (Audio output means)
7,7a Omnidirectional camera 8,8a Display

Claims

An omnidirectional camera that captures the surroundings;
An audio output means provided in the vicinity of the omnidirectional camera and outputting an audio signal input from the outside to the surroundings as an audio;
Provided in the vicinity of the omnidirectional camera, comprising voice input means for inputting surrounding voice as a voice signal,
An image pickup apparatus with a voice input / output function, which outputs image data picked up by the omnidirectional camera and a voice signal input by the voice input means.

In the vicinity of the omnidirectional camera, a plurality of displays for displaying image data input from the outside so as to be visible from a plurality of surrounding directions are provided at positions that do not interfere with surrounding imaging by the omnidirectional camera. The imaging apparatus with a voice input / output function according to claim 1.

The voice input means includes at least three microphones respectively facing at least different surrounding directions,
Sound source direction recognition means for identifying the direction of the sound source from the volume of the sound input to each microphone;
The image processing means for converting omnidirectional image data captured by the omnidirectional camera into image data centered on the direction of the sound source specified by the sound source direction recognition means. Item 3. An imaging apparatus with a voice input / output function according to Item 2.

Recognizing the face of the person being imaged in the image data captured by the omnidirectional camera, and speaking from the person being imaged based on the movement of the mouth of the recognized face Image recognition means for identifying the person,
Image processing means for converting omnidirectional image data captured by the omnidirectional camera into image data centered on the imaged person identified as speaking by the image recognition means; The imaging apparatus with a voice input / output function according to any one of claims 1 to 3.

5. A plurality of the imaging devices with a voice input / output function according to claim 1, wherein each of the imaging devices with a voice input / output function is added to the other imaging device with a voice input / output function. A video conference system, comprising: communication means for outputting the data and the audio signal, and inputting the image data and the audio signal output from the other imaging device with an audio input / output function.