JP5316248B2

JP5316248B2 - Video conference device, video conference method, and program thereof

Info

Publication number: JP5316248B2
Application number: JP2009143626A
Authority: JP
Inventors: 禎史荒木; 隆子橋本; 慶二大村; 勇児糟谷
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2009-06-16
Filing date: 2009-06-16
Publication date: 2013-10-16
Anticipated expiration: 2029-06-16
Also published as: JP2011004007A

Description

本発明は、例えば、互いに離れた複数の地点の間で行われるテレビ会議で用いられるテレビ会議装置、テレビ会議方法、そのプログラムに関する。 The present invention relates to, for example, a video conference apparatus, a video conference method, and a program thereof used in a video conference performed between a plurality of points separated from each other.

従来のテレビ会議などで用いられるテレビ会議装置について説明する。予め、テレビカメラより撮影された会議の参加者の人物画像（顔画像）から人物画像識別（顔画像識別）に必要な特徴量（例えば、顔特徴量）を生成手段により生成する。生成された特徴量と会議参加者のプロフィールを対応付けてデータベース記憶部に登録しておく。自端末のテレビカメラを通して相手端末のディスプレイに表示されている会議参加者についてプロフィールを表示させる場合に、前記生成手段により改めて特徴量を生成する。そして、この特徴量と一致する特徴量を照合し、照合により一致する特徴量があった場合に該特徴量に対応するプロフィールを要求元の相手端末に送り、相手端末のディスプレイに表示させるテレビ会議装置が既に知られている（例えば、特許文献１、２参照）。 A video conference apparatus used in a conventional video conference will be described. The generation unit generates a feature amount (for example, a face feature amount) necessary for person image identification (face image identification) from a person image (face image) of a conference participant photographed from a TV camera in advance. The generated feature value and the profile of the conference participant are associated with each other and registered in the database storage unit. When the profile is displayed for the conference participant displayed on the display of the partner terminal through the TV camera of the own terminal, the generating unit generates the feature amount again. Then, the video conference that collates the feature quantity that matches the feature quantity, sends the profile corresponding to the feature quantity to the requesting counterpart terminal and displays it on the display of the counterpart terminal when there is a matching feature quantity An apparatus is already known (see, for example, Patent Documents 1 and 2).

しかし、従来のテレビ会議装置であれば、改めて特徴量を生成するために、顔画像識別を行わなければならない。一般に、顔画像識別では、表示されている人物がカメラに対して正対している場合が最も識別率が高く、顔の向きが斜めから横向きになるにつれて、識別率が低下するという性質がある。複数の人物が参加するテレビ会議においては、全参加者がカメラに対して正対するのは困難で、顔が斜め向きや横向きに撮影される参加者が存在する場合がある。また、顔がカメラに正対していても、顔部分の表示サイズが小さいとか、部屋が暗かったりする場合は、同様に識別率の低下が起こる。これらの場合、顔画像識別技術だけでは全参加者を正しく識別できず、適切なプロフィールを表示できないという問題がある。
本発明の目的は、上記問題点に鑑みて、参加者の特徴量を用いる頻度を少なくして、相手の通信端末に適切にプロフィールを送信するテレビ会議装置、テレビ会議方法、そのプログラムを提供することを目的とする。 However, in the case of a conventional video conference apparatus, face image identification must be performed in order to generate a feature amount again. In general, face image identification has the highest identification rate when the displayed person faces the camera, and has a property that the identification rate decreases as the face direction changes from diagonal to horizontal. In a video conference in which a plurality of persons participate, it is difficult for all participants to face the camera, and there may be participants whose faces are photographed obliquely or sideways. Even when the face faces the camera, the identification rate similarly decreases when the display size of the face portion is small or the room is dark. In these cases, there is a problem that all the participants cannot be correctly identified by the face image identification technique alone and an appropriate profile cannot be displayed.
In view of the above-described problems, an object of the present invention is to provide a video conference apparatus, a video conference method, and a program for appropriately transmitting a profile to a communication terminal of a partner while reducing the frequency of using a participant's feature amount. For the purpose.

上記課題を解決するために、本実施例のテレビ会議装置は、参加予定者を識別するための特徴量と参加予定者の属性を示す属性情報とが対応付けて登録されているデータベース記憶部と、撮影装置で撮影された参加者の特徴量を生成する特徴量生成部と、撮影された参加者の位置情報を検出し、当該位置情報と当該撮影された参加者の属性情報と対応付けて前記データベース記憶部に登録する位置情報検出部と、前記撮影された参加者の特徴量と前記データベース記憶部中の特徴量とに基づいて当該参加者を識別し、２回目以降に撮影された参加者の位置情報と前記データベース記憶部中の位置情報とに基づいて当該参加者を識別する識別部と、特徴量で識別された参加者の属性情報と前記撮影された参加者の映像とを対応付けて送信し、位置情報で識別された参加者の属性情報と前記２回目以降に撮影された参加者の映像とを対応付けて送信する送信部とを有する。 In order to solve the above problem, the video conference apparatus according to the present embodiment includes a database storage unit in which a feature amount for identifying a prospective participant and attribute information indicating an attribute of the prospective participant are registered in association with each other. The feature amount generation unit that generates the feature amount of the participant photographed by the photographing device, and the position information of the photographed participant are detected and associated with the position information and the attribute information of the photographed participant. Participants photographed at the second and subsequent times by identifying the participant based on the position information detection unit registered in the database storage unit, the feature amount of the photographed participant and the feature amount in the database storage unit The identification unit for identifying the participant based on the location information of the participant and the location information in the database storage unit, the attribute information of the participant identified by the feature amount, and the video of the photographed participant Send And a transmission unit for transmitting in association with the video of participants captured in the second or subsequent attribute information of the participants identified in the information.

本発明のテレビ会議装置、テレビ会議方法、そのプログラムであれば特徴量を用いる頻度を少なくすることで、参加者の識別の精度を高め、相手の通信端末に適切にプロフィールを送信できる。 If the video conference apparatus, video conference method, and program thereof according to the present invention are used, the frequency of using feature amounts is reduced, so that the accuracy of participant identification can be improved and the profile can be appropriately transmitted to the other communication terminal.

本実施例のテレビ会議システムの機能構成例を示した図。The figure which showed the function structural example of the video conference system of a present Example. 本実施例のテレビ会議システムで用いられる撮影装置の一例の斜視図。The perspective view of an example of the imaging device used with the video conference system of a present Example. 本実施例のテレビ会議装置に適した会議の形態の一例を示した図。The figure which showed an example of the form of the meeting suitable for the video conference apparatus of a present Example. 本実施例のテレビ会議装置の機能構成例を示した図。The figure which showed the function structural example of the video conference apparatus of a present Example. 本実施例のテレビ会議装置の主な処理の流れを示したフローチャート図。The flowchart figure which showed the flow of the main processes of the video conference apparatus of a present Example. 本実施例のデータテーブルの一例を示す図。The figure which shows an example of the data table of a present Example. 参加者の切り出し後の画像の一例を示した図。The figure which showed an example of the image after a participant's cutout. 切り出した参加者が２人表された場合の画像の一例を示した図。The figure which showed an example of the image when the cut out participant is represented by two persons. 属性情報が重畳された画像の一例を示す図。The figure which shows an example of the image on which attribute information was superimposed. 位置情報が対応付けられたデータテーブルの一例を示す図。The figure which shows an example of the data table with which positional information was matched. 表示部に表示させた場合の画像の一例を示す図。The figure which shows an example of the image at the time of making it display on a display part. 他の実施例のテレビ会議装置の主な処理の流れの一部を示したフローチャート図。The flowchart figure which showed a part of flow of the main processes of the video conference apparatus of another Example. エラー情報を対応させた画像の一例を示す図。The figure which shows an example of the image which matched error information.

以下、図面を参照して、本発明を実施するための形態の説明を行う。なお、同じ機能を持つ構成部や同じ処理を行う工程には同じ番号を付し、重複説明を省略する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. In addition, the same number is attached | subjected to the process part which performs the same process with the same function, and duplication description is abbreviate | omitted.

以下に、実施例１のテレビ会議装置などについて説明するが、当該テレビ会議装置は、互いに離れた複数の地点の間で行われるテレビ会議において、それぞれの地点で撮影装置（例えば、テレビカメラ）映像がそれぞれネットワーク経由で他の地点に送信されるために用いられる。
図１に実施例１のテレビ会議装置１００を含んだテレビ会議システム１０００の機能構成例を示す。以下の説明では、お互いに離れたＬ地点とＭ地点とでテレビ会議が行われるものとし、それぞれの地点には、このテレビ会議システム１０００が配置されているものとする。そして、Ｌ地点を自己側の地点とし、Ｍ地点を相手側の地点とし、Ｌ地点側のテレビ会議装置を第１テレビ会議装置１００とし、Ｍ地点側のテレビ会議装置を第２テレビ会議装置２００とする。なお、本実施例のテレビ会議装置は、３以上の地点においてでも、テレビ会議を行うことができる。 Hereinafter, the video conference apparatus according to the first embodiment will be described. In the video conference performed between a plurality of points distant from each other, the video conference apparatus is a video of a photographing device (for example, a TV camera) at each point. Are used to be sent to other points via the network.
FIG. 1 shows a functional configuration example of a video conference system 1000 including the video conference apparatus 100 according to the first embodiment. In the following description, it is assumed that a video conference is performed at an L point and an M point that are separated from each other, and the video conference system 1000 is disposed at each point. Then, the L point is set as the self-side point, the M point is set as the partner-side point, the L-point side video conference device is the first video conference device 100, and the M-point side video conference device is the second video conference device 200. And In addition, the video conference apparatus of a present Example can perform a video conference also in three or more points.

図１に示すように、テレビ会議システム１０００は操作部１０２と、映像出力部１０３（例えばディスプレイ）音声出力部１０４（例えば、スピーカ）、後述する撮影装置１０６と、ハードディスク１０８と、メモリ１１０と、テレビ会議装置１００と通信制御部１１２と、で構成される。 As shown in FIG. 1, the video conference system 1000 includes an operation unit 102, a video output unit 103 (for example, a display), an audio output unit 104 (for example, a speaker), a photographing device 106 to be described later, a hard disk 108, a memory 110, The video conferencing apparatus 100 and the communication control unit 112 are configured.

テレビ会議システム１０００のうち、撮影装置１０６と、テレビ会議装置１００以外は、ＰＣなどに一般に使われる構成要素である。操作部１０２とは例えば、キーボードやマウスである。映像出力部１０３は、相手側のテレビ会議風景、テレビ会議の参加者（つまり、Ｌ地点であれば、Ｍ地点のテレビ会議風景、参加者）を出力し、音声出力部１０４は相手側のテレビ会議の参加者の音声（テレビ会議参加者の音声）を出力する。 In the video conference system 1000, components other than the photographing device 106 and the video conference device 100 are components that are generally used for a PC or the like. The operation unit 102 is, for example, a keyboard or a mouse. The video output unit 103 outputs the other party's video conference scenery and the participants of the video conference (that is, if it is the L point, the M point video conference scenery and the participant), and the audio output unit 104 is the other party's television conference. The audio of the conference participant (the audio of the video conference participant) is output.

図２に撮影装置１０６の一例の斜視図を示す。本実施例の撮影装置１０６は、お互いに径の異なる第１円柱部１０６ａと、第２円柱部１０６ｄとからなる。第１円柱部１０６ａの外周面上には、等間隔で撮影手段１０６ｃが配置されている。撮影手段１０６ｃは、例えばテレビカメラであり、会議の参加者を撮影するものである。第２円柱部１０６ｂの外周面上には、円周方向等間隔で収音手段１０６ｄが配置されている。収音手段１０６ｄとは例えば、マイクロホンであり、会議の参加者の音声を収音するものである。このように、本実施例では、図２に示すように、全方位を撮影、収音できる撮影装置を用いることが好ましい。撮影装置１０６は「"会議の映像・音声データから自動的に会議録コンテンツを作成するシステム"［ｏｎｌｉｎｅ］平成２０年１０月１４日、独立行政法人産業技術総合研究所［平成２１年５月２１日検索］、インターネット〈ＵＲＬ：http://www.aist.go.jp/aist_j/press_release/pr2008/pr20081014_2/pr20081014_2.html〉」に記載されている。
また撮影装置１０６は上述のように全方位撮影できるものでなく、自動的に発話者に向いて、ズーム・フォーカスする撮影装置であってもよい。 FIG. 2 shows a perspective view of an example of the photographing apparatus 106. The imaging apparatus 106 of the present embodiment includes a first cylindrical portion 106a and a second cylindrical portion 106d having different diameters. On the outer peripheral surface of the first cylindrical portion 106a, photographing means 106c are arranged at equal intervals. The photographing means 106c is, for example, a television camera, and photographs the participants in the conference. On the outer peripheral surface of the second cylindrical portion 106b, sound collecting means 106d are arranged at equal intervals in the circumferential direction. The sound collecting means 106d is, for example, a microphone, and picks up the voices of conference participants. Thus, in this embodiment, as shown in FIG. 2, it is preferable to use a photographing apparatus capable of photographing all directions and collecting sound. The photographing apparatus 106 is “a system that automatically creates conference minutes contents from video / audio data of a conference” [online] October 14, 2008, National Institute of Advanced Industrial Science and Technology [May 21, 2009]. Date search], Internet <URL: http://www.aist.go.jp/aist_j/press_release/pr2008/pr20081014_2/pr20081014_2.html>.
Further, the photographing device 106 may not be capable of photographing in all directions as described above, but may be a photographing device that automatically zooms and focuses on a speaker.

図３に、実施例１のテレビ会議装置１００が用いられる最適なテレビ会議の形態を示す。図３に示すようにテレビ会議は、参加者が円を囲むように位置し、中央に撮影装置１０６が位置するような形態が好ましい。なぜなら図２に示す撮影装置により、全ての参加者の顔を常に略正対して、撮影できるからである。 FIG. 3 shows an optimum video conference mode in which the video conference device 100 according to the first embodiment is used. As shown in FIG. 3, the video conference is preferably configured such that the participants are positioned so as to surround a circle and the photographing device 106 is positioned in the center. This is because the photographing apparatus shown in FIG. 2 can always photograph the faces of all the participants facing each other.

図４に、実施例１のテレビ会議装置１００の機能構成例を示す。図５にテレビ会議装置１００の主な処理の流れを示す。実施例１のテレビ会議装置１００は、切り出し部２と、特徴量生成部６と、位置情報検出部８と識別部１０と、合成部１２と、符号化部１４とデータベース記憶部１６とで構成されている。
まず、予め、テレビ会議開始前に、データベース記憶部１６に図６に示すようなデータテーブルを登録しておく。図６に示すように、データテーブルはテレビ会議の参加予定者の属性情報と特徴量とを対応させる。図６の例では、氏名がＡ、Ｂ、Ｃ、Ｄ、Ｅ、Ｆ、Ｇ、Ｈ、Ｉの９人の参加予定者についてのデータテーブルが示されている。参加予定者とは、参加を予定している者であり、例えば、実際の参加者と、参加を予定していたが参加できなくなった者と、を示す。
ここで、属性情報とはテレビ会議の参加者予定者の属性を示す情報である、図６の例では、属性情報とは、氏名を示す氏名情報と、プロフィール情報とからなる。プロフィール情報とは図６の例では、その参加予定者の肩書き等である。例えば、氏名がＡである参加予定者のプロフィール情報（肩書き）は「開発部部長」である。また、図６に示すように、氏名Ｂ、Ｃの参加予定者については肩書きのほかに過去の実績なども登録してもよい。例えば氏名Ｂの参加予定者についてはプロフィール情報として「企画部部長」の他に過去の実績「○○プロジェクトを成功させた」が登録されている。
また、特徴量とは参加予定者を識別するための情報である。特徴量には、例えば、顔特徴量や、音声特徴量、指紋特徴量、網膜特徴量などがある。処理の行いやすさの観点から特徴量は、顔特徴量や音声特徴量が好ましい。例えば、顔特徴量とは、本人と他人の識別判定が可能な利用者の顔の特徴を数値化したものである。例えば、顔の構成要素（目、鼻、口、眉など）の形状と、互いの配置関係などの特徴を数値で現したものである。また、例えば、予め取得した登録者の顔画像そのものを数値化したものを顔データとしても用いてもよく、例えばサイズを規定したＪＰＥＧ方式の画像データである。
また、音声特徴量とは、少ない情報量で音声の特徴を表現できるものであり、例えばケプストラム、ケプストラムの動的特徴の物理量で構成する特徴量ベクトルである。
特徴量として顔特徴量を用いる場合のデータテーブル作成処理としては、まず、撮影手段１０６によりテレビ会議の参加予定者について正対して顔を撮影する。そして、以下で説明する切り出し部２による以下の流れの処理を行う。
（１）入力された映像から顔領域の検出、顔領域の位置の特定
（２）切り出し部２による顔領域（図７参照）の切り出し処理
（３）切り出された顔領域の大きさや輝度などのばらつきの正規化処理
（４）正規化された顔領域からの顔特徴量の抽出処理
顔特徴量の抽出処理が終了すると、例えば、操作部１０２のキーボードなどで、参加予定者の属性情報（氏名情報やプロフィール情報）を入力して、顔特徴量と対応付けて、データベース記憶部１６に登録する。以下の説明では、データベース記憶部１６に登録されている特徴量を登録特徴量という。 FIG. 4 illustrates a functional configuration example of the video conference apparatus 100 according to the first embodiment. FIG. 5 shows a main processing flow of the video conference apparatus 100. The video conference apparatus 100 according to the first embodiment includes a cutout unit 2, a feature amount generation unit 6, a position information detection unit 8, an identification unit 10, a synthesis unit 12, an encoding unit 14, and a database storage unit 16. Has been.
First, before starting a video conference, a data table as shown in FIG. 6 is registered in the database storage unit 16 in advance. As shown in FIG. 6, the data table associates the attribute information and the feature amount of the prospective participant in the video conference. In the example of FIG. 6, a data table for nine prospective participants whose names are A, B, C, D, E, F, G, H, and I is shown. A prospective participant is a person who is planning to participate, and indicates, for example, an actual participant and a person who has been scheduled to participate but can no longer participate.
Here, the attribute information is information indicating the attribute of the prospective participant in the video conference. In the example of FIG. 6, the attribute information includes name information indicating the name and profile information. In the example of FIG. 6, the profile information is the title of the prospective participant. For example, the profile information (title) of a prospective participant whose name is A is “development manager”. Further, as shown in FIG. 6, past achievements may be registered in addition to the titles for the prospective participants of names B and C. For example, for the prospective participant of name B, the past performance “Successfully completed XX project” is registered as profile information in addition to “Planning department manager”.
The feature amount is information for identifying a prospective participant. The feature amount includes, for example, a face feature amount, a voice feature amount, a fingerprint feature amount, and a retinal feature amount. From the viewpoint of ease of processing, the feature quantity is preferably a face feature quantity or a voice feature quantity. For example, the facial feature value is a numerical value of a facial feature of a user who can identify and identify the person and others. For example, the shape of the face components (eyes, nose, mouth, eyebrows, etc.) and the features such as the mutual arrangement are expressed numerically. Further, for example, a digitized face image of a registrant acquired in advance may be used as face data, for example, JPEG image data with a prescribed size.
The speech feature amount is a feature amount vector that can express speech features with a small amount of information, and is a feature amount vector composed of cepstrum and physical amounts of dynamic features of the cepstrum, for example.
As a data table creation process in the case of using a face feature amount as the feature amount, first, a face is photographed by the photographing means 106 with respect to a prospective participant of the video conference. And the process of the following flows by the clipping part 2 demonstrated below is performed.
(1) Detection of face area from input video, specification of position of face area (2) Cutout process of face area (see FIG. 7) by cutout unit 2 (3) Size, brightness, etc. of cutout face area Normalization processing of variation (4) Extraction processing of face feature amount from normalized face area When the extraction processing of face feature amount is completed, for example, the attribute information (name) of the prospective participant using the keyboard of the operation unit 102 Information and profile information) are input and associated with the facial feature amount and registered in the database storage unit 16. In the following description, the feature amount registered in the database storage unit 16 is referred to as a registered feature amount.

データベース記憶部１６に全ての参加予定者のデータテーブルを登録させた後のテレビ会議装置１００の処理について説明する。また、テレビ会議の形態は図３に示す例であり、会議の参加者は、Ａ〜Ｈであり、Ｉは欠席しているとする。 The process of the video conference apparatus 100 after registering the data tables of all prospective participants in the database storage unit 16 will be described. Moreover, the form of a video conference is an example shown in FIG. 3, and it is assumed that participants in the conference are A to H and I is absent.

会議中に、参加者Ｂが発話したとする。すると、撮影装置１０６により、参加者Ｂは撮影され、発話音声が収音され、位置情報検出部８は位置情報（後述する）を検出する（ステップＳ１）。なお、以下の説明では、１回目の撮影を第１撮影とし、２回目以降の撮影を第２撮影とする。
参加者Ｂの音声信号ａは位置情報検出部８に入力される。参加者Ｂの映像信号ｂは、切り出し部２に入力される。撮影装置は上述のように、全方位（３６０度）撮影するが、略正対した発話者Ｂの画像を得るように、切り出し部２は画像を切り出す。切り出し処理の工程は上記（１）〜（４）に述べたとおりである。切り出し部２は切り出された顔画像情報ｃと顔領域の場所情報ｄ（以下、「顔場所情報」という。）を出力する。顔画像情報ｃは、特徴量生成部６に入力され、顔場所情報ｄは位置情報生成部８に入力される。顔画像情報ｃの一例を図７に示す。
次に、制御部１８は、当該撮影が２回目以降の撮影であるか否かが判断する（ステップＳ２）。ここでは、１回目の撮影であるので（ステップＳ２のＮｏ）、ステップＳ３に進む。制御部１８による２回目以降の撮影か否かの判断手法については後述する。 Assume that participant B speaks during the meeting. Then, the participant B is photographed by the photographing device 106, the uttered voice is collected, and the position information detection unit 8 detects position information (described later) (step S1). In the following description, the first shooting is the first shooting, and the second and subsequent shootings are the second shooting.
The audio signal a of the participant B is input to the position information detection unit 8. The video signal b of the participant B is input to the cutout unit 2. As described above, the photographing apparatus shoots in all directions (360 degrees), but the clipping unit 2 clips the image so as to obtain an image of the speaker B that is substantially directly facing. The cut-out process is as described in (1) to (4) above. The cutout unit 2 outputs the cut out face image information c and the face area location information d (hereinafter referred to as “face location information”). The face image information c is input to the feature amount generation unit 6, and the face location information d is input to the position information generation unit 8. An example of the face image information c is shown in FIG.
Next, the control unit 18 determines whether or not the shooting is the second or later shooting (step S2). Here, since it is the first shooting (No in step S2), the process proceeds to step S3. A method for determining whether or not the second or subsequent shooting is performed by the control unit 18 will be described later.

特徴量生成部６は、顔画像情報ｃを用いて、第１撮影された参加者Ｂの特徴量を生成する（ステップＳ３）。特徴量とは、上記のように、例えば、顔特徴量や音声特徴量などである。音声特徴量を用いる場合には、位置情報検出部８からの音声信号ａを用いる。 The feature amount generation unit 6 generates the feature amount of the first photographed participant B using the face image information c (step S3). The feature amount is, for example, a face feature amount or a voice feature amount as described above. When the audio feature amount is used, the audio signal a from the position information detection unit 8 is used.

識別部１０は、データベース記憶部１６内の登録特徴量と、特徴量生成部６からの特徴量に基づいて、識別を行う（ステップＳ４）。この例では、識別部１０は、顔識別部１５２と音声識別部１５４と位置情報識別部１５６とで構成されている。顔識別部１５２と音声識別部１５４は、どちらか一方でよい。
識別部１０は、データベース記憶部１６中の登録特徴量と、特徴量生成部６で生成された特徴量とに基づいて、参加者を識別する。以下の説明では、１回目の識別（特徴量を用いた識別）を「第１識別」といい、２回目以降の識別（後述する位置情報を用いた識別）を「第２識別」という。具体的には、特徴量と登録特徴量の類似度を計算する。特徴量が数値の場合には、例えば、特徴量と登録特徴量の差の絶対値の逆数を類似度として計算する。また、特徴量と登録特徴量の差の絶対値にマイナスを付加した値を計算する。そのほか、類似度については、特徴量と登録特徴量との類似している度合いを示すものであればなんでもよい。
そして、計算された類似度が予め定められた第１閾値より大きい類似度についての特徴量を有する参加者を識別する（以下、「類似度使用手法」という。）。
他には、例えば相互部分空間法がある。相互部分空間法とは、調べたい２つの部分空間があったときに、部分空間同士の為す角度（正準角という（Ｎ次元部分空間ならＮ個存在する））を計算し、得られた正準角のうち最小の角度を類似度とする手法である。
また、顔識別部１５２の顔特徴量を用いた識別の手法として、参加者の顔画像をモザイクに分割し、それらのモザイク毎に特徴点を抽出して照合する方法がある。また、参加者顔画像から抽出された等濃線分布に基づいて照合を行う方法がある。 The identification unit 10 performs identification based on the registered feature amount in the database storage unit 16 and the feature amount from the feature amount generation unit 6 (step S4). In this example, the identification unit 10 includes a face identification unit 152, a voice identification unit 154, and a position information identification unit 156. Either the face identification unit 152 or the voice identification unit 154 may be used.
The identification unit 10 identifies a participant based on the registered feature amount in the database storage unit 16 and the feature amount generated by the feature amount generation unit 6. In the following description, the first identification (identification using the feature amount) is referred to as “first identification”, and the second and subsequent identifications (identification using position information described later) are referred to as “second identification”. Specifically, the similarity between the feature quantity and the registered feature quantity is calculated. When the feature amount is a numerical value, for example, the reciprocal of the absolute value of the difference between the feature amount and the registered feature amount is calculated as the similarity. Also, a value obtained by adding minus to the absolute value of the difference between the feature quantity and the registered feature quantity is calculated. In addition, any degree of similarity may be used as long as it indicates the degree of similarity between the feature quantity and the registered feature quantity.
Then, a participant having a feature amount with respect to a similarity whose calculated similarity is larger than a predetermined first threshold is identified (hereinafter referred to as “similarity using method”).
Another example is the mutual subspace method. In the mutual subspace method, when there are two subspaces to be examined, the angle between the subspaces (canonical angle (N-dimensional subspace exists in N)) is calculated and the obtained positive This is a technique in which the smallest angle among the quasi-angles is used as the similarity.
In addition, as an identification method using the face feature amount of the face identification unit 152, there is a method of dividing a participant's face image into mosaics, extracting feature points for each of those mosaics, and collating them. In addition, there is a method of performing collation based on isodensity distribution extracted from the participant face image.

また、音声識別部１５の音声特徴量を用いた識別の手法は、例えば、認証の対象となる言葉を音声で入力し、当該入力音声を音声分析した結果の声紋データを特徴量としてデータベース記憶部１６に格納しておく。 Further, the identification method using the voice feature amount of the voice identification unit 15 is, for example, a database storage unit using, as a feature amount, voiceprint data obtained by inputting a voice to be authenticated and analyzing the input voice. 16 is stored.

また、図８に示すように、位置情報検出部８の方向特定の精度が不十分で、２人の人物の間（図８では中心線Ｗ）を発話者とみなす場合がある。この場合には、当該２人以上の参加者それぞれについて識別部１０による識別処理を行う。 In addition, as shown in FIG. 8, the accuracy of specifying the direction of the position information detection unit 8 is insufficient, and there may be a case where a person between two persons (center line W in FIG. 8) is regarded as a speaker. In this case, identification processing by the identification unit 10 is performed for each of the two or more participants.

また、識別に用いる特徴量の種類（上記の例では、顔特徴量や音声特徴量）を多くすれば、識別部１０の識別精度を上げることができる。 Further, if the types of feature quantities used for identification (in the above example, face feature quantities and voice feature quantities) are increased, the identification accuracy of the identification unit 10 can be increased.

次に、合成部１２は、識別部１０で識別された参加者の特徴量と対応する属性情報をデータベース記憶部１６から抽出する。そして合成部１２は、撮影装置１０６で撮影されている、識別された参加者の映像と抽出した属性情報とを対応付ける。ここで対応付けとは、例えば、図９に示すように参加者Ｂの映像に属性情報を重畳させる。重畳のほか、参加者Ｂの映像と属性情報とが対応していることが地点Ｂの会議の参加者達に理解できれば他の手法でもよい。対応づけられた参加者の映像と属性情報は符号化部１４に入力される。 Next, the synthesizing unit 12 extracts attribute information corresponding to the feature amount of the participant identified by the identifying unit 10 from the database storage unit 16. Then, the synthesizing unit 12 associates the identified participant video imaged by the imaging device 106 with the extracted attribute information. Here, associating, for example, attribute information is superimposed on the video of participant B as shown in FIG. In addition to superposition, other methods may be used as long as the participants of the meeting at the point B can understand that the video of the participant B corresponds to the attribute information. The associated participant's video and attribute information are input to the encoding unit 14.

一方、位置情報生成部８により参加者以外の音が除去された音声信号も符号化部１４に入力される。符号化部１４は、対応づけられた参加者の映像と属性情報、および音源定位部４からの音声を符号化して、Ｍ地点側のテレビ会議装置を第２テレビ会議装置２００に送信する（ステップＳ６）。
次に、位置情報について説明する。ステップＳ１において、上述のように位置情報検出部８は、第１撮影（１度目の撮影）の際に、切り出し部２からの切り出した顔画像（の参加者）について位置情報ｅを求める。位置情報ｅとは、例えば、参加者Ａを基準とした場合の発話した参加者Ｂが位置する角度θ（ラジアン）である。この基準は他の場所としてもよい。また、切り出し部２による切り出しは、発話者の方向θと切り出し画像の水平方向の中心位置が一致するように切り出すことが好ましい。
しかし、図８に示すように、位置情報生成部８の方向特定の精度が不十分で、２人の人物の間（中心線Ｗ）を発話者の方向とみなす場合がある。この場合には、位置情報ｅをピクセルであらわすことが好ましい。図８の例では、参加者Ａを基準とした場合の中心線Ｗの角度はθである。この場合の位置情報は、角度θと、切り出した画像上の水平方向の相対座標xに対応する方向(上記基準方向からの角度)から求まる解像度（横方向の全３６０度を何ピクセルで表示しているか）により一意的に決定できる。例えば、３６０度をNピクセルで表示している場合、位置情報は、（θ・Ｎ／２π）＋ｘ（ピクセル）とできる。
ここで、相対座標ｘは、切り出し部２で求めた顔場所情報ｄに相当する。このように、位置情報生成部８が２人の人物の間（中心線）を発話者の方向を音源方向とした場合であっても、位置情報としてピクセルを用いれば、話者を一意に識別できる位置情報を検出できる。その他、位置情報ｅは、参加者を一意に定めるものであれば、何でもよい。また、図７に示すように、発話者の方向θと切り出し画像の水平方向の中心位置が一致するように切り出された場合であっても、位置情報をピクセル、つまり、θ・Ｎ／２πであらわしてもよい。
そして、参加者を第１識別後（ステップＳ４終了後）に、位置情報検出部８は、位置情報ｅと撮影された参加者（この例では参加者Ｂ）の属性情報と対応付けてデータベース記憶部１６に登録する（ステップＳ１０）。つまり、図１０に示すように、参加者Ｂについて位置情報β_２が追加更新される。このように、参加者は発話をして１回目の撮影後、特徴量生成部６の特徴量を生成するとともに、位置情報検出部８が発話した参加者の位置情報ｅを検出して、データベース記憶部１６に発話した参加者の属性情報と対応付けられて随時、追加更新する。以下、データベース記憶部１６中の位置情報ｅを登録位置情報ｅとする。ステップＳ６およびステップＳ１０の処理が終了すると、テレビ会議装置の１回目の撮影についての処理が全て終了する。 On the other hand, an audio signal from which sounds other than the participants have been removed by the position information generation unit 8 is also input to the encoding unit 14. The encoding unit 14 encodes the video and attribute information of the associated participant and the sound from the sound source localization unit 4, and transmits the video conference device on the M point side to the second video conference device 200 (step) S6).
Next, position information will be described. In step S <b> 1, as described above, the position information detection unit 8 obtains the position information e for the face images (participants) cut out from the cut-out unit 2 during the first shooting (first shooting). The position information e is, for example, an angle θ (radian) at which the speaking participant B is located when the participant A is used as a reference. This criterion may be another location. In addition, it is preferable that the cutout by the cutout unit 2 is performed so that the direction θ of the speaker matches the center position of the cutout image in the horizontal direction.
However, as shown in FIG. 8, there is a case where the position information generating unit 8 has insufficient accuracy for specifying the direction, and the direction between the two persons (center line W) is regarded as the direction of the speaker. In this case, it is preferable that the position information e is represented by pixels. In the example of FIG. 8, the angle of the center line W when the participant A is the reference is θ. The position information in this case is the resolution (number of all 360 degrees in the horizontal direction) obtained from the angle θ and the direction (angle from the reference direction) corresponding to the relative coordinate x in the horizontal direction on the clipped image. Can be uniquely determined. For example, when 360 degrees are displayed by N pixels, the position information can be (θ · N / 2π) + x (pixels).
Here, the relative coordinate x corresponds to the face location information d obtained by the cutout unit 2. As described above, even when the position information generation unit 8 uses the pixel as the position information between the two persons (center line) with the direction of the speaker as the sound source direction, the speaker is uniquely identified. Position information can be detected. In addition, the position information e may be anything as long as it uniquely determines the participant. Further, as shown in FIG. 7, even when the speaker direction θ and the horizontal center position of the cut-out image are cut out, the position information is expressed in pixels, that is, θ · N / 2π. It may appear.
After the first identification of the participant (after completion of step S4), the position information detection unit 8 associates the position information e with the attribute information of the photographed participant (participant B in this example), and stores the database. Register in the unit 16 (step S10). That is, as shown in FIG. 10, the position information β ₂ is additionally updated for the participant B. In this way, the participant utters and, after the first shooting, generates the feature amount of the feature amount generation unit 6 and detects the position information e of the participant uttered by the position information detection unit 8, and the database The storage unit 16 is additionally updated as needed in association with the attribute information of the participant who spoke. Hereinafter, the position information e in the database storage unit 16 is referred to as registered position information e. When the processes of step S6 and step S10 are finished, all the processes for the first shooting of the video conference apparatus are finished.

次に、参加者Ｂの発話が終了した後に、再び参加者Ｂが発話したとする。この場合には、図３のフローチャート図中のスタートから再び開始する。この場合には、撮影装置１０６は、発話者である参加者Ｂを撮影し、位置情報ｅ'を検出する（ステップＳ１）。そして、制御部１８は、当該撮影が２回目以降の撮影であるか否かが判断する（ステップＳ２）。制御部１８の２回目以降の撮影であるか否かの判断は、位置情報検出部８により音源推定された方向についての位置情報がデータベース記憶部１６に登録されているか否かを判断すればよい。つまり、２回目以降の撮影が行われているということは、その撮影が行われた参加者の位置情報がデータベース記憶部１６に登録されているということである。
参加者Ｂは２回目以降の撮影であるので（ステップＳ２のＹｅｓ）、位置情報識別部１５６は、データベース記憶部中の位置情報ｅと、検出された位置情報ｅ'とに基づいて識別を行う。位置情報識別部１５６による位置情報を用いた識別手法は、上記特徴量を用いた識別手法と同様であり、例えば、類似度を用いればよい。
第２識別されれば、合成部１２はデータベース記憶部１６を参照して、検出された位置情報ｅ'と例えば類似度が大きい、データベース記憶部１６内の位置情報ｅと対応する属性情報と２回目以降に撮影された参加者（この例では、参加者Ｂ）の映像とを対応付け、送信部が送信する（ステップＳ１４）。ステップＳ１４の処理が終了すると、テレビ会議装置の２回目以降の撮影についての処理が終了する。 Next, it is assumed that the participant B speaks again after the participant B finishes speaking. In this case, it starts again from the start in the flowchart of FIG. In this case, the imaging device 106 images the participant B who is the speaker and detects the position information e ′ (step S1). And the control part 18 judges whether the said imaging | photography is the imaging | photography after the 2nd time (step S2). The control unit 18 may determine whether or not the second and subsequent shootings are performed by determining whether or not the position information about the direction of the sound source estimated by the position information detection unit 8 is registered in the database storage unit 16. . That is, the fact that the second and subsequent shootings have been performed means that the location information of the participant who has performed the shooting is registered in the database storage unit 16.
Since the participant B is in the second and subsequent shooting (Yes in step S2), the position information identification unit 156 performs identification based on the position information e in the database storage unit and the detected position information e ′. . The identification method using the position information by the position information identification unit 156 is the same as the identification method using the feature amount, and for example, the similarity may be used.
If the second identification is made, the synthesizing unit 12 refers to the database storage unit 16, and the attribute information corresponding to the position information e in the database storage unit 16 having a high similarity with the detected position information e ′, for example, 2 The transmission unit transmits the video image of the participant (participant B in this example) photographed after the first time, and transmits it (step S14). When the process of step S14 ends, the process for the second and subsequent shootings of the video conference apparatus ends.

また、一度も撮影されていない参加者（例えば参加者Ｃ）が撮影された場合には、ステップＳ２において、Ｎｏとなり、ステップＳ３において、特徴量生成部６が参加者Ｃの特徴量を生成する。そしてテレビ会議装置は、ステップＳ４、ステップＳ６、ステップＳ１０の処理を行う。 When a participant who has never been photographed (for example, participant C) is photographed, No is determined in step S2, and the feature value generation unit 6 generates a feature value of the participant C in step S3. . Then, the video conference apparatus performs steps S4, S6, and S10.

このように、実施例１のテレビ会議装置１００は、発話したことで１回目に撮影された参加者について位置情報を検出し、データベース記憶部１６に属性情報と対応づけて登録させる。２回目以降に撮影された参加者については、特徴量を生成することなく、位置情報を用いて、参加者を識別して、属性情報を抽出して送信する。従って、特徴量生成の頻度を減らすことができ、顔がカメラに正対していない、または顔部分の表示サイズが小さい、または部屋が暗い場合であっても、参加者の識別率を低下させることなく、適切な属性情報を送信できる。
また撮影装置１０６は、複数のカメラを全ての撮影面が外側に向くようにして全方位撮影できるカメラアレイを用いてもよい。また、自動的に発話者に向いて撮影する撮影装置でもよい。 As described above, the video conference apparatus 100 according to the first embodiment detects the position information of the participant who is photographed for the first time by speaking and registers the position information in the database storage unit 16 in association with the attribute information. For the participants photographed for the second time and thereafter, the position information is used to identify the participants, and the attribute information is extracted and transmitted without generating the feature amount. Therefore, the frequency of feature quantity generation can be reduced, and the identification rate of participants can be reduced even when the face is not directly facing the camera, the display size of the face portion is small, or the room is dark. Appropriate attribute information can be transmitted.
In addition, the photographing apparatus 106 may use a camera array that can omnidirectionally photograph a plurality of cameras with all photographing surfaces facing outward. Alternatively, a photographing device that automatically photographs the speaker.

実施例２のテレビ会議装置２００は、表示部２２を有する点で、実施例１のテレビ会議装置１００と異なる。実施例１で説明した識別部１０による上記類似度使用手法を用いて第１識別した結果、第１閾値より大きい類似度が複数ある場合がある。その場合には、これらの類似度についての特徴量をもつ、第１識別される参加者の複数の候補の属性情報を表示部２２に表示させる。 The video conference apparatus 200 according to the second embodiment is different from the video conference apparatus 100 according to the first embodiment in that the display section 22 is included. As a result of the first identification using the above-described similarity use method by the identification unit 10 described in the first embodiment, there may be a plurality of similarities greater than the first threshold. In that case, the attribute information of a plurality of candidates of the first identified participant having the feature quantities regarding these similarities is displayed on the display unit 22.

表示部２２に表示される例を図１１に示す。図１１の例では、２人の参加者が表示されている場合を示す。図１１の例では、左側の参加者については、氏名がＡか、Ｂか、Ｃかをユーザ（通常、Ｌ地点での通常は会議の参加者であり、参加者について知っている者）に選択させるように表示させる。そしてユーザは操作部１０２（例えばマウス）により入力させる（クリックさせる）。また、正しい参加者が表示された全ての候補参加者に該当しない場合は、下段の入力スペースＹに操作部１０２（例えば、キーボード）で、正しい氏名を入力させる。また、図１１の右側の参加者については、氏名Ｄの参加者と入力スペースＹが表示されている。また左側の人物は参加者Ｂであり、右側の人物は、参加者Ｄであると入力されようとしている。 An example displayed on the display unit 22 is shown in FIG. The example of FIG. 11 shows a case where two participants are displayed. In the example of FIG. 11, for the left participant, the name (A, B, or C) is given to the user (usually a person who is usually a conference participant at the point L and knows about the participant). Display to be selected. The user inputs (clicks) using the operation unit 102 (for example, a mouse). When the correct participant does not correspond to all the displayed candidate participants, the correct name is entered in the lower input space Y by the operation unit 102 (for example, a keyboard). In addition, for the participant on the right side of FIG. 11, the participant with the name D and the input space Y are displayed. Also, the left person is about to be entered as participant B, and the right person is about to be entered as participant D.

また、ユーザによる入力は、第２識別（位置情報を用いての識別）の処理についても同様である。 The input by the user is the same for the second identification (identification using position information) process.

図１２にテレビ会議装置２００の主な処理の流れの一部を示す。実施例２のテレビ会議装置２００の処理の流れは、図１２に示すフローチャート図が、図３記載のステップＳ４とステップＳ６との間に挿入され、ステップＳ１０は図１２に示す位置に移動されたものである。 FIG. 12 shows a part of the main processing flow of the video conference apparatus 200. In the processing flow of the video conference apparatus 200 according to the second embodiment, the flowchart shown in FIG. 12 is inserted between step S4 and step S6 shown in FIG. 3, and step S10 is moved to the position shown in FIG. Is.

ステップＳ１０２において、第１識別の候補人物が存在する場合には（ステップＳ１０２のＹｅｓ）、制御部１８は、候補人物名（上記の例では、参加者Ａ、Ｂ、Ｃ）を表示部２２に表示させる（ステップＳ１０６）。そして、ユーザに候補人物から人物名を選択させるか、人物名を入力スペースＹに入力させる（ステップＳ１０８）。 In step S102, when there is a first identification candidate person (Yes in step S102), the control unit 18 displays the candidate person names (participants A, B, and C in the above example) on the display unit 22. It is displayed (step S106). Then, the user is allowed to select a person name from candidate persons, or the person name is input to the input space Y (step S108).

一方、ステップＳ１０２において、第１識別の結果、候補人物が存在しない場合には（ステップＳ１０２のＮｏ）、制御部１８は表示部２２に入力スペースを表示させ、人物名を入力させる（ステップＳ１０４）。そして、選択または入力された参加者人物名の属性情報と、当該参加者の映像、音声を第２テレビ会議装置２００に送信する（ステップＳ６）。また、位置情報と、選択または入力された人物名とを対応付けてデータベース記憶部１６に記憶させる（ステップＳ１０）。ステップＳ６およびステップＳ１０が終了すると、実施例２のテレビ会議装置の１回目の撮影についての処理は終了する。 On the other hand, if no candidate person exists as a result of the first identification in step S102 (No in step S102), the control unit 18 displays an input space on the display unit 22 and inputs a person name (step S104). . Then, the attribute information of the participant person name selected or input and the video and audio of the participant are transmitted to the second video conference device 200 (step S6). Further, the position information and the selected or input person name are associated with each other and stored in the database storage unit 16 (step S10). When step S6 and step S10 are completed, the processing for the first shooting of the video conference apparatus according to the second embodiment is completed.

また、参加者を選択させる画面（例えば図１１）は、映像出力部１０３に出力させればよい。この場合には、Ｂ地点の会議風景と参加者選択画面を２画面で表示してもよく、自動で切り替えるようにしてもよい。また、参加者選択画面用の映像出力部を設置してもよい。 Moreover, what is necessary is just to make the video output part 103 output the screen (for example, FIG. 11) on which a participant is selected. In this case, the meeting scenery at B and the participant selection screen may be displayed on two screens or may be switched automatically. Moreover, you may install the video output part for a participant selection screen.

この実施例２のテレビ会議装置２００であれば、第１識別、第２識別により、識別される参加者の候補が複数いる場合であっても、会議の参加者に正しい氏名などを選択または入力させることができ、結果として、識別部１０の識別精度が低い場合や誤った場合であっても、ユーザになるべく負担をかけないで、適切に属性情報を送信できる。 In the video conference apparatus 200 according to the second embodiment, even if there are a plurality of candidate participants identified by the first identification and the second identification, a correct name or the like is selected or input to the conference participant. As a result, even when the identification accuracy of the identification unit 10 is low or incorrect, the attribute information can be appropriately transmitted without imposing a burden on the user.

テレビ会議中に参加者が座席の位置の変更、入退室での入れ変わりで、属性情報と位置情報との対応が変化する場合がある。このような場合に、データベース記憶部１６をそのまま用いると、誤った属性情報を送信することになる。実施例３では、テレビ会議中に参加者が座席の位置の変更、入退室での入れ変わりを行った場合であっても、適切な属性情報を送信できるテレビ会議装置を説明する。実施例３のテレビ会議装置３００の識別部２０内には、判定部２０を有する。以下の説明では、一度発話をした参加者Ｂが、新しい参加者Ｉと入れ替わった場合について説明する。 The correspondence between the attribute information and the position information may change depending on whether the participant changes the position of the seat during the video conference or changes the position when entering or leaving the room. In such a case, if the database storage unit 16 is used as it is, incorrect attribute information is transmitted. In the third embodiment, a video conference apparatus capable of transmitting appropriate attribute information even when a participant changes the position of a seat during a video conference or changes the room in an entrance / exit will be described. The identification unit 20 of the video conference apparatus 300 according to the third embodiment includes a determination unit 20. In the following description, a case where the participant B who has once uttered is replaced with a new participant I will be described.

参加者Ｂの座席に座っている参加者Ｉが発話をすると、撮影装置１０６は参加者Ｉを撮影し、位置情報検出部８は、参加者Ｉの（参加者Ｂの座席）の位置情報を検出する。そして特徴量生成部６は、参加者Ｉの特徴量を生成する。以下では、参加者Ｉの撮影、位置情報の検出、特徴量の生成をそれぞれ、今回の撮影、今回の位置情報の検出、今回の特徴量の生成という。 When Participant I sitting in Participant B's seat speaks, photographing device 106 photographs Participant I, and position information detection unit 8 obtains position information of Participant I (Participant B's seat). To detect. And the feature-value production | generation part 6 produces | generates the participant's I feature-value. Hereinafter, the shooting of the participant I, the detection of the position information, and the generation of the feature amount are referred to as the current shooting, the detection of the current position information, and the generation of the feature amount, respectively.

ここで、判定部２０は、今回検出された位置情報と対応する登録特徴量と、今回生成された特徴量との類似度を求める。判定部２０は、当該類似度が所定値ｇ（第２閾値）より小さいか、否かを判定する。判定部２０が類似度が所定値ｇより小さいと判定した場合というのは、位置情報を検出、登録した際の特徴量（つまり、１回目の撮影の際に生成した特徴量）と、今回生成した特徴量とが大きく異なるということであり、参加者が入れ替わったということである。その場合には判定部２０は、今回生成した特徴量に近い特徴量（今回生成した特徴量と類似度が大きい特徴量）と対応する属性情報を抽出する。そして、今回撮影した映像（つまり、参加者Ｉの映像）と、属性情報の他に、エラー情報ｆも対応付けて送信する。 Here, the determination unit 20 obtains the similarity between the registered feature value corresponding to the position information detected this time and the feature value generated this time. The determination unit 20 determines whether or not the similarity is smaller than a predetermined value g (second threshold). When the determination unit 20 determines that the similarity is smaller than the predetermined value g, the feature amount when the position information is detected and registered (that is, the feature amount generated at the first shooting) and the current generation This means that the feature amount is greatly different and the participants have been replaced. In this case, the determination unit 20 extracts attribute information corresponding to a feature amount close to the feature amount generated this time (a feature amount having a large similarity with the feature amount generated this time). Then, in addition to the video imaged this time (that is, the video image of participant I) and attribute information, error information f is also transmitted in association with it.

ここで、エラー情報ｆとは、発話した参加者の特徴量と、当該発話した参加者の位置情報と対応する登録特徴量と、の類似度が所定値ｇより大きい場合に送信される情報である。図１３にエラー情報ｆが対応付けられた映像の一例を示す。図１３の例でのエラー情報とは、「新しい参加者です」である。図１３の例でのエラー情報は、参加者Ｂから、新しい参加者Ｉに変わった場合のエラー情報であるが、参加者Ｂと、元々テレビ会議に参加していた参加者Ｆと、座席が変わった場合では、エラー情報ｆとして「参加者Ｆが参加者Ｂと座席を変わりました」を送信すればよい。エラー情報はこれらに限られるものではない。 Here, the error information f is information transmitted when the similarity between the feature amount of the uttered participant and the registered feature amount corresponding to the location information of the uttered participant is greater than the predetermined value g. is there. FIG. 13 shows an example of a video associated with the error information f. The error information in the example of FIG. 13 is “New participant”. The error information in the example of FIG. 13 is error information when the participant B changes to a new participant I. However, the participant B, the participant F who originally participated in the video conference, and the seat If changed, “participant F has changed seat with participant B” may be transmitted as error information f. The error information is not limited to these.

また、特徴量として、顔特徴量と音声特徴量を用いている場合には、発話した参加者の顔特徴量または音声特徴量と、データベース記憶部１６に登録されている顔特徴量と音声特徴量の差がどちらか一方でも所定値ｇより大きい場合に、エラー情報を送信してもよく、両方が所定値ｇより大きい場合にエラー情報を送信してもよい。
実施例３のテレビ会議装置３００であれば、テレビ会議中に参加者が入れ替わったり、または新しい参加者が参入したとしても、属性情報を間違えて送信することなく、入れ替わったことまたは新しく参入したことを示すエラー情報を送信することで、テレビ会議の相手側に適切に、参加者が入れ替わったこと、新しく参入したことを、属性情報とともに知らせることができる。
実施例３のテレビ会議装置３００は、参加者の交代、新しい参加者の参入が頻繁であると予め分かっているテレビ会議で用いることが好ましい。
また、テレビ会議装置３００の特徴量生成部６は、参加者が発話の度に、特徴量を生成する。従って、特徴量生成の頻度を下げるために、参加者が途中で入れ替わったり、新しい参加者が参入しようとした時点で、Ｌ地点側の他の参加者が、テレビ会議装置１００（または２００）から、このテレビ会議装置３００のモードに切り替えるようにすることが好ましい。この切り替えは、図示しない入力部から入力させればよい。
また、以上の例では、Ｌ地点、Ｍ地点にそれぞれ１台ずつテレビ会議装置を設けている。しかし、別の例として、Ｌ地点、Ｍ地点のうちの１つの地点（あるいはネットワーク経由でＬ地点、Ｍ地点に結ばれた別の１つの地点）に１台だけテレビ会議装置を設け、そのテレビ会議装置に、Ｌ地点、Ｍ地点の両方の参加予定者についての顔特徴データ及び肩書き・名前データをデータベース記憶部に登録させて、この両方の参加者についてテレビ会議を実行させてもよい。
また、以上の例では、テレビ会議装置内にデータベース記憶部１６を保持させる構成とした。しかし、データベース記憶部１６をハードディスク１０８またはメモリ１１０と統合させてもよい。 In addition, when a facial feature amount and a speech feature amount are used as the feature amount, the facial feature amount or the speech feature amount of the participant who spoke, the facial feature amount and the speech feature registered in the database storage unit 16 The error information may be transmitted when the difference in amount is larger than either of the predetermined values g, and the error information may be transmitted when both are larger than the predetermined value g.
In the case of the video conference apparatus 300 according to the third embodiment, even if a participant is switched during a video conference or a new participant has entered, it has been switched or newly entered without transmitting attribute information by mistake. By transmitting the error information indicating “”, it is possible to notify the other party of the video conference appropriately that the participant has been replaced or newly entered together with the attribute information.
The video conference apparatus 300 according to the third embodiment is preferably used in a video conference in which it is known in advance that the replacement of participants and the participation of new participants are frequent.
In addition, the feature value generation unit 6 of the video conference apparatus 300 generates a feature value each time the participant speaks. Accordingly, in order to reduce the frequency of feature quantity generation, when a participant is changed in the middle or a new participant is about to enter, another participant at the L point side can start from the video conference apparatus 100 (or 200). It is preferable to switch to the mode of the video conference apparatus 300. This switching may be input from an input unit (not shown).
In the above example, one video conference device is provided at each of the L point and the M point. However, as another example, only one TV conference device is provided at one of the points L and M (or another point connected to the points L and M via the network), and the TV The conference apparatus may register the face feature data and title / name data for the participants scheduled at both the L point and the M point in the database storage unit, and execute the video conference for both participants.
In the above example, the database storage unit 16 is held in the video conference apparatus. However, the database storage unit 16 may be integrated with the hard disk 108 or the memory 110.

また、属性情報として、肩書き及び名前のみならず、その参加予定者の過去の会議での主張（或るプロジェクトに賛成か反対かの見解等）を要約したデータをこのデータベース記憶部に登録し、映像出力部１０３に表示させるようにしてもよい。
また、以上の例では、Ｌ地点、Ｍ地点という２地点を結ぶテレビ会議システムに本発明を適用している。しかし、これに限らず、３地点以上を結ぶテレビ会議システムや、テレビ会議システム以外の適宜の双方向コミュニケーションシステムにも本発明を適用してよい。 In addition, as attribute information, not only the title and name, but also the data that summarizes the claims of the prospective participants at the past meeting (such as views of approval or disagreement for a certain project) are registered in this database storage unit, It may be displayed on the video output unit 103.
Moreover, in the above example, this invention is applied to the video conference system which connects two points, L point and M point. However, the present invention is not limited to this, and the present invention may be applied to a video conference system that connects three or more points, and an appropriate two-way communication system other than the video conference system.

また、エンターテイメント系の双方向コミュニケーションシステムに本発明を適用する場合には、例えば参加予定者の好きなアニメーションの画像データを属性情報としてデータベース記憶部１６に登録することにより、映像出力部１０３に表示される参加者の顔の近傍にそのアニメーションの画像が表示されるようにしたり、映像出力部１０３に表示される参加者の顔の上にそのアニメーションの画像が表示されるようにしてもよい。
また、参加予定者のうち映像出力部１０３に顔を表示することが好ましくない人物がいるような双方向コミュニケーションシステムに本発明を適用する場合には、その人物についての属性情報としてモザイクをかけることを指示する情報をデータベース記憶部１６に登録することにより、映像出力部１０３に表示されるその人物の顔にモザイクがかかるようにしてもよい。
また、本実施例は、以上の例に限らず、本発明の要旨を逸脱することなく、その他様々の構成をとりうることはもちろんである。 In addition, when the present invention is applied to an entertainment interactive communication system, for example, image data of a favorite animation of a prospective participant is registered in the database storage unit 16 as attribute information to be displayed on the video output unit 103. The animation image may be displayed near the participant's face, or the animation image may be displayed on the participant's face displayed on the video output unit 103.
Further, when the present invention is applied to an interactive communication system in which there is a person who is not desirable to display a face on the video output unit 103 among prospective participants, a mosaic is applied as attribute information about the person. May be registered in the database storage unit 16 so that the face of the person displayed on the video output unit 103 is mosaiced.
In addition, the present embodiment is not limited to the above example, and it is needless to say that various other configurations can be taken without departing from the gist of the present invention.

以上説明した本実施例のテレビ会議装置はコンピュータにテレビ会議プログラムを解読させて実現することができる。この実施例で提案するテレビ会議プログラムはコンピュータが解読可能なプログラム言語によって記述され、磁気ディスク或はＣＤ−ＲＯＭ等の記録媒体に記録され、これら記録媒体からコンピュータにインストールされるか、又は通信回線を通じてコンピュータにインストールされ、コンピュータに備えられたＣＰＵに解読されてテレビ会議装置として機能する。具体的には、図１記載のテレビ会議装置１００をＣＰＵに代替させ、当該ＣＰＵにテレビ会議プログラムを解読させればよい。 The video conference apparatus according to the present embodiment described above can be realized by causing a computer to decode the video conference program. The video conference program proposed in this embodiment is written in a computer-readable program language, recorded on a recording medium such as a magnetic disk or a CD-ROM, and installed in the computer from these recording media, or a communication line. Installed in the computer, and decrypted by a CPU provided in the computer to function as a video conference device. Specifically, the video conference apparatus 100 shown in FIG. 1 may be replaced with a CPU, and the CPU may decode the video conference program.

１０００テレビ会議システム
１００テレビ会議装置
１０２操作部
１０３映像出力部
１０４音声出力部
１０６撮影装置
１０８ハードディスク
１１０メモリ
１１２通信制御部
１００テレビ会議装置
２切り出し部
６特徴量生成部
８位置情報検出部
１０識別部
１２合成部
１４符号化部
１６データベース記憶部
１８制御部
２０判定部
２２表示部
１５２顔識別部
１５４音声識別部
１５６位置情報識別部 1000 video conference system 100 video conference device 102 operation unit 103 video output unit 104 audio output unit 106 imaging device 108 hard disk 110 memory 112 communication control unit 100 video conference device 2 clipping unit 6 feature quantity generation unit 8 position information detection unit 10 identification unit 12 synthesizing unit 14 encoding unit 16 database storage unit 18 control unit 20 determination unit 22 display unit 152 face identification unit 154 voice identification unit 156 position information identification unit

特開平６―１２１３１０号公報JP-A-6-121310 特許第４０５５５３９号公報Japanese Patent No. 4055539

Claims

A database storage unit in which a feature amount for identifying a prospective participant and attribute information indicating an attribute of the prospective participant are registered in association with each other;
A feature value generating unit that generates a feature value of a participant photographed by the photographing device;
A position information detection unit that detects the position information of the photographed participant and registers the position information in association with the attribute information of the photographed participant in the database storage unit;
The participant is identified based on the feature amount of the photographed participant and the feature amount in the database storage unit, and the location information of the participant photographed for the second time and later and the location information in the database storage unit An identification unit for identifying the participant based on
The attribute information of the participant identified by the feature amount and the captured video of the participant are transmitted in association with each other,
A video conference device comprising: a transmission unit that associates and transmits the attribute information of the participant identified by the position information and the video of the participant captured after the second time.

If the similarity between the feature quantity in the database storage unit corresponding to the location information of the participant photographed by the photographing device and the feature quantity of the photographed participant is smaller than a predetermined value, the photographing The video conference apparatus according to claim 1, further comprising: a determination unit that causes the transmission unit to transmit the video of the participant and the error information.

3. The display device according to claim 1, further comprising: a display unit that displays all the candidate candidates when there are a plurality of candidate candidates to be identified as a result of identification by the identification unit. Video conferencing equipment.

The video conference apparatus according to claim 1, wherein the feature amount is at least one of a face feature amount and an audio feature amount.

The photographing device is capable of omnidirectional photographing,
The video conference apparatus according to claim 1, wherein the participant is located around the photographing apparatus.

A generation step of generating a feature amount of the participant photographed by the photographing device;
The feature amount for identifying the prospective participant and the attribute information indicating the attribute of the prospective participant are registered in association with each other, and the feature amount is identified based on the generated feature amount. A first transmission step of transmitting the captured video of the participant and the attribute information corresponding to the feature amount in the database storage unit in association with each other;
A registration step of registering in the database storage unit in association with the position information of the photographed participant and the attribute information of the photographed participant;
A detection step of detecting the location information of the participants taken after the second time;
After the detection step, the position information in the database storage unit, the video shot after the second time of the participant identified based on the detected position information, and the position information in the database storage unit, And a second transmission step of transmitting the corresponding attribute information in association with each other.

The program for functioning a computer as a video conference apparatus in any one of Claims 1-5.