JP2015043507A

JP2015043507A - Information processing unit, communication system and program

Info

Publication number: JP2015043507A
Application number: JP2013174578A
Authority: JP
Inventors: 渉畠中; Wataru Hatanaka
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2013-08-26
Filing date: 2013-08-26
Publication date: 2015-03-05
Anticipated expiration: 2033-08-26
Also published as: JP6191333B2

Abstract

PROBLEM TO BE SOLVED: To provide a device and the like which can display an image corresponding to a party of conversation and can improve presence of a conference.SOLUTION: An imaging direction of a camera 20 is changed and images of one or more users in the same base are recorded. A sound source direction detection section 32 detects a sound source direction of the users in the same base. A face direction determination section 33 determines the direction of a face of the user imaged by changing the imaging direction to the sound source direction. A conversation person determination section 34 determines whether the conversation is in the same base or that with the other base according to whether a direction which is the same as the direction of the face is detected as the sound source direction. When the conversation is determined as that with the other base, an image processing section 35 performs image processing on the imaged image and, when the conversation is determined as that in the same base, performs image processing on the image of the imaged user, or the image of the imaged user or the image of the recorded user and the image of the recorded other user, to transmit the image to the other base and display the image of the user, or the image of the user and the image of the other user side by side.

Description

本発明は、撮像手段と複数の音入力手段とを備え、複数の音入力手段に入力された音声情報に基づいて音源方向を検知し、その音源方向へ撮像方向を変更して撮像を行い、得られた画像情報をその音声情報とともに出力する情報処理装置、２以上の情報処理装置を含む通信システム、およびその処理を情報処理装置に実行させるためのプログラムに関する。 The present invention comprises an imaging means and a plurality of sound input means, detects a sound source direction based on audio information input to the plurality of sound input means, performs imaging by changing the imaging direction to the sound source direction, The present invention relates to an information processing apparatus that outputs obtained image information together with its audio information, a communication system including two or more information processing apparatuses, and a program for causing the information processing apparatus to execute the processing.

地理的に離れた拠点間で、画像および音声によってコミュニケーションを取るための通信システムの１つとして、テレビ会議システムが利用されている。このテレビ会議システムには、一台のカメラで複数人の会議参加者を撮影するようなシステムが存在している。このようなシステムでは、カメラを左右に回転させることにより、会話している参加者にカメラが向くように制御している。 A video conference system is used as one of communication systems for communicating between geographically distant bases using images and sounds. In this video conference system, there is a system that photographs a plurality of conference participants with a single camera. In such a system, the camera is controlled to turn to the participant who is talking by rotating the camera left and right.

このようなシステムの１つとして、会話している参加者をその音声方向から検知し、その方向にカメラを向け、その参加者の顔を検知して表示させるシステムが知られている（例えば、特許文献１参照）。このシステムでは、会話している参加者の顔画像を、その音声に合わせて確実に伝達するべく、カメラが回転している間、音声を検出した参加者の顔画像を表示させ、口元の形を変えた画像を合成することにより音声に合わせて口を動かしている。 As one of such systems, a system is known in which a participant who is talking is detected from the voice direction, a camera is pointed in that direction, and the face of the participant is detected and displayed (for example, Patent Document 1). In this system, the face image of the participant who detected the voice is displayed while the camera is rotating, so that the face image of the participant who is talking is transmitted in accordance with the sound. The mouth is moved according to the voice by synthesizing the images with different rhythms.

しかしながら、このような処理は、他拠点との間の自然なコミュニケーションを取るための画像処理であり、同じ拠点内で行われている会話については考慮されていない。すなわち、同じ拠点内で複数の参加者が会話していても、そのとき会話している一人の参加者の顔画像を表示させ、口を音声に合わせて動かすのみである。これでは、他拠点の参加者は、その顔画像が表示されている参加者が誰と会話しているのか分からないという問題があった。 However, such processing is image processing for taking natural communication with other bases, and does not take into account conversations conducted in the same base. That is, even if a plurality of participants are talking in the same base, the face image of one participant who is talking at that time is displayed and the mouth is simply moved according to the voice. In this case, there is a problem that participants at other bases do not know who the participant whose face image is displayed is talking to.

また、同じ拠点にいる参加者同士が交互に会話をしていると、カメラが左右に振れ続けるため、振れるたびに顔画像が切り替わり、それを見ている他拠点の参加者は、会議に集中することができないという問題もあった。 Also, if participants at the same location are talking alternately, the camera continues to shake from side to side, so the face image changes each time it shakes, and participants at other locations who are looking at it focus on the conference. There was also the problem of not being able to do it.

そこで、同じ拠点内で誰と会話しているかが分かり、また、会議に集中することができるように、会話の当事者に応じた画像を表示させ、会議の臨場感を高めることができる装置等の提供が望まれている。 So, you can see who is talking to you in the same location and you can display images according to the parties involved in the conversation so that you can concentrate on the meeting, etc. Offer is desired.

本発明は、上記課題に鑑み、撮像手段と複数の音入力手段とを備える情報処理装置であって、撮像手段の撮像方向を制御する撮像方向制御手段と、撮像方向制御手段により撮像方向を変更し、撮像手段により撮像された同じ拠点にいる１以上のユーザの画像を記録する画像記録手段と、複数の音入力手段に入力された同じ拠点にいるユーザの音声に基づき、音源方向を検知する音源方向検知手段と、音源方向検知手段により検知された音源方向へ撮像方向制御手段によって撮像方向を変更し、撮像手段により撮像されたユーザの画像に基づき、当該ユーザの顔の向きを判断する顔方向判断手段と、顔方向判断手段により判断された顔の向きと同じ方向を、音源方向検知手段が音源方向として検知したかどうかに応じて、ユーザと同じ拠点にいる他のユーザとの会話か、他拠点との会話かを判断する会話者判断手段と、会話者判断手段により他拠点との会話と判断された場合、撮像手段により撮像された画像に対して画像処理を行い、同じ拠点にいる他のユーザとの会話と判断された場合、撮像手段により撮像されたユーザと他のユーザとを含む画像、または撮像手段により撮像されたユーザの画像もしくは画像記録手段に記録されたユーザの画像と画像記録手段に記録された他のユーザの画像とに対して画像処理を行う画像処理手段と、画像処理手段により画像処理して得られた画像を他拠点へ送信し、ユーザの画像を、またはユーザの画像と他のユーザの画像とを並べて表示させる通信制御手段とを含む、情報処理装置が提供される。 In view of the above-described problems, the present invention is an information processing apparatus including an imaging unit and a plurality of sound input units. The imaging direction control unit controls the imaging direction of the imaging unit, and the imaging direction is changed by the imaging direction control unit. The sound source direction is detected based on the image recording means for recording the image of one or more users at the same base imaged by the imaging means and the voice of the user at the same base input to the plurality of sound input means. A face for determining the orientation of the user's face based on the user's image captured by the imaging means by changing the imaging direction by the imaging direction control means to the sound source direction detected by the sound source direction detecting means and the sound source direction detecting means Depending on whether the sound source direction detection means detects the same direction as the face direction determined by the direction determination means and the face direction determination means as the sound source direction, the user is at the same base as the user. Conversation determination means for determining whether the conversation is with another user or with another base, and when the conversation determination means determines that the conversation is with another base, an image of the image captured by the imaging means is displayed. When the processing is performed and it is determined that the conversation is with another user at the same base, the image including the user imaged by the imaging unit and the other user, or the image of the user imaged by the imaging unit or the image recording unit The image processing means for performing image processing on the user image recorded in the image and the image of the other user recorded in the image recording means, and the image obtained by the image processing by the image processing means are transmitted to another base Then, an information processing apparatus is provided that includes a communication control means for displaying a user image or a user image and another user image side by side.

本発明によれば、会話の当事者に応じた画像を表示させ、会議の臨場感を高めることが可能となる。 According to the present invention, it is possible to display an image according to a party of conversation and enhance a sense of reality of the conference.

本実施形態の通信システムの構成例を示した図。The figure which showed the structural example of the communication system of this embodiment. 図１に示す通信システムに用いられる情報処理装置のハードウェア構成図。The hardware block diagram of the information processing apparatus used for the communication system shown in FIG. 図１に示す通信システムに用いられる情報処理装置の機能ブロック図。The functional block diagram of the information processing apparatus used for the communication system shown in FIG. 情報処理装置により実行される処理の流れを示したフローチャート。The flowchart which showed the flow of the process performed by information processing apparatus. 情報処理装置が備える顔方向判断部および画像処理部により実行される処理の詳細な流れを示したフローチャート。The flowchart which showed the detailed flow of the process performed by the face direction judgment part with which an information processing apparatus is provided, and an image process part. 任意の拠点に配置された情報処理装置が備えるカメラの向きと、そのカメラにより撮像されたユーザの画像および他拠点にて表示されるユーザの画像の一例を示した図。The figure which showed an example of direction of the camera with which the information processing apparatus arrange | positioned in arbitrary bases, the user's image imaged with the camera, and the user's image displayed in another base. 任意の拠点に配置された情報処理装置が備えるカメラの向きと、そのカメラにより撮像されたユーザの画像および他拠点にて表示されるユーザの画像の別の例を示した図。The figure which showed another example of the direction of the camera with which the information processing apparatus arrange | positioned in arbitrary bases, the user's image imaged with the camera, and the user's image displayed in another base. 任意の拠点に配置された情報処理装置が備えるカメラの向きと、そのカメラにより撮像されたユーザの画像および他拠点にて表示されるユーザの画像のさらに別の例を示した図。The figure which showed another example of the direction of the camera with which the information processing apparatus arrange | positioned in arbitrary bases, the user's image imaged with the camera, and the user image displayed in another base.

図１は、本実施形態の通信システムの構成例を示した図である。通信システムは、この構成に限定されるものではないが、地理的に離れた各拠点に配置される２以上の情報処理装置１０、１１がネットワーク１２に接続された構成とされる。地理的に離れた拠点としては、東京本社と北海道事業所、東京本社とニューヨーク支店等を例示することができる。 FIG. 1 is a diagram illustrating a configuration example of a communication system according to the present embodiment. The communication system is not limited to this configuration, but has a configuration in which two or more information processing apparatuses 10 and 11 arranged at geographically separated locations are connected to the network 12. Examples of geographically separated bases include the Tokyo head office and Hokkaido office, the Tokyo head office and the New York branch, and the like.

情報処理装置１０、１１は、それぞれ有線または無線によりネットワーク１２に接続することができ、無線により接続する場合、アクセスポイントと呼ばれる基地局を介してネットワーク１２に接続することができる。ここでは、２つの情報処理装置１０、１１のみが示されているが、３以上の情報処理装置がネットワーク１２に接続されたものであってもよい。また、ネットワーク１２は、有線ネットワーク、無線ネットワークのいずれであってもよく、ＷＡＮ(Wide Area Network)やインターネット等を利用することができる。 The information processing apparatuses 10 and 11 can be connected to the network 12 by wire or wireless, respectively, and when connected by wireless, can be connected to the network 12 via a base station called an access point. Here, only two information processing apparatuses 10 and 11 are shown, but three or more information processing apparatuses may be connected to the network 12. The network 12 may be either a wired network or a wireless network, and can use a WAN (Wide Area Network), the Internet, or the like.

情報処理装置１０、１１は、同じハードウェア構成とされ、同じ機能を有するものとされる。ハードウェア構成および機能の詳細については後述する。任意の拠点に配置される情報処理装置１０は、撮像機能を有し、その拠点にいる会議の参加者である１以上のユーザを撮像して得られた画像を、他拠点に配置される情報処理装置１１へ送信し、その画像を表示させる。実際には、撮像して得られる画像データまたは画像情報を送信し、画像データまたは画像情報に基づき画像を表示させるが、説明を容易にするために、本願では「画像」という用語を使用する。 The information processing apparatuses 10 and 11 have the same hardware configuration and the same functions. Details of the hardware configuration and functions will be described later. The information processing apparatus 10 arranged at an arbitrary base has an imaging function, and an image obtained by picking up one or more users who are participants of a conference at the base is information arranged at another base. The image is transmitted to the processing device 11 and the image is displayed. Actually, image data or image information obtained by imaging is transmitted and an image is displayed based on the image data or image information. However, in order to facilitate the explanation, the term “image” is used in the present application.

また、情報処理装置１０は、音入出力機能を有し、そのユーザが発した声等の入力を受け付け、それを音声等として上記の画像とともに情報処理装置１１へ送信する。この音声も、実際には音声データまたは音声情報であるが、説明を容易にするために、本願では「音声」という用語を使用する。図１では、情報処理装置１０が備えるカメラにより撮像を行い、複数のマイクにより音の入力を受け付ける。 In addition, the information processing apparatus 10 has a sound input / output function, receives an input such as a voice uttered by the user, and transmits it to the information processing apparatus 11 together with the above image as a sound or the like. This voice is also actually voice data or voice information, but the term “voice” is used in the present application for ease of explanation. In FIG. 1, imaging is performed by a camera included in the information processing apparatus 10, and sound input is received by a plurality of microphones.

情報処理装置１０は、情報処理装置１１が撮像して得られた画像および入力された音声を、当該情報処理装置１１から受信し、画像を表示し、音声を出力する。図１では、情報処理装置１０の表示装置が備えるスピーカから音声を出力し、表示装置のディスプレイ上に画像を表示する。 The information processing apparatus 10 receives an image obtained by imaging the information processing apparatus 11 and input sound from the information processing apparatus 11, displays an image, and outputs sound. In FIG. 1, sound is output from a speaker included in the display device of the information processing device 10 and an image is displayed on the display of the display device.

情報処理装置１０は、音を発生させている物体、すなわち音源がある方向（音源方向）を検知し、その音源方向へ撮像方向を変更し、撮像を行うことができるように構成されている。また、情報処理装置１０は、撮像方向を変更して撮像された１以上のユーザの画像を記録し、任意のユーザの画像を送信してそのユーザの画像を表示させることができるように構成されている。このため、情報処理装置１０は、会話しているユーザを撮像するために撮像方向を変更している間、記憶されているそのユーザの画像情報を読み出し、送信して、情報処理装置１１にそのユーザの画像を表示させることができる。 The information processing apparatus 10 is configured to detect an object that generates sound, that is, a direction in which a sound source is located (a sound source direction), change an imaging direction to the sound source direction, and perform imaging. Further, the information processing apparatus 10 is configured to record one or more user images captured by changing the imaging direction, and to transmit an arbitrary user image to display the user image. ing. For this reason, the information processing apparatus 10 reads and transmits the stored image information of the user while changing the imaging direction in order to capture the user who is talking, and transmits the information to the information processing apparatus 11. A user's image can be displayed.

情報処理装置１０は、そのユーザの画像内の顔画像と口元の形を変えた画像とを合成した画像を生成し、送信することができるように構成されている。このため、情報処理装置１０は、その画像を送信し、情報処理装置１１に音声に合わせて口が動く擬似画像を表示させることができる。この擬似画像を表示させることにより、実際に喋っているように見せることができる。 The information processing apparatus 10 is configured to generate and transmit an image obtained by synthesizing a face image in the user image and an image in which the shape of the mouth is changed. Therefore, the information processing apparatus 10 can transmit the image and cause the information processing apparatus 11 to display a pseudo image in which the mouth moves according to the sound. By displaying this pseudo image, it can appear as if it is actually speaking.

また、情報処理装置１０は、撮像されたユーザの顔の向きを判断し、その顔の向きと同じ方向を、音源方向として検知したかどうかに応じて、同じ拠点にいる他のユーザと会話しているか、他拠点にいるユーザと会話しているかを判断するように構成されている。また、情報処理装置１０、１１は、同じ拠点にいる他のユーザと会話している場合、その二人のユーザの画像を並べて表示させるための画像を生成し、それを送信するように構成されている。このため、他拠点にいるユーザは、その二人のユーザが会話していることを知ることができる。 In addition, the information processing apparatus 10 determines the orientation of the face of the user who has been imaged, and talks with other users at the same base depending on whether the same direction as the orientation of the face is detected as the sound source direction. Or is talking to a user at another location. The information processing apparatuses 10 and 11 are configured to generate an image for displaying the images of the two users side by side and transmit the images when they are talking to other users at the same base. ing. For this reason, the user in another base can know that the two users are talking.

同じ拠点にいる二人のユーザが会話を行う場合、撮像方向がそのつど変更になるが、撮像方向を変更している間、上記の二人のユーザの画像を並べて表示させるための画像を生成し、送信している。このため、画像に振れが生じることはなく、それを見ている他拠点にいるユーザは、会議に集中することが可能となる。 When two users at the same location have a conversation, the imaging direction changes each time, but while changing the imaging direction, an image is generated to display the images of the two users side by side. And sending. For this reason, no shake occurs in the image, and a user at another base watching the image can concentrate on the conference.

上記のような処理や機能を実現するためのハードウェア構成を、図２に例示する。情報処理装置１０、１１は同じ構成であるため、情報処理装置１０についてのみ説明する。情報処理装置１０は、撮像機能および音入出力機能を実現するために、カメラ２０といった撮像手段と、マイクアレイ２１といった複数の音入力手段と、スピーカ２２といった音出力手段とを備える。 A hardware configuration for realizing the processing and functions as described above is illustrated in FIG. Since the information processing apparatuses 10 and 11 have the same configuration, only the information processing apparatus 10 will be described. The information processing apparatus 10 includes an imaging unit such as a camera 20, a plurality of sound input units such as a microphone array 21, and a sound output unit such as a speaker 22 in order to realize an imaging function and a sound input / output function.

カメラ２０は、入力する光を集束させるレンズと、レンズにより集束された光を電気信号に変換する撮像素子とを含み、静止画、または連続して撮像を行い、得られた静止画を時系列に並べて動画とし、その動画を出力する。カメラ２０は、静止画または動画を設定するためのモード設定ボタン等を備え、その設定に応じていずれかの画像を出力する。カメラ２０としては、例えば、デジタルカメラやビデオカメラを挙げることができる。 The camera 20 includes a lens that focuses input light and an image sensor that converts the light focused by the lens into an electrical signal. The camera 20 captures a still image or continuously, and the obtained still images are time-series. The videos are arranged in a row, and the videos are output. The camera 20 includes a mode setting button for setting a still image or a moving image, and outputs one of the images according to the setting. Examples of the camera 20 include a digital camera and a video camera.

このカメラ２０は、例えば、このカメラ２０を支持する支持部を備え、支持部を中心として左右の方向に回転可能とされる。回転する角度は、会議に参加している複数のユーザの各々に向けて撮像可能な角度であれば、１２０度、１８０度や３６０度等、いかなる角度であってもよい。この構成に限られるものではなく、カメラ２０は、ターンテーブルに載置され、ターンテーブルを回転させることにより左右の方向に回転させることも可能である。 The camera 20 includes, for example, a support portion that supports the camera 20, and is rotatable in the left and right directions around the support portion. The rotation angle may be any angle, such as 120 degrees, 180 degrees, and 360 degrees, as long as it can be imaged toward each of a plurality of users participating in the conference. The configuration is not limited to this, and the camera 20 can be mounted on a turntable and rotated in the left-right direction by rotating the turntable.

マイクアレイ２１は、複数のマイクロフォンから構成され、複数のマイクロフォンが筐体内に水平方向に一列に配列したものを使用してもよいし、各マイクロフォンが自在に配置できるようになっていて、各マイクロフォンを水平方向に一列に並べて使用してもよい。また、一列ではなく、各ユーザの前に各マイクロフォンを１つずつ配置して使用することもできる。会話をしているユーザをより正確に検知するためには、各ユーザの前に各マイクロフォンを配置して使用することが望ましい。 The microphone array 21 may be composed of a plurality of microphones, and a plurality of microphones arranged in a row in a horizontal direction in the housing may be used, or each microphone can be freely arranged. May be used in a line in the horizontal direction. Further, each microphone can be arranged and used in front of each user instead of one line. In order to more accurately detect a user having a conversation, it is desirable to place and use each microphone in front of each user.

マイクアレイ２１に使用されるマイクロフォンは、磁石、コイル、振動板を含み、振動板が音波を受けて振動し、振動板の振動を、磁石に近隣して配置されるコイルが受け、コイル内の磁束を変化させることにより起電力が発生し、それを音信号として出力する。 The microphone used for the microphone array 21 includes a magnet, a coil, and a diaphragm. The diaphragm receives a sound wave and vibrates. The vibration of the diaphragm is received by a coil disposed in the vicinity of the magnet. An electromotive force is generated by changing the magnetic flux, and this is output as a sound signal.

スピーカ２２は、他拠点から送信されてきた音情報を再生して出力する。スピーカ２２も、マイクロフォンと同様、磁石、コイル、振動板を含むものとすることができる。この場合、スピーカ２２は、入力された音声信号により、磁石に近隣して配置されるコイル内の磁束が変化し、それによってコイルが振動し、振動板が振動して、その振動板に接する空気が振動して音波を発生させることにより音を出力する。 The speaker 22 reproduces and outputs sound information transmitted from another base. Similarly to the microphone, the speaker 22 may include a magnet, a coil, and a diaphragm. In this case, the speaker 22 changes the magnetic flux in the coil arranged in the vicinity of the magnet according to the input audio signal, thereby vibrating the coil and vibrating the diaphragm so that the air in contact with the diaphragm Vibrates and generates sound waves to output sound.

情報処理装置１０は、そのほか、表示装置２３およびコントローラ２４を含んで構成される。表示装置２３は、情報処理装置１０へ送られてきた画像を表示する。通信システムにおいては、他拠点において会話しているユーザの画像が送られてくるので、そのユーザの画像として、そのユーザの顔の静止画または動画を表示する。表示装置２３としては、ディスプレイを用いることができ、スクリーンおよびプロジェクタを用いることもできる。 In addition, the information processing apparatus 10 includes a display device 23 and a controller 24. The display device 23 displays the image sent to the information processing device 10. In the communication system, an image of a user who is talking at another site is sent, and a still image or a moving image of the user's face is displayed as the user's image. As the display device 23, a display can be used, and a screen and a projector can also be used.

コントローラ２４は、ＣＰＵ２５と、ＲＯＭ２６と、ＲＡＭ２７と、ＨＤＤ２８と、ネットワーク１２に接続するためのネットワークＩ／Ｆ２９とを含んで構成される。 The controller 24 includes a CPU 25, a ROM 26, a RAM 27, an HDD 28, and a network I / F 29 for connecting to the network 12.

ＲＯＭ２６は、情報処理装置１０の起動時に実行されるＢＩＯＳ(Basic Input／Output System)等のプログラムを記憶する。ＲＡＭ２７は、ＣＰＵ２５が作業を行うために必要とされる記憶領域を提供する。ＨＤＤ２８は、アプリケーションやＯＳ等のプログラム、それらに関連するデータ等を記憶する。ここでは、ＨＤＤ２８を使用しているが、ＳＳＤ(Solid State Drive)等のその他の記憶装置を用いてもよい。 The ROM 26 stores a program such as a BIOS (Basic Input / Output System) that is executed when the information processing apparatus 10 is activated. The RAM 27 provides a storage area required for the CPU 25 to perform work. The HDD 28 stores programs such as applications and OS, data related thereto, and the like. Although the HDD 28 is used here, other storage devices such as an SSD (Solid State Drive) may be used.

このプログラムは、ネットワーク１２を介して、または図示しない記録媒体を介して提供され、ＨＤＤ２８に格納される。情報処理装置１０は、この記録媒体を接続可能にするために外部記憶装置Ｉ／Ｆを備えることができる。記録媒体としては、ＣＤ−ＲＯＭ、ＤＶＤ、ＳＤカード等を挙げることができ、外部記憶装置Ｉ／Ｆとしては、これらを読み書き可能にするＣＤドライブ、ＤＶＤドライブ、ＳＤカードスロット等を挙げることができる。 This program is provided via the network 12 or a recording medium (not shown) and stored in the HDD 28. The information processing apparatus 10 can include an external storage device I / F to enable connection of this recording medium. Examples of the recording medium include a CD-ROM, a DVD, and an SD card, and examples of the external storage device I / F include a CD drive, a DVD drive, and an SD card slot that can read and write these. .

ＣＰＵ２５は、情報処理装置１０内の各手段を制御し、データの演算や加工を行う。各手段としては、上記の撮像手段、音入力手段、音出力手段、ＨＤＤ２８等の記憶手段等である。ＣＰＵ２５は、カメラ２０やマイクアレイ２１等からデータを受け取り、また、ＨＤＤ２８等からデータを読み出し、演算や加工を行い、それをネットワーク１２上や表示装置２３へ出力し、また、ＨＤＤ２８等に記憶させる処理を実行する。 The CPU 25 controls each means in the information processing apparatus 10 and performs data calculation and processing. Each means includes the above-described imaging means, sound input means, sound output means, storage means such as the HDD 28, and the like. The CPU 25 receives data from the camera 20, the microphone array 21, etc., reads data from the HDD 28, etc., performs calculation and processing, outputs it to the network 12 and the display device 23, and stores it in the HDD 28, etc. Execute the process.

情報処理装置１０は、電源が投入されると、ＲＯＭ２６からＢＩＯＳを取り出して実行し、カメラ２０、マイクアレイ２１、スピーカ２２、表示装置２３、ＨＤＤ２８等が使用できることをチェックする。そして、情報処理装置１０は、ＨＤＤ２８からＯＳをＲＡＭ２７に読み出し、実行することにより起動する。その後、情報処理装置１０は、ＯＳによる制御の下、アプリケーション等のプログラムを実行し、所望の処理を実現する。 When the power is turned on, the information processing apparatus 10 takes out the BIOS from the ROM 26 and executes it, and checks that the camera 20, the microphone array 21, the speaker 22, the display device 23, the HDD 28, and the like can be used. The information processing apparatus 10 is activated by reading the OS from the HDD 28 to the RAM 27 and executing it. Thereafter, the information processing apparatus 10 executes a program such as an application under the control of the OS to realize a desired process.

図３を参照して、情報処理装置１０が備える機能について詳細に説明する。情報処理装置１１は、情報処理装置１０と同様の機能を備えるため、ここでは説明を省略する。なお、図３には、説明を分かりやすくするため、撮像手段としてのカメラ２０、音入力手段としてのマイクアレイ２１も図示されている。 With reference to FIG. 3, the function with which the information processing apparatus 10 is provided is demonstrated in detail. Since the information processing apparatus 11 has the same function as the information processing apparatus 10, the description thereof is omitted here. 3 also shows a camera 20 as an imaging unit and a microphone array 21 as a sound input unit for easy understanding.

情報処理装置１０は、機能部として、撮像方向制御部３０、画像記録部３１、音源方向検知部３２、顔方向判断部３３、会話者判断部３４、画像処理部３５、通信制御部３６を含んで構成される。情報処理装置１０は、上記機能部のみであってもよいが、そのほか、音声入力部３７、人物判断部３８を含むことができる。これらの機能部は、ＨＤＤ２８やネットワークＩ／Ｆ２９のほか、ＣＰＵ２５がＨＤＤ２８からプログラムを読み出し実行することにより実現される。 The information processing apparatus 10 includes an imaging direction control unit 30, an image recording unit 31, a sound source direction detection unit 32, a face direction determination unit 33, a talker determination unit 34, an image processing unit 35, and a communication control unit 36 as functional units. Consists of. The information processing apparatus 10 may include only the function unit, but may include a voice input unit 37 and a person determination unit 38. These functional units are realized when the CPU 25 reads out and executes a program from the HDD 28 in addition to the HDD 28 and the network I / F 29.

撮像方向制御部３０は、カメラ２０のレンズの向きを変えることにより撮像方向を制御する。カメラ２０のレンズの向きは、カメラ２０を支持する支持部を中心として左右の方向に回転させることにより変えることができる。また、ターンテーブル上にカメラ２０を載置した構成の場合、撮像方向制御部３０は、ターンテーブルをいずれかの方向に回転させることにより撮像方向を変えることができる。また、撮像方向制御部３０は、音源方向検知部３２により検知された音源方向の情報や、人物判断部３８の判断結果を受けて、いずれの方向に回転させるか、どの程度回転させるか等の撮像方向の制御を行う。 The imaging direction control unit 30 controls the imaging direction by changing the direction of the lens of the camera 20. The direction of the lens of the camera 20 can be changed by rotating the lens 20 in the left and right directions around the support portion that supports the camera 20. In the case where the camera 20 is placed on the turntable, the imaging direction control unit 30 can change the imaging direction by rotating the turntable in any direction. In addition, the imaging direction control unit 30 receives information on the sound source direction detected by the sound source direction detection unit 32 and the determination result of the person determination unit 38, and in which direction, how much to rotate, etc. Control the imaging direction.

人物判断部３８は、情報処理装置１０が動作を開始したタイミングや、予め設定されたタイミングにおいて、カメラ２０により撮像された画像に人物が含まれるか否かを判断する。人物が含まれるか否かは、公知の顔認識アルゴリズムを使用して判断することができる。顔認識アルゴリズムでは、目、鼻、口、あご等の顔のパーツの相対位置、大きさ、形等の特徴を抽出し、また、肌色を検出することにより、顔認識を行う。ここでは、顔認識アルゴリズムを使用して人物の有無を判断しているが、人物の有無を判断することができれば、その他の方法を採用することもできる。 The person determination unit 38 determines whether or not a person is included in the image captured by the camera 20 at the timing when the information processing apparatus 10 starts operating or at a preset timing. Whether or not a person is included can be determined using a known face recognition algorithm. In the face recognition algorithm, face recognition is performed by extracting features such as relative positions, sizes, and shapes of face parts such as eyes, nose, mouth, and chin, and detecting skin color. Here, the presence / absence of a person is determined using a face recognition algorithm, but other methods may be employed as long as the presence / absence of a person can be determined.

画像記録部３１は、ＨＤＤ２８等により実現され、人物判断部３８により人物が含まれることを判断し、撮像方向制御部３０がその判断結果から決定した撮像方向にカメラ２０の向きを変更し、カメラ２０により撮像して得られた人物であるユーザの画像を記録する。画像記録部３１は、このようにして得られたその拠点にいるユーザ全員の画像を記録する。記録する画像は、静止画の画像である。画像記録部３１は、各ユーザの画像を記録する際、各ユーザの位置情報と関連付けて記録される。 The image recording unit 31 is realized by the HDD 28 or the like, the person determination unit 38 determines that a person is included, the imaging direction control unit 30 changes the direction of the camera 20 in the imaging direction determined from the determination result, and the camera An image of a user who is a person obtained by imaging at 20 is recorded. The image recording unit 31 records the images of all the users at the base thus obtained. The image to be recorded is a still image. The image recording unit 31 records each user's image in association with the position information of each user.

ユーザの位置情報は、例えば、カメラ２０が正面を向いたときの角度を、基準の０度とし、左右の方向へ回転させたときの角度の情報とすることができる。なお、角度には誤差が生じるので、左右に５°程度の誤差範囲を設けることが望ましい。また、ユーザの位置情報は、角度に限らず、東西南北のような方位を用いてもよい。このため、情報処理装置１０は、角度を測定するためのセンサ、ロータリエンコーダ、方位磁針等を備えることができる。 The position information of the user can be, for example, information on the angle when the camera 20 is turned to the left and right with the angle when the camera 20 faces the front being 0 degrees as a reference. Since an error occurs in the angle, it is desirable to provide an error range of about 5 ° on the left and right. Further, the position information of the user is not limited to the angle, but may be an orientation such as east, west, south, and north. For this reason, the information processing apparatus 10 can include a sensor for measuring an angle, a rotary encoder, a compass, and the like.

音声入力部３７は、マイクアレイ２１から入力された音が、情報処理装置１０が配置されている拠点にいるユーザからの音声入力であるか否かを検知し、音声入力を検知した場合、その音声を音源方向検知部３２へ送る。入力される音には、雑音も含まれるが、一般に音声の方が、音量が大きいことから音量により音声入力かどうかを検知することができる。これは一例であるので、音声を検知する方法としては、これまでに知られたいかなる方法でも使用することができる。 The voice input unit 37 detects whether or not the sound input from the microphone array 21 is a voice input from a user at a base where the information processing apparatus 10 is disposed. The sound is sent to the sound source direction detection unit 32. The input sound includes noise, but generally speaking, since the sound volume is larger, it is possible to detect whether the sound is input based on the sound volume. Since this is an example, any known method can be used as a method for detecting sound.

音源方向検知部３２は、音声入力部３７からの音声情報から音源方向を検知する。マイクアレイ２１は、複数のマイクロフォンから構成され、各マイクロフォンは異なる位置に配置されるため、入力される音声は、各マイクフォンによって時間差が生じる。最先に音声が入力されたマイクロフォンは、音源に最も近いことを示すため、どのマイクロフォンに最先に音声が入力されたかを判断することにより、その音源方向を検知することができる。音源方向検知部３２は、検知した音源方向を、音源方向の情報として撮像方向制御部３０へ入力する。なお、音源方向は、上記ユーザの位置情報と同様の情報とすることができる。 The sound source direction detection unit 32 detects the sound source direction from the sound information from the sound input unit 37. The microphone array 21 is composed of a plurality of microphones, and each microphone is arranged at a different position. Therefore, a time difference occurs in the input sound depending on each microphone. Since the microphone to which sound is input first indicates that it is closest to the sound source, the direction of the sound source can be detected by determining which microphone has the sound input first. The sound source direction detection unit 32 inputs the detected sound source direction to the imaging direction control unit 30 as information on the sound source direction. The sound source direction can be the same information as the user position information.

顔方向判断部３３は、音源方向検知部３２により検知された音源方向に向けて撮像方向制御部３０が撮像方向を変更し、その方向に向けられたカメラ２０により撮像されたユーザの画像に基づき、そのユーザの顔の向きを判断する。顔方向判断部３３は、例えば、画像記録部３１に記録されたそのユーザの画像が正面を向いた顔であるので、その顔の向きを基準とし、撮像された画像の顔の角度を推定することにより顔の向きを判断することができる。 In the face direction determination unit 33, the imaging direction control unit 30 changes the imaging direction toward the sound source direction detected by the sound source direction detection unit 32, and is based on the user's image captured by the camera 20 directed in that direction. The orientation of the user's face is determined. For example, since the user image recorded in the image recording unit 31 is a face facing the front, the face direction determination unit 33 estimates the angle of the face of the captured image on the basis of the face direction. Thus, the orientation of the face can be determined.

顔方向判断部３３は、この角度も誤差が生じるので、例えば左右に５°程度の誤差範囲を設けることが望ましい。また、顔方向判断部３３は、角度に限らず、東西南北のような方位により顔の向きを判断してもよい。 Since the face direction determination unit 33 also generates an error in this angle, it is desirable to provide an error range of about 5 ° on the left and right, for example. The face direction determination unit 33 may determine the face direction based on an orientation such as east, west, south, and north without being limited to the angle.

会話者判断部３４は、顔方向判断部３３により判断された顔の向きと同じ方向を、音源方向検知部３２が音源方向として検知したかどうかに応じて、いずれの会話であるかを判断する。会話者判断部３４は、顔の向きと同じ方向を音源方向として検知した場合、そのユーザと同じ拠点にいる他のユーザとの会話と判断する。また、会話者判断部３４は、顔の向きと同じ方向を音源方向として検知しない場合、他拠点との会話と判断する。 The talker determination unit 34 determines which conversation is based on whether the sound source direction detection unit 32 detects the same direction as the face direction determined by the face direction determination unit 33 as the sound source direction. . If the same direction as the face direction is detected as the sound source direction, the talker determination unit 34 determines that the conversation is with another user at the same base as the user. Further, when the conversation person determination unit 34 does not detect the same direction as the face direction as the sound source direction, the conversation person determination unit 34 determines that the conversation is with another site.

会話者判断部３４は、同じ拠点にいる他のユーザとの会話と判断した場合、当該他のユーザが複数人であるか否かを、マイクアレイ２１に入力された音声に基づき判断する。すなわち、複数人の音声が入力されていれば、複数人と判断し、特定のユーザの音声のみが入力されていれば、そのユーザ一人と判断することができる。これらの情報は、画像処理部３５が画像処理を行う際に利用される。 If the conversation determination unit 34 determines that the conversation is with another user at the same base, the conversation determination unit 34 determines whether there are a plurality of other users based on the sound input to the microphone array 21. That is, if a plurality of people's voices are input, it is determined that there are a plurality of people, and if only a specific user's voice is input, it is possible to determine that one user. These pieces of information are used when the image processing unit 35 performs image processing.

このとき会話を行っている他のユーザは、音源方向検知部３２により音源方向として検知され、得られたその音源方向の情報からその位置が特定される。 At this time, another user who is having a conversation is detected as the sound source direction by the sound source direction detecting unit 32, and the position is specified from the obtained information on the sound source direction.

画像処理部３５は、カメラ２０により撮像された画像あるいは画像記録部３１に記録された画像に対して画像処理を行い、他拠点にて表示させるための画像を生成する。例えば、カメラ２０を回転させている間に他拠点で表示させる画像を、画像記録部３１に記録されたそのユーザの画像に口元の形を変えた画像を合成することにより生成する。 The image processing unit 35 performs image processing on the image captured by the camera 20 or the image recorded in the image recording unit 31 and generates an image to be displayed at another base. For example, an image to be displayed at another site while the camera 20 is rotated is generated by combining an image of the user who has been recorded in the image recording unit 31 with a mouth shape changed.

また、カメラ２０により撮像された画像から２以上のユーザの顔画像等を切り抜き、並べる処理や、画像記録部３１に記録された２以上のユーザの画像の、表示する位置や大きさを変更する処理等を行うことにより生成する。このような合成する処理、切り抜き、並べる処理、表示する位置や大きさを変更する処理等については、従来から知られているいかなる方法を採用して実施することができ、ここではその詳細な方法について省略する。 Also, processing for cutting out and arranging two or more user face images from the image captured by the camera 20 and changing the display position and size of the two or more user images recorded in the image recording unit 31. It is generated by processing. Such a process of combining, a process of cutting out, arranging, a process of changing the display position and size, etc. can be carried out by adopting any conventionally known method. Is omitted.

会話者判断部３４により他のユーザが一人と判断された場合、カメラ２０により撮像された画像にユーザと当該他のユーザの二人が含まれているとき、画像処理部３５は、その画像からその二人のユーザの顔画像あるいは顔を含む所定領域を切り出す処理を行う。カメラ２０が回転し、振れたとしても、小さな振れであり、他拠点にいるユーザが会議に集中できなくなるような振れではないためである。 When the other user is determined to be one by the talker determination unit 34, when the user and the other user are included in the image captured by the camera 20, the image processing unit 35 determines from the image. A process of cutting out a face image of the two users or a predetermined area including the face is performed. This is because even if the camera 20 rotates and shakes, the shake is small, and the shake is not such that a user at another base cannot concentrate on the conference.

画像処理部３５は、切り抜いた顔画像あるいは所定領域の画像を並べて表示するための画像を生成する。このときの画像は、動画であるため、動画における顔や所定領域を並べて表示するための画像が生成される。所定領域としては、例えば、顔を含む最小の矩形の領域とすることができる。また、並べ方としては、上下あるいは左右に並べることができる。一般に、表示画面は、横長であるため、左右に並べることが望ましい。 The image processing unit 35 generates an image for displaying the cut face image or the image of the predetermined area side by side. Since the image at this time is a moving image, an image for displaying a face and a predetermined area in the moving image side by side is generated. As the predetermined area, for example, a minimum rectangular area including a face can be used. Moreover, as a way of arrangement, they can be arranged vertically or horizontally. Generally, since the display screen is horizontally long, it is desirable to arrange it horizontally.

一方、他のユーザが含まれていない場合は、カメラ２０が大きく振れることになるため、画像記録部３１に記録されたユーザの画像と他のユーザの画像とを用い、それらを並べて表示するための画像を生成する。画像記録部３１に記録される画像は、静止画であるため、その静止画を並べた画像が生成される。このとき、単に並べただけではなく、口元の形を変えた画像を合成した画像が生成される。 On the other hand, when other users are not included, the camera 20 shakes greatly, so that the user images recorded in the image recording unit 31 and the images of other users are used and displayed side by side. Generate an image of Since the image recorded in the image recording unit 31 is a still image, an image in which the still images are arranged is generated. At this time, an image that is not simply arranged, but is synthesized by combining images with different mouth shapes is generated.

他のユーザが複数人と判断された場合に、一定時間内に音声の入力があり、カメラ２０により撮像された画像にユーザと他の複数のユーザ全員が含まれているとき、画像処理部３５は、その画像からユーザ全員の顔画像あるいは顔を含む所定領域を切り出す。画像処理部３５は、各ユーザにつき切り出す処理を行う。そして、画像処理部３５は、切り抜いた顔画像あるいは所定領域の画像を並べて表示するための画像を生成する。この場合、複数の画像を並べて表示することになるため、表示画面上の上下左右に並ぶように縮小（ズームアウト）して配置した画像を生成することができる。 When it is determined that there are a plurality of other users, there is an input of audio within a predetermined time, and when the user and all of the plurality of other users are included in the image captured by the camera 20, the image processing unit 35 Cuts out a face image of all the users or a predetermined area including the face from the image. The image processing unit 35 performs processing for cutting out each user. Then, the image processing unit 35 generates an image for displaying the cut face image or the image of the predetermined area side by side. In this case, since a plurality of images are displayed side by side, it is possible to generate an image arranged by being reduced (zoomed out) so as to be arranged vertically and horizontally on the display screen.

他の複数のユーザが含まれていない場合は、画像記録部３１に記録されたユーザの画像と他の複数のユーザ全員の画像とを用い、それらユーザの画像を並べて表示するための画像を生成する。画像記録部３１に記録される画像は、静止画であるため、その静止画を並べて表示するための画像情報が生成される。この場合も、生成される画像は、口元の形を変えた画像を合成した画像とされる。 When other users are not included, the user image recorded in the image recording unit 31 and the images of all other users are used to generate an image for displaying the user images side by side. To do. Since the image recorded in the image recording unit 31 is a still image, image information for displaying the still images side by side is generated. Also in this case, the generated image is an image obtained by synthesizing an image with a changed shape of the mouth.

一定時間内に音声の入力がない場合は、他のユーザと会話していないものとみなし、カメラ２０により撮像されたユーザの画像を拡大（ズームアップ）して表示するための画像を生成する。ここではズームアップして表示するための画像を生成しているが、ズームアップしなくてもよいし、顔画像のみを切り抜いて表示してもよい。また、顔を含む所定領域を切り抜いて表示してもよい。 If there is no voice input within a certain period of time, it is considered that the user has not talked to another user, and an image for displaying the user's image captured by the camera 20 is enlarged (zoomed up). Here, an image for zooming up and displaying is generated, but zooming in may not be performed, and only a face image may be cut out and displayed. Further, a predetermined area including the face may be cut out and displayed.

通信制御部３６は、画像処理部３５により画像処理して得られた画像を、マイクアレイ２１に入力された音声とともに他拠点へ送信し、また、他拠点から画像および音声を受信するための通信制御を行う。このように会話の当事者に応じた画像を表示させることができるため、他拠点にいるユーザが、誰に向かって会話をしているかが分かり、また、他拠点にいるユーザも、会議に集中することができ、さらには、会議の臨場感を高めることができる。 The communication control unit 36 transmits the image obtained by the image processing by the image processing unit 35 to the other base together with the sound input to the microphone array 21, and receives the image and the sound from the other base. Take control. In this way, it is possible to display images according to the parties involved in the conversation, so it is possible to find out who the users at other bases are talking to, and users at other bases also concentrate on the meeting. In addition, the presence of the conference can be enhanced.

図４に示すフローチャートを参照して、情報処理装置１０が行う処理について詳細に説明する。情報処理装置１０は、情報処理装置１０が配置される拠点において、電源が投入され、プログラムが読み出されて実行されることにより、ステップ４００から処理を開始する。ここでは自動的にプログラムが読み出されて実行されるようになっているが、自動的に実行されない場合、そのプログラムが起動された段階でこの処理を開始することができる。 With reference to the flowchart shown in FIG. 4, the process which the information processing apparatus 10 performs is demonstrated in detail. The information processing apparatus 10 starts processing from step 400 when the power is turned on and the program is read and executed at the site where the information processing apparatus 10 is disposed. Here, the program is automatically read out and executed. However, when the program is not automatically executed, this processing can be started when the program is activated.

ステップ４０５では、その拠点にいる会議の参加者であるユーザ全員をカメラ２０により撮像する。このとき、撮像方向制御部３０によりカメラ２０の向きを変え、ユーザ全員の顔を撮像する。そして、撮像して得られた顔画像において口元を検出する。後の処理において、口元の形を変えた画像と合成し、実際に話しているように見せる処理を行うためである。 In step 405, all users who are participants in the conference at the base are imaged by the camera 20. At this time, the direction of the camera 20 is changed by the imaging direction control unit 30, and the faces of all the users are imaged. Then, the mouth is detected in the face image obtained by imaging. This is because, in a later process, the image is synthesized with an image with a changed shape of the mouth, and a process of making it appear as if it is actually spoken is performed.

ステップ４１０では、全員の顔が撮像された後、実際に会議が始まり、マイクアレイ２１が音の入力を受け付ける。音には、音声のほか、エアコンや外を走る車等の雑音がある。ステップ４１５では、実際に会話が開始され、音声入力部３７が、ユーザからの音声入力であるか否かを検知する。そして、音声入力を検知した場合、その音声を音源方向検知部３２へ送る。 In step 410, after all the faces have been imaged, the conference actually starts and the microphone array 21 receives sound input. In addition to voice, the sound includes noise from air conditioners and cars running outside. In step 415, the conversation is actually started, and the voice input unit 37 detects whether or not the voice input is from the user. When a voice input is detected, the voice is sent to the sound source direction detection unit 32.

ステップ４２０では、音源方向検知部３２が、音声入力部３７からの音声に基づき音源方向を検知する。音源方向検知部３２は、検知した音源方向の情報を、撮像方向制御部３０へ入力する。ステップ４２５では、撮像方向制御部３０は、入力された音源方向の情報から、撮像方向の変更が必要かどうかを判断する。必要と判断した場合、ステップ４３０へ進み、必要でないと判断した場合、ステップ４５５へ進む。 In step 420, the sound source direction detection unit 32 detects the sound source direction based on the sound from the sound input unit 37. The sound source direction detection unit 32 inputs information on the detected sound source direction to the imaging direction control unit 30. In step 425, the imaging direction control unit 30 determines whether or not the imaging direction needs to be changed from the input information on the sound source direction. If it is determined that it is necessary, the process proceeds to step 430. If it is determined that it is not necessary, the process proceeds to step 455.

ステップ４３０では、撮像方向制御部３０が、音源方向の情報に基づき撮像方向を変更する。そして、画像記録部３１から音源方向のユーザの画像を読み出す。画像処理部３５は、読み出した画像と口元の形を変えた画像とを合成し、その合成した画像を生成する。その合成した画像を送信して表示させる。 In step 430, the imaging direction control unit 30 changes the imaging direction based on the information on the sound source direction. Then, the user image in the sound source direction is read from the image recording unit 31. The image processing unit 35 synthesizes the read image and the image with the mouth shape changed, and generates the synthesized image. The synthesized image is transmitted and displayed.

ステップ４３５では、カメラ２０の向きが、音源方向になったかどうかを判断する。この判断は、カメラ２０の向きが、音源方向になるまで繰り返される。音源方向になった場合、ステップ４４０へ進み、顔方向判断部３３が、会話をしているユーザの顔の向きを判断する。すなわち、カメラ２０の方向に向いているか、それ以外の方向に向いているかを判断し、それ以外の方向である場合、どのユーザの方向に向いているかを判断する。 In step 435, it is determined whether or not the direction of the camera 20 is the sound source direction. This determination is repeated until the direction of the camera 20 becomes the sound source direction. When the sound source direction is reached, the process proceeds to step 440, where the face direction determination unit 33 determines the face direction of the user who is having a conversation. That is, it is determined whether the camera 20 is oriented in the direction of the camera 20 or in any other direction, and in other directions, it is determined which user is directed in the direction.

ステップ４４５では、ステップ４４０での判断結果から、同じ拠点での会話であるかどうかを判断する。上記のそれ以外の方向に向いているという判断結果である場合、同じ拠点での会話と判断する。カメラ２０の方向に向いているという判断結果である場合、他拠点との会話と判断する。同じ拠点での会話と判断した場合、ステップ４５０へ進み、他拠点との会話と判断した場合、ステップ４５５へ進む。 In step 445, it is determined from the determination result in step 440 whether the conversation is at the same location. If the result is that the direction is other than the above, it is determined that the conversation is at the same location. If the determination result indicates that the camera 20 is facing the camera 20, it is determined that the conversation is with another site. If it is determined that the conversation is at the same site, the process proceeds to step 450. If it is determined that the conversation is at a different site, the process proceeds to step 455.

ステップ４５０では、カメラ２０により撮像された画像内のユーザと会話している他のユーザの顔画像を切り抜き、切り抜いた２つの顔画像を並べて表示させるための画像を生成し、その画像を送信する。一方、ステップ４５５では、カメラ２０に向いているユーザをズームアップした画像を生成し、その画像を送信する。これらの送信が終了したところで、ステップ４６０へ進み、この処理を終了する。 In step 450, the face image of another user who is talking to the user in the image captured by the camera 20 is cut out, an image for displaying the cut out two face images side by side is generated, and the image is transmitted. . On the other hand, in step 455, an image obtained by zooming up the user facing the camera 20 is generated, and the image is transmitted. When these transmissions are finished, the process proceeds to step 460 and the process is finished.

次に、ステップ４４０〜ステップ４５５にて顔方向判断部３３および画像処理部３５が行う処理について、図５に示すフローチャートを参照して詳細に説明する。顔方向判断部３３は、上記のステップ４３５でカメラ２０の向きが音源方向になったときに、ステップ５００からこの処理を開始する。ステップ５０５では、会話しているユーザの顔の向きを判断する。顔の向きは、上記に例示した方法により判断することができる。 Next, processing performed by the face direction determination unit 33 and the image processing unit 35 in steps 440 to 455 will be described in detail with reference to the flowchart shown in FIG. The face direction determination unit 33 starts this processing from step 500 when the orientation of the camera 20 becomes the sound source direction in step 435 described above. In step 505, the orientation of the face of the user who is talking is determined. The orientation of the face can be determined by the method exemplified above.

ステップ５１０では、ステップ５０５にて判断した顔の向きに基づき、その方向に他のユーザがいるかどうかを判断する。例えば、画像記録部３１に記録された画像に関連付けられた位置情報を用いて、その方向に他のユーザがいるかどうかを判断することができる。他のユーザがいると判断した場合、ステップ５１５にて同じ拠点で会話していると判断し、ステップ５２０にてその方向にいる他のユーザが複数人かどうかを判断する。複数人であると判断した場合、ステップ５２５へ進み、一人であると判断した場合、直接ステップ５３０へ進む。 In step 510, based on the face orientation determined in step 505, it is determined whether there is another user in that direction. For example, it is possible to determine whether there is another user in the direction using the position information associated with the image recorded in the image recording unit 31. If it is determined that there are other users, it is determined in step 515 that the user is talking at the same location, and in step 520, it is determined whether there are a plurality of other users in that direction. If it is determined that there are multiple persons, the process proceeds to step 525, and if it is determined that there is a single person, the process proceeds directly to step 530.

ステップ５２５では、一定期間内、すなわち予め設定された期間内に他のユーザの音声を検知したかどうかを判断する。検知した場合、ステップ５３０へ進み、検知しなかった場合、ステップ５６０へ進む。ステップ５３０では、カメラ２０で撮像する撮像範囲内に他のユーザが存在するかどうかを判断する。すなわち、会話しているユーザと他のユーザが存在するかどうかを判断する。 In step 525, it is determined whether or not another user's voice is detected within a certain period, that is, within a preset period. If detected, the process proceeds to step 530, and if not detected, the process proceeds to step 560. In step 530, it is determined whether there is another user within the imaging range captured by the camera 20. That is, it is determined whether there is a user having a conversation and another user.

他のユーザが存在すると判断した場合、ステップ５３５へ進み、画像処理部３５が、カメラ２０により撮像された画像内からユーザおよび他のユーザの顔画像を切り抜き、それらを並べて表示するための画像を生成する。そして、ステップ５６５へ進み、この処理を終了する。 If it is determined that another user exists, the process proceeds to step 535, where the image processing unit 35 cuts out the face images of the user and other users from the image captured by the camera 20, and displays an image for displaying them side by side. Generate. Then, the process proceeds to step 565, and this process ends.

ステップ５３０において他のユーザが存在しないと判断した場合、ステップ５４０へ進み、画像処理部３５は、画像記録部３１から会話をしているユーザおよび他のユーザの画像を取得する。ステップ５４５では、取得したユーザの顔画像と口元の形の画像とを合成し、合成したユーザの顔画像を並べて表示するための画像を生成する。そして、ステップ５６０へ進み、この処理を終了する。 If it is determined in step 530 that there is no other user, the process proceeds to step 540, and the image processing unit 35 acquires images of the user who is having a conversation and other users from the image recording unit 31. In step 545, the acquired user face image and mouth shape image are combined, and an image for displaying the combined user face image side by side is generated. Then, the process proceeds to step 560, and this process ends.

ステップ５１０でその方向に他のユーザがいないと判断した場合、ステップ５５０へ進み、他拠点にいるユーザと会話をしていると判断する。そして、ステップ５５５で、表示すべきユーザは、他のユーザがいないので、当該ユーザのみであり、カメラ２０により撮像された当該ユーザの画像を処理し、ズームアップした画像を生成する。ステップ５６０で、この処理を終了する。 If it is determined in step 510 that there are no other users in that direction, the process proceeds to step 550, where it is determined that the user is talking to a user at another base. In step 555, since there is no other user to be displayed, only the user is processed, the user's image captured by the camera 20 is processed, and a zoomed-up image is generated. In step 560, the process ends.

実際に情報処理装置１０は、任意の拠点である本社あるいは事業所内のある会議室の所定位置に配置される。図６（ａ）では、会議室に置かれたテーブル４０上の縁部にカメラ２０を中央とし、その左右にマイクアレイ２１を配置した構成とされている。テーブル４０を挟んで左右およびカメラ２０の正面に３人のユーザＡ、Ｂ、Ｃが会議に参加している。 Actually, the information processing apparatus 10 is disposed at a predetermined position in a conference room in a head office or a business office as an arbitrary base. In FIG. 6A, the camera 20 is centered on the edge of the table 40 placed in the conference room, and the microphone arrays 21 are arranged on the left and right. Three users A, B, and C are participating in the conference on the left and right sides of the table 40 and on the front of the camera 20.

カメラ２０のレンズは、正面にいるユーザＢの方を向いている。このため、カメラ２０により撮像して得られた画像は、図６（ｂ）のような画像となる。すなわち、正面にユーザＢ、その左右にユーザＡ、Ｃがテーブル４０を挟んで座っている画像となる。 The lens of the camera 20 faces the user B in front. For this reason, the image obtained by imaging with the camera 20 is an image as shown in FIG. That is, the image is such that the user B sits on the front and the users A and C sit on the left and right with the table 40 in between.

この画像を他拠点に配置された情報処理装置１１へ送信し、表示させた場合の画像は、図６（ｃ）のような画像となる。この画像は、図６（ｂ）と同様、正面にユーザＢ、その左右にユーザＡ、Ｃがテーブル４０を挟んで座っている画像である。 When this image is transmitted to the information processing apparatus 11 arranged at another base and displayed, the image is as shown in FIG. As in FIG. 6B, this image is an image in which the user B is sitting on the front and the users A and C are sitting on the left and right sides of the table 40.

ユーザＣが会話を開始すると、音源方向検知部３２によりそのユーザＣの方向を音源方向として検知するため、カメラ２０の向きがその方向に変更される。このときの様子を、図７に例示する。図７（ａ）に示すように、カメラ２０の向きが、ユーザＣがいる方向に変更される。 When the user C starts a conversation, the direction of the camera 20 is changed to the direction because the sound source direction detection unit 32 detects the direction of the user C as the sound source direction. The state at this time is illustrated in FIG. As shown in FIG. 7A, the orientation of the camera 20 is changed to the direction in which the user C is present.

カメラ２０により撮像して得られた画像は、ユーザＣを中心とした画像となるため、図７（ｂ）のような画像となる。この画像は、ユーザＢ、Ｃのみを含み、ユーザＡは含まれない。また、図７に示す実施形態では、ユーザＣが正面を向き、他拠点にいるユーザと会話しているため、他拠点に配置された情報処理装置１１へ送信し、表示させた場合、そのユーザＣをズームアップした画像が送信され、そのズームアップした画像が表示される。このため、その画像は、図７（ｃ）に示すようなユーザＣをズームアップした画像となる。このように、会話の当事者を特定し、それを画像として表示することで、臨場感を高めることができる。 Since the image obtained by capturing with the camera 20 is an image centered on the user C, the image is as shown in FIG. This image includes only users B and C, and does not include user A. In the embodiment shown in FIG. 7, since the user C is facing the front and is talking to the user at another base, when the user C is transmitted to the information processing apparatus 11 disposed at the other base and displayed, the user C The zoomed-up image of C is transmitted, and the zoomed-up image is displayed. Therefore, the image is an image obtained by zooming up the user C as shown in FIG. Thus, the presence of the conversation is specified and displayed as an image, so that the sense of reality can be enhanced.

図７では他拠点との間で会話を行っている場合の様子を例示したが、同じ拠点で会話を行っている場合の様子を、図８を参照して説明する。ユーザＡがユーザＣと会話しているものとする。 Although FIG. 7 illustrates a situation where a conversation is performed with another base, a situation where a conversation is performed at the same base will be described with reference to FIG. Assume that user A is talking to user C.

カメラ２０の向きは、図８（ａ）に示すように、現在会話をしているユーザＡに向けられる。このときに撮像して得られた画像が、図８（ｂ）に示すような画像である。図８に示す実施形態では、ユーザＡが正面を向いておらず、同じ拠点にいるユーザＣがいる方向を向き、ユーザＣと会話を行っている。このため、他拠点に配置された情報処理装置１１では、図８（ｃ）に示すようなユーザＡ、Ｃの画像を並べた画像が表示される。これらの画像は、カメラ２０により撮像して得られたユーザＡの動画と、画像記録部３１に記録されたユーザＣの画像を並べて表示した画像で、ユーザＣの画像は口元の形の画像が合成されるので、口元が音声に合わせて動く。この場合も、会話の当事者を特定し、それを画像として表示することで、臨場感を高めることができる。 The orientation of the camera 20 is directed to the user A who is currently talking as shown in FIG. An image obtained by imaging at this time is an image as shown in FIG. In the embodiment shown in FIG. 8, the user A is not facing the front and is facing the direction in which the user C at the same base is present, and has a conversation with the user C. For this reason, in the information processing apparatus 11 arrange | positioned in another base, the image which arranged the image of the users A and C as shown in FIG.8 (c) is displayed. These images are images in which the moving image of the user A obtained by imaging with the camera 20 and the image of the user C recorded in the image recording unit 31 are displayed side by side. The image of the user C is an image in the shape of the mouth. Since it is synthesized, the mouth moves according to the voice. Also in this case, the presence of the conversation can be specified and displayed as an image to enhance the sense of reality.

ユーザＡ、Ｃが一定時間内の短い会話をやりとりしている場合、ユーザＡについては動画を、ユーザＣについては、静止画に口元の形の画像を合成したものを表示し続けることができる。ユーザＣが会話を開始し、その会話が上記一定時間より長い場合、カメラ２０がユーザＣの方を向き、撮像を開始する。カメラ２０が移動している間は、ユーザＡについても、画像記録部３１に記録された静止画の画像に口元の形の画像が合成されたものを表示させることができる。そして、カメラ２０がユーザＣの方を向き、撮像を開始すると、ユーザＣの画像を静止画から動画へ切り替え、ユーザＡについては画像記録部３１に記録された静止画の画像に口元の形の画像が合成されたものを表示させる。再びユーザＡが会話を開始し、その会話が長い場合は、再びカメラ２０の向きが変えられるので、ユーザＣの画像を静止画に、ユーザＡの画像を動画に戻すことができる。 When the users A and C are exchanging a short conversation within a certain period of time, the user A can continue to display a moving image, and the user C can continue to display a still image synthesized with a mouth shape image. When the user C starts a conversation and the conversation is longer than the predetermined time, the camera 20 faces the user C and starts imaging. While the camera 20 is moving, the user A can also display a still image recorded in the image recording unit 31 in which a mouth shape image is combined. Then, when the camera 20 faces the user C and starts imaging, the image of the user C is switched from the still image to the moving image. For the user A, the shape of the mouth is added to the still image recorded in the image recording unit 31. Display the synthesized image. When the user A starts the conversation again and the conversation is long, the direction of the camera 20 is changed again, so that the image of the user C can be returned to the still image and the image of the user A can be returned to the moving image.

これまで本発明の情報処理装置および通信システム、プログラムにより実行される処理について図面に示した実施形態を参照しながら詳細に説明してきたが、本発明は、上述した実施形態に限定されるものではない。したがって、他の実施形態や、追加、変更、削除など、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。よって、本発明では、情報処理装置や通信システムにより実行される方法や、プログラムが記録された記録媒体も提供することができるものである。 The processing executed by the information processing apparatus, communication system, and program of the present invention has been described in detail so far with reference to the embodiments shown in the drawings, but the present invention is not limited to the above-described embodiments. Absent. Therefore, other embodiments, additions, changes, deletions, and the like can be changed within a range that can be conceived by those skilled in the art, and as long as the effects and advantages of the present invention are exhibited in any aspect, the present invention It is included in the range. Therefore, the present invention can also provide a method executed by an information processing apparatus or a communication system, and a recording medium on which a program is recorded.

１０、１１…情報処理装置、１２…ネットワーク、２０…カメラ、２１…マイクアレイ、２２…スピーカ、２３…表示装置、２４…コントローラ、２５…ＣＰＵ、２６…ＲＯＭ、２７…ＲＡＭ、２８…ＨＤＤ、２９…ネットワークＩ／Ｆ、３０…撮像方向制御部、３１…画像記録部、３２…音源方向検知部、３３…顔方向判断部、３４…会話者判断部、３５…画像処理部、３６…通信制御部、３７…音声入力部、３８…人物判断部、４０…テーブル DESCRIPTION OF SYMBOLS 10, 11 ... Information processing apparatus, 12 ... Network, 20 ... Camera, 21 ... Microphone array, 22 ... Speaker, 23 ... Display device, 24 ... Controller, 25 ... CPU, 26 ... ROM, 27 ... RAM, 28 ... HDD, DESCRIPTION OF SYMBOLS 29 ... Network I / F, 30 ... Imaging direction control part, 31 ... Image recording part, 32 ... Sound source direction detection part, 33 ... Face direction judgment part, 34 ... Conversation judgment part, 35 ... Image processing part, 36 ... Communication Control unit, 37 ... voice input unit, 38 ... person determination unit, 40 ... table

特開平５−１２２６８９号公報Japanese Patent Laid-Open No. 5-12289

Claims

An information processing apparatus comprising an imaging means and a plurality of sound input means,
Imaging direction control means for controlling the imaging direction of the imaging means;
An image recording unit that changes an imaging direction by the imaging direction control unit and records an image of one or more users at the same base imaged by the imaging unit;
A sound source direction detecting means for detecting a sound source direction based on a voice of a user at the same base inputted to the plurality of sound input means;
Face direction for determining the orientation of the user's face based on the user's image captured by the imaging means by changing the imaging direction by the imaging direction control means to the sound source direction detected by the sound source direction detecting means Judgment means,
Depending on whether the sound source direction detecting means detects the same direction as the face direction determined by the face direction determining means as a sound source direction, a conversation with another user at the same base as the user or other Conversation judging means for judging whether the conversation with the base,
When the conversation determination unit determines that the conversation is with another site, image processing is performed on the image captured by the imaging unit, and when it is determined that the conversation is with another user at the same site, An image including the user and the other user imaged by the imaging unit, an image of the user imaged by the imaging unit, or an image of the user recorded in the image recording unit and the image recording unit Image processing means for performing image processing on the recorded image of the other user;
Communication control means for transmitting an image obtained by image processing by the image processing means to the other base and displaying the user image or the user image and the image of the other user side by side. Information processing device.

When the conversation determination unit determines that the conversation is with another user at the same base, whether or not there are a plurality of other users is determined based on the sound input to the plurality of sound input units. Judgment
In a case where the other user is determined to be one by the conversation determining unit, the image processing unit detects the user's face when the user and the other user are included in the image captured by the imaging unit. And a predetermined area including the face of the other user are cut out, an image for displaying the two cut-out images of the predetermined area side by side is generated, and only the user is included in the image captured by the imaging unit. For displaying the user image captured by the imaging unit or the user image recorded in the image recording unit and the other user image recorded in the image recording unit side by side. The information processing apparatus according to claim 1, which generates an image.

When the conversation determination unit determines that the conversation is with another user at the same base, whether or not there are a plurality of other users is determined based on the sound input to the plurality of sound input units. Judgment
When the other user is determined to be a plurality of users by the conversation determination unit, there is a voice input to the plurality of sound input units within a predetermined time, and the user captures an image captured by the imaging unit. And all other plural users are included, the image processing means cuts out the predetermined area including the face of the user and the predetermined areas including the faces of the other plurality of users. When an image for displaying images side by side is generated and at least one of the other plurality of users is not included in the image captured by the imaging unit, the image of the user captured by the imaging unit or the image Generating an image for displaying the user image recorded in the recording means and the images of all the other users recorded in the image recording means side by side, and for the predetermined time When there is no input of voice, and generates an image for displaying an enlarged image of the user captured by the imaging unit, the information processing apparatus according to claim 1.

When the conversation determination unit determines that the conversation is with another user at the same base, whether or not there are a plurality of other users is determined based on the sound input to the plurality of sound input units. Judgment
When the other user is determined to be a plurality of users by the conversation determination unit, there is a voice input to the plurality of sound input units within a predetermined time, and the user captures an image captured by the imaging unit. And all the other users are included, the image processing means cuts out the face image of the user and the face images of the other users, and generates an image in which all the cut out face images are displayed side by side. When the image captured by the imaging unit does not include at least one of the other users, the user image captured by the imaging unit or the user image recorded in the image recording unit Generating an image for displaying the images of all the other plurality of users recorded side by side in the image recording means, and when there is no audio input within the predetermined time, the imaging Generating an image for displaying an enlarged image of the user captured by the stage, the information processing apparatus according to claim 1.

The one or more user images recorded in the image recording unit are still images of the one or more users, the imaging direction is changed to the detected sound source direction, and the image captured by the imaging unit is The information processing apparatus according to claim 1, wherein the information processing apparatus is a moving image.

A communication system in which two or more information processing apparatuses according to any one of claims 1 to 5 are connected to a network, which are arranged at two or more geographically distant bases.

A program for causing an information processing apparatus including an imaging unit and a plurality of sound input units to execute,
Changing the imaging direction of the imaging means, and recording the image of one or more users at the same base imaged by the imaging means on the image recording means;
Detecting a sound source direction based on a voice of a user at the same base input to the plurality of sound input means;
Changing the imaging direction to the detected sound source direction and determining the orientation of the user's face based on the user's image captured by the imaging means;
Determining whether a conversation with another user at the same location as the user or a conversation with another location, depending on whether the same direction as the determined face orientation is detected as the sound source direction;
When it is determined that the conversation is with the other site, image processing is performed on the image captured by the imaging unit. When it is determined that the conversation is with another user at the same site, the image is captured by the imaging unit. The image including the user and the other user, the user image captured by the imaging unit, the user image recorded in the image recording unit, and the other recorded in the image recording unit Performing image processing on the user's image,
Transmitting the image obtained by image processing to the other base, causing the information processing apparatus to execute the step of displaying the user image or the user image and the image of the other user side by side, program.

When it is determined that the conversation is with another user at the same base, the method further includes a step of determining whether or not there are a plurality of other users based on voices input to the plurality of sound input means. ,
When it is determined that the other user is alone, the image processing step includes a predetermined region including the user's face when the user and the other user are included in the image captured by the imaging unit. And a predetermined area including the face of the other user are cut out, an image for displaying the images of the two cut out predetermined areas side by side is generated, and only the user is included in the image captured by the imaging unit Generating an image for displaying the user image captured by the imaging unit or the user image recorded in the image recording unit and the other user image recorded in the image recording unit side by side The program according to claim 7.

When it is determined that the conversation is with another user at the same base, the method further includes a step of determining whether or not there are a plurality of other users based on voices input to the plurality of sound input means. ,
When it is determined that there are a plurality of other users, in the step of performing the image processing, audio information is input to the plurality of sound input units within a predetermined time, and the image captured by the image capturing unit is displayed. When the user and all of a plurality of other users are included, the predetermined area including the face of the user and the predetermined areas including the faces of the other plurality of users are cut out, and images of all the cut out predetermined areas are arranged. When an image to be displayed is generated and at least one of the plurality of other users is not included in the image captured by the imaging unit, the image of the user captured by the imaging unit or the image recording unit An image for displaying the recorded user image and the images of all the other users recorded in the image recording means side by side is generated, and the audio is generated within the predetermined time. When no input generates an image for displaying an enlarged image of the user captured by the imaging means, the program of claim 7.

When it is determined that the conversation is with another user at the same base, the method further includes a step of determining whether or not there are a plurality of other users based on voices input to the plurality of sound input means. ,
When it is determined that there are a plurality of other users, in the step of performing the image processing, audio information is input to the plurality of sound input units within a predetermined time, and the image captured by the image capturing unit is displayed. When the user and all other users are included, the face image of the user and each face image of the other users are clipped, and an image for displaying all the cut face images side by side is generated. When the image picked up by the image pickup means does not include at least one of the other plurality of users, the image of the user picked up by the image pickup means or the image of the user recorded in the image recording means and the Generating an image for displaying the images of all the plurality of other users recorded side by side in the image recording means, and when no sound is input within the predetermined time, the imaging means Generating an image for displaying an enlarged image of a more captured the user, the program of claim 7.