JP5894055B2

JP5894055B2 - VIDEO DATA CONTROL DEVICE, VIDEO DATA CONTROL METHOD, AND VIDEO DATA CONTROL PROGRAM

Info

Publication number: JP5894055B2
Application number: JP2012230774A
Authority: JP
Inventors: 美佐平尾; 陽子石井; 宮崎　泰彦; 泰彦宮崎; 東野　豪; 豪東野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-10-18
Filing date: 2012-10-18
Publication date: 2016-03-23
Anticipated expiration: 2032-10-18
Also published as: JP2014082704A

Description

本発明は、聴覚障害者やその家族等、手話や口話を用いてコミュニケーションを行うユーザに対する映像データの制御技術に関する。 The present invention relates to a technique for controlling video data for a user who communicates using sign language or spoken language, such as a hearing impaired person or his family.

「テレビを見ながら家族と会話をする」という行為は、聴覚に障害がない聴者であれば自然に行うことができる。聴者同士にとって、音声のみによるコミュニケーション方法は一般的な意思疎通方法であり、映像コンテンツの視聴を中断することなく実施できるからである。 The act of “conversing with family while watching TV” can be performed naturally by a listener who is not impaired in hearing. This is because, for the listeners, the communication method using only sound is a general communication method and can be performed without interrupting the viewing of the video content.

一方、手話や口話を用いてコミュニケーションを行う聴覚障害者やその家族にとって、上記行為を行うことはほぼ不可能である。手話や口話は相手と目を合わせることで成立するコミュニケーション方法であり、映像コンテンツの視聴時に会話をするには視聴を中断して相手と向き合う必要があるためである。 On the other hand, it is almost impossible for the hearing impaired and their family members who communicate using sign language or speech to perform the above actions. This is because the sign language and the spoken language are communication methods established by looking at the other party, and it is necessary to interrupt the viewing and face the other party in order to have a conversation when viewing the video content.

ここで、このような課題を解決する方法として、複数のユーザ自身を撮影したカメラ映像を視聴中の映像コンテンツに合成することにより、ユーザ同士が映像コンテンツを視聴しながら互いの姿を視認し、映像越しに手話や口話でのコミュニケーションを行うようにする方法が考えられる（特許文献１参照）。 Here, as a method of solving such a problem, by synthesizing the camera video taken by a plurality of users with the video content being viewed, the users can see each other while viewing the video content, A method of communicating in sign language or spoken language over the video is conceivable (see Patent Document 1).

また、より簡易な方法としては、図１１に示すように、テレビの上に鏡を載置し、その鏡でユーザの姿を反射させることにより、テレビ画面の映像を見ながら手話や口話でのコミュニケーションを行うようにする方法も考えられる。 As a simpler method, as shown in FIG. 11, a mirror is placed on the TV and the user's appearance is reflected by the mirror, so that the user can use sign language or spoken language while watching the video on the TV screen. It is also possible to have a method of communicating with each other.

特開２００８−２１７５３６号公報JP 2008-217536 A

しかしながら、家庭での映像コンテンツの視聴スタイルは、図１１に示したようなテレビ画面の近くに座って視聴するスタイルのみとは限らない。図１１のような間取りの部屋においては、例えば、ソファに寝転んで視聴するスタイル、食卓について視聴するスタイル、キッチンで炊事をしながら視聴するスタイル、掃除のため部屋全体を動き回りながら視聴するスタイル等が考えられる。 However, the viewing style of video content at home is not limited to the style of sitting near a television screen as shown in FIG. In the floor plan room as shown in FIG. 11, for example, there are a style of lying on a sofa and watching, a style of watching about a dining table, a style of watching while cooking in the kitchen, a style of watching while moving around the whole room for cleaning, etc. Conceivable.

ゆえに、カメラや鏡へのユーザの映り方によってはユーザの下半身や余分な背景等がカメラ映像や鏡像に含まれてしまうため、必ずしも手話や口話によるコミュニケーションに適した映像や鏡像とは言えず、そのコミュニケーションを行うユーザは手先や指先の細かい表現、顔の表情、唇の微妙な動きをその映像等から注意深く読み取らなければならないという問題があった。 Therefore, depending on how the user is reflected in the camera and mirror, the lower body of the user and extra background may be included in the camera image and mirror image. Therefore, it is not necessarily a video or mirror image suitable for sign language or verbal communication. However, there is a problem that the user who performs the communication must carefully read detailed expressions of the hand and fingertips, facial expressions, and delicate movements of the lips from the image.

本発明は、上記事情を鑑みてなされたものであり、様々な視聴スタイルで映像コンテンツを視聴していても、その映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現することを課題とする。 The present invention has been made in view of the above circumstances, and it is an object of the present invention to realize communication in sign language or spoken language while viewing video content even when viewing the video content in various viewing styles. .

請求項１記載の映像データ制御装置は、映像コンテンツを表示可能な映像表示装置の画面側から撮影され、前記映像コンテンツを任意の視聴スタイルで視聴しているユーザのカメラ映像をビデオカメラから受信するカメラ映像受付手段と、前記カメラ映像から前記ユーザの顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、前記顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、前記顔の画像領域のみから成るユーザ映像を生成するユーザ映像生成手段と、前記ユーザ映像が表示されるように前記ユーザ映像を前記映像コンテンツに重畳する映像合成手段と、前記重畳されたユーザ映像及び映像コンテンツを前記映像表示装置に出力する映像出力手段と、を有し、前記ユーザ映像生成手段は、前記カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成することを特徴とする。 The video data control device according to claim 1 receives a camera video of a user who is photographed from a screen side of a video display device capable of displaying video content and is viewing the video content in an arbitrary viewing style from a video camera. In the case of a mode for extracting image areas of the user's face and upper body from the camera image and communicating with sign language according to the mode selected and determined by the user , the camera image receiving means and the face and upper body In a mode in which a user video consisting only of the image area is generated and communication is carried out by verbal speech, user video generation means for generating a user video consisting only of the face image area and the user video are displayed. Video synthesizing means for superimposing the user video on the video content, and the superimposed user video And has a video output means for outputting the video content to the video display device, the user image generation unit, when the distance between the user photographed in the camera image is larger than the threshold value, divides the plurality of users A plurality of user videos are generated .

本発明によれば、映像表示装置の映像コンテンツを視聴しているユーザのカメラ映像から顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、顔の画像領域のみから成るユーザ映像を生成して、映像コンテンツに重畳して映像表示装置に出力するため、映像コンテンツの視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現できる。また、本発明によれば、カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成するため、カメラ映像にユーザの下半身やユーザ間等の余分な背景等が含まれてしまうのを防止できる。 According to the present invention, it extracts the image area from the camera image the face and upper body of a user viewing the video content of the video display device, depending on the selected determined mode by the user, performs communication by sign language mode In this case, a user video consisting only of the image area of the face and the upper body is generated, and in a mode in which communication is carried out by spoken language, a user video consisting of only the image area of the face is generated and superimposed on the video content. Thus, since it is output to the video display device, it is possible to realize sign language or spoken language communication while viewing the video content without depending on the viewing style of the video content. In addition, according to the present invention, when the distance between users captured in the camera video is larger than the threshold value, a plurality of user videos are generated by dividing the plurality of users. It is possible to prevent the excessive background of the image from being included.

請求項２記載の映像データ制御装置は、映像コンテンツを表示可能な映像表示装置の画面側から撮影され、前記映像コンテンツを任意の視聴スタイルで視聴しているユーザのカメラ映像をビデオカメラから受信するカメラ映像受付手段と、前記カメラ映像から前記ユーザの顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、前記顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、前記顔の画像領域のみから成るユーザ映像を生成するユーザ映像生成手段と、前記ユーザの視野内に位置する他の映像表示装置に前記ユーザ映像を出力する映像出力手段と、を有し、前記ユーザ映像生成手段は、前記カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成することを特徴とする。 The video data control device according to claim 2 receives a camera video of a user who is photographed from a screen side of a video display device capable of displaying video content and is viewing the video content in an arbitrary viewing style from a video camera. In the case of a mode for extracting image areas of the user's face and upper body from the camera image and communicating with sign language according to the mode selected and determined by the user , the camera image receiving means and the face and upper body A user video generation means for generating a user video consisting only of the image area of the face and a user video consisting of only the image area of the face in a mode in which communication is carried out by spoken language, and positioned within the field of view of the user a video output means for outputting the other of the user image to the image display device for, the, the user image generation hands The distance between the user photographed in the camera image is larger than the threshold value, and generating a plurality of user video obtained by dividing the plurality of users.

本発明によれば、映像表示装置の映像コンテンツを視聴しているユーザのカメラ映像から顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、顔の画像領域のみから成るユーザ映像を生成して、ユーザの視野内に位置する他の映像表示装置に出力するため、映像コンテンツの視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現できる。また、本発明によれば、カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成するため、カメラ映像にユーザの下半身やユーザ間等の余分な背景等が含まれてしまうのを防止できる。 According to the present invention, it extracts the image area from the camera image the face and upper body of a user viewing the video content of the video display device, depending on the selected determined mode by the user, performs communication by sign language mode in the case of, it generates a user image consisting of only the image area of the face and upper body, when the mode for communication with the mouth talk generates the user image consisting of only the image area of the face, the user's field of view Because it is output to another video display device located at, communication with sign language or spoken language can be realized while viewing the video content without depending on the viewing style of the video content. In addition, according to the present invention, when the distance between users captured in the camera video is larger than the threshold value, a plurality of user videos are generated by dividing the plurality of users. It is possible to prevent the excessive background of the image from being included.

請求項３記載の映像データ制御方法は、コンピュータにより、映像コンテンツを表示可能な映像表示装置の画面側から撮影され、前記映像コンテンツを任意の視聴スタイルで視聴しているユーザのカメラ映像をビデオカメラから受信するカメラ映像受付ステップと、前記カメラ映像から前記ユーザの顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、前記顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、前記顔の画像領域のみから成るユーザ映像を生成するユーザ映像生成ステップと、前記ユーザ映像が表示されるように前記ユーザ映像を前記映像コンテンツに重畳する映像合成ステップと、前記重畳されたユーザ映像及び映像コンテンツを前記映像表示装置に出力する映像出力ステップと、を有し、前記ユーザ映像生成ステップでは、前記カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成することを特徴とする。 The video data control method according to claim 3 , wherein a video of a camera of a user who is photographed from a screen side of a video display device capable of displaying video content by a computer and who views the video content in an arbitrary viewing style. In the case of a mode in which the camera image reception step received from the camera image and the image area of the user's face and upper body are extracted from the camera image and the communication is performed in sign language according to the mode selected and determined by the user , In a mode in which a user video consisting only of the face and upper body image area is generated and communication is made through oral speech, a user video generation step for generating a user video consisting only of the face image area; and A video composition process for superimposing the user video on the video content to be displayed. And-up, has a video output step of outputting the user image and video content the superposed on the image display device, in the user image generation step, the distance between the user photographed in the camera image threshold In the case of larger than this, a plurality of user videos obtained by dividing a plurality of users are generated .

請求項４記載の映像データ制御方法は、コンピュータにより、映像コンテンツを表示可能な映像表示装置の画面側から撮影され、前記映像コンテンツを任意の視聴スタイルで視聴しているユーザのカメラ映像をビデオカメラから受信するカメラ映像受付ステップと、前記カメラ映像から前記ユーザの顔及び上半身の画像領域を抽出し、ユーザにより選択し決定されたモードに応じて、手話によりコミュニケーションを行うモードの場合には、前記顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行うモードの場合には、前記顔の画像領域のみから成るユーザ映像を生成するユーザ映像生成ステップと、前記ユーザの視野内に位置する他の映像表示装置に前記ユーザ映像を出力する映像出力ステップと、を有し、前記ユーザ映像生成ステップでは、前記カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成することを特徴とする。 5. The video data control method according to claim 4 , wherein a video of a camera of a user who is photographed from a screen side of a video display device capable of displaying video content by a computer and is viewing the video content in an arbitrary viewing style. In the case of a mode in which the camera image reception step received from the camera image and the image area of the user's face and upper body are extracted from the camera image and the communication is performed in sign language according to the mode selected and determined by the user , A user video generation step of generating a user video consisting only of the face and upper body image area and generating a user video consisting only of the image area of the face in a mode in which communication is made through oral speech; Video output step for outputting the user video to another video display device located within Has, in the user image generation step, the distance between the user photographed in the camera image is larger than the threshold value, and generating a plurality of user video obtained by dividing the plurality of users.

請求項５記載の映像データ制御プログラムは、コンピュータを請求項１又は２に記載の映像データ制御装置として機能させることを特徴とする。 A video data control program according to claim 5 causes a computer to function as the video data control apparatus according to claim 1 or 2 .

本発明によれば、様々な視聴スタイルで映像コンテンツを視聴していても、その映像コンテンツを視聴しながら手話や口話によるコミュニケーションを行うことができる。 According to the present invention, even when video content is viewed in various viewing styles, it is possible to communicate in sign language or spoken language while viewing the video content.

第１の実施の形態に係る視聴者映像表示システムの全体構成を示す図である。It is a figure showing the whole viewer picture display system composition concerning a 1st embodiment. 視聴者映像表示システムの動作を示すシーケンス図である。It is a sequence diagram which shows operation | movement of a viewer image | video display system. ユーザ映像生成部の機能ブロック構成を示す図である。It is a figure which shows the functional block structure of a user image | video production | generation part. ユーザ映像の生成例（１人）を示す図である。It is a figure which shows the example of production | generation (one person) of a user image | video. ユーザ映像の生成例（２人）を示す図である。It is a figure which shows the example of a production | generation of a user image | video (2 persons). ユーザ映像の生成例（３人）を示す図である。It is a figure which shows the example of production | generation (3 persons) of a user image | video. ユーザ映像のエフェクト処理前の例を示す図である。It is a figure which shows the example before the effect process of a user image | video. ユーザ映像のエフェクト処理後の例（鏡像化）を示す図である。It is a figure which shows the example (mirroring) after the effect process of a user image | video. ユーザ映像のエフェクト処理後の例（半透明化）を示す図である。It is a figure which shows the example (translucent) after the effect process of a user image | video. 第２の実施の形態に係る視聴者映像表示システムの全体構成を示す図である。It is a figure which shows the whole structure of the viewer video display system which concerns on 2nd Embodiment. 映像コンテンツの視聴スタイルの例を示す図である。It is a figure which shows the example of the viewing-and-listening style of video content.

本発明は、手話や口話によるコミュニケーションが顔の変化や指の動作を注視して行われることに着目し、映像コンテンツを視聴しているユーザのカメラ映像から顔と指先の可動域である上半身との画像領域を抽出して、映像コンテンツに重畳して表示又はユーザの視野内に単独表示することにより、特定の視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現するものである。 The present invention focuses on the fact that communication by sign language or spoken language is performed by paying attention to facial changes and finger movements, and the upper body that is the range of motion of the face and fingertips from the camera video of the user viewing the video content. The image area is extracted and superimposed on the video content or displayed alone in the user's field of view, so that the user can communicate in sign language or spoken language while viewing the video content without depending on the specific viewing style. Is realized.

また、様々な視聴スタイルのうち距離的に遠いユーザ同士が上記コミュニケーションを行うスタイルに着目し、撮影されたユーザ間の距離によってはユーザ映像を分割した状態で表示することにより、カメラ映像にユーザの下半身やユーザ間等の余分な背景等が含まれてしまうのを防止している。 Also, paying attention to the above-mentioned communication style among users who are distant from each other among various viewing styles, depending on the distance between the photographed users, the user video may be displayed in a divided state so that the user can see the camera video. It prevents the extra background of the lower body and between users from being included.

以下、本発明を実施する一実施の形態について図面を用いて説明する。但し、本発明は多くの異なる様態で実施することが可能であり、本実施の形態の記載内容に限定して解釈すべきではない。 Hereinafter, an embodiment for carrying out the present invention will be described with reference to the drawings. However, the present invention can be implemented in many different modes and should not be construed as being limited to the description of the present embodiment.

〔第１の実施の形態〕
最初に、図１を参照しながら、第１の実施の形態に係る視聴者映像表示システムの全体構成について説明する。 [First Embodiment]
First, the overall configuration of the viewer video display system according to the first embodiment will be described with reference to FIG.

この視聴者映像表示システムは、映像コンテンツを画面に表示する映像表示装置５と、映像表示装置５の画面側から視聴者（以下、ユーザ）５０を撮影するビデオカメラ３と、映像コンテンツを視聴しているユーザ５０のカメラ映像を画像処理する映像データ制御装置１と、映像データ制御装置１を遠隔操作するリモコン端末７と、映像コンテンツを出力する映像コンテンツ出力装置９と、で主に構成される。 This viewer video display system views video content, a video display device 5 that displays video content on a screen, a video camera 3 that captures a viewer (hereinafter referred to as a user) 50 from the screen side of the video display device 5, and the video content. The video data control device 1 that performs image processing on the camera video of the user 50, the remote control terminal 7 that remotely operates the video data control device 1, and the video content output device 9 that outputs video content are mainly configured. .

映像データ制御装置１は、視聴者映像表示システムにおいてカメラ映像からユーザ５０に係る画像領域を抽出等する中心的役割を担い、自機以外の全ての装置（映像表示装置５、ビデオカメラ３、リモコン端末７、映像コンテンツ出力装置９）に有線又は無線の通信経路を介してデータ通信可能に接続されている。 The video data control device 1 plays a central role in extracting an image area related to the user 50 from the camera video in the viewer video display system, and all other devices (video display device 5, video camera 3, remote controller). The terminal 7 and the video content output device 9) are connected to each other via a wired or wireless communication path so that data communication is possible.

また、映像コンテンツ出力装置９としては、例えば、放送局から受信した地上デジタル信号を映像コンテンツとして出力する地上デジタルテレビ放送受信機や、ハードディスクやＤＶＤ等に録画された映像コンテンツを出力するブルーレイ・ＤＶＤレコーダ等が例とされる。 As the video content output device 9, for example, a terrestrial digital television broadcast receiver that outputs a terrestrial digital signal received from a broadcasting station as a video content, or a Blu-ray / DVD that outputs video content recorded on a hard disk, a DVD, or the like. An example is a recorder.

次に、映像データ制御装置１の機能について説明する。映像データ制御装置１は、カメラ映像受付部１１と、カメラ映像記憶部１２と、カメラ映像解析部１３と、ユーザ映像生成部１４と、映像合成部１５と、映像出力部１６と、調整データ受付部１７と、調整データ記憶部１８とを備えている。 Next, functions of the video data control apparatus 1 will be described. The video data control apparatus 1 includes a camera video reception unit 11, a camera video storage unit 12, a camera video analysis unit 13, a user video generation unit 14, a video synthesis unit 15, a video output unit 16, and adjustment data reception. Unit 17 and adjustment data storage unit 18.

カメラ映像受付部１１は、カメラ映像入力インタフェース３１を介して、映像コンテンツを視聴しているユーザ５０のカメラ映像をビデオカメラ３から受信し、カメラ映像記憶部１２に一時記憶させる機能を有している。 The camera video reception unit 11 has a function of receiving the camera video of the user 50 viewing the video content from the video camera 3 via the camera video input interface 31 and temporarily storing it in the camera video storage unit 12. Yes.

カメラ映像記憶部１２は、ビデオカメラ３から受信したカメラ映像を読み出し可能に記憶しておく機能を有している。例えば、カメラ映像をフレーム単位で受信した場合には、そのフレーム画像を時系列で記憶する。 The camera video storage unit 12 has a function of storing the camera video received from the video camera 3 in a readable manner. For example, when a camera video is received in units of frames, the frame images are stored in time series.

カメラ映像解析部１３は、カメラ映像受付部１１からカメラ映像を受信（カメラ映像記憶部１２からの取得も可）し、そのカメラ映像からユーザ５０の顔部分や上半身部分を含む画像領域を検出し、その検出結果を後段のユーザ映像生成部１４に出力する機能を有している。また、必要に応じてカメラ映像を回転等の補正を行う機能を有している。 The camera video analysis unit 13 receives the camera video from the camera video reception unit 11 (can be obtained from the camera video storage unit 12), and detects an image area including the face part and upper body part of the user 50 from the camera video. The detection result is output to the user video generation unit 14 at the subsequent stage. In addition, it has a function of correcting, for example, rotation of the camera image as necessary.

ユーザ映像生成部１４は、上記検出結果に基づき検出された画像領域をカメラ映像から抽出し、ユーザ５０の顔部分や上半身部分のみから成るユーザ映像を生成して、調整データ記憶部１８内の調整データに応じた処理（後述）を実行した後に、後段の映像合成部１５に出力する機能を有している。 The user video generation unit 14 extracts an image area detected based on the detection result from the camera video, generates a user video including only the face portion and upper body portion of the user 50, and performs adjustment in the adjustment data storage unit 18. After executing processing according to data (described later), it has a function of outputting to the video composition unit 15 in the subsequent stage.

映像合成部１５は、ユーザ映像生成部１４から出力されたユーザ映像と、映像コンテンツ入力インタフェース３３を介して映像コンテンツ出力装置９から出力された映像コンテンツとをユーザ映像が表示されるように重畳する機能を有している。 The video composition unit 15 superimposes the user video output from the user video generation unit 14 and the video content output from the video content output device 9 via the video content input interface 33 so that the user video is displayed. It has a function.

映像出力部１６は、映像合成部１５で重畳されたユーザ映像及び映像コンテンツを映像出力インタフェース３２から出力し、映像表示装置５に表示させる機能を有している。 The video output unit 16 has a function of outputting the user video and video content superimposed by the video synthesis unit 15 from the video output interface 32 and displaying them on the video display device 5.

調整データ受付部１７は、調整データ入力インタフェース３４を介して、リモコン端末７のユーザ操作によって選択・決定されたユーザ映像の表示位置等の調整データを受信し、調整データ記憶部１８に記憶させる機能を有している。 The adjustment data receiving unit 17 receives the adjustment data such as the display position of the user video selected and determined by the user operation of the remote control terminal 7 via the adjustment data input interface 34, and stores the adjustment data in the adjustment data storage unit 18. have.

調整データ記憶部１８は、調整データ受付部１７で受信した調整データを読み出し可能に記憶しておく機能を有している。調整データとしては、例えば、コミュニケーションモードの選択（手話モード又は口話モード）、ユーザ映像の分割の有無、エフェクトパターンの選択、ユーザ映像のサイズ、ユーザ映像の表示位置等である。 The adjustment data storage unit 18 has a function of storing the adjustment data received by the adjustment data receiving unit 17 in a readable manner. The adjustment data includes, for example, selection of a communication mode (sign language mode or spoken language mode), presence / absence of division of user video, selection of effect pattern, size of user video, display position of user video, and the like.

以上が本実施の形態に係る映像データ制御装置１の機能である。このような映像データ制御装置１としては、メモリやＣＰＵを備えたコンピュータで実現可能である。例えば、映像表示装置５に接続されたパソコンや、テレビに搭載されたブラウザ等を用いて実現できる。また、各機能部の各処理はプログラムによって実行される。 The above is the function of the video data control apparatus 1 according to the present embodiment. Such a video data control device 1 can be realized by a computer having a memory and a CPU. For example, it can be realized using a personal computer connected to the video display device 5, a browser mounted on a television, or the like. Each process of each functional unit is executed by a program.

次に、図２を参照しながら、視聴者映像表示システムの動作について説明する。最初に、映像表示装置５の画面近くに設置されたビデオカメラ３により、任意の視聴スタイルで映像コンテンツを視聴しているユーザ５０が撮影され、そのカメラ映像が映像データ制御装置１に入力される（ステップＳ１０１）。 Next, the operation of the viewer video display system will be described with reference to FIG. First, a video camera 3 installed near the screen of the video display device 5 captures a user 50 who is viewing video content in an arbitrary viewing style, and the camera video is input to the video data control device 1. (Step S101).

次に、映像データ制御装置１において、カメラ映像受付部１１により、ビデオカメラ３から出力されたカメラ映像がカメラ映像入力インタフェース３１を介して受信される（ステップＳ１０２）。 Next, in the video data control apparatus 1, the camera video reception unit 11 receives the camera video output from the video camera 3 via the camera video input interface 31 (step S102).

次に、カメラ映像解析部１３により、受信したカメラ映像からユーザ５０の顔部分及び上半身部分の画像領域が検出される（ステップＳ１０３）。このとき、顔の傾きを検出した場合には、その顔の傾きが垂直になるように上半身を含めてカメラ映像が補正され、補正後のカメラ映像が上記検出結果（検出した画像領域の座標位置やサイズ等を含む）と共にユーザ映像生成部１４に出力される。 Next, the camera image analysis unit 13 detects the image area of the face portion and upper body portion of the user 50 from the received camera image (step S103). At this time, if the face inclination is detected, the camera image including the upper body is corrected so that the face inclination becomes vertical, and the corrected camera image is converted into the detection result (the coordinate position of the detected image area). Output to the user video generation unit 14.

次に、ユーザ映像生成部１４により、上記検出結果に基づいてビデオ映像（補正された場合は、補正後のビデオ映像）からユーザ５０の顔及び上半身の画像領域が抽出され、その画像領域のみから成るユーザ映像が生成されて、映像合成部１５に出力される（ステップＳ１０４）。尚、この処理の詳細については後述する。 Next, the user video generation unit 14 extracts the image region of the face and upper body of the user 50 from the video image (or the corrected video image if corrected) based on the detection result, and only from the image region. A user video is generated and output to the video synthesis unit 15 (step S104). Details of this process will be described later.

次に、映像合成部１５により、生成されたユーザ映像が映像コンテンツ出力装置９からの映像コンテンツにユーザ５０による指定サイズ及び指定位置で重畳される（ステップＳ１０５）。 Next, the generated video image is superimposed on the video content from the video content output device 9 by the video composition unit 15 with the size and position specified by the user 50 (step S105).

次に、映像出力部１６により、重畳されたユーザ映像及び映像コンテンツが新たな映像データとして映像出力インタフェース３２を介して映像表示装置５に出力される（ステップＳ１０６）。 Next, the video output unit 16 outputs the superimposed user video and video content as new video data to the video display device 5 through the video output interface 32 (step S106).

最後に、映像表示装置５により、上記新たな映像データが受信され、ユーザ映像が前、映像コンテンツが後になるように画面に表示される（ステップＳ１０７）。 Finally, the new video data is received by the video display device 5, and the user video is displayed on the screen and the video content is displayed on the screen (step S107).

尚、ステップＳ１０３で説明したユーザ５０の顔部分及び上半身部分を検出する手法としては、公知の技術を利用して実現することが可能である。例えば、顔の向きに応じた特徴量の抽出と、その特徴量を用いた類似度計算を行い、算出された類似度に基づいて顔を認識する方法を用いることができる（特開２００９−１５７７６６号公報）。 Note that the technique for detecting the face part and upper body part of the user 50 described in step S103 can be realized using a known technique. For example, it is possible to use a method of extracting a feature amount according to the face direction and calculating a similarity using the feature amount, and recognizing the face based on the calculated similarity (Japanese Patent Laid-Open No. 2009-157766). Issue gazette).

また、顔の傾きの検出や映像を補正する方法についても、公知の技術を利用して実現できる。例えば、顔部品である口の輪郭線を抽出し、この輪郭線にハフ（Ｈｏｕｇｈ）変換を施すことで口の傾きを推定し、これにより顔の傾きを推定する方法を用いることができる（李周文、「口のエッジ画像のＨｏｕｇｈ変換と両目の長さの差とを用いた顔画像の検出」、信学技報、電子情報通信学会、２００６年３月、ＨＩＰ２００５−１６１、ｐ．４３−４８）。 A method for detecting the tilt of the face and correcting the image can also be realized using a known technique. For example, it is possible to use a method of extracting the outline of the mouth, which is a facial component, and estimating the inclination of the mouth by performing Hough transformation on the outline, thereby estimating the inclination of the face (Li Shoubun, “Detection of face image using Hough transform of mouth edge image and length difference between eyes”, IEICE Technical Report, IEICE, March 2006, HIP2005-161, p.43- 48).

続いて、ユーザ映像生成部１４の機能及び動作（上記ステップＳ１０４の詳細）について詳述する。ユーザ映像生成部１４は、図３に示すように、ユーザ映像抽出部１４ａと、ユーザ映像分割処理部１４ｂと、ユーザ映像エフェクト処理部１４ｃと、ユーザ映像サイズ決定部１４ｄと、ユーザ映像位置座標決定部１４ｅとで構成される。 Subsequently, functions and operations (details of step S104) of the user video generation unit 14 will be described in detail. As shown in FIG. 3, the user video generation unit 14 includes a user video extraction unit 14a, a user video division processing unit 14b, a user video effect processing unit 14c, a user video size determination unit 14d, and a user video position coordinate determination. And 14e.

ユーザ５０は、リモコン端末７を使用して、（１）コミュニケーションモード（手話モード又は口話モード）、（２）ユーザ映像に複数のユーザが表示された場合の表示方法（そのまま表示又は分割して１人ずつ表示）、（３）ユーザ映像に適用するエフェクト処理のパターン、（４）ユーザ映像のサイズ、（５）ユーザ映像の表示位置等の各項目について選択し決定することができる。 The user 50 uses the remote control terminal 7 to (1) a communication mode (sign language mode or spoken language mode), (2) a display method when a plurality of users are displayed on the user video (displayed or divided as it is) It is possible to select and determine each item such as (displayed one by one), (3) pattern of effect processing applied to user video, (4) size of user video, and (5) display position of user video.

ここで、それら各項目についてユーザ５０が適宜選択し、決定ボタン等の押下操作により選択・決定された調整データが送信された場合、調整データ受付部１７により、調整データ入力インタフェース３４を介して受信され、調整データ記憶部１８に記憶される。また、調整データを複数回受信した場合には、その受信毎に前回の調整データ値を書き換えて更新する。 Here, when the adjustment data selected and determined by the user 50 as appropriate for each of these items and selected / determined by pressing the determination button or the like is transmitted, the adjustment data receiving unit 17 receives the adjustment data via the adjustment data input interface 34. And stored in the adjustment data storage unit 18. When the adjustment data is received a plurality of times, the previous adjustment data value is rewritten and updated each time the adjustment data is received.

そして、上記調整データは、更新される毎にユーザ映像生成部１４により取得され、以下の処理が実行される。尚、調整データ記憶部１８に予め記憶されたプリセット値を用いるようにしてもよい。 The adjustment data is acquired by the user video generation unit 14 every time it is updated, and the following processing is executed. A preset value stored in advance in the adjustment data storage unit 18 may be used.

まず、ユーザ映像抽出部１４ａにより、選択・決定されたコミュニケーションモードに応じた画像領域がカメラ映像から抽出され、ユーザ映像分割処理部１４ｂに出力される（ステップＳ１０４ａ）。具体的には、手話モードの場合（手話によりコミュニケーションを行う場合）、顔の表情のみならず、手先や指先の表現も注視されることから、ユーザ５０の顔と上半身の両者を併せた画像領域を抽出する。一方、口話モードの場合（口話によりコミュニケーションを行う場合）には、唇の動きが注視されることから、ユーザ５０の顔の画像領域のみを抽出する。 First, the user video extraction unit 14a extracts an image area corresponding to the selected / determined communication mode from the camera video and outputs it to the user video division processing unit 14b (step S104a). Specifically, in the sign language mode (when communicating in sign language), not only the facial expression but also the hand and fingertip expressions are watched, so the image area that combines both the face and upper body of the user 50 To extract. On the other hand, in the speech mode (when communicating by speech), the movement of the lips is watched, so that only the image area of the face of the user 50 is extracted.

ここで、ユーザ映像の抽出パターンを図４〜図６に示す。図４〜図６は手話モードの場合を例に図示している。まず、カメラ映像に映ったユーザが１人（入力されたユーザの顔部分及び上半身部分の検出結果が１領域）であった場合、図４に示すように、検出された画像領域をそのままユーザ映像として抽出する。 Here, extraction patterns of user videos are shown in FIGS. 4 to 6 illustrate the case of the sign language mode as an example. First, when there is one user shown in the camera image (the input result of the user's face part and upper body part is one area), as shown in FIG. 4, the detected image area is directly used as the user image. Extract as

カメラ映像に映ったユーザが２人であり（検出結果が２領域）、検出結果からユーザ間のユーザ間距離ｄ（例えば、検出された各領域の中心点間の距離等）が予め設定した値（以下、閾値）よりも小さい場合には、図５（ａ）に示すように、その２人分の検出領域を含む最小矩形をユーザ映像として抽出する。 There are two users shown in the camera image (detection result is two regions), and a distance d between users (for example, a distance between the center points of each detected region) based on the detection result is a preset value. If it is smaller than (hereinafter referred to as a threshold), as shown in FIG. 5A, the minimum rectangle including the detection areas for the two persons is extracted as the user video.

一方、カメラ映像に映ったユーザが２人であり（検出結果が２領域）、検出結果からユーザ間距離ｄが閾値よりも大きい場合には、図５（ｂ）に示すように、検出された各画像領域から１人ずつユーザ映像を抽出する。 On the other hand, when there are two users reflected in the camera image (detection result is two regions) and the distance d between users is larger than the threshold value from the detection result, the detection is performed as shown in FIG. One user video is extracted from each image area.

カメラ映像に映るユーザが３人以上の場合（検出結果が３領域以上）には、図６に示すように、ユーザ間距離ｄが閾値よりも小さいユーザ同士は、それらのユーザの検出画像領域を含む最小矩形をユーザ映像として抽出し、それ以外のユーザは、そのユーザに関する検出画像領域をそのままユーザ映像として抽出する。 When there are three or more users appearing in the camera video (the detection result is three or more areas), as shown in FIG. 6, users whose user distance d is smaller than the threshold value are the detected image areas of those users. The smallest rectangle that is included is extracted as the user video, and the other users extract the detected image area relating to the user as the user video as it is.

次に、複数のユーザを含むユーザ映像が抽出された場合、ユーザ映像分割処理部１４ｂにより、ユーザ５０によって選択・決定された表示方法に基づいて、複数のユーザを含むユーザ映像としてそのまま表示するか、複数のユーザを分割して１人ずつのユーザ映像として表示するかが決定され、分割が必要ない場合には抽出されたユーザ映像がそのままユーザ映像エフェクト処理部１４ｃに出力され、分割が必要な場合には分割処理を行った複数のユーザ映像が出力される（ステップＳ１０４ｂ）。 Next, when a user video including a plurality of users is extracted, whether the user video including the plurality of users is displayed as it is based on the display method selected and determined by the user 50 by the user video division processing unit 14b. It is determined whether to divide a plurality of users and display them as one user video, and when division is not necessary, the extracted user video is output to the user video effect processing unit 14c as it is, and division is necessary. In this case, a plurality of user videos subjected to the division process are output (step S104b).

次に、ユーザ映像エフェクト処理部１４ｃにより、ユーザ映像分割処理部１４ｂから入力されたユーザ映像に対し、ユーザ５０によって選択・決定されたパターンのエフェクト処理が実行され、ユーザ映像サイズ決定部１４ｄに出力される（ステップＳ１０４ｃ）。 Next, the effect processing of the pattern selected and determined by the user 50 is executed on the user video input from the user video division processing unit 14b by the user video effect processing unit 14c, and output to the user video size determination unit 14d. (Step S104c).

尚、ユーザ映像に施すエフェクト処理については、公知の映像処理技術を適用可能である。エフェクト処理を施していないユーザ映像を図７に示し、エフェクトをかけたユーザ画像例を図８及び図９に示す。図８は、鏡像化するエフェクト処理であり、図９は、半透明化するエフェクト処理である。 Note that a known video processing technique can be applied to the effect processing applied to the user video. FIG. 7 shows a user video that has not been subjected to the effect processing, and FIGS. 8 and 9 show examples of user images to which the effect has been applied. FIG. 8 shows an effect process for making a mirror image, and FIG. 9 shows an effect process for making it translucent.

次に、ユーザ映像サイズ決定部１４ｄにより、ユーザ５０によって選択・決定された映像サイズに基づいて、ユーザ映像エフェクト処理部１４ｃから入力されたユーザ映像の大きさが調整され、ユーザ映像位置座標決定部１４ｅに出力される（ステップＳ１０４ｄ）。 Next, the user video size determination unit 14d adjusts the size of the user video input from the user video effect processing unit 14c based on the video size selected and determined by the user 50, and the user video position coordinate determination unit 14e (step S104d).

尚、検出結果に含まれるユーザの顔部分及び上半身部分の大きさは、ユーザとカメラ間の距離によって異なるため、ユーザ映像サイズ決定部に入力されるユーザ映像のサイズも区々であると考えられる。ユーザ映像サイズ決定部１４ｄでは、それぞれのユーザ映像が等倍となるようサイズの調整を行うようにしてもよい。 In addition, since the size of the user's face part and upper body part included in the detection result varies depending on the distance between the user and the camera, the size of the user video input to the user video size determination unit is also considered to vary. . The user video size determination unit 14d may adjust the size so that each user video has the same magnification.

最後に、ユーザ映像位置座標決定部１４ｅにより、ユーザ５０によって選択・決定された映像の位置座標に基づいて、ユーザ映像サイズ決定部１４ｄから入力されたユーザ映像の位置座標が書き換えられ、映像合成部１５に出力される（ステップＳ１０４ｅ）。 Finally, the user video position coordinate determination unit 14e rewrites the position coordinates of the user video input from the user video size determination unit 14d based on the position coordinates of the video selected and determined by the user 50, and the video composition unit 15 (step S104e).

以上より、本実施の形態によれば、映像表示装置５の映像コンテンツを視聴しているユーザ５０のカメラ映像から顔及び上半身の画像領域を抽出し、手話によりコミュニケーションを行う場合には、顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行う場合には、顔の画像領域のみから成るユーザ映像を生成して、映像コンテンツに重畳して映像表示装置５に出力するので、映像コンテンツの視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現できる。 As described above, according to the present embodiment, when the face and upper body image areas are extracted from the camera video of the user 50 who is viewing the video content of the video display device 5 and communication is performed using sign language, When a user video consisting only of the upper body image area is generated and communication is carried out through verbal speech, a user video consisting only of the face image area is generated and superimposed on the video content and output to the video display device 5. Therefore, it is possible to realize sign language or verbal communication while viewing the video content without depending on the viewing style of the video content.

また、本実施の形態によれば、カメラ映像に撮影されたユーザ間の距離が閾値よりも大きい場合、複数のユーザを分割した複数のユーザ映像を生成するので、カメラ映像にユーザの下半身やユーザ間等の余分な背景等が含まれてしまうのを防止できる。 In addition, according to the present embodiment, when the distance between users photographed in the camera video is larger than the threshold, a plurality of user videos are generated by dividing the plurality of users. It is possible to prevent an extra background or the like from being included.

〔第２の実施の形態〕
続いて、図１０を参照しながら、第２の実施の形態に係る視聴者映像表示システムについて説明する。 [Second Embodiment]
Next, a viewer video display system according to the second embodiment will be described with reference to FIG.

第１の実施の形態では、映像コンテンツにユーザ映像を重畳して表示する形態について説明した。一方、上述した「特定の視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを行えるようにする」という効果は、ユーザ５０を映したユーザ映像が映像コンテンツと共に視認可能な状態にあり、手話や口話によるコミュニケーションに必要とされる顔や上半身の領域を画像抽出して表示することにより得ることができる。 In the first embodiment, the mode in which the user video is displayed superimposed on the video content has been described. On the other hand, the above-mentioned effect of “enabling communication in sign language or spoken language while viewing video content without depending on a specific viewing style” allows the user video showing the user 50 to be viewed together with the video content. Therefore, it can be obtained by extracting and displaying the face and upper body area necessary for communication by sign language or spoken language.

そこで、第２の実施の形態では、映像コンテンツにユーザ映像を重畳して表示するのに代えて、映像コンテンツを表示する映像表示装置５ａの近傍にユーザ映像専用のユーザ映像用映像表示装置５ｂを設置し、そのユーザ映像用映像表示装置５ｂの画面にユーザ映像を表示するようにする。 Therefore, in the second embodiment, instead of displaying the user video superimposed on the video content, a user video video display device 5b dedicated to the user video is provided in the vicinity of the video display device 5a that displays the video content. The user video is displayed on the screen of the user video video display device 5b.

本実施の形態に係る映像データ制御装置１は、第１の実施の形態で説明した機能と基本的に同様の機能を備えるが、ユーザ映像を映像コンテンツに重畳する処理が不要であるため、図１０に示したように、映像コンテンツ入力インタフェース３３や映像合成部１５を備える必要はない。 The video data control apparatus 1 according to the present embodiment has basically the same function as the function described in the first embodiment, but does not require a process of superimposing user video on video content. As shown in FIG. 10, it is not necessary to provide the video content input interface 33 or the video composition unit 15.

一方、視聴者映像表示システムの構成要素から映像合成部１５を削除したことに伴い、映像出力部１６の一部機能が修正される。第１の実施の形態では、重畳されたユーザ映像及び映像コンテンツを出力して映像表示装置５ａに表示させていたが、第２の実施の形態では、ユーザ映像生成部１４で生成されたユーザ映像のみを出力し、映像コンテンツを表示している映像表示装置５ａとは異なる上記ユーザ映像用映像表示装置５ｂに表示させる。 On the other hand, a part of the function of the video output unit 16 is modified in accordance with the deletion of the video synthesis unit 15 from the components of the viewer video display system. In the first embodiment, the superimposed user video and video content are output and displayed on the video display device 5a. In the second embodiment, the user video generated by the user video generation unit 14 is displayed. Are displayed on the video display device 5b for user video different from the video display device 5a displaying the video content.

また、映像コンテンツ出力装置９は、映像コンテンツを映像表示装置５ａに直接出力して表示させる。このような映像コンテンツ出力装置９及び映像表示装置５ａは、例えば、家庭内の地上デジタルテレビが例とされる。 The video content output device 9 directly outputs the video content to the video display device 5a for display. Such video content output device 9 and video display device 5a are, for example, home digital terrestrial television.

尚、ユーザ映像用映像表示装置５ｂの設置場所については、映像表示装置５ａの上や横、前等、映像コンテンツを視聴しているユーザの視野内に位置するように設置されていればよい。 It should be noted that the installation location of the user video image display device 5b may be installed so as to be located within the field of view of the user viewing the video content, such as on, next to, or in front of the video display device 5a.

以上より、本実施の形態によれば、映像表示装置５ａの映像コンテンツを視聴しているユーザ５０のカメラ映像から顔及び上半身の画像領域を抽出し、手話によりコミュニケーションを行う場合には、顔及び上半身の画像領域のみから成るユーザ映像を生成し、口話によりコミュニケーションを行う場合には、顔の画像領域のみから成るユーザ映像を生成して、ユーザ５０の視野内に位置するユーザ映像用映像表示装置５ｂに出力するので、映像コンテンツの視聴スタイルに依存することなく、映像コンテンツを視聴しながら手話や口話によるコミュニケーションを実現できる。 As described above, according to the present embodiment, when the face and upper body image areas are extracted from the camera video of the user 50 who is viewing the video content of the video display device 5a and communication is performed using sign language, When a user video consisting only of the upper body image area is generated and communication is made through oral speech, a user video consisting only of the face image area is generated, and a video display for user video located within the visual field of the user 50 is displayed. Since it is output to the apparatus 5b, it is possible to realize sign language or verbal communication while viewing the video content without depending on the viewing style of the video content.

１…映像データ制御装置
３…ビデオカメラ
５、５ａ…映像表示装置
５ｂ…ユーザ映像用映像表示装置
７…リモコン端末
９…映像コンテンツ出力装置
１１…カメラ映像受付部
１２…カメラ映像記憶部
１３…カメラ映像解析部
１４…ユーザ映像生成部
１５…映像合成部
１６…映像出力部
１７…調整データ受付部
１８…調整データ記憶部
３１…カメラ映像入力インタフェース
３２…映像出力インタフェース
３３…映像コンテンツ入力インタフェース
３４…調整データ入力インタフェース
５０…ユーザ
Ｓ１０１〜Ｓ１０７、Ｓ１０４ａ〜Ｓ１０４ｅ…ステップ DESCRIPTION OF SYMBOLS 1 ... Video data control apparatus 3 ... Video camera 5, 5a ... Video display apparatus 5b ... User video display apparatus 7 ... Remote control terminal 9 ... Video content output apparatus 11 ... Camera video reception part 12 ... Camera video storage part 13 ... Camera Video analysis unit 14 ... User video generation unit 15 ... Video synthesis unit 16 ... Video output unit 17 ... Adjustment data reception unit 18 ... Adjustment data storage unit 31 ... Camera video input interface 32 ... Video output interface 33 ... Video content input interface 34 ... Adjustment data input interface 50 ... User S101 to S107, S104a to S104e ... Step

Claims

A camera video reception means for receiving from a video camera a camera video of a user who is shot from a screen side of a video display device capable of displaying video content and viewing the video content in an arbitrary viewing style;
Extract the face and upper body of the image area of the user from the camera image, depending on the selected determined mode by the user, if the mode for communication with sign language consists of only the image area of the face and upper body In the case of a mode in which user video is generated and spoken communication is performed, user video generation means for generating a user video consisting only of the face image area;
Video composition means for superimposing the user video on the video content so that the user video is displayed;
Video output means for outputting the superimposed user video and video content to the video display device ,
The user video generation means includes
A video data control device that generates a plurality of user videos obtained by dividing a plurality of users when a distance between users photographed in the camera video is larger than a threshold .

A camera video reception means for receiving from a video camera a camera video of a user who is shot from a screen side of a video display device capable of displaying video content and viewing the video content in an arbitrary viewing style;
Extract the face and upper body of the image area of the user from the camera image, depending on the selected determined mode by the user, if the mode for communication with sign language consists of only the image area of the face and upper body In the case of a mode in which user video is generated and spoken communication is performed, user video generation means for generating a user video consisting only of the face image area;
Video output means for outputting the user video to another video display device located within the field of view of the user ,
The user video generation means includes
A video data control device that generates a plurality of user videos obtained by dividing a plurality of users when a distance between users photographed in the camera video is larger than a threshold .

By computer
A camera video reception step for receiving from a video camera a camera video of a user who is photographed from a screen side of a video display device capable of displaying video content and viewing the video content in an arbitrary viewing style;
In the mode in which the user's face and upper body image areas are extracted from the camera image, and communication is performed in sign language according to the mode selected and determined by the user, the image area includes only the face and upper body image areas. In the case of a mode in which user video is generated and spoken communication is performed, a user video generation step of generating a user video consisting only of the face image area;
A video composition step of superimposing the user video on the video content so that the user video is displayed;
A video output step of outputting the superimposed user video and video content to the video display device,
In the user video generation step,
When the distance between users captured in the camera video is larger than a threshold, a plurality of user videos are generated by dividing the plurality of users.
And a video data control method.

By computer
A camera video reception step for receiving from a video camera a camera video of a user who is photographed from a screen side of a video display device capable of displaying video content and viewing the video content in an arbitrary viewing style;
In the mode in which the user's face and upper body image areas are extracted from the camera image, and communication is performed in sign language according to the mode selected and determined by the user, the image area includes only the face and upper body image areas. In the case of a mode in which user video is generated and spoken communication is performed, a user video generation step of generating a user video consisting only of the face image area;
A video output step of outputting the user video to another video display device located within the field of view of the user,
In the user video generation step,
When the distance between users captured in the camera video is larger than a threshold, a plurality of user videos are generated by dividing the plurality of users.
And a video data control method.

A video data control program for causing a computer to function as the video data control device according to claim 1.