JP6030945B2

JP6030945B2 - Viewer video display control device, viewer video display control method, and viewer video display control program

Info

Publication number: JP6030945B2
Application number: JP2012277959A
Authority: JP
Inventors: 美佐平尾; 陽子石井; 宮崎　泰彦; 泰彦宮崎; 小林　透; 透小林
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-12-20
Filing date: 2012-12-20
Publication date: 2016-11-24
Anticipated expiration: 2032-12-20
Also published as: JP2014123818A

Description

本発明は、聴覚障がいがあるユーザやその家族のユーザなど、手話や口話を用いてコミュニケーションをとるユーザが、テレビ放送をはじめとするコンテンツ映像を視聴する際に、コミュニケーションをより円滑にする視聴者映像表示制御装置、視聴者映像表示制御方法、および視聴者映像表示制御プログラムに関する。 The present invention is a viewing method that facilitates communication when a user who communicates using sign language or spoken language, such as a user with hearing impairment or a user of his / her family member, views content video including television broadcasting. The present invention relates to a viewer video display control device, a viewer video display control method, and a viewer video display control program.

「テレビを見ながら家族と会話をする」という行為は、聴者（聴覚に障害がない人のこと）のユーザにとってはごく自然な行為である。これは、聴者同士は音声のみによってコミュニケーションをとることが可能であり、コンテンツ映像の視聴を中断せずにコミュニケーションをとることができるからである。一方で、手話や口話によってコミュニケーションをとる聴覚障がい者やその家族のユーザは、コンテンツ映像の視聴時にコミュニケーションをとろうとすると、「番組本編を見逃さないように、CM中、番組終了後だけおしゃべりする」、「お互いの様子、手話や顔の表情が見やすいように座る位置を工夫する」など、様々な制約がともなう。手話や口話は相手と目を合わせることで成立するコミュニケーション方法であり、コンテンツ映像の視聴時に会話をするためには、視聴を中断して相手と向き合う必要があるからである。 The act of “conversing with family while watching TV” is a very natural act for a listener (a person who has no hearing impairment). This is because the listeners can communicate with each other only by voice, and can communicate without interrupting the viewing of the content video. On the other hand, people with hearing disabilities and their family members who communicate by sign language or spoken language try to communicate when viewing content video. ”,“ Devise the sitting position so that each other's appearance, sign language and facial expressions are easy to see ”. This is because sign language and spoken language are communication methods established by looking at the other person's eyes, and in order to have a conversation when viewing the content video, it is necessary to interrupt viewing and face the other person.

このような問題に対して、例えば非特許文献１のような遠隔コミュニケーションに用いるシステムを適用することが考えられる。 For such a problem, for example, it is conceivable to apply a system used for remote communication as in Non-Patent Document 1.

特開2008-217536号公報JP 2008-217536 JP

上述のように、聴覚障がい者やその家族のユーザが、テレビなどのコンテンツ映像の視聴時にコミュニケーションをとろうとすると、様々な制約がともなうため、円滑なコミュニケーションを実現することが難しい。 As described above, when a hearing impaired person or a user of his or her family tries to communicate when viewing a content image such as a television, various restrictions are involved, and it is difficult to realize smooth communication.

特許文献１のシステムは、遠隔コミュニケーションに用いるシステムであって、コンテンツ映像を視聴する際に、聴覚障がい者などの円滑なコミュニケーションを図ることについては考慮されていない。 The system of Patent Document 1 is a system used for remote communication, and is not considered for smooth communication of persons with hearing disabilities and the like when viewing content video.

例えば、画面に表示される自己画像は、ユーザの状態に応じて表示/非表示を制御されることはなく、常時表示される。しかし、ユーザが、コンテンツ映像を視聴中、ユーザをビデオカメラにより撮影した映像（以下では、ユーザ映像と呼ぶ）は、例えば、ユーザが手話や口話などの会話をしているなどの状況に応じて表示/非表示を制御される必要がある。ユーザが会話を行っていない間、ユーザ映像は表示されている必要はなく、コンテンツ映像やテロップや字幕等の文字情報の見やすさを考えると、表示されない方が望ましいからである。 For example, the self-image displayed on the screen is always displayed without being controlled to be displayed / hidden according to the state of the user. However, while the user is viewing content video, video captured by the video camera (hereinafter referred to as user video) depends on the situation such as the user having a conversation such as sign language or spoken language. Need to be controlled to show / hide. This is because the user video does not need to be displayed while the user is not talking, and it is desirable that the user video is not displayed in consideration of the legibility of the content video, text information such as telop and subtitles.

また、ユーザが会話を行っていない間でも、ユーザの表情に変化があればユーザ映像が表示されることが望ましい。ユーザ映像を通して互いの表情の変化に気がつくことで、会話を開始するきっかけとなったり、逆に話しかけるのを控えるべきだと判断したりするなど、コミュニケーションをより円滑にすることを可能とするためである。 In addition, even when the user is not talking, it is desirable that the user video is displayed if there is a change in the user's facial expression. To be able to communicate more smoothly by noticing the change in each other's facial expressions through user images, and as a starting point for conversations and judging that they should refrain from talking to each other. is there.

本発明は、上記事情に鑑みてなされたものであり、本発明の目的は、聴覚に障がいがあるユーザがコンテンツ映像を視聴する際に、コミュニケーションをより円滑にする視聴者映像表示制御装置、視聴者映像表示制御方法、および視聴者映像表示制御プログラムを提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a viewer image display control apparatus and a viewing device that facilitate communication when a user with hearing impairment views content images. It is to provide a viewer video display control method and a viewer video display control program.

上記目的を達成するため、本発明は、視聴者の映像の表示を制御する視聴者映像表示制御装置であって、コンテンツ映像を視聴している視聴者を撮影するカメラから入力されるカメラ映像を解析し、前記視聴者の映像を検出する映像解析部と、前記コンテンツ映像がCM中であるか否かを判断するTV内容判断部と、前記カメラ映像を用いて前記視聴者が手話をしているか否かを検出する手話検出部と、前記TV内容判断部による判断がCM中でない場合は、前記手話検出部が手話をしていると検出したタイミングで、前記視聴者が手話をしている間だけ、前記視聴者の映像を用いた手話映像を前記コンテンツ映像に重畳して合成する映像合成部と、を備え、前記手話映像は、前記視聴者が手話を行っている場合に手話の動作が視認できる映像である。 In order to achieve the above object, the present invention provides a viewer video display control apparatus that controls display of a viewer's video, and that captures a camera video input from a camera that shoots a viewer watching the content video. A video analysis unit that analyzes and detects the viewer's video; a TV content determination unit that determines whether the content video is in a CM; and the viewer uses the camera video to sign language a sign language detection unit for detecting whether dolphins not, when the determination by the TV content judgment unit is not in the CM, at a timing when the sign language detection unit detects that the sign language, the viewer is the sign language during only example Bei and a video synthesis unit for synthesizing by superimposing the sign language video using the video of the viewer to the content image, the sign language video, a sign language when the viewer is performing sign language It is an image in which the operation can be visually recognized .

上記視聴者映像表示制御装置において、前記TV内容判断部による判断がCM中である場合は、前記映像合成部は、前記視聴者の映像を用いた手話映像を前記コンテンツ映像に重畳して合成してもよい。 In the viewer image display control device, when the determination by the TV content judgment unit is in the CM, the video synthesis unit is synthesized by superimposing the sign language video using the video of the viewer to the content image May be.

上記視聴者映像表示制御装置において、前記カメラ映像を用いて前記視聴者の顔の表情変化を検出する表情変化検出部をさらに備え、前記TV内容判断部による判断がCM中でない場合であって、前記視聴者が手話をしていない場合、前記映像合成部は、前記表情変化検出部が前記表情変化を検出したタイミングで、前記視聴者の映像を用いた表情映像を前記コンテンツ映像に重畳して合成してもよい。 In the viewer video display control device, further comprising a facial expression change detection unit that detects a facial expression change of the viewer using the camera video, when the judgment by the TV content judgment unit is not in the CM, When the viewer is not sign language, the video composition unit superimposes a facial expression video using the viewer's video on the content video at a timing when the facial expression change detection unit detects the facial expression change. You may synthesize .

上記視聴者映像表示制御装置において、前記カメラ映像を用いて前記視聴者の顔の表情変化を検出する表情変化検出部と、前記視聴者の映像を用いた手話映像、または前記視聴者の映像を用いた表情映像を強調するための強調処理を、前記手話検出部または前記表情変化検出部が検出してから所定の時間、行う強調部と、をさらに備え、前記映像合成部は、前記強調処理がなされた手話映像または表情映像を前記コンテンツ映像に重畳して合成してもよい。 In the viewer video display control device, a facial expression change detection unit that detects a facial expression change of the viewer using the camera video, and a sign language video using the viewer video or the viewer video. An emphasis unit that performs an emphasis process for emphasizing the used facial expression video for a predetermined time after the sign language detection unit or the facial expression change detection unit detects, and the video composition unit includes the enhancement process A sign language image or a facial expression image that has been marked may be combined with the content image .

本発明は、コンピュータが行う、視聴者の映像の表示を制御する視聴者映像表示制御方法であって、コンテンツ映像を視聴している視聴者を撮影するカメラから入力されるカメラ映像を解析し、前記視聴者の映像を検出する映像解析ステップと、前記コンテンツ映像がCM中であるか否かを判断するTV内容判断ステップと、前記TV内容判断ステップにおいてCM中でないと判断された場合、前記視聴者が手話を開始したタイミングで、前記視聴者が手話をしている間だけ、前記視聴者の映像を用いた手話映像を前記コンテンツ映像に重畳して合成する映像合成ステップと、を行い、前記手話映像は、前記視聴者が手話を行っている場合に手話の動作が視認できる映像である。 The present invention is a viewer video display control method for controlling display of a viewer's video performed by a computer, analyzing a camera video input from a camera that shoots a viewer watching a content video, The video analysis step for detecting the viewer's video, the TV content determination step for determining whether or not the content video is in a CM, and the TV content determination step when the TV content determination step is determined not to be in a CM person at a timing starting the sign language, only while the viewer is the sign language, we have rows, and image synthesis step of synthesizing superimposed on the content image a sign language video using the video of the viewer, The sign language image is an image in which an operation of sign language can be visually recognized when the viewer is performing sign language .

上記視聴者映像表示制御方法において、前記TV内容判断ステップにおいてCM中であると判断された場合、前記映像合成ステップは、前記視聴者の映像を用いた手話映像を前記コンテンツ映像に重畳して合成してもよい。 In the viewer video display control method, when it is determined in the TV content determination step that CM is being performed, the video synthesis step synthesizes a sign language video using the viewer's video by superimposing it on the content video. May be.

上記視聴者映像表示制御方法において、前記TV内容判断ステップにおいてCM中でないと判断された場合であって、前記視聴者が手話をしていない場合、前記映像合成ステップは、前記視聴者の顔の表情が変化したタイミングで、前記視聴者の映像を用いた表情映像を前記コンテンツ映像に重畳して合成してもよい。 In the viewer video display control method, when it is determined that the TV content determination step is not under CM, and the viewer is not sign language, the video composition step is performed on the face of the viewer. A facial expression video using the viewer's video may be superimposed on the content video at the timing when the facial expression changes .

上記視聴者映像表示制御方法において、前記検出された視聴者の映像を用いた手話映像を手話の開始から所定の時間、強調させる処理を行う、または、前記検出された視聴者の映像を用いた表情映像を表情変化から所定の時間、強調させる処理を行う強調ステップをさらに行い、前記映像合成ステップは、前記強調ステップの処理がなされた手話映像または表情映像を、前記コンテンツ映像に重畳して合成してもよい。 In the viewer video display control method, the sign language video using the detected viewer video is emphasized for a predetermined time from the start of sign language, or the detected viewer video is used. An emphasis step is further performed for emphasizing the facial expression video for a predetermined time from the facial expression change, and the video synthesis step synthesizes the sign language video or facial expression video subjected to the enhancement step by superimposing it on the content video. May be.

本発明は、前記視聴者映像表示制御装置が備える各部としてコンピュータを機能させる視聴者映像表示制御プログラムである。 The present invention is a viewer video display control program that causes a computer to function as each unit included in the viewer video display control device.

本発明によれば、聴覚に障がいがあるユーザがコンテンツ映像を視聴する際に、コミュニケーションをより円滑にする視聴者映像表示制御装置、視聴者映像表示制御方法、および視聴者映像表示制御プログラムを提供することができる。 According to the present invention, there are provided a viewer video display control device, a viewer video display control method, and a viewer video display control program that facilitate communication when a user with hearing impairments views content video. can do.

本発明の実施形態に係るシステムの全体構成を示す構成図である。1 is a configuration diagram showing an overall configuration of a system according to an embodiment of the present invention. 制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of a control apparatus. 手話判断部および表情変化判断部の構成を示すブロック図である。It is a block diagram which shows the structure of a sign language judgment part and a facial expression change judgment part. ユーザ映像生成部の構成を示すブロック図である。It is a block diagram which shows the structure of a user image | video production | generation part. TV内容判断部及び手話判断部の処理を表すフローチャートである。It is a flowchart showing the process of a TV content judgment part and a sign language judgment part. 表情変化判断部の処理を表すフローチャートである。It is a flowchart showing the process of a facial expression change judgment part. ユーザ映像生成部の処理を示す模式図である。It is a schematic diagram which shows the process of a user image | video production | generation part. 手話映像または表情映像が表示された画面例のイメージ図である。It is an image figure of the example of a screen on which a sign language image or a facial expression image was displayed. 手話映像または表情映像が表示された画面例のイメージ図である。It is an image figure of the example of a screen on which a sign language image or a facial expression image was displayed.

以下、本発明の実施の形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施形態に係るシステムの全体構成図である。図示するシステムは、コンテンツ映像を表示する画面１と、当該画面１の近くに設置され、コンテンツ映像を視聴しているユーザ（視聴者）を撮影するビデオカメラ２と、カメラ入力インタフェース３と、画面出力インタフェース４と、ユーザが使用するリモコン５と、ユーザ入力インタフェース６と、制御装置７（視聴者映像表示制御装置）と、地上デジタルテレビ放送受像機などのコンテンツ映像を出力する映像コンテンツ出力装置８とを備える。 FIG. 1 is an overall configuration diagram of a system according to an embodiment of the present invention. The illustrated system includes a screen 1 that displays content video, a video camera 2 that is installed near the screen 1 and captures a user (viewer) who is viewing the content video, a camera input interface 3, and a screen. Video content output device 8 for outputting content video such as an output interface 4, a remote controller 5 used by a user, a user input interface 6, a control device 7 (viewer video display control device), a terrestrial digital television broadcast receiver, and the like. With.

制御装置７には、ビデオカメラ２により撮影されたカメラ映像が、カメラ入力インタフェース３を介して制御装置７に入力される。また、制御装置７には、映像コンテンツ出力装置８から出力されるコンテンツ映像が入力される。また、制御装置７には、リモコン５を用いてユーザが入力したユーザ入力データ（指示情報、設定情報など）が、ユーザ入力インタフェース６を介して入力される。 A camera image taken by the video camera 2 is input to the control device 7 via the camera input interface 3. In addition, the content video output from the video content output device 8 is input to the control device 7. Further, user input data (instruction information, setting information, etc.) input by the user using the remote controller 5 is input to the control device 7 via the user input interface 6.

制御装置７は、入力されたカメラ映像からユーザの顔及び上半身の検出、ユーザの手話の動作の検出、表情の変化の検出、コンテンツ映像におけるテレビCMの検出などを行う。そして、制御装置７は、ユーザが手話をしている間とテレビCM中は、手話映像をユーザが視聴しているコンテンツ映像に重畳して合成し、画面出力インタフェース４を介して画面１に出力する。また、制御装置７は、手話はしていないが、表情変化があったときには、表情映像をユーザが視聴しているコンテンツ映像に重畳して合成し、画面出力インタフェース４を介して画面１に出力する。 The control device 7 performs detection of the user's face and upper body from the input camera video, detection of the user's sign language operation, detection of a change in facial expression, detection of a television CM in the content video, and the like. The control device 7 then superimposes the sign language video on the content video that the user is viewing while the user is sign language and during the TV commercial, and outputs it to the screen 1 via the screen output interface 4. To do. Further, the control device 7 does not sign language, but when there is a change in facial expression, the facial expression video is superimposed on the content video being viewed by the user and is synthesized and output to the screen 1 via the screen output interface 4. To do.

なお、制御装置７には、例えば、画面１に接続されたPCやテレビに搭載されたブラウザなどを用いること考えられる。 As the control device 7, for example, a PC connected to the screen 1 or a browser mounted on a television can be used.

次に、図２を用いて制御装置７の詳細を説明する。図２は、制御装置７の構成を示すブロック図である。図示する制御装置７は、カメラ映像解析部７１と、TV内容判断部７２と、手話判断部７３と、表情変化判断部７４と、ユーザ映像生成部７５と、映像合成部７６とを備える。 Next, details of the control device 7 will be described with reference to FIG. FIG. 2 is a block diagram showing the configuration of the control device 7. The illustrated control device 7 includes a camera video analysis unit 71, a TV content determination unit 72, a sign language determination unit 73, a facial expression change determination unit 74, a user video generation unit 75, and a video synthesis unit 76.

ビデオカメラ２は、画面１に表示されるコンテンツ映像を視聴するユーザを撮影し、撮影したカメラ映像をフレーム単位でカメラ映像解析部７１に入力する。カメラ映像解析部７１は、入力されたカメラ映像フレームを画像解析し、ユーザ毎にユーザの映像（顔及び上半身など）を検出する。そして、各ユーザの顔及び上半身の画像領域を表す座標情報とカメラ映像フレームとを、TV内容判断部７２に出力する。ＴＶ内容判断部７２は、映像コンテンツ出力装置８から出力されるコンテンツ映像がCM中か、すなわちCMに切り替わったか否かを検知する。 The video camera 2 captures a user who views the content video displayed on the screen 1 and inputs the captured camera video to the camera video analysis unit 71 in units of frames. The camera video analysis unit 71 performs image analysis on the input camera video frame and detects a user video (face, upper body, etc.) for each user. Then, the coordinate information indicating the image area of each user's face and upper body and the camera video frame are output to the TV content determination unit 72. The TV content determination unit 72 detects whether the content video output from the video content output device 8 is being commercialized, that is, has been switched to CM.

手話判断部７３は、カメラ映像フレームを用いて、ユーザが手話をしているか否かを検出する。表情変化判断部７４は、カメラ映像フレームを用いて、ユーザの顔の表情変化を検出する。ユーザ映像生成部７５は、カメラ映像解析部７１が検出したユーザの顔および上半身を用いて、ユーザの手話映像または表情映像を生成する。映像合成部７６は、ユーザの手話映像および表情映像をコンテンツ映像に重畳して合成し、画面出力インタフェース４を介して画面１に出力し、画面１上に表示させる。 The sign language determination unit 73 detects whether or not the user is sign language using the camera video frame. The expression change determination unit 74 detects a change in the expression of the user's face using the camera video frame. The user video generation unit 75 generates a user sign language video or facial expression video using the user's face and upper body detected by the camera video analysis unit 71. The video synthesizing unit 76 superimposes the user sign language video and the facial expression video on the content video, synthesizes them, outputs them to the screen 1 via the screen output interface 4, and displays them on the screen 1.

図３は、制御装置７の手話判断部７３および表情変化判断部７４の構成を示すブロック図である。図示する手話判断部７３は、手話検出部７３１と、手話非検出時間参照部７３２とを備え、表情変化判断部７４は、表情変化検出部７４１と、表情変化検出時間参照部７４２と、アイコン判断部７４３とを備える。これらの処理については、図５および図６を用いて後述する。 FIG. 3 is a block diagram illustrating configurations of the sign language determination unit 73 and the facial expression change determination unit 74 of the control device 7. The sign language determination unit 73 illustrated includes a sign language detection unit 731 and a sign language non-detection time reference unit 732, and the expression change determination unit 74 includes an expression change detection unit 741, an expression change detection time reference unit 742, and icon determination. Part 743. These processes will be described later with reference to FIGS.

図４は、制御装置７のユーザ映像生成部７５の構成を示すブロック図である。図示するユーザ映像生成部７５は、ユーザ映像抽出部７５１と、ユーザ映像エフェクト処理部７５２と、ユーザ映像サイズ決定部７５３と、ユーザ映像位置座標決定部７５４と、文字情報検出部７５５とを備える。これらの処理については、図７を用いて後述する。 FIG. 4 is a block diagram illustrating a configuration of the user video generation unit 75 of the control device 7. The illustrated user video generation unit 75 includes a user video extraction unit 751, a user video effect processing unit 752, a user video size determination unit 753, a user video position coordinate determination unit 754, and a character information detection unit 755. These processes will be described later with reference to FIG.

上記説明した制御装置７には、例えば、ＣＰＵと、メモリと、ＨＤＤ等の外部記憶装置と、入力装置と、出力装置とを備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵがメモリ上にロードされた制御装置７用のプログラムを実行することにより、制御装置７の各機能が実現される。また、制御装置７用のプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ−ＲＯＭなどのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 As the control device 7 described above, for example, a general-purpose computer system including a CPU, a memory, an external storage device such as an HDD, an input device, and an output device can be used. In this computer system, each function of the control device 7 is realized by the CPU executing a program for the control device 7 loaded on the memory. The program for the control device 7 can be stored in a computer-readable recording medium such as a hard disk, flexible disk, CD-ROM, MO, DVD-ROM, or distributed via a network.

次に、本実施形態の制御装置７の処理について説明する。 Next, the process of the control apparatus 7 of this embodiment is demonstrated.

まず、ビデオカメラ２は、コンテンツを視聴しているユーザを撮影する。このカメラ映像は、カメラ入力インタフェース３を介して、制御装置７のカメラ映像解析部７１にフレーム単位で入力される。 First, the video camera 2 photographs a user who is viewing content. This camera video is input to the camera video analysis unit 71 of the control device 7 in units of frames via the camera input interface 3.

カメラ映像解析部７１は、入力されたカメラ映像フレームからユーザの映像（ここでは、顔及び上半身）をユーザ毎に検出する。そして、カメラ映像解析部７１は、ユーザの顔及び上半身の画像領域を表す座標情報とカメラ映像フレームとを、TV内容判断部７２に出力する。なお、ユーザの顔及び上半身を検出するには、例えば以下の参考文献１のような技術を用いることができる。具体的には、顔の向きに応じた特徴量の抽出と、その特徴量を用いた類似度算出を行い、算出された類似度に基づいてユーザの映像を認識・検出する。 The camera image analysis unit 71 detects a user image (here, face and upper body) for each user from the input camera image frame. Then, the camera video analysis unit 71 outputs coordinate information representing the image area of the user's face and upper body and the camera video frame to the TV content determination unit 72. In addition, in order to detect a user's face and upper body, the technique like the following reference literature 1 can be used, for example. Specifically, feature amounts are extracted in accordance with the face orientation and similarity is calculated using the feature amounts, and the user's video is recognized and detected based on the calculated similarities.

［参考文献１］：特開2009-157766号公報
図５は、TV内容判断部７２及び手話判断部７３が行う処理を表すフローチャートである。まず、TV内容判断部７２には、映像コンテンツ出力装置８から出力されるコンテンツ映像がフレーム単位で入力されるとともに、カメラ映像解析部７１からユーザの顔及び上半身の画像領域を表す座標情報とカメラ映像フレームとがフレーム単位で入力される（Ｓ１１）。 [Reference Document 1]: Japanese Unexamined Patent Application Publication No. 2009-157766 FIG. 5 is a flowchart illustrating processing performed by the TV content determination unit 72 and the sign language determination unit 73. First, the content content output from the video content output device 8 is input to the TV content determination unit 72 in units of frames, and the camera video analysis unit 71 receives coordinate information indicating the user's face and upper body image area and the camera. Video frames are input in units of frames (S11).

TV内容判断部７２は、入力されたコンテンツ映像フレームがテレビCMであるか否かを判断する（Ｓ１２）。入力されたコンテンツ映像フレームがテレビCMであるか否かは、例えば以下の参考文献２のような技術を用いて判断することができる。 The TV content determination unit 72 determines whether or not the input content video frame is a television commercial (S12). Whether or not the input content video frame is a television commercial can be determined using, for example, a technique such as Reference Document 2 below.

［参考文献２］：武小萌、佐藤真一、「超高速CM検出に関する研究とその知識発見への応用」、電子情報通信学会技術研究報告、2011年6月、PRMU2011-53、p．119-124
本実施形態では、コンテンツ映像がCM中の場合は、ユーザが手話をしているか否かにかかわらず、手話映像をコンテンツ映像に重畳して画面１に表示するものとする。手話映像は、ユーザが手話を行っている場合に手話の動作が視認できるユーザの映像であって、本実施形態では、手話映像として、ユーザの顔及び上半身を含む映像とする。また、コンテンツ映像がCM中でない場合は、ユーザが手話をしている間だけ、手話映像をコンテンツ映像に重畳して画面１に表示するものとする。 [Reference 2]: Take Komoe, Shinichi Sato, “Research on ultra-high-speed CM detection and its application to knowledge discovery”, IEICE technical report, June 2011, PRMU2011-53, p. 119-124
In the present embodiment, when the content video is being commercialized, the sign language video is superimposed on the content video and displayed on the screen 1 regardless of whether or not the user is sign language. The sign language video is a video of the user who can visually recognize the sign language operation when the user is performing the sign language. In the present embodiment, the sign language video is a video including the user's face and upper body. When the content video is not in the CM, the sign language video is superimposed on the content video and displayed on the screen 1 only while the user is sign language.

また、本実施形態では、手話をしていない状態から手話をしている状態に遷移した場合（すなわち、手話が開始された場合）、手話の開始から所定の時間（t1秒間）は、手話が開始されたことをユーザに気づかせるためのエフェクト処理（強調処理）を施すためのマーカーを当該カメラ映像フレームに設定することとする。 In the present embodiment, when a transition is made from a state in which no sign language is being performed to a state in which sign language is being performed (that is, when sign language is started), the sign language is not transmitted for a predetermined time (t1 second) from the start of the sign language. A marker for performing effect processing (enhancement processing) for notifying the user of the start is set in the camera video frame.

具体的には、入力されたコンテンツ映像フレームがCMの場合（Ｓ１２：ＹＥＳ）、TV内容判断部７２は、Ｓ１１で入力されたカメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報とを、手話判断部７３に出力する。手話判断部７３の手話検出部７３１は、入力されたカメラ映像フレームを用いて、ユーザが手話をしているか否かを判断する（Ｓ１３）。ユーザが手話をしているか否かは、例えば以下の参考文献３のような技術を用いて判断することができる。 Specifically, when the input content video frame is CM (S12: YES), the TV content determination unit 72 includes the camera video frame input in S11, coordinate information representing the user's face and upper body image area, and Is output to the sign language determination unit 73. The sign language detection unit 731 of the sign language determination unit 73 determines whether or not the user is sign language using the input camera video frame (S13). Whether or not the user is sign language can be determined using a technique such as the following Reference 3, for example.

［参考文献３］：山田寛、松尾直志、島田伸敬、白井良明「手話認識のための見えの学習による手領域検出と形状識別」、画像の認識・理解シンポジウム、2009 年 7 月、MIRU2009、p．635-642
入力されたカメラ映像フレームで手話が行われていない場合（Ｓ１３：ＮＯ）、手話判断部７３の手話非検出時間参照部７３２は、現在時間をn1としてメモリなどの記憶部に保存し（Ｓ１４）、カメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報とをユーザ映像生成部７５へ出力する（Ｓ１５）。 [Reference 3]: Hiroshi Yamada, Naoshi Matsuo, Nobutaka Shimada, Yoshiaki Shirai “Hand Region Detection and Shape Identification by Learning of Signs for Sign Language Recognition”, Image Recognition and Understanding Symposium, July 2009, MIRU2009, p. 635-642
When sign language is not performed in the input camera image frame (S13: NO), the sign language non-detection time reference unit 732 of the sign language determination unit 73 stores the current time as n1 in a storage unit such as a memory (S14). Then, the camera video frame and coordinate information representing the image area of the user's face and upper body are output to the user video generation unit 75 (S15).

入力されたカメラ映像フレームで手話が行われていた場合（ユーザが複数いる場合は、その内の少なくとも１人が手話を行っていた場合）（Ｓ１３：ＹＥＳ）は、手話非検出時間参照部７３２は、記憶部に直前に保存されたn1と現在時間とを照らし合わせ、その差分を算出する。算出した差分が予め指定されたt1秒以内の場合（Ｓ１６：ＹＥＳ）、手話開始からt1秒以内であると判断する。この場合、手話非検出時間参照部７３２は、手話が開始されたことを、コンテンツ映像を視聴している他のユーザに気づかせる（注意を喚起させる）ためのエフェクト処理を行わせるために、Ｓ１１で入力されたカメラ映像フレームに任意のマーカーを付け、ユーザの顔及び上半身の画像領域を表す座標情報とともにユーザ映像生成部７５へ出力する（Ｓ１７）。 When sign language is performed in the input camera video frame (when there are a plurality of users, at least one of them is sign language) (S13: YES), sign language non-detection time reference unit 732 Compares the current time with n1 stored immediately before in the storage unit, and calculates the difference. When the calculated difference is within t1 seconds specified in advance (S16: YES), it is determined that it is within t1 seconds from the start of sign language. In this case, the sign language non-detection time reference unit 732 performs an effect process for notifying other users who are viewing the content video that the sign language has started (calling attention). An arbitrary marker is attached to the camera video frame input in step S3, and is output to the user video generation unit 75 together with coordinate information representing the image area of the user's face and upper body (S17).

直前に保存されたn1と現在時間との差分がt1秒を超える場合（Ｓ１６：ＮＯ）、手話非検出時間参照部７３２は、手話の開始からt1秒を経過していると判断し、マーカーを付けることなく、カメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報とをユーザ映像生成部７５へ出力する（Ｓ１５）。 When the difference between n1 stored immediately before and the current time exceeds t1 seconds (S16: NO), the sign language non-detection time reference unit 732 determines that t1 seconds have elapsed since the start of sign language, Without adding, the camera video frame and the coordinate information representing the image area of the user's face and upper body are output to the user video generation unit 75 (S15).

一方、入力されたコンテンツ映像フレームがCMではない場合（Ｓ１２：ＮＯ）、TV内容判断部７２は、手話判断部７３にカメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報を出力する。手話判断部７３の手話検出部７３１は、入力されたカメラ映像フレームを用いて、ユーザが手話をしているか否かを判断する（Ｓ１８）。 On the other hand, when the input content video frame is not a CM (S12: NO), the TV content determination unit 72 outputs the camera video frame and coordinate information representing the image area of the user's face and upper body to the sign language determination unit 73. . The sign language detection unit 731 of the sign language determination unit 73 determines whether or not the user is sign language using the input camera video frame (S18).

入力されたカメラ映像フレームで手話が行われていない場合（Ｓ１８：ＮＯ）、手話判断部７３の手話非検出時間参照部７３２は、現在時間をn2としてメモリなどの記憶部に保存し（Ｓ２２）、表情変化判断部７４へカメラ映像フレームとユーザの顔の画像領域を表す座標情報とを出力する（Ｓ２３）。 When sign language is not performed in the input camera video frame (S18: NO), the sign language non-detection time reference unit 732 of the sign language determination unit 73 stores the current time as n2 in a storage unit such as a memory (S22). Then, the camera image frame and the coordinate information representing the image area of the user's face are output to the expression change determination unit 74 (S23).

入力されたカメラ映像フレームで手話が行われていた場合（ユーザが複数いる場合は、その内の少なくとも１人が手話を行っていた場合）（Ｓ１８：ＹＥＳ）は、手話非検出時間参照部７３２は、記憶部に直前に保存されたn2と現在時間とを照らし合わせ、その差分が予め指定されたt1秒以内の場合（Ｓ１９：ＹＥＳ）、手話の開始からt1秒以内であると判断し、手話が開始されたことをコンテンツ映像を視聴している他のユーザに気づかせるために、Ｓ１１で入力されたカメラ映像フレームに任意のマーカーを付け、ユーザの顔及び上半身の画像領域を表す座標情報とともにユーザ映像生成部へ出力する（Ｓ２０）。 When sign language is performed in the input camera video frame (when there are a plurality of users, at least one of them is sign language) (S18: YES), sign language non-detection time reference unit 732 Compares the current time with n2 stored immediately before in the storage unit, and if the difference is within t1 seconds specified in advance (S19: YES), determines that it is within t1 seconds from the start of sign language, In order to make other users viewing the content video notice that sign language has started, an arbitrary marker is attached to the camera video frame input in S11, and coordinate information representing the image area of the user's face and upper body And it outputs to a user image | video production | generation part (S20).

直前に保存されたn2と現在時間との差分がt1秒を超える場合は（Ｓ１９：ＮＯ）、手話非検出時間参照部７３２は、手話の開始からｔ1秒を経過していると判断し、マーカーを付けることなく、カメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報をユーザ映像生成部７５へ出力する（Ｓ２１）。 When the difference between n2 stored immediately before and the current time exceeds t1 seconds (S19: NO), the sign language non-detection time reference unit 732 determines that t1 seconds have elapsed since the start of sign language, and the marker Without adding a mark, coordinate information representing the camera video frame and the image area of the user's face and upper body is output to the user video generation unit 75 (S21).

なお、図５の処理は、入力されるカメラ映像およびコンテンツ映像のフレーム毎に繰り返し行われるものである。 Note that the processing in FIG. 5 is repeatedly performed for each frame of the input camera video and content video.

図６は、表情変化判断部７４の処理を表すフローチャートである。 FIG. 6 is a flowchart showing processing of the facial expression change determination unit 74.

本実施形態では、コンテンツ映像がCM中でない場合であって、ユーザが手話をしていない場合でユーザに表情変化が発生した場合、所定の時間（t2秒間）、変化した後の表情映像をコンテンツ映像に重畳して画面１に表示するものとする。また、本実施形態では、表情変化が発生したことをユーザに気づかせるためのエフェクト処理（強調処理）を施すためのマーカーをカメラ映像フレームに設定することとする。 In the present embodiment, when the content video is not in the CM and the user does not sign language and the facial expression changes, the content video is displayed for a predetermined time (t2 seconds). It is assumed that the image is superimposed on the video and displayed on the screen 1. In the present embodiment, a marker for performing effect processing (enhancement processing) for notifying the user that an expression change has occurred is set in the camera video frame.

表情変化判断部７４の表情変化検出部７４１には、図４のＳ２３により、手話判断部７３により出力されたカメラ映像フレームとユーザの顔の画像領域を表す座標情報とが入力される（Ｓ３１）。 The expression change detection unit 741 of the expression change determination unit 74 receives the camera video frame output by the sign language determination unit 73 and the coordinate information representing the image area of the user's face in S23 of FIG. 4 (S31). .

表情変化検出部７４１は、入力されたカメラ映像フレームを用いて、ユーザの表情変化が発生したか否かを検出する（Ｓ３２）。ユーザの表情変化の検出には、例えば参考文献４のような技術を用いることができる。 The facial expression change detection unit 741 detects whether a facial expression change of the user has occurred using the input camera video frame (S32). For example, a technique such as Reference 4 can be used to detect a change in the facial expression of the user.

［参考文献４］：太田寛志、佐治斉、中谷広正「顔面筋に基づいた顔構成要素モデルによる表情変化の認識」、電子情報通信学会論文誌。 D-II、情報・システム、 II-パターン処理 Vol． J82-D-II(7)、pp．1129-1139、1999年7月
表情変化が検出された場合（Ｓ３２：ＹＥＳ）、表情変化検出部７４１は、カメラ映像フレームとユーザの顔の画像領域を表す座標情報と、どのような表情変化なのか（例えば、笑顔になったのか、驚いた顔になったのかなどの変化した後の表情）を表すタグAをともに、表情変化検出時間参照部７４２へ出力する。表情変化検出時間参照部７４２は、現在時間をn3としてメモリなどの記憶部に保存するとともに、入力されたカメラ映像フレームとユーザの顔の画像領域を表す座標情報も記憶部に保存し、表情変化を表すタグAと、カメラ映像フレームと、ユーザの顔の画像領域を表す座標情報とをアイコン判断部７４３に出力する（Ｓ３３）。 [Reference 4]: Hiroshi Ota, Hitoshi Saji, Hiromasa Nakatani “Recognition of facial expression changes by facial component model based on facial muscles”, IEICE Transactions. D-II, Information / System, II-Pattern Processing Vol. J82-D-II (7), pp. 1129-1139, July 1999 When a change in facial expression is detected (S32: YES), the facial expression change detection unit 741 performs coordinate information representing the camera video frame and the image area of the user's face, and what kind of facial expression change Together with a tag A indicating whether the expression has changed (for example, whether it has become a smile or a surprised face), is output to the expression change detection time reference unit 742. The expression change detection time reference unit 742 stores the current time as n3 in a storage unit such as a memory, and also stores the input camera video frame and the coordinate information indicating the image area of the user's face in the storage unit. Is output to the icon determining unit 743 (S33).

アイコン判断部７４３は、表情変化があった場合に、画面１に表示する表情映像を、カメラ映像フレームから取得するユーザの顔の映像とするのか、ユーザの表情を表す任意のアイコンとするのかを判別する（Ｓ３４）。なお、カメラ映像フレームの映像とするかアイコンとするかについては、アイコン判断部７４３は、ユーザが設定した設定情報にもとづいて判断するものとする。ユーザは、リモコン５などを用いて予め（またはコンテンツ映像を視聴中に）、カメラ映像フレームの映像とするかアイコンとするかを、アイコン判断部７４３に設定する。 When there is a change in facial expression, the icon determination unit 743 determines whether the facial expression video displayed on the screen 1 is the user's facial image acquired from the camera video frame or an arbitrary icon representing the user's facial expression. A determination is made (S34). It should be noted that the icon determination unit 743 determines whether to use the video of the camera video frame or the icon based on the setting information set by the user. The user uses the remote controller 5 or the like in advance (or while viewing the content video) to set in the icon determination unit 743 whether to use the video of the camera video frame or the icon.

なお、表情映像は、ユーザの表情が視認できる映像であって、本実施形態では、表情映像として、カメラ映像フレームから取得した映像の場合はユーザの顔を含む映像とし、アイコンの場合は顔の表情が判るものとする。 The facial expression video is an image in which the user's facial expression can be visually recognized. In this embodiment, the facial expression video is a video including the user's face in the case of a video acquired from a camera video frame, and in the case of an icon, Assume that facial expressions are understood.

カメラ映像フレームの表情映像をユーザが設定していた場合は（Ｓ３４：ＮＯ）、アイコン判断部７４３は、カメラ映像フレームとユーザの顔の画像領域を表す座標情報とをユーザ映像生成部７５へ出力する（Ｓ３５）。そして、表情変化が起こったことを、コンテンツ映像を視聴している他のユーザに気づかせるために、アイコン判断部７４３は、Ｓ３１で入力されたカメラ映像フレームに任意のマーカーを付け、ユーザの顔の画像領域を表す座標情報とともにユーザ映像生成部７５へ出力する。 When the user has set the facial expression video of the camera video frame (S34: NO), the icon determination unit 743 outputs the camera video frame and the coordinate information representing the image area of the user's face to the user video generation unit 75. (S35). Then, in order to make another user who is viewing the content video notice that the expression change has occurred, the icon determination unit 743 attaches an arbitrary marker to the camera video frame input in S31, and the user's face Are output to the user video generation unit 75 together with coordinate information representing the image area.

アイコンの表情映像をユーザが指定していた場合は（Ｓ３４：ＹＥＳ）、アイコン判断部７４３は、タグAの表す表情に対応するアイコンを選択し（Ｓ３６）、選択したアイコンの情報をユーザ映像生成部７５へ出力する（Ｓ３７）。なお、タグAはn3と対応づけてアイコン判断部７４３内のメモリなどの記憶部に保存する。 If the user has designated an icon facial image (S34: YES), the icon determination unit 743 selects an icon corresponding to the facial expression represented by the tag A (S36), and generates information on the selected icon as a user video. The data is output to the unit 75 (S37). Tag A is stored in a storage unit such as a memory in icon determination unit 743 in association with n3.

入力されたカメラ映像フレームで表情変化がなかった場合（Ｓ３２：ＮＯ）、表情変化検出部７４１は、カメラ映像フレームとユーザの顔の画像領域を表す座標情報とを表情変化検出時間参照部７４２へ出力する。表情変化検出時間参照部７４２は、現在時間と直前に記憶部に保存されたn3とを照らし合わせ、その差分が予め設定された所定の時間（t2秒間）以内の場合（Ｓ３８：ＹＥＳ）、表情変化が発生してからt2秒以内である（画面１への表情映像の表示期間内）と判別する。そして、表情変化検出時間参照部７４２は、直前にＳ３３で保存された情報を用いて、n3時点での顔の位置と現時点での顔の位置との差分がαピクセル以内であるか否かを判断する（Ｓ３９）。これにより、直前に表情変化が発生したカメラ映像フレームのユーザの顔と、現在のカメラ映像フレームのユーザの顔が同一人物であるか否かを判別する。 When there is no expression change in the input camera image frame (S32: NO), the expression change detection unit 741 sends the camera image frame and the coordinate information representing the image area of the user's face to the expression change detection time reference unit 742. Output. The facial expression change detection time reference unit 742 compares the current time with n3 stored in the storage unit immediately before, and if the difference is within a predetermined time (t2 seconds) set in advance (S38: YES), the facial expression change It is determined that it is within t2 seconds from the occurrence of the change (within the display period of the facial expression video on the screen 1). Then, the facial expression change detection time reference unit 742 uses the information stored immediately before in S33 to determine whether or not the difference between the face position at the n3 time point and the current face position is within α pixels. Judgment is made (S39). Thus, it is determined whether or not the user's face in the camera video frame in which the expression change has occurred immediately before and the user's face in the current camera video frame are the same person.

差分がαピクセル以内の場合（Ｓ３９：ＹＥＳ）、表情変化検出時間参照部７４２は、過去の顔と同一人物であると判別し、アイコン判断部７４３にカメラ映像フレームとユーザの顔の画像領域を表す座標情報とを出力する。アイコン判断部７４３は、カメラ映像フレームの表情映像をユーザが設定していた場合（Ｓ４０：ＮＯ）、カメラ映像フレームとユーザの顔の画像領域を表す座標情報とをユーザ映像生成部７５へ出力する（Ｓ４１）。アイコンの表情映像をユーザが設定していた場合は（Ｓ４０：ＹＥＳ）、アイコン判断部７４３は、直前のＳ３６で記憶部に保存しておいたタグAの表す表情に対応するアイコンを選択し（Ｓ４２）、当該アイコンの情報をユーザ映像生成部７５へ出力する（Ｓ４３）。 When the difference is within α pixels (S39: YES), the expression change detection time reference unit 742 determines that the person is the same person as the past face, and the icon determination unit 743 stores the camera video frame and the image area of the user's face. Output coordinate information. If the user has set a facial expression video of the camera video frame (S40: NO), the icon determination unit 743 outputs the camera video frame and coordinate information representing the image area of the user's face to the user video generation unit 75. (S41). If the user has set the facial expression image of the icon (S40: YES), the icon determination unit 743 selects an icon corresponding to the facial expression represented by the tag A stored in the storage unit in the immediately preceding S36 ( In step S42, the icon information is output to the user video generation unit 75 (step S43).

一方、現在時間とn3の差分が予め指定されたt2秒を超える場合（Ｓ３８：ＮＯ）、表情変化検出時間参照部７４２は、表情変化が発生してからt2秒（画面１への表情映像の表示期間）を経過したと判別し、アイコン判断部７４３への出力を行わない（Ｓ４３）。これにより、これまで画面１に表示されていた表情映像が消えることになる。 On the other hand, when the difference between the current time and n3 exceeds t2 seconds specified in advance (S38: NO), the facial expression change detection time reference unit 742 displays t2 seconds (the facial expression video on the screen 1 is displayed) after the facial expression change occurs. It is determined that the display period has elapsed, and no output is performed to the icon determination unit 743 (S43). As a result, the facial expression image that has been displayed on the screen 1 until now disappears.

また、差分がαピクセルを超える場合（Ｓ３９：ＮＯ）も、表情変化検出時間参照部７４２は、過去の顔と違うユーザであると判別し、アイコン判断部７４３への出力を行わない（Ｓ４３）。これにより、これまで画面１に表示されていた表情映像が消えることになる。 When the difference exceeds the α pixel (S39: NO), the facial expression change detection time reference unit 742 determines that the user is different from the past face, and does not output to the icon determination unit 743 (S43). . As a result, the facial expression image that has been displayed on the screen 1 until now disappears.

なお、図６の処理は、入力されるカメラ映像のフレーム毎に繰り返し行われるものである。 Note that the processing in FIG. 6 is repeatedly performed for each frame of the input camera video.

また、図６に示す実施形態では、マーカーが付されるカメラ映像フレームはＳ３５に該当するカメラ映像フレームのみであるが、t2の値が小さい場合は、Ｓ４１のカメラ映像フレームにもマーカーを付してエフェクト処理を行わせることとしてもよい。すなわち、画面１に表情映像を表示する間、エフェクト処理を行うこととしてもよい。 In the embodiment shown in FIG. 6, the camera video frame to which the marker is attached is only the camera video frame corresponding to S35, but when the value of t2 is small, the marker is also attached to the camera video frame of S41. The effect processing may be performed. That is, effect processing may be performed while a facial expression video is displayed on the screen 1.

図７は、ユーザ映像生成部７５の処理を、図５および図６の処理により入力される情報の種類毎に表わしたものである。 FIG. 7 shows the processing of the user video generation unit 75 for each type of information input by the processing of FIGS. 5 and 6.

（ａ）マーカー付きのカメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報が、ユーザ映像生成部７５に入力された場合（図５のＳ１７、Ｓ２０）
まず、ユーザ映像抽出部７５１は、カメラ映像フレームからユーザの顔及び上半身部分の画像領域を表す座標情報にもとづいて、ユーザの顔及び上半身部分を抽出し、手話映像を生成する（Ｓ５１）。そして、ユーザ映像エフェクト処理部７５２は、生成した手話映像に、手話映像の表示を目立たせるようなエフェクト処理（強調処理）をかける（Ｓ５２）。エフェクトとしては、例えば、手話映像の周囲に目立つ色の枠をつける、枠を点滅させる、手話映像のサイズを予め設定した通常よりも拡大するなどが考えられる。その他に、ユーザがリモコン５などを用いて予め指定したエフェクトあれば、それにもとづいたエフェクトもあわせてかけることも考えられる。 (A) When coordinate information representing a camera video frame with a marker and an image area of the user's face and upper body is input to the user video generation unit 75 (S17 and S20 in FIG. 5).
First, the user video extraction unit 751 extracts the user's face and upper body part from the camera video frame based on the coordinate information representing the image area of the user's face and upper body part, and generates a sign language video (S51). Then, the user video effect processing unit 752 performs effect processing (emphasis processing) on the generated sign language video so that the display of the sign language video is conspicuous (S52). Examples of the effects include adding a conspicuous color frame around the sign language image, blinking the frame, and enlarging the size of the sign language image from a preset normal size. In addition, if there is an effect designated in advance by the user using the remote controller 5 or the like, an effect based on the effect may be applied.

次に、ユーザ映像サイズ決定部７５３は、ユーザがリモコン５などを用いて予め指定した大きさに手話映像の大きさを調整する（Ｓ５３）。次に、ユーザ映像位置座標決定部７５４は、ユーザによって予め指定された位置、または後述する実施例にもとづいて位置座標を付与し（Ｓ５３）、手話映像と位置座標とを映像合成部７６へ出力する（Ｓ５４）。 Next, the user video size determination unit 753 adjusts the size of the sign language video to a size specified in advance by the user using the remote controller 5 or the like (S53). Next, the user video position coordinate determining unit 754 gives a position coordinate based on a position designated in advance by the user or based on an embodiment described later (S53), and outputs the sign language video and the position coordinate to the video synthesizing unit 76. (S54).

（ｂ）カメラ映像フレームとユーザの顔及び上半身の画像領域を表す座標情報が、ユーザ映像生成部７５に入力された場合（図５：Ｓ１５、Ｓ２１）
まず、ユーザ映像抽出部７５１は、カメラ映像フレームからユーザの顔及び上半身部分の画像領域を表す座標情報にもとづいて、ユーザの顔及び上半身部分を抽出し、手話映像を生成する（Ｓ６１）。そして、ユーザがリモコン５などにより予め手話映像にエフェクトをかけることを指定している場合には、ユーザ映像エフェクト処理部７５２は、指定されたエフェクトをかける（Ｓ６２）。ユーザによるエフェクトの指定が無い場合は、エフェクト処理を行わない。 (B) When coordinate information representing the camera video frame and the image area of the user's face and upper body is input to the user video generation unit 75 (FIG. 5: S15, S21)
First, the user video extraction unit 751 extracts the user's face and upper body part from the camera video frame based on the coordinate information representing the image area of the user's face and upper body part, and generates a sign language video (S61). Then, when the user has designated in advance that the sign language video is to be applied with the remote controller 5 or the like, the user video effect processing unit 752 applies the specified effect (S62). If no effect is specified by the user, no effect processing is performed.

次に、ユーザ映像サイズ決定部７５３は、ユーザによって予め指定された大きさに手話映像の大きさを調整する（Ｓ６３）。次に、ユーザ映像位置座標決定部７５４は、ユーザによって予め指定された位置、または後述する実施例にもとづいて位置座標を付与し（Ｓ６３）、手話映像と位置座標とを映像合成部へ出力する（Ｓ６４）。 Next, the user video size determination unit 753 adjusts the size of the sign language video to a size specified in advance by the user (S63). Next, the user video position coordinate determination unit 754 gives a position coordinate based on a position specified in advance by the user or based on an embodiment described later (S63), and outputs the sign language video and the position coordinate to the video composition unit. (S64).

（ｃ）マーカー付きのカメラ映像フレームとユーザの顔の画像領域を表す座標情報が、ユーザ映像生成部７５に入力された場合（図６：Ｓ３５）
まず、ユーザ映像抽出部７５１は、カメラ映像フレームからユーザの顔の画像領域を表す座標情報にもとづいてユーザの顔部分を抽出し、表情映像を生成する（Ｓ７１）。そして、ユーザ映像エフェクト処理部７５２は、生成した表情映像に、表情映像の表示を目立たせるようなエフェクトをかける。エフェクトについては、(ａ)のＳ５２と同様である。 (C) When the coordinate information representing the camera image frame with the marker and the image area of the user's face is input to the user image generation unit 75 (FIG. 6: S35).
First, the user video extraction unit 751 extracts a user's face from the camera video frame based on coordinate information representing the image area of the user's face, and generates a facial expression video (S71). Then, the user video effect processing unit 752 applies an effect that makes the facial expression video noticeable on the generated facial expression video. The effect is the same as S52 in (a).

次に、ユーザ映像サイズ決定部７５３は、ユーザによって予め指定された大きさに表情映像の大きさを調整する（Ｓ７３）。次に、ユーザ映像位置座標決定部７５４は、ユーザによって予め指定された位置、または後述する実施例にもとづいて位置座標を付与し（Ｓ７３）、表情映像と位置座標とを映像合成部７６へ出力する（Ｓ７４）。 Next, the user video size determination unit 753 adjusts the size of the facial expression video to a size specified in advance by the user (S73). Next, the user video position coordinate determination unit 754 gives a position coordinate based on a position designated in advance by the user or based on an embodiment described later (S73), and outputs the facial expression video and the position coordinate to the video composition unit 76. (S74).

（ｄ）カメラ映像フレームとユーザの顔の画像領域を表す座標情報が、ユーザ映像生成部７５に入力された場合（図６：Ｓ４１）
まず、ユーザ映像抽出部７５１は、カメラ映像フレームからユーザの顔の画像領域を表す座標情報にもとづいてユーザの顔部分を抽出し、表情映像を生成する（Ｓ８１）。そして、ユーザ映像エフェクト処理部７５２は、予めユーザが表情映像にエフェクトをかけることを指定している場合、指定されたエフェクトをかける（Ｓ８２）。ユーザによるエフェクトの指定が無い場合は、エフェクト処理を行わない。 (D) When coordinate information representing the camera video frame and the image area of the user's face is input to the user video generation unit 75 (FIG. 6: S41)
First, the user video extraction unit 751 extracts the user's face portion from the camera video frame based on the coordinate information representing the image area of the user's face, and generates a facial expression video (S81). Then, when the user has previously specified that the effect is applied to the facial expression video, the user video effect processing unit 752 applies the specified effect (S82). If no effect is specified by the user, no effect processing is performed.

次に、ユーザ映像サイズ決定部７５３は、ユーザによって予め指定された大きさに表情映像の大きさを調整する（Ｓ８３）。次に、ユーザ映像位置座標決定部７５４は、ユーザによって予め指定された位置、または後述する実施例にもとづいて位置座標を付与し（Ｓ８３）、表情映像と位置座標とを映像合成部７６へ出力する（Ｓ８４）。 Next, the user video size determination unit 753 adjusts the size of the facial expression video to a size specified in advance by the user (S83). Next, the user video position coordinate determination unit 754 gives a position coordinate based on a position specified in advance by the user or based on an embodiment described later (S83), and outputs the facial expression video and the position coordinate to the video composition unit 76. (S84).

（ｅ）表情映像としてアイコンの情報がユーザ映像生成部７５に入力された場合（図６：Ｓ３７、Ｓ４３）
まず、予めユーザがアイコンにエフェクトをかけることを指定している場合は、ユーザ映像エフェクト処理部７５２は、指定されたエフェクトをかける（Ｓ９１）。次に、ユーザ映像サイズ決定部７５３は、ユーザによって予め指定された大きさにアイコン映像の大きさを調整する（Ｓ９２）。次に、ユーザ映像位置座標決定部７５４は、ユーザによって予め指定された位置、または後述する実施例にもとづいて位置座標を決定し（Ｓ９２）、アイコンの情報と位置座標とを映像合成部７６へ出力する（Ｓ９３）。 (E) When icon information is input to the user video generation unit 75 as a facial expression video (FIG. 6: S37, S43)
First, when the user has specified in advance that an effect is to be applied to the icon, the user video effect processing unit 752 applies the specified effect (S91). Next, the user video size determination unit 753 adjusts the size of the icon video to a size specified in advance by the user (S92). Next, the user video position coordinate determination unit 754 determines a position coordinate based on a position designated in advance by the user or based on an embodiment described later (S92), and sends the icon information and the position coordinate to the video composition unit 76. Output (S93).

なお、図７の処理は、入力される情報に応じて繰り返し行われるものである。 Note that the process of FIG. 7 is repeatedly performed according to input information.

以上説明した（ａ）から（ｅ）の処理において、ユーザ映像位置座標決定部７５４が、コンテンツ映像に含まれる文字情報の位置に基づいて、手話映像および表情映像の位置座標を決定する場合、文字情報検出部７５５は、コンテンツ映像の中から文字情報の表示位置を検出する。なお、文字情報の検出には、例えば、参考文献５のような技術を用いることができる。文字情報は、時刻表示や番組ロゴなどを除く、コンテンツ映像における台詞やナレーションなどの音声を文字によって表現した字幕、テロップなどの情報である。 In the processes (a) to (e) described above, when the user video position coordinate determination unit 754 determines the position coordinates of the sign language video and the facial expression video based on the position of the character information included in the content video, The information detection unit 755 detects the display position of the character information from the content video. For detection of character information, for example, a technique such as Reference 5 can be used. The character information is information such as subtitles and telops that express voices such as dialogue and narration in the content video by characters, excluding time display and program logo.

［参考文献５］：門馬孝雄，沢村英治，都木徹，白井克彦，“オフライン字幕制作実用システムにおける自動整形技術の開発”，2003年映像メディア学会冬季大会
以下に、ユーザ映像位置座標決定部７５４が、手話映像および表情映像の位置座標を決定する際の実施例を説明する。 [Reference 5]: Takao Kadoma, Eiji Sawamura, Toru Tsuki, Katsuhiko Shirai, “Development of Automatic Formatting Technology in Practical System for Off-line Caption Production”, 2003 Video Media Society Winter Conference However, an embodiment for determining the position coordinates of the sign language image and the expression image will be described.

ここでは、聴覚に障がいがあるユーザにとっての見やすさを考えると、手話映像および表情映像は、字幕やテロップなどの非定常的にコンテンツ映像に重畳される文字情報との重なりをできる限り避けるとともに、視線の移動が少なくて済むように文字情報とできる限り近接していることが望ましいと考える。ここでは、手話映像については、画面の4隅のうち、文字情報にもっとも近接する隅に表示し、表情映像については、文字情報の末尾に表示するものとする。手話映像は、手話の内容を見えるようにするため映像サイズが大きくなる場合が多いため、文字情報の末尾では手話映像が収まらない場合を考慮し、画面の4隅とした。 Here, considering the ease of viewing for users with hearing impairments, sign language video and facial expression video should avoid overlapping text information superimposed on content video such as subtitles and telops as much as possible, I think it is desirable to be as close as possible to the character information so that the movement of the line of sight is small. Here, the sign language image is displayed at the corner closest to the character information among the four corners of the screen, and the facial expression image is displayed at the end of the character information. Since sign language images often have a large image size so that the contents of the sign language can be seen, the sign language images have four corners in consideration of the case where the sign language image does not fit at the end of the character information.

また、文字情報が画面上の複数箇所に検出された場合は、手話映像は、画面上の最も下部に検出された文字情報に近接する隅に表示し、また、表情映像は、画面上の最も下部に検出された文字情報の末尾に表示するものとする。 When character information is detected at multiple locations on the screen, the sign language image is displayed at the corner near the detected character information at the bottom of the screen, and the facial expression image is the most on the screen. It shall be displayed at the end of the character information detected at the bottom.

図８および図９は、このような実施例において、手話映像と表情映像の位置座標を決定した画面の一例のイメージ図を示すものである。 FIG. 8 and FIG. 9 show an image diagram of an example of the screen in which the position coordinates of the sign language image and the facial expression image are determined in such an embodiment.

図８の画面８１は、文字情報が画面中央に検出されたときに、手話映像を表示する画面例であり、コンテンツ映像を視聴するユーザの手話映像８１１、８１２が、文字情報８１０にもっとも近接する下部の左右の隅に表示されている。図８の画面８２は、文字情報が画面中央に検出されたときに、表情映像を表示する画面例であり、コンテンツ映像を視聴するユーザの表情映像８２１、８２２が、文字情報８２０の末尾に表示にされている。 A screen 81 in FIG. 8 is an example of a screen that displays a sign language video when character information is detected in the center of the screen. The sign language video 811 and 812 of the user viewing the content video is closest to the text information 810. It is displayed in the lower left and right corners. The screen 82 in FIG. 8 is an example of a screen that displays a facial expression video when character information is detected at the center of the screen, and facial expressions 821 and 822 of the user viewing the content video are displayed at the end of the character information 820. Has been.

図８の画面８３は、文字情報が画面下部に検出されたときに、手話映像を表示する画面例であり、手話映像８３１、８３２が、文字情報８３０に最も近接する下部の左右の隅に表示されている。図８の画面８４は、文字情報が画面下部に検出されたときに、表情映像を表示する画面例であり、表情映像８４１、８４２が、文字情報８４０の末尾に表示にされている。 The screen 83 in FIG. 8 is an example of a screen that displays a sign language image when character information is detected at the bottom of the screen. The sign language images 831 and 832 are displayed at the left and right corners closest to the character information 830. Has been. A screen 84 in FIG. 8 is an example of a screen that displays a facial expression video when character information is detected at the bottom of the screen, and facial expression videos 841 and 842 are displayed at the end of the character information 840.

図９の画面９１および画面９２は、文字情報が画面上部に検出されたときに、手話映像および表情映像をそれぞれ表示する画面例である。また、図９の画面９３は、文字情報が複数検出され場合であって、最も下部の文字情報に近接する隅に手話映像を表示する画面例であり、図９の画面９４は、文字情報が複数検出され場合であって、最も下部の文字情報の末尾に、表情映像を表示する画面例である。 Screens 91 and 92 in FIG. 9 are screen examples that display a sign language image and a facial expression image, respectively, when character information is detected at the top of the screen. A screen 93 in FIG. 9 is an example of a screen in which a plurality of character information is detected, and a sign language image is displayed at a corner close to the lowermost character information. A screen 94 in FIG. This is an example of a screen that displays a facial expression video at the end of the lowermost character information when a plurality of characters are detected.

このように、手話映像および表情映像の画面内での表示位置は、文字情報に重ならないような位置に決定する。なお、上記実施例以外にも、ユーザの任意の位置に手話映像および表情映像を表示することとしてもよい。 Thus, the display positions of the sign language image and the expression image on the screen are determined so as not to overlap the character information. In addition to the above embodiment, a sign language image and a facial expression image may be displayed at an arbitrary position of the user.

そして、映像合成部７６は、映像コンテンツ出力装置８から出力されるコンテンツ映像に、ユーザ映像生成部７５から出力される手話映像または表情映像を、指定された位置座標の位置に重畳した合成映像を生成し、画面出力インタフェース４を介して画面１に送出する。これにより、画面１には、コンテンツ映像を視聴するユーザの手話映像または表情映像が重畳された合成映像が表示される。 The video composition unit 76 then combines the content video output from the video content output device 8 with the composite video obtained by superimposing the sign language video or facial expression video output from the user video generation unit 75 on the position of the specified position coordinate. It is generated and sent to the screen 1 via the screen output interface 4. Thereby, the screen 1 displays a composite video in which a sign language video or a facial expression video of a user who views the content video is superimposed.

以上説明した本実施形態では、聴覚に障がいがあるユーザが手話をしている間と、ユーザが視聴しているコンテンツ映像がテレビCMに切り替わっている間にのみ、ユーザの手話映像をコンテンツ映像に重畳して表示することで、コンテンツ映像の見易さを考慮しつつ、コンテンツ映像に重畳された手話映像越しにコミュニケーションをとることを可能とし、聴覚に障がいがあるユーザがコンテンツ映像を視聴する際のコミュニケーションをより円滑にすることができる。 In the present embodiment described above, the sign language video of the user is changed to the content video only while the user with hearing impairment is making the sign language and the content video being viewed by the user is switched to the TV commercial. By superimposing and displaying the content video, it is possible to communicate through the sign language video superimposed on the content video while considering the visibility of the content video. When a user with hearing impairments views the content video Communication can be made smoother.

具体的には、ユーザの手話の動作を検出して手話をしている間、ユーザの手話映像をコンテンツ映像に重畳して表示するとともに、視聴しているコンテンツ映像がテレビCMに切り替わっている間は、ユーザの会話が発生しやすいタイミングとみなし、手話映像をコンテンツ映像に重畳して表示することで、円滑なコミュニケーションを実現し、コミュニケーションの促進を図ることができる。 Specifically, while sign language is detected by detecting user sign language movement, the user sign language video is superimposed on the content video and displayed while the content video being viewed is switched to the TV commercial. Is regarded as a timing at which a user's conversation is likely to occur, and a sign language video is superimposed and displayed on a content video, whereby smooth communication can be realized and communication can be promoted.

また、本実施形態では、ユーザの表情の変化を検出した際に、ユーザの顔部分の表情映像をコンテンツ映像に重畳して表示することで、お互いに画面の方向を向いていてはわからない、お互いの表情変化を相手に知らせ、コミュニケーションをより円滑にすることができる。 Further, in this embodiment, when a change in the user's facial expression is detected, the facial expression video of the user's face portion is displayed superimposed on the content video, so that it is not known that the screen faces each other. The change in facial expression can be informed to the other party, and communication can be made smoother.

また、本実施形態では、手話映像や表情映像の表示を開始する際には、それらの映像を目立たせるようなエフェクトをかけ、ユーザがコンテンツ映像に集中していても、手話映像や表情映像の表示が開始され、相手が手話をし始めたことや、相手の表情が変化したことに気がつきやすくする。これにより、ユーザは、手話映像や表情映像を有効に活用することができ、より円滑なコミュニケーションを実現することができる。 In this embodiment, when the display of the sign language image and the expression image is started, an effect that makes the images stand out is applied, and even if the user concentrates on the content image, the sign language image or the expression image is displayed. Display is started, and it is easy to notice that the other party has started sign language and that the other party's facial expression has changed. Accordingly, the user can effectively use the sign language video and the facial expression video, and can realize smoother communication.

また、本実施形態では、手話映像や表情映像の画面上での表示位置をテロップや字幕等の文字情報に重ならないように決定する。これにより、コンテンツ映像に含まれる文字情報の見やすさを考慮しつつ、映像越しに円滑なコミュニケーションをとることができる。 In the present embodiment, the display position of the sign language image or facial expression image on the screen is determined so as not to overlap character information such as telop or subtitle. This makes it possible to communicate smoothly over the video while taking into account the ease of viewing the character information included in the content video.

なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 In addition, this invention is not limited to the said embodiment, Many deformation | transformation are possible within the range of the summary.

１：画面
２：ビデオカメラ
３：カメラ入力インタフェース
４：画面出力インタフェース
５：リモコン
６：ユーザ入力インタフェース
７：制御装置
７１：カメラ映像解析部
７２：ＴＶ内容判断部
７３：手話判断部
７４：表情変化判断部
７５：ユーザ映像生成部
７６：映像合成部
８：映像コンテンツ出力装置 1: Screen 2: Video camera 3: Camera input interface 4: Screen output interface 5: Remote control 6: User input interface 7: Control device 71: Camera image analysis unit 72: TV content determination unit 73: Sign language determination unit 74: Expression change Judgment unit 75: User video generation unit 76: Video composition unit 8: Video content output device

Claims

A viewer video display control device for controlling display of a viewer's video,
A video analysis unit that analyzes a camera video input from a camera that shoots a viewer viewing a content video and detects the video of the viewer;
A TV content determination unit for determining whether the content video is in a CM;
A sign language detection unit that detects whether the viewer is sign language using the camera image;
If determined by the TV content judgment unit is not in the CM, at a timing when the sign language detection unit detects that the sign language, only while the viewer is the sign language, using the image of the viewer e Bei a video synthesis unit for synthesizing by superimposing the sign language video on the content image, and
The viewer sign display control device , wherein the sign language image is an image in which an operation of a sign language can be visually recognized when the viewer is performing sign language .

The viewer image display control device according to claim 1 ,
If determined by the TV content judgment unit is in the CM, the video synthesis section, a viewer image display characterized by synthesized by superimposing the sign language video using the video of the viewer to the content image Control device.

The viewer image display control device according to claim 1 ,
Further comprising a facial expression detector for detecting a facial expression change of the face of the viewer by using the camera image,
When the judgment by the TV content judgment unit is not in the CM, and the viewer is not sign language, the video composition unit is the timing at which the facial expression change detection unit detects the facial expression change. viewer video display control apparatus of the expression video using the user image, characterized in that the synthesis is superimposed on the content image.

The viewer image display control device according to claim 1 ,
A facial expression change detector that detects facial changes in the viewer's face using the camera image ;
A predetermined time after the sign language detection unit or the expression change detection unit detects an enhancement process for enhancing a sign language image using the viewer's image or an expression image using the viewer's image, An emphasis unit to perform ,
The image combining unit viewer video display control apparatus characterized by synthesized by superimposing the sign language video or expression image the enhancement processing has been performed on the content image.

A viewer video display control method for controlling display of a viewer's video performed by a computer,
A video analysis step of analyzing a camera video input from a camera that shoots a viewer viewing the content video and detecting the video of the viewer;
TV content determination step for determining whether or not the content video is in a CM;
If it is determined in the TV content determination step that the CM is not being commercialized, the sign language video using the viewer's video is only displayed while the viewer is sign language at the timing when the viewer starts sign language. a video synthesis step of synthesizing superimposed on the content image, gastric row,
The viewer sign display method , wherein the sign language image is an image in which a sign language operation can be visually recognized when the viewer is performing sign language .

The viewer image display control method according to claim 5 ,
The TV content if it is determined to be in CM and in decision, the video synthesis step the viewer image, which comprises synthesized by superimposing the sign language video using the video of the viewer to the content image Display control method.

The viewer image display control method according to claim 5 ,
If it is determined in the TV content determination step that the CM is not being commercialized and the viewer is not sign language, the video composition step is the timing when the viewer's facial expression changes. A viewer video display control method, comprising superimposing a facial expression video using a viewer's video on the content video and synthesizing it .

The viewer image display control method according to claim 5 ,
The sign language image using the detected viewer image is emphasized for a predetermined time from the start of sign language, or the facial expression image using the detected viewer image is changed for a predetermined time from the facial expression change. , Further perform an emphasis step to perform the emphasis process ,
The viewer video display control method characterized in that the video synthesizing step superimposes the sign language video or facial expression video on which the processing of the emphasis step has been performed with the content video.

A viewer video display control program for causing a computer to function as each unit included in the viewer video display control device according to any one of claims 1 to 4.