JP2013073324A

JP2013073324A - Image display system

Info

Publication number: JP2013073324A
Application number: JP2011210536A
Authority: JP
Inventors: Satoshi Tabata; 聡田端; Yasuhisa Matsuba; 靖寿松葉; Daishi Sei; 大志瀬井; Kazumasa Koizumi; 和真小泉; Daisuke Fukutomi; 大介福富
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2011-09-27
Filing date: 2011-09-27
Publication date: 2013-04-22
Anticipated expiration: 2031-09-27
Also published as: JP5776471B2

Abstract

PROBLEM TO BE SOLVED: To provide an image display system for clarifying and synthesizing the photographed contour part of a browser with a background image, and for displaying a synthetic image in a real time.SOLUTION: An image processor 2 includes: scenario data storage means for storing scenario data determining the timing of the synthesis of one or more persons on a video with contents; content storage means for storing a background image to be used for synthesis as a content; face detection means 21 for outputting the position and rectangular size of a face detection frame as face detection frame data; tracking means 23 for associating the acquired face detection frame data with the face detection frame data of the other frame; scenario data association means 83 for associating a face object including the face detection frame data with the person included in the scenario data; and synthetic image generation means 84 for masking the background image by using a segment result image segmented into the person and any part other than the person by using a segment area, and for synthesizing the person in a frame with the background image.

Description

本発明は、撮影した映像を加工して表示する技術に関し、特に撮影されている閲覧者の状態に応じて加工した映像を表示する技術に関する。 The present invention relates to a technique for processing and displaying a captured video, and more particularly to a technique for displaying a processed video according to the state of a viewer who is shooting.

ディスプレイやプロジェクタなどの表示装置を用いて広告を表示する広告媒体であるデジタルサイネージ（Digital Signage）が、様々な場所に設置され始めている。デジタルサイネージを用いることで、動画や音声を用いた豊かなコンテンツの提供が可能になるばかりか、デジタルサイネージの設置場所に応じた効率的な広告配信が可能になるため、今後、デジタルサイネージのマーケット拡大が期待されている。 Digital signage, which is an advertising medium for displaying advertisements using a display device such as a display or a projector, has begun to be installed in various places. By using digital signage, not only will it be possible to provide rich content using video and audio, but it will also be possible to efficiently deliver advertisements according to the location of digital signage. Expansion is expected.

最近では、デジタルサイネージについて、様々な改良が施されており、デジタルサイネージの前に存在する閲覧者の動きに応じて表示させる画像を変化させる技術が提案されている（特許文献１参照）。 Recently, various improvements have been made on digital signage, and a technique for changing an image to be displayed in accordance with the movement of a viewer existing before digital signage has been proposed (see Patent Document 1).

特許第４２３８３７１号公報Japanese Patent No. 4238371

上記特許文献１に記載の技術では、人の認識情報と動き情報を基に合成画像を生成するが、トラッキング処理を行っていないために、各人の動きに応じて背景画像を合成することができない。特に、人物と背景を違和感なく合成することはできない。また、人と背景画像を合成する手法としては、クロマキー合成技術が存在するが、クロマキー合成技術では、背景を人物とは明確に異なる色で区別しておかなければならないという問題がある。 In the technique described in Patent Document 1, a synthesized image is generated based on human recognition information and motion information. However, since tracking processing is not performed, it is possible to synthesize a background image according to each person's motion. Can not. In particular, it is not possible to combine a person and a background without a sense of incongruity. In addition, as a technique for synthesizing a person and a background image, there is a chroma key composition technique. However, in the chroma key composition technique, there is a problem that the background must be clearly distinguished from a person by a different color.

そこで、本発明は、撮影されている閲覧者と背景画像を違和感なく合成し、リアルタイムに表示することが可能な画像表示システムを提供することを課題とする。 Therefore, an object of the present invention is to provide an image display system capable of synthesizing a photographed viewer and a background image without a sense of incongruity and displaying them in real time.

上記課題を解決するため、本発明第１の態様では、人物を撮影するカメラと、カメラから送出される撮影映像を合成処理する画像処理装置と、合成処理された合成映像を表示するディスプレイとを備えた画像表示システムであって、前記画像処理装置は、映像上の１人以上の人物とコンテンツとの合成のタイミングを定めたシナリオデータを記憶したシナリオデータ記憶手段と、合成に用いる背景画像をコンテンツとして記憶したコンテンツ記憶手段と、前記カメラから送出された映像の１つのフレームに撮影されている顔画像を検出し、検出した前記顔画像毎に、顔検出枠の位置・矩形サイズを顔検出枠データとして出力する顔検出手段と、前記顔検出手段から取得した前記顔検出枠データを、他のフレームの顔検出枠データと対応付けるトラッキング手段と、前記顔検出手段により検出された顔検出枠データを含む顔オブジェクトに対して、前記シナリオデータに含まれる人物との対応付けを行うシナリオデータ対応付け手段と、前記各フレームにおいて前記シナリオデータに含まれる人物との対応付けがなされた顔検出枠を含む範囲にセグメント領域を設定し、このセグメント領域を利用して人物と人物以外の部分にセグメント化したセグメント結果画像を用いて、前記背景画像をマスクし、前記フレーム中の人物と背景画像を合成した合成画像を生成する合成画像生成手段と、を備えた画像表示システムを提供する。 In order to solve the above problems, in the first aspect of the present invention, there is provided a camera for photographing a person, an image processing device for synthesizing a photographed image sent from the camera, and a display for displaying the synthesized image synthesized. The image processing system includes a scenario data storage unit that stores scenario data that determines the timing of combining one or more persons on the video and the content, and a background image used for the combination. Detects a face image recorded in one frame of video sent from the camera and content storage means stored as content, and detects the position / rectangular size of the face detection frame for each detected face image Corresponding face detection frame data acquired from the face detection means output from the face detection means and the face detection frame data of other frames Racking means, scenario data associating means for associating a face object including face detection frame data detected by the face detecting means with a person included in the scenario data, and the scenario in each frame A segment area is set in a range including a face detection frame associated with a person included in the data, and a segment result image segmented into a portion other than a person and a person using the segment area is used. There is provided an image display system comprising: a synthesized image generating unit that masks a background image and generates a synthesized image obtained by synthesizing a person in the frame and a background image.

本発明第１の態様によれば、撮影映像中の人物とコンテンツとの合成のタイミングを定めたシナリオデータを用意しておき、撮影映像のフレームから顔検出枠を検出し、顔検出枠のトラッキングを行い、得られた顔オブジェクトをシナリオデータの人物に割り当て、顔オブジェクトの顔検出枠を含む範囲にセグメント領域を設定し、このセグメント領域を利用して人物と人物以外の部分にセグメント化したセグメント結果画像を用いて、背景画像をマスクし、フレーム中の人物と背景画像を合成するようにしたので、撮影されている閲覧者と背景画像を違和感なく合成し、リアルタイムに表示することが可能となる。特に、撮影されている閲覧者の輪郭部分を明確にすることにより、違和感なく背景画像と合成することが可能となる。 According to the first aspect of the present invention, scenario data that determines the timing of synthesis of a person and content in a captured video is prepared, a face detection frame is detected from the frame of the captured video, and tracking of the face detection frame is performed. Assign the obtained face object to the person in the scenario data, set a segment area in the range that includes the face detection frame of the face object, and use this segment area to segment into parts other than the person and person Since the background image is masked using the result image and the person in the frame and the background image are combined, it is possible to combine the photographed viewer and the background image without discomfort and display them in real time. Become. In particular, it becomes possible to synthesize a background image without a sense of incongruity by clarifying the contour of the viewer who is taking a photograph.

また、本発明第２の態様では、本発明第１の態様による画像表示システムにおいて、前記コンテンツ記憶手段は、さらに前記人物または背景に重ねる画像である重畳画像をコンテンツとして記憶し、前記合成画像生成手段は、前記シナリオデータ対応付け手段による対応付けに従って、前記顔オブジェクトを前記シナリオデータの人物に割り当て、前記顔オブジェクトの顔検出枠データの位置、サイズに合わせて、前記重畳画像の位置、サイズを変更し、前記フレーム上に前記重畳画像を合成した合成画像を生成するものであることを特徴とする。 In the second aspect of the present invention, in the image display system according to the first aspect of the present invention, the content storage means further stores a superimposed image, which is an image to be superimposed on the person or background, as the content, and generates the composite image. The means assigns the face object to the person of the scenario data according to the association by the scenario data association means, and sets the position and size of the superimposed image according to the position and size of the face detection frame data of the face object. It is modified to generate a synthesized image obtained by synthesizing the superimposed image on the frame.

本発明第２の態様によれば、顔オブジェクトの顔検出枠データの位置、サイズに合わせて、重畳画像の位置、サイズを変更し、フレーム上に重畳画像を合成するようにしたので、撮影した映像に映っている人物にリアルタイムで重畳画像を重ねた映像を表示することができる。ここで、重畳画像とは、撮影映像中のフレームにおける人物部分または背景画像と重ねられる画像であり、その重ね方としては、完全に上書きされる場合や、いわゆるαブレンドの手法により、人物部分または背景画像の画素の値を反映させながら重ねる手法もある。 According to the second aspect of the present invention, the position and size of the superimposed image is changed in accordance with the position and size of the face detection frame data of the face object, and the superimposed image is synthesized on the frame. It is possible to display a video in which a superimposed image is superimposed on a person shown in the video in real time. Here, the superimposed image is an image that is overlaid with a person part or a background image in a frame in a captured video, and as a method of superimposing, the person part or There is also a method of overlapping while reflecting the pixel value of the background image.

また、本発明第３の態様では、本発明第１または第２の態様による画像表示システムにおいて、前記合成画像生成手段は、前記セグメント領域を、前記顔検出枠の位置に基づいて、前記顔検出枠と左右方向の中心が一致し、上下方向については前記顔検出枠の中心が前記セグメント領域の中心より上方に位置するように設定し、前記セグメント領域内における所定の位置に設定される、人物の存在する領域と人物の存在しない領域において指定された画素を利用して、セグメント化するものであることを特徴とする。 According to a third aspect of the present invention, in the image display system according to the first or second aspect of the present invention, the composite image generating means detects the face area based on the position of the face detection frame. A person who is set at a predetermined position in the segment area by setting the frame and the center in the left-right direction to coincide, and setting the center of the face detection frame above the center of the segment area in the up-down direction Segmentation is performed using pixels designated in a region where the character is present and a region where no person is present.

本発明第３の態様によれば、セグメント領域を、顔検出枠の位置に基づいて、顔検出枠と左右方向の中心が一致し、上下方向については顔検出枠の中心がセグメント領域の中心より上方に位置するように設定し、セグメント領域内において設定される人物の存在する領域と人物の存在しない領域において、指定された画素を利用して、セグメント化するようにしたので、人物の存在する領域に指定された画素の値、人物の存在しない領域に指定された画素の値の特徴に基づき、人物と人物以外の境界の画素が特定でき、人物の輪郭を明確にすることが可能となる。 According to the third aspect of the present invention, based on the position of the face detection frame, the segment detection area is aligned with the center of the face detection frame in the horizontal direction, and the center of the face detection frame is higher than the center of the segment area in the vertical direction. It is set so that it is positioned above, and segmentation is performed using the specified pixels in the area where the person exists and the area where the person does not exist set in the segment area. Based on the characteristics of the pixel value specified for the region and the pixel value specified for the region where no person exists, it is possible to identify the pixel at the border other than the person and the person, and to clarify the outline of the person .

また、本発明第４の態様では、人物を撮影するカメラと、カメラから送出される撮影映像を合成処理する画像処理装置と、合成処理された合成映像を表示するディスプレイとを備えた画像表示システムであって、前記画像処理装置は、映像上の１人以上の人物とコンテンツとの合成のタイミングを定めたシナリオデータを記憶したシナリオデータ記憶手段と、背景画像と周囲を囲まれた一部が開いている重畳画像をコンテンツとして記憶したコンテンツ記憶手段と、前記カメラから送出された映像の１つのフレームに撮影されている顔画像を検出し、検出した前記顔画像毎に、顔検出枠の位置・矩形サイズを顔検出枠データとして出力する顔検出手段と、前記顔検出手段から取得した前記顔検出枠データを、他のフレームの顔検出枠データと対応付けるトラッキング手段と、前記顔検出手段により検出された顔検出枠データを含む顔オブジェクトに対して、前記シナリオデータに含まれる人物との対応付けを行うシナリオデータ対応付け手段と、前記対応付けに従って、前記顔オブジェクトを前記シナリオデータの人物に割り当て、前記顔オブジェクトの顔検出枠データの位置、サイズに合わせて、前記重畳画像の位置、サイズを変更して合成するとともに、前記重畳画像に対応したマスク画像によりマスクされた前記背景画像と合成した合成画像を生成する合成画像生成手段と、を備えた画像表示システムを提供する。 According to the fourth aspect of the present invention, an image display system includes a camera that shoots a person, an image processing device that synthesizes a captured video sent from the camera, and a display that displays the synthesized video that has been synthesized. The image processing apparatus includes scenario data storage means for storing scenario data that determines the timing of composition of one or more persons on the video and the content, and a part surrounded by the background image and the periphery. Content storage means that stores open superimposed images as content, and a face image captured in one frame of video sent from the camera, and a position of a face detection frame for each detected face image A face detection unit that outputs a rectangular size as face detection frame data; and the face detection frame data acquired from the face detection unit is used as face detection frame data of another frame. An association tracking unit, a scenario data association unit for associating a face object including the face detection frame data detected by the face detection unit with a person included in the scenario data, and according to the association The face object is assigned to the person in the scenario data, and the position and size of the superimposed image are changed and combined in accordance with the position and size of the face detection frame data of the face object, and the face image corresponds to the superimposed image. There is provided an image display system comprising: composite image generation means for generating a composite image combined with the background image masked by a mask image.

本発明第４の態様によれば、撮影映像中の人物とコンテンツとの合成のタイミングを定めたシナリオデータを用意しておき、撮影映像のフレームから顔検出枠を検出し、顔検出枠のトラッキングを行い、得られた顔オブジェクトをシナリオデータの人物に割り当て、顔オブジェクトの顔検出枠データの位置、サイズに合わせて、周囲を囲まれた一部が開いている重畳画像の位置、サイズを変更して合成するとともに、重畳画像に対応したマスク画像によりマスクされた背景画像と合成するようにしたので、撮影されている閲覧者と背景画像を違和感なく合成し、リアルタイムに表示することが可能となる。 According to the fourth aspect of the present invention, scenario data that determines the timing of synthesis of a person and content in a captured video is prepared, a face detection frame is detected from the frame of the captured video, and tracking of the face detection frame is performed. Assign the obtained face object to the person in the scenario data, and change the position and size of the superimposed image with the part surrounded by the face according to the position and size of the face detection frame data of the face object And combining with the background image masked by the mask image corresponding to the superimposed image, it is possible to combine the photographed viewer and the background image without discomfort and display in real time Become.

本発明によれば、撮影されている閲覧者と背景画像を違和感なく合成し、リアルタイムに表示することが可能となるという効果を奏する。 According to the present invention, it is possible to synthesize a photographed viewer and a background image without feeling uncomfortable and display them in real time.

本実施形態における画像表示システム１の構成を説明する図。The figure explaining the structure of the image display system 1 in this embodiment. 画像表示システム１を構成する画像処理装置２のハードウェアブロック図。1 is a hardware block diagram of an image processing device 2 that constitutes an image display system 1. FIG. 画像処理装置２に実装されたコンピュータプログラムで実現される機能ブロック図。FIG. 3 is a functional block diagram realized by a computer program installed in the image processing apparatus 2. 画像処理装置２がフレームを解析する処理を説明するフロー図。The flowchart explaining the process which the image processing apparatus 2 analyzes a flame | frame. トラッキング処理を説明するためのフロー図。The flowchart for demonstrating a tracking process. 顔検出枠データ対応付け処理を説明するためのフロー図。The flowchart for demonstrating face detection frame data matching processing. 本実施形態における状態遷移表を説明する図。The figure explaining the state transition table in this embodiment. 人体及び顔検出結果を説明するための図。The figure for demonstrating a human body and a face detection result. 画像処理装置２が合成画像を作成する処理を説明するフロー図。The flowchart explaining the process in which the image processing apparatus 2 produces a synthesized image. ターゲットが１人の場合のＸＭＬ形式のシナリオデータの一例を示す図。The figure which shows an example of the scenario data of an XML format in case a target is one person. ターゲットが２人の場合のＸＭＬ形式のシナリオデータの一例を示す図。The figure which shows an example of the scenario data of the XML format in case a target is two persons. 顔検出枠を用いたコンテンツの合成の様子を示す図。The figure which shows the mode of the composition of the content using a face detection frame. ターゲットが１人の場合の合成画像の表示状態を示す図。The figure which shows the display state of a synthesized image in case a target is one person. ターゲットが２人の場合の合成画像の表示状態を示す図。The figure which shows the display state of the synthesized image in case there are two targets. タッチボタン対応の場合のＸＭＬ形式のシナリオデータの一例を示す図。The figure which shows an example of the scenario data of the XML format in the case of a touch button correspondence. タッチボタン対応の場合の合成画像の表示状態を示す図。The figure which shows the display state of the synthesized image in the case of a touch button corresponding | compatible. 指示判定処理における画像処理の様子を示す図。The figure which shows the mode of the image process in an instruction | indication determination process. 画像処理装置２´に実装されたコンピュータプログラムで実現される機能ブロック図。The functional block diagram implement | achieved by the computer program mounted in image processing apparatus 2 '. 顔検出処理及びトラッキング処理を説明するためのフロー図。The flowchart for demonstrating a face detection process and a tracking process. シーン合成の場合のＸＭＬ形式のシナリオデータの一例を示す図。The figure which shows an example of the scenario data of the XML format in the case of scene composition. 撮影画像における顔検出枠とセグメント領域の関係を示す図である。It is a figure which shows the relationship between the face detection frame and segment area | region in a picked-up image. イニシャル領域の形状を示す図である。It is a figure which shows the shape of an initial area | region. 図２０のシナリオデータの例に従って設定されたイニシャル領域を示す図である。It is a figure which shows the initial region set according to the example of scenario data of FIG. 合成マスク画像作成の様子を示す図である。It is a figure which shows the mode of synthetic mask image preparation. 背景マスク画像作成の様子を示す図である。It is a figure which shows the mode of background mask image preparation. 撮影画像に対してシーン合成を行い、合成画像を作成する様子を示す図である。It is a figure which shows a mode that a scene synthetic | combination is performed with respect to a picked-up image, and a synthesized image is produced.

≪１．システム構成≫
以下、本発明の好適な実施形態について図面を参照して詳細に説明する。図１は、本実施形態における画像表示システム１の構成を説明する図、図２は、画像表示システム１を構成する画像処理装置２のハードウェアブロック図、図３は、画像処理装置２に実装されたコンピュータプログラムで実現される機能ブロック図である。 << 1. System configuration >>
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the drawings. FIG. 1 is a diagram illustrating a configuration of an image display system 1 according to the present embodiment, FIG. 2 is a hardware block diagram of an image processing apparatus 2 that configures the image display system 1, and FIG. It is a functional block diagram implement | achieved by the performed computer program.

図１で図示したように、画像表示システム１には、液晶ディスプレイ等の表示デバイスであるディスプレイ３が含まれる。このディスプレイ３には、撮影した画像だけでなく、表示領域を分けて広告を表示するようにしても良い。 As illustrated in FIG. 1, the image display system 1 includes a display 3 that is a display device such as a liquid crystal display. The display 3 may display not only the photographed image but also the advertisement by dividing the display area.

この場合、ディスプレイ３を街頭や店舗などに設置することにより、画像表示システム１はデジタルサイネージとしても機能する。画像表示システム１をデジタルサイネージとして機能させる場合、ディスプレイ３で表示する広告映像を制御するサーバが必要となる。 In this case, the image display system 1 also functions as a digital signage by installing the display 3 in a street or a store. When the image display system 1 functions as digital signage, a server that controls advertisement video displayed on the display 3 is required.

ディスプレイ３には、ディスプレイ３で再生されている映像を見ている人物の顔が撮影されるようにアングルが設定され、ディスプレイ３で再生されている広告映像を閲覧している人物を撮影するビデオカメラ４が設置されている。 An angle is set on the display 3 so that the face of the person who is watching the video reproduced on the display 3 is photographed, and a video for photographing a person who is viewing the advertisement video reproduced on the display 3 A camera 4 is installed.

このビデオカメラ４で撮影された映像は、ＵＳＢポートなどを利用して画像処理装置２に入力され、画像処理装置２は、ビデオカメラ４から送信された映像に含まれるフレームを解析し、ディスプレイ３の前にいる人物や，ディスプレイ３で再生されている映像を閲覧した人物の顔を検出し、閲覧者に関するログ（例えば、ディスプレイ３の閲覧時間）を記憶する。 The video captured by the video camera 4 is input to the image processing apparatus 2 using a USB port or the like. The image processing apparatus 2 analyzes the frame included in the video transmitted from the video camera 4 and displays the display 3. And the face of the person who browsed the video reproduced on the display 3 are detected, and a log relating to the viewer (for example, the viewing time of the display 3) is stored.

図１で図示した画像表示システム１を構成する装置において、ディスプレイ３及びビデオカメラ４は市販の装置を利用できるが、画像処理装置２は、従来技術にはない特徴を備えているため、ここから、画像処理装置２について詳細に説明する。 In the apparatus constituting the image display system 1 illustrated in FIG. 1, commercially available apparatuses can be used for the display 3 and the video camera 4, but the image processing apparatus 2 has features that are not found in the prior art. The image processing apparatus 2 will be described in detail.

画像処理装置２は汎用のコンピュータを利用して実現することができ、汎用のコンピュータと同様なハードウェアを備えている。図２の例では、画像処理装置２は、該ハードウェアとして、ＣＰＵ２ａ（CPU: Central Processing Unit）と、ＢＩＯＳが実装されるＲＯＭ２ｂ（ROM: Read-Only Memory）と、コンピュータのメインメモリであるＲＡＭ２ｃ（RAM: Random Access Memory）と、外部記憶装置として大容量のデータ記憶装置２ｄ（例えば，ハードディスク）と、外部デバイス（ビデオカメラ４）とデータ通信するための入出力インターフェース２ｅと、ネットワーク通信するためのネットワークインターフェース２ｆと、表示デバイス（ディスプレイ３）に情報を送出するための表示出力インターフェース２ｇと、文字入力デバイス２ｈ（例えば、キーボード）と、ポインティングデバイス２ｉ（例えば、マウス）を備えている。 The image processing apparatus 2 can be realized by using a general-purpose computer, and includes hardware similar to that of the general-purpose computer. In the example of FIG. 2, the image processing apparatus 2 includes, as the hardware, a CPU 2a (CPU: Central Processing Unit), a ROM 2b (ROM: Read-Only Memory) on which a BIOS is mounted, and a RAM 2c that is a main memory of the computer. (RAM: Random Access Memory), a large-capacity data storage device 2d (for example, a hard disk) as an external storage device, and an input / output interface 2e for data communication with an external device (video camera 4), for network communication Network interface 2f, a display output interface 2g for sending information to a display device (display 3), a character input device 2h (for example, a keyboard), and a pointing device 2i (for example, a mouse).

画像処理装置２のデータ記憶装置２ｄには、ＣＰＵ２ａを動作させるためのコンピュータプログラムが実装され、このコンピュータプログラムによって、画像処理装置２には図３で図示した手段が備えられる。また、データ記憶装置２ｄは、画像表示システムに必要な様々なデータを格納することが可能となっており、映像上の１人以上の人物とコンテンツとの合成のタイミングを定めたシナリオデータを記憶したシナリオデータ記憶手段、合成に用いるコンテンツを記憶したコンテンツ記憶手段としての役割も果たしている。 A computer program for operating the CPU 2a is installed in the data storage device 2d of the image processing apparatus 2, and the image processing apparatus 2 is provided with the means shown in FIG. 3 by this computer program. The data storage device 2d is capable of storing various data necessary for the image display system, and stores scenario data that determines the timing of combining one or more persons on the video with the content. It also serves as a scenario data storage means and a content storage means for storing content used for composition.

ここで、コンテンツ記憶手段に格納されているコンテンツについて説明しておく。コンテンツは、撮影された映像のフレームと合成して合成画像を得る際の素材画像である。コンテンツとしては、背景となる背景画像と、撮影画像中の人物または背景画像上に重ねる重畳画像の２種類が存在する。図１２（ａ）に重畳画像としてのコンテンツの一例を示す。重畳画像としては、特に限定されず、様々な形態のものを用いることができるが、図１２（ａ）の例では、重畳画像としてカツラを用意した場合を示している。この重畳画像は矩形情報（ｘ，ｙ方向の位置、幅、高さ）を有しており、この矩形情報を用いて、撮影されたフレームとの位置合わせが可能になっている。 Here, the content stored in the content storage means will be described. The content is a material image when a composite image is obtained by combining with the frame of the captured video. There are two types of content: a background image as a background and a superimposed image superimposed on a person in a captured image or a background image. FIG. 12A shows an example of content as a superimposed image. The superimposed image is not particularly limited, and various forms can be used. In the example of FIG. 12A, a case where a wig is prepared as the superimposed image is shown. This superimposed image has rectangular information (position, width, height in the x and y directions), and alignment with the captured frame is possible using this rectangular information.

図３で図示したように、画像処理装置２の入力は、ビデオカメラ４によって撮影された映像で、画像処理装置２の出力は、撮影された映像を加工した加工映像である。 As shown in FIG. 3, the input of the image processing device 2 is an image captured by the video camera 4, and the output of the image processing device 2 is a processed image obtained by processing the captured image.

画像処理装置２には、ビデオカメラ４によって撮影された映像のフレームを解析する手段として、ビデオカメラ４によって撮影された映像のフレームの背景画像を除去する背景除去手段２０と、背景除去手段２０によって背景が除去されたフレームから人物の顔を検出する顔検出手段２１と、背景除去手段２０によって背景画像が除去されたフレームから人体を検出する人体検出手段２２と、顔検出手段２１が検出した顔を前後のフレームで対応付けるトラッキング手段２３と、パーティクルフィルタなどの動画解析手法を用い、指定された顔画像をフレームから検出する動画解析手段２４と、顔検出手段２１が新規に検出した顔画像毎に顔オブジェクトを生成し、トラッキング手段２３から得られる一つ前と今回の顔検出枠データの対応付け結果を参照し、事前に定めた状態遷移表に従い顔オブジェクトの状態を遷移させ、顔オブジェクトの状態遷移に応じたログを記憶する状態遷移管理手段２５と、顔検出手段２１により検出され、状態遷移管理手段２５により状態遷移された顔オブジェクトと、用意されたシナリオデータの対応付けを行うシナリオデータ対応付け手段８３と、ビデオカメラ４によって撮影された映像の各フレームを、シナリオデータに従って加工した合成画像を作成する合成画像作成手段８４と、閲覧者が所定の位置に定義されたタッチレスボタンに対して指示を行っているかどうかを判定する指示判定手段８５と、指示判定手段８５による判定に従ってシナリオデータ中のコマンドの切替を行うコマンド切替手段８６を備え、更に、本実施形態では、ディスプレイ３を閲覧した人物の属性（年齢や性別）をログデータに含ませるために、顔検出手段２１が検出した顔画像から人物の人物属性（年齢や性別）を推定する人物属性推定手段２６、状態遷移管理手段２５が記憶したログをファイル形式で出力するログファイル出力手段２７、加工対象のターゲット（人または場所）をシナリオデータ中に定義する合成ターゲット定義手段８０、加工に用いるコンテンツ（画像、音声、ＣＧ等）をシナリオデータ中に定義する合成コンテンツ定義手段８１、加工内容をシナリオデータ中に定義するアニメーションシナリオ定義手段８２を備えている。 The image processing apparatus 2 includes a background removing unit 20 that removes a background image of a frame of a video shot by the video camera 4 and a background removing unit 20 as means for analyzing the frame of the video shot by the video camera 4. Face detection means 21 for detecting the face of a person from the frame from which the background has been removed, human body detection means 22 for detecting a human body from the frame from which the background image has been removed by the background removal means 20, and the face detected by the face detection means 21 For each face image newly detected by the face detection means 21, the tracking means 23 for associating the frame with the preceding and following frames, the moving picture analysis means 24 for detecting the designated face image from the frame using a moving picture analysis technique such as a particle filter. Create a face object and associate the previous face detection frame data obtained from the tracking means 23 with the current face detection frame data Referring to the result, the state of the face object is transitioned according to a predetermined state transition table, and the state transition management unit 25 that stores a log corresponding to the state transition of the face object and the face detection unit 21 detect the state transition. A composite image obtained by processing each frame of a video shot by the video camera 4 according to the scenario data, the scenario data associating unit 83 for associating the face object whose state has been changed by the management unit 25 with the prepared scenario data The composite image creating means 84 for creating the scenario data, the instruction determining means 85 for determining whether or not the viewer is instructing the touchless button defined at the predetermined position, and the scenario data according to the determination by the instruction determining means 85 Command switching means 86 for switching the commands in the middle. Personal attribute estimation means 26 for estimating the personal attributes (age and gender) of the person from the face image detected by the face detection means 21 in order to include in the log data the attributes (age and gender) of the person who viewed the play 3; Log file output means 27 for outputting the log stored in the state transition management means 25 in a file format, synthetic target definition means 80 for defining the target (person or place) to be processed in the scenario data, content (image, A synthesized content defining unit 81 for defining voice, CG, etc.) in the scenario data, and an animation scenario defining unit 82 for defining the processing content in the scenario data.

シナリオデータは、別のシステムで事前に作成し、シナリオデータ記憶手段としてのデータ記憶装置２ｄに格納しておくことができるが、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２により、作成することもできる。合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２は、撮影により得られた映像の各フレームをどのように加工するかを示したシナリオデータを作成するために用いられるものである。シナリオデータの形式は特に限定されないが、本実施形態では、ＸＭＬ（eXtensible Markup Language）を採用している。シナリオデータとしてＸＭＬを採用した本実施形態では、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２は、テキストエディタで実現することができる。したがって、テキストエディタを起動し、管理者が文字入力デバイスを用いて、文字入力を行うことにより、シナリオデータが作成される。 The scenario data can be created in advance by another system and stored in the data storage device 2d as the scenario data storage means, but the synthesis target definition means 80, the synthesis content definition means 81, and the animation scenario definition means 82. Can also be created. The composite target definition means 80, the composite content definition means 81, and the animation scenario definition means 82 are used to create scenario data that indicates how to process each frame of a video obtained by shooting. . The format of the scenario data is not particularly limited, but in the present embodiment, XML (eXtensible Markup Language) is adopted. In the present embodiment in which XML is used as the scenario data, the synthesis target definition unit 80, the synthesis content definition unit 81, and the animation scenario definition unit 82 can be realized by a text editor. Therefore, scenario data is created when the text editor is activated and the administrator inputs characters using the character input device.

図１０は、ターゲットが１人の場合のＸＭＬ形式のシナリオデータの一例を示す図である。ここからは、図１０のシナリオデータを参照しながら、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２について詳細に説明する。合成ターゲット定義手段８０は、ターゲットＩＤ、タイプ、絶対座標、移り変わり設定の４つの属性を設定することによりターゲットを定義する。図１０の例では、３行目の<Simulation Targets>と、９行目の</Simulation Targets>の２つのタグで囲まれた範囲に対応する。タイプについては、人と場所の２種を設定可能であるが、図１０の例では、<Human>タグを用いて人についてのみ設定している。移り変わり設定は、人に対してのみ設定可能となっており、対応付け済みの人が消失した場合、新たな人に対応付けるかどうかを設定するものである。図１０の例では、５行目のタグで、ターゲットＩＤ、タイプ、移り変わり設定の３属性を設定しており、ターゲットＩＤは“０”、タイプは“human(人)”、移り変わり設定（IsTransfer）は“false(設定しない)”となっている。 FIG. 10 is a diagram illustrating an example of scenario data in the XML format when the number of targets is one. From now on, the synthetic target definition unit 80, the synthetic content definition unit 81, and the animation scenario definition unit 82 will be described in detail with reference to the scenario data of FIG. The synthetic target definition means 80 defines a target by setting four attributes of target ID, type, absolute coordinate, and transition setting. In the example of FIG. 10, it corresponds to a range surrounded by two tags, <Simulation Targets> on the third line and </ Simulation Targets> on the ninth line. Regarding the type, two types of person and place can be set, but in the example of FIG. 10, only the person is set using the <Human> tag. The transition setting can be set only for a person, and sets whether or not to associate with a new person when the associated person disappears. In the example of FIG. 10, the tag of the 5th line sets three attributes of target ID, type, and transition setting, the target ID is “0”, the type is “human”, and transition setting (IsTransfer) Is “false (not set)”.

合成コンテンツ定義手段８１は、コンテンツＩＤ、コンテンツのパス、重なり設定の３つの属性を設定することによりコンテンツを定義する。図１０の例では、１０行目の<Simulation Contents>と、１８行目の</Simulation Contents >の２つのタグで囲まれた範囲に対応する。図１０の例では、コンテンツＩＤ（ContentsID）が“０”から“６”までの７つのコンテンツについて定義されている。図１０に示すように、各コンテンツについて１行単位で、コンテンツのパス（ContentsPath）、重なり設定（OverlapOrder）が設定される。 The composite content definition unit 81 defines the content by setting three attributes of a content ID, a content path, and an overlap setting. In the example of FIG. 10, it corresponds to a range surrounded by two tags, <Simulation Contents> on the 10th line and </ Simulation Contents> on the 18th line. In the example of FIG. 10, the content ID (ContentsID) is defined for seven contents from “0” to “6”. As shown in FIG. 10, a content path (ContentsPath) and an overlap setting (OverlapOrder) are set in units of one line for each content.

アニメーションシナリオ定義手段８２は、コマンドＩＤ、コマンドタイプ、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤの７つの属性を設定することによりアニメーションシナリオを定義する。図１０の例では、１９行目の<Animation Commands>と、３４行目の</Animation Commands>の２つのタグで囲まれた範囲に対応する。図１０の例では、コマンドＩＤ（CommandID）が“０”から“６”までの７つのコマンドについて定義されている。図１０に示すように、各コマンドについて２行単位で、コマンドタイプ、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤが設定される。コマンドタイプとは、どのようなフレームを基にどのようなタイプの効果を生じさせるかを示すものであり、レイヤ合成、αブレンド合成、音声再生開始、シーン合成が用意されている。このうち、レイヤ合成、αブレンド合成、シーン合成は、画像合成のタイプを示すものであり、レイヤ合成は、コンテンツを上書き合成するものであり、αブレンド合成は、設定されたα比率に応じてコンテンツとフレームを透過させて合成するものであり、シーン合成は、人体部分を切り抜き、背景画像と合成するものである。図１０の例では、コマンドタイプ（CommandType）として、レイヤ合成（LayerMontage）が設定されている。開始キー、終了キーは各コマンドの開始時点、終了時点を設定するものである。本実施形態では、シナリオデータの時間を、シナリオ開始時を“０．０”、シナリオ終了時を“１．０”として管理している。したがって、最初に開始するコマンドの開始キー（StartKey）は“０．０”、最後に終了するコマンドの終了キー（EndKey）は“１．０”となる。キータイプとは、開始キー、終了キーの基準とする対象を設定するものであり、own、base、globalの３つが用意されている。ownは各ターゲットＩＤに対応する顔オブジェクトの閲覧時間を基準とし、baseはターゲットＩＤ＝０に対応する顔オブジェクトの閲覧時間を基準とし、globalは撮影映像の最初のフレームを取得した時間を基準とする。図１０の例では、キータイプ（KeyType）として、ownが設定されているので、顔オブジェクトがフレームに登場した時点（顔オブジェクトが“閲覧開始”と判断された時点）を“０．０”として、開始キー、終了キーが認識されることになる。 The animation scenario definition means 82 defines an animation scenario by setting seven attributes of command ID, command type, start key, end key, key type, target ID, and content ID. In the example of FIG. 10, it corresponds to a range surrounded by two tags of <Animation Commands> on the 19th line and </ Animation Commands> on the 34th line. In the example of FIG. 10, the command ID (CommandID) is defined for seven commands from “0” to “6”. As shown in FIG. 10, for each command, a command type, a start key, an end key, a key type, a target ID, and a content ID are set in units of two lines. The command type indicates what type of effect is generated based on what frame, and layer synthesis, α blend synthesis, audio reproduction start, and scene synthesis are prepared. Of these, layer composition, α blend composition, and scene composition indicate the type of image composition, layer composition overwrites content, and α blend composition is performed according to the set α ratio. The content and the frame are combined to be transmitted, and the scene combination is a method in which a human body part is cut out and combined with a background image. In the example of FIG. 10, layer composition (LayerMontage) is set as the command type (CommandType). The start key and end key set the start time and end time of each command. In the present embodiment, the scenario data time is managed as “0.0” at the start of the scenario and “1.0” at the end of the scenario. Therefore, the start key (StartKey) of the command that starts first is “0.0”, and the end key (EndKey) of the command that ends last is “1.0”. The key type is to set a target as a reference for the start key and the end key, and three types of own, base, and global are prepared. own is based on the browsing time of the face object corresponding to each target ID, base is based on the browsing time of the face object corresponding to target ID = 0, and global is based on the time when the first frame of the captured video is acquired. To do. In the example of FIG. 10, since “own” is set as the key type (KeyType), the time point when the face object appears in the frame (the time point when the face object is determined to be “viewing start”) is set to “0.0”. The start key and end key will be recognized.

シナリオデータの<CycleInterval>タグは、シナリオの開始から終了までの時間を秒単位で設定するものであり、図１０の例では、１行目の<CycleInterval>タグにおいて“１０”が設定されているので、シナリオの開始から終了まで１０秒であることを示している。開始キー、終了キーの値を１０倍した実時間でシナリオは管理されることになる。シナリオデータの<IsAutoLoop>タグは、ループ処理（繰り返し処理）を行うかどうかを設定するものであり、図１０の例では、２行目の<IsAutoLoop>タグにおいて“true”が設定されているので、ループ処理を行うことを示している。<CycleInterval>タグおよび<IsAutoLoop>タグについても、テキストエディタにより設定が可能である。このようにして、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２により作成されたシナリオデータは、シナリオデータ記憶手段としてのデータ記憶装置２ｄに格納される。 The <CycleInterval> tag of the scenario data sets the time from the start to the end of the scenario in seconds. In the example of FIG. 10, “10” is set in the <CycleInterval> tag on the first line. Therefore, it is shown that it is 10 seconds from the start to the end of the scenario. The scenario is managed in real time that is 10 times the value of the start key and end key. The <IsAutoLoop> tag of the scenario data sets whether to perform loop processing (repeated processing). In the example of FIG. 10, “true” is set in the <IsAutoLoop> tag on the second line. , Loop processing is performed. The <CycleInterval> tag and <IsAutoLoop> tag can also be set using a text editor. In this way, the scenario data created by the composite target definition means 80, the composite content definition means 81, and the animation scenario definition means 82 is stored in the data storage device 2d as the scenario data storage means.

画像処理装置２が、ビデオカメラ４によって撮影された映像のフレームを時系列で解析することで、画像処理装置２のデータ記憶装置２ｄには、閲覧測定に利用可能なログファイルとして、ディスプレイの閲覧時間が記憶される閲覧時間ログファイルと、ディスプレイを閲覧した人物の位置が記憶される位置ログファイルと、ディスプレイを閲覧した人物の人物属性（例えば，年齢・性別）が記憶される人物属性ログファイルと、ディスプレイの前にいる人物の総人数、ディスプレイを閲覧していない人物の人数、ディスプレイを閲覧した人物の人数が記憶される人数ログファイルが記憶され、これらのログファイルを出力するログファイル出力手段２７が画像処理装置２には備えられている。本発明では、ログファイルを作成することは必須ではないが、ログファイルを作成する過程における顔オブジェクト、閲覧開始時刻が、合成画像の生成に利用される。 When the image processing device 2 analyzes the frames of the video captured by the video camera 4 in time series, the data storage device 2d of the image processing device 2 can view the display as a log file that can be used for browsing measurement. A browsing time log file in which time is stored, a position log file in which the position of a person who has viewed the display is stored, and a person attribute log file in which the personal attributes (for example, age and gender) of the person who has viewed the display are stored Log file output that stores the total number of people in front of the display, the number of people who are not browsing the display, the number of people who have viewed the display, and outputs these log files Means 27 is provided in the image processing apparatus 2. In the present invention, it is not essential to create a log file, but the face object and the browsing start time in the process of creating the log file are used to generate a composite image.

まず、ビデオカメラ４から送信された映像のフレームを画像処理装置２が解析する処理を説明しながら、ビデオカメラ４によって撮影された映像のフレームを解析、加工するために備えられた各手段について説明する。 First, each means provided for analyzing and processing a frame of a video shot by the video camera 4 will be described while explaining a process in which the image processing apparatus 2 analyzes a frame of a video transmitted from the video camera 4. To do.

≪２．処理動作≫
図４は、ビデオカメラ４から送信された映像のフレームを画像処理装置２が解析する処理を説明するフロー図である。それぞれの処理の詳細は後述するが、画像処理装置２に映像の一つのフレームが入力されると、画像処理装置２は該フレームについて背景除去処理Ｓ１を行い、背景除去処理Ｓ１した後のフレームについて、顔検出処理Ｓ２及び人体検出処理Ｓ３を行う。 ≪2. Processing action >>
FIG. 4 is a flowchart for explaining processing in which the image processing apparatus 2 analyzes a frame of a video transmitted from the video camera 4. Although details of each processing will be described later, when one frame of video is input to the image processing device 2, the image processing device 2 performs background removal processing S1 on the frame, and about the frame after the background removal processing S1. Then, face detection processing S2 and human body detection processing S3 are performed.

画像処理装置２は、背景除去処理Ｓ１した後のフレームについて、顔検出処理Ｓ２及び人体検出処理Ｓ３を行った後、顔検出処理Ｓ３の結果を利用して、今回の処理対象となるフレームであるＮフレームから検出された顔と、一つ前のフレームであるＮ−１フレームから検出された顔を対応付けるトラッキング処理Ｓ４を行い、トラッキング処理Ｓ４の結果を踏まえて顔オブジェクトの状態を遷移させる状態遷移管理処理Ｓ５を実行する。 The image processing apparatus 2 performs the face detection process S2 and the human body detection process S3 on the frame after the background removal process S1, and then uses the result of the face detection process S3 as a frame to be processed this time. A state transition in which a tracking process S4 is performed for associating the face detected from the N frame with the face detected from the previous frame N-1 frame, and the state of the face object is changed based on the result of the tracking process S4. The management process S5 is executed.

まず、背景除去処理Ｓ１について説明する。背景除去処理Ｓ１を担う手段は、画像処理装置２の背景除去手段２０である。画像処理装置２が背景除去処理Ｓ１を実行するのは、図１に図示しているように、ディスプレイ３の上部に設けられたビデオカメラ４の位置・アングルは固定であるため、ビデオカメラ４が撮影した映像には変化しない背景が含まれることになり、この背景を除去することで、精度よく人体・顔を検出できるようにするためである。 First, the background removal process S1 will be described. The means responsible for the background removal processing S1 is the background removal means 20 of the image processing apparatus 2. The image processing apparatus 2 executes the background removal process S1 because the position and angle of the video camera 4 provided at the upper part of the display 3 is fixed as shown in FIG. This is because the photographed video includes a background that does not change, and by removing this background, the human body / face can be detected with high accuracy.

画像処理装置２の背景除去手段２０が実行する背景除去処理としては既存技術を利用でき、ビデオカメラ４が撮影する映像は、例えば、朝、昼、夜で光が変化する場合があるので、背景の時間的な変化を考慮した動的背景更新法を用いることが好適である。 As the background removal process executed by the background removal unit 20 of the image processing apparatus 2, existing technology can be used, and the video captured by the video camera 4 may change in the morning, noon, and night, for example. It is preferable to use a dynamic background update method that takes into account the temporal change of.

背景の時間的な変化を考慮した動的背景更新法としては、例えば、「森田真司, 山澤一誠, 寺沢征彦, 横矢直和: "全方位画像センサを用いたネットワーク対応型遠隔監視システム", 電子情報通信学会論文誌（D-II), Vol. J88-D-II, No. 5, pp. 864-875, (2005.5)」に記載されている手法を用いることができる。 Dynamic background update methods that take into account temporal changes in the background include, for example, “Shinji Morita, Kazumasa Yamazawa, Nobuhiko Terasawa, Naokazu Yokoya:“ Network-enabled remote monitoring system using omnidirectional image sensors ”, electronic The method described in the Journal of Information and Communication Engineers (D-II), Vol. J88-D-II, No. 5, pp. 864-875, (2005.5) can be used.

次に、画像処理装置２の顔検出手段２１によって実行される顔検出処理Ｓ２について説明する。顔検出処理Ｓ２で実施する顔検出方法としては、特許文献１に記載されているような顔検出方法も含め、様々な顔検出方法が開示されているが、本実施形態では、弱い識別器として白黒のHaar-Like特徴を用いたAdaboostアルゴリズムによる顔検出法を採用している。なお、弱い識別器として白黒のHaar-Like特徴を用いたAdaboostアルゴリズムによる顔検出法については、「Paul Viola and Michael J. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features", IEEE CVPR, 2001.」、「Rainer Lienhart and Jochen Maydt, "An Extended Set of Haar-like Features for Rapid Object Detection", IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.」で述べられている。 Next, the face detection process S2 executed by the face detection unit 21 of the image processing apparatus 2 will be described. Various face detection methods including the face detection method described in Patent Document 1 have been disclosed as face detection methods performed in the face detection process S2, but in this embodiment, as weak classifiers, The face detection method by Adaboost algorithm using black and white Haar-Like feature is adopted. For the face detection method by Adaboost algorithm using black and white Haar-Like features as weak classifiers, see “Paul Viola and Michael J. Jones,“ Rapid Object Detection using a Boosted Cascade of Simple Features ”, IEEE CVPR, 2001. ", Rainer Lienhart and Jochen Maydt," An Extended Set of Haar-like Features for Rapid Object Detection ", IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.

弱い識別器として白黒のHaar-Like特徴を用いたAdaboostアルゴリズムによる顔検出法を実行することで、フレームに含まれる顔画像毎に顔検出枠データが得られ、この顔検出枠データには、顔画像を検出したときに利用した顔検出枠の位置（例えば、左上隅の座標）・矩形サイズ（幅及び高さ）が含まれる。 Face detection frame data is obtained for each face image included in the frame by executing the face detection method using the Adaboost algorithm using the black and white Haar-Like feature as a weak classifier. The position of the face detection frame used when detecting the image (for example, the coordinates of the upper left corner) and the rectangular size (width and height) are included.

次に、画像処理装置２の人体検出手段２２によって実行される人体検出処理Ｓ３について説明する。人体を検出する手法としては赤外線センサを用い、人物の体温を利用して人体を検出する手法が良く知られているが、本実施形態では、顔検出処理Ｓ２で実施する人体検出方法として、弱い識別器としてＨＯＧ（Histogram of Oriented Gradients）特徴を用いたAdaboostアルゴリズムによる人体検出法を採用している。なお、弱い識別器としてＨＯＧ（Histogram of Oriented Gradients）特徴を用いたAdaboostアルゴリズムによる人体検出法については、「N. Dalal and B. Triggs，"Histograms of Oriented Gradientstional Conference on Computer Vision，pp. 734-741，2003．」で述べれられている。 Next, the human body detection process S3 executed by the human body detection unit 22 of the image processing apparatus 2 will be described. As a method of detecting a human body, a method of detecting a human body using an infrared sensor and utilizing a human body temperature is well known. However, in this embodiment, the human body detection method performed in the face detection process S2 is weak. A human body detection method based on the Adaboost algorithm using HOG (Histogram of Oriented Gradients) features is adopted as a discriminator. For the human body detection method using the Adaboost algorithm using the HOG (Histogram of Oriented Gradients) feature as a weak classifier, see “N. Dalal and B. Triggs,” Histograms of Oriented Gradientstional Conference on Computer Vision, pp. 734-741. , 2003. "

弱い識別器としてＨＯＧ特徴を用いたAdaboostアルゴリズムによる人体検出法を実行することで、フレームに含まれる人体毎に人体検出枠データが得られ、この人体検出枠データには、人体画像を検出したときに利用した人体検出枠の位置（例えば、左上隅の座標）・矩形サイズ（幅及び高さ）が得られる。 By executing the human body detection method using the Adaboost algorithm using the HOG feature as a weak classifier, human body detection frame data is obtained for each human body included in the frame, and when this human body detection frame data is detected, The position (for example, the coordinates of the upper left corner) and the rectangular size (width and height) of the human body detection frame used in the above are obtained.

図８は、人体及び顔検出結果を説明するための図である。図８のフレーム７で撮影されている人物は、人物７ａ〜７ｆの合計６人が含まれ，画像処理装置２の人体検出手段２２はそれぞれの人物７ａ〜７ｆを検出し、それぞれの人物７ａ〜７ｆに対応する人体検出枠データ７０ａ〜７０ｆを出力する。また、画像処理装置２の顔検出手段２１は、両眼が撮影されている人物７ａ〜７ｃの顔を検出し、それぞれの顔に対応する顔検出枠データ７１ａ〜７１ｃを出力する。 FIG. 8 is a diagram for explaining the human body and face detection results. The person photographed in the frame 7 of FIG. 8 includes a total of six persons 7a to 7f, and the human body detection means 22 of the image processing apparatus 2 detects each person 7a to 7f, and each person 7a to 7f is detected. Human body detection frame data 70a to 70f corresponding to 7f are output. Further, the face detection means 21 of the image processing apparatus 2 detects the faces of the persons 7a to 7c in which both eyes are photographed, and outputs face detection frame data 71a to 71c corresponding to each face.

次に、画像処理装置２のトラッキング手段２３によって実行されるトラッキング処理Ｓ４について説明する。トラッキング処理Ｓ４では、画像処理装置２のトラッキング手段２３によって、顔検出手段２１がＮ−１フレームから検出した顔検出枠データと、顔検出手段２１がＮフレームから検出した顔検出枠データを対応付ける処理が実行される。 Next, the tracking process S4 executed by the tracking unit 23 of the image processing apparatus 2 will be described. In the tracking process S4, a process for associating the face detection frame data detected by the face detection unit 21 from the N-1 frame with the face detection frame data detected by the face detection unit 21 from the N frame by the tracking unit 23 of the image processing apparatus 2. Is executed.

ここから，画像処理装置２のトラッキング手段２３によって実行されるトラッキング処理Ｓ４について詳細に説明する。図５は、画像処理装置２のトラッキング手段２３によって実行されるトラッキング処理Ｓ４を説明するためのフロー図である。 From here, the tracking process S4 performed by the tracking means 23 of the image processing apparatus 2 will be described in detail. FIG. 5 is a flowchart for explaining the tracking process S4 executed by the tracking unit 23 of the image processing apparatus 2.

画像処理装置２のトラッキング手段２３は、Ｎフレームをトラッキング処理Ｓ４するために、まず、Ｎフレームから得られた顔検出枠データ及び人体検出枠データをそれぞれ顔検出手段２１及び人体検出手段２２から取得する（Ｓ１０）。 The tracking unit 23 of the image processing apparatus 2 first acquires the face detection frame data and the human body detection frame data obtained from the N frame from the face detection unit 21 and the human body detection unit 22, respectively, in order to perform the tracking process S4 for the N frame. (S10).

なお、次回のトラッキング処理Ｓ４において、Ｎフレームから得られた顔検出枠データは、Ｎ−１フレームの顔検出枠データとして利用されるため、画像処理装置２のトラッキング手段２３は、Ｎフレームから得られた顔検出枠データをＲＡＭ２ｃまたはデータ記憶装置２ｄに記憶する。 In the next tracking process S4, the face detection frame data obtained from the N frame is used as the face detection frame data of the N-1 frame. Therefore, the tracking unit 23 of the image processing apparatus 2 obtains the frame from the N frame. The face detection frame data thus obtained is stored in the RAM 2c or the data storage device 2d.

画像処理装置２のトラッキング手段２３は、Ｎフレームの顔検出枠データ及び人体検出枠データを取得すると、Ｎフレームの人体検出枠データ毎に、ディスプレイの閲覧判定を行う（Ｓ１１）。 When the tracking unit 23 of the image processing apparatus 2 acquires the face detection frame data and the human body detection frame data of N frames, the tracking means 23 performs display browsing determination for each human body detection frame data of N frames (S11).

上述しているように、人体検出枠データには人体検出枠の位置及び矩形サイズが含まれ、顔検出枠データには顔検出枠の位置及び矩形サイズが含まれるため、顔検出枠が含まれる人体検出枠データは、ディスプレイ３を閲覧している人物の人体検出枠データと判定でき、また、顔検出枠が含まれない人体検出枠データは、ディスプレイ３を閲覧していない人物の人体検出枠データと判定できる。 As described above, since the human body detection frame data includes the position and the rectangular size of the human body detection frame, and the face detection frame data includes the position and the rectangular size of the face detection frame, the face detection frame is included. The human body detection frame data can be determined as the human body detection frame data of the person who is browsing the display 3, and the human body detection frame data which does not include the face detection frame is the human body detection frame data of the person who is not browsing the display 3. Can be determined as data.

画像処理装置２のトラッキング手段２３は、このようにして、Ｎフレームの人体検出枠データ毎にディスプレイの閲覧判定を行うと、Ｎフレームが撮影されたときの人数ログファイルとして、ディスプレイ３の前にいる人物の総人数、すなわち、人体検出手段２２によって検出された人体検出枠データの数と、ディスプレイ３を閲覧していない人物の人数、すなわち、顔検出枠が含まれていない人体検出枠データの数と、ディスプレイ３を閲覧している人物の人数、すなわち、顔検出枠が含まれる人体検出枠データの数を記載した人数ログファイルを生成し、Nフレームのフレーム番号などを付与してデータ記憶装置２ｄに記憶する。 When the tracking means 23 of the image processing apparatus 2 makes a display browsing determination for each human body detection frame data of N frames in this way, it is displayed in front of the display 3 as a number of people log file when the N frames are captured. The total number of persons who are present, that is, the number of human body detection frame data detected by the human body detection means 22, and the number of persons who are not browsing the display 3, that is, human body detection frame data not including a face detection frame. Number of people browsing the display 3, that is, the number of human body detection frame data including the face detection frame is generated, and a data storage is performed by assigning N frame number and the like Store in device 2d.

画像処理装置２のトラッキング手段２３は、Ｎフレームの人体検出枠データ毎に、ディスプレイの閲覧判定を行うと、顔検出手段２１がＮ−１フレームから検出した顔検出枠データと、顔検出手段２１がＮフレームから検出した顔検出枠データを対応付ける顔検出枠データ対応付け処理Ｓ１２を実行する。 When the tracking unit 23 of the image processing apparatus 2 performs display browsing determination for each human body detection frame data of N frames, the face detection frame data detected by the face detection unit 21 from the N−1 frame and the face detection unit 21. Executes face detection frame data association processing S12 for associating the face detection frame data detected from the N frames.

図６は、顔検出枠データ対応付け処理Ｓ１２を説明するためのフロー図で、本実施形態では、図６で図示したフローにおいて、以下に記述する数式１の評価関数を用いて得られる評価値を利用して、顔検出枠データの対応付けがなされる。 FIG. 6 is a flowchart for explaining the face detection frame data association processing S12. In this embodiment, in the flow illustrated in FIG. 6, the evaluation value obtained by using the evaluation function of Equation 1 described below. Is used to associate the face detection frame data.

なお、数式１の評価関数ｆ１（）は、ニアレストネイバー法を用いた評価関数で、評価関数ｆ１（）で得られる評価値は、顔検出枠データの位置・矩形サイズの差を示した評価値になる。また、数式１の評価関数ｆ２（）で得られる評価値は、評価関数ｆ１（）から求められる評価値に、顔検出枠データで特定される顔検出枠に含まれる顔画像から得られ、顔画像の特徴を示すＳＵＲＦ特徴量の差が重み付けして加算された評価値になる。
Note that the evaluation function f1 () in Equation 1 is an evaluation function using the nearest neighbor method, and the evaluation value obtained by the evaluation function f1 () is an evaluation indicating the difference between the position / rectangular size of the face detection frame data. Value. Further, the evaluation value obtained by the evaluation function f2 () of Expression 1 is obtained from the face image included in the face detection frame specified by the face detection frame data to the evaluation value obtained from the evaluation function f1 (). The difference between the SURF feature amounts indicating the features of the image becomes an evaluation value added by weighting.

Ｎ−１フレームから検出した顔検出枠データとＮフレームから検出した顔検出枠データを対応付けるために、画像処理装置２のトラッキング手段２３は、まず、Ｎフレームから得られた顔検出枠データの数だけループ処理Ｌ１を実行する。 In order to associate the face detection frame data detected from the N-1 frame with the face detection frame data detected from the N frame, the tracking unit 23 of the image processing apparatus 2 first counts the number of face detection frame data obtained from the N frame. Only the loop processing L1 is executed.

このループ処理Ｌ１において、画像処理装置２のトラッキング手段２３は、まず、Ｎ−１フレームから検出された顔検出枠データの数だけループ処理Ｌ２を実行し、このループ処理Ｌ２では、ループ処理Ｌ１の処理対象となる顔検出枠データの位置・矩形サイズと、ループ処理Ｌ２の処理対象となる顔検出枠データの位置・矩形サイズが、数式１の評価関数ｆ１（）に代入して評価値を算出し（Ｓ１２０）、ループ処理Ｌ１の対象となる顔検出枠データとの位置・サイズの差を示す評価値が、Ｎ−１フレームから検出された顔検出枠データ毎に算出される。 In this loop process L1, the tracking means 23 of the image processing apparatus 2 first executes the loop process L2 by the number of face detection frame data detected from the N-1 frame, and in this loop process L2, the loop process L1 The evaluation value is calculated by substituting the position / rectangular size of the face detection frame data to be processed and the position / rectangular size of the face detection frame data to be processed by the loop processing L2 into the evaluation function f1 () of Equation 1. In step S120, an evaluation value indicating a position / size difference from the face detection frame data to be subjected to the loop processing L1 is calculated for each face detection frame data detected from the N-1 frame.

画像処理装置２のトラッキング手段２３は、ループ処理Ｌ１の処理対象となる顔検出枠データとの位置・サイズの差を示す評価値を、Ｎ−１フレームから検出された顔検出枠データ毎に算出すると、該評価値の最小値を検索し（Ｓ１２１）、該評価値の最小値と他の評価値との差分を算出した後（Ｓ１２２）、閾値以下の該差分値があるか判定する（Ｓ１２３）。 The tracking unit 23 of the image processing apparatus 2 calculates an evaluation value indicating a position / size difference from the face detection frame data to be processed by the loop processing L1 for each face detection frame data detected from the N-1 frame. Then, the minimum value of the evaluation value is searched (S121), and after calculating the difference between the minimum value of the evaluation value and another evaluation value (S122), it is determined whether there is the difference value equal to or less than the threshold value (S123). ).

そして、画像処理装置２のトラッキング手段２３は、ループ処理Ｌ１の処理対象となる顔検出枠データとの位置・サイズの差を示す評価値の最小値と他の評価値との差分の中に、閾値以下の差分がある場合，画像処理装置２のトラッキング手段２３は、評価値が閾値以内である顔検出枠データ数だけループ処理Ｌ３を実行する。 Then, the tracking unit 23 of the image processing apparatus 2 includes the difference between the minimum value of the evaluation value indicating the position / size difference from the face detection frame data to be processed by the loop processing L1 and the other evaluation values. When there is a difference equal to or less than the threshold, the tracking unit 23 of the image processing apparatus 2 executes the loop process L3 for the number of face detection frame data whose evaluation value is within the threshold.

このループ処理Ｌ３では、ループ処理Ｌ１の処理対象となる顔検出枠データで特定される顔検出枠内の顔画像と、ループ処理Ｌ３の処理対象となるＮ−１フレームの顔検出枠データで特定される顔検出枠内の顔画像とのＳＵＲＦ特徴量の差が求められ、ＳＵＲＦ特徴量の差が数式１の評価関数ｆ２（）に代入され、ＳＵＲＦ特徴量の差を加算した評価値が、Ｎ−１フレームから検出された顔検出枠データ毎に算出される（Ｓ１２４）。 In this loop process L3, the face image within the face detection frame specified by the face detection frame data to be processed by the loop process L1 and the N-1 frame face detection frame data to be processed by the loop process L3 are specified. The difference of the SURF feature quantity with the face image in the face detection frame to be obtained is obtained, the difference of the SURF feature quantity is substituted into the evaluation function f2 () of Formula 1, and the evaluation value obtained by adding the difference of the SURF feature quantity is It is calculated for each face detection frame data detected from the N-1 frame (S124).

数式１で示した評価関数ｆ２（）を用い、ＳＵＲＦ特徴量の差を加算した評価値を算出するのは、ニアレストネイバー法のみを利用した評価関数ｆ１（）を用いて求められた評価値の最小値と他の評価値との差分値に閾値以下がある場合、サイズの似た顔検出枠が近接していると考えられ（例えば，図８の人物７ａ，ｂ），ニアレストネイバー法の評価値からでは、Ｎフレームの顔検出枠データに対応付けるＮ−１フレームの顔検出枠データが判定できないからである。 The evaluation value obtained using the evaluation function f1 () using only the nearest neighbor method is used to calculate the evaluation value obtained by adding the difference of the SURF feature values using the evaluation function f2 () expressed by the mathematical formula 1. If the difference value between the minimum value and other evaluation values is equal to or smaller than the threshold value, it is considered that the face detection frames having similar sizes are close to each other (for example, the person 7a, b in FIG. 8), and the nearest neighbor method This is because the N-1 frame face detection frame data associated with the N frame face detection frame data cannot be determined from the above evaluation values.

数式１で示した評価関数ｆ２（）を用い、ＳＵＲＦ特徴量の差を加算した評価値を算出することで、顔の特徴が加味された評価値が算出されるので、該評価値を用いることで、サイズの似た顔検出枠が近接している場合は、顔が似ているＮ−１フレームの顔検出枠データがＮフレームの顔検出枠データに対応付けられることになる。 By using the evaluation function f2 () expressed by Equation 1 and calculating an evaluation value obtained by adding the difference between the SURF feature amounts, an evaluation value in which the facial features are added is calculated. When face detection frames having similar sizes are close to each other, the N-1 frame face detection frame data having a similar face is associated with the N frame face detection frame data.

そして、画像処理装置２のトラッキング手段２３は、数式１の評価関数から得られた評価値が最小値であるＮ−１フレームの顔検出枠データを、ループ処理Ｌ１の対象となるＮフレームの顔検出枠データに対応付ける処理を実行する（Ｓ１２５）。なお、数式１で示した評価関数ｆ２（）を用いた評価値を算出していない場合、この処理で利用される評価値は、数式１で示した評価関数ｆ１（）から求められた値になり、数式１で示した評価関数ｆ２（）を用いた評価値を算出している場合、この処理で利用される評価値は、数式１で示した評価関数ｆ２（）から求められた値になる。 Then, the tracking unit 23 of the image processing apparatus 2 uses the N-1 frame face detection frame data having the minimum evaluation value obtained from the evaluation function of Formula 1 as the N frame face to be subjected to the loop processing L1. A process of associating with the detection frame data is executed (S125). If the evaluation value using the evaluation function f2 () shown in Equation 1 is not calculated, the evaluation value used in this process is the value obtained from the evaluation function f1 () shown in Equation 1. Thus, when the evaluation value using the evaluation function f2 () shown in Equation 1 is calculated, the evaluation value used in this process is the value obtained from the evaluation function f2 () shown in Equation 1. Become.

ループ処理Ｌ１が終了し、画像処理装置２のトラッキング手段２３は、Ｎフレームの顔検出枠データとＮ−１フレームの顔検出枠データを対応付けすると、Ｎ−１フレームの顔検出枠データが重複して、Ｎフレームの顔検出枠データに対応付けられていないか確認する（Ｓ１２６）。 When the loop processing L1 ends and the tracking means 23 of the image processing apparatus 2 associates the N frame face detection frame data with the N-1 frame face detection frame data, the N-1 frame face detection frame data overlaps. Then, it is confirmed whether it is associated with the face detection frame data of N frames (S126).

Ｎ−１フレームの顔検出枠データが重複して、Ｎフレームの顔検出枠データに対応付けられている場合、画像処理装置２のトラッキング手段２３は、重複して対応付けられているＮ−１フレームの顔検出枠データの評価値を参照し、評価値が小さい方を該Ｎフレームの顔検出枠データに対応付ける処理を再帰的に実行することで、最終的に、Ｎフレームの顔検出枠データに対応付けるＮ−１フレームの顔検出枠データを決定する（Ｓ１２７）。 When the face detection frame data of the N-1 frame overlaps and is associated with the face detection frame data of the N frame, the tracking unit 23 of the image processing apparatus 2 overlaps the N-1 frame. By referring to the evaluation value of the face detection frame data of the frame and recursively executing the process of associating the smaller evaluation value with the face detection frame data of the N frame, finally the face detection frame data of the N frame N-1 frame face detection frame data to be associated with is determined (S127).

ここから、図４で図示したフローの説明に戻る。トラッキング処理Ｓ４が終了すると、画像処理装置２の状態遷移管理手段２５によって、トラッキング処理Ｓ４から得られ、一つ前と今回の顔検出枠データの対応付け結果を参照し、事前に定めた状態遷移表に従い顔オブジェクトの状態を遷移させ、顔オブジェクトの状態遷移に応じたログを記憶する状態遷移管理処理Ｓ５が実行され、この状態遷移管理処理Ｓ５で所定の状態遷移があると、該状態遷移に対応した所定のログファイルがデータ記憶装置２ｄに記憶される。 From here, it returns to description of the flow illustrated in FIG. When the tracking process S4 is completed, the state transition management unit 25 of the image processing apparatus 2 obtains the state transition obtained in advance from the tracking process S4 and refers to the result of association between the previous and current face detection frame data, and is determined in advance. The state transition management process S5 is executed to change the state of the face object according to the table and store a log corresponding to the state transition of the face object. If there is a predetermined state transition in the state transition management process S5, A corresponding predetermined log file is stored in the data storage device 2d.

画像処理装置２の状態遷移管理手段２５には、顔オブジェクトの状態遷移を管理するために、予め、顔オブジェクトの状態と該状態を状態遷移させるルールが定義された状態遷移表が定められており、画像処理装置２のトラッキング手段２３は、この状態遷移表を参照し、顔検出枠データ対応付け処理Ｓ１２の結果に基づき顔オブジェクトの状態を遷移させる。 In the state transition management unit 25 of the image processing apparatus 2, in order to manage the state transition of the face object, a state transition table in which a state of the face object and a rule for state transition are defined in advance is defined. The tracking unit 23 of the image processing device 2 refers to this state transition table and changes the state of the face object based on the result of the face detection frame data association processing S12.

ここから、状態遷移表の一例を例示し、該状態遷移表の説明をしながら、画像処理装置２の状態遷移管理手段２５によって実行される状態遷移管理処理Ｓ５について説明する。 From here, an example of the state transition table is illustrated, and the state transition management process S5 executed by the state transition management unit 25 of the image processing apparatus 2 will be described while explaining the state transition table.

図７は、本実施形態における状態遷移表６を説明する図である。図７で図示した状態遷移表６によって、顔オブジェクトの状態と、Ｎ−１フレームの状態からＮフレームの状態への遷移が定義され、状態遷移表６の縦軸はＮ−１フレームの状態で、横軸はＮフレームの状態で，縦軸と横軸が交差する箇所に状態遷移する条件が記述されている。なお、状態遷移表に「―」は不正な状態遷移を示している。 FIG. 7 is a diagram illustrating the state transition table 6 in the present embodiment. The state transition table 6 illustrated in FIG. 7 defines the state of the face object and the transition from the state of the N-1 frame to the state of the N frame. The vertical axis of the state transition table 6 indicates the state of the N-1 frame. The horizontal axis indicates the state of N frames, and the condition for state transition is described at a location where the vertical axis and the horizontal axis intersect. In the state transition table, “-” indicates an illegal state transition.

図７で図示した状態遷移表６には、顔オブジェクトの状態として、Ｎｏｎｅ、候補Ｆａｃｅ、現在Ｆａｃｅ、待機Ｆａｃｅ、ノイズＦａｃｅ及び終了Ｆａｃｅが定義され、状態遷移表で定義された状態遷移を説明しながら、それぞれの状態について説明する。 In the state transition table 6 illustrated in FIG. 7, None, candidate face, current face, standby face, noise face, and end face are defined as face object states, and the state transition defined in the state transition table is described. However, each state will be described.

顔オブジェクトの状態の一つであるＮｏｎｅとは、顔オブジェクトが存在しない状態を意味し、Ｎフレームの顔検出枠データに対応付けるＮ−１フレームの顔検出枠データが無い場合（図７の条件１）、画像処理装置２の状態遷移管理手段２５は、顔オブジェクトを識別するためのＩＤ、該Ｎフレームの顔検出データ、顔オブジェクトに付与された状態に係わるデータなどを属性値と有する顔オブジェクトを新規に生成し、該顔オブジェクトの状態を候補Ｆａｃｅに設定する。 None, which is one of the face object states, means a state in which no face object exists, and there is no N-1 frame face detection frame data associated with N frame face detection frame data (condition 1 in FIG. 7). ), The state transition management unit 25 of the image processing apparatus 2 selects a face object having an attribute value such as an ID for identifying the face object, face detection data of the N frame, data related to the state assigned to the face object, and the like. A new face is generated and the state of the face object is set as a candidate Face.

顔オブジェクトの状態の一つである候補Ｆａｃｅとは、新規に検出した顔画像がノイズである可能性がある状態を意味し、顔オブジェクトの状態の一つに候補Ｆａｃｅを設けているのは、複雑な背景の場合、背景除去処理を行っても顔画像の誤検出が発生し易く、新規に検出できた顔画像がノイズの可能性があるからである。 The candidate face that is one of the face object states means a state in which the newly detected face image may be noise, and the candidate face is provided as one of the face object states. This is because in the case of a complex background, erroneous detection of a face image is likely to occur even if background removal processing is performed, and the newly detected face image may be noise.

候補Ｆａｃｅの状態である顔オブジェクトには、候補Ｆａｃｅの状態に係わるデータとして、候補Ｆａｃｅの状態であることを示す状態ＩＤと、候補Ｆａｃｅへ状態遷移したときの日時及びカウンタが付与される。 The face object in the candidate face state is given, as data related to the candidate face state, a state ID indicating the candidate face state, the date and time when the state transition is made to the candidate face, and a counter.

候補Ｆａｃｅから状態遷移可能な状態は、候補Ｆａｃｅ、現在Ｆａｃｅ及びノイズＦａｃｅで、事前に定められた設定時間内において、候補Ｆａｃｅの状態である顔オブジェクトに対応する顔検出枠が所定の数だけ連続してトラッキングできた場合（図７の条件２−２）、該顔オブジェクトの状態は候補Ｆａｃｅから現在Ｆａｃｅに遷移する。 The states that can be changed from the candidate face are the candidate face, the current face, and the noise face, and a predetermined number of face detection frames corresponding to the face objects that are in the candidate face state are continuous within a predetermined setting time. If the tracking is successful (condition 2-2 in FIG. 7), the state of the face object changes from the candidate face to the current face.

候補Ｆａｃｅの状態である顔オブジェクトの属性にカウンタを設けているのは、設定時間内において、候補Ｆａｃｅの状態である顔オブジェクトに対応する顔検出枠を連続してトラッキングできた回数をカウントするためで、画像処理装置２の状態遷移管理手段２５は、Ｎフレームの顔検出枠データに対応付けられたＮ−１フレームの顔検出データが含まれている顔オブジェクトの状態が候補Ｆａｃｅの場合、該顔オブジェクトに付与されている顔検出枠データをＮフレームの顔検出枠データに更新すると共に、該顔オブジェクトのカウンタをインクリメントする。 The reason why a counter is provided for the attribute of the face object in the candidate face state is to count the number of times that the face detection frame corresponding to the face object in the candidate face state can be tracked continuously within the set time. Then, the state transition management unit 25 of the image processing apparatus 2 determines that when the state of the face object including the N-1 frame face detection data associated with the N frame face detection frame data is the candidate Face, The face detection frame data attached to the face object is updated to N frame face detection frame data, and the counter of the face object is incremented.

そして、画像処理装置２の状態遷移管理手段２５は、状態遷移管理処理Ｓ５を実行する際、候補Ｆａｃｅである顔オブジェクト毎に、候補Ｆａｃｅへ状態遷移したときの日時を参照し、設定時間以内に該カウンタの値が事前に定めた設定値に達している場合は、顔オブジェクトの状態を現在Ｆａｃｅに状態遷移させる。また、画像処理装置２の状態遷移管理手段２５は、この時点で設定時間が経過しているが、該カウンタが設定値に達しなかった該顔オブジェクトの状態をノイズＦａｃｅに状態遷移させ（図７の条件２−３）、該設定時間が経過していない該顔オブジェクトについては状態を状態遷移させない（図７の条件２−１）。 Then, when executing the state transition management process S5, the state transition management unit 25 of the image processing apparatus 2 refers to the date and time when the state transition is made to the candidate face for each face object that is the candidate face, and within the set time When the value of the counter has reached a predetermined setting value, the state of the face object is changed to the current Face. Further, the state transition management unit 25 of the image processing apparatus 2 causes the state of the face object that has not reached the set value at the time when the set time has elapsed to change to the noise face (FIG. 7). Condition 2-3), the face object for which the set time has not elapsed does not change state (condition 2-1 in FIG. 7).

顔オブジェクトの状態の一つであるノイズＦａｃｅとは、画像処理装置２の顔検出手段２１が検出した顔画像がノイズと判定された状態で、ノイズＦａｃｅに状態遷移した顔オブジェクトは消滅したと見なされ、これ以降の状態遷移管理処理Ｓ５に利用されない。 The noise face that is one of the states of the face object is a state in which the face image detected by the face detection unit 21 of the image processing apparatus 2 is determined to be noise, and the face object that has transitioned to the noise face is considered to have disappeared. It is made and is not used for the subsequent state transition management process S5.

顔オブジェクトの状態の一つである現在Ｆａｃｅとは、顔オブジェクトに対応する人物がディスプレイ３を閲覧状態と判定できる状態で、現在Ｆａｃｅの状態にある時間が、顔オブジェクトに対応する人物がディスプレイ３を閲覧している時間となる。 The current face, which is one of the face object states, is a state in which a person corresponding to the face object can determine that the display 3 is in the browsing state. It is time to browse.

画像処理装置２の状態遷移管理手段２５は、顔オブジェクトの状態を候補Ｆａｃｅから現在Ｆａｃｅに状態遷移すると、該顔オブジェクトの顔検出枠データをＮフレームの顔検出枠データに更新すると共に、現在Ｆａｃｅに係わるデータとして、現在Ｆａｃｅの状態であることを示す状態ＩＤと現在Ｆａｃｅに状態遷移させたときの日時を顔オブジェクトに付与する。 When the state transition management unit 25 of the image processing apparatus 2 changes the state of the face object from the candidate Face to the current Face, the face detection frame data of the face object is updated to N frame face detection frame data and the Face As the data related to the above, a face ID indicating the current face state and the date and time when the state is changed to the current face are assigned to the face object.

また、ディスプレイを閲覧している人物の人物属性（例えば、年齢・性別）をログとして記憶するために、顔オブジェクトの状態を現在Ｆａｃｅに状態遷移すると、画像処理装置２の状態遷移管理手段２５は人物属性推定手段２６を作動させ、現在Ｆａｃｅに状態遷移させた顔オブジェクトの顔検出枠データで特定される顔検出枠から得られる人物属性を取得し、該顔オブジェクトのオブジェクトＩＤ、人物属性が記述された属性ログファイルをデータ記憶装置２ｄに記憶する。 In addition, when the state of the face object is changed to “Face” in order to store the person attributes (for example, age and gender) of the person browsing the display as a log, the state transition management unit 25 of the image processing apparatus 2 The person attribute estimating means 26 is operated to acquire a person attribute obtained from the face detection frame specified by the face detection frame data of the face object whose state is currently changed to Face, and the object ID and person attribute of the face object are described. The attribute log file thus stored is stored in the data storage device 2d.

なお、画像処理装置２に備えられた人物属性推定手段２６については詳細な記載はしないが、人物の顔画像から人物の人物属性（年齢・性別）を自動で識別することは、タバコの自動販売機などでも広く利用されており、例えば、特開２００７―０８００５７号公報の技術を利用できる。 The person attribute estimation means 26 provided in the image processing apparatus 2 will not be described in detail, but automatic identification of a person's attribute (age / gender) from a person's face image is an automatic cigarette sale. For example, a technique disclosed in Japanese Patent Application Laid-Open No. 2007-080057 can be used.

更に、画像処理装置２の状態遷移管理手段２５は、顔オブジェクトの状態を現在Ｆａｃｅに状態遷移すると、ディスプレイ３を閲覧している人物の位置を時系列で記憶するための位置ログファイルをデータ記憶装置２ｄに新規に生成する。生成時の位置ログファイルには、現在Ｆａｃｅに状態遷移した顔オブジェクトのオブジェクトＩＤと、現在Ｆａｃｅに状態遷移した顔オブジェクトに含まれる顔検出枠データが付与される。 Further, the state transition management unit 25 of the image processing apparatus 2 stores a position log file for storing the position of the person who is browsing the display 3 in time series when the state of the face object is currently changed to Face. Newly generated in the device 2d. The position log file at the time of generation is given the object ID of the face object whose state has been changed to Face and the face detection frame data included in the face object whose state has been changed to Face.

現在Ｆａｃｅの状態から状態遷移可能な状態は、現在Ｆａｃｅ及び待機Ｆａｃｅである。画像処理装置２の状態遷移管理手段２５は、Ｎフレームの顔検出枠データに対応付けられたＮ−１フレームの顔検出データを含む顔オブジェクトの状態が現在Ｆａｃｅの場合（条件３−１）、該顔オブジェクトに付与されている顔検出枠データをＮフレームにおける顔検出枠データに更新すると共に、該顔検出枠データを、該顔オブジェクトのオブジェクトＩＤで特定される位置ログファイルに追加する。 The states that can be changed from the current face state are the current face and the standby face. The state transition management unit 25 of the image processing apparatus 2 is configured so that the state of the face object including the N-1 frame face detection data associated with the N frame face detection frame data is currently Face (condition 3-1). The face detection frame data attached to the face object is updated to face detection frame data in N frames, and the face detection frame data is added to the position log file specified by the object ID of the face object.

また、画像処理装置２の状態遷移管理手段２５は、状態遷移管理処理Ｓ５を行う際、Ｎフレームの顔検出枠データが対応付けられなかったＮ−１フレームの顔検出枠データが付与されている顔オブジェクトの状態が現在Ｆａｃｅの場合、動画解析手段２４を作動させて、動画解析手法により、該Ｎ−１フレームの顔検出枠データに対応する顔画像をＮフレームから検出する処理を実施する。 Further, when the state transition management unit 25 of the image processing apparatus 2 performs the state transition management process S5, N-1 frame face detection frame data that is not associated with the N frame face detection frame data is assigned. When the state of the face object is currently “Face”, the moving image analysis unit 24 is operated to perform processing for detecting a face image corresponding to the face detection frame data of the N−1 frame from the N frame by the moving image analysis method.

本実施形態において、画像処理装置２の動画解析手段２４は、まず、Ｎフレームの顔検出枠データが対応付けられなかったＮ−１フレームの顔検出枠データと既に対応付けられているＮフレームの顔検出枠データの間で、オクルージョン状態の判定を行い、対象となる人物の顔が完全に隠れた状態のオクルージョンであるか確認する。 In the present embodiment, the moving image analysis unit 24 of the image processing apparatus 2 first has N frames that are already associated with N-1 frame face detection frame data that has not been associated with N frame face detection frame data. The occlusion state is determined between the face detection frame data, and it is confirmed whether the target person's face is completely occluded.

画像処理装置２の動画解析手段２４は、この時点で存在し、現在Ｆａｃｅ、候補Ｆａｃｅ及び待機Ｆａｃｅの状態である全ての顔オブジェクトについて、数式２に従い，顔オブジェクトのオクルージョン状態を判定する処理を実行する。
The moving image analysis means 24 of the image processing apparatus 2 executes processing for determining the occlusion state of the face object according to Equation 2 for all face objects that exist at this time and are currently in the face, candidate face, and standby face states. To do.

画像処理装置２の動画解析手段２４は、数式２に従い、顔オブジェクトのオクルージョン状態を判定する処理を実行すると、判定結果に基づき処理を分岐する。 When the moving image analysis unit 24 of the image processing apparatus 2 executes the process of determining the occlusion state of the face object according to Equation 2, the process branches based on the determination result.

トラッキング対象である人物が完全に隠れた状態のオクルージョンである可能性が高いと判断できた場合（数式２の判定基準１に該当する場合）、パーティクルフィルタによるトラッキング行い、対象となる顔オブジェクトの位置・矩形サイズを検出する。なお、パーティクルフィルタについては，「加藤丈和: 「パーティクルフィルタとその実装法」、情報処理学会研究報告, CVIM-157, pp.161-168 (2007).」など数多くの文献で述べられている。 When it is determined that there is a high possibility that the person to be tracked is an occlusion in a completely hidden state (when the criterion 1 of Expression 2 is satisfied), tracking by the particle filter is performed, and the position of the target face object・ Detect the rectangle size. The particle filter is described in many literatures such as “Takekazu Kato:“ Particle filter and its implementation ”, IPSJ Research Report, CVIM-157, pp.161-168 (2007).” .

また、トラッキング対象である人物が半分隠れた状態のオクルージョンで可能性が高いと判断できた場合（数式２の判定基準２に該当する場合）、ＬＫ法（Lucus-Kanadeアルゴリズム）によるトラッキング行い、対象となる顔オブジェクトの位置・矩形サイズを検出する。なお、ＬＫ法については、「Lucas, B.D. and Kanade, T.：" An Iterative Image Registration Technique with an Application to Stereo Vision",Proc.DARPA Image Understanding Workshop,pp.121-130,1981.」で述べられている。 Also, if it is judged that the possibility is high due to occlusion in which the person being tracked is half-hidden (corresponding to criterion 2 in Formula 2), tracking is performed using the LK method (Lucus-Kanade algorithm) The position / rectangular size of the face object is detected. The LK method is described in “Lucas, BD and Kanade, T .:“ An Iterative Image Registration Technique with an Application to Stereo Vision ”, Proc. DARPA Image Understanding Workshop, pp. 121-130, 1981.” ing.

そして、トラッキング対象である人物にオクルージョンはない可能性が高いと判定できた場合（数式３の判定基準３に該当する場合）、画像処理装置２の動画解析手段２４は、ＣａｍＳｈｉｆｔ手法を用いたトラッキングを行い、対象となる顔オブジェクトの位置・矩形サイズを検出する。なお、ＣａｍＳｈｉｆｔ手法については、「G. R. Bradski: "Computer vision face tracking foruse in a perceptual user interface," Intel Technology Journal, Q2, 1998.」で述べられている。 When it is determined that there is a high possibility that the person to be tracked does not have occlusion (when the criterion 3 in Expression 3 is satisfied), the moving image analysis unit 24 of the image processing apparatus 2 performs tracking using the CamShift method. To detect the position / rectangular size of the target face object. The CamShift method is described in “G. R. Bradski:“ Computer vision face tracking for use in a perceptual user interface, ”Intel Technology Journal, Q2, 1998.”.

画像処理装置２の状態遷移管理手段２５は、これらのいずれかの手法で対象となる顔画像がＮフレームから検出できた場合、現在Ｆａｃｅの状態である顔オブジェクトの顔検出データを、これらの手法で検出された位置・矩形サイズに更新し、これらのいずれかの手法でも対象となる顔画像がトラッキングできなかった場合、現在Ｆａｃｅの状態である顔オブジェクトの状態を待機Ｆａｃｅに状態遷移させる（図７の条件３−２）。 When the target face image can be detected from N frames by any one of these methods, the state transition management unit 25 of the image processing apparatus 2 uses the face detection data of the face object that is currently in the face state as these methods. If the target face image cannot be tracked by any of these methods, the state of the face object that is currently in the face state is changed to the standby face (see FIG. 7 condition 3-2).

顔オブジェクトの状態の一つである待機Ｆａｃｅとは、画像処理装置２に備えられた動画解析手段２４を用いても、顔オブジェクトに対応する顔画像を検出できなくなった状態である。 A standby face, which is one of the states of a face object, is a state in which a face image corresponding to the face object cannot be detected even using the moving image analysis means 24 provided in the image processing apparatus 2.

また、画像処理装置２の状態遷移管理手段２５は、顔オブジェクトの状態を待機Ｆａｃｅに状態遷移する際、顔オブジェクトの顔検出枠データは更新せず、待機Ｆａｃｅに係わるデータとして、待機Ｆａｃｅの状態であることを示す状態ＩＤと、該顔オブジェクトが現在Ｆａｃｅに状態遷移したときの日時と、該顔オブジェクトが待機Ｆａｃｅに状態遷移したときの日時を顔オブジェクトに付与する。 Further, when the state transition management unit 25 of the image processing apparatus 2 changes the state of the face object to the standby face, the face detection frame data of the face object is not updated, and the state of the standby face is used as data related to the standby face. Is given to the face object, the date and time when the face object has made a transition to the current Face, and the date and time when the face object has made a transition to the standby Face.

待機Ｆａｃｅから状態遷移可能な状態は、現在Ｆａｃｅまたは終了Ｆａｃｅである。画像処理装置２の状態遷移管理手段２５は、待機Ｆａｃｅに状態遷移してからの時間が所定時間経過する前に、Ｎフレームの顔検出枠データを含む顔オブジェクトを検索し、該顔オブジェクトの状態が待機Ｆａｃｅであった場合、該顔オブジェクトの状態を待機Ｆａｃｅから現在Ｆａｃｅに状態遷移させる（図７の条件４−１）。 The state in which state transition is possible from the standby face is the current face or the end face. The state transition management unit 25 of the image processing apparatus 2 searches for a face object including face detection frame data of N frames before a predetermined time elapses after the state transition to the standby face, and the state of the face object Is a standby face, the state of the face object is changed from the standby face to the current face (condition 4-1 in FIG. 7).

なお、顔オブジェクトの状態を待機Ｆａｃｅから現在Ｆａｃｅに状態遷移させる際、画像処理装置２の状態遷移管理手段２５は、該顔オブジェクトが現在Ｆａｃｅに状態遷移したときの日時は、待機Ｆａｃｅの状態のときに顔オブジェクトに付与されていた該日時を利用する。 When the state of the face object is changed from the standby face to the current face, the state transition management unit 25 of the image processing apparatus 2 indicates that the date and time when the face object has changed to the current face is the state of the standby face. Sometimes the date and time assigned to the face object is used.

また、画像処理装置２のトラッキング手段２３は、顔オブジェクトの状態遷移を管理する処理を実行する際、待機Ｆａｃｅに状態遷移してからの時間が所定時間経過した顔オブジェクトの状態を終了Ｆａｃｅに状態遷移させ（図７の条件４−３）、該設定時間が経過していない該顔オブジェクトについては状態を遷移させない（図７の条件４−２）。 Further, when executing the process for managing the state transition of the face object, the tracking unit 23 of the image processing apparatus 2 changes the state of the face object that has passed a predetermined time from the state transition to the standby face to the end face. The state is changed (condition 4-3 in FIG. 7), and the state of the face object for which the set time has not elapsed is not changed (condition 4-2 in FIG. 7).

顔オブジェクトの状態の一つである終了Ｆａｃｅとは、画像処理装置２が検出できなくなった人物に対応する状態で、状態が終了Ｆａｃｅになった顔オブジェクトは消滅したと見なされ、これ以降の状態遷移管理処理Ｓ５で利用されない。 The end face, which is one of the face object states, is a state corresponding to a person who can no longer be detected by the image processing apparatus 2, and the face object whose state is the end face is considered to have disappeared. It is not used in the transition management process S5.

なお、画像処理装置２の状態遷移管理手段２５は、顔オブジェクトの状態を終了Ｆａｃｅに状態遷移する前に、該顔オブジェクトのオブジェクトＩＤ、該顔オブジェクトが現在Ｆａｃｅに状態遷移したときの日時である閲覧開始時刻、該顔オブジェクトが待機Ｆａｃｅに状態遷移したときの日時である閲覧終了時刻を記述した閲覧時間ログファイルを生成しデータ記憶装置２ｄに記憶させる。 Note that the state transition management unit 25 of the image processing apparatus 2 indicates the object ID of the face object and the date and time when the face object is currently transitioned to Face before the face object is transitioned to end Face. A browsing time log file describing the browsing start time and the browsing end time, which is the date and time when the face object changes to the standby face, is generated and stored in the data storage device 2d.

以上詳しく説明したように、画像処理装置２は、顔検出手段２１が検出した顔毎に生成する顔オブジェクトの状態として、Ｎｏｎｅ、候補Ｆａｃｅ、現在Ｆａｃｅ、待機Ｆａｃｅ、ノイズＦａｃｅ及び終了Ｆａｃｅの５つを状態遷移表６で定義し，顔オブジェクトに対応する顔のトラッキング結果に従い、顔オブジェクトの状態を遷移させることで、顔オブジェクトの状態遷移に従い、ディスプレイ３の閲覧時間をログとして記憶することが可能になる。 As described above in detail, the image processing apparatus 2 has five states of the face object generated for each face detected by the face detecting unit 21: None, candidate Face, current Face, standby Face, noise Face, and end Face. Can be stored as a log according to the state transition of the face object by changing the state of the face object according to the tracking result of the face corresponding to the face object. become.

上述した内容に従えば、顔オブジェクトの状態が現在Ｆａｃｅである間は、顔オブジェクトに対応する顔を連続して検出できたことになるため、現在Ｆａｃｅの状態にあった時間は、ディスプレイ３の閲覧時間になる。 According to the above-described contents, while the face object state is currently Face, the face corresponding to the face object can be continuously detected. It becomes browsing time.

また、顔オブジェクトの状態として候補Ｆａｃｅを定義しておくことで、ノイズによって顔を誤検出した場合でも、ディスプレイ３の閲覧時間への影響はなくなる。また、顔オブジェクトの状態として待機Ｆａｃｅを定義しておくことで、顔を見失った後に、同じ顔を検出した場合でも、同じ顔として取り扱うことができるようになる。 Further, by defining the candidate Face as the state of the face object, even when a face is erroneously detected due to noise, the influence on the browsing time of the display 3 is eliminated. Also, by defining the standby face as the state of the face object, even if the same face is detected after losing sight of the face, it can be handled as the same face.

≪３．シナリオデータを用いた合成処理≫
≪３．１．ターゲットが１人の場合≫
図９は、ビデオカメラ４から送信された映像のフレームを基に、画像処理装置２が合成画像を作成する処理を説明するフロー図である。画像処理装置２を起動し、使用するシナリオデータを指定すると、まず、シナリオデータ対応付け手段８３が、指定されたシナリオデータをデータ記憶装置２ｄから読み込む（Ｓ２１）。そして、シナリオデータ対応付け手段８３は、シナリオデータを解釈し、シナリオデータに従った画像の作成を開始する（Ｓ２２）。 ≪3. Synthesis processing using scenario data >>
<< 3.1. When there is one target >>
FIG. 9 is a flowchart illustrating a process in which the image processing apparatus 2 creates a composite image based on a video frame transmitted from the video camera 4. When the image processing device 2 is activated and scenario data to be used is designated, first, the scenario data association unit 83 reads the designated scenario data from the data storage device 2d (S21). Then, the scenario data association unit 83 interprets the scenario data and starts creating an image according to the scenario data (S22).

次に、シナリオデータ対応付け手段８３は、状態遷移管理手段２５により生成された顔オブジェクトデータを取得する（Ｓ２３）。顔オブジェクトデータは、オブジェクトＩＤ、顔検出枠データ（位置、矩形サイズ）、閲覧時間で構成される。 Next, the scenario data association unit 83 acquires the face object data generated by the state transition management unit 25 (S23). The face object data includes an object ID, face detection frame data (position, rectangular size), and browsing time.

続いて、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した顔オブジェクトデータをシナリオデータに対応付ける処理を行う（Ｓ２４）。具体的には、顔オブジェクトデータに含まれる顔検出枠データのオブジェクトＩＤとターゲットＩＤを対応付ける。状態遷移管理手段２５から複数の顔検出枠データを取得した場合は、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。図１０の例では、シナリオデータには、ターゲットＩＤ“０”の１つだけ設定されているので、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”が対応付けられたオブジェクトＩＤで特定される顔検出枠データをターゲットとすることになる。 Subsequently, the scenario data association unit 83 performs a process of associating the face object data acquired from the state transition management unit 25 with the scenario data (S24). Specifically, the object ID of the face detection frame data included in the face object data is associated with the target ID. When a plurality of face detection frame data is acquired from the state transition management unit 25, the date and time when the state transition to the candidate face is the earliest date and time is set to “0”, and thereafter the date and time when the state transition to the candidate face is performed. In order from the earliest, “1”, “2” and “3” are set while increasing the number by one. In the example of FIG. 10, since only one target ID “0” is set in the scenario data, the scenario data association unit 83 is identified by the object ID associated with the target ID “0”. The face detection frame data is targeted.

次に、合成画像作成手段８４が、フレーム単位で表示用の合成画像を作成する処理を行う（Ｓ２５）。具体的には、まず、開始時点を時刻“０．０”と設定し、この時刻“０．０”で、シナリオデータの<Animation Commands>を参照する。すると、図１０に示すように、開始キー“０．０”から終了キー“０．１”までは、キータイプ“own”、コマンドタイプ“LayerMontage(レイヤ合成)”、ターゲットＩＤ“０”、コンテンツＩＤ“０”であるので、合成画像作成手段８４は、ターゲットＩＤ“０”の顔検出枠データに、コンテンツＩＤ“０”をレイヤ合成することにより、合成画像を作成することになる。 Next, the composite image creating means 84 performs processing for creating a composite image for display in units of frames (S25). Specifically, first, the start time is set to “0.0”, and the <Animation Commands> of the scenario data is referred to at this time “0.0”. Then, as shown in FIG. 10, from the start key “0.0” to the end key “0.1”, the key type “own”, the command type “LayerMontage (layer composition)”, the target ID “0”, the content Since the ID is “0”, the composite image creating unit 84 creates a composite image by layer-combining the content ID “0” with the face detection frame data with the target ID “0”.

コンテンツＩＤ“０”のコンテンツ（合成用素材）は、シナリオデータの<Simulation Contents>を参照することにより、コンテンツのパスが特定できるので、そのパスで特定されるデータ記憶装置２ｄ内の記憶位置からコンテンツを取得する。上述のように、コンテンツには矩形情報が設定されているので、合成画像作成手段８４は、この矩形情報と顔検出枠データの矩形サイズが一致するように矩形情報とコンテンツのサイズを変更し、変更したコンテンツを、変更後の矩形情報と顔検出枠データの位置が一致する位置にレイヤ合成を行う。具体的には、図１２（ｂ）に示すような顔画像に設定された顔検出枠データの矩形に合わせて、コンテンツに設定された矩形の大きさを変更し、コンテンツに設定された矩形の変更割合に合わせてコンテンツをリサイズし、２つの矩形が一致するように合成処理を行う。この結果、例えば、図１２（ａ）のようにコンテンツがカツラである場合、人の顔に合わせてカツラを被せたような状態の合成画像（図１２（ｃ））が得られることになる。合成画像作成手段８４は、得られた合成画像を、ディスプレイ３に表示させる。この結果、ディスプレイ３には、撮影映像のフレームに加工が施された合成画像が表示されることになる。 Since the content path of content ID “0” (material for composition) can be identified by referring to <Simulation Contents> of the scenario data, it can be determined from the storage location in the data storage device 2d identified by the path. Get content. As described above, since the rectangular information is set in the content, the composite image creating unit 84 changes the rectangular information and the size of the content so that the rectangular information and the rectangular size of the face detection frame data match, Layer composition of the changed content is performed at a position where the position of the rectangle information after the change and the face detection frame data coincide. Specifically, the size of the rectangle set in the content is changed to match the rectangle of the face detection frame data set in the face image as shown in FIG. The content is resized according to the change ratio, and the composition process is performed so that the two rectangles match. As a result, for example, when the content is a wig as shown in FIG. 12A, a composite image (FIG. 12C) in a state where the wig is covered with a human face is obtained. The composite image creating unit 84 displays the obtained composite image on the display 3. As a result, a composite image in which the frame of the captured video is processed is displayed on the display 3.

１つのフレームについて合成画像の作成を終えたら、シナリオデータ対応付け手段８３は、シナリオ中であるかどうかを判断する（Ｓ２６）。具体的には、シナリオデータに従った画像作成開始からの経過時間でシナリオデータ内の<CycleInterval>タグを参照し、経過時間がシナリオ時間未満である場合は、シナリオ中であると判断し、経過時間がシナリオ時間以上である場合は、シナリオ終了であると判断する。シナリオ中であると判断した場合には、シナリオデータ対応付け手段８３は、Ｓ２３に戻って、顔オブジェクトデータを取得する。 When the creation of the composite image for one frame is completed, the scenario data association unit 83 determines whether or not the scenario is in progress (S26). Specifically, the <CycleInterval> tag in the scenario data is referenced with the elapsed time from the start of image creation according to the scenario data. If the elapsed time is less than the scenario time, it is determined that the scenario is in progress and the elapsed time If the time is longer than the scenario time, it is determined that the scenario is over. If it is determined that the scenario is in progress, the scenario data association unit 83 returns to S23 and acquires face object data.

そして、Ｓ２４において、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した次の顔オブジェクトデータをシナリオデータに対応付ける処理を行う。このときも1回目のループと同様、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。そして、シナリオデータに従って、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”が対応付けられたオブジェクトＩＤで特定される顔検出枠データをターゲットとする。 In S24, the scenario data association unit 83 performs processing for associating the next face object data acquired from the state transition management unit 25 with the scenario data. Also at this time, as in the first loop, the earliest date and time when the state transition to the candidate face is set to “0”, and thereafter “1” and “2” in order of the date and time when the state transition to the candidate face occurs. Set "3" while increasing the number by one. Then, according to the scenario data, the scenario data association unit 83 targets the face detection frame data specified by the object ID associated with the target ID “0”.

次に、Ｓ２５において、合成画像作成手段８４が、フレーム単位で合成画像を作成する処理を行う。具体的には、経過時間を取得し、取得した経過時間で、シナリオデータの<Animation Commands>を参照する。取得した経過時間が、“０．１”より小さい場合は、上述の場合と同様、図１０のシナリオデータに示すように、開始キー“０．０”から終了キー“０．１”の間であるので、キータイプ“own”、コマンドタイプ“LayerMontage(レイヤ合成)”、ターゲットＩＤ“０”、コンテンツＩＤ“０”であるので、合成画像作成手段８４は、ターゲットＩＤ“０”の顔検出枠データに、コンテンツＩＤ“０”をレイヤ合成することにより、合成画像を作成する。取得した経過時間が、“０．１”を超えていた場合、図１０に示すように、開始キー“０．１”から終了キー“０．３”までは、キータイプ“own”、コマンドタイプ“LayerMontage(レイヤ合成)”、ターゲットＩＤ“０”、コンテンツＩＤ“１”であるので、合成画像作成手段８４は、ターゲットＩＤ“０”の顔検出枠データに、コンテンツＩＤ“１”をレイヤ合成することにより、合成画像を作成することになる。このようにして、Ｓ２６においてシナリオ終了であると判断されるまでは、経過時間に従い、シナリオデータを実行する処理を繰り返し行う。 Next, in S25, the composite image creating unit 84 performs a process of creating a composite image in units of frames. Specifically, the elapsed time is acquired, and the <Animation Commands> of the scenario data is referenced with the acquired elapsed time. When the acquired elapsed time is smaller than “0.1”, as shown in the scenario data of FIG. 10, between the start key “0.0” and the end key “0.1”, as in the case described above. Therefore, since the key type is “own”, the command type is “LayerMontage (layer composition)”, the target ID is “0”, and the content ID is “0”, the composite image creating unit 84 performs the face detection frame with the target ID “0”. A composite image is created by layer combining the content ID “0” with the data. When the acquired elapsed time exceeds “0.1”, the key type “own”, the command type is used from the start key “0.1” to the end key “0.3” as shown in FIG. Since “LayerMontage”, target ID “0”, and content ID “1”, the composite image creation unit 84 combines the content ID “1” with the face detection frame data of the target ID “0”. By doing so, a composite image is created. In this way, the process of executing the scenario data is repeated according to the elapsed time until it is determined in S26 that the scenario is ended.

Ｓ２６において、シナリオ終了であると判断した場合には、シナリオデータ対応付け手段８３は、繰り返して処理を行うかどうかを判断する（Ｓ２７）。具体的には、シナリオデータ内の<IsAutoLoop>タグを参照し、“true”が設定されている場合は、繰り返して処理を行うと判断する。繰り返して処理を行うと判断した場合には、シナリオデータ対応付け手段８３は、経過時間を“０”にリセットし、経過時間の計測を再び開始するとともに、Ｓ２２に戻って、シナリオデータに従った画像の作成を開始する。このように、映像の各フレームから得られた合成画像を順次ディスプレイに表示することにより、加工映像として表示されることになる。 If it is determined in S26 that the scenario has been completed, the scenario data association unit 83 determines whether or not to repeat the process (S27). Specifically, the <IsAutoLoop> tag in the scenario data is referenced, and if “true” is set, it is determined that the process is repeated. If it is determined that the process is to be repeated, the scenario data association unit 83 resets the elapsed time to “0”, starts measuring the elapsed time again, returns to S22, and follows the scenario data. Start creating an image. In this manner, the composite image obtained from each frame of the video is sequentially displayed on the display, so that it is displayed as a processed video.

図１３は、図１０のシナリオデータに従って得られた合成画像の表示状態を示す図である。まず、フレームにフキダシ「こんにちは」が合成された合成画像が表示される（ア）。シナリオ開始時（StartKey="0.0"）から経過時間１秒（EndKey="0.1"）までのフレームについては、全てフキダシ「こんにちは」（ContentsID="0"）が合成された合成画像が表示されることになる。この際、ContentsID="0"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“fukidashi001.jpg”を特定し、フキダシ「こんにちは」のコンテンツを取得する。 FIG. 13 is a diagram showing a display state of a composite image obtained according to the scenario data of FIG. First, the composite image balloon "Hello" has been synthesized is displayed on the frame (a). The frame at the scenario begins (StartKey = "0.0") until the elapsed time of 1 second (EndKey = "0.1"), all balloon "Hello" (ContentsID = "0") is a synthetic image synthesized is displayed It will be. At this time, by referring to <SimulationContents> in ContentsID = "0", and specifying a location "fukidashi001.jpg" in the content storage unit of the content, it acquires the content of the balloon "Hello".

次に、経過時間１秒（StartKey="0. 1"）になると、フキダシ「今日の髪素敵でしょ？」（ContentsID="1"）が合成された合成画像が表示される（イ）。経過時間１秒（StartKey="0. 1"）から経過時間３秒（EndKey="0. 3"）までのフレームについては、全てフキダシ「今日の髪素敵でしょ？」が合成された合成画像が表示されることになる。この際、ContentsID="1"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“fukidashi002.jpg”を特定し、フキダシ「今日の髪素敵でしょ？」のコンテンツを取得する。 Next, when the elapsed time is 1 second (StartKey = "0. 1"), a synthesized image in which the balloon "isn't your hair nice today?" (ContentsID = "1") is displayed (I). For frames from an elapsed time of 1 second (StartKey = "0. 1") to an elapsed time of 3 seconds (EndKey = "0.3."), A composite image in which all the balloons are nice today's hair is synthesized? Will be displayed. At this time, by referring to <SimulationContents> with ContentsID = "1", the location "fukidashi002.jpg" in the content storage means of the content is specified, and the content of the balloon "isn't your hair today nice?"

経過時間３秒（StartKey="0.3"）になると、フキダシ「もう少しカジュアルにしようかな」（ContentsID="2"）が合成された合成画像が表示される（ウ）。この際、ContentsID="2"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“fukidashi003.jpg”を特定し、フキダシ「もう少しカジュアルにしようかな」のコンテンツを取得する。 When the elapsed time is 3 seconds (StartKey = "0.3"), a composite image in which the balloon "Is it a little more casual" (ContentsID = "2") is combined is displayed (c). At this time, by referring to <SimulationContents> with ContentsID = "2", the location "fukidashi003.jpg" in the content storage unit of the content is specified, and the content of the balloon "Is it a little more casual" is acquired.

経過時間３秒（StartKey="0.3"）から経過時間６秒（EndKey="0.6"）までのフレームについては、全てフキダシ「もう少しカジュアルにしようかな」が合成された合成画像が表示されることになる。経過時間６秒（StartKey="0.6"）になると、黒いカツラ（ContentsID="5"）が合成された合成画像が表示される（エ）。この際、ContentsID="5"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“afro.jpg”を特定し、黒いカツラのコンテンツを取得する。経過時間６秒（StartKey="0.6"）から経過時間７秒（EndKey="0.7"）までのフレームについては、全て黒いカツラが合成された合成画像が表示されることになる。 For the frames from the elapsed time of 3 seconds (StartKey = "0.3") to the elapsed time of 6 seconds (EndKey = "0.6"), a composite image in which all the balloons "I want to make it a little more casual" is displayed. Become. When the elapsed time is 6 seconds (StartKey = "0.6"), a composite image in which the black wig (Contents ID = "5") is combined is displayed (D). At this time, by referring to <SimulationContents> with ContentsID = "5", the location "afro.jpg" in the content storage unit of the content is specified, and the content of the black wig is acquired. For frames from an elapsed time of 6 seconds (StartKey = "0.6") to an elapsed time of 7 seconds (EndKey = "0.7"), a composite image in which all black wigs are combined is displayed.

経過時間７秒（StartKey="0.7"）になると、フキダシ「こっちがいいかな」（ContentsID="3"）と、金髪のカツラ（ContentsID="6"）が合成された合成画像が表示される（オ）。この際、ContentsID="6"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“blond.jpg”を特定し、金髪のカツラのコンテンツを取得する。経過時間７秒（StartKey="0.7"）から経過時間９秒（EndKey="0.9"）までのフレームについては、全てフキダシ「こっちがいいかな」と、金髪のカツラが合成された合成画像が表示されることになる。経過時間９秒（StartKey="0.9"）になると、フキダシ「ねっどうでしょう」（ContentsID="4"）と、金髪のカツラが合成された合成画像が表示される（カ）。この際、ContentsID="4"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“fukidashi005.jpg”を特定し、フキダシ「ねっどうでしょう」のコンテンツを取得する。経過時間９秒（StartKey="0.9"）から経過時間１０秒（EndKey="1.0"）までのフレームについては、全て黒いカツラが合成された合成画像が表示されることになる。 When the elapsed time is 7 seconds (StartKey = "0.7"), a composite image in which the balloon "This is good" (ContentsID = "3") and the blond wig (ContentsID = "6") is displayed. (E). At this time, by referring to <SimulationContents> with ContentsID = "6", the location "blond.jpg" in the content storage means of the content is specified, and the content of the blond wig is acquired. For frames from an elapsed time of 7 seconds (StartKey = "0.7") to an elapsed time of 9 seconds (EndKey = "0.9"), all the balloons are “this is good” and a composite image of a blonde wig is displayed. Will be. When the elapsed time is 9 seconds (StartKey = "0.9"), a balloon "Nedou will" (ContentsID = "4") and a composite image in which a blonde wig is synthesized are displayed (F). At this time, by referring to <SimulationContents> with ContentsID = "4", the location "fukidashi005.jpg" in the content storage means of the content is specified, and the content of the balloon "Nendodo" is acquired. For frames from an elapsed time of 9 seconds (StartKey = "0.9") to an elapsed time of 10 seconds (EndKey = "1.0"), a composite image in which all black wigs are combined is displayed.

≪３．２．ターゲットが２人の場合≫
次に、ターゲットが２人の場合について説明する。図１１は、ターゲットが２人の場合のＸＭＬ形式のシナリオデータの一例を示す図である。図１０の例と大きく異なるのは、<Simulation Targets>タグで挟まれた<Human>タグ内に、ターゲットＩＤが２つ設定されている点である。図１１の例では、ターゲットＩＤ“０”と“１”が設定されている。<Animation Commands>タグ内では、各コマンドＩＤについて、１つのターゲットＩＤが設定される。図１１の例では、コマンドＩＤ“０”“２”“３”“５”については、ターゲットＩＤ“０”が設定され、コマンドＩＤ“１”“４”“６”については、ターゲットＩＤ“１”が設定されている。 << 3.2. When there are two targets >>
Next, a case where there are two targets will be described. FIG. 11 is a diagram illustrating an example of scenario data in XML format when there are two targets. A significant difference from the example of FIG. 10 is that two target IDs are set in a <Human> tag sandwiched between <Simulation Targets> tags. In the example of FIG. 11, target IDs “0” and “1” are set. In the <Animation Commands> tag, one target ID is set for each command ID. In the example of FIG. 11, the target ID “0” is set for the command IDs “0”, “2”, “3”, and “5”, and the target ID “1” is set for the command IDs “1”, “4”, and “6”. "Is set.

ターゲットが２人の場合も、ターゲットが１人の場合と同様に、図９のフロー図に従って実行される。ターゲットが２人の場合は、Ｓ２３において、シナリオデータ対応付け手段８３が、状態遷移管理手段２５により生成された顔オブジェクトデータを複数取得するので、Ｓ２４において、取得したそれぞれの顔オブジェクトデータをシナリオデータに対応付ける。図１１の例では、シナリオデータには、ターゲットＩＤ“０”“１”の２つが設定されているので、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”“１”が対応付けられたオブジェクトＩＤで特定される顔検出枠データをターゲットとすることになる。これにより、ターゲットとする顔オブジェクトの顔検出枠がフレームで入れ替わるように移動したとしても、合成画像作成手段８４は、それぞれの顔検出枠に合わせてコンテンツを合成することができる。 When the number of targets is two, the process is executed according to the flowchart of FIG. 9 as in the case of one target. If there are two targets, the scenario data association unit 83 acquires a plurality of pieces of face object data generated by the state transition management unit 25 in S23. Therefore, in S24, the acquired face object data is converted into scenario data. Associate with. In the example of FIG. 11, two target IDs “0” and “1” are set in the scenario data. Therefore, the scenario data associating unit 83 associates the object IDs “0” and “1” with each other. The face detection frame data specified by the ID is targeted. As a result, even if the face detection frame of the target face object is moved so as to be replaced by a frame, the composite image creating unit 84 can synthesize content in accordance with each face detection frame.

図１４は、図１１のシナリオデータに従って得られた合成画像の表示状態を示す図である。まず、フレームの左側にフキダシ「こんにちは」が合成された合成画像が表示され、経過時間１秒になると、フレームの右側にフキダシ「こんにちは」が合成された合成画像が表示される（ア）。経過時間３秒になると、金髪のカツラと、黒いカツラが合成された合成画像が表示される（図示省略）。経過時間４秒になると、フキダシ「今日のあなたの髪」と、金髪のカツラと、黒いカツラが合成された合成画像が表示される（イ）。経過時間６秒になると、金髪のカツラと、黒いカツラが合成された合成画像が表示される（図示省略）。経過時間７秒になると、フキダシ「今日のあなたの髪」と、金髪のカツラと、黒いカツラが合成された合成画像が表示される（ウ）。経過時間９秒になると、フキダシ「ありがとうあなたもね」と、金髪のカツラと、黒いカツラが合成された合成画像が表示される（エ）。 FIG. 14 is a diagram showing a display state of a composite image obtained according to the scenario data of FIG. First, is displayed combined image on the left side of the frame balloon "Hello" is synthesized, at the elapsed time of 1 second, the composite image in the right frame balloon "Hello" is synthesized is displayed (A). When the elapsed time reaches 3 seconds, a composite image in which a blond wig and a black wig are combined is displayed (not shown). When the elapsed time reaches 4 seconds, a composite image in which the balloon “your hair today”, the blonde wig and the black wig are combined is displayed (I). When the elapsed time reaches 6 seconds, a composite image in which a blonde wig and a black wig are combined is displayed (not shown). When the elapsed time is 7 seconds, a composite image in which the balloon “your hair today”, the blonde wig, and the black wig are combined is displayed (c). When the elapsed time is 9 seconds, a composite image in which a balloon wig and a black wig are combined is displayed ("D").

≪３．３．対話的表示を行う例≫
次に、仮想のタッチレスボタンを用いて対話的表示を行う場合について説明する。対話的表示を行うためには、対話的表示に対応するシナリオデータが必要となる。図１５は、タッチレスボタンに対応（ターゲット１人）の場合のＸＭＬ形式のシナリオデータの一例を示す図である。ここでは、図１５のシナリオデータを参照しながら、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２の機能について説明する。合成ターゲット定義手段８０は、ターゲットＩＤ、タイプ、絶対座標、移り変わり設定の４つの属性を設定することによりターゲットを定義する。図１５の例では、１行目の<Simulation Targets>と、７行目の</Simulation Targets>の２つのタグで囲まれた範囲に対応する。タイプについては、人と場所の２種を設定可能であるが、図１５の例では、<Human>タグを用いて人についてのみ設定している。移り変わり設定は、人に対してのみ設定可能となっており、対応付け済みの人が消失した場合、新たな人に対応付けるかどうかを設定するものである。図１５の例では、３行目のタグで、ターゲットＩＤ、タイプ、移り変わり設定の３属性を設定しており、ターゲットＩＤは“０”、タイプは“human(人)”、移り変わり設定（IsTransfer）は“false(設定しない)”となっている。 << 3.3. Example of interactive display >>
Next, a case where interactive display is performed using a virtual touchless button will be described. In order to perform interactive display, scenario data corresponding to the interactive display is required. FIG. 15 is a diagram illustrating an example of scenario data in the XML format in the case of corresponding to a touchless button (one target). Here, the functions of the synthesis target definition unit 80, the synthesis content definition unit 81, and the animation scenario definition unit 82 will be described with reference to the scenario data of FIG. The synthetic target definition means 80 defines a target by setting four attributes of target ID, type, absolute coordinate, and transition setting. In the example of FIG. 15, it corresponds to a range surrounded by two tags, <Simulation Targets> on the first line and </ Simulation Targets> on the seventh line. Regarding the type, two types of person and place can be set, but in the example of FIG. 15, only the person is set using the <Human> tag. The transition setting can be set only for a person, and sets whether or not to associate with a new person when the associated person disappears. In the example of FIG. 15, the tag in the third row sets three attributes of target ID, type, and transition setting, the target ID is “0”, the type is “human”, and transition setting (IsTransfer) Is “false (not set)”.

合成コンテンツ定義手段８１は、コンテンツＩＤ、コンテンツのパス、重なり設定の３つの属性を設定することによりコンテンツを定義する。図１５の例では、８行目の<Simulation Contents>と、１１行目の</Simulation Contents >の２つのタグで囲まれた範囲に対応する。図１５の例では、コンテンツＩＤ（ContentsID）が“０”と“１”の２つのコンテンツについて定義されている。図１５に示すように、各コンテンツについて１行単位で、コンテンツのパス（ContentsPath）、重なり設定（OverlapOrder）が設定される。 The composite content definition unit 81 defines the content by setting three attributes of a content ID, a content path, and an overlap setting. In the example of FIG. 15, it corresponds to a range surrounded by two tags, <Simulation Contents> on the eighth line and </ Simulation Contents> on the eleventh line. In the example of FIG. 15, two contents whose content ID (ContentsID) is “0” and “1” are defined. As shown in FIG. 15, a content path (ContentsPath) and an overlap setting (OverlapOrder) are set for each content in units of one line.

アニメーションシナリオ定義手段８２は、コマンドＩＤ、コマンドタイプ、セレクタＩＤ、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤの８つの属性を設定することによりアニメーションシナリオを定義する。図１５の例では、１２行目の<Animation Commands>と、１７行目の</Animation Commands>の２つのタグで囲まれた範囲に対応する。図１５の例では、コマンドＩＤ（CommandID）が“０” と“１”の２つのコマンドについて定義されている。図１５に示すように、各コマンドについて２行単位で、コマンドタイプ、セレクタＩＤ（SelectorID）、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤが設定される。コマンドタイプとは、どのようなフレームを基にどのようなタイプの効果を生じさせるかを示すものであり、レイヤ合成、αブレンド合成、音声再生開始、シーン合成が用意されている。このうち、レイヤ合成、αブレンド合成、シーン合成は、画像合成のタイプを示すものであり、レイヤ合成は、コンテンツを上書き合成するものであり、αブレンド合成は、設定されたα比率に応じてコンテンツとフレームを透過させて合成するものであり、シーン合成は、人体部分を切り抜き、背景の上に合成するものである。図１５の例では、コマンドタイプ（CommandType）として、レイヤ合成（LayerMontage）が設定されている。 The animation scenario definition means 82 defines an animation scenario by setting eight attributes of command ID, command type, selector ID, start key, end key, key type, target ID, and content ID. In the example of FIG. 15, it corresponds to a range surrounded by two tags, <Animation Commands> on the 12th line and </ Animation Commands> on the 17th line. In the example of FIG. 15, two commands with command IDs (CommandID) “0” and “1” are defined. As shown in FIG. 15, for each command, a command type, a selector ID (SelectorID), a start key, an end key, a key type, a target ID, and a content ID are set in units of two lines. The command type indicates what type of effect is generated based on what frame, and layer synthesis, α blend synthesis, audio reproduction start, and scene synthesis are prepared. Of these, layer composition, α blend composition, and scene composition indicate the type of image composition, layer composition overwrites content, and α blend composition is performed according to the set α ratio. The content and the frame are transmitted and combined, and the scene combination is a method in which a human body part is cut out and combined on the background. In the example of FIG. 15, layer composition (LayerMontage) is set as the command type (CommandType).

セレクタＩＤは、実行するコマンドを特定するための条件であり、そのセレクタＩＤに対応するコマンドＩＤのコマンドが実行される。セレクタにより特定された１以上のコマンドの集合が１つのシナリオを構成することになる。したがって、セレクタとシナリオは１対１で対応している。図１５の例では、１つのシナリオが１つのコマンドで構成されており、セレクタＩＤ＝０の場合に、コマンドＩＤ＝０のコマンドが実行され、セレクタＩＤ＝１の場合に、コマンドＩＤ＝１のコマンドが実行されることを示している。開始キー、終了キーは、図１０の例と同様、各コマンドの開始時点、終了時点を設定するものである。本実施形態では、シナリオデータの時間を、シナリオ開始時を“０．０”、シナリオ終了時を“１．０”として管理している。図１５の例では、図１０の例と異なり、全てのコマンドの開始キー（StartKey）は“０．０”、最後に終了するコマンドの終了キー（EndKey）は“１．０”となる。キータイプとは、開始キー、終了キーの基準とする対象を設定するものであり、own、base、globalの３つが用意されている。ownは各ターゲットＩＤに対応する顔オブジェクトの閲覧時間を基準とし、baseはターゲットＩＤ＝０に対応する顔オブジェクトの閲覧時間を基準とし、globalは撮影映像の最初のフレームを取得した時間を基準とする。図１５の例では、キータイプ（KeyType）として、ownが設定されているので、顔オブジェクトがフレームに登場した時点（顔オブジェクトが“閲覧開始”と判断された時点）を“０．０”として、開始キー、終了キーが認識されることになる。 The selector ID is a condition for specifying a command to be executed, and a command with a command ID corresponding to the selector ID is executed. A set of one or more commands specified by the selector constitutes one scenario. Therefore, there is a one-to-one correspondence between selectors and scenarios. In the example of FIG. 15, one scenario is composed of one command. When the selector ID = 0, the command with the command ID = 0 is executed, and when the selector ID = 1, the command ID = 1. Indicates that the command will be executed. The start key and end key are used to set the start time and end time of each command, as in the example of FIG. In the present embodiment, the scenario data time is managed as “0.0” at the start of the scenario and “1.0” at the end of the scenario. In the example of FIG. 15, unlike the example of FIG. 10, the start key (StartKey) of all commands is “0.0”, and the end key (EndKey) of the last command is “1.0”. The key type is to set a target as a reference for the start key and the end key, and three types of own, base, and global are prepared. own is based on the browsing time of the face object corresponding to each target ID, base is based on the browsing time of the face object corresponding to target ID = 0, and global is based on the time when the first frame of the captured video is acquired. To do. In the example of FIG. 15, since “own” is set as the key type (KeyType), the time point when the face object appears in the frame (the time point when the face object is determined to be “viewing start”) is set to “0.0”. The start key and end key will be recognized.

図１５の例では、図１０に示した<CycleInterval>タグ、<IsAutoLoop>タグが、２行目の< Human >と、４行目の</ Human >の２つのタグで囲まれた３行目のターゲットＩＤタグ内に含まれている。<CycleInterval>タグ、<IsAutoLoop>タグがターゲットＩＤタグ内に含まれている場合でも、図１０の例と同様、シナリオデータの<CycleInterval>タグは、シナリオの開始から終了までの時間を秒単位で設定するものであり、シナリオデータの<IsAutoLoop>タグは、ループ処理（繰り返し処理）を行うかどうかを設定するものである。画像表示システム１は、シナリオデータに複数通りの記述形式を許容しており、図１０、図１５に示したように<CycleInterval>タグ、<IsAutoLoop>タグの記述位置が異なる場合であっても、シナリオデータ対応付け手段８３は、<CycleInterval>タグ、<IsAutoLoop>タグを認識し、記述内容に従った処理を実行する。図１５の例では、ターゲットＩＤタグ内でCycleInterval＝“１０”に設定されているので、シナリオの開始から終了まで１０秒であることを示している。開始キー、終了キーの値を１０倍した実時間でシナリオは管理されることになる。図１５の例では、ターゲットＩＤタグ内でIsAutoLoop＝“true”に設定されているので、ループ処理を行うことを示している。このようにして、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２により作成されたシナリオデータは、シナリオデータ記憶手段としてのデータ記憶装置２ｄに格納される。 In the example of FIG. 15, the <CycleInterval> tag and the <IsAutoLoop> tag shown in FIG. 10 are surrounded by two tags, <Human> on the second line and </ Human> on the fourth line. Are included in the target ID tag. Even when the <CycleInterval> tag and <IsAutoLoop> tag are included in the target ID tag, the <CycleInterval> tag of the scenario data is the time from the start to the end of the scenario in seconds, as in the example of FIG. The <IsAutoLoop> tag of the scenario data sets whether to perform loop processing (repetition processing). The image display system 1 allows a plurality of description formats for scenario data, and even when the description positions of the <CycleInterval> tag and <IsAutoLoop> tag are different as shown in FIGS. 10 and 15, The scenario data association unit 83 recognizes the <CycleInterval> tag and the <IsAutoLoop> tag, and executes processing according to the description content. In the example of FIG. 15, since CycleInterval = “10” is set in the target ID tag, it indicates that it is 10 seconds from the start to the end of the scenario. The scenario is managed in real time that is 10 times the value of the start key and end key. In the example of FIG. 15, since IsAutoLoop = “true” is set in the target ID tag, this indicates that loop processing is performed. In this way, the scenario data created by the composite target definition means 80, the composite content definition means 81, and the animation scenario definition means 82 is stored in the data storage device 2d as the scenario data storage means.

さらに、アニメーションシナリオ定義手段８２は、セレクタＩＤ、セレクトタイプ、セレクトナンバー、トリガＩＤの４つの属性を設定することによりアニメーションセレクタを定義する。図１５の例では、１８行目の<AnimationSelectors SelectorsNum = "2" InitialSerectorID="0">と、２１行目の</AnimationSelectors>の２つのタグで囲まれた範囲に対応する。図１５の例では、セレクタＩＤ（SelectorID）が“０” と“１”の２つのセレクタについて定義されている。図１５に示すように、各セレクタについて１行単位で、セレクタＩＤ（SelectorID）、セレクトタイプ（SelectType）、セレクトナンバー（SelectNum）、トリガＩＤ（TriggerID）が設定される。セレクタＩＤはセレクタを特定するＩＤである。セレクトタイプ（SelectType）は、セレクタＩＤの特定方法を定めるものであり、selectnum、incrementnum、decrementnum、returnnumの４つのタイプが存在する。selectnumは、直接指定されたセレクタＩＤを選択する。incrementnumは、選択中のセレクタＩＤにセレクトナンバー（SelectNum）を加算して得られるセレクタＩＤを選択する。decrementnumは、選択中のセレクタＩＤにセレクトナンバー（SelectNum）を減算して得られるセレクタＩＤを選択する。returnnumは、トリガＩＤに対応するセレクタＩＤを選択する。セレクトナンバー（SelectNum）は、incrementnum、decrementnumに用いられる番号である。トリガＩＤは、所定のイベントが発生した場合に、そのイベントに対応して発生するＩＤであり、図１５の例では、タッチレスボタンが指示された場合に発生する。 Furthermore, the animation scenario definition means 82 defines an animation selector by setting four attributes: selector ID, select type, select number, and trigger ID. In the example of FIG. 15, it corresponds to a range surrounded by two tags of <AnimationSelectors SelectorsNum = “2” InitialSerectorID = “0”> on the 18th line and </ AnimationSelectors> on the 21st line. In the example of FIG. 15, two selectors having selector IDs (Selector ID) “0” and “1” are defined. As shown in FIG. 15, a selector ID (SelectorID), a select type (SelectType), a select number (SelectNum), and a trigger ID (TriggerID) are set for each selector in units of one line. The selector ID is an ID that identifies the selector. The select type (SelectType) defines a method for specifying the selector ID, and there are four types: selectnum, incrementnum, decrementnum, and returnnum. selectnum selects the directly specified selector ID. The incrementnum selects a selector ID obtained by adding a select number (SelectNum) to the currently selected selector ID. Decrementnum selects the selector ID obtained by subtracting the select number (SelectNum) from the currently selected selector ID. returnnum selects the selector ID corresponding to the trigger ID. The select number (SelectNum) is a number used for incrementnum and decrementnum. The trigger ID is an ID that is generated in response to a predetermined event. In the example of FIG. 15, the trigger ID is generated when a touchless button is designated.

さらに、アニメーションシナリオ定義手段８２は、タッチレスボタンＩＤ、ボタンイメージパス、マスクファイルパス、ボタン位置、ボタンサイズ、連続時間しきい値、マスク割合しきい値の７つの属性を設定することによりアニメーショントリガを定義する。図１５の例では、２２行目の<AnimationTriggers >と、３１行目の</AnimationTriggers >の２つのタグで囲まれた範囲に対応する。図１５の例では、タッチレスボタンＩＤ（TouchlessButton ID）が“０” と“１”の２つのボタンについて定義されている。図１５に示すように、各ボタンについて４行単位で、タッチレスボタンＩＤ（TouchlessButton ID）、ボタンイメージパス（ButtonImagePath）、マスクファイルパス（MaskFilePath）、ボタン位置（PositionX, PositionY）、ボタンサイズ（ButtonWidth,ButtonHeight）、連続時間しきい値（Thtime）、マスク割合しきい値（Thmask）が設定される。タッチレスボタンＩＤはタッチレスボタンを特定するＩＤである。このタッチレスボタンＩＤはトリガＩＤと一対一で対応しており、タッチレスボタンＩＤが決定すると、トリガＩＤも同じ値に設定される。ボタンイメージパスは、ディスプレイ３に表示させるタッチレスボタンの画像を記録した位置を特定するパスである。マスクファイルパスは、ボタン形状の角を丸めるためにマスクするためのマスク画像を記録した位置を特定するパスである。ボタン位置は、タッチレスボタンを表示させる画面上のｘｙ座標である。なお、後述するボタンの指示判定において、ボタン領域を設定する場合は、画像の左右を反転させる必要があるため、それに応じてｘ座標を変換して用いる。ボタンサイズは、タッチレスボタンの幅と高さである。連続時間しきい値は、タッチレスボタンが押され続けたと判断するための時間のしきい値である。マスク割合しきい値は、タッチレスボタンが押されていると判断するための面積比率のしきい値である。 Furthermore, the animation scenario definition means 82 sets an animation trigger by setting seven attributes of a touchless button ID, a button image path, a mask file path, a button position, a button size, a continuous time threshold value, and a mask ratio threshold value. Define In the example of FIG. 15, it corresponds to a range surrounded by two tags of <AnimationTriggers> on the 22nd line and </ AnimationTriggers> on the 31st line. In the example of FIG. 15, two buttons having a touchless button ID (TouchlessButton ID) “0” and “1” are defined. As shown in FIG. 15, touchless button ID (TouchlessButton ID), button image path (ButtonImagePath), mask file path (MaskFilePath), button position (PositionX, PositionY), button size (ButtonWidth) in units of 4 lines for each button. , ButtonHeight), continuous time threshold (Thtime), and mask ratio threshold (Thmask). The touchless button ID is an ID that identifies the touchless button. The touchless button ID has a one-to-one correspondence with the trigger ID. When the touchless button ID is determined, the trigger ID is also set to the same value. The button image path is a path for specifying a position where an image of a touchless button to be displayed on the display 3 is recorded. The mask file path is a path for specifying a position where a mask image for masking to round the corners of the button shape is recorded. The button position is an xy coordinate on the screen on which the touchless button is displayed. In the button instruction determination described later, when a button area is set, since it is necessary to invert the left and right of the image, the x coordinate is converted and used accordingly. The button size is the width and height of the touchless button. The continuous time threshold value is a threshold value of time for determining that the touchless button is continuously pressed. The mask ratio threshold is an area ratio threshold for determining that the touchless button is pressed.

図９のフロー図を用いて、図１５に示したシナリオデータを指定した場合について説明する。画像処理装置２を起動し、図１５に示したシナリオデータを使用するシナリオデータとして指定すると、まず、シナリオデータ対応付け手段８３が、指定されたシナリオデータをデータ記憶装置２ｄから読み込む（Ｓ２１）。そして、シナリオデータ対応付け手段８３は、シナリオデータを解釈し、シナリオデータに従った画像の作成を開始する（Ｓ２２）。 The case where the scenario data shown in FIG. 15 is designated will be described using the flowchart of FIG. When the image processing device 2 is activated and designated as scenario data using the scenario data shown in FIG. 15, first, the scenario data association unit 83 reads the designated scenario data from the data storage device 2d (S21). Then, the scenario data association unit 83 interprets the scenario data and starts creating an image according to the scenario data (S22).

画像の作成を開始と同時に、画像処理装置２は、ディスプレイ３にタッチレスボタンのボタン画像を表示する。閲覧者はビデオカメラ４により撮影され、常に撮影映像が取得されており、閲覧者がボタン画像に触れようとした際にも、その映像は撮影されることになる。 Simultaneously with the start of image creation, the image processing apparatus 2 displays the button image of the touchless button on the display 3. The viewer is photographed by the video camera 4, and the photographed image is always acquired. When the viewer tries to touch the button image, the image is photographed.

画像の作成を開始したら、シナリオデータ対応付け手段８３は、状態遷移管理手段２５により生成された顔オブジェクトデータを取得する（Ｓ２３）。顔オブジェクトデータは、オブジェクトＩＤ、顔検出枠データ（位置、矩形サイズ）、閲覧時間で構成される。 When the creation of the image is started, the scenario data association unit 83 acquires the face object data generated by the state transition management unit 25 (S23). The face object data includes an object ID, face detection frame data (position, rectangular size), and browsing time.

続いて、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した顔オブジェクトデータをシナリオデータに対応付ける処理を行う（Ｓ２４）。具体的には、顔オブジェクトデータに含まれる顔検出枠データのオブジェクトＩＤとターゲットＩＤを対応付ける。状態遷移管理手段２５から複数の顔検出枠データを取得した場合は、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。図１５の例では、シナリオデータには、ターゲットＩＤ“０”の１つだけ設定されているので、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”が対応付けられたオブジェクトＩＤで特定される顔検出枠データをターゲットとすることになる。 Subsequently, the scenario data association unit 83 performs a process of associating the face object data acquired from the state transition management unit 25 with the scenario data (S24). Specifically, the object ID of the face detection frame data included in the face object data is associated with the target ID. When a plurality of face detection frame data is acquired from the state transition management unit 25, the date and time when the state transition to the candidate face is the earliest date and time is set to “0”, and thereafter the date and time when the state transition to the candidate face is performed. In order from the earliest, “1”, “2” and “3” are set while increasing the number by one. In the example of FIG. 15, since only one target ID “0” is set in the scenario data, the scenario data association unit 83 is identified by the object ID associated with the target ID “0”. The face detection frame data is targeted.

次に、合成画像作成手段８４が、フレーム単位で表示用の合成画像を作成する処理を行う（Ｓ２５）。具体的には、まず、開始時点を時刻“０．０”と設定し、この時刻“０．０”で、シナリオデータの<Animation Commands>を参照する。図１５に示すように、<Animation Commands>においては、コマンドＩＤが“０”と“１”の２つのコマンドが存在し、いずれも、開始キー“０．０”から終了キー“１．０”までは、キータイプ“own”、コマンドタイプ“LayerMontage(レイヤ合成)”、ターゲットＩＤ“０”となっており、コンテンツＩＤのみが“０” と“１”で異なっている。上述のように、コマンドＩＤは、セレクタＩＤにより決定されるが、図１５の１８行目に示すように、初期セレクタＩＤ（InitialSerectorID）は“０”に設定されているため、最初は、合成画像作成手段８４は、ターゲットＩＤ“０”の顔検出枠データに、コンテンツＩＤ“０”のコンテンツをレイヤ合成することにより、合成画像を作成することになる。 Next, the composite image creating means 84 performs processing for creating a composite image for display in units of frames (S25). Specifically, first, the start time is set to “0.0”, and the <Animation Commands> of the scenario data is referred to at this time “0.0”. As shown in FIG. 15, in <Animation Commands>, there are two commands with command IDs “0” and “1”, both of which are a start key “0.0” and an end key “1.0”. Up to now, the key type is “own”, the command type is “LayerMontage (layer composition)”, and the target ID is “0”, and only the content ID is different between “0” and “1”. As described above, the command ID is determined by the selector ID. However, since the initial selector ID (InitialSerectorID) is set to “0” as shown in the 18th line in FIG. The creation unit 84 creates a composite image by layer-combining the content with the content ID “0” with the face detection frame data with the target ID “0”.

１つのフレームについて合成画像の作成を終えたら、シナリオデータ対応付け手段８３は、シナリオ中であるかどうかを判断する（Ｓ２６）。具体的には、シナリオデータに従った画像作成開始からの経過時間でシナリオデータ内の<Target ID>タグ内の“CycleInterval”を参照し、経過時間がシナリオ時間未満である場合は、シナリオ中であると判断し、経過時間がシナリオ時間以上である場合は、シナリオ終了であると判断する。シナリオ中であると判断した場合には、シナリオデータ対応付け手段８３は、Ｓ２３に戻って、顔オブジェクトデータを取得する。 When the creation of the composite image for one frame is completed, the scenario data association unit 83 determines whether or not the scenario is in progress (S26). Specifically, refer to “CycleInterval” in the <Target ID> tag in the scenario data with the elapsed time from the start of image creation according to the scenario data, and if the elapsed time is less than the scenario time, If the elapsed time is equal to or longer than the scenario time, it is determined that the scenario is ended. If it is determined that the scenario is in progress, the scenario data association unit 83 returns to S23 and acquires face object data.

次に、Ｓ２５において、合成画像作成手段８４が、フレーム単位で合成画像を作成する処理を行う。具体的には、経過時間を取得し、取得した経過時間で、シナリオデータの<Animation Commands>を参照する。このとき、セレクタＩＤが“０”であれば、コンテンツＩＤ“０”のコンテンツを、セレクタＩＤが“１”であれば、コンテンツＩＤ“１” のコンテンツを、ターゲットＩＤ“０”の顔検出枠データにレイヤ合成することにより合成画像を作成する。このようにして、Ｓ２６においてシナリオ終了であると判断されるまでは、経過時間に従い、シナリオデータを実行する処理を繰り返し行う。 Next, in S25, the composite image creating unit 84 performs a process of creating a composite image in units of frames. Specifically, the elapsed time is acquired, and the <Animation Commands> of the scenario data is referenced with the acquired elapsed time. At this time, if the selector ID is “0”, the content of the content ID “0” is selected. If the selector ID is “1”, the content of the content ID “1” is converted to the face detection frame of the target ID “0”. A composite image is created by layering the data. In this way, the process of executing the scenario data is repeated according to the elapsed time until it is determined in S26 that the scenario is ended.

一方、Ｓ２２〜Ｓ２５の処理と並行して、指示判定手段８５は、タッチボタンが指示されたかどうかを常に判断する（Ｓ２８）。具体的には、撮影映像のフレームに対して画像処理を行うことにより、判断を行う。図１７は、指示判定手段８５の画像処理により得られる画像を示す図である。図１７においては、白い部分が画素値“１”の画素、黒い部分が画素値“０”の画素を示している。まず、指示判定手段８５は、図４のＳ１において、背景除去手段２０により処理された背景除去画像を取得する。そして、多値である背景除去画像の各画素値を所定のしきい値で二値化する。図１７（ａ）に、二値化された背景除去画像を示す。続いて、指示判定手段８５は、入力フレーム画像に対してＨＳＶ変換を行い、所定のしきい値で二値化し、二値化されたＨＳＶ画像を得る。図１７（ｂ）に、二値化されたＨＳＶ画像を示す。このＨＳＶ変換により肌色等の色味部分を抽出する。そして、二値化された背景除去画像と二値化されたＨＳＶ画像のＡｎｄ演算処理を行う。そして、人物を抽出した人物画像を得る。図１７（ｃ）に、人物画像を示す。 On the other hand, in parallel with the processing of S22 to S25, the instruction determination unit 85 always determines whether or not the touch button has been instructed (S28). Specifically, the determination is performed by performing image processing on the frame of the captured video. FIG. 17 is a diagram showing an image obtained by the image processing of the instruction determination unit 85. In FIG. 17, a white portion indicates a pixel having a pixel value “1”, and a black portion indicates a pixel having a pixel value “0”. First, the instruction determination unit 85 acquires the background removed image processed by the background removal unit 20 in S1 of FIG. Then, each pixel value of the multi-value background removed image is binarized with a predetermined threshold value. FIG. 17A shows a binarized background removed image. Subsequently, the instruction determination unit 85 performs HSV conversion on the input frame image, binarizes it with a predetermined threshold value, and obtains a binarized HSV image. FIG. 17B shows a binarized HSV image. A color portion such as skin color is extracted by this HSV conversion. Then, an And calculation process is performed on the binarized background removed image and the binarized HSV image. And the person image which extracted the person is obtained. FIG. 17C shows a person image.

次に、指示判定手段８５は、図４のＳ２において、顔検出手段２１により取得された顔検出枠を取得する。指示判定手段８５は、取得した顔検出枠をマスクした二値の顔マスク画像を生成する。図１７（ｄ）に、顔マスク画像を示す。そして、指示判定手段８５は、顔マスク画像と人物画像のＡｎｄ演算処理を行う。この結果、人物から顔部分を除外した特徴画像が得られる。図１７（ｅ）に、特徴画像を示す。 Next, the instruction determination unit 85 acquires the face detection frame acquired by the face detection unit 21 in S2 of FIG. The instruction determination unit 85 generates a binary face mask image masking the acquired face detection frame. FIG. 17D shows a face mask image. Then, the instruction determination unit 85 performs an And calculation process on the face mask image and the person image. As a result, a feature image excluding the face portion from the person is obtained. FIG. 17E shows a feature image.

次に、指示判定手段８５は、シナリオデータ中の<AnimationTriggers >と、</AnimationTriggers >の２つのタグで囲まれた範囲におけるタッチレスボタンのボタン位置（PositionX, PositionY）、ボタンサイズ（ButtonWidth,ButtonHeight）の情報を取得する。そして、このボタン位置とボタンサイズから、ボタンの領域、およびボタンに含まれる総画素数を求める。さらに、特徴画像のボタンの領域における画素値“１”（図面では白で表現）の画素数とボタンに含まれる総画素数の比率Ｒｍａｓｋを算出する。図１７（ｆ）に、特徴画像とボタン領域の関係を示す。図１７（ｆ）において、白線で囲まれた矩形領域がボタン領域である。なお、図１５に示すように、タッチレスボタンが複数設定されている場合、指示判定手段８５は、各ボタンについて、比率Ｒｍａｓｋを算出する処理を行う。 Next, the instruction determination unit 85 determines the button position (PositionX, PositionY) and button size (ButtonWidth, ButtonHeight) of the touchless button in the range surrounded by two tags <AnimationTriggers> and </ AnimationTriggers> in the scenario data. ) Information. Then, from the button position and the button size, the button area and the total number of pixels included in the button are obtained. Further, a ratio Rmask between the number of pixels of the pixel value “1” (expressed in white in the drawing) in the button area of the feature image and the total number of pixels included in the button is calculated. FIG. 17F shows the relationship between the feature image and the button area. In FIG. 17F, a rectangular area surrounded by a white line is a button area. As illustrated in FIG. 15, when a plurality of touchless buttons are set, the instruction determination unit 85 performs a process of calculating the ratio Rmask for each button.

指示判定手段８５は、各フレームに対して、上述のような処理を実行し、各タッチレスボタンの比率Ｒｍａｓｋを算出する。そして、シナリオデータ中に設定されているマスクしきい値Ｔｈｍａｓｋと各タッチレスボタンの比率Ｒｍａｓｋの比較を行う。そして、比較の結果に基づいて、以下の１）〜３）のいずれの状態であるかの判断を行う。 The instruction determination unit 85 performs the processing as described above for each frame, and calculates the ratio Rmask of each touchless button. Then, the mask threshold Thmask set in the scenario data is compared with the ratio Rmask of each touchless button. Then, based on the comparison result, it is determined which of the following states 1) to 3).

（タッチレスボタン上の特徴部分の比率および継続時間による状態の判断）
１）Ｒｍａｓｋ＜Ｔｈｍａｓｋの場合通常状態
２）Ｒｍａｓｋ＞ＴｈｍａｓｋかつＴｉｍｅ＜Ｔｈｔｉｍｅの場合手が翳された状態
３）Ｒｍａｓｋ＞ＴｈｍａｓｋかつＴｉｍｅ＞Ｔｈｔｉｍｅの場合指示された状態 (Judgment of the state by the ratio and duration of the feature on the touchless button)
1) When Rmask <Thmask Normal state 2) When Rmask> Thmask and Time <Thtime Hand is defeated 3) When Rmask> Thmask and Time> Thtime Instructed state

上記２）、３）において、Ｔｉｍｅは、フレーム数にフレームレートの逆数を乗じて算出した時間であり、Ｔｈｔｉｍｅはシナリオデータ中に設定されている時間のしきい値である。上記１）の場合、タッチレスボタン上における手の面積が小さいため、タッチレスボタンに手が翳された状態でなく、通常状態であると判断する。上記２）の場合、タッチレスボタン上における手の面積が大きいため、手が翳された状態ではあるが、その時間が短いため、タッチレスボタンに対する指示とまでは判断できない。上記３）の場合、タッチレスボタン上における手の面積が大きく、その時間も長いため、タッチレスボタンに対する指示と判断する。なお、図１５に示したシナリオデータ中では、２つのタッチレスボタンのいずれにおいても、Ｔｈｔｉｍｅは１．０秒、Ｔｈｍａｓｋは０．１（１０％）が設定されている。 In 2) and 3) above, Time is the time calculated by multiplying the number of frames by the reciprocal of the frame rate, and Thtime is a threshold of time set in the scenario data. In the case of 1), since the area of the hand on the touchless button is small, it is determined that the hand is not in a state of being touched by the touchless button but in a normal state. In the case of the above 2), since the area of the hand on the touchless button is large, the hand is in a state of being touched, but since the time is short, it is not possible to determine even the instruction for the touchless button. In the case of 3), since the area of the hand on the touchless button is large and the time is long, it is determined that the instruction is for the touchless button. In the scenario data shown in FIG. 15, in both of the two touchless buttons, Thtime is set to 1.0 second and Thmask is set to 0.1 (10%).

すなわち、Ｓ２８において、指示判定手段８５は、各タッチレスボタンに対する手の面積比率がしきい値以上であるかどうかを判断する処理を、各フレームについて行い、手の面積比率がしきい値以上であるタッチレスボタンが所定フレーム数以上連続した場合に、タッチレスボタンが指示されたと判断する。 That is, in S28, the instruction determination unit 85 performs processing for each frame to determine whether the hand area ratio for each touchless button is equal to or greater than the threshold value, and the hand area ratio is equal to or greater than the threshold value. When a certain touchless button continues for a predetermined number of frames or more, it is determined that the touchless button has been instructed.

Ｓ２８において、指示判定手段８５が、タッチレスボタンが指示されたと判断した場合、コマンド切替手段８６は、コマンド切替を行う（Ｓ２９）。具体的には、セレクタＩＤ（Selector ID）を変更する。図１５の例では、“０”である場合は“１”に、“１”である場合は“０”に変更されることになる。セレクタＩＤの変更により、コマンドＩＤも変更され、コマンドが切り替えられることになる。 If the instruction determination unit 85 determines in S28 that the touchless button has been instructed, the command switching unit 86 performs command switching (S29). Specifically, the selector ID is changed. In the example of FIG. 15, “0” is changed to “1”, and “1” is changed to “0”. By changing the selector ID, the command ID is also changed, and the command is switched.

図１６は、図１５のシナリオデータに従って得られた合成画像の表示状態を示す図である。図１６（ａ）（ｂ）において、角の丸い矩形は、タッチレスボタンを示している。図１６の例では、“金髪”、“黒髪”と記されたタッチレスボタンが表示されている。閲覧者を認識すると、まず、図１５のシナリオデータ１８行目に示すように、初期のセレクタＩＤを定めるInitialSelectorIDが“０”であるので、セレクタＩＤは“０”に設定され、１３行目によりコマンドＩＤが“０”に設定される。そして、１３行目、１４行目の<Command ID>タグに従って、シナリオ開始時（StartKey="0.0"）から経過時間１０秒（EndKey="1.0"）までのフレームについては、コンテンツＩＤ“０”のコンテンツがレイヤ合成される。この際、ContentsID="0"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“afro.jpg”を特定し、黒いカツラのコンテンツを取得する。そして、図１６（ａ）に示すように、黒いカツラが閲覧者の画像に合成される。タッチレスボタンの指示によるコマンドの切替がない場合、シナリオ開始時（StartKey="0.0"）から経過時間１０秒（EndKey="1.0"）までのフレームについては、全て黒いカツラが合成された合成画像が表示されることになる。 FIG. 16 is a diagram showing a display state of a composite image obtained according to the scenario data of FIG. In FIGS. 16A and 16B, a rectangle with rounded corners indicates a touchless button. In the example of FIG. 16, touchless buttons labeled “Blonde” and “Black hair” are displayed. When the viewer is recognized, first, as shown in the scenario data 18th line in FIG. 15, since the InitialSelectorID that defines the initial selector ID is “0”, the selector ID is set to “0”. The command ID is set to “0”. Then, according to the <Command ID> tag on the 13th and 14th lines, the content ID “0” is set for the frame from the scenario start time (StartKey = “0.0”) to the elapsed time of 10 seconds (EndKey = “1.0”). Are layered. At this time, by referring to <SimulationContents> with ContentsID = "0", the location "afro.jpg" in the content storage unit of the content is specified, and the content of the black wig is acquired. And as shown to Fig.16 (a), a black wig is synthesize | combined with a viewer's image. When there is no command switching by the instruction of the touchless button, a composite image in which all black wigs are synthesized for the frames from the start of the scenario (StartKey = "0.0") to the elapsed time 10 seconds (EndKey = "1.0"). Will be displayed.

閲覧者が“金髪”のタッチレスボタンに手を翳し、<AnimationTriggers>の条件に従って、指示判定手段８５が、タッチレスボタンが指示されたと判定すると、指示判定手段８５は、タッチレスボタンＩＤ(TouchlessButton ID)を“１”に設定し、コマンド切替手段８６が、タッチレスボタンＩＤに１対１で対応するトリガＩＤを “１” に設定し、さらに、セレクタＩＤを “１”、コマンドＩＤを“１”に切り替える。そして、１３行目、１４行目の<Command ID>タグに従って、シナリオ開始時（StartKey="0.0"）から経過時間１０秒（EndKey="1.0"）までのフレームについては、コンテンツＩＤ“１”のコンテンツがレイヤ合成される。この際、ContentsID="1"で<SimulationContents>を参照することにより、コンテンツのコンテンツ記憶手段内における所在“blond.jpg”を特定し、金髪のカツラのコンテンツを取得する。そして、図１６（ｂ）に示すように、黒いカツラに代えて金髪のカツラが閲覧者の画像に合成される。この後、タッチレスボタンの指示によるコマンドの切替がない場合、シナリオ開始時（StartKey="0.0"）から経過時間１０秒（EndKey="1.0"）までのフレームについては、全て金髪のカツラが合成された合成画像が表示されることになる。 When the viewer touches the “blonde” touchless button and the instruction determination unit 85 determines that the touchless button is instructed according to the condition of <AnimationTriggers>, the instruction determination unit 85 determines that the touchless button ID (TouchlessButton ID) is set to “1”, the command switching means 86 sets the trigger ID corresponding to the touchless button ID on a one-to-one basis to “1”, the selector ID is set to “1”, and the command ID is set to “1”. Switch to 1 ”. Then, according to the <Command ID> tag in the 13th and 14th lines, the content ID “1” is set for the frame from the scenario start time (StartKey = “0.0”) to the elapsed time of 10 seconds (EndKey = “1.0”). Are layered. At this time, by referring to <SimulationContents> with ContentsID = "1", the location "blond.jpg" in the content storage means of the content is specified, and the content of the blond wig is acquired. And as shown in FIG.16 (b), it replaces with a black wig and a blond wig is synthesize | combined with a viewer's image. After this, if there is no command switching by the instruction of the touchless button, all the frames from the start of the scenario (StartKey = "0.0") to the elapsed time 10 seconds (EndKey = "1.0") are combined with a blonde wig. The synthesized image thus displayed is displayed.

≪３．４．シーン合成処理（１）≫
次に、本発明の特徴であるシーン合成機能について説明する。シーン合成機能としては、人物と人物以外の輪郭を明確にして合成を行う実施形態と、人物と人物以外の輪郭を特に明確にすることなく合成を行う実施形態がある。まず、第１の実施形態として、人物と人物以外の輪郭を明確にして合成を行う実施形態について説明する。シーン合成を行うためには、シーン合成機能に対応するシナリオデータが必要となる。図２０は、シーン合成機能に対応（ターゲット２人）の場合のＸＭＬ形式のシナリオデータの一例を示す図である。ここでは、図２０のシナリオデータを参照しながら、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２の機能について説明する。合成ターゲット定義手段８０は、ターゲットＩＤ、タイプ、絶対座標、移り変わり設定の４つの属性を設定することによりターゲットを定義する。図２０の例では、１行目の<Simulation Targets>と、７行目の</Simulation Targets>の２つのタグで囲まれた範囲に対応する。ターゲットについては、Human、Place、Sceneの３つのタイプが設定可能である。ＩＤタイプについては、ターゲットに対応した人、場所、場面の３種を設定可能である。図２０の例では、HumanIDで始まる２つのHumanタグを用いて２人について設定し、Place IDで始まるPlaceタグを用いて場所について設定し、SceneIDで始まるSceneタグを用いて場面について設定している。移り変わり設定は、対応付けられたターゲットが消失した場合、新たなターゲットに対応付けるかどうかを設定するものである。図２０の例では、２〜４行目のタグで、ターゲットＩＤ（HumanID, Place ID, SceneID）、タイプ、移り変わり設定の３属性を設定しており、移り変わり設定（IsTransfer）はいずれも“false(設定しない)”となっている。 << 3.4. Scene composition processing (1) >>
Next, the scene composition function that is a feature of the present invention will be described. As the scene composition function, there are an embodiment in which the contours other than the person and the person are clarified and the composition is synthesized, and an embodiment in which the contours other than the person and the person are clarified are not particularly clarified. First, as a first embodiment, a description will be given of an embodiment in which a person and a contour other than a person are clarified and combined. In order to perform scene synthesis, scenario data corresponding to the scene synthesis function is required. FIG. 20 is a diagram illustrating an example of XML-format scenario data in the case of a scene synthesis function (two targets). Here, functions of the synthesis target definition unit 80, the synthesis content definition unit 81, and the animation scenario definition unit 82 will be described with reference to the scenario data of FIG. The synthetic target definition means 80 defines a target by setting four attributes of target ID, type, absolute coordinate, and transition setting. In the example of FIG. 20, it corresponds to a range surrounded by two tags, <Simulation Targets> on the first line and </ Simulation Targets> on the seventh line. There are three types of targets that can be set: Human, Place, and Scene. For the ID type, three types of person, place, and scene corresponding to the target can be set. In the example of FIG. 20, two human tags that start with HumanID are used to set two people, a place tag that starts with Place ID is used to set a place, and a scene tag that starts with SceneID is used to set a scene. . The transition setting is for setting whether to associate with a new target when the associated target disappears. In the example of FIG. 20, three attributes of target ID (HumanID, Place ID, SceneID), type, and transition setting are set in the tags in the second to fourth lines, and the transition setting (IsTransfer) is “false ( Do not set) ”.

合成コンテンツ定義手段８１は、コンテンツＩＤ、コンテンツのパス、重なり設定の３つの属性を設定することによりコンテンツを定義する。図２０の例では、８行目の<Simulation Contents>と、２４行目の</Simulation Contents >の２つのタグで囲まれた範囲に対応する。図２０の例では、コンテンツＩＤ（ContentsID）が“０”、“１”、“２”の３つのコンテンツについて定義されている。コンテンツには、人物または背景に重ねる重畳画像と、背景画像の２種類がある。重畳画像、背景画像については、図２０の９、１０行目に示すように、各コンテンツについて１行単位で、コンテンツのパス（ContentsPath）、重なり設定（OverlapOrder）が設定される。 The composite content definition unit 81 defines the content by setting three attributes of a content ID, a content path, and an overlap setting. In the example of FIG. 20, it corresponds to a range surrounded by two tags of <Simulation Contents> on the 8th line and </ Simulation Contents> on the 24th line. In the example of FIG. 20, the content ID (ContentsID) is defined for three contents “0”, “1”, and “2”. There are two types of content: a superimposed image superimposed on a person or background, and a background image. As for the superimposed image and the background image, as shown in the ninth and tenth lines in FIG. 20, the content path (ContentsPath) and the overlap setting (OverlapOrder) are set in units of one line for each content.

アニメーションシナリオ定義手段８２は、コマンドＩＤ、コマンドタイプ、セレクタＩＤ、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤの８つの属性を設定することによりアニメーションシナリオを定義する。図２０の例では、２５行目の<Animation Commands>と、２８行目の</Animation Commands>の２つのタグで囲まれた範囲に対応する。図２０の例では、コマンドＩＤ（CommandID）が“０”のコマンドについて定義されている。図２０に示すように、１つのコマンドについて２行単位で、コマンドタイプ、セレクタＩＤ（SelectorID）、開始キー、終了キー、キータイプ、ターゲットＩＤ、コンテンツＩＤが設定される。コマンドタイプとは、どのようなフレームを基にどのようなタイプの効果を生じさせるかを示すものであり、レイヤ合成、αブレンド合成、音声再生開始、シーン合成が用意されている。このうち、レイヤ合成、αブレンド合成、シーン合成は、画像合成のタイプを示すものであり、レイヤ合成は、コンテンツを上書き合成するものであり、αブレンド合成は、設定されたα比率に応じてコンテンツとフレームを透過させて合成するものであり、シーン合成は、人体部分を切り抜き、背景の上に合成するものである。図２０の例では、コマンドタイプ（CommandType）として、シーン合成（SceneMontage）が設定されている。 The animation scenario definition means 82 defines an animation scenario by setting eight attributes of command ID, command type, selector ID, start key, end key, key type, target ID, and content ID. In the example of FIG. 20, it corresponds to a range surrounded by two tags, <Animation Commands> on the 25th line and </ Animation Commands> on the 28th line. In the example of FIG. 20, a command with a command ID (CommandID) “0” is defined. As shown in FIG. 20, a command type, a selector ID (SelectorID), a start key, an end key, a key type, a target ID, and a content ID are set in units of two lines for one command. The command type indicates what type of effect is generated based on what frame, and layer synthesis, α blend synthesis, audio reproduction start, and scene synthesis are prepared. Of these, layer composition, α blend composition, and scene composition indicate the type of image composition, layer composition overwrites content, and α blend composition is performed according to the set α ratio. The content and the frame are transmitted and combined, and the scene combination is a method in which a human body part is cut out and combined on the background. In the example of FIG. 20, scene composition (SceneMontage) is set as the command type (CommandType).

セレクタＩＤは、実行するコマンドを特定するための条件であり、そのセレクタＩＤに対応するコマンドＩＤのコマンドが実行される。セレクタにより特定された１以上のコマンドの集合が１つのシナリオを構成することになる。したがって、セレクタとシナリオは１対１で対応している。図２０の例では、１つのシナリオが１つのコマンドで構成されており、セレクタＩＤ＝０の場合に、コマンドＩＤ＝０のコマンドが実行されることを示している。開始キー、終了キーは、図１０の例と同様、各コマンドの開始時点、終了時点を設定するものである。本実施形態では、シナリオデータの時間を、シナリオ開始時を“０．０”、シナリオ終了時を“１．０”として管理している。図２０の例では、図１５の例と同様、全てのコマンドの開始キー（StartKey）は“０．０”、最後に終了するコマンドの終了キー（EndKey）は“１．０”となる。キータイプとは、開始キー、終了キーの基準とする対象を設定するものであり、own、base、globalの３つが用意されている。ownは各ターゲットＩＤに対応する顔オブジェクトの閲覧時間を基準とし、baseはターゲットＩＤ＝０に対応する顔オブジェクトの閲覧時間を基準とし、globalは撮影映像の最初のフレームを取得した時間を基準とする。図２０の例では、キータイプ（KeyType）として、ownが設定されているので、1つ目の顔オブジェクトがフレームに登場した時点（顔オブジェクトが“閲覧開始”と判断された時点）を“０．０”として、開始キー、終了キーが認識されることになる。 The selector ID is a condition for specifying a command to be executed, and a command with a command ID corresponding to the selector ID is executed. A set of one or more commands specified by the selector constitutes one scenario. Therefore, there is a one-to-one correspondence between selectors and scenarios. In the example of FIG. 20, one scenario includes one command, and when the selector ID = 0, the command with the command ID = 0 is executed. The start key and end key are used to set the start time and end time of each command, as in the example of FIG. In the present embodiment, the scenario data time is managed as “0.0” at the start of the scenario and “1.0” at the end of the scenario. In the example of FIG. 20, as in the example of FIG. 15, the start key (StartKey) of all commands is “0.0”, and the end key (EndKey) of the last command is “1.0”. The key type is to set a target as a reference for the start key and the end key, and three types of own, base, and global are prepared. own is based on the browsing time of the face object corresponding to each target ID, base is based on the browsing time of the face object corresponding to target ID = 0, and global is based on the time when the first frame of the captured video is acquired. To do. In the example of FIG. 20, since “own” is set as the key type (KeyType), the time when the first face object appears in the frame (the time when the face object is determined to be “viewing start”) is “0”. .0 ", the start key and the end key are recognized.

図２０の例では、図１５と同様、<CycleInterval>タグ、<IsAutoLoop>タグが、２行目〜４行目のHumanID、PlaceIDで表現された各ターゲットＩＤタグ内に含まれている。上述のように、<CycleInterval>タグ、<IsAutoLoop>タグがターゲットＩＤタグ内に含まれている場合でも、図１０の例と同様、シナリオデータの<CycleInterval>タグは、シナリオの開始から終了までの時間を秒単位で設定するものであり、シナリオデータの<IsAutoLoop>タグは、ループ処理（繰り返し処理）を行うかどうかを設定するものである。画像表示システム１は、シナリオデータに複数通りの記述形式を許容しており、図１０、図１５に示したように<CycleInterval>タグ、<IsAutoLoop>タグの記述位置が異なる場合であっても、シナリオデータ対応付け手段８３は、<CycleInterval>タグ、<IsAutoLoop>タグを認識し、記述内容に従った処理を実行する。図２０の例では、ターゲットＩＤタグ内でCycleInterval＝“１０”に設定されているので、シナリオの開始から終了まで１０秒であることを示している。開始キー、終了キーの値を１０倍した実時間でシナリオは管理されることになる。図２０の例では、ターゲットＩＤタグ内でIsAutoLoop＝“true”に設定されているので、ループ処理を行うことを示している。このようにして、合成ターゲット定義手段８０、合成コンテンツ定義手段８１、アニメーションシナリオ定義手段８２により作成されたシナリオデータは、シナリオデータ記憶手段としてのデータ記憶装置２ｄに格納される。 In the example of FIG. 20, the <CycleInterval> tag and the <IsAutoLoop> tag are included in each target ID tag represented by HumanID and PlaceID in the second to fourth lines, as in FIG. As described above, even when the <CycleInterval> tag and the <IsAutoLoop> tag are included in the target ID tag, the <CycleInterval> tag of the scenario data from the start to the end of the scenario is the same as in the example of FIG. The time is set in seconds, and the <IsAutoLoop> tag of the scenario data sets whether to perform loop processing (repeated processing). The image display system 1 allows a plurality of description formats for scenario data, and even when the description positions of the <CycleInterval> tag and <IsAutoLoop> tag are different as shown in FIGS. 10 and 15, The scenario data association unit 83 recognizes the <CycleInterval> tag and the <IsAutoLoop> tag, and executes processing according to the description content. In the example of FIG. 20, since CycleInterval = “10” is set in the target ID tag, it indicates that it is 10 seconds from the start to the end of the scenario. The scenario is managed in real time that is 10 times the value of the start key and end key. In the example of FIG. 20, since IsAutoLoop = “true” is set in the target ID tag, it indicates that loop processing is performed. In this way, the scenario data created by the composite target definition means 80, the composite content definition means 81, and the animation scenario definition means 82 is stored in the data storage device 2d as the scenario data storage means.

図９のフロー図を用いて、図２０に示したシナリオデータを指定した場合について説明する。画像処理装置２を起動し、図２０に示したシナリオデータを使用するシナリオデータとして指定すると、まず、シナリオデータ対応付け手段８３が、指定されたシナリオデータをデータ記憶装置２ｄから読み込む（Ｓ２１）。そして、シナリオデータ対応付け手段８３は、シナリオデータを解釈し、シナリオデータに従った画像の作成を開始する（Ｓ２２）。 The case where the scenario data shown in FIG. 20 is designated will be described using the flowchart of FIG. When the image processing apparatus 2 is activated and designated as scenario data using the scenario data shown in FIG. 20, first, the scenario data association unit 83 reads the designated scenario data from the data storage device 2d (S21). Then, the scenario data association unit 83 interprets the scenario data and starts creating an image according to the scenario data (S22).

続いて、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した顔オブジェクトデータをシナリオデータに対応付ける処理を行う（Ｓ２４）。具体的には、顔オブジェクトデータに含まれる顔検出枠データのオブジェクトＩＤとターゲットＩＤを対応付ける。状態遷移管理手段２５から複数の顔検出枠データを取得した場合は、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。図２０の例では、シナリオデータには、ターゲットＩＤとして、HumanID="0"、 HumanID="1" 、PlaceID="2"、 SceneID="3"の４つが設定されているので、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”が対応付けられたオブジェクトＩＤで特定されるデータをターゲットとすることになる。 Subsequently, the scenario data association unit 83 performs a process of associating the face object data acquired from the state transition management unit 25 with the scenario data (S24). Specifically, the object ID of the face detection frame data included in the face object data is associated with the target ID. When a plurality of face detection frame data is acquired from the state transition management unit 25, the date and time when the state transition to the candidate face is the earliest date and time is set to “0”, and thereafter the date and time when the state transition to the candidate face is performed. In order from the earliest, “1”, “2” and “3” are set while increasing the number by one. In the example of FIG. 20, since scenario data has four target IDs, HumanID = "0", HumanID = "1", PlaceID = "2", and SceneID = "3", the scenario data is supported. The attaching unit 83 targets the data specified by the object ID associated with the target ID “0”.

次に、合成画像作成手段８４が、フレーム単位で表示用の合成画像を作成する処理を行う（Ｓ２５）。具体的には、まず、開始時点を時刻“０．０”と設定し、この時刻“０．０”で、シナリオデータの<Animation Commands>を参照する。図２０に示すように、<Animation Commands>においては、コマンドＩＤが“０”のコマンドのみが存在し、開始キー“０．０”から終了キー“１．０”までは、キータイプ“own”、コマンドタイプ“SceneMontage(シーン合成)”、ターゲットＩＤ“３”、コンテンツＩＤ“２”となっている。上述のように、コマンドＩＤは、セレクタＩＤにより決定されるが、図２０では、コマンドＩＤは1つのみであるため、合成画像作成手段８４は、ターゲットＩＤ“３”で特定される対象に、コンテンツＩＤ“２”のコンテンツをシーン合成することにより、合成画像を作成することになる。 Next, the composite image creating means 84 performs processing for creating a composite image for display in units of frames (S25). Specifically, first, the start time is set to “0.0”, and the <Animation Commands> of the scenario data is referred to at this time “0.0”. As shown in FIG. 20, in <Animation Commands>, only a command having a command ID “0” exists, and the key type “own” is used from the start key “0.0” to the end key “1.0”. , Command type “SceneMontage (scene synthesis)”, target ID “3”, and content ID “2”. As described above, the command ID is determined by the selector ID. However, in FIG. 20, since there is only one command ID, the composite image creation unit 84 determines that the target ID is “3”. By synthesizing the content with the content ID “2”, a composite image is created.

コンテンツＩＤ“２”のコンテンツ（合成用素材画像）は、シナリオデータの<Simulation Contents>を参照することにより、コンテンツのパスが特定できる。具体的には、コンテンツＩＤ“２”は、１１行目の“SceneContentsID=2”を示しており、SceneContentsIDタグ内の“Human ContentsID=0”と“Place ContentsID=1”の２つのコンテンツからなることがわかる。さらに、“Human ContentsID=0”は、９行目の“ContentsID=0”を示し、“Place ContentsID=1”は、１０行目の“ContentsID=1”を示しているので、そのパスで特定されるデータ記憶装置２ｄ内の記憶位置から各コンテンツを取得する。そして、“Place ContentsID=1”で特定されるコンテンツ（背景画像）のうち、輪郭により切り抜かれた人物部分をマスクする。具体的には、１２行目から２２行目までのSegmentationArea、SourceArea、SinkAreaに記述された内容に従って得られた輪郭線の内部をマスクしたマスク画像を生成し、“Place ContentsID=1”で特定される背景画像と合成する。一方、“Human ContentsID=0”で特定されるコンテンツ（重畳画像）は、既にマスク加工が行われているので、そのまま重畳画像として用いる。 A content path of content ID “2” (composition material image) can be identified by referring to <Simulation Contents> of the scenario data. Specifically, the content ID “2” indicates “SceneContentsID = 2” on the 11th line, and consists of two contents “Human ContentsID = 0” and “Place ContentsID = 1” in the SceneContentsID tag. I understand. Furthermore, “Human ContentsID = 0” indicates “ContentsID = 0” on the 9th line, and “Place ContentsID = 1” indicates “ContentsID = 1” on the 10th line. Each content is acquired from the storage position in the data storage device 2d. Then, in the content (background image) specified by “Place Contents ID = 1”, the person portion cut out by the contour is masked. Specifically, a mask image is generated by masking the inside of the outline obtained according to the contents described in the SegmentationArea, SourceArea, and SinkArea from the 12th line to the 22nd line, and specified by “Place ContentsID = 1” Composite with the background image. On the other hand, since the content (superimposed image) specified by “Human ContentsID = 0” has already been masked, it is used as it is as a superimposed image.

続いて、合成画像作成手段８４は、１２行目から２２行目までのSegmentationArea、SourceArea、SinkAreaに記述された内容に従って得られた輪郭線の内部を撮影画像から切り抜いて、マスク加工された背景画像と合成する。さらに、撮影画像の顔検出枠データに対応する部分を用いて、重畳画像と合成する。この重畳画像にも、上述のコンテンツのように、矩形情報が設定されているので、合成画像作成手段８４は、この矩形情報と顔検出枠データの矩形サイズが一致するように矩形情報と重畳画像のサイズを変更し、変更した重畳画像を、変更後の矩形情報と顔検出枠データの位置が一致する位置にレイヤ合成を行う。具体的には、図１２（ｂ）に示すような顔画像に設定された顔検出枠データの矩形に合わせて、コンテンツに設定された矩形の大きさを変更し、コンテンツに設定された矩形の変更割合に合わせてコンテンツをリサイズし、２つの矩形が一致するように合成処理を行う。この結果、例えば、図１２（ａ）のようにコンテンツがカツラである場合、人の顔に合わせてカツラを被せたような状態の合成画像（図１２（ｃ））が得られることになる。背景画像と重畳画像が重なる部分については、９行目と１０行目のOverlapOrderの値により、重畳画像が優先する。合成画像作成手段８４は、得られた合成画像を、ディスプレイ３に表示させる。この結果、ディスプレイ３には、撮影映像のフレームに加工が施された合成画像が表示されることになる。 Subsequently, the composite image creating means 84 cuts out the inside of the contour line obtained according to the contents described in the Segmentation Area, Source Area, and Sink Area from the 12th line to the 22nd line from the photographed image, and masked background image And synthesize. Further, the portion corresponding to the face detection frame data of the photographed image is used to synthesize the superimposed image. Since the rectangular information is also set in this superimposed image as in the above-described content, the composite image creating unit 84 uses the rectangular information and the superimposed image so that the rectangular information matches the rectangular size of the face detection frame data. And the layered composition of the changed superimposed image is performed at a position where the position of the rectangular information after the change matches the position of the face detection frame data. Specifically, the size of the rectangle set in the content is changed to match the rectangle of the face detection frame data set in the face image as shown in FIG. The content is resized according to the change ratio, and the composition process is performed so that the two rectangles match. As a result, for example, when the content is a wig as shown in FIG. 12A, a composite image (FIG. 12C) in a state where the wig is covered with a human face is obtained. For a portion where the background image and the superimposed image overlap, the superimposed image has priority according to the value of OverlapOrder in the ninth and tenth lines. The composite image creating unit 84 displays the obtained composite image on the display 3. As a result, a composite image in which the frame of the captured video is processed is displayed on the display 3.

１つのフレームについて合成画像の作成を終えたら、シナリオデータ対応付け手段８３は、シナリオ中であるかどうかを判断する（Ｓ２６）。具体的には、シナリオデータに従った画像作成開始からの経過時間でシナリオデータ内のターゲットＩＤタグ（HumanID 、PlaceID）内の“CycleInterval”を参照し、経過時間がシナリオ時間未満である場合は、シナリオ中であると判断し、経過時間がシナリオ時間以上である場合は、シナリオ終了であると判断する。シナリオ中であると判断した場合には、シナリオデータ対応付け手段８３は、Ｓ２３に戻って、顔オブジェクトデータを取得する。 When the creation of the composite image for one frame is completed, the scenario data association unit 83 determines whether or not the scenario is in progress (S26). Specifically, referring to “CycleInterval” in the target ID tag (HumanID, PlaceID) in the scenario data with the elapsed time from the start of image creation according to the scenario data, if the elapsed time is less than the scenario time, If it is determined that the scenario is in progress and the elapsed time is equal to or longer than the scenario time, it is determined that the scenario is completed. If it is determined that the scenario is in progress, the scenario data association unit 83 returns to S23 and acquires face object data.

そして、Ｓ２４において、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した次の顔オブジェクトデータをシナリオデータに対応付ける処理を行う。このときも1回目のループと同様、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。ただし、図２０ではPlaceID="2"に設定されているため、HumanIDは“0”か“1”しか設定することができない。すなわち、図２０のシナリオデータでは、最大２人までしか対応できない。もちろん、シナリオデータを書き換えれば、より多くの人数に対応することも可能である。そして、シナリオデータに従って、シナリオデータ対応付け手段８３は、ターゲットＩＤ“０”が対応付けられたオブジェクトＩＤで特定される顔検出枠データをターゲットとする。 In S24, the scenario data association unit 83 performs processing for associating the next face object data acquired from the state transition management unit 25 with the scenario data. Also at this time, as in the first loop, the earliest date and time when the state transition to the candidate face is set to “0”, and thereafter “1” and “2” in order of the date and time when the state transition to the candidate face occurs. Set "3" while increasing the number by one. However, since PlaceID = "2" is set in FIG. 20, HumanID can only be set to "0" or "1". That is, the scenario data in FIG. 20 can support only a maximum of two people. Of course, if the scenario data is rewritten, it is possible to handle a larger number of people. Then, according to the scenario data, the scenario data association unit 83 targets the face detection frame data specified by the object ID associated with the target ID “0”.

次に、Ｓ２５において、合成画像作成手段８４が、フレーム単位で合成画像を作成する処理を行う。具体的には、経過時間を取得し、取得した経過時間で、シナリオデータの<Animation Commands>を参照する。そして、<Animation Commands>に従い、TargetID=“3”で特定されるターゲットに、ContentsID=“2”で特定されるコンテンツをシーン合成することにより合成画像を作成する。このようにして、Ｓ２６においてシナリオ終了であると判断されるまでは、経過時間に従い、シナリオデータを実行する処理を繰り返し行う。 Next, in S25, the composite image creating unit 84 performs a process of creating a composite image in units of frames. Specifically, the elapsed time is acquired, and the <Animation Commands> of the scenario data is referenced with the acquired elapsed time. Then, according to <Animation Commands>, a composite image is created by scene-synthesizing the content specified by ContentsID = “2” on the target specified by TargetID = “3”. In this way, the process of executing the scenario data is repeated according to the elapsed time until it is determined in S26 that the scenario is ended.

次に、図２０のシナリオデータの１２行目〜２２行目における撮影画像の切り抜き処理の詳細について説明する。１２行目〜２２行目は、１２、１３行目のSegmentationAreaタグ、１４〜１８行目のSourceAreaタグ、１９〜２２行目のSinkAreaタグで構成されている。SegmentationAreaタグ、SourceAreaタグ、SinkAreaタグでは、それぞれセグメント領域、Sourceデータ列のイニシャル領域、Sinkデータ列のイニシャル領域が設定されている。 Next, details of the cut-out process of the captured image in the 12th to 22nd lines of the scenario data in FIG. 20 will be described. The 12th to 22nd lines are composed of a SegmentationArea tag on the 12th and 13th lines, a SourceArea tag on the 14th to 18th lines, and a SinkArea tag on the 19th to 22nd lines. In the SegmentationArea tag, SourceArea tag, and SinkArea tag, a segment area, an initial area of the source data string, and an initial area of the sink data string are set, respectively.

まず、セグメント領域について説明する。セグメント領域とは、イニシャル領域設定に必要な領域であり、顔検出枠を含む範囲のものである。図２０の１２、１３行目に示すように、セグメント領域は、顔検出枠上部倍率（TopRate）、顔検出枠下部倍率（BottomRate）、アスペクト比（AspectWidth、AspectHeight）により設定される。 First, the segment area will be described. The segment area is an area necessary for initial area setting, and is a range including a face detection frame. As shown in the 12th and 13th lines in FIG. 20, the segment area is set by the face detection frame upper magnification (TopRate), the face detection frame lower magnification (BottomRate), and the aspect ratio (AspectWidth, AspectHeight).

図２１は、撮影画像における顔検出枠とセグメント領域の関係を示す図である。図２１における２つの矩形のうち、内側の矩形が顔検出枠、外側の矩形がセグメント領域の枠を示している。図２１に示したセグメント領域は、図２０のシナリオデータの内容（TopRate="1.2、BottomRate="2.0、 AspectWidth="5"、 AspectHeight="3"）に従って設定されたものである。顔検出枠上部倍率とは、顔検出枠中央から顔検出枠上辺までの長さと、顔検出枠中央からセグメント領域上辺までの長さの比であり、顔検出枠下部倍率とは、顔検出枠中央から顔検出枠下辺までの長さと、顔検出枠中央からセグメント領域下辺までの長さの比である。顔検出枠上部倍率、顔検出枠下部倍率とアスペクト比に従って顔検出枠全体を含む範囲にセグメント領域が設定される。図２０、図２１に示したように、セグメント領域は、顔検出枠の位置・大きさを基準にして数値指定されることにより設定され、数値は適宜変更可能であるが、顔検出枠と左右方向の中心が一致し、上下方向については顔検出枠の中心がセグメント領域の中心より上方に位置するように設定されることが望ましい。そのような数値は、合成コンテンツ定義手段８１が自動で設定するようにしても良い。 FIG. 21 is a diagram illustrating a relationship between a face detection frame and a segment area in a captured image. Of the two rectangles in FIG. 21, the inner rectangle indicates the face detection frame, and the outer rectangle indicates the segment region frame. The segment area shown in FIG. 21 is set according to the contents of the scenario data shown in FIG. 20 (TopRate = "1.2, BottomRate =" 2.0, AspectWidth = "5", AspectHeight = "3"). The upper magnification of the face detection frame is the ratio of the length from the center of the face detection frame to the upper side of the face detection frame and the length from the center of the face detection frame to the upper side of the segment area. It is the ratio of the length from the center to the lower side of the face detection frame and the length from the center of the face detection frame to the lower side of the segment area. A segment area is set in a range including the entire face detection frame according to the face detection frame upper magnification, the face detection frame lower magnification and the aspect ratio. As shown in FIGS. 20 and 21, the segment area is set by numerical designation based on the position / size of the face detection frame, and the numerical value can be changed as appropriate. It is desirable to set the direction centers so that the center of the face detection frame is positioned above the center of the segment area in the vertical direction. Such a numerical value may be automatically set by the composite content definition means 81.

図２２は、イニシャル領域の形状を示す図である。本実施形態では、イニシャル領域は、円、楕円、長方形（正方形含む）の３つの形状で設定可能である。各イニシャル領域は、セグメント領域を０〜１とした場合の相対位置、相対幅、相対高さで表現される。具体的には、円の場合は、中心位置（CenterX、CenterY）と半径（Radius）、楕円の場合は、中心位置（CenterX、CenterY）とX軸半径（XRadius）とY軸半径（YRadius）、長方形の場合は、左上頂点位置（X、Y）と幅（Width）と高さ（Height）で設定される。本実施形態の３つの形状以外で設定することも、もちろん可能である。 FIG. 22 is a diagram showing the shape of the initial region. In the present embodiment, the initial region can be set in three shapes: a circle, an ellipse, and a rectangle (including a square). Each initial area is expressed by a relative position, a relative width, and a relative height when the segment area is 0 to 1. Specifically, in the case of a circle, the center position (CenterX, CenterY) and radius (Radius), in the case of an ellipse, the center position (CenterX, CenterY), X-axis radius (XRadius) and Y-axis radius (YRadius), In the case of a rectangle, the upper left vertex position (X, Y), width (Width), and height (Height) are set. Of course, it is possible to set other than the three shapes of the present embodiment.

図２３は、図２０のシナリオデータの例に従って設定されたイニシャル領域を示す図である。図２３（ａ）は、撮影フレームにおけるイニシャル領域を示す図であり、図２３（ｂ）は、図２３（ａ）のイニシャル領域を設定するためのシナリオデータにおけるSourceAreaタグとSinkAreaタグ部分のスクリプトである。 FIG. 23 is a diagram showing an initial area set according to the example of scenario data of FIG. FIG. 23A is a diagram showing the initial area in the shooting frame, and FIG. 23B is a script of the SourceArea tag and the SinkArea tag part in the scenario data for setting the initial area of FIG. is there.

イニシャル領域を用いて後述する処理により人物の輪郭を切り抜くためには、Sourceのイニシャル領域は、人物が存在する位置に、Sinkのイニシャル領域は、人物が存在しない位置に設定される必要がある。図２０のシナリオデータでは、イニシャル領域が数値指定されているが、この数値は、顔検出枠とセグメント領域の関係に基づいて計算され、得られた数値である。また、SourceAreaタグ、SinkAreaタグでは、SamplingNum = "50"として、イニシャル領域においてランダムに選ばれる画素数が指定されている。 In order to cut out the outline of a person by the process described later using the initial area, the initial area of the source needs to be set to a position where the person exists, and the initial area of the sink needs to be set to a position where no person exists. In the scenario data of FIG. 20, the initial area is designated by a numerical value. This numerical value is a numerical value obtained by calculation based on the relationship between the face detection frame and the segment area. In the SourceArea tag and the SinkArea tag, the number of pixels randomly selected in the initial region is specified as SamplingNum = “50”.

イニシャル領域が設定されたら、合成画像作成手段８４によりセグメント領域における画像のセグメント化が行われる。このセグメント化は、対象と対象以外を分ける処理であり、Sourceのイニシャル領域が対象、Sinkのイニシャル領域が対象以外の部分に含まれるように行われる。セグメント化の手法としては、公知の様々な技術を用いることが可能であるが、本実施形態では、公知のグラフカットセグメンテーションの手法を用いている。したがって、合成画像作成手段８４は、Sourceのイニシャル領域、Sinkのイニシャル領域それぞれにおいてランダムに５０画素ずつ選択し、各画素の値を用いてグラフカットセグメンテーションの手法を実行することにより、セグメント領域における画像を、対象と対象以外にセグメント化する。Sourceのイニシャル領域、Sinkのイニシャル領域それぞれにおいてランダムに選択する画素数は、SourceAreaタグ内の“SamplingNum”により任意に設定可能である。セグメント化により得られたセグメント結果画像を図２４（ｂ）に示す。図２４（ｂ）において、白い部分が対象（人物）を示す部分であり、両側の黒い部分が対象（人物）以外を示す部分である。セグメント結果画像は２値画像として作成される。 When the initial area is set, the composite image creating means 84 segments the image in the segment area. This segmentation is a process for separating the target and the non-target, and is performed so that the source initial area is included in the target and the sink initial area is included in the non-target portion. Various known techniques can be used as the segmentation technique. In this embodiment, a known graph cut segmentation technique is used. Therefore, the composite image creating unit 84 selects 50 pixels at random in each of the initial area of the source and the initial area of the sink, and executes the graph cut segmentation technique using the value of each pixel, thereby obtaining an image in the segment area. Are segmented into targets and non-targets. The number of pixels selected at random in each of the Source initial area and the Sink initial area can be arbitrarily set by “SamplingNum” in the SourceArea tag. The segment result image obtained by segmentation is shown in FIG. In FIG. 24B, a white part is a part indicating a target (person), and a black part on both sides is a part indicating other than the target (person). The segment result image is created as a binary image.

セグメント結果画像が得られたら、合成画像作成手段８４は、コンテンツマスク画像とセグメント結果画像を合成して合成マスク画像を作成する。コンテンツマスク画像は、Human ContentsIDで指定されたコンテンツ画像に対応するマスク画像であり、そのコンテンツ画像で抜き出す部分以外がマスクされた画像である。このコンテンツマスク画像は、各コンテンツ画像について予め作成された２値画像であり、データ記憶装置２ｄに格納されている。図２４に示すように、図２４（ｃ）の合成マスク画像は、図２４（ａ）に示すようなコンテンツマスク画像と図２４（ｂ）に示すようなセグメント結果画像のマスクされない部分をＯＲ演算することにより、両者のマスクされない部分を足し合わせた２値画像として得られる。したがって、コンテンツ画像がヘルメットと服である場合、コンテンツマスク画像は、ヘルメット、服、人物の部分がマスクされない画像となる。 When the segment result image is obtained, the composite image creating unit 84 composites the content mask image and the segment result image to create a composite mask image. The content mask image is a mask image corresponding to the content image specified by Human Contents ID, and is an image in which portions other than the portion extracted by the content image are masked. This content mask image is a binary image created in advance for each content image, and is stored in the data storage device 2d. As shown in FIG. 24, the composite mask image of FIG. 24C is obtained by ORing the content mask image as shown in FIG. 24A and the unmasked portion of the segment result image as shown in FIG. By doing so, a binary image obtained by adding the unmasked portions of both is obtained. Therefore, when the content image is a helmet and clothes, the content mask image is an image in which the helmet, clothes, and person portions are not masked.

次に、合成画像作成手段８４は、合成マスク画像を反転して背景マスク画像を作成する。具体的には、二値画像である合成マスク画像の各画素の値を他方の値に変更する。この結果、マスクする画素とマスクされない画素が逆になった二値画像である背景マスク画像が得られる。図２５に合成マスク画像と背景マスク画像の一例を示す。図２５（ａ）は、合成マスク画像、図２５（ｂ）は、背景マスク画像である。背景マスク画像では、ヘルメット、服、人物の部分がマスクされる画像となる。 Next, the composite image creating means 84 creates a background mask image by inverting the composite mask image. Specifically, the value of each pixel of the composite mask image that is a binary image is changed to the other value. As a result, a background mask image which is a binary image in which the masked pixels and the unmasked pixels are reversed is obtained. FIG. 25 shows an example of the composite mask image and the background mask image. FIG. 25A is a composite mask image, and FIG. 25B is a background mask image. In the background mask image, the helmet, clothes, and person portions are masked.

続いて、合成画像作成手段８４は、背景マスク画像を用いて背景画像をマスクし、重畳画像をコンテンツマスク画像でマスクし、得られた画像を撮影画像に重ねることにより、合成画像を作成する。この結果、図２６（ｆ）に示すような合成画像が得られる。図２６に合成画像作成の様子を示す。図２６（ａ）は、撮影画像、図２６（ｂ）は、背景画像、図２６（ｃ）は、背景マスク画像、図２６（ｄ）は、重畳画像、図２６（ｅ）は、重畳マスク画像、図２６（ｆ）は、合成画像である。図２６（ｆ）においては、図面の都合上、背景の幅を狭めた状態で示しているが、現実には、ディスプレイ３の表示範囲に合わせた縦横比で合成が行われる。このようにして、ディスプレイ３を閲覧している人物に服等のコンテンツと背景を合成してディスプレイ３に表示することが可能となる。 Subsequently, the composite image creating means 84 creates a composite image by masking the background image using the background mask image, masking the superimposed image with the content mask image, and superimposing the obtained image on the captured image. As a result, a composite image as shown in FIG. FIG. 26 shows how a composite image is created. 26 (a) is a captured image, FIG. 26 (b) is a background image, FIG. 26 (c) is a background mask image, FIG. 26 (d) is a superimposed image, and FIG. 26 (e) is a superimposed mask. The image, FIG. 26 (f), is a composite image. In FIG. 26 (f), for the convenience of the drawing, the background is shown in a narrowed state, but in reality, the composition is performed with an aspect ratio that matches the display range of the display 3. In this way, it is possible to combine the content such as clothes and the background with the person browsing the display 3 and display it on the display 3.

≪３．５．シーン合成処理（２）≫
上記のシーン合成処理では、セグメント領域をセグメント化することにより、人物と人物以外の境界を明確にして、背景画像と合成するようにした。第２の実施形態では、セグメント領域を用いず、人物と人物以外の境界を明確にすることなく、人物と背景画像を合成する処理を行う。第２の実施形態においても、図１〜３に示した装置構成により、図４〜６、９に示したフロー図に従った処理を実行する点は第１の実施形態と同じである。また、図示は省略するが、映像上の１人以上の人物とコンテンツとの合成のタイミングを定めたシナリオデータがシナリオデータ記憶手段としてのデータ記憶装置２ｄに記憶されている。 << 3.5. Scene composition processing (2) >>
In the above scene composition process, the segment area is segmented so that the boundary between the person and the person is clarified and the background image is synthesized. In the second embodiment, the process of synthesizing the person and the background image is performed without using the segment area and without clarifying the boundary between the person and the person. The second embodiment is also the same as the first embodiment in that processing according to the flowcharts shown in FIGS. 4 to 6 and 9 is executed by the apparatus configuration shown in FIGS. Although illustration is omitted, scenario data that determines the timing of combining one or more persons on the video with the content is stored in the data storage device 2d as scenario data storage means.

第２の実施形態について、図９のフロー図を用いて説明する。画像処理装置２を起動し、シナリオデータを指定すると、シナリオデータ対応付け手段８３が、指定されたシナリオデータをデータ記憶装置２ｄから読み込む（Ｓ２１）。そして、シナリオデータ対応付け手段８３は、シナリオデータを解釈し、シナリオデータに従った画像の作成を開始する（Ｓ２２）。 The second embodiment will be described with reference to the flowchart of FIG. When the image processing device 2 is activated and scenario data is designated, the scenario data association unit 83 reads the designated scenario data from the data storage device 2d (S21). Then, the scenario data association unit 83 interprets the scenario data and starts creating an image according to the scenario data (S22).

続いて、シナリオデータ対応付け手段８３は、状態遷移管理手段２５から取得した顔オブジェクトデータをシナリオデータに対応付ける処理を行う（Ｓ２４）。具体的には、顔オブジェクトデータに含まれる顔検出枠データのオブジェクトＩＤとターゲットＩＤを対応付ける。状態遷移管理手段２５から複数の顔検出枠データを取得した場合は、候補Ｆａｃｅへ状態遷移したときの日時が最も早いものを“０”に設定し、以降、候補Ｆａｃｅへ状態遷移したときの日時が早い順に“１””２” ”３”と数を１ずつ増加させながら設定していく。 Subsequently, the scenario data association unit 83 performs a process of associating the face object data acquired from the state transition management unit 25 with the scenario data (S24). Specifically, the object ID of the face detection frame data included in the face object data is associated with the target ID. When a plurality of face detection frame data is acquired from the state transition management unit 25, the date and time when the state transition to the candidate face is the earliest date and time is set to “0”, and thereafter the date and time when the state transition to the candidate face is performed. In order from the earliest, “1”, “2” and “3” are set while increasing the number by one.

次に、合成画像作成手段８４が、フレーム単位で表示用の合成画像を作成する処理を行う（Ｓ２５）。具体的には、まず、開始時点を時刻“０．０”と設定し、この時刻“０．０”で、シナリオデータの<Animation Commands>を参照し、指定されたコマンドＩＤのコマンドを実行する。 Next, the composite image creating means 84 performs processing for creating a composite image for display in units of frames (S25). Specifically, first, the start time is set to “0.0”, and at this time “0.0”, the <Animation Commands> of the scenario data is referenced and the command with the specified command ID is executed. .

コマンドＩＤタグで指定されるコンテンツＩＤで、シナリオデータの<Simulation Contents>を参照することにより、コンテンツのパスが特定できる。第２の実施形態においては、第１の実施形態と異なり、必ず重畳画像と背景画像の２つのコンテンツが特定されている必要がある。また、重畳画像としては、人物の顔付近だけ開いていて、その周囲を全て囲うような画像を用意する。例えば、顔出し看板、ウェディングドレスが挙げられる。合成画像作成手段８４は、そのパスで特定されるデータ記憶装置２ｄ内の記憶位置から各コンテンツを取得する。そして、Place ContentsIDで特定されるコンテンツ（背景画像）を、重畳画像のコンテンツマスク画像を反転して得られた背景マスク画像を用いてマスクする。一方、“Human ContentsID”で特定されるコンテンツ（重畳画像）は、既にマスク加工が行われているので、そのまま重畳画像として用いる。 By referring to the <Simulation Contents> of the scenario data with the content ID specified by the command ID tag, the content path can be specified. In the second embodiment, unlike the first embodiment, two contents of a superimposed image and a background image must be specified. In addition, as the superimposed image, an image that is open only near the face of a person and surrounds the entire periphery is prepared. For example, a face signboard and a wedding dress can be mentioned. The composite image creation unit 84 acquires each content from the storage position in the data storage device 2d specified by the path. Then, the content (background image) specified by the Place Contents ID is masked using the background mask image obtained by inverting the content mask image of the superimposed image. On the other hand, since the content (superimposed image) specified by “Human ContentsID” has already been masked, it is used as it is as a superimposed image.

続いて、合成画像作成手段８４は、撮影画像の顔検出枠データに対応する部分を用いて、重畳画像と合成する。この重畳画像にも、上述のコンテンツのように、矩形情報が設定されているので、合成画像作成手段８４は、この矩形情報と顔検出枠データの矩形サイズが一致するように矩形情報と重畳画像のサイズを変更し、変更した重畳画像を、変更後の矩形情報と顔検出枠データの位置が一致する位置にレイヤ合成を行う。具体的には、図１２（ｂ）に示すような顔画像に設定された顔検出枠データの矩形に合わせて、コンテンツに設定された矩形の大きさを変更し、コンテンツに設定された矩形の変更割合に合わせてコンテンツをリサイズし、２つの矩形が一致するように合成処理を行う。この結果、例えば、図１２（ａ）のようにコンテンツがカツラである場合、人の顔に合わせてカツラを被せたような状態の合成画像（図１２（ｃ））が得られることになる。本実施形態では、重畳画像のコンテンツとして、顔出し看板やウェディングドレスのような人物の顔付近だけ開いていて、その周囲を全て囲うような画像が用意されているので、撮影画像中の人物が顔出し看板の穴に顔を出していたり、ウェディングドレスを着た状態になるような合成が行われることになる。続いて、合成画像作成手段８４は、背景マスク画像を用いてマスクされた背景画像と、撮影画像および重畳画像を合成したものを合成する。背景画像と重畳画像が重なる部分については、８行目と９行目のOverlapOrderの値により、重畳画像が優先する。合成画像作成手段８４は、得られた合成画像を、ディスプレイ３に表示させる。この結果、ディスプレイ３には、撮影映像のフレームに加工が施された合成画像が表示されることになる。 Subsequently, the composite image creating unit 84 synthesizes the superimposed image using the portion corresponding to the face detection frame data of the photographed image. Since the rectangular information is also set in this superimposed image as in the above-described content, the composite image creating unit 84 uses the rectangular information and the superimposed image so that the rectangular information matches the rectangular size of the face detection frame data. And the layered composition of the changed superimposed image is performed at a position where the position of the rectangular information after the change matches the position of the face detection frame data. Specifically, the size of the rectangle set in the content is changed to match the rectangle of the face detection frame data set in the face image as shown in FIG. The content is resized according to the change ratio, and the composition process is performed so that the two rectangles match. As a result, for example, when the content is a wig as shown in FIG. 12A, a composite image (FIG. 12C) in a state where the wig is covered with a human face is obtained. In the present embodiment, as the content of the superimposed image, an image is prepared in which only the face of a person such as a face signboard or a wedding dress is open and surrounding all of the surroundings. Compositing will be done in such a way that a face is put out in the hole of the signboard or a wedding dress is put on. Subsequently, the composite image creating unit 84 synthesizes the background image masked using the background mask image with the composite of the captured image and the superimposed image. For the portion where the background image and the superimposed image overlap, the superimposed image has priority according to the value of the OverlapOrder in the 8th and 9th rows. The composite image creating unit 84 displays the obtained composite image on the display 3. As a result, a composite image in which the frame of the captured video is processed is displayed on the display 3.

次に、Ｓ２５において、合成画像作成手段８４が、フレーム単位で合成画像を作成する処理を行う。具体的には、経過時間を取得し、取得した経過時間で、シナリオデータの<Animation Commands>を参照する。そして、<Animation Commands>に従い、TargetIDで特定されるターゲットに、ContentsIDで特定されるコンテンツをシーン合成することにより合成画像を作成する。このようにして、Ｓ２６においてシナリオ終了であると判断されるまでは、経過時間に従い、シナリオデータを実行する処理を繰り返し行う。 Next, in S25, the composite image creating unit 84 performs a process of creating a composite image in units of frames. Specifically, the elapsed time is acquired, and the <Animation Commands> of the scenario data is referenced with the acquired elapsed time. Then, in accordance with <Animation Commands>, a composite image is created by synthesizing the content specified by ContentsID with the target specified by TargetID. In this way, the process of executing the scenario data is repeated according to the elapsed time until it is determined in S26 that the scenario is ended.

＜４．状態遷移管理手段を用いない構成＞
上記実施形態の画像表示システムは、状態遷移管理手段２５を用い、検出された顔画像がノイズであったと判定される場合に、閲覧状態と判断しないようにしたが、状態遷移管理手段２５を用いず、検出された顔画像を全て閲覧状態と判断するようにすることも可能である。次に、状態遷移管理手段２５を用いない構成について説明する。 <4. Configuration not using state transition management means>
In the image display system of the above embodiment, the state transition management unit 25 is used, and when it is determined that the detected face image is noise, the browsing state is not determined, but the state transition management unit 25 is used. It is also possible to determine that all detected face images are in the browsing state. Next, a configuration not using the state transition management unit 25 will be described.

図１８は、状態遷移管理手段２５を用いない場合の画像処理装置２´に実装されたコンピュータプログラムで実現される機能ブロック図である。図１８において、図３と同一機能を有するものについては、同一符号を付して詳細な説明を省略する。 FIG. 18 is a functional block diagram realized by a computer program installed in the image processing apparatus 2 ′ when the state transition management unit 25 is not used. 18, components having the same functions as those in FIG. 3 are denoted by the same reference numerals, and detailed description thereof is omitted.

図１８に示す画像処理装置２´は、図３に示したトラッキング手段２３に代えて、トラッキング手段２３´を有している。このトラッキング手段２３´は、図３に示した動画解析手段２４に相当する機能も備えている。 An image processing apparatus 2 ′ illustrated in FIG. 18 includes a tracking unit 23 ′ instead of the tracking unit 23 illustrated in FIG. This tracking means 23 'also has a function corresponding to the moving picture analysis means 24 shown in FIG.

図１８に示す画像処理装置２´は、フレームを解析するにあたり、図４に示したＳ１〜Ｓ５の処理のうち、Ｓ１、Ｓ３の処理は、画像処理装置２と同様にして行う。また、顔検出処理とトラッキング処理は、連携させて実行する。上述のように、Ｓ５の状態遷移管理処理は行わない。 The image processing apparatus 2 ′ illustrated in FIG. 18 performs the processes of S1 and S3 in the same manner as the image processing apparatus 2 among the processes of S1 to S5 illustrated in FIG. Further, the face detection process and the tracking process are executed in cooperation. As described above, the state transition management process in S5 is not performed.

図１９は、顔検出処理とトラッキング処理を示すフロー図である。まず、背景除去処理Ｓ１を行った後、Ｎフレームを処理するにあたり、Ｎ−１フレームの顔検出枠の数が０より大であるかどうかの判断を行う（Ｓ３１）。Ｎ−１フレームの顔検出枠の数が０より大である場合は、トラッキング手段２３´がトラッキング処理を実行する（Ｓ３２）。 FIG. 19 is a flowchart showing face detection processing and tracking processing. First, after performing the background removal processing S1, in processing N frames, it is determined whether or not the number of face detection frames in N-1 frames is greater than 0 (S31). When the number of N-1 frame face detection frames is larger than 0, the tracking unit 23 'executes the tracking process (S32).

トラッキング手段２３´は、Ｎ−１フレームにおける各顔検出枠を追跡してＮフレームにおける対応する顔検出枠を特定するものである。トラッキング手段２３´としては、上述の動画解析手段２４が実行する“パーティクルフィルタ”、“ＬＫ法”、“ＣａｍＳｈｉｆｔ手法”等の公知のトラッキング手法を採用することができる。 The tracking means 23 'tracks each face detection frame in the N-1 frame and specifies a corresponding face detection frame in the N frame. As the tracking unit 23 ′, a known tracking method such as “particle filter”, “LK method”, or “CamShift method” executed by the moving image analysis unit 24 can be employed.

Ｎ−１フレームからＮフレームへの顔検出枠のトラッキング処理を終えたら、顔検出手段２１がＮフレームにおける顔検出処理を行う（Ｓ３３）。Ｓ３３における顔検出処理は、図４に示したＳ２の顔検出処理と同一である。また、Ｓ３１において、Ｎ−１フレームの顔検出枠の数が０より大でないと判定された場合は、Ｎ−１フレームからＮフレームへのトラッキング処理を行わずに、顔検出手段２１がＮフレームにおける顔検出処理を行う。 When the tracking process of the face detection frame from the N-1 frame to the N frame is finished, the face detection unit 21 performs the face detection process in the N frame (S33). The face detection process in S33 is the same as the face detection process in S2 shown in FIG. If it is determined in S31 that the number of face detection frames in the N-1 frame is not greater than 0, the face detection unit 21 does not perform the tracking process from the N-1 frame to the N frame and the face detection unit 21 The face detection process is performed.

続いて、顔検出処理Ｓ３３において新規に検出されたＮフレームの顔検出枠の数が０より大であるかどうかを判断する（Ｓ３４）。新規に検出されたＮフレームの顔検出枠とは、Ｎフレームで検出された顔検出枠のうち、Ｎ−１フレームからＮフレームへトラッキングされた顔検出枠を除外したものである。 Subsequently, it is determined whether or not the number of N frame face detection frames newly detected in the face detection process S33 is greater than 0 (S34). The newly detected N frame face detection frame is obtained by excluding the face detection frame tracked from the N-1 frame to the N frame from the face detection frames detected in the N frame.

次に、顔検出手段２１が、Ｎフレームにおいて新規に検出された各顔検出枠データに、オブジェクトＩＤを付与し、顔検出枠データ、オブジェクトＩＤ、トラッキング時間で構成される顔オブジェクトを設定する（Ｓ３５）。顔オブジェクトは、オブジェクトＩＤにより特定され、トラッキングにより対応付けられた顔検出枠は、同一のオブジェクトＩＤで特定されることになる。また、トラッキング時間の初期値は０に設定される。 Next, the face detection means 21 assigns an object ID to each face detection frame data newly detected in the N frame, and sets a face object composed of the face detection frame data, the object ID, and the tracking time ( S35). The face object is specified by the object ID, and the face detection frames associated by tracking are specified by the same object ID. The initial value of the tracking time is set to zero.

続いて、Ｎフレームにおける顔検出枠の数が０より大であるかどうかの判断を行う（Ｓ３６）。Ｓ３６においては、Ｎフレームにおいて新規に検出されたかどうかを問わず、既にオブジェクトＩＤが発行された顔検出枠がＮフレームに存在するかどうかを判断する。 Subsequently, it is determined whether or not the number of face detection frames in the N frame is greater than 0 (S36). In S36, it is determined whether or not a face detection frame in which an object ID has already been issued exists in the N frame regardless of whether or not a new detection has been performed in the N frame.

顔検出枠が存在した場合には、各顔検出枠の顔オブジェクトについて、トラッキング時間を算出する（Ｓ３７）。具体的には、直前のＮ−１フレームまでに算出されているトラッキング時間に１フレームに相当する時間を加算することによりＮフレームまでの各顔オブジェクトのトラッキング時間を算出する。トラッキング時間を算出し終えたら、Ｎをインクリメントして（Ｓ３８）、次のＮフレームについての処理に移行する。Ｓ３６における判断の結果、顔検出枠が存在しなかった場合には、Ｎフレームには、追跡すべき対象が存在しないことになるので、トラッキング時間の算出は行わず、Ｎをインクリメントして（Ｓ３８）、次のＮフレームについての処理に移行する。 When the face detection frame exists, the tracking time is calculated for the face object of each face detection frame (S37). Specifically, the tracking time of each face object up to N frames is calculated by adding the time corresponding to one frame to the tracking time calculated up to the immediately preceding N-1 frame. When the tracking time is calculated, N is incremented (S38), and the process proceeds to the next N frame. If the result of determination in S36 is that there is no face detection frame, there is no target to be tracked in N frames, so tracking time is not calculated and N is incremented (S38). ), And shifts to processing for the next N frame.

画像処理装置２´の顔検出手段２１、トラッキング手段２３´は、背景除去手段２０により背景処理が行われた各フレームについて、図１９に示した処理を繰り返し実行する。 The face detection unit 21 and the tracking unit 23 ′ of the image processing apparatus 2 ′ repeatedly execute the process shown in FIG. 19 for each frame on which the background process has been performed by the background removal unit 20.

図１９に示した処理において付与された顔オブジェクトは、図９に示したＳ２４において、シナリオデータ対応付け手段８３によりシナリオデータと対応付けられる。図１９に示した処理においては、顔オブジェクトのオブジェクトＩＤは、顔検出枠が検出された順に、“０”“１” “２”“３”と数を１ずつ増加させながら設定される。 The face object given in the process shown in FIG. 19 is associated with the scenario data by the scenario data association unit 83 in S24 shown in FIG. In the processing shown in FIG. 19, the object ID of the face object is set in increments of “0”, “1”, “2”, and “3” in the order in which the face detection frames are detected.

以上、本発明の好適な実施形態について説明したが、本発明は上記実施形態に限定されず、種々の変形が可能である。例えば、上記第１の実施形態では、撮影画像から切り抜いた人物部分と背景画像を合成するとともに、人物部分または背景画像に重畳画像を重ねて合成するようにしたが、重畳画像を重ねず、背景画像のみと合成した場合であっても、十分にリアルタイムで演出効果のある合成画像を作成することが可能である。重畳画像を重ねた場合には、さらに演出効果が高まる。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above embodiments, and various modifications can be made. For example, in the first embodiment, the person portion cut out from the photographed image and the background image are combined and the superimposed image is superimposed on the person portion or the background image. Even when it is synthesized with only the image, it is possible to create a synthesized image having a production effect sufficiently in real time. When the superimposed images are superimposed, the effect of production is further enhanced.

また、上記第１の実施形態では、合成画像作成手段８４によるセグメント領域における画像のセグメント化の具体的な手法として、グラフカットセグメンテーションを用いたが、Ｗａｔｅｒｓｈｅｄアルゴリズムや平均値アルゴリズム等の公知の他の手法を用いても良い。 In the first embodiment, the graph cut segmentation is used as a specific method of segmenting the image in the segment area by the composite image creating unit 84. However, other known methods such as the Watershed algorithm and the average value algorithm are used. A technique may be used.

本発明は、コンピュータを利用してディスプレイに画像を表示する産業、広告を映像として表示するデジタルサイネージの産業に利用可能である。 INDUSTRIAL APPLICABILITY The present invention is applicable to industries that display images on a display using a computer and digital signage that displays advertisements as video.

１画像表示システム
２、２´ 画像処理装置
２０背景除去手段
２１顔検出手段
２２人体検出手段
２３、２３´ トラッキング手段
２４動画解析手段
２５状態遷移管理手段
２６人物属性推定手段
２７ログファイル出力手段
３ディスプレイ
４ビデオカメラ
６状態遷移表
８０合成ターゲット定義手段
８１合成コンテンツ定義手段
８２アニメーションシナリオ定義手段
８３シナリオデータ対応付け手段
８４合成画像作成手段
８５指示判定手段
８６コマンド切替手段 DESCRIPTION OF SYMBOLS 1 Image display system 2, 2 'Image processing apparatus 20 Background removal means 21 Face detection means 22 Human body detection means 23, 23' Tracking means 24 Movie analysis means 25 State transition management means 26 Person attribute estimation means 27 Log file output means 3 Display 4 video camera 6 state transition table 80 composite target definition means 81 composite content definition means 82 animation scenario definition means 83 scenario data association means 84 composite image creation means 85 instruction determination means 86 command switching means

Claims

An image display system comprising a camera for photographing a person, an image processing device for synthesizing a captured video sent from the camera, and a display for displaying the synthesized video that has been synthesized,
The image processing apparatus includes:
Scenario data storage means for storing scenario data defining the timing of composition of one or more persons on the video and the content;
Content storage means for storing a background image used for composition as content;
Face detection means for detecting a face image captured in one frame of a video transmitted from the camera and outputting the position / rectangular size of the face detection frame as face detection frame data for each detected face image; ,
Tracking means for associating the face detection frame data acquired from the face detection means with face detection frame data of another frame;
Scenario data associating means for associating a face object including face detection frame data detected by the face detecting means with a person included in the scenario data;
A segment result is set in a range including a face detection frame associated with a person included in the scenario data in each frame, and segmented into portions other than the person and the person using the segment area Using an image, masking the background image, and a combined image generating means for generating a combined image obtained by combining the person in the frame and the background image;
An image display system comprising:

The content storage means further stores a superimposed image, which is an image to be superimposed on the person or background, as content,
The composite image generation unit assigns the face object to the person of the scenario data according to the association by the scenario data association unit, and matches the position and size of the face detection frame data of the face object to match the superimposed image. The image display system according to claim 1, wherein a combined image is generated by changing the position and size and combining the superimposed image on the frame.

The composite image generating means matches the segment area with the face detection frame at the center in the left-right direction based on the position of the face detection frame, and the center of the face detection frame is the center of the segment area in the up-down direction. Set to be located above the center,
The segmentation is performed using a pixel set in a predetermined position in the segment area and designated in an area where a person exists and an area where no person exists. Item 3. The image display system according to Item 2.

An image display system comprising a camera for photographing a person, an image processing device for synthesizing a captured video sent from the camera, and a display for displaying the synthesized video that has been synthesized,
The image processing apparatus includes:
Scenario data storage means for storing scenario data defining the timing of composition of one or more persons on the video and the content;
Content storage means for storing a background image and a superimposed image in which a part surrounded by the periphery is opened as content,
Face detection means for detecting a face image captured in one frame of a video transmitted from the camera and outputting the position / rectangular size of the face detection frame as face detection frame data for each detected face image; ,
Tracking means for associating the face detection frame data acquired from the face detection means with face detection frame data of another frame;
Scenario data associating means for associating a face object including face detection frame data detected by the face detecting means with a person included in the scenario data;
According to the association, the face object is assigned to the person of the scenario data, and the position and size of the superimposed image are changed and combined in accordance with the position and size of the face detection frame data of the face object, and the superposition is performed. A composite image generating means for generating a composite image combined with the background image masked by the mask image corresponding to the image;
An image display system comprising: