JP2014187687A

JP2014187687A - Device and method for extracting highlight scene of moving image

Info

Publication number: JP2014187687A
Application number: JP2014017221A
Authority: JP
Inventors: Shotaro Miwa; 祥太郎三輪; Makito Seki; 真規人関; Takashi Hirai; 隆史平位
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2013-02-21
Filing date: 2014-01-31
Publication date: 2014-10-02

Abstract

PROBLEM TO BE SOLVED: To extract a specified highlight scene including a face of a person from a moving image.SOLUTION: A highlight scene extracting device includes: a frame extracting circuit 2 which detects a face of a person from a frame image; a scene composition determining circuit 4 for generating a scene composition feature C including a plurality of feature values determined by a position and a size of the detected face in the frame image; a scene storage device 5 which previously stores a range of the plurality of feature values determined by the position and the size of the face of the person in at least one specified highlight scene as a scene composition condition D; a highlight scene extracting device 6 for determining whether the feature value of the scene composition feature C is within the range of the feature value of the scene composition condition D or not; and an index generation circuit 7 for giving an index of the highlight scene to the frame image when the feature value of the scene composition feature C is within the range of the feature value of the scene composition condition D.

Description

本発明は、動画像のハイライトシーン抽出装置及び方法に関し、特に、動画像中から人物の顔を含む特定のハイライトシーンを抽出する装置及び方法に関する。 The present invention relates to a moving image highlight scene extraction apparatus and method, and more particularly to an apparatus and method for extracting a specific highlight scene including a human face from a moving image.

動画像から人物の顔を抽出し、顔インデックスを作成する技術として、例えば特許文献１の発明が知られている。 As a technique for extracting a human face from a moving image and creating a face index, for example, the invention of Patent Document 1 is known.

特許文献１の発明は、動画像蓄積部、顔画像追跡部、代表顔決定部、顔インデックス構築部を備える。動画像蓄積部から取り出された各フレーム画像に対して、顔画像追跡部は、各フレーム画像中の顔を検出し、連続フレーム中で、検出された顔がフレーム中の同じ位置にある場合、同一人物の画像として追跡し、１つのまとまったフレーム群情報を生成する。次いで、代表顔決定部は、まとまったフレーム群から正面顔に近いものを代表顔画像として選択する。顔インデックス構築部は、この選択された代表顔画像を顔画像キーとして用いて、顔インデックスを構築する。ユーザは、例えばサムネイル画像として表示された顔画像キーを選ぶことで、その顔画像の人物が登場しているシーンを検索して再生することができる。 The invention of Patent Document 1 includes a moving image storage unit, a face image tracking unit, a representative face determination unit, and a face index construction unit. For each frame image extracted from the moving image storage unit, the face image tracking unit detects a face in each frame image, and when the detected face is in the same position in the frame in a continuous frame, The images are tracked as images of the same person, and one set of frame group information is generated. Next, the representative face determination unit selects, as a representative face image, a group close to the front face from the group of frames. The face index constructing unit constructs a face index using the selected representative face image as a face image key. For example, by selecting a face image key displayed as a thumbnail image, the user can search and reproduce a scene in which a person of the face image appears.

特開２００８−２５２２９６号公報JP 2008-252296 A

P. Viola, M. Jones, "Rapid Object Detection Using a Boosted Cascade of Simple Features", IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 511-51, 2001P. Viola, M. Jones, "Rapid Object Detection Using a Boosted Cascade of Simple Features", IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 511-51, 2001

動画像は、その内容に応じて、さまざまなハイライトシーンを含む。例えば、ドラマの盛り上がるシーンでは、２人の人物の顔が大きく表示されることがあり、また、重要なシーン又はスポーツ等のインタビューでは、１人の人物が大きく表示されることがある。 The moving image includes various highlight scenes according to the content. For example, in a scene where a drama is exciting, the faces of two people may be displayed large, and in an important scene or an interview such as sports, one person may be displayed large.

特許文献１の発明では、ユーザが動画像から選択できるシーンは、顔画像キーで特定されるものに限られていた。従って、特許文献１の発明では、ユーザは、顔画像キーの人物が登場しているシーンを検索して再生することができるが、特定のハイライトシーンを検索して再生することが困難であった。 In the invention of Patent Document 1, scenes that can be selected from a moving image by a user are limited to those specified by face image keys. Therefore, in the invention of Patent Document 1, the user can search and reproduce a scene in which a person with a face image key appears, but it is difficult to search for and reproduce a specific highlight scene. It was.

本発明の目的は、以上の問題点を解決し、動画像中から人物の顔を含む特定のハイライトシーンを抽出する動画像のハイライトシーン抽出装置及び方法を提供することにある。 An object of the present invention is to solve the above-described problems and provide a moving image highlight scene extraction apparatus and method for extracting a specific highlight scene including a human face from a moving image.

本発明の態様に係る動画像のハイライトシーン抽出装置は、
動画像のフレームから人物の顔を検出する手段と、
上記フレームにおける上記検出された顔の位置及び大きさによって決まる複数の特徴値を含むシーン構図特徴を生成する手段と、
少なくとも１つの特定のハイライトシーンにおける人物の顔の位置及び大きさによって決まる複数の特徴値の範囲をシーン構図条件として予め格納した格納手段と、
上記シーン構図特徴の特徴値が上記シーン構図条件の特徴値の範囲内にあるか否かを判定する第１の判定手段と、
上記シーン構図特徴の特徴値が上記シーン構図条件の特徴値の範囲内にあるとき、上記フレームにハイライトシーンのインデックスを付与するインデックス手段とを備えたことを特徴とする。 A highlight scene extraction device for moving images according to an aspect of the present invention includes:
Means for detecting a human face from a frame of a moving image;
Means for generating a scene composition feature including a plurality of feature values determined by the position and size of the detected face in the frame;
Storage means for storing in advance a plurality of feature value ranges determined by the position and size of a person's face in at least one specific highlight scene as scene composition conditions;
First determination means for determining whether or not a feature value of the scene composition feature is within a range of feature values of the scene composition condition;
Indexing means for assigning an index of a highlight scene to the frame when the feature value of the scene composition feature is within the range of the feature value of the scene composition condition is provided.

本発明によれば、動画像中から人物の顔を含む特定のハイライトシーンを抽出することができる。 According to the present invention, a specific highlight scene including a human face can be extracted from a moving image.

本発明の実施の形態１に係るハイライトシーン抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the highlight scene extraction apparatus which concerns on Embodiment 1 of this invention. 図１のハイライトシーン抽出装置によって実行されるインデックス生成処理を示すフローチャートである。It is a flowchart which shows the index production | generation process performed by the highlight scene extraction apparatus of FIG. 図１の顔検出回路３により、フレーム画像１００から人物Ｐ０の顔を検出する例を示す図である。FIG. 2 is a diagram illustrating an example in which a face of a person P0 is detected from a frame image 100 by the face detection circuit 3 of FIG. 図１の顔検出回路３により、フレーム画像１０１から複数の人物Ｐ０及びＰ１の顔を検出する例を示す図である。2 is a diagram illustrating an example in which faces of a plurality of persons P0 and P1 are detected from a frame image 101 by the face detection circuit 3 of FIG. 図１のシーン構図決定回路４により決定された、複数の人物Ｐ０〜ＰＮを含むフレーム画像のシーン構図特徴Ｃを示す表である。6 is a table showing a scene composition feature C of a frame image including a plurality of persons P0 to PN determined by the scene composition determination circuit 4 of FIG. ドラマのハイライトシーンを含むフレーム画像１１０の例を示す図である。It is a figure which shows the example of the frame image 110 containing the highlight scene of a drama. 図６のフレーム画像１１０における注目領域１１１及び注目人物Ｐｉの顔Ｆｉを示す図である。It is a figure which shows the attention area 111 and the face Fi of the attention person Pi in the frame image 110 of FIG. 図６のフレーム画像１１０におけるシーン構図特徴Ｃを示す図である。It is a figure which shows the scene composition characteristic C in the frame image 110 of FIG. 本発明の実施の形態２に係るハイライトシーン抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the highlight scene extraction apparatus which concerns on Embodiment 2 of this invention. 図９の動画像ジャンル決定回路１１、シーン構図選択回路１２、及び顔追跡回路１３によって実行されるシーン選択処理Ｓ５Ａを示すフローチャートである。10 is a flowchart showing a scene selection process S5A executed by the moving image genre determination circuit 11, the scene composition selection circuit 12, and the face tracking circuit 13 of FIG. インタビューシーンを含むフレーム画像１２０の例を示す図である。It is a figure which shows the example of the frame image 120 containing an interview scene. インタビューシーンを含む動画像のシーン構図特徴Ｃの例を示す図である。It is a figure which shows the example of the scene composition characteristic C of the moving image containing an interview scene. 本発明の実施の形態３に係るハイライトシーン抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the highlight scene extraction apparatus which concerns on Embodiment 3 of this invention. 図１３のハイライトシーン抽出装置によって実行されるインデックス生成処理を示すフローチャートである。It is a flowchart which shows the index production | generation process performed by the highlight scene extraction apparatus of FIG.

以下、図面を参照して、本発明の実施の形態について説明する。各図面にわたって、同様の構成要素は、同じ符号により示す。 Embodiments of the present invention will be described below with reference to the drawings. Throughout the drawings, similar components are designated by the same reference numerals.

実施の形態１．
図１は、本発明の実施の形態１に係るハイライトシーン抽出装置の構成を示すブロック図である。図１のハイライトシーン抽出装置は、動画像入力回路１、フレーム抽出回路２、顔検出回路３、シーン構図決定回路４、シーン記憶装置５、ハイライトシーン判定回路６、インデックス生成回路７、及び動画像記憶装置８を備える。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a highlight scene extraction apparatus according to Embodiment 1 of the present invention. 1 includes a moving image input circuit 1, a frame extraction circuit 2, a face detection circuit 3, a scene composition determination circuit 4, a scene storage device 5, a highlight scene determination circuit 6, an index generation circuit 7, and A moving image storage device 8 is provided.

動画像入力回路１は、チューナ、ＤＶＤもしくはブルーレイなどの記録媒体、又は、ＬＡＮもしくはインターネットなどのネットワークから動画像を取得する。動画像入力回路１は、後述する動画像記憶装置８から動画像を取得してもよい。フレーム抽出回路２は、入力された動画像から一連のフレーム画像を生成する。顔検出回路３は、生成された各フレーム画像から人物の顔を検出する。フレーム画像から顔を検出するために、例えば非特許文献１の方法を使用可能である。 The moving image input circuit 1 acquires a moving image from a recording medium such as a tuner, DVD or Blu-ray, or a network such as a LAN or the Internet. The moving image input circuit 1 may acquire a moving image from a moving image storage device 8 to be described later. The frame extraction circuit 2 generates a series of frame images from the input moving image. The face detection circuit 3 detects a person's face from each generated frame image. In order to detect a face from a frame image, for example, the method of Non-Patent Document 1 can be used.

図３は、図１の顔検出回路３により、フレーム画像１００から人物Ｐ０の顔を検出する例を示す図である。図４は、図１の顔検出回路３により、フレーム画像１０１から複数の人物Ｐ０及びＰ１の顔を検出する例を示す図である。図３のフレーム画像１００は、サイズｈ×ｗを有する。図３のフレーム画像１００は人物Ｐ０を含み、人物Ｐ０の顔は、頂点Ａ０，Ｂ０を有する矩形領域Ｆ０として検出される。人物Ｐ０の顔Ｆ０は、例えば正方形領域であり、その位置は重心Ｇ０＝（ｘ_０，ｙ_０）により特定され、その大きさは辺の長さｌ_０により特定される。また、図４のフレーム画像１０１は人物Ｐ０，Ｐ１を含み、人物Ｐ１の顔Ｆ１は、例えば正方形領域であり、その位置は重心Ｇ１＝（ｘ_１，ｙ_１）により特定され、その大きさは辺の長さｌ_１により特定される。 FIG. 3 is a diagram showing an example of detecting the face of the person P0 from the frame image 100 by the face detection circuit 3 of FIG. FIG. 4 is a diagram illustrating an example in which faces of a plurality of persons P0 and P1 are detected from the frame image 101 by the face detection circuit 3 of FIG. The frame image 100 in FIG. 3 has a size h × w. 3 includes a person P0, and the face of the person P0 is detected as a rectangular area F0 having vertices A0 and B0. The face F0 of the person P0 is, for example, a square area, the position is specified by the center of gravity G0 = (x ₀ , y ₀ ), and the size thereof is specified by the side length l ₀ . 4 includes persons P0 and P1, and the face F1 of the person P1 is, for example, a square area, the position of which is specified by the center of gravity G1 = (x ₁ , y ₁ ), and the size thereof is It is specified by the side length l ₁ .

シーン構図決定回路４は、フレーム画像における検出された顔の位置及び大きさによって決まる複数の特徴値を含むシーン構図特徴Ｃを生成する。動画像において盛り上がるシーン又は重要なシーン等は、登場人物の空間的な配置により定義することができる。従って、シーン構図特徴Ｃの特徴値は、この登場人物の空間的な配置を定義する。 The scene composition determination circuit 4 generates a scene composition feature C including a plurality of feature values determined by the position and size of the detected face in the frame image. An exciting scene or an important scene in a moving image can be defined by the spatial arrangement of characters. Therefore, the feature value of the scene composition feature C defines the spatial arrangement of the characters.

図５は、図１のシーン構図決定回路４により決定された、複数の人物Ｐ０〜ＰＮを含むフレーム画像のシーン構図特徴Ｃを示す表である。シーン構図特徴Ｃの各要素Ｃ_ｉｊ（１≦ｉ，ｊ≦Ｎ）は、任意の一対の人物Ｐｉ，Ｐｊについて、以下のように定義される。 FIG. 5 is a table showing a scene composition feature C of a frame image including a plurality of persons P0 to PN determined by the scene composition determination circuit 4 of FIG. Each element C _ij (1 ≦ i, j ≦ N) of the scene composition feature C is defined as follows for an arbitrary pair of persons Pi and Pj.

［数１］
Ｃ_ｉｊ＝（ｘ_ｉ，ｙ_ｉ，ｈ_ｉ，ｓ_ｉｊ，ｄ_ｉｊ）
［数２］
ｈ_ｉ＝ｌ_ｉ／ｈ
［数３］
ｓ_ｉｊ＝ｌ_ｊ／ｌ_ｉ
［数４］
ｄ_ｉｊ＝ｓｑｒｔ（（ｘ_ｉ−ｘ_ｊ）^２＋（ｙ_ｉ−ｙ_ｊ）^２）／ｈ [Equation 1]
C _ij = (x _i , y _i , h _i , s _ij , d _ij )
[Equation 2]
h _i = l _i / h
[Equation 3]
s _ij = l _j / l _i
[Equation 4]
d _ij = sqrt ((x _i -x _j ) ² + (y _i -y _j ) ² ) / h

ここで、ｘ_ｉ，ｙ_ｉは、人物Ｐｉの顔の位置（重心）を示す。ｈ_ｉは、フレーム画像のサイズｈで正規化された人物Ｐｉの顔の大きさである。ｓ_ｉｊは、人物Ｐｉ，Ｐｊの顔の大きさの比である。ｄ_ｉｊは、フレーム画像のサイズｈで正規化された人物Ｐｉ，Ｐｊの顔の間の距離である。言い換えると、シーン構図特徴Ｃの各要素Ｃ_ｉｊは、人物Ｐｉ自体の顔を特定する特徴値（位置ｘ_ｉ，ｙ_ｉ及び大きさｈ_ｉ）と、他の人物Ｐｊの顔との関係を示す特徴値（大きさの比ｓ_ｉｊ及び距離ｄ_ｉｊ）とを含む。なお、一般的なフレーム画像は横長であり（ｗ＞ｈ）、このため、数２及び数４では、短辺の長さｈを用いている。 Here, x _i and y _i indicate the position (center of gravity) of the face of the person Pi. h _i is the size of the face of the person Pi normalized by the size h of the frame image. s _ij is a ratio of the face sizes of the persons Pi and Pj. d _ij is a distance between the faces of the persons Pi and Pj normalized by the size h of the frame image. In other words, each element C _ij of the scene composition feature C indicates the relationship between the feature value (position x _i , y _i and size h _i ) that identifies the face of the person Pi itself and the face of another person Pj. And feature values (size ratio s _ij and distance d _ij ). Note that a general frame image is horizontally long (w> h), and therefore the length h of the short side is used in Equations 2 and 4.

シーン記憶装置５は、少なくとも１つの特定のハイライトシーンにおける人物の顔の位置及び大きさによって決まる複数の特徴値の範囲をシーン構図条件Ｄとして予め格納している。言い換えると、シーン構図条件Ｄは、フレーム画像が特定のハイライトシーンであるときに、当該ハイライトシーンが有すると考えられるシーン構図特徴Ｃの特徴値の予め決められた範囲を示す。 The scene storage device 5 stores in advance as a scene composition condition D a range of a plurality of feature values determined by the position and size of a person's face in at least one specific highlight scene. In other words, the scene composition condition D indicates a predetermined range of feature values of a scene composition feature C that is considered to be possessed by the highlight scene when the frame image is a specific highlight scene.

例示的なシーン構図条件Ｄを以下に示す。 An exemplary scene composition condition D is shown below.

［数５］
ｒ_０／３≦ｘ_ｉ／ｈ≦ｒ_０×２／３
［数６］
１／３≦ｙ_ｉ／ｈ≦２／３
［数７］
１／２≦ｈ_ｉ≦１
［数８］
０．８≦ｓ_ｉｊ≦１．２
［数９］
ｈ_ｉ×１．５≦ｄ_ｉｊ≦ｈ_ｉ×２．５ [Equation 5]
_{_{r 0/3 ≦ x i /}} h ≦ r 0 × 2/3
[Equation 6]
1/3 ≦ y _i / h ≦ 2/3
[Equation 7]
1/2 ≦ h _i ≦ 1
[Equation 8]
0.8 ≦ s _ij ≦ 1.2
[Equation 9]
h _i × 1.5 ≦ d _ij ≦ h _i × 2.5

ここで、係数ｒ_０＝ｗ／ｈである。数５〜数９のシーン構図条件Ｄは、動画像がドラマである場合を示す。ドラマの場合、主人公と、それと主に関係する重要人物とが存在し、シナリオの展開の中で重要なシーンでは、主人公とその重要人物とが大きく表示されることが多い。このようなシーンのシーン構図特徴Ｃ及びシーン構図条件Ｄについて、図６〜図８を参照して以下に説明する。 Here, the coefficient r ₀ = w / h. The scene composition conditions D in Equations 5 to 9 indicate cases where the moving image is a drama. In the case of a drama, there are a hero and an important person related to the hero, and the hero and the important person are often displayed in an important scene in the development of the scenario. The scene composition feature C and scene composition condition D of such a scene will be described below with reference to FIGS.

図６は、ドラマのハイライトシーンを含むフレーム画像１１０の例を示す図である。図７は、図６のフレーム画像１１０における注目領域１１１及び注目人物Ｐｉの顔Ｆｉを示す図である。図８は、図６のフレーム画像１１０におけるシーン構図特徴Ｃを示す図である。図６のフレーム画像１１０は、注目人物（主人公）となる人物Ｐｉと、その周辺人物となる人物Ｐｊとを含む。まず、フレーム画像１１０の中央の注目領域１１１で大きく表示されている顔Ｆｉを有する人物Ｐｉを、注目人物として決定し（図７）、また、注目人物である人物Ｐｉの近くにおいて、人物Ｐｉと同様に大きく表示されている顔Ｆｊを有する人物Ｐｊを、周辺人物として決定する（図８）。注目人物及び周辺人物は、その顔の位置（ｘ_ｉ，ｙ_ｉ；ｘ_ｊ，ｙ_ｊ）及び大きさ（ｈ_ｉ；ｈ_ｊ）が数５〜数７を満たす人物である。さらに、フレーム画像１１０は、数８及び数９をさらに満たすとき、ドラマのハイライトシーンであると判断される。なお、数５〜数９に例示した特徴値の範囲は、固定値ではなく、設計時に任意に設定できる。 FIG. 6 is a diagram illustrating an example of a frame image 110 including a drama highlight scene. FIG. 7 is a diagram showing the attention area 111 and the face Fi of the attention person Pi in the frame image 110 of FIG. FIG. 8 is a diagram showing a scene composition feature C in the frame image 110 of FIG. The frame image 110 in FIG. 6 includes a person Pi that is a person of interest (the main character) and a person Pj that is a peripheral person. First, a person Pi having a face Fi that is displayed in a large area of attention 111 in the center of the frame image 110 is determined as a person of interest (FIG. 7), and a person Pi near the person Pi that is the person of interest Similarly, a person Pj having a face Fj displayed large is determined as a peripheral person (FIG. 8). The attention person and the peripheral person are persons whose face positions (x _i , y _i ; x _j , y _j ) and sizes (h _i ; h _j ) satisfy Expressions 5 to 7. Furthermore, when the frame image 110 further satisfies the expressions 8 and 9, it is determined that the frame image 110 is a drama highlight scene. In addition, the range of the feature value illustrated in Formula 5 to Formula 9 is not a fixed value, and can be arbitrarily set at the time of design.

シーン記憶装置５に格納されるシーン構図条件Ｄは、数５〜数９に例示された特徴値の範囲に限定されるものではなく、また、ドラマ以外の他のハイライトシーンが有すると考えられるシーン構図特徴Ｃの特徴値の範囲を格納してもよい。 The scene composition condition D stored in the scene storage device 5 is not limited to the range of feature values exemplified in Equations 5 to 9, and is considered to be included in other highlight scenes other than the drama. A range of feature values of the scene composition feature C may be stored.

ハイライトシーン判定回路６は、シーン構図決定回路４によって生成されたシーン構図特徴Ｃの特徴値が、シーン記憶装置５に格納されたシーン構図条件Ｄの特徴値の範囲内にあるか否かを判定する。次いで、インデックス生成回路７は、シーン構図特徴Ｃの特徴値がシーン構図条件Ｄの特徴値の範囲内にあるとき、フレーム画像にハイライトシーンのインデックスを付与する。 The highlight scene determination circuit 6 determines whether the feature value of the scene composition feature C generated by the scene composition determination circuit 4 is within the range of the feature value of the scene composition condition D stored in the scene storage device 5. judge. Next, when the feature value of the scene composition feature C is within the range of the feature value of the scene composition condition D, the index generation circuit 7 assigns the highlight scene index to the frame image.

ハイライトシーン判定回路６による判定を、以下のコードにより説明する。 The determination by the highlight scene determination circuit 6 will be described with the following code.

［数１０］
ｆｏｒｉ：＝１ｔｏＮｄｏ
ｆｏｒｊ：＝１ｔｏＮｄｏ
ｉｆｉ≠ｊｔｈｅｎ
ｖａｌ：＝ｃａｌｃ＿ｃｏｒｒ（Ｃ_ｉｊ，Ｄ）
ｉｆｖａｌ＝＝ｔｒｕｅｔｈｅｎ
ｍａｋｅ＿ｉｎｄｅｘ（ｉ，ｊ，Ｄ）
ｅｎｄ
ｅｎｄ [Equation 10]
for i: = 1 to N do
for j: = 1 to N do
if i ≠ j then
val: = calc_corr (C _ij , D)
if val == true then
make_index (i, j, D)
end
end

数１０によれば、あるフレーム画像のシーン構図特徴Ｃの各要素Ｃ_ｉｊ（１≦ｉ，ｊ≦Ｎ）のうち、ｉ≠ｊのすべての要素について、関数ｃａｌｃ＿ｃｏｒｒ（Ｃ_ｉｊ，Ｄ）を実行し、関数ｃａｌｃ＿ｃｏｒｒ（Ｃ_ｉｊ，Ｄ）がｔｒｕｅ（真）の値を返すとき、関数ｍａｋｅ＿ｉｎｄｅｘ（ｉ，ｊ，Ｄ）を実行する。関数ｃａｌｃ＿ｃｏｒｒ（Ｃ_ｉｊ，Ｄ）は、あるフレーム画像のシーン構図特徴Ｃの特徴値が数５〜数９のシーン構図条件Ｄの特徴値の範囲内にあるか否かを判定し、ＹＥＳのときｔｒｕｅ（真）の値を返し、ＮＯのときｆａｌｓｅ（偽）の値を返す。ここで、関数ｃａｌｃ＿ｃｏｒｒ（Ｃ_ｉｊ，Ｄ）がｔｒｕｅの値を返すとき、インデックス生成回路７は、関数ｍａｋｅ＿ｉｎｄｅｘ（ｉ，ｊ，Ｄ）を実行することで、フレーム画像にハイライトシーンのインデックスを付与する。詳しくは、インデックス生成回路７は、フレーム画像のタイムスタンプ情報を動画像記憶装置８に記憶することで、当該フレーム画像にハイライトシーンが含まれることを示す。 According to Equation 10, the function calc_corr (C _ij , D) is executed for all elements of i ≠ j among the elements C _ij (1 ≦ i, j ≦ N) of the scene composition feature C of a certain frame image. When the function calc_corr (C _ij , D) returns a value of true (true), the function make_index (i, j, D) is executed. The function calc_corr (C _ij , D) determines whether or not the feature value of the scene composition feature C of a certain frame image is within the range of the feature values of the scene composition condition D of Formulas 5 to 9, and when YES Returns the value of true (true), and returns the value of false (false) when NO. Here, when the function calc_corr (C _ij , D) returns the value of true, the index generation circuit 7 executes the function make_index (i, j, D) to add the index of the highlight scene to the frame image. To do. Specifically, the index generation circuit 7 stores the time stamp information of the frame image in the moving image storage device 8 to indicate that the highlight image is included in the frame image.

動画像記憶装置８は、ハイライトシーンのインデックスを記憶するとともに、動画像入力回路１によって取得された動画像を記憶する。ユーザは、動画像の再生時にハイライトシーンのインデックスを用いて、動画中のハイライトシーンを検索することができる。なお、ハイライトシーンのインデックスを記憶する記憶装置と、動画像入力回路１によって取得された動画像を記憶する記憶装置とが、別個に設けられてもよい。 The moving image storage device 8 stores the index of the highlight scene and the moving image acquired by the moving image input circuit 1. The user can search for a highlight scene in the moving image by using the index of the highlight scene when the moving image is reproduced. A storage device that stores the index of the highlight scene and a storage device that stores the moving image acquired by the moving image input circuit 1 may be provided separately.

図２は、図１のハイライトシーン抽出装置によって実行されるインデックス生成処理を示すフローチャートである。図２のステップＳ１において、動画像入力回路１は動画像を取得する。ステップＳ２において、フレーム抽出回路２は、取得された動画像からフレーム画像を生成する。ステップＳ３において、顔検出回路３は、生成されたフレーム画像から顔を検出する。ステップＳ４において、シーン構図決定回路４は、検出された顔情報を用いてシーン構図特徴を生成する。ステップＳ５のシーン選択処理において、ハイライトシーン判定回路６は、シーン構図条件をシーン記憶装置５から読み出す。ステップＳ６において、ハイライトシーン判定回路６は、シーン構図特徴の特徴値が、シーン構図条件の特徴値の予め決められた範囲内にあるか否かを判定し、ＹＥＳのときはステップＳ７に進み、ＮＯのときはステップＳ８に進む。ステップＳ７において、インデックス生成回路７は、フレームにハイライトシーンのインデックスを付与し、当該インデックスを動画像記憶装置８に記憶する。ステップＳ８において、シーン構図決定回路４は、取得された動画像から次のフレーム画像を生成し、ステップＳ３に戻る。 FIG. 2 is a flowchart showing an index generation process executed by the highlight scene extraction apparatus of FIG. In step S1 of FIG. 2, the moving image input circuit 1 acquires a moving image. In step S2, the frame extraction circuit 2 generates a frame image from the acquired moving image. In step S3, the face detection circuit 3 detects a face from the generated frame image. In step S4, the scene composition determination circuit 4 generates a scene composition feature using the detected face information. In the scene selection process in step S5, the highlight scene determination circuit 6 reads the scene composition condition from the scene storage device 5. In step S6, the highlight scene determination circuit 6 determines whether the feature value of the scene composition feature is within a predetermined range of the feature value of the scene composition condition. If YES, the process proceeds to step S7. If NO, the process proceeds to step S8. In step S <b> 7, the index generation circuit 7 assigns an index of the highlight scene to the frame and stores the index in the moving image storage device 8. In step S8, the scene composition determination circuit 4 generates the next frame image from the acquired moving image, and returns to step S3.

以上説明したように、本発明の実施の形態１に係るハイライトシーン抽出装置によれば、動画像中から人物の顔を含む特定のハイライトシーンを抽出することができる。 As described above, the highlight scene extraction apparatus according to Embodiment 1 of the present invention can extract a specific highlight scene including a human face from a moving image.

本発明では、動画像の撮影者の意図（ハイライトシーンであるか否か）がフレーム画像中における人物の配置として現れることに着目している。本発明では、人物の配置として、フレーム画像における検出された顔の位置及び大きさによって決まる複数の特徴値を含むシーン構図特徴を生成する。本発明によれば、フレーム画像から生成されたシーン構図特徴の特徴値が、予め格納されたシーン構図条件の特徴値の範囲内にあるか否かを判定し、動画像の中から特定のハイライトシーンだけを検索して再生することができる。 In the present invention, attention is paid to the fact that the intention of the photographer of the moving image (whether it is a highlight scene) appears as the arrangement of persons in the frame image. In the present invention, a scene composition feature including a plurality of feature values determined by the position and size of the detected face in the frame image is generated as the arrangement of the person. According to the present invention, it is determined whether or not the feature value of the scene composition feature generated from the frame image is within the range of the feature value of the scene composition condition stored in advance, and a specific high level is selected from the moving image. Only light scenes can be searched and played.

従来技術の手法では、動画像中に顔が現れるか否かという点だけに注目し、各フレーム画像から顔を検出し、顔が現れる部分に対して検索用の顔インデックスを作成していた。これに対して、本発明の実施の形態１では、シーン構図特徴を用いることで、ある人物が登場する特定のハイライトシーンを抽出することができる。 In the conventional technique, attention is paid only to whether or not a face appears in a moving image, a face is detected from each frame image, and a search face index is created for a portion where the face appears. On the other hand, in the first embodiment of the present invention, a specific highlight scene in which a certain person appears can be extracted by using the scene composition feature.

実施の形態２．
一般に、動画像は特定のジャンルに関連付けられ、ジャンルに応じて、異なるハイライトシーンを含む可能性がある。本発明の実施の形態２では、動画像のジャンルに応じて異なるシーン構図条件を選択する。 Embodiment 2. FIG.
In general, a moving image is associated with a specific genre and may include different highlight scenes depending on the genre. In the second embodiment of the present invention, different scene composition conditions are selected according to the genre of moving images.

図９は、本発明の実施の形態２に係るハイライトシーン抽出装置の構成を示すブロック図である。図９のハイライトシーン抽出装置は、図１のハイライトシーン抽出装置の構成要素に加えて、動画像ジャンル決定回路１１、シーン構図選択回路１２、及び顔追跡回路１３を備える。また、図９のハイライトシーン抽出装置は、図１のインデックス生成回路７に代えて、顔追跡回路１３から出力される信号に応じて動作するインデックス生成回路７Ａを備える。シーン記憶装置５は、少なくとも１つの特定のジャンルのハイライトシーンにおける人物の顔の位置及び大きさによって決まる複数の特徴値の範囲をシーン構図条件として予め格納している。 FIG. 9 is a block diagram showing a configuration of a highlight scene extraction apparatus according to Embodiment 2 of the present invention. The highlight scene extraction device of FIG. 9 includes a moving image genre determination circuit 11, a scene composition selection circuit 12, and a face tracking circuit 13 in addition to the components of the highlight scene extraction device of FIG. 9 includes an index generation circuit 7A that operates in accordance with a signal output from the face tracking circuit 13, instead of the index generation circuit 7 of FIG. The scene storage device 5 stores in advance as a scene composition condition a range of a plurality of feature values determined by the position and size of a person's face in a highlight scene of at least one specific genre.

動画像ジャンル決定回路１１は、例えば放送波中のＥＰＧ（電子番組ガイド）情報から、動画像のジャンルを示すジャンル情報（ドラマ、ニュース、スポーツなど）を取得する。次いで、動画像ジャンル決定回路１１は、動画像がインタビューシーンを含むジャンルであるか否かを判定する。動画像がインタビューシーンを含むジャンル（ニュース又はスポーツなど）であるとき、シーン構図選択回路１２は、当該ジャンルと同じジャンルのハイライトシーンに対応するシーン構図条件をシーン記憶装置５から選択してハイライトシーン判定回路６に送り、顔追跡回路１３は、フレーム画像における検出された顔を一定時間にわたって追跡する。一方、動画像がインタビューシーンを含まないジャンル（ドラマなど）であるとき、シーン構図選択回路１２は、当該ジャンルと同じジャンルのハイライトシーンに対応するシーン構図条件をシーン記憶装置５から選択してハイライトシーン判定回路６に送り、顔追跡回路１３は顔の追跡を実行しない。 The moving image genre determination circuit 11 acquires genre information (drama, news, sports, etc.) indicating the genre of the moving image from, for example, EPG (electronic program guide) information in a broadcast wave. Next, the moving image genre determination circuit 11 determines whether or not the moving image is a genre including an interview scene. When the moving image is a genre including an interview scene (such as news or sports), the scene composition selection circuit 12 selects a scene composition condition corresponding to a highlight scene of the same genre as the genre from the scene storage device 5 and selects The face tracking circuit 13 tracks the detected face in the frame image over a certain period of time. On the other hand, when the moving image is a genre that does not include an interview scene (such as a drama), the scene composition selection circuit 12 selects a scene composition condition corresponding to a highlight scene of the same genre as the genre from the scene storage device 5. The face tracking circuit 13 does not perform face tracking.

ここで、図１１及び図１２を参照して、顔追跡回路１３による顔の追跡について説明する。 Here, with reference to FIG. 11 and FIG. 12, face tracking by the face tracking circuit 13 will be described.

図１１は、インタビューシーンを含むフレーム画像１２０の例を示す図である。図１２は、インタビューシーンを含む動画像のシーン構図特徴の例を示す図である。動画像入力回路１、フレーム抽出回路２、及び顔検出回路３は、実施の形態１の場合と同様に動作する。図１２の時間ｔ_０においてフレーム画像１２０に含まれる人物Ｐ１０の顔Ｆ１０が検出されたとき、顔追跡回路１３は、時間ｔ_０におけるフレーム画像から時間的に連続した複数のフレーム画像のうちの各隣接したフレーム画像において検出される顔の位置又は大きさの変化量が、所定のしきい値より小さいか否かを判定する。言い換えると、顔追跡回路１３は、顔の位置の変化量のしきい値ｔｈ_ｄ、又は、顔の大きさの変化量のしきい値ｔｈ_ｈに対して、次式のいずれかを満たすか否かを判定する。 FIG. 11 is a diagram illustrating an example of a frame image 120 including an interview scene. FIG. 12 is a diagram illustrating an example of a scene composition feature of a moving image including an interview scene. The moving image input circuit 1, the frame extraction circuit 2, and the face detection circuit 3 operate in the same manner as in the first embodiment. When the face F10 of the person P10 included in the frame image 120 is detected at time t ₀ in FIG. 12, the face tracking circuit 13, each of the plurality of frame images temporally continuous from the frame image at time t ₀ It is determined whether or not the amount of change in the position or size of the face detected in the adjacent frame image is smaller than a predetermined threshold value. In other words, the face tracking circuit 13 satisfies whether the following expression is satisfied with respect to the threshold value th _d of the change amount of the face position or the threshold value th _h of the change amount of the face size. Determine whether.

［数１１］
ｔｈ_ｄ≦ｓｑｒｔ（（ｘ（ｎ＋１）−ｘ（ｎ））^２＋（ｙ（ｎ＋１）−ｙ（ｎ））^２）
［数１２］
ｔｈ_ｈ≦ａｂｓ｜ｈ（ｎ＋１）−ｈ（ｎ）｜ [Equation 11]
th _d ≦ sqrt ((x (n + 1) −x (n)) ² + (y (n + 1) −y (n)) ² )
[Equation 12]
th _h ≦ abs | h (n + 1) −h (n) |

ここで、整数ｎは、離散化された時間ｔを示す。 Here, the integer n indicates the discretized time t.

図１２の例では、時間ｔ_０から時間ｔ_１までの時間期間にわたって、フレーム画像１２０は実質的に一定の顔Ｆ１０を含み、従って、数１１及び数１２のいずれも成立しない。時間ｔ_０から時間ｔ_１までの時間期間では、顔の位置又は大きさの変化量がしきい値ｔｈ_ｄ又はｔｈ_ｈより小さいフレーム画像が連続し、この時間期間を「追跡時間長ｔａ」と呼ぶ。 In the example of FIG. 12, over the time period from time t ₀ to time t ₁ , the frame image 120 includes a substantially constant face F10, and therefore neither of the equations 11 and 12 holds. In the time period from time t ₀ to time t ₁ , frame images in which the amount of change in the position or size of the face is smaller than the threshold th _d or th _h are continuous, and this time period is referred to as “tracking time length ta”. Call.

顔追跡回路１３は、数１１及び数１２のいずれかが満たされたとき、追跡時間長ｔａが所定のしきい値ｔｈ_１を超えているか否かを判定する。追跡時間長ｔａがしきい値ｔｈ_１を超えているとき、フレーム画像１２０はインタビューシーン（すなわち、フレーム画像の中央に同一人物の顔が長時間にわたって大きく表示されているシーン）を含むと考えられる。このとき、顔追跡回路１３は、インタビューシーンに対応するシーン構図条件をシーン記憶装置５から選択してハイライトシーン判定回路６に送るようにシーン構図選択回路１２に指示し、さらに、顔追跡回路１３は、追跡時間長ｔａの開始時間ｔ_０をインデックス生成回路７Ａに通知する。一方、追跡時間長ｔａがしきい値ｔｈ_１を超えていないとき、フレーム画像１２０はインタビューシーンを含まないと考えられる。このとき、顔追跡回路１３は、動画像ジャンル決定回路１１によって判定された動画像のジャンルに対応するシーン構図条件をシーン記憶装置５から選択してハイライトシーン判定回路６に送るようにシーン構図選択回路１２に指示する。 The face tracking circuit 13 determines whether or not the tracking time length ta exceeds a predetermined threshold th ₁ when either of the expressions 11 and 12 is satisfied. When tracking the time length ta exceeds the threshold th _1, considered frame image 120 includes interview scene (i.e., scene same person's face in the center of the frame image is displayed larger over time) . At this time, the face tracking circuit 13 instructs the scene composition selection circuit 12 to select a scene composition condition corresponding to the interview scene from the scene storage device 5 and send it to the highlight scene determination circuit 6, and further, the face tracking circuit 13 notifies the start time _{t 0} of the tracking time length ta in the index generating circuit 7A. Meanwhile, when the track length of time ta has not exceeded the threshold th _1, the frame image 120 is considered to contain no interview scene. At this time, the face tracking circuit 13 selects a scene composition condition corresponding to the genre of the moving image determined by the moving image genre determination circuit 11 from the scene storage device 5 and sends it to the highlight scene determination circuit 6. The selection circuit 12 is instructed.

ハイライトシーン判定回路６は、シーン構図選択回路１２により送られたシーン構図条件に基づいて、実施の形態１の場合と同様に、シーン構図特徴の特徴値がシーン構図条件の特徴値の範囲内にあるか否かを判定する。 Based on the scene composition condition sent from the scene composition selection circuit 12, the highlight scene determination circuit 6 determines that the feature value of the scene composition feature is within the range of the feature value of the scene composition condition, as in the first embodiment. It is determined whether or not.

インデックス生成回路７Ａは、追跡時間長ｔａがしきい値ｔｈ_１を超えているとき（すなわち、顔追跡回路１３から追跡時間長ｔａが入力されているとき）、かつ、シーン構図特徴の特徴値がシーン構図条件の特徴値の範囲内にあるとき、追跡時間長ｔａの開始時間ｔ_０のフレーム画像にハイライトシーンのインデックスを付与する。なお、追跡時間長ｔａがしきい値ｔｈ_１以下であるとき（すなわち、顔追跡回路１３から追跡時間長ｔａが入力されていないとき）、かつ、シーン構図特徴の特徴値がシーン構図条件の特徴値の範囲内にあるとき、現在のフレーム画像にハイライトシーンのインデックスを付与する。 When the tracking time length ta exceeds the threshold th ₁ (that is, when the tracking time length ta is input from the face tracking circuit 13), the index generation circuit 7A has a feature value of the scene composition feature. when in the range of feature values of the scene composition condition is indexed highlight scene frame image of the start time t ₀ of the track duration ta. When the tracking time length ta is equal to or less than the threshold th ₁ (that is, when the tracking time length ta is not input from the face tracking circuit 13), the feature value of the scene composition feature is the feature of the scene composition condition. When it is within the value range, the index of the highlight scene is given to the current frame image.

図１０は、図９の動画像ジャンル決定回路１１、シーン構図選択回路１２、及び顔追跡回路１３によって実行されるシーン選択処理Ｓ５Ａを示すフローチャートである。図１０のシーン選択処理Ｓ５Ａは、図２のシーン選択処理Ｓ５に代えて実行される。図１０のステップＳ１１において、動画像ジャンル決定回路１１は、動画像のジャンル情報を取得する。ステップＳ１２において、動画像ジャンル決定回路１１は、動画像がインタビューシーンを含むジャンルであるか否かを判定し、ＹＥＳのときはステップＳ１３に進み、ＮＯのときはステップＳ１６に進む。ステップＳ１３において、顔追跡回路１３は、フレーム画像における検出された顔を一定時間にわたって追跡する。ステップＳ１４において、顔追跡回路１３は、顔画像の追跡時間長ｔａがしきい値ｔｈ_１を超えているか否かを判定し、ＹＥＳのときはステップＳ１５に進み、ＮＯのときはステップＳ１６に進む。ステップＳ１５において、シーン構図選択回路１２は、インタビューシーンに対応するシーン構図条件をハイライトシーン判定回路６に送り、図２のステップＳ６に進む。ステップＳ１６において、動画像ジャンル決定回路１１によって判定された動画像のジャンルに対応するシーン構図条件をハイライトシーン判定回路６に送り、図２のステップＳ６に進む。 FIG. 10 is a flowchart showing a scene selection process S5A executed by the moving image genre determination circuit 11, the scene composition selection circuit 12, and the face tracking circuit 13 of FIG. The scene selection process S5A in FIG. 10 is executed instead of the scene selection process S5 in FIG. In step S11 of FIG. 10, the moving image genre determination circuit 11 acquires genre information of moving images. In step S12, the moving image genre determination circuit 11 determines whether or not the moving image is a genre including an interview scene. If YES, the process proceeds to step S13. If NO, the process proceeds to step S16. In step S13, the face tracking circuit 13 tracks the detected face in the frame image for a predetermined time. In step S14, the face tracking circuit 13 determines whether track duration ta of the facial image is greater than the threshold th _1, when YES, the process proceeds to step S15, and if NO then the process proceeds to step S16 . In step S15, the scene composition selection circuit 12 sends the scene composition condition corresponding to the interview scene to the highlight scene determination circuit 6, and proceeds to step S6 in FIG. In step S16, the scene composition condition corresponding to the genre of the moving image determined by the moving image genre determination circuit 11 is sent to the highlight scene determination circuit 6, and the process proceeds to step S6 in FIG.

以上説明したように、本発明の実施の形態２に係るハイライトシーン抽出装置によれば、動画像のジャンルに応じて異なるシーン構図条件を選択することができる。 As described above, according to the highlight scene extraction apparatus according to Embodiment 2 of the present invention, different scene composition conditions can be selected according to the genre of moving images.

ハイライトシーンのシーン構図は、動画像のジャンルによって異なる。そこで、動画像のジャンルを示すジャンル情報を取得し、動画像のジャンルと同じジャンルのハイライトシーンに対応するシーン構図条件を用いることで、特定のジャンルの動画像から特定のハイライトシーンを抽出することができる。 The scene composition of the highlight scene differs depending on the genre of the moving image. Therefore, genre information indicating the genre of the moving image is acquired, and a specific highlight scene is extracted from the moving image of the specific genre by using a scene composition condition corresponding to a highlight scene of the same genre as the moving image genre. can do.

実施の形態３．
一般に、動画像において映像情報と音声情報とは互いに密接に関係しているので、ハイライトシーンを決定するとき、シーン構図特徴と、そのシーン構図特徴に高い関連性を有する音声特徴とを利用することで、ハイライトシーンをより正確に抽出することが可能になる。 Embodiment 3 FIG.
In general, since video information and audio information are closely related to each other in a moving image, when determining a highlight scene, a scene composition feature and an audio feature having high relevance to the scene composition feature are used. As a result, the highlight scene can be extracted more accurately.

図１３は、本発明の実施の形態３に係るハイライトシーン抽出装置の構成を示すブロック図である。図１３のハイライトシーン抽出装置は、図１のシーン記憶装置５及びハイライトシーン判定回路６に代えてシーン記憶装置５Ｂ及びハイライトシーン判定回路６Ｂを備え、さらに、音声特徴生成回路２１を備える。音声特徴生成回路２１は、動画像から音声情報を分離して音声特徴を生成し、音声特徴をハイライトシーン判定回路６Ｂに送る。シーン記憶装置５Ｂは、シーン構図条件の特徴値の範囲に加えて、少なくとも１つの特定のハイライトシーンの前後における音声情報に係る複数の特徴値の範囲を音声条件として予め格納している。言い換えると、音声条件は、フレーム画像が特定のハイライトシーンであるときに、当該ハイライトシーンの前後の音声情報が有すると考えられる音声特徴の特徴値の範囲を示す。 FIG. 13 is a block diagram showing a configuration of a highlight scene extraction apparatus according to Embodiment 3 of the present invention. The highlight scene extraction device of FIG. 13 includes a scene storage device 5B and a highlight scene determination circuit 6B instead of the scene storage device 5 and the highlight scene determination circuit 6 of FIG. 1, and further includes an audio feature generation circuit 21. . The audio feature generation circuit 21 generates audio features by separating audio information from the moving image, and sends the audio features to the highlight scene determination circuit 6B. In addition to the range of feature values of the scene composition condition, the scene storage device 5B stores in advance a plurality of feature value ranges related to audio information before and after at least one specific highlight scene as audio conditions. In other words, when the frame image is a specific highlight scene, the audio condition indicates a range of feature values of audio features that are considered to be included in audio information before and after the highlight scene.

図１４は、図１３のハイライトシーン抽出装置によって実行されるインデックス生成処理を示すフローチャートである。図１４のステップＳ１〜Ｓ４は、図２のステップＳ１〜Ｓ４と同様である。音声特徴生成回路２１は、ステップＳ２１において、動画像入力回路１により取得された動画像（ステップＳ１）から音声情報を分離し、ステップＳ２２において、音声情報から音声特徴を生成してハイライトシーン判定回路６Ｂに送る。ステップＳ５Ｂのシーン選択処理において、ハイライトシーン判定回路６Ｂは、シーン構図条件及び音声条件をシーン記憶装置５Ｂから読み出す。ステップＳ６Ｂにおいて、ハイライトシーン判定回路６Ｂは、シーン構図特徴及び音声特徴の各特徴値が、シーン構図条件及び音声条件の各特徴値の予め決められた範囲内にあるか否かを判定し、ＹＥＳのときはステップＳ７に進み、ＮＯのときはステップＳ８に進む。図１４のステップＳ７〜Ｓ８は、図２のステップＳ７〜Ｓ８と同様である。 FIG. 14 is a flowchart showing an index generation process executed by the highlight scene extraction apparatus of FIG. Steps S1 to S4 in FIG. 14 are the same as steps S1 to S4 in FIG. In step S21, the audio feature generation circuit 21 separates audio information from the moving image (step S1) acquired by the moving image input circuit 1, and in step S22, generates an audio feature from the audio information to determine a highlight scene. Send to circuit 6B. In the scene selection process in step S5B, the highlight scene determination circuit 6B reads the scene composition condition and the audio condition from the scene storage device 5B. In step S6B, the highlight scene determination circuit 6B determines whether or not the feature values of the scene composition feature and the audio feature are within the predetermined ranges of the feature values of the scene composition condition and the audio condition, If YES, the process proceeds to step S7. If NO, the process proceeds to step S8. Steps S7 to S8 in FIG. 14 are the same as steps S7 to S8 in FIG.

音声特徴としては、抽出しようとするシーン構図特徴と関連性の高いものを用いることができる。例えば、ハイライトシーンでは音量（音の大きさ）｜Ｖ｜が大きくなることが多いので、音声特徴として音量｜Ｖ｜を使用し、音声条件として音量のしきい値｜Ｖｔｈ｜を使用してもよい。この場合、ステップＳ６Ｂにおいて、ハイライトシーン判定回路６Ｂは、音量｜Ｖ｜がしきい値｜Ｖｔｈ｜より大きいかどうかを判定する。また、ハイライトシーンにおいて観客が興奮するようなケースでは音の周波数が高くなるので、音声特徴として音の周波数｜Ｆｖ｜を使用し、音声条件として周波数のしきい値｜Ｆｔｈ｜を使用してもよい。ハイライトシーン判定回路６Ｂは、周波数｜Ｆｖ｜がしきい値｜Ｆｔｈ｜より高いかどうかを判定する。 As the audio feature, a feature highly relevant to the scene composition feature to be extracted can be used. For example, in a highlight scene, the volume (sound volume) | V | is often large, so the volume | V | is used as an audio feature, and the volume threshold value | Vth | is used as an audio condition. Also good. In this case, in step S6B, the highlight scene determination circuit 6B determines whether or not the volume | V | is larger than the threshold value | Vth |. Further, since the frequency of the sound becomes high in a case where the audience is excited in the highlight scene, the frequency of the sound | Fv | is used as the audio feature, and the frequency threshold value | Fth | is used as the audio condition. Also good. The highlight scene determination circuit 6B determines whether the frequency | Fv | is higher than the threshold value | Fth |.

以上説明したように、本発明の実施の形態３に係るハイライトシーン抽出装置によれば、映像情報からのシーン構図特徴と、音声情報からの音声特徴とを組み合わせることで、動画像中から人物の顔を含む特定のハイライトシーンをより精度良く抽出することができる。 As described above, according to the highlight scene extraction apparatus according to Embodiment 3 of the present invention, a person can be identified from a moving image by combining a scene composition feature from video information and an audio feature from audio information. It is possible to extract a specific highlight scene including the face of the user more accurately.

本発明の動画像のハイライトシーン抽出装置及び方法は、動画像を記録するハードディスクレコーダなどに適用可能である。 The moving image highlight scene extracting apparatus and method of the present invention can be applied to a hard disk recorder or the like for recording moving images.

１動画像入力回路、２フレーム抽出回路、３顔検出回路、４シーン構図決定回路、５，５Ｂシーン記憶装置、６，６Ｂハイライトシーン判定回路、７，７Ａインデックス生成回路、８動画像記憶装置、１１動画像ジャンル判定回路、１２シーン構図選択回路、１３顔追跡回路、１００，１０１，１１０，１２０フレーム画像、１１１注目領域、Ａ０，Ｂ０頂点、Ｆ０，Ｆ１，Ｆｉ，Ｆｊ，Ｆ１０顔、Ｇ０，Ｇ１，Ｇｉ，Ｇｊ，Ｇ１０重心、Ｐ０，Ｐ１，Ｐｉ，Ｐｊ，Ｐ１０人物。 DESCRIPTION OF SYMBOLS 1 Moving image input circuit, 2 Frame extraction circuit, 3 Face detection circuit, 4 Scene composition determination circuit, 5, 5B Scene storage device, 6, 6B Highlight scene determination circuit, 7, 7A Index generation circuit, 8 Moving image storage device , 11 Moving image genre determination circuit, 12 Scene composition selection circuit, 13 Face tracking circuit, 100, 101, 110, 120 Frame image, 111 Region of interest, A0, B0 vertex, F0, F1, Fi, Fj, F10 Face, G0 , G1, Gi, Gj, G10 Center of gravity, P0, P1, Pi, Pj, P10 Person.

Claims

Means for detecting a human face from a frame of a moving image;
Means for generating a scene composition feature including a plurality of feature values determined by the position and size of the detected face in the frame;
Storage means for storing in advance a plurality of feature value ranges determined by the position and size of a person's face in at least one specific highlight scene as scene composition conditions;
First determination means for determining whether or not a feature value of the scene composition feature is within a range of feature values of the scene composition condition;
A highlight scene of a moving image comprising: index means for assigning an index of a highlight scene to the frame when the feature value of the scene composition feature is within the range of the feature value of the scene composition condition Extraction device.

The above video is associated with a specific genre,
The storage means stores in advance as a scene composition condition a plurality of feature value ranges determined by the position and size of a person's face in a highlight scene of at least one specific genre,
The moving image highlight scene extraction device is:
Means for acquiring genre information indicating the genre of the moving image;
Means for selecting a scene composition condition corresponding to a highlight scene of the same genre as the moving image genre from the storage means;
2. The moving image according to claim 1, wherein the first determination unit determines whether or not a feature value of the scene composition feature is within a range of feature values of the selected scene composition condition. Highlight scene extraction device.

The feature value of the scene composition feature includes, when the frame includes a plurality of persons, a ratio of face sizes of the plurality of persons, and a distance between the faces of the plurality of persons,
3. The moving image high image according to claim 1, wherein the scene composition condition includes a range of a ratio of face sizes of a plurality of persons and a range of a distance between the faces of the plurality of persons. Light scene extraction device.

The moving image highlight scene extraction device is:
Determining whether or not the amount of change in the position or size of the face detected in each adjacent frame among a plurality of frames temporally continuous from the first frame is smaller than the first threshold, A second determination means for determining whether or not a time length of continuous frames having a change amount smaller than the first threshold exceeds a second threshold;
The index means is configured such that when the time length for which the amount of change is smaller than the first threshold exceeds a second threshold and the feature value of the scene composition feature is the scene composition The highlight of a moving image according to any one of claims 1 to 3, wherein an index of a highlight scene is assigned to the first frame when it is within a range of feature values of a condition. Scene extraction device.

The moving image includes video information and audio information,
The highlight scene extraction device further includes means for generating an audio feature including a plurality of feature values related to the audio information,
The storage means stores in advance as a sound condition a range of a plurality of feature values related to sound information before and after at least one specific highlight scene,
The first determination means determines whether each feature value of the scene composition feature and the audio feature is within a range of each feature value of the scene composition condition and the audio condition,
The index means assigns an index of a highlight scene to the frame when the feature values of the scene composition feature and the audio feature are within the range of the feature values of the scene composition condition and the audio condition. The highlight scene extraction apparatus for moving images according to any one of claims 1 to 4.

Detecting a human face from a frame of a moving image;
Generating a scene composition feature including a plurality of feature values determined by the position and size of the detected face in the frame;
Pre-stored in the storage means as a scene composition condition a range of a plurality of feature values determined by the position and size of a person's face in at least one specific highlight scene;
Determining whether the feature value of the scene composition feature is within the range of the feature value of the scene composition condition;
A highlight scene extraction method for moving images, comprising: a step of assigning an index of a highlight scene to the frame when a feature value of the scene composition feature is within a range of feature values of the scene composition condition .

The above video is associated with a specific genre,
The storage means stores in advance as a scene composition condition a plurality of feature value ranges determined by the position and size of a person's face in a highlight scene of at least one specific genre,
The highlight scene extraction method of the above moving image is as follows:
Acquiring genre information indicating the genre of the moving image;
Selecting from the storage means a scene composition condition corresponding to a highlight scene of the same genre as the moving image genre;
7. The highlight scene extraction of a moving image according to claim 6, further comprising a step of determining whether or not the feature value of the scene composition feature is within the range of the feature value of the selected scene composition condition. Method.

The feature value of the scene composition feature includes, when the frame includes a plurality of persons, a ratio of face sizes of the plurality of persons, and a distance between the faces of the plurality of persons,
8. The moving image high level according to claim 6, wherein the scene composition condition includes a range of a ratio of face sizes of a plurality of persons and a range of a distance between the faces of the plurality of persons. Light scene extraction method.

The highlight scene extraction method of the above moving image is as follows:
Determining whether or not the amount of change in the position or size of the face detected in each adjacent frame among a plurality of frames temporally continuous from the first frame is smaller than the first threshold, Determining whether or not a time length of continuous frames having a change amount smaller than the first threshold exceeds a second threshold;
When the length of time for which the amount of change is smaller than the first threshold exceeds the second threshold, and the feature value of the scene composition feature is the feature value of the scene composition condition The moving image highlight scene according to claim 6, further comprising a step of assigning an index of a highlight scene to the first frame when it is within the range. Extraction method.

The moving image includes video information and audio information,
The highlight scene extraction method of the above moving image is as follows:
Generating a voice feature including a plurality of feature values related to the voice information;
Pre-stored as a sound condition a range of feature values related to sound information before and after at least one specific highlight scene;
Determining whether each feature value of the scene composition feature and the audio feature is within a range of each feature value of the scene composition condition and the audio condition;
Adding a highlight scene index to the frame when the feature values of the scene composition feature and the audio feature are within the range of the feature values of the scene composition condition and the audio condition. The highlight scene extraction method for moving images according to any one of claims 6 to 9.