JP5538781B2

JP5538781B2 - Image search apparatus and image search method

Info

Publication number: JP5538781B2
Application number: JP2009202800A
Authority: JP
Inventors: 靖浩伊藤; 哲八代; 光太郎矢野
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-09-02
Filing date: 2009-09-02
Publication date: 2014-07-02
Anticipated expiration: 2029-09-02
Also published as: JP2011053952A

Description

本発明は、画像検索装置及び画像検索方法に関する。 The present invention relates to an image search device and an image search method.

従来、動画像からユーザーが所望する被写体の画像を検索する画像検索装置が知られている。この画像検索装置により検索できるのはフレーム単位の画像であり、フレーム単位に検索した画像がそのまま表示される。したがって、この画像検索装置によれば、動画像において登場する所望の人物を把握することができる。 2. Description of the Related Art Conventionally, there has been known an image search apparatus that searches a moving image for a subject image desired by a user. This image retrieval device can retrieve images in units of frames, and the images retrieved in units of frames are displayed as they are. Therefore, according to this image search device, it is possible to grasp a desired person appearing in the moving image.

近年、被写体を検出できた箇所から、その時点の前後の一定の時間、複数フレーム中から映像信号を録画する技術が開示されている（例えば、特許文献１参照）。 2. Description of the Related Art In recent years, a technique for recording video signals from a plurality of frames for a certain period of time before and after the time point from which a subject can be detected has been disclosed (for example, see Patent Document 1).

特開２００１−２８５７８７号公報JP 2001-285787 A

しかしながら、上述した技術は、被写体が映っている映像区間を精度よく特定するものではない。すなわち、上述した技術により映像区間を検索したとしても、被写体と関係が薄い映像区間が多々検索されるという問題がある。 However, the above-described technique does not accurately specify the video section in which the subject is shown. That is, even if a video section is searched by the above-described technique, there is a problem that many video sections that are not closely related to the subject are searched.

本発明はこのような問題点に鑑みなされたもので、被写体と関係が深い映像区間を検索することを目的とする。 The present invention has been made in view of such problems, and an object thereof is to search for a video section that is closely related to a subject.

そこで、本発明の画像検索装置は、複数のフレームを有する動画像より複数の被写体パターンを検出する検出手段と、前記複数の被写体パターンの各々に対応する前記動画像における被写体画像を、前記被写体パターンごとに一連のフレームについて集めたシーケンスを生成する生成手段と、前記生成手段で生成されたシーケンスを構成する複数の被写体画像の各被写体画像の状態に基づいて、それぞれの被写体画像が被写体の類似度判定に有効か否かを判定する判定手段と、前記複数の被写体画像のうち、前記判定手段により有効と判定された被写体画像の特徴量を算出する特徴量算出手段と、前記シーケンスに対して前記特徴量算出手段により算出された特徴量を記憶部に記憶する記憶手段と、検索条件として与えられたシーケンスに対する特徴量と、前記記憶部に記憶された各シーケンスに対する特徴量との類似度を算出する類似度算出手段と、前記記憶部に記憶されたシーケンスのうちから前記類似度算出手段で算出された類似度に基づいてシーケンスを抽出する抽出手段と、前記抽出手段で抽出されたシーケンスに対応する映像区間情報を検索結果として出力する出力手段と、を備えることを特徴とする。 Therefore, the image search apparatus according to the present invention includes a detecting unit that detects a plurality of subject patterns from a moving image having a plurality of frames, and a subject image in the moving image corresponding to each of the plurality of subject patterns. Generating means for generating a sequence collected for each series of frames, and based on the state of each of the subject images of a plurality of subject images constituting the sequence generated by the generating means, each subject image is subject similarity A determination unit that determines whether the determination is effective; a feature amount calculation unit that calculates a feature amount of the subject image determined to be effective by the determination unit among the plurality of subject images; and A storage unit that stores the feature amount calculated by the feature amount calculation unit in the storage unit, and a sequence given as a search condition. Calculated by the similarity calculation unit from among the sequences stored in the storage unit and the similarity calculation unit that calculates the similarity between the feature amount and the feature amount for each sequence stored in the storage unit An extraction means for extracting a sequence based on the similarity and an output means for outputting video section information corresponding to the sequence extracted by the extraction means as a search result .

本発明によれば、被写体と関係が深い映像区間を検索することができる。 According to the present invention, it is possible to search for a video section that is closely related to the subject.

画像検索装置の構成を示す図である。It is a figure which shows the structure of an image search device. 画像検索装置における処理を示すフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart which shows the process in an image search device. 所定間隔のフレームごとに顔検出が行われる例を示す図である。It is a figure which shows the example in which face detection is performed for every frame of a predetermined interval. 画像全域から顔パターンを検出する例を示す図である。It is a figure which shows the example which detects a face pattern from the whole image area. 顔シーケンスを生成する処理に係るフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart which concerns on the process which produces | generates a face sequence. 特徴抽出する処理に係るフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart which concerns on the process which extracts a feature. 特徴抽出を行う際の進捗状況を示す図である。It is a figure which shows the progress situation at the time of performing feature extraction. 代表顔特徴量の決定に関するイメージを表す図である。It is a figure showing the image regarding determination of a representative face feature-value. クエリを入力するためのユーザーインターフェースの表示例を示す図である。It is a figure which shows the example of a display of the user interface for inputting a query. 類似度判定部における処理の概要を示した図である。It is the figure which showed the outline | summary of the process in a similarity determination part. 検索された顔シーケンスに係る表示の例を示す図である。It is a figure which shows the example of the display which concerns on the searched face sequence. 検索された顔シーケンスに係る表示の例を示す図である。It is a figure which shows the example of the display which concerns on the searched face sequence.

以下、本発明に係る実施形態について図面に基づいて説明する。例えば、本実施形態に係る画像検索装置は、ユーザーが検索する対象の被写体として人物、特に顔を採用し、動画像から人物が映っている映像区間単位で、映像区間を代表する画像として顔画像及び映像区間を抽出して出力する。この構成によれば、ユーザーは、検索対象の被写体が映っている映像区間を容易に把握できる。 Hereinafter, embodiments according to the present invention will be described with reference to the drawings. For example, the image search apparatus according to the present embodiment employs a person, particularly a face, as a subject to be searched by a user, and a face image as an image representing a video section in units of video sections in which a person is shown from a moving image. The video section is extracted and output. According to this configuration, the user can easily grasp the video section in which the subject to be searched is shown.

図１の（ａ）は、画像検索装置のハードウェア構成を示す図である。画像検索装置は、ＣＰＵ（Central Processing Unit）１、記憶装置２、入力装置３、表示装置４、及び撮像装置５を含んで構成される。なお、各装置は、互いに通信可能に構成され、バス等により接続されている。
ＣＰＵ１は、画像検索装置の動作をコントロールし、記憶装置２に格納されたプログラムの実行等を行う。
記憶装置２は、磁気記憶装置、半導体メモリ等のストレージデバイスであり、ＣＰＵ１の動作に基づき読み込まれたプログラム、長時間記憶しなくてはならないデータ等を記憶する。
本実施形態では、ＣＰＵ１が、記憶装置２に格納されたプログラムの手順に従って処理を行うことによって、画像検索装置における機能及び後述するフローチャートに係る処理が実現される。
入力装置３は、マウス、キーボード、タッチパネルデバイス、ボタン等であり、各種の指示を入力する。
表示装置４は、液晶パネル、外部モニタ等であり、各種の情報を表示する。
撮像装置５は、カムコーダ等であり、CCD（Charge Coupled Devices）、CMOS（Complementary Metal Oxide Semiconductor）等の撮像素子を備える。なお、撮像装置５で撮像された動画像データは、記憶装置２等に記憶される。また、動画像は、基本的には、一連の複数のフレームを有して構成され、各フレームに対応する静止画像を有している。
なお、画像検索装置のハードウェア構成は、上述した構成に限られるものではない。例えば、画像検索装置は、各種の装置間で通信を行うためのＩ／Ｏ装置を備えてもよい。例えば、Ｉ／Ｏ装置は、メモリーカード、ＵＳＢケーブル等の入出力部、有線、無線等による送受信部である。 FIG. 1A is a diagram illustrating a hardware configuration of the image search apparatus. The image search device includes a CPU (Central Processing Unit) 1, a storage device 2, an input device 3, a display device 4, and an imaging device 5. Each device is configured to be able to communicate with each other, and is connected by a bus or the like.
The CPU 1 controls the operation of the image search device and executes a program stored in the storage device 2.
The storage device 2 is a storage device such as a magnetic storage device or a semiconductor memory, and stores a program read based on the operation of the CPU 1, data that must be stored for a long time, and the like.
In the present embodiment, the CPU 1 performs processing according to the procedure of the program stored in the storage device 2, thereby realizing functions in the image search device and processing according to a flowchart described later.
The input device 3 is a mouse, a keyboard, a touch panel device, a button, or the like, and inputs various instructions.
The display device 4 is a liquid crystal panel, an external monitor, or the like, and displays various types of information.
The imaging device 5 is a camcorder or the like, and includes an imaging device such as a CCD (Charge Coupled Devices) or a CMOS (Complementary Metal Oxide Semiconductor). Note that moving image data captured by the imaging device 5 is stored in the storage device 2 or the like. In addition, a moving image basically includes a series of a plurality of frames, and has a still image corresponding to each frame.
Note that the hardware configuration of the image search apparatus is not limited to the above-described configuration. For example, the image search device may include an I / O device for performing communication between various devices. For example, the I / O device is an input / output unit such as a memory card or a USB cable, or a transmission / reception unit such as wired or wireless.

図１の（ｂ）は、本実施形態に係る画像検索装置の機能構成を示す図である。画像検索装置の処理及び機能は、映像入力部１００、顔シーケンス生成部２００、顔シーケンス特徴抽出部３００、顔シーケンス記憶部４００、及び顔シーケンス検索部５００により実現される。 FIG. 1B is a diagram illustrating a functional configuration of the image search apparatus according to the present embodiment. The processing and functions of the image search apparatus are realized by the video input unit 100, the face sequence generation unit 200, the face sequence feature extraction unit 300, the face sequence storage unit 400, and the face sequence search unit 500.

映像入力部１００は、撮像装置５により撮像された映像に係る動画像データを画像メモリ部２１０に入力する。なお、映像入力部１００は、動画像データを記憶する記憶媒体から動画像データを読み込む構成でもよい。また、映像入力部１００は、インターネット等を介してサーバ等に記憶された動画像データを読み込む構成でもよい。 The video input unit 100 inputs moving image data related to the video imaged by the imaging device 5 to the image memory unit 210. The video input unit 100 may be configured to read moving image data from a storage medium that stores moving image data. The video input unit 100 may be configured to read moving image data stored in a server or the like via the Internet or the like.

顔シーケンス生成部２００は、画像メモリ部２１０、顔検出部２２０、顔追跡部２３０、及び代表パターン抽出部２４０を含んで構成される。
顔シーケンス生成部２００は、画像メモリ部２１０に入力された動画像を解析し、顔が映っている映像区間において各フレームから顔画像（より広義には、被写体画像）を抽出し、顔シーケンス（より広義には、シーケンス）を生成する。
ここで、顔シーケンスとは、連続した映像区間から抽出された顔画像及び抽出された顔画像に関する付帯情報である。付帯情報とは、顔シーケンスの開始タイム及び終了タイム、顔画像を抽出したフレームの番号、顔画像を抽出したフレームにおいて顔画像が切り出された領域の情報等である。
画像メモリ部２１０は、記憶装置２に設けられる記憶領域であり、映像入力部１００から出力された動画像データをフレームごとに一時的に記憶する。
顔検出部２２０は、動画像データの所定のフレームから顔パターンの検出を行い、検出した結果（検出結果）を顔追跡部２３０に出力する。
顔追跡部２３０は、顔検出部２２０で検出された顔パターンを後続するフレーム中から追跡し、追跡した結果（追跡結果）に基づいて、顔シーケンスを生成し、顔領域の情報、顔シーケンスの区間等を代表パターン抽出部２４０等に出力する。
代表パターン抽出部２４０は、顔追跡部２３０による出力をもとに顔シーケンスを代表する顔画像（代表顔画像）を抽出する。 The face sequence generation unit 200 includes an image memory unit 210, a face detection unit 220, a face tracking unit 230, and a representative pattern extraction unit 240.
The face sequence generation unit 200 analyzes the moving image input to the image memory unit 210, extracts a face image (subject image in a broad sense) from each frame in a video section in which a face is shown, and generates a face sequence ( In a broader sense, a sequence) is generated.
Here, the face sequence is face information extracted from consecutive video sections and incidental information related to the extracted face image. The supplementary information includes the start time and end time of the face sequence, the number of the frame from which the face image is extracted, information on the area where the face image is cut out from the frame from which the face image is extracted, and the like.
The image memory unit 210 is a storage area provided in the storage device 2 and temporarily stores the moving image data output from the video input unit 100 for each frame.
The face detection unit 220 detects a face pattern from a predetermined frame of moving image data, and outputs the detection result (detection result) to the face tracking unit 230.
The face tracking unit 230 tracks the face pattern detected by the face detection unit 220 from the subsequent frames, generates a face sequence based on the tracked result (tracking result), information on the face area, The section or the like is output to the representative pattern extraction unit 240 or the like.
The representative pattern extraction unit 240 extracts a face image representing a face sequence (representative face image) based on the output from the face tracking unit 230.

顔シーケンス特徴抽出部３００は、顔状態解析部３１０及び顔特徴量算出部３２０を含んで構成される。
顔状態解析部３１０は、顔シーケンス中の顔画像から顔の類似度による判定に有効な顔画像を抽出するために、顔シーケンス中の各顔画像の顔の状態を解析すると共に、顔特徴量算出部３２０に解析した結果（解析結果）を出力する。
顔特徴量算出部３２０は、顔状態解析部３１０の解析結果から顔の類似度による判定に有効な顔画像について顔特徴量を算出する。更には、顔特徴量算出部３２０は、算出した顔特徴量に基づいて、顔シーケンスごとに顔シーケンスにおける顔の特徴を最も良く表す顔特徴量を代表顔特徴量として抽出する。 The face sequence feature extraction unit 300 includes a face state analysis unit 310 and a face feature amount calculation unit 320.
The face state analysis unit 310 analyzes a face state of each face image in the face sequence and extracts a face feature amount in order to extract a face image effective for determination based on the similarity of the face from the face images in the face sequence. The analysis result (analysis result) is output to the calculation unit 320.
The face feature amount calculation unit 320 calculates a face feature amount for a face image effective for determination based on the similarity of the face from the analysis result of the face state analysis unit 310. Further, the face feature amount calculation unit 320 extracts, as the representative face feature amount, the face feature amount that best represents the facial feature in the face sequence for each face sequence based on the calculated face feature amount.

顔シーケンス記憶部４００は、記憶装置２に設けられる記憶領域である。顔シーケンス記憶部４００は、付帯情報、顔特徴量を算出したフレーム、顔特徴量、顔シーケンスを代表する顔画像等を記憶する。 The face sequence storage unit 400 is a storage area provided in the storage device 2. The face sequence storage unit 400 stores incidental information, a frame in which a face feature amount is calculated, a face feature amount, a face image representing the face sequence, and the like.

顔シーケンス検索部５００は、クエリ入力部５１０、類似度判定部５２０、及び表示部５３０を含んで構成される。
クエリ入力部５１０は、顔シーケンスを検索するための検索条件（クエリ）に関する入力を受け付ける。
類似度判定部５２０は、クエリにより指定された顔シーケンスと顔シーケンス記憶部４００によって記憶された各顔シーケンスとにおける類似度を算出する。そして、類似度判定部５２０は、算出した類似度が所定の閾値よりも高いか否かを判定し、高いと判定した類似度の顔シーケンスを表示部５３０に出力する。
表示部５３０は、類似度判定部５２０により出力された顔シーケンスを整理し、顔シーケンス記憶部４００に記憶された代表顔画像と共に、整理した顔シーケンスを表示装置４に表示する。 The face sequence search unit 500 includes a query input unit 510, a similarity determination unit 520, and a display unit 530.
The query input unit 510 receives an input related to a search condition (query) for searching for a face sequence.
The similarity determination unit 520 calculates the similarity between the face sequence specified by the query and each face sequence stored by the face sequence storage unit 400. Then, the similarity determination unit 520 determines whether or not the calculated similarity is higher than a predetermined threshold, and outputs the face sequence having the similarity determined to be high to the display unit 530.
The display unit 530 arranges the face sequence output by the similarity determination unit 520 and displays the arranged face sequence on the display device 4 together with the representative face image stored in the face sequence storage unit 400.

次に、図２を参照して、画像検索装置の動作を説明する。図２は、画像検索装置における処理を示すフローチャートの一例を示す図である。 Next, the operation of the image search apparatus will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of a flowchart illustrating processing in the image search apparatus.

まず、映像入力部１００は、画像メモリ部２１０に入力した動画像データをフレームごとに画像メモリ部２１０から読み込む（ステップＳ１００）。ここで読み込まれたデータは、例えば８ビットの画素により構成される２次元配列のデータであり、Ｒ，Ｇ，Ｂ，３つの面により構成される。このとき、画像データがＭＰＥＧ（Moving Picture Experts Group）、ＪＰＥＧ（Joint Photographic Experts Group）等の方式により圧縮されている場合がある。この場合は、画像データを所定の解凍方式に従って解凍し、ＲＧＢ各画素により構成される画像データとする。 First, the video input unit 100 reads the moving image data input to the image memory unit 210 from the image memory unit 210 for each frame (step S100). The data read here is, for example, two-dimensional array data composed of 8-bit pixels, and is composed of R, G, B, and three surfaces. At this time, the image data may be compressed by a method such as MPEG (Moving Picture Experts Group) or JPEG (Joint Photographic Experts Group). In this case, the image data is decompressed according to a predetermined decompression method to obtain image data composed of RGB pixels.

続いて、検出手段の一例である顔検出部２２０は、動画像データのフレームから顔パターン（より広義には、被写体パターン）を複数、検出し、検出結果を顔追跡部２３０に出力する（ステップＳ２００）。
例えば、顔検出部２２０は、動画像データから所定間隔のフレームごとに顔検出を行う。そこで、図３を参照して、所定間隔のフレームごとに顔検出が行われる例（換言するならば、映像データが解析される際の進捗状況の例）について説明する。
図３に示すように、映像に係る動画像データＡは、複数のフレームを有して構成され、この例では、７つのフレームごとに顔パターンの検出が行われる。そして、フレームから切り出された矩形の画像パターンＢが顔画像として検出される。 Subsequently, the face detection unit 220, which is an example of a detection unit, detects a plurality of face patterns (subject patterns in a broad sense) from the frame of the moving image data, and outputs detection results to the face tracking unit 230 (step). S200).
For example, the face detection unit 220 performs face detection for each frame at a predetermined interval from the moving image data. Therefore, an example in which face detection is performed for each frame at a predetermined interval (in other words, an example of progress when video data is analyzed) will be described with reference to FIG.
As shown in FIG. 3, the moving image data A related to a video is configured to have a plurality of frames. In this example, face patterns are detected every seven frames. A rectangular image pattern B cut out from the frame is detected as a face image.

本実施形態では、ニューラル・ネットワークにより画像中の顔パターンを検出する方法（例えば、参考文献１を参照のこと。）を適用した構成を採用する。
そこで、ニューラル・ネットワークにより画像中の顔パターンを検出する方法について説明する。
まず、顔検出部２２０は、検出の対象とする画像データをメモリに読み込み、顔と照合する所定の領域を読み込まれた画像中から切り出す。そして、顔検出部２２０は、切り出した領域の画素値の分布を入力としてニューラル・ネットワークによる演算で一つの出力を得る。このとき、ニューラル・ネットワークの重み及び閾値が、膨大な顔パターンと非顔パターンとにより予め学習されている。例えば、顔検出部２２０は、ニューラル・ネットワークの出力が０以上なら顔パターン、それ以外は非顔パターンであると判別する。
そして、顔検出部２２０は、ニューラル・ネットワークの入力である顔と照合する画像パターンとの切り出し位置を、例えば、図４に示すように画像全域から縦横順次に走査していくことにより、画像中から顔パターンを検出する。また、本実施形態では、様々な大きさの顔パターンの検出に対応するため、顔検出部２２０は、図４に示すように読み込んだ画像を所定の割合で順次縮小し、それに対して前述した顔検出の走査を行う構成を採用している。
参考文献１：Rowley et al, "Neural network-based face detection", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.20 , NO.1, JANUARY 1998 In the present embodiment, a configuration in which a method of detecting a face pattern in an image using a neural network (for example, see Reference 1) is employed.
Therefore, a method for detecting a face pattern in an image using a neural network will be described.
First, the face detection unit 220 reads image data to be detected into a memory and cuts out a predetermined area to be matched with the face from the read image. Then, the face detection unit 220 receives a distribution of pixel values of the cut out region as an input, and obtains one output by calculation using a neural network. At this time, the weights and threshold values of the neural network are learned in advance using a huge number of face patterns and non-face patterns. For example, the face detection unit 220 determines that a face pattern is output if the output of the neural network is 0 or more, and a non-face pattern otherwise.
Then, the face detection unit 220 scans the cutout position of the image pattern to be collated with the face which is an input of the neural network, for example, by scanning the entire image vertically and horizontally as shown in FIG. To detect the face pattern. Further, in the present embodiment, in order to cope with detection of face patterns of various sizes, the face detection unit 220 sequentially reduces the read images at a predetermined rate as shown in FIG. A configuration for performing face detection scanning is employed.
Reference 1: Rowley et al, "Neural network-based face detection", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.20, NO.1, JANUARY 1998

なお、画像中から顔パターンを検出する方法は、ニューラル・ネットワークによる方法に限られるものではなく、各種の方式が適用可能であり、他の方法を採用してもよい（例えば、参考文献２参照のこと。）。
参考文献２：Yang et al, "Detecting Faces in Images: A Survey", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.24 , NO.1, JANUARY 2002 Note that the method of detecting a face pattern from an image is not limited to the method using a neural network, and various methods are applicable, and other methods may be employed (for example, see Reference 2). Of that.)
Reference 2: Yang et al, "Detecting Faces in Images: A Survey", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.24, NO.1, JANUARY 2002

続いて、生成手段の一例である顔追跡部２３０は、顔検出部２２０で検出された顔パターンごとに後続するフレーム中から顔パターンを追跡し、追跡した結果を代表パターン抽出部２４０等に出力する（ステップＳ３００）。
すなわち、顔追跡部２３０は、所定間隔のフレームで検出された顔パターン（例えば、図３に示す画像パターンＢ）の夫々について後続するフレームで追跡を行う。追跡した結果、顔追跡部２３０は、顔画像を抽出したフレームの番号、そのフレームにおける顔画像が切り出された領域の情報等の付帯情報を取得する。更には、顔追跡部２３０は、時間的に連続した顔画像の集まりを顔シーケンス（例えば、図３に示すシーケンスＣ）として生成する。換言すると、顔追跡部２３０は、被写体画像を被写体パターンごとに一連のフレームについて寄せ集めたシーケンスを生成する。 Subsequently, the face tracking unit 230, which is an example of a generation unit, tracks the face pattern from the subsequent frame for each face pattern detected by the face detection unit 220, and outputs the tracked result to the representative pattern extraction unit 240 or the like. (Step S300).
That is, the face tracking unit 230 performs tracking in the subsequent frames for each of the face patterns (for example, the image pattern B shown in FIG. 3) detected in frames at a predetermined interval. As a result of the tracking, the face tracking unit 230 acquires incidental information such as the number of the frame from which the face image is extracted and information on the area in which the face image is cut out. Furthermore, the face tracking unit 230 generates a collection of face images continuous in time as a face sequence (for example, the sequence C shown in FIG. 3). In other words, the face tracking unit 230 generates a sequence in which subject images are collected for a series of frames for each subject pattern.

ここで、図５を参照して、顔シーケンスが生成される構成について説明する。図５は、顔シーケンスを生成する処理に係るフローチャートの一例を示す図である。 Here, a configuration in which a face sequence is generated will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of a flowchart relating to processing for generating a face sequence.

まず、顔追跡部２３０は、顔パターンに対応する顔画像が切り出された領域の情報をもとに、後続のフレームにおいて顔パターンの探索を行う探索領域を設定する（ステップＳ３０１）。顔検出部２２０において顔パターンを検出したフレームの次のフレームから探索を行う場合には、顔パターンを検出した領域に対して水平位置、垂直位置について所定量だけその中心位置をずらした近傍の矩形領域を顔パターンの探索領域として設定する。なお、さらに次のフレームについて探索を行う場合には、顔追跡部２３０は、追跡結果に基づいて、すなわち後述の処理において追跡した顔パターンの顔画像が切り出された領域を利用して、同様に新たな探索領域を設定する。 First, the face tracking unit 230 sets a search area for searching for a face pattern in a subsequent frame based on information on an area where a face image corresponding to the face pattern is cut out (step S301). When searching from the frame next to the frame in which the face pattern is detected by the face detection unit 220, a neighboring rectangle in which the center position is shifted by a predetermined amount for the horizontal position and the vertical position with respect to the area where the face pattern is detected. The area is set as a face pattern search area. In the case of searching for the next frame, the face tracking unit 230 similarly uses the area where the face image of the face pattern tracked in the processing described later is cut out based on the tracking result. Set a new search area.

続いて、顔追跡部２３０は、探索領域内で切り出された領域における画像と探索する顔パターンに対応する顔画像との相関をもとに顔パターンによる追跡を行う（ステップＳ３０２）。すなわち、顔追跡部２３０は、探索領域として設定された中心位置を中心として探索する顔パターンと同じ大きさの矩形領域を順次切出し、切り出した領域における画像と探索する顔パターンに対応する顔画像との輝度分布をテンプレートとした相関値を算出する。相関値は、２つのパターンの関係の程度を量的に表現した値をいう。そして、顔追跡部２３０は、探索領域内において相関値が最も高い領域を顔パターンの追跡結果として、その相関値とともに出力する。
なお、本実施形態では、顔追跡部２３０は、顔パターンの追跡のために輝度分布の相関値を用いるものとして説明を行なったが、ＲＧＢごとの画素値分布の相関を用いて顔パターンの追跡を行ってもよい。また、顔追跡部２３０は、領域内での輝度分布やＲＧＢ値のヒストグラム等、画像特徴量の相関を用いて顔パターンの追跡を行ってもよい。 Subsequently, the face tracking unit 230 performs tracking by the face pattern based on the correlation between the image in the region cut out in the search region and the face image corresponding to the face pattern to be searched (step S302). That is, the face tracking unit 230 sequentially cuts out a rectangular area having the same size as the face pattern to be searched around the center position set as the search area, and the face image corresponding to the face pattern to be searched and the image in the cut-out area The correlation value using the brightness distribution of the template as a template is calculated. The correlation value is a value that quantitatively expresses the degree of relationship between two patterns. Then, the face tracking unit 230 outputs an area having the highest correlation value in the search area as a face pattern tracking result together with the correlation value.
In this embodiment, the face tracking unit 230 is described as using the correlation value of the luminance distribution for tracking the face pattern. However, the face pattern tracking is performed using the correlation of the pixel value distribution for each RGB. May be performed. Further, the face tracking unit 230 may track the face pattern using the correlation of the image feature amount such as the luminance distribution in the region and the histogram of the RGB values.

続いて、顔追跡部２３０は、追跡処理で出力された相関値が所定の値以上であるかを判定する（ステップＳ３０３）。所定の値以上の場合には、顔追跡部２３０は、顔検出部２２０で検出された顔パターンに対応する顔画像と追跡結果に係る顔画像とが類似する確率が高いので正確に追跡できたと判断し、処理をステップＳ３０４に移す。他方、顔追跡部２３０は、所定値以下の場合には、類似する確率が低いので追跡する顔パターンに対応する顔画像がなかったと判断し、追跡処理を終了する。 Subsequently, the face tracking unit 230 determines whether the correlation value output in the tracking process is greater than or equal to a predetermined value (step S303). In the case of a predetermined value or more, the face tracking unit 230 can accurately track the face image corresponding to the face pattern detected by the face detection unit 220 and the face image related to the tracking result with a high probability of similarity. Determination is made, and the process proceeds to step S304. On the other hand, if it is equal to or smaller than the predetermined value, the face tracking unit 230 determines that there is no face image corresponding to the face pattern to be tracked because the similarity probability is low, and ends the tracking process.

そして、顔追跡部２３０は、顔パターンによる追跡に係る処理の対象を後続する次のフレームに移し、ステップＳ３０１に処理を移す（ステップＳ３０４）。 Then, the face tracking unit 230 moves the processing target related to tracking by the face pattern to the subsequent frame, and moves the process to step S301 (step S304).

したがって、顔検出部２２０で顔パターンが検出されたフレームの次のフレームの直前のフレームまで繰り返し処理が基本的に行われ、検出した顔パターンごとに顔シーケンスを得ることができる。
このように、本実施形態の顔追跡部２３０では、顔検出部２２０で検出された顔パターンごとに後続するフレーム中から顔パターンを探索し、追跡する構成を採用している。しかしながら、例えば、顔追跡部２３０は、この構成に加えて、顔パターンを検出した前のフレームにおいても顔パターンを探索し、追跡する構成を採用してもよい。また、例えば、顔追跡部２３０は、上述した構成に加えて又は代えて、動画像から動きベクトルを算出し、動きベクトルを手がかりにして顔パターンの追跡を行う構成を採用してもよい。 Therefore, iterative processing is basically performed up to the frame immediately before the frame next to the frame where the face pattern is detected by the face detection unit 220, and a face sequence can be obtained for each detected face pattern.
As described above, the face tracking unit 230 according to the present embodiment employs a configuration in which a face pattern is searched and tracked from the subsequent frames for each face pattern detected by the face detection unit 220. However, for example, in addition to this configuration, the face tracking unit 230 may employ a configuration in which a face pattern is searched for and tracked even in the previous frame in which the face pattern was detected. In addition, for example, the face tracking unit 230 may employ a configuration that calculates a motion vector from a moving image and tracks a face pattern using the motion vector as a clue in addition to or instead of the configuration described above.

また、顔追跡部２３０は、所定の間隔をあけた後のフレームを使って顔パターンの追跡を行ってもよい。この構成によれば、顔の前を何かが横切ること、フラッシュによる影響等によって顔シーケンスが分割され過ぎることを防ぐことができる。 Further, the face tracking unit 230 may track the face pattern using frames after a predetermined interval. According to this configuration, it is possible to prevent the face sequence from being divided too much due to something crossing in front of the face or the influence of the flash.

また、顔追跡部２３０は、時間的（或いは、フレームの位置的）に近隣の二つの顔シーケンスの顔特徴の相関を算出し、相関が高い場合は、シーケンスを統合（１つに結合）してもよい。すなわち、顔追跡部２３０は、統合する前側のシーケンスの開始から後ろ側のシーケンスの終了時までを統合したシーケンスを新たなシーケンス（映像区間）とし、付帯情報も併せて統合する。顔追跡部２３０は、代表顔画像として、片方の顔シーケンスのものを用いることができる。
以上の顔シーケンスの統合が全ての前後の顔シーケンスについて順次繰り返し行われ、顔シーケンスが統合される。
ただし、顔シーケンスにより表される映像区間が所定の時間以上離れている組については、顔追跡部２３０は、顔シーケンスの統合の候補としては用いない。また、映像中に人物が複数登場する場合には、複数の顔シーケンスで映像区間が重なる場合が生じ得る。このような場合には、それぞれの顔シーケンスに対応する人物が別の人物と見なせるので、顔追跡部２３０は、顔シーケンスの統合の候補としては用いない。 In addition, the face tracking unit 230 calculates a correlation between facial features of two neighboring face sequences in time (or frame position), and if the correlation is high, the sequences are integrated (combined into one). May be. That is, the face tracking unit 230 sets a sequence obtained by integrating from the start of the sequence on the front side to the end of the sequence on the back side as a new sequence (video section), and also integrates accompanying information. The face tracking unit 230 can use one of the face sequences as the representative face image.
The integration of the face sequences described above is sequentially repeated for all the front and back face sequences, and the face sequences are integrated.
However, the face tracking unit 230 does not use the video sequence represented by the face sequence as a candidate for the integration of the face sequence for a set in which the video sections represented by the face sequence are separated by a predetermined time or more. In addition, when a plurality of people appear in a video, there may be a case where video sections overlap in a plurality of face sequences. In such a case, since the person corresponding to each face sequence can be regarded as another person, the face tracking unit 230 is not used as a candidate for face sequence integration.

続いて、代表パターン抽出部２４０は、顔追跡部２３０の結果をもとに、顔シーケンスを代表する顔画像を抽出する（ステップＳ４００）。すなわち、代表パターン抽出部２４０は、顔シーケンス内の１つのフレームから顔領域を含む矩形領域を切り出し、顔シーケンス記憶部４００に記憶する。代表パターン抽出部２４０は、切り出す対象とするフレームを、シーケンスの長さを基準としてシーケンスの先頭から所定間隔のフレーム（言い換えるならば所定時間に位置するフレーム）、顔のサイズが最も大きなフレーム等としてもよい。また、代表パターン抽出部２４０は、切り出す対象とするフレームを、各顔シーケンス内の一部、又は所定の間隔を置いた複数のフレームから顔領域を含む矩形領域を切り出した動画像としてもよい。
また、本実施形態では、生成された顔シーケンス及び代表する顔画像については、インデックスを作成し、検索時に容易にアクセスできる構成を採用する。 Subsequently, the representative pattern extraction unit 240 extracts a face image representing the face sequence based on the result of the face tracking unit 230 (step S400). That is, the representative pattern extraction unit 240 cuts out a rectangular region including the face region from one frame in the face sequence and stores it in the face sequence storage unit 400. The representative pattern extraction unit 240 sets the frame to be extracted as a frame at a predetermined interval from the beginning of the sequence (in other words, a frame located at a predetermined time) based on the sequence length, a frame having the largest face size, or the like. Also good. In addition, the representative pattern extraction unit 240 may set a frame to be cut out as a moving image obtained by cutting out a rectangular area including a face area from a part of each face sequence or a plurality of frames at a predetermined interval.
In the present embodiment, an index is created for the generated face sequence and the representative face image, and a configuration that can be easily accessed during search is adopted.

続いて、顔シーケンス特徴抽出部３００は、顔追跡部２３０で生成された顔シーケンスをもとに、顔シーケンスの特徴抽出を行う（ステップＳ５００）。 Subsequently, the face sequence feature extraction unit 300 performs face sequence feature extraction based on the face sequence generated by the face tracking unit 230 (step S500).

ここで、図６及び図７を参照して、顔シーケンスの特徴抽出について説明する。図６は、顔シーケンスを解析して特徴抽出する処理に係るフローチャートを示す図である。図７は、図６の各ステップに沿って顔シーケンスの特徴抽出を行う際の進捗状況を示す図である。 Here, the feature extraction of the face sequence will be described with reference to FIGS. FIG. 6 is a diagram illustrating a flowchart according to a process of analyzing a face sequence and extracting features. FIG. 7 is a diagram illustrating a progress situation when performing feature extraction of the face sequence along each step of FIG.

まず、解析手段の一例である顔状態解析部３１０は、顔シーケンス中の各顔画像の顔（被写体）の状態を解析し、解析した結果を顔特徴量算出部３２０に出力する（ステップＳ５０１）。顔状態解析部３１０によれば、顔画像の顔（より広義には、被写体画像の被写体）の状態を解析することで、各顔シーケンス中の顔画像から顔の類似度による判定に有効な顔特徴量を抽出することができる。
顔の類似度による判定をより正確にするためには、顔画像の中に目、口、鼻等の顔の各パーツが存在することが重要である。すなわち、顔が横方向や斜めを向いているものよりも正面を向いているものの方が、目、口、鼻等を正確に識別できるので、顔の特徴を正確に表現しており、顔の類似度による判定に有効である。
したがって、本実施形態では、顔状態解析部３１０が顔シーケンス中の各顔画像の顔の向きを検出する構成を有する。なお、図７には、顔シーケンスの各顔画像Ａに対して顔の向きＢが示されている。
他方、例えば、顔状態解析部３１０は、ニューラル・ネットワークによる顔判別器と同じ構成の複数の顔判別器を有してもよい。ただし、各顔判別器の判別のためのパラメータを顔の向きごとにサンプル学習によりチューニングし、設定しておく。この構成においては、複数の顔判別器のうち、もっとも出力の高い、すなわち尤度の高い顔判別器に対応した顔の向きを顔状態解析部３１０による解析結果とする。
なお、本実施形態では、顔状態解析部３１０は、顔の向きを顔の状態の解析に用いる構成を採用しているが、例えば、顔画像から目、口、鼻等のパーツを個別に探索する手段を設けて、夫々の存在の可否を解析する構成を採用してもよい。
また、目が開いている場合は、目が閉じている場合よりも顔の類似度を正確に判定できる。したがって、顔状態解析部３１０は、上述した構成に加えて又は代えて、目が開いているか、目が閉じているかを判定する手段を設けて、目の開閉の状態を解析する構成を採用してもよい。
また、顔に対する照明状態がよく全体的に肌部分が明るく撮影されている場合には、部分的に陰がある場合よりも顔の類似度を正確に判定できる。したがって、顔状態解析部３１０は、上述した構成に加えて又は代えて、顔の肌部分の明るさの分布から顔に対する照明の状態を判定する手段を設けて、照明の状態を解析する構成を採用してもよい。 First, the face state analysis unit 310, which is an example of an analysis unit, analyzes the state of the face (subject) of each face image in the face sequence, and outputs the analysis result to the face feature amount calculation unit 320 (step S501). . The face state analysis unit 310 analyzes the state of the face of the face image (more broadly, the subject of the subject image), thereby enabling a face effective for determination based on the similarity of the face from the face image in each face sequence. Feature quantities can be extracted.
In order to make the determination based on the similarity of the face more accurate, it is important that each part of the face such as eyes, mouth, nose, etc. exists in the face image. In other words, the face facing forward rather than the face facing sideways or diagonally can accurately identify the eyes, mouth, nose, etc. This is effective for determination based on similarity.
Therefore, in this embodiment, the face state analysis unit 310 has a configuration for detecting the face orientation of each face image in the face sequence. FIG. 7 shows the face orientation B for each face image A in the face sequence.
On the other hand, for example, the face state analysis unit 310 may include a plurality of face discriminators having the same configuration as a face discriminator using a neural network. However, parameters for discrimination of each face discriminator are tuned and set by sample learning for each face direction. In this configuration, the face orientation corresponding to the face classifier having the highest output among the plurality of face classifiers, that is, the likelihood classifier, is set as the analysis result by the face state analysis unit 310.
In the present embodiment, the face state analysis unit 310 employs a configuration in which the face orientation is used for the analysis of the face state. For example, parts such as eyes, mouth, and nose are individually searched from the face image. It is also possible to adopt a configuration in which means for analyzing the presence / absence of each existence is provided.
Further, when the eyes are open, the similarity of the face can be determined more accurately than when the eyes are closed. Therefore, in addition to or instead of the above-described configuration, the face state analysis unit 310 employs a configuration that provides a means for determining whether the eyes are open or whether the eyes are closed, and analyzes the open / closed state of the eyes. May be.
In addition, when the face is well illuminated and the skin part is brightly photographed as a whole, the similarity of the face can be determined more accurately than when there is a partial shadow. Therefore, in addition to or instead of the above-described configuration, the face state analysis unit 310 is provided with means for determining the lighting state of the face from the brightness distribution of the skin portion of the face, and has a configuration for analyzing the lighting state. It may be adopted.

続いて、特徴量算出手段の一例である顔特徴量算出部３２０は、顔状態解析部３１０での解析結果に基づいて顔の類似度による判定に有効な顔画像について顔特徴量を算出する（ステップＳ５０２）。
すなわち、顔特徴量算出部３２０は、解析結果から顔の類似度による判定に有効な顔画像のみに対して顔特徴量を算出する。例えば、顔特徴量算出部３２０は、予め定められた状態にある正面の顔画像のみに絞り込んで顔特徴量を算出する。
より具体的に説明すると、顔特徴量算出部３２０は、絞り込んだ顔画像から顔の判定に有効な顔特徴点の探索を行う。例えば、顔特徴量算出部３２０は、顔の特徴点として、目尻、口の両端、鼻の頂点等をパターン照合に基づき抽出する。そして、顔特徴量算出部３２０は、各特徴点において局所輝度分布をガボールウェーブレット変換により顔特徴量として抽出し、ベクトル化する。すなわち、図７で示すように正面の顔画像に対して顔の特徴ベクトルＣ（顔特徴量）が夫々算出される。 Subsequently, the face feature amount calculation unit 320, which is an example of a feature amount calculation unit, calculates a face feature amount for a face image effective for determination based on the similarity of the face based on the analysis result of the face state analysis unit 310 ( Step S502).
That is, the face feature amount calculation unit 320 calculates a face feature amount only for a face image effective for determination based on the similarity of the face from the analysis result. For example, the face feature amount calculation unit 320 calculates the face feature amount by narrowing down only to the front face image in a predetermined state.
More specifically, the face feature amount calculation unit 320 searches for a face feature point effective for face determination from the narrowed face image. For example, the face feature amount calculation unit 320 extracts the corners of the eyes, both ends of the mouth, the apex of the nose, and the like as the feature points of the face based on the pattern matching. Then, the face feature amount calculation unit 320 extracts a local luminance distribution at each feature point as a face feature amount by Gabor wavelet transform and vectorizes it. That is, as shown in FIG. 7, a face feature vector C (face feature amount) is calculated for each front face image.

ここで、顔特徴量を算出する方法については、公知の方法を採用する（例えば、参考文献３を参照のこと。）。なお、顔特徴量を算出する方法については、参考文献３に記載の方法に限定されるものではなく、ローカル記述子を各特徴点で算出する方法（例えば、参考文献４参照のこと。）を採用してもよいし、顔画像の輝度分布のヒストグラム等を用いた単純な方法を採用してもよい。
参考文献３：Wiskott et al, "Face Recognition by Elastic Bunch Graph Matching", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.19 , NO.7, JULY 1997
参考文献４：Schmid and Mohr, "Local Greyvalue Invariants for Image Retrieval", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.19 , NO.5, MAY 1997 Here, as a method for calculating the facial feature amount, a known method is adopted (for example, see Reference 3). Note that the method for calculating the face feature amount is not limited to the method described in Reference 3, and a method for calculating a local descriptor at each feature point (for example, see Reference 4). You may employ | adopt and may employ | adopt the simple method using the histogram etc. of the luminance distribution of a face image.
Reference 3: Wiskott et al, "Face Recognition by Elastic Bunch Graph Matching", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.19, NO.7, JULY 1997
Reference 4: Schmid and Mohr, "Local Grayvalue Invariants for Image Retrieval", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.19, NO.5, MAY 1997

続いて、顔特徴量算出部３２０は、顔シーケンスを代表する代表顔特徴量を抽出するために、算出した各顔特徴量をクラスタリングしてクラスタを複数作成する（ステップＳ５０３）。
例えば、顔特徴量算出部３２０は、クラスタリングに、顔シーケンスから算出された複数の顔特徴ベクトル（顔特徴量）を入力として所定のクラスタ数においてK平均法（Ｋ-means法）を適用する。 Subsequently, the face feature quantity calculation unit 320 creates a plurality of clusters by clustering each calculated face feature quantity in order to extract a representative face feature quantity representing a face sequence (step S503).
For example, the face feature amount calculation unit 320 applies a K-means method (K-means method) to a predetermined number of clusters by inputting a plurality of face feature vectors (face feature amounts) calculated from the face sequence for clustering.

続いて、代表特徴量抽出手段の一例である顔特徴量算出部３２０は、クラスタリングの結果に基づいて、顔シーケンスの顔特徴を最も良く表す顔特徴量を代表顔特徴量として抽出する（ステップＳ５０４）。
すなわち、顔特徴量算出部３２０は、顔特徴ベクトルが相対的に多く含まれるクラスタ（例えば、最も顔特徴ベクトルのサンプル数が多いクラスタ）を主クラスタとし、主クラスタ内の顔特徴ベクトルの平均ベクトルを算出して、代表顔特徴量とする。
例えば、図８で示すように、複数の顔特徴ベクトルＢに基づいて主クラスタが決定され、主クラスタ内の顔特徴ベクトルの平均ベクトル（代表顔特徴量Ｃ）が得られる。ここで、代表顔特徴量として主クラスタの平均ベクトルを抽出するようにすると、顔特徴量を任意に一つサンプリングする場合に比べて、各顔画像に個別に含まれるノイズの影響に強い顔特徴量を抽出できる効果がある。 Subsequently, the face feature amount calculation unit 320, which is an example of the representative feature amount extraction unit, extracts the face feature amount that best represents the face feature of the face sequence as the representative face feature amount based on the clustering result (step S504). ).
That is, the face feature amount calculation unit 320 sets a cluster including a relatively large number of face feature vectors (for example, a cluster having the largest number of sample face feature vectors) as a main cluster, and an average vector of face feature vectors in the main cluster. Is calculated as the representative face feature amount.
For example, as shown in FIG. 8, a main cluster is determined based on a plurality of face feature vectors B, and an average vector (representative face feature amount C) of face feature vectors in the main cluster is obtained. Here, if the average vector of the main cluster is extracted as the representative face feature amount, the face feature is more resistant to the influence of noise contained in each face image than when one face feature amount is arbitrarily sampled. There is an effect that the amount can be extracted.

続いて、顔特徴量算出部３２０は、算出された顔特徴量及び抽出された代表顔特徴量を顔シーケンス記憶部４００に記憶する（ステップＳ６００）。このとき、抽出された代表顔特徴量については、代表パターン抽出部２４０で抽出された顔画像と整合を取ったインデックスが作成される構成を採用する。この構成によれば、検索時には、代表パターン抽出部２４０で抽出された顔画像及び代表顔特徴量に容易にアクセスすることができる。 Subsequently, the face feature amount calculation unit 320 stores the calculated face feature amount and the extracted representative face feature amount in the face sequence storage unit 400 (step S600). At this time, for the extracted representative face feature quantity, a configuration is adopted in which an index that matches the face image extracted by the representative pattern extraction unit 240 is created. According to this configuration, the face image and the representative face feature amount extracted by the representative pattern extraction unit 240 can be easily accessed during the search.

続いて、クエリ入力部５１０は、顔シーケンスを探すための検索条件の入力を受け付ける（ステップＳ７００）。より具体的には、クエリ入力部５１０は、顔画像の少なくとも１つが関連付けられた顔シーケンス（すなわち、検索条件）のユーザーによる指定を受け付ける。 Subsequently, the query input unit 510 receives an input of search conditions for searching for a face sequence (step S700). More specifically, the query input unit 510 receives designation by the user of a face sequence (that is, a search condition) associated with at least one face image.

ここで、図９を参照して、検索条件の入力について説明する。図９は、検索条件（クエリ）を入力するためのユーザーインターフェースの表示例を示す図である。
図９には、顔シーケンス記憶部４００で記憶されている顔シーケンスを映像のタイトルごとにグルーピングして、各顔シーケンスの代表顔画像が入力ダイアログウインドウ８００に一覧表示される例が示されている。このとき、顔シーケンスが映像中に占める映像区間等を示してもよい。ユーザーは、この一覧表示からクエリとなる顔シーケンスを１つ選ぶことになる。 Here, referring to FIG. 9, input of search conditions will be described. FIG. 9 is a diagram illustrating a display example of a user interface for inputting a search condition (query).
FIG. 9 shows an example in which the face sequences stored in the face sequence storage unit 400 are grouped for each video title and a list of representative face images of each face sequence is displayed in the input dialog window 800. . At this time, a video section or the like occupied by the face sequence in the video may be indicated. The user selects one face sequence as a query from this list display.

図９において、タイトル８０１は、映像のタイトルを示す情報である。タイトルが無い場合には、撮影時刻や録画時刻等を表示する。代表顔画像８０２は、顔シーケンスの代表顔画像である。選択領域８０３は、マウスカーソル８０４により代表顔画像が選択された状態を示す。なお、ユーザーは、マウスカーソル８０４を代表顔画像８０２の上でクリックすることによって、代表顔画像８０２を選択した状態にすることができる。ただし、本実施形態では、選択した状態にできる代表顔画像は１つであり、新たな選択を行うと、それまで選択した状態だった代表顔画像は非選択の状態になる。
スライダー８０５は、縦エレベーターである。ユーザーは、マウスカーソル８０４をスライダー８０５の上でドラッグして縦方向に画面をスクロールさせることで、１画面で表示しきれない代表顔画像を表示することができる。
検索実行ボタン８０６は、顔シーケンスの検索を実行するためのボタンである。
なお、クエリによる検索の結果としては、顔シーケンスに限られるものではなく、顔画像を検索の結果としてもよい。この場合は、顔検出部２２０によって顔領域が検出され、検出された顔領域において顔特徴量算出部３２０によって特徴量が算出され、ベクトル化されることで、検索対象の代表顔画像との類似度を算出することができる。
また、顔シーケンス、顔シーケンスが属する映像のタイトル、映像のキーワード、記録日時等、他の付帯情報と組み合わせてクエリを設定し、類似度判定部５２０が類似度判定を行う顔シーケンスを絞り込んでもよい。 In FIG. 9, a title 801 is information indicating a video title. If there is no title, the shooting time, recording time, etc. are displayed. The representative face image 802 is a representative face image of the face sequence. A selection area 803 shows a state in which a representative face image is selected by the mouse cursor 804. The user can select the representative face image 802 by clicking the mouse cursor 804 on the representative face image 802. However, in this embodiment, there is only one representative face image that can be selected, and when a new selection is made, the representative face image that has been selected until then becomes a non-selected state.
The slider 805 is a vertical elevator. The user can display a representative face image that cannot be displayed on one screen by dragging the mouse cursor 804 on the slider 805 and scrolling the screen in the vertical direction.
The search execution button 806 is a button for executing a face sequence search.
Note that the search result by the query is not limited to the face sequence, and a face image may be the search result. In this case, a face area is detected by the face detection unit 220, and a feature amount is calculated by the face feature amount calculation unit 320 in the detected face area, and is vectorized, thereby being similar to the representative face image to be searched. The degree can be calculated.
In addition, a query may be set in combination with other auxiliary information such as a face sequence, a title of a video to which the face sequence belongs, a video keyword, a recording date, and the like, and the similarity determination unit 520 may narrow down the face sequences for which similarity determination is performed. .

続いて、算出手段の一例である類似度判定部５２０は、クエリの対象となる顔シーケンスと、顔シーケンス記憶部４００により記憶されている各顔シーケンスとの類似度を算出する。すなわち、類似度判定部５２０は、ユーザーにより指定された顔シーケンスに関連付けられた顔画像と、生成された顔シーケンスを構成する顔画像との類似度を算出する。他方、抽出手段の一例である類似度判定部５２０は、類似度が基準を満たす（換言するならば、所定の閾値よりも類似度が高い）顔画像が含まれる顔シーケンスに係る情報を抽出し、抽出した情報を表示部５３０に出力する（ステップＳ８００）。なお、シーケンスを抽出する処理は、類似度判定部５２０により行われる構成に限られるものではなく、新たな抽出部を設けて行われてもよい。ここで、図１０には、類似度判定部５２０における処理の概要が示されている。 Subsequently, the similarity determination unit 520 as an example of a calculation unit calculates the similarity between the face sequence to be queried and each face sequence stored in the face sequence storage unit 400. That is, the similarity determination unit 520 calculates the similarity between the face image associated with the face sequence designated by the user and the face images that make up the generated face sequence. On the other hand, the similarity determination unit 520, which is an example of an extraction unit, extracts information related to a face sequence including a face image whose similarity satisfies a standard (in other words, the similarity is higher than a predetermined threshold). The extracted information is output to the display unit 530 (step S800). Note that the process of extracting a sequence is not limited to the configuration performed by the similarity determination unit 520, and may be performed by providing a new extraction unit. Here, FIG. 10 shows an outline of processing in the similarity determination unit 520.

本実施形態では、顔シーケンス間の類似度については、各顔シーケンスの複数の顔特徴量間での類似度が最大のものを用いる。ただし、類似度が最大となるものを用いる構成に限られるものではなく、類似度の平均値を用いてもよい。すなわち、類似度判定部５２０は、ステップＳ５０２で算出された顔特徴量と各顔シーケンス（検索条件）に関連付けられた顔画像の顔特徴量との相関に基づいて類似度を算出する。なお、顔特徴量間での類似度は、顔特徴ベクトル同士のユークリッド距離の逆数を用いる。
また、本実施形態では、顔シーケンス間の類似度を、クエリを設定した後に計算しているが、予め顔シーケンス間の類似度を算出して相関の高い顔シーケンスを記憶しておき、クエリが設定されたとき、又は検索が実行されたときに、これを利用してもよい。 In the present embodiment, as the similarity between face sequences, the one having the maximum similarity between a plurality of face feature amounts of each face sequence is used. However, the configuration is not limited to the configuration using the maximum similarity, and an average value of similarities may be used. That is, the similarity determination unit 520 calculates the similarity based on the correlation between the face feature amount calculated in step S502 and the face feature amount of the face image associated with each face sequence (search condition). Note that the reciprocal of the Euclidean distance between face feature vectors is used as the similarity between face feature amounts.
In this embodiment, the similarity between the face sequences is calculated after the query is set. However, the similarity between the face sequences is calculated in advance to store a highly correlated face sequence. This may be used when set or when a search is performed.

なお、顔シーケンス間の類似度の算出については、各顔シーケンスの複数の顔特徴量を用いる構成に加えて又は代えて、顔シーケンスの代表顔特徴量を用いる構成を採用してもよい。例えば、類似度判定部５２０は、クエリの対象となる顔シーケンスの各顔特徴量と、顔シーケンス記憶部４００により記憶されている各顔シーケンスの代表顔特徴量との類似度が最大のものを類似度として用いる。すなわち、類似度判定部５２０は、ステップＳ５０４で抽出された代表顔特徴量と各顔シーケンス（検索条件）に関連付けられた顔画像の顔特徴量との相関に基づいて類似度を算出する。 For calculating the similarity between face sequences, a configuration using the representative face feature values of the face sequence may be employed in addition to or instead of the configuration using the plurality of face feature values of each face sequence. For example, the similarity determination unit 520 has the largest similarity between each face feature amount of the face sequence to be queried and the representative face feature amount of each face sequence stored in the face sequence storage unit 400. Used as similarity. That is, the similarity determination unit 520 calculates the similarity based on the correlation between the representative face feature value extracted in step S504 and the face feature value of the face image associated with each face sequence (search condition).

これらを踏まえると、類似度判定部５２０は、ユーザーにより指定された顔シーケンスに関連付けられた顔画像が含まれる顔シーケンスを検索する検索手段の一例である。 Based on these, the similarity determination unit 520 is an example of a search unit that searches for a face sequence including a face image associated with a face sequence specified by the user.

続いて、出力手段の一例である表示部５３０は、類似度判定部５２０の結果を整理して表示装置４に表示する（ステップＳ９００）。例えば、表示部５３０は、抽出した顔シーケンスの付帯情報（フレームの情報）に基づいて顔シーケンスに対応する映像区間（映像区間情報）等を表示する。 Subsequently, the display unit 530, which is an example of an output unit, organizes the results of the similarity determination unit 520 and displays them on the display device 4 (step S900). For example, the display unit 530 displays a video section (video section information) or the like corresponding to the face sequence based on the accompanying information (frame information) of the extracted face sequence.

また、表示装置４で表示される内容は、映像区間と対応付けがされた顔シーケンス記憶部４００で記憶された代表顔画像（顔シーケンスを構成する顔画像のうちの一の顔画像）等である。ここで、顔シーケンスに係る表示の例を図１１及び図１２に示す。
図１１において、表示部５３０は、映像中の各顔シーケンスに対応する映像区間Ｃを顔シーケンス記憶部４００で記憶された代表顔画像Ｄと共に表示装置４に表示する。動画像データＡでは、所定時間間隔でフレームのサムネールＢが表示されている。また、映像区間Ｃは、顔シーケンスの解析結果に基づく映像区間の表示の例である。表示部５３０が映像区間Ｃを表示することによりユーザーは、映像中の人物の出現区間が把握できるようになる。また、代表顔画像Ｄは、代表顔画像である。表示部５３０が代表顔画像を表示することにより、ユーザーは、映像区間Ｃを代表する顔を確認できる。 The content displayed on the display device 4 is a representative face image (one face image of the face images constituting the face sequence) stored in the face sequence storage unit 400 associated with the video section. is there. Here, examples of display related to the face sequence are shown in FIGS.
In FIG. 11, the display unit 530 displays the video section C corresponding to each face sequence in the video together with the representative face image D stored in the face sequence storage unit 400 on the display device 4. In the moving image data A, thumbnails B of frames are displayed at predetermined time intervals. Video section C is an example of display of a video section based on the analysis result of the face sequence. When the display unit 530 displays the video section C, the user can grasp the appearance section of the person in the video. The representative face image D is a representative face image. When the display unit 530 displays the representative face image, the user can check the face representing the video section C.

他方、図１２において、表示部５３０は、検索結果ウインドウ９００を表示装置４に表示する。タイトル９０１は、映像のタイトルを示す情報である。タイトルが無い場合には、表示部５３０は、撮影時刻、録画時刻等を表示する。バー９０２は、映像の時間全体を表すバーである。映像区間９０３は、映像において顔シーケンスが占める映像区間を示す。表示部５３０がバー９０２及び映像区間９０３を表示することにより、ユーザーは、映像中の人物の出現区間を把握できるようになる。代表顔画像９０４は、代表顔画像である。表示部５３０が代表顔画像を表示することにより、ユーザーは、映像区間を代表する顔を確認できる。なお、代表顔画像９０４は、映像区間９０３との関連が認識できるように所定の範囲内（例えば、映像区間９０３とは異なる他の映像区間の中心からの距離よりも映像区間９０３の中心からの距離の方が近い位置）に設けられることが好適である。 On the other hand, in FIG. 12, the display unit 530 displays the search result window 900 on the display device 4. The title 901 is information indicating a video title. When there is no title, the display unit 530 displays the shooting time, the recording time, and the like. The bar 902 is a bar that represents the entire time of the video. A video section 903 indicates a video section occupied by the face sequence in the video. The display unit 530 displays the bar 902 and the video section 903, so that the user can grasp the appearance section of the person in the video. The representative face image 904 is a representative face image. When the display unit 530 displays the representative face image, the user can check the face representing the video section. It should be noted that the representative face image 904 is within a predetermined range (for example, from the center of the video section 903 rather than the distance from the center of another video section different from the video section 903 so that the relation with the video section 903 can be recognized. It is preferable to be provided at a position closer to the distance.

なお、以上の実施形態では、画像検索装置は、被写体パターンとして人物の顔を検出し、顔シーケンスを生成する構成を採用したが、映像の内容が把握できるその他の被写体のパターンを生成する構成を採用してもよい。 In the above embodiment, the image search apparatus employs a configuration that detects a human face as a subject pattern and generates a face sequence. However, the image search device has a configuration that generates a pattern of another subject that can grasp the content of the video. It may be adopted.

＜その他の実施形態＞
また、本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、ネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 <Other embodiments>
The present invention can also be realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, etc.) of the system or apparatus reads the program. It is a process to be executed.

上述した実施形態の構成によれば、映像中から精度良く所定の被写体が映っている映像区間を単位とする検索ができるので、映像の検索、映像の編集の効率が向上する。また、同一人物の複数の画像情報を使って照合を行うので精度が向上する。
また、上述した各実施形態によれば、映像中から顔を検出し、複数のフレームに渡って検出した顔画像を１つの顔シーケンスとし、顔シーケンスを単位とする検索ができるので、映像の検索、映像の編集の効率が向上する。また、同一人物の複数の画像情報を使って類似度による判定を行うので精度が向上する。 According to the configuration of the above-described embodiment, a search can be performed in units of a video section in which a predetermined subject is accurately reflected in the video, so that the efficiency of video search and video editing is improved. In addition, accuracy is improved because a plurality of pieces of image information of the same person are used for collation.
Further, according to each of the above-described embodiments, a face can be detected from a video, a face image detected over a plurality of frames can be set as one face sequence, and search can be performed in units of face sequences. , Video editing efficiency is improved. Further, since the determination based on the similarity is performed using a plurality of pieces of image information of the same person, the accuracy is improved.

また、上述した実施形態によれば、多くの映像区間から精度良く所定の被写体が映っている映像区間単位での検索ができるようになる。このため、映像検索の効率が向上する。また、単位映像区間中の被写体の画像情報のうち、良好なものを抽出し映像区間検索の対象として使って照合を行うので、被写体の映った映像区間での人物判定の精度をより向上することができる。 Further, according to the above-described embodiment, it is possible to perform a search in units of video sections in which a predetermined subject is accurately reflected from many video sections. For this reason, the efficiency of video search is improved. In addition, since the best image information of the subject in the unit video section is extracted and used as the target of the video section search, collation is performed, so that the accuracy of person determination in the video section in which the subject is shown is further improved. Can do.

また、上述した実施形態によれば、単位映像区間中の被写体の画像情報のうち、ぞれぞれのシーケンスを代表する特徴量でシーケンスの類似度を照合する場合、被写体の映った映像区間での人物判定をより安定して行うことができる。 Further, according to the above-described embodiment, when the similarity of the sequence is compared with the feature amount representing each sequence among the image information of the subject in the unit video section, the video section in which the subject is captured is used. The person determination can be performed more stably.

以上、本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications can be made within the scope of the gist of the present invention described in the claims.・ Change is possible.

１００映像入力部、２００顔シーケンス生成部、３００顔シーケンス特徴抽出部、４００顔シーケンス記憶部、５００顔シーケンス検索部 100 video input unit, 200 face sequence generation unit, 300 face sequence feature extraction unit, 400 face sequence storage unit, 500 face sequence search unit

Claims

Detecting means for detecting a plurality of subject patterns from a moving image having a plurality of frames;
Generating means for generating a sequence in which subject images in the moving image corresponding to each of the plurality of subject patterns are collected for a series of frames for each subject pattern;
Determination means for determining whether or not each of the subject images is effective for determining the similarity of the subject based on the state of each of the subject images of the plurality of subject images constituting the sequence generated by the generation means;
A feature amount calculating means for calculating a feature amount of the subject image determined to be valid by the determining means among the plurality of subject images;
Storage means for storing the feature quantity calculated by the feature quantity calculation means for the sequence in a storage unit;
Similarity calculation means for calculating the similarity between the feature amount for the sequence given as a search condition and the feature amount for each sequence stored in the storage unit;
Extraction means for extracting a sequence based on the similarity calculated by the similarity calculation means from the sequences stored in the storage unit;
Output means for outputting video section information corresponding to the sequence extracted by the extraction means as a search result;
An image search device comprising:

The image search apparatus according to claim 1, wherein the output unit outputs one subject image of the subject images constituting a sequence corresponding to the video segment information to be output together with the video segment information .

Based on the feature quantity distribution calculated by the feature quantity calculation means, further comprising representative feature quantity generation means for generating a representative feature quantity representative of the sequence;
The storage means further stores the representative feature amount generated by the representative feature amount generation means in the storage unit with respect to the sequence,
The similarity calculation means calculates a similarity based on the correlation between the feature quantity and the representative feature quantity for each sequence stored in the storage unit, for a sequence given as the search condition, according to claim 1 or 2 The image search apparatus described.

The representative feature quantity generating means creates a plurality of clusters by clustering the feature quantities of the subject image calculated by the feature quantity calculation means, and the sequence based on a cluster containing relatively many feature quantities. The image search device according to claim 3 , wherein a representative feature amount is generated .

An image search method executed by an image search device,
A detection step of detecting a plurality of subject patterns from a moving image having a plurality of frames;
Generating a sequence of collecting subject images in the moving image corresponding to each of the plurality of subject patterns for a series of frames for each subject pattern;
A determination step of determining whether or not each subject image is effective in determining the similarity of the subject based on the state of each subject image of the plurality of subject images constituting the sequence generated in the generation step;
A feature amount calculating step of calculating a feature amount of the subject image determined to be valid by the determination step among the plurality of subject images;
A storage step of storing in the storage unit the feature amount calculated by the feature amount calculation step for the sequence;
A similarity calculation step of calculating a similarity between a feature amount for a sequence given as a search condition and a feature amount for each sequence stored in the storage unit;
An extraction step for extracting a sequence based on the similarity calculated in the similarity calculation step from the sequences stored in the storage unit;
An output step for outputting video section information corresponding to the sequence extracted in the extraction step as a search result;
Image search method including

A detection step of detecting a plurality of subject patterns from a moving image having a plurality of frames;
Generating a sequence of collecting subject images in the moving image corresponding to each of the plurality of subject patterns for a series of frames for each subject pattern;
A determination step of determining whether or not each subject image is effective in determining the similarity of the subject based on the state of each subject image of the plurality of subject images constituting the sequence generated in the generation step;
A feature amount calculating step of calculating a feature amount of the subject image determined to be valid by the determination step among the plurality of subject images;
A storage step of storing in the storage unit the feature amount calculated by the feature amount calculation step for the sequence;
A similarity calculation step of calculating a similarity between a feature amount for a sequence given as a search condition and a feature amount for each sequence stored in the storage unit;
An extraction step for extracting a sequence based on the similarity calculated in the similarity calculation step from the sequences stored in the storage unit;
An output step for outputting video section information corresponding to the sequence extracted in the extraction step as a search result;
A program that causes a computer to execute .