JP6151908B2

JP6151908B2 - Learning device, identification device, and program thereof

Info

Publication number: JP6151908B2
Application number: JP2012250151A
Authority: JP
Inventors: 吉彦河合; 藤井　真人; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-11-14
Filing date: 2012-11-14
Publication date: 2017-06-21
Anticipated expiration: 2032-11-14
Also published as: JP2014099027A

Description

本発明は、画像特徴量算出装置、学習装置、識別装置、およびそのプログラムに関する。特に、映像や画像に含まれる事物を検出するために画像特徴量を算出する画像特徴量算出装置、学習装置、識別装置、およびそのプログラムに関する。 The present invention relates to an image feature amount calculation device, a learning device, an identification device, and a program thereof. In particular, the present invention relates to an image feature amount calculation device, a learning device, an identification device, and a program for calculating an image feature amount in order to detect an object included in a video or an image.

映像の内容を解析するための方法として、映像フレームから特徴量を抽出し、その特徴量に基づいて特定の被写体が映っているか否かを判定する手法が存在する。また、そのような判定を行なうための機械学習を行なう手法が存在する。この学習においては、正例あるいは負例のラベルが付与された学習データ（画像）を用いて、判定器のパラメーターを調整する。つまり、特定の被写体が移っているか否かを示す正解が付与された画像による学習を行なうものである。この手法を用いる場合、学習手法自体のフレームワークを変更することなく、学習データを変更するだけで、様々な被写体を検出するための判定器を実現することができることが特徴である。 As a method for analyzing the contents of a video, there is a method of extracting a feature amount from a video frame and determining whether or not a specific subject is reflected based on the feature amount. There is also a method for performing machine learning for making such a determination. In this learning, the parameters of the determiner are adjusted using learning data (images) to which positive or negative example labels are assigned. That is, learning is performed using an image to which a correct answer indicating whether or not a specific subject has moved is given. When this method is used, it is a feature that a determination device for detecting various subjects can be realized only by changing the learning data without changing the framework of the learning method itself.

フレーム画像を、特定の被写体が映っているかいないかの２つのクラスに分類するためには、まず、画像データを何らかの特徴ベクトルに変換する。特徴ベクトルを得るための最も単純な手法の例は、画像全体の各画素のＲ（赤）、Ｇ（緑）、Ｂ（青）それぞれの画素値に関する統計量（たとえば平均値や分散など）を要素として並べ、数次元の特徴ベクトルを算出する方法である。また、特徴ベクトルを得るためのその他の手法の例は、画像全体の周波数成分を算出し、それらの強度分布を特徴ベクトルとする（周波数成分ごとの強度値を要素として並べた特徴ベクトルを得る）方法である。また、さらに他の手法として、非特許文献１には、フレーム画像から特徴点を検出し、その周辺領域から勾配特徴を算出した後、それらの出現頻度ヒストグラムを求めることによって、そのフレーム画像の特徴ベクトルを算出する方式が示されている。この方式は、バッグ・オブ・ビジュアル・ワーズ（Bag of Visual Words, BoVW）法と呼ばれる。 In order to classify frame images into two classes depending on whether or not a specific subject is reflected, first, image data is converted into some feature vector. An example of the simplest method for obtaining a feature vector is a statistic (for example, an average value or variance) regarding pixel values of R (red), G (green), and B (blue) of each pixel of the entire image. This is a method of calculating feature vectors of several dimensions by arranging them as elements. Another example of a technique for obtaining a feature vector is to calculate frequency components of the entire image and use the intensity distribution as a feature vector (to obtain a feature vector in which intensity values for each frequency component are arranged as elements). Is the method. As yet another technique, Non-Patent Document 1 discloses a feature of a frame image by detecting feature points from the frame image, calculating gradient features from the surrounding area, and then determining their appearance frequency histogram. A method for calculating a vector is shown. This method is called the Bag of Visual Words (BoVW) method.

また、非特許文献２には、フレーム画像を複数の領域に分割して、それらの領域ごとに特徴ベクトルを算出し、算出されたベクトルを連結することでフレーム画像全体の特徴ベクトルを算出する方式が示されている。具体的に示されているフレーム画像の分割方法は、たとえば縦横２×２分割、あるいは縦横１×３分割といったものである。非特許文献２に記載された技術は、これにより、フレーム画像内における被写体の位置が特徴ベクトルに反映できないという問題や、被写体とそれ以外の背景領域の特徴が混合してしまうという問題の解決を図っている。 Non-Patent Document 2 discloses a method for dividing a frame image into a plurality of regions, calculating a feature vector for each region, and calculating the feature vector of the entire frame image by connecting the calculated vectors. It is shown. The frame image dividing method specifically shown is, for example, vertical and horizontal 2 × 2 division or vertical and horizontal 1 × 3 division. The technique described in Non-Patent Document 2 solves the problem that the position of the subject in the frame image cannot be reflected in the feature vector and the problem that the subject and other background region features are mixed. I am trying.

G. Csurka, C. Bray, C. Dance and L. Fan, “Visual categorization with bags of keypoints,”, In Proc. ECCV Workshop on Statistical Learning in Computer Vision, pp. 59-74, ２００４年G. Csurka, C. Bray, C. Dance and L. Fan, “Visual categorization with bags of keypoints,”, In Proc. ECCV Workshop on Statistical Learning in Computer Vision, pp. 59-74, 2004 S.-F. Chang, J. He, Y.-G. Jiang, E.E. Khoury, C.-W. Ngo, A. Yanagawa and E. Zavesky, ``Columbia University/VIREO-City/IRIT TRECVID2008 high-level feature extraction and interactive video search,'' In Proc. TRECVID 2008 Workshop, ２００８年S.-F. Chang, J. He, Y.-G. Jiang, EE Khoury, C.-W.Ngo, A. Yanagawa and E. Zavesky, `` Columbia University / VIREO-City / IRIT TRECVID2008 high-level feature extraction and interactive video search, '' In Proc. TRECVID 2008 Workshop, 2008

しかしながら、非特許文献２に記載の技術では、フレーム画像を分割する際に、たとえば縦横２×２分割あるいは縦横１×３分割といったように、固定サイズ、固定位置での分割を行なってしまっている。このように分割のサイズや方法を固定してしまうと、被写体のサイズ変動に対する頑健性が不足してしまうという問題が生じる。たとえば同じ自動車であっても、フレーム画像全体にアップで被写体として映る場合もあれば、フレーム画像の隅のほうに小さく映る場合もある。分割された領域の画像サイズを固定することによって、そのサイズから外れるような自動車を検出できなくなるおそれもある。 However, in the technique described in Non-Patent Document 2, when a frame image is divided, division at a fixed size and a fixed position is performed, for example, vertical and horizontal 2 × 2 division or vertical and horizontal 1 × 3 division. . If the size and method of division are fixed in this way, there arises a problem that the robustness against the size variation of the subject is insufficient. For example, even in the same car, the entire frame image may appear as a subject up, or the frame image may appear smaller in the corners of the frame image. If the image size of the divided area is fixed, there is a possibility that a car that deviates from the size cannot be detected.

また、別の問題として、フレーム画像を分割した際に、目的とする被写体が領域の境界をまたぐ場合もあり得る。被写体が領域の境界をまたいだ場合は、分割された画像から得られる特徴ベクトルに、被写体全体の情報が正確に反映されなくなってしまう。 As another problem, when a frame image is divided, the target subject may straddle the boundary of the region. When the subject crosses the boundary of the area, the information on the entire subject is not accurately reflected in the feature vector obtained from the divided image.

これらの問題は、フレーム画像から特定の被写体を検出する際の精度の低下につながる。本発明は、このような事情を考慮して為されたものであり、高精度な被写体検出を行なうための画像特徴量算出装置、学習装置、識別装置、およびそのプログラムを提供するものである。 These problems lead to a decrease in accuracy when detecting a specific subject from the frame image. The present invention has been made in view of such circumstances, and provides an image feature amount calculation device, a learning device, an identification device, and a program thereof for performing highly accurate subject detection.

［１］上記の課題を解決するため、本発明の一態様による画像特徴量算出装置は、入力画像に含まれる複数のサイズの領域画像の範囲を指定する領域画像抽出部と、前記入力画像に基づき、前記領域画像抽出部によって指定された前記領域画像の各々の特徴量を算出するとともに、複数の前記領域画像から算出された特徴量を連結することによって前記入力画像の特徴量を生成する特徴量算出部とを具備する。 [1] In order to solve the above-described problem, an image feature amount calculation apparatus according to an aspect of the present invention includes a region image extraction unit that specifies a range of region images of a plurality of sizes included in an input image, and the input image Based on the above, the feature amount of each of the region images specified by the region image extraction unit is calculated, and the feature amount of the input image is generated by connecting the feature amounts calculated from the plurality of region images. A quantity calculation unit.

ここで「領域画像」とは、入力画像の一部分の領域の画像である。なお、入力画像と全く同一の領域の画像もまた領域画像である。領域画像が複数のサイズであるということは、縦および横のサイズ（画素数等の単位）が様々な領域画像を用いることを表わす。複数のサイズは、所定の差で段階的に変化する画素数である場合（つまり、矩形画像の縦または横の辺の長さが等差数列を為すように段階的な領域画像を用いる場合）もあり得る。また、所定の比で段階的に変化する画素数である場合（つまり、矩形画像の縦または横の辺の長さが等比数列を為すように段階的な領域画像を用いる場合）もあり得る。また、領域画像のサイズが、より不規則に段階的になるような場合もあり得る。
また「領域画像の各々の特徴量」とは、上記の領域画像の一つから得られる画像の特徴量（スカラーまたはベクトル）である。
また「複数の領域画像から算出された特徴量を連結する」とは、例えば、各々の領域画像から得られた上記の特徴量を単純に要素として並べる（連結する）ことによって特徴ベクトルを得る操作である。 Here, the “region image” is an image of a partial region of the input image. Note that an image in the same region as the input image is also a region image. The fact that the region image has a plurality of sizes means that region images having various vertical and horizontal sizes (units such as the number of pixels) are used. When the multiple sizes are the number of pixels that change stepwise with a predetermined difference (that is, when the stepwise region image is used so that the lengths of the vertical or horizontal sides of the rectangular image form an even number sequence) There is also a possibility. In addition, there may be a case where the number of pixels changes stepwise at a predetermined ratio (that is, a case where a stepwise region image is used so that the lengths of the vertical or horizontal sides of the rectangular image form a geometric progression). . In addition, there may be a case where the size of the region image becomes more irregular and stepwise.
The “feature amount of each region image” is a feature amount (scalar or vector) of an image obtained from one of the region images.
“Connecting feature amounts calculated from a plurality of region images” means, for example, an operation for obtaining a feature vector by simply arranging (connecting) the feature amounts obtained from the respective region images as elements. It is.

「複数のサイズの領域画像の範囲を指定する」ことと「指定された領域画像の各々の特徴量を算出するとともに、複数の領域画像から算出された特徴量を連結することによって入力画像の特徴量を生成する」こととの組合せは、本実施形態の技術的特徴を有する構成の一つである。領域画像が複数のサイズを有することにより、入力画像に含まれる被写体が、ある領域画像からはみ出す場合や、ある領域画像の中に相対的に小さく含まれる場合や、その中間である領域画像に程よく収まる場合などが生じる。被写体が領域画像からはみ出す場合には、画像におけるその被写体の特徴をその領域画像から良好に抽出することができないことがある。被写体が領域画像の中に小さく写りこむ場合には、その領域画像のから抽出した特徴量においてその被写体の特徴の情報が不十分であることがある。被写体が領域画像内に程よく収まる場合には、その領域画像から抽出した特徴量が、情報として、被写体の特徴を良好に表わす。そして、複数の領域画像の各々から算出された特徴量を連結することによって、ある被写体の画像としての特徴が、連結された特徴量のいずれかの場所に良好に含まれている可能性が相対的に高くなる。したがって、このような技術構成により、被写体が写りこむ大きさがたとえ変化しても、その被写体の特徴を良好に捉えた特徴量を抽出することができる。 “Specify the range of area images of multiple sizes” and “Calculate the feature quantities of each of the specified area images, and connect the feature quantities calculated from the multiple area images to combine the features of the input image The combination with “generate quantity” is one of the configurations having the technical features of the present embodiment. The area image has a plurality of sizes, so that the subject included in the input image protrudes from a certain area image, is relatively small in a certain area image, or is suitable for an intermediate area image. When it fits. When the subject protrudes from the region image, the feature of the subject in the image may not be extracted well from the region image. When the subject appears small in the area image, the feature information extracted from the area image may have insufficient information on the characteristics of the subject. When the subject fits within the area image reasonably, the feature amount extracted from the area image favorably represents the feature of the subject as information. Then, by connecting the feature values calculated from each of the plurality of region images, it is relatively possible that the feature as an image of a certain subject is well included in any place of the connected feature values. Become expensive. Therefore, according to such a technical configuration, even if the size of the subject is changed, it is possible to extract a feature amount that captures the feature of the subject satisfactorily.

［２］また、本発明の一態様による学習装置は、［１］に記載の画像特徴量算出装置と、前記入力画像が正例または負例のいずれであるかを示す情報と、前記特徴量算出部によって生成された前記入力画像の特徴量の組合せとに基づいて、未知の入力画像が正例であるか負例であるかのいずれかを識別するための識別器のパラメーターを求める識別器学習部とを具備する。
ここで、識別器のパラメーターを求める処理は、学習用データに基づいた機械学習処理である。識別器は、所定のモデルにより、未知の入力画像から抽出された特徴量を入力とし、この特徴量とパラメーターとを用いた計算の結果として、その入力画像が正例であるか負例であるかを表わす情報を出力する。パラメーターは通常は複数の変数であり、識別器学習部の処理を行なうことより、最適なパラメーター値の集合が得られる。「正例であるか負例であるか」とは、入力画像が、所定のクラスターに属するか否かということを表わす。具体例としては、入力画像に所定の被写体（人、車、山、犬、猫など）が写っているか否かを表わす。これにより、良好な特徴量を用いた学習が可能になる。 [2] A learning device according to an aspect of the present invention includes an image feature amount calculation device according to [1], information indicating whether the input image is a positive example or a negative example, and the feature amount. A discriminator for obtaining a parameter of a discriminator for discriminating whether an unknown input image is a positive example or a negative example based on a combination of feature amounts of the input image generated by the calculation unit And a learning unit.
Here, the process for obtaining the parameters of the discriminator is a machine learning process based on the learning data. The discriminator receives a feature amount extracted from an unknown input image according to a predetermined model as an input, and the input image is a positive example or a negative example as a result of calculation using the feature amount and a parameter. Outputs information that represents. The parameter is usually a plurality of variables, and an optimal set of parameter values can be obtained by performing the process of the classifier learning unit. “A positive example or a negative example” indicates whether an input image belongs to a predetermined cluster. As a specific example, it represents whether or not a predetermined subject (a person, a car, a mountain, a dog, a cat, etc.) is reflected in the input image. Thereby, learning using a good feature amount is possible.

［３］また、本発明の一態様は、上記の学習装置において、前記領域画像抽出部は、同一サイズの複数の前記領域画像の少なくとも一部が互いに重なり合うように、前記領域画像の範囲を指定することを特徴とする。
これは、実施形態に記載する設定値αまたはβの値を１未満（０＜α＜１または０＜β＜１）とすることにより実現される。これにより、特徴量抽出部は、被写体の特徴を良好に表わす特徴量を抽出できる可能性が高くなる。
さらに、０＜α≦０．５としたとき、または０＜α≦０．５としたときには、元のキーフレーム画像の中の任意の画素が、同一サイズの少なくとも２個の領域画像の範囲に含まれることとなる。つまりこの場合は、被写体を適切なサイズの領域画像内に捉えることのできる可能性がよりいっそう高まる。つまり、より良好な特徴量を抽出できるようになる。 [3] Further, according to one aspect of the present invention, in the learning device, the region image extraction unit specifies a range of the region image so that at least a part of the plurality of region images having the same size overlap each other. It is characterized by doing.
This is realized by setting the value α or β described in the embodiment to less than 1 (0 <α <1 or 0 <β <1). This increases the possibility that the feature amount extraction unit can extract a feature amount that well represents the feature of the subject.
Furthermore, when 0 <α ≦ 0.5, or when 0 <α ≦ 0.5, any pixel in the original key frame image falls within the range of at least two region images of the same size. Will be included. That is, in this case, the possibility that the subject can be captured in a region image of an appropriate size is further increased. That is, a better feature amount can be extracted.

［４］また、本発明の一態様による識別装置は、［１］に記載の画像特徴量算出装置と、予め学習済みのパラメーターと、前記特徴量算出部が生成した前記入力画像の特徴量とに基づいて、前記入力画像が正例であるか負例かを識別する識別部とを具備する。
これにより、画像特徴量算出装置で得られた画像特徴量と、学習済みのパラメーターとに基づき、入力画像が正例であるか負例であるかを識別できる。 [4] In addition, an identification device according to an aspect of the present invention provides an image feature amount calculation device according to [1], a parameter learned in advance, and a feature amount of the input image generated by the feature amount calculation unit. And an identification unit for identifying whether the input image is a positive example or a negative example.
Accordingly, it is possible to identify whether the input image is a positive example or a negative example based on the image feature amount obtained by the image feature amount calculation device and the learned parameter.

［５］また、本発明の一態様は、上記の識別装置において、学習用データとして入力された前記入力画像が正例または負例のいずれであるかを示す情報と、前記特徴量算出部によって生成された前記入力画像の特徴量の組合せとに基づいて、未知の入力画像が正例であるか負例であるかのいずれかを識別するための識別器のパラメーターを求める識別器学習部をさらに具備し、前記識別部は、前記識別器学習部によって求められた前記パラメーターを前記予め学習済みのパラメーターとして用いることによって、未知の前記入力画像が正例であるか負例かを識別することを特徴とする。
これにより、この識別装置は、学習処理と識別処理とを行なう。 [5] In addition, according to one aspect of the present invention, in the above-described identification device, information indicating whether the input image input as learning data is a positive example or a negative example, and the feature amount calculation unit A discriminator learning unit for obtaining a parameter of a discriminator for discriminating whether the unknown input image is a positive example or a negative example based on the generated combination of feature amounts of the input image; Further, the identifying unit identifies whether the unknown input image is a positive example or a negative example by using the parameter obtained by the classifier learning unit as the previously learned parameter. It is characterized by.
Thereby, this identification device performs a learning process and an identification process.

［６］また、本発明の一態様は、上記の識別装置において、前記領域画像抽出部は、同一サイズの複数の前記領域画像の少なくとも一部が互いに重なり合うように、前記領域画像の範囲を指定することを特徴とする。
同一サイズの複数の前記領域画像の少なくとも一部が互いに重なり合うことにより、被写体の特徴を良好に表わす特徴量を算出することができる可能性が高まる。 [6] Further, according to one aspect of the present invention, in the identification device, the region image extraction unit specifies a range of the region image so that at least a part of the plurality of region images having the same size overlap each other. It is characterized by doing.
When at least a part of the plurality of region images having the same size overlap each other, there is a high possibility that a feature amount that favorably represents the feature of the subject can be calculated.

［７］また、本発明の一態様は、コンピューターを、入力画像に含まれる複数のサイズの領域画像の範囲を指定する領域画像抽出部、前記入力画像に基づき、前記領域画像抽出部によって指定された前記領域画像の各々の特徴量を算出するとともに、複数の前記領域画像から算出された特徴量を連結することによって前記入力画像の特徴量を生成する特徴量算出部、として機能させるためのプログラムである。 [7] Further, according to one embodiment of the present invention, a computer is designated by the region image extraction unit that designates a range of region images of a plurality of sizes included in the input image, and based on the input image. A program for functioning as a feature amount calculation unit that calculates a feature amount of each of the region images and generates a feature amount of the input image by connecting the feature amounts calculated from a plurality of the region images. It is.

本発明によれば、画像内における被写体の位置やサイズの変化の影響を受けることなく、高精度に被写体の出現を判別することが可能となる。
特に、複数のサイズの領域画像の各々から得られる特徴を情報として維持する特徴量を算出することにより、被写体のサイズ変化に対して頑健な特徴量を得て使用することができる。
また特に、同一サイズの領域画像が少なくとも一部において互いに重なり合うようにして、それらの領域画像を用いることにより、被写体の位置変化に対して頑健な特徴量を得て使用することができる。つまり、グリッド境界に存在する被写体に対しても良好な結果を得ることができる。 According to the present invention, it is possible to determine the appearance of a subject with high accuracy without being affected by changes in the position and size of the subject in an image.
In particular, by calculating a feature amount that maintains information obtained from each of a plurality of size area images as information, it is possible to obtain and use a feature amount that is robust against changes in the size of the subject.
In particular, it is possible to obtain and use feature amounts that are robust against changes in the position of the subject by using region images having the same size so that they overlap at least partially. That is, a good result can be obtained even for a subject existing at the grid boundary.

本発明の第１実施形態による識別装置の概略機能構成を示したブロック図である。It is the block diagram which showed schematic function structure of the identification device by 1st Embodiment of this invention. 同実施形態による特徴量算出部の詳細な機能構成を示すブロック図である。It is a block diagram which shows the detailed functional structure of the feature-value calculation part by the embodiment. 同実施形態によってフレーム画像を基に抽出される領域画像の範囲を示す概略図である。It is the schematic which shows the range of the area image extracted based on a frame image by the embodiment. 同実施形態による領域画像抽出部が領域画像を抽出する処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the process which the area | region image extraction part by the same embodiment extracts an area | region image. 同実施形態による特徴量算出部が算出する各種の特徴ベクトルと、領域画像抽出部によって抽出される領域画像との関係を示す概略図である。It is the schematic which shows the relationship between the various feature vectors which the feature-value calculation part by the embodiment calculates, and the area | region image extracted by an area | region image extraction part. 同実施形態における学習用データの構成例を示す概略図である。It is the schematic which shows the structural example of the data for learning in the embodiment.

次に、本発明の実施形態について、図面を参照しながら説明する。
［第１の実施形態］
図１は、第１の実施形態による識別装置２の概略機能構成を示すブロック図である。図示するように、識別装置２は、内部に学習装置１を備えている。学習装置１は、学習用映像入力部１１と、キーフレーム画像抽出部１３と、領域画像抽出部１５と、特徴量算出部１７と、識別器学習部１９とを含んで構成される。また、識別装置２は、さらに、映像入力部１２と、キーフレーム画像抽出部１４と、領域画像抽出部１６と、特徴量算出部１８と、識別部２０とを含んで構成される。なお、図示していないが、領域画像抽出部１５と特徴量算出部１７との組合せは画像特徴量算出装置として機能する。同様に、領域画像抽出部１６と特徴量算出部１８との組合せは画像特徴量算出装置として機能する。 Next, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
FIG. 1 is a block diagram showing a schematic functional configuration of the identification device 2 according to the first embodiment. As illustrated, the identification device 2 includes a learning device 1 therein. The learning device 1 includes a learning video input unit 11, a key frame image extraction unit 13, a region image extraction unit 15, a feature amount calculation unit 17, and a discriminator learning unit 19. The identification device 2 further includes a video input unit 12, a key frame image extraction unit 14, a region image extraction unit 16, a feature amount calculation unit 18, and an identification unit 20. Although not shown, the combination of the region image extraction unit 15 and the feature amount calculation unit 17 functions as an image feature amount calculation device. Similarly, the combination of the region image extraction unit 16 and the feature amount calculation unit 18 functions as an image feature amount calculation device.

学習装置１は、読み込んだ学習データに基づいて、識別部２０の機械学習を行なう。
識別装置２は、学習装置１によって学習済みの識別部により、入力映像に特定の被写体が出現するか否かを判定する。 The learning device 1 performs machine learning of the identification unit 20 based on the read learning data.
The identification device 2 determines whether or not a specific subject appears in the input video by the identification unit that has been learned by the learning device 1.

学習用映像入力部１１は、学習用の映像データを外部から取得する。
キーフレーム画像抽出部１３は、学習用映像入力部１１で取得された学習用映像から、キーフレーム画像を抽出する。具体的方法としては、キーフレーム画像抽出部１３は、映像からショット境界を検出して、映像をショットに分割した後、各ショットの冒頭あるいは中間位置からフレーム画像を取得する。なお、ショット境界の検出は、例えば画素値の時間方向の微分値の総和が所定の閾値を超えてピークを示す箇所を検出することにより行なう。また、ショット境界が存在しない映像、あるいはひとつのショットの時間長が非常に長い映像においては、キーフレーム画像抽出部１３は、所定の時間間隔でキーフレーム画像を抽出したり、フレーム間の動きベクトルの大きさが閾値以上となったタイミングでキーフレーム画像を抽出したりするようにする。 The learning video input unit 11 obtains learning video data from the outside.
The key frame image extraction unit 13 extracts a key frame image from the learning video acquired by the learning video input unit 11. As a specific method, the key frame image extraction unit 13 detects a shot boundary from a video, divides the video into shots, and then acquires a frame image from the beginning or an intermediate position of each shot. Note that the shot boundary is detected, for example, by detecting a point where the sum of the differential values of the pixel values in the time direction exceeds a predetermined threshold and shows a peak. In addition, in a video where there is no shot boundary or a video in which the time length of one shot is very long, the key frame image extraction unit 13 extracts a key frame image at a predetermined time interval or a motion vector between frames. A key frame image is extracted at a timing when the size of the frame becomes greater than or equal to a threshold value.

領域画像抽出部１５は、キーフレーム画像抽出部１３によって抽出されたキーフレーム画像に含まれる、複数のサイズの領域画像を抽出し、それら領域画像の範囲を指定する。キーフレーム画像抽出部１３は、抽出された領域画像の範囲に関する情報を出力する。
特徴量算出部１７は、キーフレーム画像抽出部１３で抽出されたフレーム画像から、特徴ベクトルを算出する。特徴ベクトルの算出方法については後で詳述する。
識別器学習部１９は、正例あるいは負例のラベルが付与された学習データから、被写体が映っているかどうかを判定するための識別器の学習を行なう。識別器学習部１９への入力データは、キーフレーム画像を基に特徴量算出部１７によって算出された特徴量（特徴ベクトル）であり、各々の入力画像に対応して、「正例」または「負例」のいずれであるかを示すラベルが付随している。識別器学習部１９は、このラベルを正解として使用し、機械学習処理を行なう。識別器学習部１９による学習手法としては、サポートベクターマシン、ニューラルネットワーク、ベイジアンネットワークなどの一般的な機械学習手法を利用できる。なお、学習用データの構成例については、後で図６を参照しながら詳述する。 The region image extraction unit 15 extracts region images of a plurality of sizes included in the key frame image extracted by the key frame image extraction unit 13, and designates the range of these region images. The key frame image extraction unit 13 outputs information regarding the range of the extracted region image.
The feature amount calculation unit 17 calculates a feature vector from the frame image extracted by the key frame image extraction unit 13. The feature vector calculation method will be described in detail later.
The discriminator learning unit 19 learns a discriminator for determining whether or not a subject is shown from learning data to which a positive or negative example label is assigned. The input data to the discriminator learning unit 19 is a feature amount (feature vector) calculated by the feature amount calculation unit 17 based on the key frame image, and “positive example” or “ A label indicating which of the “negative examples” is attached. The discriminator learning unit 19 uses this label as a correct answer and performs machine learning processing. As a learning method by the classifier learning unit 19, a general machine learning method such as a support vector machine, a neural network, or a Bayesian network can be used. A configuration example of the learning data will be described in detail later with reference to FIG.

映像入力部１２は、映像データを外部から取得する。この映像データは、特定の被写体が映っているか否かを判定する対象となる映像のデータである。
キーフレーム画像抽出部１４は、キーフレーム画像抽出部１３と同様の方法によりキーフレーム画像を抽出する。但し、キーフレーム画像抽出部１４が対象とするのは、学習用の映像データではなく、映像入力部１２によって取得された映像データである。
領域画像抽出部１６は、キーフレーム画像抽出部１４によって抽出されたキーフレーム画像について、領域画像抽出部１５と同様の方法により、領域画像の抽出を行なう。
特徴量算出部１８は、特徴量算出部１７と同様の方法により、キーフレーム画像の特徴量を抽出する。
識別部２０は、特徴量算出部１８が算出した特徴量に基づいて、入力画像（未知の画像）が正例であるか負例かを識別する。なお、識別部２０は、識別器学習部１９によって予め学習済みである。言い換えれば、識別部２０が識別のために用いるパラメーターは、識別器学習部１９による学習処理によって、予め最適化されている。 The video input unit 12 acquires video data from the outside. This video data is video data that is a target for determining whether or not a specific subject is shown.
The key frame image extraction unit 14 extracts a key frame image by the same method as the key frame image extraction unit 13. However, the key frame image extraction unit 14 targets not video data for learning but video data acquired by the video input unit 12.
The region image extraction unit 16 extracts a region image from the key frame image extracted by the key frame image extraction unit 14 by the same method as the region image extraction unit 15.
The feature amount calculation unit 18 extracts the feature amount of the key frame image by the same method as the feature amount calculation unit 17.
The identifying unit 20 identifies whether the input image (unknown image) is a positive example or a negative example based on the feature amount calculated by the feature amount calculating unit 18. Note that the discriminator 20 has been learned in advance by the discriminator learning unit 19. In other words, the parameters used by the identification unit 20 for identification are optimized in advance by the learning process performed by the classifier learning unit 19.

これにより識別装置２は、入力される映像に特定の被写体が映っているか否かを判別する処理を行い、判別結果を出力する。 As a result, the identification device 2 performs a process of determining whether or not a specific subject appears in the input video, and outputs a determination result.

なお既に述べたように、キーフレーム画像抽出部１３と１４は、同一の機能を有する。また、領域画像抽出部１５と１６は、同一の機能を有する。また、特徴量算出部１７と１８は、同一の機能を有する。したがって、これらの同一機能を有する機能ブロックについては、これら各部を共用として装置を構成するようにしても良い。 As already described, the key frame image extraction units 13 and 14 have the same function. The area image extraction units 15 and 16 have the same function. Also, the feature quantity calculation units 17 and 18 have the same function. Therefore, these functional blocks having the same function may be configured by sharing these units.

図２は、特徴量算出部１７の詳細な機能構成を示すブロック図である。図示するように、特徴量算出部１７は、特徴点検出部１７１と、局所特徴量子化部１７４と、局所特徴ベクトル生成部１７７と、色統計特徴算出部１７２と、色特徴ベクトル生成部１７８と、テクスチャ特徴算出部１７３と、テクスチャ特徴ベクトル生成部１７９と、特徴ベクトル生成部１７０とを含んで構成される。 FIG. 2 is a block diagram illustrating a detailed functional configuration of the feature amount calculation unit 17. As shown in the figure, the feature amount calculation unit 17 includes a feature point detection unit 171, a local feature quantization unit 174, a local feature vector generation unit 177, a color statistical feature calculation unit 172, and a color feature vector generation unit 178. A texture feature calculation unit 173, a texture feature vector generation unit 179, and a feature vector generation unit 170.

また、図２に示すように、フレーム画像データが、領域画像抽出部１５と特徴量算出部１７とに入力される。領域画像抽出部１５は、入力されたフレーム画像から、その部分を切り取って得られるグリッド領域の画像（これを「領域画像」と呼ぶ）を順次抽出する。そして、領域画像抽出部１５は、各々の領域画像の範囲を示す情報を局所特徴ベクトル生成部１７７と色特徴ベクトル生成部１７８とテクスチャ特徴ベクトル生成部１７９とに供給する。領域画像の形状は典型的には矩形であり、その場合、領域画像の範囲を示す情報とは、領域画像の左上隅および右下隅それぞれの画素の座標値や、領域画像の左上隅の画素の座標値および縦と横のサイズである。 Further, as shown in FIG. 2, the frame image data is input to the region image extraction unit 15 and the feature amount calculation unit 17. The region image extraction unit 15 sequentially extracts images of a grid region (this is referred to as “region image”) obtained by cutting out the portion from the input frame image. Then, the region image extraction unit 15 supplies information indicating the range of each region image to the local feature vector generation unit 177, the color feature vector generation unit 178, and the texture feature vector generation unit 179. The shape of the region image is typically a rectangle, and in this case, the information indicating the range of the region image includes the coordinate values of the pixels at the upper left corner and the lower right corner of the region image, and the pixels at the upper left corner of the region image. Coordinate values and vertical and horizontal sizes.

特徴点検出部１７１は、特徴点検出手法を用いて、入力されるフレーム画像全体から特徴点を抽出する。
局所特徴量子化部１７４は、特徴点検出部１７１によって検出された特徴点の周囲の局所領域の特徴を量子化する。
局所特徴ベクトル生成部１７７は、領域画像ごとの局所特徴量を連結することにより局所特徴ベクトルを生成する。
色統計特徴算出部１７２は、入力されるフレーム画像データを基に、色空間の変換を行い、変換後の色空間における特徴量を算出する。
色特徴ベクトル生成部１７８は、領域画像ごとの色特徴量を連結することにより色特徴ベクトルを生成する。
テクスチャ特徴算出部１７３は、ウェーブレット変換等の処理を行なうことにより、入力されるフレーム画像データのテクスチャ特徴を算出する。
テクスチャ特徴ベクトル生成部１７９は、ウェーブレット変換の結果の画素値の、領域画像ごとの統計的特徴値を基に、テクスチャ特徴ベクトルを算出する。
特徴ベクトル生成部１７０は、局所特徴ベクトルと色特徴ベクトルとテクスチャ特徴ベクトルとを連結したベクトルを生成する。
これら各部の処理の詳細については後述する。 The feature point detection unit 171 extracts feature points from the entire input frame image using a feature point detection method.
The local feature quantization unit 174 quantizes the feature of the local region around the feature point detected by the feature point detection unit 171.
The local feature vector generation unit 177 generates a local feature vector by concatenating local feature amounts for each region image.
The color statistical feature calculation unit 172 performs color space conversion based on the input frame image data, and calculates a feature amount in the color space after conversion.
The color feature vector generation unit 178 generates a color feature vector by concatenating the color feature amounts for each area image.
The texture feature calculation unit 173 calculates the texture feature of the input frame image data by performing processing such as wavelet transform.
The texture feature vector generation unit 179 calculates a texture feature vector based on the statistical feature value for each region image of the pixel value as a result of the wavelet transform.
The feature vector generation unit 170 generates a vector obtained by connecting the local feature vector, the color feature vector, and the texture feature vector.
Details of the processing of these units will be described later.

なお、特徴量算出部１８もまた、特徴量算出部１７と同様の構成を有する。そして、領域画像抽出部１５が抽出した領域画像に関する情報を特徴量算出部１７に供給するのと同様に、領域画像抽出部１６は抽出した領域画像に関する情報を特徴量算出部１８に供給する。 The feature amount calculation unit 18 also has the same configuration as the feature amount calculation unit 17. The area image extraction unit 16 supplies information related to the extracted region image to the feature amount calculation unit 18 in the same manner as the information related to the region image extracted by the region image extraction unit 15 is supplied to the feature amount calculation unit 17.

次に、各々の特徴量抽出の詳細について説明する。
（Ａ）局所特徴ベクトルの抽出
局所特徴ベクトルの抽出のためには、前記のバッグ・オブ・ビジュアル・ワーズ法を用いる。
特徴点検出部１７１は、ＳＩＦＴ（Scale-invariant feature transform）やＳＵＲＦ（Supeeded-Up. Robust Features）などの特徴点検出手法を用いて、入力されるフレーム画像全体から特徴点を抽出する。ＳＩＦＴおよびＳＵＲＦは、画像内における局所的特徴を検出する手法であり、それぞれ、参考文献［David G. Lowe, ``Object recognition from local scale-invariant features,'' In Proc. IEEE International Conference on Computer Vision, vol. 2, pp. 1150-1157, 1999.］および［Herbert Bay, Tinne Tuytelaars, and L Van Gool, ``SURF: Speeded Up Robust Features,'' In Proc. 9th European Conference on Computer Vision, vol. 3951, pp. 404--417, 2006.］にもその詳細が記載されている。 Next, details of each feature amount extraction will be described.
(A) Extraction of local feature vectors The above-described bag of visual words method is used to extract local feature vectors.
The feature point detection unit 171 extracts feature points from the entire input frame image using a feature point detection method such as SIFT (Scale-invariant feature transform) or SURF (Supeeded-Up. Robust Features). SIFT and SURF are methods for detecting local features in an image. Reference documents [David G. Lowe, “Object recognition from local scale-invariant features,” In Proc. IEEE International Conference on Computer Vision, respectively. , vol. 2, pp. 1150-1157, 1999.] and [Herbert Bay, Tinne Tuytelaars, and L Van Gool, `` SURF: Speeded Up Robust Features, '' In Proc. 9th European Conference on Computer Vision, vol. 3951, pp. 404--417, 2006.] for details.

そして、局所特徴量子化部１７４は、特徴点検出部１７１によって検出された特徴点の周囲の局所領域の特徴を量子化する。具体的には、局所特徴量子化部１７４は、特徴点の周囲の局所領域から算出される勾配特徴量をクラスタリングすることにより量子化する。そのために、局所特徴量子化部１７４は、あらかじめ学習データから求めた勾配特徴量をたとえばｋ−ｍｅａｎｓなどによってクラスタリングしてクラスターごとの代表値を求めておく。そして、局所特徴量子化部１７４は、入力データから算出された特徴量を、最も近い代表値に対応するクラスターに割り当てる。 Then, the local feature quantization unit 174 quantizes the features in the local region around the feature points detected by the feature point detection unit 171. Specifically, the local feature quantization unit 174 performs quantization by clustering gradient feature amounts calculated from local regions around the feature points. For this purpose, the local feature quantization unit 174 obtains a representative value for each cluster by clustering gradient feature amounts obtained from learning data in advance using, for example, k-means. Then, the local feature quantization unit 174 assigns the feature amount calculated from the input data to the cluster corresponding to the closest representative value.

そして、局所特徴ベクトル生成部１７７は、領域画像抽出部１５から各々の領域画像の範囲に関する情報を得て、ある１つの領域画像に含まれる特徴点に関して、量子化された勾配特徴量の出現頻度ヒストグラムを求め、そのヒストグラムを構成する頻度値の列を求める。局所特徴ベクトル生成部１７７は、すべての領域画像について、上記の処理を行なう。そして、局所特徴ベクトル生成部１７７は、各領域画像から得られた頻度値の列を、すべての領域画像に関して連結することにより、局所特徴ベクトルを生成する。この局所特徴ベクトルをＶ_ｌとする。なお、「すべての領域画像に関して連結」については、後で図５を参照しながら詳述する。 Then, the local feature vector generation unit 177 obtains information on the range of each region image from the region image extraction unit 15, and the frequency of appearance of the quantized gradient feature amount with respect to the feature points included in a certain region image. A histogram is obtained, and a sequence of frequency values constituting the histogram is obtained. The local feature vector generation unit 177 performs the above processing for all region images. Then, the local feature vector generation unit 177 generates a local feature vector by concatenating the sequence of frequency values obtained from each region image with respect to all the region images. Let this local feature vector be _Vl . It should be noted that “connection with respect to all region images” will be described in detail later with reference to FIG.

（Ｂ）色特徴ベクトルの抽出
色統計特徴算出部１７２は、入力されるフレーム画像データを、ＨＳＶ色空間およびＬａｂ色空間に変換する。ＨＳＶ色空間は、色相（Hue）、彩度（Saturation）、明度（Value）の三成分からなる色空間である。Ｌａｂ色空間は、明度（Ｌ）、補色次元（ａおよびｂ）の成分からなる色空間である。例えばＲＧＢの画素値から、ＨＳＶ色空間およびＬａｂ色空間への変換は、既存の技術を使って行なわれる。色空間の変換の結果、色統計特徴算出部１７２は、フレーム画像に含まれる各画素について、各コンポーネントｃの画素値を出力する。なお、ｃ∈｛ｈ，ｓ，ｖ，ｌ，ａ，ｂ｝であり、これらｈ，ｓ，ｖ，ｌ，ａ，ｂのそれぞれは、ＨＳＶ色空間およびＬａｂ色空間の成分である。 (B) Extraction of Color Feature Vector The color statistical feature calculation unit 172 converts input frame image data into an HSV color space and a Lab color space. The HSV color space is a color space including three components of hue (Hue), saturation (Saturation), and lightness (Value). The Lab color space is a color space composed of components of lightness (L) and complementary color dimensions (a and b). For example, conversion from RGB pixel values to the HSV color space and the Lab color space is performed using an existing technique. As a result of the color space conversion, the color statistical feature calculation unit 172 outputs the pixel value of each component c for each pixel included in the frame image. Note that cε {h, s, v, l, a, b}, and each of these h, s, v, l, a, b is a component of the HSV color space and the Lab color space.

色特徴ベクトル生成部１７８は、領域画像抽出部１５から各々の領域画像の範囲に関する情報を得て、領域画像ごとに、各コンポーネントｃに対して、画素値の平均μ_c、標準偏差σ_c、歪度の立方根ω_cを算出する。具体的には、色特徴ベクトル生成部１７８は、下の式（１）、式（２）、式（３）により、これらの値を算出する。 The color feature vector generation unit 178 obtains information on the range of each region image from the region image extraction unit 15, and for each region image, for each component c, the average μ _c , standard deviation σ _c , The cubic root ω _c of the skewness is calculated. Specifically, the color feature vector generation unit 178 calculates these values according to the following expressions (1), (2), and (3).

なお、式（１）〜（３）において、ｘは横座標値、ｙは縦座標値であり、ｆ_ｃ（ｘ，ｙ）は座標（ｘ，ｙ）におけるコンポーネントｃの画素値である。また、ｘおよびｙそれぞれにおいて、記号Σによって総和を算出する範囲は、当該領域画像の範囲である。またＨ_ＳおよびＷ_Ｓは、それぞれ、当該領域画像の縦サイズ（高さ）および横サイズ（幅）である。Ｈ_ＳおよびＷ_Ｓの単位は、画素［ｐｉｘｅｌｓ］である。Ｈ_ＳおよびＷ_Ｓについては、後でもさらに述べる。 In Expressions (1) to (3), x is an abscissa value, y is an ordinate value, and f _c (x, y) is a pixel value of the component c at coordinates (x, y). In each of x and y, the range in which the sum is calculated using the symbol Σ is the range of the region image. H _S and W _S are the vertical size (height) and horizontal size (width) of the area image, respectively. The unit of H _S and W _S is a pixel [pixels]. H _S and W _S will be further described later.

色特徴ベクトル生成部１７８は、すべての領域画像について、上記の処理を行なう。そして、色特徴ベクトル生成部１７８は、各領域画像から算出された値の列（μ_ｈ，σ_ｈ，ω_ｈ，μ_ｓ，σ_ｓ，ω_ｓ，μ_ｖ，σ_ｖ，ω_ｖ，μ_ｌ，σ_ｌ，ω_ｌ，μ_ａ，σ_ａ，ω_ａ，μ_ｂ，σ_ｂ，ω_ｂ）を、すべての領域画像に関して連結することにより、色特徴ベクトルを生成する。この色特徴ベクトルをＶ_ｃとする。なお、「すべての領域画像に関して連結」については、後で図５を参照しながら詳述する。 The color feature vector generation unit 178 performs the above processing for all region images. The color feature vector generation unit 178 then generates a sequence of values (μ _h , σ _h , ω _h , μ _s , σ _s , ω _s , μ _v , σ _v , ω _v , μ _l calculated from each region image. , [Sigma] _l , [omega] _l , [mu] _a , [sigma] _a , [omega] _a , [mu] _b , [sigma] _b , [omega] _b ) are concatenated with respect to all region images to generate a color feature vector. The color feature vector and V _c. It should be noted that “connection with respect to all region images” will be described in detail later with reference to FIG.

（Ｃ）テクスチャ特徴ベクトルの抽出
ここでは、Ｈａａｒウェーブレットに基づいて画像のテクスチャを反映した特徴量を算出する。まず、テクスチャ特徴算出部１７３は、入力されるフレーム画像データを基に、Ｈａａｒウェーブレット変換を３段階適用する。次に、テクスチャ特徴ベクトル生成部１７９は、領域画像抽出部１５から各々の領域画像の範囲に関する情報を得て、領域画像ごとに、それぞれのサブバンド領域の画素値の分散を算出し、それらの分散値の列を当該領域画像における特徴量とする。そして、すべての領域画像に関してこれらの数値列を連結することにより、テクスチャ特徴ベクトルを生成する。このテクスチャ特徴ベクトルをＶ_ｔとする。なお、「すべての領域画像に関して連結」については、後で図５を参照しながら詳述する。 (C) Extraction of Texture Feature Vector Here, a feature amount reflecting the texture of the image is calculated based on the Haar wavelet. First, the texture feature calculation unit 173 applies Haar wavelet transform in three stages based on input frame image data. Next, the texture feature vector generation unit 179 obtains information on the range of each region image from the region image extraction unit 15, calculates the variance of the pixel values of each subband region for each region image, A row of variance values is used as a feature amount in the region image. Then, the texture feature vectors are generated by concatenating these numerical sequences for all the region images. The texture feature vector and V _t. It should be noted that “connection with respect to all region images” will be described in detail later with reference to FIG.

以上述べたように、局所特徴ベクトル生成部１７７が局所特徴ベクトルＶ_ｌを生成し、色特徴ベクトル生成部１７８が色特徴ベクトルＶ_ｃを生成し、テクスチャ特徴ベクトル生成部１７９がテクスチャ特徴ベクトルＶ_ｔを生成する。そして、特徴ベクトル生成部１７０は、これらの３つのベクトルを連結して特徴ベクトルＶを求める。このＶについては、下の式（４）に表わす通りである。特徴ベクトル生成部１７０によって連結されたベクトルＶが、特徴量算出部１７からの出力される特徴量である。 As described above, the local feature vector generating unit 177 generates the local feature vector V _l , the color feature vector generating unit 178 generates the color feature vector V _c , and the texture feature vector generating unit 179 is the texture feature vector V _t. Is generated. Then, the feature vector generation unit 170 obtains a feature vector V by connecting these three vectors. This V is as shown in the following formula (4). The vector V connected by the feature vector generation unit 170 is a feature amount output from the feature amount calculation unit 17.

以上、述べたように、特徴量算出部１７は、入力画像に基づき、領域画像抽出部１５によって指定された領域画像の各々の特徴量を算出するとともに、複数の領域画像から算出された特徴量を連結することによって入力画像の特徴量（特徴ベクトルＶ_ｌ，Ｖ_ｃ，Ｖ_ｔ，Ｖ）を生成する。特徴量算出部１７によって算出された特徴量は、複数の領域画像の各々の特徴を情報として保持している。 As described above, the feature amount calculation unit 17 calculates the feature amount of each of the region images specified by the region image extraction unit 15 based on the input image, and the feature amount calculated from the plurality of region images. To generate feature quantities (feature vectors V _l , V _c , V _t , V) of the input image. The feature amount calculated by the feature amount calculation unit 17 holds each feature of a plurality of area images as information.

図３は、領域画像抽出部１５および１６によって抽出されるグリッド領域の領域画像の範囲を示す概略図である。以下では、代表として領域画像抽出部１５による処理を説明するが、領域画像抽出部１６による処理も同様のものである。
領域画像抽出部１５は、領域画像のサイズを段階的に変化させる。同図に示す例においては、（ａ）、（ｂ）、（ｃ）の順に、徐々に抽出する領域画像のサイズを小さくしている。入力される元のフレーム画像のサイズを縦（高さ）Ｈ、横（幅）Ｗとしたとき、第Ｓ番目（Ｓ＝１，２，３，・・・）のスケールにおける領域画像のサイズは、縦Ｈ_Ｓ、横Ｗ_Ｓであり、これらは、下の式（５）で表わされる。 FIG. 3 is a schematic diagram showing the range of the area image of the grid area extracted by the area image extraction units 15 and 16. Hereinafter, the processing by the region image extraction unit 15 will be described as a representative, but the processing by the region image extraction unit 16 is the same.
The region image extraction unit 15 changes the size of the region image stepwise. In the example shown in the figure, the size of the region image to be extracted is gradually reduced in the order of (a), (b), and (c). When the size of the input original frame image is vertical (height) H and horizontal (width) W, the size of the area image on the Sth scale (S = 1, 2, 3,...) Is , Vertical H _S , and horizontal W _S, which are represented by the following equation (5).

ここで、δは、スケールの変化の度合いを表す定数であり０＜δ＜１である。この不等式の範囲内でδの値については適宜設定可能とする。一例として、同図に示す場合、δ＝０．５としている。そして、同図（ａ）の場合に、Ｓ＝１、Ｈ_１＝Ｈ、Ｗ_１＝Ｗである。また同図（ｂ）の場合に、Ｓ＝２、Ｈ_２＝δＨ、Ｗ_２＝δＷである。また同図（ｃ）の場合に、Ｓ＝３、Ｈ_３＝δ^２Ｈ、Ｗ_３＝δ^２Ｗである。また、同図にも示すように、領域画像抽出部１５は、縦方向Ｈ_Ｓ×α、横方向Ｗ_Ｓ×βの刻みで順次移動させながら、領域画像の範囲を抽出していく。ここで、αおよびβは、適宜設定可能な定数であり、０＜α≦１、０＜β≦１である。一例として、同図に示す場合、α＝β＝０．５としている。 Here, δ is a constant representing the degree of scale change, and 0 <δ <1. The value of δ can be set as appropriate within the range of this inequality. As an example, in the case shown in the figure, δ = 0.5. In the case of FIG. 5A, S = 1, H ₁ = H, and W ₁ = W. In the case of FIG. 5B, S = 2, H ₂ = δH, and W ₂ = δW. In the case of FIG. 3C, S = 3, H ₃ = δ ² H, and W ₃ = δ ² W. In addition, as shown in the figure, the region image extraction unit 15 extracts the range of the region image while sequentially moving in increments of the vertical direction H _S × α and the horizontal direction W _S × β. Here, α and β are constants that can be set as appropriate, and 0 <α ≦ 1 and 0 <β ≦ 1. As an example, in the figure, α = β = 0.5.

同図（ａ）〜（ｃ）のそれぞれにおいて、領域画像の枠の左上隅の部分のみを、黒丸と、縦・横の太線で示している。なお、フレーム画像全体の左上角の画素の座標を（ｘ，ｙ）＝（０，０）とする。同図（ａ）においては、Ｓ＝１であり、フレーム画像全体が領域画像に相当する。つまり、Ｓ＝１の場合における領域画像の数Ｎ_１は１である。また同図（ｂ）においては、Ｓ＝２であり、各々の領域画像の左上角の画素における、ｘ座標（横座標）の値は０，βδＷ，２βδＷであり、ｙ座標（縦座標）の値は０，αδＨ，２αδＨである。同図（ｂ）に一例として示している破線の枠は、左上角の画素の座標位置が（ｘ，ｙ）＝（βδＷ，αδＨ）である領域画像を示す。Ｓ＝２の場合における領域画像の数Ｎ_２は９である。また同図（ｃ）においては、Ｓ＝３であり、各々の領域画像の左上角の画素における、ｘ座標（横座標）の値は０，βδ^２Ｗ，２βδ^２Ｗ，３βδ^２Ｗ，４βδ^２Ｗ，５βδ^２Ｗ，６βδ^２Ｗである。また、ｙ座標（縦座標）の値は０，αδ^２Ｈ，２αδ^２Ｈ，３αδ^２Ｈ，４αδ^２Ｈ，５αδ^２Ｈ，６αδ^２Ｈである。同図（ｃ）に一例として示している破線の枠は、左上角の画素の座標位置が（ｘ，ｙ）＝（５βδ^２Ｗ，４αδ^２Ｈ）である領域画像を示す。Ｓ＝３の場合における領域画像の数Ｎ_３は４９である。 In each of FIGS. 9A to 9C, only the upper left corner of the frame of the area image is indicated by a black circle and vertical and horizontal thick lines. Note that the coordinates of the pixel in the upper left corner of the entire frame image are (x, y) = (0, 0). In FIG. 5A, S = 1, and the entire frame image corresponds to a region image. That is, the number N ₁ of area images in the case of S = 1 is 1. In FIG. 5B, S = 2, and the x-coordinate (abscissa) values are 0, βδW, 2βδW, and y-coordinate (ordinate) of the pixel in the upper left corner of each area image. The values are 0, αδH, and 2αδH. The broken line frame shown as an example in FIG. 5B shows a region image in which the coordinate position of the pixel at the upper left corner is (x, y) = (βδW, αδH). The number N ₂ of area images in the case of S = 2 is 9. In FIG. 3C, S = 3, and the values of the x coordinate (abscissa) in the upper left corner pixel of each area image are 0, βδ ² W, 2βδ ² W, 3βδ ² W, 4βδ. ² W, 5βδ ² W, 6βδ ² W. The values of the y coordinate (ordinate) are 0, αδ ² H, 2αδ ² H, 3αδ ² H, 4αδ ² H, 5αδ ² H, and 6αδ ² H. A broken-line frame shown as an example in FIG. 4C shows a region image in which the coordinate position of the pixel at the upper left corner is (x, y) = (5βδ ² W, 4αδ ² H). The number N ₃ of area images in the case of S = 3 is 49.

つまり、領域画像抽出部１５は、上記のように、同一サイズの複数の領域画像の少なくとも一部が互いに重なり合うように、領域画像の範囲を指定する。同一サイズの複数の領域画像の一部が互いに重なり合うのは、縦方向に関してはα＜１である場合である。また、横方向に関してはβ＜１である場合である。これにより、被写体が領域画像の枠（境界線）をまたぐような位置に存在するとき（つまりその１つの領域画像の中に収まらないとき）にも、その被写体は同じサイズの他の領域画像に収まりきる可能性がある。これにより、その被写体の画像特徴を表わす特徴量を、より適切に抽出することが可能となる。
特に、０＜α≦０．５としたとき、または０＜β≦０．５としたときには、元のキーフレーム画像の中の任意の画素が、同一サイズの少なくとも２個の領域画像の範囲に含まれることとなる。つまりこの場合は、被写体を適切なサイズの領域画像内に捉えることのできる可能性がよりいっそう高まる。つまり、より良好な特徴量を抽出できるようになる。 That is, as described above, the region image extraction unit 15 specifies the region image range so that at least some of the plurality of region images having the same size overlap each other. Part of the plurality of region images having the same size overlaps each other when α <1 in the vertical direction. In the lateral direction, β <1. As a result, even when the subject exists at a position that crosses the frame (boundary line) of the region image (that is, when it does not fit in the one region image), the subject is converted into another region image of the same size. There is a possibility to fit. This makes it possible to more appropriately extract the feature amount representing the image feature of the subject.
In particular, when 0 <α ≦ 0.5 or 0 <β ≦ 0.5, any pixel in the original key frame image is within the range of at least two region images of the same size. Will be included. That is, in this case, the possibility that the subject can be captured in a region image of an appropriate size is further increased. That is, a better feature amount can be extracted.

図４は、領域画像抽出部１５による、領域画像抽出の処理手順を示すフローチャートである。以下、このフローチャートに沿って説明する。なお、領域画像抽出部１６による処理もこれと同様である。
まずステップＳ１において、領域画像抽出部１５は、変数Ｓの値を１に初期化する。このＳは、前述の通り、領域画像のスケールを指標するための値である。
次にステップＳ２において、領域画像抽出部１５は、変数Ｓの値が、予め設定された上限（設定スケール）未満であるか否かを判定する。上限未満である場合（ステップＳ２：ＹＥＳ）には、次のステップＳ３に進む。その他の場合（ステップＳ２：ＮＯ）には、このフローチャート全体の処理を終了する。
次にステップＳ３において、領域画像抽出部１５は、変数ｙの値を０に初期化する。このｙは、縦座標の値を表わすものである。このステップの処理により、領域画像の縦座標を初期化する。 FIG. 4 is a flowchart showing a region image extraction processing procedure performed by the region image extraction unit 15. Hereinafter, it demonstrates along this flowchart. The processing by the region image extraction unit 16 is the same as this.
First, in step S1, the region image extraction unit 15 initializes the value of the variable S to 1. As described above, S is a value for indicating the scale of the area image.
Next, in step S2, the area image extraction unit 15 determines whether or not the value of the variable S is less than a preset upper limit (set scale). If it is less than the upper limit (step S2: YES), the process proceeds to the next step S3. In other cases (step S2: NO), the process of the entire flowchart ends.
Next, in step S3, the area image extraction unit 15 initializes the value of the variable y to 0. This y represents the value of the ordinate. By the processing of this step, the ordinate of the area image is initialized.

次にステップＳ４において、領域画像抽出部１５は、変数ｙに関して、ｙ＋Ｈ_Ｓ＜Ｈの不等式で表わされる条件を満たすか否かを判定する。この条件を満たす場合（ステップＳ４：ＹＥＳ，つまり縦方向にまだ領域画像を取れる場合）には次のステップＳ５に進み、満たさない場合（ステップＳ４：ＮＯ，つまりフレーム画像の下端に達してしまい縦方向にもう領域画像を取れない場合）にはステップＳ１０の処理に分岐する。
次にステップＳ５に進んだ場合、領域画像抽出部１５は、変数ｘの値を０に初期化する。このｘは、横座標の値を表わすものである。このステップの処理により、領域画像の横座標を初期化する。 Next, in step S4, the region image extraction unit 15 determines whether or not the condition represented by the inequality y + H _S <H is satisfied with respect to the variable y. If this condition is satisfied (step S4: YES, that is, if the area image can still be taken in the vertical direction), the process proceeds to the next step S5. If not satisfied (step S4: NO, that is, the lower end of the frame image is reached and the vertical image is reached). If no area image can be taken in the direction), the process branches to step S10.
Next, when the process proceeds to step S5, the area image extraction unit 15 initializes the value of the variable x to 0. This x represents the value of the abscissa. By the processing in this step, the abscissa of the area image is initialized.

次にステップＳ６において、領域画像抽出部１５は、変数ｘに関して、ｘ＋Ｗ_Ｓ＜Ｗの不等式で表わされる条件を満たすか否かを判定する。この条件を満たす場合（ステップＳ６：ＹＥＳ，つまり横方向にまだ領域画像を取れる場合）には次のステップＳ７に進み、満たさない場合（ステップＳ６：ＮＯ，つまりフレーム画像の右端に達してしまい横方向にもう領域画像を取れない場合）にはステップＳ９の処理に分岐する。
次にステップＳ７に進んだ場合、領域画像抽出部１５は、その時の変数ｘおよびｙの値に応じて、座標（ｘ，ｙ）を基点（左上角の画素）とする、高さＨ_Ｓ、幅Ｗ_Ｓのグリッドによる領域画像を抽出する。そして、領域画像抽出部１５は、抽出した領域画像の範囲を示す情報を、局所特徴ベクトル生成部１７７と色特徴ベクトル生成部１７８とテクスチャ特徴ベクトル生成部１７９とに渡す。これに応じて、局所特徴ベクトル生成部１７７と色特徴ベクトル生成部１７８とテクスチャ特徴ベクトル生成部１７９の各々は、当該領域画像に関する特徴量を前述の方法により算出する。 Next, in step _S <b> 6, the region image extraction unit 15 determines whether or not the condition represented by the inequality of x + W _S <W is satisfied with respect to the variable x. If this condition is satisfied (step S6: YES, that is, if the region image can still be taken in the horizontal direction), the process proceeds to the next step S7, and if not satisfied (step S6: NO, that is, the right end of the frame image is reached and If no more area images can be taken in the direction), the process branches to step S9.
Next, when the process proceeds to step S7, the area image extraction unit 15 sets the height H _S with the coordinates (x, y) as the base point (the pixel at the upper left corner) according to the values of the variables x and y at that time. It extracts an area image by grid width W _S. Then, the region image extraction unit 15 passes information indicating the range of the extracted region image to the local feature vector generation unit 177, the color feature vector generation unit 178, and the texture feature vector generation unit 179. In response to this, each of the local feature vector generation unit 177, the color feature vector generation unit 178, and the texture feature vector generation unit 179 calculates a feature amount related to the region image by the method described above.

次にステップＳ８において、領域画像抽出部１５は、変数ｘの値をβ・Ｗ_Ｓの増分で増加させる。これは、領域画像の横座標の値を、次の領域画像の座標に進めるための処理である。このステップの処理のあとは、ステップＳ６の処理に戻る。
ステップＳ６からステップＳ９に進んだ場合には、領域画像抽出部１５は、変数ｙの値をα・Ｈ_Ｓの増分で増加させる。これは、領域画像の縦座標の値を、次の領域画像の座標に進めるための処理である。このステップの処理のあとは、ステップＳ４の処理に戻る。
ステップＳ４からステップＳ１０に進んだ場合には、領域画像抽出部１５は、変数Ｓの値を次の値に更新する。つまり、（Ｓ＋１）の値を変数Ｓの記憶領域に格納する。これは、領域画像のスケールを次の段階に進めるための処理である。そして、このステップの処理のあとは、ステップＳ２の処理に戻る。 In step S8, the area image extracting unit 15, the value of the variable x is increased in increments of β · _{W S.} This is a process for advancing the abscissa value of a region image to the coordinates of the next region image. After the process in this step, the process returns to step S6.
When the process proceeds from step S6 to step S9, the region image extraction unit 15 increases the value of the variable y by an increment of α · H _S. This is a process for advancing the value of the ordinate of the area image to the coordinates of the next area image. After the process of this step, the process returns to step S4.
When the process proceeds from step S4 to step S10, the area image extraction unit 15 updates the value of the variable S to the next value. That is, the value (S + 1) is stored in the storage area of the variable S. This is a process for advancing the scale of the area image to the next stage. After this step, the process returns to step S2.

上述した一連の処理により、領域画像抽出部１５は、図３に例示したような領域画像をすべて抽出し、各領域画像の範囲を示す情報を特徴量算出部１７に渡す。領域画像抽出部１５がすべての領域画像の抽出を終えた後は、局所特徴ベクトル生成部１７７と色特徴ベクトル生成部１７８とテクスチャ特徴ベクトル生成部１７９の各々が、前述の通り、各領域画像に対応した特徴量の列をすべて並べた特徴ベクトルを出力する。そして、特徴ベクトル生成部１７０が、それらの特徴ベクトルを連結して得られる特徴ベクトルを出力する。領域画像抽出部１６と特徴量算出部１８との関係も、これと同様である。 Through the series of processes described above, the region image extraction unit 15 extracts all the region images illustrated in FIG. 3 and passes information indicating the range of each region image to the feature amount calculation unit 17. After the region image extraction unit 15 finishes extracting all the region images, each of the local feature vector generation unit 177, the color feature vector generation unit 178, and the texture feature vector generation unit 179 includes each region image as described above. Outputs a feature vector in which all corresponding feature columns are arranged. Then, the feature vector generation unit 170 outputs a feature vector obtained by connecting these feature vectors. The relationship between the region image extraction unit 16 and the feature amount calculation unit 18 is the same as this.

このように、領域画像のサイズを段階的に変化させて、各々の領域画像から特徴量を抽出し、それら領域画像ごとの特徴量を情報として含んだ特徴量（特徴ベクトル）を用いることにより、映像に含まれる被写体の大きさの変動に対して頑健性を得ることができる。 In this way, by changing the size of the region image in stages, extracting feature amounts from each region image, and using feature amounts (feature vectors) including the feature amounts for each region image as information, Robustness can be obtained against changes in the size of the subject included in the video.

図５は、上述した方法によって抽出された複数の領域画像と、特徴ベクトルとの関係を示す概略図である。同図において、（ａ）〜（ｄ）は、領域画像のスケールの段階に対応しており、それぞれの場合において順に、Ｓ＝１，２，３，４である。前述の通り、フレーム画像全体のサイズは、縦（高さ）Ｈ、横（幅）Ｗである。領域画像のサイズは、Ｓの値に応じて、縦（高さ）δ^Ｓ−１・Ｈ、横（幅）δ^Ｓ−１・Ｗである。（ａ）〜（ｄ）のそれぞれにおいて、領域画像のうちの１つを、破線で示している。図中において、連結された特徴ベクトルを、２次元のグラフの形式で示している。このグラフにおいて、横軸は特徴量（スカラー）の並び順であり、縦軸は各特徴量に共通する値の大きさを表わす。「ａ１」で示す範囲に含まれる特徴量の列は、同図（ａ）に含まれる領域画像から得られる特徴量である。「ｂ１」、「ｂ２」、「ｂ３」、・・・のそれぞれに示す範囲に含まれる特徴量の列は、同図（ｂ）に含まれる複数の領域画像から得られる特徴量である。同図においては「ｂ４」までだけを示してそれより後を省略しているが、実際には、領域画像の数の分だけ特徴量の列が後続する。同図（ｃ）や（ｄ）についても同様であり、領域画像ごとの特徴量の列が後続する。本実施形態では、このようにして、特徴量の列をすべての領域画像について連結することにより、特徴ベクトルを生成する。つまり、局所特徴と、色特徴と、テクスチャ特徴のそれぞれに関して、領域画像ごとの特徴量の値（または値の列）を、図５で説明したようにすべての領域画像に関して連結したものが、局所特徴ベクトルと、色特徴ベクトルと、テクスチャ特徴ベクトルである。 FIG. 5 is a schematic diagram illustrating a relationship between a plurality of region images extracted by the above-described method and feature vectors. In the figure, (a) to (d) correspond to the scale stage of the region image, and S = 1, 2, 3, 4 in order in each case. As described above, the size of the entire frame image is vertical (height) H and horizontal (width) W. The size of the region image is vertical (height) δ ^S-1 · H and horizontal (width) δ ^S-1 · W according to the value of ^S. In each of (a) to (d), one of the region images is indicated by a broken line. In the figure, the connected feature vectors are shown in the form of a two-dimensional graph. In this graph, the horizontal axis represents the arrangement order of feature amounts (scalar), and the vertical axis represents the size of a value common to each feature amount. The feature amount column included in the range indicated by “a1” is a feature amount obtained from the region image included in FIG. The column of feature amounts included in the ranges indicated by “b1”, “b2”, “b3”,... Is a feature amount obtained from a plurality of region images included in FIG. In the figure, only “b4” is shown and the rest is omitted, but in actuality, the feature quantity columns follow the number of area images. The same applies to (c) and (d) in the figure, and a sequence of feature amounts for each area image follows. In the present embodiment, a feature vector is generated in such a manner by connecting feature value columns for all region images. That is, for each of the local feature, the color feature, and the texture feature, a feature value (or a sequence of values) for each region image connected with respect to all the region images as described in FIG. A feature vector, a color feature vector, and a texture feature vector.

次に、学習用データの構成方法の一例について説明する。
図６は、学習用データの構成例を示す概略図である。学習用データは、学習装置１の内部の記憶装置に格納される。既に述べたように、学習用データには、正例あるいは負例のラベルが付与されている。学習用データは、例えば、オブジェクト指向データベースを用いて構成され、図示するような表構造を有している。同データは、映像番号、映像データロケーション、フレーム識別情報、被写体種類（１から４０まで）のデータ項目を有している。このデータは、複数の映像データについての情報を格納するものである。また、１つの映像データに対して、１つまたは複数のキーフレームを対応させている。映像番号は、映像データを識別するために付与された番号である。映像データロケーションは、映像データの実体の所在を表わす情報であり、例えば、ファイルシステムにおけるパス名の情報が用いられる。フレーム識別情報は、１つの映像データ内に含まれる、複数のキーフレームのそれぞれを識別する情報である。フレーム識別情報としては、単なるキーフレームの連番を用いても良いし、「ｈｈ：ｍｍ：ｓｓ．ｎｎｎ」（時：分：秒．フレーム番号）の形式等で映像内のフレーム位置を特定する情報を用いても良い。各々の被写体種類に対応する欄には、「正」または「負」のラベル（入力画像が正例または負例のいずれであるかを示す情報）を格納する。これらのラベルは、キーフレーム画像抽出部１３によって抽出される各々のキーフレームに、被写体種類（１〜４０）のそれぞれが被写体として含まれているか否かの正解を表わすラベル情報である。なお、被写体種類の第６番目から第３９番目のデータは図中において記載を省略している。「正」のラベルは、その被写体がそのキーフレーム画像に含まれていることを表わす。「負」のラベルは、その被写体がそのキーフレーム画像に含まれていないことを表わす。このラベルの値が、学習時の教師データとして用いられる。なお、被写体の種類数は４０に限らず、これより多くても少なくても良い。 Next, an example of a method for configuring learning data will be described.
FIG. 6 is a schematic diagram illustrating a configuration example of learning data. The learning data is stored in a storage device inside the learning device 1. As already described, positive or negative labels are assigned to the learning data. The learning data is configured using an object-oriented database, for example, and has a table structure as shown in the figure. The data includes data items of video number, video data location, frame identification information, and subject type (1 to 40). This data stores information about a plurality of video data. One video data is associated with one or more key frames. The video number is a number assigned to identify video data. The video data location is information representing the actual location of the video data, and for example, information on a path name in the file system is used. The frame identification information is information for identifying each of a plurality of key frames included in one video data. As the frame identification information, a simple key frame serial number may be used, or the frame position in the video is specified in the format of “hh: mm: ss.nnn” (hour: minute: second.frame number) or the like. Information may be used. The column corresponding to each subject type stores a “positive” or “negative” label (information indicating whether the input image is a positive example or a negative example). These labels are label information indicating correct answers as to whether or not each of the subject types (1 to 40) is included in each key frame extracted by the key frame image extraction unit 13 as a subject. Note that the sixth to 39th data of the subject type are not shown in the figure. The “positive” label indicates that the subject is included in the key frame image. The “negative” label indicates that the subject is not included in the key frame image. The value of this label is used as teacher data at the time of learning. The number of types of subjects is not limited to 40, but may be more or less.

なお、「正」または「負」のラベルの値は、例えば、キーフレーム画像抽出部１３がキーフレーム画像を抽出した後に、人手によって与え、学習用データに書き込むようにする。 Note that the value of the “positive” or “negative” label is given by hand after the key frame image extracting unit 13 extracts the key frame image, and is written in the learning data.

以上、述べたように、本実施形態では、正例（ある物体・事象が写っている）および負例（映っていない）のラベルが付与された学習データを用いた機械学習によって、映像に特定の被写体が出現しているかどうかを判定する。そのため、フレーム画像内における被写体の出現位置やサイズなどが変動した場合においても、特定の被写体を頑健に判定することができる画像特徴量を算出する。具体的には、映像フレーム画像を、様々なサイズのグリッド領域（領域画像）に区切り、グリッド領域ごとに特徴量を算出し、それらを連結することによってサイズ変動に対する頑健性を確保する。グリッド領域のサイズは、段階的に変化させる。また、グリッド領域同士が重なりを持つようにすることによって、グリッド領域の境界に存在する物体にも対応する。 As described above, in this embodiment, a video is specified by machine learning using learning data with a positive example (a certain object / event is shown) and a negative example (not shown). It is determined whether or not the subject has appeared. Therefore, even when the appearance position or size of the subject in the frame image changes, an image feature amount that can robustly determine a specific subject is calculated. Specifically, the video frame image is divided into grid regions (region images) of various sizes, feature amounts are calculated for each grid region, and the robustness against the size variation is ensured by connecting them. The size of the grid area is changed in stages. In addition, by making the grid areas overlap, it is possible to deal with an object existing at the boundary of the grid areas.

［評価実験］
本実施形態について、実際の映像データを使用して行なった評価実験の結果は、以下の通りである。本実験では、約６００時間の映像を対象として、４０種類の被写体を検出し、その検出精度を評価した。検出精度の算出については、テスト映像における全フレーム画像に対して判定処理を適用し、スコアが高いものから順に並び替え、その上位２０００件に対する推定平均適合率を算出することで求めた。なお、設定値としては、δ＝０．５，α＝０．５，β＝０．５とした。領域画像のスケールの範囲は、１≦Ｓ≦４とした。 [Evaluation experiment]
The results of an evaluation experiment performed using actual video data for this embodiment are as follows. In this experiment, 40 types of subjects were detected for about 600 hours of video, and the detection accuracy was evaluated. The detection accuracy was calculated by applying determination processing to all the frame images in the test video, rearranging them in descending order from the highest score, and calculating the estimated average precision for the top 2000 cases. The set values were δ = 0.5, α = 0.5, and β = 0.5. The range of the scale of the area image is 1 ≦ S ≦ 4.

なお、評価のための比較対象（従来技術による手法）としては、フレーム画像を固定的なグリッドサイズに分割する方式を用いた。具体的には、フレーム画像を縦横２×２分割とする分割方法と、縦横３×１分割とする分割方法を用いて、分割された各領域における特徴量を求めた。 As a comparison target for evaluation (a technique according to the prior art), a method of dividing a frame image into a fixed grid size was used. Specifically, the feature amount in each divided region was obtained by using a division method that divides the frame image into 2 × 2 vertical and horizontal divisions and a division method that divides the vertical and horizontal 3 × 1 divisions.

その結果、従来手法と比べて、検出精度が向上することを確認できた。被写体の種類別に精度を比較したところ、最大で４％の精度向上が認められたものもあった。本実施形態による手法と、従来手法との、検出精度の比較結果を表１に示す。ここに示すように、本実施形態による手法では、推定平均適合率（４０種類の被写体の平均）において、従来手法よりも良い結果が得られた。 As a result, it was confirmed that the detection accuracy was improved as compared with the conventional method. When the accuracy was compared according to the type of subject, there was an accuracy improvement of 4% at the maximum. Table 1 shows a comparison result of detection accuracy between the method according to the present embodiment and the conventional method. As shown here, in the method according to the present embodiment, a better result than the conventional method was obtained in the estimated average precision (average of 40 types of subjects).

［第２の実施形態］
次に、本発明の第２の実施形態について説明する。第１の実施形態が、学習処理と識別処理の両方を行なうものであったのに対して、第２の実施形態は、学習処理のみを行なう。本実施形態の機能構成は、図１の機能ブロック図に含まれる機能のうち、学習装置１と識別部２０の機能のみを有するものである。学習装置１が、学習用映像入力部１１とキーフレーム画像抽出部１３と領域画像抽出部１５と特徴量算出部１７と識別器学習部１９とを含んで構成される点は、第１の実施形態と同様である。また、ここに列挙した各部の処理機能およびその作用、効果も、第１の実施形態において述べたそれらと同様であるので説明を省略する。この構成により、本実施形態の学習装置は、良好な特徴量を用いて機械学習を行い、識別部２０を生成する（学習によりパラメーターの値を最適化する）ことができる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. The first embodiment performs both learning processing and identification processing, whereas the second embodiment performs only learning processing. The functional configuration of the present embodiment has only the functions of the learning device 1 and the identification unit 20 among the functions included in the functional block diagram of FIG. The learning apparatus 1 includes a learning video input unit 11, a key frame image extraction unit 13, a region image extraction unit 15, a feature amount calculation unit 17, and a discriminator learning unit 19 in the first embodiment. It is the same as the form. Further, the processing functions of the respective parts listed here and their functions and effects are also the same as those described in the first embodiment, so that the description thereof is omitted. With this configuration, the learning device of the present embodiment can perform machine learning using a good feature amount and generate the identification unit 20 (optimizing parameter values through learning).

［第３の実施形態］
次に、本発明の第３の実施形態について説明する。第１の実施形態が、学習処理と識別処理の両方を行なうものであったのに対して、第３の実施形態は、識別処理のみを行なう。本実施形態の機能構成は、図１の機能ブロック図に含まれる機能のうち、映像入力部１２とキーフレーム画像抽出部１４と領域画像抽出部１６と特徴量算出部１８と識別部２０のみを含んで構成され、学習装置１を含まない。そして、ここに列挙した各部の処理機能およびその作用、効果も、第１の実施形態において述べたそれらと同様であるので説明を省略する。また、識別部２０は、予め学習済である。この構成により、本実施形態の識別装置は、良好な特徴量を用いて識別処理を行うことができる。 [Third Embodiment]
Next, a third embodiment of the present invention will be described. While the first embodiment performs both learning processing and identification processing, the third embodiment performs only identification processing. The functional configuration of the present embodiment includes only the video input unit 12, the key frame image extraction unit 14, the region image extraction unit 16, the feature amount calculation unit 18, and the identification unit 20 among the functions included in the functional block diagram of FIG. The learning apparatus 1 is not included. The processing functions of the respective parts listed here and their functions and effects are also the same as those described in the first embodiment, so that the description thereof is omitted. The identification unit 20 has been learned in advance. With this configuration, the identification device of the present embodiment can perform identification processing using a good feature amount.

［第４の実施形態］
次に、本発明の第４の実施形態について説明する。第４の実施形態は、第１の実施形態の中で説明した画像特徴量算出装置の機能のみを単独の装置として実施する形態である。既に述べたように、画像特徴量算出装置は、領域画像抽出部１５と特徴量算出部１７とを組合せた装置として実現される。この画像特徴量算出装置における領域画像抽出部１５と特徴量算出部１７の機能、作用、効果は、既に説明したとおりであるため、ここでは説明を省略する。本実施形態の構成により、画像特徴量算出装置は、入力画像を基に、良好な、つまり、被写体のサイズの変化に対して頑健な画像特徴量を算出することができる。 [Fourth Embodiment]
Next, a fourth embodiment of the present invention will be described. In the fourth embodiment, only the function of the image feature amount calculation apparatus described in the first embodiment is implemented as a single apparatus. As described above, the image feature amount calculation device is realized as a device in which the region image extraction unit 15 and the feature amount calculation unit 17 are combined. Since the functions, operations, and effects of the region image extraction unit 15 and the feature amount calculation unit 17 in this image feature amount calculation apparatus are as described above, description thereof is omitted here. With the configuration of the present embodiment, the image feature amount calculation apparatus can calculate an image feature amount that is favorable, that is, robust against changes in the size of the subject, based on the input image.

［第５の実施形態］
第１〜第４の実施形態では、領域画像を抽出する際に、範囲を等間隔に移動させていた。本実施形態における領域画像抽出部１５および１６は、第１〜第４の実施形態とは異なる方法で、領域画像の抽出を行なう。なお、以下に述べる領域画像の抽出のしかたは、第１〜第４の実施形態に適用可能である。そのとき、領域画像の抽出のしかた以外の技術事項に関しては、各実施形態において既に述べたとおりであるので、ここでは説明を省略する。本実施形態における領域画像抽出部１５および１６は、次のいずれかの方法で領域画像の抽出を行なう。 [Fifth Embodiment]
In the first to fourth embodiments, the range is moved at equal intervals when extracting the region image. The region image extraction units 15 and 16 in the present embodiment extract region images by a method different from that in the first to fourth embodiments. The region image extraction method described below can be applied to the first to fourth embodiments. At that time, technical matters other than the method of extracting the region image are as described in the respective embodiments, and thus description thereof is omitted here. The region image extraction units 15 and 16 in the present embodiment extract region images by one of the following methods.

第１の方法では、入力画像内の位置に応じて、領域画像を抽出する密度を変化させる。具体的には、図４で説明したフローチャートにおいて、設定値αおよびβの値を常に一定にするのではなく、例えば、フレーム画像の中央に近い領域ではαおよびβの値を小さくし、フレーム画像の周辺に近い領域ではαおよびβの値を相対的に大きくする。これは、フレーム画像の中央に近い領域に被写体が存在する場合に検出精度をより高めることにつながる。なお、逆に、フレーム画像の周辺部において被写体の検出精度を相対的に高めたい場合には、逆に、周辺部においてαおよびβの値を相対的に高くする。なお、この場合も、０＜α≦１、且つ０＜β≦１である。このように、領域画像を抽出する密度に差をつけることにより、特徴量を算出したり被写体を識別したりするための総合的な計算量を抑制しながら、画像内の重点的な領域のみによりきめ細かな計算を行なうことができる。 In the first method, the density for extracting the region image is changed according to the position in the input image. Specifically, in the flowchart described with reference to FIG. 4, the set values α and β are not always constant. For example, in the region close to the center of the frame image, the values α and β are decreased, and the frame image In the region near the periphery of, the values of α and β are relatively increased. This leads to higher detection accuracy when the subject is present in a region near the center of the frame image. On the contrary, when it is desired to relatively increase the detection accuracy of the subject in the peripheral portion of the frame image, the values of α and β are relatively increased in the peripheral portion. In this case, 0 <α ≦ 1 and 0 <β ≦ 1. In this way, by making a difference in the density at which the region image is extracted, while suppressing the total amount of calculation for calculating the feature amount and identifying the subject, only by the priority region in the image Detailed calculations can be performed.

第２の方法では、目的とする被写体が存在する可能性が高い領域において、領域画像を抽出する密度を相対的に高める。画像内の場所に応じた、被写体が存在する可能性（確率値）を表わすデータを、外部から供給するようにする。これにより、第１の方法と類似の効果が得られる。即ち、特徴量を算出したり被写体を識別したりするための総合的な計算量を抑制しながら、画像内の重点的な領域のみによりきめ細かな計算を行なうことができる。 In the second method, the density of extracting the region image is relatively increased in the region where the target subject is highly likely to exist. Data representing the possibility (probability value) of the presence of a subject according to the location in the image is supplied from the outside. Thereby, the effect similar to the 1st method is acquired. That is, it is possible to perform fine calculation only with the important region in the image while suppressing the total calculation amount for calculating the feature amount and identifying the subject.

第３の方法では、フレーム画像内のランダムな場所において同一サイズで複数の領域画像を抽出するようにする。 In the third method, a plurality of region images having the same size are extracted at random locations in the frame image.

［第１〜第５の実施形態のコンピュータープログラムによる実施］
なお、上述した各実施形態における各処理部の機能をコンピューターで実現するようにしても良い。その場合、これらの機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 [Implementation by the computer program of the first to fifth embodiments]
In addition, you may make it implement | achieve the function of each process part in each embodiment mentioned above with a computer. In that case, the program for realizing these functions may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible disk, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, a “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory inside a computer system serving as a server or a client in that case may be included, and a program that holds a program for a certain period of time. The program may be a program for realizing a part of the above-described functions, or may be a program that can realize the above-described functions in combination with a program already recorded in a computer system.

以上、複数の実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。
（変形例１）前述の実施形態では、一例としてα＝０．５，β＝０．５とした。また、α≦０．５またはβ≦０．５とすることにより領域画像の抽出密度を高める例を記載した。しかしながら、α＞０．５またはβ＞０．５としても良い。
（変形例２）前述の実施形態では、画像の特徴量として、局所特徴ベクトルや色特徴ベクトルやテクスチャ特徴ベクトルを用いた。変形例では、その他の特徴量を用いるようにしても良い。
（変形例３）前述の実施形態では、学習装置１内において、キーフレーム画像抽出部１３が抽出したキーフレームについて、「正例」または「負例」のラベル値を与えるようにした。変形例では、その代わりに、映像に対応したキーフレーム画像を予め抽出しておき、抽出済みのキーフレーム画像とラベル値のデータとをセットにして学習装置１が外部から取り込むようにする。そして、学習装置１は、特に映像データそのものを用いず、キーフレーム画像とラベル値とに基づいた学習処理を行なう。
（変形例４）図４のフローチャートの処理によって領域画像抽出部が領域画像を抽出する際に、元のフレーム画像の下端部または右端部に余剰が生じた場合には、領域画像の下端または右端がちょうどフレーム画像の下端または右端に合うように、領域画像の座標の増分を調整する。あるいは、フレーム画像の下端または右端をはみ出して、領域画像の座標を決定しても良い。領域画像の一部がフレーム画像の外側にはみ出す場合は、はみ出した部分については一様な画素値が存在するものとして（つまり、その部分には画像情報がないものとして）、以後の特徴量算出等の処理を行なうようにする。 Although a plurality of embodiments have been described above, the present invention can also be implemented in the following modifications.
(Modification 1) In the above-described embodiment, α = 0.5 and β = 0.5 are set as an example. In addition, an example has been described in which the extraction density of the region image is increased by setting α ≦ 0.5 or β ≦ 0.5. However, α> 0.5 or β> 0.5 may be set.
(Modification 2) In the above-described embodiment, a local feature vector, a color feature vector, or a texture feature vector is used as the feature amount of an image. In the modification, other feature amounts may be used.
(Modification 3) In the embodiment described above, a label value of “positive example” or “negative example” is given to the key frame extracted by the key frame image extraction unit 13 in the learning device 1. In the modified example, instead, a key frame image corresponding to the video is extracted in advance, and the learning device 1 captures the extracted key frame image and label value data as a set from the outside. The learning device 1 does not particularly use the video data itself, and performs a learning process based on the key frame image and the label value.
(Modification 4) When the region image extraction unit extracts a region image by the processing of the flowchart of FIG. 4, if a surplus occurs at the lower end or right end of the original frame image, the lower end or right end of the region image Adjust the increment of the coordinates of the region image so that is exactly aligned with the lower or right edge of the frame image. Alternatively, the coordinates of the region image may be determined by protruding the lower end or the right end of the frame image. When a part of the area image protrudes outside the frame image, it is assumed that there is a uniform pixel value for the protruding part (that is, there is no image information in that part), and the subsequent feature value calculation And so on.

以上、この発明の実施形態およびその変形例について詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 As mentioned above, although embodiment of this invention and its modification were explained in full detail, the concrete structure is not restricted to this embodiment, The design etc. of the range which does not deviate from the summary of this invention are included.

本発明は、映像コンテンツの管理等に利用することができる。 The present invention can be used for video content management and the like.

１学習装置
２識別装置
１１学習用映像入力部
１２映像入力部
１３，１４キーフレーム画像抽出部
１５，１６領域画像抽出部
１７，１８特徴量算出部
１９識別器学習部
２０識別部
１７０特徴ベクトル生成部
１７１特徴点検出部
１７２色統計特徴算出部
１７３テクスチャ特徴算出部
１７４局所特徴量子化部
１７７局所特徴ベクトル生成部
１７８色特徴ベクトル生成部
１７９テクスチャ特徴ベクトル生成部 DESCRIPTION OF SYMBOLS 1 Learning apparatus 2 Identification apparatus 11 Video input part 12 for learning Video input part 13, 14 Key frame image extraction part 15, 16 Area image extraction part 17, 18 Feature amount calculation part 19 Classifier learning part 20 Identification part 170 Feature vector generation Unit 171 feature point detection unit 172 color statistical feature calculation unit 173 texture feature calculation unit 174 local feature quantization unit 177 local feature vector generation unit 178 color feature vector generation unit 179 texture feature vector generation unit

Claims

An area image extraction unit for designating a range of area images of a plurality of sizes included in the input image;
Based on the input image, the feature amount of each of the region images specified by the region image extraction unit is calculated, and the feature amount of the input image is connected by connecting the feature amounts calculated from the plurality of region images. A feature amount calculation unit for generating
An unknown input image is a positive example based on information indicating whether the input image is a positive example or a negative example and a combination of feature amounts of the input image generated by the feature amount calculation unit. A discriminator learning unit for obtaining a parameter of a discriminator for discriminating either a negative example or a negative example;
Comprising
The area image extracting unit, the vertical H _S at least a portion of the pixel and horizontal W _S plurality of the area images of the same size of pixels so as to overlap each other in the vertical direction (H _S × α) pixels and lateral (W _S Xβ) The range of the region image is designated while sequentially moving in increments of pixels , and 0 <α ≦ 0.5 or 0 <β ≦ 0.5.
A learning apparatus characterized by that.

An area image extraction unit for designating a range of area images of a plurality of sizes included in the input image;
Based on the input image, the feature amount of each of the region images specified by the region image extraction unit is calculated, and the feature amount of the input image is connected by connecting the feature amounts calculated from the plurality of region images. A feature amount calculation unit for generating
An identification unit for identifying whether the input image is a positive example or a negative example based on a parameter learned in advance and a feature amount of the input image generated by the feature amount calculation unit;
Comprising
The area image extracting unit, the vertical H _S at least a portion of the pixel and horizontal W _S plurality of the area images of the same size of pixels so as to overlap each other in the vertical direction (H _S × α) pixels and lateral (W _S Xβ) The range of the region image is designated while sequentially moving in increments of pixels , and 0 <α ≦ 0.5 or 0 <β ≦ 0.5.
An identification device characterized by that.

Based on information indicating whether the input image input as learning data is a positive example or a negative example, and a combination of feature amounts of the input image generated by the feature amount calculation unit, an unknown A discriminator learning unit for obtaining a discriminator parameter for discriminating whether the input image is a positive example or a negative example;
The identifying unit identifies whether the unknown input image is a positive example or a negative example by using the parameter obtained by the classifier learning unit as the previously learned parameter.
The identification device according to claim 2.

A program for causing a computer to function as the learning device according to claim 1.

A program for causing a computer to function as the identification device according to claim 2 or 3.