JP5953151B2

JP5953151B2 - Learning device and program

Info

Publication number: JP5953151B2
Application number: JP2012157813A
Authority: JP
Inventors: 吉彦河合; 藤井　真人; 真人藤井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2012-07-13
Filing date: 2012-07-13
Publication date: 2016-07-20
Anticipated expiration: 2032-07-13
Also published as: JP2014022837A

Description

本発明は、映像に検出対象が含まれているかを検出する識別器を学習する学習装置、及びプログラムに関する。 The present invention relates to a learning apparatus and a program for learning a discriminator that detects whether a detection target is included in a video.

映像を検索する技術として、色ヒストグラムを索引として利用する手法がある。具体的には、与えられたクエリ映像に対して色ヒストグラムを算出し、予め用意されている映像の中から同様の色ヒストグラムを持つ映像区間を検索する。この手法では、検索対象の映像における時間窓の位置をずらしながら、色ヒストグラムが類似する区間を検索していくことによって類似映像区間を特定する。しかし、映像の色を利用した検索手法の場合、映像の内容を全く考慮していないため、意味的には同一なオブジェクトの映像であっても色が異なる場合は検出できなかったり、全く異なるオブジェクトやイベントであっても色が似ている場合は、同一とみなされたりしてしまう。 As a technique for searching for an image, there is a method of using a color histogram as an index. Specifically, a color histogram is calculated for a given query video, and a video section having a similar color histogram is searched from videos prepared in advance. In this method, a similar video section is specified by searching for a section having similar color histograms while shifting the position of the time window in the search target video. However, in the case of the search method using the color of the video, the content of the video is not considered at all, so even if the video of the same object is semantically, it cannot be detected if the color is different, or a completely different object Even if it is an event, if the colors are similar, they will be regarded as the same.

そこで、映像についての高度な検索や要約のためには、色やテクスチャなどの表層的な特徴ではなく、意味内容を考慮した索引が必要である。そこで、対象物がある分類に属するかを判断する識別器を検索に利用することが考えられる。このような識別器を構築するためには、正例、及び負例の学習データを用いて学習を行う。正例とは、検出対象の物体や事象が出現しているデータを表し、負例とは、検出対象の物体や事象が出現していないデータを表す。 Therefore, for advanced search and summarization of video, an index that takes into account the semantic content rather than surface features such as color and texture is necessary. Therefore, it is conceivable to use a discriminator for determining whether an object belongs to a certain classification for the search. In order to construct such a discriminator, learning is performed using learning data of positive examples and negative examples. A positive example represents data in which an object or event to be detected appears, and a negative example represents data in which an object or event to be detected does not appear.

図７は、学習データを用いた識別器の構築を説明するための図である。同図は、学習データから得られる特徴量を各要素としたベクトルが配置されるベクトル空間を示しており、実際は多次元である。各点は、学習データの特徴ベクトルの位置を示し、黒い点は正例を、白抜きの点は負例を示す。識別器の構築とは、点線で示すように、この空間における正例と負例の境界を定めることに相当する。よって、境界付近の学習データを多く集めるほど、精度の高い識別器を構築することができる。 FIG. 7 is a diagram for explaining the construction of a discriminator using learning data. This figure shows a vector space in which vectors having feature amounts obtained from learning data as elements are arranged, and is actually multidimensional. Each point indicates the position of the feature vector of the learning data, a black point indicates a positive example, and a white point indicates a negative example. The construction of the discriminator corresponds to defining the boundary between the positive example and the negative example in this space, as indicated by a dotted line. Therefore, the more accurate the learning data near the boundary, the more accurate the classifier can be constructed.

学習データに対して、正例であるか負例であるかのラベルを正確に付与するためには、すべての学習データの内容を人間が確認してラベルを付与するという手法がもっとも確実である。しかし、十分に汎用的な識別器を構築するためには、大量のデータが必要となることから、様々な種類の物体や事象に対する学習データをこのような方法で作成することは非常に難しい。 In order to assign a correct or negative label to learning data, the most reliable method is to confirm the contents of all learning data and assign a label. . However, since a large amount of data is required to construct a sufficiently general classifier, it is very difficult to create learning data for various types of objects and events by such a method.

この問題を解決するための方法として、正例か負例かのラベルがすでに付与されている一部の学習データを用いて識別器を学習し、その検出結果に基づいてラベルを修正するという手続きを繰り返すことによって、識別器を構築するアプローチがある。この方法においては、最初のラベルをどのように与えるかが重要となる。これは、ラベルを基に識別器の学習と、学習データの修正とを反復することから、最初の学習データが偏っていると、一部のデータに特化した識別器が構築されてしまうためである。例えば、時計全般を検出する識別器を構築する際、初期の学習データに腕時計しか含まれていないと、掛け時計や置き時計などは精度よく検出できない識別器が構築されてしまう。あるいは、初期の学習データが特定のアングルで撮影されたものしか含まれていない場合、そのアングルでしか精度よく検出できないといったことも考えられる。 As a method for solving this problem, a procedure is performed in which a discriminator is learned using a part of learning data that has already been given a positive or negative label, and the label is corrected based on the detection result. There is an approach to construct a discriminator by repeating the above. In this method, it is important how to give the first label. This is because the learning of the discriminator and the correction of the learning data are repeated based on the label, and if the initial learning data is biased, a discriminator specialized for some data is constructed. It is. For example, when constructing a discriminator that detects a clock as a whole, if only initial watches are included in the initial learning data, a discriminator that cannot accurately detect a wall clock, a table clock, or the like is constructed. Alternatively, if the initial learning data includes only data shot at a specific angle, it may be possible to accurately detect only at that angle.

初期の学習データの作成方法としては、別々に開発された複数の識別手法により検出された結果を用いて識別器を学習し、最初の学習データを生成するというものがある（例えば、非特許文献１参照）。 As an initial learning data generation method, there is a method of learning a discriminator using results detected by a plurality of separately developed identification methods and generating initial learning data (for example, non-patent literature). 1).

Stephane Ayache and Georges Quenot, "Evaluation of Active Learning Strategies for Video Indexing", Signal Processing: Image Communication, Vol 22/7-8, pp 692-704, 2007.Stephane Ayache and Georges Quenot, "Evaluation of Active Learning Strategies for Video Indexing", Signal Processing: Image Communication, Vol 22 / 7-8, pp 692-704, 2007.

非特許文献１のような初期の学習データの作成方法では、最初の学習データの生成に用いる識別手法において、十分な精度と多様性が確保されているかが問題となる。また、識別器自体をどのように学習するかという問題もある。 In the initial learning data creation method as in Non-Patent Document 1, there is a problem whether sufficient accuracy and diversity are ensured in the identification method used for generating the first learning data. There is also the problem of how to learn the classifier itself.

本発明は、このような事情を考慮してなされたもので、多様性のある学習データを用いた学習により映像から特定の物体や事象などの検出対象を高い精度で検出する識別器を構築する学習装置、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and constructs a discriminator that detects a detection target such as a specific object or event from a video with high accuracy by learning using diverse learning data. A learning device and a program are provided.

［１］本発明の一態様は、映像データを記憶する映像データ記憶部と、映像データの特徴量と当該映像データに検出対象が出現している正例であるか出現していない負例であるかを示すラベルとを含む学習データを記憶する学習データ記憶部と、前記学習データ記憶部に初期の学習データが登録されたとき、及び前記学習データ記憶部に学習データが追加されたときに、前記学習データ記憶部に記憶されている前記学習データを用いて識別器を構築する識別器構築部と、前記識別器構築部が構築した前記識別器により、前記学習データ記憶部に記憶されている前記学習データに対して、入力映像データの映像区間の中から、現在の正例の学習データと見た目が類似している映像区間、音声特徴が類似している映像区間の映像データを、前記識別器を用いて検出する検出処理を行う識別器検出部と、前記識別器検出部による検出結果に基づいて前記識別器の精度を判定する判定部と、前記判定部において識別器の精度が所定の精度に達していないと判断された場合、前記映像データ記憶部に記憶されている前記映像データのうち一部を選択し、選択した前記映像データの特徴量に正例のラベルを付与して生成した学習データを前記学習データ記憶部に追加する学習データ追加部と、を備え、前記学習データ追加部は、前記映像データ記憶部に記憶されている前記映像データのうち、ランダムに選択した前記映像データ、前記学習データ記憶部に記憶されている正例の前記学習データが得られた映像データに類似する映像データ、あるいは、構築対象の前記識別器と類似の検出対象に対応した他の識別器によって検出された前記映像データの中から一部を選択する、ことを特徴とする学習装置である。
この態様によれば、学習装置は、正例及び負例の初期の学習データから、映像が検出対象に関連するかを検出する識別器を構築し、構築した識別器により学習データを対象に検出処理を行って精度を確認する。精度が低い場合、学習装置は、映像データ記憶部に記憶されている映像データの中の一部から学習データを生成し、現在の学習データに正例として追加する。学習装置は、精度が高くなるまで、学習データを用いた識別器の構築と、学習データの追加を繰り返す。
これにより、学習装置は、偏りのない学習データを生成することができるため、映像から特定の物体や事象などの検出対象を高い精度で検出する識別器を構築することが可能となる。 [1] One aspect of the present invention is a video data storage unit that stores video data, a feature amount of the video data, and a positive example in which a detection target appears in the video data, or a negative example in which the detection target does not appear. A learning data storage unit for storing learning data including a label indicating whether or not, when initial learning data is registered in the learning data storage unit, and when learning data is added to the learning data storage unit A classifier constructing unit that constructs a discriminator using the learning data stored in the learning data storage unit, and a discriminator constructed by the classifier constructing unit and stored in the learning data storage unit. For the learning data, among the video sections of the input video data, the video section that looks similar to the current positive learning data, the video data of the video section that has similar audio features, Knowledge A discriminator detector that performs detection processing using a separate device, a determination unit that determines the accuracy of the discriminator based on a detection result by the discriminator detector, and the accuracy of the discriminator is predetermined in the determination unit When it is determined that the accuracy of the video data has not been reached, a part of the video data stored in the video data storage unit is selected, and a positive example label is attached to the feature amount of the selected video data. A learning data adding unit that adds the generated learning data to the learning data storage unit, and the learning data adding unit randomly selected from the video data stored in the video data storage unit Video data, video data similar to the video data obtained from the learning data of the positive example stored in the learning data storage unit, or a detection target similar to the classifier to be constructed Selecting a portion from the image data detected by the other discriminator, it is learning apparatus according to claim.
According to this aspect, the learning device constructs a discriminator that detects whether the video is related to the detection target from the initial learning data of the positive example and the negative example, and detects the learning data by the built discriminator. Process to check accuracy. When the accuracy is low, the learning device generates learning data from a part of the video data stored in the video data storage unit, and adds it as a positive example to the current learning data. The learning device repeats the construction of the discriminator using the learning data and the addition of the learning data until the accuracy becomes high.
Thereby, since the learning apparatus can generate learning data without bias, it is possible to construct a discriminator that detects a detection target such as a specific object or event from a video with high accuracy.

また、この態様によれば、学習装置は、登録されている映像データの中から、無作為に選択した映像データ、正例の映像データに視覚的あるいは聴覚的に類似性の高い映像データ、あるいは、構築対象の識別器が検出対象としている物体や事象と意味的に類似した検出対象を検出する学習済みの識別器を使用して検出された映像データの一部を選択し、選択した映像データから学習データを生成して現在の学習データに追加する。
これにより、学習装置は、多様性が向上するように学習データを追加することができるため、より精度の高い識別器の構築が可能となる。 Further , according to this aspect, the learning device can select video data that is randomly selected from the registered video data, video data that is visually or auditorily similar to the positive video data, or Select a portion of the video data detected using a learned classifier that detects a detection target that is semantically similar to the object or event that is being detected by the target classifier. The learning data is generated from and added to the current learning data.
As a result, the learning device can add learning data so as to improve diversity, so that a more accurate classifier can be constructed.

［２］本発明の一態様は、上述した学習装置であって、前記識別器検出部は、前記学習データ記憶部に学習データが追加されたときに、前記学習データ記憶部に記憶されている前記学習データに対して前記識別器により検出を行い、検出結果に基づいて前記学習データのラベルに正例または負例を設定し、前記識別器構築部は、前記学習データ記憶部に学習データが追加されたときには、前記識別器検出部によるラベルの設定後に、前記学習データ記憶部に記憶されている前記学習データを用いて識別器を構築する、ことを特徴とする。
この態様によれば、学習装置は、繰り返し処理における識別器の構築前に、追加後の学習データを含む全ての学習データに対して現在の識別器により検出を行い、検出結果に基づいて学習データのラベルを書き換える。
これにより、学習データのラベルの誤りが修正されるため、構築される識別器の性能を向上させることができる。 [ 2 ] One aspect of the present invention is the learning device described above, wherein the discriminator detection unit is stored in the learning data storage unit when learning data is added to the learning data storage unit. The learning data is detected by the discriminator, a positive example or a negative example is set as a label of the learning data based on a detection result, and the discriminator construction unit stores the learning data in the learning data storage unit. When added, after the label is set by the discriminator detector, a discriminator is constructed using the learning data stored in the learning data storage unit.
According to this aspect, the learning device detects all the learning data including the learning data after the addition by the current classifier before constructing the classifier in the iterative process, and learns data based on the detection result. Rewrite the label.
Thereby, since the error of the label of learning data is corrected, the performance of the constructed discriminator can be improved.

［３］本発明の一態様は、上述した学習装置であって、前記学習データ記憶部に登録された前記初期の学習データ、あるいは、前記識別器検出部によりラベルが設定された前記学習データに対して、ユーザ入力または他の識別器による前記学習データの検出結果に基づいて前記学習データのラベルを修正する学習データ修正部をさらに備え、前記識別器構築部は、前記学習データ記憶部に初期の学習データが登録されたとき、及び前記学習データ記憶部に学習データが追加されたときに、前記学習データ修正部によるラベルの修正後に、前記学習データ記憶部に記憶されている前記学習データを用いて識別器を構築する、ことを特徴とする。
この態様によれば、学習装置は、識別器の構築前に、学習データの正例、負例のラベルを、ユーザの入力、あるいは、他の識別器による検出結果に基づいて修正する。
これにより、学習データのラベルの誤りが精度よく修正されるため、構築される識別器の性能を向上させることができる。 [ 3 ] One aspect of the present invention is the learning device described above, wherein the initial learning data registered in the learning data storage unit or the learning data in which a label is set by the classifier detection unit is used. On the other hand, a learning data correction unit that corrects a label of the learning data based on a user input or a detection result of the learning data by another discriminator is further provided, and the discriminator construction unit initially stores the learning data storage unit. When the learning data is registered, and when the learning data is added to the learning data storage unit, the learning data stored in the learning data storage unit is corrected after the label correction by the learning data correction unit. The classifier is used to construct a classifier.
According to this aspect, the learning device corrects the positive example and negative example labels of the learning data based on the input from the user or the detection result by another classifier before the classifier is constructed.
Thereby, since the error of the label of learning data is corrected accurately, the performance of the constructed discriminator can be improved.

［４］本発明の一態様は、上述した学習装置であって、前記映像データの音声を表すテキストデータに、構築する前記識別器による検出対象を表すキーワード及び当該キーワードに関連する他のキーワードが含まれるかを検出し、検出されたテキストデータに対応した映像データの特徴量に正例のラベルを付与して初期の学習データを生成し、前記学習データ記憶部に登録する初期学習データ生成部をさらに備える、ことを特徴とする。
この態様によれば、映像の色やテクスチャなどの表層的な特徴ではなく、映像データの内容に基づいて初期の学習データを生成することができる。
これにより、学習装置は、映像データの内容に基づいて多様性のある初期の学習データを生成し、意味内容に基づく映像を精度よく検索可能とする識別器を構築することができる。 [ 4 ] One aspect of the present invention is the above-described learning device, in which text data representing audio of the video data includes a keyword representing a detection target by the classifier to be constructed and other keywords related to the keyword. An initial learning data generation unit that detects whether it is included, generates initial learning data by adding a positive example label to the feature amount of the video data corresponding to the detected text data, and registers it in the learning data storage unit Is further provided.
According to this aspect, it is possible to generate the initial learning data based on the content of the video data rather than the surface features such as the color and texture of the video.
Thereby, the learning device can generate a variety of initial learning data based on the content of the video data, and can construct a discriminator that can accurately search the video based on the semantic content.

［５］本発明の一態様は、映像データの特徴量と当該映像データに検出対象が出現している正例であるか出現していない負例であるかを示すラベルとを含む学習データを記憶する学習データ記憶部、映像の特徴量と検出対象に対して正例であるか負例であるかを示すラベルとを含む学習データを記憶する学習データ記憶部、前記学習データ記憶部に初期の学習データが登録されたとき、及び前記学習データ記憶部に学習データが追加されたときに、前記学習データ記憶部に記憶されている前記学習データを用いて識別器を構築する識別器構築部、前記識別器構築部が構築した前記識別器により、前記学習データ記憶部に記憶されている前記学習データに対して、入力映像データの映像区間の中から、現在の正例の学習データと見た目が類似している映像区間、音声特徴が類似している映像区間の映像データを、前記識別器を用いて検出する検出処理を行う識別器検出部、前記識別器検出部による検出結果に基づいて前記識別器の精度を判定する判定部、前記判定部において識別器の精度が所定の精度に達していないと判断された場合、前記映像データ記憶部に記憶されている前記映像データのうち一部を選択し、選択した前記映像データの特徴量に正例のラベルを付与して生成した学習データを前記学習データ記憶部に追加する学習データ追加部、として機能させ、前記学習データ追加部が、前記映像データ記憶部に記憶されている前記映像データのうち、ランダムに選択した前記映像データ、前記学習データ記憶部に記憶されている正例の前記学習データが得られた映像データに類似する映像データ、あるいは、構築対象の前記識別器と類似の検出対象に対応した他の識別器によって検出された前記映像データの中から一部を選択するよう機能させるプログラムである。 [ 5 ] According to one aspect of the present invention, learning data including a feature amount of video data and a label indicating whether a detection target appears in the video data is a positive example or a negative example that does not appear. A learning data storage unit for storing, a learning data storage unit for storing learning data including a feature amount of a video and a label indicating whether the detection target is a positive example or a negative example, and an initial in the learning data storage unit When the learning data is registered, and when learning data is added to the learning data storage unit, a classifier construction unit that constructs a classifier using the learning data stored in the learning data storage unit The learning data stored in the learning data storage unit by the classifier constructed by the classifier construction unit is viewed as current learning data from the video section of the input video data. Is similar The classifier based and has video section, the image data of the image segment in which speech features are similar, the identifier detection unit that performs detection processing for detecting the use of the identification device, the detection result of the classifier detector A determination unit for determining the accuracy of the video data, and when the determination unit determines that the accuracy of the discriminator has not reached a predetermined accuracy, a part of the video data stored in the video data storage unit is selected. A learning data adding unit that adds learning data generated by assigning a positive example label to the feature amount of the selected video data to the learning data storage unit, and the learning data adding unit includes the video data Video data from which the video data selected at random among the video data stored in the storage unit and the learning data of the positive example stored in the learning data storage unit are obtained Video data similar, or a program to function to select a part among the image data detected by other identifier corresponding to the similar detection target and the identifier of the building object.

本発明によれば、多様性のある学習データを生成し、生成された学習データを用いた学習により映像から特定の物体や事象などの検出対象を高い精度で検出する識別器を構築することができる。 According to the present invention, it is possible to generate a variety of learning data, and to construct a discriminator that detects a detection object such as a specific object or event from a video with high accuracy by learning using the generated learning data. it can.

本発明の一実施形態による学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the learning apparatus by one Embodiment of this invention. 同実施形態による音声テキストデータの例を示す図である。It is a figure which shows the example of the audio | voice text data by the embodiment. 同実施形態による学習データの例を示す図である。It is a figure which shows the example of the learning data by the embodiment. 同実施形態による学習装置の処理フローを示す図である。It is a figure which shows the processing flow of the learning apparatus by the embodiment. 同実施形態による学習装置を用いた実験結果を示す図である。It is a figure which shows the experimental result using the learning apparatus by the embodiment. 同実施形態による学習装置を用いた実験結果を示す図である。It is a figure which shows the experimental result using the learning apparatus by the embodiment. 学習データを用いた識別器の構築を説明するための図である。It is a figure for demonstrating construction | assembly of the discriminator using learning data.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施形態による学習装置１の構成を示すブロック図であり、本実施形態に関係する機能ブロックのみを抽出して示してある。学習装置１は、例えば、１台または複数台のサーバコンピュータ等のコンピュータ装置により実現することができる。 FIG. 1 is a block diagram showing a configuration of a learning device 1 according to an embodiment of the present invention, and shows only functional blocks related to the present embodiment. The learning device 1 can be realized by a computer device such as one or a plurality of server computers, for example.

学習装置１は、入力された映像データ（以下、「入力映像データ」と記載する。）から生成された正例及び負例の学習データを用いて学習を行い、識別器を構築する。ここで、正例とは、検出対象の物体（オブジェクト）や事象（イベント）が映像に出現していることを表し、負例とは、出現していないことを表す。また、識別器は、映像の特徴を入力とし、その映像が検出対象の物体や事象と関連するかを検出するアルゴリズムである。 The learning device 1 performs learning using positive and negative learning data generated from input video data (hereinafter referred to as “input video data”), and constructs a discriminator. Here, the positive example indicates that an object (object) or event (event) to be detected appears in the video, and the negative example indicates that it does not appear. The discriminator is an algorithm that receives video features and detects whether the video is related to an object or event to be detected.

そこでまず学習装置１は、入力映像データの音声を表すテキストを利用し、構築する識別器の検出対象を表すキーワードや、そのキーワードの同義語や類語、あるいは、キーワードと意味的な包含関係がある他のキーワードに対応する映像区間を抽出する。学習装置１は、抽出した映像区間を正例の初期の学習データとして識別器を構築し、構築した識別器の精度が十分でなければ、できるだけ多様性のある学習データを生成するために学習データを追加、修正し、再び識別器を構築する処理を繰り返す。 Therefore, first, the learning device 1 uses a text representing the sound of the input video data, and has a keyword representing a detection target of the classifier to be constructed, a synonym or a synonym of the keyword, or a semantic inclusion relation with the keyword. Video segments corresponding to other keywords are extracted. The learning device 1 constructs a discriminator using the extracted video section as initial learning data as a positive example, and if the constructed discriminator is not sufficiently accurate, the learning device 1 generates learning data as diverse as possible. Are added, corrected, and the process of constructing the classifier is repeated.

学習装置１は、学習データを追加する際、入力映像データをある単位で分割した映像区間の中から、無作為に選択した映像区間、正例と視聴覚的に類似性の高い映像区間、意味的に類似したキーワードに対応する学習済みの識別器を使用して検出された映像区間の映像データを一定割合だけ正例として既に生成されている学習データに混合する。これにより、学習データが一部に偏ったものにならないようにし、精度の高い識別器の構築を可能とする。 When the learning device 1 adds learning data, the learning device 1 randomly selects a video segment from among video segments obtained by dividing the input video data by a certain unit, a video segment that is visually similar to the positive example, and is semantic. The video data of the video section detected using the learned discriminator corresponding to the keyword similar to is mixed with the learning data already generated as a positive example by a certain ratio. As a result, the learning data is not partially biased, and a highly accurate classifier can be constructed.

同図に示すように、学習装置１は、記憶部１０、入力部１１、映像区間分割部１２、初期学習データ生成部１３、学習データ修正部１４、識別器構築部１５、識別器検出部１６、識別器判定部１７、及び学習データ追加部１８を備えて構成される。 As shown in the figure, the learning device 1 includes a storage unit 10, an input unit 11, a video segment dividing unit 12, an initial learning data generating unit 13, a learning data correcting unit 14, a discriminator constructing unit 15, and a discriminator detecting unit 16. , A discriminator determining unit 17 and a learning data adding unit 18.

記憶部１０は、ハードディスク装置や半導体メモリなどで実現され、映像データ記憶部１０１、学習データ記憶部１０２、及び識別器記憶部１０３を備えて構成される。
映像データ記憶部１０１は、入力映像データ、及び音声テキストデータを記憶する。入力映像データは、動画のコンテンツデータであり、本実施形態では、動画として放送番組を用いる場合について説明する。
音声テキストデータは、入力映像データの音声を示すテキストデータと、そのテキストデータが対応する入力映像データの映像部分を特定する同期データとを含む。本実施形態では、音声テキストデータとして、番組音声の書き起こしを示すクローズドキャプションデータや、入力映像データに含まれる音声を音声認識した結果を示す音声認識データを用いる。 The storage unit 10 is realized by a hard disk device, a semiconductor memory, or the like, and includes a video data storage unit 101, a learning data storage unit 102, and a discriminator storage unit 103.
The video data storage unit 101 stores input video data and audio text data. The input video data is moving image content data, and in this embodiment, a case where a broadcast program is used as a moving image will be described.
The audio text data includes text data indicating the audio of the input video data and synchronization data for specifying a video portion of the input video data corresponding to the text data. In the present embodiment, closed caption data indicating transcription of program audio or voice recognition data indicating a result of voice recognition of voice included in input video data is used as voice text data.

学習データ記憶部１０２は、識別器を構築するための学習データを記憶する。学習データは、入力映像データにおける映像区間と、特徴データと、正例であるか負例であるかのラベルとの対応付けを示す。特徴データは、映像区間における画像特徴量を示す。 The learning data storage unit 102 stores learning data for constructing a discriminator. The learning data indicates a correspondence between a video section in the input video data, feature data, and a label indicating whether the example is a positive example or a negative example. The feature data indicates the image feature amount in the video section.

識別器記憶部１０３は、既存の識別器と、学習データから構築した新たな識別器とを記憶する。識別器は、それぞれ検出対象を表すキーワードと対応付けられ、映像データの画像特徴を表す特徴データを入力とし、その映像データが検出対象に関連するかを検出するアルゴリズムである。識別器は、例えば、サポートベクターマシンや決定木などの分類アルゴリズムを利用しており、入力された特徴データから映像が検出対象に関連する度合いを定量的に表す値を算出する。 The classifier storage unit 103 stores an existing classifier and a new classifier constructed from learning data. The discriminator is an algorithm that detects whether or not the video data is related to the detection target by receiving the feature data that represents the image feature of the video data and is associated with the keyword that represents the detection target. The classifier uses, for example, a classification algorithm such as a support vector machine or a decision tree, and calculates a value that quantitatively represents the degree to which the video is related to the detection target from the input feature data.

入力部１１は、入力映像データ、新たに構築する識別器が検出対象とする物体や事象を表すキーワード、ユーザが選択したラベル書き換え対象の学習データを特定する情報などの各種データの入力を受ける。
映像区間分割部１２は、各入力映像データを映像区間毎に分割する。本実施形態では、映像区間分割部１２は、入力映像データを１ショット毎に分割する。１ショットとは、一台のカメラで連続的に撮影された区間であり、カメラの切り替え点によって挟まれた区間を表す。 The input unit 11 receives input of various data such as input video data, a keyword representing an object or event to be detected by a newly constructed discriminator, and information for specifying learning data to be rewritten as a label selected by the user.
The video section dividing unit 12 divides each input video data for each video section. In the present embodiment, the video segment dividing unit 12 divides the input video data for each shot. One shot is a section continuously shot by one camera, and represents a section sandwiched between camera switching points.

初期学習データ生成部１３は、入力映像データから初期の学習データを生成する。初期学習データ生成部１３は、クローズドキャプション抽出部１３１、番組音声認識部１３２、キーワード拡張部１３３、映像区間抽出部１３４、及び特徴データ抽出部１３５を備えて構成される。
クローズドキャプション抽出部１３１は、入力映像データからクローズドキャプションデータを抽出し、音声テキストデータとする。番組音声認識部１３２は、入力映像データの番組音声に対して音声認識処理を行い、音声テキストデータを生成する。キーワード拡張部１３３は、学習装置１と接続されるシソーラス記憶装置５が記憶するシソーラスや辞書を利用して、入力されたキーワードに類似するキーワードや、同義のキーワード、意味的に含有関係にあるキーワードなどを抽出する。映像区間抽出部１３４は、音声テキストデータを利用して、入力されたキーワード、あるいはキーワード拡張部１３３が抽出したキーワードに対応する映像区間を入力映像データから抽出する。特徴データ抽出部１３５は、映像区間抽出部１３４が抽出した映像区間の映像データから特徴データを取得し、最初の学習データを生成する。 The initial learning data generation unit 13 generates initial learning data from input video data. The initial learning data generation unit 13 includes a closed caption extraction unit 131, a program audio recognition unit 132, a keyword expansion unit 133, a video section extraction unit 134, and a feature data extraction unit 135.
The closed caption extraction unit 131 extracts closed caption data from the input video data and sets it as audio text data. The program voice recognition unit 132 performs voice recognition processing on the program voice of the input video data to generate voice text data. The keyword expansion unit 133 uses a thesaurus or dictionary stored in the thesaurus storage device 5 connected to the learning device 1, a keyword similar to the input keyword, a synonymous keyword, or a keyword that is semantically contained. And so on. The video segment extraction unit 134 extracts the video segment corresponding to the input keyword or the keyword extracted by the keyword expansion unit 133 from the input video data using the audio text data. The feature data extraction unit 135 acquires feature data from the video data of the video section extracted by the video section extraction unit 134, and generates first learning data.

学習データ修正部１４は、入力部１１により入力された情報に基づいて、あるいは、学習を行う際の反復処理の過程において構築される識別器の検出結果に基づいて、学習データに付与された正例あるいは負例を表すラベルを修正する。初期学習データ生成部１３は、キーワードに対応する映像区間をそのまま正例として最初の学習データを生成している。しかし、番組音声やクローズドキャプションにキーワードが含まれていても、映像に目的とする物体や事象が出現しているとは限らないため、最初の学習データが正例か負例かのラベルの修正が必要である。同様に、学習を行う際の反復処理の過程において構築される識別器の検出結果に基づいてラベルが付与された学習データについても、誤りや漏れが含まれている可能性がある。そのため、反復処理の過程においても学習データのラベルの修正が必要である。 The learning data correction unit 14 is based on the information input from the input unit 11 or based on the detection result of the discriminator constructed in the course of the iterative process when learning is performed. Modify the labels that represent examples or negative examples. The initial learning data generation unit 13 generates the first learning data using the video section corresponding to the keyword as a positive example. However, even if keywords are included in the program audio or closed caption, the target object or event does not always appear in the video, so the label of whether the first learning data is positive or negative is corrected. is necessary. Similarly, there is a possibility that errors and omissions are included in the learning data to which the labels are given based on the detection results of the discriminator constructed in the process of iterative processing when learning is performed. Therefore, it is necessary to correct the label of the learning data even in the process of iterative processing.

識別器構築部１５は、学習データから識別器を構築する。
識別器検出部１６は、識別器構築部１５で構築された識別器を学習データに適用し、検出結果を得る。識別器検出部１６は、検出結果により検出対象に関連すると判断された学習データを、次の正例の学習データとする。
識別器判定部１７は、識別器構築部１５により構築された識別器の精度が十分か否かを判定する。 The classifier construction unit 15 constructs a classifier from the learning data.
The discriminator detector 16 applies the discriminator constructed by the discriminator constructing unit 15 to the learning data, and obtains a detection result. The discriminator detection unit 16 sets the learning data determined to be related to the detection target based on the detection result as the learning data of the next positive example.
The classifier determination unit 17 determines whether or not the classifier constructed by the classifier construction unit 15 has sufficient accuracy.

学習データ追加部１８は、識別器判定部１７により識別器の精度が十分ではないと判断された場合、学習データを追加する。単純に、識別器の検出結果により検出対象に関連すると判定された学習データを正例として再び識別子を構築すると、この識別器を学習した際の学習データに類似したものしか精度よく検出できなくなってしまう。また、最初のデータに、多様性が不十分といったような問題があると、特定のデータしか精度よく検出できない識別器が学習されてしまう恐れがある。そこで、学習データ追加部１８は、現在学習させている識別器と全く依存関係がない方式の識別器によって入力映像データの映像区間から選択した映像データを正例の学習データとして加える。これによって、特定のデータに偏った識別器が学習されてしまうことを避ける。 The learning data adding unit 18 adds learning data when the classifier determination unit 17 determines that the accuracy of the classifier is not sufficient. Simply constructing an identifier again using the learning data determined to be related to the detection target based on the detection result of the discriminator as a positive example, only those similar to the learning data when learning this discriminator can be detected accurately. End up. Further, if there is a problem such as insufficient diversity in the initial data, there is a possibility that a discriminator that can detect only specific data with high accuracy is learned. Therefore, the learning data adding unit 18 adds, as positive learning data, video data selected from the video section of the input video data by a discriminator that has no dependency relationship with the discriminator currently being learned. This avoids learning a discriminator biased to specific data.

学習データ追加部１８は、ランダムデータ選択部１８１、類似映像選択部１８２、類似識別器検出部１８３、及びデータ混合部１８４を備えて構成される。
ランダムデータ選択部１８１は、入力映像データの映像区間の中から無作為に抽出した映像区間の映像データを学習データへの追加候補とする。
類似映像選択部１８２は、入力映像データの映像区間の中から、現在の正例の学習データと見た目が類似している映像区間、音声特徴が類似している映像区間の映像データを選択し、学習データへの追加候補とする。
類似識別器検出部１８３は、識別器記憶部１０３内に記憶されているすでに学習済みの識別器の中から、入力されたキーワードと意味的に類似しているキーワード、関連のあるキーワード、意味的に含有関係にあるキーワードに対応した学習済みの識別器を選択する。類似識別器検出部１８３は、選択した学習済みの識別器を用いて入力映像データの映像区間を対象として検出処理を行い、検出された映像区間の映像データを学習データへの追加候補とする。
データ混合部１８４は、ランダムデータ選択部１８１、類似映像選択部１８２、類似識別器検出部１８３において追加候補とされた映像データを正例として、学習データに一定の割合だけ追加する。追加を複数回行う場合、データ混合部１８４は、学習データ修正部１４において一度でも負例と判定されたデータについては、正例として学習データに追加しないようにする。 The learning data adding unit 18 includes a random data selecting unit 181, a similar video selecting unit 182, a similar classifier detecting unit 183, and a data mixing unit 184.
The random data selection unit 181 sets the video data of the video section randomly extracted from the video sections of the input video data as addition candidates to the learning data.
The similar video selection unit 182 selects, from the video sections of the input video data, video data of a video section that looks similar to the current learning data of the positive example and a video section that has similar audio characteristics, Add candidates to learning data.
The similar classifier detection unit 183 includes a keyword that is semantically similar to the input keyword from among the already learned classifiers stored in the classifier storage unit 103, related keywords, and semantics. A learned classifier corresponding to a keyword having a content relation is selected. The similar classifier detection unit 183 performs detection processing on the video section of the input video data using the selected learned classifier, and sets the video data of the detected video section as an addition candidate to the learning data.
The data mixing unit 184 adds video data determined as additional candidates in the random data selection unit 181, the similar video selection unit 182, and the similar classifier detection unit 183 as a positive example to the learning data by a certain ratio. When the addition is performed a plurality of times, the data mixing unit 184 does not add the data determined to be a negative example even once by the learning data correction unit 14 to the learning data as a positive example.

図２は、音声テキストデータのデータ例を示す図である。同図に示す音声テキストデータは、クローズドキャプションデータであり、番組音声を示すテキストデータと、そのテキストデータに対応するタイムコード情報により示される同期データとを含む。 FIG. 2 is a diagram showing an example of voice text data. The audio text data shown in the figure is closed caption data, and includes text data indicating program audio and synchronization data indicated by time code information corresponding to the text data.

図３は、学習データのデータ例を示す図である。同図に示すように、学習データは、映像区間を特定する映像区間特定データと、映像区間における画像特徴量を示す特徴データと、正例であるか負例であるかのラベルとを対応付けたデータである。映像区間特定データは、入力映像データの識別情報と、入力映像データにおける映像区間の開始位置及び終了位置とにより示され、開始位置及び終了位置は、例えば、タイムコード情報など入力映像データの先頭からの再生時間により示される。 FIG. 3 is a diagram illustrating an example of learning data. As shown in the figure, the learning data associates video segment specifying data for specifying a video segment, feature data indicating an image feature amount in the video segment, and a label indicating whether it is a positive example or a negative example. Data. The video section specifying data is indicated by identification information of the input video data and the start position and end position of the video section in the input video data. The start position and end position are, for example, from the beginning of the input video data such as time code information. Indicated by the playback time.

図４は、学習装置の処理手順のフローチャートを示す。
まず、学習装置１の入力部１１は、入力映像データと、新たに構築する識別器の検出対象となる物体や事象を表すキーワードとの入力を受ける。入力部１１は、入力映像データを映像データ記憶部１０１に書き込み、キーワードを初期学習データ生成部１３に出力する。 FIG. 4 shows a flowchart of the processing procedure of the learning apparatus.
First, the input unit 11 of the learning device 1 receives input of input video data and a keyword representing an object or event to be detected by a newly constructed classifier. The input unit 11 writes the input video data to the video data storage unit 101 and outputs the keyword to the initial learning data generation unit 13.

映像区間分割部１２は、映像データ記憶部１０１に記憶されている入力映像データを読み出し、各入力映像データを１ショット単位で分割する。例えば、映像区間分割部１２は、入力映像データが示す隣接フレーム間の映像の差分を計算し、計算した差分を指標としてカット点を検出すると、その検出したカット点で入力映像データを映像区間ごとに区切る。映像区間分割部１２は、入力映像データに、各映像区間の開始位置及び終了位置を示す分割データを対応づけて映像データ記憶部１０１に書き込む（ステップＳ１０５）。以降、学習装置１は、この分割データに基づいて映像データにおける映像区間を特定する。 The video section dividing unit 12 reads the input video data stored in the video data storage unit 101 and divides each input video data in units of one shot. For example, the video section dividing unit 12 calculates a video difference between adjacent frames indicated by the input video data, detects a cut point using the calculated difference as an index, and converts the input video data at the detected cut point for each video section. Separated into The video segment dividing unit 12 writes the divided data indicating the start position and the end position of each video segment in the video data storage unit 101 in association with the input video data (step S105). Thereafter, the learning device 1 specifies a video section in the video data based on the divided data.

続いて初期学習データ生成部１３は、入力映像データから初期の学習データを生成する（ステップＳ１１０）。
まず、クローズドキャプション抽出部１３１は、入力映像データにクローズドキャプションが重畳されている場合、入力映像データからクローズドキャプションを抽出し、映像データ記憶部１０１に音声テキストデータとして書き込む。 Subsequently, the initial learning data generation unit 13 generates initial learning data from the input video data (step S110).
First, when the closed caption is superimposed on the input video data, the closed caption extraction unit 131 extracts the closed caption from the input video data and writes it as audio text data in the video data storage unit 101.

続いて番組音声認識部１３２は、クローズドキャプションが重畳されていない入力映像データから音声データを取得し、その取得した音声データが示す番組音声に対して音声認識を行う。番組音声認識部１３２は、番組音声を音声認識した結果を示すテキストデータと、その音声認識した音声が得られた入力映像データの映像部分を表す同期データとを対応づけた音声認識データを生成し、映像データ記憶部１０１に音声テキストデータとして書き込む。 Subsequently, the program audio recognition unit 132 acquires audio data from the input video data on which the closed caption is not superimposed, and performs audio recognition on the program audio indicated by the acquired audio data. The program voice recognition unit 132 generates voice recognition data in which text data indicating the result of voice recognition of the program voice and the synchronization data representing the video portion of the input video data from which the voice recognized is obtained. Then, it is written as audio text data in the video data storage unit 101.

キーワード拡張部１３３は、学習装置１の外部または内部に備えられたシソーラス記憶装置５に記憶されているシソーラスや辞書を検索し、入力されたキーワードに類似のキーワードや、同義のキーワード、意味的に含有関係にあるキーワードなどを読み出す。例えば、キーワード拡張部１３３は、入力キーワードが「車」である場合、類似したキーワードや同義のキーワードとして「自動車」、「カー」、「タクシー」、「乗用車」…などを取得し、意味的に含有関係にあるキーワードとして「陸上交通」、「ワンボックスカー」、「軽自動車」…などを取得する。以下、入力されたキーワードに基づいて取得された類似のキーワード、同義のキーワード、意味的に含有関係にあるキーワードを「関連キーワード」と記載する。 The keyword expansion unit 133 searches a thesaurus or dictionary stored in the thesaurus storage device 5 provided outside or inside the learning device 1, and searches for keywords similar to the input keyword, synonymous keywords, semantically Read keywords that are contained. For example, when the input keyword is “car”, the keyword expansion unit 133 acquires “automobile”, “car”, “taxi”, “passenger car”, etc. as similar keywords or synonymous keywords, and semantically. “Land traffic”, “One box car”, “Light car”, etc. are acquired as keywords related to inclusion. Hereinafter, similar keywords, synonymous keywords, and keywords that are semantically contained based on the input keywords will be referred to as “related keywords”.

映像区間抽出部１３４は、映像データ記憶部１０１に記憶されている音声テキストデータを検索して入力キーワードや関連キーワードを検出し、検出した入力キーワードや関連キーワードに対応した同期データを取得する。この同期データは、キーワードが出現した番組内での時刻を表しており、その時刻をtとおくと、ｔはクローズドキャプションに記載されるタイムコード情報、あるいは音声認識された時間などに基づいて与えられたものである。映像区間抽出部１３４は、特定した同期データに対応した映像区間を入力映像データから抽出する。 The video segment extraction unit 134 searches the audio text data stored in the video data storage unit 101 to detect input keywords and related keywords, and acquires synchronization data corresponding to the detected input keywords and related keywords. This synchronization data represents the time in the program in which the keyword appears. If the time is t, t is given based on the time code information described in the closed caption or the time when the voice is recognized. It is what was done. The video segment extraction unit 134 extracts a video segment corresponding to the specified synchronization data from the input video data.

例えば、映像区間抽出部１３４は、キーワードの出現時刻tに対して、時刻ｔ−δから時刻ｔ＋δまでの映像区間を選択する。なお、δは、予め決められた時間である。映像区間抽出部１３４は、時刻ｔ−δを開始位置、時刻ｔ＋δを終了位置とする。
あるいは、映像区間抽出部１３４は、時刻ｔにおけるショットを選択する。この場合、映像区間抽出部１３４は、入力映像データに付加されている分割データで示される映像区間の中から、特定した同期データが示す時刻ｔが含まれる映像区間を選択する。
特徴データ抽出部１３５は、映像区間抽出部１３４が選択した映像区間の映像データから、映像の特徴量を表す特徴データを生成する。 For example, the video segment extraction unit 134 selects a video segment from time t-δ to time t + δ for the keyword appearance time t. Note that δ is a predetermined time. The video section extraction unit 134 sets time t−δ as a start position and time t + δ as an end position.
Alternatively, the video segment extraction unit 134 selects a shot at time t. In this case, the video segment extraction unit 134 selects a video segment including the time t indicated by the identified synchronization data from the video segments indicated by the divided data added to the input video data.
The feature data extraction unit 135 generates feature data representing the feature amount of the video from the video data of the video segment selected by the video segment extraction unit 134.

特徴データとして用いる特徴量は、様々なオブジェクトやイベントに対応する必要があるため、特定のオブジェクトやイベントに特化した特徴量でなく、より汎用的な特徴量を利用する。具体的には、グリッド領域における色モーメント、エッジ方向ヒストグラム、ガボールウェーブレット、ハールウェーブレット、ローカルバイナリパターンなどを組み合わせて特徴データを生成する。これは、例えば、「T. Ojala, M. Pietikaninen and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, 2002.」（参考文献１）に記載されている。 Since the feature quantity used as the feature data needs to correspond to various objects and events, a more general-purpose feature quantity is used instead of a feature quantity specialized for a specific object or event. Specifically, feature data is generated by combining color moments in the grid area, edge direction histograms, Gabor wavelets, Haar wavelets, local binary patterns, and the like. For example, “T. Ojala, M. Pietikaninen and T. Maenpaa,“ Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, ”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7 , pp. 971-987, 2002. "(Reference 1).

あるいは、コーナーなどの特徴点近辺の局所領域における輝度勾配ヒストグラムに基づいて、それらの頻度ヒストグラムを作成するといった、一般物体認識におけるアプローチを利用する方法を組み合わせた特徴量を用いることも考えられる。これは、例えば、「G. Csurka, C. Bray, C. Dance and L. Fan, “Visual categorization with bags of keypoints,” in Proc. ECCV Workshop on Statistical Learning in Computer Vision, pp. 59-74, 2004.」（参考文献２）に記載されている。
その他には、時間方向を考慮した特徴量として、フレーム間の動きベクトル列や、フレーム間における特徴量の相関を考慮したり、音声の特徴を利用したりすることが考えられる。 Alternatively, it is conceivable to use a feature amount that is a combination of methods using an approach in general object recognition, such as creating a frequency histogram based on a luminance gradient histogram in a local region near a feature point such as a corner. For example, “G. Csurka, C. Bray, C. Dance and L. Fan,“ Visual categorization with bags of keypoints, ”in Proc. ECCV Workshop on Statistical Learning in Computer Vision, pp. 59-74, 2004. (Reference 2).
In addition, as a feature quantity considering the time direction, it is conceivable to consider a motion vector sequence between frames, a correlation between feature quantities between frames, or use a feature of speech.

特徴データ抽出部１３５は、抽出した映像区間を示す映像区間特定データと、その映像区間の特徴データと、正例を示すラベルとを設定した学習データを生成して学習データ記憶部１０２に書き込み、最初の学習データとして登録する。 The feature data extraction unit 135 generates learning data in which the video segment specifying data indicating the extracted video segment, the feature data of the video segment, and the label indicating the positive example are set and written in the learning data storage unit 102, Register as the first learning data.

続いて、学習データ修正部１４は、現在すべて正例が設定されている最初の学習データのラベルを修正する（ステップＳ１１５）。修正は、人手で実施するのがもっとも正確である。そこで、最初の学習データに対する修正は、正例のラベルが付与された学習データに対して人（ユーザ）が正否を判定し、その判定結果に基づいて負例であると判定した学習データについては、ラベルを負例に修正する。 Subsequently, the learning data correction unit 14 corrects the label of the first learning data for which all positive examples are currently set (step S115). The correction is most accurately performed manually. Therefore, the correction to the first learning data is about the learning data in which a person (user) determines whether the learning data is given a positive example label and whether it is a negative example based on the determination result. , Correct the label to a negative example.

具体的には、入力部１１は、学習データ記憶部１０２に現在記憶されている正例の学習データのうち、負例とする学習データを特定する情報の入力を受ける。学習データ修正部１４は、入力部１１により入力された情報により特定される学習データのラベルを、正例から負例に書き換える。 Specifically, the input unit 11 receives input of information for specifying learning data as a negative example among positive learning data currently stored in the learning data storage unit 102. The learning data correction unit 14 rewrites the label of the learning data specified by the information input by the input unit 11 from the positive example to the negative example.

識別器構築部１５は、学習データ記憶部１０２に現在記憶されている学習データを用いて識別器を構築し、識別器記憶部１０３に書き込む（ステップＳ１２０）。なお、識別器構築部１５は、識別器の構築には、サポートベクターマシンやランダムフォレストなどの機械学習を利用する。 The discriminator construction unit 15 constructs a discriminator using the learning data currently stored in the learning data storage unit 102, and writes it into the discriminator storage unit 103 (step S120). Note that the classifier construction unit 15 uses machine learning such as a support vector machine or a random forest for construction of the classifier.

識別器検出部１６は、ステップＳ１２０において構築された識別器を学習データ記憶部１０２に現在記憶されている学習データに適用し、各学習データに設定されている特徴データを入力として検出結果を得る。この検出結果は、各学習データが検出対象に関連する度合いを定量的に表す値と、その値に基づいて得られる関連の度合いの順位を示す。識別器検出部１６は、検出結果から検出対象に関連すると判断された学習データのラベルに負例が設定されている場合は正例に書き換え、関連しないと判断された学習データのラベルに正例が設定されている場合は負例に書き換える。識別器判定部１７は、ステップＳ１２５において構築された識別器による検出結果から、構築された識別器の精度が閾値以上かどうか否かを判定する（ステップＳ１２５）。 The discriminator detection unit 16 applies the discriminator constructed in step S120 to the learning data currently stored in the learning data storage unit 102, and obtains the detection result using the feature data set in each learning data as an input. . This detection result indicates a value that quantitatively represents the degree to which each learning data is related to the detection target, and the rank of the degree of relation obtained based on the value. The discriminator detection unit 16 rewrites the negative example in the label of the learning data determined to be related to the detection target based on the detection result, and rewrites the positive example in the label of the learning data determined not to be related. If is set, it is rewritten as a negative example. The discriminator determination unit 17 determines whether or not the accuracy of the constructed discriminator is greater than or equal to a threshold value from the detection result by the discriminator constructed in step S125 (step S125).

識別器の精度を評価するための指標としては、順位付き検索結果に対する評価指標である平均適合率が利用できる。以下の式（１）は、検出結果の上位Ｎ件に対する平均適合率の算出式を示す。 As an index for evaluating the accuracy of the discriminator, an average relevance ratio that is an evaluation index for the ranked search result can be used. The following formula (1) shows a formula for calculating the average precision for the top N detection results.

ここで、ｒ_ｋは、順位がｋ番目の検出結果が正解なら１、不正解なら０を表す。なお、正解か不正解かの情報は、人により入力部１１に入力される。
また、式（１）におけるｐ（ｋ）は、上位Ｎ件それぞれの適合率を表し、次式（２）で算出される。 Here, r _k is 1 if rank k-th detection result correct, represents 0 if incorrect. Information about whether the answer is correct or incorrect is input to the input unit 11 by a person.
Further, p (k) in equation (1) represents the relevance ratio of each of the top N cases, and is calculated by the following equation (2).

識別器判定部１７が、式（１）により算出した精度（平均適合率）は閾値に満たないと判定した場合（ステップＳ１２５：ＮＯ）、学習データ追加部１８は、学習データ記憶部１０２に記憶されている学習データに対して、追加の学習データを混合する（ステップＳ１３０）。 When the discriminator determining unit 17 determines that the accuracy (average relevance ratio) calculated by the equation (1) is less than the threshold (step S125: NO), the learning data adding unit 18 stores the learning data in the learning data storage unit 102. The additional learning data is mixed with the learning data that has been set (step S130).

まず、ランダムデータ選択部１８１は、映像データ記憶部１０１に記憶されている入力映像データの映像区間の中から無作為に抽出し、抽出した各映像区間を学習データ生成候補とする。 First, the random data selection unit 181 randomly extracts from the video sections of the input video data stored in the video data storage unit 101, and sets each extracted video section as a learning data generation candidate.

また、類似映像選択部１８２は、学習データ記憶部１０２からラベルに正例が設定されている学習データを特定し、特定した学習データに含まれる映像区間特定データにより示される入力映像データの映像区間から類似検出用特徴データを取得する。さらに、類似映像選択部１８２は、映像データ記憶部１０１に記憶されている各入力映像データの映像区間それぞれについて類似検出用特徴データを取得する。 Further, the similar video selection unit 182 identifies learning data in which a positive example is set in the label from the learning data storage unit 102, and the video segment of the input video data indicated by the video segment identification data included in the identified learning data To obtain similarity detection feature data. Further, the similar video selection unit 182 acquires the feature data for similarity detection for each video section of each input video data stored in the video data storage unit 101.

類似検出用特徴データには、映像特徴や音声特徴を用いることができる。例えば、映像特徴には、映像の色ヒストグラム、テクスチャなどを、音声特徴には音声の周波数分布、音声のパワーの分布などを用いることができる。
また、類似検出用特徴データが示す映像区間の画像特徴量としては、色やテクスチャなどが利用できる。また、画像特徴量として、上記の参考文献２に記載のように、コーナーなどの特徴点近辺の局所領域における輝度勾配ヒストグラムに基づいて、それらの頻度ヒストグラムを作成するというアプローチを利用することも考えられる。 As the feature data for similarity detection, video features and audio features can be used. For example, video color histograms, textures, and the like can be used as video features, and audio frequency distributions, audio power distributions, and the like can be used as audio features.
In addition, as the image feature amount of the video section indicated by the similarity detection feature data, color, texture, or the like can be used. In addition, as described in Reference Document 2 above, as an image feature amount, it is also possible to use an approach of creating frequency histograms based on luminance gradient histograms in local regions near feature points such as corners. It is done.

類似映像選択部１８２は、各入力映像データの映像区間それぞれの類似検出用特徴データについて、正例の学習データに対応した映像区間の類似検出用特徴データとどの程度類似しているかを定量的に表す値を算出する。類似映像選択部１８２は、この算出した値に基づいて、現在の正例の学習データと見た目が類似している映像、あるいは音声特徴が類似している映像区間を特定する。 The similar video selection unit 182 quantitatively determines how similar the similarity detection feature data for each video segment of each input video data is with the similarity detection feature data for the video segment corresponding to the learning data of the positive example. Calculate the value to represent. Based on this calculated value, the similar video selection unit 182 identifies a video segment that looks similar to the current learning data or a video segment that has similar audio characteristics.

例えば、類似映像選択部１８２は、入力映像データの映像区間から得た特徴データと、正例の学習データに対応する映像区間から得た特徴データそれぞれとについて算出した類似度を合計する。類似映像選択部１８２は、各入力映像データの映像区間それぞれについて算出した合計の類似度が所定の閾値以上、あるいは、合計の類似度に基づく順位が高いものから所定数の映像区間データを、学習データ生成候補として特定する。 For example, the similar video selection unit 182 adds the similarities calculated for the feature data obtained from the video section of the input video data and the feature data obtained from the video section corresponding to the learning data of the positive example. The similar video selection unit 182 learns a predetermined number of video section data from the total similarity calculated for each video section of each input video data equal to or higher than a predetermined threshold or from the highest ranking based on the total similarity. Identified as a data generation candidate.

また、類似識別器検出部１８３は、シソーラス記憶装置５に記憶されているシソーラスや辞書を検索し、入力されたキーワードに対する関連キーワードを読み出す。類似識別器検出部１８３は、識別器記憶部１０３内に記憶されているすでに学習済みの識別器の中から、関連キーワードを検出対象とする学習済みの識別器を選択する。類似識別器検出部１８３は、映像データ記憶部１０１に記憶されている各入力映像データの映像区間それぞれについて特徴データを取得し、取得した特徴データを入力として、選択した学習済みの識別器により検出処理を実行する。類似識別器検出部１８３は、学習済みの識別器により関連すると検出された映像区間を、学習データ生成候補として特定する。 In addition, the similar classifier detection unit 183 searches the thesaurus or dictionary stored in the thesaurus storage device 5 and reads the related keywords for the input keyword. The similar discriminator detection unit 183 selects a learned discriminator whose detection target is a related keyword from the already learned discriminators stored in the discriminator storage unit 103. The similar classifier detection unit 183 acquires feature data for each video section of each input video data stored in the video data storage unit 101, and detects the selected feature classifier using the acquired feature data as an input. Execute the process. The similar classifier detection unit 183 identifies the video section detected as related by the learned classifier as a learning data generation candidate.

データ混合部１８４は、ランダムデータ選択部１８１、類似映像選択部１８２、類似識別器検出部１８３により特定された学習データ生成候補の映像区間の中から一定割合を選択する。なお、ランダムデータ選択部１８１、類似映像選択部１８２、類似識別器検出部１８３により特定された学習データ生成候補の映像区間の混合割合は、検出対象によって可変とすることができる。 The data mixing unit 184 selects a certain ratio from the video data of learning data generation candidates specified by the random data selection unit 181, the similar video selection unit 182, and the similar classifier detection unit 183. Note that the mixing ratio of the video sections of the learning data generation candidates specified by the random data selection unit 181, the similar video selection unit 182, and the similar classifier detection unit 183 can be made variable depending on the detection target.

データ混合部１８４は、選択した映像区間のうち、まだ特徴データが生成されていないものについては、その映像区間の映像データから特徴データを生成する。データ混合部１８４は、選択された映像区間の映像区間特定データ及び特徴データと、正例を設定したラベルとを対応づけて学習データを生成し、学習データ記憶部１０２に追加して書き込む。 The data mixing unit 184 generates feature data from the video data of the selected video segment for which the feature data has not yet been generated. The data mixing unit 184 generates learning data by associating the video segment specifying data and the feature data of the selected video segment with the label set with the positive example, and additionally writes it to the learning data storage unit 102.

なお、反復処理により複数回学習データを追加する場合でも、類似映像選択部１８２、類似識別器検出部１８３は、学習データ生成候補の映像区間を最初に一度特定すればよい。２回目以降の学習データの追加処理の際には、データ混合部１８４は、これら特定済みの学習データ生成候補の映像区間の中から学習データの生成対象を選択する。
また、データ混合部１８４は、混合を複数回行う場合、学習データ修正部１４において一度でも負例と判定されたデータについては、正例として学習データに追加しないようにする。 Note that even when learning data is added a plurality of times by iterative processing, the similar video selection unit 182 and the similar classifier detection unit 183 may first specify a video section of a learning data generation candidate once. In the second and subsequent learning data addition processing, the data mixing unit 184 selects a learning data generation target from the video sections of the specified learning data generation candidates.
In addition, when mixing is performed a plurality of times, the data mixing unit 184 does not add data that has been determined as a negative example even once by the learning data correction unit 14 to the learning data as a positive example.

識別器検出部１６は、現在構築されている識別器を用い、学習データ記憶部１０２から全ての学習データを読み出し、読み出した学習データに対して検出処理を実行する。識別器検出部１６は、検出結果により関連すると判断された学習データのラベルに負例が設定されている場合は正例に書き換え、関連しないと判断された学習データのラベルに正例が設定されている場合は負例に書き換える。そして、検出の結果得られた上位Ｎ件の学習データに対して人が正否を判定し、間違った判定の対象となっている学習データを特定する情報を入力する。入力部１１は、ラベル修正対象の学習データを特定する情報の入力を受け、学習データ修正部１４は、入力された情報により特定されている学習データに正例が設定されてれいば負例に書き換え、負例が設定されていれば正例に書き換える（ステップＳ１３５）。 The discriminator detector 16 uses the currently constructed discriminator, reads all the learning data from the learning data storage unit 102, and executes a detection process on the read learning data. The discriminator detection unit 16 rewrites the negative example in the label of the learning data determined to be related based on the detection result, and sets the positive example to the label of the learning data determined not to be related. If so, rewrite it as a negative example. Then, the person determines whether or not the top N pieces of learning data obtained as a result of the detection are correct and inputs information for specifying the learning data that is the object of wrong determination. The input unit 11 receives input of information for specifying learning data to be corrected, and the learning data correction unit 14 becomes a negative example if a positive example is set for the learning data specified by the input information. If rewriting or a negative example is set, it is rewritten to a positive example (step S135).

なお、Ｎは大きいほど正確性が増すが、通常は、学習データの総数に対する割合で決定したり、修正作業にかけられる時間や人数に応じて決定したりする。なお、完全に自動化する必要がある場合には、アプローチが全く異なるアルゴリズムの識別器を複数用意し、それらの識別器の多数決によって正否を判定する方法がある。 Although the accuracy increases as N increases, it is usually determined as a percentage of the total number of learning data, or is determined according to the time and number of people required for the correction work. When it is necessary to completely automate, there is a method of preparing a plurality of discriminators of algorithms with completely different approaches and determining the correctness by majority decision of those discriminators.

識別器構築部１５は、学習データ記憶部１０２から全ての学習データを読み出し、読み出した学習データを用いて識別器を構築する（ステップＳ１４０）。
識別器検出部１６は、ステップＳ１４０において構築された識別器を、学習データ記憶部１０２に記憶されている学習データに適用し、検出処理を実行する。識別器検出部１６は、検出結果から検出対象に関連すると判断された学習データのラベルに負例が設定されている場合は正例に書き換え、関連しないと判断された学習データのラベルに正例が設定されている場合は負例に書き換える（ステップＳ１４５）。 The classifier construction unit 15 reads all the learning data from the learning data storage unit 102, and constructs a classifier using the read learning data (step S140).
The discriminator detector 16 applies the discriminator constructed in step S140 to the learning data stored in the learning data storage unit 102, and executes detection processing. The discriminator detection unit 16 rewrites the negative example in the label of the learning data determined to be related to the detection target based on the detection result, and rewrites the positive example in the label of the learning data determined not to be related. Is set to a negative example (step S145).

ステップＳ１４５の処理の後、学習装置１は、ステップＳ１４０において構築された識別器による検出結果から、構築された識別器の精度が閾値以上かどうか否かを判定するステップＳ１２５からの処理を繰り返す。そして、ステップＳ１２５において、識別器判定部１７が、精度は閾値以上であると判定した場合（ステップＳ１２５：ＹＥＳ）、学習装置１は処理を終了する。 After the process of step S145, the learning device 1 repeats the process from step S125 for determining whether or not the accuracy of the constructed discriminator is equal to or higher than the threshold value from the detection result by the discriminator constructed in step S140. In step S125, when the discriminator determination unit 17 determines that the accuracy is equal to or higher than the threshold (step S125: YES), the learning device 1 ends the process.

図５及び図６は、本実施形態による学習装置１を用いた実験結果を示す図である。
図５は、識別器構築の繰り返し回数と、キーワード（物体名）毎の平均適合率の値及び正例の学習データの数（＃ｏｆＰｏｓ）の変化との関係を示している。なお、平均適合率は、上位１００件で算出している。同図に示すように、各キーワードとも、繰り返し回数が増えるたびに平均適合率の値は向上し、３回から６回の繰り返しで精度が閾値以上となっている。 5 and 6 are diagrams showing experimental results using the learning apparatus 1 according to the present embodiment.
FIG. 5 shows the relationship between the number of repetitions of classifier construction and the change in the average precision value for each keyword (object name) and the number of positive learning data (# of Pos). The average precision is calculated for the top 100 cases. As shown in the figure, for each keyword, the value of the average relevance ratio improves as the number of repetitions increases, and the accuracy is equal to or greater than the threshold after 3 to 6 repetitions.

図６は、図５に示す識別器構築の繰り返し回数とキーワード別の平均適合率の平均の変化との関係を示す図である。同図に示すように、繰り返し回数が３回程度から平均適合率の上昇が飽和し始め、５回目あたりで０．９５を超える。
このように、学習装置１は、学習データを追加しながら識別器を学習させることにより、識別器の検出精度を向上させることができる。 FIG. 6 is a diagram showing the relationship between the number of repetitions of the classifier construction shown in FIG. 5 and the average change in the average precision for each keyword. As shown in the figure, the increase in the average precision starts to saturate when the number of repetitions is about 3, and exceeds 0.95 for the fifth time.
Thus, the learning device 1 can improve the detection accuracy of the classifier by learning the classifier while adding learning data.

以上説明した本実施形態によれば、学習装置１は、テレビ番組などの映像データから、正例および負例のラベルが付与された学習データを生成し、特定の物体や事象を検出するための識別器を、生成された学習データに基づいて構築する。
学習装置１は、新たに構築する識別器の検出対象を表すキーワードと、シソーラスなどの辞書を用いて選択した追加のキーワードを、番組音声の認識結果やクローズドキャプションから検索し、対応する映像区間を正例の映像データとして抽出する。これによって、色やテクスチャなどの表層的な特徴ではなく、映像の内容に基づいた検索を可能とする識別器を構築するための学習データを効率的に生成することができる。また、特定のオブジェクトやイベントに特化したり、番組のジャンルや放送局などに関わらず、様々な番組から様々な検出対象の識別器を構築することができる。 According to the embodiment described above, the learning device 1 generates learning data to which positive and negative labels are attached from video data such as a television program, and detects a specific object or event. A discriminator is constructed based on the generated learning data.
The learning device 1 searches a keyword representing the detection target of a newly constructed classifier and an additional keyword selected using a dictionary such as a thesaurus from the recognition result or closed caption of the program audio, and finds a corresponding video section. Extracted as positive example video data. Thereby, it is possible to efficiently generate learning data for constructing a discriminator that enables a search based on the content of the video, not the surface features such as color and texture. In addition, it is possible to construct various classifiers for various detection targets from various programs regardless of the specific object or event, regardless of the program genre or broadcasting station.

学習装置１は、学習データが正例であるか負例であるかのラベルを修正した後、そのデータに基づいて識別器を構築し、構築した識別器で検出した学習データを次の学習データにおける正例にする、といった処理を反復する。これによって、識別器の精度を向上させることができる。 The learning device 1 corrects the label indicating whether the learning data is a positive example or a negative example, and then constructs a discriminator based on the data, and the learning data detected by the constructed discriminator is used as the next learning data. The process of making a positive example in is repeated. Thereby, the accuracy of the discriminator can be improved.

また、学習装置１は、入力された映像データの中から無作為に選択した映像区間、あるいは、正例の映像データに視覚あるいは聴覚的に類似性の高い映像区間、構築対象の識別器が検出対象としている物体や事象と意味的に類似した検出対象に対応した学習済みの識別器を使用して検出した映像区間の映像データを、反復処理の過程で学習データにおける正例として一定割合だけ混合する。これによって、多様性が向上するように学習データを追加し、反復処理によって特定のデータに偏った識別器が構築されることを避けることが可能となる。 In addition, the learning device 1 detects a video section randomly selected from the input video data, or a video section that is visually or auditorily similar to the video data of the positive example, and a classifier to be constructed Video data of a video section detected using a learned discriminator corresponding to a detection target that is semantically similar to the target object or event is mixed at a fixed rate as a positive example in the learning data during the iterative process. To do. As a result, it is possible to add learning data so as to improve diversity, and avoid building a discriminator biased to specific data by iterative processing.

なお、映像データに代えて、属性データ付きの静止画データを用いることにより、静止画データが検出対象と関連するかを検出する識別器を構築することもできる。この場合、学習装置１は、音声テキストデータに代えて属性データに記述されている静止画に関するテキストの情報を用い、１つの静止画データが１つの映像区間に対応するものとして同様の処理を行う。この場合、特徴データは、静止画の特徴量を表すものとする。 Note that, by using still image data with attribute data instead of video data, a discriminator for detecting whether still image data is related to a detection target can be constructed. In this case, the learning apparatus 1 uses the text information about the still image described in the attribute data instead of the voice text data, and performs the same processing assuming that one still image data corresponds to one video section. . In this case, the feature data represents the feature amount of the still image.

上述した学習装置１は、内部にコンピュータシステムを有している。そして、学習装置１の入力部１１、映像区間分割部１２、初期学習データ生成部１３、学習データ修正部１４、識別器構築部１５、識別器検出部１６、識別器判定部１７、及び学習データ追加部１８の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The learning device 1 described above has a computer system inside. Then, the input unit 11, the video section dividing unit 12, the initial learning data generation unit 13, the learning data correction unit 14, the classifier construction unit 15, the classifier detection unit 16, the classifier determination unit 17, and the learning data of the learning device 1 The process of the operation of the adding unit 18 is stored in a computer-readable recording medium in the form of a program, and the above-described processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶部のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage unit such as a hard disk built in the computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１学習装置
５シソーラス記憶装置
１０記憶部
１１入力部
１２映像区間分割部
１３初期学習データ生成部
１４学習データ修正部
１５識別器構築部
１６識別器検出部
１７識別器判定部
１８学習データ追加部
１０１映像データ記憶部
１０２学習データ記憶部
１０３識別器記憶部
１３１クローズドキャプション抽出部
１３２番組音声認識部
１３３キーワード拡張部
１３４映像区間抽出部
１３５特徴データ抽出部
１８１ランダムデータ選択部
１８２類似映像選択部
１８３類似識別器検出部
１８４データ混合部 DESCRIPTION OF SYMBOLS 1 Learning apparatus 5 Thesaurus memory | storage device 10 Memory | storage part 11 Input part 12 Image | video division | segmentation part 13 Initial learning data generation part 14 Learning data correction part 15 Classifier construction part 16 Classifier detection part 17 Classifier determination part 18 Learning data addition part 101 Video data storage unit 102 Learning data storage unit 103 Discriminator storage unit 131 Closed caption extraction unit 132 Program audio recognition unit 133 Keyword expansion unit 134 Video section extraction unit 135 Feature data extraction unit 181 Random data selection unit 182 Similar video selection unit 183 Similar Discriminator detector 184 Data mixer

Claims

A video data storage unit for storing video data;
A learning data storage unit that stores learning data including a feature amount of the video data and a label indicating whether the detection target appears in the video data as a positive example or a negative example that does not appear;
A discriminator using the learning data stored in the learning data storage unit when initial learning data is registered in the learning data storage unit and when learning data is added to the learning data storage unit A classifier construction unit for constructing
With the classifier constructed by the classifier construction unit, the learning data stored in the learning data storage unit is compared with the current positive learning data from the video section of the input video data. A discriminator detector for performing a detection process for detecting video data of a video segment that is similar to the video segment that has similar audio characteristics using the discriminator;
A determination unit that determines the accuracy of the classifier based on a detection result by the classifier detection unit;
When the determination unit determines that the accuracy of the discriminator does not reach a predetermined accuracy, a part of the video data stored in the video data storage unit is selected, and the characteristics of the selected video data A learning data adding unit for adding learning data generated by giving a positive example label to the quantity to the learning data storage unit;
Equipped with a,
The learning data adding unit obtains the video data selected at random from the video data stored in the video data storage unit, and the learning data of the positive example stored in the learning data storage unit. A part of the video data detected by another classifier corresponding to the classifier that is similar to the classifier to be constructed or the classifier to be constructed,
A learning apparatus characterized by that.

When the learning data is added to the learning data storage unit, the classifier detection unit detects the learning data stored in the learning data storage unit by the classifier,
Based on the detection result, set a positive example or a negative example in the label of the learning data,
When the learning data is added to the learning data storage unit, the classifier construction unit uses the learning data stored in the learning data storage unit after the label is set by the classifier detection unit. Build,
The learning apparatus according to claim 1 .

For the initial learning data registered in the learning data storage unit or the learning data for which the label is set by the discriminator detection unit, user learning or the detection result of the learning data by another discriminator A learning data correction unit for correcting the label of the learning data based on the learning data;
When the initial learning data is registered in the learning data storage unit and when learning data is added to the learning data storage unit, after the correction of the label by the learning data correction unit, the classifier construction unit, Building a discriminator using the learning data stored in the learning data storage unit;
The learning apparatus according to claim 2 , wherein:

Detect whether the text data representing the audio of the video data includes a keyword representing a detection target by the classifier to be constructed and other keywords related to the keyword, and the video data corresponding to the detected text data An initial learning data generation unit that adds a positive example label to the feature quantity to generate initial learning data and registers the learning data in the learning data storage unit is further provided.
The learning device according to any one of claims 1 to 3 , wherein

The computer used for the learning device
A video data storage unit for storing video data;
A learning data storage unit that stores learning data including a feature amount of video data and a label indicating whether the detection target is a positive example or a negative example that does not appear in the video data;
A discriminator using the learning data stored in the learning data storage unit when initial learning data is registered in the learning data storage unit and when learning data is added to the learning data storage unit The classifier construction part,
With the classifier constructed by the classifier construction unit, the learning data stored in the learning data storage unit is compared with the current positive learning data from the video section of the input video data. A discriminator detector that performs a detection process for detecting video data of a video segment that is similar and a video segment that has similar audio characteristics using the discriminator,
A determination unit that determines the accuracy of the classifier based on a detection result by the classifier detection unit,
When the determination unit determines that the accuracy of the discriminator does not reach a predetermined accuracy, a part of the video data stored in the video data storage unit is selected, and the characteristics of the selected video data A learning data adding unit for adding learning data generated by giving a positive example label to the quantity to the learning data storage unit,
To function as,
The learning data adding unit obtains the video data selected at random from the video data stored in the video data storage unit, and the learning data of the positive example stored in the learning data storage unit. A program that functions to select a part of the video data that is similar to the video data that has been detected or the video data that has been detected by another classifier corresponding to the classifier that is similar to the classifier to be constructed .