JP5545877B2

JP5545877B2 - Content recognition model learning apparatus, content recognition model learning method, and content recognition model learning program

Info

Publication number: JP5545877B2
Application number: JP2011017057A
Authority: JP
Inventors: 昭悟木村; 泰浩南; 英作前田; 鋭坂野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-01-28
Filing date: 2011-01-28
Publication date: 2014-07-09
Anticipated expiration: 2031-01-28
Also published as: JP2012159871A

Description

本発明は、音声信号、音響信号、静止画像や動画像（映像）などのメディアデータ（コンテンツ）と、それらに対して人手で付与されたテキスト情報から、意味を推定するコンテンツ認識モデルを学習する技術に関する。ここで「意味」とは、音声信号、音響信号、静止画像や映像の中に含まれるオブジェクト、動作、行為、シーンなどの情報を組み合せた情報である。 The present invention learns a content recognition model for estimating meaning from media data (contents) such as audio signals, sound signals, still images and moving images (videos), and text information manually assigned thereto. Regarding technology. Here, “meaning” is information obtained by combining information such as an audio signal, an acoustic signal, an object, an action, an action, and a scene included in a still image or video.

従来から、与えられた映像に対してその映像を説明する言語情報を自動的に付与する映像認識技術の開発が行われている。近年では、ディジタルビデオカメラや携帯電話などの撮像装置の普及、インターネット上での映像共有の一般化などに伴い、このような映像認識技術が非常に重要な技術となってきている。 2. Description of the Related Art Conventionally, video recognition technology has been developed that automatically gives language information that describes a given video to the language. In recent years, with the widespread use of imaging devices such as digital video cameras and mobile phones, and generalization of video sharing on the Internet, such video recognition technology has become a very important technology.

また、潜在変数を用いて２つの観測情報を結びつける統計モデルであるトピックモデルを学習し、このトピックモデルを用いて、与えられた画像に適切なテキストラベルを自動的に付与する画像ラベル付けと、与えられたテキストラベルから適切な画像を見つけだす画像獲得とを統一的に扱う技術が提案されている（例えば、非特許文献１及び２参照）。 Also, learning a topic model, which is a statistical model that links two observation information using latent variables, and using this topic model, an image labeling that automatically assigns an appropriate text label to a given image, There has been proposed a technique that uniformly handles image acquisition for finding an appropriate image from a given text label (see, for example, Non-Patent Documents 1 and 2).

一方、人間が映像を理解する過程を真似て、画像（映像）の内容に関する質問を提示し、ユーザから回答を取得することにより、映像を理解するモデルを更新していく手法（動的学習法）が提案されている（例えば、非特許文献３）。 On the other hand, a method (dynamic learning method) that updates the model for understanding video by imitating the process of human understanding of video, presenting questions about the contents of images (video), and obtaining answers from users ) Has been proposed (for example, Non-Patent Document 3).

中山、原田、國吉、大津”画像・単語間概念対応の確率構造学習を利用した超高速画像認識・検索方法”、電子情報通信学会技術報告、ＰＲＭＵ２００７−１４７、２００７年１２月Nakayama, Harada, Kuniyoshi, Otsu “Ultra-high-speed image recognition / retrieval method using stochastic structure learning corresponding to image / word concept”, IEICE Technical Report, PRMU 2007-147, December 2007 木村、中野、杉山、亀岡、前田、坂野 ”ＳＳＣＤＥ：画像認識検索のための半教師付正準密度推定法”、画像の認識・理解シンポジウムＭＩＲＵ２０１０、ＯＳ８−１、２０１０年７月Kimura, Nakano, Sugiyama, Kameoka, Maeda, Sakano "SSCDE: Semi-supervised canonical density estimation method for image recognition search", Image Recognition and Understanding Symposium MIRU2010, OS8-1, July 2010 Siddiquie, B. and Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for multi-class active learning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2979 - 2986 , 2010.Siddiquie, B. and Gupta, A .: Beyond active noun tagging: Modeling contextual interactions for multi-class active learning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2979-2986, 2010.

しかしながら、非特許文献１、２の手法における映像認識では、付加情報に含まれるノイズの影響によって、学習精度が低下するという問題がある。ここで、付加情報に含まれるノイズとは、コンテンツには存在しないオブジェクトを示す単語が付加情報（テキスト文章）中に含まれていたり、コンテンツ中に存在するのにテキスト文章中にそれに対応する表現（ラベル）が存在しなかったりすることを意味する。学習精度を上げるには大量の教師データが必要となるが、上述のようなノイズが含まれないように人手で付加情報を準備するのは非常にコストがかかるという問題がある。 However, in the video recognition in the methods of Non-Patent Documents 1 and 2, there is a problem that learning accuracy is lowered due to the influence of noise included in the additional information. Here, the noise included in the additional information means that a word indicating an object that does not exist in the content is included in the additional information (text sentence) or an expression corresponding to it in the text sentence even though it exists in the content. (Label) does not exist. In order to improve the learning accuracy, a large amount of teacher data is required, but there is a problem that it is very costly to prepare additional information manually so as not to include the noise as described above.

また、非特許文献３の技術によれば、コンテンツ中でコンピュータが学習対象としたい領域に関する質問をコンピュータが自動生成し、その領域についての回答を取得しながら学習を進めるので、学習精度を高めることができるとともに、人の負担を減らすことができる。非特許文献３では、以下の３通り（ａ〜ｃ）の質問形態が提案されている。
（ａ）ユーザに対して、コンピュータが学習したいオブジェクト（例えばｂｏａｔ）に対応する領域がコンテンツ中のどの領域であるかを指定してもらう質問
（ｂ）コンピュータが認識できなかった領域（不確定領域）が何であるかを、コンピュータが認識できた領域（確定領域）との相対的な関係を用いてユーザに問う質問（例えば、”ｗｈａｔｉｓａｂｏｖｅｗａｔｅｒ？”や”ｗｈａｔｉｓｂｒｉｇｈｔｅｒｔｈａｎｗａｔｅｒ？”）
（ｃ）コンピュータが認識できた２つのオブジェクトの相対関係を表す語（ラベル）を問う質問（例えば、”ｗｈａｔｉｓｔｈｅｒｅｌａｔｉｏｎｂｅｔｗｅｅｎｂｏａｔａｎｄｗａｔｅｒ？”） In addition, according to the technique of Non-Patent Document 3, since the computer automatically generates a question regarding a region that the computer wants to study in the content and proceeds with learning while obtaining an answer about the region, the learning accuracy is improved. Can reduce the burden on people. Non-Patent Document 3 proposes the following three types of questions (ac).
(A) Question that asks the user to specify which area in the content corresponds to the object (for example, “bot”) that the computer wants to learn (b) Area that the computer could not recognize (indeterminate area) ) Is a question that asks the user using the relative relationship with the area (determined area) that the computer can recognize (for example, “what is above water?” Or “what is bright than what?”)
(C) A question that asks for a word (label) representing the relative relationship between two objects that can be recognized by the computer (eg, “what is the relation between boat and water?”)

以上のように、非特許文献３の技術では、不確定領域が何であるかを特定する、”Ｗｈａｔ”形式の質問が前提となっており、ユーザの入力できる回答の自由度が非常に高い。そのため、同じ物体（不確定領域）に異なるラベルが付与されたり、ラベルの種類が膨大になるなどの問題がある。 As described above, the technique of Non-Patent Document 3 is based on the “What” format question that identifies what the uncertain region is, and the degree of freedom of answers that can be input by the user is very high. For this reason, there are problems that different labels are given to the same object (indeterminate region) and that the types of labels become enormous.

本発明は、このような事情に鑑みてなされたもので、すでに学習した知識から得られる不確定領域の認識の確信度に基づいて、”Ｗｈａｔ”だけでなく”Ｗｈｉｃｈ”や”Ｉｓｔｈｉｓ”形式の質問を切り替えて使うことにより、ユーザの回答の自由度を制限し、従来技術の問題を解決しつつ、高精度な学習を行うことができるコンテンツ認識モデル学習装置、コンテンツ認識モデル学習方法及びコンテンツ認識モデル学習プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and based on the certainty of recognition of an indeterminate region obtained from already learned knowledge, not only “What” but also “Which” or “Is this” format. Content recognition model learning apparatus, content recognition model learning method, and content capable of performing high-accuracy learning while limiting the degree of freedom of user answers and solving the problems of the prior art The purpose is to provide a recognition model learning program.

本発明は、コンテンツデータを認識するためのコンテンツ認識モデルの情報が記憶されたコンテンツ認識モデル記憶手段と、音響信号または映像信号を含むコンテンツデータを取得するコンテンツ取得手段と、前記コンテンツ認識モデル記憶手段に記憶されたコンテンツ認識モデルを用いて、前記コンテンツデータに付与すべき前記コンテンツデータに含まれる前記音響信号または映像信号の意味を示す付加情報を推定する付加情報推定手段と、前記付加情報推定手段により推定された前記コンテンツデータに付与すべき付加情報の確信度を求める確信度算出手段と、前記コンテンツデータに付与すべき前記付加情報を決定するための質問を前記確信度算出手段によって求められた前記確信度に基づいて選択し、選択された質問を表示する質問表示手段と、前記質問に対する回答を取得する回答取得手段と、前記回答取得手段によって取得された回答の情報に基づき、前記コンテンツ認識モデル記憶手段に記憶された前記コンテンツ認識モデルの情報を更新するモデル更新手段とを備えたことを特徴とする。 The present invention includes a content recognition model storage unit that stores content recognition model information for recognizing content data, a content acquisition unit that acquires content data including an audio signal or a video signal, and the content recognition model storage unit. Using the content recognition model stored in the content data, additional information estimation means for estimating additional information indicating the meaning of the audio signal or video signal included in the content data to be added to the content data, and the additional information estimation means The certainty factor calculating means for obtaining the certainty factor of the additional information to be given to the content data estimated by the above and the question for determining the additional information to be given to the content data are obtained by the certainty factor calculating means. Quality based on the certainty and the selected question to display A model for updating information on the content recognition model stored in the content recognition model storage unit based on information on a response acquired by a display unit, an answer acquisition unit that acquires an answer to the question, and the answer acquisition unit And updating means.

本発明は、前記質問表示手段は、前記確信度が高い場合は、前記コンテンツデータに付与すべき付加情報が、前記推定した付加情報に合致するか否かを問う質問のみを表示し、前記確信度が中程度の場合は、前記コンテンツデータに付与すべき付加情報を、推定した付加情報の候補の中から絞り込む質問のみを表示し、前記確信度が低い場合は、前記コンテンツデータに付与すべき付加情報が何であるかを問う質問を表示することを特徴とする。 In the present invention, when the certainty factor is high, the question display means displays only a question asking whether or not the additional information to be added to the content data matches the estimated additional information. When the degree is medium, only the questions that narrow down the additional information to be added to the content data from the estimated additional information candidates are displayed. When the certainty level is low, the additional information should be given to the content data A question asking what the additional information is is displayed.

本発明は、コンテンツデータを認識するためのコンテンツ認識モデルの情報が記憶されたコンテンツ認識モデル記憶手段と、コンテンツ取得手段と、付加情報推定手段と、確信度算出手段と、質問表示手段と、回答取得手段と、モデル更新手段とを備えたコンテンツ認識モデル学習装置におけるコンテンツ認識モデル学習方法であって、前記コンテンツ取得手段が、音響信号または映像信号を含むコンテンツデータを取得するコンテンツ取得ステップと、前記付加情報推定手段が、前記コンテンツ認識モデル記憶手段に記憶されたコンテンツ認識モデルを用いて、前記コンテンツデータに付与すべき前記コンテンツデータに含まれる前記音響信号または映像信号の意味を示す付加情報を推定する付加情報推定ステップと、前記確信度算出手段が、前記付加情報推定ステップにより推定された前記コンテンツデータに付与すべき付加情報の確信度を求める確信度算出ステップと、前記質問表示手段が、前記コンテンツデータに付与すべき前記付加情報を決定するための質問を前記確信度算出ステップによって求められた前記確信度に基づいて選択し、選択された質問を表示する質問表示ステップと、前記回答取得手段が、前記質問に対する回答を取得する回答取得ステップと、前記モデル更新手段が、前記回答取得ステップによって取得された回答の情報に基づき、前記コンテンツ認識モデル記憶手段に記憶された前記コンテンツ認識モデルの情報を更新するモデル更新ステップとを有することを特徴とする。 The present invention includes a content recognition model storage unit storing content recognition model information for recognizing content data, a content acquisition unit, an additional information estimation unit, a certainty factor calculation unit, a question display unit, an answer A content recognition model learning method in a content recognition model learning apparatus comprising an acquisition unit and a model update unit, wherein the content acquisition unit acquires content data including an audio signal or a video signal; and The additional information estimating means estimates additional information indicating the meaning of the audio signal or video signal included in the content data to be added to the content data, using the content recognition model stored in the content recognition model storage means. Additional information estimating step, and the certainty factor calculating means A certainty factor calculating step for obtaining a certainty factor of additional information to be added to the content data estimated in the additional information estimating step; and the question display means for determining the additional information to be added to the content data. A question display step for selecting the question based on the certainty factor obtained by the certainty factor calculation step, displaying the selected question, and an answer obtaining step for the answer obtaining means to obtain an answer to the question; The model update means includes a model update step for updating the content recognition model information stored in the content recognition model storage means based on the answer information obtained by the answer obtaining step. To do.

本発明は、前記質問表示ステップは、前記確信度が高い場合は、前記コンテンツデータに付与すべき付加情報が、前記推定した付加情報に合致するか否かを問う質問のみを表示し、前記確信度が中程度の場合は、前記コンテンツデータに付与すべき付加情報を、推定した付加情報の候補の中から絞り込む質問のみを表示し、前記確信度が低い場合は、前記コンテンツデータに付与すべき付加情報が何であるかを問う質問を表示することを特徴とする。 In the present invention, in the question display step, when the certainty factor is high, only the question asking whether the additional information to be added to the content data matches the estimated additional information is displayed. When the degree is medium, only the questions that narrow down the additional information to be added to the content data from the estimated additional information candidates are displayed. When the certainty level is low, the additional information should be given to the content data A question asking what the additional information is is displayed.

本発明は、コンテンツデータを認識するためのコンテンツ認識モデルの情報が記憶されたコンテンツ認識モデル記憶手段と、コンテンツ取得手段と、付加情報推定手段と、確信度算出手段と、質問表示手段と、回答取得手段と、モデル更新手段とを備えたコンテンツ認識モデル学習装置上のコンピュータに、コンテンツ認識モデル学習処理を行わせるコンテンツ認識モデル学習プログラムであって、音響信号または映像信号を含むコンテンツデータを取得するコンテンツ取得ステップと、前記コンテンツ認識モデル記憶手段に記憶されたコンテンツ認識モデルを用いて、前記コンテンツデータに付与すべき前記コンテンツデータに含まれる前記音響信号または映像信号の意味を示す付加情報を推定する付加情報推定ステップと、前記付加情報推定ステップにより推定された前記コンテンツデータに付与すべき付加情報の確信度を求める確信度算出ステップと、前記コンテンツデータに付与すべき前記付加情報を決定するための質問を前記確信度算出ステップによって求められた前記確信度に基づいて選択し、選択された質問を表示する質問表示ステップと、前記質問に対する回答を取得する回答取得ステップと、前記回答取得ステップによって取得された回答の情報に基づき、前記コンテンツ認識モデル記憶手段に記憶された前記コンテンツ認識モデルの情報を更新するモデル更新ステップとを前記コンピュータに行わせることを特徴とする。 The present invention includes a content recognition model storage unit storing content recognition model information for recognizing content data, a content acquisition unit, an additional information estimation unit, a certainty factor calculation unit, a question display unit, an answer A content recognition model learning program for causing a computer on a content recognition model learning apparatus including an acquisition unit and a model update unit to perform a content recognition model learning process, and acquiring content data including an audio signal or a video signal Using the content acquisition step and the content recognition model stored in the content recognition model storage means, extra information indicating the meaning of the audio signal or video signal included in the content data to be added to the content data is estimated. Additional information estimation step and the additional information A certainty factor calculating step for determining the certainty factor of the additional information to be added to the content data estimated in the determination step, and a question for determining the additional information to be added to the content data are obtained by the certainty factor calculating step. Based on the information of the answer acquired by the question display step for selecting based on the given confidence and displaying the selected question, the answer acquiring step for acquiring the answer to the question, and the answer acquiring step. A model update step of updating information of the content recognition model stored in the content recognition model storage means is performed by the computer.

本発明によれば、推定した付加情報の確信度に応じて質問の種類を変えることにより、単一種類の質問だけを行う場合と比較して、誤識別率を向上させることができる。また、学習の精度向上に有用な情報を引き出すための質問をシステムが自動生成するため、人の負荷を軽減することができるという効果が得られる。 According to the present invention, by changing the type of question according to the certainty of the estimated additional information, the misidentification rate can be improved as compared with the case where only a single type of question is performed. In addition, since the system automatically generates a question for extracting information useful for improving the accuracy of learning, an effect of reducing a human load can be obtained.

本発明の一実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of this invention. 図１に示す装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the apparatus shown in FIG. 図１に示す装置の動作を示す説明図である。It is explanatory drawing which shows operation | movement of the apparatus shown in FIG. 図１に示す装置の動作を示す説明図である。It is explanatory drawing which shows operation | movement of the apparatus shown in FIG. 図１に示す装置の動作を示す説明図である。It is explanatory drawing which shows operation | movement of the apparatus shown in FIG. 質問を表示する画面の一例を示す説明図である。It is explanatory drawing which shows an example of the screen which displays a question. 本発明による装置の実験結果を示す説明図である。It is explanatory drawing which shows the experimental result of the apparatus by this invention. 本発明による装置の実験結果を示す説明図である。It is explanatory drawing which shows the experimental result of the apparatus by this invention.

以下、図面を参照して、本発明の一実施形態によるコンテンツ認識モデル学習装置を説明する。図１は同実施形態の構成を示すブロック図である。以下の説明において、コンテンツとは音声信号や音響信号、あるいは、静止画像や動画像（映像）などのメディアデータであり、付加情報の集合はコンテンツの内容を示す情報（映像の内容や構造を記述したテキスト、映像が撮影された時刻や場所に関する情報を表すテキストデータ）であるものとする。図１において、符号１は、予め与えられたＮ個のコンテンツの集合（これを初期コンテンツ集合という）Ｇ＝｛ｇ１，ｇ２，…，ｇＮ｝に対する初期コンテンツ特徴集合Ｘ＝｛ｘ１，ｘ２，…，ｘＮ｝の情報が記憶された初期コンテンツ特徴集合記憶部である。符号２は、初期付加情報特徴集合Ｙ＝｛ｙ１，ｙ２，…，ｙＮ｝が記憶された初期付加情報特徴集合記憶部である。 Hereinafter, a content recognition model learning apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the embodiment. In the following description, content refers to media data such as audio signals, audio signals, still images and moving images (video), and a set of additional information refers to information indicating the content of the content (describes the content and structure of the video) And text data representing information on the time and place where the video was shot). In FIG. 1, reference numeral 1 denotes an initial content feature set X = {x1, x2,... For a set of N contents (this is called an initial content set) G = {g1, g2,. , XN} is an initial content feature set storage unit. Reference numeral 2 denotes an initial additional information feature set storage unit in which an initial additional information feature set Y = {y1, y2,..., YN} is stored.

符号３は、初期コンテンツ集合Ｇに対する初期コンテンツ特徴集合Ｘと初期付加情報特徴集合Ｙから、コンテンツ認識モデルの初期値を学習する初期モデル学習部である。符号４は、コンテンツ認識モデルの情報を記憶するコンテンツ認識モデル記憶部である。符号５は、必ずしも初期コンテンツ集合Ｇに含まれるとは限らない新たなコンテンツｇＮ＋１を取得し、その特徴ｘＮ＋１を計算する新コンテンツ取得部である。新コンテンツ取得部５は、新コンテンツがない場合には、処理を終了する。 Reference numeral 3 denotes an initial model learning unit that learns the initial value of the content recognition model from the initial content feature set X and the initial additional information feature set Y for the initial content set G. Reference numeral 4 denotes a content recognition model storage unit that stores content recognition model information. Reference numeral 5 denotes a new content acquisition unit that acquires new content gN + 1 that is not necessarily included in the initial content set G and calculates its feature xN + 1. The new content acquisition unit 5 ends the process when there is no new content.

符号６は、新コンテンツ取得部５において取得した新コンテンツｇＮ＋１と初期コンテンツｇ１，ｇ２，…，ｇＮについて、コンテンツ認識モデル記憶部４に記憶されているコンテンツ認識モデルを用いて付加情報を推定する付加情報推定部である。符号７は、新コンテンツ取得部５において取得した新コンテンツと類似する初期コンテンツ集合Ｇ中のコンテンツを近傍サンプルとし、近傍サンプルの集合＾Ｇ（＾はＧの頭に付く、以下、他の文字についても同様）＝｛＾ｇ１，＾ｇ２，…，＾ｇＨ｝（Ｈは近傍サンプルの個数）を抽出する近傍サンプル抽出部である。 Reference numeral 6 denotes an addition for estimating additional information for the new content gN + 1 and the initial contents g1, g2,..., GN acquired by the new content acquisition unit 5 using the content recognition model stored in the content recognition model storage unit 4. It is an information estimation part. Reference numeral 7 denotes a content in the initial content set G similar to the new content acquired by the new content acquisition unit 5 as a neighborhood sample, and a set of neighborhood samples ^ G (^ is attached to the head of G. Are similar samples) = {^ g1, ^ g2,..., ^ GH} (H is the number of neighboring samples).

符号８は、近傍サンプル集合＾Ｇに含まれる各近傍サンプルについて、学習済みのコンテンツ認識モデルによりラベル（付加情報特徴）＾ｙ１，＾ｙ２，…，＾ｙＨを推定し、推定したラベルのばらつき度合い（これを確信度という）に応じて、入力された新コンテンツｇＮ＋１の内容（付加情報）についてユーザに対して提示するべき質問を選択する質問選択部である。符号９は、質問選択部８が選択した質問情報を表示するためにディスプレイ等の表示装置で構成された表示部である。符号１０は、表示部９に表示された質問に対して、その回答情報を入力するためにキーボードやマウスなどの入力装置から構成する入力部である。なお、キーボードやマウスの代わりに、音声信号やＷｅｂ情報などを取得する入力装置を用いても良い。 Reference numeral 8 denotes a label (additional information feature) ^ y1, ^ y2,..., ^ YH estimated by a learned content recognition model for each neighboring sample included in the neighboring sample set ^ G, and the degree of variation in the estimated label. It is a question selection unit that selects a question to be presented to the user regarding the content (additional information) of the input new content gN + 1 according to (this is called confidence). Reference numeral 9 denotes a display unit constituted by a display device such as a display for displaying the question information selected by the question selection unit 8. Reference numeral 10 denotes an input unit configured from an input device such as a keyboard or a mouse for inputting answer information to a question displayed on the display unit 9. Note that an input device that acquires audio signals, Web information, and the like may be used instead of the keyboard and mouse.

符号９は、入力部１０から入力された回答情報を取得し、得られた回答情報に基づき入力した新コンテンツｇＮ＋１に対する新付加情報特徴ｙＮ＋１を決定し、入力した新コンテンツに対する特徴ｘＮ＋１を初期コンテンツ集合に加えたコンテンツ特徴集合Ｘ（１）＝Ｘ∪｛ｘＮ＋１｝と、入力した新コンテンツに対して決定した新しい付加情報特徴ＹＮ＋１を加えた付加情報特徴集合Ｙ（１）＝Ｙ∪｛ｙＮ＋１｝を出力する回答取得部である。ここで、Ｘ∪｛ｘＮ＋１｝は集合Ｘと｛ｘＮ＋１｝の和集合を表し、ＸとＹの右肩の数字（１）は更新処理の繰り返し回数を表している。符号１２は、回答取得部１１から出力するコンテンツ特徴集合Ｘの情報を記憶するコンテンツ特徴集合記憶部である。号１３は、回答取得部１１から出力する付加情報特徴集合Ｙの情報を記憶する付加情報特徴集合記憶部である。 Reference numeral 9 acquires the answer information input from the input unit 10, determines a new additional information feature yN + 1 for the input new content gN + 1 based on the obtained answer information, and sets the feature xN + 1 for the input new content as the initial content set And the additional information feature set Y (1) = Y∪ {yN + 1} obtained by adding the new additional information feature YN + 1 determined for the input new content and the content feature set X (1) = X∪ {xN + 1} This is an answer acquisition unit to output. Here, X∪ {xN + 1} represents the union of the set X and {xN + 1}, and the number (1) on the right shoulder of X and Y represents the number of repetitions of the update process. Reference numeral 12 denotes a content feature set storage unit that stores information on the content feature set X output from the answer acquisition unit 11. No. 13 is an additional information feature set storage unit that stores information of the additional information feature set Y output from the answer acquisition unit 11.

符号１４は、コンテンツ認識モデルを作成した時点の学習データ（コンテンツ特徴集合Ｘと付加情報特徴集合Ｙ）に対して、新たに追加された新コンテンツの数（新コンテンツ特徴や新付加情報特徴の数）をカウントし、追加された新コンテンツの数が所定数に達した場合には、処理を移行してコンテンツ認識モデルを更新させるモデル更新制御部である。新コンテンツの数が所定の基準を満たしていないときは、新コンテンツ取得部に処理を移行して処理を繰り返すよう制御する。 Reference numeral 14 denotes the number of new contents (number of new content features or new additional information features) newly added to the learning data (content feature set X and additional information feature set Y) at the time of creating the content recognition model. ) And when the number of added new contents reaches a predetermined number, the model update control unit shifts the process and updates the content recognition model. When the number of new contents does not satisfy the predetermined standard, control is performed so as to shift the process to the new content acquisition unit and repeat the process.

符号１５は、モデル更新制御部１４の指示に基づき、コンテンツ特徴集合記憶部１２と付加情報特徴集合記憶部１３に記憶された更新後のコンテンツ特徴集合Ｘ（ｉ）と付加情報特徴集合Ｙ（ｉ）を用いて、コンテンツ認識モデル記憶部４に記憶されているコンテンツ認識モデルを更新するモデル更新部である。モデルの更新処理は、入力データが初期コンテンツ特徴集合と初期付加情報特徴集合の代わりにＸ（ｉ）（ｉは自然数）とＹ（ｉ）（ｉは自然数）を用いる点を除いては、初期モデル学習部３と同様の処理である。モデル更新部１５の処理を終えると、新コンテンツ取得部５へ処理を移行し、新たなコンテンツについて、コンテンツ認識モデルの更新を繰り返す。 Reference numeral 15 denotes an updated content feature set X (i) and additional information feature set Y (i) stored in the content feature set storage unit 12 and the additional information feature set storage unit 13 based on an instruction from the model update control unit 14. ) To update the content recognition model stored in the content recognition model storage unit 4. The model update process is initial except that the input data uses X (i) (i is a natural number) and Y (i) (i is a natural number) instead of the initial content feature set and the initial additional information feature set. This is the same processing as the model learning unit 3. When the process of the model update unit 15 is completed, the process proceeds to the new content acquisition unit 5 and the update of the content recognition model is repeated for new content.

なお、モデル更新制御部１４は省略してもよく、省略した場合には、１つの新コンテンツが追加される毎に、逐次的にコンテンツ認識モデルの更新を行うことになる。モデル更新制御部１４により、新コンテンツに対する回答がある程度集まった段階で、まとめてコンテンツ認識モデルを更新することにより、学習の回数を減らすことができるので、逐次的に行う場合も効率的に高精度なコンテンツ認識モデルを作成することができる。 Note that the model update control unit 14 may be omitted, and in this case, the content recognition model is sequentially updated every time one new content is added. The model update control unit 14 can reduce the number of times of learning by updating the content recognition model at a stage when the answers to the new content are gathered to some extent. A simple content recognition model can be created.

次に、図１に示す初期コンテンツ特徴集合記憶部１と初期付加情報特徴集合記憶部２に記憶される情報に基づき、初期モデル学習部３がコンテンツ認識モデルを学習する動作について説明する。最初のコンテンツ認識モデル学習用に予め与えられたコンテンツの集合Ｇ＝｛ｇ１，ｇ２，…，ｇＮ｝を初期コンテンツ集合とする。また、初期コンテンツ集合Ｇに含まれる各コンテンツに対応づけられた付加情報の集合Ａ＝｛ａ１，ａ２，…，ａＮ｝を初期付加情報集合とする。初期コンテンツ集合Ｇに含まれる各コンテンツについて抽出した特徴量の集合が初期コンテンツ特徴集合Ｘ＝｛ｘ１，ｘ２，…，ｘＮ｝となり、初期付加情報集合Ａの各付加情報について抽出した特徴量の集合を初期付加情報特徴集合Ｙ＝｛ｙ１，ｙ２，…，ｙＮ｝となる。 Next, an operation in which the initial model learning unit 3 learns the content recognition model based on information stored in the initial content feature set storage unit 1 and the initial additional information feature set storage unit 2 illustrated in FIG. 1 will be described. A content set G = {g1, g2,..., GN} given in advance for learning the first content recognition model is set as an initial content set. A set of additional information A = {a1, a2,..., AN} associated with each content included in the initial content set G is set as an initial additional information set. A set of feature amounts extracted for each content included in the initial content set G becomes an initial content feature set X = {x1, x2,..., XN}, and a set of feature amounts extracted for each additional information in the initial additional information set A The initial additional information feature set Y = {y1, y2,..., YN}.

対象コンテンツが画像の場合、初期コンテンツ特徴としては、例えば、色ヒストグラム、デジタルコサイン変換の任意の成分、Ｈａａｒｗａｖｅｌｅｔの任意の成分、高次局所自己相関特徴（N. Otsu and T. Kurita "A new scheme for practical flexible and intelligent vision systems," Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.参照）、任意の方法で選択した特徴点のＢａｇｏｆＦｅａｔｕｒｅｓ表現（ＬｉＦｅｉ−Ｆｅｉｅｔａｌ．，２００５．参照）などの公知の特徴量を用いることができる。また、初期付加情報特徴としては、付加情報に含まれる単語（ラベル）の有無を表現する２値ベクトルを用いることができる。これは、取り得る単語の総数と同数の次元を持つベクトルであり、ベクトルの各次元が取り得る単語に対応するものとし、付加情報に単語ｉが含まれている場合には、ベクトルの第ｉ要素の値を１とし、含まれていない場合には値を０としたものである。 When the target content is an image, the initial content features include, for example, a color histogram, an arbitrary component of digital cosine transform, an arbitrary component of Haar wavelet, a higher-order local autocorrelation feature (N. Otsu and T. Kurita "A new Scheme for practical flexible and intelligent vision systems, "Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.), Bag of Features representation of feature points selected by an arbitrary method (Li Fei-Fei et al. , 2005.) can be used. In addition, as the initial additional information feature, a binary vector expressing the presence or absence of a word (label) included in the additional information can be used. This is a vector having the same number of dimensions as the total number of possible words, and each dimension of the vector corresponds to the possible word. If the word i is included in the additional information, the vector i The value of the element is 1, and when it is not included, the value is 0.

あるいは、付加情報に含まれる単語の出現回数を表現するベクトル（word occurrence vector）や、あらかじめ指定しておいたトピックの数を次元数とし、各トピックの出現確率を表現するベクトルを用いることもできる。このベクトルは、probabilistic latent semantic analysis（ｐＬＳＡ）やlatent Dirichlet allocation（ＬＤＡ）などのトピックモデルを用いて算出することができる（詳細は、Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, Nuno Vasconcelos: “A new approach to cross-modal multimedia retrieval." ACM Multimedia 2010, pp.251-260を参照）。 Alternatively, a vector expressing the number of occurrences of words included in the additional information (word occurrence vector) or a vector expressing the appearance probability of each topic can be used with the number of topics specified in advance as the number of dimensions. . This vector can be calculated using topic models such as probabilistic latent semantic analysis (pLSA) and latent dirichlet allocation (LDA). Roger Levy, Nuno Vasconcelos: “A new approach to cross-modal multimedia retrieval.” See ACM Multimedia 2010, pp.251-260.

また、対象コンテンツが音響信号や映像信号の場合には、時系列アクティブ探索法（ＴＡＳ，特許第３０６５３１４号明細書参照）や分割一致探索法（ＤＡＬ、特許第４３２７２０２号明細書参照）などの手法を用いて抽出した音響または映像信号の特徴量を利用することができる。 Further, when the target content is an audio signal or a video signal, a method such as a time-series active search method (TAS, see Japanese Patent No. 30653314) or a division match search method (DAL, see Japanese Patent No. 4327202). The feature quantity of the audio or video signal extracted by using can be used.

まず、初期モデル学習部３は、初期コンテンツ特徴集合記憶部１に記憶されている初期コンテンツ特徴集合Ｘ＝｛ｘ１，ｘ２，…，ｘＮ｝と、初期付加情報特徴集合記憶部２に記憶されている初期付加情報特徴集合Ｙ＝｛ｙ１，ｙ２，…，ｙＮ｝を読み出し、これらの集合の情報を用いて、付加情報が未知のコンテンツに対して、その内容を表す付加情報として最もふさわしいものを決定するためのコンテンツ認識モデルを学習する。 First, the initial model learning unit 3 stores the initial content feature set X = {x1, x2,..., XN} stored in the initial content feature set storage unit 1 and the initial additional information feature set storage unit 2. Initial additional information feature set Y = {y 1, y 2,..., YN}, and using the information of these sets, the most appropriate additional information representing the content of the additional information that is unknown is used. Learn the content recognition model to determine.

モデル学習の処理は、非特許文献１、２に記載されている公知の技術を用いれば良い。例えば、まず、特徴量の集合（Ｘ，Ｙ）から潜在変数Ｚ＝｛ｚ１，ｚ２，…，ｚＮ｝を生成する。潜在変数は、正準相関分析を用いる方法（非特許文献１）、確率的正準相関分析を用いる方法（文献：中山他”大規模Ｗｅｂ画像のための画像アノテーション・リトリーバル手法Ｗｅｂ集合知からの自律的画像知識獲得へ向けて”、画像の認識・理解シンポジウムＭＩＲＵ２００９、ＯＳ２−４、２００９年７月）、半教師付き正準相関分析を用いる方法（非特許文献２）等により求めることができる。続いて、カーネル密度推定（Kernel density estimation（ＫＤＥ），文献：Parzen, E.: On estination of a probability density function and mode, The annuals of Mathematical Statistics, vol. 33, No. 3, pp. 1065-1076, 1962.）、または、半教師付きカーネル密度推定（ＳＳＫＤＥ、非特許文献２を参照）などによりモデル学習を行うことで、コンテンツ認識モデルを学習することができる。モデル学習とは、具体的には、潜在変数ｚ_ｉが与えられたときのコンテンツ特徴ｘの条件付生起確率ｐ（ｘ｜ｚ_ｉ）と、同じく付加特徴ｙの条件付生起確率ｐ（ｙ｜ｚ_ｉ）のモデルパラメータを求めることを意味する。 The model learning process may use a known technique described in Non-Patent Documents 1 and 2. For example, first, a latent variable Z = {z1, z2,..., ZN} is generated from the feature quantity set (X, Y). The latent variables are the method using canonical correlation analysis (Non-Patent Document 1), the method using probabilistic canonical correlation analysis (reference: Nakayama et al.) Image annotation retrieval method for large-scale Web images from Web collective intelligence Toward autonomous image knowledge acquisition ", Image Recognition / Understanding Symposium MIRU 2009, OS2-4, July 2009), a method using semi-supervised canonical correlation analysis (Non-Patent Document 2), etc. . Next, Kernel density estimation (KDE), literature: Parzen, E .: On estination of a probability density function and mode, The annuals of Mathematical Statistics, vol. 33, No. 3, pp. 1065-1076 1962.) or semi-supervised kernel density estimation (SSKDE, see Non-Patent Document 2) or the like, the content recognition model can be learned. Specifically, the model learning means the conditional occurrence probability p (x | z _i ) of the content feature x when the latent variable z _i is given, and the conditional occurrence probability p (y |) of the additional feature y. It means obtaining a model parameter of z _i ).

ｚ１〜ｚＮは潜在変数と呼ばれ、例えば、コンテンツの属するカテゴリラベルのようなものと捉えることができる。学習後のコンテンツ認識モデルは、与えられたコンテンツｇｉの特徴ｘｉと付加情報の特徴ｙｉを入力したとき、それに対応する潜在変数ｚｉを返す（１≦ｉ≦Ｎ）。ここでは、ｘｉとｙｉはそれぞれ多次元ベクトルで表現されるので、ｘｉとｙｉを用いた線形変換によりｚｉに対応する多次元ベクトルを得ることができる。 z1 to zN are called latent variables, and can be considered as, for example, category labels to which content belongs. The content recognition model after learning returns a latent variable zi (1 ≦ i ≦ N) when a feature xi of a given content gi and a feature yi of additional information are input. Here, since xi and yi are each represented by a multidimensional vector, a multidimensional vector corresponding to zi can be obtained by linear transformation using xi and yi.

次に、図２を参照して、図１に示すコンテンツ認識モデル学習装置が、新しいコンテンツを取得した際の動作を説明する。まず、新コンテンツ取得部５は、外部から新しいコンテンツを取得して（ステップＳ１）、付加情報推定部６と、近傍サンプル抽出部７へ出力する。これを請けて、付加情報推定部６は、新コンテンツ取得部５において取得した新コンテンツｇＮ＋１と初期コンテンツｇ１，ｇ２，…，ｇＮについて、コンテンツ認識モデル記憶部４に記憶されているコンテンツ認識モデルを用いて付加情報を推定する（ステップＳ２）。具体的には、コンテンツｇＮ＋１の特徴ｘＮ＋１に対して、（１）式により付加情報特徴＾ｙＮ＋１を推定する。

ここで、Ｄｙは、付加情報特徴ｙｉ（ｉ＝１，２，…，Ｎ）を表すベクトルの次元数（要素数）である。 Next, the operation when the content recognition model learning apparatus shown in FIG. 1 acquires new content will be described with reference to FIG. First, the new content acquisition unit 5 acquires new content from the outside (step S1), and outputs it to the additional information estimation unit 6 and the neighborhood sample extraction unit 7. In response to this, the additional information estimation unit 6 uses the content recognition model stored in the content recognition model storage unit 4 for the new content gN + 1 acquired by the new content acquisition unit 5 and the initial contents g1, g2,. It is used to estimate additional information (step S2). Specifically, for the feature xN + 1 of the content gN + 1, the additional information feature ^ yN + 1 is estimated by the expression (1).

Here, Dy is the number of dimensions (number of elements) of the vector representing the additional information feature yi (i = 1, 2,..., N).

次に、近傍サンプル抽出部７は、新コンテンツ取得部５において取得した新コンテンツｇＮ＋１に対して推定した潜在変数ｚＮ＋１と初期コンテンツの各々に対して推定した潜在変数ｚｉ（ｉ＝１，２，…，Ｎ）との類似度を計算し、類似度が所定の閾値を超える（類似度の高い）もしくは類似度が上位所定順位以上の潜在変数を持つ初期コンテンツを近傍サンプルとし、近傍サンプルの集合＾Ｇ＝｛＾ｇ１，＾ｇ２，…，＾ｇＨ｝（Ｈは近傍サンプルの個数）を抽出する（ステップＳ３）。潜在変数の類似度は、例えば、多次元ベクトル同士の距離（例えば、ユークリッド（Ｌ２）距離、マハラノビス距離、マンハッタン（Ｌ１）距離）の逆数により定義する。 Next, the neighborhood sample extraction unit 7 estimates the latent variable zN + 1 estimated for the new content gN + 1 acquired by the new content acquisition unit 5 and the latent variable zi (i = 1, 2,...) Estimated for each of the initial contents. , N), the initial content having a latent variable whose similarity exceeds a predetermined threshold (high similarity) or whose similarity is higher than the upper predetermined rank is set as a neighborhood sample, and a set of neighborhood samples ^ G = {^ g1, ^ g2,..., ^ GH} (H is the number of neighboring samples) is extracted (step S3). The similarity of latent variables is defined by, for example, the reciprocal of the distance between multidimensional vectors (for example, Euclidean (L2) distance, Mahalanobis distance, Manhattan (L1) distance).

次に、質問選択部８は、近傍サンプル集合＾Ｇ＝｛＾ｇ１，＾ｇ２，…，＾ｇＨ｝に含まれる各近傍サンプルについて、学習済みのコンテンツ認識モデルによりラベル（付加情報特徴）＾ｙ１，＾ｙ２，…，＾ｙＨを推定し、推定したラベルのばらつき度合い（確信度）に応じて、新コンテンツ取得部５で取得した新コンテンツｇＮ＋１の内容（付加情報ａＮ＋１）についてユーザに提示する質問を生成する。新コンテンツのラベルの確信度は、新コンテンツｇＮ＋１に対して推定した付加情報特徴＾ｙＮ＋１と近傍サンプル集合＾Ｇ＝｛＾ｇ１，＾ｇ２，…，＾ｇＨ｝中の各要素に対して推定した付加情報特徴＾ｙｊ（ｊ＝１，２，…，Ｈ）がどのくらい整合しているかを表す指標である。そして、質問選択部８は、確信度が「高い、「中程度」、「低い」のいずれかに該当するか否かを判定する（ステップＳ４）。 Next, the question selection unit 8 labels (additional information feature) ^ y1 for each neighboring sample included in the neighboring sample set ^ G = {^ g1, ^ g2,. , ^ Y2, ..., ^ yH, and a question to be presented to the user about the content (additional information aN + 1) of the new content gN + 1 acquired by the new content acquisition unit 5 according to the estimated degree of label variation (confidence) Is generated. The confidence of the label of the new content is estimated for each element in the additional information feature ^ yN + 1 estimated for the new content gN + 1 and the neighborhood sample set ^ G = {^ g1, ^ g2,. This is an index indicating how much the additional information feature ^ yj (j = 1, 2,..., H) is consistent. Then, the question selection unit 8 determines whether or not the certainty factor is one of “high,“ medium ”, and“ low ”(step S4).

この判定の結果、質問選択部８は、新コンテンツの近傍に十分な数の学習データ（初期コンテンツ）が存在し、それらの学習データの付加情報（ラベル）に一貫性がある場合には、確信度が高いと見なし、新コンテンツに対するラベルが「Ａ」であるか否かを問う質問（例えば、”ＩｓｔｈｉｓＡ？”）を生成して、表示部９に表示する（ステップＳ５）。ここで、「Ａ」は、近傍サンプル集合に対するラベルであり、一貫性があるため、ラベルは１種類（Ａ）のみとなる（図３参照）。 As a result of this determination, the question selection unit 8 is convinced that there is a sufficient number of learning data (initial content) in the vicinity of the new content, and the additional information (label) of these learning data is consistent. A question (for example, “Is this A?”) Asking whether or not the label for the new content is “A” is generated and displayed on the display unit 9 (step S5). Here, “A” is a label for a neighboring sample set and is consistent, so there is only one type (A) (see FIG. 3).

また、判定の結果、確信度が中程度の場合、質問選択部８は、入力サンプルの近傍に十分な数の学習データが存在するものの、それらの学習データのラベルに十分な一貫性が見られない、すなわち、学習データのラベルの候補が複数あり、どちらが適切かをコンピュータが自動で判断することが難しい状況にあると見なして、近傍サンプル集合中の各要素に対して推定したラベルのうち、最も頻度が高いラベルを「Ａ」、２番目に頻度の高いラベルを「Ｂ」とし、新コンテンツに対するラベルがＡ、Ｂのいずれであるかを問う質問（例えば、”Ｗｈｉｃｈｉｓｔｈｉｓ，ＡｏｒＢ？”）を生成して、表示部９に表示する（ステップ６：図４参照）。 Further, when the determination result shows that the certainty factor is medium, the question selection unit 8 has a sufficient number of learning data in the vicinity of the input sample, but sufficient consistency is seen in the labels of the learning data. No, that is, there are multiple candidate labels for the training data and it is difficult for the computer to automatically determine which one is appropriate, and among the labels estimated for each element in the neighborhood sample set, The most frequent label is “A”, the second most frequent label is “B”, and a question asking whether the label for the new content is A or B (for example, “Which is this, A or B ? ") Is generated and displayed on the display unit 9 (step 6: see FIG. 4).

また、判定の結果、確信度が低い場合、質問選択部８は、入力サンプルの近傍にある学習データの数が少なく、それらの学習データのラベルの信頼性が低い、すなわち、学習データのラベルは信用できない可能性が高いと見なして、新コンテンツに対するラベルが何であるかを問う質問（例えば、”Ｗｈａｔｉｓｔｈｉｓ？”）を生成して、表示部９に表示する（ステップＳ７：図５参照）。 Further, when the determination result shows that the certainty factor is low, the question selection unit 8 has a small number of learning data in the vicinity of the input sample, and the reliability of the labels of the learning data is low, that is, the label of the learning data is A question (for example, “What is this?”) That asks what is the label for the new content is generated and displayed on the display unit 9 (see step S7: FIG. 5). .

なお、一般に、コンテンツには複数のオブジェクトが含まれることが多い。そのため、コンテンツに付与すべきラベルは、複数のラベルの組み合わせ（各オブジェクトに対応するラベルの組み合わせ）とするのが望ましい。この場合、以下の手順によって質問を生成すればよい。 In general, content often includes a plurality of objects. Therefore, it is desirable that the label to be given to the content is a combination of a plurality of labels (a combination of labels corresponding to each object). In this case, what is necessary is just to produce | generate a question with the following procedures.

まず、新コンテンツｇＮ＋１から推定した付加情報特徴＾ｙＮ＋１から、近傍サンプルに対応する付加情報中に存在しないラベル（単語）を削除する。すなわち、このラベルに対応する特徴量を０にする。そして、＾ｙＮ＋１にラベルが一つも残っていない場合（零ベクトルの場合、すなわち、＾ｙＮ＋１の各要素に対応するラベルを近傍サンプルが一つも持っていない場合）には、新コンテンツから推定した付加情報特徴が必ずしも適切ではない、あるいは、推定した付加情報特徴の信頼性が低いと判断でき、前述した「確信度が低い場合」に該当するため、新コンテンツに対するラベルが何であるかを問う質問（例えば、”Ｗｈａｔｉｓｔｈｉｓ？”）を生成する。 First, a label (word) that does not exist in the additional information corresponding to the neighboring sample is deleted from the additional information feature yN + 1 estimated from the new content gN + 1. That is, the feature amount corresponding to this label is set to zero. If no label remains in ^ yN + 1 (in the case of a zero vector, that is, if no neighboring sample has a label corresponding to each element of ^ yN + 1), an addition estimated from the new content Since it can be determined that the information feature is not always appropriate or the reliability of the estimated additional information feature is low and corresponds to the above-mentioned “in case of low confidence”, a question asking what is the label for the new content ( For example, “What is this?”) Is generated.

一方、＾ｙＮ＋１にラベルが残っている場合（零ベクトルでない場合）には、新コンテンツｇＮ＋１に対する潜在変数割合＾ＺＮ＋１と近傍サンプル集合中の各潜在変数割合＾Ｚｊ（ｊ＝１，２，…，Ｈ）との平均距離を計算する。そして、平均距離が予め設定した閾値を超える場合には、入力サンプルの近傍にある学習データの数が極めて少ないと判断でき、前述した「確信度が低い場合」に該当するため、新コンテンツに対するラベルが何であるかを問う質問（例えば、”Ｗｈａｔｉｓｔｈｉｓ？”）を生成する。 On the other hand, if a label remains in ^ yN + 1 (if it is not a zero vector), the latent variable ratio ^ ZN + 1 for the new content gN + 1 and each latent variable ratio ^ Zj (j = 1, 2,... Calculate the average distance to H). If the average distance exceeds a preset threshold value, it can be determined that the number of learning data in the vicinity of the input sample is extremely small, and corresponds to the above-mentioned “when confidence is low”. Generate a question that asks what is (eg, “What is this?”).

平均距離が予め設定した閾値以下の場合には、近傍サンプル集合に対応する個々のラベルの取り得る組み合わせを生成する。そして、生成したラベルの組み合わせの類似度を計算する。組み合わせの類似度は、組み合わせにおける全てのラベルの組の連想度の平均により計算する。例えば、ｙＡをラベルＡに対応する付加情報特徴とし、Ｓ（ｙＡ，ｙＢ）をラベルの組（Ａ，Ｂ）の連想度としたとき、ＡＢＣというマルチラベルの出力の類似度は、Ｓ（ｙＡ，ｙＢ），Ｓ（ｙＢ，ｙＣ），Ｓ（ｙＣ，ｙＡ）の平均値となる。なお、Ｓ（ｙＡ，ｙＢ）は、ラベルｙＡとｙＢが同時に付与されているコンテンツの数である。 When the average distance is equal to or smaller than a preset threshold value, possible combinations of individual labels corresponding to the neighboring sample set are generated. Then, the similarity of the generated label combination is calculated. The similarity of the combination is calculated by the average of the association degree of all the label sets in the combination. For example, when yA is an additional information feature corresponding to the label A and S (yA, yB) is the association of the label set (A, B), the similarity of the output of the multi-label ABC is S (yA , YB), S (yB, yC), and S (yC, yA). S (yA, yB) is the number of contents to which labels yA and yB are assigned at the same time.

計算した類似度を予め設定した閾値と比較し、類似度が閾値を超えた組み合わせが１つしかない場合には、前述した「確信度が高い場合」に該当するため、新コンテンツに対するラベルがＡであるか否かを問う質問（例えば、”ＩｓｔｈｉｓＡ？”）を生成する。ここで、Ａには、類似度が閾値を超えた組み合わせに含まれるラベルが入る。 The calculated similarity is compared with a preset threshold value, and if there is only one combination whose similarity exceeds the threshold value, it corresponds to the above-mentioned “when the certainty factor is high”. The question (for example, “Is this A?”) That asks whether or not is generated is generated. Here, A includes a label included in the combination whose similarity exceeds the threshold value.

一方、類似度が閾値を超えた組み合わせが２つ以上ある場合には、前述した「確信度が中程度の場合」に該当するため、新コンテンツに対するラベルがＡ、Ｂのいずれであるかを問う質問（例えば、”Ｗｈｉｃｈｉｓｔｈｉｓ，ＡｏｒＢ？”）を生成する。ここで、ＡとＢには、類似度が最大の組み合わせに含まれるラベルと、類似度が２番目に大きい組み合わせに含まれるラベルがそれぞれ入る。 On the other hand, if there are two or more combinations whose similarity exceeds the threshold value, it corresponds to the above-mentioned “when the certainty level is medium”, so it is asked whether the label for the new content is A or B. Create a question (eg, “Which is this, A or B?”). Here, in A and B, a label included in the combination having the highest similarity and a label included in the combination having the second highest similarity are entered.

このように、確信度が高い場合には、推定したラベルに合致しているか否かを確認する質問のみを行い、確信度が中程度である場合には、推定したラベルの候補を絞り込むような質問のみを行い、確信度が低い場合にのみ推定したラベルの情報を用いず、ラベルが何であるかを問う質問を表示するようにしたため、同じ物体もしくは同じ概念に対して異なるラベルを与えるリスクを回避し、コンテンツ認識モデルをより精緻に構築することができる。 In this way, when the certainty level is high, only the question to confirm whether or not it matches the estimated label is performed, and when the certainty level is medium, the estimated label candidates are narrowed down. Since only the question is asked, and the information about the label is displayed without using the estimated label information only when the certainty level is low, the risk of giving a different label to the same object or the same concept is displayed. By avoiding this, it is possible to construct a content recognition model more precisely.

次に、回答取得部１１は、表示部９に表示した質問に対して、ユーザが入力部１０から入力した回答を取得し、回答に応じて修正した付加情報特徴ｙＮ＋１を生成する。確信度が高い場合、ラベルがＡであるか否かの質問を表示したため、ユーザからの回答はＹｅｓ（肯定）かＮｏ（否定）で入力されることになる（図６参照）。なお、図６は表示の一例であり、ＹｅｓとＮｏのラジオボタンを選択する形式に限らず、ユーザがＹｅｓまたはＮｏのラベルを直接入力することも可能である。Ｎｏの場合には、ユーザが正解のラベルを自由に入力することもできる。回答取得部１１は、回答がＹｅｓかＮｏのいずれであるかを判定し（ステップＳ８）、肯定的な回答（Ｙｅｓ）が得られた場合には、ラベルはＡであると見なしてＡに対応するラベルの特徴量を１とし、残りを０とする付加情報特徴ｙＮ＋１を生成する（ステップＳ９）。 Next, the answer acquisition unit 11 acquires an answer input by the user from the input unit 10 with respect to the question displayed on the display unit 9, and generates an additional information feature yN + 1 corrected according to the answer. When the certainty factor is high, a question as to whether or not the label is A is displayed. Therefore, the answer from the user is input as Yes (positive) or No (negative) (see FIG. 6). Note that FIG. 6 is an example of display, and the user can directly input a Yes or No label without being limited to the format of selecting the Yes and No radio buttons. In the case of No, the user can freely input the correct answer label. The answer acquisition unit 11 determines whether the answer is Yes or No (step S8). If a positive answer (Yes) is obtained, the label is assumed to be A and corresponds to A. The additional information feature yN + 1 is generated with the feature amount of the label to be set as 1 and the remaining as 0 (step S9).

一方、否定的な回答（Ｎｏ）が得られた場合には、システムの前提知識（学習済みの知識）が誤っていると判断し、ユーザの回答に合わせて付加情報特徴を修正する。すなわち、ユーザから正しいラベルを取得し、そのラベルに対応する特徴量を１とし、残りを０とする付加情報特徴ｙ_Ｎ＋１を生成するために、ステップＳ７へ移行する。このとき、Ｎｏの場合にユーザ入力した正解ラベルがシステムにとって未知のラベルだった場合には、そのラベルに対応する付加情報特徴を表現するために付加情報特徴ｙＮ＋１の次元を増やす。 On the other hand, if a negative answer (No) is obtained, it is determined that the system prerequisite knowledge (learned knowledge) is incorrect, and the additional information feature is corrected in accordance with the user's answer. That is, the process proceeds to step S7 in order to obtain a correct label from the user, and to generate an additional information feature y _{N + 1 in} which the feature amount corresponding to the label is 1 and the remaining is 0. At this time, if the correct label input by the user in the case of No is an unknown label for the system, the dimension of the additional information feature yN + 1 is increased in order to express the additional information feature corresponding to the label.

また、確信度が中程度の場合、ラベルがＡとＢのいずれであるかの質問を表示したため、ユーザからの回答は、Ａ、Ｂ、いずれででもない、のいずれかで入力されることになる（図６参照）。回答取得部１１は、回答が何であったかを判定し（ステップＳ１０）、肯定的な回答（ＡまたはＢ）が得られた場合には、ラベルはＡまたはＢであると見なして、ユーザの選択したラベルに対応する特徴量を１とし、残りを０とする付加情報特徴ｙＮ＋１を生成する（ステップＳ１１、Ｓ１２）。 In addition, when the certainty level is medium, since the question as to whether the label is A or B is displayed, the answer from the user is input as either A or B. (See FIG. 6). The answer acquisition unit 11 determines what the answer was (step S10), and when a positive answer (A or B) is obtained, the label is assumed to be A or B and is selected by the user. An additional information feature yN + 1 is generated in which the feature quantity corresponding to the label is 1 and the rest is 0 (steps S11 and S12).

一方、否定的な回答（どちらでもない）が得られた場合には、確信度が高い場合の否定的な回答の処理と同様に、ステップＳ７に移行し、ユーザの回答に合わせて付加情報特徴を修正する。 On the other hand, if a negative answer (neither) is obtained, the process proceeds to step S7 as in the case of a negative answer when the certainty factor is high, and additional information features are matched to the user's answer. To correct.

また、確信度が低い場合は、ラベルが何であるかの質問を表示したため、ユーザからは正しいラベルの情報（テキスト情報）が入力されることになる（図６参照）。この場合、入力部１０から入力されたラベル情報に対応する特徴量を１とし、残りを０とする付加情報特徴ｙＮ＋１を生成する（ステップＳ１３）。この場合も、ユーザ入力した正解ラベルがシステムにとって未知のラベルだった場合には、そのラベルに対応する付加情報特徴を表現するために付加情報特徴ｙ_Ｎ＋１の次元を増やす。 Further, when the certainty factor is low, a question as to what the label is is displayed, so that the correct label information (text information) is input from the user (see FIG. 6). In this case, an additional information feature yN + 1 is generated in which the feature amount corresponding to the label information input from the input unit 10 is 1, and the rest is 0 (step S13). Also in this case, when the correct label input by the user is an unknown label for the system, the dimension of the additional information feature y _{N + 1} is increased in order to express the additional information feature corresponding to the label.

回答取得部１１は、生成した付加情報特徴ｙＮ＋１を付加情報特徴集合記憶部１３に追加する。また、新コンテンツｇＮ＋１に対応する特徴ｘＮ＋１は、コンテンツ特徴集合記憶部１２に追加する。 The answer acquisition unit 11 adds the generated additional information feature yN + 1 to the additional information feature set storage unit 13. Further, the feature xN + 1 corresponding to the new content gN + 1 is added to the content feature set storage unit 12.

次に、モデル更新部１５は、モデル更新制御部１４からの指示に基づき、回答取得部１１により更新されたコンテンツ特徴集合と付加情報特徴集合の情報を用いて、コンテンツ認識モデル記憶部４に記憶されているコンテンツ認識モデルを更新する（ステップＳ１４）。この更新処理は、初期モデル学習部３と同様の処理であるため、詳細な処置動作の説明を省略する。モデル更新部１５は、新しいコンテンツが存在する間は、以上の処理を繰り返すことで、コンテンツ認識モデルを更新していく。 Next, based on an instruction from the model update control unit 14, the model update unit 15 stores the content feature set and additional information feature set information updated by the answer acquisition unit 11 in the content recognition model storage unit 4. The set content recognition model is updated (step S14). Since this update process is the same process as the initial model learning unit 3, a detailed description of the treatment operation is omitted. The model update unit 15 updates the content recognition model by repeating the above processing while new content exists.

次に、図１に示すコンテンツ認識モデル学習装置を用いた実験結果について説明する。学習用のコンテンツデータ集合として、ＰＡＳＣＡＬＶｉｓｕａｌＯｂｊｅｃｔＣｈａｌｌｅｎｇｅ（ＶＯＣ２００８、文献：M. Everingham et al., : The PASCAL VOC Challenge 2008 Results, http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html）の５０９６枚の画像データを用いた。各画像データは、人物、動物、乗り物、家具など２０種類のオブジェクトのカテゴリに含まれるオブジェクトを含んでいる。このうち、３５９６枚の画像を初期コンテンツ集合として利用して初期モデル学習部３により、コンテンツ認識モデル記憶部４に記憶されるコンテンツ認識モデルを学習し、１０００枚を新コンテンツとしてモデル更新部１５によるモデルの再学習（更新）を行った。残りの５００枚を評価実験用に用いた。 Next, experimental results using the content recognition model learning apparatus shown in FIG. 1 will be described. As a content data set for learning, PASCAL Visual Object Challenge (VOC 2008, literature: M. Everingham et al.,: The PASCAL VOC Challenge 2008 Results, http://www.pascal-network.org/challenges/VOC/voc2008 /workshop/index.html) is used. Each image data includes objects included in 20 types of object categories such as people, animals, vehicles, and furniture. Of these, 3596 images are used as the initial content set, the initial model learning unit 3 learns the content recognition model stored in the content recognition model storage unit 4, and the model update unit 15 uses 1000 images as new content. The model was relearned (updated). The remaining 500 sheets were used for evaluation experiments.

本発明の手法との比較対象として、２つの方法Ａ、Ｂを用いた。方法Ａは、確信度に関係なく、どのような場合にも「ラベルが何であるか」を問う質問を行う方法である。方法Ｂは、ランダムに質問の種類を選択して質問を行う方法である。 Two methods A and B were used for comparison with the method of the present invention. Method A is a method of asking a question “what is the label” in any case regardless of the certainty factor. Method B is a method in which a question is selected by randomly selecting a question type.

方法Ａ、方法Ｂ、本発明手法のそれぞれについて、評価実験用の５００枚の画像に対して、学習したコンテンツ認識モデルにより推定したラベル（付加情報）の誤識別率とユーザコスト（ユーザの回答入力時間）を調べた結果を図７、図８に示す。横軸のＩｔｅｒａｔｉｏｎは、新コンテンツとして入力した画像（学習サンプル）の枚数を表す。学習サンプルの増加に伴い、いずれの手法においても誤識別率が向上するが、特に本発明手法の誤識別率の向上率が高い（図７参照）。また、ユーザコストは、ユーザの回答入力時間によって評価した。テキスト入力時間を、（入力文字数）／（１文字あたりの平均入力時間）とし、テキスト入力以外の動作（クリック、マウス移動など）についての時間は無視した。本発明手法は、最も少ないコストで入力が完了できることが分かる（図８参照）。 For each of Method A, Method B, and the method of the present invention, the misidentification rate and user cost (user's answer input) of labels (additional information) estimated by the learned content recognition model for 500 images for evaluation experiments The results of examining (time) are shown in FIGS. Iteration on the horizontal axis represents the number of images (learning samples) input as new content. As the number of learning samples increases, the misidentification rate is improved in any method, but the improvement rate of the misidentification rate of the method of the present invention is particularly high (see FIG. 7). The user cost was evaluated based on the user's answer input time. The text input time was (number of input characters) / (average input time per character), and time for operations other than text input (click, mouse movement, etc.) was ignored. It can be seen that the method of the present invention can complete the input at the lowest cost (see FIG. 8).

なお、図１に示す各処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによりコンテンツ認識モデル学習処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 It should be noted that a program for realizing the functions of the processing units shown in FIG. 1 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by a computer system and executed to recognize content. Model learning processing may be performed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

画像や映像などのメディアデータと、それらに対して人手で付与されたテキスト情報から、意味（画像や映像の中に含まれるオブジェクト、動作、行為、シーンなどの情報を組み合せた情報）を推定することが不可欠な用途に適用できる。 Inferring meaning (information combining information on objects, actions, actions, scenes, etc. contained in images and videos) from media data such as images and videos and text information manually assigned to them Can be applied to indispensable uses.

１・・・初期コンテンツ特徴集合記憶部、２・・・初期付加情報特徴集合記憶部、３・・・初期モデル学習部、４・・・コンテンツ認識モデル記憶部、５・・・新コンテンツ取得部、６・・・付加情報推定部、７・・・近傍サンプル抽出部、８・・・質問選択部、９・・・表示部、１０・・・入力部、１１・・・回答取得部、１２・・・コンテンツ特徴集合記憶部、１３・・・付加情報特徴集合記憶部、１４・・・モデル更新制御部、１５・・・モデル更新部 DESCRIPTION OF SYMBOLS 1 ... Initial content feature set storage part, 2 ... Initial additional information feature set storage part, 3 ... Initial model learning part, 4 ... Content recognition model storage part, 5 ... New content acquisition part , 6 ... additional information estimation unit, 7 ... neighborhood sample extraction unit, 8 ... question selection unit, 9 ... display unit, 10 ... input unit, 11 ... answer acquisition unit, 12 ... Content feature set storage unit, 13 ... Additional information feature set storage unit, 14 ... Model update control unit, 15 ... Model update unit

Claims

Content recognition model storage means for storing information of a content recognition model for recognizing content data;
Content acquisition means for acquiring content data including an audio signal or a video signal;
Additional information estimation means for estimating additional information indicating the meaning of the audio signal or video signal included in the content data to be added to the content data using the content recognition model stored in the content recognition model storage means; ,
A certainty factor calculating means for obtaining a certainty factor of additional information to be given to the content data estimated by the additional information estimating means;
To determine the additional information to be given to the content data, questions the selected question of changing the question format in response to the confidence determined by the confidence factor computing means, for displaying the selected question Display means;
An answer acquisition means for acquiring an answer corresponding to the question in which the question format is changed ;
A content recognition model learning device comprising: model update means for updating information of the content recognition model stored in the content recognition model storage means based on information of the answer acquired by the answer acquisition means .

The question display means includes
If the certainty factor is high, only the question asking whether the additional information to be given to the content data matches the estimated additional information is displayed,
When the certainty factor is medium, only the questions that narrow down the additional information to be added to the content data from the estimated additional information candidates are displayed,
2. The content recognition model learning device according to claim 1, wherein when the certainty factor is low, a question asking what additional information is to be added to the content data is displayed.

Content recognition model storage means storing content recognition model information for recognizing content data, content acquisition means, additional information estimation means, confidence factor calculation means, question display means, answer acquisition means, A content recognition model learning method in a content recognition model learning device comprising a model update means,
A content acquisition step in which the content acquisition means acquires content data including an audio signal or a video signal;
The additional information estimation means uses the content recognition model stored in the content recognition model storage means to add additional information indicating the meaning of the audio signal or video signal included in the content data to be added to the content data. An additional information estimation step to estimate;
A certainty factor calculating step in which the certainty factor calculating means obtains a certainty factor of additional information to be added to the content data estimated in the additional information estimating step;
The question display means, in order to determine the additional information to be given to the content data, select a question of changing the question format in response to the confidence determined by the confidence factor computing step, it is selected A question display step for displaying the questions,
An answer obtaining step in which the answer obtaining means obtains an answer corresponding to the question in which the question format is changed ;
The model update unit includes a model update step of updating the content recognition model information stored in the content recognition model storage unit based on the information of the response acquired by the response acquisition step. Content recognition model learning method.

The question display step includes
If the certainty factor is high, only the question asking whether the additional information to be given to the content data matches the estimated additional information is displayed,
When the certainty factor is medium, only the questions that narrow down the additional information to be added to the content data from the estimated additional information candidates are displayed,
4. The content recognition model learning method according to claim 3, wherein when the certainty factor is low, a question asking what additional information is to be added to the content data is displayed.

A content recognition model learning program for causing a computer to execute the content recognition model learning method according to claim 3 or 4 .