JP5481710B2

JP5481710B2 - Video management apparatus, method, and program

Info

Publication number: JP5481710B2
Application number: JP2010190016A
Authority: JP
Inventors: 昭悟木村; 弘和亀岡; 邦夫柏野; 拓帆中野; 茂樹嵯峨山; 滋樹宮部; 順貴小野; 卓也西本
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2010-08-26
Filing date: 2010-08-26
Publication date: 2014-04-23
Anticipated expiration: 2030-08-26
Also published as: JP2012049830A

Description

本発明は、映像管理装置、方法、及びプログラムに係り、特に、映像とその映像に付加された付加情報との関係に基づいて、映像を認識または検索する映像管理装置、方法、及びプログラムに関する。 The present invention relates to a video management apparatus, method, and program, and more particularly, to a video management apparatus, method, and program for recognizing or searching for a video based on the relationship between the video and additional information added to the video.

従来、所望の映像を与えられた言語情報に基づいて検索する映像検索技術、及び、与えられた映像に対してその映像を説明する言語情報を自動的に付与する映像認識技術の開発が行われている。近年では、ディジタルビデオカメラや携帯電話などの撮像装置の普及、インターネット上での映像共有の一般化などに伴い、このような映像検索技術及び映像認識技術が非常に重要な技術となってきている。 Conventionally, video search technology for searching for a desired video based on given language information, and video recognition technology for automatically giving language information that describes the video to the given video have been developed. ing. In recent years, with the spread of imaging devices such as digital video cameras and mobile phones, and the generalization of video sharing on the Internet, such video search technology and video recognition technology have become very important technologies. .

また、対象信号を静止画像に限定した場合には、画像検索と画像認識とを同一の枠組の下で実現する画像認識検索技術が数多く開発されている。例えば、潜在変数を用いて２つの観測情報を結びつける統計モデルであるトピックモデルを学習して、このトピックモデルを用いて、与えられた画像に適切なテキストラベルを自動的に付与する画像ラベル付けと、与えられたテキストラベルから適切な画像を見つけだす画像獲得とを統一的に扱う技術が提案されている（例えば、非特許文献１及び２参照）。 In addition, when the target signal is limited to still images, many image recognition search technologies have been developed that realize image search and image recognition under the same framework. For example, learning a topic model, which is a statistical model that links two observation information using latent variables, and automatically assigning an appropriate text label to a given image using this topic model A technique has been proposed that uniformly handles image acquisition for finding an appropriate image from a given text label (see, for example, Non-Patent Documents 1 and 2).

中山、原田、國吉、大津 “画像・単語間概念対応の確率構造学習を利用した超高速画像認識・検索方法”、電子情報通信学会技術報告、PRMU2007-147、2007 年12 月Nakayama, Harada, Kuniyoshi, Otsu “Ultra-high-speed image recognition / retrieval method using stochastic structure learning corresponding to concept between images and words”, IEICE Technical Report, PRMU2007-147, December 2007 木村、中野、亀岡、杉山、前田、坂野 “SSCDE：画像認識検索のための半教師正準密度推定法”、画像の認識・理解シンポジウムMIRU2010、OS8-1、2010年7月Kimura, Nakano, Kameoka, Sugiyama, Maeda, Sakano “SSCDE: Semi-Teacher Canonical Density Estimation Method for Image Recognition Retrieval”, Image Recognition and Understanding Symposium MIRU2010, OS8-1, July 2010

しかしながら、非特許文献１及び２に記載された技術を映像に対する認識及び検索技術に適用しようとすると、映像の持つ重要な情報の１つである時間情報を完全に無視する形で処理が進められることになる。映像に含まれる画像情報・音響情報などが時間的な連続性を持つ点を考慮すると、映像信号を適切にモデル化することは困難であり、結果として映像認識及び映像検索の精度が低下する、という問題がある。 However, if the techniques described in Non-Patent Documents 1 and 2 are applied to video recognition and search techniques, the process proceeds in a way that completely ignores time information, which is one of important information of video. It will be. Considering that image information and audio information included in video has temporal continuity, it is difficult to properly model video signals, and as a result, the accuracy of video recognition and video search decreases. There is a problem.

本発明は上記問題点に鑑みてなされたものであり、映像とその映像に付加された付加情報との関係に基づいて、時間情報も考慮した映像認識または映像検索を精度良く行うことができる映像管理装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and can perform video recognition or video search considering time information with high accuracy based on the relationship between video and additional information added to the video. It is an object to provide a management apparatus, method, and program.

上記目的を達成するために、本発明の映像管理装置は、予め蓄積された蓄積映像集合の要素である蓄積映像の各々を所定の区間で時分割した蓄積映像区間の各々から抽出された該蓄積映像区間の特徴を示す蓄積映像区間特徴の各々と、前記蓄積映像区間の各々に対応して付加された蓄積付加情報の各々から抽出された該蓄積付加情報の特徴を示す蓄積付加情報特徴の各々に基づいて、前記蓄積映像区間特徴の集合である蓄積映像区間特徴集合と前記蓄積付加情報特徴の集合である蓄積付加情報特徴集合とに基づく蓄積潜在変数導出関数を用いて得られた蓄積映像区間と蓄積付加情報との関係を記述する蓄積潜在変数の分布を示す蓄積状態変数の集合である蓄積状態変数集合または該蓄積状態変数の分布である蓄積状態変数分布を抽出するトピック軌跡モデル学習手段により構築された該蓄積状態変数集合または該蓄積状態変数分布、及び前記蓄積潜在変数導出関数からなるトピック軌跡モデルを記憶したトピック軌跡モデル記憶手段と、映像認識手段とを含む映像管理装置であって、前記映像認識手段は、入力映像を所定の区間で時分割した入力映像区間の各々から、該入力映像区間の特徴を示す入力映像区間特徴の各々を抽出する入力映像区間特徴抽出手段と、前記入力映像区間特徴抽出手段により抽出された入力映像区間特徴の集合である入力映像区間特徴集合に基づく入力潜在変数導出関数を用いて、該入力映像区間特徴に対応する入力潜在変数を得て、該入力潜在変数の集合である入力潜在変数集合を抽出する入力潜在変数抽出手段と、前記入力潜在変数抽出手段により抽出された入力潜在変数集合と、前記蓄積状態変数集合または前記蓄積状態変数分布とに基づいて、入力潜在変数の分布を記述する変数である入力状態変数の集合である入力状態変数集合、または該入力状態変数の分布である入力状態変数分布を抽出する入力状態変数抽出手段と、前記入力状態変数抽出手段により抽出された前記入力状態変数集合または前記入力状態変数分布に基づいて、前記入力潜在変数抽出手段により抽出された入力潜在変数集合を補正した新たな補正入力潜在変数集合を抽出する入力潜在変数補正手段と、前記入力潜在変数補正手段により抽出された補正入力潜在変数集合と、前記入力映像区間特徴抽出手段により抽出された入力映像区間特徴集合に基づいて、該入力映像区間特徴と関連する出力付加情報特徴の集合である出力付加情報特徴集合を推定する出力付加情報特徴推定手段と、前記出力付加情報特徴推定手段により推定された出力付加情報特徴集合が示す特徴に対応した出力付加情報を推定する出力付加情報推定手段と、を含んで構成されている。 In order to achieve the above object, the video management apparatus of the present invention provides the storage extracted from each of the stored video sections obtained by time-dividing each of the stored videos as elements of the stored video set stored in advance in a predetermined section. Each of the accumulated video section features indicating the characteristics of the video sections, and each of the accumulated additional information features indicating the characteristics of the accumulated additional information extracted from each of the accumulated additional information added corresponding to each of the accumulated video sections. And a stored video segment obtained using a stored latent variable derivation function based on a stored video segment feature set that is a set of stored video segment features and a stored additional information feature set that is a set of stored additional information features And a cumulative state variable distribution that is a set of accumulated state variables indicating the distribution of accumulated latent variables that describe the relationship between the accumulated additional information and the accumulated state variable distribution. A video including a topic trajectory model storage means storing a topic trajectory model composed of the accumulated state variable set or the accumulated state variable distribution constructed by the trajectory model learning means and the accumulated latent variable derivation function; and a video recognition means. The management device, wherein the video recognition means extracts each of the input video section features indicating the characteristics of the input video section from each of the input video sections obtained by time-dividing the input video in a predetermined section. An input latent variable corresponding to the input video section feature using an extraction latent function and an input latent variable derivation function based on an input video section feature set that is a set of input video section features extracted by the input video section feature extracting means Input latent variable extracting means for extracting an input latent variable set which is a set of the input latent variables, and extracting the input latent variable by the input latent variable extracting means. An input state variable set that is a set of input state variables that are variables that describe the distribution of the input latent variables based on the set of input latent variables and the accumulated state variable set or the accumulated state variable distribution, or the input Input state variable extraction means for extracting an input state variable distribution that is a distribution of state variables, and input latent variable extraction based on the input state variable set or the input state variable distribution extracted by the input state variable extraction means Input latent variable correction means for extracting a new corrected input latent variable set obtained by correcting the input latent variable set extracted by the means, corrected input latent variable set extracted by the input latent variable correction means, and the input video section Based on the input video segment feature set extracted by the feature extraction means, it is a set of output additional information features related to the input video segment feature. Output additional information feature estimation means for estimating the output additional information feature set, and output additional information estimation means for estimating output additional information corresponding to the feature indicated by the output additional information feature set estimated by the output additional information feature estimation means And.

このように、入力映像区間と入力付加情報との関係を記述する入力映像区間特徴に基づく入力潜在変数から得られた入力状態変数と、蓄積映像集合と蓄積付加情報集合との関係を記述したトピック軌跡モデルとに基づいて入力潜在変数を補正し、補正された入力潜在変数と入力映像区間特徴とに基づいて、映像区間に付加される可能性の高い出力付加情報特徴を推定するため、時間情報も考慮して映像認識を精度良く行うことができる。 Thus, the topic describing the relationship between the input state variable obtained from the input latent variable based on the input video section feature describing the relationship between the input video section and the input additional information, and the stored video set and the stored additional information set The time information is used to correct the input latent variable based on the trajectory model, and to estimate the output additional information feature that is likely to be added to the video section based on the corrected input latent variable and the input video section feature. Image recognition can be performed with high accuracy.

また、本発明の映像管理装置は、蓄積映像集合の要素である蓄積映像の各々を前記入力映像として入力して、前記映像認識手段により、該蓄積映像集合に対応する付加情報として出力蓄積付加情報の集合である出力蓄積付加情報集合を出力させる蓄積映像認識手段と、入力された入力付加情報、入力された前記蓄積映像集合に対応する蓄積付加情報集合、及び前記映像認識手段により出力された出力蓄積付加情報集合に基づいて、前記蓄積映像集合から前記入力付加情報に適合した映像を出力映像として推定する出力映像推定手段と、をさらに含んで構成することができる。 Also, the video management apparatus of the present invention inputs each of the stored video as an element of the stored video set as the input video, and outputs the additional stored additional information as additional information corresponding to the stored video set by the video recognition means. Stored video recognition means for outputting an output accumulated additional information set that is a set of the above, input additional information inputted, accumulated additional information set corresponding to the inputted accumulated video set, and output outputted by the video recognition means Based on the accumulated additional information set, it may further comprise output video estimating means for estimating an image suitable for the input additional information from the accumulated video set as an output video.

このように、時間情報も考慮して精度良く推定された蓄積映像集合に対する出力蓄積付加情報集合と、蓄積映像集合に対応する蓄積付加情報集合と、入力付加情報とを比較して、入力付加情報に適合する出力映像を推定するため、時間情報も考慮して精度良く映像検索を行うことができる。 In this way, the input additional information is compared by comparing the output additional information set corresponding to the stored video set with the output additional information set for the stored video set that is accurately estimated in consideration of the time information. Therefore, the video search can be performed with high accuracy in consideration of the time information.

また、本発明の映像管理装置は、前記トピック軌跡モデル学習手段をさらに含んで構成することができる。 The video management apparatus according to the present invention may further include the topic trajectory model learning means.

また、本発明の映像管理方法は、予め蓄積された蓄積映像集合の要素である蓄積映像の各々を所定の区間で時分割した蓄積映像区間の各々から抽出された該蓄積映像区間の特徴を示す蓄積映像区間特徴の各々と、前記蓄積映像区間の各々に対応して付加された蓄積付加情報の各々から抽出された該蓄積付加情報の特徴を示す蓄積付加情報特徴の各々に基づいて、前記蓄積映像区間特徴の集合である蓄積映像区間特徴集合と前記蓄積付加情報特徴の集合である蓄積付加情報特徴集合とに基づく蓄積潜在変数導出関数を用いて得られた蓄積映像区間と蓄積付加情報との関係を記述する蓄積潜在変数の分布を示す蓄積状態変数の集合である蓄積状態変数集合または該蓄積状態変数の分布である蓄積状態変数分布を抽出して構築された該蓄積状態変数集合または該蓄積状態変数分布、及び前記蓄積潜在変数導出関数からなるトピック軌跡モデルを記憶しておき、入力映像を所定の区間で時分割した入力映像区間の各々から、該入力映像区間の特徴を示す入力映像区間特徴の各々を抽出し、抽出された入力映像区間特徴の集合である入力映像区間特徴集合に基づく入力潜在変数導出関数を用いて、該入力映像区間特徴に対応する入力潜在変数を得て、該入力潜在変数の集合である入力潜在変数集合を抽出し、抽出された入力潜在変数集合と、前記蓄積状態変数集合または前記蓄積状態変数分布とに基づいて、入力潜在変数の分布を記述する変数である入力状態変数の集合である入力状態変数集合、または該入力状態変数の分布である入力状態変数分布を抽出し、抽出された前記入力状態変数集合または前記入力状態変数分布に基づいて、入力潜在変数集合を補正した新たな補正入力潜在変数集合を抽出し、抽出された補正入力潜在変数集合と、前記入力映像区間特徴集合に基づいて、該入力映像区間特徴と関連する出力付加情報特徴の集合である出力付加情報特徴集合を推定し、推定された出力付加情報特徴集合が示す特徴に対応した出力付加情報を推定する方法である。 In addition, the video management method of the present invention shows the characteristics of the stored video section extracted from each of the stored video sections obtained by time-dividing each of the stored videos that are elements of the stored video set stored in advance in a predetermined section. Based on each of the stored video section features and each of the stored additional information features indicating the characteristics of the stored additional information extracted from each of the stored additional information added corresponding to each of the stored video sections, the storage A stored video segment obtained by using a storage latent variable derivation function based on a stored video segment feature set that is a set of video segment features and a stored additional information feature set that is a set of the stored additional information features. An accumulated state variable set that is a set of accumulated state variables indicating a distribution of accumulated latent variables that describe a relationship, or the accumulated state variable that is constructed by extracting an accumulated state variable distribution that is a distribution of the accumulated state variable Or a topic trajectory model consisting of the accumulated state variable distribution and the accumulated latent variable derivation function is stored, and the characteristics of the input video section are obtained from each of the input video sections obtained by time-dividing the input video in a predetermined section. Each of the input video segment features is extracted and an input latent variable derivation function based on the input video segment feature set that is a set of the extracted input video segment features is used to obtain an input latent variable corresponding to the input video segment feature. An input latent variable set that is a set of the input latent variables is extracted, and based on the extracted input latent variable set and the accumulated state variable set or the accumulated state variable distribution, the distribution of the input latent variables is obtained. An input state variable set that is a set of input state variables that are variables to be described or an input state variable distribution that is a distribution of the input state variables is extracted, and the extracted input state variable set Alternatively, a new corrected input latent variable set obtained by correcting the input latent variable set is extracted based on the input state variable distribution, and the input is calculated based on the extracted corrected input latent variable set and the input video section feature set. This is a method of estimating an output additional information feature set that is a set of output additional information features related to a video section feature, and estimating output additional information corresponding to the feature indicated by the estimated output additional information feature set.

また、本発明の映像管理方法は、蓄積映像集合の要素である蓄積映像の各々を前記入力映像として入力して、該蓄積映像集合に対応する付加情報として出力蓄積付加情報の集合である出力蓄積付加情報集合を出力し、入力された入力付加情報、入力された前記蓄積映像集合に対応する蓄積付加情報集合、及び出力された出力蓄積付加情報集合に基づいて、前記蓄積映像集合から前記入力付加情報に適合した映像を出力映像として推定することができる。 Also, the video management method of the present invention inputs each of the stored videos as elements of the stored video set as the input video, and outputs the output storage as a set of output stored additional information as additional information corresponding to the stored video set. The additional information set is output, and the input additional information is input from the stored video set based on the input additional information input, the stored additional information set corresponding to the input stored video set, and the output output additional stored information set. An image suitable for information can be estimated as an output image.

また、本発明の映像管理プログラムは、コンピュータを、上記の映像管理装置を構成する各手段として機能させるためのプログラムである。 The video management program of the present invention is a program for causing a computer to function as each means constituting the video management apparatus.

以上説明したように、本発明の映像管理装置、方法、及びプログラムによれば、映像とその映像に付加された付加情報との関係に基づいて、時間情報も考慮した映像認識または映像検索を精度良く行うことができる、という効果が得られる。 As described above, according to the video management apparatus, method, and program of the present invention, video recognition or video search considering time information is accurately performed based on the relationship between video and additional information added to the video. The effect that it can perform well is acquired.

第１の実施の形態の映像認識装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the video recognition apparatus of 1st Embodiment. 第１の実施の形態の映像認識装置の詳細な機能的構成を示すブロック図である。It is a block diagram which shows the detailed functional structure of the video recognition apparatus of 1st Embodiment. 第１の実施の形態の映像認識装置における認識処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the recognition process routine in the video recognition apparatus of 1st Embodiment. 第２の実施の形態の映像認識装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the video recognition apparatus of 2nd Embodiment. 第２の実施の形態の映像認識装置の詳細な機能的構成を示すブロック図である。It is a block diagram which shows the detailed functional structure of the video recognition apparatus of 2nd Embodiment. 第２の実施の形態の映像認識装置における検索処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the search process routine in the video recognition apparatus of 2nd Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

まず、第１の実施の形態について説明する。第１の実施の形態では、本発明の映像管理装置を、入力映像に対応する付加情報を付加する映像認識装置に適用した場合について説明する。 First, the first embodiment will be described. In the first embodiment, a case where the video management apparatus of the present invention is applied to a video recognition apparatus that adds additional information corresponding to an input video will be described.

第１の実施の形態に係る映像認識装置１０は、ＣＰＵと、ＲＡＭと、後述する認識処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には、図１に示すように、トピック軌跡モデル学習部１と、映像認識部２とを含んだ構成で表すことができる。 The video recognition apparatus 10 according to the first embodiment is composed of a computer including a CPU, a RAM, and a ROM storing a program for executing a recognition processing routine described later. As shown in FIG. 1, it can be expressed by a configuration including a topic locus model learning unit 1 and a video recognition unit 2.

図２に示すように、トピック軌跡モデル学習部１は、さらに、蓄積映像区間特徴抽出部１１と、蓄積付加情報特徴抽出部１２と、蓄積潜在変数抽出部１３と、蓄積状態変数抽出部１４とを含んだ構成で表すことができる。トピック軌跡モデル学習部１は、予め蓄積された映像信号の集合である蓄積映像集合、及び蓄積映像の各々に対応する付加情報の集合である蓄積付加情報集合を入力し、これらの入力から、映像信号と付加情報との関係を記述した、映像認識のための統計モデルであるトピック軌跡モデルを構築して出力する。 As shown in FIG. 2, the topic trajectory model learning unit 1 further includes an accumulated video section feature extracting unit 11, an accumulated additional information feature extracting unit 12, an accumulated latent variable extracting unit 13, and an accumulated state variable extracting unit 14. Can be expressed in a configuration including The topic trajectory model learning unit 1 inputs a stored video set that is a set of pre-stored video signals and a stored additional information set that is a set of additional information corresponding to each of the stored videos. A topic trajectory model, which is a statistical model for video recognition, describing the relationship between signals and additional information is constructed and output.

蓄積映像区間特徴抽出部１１は、蓄積映像集合を入力し、その各要素である蓄積映像の各々を時間軸に沿って分割して、複数の蓄積映像区間を抽出する。蓄積映像の分割は、例えば、予め定めた時間長で均等に分割したり、蓄積映像からカット点を算出して、そのカット点を境界として分割したりすることができる。 The stored video section feature extraction unit 11 inputs a stored video set, divides each of the stored video that is each element along the time axis, and extracts a plurality of stored video sections. For example, the accumulated video can be divided evenly by a predetermined time length, or a cut point can be calculated from the accumulated video, and the cut point can be divided as a boundary.

次に、蓄積映像区間に含まれるフレームの中から、その蓄積映像区間を代表するキーフレームを抽出する。キーフレームの抽出方法としては、例えば、蓄積映像区間の予め定められた箇所のフレーム（先頭のフレーム、中央のフレーム、最後尾のフレーム等）のフレームを選択することができる。また、蓄積映像区間に含まれる任意のフレームの映像信号の平均をキーフレームとして作成してもよい。 Next, a key frame representing the stored video section is extracted from the frames included in the stored video section. As a key frame extraction method, for example, a frame at a predetermined location in the accumulated video section (a first frame, a center frame, a last frame, or the like) can be selected. Further, an average of video signals of arbitrary frames included in the accumulated video section may be created as a key frame.

続いて、キーフレームから、その特性を表現する画像特徴を算出し、これを蓄積映像区間特徴として抽出する。画像特徴の算出方法として、例えば、色ヒストグラム、デジタルコサイン変換の任意の成分、Haar waveletの任意の成分、高次局所自己相関特徴（N. Otsu and T. Kurita “A new scheme for practical flexible and intelligent vision systems,” Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.参照）、任意の方法で選択した特徴点のBag of Features表現（Li Fei-Fei et al.,“A Bayesian Hierarchical Model for Learning Natural Scene Categories Li Fei-Fei and Pietro Perona,”2005.参照）などの方法、及びこれらの方法の任意の組み合わせを用いることができる。特徴点の選択方法としては、SIFT（D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, Vol.60, No.2, pp.91-110, 2004.参照）、SURF（H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool,“Speeded-up robust features (surf),” CVIU, vol. 110, no. 3, pp. 346−359, 2008.参照）、Harris、Harris-Laplace、ランダム選択などの方法、及びこれらの方法の任意の組み合わせを用いることができる。また、特徴点からの特徴量の抽出方法としては、SIFT、SURFなどの方法を用いることができる。 Subsequently, an image feature expressing the characteristic is calculated from the key frame, and this is extracted as an accumulated video section feature. Image feature calculation methods include, for example, color histograms, arbitrary components of digital cosine transform, arbitrary components of Haar wavelets, higher order local autocorrelation features (N. Otsu and T. Kurita “A new scheme for practical flexible and intelligent vision systems, ”Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.), Bag of Features representation of feature points selected by any method (Li Fei-Fei et al.,“ A Bayesian Hierarchical Model Methods such as for Learning Natural Scene Categories Li Fei-Fei and Pietro Perona, "2005.), and any combination of these methods can be used. Feature selection methods include SIFT (D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.), See SURF (H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” CVIU, vol. 110, no. 3, pp. 346-359, 2008. ), Harris, Harris-Laplace, random selection, etc., and any combination of these methods. Further, as a method for extracting feature amounts from feature points, methods such as SIFT and SURF can be used.

また、蓄積映像に所定の長さの窓をあて、窓を1フレームずつずらしながら蓄積映像区間を抽出し、抽出された蓄積映像区間に対して、TAS（特許第３０６５３１４号明細書参照）やBAM（特許第４３５８２２９号明細書参照）などを代表例とする蓄積映像区間そのものから取り出す特徴量を計算するようにしてもよい。 In addition, a window having a predetermined length is assigned to the stored video, and the stored video section is extracted while shifting the window by one frame. For the extracted stored video section, TAS (see Japanese Patent No. 30653314) or BAM is extracted. (See Japanese Patent No. 4358229) The feature amount extracted from the stored video section itself as a representative example may be calculated.

蓄積付加情報特徴抽出部１２は、蓄積映像区間に付加された付加情報である蓄積付加情報の各々から、その特性を表現するベクトルである蓄積付加情報特徴を抽出する。ここで、付加情報とは、映像を説明する情報であり、一般的な付加情報としては、映像の内容や構造を記述したテキスト、映像が撮影された時刻や場所に関する情報等が挙げられる。 The accumulated additional information feature extraction unit 12 extracts an accumulated additional information feature that is a vector representing the characteristic from each of the accumulated additional information that is additional information added to the accumulated video section. Here, the additional information is information describing the video, and general additional information includes text describing the content and structure of the video, information on the time and place where the video was shot, and the like.

例えば、付加情報がテキスト、特に単語であるとすると、蓄積付加情報特徴としては、テキストラベルに含まれる単語（言語ラベル）の有無を表現する２値ベクトルを用いることができる。より具体的には、蓄積付加情報特徴は、考慮すべき単語の総数と同数の次元を持つベクトルで、ベクトルの各次元が考慮すべき単語に対応するようなベクトルとすることができる。例えば、蓄積付加情報に単語ｉが含まれている場合には、ベクトルの第ｉ次元を１とし、含まれていない場合には０とする蓄積付加情報特徴を抽出することができる。 For example, if the additional information is text, particularly a word, a binary vector representing the presence or absence of a word (language label) included in the text label can be used as the accumulated additional information feature. More specifically, the accumulated additional information feature is a vector having the same number of dimensions as the total number of words to be considered, and can be a vector in which each dimension of the vector corresponds to the word to be considered. For example, when the stored additional information includes the word i, it is possible to extract a stored additional information feature that sets the i-th dimension of the vector to 1 and does not include the word i.

また、蓄積付加情報特徴として、単語共起ベクトルを用いてもよい。より具体的には、単語共起ベクトルの第ｄ番目の要素を、テキストラベルに含まれるｄ番目の単語の個数とするベクトルを、蓄積付加情報特徴として抽出することができる。 Further, a word co-occurrence vector may be used as the accumulated additional information feature. More specifically, a vector having the d-th element of the word co-occurrence vector as the number of the d-th word included in the text label can be extracted as the accumulated additional information feature.

蓄積潜在変数抽出部１３は、蓄積映像区間特徴の集合である蓄積映像区間特徴集合、及び蓄積付加情報特徴の集合である蓄積付加情報特徴集合に基づいて、蓄積映像区間と蓄積付加情報との関係性を記述するベクトル（蓄積潜在変数）を導出する関数である蓄積潜在変数導出関数を得て、蓄積潜在変数導出関数を用いて、蓄積映像区間特徴の各々と蓄積付加情報特徴の各々から潜在変数を得て、この潜在変数を蓄積潜在変数として蓄積潜在変数の集合である蓄積潜在変数集合を抽出する。 The accumulated latent variable extraction unit 13 relates the accumulated video section and the accumulated additional information based on the accumulated video section feature set that is a set of accumulated video section features and the accumulated additional information feature set that is a set of accumulated additional information features. A storage latent variable derivation function, which is a function for deriving a vector (accumulation latent variable) describing sex, is obtained, and a latent variable is obtained from each of the stored video section features and each of the stored additional information features using the stored latent variable derivation function And the accumulated latent variable set, which is a set of accumulated latent variables, is extracted using the latent variables as accumulated latent variables.

蓄積潜在変数は、例えば、非特許文献１に記載された正準相関分析を用いる方法、確率的正準相関分析を用いる方法（中山他“大規模Web 画像のための画像アノテーション・リトリーバル手法 Web 集合知からの自律的画像知識獲得へ向けて”、画像の認識・理解シンポジウムMIRU2009、OS2-4、2009年7月参照）、非特許文献２に記載された半教師付き正準相関分析を用いる方法等により抽出することができる。 The accumulated latent variables are, for example, a method using canonical correlation analysis described in Non-Patent Document 1, a method using probabilistic canonical correlation analysis (Nakayama et al. “Image annotation / retrieval method for large-scale Web images Web set Toward autonomous image knowledge acquisition from knowledge ”, Image Recognition / Understanding Symposium MIRU2009, OS2-4, July 2009), a method using semi-supervised canonical correlation analysis described in Non-Patent Document 2 Or the like.

例えば、確率的正準相関分析を用いる方法では、下記（１）式に示すような蓄積潜在変数を抽出することができる。 For example, in the method using probabilistic canonical correlation analysis, accumulated latent variables as shown in the following equation (1) can be extracted.

ここで、Λは蓄積映像区間特徴集合及び蓄積付加情報特徴集合を入力とした正準相関分析によって得られた特異値を対角要素とする対角行列、Ｗｘ及びＷｙは同様の正準相関分析によって得られた左特異ベクトルを行に持つ行列である。 Where Λ is a diagonal matrix whose diagonal elements are singular values obtained by the canonical correlation analysis using the stored video section feature set and the stored additional information feature set as input, and Wx and Wy are the same canonical correlation analysis. Is a matrix with the left singular vector obtained by

蓄積状態変数抽出部１４は、蓄積潜在変数集合から、蓄積潜在変数の分布を記述する変数である蓄積状態変数の集合である蓄積状態変数集合、または蓄積状態変数の分布である蓄積状態変数分布を抽出する。ここで抽出された蓄積状態変数集合または蓄積状態変数分布、及び蓄積潜在変数導出関数により、トピック軌跡モデルが構築される。 The accumulated state variable extraction unit 14 obtains, from the accumulated latent variable set, an accumulated state variable set that is a set of accumulated state variables that is a variable describing the distribution of accumulated latent variables, or an accumulated state variable distribution that is a distribution of accumulated state variables. Extract. A topic trajectory model is constructed from the accumulated state variable set or accumulated state variable distribution extracted here and the accumulated latent variable derivation function.

蓄積状態変数集合または蓄積状態変数分布の抽出の方法は、特に限定されるものではないが、例えば、出力分布を混合正規分布と規定した隠れマルコフモデルを考え、ＥＭアルゴリズムのE-stepにViterbiアルゴリズムを組み込んだViterbi学習（特開２００９−２５９０３５号公報参照）、あるいはＥＭアルゴリズムのE-stepにForward-backwardアルゴリズムを組み込んだ方法（特開２００９−２５９０３５号公報参照）等を用いることができる。 The method of extracting the accumulated state variable set or accumulated state variable distribution is not particularly limited. For example, considering a hidden Markov model in which the output distribution is defined as a mixed normal distribution, the Viterbi algorithm is used as the E-step of the EM algorithm. Viterbi learning (see Japanese Patent Application Laid-Open No. 2009-259035) or a method in which a Forward-backward algorithm is incorporated in the E-step of the EM algorithm (see Japanese Patent Application Laid-Open No. 2009-259035) can be used.

また、図２に示すように、映像認識部２は、さらに、入力映像区間特徴抽出部２１と、入力潜在変数抽出部２２と、入力状態変数抽出部２３と、入力潜在変数補正部２４と、出力付加情報特徴推定部２５と、出力付加情報推定部２６とを含んだ構成で表すことができる。映像認識部２は、映像認識により付加情報を付加する対象となる入力映像の集合である入力映像集合を入力し、トピック軌跡モデル学習部１で構築されたトピック軌跡モデルを用いて、入力映像の各々に対応した付加情報を推定して出力する。 As shown in FIG. 2, the video recognition unit 2 further includes an input video section feature extraction unit 21, an input latent variable extraction unit 22, an input state variable extraction unit 23, an input latent variable correction unit 24, The output additional information feature estimation unit 25 and the output additional information estimation unit 26 can be represented by a configuration. The video recognition unit 2 inputs an input video set which is a set of input videos to which additional information is added by video recognition, and uses the topic trajectory model constructed by the topic trajectory model learning unit 1 to input the input video. Additional information corresponding to each is estimated and output.

入力映像区間特徴抽出部２１は、蓄積映像区間特徴抽出部１１と同様の処理により、入力された入力映像集合の各要素である入力映像の各々を時間軸に沿って分割して、複数の入力映像区間を抽出する。そして、入力映像区間に含まれるフレームの中から、キーフレームを抽出し、キーフレームから、その特性を表現する画像特徴を算出し、これを入力映像区間特徴として抽出する。 The input video section feature extraction unit 21 divides each of the input videos, which are each element of the input video set input, along the time axis by a process similar to that of the stored video section feature extraction unit 11, and outputs a plurality of inputs. Extract video segment. Then, a key frame is extracted from the frames included in the input video section, an image feature expressing the characteristic is calculated from the key frame, and this is extracted as the input video section feature.

入力潜在変数抽出部２２は、入力映像区間特徴の集合である入力映像区間特徴集合に基づく入力潜在変数導出関数に基づいて、入力映像区間特徴に対応する入力潜在変数の集合である入力潜在変数集合を抽出する。蓄積潜在変数抽出部１３とは異なり、付加情報なしで潜在変数を計算するため、例えば、非特許文献２に記載の下記（２）式に示すような入力潜在変数を抽出することができる。 Based on an input latent variable derivation function based on an input video section feature set that is a set of input video section features, the input latent variable extraction unit 22 is an input latent variable set that is a set of input latent variables corresponding to the input video section features. To extract. Unlike the accumulated latent variable extraction unit 13, since the latent variable is calculated without additional information, for example, an input latent variable as shown in the following equation (2) described in Non-Patent Document 2 can be extracted.

入力状態変数抽出部２３は、入力潜在変数集合と、蓄積状態変数集合または蓄積状態変数分布とに基づいて、入力潜在変数の分布を記述する変数である入力状態変数の集合である入力状態変数集合、またはその入力状態変数の分布である入力状態変数分布を抽出する。これは、Viterbi searchまたはForward-backwardアルゴリズムを用いて、入力潜在変数の系列から入力状態変数~ｓ_ｉの系列~ｓ＝｛~s_１,~s_２、...｝、または、その確率分布を求めればよい。 The input state variable extracting unit 23 is an input state variable set that is a set of input state variables that are variables describing the distribution of the input latent variables based on the input latent variable set and the accumulated state variable set or accumulated state variable distribution. Or an input state variable distribution that is a distribution of the input state variables. This is done by using the Viterbi search or the Forward-backward algorithm, from the input latent variable series to the input state variable ~ s _i series ~ s = {~ s ₁ , ~ s ₂ , ...}, or its probability distribution You can ask for.

入力潜在変数補正部２４は、入力状態変数集合または入力状態変数分布に基づいて、入力潜在変数集合を補正し、これを新たに補正入力潜在変数集合として抽出する。入力状態変数抽出部２３によって入力状態変数~ｓが得られると、トピック軌跡モデル学習部１で構築されたトピック軌跡モデルを用いることにより、その入力状態変数に対応する形で与えられている潜在変数の混合正規分布が決定されることになる。この分布そのもの、またはその平均ベクトルから、補正後の入力潜在変数^ｚを算出する。 The input latent variable correction unit 24 corrects the input latent variable set based on the input state variable set or the input state variable distribution, and newly extracts this as a corrected input latent variable set. When the input state variable ~ s is obtained by the input state variable extraction unit 23, the latent variable given in a form corresponding to the input state variable is obtained by using the topic trajectory model constructed by the topic trajectory model learning unit 1. The mixed normal distribution is determined. The corrected input latent variable ^ z is calculated from this distribution itself or its average vector.

出力付加情報特徴推定部２５は、補正入力潜在変数集合と、入力映像区間特徴に基づいて、出力付加情報に対応するベクトルである出力付加情報特徴を推定する。入力潜在変数補正部２４によって補正後の入力潜在変数^ｚが得られると、例えば、確率的正準相関分析において潜在変数から付加情報を推定する式である下記（３）式、または、上記（１）式を式変形することで得られる下記（４）式により、入力潜在変数及び入力映像区間特徴に基づいて、出力付加情報特徴を計算する。 The output additional information feature estimation unit 25 estimates an output additional information feature that is a vector corresponding to the output additional information, based on the corrected input latent variable set and the input video section feature. When the input latent variable ^ z after correction is obtained by the input latent variable correcting unit 24, for example, the following equation (3) that is an equation for estimating additional information from the latent variable in the stochastic canonical correlation analysis, or the above ( 1) The output additional information feature is calculated based on the input latent variable and the input video section feature by the following equation (4) obtained by transforming the equation.

補正後の入力潜在変数^ｚは、平均ベクトルの形または分布の形で複数存在しているため、それらの値に基づいて可能性の高い出力付加情報特徴が複数得られる。 Since there are a plurality of corrected input latent variables ^ z in the form of an average vector or a distribution, a plurality of output additional information features having a high possibility are obtained based on these values.

出力付加情報推定部２６は、蓄積付加情報特徴抽出部１２と逆の処理を行って、出力付加情報特徴推定部２５により推定された出力付加情報特徴集合の各々から出力付加情報の各々を推定して、出力付加情報の集合である出力付加情報集合を出力する。 The output additional information estimation unit 26 performs processing reverse to that of the accumulated additional information feature extraction unit 12 and estimates each of the output additional information from each of the output additional information feature sets estimated by the output additional information feature estimation unit 25. Thus, an output additional information set that is a set of output additional information is output.

次に、図３を参照して、第１の実施の形態の映像認識装置１０において実行される認識処理ルーチンについて説明する。 Next, with reference to FIG. 3, a recognition processing routine executed in the video recognition apparatus 10 according to the first embodiment will be described.

ステップ１００で、図示しない入力手段により入力された蓄積映像集合及び対応する蓄積付加情報集合を取得する。 In step 100, a stored video set and a corresponding stored additional information set input by input means (not shown) are acquired.

次に、ステップ１０２で、上記ステップ１００で取得した蓄積映像集合の要素である蓄積映像の各々を時間軸に沿って、複数の蓄積映像区間に分割し、蓄積映像区間に含まれるキーフレームから、その特性を表現する画像特徴を算出し、これを蓄積映像区間特徴として抽出する。 Next, in step 102, each of the stored videos that are the elements of the stored video set acquired in step 100 is divided into a plurality of stored video sections along the time axis, and from the key frames included in the stored video sections, An image feature expressing the characteristic is calculated, and this is extracted as an accumulated video section feature.

次に、ステップ１０４で、上記ステップ１００で取得した蓄積付加情報集合の要素である蓄積付加情報の各々から、その特性を表現するベクトルである蓄積付加情報特徴を抽出する。 Next, in step 104, a stored additional information feature that is a vector representing the characteristic is extracted from each of the stored additional information that is an element of the stored additional information set acquired in step 100.

次に、ステップ１０６で、上記ステップ１０２で抽出した蓄積映像区間特徴の集合である蓄積映像区間特徴集合、及び上記ステップ１０４で抽出した蓄積付加情報特徴の集合である蓄積付加情報特徴集合に基づいて得られる蓄積潜在変数導出関数を用いて蓄積潜在変数を得て、蓄積潜在変数の集合である蓄積潜在変数集合を抽出する。 Next, in step 106, based on the accumulated video section feature set that is a set of accumulated video section features extracted in step 102 and the accumulated additional information feature set that is a set of accumulated additional information features extracted in step 104. A stored latent variable is obtained using the obtained stored latent variable derivation function, and a stored latent variable set that is a set of stored latent variables is extracted.

次に、ステップ１０８で、上記ステップ１０６で抽出した蓄積潜在変数集合から、蓄積潜在変数の分布を記述する変数である蓄積状態変数の集合である蓄積状態変数集合、または蓄積状態変数の分布である蓄積状態変数分布を抽出して、トピック軌跡モデルを構築する。 Next, in step 108, from the accumulated latent variable set extracted in step 106, the accumulated state variable set which is a set of accumulated state variables which are variables describing the distribution of accumulated latent variables, or the accumulated state variable distribution. The accumulated state variable distribution is extracted and a topic trajectory model is constructed.

次に、ステップ１１０で、図示しない入力手段により入力された認識対象の入力映像集合を取得する。 Next, in step 110, an input video set to be recognized input by an input unit (not shown) is acquired.

次に、ステップ１１２で、上記ステップ１０２と同様の処理により、入力された入力映像集合の要素である入力映像の各々を時間軸に沿って、複数の入力映像区間に分割し、入力映像区間に含まれるキーフレームから、その特性を表現する画像特徴を算出し、これを入力映像区間特徴として抽出する。 Next, in step 112, each input video that is an element of the input video set is divided into a plurality of input video segments along the time axis by the same processing as in step 102, and the input video segments are divided into the input video segments. An image feature expressing the characteristic is calculated from the included key frame, and this is extracted as an input video section feature.

次に、ステップ１１４で、上記ステップ１１２で抽出された入力映像区間特徴の集合である入力映像区間特徴集合と、入力潜在変数導出関数とに基づいて、入力映像区間特徴に対応する入力潜在変数を得て、入力潜在変数の集合である入力潜在変数集合を抽出する。 Next, in step 114, the input latent variable corresponding to the input video segment feature is determined based on the input video segment feature set that is the set of input video segment features extracted in step 112 and the input latent variable derivation function. Then, an input latent variable set that is a set of input latent variables is extracted.

次に、ステップ１１６で、上記ステップ１１４で抽出した入力潜在変数集合と、上記ステップ１０８で抽出した蓄積状態変数集合または蓄積状態変数分布とに基づいて、入力潜在変数の分布を記述する変数である入力状態変数の集合である入力状態変数集合、またはその入力状態変数の分布である入力状態変数分布を抽出する。 Next, in step 116, the variable describes the distribution of the input latent variables based on the input latent variable set extracted in step 114 and the accumulated state variable set or accumulated state variable distribution extracted in step 108. An input state variable set that is a set of input state variables or an input state variable distribution that is a distribution of the input state variables is extracted.

次に、ステップ１１８で、上記ステップ１１６で抽出した入力状態変数集合または入力状態変数分布から、上記ステップ１１４で抽出した入力潜在変数集合を補正し、これを新たに補正入力潜在変数集合として抽出する。 Next, in step 118, the input latent variable set extracted in step 114 is corrected from the input state variable set or the input state variable distribution extracted in step 116, and this is newly extracted as a corrected input latent variable set. .

次に、ステップ１２０で、上記ステップ１１８で抽出した補正入力潜在変数集合と、上記ステップ１１２で抽出した入力映像区間特徴とに基づいて、出力付加情報に対応するベクトルである出力付加情報特徴を得て、出力付加情報特徴の集合である出力付加情報特徴集合を推定する。 Next, in step 120, based on the corrected input latent variable set extracted in step 118 and the input video section feature extracted in step 112, an output additional information feature that is a vector corresponding to the output additional information is obtained. Thus, an output additional information feature set that is a set of output additional information features is estimated.

次に、ステップ１２２で、上記ステップ１０４と逆の処理を行って、上記ステップ１２０で推定された出力付加情報特徴集合の各々から出力付加情報の各々を推定して、出力付加情報の集合である出力付加情報集合として出力する。 Next, in step 122, a process reverse to that in step 104 is performed to estimate each of the output additional information from each of the output additional information feature sets estimated in step 120, and this is a set of output additional information. Output as output additional information set.

以上説明したように、第１の実施の形態の映像認識装置によれば、入力映像区間と入力付加情報との関係を記述する入力映像区間特徴に基づく入力潜在変数から得られた入力状態変数と、蓄積映像集合と蓄積付加情報集合との関係を記述したトピック軌跡モデルとに基づいて入力潜在変数を補正し、補正された入力潜在変数と入力映像区間特徴とに基づいて、映像区間に付加される可能性の高い出力付加情報特徴を推定するため、時間情報も考慮して映像認識を精度良く行うことができる。 As described above, according to the video recognition apparatus of the first embodiment, the input state variable obtained from the input latent variable based on the input video section feature describing the relationship between the input video section and the input additional information, The input latent variable is corrected based on the topic trajectory model describing the relationship between the stored video set and the stored additional information set, and added to the video section based on the corrected input latent variable and the input video section feature. Therefore, video recognition can be performed with high accuracy in consideration of time information.

次に、第２の実施の形態について説明する。第２の実施の形態では、本発明の映像管理装置を、入力した付加情報に基づいて所望の映像を検索する映像検索装置に適用した場合について説明する。なお、第２の実施の形態の映像検索装置において、第１の実施の形態の映像認識装置の構成と同一の構成については、同一の符号を付して詳細な説明を省略する。 Next, a second embodiment will be described. In the second embodiment, a case will be described in which the video management apparatus of the present invention is applied to a video search apparatus that searches for a desired video based on input additional information. Note that, in the video search device of the second embodiment, the same components as those of the video recognition device of the first embodiment are denoted by the same reference numerals, and detailed description thereof is omitted.

第２の実施の形態に係る映像検索装置２０は、ＣＰＵと、ＲＡＭと、後述する検索処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には、図４に示すように、トピック軌跡モデル記憶部１５と、映像認識部２と、映像検索部３とを含んだ構成で表すことができる。 The video search apparatus 20 according to the second embodiment is composed of a computer including a CPU, a RAM, and a ROM storing a program for executing a search processing routine described later. As shown in FIG. 4, it can be expressed by a configuration including a topic locus model storage unit 15, a video recognition unit 2, and a video search unit 3.

トピック軌跡モデル記憶部１５には、第１の実施の形態の映像認識装置１０のトピック軌跡モデル学習部１により構築されたトピック軌跡モデルが記憶されている。 The topic trajectory model storage unit 15 stores a topic trajectory model constructed by the topic trajectory model learning unit 1 of the video recognition apparatus 10 according to the first embodiment.

図５に示すように、映像検索部３は、さらに、蓄積映像認識部３１と、出力映像推定部３２とを含んだ構成で表すことができる。映像検索部３は、第１の実施の形態の映像認識装置１０により構築したトピック軌跡モデルを用いて、入力された付加情報（入力付加情報）に適合した映像を検索する。 As shown in FIG. 5, the video search unit 3 can be represented by a configuration that further includes an accumulated video recognition unit 31 and an output video estimation unit 32. The video search unit 3 searches for a video that matches the input additional information (input additional information) using the topic trajectory model constructed by the video recognition device 10 of the first embodiment.

蓄積映像認識部３１は、入力された蓄積映像集合の要素である蓄積映像の各々を入力映像として映像認識部２に入力し、映像認識部２の各部を第１の実施の形態と同様に動作させて、その出力である出力蓄積付加情報の集合である出力蓄積付加情報集合を出力する。 The stored video recognition unit 31 inputs each of the stored videos as elements of the input stored video set to the video recognition unit 2 as input video, and operates each unit of the video recognition unit 2 in the same manner as in the first embodiment. Then, an output accumulation additional information set that is a set of output accumulation additional information that is the output is output.

出力映像推定部３２は、入力された入力付加情報、蓄積映像認識部３１に入力された蓄積映像集合に対応する蓄積付加情報集合、及び映像認識部２から出力された出力蓄積付加情報集合を入力とし、それらを比較することにより、入力付加情報に適合する出力映像を推定して出力する。 The output video estimation unit 32 inputs the input additional information input, the storage additional information set corresponding to the stored video set input to the stored video recognition unit 31, and the output storage additional information set output from the video recognition unit 2. By comparing them, an output video suitable for the input additional information is estimated and output.

次に、図６を参照して、第２の実施の形態の映像検索装置２０において実行される検索処理ルーチンについて説明する。 Next, with reference to FIG. 6, a search processing routine executed in the video search apparatus 20 of the second embodiment will be described.

ステップ２００で、図示しない入力手段により入力された入力付加情報及び蓄積映像集合を取得する。 In step 200, input additional information and a stored video set input by an input unit (not shown) are acquired.

次に、ステップ２０２で、出力蓄積付加情報集合出力処理を実行する。出力付加情報集合出力処理では、上記ステップ２００で入力された蓄積映像集合の要素である蓄積映像の各々を入力映像として、第１の実施の形態の認識処理（図３）のステップ１１０〜１２２と同様の処理が実行される。 Next, in step 202, an output accumulation additional information set output process is executed. In the output additional information set output process, steps 110 to 122 in the recognition process (FIG. 3) of the first embodiment are performed using each of the stored videos that are elements of the stored video set input in step 200 as input videos. Similar processing is executed.

次に、ステップ２０４で、上記ステップ２００で入力された入力付加情報、上記ステップ２００で入力された蓄積映像集合に対応する蓄積付加情報集合、及び上記ステップ２０２で出力された出力蓄積付加情報集合を比較することにより、蓄積映像集合から入力付加情報に適合する出力映像を推定して出力する。なお、出力する出力映像は、１つであっても複数であってもよい。また、入力付加情報との適合度に応じて、ランキングを付して出力するようにしてもよい。 Next, in step 204, the input additional information input in step 200, the storage additional information set corresponding to the stored video set input in step 200, and the output storage additional information set output in step 202 are obtained. By comparison, an output video that matches the input additional information is estimated from the stored video set and output. The output video to be output may be one or plural. Further, ranking may be given and output according to the degree of matching with the input additional information.

以上説明したように、第２の実施の形態の映像検索装置によれば、時間情報も考慮して精度良く推定された蓄積映像集合に対する出力蓄積付加情報集合と、蓄積映像集合に対応する蓄積付加情報集合と、入力付加情報とを比較して、入力付加情報に適合する出力映像を推定するため、時間情報も考慮して精度良く映像検索を行うことができる。 As described above, according to the video search apparatus of the second embodiment, the output accumulation additional information set for the accumulated video set estimated accurately in consideration of the time information and the accumulation addition corresponding to the accumulated video set. Since the information set and the input additional information are compared to estimate the output video that matches the input additional information, the video search can be performed with high accuracy in consideration of the time information.

なお、第１の実施の形態は映像認識装置について、第２の実施の形態では映像検索装置について説明したが、第２の実施の形態において、トピック軌跡モデル記憶部を第１の実施の形態のトピック軌跡モデル学習部に置き換えることにより、映像の認識及び検索の両方を実行することができる映像認識検索装置として構成することができる。 The first embodiment describes the video recognition device, and the second embodiment describes the video search device. However, in the second embodiment, the topic trajectory model storage unit is the same as that of the first embodiment. By replacing with a topic trajectory model learning unit, it can be configured as a video recognition search device capable of executing both video recognition and search.

また、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Further, the present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述の映像認識装置及び映像検索装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, although the above-described video recognition device and video search device have a computer system inside, if the “computer system” uses a WWW system, a homepage providing environment (or display environment) Shall also be included.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１トピック軌跡モデル学習部
２映像認識部
３映像検索部
１０映像認識装置
１５トピック軌跡モデル記憶部
２０映像検索装置
DESCRIPTION OF SYMBOLS 1 Topic locus model learning part 2 Image | video recognition part 3 Image | video search part 10 Image | video recognition apparatus 15 Topic locus | trajectory model memory | storage part 20 Image | video search apparatus

Claims

Each of the stored video section features indicating the characteristics of the stored video section extracted from each of the stored video sections obtained by time-dividing each of the stored videos that are elements of the stored video set stored in advance in a predetermined section, and the storage Based on each of the stored additional information features indicating the characteristics of the stored additional information extracted from each of the stored additional information added corresponding to each of the video sections, the stored video section that is a set of the stored video section features The distribution of stored latent variables describing the relationship between the stored video section and stored additional information obtained using a stored latent variable derivation function based on the feature set and the stored additional information feature set that is a set of the stored additional information features. The accumulated state variable constructed by the topic trajectory model learning means for extracting the accumulated state variable set that is the set of accumulated state variables or the accumulated state variable distribution that is the distribution of the accumulated state variables. Collection or the accumulation state variable distribution, and the topic trajectory model storage means for storing a topic trajectory model consisting of the storage latent variable derivation function, an image management apparatus comprising a video recognition unit,
The image recognition means includes
Input video section feature extraction means for extracting each of the input video section features indicating the characteristics of the input video section from each of the input video sections obtained by time-dividing the input video in a predetermined section;
Using an input latent variable derivation function based on an input video segment feature set that is a set of input video segment features extracted by the input video segment feature extracting means, obtaining an input latent variable corresponding to the input video segment feature, Input latent variable extracting means for extracting an input latent variable set which is a set of the input latent variables;
This is a set of input state variables which are variables describing the distribution of the input latent variables based on the input latent variable set extracted by the input latent variable extracting means and the accumulated state variable set or the accumulated state variable distribution. An input state variable extracting means for extracting an input state variable set or an input state variable distribution which is a distribution of the input state variables;
A new corrected input latent variable set obtained by correcting the input latent variable set extracted by the input latent variable extracting unit based on the input state variable set or the input state variable distribution extracted by the input state variable extracting unit. Input latent variable correction means to be extracted;
Based on the corrected input latent variable set extracted by the input latent variable correction means and the input video section feature set extracted by the input video section feature extraction means, the output additional information feature related to the input video section feature Output additional information feature estimation means for estimating an output additional information feature set that is a set;
Output additional information estimation means for estimating output additional information corresponding to the feature indicated by the output additional information feature set estimated by the output additional information feature estimation means;
Including video management device.

Each of the stored videos that are elements of the stored video set is input as the input video, and an output storage additional information set that is a set of output storage additional information is added as additional information corresponding to the stored video set by the video recognition means. Accumulated image recognition means for outputting;
Based on the input additional information input, the stored additional information set corresponding to the input stored video set, and the output stored additional information set output by the video recognition unit, the stored video set changes to the input additional information. Output video estimation means for estimating a suitable video as an output video,
The video management apparatus according to claim 1, further comprising:

The video management apparatus according to claim 1, further comprising a topic trajectory model learning unit.

Each of the stored video section features indicating the characteristics of the stored video section extracted from each of the stored video sections obtained by time-dividing each of the stored videos that are elements of the stored video set stored in advance in a predetermined section, and the storage Based on each of the stored additional information features indicating the characteristics of the stored additional information extracted from each of the stored additional information added corresponding to each of the video sections, the stored video section that is a set of the stored video section features The distribution of stored latent variables describing the relationship between the stored video section and stored additional information obtained using a stored latent variable derivation function based on the feature set and the stored additional information feature set that is a set of the stored additional information features. An accumulated state variable set or accumulated state variable distribution constructed by extracting an accumulated state variable set that is a set of accumulated state variables or an accumulated state variable distribution that is a distribution of the accumulated state variables, Stores the topic trajectory model of microcrystal said storage latent variable derivation function,
Extracting each input video section feature indicating the characteristics of the input video section from each of the input video sections obtained by time-dividing the input video in a predetermined section,
An input latent variable derivation function based on an input video segment feature set, which is a set of extracted input video segment features, is used to obtain an input latent variable corresponding to the input video segment feature, and is a set of the input latent variable Extract the input latent variable set,
Based on the extracted input latent variable set and the accumulated state variable set or the accumulated state variable distribution, the input state variable set that is a set of input state variables that are variables describing the distribution of the input latent variables, or Extract the input state variable distribution, which is the distribution of the input state variables,
Based on the extracted input state variable set or the input state variable distribution, a new corrected input latent variable set obtained by correcting the input latent variable set is extracted;
Based on the extracted corrected input latent variable set and the input video segment feature set, an output additional information feature set that is a set of output additional information features related to the input video segment feature is estimated;
A video management method for estimating output additional information corresponding to a feature indicated by an estimated output additional information feature set.

Each of the stored videos that are elements of the stored video set is input as the input video, and an output stored additional information set that is a set of output stored additional information is output as additional information corresponding to the stored video set,
Based on the inputted input additional information, the accumulated additional information set corresponding to the inputted accumulated video set, and the output output accumulated additional information set, an image suitable for the input additional information is output from the accumulated video set. The video management method according to claim 4, wherein the video management method is estimated as a video.

The video management program for functioning a computer as each means which comprises the video management apparatus of any one of Claims 1-3.