JP2013105393A

JP2013105393A - Video additional information relationship learning device, method and program

Info

Publication number: JP2013105393A
Application number: JP2011249956A
Authority: JP
Inventors: Shogo Kimura; 昭悟木村; Hirokazu Kameoka; 弘和亀岡; Ei Sakano; 鋭坂野; Susumu Sugiyama; 将杉山
Original assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokyo Institute of Technology NUC
Priority date: 2011-11-15
Filing date: 2011-11-15
Publication date: 2013-05-30

Abstract

PROBLEM TO BE SOLVED: To provide a video additional information relationship learning device which uses both image information and acoustic information included in a video and considers their mutual co-occurrence relation, thereby highly accurately learning relationship between those information and linguistic information.SOLUTION: In a video additional information relationship learning device, an accumulation image feature extraction unit 5 extracts a complete accumulation image feature and an incomplete accumulation image feature from each of a complete accumulation video and an incomplete accumulation video. An accumulation acoustic feature extraction unit 6 extracts a complete accumulation acoustic feature and an incomplete accumulation acoustic feature from each of the complete accumulation video and the incomplete accumulation video. An accumulation additional information feature extraction unit 4 extracts an accumulation additional information feature from each of the accumulation additional information. An accumulation latent variable extraction unit 8 extracts an accumulation latent variable group which is a group of variables for describing relationship between the video and the additional information. An inter-video/additional information relationship learning unit 9 learns a video/additional information relation model which is a model for describing relationship between the video and the additional information.

Description

本発明は、映像付加情報関係性学習装置、方法、及びプログラムに関する。 The present invention relates to a video additional information relationship learning apparatus, method, and program.

所望の映像を与えられた言語情報に基づいて検索する映像検索技術、及び与えられた映像に対してその映像を説明する言語情報を自動的に付与する映像認識技術は、ディジタルカメラ・携帯電話などの撮像装置の普及、インターネット上での映像共有の一般化などに伴い、非常に重要な技術となってきている。映像は静止画像の時系列と考えることができ、映像検索や映像認識を目的とした多くの技術では、映像を静止画像に分割した上で、画像検索・画像認識の技術を用いている（例えば、非特許文献１）。 Video search technology for searching for desired video based on given language information, and video recognition technology for automatically assigning language information describing the video to the given video are digital cameras, mobile phones, etc. With the widespread use of such imaging devices and the generalization of video sharing on the Internet, it has become a very important technology. Video can be considered as a time series of still images, and many technologies for video search and video recognition use image search and image recognition technology after dividing the video into still images (for example, Non-Patent Document 1).

また、近年では、画像検索と画像認識を同一の枠組の下で実現する画像認識検索技術についても、非特許文献２など数多くの技術が開発されている。 In recent years, a number of technologies such as Non-Patent Document 2 have been developed for image recognition and retrieval technology that realizes image retrieval and image recognition under the same framework.

さらに、これらの技術を発展させ、言語情報が関連づけられていない画像が十分に用意できない場合においても、画像と言語情報との関連性を適切に学習し、高精度な画像認識・検索を実現する技術も開発されている（非特許文献３）。 Furthermore, by developing these technologies, even when images that are not associated with linguistic information cannot be prepared sufficiently, the relationship between the images and linguistic information is properly learned to realize high-accuracy image recognition and retrieval. Technology has also been developed (Non-Patent Document 3).

Sivic, J. and Zisserman, A. “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proceedings of the International Conference on Computer Vision (2003)Sivic, J. and Zisserman, A. “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proceedings of the International Conference on Computer Vision (2003) 中山、原田、國吉、大津“画像・単語間概念対応の確率構造学習を利用した超高速画像認識・検索方法”、電子情報通信学会技術報告、PRMU2007-147、2007年12月Nakayama, Harada, Kuniyoshi, Otsu “Ultra-high-speed image recognition / retrieval method using probabilistic structure learning corresponding to image / word concept”, IEICE Technical Report, PRMU2007-147, December 2007 木村、中野、杉山、亀岡、前田、坂野 ”SSCDE: 画像認識検索のための半教師付正準密度推定法“，画像の認識・理解シンポジウム予稿集、OS8-1，2010年7月Kimura, Nakano, Sugiyama, Kameoka, Maeda, Sakano “SSCDE: Semi-supervised canonical density estimation method for image recognition retrieval”, Proceedings of Symposium on Image Recognition and Understanding, OS8-1, July 2010

これらの技術は、画像情報のみを利用することで映像の認識や検索を実現している。しかし、通常、映像には音響信号も同時に付与されている。この音響信号が映像の内容を理解する上で非常に重要となる事象が多数存在する。例えば、「歌っている」、「歓声」など、音響情報が観測されることでしか内容が明らかにならない事象だけでなく、「サッカーのゴール」、「爆発」など、音響情報が画像情報と同時に観測されることによって事象の内容がより明確になる場合が挙げられる。 These technologies realize video recognition and retrieval by using only image information. However, usually, an audio signal is also given to the video. There are a number of events in which this audio signal is very important for understanding the content of video. For example, not only events that are revealed only by observing acoustic information such as “singing” and “cheers”, but also acoustic information such as “soccer goal” and “explosion” There are cases where the details of the event become clearer by being observed.

本発明は、このような事情を考慮してなされたものである。その目的は、映像に含まれる画像情報と音響情報との双方を利用し、かつその相互の共起関係を考慮して、それら情報と言語情報との関係性の学習をより高精度に行うことができる映像付加情報関係性学習装置、方法、及びプログラムを提供することにある。 The present invention has been made in consideration of such circumstances. The purpose is to use both image information and audio information included in the video, and to learn the relationship between the information and language information with higher accuracy in consideration of their mutual co-occurrence relationships. It is an object to provide a video additional information relationship learning apparatus, method, and program capable of performing the above.

上記の目的を達成するために本発明に係る映像付加情報関係性学習装置は、音響付き動画像である映像と、前記映像を説明する情報である付加情報との関係性を学習する映像付加情報関係性学習装置であって、付加情報が予め付与された映像の集合である完全蓄積映像集合の要素である完全蓄積映像、及び付加情報が与えられていない映像の集合である不完全蓄積映像集合の要素である不完全蓄積映像の各々から、画像の特性を表現するベクトルである完全蓄積画像特徴及び不完全蓄積画像特徴を抽出する蓄積画像特徴抽出手段と、前記完全蓄積映像集合の要素である完全蓄積映像、及び前記不完全蓄積映像集合の要素である不完全蓄積映像の各々から、音響の特性を表現するベクトルである完全蓄積音響特徴及び不完全蓄積音響特徴を抽出する蓄積音響特徴抽出手段と、付与された付加情報の集合である蓄積付加情報集合の要素である蓄積付加情報の各々から、付加情報の特性を表現するベクトルである蓄積付加情報特徴を抽出する蓄積付加情報特徴抽出手段と、前記完全蓄積画像特徴の集合である完全蓄積画像特徴集合、前記不完全蓄積画像特徴の集合である不完全蓄積画像特徴集合、前記完全蓄積音響特徴の集合である完全蓄積音響特徴集合、前記不完全蓄積音響特徴の集合である不完全蓄積音響特徴集合、及び前記蓄積付加情報特徴の集合である蓄積付加情報特徴集合から、映像と付加情報の関係性を記述するための変数の集合である蓄積潜在変数集合を抽出する蓄積潜在変数抽出手段と、前記完全蓄積画像特徴集合、前記不完全蓄積画像特徴集合、前記完全蓄積音響特徴集合、前記不完全蓄積音響特徴集合、前記蓄積付加情報特徴集合、及び前記蓄積潜在変数集合から、映像と付加情報との関係性を記述するモデルである映像・付加情報関係モデルを学習する映像・付加情報間関係性学習手段とを含んで構成されている。 In order to achieve the above object, a video additional information relationship learning apparatus according to the present invention learns the relationship between a video that is a moving image with sound and additional information that is information that describes the video. A relationship learning device that is a complete stored video that is an element of a fully stored video set that is a set of videos to which additional information is given in advance, and an incompletely stored video set that is a set of videos to which no additional information is given And a stored image feature extracting means for extracting a completely stored image feature and an incompletely stored image feature, which are vectors expressing the characteristics of the image, from each of the incompletely stored video as elements of the image, and an element of the complete stored video set Extraction of complete accumulated sound features and incompletely accumulated sound features, which are vectors representing acoustic characteristics, from each of the completely accumulated image and the incompletely accumulated image that is an element of the incompletely accumulated image set. The storage acoustic feature extraction means for extracting the stored additional information features that are vectors representing the characteristics of the additional information from each of the stored additional information that is an element of the stored additional information set that is a set of the added additional information Additional information feature extraction means, a completely stored image feature set that is a set of the completely stored image features, an incompletely stored image feature set that is a set of the incompletely stored image features, and a complete storage that is a set of the completely stored acoustic features To describe the relationship between video and additional information from an acoustic feature set, an incompletely stored acoustic feature set that is a set of the incompletely stored acoustic features, and a stored additional information feature set that is a set of the stored additional information features Storage latent variable extraction means for extracting a stored latent variable set that is a set of variables, the completely stored image feature set, the incompletely stored image feature set, and the completely stored acoustic feature Video / additional information relation model that is a model describing the relationship between video and additional information from the incompletely stored acoustic feature set, the stored additional information feature set, and the stored latent variable set. Additional information relationship learning means.

本発明に係る映像付加情報関係性学習方法は、音響付き動画像である映像と、前記映像を説明する情報である付加情報との関係性を学習する映像付加情報関係性学習装置において用いられる映像付加情報関係性学習方法であって、蓄積画像特徴抽出手段によって、付加情報が予め付与された映像の集合である完全蓄積映像集合の要素である完全蓄積映像、及び付加情報が与えられていない映像の集合である不完全蓄積映像集合の要素である不完全蓄積映像の各々から、画像の特性を表現するベクトルである完全蓄積画像特徴及び不完全蓄積画像特徴を抽出するステップと、蓄積音響特徴抽出手段によって、前記完全蓄積映像集合の要素である完全蓄積映像、及び前記不完全蓄積映像集合の要素である不完全蓄積映像の各々から、音響の特性を表現するベクトルである完全蓄積音響特徴及び不完全蓄積音響特徴を抽出するステップと、蓄積付加情報特徴抽出手段によって、付与された付加情報の集合である蓄積付加情報集合の要素である蓄積付加情報の各々から、付加情報の特性を表現するベクトルである蓄積付加情報特徴を抽出するステップと、蓄積潜在変数抽出手段によって、前記完全蓄積画像特徴の集合である完全蓄積画像特徴集合、前記不完全蓄積画像特徴の集合である不完全蓄積画像特徴集合、前記完全蓄積音響特徴の集合である完全蓄積音響特徴集合、前記不完全蓄積音響特徴の集合である不完全蓄積音響特徴集合、及び前記蓄積付加情報特徴の集合である蓄積付加情報特徴集合から、映像と付加情報の関係性を記述するための変数の集合である蓄積潜在変数集合を抽出するステップと、映像・付加情報間関係性学習手段によって、前記完全蓄積画像特徴集合、前記不完全蓄積画像特徴集合、前記完全蓄積音響特徴集合、前記不完全蓄積音響特徴集合、前記蓄積付加情報特徴集合、及び前記蓄積潜在変数集合から、映像と付加情報との関係性を記述するモデルである映像・付加情報関係モデルを学習するステップとを含む。 The video additional information relationship learning method according to the present invention is a video used in a video additional information relationship learning device that learns the relationship between a video that is a moving image with sound and additional information that is information that describes the video. An additional information relationship learning method, wherein the accumulated image feature extraction means is a completely accumulated image that is an element of a completely accumulated image set that is a set of images to which additional information is given in advance, and an image for which no additional information is given Extracting a completely accumulated image feature and an imperfectly accumulated image feature, which are vectors representing the characteristics of the image, from each of the incompletely accumulated images that are elements of the incompletely accumulated image set that is a set of images, and accumulated acoustic feature extraction By means of the method, the acoustic characteristics are obtained from each of the completely accumulated image that is an element of the completely accumulated image set and the incompletely accumulated image that is an element of the incompletely accumulated image set. A step of extracting a completely accumulated acoustic feature and an incompletely accumulated acoustic feature which are present vectors, and a stored additional information feature which is a set of additional information added by a stored additional information feature extracting unit; A step of extracting a stored additional information feature, which is a vector expressing the characteristic of the additional information, and a stored latent variable extracting means, and a complete stored image feature set, which is a set of the completely stored image features, and the incompletely stored image An incompletely stored image feature set that is a set of features, a fully stored acoustic feature set that is a set of the completely stored acoustic features, an incompletely stored acoustic feature set that is a set of the incompletely stored acoustic features, and the stored additional information feature From the stored additional information feature set, which is a set of, a latent latent variable set, which is a set of variables for describing the relationship between video and additional information, is extracted And the step of learning the relationship between video and additional information, the complete stored image feature set, the incompletely stored image feature set, the complete stored acoustic feature set, the incompletely stored acoustic feature set, and the stored additional information feature Learning a video / additional information relationship model, which is a model describing the relationship between video and additional information, from the set and the accumulated latent variable set.

本発明によれば、蓄積画像特徴抽出手段によって、付加情報が予め付与された映像の集合である完全蓄積映像集合の要素である完全蓄積映像、及び付加情報が与えられていない映像の集合である不完全蓄積映像集合の要素である不完全蓄積映像の各々から、画像の特性を表現するベクトルである完全蓄積画像特徴及び不完全蓄積画像特徴を抽出する。蓄積音響特徴抽出手段によって、前記完全蓄積映像集合の要素である完全蓄積映像、及び前記不完全蓄積映像集合の要素である不完全蓄積映像の各々から、音響の特性を表現するベクトルである完全蓄積音響特徴及び不完全蓄積音響特徴を抽出する。 According to the present invention, the accumulated image feature extraction means is a set of a completely accumulated image that is an element of a completely accumulated image set that is a set of images to which additional information is given in advance, and a set of images to which no additional information is given. A fully accumulated image feature and an incompletely accumulated image feature, which are vectors representing the characteristics of the image, are extracted from each of the incompletely accumulated images that are elements of the incompletely accumulated image set. Complete accumulation, which is a vector that expresses acoustic characteristics from each of the completely accumulated video, which is an element of the completely accumulated video set, and the incompletely accumulated video, which is an element of the incompletely accumulated video set, by the accumulated acoustic feature extraction means Extract acoustic features and imperfectly accumulated acoustic features.

そして、蓄積付加情報特徴抽出手段によって、付与された付加情報の集合である蓄積付加情報集合の要素である蓄積付加情報の各々から、付加情報の特性を表現するベクトルである蓄積付加情報特徴を抽出する。蓄積潜在変数抽出手段によって、前記完全蓄積画像特徴の集合である完全蓄積画像特徴集合、前記不完全蓄積画像特徴の集合である不完全蓄積画像特徴集合、前記完全蓄積音響特徴の集合である完全蓄積音響特徴集合、前記不完全蓄積音響特徴の集合である不完全蓄積音響特徴集合、及び前記蓄積付加情報特徴の集合である蓄積付加情報特徴集合から、映像と付加情報の関係性を記述するための変数の集合である蓄積潜在変数集合を抽出する。 Then, the storage additional information feature extraction means extracts a storage additional information feature that is a vector representing the characteristic of the additional information from each of the storage additional information that is an element of the storage additional information set that is a set of the added additional information. To do. By the accumulated latent variable extraction means, a completely accumulated image feature set that is a set of the completely accumulated image features, an incompletely accumulated image feature set that is a set of the incompletely accumulated image features, and a complete accumulation that is a set of the completely accumulated acoustic features. To describe the relationship between video and additional information from an acoustic feature set, an incompletely stored acoustic feature set that is a set of the incompletely stored acoustic features, and a stored additional information feature set that is a set of the stored additional information features An accumulated latent variable set that is a set of variables is extracted.

そして、映像・付加情報間関係性学習手段によって、前記完全蓄積画像特徴集合、前記不完全蓄積画像特徴集合、前記完全蓄積音響特徴集合、前記不完全蓄積音響特徴集合、前記蓄積付加情報特徴集合、及び前記蓄積潜在変数集合から、映像と付加情報との関係性を記述するモデルである映像・付加情報関係モデルを学習する。 Then, by the relationship learning means between video and additional information, the complete stored image feature set, the incomplete stored image feature set, the complete stored acoustic feature set, the incomplete stored acoustic feature set, the stored additional information feature set, The video / additional information relationship model, which is a model describing the relationship between the video and the additional information, is learned from the accumulated latent variable set.

このように、完全蓄積映像の各々から抽出される完全蓄積画像特徴及び完全蓄積音響特徴と、不完全蓄積映像の各々から抽出される不完全蓄積画像特徴及び不完全蓄積音響特徴と、蓄積潜在変数集合とから、映像と付加情報との関係性を記述するモデルを学習することにより、映像に含まれる画像情報と音響情報との双方を利用し、かつその相互の共起関係を考慮して、それら情報と言語情報との関係性の学習をより高精度に行うことができる。 As described above, the completely accumulated image feature and the completely accumulated acoustic feature extracted from each of the completely accumulated images, the incompletely accumulated image feature and the incompletely accumulated acoustic feature extracted from each of the incompletely accumulated images, and the accumulated latent variable. By learning a model that describes the relationship between video and additional information from the set, using both image information and acoustic information contained in the video, and taking into account their mutual co-occurrence relationship, The relationship between the information and the language information can be learned with higher accuracy.

本発明に係るプログラムは、上記の映像付加情報関係性学習装置の各手段としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the video additional information relationship learning apparatus.

以上説明したように、本発明の映像付加情報関係性学習装置、方法、及びプログラムによれば、完全蓄積映像の各々から抽出される完全蓄積画像特徴及び完全蓄積音響特徴と、不完全蓄積映像の各々から抽出される不完全蓄積画像特徴及び不完全蓄積音響特徴と、蓄積潜在変数集合とから、映像と付加情報との関係性を記述するモデルを学習することにより、映像に含まれる画像情報と音響情報との双方を利用し、かつその相互の共起関係を考慮して、それら情報と言語情報との関係性の学習をより高精度に行うことができる、という効果が得られる。 As described above, according to the video additional information relationship learning apparatus, method, and program of the present invention, the completely accumulated image feature and the completely accumulated acoustic feature extracted from each of the completely accumulated image, and the incompletely accumulated image By learning a model that describes the relationship between video and additional information from incompletely stored image features and incompletely stored acoustic features extracted from each, and a set of stored latent variables, image information contained in the video By using both of the acoustic information and taking into account the mutual co-occurrence relationship, it is possible to learn the relationship between the information and the linguistic information with higher accuracy.

本発明の第１の実施の形態に係る映像付加情報関係性学習装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the image | video additional information relationship learning apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る映像付加情報関係性学習装置の蓄積潜在変数抽出部の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the accumulation latent variable extraction part of the image | video additional information relationship learning apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る映像付加情報関係性学習装置におけるモデル学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the model learning process routine in the image | video additional information relationship learning apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る半教師映像検索装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the semi-supervised video search apparatus based on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る半教師映像検索装置における映像検索処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the video search processing routine in the semi-supervised video search apparatus based on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る半教師映像認識装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the semi-teacher image | video recognition apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係る半教師映像認識装置における映像認識処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the video recognition process routine in the semi-teacher video recognition apparatus which concerns on the 3rd Embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

〔第１の実施の形態〕
＜システム構成＞
図１は、本発明の第１の実施の形態に係る映像付加情報関係性学習装置１００を示すブロック図である。映像付加情報関係性学習装置１００は、映像を説明する情報である付加情報が予め付与された映像（画像信号＋音響信号）の集合である完全蓄積映像集合、付加情報が与えられていない映像の集合である不完全蓄積映像集合、及び映像に付与された付加情報の集合である蓄積付加情報集合を入力し、映像と付加情報との関係性を記述するモデルである映像・付加情報関係モデルを出力する装置であり、具体的にはＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＡＭと、後述するモデル学習処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 [First Embodiment]
<System configuration>
FIG. 1 is a block diagram showing a video additional information relationship learning apparatus 100 according to the first embodiment of the present invention. The video additional information relationship learning apparatus 100 is configured to store a complete accumulated video set that is a set of video (image signal + acoustic signal) to which additional information, which is information describing the video, is given in advance, and a video to which no additional information is given. A video / additional information relation model, which is a model that describes the relationship between video and additional information, by inputting an incompletely stored video set that is a set and a storage additional information set that is a set of additional information attached to the video. An output device, specifically, a computer comprising a CPU (Central Processing Unit), a RAM, and a ROM storing a program for executing a model learning processing routine to be described later. The configuration is as follows.

映像付加情報関係性学習装置１００は、入力部１０、演算部２０、及び出力部３０を備えている。 The video additional information relationship learning device 100 includes an input unit 10, a calculation unit 20, and an output unit 30.

入力部１０は、映像を説明する情報である付加情報が予め付与された映像の集合である完全蓄積映像集合、付加情報が与えられていない映像の集合である不完全蓄積映像集合、及び映像に付与された付加情報の集合である蓄積付加情報集合の入力を受け付ける。 The input unit 10 adds a complete stored video set that is a set of videos to which additional information, which is information describing the video, is added in advance, an incompletely stored video set that is a set of videos to which no additional information is given, and video An input of a stored additional information set that is a set of added additional information is received.

演算部２０は、蓄積付加情報データベース１、完全蓄積映像データベース２、不完全蓄積映像データベース３、蓄積付加情報特徴抽出部４、蓄積画像特徴抽出部５、蓄積音響特徴抽出部６、特徴データベース７、蓄積潜在変数抽出部８、映像・付加情報間関係性学習部９を備えている。 The calculation unit 20 includes a storage additional information database 1, a complete storage video database 2, an incomplete storage video database 3, a storage additional information feature extraction unit 4, a storage image feature extraction unit 5, a storage acoustic feature extraction unit 6, a feature database 7, A storage latent variable extraction unit 8 and a video / additional information relationship learning unit 9 are provided.

蓄積付加情報データベース１は、入力された蓄積付加情報集合を記憶する。完全蓄積映像データベース２は、入力された完全蓄積映像集合を記憶する。不完全蓄積映像データベース３は、入力された不完全蓄積映像集合を記憶する。 The accumulated additional information database 1 stores the inputted accumulated additional information set. The fully stored video database 2 stores the input complete stored video set. The incompletely stored video database 3 stores the input incompletely stored video set.

蓄積画像特徴抽出部５は、完全蓄積映像集合、及び不完全蓄積映像集合を入力し、完全蓄積映像集合の要素である完全蓄積映像、及び不完全蓄積映像集合の要素である不完全蓄積映像のそれぞれから、各蓄積映像に含まれる画像信号の特性を表現するベクトルである完全蓄積画像特徴、及び不完全蓄積画像特徴を抽出し、完全蓄積画像特徴の集合である完全蓄積画像特徴集合、及び不完全蓄積画像特徴の集合である不完全蓄積画像特徴集合を出力する。 The stored image feature extraction unit 5 inputs the complete stored video set and the incomplete stored video set, and inputs the complete stored video that is an element of the complete stored video set and the incomplete stored video that is an element of the incomplete stored video set. From each of them, a completely accumulated image feature that is a vector expressing the characteristics of the image signal included in each accumulated video and an incompletely accumulated image feature are extracted, and a complete accumulated image feature set that is a set of completely accumulated image features, An incompletely stored image feature set that is a set of completely stored image features is output.

完全蓄積画像特徴及び不完全蓄積画像特徴の抽出方法は、特に限定されるものではないが、例えば、完全蓄積映像の構成要素たる画像信号の各フレームである完全蓄積画像，及び不完全蓄積映像の構成要素たる画像信号の各フレームである不完全蓄積画像のそれぞれから、以下のような方法、及びそれらの任意の組み合わせによって特徴を抽出する方法が考えられる。 The extraction method of the completely accumulated image feature and the incompletely accumulated image feature is not particularly limited. For example, the complete accumulated image and the incompletely accumulated image of each frame of the image signal that is a component of the completely accumulated image are extracted. A method of extracting features from each of the incompletely stored images, which are each frame of the image signal as a constituent element, by the following method and any combination thereof can be considered.

・色ヒストグラム
・画像中の各小領域のディジタルコサイン変換の低周波成分
・Haar Waveletの低周波及び／または高周波成分のヒストグラム
・高次局所自己相関特徴（参考文献１「N. Otsu and T. Kurita “A new scheme for practical flexible and intelligent vision systems,” Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.」参照）
・SIFT（参考文献２「D. Lowe, “Distinctive image features from scale-invariant keypoints, ”International Journal of Computer Vision, Vol.60, No.2, pp.91-110, 2004.」参照）及びその各種改良
・Bag of Features（参考文献３「G. Csurka, C. Bray, C. Dance and L. Fan “Visual categorization with bags of keypoints,” in Proc. of ECCV Workshop on Statistical Learning in Computer Vision, pp. 59−74, 2004.」参照）・ Color histogram ・ Low frequency component of digital cosine transform of each small area in the image ・ Histogram of low frequency and / or high frequency component of Haar Wavelet ・ High-order local autocorrelation feature (Ref. 1 “N. Otsu and T. Kurita (See “A new scheme for practical flexible and intelligent vision systems,” Proc. IAPR Workshop on Computer Vision, pp.431-435, 1988.)
・ SIFT (see Reference 2 “D. Lowe,“ Distinctive image features from scale-invariant keypoints, ”International Journal of Computer Vision, Vol. 60, No. 2, pp. 91-110, 2004.) Improvement / Bag of Features (Reference 3 “G. Csurka, C. Bray, C. Dance and L. Fan“ Visual categorization with bags of keypoints, ”in Proc. Of ECCV Workshop on Statistical Learning in Computer Vision, pp. 59 -74, 2004.)

また、当該蓄積映像に対応する区間に含まれる各蓄積画像から、上記の方法によって抽出した特徴を、Bag of Featuresと同様の方法で作成したヒストグラムを、蓄積画像特徴として採用する形態も可能である（参考文献４「K. Kashino, T. Kurozumi and H. Murase: "A quick search method for audio and video signals based on histogram pruning", IEEE Transactions on Multimedia, Vol.5, No.3, pp.348-357, 2003.」参照）。 In addition, it is possible to adopt a form in which a histogram created by using the same method as the Bag of Features, which is a feature extracted from each stored image included in the section corresponding to the stored video, by the above method is used as the stored image feature. (Reference 4 “K. Kashino, T. Kurozumi and H. Murase:“ A quick search method for audio and video signals based on histogram pruning ”, IEEE Transactions on Multimedia, Vol.5, No.3, pp.348- 357, 2003 ").

このようにして、蓄積画像特徴抽出部５は、完全蓄積画像特徴、及び不完全蓄積画像特徴を抽出し、これらそれぞれの集合である完全蓄積画像特徴集合及び不完全蓄積画像特徴集合を出力する。 In this manner, the accumulated image feature extraction unit 5 extracts the completely accumulated image feature and the incompletely accumulated image feature, and outputs a complete accumulated image feature set and an incompletely accumulated image feature set, which are their respective sets.

蓄積音響特徴抽出部６は、完全蓄積映像集合、及び不完全蓄積映像集合を入力し、完全蓄積映像集合の要素である完全蓄積映像、及び不完全蓄積映像集合の要素である不完全蓄積映像のそれぞれから、各蓄積映像に含まれる音響信号の特性を表現するベクトルである完全蓄積音響特徴、及び不完全蓄積音響特徴を抽出し、完全蓄積音響特徴の集合である完全蓄積音響特徴集合、及び不完全蓄積音響特徴の集合である不完全蓄積音響特徴集合を出力する。 The stored acoustic feature extraction unit 6 inputs the complete stored video set and the incomplete stored video set, and the complete stored video that is an element of the complete stored video set and the incomplete stored video that is an element of the incomplete stored video set. From each of them, a completely accumulated acoustic feature that is a vector expressing the characteristics of the acoustic signal included in each accumulated video and an incompletely accumulated acoustic feature are extracted, and a complete accumulated acoustic feature set that is a set of completely accumulated acoustic features, and a non-accumulated acoustic feature set. An incompletely stored acoustic feature set that is a set of completely stored acoustic features is output.

完全蓄積音響特徴及び不完全蓄積音響特徴の抽出方法は，特に限定されるものではないが、例えば、各蓄積映像の構成要素たる音響信号に分析窓をかけ、各分析窓から以下のような方法、及びそれらの任意の組み合わせによって特徴を抽出する方法が考えられる。 The extraction method of the completely accumulated acoustic feature and the incompletely accumulated acoustic feature is not particularly limited. For example, an analysis window is applied to the acoustic signal as a component of each accumulated video, and the following method is used from each analysis window. And a method of extracting features by any combination thereof.

・メル周波数ケプストラム係数（参考文献５「J. Foote “Content-based retrieval of music and audio,” In Multimedia Storage and Archiving Systems II, Proc. of SPIE, volume 3229, pages 138−147, 1997.」参照）
・デルタケプストラム（参考文献６「S. Furui, “Speaker independent isolated word recognition using dynamic features speech spectrum,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.34, No.1, pp.52-59, 1986.」参照）
・帯域通過フィルタバンク（参考文献７「柏野, スミス, 村瀬“ヒストグラム特徴を用いた音響信号の高速探索法 ― 時系列アクティブ探索法―”電子情報通信学会論文誌, Vol.J82-D2, No.9, pp.1365-1373, 1998」） Mel frequency cepstrum coefficient (see Reference 5 “J. Foote“ Content-based retrieval of music and audio, ”In Multimedia Storage and Archiving Systems II, Proc. Of SPIE, volume 3229, pages 138-147, 1997.)
Delta cepstrum (Reference 6 “S. Furui,“ Speaker independent isolated word recognition using dynamic features speech spectrum, ”IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 34, No. 1, pp. 52-59, 1986 ."reference)
・ Bandpass filter bank (Ref. 7, “Sagano, Smith, Murase“ Fast acoustic signal search method using histogram features-Time-series active search method ”, IEICE Transactions, Vol.J82-D2, No. 9, pp.1365-1373, 1998 ")

また、蓄積音響特徴抽出部６は、蓄積画像特徴抽出部５に示した実施形態と同様にして、上記の方法によって抽出した特徴を、Bag of Featuresと同様の方法で作成したヒストグラムを、蓄積音響特徴として採用する形態も可能である。 In addition, the stored acoustic feature extraction unit 6, similar to the embodiment shown in the stored image feature extraction unit 5, stores the histogram extracted from the features extracted by the above method using the same method as the Bag of Features. A form adopted as a feature is also possible.

このようにして、蓄積音響特徴抽出部６は、完全蓄積音響特徴、及び不完全蓄積音響特徴を抽出し、これらそれぞれの集合である完全蓄積音響特徴集合及び不完全蓄積音響特徴集合を出力する。 In this way, the accumulated acoustic feature extraction unit 6 extracts the completely accumulated acoustic feature and the incompletely accumulated acoustic feature, and outputs a complete accumulated acoustic feature set and an incompletely accumulated acoustic feature set, which are their respective sets.

蓄積付加情報特徴抽出部４は、蓄積付加情報集合を入力し、蓄積付加情報集合の要素である蓄積付加情報のそれぞれから、付加情報の特性を表現するベクトルである蓄積付加情報特徴を抽出し、この蓄積付加情報特徴の集合である蓄積付加情報特徴集合を出力する。 The stored additional information feature extraction unit 4 inputs a stored additional information set, extracts a stored additional information feature that is a vector expressing the characteristics of the additional information from each of the stored additional information that is an element of the stored additional information set, A stored additional information feature set that is a set of the stored additional information features is output.

蓄積付加情報特徴の抽出方法は、特に限定されるものではないが、本実施の形態においては、付加情報として言語ラベルを想定し、その言語ラベルの有無を表現する２値ベクトルを蓄積付加情報特徴とする。すなわち、蓄積付加情報特徴は、以下のように構成される。 The method of extracting the accumulated additional information feature is not particularly limited, but in the present embodiment, a language label is assumed as the additional information, and a binary vector representing the presence or absence of the language label is accumulated. And That is, the accumulated additional information feature is configured as follows.

蓄積付加情報特徴は、考慮すべき言語ラベルの総数と同数の次元を持つベクトルであり、ベクトルの各次元が言語ラベルに対応する。以降、便宜的に、言語ラベルを、上記ベクトルにおいて対応する次元のインデックスを用いて表現する。蓄積付加情報に第ｉ番目の言語ラベルが含まれている場合には、蓄積付加情報特徴の第ｉ次元を「１」とし、そうでない場合には、「０」とする。あるいは、上記の方法で作成した特徴を、主成分分析を用いて圧縮した多次元ベクトルを蓄積付加情報特徴としても良い。 The accumulated additional information feature is a vector having the same number of dimensions as the total number of language labels to be considered, and each dimension of the vector corresponds to a language label. Hereinafter, for convenience, the language label is expressed using an index of a corresponding dimension in the vector. If the accumulated additional information includes the i-th language label, the i-th dimension of the accumulated additional information feature is set to “1”. Otherwise, “0” is set. Alternatively, a multidimensional vector obtained by compressing the feature created by the above method using principal component analysis may be used as the accumulated additional information feature.

このようにして、蓄積付加情報特徴抽出部４は、蓄積付加情報特徴を抽出し、この蓄積付加情報特徴の集合である蓄積付加情報特徴集合を出力する。 In this way, the accumulated additional information feature extraction unit 4 extracts the accumulated additional information feature and outputs an accumulated additional information feature set that is a set of the accumulated additional information features.

特徴データベース７は、抽出された完全蓄積画像特徴集合、不完全蓄積画像特徴集合、完全蓄積音響特徴集合、不完全蓄積音響特徴集合、及び蓄積付加情報特徴集合を記憶する。 The feature database 7 stores the extracted completely accumulated image feature set, incompletely accumulated image feature set, completely accumulated acoustic feature set, incompletely accumulated acoustic feature set, and accumulated additional information feature set.

蓄積潜在変数抽出部８は、特徴データベース７から、完全蓄積画像特徴集合、不完全蓄積画像特徴集合、完全蓄積音響特徴集合、不完全蓄積音響特徴集合、及び蓄積付加情報特徴集合を入力し、これらの特徴集合から、映像と付加情報の関係性を記述するための変数の集合である蓄積潜在変数集合を抽出し、この蓄積潜在変数集合を出力する。 The stored latent variable extraction unit 8 inputs from the feature database 7 a complete stored image feature set, an incomplete stored image feature set, a complete stored acoustic feature set, an incomplete stored acoustic feature set, and a stored additional information feature set. From the feature set, an accumulated latent variable set that is a set of variables for describing the relationship between the video and the additional information is extracted, and this accumulated latent variable set is output.

蓄積潜在変数集合は、画像、音響、付加情報のそれぞれがベクトルで与えられる場合に、同一の映像に属する画像ベクトル、音響ベクトル、付加情報ベクトルを何らかの方法で潜在変数集合のある空間に写像した場合に同一の潜在変数で記述されることを目的として算出される。このことを以て潜在変数は画像、音響、付加情報を関係づけるものとなる。
現実にはそれぞれのベクトルの写像先での値は完全に一致することは難しいので画像、音響、付加情報のそれぞれのベクトルとこれらを合成したベクトルの相関を最大化する写像であるとか、画像、音響、付加情報のそれぞれのベクトルを写像した先での潜在変数の自乗誤差が最小になる写像などの目的関数を作り、Lagrangeの未定係数法、勾配法などの最適化手法で計算することが出来る。
このとき、一般に同一の映像から得られたことがわかっている画像、音響、付加情報の組はあまり多く得られない。そのため、計算される写像が不正確になることが予想される。本発明においてはこの問題を解決するために不完全蓄積特徴を用い、潜在変数空間全体の密度全体を正確に推定することによりこれを補う。
蓄積潜在変数集合の抽出方法は、特に限定されるものではないが、本実施の形態においては、多変量解析の一種である正準相関分析を改良した以下の方法を用いる。 The stored latent variable set is a case where the image vector, sound vector, and additional information vector belonging to the same video are mapped to the space with the latent variable set by some method when each of the image, sound, and additional information is given as a vector. In order to be described with the same latent variable. Thus, the latent variable relates the image, sound, and additional information.
In reality, it is difficult for the values at the mapping destinations of each vector to completely match, so the image, sound, and additional information vectors are the mapping that maximizes the correlation between the combined vector, the image, An objective function such as a mapping that minimizes the square error of the latent variable at the point where each vector of acoustic and additional information is mapped can be calculated and calculated using optimization methods such as Lagrange's undetermined coefficient method and gradient method. .
At this time, a large number of sets of images, sounds, and additional information that are generally known to have been obtained from the same video cannot be obtained. Therefore, it is expected that the calculated map will be inaccurate. The present invention compensates for this problem by using incomplete accumulation features and accurately estimating the overall density of the entire latent variable space.
The method for extracting the accumulated latent variable set is not particularly limited, but in the present embodiment, the following method is used in which canonical correlation analysis, which is a kind of multivariate analysis, is improved.

図２に示すように、蓄積潜在変数抽出部８は、完全蓄積特徴集合統計量算出部８１と、不完全蓄積特徴集合統計量算出部８２と、統合統計量算出部８３と、特徴圧縮関数決定部８４と、特徴圧縮部８５とを備えている。 As shown in FIG. 2, the accumulated latent variable extraction unit 8 includes a complete accumulation feature set statistic calculation unit 81, an incomplete accumulation feature set statistic calculation unit 82, an integrated statistic calculation unit 83, and a feature compression function determination. Unit 84 and feature compression unit 85.

完全蓄積特徴集合統計量算出部８１は、完全蓄積画像特徴、完全蓄積音響特徴、及び対応する蓄積付加情報特徴が組を成す特徴の組み合わせの集合である完全蓄積特徴集合（完全蓄積画像特徴集合Ｘ_Ｃ０、完全蓄積音響特徴集合Ｘ_Ｃ１、蓄積付加情報特徴集合Ｙ_Ｃ）を入力し、この完全蓄積特徴集合を表現する統計量である完全蓄積特徴集合統計量を算出し、この完全蓄積特徴集合統計量を出力する。 The completely accumulated feature set statistic calculation unit 81 is a complete accumulated feature set (completely accumulated image feature set X) that is a set of features that a complete accumulated image feature, a completely accumulated acoustic feature, and a corresponding accumulated additional information feature form a set. _C0 , a completely stored acoustic feature set X _C1 , a stored additional information feature set Y _C ) are input, a completely stored feature set statistic that is a statistic representing this completely stored feature set is calculated, and this completely stored feature set statistics Output quantity.

完全蓄積特徴集合統計量の算出方法は、特に限定されるものではないが、本実施の形態では、完全蓄積特徴集合統計量として、完全蓄積特徴集合の自己共分散行列及び相互共分散行列を算出する。 The calculation method of the completely accumulated feature set statistics is not particularly limited, but in this embodiment, the self-covariance matrix and the mutual covariance matrix of the completely accumulated feature set are calculated as the completely accumulated feature set statistics. To do.

ここで、方法の具体的な記述に必要な記号の説明を行う。以下の式に示すように、完全蓄積画像特徴集合をＸ_Ｃ０、不完全蓄積画像特徴集合をＸ_Ｉ０と記述し、その和集合である蓄積画像特徴集合をＸ_０と記述する。同様に、完全蓄積音響特徴集合をＸ_Ｃ１、不完全蓄積音響特徴集合をＸ_Ｉ１と記述し、その和集合である蓄積音響特徴集合をＸ_１と記述する。また、蓄積付加情報集合をＹと記述する。 Here, the symbols necessary for the specific description of the method will be described. As shown in the following equation, the completely stored image feature set is described as X _C0 , the incompletely stored image feature set is described as X _I0, and the stored image feature set that is the union thereof is described as X ₀ . Similarly, a complete stored acoustic feature set is described as X _C1 , an incomplete stored acoustic feature set is described as X _I1, and a stored acoustic feature set that is the union thereof is described as X ₁ . The accumulated additional information set is described as Y.

ここで、Ｎは完全蓄積特徴集合の要素数、Ｎ_ｘは蓄積画像特徴集合及び蓄積音響特徴集合の各々の要素数（注：両集合の要素数は同一であることを意味する）である。また、各要素ｘ_０，ｉ，ｘ_１，ｉ及びｙ_ｊ（ｉ＝１、２、…、Ｎ_ｘ、ｊ＝１、２、…、Ｎ）は、それぞれｄ_ｘ０次元、ｄ_ｘ１次元、及びｄ_ｙ次元の列ベクトルとする。同じ添字の蓄積特徴は、互いに対応関係にあることを示している。以降の説明において、上記完全蓄積特徴集合と不完全蓄積特徴集合とを明確に区別して記述する必要がある場合には、次式に示すように、代替的な表記を用いることもある。 Here, N is the number of elements in the complete accumulated feature set, and N _x is the number of elements in each of the accumulated image feature set and the accumulated acoustic feature set (Note: the number of elements in both sets means the same). In addition, each element x _{0, i} , x _{1, i} and y _j (i = 1, 2,..., N _x , j = 1, 2,..., N) has d _x0 dimension, d _x1 dimension, and d Let it be a _y- dimensional column vector. Accumulated features with the same subscript indicate that they correspond to each other. In the following description, when it is necessary to clearly distinguish between the complete accumulation feature set and the incomplete accumulation feature set, an alternative notation may be used as shown in the following equation.

また、以降では、説明の簡略化のため、各蓄積特徴集合の平均は、常に０ベクトルであると仮定する。もしそうでない場合には、予め、それぞれの平均ベクトルを求めて、各蓄積特徴からその平均ベクトルを引くことで、同様の状況にすることが可能である。 In the following, for the sake of simplicity, it is assumed that the average of each accumulated feature set is always a zero vector. If not, it is possible to obtain the same situation by obtaining each average vector in advance and subtracting the average vector from each accumulated feature.

このとき、完全蓄積特徴集合統計量Ｓ_Ｃとして、各蓄積特徴集合の自己共分散行列Ｓ_{Ｃｘ０ｘ０}，Ｓ_{Ｃｘ１ｘ１}，Ｓ_ｙｙ及び相互共分散行列Ｓ_{Ｃｘ０ｘ１}，Ｓ_Ｃｘ０ｙ，Ｓ_Ｃｘ１ｙを、以下の（１）式〜（６）式で求める。 At this time, as complete storage feature set statistic _{S C,} each storage feature set autocovariance matrix _{_{_S Cx0x0,}} _S _Cx1x1, _S _yy and cross-covariance matrix _S _{_Cx0x1,} _S _Cx0y, the _{S Cx1y,} following (1) It calculates | requires by Formula-(6) Formula.

但し、ｘ^Ｔは、（ベクトルもしくは行列の）転置を意味する。 However, ^{x T} denotes the transpose (vector or matrix).

このようにして、完全蓄積特徴集合統計量算出部８１は、完全蓄積特徴集合統計量Ｓ_Ｃ＝｛Ｓ_{Ｃｘ０ｘ０}、Ｓ_{Ｃｘ１ｘ１}、Ｓ_ｙｙ、Ｓ_{Ｃｘ０ｘ１}，Ｓ_Ｃｘ０ｙ，Ｓ_Ｃｘ１ｙ｝を抽出して出力する。 In this way, the completely accumulated feature set statistic calculation unit 81 extracts and outputs the completely accumulated feature set statistic S _C = {S _Cx0x0 , S _Cx1x1 , S _yy , S _Cx0x1 , S _Cx0y , S _Cx1y }. .

次に、不完全蓄積特徴集合統計量算出部８２は、不完全蓄積画像特徴及び不完全蓄積音響特徴が組を成す特徴の組み合わせの集合である不完全蓄積特徴集合（不完全蓄積画像特徴集合Ｘ_Ｉ０、不完全蓄積音響特徴集合Ｘ_Ｉ１）を入力し、この不完全蓄積特徴集合を表現する統計量である不完全蓄積特徴集合統計量を算出し、この不完全蓄積特徴集合統計量を出力する。不完全蓄積特徴集合統計量の算出方法は、特に限定されるものではないが、ここでは、不完全蓄積特徴集合統計量として、以下の（７）式〜（９）式に従って、不完全蓄積特徴集合の自己共分散行列Ｓ_{Ｉｘ０ｘ０}、Ｓ_{Ｉｘ１ｘ１}及び相互共分散行列Ｓ_{Ｉｘ０ｘ１}を算出する。 Next, the incompletely stored feature set statistic calculation unit 82 includes an incompletely stored feature set (incompletely stored image feature set X) that is a set of combinations of incompletely stored image features and incompletely stored acoustic features. _I0 , incompletely stored acoustic feature set X _I1 ) is input, an incompletely stored feature set statistic that is a statistic representing the incompletely stored feature set is calculated, and this incompletely stored feature set statistic is output. . The method for calculating the incompletely accumulated feature set statistic is not particularly limited. Here, the incompletely accumulated feature set statistic is used as the incompletely accumulated feature set statistic according to the following equations (7) to (9). The set autocovariance matrices S _Ix0x0 and S _Ix1x1 and the mutual covariance matrix S _Ix0x1 are calculated.

上述したように、不完全蓄積特徴集合統計量算出部８２は、不完全蓄積特徴集合統計量Ｓ_Ｉ＝｛Ｓ_{Ｉｘ０ｘ０}，Ｓ_{Ｉｘ１ｘ１}、Ｓ_{Ｉｘ０ｘ１}｝として出力する。 As described above, the incompletely accumulated feature set statistic calculating unit 82 outputs the incompletely accumulated feature set statistics S _I = {S _Ix0x0 , S _Ix1x1 , S _Ix0x1 }.

次に、統合統計量算出部８３は、完全蓄積特徴集合統計量、及び不完全蓄積特徴集合統計量を入力し、これらの統計量から、新しい統計量である統合統計量を算出し、該統合統計量を出力する。統合統計量の算出方法は、特に限定されるものではないが、ここでは、自己共分散行列と相互共分散行列から計算される２種類の統合統計量を算出する。 Next, the integrated statistic calculating unit 83 inputs the completely accumulated feature set statistic and the incompletely accumulated feature set statistic, calculates an integrated statistic that is a new statistic from these statistics, and the integrated statistic Output statistics. The method for calculating the integrated statistics is not particularly limited, but here, two types of integrated statistics calculated from the autocovariance matrix and the mutual covariance matrix are calculated.

第1の統合統計量は、以下の（１０）式で算出される。なお、式中の文字下部に＿が付いた文字は、文中において文字の前に＿を記載して示す。つまり、第1の統合統計量は、＿Ｃと記載する。 The first integrated statistic is calculated by the following equation (10). In addition, the character with _ attached to the lower part of the character in the formula is indicated by writing _ before the character in the sentence. That is, the first integrated statistic is described as _C.

ここで、βは０≦β≦１を満たすように予め定められた定数であり、Ｉ_ｄはｄ×ｄ単位行列であり、０は零行列である。Ｓ_{Ｃｘ０ｘ０}がｄ_ｘ０×ｄ_ｘ０正方行列、Ｓ_{Ｃｘ１ｘ１}がｄ_ｘ１×ｄ_ｘ１正方行列、Ｓ_ｙｙがｄ_ｙ×ｄ_ｙ正方行列であることから、第1の統合統計量＿Ｃは（ｄ_ｘ０＋ｄ_ｘ１＋ｄ_ｙ）×（ｄ_ｘ０＋ｄ_ｘ１＋ｄ_ｙ）正方行列となる。 Here, β is a constant determined in advance to satisfy 0 ≦ β ≦ 1, I _d is a d × d unit matrix, and 0 is a zero matrix. S _Cx0x0 is _{d x0} × _{d x0} square _{matrix, S Cx1x1} is _{d x1} × _{d x1} square _matrix, because the _{S yy} is _{d y} × _{d y} square matrix, the first integrated statistics _C is _{_(d} x0 ₊ d _x1 + D _y ) × (d _x0 + d _x1 + d _y ) square matrix.

一方、第２の統合統計量は、以下の（１１）式で算出される。なお、式中の文字上部に￣が付いた文字は、文中において文字の前に￣を記載して示す。つまり、上記統合統計量は、￣Ｃと記載する。 On the other hand, the second integrated statistic is calculated by the following equation (11). In addition, a character with a 文字 at the top of the character in the formula is indicated with a ￣ in front of the character in the sentence. That is, the integrated statistic is described as ￣C.

第２の統合統計量￣Ｃも、第１の統合統計量と同様に、（ｄ_ｘ０＋ｄ_ｘ１＋ｄ_ｙ）×（ｄ_ｘ０＋ｄ_ｘ１＋ｄ_ｙ）正方行列となる。 Similarly to the first integrated statistic, the second integrated statistic ￣C is a (d _x0 + d _x1 + d _y ) × (d _x0 + d _x1 + d _y ) square matrix.

上述した通り、統合統計量算出部８３は、第１の統合統計量＿Ｃと第２の統合統計量Ｃ￣とを合わせて、統合統計量Ｃ＝｛＿Ｃ、￣Ｃ｝とし、該統合統計量Ｃを出力する。 As described above, the integrated statistic calculation unit 83 combines the first integrated statistic_C and the second integrated statistic C￣ to obtain an integrated statistic C = {_ C, ￣C}, and the integrated statistic. C is output.

次に、特徴圧縮関数決定部８４は、統合統計量Ｃを入力し、画像特徴、音響特徴及び付加情報特徴を圧縮する関数である特徴圧縮関数を決定し、該特徴圧縮関数を出力する。特徴圧縮関数の決定方法は、特に限定されるものではないが、ここでは、第１の統合統計量及び第２の統合統計量を用いた一般化固有値問題を解くことによって導出する。 Next, the feature compression function determination unit 84 receives the integrated statistic C, determines a feature compression function that is a function for compressing the image feature, the acoustic feature, and the additional information feature, and outputs the feature compression function. The method for determining the feature compression function is not particularly limited, but here it is derived by solving the generalized eigenvalue problem using the first integrated statistic and the second integrated statistic.

まず、以下の（１２）式で表される一般化固有値問題を考える。 First, consider the generalized eigenvalue problem expressed by the following equation (12).

ここで、ｗは、（ｄ_ｘ０＋ｄ_ｘ１＋ｄ_ｙ）次元のベクトルである。上記（１２）式に示す一般化固有値問題を解き、予め定められた数の固有値と固有ベクトルとの組、もしくは固有値の和が予め定められた閾値を上回る最大数の固有値と固有ベクトルとの組を求めることで、特徴圧縮関数を決定することができる。 Here, w is a (d _x0 + d _x1 + d _y ) -dimensional vector. Solve the generalized eigenvalue problem shown in equation (12) above to find a set of a predetermined number of eigenvalues and eigenvectors, or a set of the maximum number of eigenvalues and eigenvectors whose sum of eigenvalues exceeds a predetermined threshold. Thus, the feature compression function can be determined.

具体的には、以下の通りである。各固有ベクトルｗ_ｉは、先頭の（ｄ_ｘ０＋ｄ_ｘ１）次元ベクトルｗ_ｘ，ｉと後続のｄ_ｙ次元ベクトルｗ_ｙ，ｉとに分解することができる。そして、この分解された固有ベクトルｗ_ｘ，ｉ、ｗ_ｙ，ｉ、及び対応する固有値λ_ｉを用いて、特徴圧縮関数を特徴付ける（ｄ_ｘ０＋ｄ_ｘ１）×ハット（＾）ｄ変換行列Ｔ_ｘ及びｄ_ｙ×ハット（＾）ｄ変換行列Ｔ_ｙを、以下の（１３）式、（１４）式のように得る。 Specifically, it is as follows. Each eigenvector w _i can be decomposed into a leading (d _x0 + d _x1 ) dimensional vector w _{x, i} and a subsequent _dy dimensional vector w _{y, i} . Then, using the decomposed eigenvectors w _{x, i} , w _{y, i} and the corresponding eigenvalues λ _i , the feature compression function is characterized (d _x0 + d _x1 ) × hat (^) d transformation matrices T _x and d _y × hat (^) d transformation matrix T _y is obtained as in the following equations (13) and (14).

ここで、ハット（＾）ｄは、取り出した固有値及び固有ベクトルの数であり、ハット（＾）ｄ≦ｍｉｎ（ｄ_ｘ、ｄ_ｙ）を満たす。また、Λは各対角成分に固有値λ_ｉの平方根を値として持つハット（＾）ｄ×ハット（＾）ｄ対角行列である。 Here, the hat (^) d is the number of eigenvalues and eigenvectors taken out, and satisfies the hat (^) d ≦ min (d _x , d _y ). Λ is a hat (^) d × hat (^) d diagonal matrix having the square root of the eigenvalue λ _i as a value for each diagonal component.

このようにして、特徴圧縮関数決定部４４は、特徴圧縮関数を特徴付ける変換行列Ｔ_ｘとＴ_ｙを算出し、これらの変換行列を出力する。 In this way, the feature compression function determination unit 44 calculates the transformation matrices T _x and T _y that characterize the feature compression function, and outputs these transformation matrices.

次に、特徴圧縮部８５は、蓄積画像特徴集合Ｘ_０（Ｘ_Ｃ０、Ｘ_Ｉ０）、蓄積音響特徴集合Ｘ_１（Ｘ_Ｃ、Ｘ_Ｉ）、蓄積付加情報特徴集合Ｙ、及び特徴圧縮関数を入力し、各特徴を特徴圧縮関数で圧縮し、圧縮された特徴の集合である蓄積画像圧縮特徴集合、蓄積音響圧縮特徴集合、及び蓄積付加情報圧縮特徴集合を出力する。蓄積画像圧縮特徴集合ハット（＾）Ｘ_０，蓄積音響圧縮特徴集合ハット（＾）Ｘ_１及び蓄積付加情報圧縮特徴集合ハット（＾）Ｙは、以下の（１５）式、（１６）式に示すように、特徴圧縮関数を特徴付ける変換行列を用いて、各特徴を圧縮することによって得られる。 Next, the feature compression unit 85 receives the stored image feature set X ₀ (X _C0 , X _I0 ), the stored acoustic feature set X ₁ (X _C , X _I ), the stored additional information feature set Y, and the feature compression function. Then, each feature is compressed with a feature compression function, and a stored image compression feature set, a stored acoustic compression feature set, and a stored additional information compression feature set, which are sets of compressed features, are output. The stored image compression feature set hat (^) X ₀ , the stored acoustic compression feature set hat (^) X ₁ and the stored additional information compression feature set hat (^) Y are expressed by the following formulas (15) and (16): As described above, each feature is compressed by using a transformation matrix that characterizes the feature compression function.

このように、特徴圧縮部８５は、蓄積画像圧縮特徴集合ハット（＾）Ｘ_０、蓄積音響圧縮特徴集合ハット（＾）Ｘ_１、及び蓄積付加情報圧縮特徴集合ハット（＾）Ｙを導出し、これら圧縮特徴集合を出力する。 As described above, the feature compression unit 85 derives the stored image compression feature set hat (^) X ₀ , the stored acoustic compression feature set hat (^) X ₁ , and the stored additional information compression feature set hat (^) Y, These compressed feature sets are output.

最後に、以下の（１７）式、（１８）式に従って上記ハット（＾）Ｘ_０、ハット（＾）Ｘ_１、及びハット（＾）Ｙを合成した多次元ベクトル集合を算出し、蓄積潜在変数集合Ｚ＝｛ｚ_１，ｚ_２，…，ｚ_Ｎｘ｝として用いる。 Finally, a multidimensional vector set obtained by synthesizing the hat (^) X ₀ , the hat (^) X ₁ , and the hat (^) Y according to the following expressions (17) and (18) is calculated, and accumulated latent variables It is used as a set Z = {z ₁ , z ₂ ,..., Z _Nx }.

ただし、各ａ_ｉは予め与えておいた定数である。このようにして、蓄積潜在変数抽出部８は、蓄積潜在変数集合Ｚを抽出し、この蓄積潜在変数集合を出力する。 However, each a _i is a constant given in advance. In this way, the accumulated latent variable extraction unit 8 extracts the accumulated latent variable set Z and outputs this accumulated latent variable set.

映像・付加情報間関係性学習部９は、完全蓄積音響特徴集合、不完全蓄積音響特徴集合、完全蓄積画像特徴集合、不完全蓄積画像特徴集合、蓄積付加情報特徴集合、及び蓄積潜在変数集合を入力し、これら集合から、映像と付加情報との関係性を記述するモデルである映像・付加情報関係モデルを学習し、この映像・付加情報関係モデルを出力する。映像・付加情報関係モデルの学習方法は、特に限定されるものではないが、ここでは、潜在変数モデル学習部９１と、映像・潜在変数関係モデル学習部９２と、付加情報・潜在変数関係モデル学習部９３とを用いる方法について説明する。 The relationship learning unit 9 between the video and additional information includes a complete storage acoustic feature set, an incomplete storage acoustic feature set, a complete storage image feature set, an incomplete storage image feature set, a storage additional information feature set, and a storage latent variable set. From these sets, a video / additional information relationship model, which is a model describing the relationship between video and additional information, is learned, and this video / additional information relationship model is output. The learning method of the video / additional information relationship model is not particularly limited, but here, the latent variable model learning unit 91, the video / latent variable relationship model learning unit 92, and the additional information / latent variable relationship model learning are performed. A method using the unit 93 will be described.

潜在変数モデル学習部９１は、蓄積潜在変数集合を入力し、この蓄積潜在変数の構造を記述するモデルである潜在変数モデルを学習し、この潜在変数モデルを出力する。潜在変数モデルの学習方法は、特に限定されるものではないが、ここでは、以下の（１９）式で導出される潜在変数ｚの生起確率ｐ（ｚ）を潜在変数モデルとして採用する。 The latent variable model learning unit 91 receives a set of accumulated latent variables, learns a latent variable model that is a model describing the structure of the accumulated latent variables, and outputs the latent variable model. The learning method of the latent variable model is not particularly limited, but here, the occurrence probability p (z) of the latent variable z derived by the following equation (19) is adopted as the latent variable model.

ここで、δ_ａ，ｂは、クロネッカーのデルタである。 Where δ _{a, b} is the Kronecker delta.

このようにして、潜在変数モデル学習部９１は、潜在変数モデルｐ（ｚ）を抽出して出力する。 In this way, the latent variable model learning unit 91 extracts and outputs the latent variable model p (z).

次に、映像・潜在変数関係モデル学習部９２は、蓄積音響特徴集合、蓄積画像特徴集合、及び蓄積潜在変数集合を入力し、これらの集合を用いて映像と潜在変数との関係性を記述するモデルである映像・潜在変数関係モデルを学習し、この映像・潜在変数関係モデルを出力する。映像・潜在変数関係モデルの学習方法は、特に限定されるものではないが、ここでは、以下の（２０）式のようにして得られた、潜在変数ｚが与えられたときの画像特徴ｘ_０及び音響特徴ｘ_１の条件付生起確率ｐ（ｘ_０，ｘ_１｜ｚ）を映像・潜在変数関係モデルとして採用する。なお、式中の文字上部に〜が付いた文字は、文中において文字の前に〜を記載して示す。 Next, the video / latent variable relationship model learning unit 92 inputs a stored acoustic feature set, a stored image feature set, and a stored latent variable set, and describes the relationship between the video and the latent variable using these sets. A video / latent variable relationship model as a model is learned, and the video / latent variable relationship model is output. The learning method of the video / latent variable relationship model is not particularly limited, but here, the image feature x ₀ obtained by the following equation (20) when the latent variable z is given. And the conditional occurrence probability p (x ₀ , x ₁ | z) of the acoustic feature x ₁ is adopted as the video / latent variable relationship model. In addition, a character with “˜” in the upper part of the character in the formula is indicated by “˜” before the character in the sentence.

但し、〜ｚは、蓄積画像特徴ｘ_０と蓄積音響特徴ｘ_１を特徴圧縮部８５で変換した蓄積潜在変数であり、γは予め定められた定数であり、Ｉ_^dは^ｄ×^ｄ単位行列である。また、ｇ（〜ｚ；ｚ_n，γＩ_^d）は、ｚ_nを平均ベクトルとし、γＩ_^dを共分散行列とする〜ｚの多次元正規分布を表わす。 However, to z are converted accumulated latent variables accumulated image feature x ₀ and accumulation acoustic feature x ₁ in the feature compression unit 85, gamma is a predetermined constant, I ^ _d is ^ d × ^ d It is an identity matrix. G (˜z; z _n , γI _{^ d} ) represents a multidimensional normal distribution of ~ z in which z _n is an average vector and γI _{^ d} is a covariance matrix.

このようにして、映像・潜在変数関係モデル学習部９２は、映像・潜在変数関係モデルを抽出し、これを出力する。 In this way, the video / latent variable relationship model learning unit 92 extracts the video / latent variable relationship model and outputs it.

付加情報・潜在変数関係モデル学習部９３は、蓄積付加情報特徴集合、及び蓄積潜在変数集合を入力し、これら集合を用いて付加情報と潜在変数との関係性を記述するモデルである付加情報・潜在変数関係モデルを学習し、この付加情報・潜在変数関係モデルを出力する。 The additional information / latent variable relationship model learning unit 93 inputs a stored additional information feature set and a stored latent variable set, and uses these sets to describe the relationship between the additional information and the latent variable. The latent variable relationship model is learned, and this additional information / latent variable relationship model is output.

付加情報・潜在変数関係モデルの学習方法は、特に限定されるものではないが、ここでは、以下の（２１）式〜（２４）式のようにして得られた、潜在変数ｚが与えられたときの付加情報特徴ｙの条件付生起確率ｐ（ｙ｜ｚ）を付加情報・潜在変数関係モデルとして採用する。 The learning method of the additional information / latent variable relationship model is not particularly limited, but here, the latent variable z obtained by the following equations (21) to (24) is given. The conditional occurrence probability p (y | z) of the additional information feature y is adopted as the additional information / latent variable relationship model.

ここで、μは０≦μ≦１を満たす定数であり、ｙ_ｎ，ｉは、蓄積付加情報特徴ｙ_ｎの第ｉ要素である。すなわち、上記の関係式は、まず、各言語ラベルが独立に生起することを仮定し（上記（２１）式）、各言語ラベルの生起確率を、各サンプルｎでの言語ラベルの経験分布（上記（２２）式のδ_{ｙｉ，ｙｎ，ｉ}に相当）と全サンプルでの言語ラベルの経験分布（上記（２２）式のＭ_ｉ／Ｍに相当）とを混合比μで混合して生成することを意味する。 Here, mu is a constant satisfying 0 ≦ μ ≦ 1, y n , i is the i-th element of the accumulated additional information feature y _n. That is, the above relational expression first assumes that each language label occurs independently (the above equation (21)), and the occurrence probability of each language label is expressed as the empirical distribution of the language label in each sample n (the above) (22) (corresponding to δ _{yi, yn, i} in the equation) and the empirical distribution of language labels in all samples (corresponding to M _i / M in the above equation (22)) at a mixing ratio μ. Means.

このようにして、付加情報・潜在変数関係モデル学習部９３は、付加情報・潜在変数関係モデルを抽出し、これを出力する。 In this way, the additional information / latent variable relationship model learning unit 93 extracts the additional information / latent variable relationship model and outputs it.

上述したように、映像・付加情報間関係性学習部９は、潜在変数モデル、映像・潜在変数関係モデル、及び付加情報・潜在変数関係モデルを合わせて、映像・付加情報関係モデルとし、この映像・付加情報関係モデルを出力する。 As described above, the video / additional information relationship learning unit 9 combines the latent variable model, the video / latent variable relationship model, and the additional information / latent variable relationship model into a video / additional information relationship model.・ Output additional information relation model.

＜映像付加情報関係性学習装置の作用＞
次に、本実施の形態に係る映像付加情報関係性学習装置１００の作用について説明する。まず、付加情報が与えられた完全蓄積映像集合、その付加情報の集合である蓄積付加情報集合、及び付加情報が与えられていない不完全蓄積映像集合が、映像付加情報関係性学習装置１００に入力されると、映像付加情報関係性学習装置１００によって、入力された蓄積付加情報集合が、蓄積付加情報データベース１へ格納され、入力された完全蓄積映像集合が、完全蓄積映像データベース２へ格納され、入力された不完全蓄積映像集合が、不完全蓄積映像データベース３へ格納される。そして、映像付加情報関係性学習装置１００によって、図３に示すモデル学習処理ルーチンが実行される。 <Operation of video additional information relationship learning device>
Next, the operation of the video additional information relationship learning apparatus 100 according to the present embodiment will be described. First, a complete stored video set to which additional information is given, a stored additional information set that is a set of the additional information, and an incompletely stored video set to which no additional information is given are input to the video additional information relationship learning apparatus 100. Then, the video additional information relationship learning device 100 stores the input stored additional information set in the stored additional information database 1 and the input complete stored video set in the fully stored video database 2. The input incompletely stored video set is stored in the incompletely stored video database 3. Then, the video additional information relationship learning apparatus 100 executes a model learning process routine shown in FIG.

まず、ステップＳ１０１において、完全蓄積映像集合及び不完全蓄積映像集合の各蓄積映像から、完全蓄積画像特徴及び不完全蓄積画像特徴を抽出して、特徴データベース７へ格納する。そして、ステップＳ１０２において、完全蓄積映像集合及び不完全蓄積映像集合の各蓄積映像から、完全蓄積音響特徴及び不完全蓄積音響特徴を抽出して、特徴データベース７へ格納する。ステップＳ１０３では、蓄積付加情報集合の各付加情報から、蓄積付加情報特徴を抽出して、特徴データベース７へ格納する。 First, in step S101, the completely accumulated image feature and the incompletely accumulated image feature are extracted from each accumulated image of the completely accumulated image set and the incompletely accumulated image set, and stored in the feature database 7. Then, in step S102, the completely accumulated acoustic feature and the incompletely accumulated acoustic feature are extracted from each accumulated image of the complete accumulated image set and the incompletely accumulated image set, and stored in the feature database 7. In step S 103, the accumulated additional information feature is extracted from each additional information of the accumulated additional information set and stored in the feature database 7.

次のステップＳ１０４では、上記ステップＳ１０１〜Ｓ１０３で抽出された完全蓄積画像特徴、完全蓄積音響特徴、及び蓄積付加情報特徴の組み合わせの集合である完全蓄積特徴集合について、上記（１）式〜（６）式に従って、自己共分散行列Ｓ_{Ｃｘ０ｘ０}，Ｓ_{Ｃｘ１ｘ１}，Ｓ_ｙｙ及び相互共分散行列Ｓ_{Ｃｘ０ｘ１}，Ｓ_Ｃｘ０ｙ，Ｓ_Ｃｘ１ｙを算出する。 In the next step S104, with respect to the complete accumulation feature set that is a set of combinations of the completely accumulated image feature, the completely accumulated acoustic feature, and the accumulated additional information feature extracted in the above steps S101 to S103, ) To calculate auto-covariance matrices S _Cx0x0 , S _Cx1x1 , S _yy and cross-covariance matrices S _Cx0x1 , S _Cx0y , S _Cx1y .

そして、ステップＳ１０５において、上記ステップＳ１０１〜Ｓ１０２で抽出された不完全蓄積画像特徴、及び不完全蓄積音響特徴の組み合わせの集合である不完全蓄積特徴集合について、上記（７）式〜（９）式に従って、自己共分散行列Ｓ_{Ｉｘ０ｘ０}、Ｓ_{Ｉｘ１ｘ１}及び相互共分散行列Ｓ_{Ｉｘ０ｘ１}を算出する。 In step S105, the above-described equations (7) to (9) are used for the incompletely stored image feature extracted in steps S101 to S102 and the incompletely stored feature set that is a set of incompletely stored acoustic features. Accordingly, the self-covariance matrices S _Ix0x0 and S _Ix1x1 and the mutual covariance matrix S _Ix0x1 are calculated.

ステップＳ１０６では、上記ステップＳ１０５で算出された完全蓄積特徴集合の自己共分散行列Ｓ_{Ｃｘ０ｘ０}，Ｓ_{Ｃｘ１ｘ１}，Ｓ_ｙｙ及び相互共分散行列Ｓ_{Ｃｘ０ｘ１}，Ｓ_Ｃｘ０ｙ，Ｓ_Ｃｘ１ｙと、不完全蓄積特徴集合の自己共分散行列ＳＩ_ｘ０ｘ０、Ｓ_{Ｉｘ１ｘ１}及び相互共分散行列Ｓ_{Ｉｘ０ｘ１}とに基づいて、上記（１０）式、（１１）式に従って、第１の統合統計量＿Ｃ、及び第２の統合統計量￣Ｃを算出する。 In step S106, the self-covariance matrices S _Cx0x0 , S _Cx1x1 , S _yy and the mutual covariance matrices S _Cx0x1 , S _Cx0y , S _{Cx1y of} the complete accumulation feature set calculated in step S105 and the self-incomplete accumulation feature set self Based on the covariance matrices SI _x0x0 , S _Ix1x1 and the mutual covariance matrix S _Ix0x1 , the first integrated statistic _C and the second integrated statistic ￣C are calculated according to the above formulas (10) and (11). calculate.

そして、ステップＳ１０７において、上記ステップＳ１０６で算出された第１の統合統計量＿Ｃ、及び第２の統合統計量￣Ｃを用いて、上記（１２）式で表される一般化固有値問題を解き、予め定められた数の固有値と固有ベクトルとの組を求める。求められた固有値と固有ベクトルとの組を用いて、上記（１３）式、（１４）式に従って、特徴圧縮関数を特徴付ける変換行列Ｔ_ｘとＴ_ｙを算出する。 In step S107, the generalized eigenvalue problem expressed by the above equation (12) is solved using the first integrated statistic_C and the second integrated statistic ￣C calculated in step S106. A set of a predetermined number of eigenvalues and eigenvectors is obtained. Using the set of the obtained eigenvalue and eigenvector, transformation matrices T _x and T _y characterizing the feature compression function are calculated according to the above equations (13) and (14).

次のステップＳ１０８では、上記ステップＳ１０７で算出された変換行列Ｔ_ｘとＴ_ｙにより決定される特徴圧縮関数を用いて、上記（１５）式、（１６）式に従って、蓄積画像圧縮特徴集合ハット（＾）Ｘ_０、蓄積音響圧縮特徴集合ハット（＾）Ｘ_１、及び蓄積付加情報圧縮特徴集合ハット（＾）Ｙを算出する。ステップＳ１０９では、上記（１７）式、（１８）式に従って、上記ステップＳ１０８で算出された蓄積画像圧縮特徴集合ハット（＾）Ｘ_０、蓄積音響圧縮特徴集合ハット（＾）Ｘ_１、及び蓄積付加情報圧縮特徴集合ハット（＾）Ｙを合成して、蓄積潜在変数集合を算出する。 In the next step S108, using the feature compression function determined by the transformation matrix _{T x} and _{T y} calculated in step S107, the equation (15), according to (16), storing the image compression feature set hat ( ^) X ₀ , stored acoustic compression feature set hat (^) X ₁ , and stored additional information compression feature set hat (^) Y are calculated. In step S109, the stored image compression feature set hat (^) X ₀ calculated in step S108, the stored acoustic compression feature set hat (^) X ₁ , and the storage addition according to the above equations (17) and (18). The information compression feature set hat (^) Y is synthesized to calculate an accumulated latent variable set.

そして、ステップＳ１１０では、上記ステップＳ１０９で算出された蓄積潜在変数集合を用いて、上記（１９）式に従って、潜在変数モデルｐ（ｚ）を学習する。次のステップＳ１１１では、蓄積音響特徴集合、蓄積画像特徴集合、及び蓄積潜在変数集合を用いて、上記（２０）式に従って、映像・潜在変数関係モデルｐ（ｘ_０，ｘ_１｜ｚ）を学習する。 In step S110, the latent variable model p (z) is learned according to the above equation (19) using the accumulated latent variable set calculated in step S109. In the next step S111, the image / latent variable relationship model p (x ₀ , x ₁ | z) is learned according to the above equation (20) using the accumulated acoustic feature set, the accumulated image feature set, and the accumulated latent variable set. To do.

そして、ステップＳ１１２において、蓄積付加情報特徴集合、及び蓄積潜在変数集合を用いて、上記（２１）式に従って、付加情報・潜在変数関係モデルｐ（ｙ｜ｚ）を学習する。ステップＳ１１３において、上記ステップＳ１１０で学習された潜在変数モデルｐ（ｚ）、上記ステップＳ１１１で学習された映像・潜在変数関係モデルｐ（ｘ_０，ｘ_１｜ｚ）、及び上記ステップＳ１１２で学習された付加情報・潜在変数関係モデルｐ（ｙ｜ｚ）を、映像・付加情報関係モデルとして出力部３０により出力し、モデル学習処理ルーチンを終了する。 In step S112, the additional information / latent variable relationship model p (y | z) is learned according to the above equation (21) using the stored additional information feature set and the stored latent variable set. In step S113, the latent variable model p (z) learned in step S110, the video / latent variable relationship model p (x ₀ , x ₁ | z) learned in step S111, and learned in step S112. The additional information / latent variable relationship model p (y | z) is output by the output unit 30 as a video / additional information relationship model, and the model learning processing routine is terminated.

以上説明したように、第１の実施の形態に係る映像付加情報関係性学習装置によれば、完全蓄積映像の各々から抽出される完全蓄積画像特徴及び完全蓄積音響特徴と、不完全蓄積映像の各々から抽出される不完全蓄積画像特徴及び不完全蓄積音響特徴と、蓄積潜在変数集合とから、映像と付加情報との関係性を記述するモデルを学習することにより、映像に含まれる画像情報と音響情報との双方を利用し、かつその相互の共起関係を考慮して、それら情報と言語情報との関係性の学習をより高精度に行うことができる。 As described above, according to the video additional information relationship learning device according to the first embodiment, the completely accumulated image feature and the completely accumulated acoustic feature extracted from each of the completely accumulated image, and the incompletely accumulated image By learning a model that describes the relationship between video and additional information from incompletely stored image features and incompletely stored acoustic features extracted from each, and a set of stored latent variables, image information contained in the video By using both of the acoustic information and considering the mutual co-occurrence relationship, the relationship between the information and the language information can be learned with higher accuracy.

また、上記（１０）式、（１１）式により統合統計量を算出し，上記（１２）式により固有ベクトルを算出することにより、映像に含まれる画像情報、音響情報、付加情報の３者の相関関係（共起関係）を簡易に学習できる。これにより、音響情報そのもの、及び音響情報と画像情報との組み合わせを手がかりにして、画像情報だけでは得られない、映像信号と付加情報との関係性を蓄積情報から学習することができ、その関係性を使って映像検索、映像認識の精度を向上させることができる。 Further, by calculating the integrated statistic by the above formulas (10) and (11) and by calculating the eigenvector by the above formula (12), the correlation between the three of the image information, the acoustic information, and the additional information included in the video is obtained. Relationships (co-occurrence relationships) can be easily learned. This makes it possible to learn from the stored information the relationship between the video signal and the additional information that cannot be obtained only by the image information, using the combination of the sound information and the sound information and the image information as a clue. Can improve the accuracy of video search and video recognition.

また、蓄積潜在変数抽出部により、付加情報が与えられた映像と付加情報が与えられていない映像の双方からの、映像と付加情報の関係性を記述するための潜在変数の抽出することにより、付加情報が与えられた映像が少量しか利用できない場合においても、映像と付加情報の関係性を精度良く学習することができる。 Further, by extracting a latent variable for describing the relationship between the video and the additional information from both the video given the additional information and the video not given the additional information by the accumulated latent variable extraction unit, Even when only a small amount of video provided with additional information can be used, the relationship between the video and additional information can be learned with high accuracy.

付加情報が与えられている映像を大量に収集することは困難である一方、付加情報が与えられていなくても良い場合には、映像そのものを収集することは非常に容易で大量に収集することが可能である。これら、付加情報が与えられていない映像を、映像と付加情報との関係性を学習する際に、同時に利用することにより、少数しかない付加情報が与えられた映像のみを利用する場合に比べて、高い精度で映像と付加情報の関係性を学習することができる。 While it is difficult to collect a large amount of video with additional information, it is very easy to collect the video itself when it is not necessary to provide additional information. Is possible. Compared to the case where only a video with only a small amount of additional information is used by simultaneously using the video without additional information when learning the relationship between the video and the additional information. The relationship between video and additional information can be learned with high accuracy.

〔第２の実施の形態〕
＜システム構成＞
次に、本発明の第２の実施の形態について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Second Embodiment]
<System configuration>
Next, a second embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第２の実施の形態では、映像・付加情報関係モデルを用いて、入力された付加情報と関連性が高い映像を検索する点が、第１の実施の形態と異なっている。また、入力付加情報に関連する映像の集合を検索する半教師映像検索装置に、本発明を適用した場合を例に説明する。 The second embodiment is different from the first embodiment in that the video / additional information relation model is used to search for a video having high relevance to the input additional information. Further, a case where the present invention is applied to a semi-supervised video search apparatus that searches a set of videos related to input additional information will be described as an example.

図４に示すように、第２の実施の形態による半教師映像検索装置２００は、完全蓄積映像集合、不完全蓄積映像集合、蓄積付加情報集合、及び蓄積付加情報集合と別に与えられた付加情報である入力付加情報を入力し、入力付加情報に関連する映像の集合である付加情報関連映像集合を出力する装置である。半教師映像検索装置２００は、入力部１０、演算部２２０、及び出力部３０を備えている。 As shown in FIG. 4, the semi-supervised video search apparatus 200 according to the second embodiment includes additional information provided separately from a complete stored video set, an incomplete stored video set, a stored additional information set, and a stored additional information set. Is a device that inputs input additional information and outputs an additional information related video set that is a set of videos related to the input additional information. The semi-teacher video search apparatus 200 includes an input unit 10, a calculation unit 220, and an output unit 30.

入力部１０は、完全蓄積映像集合、不完全蓄積映像集合、及び蓄積付加情報集合の入力を受け付けると共に、映像を検索するためのクエリとして、入力付加情報の入力を受け付ける。 The input unit 10 receives input of a complete stored video set, an incompletely stored video set, and a stored additional information set, and receives input additional information as a query for searching for videos.

演算部２２０は、蓄積付加情報データベース１、完全蓄積映像データベース２、不完全蓄積映像データベース３、蓄積付加情報特徴抽出部４、蓄積画像特徴抽出部５、蓄積音響特徴抽出部６、特徴データベース７、蓄積潜在変数抽出部８、映像・付加情報間関係性学習部９、入力付加情報特徴抽出部１１、及び映像検索部１２を備えている。 The calculation unit 220 includes a storage additional information database 1, a complete storage video database 2, an incomplete storage video database 3, a storage additional information feature extraction unit 4, a storage image feature extraction unit 5, a storage acoustic feature extraction unit 6, a feature database 7, A storage latent variable extraction unit 8, a video / additional information relationship learning unit 9, an input additional information feature extraction unit 11, and a video search unit 12 are provided.

入力付加情報特徴抽出部１１は、入力付加情報を入力し、この入力付加情報の特性を表現するベクトルである入力付加情報特徴を抽出し、この入力付加情報特徴を出力する。入力付加情報特徴の抽出方法は、蓄積付加情報特徴抽出部４と同様である。 The input additional information feature extraction unit 11 inputs the input additional information, extracts an input additional information feature that is a vector expressing the characteristics of the input additional information, and outputs the input additional information feature. The input additional information feature extraction method is the same as that of the stored additional information feature extraction unit 4.

映像検索部１２は、入力付加情報特徴、完全蓄積映像集合、不完全蓄積映像集合、及び映像・付加情報関係モデルを入力し、入力付加情報特徴を映像・付加情報関係モデルに与えることで、完全蓄積映像集合と不完全蓄積映像集合から、入力付加情報との関連性が高い映像である付加情報関連映像を選択し、この付加情報関連映像の集合である付加情報関連映像集合を出力する。 The video search unit 12 inputs the input additional information feature, the completely stored video set, the incompletely stored video set, and the video / additional information relation model, and gives the input additional information feature to the video / additional information relation model, thereby From the stored video set and the incompletely stored video set, an additional information related video that is a video having a high relationship with the input additional information is selected, and an additional information related video set that is a set of the additional information related video is output.

付加情報関連画像の選択方法は特に限定されるものではないが、ここでは，以下の方法について述べる。 The method for selecting the additional information related image is not particularly limited, but here, the following method will be described.

まず、入力付加情報特徴ｙ_{ｇｉｖｅｎ}が与えられたときの画像特徴ｘ₀と音響特徴ｘ₁の事後確率を、以下の（２５）式で設定する。 First, the posterior probabilities of the image feature x ₀ and the acoustic feature x ₁ when the input additional information feature y give is _given are set by the following equation (25).

この事後確率ｐ（ｘ₀，ｘ₁｜ｙ_{ｇｉｖｅｎ}）を、蓄積画像特徴集合と蓄積音響特徴集合の各組み合わせについて計算し、事後確率の大きい一定数の蓄積画像特徴と蓄積音響特徴の組、もしくは事後確率が閾値を超えた蓄積画像特徴と蓄積音響特徴の組を選択し、この蓄積画像特徴と蓄積音響特徴の組に対応する映像の集合を付加情報関連映像集合とする。 The posterior probability _{_{p (x 0, x 1 |}} y given) and were calculated for each combination of storage acoustic feature set and stored image feature set, a large a number of the stored images, wherein the accumulation audio feature set of a posteriori probabilities, or A set of accumulated image features and accumulated acoustic features whose posterior probabilities exceed a threshold is selected, and a set of images corresponding to the set of accumulated image features and accumulated acoustic features is set as an additional information related image set.

このようにして、映像検索部１２は、付加情報関連映像集合を選択し、この付加情報関連映像集合を出力部３０により出力する。 In this way, the video search unit 12 selects the additional information related video set, and the output unit 30 outputs the additional information related video set.

＜半教師映像検索装置の作用＞
まず、完全蓄積映像集合、蓄積付加情報集合、及び不完全蓄積映像集合が、半教師映像検索装置２００に入力されると、半教師映像検索装置２００によって、入力された蓄積付加情報集合が、蓄積付加情報データベース１へ格納され、入力された完全蓄積映像集合が、完全蓄積映像データベース２へ格納され、入力された不完全蓄積映像集合が、不完全蓄積映像データベース３へ格納される。そして、半教師映像検索装置２００は、第１の実施の形態と同様に、上記図３に示すモデル学習処理ルーチンを実行する。 <Operation of semi-teacher video search device>
First, when a complete stored video set, a stored additional information set, and an incomplete stored video set are input to the semi-supervised video search device 200, the input stored additional information set is stored by the semi-supervised video search device 200. The complete stored video set stored and input in the additional information database 1 is stored in the fully stored video database 2, and the input incompletely stored video set is stored in the incompletely stored video database 3. Then, the semi-teacher video search apparatus 200 executes the model learning processing routine shown in FIG. 3 as in the first embodiment.

また、映像を検索するためのクエリとして、入力付加情報が、半教師映像検索装置２００に入力されると、半教師映像検索装置２００によって、図５に示す映像検索処理ルーチンが実行される。 When the input additional information is input to the semi-teacher video search device 200 as a query for searching for the video, the video search processing routine shown in FIG.

ステップＳ２０１において、入力された入力付加情報を受け付け、ステップＳ２０２において、入力付加情報から、入力付加情報特徴を抽出する。 In step S201, input additional information is received. In step S202, input additional information features are extracted from the input additional information.

そして、ステップＳ２０３において、蓄積画像特徴集合と蓄積音響特徴集合とから得られる蓄積画像特徴及び蓄積音響特徴の各組み合わせについて、上記（２５）式に従って、事後確率ｐ（ｘ₀，ｘ₁｜ｙ_{ｇｉｖｅｎ}）を計算する。ステップＳ２０４では、上記ステップＳ２０３で計算された事後確率ｐ（ｘ₀，ｘ₁｜ｙ_{ｇｉｖｅｎ}）が閾値以上となる蓄積画像特徴と蓄積音響特徴の組み合わせを抽出し、完全蓄積映像集合及び不完全蓄積映像集合から、抽出された蓄積画像特徴と蓄積音響特徴の組み合わせに対応する映像を付加情報関連映像として選択する。 Then, in step S203, the posterior probabilities p (x ₀ , x ₁ | y _given ) for each combination of the stored image feature and the stored acoustic feature obtained from the stored image feature set and the stored acoustic feature set according to the above equation (25). ). In step S204, a combination of stored image features and stored acoustic features in which the posterior probability p (x ₀ , x ₁ | y _give ) calculated in step S203 is equal to or greater than a threshold value is extracted, and a complete stored video set and incomplete storage are extracted. A video corresponding to the combination of the extracted stored image feature and the stored sound feature is selected as the additional information related video from the video set.

そして、ステップＳ２０５において、上記ステップＳ２０４で選択された付加情報関連映像集合を出力部３０により出力して、映像検索処理ルーチンを終了する。 In step S205, the output unit 30 outputs the additional information related video set selected in step S204, and the video search processing routine is terminated.

以上説明したように、第２の実施の形態に係る半教師映像検索装置によれば、学習した映像と付加情報との関係性を記述するモデルを用いることにより、入力付加情報と関連性が高い映像の検索を高精度に行うことができる。 As described above, according to the semi-supervised video search apparatus according to the second embodiment, the model describing the relationship between the learned video and the additional information is used, so that it is highly related to the input additional information. Video search can be performed with high accuracy.

〔第３の実施の形態〕
＜システム構成＞
次に、本発明の第３の実施の形態について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。 [Third Embodiment]
<System configuration>
Next, a third embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

第３の実施の形態では、映像・付加情報関係モデルを用いて、入力された映像と関連性が高い付加情報を出力する点が、第１の実施の形態と異なっている。また、入力映像を説明する付加情報の集合を出力する半教師映像認識装置に、本発明を適用した場合を例に説明する。 The third embodiment is different from the first embodiment in that additional information that is highly relevant to an input video is output using a video / additional information relationship model. Also, a case where the present invention is applied to a semi-supervised video recognition apparatus that outputs a set of additional information describing an input video will be described as an example.

図６に示すように、第３の実施の形態による半教師映像認識装置３００は、完全蓄積映像集合、不完全蓄積映像集合、蓄積付加情報集合、及び完全蓄積映像集合や不完全蓄積映像集合と別に与えられた映像である入力映像を入力し、入力映像に対応する付加情報の集合である映像関連付加情報集合を出力する装置である。半教師映像認識装置３００は、入力部１０、演算部３２０、及び出力部３０を備えている。 As shown in FIG. 6, the semi-supervised video recognition apparatus 300 according to the third embodiment includes a complete stored video set, an incomplete stored video set, a stored additional information set, a complete stored video set, and an incomplete stored video set. This is an apparatus for inputting an input video that is a separately provided video and outputting a video-related additional information set that is a set of additional information corresponding to the input video. The semi-teacher video recognition apparatus 300 includes an input unit 10, a calculation unit 320, and an output unit 30.

入力部１０は、完全蓄積映像集合、不完全蓄積映像集合、及び蓄積付加情報集合の入力を受け付けると共に、映像認識対象となる入力映像の入力を受け付ける。 The input unit 10 receives input of a complete stored video set, an incompletely stored video set, and a stored additional information set, and also receives an input of an input video that is a video recognition target.

演算部３２０は、蓄積付加情報データベース１、完全蓄積映像データベース２、不完全蓄積映像データベース３、蓄積付加情報特徴抽出部４、蓄積画像特徴抽出部５、蓄積音響特徴抽出部６、特徴データベース７、蓄積潜在変数抽出部８、映像・付加情報間関係性学習部９、入力画像特徴抽出部１３、入力音響特徴抽出部１４、及び画像認識部１５を備えている。 The calculation unit 320 includes a storage additional information database 1, a complete storage video database 2, an incomplete storage video database 3, a storage additional information feature extraction unit 4, a storage image feature extraction unit 5, a storage acoustic feature extraction unit 6, a feature database 7, An accumulated latent variable extraction unit 8, a video / additional information relationship learning unit 9, an input image feature extraction unit 13, an input acoustic feature extraction unit 14, and an image recognition unit 15 are provided.

入力画像特徴抽出部１３は、入力映像を入力し、この入力映像に含まれる画像信号の特性を表現するベクトルである入力画像特徴を抽出し、この入力画像特徴を出力する。入力画像特徴の抽出方法は、蓄積画像特徴抽出部５と同様であるため、説明を省略する。 The input image feature extraction unit 13 inputs an input video, extracts an input image feature that is a vector representing characteristics of an image signal included in the input video, and outputs the input image feature. Since the input image feature extraction method is the same as that of the accumulated image feature extraction unit 5, a description thereof will be omitted.

入力音響特徴抽出部１４は、入力映像を入力し、この入力映像に含まれる音響信号の特性を表現するベクトルである入力音響特徴を抽出し、この入力音響特徴を出力する。入力音響特徴の抽出方法は、蓄積音響特徴抽出部６と同様であるため、説明を省略する。 The input acoustic feature extraction unit 14 inputs an input video, extracts an input acoustic feature that is a vector expressing the characteristics of an acoustic signal included in the input video, and outputs the input acoustic feature. The input acoustic feature extraction method is the same as that of the accumulated acoustic feature extraction unit 6, and thus the description thereof is omitted.

映像認識部１５は、入力画像特徴、入力音響特徴、及び映像・付加情報関係モデルを入力し、入力画像特徴と入力音響特徴を映像・付加情報関係モデルに与えることで、入力映像との関連性が高い付加情報である映像関連付加情報を抽出し、この映像関連付加情報を出力する。映像関連付加情報の選択方法は、特に限定されるものではないが、ここでは、以下の方法について述べる。 The video recognition unit 15 inputs the input image feature, the input acoustic feature, and the video / additional information relation model, and gives the input image feature and the input acoustic feature to the video / additional information relation model, thereby relevance with the input picture. The video related additional information which is high additional information is extracted, and this video related additional information is output. The method for selecting video-related additional information is not particularly limited, but the following method will be described here.

まず、入力画像特徴ｘ_{０，ｇｉｖｅｎ}及び入力音響特徴ｘ_{１，ｇｉｖｅｎ}が与えられたときの付加情報特徴ｙの事後確率ｐ（ｙ｜ｘ_{０，ｇｉｖｅｎ}，ｘ_{１，ｇｉｖｅｎ}）を、以下の（２６）式で設定する。 First, the a posteriori probability p (y | x _{0, given} , x _{1, given} ) of the additional information feature y when the input image feature x _{0, given} and the input acoustic feature x _{1, given} are _given is _expressed as (26 ) To set.

ここで、映像・潜在変数関係モデルｐ（ｘ_０，ｘ_１｜ｚ）及び付加情報・潜在変数関係モデルｐ（ｙ｜ｚ）の定式化より、この事後確率が最大となる付加情報特徴〜ｙは、以下の（２７）式で算出できる。 Here, from the formulation of the video / latent variable relationship model p (x ₀ , x ₁ | z) and the additional information / latent variable relationship model p (y | z), the additional information feature y that maximizes the posterior probability Can be calculated by the following equation (27).

ただし，ｚ_{ｇｉｖｅｎ}は，蓄積潜在変数抽出部８に示す処理を用いて、入力画像特徴ｘ_{０，ｇｉｖｅｎ}及び入力音響特徴ｘ_{１，ｇｉｖｅｎ}から算出した潜在変数である。また、ｐ（ｙ_d＝１｜ｚ_n）は、潜在変数ｚ_nが与えられたときの、１となる付加情報特徴ｙ_dの条件付き確率である。 _{However, z Given,} using the processing shown in accumulating latent variable extraction unit 8, a latent variable calculated from the input image feature _{x 0, Given} and input acoustic feature _{x 1, Given.} Further, p (y _d = 1 | z _n ) is a conditional probability of the additional information feature y _d that becomes 1 when the latent variable z _n is given.

上記事後確率最大となる付加情報特徴〜ｙは、一般に２値ベクトルとはならないことに注意する。この事後確率最大の付加情報特徴〜ｙの各要素のうち、値の大きい一定数の要素、もしくは値が閾値を超えた要素を選択し、それぞれの要素に対応する言語ラベルを集めて映像関連付加情報とする。 Note that the additional information feature ˜y that maximizes the posterior probability generally does not become a binary vector. From the additional information features with the maximum posterior probability-y, select a certain number of elements with a large value or elements whose values exceed the threshold value, collect language labels corresponding to each element, and add video-related information Information.

このようにして、映像認識部１５は、映像関連付加情報を選択し、この映像関連付加情報を出力部３０により出力する。 In this way, the video recognition unit 15 selects the video related additional information, and the output unit 30 outputs the video related additional information.

＜半教師映像認識装置の作用＞
まず、完全蓄積映像集合、蓄積付加情報集合、及び不完全蓄積映像集合が、半教師映像認識装置３００に入力されると、半教師映像認識装置３００によって、入力された蓄積付加情報集合が、蓄積付加情報データベース１へ格納され、入力された完全蓄積映像集合が、完全蓄積映像データベース２へ格納され、入力された不完全蓄積映像集合が、不完全蓄積映像データベース３へ格納される。そして、半教師映像認識装置３００は、第１の実施の形態と同様に、モデル学習処理ルーチンを実行する。 <Operation of semi-supervised video recognition device>
First, when a complete stored video set, a stored additional information set, and an incomplete stored video set are input to the semi-supervised video recognition device 300, the input stored additional information set is stored by the semi-supervised video recognition device 300. The complete stored video set stored and input in the additional information database 1 is stored in the fully stored video database 2, and the input incompletely stored video set is stored in the incompletely stored video database 3. Then, the semi-teacher video recognition apparatus 300 executes a model learning processing routine as in the first embodiment.

また、映像認識対象の入力映像が、半教師映像認識装置３００に入力されると、半教師映像認識装置３００によって、図７に示す映像認識処理ルーチンが実行される。 When the input video to be recognized is input to the semi-teacher video recognition apparatus 300, the video recognition processing routine shown in FIG.

ステップＳ３０１において、入力された入力映像を受け付け、ステップＳ３０２において、入力映像から、入力画像特徴を抽出し、ステップＳ３０３において、入力映像から、入力音響特徴を抽出する。 In step S301, the input video input is received. In step S302, input image features are extracted from the input video. In step S303, input acoustic features are extracted from the input video.

そして、ステップＳ３０４において、上記ステップＳ３０２、Ｓ３０３で抽出された入力画像特徴及び入力音響特徴を用いて、上記（２７）式に従って、事後確率ｐ（ｙ｜ｘ_{０，ｇｉｖｅｎ}，ｘ_{１，ｇｉｖｅｎ}）が最大となる付加情報特徴〜ｙを算出する。 In step S304, the posterior probabilities p (y | _x0 , _given , x1 _{, given} ) are calculated according to the above equation (27) using the input image features and input acoustic features extracted in steps S302 and S303. The maximum additional information feature -y is calculated.

次のステップＳ３０５では、上記ステップＳ３０４で算出された付加情報特徴〜ｙから、値が閾値を超えた要素を選択する。そして、ステップＳ３０６において、上記ステップＳ３０５で選択された要素に対応する言語ラベルを集めて、映像関連付加情報として、出力部３０により出力し、映像認識処理ルーチンを終了する。 In the next step S305, an element whose value exceeds the threshold value is selected from the additional information features -y calculated in step S304. In step S306, language labels corresponding to the elements selected in step S305 are collected and output as video-related additional information by the output unit 30, and the video recognition processing routine ends.

以上説明したように、第３の実施の形態に係る半教師映像認識装置によれば、学習した映像と付加情報との関係性を記述するモデルを用いることにより、入力映像との関連性が高い付加情報を、映像認識結果として精度よく得ることができる。 As described above, according to the semi-supervised video recognition apparatus according to the third embodiment, the model describing the relationship between the learned video and the additional information is used, so that the relevance with the input video is high. Additional information can be accurately obtained as a video recognition result.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 For example, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program may be provided by being stored in a computer-readable recording medium.

１蓄積付加情報データベース
２完全蓄積映像データベース
３不完全蓄積映像データベース
４蓄積付加情報特徴抽出部
５蓄積画像特徴抽出部
６蓄積音響特徴抽出部
７特徴データベース
８蓄積潜在変数抽出部
９映像・付加情報間関係性学習部
１１入力付加情報特徴抽出部
１２映像検索部
１３入力画像特徴抽出部
１４入力音響特徴抽出部
１５映像認識部
１５画像認識部
４４特徴圧縮関数決定部
８１完全蓄積特徴集合統計量算出部
８２不完全蓄積特徴集合統計量算出部
８３統合統計量算出部
８４特徴圧縮関数決定部
８５特徴圧縮部
９１潜在変数モデル学習部
９２映像・潜在変数関係モデル学習部
９３付加情報・潜在変数関係モデル学習部
１００映像付加情報関係性学習装置
２００半教師映像検索装置
３００半教師映像認識装置 1 accumulated additional information database 2 complete accumulated video database 3 incomplete accumulated video database 4 accumulated additional information feature extracting unit 5 accumulated image feature extracting unit 6 accumulated acoustic feature extracting unit 7 feature database 8 accumulated latent variable extracting unit 9 between video and additional information Relationship learning unit 11 Input additional information feature extraction unit 12 Video search unit 13 Input image feature extraction unit 14 Input acoustic feature extraction unit 15 Video recognition unit 15 Image recognition unit 44 Feature compression function determination unit 81 Complete accumulation feature set statistic calculation unit 82 Incompletely Accumulated Feature Set Statistics Calculation Unit 83 Integrated Statistics Calculation Unit 84 Feature Compression Function Determination Unit 85 Feature Compression Unit 91 Latent Variable Model Learning Unit 92 Video / Latent Variable Relationship Model Learning Unit 93 Additional Information / Latent Variable Relationship Model Learning Part 100 Video Additional Information Relationship Learning Device 200 Semi-Teacher Video Retrieval Device 300 Semi-Teacher Video Recognition Device

Claims

A video additional information relationship learning device that learns a relationship between a video that is a moving image with sound and additional information that is information that describes the video,
Completely accumulated video that is an element of a fully accumulated video set that is a set of videos to which additional information is given in advance, and incompletely accumulated video that is an element of an incompletely accumulated video set that is a set of videos to which no additional information is given A stored image feature extracting means for extracting a completely stored image feature and an incompletely stored image feature, which are vectors expressing the characteristics of the image,
Fully stored sound features and incompletely stored sound that are vectors representing acoustic characteristics from each of the completely stored video that is an element of the complete stored video set and the incompletely stored video that is an element of the incompletely stored video set. Accumulated acoustic feature extraction means for extracting features;
A stored additional information feature extracting means for extracting a stored additional information feature, which is a vector expressing the characteristics of the additional information, from each of the stored additional information that is an element of the stored additional information set that is a set of the additional information;
A completely stored image feature set that is a set of the completely stored image features, an incompletely stored image feature set that is a set of the incompletely stored image features, a fully stored acoustic feature set that is a set of the completely stored acoustic features, and the incomplete A storage latency that is a set of variables for describing the relationship between video and additional information from an incompletely stored acoustic feature set that is a set of stored acoustic features and a stored additional information feature set that is a set of stored additional information features. An accumulated latent variable extracting means for extracting a variable set;
From the complete stored image feature set, the incomplete stored image feature set, the complete stored acoustic feature set, the incomplete stored acoustic feature set, the stored additional information feature set, and the stored latent variable set, video and additional information, A video additional information relationship learning device comprising: a video / additional information relationship learning means for learning a video / additional information relationship model, which is a model for describing the relationship between the images.

The accumulated latent variable extracting means includes
A complete accumulation feature set statistic that is a statistic representing a statistical property of a complete accumulation feature set that is a set of the combination of the perfect accumulation image feature, the perfect accumulation acoustic feature, and the corresponding accumulation additional information feature is calculated. A completely accumulated feature set statistic calculation means;
An incomplete accumulation feature set for calculating an incomplete accumulation feature set statistic that is a statistic expressing a statistical property of the incomplete accumulation feature set that is a set of a combination of the incomplete accumulation image feature and the incomplete accumulation acoustic feature A statistic calculation means;
An integrated statistic calculation means for calculating an integrated statistic by combining the complete accumulation feature set statistic and the incomplete accumulation feature set statistic;
Feature compression function determination means for determining a feature compression function that is a function for compressing the image feature, the acoustic feature, and the additional information feature using the integrated statistic;
Using the feature compression function, a stored image compression feature set obtained by compressing the stored image feature set, a stored acoustic compression feature set obtained by compressing the stored acoustic feature set, and a stored additional information compression obtained by compressing the stored additional information feature set. A feature compression unit that calculates a feature set, combines the stored image compression feature set, the stored acoustic compression feature set, and a stored additional information compression feature set to calculate the stored latent variable set;
The video additional information relationship learning device according to claim 1, comprising:

An input additional information feature extracting means for extracting an input additional information feature which is a vector expressing the characteristic of the additional information from the input additional information;
Video search means for providing the input additional information feature to the video / additional information relation model and searching for a video having high relevance to the input additional information from the complete stored video set and the incompletely stored video set; The video additional information relationship learning device according to claim 1, further comprising:

An input image feature extracting means for extracting an input image feature which is a vector expressing the characteristics of the image from the input video;
Input acoustic feature extraction means for extracting an input acoustic feature, which is a vector expressing acoustic characteristics, from the input video;
And a video recognition unit that gives the input image feature and the input acoustic feature to the video / additional information relation model and selects additional information highly relevant to the input video from the stored additional information set. Item 3. The video additional information relationship learning device according to Item 1 or 2.

A video additional information relationship learning method used in a video additional information relationship learning device that learns a relationship between a video that is a moving image with sound and additional information that is information that describes the video,
By the stored image feature extraction means, a complete stored video that is an element of a complete stored video set that is a set of videos to which additional information is given in advance, and an incompletely stored video set that is a set of videos to which no additional information is given. Extracting from each of the incompletely stored images that are the elements a fully stored image feature and an incompletely stored image feature that are vectors representing the characteristics of the image;
Complete accumulation, which is a vector that expresses acoustic characteristics from each of the completely accumulated video, which is an element of the completely accumulated video set, and the incompletely accumulated video, which is an element of the incompletely accumulated video set, by the accumulated acoustic feature extraction means Extracting acoustic features and imperfectly accumulated acoustic features;
A step of extracting a stored additional information feature, which is a vector representing a characteristic of the additional information, from each of the stored additional information that is an element of the stored additional information set, which is a set of the added additional information, by the stored additional information feature extracting means; When,
By the accumulated latent variable extraction means, a completely accumulated image feature set that is a set of the completely accumulated image features, an incompletely accumulated image feature set that is a set of the incompletely accumulated image features, and a complete accumulation that is a set of the completely accumulated acoustic features. To describe the relationship between video and additional information from an acoustic feature set, an incompletely stored acoustic feature set that is a set of the incompletely stored acoustic features, and a stored additional information feature set that is a set of the stored additional information features Extracting an accumulated latent variable set that is a set of variables;
The complete accumulated image feature set, the incompletely accumulated image feature set, the complete accumulated acoustic feature set, the incompletely accumulated acoustic feature set, the accumulated additional information feature set, and the video / additional information relationship learning means A video additional information relationship learning method comprising: learning a video / additional information relationship model, which is a model describing a relationship between a video and additional information, from a stored latent variable set.

Extracting the accumulated latent variable set by the accumulated latent variable extracting means;
A statistic that expresses the statistical properties of the complete accumulation feature set, which is a set of the combination of the complete accumulation image feature, the complete accumulation acoustic feature, and the corresponding accumulated additional information feature by the perfect accumulation feature set statistic calculation means. Calculating a complete accumulated feature set statistic;
The incompletely accumulated feature set statistic calculating means is an incompletely accumulated feature that is a statistic representing the statistical properties of the incompletely accumulated feature set that is a set of the combination of the incompletely accumulated image feature and the incompletely accumulated acoustic feature. Calculating an aggregate statistic;
A step of calculating an integrated statistic by combining the complete accumulated feature set statistic and the incompletely accumulated feature set statistic by an integrated statistic calculating means;
Determining a feature compression function, which is a function for compressing the image feature, the acoustic feature, and the additional information feature, using the integrated statistic by a feature compression function determination unit;
The feature compression means compresses the stored image compression feature set obtained by compressing the stored image feature set, the stored acoustic compression feature set obtained by compressing the stored acoustic feature set, and the stored additional information feature set using the feature compression function. Calculating the stored additional information compression feature set, combining the stored image compression feature set, the stored acoustic compression feature set, and the stored additional information compression feature set to calculate the stored latent variable set;
The video additional information relationship learning method according to claim 5, comprising:

Extracting an input additional information feature which is a vector expressing the characteristic of the additional information from the input additional information by the input additional information feature extracting means;
The video additional means provides the input additional information feature to the video / additional information relationship model, and searches for a video having high relevance to the input additional information from the complete stored video set and the incompletely stored video set. The video additional information relationship learning method according to claim 5, further comprising: a step.

Extracting an input image feature which is a vector expressing the characteristics of the image from the input video by the input image feature extracting means;
Extracting an input acoustic feature, which is a vector expressing acoustic characteristics, from the input video by an input acoustic feature extracting means;
Providing the input image feature and the input acoustic feature to the video / additional information relation model by a video recognition means, and selecting additional information highly relevant to the input video from the stored additional information set; The video additional information relationship learning method according to claim 5, further comprising:

The program for functioning a computer as each means of the image | video additional information relationship learning apparatus of any one of Claims 1-4.