JP6283308B2

JP6283308B2 - Image dictionary construction method, image representation method, apparatus, and program

Info

Publication number: JP6283308B2
Application number: JP2014261008A
Authority: JP
Inventors: 豪入江; 新井　啓之; 啓之新井; 行信谷口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2018-02-21
Anticipated expiration: 2034-12-24
Also published as: JP2016122279A

Description

本発明は、画像辞書構成方法、画像表現方法、装置、及びプログラムに係り、特に、画像の画像表現を得るための画像辞書を構成する画像辞書構成方法、装置、及びプログラム、並びに、画像の画像表現を求める画像表現方法、装置、及びプログラムに関する。 The present invention relates to an image dictionary construction method, an image expression method, an apparatus, and a program, and more particularly to an image dictionary composition method, apparatus, and program for constructing an image dictionary for obtaining an image representation of an image, and an image image. The present invention relates to an image expression method, apparatus, and program for obtaining expression.

通信環境やコンピュータ、分散処理基盤技術の高度・高品質化により、ネットワークに流通する画像、映像コンテンツの数は膨大なものとなっている。あるサイトでは、日々３．５億の画像がアップロードされていると報告されており、また、あるサイトでは、１分当たり６４時間分の映像が新規に公開されているとの報告もある。 Due to the advancement and quality of communication environments, computers, and distributed processing infrastructure technologies, the number of images and video content distributed on the network has become enormous. One site reports that 350 million images are uploaded every day, and another site reports that 64 hours of video are newly released per minute.

このような膨大な量のコンテンツは、利用者にとっては豊富な情報源となる一方で、閲覧したいコンテンツに素早くアクセスすることがますます困難になっているという問題ももたらしている。このような流れの中、閲覧・視聴したいコンテンツを効率的に探し出すためのメディア解析技術への要望がますます高まってきている。 Such an enormous amount of content is a rich source of information for users, but it also raises the problem that it becomes increasingly difficult to quickly access the content that the user wants to browse. In such a trend, there is an increasing demand for media analysis technology for efficiently searching for contents to be browsed and viewed.

以降画像に限って述べるが、映像は、連続する一連の画像によって構成されているため、本明細書に記載の範囲で、映像にもそのまま適用することができることは言うまでもない。 Although the following description is limited to the image, it goes without saying that the video can be applied as it is to the video within the scope described in the present specification because it is composed of a series of continuous images.

画像解析の最初のステップは、画像表現を得ること、すなわち、画像を数値によって比較可能なベクトルとして記述することである。こうすることで、例えば画像認識を実施する場合は、画像表現によって生成される空間の内、ある特定の領域にあるものを同じカテゴリに分類することができる。あるいは画像検索の場合、ある画像をクエリとして与えたとき、画像表現によって画像の類似度を評価し、類似画像を検索することができる。そのほか、画像推薦においても利用者がこれまでに閲覧した／閲覧している画像と類似する画像を発見してこれを推薦するし、沢山の画像をより少数の代表画像にまとめるような場合においても、類似した画像を発見して省くような処理を実行する。いずれの場合においても、画像表現が必要となる。 The first step in image analysis is to obtain an image representation, ie describe the images as numerically comparable vectors. In this way, for example, when image recognition is performed, it is possible to classify the space in a specific area within the space generated by the image expression into the same category. Alternatively, in the case of an image search, when a certain image is given as a query, the similarity between the images can be evaluated by image representation, and a similar image can be searched. In addition, even in the case of image recommendation, when a user finds an image similar to the image browsed / viewed so far, recommends it, and collects many images into a smaller number of representative images. Then, a process for finding and omitting a similar image is executed. In either case, image representation is required.

以上利用形態を鑑みるに、画像表現に対する要件として、画像の“意味的な内容”を捉えたものであることが好ましい。ここでいう“意味的な内容”とは、すなわち、画像に収められた被写体（『犬』、『家』、『パソコン』等）やシーン（『海岸』、『オフィス』、『森林』等）を特徴づける部品や物体及びその特徴を指すものであり、言語として指示可能な画像中の一部領域のことである。例えば、『犬』であれば『耳の形』（『尖った耳』、『垂れ耳』等）や『足』（『短く丸い足』、『細長い足』等）等、『海岸』であれば『ビーチ』、『海』、『船』等を指す。仮に、画像中にこういった部品や物体の有無が得られていれば、その集合から撮影されている被写体やシーンを演繹的に推論することができる。結果、意味的な内容に基づく分類や検索が可能となり、産業上における応用価値が高い。 In view of the above utilization form, it is preferable that the “semantic content” of an image is captured as a requirement for image representation. “Semantic content” here means subjects (“dog”, “house”, “computer”, etc.) and scenes (“coast”, “office”, “forest”, etc.) contained in the image. This is a part or object that characterizes the character and its characteristic, and is a partial area in the image that can be designated as a language. For example, if it is a “dog”, it should be “coast” such as “ear shape” (“pointed ear”, “drooping ear”, etc.) or “foot” (“short round leg”, “elongate leg”, etc.) "Beach", "Sea", "Ship", etc. If the presence or absence of such parts and objects is obtained in the image, it is possible to a priorily infer the subject and scene that are photographed from the set. As a result, classification and search based on semantic content are possible, and the industrial application value is high.

過去、様々な画像表現方法が考案されてきている。特許文献１記載の技術では、画像の輝度や色、テクスチャ（模様）、エッジ等について、画像全域に渡る統計をヒストグラム等として求め、これを画像表現とする方法が開示されている。 In the past, various image expression methods have been devised. The technique described in Patent Document 1 discloses a method in which statistics over the entire image are obtained as a histogram or the like for image brightness, color, texture (pattern), edge, and the like, and this is used as an image representation.

また、画像表現を得る上で、事前にそのモデルとなる画像辞書を構成しておき、この辞書に基づいて画像表現を求める方法も開示されてきている。 Also, a method has been disclosed in which an image dictionary serving as a model is configured in advance for obtaining an image expression, and the image expression is obtained based on this dictionary.

非特許文献１には、一般にＢａｇ−ｏｆ−Ｗｏｒｄｓ、又は、Ｂａｇ−ｏｆ−Ｋｅｙ−Ｐｏｉｎｔｓ等として知られる技術が開示されている。この技術では、画像を数ピクセル四方の微小領域の集合と見做し、その領域の有無を画像全体に渡って計数することによって、当該微小領域のヒストグラムとして画像を表現する。まず、画像中の特にコントラストの強い微小領域の集合を求め、当該微小領域を輝度勾配によって記述した後、これらを量子化することによって画像辞書（コードブック）を得る。画像を表現する際には、画像全域に渡りコントラストの強い微小領域の輝度勾配を求め、辞書に基づいてこれらを符号化する。この後、各符号の出現頻度を求め、ヒストグラム化することで、これを画像表現とする。 Non-Patent Document 1 discloses a technique generally known as Bag-of-Words or Bag-of-Key-Points. In this technique, an image is regarded as a set of minute regions on several pixels, and the presence or absence of the region is counted over the entire image, thereby expressing the image as a histogram of the minute region. First, a set of minute regions having a particularly high contrast in an image is obtained, the minute regions are described by a luminance gradient, and then quantized to obtain an image dictionary (codebook). When an image is expressed, a luminance gradient of a minute region having a strong contrast over the entire image is obtained, and these are encoded based on a dictionary. Thereafter, the appearance frequency of each code is obtained and converted into a histogram, which is used as an image expression.

非特許文献２には、被写体やシーンを特徴的に表すような領域を探し出し、これによって画像を表現する方法が開示されている。この方法では、画像をランダムな部分領域に分割し、これをクラスタリングしていくことで類似した部分領域をまとめていく。続いて、各クラスタに含まれる部分領域の特徴量を、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ（ＳＶＭ）等によって関数としてモデル化し、このモデルに適合した部分領域（すなわち、モデルに部分領域の特徴量を入力したとき、その出力値が高いもの）の集合として辞書を構成する。得られた辞書を用い、新たな画像が入力された際、辞書に登録された部分領域と類似した部分領域が出現する頻度を求め、ヒストグラム化することで画像を表現する。 Non-Patent Document 2 discloses a method for finding an area that characteristically represents a subject or a scene and expressing an image by this. In this method, an image is divided into random partial areas, and similar partial areas are grouped by clustering them. Subsequently, the feature amount of the partial region included in each cluster is modeled as a function by Support Vector Machine (SVM) or the like, and the partial region suitable for this model (that is, when the feature amount of the partial region is input to the model, A dictionary is constructed as a set of those whose output values are high. When a new image is input using the obtained dictionary, the frequency of appearance of a partial region similar to the partial region registered in the dictionary is obtained, and the image is expressed by forming a histogram.

特開２０１４−６７１７４号公報JP 2014-67174 A

J. Sivic and A. Zisserman、「Video Google: A Text Retrieval Approach to Object Matching in Videos」、In Proc. of IEEE International Conference on Computer Vision, 2003年、pp.1470-1477J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos”, In Proc. Of IEEE International Conference on Computer Vision, 2003, pp. 1470-1477 S. Singh、Saurabh Singh、A. Gupta、and Alexei A. Efros、「Unsupervised Discovery of Mid-Level Discriminative Patches」、In Proceedings of the European Conference on Computer Vision、2005年、pp.239-248S. Singh, Saurabh Singh, A. Gupta, and Alexei A. Efros, `` Unsupervised Discovery of Mid-Level Discriminative Patches '', In Proceedings of the European Conference on Computer Vision, 2005, pp.239-248

先に述べた通り、画像認識、画像検索等、多くの産業上の応用の観点においては、画像表現は被写体やシーンの意味的な内容をよく表す表現となっていることが好ましい。この観点においては、前記先行技術は、以下に示す問題があった。 As described above, from the viewpoint of many industrial applications such as image recognition and image search, it is preferable that the image expression is an expression that well represents the semantic content of the subject or scene. In this respect, the prior art has the following problems.

特許文献１及び非特許文献１記載の技術は、画像全体から抽出された、ごく低次の物理量（色やテクスチャ、輝度勾配等）によって画像を表現している。しかしながら、画像全体から抽出された低次の物理量では、全体が類似した被写体やシーンを判別できないという問題があった。特に、同じ『鳥』であってもよく似た種類（『タカ』と『ハヤブサ』等）や、『犬』であってもよく似た犬種（『シベリアン・ハスキー』と『アラスカン・マラミュート』等）は、部分的な差異こそあるものの、全貌が良く類似しているため、このような画像表現では実用的な識別精度を得ることができなかった。 The techniques described in Patent Document 1 and Non-Patent Document 1 express an image with very low-order physical quantities (color, texture, luminance gradient, etc.) extracted from the entire image. However, there is a problem in that low-order physical quantities extracted from the entire image cannot discriminate subjects and scenes that are similar to each other. In particular, even the same “bird” (similar to “hawk” and “falcon”), or similar to “dog” (“Siberian Husky” and “Alaskan Malamute”) Etc.), although there are some differences, the whole picture is very similar, and practical image recognition accuracy cannot be obtained with such image representation.

一方、非特許文献２記載の技術は、特徴的な部分領域から、差異となる部分領域を抽出して画像辞書を構成することで、画像表現において、被写体間の細かな差異を識別できる画像辞書を獲得できる可能性がある。しかしながら、依然として画像特徴のみから画像辞書を構築しようとするため、先に述べたような意味的な内容（『尖った耳』、『細長い足』等）に即した部分領域を必ずしも特定して抽出できるとは限らず、有効な画像辞書を獲得できないという問題があった。 On the other hand, the technique described in Non-Patent Document 2 is an image dictionary that can identify minute differences between subjects in image representation by extracting partial areas that are different from characteristic partial areas and constructing an image dictionary. May be earned. However, since we are still trying to build an image dictionary based only on image features, we need to identify and extract partial areas based on semantic content (such as “pointed ears” and “elongate legs”) as described above. There is a problem that a valid image dictionary cannot be obtained.

以上のことより、従来開示されている発明は、そのいずれも、画像表現に対する要件である、画像中の被写体やシーンの意味的な内容を表す画像表現を獲得できるような画像辞書構築技術、及び画像表現技術ではなかった。 From the above, the inventions that have been disclosed in the past are all image dictionary construction techniques that can acquire an image representation that represents the semantic content of the subject or scene in the image, which is a requirement for the image representation, and It was not an image expression technology.

本発明は、上記問題点を解決するために成されたものであり、画像中の意味のある特徴的な領域を発見することが可能な画像表現を得るための画像辞書を構成することができる画像辞書構成方法、装置、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and can constitute an image dictionary for obtaining an image expression capable of finding a meaningful characteristic region in an image. It is an object to provide an image dictionary construction method, apparatus, and program.

また、画像中の意味のある特徴的な領域を発見することが可能な画像表現を求めることができる画像表現方法、装置、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide an image expression method, apparatus, and program capable of obtaining an image expression capable of finding a meaningful characteristic area in an image.

上記目的を達成するために、第１の発明に係る画像辞書構成方法は、部分領域分割部と、特徴量抽出部と、分類部と、候補領域決定部と、識別器学習部とを含み、入力された一つ以上の画像の各々、及び前記画像の各々に対応した文書データから画像辞書を構成する画像辞書構成装置における画像辞書構成方法であって、前記部分領域分割部が、前記入力された一つ以上の画像の各々を、一つ以上の部分領域に分割するステップと、前記特徴量抽出部が、前記部分領域分割部によって分割された前記部分領域からなる部分領域の集合に含まれる前記部分領域の各々について、特徴量を抽出するステップと、前記分類部が、前記特徴量抽出部により抽出した前記部分領域の各々の特徴量に関する類似度に基づいて、前記部分領域の集合の前記部分領域の各々を一つ以上のクラスタのうちのいずれかのクラスタに分類するステップと、前記候補領域決定部が、前記クラスタの各々について、前記分類部によって前記クラスタに分類された部分領域の各々に対する、前記部分領域の特徴量と、前記入力された、前記部分領域を含む画像に対応した文書データとに基づいて、前記クラスタを代表する部分領域である候補領域を決定するステップと、前記識別器学習部が、前記クラスタの各々について、前記候補領域決定部によって決定された候補領域の前記特徴量を正例、前記クラスタに分類されなかった前記部分領域の前記特徴量を負例として、前記部分領域が前記クラスタに属するか否かを識別するための識別器を学習して取得し、前記クラスタの各々について取得した前記識別器を、画像辞書として出力するステップと、を含んで実行することを特徴とする。 To achieve the above object, an image dictionary construction method according to the first invention includes a partial region dividing unit, a feature amount extracting unit, a classifying unit, a candidate region determining unit, and a discriminator learning unit, An image dictionary construction method in an image dictionary construction device that constructs an image dictionary from each of one or more input images and document data corresponding to each of the images, wherein the partial region dividing unit is the input Each of the one or more images is divided into one or more partial areas, and the feature amount extraction unit is included in a set of partial areas formed by the partial areas divided by the partial area division unit. Extracting a feature amount for each of the partial regions, and the classification unit based on a similarity degree of each of the partial regions extracted by the feature amount extraction unit, the set of the partial regions Part Classifying each of the regions into one of one or more clusters, and the candidate region determining unit, for each of the clusters, for each of the partial regions classified into the clusters by the classifying unit Determining a candidate area that is a partial area representing the cluster based on the feature quantity of the partial area and the input document data corresponding to the image including the partial area; and the classifier For each of the clusters, the learning unit uses the feature amount of the candidate region determined by the candidate region determination unit as a positive example, and sets the feature amount of the partial region not classified into the cluster as a negative example. Learning and obtaining a classifier for identifying whether a region belongs to the cluster, the classifier obtained for each of the clusters, And executes includes a step of outputting as an image dictionary, a.

第１の発明に係る画像辞書構成装置は、入力として受け付けた一つ以上の画像の各々、及び前記画像の各々に対応した文書データから画像辞書を構成する画像辞書構成装置であって、前記入力された一つ以上の画像の各々を、一つ以上の部分領域に分割する部分領域分割部と、前記部分領域分割部によって分割された前記部分領域からなる部分領域の集合に含まれる前記部分領域の各々について、特徴量を抽出する特徴量抽出部と、前記特徴量抽出部により抽出した前記部分領域の各々の特徴量に関する類似度に基づいて、前記部分領域の集合の前記部分領域の各々を一つ以上のクラスタのうちのいずれかのクラスタに分類する分類部と、前記クラスタの各々について、前記分類部によってクラスタに分類された部分領域の各々に対する、前記部分領域の特徴量と、前記入力された、前記部分領域を含む画像に対応した文書データとに基づいて、前記クラスタを代表する部分領域である候補領域を決定する候補領域決定部と、前記クラスタの各々について、前記候補領域決定部によって決定された候補領域の前記特徴量を正例、前記クラスタに分類されなかった前記部分領域の前記特徴量を負例として、前記部分領域が前記クラスタに属するか否かを識別するための識別器を学習して取得し、前記クラスタの各々について取得した前記識別器を、画像辞書として出力する識別器学習部と、を含んで構成されている。 An image dictionary construction device according to a first aspect of the present invention is an image dictionary construction device that constructs an image dictionary from each of one or more images received as input and document data corresponding to each of the images. The partial region included in a set of partial regions formed by the partial region dividing unit that divides each of the one or more images that have been divided into one or more partial regions, and the partial region divided by the partial region dividing unit Each of the partial regions of the set of partial regions based on a similarity with respect to the feature amounts of the partial regions extracted by the feature amount extraction unit. A classifying unit that classifies one of the one or more clusters, and for each of the clusters, for each of the partial areas classified into clusters by the classifying unit, A candidate area determination unit that determines a candidate area that is a partial area representing the cluster based on the feature amount of the partial area and the input document data corresponding to the image including the partial area; and the cluster For each of the above, the partial region belongs to the cluster, with the feature amount of the candidate region determined by the candidate region determination unit as a positive example and the feature amount of the partial region not classified into the cluster as a negative example A classifier learning unit that learns and acquires a classifier for identifying whether or not the cluster is obtained and outputs the classifier acquired for each of the clusters as an image dictionary.

また、第２の発明に係る画像表現方法は、部分領域分割部と、特徴量抽出部と、表現部と、を含む画像表現装置における画像表現方法であって、前記部分領域分割部が、入力された画像を一つ以上の部分領域に分割するステップと、前記特徴量抽出部が、前記部分領域の各々について、特徴量を抽出するステップと、前記表現部が、前記特徴量抽出部により抽出した前記部分領域の各々の特徴量と、第１の発明に係る画像辞書構成方法によって出力された前記画像辞書とに基づいて、前記部分領域の各々について、前記部分領域が前記クラスタの各々に帰属する確度を算出し、前記算出された確度に基づいて、前記部分領域が前記クラスタのいずれかに属するか、又は前記クラスタのいずれにも属さないかを判定し、前記判定の結果に基づいて、前記クラスタの各々について前記クラスタに属すると判定された頻度を表すヒストグラムを、前記入力された画像の画像表現として出力するステップと、を含んで実行することを特徴とする。 An image representation method according to a second aspect of the present invention is an image representation method in an image representation device including a partial region dividing unit, a feature amount extracting unit, and a representation unit, wherein the partial region dividing unit includes an input Dividing the image into one or more partial areas, the step of extracting the feature quantity for each of the partial areas, and the expression section extracting by the feature quantity extraction section On the basis of the feature amount of each of the partial areas and the image dictionary output by the image dictionary construction method according to the first invention, the partial area belongs to each of the clusters for each of the partial areas. To determine whether the partial region belongs to any of the clusters or does not belong to any of the clusters based on the calculated accuracy, and based on the result of the determination , A histogram representing the frequencies which are determined to belong to the cluster for each of the clusters, and executes contain, and outputting an image representation of the input image.

第２の発明に係る画像表現装置は、入力された画像を一つ以上の部分領域に分割する部分領域分割部と、前記部分領域の各々について、特徴量を抽出する特徴量抽出部と、前記特徴量抽出部により抽出した前記部分領域の各々の特徴量と、請求項２記載の画像辞書構成装置によって出力された前記画像辞書とに基づいて、前記部分領域の各々について、前記部分領域が前記クラスタの各々に帰属する確度を算出し、前記算出された確度に基づいて、前記部分領域が前記クラスタのいずれかに属するか、又は前記クラスタのいずれにも属さないかを判定し、前記判定の結果に基づいて、前記クラスタの各々について前記クラスタに属すると判定された頻度を表すヒストグラムを、前記入力された画像の画像表現として出力する表現部と、を含んで構成されている。 An image expression device according to a second aspect of the present invention includes a partial region dividing unit that divides an input image into one or more partial regions, a feature amount extracting unit that extracts a feature amount for each of the partial regions, Based on the feature amount of each of the partial regions extracted by the feature amount extraction unit and the image dictionary output by the image dictionary construction device according to claim 2, the partial region is the Calculating the accuracy belonging to each of the clusters, and determining whether the partial region belongs to any of the clusters or does not belong to any of the clusters based on the calculated accuracy; A representation unit that outputs, as an image representation of the input image, a histogram representing the frequency determined to belong to the cluster for each of the clusters based on a result. It is configured.

第１の発明に係るプログラムは、コンピュータを、第１の発明に係る画像辞書構成方法又は画像表現方法を構成する各ステップを実行させるためのプログラムである。 A program according to a first invention is a program for causing a computer to execute each step constituting the image dictionary construction method or the image expression method according to the first invention.

本発明の画像辞書構成方法、装置、及びプログラムによれば、入力された画像を部分領域に分割し、部分領域の各々をクラスタに分類し、クラスタの各々について、部分領域の特徴量と、部分領域を含む画像に対応した文書データとに基づいて、クラスタの代表となる部分領域である候補領域を決定し、候補領域を正例として用いて識別器を学習することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を得るための画像辞書を構成することができる、という効果が得られる。 According to the image dictionary configuration method, apparatus, and program of the present invention, an input image is divided into partial areas, each of the partial areas is classified into clusters, and for each of the clusters, the feature amount of the partial area and the partial Based on the document data corresponding to the image including the region, a candidate region that is a partial region that is a representative of the cluster is determined, and the classifier is learned using the candidate region as a positive example. An effect is obtained that an image dictionary for obtaining an image expression capable of finding a certain characteristic region can be constructed.

また、画像表現方法、装置、及びプログラムによれば、入力された画像を部分領域に分割し、部分領域の各々の特徴量と、画像辞書とに基づいて、部分領域の各々について、クラスタのいずれかに属するか、クラスタのいずれにも属さないかを判定し、判定の結果に基づいて、クラスタの各々についてクラスタに属すると判定された頻度を表すヒストグラムを、入力された画像の画像表現として出力することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を求めることができる、という効果が得られる。 In addition, according to the image expression method, apparatus, and program, an input image is divided into partial areas, and each of the partial areas is determined based on the feature amount of each partial area and the image dictionary. A histogram representing the frequency of each cluster determined to belong to the cluster is output as an image representation of the input image. By doing so, the effect that the image expression which can discover the meaningful characteristic area in an image can be calculated | required is acquired.

本発明の実施の形態に係る画像辞書構成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the image dictionary structure apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る画像表現装置の構成を示すブロック図である。It is a block diagram which shows the structure of the image representation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る画像辞書構成装置における画像辞書構成処理ルーチンを示すフローチャートである。It is a flowchart which shows the image dictionary structure process routine in the image dictionary structure apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る画像表現装置における画像表現処理ルーチンを示すフローチャートである。It is a flowchart which shows the image expression process routine in the image expression apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の第１の実施の形態に係る画像辞書構成装置の構成＞ <Configuration of Image Dictionary Configuration Device According to First Embodiment of the Present Invention>

まず、本発明の第１の実施の形態に係る画像辞書構成装置の構成について説明する。図１に示すように、本発明の第１の実施の形態に係る画像辞書構成装置１００は、ＣＰＵと、ＲＡＭと、後述する画像辞書構成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この画像辞書構成装置１００は、機能的には図１に示すように画像データベース１０と、演算部２０と、画像辞書５０とを備えている。 First, the configuration of the image dictionary configuration device according to the first embodiment of the present invention will be described. As shown in FIG. 1, the image dictionary construction device 100 according to the first embodiment of the present invention stores a CPU, a RAM, a program for executing an image dictionary construction processing routine described later, and various data. It can be composed of a computer including a ROM. Functionally, the image dictionary construction device 100 includes an image database 10, a calculation unit 20, and an image dictionary 50 as shown in FIG.

画像データベース１０には、画像自体、あるいは、当該画像ファイルの所在を一意に示すアドレスが格納されているものとする。また、格納されている画像の内、一つ以上の画像に対応した文書データが格納されているものとする。この文書データは、画像全体に関する意味的な内容を表すものである。ここでいう意味的な内容を表す文書データは、画像に撮影された被写体やシーンを特徴づける部品や物体について記述した文書である。その形式は、例えばキーワードの形で与えられていてもよいし、文章として与えられていてもよい。前者の場合、好ましくは、当該画像の撮影された被写体やシーンの全体、又は一部を記述する単語として与えられているものとする。例えば、被写体が『犬』であれば『耳の形』（『尖った耳』、『垂れ耳』等）や『足』（『短く丸い足』、『細長い足』等）等、『海岸』であれば『ビーチ』、『海』、『船』等として与えられる。後者の場合は、その画像の被写体やシーンを記述する文書として与えられていることが好ましい。その具体性は任意であり、例えば、船の渡航する海岸に耳の尖った犬がいるような場合、『海岸に犬がいる』と記述されていてもよいし、『船の渡航する海岸に耳の尖った犬がいる』と記述されていても構わない。 It is assumed that the image database 10 stores an image itself or an address that uniquely indicates the location of the image file. Further, it is assumed that document data corresponding to one or more images among the stored images is stored. This document data represents the semantic content related to the entire image. The document data representing the semantic content here is a document describing a part or object that characterizes a subject or scene photographed in an image. The format may be given in the form of a keyword, for example, or may be given as a sentence. In the former case, it is preferable that the word is given as a word that describes the subject or scene of the image taken in whole or in part. For example, if the subject is a “dog”, “ear shape” (“pointed ears”, “drooping ears”, etc.), “foot” (“short round legs”, “elongate legs”, etc.), etc. “shore” If so, it will be given as “Beach”, “Sea”, “Ship” etc. In the latter case, the document is preferably given as a document describing the subject or scene of the image. The specific nature is arbitrary. For example, if there is a dog with a sharp ear on the coast where the ship travels, it may be described as “the dog is on the coast” or “the coast where the ship travels” It may be written that there is a dog with a sharp ear.

以上の文書データを準備する方法は問わない。例えば、インターネット上のウェブページにある画像を用いる場合には、通常、画像の周囲にその画像と関連のある文書があるが、これを文書データとして用いてもよい。この場合、人手を介さずに文書データを得ることができる利点がある。あるいは、各画像について、人手で文書データを入力しても構わない。この場合、人の正確な判断に則った信頼性の高い文書データを構成できるという利点がある。 The method for preparing the above document data does not matter. For example, when an image on a web page on the Internet is used, there is usually a document associated with the image around the image, but this may be used as document data. In this case, there is an advantage that document data can be obtained without human intervention. Alternatively, the document data may be manually input for each image. In this case, there is an advantage that highly reliable document data can be configured in accordance with a person's accurate judgment.

また、画像データベース１０は、上記各画像、又はアドレス、及び、対応する文書データを関連づけて格納できるものであればよく、いわゆるＲＤＢＭＳ（ＲｅｌａｔｉｏｎａｌＤａｔａｂａｓｅＭａｎａｇｅｍｅｎｔＳｙｓｔｅｍ）などで構成されているものとしてもよい。なお、画像データベース１０は、画像辞書構成装置１００の内部にあっても外部にあっても構わず、通信手段は任意の公知ものを用いることができる。さらに、画像辞書構成装置１００が一つ以上の画像を入力として受信できる限り、必ずしもデータベースでなくとも構わない。本実施形態においては、画像データベース１０が外部にあるものとして、通信手段は、インターネット、ＴＣＰ／ＩＰにより通信するよう接続されているものとする。 Further, the image database 10 only needs to be able to store each image or address and the corresponding document data in association with each other, and may be configured by a so-called RDBMS (Relational Database Management System) or the like. The image database 10 may be inside or outside the image dictionary construction apparatus 100, and any known communication means can be used. Furthermore, as long as the image dictionary construction apparatus 100 can receive one or more images as input, it does not necessarily have to be a database. In this embodiment, it is assumed that the image database 10 is external, and the communication means is connected to communicate via the Internet and TCP / IP.

また、画像辞書構成装置１００が備える各部及び画像データベース１０は、演算処理装置、記憶装置等を備えたコンピュータやサーバ等により構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは画像辞書構成装置１００が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。もちろん、その他いかなる構成要素についても、単一のコンピュータやサーバによって実現しなければならないものではなく、ネットワークによって接続された複数のコンピュータに分散して実現してもよい。 Further, each unit and the image database 10 included in the image dictionary configuration apparatus 100 may be configured by a computer or a server including an arithmetic processing device, a storage device, and the like, and the processing of each unit may be executed by a program. This program is stored in a storage device included in the image dictionary construction apparatus 100, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or provided through a network. Of course, any other component does not have to be realized by a single computer or server, and may be realized by being distributed to a plurality of computers connected by a network.

演算部２０は、部分領域分割部３０と、特徴量抽出部３２と、分類部３４と、候補領域決定部３６と、識別器学習部３８とを含んで構成されている。 The computing unit 20 includes a partial region dividing unit 30, a feature amount extracting unit 32, a classification unit 34, a candidate region determining unit 36, and a classifier learning unit 38.

部分領域分割部３０は、画像データベース１０から入力された一つ以上の画像を読み込み、各画像を一つ以上の部分領域に分割、選定してこれらを特徴量抽出部３２に出力する。 The partial area dividing unit 30 reads one or more images input from the image database 10, divides and selects each image into one or more partial areas, and outputs these to the feature amount extracting unit 32.

以下、部分領域分割部３０における部分領域抽出処理について詳述する。本処理は、画像データベース１０に格納された画像全てに対して実施されるが、全ての画像に対して同じ処理が実行されるので、ここでは１枚の画像に対する処理のみについて説明する。 Hereinafter, the partial area extraction processing in the partial area dividing unit 30 will be described in detail. This process is performed for all the images stored in the image database 10, but since the same process is performed for all the images, only the process for one image will be described here.

部分領域抽出処理では、画像全体の中から、その一部領域だけを切り出して抽出する。具体的には、部分領域数と部分領域サイズを指定し、一定間隔で部分領域を抽出していく。 In the partial area extraction process, only the partial area is cut out and extracted from the entire image. Specifically, the number of partial areas and the partial area size are designated, and partial areas are extracted at regular intervals.

例えば、元の画像サイズが縦３６０ピクセル×横２４０ピクセルであるとし、部分領域数を１６×１６＝２５６個、部分領域サイズを３２ピクセル×３２ピクセルとした場合の一例を説明する。この場合、縦は（３６０−３２）／１６＝２０ピクセル（少数点以下切りすて）シフトごと、横は（２４０−３２）／１６＝１３ピクセルシフトごとに一つ、３２ピクセル×３２ピクセルの部分領域を抽出する。 For example, an example in which the original image size is 360 × vertical × 240 horizontal, the number of partial areas is 16 × 16 = 256, and the partial area size is 32 × 32 pixels will be described. In this case, the vertical is (360-32) / 16 = 20 pixels (every decimal point cut) and the horizontal is (240-32) / 16 = 13 pixel shift, 32 pixels × 32 pixels. Extract a partial area.

部分領域数及び部分領域サイズに対しては、任意の正の整数を設定すればよい。部分領域は相互に重なりがあっても構わず、また、いくつかの設定を組み合わせて用いるものとしてもよい。 Any positive integer may be set for the number of partial areas and the partial area size. The partial areas may overlap each other, or a combination of several settings may be used.

以上の処理を画像全体に対して行うことで、部分領域の集合を得ることができる。こうして得た部分領域集合を特徴量抽出部３２に出力し、処理を終了する。 By performing the above processing on the entire image, a set of partial areas can be obtained. The partial area set obtained in this way is output to the feature quantity extraction unit 32, and the process is terminated.

特徴量抽出部３２は、部分領域分割部３０によって分割された画像の部分領域からなる部分領域の集合に含まれる部分領域の各々について、解析をし、予め定めた特徴量を抽出する。当該特徴量は、分類部３４に出力される。なお、本実施の形態では、特徴量として画像特徴ベクトルを抽出する。 The feature amount extraction unit 32 analyzes each partial region included in the set of partial regions composed of the partial regions of the image divided by the partial region dividing unit 30, and extracts a predetermined feature amount. The feature amount is output to the classification unit 34. In the present embodiment, an image feature vector is extracted as a feature amount.

以下、特徴量抽出部３２における特徴量の抽出について説明する。本実施の形態では以下に挙げる全ての特徴量について抽出をするが、どのような特徴量を抽出するかは、本発明の実施の形態の要件として重要ではなく、一般に知られた公知の特徴抽出処理を用いてよい。具体的には、画像から抽出された次元を持つ数値データ（スカラー又はベクトル）であれば、あらゆる特徴量及びその組み合わせに対して有効であり、例えば、明るさ特徴、色特徴、テクスチャ特徴、景観特徴、形状特徴などを抽出すればよい。 Hereinafter, extraction of feature amounts in the feature amount extraction unit 32 will be described. In this embodiment, all the feature quantities listed below are extracted, but what kind of feature quantity is extracted is not important as a requirement of the embodiment of the present invention, and is a publicly known publicly known feature extraction. Processing may be used. Specifically, any numerical data (scalar or vector) having a dimension extracted from an image is effective for all feature quantities and combinations thereof. For example, brightness feature, color feature, texture feature, landscape What is necessary is just to extract a feature, a shape feature, etc.

明るさ特徴は、部分領域内のピクセルに対して、ＨＳＶ色空間におけるＶ値のヒストグラムとして求めることができる。 The brightness feature can be obtained as a histogram of V values in the HSV color space for the pixels in the partial region.

色特徴は、Ｌ＊ａ＊ｂ＊色空間における各軸（Ｌ＊、ａ＊、ｂ＊）の値のヒストグラムとして求めることができる。 The color feature can be obtained as a histogram of the values of the respective axes (L *, a *, b *) in the L * a * b * color space.

テクスチャ特徴としては、部分領域内から一定間隔で抽出したキーポイントごとに局所特徴量を抽出すればよい。局所特徴としては、例えば下記の参考文献１に記載されるＳＩＦＴ（ＳｃａｌｅＩｎｖａｒｉａｎｔＦｅａｔｕｒｅＴｒａｎｓｆｏｒｍ）や、下記の参考文献２に記載されるＳＵＲＦ（ＳｐｅｅｄｅｄＵｐＲｏｂｕｓｔＦｅａｔｕｒｅｓ）などを用いることができる。 As the texture feature, a local feature amount may be extracted for each key point extracted from the partial region at regular intervals. As the local feature, for example, SIFT (Scale Invariant Feature Transform) described in Reference Document 1 below, SURF (Speeded Up Features) described in Reference Document 2 below, and the like can be used.

［参考文献１］D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints ", International Journal of Computer Vision, pp.91-110, 2004 [Reference 1] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp.91-110, 2004

［参考文献２］H. Bay, T. Tuytelaars, and L.V. Gool, “SURF: Speeded Up Robust Features", Lecture Notes in Computer Science, vol. 3951, pp.404-417, 2006 [Reference 2] H. Bay, T. Tuytelaars, and L.V. Gool, “SURF: Speeded Up Robust Features”, Lecture Notes in Computer Science, vol. 3951, pp.404-417, 2006

これらによって抽出される局所特徴は、例えばキーポイント１点あたり１２８次元の実数値ベクトルとなるため、１２８次元×キーポイント数分の次元を持つ。あるいは、このベクトルを予め学習して生成しておいた符号長を参照して、符号に変換し、部分領域内適当なサイズのブロック内に存在する符号の数を数え上げることでヒストグラムを生成することができる。この場合、ヒストグラムのビンの数は、符号長の符号数と一致する。又は、参考文献３に記載のスパース表現や、参考文献４、５に記載のフィッシャーカーネルに基づく特徴表現などを利用してもよい。 The local feature extracted by these is, for example, a 128-dimensional real value vector per key point, and therefore has 128 dimensions × the number of key points. Alternatively, refer to the code length generated by learning this vector in advance, convert it to a code, and generate a histogram by counting the number of codes present in a block of an appropriate size in the partial area. Can do. In this case, the number of bins in the histogram matches the code number of the code length. Alternatively, the sparse expression described in Reference 3 or the feature expression based on the Fisher kernel described in References 4 and 5 may be used.

［参考文献３］ Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong, “Locality-constrained Linear Coding for Image Classification", IEEE Conference on Computer Vision and Pattern Recognition, pp. 3360-3367, 2010. [Reference 3] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong, “Locality-constrained Linear Coding for Image Classification”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 3360-3367, 2010.

［参考文献４］ Florent Perronnin, Jorge Sanchez, Thomas Mensink, “Improving the Fisher Kernel for Large-Scale Image Classification", European Conference on Computer Vision, pp. 143-156, 2010. [Reference 4] Florent Perronnin, Jorge Sanchez, Thomas Mensink, “Improving the Fisher Kernel for Large-Scale Image Classification”, European Conference on Computer Vision, pp. 143-156, 2010.

［参考文献５］ Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sanchez, Patrick Perez, Cordelia Schmid, “Aggregating Local Image Descriptors into Compact Codes", IEEE Trans. Pattern Recognition and Machine Intelligence, Vol. 34, No. 9, pp. 1704-1716, 2012. [Reference 5] Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sanchez, Patrick Perez, Cordelia Schmid, “Aggregating Local Image Descriptors into Compact Codes”, IEEE Trans. Pattern Recognition and Machine Intelligence, Vol. 34, No. 9, pp. 1704-1716, 2012.

結果として生成される特徴量は、いずれの場合にも、符号長の符号数に依存した長さを持つ実数値ベクトルになる。 In any case, the resulting feature quantity is a real value vector having a length that depends on the number of codes of the code length.

景観特徴は、画像の風景や場面を表現した特徴量である。例えば参考文献６に記載のＧＩＳＴ記述子を用いることができる。ＧＩＳＴ記述子は部分領域内ブロックごとに一定のオリエンテーションを持つフィルタを掛けたときの係数によって表現されるが、この場合、生成される特徴量は、フィルタの種類（ブロック数×オリエンテーション数）に依存した長さのベクトルとなる。 A landscape feature is a feature amount that represents a landscape or scene of an image. For example, the GIST descriptor described in Reference 6 can be used. The GIST descriptor is expressed by a coefficient when a filter having a certain orientation is applied to each block in the partial area. In this case, the generated feature amount depends on the type of filter (number of blocks × number of orientations). It becomes a vector of the length.

［参考文献６］A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition", Progress in Brain Research, 155, pp.23-36, 2006 [Reference 6] A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition”, Progress in Brain Research, 155, pp. 23-36, 2006

形状特徴は、画像に写る物体の形状を表す特徴量である。例えば参考文献７に記載のＨｉｓｔｏｇｒａｍｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔ（ＨＯＧ）特徴量やエッジヒストグラムを用いることができる。 The shape feature is a feature amount representing the shape of an object shown in an image. For example, a Histogram of Oriented Gradient (HOG) feature amount described in Reference Document 7 or an edge histogram can be used.

［参考文献７］N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection", IEEE Conference on Computer Vision and Pattern Recognition, pp.886-893, 2005 [Reference 7] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection”, IEEE Conference on Computer Vision and Pattern Recognition, pp.886-893, 2005

なお、これらの特徴量は、一つあるいは複数を利用してもよいし、その他の公知の特徴量を用いるものとしてもよい。 One or a plurality of these feature quantities may be used, or other known feature quantities may be used.

特徴量抽出部３２は、上記処理によって得られた部分領域ごとの各ブロックの特徴量を、分類部３４に出力し、処理を終了する。 The feature quantity extraction unit 32 outputs the feature quantity of each block for each partial region obtained by the above process to the classification unit 34, and ends the process.

分類部３４は、特徴量抽出部３２により抽出した部分領域の各々の画像特徴ベクトルに関する類似度に基づいて、部分領域集合の部分領域の各々を一つ以上のクラスタのうちいずれかのクラスタに分類し、分類結果を候補領域決定部３６に出力する。 The classification unit 34 classifies each partial region of the partial region set into one of one or more clusters based on the similarity of the partial region extracted by the feature amount extraction unit 32 with respect to each image feature vector. Then, the classification result is output to the candidate area determination unit 36.

分類部３４では、特徴量抽出部３２の特徴量の抽出において、部分領域の各々は同一の画像特徴ベクトルとして表現されているから、分類は任意のクラスタリング手法を用いることができる。例えば、Ｋ−ｍｅａｎｓ法などを用いることで、任意のクラスタ数に分類すればよい。クラスタ数は、例えば部分領域の総数に対してその１／４などと設定すればよい。 In the classification unit 34, in the feature amount extraction by the feature amount extraction unit 32, each partial region is expressed as the same image feature vector, and therefore any clustering method can be used for classification. For example, it may be classified into an arbitrary number of clusters by using a K-means method or the like. The number of clusters may be set to 1/4 of the total number of partial areas, for example.

ここで、場合によっては一部クラスタに非常に多数の部分領域が属する場合や、ごく少数の部分領域しか属さないような場合があり得る。こういったクラスタに属する部分領域は、極端に一般的な部分領域であるか、極端に稀な部分領域であることが多く、画像認識や検索等に対して効果が低い場合がある。したがって、クラスタ内の部分領域数が一定以上（例えば１０００以上）となるようなクラスタ及び一定以下（例えば３以下）となるようなクラスタを、削除しても構わない。 Here, depending on the case, there may be a case where a very large number of partial regions belong to some clusters, or a case where only a small number of partial regions belong. The partial areas belonging to such clusters are often extremely general partial areas or extremely rare partial areas, and may be less effective for image recognition and search. Therefore, a cluster in which the number of partial areas in the cluster is equal to or greater than a certain value (for example, 1000 or more) and a cluster that is equal to or less than a certain value (for example, 3 or less) may be deleted.

そして、分類部３４は、得られた分類結果（部分領域の各々と、部分領域の各々が属するクラスタ）を候補領域決定部３６に出力し、処理を終了する。 Then, the classification unit 34 outputs the obtained classification result (each of the partial areas and the cluster to which each of the partial areas belong) to the candidate area determination unit 36, and ends the process.

候補領域決定部３６は、分類部３４により出力された分類結果を受け取り、クラスタの各々について、分類部３４によって当該クラスタに分類された部分領域の各々に対する、当該部分領域の画像特徴ベクトルと、画像データベース１０に格納された、当該部分領域を含む画像に対応した文書データとに基づいて、当該クラスタを代表する部分領域である候補領域を決定する。候補領域は、識別器学習部３８の識別器の学習において正例として用いられる。 The candidate area determination unit 36 receives the classification result output from the classification unit 34, and for each of the clusters, the image feature vector of the partial area and the image for each of the partial areas classified into the cluster by the classification unit 34 Based on the document data corresponding to the image including the partial area stored in the database 10, a candidate area that is a partial area representing the cluster is determined. The candidate area is used as a positive example in learning of the classifier of the classifier learning unit 38.

候補領域決定部３６の処理は、本発明の実施の形態の目的である意味的な内容を保持した画像辞書を構成する上で、要となる処理である。具体的には、下記二つの要件を満たす部分領域を格納した画像辞書を構成することを目的とする。 The process of the candidate area determination unit 36 is a process that is important in constructing an image dictionary that holds the semantic content that is the object of the embodiment of the present invention. Specifically, the object is to construct an image dictionary storing partial areas that satisfy the following two requirements.

要件１は、画像データベース１０中にある画像内に頻出するような代表的な見た目を持つ部分領域であることである。要件２は、画像データベース１０中にある画像内に表れる被写体及びシーンの意味的な内容を捉えた部分領域であることである。 Requirement 1 is a partial area having a typical appearance that frequently appears in an image in the image database 10. Requirement 2 is a partial area that captures the semantic content of the subject and scene that appear in the image in the image database 10.

このような部分領域は、画像データベース中に含まれる被写体やシーンを効率的に記述できるだけでなく（要件１）、同時にその意味的な内容を捉えることができるため（要件２）、特に高精度な画像認識、画像検索などを実現する画像辞書として好適である。 Such a partial area not only can efficiently describe subjects and scenes included in the image database (requirement 1), but can also capture its semantic content (requirement 2) at the same time. It is suitable as an image dictionary for realizing image recognition, image search, and the like.

以下、候補領域決定部３６における第１〜第４の処理について詳細に説明する。なお、候補領域決定部３６の処理では、各クラスタについて、全く同一の処理を実施するため、ここでは単一のクラスタの中での処理のみを記載する。 Hereinafter, the first to fourth processes in the candidate area determination unit 36 will be described in detail. In the process of the candidate area determination unit 36, exactly the same process is performed for each cluster, so only the process in a single cluster is described here.

候補領域決定部３６は、まず、第１の処理として、部分領域と文書データとの対応を取る。先に述べたように、画像データベース１０には、画像（又はその所在を一意に示すアドレス）と、当該画像全体に関する意味的な内容を表す文書データが関連づけて格納されている。従って、部分領域がどの画像から抽出されたものであるかを見ることによって、その部分領域と、それが抽出された画像に関連づけられた文書データとを対応づけることができる。この第１の処理によって、全て又は一部の部分領域に、文書データが対応づけられることとなる。 The candidate area determination unit 36 first takes the correspondence between the partial area and the document data as the first process. As described above, the image database 10 stores an image (or an address that uniquely indicates the location) and document data representing the semantic content of the entire image in association with each other. Therefore, by looking at which image the partial area is extracted from, it is possible to associate the partial area with the document data associated with the extracted image. With this first processing, document data is associated with all or some of the partial areas.

候補領域決定部３６は、続いて、第２の処理として、文書データから、これを数値化した文書ベクトルを、第１の処理で文書データと対応付けられた部分領域ごとに構成する。これはいかなる公知の技術を用いても構わず、単純には単語の正規頻度によってベクトルを構成するＢａｇ−ｏｆ−ｗｏｒｄｓヒストグラムを適用することができる。あるいは、単語の生起頻度で重みを付けたｔｆ−ｉｄｆ法などを適用しても構わない。このような方法は、画像データベース１０に登録された文書データが単語であるか、文章であるか等に依らず、一様に同一の文書ベクトルに変換し、表現することができる点で利便性が高い。 Subsequently, as a second process, the candidate area determination unit 36 constructs a document vector obtained by digitizing the document data for each partial area associated with the document data in the first process. Any known technique may be used for this, and a Bag-of-words histogram in which a vector is configured based on the normal frequency of words can be simply applied. Alternatively, the tf-idf method weighted by the word occurrence frequency may be applied. Such a method is convenient in that it can be uniformly converted into the same document vector and expressed regardless of whether the document data registered in the image database 10 is a word or a sentence. Is expensive.

候補領域決定部３６は、続いて、第３及び第４の処理を行って、部分領域の画像特徴ベクトルと、これに対応づけられた文書データの文書ベクトルから、候補領域を発見する。 Subsequently, the candidate area determination unit 36 performs the third and fourth processes to find a candidate area from the image feature vector of the partial area and the document vector of the document data associated with the partial area.

第３及び第４の処理の目的は、各クラスタに属する部分領域及びこれに対応する文書ベクトルの中から、より少数の候補領域を絞り込むことである。より厳密には、あるクラスタ内にＭ個の部分領域の画像特徴ベクトル、及び対応する文書ベクトルが存在するとしたとき、これらからＴ＜Ｍ個の候補領域を選び、決定する。 The purpose of the third and fourth processes is to narrow down a smaller number of candidate areas from the partial areas belonging to each cluster and the corresponding document vectors. More precisely, when there are M feature image vectors and corresponding document vectors in a certain cluster, T <M candidate regions are selected and determined from these.

候補領域決定部３６の第３及び第４の処理の基本的な方針は、処理対象のクラスタに属する部分領域のうち、クラスタを代表する度合いを代表度スコアとして算出し、その代表度スコアの上位Ｔ個を候補領域として選ぶことである。このような代表度スコアを求める方法はさまざまある。例えば、Ｋ−ｍｅａｎｓ法などでは、クラスタの中心、すなわち、クラスタに属する他の部分領域との距離の総和が最も小さい点を代表とする。この観点では、代表度スコアはその他のデータとの距離の近さで与えられるといえる。同様の考えに基づき、本実施の形態においても、他の部分領域との距離の近さによって定めるものと考え、クラスタリングを用いてこれを決定する。以下、この第３の処理におけるクラスタリングについて詳述する。 The basic policy of the third and fourth processes of the candidate area determining unit 36 is to calculate a degree representing a cluster as a representativeness score among the partial areas belonging to the cluster to be processed, T pieces are selected as candidate areas. There are various methods for obtaining such a representative score. For example, in the K-means method or the like, the center of the cluster, that is, the point having the smallest sum of the distances from other partial regions belonging to the cluster is representative. From this point of view, it can be said that the representativeness score is given by the proximity of other data. Based on the same idea, in the present embodiment, it is determined that the distance is determined based on the proximity of other partial areas, and this is determined using clustering. Hereinafter, clustering in the third process will be described in detail.

候補領域決定部３６の第３の処理では、処理対象のクラスタについて、当該クラスタに属する部分領域集合について、部分領域の画像特徴ベクトル及び文書ベクトルのそれぞれに対してのクラスタリングを個別に適用し、代表度スコアを求める。なお、本処理は、画像特徴ベクトルの場合も、文書ベクトルの場合も、いずれも同一の処理を適用するため、画像特徴ベクトルの場合についてのみ述べる。 In the third process of the candidate area determination unit 36, clustering is separately applied to each of the image feature vector and document vector of the partial area for the partial area set belonging to the cluster to be processed. Find the degree score. Note that this processing applies only to the case of an image feature vector because the same processing is applied to both an image feature vector and a document vector.

用いるクラスタリング法はいかなる公知のものを用いてよい。ただし、クラスタ内にある部分領域の特徴量のばらつきには差があることが多いこと、及び、画像特徴ベクトル（又は文書ベクトル）の種類によって、それぞれ適当な距離（ユークリッド距離、コサイン類似度、又はヒストグラムインターセクション等）は異なることに鑑み、Ｋ−ｍｅａｎｓ法などのように、事前にクラスタ数を指定する必要があったり、距離が特定の種類に限定されるものよりも、好ましくは、ＡｆｆｉｎｉｔｙＰｒｏｐａｇａｔｉｏｎのようにクラスタ数を自動的に推定でき、かつ、任意の距離に基づいてクラスタリングできるものである方がよい。 Any known clustering method may be used. However, depending on the fact that there are often differences in the variation in the feature values of the partial areas in the cluster, and depending on the type of the image feature vector (or document vector), an appropriate distance (Euclidean distance, cosine similarity, or In view of the fact that histogram intersections and the like are different, it is preferable to specify Affinity Propagation, as compared to the case where it is necessary to specify the number of clusters in advance or the distance is limited to a specific type as in the K-means method. It is better that the number of clusters can be automatically estimated and clustering can be performed based on an arbitrary distance.

候補領域決定部３６の第３の処理におけるクラスタリングの結果、Ｋ個の画像特徴ベクトルのクラスタ中心が発見できる。候補領域決定部３６は、これらのクラスタ中心を基準に、代表度スコアを求める。 As a result of the clustering in the third process of the candidate area determination unit 36, the cluster centers of K image feature vectors can be found. The candidate area determination unit 36 obtains a representative score based on these cluster centers.

候補領域決定部３６の第３の処理では、例えば、クラスタ中心と当該クラスタに属する部分領域との平均（あるいは、中央値など、任意の統計量を用いてもよい）距離が最も小さいものを代表クラスタ中心として代表度スコア１．０を与え、以下、この代表クラスタ中心からの距離が近いものから順にスコアが高くなるように代表度スコアを決定すればよい。この際の代表度スコアの計算式は、代表クラスタ中心からの距離をｄｉｓｔとしたとき、例えば In the third process of the candidate area determination unit 36, for example, the one having the smallest average distance (or any statistic such as a median value) between the cluster center and the partial area belonging to the cluster is represented. A representativeness score of 1.0 is given as the cluster center, and the representativeness score may be determined so that the score increases in order from the closest distance from the representative cluster center. In this case, the representative score is calculated by setting the distance from the center of the representative cluster as dist, for example:

などと求めることができる。 Etc.

あるいは、Ｋ個のクラスタ中心の内、最も近いクラスタ中心からの距離をｄｉｓｔとして求め、同様に上記（１）式を用いて代表度スコアを求めるものとしてもよい。 Alternatively, the distance from the nearest cluster center among the K cluster centers may be obtained as dist, and similarly, the representative score may be obtained using the above equation (1).

ここでは画像特徴ベクトルによる代表度スコアを求めたが、同様に文書ベクトルによる代表度スコアも求める。仮に、部分領域の中に、対応付けられた文書ベクトルが存在しないものがある場合、当該文書ベクトルによる代表度スコアは、予め定めた値として定めるとしてよい。例えば一様に０とする、あるいは、得られている文書ベクトルの平均値又は中央値とする等とすればよい。 Although the representativeness score based on the image feature vector is obtained here, the representativeness score based on the document vector is obtained similarly. If there is a partial area for which there is no associated document vector, the representative score based on the document vector may be determined as a predetermined value. For example, it may be set to 0 uniformly, or an average value or median value of the obtained document vectors.

候補領域決定部３６は、次に、第４の処理において、処理対象のクラスタについて先に求めた画像特徴ベクトル及び文書ベクトルによる代表度スコアに基づいて、候補領域を選定する。最終的には、画像特徴ベクトル及び文書ベクトルそれぞれから独立に求めた代表度スコアの双方に基づいて、最終的な代表度スコアを求め、これに基づいて候補領域を選択する。 Next, in the fourth process, the candidate area determination unit 36 selects a candidate area based on the image feature vector and the representative score based on the document vector previously obtained for the cluster to be processed. Finally, a final representativeness score is obtained based on both the representativeness score obtained independently from the image feature vector and the document vector, and a candidate region is selected based on this.

最も単純には、画像特徴ベクトル及び文書ベクトルの各代表度スコアの和が大きい順に部分領域をランキングし、これが最も高いものからＴ個を候補領域として選定すればよい。なお、画像特徴ベクトル及び文書ベクトルの各代表度スコアの和を用いるのではなく、画像特徴ベクトル及び文書ベクトルの代表度スコアの内、大きい方、又は小さい方のいずれかを当該部分領域の代表度スコアとして採用しても構わない。 In the simplest case, the partial areas are ranked in descending order of the sum of the representative scores of the image feature vector and the document vector, and T elements having the highest sum are selected as candidate areas. Instead of using the sum of the representative scores of the image feature vector and the document vector, either the larger or the smaller representative score of the image feature vector and the document vector is used as the representative degree of the partial area. You may adopt as a score.

あるいは、画像特徴ベクトル及び文書ベクトルそれぞれの代表度スコアの大きい順に、それぞれをランキングして、２つの異なるランキングリストを作成したのち、これらの２つのランキングリストを一つのランキングリストに統合することによって実施してもよい。２つのランキングを統合する際には、例えば、ボルダ得点方式を用いることができる。この場合、各ランキングリスト１位〜Ｍ位にそれぞれ順にＭ点〜１点を与え、その合算値が大きい順にＴ個を候補領域として選択すればよい。このような方法は、代表度スコアにノイズが含まれているような場合でも、その順位のみによって頑健に有効な候補領域を選定することができるため、頑健で高精度である。 Or, by ranking each image feature vector and document vector in descending order of representative score, creating two different ranking lists, and then integrating these two ranking lists into one ranking list May be. When integrating the two rankings, for example, the Boulder scoring method can be used. In this case, M points to 1 point may be given to the ranking lists 1 to M, respectively, and T may be selected as a candidate region in descending order of the sum. Such a method is robust and highly accurate because, even when the representativeness score includes noise, a robustly effective candidate region can be selected only by its rank.

そして、候補領域決定部３６は、第４の処理によって選定されたクラスタごとの候補領域を識別器学習部３８に出力し、処理を終了する。 Then, the candidate area determination unit 36 outputs the candidate area for each cluster selected by the fourth process to the classifier learning unit 38, and ends the process.

識別器学習部３８は、第１の処理として、クラスタの各々について、候補領域決定部３６により選択された候補領域の画像特徴ベクトルを正例とし、当該クラスタに分類されなかった任意の部分領域の集合に含まれる部分領域の画像特徴ベクトルを負例として、識別器を学習し、これを画像辞書５０として出力する。 As a first process, the classifier learning unit 38 uses, as a positive example, the image feature vector of the candidate area selected by the candidate area determination unit 36 for each of the clusters, and selects any partial area not classified into the cluster. Using the image feature vectors of the partial areas included in the set as a negative example, the discriminator is learned and output as an image dictionary 50.

一般に、正例及び負例が与えられた下で識別器を学習する手法は様々な公知のものが存在する。任意のものを用いてよいが、本実施の形態ではＳＶＭや、［参考文献８］記載のＳｕｐｐｏｒｔｖｅｃｔｏｒｒｅｇｒｅｓｓｉｏｎ（ＳＶＲ）を用いることができる。 In general, there are various known methods for learning a discriminator under given positive examples and negative examples. An arbitrary one may be used, but in this embodiment, SVM or Support vector regi- sion (SVR) described in [Reference 8] can be used.

［参考文献８］A.J. Smola, B. Scholkopf “A Tutorial on Support Vector Regression", Statistics and Computing, Vol. 14, Issue 3, pp.199-222, 2004 [Reference 8] A.J. Smola, B. Scholkopf “A Tutorial on Support Vector Regression”, Statistics and Computing, Vol. 14, Issue 3, pp.199-222, 2004

いずれの場合にも、あるクラスタに対して、その候補領域を正例、及び、当該クラスタに属さない任意の部分領域を負例として、ある部分領域がどの程度当該クラスタに帰属するかを表す確度を求めるための識別器を得ることができる。そして、クラスタごとに得られた識別器を画像辞書５０として出力し、処理を終了すればよい。 In any case, with respect to a certain cluster, the candidate region is a positive example, and an arbitrary partial region that does not belong to the cluster is a negative example. Can be obtained. Then, the classifier obtained for each cluster is output as the image dictionary 50, and the processing is terminated.

なお、識別器学習部３８は、第２の処理として、一度学習した識別器を、予め用意した新たな画像データセットに適用して、当該画像データセットに含まれる部分領域がどのクラスタに属するかを判定し、各クラスタに分類した後、その分類結果を候補領域決定部３６に出力しても構わない。その後、候補領域決定部３６は、各クラスタに分類された新たな画像データセットについて、各クラスタの候補領域を、先に述べた候補領域決定部３６の処理によって決定した後、再度、識別器学習部３８に出力する。このようにすることによって、新たな画像データセットが追加された際には、その画像データセットを用いて繰り返し候補領域の決定と識別器の学習を所定の条件が満たされるまで実行し、得られた識別器を画像辞書５０として出力することで、画像データセットの偏りの影響を低減し、特定の意味的な内容を持つ部分領域に対して反応する（高い正の値を出力する）ような識別器を得ることができる。 As the second process, the classifier learning unit 38 applies the classifier once learned to a new image data set prepared in advance, and which cluster the partial area included in the image data set belongs to. And classifying each cluster, the classification result may be output to the candidate area determination unit 36. Thereafter, the candidate area determination unit 36 determines the candidate area of each cluster for the new image data set classified into each cluster by the process of the candidate area determination unit 36 described above, and then performs the classifier learning again. To the unit 38. In this way, when a new image data set is added, it is obtained by repeatedly executing candidate region determination and classifier learning using the image data set until a predetermined condition is satisfied. By outputting the discriminator as the image dictionary 50, the influence of the bias of the image data set is reduced, and it reacts to a partial area having a specific semantic content (outputs a high positive value). A discriminator can be obtained.

なお、本実施の形態では識別器学習部３８は、第１及び第２の処理を実行するが、第１の処理のみを実行して取得した識別器を画像辞書５０として出力するようにしてもよい。 In the present embodiment, the classifier learning unit 38 executes the first and second processes, but the classifier obtained by executing only the first process may be output as the image dictionary 50. Good.

以上が、画像辞書構成装置の各処理部の処理詳細の一例である。 The above is an example of the processing details of each processing unit of the image dictionary construction device.

＜本発明の第１の実施の形態に係る画像表現装置の構成＞ <Configuration of Image Representation Device According to First Embodiment of the Present Invention>

次に、本発明の第１の実施の形態に係る画像表現装置の構成について説明する。図２に示すように、本発明の第１の実施の形態に係る画像表現装置２００は、ＣＰＵと、ＲＡＭと、後述する画像表現処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この画像表現装置２００は、機能的には図２に示すように画像データベース２１０と、演算部２２０とを備えている。 Next, the configuration of the image expression device according to the first embodiment of the present invention will be described. As shown in FIG. 2, the image expression device 200 according to the first exemplary embodiment of the present invention includes a CPU, a RAM, a ROM that stores programs and various data for executing an image expression processing routine described later, and , Can be configured with a computer including. The image expression device 200 functionally includes an image database 210 and a calculation unit 220 as shown in FIG.

画像データベース２１０には、少なくとも画像自体、あるいは、当該画像ファイルの所在を一意に示すアドレスが格納されているものとする。その他の構成は、画像辞書構成装置１００の画像データベース１０と同様である。 The image database 210 stores at least an image itself or an address that uniquely indicates the location of the image file. Other configurations are the same as those of the image database 10 of the image dictionary configuration apparatus 100.

演算部２２０は、部分領域分割部２３０と、特徴量抽出部２３２と、表現部２３４と、画像辞書２３６とを含んで構成されている。 The computing unit 220 includes a partial region dividing unit 230, a feature amount extracting unit 232, an expression unit 234, and an image dictionary 236.

部分領域分割部２３０は、画像データベース２１０から入力された画像を読み込み、各画像を部分領域に分割、選定してこれを特徴量抽出部２３２に出力する。その他の構成は、画像辞書構成装置１００の部分領域分割部３０と同様である。 The partial area dividing unit 230 reads an image input from the image database 210, divides and selects each image into partial areas, and outputs this to the feature amount extracting unit 232. Other configurations are the same as those of the partial region dividing unit 30 of the image dictionary configuration apparatus 100.

特徴量抽出部２３２は、部分領域分割部２３０によって分割された画像の部分領域からなる部分領域の集合に含まれる部分領域の各々について、解析をし、予め定めた特徴量を抽出する。当該特徴量は、表現部２３４に出力される。その他の構成は、画像辞書構成装置１００の特徴量抽出部３２と同様である。 The feature quantity extraction unit 232 analyzes each partial area included in the set of partial areas composed of the partial areas of the image divided by the partial area division unit 230, and extracts a predetermined feature quantity. The feature amount is output to the expression unit 234. Other configurations are the same as those of the feature amount extraction unit 32 of the image dictionary configuration apparatus 100.

表現部２３４は、特徴量抽出部２３２によって抽出された部分領域の各々の特徴量と、画像辞書構成装置１００によって出力された画像辞書２３６とに基づいて、部分領域の各々について、当該部分領域がクラスタの各々に帰属する確度を算出し、算出された確度に基づいて、当該部分領域がクラスタのいずれかに属するか、又はクラスタのいずれにも属さないかを判定する。表現部２３４は、判定の結果に基づいて、クラスタの各々について当該クラスタに属すると判定された頻度を表すヒストグラムを、入力された画像の画像表現として出力する。 Based on the feature amounts of the partial regions extracted by the feature amount extraction unit 232 and the image dictionary 236 output by the image dictionary construction device 100, the expression unit 234 determines that the partial region is The accuracy belonging to each cluster is calculated, and based on the calculated accuracy, it is determined whether the partial region belongs to any of the clusters or does not belong to any of the clusters. Based on the determination result, the expression unit 234 outputs, as an image representation of the input image, a histogram representing the frequency determined for each cluster as belonging to the cluster.

以下、表現部２３４の処理について詳細に説明する。ここで、入力された画像に対しては、部分領域、及び各部分領域に対する特徴量が抽出されている。この元で、画像辞書構成装置１００により予め学習した画像辞書２３６を用い、この画像に対する画像表現を得る。 Hereinafter, the processing of the expression unit 234 will be described in detail. Here, with respect to the input image, partial areas and feature amounts for the partial areas are extracted. Based on this, the image dictionary 236 learned in advance by the image dictionary construction device 100 is used to obtain an image representation for this image.

まず、表現部２３４は、画像辞書構成装置１００における分類部３４と同様の処理によって、それぞれの部分領域がいずれかの「クラスタに属する」、又は、「いずれのクラスタにも属さない」のいずれに相当するかを判定する。 First, the expression unit 234 performs a process similar to that performed by the classification unit 34 in the image dictionary construction apparatus 100, so that each partial area belongs to any one of “clusters” or “does not belong to any cluster”. Determine whether it corresponds.

最も単純には、クラスタの各々について、各クラスタＫ_ｉ（ｉ＝１、・・・、Ｖ）に属すると判定された頻度を求め、Ｖ次元のヒストグラムを構成することでこれを画像表現とすることができる。 Most simply, for each of the clusters, the frequency determined to belong to each cluster K _i (i = 1,..., V) is obtained, and this is used as an image representation by constructing a V-dimensional histogram. be able to.

あるいは、［参考文献９］記載のＳｐａｔｉａｌＰｙｒａｍｉｄＭａｔｃｈｉｎｇ（又はＳｐａｔｉａｌＰｏｏｌｉｎｇ）と呼ばれる処理によって、頻度を計算してもよい。 Alternatively, the frequency may be calculated by a process called Spatial Pyramid Matching (or Spatial Pooling) described in [Reference 9].

［参考文献９］S. Lazebnik, C. Schmid, J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories” In Proc. IEEE Conference on Computer Vision and Pattern Recognition. Pp.2169-2178, 2006. [Reference 9] S. Lazebnik, C. Schmid, J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories” In Proc. IEEE Conference on Computer Vision and Pattern Recognition. Pp.2169-2178, 2006.

これ以外にも、一つ以上のクラスタと、それに属する要素の集合から求めることのできる任意の統計量によって画像表現を得ることができる。 In addition to this, an image representation can be obtained by an arbitrary statistic that can be obtained from one or more clusters and a set of elements belonging to the clusters.

そして、表現部２３４は、クラスタの各々について当該クラスタに属すると判定された頻度を表すヒストグラムを、画像表現として出力し、処理を終了する。 Then, the expression unit 234 outputs a histogram representing the frequency determined to belong to the cluster for each cluster as an image expression, and ends the process.

＜本発明の第１の実施の形態に係る画像辞書構成装置の作用＞ <Operation of Image Dictionary Constructing Device According to First Embodiment of the Present Invention>

次に、本発明の第１の実施の形態に係る画像辞書構成装置１００の作用について説明する。画像データベース１０から一つ以上の画像及び画像の各々に対応した文書データの入力を受け付けると、画像辞書構成装置１００は、図３に示す画像辞書構成処理ルーチンを実行する。 Next, the operation of the image dictionary construction device 100 according to the first embodiment of the present invention will be described. When receiving one or more images and input of document data corresponding to each of the images from the image database 10, the image dictionary construction device 100 executes an image dictionary construction processing routine shown in FIG.

まず、ステップＳ１００では、画像データベース１０から受け付けた一つ以上の画像及び画像の各々に対応した文書データを読み込む。 First, in step S100, one or more images received from the image database 10 and document data corresponding to each of the images are read.

ステップＳ１０２では、ステップＳ１００で読み込まれた画像の各々を、一つ以上の部分領域の各々に分割する。 In step S102, each of the images read in step S100 is divided into one or more partial areas.

ステップＳ１０４では、ステップＳ１０２で分割された部分領域からなる部分領域集合の部分領域の各々について解析をし、部分領域ごとに特徴量として画像特徴ベクトルを抽出する。 In step S104, each partial area of the partial area set composed of the partial areas divided in step S102 is analyzed, and an image feature vector is extracted as a feature amount for each partial area.

ステップＳ１０６では、ステップＳ１０２で分割された全ての部分領域からなる部分領域集合を、ステップＳ１０４で抽出した画像特徴ベクトルに基づいて、一つ以上のクラスタのうちのいずれかに分類する。 In step S106, the partial region set including all the partial regions divided in step S102 is classified into one or more clusters based on the image feature vector extracted in step S104.

ステップＳ１０８では、ステップＳ１０６で得られたクラスタ集合から処理対象となるクラスタを選択する。 In step S108, a cluster to be processed is selected from the cluster set obtained in step S106.

ステップＳ１１０では、ステップＳ１０８で選択したクラスタについて、ステップＳ１０６で当該クラスタに分類された部分領域の集合に含まれる部分領域と、ステップＳ１００で読み込んだ、当該部分領域を含む画像に対応する文書データとを対応づける。 In step S110, for the cluster selected in step S108, the partial area included in the set of partial areas classified into the cluster in step S106, and the document data corresponding to the image including the partial area read in step S100, and Associate.

ステップＳ１１２では、ステップＳ１１０で部分領域の各々に対応付けられた文書データに基づいて、文書ベクトルを部分領域ごとに構成する。 In step S112, a document vector is constructed for each partial area based on the document data associated with each partial area in step S110.

ステップＳ１１４では、ステップＳ１０８で選択したクラスタについて、ステップＳ１０６で当該クラスタに分類された部分領域の画像特徴ベクトルに基づいて、画像特徴ベクトルをクラスタリングすることによりクラスタ中心を求め、クラスタ中心を基準に、部分領域の各々の代表度スコアを求める。また、当該クラスタに分類された部分領域ごとにステップＳ１１２で構成された文書ベクトルに基づいて、文書ベクトルをクラスタリングすることによりクラスタ中心を求め、クラスタ中心を基準に、部分領域の各々の代表度スコアを求める。そして、当該クラスタに分類された部分領域ごとに、双方で求められた代表度スコアに基づいて、当該部分領域の代表度スコアを求める。 In step S114, for the cluster selected in step S108, the cluster center is obtained by clustering the image feature vector based on the image feature vector of the partial region classified into the cluster in step S106, and based on the cluster center, A representative score for each of the partial areas is obtained. In addition, the cluster center is obtained by clustering the document vector based on the document vector configured in step S112 for each partial region classified into the cluster, and the representative score of each partial region is determined based on the cluster center. Ask for. Then, for each partial region classified into the cluster, the representative score of the partial region is obtained based on the representative score obtained by both.

ステップＳ１１６では、ステップＳ１０８で選択したクラスタについて、ステップＳ１１４で求めた部分領域の各々の代表度スコアに基づいて、当該クラスタを代表する部分領域である候補領域を決定する。 In step S116, for the cluster selected in step S108, a candidate area that is a partial area representing the cluster is determined based on the representative score of each partial area obtained in step S114.

ステップＳ１１８では、ステップＳ１０８で選択したクラスタについて、ステップＳ１１６で決定した候補領域を正例、当該クラスタに属さない部分領域を負例として用い、識別器を学習する。 In step S118, for the cluster selected in step S108, the discriminator is learned using the candidate region determined in step S116 as a positive example and the partial region not belonging to the cluster as a negative example.

ステップＳ１２０では、全てのクラスタについて、ステップＳ１０８〜ステップＳ１１８の処理を終了したかを判定し、終了していなければ、ステップＳ１０８へ戻ってクラスタを選択して処理を繰り返し、終了していれば、ステップＳ１２２へ移行する。 In step S120, it is determined whether or not the processing of step S108 to step S118 has been completed for all the clusters. If not, the process returns to step S108 to select the cluster and repeat the processing. The process proceeds to step S122.

ステップＳ１２２では、ステップＳ１１８において全てのクラスタについて学習された識別器を画像辞書として出力し、処理を終了する。 In step S122, the classifiers learned for all clusters in step S118 are output as an image dictionary, and the process ends.

＜本発明の第１の実施の形態に係る画像表現装置の作用＞ <Operation of Image Representation Device According to First Embodiment of the Present Invention>

次に、本発明の第１の実施の形態に係る画像表現装置２００の作用について説明する。画像データベース２１０から画像の入力を受け付けると、画像表現装置２００は、図４に示す画像表現処理ルーチンを実行する。 Next, the operation of the image expression device 200 according to the first embodiment of the present invention will be described. When an image input is received from the image database 210, the image expression device 200 executes an image expression processing routine shown in FIG.

まず、ステップＳ２００では、画像データベース２１０から受け付けた画像を読み込む。 First, in step S200, an image received from the image database 210 is read.

ステップＳ２０２では、ステップＳ２００で読み込まれた画像を、一つ以上の部分領域の各々に分割する。 In step S202, the image read in step S200 is divided into one or more partial areas.

ステップＳ２０４では、ステップＳ２０２で分割された部分領域からなる部分領域集合の部分領域の各々について特徴量を抽出する。 In step S204, a feature amount is extracted for each partial area of the partial area set including the partial areas divided in step S202.

ステップＳ２０６では、ステップＳ２０４で抽出された部分領域の各々の特徴量と、上記の画像辞書構成処理ルーチンによって出力された画像辞書２３６とに基づいて、部分領域の各々について、当該部分領域がクラスタの各々に帰属する確度を算出する。 In step S206, based on the feature amount of each partial area extracted in step S204 and the image dictionary 236 output by the above image dictionary construction processing routine, the partial area is a cluster of each partial area. The accuracy attributed to each is calculated.

ステップＳ２０８では、ステップＳ２０６で算出された部分領域の各々がクラスタの各々に帰属する確度に基づいて、それぞれの部分領域がクラスタのいずれかに属するか、又はクラスタのいずれにも属さないかを判定する。 In step S208, based on the probability that each of the partial areas calculated in step S206 belongs to each of the clusters, it is determined whether each partial area belongs to any of the clusters or does not belong to any of the clusters. To do.

ステップＳ２１０では、ステップＳ２０８の判定の結果に基づいて、クラスタの各々について当該クラスタに属すると判定された頻度を表すヒストグラムを構成する。 In step S210, a histogram representing the frequency determined to belong to the cluster for each cluster is constructed based on the result of the determination in step S208.

ステップＳ２１２では、ステップＳ２１２で構成されたヒストグラムを画像表現として出力し、処理を終了する。 In step S212, the histogram formed in step S212 is output as an image expression, and the process ends.

以上説明したように、本発明の第１の実施の形態に係る画像辞書構成装置によれば、入力された画像を部分領域に分割し、部分領域の各々をクラスタに分類し、クラスタの各々について、部分領域の画像特徴ベクトルと、部分領域を含む画像全体に対応した文書データとに基づいて、クラスタの代表となる部分領域である候補領域を決定し、候補領域を正例として用いて識別器を学習することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を得るための画像辞書を構成することができる。 As described above, according to the image dictionary construction device according to the first exemplary embodiment of the present invention, the input image is divided into partial areas, each of the partial areas is classified into clusters, and each of the clusters is classified. A candidate area that is a representative partial area of the cluster is determined based on the image feature vector of the partial area and the document data corresponding to the entire image including the partial area, and the classifier is used as a positive example By learning the above, it is possible to construct an image dictionary for obtaining an image expression capable of finding a meaningful characteristic region in the image.

また、本発明の第１の実施の形態に係る画像表現装置によれば、入力された画像を部分領域に分割し、部分領域の各々の特徴量と、画像辞書とに基づいて、部分領域の各々について、クラスタのいずれかに属するか、クラスタのいずれにも属さないかを判定し、判定の結果に基づいて、クラスタの各々についてクラスタに属すると判定された頻度を表すヒストグラムを、入力された画像の画像表現として出力することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を求めることができる。 In addition, according to the image representation device according to the first exemplary embodiment of the present invention, the input image is divided into partial areas, and the partial area is determined based on the feature amounts of the partial areas and the image dictionary. For each, it was determined whether it belongs to any of the clusters or not to any of the clusters, and a histogram representing the frequency determined to belong to the cluster was input for each of the clusters based on the determination result. By outputting the image as an image representation of the image, an image representation capable of finding a meaningful characteristic area in the image can be obtained.

＜本発明の第２の実施の形態に係る画像辞書構成装置の構成＞ <Configuration of Image Dictionary Configuration Device According to Second Embodiment of the Present Invention>

次に、本発明の第２の実施の形態に係る画像辞書構成装置の構成について説明する。なお、第１の実施の形態の画像辞書構成装置１００と同様の構成となる部分については、同一符号を付して説明を省略する。 Next, the configuration of the image dictionary configuration device according to the second embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to the image dictionary structure apparatus 100 of 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

上記図１に示すように、本発明の第２の実施の形態に係る画像辞書構成装置１００は、ＣＰＵと、ＲＡＭと、後述する画像辞書構成処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この画像辞書構成装置１００は、機能的には図１に示すように画像データベース１０と、演算部２０と、画像辞書５０とを備えている。 As shown in FIG. 1, the image dictionary construction apparatus 100 according to the second embodiment of the present invention stores a CPU, a RAM, a program for executing an image dictionary construction processing routine described later, and various data. And a computer including a ROM. Functionally, the image dictionary construction device 100 includes an image database 10, a calculation unit 20, and an image dictionary 50 as shown in FIG.

第２の実施の形態に係る画像データベース１０には、画像自体、あるいは、当該画像ファイルの所在を一意に示すアドレスが格納されているものとする。また、格納されている画像の内、一つ以上の画像に対応した文書データが格納されているものとする。この文書データは、画像の一部領域に関する意味的な内容を表すものである。 It is assumed that the image database 10 according to the second embodiment stores an image itself or an address that uniquely indicates the location of the image file. Further, it is assumed that document data corresponding to one or more images among the stored images is stored. This document data represents the semantic content related to a partial area of the image.

また、文書データに記載されている画像の一部領域の内容は、その画像のどこの領域について記述したものであるのか、その位置情報と共に記述されているものとする。例えば、画像の一部の部分領域において長方形で囲まれる区間については、縦横ピクセル位置、及び、幅と高さの４点の位置が位置情報として与えられていれば十分である。この場合、画像のどこにどんな意味的な内容が含まれているかについて特定性が高まるため、より精密に意味的な内容を表す画像辞書、及び画像表現を得ることができる。 Further, it is assumed that the contents of the partial area of the image described in the document data are described along with the position information about which area of the image is described. For example, for a section surrounded by a rectangle in a partial area of the image, it is sufficient if the vertical and horizontal pixel positions and the positions of four points of width and height are given as position information. In this case, since the specificity is increased as to what semantic content is included in the image, it is possible to obtain an image dictionary and an image expression that represent the semantic content more precisely.

第２の実施の形態に係る演算部２０は、部分領域分割部３０と、特徴量抽出部３２と、分類部３４と、候補領域決定部３６と、識別器学習部３８とを含んで構成されている。 The computing unit 20 according to the second embodiment includes a partial region dividing unit 30, a feature amount extracting unit 32, a classifying unit 34, a candidate region determining unit 36, and a classifier learning unit 38. ing.

第２の実施の形態に係る候補領域決定部３６は、分類部３４により出力された分類結果を受け取り、クラスタの各々について、分類部３４によって当該クラスタに分類された部分領域の各々に対する、当該部分領域の特徴量と、画像データベース１０に格納された、当該部分領域を含む画像の一部領域に対応した文書データとに基づいて、クラスタを代表する部分領域である候補領域を決定する。 The candidate area determination unit 36 according to the second embodiment receives the classification result output from the classification unit 34, and for each of the clusters, the part for each of the partial areas classified into the cluster by the classification unit 34 Based on the feature amount of the area and the document data corresponding to the partial area of the image including the partial area stored in the image database 10, a candidate area that is a partial area representing the cluster is determined.

第２の実施の形態に係る候補領域決定部３６は、まず、第１の処理として、部分領域と文書データの対応を取る。先に述べたように、画像データベース１０には、画像（又はその所在を一意に示すアドレス）と、当該画像の一部領域に関する意味的な内容を表す文書データが関連づけて格納されている。従って、部分領域がどの画像から抽出されたものであるかを見ることによって、その部分領域と、部分領域が抽出された画像に関連づけられた文書データとを対応づけることができる。第２の実施の形態では、文書データが画像の一部領域について記述したものであり、その領域の位置情報も併せて格納されているため、これに基づいて直接部分領域と文書データの対応づけを行う。例えば、領域の重なり割合を用いて関連づけることができる。つまり、重なる領域の割合（文書データの割り当てられた領域）と（部分領域）の積により得られる領域のサイズに対する、その和により得られる領域のサイズの比率が閾値以上（例えば0.5）の割合となった場合に、当該部分領域にその文書データを対応づければよい。例えば、ある文書データが画像位置（横36ピクセル、縦56ピクセル）に、サイズ（幅18ピクセル、高さ24ピクセル）で割り当てられているとする。また、例えば、ある画像領域が（横40ピクセル、縦60ピクセル）の位置に（幅20ピクセル、高さ20ピクセル）で取られたとしよう。このとき、重なり割合は、（36+18-40）×20／（18×24+20×20-（36+18-40）×20）= 0.51である。仮に、閾値を0.5と設定していたならば、この部分領域には当該文書データを対応づけることとなる。なお、第２の実施の形態に係る候補領域決定部３６における第１の処理以降の第２〜第４の処理は、第１の実施の形態に係る候補領域決定部３６における第２〜第４の処理と同様である。 The candidate area determination unit 36 according to the second embodiment first takes the correspondence between the partial area and the document data as the first process. As described above, the image database 10 stores an image (or an address that uniquely indicates the location) and document data that represents semantic content regarding a partial region of the image in association with each other. Therefore, by viewing from which image the partial area is extracted, the partial area can be associated with the document data associated with the image from which the partial area has been extracted. In the second embodiment, the document data describes a partial area of the image, and the positional information of the area is also stored. Based on this, the direct association between the partial area and the document data is performed. I do. For example, association can be performed using the overlapping ratio of regions. In other words, the ratio of the size of the area obtained by the sum to the size of the area obtained by the product of the ratio of overlapping areas (area to which document data is allocated) and (partial area) is equal to or greater than a threshold (for example, 0.5). In such a case, the document data may be associated with the partial area. For example, it is assumed that certain document data is allocated to an image position (36 pixels wide and 56 pixels high) with a size (18 pixels wide and 24 pixels high). For example, suppose that an image area is taken at a position (width 20 pixels, height 20 pixels) at a position (width 40 pixels, height 60 pixels). At this time, the overlapping ratio is (36 + 18-40) × 20 / (18 × 24 + 20 × 20− (36 + 18-40) × 20) = 0.51. If the threshold is set to 0.5, the document data is associated with this partial area. The second to fourth processes after the first process in the candidate area determination unit 36 according to the second embodiment are the second to fourth processes in the candidate area determination unit 36 according to the first embodiment. This is the same as the process.

なお、第２の実施の形態に係る画像辞書構成装置の他の構成及び作用は、第１の実施の形態の画像辞書構成装置１００と同様であるため詳細な説明を省略する。 The other configuration and operation of the image dictionary configuration apparatus according to the second embodiment are the same as those of the image dictionary configuration apparatus 100 according to the first embodiment, and thus detailed description thereof is omitted.

＜本発明の第２の実施の形態に係る画像表現装置の構成＞ <Configuration of Image Representation Device According to Second Embodiment of the Present Invention>

次に、本発明の第２の実施の形態に係る画像表現装置の構成について説明する。なお、第１の実施の形態の画像表現装置２００と同様の構成となる部分については、同一符号を付して説明を省略する。 Next, the configuration of the image expression device according to the second embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to the image representation apparatus 200 of 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

上記図２に示すように、本発明の第２の実施の形態に係る画像表現装置２００は、ＣＰＵと、ＲＡＭと、後述する画像表現処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この画像表現装置２００は、機能的には図２に示すように画像データベース２１０と、演算部２２０とを備えている。 As shown in FIG. 2, the image expression device 200 according to the second embodiment of the present invention includes a CPU, a RAM, and a ROM that stores programs and various data for executing an image expression processing routine to be described later. And a computer including The image expression device 200 functionally includes an image database 210 and a calculation unit 220 as shown in FIG.

なお、第２の実施の形態に係る画像表現装置の他の構成及び作用は、第１の実施の形態の画像表現装置２００と同様であるため詳細な説明を省略する。 Note that other configurations and operations of the image expression device according to the second embodiment are the same as those of the image expression device 200 according to the first embodiment, and thus detailed description thereof is omitted.

以上説明したように、本発明の第２の実施の形態に係る画像辞書構成装置によれば、入力された画像を部分領域に分割し、部分領域の各々をクラスタに分類し、クラスタの各々について、部分領域の画像特徴ベクトルと、部分領域を含む画像の一部領域に対応した文書データとに基づいて、クラスタの代表となる部分領域である候補領域を決定し、候補領域を正例として用いて識別器を学習することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を得るための画像辞書を構成することができる。 As described above, according to the image dictionary construction device according to the second exemplary embodiment of the present invention, an input image is divided into partial areas, each of the partial areas is classified into clusters, and each of the clusters is classified. Based on the image feature vector of the partial area and the document data corresponding to the partial area of the image including the partial area, a candidate area that is a partial area representing the cluster is determined, and the candidate area is used as a positive example. By learning the discriminator, it is possible to construct an image dictionary for obtaining an image expression capable of finding a meaningful characteristic area in the image.

また、本発明の第２の実施の形態に係る画像表現装置によれば、入力された画像を部分領域に分割し、部分領域の各々の特徴量と、画像辞書とに基づいて、部分領域の各々について、クラスタのいずれかに属するか、クラスタのいずれにも属さないかを判定し、判定の結果に基づいて、クラスタの各々についてクラスタに属すると判定された頻度を表すヒストグラムを、入力された画像の画像表現として出力することにより、画像中の意味のある特徴的な領域を発見することが可能な画像表現を求めることができる。 In addition, according to the image expression device according to the second exemplary embodiment of the present invention, the input image is divided into partial areas, and based on the feature amount of each partial area and the image dictionary, the partial area For each, it was determined whether it belongs to any of the clusters or not to any of the clusters, and a histogram representing the frequency determined to belong to the cluster was input for each of the clusters based on the determination result. By outputting the image as an image representation of the image, an image representation capable of finding a meaningful characteristic area in the image can be obtained.

また、上述した実施の形態における画像辞書構成装置によれば、画像特徴のみならず、これに付随する意味的な内容を指し示す文書データを参考情報として用いることで、被写体やシーンの意味的な内容を表す部分領域を発見し、これを基に画像辞書を構成することができる。 Further, according to the image dictionary construction device in the above-described embodiment, the semantic content of the subject or the scene is obtained by using, as reference information, document data indicating not only the image feature but also the semantic content accompanying the image feature. Can be found, and an image dictionary can be constructed based on this.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した第１の実施の形態では、画像全体に対応する文書データを、第２の実施の形態では、画像の一部領域に対応する文書データをそれぞれ用いて、部分領域と文書データとを対応付け、文書ベクトルを構成したが、これに限定されるものではなく、画像の全体及び一部領域に対応する文書データを用いてもよい。ここで、部分領域と対応付けられる文書データとして、画像全体に対応する文書データ及び一部領域に対応する文書データの両方が存在する場合には、一部領域に対応する文書データを優先して文書ベクトルを構成してもよいし、画像全体に対応する文書データ及び一部領域に対応する文書データを合わせた文書データについて文書ベクトルを構成するようにしてもよい。 For example, in the first embodiment described above, the document data corresponding to the entire image is used, and in the second embodiment, the document data corresponding to the partial area of the image is used. However, the present invention is not limited to this, and document data corresponding to the entire image and a partial area may be used. Here, when both document data corresponding to the entire image and document data corresponding to the partial area exist as document data corresponding to the partial area, the document data corresponding to the partial area is given priority. A document vector may be configured, or a document vector may be configured for document data that is a combination of document data corresponding to the entire image and document data corresponding to a partial area.

１０、２１０画像データベース
２０、２２０演算部
３０、２３０部分領域分割部
３２、２３２特徴量抽出部
３４分類部
３６候補領域決定部
３８識別器学習部
５０、２３６画像辞書
１００画像辞書構成装置
２００画像表現装置
２３４表現部
２５０出力部 10, 210 Image database 20, 220 Arithmetic unit 30, 230 Partial region division unit 32, 232 Feature amount extraction unit 34 Classification unit 36 Candidate region determination unit 38 Discriminator learning unit 50, 236 Image dictionary 100 Image dictionary construction device 200 Image representation Device 234 Expression unit 250 Output unit

Claims

Each of the input one or more images and a document corresponding to each of the images, including a partial region dividing unit, a feature amount extracting unit, a classifying unit, a candidate region determining unit, and a classifier learning unit An image dictionary construction method in an image dictionary construction device for constructing an image dictionary from data,
The partial region dividing unit dividing each of the inputted one or more images into one or more partial regions;
The feature amount extracting unit extracting a feature amount for each of the partial regions included in a set of partial regions formed by the partial regions divided by the partial region dividing unit;
The classification unit selects each of the partial regions of the set of partial regions from one or more clusters based on the similarity regarding the feature amount of each of the partial regions extracted by the feature amount extraction unit. Categorizing into a cluster of
For each of the clusters, the candidate region determination unit corresponds to the feature amount of the partial region and the input image including the partial region for each of the partial regions classified into the cluster by the classification unit. Determining a candidate area that is a partial area representing the cluster based on the document data;
For each of the clusters, the classifier learning unit uses the feature amount of the candidate region determined by the candidate region determination unit as a positive example, and sets the feature amount of the partial region not classified into the cluster as a negative example. Learning and obtaining a discriminator for identifying whether or not the partial region belongs to the cluster, and outputting the discriminator obtained for each of the clusters as an image dictionary;
An image dictionary construction method including:

Each of one or more images received as input, and an image dictionary configuration device that configures an image dictionary from document data corresponding to each of the images,
A partial area dividing unit that divides each of the input one or more images into one or more partial areas;
A feature amount extraction unit that extracts a feature amount for each of the partial regions included in the set of partial regions formed by the partial regions divided by the partial region dividing unit;
Each of the partial areas of the set of partial areas is classified into one of one or more clusters based on the similarity regarding the feature quantities of the partial areas extracted by the feature quantity extraction unit. A classification section;
For each of the clusters, based on the feature amount of the partial region for each of the partial regions classified into clusters by the classification unit, and the input document data corresponding to the image including the partial region, A candidate area determination unit that determines a candidate area that is a partial area representing the cluster;
For each of the clusters, the partial region is the cluster, with the feature amount of the candidate region determined by the candidate region determination unit as a positive example and the feature amount of the partial region not classified into the cluster as a negative example. A classifier learning unit that learns and acquires a classifier for identifying whether or not it belongs to, and outputs the classifier acquired for each of the clusters as an image dictionary;
An image dictionary construction apparatus including:

An image expression method in an image expression device including a partial region dividing unit, a feature amount extraction unit, and an expression unit,
The partial region dividing unit divides the input image into one or more partial regions;
The feature amount extracting unit extracting a feature amount for each of the partial regions;
The each of the partial regions based on the feature amount of each of the partial regions extracted by the feature amount extraction unit and the image dictionary output by the image dictionary construction method according to claim 1. , Calculating the accuracy that the partial region belongs to each of the clusters, and based on the calculated accuracy, whether the partial region belongs to any of the clusters or does not belong to any of the clusters Determining and outputting, as an image representation of the input image, a histogram representing the frequency determined to belong to the cluster for each of the clusters based on the determination result;
An image representation method including:

A partial area dividing unit that divides the input image into one or more partial areas;
A feature amount extraction unit that extracts a feature amount for each of the partial regions;
The partial area is determined for each of the partial areas based on the feature quantities of the partial areas extracted by the feature quantity extraction unit and the image dictionary output by the image dictionary construction device according to claim 2. Calculating the accuracy belonging to each of the clusters, and determining whether the partial region belongs to any of the clusters or does not belong to any of the clusters based on the calculated accuracy; An expression unit that outputs a histogram representing the frequency determined to belong to the cluster for each of the clusters as an image representation of the input image;
An image expression apparatus including:

The program for making a computer perform each step which comprises the image dictionary construction method of Claim 1, or the image expression method of Claim 3.