JP2012022419A

JP2012022419A - Learning data creation device, learning data creation method, and program

Info

Publication number: JP2012022419A
Application number: JP2010158360A
Authority: JP
Inventors: Stejic Zoran; ゾランステイチ
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-07-13
Filing date: 2010-07-13
Publication date: 2012-02-02
Anticipated expiration: 2030-07-13
Also published as: JP4926266B2

Abstract

【課題】ウェブ上から収集した画像データから人手を介さずに学習データを自動作成すること。
【解決手段】各クラスタに属する領域画像の数に基づいたクラスタに対する領域画像の分類状態から、領域画像の分布の少ないクラスタを特定し、そのクラスタに属する領域画像を画像データから除去することで学習データを作成する。このため、画像間を亘って共通性のない画像領域が除去されていくこととなるため、キーワード検索により収集した画像データから作成した学習データには、画像間で共通性のある画像領域（キーワードを表すオブジェクト）が残る。
【選択図】図１Learning data is automatically generated from image data collected from the web without human intervention.
Learning is performed by identifying a cluster having a small distribution of region images from the classification state of the region image based on the number of region images belonging to each cluster, and removing the region image belonging to the cluster from the image data. Create data. For this reason, image areas having no commonality between images are removed. Therefore, learning data created from image data collected by keyword search includes image areas (keywords that have commonality between images). The object that represents) remains.
[Selection] Figure 1

Description

本発明は、オブジェクト認識に用いる学習データを作成する学習データ作成装置等に関する。 The present invention relates to a learning data creation device that creates learning data used for object recognition.

画像中に含まれる物体等のオブジェクトを認識する技術としてのオブジェクト認識のためには、オブジェクト毎の特徴量（配色、テクスチャ、形状等の画像の特徴を数値化して表現したもの）を用意しておく必要がある。そのオブジェクトの特徴量は、学習用の画像データを大量に準備し、その画像を機械学習等することにより得られる。 For object recognition as a technology for recognizing objects such as objects contained in images, prepare feature quantities for each object (representation of image features such as color scheme, texture, shape, etc.) It is necessary to keep. The feature amount of the object is obtained by preparing a large amount of image data for learning and machine learning the image.

従って、精度の高いオブジェクト認識を行うためには、オブジェクトを正しく表した学習データを大量に準備する必要がある。学習データは、一般には人が目視によって画像の内容を確認して、オブジェクトが含まれる画像に対してラベル付けを行うことにより生成されるため、人為的な労力が必要になる。 Therefore, in order to perform highly accurate object recognition, it is necessary to prepare a large amount of learning data that correctly represents the object. Since learning data is generally generated by a person visually confirming the contents of an image and labeling an image including an object, human labor is required.

また、近年ではウェブ検索が広く普及し、ウェブ上から大量のデータを収集することが可能になってきたため、オブジェクトを表すキーワードによるウェブ検索を行うことによって関連する画像を収集することができるようになった。 Further, in recent years, web search has become widespread and it has become possible to collect a large amount of data from the web, so that related images can be collected by performing a web search using a keyword representing an object. became.

しかし、ウェブ検索における検索インデックスは、ウェブページ内に含まれるキーワードを用いて生成されるため、検索された画像の中に該キーワードを表すオブジェクトが含まれているとは限らない。 However, since a search index in web search is generated using a keyword included in a web page, an object representing the keyword is not always included in the searched image.

また、オブジェクトが含まれていたとしても、遠方からの撮影によりオブジェクトが小さかったり、照明が不足していたりといったように、撮影状態が学習データには適していないものが含まれていることがある。このため、ウェブ検索により収集した画像からも、人手で選別したり、画像中から切り出す作業を行ったりと、やはり膨大な手間がかかった。 Even if an object is included, there may be cases where the shooting state is not suitable for the learning data, such as when the object is small due to shooting from a distance, or when the lighting is insufficient. . For this reason, it took a great deal of time and effort to manually sort out images from images collected by web search or to cut out images from images.

このような人による学習データの正否判断の手間を低減させる技術として、画像を複数の領域に分割し、その領域画像をクラスタリングすることで得られたクラスタの中から利用者に正事例を選定させて、この選定に基づいて学習データ（画像辞書）を作成する技術が知られている（特許文献１参照）。 As a technique to reduce the time and effort required to judge whether or not the learning data is correct by such a person, the user can select a correct case from the clusters obtained by dividing the image into multiple regions and clustering the region images. A technique for creating learning data (image dictionary) based on this selection is known (see Patent Document 1).

特開２００９−２８２６６０号公報JP 2009-282660 A

しかし、特許文献１の技術においても、クラスタの中から正事例を選定させるという人による判断を行わなければならず、クラスタ数が多くなるとその判断は複雑になり、煩雑であった。 However, even in the technique of Patent Document 1, it is necessary to make a judgment by a person to select a positive case from clusters, and the judgment becomes complicated and complicated as the number of clusters increases.

本発明は、上述した課題に鑑みて為されたものであり、その目的とするところは、ウェブ上から収集した画像データから人手を介さずに学習データを自動作成することである。 The present invention has been made in view of the above-described problems, and an object of the present invention is to automatically create learning data from image data collected from the web without human intervention.

上記目的を達成するため、第１の発明は、キーワードに基づくウェブ検索により収集された複数の画像データからオブジェクト認識用の学習データを作成する学習データ作成装置において、前記収集された複数の画像データから検出される領域画像の特徴量に基づいて該領域画像を所定のクラスタに分類し、各画像データの領域画像が属するクラスタと、該クラスタに属する領域画像の数とを該画像データ毎に生成するクラスタ分類手段と、前記各クラスタに属する領域画像の数に基づく前記各クラスタに対する前記領域画像の分類状態から、前記領域画像の分布が少ないクラスタを特定する非共通領域特定手段と、前記特定されたクラスタに属する領域画像を、該領域画像を検出した前記画像データから除去することにより学習データを作成する学習データ作成手段と、を備えることを特徴としている。 In order to achieve the above object, the first invention provides a learning data creating apparatus for creating learning data for object recognition from a plurality of image data collected by a web search based on a keyword, the plurality of collected image data The region images are classified into predetermined clusters based on the feature amount of the region image detected from the image, and a cluster to which the region image of each image data belongs and the number of region images belonging to the cluster are generated for each image data. And a non-common area specifying means for specifying a cluster with a small distribution of the area image from a classification state of the area image with respect to each cluster based on the number of area images belonging to each cluster; Learning data is created by removing the region image belonging to the cluster from the image data in which the region image is detected. Is characterized by comprising a learning data creation means for, the.

第１の発明によれば、各クラスタに属する領域画像の数に基づいたクラスタに対する領域画像の分類状態から、領域画像の分布の少ないクラスタを特定し、そのクラスタに属する領域画像を画像データから除去することで学習データを作成する。このため、画像間を亘って共通性のない画像領域が除去されていくこととなるため、学習データには、画像間で共通性のある画像領域が残る。この、キーワード検索により収集された画像データの中でも更に共通性のある画像領域には、キーワードを表すオブジェクトが含まれていると推測される。従って、ウェブ上から収集した画像データから人手を介さずに学習データを自動作成することができる。 According to the first invention, a cluster having a small distribution of region images is identified from the classification state of the region image with respect to the cluster based on the number of region images belonging to each cluster, and the region image belonging to the cluster is removed from the image data. To create learning data. For this reason, since image areas having no commonality between images are removed, image areas having commonality between images remain in the learning data. Among the image data collected by the keyword search, it is presumed that an image area having more common includes an object representing the keyword. Therefore, learning data can be automatically created from image data collected from the web without human intervention.

また、第２の発明における前記非共通領域特定手段は、前記領域画像が分類されたクラスタのうちの、該クラスタに分類された領域画像の検出元の画像データの枚数に基づいて前記領域画像の分布が少ないクラスタを特定することを特徴としている。 Further, the non-common area specifying means in the second aspect of the invention is characterized in that, based on the number of image data from which the area image classified into the cluster is detected, out of the clusters into which the area image is classified. It is characterized by identifying clusters with low distribution.

第２の発明によれば、クラスタに分類された領域画像の検出元の画像データの枚数に基づいて領域画像の分布が少ないクラスタを特定するため、画像がどのクラスタに分類されたかに応じて共通性の画像領域を特定することができる。 According to the second aspect of the invention, since a cluster having a small distribution of region images is specified based on the number of image data from which the region images classified into the clusters are detected, it is common depending on which cluster the image is classified into. Sex image areas can be identified.

また、第３の発明は、前記各画像データから検出した領域画像の数に対する前記非共通領域特定手段により特定されたクラスタに属する該画像データの領域画像の数の比率に基づいて、前記領域画像を除去した場合の画像データが前記学習データに適しているか否かを判定する品質判定手段を更に備え、前記学習データ作成手段は、前記品質判定手段により前記学習データに適していると判定された画像データから前記領域画像の除去を行って前記学習データを作成することを特徴としている。 Further, the third invention is based on a ratio of the number of area images of the image data belonging to the cluster specified by the non-common area specifying means to the number of area images detected from the image data. Is further provided with quality determining means for determining whether or not the image data is suitable for the learning data, and the learning data creating means is determined to be suitable for the learning data by the quality determining means. The learning data is created by removing the region image from the image data.

第３の発明によれば、領域画像を除去した場合の画像データが学習データに適しているか否かによって学習データの作成を行うため、オブジェクト認識に適した質の高い学習データを作成することができる。 According to the third invention, since the learning data is created depending on whether or not the image data when the region image is removed is suitable for the learning data, it is possible to create high-quality learning data suitable for object recognition. it can.

また、第４の発明における前記学習データ作成手段は、前記特定されたクラスタに分類された領域画像の前記画像データに対する検出位置に基づいて、前記画像データに対して外側から前記領域画像を除去することを特徴としている。 Further, the learning data creating means in the fourth invention removes the region image from the outside with respect to the image data based on the detection position of the region image classified into the specified cluster with respect to the image data. It is characterized by that.

第４の発明によれば、特定されたクラスタに分類された領域画像の検出位置に基づいて、画像の外側から該領域画像を除去するため、学習データとして作成した画像データ内でオブジェクトに欠けが生じることを防止できる。 According to the fourth invention, since the region image is removed from the outside of the image based on the detection position of the region image classified into the identified cluster, the object is lacking in the image data created as the learning data. It can be prevented from occurring.

本発明によれば、ウェブ上から収集した画像データから人手を介さずに学習データを自動作成することができる。 According to the present invention, learning data can be automatically created from image data collected from the web without human intervention.

本発明に係る学習データ作成装置の機能構成を示すブロック図。The block diagram which shows the function structure of the learning data creation apparatus which concerns on this invention. 特徴ベクトル生成処理のフローチャート。The flowchart of a feature vector generation process. 画像データからの領域画像の抽出とビジュアルキーワードへのマッピングの様子を示す図。The figure which shows the mode of the extraction of the area image from image data, and the mapping to a visual keyword. 学習データ作成処理のフローチャート。The flowchart of learning data creation processing. 非共通領域の特定等を説明するための概念図。The conceptual diagram for demonstrating specification etc. of a non-common area | region. 学習データの作成例を示す図。The figure which shows the example of creation of learning data. 非共通領域の除去の他の実施例を説明するための図。The figure for demonstrating the other Example of the removal of a non-common area | region.

［画像検索装置の構成］
以下、本発明の実施の形態を図面に基づいて説明する。
図１は、本発明を適用した学習データ作成装置１の機能ブロック図である。学習データ作成装置１は、通信ネットワークを介して接続されたインターネットに接続され、該インターネットを介してウェブ上から画像データを収集可能となっている。この収集したデータの中からオブジェクトを含む画像領域の切り出しや選定を行ってオブジェクト認識に用いる学習データを作成する。 [Configuration of image search device]
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a functional block diagram of a learning data creation apparatus 1 to which the present invention is applied. The learning data creation device 1 is connected to the Internet connected via a communication network, and can collect image data from the web via the Internet. Learning data used for object recognition is created by cutting out and selecting an image area including the object from the collected data.

本実施形態における学習データ作成装置１は、画像データ間を亘って非共通な画像領域を特定し、その特定した画像領域に共通のオブジェクトは含まれていないと判定して、学習データを作成する。この非共通な画像領域の特定にビジュアルキーワードの手法を用いる。 The learning data creation device 1 in the present embodiment identifies non-common image areas across image data, determines that no common object is included in the identified image areas, and creates learning data. . A visual keyword technique is used to identify the non-common image area.

ビジュアルキーワードとは、画像を複数の細かな画像領域の集合として見なして、各画像を構成する画像領域（以下、適宜「領域画像」「部分画像」という）から得られる特徴量に基づいて画像のインデックス（特徴ベクトル）を生成する技術であり、テキスト中のキーワードから文章の特徴量を求めるテキスト技術の応用といえる。 A visual keyword refers to an image based on a feature amount obtained from image regions (hereinafter referred to as “region images” and “partial images” as appropriate) that consider each image as a set of a plurality of fine image regions. This is a technique for generating an index (feature vector), and can be said to be an application of a text technique for obtaining a feature amount of a sentence from a keyword in the text.

このため、ビジュアルキーワードでは、画像中の領域画像（視覚的な切片）をキーワードとして扱うことで、画像の細かい部分的な領域まで分析して一枚の画像を表す特徴ベクトルを生成することができる。また、単語（キーワード）の集合として文書解析を行うテキスト技術（転置インデックスやベクトル空間モデル、単語の出現頻度等）を画像の特徴ベクトルを技術に適用することができるので、大規模且つ高速性を実現することができる。 For this reason, in the visual keyword, a region vector (visual segment) in the image is treated as a keyword, so that a fine partial region of the image can be analyzed and a feature vector representing one image can be generated. . In addition, text technology that performs document analysis as a set of words (keywords) (transposition index, vector space model, word appearance frequency, etc.) can be applied to the feature vector of the image. Can be realized.

ビジュアルキーワードによる画像検索についての参考技術文献としては、
・Sivic and Zisserman:“Efficient visual search for objects in videos”, Proceedings of the IEEE, Vol.96,No.4.,pp.548-566,Apr 2008.
・Yang and Hauptmann:“A text categorization approach to video scene classification using keypoint features”,Carnegie Mellon University Technical Report,pp.25,Oct 2006.
・Jiang and Ngo:“Bag-of-visual-words expansion using visual relatedness for video indexing”,Proc.31^st ACM SIGIR Conf.,pp.769-770,Jul 2008.
・Jiang, Ngo, andYang:“Towards optimal bag-of-features for object categorization and semantic video retrieval”,Proc.6th ACM CIVR Conf.,pp.494-501,Jul.2007.
・Yang, Jiang, Hauptmann, and Ngo:“Evaluating bag-of-visual-words representations in scene classification”,Proc.15^th ACM MM Conf., Workshop onMMIR,pp.197-206,Sep. 2007.
等が挙げられる。 As reference technical literature on image search using visual keywords,
Sivic and Zisserman: “Efficient visual search for objects in videos”, Proceedings of the IEEE, Vol.96, No.4, pp.548-566, Apr 2008.
・ Yang and Hauptmann: “A text categorization approach to video scene classification using keypoint features”, Carnegie Mellon University Technical Report, pp. 25, Oct 2006.
・ Jiang and Ngo: “Bag-of-visual-words expansion using visual relatedness for video indexing”, Proc. 31 ^st ACM SIGIR Conf., Pp.769-770, Jul 2008.
・ Jiang, Ngo, and Yang: “Towards optimal bag-of-features for object categorization and semantic video retrieval”, Proc. 6th ACM CIVR Conf., Pp.494-501, Jul. 2007.
・ Yang, Jiang, Hauptmann, and Ngo: “Evaluating bag-of-visual-words representations in scene classification”, Proc. 15 ^th ACM MM Conf., Workshop on MMIR, pp.197-206, Sep. 2007.
Etc.

図１に示すように、学習データ作成装置１は、画像収集部１０、画像ＤＢ（データベース）１５、ビジュアルキーワード生成部２０、ビジュアルキーワードＤＢ２５、特徴ベクトル生成部３０、領域管理ＤＢ３５、特徴ベクトルＤＢ４０、非共通領域特定部５０、品質判定部６０、学習データ作成部７０及び学習データＤＢ７５を備えて構成される。 As shown in FIG. 1, the learning data creation device 1 includes an image collection unit 10, an image DB (database) 15, a visual keyword generation unit 20, a visual keyword DB 25, a feature vector generation unit 30, an area management DB 35, a feature vector DB 40, The non-common area specifying unit 50, the quality determining unit 60, the learning data creating unit 70, and the learning data DB 75 are configured.

これらの機能部は、所謂コンピュータにより構成され、演算／制御装置としてのＣＰＵ（Central Processing Unit）、記憶媒体としてのＲＡＭ（Random Access Memory）及びＲＯＭ（Read Only Memory）、通信インターフェイス等が連関することで実現される。 These functional units are configured by so-called computers, and are associated with a CPU (Central Processing Unit) as an arithmetic / control device, a RAM (Random Access Memory) and a ROM (Read Only Memory) as a storage medium, a communication interface, and the like. It is realized with.

画像収集部１０は、インターネットを介してウェブ上から画像データを収集する機能部である。画像収集部１０は、予め定められたキーワードを検索エンジンに送信する等して、該キーワードに関連付けられたウェブページを取得する。そして、このウェブページ内に含まれる画像データを抽出して、キーワードと画像データとを対応付けて画像ＤＢ１５に記憶する。 The image collection unit 10 is a functional unit that collects image data from the web via the Internet. The image collection unit 10 acquires a web page associated with the keyword by, for example, transmitting a predetermined keyword to the search engine. Then, the image data included in the web page is extracted, and the keyword and the image data are associated with each other and stored in the image DB 15.

また、検索エンジンとしては、画像データを検索対象とした画像検索エンジンであってもよく、その場合には、キーワードの送信に応じて返される検索結果の画像データを受信して、画像ＤＢ１５に記憶する。 Further, the search engine may be an image search engine that uses image data as a search target. In this case, image data of a search result returned in response to transmission of a keyword is received and stored in the image DB 15. To do.

画像ＤＢ１５は、画像収集部１０が収集した画像データを蓄積記憶するデータベースであって、図１に示すように、キーワードと、画像ＩＤと、画像データとを対応付けて記憶する。画像ＩＤは、各画像データを固有に識別するための識別情報であって、キーワード及び画像データを記憶する際に、画像収集部１０により割り振られる。 The image DB 15 is a database for accumulating and storing image data collected by the image collecting unit 10, and stores keywords, image IDs, and image data in association with each other as shown in FIG. The image ID is identification information for uniquely identifying each image data, and is assigned by the image collection unit 10 when storing a keyword and image data.

ビジュアルキーワード生成部２０は、画像データの特徴ベクトルを生成する際に、画像内の領域画像をマッピングする対象の分類（クラスタ）を生成する。ビジュアルキーワード生成部２０は、画像検索に用いる画像や学習用に予め用意された画像データから複数の領域画像を抽出し、その領域画像の有する特徴量に基づいてそれらの画像をクラスタリングする。尚、クラスタリングの標準的な手法としては、k-means, Hierarchical Agglomerative Clustering(HAC)などが用いられる。 When generating the feature vector of the image data, the visual keyword generation unit 20 generates a classification (cluster) for mapping the region image in the image. The visual keyword generation unit 20 extracts a plurality of region images from images used for image search and image data prepared in advance for learning, and clusters these images based on the feature values of the region images. As a standard method of clustering, k-means, Hierarchical Agglomerative Clustering (HAC) or the like is used.

後述する特徴ベクトル生成部３０は、画像から検出した領域画像を、ビジュアルキーワード生成部２０のクラスタリングにより形成されるクラスタにマッピング（分類）することで、特徴ベクトルを生成する。このクラスタを、画像を視覚的なキーワードの集まりとして表現するための特徴量空間として「ビジュアルキーワード」という。 The feature vector generation unit 30 to be described later generates a feature vector by mapping (classifying) the region image detected from the image to a cluster formed by the clustering of the visual keyword generation unit 20. This cluster is referred to as a “visual keyword” as a feature amount space for expressing an image as a collection of visual keywords.

ビジュアルキーワードＤＢ２５は、ビジュアルキーワード生成部２０のクラスタリングにより形成されたクラスタを識別するビジュアルキーワードＩＤ（ＶＫＩＤ）と、そのクラスタの特徴量空間（多次元空間）での中心点の座標である中心座標と、該クラスタの範囲を示す半径とを対応付けて記憶するデータベースである。 The visual keyword DB 25 includes a visual keyword ID (VKID) for identifying a cluster formed by clustering of the visual keyword generation unit 20, and a center coordinate that is a coordinate of a center point in the feature amount space (multidimensional space) of the cluster. , A database that stores the radius indicating the cluster range in association with each other.

中心座標は、各クラスタに属する画像の特徴量の平均値を示す値であり、特徴量空間上での多次元の座標により示される。半径は、例えば、クラスタに属する画像のうちの、中心座標から最遠の画像との距離により求められる。 The center coordinate is a value indicating an average value of feature amounts of images belonging to each cluster, and is represented by multidimensional coordinates on the feature amount space. The radius is obtained, for example, by the distance from the image farthest from the center coordinate among the images belonging to the cluster.

特徴ベクトル生成部３０は、画像データから領域画像を抽出し、その領域画像の特徴量に基づいて特徴ベクトルを生成する特徴ベクトル生成処理（図２参照）を行って、各画像データの特徴ベクトルを生成する。特徴ベクトル生成処理については後述する。 The feature vector generation unit 30 extracts a region image from the image data, performs a feature vector generation process (see FIG. 2) that generates a feature vector based on the feature amount of the region image, and obtains a feature vector of each image data. Generate. The feature vector generation process will be described later.

領域管理ＤＢ３５は、特徴ベクトル生成部３０により各画像データから検出された領域画像と、そのマッピング先のビジュアルキーワードと、領域画像の検出元の画像データとの対応関係を記憶するデータベースであって、図１に示すように、画像ＩＤと、領域ＩＤと、ＶＫＩＤとを対応付けて記憶する。 The area management DB 35 is a database that stores a correspondence relationship between the area image detected from each image data by the feature vector generation unit 30, the visual keyword of the mapping destination, and the image data of the area image detection source, As shown in FIG. 1, an image ID, a region ID, and a VKID are stored in association with each other.

特徴ベクトルＤＢ４０は、特徴ベクトル生成部３０が生成した特徴ベクトルを画像毎に対応付けて記憶するデータベースであり、図１に示すように、画像ＩＤと、特徴ベクトルとなるビジュアルキーワード毎の領域画像の出現頻度とを対応付けて記憶する。 The feature vector DB 40 is a database that stores the feature vectors generated by the feature vector generation unit 30 in association with each image. As shown in FIG. 1, the feature vector DB 40 stores an image ID and a region image for each visual keyword that becomes a feature vector. The appearance frequency is stored in association with each other.

ここで、特徴ベクトル生成処理について、図２のフローチャートと、図３の概念図とを参照しながら説明する。 Here, the feature vector generation processing will be described with reference to the flowchart of FIG. 2 and the conceptual diagram of FIG.

先ず、特徴ベクトル生成部３０は、画像ＤＢ１５に記憶された画像データを読み出し、その画像データから複数の領域画像を検出する（ステップＳ１１）。この領域画像の検出方法としては、画像中の特徴的な領域（特徴領域）を検出する手法と、画像を所定領域で分割することで検出する手法とがある。 First, the feature vector generation unit 30 reads the image data stored in the image DB 15 and detects a plurality of region images from the image data (step S11). As a method for detecting the region image, there are a method for detecting a characteristic region (characteristic region) in the image and a method for detecting the region image by dividing the image into predetermined regions.

特徴領域を検出する手法としては、
・Ｈａｒｒｉｓ−ａｆｆｉｎｅ
・Ｈｅｓｓｉａｎ−ａｆｆｉｎｅ
・Ｍａｘｉｍａｌｌｙｓｔａｂｌｅｅｘｔｒｅｍａｌｒｅｇｉｏｎｓ（ＭＳＥＲ）
・ＤｉｆｆｅｒｅｎｃｅｏｆＧａｕｓｓｉａｎｓ（ＤｏＧ）
・ＬａｐｌａｃｉａｎｏｆＧａｕｓｓｉａｎ（ＬｏＧ）
・ＤｅｔｅｒｍｉｎａｎｔｏｆＨｅｓｓｉａｎ（ＤｏＨ）
等がある。 As a technique for detecting feature regions,
・ Harris-affine
・ Hessian-affine
・ Maximally stable extremal regions (MSER)
・ Difference of Gaussians (DoG)
・ Laplacian of Gaussian (LoG)
・ Determinant of Hessian (DoH)
Etc.

また、特徴領域の検出技術については、“Local Invariant Feature Detectors: A Survey”（Foundations and Trends in Computer Graphics and Vision,Vol.3,No.3,pp.177-280,2007.）等において公開されており、適宜公知技術を採用可能である。 The feature region detection technology is published in “Local Invariant Feature Detectors: A Survey” (Foundations and Trends in Computer Graphics and Vision, Vol.3, No.3, pp.177-280, 2007.). Any known technique can be adopted as appropriate.

また、画像を所定領域で分割して検出する手法としては、例えば、予め定めたＭ×Ｎブロックに分割したり、分割後のブロックの大きさが予め定めたｍ×ｎ画素となるように分割したりする手法がある。例えば、画像を１０×１０のブロックに分割する場合、画像の大きさが６４０×４８０画素であれば、１ブロックの大きさは６４×４８画素となる。 In addition, as a method of detecting an image by dividing it into predetermined areas, for example, the image is divided into predetermined M × N blocks, or divided so that the size of the divided blocks becomes predetermined m × n pixels. There is a technique to do. For example, when an image is divided into 10 × 10 blocks, if the size of the image is 640 × 480 pixels, the size of one block is 64 × 48 pixels.

図３では、画像を所定領域に分割した例を示しており、Ｎｏ．０００１の画像については７×６ブロックに分割されている。また、Ｎｏ．０００２の画像については５×７ブロック、Ｎｏ．０００３の画像については６×６ブロックに分割されている。尚、図示の例では、説明の簡略化のために数ブロックに分割しているが、数万のブロックに分割される。 3 shows an example in which an image is divided into predetermined areas. The 0001 image is divided into 7 × 6 blocks. No. For images of 0002, 5 × 7 blocks, No. The 0003 image is divided into 6 × 6 blocks. In the illustrated example, it is divided into several blocks for simplification of explanation, but is divided into tens of thousands of blocks.

次に、特徴ベクトル生成部３０は、検出した領域画像が有する特徴量を算出する（ステップＳ１２）。尚、特徴領域を抽出している場合には、スケール変化や回転、角度変化等のアフィン変換に耐性を持つ局所特徴量を抽出する。局所特徴量の一例としては、例えば次のものが挙げられる。 Next, the feature vector generation unit 30 calculates the feature amount of the detected region image (step S12). When the feature region is extracted, a local feature amount resistant to affine transformation such as scale change, rotation, and angle change is extracted. Examples of the local feature amount include the following.

・ＳＩＦＴ
・ｇｒａｄｉｅｎｔｌｏｃａｔｉｏｎａｎｄｏｒｉｅｎｔａｔｉｏｎｈｉｓｔｏｇｒａｍ
・ｓｈａｐｅｃｏｎｔｅｘｔ
・ＰＣＡ−ＳＩＦＴ
・ｓｐｉｎｉｍａｇｅｓ
・ｓｔｅｅｒａｂｌｅｆｉｌｔｅｒｓ
・ｄｉｆｆｅｒｅｎｔｉａｌｉｎｖａｒｉａｎｔｓ
・ｃｏｍｐｌｅｘｆｉｌｔｅｒｓ
・ｍｏｍｅｎｔｉｎｖａｒｉａｎｔｓ・ SIFT
・ Gradient location and orientation histogram
・ Shape context
・ PCA-SIFT
・ Spin images
・ Steerable filters
・ Differential inverters
・ Complex filters
・ Moment inviteants

局所特徴量の抽出については、“A performance evaluation of local descriptors”（IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.27, No.10,pp.1615-1630,2005.）等において公開されており、適宜公知技術を採用可能である。 The extraction of local features is published in “A performance evaluation of local descriptors” (IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.27, No.10, pp.1615-1630, 2005.) A known technique can be adopted as appropriate.

この特徴領域から抽出した特徴量に基づいて生成した特徴ベクトルは、オブジェクト（物体）の存在する可能性の高い特徴領域から生成されるため、画像中のオブジェクトの特徴を示す指標として有効である。 Since the feature vector generated based on the feature amount extracted from the feature region is generated from the feature region where the object (object) is highly likely to exist, it is effective as an index indicating the feature of the object in the image.

また、領域分割により領域画像を抽出している場合には、画像の配色やテクスチャ、形状等の各画像の特徴を数値化して表現した画像特徴量を用いる。この領域分割により検出した領域画像の特徴量から生成した特徴ベクトルは、画像を構成する各部分から生成されるため、画像の全体的な構成を示す指標として有効である。 In addition, when a region image is extracted by region division, an image feature amount expressed by quantifying features of each image such as an image color scheme, texture, and shape is used. Since the feature vector generated from the feature amount of the region image detected by the region division is generated from each part constituting the image, it is effective as an index indicating the overall configuration of the image.

そして、特徴ベクトル生成部３０は、画像データから検出した複数の領域画像を、その領域画像が有する特徴量に基づいてビジュアルキーワードにマッピング（分類）する（ステップＳ１３）。ビジュアルキーワードへのマッピングは、各ビジュアルキーワード（クラスタ）の中心点と、領域画像の特徴量との特徴量空間における距離に基づいて、距離が最も近いビジュアルキーワードを選定することで行う。 Then, the feature vector generation unit 30 maps (classifies) the plurality of region images detected from the image data to visual keywords based on the feature amounts of the region images (step S13). The mapping to the visual keyword is performed by selecting the visual keyword having the closest distance based on the distance in the feature amount space between the center point of each visual keyword (cluster) and the feature amount of the region image.

図３の例では、画像ＩＤ‘０００１’の画像から検出した領域画像Ｔ１、Ｔ３〜Ｔ６がビジュアルキーワードＶＫ１、領域画像Ｔ２がビジュアルキーワードＶＫ２にマッピングされている。また、画像ＩＤ‘０００２’の画像から検出した領域画像Ｔ１２〜Ｔ１４がビジュアルキーワードＶＫ３にマッピングされている。また、画像ＩＤ‘０００２’の画像の領域画像Ｔ１１と、画像ＩＤ‘０００３’の画像の領域画像Ｔ２１がビジュアルキーワードＶＫ４にマッピングされている。 In the example of FIG. 3, the area images T1, T3 to T6 detected from the image with the image ID “0001” are mapped to the visual keyword VK1, and the area image T2 is mapped to the visual keyword VK2. Further, the region images T12 to T14 detected from the image with the image ID “0002” are mapped to the visual keyword VK3. Further, the area image T11 of the image with the image ID “0002” and the area image T21 of the image with the image ID “0003” are mapped to the visual keyword VK4.

特徴ベクトル生成部３０は、各領域画像をビジュアルキーワード（クラスタ）にマッピングすると、各ビジュアルキーワードでの領域画像の出現頻度を計上し、このビジュアルキーワード毎での領域画像の出現頻度により多次元で表される特徴ベクトルを生成し、特徴ベクトルＤＢ４０に記憶する（ステップＳ１４）。 When each area image is mapped to a visual keyword (cluster), the feature vector generation unit 30 counts the appearance frequency of the area image for each visual keyword, and displays it in a multidimensional manner according to the appearance frequency of the area image for each visual keyword. The generated feature vector is generated and stored in the feature vector DB 40 (step S14).

例えば、図３の‘０００１’の画像であれば、該画像から検出した領域画像の出現頻度は、ビジュアルキーワードＶＫ１では‘５’、ビジュアルキーワードＶＫ２では‘１’、ビジュアルキーワードＶＫ３では‘０’となる。この複数のビジュアルキーワードに対する出現頻度をベクトル要素とした特徴ベクトルを生成する。 For example, in the case of the image “0001” in FIG. 3, the appearance frequency of the region image detected from the image is “5” for the visual keyword VK1, “1” for the visual keyword VK2, and “0” for the visual keyword VK3. Become. A feature vector having the appearance frequency for the plurality of visual keywords as a vector element is generated.

また、特徴ベクトル生成部３０は、画像データから検出した領域画像に領域ＩＤを割り振り、その領域画像をマッピングしたビジュアルキーワードのＶＫＩＤを画像ＩＤと領域ＩＤとに対応付けて領域管理ＤＢ３５に記憶する。この領域ＩＤは、画像内でのＸＹ座標であってもよいし、領域分割した際の行番号・列番号であってもよい。 The feature vector generation unit 30 assigns a region ID to the region image detected from the image data, and stores the VKID of the visual keyword mapping the region image in the region management DB 35 in association with the image ID and the region ID. The area ID may be an XY coordinate in the image, or may be a row number / column number when the area is divided.

非共通領域特定部５０は、特徴ベクトル生成部３０により生成された特徴ベクトルを用いて、画像データ間を亘って非共通となる画像領域を特定する。詳細については、後述するが、簡単に説明すると、領域画像を検出する元となった複数の画像データ（以下「元画像」ともいう。）毎に生成された特徴ベクトルに基づいて、元画像がマッピングされたビジュアルキーワード、即ち、元画像の同一特徴量空間への分布状態を算出して、その分布状態によって画像間での非共通となる画像領域を判定する。 The non-common area specifying unit 50 uses the feature vector generated by the feature vector generating unit 30 to specify an image area that is not common across the image data. Although details will be described later, in brief, based on a feature vector generated for each of a plurality of image data (hereinafter also referred to as “original images”) from which the region image is detected, the original image is A mapped visual keyword, that is, a distribution state of the original image in the same feature amount space is calculated, and an image region that is not common between the images is determined based on the distribution state.

品質判定部６０は、非共通領域特定部５０により判定された非共通の画像領域を、元画像から除去した場合の、その除去後の画像データが学習データとして適しているか否かの品質を判定する。これは、オブジェクト認識に用いる学習データがオブジェクトの特徴を一定以上の品質で十分に表現していることが好ましいからであり、品質判定部６０は、非共通領域を元画像から除去した場合の領域画像のビジュアルキーワードへのマッピング状態に基づいて品質を判定する。 The quality determination unit 60 determines the quality of whether or not the removed image data is suitable as learning data when the non-common image region determined by the non-common region specifying unit 50 is removed from the original image. To do. This is because it is preferable that the learning data used for object recognition sufficiently express the characteristics of the object with a certain level of quality, and the quality determination unit 60 is an area when the non-common area is removed from the original image. The quality is determined based on the mapping state of the image to the visual keyword.

学習データ作成部７０は、品質判定部６０によって学習データに適していると判定された画像データに対して、非共通領域特定部５０によって非共通領域として特定された領域画像を除去して学習データを作成し、学習データＤＢ７５に記憶する。 The learning data creation unit 70 removes the region image specified as the non-common region by the non-common region specifying unit 50 from the image data determined to be suitable for the learning data by the quality determination unit 60, and the learning data Is stored in the learning data DB 75.

学習データＤＢ７５は、キーワードと学習データとを対応付けて記憶するデータベースであり、学習データ作成部７０が作成した学習データに、その学習データの作成元となった画像データに対応付けられた画像ＤＢ１５のキーワードが対応付けられて記憶される。 The learning data DB 75 is a database that stores keywords and learning data in association with each other. The learning data DB 75 associates the learning data created by the learning data creation unit 70 with the image data from which the learning data is created. Are stored in association with each other.

この学習データＤＢ７５に格納された学習データは、オブジェクト認識装置により利用されて、各キーワードで表されるオブジェクトの特徴が学習される。このオブジェクト認識装置が行う学習方法や特徴量の抽出方法等は、学習データ作成装置１のアルゴリズムに依存するものではなく、オブジェクト認識装置によって任意に設定されるものである。 The learning data stored in the learning data DB 75 is used by the object recognition device to learn the characteristics of the object represented by each keyword. The learning method, the feature amount extraction method, and the like performed by the object recognition device do not depend on the algorithm of the learning data creation device 1, but are arbitrarily set by the object recognition device.

尚、本実施形態においては、品質判定部６０の品質判定によって学習データとして記憶する画像データを選定しているが、この品質判定を行わずに非共通領域特定部５０により非共通と特定された画像領域を除去することで学習データを作成して学習データＤＢ７５に登録することとしてもよい。 In the present embodiment, the image data to be stored as learning data is selected by the quality determination of the quality determination unit 60, but it is specified as non-common by the non-common area specifying unit 50 without performing this quality determination. Learning data may be created by removing the image area and registered in the learning data DB 75.

〔学習データ作成処理〕
次に、図４のフローチャートと、図５の概念図とを用いて、非共通領域特定部５０、品質判定部６０及び学習データ作成部７０により実行される学習データ作成処理を説明する。 [Learning data creation process]
Next, learning data creation processing executed by the non-common area specifying unit 50, the quality determining unit 60, and the learning data creating unit 70 will be described using the flowchart of FIG. 4 and the conceptual diagram of FIG.

先ず、非共通領域特定部５０は、各画像の領域画像をマッピングした画像毎のビジュアルキーワードの数（ＶＫ割当数）と、各ビジュアルキーワードに分布している元画像の枚数（元画像分布数）とを算出する。 First, the non-common area specifying unit 50 includes the number of visual keywords (number of VK assignments) for each image obtained by mapping the area images of each image, and the number of original images distributed to each visual keyword (number of original image distributions). And calculate.

具体的には、同一のキーワードが関連付けられた複数の画像データについて、各画像から検出した領域画像のマッピングしたビジュアルキーワードの数を画像毎に算出する（ステップＳ２１）。例えば、図５のように、画像ＩＭＧ１から検出した領域画像のマッピング先はＶＫ１，ＶＫ２，ＶＫ４の３つであり、ＶＫ割当数は‘３’と算出される。 Specifically, for a plurality of image data associated with the same keyword, the number of visual keywords mapped to the region image detected from each image is calculated for each image (step S21). For example, as shown in FIG. 5, there are three mapping destinations of the area image detected from the image IMG1, VK1, VK2, and VK4, and the VK allocation number is calculated as ‘3’.

また、各ビジュアルキーワードに領域画像がマッピングされた元画像の枚数を元画像分布数としてビジュアルキーワード毎に算出する（ステップＳ２２）。例えば、図５において、ビジュアルキーワードＶＫ１には、画像ＩＭＧ１，ＩＭＧ２，ＩＭＧ３の３枚の元画像から領域画像がマッピングされているので、元画像分布数は‘３’と算出される。 Further, the number of original images in which area images are mapped to each visual keyword is calculated for each visual keyword as the original image distribution number (step S22). For example, in FIG. 5, since the region image is mapped from the three original images IMG1, IMG2, and IMG3 to the visual keyword VK1, the number of original image distributions is calculated as “3”.

次に、非共通領域特定部５０は、ビジュアルキーワードを１つ選択し（ステップＳ２３）、そのビジュアルキーワードの元画像分布数が所定の閾値未満であるか否かを判定する（ステップＳ２４）。そして、その元画像分布数が所定の閾値未満であると判定した場合には（ステップＳ２４；Ｙｅｓ）、そのビジュアルキーワードにマッピングされた領域画像が非共通領域であると特定する（ステップＳ２５）。 Next, the non-common area specifying unit 50 selects one visual keyword (step S23), and determines whether or not the number of original image distributions of the visual keyword is less than a predetermined threshold (step S24). If it is determined that the number of original image distributions is less than the predetermined threshold (step S24; Yes), the region image mapped to the visual keyword is specified as a non-common region (step S25).

非共通領域特定部５０は、全てのビジュアルキーワードについてステップＳ２３〜Ｓ２５の処理を行ったか否かを判断して（ステップＳ２６）、未処理のビジュアルキーワードがあれば、ステップＳ２３に処理を移行する。 The non-common area specifying unit 50 determines whether or not the processing of steps S23 to S25 has been performed for all visual keywords (step S26), and if there is an unprocessed visual keyword, the processing proceeds to step S23.

例えば、閾値が‘３’と設定されており、選択したビジュアルキーワードがＶＫ１であれば、このビジュアルキーワードＶＫ１で表される特徴量空間は共通領域と判定されることとなる。また、図５のように、ビジュアルキーワードＶＫ２については、閾値未満であるので、非共通領域と判定される。 For example, if the threshold is set to '3' and the selected visual keyword is VK1, the feature amount space represented by this visual keyword VK1 is determined to be a common area. Further, as shown in FIG. 5, since the visual keyword VK2 is less than the threshold value, it is determined as a non-common area.

このように、ビジュアルキーワードを用いて生成した特徴ベクトルを用いることで、各画像間を亘って共通でないビジュアルキーワードである特徴量空間上でのクラスタを特定することができる。 In this way, by using the feature vector generated using the visual keyword, it is possible to specify a cluster on the feature amount space, which is a visual keyword that is not common between the images.

本実施形態の学習データ作成装置１は、この非共通のビジュアルキーワードに属する画像領域を元画像から除去していくことで、画像間に亘って共通の画像領域を残していく。この画像間で共通の画像領域は、キーワード検索により得られた画像集合の中で共通の特徴量を有する画像領域であるから、キーワードで表されるオブジェクトが含まれているといえる。従って、非共通領域を元画像から除去していくことで、オブジェクトを含む画像を生成することができる。 The learning data creation apparatus 1 of the present embodiment removes an image area belonging to this non-common visual keyword from the original image, thereby leaving a common image area between the images. Since the image area common to the images is an image area having a common feature amount in the image set obtained by the keyword search, it can be said that the object represented by the keyword is included. Therefore, an image including an object can be generated by removing the non-common area from the original image.

尚、ステップＳ２４における非共通領域であるか否かの特定に用いる閾値は前述のように定数であってもよいし、キーワードに基づいて収集した画像の総数のＰ％（例えば１０％）の枚数として動的に設定することとしてもよい。また、定数か画像総数のＰ％の枚数の何れかのうちの、最大となるほうを選択して設定することとしてもよい。 Note that the threshold value used for specifying whether or not it is a non-common area in step S24 may be a constant as described above, or the number of images P% (for example, 10%) of the total number of images collected based on the keyword. It is good also as setting dynamically. It is also possible to select and set the largest one of a constant or P% of the total number of images.

非共通領域特定部５０がステップＳ２６において全てのビジュアルキーワードについて処理を行ったと判断した場合は（ステップＳ２６；Ｙｅｓ）、品質判定部６０が、元画像を一枚ずつ選択して（ステップＳ２７）、ステップＳ２８〜Ｓ３１の処理を行う。 If the non-common area specifying unit 50 determines that all visual keywords have been processed in step S26 (step S26; Yes), the quality determination unit 60 selects the original images one by one (step S27), Steps S28 to S31 are performed.

先ず、選択した元画像から非共通領域と特定されたビジュアルキーワードに属する画像領域を元画像から除去した場合の画像が品質条件を満たしているかを判定する（ステップＳ２８）。 First, it is determined whether or not the image when the image area belonging to the visual keyword identified as the non-common area from the selected original image is removed from the original image satisfies the quality condition (step S28).

具体的には、全画像から非共通領域となったビジュアルキーワードを除去した場合の該ビジュアルキーワードの総数（ＶＫ残総数）に対する、非共通領域を除去した場合の各画像のビジュアルキーワードへの分布（ＶＫ残数）の比率が所定値以上であれば、品質条件を満たしていると判定する。この非共通領域の除去による分布の比率は、次式により求められる。 Specifically, with respect to the total number of visual keywords (total number of remaining VKs) when visual keywords that have become non-common regions are removed from all images, the distribution of images to visual keywords when non-common regions are removed ( If the ratio of (number of remaining VK) is equal to or greater than a predetermined value, it is determined that the quality condition is satisfied. The ratio of the distribution due to the removal of the non-common area is obtained by the following equation.

各画像の分布の比率＝ＶＫ残数／ＶＫ残総数 Ratio of distribution of each image = number of remaining VKs / total number of remaining VKs

ＶＫ残数は、非共通領域のビジュアルキーワードにマッピングされた画像領域を元画像から除去した場合でも、画像領域がマッピングされて残っているビジュアルキーワードの数であり、画像毎に算出される。図５においては、破線で囲んだビジュアルキーワードを除いたビジュアルキーワードの数となる。画像ＩＭＧ１では、ビジュアルキーワードＶＫ１とＶＫ４が残っているので、ＶＫ残数＝２となる。画像ＩＭＧ４では、ビジュアルキーワードＶＫ４のみにマッピングされて残っているので、ＶＫ残数＝１となる。 The number of remaining VKs is the number of visual keywords remaining after mapping the image area even when the image area mapped to the visual keyword of the non-common area is removed from the original image, and is calculated for each image. In FIG. 5, it is the number of visual keywords excluding the visual keywords surrounded by a broken line. In the image IMG1, since the visual keywords VK1 and VK4 remain, the remaining number of VKs = 2. In the image IMG4, only the visual keyword VK4 is mapped and remains, so the number of remaining VKs = 1.

また、ＶＫ残総数は、非共通領域のビジュアルキーワードの全画像がマッピングされ残っているビジュアルキーワードから除去した場合でも残っているビジュアルキーワードの総数であり、収集された画像全体に対して求められる。図５においては、全画像がマッピングされているビジュアルキーワードは６つであり、そのうち、非共通領域として特定されたビジュアルキーワードが４つであるから、ＶＫ残総数＝２となる。 The total number of remaining VKs is the total number of remaining visual keywords even when all the images of the visual keywords in the non-common area are mapped and removed from the remaining visual keywords, and is obtained for the entire collected image. In FIG. 5, there are six visual keywords to which all images are mapped, and among them, there are four visual keywords specified as non-common areas, so the total number of remaining VKs = 2.

このＶＫ残総数に対する各画像のＶＫ残数の比率が所定の閾値（例えば、０．５）以下である場合は、その画像が学習データとしての品質を満たしていないと判定する。これは、非共通領域の除去によってオブジェクトを表現するビジュアルキーワードが、全体的な集合（ＶＫ残総数）に対して相対的に少ない、ということを意味する。 When the ratio of the remaining number of VKs of each image to the total number of remaining VKs is equal to or less than a predetermined threshold (for example, 0.5), it is determined that the image does not satisfy the quality as learning data. This means that the number of visual keywords expressing objects by removing non-common areas is relatively small with respect to the entire set (total number of remaining VKs).

即ち、非共通領域を除去した後でも残っているビジュアルキーワード（特徴量空間のクラスタ）が少なくなっていると、該除去によってオブジェクトを表現するに足る特徴が削られてしまったとして、学習データには不適であると判定する。 In other words, if there are fewer visual keywords (clusters in the feature space) remaining after removing the non-common area, it is assumed that the features sufficient to represent the object have been deleted by the removal, and the learning data Is determined to be inappropriate.

また、例えば、オブジェクトが小さく撮影されていたり、オブジェクトの一部分が撮影されていた画像に対して非共通の画像領域を除去すると、残存する画像が小さくなり、学習データとして耐えられないことがある。このような場合であっても、上述の品質判定によって学習に適した学習データを選定し登録することができる。 Further, for example, if an image area that is not common to an image in which an object has been photographed or a part of the object has been photographed is removed, the remaining image becomes small and cannot be used as learning data. Even in such a case, learning data suitable for learning can be selected and registered by the above-described quality determination.

尚、この品質判定における閾値は、前述のように定数であってもよいし、各画像の分布の比率の平均値としてもよい。また、定数か画像分布の比率の平均値の何れかのうちの、最大となるほうを選択して設定することとしてもよい。 Note that the threshold value in this quality determination may be a constant as described above, or may be an average value of the distribution ratio of each image. Further, it is possible to select and set the maximum of either the constant or the average value of the ratio of the image distribution.

品質判定部６０は、画像が品質条件を満たしていないと判定した場合には（ステップＳ２８；Ｎｏ）、その画像を画像ＤＢ１５から削除し（ステップＳ３０）、次ぎの画像を選択する（ステップＳ３１→Ｓ２７）。 When it is determined that the image does not satisfy the quality condition (Step S28; No), the quality determination unit 60 deletes the image from the image DB 15 (Step S30), and selects the next image (Step S31 → S27).

また、画像が品質条件を満たしていると判定した場合には（ステップＳ２８；Ｙｅｓ）、その画像から非共通領域のビジュアルキーワードにマッピングされた画像領域を除去して学習データを作成し、学習データＤＢ７５にキーワードと共に格納する（ステップＳ２９）。 When it is determined that the image satisfies the quality condition (step S28; Yes), learning data is created by removing the image area mapped to the visual keyword of the non-common area from the image, and learning data The keyword is stored in the DB 75 together with the keyword (step S29).

このように、各画像からビジュアルキーワードを用いて生成した特徴ベクトルに基づいて、各画像間を亘って非共通であると特定される画像領域を除去することで、図６のように例えば「ランドマーク」というキーワードで収集した画像の中でも、「ランドマーク」というオブジェクトを表す画像領域（破線で囲まれる領域）を切り出して、オブジェクト認識に適切な学習データを作成することができる。 As described above, for example, “land” is removed as shown in FIG. 6 by removing the image area that is identified as non-common between the images based on the feature vector generated from each image using the visual keyword. Among the images collected with the keyword “mark”, an image region (region surrounded by a broken line) representing the object “landmark” can be cut out to create learning data suitable for object recognition.

また、図６の画像ＩＭＧ４のように、非共通であると特定される画像領域を除去した場合に、学習データとしての質を満たさない場合には、その画像データを学習データとしては登録しないため、精度のよい学習データを作成することができる。 In addition, when an image area specified as non-common is removed as in the image IMG4 of FIG. 6, if the quality as learning data is not satisfied, the image data is not registered as learning data. Highly accurate learning data can be created.

尚、上述した実施形態は、本発明を適用した一例であって、本発明の目的を逸脱しない範囲において適宜設計変更等してもよい。以下、本発明の変形例について説明する。 The embodiment described above is an example to which the present invention is applied, and the design may be changed as appropriate without departing from the object of the present invention. Hereinafter, modifications of the present invention will be described.

〔ビジュアルキーワードへの分布の指標の変更〕
先ず、ビジュアルキーワードの画像の分布の度合いとして、上述例では、元画像分布数を算出したが、各ビジュアルキーワードにマッピングされた画像領域の枚数を算出して、非共通領域の特定を行うこととしてもよい。 [Change of distribution index to visual keywords]
First, as the degree of distribution of the image of the visual keyword, in the above example, the number of original image distributions is calculated, but the number of image areas mapped to each visual keyword is calculated to specify the non-common area. Also good.

具体的には、図４のステップＳ２１においては、ＶＫ割当数を算出するのに代えて、各画像から検出した領域画像の検出数を、画像毎に検出数として算出する。また、ステップＳ２２においては、元画像分布数を算出するのに代えて、各ビジュアルキーワードにマッピングされた領域画像の総数をＶＫマッピング数として算出する。そして、ステップＳ２５においては、ＶＫマッピング数が所定閾値未満であるビジュアルキーワードを非共通領域として特定する。 Specifically, in step S21 of FIG. 4, instead of calculating the VK allocation number, the detection number of area images detected from each image is calculated as the detection number for each image. In step S22, instead of calculating the number of original image distributions, the total number of area images mapped to each visual keyword is calculated as the number of VK mappings. In step S25, a visual keyword having a VK mapping number less than a predetermined threshold is specified as a non-common area.

例えば、図５において、画像ＩＭＧ１から１５枚、画像ＩＭＧ２から１１枚、画像ＩＭＧ３から７枚の領域画像がビジュアルキーワードＶＫ１にマッピングされているので、それらを加算した３３がＶＫマッピング数として算出される。そして、閾値が２０に設定されている場合には、図５においてビジュアルキーワードＶＫ２，ＶＫ５，ＶＫ６が非共通領域として特定される。 For example, in FIG. 5, 15 area images from image IMG1, 11 images from IMG2, and 7 area images from image IMG3 are mapped to visual keyword VK1, and therefore, 33 obtained by adding them is calculated as the number of VK mappings. . When the threshold is set to 20, the visual keywords VK2, VK5, and VK6 are specified as non-common areas in FIG.

また、学習データとしての品質の判定のステップＳ２８においては、全画像から非共通領域となったビジュアルキーワードを除去した場合に残る画像領域の枚数（残領域総数）に対する、非共通領域を除去した場合の各画像のビジュアルキーワードへの画像領域の分布（検出残数）の比率が所定値以上であれば、品質条件を満たしていると判定する。この非共通領域の除去による分布の比率は、次式により求められる。 Further, in step S28 for determining quality as learning data, when non-common areas are removed with respect to the number of remaining image areas (total number of remaining areas) when visual keywords that have become non-common areas are removed from all images. If the ratio of the distribution (number of remaining detections) of the image area to the visual keyword of each image is equal to or greater than a predetermined value, it is determined that the quality condition is satisfied. The ratio of the distribution due to the removal of the non-common area is obtained by the following equation.

各画像の分布の比率＝検出残数／残領域総数 Distribution ratio of each image = number of remaining detections / total number of remaining areas

残領域総数は、非共通領域のビジュアルキーワードにマッピングされた画像領域を元画像から除去した場合でも各ビジュアルキーワードにマッピングされて残っている画像領域の数であり、画像毎に算出される。図５においては、ビジュアルキーワードＶＫ１，ＶＫ３，ＶＫ４にマッピングされている画像領域の総数‘８８’が残領域総数として算出される。 The total number of remaining areas is the number of image areas remaining mapped to each visual keyword even when the image area mapped to the visual keyword of the non-common area is removed from the original image, and is calculated for each image. In FIG. 5, the total number “88” of image areas mapped to the visual keywords VK1, VK3, and VK4 is calculated as the total number of remaining areas.

また、検出残数は、非共通領域のビジュアルキーワードの領域画像を除去した場合でも各画像に残っている領域画像の枚数であり、画像毎に算出される。図５において画像ＩＭＧ１については、非共通領域以外のビジュアルキーワードＶＫ１，ＶＫ３，ＶＫ４にマッピングされている領域画像は２４枚として算出される。 The number of remaining detections is the number of area images remaining in each image even when the area image of the visual keyword in the non-common area is removed, and is calculated for each image. In FIG. 5, for the image IMG1, the area images mapped to the visual keywords VK1, VK3, and VK4 other than the non-common areas are calculated as 24 images.

この残領域総数に対する各画像の検出残数の比率が所定の閾値（例えば、０．５）以下である場合は、その画像が学習データとしての品質を満たしていないと判定する。このように、非特定領域であるビジュアルキーワードの特定や、学習データとしての品質の判定に、各ビジュアルキーワードにマッピングされた領域画像の枚数を用いることで、より質の高い学習データを作成することができる。 If the ratio of the number of remaining detections of each image to the total number of remaining areas is a predetermined threshold (for example, 0.5) or less, it is determined that the image does not satisfy the quality as learning data. In this way, it is possible to create higher quality learning data by using the number of area images mapped to each visual keyword to specify the visual keywords that are non-specific areas and to determine the quality as learning data. Can do.

〔非共通領域にマッピングされた画像領域の除去〕
また、上述例では、非共通領域として特定されたビジュアルキーワードに属する領域画像を元画像から除去することで学習データを作成することとして説明しているが、その非共通領域と特定されたビジュアルキーワードに属する領域画像の元画像内での位置に基づいて実際に除去する領域を求めることとしてもよい。 [Removal of image area mapped to non-common area]
In the above example, the learning data is created by removing the region image belonging to the visual keyword specified as the non-common region from the original image. However, the visual keyword specified as the non-common region is described. An area to be actually removed may be obtained based on the position of the area image belonging to the original image in the original image.

具体的に、図７に示す画像ＩＭＧ５において、破線で示す領域で分割が為され、網掛けの領域が非共通領域として特定されたビジュアルキーワードにマッピングされた画像領域であるとする。 Specifically, in the image IMG5 shown in FIG. 7, it is assumed that the image area is divided into areas indicated by broken lines and the shaded area is an image area mapped to a visual keyword specified as a non-common area.

図７のように非共通領域にマッピングされた画像領域の位置から、共通領域となる画像領域（白地の領域）を認識することができるため、この共通領域のうち、上下左右方向に最も外側の画像領域を抽出し、その画像領域を含むように学習データを切り出す。 As shown in FIG. 7, the image area (white area) that becomes the common area can be recognized from the position of the image area mapped to the non-common area. An image area is extracted, and learning data is cut out so as to include the image area.

即ち、画像ＩＭＧ５の中では、画像領域Ｐ１〜Ｐ４を共通領域となる画像領域の外縁として抽出し、この画像領域Ｐ１〜Ｐ４を含む枠Ｆを抽出する。この枠Ｆの外側の画像領域を画像ＩＭＧから除去することで、枠Ｆの内側の画像を学習データとして作成する。 That is, in the image IMG5, the image areas P1 to P4 are extracted as the outer edges of the image areas that are common areas, and the frame F including the image areas P1 to P4 is extracted. By removing the image area outside the frame F from the image IMG, an image inside the frame F is created as learning data.

本実施形態の学習データ作成装置１により作成された学習データを用いたオブジェクト認識に用いる特徴量は、オブジェクト認識のエンジンにより異なる。従って、オブジェクトを含む十分な大きさで学習データを作成することで、学習過程においてオブジェクトの特質を精度よく抽出することができる。 The feature amount used for object recognition using the learning data created by the learning data creation device 1 of the present embodiment differs depending on the object recognition engine. Therefore, by creating the learning data with a sufficient size including the object, it is possible to accurately extract the characteristics of the object in the learning process.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１学習データ作成装置
１０画像収集部
１５画像ＤＢ
２０ビジュアルキーワード生成部
２５ビジュアルキーワードＤＢ
３０特徴ベクトル生成部
３５領域管理ＤＢ
４０特徴ベクトルＤＢ
５０非共通領域特定部
６０品質判定部
７０学習データ作成部
７５学習データＤＢ 1 learning data creation device 10 image collection unit 15 image DB
20 Visual keyword generator 25 Visual keyword DB
30 feature vector generation unit 35 area management DB
40 Feature vector DB
50 Non-common area specifying unit 60 Quality determining unit 70 Learning data creating unit 75 Learning data DB

Claims

In a learning data creation device for creating learning data for object recognition from a plurality of image data collected by web search based on keywords,
The area image is classified into a predetermined cluster based on the feature amount of the area image detected from the collected plurality of image data, the cluster to which the area image of each image data belongs, and the number of area images belonging to the cluster And classifying means for generating the image data for each image data;
Non-common area specifying means for specifying a cluster with a small distribution of the area image from the classification state of the area image for each cluster based on the number of area images belonging to each cluster;
Learning data creating means for creating learning data by removing the region image belonging to the identified cluster from the image data in which the region image is detected;
A learning data creation device comprising:

The non-common area specifying means includes
2. The cluster in which the distribution of the region image is small is identified based on the number of image data from which the region image classified into the cluster is detected, among the clusters into which the region image is classified. The learning data creation device described in 1.

Based on the ratio of the number of area images of the image data belonging to the cluster specified by the non-common area specifying means to the number of area images detected from each image data, the image data when the area image is removed is Further comprising quality judging means for judging whether or not the learning data is suitable;
The learning data creation means includes
3. The learning data generating apparatus according to claim 1, wherein the learning data is generated by removing the region image from image data determined to be suitable for the learning data by the quality determination unit. .

The learning data creation means includes
The area image is removed from the outside of the image data based on a detection position of the area image classified into the identified cluster with respect to the image data. The learning data creation device described.

In a learning data creation method in which a computer creates learning data for object recognition from a plurality of image data collected by web search based on keywords,
The area image is classified into a predetermined cluster based on the feature amount of the area image detected from the collected plurality of image data, the cluster to which the area image of each image data belongs, and the number of area images belonging to the cluster A cluster process for generating the image data for each image data;
A non-common area specifying step for specifying a cluster with a small distribution of the area image from the classification state of the area image for each cluster based on the number of area images belonging to each cluster;
A learning data creating step of creating learning data by removing the region image belonging to the identified cluster from the image data in which the region image is detected;
The learning data creation method, wherein the computer performs the above.

A program for causing a computer to execute the learning data creation method according to claim 5.