JP2009258953A

JP2009258953A - Image processing method, program for executing the method, storage medium, imaging apparatus, and image processing system

Info

Publication number: JP2009258953A
Application number: JP2008106546A
Authority: JP
Inventors: Keiji Yanai; 啓司柳井
Original assignee: University of Electro Communications NUC
Current assignee: University of Electro Communications NUC
Priority date: 2008-04-16
Filing date: 2008-04-16
Publication date: 2009-11-05
Anticipated expiration: 2028-04-16
Also published as: JP5018614B2

Abstract

【課題】ＧＰＳにより取得した撮影場所の位置情報（緯度経度）付きのデジタル画像に対する、一般的な物体やシーン等の被写体認識の精度を向上させ、デジタル写真の自動分類や検索を可能にする。
【解決手段】認識対象（被写体）の画像に加え、撮影位置を中心とした小領域に対応する、様々な縮尺の航空写真画像および／または地図画像の画像特徴量を、認識対象のデジタル画像の画像特徴量に付加して利用する。
【選択図】図１PROBLEM TO BE SOLVED: To improve the accuracy of recognition of subjects such as general objects and scenes with respect to a digital image with position information (latitude and longitude) of a shooting location acquired by GPS, and enables automatic classification and search of digital photos.
In addition to an image of a recognition target (subject), image features of various scales of aerial image and / or map image corresponding to a small area centered on a shooting position are represented in the digital image of the recognition target. Used in addition to image features.
[Selection] Figure 1

Description

本発明は、画像処理方法、およびその方法を実行するプログラム、記憶媒体、撮像機器、画像処理システムに関する。詳しくは、画像内の認識対象（被写体）を分類するための画像処理技術に関する。 The present invention relates to an image processing method, a program for executing the method, a storage medium, an imaging device, and an image processing system. More specifically, the present invention relates to an image processing technique for classifying recognition objects (subjects) in an image.

近年、デジタルカメラ、カメラ付き携帯電話などの普及や、ハードディスク、その他記憶媒体の大容量化によって、一般の個人が大量にデジタル画像を保有、蓄積することが出来るようになった。
しかしながら、撮影されたデジタル画像の蓄積先、例えば、パーソナル・コンピュータ（ＰＣ）、デジタルカメラ、カメラ付き携帯電話などのデジタル機器は、撮影、蓄積された画像中の被写体（認識対象）を判別する機能をもっていない。
よって、画像の取り扱いに関するデジタル機器と人とのセマンティックギャップは狭まることはなく、現状では、大量の画像データの分類や検索には、人手の介入が不可欠である。
人手により、撮影画像の内容に関するメタデータを記述することも可能であるが、手間がかかるため、撮影した画像すべてに意味、内容等に関するメタデータを記載することは、現実的ではない。 In recent years, with the widespread use of digital cameras, camera-equipped mobile phones, and the increase in the capacity of hard disks and other storage media, ordinary individuals can hold and store large amounts of digital images.
However, the storage location of the captured digital image, for example, a digital device such as a personal computer (PC), a digital camera, or a camera-equipped mobile phone, has a function of determining a subject (recognition target) in the captured and stored image. I do not have.
Therefore, the semantic gap between digital devices and people regarding image handling does not narrow, and at present, human intervention is indispensable for classification and retrieval of a large amount of image data.
Although it is possible to manually describe the metadata related to the contents of the photographed image, it takes time, so it is not practical to describe the metadata about the meaning, contents, etc. in all the photographed images.

画像の意味、内容に基づく処理を人手の介入なしに実現するためには、被写体、例えば、「ライオン」「自動車」「花」「山」「夕焼け」などの一般的な対象の認識（一般画像認識）を行なう必要がある。
現実世界で撮影された画像に対して、コンピュータなどのデジタル機器が、その画像中に含まれる被写体を一般的な名称で認識することを「一般物体認識（generic object recognition）」と呼び、画像認識の研究において最も重要な課題の一つである（例えば、非特許文献１、２など参照）。 In order to realize processing based on the meaning and content of an image without human intervention, recognition of general objects such as “lion”, “car”, “flower”, “mountain”, “sunset” (general image) Recognition).
When a digital device such as a computer recognizes a subject included in the image with a generic name for an image taken in the real world, it is called "generic object recognition". This is one of the most important issues in this research (for example, see Non-Patent Documents 1 and 2).

一般に、現実世界で撮影された画像に対する物体認識には、大きく分けて、identification（同定）と、classification（分類）の、２種類の認識がある。
identificationは個々の物体（the object）を区別する認識であり、入力された画像とデータベース中のモデルの照合を行い、どのモデルに対応する物体が画像中に存在するかどうかを出力結果とする。
一方、classificationは物体の種類（an object）を区別する認識で、人間が決めた分類（class）と画像中の物体（被写体）とを対応付け、物体のクラス名（多くの場合は一般名称）を出力結果とする。
「物体認識」はidentificationを指すのが一般的であるが、「一般物体認識」はclassificationの認識を意味するものであり、本明細書においても、これらの用語の定義に基づいて説明する。 In general, object recognition for an image taken in the real world is roughly divided into two types of recognition: identification and classification.
Identification is recognition for distinguishing individual objects (the objects). The input image is compared with a model in the database, and an output result is whether an object corresponding to which model exists in the image.
On the other hand, classification is recognition that distinguishes the type of an object (an object). The classification (class) determined by a person is associated with the object (subject) in the image, and the class name of the object (in many cases, a general name) Is the output result.
“Object recognition” generally refers to identification, but “general object recognition” means recognition of classification, and will be described in this specification based on the definitions of these terms.

現在、デジタル画像に対する「一般物体認識」の研究が急速に進歩しつつある。ここでいう「一般物体認識」での対象画像とは、例えば、デジタルカメラやカメラ付携帯電話などで撮影したデジタル写真などの画像であり、認識対象はそうした画像中の「ライオン」「自動車」「花」「山」「夕焼け」などの各種の物体やシーンなどの被写体である。 Currently, research on “general object recognition” for digital images is rapidly progressing. The target image in “general object recognition” here is, for example, an image such as a digital photograph taken with a digital camera or a mobile phone with a camera, and the recognition target is “lion”, “car”, “ Various objects such as flowers, mountains, and sunsets, and subjects such as scenes.

「一般画像認識」では、画像のみの持つ情報（画像データ）から認識するのが最も基本的な方法であるが、近年においては、デジタル画像の撮影時に、デジタルカメラもしくはカメラ付き携帯電話によって自動的に埋め込まれた付加情報（メタデータ）を、認識に利用する研究が提案されている。
例えば、撮影された時間を用いれば、画像データだけでは難しい、「夕日」であるか「朝日」であるかの区別は、容易に行うことが可能となる。 In “general image recognition”, the most basic method is to recognize information (image data) that only an image has, but in recent years, a digital camera or a mobile phone with a camera is automatically used when taking a digital image. Research has been proposed in which additional information (metadata) embedded in is used for recognition.
For example, if the time when the image was taken is used, it is possible to easily distinguish between “sunset” and “sunrise”, which is difficult only with image data.

また、非特許文献３に示されたように、メタデータとして、撮影時間のほか、撮影時におけるフラッシュの利用の有無や、レンズの焦点距離等のデータを、画像認識に利用する提案がなされている。しかし、従来の各種文献等においては、位置情報の利用について開示されていない。
メタデータ中の重要な情報として、位置情報がある。位置情報は通常ＧＰＳ（Global Positioning System）によって取得するが、最近のデジタルカメラや携帯電話にはＧＰＳが内蔵されたものもあり、撮影した画像に位置情報を、メタデータとして埋め込むことが可能な撮像機器が多く登場している。 In addition, as shown in Non-Patent Document 3, as metadata, there is a proposal to use data such as shooting time, presence / absence of use of a flash at the time of shooting, and focal length of a lens for image recognition. Yes. However, various conventional documents do not disclose use of position information.
Position information is important information in the metadata. Position information is usually obtained by GPS (Global Positioning System). However, some recent digital cameras and mobile phones have built-in GPS, so that the position information can be embedded as metadata in captured images. Many devices have appeared.

また、独立したＧＰＳデバイスをデジタルカメラと一緒に持ち歩くことによって、撮影時の位置情報を記録し、ＰＣ（パーソナル・コンピュータ）により、デジタル画像のファイルに、付加情報として位置情報を埋め込むことも可能である。
また、画像ファイル中の位置情報を、画像認識に利用することも一部で試みられている（例えば、特許文献１参照）。 It is also possible to carry position information at the time of shooting by carrying an independent GPS device with a digital camera and embed the position information as additional information in a digital image file by a PC (personal computer). is there.
Some attempts have also been made to use position information in image files for image recognition (see, for example, Patent Document 1).

しかし、位置情報は緯度経度の２つの数値のみからなる情報であり、それ自体では一般物体認識の手がかりに利用することは困難であり、はかばかしい成果を挙げていない。この困難さは、位置情報を、どのように認識の手がかりとして利用するかが簡単でないとの問題に起因するものである。 However, the position information is information consisting only of two numerical values of latitude and longitude, and as such, it is difficult to use it as a clue for general object recognition, and does not produce a ridiculous result. This difficulty is caused by the problem that it is not easy to use position information as a clue to recognition.

柳井啓司：「一般物体認識の現状と今後」、情報処理学会コンピュータビジョン・イメージメディア研究会招待講演予稿、CVM2006、CVM155-17(2006年)Keiji Yanai: "Present state and future of general object recognition", Invited Proposal for Computer Vision and Image Media Study Group, IPSJ, CVM2006, CVM155-17 (2006) 柳井啓司：「一般物体認識の現状と今後」、情報処理学会論文誌：コンピュータビジョン・イメージメディア、Vol.48, No.SIG16(CVIM19), pp.1-24, 2007.Keiji Yanai: “Current Status and Future of General Object Recognition”, Transactions of Information Processing Society of Japan: Computer Vision and Image Media, Vol.48, No.SIG16 (CVIM19), pp.1-24, 2007. M. Boutell and J. Luo：Bayesian Fusion of Camera Metadata Cues in Semantic Scene Classification, Proceeding of Computer Vision and Pattern Recognition, pp. 623-630, 2004.M. Boutell and J. Luo: Bayesian Fusion of Camera Metadata Cues in Semantic Scene Classification, Proceeding of Computer Vision and Pattern Recognition, pp. 623-630, 2004. 特開２００７−４１７６２号（第１８頁、第４図）JP2007-41762 (page 18, FIG. 4)

このような事情に鑑み、本発明は、認識対象画像の分類に際し、認識対象画像と併せて、撮影位置付近の航空写真画像および地図画像を画像認識の手がかりの一部として利用することで、画像の認識精度を向上させることを目的とする。 In view of such circumstances, the present invention uses an aerial photograph image and a map image in the vicinity of the shooting position together with the recognition target image as a part of a clue for image recognition when classifying the recognition target image. The purpose is to improve the recognition accuracy.

また、本発明は、デジタルカメラやカメラ付き携帯電話で撮影した位置情報付きのデジタル写真の自動分類、検索を可能にすることを目的とする。 Another object of the present invention is to enable automatic classification and retrieval of digital photographs with position information taken with a digital camera or a camera-equipped mobile phone.

さらに、本発明は、デジタルカメラやカメラ付携帯電話で撮影した画像の自動タグ付け、自動説明文生成、自動アルバム作成といった、様々な応用を可能にすることを目的とする。 Furthermore, an object of the present invention is to enable various applications such as automatic tagging of images taken by a digital camera or a camera-equipped mobile phone, automatic description generation, and automatic album creation.

本発明者等は、画像が撮影された場所の位置情報が、時間情報などと比較して、一般物体認識のための大きな手がかりとなることに着目した。例えば、「海」の画像は海の近くでなければ撮影できないし、「ライオン」の画像は、アフリカでの撮影である等の特別な状況を除いて、動物園以外では撮影することがほとんどできない。 The inventors of the present invention paid attention to the fact that the position information of the place where the image was taken becomes a large clue for general object recognition as compared with time information and the like. For example, an image of “the sea” can only be taken near the sea, and an image of “the lion” can hardly be taken outside the zoo except for a special situation such as shooting in Africa.

このような着目に基づき本発明者等は鋭意検討を続け、認識対象の画像特徴量と併せて、撮影場所の位置情報を示す航空写真または地図の画像情報の画像特徴量を用いることで、一般画像認識の精度を向上させることができるとの知見を得て、本発明を完成するに至った。 Based on such attention, the present inventors have continued diligent examination, and in addition to the image feature amount of the recognition target, using the image feature amount of the aerial photograph or the map image information indicating the location information of the shooting location, Obtaining knowledge that the accuracy of image recognition can be improved, the present invention has been completed.

すなわち、本発明は、画像内の認識対象を分類するための分類器を用いて、認識対象画像における前記認識対象を分類する認識処理を含む画像処理方法であって、基本的には、
認識対象画像を入力するステップ（Ｓ１０５）と、
前記認識対象画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成するステップ（Ｓ１１０）と、
前記分類器を用いて認識結果を得るステップ（Ｓ１３５）と、
前記認識対象画像における認識対象の有無を判断するステップ（Ｓ１４０、Ｓ１４５、Ｓ１５０）と、
を含むことを特徴とする。 That is, the present invention is an image processing method including a recognition process for classifying a recognition target in a recognition target image by using a classifier for classifying the recognition target in the image.
Inputting a recognition target image (S105);
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the photographing position of the recognition target image (S110);
Obtaining a recognition result using the classifier (S135);
Determining whether there is a recognition target in the recognition target image (S140, S145, S150);
It is characterized by including.

また、さらに、
前記小領域パッチ画像に基づき画像特徴量を抽出するステップ（Ｓ１１５）と、
前記抽出された画像特徴量からヒストグラムを作成するステップ（Ｓ１２０）と、
前記作成されたヒストグラムに最も近い特徴ベクトルをコードブックより選択するステップ（Ｓ１２５）と、
前記選択された特徴ベクトルを正規化するステップ（Ｓ１３０）と、
を備えたことを特徴としてもよい。 In addition,
Extracting an image feature amount based on the small area patch image (S115);
Creating a histogram from the extracted image feature quantity (S120);
Selecting a feature vector closest to the created histogram from a code book (S125);
Normalizing the selected feature vector (S130);
It is good also as having characterized.

前記小領域パッチ画像を生成するステップ（Ｓ１１０）が、前記航空写真画像および／または地図画像から生成された、一つの画像から小領域パッチ画像を生成するステップであるとしてもよい。 The step (S110) of generating the small region patch image may be a step of generating a small region patch image from one image generated from the aerial photograph image and / or map image.

前記小領域パッチ画像を生成するステップ（Ｓ１１０）が、前記航空写真画像および／または地図画像から生成された、縮尺の異なる複数の画像から小領域パッチ画像を生成するステップであるとしてもよい。 The step (S110) of generating the small region patch image may be a step of generating a small region patch image from a plurality of images having different scales generated from the aerial photograph image and / or map image.

前記小領域パッチ画像を生成するステップ（Ｓ１１０）が、前記認識対象画像、および、前記航空写真画像および／または地図画像から生成された、縮尺の異なる複数の画像から前記小領域パッチ画像を生成するステップであるとしてもよい。 The step (S110) of generating the small area patch image generates the small area patch image from the recognition target image and a plurality of images having different scales generated from the aerial photograph image and / or the map image. It may be a step.

前記ヒストグラムを作成するステップ（Ｓ１２０）が、前記抽出された複数の画像特徴量から生成されたそれぞれのヒストグラムを連接して、一つのヒストグラムを生成するステップであるとしてもよい。 The step of creating the histogram (S120) may be a step of concatenating the respective histograms generated from the plurality of extracted image feature amounts to generate one histogram.

前記ヒストグラムを作成するステップ（Ｓ１２０）が、前記認識対象画像の画像特徴量から生成されたヒストグラムと、前記小領域パッチ画像の画像特徴量から生成されたヒストグラムと、を連接して、一つのヒストグラムを生成するステップであるとしてもよい。 The step of creating the histogram (S120) concatenates the histogram generated from the image feature amount of the recognition target image and the histogram generated from the image feature amount of the small area patch image, to form one histogram. It may be a step of generating.

また、前記分類器が、学習画像および該学習画像の分類を入力するステップ（Ｓ９１）と、
前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成するステップ（Ｓ９２）と、
前記分類器を作成するステップ（Ｓ９６）と、
によって生成されてもよい。 The classifier inputs a learning image and a classification of the learning image (S91);
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image (S92);
Creating the classifier (S96);
May be generated.

また、さらに、
前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成するステップ（Ｓ９２）と、
前記小領域パッチ画像の画像特徴量を抽出するステップ(Ｓ９３)と、
前記抽出された画像特徴量からコードブックを作成するステップ（Ｓ９４）と、
前記コードブックを用いて、前記抽出された画像特徴量からヒストグラムを作成するステップ（Ｓ９５）と、を備えてもよい。 In addition,
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image (S92);
Extracting an image feature amount of the small area patch image (S93);
Creating a code book from the extracted image feature values (S94);
Creating a histogram from the extracted image feature amount using the code book (S95).

前記小領域パッチ画像を生成するステップ（Ｓ９２）が、前記学習画像の撮影位置に対応する、縮尺の異なる複数の航空写真画像および／または地図画像から、縮尺の異なる複数の小領域パッチ画像を生成するステップ（Ｓ９２）であるとしてもよい。 The step of generating the small area patch image (S92) generates a plurality of small area patch images having different scales from a plurality of aerial photograph images and / or map images having different scales corresponding to the shooting positions of the learning images. Step (S92) may be performed.

前記小領域パッチ画像を生成するステップ（Ｓ９２）が、前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から一つの小領域パッチ画像を生成するステップ（Ｓ９２）であるとしてもよい。 The step of generating the small region patch image (S92) may be a step of generating one small region patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image (S92). .

前記小領域パッチ画像を生成するステップ（Ｓ９２）が、前記認識対象画像、および、前記航空写真画像および／または地図画像から生成された、縮尺の異なる複数の画像から前記小領域パッチ画像を生成するステップであるとしてもよい。 The step of generating the small area patch image (S92) generates the small area patch image from the recognition target image and a plurality of images having different scales generated from the aerial photograph image and / or the map image. It may be a step.

前記ヒストグラムを作成するステップ（Ｓ９５）が、前記抽出された複数の画像特徴量から生成されたそれぞれのヒストグラムを連接して、一つのヒストグラムを生成するステップであるとしてもよい。 The step of creating the histogram (S95) may be a step of concatenating the respective histograms generated from the plurality of extracted image feature amounts to generate one histogram.

前記ヒストグラムを作成するステップ（Ｓ９５）が、前記認識対象画像の画像特徴量から生成されたヒストグラムと、前記小領域パッチ画像の画像特徴量から生成されたヒストグラムと、を連接して、一つのヒストグラムを生成するステップであるとしてもよい。 The step of creating the histogram (S95) concatenates the histogram generated from the image feature quantity of the recognition target image and the histogram generated from the image feature quantity of the small region patch image, to form one histogram. It may be a step of generating.

前記航空写真画像および／または地図画像が、ネットワークを介してアクセス可能なデータベースに格納されていることとしてもよい。 The aerial photograph image and / or map image may be stored in a database accessible via a network.

前記認識対象画像が位置情報を保有しており、該位置情報に基づき、前記航空写真画像および／または地図画像を、前記認識対象画像と対応させることとしてもよい。 The recognition target image may have position information, and based on the position information, the aerial photograph image and / or map image may be associated with the recognition target image.

前記した画像処理方法を、コンピュータ、または画像分類機能付き撮像機器、または画像処理システム、に実行させるための画像処理用プログラムとして実現することもできる。 The image processing method described above can also be realized as an image processing program for causing a computer, an imaging device with an image classification function, or an image processing system to execute the image processing method.

前記した画像処理用プログラムを、コンピュータにより読み出され実行可能なプログラムとして記憶した記憶媒体として提供することもできる。 The above-described image processing program can also be provided as a storage medium stored as a program that can be read and executed by a computer.

前記の画像処理方法を実行可能に構成された、画像分類機能付き撮像機器として提供することもできる。 It can also be provided as an imaging device with an image classification function configured to execute the image processing method.

また、本発明は、画像内の認識対象を分類するための分類器と、認識対象画像における前記認識対象を分類する画像認識手段と、を備えた画像処理システムであって、基本的には、
前記画像認識手段は、
認識対象画像を入力する入力部と、
前記認識対象画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成する小領域パッチ画像生成部と、
前記分類器を用いて認識結果を得る認識結果取得部と、
前記認識対象画像における認識対象の有無を判断する判断部と、
を有することを特徴とする。 Further, the present invention is an image processing system comprising a classifier for classifying a recognition target in an image, and an image recognition unit for classifying the recognition target in the recognition target image.
The image recognition means includes
An input unit for inputting a recognition target image;
A small area patch image generation unit that generates a small area patch image from an aerial photograph image and / or map image corresponding to the imaging position of the recognition target image;
A recognition result acquisition unit that obtains a recognition result using the classifier;
A determination unit for determining presence or absence of a recognition target in the recognition target image;
It is characterized by having.

また、さらに、
画像特徴量を抽出する画像特徴量抽出部と、
前記画像特徴量抽出部から抽出された画像特徴量からヒストグラムを作成するヒストグラム作成部と、
前記作成されたヒストグラムに最も近い特徴ベクトルをコードブックより選択する特徴ベクトル選択部と、
前記特徴ベクトル選択部から選択された特徴ベクトルを正規化する正規化部と、
前記正規化部により正規化された特徴ベクトルに基づき、前記分類器を用いて認識結果を得る認識結果取得部と、を有することとしてもよい。 In addition,
An image feature amount extraction unit for extracting an image feature amount;
A histogram creation unit that creates a histogram from the image feature amount extracted from the image feature amount extraction unit;
A feature vector selection unit that selects a feature vector closest to the created histogram from a code book;
A normalization unit for normalizing the feature vector selected from the feature vector selection unit;
A recognition result acquisition unit that obtains a recognition result using the classifier based on the feature vector normalized by the normalization unit.

前記画像特徴量抽出部は、前記小領域パッチ画像生成部により生成された一つの小領域パッチ画像の画像特徴量を抽出することとしてもよい。 The image feature amount extraction unit may extract an image feature amount of one small region patch image generated by the small region patch image generation unit.

前記画像特徴量抽出部は、前記小領域パッチ画像生成部により生成された縮尺の異なる複数の小領域パッチ画像の画像特徴量をそれぞれ抽出し、
前記ヒストグラム作成部は、前記画像特徴量抽出部により抽出された複数の画像特徴量から生成されたそれぞれのヒストグラムを連接して、一つのヒストグラムを生成することとしてもよい。 The image feature amount extraction unit extracts image feature amounts of a plurality of small region patch images having different scales generated by the small region patch image generation unit;
The histogram creation unit may generate one histogram by concatenating respective histograms generated from a plurality of image feature amounts extracted by the image feature amount extraction unit.

前記画像特徴量抽出部は、前記認識対象画像の画像特徴量を抽出すると共に、前記小領域パッチ画像の画像特徴量を抽出し、
前記ヒストグラム作成部は、前記認識対象画像の画像特徴量から生成されたヒストグラムと、前記小領域パッチ画像の画像特徴量から生成されたヒストグラムと、を連接して一つのヒストグラムを生成することとしてもよい。 The image feature amount extraction unit extracts an image feature amount of the recognition target image and extracts an image feature amount of the small area patch image,
The histogram creation unit may generate one histogram by concatenating the histogram generated from the image feature amount of the recognition target image and the histogram generated from the image feature amount of the small region patch image. Good.

前記分類器は、
学習画像および該学習画像の分類を入力する手段と、
前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成する手段と、
前記小領域パッチ画像を用いて分類器を作成する手段と、
によって生成されることとしてもよい。 The classifier is:
Means for inputting a learning image and a classification of the learning image;
Means for generating a small area patch image from an aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for creating a classifier using the small area patch image;
It is good also as produced by.

前記分類器は、
学習画像および該学習画像の分類を入力する手段と、
前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から小領域パッチ画像を生成する手段と、
前記小領域パッチ画像の画像特徴量を抽出する手段と、
前記抽出された画像特徴量からコードブックを作成する手段と、
前記コードブックを用いて、前記抽出された画像特徴量からヒストグラムを作成する手段と、
前記ヒストグラムを用いて分類器を作成する手段と、
によって生成されることとしてもよい。 The classifier is:
Means for inputting a learning image and a classification of the learning image;
Means for generating a small area patch image from an aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for extracting an image feature amount of the small area patch image;
Means for creating a codebook from the extracted image feature values;
Means for creating a histogram from the extracted image features using the codebook;
Means for creating a classifier using the histogram;
It is good also as produced by.

前記分類器は、
学習画像および該学習画像の分類を入力する手段と、
前記学習画像の撮影位置に対応する、縮尺の異なる複数の航空写真画像および／または地図画像から、縮尺の異なる複数の小領域パッチ画像を生成する手段と、
前記複数の小領域パッチ画像の画像特徴量をそれぞれ抽出する手段と、
前記抽出された複数の画像特徴量からコードブックを作成する手段と、
前記コードブックを用いて、前記抽出された複数の画像特徴量からそれぞれのヒストグラムを作成し、これらヒストグラムを連接して一つのヒストグラムを生成する手段と、
前記一つのヒストグラムを用いて分類器を作成する手段と、
によって生成されることとしてもよい。 The classifier is
Means for inputting a learning image and a classification of the learning image;
Means for generating a plurality of small area patch images having different scales from a plurality of aerial photograph images and / or map images having different scales corresponding to the shooting positions of the learning images;
Means for respectively extracting image feature amounts of the plurality of small region patch images;
Means for creating a code book from the extracted plurality of image feature quantities;
Using the codebook, creating respective histograms from the extracted plurality of image feature quantities, and concatenating these histograms to generate one histogram;
Means for creating a classifier using the one histogram;
It is good also as produced by.

前記分類器は、
学習画像および該学習画像の分類を入力する手段と、
前記学習画像の撮影位置に対応する航空写真画像および／または地図画像から一つの小領域パッチ画像を生成する手段と、
前記学習画像および前記小領域パッチ画像の画像特徴量をそれぞれ抽出する手段と、
前記抽出された複数の画像特徴量からコードブックを作成する手段と、
前記コードブックを用いて、前記抽出された複数の画像特徴量からそれぞれのヒストグラムを作成し、これらヒストグラムを連接して一つのヒストグラムを生成する手段と、
前記一つのヒストグラムを用いて分類器を作成する手段と、
によって生成されることとしてもよい。 The classifier is
Means for inputting a learning image and a classification of the learning image;
Means for generating one small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for extracting image feature amounts of the learning image and the small area patch image,
Means for creating a code book from the extracted plurality of image feature quantities;
Using the codebook, creating respective histograms from the extracted plurality of image feature quantities, and concatenating these histograms to generate one histogram;
Means for creating a classifier using the one histogram;
It is good also as produced by.

前記航空写真画像および／または地図画像が、ネットワークを介してアクセス可能なデータベースに格納されていることが望ましい。また、前記認識対象画像が位置情報を保持しているように構成することもできる。 The aerial photograph image and / or the map image are preferably stored in a database accessible via a network. In addition, the recognition target image may be configured to hold position information.

本発明によれば、認識対象画像の分類に際して、認識対象画像と併せて、撮影位置付近の航空写真画像および地図画像を画像認識の手がかりの一部として利用することにより、一般画像認識の精度を向上させることが可能となる。 According to the present invention, when classifying recognition target images, the aerial photograph image and map image in the vicinity of the shooting position are used as part of the clues of image recognition together with the recognition target images, thereby improving the accuracy of general image recognition. It becomes possible to improve.

また、本発明によれば、前記利点の応用により、デジタルカメラやカメラ付き携帯電話等の撮像機器で撮影した位置情報付きのデジタル写真の自動分類、検索することが可能となる。 In addition, according to the present invention, it is possible to automatically classify and search digital photographs with position information taken by an imaging device such as a digital camera or a camera-equipped mobile phone by applying the above advantages.

また、本発明によれば、デジタルカメラやカメラ付き携帯電話の撮像機器で撮影した画像の自動タグ付け、自動説明文生成、自動アルバム作成といった、様々な応用が可能になるなど、多くの効果を有する。 In addition, according to the present invention, various effects such as automatic tagging of images taken with an imaging device of a digital camera or a camera-equipped mobile phone, automatic explanation generation, and automatic album creation are possible. Have.

以下、実施形態例について説明する。 Hereinafter, exemplary embodiments will be described.

図１に、本例の画像処理システムの概要を示す。この画像処理システムは、システム外部の位置情報付き画像記憶部１１より画像を収集して画像本体記憶部１５に蓄積すると共に、画像に付加されている位置情報をメタデータ記憶部１６へ格納する。また、画像の撮影位置に対応してマッピングサービス記憶部１２より、位置情報に対応する位置の航空写真画像または地図画像を取り出し、異なる縮尺の小領域パッチ画像を生成し、各縮尺の記憶部１３ａ、１３ｂ、１３ｃへ格納するようになっている。 FIG. 1 shows an outline of the image processing system of this example. This image processing system collects images from the image information storage unit 11 with position information outside the system and stores them in the image main body storage unit 15, and stores the position information added to the image in the metadata storage unit 16. Also, an aerial photograph image or map image at a position corresponding to the position information is extracted from the mapping service storage unit 12 corresponding to the image shooting position, a small area patch image of a different scale is generated, and a storage unit 13a of each scale. , 13b, and 13c.

システム外部の位置情報付き画像記憶部１１として、ネットワークを介してアクセス可能なデータベースを用いてもよい。例えば、インターネット上に公開されたソーシャルサイトである「Ｆｌｉｃｋｒ（登録商標）」などをあげることができる。
「Ｆｌｉｃｋｒ」は撮影画像を投稿（アップロード）、共有（ダウンロード）することが出来るソーシャルサイトであり、毎日１００万以上の画像が投稿されるといわれている。本明細書において、以下、Ｆｌｉｃｋｒへ投稿（アップロード）された画像を「Ｆｌｉｃｋｒ画像」と言う。Ｆｌｉｃｋｒでは、撮影画像を投稿する場合、その位置情報を付加させることを強く推奨している。従って、今後「Ｆｌｉｃｋｒ」に投稿される位置情報付きの画像データ、すなわち、「Ｆｌｉｃｋｒ画像」は増え続けると予想される。 As the image storage unit 11 with position information outside the system, a database accessible via a network may be used. For example, “Flickr (registered trademark)” which is a social site published on the Internet can be cited.
“Flickr” is a social site that allows posting (uploading) and sharing (downloading) of captured images, and it is said that over 1 million images are posted every day. In the present specification, an image posted (uploaded) to Flickr is hereinafter referred to as a “Flickr image”. Flickr strongly recommends adding position information when posting a captured image. Therefore, it is expected that image data with position information to be posted on “Flickr”, that is, “Flickr images” will continue to increase in the future.

マッピングサービス記憶部１２としても、ネットワークを介してアクセス可能なデータベースを用いてもよい。例えば、インターネット上に公開された国土交通省などの公的な検索サービス、あるいは、民間の検索サービス（例えば、「Ｇｏｏｇｌｅ（登録商標）」、「Ｙａｈｏｏ（登録商標）」等）が提供する航空写真画像や地図画像を用いるとよい。
航空写真画像や地図画像は、位置情報(緯度経度)と対応しているので、画像の特徴を持った位置情報とみなせる。したがって、位置情報を記述、識別する客観的な手段に成りうる。 As the mapping service storage unit 12, a database accessible via a network may be used. For example, an aerial photograph provided by a public search service such as the Ministry of Land, Infrastructure, Transport and Tourism published on the Internet or a private search service (for example, “Google (registered trademark)”, “Yahoo (registered trademark)”, etc.) An image or a map image may be used.
Since the aerial photograph image and the map image correspond to the position information (latitude and longitude), they can be regarded as position information having image characteristics. Therefore, it can be an objective means for describing and identifying position information.

以下、画像、航空写真画像、地図画像を、「Ｆｌｉｃｋｒ」と「Ｇｏｏｇｌｅ」から収集した場合について説明する。 Hereinafter, a case where images, aerial photograph images, and map images are collected from “Flickr” and “Google” will be described.

なお、上記データベースでは現在、航空写真画像と地図画像の独立した画像を提供している。これらの航空写真画像は、256×256ピクセルの「タイル」と呼ばれる地図情報の断片から形成されているものも存在する。
これらの航空写真画像と地図画像についてはその画像を拡大、縮小して閲覧することが出来る。現在公開されている範囲では、０〜１９の、２０通りのズームレベルをもつものも存在する。 Note that the database currently provides an independent image of an aerial photograph image and a map image. Some of these aerial photo images are formed from pieces of map information called “tiles” of 256 × 256 pixels.
These aerial photograph images and map images can be viewed by enlarging or reducing the images. There are 20 zoom levels from 0 to 19 in the currently released range.

対象となる画像を認識するにあたり、画像から特徴を抽出する必要がある。以下、本例における画像の特徴を抽出する方法について説明する。
画像の特徴を記述する手法としては、画素値の統計や固有値を記述するものから、局所的な特徴を記述するものまで多種にわたる。
本例では、特徴抽出のために局所特徴の一種であるＳＩＦＴ特徴を用いる。また、この局所特徴を簡潔に記述するために後述するBag of Keypoints手法を用いてデータをベクトル量子化する（図３参照）。
なお、別な手法として、似ている画像同士では画像を構成する色が似ているという仮定に基づき、画像中に色がどのような割合で含まれているのかを比較する色ヒストグラム法（color histogram method）を用いて特徴抽出を行うことも可能である。量子化された色空間に形成された色ヒストグラムを使用すると、色情報が各画素に割り当てられた元の画像よりもマッチングに使用される情報量が減少して、演算量も減ることが期待できる。 In recognizing a target image, it is necessary to extract features from the image. Hereinafter, a method for extracting image features in this example will be described.
There are a variety of methods for describing image features, from describing pixel value statistics and eigenvalues to describing local features.
In this example, a SIFT feature, which is a kind of local feature, is used for feature extraction. In addition, in order to briefly describe the local feature, data is vector quantized using a Bag of Keypoints method described later (see FIG. 3).
As another method, based on the assumption that similar images are similar in color to each other, the color histogram method (color It is also possible to perform feature extraction using a histogram method. When a color histogram formed in a quantized color space is used, the amount of information used for matching is reduced compared to the original image in which color information is assigned to each pixel, and the amount of computation can be expected to decrease. .

ＳＩＦＴ（Scale Invariant Feature Transform）とは、1999年にDavid Lowe によって提案された特徴点とそれに付随する特徴ベクトルの抽出法であり、特徴点周りの局所画像パターンを128 次元特徴ベクトルで表現する。
ＳＩＦＴ特徴は、画像の拡大縮小、回転や視点の変化のいずれに対してもロバスト（強靭）であるとの性質がある。ＳＩＦＴ特徴の抽出は、特徴点の抽出とその特徴点における特徴ベクトルの抽出の２つのステップに分けることができる。 SIFT (Scale Invariant Feature Transform) is a method of extracting feature points and accompanying feature vectors proposed by David Lowe in 1999, and represents local image patterns around feature points with 128-dimensional feature vectors.
The SIFT feature has a property that it is robust to any of enlargement / reduction, rotation, and change of viewpoint of an image. SIFT feature extraction can be divided into two steps: feature point extraction and feature vector extraction at the feature point.

具体的には、図４、図５に示すように、キーワードにて分類したＦｌｉｃｋｒ画像に対応する位置情報を含んでいる航空写真画像１枚と、その周囲の航空写真画像８枚（合計９枚）を、ズームレベル10、12、14のそれぞれについてマッピングサービス記憶部１２より取り出す。
それぞれのズームレベルについて、まず９枚の航空写真画像のタイル（256×256 ピクセル）を３×３で結合する。このとき、位置情報を含む航空写真画像を中心に配置する。この位置情報が正方形の中心になるように、この結合された航空写真画像から512×512ピクセルの正方形部分を切り抜き、これをＦｌｉｃｋｒ画像に対応する航空写真画像（小領域パッチ画像）とする。 Specifically, as shown in FIGS. 4 and 5, one aerial photograph image including position information corresponding to the Flickr image classified by the keyword and eight surrounding aerial photograph images (total nine). ) Is extracted from the mapping service storage unit 12 for each of the zoom levels 10, 12, and 14.
For each zoom level, 9 aerial image tiles (256 × 256 pixels) are first combined in 3 × 3. At this time, the aerial photograph image including the position information is arranged at the center. A square portion of 512 × 512 pixels is cut out from the combined aerial image so that the position information becomes the center of the square, and this is used as an aerial image (small region patch image) corresponding to the Flickr image.

なお航空写真画像についても、Ｆｌｉｃｋｒ画像と同様に、ＳＩＦＴ特徴を抽出する。
本例では、ＳＩＦＴ特徴を抽出するために、SIFT++というツールを用いた。このツールにおけるアルゴリズムは、ＳＩＦＴを提唱したLoweアルゴリズムとほぼ同一である。 As for the aerial photograph image, the SIFT feature is extracted as in the Flickr image.
In this example, a tool called SIFT ++ is used to extract SIFT features. The algorithm in this tool is almost the same as the Lowe algorithm that proposed SIFT.

ＳＩＦＴ特徴における特徴点の抽出について、次に述べるＧＲＩＤ点抽出で行う。ＧＲＩＤ点抽出では、格子状に点を配置し、ＳＩＦＴ特徴ベクトル計算のための特徴点として利用する。 The feature point extraction in the SIFT feature is performed by GRID point extraction described below. In GRID point extraction, points are arranged in a grid and used as feature points for SIFT feature vector calculation.

ＧＲＩＤ点抽出によるＳＩＦＴ特徴の抽出手順は以下のようになる。
１．格子点の間隔を決定する。本例においては、画像に対して、１０画素ごとにＧＲＩＤ点抽出を行い、それらの点に基づきＳＩＦＴ特徴量を計算することとした。
２．画像から格子点を抽出し、予め決められた複数のスケールで、それぞれの点について勾配方向を計算する。格子点の総数は画像の画素数と格子点の間隔に依存する。
３．抽出した特徴点に対して、ＳＩＦＴ特徴量を計算する。 The procedure for extracting SIFT features by GRID point extraction is as follows.
1. Determine the spacing between grid points. In this example, GRID point extraction is performed for every 10 pixels in the image, and SIFT feature values are calculated based on those points.
2. Lattice points are extracted from the image, and the gradient direction is calculated for each point at a plurality of predetermined scales. The total number of grid points depends on the number of pixels in the image and the interval between the grid points.
3. A SIFT feature value is calculated for the extracted feature points.

なお、本例では、SIFT++を用いてＧＲＩＤ点抽出を行うために、ＧＲＩＤ点を抽出する処理を予め実装しておき、これらの点を明示的に指定するオプションを用いることによって実装することができる。 In this example, in order to perform GRID point extraction using SIFT ++, a process for extracting GRID points can be implemented in advance, and an option for explicitly specifying these points can be used. .

次に、抽出された特徴点における特徴ベクトルの抽出について、Bag of Keypointsの手法で行う。
Bag of Keypointsモデルとは、画像を局所特徴の集合と捉えた手法である。局所特徴をベクトル量子化し、Visual Wordsと呼ばれる特徴ベクトルを生成する。それらをまとめたものをコードブックと呼び、それを記述子として画像全体の特徴ベクトルを生成する。これにより、画像をVisual Wordsの集合（bag）として表現することができる。 Next, extraction of feature vectors at the extracted feature points is performed by the Bag of Keypoints method.
The Bag of Keypoints model is a method that considers an image as a set of local features. Vector quantize local features to generate feature vectors called Visual Words. A collection of them is called a codebook, and a feature vector of the entire image is generated using this as a descriptor. Thereby, an image can be expressed as a set (bag) of Visual Words.

Bag of keypointsの画像認識の流れは以下の通りである（図３参照）。
１．全画像データから特徴点を抽出する。
２．それをベクトル量子化し、コードブックを作成する。
３．コードブックをもとに、学習画像の特徴ベクトルを生成する。
４．同様にテスト画像の特徴ベクトルも生成し、分類器により画像がどのカテゴリに属するか決定する。 The image recognition process for Bag of keypoints is as follows (see FIG. 3).
1. Extract feature points from all image data.
2. Vectorize it and create a codebook.
3. A feature vector of the learning image is generated based on the code book.
4). Similarly, a feature vector of the test image is also generated, and a classifier determines which category the image belongs to.

コードブックの生成手順について、図３、図６を参照しながら説明する。
まず、Visual Wordsを生成するために、ＧＲＩＤ点におけるＳＩＦＴ特徴を用いて全ての画像から局所特徴を抽出する。次に、抽出したもののうち、学習画像についての局所特徴をベクトル量子化し、各々のクラスタの中心を求めることによりVisual Wordsを求め、コードブックとする。 The codebook generation procedure will be described with reference to FIGS.
First, in order to generate Visual Words, local features are extracted from all images using SIFT features at GRID points. Next, among the extracted items, the local features of the learning image are vector-quantized, and the Visual Words are obtained by obtaining the center of each cluster to obtain a code book.

ベクトル量子化は、最も単純なクラスタリング手法であるk-Means法を用いる。
これは、クラスタ数kと、各クラスタの初期の重心（これはランダムでもよい）を予め定めておき、重心と各ベクトルとの距離の平均が最小になるように反復して重心を更新していく手法である。コードブックの大きさは、クラスタ数kに依存する。
本例では、k＝300に固定してベクトル量子化を行なった。k-Means法におけるクラスタリング処理では、ベクトル間の距離を計測する必要があるが、その距離尺度として、本例ではユークリッド距離を用いた。 Vector quantization uses the k-Means method, which is the simplest clustering method.
This is because the number of clusters k and the initial centroid of each cluster (this may be random) are determined in advance, and the centroid is updated repeatedly so that the average of the distance between the centroid and each vector is minimized. It is a technique to go. The size of the codebook depends on the number of clusters k.
In this example, vector quantization is performed with k = 300 fixed. In the clustering process in the k-Means method, it is necessary to measure the distance between vectors. In this example, the Euclidean distance is used as the distance scale.

学習データ作成のためのコードブックの作成手順は以下の通りである。一つの画像について一つのコードブックが作成される。
１．各キーワードのグループについて、正例画像（ＯＫ画像）と負例画像（ＮＧ画像）を明確にする。
２．各キーワードのグループの全ての画像から抽出された特徴量を用いて、コードブックを作成する。
３．各キーワードから抽出したコードブックを用いて、そのキーワードの画像について、コードブックについてのヒストグラムを作成する。 The procedure for creating a code book for creating learning data is as follows. One codebook is created for one image.
1. For each group of keywords, a positive example image (OK image) and a negative example image (NG image) are clarified.
2. A code book is created using feature quantities extracted from all images of each keyword group.
3. Using the codebook extracted from each keyword, a histogram for the codebook is created for the keyword image.

各画像に対応する航空写真画像のコードブックの作成手順は以下の通りである。
１．各画像が含んでいる位置情報から、対応する航空写真画像（本例では三種類のズームレベル）を探し、それぞれ３×３＝９枚ずつ用意する。
２．各ズームレベル航空写真画像について、位置（緯度経度）が中央になるように256×256ピクセルの正方形に切り抜く。
３．航空写真画像について、対応する画像のキーワードに対してズームごとに独立でグループ化し、画像の場合と同様に特徴量、コードブックの順に求める。 The procedure for creating an aerial photograph image codebook corresponding to each image is as follows.
1. A corresponding aerial photograph image (three kinds of zoom levels in this example) is searched from the position information included in each image, and 3 × 3 = 9 are prepared for each.
2. Each zoom level aerial image is clipped into a 256 × 256 pixel square so that the position (latitude and longitude) is centered.
3. The aerial photograph image is grouped independently for each zoom with respect to the keyword of the corresponding image, and the feature amount and the code book are obtained in the same order as in the image.

この時点で、一つのキーワードに対して、キーワードの画像、レベル10、12、14の各航空写真画像のグループに関する、４種類のコードブックを求めることになる。
コードブックの作成は、クラスタリング処理を伴う。一つのキーワードについて、精度の高いコードブックを作成するには、著しく膨大な量のデータをクラスタリングする必要がある。これは処理時間に対してトレードオフの問題となる。
本例においては、クラスタリングに用いる特徴点を、１０分の１の確率で特徴点を抽出するよう絞込みを行い処理の高速化を図った。 At this point, for one keyword, four types of codebooks relating to the keyword images and the groups of aerial image images at levels 10, 12, and 14 are obtained.
The creation of a code book involves a clustering process. In order to create a codebook with high accuracy for one keyword, it is necessary to cluster an extremely large amount of data. This is a trade-off problem with respect to processing time.
In this example, the feature points used for clustering are narrowed down to extract feature points with a probability of 1/10 to speed up the processing.

本例では、このクラスタリングに関し、学習と分類の手段である分類器として、ＳＶＭ（Support Vector Machine）を用いる。
ＳＶＭは、ニューロンのモデルとして最も単純な線形しきい素子を用いて、２クラスのパターン識別器を構成する手法である。
カーネル学習法と組み合わせると非線形の識別器になる。この拡張はカーネルトリックと呼ばれる手法で、このカーネルトリックにより、現在知られている多くの手法の中でも最も認識性能の優れた学習モデルの一つであると考えられている。
なお、別の手法として、当該技術分野に公知の方法から選択することができ、最近傍法を用いても良い。最近傍法とは、補間処理の手法の一つであり、ある画素の周辺で一番近い画素の値を設定する手法である。より具体的には、例えば、
「http://www.microsoft.com/japan/msdn/academic/Articles/Algorithm/04」にその処理内容が記載されている。最近傍法は、処理速度が高速であるというメリットを有している。 In this example, regarding this clustering, SVM (Support Vector Machine) is used as a classifier that is a means of learning and classification.
SVM is a method of constructing a two-class pattern classifier using the simplest linear threshold element as a neuron model.
When combined with the kernel learning method, it becomes a non-linear classifier. This extension is called a kernel trick, and is considered to be one of the learning models with the best recognition performance among many currently known methods.
As another method, a method known in the technical field can be selected, and the nearest neighbor method may be used. The nearest neighbor method is one of interpolation processing methods, and is a method of setting the value of the nearest pixel around a certain pixel. More specifically, for example,
The processing contents are described in “http://www.microsoft.com/japan/msdn/academic/Articles/Algorithm/04”. The nearest neighbor method has an advantage that the processing speed is high.

本例では、このＳＶＭを実行するツールとして、ＳＶＭｌｉｇｈｔを用いる。
学習と分類のために用いるＳＶＭへの入力ベクトルは、位置情報（緯度経度）のベクトルと、コードブックに関するヒストグラム（bag）によって構成される。 In this example, SVM light is used as a tool for executing this SVM.
An input vector to the SVM used for learning and classification includes a vector of position information (latitude and longitude) and a histogram (bag) regarding the codebook.

まず各画像に対して、それぞれのグループのコードブックに関するヒストグラムを作成する。コードブックはＳＩＦＴ特徴の代表ベクトルを指定されたクラスタ数だけ記述したデータであるから、各画像に対応するＳＩＦＴ特徴のそれぞれについて、コードブックから「距離が最も近い」ベクトルを探し、そのベクトルに対して投票することによってヒストグラムを作成することができる。ベクトルの距離を計測する尺度として、本例ではユークリッド距離を用いた。 First, a histogram for each group of codebooks is created for each image. Since the code book is data in which a representative vector of SIFT features is described for the designated number of clusters, for each SIFT feature corresponding to each image, a vector having the “closest distance” is searched from the code book, and A histogram can be created by voting. In this example, the Euclidean distance is used as a scale for measuring the vector distance.

全ての画像に関してヒストグラムの完成が完了した時点で、それぞれのキーワードにおいて、各ヒストグラム（bag）と位置情報のそれぞれの組合せで、各ベクトルを結合したものを作成する。ただし、航空写真画像については、３つの各レベルにそれぞれ独立して行うので、一つのキーワードに対して９通りのベクトルパターンを作成することになる（図２、図７、図８参照）。 When the completion of histograms for all images is completed, for each keyword, a combination of each vector is created using a combination of each histogram (bag) and position information. However, since the aerial image is performed independently for each of the three levels, nine vector patterns are created for one keyword (see FIGS. 2, 7, and 8).

本例では、一つのキーワードに対する各画像に、９通りのベクトルパターンを作成した。各ベクトルパターンのグループのそれぞれについて、対応する画像の手動分類結果に基づき、正例画像と負例画像の２つのグループに分ける。それぞれのグループから、一定の枚数をランダムに抽出して、Cross Validationの手法によって学習と分類を行う。 In this example, nine vector patterns were created for each image for one keyword. Each vector pattern group is divided into two groups, a positive example image and a negative example image, based on the manual classification result of the corresponding image. A certain number is randomly extracted from each group, and learning and classification are performed by the method of Cross Validation.

次に、本例における学習処理（分類器）の動作について、図９、図１０を参照して説明する。
学習処理を行なうプログラムの動作が開始されると（Ｓ９０）、ある対象（例えば、「山」「海」「ライオン」など。本明細書においてはこれらを総称して「分類」とも言う。）についてその対象が学習画像に含まれていることが予めわかっている正例画像、および、その対象が画像に含まれていないことが予めわかっている負例画像が予め蓄積されている画像を、写真本体記憶部１５より読み出す（Ｓ９１）。
分類器の学習精度を高めるため、正例画像、負例画像の枚数は、それぞれ１００枚以上が好ましい。
その際、読み出した画像の縦横のいずれか一方または縦横の両方が４８０画素以上の場合は、画像の縦横比を維持しながら、縦横のいずれもが４８０画素未満となるように画像を縮小することが好ましい。 Next, the operation of the learning process (classifier) in this example will be described with reference to FIGS.
When the operation of the program for performing the learning process is started (S90), a certain object (for example, “mountain”, “sea”, “lion”, etc., which are collectively referred to as “classification” in this specification). A positive example image in which the target is known to be included in the learning image and an image in which a negative example image in which the target is not included in the image are stored in advance Read from the main body storage unit 15 (S91).
In order to increase the learning accuracy of the classifier, the number of positive example images and negative example images is preferably 100 or more.
At that time, if either the vertical or horizontal of the read image or both the vertical and horizontal are 480 pixels or more, the image is reduced so that both the vertical and horizontal are less than 480 pixels while maintaining the aspect ratio of the image. Is preferred.

次に、Ｓ９１で読み出した画像の撮影位置情報をメタデータ記憶部１６より読み出し、当該位置情報に対応する航空写真画像を用いて、小領域パッチ画像を作成する（Ｓ９２）。
その際、画像の位置座標が航空写真画像の中央となるように対応づけを行なう（図１の１４ａ〜１４ｃ）。
小領域パッチ画像としては、縮尺の異なる航空写真画像または地図画像を用いる。
分類器の精度を向上させるため、３つ以上の異なる縮尺を用いることが好ましい。 Next, the shooting position information of the image read in S91 is read from the metadata storage unit 16, and a small area patch image is created using the aerial photograph image corresponding to the position information (S92).
At that time, the correspondence is performed so that the position coordinates of the image are at the center of the aerial photograph image (14a to 14c in FIG. 1).
As the small area patch image, an aerial photograph image or a map image having a different scale is used.
In order to improve the accuracy of the classifier, it is preferable to use three or more different scales.

次に、画像認識を実行する分類器を生成する。分類器を生成する方法としては、この種の分野で通常用いられる手法を用いることができる。例えば、図９に示すように、前記Ｓ９１で得られた小領域パッチ画像の白黒ビットマップデータ（例えば，２５６ｘ２５６）をそのまま６５５３６（＝２５６ｘ２５６）次元の特徴ベクトルとして、分類器（例えば、ＳＶＭなど）へ入力して分類器を生成する（Ｓ９６）こともできる。 Next, a classifier that performs image recognition is generated. As a method for generating the classifier, a method usually used in this kind of field can be used. For example, as shown in FIG. 9, the monochrome bitmap data (for example, 256 × 256) of the small area patch image obtained in S91 is used as it is as a 65536 (= 256 × 256) dimensional feature vector, and the classifier (for example, SVM). It is also possible to generate a classifier by inputting to (S96).

また、他の例として、図１０に示すように、次に、前記Ｓ９１およびＳ９２で得られた全ての画像について、画像特徴量を抽出する（Ｓ９３）手法とすることも、画像認識精度をさらに向上させる効果を奏するために好適である。
画像特徴量として、本発明の効果を奏するためには特に限定はなく、ＳＩＦＴ特徴量、Ｈａａｒ特徴量のどちらを用いてもよいが、ＳＩＦＴ特徴量を用いる場合について以下に説明する。 As another example, as shown in FIG. 10, a method of extracting image feature amounts for all the images obtained in S91 and S92 (S93) may be used. This is suitable for improving the effect.
There are no particular limitations on the image feature amount in order to achieve the effects of the present invention, and either the SIFT feature amount or the Haar feature amount may be used. The case where the SIFT feature amount is used will be described below.

Ｓ９３において、格子点（ＧＲＩＤ点）を特徴点として設定する。
処理データ量と精度向上のトレードオフの観点より、各画像について縦横１０画素間隔で格子点を設定することが好ましい。さらに、前記特徴点の近傍領域において、輝度勾配の方向ヒストグラム（「ＳＩＦＴ特徴ベクトル」とも言う）を算出する。
その際、近傍領域の範囲を４通り設定し、一つの特徴点から４つのＳＩＦＴ特徴ベクトルを算出することが、精度向上の観点から好ましい。
以上の処理により、１枚の学習画像から約数千個のＳＩＦＴ特徴ベクトルが求められる。 In S93, lattice points (GRID points) are set as feature points.
From the viewpoint of a trade-off between processing data amount and accuracy improvement, it is preferable to set grid points at intervals of 10 pixels vertically and horizontally for each image. Further, a direction histogram (also referred to as “SIFT feature vector”) of the luminance gradient is calculated in the vicinity of the feature point.
At that time, it is preferable from the viewpoint of improving accuracy to set four ranges of neighboring regions and calculate four SIFT feature vectors from one feature point.
Through the above processing, about several thousand SIFT feature vectors are obtained from one learning image.

次に、コードブックを作成する（Ｓ９４）。この処理によって、典型的な例では、数百万個程度のすべてのＳＩＦＴ特徴ベクトルから代表ＳＩＦＴベクトルを３００個程度求めて、コードブック２１ａ〜２１ｄを作成することになる。
より具体的には、数百万個程度のすべてのＳＩＦＴ特徴ベクトルから、ランダムサンプリングにより、１万個程度を選択する。
次に、選択された１万個程度のＳＩＦＴ特徴ベクトルから、３００個の代表ベクトルを、クラスタ分析により求める。クラスタリング法としては特に限定はなく、k-meansクラスタリング法を用いても良い。
k-meansクラスタリング法（ｋ平均法）とは、分散最適手法の一つで、分割の良さの評価関数を求め、その評価関数を最小化するように、k個のクラスタを分割する代表的な手法である。 Next, a code book is created (S94). By this process, in a typical example, about 300 representative SIFT vectors are obtained from all about several million SIFT feature vectors, and codebooks 21a to 21d are created.
More specifically, about 10,000 are selected by random sampling from all the millions of SIFT feature vectors.
Next, 300 representative vectors are obtained by cluster analysis from about 10,000 selected SIFT feature vectors. The clustering method is not particularly limited, and the k-means clustering method may be used.
The k-means clustering method (k-means method) is one of the distributed optimal methods, and it is a typical method of dividing k clusters so as to obtain an evaluation function of good division and minimize the evaluation function. It is a technique.

次に、ヒストグラム２２ａ〜２２ｄを作成する（Ｓ９５）。画像ごとに、抽出された数千個の各ＳＩＦＴ特徴ベクトルに最も近いコードブックのベクトルを求める。典型的な例では、コードブックに関する３００次元ヒストグラムを作成することになる。
さらに、ヒストグラムの要素の合計が１となるように正規化する。この正規化されたものが、画像を表すbag of keypointsベクトルとなる。 Next, histograms 22a to 22d are created (S95). For each image, the codebook vector closest to each of the thousands of extracted SIFT feature vectors is obtained. A typical example would be to create a 300 dimensional histogram for the codebook.
Further, normalization is performed so that the sum of the elements of the histogram becomes 1. This normalized result is a bag of keypoints vector representing the image.

次に、分類器を生成する（Ｓ９６）。上記の処理により得られた正例画像のbag of keypointsベクトル、負例画像のbag of keypointsベクトルを学習データとして分類器へ入力することで分類器を生成する。分類器としては、ＳＶＭを用いても良い。
なお、分類器の生成（Ｓ９６）は、色ヒストグラム法により得られた正例画像のヒストグラム、負例画像のヒストグラムを学習データとして、分類器へ入力することで分類器を生成する。分類器の実現方法は、当該技術分野に公知の方法から選択することができ、前述した最近傍法を用いても良い。
学習処理を実行するプログラム（Ｓ９０）は、ＣＰＵにより、上記した処理を、全画像について実行する（Ｓ９７）。 Next, a classifier is generated (S96). The classifier is generated by inputting the bag of keypoints vector of the positive example image and the bag of keypoints vector of the negative example image obtained by the above processing to the classifier as learning data. SVM may be used as the classifier.
The generation of the classifier (S96) generates the classifier by inputting the histogram of the positive example image and the histogram of the negative example image obtained by the color histogram method as learning data to the classifier. The method for realizing the classifier can be selected from methods known in the art, and the nearest neighbor method described above may be used.
The program for executing the learning process (S90) causes the CPU to execute the above-described process for all images (S97).

次に、本例の認識処理の動作について、図１１、図１２を参照して説明する。
認識処理を行なうプログラムの動作が開始されると（Ｓ１００）、認識対象画像が入力される（Ｓ１０５）。
その際、読み出した画像の縦横のいずれか一方または縦横の両方が４８０画素以上の場合は画像の縦横比を維持しながら、縦横のいずれもが４８０画素未満となるように画像を縮小する。 Next, the operation of the recognition process of this example will be described with reference to FIGS.
When the operation of the program for performing the recognition process is started (S100), a recognition target image is input (S105).
At this time, when either one of the vertical and horizontal directions of the read image or both of the vertical and horizontal directions is 480 pixels or more, the image is reduced so that both the vertical and horizontal directions are less than 480 pixels while maintaining the aspect ratio of the image.

次に、Ｓ１０５で読み出した画像の撮影位置情報をメタデータ記憶部１６より読み出し、当該位置情報に対応する航空写真画像または地図画像を用いて、小領域パッチ画像を作成する（Ｓ１１０）。その際、画像の位置座標が航空写真画像または地図画像の中央となるように対応づけを行なう。小領域パッチ画像としては、縮尺の異なる航空写真画像または地図画像を用いる。分類器の精度を向上させるため、３つ以上の異なる縮尺を用いることが望ましい。 Next, the shooting position information of the image read in S105 is read from the metadata storage unit 16, and a small area patch image is created using the aerial photograph image or map image corresponding to the position information (S110). At that time, the correspondence is performed so that the position coordinates of the image are at the center of the aerial photograph image or the map image. As the small area patch image, an aerial photograph image or a map image having a different scale is used. It is desirable to use three or more different scales to improve the accuracy of the classifier.

次に、画像を分類器により判定する手法については、この分野で通常用いられる手法を用いることができる。例えば、図１２に示すように、前記Ｓ１１０で得られた小領域パッチ画像の白黒ビットマップデータ（例えば，２５６ｘ２５６）をそのまま６５５３６（＝２５６ｘ２５６）次元の特徴ベクトルとして、分類器（例えば、ＳＶＭなど）へ入力して分類器による判定を行なう（Ｓ１３５）手法を用いることもできる。 Next, as a method for determining an image by a classifier, a method usually used in this field can be used. For example, as shown in FIG. 12, the monochrome bitmap data (for example, 256 × 256) of the small area patch image obtained in S110 is used as a 65536 (= 256 × 256) dimensional feature vector as it is, and a classifier (for example, SVM). It is also possible to use the technique of inputting to the input and performing the determination by the classifier (S135).

また、他の例として、図１１に示すように、次に、前記Ｓ１０５およびＳ１１０で得られた全ての画像について、画像特徴量を抽出する（Ｓ１１５）こととすることも、画像認識精度をさらに向上させる効果を奏するために好適である。画像特徴量として、本発明の効果を奏するためには特に限定はなく、ＳＩＦＴ特徴量を用いても良く、また、Ｈａａｒ特徴量を用いても良い。以下、説明を容易にするために、ＳＩＦＴ特徴量を用いる場合を例に説明を行なう。 As another example, as shown in FIG. 11, next, image feature amounts may be extracted for all images obtained in S105 and S110 (S115). This is suitable for improving the effect. The image feature amount is not particularly limited in order to achieve the effects of the present invention, and a SIFT feature amount may be used, or a Haar feature amount may be used. Hereinafter, in order to facilitate the description, the case where SIFT feature values are used will be described as an example.

Ｓ１１５において、格子点（ＧＲＩＤ点）を特徴点として設定する。処理データ量と精度向上のトレードオフの観点より、各画像について縦横１０画素間隔で格子点を設定するのが望ましい。さらに、前記特徴点の近傍領域において輝度勾配の方向ヒストグラム（「ＳＩＦＴ特徴ベクトル」とも言う）を算出する。その際、近傍領域の範囲を４通り設定し、一つの特徴点から４つのＳＩＦＴ特徴ベクトルを算出することが、精度向上の観点より望ましい。
以上の処理により、１枚の学習画像から約数千個のＳＩＦＴ特徴ベクトルが求められることになる。 In S115, lattice points (GRID points) are set as feature points. From the viewpoint of a trade-off between processing data amount and accuracy improvement, it is desirable to set grid points at intervals of 10 pixels vertically and horizontally for each image. Further, a luminance histogram direction histogram (also referred to as a “SIFT feature vector”) is calculated in a region near the feature point. At this time, it is desirable from the viewpoint of improving accuracy to set four ranges of neighboring regions and calculate four SIFT feature vectors from one feature point.
Through the above processing, about several thousand SIFT feature vectors are obtained from one learning image.

次に、コードブックを検索する（Ｓ１２５）。当該処理によって、典型的な例では、前記１枚の学習画像から抽出された約数千個のＳＩＦＴ特徴ベクトルにコードブック中で「距離が最も近い」ベクトルを探し、そのベクトルに対して投票することによってヒストグラムを作成する。ベクトルの距離を計測する尺度として、本例ではユークリッド距離を用いた。このようにして、コードブックに関する３００次元のヒストグラムを得ることになる。 Next, the code book is searched (S125). As a result of this process, in a typical example, a search is made for a vector having the “closest distance” in the codebook to about several thousand SIFT feature vectors extracted from the one learning image, and the vector is voted for. To create a histogram. In this example, the Euclidean distance is used as a scale for measuring the vector distance. In this way, a 300-dimensional histogram relating to the code book is obtained.

次に、前記ヒストグラムを正規化する（Ｓ１３０）。前記典型的な例では、各３００次元のヒストグラムは要素の合計が１となるように正規化されることによって、認識対象画像を表すbag-of-keypointsベクトルが得られる。
次に、当該得られたbag-of-keypointsベクトルを分類器に入力し、認識対象画像に対する認識結果値を得る（Ｓ１３５）。分類器として、前記学習処理によって学習済みであるサポートベクターマシン（ＳＶＭ）を用いるのが好適である。 Next, the histogram is normalized (S130). In the typical example, each 300-dimensional histogram is normalized so that the sum of the elements is 1, thereby obtaining a bag-of-keypoints vector representing the recognition target image.
Next, the obtained bag-of-keypoints vector is input to the classifier, and a recognition result value for the recognition target image is obtained (S135). It is preferable to use a support vector machine (SVM) that has been learned by the learning process as the classifier.

次に、前記認識結果値を判定し、正であれば（Ｓ１４０：下方向）認識対象画像は予め指定された分類の対象物体を含むと判断され（Ｓ１４５）、また、負であれば（Ｓ１４０：右方向）認識対象画像は予め指定された分類の対象物体を含まないものと判断される（Ｓ１５０）。
認識処理を実行するプログラムＳ１００は、ＣＰＵにより以上の処理を全画像について実行する（Ｓ１５５）。 Next, the recognition result value is determined. If it is positive (S140: downward direction), it is determined that the recognition target image includes a target object of a predetermined classification (S145), and if it is negative (S140). : Right direction) It is determined that the recognition target image does not include the target object of the classification specified in advance (S150).
The program S100 that executes the recognition process executes the above process for all images by the CPU (S155).

次に、試験例について説明する。 Next, test examples will be described.

本例をコンピュータソフトウエアとして実現し、インターネット（Web）にアクセス可能なパーソナル・コンピュータ（ＰＣ）にて実施した。Ｆｌｉｃｋｒから収集した日本国内の位置情報を含む画像約5000 枚を用いて、本例による一般画像認識の精度を確認した。
各画像について本試験のために、５種類のキーワード（景色、ラーメン、山、神社、海岸）を与えた。
本試験では、使用する航空写真画像のズームレベルは10、12、14 の３種類とする。
Ｆｌｉｃｋｒで収集した各画像について、収集した航空写真画像を用いて、その位置情報を表す航空写真画像を対応づけた。 This example was implemented as computer software and implemented on a personal computer (PC) accessible to the Internet (Web). The accuracy of general image recognition according to this example was confirmed using about 5000 images including location information in Japan collected from Flickr.
For each image, five types of keywords (scenery, ramen, mountain, shrine, and coast) were given for this test.
In this test, the zoom level of the aerial image used is three types: 10, 12, and 14.
For each image collected by Flickr, an aerial photograph image representing the position information is associated using the collected aerial photograph image.

試験データセットから、特徴量を抽出し、コードブックとヒストグラムを作成することによって、ＳＶＭへの入力データを作成する。
これにより、画像と航空写真画像、位置情報のデータセットから、一つのキーワードについて各グループのデータが作成される。
本試験では、それぞれのグループから、200枚をランダムに抽出して、ＳＶＭへの入力データセットを作成した。
また、本試験では、テストデータ全体をより客観的に評価するため、学習と分類の方法としてCross Validationを用いた（図１３参照）。 Feature data is extracted from the test data set, and a code book and a histogram are created to create input data to the SVM.
Thereby, data of each group is created for one keyword from the data set of the image, the aerial photograph image, and the position information.
In this test, 200 sheets were randomly extracted from each group to create an input data set to SVM.
In this test, Cross Validation was used as a learning and classification method in order to more objectively evaluate the entire test data (see FIG. 13).

具体的な手順を以下に説明する。まず認識対象物が画像中に写っている正例画像（「ＯＫデータ」とも言う）と、認識対象物が画像中に写っていない負例画像（「ＮＧデータ」とも言う）を等分割する。各枠に含まれるテストセットの数はすべて同じとした。本試験では、試験データを５つに等分割するので、枠の中には20枚が入ることになる。
すなわち、前述のＳＶＭへの入力データの作成の際、学習データと分類データがそれぞれ等しく成るようにランダム抽出を行う。 A specific procedure will be described below. First, a positive example image (also referred to as “OK data”) in which the recognition target object appears in the image and a negative example image (also referred to as “NG data”) in which the recognition target object does not appear in the image are equally divided. The number of test sets included in each frame was the same. In this test, the test data is equally divided into five, so 20 frames will fit in the frame.
That is, when the input data to the SVM is created, random extraction is performed so that the learning data and the classification data are equal to each other.

学習データを分割した後、学習データと分類データをそれぞれ組み替えて実験を行い、1つのグループについて、５通りの結果を得た。
ＳＶＭによる出力結果から、情報検索の評価と同様に、再現率（Recall）と、適合率（Precision）を求めることができる。また、再現率と適合率の両方を考慮した指標として、Ｆ値と、再現率−適合率グラフを求めることができる。
特に、再現率−適合率グラフでは、両方のトレードオフの関係を検証可能である。 After dividing the learning data, the learning data and the classification data were rearranged, and an experiment was performed. Five results were obtained for one group.
Similar to the information retrieval evaluation, the recall (Recall) and the precision (Precision) can be obtained from the output result by the SVM. Further, as an index considering both the reproduction rate and the matching rate, an F value and a recall rate-matching rate graph can be obtained.
In particular, the relationship between both trade-offs can be verified in the recall rate-relevance rate graph.

本試験では、学習と分類に、Cross Validationを採用している（図１３参照）。５つのfoldで行ったので、上記のような評価方法において、５通りの結果が出力されることになる。 In this test, Cross Validation is adopted for learning and classification (see FIG. 13). Since five folds are used, five kinds of results are output in the evaluation method as described above.

本試験では、平均適合率（Average Precision）により実験結果を評価した。
１つのキーワードにおいて、９通りのグループのデータセットをそれぞれ評価する。さらに、それぞれのグループを５つのfoldに分割しているので、５つのfoldのそれぞれの平均適合率を求め、これらとこれらの平均を示す。１つのキーワードにおいて、合計で４５個の平均適合率を計算する。 In this test, the experimental results were evaluated based on the average precision.
Nine groups of data sets are evaluated for each keyword. Furthermore, since each group is divided into five folds, the average precision of each of the five folds is obtained and the average of these is shown. In one keyword, a total of 45 average precision is calculated.

試験結果は、それぞれの平均適合率を100倍して、有効数字４桁で示す（表１参照）。
ただし、各表において、画像をＩ、位置情報をＧ、10（表１での(1)）、12（表１での(2)）、14（表１での(3)）を航空写真画像のそれぞれのレベルとして、グループの組み合わせを示す。 The test results are shown in four significant figures (see Table 1) by multiplying each average precision by 100.
However, in each table, the image is I, the position information is G, 10 ((1) in Table 1), 12 ((2) in Table 1), and 14 ((3) in Table 1) are aerial photographs. Group combinations are shown for each level of the image.

キーワード「景色」については、画像と航空写真画像を統合したもののうち、レベル10（表１でのＩ＋(1)）と、レベル12（表１でのＩ＋(2)）で精度が向上していることが確認された。
これらの位置情報には、都会や都市部全体的な割合が多く、都市部の局所特徴量のみでは、風景を識別するのは困難である。したがって、風景写真の画像と航空写真画像の統合した結果が最も精度が高くなると考えられる。 For the keyword “scenery”, the accuracy is improved at level 10 (I + (1) in Table 1) and level 12 (I + (2) in Table 1) among the images and aerial images combined. It was confirmed that
These location information has a large proportion of the whole city or city area, and it is difficult to identify a landscape only by the local feature amount of the city area. Therefore, it is considered that the result of integrating the landscape photograph image and the aerial photograph image has the highest accuracy.

キーワード「ラーメン」については、タイトルや説明などのメタデータから、ラーメン店で撮影された画像が多い。したがって、「景色」の場合と同様に、位置情報が比較的都市部に集中する。航空写真画像レベル10（表１での(1)）との組合せで最も精度が高くなるのは、このズームレベルに対して都市部の特徴が現れやすいためと考えられる。 As for the keyword “ramen”, there are many images taken at ramen shops from metadata such as titles and descriptions. Therefore, as in the case of “scenery”, the position information is relatively concentrated in the urban area. The combination with the aerial photograph image level 10 ((1) in Table 1) is considered to have the highest accuracy because urban features tend to appear at this zoom level.

なお、本試験では、航空写真画像のコードブックを作成する際に、画像と区別した。これは、画像の特徴量との混乱を避けるためである。しかし、１種類のズームレベルの航空写真画像をあえて混合して、各ズームごとに独立したコードブックを作成するという方法も可能である。 In this test, when creating a code book of aerial photograph images, they were distinguished from images. This is to avoid confusion with the feature amount of the image. However, it is also possible to mix aerial images of one zoom level and create an independent codebook for each zoom.

以上のように、本試験では、インターネットのＷｅｂ上から収集した、位置情報付きの画像と、位置情報に対応する航空写真画像を用いて、本発明の画像処理方法により、画像認識の精度が向上することが確認された。 As described above, in this test, the accuracy of image recognition is improved by the image processing method of the present invention using the image with position information collected from the Web on the Internet and the aerial photograph image corresponding to the position information. Confirmed to do.

以上、本発明の実施形態例を図面等に基づき説明したが、本発明は前記した例に限定されるものではなく、特許請求範囲記載の技術的思想の範疇において種々の変更が可能であることは言うまでもない。 The embodiments of the present invention have been described with reference to the drawings. However, the present invention is not limited to the above-described examples, and various modifications can be made within the scope of the technical idea described in the claims. Needless to say.

本発明の画像処理技術、特に、分類対象の対象物（被写体）がデジタル画像に含まれるか否かを判断する技術は、インターネットなどのネットワークとそれに接続したパーソナル・コンピュータ（ＰＣ）上、または、一般家庭内やオフィス内に構築したＬＡＮに接続したＰＣ上、または、個人ユーザなどが使うスタンドアローンのＰＣ上などにおいて、一般画像認識の精度を向上させることができる。
また、デジタルカメラやカメラ付き携帯電話等の撮像機器に組み込むことによって、撮影した位置情報付きデジタル画像の自動分類、検索に寄与する。
また、デジタルカメラ等で撮像された画像の自動タグ付け、自動説明文生成、自動アルバム作成など、様々な応用技術、応用製品を提供可能となる。
そして、独立した製品として、または、他の製品に組み込むソフトウエアとして、または、インターネット上で利用可能なシステムとして、など、好適に利用され得る。 The image processing technique of the present invention, in particular, a technique for determining whether or not an object (subject) to be classified is included in a digital image is a network such as the Internet and a personal computer (PC) connected thereto, or The accuracy of general image recognition can be improved on a PC connected to a LAN built in a general home or office, or on a stand-alone PC used by an individual user.
Further, by incorporating it into an imaging device such as a digital camera or a camera-equipped mobile phone, it contributes to automatic classification and retrieval of photographed digital images with position information.
It is also possible to provide various applied technologies and products such as automatic tagging of images captured by a digital camera or the like, automatic description sentence generation, and automatic album creation.
It can be suitably used as an independent product, as software embedded in another product, or as a system that can be used on the Internet.

本発明に係る画像処理システムの概要を示す概念図。1 is a conceptual diagram showing an overview of an image processing system according to the present invention. 分類器への入力データを示す概念図。The conceptual diagram which shows the input data to a classifier. Bag of Keypoints手法の説明図。Illustration of the Bag of Keypoints method. 航空写真画像（または地図画像）の収集方法例の説明図。Explanatory drawing of the example of the collection method of an aerial photograph image (or map image). 航空写真画像（または地図画像）の処理方法例の説明図。Explanatory drawing of the example of a processing method of an aerial photograph image (or map image). コードブックの生成の一例の説明図。An explanatory view of an example of generation of a code book. 対象画像の分類を示すキーワードごとのＳＶＭへの入力ベクトル例の説明図。Explanatory drawing of the example of the input vector to SVM for every keyword which shows the classification | category of a target image. ベクトルパターン例の説明図。Explanatory drawing of the example of a vector pattern. 学習処理（分類器）の動作を示すフローチャートの一例。An example of the flowchart which shows operation | movement of a learning process (classifier). 学習処理（分類器）の動作を示すフローチャートの一例。An example of the flowchart which shows operation | movement of a learning process (classifier). 認識処理の動作を示すフローチャートの一例。An example of the flowchart which shows operation | movement of a recognition process. 認識処理の動作を示すフローチャートの一例。An example of the flowchart which shows operation | movement of a recognition process. クロスバリデーション手法の概念図。A conceptual diagram of a cross-validation technique.

Explanation of symbols

１１：画像記憶部
１２：マッピングサービス記憶部
１３ａ〜１３ｃ：位置情報に対応した各縮尺の航空写真画像または地図図形の記憶部
１４：位置座標調整済みの各縮尺の航空写真画像または地図図形の記憶部
１５：画像本体記憶部
１６：メタデータ記憶部
２１：コードブック記憶部
２２：ヒストグラム記憶部
２３：分類器への入力データ記憶部
２４：入力ベクトル記憶部 11: Image storage unit 12: Mapping service storage unit 13a to 13c: Storage unit of aerial image or map figure of each scale corresponding to position information 14: Storage of aerial image or map figure of each scale whose position coordinates have been adjusted Unit 15: Image main body storage unit 16: Metadata storage unit 21: Codebook storage unit 22: Histogram storage unit 23: Input data storage unit to classifier 24: Input vector storage unit

Claims

An image processing method including a recognition process for classifying the recognition target in a recognition target image using a classifier for classifying the recognition target in the image,
Inputting a recognition target image (S105);
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the photographing position of the recognition target image (S110);
Obtaining a recognition result using the classifier (S135);
Determining whether there is a recognition target in the recognition target image (S140, S145, S150);
An image processing method comprising:

The image processing method according to claim 1, further comprising:
Extracting an image feature amount based on the small area patch image (S115);
Creating a histogram from the extracted image feature quantity (S120);
Selecting a feature vector closest to the created histogram from a code book (S125);
Normalizing the selected feature vector (S130);
An image processing method comprising:

The step (S110) of generating the small area patch image is a step of generating a small area patch image from one image generated from the aerial photograph image and / or map image. Or the image processing method of 2.

The step (S110) of generating the small region patch image is a step of generating a small region patch image from a plurality of images having different scales generated from the aerial photograph image and / or map image. The image processing method according to claim 1 or 2.

The step (S110) of generating the small area patch image generates the small area patch image from the recognition target image and a plurality of images having different scales generated from the aerial photograph image and / or the map image. The image processing method according to claim 1, wherein the image processing method is a step.

The step of creating the histogram (S120) is a step of concatenating the respective histograms generated from the plurality of extracted image feature amounts to generate one histogram. The image processing method according to claim 5.

The step of creating the histogram (S120) concatenates the histogram generated from the image feature amount of the recognition target image and the histogram generated from the image feature amount of the small area patch image, to form one histogram. 6. The image processing method according to claim 2, wherein the image processing method is a step of generating an image.

The classifier is
Inputting a learning image and a classification of the learning image (S91);
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image (S92);
Creating the classifier (S96);
The image processing method according to claim 1, wherein the image processing method is generated by:

9. The image processing method according to claim 8, further comprising:
Generating a small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image (S92);
Extracting an image feature amount of the small area patch image (S93);
Creating a code book from the extracted image feature values (S94);
Using the code book to create a histogram from the extracted image feature quantity (S95).

The step of generating the small area patch image (S92) generates a plurality of small area patch images having different scales from a plurality of aerial photograph images and / or map images having different scales corresponding to the shooting positions of the learning images. The image processing method according to claim 9, wherein the step (S92) is performed.

The step (S92) of generating the small region patch image is a step (S92) of generating one small region patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image. The image processing method according to claim 9.

The step of generating the small area patch image (S92) generates the small area patch image from the recognition target image and a plurality of images having different scales generated from the aerial photograph image and / or the map image. The image processing method according to claim 9, wherein the image processing method is a step.

10. The step of creating the histogram (S95) is a step of concatenating the respective histograms generated from the plurality of extracted image feature amounts to generate one histogram. Image processing method.

The step of creating the histogram (S95) concatenates the histogram generated from the image feature quantity of the recognition target image and the histogram generated from the image feature quantity of the small region patch image, to form one histogram. The image processing method according to claim 9, wherein the image processing method is a step of generating.

15. The image processing method according to claim 1, wherein the aerial photograph image and / or the map image are stored in a database accessible via a network.

16. The recognition target image has position information, and based on the position information, the aerial photograph image and / or map image is associated with the recognition target image. The image processing method according to one item.

An image processing program for causing a computer, an imaging device with an image classification function, or an image processing system to execute the image processing method according to any one of claims 1 to 16.

A storage medium storing the image processing program according to claim 17 as a program that can be read and executed by a computer.

An imaging device with an image classification function configured to be able to execute the image processing method according to any one of claims 1 to 16.

An image processing system comprising: a classifier for classifying recognition targets in an image; and image recognition means for classifying the recognition targets in a recognition target image,
The image recognition means includes
An input unit for inputting a recognition target image;
A small area patch image generation unit that generates a small area patch image from an aerial photograph image and / or map image corresponding to the imaging position of the recognition target image;
A recognition result acquisition unit that obtains a recognition result using the classifier;
A determination unit for determining presence or absence of a recognition target in the recognition target image;
An image processing system comprising:

The image processing system according to claim 20, further comprising:
An image feature amount extraction unit for extracting an image feature amount;
A histogram creation unit that creates a histogram from the image feature amount extracted from the image feature amount extraction unit;
A feature vector selection unit that selects a feature vector closest to the created histogram from a code book;
A normalization unit for normalizing the feature vector selected from the feature vector selection unit;
A recognition result acquisition unit that obtains a recognition result using the classifier based on the feature vector normalized by the normalization unit;
An image processing system comprising:

The image processing system according to claim 20 or 21, wherein the image feature amount extraction unit extracts an image feature amount of one small region patch image generated by the small region patch image generation unit.

The image feature amount extraction unit extracts image feature amounts of a plurality of small region patch images having different scales generated by the small region patch image generation unit;
The said histogram creation part produces | generates one histogram by concatenating each histogram produced | generated from the several image feature-value extracted by the said image feature-value extraction part. Image processing system.

The image feature amount extraction unit extracts an image feature amount of the recognition target image and extracts an image feature amount of the small area patch image,
The histogram creating unit generates one histogram by concatenating a histogram generated from the image feature amount of the recognition target image and a histogram generated from the image feature amount of the small region patch image. The image processing system according to claim 20 or 21.

The classifier is:
Means for inputting a learning image and a classification of the learning image;
Means for generating a small area patch image from an aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for creating a classifier using the small area patch image;
The image processing system according to claim 20, wherein the image processing system is generated by:

The classifier is:
Means for inputting a learning image and a classification of the learning image;
Means for generating a small area patch image from an aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for extracting an image feature amount of the small area patch image;
Means for creating a codebook from the extracted image feature values;
Means for creating a histogram from the extracted image features using the codebook;
Means for creating a classifier using the histogram;
The image processing system according to claim 20, wherein the image processing system is generated by:

The classifier is:
Means for inputting a learning image and a classification of the learning image;
Means for generating a plurality of small area patch images having different scales from a plurality of aerial photograph images and / or map images having different scales corresponding to the shooting positions of the learning images;
Means for respectively extracting image feature amounts of the plurality of small region patch images;
Means for creating a code book from the extracted plurality of image feature quantities;
Using the codebook, creating respective histograms from the extracted plurality of image feature quantities, and concatenating these histograms to generate one histogram;
Means for creating a classifier using the one histogram;
The image processing system according to claim 20, wherein the image processing system is generated by:

The classifier is
Means for inputting a learning image and a classification of the learning image;
Means for generating one small area patch image from the aerial photograph image and / or map image corresponding to the shooting position of the learning image;
Means for extracting image feature amounts of the learning image and the small area patch image,
Means for creating a code book from the extracted plurality of image feature quantities;
Using the codebook, creating respective histograms from the extracted plurality of image feature quantities, and concatenating these histograms to generate one histogram;
Means for creating a classifier using the one histogram;
The image processing system according to claim 20, wherein the image processing system is generated by:

The image processing system according to any one of claims 20 to 28, wherein the aerial photograph image and / or map image is stored in a database accessible via a network.

30. The image processing system according to claim 20, wherein the recognition target image holds position information.