JP5351084B2

JP5351084B2 - Image recognition apparatus and image recognition method

Info

Publication number: JP5351084B2
Application number: JP2010059627A
Authority: JP
Inventors: 満安倍; 悠一吉田
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2010-03-16
Filing date: 2010-03-16
Publication date: 2013-11-27
Anticipated expiration: 2030-03-16
Also published as: JP2011192178A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an image recognition device capable of accurately detecting an object without an increase in calculation amount even in the recognition of a complicated object, to perform subsequent processing such as the extraction of a feature quantity or pattern recognition. <P>SOLUTION: The image recognition device 1 includes an imaging unit 21 for imaging a recognition object to generate recognition image data; a display unit 23 which displays a recognition image based on the recognition image data generated by the imaging unit 21; an input unit 22 for a user to input recognition position input data to instruct a position of an element to be recognized to the recognition image; a recognition feature quantity extraction unit 31 which extracts a recognition image feature quantity from the recognition image data based on the recognition position input data inputted to the input unit 22; and a pattern recognition unit 32 which recognizes the recognition object based on the recognition image feature quantity extracted by the extraction unit 31. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、画像認識装置及び画像認識方法に関し、特に撮影画像に写っている対象を認識する画像認識装置及び画像認識方法に関する。 The present invention relates to an image recognition apparatus and an image recognition method, and more particularly to an image recognition apparatus and an image recognition method for recognizing an object shown in a captured image.

従来から、デジタルカメラなどの撮像装置により撮影された画像を解析して、そこに写っている対象を認識する画像認識装置が広く知られている。例えば、ＱＲコード読取機のようなコード読取機や、本の表紙の画像からそれが何かを認識する装置が知られている。 2. Description of the Related Art Conventionally, an image recognition device that analyzes an image taken by an imaging device such as a digital camera and recognizes an object shown there is widely known. For example, a code reader such as a QR code reader or a device that recognizes what it is from the image of a book cover is known.

これらの装置は、一般的に、画像に対して、（１）対象の検出（セグメンテーション）、（２）特徴量の抽出、（３）パターン識別（認識処理）の手順で処理を行っている。これらの一連の処理は自動化されており、入力画像が与えられると計算機のソフトウェアにより上記の手順（１）〜（３）が行われ、認識結果が出力される。ここで、コード読取機のように、認識対象の形状に規則性があり、画像認識処理を適用しやすいような場合は、上記の「対象の検出」の処理は容易である。 In general, these apparatuses perform processing on an image in the order of (1) target detection (segmentation), (2) feature amount extraction, and (3) pattern identification (recognition processing). A series of these processes are automated. When an input image is given, the above procedures (1) to (3) are performed by the software of the computer, and the recognition result is output. Here, when the shape of the recognition target is regular and the image recognition process can be easily applied as in a code reader, the above-described “target detection” process is easy.

これに対して、近年では、鳥、花、犬、猫、昆虫のような、より複雑な対象を認識するシステムが提案されている（例えば花の認識を行うシステムについて、非特許文献１を参照）。このようなシステムにおいて、認識精度を向上させるためには、「対象の検出」を正確に行う必要がある。 On the other hand, in recent years, a system for recognizing more complicated objects such as birds, flowers, dogs, cats, and insects has been proposed (for example, see Non-Patent Document 1 for a system that recognizes flowers). ). In such a system, in order to improve recognition accuracy, it is necessary to accurately perform “target detection”.

また、対象を個別のパーツごとに検出できれば、より高い精度の認識ができる。例えば、鳥を認識する場合に、「対象の検出」の処理において、鳥の体全体だけでなく、頭部、羽根、胴体、目、くちばしの位置をそれぞれ個別に検出することができれば、後段の「特徴量の抽出」及び「パターン識別」の処理後の認識結果の精度を格段に高めることができる。従って、複雑な対象を高精度に認識するためには、まず「対象の検出」を適切に行うことが不可欠である。
本発明に関連する先行技術文献として、以下の文献がある。 Further, if the target can be detected for each individual part, recognition with higher accuracy can be performed. For example, when recognizing a bird, if the position of the head, wings, trunk, eyes, and beak can be detected individually in the “target detection” process, The accuracy of the recognition result after “feature amount extraction” and “pattern identification” processing can be significantly increased. Therefore, in order to recognize a complex target with high accuracy, it is indispensable to first perform “target detection” appropriately.
As prior art documents related to the present invention, there are the following documents.

Nilsback, M-E. and Zisserman, A. 「A Visual Vocabulary for Flower Classification」Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2006)Nilsback, M-E. And Zisserman, A. `` A Visual Vocabulary for Flower Classification '' Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2006) G. Csurka, C. R. Dance, L. Fan, J.Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1-22, 2004.G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray.Visual categorization with bags of keypoints.In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1-22, 2004.

しかしながら、例えば、鳥、花、犬、猫、昆虫のような複雑な対象を認識する場合には、以下の理由により、対象やその個別のパーツを計算機で検出することは容易ではない。 However, when recognizing a complex object such as a bird, a flower, a dog, a cat, or an insect, for example, it is not easy to detect the object and its individual parts with a computer for the following reason.

第１の理由は、認識対象の外観の多様性である。即ち、ＱＲコードや本の表紙等を認識する場合は、認識対象が平面であるので、それを正面から撮影することで撮影画像の外観は一様になるが、認識対象が鳥、花、犬、猫、昆虫のような立体物である場合には、撮影の際のアングルによって撮影画像における認識対象の外観は多様に変化する。また、照明条件、ズーミング等の撮影条件によっても撮影画像における認識対象の外観は多様に変化する。さらに、認識対象が鳥、花、犬、猫、昆虫のような動物又は植物である場合には、個体差があり、例えば同一種の鳥であっても形状、色等が異なる。従って、計算機のソフトウェアにより撮影画像から認識対象又はその個別のパーツを検出することは容易ではない。 The first reason is the variety of appearances of recognition objects. That is, when recognizing a QR code, a book cover, etc., the recognition target is a flat surface, and by photographing it from the front, the appearance of the captured image becomes uniform, but the recognition target is a bird, flower, dog, etc. In the case of a three-dimensional object such as a cat or an insect, the appearance of the recognition target in the captured image varies in various ways depending on the angle at the time of shooting. Further, the appearance of the recognition target in the photographed image varies in various ways depending on the photographing conditions such as illumination conditions and zooming. Furthermore, when the recognition target is an animal or plant such as a bird, a flower, a dog, a cat, or an insect, there are individual differences. For example, even if the birds are the same species, the shape, color, and the like are different. Therefore, it is not easy to detect a recognition target or individual parts thereof from a photographed image by computer software.

第２の理由は、画像上で認識対象又はその個別パーツとして検出すべき境界が不明確なことがあるということである。即ち、計算機で「対象の検出」を行う場合には、一般的には画像のエッジ部分を抽出することで領域を区画することにより認識対象又はその個別パーツを検出するが、認識対象を撮影した画像において、認識対象とその背景、又は認識対象の隣り合うパーツ同士の境界に色の変化が少ないことがあり、このような場合には、計算機が認識対象とその背景、又は認識対象の隣り合うパーツ同士の境界を検出することが困難になる。 The second reason is that the boundary to be detected as the recognition target or its individual parts on the image may be unclear. In other words, when performing “target detection” with a computer, in general, the recognition target or its individual parts are detected by dividing the region by extracting the edge portion of the image, but the recognition target is photographed. In an image, there may be little color change at the boundary between the recognition target and its background, or between adjacent parts of the recognition target. In such a case, the computer is adjacent to the recognition target and its background or recognition target. It becomes difficult to detect the boundary between parts.

第３の理由は、鳥、花、犬、猫、昆虫のような対象については、「対象の検出」において必ずしも真の解が存在しないということである。例えば、認識対象が鳥である場合に、目やくちばしについては比較的明確に定義することができ、鳥の体においてどこからが目であるか、どこからがくちばしであるかは明確である。しかしながら、例えば羽根や胴体等については、鳥の体においてそれらの境界を一義的に定義することはできず、その解は、人によってばらつき、一意には存在しない。このように真の解が存在しないタスクは計算機には不向きである。 The third reason is that for objects such as birds, flowers, dogs, cats, and insects, there is not necessarily a true solution in “target detection”. For example, when the recognition target is a bird, eyes and beaks can be defined relatively clearly, and it is clear from where the eyes are in the bird's body and where the beak is from. However, for example, for the wings and trunks, their boundaries cannot be uniquely defined in the bird's body, and their solutions vary from person to person and do not exist uniquely. Thus, a task for which no true solution exists is not suitable for a computer.

第４の理由は、計算量が膨大になるということである。上記の（１）〜（３）の処理おいて、特に計算量が多くなるのが（１）「対象の検出」である。一般に、複雑な対象についての「対象の検出」には、複雑な処理を要するので、多大な計算資源を消費する。従って、鳥、花、犬、猫、昆虫のような複雑な対象についての「対象の検出」を低リソースのハードウェア上で実現することは困難である。 The fourth reason is that the calculation amount becomes enormous. In the processes (1) to (3) described above, (1) “target detection” has a particularly large amount of calculation. In general, “detecting an object” for a complex object requires complicated processing, and thus consumes a large amount of computing resources. Therefore, it is difficult to realize “target detection” for complex objects such as birds, flowers, dogs, cats, and insects on hardware with low resources.

そこで、本発明は、複雑な対象を認識する場合にも、計算量を増大せずに正確に対象を検出して、特徴量の抽出やパターン識別といった後続の処理を行える画像認識装置及び画像認識方法を提供する。 Therefore, the present invention provides an image recognition apparatus and image recognition that can detect a target accurately without increasing the amount of calculation and perform subsequent processing such as feature amount extraction and pattern identification even when a complex target is recognized. Provide a method.

本発明は、「対象の検出」を自動で行わず、入力部を介してユーザに認識対象（及びその個別のパーツ）の位置を入力させることにより、計算機が不得意とするタスクを人間に代行させ、この認識対象（及びその個別のパーツ）の検出結果を後の処理に用いて対象の認識を行う。 The present invention does not automatically perform “target detection”, but allows the user to input the position of the recognition target (and its individual parts) via the input unit, so that the task that the computer is not good at is substituted for humans. Then, the detection result of this recognition target (and its individual parts) is used for subsequent processing to recognize the target.

本発明の一態様は、画像認識装置であって、この画像認識装置は、認識対象を撮影して認識用画像データを生成するための撮像部と、前記撮像部にて生成された認識用画像データに基づいて認識用画像を表示する表示部と、ユーザが、前記認識用画像に対して、前記認識対象の要素の位置を指示する認識用位置入力データを入力するための入力部と、前記入力部に入力された認識用位置入力データに基づいて前記認識用画像データから認識用画像特徴量を抽出する認識用特徴量抽出部と、前記認識用特徴量抽出部にて抽出された認識用画像特徴量に基づいて、前記認識対象を認識する認識部とを備えている。 One aspect of the present invention is an image recognition apparatus, which includes an imaging unit for capturing a recognition target and generating recognition image data, and a recognition image generated by the imaging unit. A display unit that displays an image for recognition based on data, an input unit for a user to input position input data for recognition indicating the position of the element to be recognized with respect to the image for recognition, and A recognition feature amount extraction unit that extracts a recognition image feature amount from the recognition image data based on the recognition position input data input to the input unit, and a recognition feature extracted by the recognition feature amount extraction unit A recognition unit for recognizing the recognition target based on the image feature amount.

この構成によれば、認識に必要な「対象の検出」、「特徴量の抽出」、「パターン識別」の処理のうち、「対象の検出」をユーザに行わせる。ユーザの認知能力によって「対象の検出」を行うことで、対象の外観に多様性があっても、また、対象又は対象の個別のパーツの境界のエッジが不鮮明であっても、正確に対象の検出ができる。また、「対象の検出」について必ずしも真の解が存在しない場合にも、「対象の検出」を行うことができる。さらに、これらのために計算量が増大することもない。 According to this configuration, the user performs “target detection” among the processes of “target detection”, “feature amount extraction”, and “pattern identification” necessary for recognition. By performing “target detection” based on the user's cognitive ability, even if the appearance of the target is diverse, and even if the edges of the boundary of the target or individual parts of the target are unclear, it is possible to accurately detect the target. Can be detected. Further, even when a true solution does not necessarily exist for “target detection”, “target detection” can be performed. Furthermore, the amount of calculation does not increase because of these.

本発明の画像認識装置において、前記認識部は、学習用画像特徴量と正解ラベルとの関係を用いて得られる学習結果に基づいて、前記認識用特徴量抽出部にて抽出された認識用画像特徴量から前記認識対象の正解ラベルを特定することで、前記認識対象を認識する。ここで、前記学習用画像特徴量は、学習用位置入力データに基づいて、前記学習用位置入力データに対応する学習用画像データから抽出された画像特徴量である。 In the image recognition device of the present invention, the recognition unit is configured to recognize the recognition image extracted by the recognition feature amount extraction unit based on a learning result obtained using a relationship between the learning image feature amount and the correct label. The recognition target is recognized by specifying the correct label of the recognition target from the feature amount. Here, the learning image feature amount is an image feature amount extracted from the learning image data corresponding to the learning position input data based on the learning position input data.

この構成により、学習結果を用いることで、例えば認識対象が鳥である場合の鳥の羽根や胴体等のように、境界を一義的に定義することができず、位置入力データによって指定される位置がユーザによって異なる場合にも、学習結果に従って適切に対象を認識できる。特に、学習に基づく認識を行う場合は、学習結果を生成するための学習用画像特徴量の数が多いほど、その学習結果を用いた認識の精度が向上する。よって、事前に用意された多数の学習用画像特徴量から生成された学習結果を用いることで、一義的に境界を決定しにくい対象を適切に認識できるようになる。 With this configuration, by using the learning result, the boundary cannot be uniquely defined, such as a bird's wing or a trunk when the recognition target is a bird, and the position specified by the position input data Even when the user is different depending on the user, the target can be appropriately recognized according to the learning result. In particular, when recognition based on learning is performed, the accuracy of recognition using the learning result increases as the number of learning image feature quantities for generating the learning result increases. Therefore, by using learning results generated from a large number of learning image feature amounts prepared in advance, it becomes possible to appropriately recognize an object whose boundary is not easily determined.

本発明の画像認識装置は、さらに、前記学習用画像データと、前記学習用位置入力データと、前記正解ラベルとが対応付けられて記憶されている学習データベースと、前記学習用位置入力データに基づいて、それに対応付けられた前記学習用画像データから前記学習用画像特徴量を抽出する学習用特徴量抽出部と、前記学習用画像特徴量と前記正解ラベルとの関係を用いて、前記学習結果を取得する学習部とを備えている。 The image recognition apparatus of the present invention is further based on the learning database in which the learning image data, the learning position input data, and the correct label are stored in association with each other, and the learning position input data. A learning feature amount extraction unit that extracts the learning image feature amount from the learning image data associated therewith, and the learning result using the relationship between the learning image feature amount and the correct answer label. And a learning unit for acquiring

この構成により、画像認識装置に予め学習結果が記憶されておらず、又は画像認識装置にて予め十分な数の学習用画像特徴量に基づく学習結果が得られておらず、かつ、画像認識装置が通信機能によって外部から学習結果を得ることができないとしても、画像認識装置側で学習結果を得ることができるので、学習結果を用いて認識対象を適切に認識できる。 With this configuration, learning results are not stored in advance in the image recognition device, or learning results based on a sufficient number of learning image feature quantities are not obtained in advance in the image recognition device, and the image recognition device However, even if the learning result cannot be obtained from the outside by the communication function, the learning result can be obtained on the image recognition apparatus side, and therefore the recognition target can be appropriately recognized using the learning result.

本発明の画像認識装置において、一の前記学習用画像データに対して、複数の前記学習用位置入力データが入力され、前記複数の学習用位置入力データに基づいて、前記一の学習用画像データにからそれぞれ画像特徴量を抽出することで、複数の前記学習用画像特徴量が抽出され、前記認識部は、前記複数の学習用画像特徴量と前記正解ラベルとの関係を用いて得られた学習結果に基づいて、前記認識対象を認識する。 In the image recognition apparatus of the present invention, a plurality of learning position input data is input to one learning image data, and the one learning image data is based on the plurality of learning position input data. A plurality of learning image feature amounts are extracted by extracting image feature amounts respectively from the above, and the recognition unit is obtained using a relationship between the plurality of learning image feature amounts and the correct answer label. The recognition target is recognized based on the learning result.

この構成により、複数の学習結果に基づいて認識対象を認識するので、認識の精度を向上できる。 With this configuration, since the recognition target is recognized based on a plurality of learning results, the recognition accuracy can be improved.

本発明の画像認識装置において、前記複数の学習用画像特徴量は、一の前記学習用画像データに対して入力された複数の学習用位置入力データを合成して生成された合成後の学習用位置入力データに基づいて、前記合成後の学習用位置入力データに対応する学習用画像データから抽出された画像特徴量を含んでいる。 In the image recognition apparatus according to the present invention, the plurality of learning image feature quantities may be the combined learning image generated by combining the plurality of learning position input data input to the one learning image data. An image feature amount extracted from learning image data corresponding to the combined learning position input data based on the position input data is included.

この構成により、より多くの学習用画像特徴量から学習結果を取得でき、認識の精度をより向上できる。 With this configuration, learning results can be acquired from more learning image feature quantities, and the recognition accuracy can be further improved.

本発明の画像認識装置は、さらに、一の前記学習用画像データに対して入力された複数の前記学習用位置入力データを合成することで、前記合成後の学習用位置入力データを生成する位置入力データ合成部を備えている。 The image recognition apparatus of the present invention further includes a position for generating the combined learning position input data by combining the plurality of learning position input data input to the one learning image data. An input data synthesis unit is provided.

この構成により、ユーザが被写体を撮影して学習用画像データを生成し、それに対して学習用位置入力データを入力した場合にも、そのような学習用位置入力データを元に、学習結果を得るための学習用画像特徴量を増加させることができる。 With this configuration, even when the user shoots a subject to generate learning image data and inputs learning position input data thereto, a learning result is obtained based on such learning position input data. Therefore, the learning image feature amount can be increased.

本発明の画像認識装置において、前記入力部は、タッチパネルである。 In the image recognition device of the present invention, the input unit is a touch panel.

この構成により、ユーザは、表示部に表示された認識用画像に対して、表示部を指先等でなぞったりタッチしたりすることで、認識用位置入力データを入力できる。 With this configuration, the user can input recognition position input data by tracing or touching the display unit with a fingertip or the like on the recognition image displayed on the display unit.

本発明の画像認識装置において、前記認識用位置入力データは、閉曲線、点、若しくは線、又はこれらの組合せにより指示されたデータである。 In the image recognition apparatus of the present invention, the recognition position input data is data indicated by a closed curve, a point, a line, or a combination thereof.

この構成により、認識用位置入力データとして、領域、点、線を指定できる。 With this configuration, an area, a point, or a line can be designated as the recognition position input data.

本発明の画像認識装置において、前記認識対象の要素は、前記認識対象の全体及び前記認識対象の個別のパーツである。 In the image recognition apparatus of the present invention, the recognition target elements are the whole recognition target and individual parts of the recognition target.

この構成により、認識対象の全体だけでなく、認識対象の個別のパーツについても、認識用位置入力データによって指定できるので、認識の精度を向上させることができる。 With this configuration, not only the entire recognition target but also individual parts of the recognition target can be specified by the recognition position input data, so that the recognition accuracy can be improved.

本発明の画像認識装置において、前記認識用特徴量抽出部は、一の前記認識対象について、前記撮像部で複数回の撮影が行われて複数の前記認識用画像データが生成され、前記入力部にてユーザから前記複数の認識用画像データに対する複数の前記認識用位置入力データが入力されたときに、前記複数の認識用位置入力データに基づいて、複数の前記認識用画像特徴量を抽出し、前記認識部は、前記複数の認識用画像特徴量に基づいて、前記一の認識対象を認識する。 In the image recognition device according to the present invention, the recognition feature quantity extraction unit is configured to generate a plurality of the recognition image data by performing a plurality of times of imaging for the one recognition target by the imaging unit. When the plurality of recognition position input data for the plurality of recognition image data is input from the user, the plurality of recognition image feature quantities are extracted based on the plurality of recognition position input data. The recognizing unit recognizes the one recognition target based on the plurality of recognition image feature quantities.

この構成により、一回の撮影では、認識対象の特徴的な部分を撮影することができなかったとしても、複数回の撮影のいずれかで当該特徴的な部分を撮影することができれば、当該特徴的な部分に基づいて認識対象を認識できるので、認識の精度を向上させることができる。 With this configuration, even if the characteristic part of the recognition target cannot be photographed in a single photographing, if the characteristic part can be photographed in any of a plurality of photographing, the feature Since the recognition target can be recognized based on the specific part, the recognition accuracy can be improved.

本発明の画像認識装置において、前記認識部は、前記複数の認識用画像特徴量を統合した一の認識用画像特徴量に基づいて、前記一の認識対象を認識する。 In the image recognition apparatus of the present invention, the recognition unit recognizes the one recognition target based on one recognition image feature amount obtained by integrating the plurality of recognition image feature amounts.

この構成により、複数回の撮影による一の認識対象の認識を行うことができる。 With this configuration, it is possible to recognize one recognition target by a plurality of shootings.

本発明の画像認識装置において、前記一の認識用画像特徴量は、前記複数の認識用画像特徴量を平均化したものである。 In the image recognition apparatus of the present invention, the one recognition image feature amount is an average of the plurality of recognition image feature amounts.

この構成により、簡単な処理で、複数回の撮影による一の認識対象の認識を行うことができる。 With this configuration, it is possible to recognize one recognition target by a plurality of shootings with a simple process.

本発明の別の態様は、撮影により得た画像データから認識対象を認識する画像認識システムであって、この画像認識システムは、ユーザインターフェース装置と、認識装置とを備えている。前記ユーザインターフェース装置は、認識対象を撮影して認識用画像データを生成するための撮像部と、前記撮像部にて生成された認識用画像データに基づいて認識用画像を表示する表示部と、ユーザが、前記認識用画像に対して、前記認識対象の要素の位置を指示する認識用位置入力データを入力するための入力部と、前記撮像部にて生成された認識用画像データ及びそれに対して入力された前記認識用位置入力データを前記認識装置に送信するデータ送信部と、前記データ送信部にて送信した認識用画像データに対する前記認識装置による認識結果を受信する認識結果受信部とを備えている。前記認識装置は、前記データ送信部より送信された認識用画像データ及び認識用位置入力データを受信するデータ受信部と、前記データ受信部にて受信した認識用位置入力データに基づいて、前記データ受信部にて受信した認識用画像データから認識用画像特徴量を抽出する認識用特徴量抽出部と、前記認識用特徴量抽出部にて抽出された認識用画像特徴量に基づいて、前記認識対象を認識する認識部と、前記認識部による認識結果を前記ユーザインターフェース装置に送信する認識結果送信部とを備えている。 Another aspect of the present invention is an image recognition system that recognizes a recognition target from image data obtained by photographing, and this image recognition system includes a user interface device and a recognition device. The user interface device includes an imaging unit that captures a recognition target and generates recognition image data, a display unit that displays a recognition image based on the recognition image data generated by the imaging unit, An input unit for a user to input recognition position input data that indicates the position of the element to be recognized with respect to the recognition image; recognition image data generated by the imaging unit; and A data transmitting unit that transmits the input position input data for recognition to the recognition device, and a recognition result receiving unit that receives a recognition result by the recognition device for the image data for recognition transmitted by the data transmitting unit. I have. The recognition device includes: a data receiving unit that receives recognition image data and recognition position input data transmitted from the data transmission unit; and the data based on the recognition position input data received by the data receiving unit. A recognition feature amount extraction unit that extracts a recognition image feature amount from the recognition image data received by the reception unit; and the recognition based on the recognition image feature amount extracted by the recognition feature amount extraction unit. A recognition unit for recognizing a target; and a recognition result transmission unit for transmitting a recognition result by the recognition unit to the user interface device.

この構成によっても、上記の画像認識装置と同様に、ユーザの認知能力によって「対象の検出」を行うことで、対象の外観に多様性があっても、また、対象又は対象のパーツの境界のエッジが不鮮明であっても、正確に対象の検出ができ、また、「対象の検出」について必ずしも真の解が存在しない場合にも、「対象の検出」を行うことができるとともに、これらのために計算量が増大することもない。 Even with this configuration, as in the image recognition apparatus described above, by performing “target detection” based on the user's cognitive ability, even if the appearance of the target is diverse, the boundary of the target or the part of the target Even if the edges are unclear, it is possible to detect the target accurately, and even if there is not necessarily a true solution for “target detection”, “target detection” can be performed. In addition, the calculation amount does not increase.

本発明のさらに別の態様は、携帯端末であり、この携帯端末は、認識対象を撮影して認識用画像データを生成するための撮像部と、前記撮像部にて生成された認識用画像データに基づいて認識用画像を表示する表示部と、ユーザが、前記認識用画像に対して、前記認識対象の要素の位置を指示する認識用位置入力データを入力するための入力部と、前記撮像部にて生成された認識用画像データ及びそれに対して入力された前記認識用位置入力データを認識装置に送信するデータ送信部と、前記データ送信部にて送信した認識用画像データに対する前記認識装置による認識結果を受信する認識結果受信部とを備えている。 Still another aspect of the present invention is a mobile terminal, which includes an imaging unit for capturing a recognition target and generating recognition image data, and recognition image data generated by the imaging unit. A display unit that displays a recognition image based on the image, an input unit for a user to input recognition position input data that indicates a position of the recognition target element with respect to the recognition image, and the imaging Recognition data generated by the unit and a data transmission unit that transmits the recognition position input data input thereto to the recognition device, and the recognition device for the recognition image data transmitted by the data transmission unit And a recognition result receiving unit for receiving the recognition result by.

本発明のさらに別の態様はプログラムであり、このプログラムは、撮像部を備えた携帯端末に、前記撮像部にて認識対象を撮影して生成された認識用画像データに基づいて認識用画像を表示する表示ステップと、ユーザに、前記認識用画像に対して、前記認識対象の要素の位置を指示する認識用位置入力データを入力させる入力ステップと、前記認識用画像データ及びそれに対して入力された前記位認識用置入力データを認識装置に送信するデータ送信ステップと、前記データ送信ステップにて送信した認識用画像データに対する前記認識装置による認識結果を受信する認識結果受信ステップとを実行させる。 Yet another aspect of the present invention is a program, which stores a recognition image based on recognition image data generated by photographing a recognition target in the imaging unit on a portable terminal including the imaging unit. A display step for displaying; an input step for allowing a user to input recognition position input data for instructing a position of the recognition target element with respect to the recognition image; and the recognition image data and the input for the recognition image data. A data transmission step for transmitting the position recognition position input data to the recognition device; and a recognition result reception step for receiving a recognition result by the recognition device for the recognition image data transmitted in the data transmission step.

本発明のさらに別の態様は、画像認識方法であって、この画像認識方法は、認識対象を撮影して認識用画像データを生成するための撮像ステップと、前記撮像ステップにて生成された認識用画像データに基づいて認識用画像を表示する表示ステップと、ユーザが、前記認識用画像に対して、前記認識対象の要素の位置を指示する認識用位置入力データを入力する入力ステップと、前記入力ステップにて入力された認識用位置入力データに基づいて前記認識用画像データから認識用画像特徴量を抽出する認識用特徴量抽出ステップと、前記認識用特徴量抽出ステップにて抽出された認識用画像特徴量に基づいて、前記認識対象を認識する認識ステップとを有している。 Still another aspect of the present invention is an image recognition method, which includes an imaging step for capturing a recognition target and generating image data for recognition, and recognition generated in the imaging step. A display step of displaying a recognition image based on the image data; an input step in which the user inputs recognition position input data for indicating a position of the element to be recognized with respect to the recognition image; A recognition feature quantity extracting step for extracting a recognition image feature quantity from the recognition image data based on the recognition position input data inputted in the input step, and a recognition extracted in the recognition feature quantity extraction step. And a recognition step of recognizing the recognition target based on the image feature quantity.

本発明によれば、画像認識に際して、「対象の検出」をユーザに行わせるので、複雑な認識対象についても正確に対象の検出ができるとともに、「対象の検出」に必要な計算量を低減させることができる。 According to the present invention, during image recognition, the user performs “target detection”, so that the target can be detected accurately even for complex recognition targets, and the amount of calculation required for “target detection” is reduced. be able to.

本発明の実施の形態の画像認識装置の全体構成を示す図The figure which shows the whole structure of the image recognition apparatus of embodiment of this invention 本発明の実施の形態の画像認識装置の正面外観図1 is a front external view of an image recognition apparatus according to an embodiment of the present invention. 本発明の実施の形態の画像認識装置の背面外観図1 is a rear external view of an image recognition apparatus according to an embodiment of the present invention. 本発明の実施の形態の複数の認識用画像とそれに対する認識用位置入力データの例を説明する図The figure explaining the example of the some image for recognition of embodiment of this invention, and the position input data for recognition with respect to it 本発明の実施の形態の学習用画像とそれに対する複数の学習用位置入力データを示す図The figure which shows the image for learning of embodiment of this invention, and several position input data for learning with respect to it 本発明の実施の形態の学習用位置入力データの合成の例を示す図The figure which shows the example of a synthesis | combination of the position input data for learning of embodiment of this invention 本発明の実施の形態の学習データベースの構成を示す図The figure which shows the structure of the learning database of embodiment of this invention 本発明の実施の形態の画像認識装置における画像認識の動作を示すフロー図The flowchart which shows the operation | movement of the image recognition in the image recognition apparatus of embodiment of this invention. 本発明の実施の形態の認識用位置入力データを入力する動作を示すフロー図The flowchart which shows the operation | movement which inputs the position input data for recognition of embodiment of this invention 本発明の実施の形態において撮影をする際の表示画面の例を示す図The figure which shows the example of the display screen at the time of imaging | photography in embodiment of this invention 本発明の実施の形態において認識用位置入力データ（体全体）を入力する際の表示画面の例を示す図The figure which shows the example of the display screen at the time of inputting the position input data for recognition (whole body) in embodiment of this invention 本発明の実施の形態において認識用位置入力データ（胴体）を入力する際の表示画面の例を示す図The figure which shows the example of the display screen at the time of inputting the position input data for recognition (torso) in embodiment of this invention 本発明の実施の形態において認識用位置入力データ（目）を入力する際の表示画面の例を示す図The figure which shows the example of the display screen at the time of inputting the position input data for recognition (eye) in embodiment of this invention 本発明の実施の形態において認識用位置入力データ（足）を入力する際の表示画面の例を示す図The figure which shows the example of the display screen at the time of inputting the position input data for recognition (foot) in embodiment of this invention 本発明の実施の形態において再度の撮影をするか否かを選択する際の表示画面の例を示す図The figure which shows the example of the display screen at the time of selecting whether imaging again in embodiment of this invention

以下、本発明の実施の形態の画像認識装置について、図面を参照しながら説明する。本実施の形態では、認識対象は鳥であり、鳥を撮影して、その鳥の種類の名称を求める画像認識装置を例として説明する。但し、本発明は、鳥以外の認識対象についても同様に適用できる。本発明の画像認識装置は、例えば、花、犬、猫、昆虫といった動植物のほか、寺院、自動車等の人工物の認識に用いても有効である。 Hereinafter, an image recognition apparatus according to an embodiment of the present invention will be described with reference to the drawings. In this embodiment, a recognition target is a bird, and an image recognition apparatus that captures a bird and obtains the name of the type of the bird will be described as an example. However, the present invention can be similarly applied to recognition objects other than birds. The image recognition apparatus of the present invention is also effective when used for recognizing animals and plants such as flowers, dogs, cats and insects, as well as artificial objects such as temples and cars.

図１は、本実施の形態の画像認識装置の全体構成を示す図である。画像認識装置１は、ユーザインターフェース部２と学習認識部３を有している。本実施の形態では、画像認識に必要な「対象の検出」、「特徴量の抽出」、及び「パターン識別」の処理のうち、「対象の検出」をユーザインターフェース部２で行い、「特徴量の抽出」及び「パターン識別」を学習認識部３で行う。 FIG. 1 is a diagram showing an overall configuration of an image recognition apparatus according to the present embodiment. The image recognition apparatus 1 includes a user interface unit 2 and a learning recognition unit 3. In the present embodiment, among the processes of “target detection”, “feature amount extraction”, and “pattern identification” necessary for image recognition, “target detection” is performed by the user interface unit 2. Extraction ”and“ pattern identification ”are performed by the learning recognition unit 3.

ユーザインターフェース部２は、撮像部２１、入力部２２、表示部２３、画像保存部２４、及び位置入力データ保存部２５を有している。撮像部２１は、被写体をキャプチャリングして、表示用画像データを生成して表示部２３に出力する。撮像部２１は、ユーザによって撮影が指示される（シャッターが押される）と、画像データを生成して画像保存部２２に出力する（以下、この画像データを「認識用画像データ」といい、この認識用画像データに基づいて表示される画像を「認識用画像」という）。入力部２３は、ユーザからの位置入力及びその他の入力を受け付ける。 The user interface unit 2 includes an imaging unit 21, an input unit 22, a display unit 23, an image storage unit 24, and a position input data storage unit 25. The imaging unit 21 captures a subject, generates display image data, and outputs the display image data to the display unit 23. When imaging is instructed by the user (shutter is pressed), the imaging unit 21 generates image data and outputs the image data to the image storage unit 22 (hereinafter, this image data is referred to as “recognition image data”. An image displayed based on the recognition image data is referred to as a “recognition image”). The input unit 23 receives position input and other inputs from the user.

入力部２２は、表示部２３に表示された認識用画像に対してユーザが位置入力データの入力を行うと、その位置入力データを表示部２３に出力する（以下、この位置入力データを「認識用位置入力データ」という）。本実施の形態の画像認識装置１では、認識用位置入力データは、鳥の体全体、胴体、２本の足、２つの目という６つの要素を有する。入力部２２は、ユーザによって認識用位置入力データの終了が指示されると、認識用位置入力データを位置入力データ保存部２５に出力する。表示部２３は、表示用画像データ、認識用画像データ、及び入力部２２に入力された位置入力に基づいて画像を表示する。表示部２３はまた、各種の入力用のボタンを表示する。画像保存部２４は、撮像部２１で生成された認識用画像データを保存する。位置入力データ保存部２５は、入力部２２から受けた認識用位置入力データを保存する。 When the user inputs position input data to the recognition image displayed on the display unit 23, the input unit 22 outputs the position input data to the display unit 23 (hereinafter, the position input data is recognized as “recognition”. "Position input data"). In the image recognition apparatus 1 according to the present embodiment, the position input data for recognition has six elements: the whole bird body, the trunk, two legs, and two eyes. The input unit 22 outputs the position input data for recognition to the position input data storage unit 25 when the end of the position input data for recognition is instructed by the user. The display unit 23 displays an image based on the display image data, the recognition image data, and the position input input to the input unit 22. The display unit 23 also displays various input buttons. The image storage unit 24 stores the recognition image data generated by the imaging unit 21. The position input data storage unit 25 stores the recognition position input data received from the input unit 22.

学習認識部３は、認識用特徴量抽出部３１、パターン識別部３２、学習部３３、学習用特徴量抽出部３４、及び学習データベース３５を有している。認識用特徴量抽出部３１は、位置入力データ保存部２５に保存された認識用位置入力データに基づいて、画像保存部２４に保存された、その認識用位置入力データに対応する認識用画像データから画像特徴量を抽出する（以下、認識用画像データから抽出される画像特徴量を「認識用画像特徴量」という）。学習データベース３５には、複数の学習用画像データの各々について、正解ラベルと、複数の入力者による複数の位置入力データと、モーフィングにより作成した位置入力データとが対応付けられて記憶されている（以下、学習データベース３５に記憶されている位置入力データを「学習用位置入力データ」という）。 The learning recognition unit 3 includes a recognition feature value extraction unit 31, a pattern identification unit 32, a learning unit 33, a learning feature value extraction unit 34, and a learning database 35. The recognition feature quantity extraction unit 31 is based on the recognition position input data stored in the position input data storage unit 25, and the recognition image data corresponding to the recognition position input data stored in the image storage unit 24. The image feature amount is extracted from the image feature amount (hereinafter, the image feature amount extracted from the recognition image data is referred to as “recognition image feature amount”). In the learning database 35, for each of the plurality of pieces of learning image data, correct labels, a plurality of position input data by a plurality of input persons, and position input data created by morphing are stored in association with each other ( Hereinafter, the position input data stored in the learning database 35 is referred to as “learning position input data”).

学習用特徴量抽出部３４は、学習データベース３５に記憶されている学習用位置入力データに基づいて、学習用画像データに対して画像特徴量の抽出を行う（以下、学習用画像データから抽出される画像特徴量を「学習用画像特徴量」という）。学習部３３は、学習用特徴量抽出部３４で抽出された各学習用画像特徴量と、学習データベース３５から抽出したそれに対応する正解ラベルとを対応付けて、学習結果を取得する。パターン識別部３２は、学習部３３にて得られた学習結果を参照して、認識用特徴量抽出部３１で抽出された認識用画像特徴量に基づいて、パターン識別を行い、認識用画像特徴量に対する正解ラベルを認識結果として出力する。 The learning feature amount extraction unit 34 extracts image feature amounts from the learning image data based on the learning position input data stored in the learning database 35 (hereinafter, extracted from the learning image data). The image feature amount is referred to as “learning image feature amount”). The learning unit 33 associates each learning image feature amount extracted by the learning feature amount extraction unit 34 with the corresponding correct answer label extracted from the learning database 35 to acquire a learning result. The pattern identification unit 32 refers to the learning result obtained by the learning unit 33, performs pattern identification based on the recognition image feature amount extracted by the recognition feature amount extraction unit 31, and performs the recognition image feature. The correct label for the quantity is output as the recognition result.

図２Ａは、本実施の形態の画像認識装置１の正面外観図であり、図２Ｂは本実施の形態の画像認識装置１の背面外観図である。図２Ａ及び図２Ｂに示すように、画像認識装置１は、携帯端末（例えば携帯電話）である。図２Ａに示すように、画像認識装置１は、正面にタッチパネル１０１と複数のボタン１０２を備えている。タッチパネル１０１は、図１の表示部２３に相当し、かつ入力部２２にも相当する。また、ボタン１０２も入力部２２に相当する。図２Ｂに示すように、画像認識装置１は、背面に撮像部２１のレンズ１０３を備えている。 FIG. 2A is a front external view of the image recognition apparatus 1 of the present embodiment, and FIG. 2B is a rear external view of the image recognition apparatus 1 of the present embodiment. As shown in FIGS. 2A and 2B, the image recognition apparatus 1 is a mobile terminal (for example, a mobile phone). As shown in FIG. 2A, the image recognition apparatus 1 includes a touch panel 101 and a plurality of buttons 102 on the front. The touch panel 101 corresponds to the display unit 23 in FIG. 1 and also corresponds to the input unit 22. The button 102 also corresponds to the input unit 22. As shown in FIG. 2B, the image recognition device 1 includes a lens 103 of the imaging unit 21 on the back surface.

図３は、複数の認識用画像及びそれに対する認識用位置入力データの例を示す図である。ユーザは、撮影部２１が対象Ｏをキャプチャしている状態で、タッチパネル１０２に表示された撮影ボタン（シャッターボタン）を押すことで、対象Ｏを撮影する。撮影によって認識用画像データが生成されると、表示部２３は認識用画像データに基づいて認識用画像を表示する。ユーザは、表示部２３に表示された認識用画像に対して、指先（スタイラス等他のものでもよい。以下同じ。）で認識用位置入力データを入力する。上述のように、本実施の形態では、ユーザは、表示部２３に表示された認識用画像に対して、鳥の体全体、胴体、２本の足、２つの目という６つの要素の認識用位置入力データを入力する。 FIG. 3 is a diagram illustrating an example of a plurality of recognition images and recognition position input data corresponding thereto. The user captures the target O by pressing a capture button (shutter button) displayed on the touch panel 102 while the capturing unit 21 is capturing the target O. When the recognition image data is generated by shooting, the display unit 23 displays the recognition image based on the recognition image data. The user inputs recognition position input data to the recognition image displayed on the display unit 23 with a fingertip (others such as a stylus may be used. The same applies hereinafter). As described above, in the present embodiment, the user uses the recognition image displayed on the display unit 23 for recognition of the six elements such as the whole bird body, the trunk, two legs, and two eyes. Input position input data.

ユーザは、ある対象を認識したい場合には、その対象を複数の異なる角度から撮影することで、撮影角度が異なる複数の認識用画像データを用意できる。図３において、画像Ｐ１は対象Ｏを正面方向から撮影して得られた認識用画像であり、画像Ｐ２は対象Ｏを左斜め前方向から撮影して得られた認識用画像であり、画像Ｐ３は対象Ｏを左斜め後ろ方向から撮影して得られた認識用画像である。 When the user wants to recognize a certain target, the user can prepare a plurality of recognition image data having different shooting angles by shooting the target from a plurality of different angles. In FIG. 3, an image P1 is a recognition image obtained by photographing the object O from the front direction, and an image P2 is a recognition image obtained by photographing the object O from the diagonally left front direction, and the image P3. Is a recognition image obtained by photographing the object O from the diagonally backward left direction.

図３において、Ｄ１〜Ｄ３は、それぞれ認識用画像Ｐ１〜Ｐ３に対してユーザから入力された認識用位置入力データを示している。ユーザは、表示部２３に表示された認識用画像に対して、鳥の体全体を一の閉曲線で囲んで指示し、胴体を一の閉曲線で囲んで指示し、２本の足をそれぞれ線で指示し、２つの目をそれぞれ点で指示する。これらの閉曲線、線、点が認識用位置入力データとなる。 In FIG. 3, D1 to D3 indicate recognition position input data input from the user to the recognition images P1 to P3, respectively. The user indicates the recognition image displayed on the display unit 23 by enclosing the entire body of the bird with one closed curve, indicating the trunk with one closed curve, and indicating the two legs with lines. Point and point each two eyes with a point. These closed curves, lines, and points serve as recognition position input data.

なお、認識用画像中に足や目など一部の要素が現れていない場合には、現れている要素のみについて指定をする。図３の例では、認識用画像Ｐ２及びＰ３には鳥の目が１つしか写っておらず、従って、認識用位置入力データＤ２及びＤ３には目の認識用位置入力データは１つしかない。但し、認識用位置入力データの各要素が、複数の認識用画像データに対する複数通りの認識用位置入力データの少なくともいずれか１つにおいて指定されているようにする。 If some elements such as feet and eyes do not appear in the recognition image, only the appearing elements are specified. In the example of FIG. 3, only one bird's eye is shown in the recognition images P2 and P3. Therefore, the recognition position input data D2 and D3 have only one eye recognition position input data. . However, each element of the recognition position input data is specified in at least one of a plurality of recognition position input data for a plurality of recognition image data.

ユーザは、一回の撮影で対象の特徴的な部分を撮影できなかったとしても、視点を変えて複数回の撮影をすることができる。対象を複数通りの視点から撮影し、それぞれの視点で各要素の認識用位置入力データを入力することにより、後段の特徴量の抽出及びパターン識別部の精度を高めることができる。例えば、認識用位置入力データの要素にくちばしと羽が含まれる場合において、くちばしと羽の模様に非常に特徴的な画像特徴量を有する鳥を認識するときに、１つ目の認識用画像にはくちばしが写っている一方で、羽の模様がうまく写っていなかったとしても、２つ目の認識用画像で羽がよく写るように撮影をすれば、両方の特徴的な画像特徴量を獲得することができる。このように、ユーザは対象の特徴的な部分が十分に撮影できるまで、何度も撮影を繰り返すことができる。 Even if the user cannot shoot the characteristic part of the object by one shooting, the user can shoot a plurality of times by changing the viewpoint. By capturing the object from a plurality of viewpoints and inputting the position input data for recognition of each element from each viewpoint, it is possible to improve the accuracy of the extraction of the feature amount in the subsequent stage and the pattern identification unit. For example, when a beak and a wing are included in the elements of the position input data for recognition, when recognizing a bird having an image feature amount that is very characteristic in the beak and wing pattern, the first recognition image is displayed. Even if the beak is visible, even if the wing pattern is not reflected well, if you take a picture so that the wings appear well in the second recognition image, you can get both characteristic image features. can do. In this way, the user can repeat the shooting many times until the characteristic part of the target can be sufficiently shot.

１つの認識用画像に対して入力部２２から入力された各要素の認識用位置入力データは、その認識用画像に対する認識用位置入力データとして位置入力データ保存部２５に保存される。このような認識用位置入力データの入力が複数の認識用画像の各々について行われ、位置入力データ保存部２５には、複数の認識用画像データに対する複数の認識用位置入力データが保存される。複数の認識用画像に対するユーザの認識用位置入力データの入力が完了すると、ユーザインターフェース部２は、認識用画像データとその認識用位置入力データとを対応付けて、認識用特徴量抽出部３１に出力する。 The recognition position input data of each element input from the input unit 22 for one recognition image is stored in the position input data storage unit 25 as recognition position input data for the recognition image. Such recognition position input data is input to each of the plurality of recognition images, and the position input data storage unit 25 stores a plurality of recognition position input data for the plurality of recognition image data. When the input of the user's recognition position input data for the plurality of recognition images is completed, the user interface unit 2 associates the recognition image data with the recognition position input data, and sends them to the recognition feature quantity extraction unit 31. Output.

認識用特徴量抽出３１は、複数の認識用画像データとそれに対する認識用位置入力データを用いて、１つの認識対象に対して複数の認識用画像特徴量を求める。認識用特徴量抽出部３１は、これらの複数の認識用画像特徴量の平均を取ってその認識対象の認識用画像特徴量とする。上述のように、認識用位置入力データの入力の際には、一部の要素について認識用位置入力データが入力されず、認識用位置入力データの一部が欠けていることがある。この場合には、その部分の認識用画像特徴量も欠けることになる。この場合には、その欠けている部分の認識用画像特徴量ついては平均化の際に考慮しない。 The recognition feature quantity extraction 31 obtains a plurality of recognition image feature quantities for one recognition target by using a plurality of recognition image data and recognition position input data corresponding thereto. The recognition feature quantity extraction unit 31 takes the average of the plurality of recognition image feature quantities and sets it as the recognition image feature quantity of the recognition target. As described above, when inputting the position input data for recognition, the position input data for recognition is not input for some elements, and part of the position input data for recognition may be missing. In this case, the recognition image feature amount of that portion is also missing. In this case, the image feature quantity for recognition of the lacking part is not taken into consideration at the time of averaging.

なお、上記の説明では、複数の認識用画像特徴量を平均化して一の認識用画像特徴量を求めたが、平均化以外の方法によって複数の認識用画像特徴量を統合して一の認識用画像特徴量を求めてもよい。 In the above description, a plurality of recognition image feature quantities are averaged to obtain one recognition image feature quantity. However, a plurality of recognition image feature quantities are integrated by a method other than averaging to obtain one recognition feature. The image feature amount may be obtained.

次に、学習について説明する。上記のように、本実施の形態では、認識用画像に対して、ユーザに、鳥の体全体、胴体、２本の足、２つの目という６つの要素の認識用位置入力データを入力させるが、入力させる要素によっては、ユーザ毎に様々な解釈が存在し得る。例えば、鳥の胴体（胸と腹）をユーザに入力させる場合、胴体と、頭部と、背中との境界線は曖昧であり、正確な境界は定義できない。従って、ユーザはそれぞれの解釈に従い、様々なパターンで認識用位置入力データを入力することになる。そこで、本実施の形態の画像認識装置１では、１つの学習用画像に対して、予め複数の入力者による学習用位置入力データを収集して、それらを独立した学習サンプルとして学習データベース３５に保存しておく。これにより、学習用特徴量抽出部３４は、バリエーションに富む学習用位置入力データを用いて、それぞれ異なる学習用画像特徴量を抽出できる。 Next, learning will be described. As described above, in the present embodiment, the user is made to input the recognition position input data for the six elements of the whole bird body, the trunk, the two legs, and the two eyes with respect to the recognition image. Depending on the input element, various interpretations may exist for each user. For example, when the user inputs the bird's trunk (chest and belly), the boundary line between the trunk, the head, and the back is ambiguous, and an accurate boundary cannot be defined. Therefore, the user inputs recognition position input data in various patterns according to each interpretation. Therefore, in the image recognition device 1 of the present embodiment, learning position input data by a plurality of input persons is collected in advance for one learning image and stored in the learning database 35 as independent learning samples. Keep it. Thereby, the learning feature amount extraction unit 34 can extract different learning image feature amounts using the learning position input data rich in variations.

図４は、学習用画像とそれに対する複数の学習用位置入力データを示す図である。図４の例では、学習用画像ＳＰに対して、入力者Ａにより胴体の学習用位置入力データＤＡが入力され、入力者Ｂにより胴体の学習用位置入力データＤＢが入力され、入力者Ｃにより胴体の学習用位置入力データＤＣが入力されている。入力者Ａ、Ｂ、Ｃは、同一の学習用画像に対して、それぞれ鳥の胴体の領域について異なった解釈をしている。このように、１つの学習用画像データに対して複数の入力者による複数の学習用位置入力データを用意することで、複数の学習用画像特徴量を取得する。 FIG. 4 is a diagram showing a learning image and a plurality of learning position input data corresponding thereto. In the example of FIG. 4, the body learning position input data DA is input by the input person A to the learning image SP, the body learning position input data DB is input by the input person B, and the input person C inputs the learning image SP. The body learning position input data DC is input. The input persons A, B, and C interpret the same area for the bird's torso with respect to the same learning image. In this way, a plurality of learning image feature quantities are acquired by preparing a plurality of learning position input data by a plurality of input persons for one learning image data.

図５は、学習用位置入力データの合成の例を示す図である。一般に、学習サンプルが多ければ多いほど認識の精度は向上する。しかしながら、学習用位置入力データを手作業で集めるのは大変な作業である。そこで、本実施の形態では、学習用位置入力データの数を増加させるために、複数の入力者によって入力された学習用位置入力データを合成して、合成後の学習用位置入力データを得る。図５の例では、１つの学習用画像データについて、入力者Ａによる学習用位置入力データＤＡと入力者Ｂによる学習用位置入力データＤＢという２つの学習用位置入力データを合成する例が示されている。このように、複数の入力者により入力された複数の学習用位置入力データのうちの２つを合成することで、複数の合成後の学習用位置入力データを得る。 FIG. 5 is a diagram illustrating an example of synthesis of learning position input data. In general, the more learning samples, the better the recognition accuracy. However, collecting learning position input data manually is a difficult task. Therefore, in the present embodiment, in order to increase the number of learning position input data, the learning position input data input by a plurality of input persons are combined to obtain combined learning position input data. The example of FIG. 5 shows an example in which two pieces of learning position input data, that is, learning position input data DA by the input person A and learning position input data DB by the input person B are synthesized for one piece of learning image data. ing. In this way, by combining two of the plurality of learning position input data input by a plurality of input persons, a plurality of combined learning position input data is obtained.

２つの閉曲線（又は線）を合成する手法として、これらの閉曲線（又は線）の特徴的な部分を対応付け、中間的な形状を生成するモーフィング技術を採用する。２つの閉曲線の特徴的な部分を対応付ける技術は、Scott, C.; Nowak, R. “Robust Contour Matching via the Order Preserving Assignment Problem”, IEEE Transactions on Image Processing, Volume 15, Issue 7, July 2006 Page(s):1831 - 1838に紹介されている。 As a method for synthesizing two closed curves (or lines), a morphing technique for generating an intermediate shape by associating characteristic portions of these closed curves (or lines) is adopted. The technique for matching the characteristic parts of two closed curves is Scott, C .; Nowak, R. “Robust Contour Matching via the Order Preserving Assignment Problem”, IEEE Transactions on Image Processing, Volume 15, Issue 7, July 2006 Page ( s): 1831-1838.

図６は、学習データベース３５の構成を示す図である。図６に示すように、学習データベース３５には、複数の学習用画像データの各々について、複数の入力者によって入力された複数の学習用位置入力データ、及びモーフィングにより合成された複数の学習用位置入力データが予め記憶されている。さらに、学習データベース３５では、各学習用画像データに対して正解ラベルが付されている。 FIG. 6 is a diagram showing the configuration of the learning database 35. As shown in FIG. 6, the learning database 35 includes a plurality of learning position input data input by a plurality of input persons and a plurality of learning positions synthesized by morphing for each of the plurality of learning image data. Input data is stored in advance. Further, in the learning database 35, a correct answer label is attached to each learning image data.

次に、認識用特徴量抽出部３１及び学習用特徴量抽出部３４における画像特徴量の抽出について説明する。認識用特徴量抽出部３１は、認識用位置入力データ及び認識用画像データを入力として、特徴ベクトルを抽出して、この特徴ベクトルを認識用画像特徴量とする。学習用特徴量抽出部３４も同様に、学習用位置入力データ及び学習用画像データを入力として特徴ベクトルを抽出して、この特徴ベクトルを学習用画像特徴量とする。特徴ベクトルの抽出には、画像処理の分野で用いられている任意の特徴抽出技術を採用することができる。本実施の形態では、上掲の非特許文献２で提案されているVisual Wordのヒストグラム表現を用いる。この手法を用いることで、画像上の部分領域から、領域の面積によらない一定の長さの特徴ベクトルを抽出できる。 Next, extraction of image feature amounts in the recognition feature amount extraction unit 31 and the learning feature amount extraction unit 34 will be described. The recognition feature quantity extraction unit 31 receives the recognition position input data and the recognition image data as inputs, extracts a feature vector, and sets the feature vector as a recognition image feature quantity. Similarly, the learning feature quantity extraction unit 34 extracts the feature vector by using the learning position input data and the learning image data as input, and uses the feature vector as the learning image feature quantity. For the feature vector extraction, any feature extraction technique used in the field of image processing can be employed. In this embodiment, the histogram expression of Visual Word proposed in Non-Patent Document 2 described above is used. By using this method, it is possible to extract a feature vector having a certain length independent of the area of the region from the partial region on the image.

このとき、位置入力データに対して、Visual Wordのヒストグラム表現に変換すべき領域を次のように設定する。位置入力データが閉曲線であるときは、閉曲線で囲まれた部分領域をVisual Wordのヒストグラム表現に変換する。これにより、Ｎ_１次元の特徴ベクトルが得られる。位置入力データが点であるときは、点の周りの半径ｒの部分領域をVisual Wordのヒストグラム表現に変換する。これにより、Ｎ_２次元の特徴ベクトルが得られる。半径ｒは、固定の値を用いてもよいし、鳥の例のように、体全体を示す閉曲線が入力されている場合には、この閉曲線の大きさで正規化された値をｒとしてもよい。このように正規化することより、画像中の対象のスケールに依存しない特徴ベクトルが得られる。位置入力データが線であるときは、線を中心とする幅ｗの部分領域をVisual Wordのヒストグラム表現に変換する。これにより、Ｎ_３次元の特徴ベクトルが得られる。幅ｗは、固定の値を用いてもよいし、鳥の例のように、体全体を示す閉曲線が入力されている場合には、この閉曲線の大きさで正規化された値をｗとしてもよい。このように正規化することより、画像中の対象のスケールに依存しない特徴ベクトルが得られる。 At this time, the area to be converted into the Visual Word histogram representation is set as follows for the position input data. When the position input data is a closed curve, the partial area surrounded by the closed curve is converted into a histogram expression of Visual Word. Thereby, an N _one- dimensional feature vector is obtained. When the position input data is a point, a partial region having a radius r around the point is converted into a histogram expression of Visual Word. Thereby, an N _two- dimensional feature vector is obtained. As the radius r, a fixed value may be used. When a closed curve indicating the whole body is input as in the bird example, a value normalized by the size of the closed curve is set as r. Good. By normalizing in this way, a feature vector that does not depend on the scale of the object in the image can be obtained. When the position input data is a line, a partial area having a width w centered on the line is converted into a histogram expression of Visual Word. Thereby, an N _three- dimensional feature vector is obtained. For the width w, a fixed value may be used. When a closed curve indicating the whole body is input as in the bird example, the value normalized by the size of the closed curve is w. Good. By normalizing in this way, a feature vector that does not depend on the scale of the object in the image can be obtained.

認識用特徴量抽出部３１及び学習用特徴量抽出部３４は、上記のようにして各要素の特徴ベクトルを求め、これらの特徴ベクトルを縦に並べて、高次元の特徴ベクトルを構成する。本実施の形態の鳥の認識の場合には、閉曲線が２つ、点が２つ、線が２つあるため、Ｎ = ２×Ｎ_１＋２×Ｎ_２＋２×Ｎ_３次元のベクトルを特徴ベクトルが得られる。認識用特徴量抽出部３１及び学習用特徴量抽出部３４は、このＮ次元の特徴ベクトルを画像特徴量として出力する。 The recognition feature quantity extraction unit 31 and the learning feature quantity extraction unit 34 obtain the feature vectors of each element as described above, and arrange these feature vectors vertically to constitute a high-dimensional feature vector. In the case of the bird recognition according to the present embodiment, there are two closed curves, two points, and two lines. Therefore, an N = 2 × N ₁ + 2 × N ₂ + 2 × N _three- dimensional vector is used as a feature vector. Is obtained. The recognition feature quantity extraction unit 31 and the learning feature quantity extraction unit 34 output the N-dimensional feature vector as an image feature quantity.

なお、位置入力データに基づいて画像特徴量を求める方法は、上記の例に限られない。例えば、Visual Wordのヒストグラム表現を用いる方法以外の方法であってもよい。また、例えば、閉曲線の形状の情報（形及び大きさ）を画像特徴量に含めてもよい。また、上記にように、位置入力データとして２つの閉曲線があるならば、それらの相対的な位置関係を表す特徴量を画像特徴量に含めてもよい。同様に、閉曲線、線、及び点の相対的な位置関係を画像特徴量として含めてもよい。 Note that the method of obtaining the image feature amount based on the position input data is not limited to the above example. For example, a method other than the method using the histogram expression of Visual Word may be used. Further, for example, information (shape and size) of the shape of the closed curve may be included in the image feature amount. In addition, as described above, if there are two closed curves as the position input data, a feature amount representing a relative positional relationship between them may be included in the image feature amount. Similarly, the relative positional relationship between a closed curve, a line, and a point may be included as an image feature amount.

学習部３３及びパターン識別部３２は、サポートベクトルマシン又はニューラルネットワークにより学習結果を得て、その学習結果を利用することで、認識用特徴量抽出部３１で抽出されたＮ次元の特徴ベクトルに対する認識結果を得る。なお、サポートベクトルマシンを採用する場合は、学習部３３による学習結果は、サポートベクトルであり、ニューラルネットワークを採用する場合は、学習部３３による学習結果は、ニューラルネットワークの重みである。 The learning unit 33 and the pattern identification unit 32 obtain a learning result by a support vector machine or a neural network, and recognize the N-dimensional feature vector extracted by the recognition feature amount extraction unit 31 by using the learning result. Get results. When the support vector machine is employed, the learning result by the learning unit 33 is a support vector, and when the neural network is employed, the learning result by the learning unit 33 is a weight of the neural network.

以上のように構成された画像認識装置１について、その動作を説明する。図７は、画像認識装置１における画像認識の動作を示すフロー図である。ユーザは、認識対象を撮影すると、ユーザインターフェース部２を操作して、その撮影画像の画像データを認識用画像データとして、認識用位置入力データを入力する（ステップＳ７１）。画像認識装置１は、この認識用位置入力データに基づいて、認識用画像データから認識用画像特徴量を抽出する（ステップＳ７２）。次に、画像認識装置１は、抽出された認識用画像特徴量を用いて、パターン識別を行い、認識対象を認識する（ステップＳ７３）。 The operation of the image recognition apparatus 1 configured as described above will be described. FIG. 7 is a flowchart showing an image recognition operation in the image recognition apparatus 1. When the user captures a recognition target, the user operates the user interface unit 2 to input recognition position input data using the image data of the captured image as recognition image data (step S71). The image recognition apparatus 1 extracts a recognition image feature amount from the recognition image data based on the recognition position input data (step S72). Next, the image recognition apparatus 1 performs pattern identification using the extracted image feature quantity for recognition, and recognizes a recognition target (step S73).

図８は、認識用位置入力データを入力する動作を示すフロー図である。また、図９Ａ〜図９Ｆは、認識用位置入力データを入力する際の画像認識装置１の表示画面の表示例である。まず、画像認識装置１の撮像部２１が被写体をキャプチャして、表示部２３が表示用画像を表示する（ステップＳ８１）。図９Ａは、このときの表示画面の例である。表示画面には、撮影ボタンＢ１が表示されている。画像認識装置１は、撮影ボタンＢ１が押されたかを判定する（ステップＳ８２）。表示部２３は、撮影ボタンＢ１が押されるまで、撮像部２１でキャプチャされた表示用画像を表示する。撮影ボタンＢ１が押されると（ステップＳ８２でＹＥＳ）、撮像部２１は認識用画像データを生成して画像保存部２４に保存し、表示部２３はこの認識用画像データに基づいて認識用画像を表示する。 FIG. 8 is a flowchart showing an operation of inputting recognition position input data. 9A to 9F are display examples of the display screen of the image recognition apparatus 1 when inputting the position input data for recognition. First, the imaging unit 21 of the image recognition apparatus 1 captures a subject, and the display unit 23 displays a display image (step S81). FIG. 9A is an example of the display screen at this time. A shooting button B1 is displayed on the display screen. The image recognition device 1 determines whether the shooting button B1 has been pressed (step S82). The display unit 23 displays the display image captured by the imaging unit 21 until the shooting button B1 is pressed. When the photographing button B1 is pressed (YES in step S82), the imaging unit 21 generates recognition image data and stores it in the image storage unit 24, and the display unit 23 displays the recognition image based on the recognition image data. indicate.

表示部２３に認識用画像が表示されると、ユーザは、認識用画像に対して、認識用位置入力データを入力する。まず、ユーザは、体全体を包含する閉曲線及び胴体を包含する閉曲線を入力する（ステップＳ８３）。図９Ｂは、体全体を包含する閉曲線を入力するときの表示画面の例であり、図９Ｃは、胴体を包含する閉曲線を入力するときの表示画面の例である。ユーザが認識用位置入力データを入力する際の表示画面には、認識用位置入力データを入力する要素を特定するためのボタンＢ２〜Ｂ５が表示されている。ユーザは、体全体の認識用位置入力データを入力するときには、図９Ｂに示すように、全体ボタンＢ２を押して体全体を選択し、体全体を包含する閉曲線を入力する。具体的には、全体ボタンＢ２が押されると、画面を指先でなぞることで線が引けるようになるので、ユーザは、鳥の体全体を線で囲む。指先を画面から離すと曲線の始点と終点が結ばれて、自動的に閉曲線となる。この閉曲線が体全体の認識用位置入力データとなる。 When the recognition image is displayed on the display unit 23, the user inputs recognition position input data for the recognition image. First, the user inputs a closed curve that includes the entire body and a closed curve that includes the trunk (step S83). FIG. 9B is an example of a display screen when a closed curve including the entire body is input, and FIG. 9C is an example of a display screen when a closed curve including the body is input. On the display screen when the user inputs the position input data for recognition, buttons B2 to B5 for specifying an element for inputting the position input data for recognition are displayed. When inputting the position input data for recognition of the whole body, the user selects the whole body by pressing the whole button B2 and inputs a closed curve including the whole body, as shown in FIG. 9B. Specifically, when the whole button B2 is pressed, a line can be drawn by tracing the screen with a fingertip, so that the user surrounds the whole body of the bird with a line. When the fingertip is removed from the screen, the start point and end point of the curve are connected and a closed curve is automatically created. This closed curve becomes the position input data for recognition of the whole body.

ユーザが胴体の認識用位置入力データを入力するときには、図９Ｃに示すように、胴ボタンＢ３を押して胴体を選択し、胴体を包含する閉曲線を入力する。具体的には、胴ボタンＢ３が押されると、画面を指先でなぞることで線が引けるようになるので、ユーザは、鳥の胴体（腹と胸を含む部分）を線で囲む。指先を画面から離すと曲線の始点と終点が結ばれて、自動的に閉曲線となる。この閉曲線が胴体の認識用位置入力データとなる。 When the user inputs the body recognition position input data, as shown in FIG. 9C, the user selects the body by pressing the body button B3 and inputs a closed curve including the body. Specifically, when the torso button B3 is pressed, a line can be drawn by tracing the screen with a fingertip, so the user surrounds the bird's torso (the part including the belly and chest) with a line. When the fingertip is removed from the screen, the start point and end point of the curve are connected and a closed curve is automatically created. This closed curve becomes the position input data for body recognition.

ユーザは、目ボタンＢ４を押して目を選択し、鳥の２つの目をそれぞれ点で指定する（ステップＳ８４）。図９Ｄは、このときの表示画面の例である。具体的には、目ボタンＢ４が押されると、画面を指先でタッチすることで点が描画されるようになるので、ユーザは、鳥の目を２つとも指先でタッチする。この２つの点が目の認識用位置入力データとなる。 The user presses the eye button B4 to select an eye, and designates the two eyes of the bird with points (step S84). FIG. 9D is an example of the display screen at this time. Specifically, when the eye button B4 is pressed, a point is drawn by touching the screen with a fingertip, and thus the user touches both the bird's eyes with the fingertip. These two points serve as eye recognition position input data.

ユーザはさらに、足ボタンＢ５を押して足を選択し、鳥の２本の足をそれぞれ線で指定する（ステップＳ８５）。図９Ｅは、このときの表示画面の例である。具体的には、足ボタンＢ５が押されると、画面を指先でなぞることで線が引けるようになるので、ユーザは、鳥の足を２本とも線でなぞる。この２本の線が足の認識用位置入力データとなる。なお、ステップＳ８３、Ｓ８４、Ｓ８５における、各要素の認識用位置入力データの入力はどの順番で行ってもよい。 The user further presses the foot button B5 to select the foot, and designates the two feet of the bird by lines (step S85). FIG. 9E is an example of the display screen at this time. Specifically, when the foot button B5 is pressed, a line can be drawn by tracing the screen with a fingertip, so the user traces both of the bird's legs with a line. These two lines become foot recognition position input data. Note that the input of recognition position input data for each element in steps S83, S84, and S85 may be performed in any order.

認識用画像について、認識用位置入力データの入力を終了すると、再度同じ被写体を撮影するか否かを判断する（ステップＳ８６）。図９Ｆは、このときの表示画面の例である。ユーザは、図９Ｆの表示画面に対して、再度撮影を行う場合には「はい」のボタンＢ６を押し、再度の撮影を行わない場合には「いいえ」のボタンＢ７を押す。再度の撮影が指示された場合には（ステップＳ８６でＹＥＳ）、ステップＳ８１に戻って撮像部２１は被写体をキャプチャして、撮影ボタンＢ１が押されるのを待つ。再度の撮影を行わない場合には（ステップＳ８６でＮＯ）、認識用位置入力データの入力を終了して、それまでに入力された認識用位置入力データを出力する（ステップＳ８７）。 When the input of the recognition position input data for the recognition image is completed, it is determined whether or not the same subject is to be shot again (step S86). FIG. 9F is an example of the display screen at this time. On the display screen of FIG. 9F, the user presses the “Yes” button B6 when shooting again, and presses the “No” button B7 when not shooting again. If another shooting is instructed (YES in step S86), the process returns to step S81 and the imaging unit 21 captures the subject and waits for the shooting button B1 to be pressed. When the image capturing is not performed again (NO in step S86), the input of the recognition position input data is terminated, and the recognition position input data input so far is output (step S87).

以上、本発明の実施の形態を説明したが、上記の説明は例示に過ぎず、本発明は他の実施の形態でも実施される。 As mentioned above, although embodiment of this invention was described, said description is only an illustration and this invention is implemented also in other embodiment.

例えば、上記の実施の形態では、認識用位置入力データを入力する画面で、認識用位置入力データを入力する要素を指定するボタンＢ２〜Ｂ５が表示されており、ユーザはいずれかのボタンを押して要素を選択して当該要素の認識用位置入力データを入力するよう構成されていたが、例えば、画像認識装置がユーザに認識用位置入力データを入力する要素を指定してもよい。この場合には、画像認識装置は、例えば、「体全体を閉曲線で囲ってください」、「胴体を閉曲線で囲ってください」、「目（２つ）をタッチしてください」、「足（２本）を線でなぞってください」等の案内を、文字で表示部２３に表示するか、またはスピーカーなどの装置を用いて読み上げることにより、ユーザに対応する要素の認識用位置入力データの入力を促すことができる。また、認識用位置入力データを入力する画面では、入力内容を初期化するクリアボタンや入力作業を終了する終了ボタンなど、入力作業を補助するユーザインターフェースを備えていてもよい。 For example, in the above embodiment, the buttons B2 to B5 for designating elements for inputting the position input data for recognition are displayed on the screen for inputting the position input data for recognition, and the user presses any button. Although it is configured to select an element and input position input data for recognition of the element, for example, the image recognition apparatus may designate an element for inputting position input data for recognition to the user. In this case, the image recognition device can, for example, “enclose the entire body with a closed curve”, “enclose the torso with a closed curve”, “touch the eyes (two)”, “foot (2 “Please trace this book with a line” or the like, and display the text on the display unit 23 or read it out using a device such as a speaker to input the position input data for recognizing the element corresponding to the user. Can be urged. In addition, the screen for inputting the position input data for recognition may include a user interface for assisting the input work such as a clear button for initializing the input contents and an end button for ending the input work.

また、上記の実施の形態では、学習認識部３が、学習データベース３５及び学習に用いる学習用特徴量抽出部３４を備えていたが、本発明の画像認識装置は、これらを備えていなくてもよい。即ち、上記の実施の形態では、学習データベース３５に、学習用画像と学習用位置入力データと正解ラベルとを関連付けて記憶し、学習用特徴量抽出部３４にて学習用位置入力データに基づいて学習用画像から画像特徴量を求め、学習部３３がこの画像特徴量と正解ラベルとを対応付けて学習結果を得ていたが、これらの処理を予め画像認識装置１の外部で行い、それによって得られた学習結果を予め学習部３３に記憶しておいてもよい。また、画像認識装置１がそのような学習結果を通信によって外部から取得してもよい。このような構成にすることで、画像認識装置１は学習データベース３５及び学習用特徴量抽出部３４を備える必要がなくなる。 In the above embodiment, the learning recognition unit 3 includes the learning database 35 and the learning feature amount extraction unit 34 used for learning. However, the image recognition apparatus of the present invention may not include these. Good. That is, in the above embodiment, the learning image 35, the learning position input data, and the correct label are stored in the learning database 35 in association with each other, and the learning feature amount extraction unit 34 based on the learning position input data. The image feature amount is obtained from the learning image, and the learning unit 33 obtains the learning result by associating the image feature amount with the correct answer label. However, these processes are performed outside the image recognition apparatus 1 in advance, thereby The obtained learning result may be stored in the learning unit 33 in advance. Moreover, the image recognition apparatus 1 may acquire such a learning result from the outside by communication. With this configuration, the image recognition apparatus 1 does not need to include the learning database 35 and the learning feature amount extraction unit 34.

一方、上記の実施の形態では、学習データベース３５には、学習用画像とそれに対して複数の入力者により入力された学習用位置入力データが予め記憶されていたが、これらのデータを画像認識装置１で生成してもよい。この場合には、ユーザインターフェース部２にて、上記と同様にして認識対象となる被写体を撮影して画像データを生成し、その画像データに対してユーザが位置入力データ及び正解ラベルを入力する。この画像データは学習用画像データとなり、この位置入力データは学習用位置入力データとなる。これらの学習用画像データと学習用位置入力データは、学習用特徴量抽出部３４に入力され、又は学習データベース３５に保存された後に、学習用特徴量抽出部３４に入力される。学習用特徴量抽出部３４では、上記と同様にして、画像特徴量を抽出する。ユーザはさらに、入力部２１を介して、この学習用画像データに対して正解ラベルを付与する。正解ラベルは、学習データベース３５にある正解ラベルから選択する方式で付与してもよいし、ユーザが直接入力してもよい。学習部３３は、抽出された画像特徴量と正解ラベルとを対応付けて学習結果を取得する。 On the other hand, in the above-described embodiment, the learning database 35 stores learning images and learning position input data input by a plurality of input persons in advance, but these data are stored in the image recognition device. 1 may be generated. In this case, the user interface unit 2 shoots a subject to be recognized in the same manner as described above to generate image data, and the user inputs position input data and a correct answer label for the image data. This image data becomes learning image data, and this position input data becomes learning position input data. The learning image data and the learning position input data are input to the learning feature amount extraction unit 34 or stored in the learning database 35 and then input to the learning feature amount extraction unit 34. The learning feature amount extraction unit 34 extracts image feature amounts in the same manner as described above. The user further gives a correct answer label to the learning image data via the input unit 21. The correct answer label may be given by a method of selecting from correct answer labels in the learning database 35 or may be directly input by the user. The learning unit 33 acquires the learning result by associating the extracted image feature quantity with the correct answer label.

学習用位置入力データの合成を画像認識装置１にて行ってもよい。この場合には、画像認識装置１は位置入力データ合成部を備える。位置入力データ合成部は、学習データベース３５に保存されている、同一の学習用画像に対する複数の学習用位置入力データを上述のようにして合成し、新たな学習用位置入力データを生成する。 The position recognition data for learning may be synthesized by the image recognition apparatus 1. In this case, the image recognition apparatus 1 includes a position input data synthesis unit. The position input data combining unit combines a plurality of learning position input data stored in the learning database 35 for the same learning image as described above, and generates new learning position input data.

また、上記の実施の形態では、ユーザインターフェース部２及び学習認識部３が何れも携帯端末に備えられて、画像認識装置１を構成する例を説明したが、本発明の実施の形態はこれに限られない。例えば、ユーザインターフェース部２が携帯端末に備えられ、学習認識部３がこの携帯端末と通信可能な他のコンピュータに備えられていてもよい。この場合には、ユーザが撮像部２１を用いて撮影して得られた認識用画像データ、及びユーザが入力部２２から入力した認識用位置入力データは、通信ネットワークを介して当該他のコンピュータに送信される。当該他のコンピュータは、携帯端末から送信されてきた認識用画像データ及び認識用位置入力データに基づいて、画像認識を行い、認識結果を携帯端末に送信する。携帯端末はこの認識結果を受信する。 In the above-described embodiment, the example in which the user interface unit 2 and the learning recognition unit 3 are both provided in the mobile terminal to configure the image recognition device 1 has been described. However, the embodiment of the present invention is not limited thereto. Not limited. For example, the user interface unit 2 may be provided in a mobile terminal, and the learning recognition unit 3 may be provided in another computer that can communicate with the mobile terminal. In this case, the image data for recognition obtained by the user using the imaging unit 21 and the position input data for recognition input from the input unit 22 by the user are transmitted to the other computer via the communication network. Sent. The other computer performs image recognition based on the recognition image data and the recognition position input data transmitted from the portable terminal, and transmits the recognition result to the portable terminal. The portable terminal receives this recognition result.

上記の実施の形態では、ユーザインターフェース部２において、一回の撮影について認識用位置入力データの入力をした後に、ユーザに再度の撮影を行うかを問い、ユーザ自身が再度の撮影を行うか否かを決定した。これによりユーザは、自ら十分であると思うまで、撮影及び認識用位置入力データの入力を繰り返すことができる。しかし、ユーザは、撮影及び認識用位置入力データの入力をいつまで繰り返せばよいか判断できないこともある。そこで、ユーザインターフェース部２に、十分な認識用位置入力データが得られたか否かをユーザに通知し、又は十分な認識用位置入力データが得られていない場合には強制的にユーザに再度の撮影を行わせる機能を付加してもよい。上述のように、認識用画像特徴量を求めてパターン識別を行うためには、認識対象の各要素について、少なくとも一つの認識用位置入力データが必要となる。よって、この機能においては、すべての要素について少なくとも一つの認識用位置入力データが得られたか否かをユーザに案内し、又はすべての要素について少なくとも一つの認識用位置入力データが得られるまで強制的にユーザに再度の撮影を行わせることができる。 In the above embodiment, the user interface unit 2 asks the user whether or not to take another shot after inputting the recognition position input data for one shot, and whether or not the user himself takes another shot. I decided. Thus, the user can repeat the input of the shooting and recognition position input data until he / she thinks that he / she is sufficient. However, the user may not be able to determine how long to repeat the input of shooting and recognition position input data. Therefore, the user interface unit 2 is notified to the user whether or not sufficient position input data for recognition has been obtained, or if sufficient position input data for recognition has not been obtained, the user is forced to repeat the operation. You may add the function to perform imaging | photography. As described above, in order to obtain a recognition image feature quantity and perform pattern identification, at least one recognition position input data is required for each element to be recognized. Therefore, in this function, the user is guided whether or not at least one recognition position input data is obtained for all elements, or is compulsory until at least one recognition position input data is obtained for all elements. Can cause the user to take another picture.

上記の実施の形態では、鳥の認識をする画像認識装置を例示して説明をしたが、本発明の画像認識装置は、昆虫、花、魚等の他の動植物を認識の対象としてもよい。例えば、昆虫を認識する場合は、位置入力データは、足を６本の線で指定し、頭、胸、及び腹をそれぞれ閉曲線で指定したデータとすることができる。また、認識対象が花である場合は、花びらを閉曲線で指定して、花の中心を点で指定して、茎を線で指定したデータを位置入力データとすることができる。認識対象が魚である場合は、体全体、背ビレ、尾ビレをそれぞれ閉曲線で指定し、目を点で指定したデータを位置入力データとすることができる。本発明の画像認識装置は、さらに、寺院、自動車等の人工物を認識の対象としてもよい。 In the above embodiment, the image recognition apparatus for recognizing birds has been described as an example. However, the image recognition apparatus of the present invention may recognize other animals and plants such as insects, flowers, and fish. For example, when recognizing an insect, the position input data may be data in which a foot is designated by six lines and the head, chest and abdomen are designated by closed curves. When the recognition target is a flower, the petal can be designated by a closed curve, the center of the flower can be designated by a point, and the stem can be designated by a line as position input data. When the recognition target is a fish, the whole body, the back and tail fins can be designated by closed curves, and the data designated by the eyes can be used as position input data. The image recognition apparatus according to the present invention may further recognize an artifact such as a temple or a car.

本発明は、画像認識に際して、「対象の検出」をユーザに行わせるので、複雑な認識対象についても正確に対象の検出ができるとともに、「対象の検出」に必要な計算量を低減させることができるというすぐれた効果を有し、撮影画像に写っている対象を認識する画像認識装置等に適用することができる。 The present invention allows the user to perform “target detection” during image recognition, so that the target can be accurately detected even for complex recognition targets, and the amount of calculation required for “target detection” can be reduced. The present invention has an excellent effect that it can be applied, and can be applied to an image recognition device that recognizes an object in a captured image.

１画像認識装置
２ユーザインターフェース部
２１撮像部
２２入力部
２３表示部
２４画像保存部
２５位置入力データ保存部
３学習認識部
３１認識用特徴量抽出部
３２パターン識別部
３３学習部
３４学習用特徴量抽出部
３５学習データベース
１０１タッチパネル
１０２ボタン
１０３レンズ DESCRIPTION OF SYMBOLS 1 Image recognition apparatus 2 User interface part 21 Imaging part 22 Input part 23 Display part 24 Image preservation | save part 25 Position input data preservation | save part 3 Learning recognition part 31 Recognizing feature-value extraction part 32 Pattern identification part 33 Learning part 34 Learning feature-value Extraction unit 35 Learning database 101 Touch panel 102 Button 103 Lens

Claims

An imaging unit for capturing a recognition target and generating image data for recognition;
A display unit for displaying a recognition image based on the recognition image data generated by the imaging unit;
An input unit for a user to input recognition position input data that indicates the position of the recognition target element with respect to the recognition image;
A recognition feature quantity extraction unit that extracts a recognition image feature quantity from the recognition image data based on the recognition position input data input to the input unit;
A recognition unit for recognizing the recognition target based on the recognition image feature amount extracted by the recognition feature amount extraction unit ;
The recognition unit corrects the correct answer of the recognition target from the recognition image feature amount extracted by the recognition feature amount extraction unit based on a learning result obtained using the relationship between the learning image feature amount and the correct answer label. By identifying the label, the recognition target is recognized,
The learning image feature amount is an image feature amount extracted from the learning image data corresponding to the learning position input data based on the learning position input data.
A plurality of pieces of learning position input data are input to one piece of the learning image data, and image feature amounts are extracted from the one piece of learning image data based on the plurality of pieces of learning position input data, respectively. By doing so, a plurality of the learning image feature quantities are extracted,
The recognition unit recognizes the recognition target based on a learning result obtained by using a relationship between the plurality of learning image feature quantities and the correct answer label.
Image recognition apparatus according to claim and this.

A learning database in which the learning image data, the learning position input data, and the correct label are stored in association with each other;
Based on the learning position input data, a learning feature amount extraction unit that extracts the learning image feature amount from the learning image data associated therewith;
A learning unit that acquires the learning result using the relationship between the learning image feature quantity and the correct answer label;
The image recognition apparatus according to claim 1 , further comprising:

The plurality of learning image feature amounts are based on the combined learning position input data generated by combining the plurality of learning position input data input to the one learning image data. The image recognition apparatus according to claim 1 , comprising an image feature amount extracted from learning image data corresponding to the combined learning position input data.

A position input data combining unit configured to generate the combined learning position input data by combining the plurality of learning position input data input to the one learning image data. The image recognition apparatus according to claim 3.

Wherein the input unit, the image recognition apparatus according to any one of claims 1 to 4, characterized in that a touch panel.

The recognition position input data, closed curve, a point, or line, or the image recognition apparatus according to any one of claims 1 to 5, characterized in that the data indicated by the combination thereof.

The elements of the recognition target, the image recognition apparatus according to any one of claims 1 to 6, characterized in that an overall and individual parts of the recognition target of the recognition target.

The recognition feature quantity extraction unit is configured to generate a plurality of pieces of recognition image data by performing a plurality of times of image capturing by the imaging unit with respect to one recognition target, and the input unit performs the plurality of recognitions from the user. When a plurality of recognition position input data for image data is input, a plurality of recognition image feature quantities are extracted based on the plurality of recognition position input data,
The image recognition apparatus according to claim 1, wherein the recognition unit recognizes the one recognition target based on the plurality of recognition image feature amounts.

The image recognition apparatus according to claim 8 , wherein the recognition unit recognizes the one recognition target based on one recognition image feature amount obtained by integrating the plurality of recognition image feature amounts.

The image recognition apparatus according to claim 9 , wherein the one recognition image feature amount is an average of the plurality of recognition image feature amounts.

An image recognition system comprising a user interface device and a recognition device and recognizing a recognition target from image data obtained by photographing,
The user interface device includes:
An imaging unit for capturing a recognition target and generating image data for recognition;
A display unit for displaying a recognition image based on the recognition image data generated by the imaging unit;
An input unit for a user to input recognition position input data that indicates the position of the recognition target element with respect to the recognition image;
A data transmission unit that transmits the image data for recognition generated by the imaging unit and the position input data for recognition input thereto to the recognition device;
A recognition result receiving unit that receives a recognition result by the recognition device for recognition image data transmitted by the data transmission unit;
The recognition device is
A data receiving unit for receiving the recognition image data and the recognition position input data transmitted from the data transmission unit;
A recognition feature amount extraction unit that extracts a recognition image feature amount from recognition image data received by the data reception unit based on the recognition position input data received by the data reception unit;
A recognition unit for recognizing the recognition target based on the recognition image feature amount extracted by the recognition feature amount extraction unit;
A recognition result transmission unit that transmits a recognition result by the recognition unit to the user interface device ;
The recognition unit corrects the correct answer of the recognition target from the recognition image feature amount extracted by the recognition feature amount extraction unit based on a learning result obtained using the relationship between the learning image feature amount and the correct answer label. By identifying the label, the recognition target is recognized,
The learning image feature amount is an image feature amount extracted from the learning image data corresponding to the learning position input data based on the learning position input data.
A plurality of pieces of learning position input data are input to one piece of the learning image data, and image feature amounts are extracted from the one piece of learning image data based on the plurality of pieces of learning position input data, respectively. By doing so, a plurality of the learning image feature quantities are extracted,
The recognition unit is configured to recognize the recognition target based on a learning result obtained by using a relationship between the plurality of learning image feature quantities and the correct answer label .

An imaging step for capturing a recognition target and generating image data for recognition;
A display step for displaying a recognition image based on the recognition image data generated in the imaging step;
An input step in which a user inputs recognition position input data for indicating the position of the recognition target element with respect to the recognition image;
A recognition feature amount extraction step for extracting a recognition image feature amount from the recognition image data based on the recognition position input data input in the input step;
On the basis of the recognition image feature quantity extracted by the recognition feature extraction step, and a recognition step of recognizing the recognition target,
In the recognition step, based on a learning result obtained by using a relationship between the learning image feature quantity and the correct answer label, the recognition target correct answer is extracted from the recognition image feature quantity extracted by the recognition feature quantity extracting unit. By identifying the label, the recognition target is recognized,
The learning image feature amount is an image feature amount extracted from the learning image data corresponding to the learning position input data based on the learning position input data.
A plurality of pieces of learning position input data are input to one piece of the learning image data, and image feature amounts are extracted from the one piece of learning image data based on the plurality of pieces of learning position input data, respectively. By doing so, a plurality of the learning image feature quantities are extracted,
The recognition step recognizes the recognition target based on a learning result obtained using a relationship between the plurality of learning image feature quantities and the correct answer label.
Image recognition method which is characterized a call.