JP2019536164A

JP2019536164A - Image processing apparatus, image processing method, and image processing program

Info

Publication number: JP2019536164A
Application number: JP2019527275A
Authority: JP
Inventors: カランランパル
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2019-12-12
Anticipated expiration: 2036-11-30
Also published as: US20190311216A1; JP6756406B2; WO2018100668A1; US11138464B2

Abstract

【課題】物体認識タスクの類似性スコアまたは類似性ラベルに対する背景の影響を低減できる画像処理装置を提供する。【解決手段】画像処理装置１０は、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得する特徴抽出部１１と、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算する顕著性生成部１２と、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去するドロップアウト処理部１３とを含む。【選択図】図６An image processing apparatus capable of reducing the influence of a background on a similarity score or a similarity label of an object recognition task. An image processing apparatus includes: a feature extraction unit configured to acquire a feature in each sample of a scaled region of interest in a survey image; Saliency generator 12 that calculates the probabilities of the pixels in the sampled sample, and uses the calculated probabilities to remove non-essential features from the scaled sample to calculate the score or label of the object And a dropout processing unit 13. [Selection diagram] FIG.

Description

本発明は、画像処理装置、画像処理方法および画像処理プログラムに関し、特に物体認識の目的で背景を除去する画像処理装置、画像処理方法および画像処理プログラムに関する。 The present invention relates to an image processing device, an image processing method, and an image processing program, and more particularly to an image processing device, an image processing method, and an image processing program for removing a background for the purpose of object recognition.

物体認識のタスクは、監視、バイオメトリクス等における多くの実用的用途を有する。これらのタスクの目的は、関心のある物体を含む一対の入力画像間の類似性のレベルを示すラベルまたはスコアを出力することである。ここでの物体として、人、乗り物、動物等が成り得る。計量学習は、類似性スコアを取得するための最も効果的な手法の１つである。この技術の目的は、入力画像間の距離を計算することである。最初に、入力画像を特徴空間に投影する。特徴空間は、学習可能な空間、または手動で生成可能な空間である。次に、類似の特徴と非類似の特徴とを与えられたマージンで効果的に分離することによって新しい特徴空間内の距離を計算できる測定基準または関数が学習される。 The object recognition task has many practical applications in monitoring, biometrics, etc. The purpose of these tasks is to output a label or score that indicates the level of similarity between a pair of input images containing the object of interest. The object here may be a person, a vehicle, an animal, or the like. Metric learning is one of the most effective techniques for obtaining similarity scores. The purpose of this technique is to calculate the distance between input images. First, the input image is projected onto the feature space. The feature space is a space that can be learned or a space that can be manually generated. A metric or function is then learned that can compute distances in the new feature space by effectively separating similar and dissimilar features with a given margin.

ただし、堅牢な物体認識のためには、最終的なスコアに対する背景の影響も考慮することが求められる。その理由は、制約のない環境では、背景が誤陽性結果または誤陰性結果を引き起こす可能性があるためである。例えば、対象物が非常に異なる場合、認識アルゴリズムは、背景が非常に類似しているという理由だけで依然として高い類似性スコアをもたらし得る。逆もまた真であり、類似した物体ではあるが背景が異なる場合、認識スコアはかなり低くなり得る。従って、この問題に取り組むことが求められている。 However, for robust object recognition, it is necessary to consider the influence of the background on the final score. The reason is that in an unconstrained environment, the background can cause false positive or false negative results. For example, if the objects are very different, the recognition algorithm may still yield a high similarity score just because the background is very similar. The reverse is also true, and if the object is similar but the background is different, the recognition score can be quite low. There is therefore a need to address this issue.

しかし、この問題への取り組みにおける進歩はあまりなく、解決すべき未解決の問題のままである。多くの方法は、結果として得られる画像が背景よりも多くの物体を含むように検出方法を改善することに焦点を合わせている。また、他の方法は、機能や測定基準の改善に集中している。しかし、背景の減算自体に関して体系的な努力はあまりなされていない。従って、認識作業における背景の影響に対処できる方法が求められている。 However, little progress has been made in addressing this issue, and it remains an open issue to be solved. Many methods focus on improving the detection method so that the resulting image contains more objects than the background. Other methods are also focused on improving functionality and metrics. However, little systematic effort has been made regarding the background subtraction itself. Therefore, there is a need for a method that can cope with the influence of background in recognition work.

物体を認識する方法の１つは、複数の測定基準を組み合わせることである（特許文献１参照）。特許文献１には、複数の手動で生成された特徴を画像から抽出した後、Bhattacharya係数、コサイン類似度等のいくつかの類似度関数が使用されることが記載されている。最後にそれらを組み合わせるためにRankBoostアルゴリズムが使用される。この方法により高い精度が得られ、多くの測定基準の利点が共に組み合わせられる。 One method of recognizing an object is to combine a plurality of measurement standards (see Patent Document 1). Patent Document 1 describes that after a plurality of manually generated features are extracted from an image, several similarity functions such as Bhattacharya coefficient and cosine similarity are used. Finally, RankBoost algorithm is used to combine them. This method provides high accuracy and combines the advantages of many metrics together.

スケール推定のための別の方法は、三角グラフを使用することである（特許文献２参照）。特許文献２には、動的計画法を使用してエネルギー関数を最小化することによって三角グラフを物体（例えば人）の内側に適合する方法が記載されている。この方法は、人の形を記述する。この方法はまた、HSV色空間を使用して色情報も組み合わせることによって、方法の頑健性を高める。 Another method for estimating the scale is to use a triangular graph (see Patent Document 2). Patent Document 2 describes a method of fitting a triangular graph inside an object (for example, a person) by minimizing an energy function using dynamic programming. This method describes the shape of a person. This method also increases the robustness of the method by combining color information using the HSV color space.

特許文献３には、輝度伝達関数（BTF:Brightness Transfer Function）が記載されている。BTFは、あるカメラから別のカメラへ物体の外観をマッピングする関数である。各トレーニング画像から発見されたこれらのBTFマップは、単一モデル（WBTF）に重み付け結合され、予測に使用される。BTFマップは、照明の変動が大きな問題になる場合に適している。 Patent Document 3 describes a brightness transfer function (BTF). BTF is a function that maps the appearance of an object from one camera to another. These BTF maps found from each training image are weighted into a single model (WBTF) and used for prediction. BTF maps are suitable when lighting fluctuations are a major problem.

非特許文献１には、局所最大発生表現（LOMO:Local Maximal Occurrence representation）と呼ばれる手動で生成された特徴が各画像に対して計算される。この方法では、射影行列が、二次判別分析技術と原理的に同様である計量関数と共に効率的に学習される。 In Non-Patent Document 1, a manually generated feature called a local maximum occurrence representation (LOMO) is calculated for each image. In this method, the projection matrix is efficiently learned with a metric function that is similar in principle to the secondary discriminant analysis technique.

非特許文献２に記載されている方法は、一対の画像間の類似性を端から端まで開示する。これは、特徴の生成、特徴の抽出、および計量学習のパイプライン全体が、ディープニューラルネットワークを学習することによって共にまとめられることを意味する。また、識別能力の改善に役立つ対照的な損失関数も提案されている。 The method described in Non-Patent Document 2 discloses the similarity between a pair of images from end to end. This means that the entire feature generation, feature extraction, and metric learning pipelines are brought together by learning a deep neural network. A contrasting loss function has also been proposed to help improve discrimination.

米国特許出願公開第２００７／０２１１９３８号明細書US Patent Application Publication No. 2007/0211938 米国特許出願公開第２０１３／０３４３６４２号明細書US Patent Application Publication No. 2013/0343642 米国特許出願公開第２０１４／００９８２２１号明細書US Patent Application Publication No. 2014/0098221 特開２０１１−２５３３５４号公報JP 2011-253354 A 特表２０１３−５４１１１９号公報Special table 2013-541119 gazette

S. Liao, Y. Hu, Xiangyu Zhu, and S. Z. Li, "Person re-identification by Local Maximal Occurrence representation and metric learning," 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Boston, MA, 2015, pp. 2197-2206.S. Liao, Y. Hu, Xiangyu Zhu, and SZ Li, "Person re-identification by Local Maximal Occurrence representation and metric learning," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp 2197-2206. S. Chopra, R. Hadsell, and Y. LeCun, "Learning a Similarity Metric Discriminatively, with Application to Face Verification," 2005 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Washington, DC, 2005, 539-546.S. Chopra, R. Hadsell, and Y. LeCun, "Learning a Similarity Metric Discriminatively, with Application to Face Verification," 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, 2005, 539-546.

物体認識は、より記述的な空間で物体を表現する目的で入力画像から特徴を抽出することを含む。この空間、またはその部分空間において、計量関数または距離関数が学習される。この関数は、入力画像を比較するために使用され得る。しかしながら、特徴が抽出された後、入力画像だけではなく背景に属する特徴も得られる。これらの背景の特徴は、認識結果に不一致を引き起こす可能性がある。これらの特徴から学習される測定基準は、十分に堅牢ではない。よって、学習過程の早い段階でそのような特徴を除去できる技術が求められている。 Object recognition includes extracting features from an input image for the purpose of representing the object in a more descriptive space. In this space, or a subspace thereof, a metric function or a distance function is learned. This function can be used to compare input images. However, after the features are extracted, not only the input image but also features belonging to the background are obtained. These background features can cause discrepancies in the recognition results. The metrics learned from these features are not robust enough. Thus, there is a need for a technique that can remove such features early in the learning process.

特許文献１では、入力特徴から複数の測定基準が学習または計算される。次に、これらの測定基準は、測定基準それぞれを評価するランキング関数を使用して組み合わせられる。このランキング関数は、ブースティングアルゴリズムと同様の方法で学習される。この技術では、背景特徴の問題は、直接的には扱われていない。特許文献１では、計量関数のうち少なくとも１つは、前景特徴と背景特徴との違いを知るための十分な識別力があると想定されている。ただし、この技術は、アプリケーション、特徴等に依存し、直接操作されない。 In Patent Document 1, a plurality of measurement standards are learned or calculated from input features. These metrics are then combined using a ranking function that evaluates each of the metrics. This ranking function is learned in the same manner as the boosting algorithm. In this technique, the background feature problem is not directly addressed. In Patent Document 1, it is assumed that at least one of the metric functions has sufficient discriminating power to know the difference between the foreground feature and the background feature. However, this technology depends on applications, features, etc. and is not directly operated.

特許文献２に記載されている方法は、三角グラフと色ヒストグラムを用いて物体形状をモデル化する。これにより、前景と背景を識別する性能が向上する。ただし、物体が非剛体である場合、形状を効果的にモデル化することは困難である。例えば、対象物が人間である場合、形状は任意であり、正方形、楕円形等の幾何学的形状ではない。従って、この方法は、そのような場合にはあまり適さない。 The method described in Patent Document 2 models an object shape using a triangular graph and a color histogram. This improves the performance of identifying the foreground and background. However, if the object is non-rigid, it is difficult to model the shape effectively. For example, when the object is a human, the shape is arbitrary, and it is not a geometric shape such as a square or an ellipse. Therefore, this method is not very suitable in such a case.

特許文献３では、あるカメラから他のカメラへ物体の外観をマッピングする関数が学習される。しかし、環境へのアクセスや制御ができない場合、この学習は必ずしも実現可能ではない。較正情報がなければ、射影行列を学ぶことは困難であり、それ故に写像関数を学ぶことも困難である。 In Patent Document 3, a function for mapping the appearance of an object from one camera to another camera is learned. However, this learning is not always feasible if the environment cannot be accessed or controlled. Without calibration information, it is difficult to learn the projection matrix and hence the mapping function.

非特許文献１に記載された方法は、手動で生成された特徴を要する。手動で生成された特徴は、特定の用途を念頭に置いて特別に設計された特徴である。このような特徴は、特定のアプリケーションでは非常に上手く機能するが、他のアプリケーション分野にも適するように一般化されていない。 The method described in Non-Patent Document 1 requires manually generated features. Manually generated features are features that are specifically designed with a particular application in mind. Such features work very well in certain applications, but are not generalized to suit other application areas.

特許文献４に記載されている装置は、奥行き情報を用いて不要な特徴を除去する。画像の奥行き情報を見つけるためには、ハードウェアまたはカメラの較正情報が求められる。 The apparatus described in Patent Document 4 removes unnecessary features using depth information. In order to find the depth information of the image, hardware or camera calibration information is required.

特許文献５に記載された方法は、不要な（背景）特徴を除去するために画素の分散を利用する。画素の分散が高い場合、その画素は、背景とみなされて除去される。また、画素の分散が低い場合、その画素は、前景とみなされる。この方法は、低分散の画素が必然的に前景であると仮定するため、照明が変動するシーンには適さない。 The method described in U.S. Patent No. 6,057,054 uses pixel dispersion to remove unwanted (background) features. If the pixel variance is high, the pixel is considered as background and is removed. Also, if the pixel variance is low, the pixel is considered foreground. This method is not suitable for scenes with varying illumination, as it assumes that low dispersion pixels are necessarily foreground.

本発明の目的の１つは、物体認識タスクの類似性スコアまたは類似性ラベルに対する背景の影響を低減できる画像処理装置、画像処理方法および画像処理プログラムを提供することである。 An object of the present invention is to provide an image processing apparatus, an image processing method, and an image processing program that can reduce the influence of the background on the similarity score or similarity label of an object recognition task.

本発明による画像処理装置は、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得する特徴抽出手段と、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算する顕著性生成手段と、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去するドロップアウト処理手段とを備えることを特徴とする。 An image processing apparatus according to the present invention comprises a feature extraction means for obtaining features in each scaled sample of a region of interest in a survey image and a scaled sample that contributes to the score or label of the object of interest in the region of interest. A saliency generator that calculates the probability of a pixel in it, and a dropout processor that uses the calculated probability to remove features that are not essential for calculating an object's score or label from the scaled sample It is characterized by providing.

本発明による画像処理方法は、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得し、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算し、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去することを特徴とする。 The image processing method according to the present invention obtains a feature in each scaled sample of the region of interest in the survey image and contributes to the score or label of the object of interest in the region of interest for the pixels in the scaled sample. Probabilities are calculated, and the calculated probabilities are used to remove features that are not essential for calculating object scores or labels from the scaled samples.

本発明による画像処理プログラムを記録した非一時的なコンピュータ読み取り可能な記録媒体は、コンピュータで実行されるときに、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得し、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算し、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去する画像処理プログラムを記憶する。 A non-transitory computer readable recording medium having recorded an image processing program according to the present invention, when executed on a computer, obtains features in each scaled sample of a region of interest in a survey image, A feature that is not essential for calculating the probability of a pixel in a scaled sample that contributes to the object's score or label of interest and using the calculated probability to calculate the object's score or label An image processing program is stored to remove from the scaled sample.

本発明によれば、物体認識タスクの類似性スコアまたは類似性ラベルに対する背景の影響を低減できる。 According to the present invention, the influence of the background on the similarity score or similarity label of the object recognition task can be reduced.

図１は、本発明の第１の実施形態の画像処理装置１００の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of an image processing apparatus 100 according to the first embodiment of the present invention. 図２は、本発明の第１の実施形態の画像処理装置１００の動作の例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of the operation of the image processing apparatus 100 according to the first embodiment of the present invention. 図３は、本発明の第１の実施形態の画像処理装置１００による推定処理を示すフローチャートである。FIG. 3 is a flowchart showing an estimation process performed by the image processing apparatus 100 according to the first embodiment of the present invention. 図４は、本発明の第１の実施形態の画像処理装置１００によるドロップアウト処理を示すフローチャートである。FIG. 4 is a flowchart showing a dropout process performed by the image processing apparatus 100 according to the first embodiment of the present invention. 図５は、本発明の第１の実施形態の画像処理装置１００による顕著性処理を示すフローチャートである。FIG. 5 is a flowchart showing the saliency processing by the image processing apparatus 100 according to the first embodiment of the present invention. 図６は、本発明の第２の実施形態の画像処理装置１０の構成例を示すブロック図である。FIG. 6 is a block diagram illustrating a configuration example of the image processing apparatus 10 according to the second embodiment of the present invention. 図７は、本発明の実施形態の画像処理装置を実現可能なコンピュータ１０００のハードウェア構成例を示すブロック図である。FIG. 7 is a block diagram illustrating a hardware configuration example of a computer 1000 capable of realizing the image processing apparatus according to the embodiment of the present invention.

上記の技術的問題を解決するための全体的なアプローチを以下に要約する。物体認識の性能は、特に背景のシーンが複雑な場合、背景の特徴に影響を受ける。よって、背景の特徴を抑制することが求められる。画像内の関心のある物体の位置が与えられると、いくつかのスケーリングされたサンプルが生成される。これらのスケーリングされたサンプルから特徴が抽出される。物体検出器を使用して、顕著性マップは、入力画像に対する物体検出器の出力の逆伝播をとることによって生成される。顕著性マップを補助的に用いることによって、物体または背景に属する画素の確率が計算される。このドロップアウトの使用は、確率が背景である画素の特徴に属するニューロンを除去することによって実行される。次に、残りの特徴に対する特徴マッチングを使用してスコアが取得され、最も高いスコアを有する１つの対象画像が出力として選択される。 The overall approach for solving the above technical problems is summarized below. The performance of object recognition is affected by the characteristics of the background, especially when the background scene is complex. Therefore, it is required to suppress the background feature. Given the location of the object of interest in the image, several scaled samples are generated. Features are extracted from these scaled samples. Using the object detector, the saliency map is generated by taking the back propagation of the object detector output relative to the input image. By using the saliency map as a supplement, the probability of pixels belonging to the object or background is calculated. The use of this dropout is performed by removing the neurons belonging to the pixel feature whose probability is background. Next, a score is obtained using feature matching for the remaining features, and one target image with the highest score is selected as the output.

本発明は、これらの上述した問題を解決するように設計されている。上記の実体に加えて、本発明が克服できる他の明白かつ明らかな不利益は、詳細な明細書および図面から明らかにされるであろう。 The present invention is designed to solve these aforementioned problems. In addition to the above entities, other obvious and obvious disadvantages that the present invention can overcome will be apparent from the detailed description and drawings.

＜実施形態１＞
以下、本発明の第１の実施形態について詳細に説明する。 <Embodiment 1>
Hereinafter, a first embodiment of the present invention will be described in detail.

図１は、本発明の第１の実施形態の画像処理装置１００の構成例を示すブロック図である。図１を参照すると、画像処理装置１００は、入力部１０１と、物体検出部１０２と、特徴抽出部１０３と、学習部１０４と、モデル記憶部１０５と、顕著性生成部１０６と、ドロップアウト処理部１０７と、特徴マッチング部１０８と、パラメータ更新部１０９と、出力部１１０と、トレーニングデータセット記憶部１１１とを含む。 FIG. 1 is a block diagram showing a configuration example of an image processing apparatus 100 according to the first embodiment of the present invention. Referring to FIG. 1, an image processing apparatus 100 includes an input unit 101, an object detection unit 102, a feature extraction unit 103, a learning unit 104, a model storage unit 105, a saliency generation unit 106, and a dropout process. Unit 107, feature matching unit 108, parameter update unit 109, output unit 110, and training data set storage unit 111.

入力部１０１は、追跡段階において、一連のフレーム、すなわち画像、例えばビデオのフレーム、静止画像等を受け取る。入力部１０１は、例えば学習段階において、または学習段階の前に、一連のフレーム、すなわちトレーニングフレームを受け取る。以下の説明では、複数のフレームを「複数の画像」、複数のフレーム内のフレームを「画像」とそれぞれ表記する場合がある。また、複数のトレーニングフレームを「複数のトレーニング画像」、複数のトレーニングフレーム内のトレーニングフレームを「トレーニング画像」とそれぞれ表記する。 The input unit 101 receives a series of frames, that is, an image, for example, a video frame, a still image, etc., in the tracking stage. The input unit 101 receives a series of frames, that is, training frames, for example, at or before the learning stage. In the following description, a plurality of frames may be referred to as “plural images”, and frames within the plurality of frames may be referred to as “images”. Also, a plurality of training frames are denoted as “a plurality of training images”, and a training frame within the plurality of training frames is denoted as a “training image”.

物体検出部１０２は、フレーム内の関心領域、すなわち顔等の物体、またはいくつかの部分を含み得る他の物体のうちの１つを検出する。以下の説明において、物体検出部１０２は、枠内の人物を検出する。物体検出部１０２は、フレーム内の人の位置、すなわち左上および右下の境界ボックスのxy座標を提供する。以下の説明では、物体検出部１０２を「物体検出器」と表記することがある。 The object detection unit 102 detects one of a region of interest in the frame, that is, an object such as a face, or other object that may include some part. In the following description, the object detection unit 102 detects a person within the frame. The object detection unit 102 provides the position of the person in the frame, that is, the xy coordinates of the upper left and lower right bounding boxes. In the following description, the object detection unit 102 may be referred to as an “object detector”.

特徴抽出部１０３は、物体検出部１０２により提供された関心領域から特徴を抽出するために使用される。物体検出部１０２により提供された位置を使用して、特徴抽出部１０３は、スケーリングされたサンプルを生成する。次に、これらのサンプルは、同じ座標系に位置するように正規化される。座標は、フレーム内に予め設定されている座標系で定義されている。最後に、これらのサンプル画像から特徴が抽出される。これらの特徴として、サンプル画像からのエッジ、テクスチャ、色、時間的、空間的および／または他のより高いレベルの情報またはより低いレベルの情報の組み合わせがあり得る。 The feature extraction unit 103 is used to extract features from the region of interest provided by the object detection unit 102. Using the position provided by the object detector 102, the feature extractor 103 generates a scaled sample. These samples are then normalized to be in the same coordinate system. The coordinates are defined in a coordinate system set in advance in the frame. Finally, features are extracted from these sample images. These features can be edges, texture, color, temporal, spatial and / or other higher level information or a combination of lower level information from the sample image.

学習部１０４は、１つ以上の一連のトレーニングフレームでモデルを学習する。より具体的には、学習部１０４は、トレーニングフレームから抽出された特徴で、サンプルの顕著性マップを計算するために使用されることになるモデルを学習する。学習部１０４は、モデルのパラメータ学習の一部として、サンプルの特徴から平均ベクトルおよび共分散行列を計算してもよい。また、学習部１０４は、入力画像に対する物体検出器の出力の勾配を計算してもよい。 The learning unit 104 learns a model with one or more series of training frames. More specifically, the learning unit 104 learns a model that will be used to calculate a saliency map of the sample with features extracted from the training frame. The learning unit 104 may calculate an average vector and a covariance matrix from sample features as part of model parameter learning. The learning unit 104 may calculate the gradient of the output of the object detector with respect to the input image.

このモデルは、基本的に、スケーリングされたサンプルの特徴の分布を捉えている。より具体的には、このモデルは、物体検出器から出力された、画素が特定のラベルに属する可能性を捉えている。物体検出器は、与えられた入力画像が所望のラベルに適合するようにその出力スコアを最大にする。本実施形態ではその逆の処理が求められる。すなわち、生成が求められるラベルが与えられた場合、物体検出器は、画像が与えられたラベルに適合するようにその出力スコアを最大にする。モデル記憶部１０５は、推論目的で使用されるモデルのパラメータを記憶し、与えられた入力でモデルを評価するために用いられる。 This model basically captures the distribution of scaled sample features. More specifically, this model captures the possibility that pixels output from an object detector belong to a specific label. The object detector maximizes its output score so that a given input image matches the desired label. In the present embodiment, the reverse process is required. That is, given a label that is sought to be generated, the object detector maximizes its output score so that the image fits the given label. The model storage unit 105 stores model parameters used for inference purposes, and is used to evaluate the model with given inputs.

顕著性生成部１０６は、モデル記憶部１０５に記憶されているモデルパラメータを用いて、ある画素が特定のラベルに属する確率を導出する。確率は、ランダム画像に対する物体検出器からの出力の勾配を求めることによって算出される。このランダム画像は、最終的にこの画像内の画素が確率を表すまで繰り返し更新される。この手順は、求められる顕著性マップを繰り返し提供する。 The saliency generation unit 106 uses the model parameters stored in the model storage unit 105 to derive the probability that a certain pixel belongs to a specific label. The probability is calculated by obtaining the gradient of the output from the object detector with respect to the random image. This random image is repeatedly updated until the pixels in this image finally represent probabilities. This procedure repeatedly provides the required saliency map.

顕著性生成部１０６により生成された顕著性マップは、ドロップアウト処理部１０７に入力される。ドロップアウト処理部１０７において、サンプルの各特徴は、顕著性マップからのそれらの確率と直接関連付けられる。特徴の確率が低い場合、すなわち特徴が背景クラスに属する場合、特徴は除去されるか除外される。特徴がその物体に属する場合、その特徴は、確率を用いて再スケーリングされる。これにより、マッチングに使用される最終的な特徴が生成される。 The saliency map generated by the saliency generation unit 106 is input to the dropout processing unit 107. In the dropout processor 107, each feature of the sample is directly associated with their probability from the saliency map. If the probability of a feature is low, ie if the feature belongs to the background class, the feature is removed or excluded. If a feature belongs to the object, the feature is rescaled using probability. This generates the final feature used for matching.

特徴マッチング部１０８は、対象画像の特徴と調査画像の特徴とを比較することによって、スコアが最も高いサンプルを選択する。このスケールでの調査画像の特徴は、登録されている対象画像の特徴と適合する。対象画像それぞれについて、特徴マッチング部１０８によりスコアが生成される。モデルパラメータは、パラメータ更新部１０９により更新される。 The feature matching unit 108 selects the sample having the highest score by comparing the feature of the target image with the feature of the survey image. The feature of the survey image at this scale matches the feature of the registered target image. A score is generated by the feature matching unit 108 for each target image. The model parameter is updated by the parameter update unit 109.

以下の説明では、特徴抽出部１０３、顕著性生成部１０６、ドロップアウト処理部１０７、および特徴マッチング部１０８の集合を「推定処理部」と表記することがある。 In the following description, a set of the feature extraction unit 103, the saliency generation unit 106, the dropout processing unit 107, and the feature matching unit 108 may be referred to as an “estimation processing unit”.

出力部１１０は、最終対象画像またはIDを出力する。出力部１１０は、xy座標、およびマークが描画されたフレームである出力における物体の境界ボックスのスケール（幅、高さ）で表されるフレーム上の所定の位置に、IDを表す所定のマークを描画できる。 The output unit 110 outputs the final target image or ID. The output unit 110 puts a predetermined mark indicating the ID at a predetermined position on the frame represented by the scale (width, height) of the bounding box of the object in the output which is the frame in which the mark is drawn. Can draw.

トレーニングデータセット記憶部１１１は、対象画像と調査画像との組と、それらが同一の物体であるか否かを示すラベルとを含む一連のトレーニングサンプルを１つ以上記憶する。なお、入力部１０１は、トレーニングデータセット記憶部１１１を実現するために用いられなくてもよい。 The training data set storage unit 111 stores one or more series of training samples including a set of target images and survey images and a label indicating whether or not they are the same object. Note that the input unit 101 may not be used to implement the training data set storage unit 111.

次に、第１の実施形態の画像処理装置１００の動作を、図面を参照して詳細に説明する。 Next, the operation of the image processing apparatus 100 according to the first embodiment will be described in detail with reference to the drawings.

図２は、本発明の第１の実施形態の画像処理装置１００の動作の例を示すフローチャートである。 FIG. 2 is a flowchart illustrating an example of the operation of the image processing apparatus 100 according to the first embodiment of the present invention.

本発明の第１の実施形態の画像処理装置１００の動作は、トレーニングフェーズと評価フェーズとに大別される。この段落では、図２を参照して本発明の概要を説明し、評価フェーズを説明する。物体の認識は、画像またはフレーム内の物体の検出により開始する。図２に示すように、システムへの入力（ステップS101）は、調査画像と呼ばれる物体の画像である。物体検出部１０２は、対象画像が存在するか否かを確認する（ステップS102）。対象画像は、ギャラリーの登録画像とも呼ばれる。対象画像が選択されていない場合（ステップS102におけるNO）、先に登録された画像がギャラリーから選択される（ステップS103）。物体検出部１０２は、一般的な物体検出器の特定の実施形態でもよい。そして、検出された物体と選択された対象画像とが推定処理部に提供される（ステップS104）。対象画像が存在する場合（ステップS102におけるYES）、そのままステップS104に進む。その後、推定処理部により出力スコアが生成される（ステップS105）。最終的に全ての対象画像が比較されれば、処理は終了する。 The operation of the image processing apparatus 100 according to the first embodiment of the present invention is roughly divided into a training phase and an evaluation phase. In this paragraph, the outline of the present invention will be described with reference to FIG. 2, and the evaluation phase will be described. Object recognition begins with the detection of an object in an image or frame. As shown in FIG. 2, the input (step S101) to the system is an image of an object called a survey image. The object detection unit 102 confirms whether or not the target image exists (step S102). The target image is also called a gallery registered image. If the target image has not been selected (NO in step S102), the previously registered image is selected from the gallery (step S103). The object detection unit 102 may be a specific embodiment of a general object detector. Then, the detected object and the selected target image are provided to the estimation processing unit (step S104). If the target image exists (YES in step S102), the process directly proceeds to step S104. Thereafter, an output score is generated by the estimation processing unit (step S105). If all target images are finally compared, the process ends.

推定処理の詳細については、図３の図面と共に後述する。以下、推定処理部について簡単に説明する。推定処理部は、現在のフレームと対象画像から生成されたサンプルそれぞれにスコアを付け、最大スコアを有するサンプルを出力する。 Details of the estimation process will be described later with reference to FIG. Hereinafter, the estimation processing unit will be briefly described. The estimation processing unit scores each sample generated from the current frame and the target image, and outputs a sample having the maximum score.

次に、出力部１１０は、推定IDまたは推定ラベルおよびスコア、すなわち上記の最終出力を出力する（ステップS105）。画像処理装置１００の処理が終了していない場合（ステップS106におけるNO）、入力部１０１は、次のフレームを受け付ける（ステップS101）。入力装置（図示せず）を介した画像処理装置１００のユーザからの指示により画像処理装置１００の処理が終了した場合（ステップS106におけるYES）、または全ての対象画像が処理された場合、画像処理装置１００は、図２に示す処理を中止する。 Next, the output unit 110 outputs the estimated ID or estimated label and score, that is, the final output described above (step S105). If the processing of the image processing apparatus 100 has not ended (NO in step S106), the input unit 101 receives the next frame (step S101). When the processing of the image processing apparatus 100 is completed by an instruction from the user of the image processing apparatus 100 via an input device (not shown) (YES in step S106), or when all target images have been processed, image processing The apparatus 100 stops the process illustrated in FIG.

次に、第１の実施形態の画像処理装置１００の推定処理フェーズにおける動作を、図面を参照して詳細に説明する。 Next, the operation in the estimation processing phase of the image processing apparatus 100 according to the first embodiment will be described in detail with reference to the drawings.

図３は、第１の実施形態の画像処理装置１００の推定処理フェーズにおける動作の例を示すフローチャートである。 FIG. 3 is a flowchart illustrating an example of an operation in the estimation processing phase of the image processing apparatus 100 according to the first embodiment.

上述したように、推定処理用のモデルが学習されることが求められる。従って、対象画像が与えられ、スケーリングされた調査画像またはスケーリングされた問合せ画像であるサンプルがステップS201で生成される。これらのサンプルは、物体の位置で与えられた領域および物体検出器で提供されたスケールの周りに抽出される。次に、これらのサンプルから特徴が抽出される（ステップS202）。抽出された特徴は、HOG（Histogram of Oriented Gradients）、LBP（Local Binary Patterns）、正規化勾配、色ヒストグラム等の特徴を指す。ステップS203において、ドロップアウト処理が実行される。ドロップアウト処理は、後で図４を用いて詳細に説明する。次に、トレーニングフェーズである場合（ステップS204におけるYES）、ステップS203のマスクを使用して背景の特徴を除去することが求められていることを意味する（ステップS206）。マスクは、ステップS203におけるドロップアウト処理の出力である。背景の特徴の除去は、特徴抽出で生成された特徴マップとマスクとの要素ごとの乗算を使用して行われる（ステップS202）。特徴マップからの削減された特徴を用いて、特徴マッチングによりスコアが得られる（ステップS207）。最終的に、全てのスケールの中からスコアが最大のスケールが最終出力として選択される（ステップS208）。ステップS204でNOの場合、マスキング動作は不要であり、データを乱さずに送信する順方向パス（分類器順方向パス）が実行されるだけでよい（ステップS205）。この手順は、以下で詳しく説明する。

As described above, it is required that a model for estimation processing is learned. Accordingly, given a target image, a sample that is a scaled survey image or a scaled query image is generated in step S201. These samples are extracted around a given area at the location of the object and a scale provided by the object detector. Next, features are extracted from these samples (step S202). The extracted features indicate features such as HOG (Histogram of Oriented Gradients), LBP (Local Binary Patterns), normalized gradient, and color histogram. In step S203, dropout processing is executed. The dropout process will be described in detail later with reference to FIG. Next, in the training phase (YES in step S204), it means that the background feature is required to be removed using the mask in step S203 (step S206). The mask is an output of the dropout process in step S203. The background feature is removed using element-wise multiplication of the feature map generated by feature extraction and the mask (step S202). A score is obtained by feature matching using the reduced features from the feature map (step S207). Finally, the scale with the maximum score is selected as the final output from all the scales (step S208). In the case of NO in step S204, no masking operation is required, and only a forward path (classifier forward path) for transmitting data without being disturbed is executed (step S205). This procedure is described in detail below.

式（１）において、左辺はサンプルの平均または平均値である。これは、実際のドロップアウト手順の前に特徴を正規化するために使用されるパラメータの１つである。'x_i'は、i番目のサンプルの特徴のベクトルで、'N'はスケーリングされたサンプルの総数である。

In equation (1), the left side is the average or average value of the samples. This is one of the parameters used to normalize features before the actual dropout procedure. 'x _i ' is the vector of features of the i th sample and 'N' is the total number of scaled samples.

式（２）において、'V'は特徴ベクトルの分散である。これら２つの方程式が使用されると、平均がゼロで分散が１になるように特徴が正規化される。正規化は、次の式を用いて行われる。

In Equation (2), 'V' is a feature vector variance. When these two equations are used, the features are normalized so that the mean is zero and the variance is one. Normalization is performed using the following equation.

特徴が正規化された後、特徴はドロップアウト処理に渡される（ステップS203）。以下、ドロップアウト処理の手順を、図４を参照してより詳細に説明する。 After the features are normalized, the features are passed to the dropout process (step S203). Hereinafter, the procedure of the dropout process will be described in more detail with reference to FIG.

図４は、本発明の第１の実施の形態の画像処理装置１００のドロップアウト処理ステップのフローチャートである。ドロップアウト処理の最初のステップは、顕著性処理である（ステップS301）。顕著性処理は、顕著性マップを生成するために使用され、後で図５を用いて詳細に説明される。顕著性マップは、画素の確率を得るために使用される。次に、ステップS302において、特徴マップが取得される。ステップS202ではただの特徴が抽出されたが、ステップS302では３次元マップを形成するように特徴のサイズが変更される。次に、ステップS303において、顕著性マップのエントリが各画素位置、すなわちx軸およびy軸において閾値と照合される。この閾値は予め選択されている。顕著性マップ内の確率が閾値'T'より大きい場合（ステップS303におけるYES）、対応する特徴は、以下の式（４）を使用してただ再正規化される（ステップS306）。

FIG. 4 is a flowchart of dropout processing steps of the image processing apparatus 100 according to the first embodiment of this invention. The first step of the dropout process is a saliency process (step S301). The saliency process is used to generate a saliency map and will be described in detail later with reference to FIG. The saliency map is used to obtain pixel probabilities. Next, in step S302, a feature map is acquired. In step S202, only the feature is extracted, but in step S302, the size of the feature is changed so as to form a three-dimensional map. Next, in step S303, the saliency map entry is checked against the threshold at each pixel location, ie, the x-axis and y-axis. This threshold is selected in advance. If the probability in the saliency map is greater than the threshold value 'T' (YES in step S303), the corresponding feature is simply renormalized using the following equation (4) (step S306).

確率が閾値'T'以下である場合（ステップS303におけるNO）、対応する特徴は、ステップS304でゼロに設定されることによって除去される。次に、ステップS302と同様に、マップが３次元の代わりに元の次元に再整形されることによって、特徴マップが更新される（ステップS305）。 If the probability is less than or equal to the threshold value 'T' (NO in step S303), the corresponding feature is removed by being set to zero in step S304. Next, as in step S302, the feature map is updated by reshaping the map to the original dimensions instead of the three dimensions (step S305).

次に、ステップS307に従って、ステップS301で生成された顕著性マップがモデル記憶部１０５に記憶される。その後、画像処理装置１００が図４に示すドロップアウト処理を停止する。次に、図５を用いて顕著性処理ステップを説明する。 Next, the saliency map generated in step S301 is stored in the model storage unit 105 in accordance with step S307. Thereafter, the image processing apparatus 100 stops the dropout process shown in FIG. Next, the saliency processing step will be described with reference to FIG.

図５は、本発明の第１の実施形態の画像処理装置１００による顕著性処理を示すフローチャートである。図５を参照すると、ステップS401に示すように、スケーリングされたサンプルは入力として使用される。ステップS402において、顕著性マップは、平均がゼロ、および分散が１であるガウス分布が使用されてランダム値で初期化される。これを以下の式（５）に示す。

FIG. 5 is a flowchart showing the saliency processing by the image processing apparatus 100 according to the first embodiment of the present invention. Referring to FIG. 5, the scaled samples are used as input, as shown in step S401. In step S402, the saliency map is initialized with random values using a Gaussian distribution with a mean of zero and a variance of one. This is shown in the following formula (5).

ここで、'P'は、初期化のためのランダムな値、'm'はガウス分布の平均、'S'は標準偏差、'd'は顕著性マップの次元である。初期化の後、ステップS403において分類子順方向パスが計算される。式（６）は、ランダムに初期化された顕著性マップである入力画像が与えられたときに、分類子順方向パス、すなわちクラスラベルを計算することを表す。

Here, 'P' is a random value for initialization, 'm' is the mean of the Gaussian distribution, 'S' is the standard deviation, and 'd' is the dimension of the saliency map. After initialization, a classifier forward path is calculated in step S403. Equation (6) represents computing a classifier forward path, ie, a class label, given an input image that is a randomly initialized saliency map.

式（６）において、'L'は入力を画像'I'とする分類関数であり、また'c'は最大化の正則化に使用される定数である。 In Equation (6), 'L' is a classification function whose input is an image 'I', and 'c' is a constant used for regularization of maximization.

次のステップは、分類子逆方向パスであり（ステップS404）、このステップを使用して、入力された顕著性マップの画像に対して式（６）の勾配が得られる。このステップは、式（６）を最大化できるように顕著性マップの画像を更新すべき方向を提供する。ステップS404は、以下の式を用いて実行される。

The next step is the classifier reverse path (step S404), which is used to obtain the slope of equation (6) for the input saliency map image. This step provides the direction in which the image of the saliency map should be updated so that equation (6) can be maximized. Step S404 is executed using the following equation.

式（７）において、'▽L'は、顕著性マップの画像に対する分類子関数の勾配であり、'a'はステップサイズを制御する定数である。この式において、'I'はステップS405で更新された顕著性マップである。次のステップS406では、１回の往路と１回の復路を経た後に発生する損失が計算される。損失が十分に小さければ（ステップS407におけるYES）、アルゴリズムは収束し、顕著性処理は中止され得る。未だ損失が十分に小さくない場合（ステップS407におけるNO）、ステップS403からのステップが再度実行される。顕著性マップの画像が有する損失が小さくなり、アルゴリズムが収束するまで、これらのステップが繰り返される。 In Equation (7), '▽ L' is the gradient of the classifier function with respect to the image of the saliency map, and 'a' is a constant that controls the step size. In this equation, “I” is the saliency map updated in step S405. In the next step S406, the loss that occurs after one forward pass and one return pass is calculated. If the loss is sufficiently small (YES in step S407), the algorithm converges and the saliency processing can be stopped. If the loss is not yet sufficiently small (NO in step S407), the steps from step S403 are executed again. These steps are repeated until the loss of the saliency map image is reduced and the algorithm converges.

上述したステップにより推定処理が完了した後、再度特徴が再正規化される。これで特徴マッチングステップが実行可能になる。マッチングは、交差カーネル、ガウスカーネル、多項式カーネル等のカーネル方法を使用して実行可能である。

After the estimation process is completed by the steps described above, the features are renormalized again. This allows the feature matching step to be performed. Matching can be performed using kernel methods such as crossed kernels, Gaussian kernels, polynomial kernels and the like.

式（８）は、対象画像'I'の特徴と調査画像'x'の特徴との間のマッチングスコア'r'を与える。ここで、'd'は特徴の次元の長さ、'j'は次元の指数である。最もスコアの低い対象画像が選択される。 Equation (8) gives a matching score 'r' between the feature of the target image 'I' and the feature of the survey image 'x'. Here, 'd' is the feature dimension length and 'j' is the dimension index. The target image with the lowest score is selected.

本発明の目的の１つは、正確に物体を認識でき、かつ類似性スコアまたは距離スコアに対する背景の影響を低減できる画像処理装置を提供することである。 One of the objects of the present invention is to provide an image processing apparatus capable of accurately recognizing an object and reducing the influence of the background on the similarity score or the distance score.

本実施形態の第１の効果は、物体を正確に推定でき、認識スコアに対する背景の影響を低減できることである。 The first effect of the present embodiment is that the object can be estimated accurately and the influence of the background on the recognition score can be reduced.

本実施形態の他の効果を以下に説明する。本実施形態の利点は、特許文献１に記載されている多数の測定基準の組み合わせのように、この方法でもまだ複数の測定基準を使用できることである。この画像処理装置は、各測定基準の性能を向上させる処理である、背景の効果の削減のために使用可能である。 Other effects of this embodiment will be described below. The advantage of this embodiment is that a plurality of metrics can still be used with this method, such as a combination of numerous metrics described in US Pat. This image processing apparatus can be used for reducing the background effect, which is a process for improving the performance of each measurement standard.

本実施形態の他の効果は、非特許文献１および特許文献２と異なり、モデルパラメータが手動で生成された特徴を必要としないことである。手動で生成された特徴は、技術の適用範囲を狭め一般化を妨げる。この画像処理装置は、背景の除去を必要とする任意の技術と共に利用可能である。 Another effect of this embodiment is that, unlike Non-Patent Document 1 and Patent Document 2, model parameters do not require manually generated features. Manually generated features narrow the scope of the technology and prevent generalization. This image processing device can be used with any technique that requires background removal.

本実施形態の更なる効果は、特許文献３と異なり、射影行列を算出する必要がないため、カメラの較正情報が不要なことである。 A further effect of the present embodiment is that, unlike Patent Document 3, it is not necessary to calculate a projection matrix, so that camera calibration information is unnecessary.

本実施形態の他の効果は、非特許文献２と同様に学習が端から端まで行われ、入力画像の組が与えられた場合に直接類似性スコアを出力することである。しかしながら、非特許文献２と異なり、この画像処理装置用の距離関数は、ユークリッド距離に限定されない。 Another effect of this embodiment is that learning is performed from end to end as in Non-Patent Document 2, and a similarity score is directly output when a set of input images is given. However, unlike Non-Patent Document 2, the distance function for the image processing apparatus is not limited to the Euclidean distance.

また、潜在的サポートベクターマシンのような重い最適化技術は必要とされず、従ってリアルタイム操作も可能である。さらに、剛体の形状および非剛体の形状も認識可能である。さらに、形状、姿勢、および部品における変更の見本は必要とされない。 Also, heavy optimization techniques such as potential support vector machines are not required and therefore real-time operation is possible. Furthermore, the shape of a rigid body and the shape of a non-rigid body can also be recognized. Further, no sample changes in shape, orientation, and parts are required.

特許文献４に記載された装置および特許文献５に記載された方法は決定論的であり、確率的ではない。しかしながら、本実施形態は確率的方法であり、顕著性生成部１０６が提供する確率マップを必要とする。また、本実施形態は、ハードウェアや較正情報を必要とせず、特許文献５で想定されていない。 The device described in Patent Document 4 and the method described in Patent Document 5 are deterministic and not probabilistic. However, this embodiment is a probabilistic method, and requires a probability map provided by the saliency generator 106. Moreover, this embodiment does not require hardware or calibration information, and is not assumed in Patent Document 5.

＜実施形態２＞
次に、本発明の第２の実施形態について図面を参照して詳細に説明する。 <Embodiment 2>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

図６は、本発明の第２の実施形態の画像処理装置１０の構成例を示すブロック図である。図６を参照すると、画像処理装置１０は、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得する特徴抽出部１１（例えば、特徴抽出部１０３）と、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算する顕著性生成部１２（例えば、顕著性生成部１０６）と、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去するドロップアウト処理部１３（例えばドロップアウト処理部１０７）とを含む。 FIG. 6 is a block diagram illustrating a configuration example of the image processing apparatus 10 according to the second embodiment of the present invention. Referring to FIG. 6, the image processing apparatus 10 includes a feature extraction unit 11 (for example, a feature extraction unit 103) that acquires features in each scaled sample of a region of interest in a survey image, and a target of interest in the region of interest. A saliency generator 12 (eg, saliency generator 106) that calculates the probability of a pixel in the scaled sample that contributes to the object score or label, and using the calculated probability, the object score or And a dropout processor 13 (eg, dropout processor 107) that removes non-essential features from the scaled sample to calculate the label.

この構成により、画像処理装置は、背景が物体認識タスクの類似性スコアまたは類似性ラベルに与える影響を低減できる。 With this configuration, the image processing apparatus can reduce the influence of the background on the similarity score or similarity label of the object recognition task.

第２の実施形態は、第１の実施形態の第１の効果と同様の効果を奏する。第１の実施形態の第１の効果と同様の効果が得られる理由は、基本原理が両実施形態で同じであるためである。 The second embodiment has the same effect as the first effect of the first embodiment. The reason why the same effect as the first effect of the first embodiment is obtained is that the basic principle is the same in both embodiments.

画像処理装置１０は、与えられた対象画像と調査画像のスケーリングされたサンプルとの類似度を取得し、類似度が最大であるスケーリングされたサンプルを最終出力として選択する特徴マッチング部（例えば、特徴マッチング部１０８）を含んでもよい。 The image processing apparatus 10 acquires a similarity between a given target image and a scaled sample of a survey image, and selects a scaled sample having the maximum similarity as a final output (for example, a feature) A matching unit 108) may be included.

この構成により、画像処理装置は、スケーリングされたサンプルを最大の類似度と共に出力できる。 With this configuration, the image processing apparatus can output the scaled sample with the maximum similarity.

ドロップアウト処理部１３は、計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴を除去するためのマスクを生成し、生成されたマスクを用いてスケーリングされたサンプルから特徴を除去してもよい。 The dropout processor 13 uses the calculated probabilities to generate a mask for removing features that are not essential for calculating the object's score or label, and is scaled using the generated mask. Features may be removed from the collected samples.

ここでは、背景の画素に属するニューロンが計算され、そのような画素に属する特徴に対する閾値になり得るマスクが生成される。 Here, the neurons belonging to the background pixels are calculated, and a mask that can serve as a threshold for the features belonging to such pixels is generated.

この構成により、画像処理装置は、生成されたマスクを用いて、スケーリングされたサンプルから特徴を除去できる。 With this configuration, the image processing apparatus can remove features from the scaled sample using the generated mask.

画像処理装置１０は、対象画像と調査画像の組を含む１つ以上の一連のトレーニングサンプルと、各画像が示す物体が同一の物体であるか否かを示すラベルとによってモデルパラメータを学習する学習部（例えば、学習部１０４）を含んでもよい。 The image processing apparatus 10 learns to learn model parameters from a series of one or more training samples including a set of target images and survey images, and labels indicating whether or not the objects indicated by the images are the same object. Part (for example, the learning part 104) may be included.

この構成により、画像処理装置は、対象画像と調査画像との関係を学習できる。 With this configuration, the image processing apparatus can learn the relationship between the target image and the survey image.

また、画像処理装置１０は、顕著性マップにおける確率が低い画素の特徴を除去するために、ドロップアウト処理部１３により生成されたマスクを適用することによって特徴マップを更新する特徴マップ更新部（例えば、ドロップアウト処理部１０７）を含んでもよい。 Further, the image processing apparatus 10 applies a mask generated by the dropout processing unit 13 to remove a feature of a pixel having a low probability in the saliency map, and updates the feature map (for example, , A dropout processing unit 107) may be included.

この構成により、画像処理装置は、マスクを用いて特徴マップを更新できる。 With this configuration, the image processing apparatus can update the feature map using the mask.

画像処理装置１０は、ドロップアウト処理部１３により特徴を除去した後、残りの特徴を再度正規化する特徴正規化部（例えば、ドロップアウト処理部１０７）を含んでもよい。 The image processing apparatus 10 may include a feature normalization unit (for example, the dropout processing unit 107) that removes features by the dropout processing unit 13 and then normalizes the remaining features again.

この構成により、画像処理装置は、カーネル法を用いて特徴マッチングステップを実行できる。 With this configuration, the image processing apparatus can execute the feature matching step using the kernel method.

画像処理装置１００および画像処理装置１０それぞれは、コンピュータ、コンピュータを制御するプログラム、専用のハードウェア、またはコンピュータとコンピュータを制御するプログラムと専用のハードウェアとのセットを用いて実現可能である。 Each of the image processing apparatus 100 and the image processing apparatus 10 can be realized using a computer, a program for controlling the computer, dedicated hardware, or a set of a computer and a program for controlling the computer and dedicated hardware.

図７は、上述した画像処理装置１００および画像処理装置１０を実現可能なコンピュータ１０００のハードウェア構成例を示すブロック図である。図７を参照すると、コンピュータ１０００は、バス１００５を介して通信可能に接続されたプロセッサ１００１、メモリ１００２、記憶装置１００３およびインタフェース１００４を含む。コンピュータ１０００は、記憶媒体２０００にアクセスできる。メモリ１００２および記憶装置１００３それぞれは、RAM（Random Access Memory）、ハードディスクドライブ等の記憶装置でもよい。記憶媒体２０００は、RAM、ハードディスクドライブ等の記憶装置、ROM（Read Only Memory）、または可搬型の記憶媒体でもよい。記憶装置１００３は、記憶媒体２０００として機能してもよい。プロセッサ１００１は、メモリ１００２および記憶装置１００３からデータおよびプログラムを読み取り、メモリ１００２および記憶装置１００３にデータおよびプログラムを書き込むことができる。プロセッサ１００１は、インタフェース１００４等を介して、プロセッサ１００１にフレームを提供するサーバ（図示せず）、および最終的な出力形状を出力する端末（図示せず）等と通信できる。プロセッサ１００１は、記憶媒体２０００にアクセスできる。記憶媒体２０００は、コンピュータ１０００を画像処理装置１００または画像処理装置１０として動作させるためのプログラムを格納する。 FIG. 7 is a block diagram illustrating a hardware configuration example of a computer 1000 that can implement the image processing apparatus 100 and the image processing apparatus 10 described above. Referring to FIG. 7, a computer 1000 includes a processor 1001, a memory 1002, a storage device 1003, and an interface 1004 communicatively connected via a bus 1005. The computer 1000 can access the storage medium 2000. Each of the memory 1002 and the storage device 1003 may be a storage device such as a RAM (Random Access Memory) or a hard disk drive. The storage medium 2000 may be a storage device such as a RAM or a hard disk drive, a ROM (Read Only Memory), or a portable storage medium. The storage device 1003 may function as the storage medium 2000. The processor 1001 can read data and programs from the memory 1002 and the storage device 1003 and write data and programs to the memory 1002 and the storage device 1003. The processor 1001 can communicate with a server (not shown) that provides a frame to the processor 1001, a terminal (not shown) that outputs a final output shape, and the like via the interface 1004 and the like. The processor 1001 can access the storage medium 2000. The storage medium 2000 stores a program for causing the computer 1000 to operate as the image processing apparatus 100 or the image processing apparatus 10.

プロセッサ１００１は、記憶媒体２０００に格納された、コンピュータ１０００を画像処理装置１００または画像処理装置１０として動作させるためのプログラムをメモリ１００２にロードする。プロセッサ１００１は、メモリ１００２にロードされたプログラムを実行することによって、画像処理装置１００または画像処理装置１０として動作する。 The processor 1001 loads a program for operating the computer 1000 as the image processing apparatus 100 or the image processing apparatus 10 stored in the storage medium 2000 into the memory 1002. The processor 1001 operates as the image processing apparatus 100 or the image processing apparatus 10 by executing a program loaded in the memory 1002.

入力部１０１、物体検出部１０２、特徴抽出部１０３、学習部１０４、顕著性生成部１０６、特徴マッチング部１０８、ドロップアウト処理部１０７、および出力部１１０は、記憶媒体２０００からメモリ１００２にロードされ、上述した各部の機能を実現できる専用のプログラムによって実現可能である。プロセッサ１００１が、専用のプログラムを実行する。モデル記憶部１０５、パラメータ更新部１０９、およびトレーニングデータセット記憶部１１１は、メモリ１００２および／またはハードディスク装置等の記憶装置１００３によって実現可能である。入力部１０１、物体検出部１０２、特徴抽出部１０３、学習部１０４、モデル記憶部１０５、顕著性生成部１０６、ドロップアウト処理部１０７、特徴マッチング部１０８、パラメータ更新部１０９、出力部１１０、およびトレーニングデータセット記憶部１１１は、上述した各部の機能を実現する専用の回路によって実現可能である。 The input unit 101, object detection unit 102, feature extraction unit 103, learning unit 104, saliency generation unit 106, feature matching unit 108, dropout processing unit 107, and output unit 110 are loaded from the storage medium 2000 to the memory 1002. It can be realized by a dedicated program capable of realizing the functions of the above-described units. The processor 1001 executes a dedicated program. The model storage unit 105, the parameter update unit 109, and the training data set storage unit 111 can be realized by the memory 1002 and / or a storage device 1003 such as a hard disk device. An input unit 101, an object detection unit 102, a feature extraction unit 103, a learning unit 104, a model storage unit 105, a saliency generation unit 106, a dropout processing unit 107, a feature matching unit 108, a parameter update unit 109, an output unit 110, and The training data set storage unit 111 can be realized by a dedicated circuit that realizes the function of each unit described above.

最後のポイントとして、ここに記載され図示されたプロセス、技術および方法論が限定されないこと、またはいかなる特定の装置にも関連しないことは明らかであり、コンポーネントの組み合わせを使用して実装可能である。また、本明細書の指示に従って、様々な種類の汎用装置が使用されてもよい。本発明はまた、特定の実施例の組を用いて説明された。しかしながら、これらは例示的なものにすぎず、限定的なものではない。例えば、説明されたソフトウェアは、C、C++、Java（登録商標）、Python、およびPerl等のような多種多様な言語で実装されてもよい。さらに、本発明の技術の他の実装は、当業者に明らかであろう。 As a final point, it is clear that the processes, techniques and methodologies described and illustrated herein are not limiting or related to any particular apparatus and can be implemented using a combination of components. Also, various types of general purpose devices may be used in accordance with the instructions herein. The invention has also been described using a specific set of embodiments. However, these are merely exemplary and not limiting. For example, the described software may be implemented in a wide variety of languages such as C, C ++, Java, Python, Perl, and the like. Furthermore, other implementations of the technology of the present invention will be apparent to those skilled in the art.

本発明をその例示的な実施形態を参照して特に示し説明したが、本発明はこれらの実施形態に限定されない。特許請求の範囲によって定義される本発明の精神および範囲から逸脱することなく、形態および詳細における様々な変更がなされてもよいことは、当業者に理解されるであろう。 Although the invention has been particularly shown and described with reference to exemplary embodiments thereof, the invention is not limited to these embodiments. It will be appreciated by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention as defined by the claims.

上記の実施形態の一部または全部は、以下の付記のように記載可能であるが、以下の付記には限定されない。 A part or all of the above embodiment can be described as the following supplementary notes, but is not limited to the following supplementary notes.

（付記１）調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得し、前記関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算し、計算された確率を使用して、前記物体のスコアまたはラベルを計算するために本質的ではない特徴を前記スケーリングされたサンプルから除去することを特徴とする画像処理方法。 (Supplementary Note 1) Obtain features in each scaled sample of the region of interest in the survey image and calculate the probability of the pixels in the scaled sample that contribute to the score or label of the object of interest in the region of interest And using the calculated probability to remove from the scaled sample features that are not essential for calculating the score or label of the object.

（付記２）物体のラベルまたはスコアを計算するために本質的ではない特徴を除去するためのマスクを生成し、確率が低い顕著性マップにおける画素の特徴を除去するためにマスクを適用する付記１記載の画像処理方法。ここでは、背景画素に属するニューロンが計算され、そのような画素に属する特徴に対する閾値になり得るマスクが生成される。 (Supplementary note 2) A mask for removing features that are not essential for calculating the label or score of an object is generated, and the mask is applied to remove the feature of pixels in a saliency map with low probability The image processing method as described. Here, the neurons belonging to the background pixels are calculated and a mask is generated that can be a threshold for features belonging to such pixels.

（付記３）対象画像と調査画像の組を含む１つ以上の一連のトレーニングサンプルと、各画像が示す物体が同一の物体であるか否かを示すラベルとによってモデルパラメータを学習する付記１または付記２記載の画像処理方法。 (Supplementary note 3) Supplementary note 1 or 1 in which model parameters are learned by one or more series of training samples including a set of target images and survey images and a label indicating whether or not the objects indicated by the images are the same object The image processing method according to attachment 2.

（付記４）与えられた関心領域から画像のスケーリングされたサンプルを取得する付記１から付記３のうちのいずれかに記載の画像処理方法。 (Supplementary note 4) The image processing method according to any one of supplementary note 1 to supplementary note 3, wherein a scaled sample of an image is obtained from a given region of interest.

（付記５）特徴を除去した後、残りの特徴を再度正規化する付記１から付記４のうちのいずれかに記載の画像処理方法。 (Supplementary note 5) The image processing method according to any one of supplementary notes 1 to 4, wherein after the feature is removed, the remaining feature is normalized again.

（付記６）コンピュータで実行されるときに、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得し、前記関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算し、計算された確率を使用して、前記物体のスコアまたはラベルを計算するために本質的ではない特徴を前記スケーリングされたサンプルから除去する画像処理プログラムを記録した非一時的なコンピュータ読み取り可能な記録媒体。 (Supplementary note 6) A scaled feature that, when executed on a computer, obtains a feature in each scaled sample of the region of interest in the survey image and contributes to the score or label of the object of interest in the region of interest Recorded an image processing program that calculates the probabilities of pixels in a sample and uses the calculated probabilities to remove non-essential features from the scaled sample to calculate the object's score or label A non-transitory computer-readable recording medium.

（付記７）コンピュータで実行されるときに、物体のラベルまたはスコアを計算するために本質的ではない特徴を除去するためのマスクを生成し、確率が低い顕著性マップにおける画素の特徴を除去するためにマスクを適用する付記６記載の記録媒体。ここでは、背景画素に属するニューロンが計算され、そのような画素に属する特徴に対する閾値になり得るマスクが生成される。 (Supplementary note 7) Generates a mask for removing features that are not essential for calculating the label or score of an object when executed on a computer, and removes pixel features in a saliency map with low probability The recording medium according to appendix 6, wherein a mask is applied for the purpose. Here, the neurons belonging to the background pixels are calculated and a mask is generated that can be a threshold for features belonging to such pixels.

（付記８）コンピュータで実行されるときに、対象画像と調査画像の組を含む１つ以上の一連のトレーニングサンプルと、各画像が示す物体が同一の物体であるか否かを示すラベルとによってモデルパラメータを学習する付記６または付記７記載の記録媒体。 (Supplementary note 8) When executed by a computer, by one or more series of training samples including a set of a target image and a survey image, and a label indicating whether or not the object represented by each image is the same object The recording medium according to appendix 6 or appendix 7, which learns model parameters.

（付記９）コンピュータで実行されるときに、与えられた関心領域から画像のスケーリングされたサンプルを取得する付記６から付記８のうちのいずれかに記載の記録媒体。 (Supplementary note 9) The recording medium according to any one of supplementary note 6 to supplementary note 8, which obtains a scaled sample of an image from a given region of interest when executed by a computer.

（付記１０）コンピュータで実行されるときに、特徴を除去した後、残りの特徴を再度正規化する付記６から付記９のうちのいずれかに記載の記録媒体。 (Supplementary note 10) The recording medium according to any one of supplementary note 6 to supplementary note 9, wherein when executed by a computer, after the feature is removed, the remaining feature is normalized again.

１０、１００画像処理装置
１１、１０３特徴抽出部
１２、１０６顕著性生成部
１３、１０７ドロップアウト処理部
１０１入力部
１０２物体検出部
１０４学習部
１０５モデル記憶部
１０８特徴マッチング部
１０９パラメータ更新部
１１０出力部
１１１トレーニングデータセット記憶部
１０００コンピュータ
１００１プロセッサ
１００２メモリ
１００３記憶装置
１００４インタフェース
１００５バス
２０００記憶媒体 10, 100 Image processing device 11, 103 Feature extraction unit 12, 106 Saliency generation unit 13, 107 Dropout processing unit 101 Input unit 102 Object detection unit 104 Learning unit 105 Model storage unit 108 Feature matching unit 109 Parameter update unit 110 Output Unit 111 training data set storage unit 1000 computer 1001 processor 1002 memory 1003 storage device 1004 interface 1005 bus 2000 storage medium

本発明による画像処理プログラムは、コンピュータに、調査画像内の関心領域のスケーリングされた各サンプルにおける特徴を取得する特徴取得処理、関心領域内の関心対象の物体のスコアまたはラベルに寄与する、スケーリングされたサンプル中の画素の確率を計算する計算処理、および計算された確率を使用して、物体のスコアまたはラベルを計算するために本質的ではない特徴をスケーリングされたサンプルから除去する除去処理を実行させることを特徴とする。 The image processing program according to the present invention, the computer, characterized and acquires the features in each sample were scaled region of interest in the survey image, contributing to the score or label of the object of interest in the region of interest, scaled calculation processing for calculating the probability of pixels in the sample, and using the calculated probabilities, removal processing of removing features not essential to calculate the score or label of the object from the scaled samples Is executed .

Claims

Feature extraction means for obtaining features in each scaled sample of the region of interest in the survey image;
Saliency generating means for calculating the probability of a pixel in the scaled sample that contributes to the score or label of the object of interest in the region of interest;
An image processing apparatus comprising: dropout processing means for removing features that are not essential for calculating a score or label of the object using the calculated probability from the scaled sample.

The image processing according to claim 1, further comprising: a feature matching unit that obtains a similarity between a given target image and a scaled sample of the survey image, and selects a scaled sample having the maximum similarity as a final output. apparatus.

Dropout processing means
Using the calculated probabilities to generate a mask to remove features that are not essential for calculating an object's score or label;
The image processing apparatus according to claim 1, wherein the feature is removed from the scaled sample using the generated mask.

3. A learning unit that learns model parameters from one or more series of training samples including a set of target images and survey images, and a label indicating whether or not the objects indicated by the images are the same object. The image processing apparatus described.

The image processing apparatus according to claim 3, further comprising a feature map update unit that updates the feature map by applying a mask generated by the dropout processing unit in order to remove a feature of a pixel having a low probability in the saliency map.

The image processing apparatus according to claim 1, further comprising a feature normalization unit that normalizes the remaining features again after removing the features by the dropout processing unit.

Obtain features in each scaled sample of the region of interest in the survey image,
Calculating the probability of a pixel in the scaled sample that contributes to the score or label of the object of interest in the region of interest;
An image processing method comprising using the calculated probabilities to remove features from the scaled sample that are not essential for calculating the score or label of the object.

Get the similarity between a given target image and a scaled sample of survey images,
The image processing method according to claim 7, wherein a scaled sample having the maximum similarity is selected as a final output.

When run on a computer
Obtain features in each scaled sample of the region of interest in the survey image,
Calculating the probability of a pixel in the scaled sample that contributes to the score or label of the object of interest in the region of interest;
A non-transitory computer readable recording medium recording an image processing program that uses the calculated probability to remove features that are not essential for calculating the score or label of the object from the scaled sample .

When run on a computer
Get the similarity between a given target image and a scaled sample of survey images,
The recording medium according to claim 9, wherein a scaled sample having the maximum similarity is selected as a final output.