JP2020017082A

JP2020017082A - Image object extraction device and program

Info

Publication number: JP2020017082A
Application number: JP2018139764A
Authority: JP
Inventors: 吉彦河合; Yoshihiko Kawai
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-01-30
Anticipated expiration: 2038-07-25
Also published as: JP7149124B2

Abstract

To provide an image object extraction device and a program for extracting a specific object from an input image in a highly accurate manner for a relatively short period of time.SOLUTION: An image object extraction device according to the present invention comprises: a scale conversion section 11 which performs scale conversion against an input image; a computation area curving out section 12 which sequentially curves out an attention area and a context area including peripheral information thereof while scanning the input image after the scale conversion; a size conversion section 14 which reduces the size of the context area so as to become equal to the size of the attention area; and a neural network section 15 which combines each of those obtained from calculating a feature amount in parallel operation using a neural network against each of partial images in the attention area and the context area after the size conversion, extracts a specific object on the basis of a combined feature amount combined therein, and causes the scale conversion to repeat.SELECTED DRAWING: Figure 1

Description

本発明は、画像に映る特定のオブジェクトを抽出する技術に関し、特に、ニューラルネットワークを用いて例えば風景を撮像した画像中からオブジェクトとして空や建物、車両、人物の顔等を抽出する画像オブジェクト抽出装置及びプログラムに関する。 The present invention relates to a technique for extracting a specific object appearing in an image, and in particular, to an image object extraction apparatus for extracting a sky, a building, a vehicle, a person's face, and the like as an object from an image of, for example, a landscape using a neural network. And programs.

例えば風景を撮像した画像中からオブジェクトとして空や建物、車両、人物の顔等を抽出する技術として、機械学習やニューラルネットワークを用いる技術が知られている。 For example, as a technique for extracting a sky, a building, a vehicle, a face of a person, or the like as an object from an image of a landscape, a technique using machine learning or a neural network is known.

特に、ニューラルネットワークを用いて特定のオブジェクトを抽出する技術が開示されている（例えば、非特許文献１，２参照）。 In particular, a technique for extracting a specific object using a neural network is disclosed (for example, see Non-Patent Documents 1 and 2).

ニューラルネットワークは、オブジェクト抽出やオブジェクト認識などのタスクで広く利用されている技術である。ニューラルネットワークを利用して入力画像の一部分に映るオブジェクトを抽出する場合は、入力画像の一部（或いは入力画像から算出された特徴マップの一部）の注目領域（「ＲＯＩ」とも称される）をニューラルネットワークに入力して、その抽出結果を出力するものとなっている（例えば、非特許文献３参照）。 Neural networks are widely used in tasks such as object extraction and object recognition. When using a neural network to extract an object reflected in a part of an input image, a region of interest (also referred to as “ROI”) of a part of the input image (or a part of a feature map calculated from the input image) Is input to the neural network, and the extraction result is output (for example, see Non-Patent Document 3).

図８に、従来技術における、ニューラルネットワークを用いた画像オブジェクト抽出装置１００の概略構成を示す。また、図９（ａ）は、ニューラルネットワークを利用したオブジェクト抽出処理の概要を示す図であり、図９（ｂ）は、図９（ａ）について分かりやすさのため入力を１次元に省略した図である。 FIG. 8 shows a schematic configuration of an image object extraction device 100 using a neural network in the related art. FIG. 9A is a diagram showing an outline of an object extraction process using a neural network, and FIG. 9B is one-dimensionally omitted in FIG. 9A for clarity. FIG.

図８に示す従来技術における画像オブジェクト抽出装置１００は、注目領域切り出し部１１２、走査部１１３、及びニューラルネットワーク部１１５を備える。 The image object extracting apparatus 100 according to the related art illustrated in FIG. 8 includes an attention area cutout unit 112, a scanning unit 113, and a neural network unit 115.

注目領域切り出し部１１２は、入力画像Ｉを入力して、走査部１１３によって指定される画像座標に基づいて、入力画像Ｉから注目領域（ＲＯＩ）の部分画像を切り出しニューラルネットワーク部１１５に出力する。従って、注目領域切り出し部１１２は、図９（ａ），（ｂ）に示す画像オブジェクト抽出装置１００の入力層として機能する。 The attention area cutout unit 112 receives the input image I, cuts out a partial image of the attention area (ROI) from the input image I based on the image coordinates specified by the scanning unit 113, and outputs the cutout image to the neural network unit 115. Therefore, the attention area cutout unit 112 functions as an input layer of the image object extraction device 100 shown in FIGS. 9A and 9B.

走査部１１３は、後段のニューラルネットワーク部１１５による特徴演算が実行される度に、入力画像Ｉから、注目領域（ＲＯＩ）の基準となる座標値を順次走査（例えば１画素単位で走査）しながら生成し、或る画像座標を演算領域切り出し部１１２に出力する。 The scanning unit 113 sequentially scans (eg, scans in pixel units) the reference value of the region of interest (ROI) from the input image I each time the neural network unit 115 at the subsequent stage performs the characteristic calculation. Generated and output certain image coordinates to the calculation area cutout unit 112.

ニューラルネットワーク部１１５は、ニューラルネットワークの構造上の一部分である部分ネットワークとして、注目領域特徴演算部１１５１、及びオブジェクト抽出部１１５４からなる。 The neural network unit 115 includes an attention area feature calculation unit 1151 and an object extraction unit 1154 as a partial network which is a part of the structure of the neural network.

注目領域特徴演算部１１５１は、演算領域切り出し部１１２から入力される注目領域（ＲＯＩ）の部分画像に対して、ニューラルネットワークを用いて特徴量を算出し、オブジェクト抽出部１１５４に出力する。従って、注目領域特徴演算部１１５１は、図９（ａ），（ｂ）に示す画像オブジェクト抽出装置１００の特徴演算層として機能し、注目領域（ＲＯＩ）の部分画像（図示ＮＡ１）から、ニューラルネットワークを用いて特徴量（図示ＮＡ２）を算出する。 The attention area feature calculation unit 1151 calculates the feature amount of the partial image of the attention area (ROI) input from the calculation area extraction unit 112 using a neural network, and outputs the feature amount to the object extraction unit 1154. Therefore, the attention area feature calculation unit 1151 functions as a feature calculation layer of the image object extraction device 100 shown in FIGS. 9A and 9B, and converts a partial image (NA1 shown) of the attention area (ROI) into a neural network. Is used to calculate a feature value (NA2 shown).

ここで、注目領域特徴演算部１１５１にて算出する特徴量は、ニューラルネットワークを用いたものであれば任意に定めた公知のものを利用することができ、特徴マップで表されるものとする。このような特徴マップの算出例として、注目領域（ＲＯＩ）の部分画像に対し一般的なオブジェクト変換（階調変換、シャープネス／スムージング変換、エッジ抽出変換、モーフィング変換等）を施し、例えば二値、スカラー、ベクトル、マトリックス等により表現したものとすることができるが、より簡便に畳み込みニューラルネットワークで算出した二次元マトリックスで表現したものとすることができる。畳み込みニューラルネットワークは通常、畳み込み層やプーリング層、全結合層といったものの組み合わせで構成される。 Here, as the feature amount calculated by the attention area feature calculation unit 1151, any well-known feature that uses a neural network can be used and is represented by a feature map. As an example of calculating such a feature map, general object conversion (gradation conversion, sharpness / smoothing conversion, edge extraction conversion, morphing conversion, etc.) is performed on a partial image of a region of interest (ROI), and for example, binary, Although it can be represented by a scalar, vector, matrix, or the like, it can be represented more simply by a two-dimensional matrix calculated by a convolutional neural network. A convolutional neural network usually consists of a combination of a convolutional layer, a pooling layer, a fully connected layer, and the like.

オブジェクト抽出部１１５４は、注目領域特徴演算部１１５１から得られる注目領域（ＲＯＩ）の特徴量から、該当する注目領域（ＲＯＩ）が当該ニューラルネットワークの目的とする特定のオブジェクト（車両、人物の顔等）を含んでいるか否かを判定し、オブジェクトであると判定した場合には、その抽出結果を外部に出力する。従って、注目領域特徴演算部１１５４は、図９（ａ），（ｂ）に示す画像オブジェクト抽出装置１００のオブジェクト抽出・出力層として機能し、該当する注目領域（ＲＯＩ）に当該ニューラルネットワークの目的とする特定のオブジェクト（車両、人物の顔等）が含まれるか否かを判定し、オブジェクトの抽出結果（図示ＮＤ）を出力する。 The object extracting unit 1154 determines from the feature amount of the region of interest (ROI) obtained from the region of interest feature calculating unit 1151 that the corresponding region of interest (ROI) is a specific object (vehicle, person's face, etc.) targeted by the neural network. ) Is determined, and if it is determined that the object is an object, the extraction result is output to the outside. Therefore, the attention area feature calculation unit 1154 functions as an object extraction / output layer of the image object extraction apparatus 100 shown in FIGS. 9A and 9B, and stores the purpose of the neural network in the corresponding attention area (ROI). It is determined whether or not a specific object (vehicle, person's face, etc.) is included, and an object extraction result (shown as ND) is output.

このように、従来技術における画像オブジェクト抽出装置１００は、画像の一部の注目領域（ＲＯＩ）をニューラルネットワークに入力し、最後にオブジェクトの抽出結果を得るものとなっているが、ＲＯＩ以外の情報は全く考慮せずにオブジェクトを抽出するものとなっている。 As described above, the image object extracting apparatus 100 according to the related art inputs part of a region of interest (ROI) of an image to a neural network and finally obtains an object extraction result. Extracts objects without any consideration.

Q. V. Le, “Building High-level Features Using Large Scale Unsupervised Learning,” ICASSP, 2013Q. V. Le, “Building High-level Features Using Large Scale Unsupervised Learning,” ICASSP, 2013 A. Krizhevsky, I. Sutskever and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012A. Krizhevsky, I. Sutskever and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012 山田，渡辺、“畳み込みニューラルネットワークの特徴マップ選択によるトラッキング”、情報処理学会第７９回全国大会、講演論文集第２分冊人口知能と認知科学、pp.2-385 〜2-386、講演番号1P-08、2017年3月16日〜18日開催Yamada, Watanabe, "Tracking by Feature Map Selection of Convolutional Neural Network", 79th Annual Convention of Information Processing Society of Japan, Proc. -08, March 16-18, 2017

上述したように、従来技術におけるニューラルネットワークを用いた画像オブジェクト抽出装置では、画像の一部の注目領域（ＲＯＩ）をニューラルネットワークに入力し、最後にオブジェクトの抽出結果を得るものとなっているが、ＲＯＩ以外の情報は全く考慮しないものとなっている。 As described above, in the image object extraction device using the neural network in the related art, a region of interest (ROI) of an image is input to the neural network, and finally, the object extraction result is obtained. , ROI are not considered at all.

このため、入力画像に映るオブジェクトのサイズが小さい場合などでは特に、そのオブジェクトの抽出が困難になり、オブジェクトの抽出精度に改善の余地がある。 Therefore, especially when the size of the object shown in the input image is small, it becomes difficult to extract the object, and there is room for improvement in the object extraction accuracy.

そこで、本発明の目的は、上述の問題に鑑みて、精度よく、且つ比較的短時間で入力画像から特定のオブジェクトを抽出する画像オブジェクト抽出装置及びプログラムを提供することにある。 Therefore, an object of the present invention is to provide an image object extraction device and a program that extract a specific object from an input image accurately and in a relatively short time in view of the above-described problem.

即ち、本発明の画像オブジェクト抽出装置は、入力画像から特定のオブジェクトを抽出する画像オブジェクト抽出装置であって、予め定めた最初のスケールを初期値として、所定の倍率で前記入力画像を段階的に縮小するようにスケール変換を施した入力画像を順に生成するスケール変換手段と、前記スケール変換手段によりスケール変換された入力画像を走査しながら、注目領域の部分画像と、当該注目領域とその周りの情報を含むコンテキスト領域の部分画像とをそれぞれ予め定めたサイズで順次切り出す演算領域切り出し手段と、順次切り出される前記コンテキスト領域の部分画像を前記注目領域と同じサイズに縮小するようにサイズ変換を行うサイズ変換手段と、前記注目領域の部分画像に対してニューラルネットワークを用いて第１の特徴量を算出する注目領域特徴演算手段と、当該サイズ変換した後の前記コンテキスト領域の部分画像に対してニューラルネットワークを用いて第２の特徴量を算出するコンテキスト領域特徴演算手段と、前記第１の特徴量、及び前記第２の特徴量を結合し、結合特徴量を生成する結合手段と、当該結合特徴量を基に前記注目領域が当該特定のオブジェクトを含んでいるか否かを判定することにより、前記スケール変換手段を経て得られる入力画像から当該特定のオブジェクトを抽出するオブジェクト抽出手段と、を備え、少なくとも前記注目領域特徴演算手段、前記コンテキスト領域特徴演算手段、前記結合手段、及び前記オブジェクト抽出手段は、ニューラルネットワークにおける部分ネットワークとして構成され、前記注目領域特徴演算手段、及び前記コンテキスト領域特徴演算手段が並列処理されるように構成され、前記オブジェクト抽出手段は、前記スケール変換手段を経て得られる入力画像のスケールが所定の閾値より小さくならない範囲内で当該スケール変換手段によるスケール変換を繰り返させて、異なるサイズのオブジェクトを抽出することを特徴とする。 That is, the image object extraction device of the present invention is an image object extraction device that extracts a specific object from an input image, and uses the predetermined initial scale as an initial value to step-by-step convert the input image at a predetermined magnification. Scale conversion means for sequentially generating an input image that has been subjected to scale conversion so as to reduce the size, and scanning the input image scale-converted by the scale conversion means, a partial image of a region of interest, the region of interest and its surroundings A calculation region cutout unit for sequentially cutting out the partial images of the context region including information in a predetermined size, and a size for performing size conversion so that the partial images of the context region sequentially cut out are reduced to the same size as the attention region. Transforming means, using a neural network for the partial image of the region of interest, A region-of-interest calculation unit for calculating a feature amount of the context region; a context region feature calculation unit for calculating a second feature amount of the partial image of the context region after the size conversion using a neural network; A combining unit that combines the first feature amount and the second feature amount to generate a combined feature amount, and determines whether the attention area includes the specific object based on the combined feature amount And an object extraction unit for extracting the specific object from the input image obtained through the scale conversion unit, wherein at least the attention region feature calculation unit, the context region feature calculation unit, the combining unit, The object extraction means is configured as a partial network in the neural network, and The calculating means and the context region feature calculating means are configured to be processed in parallel, and the object extracting means sets the scale of the input image obtained through the scale converting means within a range where the scale does not become smaller than a predetermined threshold value. It is characterized in that objects of different sizes are extracted by repeating the scale conversion by the conversion means.

また、本発明の画像オブジェクト抽出装置において、前記演算領域切り出し手段は、前記スケール変換手段を経て得られる入力画像から、それぞれ固定値で前記注目領域の部分画像、及び前記コンテキスト領域の部分画像を切り出すものとし、且つ前記コンテキスト領域が前記注目領域の重心と一致する重心を持ち、且つ前記注目領域に対して縦横四方の周りの情報が含まれるように所定量で拡大したサイズで切り出すことを特徴とする。 Further, in the image object extraction device of the present invention, the calculation region cutout unit cuts out the partial image of the attention region and the partial image of the context region with fixed values from the input image obtained through the scale conversion unit. And wherein the context region has a center of gravity that matches the center of gravity of the region of interest, and is cut out in a size enlarged by a predetermined amount so as to include information around the vertical and horizontal directions with respect to the region of interest. I do.

また、本発明の画像オブジェクト抽出装置において、前記演算領域切り出し手段は、前記コンテキスト領域の面積が、前記注目領域の面積に対し１倍より大きく４倍以下を満たすように切り出すことを特徴とする。 Further, in the image object extracting apparatus according to the present invention, the calculation area cutout unit cuts out the area of the context area so as to satisfy the area of more than 1 time and 4 times or less with respect to the area of the attention area.

また、本発明の画像オブジェクト抽出装置において、前記第１の特徴量、及び前記第２の特徴量は、それぞれ同一形式の特徴量算出処理で得られる特徴マップで表されていることを特徴とする。 Further, in the image object extracting apparatus according to the present invention, the first feature amount and the second feature amount are each represented by a feature map obtained by a feature amount calculation process of the same format. .

また、本発明の画像オブジェクト抽出装置において、前記注目領域特徴演算手段、及び前記コンテキスト領域特徴演算手段は、それぞれ同一形式の特徴量算出処理として畳み込みニューラルネットワークに基づく並列処理により、前記スケール変換手段を経て得られる入力画像を基準にして、前記第１の特徴量、及び前記第２の特徴量の各々の位置関係が相関する特徴マップを算出することを特徴とする。 Further, in the image object extraction device of the present invention, the attention area feature calculation means and the context area feature calculation means each execute the scale conversion means by parallel processing based on a convolutional neural network as feature amount calculation processing of the same format. It is characterized in that a feature map in which the positional relationship of each of the first feature amount and the second feature amount is correlated with reference to an input image obtained through the process.

更に、本発明のプログラムは、コンピュータを、本発明の画像オブジェクト抽出装置として機能させるためのプログラムとして構成する。 Furthermore, the program of the present invention is configured as a program for causing a computer to function as the image object extracting device of the present invention.

本発明によれば、入力画像における注目領域（ＲＯＩ）と、そのＲＯＩを含む周辺情報（コンテキスト領域）の双方を考慮してオブジェクトを抽出するように構成されるため、不所望に演算時間を増大させることなく計算量の増加を抑え、オブジェクトの抽出精度を向上させることができる。特に、本発明によれば、入力画像に対するオブジェクトのサイズが従来技術では抽出困難であるほど相対的に小さい場合でも、そのオブジェクトを精度よく抽出できるようになる。 According to the present invention, since an object is extracted in consideration of both a region of interest (ROI) in an input image and peripheral information (context region) including the ROI, the operation time is undesirably increased. It is possible to suppress an increase in the amount of calculation without performing this, and to improve the extraction accuracy of the object. In particular, according to the present invention, even when the size of an object with respect to an input image is relatively small so as to be difficult to extract with the related art, the object can be accurately extracted.

本発明による一実施形態の画像オブジェクト抽出装置の概略構成を示すブロック図である。It is a block diagram showing a schematic structure of an image object extraction device of one embodiment by the present invention. （ａ）乃至（ｃ）は本発明による一実施形態の画像オブジェクト抽出装置における入力画像に対する注目領域（ＲＯＩ）とコンテキスト領域の説明図である。FIGS. 3A to 3C are explanatory diagrams of a region of interest (ROI) and a context region with respect to an input image in an image object extraction device according to an embodiment of the present invention. 本発明による一実施形態の画像オブジェクト抽出装置の動作を示すフローチャートである。5 is a flowchart illustrating an operation of the image object extraction device according to the embodiment of the present invention. 本発明による一実施形態の画像オブジェクト抽出装置に係る並列処理型ニューラルネットワークの説明図である。It is an explanatory view of a parallel processing type neural network concerning an image object extraction device of one embodiment by the present invention. 本発明による一実施形態の画像オブジェクト抽出装置における注目領域特徴演算部及びコンテキスト領域特徴演算部に対し畳み込みニューラルネットワークを用いた場合の入出力に関する説明図である。It is an explanatory view about input-and-output when a convolutional neural network is used for an attention area feature operation part and a context area feature operation part in an image object extraction device of one embodiment of the present invention. 本発明による一実施形態の画像オブジェクト抽出装置にて畳み込みニューラルネットワークを用いた一実施例の処理例を示す図である。It is a figure showing the example of processing of an example using a convolutional neural network in an image object extraction device of one embodiment by the present invention. 本発明による一実施形態の画像オブジェクト抽出装置にて畳み込みニューラルネットワークを用いた一実施例の処理例を示す図である。It is a figure showing the example of processing of an example using a convolutional neural network in an image object extraction device of one embodiment by the present invention. 従来の画像オブジェクト抽出装置の概略構成を示すブロック図である。FIG. 11 is a block diagram illustrating a schematic configuration of a conventional image object extraction device. （ａ），（ｂ）は従来の画像オブジェクト抽出装置におけるニューラルネットワークの説明図である。(A), (b) is an explanatory view of a neural network in a conventional image object extraction device.

以下、図面を参照して、本発明による一実施形態の画像オブジェクト抽出装置１について説明する。 Hereinafter, an image object extraction device 1 according to an embodiment of the present invention will be described with reference to the drawings.

（全体構成）
図１は、本発明による一実施形態の画像オブジェクト抽出装置１の概略構成を示すブロック図である。本発明による一実施形態の画像オブジェクト抽出装置１は、スケール変換部１１、演算領域切り出し部１２、走査部１３、サイズ変換部１４、及びニューラルネットワーク部１５を備える。 (overall structure)
FIG. 1 is a block diagram showing a schematic configuration of an image object extraction device 1 according to an embodiment of the present invention. An image object extraction device 1 according to an embodiment of the present invention includes a scale conversion unit 11, a calculation region cutout unit 12, a scanning unit 13, a size conversion unit 14, and a neural network unit 15.

スケール変換部１１は、入力画像Ｉ（横×縦サイズとしてＷ×Ｈ）を入力し、図示しないメモリに一時記憶し、予め定めた倍率（１／ｋ；ｋは任意の実数）で入力画像Ｉを段階的に縮小するようにスケール変換を施す機能部である。スケール変換部１１は、最初のスケール（Ｗ×Ｈ）を初期値として、その入力画像Ｉのスケールが所定の閾値より小さくならない範囲内で段階的に縮小したときの個々の入力画像Ｉを順に、演算領域切り出し部１２に出力する。 The scale conversion unit 11 receives the input image I (W × H as horizontal × vertical size), temporarily stores the input image I in a memory (not shown), and outputs the input image I at a predetermined magnification (1 / k; k is an arbitrary real number). Is a functional unit that performs scale conversion so as to reduce step by step. The scale conversion unit 11 uses the first scale (W × H) as an initial value, and sequentially orders the individual input images I when the scale of the input image I is reduced stepwise within a range where the scale of the input image I does not become smaller than a predetermined threshold. It outputs to the calculation area cutout unit 12.

つまり、本実施形態の画像オブジェクト抽出装置１は、異なる様々なサイズのオブジェクトを抽出できるように、スケール変換部１１で入力画像Ｉのサイズを少しずつ縮小しながら、演算領域切り出し部１２以降の処理を適用する。 In other words, the image object extraction device 1 of the present embodiment performs processing after the calculation area cutout unit 12 while reducing the size of the input image I little by little by the scale conversion unit 11 so that objects of various different sizes can be extracted. Apply

演算領域切り出し部１２は、後述する図２に例示するように、スケール変換部１１から入力画像Ｉを入力してメモリ（図示略）に一時記憶し、走査部１３によって指定されるｉ番目の走査時点における画像座標（ｐ（ｉ），ｑ（ｉ））に基づいて、入力画像Ｉから注目領域（ＲＯＩ）の部分画像（横×縦サイズとしてｗ×ｈ）と、当該注目領域（ＲＯＩ）とその周りの情報を含むコンテキスト領域の部分画像（横×縦サイズとしてｗ’×ｈ’）とを切り出して、それぞれニューラルネットワーク部１５及びサイズ変換部１４に出力する。 The calculation area cutout unit 12 receives the input image I from the scale conversion unit 11 and temporarily stores the input image I in a memory (not shown) as illustrated in FIG. Based on the image coordinates (p (i), q (i)) at the time, a partial image of the region of interest (ROI) from the input image I (w × h as the horizontal × vertical size) and the region of interest (ROI) A partial image (w ′ × h ′ as a horizontal × vertical size) of a context region including surrounding information is cut out and output to the neural network unit 15 and the size conversion unit 14, respectively.

注目領域（ＲＯＩ）の画像サイズ（ｗ×ｈ）は予め定められた固定値であり、コンテキスト領域の画像サイズ（ｗ’×ｈ’）も固定値である。ただし、コンテキスト領域は、注目領域（ＲＯＩ）の重心と一致する重心を持ち、且つ注目領域（ＲＯＩ）に対して縦横四方の周りの情報が含まれるように所定量で拡大したサイズとする。 The image size (w × h) of the region of interest (ROI) is a predetermined fixed value, and the image size (w ′ × h ′) of the context region is also a fixed value. However, the context region has a center of gravity that matches the center of gravity of the region of interest (ROI), and has a size enlarged by a predetermined amount so as to include information around the vertical and horizontal sides of the region of interest (ROI).

例えば、注目領域（ＲＯＩ）の画像サイズ（ｗ×ｈ）の面積Ａに対し、コンテキスト領域の画像サイズ（ｗ’×ｈ’）の面積Ａ’は、Ａ＜Ａ’≦４Ａを満たすようにする。この範囲であれば演算時間及び検出精度の観点から好ましいことが、後述する実験結果で確認されている。 For example, with respect to the area A of the image size (w × h) of the region of interest (ROI), the area A ′ of the image size (w ′ × h ′) of the context region is set to satisfy A <A ′ ≦ 4A. . It is confirmed from the experimental results described later that this range is preferable from the viewpoint of calculation time and detection accuracy.

例えば、図２（ａ）乃至（ｃ）は、本発明による一実施形態の画像オブジェクト抽出装置１における入力画像Ｉに対する注目領域（ＲＯＩ）とコンテキスト領域の説明図である。図２（ａ）に示す例は、演算領域切り出し部１２が、例えば２つのオブジェクトＯｂｊ１, Ｏｂｊ２が写る入力画像Ｉ内で、ｉ番目の走査時点における画像座標（ｐ（ｉ），ｑ（ｉ））に基づいて、注目領域（ＲＯＩ）とコンテキスト領域を切り出す様子を示している。例えば図２（ｂ）に示すように、オブジェクトＯｂｊ１上に注目領域（ＲＯＩ）が位置するとき、図２（ｃ）に示すように、演算領域切り出し部１２は、その注目領域（ＲＯＩ）の重心と一致する重心を持つコンテキスト領域を切り出す。 For example, FIGS. 2A to 2C are explanatory diagrams of a region of interest (ROI) and a context region for an input image I in the image object extraction device 1 according to an embodiment of the present invention. In the example illustrated in FIG. 2A, the operation area cutout unit 12 determines the image coordinates (p (i), q (i) at the i-th scanning time in the input image I in which, for example, two objects Obj1 and Obj2 appear. ) Shows a state in which a region of interest (ROI) and a context region are cut out. For example, when the region of interest (ROI) is located on the object Obj1 as shown in FIG. 2B, the calculation region cutout unit 12 sets the center of gravity of the region of interest (ROI) as shown in FIG. Cut out the context area with the center of gravity that matches.

尚、注目領域（ＲＯＩ）が入力画像Ｉの端部に位置しているときに、注目領域（ＲＯＩ）に対して縦横四方のうちいずれかの周りの情報が存在しない場合も、コンテキスト領域の画像サイズ（ｗ’×ｈ’）は、その存在しない部分に固定値（例えばダイナミックレンジの中間値）を補完して、注目領域（ＲＯＩ）の重心と一致する重心を持ち、且つ注目領域（ＲＯＩ）に対して所定量で拡大した固定値のサイズとする。 When the region of interest (ROI) is located at the end of the input image I, if there is no information around any of the vertical and horizontal directions with respect to the region of interest (ROI), the image of the context region is also displayed. The size (w ′ × h ′) has a barycenter that matches the barycenter of the region of interest (ROI) by complementing a fixed value (for example, an intermediate value of the dynamic range) in the nonexistent portion, and also has a region of interest (ROI). Is a fixed value size enlarged by a predetermined amount.

このように注目領域（ＲＯＩ）及びコンテキスト領域を固定値とすることで、以後のニューラルネットワーク部１５の処理が安定化し、且つ処理も簡素化できる。尚、演算領域切り出し部１２に入力される入力画像Ｉは、スケール変換部１１により、スケール（Ｗ×Ｈ）を初期値とし段階的に縮小した個々のサイズであるため、注目領域（ＲＯＩ）及びコンテキスト領域が相対的に段階的に拡大するものとなるため、異なる様々なサイズのオブジェクトを抽出できるようになる。 By setting the attention area (ROI) and the context area to fixed values in this way, the subsequent processing of the neural network unit 15 can be stabilized and the processing can be simplified. Note that the input image I input to the calculation region cutout unit 12 is an individual size that is reduced stepwise with the scale (W × H) as an initial value by the scale conversion unit 11, so that the region of interest (ROI) and the Since the context region expands relatively stepwise, objects of various different sizes can be extracted.

走査部１３は、演算領域切り出し部１２に入力される入力画像Ｉから、注目領域（ＲＯＩ）の基準となる座標値を順次走査（例えば１画素単位で走査）しながら生成し、或るｉ番目の走査時点における画像座標（ｐ（ｉ），ｑ（ｉ））を演算領域切り出し部１２に出力する。 The scanning unit 13 generates a reference value of a region of interest (ROI) by sequentially scanning (for example, scanning in units of one pixel) from the input image I input to the calculation region extracting unit 12 and generating a certain i-th coordinate value. The image coordinates (p (i), q (i)) at the time of scanning are output to the calculation area cutout unit 12.

サイズ変換部１４は、演算領域切り出し部１２から入力されるコンテキスト領域の部分画像（ｗ’×ｈ’）を注目領域（ＲＯＩ）と同じサイズ（ｗ×ｈ）になるように縮小して、ニューラルネットワーク部１５に出力する。尚、サイズ変換部１４による縮小処理自体をニューラルネットワーク部１５内で実行することもできる。 The size conversion unit 14 reduces the partial image (w ′ × h ′) of the context region input from the calculation region extraction unit 12 so as to have the same size (w × h) as the region of interest (ROI), and Output to the network unit 15. Note that the reduction processing itself by the size conversion unit 14 can be executed in the neural network unit 15.

ニューラルネットワーク部１５は、ニューラルネットワークの構造上の一部分である部分ネットワークとして、注目領域特徴演算部１５１、コンテキスト領域特徴演算部１５２、特徴結合部１５３、及びオブジェクト抽出部１５４を有する。 The neural network unit 15 includes an attention area feature calculation unit 151, a context area feature calculation unit 152, a feature connection unit 153, and an object extraction unit 154 as a partial network that is a part of the structure of the neural network.

注目領域特徴演算部１５１は、演算領域切り出し部１２から入力される注目領域（ＲＯＩ）の部分画像に対して、ニューラルネットワークを用いて特徴量を算出し、特徴結合部１５３に出力する。 The attention area feature calculation unit 151 calculates a feature amount of the partial image of the attention area (ROI) input from the calculation area cutout unit 12 using a neural network, and outputs the feature amount to the feature combination unit 153.

コンテキスト領域特徴演算部１５２は、サイズ変換部１４から入力されるコンテキスト領域の部分画像に対して、ニューラルネットワークを用いて特徴量を算出し、特徴結合部１５３に出力する。 The context region feature calculation unit 152 calculates a feature amount of the partial image of the context region input from the size conversion unit 14 using a neural network, and outputs the feature amount to the feature combination unit 153.

ここで、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２にてそれぞれ算出する特徴量は、それぞれニューラルネットワークを用いたものであれば任意に定めた公知のものを利用することができるが、それぞれ同一形式の特徴量算出処理とし、位置関係が相関する特徴マップで表されるものとする。このような特徴量算出処理の例として、注目領域（ＲＯＩ）及びコンテキスト領域の各部分画像に対し、一般的なオブジェクト変換（階調変換、シャープネス／スムージング変換、エッジ抽出変換、モーフィング変換等）を施したものとすることができるが、より簡便に畳み込みニューラルネットワークで算出する構成とすることができる。実施例として後述するが、畳み込みニューラルネットワークに基づく特徴マップは、二次元マトリックスで表現したものとすることができる。畳み込みニューラルネットワークは通常、畳み込み層やプーリング層、全結合層といったものの組み合わせで構成される。 Here, as the feature amounts calculated by the attention area feature calculation unit 151 and the context area feature calculation unit 152, known ones arbitrarily determined can be used as long as they use a neural network. It is assumed that the feature amounts are calculated in the same format and are represented by a feature map whose positional relationship is correlated. As an example of such feature amount calculation processing, general object conversion (gradation conversion, sharpness / smoothing conversion, edge extraction conversion, morphing conversion, etc.) is performed on each partial image of the region of interest (ROI) and the context region. Although it is possible to use a convolutional neural network, the calculation can be performed more easily. As will be described later as an example, a feature map based on a convolutional neural network can be represented by a two-dimensional matrix. A convolutional neural network usually consists of a combination of a convolutional layer, a pooling layer, a fully connected layer, and the like.

特徴結合部１５３は、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２にてそれぞれ算出した注目領域（ＲＯＩ）及びコンテキスト領域の特徴量を結合してオブジェクト抽出部１５４に出力し、その後、走査部１３に対し、当該入力画像Ｉにおける次の注目領域（ＲＯＩ）の基準となる座標値を生成するよう指示する。 The feature combining unit 153 combines the features of the ROI and the context area calculated by the attention area feature calculation unit 151 and the context area feature calculation unit 152, and outputs the combined features to the object extraction unit 154. The unit 13 is instructed to generate a reference coordinate value of the next region of interest (ROI) in the input image I.

このとき、走査部１３は、或る入力画像Ｉの全体からオブジェクト抽出の処理が終了したか否かを判定し、終了していなければその入力画像Ｉに対する次の注目領域（ＲＯＩ）の基準となる座標値を生成し、終了していれば演算領域切り出し部１２へ新たに入力される入力画像Ｉに対して、初期位置から画像座標（ｐ（ｉ），ｑ（ｉ））に対応するｉ番目の走査を開始する。 At this time, the scanning unit 13 determines whether or not the object extraction processing has been completed from the entire input image I. If the processing has not been completed, the scanning unit 13 determines the reference of the next region of interest (ROI) for the input image I Is generated, and if it has been completed, i corresponding to the image coordinates (p (i), q (i)) from the initial position for the input image I newly input to the calculation area cutout unit 12 Start the second scan.

オブジェクト抽出部１５４は、特徴結合部１５３から得られる注目領域（ＲＯＩ）及びコンテキスト領域の結合した結合特徴量を基に、該当する注目領域（ＲＯＩ）が当該ニューラルネットワークの目的とする特定のオブジェクトを含んでいるか否かを判定し、当該特定のオブジェクトを抽出する。 The object extracting unit 154 extracts a specific object which is a target of the neural network based on the ROI obtained from the feature combining unit 153 and the combined feature of the context region and the ROI. It is determined whether or not the specific object is included, and the specific object is extracted.

つまり、オブジェクト抽出部１５４は、該当する注目領域（ＲＯＩ）が当該特定のオブジェクトであると判定した場合には、その入力画像Ｉに対するｉ番目の走査時点における注目領域（ＲＯＩ）の位置情報又は注目領域（ＲＯＩ）の部分画像そのものを抽出結果として外部に出力する。この抽出結果は、車両認識や顔認識等の認識処理に利用できる。 That is, if the object extracting unit 154 determines that the corresponding region of interest (ROI) is the specific object, the object extracting unit 154 determines the position information of the region of interest (ROI) at the i-th scanning time with respect to the input image I or The partial image of the region (ROI) itself is output to the outside as an extraction result. This extraction result can be used for recognition processing such as vehicle recognition and face recognition.

また、オブジェクト抽出部１５４は、走査部１３により走査した結果、その都度、特徴結合部１５３から得られる注目領域（ＲＯＩ）及びコンテキスト領域の結合した結合特徴量を基に、事前学習に基づいてオブジェクトが含まれるか否かを判定し、その入力画像Ｉの全体からオブジェクト抽出を行う。 In addition, the object extracting unit 154 performs an object learning based on a pre-learning based on a combined feature amount obtained by combining the attention area (ROI) and the context area obtained from the feature combining unit 153 as a result of the scanning performed by the scanning unit 13. Is determined, and an object is extracted from the entire input image I.

オブジェクト抽出部１５４は、オブジェクト抽出処理として、制約なしに自由に設計することができ、ニューラルネットワーク部１５（特に、オブジェクト抽出部１５４）は、予め多数の画像サンプルを基に注目領域（ＲＯＩ）及びコンテキスト領域の結合した結合特徴量を基にオブジェクト抽出に関するニューラルネットワークのパラメータを事前学習させておくようにする。 The object extraction unit 154 can be freely designed without restriction as an object extraction process, and the neural network unit 15 (especially, the object extraction unit 154) uses the ROI and ROI based on a large number of image samples in advance. Neural network parameters relating to object extraction are preliminarily learned based on the combined feature amount obtained by combining the context regions.

そして、オブジェクト抽出部１５４は、或る入力画像Ｉの全体からオブジェクト抽出の処理が終了すると、スケール変換部１１に対し、その入力画像Ｉに対して所定の倍率（１／ｋ；ｋは任意の実数）で縮小した次の入力画像Ｉを生成するよう指示する。 When the object extraction processing is completed for the entire input image I, the object extraction unit 154 instructs the scale conversion unit 11 to perform a predetermined magnification (1 / k; (Real number) to generate the next input image I.

従って、本実施形態の画像オブジェクト抽出装置１は、入力される入力画像（Ｗ×Ｈ）に対し、異なる様々なサイズのオブジェクトを抽出することができる。 Therefore, the image object extraction device 1 of the present embodiment can extract objects having various different sizes from the input image (W × H) to be input.

尚、図１では、本発明の理解を高めるために、スケール変換部１１、演算領域切り出し部１２、走査部１３及びサイズ変換部１４と、ニューラルネットワークを構成するニューラルネットワーク部１５とを区別した例を示しているが、画像オブジェクト抽出装置１全体を単一のニューラルネットワークとして構成することもできる。 In FIG. 1, in order to enhance the understanding of the present invention, an example in which a scale conversion unit 11, a calculation region cutout unit 12, a scanning unit 13, and a size conversion unit 14 are distinguished from a neural network unit 15 forming a neural network. However, the entire image object extracting apparatus 1 may be configured as a single neural network.

（装置動作）
以下、より具体的に、図３及び図４を参照しながら、本実施形態の画像オブジェクト抽出装置１について説明する。図３は、本発明による一実施形態の画像オブジェクト抽出装置１の動作を示すフローチャートである。また、図４は、本発明による一実施形態の画像オブジェクト抽出装置１に係る並列処理型ニューラルネットワークの説明図である。 (Device operation)
Hereinafter, the image object extraction device 1 of the present embodiment will be described more specifically with reference to FIGS. FIG. 3 is a flowchart showing the operation of the image object extraction device 1 according to one embodiment of the present invention. FIG. 4 is an explanatory diagram of a parallel processing type neural network according to the image object extraction device 1 according to one embodiment of the present invention.

まず、図３に示すように、画像オブジェクト抽出装置１は、スケール変換部１１により、入力された入力画像（Ｗ×Ｈ）のスケールが所定の閾値より小さいか否かを判定する（ステップＳ１）。 First, as shown in FIG. 3, in the image object extracting apparatus 1, the scale converter 11 determines whether the scale of the input image (W × H) is smaller than a predetermined threshold (Step S1). .

スケール変換部１１は、入力された入力画像（Ｗ×Ｈ）のスケールが所定の閾値より小さいとき（本例では、ｗ×ｈより小さいとき）は処理を終了し（ステップＳ１：Ｙ）、そうでなければ（ステップＳ１：Ｎ）、入力画像Ｉとして最初はステップＳ３に移行し、以降（ステップＳ１：Ｎ）を経るときは、その入力画像Ｉのスケールを所定の倍率（１／ｋ；ｋは任意の実数）に縮小してからステップＳ３に移行する（ステップＳ２）。 When the scale of the input image (W × H) is smaller than a predetermined threshold (in this example, smaller than w × h), the scale converter 11 ends the process (step S1: Y), and If not (step S1: N), the process first proceeds to step S3 as the input image I, and after that (step S1: N), the scale of the input image I is increased by a predetermined magnification (1 / k; k). (Step S2).

続いて、画像オブジェクト抽出装置１は、演算領域切り出し部１２により、走査部１３によって指定されるｉ番目の走査時点における画像座標（ｐ（ｉ），ｑ（ｉ））に基づいて、入力画像Ｉから注目領域（ＲＯＩ）の部分画像（ｗ×ｈ）と、当該注目領域（ＲＯＩ）とその周りの情報を含むコンテキスト領域の部分画像（ｗ’×ｈ’）とを切り出す（ステップＳ３）。 Subsequently, the image object extraction device 1 causes the calculation area cutout unit 12 to input the input image I based on the image coordinates (p (i), q (i)) at the i-th scanning time specified by the scanning unit 13. Then, a partial image (w × h) of a region of interest (ROI) and a partial image (w ′ × h ′) of a context region including the region of interest (ROI) and information around the region are cut out (step S3).

図４は、本実施形態の画像オブジェクト抽出装置１に係る並列処理型ニューラルネットワークの説明図である。スケール変換部１１及び演算領域切り出し部１２は、画像オブジェクト抽出装置１の入力層として機能し、図４にて１次元で簡易図示する入力画像Ｉに対して或る注目領域（ＲＯＩ）及びコンテキスト領域を切り出す。 FIG. 4 is an explanatory diagram of a parallel processing type neural network according to the image object extraction device 1 of the present embodiment. The scale conversion unit 11 and the calculation region cutout unit 12 function as an input layer of the image object extraction device 1, and provide a certain region of interest (ROI) and a context region with respect to the input image I which is simply illustrated one-dimensionally in FIG. Cut out.

続いて、画像オブジェクト抽出装置１は、サイズ変換部１４により、コンテキスト領域の部分画像（ｗ’×ｈ’）を注目領域（ＲＯＩ）と同じサイズ（ｗ×ｈ）になるように縮小してから、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２の各部分ネットワークを並列適用する（ステップＳ４）。つまり、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２は、それぞれ注目領域（ＲＯＩ）及びコンテキスト領域におけるニューラルネットワークを用いて特徴量を並列処理でそれぞれ算出する。 Subsequently, the image object extracting apparatus 1 reduces the partial image (w ′ × h ′) of the context area to the same size (w × h) as the attention area (ROI) by the size conversion unit 14. Then, the respective partial networks of the attention area feature calculation unit 151 and the context area feature calculation unit 152 are applied in parallel (step S4). That is, the attention area feature calculation unit 151 and the context area feature calculation unit 152 calculate the feature amounts by parallel processing using the neural networks in the attention area (ROI) and the context area, respectively.

従って、図４に示すように、サイズ変換部１４は、画像オブジェクト抽出装置１のサイズ変換層として機能し、コンテキスト領域の部分画像（図示ＤＳ）を注目領域（ＲＯＩ）のサイズ（図示ＮＡ１）と同じサイズ（図示ＮＢ１）になるように縮小する。そして、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２は、画像オブジェクト抽出装置１の特徴演算層（畳み込みニューラルネットワークであれば畳み込み層やプーリング層等）として機能し、注目領域（ＲＯＩ）の部分画像（図示ＮＡ１）及びサイズ変換後のコンテキスト領域の部分画像（図示ＮＢ１）から、それぞれニューラルネットワークを用いて特徴量（図示ＮＡ２，ＮＢ２）を算出する。 Therefore, as shown in FIG. 4, the size conversion unit 14 functions as a size conversion layer of the image object extraction device 1, and converts the partial image (DS shown in the drawing) of the context region into the size (NA1 shown in the drawing) of the region of interest (ROI). The size is reduced so as to have the same size (NB1 in the drawing). Then, the attention area feature calculation unit 151 and the context area feature calculation unit 152 function as a feature calculation layer (a convolution layer or a pooling layer in the case of a convolutional neural network) of the image object extraction device 1, and From the partial image (NA1 shown) and the partial image of the context area after the size conversion (NB1 shown), feature amounts (NA2, NB2 shown) are calculated using neural networks.

続いて、画像オブジェクト抽出装置１は、特徴結合部１５３により、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２の各部分ネットワークにてそれぞれ算出した注目領域（ＲＯＩ）及びコンテキスト領域の特徴量を結合する（ステップＳ５）。 Subsequently, the image object extracting apparatus 1 uses the feature combining unit 153 to calculate the feature amounts of the ROI and the context area calculated by the partial networks of the attention area feature calculation unit 151 and the context area feature calculation unit 152, respectively. Combine (step S5).

従って、図４に示すように、特徴結合部１５３は、画像オブジェクト抽出装置１の特徴結合層（畳み込みニューラルネットワークであれば全結合層（ソフトマックス層を含んでもよい）等）として機能し、注目領域（ＲＯＩ）及びコンテキスト領域の特徴量を結合したものである結合特徴量（図示ＮＣ）を生成する。 Therefore, as shown in FIG. 4, the feature connection unit 153 functions as a feature connection layer of the image object extraction device 1 (such as a fully connected layer (which may include a softmax layer in the case of a convolutional neural network)). A combined feature (NC in the figure) is generated by combining the feature of the region (ROI) and the feature of the context region.

続いて、画像オブジェクト抽出装置１は、オブジェクト抽出部１５４により、注目領域（ＲＯＩ）及びコンテキスト領域の結合したものであるこの結合特徴量を基に、該当する注目領域（ＲＯＩ）が当該ニューラルネットワークの目的とする特定のオブジェクト（車両、人物の顔等）を含んでいるか否かを判定し、オブジェクトであると判定した場合には、その入力画像Ｉに対するｉ番目の走査時点における注目領域（ＲＯＩ）の位置情報又は注目領域（ＲＯＩ）の部分画像そのものを抽出結果として外部に出力する（ステップＳ６）。 Subsequently, the image object extraction device 1 uses the object extracting unit 154 to determine a corresponding region of interest (ROI) of the neural network based on the combined feature amount obtained by combining the region of interest (ROI) and the context region. It is determined whether or not a target specific object (vehicle, person's face, etc.) is included, and if it is determined that the target image is an object, a region of interest (ROI) at the i-th scanning time for the input image I is determined. The position information or the partial image of the region of interest (ROI) is output to the outside as an extraction result (step S6).

従って、図４に示すように、オブジェクト抽出部１５４は、画像オブジェクト抽出装置１のオブジェクト抽出・出力層として機能し、該当する注目領域（ＲＯＩ）に当該ニューラルネットワークの目的とする特定のオブジェクト（車両、人物の顔等）が含まれるか否かを判定し、オブジェクトの抽出結果（図示ＮＤ）を出力する。 Therefore, as shown in FIG. 4, the object extraction unit 154 functions as an object extraction / output layer of the image object extraction device 1, and stores a specific object (vehicle) as a target of the neural network in a corresponding region of interest (ROI). , Etc.), and outputs an object extraction result (ND in the figure).

また、画像オブジェクト抽出装置１は、走査部１３により、入力画像Ｉの全体からオブジェクト抽出の処理が終了したか否かを判定し（ステップＳ７）、終了していなければ（ステップＳ７：Ｎ）、入力画像Ｉに対する次の注目領域（ＲＯＩ）の基準となる座標値を生成してステップＳ３に移行する。一方、入力画像Ｉの全体からオブジェクト抽出の処理が終了していれば（ステップＳ７：Ｙ）、ステップＳ１に移行した後、ステップＳ２を経て演算領域切り出し部１２へ新たに入力される入力画像Ｉに対して初期位置から走査を開始するようにステップＳ３に移行する。 Further, the image object extracting apparatus 1 determines whether or not the object extraction processing has been completed from the entire input image I by the scanning unit 13 (step S7), and if not completed (step S7: N), A coordinate value serving as a reference for the next region of interest (ROI) for the input image I is generated, and the process proceeds to step S3. On the other hand, if the object extraction processing has been completed for the entire input image I (step S7: Y), the process proceeds to step S1, and then the input image I newly input to the calculation region cutout unit 12 via step S2. To step S3 so as to start scanning from the initial position.

このように、本発明に係る画像オブジェクト抽出装置１は、注目領域（ＲＯＩ）と共にそのＲＯＩを含むコンテキスト領域を切り出し、当該コンテキスト領域の画像サイズをＲＯＩの画像サイズまで縮小し、その上で、ＲＯＩとコンテキスト領域とを並列処理する並列処理型ニューラルネットワークを構成し、本来の演算対象のＲＯＩの画像サイズでオブジェクトを抽出するようにしている。 As described above, the image object extraction device 1 according to the present invention cuts out the region of interest (ROI) and the context region including the ROI, reduces the image size of the context region to the image size of the ROI, and then performs the ROI A parallel processing type neural network is configured to perform parallel processing on the object and the context region, and the object is extracted with the image size of the ROI that is the original operation target.

（実施例）
以下、図５乃至図７を参照して、本発明に係る画像オブジェクト抽出装置１について、畳み込みニューラルネットワークを用いた場合の実施例について説明する。 (Example)
Hereinafter, an embodiment in which a convolutional neural network is used for the image object extraction device 1 according to the present invention will be described with reference to FIGS.

図５は、本発明による一実施形態の画像オブジェクト抽出装置１における注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２に対し畳み込みニューラルネットワークを用いた場合の入出力に関する説明図である。 FIG. 5 is an explanatory diagram regarding input and output when a convolutional neural network is used for the attention area feature calculation unit 151 and the context area feature calculation unit 152 in the image object extraction device 1 according to one embodiment of the present invention.

まず、図５に示すように、特徴演算層である注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２の各部分ネットワークとして畳み込みニューラルネットワークを用いた場合、入力層における入力画像ＩのサイズＷ×Ｈ（画素数）に対する注目領域（ＲＯＩ）とコンテキスト領域の各特徴量は、それぞれ特徴マップとして、例えばｍ×ｎの２次元行列（マトリックス）で出力される。尚、その特徴演算層を経て結合される特徴結合層の出力は、ｍ×ｎ×２で表される。 First, as shown in FIG. 5, when a convolutional neural network is used as each partial network of the attention area feature calculation unit 151 and the context area feature calculation unit 152, which are the feature calculation layers, the size of the input image I in the input layer W × Each feature amount of the region of interest (ROI) and the context region with respect to H (number of pixels) is output as a feature map, for example, as an m × n two-dimensional matrix. Note that the output of the feature combination layer connected via the feature calculation layer is represented by m × n × 2.

つまり、特徴結合層である特徴結合部１５３は、例えば特徴マップとして２次元行列のｍ行ｎ列の値で表す２種類の特徴量を結合してオブジェクト抽出部１５４に出力する。 In other words, the feature combining unit 153, which is a feature combining layer, combines two types of feature amounts represented by values of m rows and n columns of a two-dimensional matrix as a feature map and outputs the combined feature amounts to the object extracting unit 154.

ここで、ｍ，ｎの各値は有限の値であり、ｍ＝１，２，…，Ｍ、ｎ＝１，２，…，Ｎとなる。ＭとＮの値は、ニューラルネットワークの構成によって決定される値である。 Here, each value of m and n is a finite value, and m = 1, 2,..., M, and n = 1, 2,. The values of M and N are values determined by the configuration of the neural network.

そして、オブジェクト抽出・出力層であるオブジェクト抽出部１５４は、そのニューラルネットワークを構成するニューロンに対応する受容野（入力画像Ｉに対する注目領域（ＲＯＩ）とコンテキスト領域）が、オブジェクトである確率を表すものとなり、例えばｍ，ｎの各値が大きいほど、オブジェクトである可能性が高いことを示すものとなる。 The object extraction unit 154, which is an object extraction / output layer, indicates that the receptive field (the region of interest (ROI) and the context region for the input image I) corresponding to the neurons forming the neural network is an object. For example, the larger the value of each of m and n, the higher the possibility of being an object.

より具体的に、図６及び図７を参照して、本実施形態の画像オブジェクト抽出装置１にて畳み込みニューラルネットワークを用いた一実施例について説明する。図６及び図７は、本実施形態の画像オブジェクト抽出装置１にて畳み込みニューラルネットワークを用いた一実施例の処理例を示す図である。尚、図７は、図６について簡単のため入力画像を１次元で表したものであり、図６及び図７に示す実施例は、注目領域（ＲＯＩ）とコンテキスト領域について並列処理する、並列処理型の畳み込みニューラルネットワークを適用した一例である。 More specifically, an example of using the convolutional neural network in the image object extraction device 1 of the present embodiment will be described with reference to FIGS. FIG. 6 and FIG. 7 are diagrams illustrating a processing example of an example using a convolutional neural network in the image object extraction device 1 of the present embodiment. FIG. 7 is a one-dimensional representation of an input image for simplicity of FIG. 6, and the embodiment shown in FIGS. 6 and 7 performs parallel processing on a region of interest (ROI) and a context region. It is an example to which a convolutional neural network of the type is applied.

図６及び図７に示す例では、スケール変換部１１の出力である入力画像Ｉから、演算領域切り出し部１２によって、４×４画素の注目領域（ＲＯＩ）と、８×８画素のコンテキスト領域の部分画像が切り出されるものとする（図１参照）。 In the examples shown in FIGS. 6 and 7, from the input image I, which is the output of the scale conversion unit 11, the calculation region extraction unit 12 outputs a region of interest (ROI) of 4 × 4 pixels and a context region of 8 × 8 pixels. It is assumed that a partial image is cut out (see FIG. 1).

ここで、８×８画素のコンテキスト領域の部分画像は、４×４画素の注目領域（ＲＯＩ）の重心と一致する重心を持つように切り出されている。 Here, the partial image of the context region of 8 × 8 pixels is cut out to have a center of gravity that matches the center of gravity of the region of interest (ROI) of 4 × 4 pixels.

そして、８×８画素のコンテキスト領域の部分画像は、サイズ変換部１４によって、縮小率１／２にダウンサンプリング（図示するＤＳ）され、注目領域（ＲＯＩ）と同じサイズに変換される。 Then, the partial image of the context region of 8 × 8 pixels is down-sampled (DS shown in the figure) by a reduction ratio of 1/2 by the size conversion unit 14 and converted into the same size as the region of interest (ROI).

４×４画素の注目領域（ＲＯＩ）と、サイズ変換後の４×４画素のコンテキスト領域の各部分画像は、畳み込みニューラルネットワークで構成するニューラルネットワーク部１５に入力される。 Each partial image of the region of interest (ROI) of 4 × 4 pixels and the context region of 4 × 4 pixels after the size conversion is input to the neural network unit 15 configured by a convolutional neural network.

本実施例のニューラルネットワーク部１５においても、注目領域特徴演算部１５１、コンテキスト領域特徴演算部１５２、特徴結合部１５３、及びオブジェクト抽出部１５４を有している（図１参照）。 The neural network unit 15 according to the present embodiment also includes an attention area feature calculation unit 151, a context area feature calculation unit 152, a feature combination unit 153, and an object extraction unit 154 (see FIG. 1).

注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２は、それぞれ（カーネルサイズ，ストライド）をパラメータとする畳み込み層（図示するＣｏｎｖ）と、（カーネルサイズ，ストライド）をパラメータとする最大プーリング層（図示するＭＰ）を持つ部分ネットワークで構成されている。 The attention area feature calculation unit 151 and the context area feature calculation unit 152 respectively include a convolution layer (Conv shown in the drawing) using (kernel size, stride) as a parameter and a maximum pooling layer (Conn) in which (kernel size, stride) is used as a parameter. MP).

注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２における各畳み込み層（図示するＣｏｎｖ）では、それぞれの受容野（４×４画素の注目領域（ＲＯＩ）と、サイズ変換後の４×４画素のコンテキスト領域の各部分画像）に対し、カーネルサイズを３×３画素とし、ストライドを１（１画素単位で移動させる移動幅）として、カーネルを移動させながら畳み込み演算を行い、２×２の２次元行列の特徴マップを形成する。 In each convolutional layer (Conv shown in the drawing) in the attention area feature calculation unit 151 and the context area feature calculation unit 152, each receptive field (4 × 4 pixel attention area (ROI) and 4 × 4 pixel For each partial image in the context area), the kernel size is set to 3 × 3 pixels, the stride is set to 1 (moving width for moving in units of one pixel), and convolution operation is performed while moving the kernel, and 2 × 2 two-dimensional Form a matrix feature map.

また、注目領域特徴演算部１５１におけるプーリング層（図示するＭＰ）では、カーネルサイズを２×２とし、ストライドを２として、注目領域（ＲＯＩ）に関する畳み込み演算後の特徴マップから最大の値を持つ領域を抽出し、これにより１×１の２次元行列の特徴マップを形成する。 In the pooling layer (MP shown in the drawing) in the attention area feature calculation unit 151, the kernel size is 2 × 2, the stride is 2, and the area having the largest value from the feature map of the attention area (ROI) after the convolution calculation is performed. To form a 1 × 1 two-dimensional matrix feature map.

一方、コンテキスト領域特徴演算部１５２におけるプーリング層（図示するＭＰ）では、同じくカーネルサイズを２×２とするがストライドを１として、コンテキスト領域に関する畳み込み演算後の特徴マップから最大の値を持つ領域を抽出し、これにより１×１の２次元行列の特徴マップを形成する。 On the other hand, in the pooling layer (MP shown in the figure) in the context region feature calculation unit 152, the kernel size is also set to 2 × 2, but the stride is set to 1, and the region having the maximum value from the feature map after the convolution calculation regarding the context region is determined. Extraction, thereby forming a 1 × 1 two-dimensional matrix feature map.

ところで、注目領域特徴演算部１５１と、サイズ変換部１４を介するコンテキスト領域特徴演算部１５２について、並列処理型の畳み込みニューラルネットワークとして構成する際に、それぞれの受容野（入力画像Ｉに対する注目領域（ＲＯＩ）とコンテキスト領域）の中心（重心）点と、その受容野のストライド（移動幅）が一致するように構成する。これにより、注目領域（ＲＯＩ）とコンテキスト領域の相関性を高くすることができ、以降のオブジェクト抽出における精度を向上させることができる。 By the way, when the attention area feature calculation unit 151 and the context area feature calculation unit 152 via the size conversion unit 14 are configured as a convolutional neural network of a parallel processing type, each receptive field (the attention area (ROI for input image I) ) And the center (center of gravity) point of the context area) and the stride (movement width) of the receptive field coincide with each other. As a result, the correlation between the region of interest (ROI) and the context region can be increased, and the accuracy of subsequent object extraction can be improved.

つまり、注目領域特徴演算部１５１における注目領域（ＲＯＩ）に関する入力画像Ｉを基準にする全体のストライドは２画素であり（最大プーリング層のストライド２による）、ｍ行ｎ列に対応する注目領域（ＲＯＩ）の受容野が、入力画像Ｉに対する４隅の画像座標として（ｘ，ｙ，ｘ＋４，ｙ＋４）の４×４の矩形領域とすると、（ｍ＋１）行ｎ列に対応する画像座標は（ｘ＋２，ｙ，（ｘ＋２）＋４，ｙ＋４）となる。 That is, the overall stride of the region of interest (ROI) in the region of interest calculation unit 151 based on the input image I is 2 pixels (based on stride 2 of the maximum pooling layer), and the region of interest corresponding to m rows and n columns ( If the receptive field of the (ROI) is a (x, y, x + 4, y + 4) 4 × 4 rectangular area as the image coordinates of the four corners with respect to the input image I, the image coordinates corresponding to (m + 1) rows and n columns are (x + 2) , Y, (x + 2) +4, y + 4).

同様に、サイズ変換部１４を介するコンテキスト領域特徴演算部１５２におけるコンテキスト領域に関する入力画像Ｉを基準にする全体のストライドも２画素である（最大プーリング層のストライド１であるが、サイズ変換部１４による縮小率１／２のダウンサンプリングによる）。 Similarly, the overall stride based on the input image I relating to the context region in the context region feature calculation unit 152 via the size conversion unit 14 is also 2 pixels (the stride 1 of the maximum pooling layer, Downsampling with a reduction ratio of 1/2).

即ち、簡単のため、図７では１次元で表現することにより、注目領域特徴演算部１５１の演算と、サイズ変換部１４及びコンテキスト領域特徴演算部１５２の演算に関して、入力画像Ｉにおける画素（受容野）と当該演算の各出力との関係を表している。注目領域特徴演算部１５１の演算と、サイズ変換部１４及びコンテキスト領域特徴演算部１５２の演算において、実線で示す演算時の出力に対し、その隣の破線で示すストライドさせた演算時の出力が、２画素ずれた位置に相当していることが分かり、注目領域（ＲＯＩ）とコンテキスト領域との位置関係が相関性の高い（崩れていない）状態を保つことができることが確認できる。 That is, for simplicity, in FIG. 7, the one-dimensional representation is used to calculate the pixel of the input image I (the receptive field) with respect to the calculation of the attention area feature calculation unit 151 and the calculation of the size conversion unit 14 and the context area feature calculation unit 152 ) And each output of the calculation. In the calculation of the attention area feature calculation unit 151 and the calculation of the size conversion unit 14 and the context area feature calculation unit 152, the output at the time of the calculation indicated by the solid line and the output at the time of the calculation indicated by the broken line next to the solid line are: It can be seen that the position corresponds to a position shifted by two pixels, and it can be confirmed that the positional relationship between the region of interest (ROI) and the context region can be maintained in a highly correlated (unbroken) state.

そして、図６に示す例では、注目領域特徴演算部１５１及びコンテキスト領域特徴演算部１５２からそれぞれ出力される１×１の２次元行列の特徴マップは、特徴結合部１５３によってチャンネル方向に結合され、１×１×２の特徴マップとしてオブジェクト抽出部１５４に出力される。 In the example illustrated in FIG. 6, the feature maps of the 1 × 1 two-dimensional matrix output from the attention area feature calculation unit 151 and the context area feature calculation unit 152 are combined in the channel direction by the feature combining unit 153. The data is output to the object extracting unit 154 as a 1 × 1 × 2 feature map.

オブジェクト抽出部１５４は、１×１×２の特徴マップを基に、事前学習に基づいてオブジェクトが含まれるか否かを判定し、その入力画像Ｉの全体からオブジェクト抽出を行う。このようなオブジェクト抽出部１５４を構成する部分ネットワークは、制約なしに自由に設計することができる。一般的には、畳み込み層とプーリング層を繰り返した後、全結合層、ソフトマックス層と連結するような構造が利用される。 The object extracting unit 154 determines whether or not an object is included based on the pre-learning based on the 1 × 1 × 2 feature map, and extracts the object from the entire input image I. The partial network constituting such an object extraction unit 154 can be freely designed without any restrictions. In general, a structure is used in which the convolutional layer and the pooling layer are repeated, and then connected to the fully connected layer and the softmax layer.

（実施例に基づく実験結果）
ここで、本発明に係る画像オブジェクト抽出装置１の効果を実験により検証した。実験では、本発明に係る画像オブジェクト抽出装置１として、入力画像Ｉから８×８画素の注目領域（ＲＯＩ）と、１６×１６画素のコンテキスト領域の部分画像を切り出すものとした。そして、注目領域特徴演算部１５１における注目領域（ＲＯＩ）に関する全体のストライドは２、サイズ変換部１４を介するコンテキスト領域特徴演算部１５２におけるコンテキスト領域に関する全体のストライドも２となるように、畳み込み層とプーリング層を組み合わせて設計した。また、オブジェクト抽出部１５４も含めたニューラルネットワーク部１５全体の畳み込み層の総数は３とした。 (Experimental results based on examples)
Here, the effect of the image object extraction device 1 according to the present invention was verified by experiments. In the experiment, the image object extraction device 1 according to the present invention cut out a region of interest (ROI) of 8 × 8 pixels and a partial image of a context region of 16 × 16 pixels from the input image I. Then, the convolution layer and the convolution layer are set so that the overall stride of the ROI in the ROI feature unit 151 is 2 and the overall stride of the context region in the ROI 152 via the size converter 14 is also 2. It was designed by combining the pooling layers. The total number of convolutional layers of the entire neural network unit 15 including the object extracting unit 154 is set to 3.

一方、比較例として、図８に例示する従来技術に係る画像オブジェクト抽出装置１００のように、８×８画素の注目領域（ＲＯＩ）のみでオブジェクト抽出するものとし、畳み込み層の総数も合わせるため３とした。 On the other hand, as a comparative example, as in the image object extracting apparatus 100 according to the related art illustrated in FIG. 8, it is assumed that an object is extracted only in a region of interest (ROI) of 8 × 8 pixels. And

表１は、本発明と比較例に関するオブジェクト抽出の実験結果を示している。表１は、検出漏れの少なさを評価するための再現率による比較を示すものであり、本例ではサンプル数を３９７１枚の画像としている。本発明に係る再現率は、比較例と比べて約１％向上する結果となった。従って、注目領域（ＲＯＩ）の周辺情報を利用する方がオブジェクト抽出の精度が向上し、本発明の有効性が確認できた。 Table 1 shows experimental results of object extraction according to the present invention and a comparative example. Table 1 shows a comparison based on the recall ratio for evaluating the small number of detection omissions. In this example, the number of samples is 3971 images. The recall according to the present invention was improved by about 1% as compared with the comparative example. Therefore, the use of the peripheral information of the region of interest (ROI) improves the accuracy of object extraction and confirms the effectiveness of the present invention.

また、表２は、本発明と比較例に関するオブジェクト抽出に係る演算時間（必要実行時間）の比較結果を示している。本発明に係る演算時間（必要実行時間）は、比較例に比べて、実行時間の増加は許容できる範囲である。特に、従来技術に基づいて単純に８×８画素の注目領域（ＲＯＩ）のみに基づいてオブジェクト抽出し、更に１６×１６画素のコンテキスト領域のみに基づいてオブジェクト抽出し、その結果をまとめてオブジェクト抽出判断を行うように構成することも考えられる。この場合では、仮に本発明と同程度の精度が得られるとしても、表２に示す比較例の演算時間（必要実行時間）は２倍以上になることが想定されるため、本発明の構成による演算時間（必要実行時間）が如何に小さく抑えられているかが理解される。 Table 2 shows a comparison result of the calculation time (required execution time) related to the object extraction according to the present invention and the comparative example. The operation time (required execution time) according to the present invention is within an allowable range, as compared with the comparative example. In particular, based on the conventional technology, an object is simply extracted based only on a region of interest (ROI) of 8 × 8 pixels, an object is further extracted based only on a context region of 16 × 16 pixels, and the results are collectively extracted. It is also conceivable to configure so as to make a judgment. In this case, even if the same level of accuracy as that of the present invention can be obtained, the operation time (required execution time) of the comparative example shown in Table 2 is expected to be twice or more. It is understood how the operation time (required execution time) is kept small.

上記表１及び表２の結果から、本発明に係る画像オブジェクト抽出装置１は、注目領域（ＲＯＩ）及びコンテキスト領域について並列処理する点、及び、コンテキスト領域について注目領域（ＲＯＩ）と同じサイズになるように縮小している点で、オブジェクト抽出の精度を向上させながら計算量の増加が抑えられたものと考えられる。 From the results in Tables 1 and 2, the image object extraction device 1 according to the present invention performs the parallel processing on the region of interest (ROI) and the context region, and the size of the context region is the same as that of the region of interest (ROI). In this way, it is considered that the increase in the amount of calculation was suppressed while improving the accuracy of object extraction.

従って、従来技術のように入力された注目領域（ＲＯＩ）のみを利用してオブジェクトを抽出又は認識する技術より、本発明に係る画像オブジェクト抽出装置１のように、入力画像内のオブジェクトを抽出又は認識する場合には、注目領域（ＲＯＩ）とその周りの情報（周辺情報）も利用することが有効であることが分かる。 Therefore, the object in the input image is extracted or recognized as in the image object extraction device 1 according to the present invention, using the technique of extracting or recognizing the object using only the input region of interest (ROI) as in the related art. It can be seen that it is effective to use the region of interest (ROI) and its surrounding information (peripheral information) for recognition.

特に、従来技術では入力画像に対するオブジェクトのサイズが抽出困難であるほど相対的に小さい場合でも、本発明に係る構成ではそのオブジェクトを精度よく抽出できるようになり、特にオブジェクトのサイズが小さいほど、その傾向が顕著となる。 In particular, in the related art, even when the size of an object with respect to an input image is relatively small so as to be difficult to extract, the configuration according to the present invention enables the object to be accurately extracted. The tendency becomes remarkable.

総括するに、従来技術の変形例として、はじめから注目領域（ＲＯＩ）の周辺情報を含むコンテキスト領域のみを演算対象とすることも考えられる。この場合、図８に示す従来技術の構成を変えることなく、注目領域（ＲＯＩ）の周辺情報を考慮できるようになるが、幾つかの問題が生じる。 To sum up, as a modification of the conventional technique, it is conceivable that only the context region including the peripheral information of the region of interest (ROI) is to be calculated from the beginning. In this case, the peripheral information of the region of interest (ROI) can be considered without changing the configuration of the related art shown in FIG. 8, but there are some problems.

第１に、ＲＯＩを含むコンテキスト領域の画像サイズをそのままにニューラルネットワークによりオブジェクト抽出を行うことになり、オブジェクト抽出に係る計算時間が増大する。即ち、この場合、コンテキスト領域の画像サイズが本来の演算対象のＲＯＩの画像サイズより相対的に拡大したものとなり、その拡大した面積に比例して計算時間が増大してしまう。特に、入力画像内からオブジェクトを抽出するタスクにおいては、上述したスケール変換部１１のようなスケール変換が有効である一方で、様々な位置や大きさの演算対象の画像に対して何度も実行すると、その計算時間は著しく増大する。 First, the object is extracted by the neural network while keeping the image size of the context region including the ROI as it is, and the calculation time for extracting the object increases. In other words, in this case, the image size of the context region is relatively larger than the original calculation target ROI image size, and the calculation time increases in proportion to the enlarged area. In particular, in the task of extracting an object from the input image, while the scale conversion as in the above-described scale conversion unit 11 is effective, it is executed many times for images to be calculated at various positions and sizes. Then, the calculation time significantly increases.

第２に、ＲＯＩを含むコンテキスト領域でニューラルネットワークによりオブジェクト抽出を行うと、オブジェクト抽出された当該コンテキスト領域から本来の演算対象のＲＯＩで抽出すべきオブジェクトを何らかの方法で切り出す必要が生じ、抽出精度や演算時間に悪影響を与える。 Second, if an object is extracted by a neural network in a context region including an ROI, it is necessary to cut out an object to be extracted from the context region from which the object was extracted by the ROI of the original operation target by some method. It adversely affects the operation time.

そこで、本発明に係る画像オブジェクト抽出装置１では、注目領域（ＲＯＩ）と共にそのＲＯＩを含むコンテキスト領域を切り出し、当該コンテキスト領域の画像サイズをＲＯＩの画像サイズまで縮小し、その上で、ＲＯＩとコンテキスト領域とを並列処理する並列処理型ニューラルネットワークを構成し、本来の演算対象のＲＯＩの画像サイズでオブジェクトを抽出するようにしている。このため、上記表１及び表２に示したように、ＲＯＩのみよりも、ＲＯＩを含む周辺情報がある方が明らかにオブジェクト抽出の精度が向上し、不所望に演算時間を増大させることなく計算量の増加を抑えることができる。 Therefore, the image object extracting apparatus 1 according to the present invention cuts out a context region including the region of interest (ROI) together with the ROI, reduces the image size of the context region to the image size of the ROI, and then performs the ROI and the context A parallel processing type neural network for processing an area in parallel is constructed, and an object is extracted with the image size of the ROI of the original operation target. Therefore, as shown in Tables 1 and 2 above, the accuracy of object extraction is clearly improved when there is peripheral information including the ROI, and the calculation is performed without undesirably increasing the operation time, as compared with the ROI alone. The increase in amount can be suppressed.

上述した実施形態の例に関して、画像オブジェクト抽出装置１として機能するコンピュータを構成し、これらの装置の各手段を機能させるためのプログラムを好適に用いることができる。具体的には、各手段を制御するための制御部をコンピュータ内の中央演算処理装置（ＣＰＵ）で構成でき、且つ、各手段を動作させるのに必要となるプログラムを適宜記憶する記憶部を少なくとも１つのメモリで構成させることができる。即ち、そのようなコンピュータに、ＣＰＵによって該プログラムを実行させることにより、上述した各手段の有する機能を実現させることができる。更に、各手段の有する機能を実現させるためのプログラムを、前述の記憶部（メモリ）の所定の領域に格納させることができる。そのような記憶部は、装置内部のＲＡＭ又はＲＯＭなどで構成させることができ、或いは又、外部記憶装置（例えば、ハードディスク）で構成させることもできる。また、そのようなプログラムは、コンピュータで利用されるＯＳ上のソフトウェア（ＲＯＭ又は外部記憶装置に格納される）の一部で構成させることができる。更に、そのようなコンピュータに、各手段として機能させるためのプログラムは、コンピュータ読取り可能な記録媒体に記録することができる。また、上述した各手段をハードウェア又はソフトウェアの一部として構成させ、各々を組み合わせて実現させることもできる。 With respect to the example of the above-described embodiment, a computer functioning as the image object extracting device 1 is configured, and a program for causing each unit of these devices to function can be suitably used. Specifically, a control unit for controlling each unit can be constituted by a central processing unit (CPU) in a computer, and at least a storage unit for appropriately storing a program necessary for operating each unit is provided. It can be configured with one memory. That is, by causing such a computer to execute the program by the CPU, the functions of the respective units described above can be realized. Furthermore, a program for realizing the function of each means can be stored in a predetermined area of the storage unit (memory). Such a storage unit can be configured by a RAM or a ROM inside the device, or can be configured by an external storage device (for example, a hard disk). Further, such a program can be configured as a part of software (stored in a ROM or an external storage device) on an OS used in a computer. Further, a program for causing such a computer to function as each means can be recorded on a computer-readable recording medium. In addition, each of the above-described units may be configured as a part of hardware or software, and each unit may be implemented in combination.

上述の実施形態及び実施例については代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換することができることは当業者に明らかである。従って、本発明は、上述の実施形態及び実施例によって制限するものと解するべきではなく、特許請求の範囲によってのみ制限される。 Although the above embodiments and examples have been described as representative examples, it is apparent to those skilled in the art that many changes and substitutions can be made within the spirit and scope of the present invention. Therefore, the present invention should not be construed as limited by the above-described embodiments and examples, but only by the appended claims.

本発明によれば、精度よく、且つ比較的短時間で入力画像からオブジェクトを抽出できるようになるので、画像からオブジェクトを抽出又は認識する用途に有用である。 According to the present invention, an object can be extracted from an input image with high accuracy and in a relatively short time, so that the present invention is useful for applications that extract or recognize an object from an image.

１画像オブジェクト抽出装置
１１スケール変換部
１２演算領域切り出し部
１３走査部
１４サイズ変換部
１５ニューラルネットワーク部
１５１注目領域特徴演算部
１５２コンテキスト領域特徴演算部
１５３特徴結合部
１５４オブジェクト抽出部
１００画像オブジェクト抽出装置
１１２注目領域切り出し部
１１３走査部
１１５ニューラルネットワーク部
１１５１注目領域特徴演算部
１１５４オブジェクト抽出部 REFERENCE SIGNS LIST 1 image object extraction device 11 scale conversion unit 12 operation region cutout unit 13 scanning unit 14 size conversion unit 15 neural network unit 151 attention area feature calculation unit 152 context region feature calculation unit 153 feature connection unit 154 object extraction unit 100 image object extraction device 112 Attention area cutout section 113 Scanning section 115 Neural network section 1151 Attention area feature calculation section 1154 Object extraction section

Claims

An image object extraction device for extracting a specific object from an input image,
Scale conversion means for sequentially generating an input image that has been subjected to scale conversion so as to reduce the input image stepwise at a predetermined magnification with a predetermined initial scale as an initial value,
An operation area for sequentially cutting out a partial image of a region of interest and a partial image of a context region including information of the region of interest and surrounding information while scanning the input image scale-converted by the scale converting means, respectively, in a predetermined size. Cutting means,
Size conversion means for performing size conversion so as to reduce the partial images of the context region sequentially cut out to the same size as the attention region,
Attention area feature calculation means for calculating a first feature amount using a neural network for the partial image of the attention area;
Context region feature calculation means for calculating a second feature amount using a neural network for the partial image of the context region after the size conversion,
Combining means for combining the first feature quantity and the second feature quantity to generate a combined feature quantity;
Object extraction means for extracting the specific object from an input image obtained through the scale conversion means, by determining whether the attention area includes the specific object based on the combined feature amount, With
At least the attention region feature calculation unit, the context region feature calculation unit, the combining unit, and the object extraction unit are configured as a partial network in a neural network,
The attention area feature calculation means and the context area feature calculation means are configured to be processed in parallel,
The object extracting unit extracts the objects of different sizes by repeating the scale conversion by the scale converting unit within a range in which the scale of the input image obtained through the scale converting unit does not become smaller than a predetermined threshold. Image object extraction device.

The calculation area cutout means cuts out a partial image of the attention area and a partial image of the context area with fixed values from the input image obtained through the scale conversion means, and the context area is the attention area. 2. The image object extraction method according to claim 1, wherein the image of interest is cut out in a size enlarged by a predetermined amount so as to include information about the vertical and horizontal sides of the attention area. apparatus.

3. The image object extraction device according to claim 1, wherein the calculation region cutout unit cuts out the area of the context region so that the area of the context region satisfies an area that is greater than 1 time and equal to or less than 4 times the area of the attention area. apparatus.

The said 1st feature-value and the said 2nd feature-value are each represented by the feature map obtained by the feature-value calculation process of the same format, The Claim 1 characterized by the above-mentioned. An image object extraction device according to item 1.

The attention area feature calculation means and the context area feature calculation means are each based on an input image obtained through the scale conversion means by performing parallel processing based on a convolutional neural network as feature quantity calculation processing in the same format. The image object extraction device according to any one of claims 1 to 4, wherein a feature map in which a positional relationship between each of the first feature amount and the second feature amount is calculated.

A program for causing a computer to function as the image object extraction device according to any one of claims 1 to 5.