JP2020017136A

JP2020017136A - Object detection and recognition apparatus, method, and program

Info

Publication number: JP2020017136A
Application number: JP2018140533A
Authority: JP
Inventors: 泳青孫; Yongqing Sun; 慎吾安藤; Shingo Ando; 杵渕　哲也; Tetsuya Kinebuchi; 哲也杵渕
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2020-01-30
Also published as: WO2020022329A1

Abstract

To detect and recognize even a small-size object with satisfactory accuracy.SOLUTION: An image for use in detection of an object is acquired. A background included in the image and a foreground including an object are separated from each other. A region having a predetermined size or smaller of a region representing the separated foreground is extracted as an object candidate region. A plurality of input images corresponding to the object candidate regions and the surroundings of the object candidate regions are generated based on the extracted object candidate regions, and each of the plurality of generated input images is input to a CNN for detecting a previously learned object and for recognizing a category of an object. With respect to each of the input images, categories of the detected objects are recognized to integrate results of recognition of object categories on the basis of the result of the recognition of the category of each of objects in the input image.SELECTED DRAWING: Figure 1

Description

本発明は、物体検出認識装置、方法、及びプログラムに係り、特に、画像の物体を検出し、認識するための物体検出認識装置、方法、及びプログラムに関する。 The present invention relates to an object detection / recognition apparatus, method, and program, and more particularly to an object detection / recognition apparatus, method, and program for detecting and recognizing an object in an image.

映像や画像の中から物体を検出し、検出した物体のカテゴリ（クラス）を認識する技術がある。この技術は、映像や画像のシーン内容を解析し理解するために用いられるものである。一般的な処理の流れとしては、まず、映像や画像から被写体（物体や人物）を表す物体候補領域を抽出する。そして、物体候補領域において物体の特徴量を求め、当該特徴量を用いてカテゴリの認識を行う。また、個々の物体候補領域の認識結果を統合することにより映像や画像中の物体の検出及び認識を実現する。例えば、深層学習による物体検出と認識検出は次のような方法がある。 There is a technology for detecting an object from a video or an image and recognizing a category (class) of the detected object. This technique is used to analyze and understand the scene contents of video and images. As a general processing flow, first, an object candidate area representing a subject (object or person) is extracted from a video or an image. Then, the feature amount of the object is obtained in the object candidate region, and the category is recognized using the feature amount. Further, detection and recognition of an object in a video or image are realized by integrating recognition results of individual object candidate regions. For example, object detection and recognition detection by deep learning include the following methods.

まず、入力画像をＳ＊Ｓのセルの領域に分割し、領域ごとに幅と長さの異なるＰ個のバウンディングボックスを予め決めておく。次に、入力画像をＣＮＮ（Convolutional Neural Network）などのニューラルネットワークのモデル（例えば、ＶＧＧ（Visual Geometry Group））において、Ｓ＊Ｓ領域内の物体があるカテゴリに属する確率と、セルに対応するＰ個のバウンディングボックスと、信頼度（信頼度の指標はバウンディングボックスの長さ、高さ、及び座標に応じて定まる）の高い真の物体の領域を表すバウンディングボックスとを同時に導出することで、物体の検出及び認識とを実現できる。 First, the input image is divided into S * S cell regions, and P bounding boxes having different widths and lengths are determined in advance for each region. Next, in a model of a neural network such as a CNN (Convolutional Neural Network) (for example, VGG (Visual Geometry Group)), the probability that an object in the S * S region belongs to a certain category and the P corresponding to the cell By simultaneously deriving a bounding box and a bounding box representing a region of a true object having a high degree of reliability (the reliability index is determined by the length, height, and coordinates of the bounding box), the object Detection and recognition can be realized.

“You Only Look Once: Unified, Real-Time Object Detection”, Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, CVPR2016“You Only Look Once: Unified, Real-Time Object Detection”, Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, CVPR2016

もっとも、実環境で撮影された画像や映像には、複雑な背景がある場合や、サイズの小さい物体が映されている場合がよくある。例えば、図４に示すような森や山の映像や、飛んでいるドローンや道路シーンにおいて小さな道路標識はよく挙げられる物体の検出と認識の対象である。 However, images and videos shot in a real environment often have a complicated background or a small-sized object. For example, an image of a forest or a mountain as shown in FIG. 4, or a small road sign in a flying drone or a road scene is an object to be often detected and recognized.

上記の非特許文献１に示すような従来手法は画像全体を固定サイズの候補領域に分割し、それぞれの分割領域ごとに物体の種類と領域推定を行っている。そのため、物体候補領域に複数の物体が映っている場合やサイズの小さい物体があった場合、ＣＮＮの層が深くなるにつれて出力の特徴マップにおけるそれらの物体を表現するための情報量（特徴量）が少なくなるため、検出及び認識の精度が低くなる問題が生じる。 The conventional method as described in Non-Patent Literature 1 divides the entire image into fixed-size candidate regions, and estimates the type of the object and the region for each of the divided regions. Therefore, when a plurality of objects appear in the object candidate area or when there are small objects, the information amount (feature amount) for expressing those objects in the output feature map as the CNN layer becomes deeper , The accuracy of detection and recognition is reduced.

本発明は、上記問題点を解決するために成されたものであり、サイズの小さな物体についても精度よく検出及び認識ができる物体検出認識装置、方法、及びプログラムを提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is to provide an object detection / recognition apparatus, method, and program capable of accurately detecting and recognizing an object having a small size even when the object is small.

上記目的を達成するために、第１の発明に係る物体検出認識装置は、物体の検出の対象となる画像を取得し、前記画像に含まれる背景と物体が写った前景とを分離する分離部と、前記分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出する物体候補領域抽出部と、前記抽出された物体候補領域に基づいて、前記物体候補領域及び前記物体候補領域の周辺に対応する複数の入力画像を生成する入力画像生成部と、生成された前記複数の入力画像の各々を、予め学習された物体の検出及び前記物体のカテゴリの認識を行うためのＣＮＮ（Convolutional Neural Network）に入力して、前記入力画像の各々について前記入力画像に含まれる前記物体の位置を検出すると共に、前記入力画像の各々について検出された前記物体のカテゴリを認識し、前記入力画像の各々の前記物体のカテゴリの認識結果に基づいて、前記物体のカテゴリの前記認識結果を統合する検出認識部と、を含んで構成されている。 In order to achieve the above object, an object detection / recognition device according to a first aspect of the present invention includes a separation unit configured to acquire an image to be detected and to separate a background included in the image from a foreground in which the object is captured. And, among the regions representing the separated foreground, an object candidate region extraction unit that extracts a region equal to or smaller than a predetermined size as an object candidate region, and, based on the extracted object candidate region, An input image generation unit that generates a plurality of input images corresponding to the periphery of the object candidate region, and detects each of the generated plurality of input images to detect a previously learned object and recognize a category of the object. Input to a CNN (Convolutional Neural Network), and detects the position of the object included in the input image for each of the input images, and detects the object detected for each of the input images. And a detection / recognition unit that integrates the recognition result of the category of the object based on the recognition result of the category of the object in each of the input images.

また、第１の発明に係る物体検出認識装置において、前記画像は、物体の検出の対象となる映像のシーンごとに取得し、前記シーンの画像ごとに、前記分離部、物体候補領域抽出部、入力画像生成部、及び検出認識部の各処理を行うようにしてもよい。 Further, in the object detection and recognition device according to the first invention, the image is acquired for each scene of a video that is a target of object detection, and for each image of the scene, the separation unit, the object candidate region extraction unit, Each processing of the input image generation unit and the detection recognition unit may be performed.

また、第１の発明に係る物体検出認識装置において、前記入力画像生成部は、前記抽出された物体候補領域及び前記物体候補領域の周辺から得られる複数の領域をアップサンプリングすることにより前記複数の入力画像を生成するようにしてもよい。 In the object detection / recognition device according to the first invention, the input image generation unit performs upsampling on the extracted object candidate region and a plurality of regions obtained from the periphery of the object candidate region, thereby performing the sampling on the plurality of regions. An input image may be generated.

また、第１の発明に係る物体検出認識装置において、前記検出認識部の前記認識結果の統合は、前記入力画像の各々についてのカテゴリの認識において算出されるカテゴリの各々の信頼度から求められる、カテゴリごとの前記信頼度の最大値、又は平均値を用いて、最も前記信頼度が高いカテゴリを求め、統合した認識結果とするようにしてもよい。 In the object detection / recognition device according to the first invention, the integration of the recognition results of the detection / recognition unit is obtained from the reliability of each of the categories calculated in the recognition of the category for each of the input images. The category having the highest reliability may be obtained by using the maximum value or the average value of the reliability for each category, and the integrated recognition result may be obtained.

また、第１の発明に係る物体検出認識装置において、前記検出認識部で前記画像から検出された物体、及び前記物体のカテゴリを用いて、前記ＣＮＮを学習する学習部を更に含み、前記検出認識部により、学習した前記ＣＮＮを用いて、前記検出及び前記認識を行うようにしてもよい。 The object detection / recognition device according to the first invention further includes a learning unit that learns the CNN by using the object detected from the image by the detection / recognition unit and the category of the object. The detection and the recognition may be performed by the unit using the learned CNN.

第２の発明に係る物体検出認識方法は、分離部が、物体の検出の対象となる画像を取得し、前記画像に含まれる背景と物体が写った前景とを分離するステップと、物体候補領域抽出部が、前記分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出するステップと、入力画像生成部が、前記抽出された物体候補領域に基づいて、前記物体候補領域及び前記物体候補領域の周辺に対応する複数の入力画像を生成するステップと、検出認識部が、生成された前記複数の入力画像の各々を、予め学習された物体の検出及び前記物体のカテゴリの認識を行うためのＣＮＮ（Convolutional Neural Network）に入力して、前記入力画像の各々について前記入力画像に含まれる前記物体の位置を検出すると共に、前記入力画像の各々について検出された前記物体のカテゴリを認識し、前記入力画像の各々の前記物体のカテゴリの認識結果に基づいて、前記物体のカテゴリの前記認識結果を統合するステップと、を含んで実行することを特徴とする。 An object detection / recognition method according to a second aspect of the present invention includes a step in which a separation unit acquires an image to be detected and separates a background included in the image from a foreground in which the object is captured; An extracting unit for extracting, as an object candidate region, an area having a size equal to or less than a predetermined size from the area representing the separated foreground; and an input image generating unit, based on the extracted object candidate area, Generating a plurality of input images corresponding to the periphery of the candidate region and the object candidate region; anda detection / recognition unit detects each of the generated plurality of input images by detecting a previously learned object and detecting the object. Input to a CNN (Convolutional Neural Network) for performing category recognition to detect the position of the object included in the input image for each of the input images, Recognizing the category of the object detected for the input image, and integrating the recognition result of the category of the object based on the recognition result of the category of the object in each of the input images. Features.

第３の発明に係るプログラムは、コンピュータを、第１の発明に記載の物体検出認識装置の各部として機能させるためのプログラムである。 A program according to a third invention is a program for causing a computer to function as each unit of the object detection and recognition device according to the first invention.

本発明の物体検出認識装置、方法、及びプログラムによれば、物体の検出の対象となる画像を取得し、画像に含まれる背景と物体が写った前景とを分離し、分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出し、抽出された物体候補領域に基づいて、物体候補領域及び物体候補領域の周辺に対応する複数の入力画像を生成し、生成された複数の入力画像の各々を、予め学習された物体の検出及び物体のカテゴリの認識を行うためのＣＮＮに入力して、入力画像の各々について入力画像に含まれる物体の位置を検出すると共に、入力画像の各々について検出された物体のカテゴリを認識し、入力画像の各々の物体のカテゴリの認識結果に基づいて、物体のカテゴリの認識結果を統合することにより、サイズの小さな物体についても精度よく検出及び認識ができる、という効果が得られる。 According to the object detection / recognition apparatus, method, and program of the present invention, an image to be detected is obtained, the background included in the image is separated from the foreground in which the object is captured, and the separated foreground is represented. Among the regions, a region equal to or smaller than a predetermined size is extracted as an object candidate region, and a plurality of input images corresponding to the object candidate region and the periphery of the object candidate region are generated based on the extracted object candidate region. Each of the plurality of input images is input to a CNN for detecting a previously learned object and recognizing a category of the object, and detecting a position of an object included in the input image for each of the input images, By recognizing the category of the object detected for each of the input images and integrating the recognition results of the category of the object based on the recognition result of the category of each object of the input image, the size of the object is reduced. Can accurately detect and recognized also be objects, effect is obtained that.

本発明の実施の形態に係る物体検出認識装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of an object detection and recognition device according to an embodiment of the present invention. 物体候補領域から複数の領域を生成する場合の一例を示す図である。FIG. 9 is a diagram illustrating an example of a case where a plurality of regions are generated from an object candidate region. 本発明の実施の形態に係る物体検出認識装置における物体検出認識処理ルーチンを示すフローチャートである。5 is a flowchart illustrating an object detection and recognition processing routine in the object detection and recognition device according to the embodiment of the present invention. 物体の検出と認識の対象となる画像の一例を示す図である。FIG. 3 is a diagram illustrating an example of an image to be detected and recognized by an object.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る概要＞ <Overview according to Embodiment of the Present Invention>

まず、本発明の実施の形態における概要を説明する。上述した課題に対し、映像や画像において背景と物体が写った前景とを分離し、前景からサイズの小さい物体を表す物体候補領域を抽出して、抽出された物体候補領域だけを対象にして物体の検出及び認識を行う手法はサイズの小さい物体検出に対して有効であると考えられる。 First, an outline of an embodiment of the present invention will be described. In order to solve the above-mentioned problem, the background and the foreground in which the object appears in the video or the image are separated, an object candidate area representing a small-sized object is extracted from the foreground, and only the extracted object candidate area is targeted. It is considered that the technique of detecting and recognizing is effective for detecting an object having a small size.

本発明の実施の形態では、映像や画像中の背景と前景とを分離する分離手段と、前景から一定サイズ以下の物体候補領域を抽出する手段と、Deep Learningベースで予め学習した物体の検出及び物体のカテゴリの認識を行うためのＣＮＮに入力する入力画像を生成する手段とを設けることで、サイズの小さい物体を精度よく検出及び認識できるようにする。 In an embodiment of the present invention, a separating unit for separating a background and a foreground in a video or an image, a unit for extracting an object candidate region having a certain size or less from a foreground, and a method for detecting and learning an object learned in advance based on Deep Learning. By providing means for generating an input image to be input to the CNN for recognizing the category of the object, it is possible to accurately detect and recognize a small-sized object.

＜本発明の実施の形態に係る物体検出認識装置の構成＞ <Configuration of Object Detection and Recognition Apparatus According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る物体検出認識装置の構成について説明する。図１に示すように、本発明の実施の形態に係る物体検出認識装置１００は、ＣＰＵと、ＲＡＭと、後述する物体検出認識処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この物体検出認識装置１００は、機能的には図１に示すように、蓄積部２０と、取得部２２と、分離部３０と、物体候補領域抽出部３２と、入力画像生成部３４と、検出認識部３６と、学習部３８とを含んで構成されている。 Next, the configuration of the object detection and recognition device according to the embodiment of the present invention will be described. As shown in FIG. 1, an object detection / recognition device 100 according to an embodiment of the present invention includes a CPU, a RAM, a ROM storing a program for executing an object detection / recognition processing routine described later, and various data, And a computer including As shown in FIG. 1, the object detection / recognition apparatus 100 includes a storage unit 20, an acquisition unit 22, a separation unit 30, an object candidate region extraction unit 32, an input image generation unit 34, It is configured to include a recognition unit 36 and a learning unit 38.

蓄積部２０には、物体の検出及び認識の対象となる映像を蓄積する。蓄積部２０は、処理対象の取得部２２から処理指示を受け取ると、処理対象の取得部２２に対して映像を出力する。また、検出認識部３６で求められた検出結果及び認識結果を蓄積部２０に格納する。なお、蓄積部２０に映像でなく画像を蓄積し、画像ごとに分離部３０、物体候補領域抽出部３２、入力画像生成部３４、及び検出認識部３６による物体の検出及び認識の処理を行ってもよい。 The storage unit 20 stores an image to be detected and recognized by an object. Upon receiving the processing instruction from the processing target acquisition unit 22, the storage unit 20 outputs a video to the processing target acquisition unit 22. In addition, the detection result and the recognition result obtained by the detection / recognition unit 36 are stored in the storage unit 20. It should be noted that an image, not a video, is stored in the storage unit 20, and the separation unit 30, the object candidate region extraction unit 32, the input image generation unit 34, and the detection and recognition unit 36 perform object detection and recognition processing for each image. Is also good.

取得部２２は、蓄積部２０に検出及び認識の処理指示を出力し、蓄積部２０に格納された映像を取得し、取得した映像を分離部３０へ出力する。また、検出認識部３６の処理後に、蓄積部２０に学習の処理指示を出力し、蓄積部２０に格納された、入力画像の各々の検出結果、及びカテゴリの認識結果を統合したものを取得し、学習部３８に出力する。 The acquisition unit 22 outputs a detection and recognition processing instruction to the storage unit 20, acquires the video stored in the storage unit 20, and outputs the acquired video to the separation unit 30. Further, after the processing of the detection / recognition unit 36, a learning processing instruction is output to the storage unit 20, and the integrated result of the detection result of each input image and the recognition result of the category stored in the storage unit 20 is obtained. , To the learning unit 38.

分離部３０は、物体の検出の対象となる画像を映像のシーンごとに取得し、画像に含まれる背景と物体が写った前景とを分離する。分離部３０では、まず、取得部２２から受け取った映像において、一定の時間間隔のフレームで画像（ｖ_１,ｖ_２，...，ｖ_Ｎ）を抽出する。次に、時系列順に前後のフレーム間の動的な特徴量を用いて、画像中において背景と物体の写る前景とを分離する。分離には、例えば、画像処理のライブラリOpenCVのcv2.absdiff()の処理を用いればよい。分離された前景を物体候補領域抽出部３２へ出力する。 The separating unit 30 acquires an image to be detected as an object for each video scene, and separates a background included in the image from a foreground in which the object appears. First, the separation unit 30 extracts images (v ₁ , v ₂ ,..., V _N ) from the video received from the acquisition unit 22 in frames at fixed time intervals. Next, the background and the foreground in which the object appears in the image are separated using the dynamic feature between the preceding and succeeding frames in chronological order. For the separation, for example, the processing of cv2.absdiff () of the image processing library OpenCV may be used. The separated foreground is output to the object candidate area extracting unit 32.

物体候補領域抽出部３２は、分離部３０で分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出し、入力画像生成部３４へ出力する。
具体的には、まず下記非特許文献２に示すようなエッジ情報を用いた物体領域抽出手法を用いて、前景を表す領域を抽出する。 The object candidate area extraction unit 32 extracts, as an object candidate area, an area having a size equal to or smaller than a predetermined size from the areas representing the foreground separated by the separation unit 30 and outputs the extracted area to the input image generation unit 34.
Specifically, first, an area representing a foreground is extracted using an object area extraction method using edge information as shown in Non-Patent Document 2 below.

非特許文献２：Edge Boxes: Locating Object Proposals from Edges, C.LawrenceZitnickPiotrDollar,ECCV 2014 Non-Patent Document 2: Edge Boxes: Locating Object Proposals from Edges, C. LawrenceZitnickPiotrDollar, ECCV 2014

次に、物体候補領域抽出部３２は、抽出した前景を表す領域の個々のサイズを計算する。例えば、前景を表す領域のバウンディングボックスの面積を計算すればよい。そして、所定のサイズ以下（例えば、50*50pixel以下）の領域を物体候補領域として抽出する。 Next, the object candidate area extraction unit 32 calculates the size of each of the extracted areas representing the foreground. For example, the area of the bounding box of the region representing the foreground may be calculated. Then, an area having a predetermined size or less (for example, 50 * 50 pixels or less) is extracted as an object candidate area.

入力画像生成部３４は、抽出された物体候補領域に基づいて、物体候補領域及び物体候補領域の周辺に対応する複数の入力画像を生成する。例えば、抽出された物体候補領域及び物体候補領域の周辺から得られる複数の領域をアップサンプリングすることにより複数の入力画像を生成する。具体的には、図２に示すように、ｅのボックスを一つの物体候補領域として、ｅと同じ面積を持つ左上に隣接した領域をａの領域として生成する、同じく、左下、右上、及び右下の周辺領域を用いて、ｂ，ｃ，ｄのような複数の領域を生成することができる。次に、ａ，ｂ，ｃ，ｄ，ｅの領域に対してアップサンプリング処理する。例えば、画像処理の最近傍補間（Nearest neighbor）やバイリニア補間（bilinear）処理を用いれば領域を拡大することができる。そして、アップサンプリングされたａ，ｂ，ｃ，ｄ，ｅの領域を入力画像として検出認識部３６へ出力する。このように複数の領域を用いることで認識の精度を高められる。また、後述する学習部３８において、解像度の高い画像やトレーニング用の複数の画像を用いて、Deep Learningベースで学習した物体の検出及び物体のカテゴリの認識を行うためのＣＮＮのパラメータのチューニングをすることができる。ＣＮＮのパラメータのチューニングにより、検出及び認識の精度が高くなることが見込めるため、上記ｅの領域を含め、データ量を増やす目的で、周辺領域のａ，ｂ，ｃ，ｄの領域の入力画像は生成されるのである。図２は一つの態様であり、応用ニーズに応じて、データを増やす他の方法を取り入れてもよい。例えば、ｅの領域を拡大した領域を生成してデータを増やしてもよい。 The input image generation unit 34 generates an object candidate area and a plurality of input images corresponding to the periphery of the object candidate area based on the extracted object candidate area. For example, a plurality of input images are generated by up-sampling the extracted object candidate region and a plurality of regions obtained from the periphery of the object candidate region. Specifically, as shown in FIG. 2, a box of e is generated as one object candidate area, and a region adjacent to the upper left having the same area as e is generated as an area of a. Similarly, lower left, upper right, and right are generated. A plurality of regions such as b, c, and d can be generated using the lower peripheral region. Next, upsampling processing is performed on the areas a, b, c, d, and e. For example, the area can be enlarged by using the nearest neighbor (Nearest neighbor) or bilinear interpolation (bilinear) processing of the image processing. Then, the upsampled areas a, b, c, d, and e are output to the detection / recognition unit 36 as an input image. By using a plurality of regions in this way, the accuracy of recognition can be increased. Further, in a learning unit 38 to be described later, using an image having a high resolution or a plurality of images for training, tuning of CNN parameters for detecting an object learned on a Deep Learning basis and recognizing a category of the object is performed. be able to. Since it is expected that the accuracy of detection and recognition will be improved by tuning the parameters of the CNN, the input images of the areas a, b, c, and d of the surrounding areas are increased with the aim of increasing the data amount including the area e. It is created. FIG. 2 is one embodiment, and other methods of increasing data may be employed depending on application needs. For example, an area in which the area of e is enlarged may be generated to increase the data.

検出認識部３６は、入力画像生成部３４で生成された複数の入力画像の各々を、予め学習された物体の検出及び物体のカテゴリの認識を行うためのＣＮＮに入力して、入力画像の各々について入力画像に含まれる物体の位置を検出すると共に、入力画像の各々について検出された物体のカテゴリを認識し、入力画像の各々の物体のカテゴリの認識結果に基づいて、物体のカテゴリの認識結果を統合する。ＣＮＮを用いた検出及び認識の手法としては、例えば非特許文献１に記載のＹｏｌｏを用いる。同手法では、例えば、入力画像の各々について、セルごとの物体検出確率が算出されると共に、カテゴリの認識において、複数のバウンディングボックスの各々に対して、カテゴリの各々の信頼度が算出される。そこで、入力画像の各々について算出したカテゴリの各々の信頼度から、カテゴリごとの信頼度の最大値を求め、最大値が最も高いカテゴリを、カテゴリの認識結果を統合したものとすればよい。もしくは、入力画像の各々についてのカテゴリごとの信頼度の平均値を用いて、信頼度の平均値が最も高いカテゴリを、カテゴリの認識結果を統合したものとしてもよい。 The detection / recognition unit 36 inputs each of the plurality of input images generated by the input image generation unit 34 to a CNN for detecting a previously learned object and recognizing a category of the object, and inputs each of the input images. The position of the object included in the input image is detected with respect to the input image, and the category of the detected object is recognized for each of the input images. To integrate. As a detection and recognition method using CNN, for example, Yolo described in Non-Patent Document 1 is used. In this method, for example, an object detection probability for each cell is calculated for each input image, and in the category recognition, the reliability of each category is calculated for each of a plurality of bounding boxes. Therefore, the maximum value of the reliability of each category may be obtained from the reliability of each category calculated for each of the input images, and the category having the highest maximum value may be obtained by integrating the recognition results of the categories. Alternatively, using the average value of the reliability of each category of each of the input images, the category having the highest average value of the reliability may be obtained by integrating the category recognition results.

検出認識部３６は、上記の入力画像の各々の物体の検出結果、及びカテゴリの認識結果を統合したものを蓄積部２０に格納する。 The detection / recognition unit 36 stores in the storage unit 20 an integrated result of the detection result of each object in the input image and the recognition result of the category.

学習部３８は、取得部２２から、蓄積部２０に格納された、入力画像の各々の検出結果、及びカテゴリの認識結果を統合したものを受け取り、検出結果、及び認識結果を用いて、物体の検出及び物体のカテゴリの認識を行うためのＣＮＮのパラメータをチューニングし、学習結果を検出認識部３６にフィードバックする。学習は誤差逆伝播法などの一般的なＣＮＮの学習手法を用いればよい。学習部３８の学習により、検出認識部３６では、パラメータがチューニングされたＣＮＮを用いて物体の検出及び認識をすることができる。 The learning unit 38 receives from the acquisition unit 22 the integrated detection result of each of the input images and the recognition result of the category stored in the storage unit 20, and uses the detection result and the recognition result to generate an object. The parameters of the CNN for detecting and recognizing the category of the object are tuned, and the learning result is fed back to the detection and recognition unit 36. For learning, a general CNN learning method such as an error back propagation method may be used. By the learning of the learning unit 38, the detection and recognition unit 36 can detect and recognize the object using the CNN whose parameters are tuned.

なお、学習部３８の処理については、取得部２２、分離部３０、物体候補領域抽出部３２、入力画像生成部３４、及び検出認識部３６による一連の物体の検出及び認識の処理とは別個に、任意のタイミングで行えばよい。 The process of the learning unit 38 is performed separately from the process of detecting and recognizing a series of objects by the acquisition unit 22, the separation unit 30, the object candidate region extraction unit 32, the input image generation unit 34, and the detection and recognition unit 36. May be performed at any timing.

＜本発明の実施の形態に係る物体検出認識装置の作用＞ <Operation of Object Detection and Recognition Apparatus According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る物体検出認識装置１００の物体の検出及び認識に関する作用について説明する。物体検出認識装置１００は、図３に示す物体検出認識処理ルーチンを実行する。 Next, an operation of the object detection / recognition apparatus 100 according to the embodiment of the present invention relating to detection and recognition of an object will be described. The object detection and recognition device 100 executes an object detection and recognition processing routine shown in FIG.

まず、ステップＳ１００では、取得部２２は、蓄積部２０に検出及び認識の処理指示を出力し、蓄積部２０に格納された映像を取得し、取得した映像を分離部３０へ出力する。 First, in step S100, the obtaining unit 22 outputs a detection and recognition processing instruction to the storage unit 20, obtains a video stored in the storage unit 20, and outputs the obtained video to the separation unit 30.

次に、ステップＳ１０２では、分離部３０は、映像のシーンごとに、一定の時間間隔のフレームから、物体の検出の対象となる画像（ｖ_１,ｖ_２，...，ｖ_Ｎ）を抽出する。 Next, in step S102, the separating unit 30 extracts images (v ₁ , v ₂ ,..., V _N ) to be detected from the frames at a fixed time interval for each video scene. I do.

ステップＳ１０４では、分離部３０は、対象の画像ｖ_ｉを選択する。 In step S104, the separation unit 30 selects the image _{v i} of the object.

ステップＳ１０６では、分離部３０は、対象の画像ｖ_ｉについて、時系列順に前後のフレーム間の動的な特徴量を用いて、画像中において背景と物体の写る前景とを分離する。 In step S106, the separation unit 30, the image v _i of the object, when using the dynamic feature quantity between adjacent frames in chronological order, to separate the foreground Utsuru backgrounds and the object in the image.

ステップＳ１０８では、物体候補領域抽出部３２は、ステップＳ１０６で分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出し、入力画像生成部３４へ出力する。 In step S108, the object candidate area extraction unit 32 extracts, as an object candidate area, an area having a size equal to or smaller than a predetermined size from among the areas representing the foreground separated in step S106, and outputs the extracted area to the input image generation unit 34.

ステップＳ１１０では、入力画像生成部３４は、対象の画像ｖ_ｉについて、抽出された物体候補領域に基づいて、物体候補領域及び物体候補領域の周辺に対応する複数の入力画像を生成する。 In step S110, the input image generation unit 34, the image v _i of the object, based on the extracted object candidate region, generates a plurality of input images corresponding to the periphery of the object candidate region and object candidate region.

ステップＳ１１２では、検出認識部３６は、対象の画像ｖ_ｉについて、ステップＳ１１０で生成された複数の入力画像の各々を、予め学習された物体の検出及び物体のカテゴリの認識を行うためのＣＮＮに入力して、入力画像の各々について入力画像に含まれる物体の位置を検出すると共に、入力画像の各々について検出された物体のカテゴリを認識し、入力画像の各々の物体のカテゴリの認識結果に基づいて、物体のカテゴリの認識結果を統合する。 In step S112, detection and recognition unit 36, the image v _i of the object, each of the plurality of input images generated in step S110, the CNN for the detection and recognition of objects categories in advance learned object Input, detecting the position of the object included in the input image for each of the input images, recognizing the category of the detected object for each of the input images, and based on the recognition result of the category of each object of the input image. Then, the recognition results of the category of the object are integrated.

ステップＳ１１４では、検出認識部３６は、対象の画像ｖ_ｉについて、入力画像の各々の物体の検出結果、及びカテゴリの認識結果を統合したものを蓄積部２０に格納する。 In step S114, detection and recognition unit 36, the image v _i of the target, and stores the detection result of each of the objects in the input image, and an integration of the recognition result of the category in the storage unit 20.

ステップＳ１１６では、検出認識部３６は、全ての対象の画像ｖ_ｉについて処理を終了したかを判定し、終了していれば物体検出認識処理ルーチンを終了し、終了していなければステップＳ１０４に戻って次の対象の画像ｖ_ｉを選択して処理を繰り返す。 At step S116, the detection and recognition unit 36 determines whether the processing has been finished for the image v _i of all subjects, if the ends finished object detection recognition processing routine, the flow returns to step S104 if not ended Te repeat the process to select the image v _i of the next target.

以上説明したように、本発明の実施の形態に係る物体検出認識装置によれば、物体の検出の対象となる画像を取得し、画像に含まれる背景と物体が写った前景とを分離し、分離された前景を表す領域のうち、所定のサイズ以下の領域を物体候補領域として抽出し、抽出された物体候補領域に基づいて、物体候補領域及び物体候補領域の周辺に対応する複数の入力画像を生成し、生成された複数の入力画像の各々を、予め学習された物体の検出及び物体のカテゴリの認識を行うためのＣＮＮに入力して、入力画像の各々について入力画像に含まれる物体の位置を検出すると共に、入力画像の各々について検出された物体のカテゴリを認識し、入力画像の各々の物体のカテゴリの認識結果に基づいて、物体のカテゴリの認識結果を統合することにより、サイズの小さな物体についても精度よく検出及び認識ができる。 As described above, according to the object detection and recognition device according to the embodiment of the present invention, an image to be detected is acquired, and the background included in the image and the foreground in which the object is captured are separated. Among the regions representing the separated foreground, a region smaller than a predetermined size is extracted as an object candidate region, and based on the extracted object candidate region, a plurality of input images corresponding to the object candidate region and the periphery of the object candidate region. Is generated, and each of the plurality of generated input images is input to a CNN for detecting a previously learned object and recognizing a category of the object, and for each of the input images, an object included in the input image is input. Detecting the position, recognizing the category of the detected object for each of the input images, and integrating the recognition result of the category of the object based on the recognition result of the category of each object in the input image. Ri, can be detected and recognized precisely also small object size.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した実施の形態では、学習部３８を物体検出認識装置１００に含める場合を例に説明したが、これに限定されるものではなく、物体検出認識装置１００とは別個の学習装置として構成するようにしてもよい。 For example, in the above-described embodiment, the case where the learning unit 38 is included in the object detection / recognition device 100 has been described as an example. However, the present invention is not limited to this, and is configured as a learning device separate from the object detection / recognition device 100. You may make it.

２０蓄積部
２２取得部
３０分離部
３２物体候補領域抽出部
３４入力画像生成部
３６検出認識部
３８学習部
１００物体検出認識装置 Reference Signs List 20 accumulation unit 22 acquisition unit 30 separation unit 32 object candidate region extraction unit 34 input image generation unit 36 detection and recognition unit 38 learning unit 100 object detection and recognition device

Claims

A separation unit that obtains an image to be detected as an object and separates a background included in the image and a foreground in which the object is captured,
An object candidate region extracting unit that extracts a region equal to or smaller than a predetermined size as an object candidate region among the regions representing the separated foreground;
An input image generation unit that generates a plurality of input images corresponding to the periphery of the object candidate region and the object candidate region based on the extracted object candidate region,
Each of the plurality of generated input images is input to a CNN (Convolutional Neural Network) for detecting a pre-learned object and recognizing a category of the object. Detecting the position of the object included in, and recognizing the category of the object detected for each of the input images, based on the recognition result of the category of the object of each of the input images, the category of the object A detection recognition unit that integrates the recognition results of
Object detection and recognition device.

The image is obtained for each scene of a video that is a target of object detection, and for each image of the scene, the respective processes of the separation unit, the object candidate region extraction unit, the input image generation unit, and the detection recognition unit are performed. The object detection and recognition device according to claim 1.

3. The input image generation unit according to claim 1, wherein the input image generation unit generates the plurality of input images by upsampling the extracted object candidate region and a plurality of regions obtained from a periphery of the object candidate region. 4. Object detection and recognition device.

Integration of the recognition results of the detection and recognition unit is determined from the reliability of each of the categories calculated in the recognition of the category of each of the input images, the maximum value of the reliability for each category, or the average value The object detection / recognition apparatus according to any one of claims 1 to 3, wherein the category having the highest reliability is obtained using the category, and an integrated recognition result is obtained.

The detection unit further includes a learning unit that learns the CNN using the object detected from the image and the category of the object,
The object detection / recognition device according to any one of claims 1 to 4, wherein the detection and recognition are performed by the detection / recognition unit using the learned CNN.

A separation unit acquires an image to be detected, and separates a background included in the image and a foreground in which the object is captured,
An object candidate region extracting unit, for extracting, from the regions representing the separated foreground, a region having a predetermined size or less as an object candidate region;
An input image generation unit, based on the extracted object candidate region, generating a plurality of input images corresponding to the periphery of the object candidate region and the object candidate region,
A detection / recognition unit that inputs each of the plurality of generated input images to a CNN (Convolutional Neural Network) for detecting a previously learned object and recognizing a category of the object; Detecting the position of the object included in the input image for each, recognizing the category of the object detected for each of the input images, based on the recognition result of the category of the object of each of the input images Integrating the recognition results for the category of the object;
Object detection and recognition method.

A program for causing a computer to function as each unit of the object detection and recognition device according to any one of claims 1 to 5.