WO2020261324A1 - Object detection/recognition device, object detection/recognition method, and object detection/recognition program - Google Patents

Object detection/recognition device, object detection/recognition method, and object detection/recognition program Download PDF

Info

Publication number
WO2020261324A1
WO2020261324A1 PCT/JP2019/024906 JP2019024906W WO2020261324A1 WO 2020261324 A1 WO2020261324 A1 WO 2020261324A1 JP 2019024906 W JP2019024906 W JP 2019024906W WO 2020261324 A1 WO2020261324 A1 WO 2020261324A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
layer
integrated
unit
recognition
Prior art date
Application number
PCT/JP2019/024906
Other languages
French (fr)
Japanese (ja)
Inventor
泳青 孫
島村 潤
淳 嵯峨田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2019/024906 priority Critical patent/WO2020261324A1/en
Publication of WO2020261324A1 publication Critical patent/WO2020261324A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis

Definitions

  • the disclosed technology relates to an object detection / recognition device, an object detection / recognition method, and an object detection / recognition program.
  • Semantic image division and recognition is a technology that attempts to assign pixels in a video or image to an object category. It is often applied to autonomous driving, medical image analysis, state and pose estimation, etc. In recent years, pixel-by-pixel image division techniques using deep learning have been actively studied. As shown in FIG. 8, the method called Mask RCNN (Non-Patent Document 1), which is an example of a typical processing flow, first features an input image through a CNN (Convolutional Neural Network) -based backbone network. The map is extracted (part (a) in FIG. 8). Next, in the extracted feature map, a candidate region (area that seems to be an object) related to the object is detected (part (b) in FIG. 8).
  • CNN Convolutional Neural Network
  • Non-Patent Document 2 A hierarchical feature map extraction method (Non-Patent Document 2) called Network) has also been proposed. Specifically, as shown in FIG. 10, a hierarchical feature map is generated from a deep layer to a shallow layer by an upsampling process.
  • the shallow layers of the CNN-based backbone network represent low-level image features of the input image. That is, it expresses details such as lines, dots, and patterns of an object.
  • the CNN layer gets deeper, higher level features of the image can be extracted. For example, it is possible to extract features that represent characteristic contours of objects and contextual relationships between objects.
  • Non-Patent Document 1 uses only the feature map generated from the deep layer of CNN to detect the next object region candidate and perform segmentation for each pixel. Therefore, since the low-level features expressing the details of the object are lost, there arises a problem that the object detection position shifts and the accuracy of segmentation (pixel allocation) becomes low.
  • the FPN method of Non-Patent Document 2 propagates semantic information to the shallow layer while upsampling from the feature map of the deep layer to the backbone network of CNN. Then, by performing object division using a plurality of feature maps, the object division accuracy is improved to some extent, but since low-level features are not actually incorporated into the high-level feature map (up layer), the object division is performed. And recognition accuracy problems arise.
  • the present invention has been made to solve the above problems, and an object of the present invention is to provide an object detection / recognition device, a method, and a program capable of accurately recognizing an object category and region represented by an image. And.
  • the first aspect of the present disclosure is an object detection / recognition device, in which an image to be processed is input to a CNN (Convolutional Neural Network), and all feature maps output in each layer of the CNN are integrated.
  • An integration unit that generates an integrated feature map
  • a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer
  • the generation unit Includes a recognition unit that detects each object candidate region from the image and recognizes the object category and region represented by each of the object candidate regions based on the augmented hierarchical feature map generated by.
  • the second aspect of the present disclosure is an object detection and recognition method, in which the integration unit inputs an image to be processed into the CNN and integrates all the feature maps output in each layer of the CNN.
  • the integrated feature map is generated, the generating unit generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integrated unit in each of the feature maps output in each layer, and the recognition unit generates an enhanced hierarchical feature map.
  • This is a method of detecting each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit, and recognizing the object category and region represented by each of the object candidate regions.
  • a third aspect of the present disclosure is an object detection and recognition program, in which a computer inputs an image to be processed into a CNN and integrates all feature maps output in each layer of the CNN.
  • An integration unit that generates an integrated feature map
  • a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer
  • the generation unit Based on the augmented hierarchical feature map generated by, a program for detecting each object candidate region from the image and functioning as a recognition unit for recognizing the object category and region represented by each of the object candidate regions. is there.
  • the CNN-based backbone network for feature map extraction if sufficient information from the shallow layer to the deep layer is used, it is considered to be effective for accurate object detection and recognition.
  • an image to be detected and recognized is acquired, and the acquired image is provided with a characteristic map of each hierarchical layer output by, for example, FPN through the backbone network of CNN.
  • FIG. 1 is a block diagram showing a hardware configuration of the object detection / recognition device 10.
  • the object detection / recognition device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and communication. It has an I / F (Interface) 17. Each configuration is communicably connected to each other via a bus 19.
  • CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • storage 14 an input unit 15, a display unit 16, and communication. It has an I / F (Interface) 17.
  • I / F Interface
  • the CPU 11 is a central arithmetic processing unit that executes various programs and controls each part. That is, the CPU 11 reads the program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above configurations and performs various arithmetic processes according to the program stored in the ROM 12 or the storage 14.
  • the ROM 12 or the storage 14 stores an object detection / recognition program for executing the object detection / recognition process described later.
  • the ROM 12 stores various programs and various data.
  • the RAM 13 temporarily stores a program or data as a work area.
  • the storage 14 is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.
  • the input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for performing various inputs.
  • the display unit 16 is, for example, a liquid crystal display and displays various types of information.
  • the display unit 16 may adopt a touch panel method and function as an input unit 15.
  • the communication I / F17 is an interface for communicating with other devices, and for example, standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark) are used.
  • FIG. 2 is a block diagram showing an example of the functional configuration of the object detection / recognition device 10.
  • the object detection / recognition device 10 has a storage unit 101, an acquisition unit 102, an extraction unit 103, an integration unit 104, a generation unit 105, a recognition unit 106, and a learning unit as functional configurations. Includes 107 and.
  • Each functional configuration is realized by the CPU 11 reading the object detection / recognition program stored in the ROM 12 or the storage 14, deploying it in the RAM 13, and executing it.
  • the storage unit 101 stores an image to be detected and recognized as an object, and a detection result and a recognition result obtained by the recognition unit 106.
  • the storage unit 101 receives a processing instruction from the acquisition unit 102, the storage unit 101 outputs an image to the acquisition unit 102.
  • the storage unit 101 stores images to which the detection result and the recognition result are given in advance.
  • the acquisition unit 102 outputs a processing instruction to the storage unit 101, acquires the image stored in the storage unit 101, and outputs the acquired image to the extraction unit 103.
  • the extraction unit 103 receives an image to be processed from the acquisition unit 102, inputs the received image to a CNN (Convolutional Neural Network), and outputs a feature map in each layer of the CNN (hereinafter, "hierarchical feature map"). ) Is extracted. The extraction unit 103 outputs the extracted hierarchical feature map to the integration unit 104.
  • a CNN Convolutional Neural Network
  • hierarchical feature map a feature map in each layer of the CNN
  • the integration unit 104 receives the hierarchical feature map extracted by the extraction unit 103, and generates an integrated feature map that integrates all layers of the hierarchical feature map.
  • the integration unit 104 upsamples the hierarchical feature map for each layer in order from the deepest layer so as to match the resolution of the hierarchical feature map of the shallowest layer. Convert each of the maps to the same size. Then, the integration unit 104 obtains the feature amount of each cell of the integrated feature map by connecting the feature amounts of each layer for each cell corresponding to each layer or statistically processing the feature amount of each layer. , Generate an integrated feature map. The integration unit 104 outputs the generated integrated feature map to the generation unit 105.
  • the generation unit 105 receives the integrated feature map from the integration unit 104 and generates an enhanced hierarchical feature map considering the information of all layers of the CNN. That is, the generation unit 105 generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit 104 in each of the hierarchical feature maps.
  • the generation unit 105 converts the integrated map generated by the integrated unit 104 into a plurality of integrated feature maps having the same size as each of the hierarchical feature maps by the convolution processing having different filter sizes. Then, the generation unit 105 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size to generate an enhanced hierarchical feature map. As the method of integrating the integrated feature map and the hierarchical feature map, the same method as the integration method when generating the integrated feature map can be adopted. The generation unit 105 outputs the generated augmented hierarchical feature map to the recognition unit 106.
  • the recognition unit 106 detects each object candidate region from the image to be processed based on the augmented hierarchical feature map generated by the generation unit 105, and recognizes the object category and region represented by each of the object candidate regions. ..
  • the recognition unit 106 uses deep learning-based object detection (for example, the process of (b) of Mask RCNN shown in FIG. 8) in the detection of the object candidate region, and the image to be processed is subjected to the process. By performing object division for each pixel, each object candidate region is detected. Further, the recognition unit 106 uses a deep learning-based recognition method (for example, the process of (c) of Mask RCNN shown in FIG. 8) in recognizing the object category and the area, and the object category represented by the object candidate area. , Position, and area. The recognition unit 106 stores the detection result of the object candidate region and the recognition result of the object category, position, and region in the storage unit 101.
  • deep learning-based object detection for example, the process of (b) of Mask RCNN shown in FIG. 8
  • a deep learning-based recognition method for example, the process of (c) of Mask RCNN shown in FIG. 8
  • the recognition unit 106 stores the detection result of the object candidate region and the recognition result of the object category, position, and
  • the learning unit 107 uses the detection result and recognition result by the recognition unit 106 for each of the images to be processed stored in the storage unit 101 and the detection result and recognition result given in advance for each of the images. , The parameters of the neural network used in each of the extraction unit 103 and the recognition unit 106 are learned. For learning, a general neural network learning method such as an error backpropagation method may be used. By learning of the learning unit 107, each of the extraction unit 103 and the recognition unit 106 can perform each process using the neural network in which the parameters are tuned.
  • the processing of the learning unit 107 is optional, separately from the processing of detecting and recognizing a series of objects by the acquisition unit 102, the extraction unit 103, the integration unit 104, the generation unit 105, and the recognition unit 106. It should be done at the timing of.
  • FIG. 3 is a flowchart showing the flow of the object detection / recognition process by the object detection / recognition device 10.
  • the object detection / recognition process is performed by the CPU 11 reading the object detection / recognition program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing the program.
  • step S101 the CPU 11 outputs a processing instruction to the storage unit 101 as the acquisition unit 102, and acquires the image to be processed from the image stored in the storage unit 101.
  • step S102 the CPU 11 inputs the image of the processing target acquired in step S101 to the CNN-based backbone network as the extraction unit 103, and acquires the feature map output from each layer.
  • a CNN network such as VGG or Resnet can be used.
  • the CPU 11 uses the feature map of the upper layer as the extraction unit 103, upsamples it, and combines it with the feature map of the lower layer to form a hierarchical feature map. Extract.
  • semantic information of a deep layer can be propagated to the feature map of the lower layer, and is detected at the time of object detection. It is expected that the outline of the object candidate region will be smooth and accurate detection without omission of detection will be possible.
  • step S103 the CPU 11 receives the hierarchical feature map extracted in step S102 as the integration unit 104, and processes each layer so that all layers have the same size.
  • the CPU 11, as the integrated unit 104 upsamples the hierarchical feature map to each layer in order from the deepest layer to the resolution of the hierarchical feature map of the shallowest layer. Convert each of the hierarchical feature maps to the same size so that they match.
  • the CPU 11 integrates all the layers of the hierarchical feature map converted to the same size as the integrated unit 104, for example, as shown by the arrow A in FIG. 5, to generate the integrated feature map.
  • the feature amount of cell j in the hierarchical feature map of the i-th layer after size conversion and f ij, the number of hierarchical feature map layer is assumed to be N.
  • the features of each layer are connected and the feature amount f j_concat of the cell j of the integrated feature map is set to (f 1j , f 2j , ..., f ij , ..., f Nj ).
  • the feature amounts of each layer can be added to make the feature amount f j_concat of the cell j of the integrated feature map (f 1j + f 2j + ... + f ij + ...
  • the feature amount f j_concat ((f 1j + f 2j + ... + f ij + ... + f Nj) / N) may be.
  • the maximum value f Ij_max of each feature quantity f ij, the minimum value f Ij_min, the median f Ij_median like, may be used as feature amounts f J_concat integrated feature map.
  • the integrated feature map Since the integrated processing of such a feature map effectively and efficiently incorporates information on all layers of the CNN, the integrated feature map has detailed features (patterns, dots, etc.) and semantic features (contours) of the object. And positional relationship) can be expected to be reflected more accurately.
  • step S104 the CPU 11 receives the integrated feature map generated in step S103 as the generation unit 105, and generates an enhanced hierarchical feature map in consideration of the information of all layers of the CNN.
  • the CPU 11 uses the convolution processing having different filter sizes to generate a plurality of integrated maps of the same size as each of the hierarchical feature maps, as shown by the arrow B in FIG. Convert to an integrated feature map of.
  • the Inception module of GoogleLeNet as shown in FIG. 6 can be used.
  • the CPU 11 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size as shown in the portion (b) of FIG. 7, and enhances the hierarchy. Generate a type feature map.
  • the part (a) represents the extraction of the hierarchical feature map
  • the part (c) represents the generation of the integrated feature map
  • the part (b) represents the generation of the enhanced hierarchical feature map.
  • i-th characteristic amount of cell j in the hierarchical feature map layers as f ij the feature amount of each cell j in the i th integration feature map which is converted to the same size as the hierarchical feature map layers and f Ij_concat To do.
  • the feature amount f Ij_new of i-th cell j enhanced hierarchical feature map layer can be.
  • the feature amount f Ij_new cell j enhanced hierarchical feature map may be a (f ij + f ij_concat).
  • the feature amount f ij_new ((f ij + f ij_concat) / 2) may be.
  • the maximum value or the minimum value of both feature amounts may be set as the feature amount fij_new of the augmented hierarchical feature map.
  • step S105 the CPU 11 detects each object candidate region as the recognition unit 106 based on the augmented hierarchical feature map generated in step S104. For example, for each layer of the augmented hierarchical feature map, as shown in part (b) of FIG. 8, the score of the object is calculated for each pixel by RPN (Region Proposal Network), and the score of the corresponding area in each layer is high. The object candidate area is detected.
  • RPN Registered Proposal Network
  • the CPU 11 determines the category, position, and area of the object represented by the object candidate area for each of the detected object candidate areas based on the enhanced hierarchical feature map generated in step S104. recognize.
  • the CPU 11 generates a fixed-size feature map as the recognition unit 106 by using each layer of the augmented hierarchical feature map corresponding to the object candidate region as shown in the portion (b) of FIG. To do. Then, as the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the FCN (Full Convolutional Network) as shown in the portion (c) of FIG. 8, and the area of the object represented by the object candidate area is formed. Recognize. As the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the fully connected layer to enclose the object category represented by the object candidate region and the object, as shown in the portion (c) of FIG. Recognize the box position.
  • FCN Full Convolutional Network
  • the CPU 11 stores the detection result of each object candidate area and the recognition result of the category, position, and area of the object represented by each object candidate area in the storage unit 101 as the recognition unit 106.
  • step S106 the CPU 11 determines whether or not the processing has been completed for all the images stored in the storage unit 101. If so, the object detection and recognition process is ended, and if not, step S101. Return to to acquire the next image and repeat the process.
  • the object detection / recognition device As described above, according to the object detection / recognition device according to the present embodiment, all the layers of the feature map output in each layer of the CNN are integrated to generate an integrated feature map, and the integrated feature map is used as the feature map of each layer. Generates an enhanced hierarchical feature map that reflects in. Then, using this augmented hierarchical map, each object candidate region is detected, and for each of the object candidate regions, the category and region of the object represented by the object candidate region are recognized. As a result, the high-level feature (upper layer) representing the semantic information of the object and the low-level feature (lower layer) representing the detailed information of the object, which are the information of all the convolutional layers in the CNN network, are equalized. Moreover, since it can be used efficiently, it is possible to accurately recognize the category and area of the object represented by the image.
  • the present invention is not limited to the above-described embodiment, and various modifications and applications are possible within a range that does not deviate from the gist of the present invention.
  • the learning unit 107 is included in the object detection / recognition device 10 has been described as an example, but the present invention is not limited to this, and the learning device is configured as a learning device separate from the object detection / recognition device 10. It may be.
  • various processors other than the CPU may execute the object detection recognition process executed by the CPU reading the software (program) in each of the above embodiments.
  • the processors include PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing FPGA (Field-Programmable Gate Array), and ASIC (Application Specific Integrated Circuit) for executing ASIC (Application Special Integrated Circuit).
  • An example is a dedicated electric circuit or the like, which is a processor having a circuit configuration designed exclusively for the purpose.
  • the object detection recognition process may be executed by one of these various processors, or a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, and a CPU and an FPGA). It may be executed by combination etc.).
  • the hardware structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.
  • the program is a non-temporary storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital entirely Disk Online Memory), and a USB (Universal Serial Bus) memory. It may be provided in the form. Further, the program may be downloaded from an external device via a network.
  • the image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
  • An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
  • An object detection / recognition device that detects each object candidate region from the image based on the generated augmented hierarchical feature map, and recognizes the object category and region represented by each of the object candidate regions.
  • a non-temporary recording medium that stores a program that can be executed by a computer to perform object detection and recognition processing.
  • the object detection recognition process The image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
  • An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
  • a non-temporary recording medium including detecting each object candidate region from the image based on the generated augmented hierarchical feature map and recognizing the category and region of the object represented by each of the object candidate regions.

Abstract

The present invention accurately recognizes the category and the area of an object represented by an image. An integration unit (104) inputs an image to be processed into a CNN, and integrates all hierarchical feature maps output from each layer of the CNN, thereby generating an integrated feature map; a generation unit (105) generates enhanced hierarchical feature maps by causing each hierarchical feature map to reflect the integrated feature map generated by the integration unit (104); and a recognition unit (106) detects each object candidate area in the image and recognizes the category and the area of the object represented by each object candidate area, on the basis of the enhanced hierarchical feature maps generated by the generation unit (105).

Description

物体検出認識装置、物体検出認識方法、及び物体検出認識プログラムObject detection and recognition device, object detection and recognition method, and object detection and recognition program
 開示の技術は、物体検出認識装置、物体検出認識方法、及び物体検出認識プログラムに関する。 The disclosed technology relates to an object detection / recognition device, an object detection / recognition method, and an object detection / recognition program.
 セマンティック画像分割及び認識は、映像や画像中の画素をオブジェクトカテゴリに割り当てようとする技術である。自動運転、医用画像の解析、状態やポーズ推定などによく応用されている。近年、深層学習を用いた画素毎の画像分割技術は盛んに研究されている。代表的な処理の流れの例である、Mask RCNNという手法(非特許文献1)は、図8に示すとおり、まず、入力画像に対して、CNN(Convolutional Neural Network)ベースのbackboneネットワークを通して、特徴マップの抽出を行う(図8の(a)部分)。次に、抽出した特徴マップにおいて、物体に関連する候補領域(物体らしい領域)を検出する(図8の(b)部分)。最後に、検出した候補領域から物体位置検出や画素の割り当てを行う(図8の(c)部分)。また、Mask RCNNの特徴マップ抽出処理についてCNNの深い層の出力しか利用していないことに対して、図9に示すように、浅い層の情報を含め複数層の出力も利用するFPN(Feature Pyramid Network)という階層的な特徴マップ抽出方法(非特許文献2)も提案されている。具体的には、図10に示すように、アップサンプリング処理により、深い層から浅い層まで階層型の特徴マップを生成する。 Semantic image division and recognition is a technology that attempts to assign pixels in a video or image to an object category. It is often applied to autonomous driving, medical image analysis, state and pose estimation, etc. In recent years, pixel-by-pixel image division techniques using deep learning have been actively studied. As shown in FIG. 8, the method called Mask RCNN (Non-Patent Document 1), which is an example of a typical processing flow, first features an input image through a CNN (Convolutional Neural Network) -based backbone network. The map is extracted (part (a) in FIG. 8). Next, in the extracted feature map, a candidate region (area that seems to be an object) related to the object is detected (part (b) in FIG. 8). Finally, the object position is detected and pixels are assigned from the detected candidate area (part (c) in FIG. 8). Further, as opposed to using only the output of the deep layer of CNN for the feature map extraction process of Mask RCNN, as shown in FIG. 9, FPN (Fature Pyramid) which also uses the output of multiple layers including the information of the shallow layer is used. A hierarchical feature map extraction method (Non-Patent Document 2) called Network) has also been proposed. Specifically, as shown in FIG. 10, a hierarchical feature map is generated from a deep layer to a shallow layer by an upsampling process.
 CNNベースの物体分割及び認識手法について以下の観察がある。 There are the following observations about CNN-based object division and recognition methods.
 第1に、CNNベースのbackboneネットワークの浅い層では、入力画像の低レベル画像特徴を表している。すなわち、物体の線や点、模様などの細部を表現している。 First, the shallow layers of the CNN-based backbone network represent low-level image features of the input image. That is, it expresses details such as lines, dots, and patterns of an object.
 第2に、CNN層が深くなるにつれて、画像の高レベル特徴を抽出することができる。例えば、物体の特徴的な輪郭や物体間のコンテキスト関係などを表す特徴を抽出することができる。 Second, as the CNN layer gets deeper, higher level features of the image can be extracted. For example, it is possible to extract features that represent characteristic contours of objects and contextual relationships between objects.
 上記の非特許文献1に示すMask RCNNという手法は、CNNの深い層から生成した特徴マップだけを用いて、次の物体領域候補検出及び画素毎のセグメンテーションを行うこととなる。従って、物体の細部を表現する低レベル特徴量を失うため、物体検出位置のずれやセグメンテーション(画素の割り当て)の精度が低くなる問題が生じる。 The method called Mask RCNN shown in Non-Patent Document 1 above uses only the feature map generated from the deep layer of CNN to detect the next object region candidate and perform segmentation for each pixel. Therefore, since the low-level features expressing the details of the object are lost, there arises a problem that the object detection position shifts and the accuracy of segmentation (pixel allocation) becomes low.
 一方、非特許文献2のFPNという手法は、CNNのbackboneネットワークに対して、深い層の特徴マップからアップサンプリングしながら、セマンティックな情報を浅い層へ伝搬していく。そして、複数の特徴マップを用いて物体分割を行うことにより、物体分割精度はある程度改善されるが、実際に高レベル特徴マップ(up layer)に対して低レベル特徴を取り入れてないため、物体分割及び認識の精度問題が生じる。 On the other hand, the FPN method of Non-Patent Document 2 propagates semantic information to the shallow layer while upsampling from the feature map of the deep layer to the backbone network of CNN. Then, by performing object division using a plurality of feature maps, the object division accuracy is improved to some extent, but since low-level features are not actually incorporated into the high-level feature map (up layer), the object division is performed. And recognition accuracy problems arise.
 本発明は、上記問題点を解決するために成されたものであり、画像が表す物体のカテゴリ及び領域を精度良く認識することができる物体検出認識装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and an object of the present invention is to provide an object detection / recognition device, a method, and a program capable of accurately recognizing an object category and region represented by an image. And.
 本開示の第1態様は、物体検出認識装置であって、処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成する統合部と、前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成する生成部と、前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する認識部と、を含む。 The first aspect of the present disclosure is an object detection / recognition device, in which an image to be processed is input to a CNN (Convolutional Neural Network), and all feature maps output in each layer of the CNN are integrated. An integration unit that generates an integrated feature map, a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and the generation unit. Includes a recognition unit that detects each object candidate region from the image and recognizes the object category and region represented by each of the object candidate regions based on the augmented hierarchical feature map generated by.
 また、本開示の第2態様は、物体検出認識方法であって、統合部が、処理対象となる画像を、CNNに入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成し、生成部が、前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成し、認識部が、前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する方法である。 The second aspect of the present disclosure is an object detection and recognition method, in which the integration unit inputs an image to be processed into the CNN and integrates all the feature maps output in each layer of the CNN. The integrated feature map is generated, the generating unit generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integrated unit in each of the feature maps output in each layer, and the recognition unit generates an enhanced hierarchical feature map. This is a method of detecting each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit, and recognizing the object category and region represented by each of the object candidate regions.
 また、本開示の第3態様は、物体検出認識プログラムであって、コンピュータを、処理対象となる画像を、CNNに入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成する統合部、前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成する生成部、及び、前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する認識部として機能させるためのプログラムである。 A third aspect of the present disclosure is an object detection and recognition program, in which a computer inputs an image to be processed into a CNN and integrates all feature maps output in each layer of the CNN. An integration unit that generates an integrated feature map, a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and the generation unit. Based on the augmented hierarchical feature map generated by, a program for detecting each object candidate region from the image and functioning as a recognition unit for recognizing the object category and region represented by each of the object candidate regions. is there.
 開示の技術によれば、画像が表す物体のカテゴリ及び領域を精度良く認識することができる。 According to the disclosed technology, it is possible to accurately recognize the category and area of the object represented by the image.
本実施形態に係る物体検出認識装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware structure of the object detection recognition apparatus which concerns on this embodiment. 本実施形態に係る物体検出認識装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the object detection recognition apparatus which concerns on this embodiment. 本実施形態に係る物体検出認識装置における物体検出認識処理の一例を示すフローチャートである。It is a flowchart which shows an example of the object detection recognition processing in the object detection recognition apparatus which concerns on this embodiment. 階層型特徴マップの各々を同じサイズに変換する処理を説明するための図である。It is a figure for demonstrating the process of converting each of the hierarchical feature maps into the same size. 同じサイズに変換した階層型特徴マップの統合、及び異なるサイズの階層型特徴マップへの変換を説明するための図である。It is a figure for demonstrating the integration of the hierarchical feature map converted to the same size, and the conversion to the hierarchical feature map of a different size. 統合特徴マップを異なるサイズに変換する際に用いるモジュールの一例を示す図である。It is a figure which shows an example of the module used when converting an integrated feature map into a different size. 増強階層型特徴マップの生成を説明するための図である。It is a figure for demonstrating the generation of an augmented hierarchical feature map. Mask RCNNの処理を説明するための図である。It is a figure for demonstrating the processing of Mask RCNN. FPNの処理を説明するための図である。It is a figure for demonstrating the processing of FPN. アップサンプリング処理による階層型特徴マップの生成を説明するための図である。It is a figure for demonstrating the generation of a hierarchical feature map by upsampling processing.
 以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Hereinafter, an example of the embodiment of the disclosed technology will be described with reference to the drawings. The same reference numerals are given to the same or equivalent components and parts in each drawing. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation and may differ from the actual ratios.
<本発明に係る実施形態の概要> <Overview of Embodiments of the present invention>
 まず、本発明に係る実施形態の概要について説明する。 First, the outline of the embodiment according to the present invention will be described.
 上述した課題を踏まえて、特徴マップ抽出のCNNベースのbackboneネットワークにおいて、浅い層から深い層までの情報を十分に用いれば、精度の良い物体検出及び認識に対して有効であると考えられる。 Based on the above-mentioned problems, in the CNN-based backbone network for feature map extraction, if sufficient information from the shallow layer to the deep layer is used, it is considered to be effective for accurate object detection and recognition.
 そこで、本発明の実施形態では、物体検出及び認識の対象となる画像を取得し、取得した画像に対して、CNNのbackboneネットワークを通して、例えば、FPNにより出力される階層的な各層の特徴マップを全て統合した統合特徴マップを生成する。そして、生成した統合特徴マップから、階層的な各層の特徴マップに対して、相応のサイズの特徴マップを抽出し、抽出した特徴マップと各層の特徴マップとを統合することにより増強し、全層の情報を考慮した増強階層型特徴マップ生成する。そして、生成された増強階層型特徴マップを用いて、物体検出及び認識を行う。 Therefore, in the embodiment of the present invention, an image to be detected and recognized is acquired, and the acquired image is provided with a characteristic map of each hierarchical layer output by, for example, FPN through the backbone network of CNN. Generate an integrated feature map that integrates everything. Then, from the generated integrated feature map, a feature map of an appropriate size is extracted from the feature map of each hierarchical layer, and the extracted feature map and the feature map of each layer are integrated to enhance and enhance all layers. Generate an enhanced hierarchical feature map that takes into account the information in. Then, the generated hierarchical feature map is used to detect and recognize the object.
<本実施形態に係る物体検出認識装置の構成> <Configuration of object detection / recognition device according to this embodiment>
 次に、本発明の実施形態に係る物体検出認識装置の構成について説明する。図1は、物体検出認識装置10のハードウェア構成を示すブロック図である。 Next, the configuration of the object detection / recognition device according to the embodiment of the present invention will be described. FIG. 1 is a block diagram showing a hardware configuration of the object detection / recognition device 10.
 図1に示すように、物体検出認識装置10は、CPU(Central Processing Unit)11、ROM(Read Only Memory)12、RAM(Random Access Memory)13、ストレージ14、入力部15、表示部16及び通信I/F(Interface)17を有する。各構成は、バス19を介して相互に通信可能に接続されている。 As shown in FIG. 1, the object detection / recognition device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and communication. It has an I / F (Interface) 17. Each configuration is communicably connected to each other via a bus 19.
 CPU11は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、CPU11は、ROM12又はストレージ14からプログラムを読み出し、RAM13を作業領域としてプログラムを実行する。CPU11は、ROM12又はストレージ14に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ROM12又はストレージ14には、後述する物体検出認識処理を実行するための物体検出認識プログラムが格納されている。 The CPU 11 is a central arithmetic processing unit that executes various programs and controls each part. That is, the CPU 11 reads the program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above configurations and performs various arithmetic processes according to the program stored in the ROM 12 or the storage 14. In the present embodiment, the ROM 12 or the storage 14 stores an object detection / recognition program for executing the object detection / recognition process described later.
 ROM12は、各種プログラム及び各種データを格納する。RAM13は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ14は、HDD(Hard Disk Drive)又はSSD(Solid State Drive)により構成され、オペレーティングシステムを含む各種プログラム及び各種データを格納する。 ROM 12 stores various programs and various data. The RAM 13 temporarily stores a program or data as a work area. The storage 14 is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.
 入力部15は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for performing various inputs.
 表示部16は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部16は、タッチパネル方式を採用して、入力部15として機能してもよい。 The display unit 16 is, for example, a liquid crystal display and displays various types of information. The display unit 16 may adopt a touch panel method and function as an input unit 15.
 通信I/F17は、他の機器と通信するためのインタフェースであり、例えば、イーサネット(登録商標)、FDDI、Wi-Fi(登録商標)等の規格が用いられる。 The communication I / F17 is an interface for communicating with other devices, and for example, standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark) are used.
 次に、物体検出認識装置10の機能構成について説明する。 Next, the functional configuration of the object detection / recognition device 10 will be described.
 図2は、物体検出認識装置10の機能構成の例を示すブロック図である。 FIG. 2 is a block diagram showing an example of the functional configuration of the object detection / recognition device 10.
 図2に示すように、物体検出認識装置10は、機能構成として、蓄積部101と、取得部102と、抽出部103と、統合部104と、生成部105と、認識部106と、学習部107とを含む。各機能構成は、CPU11がROM12又はストレージ14に記憶された物体検出認識プログラムを読み出し、RAM13に展開して実行することにより実現される。 As shown in FIG. 2, the object detection / recognition device 10 has a storage unit 101, an acquisition unit 102, an extraction unit 103, an integration unit 104, a generation unit 105, a recognition unit 106, and a learning unit as functional configurations. Includes 107 and. Each functional configuration is realized by the CPU 11 reading the object detection / recognition program stored in the ROM 12 or the storage 14, deploying it in the RAM 13, and executing it.
 蓄積部101には、物体の検出及び認識の対象となる画像と、認識部106で得られた検出結果及び認識結果とが蓄積される。蓄積部101は、取得部102から処理指示を受け取ると、取得部102に対して画像を出力する。なお、学習時には、蓄積部101に、検出結果及び認識結果が予め付与された画像が蓄積されている。 The storage unit 101 stores an image to be detected and recognized as an object, and a detection result and a recognition result obtained by the recognition unit 106. When the storage unit 101 receives a processing instruction from the acquisition unit 102, the storage unit 101 outputs an image to the acquisition unit 102. At the time of learning, the storage unit 101 stores images to which the detection result and the recognition result are given in advance.
 取得部102は、蓄積部101に処理指示を出力し、蓄積部101に蓄積された画像を取得し、取得した画像を抽出部103へ出力する。 The acquisition unit 102 outputs a processing instruction to the storage unit 101, acquires the image stored in the storage unit 101, and outputs the acquired image to the extraction unit 103.
 抽出部103は、取得部102から処理対象の画像を受け取って、受け取った画像をCNN(Convolutional Neural Network)に入力して、CNNの各層で出力される特徴マップ(以下、「階層型特徴マップ」という)を抽出する。抽出部103は、抽出した階層型特徴マップを統合部104へ出力する。 The extraction unit 103 receives an image to be processed from the acquisition unit 102, inputs the received image to a CNN (Convolutional Neural Network), and outputs a feature map in each layer of the CNN (hereinafter, "hierarchical feature map"). ) Is extracted. The extraction unit 103 outputs the extracted hierarchical feature map to the integration unit 104.
 統合部104は、抽出部103で抽出された階層型特徴マップを受け取って、階層型特徴マップの全層を統合した統合特徴マップを生成する。 The integration unit 104 receives the hierarchical feature map extracted by the extraction unit 103, and generates an integrated feature map that integrates all layers of the hierarchical feature map.
 具体的には、統合部104は、階層型特徴マップを、深い層から順に各層に対してアップサンプリング処理しながら、1番浅い層の階層型特徴マップの解像度に一致するように、階層型特徴マップの各々を同じサイズに変換する。そして、統合部104は、各層で対応するセル毎に、各層の特徴量を繋ぎ合わせるか、又は各層の特徴量を統計的に処理することにより、統合特徴マップの各セルの特徴量を求めて、統合特徴マップを生成する。統合部104は、生成した統合特徴マップを生成部105へ出力する。 Specifically, the integration unit 104 upsamples the hierarchical feature map for each layer in order from the deepest layer so as to match the resolution of the hierarchical feature map of the shallowest layer. Convert each of the maps to the same size. Then, the integration unit 104 obtains the feature amount of each cell of the integrated feature map by connecting the feature amounts of each layer for each cell corresponding to each layer or statistically processing the feature amount of each layer. , Generate an integrated feature map. The integration unit 104 outputs the generated integrated feature map to the generation unit 105.
 生成部105は、統合部104から統合特徴マップを受け取って、CNNの全層の情報を考慮した増強階層型特徴マップを生成する。すなわち、生成部105は、統合部104により生成された統合特徴マップを、階層型特徴マップの各々に反映した増強階層型特徴マップを生成する。 The generation unit 105 receives the integrated feature map from the integration unit 104 and generates an enhanced hierarchical feature map considering the information of all layers of the CNN. That is, the generation unit 105 generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit 104 in each of the hierarchical feature maps.
 具体的には、生成部105は、フィルタサイズの異なる畳み込み処理により、統合部104で生成された統合マップを、階層型特徴マップの各々と同じサイズの複数の統合特徴マップに変換する。そして、生成部105は、各サイズに変換した統合特徴マップの各々と、サイズが対応する階層型特徴マップとを統合して、増強階層型特徴マップを生成する。統合特徴マップと階層型特徴マップとの統合方法は、統合特徴マップを生成する際の統合方法と同様の方法を採用することができる。生成部105は、生成した増強階層型特徴マップを認識部106へ出力する。 Specifically, the generation unit 105 converts the integrated map generated by the integrated unit 104 into a plurality of integrated feature maps having the same size as each of the hierarchical feature maps by the convolution processing having different filter sizes. Then, the generation unit 105 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size to generate an enhanced hierarchical feature map. As the method of integrating the integrated feature map and the hierarchical feature map, the same method as the integration method when generating the integrated feature map can be adopted. The generation unit 105 outputs the generated augmented hierarchical feature map to the recognition unit 106.
 認識部106は、生成部105により生成された増強階層型特徴マップに基づいて、処理対象の画像から物体候補領域を各々検出すると共に、物体候補領域の各々が表す物体のカテゴリ及び領域を認識する。 The recognition unit 106 detects each object candidate region from the image to be processed based on the augmented hierarchical feature map generated by the generation unit 105, and recognizes the object category and region represented by each of the object candidate regions. ..
 具体的には、認識部106は、物体候補領域の検出では、deep learningベースの物体検出(例えば、図8に示すMask RCNNの(b)の処理)を用い、処理対象の画像に対して、画素毎の物体分割を行うことにより、物体候補領域を各々検出する。また、認識部106は、物体のカテゴリ及び領域の認識では、deep learningベースの認識手法(例えば、図8に示すMask RCNNの(c)の処理)を用いて、物体候補領域が表す物体のカテゴリ、位置、及び領域を認識する。認識部106は、物体候補領域の検出結果と、物体のカテゴリ、位置、及び領域の認識結果とを蓄積部101に蓄積する。 Specifically, the recognition unit 106 uses deep learning-based object detection (for example, the process of (b) of Mask RCNN shown in FIG. 8) in the detection of the object candidate region, and the image to be processed is subjected to the process. By performing object division for each pixel, each object candidate region is detected. Further, the recognition unit 106 uses a deep learning-based recognition method (for example, the process of (c) of Mask RCNN shown in FIG. 8) in recognizing the object category and the area, and the object category represented by the object candidate area. , Position, and area. The recognition unit 106 stores the detection result of the object candidate region and the recognition result of the object category, position, and region in the storage unit 101.
 学習部107は、蓄積部101に蓄積された、処理対象の画像の各々についての認識部106による検出結果及び認識結果と、その画像の各々について予め付与された検出結果及び認識結果とを用いて、抽出部103及び認識部106の各々で用いられるニューラルネットワークのパラメータを学習する。学習は誤差逆伝播法などの一般的なニューラルネットワークの学習手法を用いればよい。学習部107の学習により、抽出部103及び認識部106の各々では、パラメータがチューニングされたニューラルネットワークを用いて各処理が可能となる。 The learning unit 107 uses the detection result and recognition result by the recognition unit 106 for each of the images to be processed stored in the storage unit 101 and the detection result and recognition result given in advance for each of the images. , The parameters of the neural network used in each of the extraction unit 103 and the recognition unit 106 are learned. For learning, a general neural network learning method such as an error backpropagation method may be used. By learning of the learning unit 107, each of the extraction unit 103 and the recognition unit 106 can perform each process using the neural network in which the parameters are tuned.
 なお、学習部107の処理については、取得部102と、抽出部103と、統合部104と、生成部105と、認識部106とによる一連の物体の検出及び認識の処理とは別個に、任意のタイミングで行えばよい。 The processing of the learning unit 107 is optional, separately from the processing of detecting and recognizing a series of objects by the acquisition unit 102, the extraction unit 103, the integration unit 104, the generation unit 105, and the recognition unit 106. It should be done at the timing of.
<本実施形態に係る物体検出認識装置の作用> <Operation of the object detection / recognition device according to this embodiment>
 次に、物体検出認識装置10の作用について説明する。 Next, the operation of the object detection / recognition device 10 will be described.
 図3は、物体検出認識装置10による物体検出認識処理の流れを示すフローチャートである。CPU11がROM12又はストレージ14から物体検出認識プログラムを読み出して、RAM13に展開して実行することにより、物体検出認識処理が行なわれる。 FIG. 3 is a flowchart showing the flow of the object detection / recognition process by the object detection / recognition device 10. The object detection / recognition process is performed by the CPU 11 reading the object detection / recognition program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing the program.
 まず、ステップS101で、CPU11が、取得部102として、蓄積部101に処理指示を出力し、蓄積部101に蓄積された画像から、処理対象の画像を取得する。 First, in step S101, the CPU 11 outputs a processing instruction to the storage unit 101 as the acquisition unit 102, and acquires the image to be processed from the image stored in the storage unit 101.
 次に、ステップS102で、CPU11が、抽出部103として、上記ステップS101で取得した処理対象の画像を、CNNベースのbackboneネットワークへ入力し、各層から出力された特徴マップを取得する。例えば、VGGやResnetなどのCNNネットワークを用いることができる。そして、例えば、図10に示すように、CPU11が、抽出部103として、上の層の特徴マップを用いて、アップサンプリングしながら、下の層の特徴マップと結合して、階層型特徴マップを抽出する。 Next, in step S102, the CPU 11 inputs the image of the processing target acquired in step S101 to the CNN-based backbone network as the extraction unit 103, and acquires the feature map output from each layer. For example, a CNN network such as VGG or Resnet can be used. Then, for example, as shown in FIG. 10, the CPU 11 uses the feature map of the upper layer as the extraction unit 103, upsamples it, and combines it with the feature map of the lower layer to form a hierarchical feature map. Extract.
 このような階層型特徴マップでは、深い層のセマンティックな情報(物体の特徴的な輪郭、物体間のコンテキスト情報等)を下の層の特徴マップへも伝搬でき、物体検出の際に、検出される物体候補領域の輪郭がなめらかとなり、検出漏れがない精度の良い検出を行うことができるという効果が期待できる。 In such a hierarchical feature map, semantic information of a deep layer (characteristic contour of an object, context information between objects, etc.) can be propagated to the feature map of the lower layer, and is detected at the time of object detection. It is expected that the outline of the object candidate region will be smooth and accurate detection without omission of detection will be possible.
 次に、ステップS103で、CPU11が、統合部104として、上記ステップS102で抽出された階層型特徴マップを受け取り、全層が同じサイズになるように各層に対して処理する。ここで、図4に示すように、CPU11は、統合部104として、階層型特徴マップを、深い層から順に各層に対してアップサンプリング処理しながら、1番浅い層の階層型特徴マップの解像度に一致するように、階層型特徴マップの各々を同じサイズに変換する。そして、CPU11が、統合部104として、例えば、図5の矢印Aに示すように、同じサイズに変換した階層型特徴マップの全層を統合して統合特徴マップを生成する。 Next, in step S103, the CPU 11 receives the hierarchical feature map extracted in step S102 as the integration unit 104, and processes each layer so that all layers have the same size. Here, as shown in FIG. 4, the CPU 11, as the integrated unit 104, upsamples the hierarchical feature map to each layer in order from the deepest layer to the resolution of the hierarchical feature map of the shallowest layer. Convert each of the hierarchical feature maps to the same size so that they match. Then, the CPU 11 integrates all the layers of the hierarchical feature map converted to the same size as the integrated unit 104, for example, as shown by the arrow A in FIG. 5, to generate the integrated feature map.
 統合方法の具体例について説明する。サイズ変換後のi番目の層の階層型特徴マップにおけるセルjの特徴量をfijとし、階層型特徴マップの層の数がNであるとする。統合の方法としては、例えば、各層の特徴量を繋ぎ合わせて、統合特徴マップのセルjの特徴量fj_concatを(f1j,f2j,...,fij,...,fNj)とすることができる。また、例えば、各層の特徴量を加算して、統合特徴マップのセルjの特徴量fj_concatを(f1j+f2j+...+fij+...+fNj)とすることができる。また、平均値をとって、特徴量fj_concatを((f1j+f2j+...+fij+...+fNj)/N)としてもよい。また、各層の特徴量fijの最大値fij_max、最小値fij_min、中央値fij_median等を、統合特徴マップの特徴量fj_concatとしてもよい。 A specific example of the integration method will be described. The feature amount of cell j in the hierarchical feature map of the i-th layer after size conversion and f ij, the number of hierarchical feature map layer is assumed to be N. As a method of integration, for example, the features of each layer are connected and the feature amount f j_concat of the cell j of the integrated feature map is set to (f 1j , f 2j , ..., f ij , ..., f Nj ). Can be. Further, for example, the feature amounts of each layer can be added to make the feature amount f j_concat of the cell j of the integrated feature map (f 1j + f 2j + ... + f ij + ... + f Nj ). Moreover, by taking the average value, the feature amount f j_concat ((f 1j + f 2j + ... + f ij + ... + f Nj) / N) may be. The maximum value f Ij_max of each feature quantity f ij, the minimum value f Ij_min, the median f Ij_median like, may be used as feature amounts f J_concat integrated feature map.
 このような特徴マップの統合処理は、有効かつ効率的に、CNNの全層の情報を一括で取り入れるため、統合特徴マップは、物体の細部特徴(模様や点など)、及びセマンティックな特徴(輪郭や位置関係など)を、より正確に反映できる効果が期待できる。 Since the integrated processing of such a feature map effectively and efficiently incorporates information on all layers of the CNN, the integrated feature map has detailed features (patterns, dots, etc.) and semantic features (contours) of the object. And positional relationship) can be expected to be reflected more accurately.
 次に、ステップS104で、CPU11が、生成部105として、上記ステップS103で生成された統合特徴マップを受け取って、CNNの全層の情報を考慮した増強階層型特徴マップを生成する。例えば、CPU11が、生成部105として、フィルタサイズの異なる畳み込み処理により、図5の矢印Bに示すように、上記ステップS103で生成された統合マップを、階層型特徴マップの各々と同じサイズの複数の統合特徴マップに変換する。この処理には、例えば、図6に示すような、GoogLeNetのinceptionモジュールを用いることができる。 Next, in step S104, the CPU 11 receives the integrated feature map generated in step S103 as the generation unit 105, and generates an enhanced hierarchical feature map in consideration of the information of all layers of the CNN. For example, as the generation unit 105, the CPU 11 uses the convolution processing having different filter sizes to generate a plurality of integrated maps of the same size as each of the hierarchical feature maps, as shown by the arrow B in FIG. Convert to an integrated feature map of. For this process, for example, the Inception module of GoogleLeNet as shown in FIG. 6 can be used.
 そして、CPU11が、生成部105として、図7の(b)部分に示すように、各サイズに変換した統合特徴マップの各々と、サイズが対応する階層型特徴マップとを統合して、増強階層型特徴マップを生成する。なお、図7では、(a)部分が階層型特徴マップの抽出、(c)部分が統合特徴マップの生成、(b)部分が増強階層型特徴マップの生成を表している。 Then, as the generation unit 105, the CPU 11 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size as shown in the portion (b) of FIG. 7, and enhances the hierarchy. Generate a type feature map. In FIG. 7, the part (a) represents the extraction of the hierarchical feature map, the part (c) represents the generation of the integrated feature map, and the part (b) represents the generation of the enhanced hierarchical feature map.
 増強階層型特徴マップを生成する際の、階層型特徴マップと統合特徴マップとの統合方法の具体例について説明する。i番目の層の階層型特徴マップにおけるセルjの特徴量をfijとし、i番目の層の階層型特徴マップと同じサイズに変換された統合特徴マップの各セルjの特徴量をfij_concatとする。統合の方法としては、例えば、両者の特徴量を繋ぎ合わせて、i番目の層の増強階層型特徴マップのセルjの特徴量fij_newを(fij,fij_concat)とすることができる。また、例えば、両者の特徴量を加算して、増強階層型特徴マップのセルjの特徴量fij_newを(fij+fij_concat)としてもよい。また、平均値をとって、特徴量fij_newを((fij+fij_concat)/2)としてもよい。また、両者の特徴量の最大値又は最小値を、増強階層型特徴マップの特徴量fij_newとしてもよい。 A specific example of the method of integrating the hierarchical feature map and the integrated feature map when generating the enhanced hierarchical feature map will be described. i-th characteristic amount of cell j in the hierarchical feature map layers as f ij, the feature amount of each cell j in the i th integration feature map which is converted to the same size as the hierarchical feature map layers and f Ij_concat To do. As a method of integration, for example, by joining the feature amounts of both, the feature amount f Ij_new of i-th cell j enhanced hierarchical feature map layer (f ij, f ij_concat) can be. Further, for example, by adding the feature quantity of both, the feature amount f Ij_new cell j enhanced hierarchical feature map may be a (f ij + f ij_concat). Moreover, by taking the average value, the feature amount f ij_new ((f ij + f ij_concat) / 2) may be. Further, the maximum value or the minimum value of both feature amounts may be set as the feature amount fij_new of the augmented hierarchical feature map.
 次に、ステップS105で、CPU11が、認識部106として、上記ステップS104で生成された増強階層型特徴マップに基づいて、物体候補領域を各々検出する。例えば、増強階層型特徴マップの各層について、図8の(b)部分に示すように、RPN(Region Proposal Network)により物体であるスコアを画素毎に計算し、各層で対応する領域のスコアが高くなる物体候補領域を検出する。 Next, in step S105, the CPU 11 detects each object candidate region as the recognition unit 106 based on the augmented hierarchical feature map generated in step S104. For example, for each layer of the augmented hierarchical feature map, as shown in part (b) of FIG. 8, the score of the object is calculated for each pixel by RPN (Region Proposal Network), and the score of the corresponding area in each layer is high. The object candidate area is detected.
 そして、CPU11が、認識部106として、検出した物体候補領域の各々について、上記ステップS104で生成された増強階層型特徴マップに基づいて、その物体候補領域が表す物体のカテゴリ、位置、及び領域を認識する。 Then, as the recognition unit 106, the CPU 11 determines the category, position, and area of the object represented by the object candidate area for each of the detected object candidate areas based on the enhanced hierarchical feature map generated in step S104. recognize.
 例えば、CPU11が、認識部106として、図8の(b)部分に示すように、増強階層型特徴マップの各層の、物体候補領域に対応する部分を各々用いて、固定サイズの特徴マップを生成する。そして、CPU11が、認識部106として、図8の(c)部分に示すように、固定サイズの特徴マップを、FCN(Fully Convolutional Network)に入力することにより、その物体候補領域が表す物体の領域を認識する。CPU11が、認識部106として、図8の(c)部分に示すように、固定サイズの特徴マップを、全結合層に入力することにより、その物体候補領域が表す物体のカテゴリ及び当該物体を囲うボックス位置を認識する。 For example, the CPU 11 generates a fixed-size feature map as the recognition unit 106 by using each layer of the augmented hierarchical feature map corresponding to the object candidate region as shown in the portion (b) of FIG. To do. Then, as the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the FCN (Full Convolutional Network) as shown in the portion (c) of FIG. 8, and the area of the object represented by the object candidate area is formed. Recognize. As the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the fully connected layer to enclose the object category represented by the object candidate region and the object, as shown in the portion (c) of FIG. Recognize the box position.
 そして、CPU11が、認識部106として、各物体候補領域の検出結果と、各物体候補領域が表す物体のカテゴリ、位置、及び領域の認識結果とを、蓄積部101に蓄積する。 Then, the CPU 11 stores the detection result of each object candidate area and the recognition result of the category, position, and area of the object represented by each object candidate area in the storage unit 101 as the recognition unit 106.
 次に、ステップS106で、CPU11が、蓄積部101に格納された全ての画像について処理を終了したかを判定し、終了していれば物体検出認識処理を終了し、終了していなければステップS101に戻って次の画像を取得して処理を繰り返す。 Next, in step S106, the CPU 11 determines whether or not the processing has been completed for all the images stored in the storage unit 101. If so, the object detection and recognition process is ended, and if not, step S101. Return to to acquire the next image and repeat the process.
 以上説明したように、本実施形態に係る物体検出認識装置によれば、CNNの各層で出力される特徴マップの全層を統合して統合特徴マップを生成し、統合特徴マップを各層の特徴マップに反映した増強階層型特徴マップを生成する。そして、この増強階層型マップを用いて、物体候補領域を各々検出して、物体候補領域の各々について、物体候補領域が表す物体のカテゴリ及び領域を認識する。これにより、CNNのネットワークにおける全部の畳込み層の情報である、物体の意味情報を表す高レベル特徴(上のlayer)と物体の細部情報を表現する低レベル特徴(下のlayer)とを平等かつ効率的に利用できるようになるため、画像が表す物体のカテゴリ及び領域を精度よく認識することができる。 As described above, according to the object detection / recognition device according to the present embodiment, all the layers of the feature map output in each layer of the CNN are integrated to generate an integrated feature map, and the integrated feature map is used as the feature map of each layer. Generates an enhanced hierarchical feature map that reflects in. Then, using this augmented hierarchical map, each object candidate region is detected, and for each of the object candidate regions, the category and region of the object represented by the object candidate region are recognized. As a result, the high-level feature (upper layer) representing the semantic information of the object and the low-level feature (lower layer) representing the detailed information of the object, which are the information of all the convolutional layers in the CNN network, are equalized. Moreover, since it can be used efficiently, it is possible to accurately recognize the category and area of the object represented by the image.
 なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible within a range that does not deviate from the gist of the present invention.
 例えば、上記実施形態では、学習部107を物体検出認識装置10に含める場合を例に説明したが、これに限定されるものではなく、物体検出認識装置10とは別個の学習装置として構成するようにしてもよい。 For example, in the above embodiment, the case where the learning unit 107 is included in the object detection / recognition device 10 has been described as an example, but the present invention is not limited to this, and the learning device is configured as a learning device separate from the object detection / recognition device 10. It may be.
 また、上記各実施形態でCPUがソフトウェア(プログラム)を読み込んで実行した物体検出認識処理を、CPU以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、FPGA(Field-Programmable Gate Array)等の製造後に回路構成を変更可能なPLD(Programmable Logic Device)、及びASIC(Application Specific Integrated Circuit)等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、物体検出認識処理を、これらの各種のプロセッサのうちの1つで実行してもよいし、同種又は異種の2つ以上のプロセッサの組み合わせ(例えば、複数のFPGA、及びCPUとFPGAとの組み合わせ等)で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 Further, various processors other than the CPU may execute the object detection recognition process executed by the CPU reading the software (program) in each of the above embodiments. In this case, the processors include PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing FPGA (Field-Programmable Gate Array), and ASIC (Application Specific Integrated Circuit) for executing ASIC (Application Special Integrated Circuit). An example is a dedicated electric circuit or the like, which is a processor having a circuit configuration designed exclusively for the purpose. Further, the object detection recognition process may be executed by one of these various processors, or a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, and a CPU and an FPGA). It may be executed by combination etc.). Further, the hardware structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.
 また、上記各実施形態では、物体検出認識処理プログラムがROM12又はストレージ14に予め記憶(インストール)されている態様を説明したが、これに限定されない。プログラムは、CD-ROM(Compact Disk Read Only Memory)、DVD-ROM(Digital Versatile Disk Read Only Memory)、及びUSB(Universal Serial Bus)メモリ等の非一時的(non-transitory)記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 Further, in each of the above embodiments, the mode in which the object detection recognition processing program is stored (installed) in the ROM 12 or the storage 14 in advance has been described, but the present invention is not limited to this. The program is a non-temporary storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versailles Disk Online Memory), and a USB (Universal Serial Bus) memory. It may be provided in the form. Further, the program may be downloaded from an external device via a network.
 以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following additional notes will be further disclosed.
 (付記項1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成し、
 生成した統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成し、
 生成した増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する
 物体検出認識装置。
(Appendix 1)
With memory
With at least one processor connected to the memory
Including
The processor
The image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
An object detection / recognition device that detects each object candidate region from the image based on the generated augmented hierarchical feature map, and recognizes the object category and region represented by each of the object candidate regions.
 (付記項2)
 物体検出認識処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記録媒体であって、
 前記物体検出認識処理は、
 処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成し、
 生成した統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成し、
 生成した増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する
 ことを含む非一時的記録媒体。
(Appendix 2)
A non-temporary recording medium that stores a program that can be executed by a computer to perform object detection and recognition processing.
The object detection recognition process
The image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
A non-temporary recording medium including detecting each object candidate region from the image based on the generated augmented hierarchical feature map and recognizing the category and region of the object represented by each of the object candidate regions.
10   物体検出認識装置
11   CPU
12   ROM
13   RAM
14   ストレージ
15   入力部
16   表示部
17   通信I/F
19   バス
101 蓄積部
102 取得部
103 抽出部
104 統合部
105 生成部
106 認識部
107 学習部
10 Object detection and recognition device 11 CPU
12 ROM
13 RAM
14 Storage 15 Input unit 16 Display unit 17 Communication I / F
19 Bus 101 Storage unit 102 Acquisition unit 103 Extraction unit 104 Integration unit 105 Generation unit 106 Recognition unit 107 Learning unit

Claims (6)

  1.  処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成する統合部と、
     前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成する生成部と、
     前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する認識部と、
     を含む物体検出認識装置。
    An integrated unit that inputs an image to be processed into a CNN (Convolutional Neural Network) and integrates all the feature maps output in each layer of the CNN to generate an integrated feature map.
    An integrated feature map generated by the integrated unit is reflected in each of the feature maps output in each layer to generate an enhanced hierarchical feature map, and a generation unit.
    Based on the augmented hierarchical feature map generated by the generation unit, a recognition unit that detects each object candidate region from the image and recognizes the object category and region represented by each of the object candidate regions.
    Object detection and recognition device including.
  2.  前記統合部は、前記各層で出力される特徴マップを、深い層から順に各層に対してアップサンプリング処理しながら、1番浅い層の特徴マップの解像度と一致するように、前記各層で出力される特徴マップの各々を同じサイズに変換し、各層で対応するセルの特徴量を繋ぎ合わせるか、又は統計的に処理することにより、前記統合特徴マップを生成する請求項1に記載の物体検出認識装置。 The integration unit upsamples the feature map output in each layer to each layer in order from the deepest layer, and outputs the feature map in each layer so as to match the resolution of the feature map of the shallowest layer. The object detection / recognition device according to claim 1, wherein each of the feature maps is converted to the same size, and the feature quantities of the corresponding cells in each layer are stitched together or statistically processed to generate the integrated feature map. ..
  3.  前記生成部は、フィルタサイズの異なる畳み込み処理により、前記統合特徴マップを、前記各層で出力される特徴マップの各々と同じサイズに変換した複数の前記統合特徴マップの各々と、サイズが対応する前記各層で出力される特徴マップとを統合して、前記増強階層型特徴マップを生成する請求項1又は請求項2に記載の物体検出認識装置。 The generator has a size corresponding to each of the plurality of integrated feature maps obtained by converting the integrated feature map to the same size as each of the feature maps output in each layer by convolution processing having different filter sizes. The object detection / recognition device according to claim 1 or 2, wherein the feature map output in each layer is integrated to generate the enhanced hierarchical feature map.
  4.  前記生成部は、サイズを変換した前記統合特徴マップと、サイズが対応する前記各層で出力される特徴マップとで、対応するセルの特徴量を繋ぎ合わせるか、又は統計的に処理することにより、前記増強階層型特徴マップを生成する請求項3に記載の物体検出認識装置。 The generation unit connects the feature quantities of the corresponding cells with the integrated feature map whose size has been converted and the feature map output in each layer corresponding to the size, or statistically processes them. The object detection / recognition device according to claim 3, which generates the enhanced hierarchical feature map.
  5.  統合部が、処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成し、
     生成部が、前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成し、
     認識部が、前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する
     物体検出認識方法。
    The integration unit inputs an image to be processed into a CNN (Convolutional Neural Network), integrates all the feature maps output in each layer of the CNN, and generates an integrated feature map.
    The generation unit generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer.
    The recognition unit detects each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit, and recognizes the object category and region represented by each of the object candidate regions. Recognition method.
  6.  コンピュータを、
     処理対象となる画像を、CNN(Convolutional Neural Network)に入力して、前記CNNの各層で出力される全ての特徴マップを統合して統合特徴マップを生成する統合部、
     前記統合部により生成された統合特徴マップを、前記各層で出力される特徴マップの各々に反映した増強階層型特徴マップを生成する生成部、及び、
     前記生成部により生成された増強階層型特徴マップに基づいて、前記画像から物体候補領域を各々検出すると共に、前記物体候補領域の各々が表す物体のカテゴリ及び領域を認識する認識部
     として機能させるための物体検出認識プログラム。
    Computer,
    An integrated unit that inputs an image to be processed into a CNN (Convolutional Neural Network) and integrates all the feature maps output in each layer of the CNN to generate an integrated feature map.
    A generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and a generation unit.
    In order to detect each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit and to function as a recognition unit for recognizing the object category and region represented by each of the object candidate regions. Object detection recognition program.
PCT/JP2019/024906 2019-06-24 2019-06-24 Object detection/recognition device, object detection/recognition method, and object detection/recognition program WO2020261324A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/024906 WO2020261324A1 (en) 2019-06-24 2019-06-24 Object detection/recognition device, object detection/recognition method, and object detection/recognition program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/024906 WO2020261324A1 (en) 2019-06-24 2019-06-24 Object detection/recognition device, object detection/recognition method, and object detection/recognition program

Publications (1)

Publication Number Publication Date
WO2020261324A1 true WO2020261324A1 (en) 2020-12-30

Family

ID=74061570

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/024906 WO2020261324A1 (en) 2019-06-24 2019-06-24 Object detection/recognition device, object detection/recognition method, and object detection/recognition program

Country Status (1)

Country Link
WO (1) WO2020261324A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019067406A (en) * 2017-10-04 2019-04-25 株式会社ストラドビジョン Method and device for generating feature map by using fun
JP2019096006A (en) * 2017-11-21 2019-06-20 キヤノン株式会社 Information processing device, and information processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019067406A (en) * 2017-10-04 2019-04-25 株式会社ストラドビジョン Method and device for generating feature map by using fun
JP2019096006A (en) * 2017-11-21 2019-06-20 キヤノン株式会社 Information processing device, and information processing method

Similar Documents

Publication Publication Date Title
EP3483767B1 (en) Device for detecting variant malicious code on basis of neural network learning, method therefor, and computer-readable recording medium in which program for executing same method is recorded
WO2020145180A1 (en) Object detection and recognition device, method, and program
US11875553B2 (en) Method and device for detecting object in real time by means of deep learning network model
JP6855535B2 (en) Simulation data optimization methods, devices, storage media and programs
KR102245220B1 (en) Apparatus for reconstructing 3d model from 2d images based on deep-learning and method thereof
CN111615702B (en) Method, device and equipment for extracting structured data from image
CN110991560A (en) Target detection method and system in combination with context information
CN112640037A (en) Learning device, inference device, learning model generation method, and inference method
JP6989450B2 (en) Image analysis device, image analysis method and program
Rövid et al. Towards raw sensor fusion in 3D object detection
KR102291041B1 (en) Learning apparatus based on game data and method for the same
CN108573510B (en) Grid map vectorization method and device
WO2020261324A1 (en) Object detection/recognition device, object detection/recognition method, and object detection/recognition program
CN112884702A (en) Polyp identification system and method based on endoscope image
CN112634174A (en) Image representation learning method and system
US20220375200A1 (en) Detection result analysis device, detection result analysis method, and computer readable medium
Narasimhamurthy et al. Fast architecture for low level vision and image enhancement for reconfigurable platform
CN116051850A (en) Neural network target detection method, device, medium and embedded electronic equipment
CN109961083B (en) Method and image processing entity for applying a convolutional neural network to an image
JP7238510B2 (en) Information processing device, information processing method and program
CN113379637A (en) Image restoration method, system, medium, and device based on progressive learning strategy
WO2021079441A1 (en) Detection method, detection program, and detection device
TWI826201B (en) Object detection method, object detection apparatus, and non-transitory storage medium
Wang et al. Deep learning method for rain streaks removal from single image
US20210158153A1 (en) Method and system for processing fmcw radar signal using lightweight deep learning network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19934491

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19934491

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP