WO2020261324A1

WO2020261324A1 - Object detection/recognition device, object detection/recognition method, and object detection/recognition program

Info

Publication number: WO2020261324A1
Application number: PCT/JP2019/024906
Authority: WO
Inventors: 泳青孫; 島村　潤; 淳嵯峨田
Original assignee: 日本電信電話株式会社
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2020-12-30

Abstract

The present invention accurately recognizes the category and the area of an object represented by an image.　An integration unit (104) inputs an image to be processed into a CNN, and integrates all hierarchical feature maps output from each layer of the CNN, thereby generating an integrated feature map; a generation unit (105) generates enhanced hierarchical feature maps by causing each hierarchical feature map to reflect the integrated feature map generated by the integration unit (104); and a recognition unit (106) detects each object candidate area in the image and recognizes the category and the area of the object represented by each object candidate area, on the basis of the enhanced hierarchical feature maps generated by the generation unit (105).

Description

Object detection and recognition device, object detection and recognition method, and object detection and recognition program

The disclosed technology relates to an object detection / recognition device, an object detection / recognition method, and an object detection / recognition program.

Semantic image division and recognition is a technology that attempts to assign pixels in a video or image to an object category. It is often applied to autonomous driving, medical image analysis, state and pose estimation, etc. In recent years, pixel-by-pixel image division techniques using deep learning have been actively studied. As shown in FIG. 8, the method called Mask RCNN (Non-Patent Document 1), which is an example of a typical processing flow, first features an input image through a CNN (Convolutional Neural Network) -based backbone network. The map is extracted (part (a) in FIG. 8). Next, in the extracted feature map, a candidate region (area that seems to be an object) related to the object is detected (part (b) in FIG. 8). Finally, the object position is detected and pixels are assigned from the detected candidate area (part (c) in FIG. 8). Further, as opposed to using only the output of the deep layer of CNN for the feature map extraction process of Mask RCNN, as shown in FIG. 9, FPN (Fature Pyramid) which also uses the output of multiple layers including the information of the shallow layer is used. A hierarchical feature map extraction method (Non-Patent Document 2) called Network) has also been proposed. Specifically, as shown in FIG. 10, a hierarchical feature map is generated from a deep layer to a shallow layer by an upsampling process.

There are the following observations about CNN-based object division and recognition methods.

First, the shallow layers of the CNN-based backbone network represent low-level image features of the input image. That is, it expresses details such as lines, dots, and patterns of an object.

Second, as the CNN layer gets deeper, higher level features of the image can be extracted. For example, it is possible to extract features that represent characteristic contours of objects and contextual relationships between objects.

The method called Mask RCNN shown in Non-Patent Document 1 above uses only the feature map generated from the deep layer of CNN to detect the next object region candidate and perform segmentation for each pixel. Therefore, since the low-level features expressing the details of the object are lost, there arises a problem that the object detection position shifts and the accuracy of segmentation (pixel allocation) becomes low.

On the other hand, the FPN method of Non-Patent Document 2 propagates semantic information to the shallow layer while upsampling from the feature map of the deep layer to the backbone network of CNN. Then, by performing object division using a plurality of feature maps, the object division accuracy is improved to some extent, but since low-level features are not actually incorporated into the high-level feature map (up layer), the object division is performed. And recognition accuracy problems arise.

The present invention has been made to solve the above problems, and an object of the present invention is to provide an object detection / recognition device, a method, and a program capable of accurately recognizing an object category and region represented by an image. And.

The first aspect of the present disclosure is an object detection / recognition device, in which an image to be processed is input to a CNN (Convolutional Neural Network), and all feature maps output in each layer of the CNN are integrated. An integration unit that generates an integrated feature map, a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and the generation unit. Includes a recognition unit that detects each object candidate region from the image and recognizes the object category and region represented by each of the object candidate regions based on the augmented hierarchical feature map generated by.

The second aspect of the present disclosure is an object detection and recognition method, in which the integration unit inputs an image to be processed into the CNN and integrates all the feature maps output in each layer of the CNN. The integrated feature map is generated, the generating unit generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integrated unit in each of the feature maps output in each layer, and the recognition unit generates an enhanced hierarchical feature map. This is a method of detecting each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit, and recognizing the object category and region represented by each of the object candidate regions.

A third aspect of the present disclosure is an object detection and recognition program, in which a computer inputs an image to be processed into a CNN and integrates all feature maps output in each layer of the CNN. An integration unit that generates an integrated feature map, a generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and the generation unit. Based on the augmented hierarchical feature map generated by, a program for detecting each object candidate region from the image and functioning as a recognition unit for recognizing the object category and region represented by each of the object candidate regions. is there.

According to the disclosed technology, it is possible to accurately recognize the category and area of the object represented by the image.

It is a block diagram which shows the hardware structure of the object detection recognition apparatus which concerns on this embodiment. It is a block diagram which shows the functional structure of the object detection recognition apparatus which concerns on this embodiment. It is a flowchart which shows an example of the object detection recognition processing in the object detection recognition apparatus which concerns on this embodiment. It is a figure for demonstrating the process of converting each of the hierarchical feature maps into the same size. It is a figure for demonstrating the integration of the hierarchical feature map converted to the same size, and the conversion to the hierarchical feature map of a different size. It is a figure which shows an example of the module used when converting an integrated feature map into a different size. It is a figure for demonstrating the generation of an augmented hierarchical feature map. It is a figure for demonstrating the processing of Mask RCNN. It is a figure for demonstrating the processing of FPN. It is a figure for demonstrating the generation of a hierarchical feature map by upsampling processing.

Hereinafter, an example of the embodiment of the disclosed technology will be described with reference to the drawings. The same reference numerals are given to the same or equivalent components and parts in each drawing. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation and may differ from the actual ratios.

First, the outline of the embodiment according to the present invention will be described.

Based on the above-mentioned problems, in the CNN-based backbone network for feature map extraction, if sufficient information from the shallow layer to the deep layer is used, it is considered to be effective for accurate object detection and recognition.

Therefore, in the embodiment of the present invention, an image to be detected and recognized is acquired, and the acquired image is provided with a characteristic map of each hierarchical layer output by, for example, FPN through the backbone network of CNN. Generate an integrated feature map that integrates everything. Then, from the generated integrated feature map, a feature map of an appropriate size is extracted from the feature map of each hierarchical layer, and the extracted feature map and the feature map of each layer are integrated to enhance and enhance all layers. Generate an enhanced hierarchical feature map that takes into account the information in. Then, the generated hierarchical feature map is used to detect and recognize the object.

Next, the configuration of the object detection / recognition device according to the embodiment of the present invention will be described. FIG. 1 is a block diagram showing a hardware configuration of the object detection / recognition device 10.

As shown in FIG. 1, the object detection / recognition device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and communication. It has an I / F (Interface) 17. Each configuration is communicably connected to each other via a bus 19.

The CPU 11 is a central arithmetic processing unit that executes various programs and controls each part. That is, the CPU 11 reads the program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above configurations and performs various arithmetic processes according to the program stored in the ROM 12 or the storage 14. In the present embodiment, the ROM 12 or the storage 14 stores an object detection / recognition program for executing the object detection / recognition process described later.

ROM 12 stores various programs and various data. The RAM 13 temporarily stores a program or data as a work area. The storage 14 is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.

The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for performing various inputs.

The display unit 16 is, for example, a liquid crystal display and displays various types of information. The display unit 16 may adopt a touch panel method and function as an input unit 15.

The communication I / F17 is an interface for communicating with other devices, and for example, standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark) are used.

Next, the functional configuration of the object detection / recognition device 10 will be described.

FIG. 2 is a block diagram showing an example of the functional configuration of the object detection / recognition device 10.

As shown in FIG. 2, the object detection / recognition device 10 has a storage unit 101, an acquisition unit 102, an extraction unit 103, an integration unit 104, a generation unit 105, a recognition unit 106, and a learning unit as functional configurations. Includes 107 and. Each functional configuration is realized by the CPU 11 reading the object detection / recognition program stored in the ROM 12 or the storage 14, deploying it in the RAM 13, and executing it.

The storage unit 101 stores an image to be detected and recognized as an object, and a detection result and a recognition result obtained by the recognition unit 106. When the storage unit 101 receives a processing instruction from the acquisition unit 102, the storage unit 101 outputs an image to the acquisition unit 102. At the time of learning, the storage unit 101 stores images to which the detection result and the recognition result are given in advance.

The acquisition unit 102 outputs a processing instruction to the storage unit 101, acquires the image stored in the storage unit 101, and outputs the acquired image to the extraction unit 103.

The extraction unit 103 receives an image to be processed from the acquisition unit 102, inputs the received image to a CNN (Convolutional Neural Network), and outputs a feature map in each layer of the CNN (hereinafter, "hierarchical feature map"). ) Is extracted. The extraction unit 103 outputs the extracted hierarchical feature map to the integration unit 104.

The integration unit 104 receives the hierarchical feature map extracted by the extraction unit 103, and generates an integrated feature map that integrates all layers of the hierarchical feature map.

Specifically, the integration unit 104 upsamples the hierarchical feature map for each layer in order from the deepest layer so as to match the resolution of the hierarchical feature map of the shallowest layer. Convert each of the maps to the same size. Then, the integration unit 104 obtains the feature amount of each cell of the integrated feature map by connecting the feature amounts of each layer for each cell corresponding to each layer or statistically processing the feature amount of each layer. , Generate an integrated feature map. The integration unit 104 outputs the generated integrated feature map to the generation unit 105.

The generation unit 105 receives the integrated feature map from the integration unit 104 and generates an enhanced hierarchical feature map considering the information of all layers of the CNN. That is, the generation unit 105 generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit 104 in each of the hierarchical feature maps.

Specifically, the generation unit 105 converts the integrated map generated by the integrated unit 104 into a plurality of integrated feature maps having the same size as each of the hierarchical feature maps by the convolution processing having different filter sizes. Then, the generation unit 105 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size to generate an enhanced hierarchical feature map. As the method of integrating the integrated feature map and the hierarchical feature map, the same method as the integration method when generating the integrated feature map can be adopted. The generation unit 105 outputs the generated augmented hierarchical feature map to the recognition unit 106.

The recognition unit 106 detects each object candidate region from the image to be processed based on the augmented hierarchical feature map generated by the generation unit 105, and recognizes the object category and region represented by each of the object candidate regions. ..

Specifically, the recognition unit 106 uses deep learning-based object detection (for example, the process of (b) of Mask RCNN shown in FIG. 8) in the detection of the object candidate region, and the image to be processed is subjected to the process. By performing object division for each pixel, each object candidate region is detected. Further, the recognition unit 106 uses a deep learning-based recognition method (for example, the process of (c) of Mask RCNN shown in FIG. 8) in recognizing the object category and the area, and the object category represented by the object candidate area. , Position, and area. The recognition unit 106 stores the detection result of the object candidate region and the recognition result of the object category, position, and region in the storage unit 101.

The learning unit 107 uses the detection result and recognition result by the recognition unit 106 for each of the images to be processed stored in the storage unit 101 and the detection result and recognition result given in advance for each of the images. , The parameters of the neural network used in each of the extraction unit 103 and the recognition unit 106 are learned. For learning, a general neural network learning method such as an error backpropagation method may be used. By learning of the learning unit 107, each of the extraction unit 103 and the recognition unit 106 can perform each process using the neural network in which the parameters are tuned.

The processing of the learning unit 107 is optional, separately from the processing of detecting and recognizing a series of objects by the acquisition unit 102, the extraction unit 103, the integration unit 104, the generation unit 105, and the recognition unit 106. It should be done at the timing of.

Next, the operation of the object detection / recognition device 10 will be described.

FIG. 3 is a flowchart showing the flow of the object detection / recognition process by the object detection / recognition device 10. The object detection / recognition process is performed by the CPU 11 reading the object detection / recognition program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing the program.

First, in step S101, the CPU 11 outputs a processing instruction to the storage unit 101 as the acquisition unit 102, and acquires the image to be processed from the image stored in the storage unit 101.

Next, in step S102, the CPU 11 inputs the image of the processing target acquired in step S101 to the CNN-based backbone network as the extraction unit 103, and acquires the feature map output from each layer. For example, a CNN network such as VGG or Resnet can be used. Then, for example, as shown in FIG. 10, the CPU 11 uses the feature map of the upper layer as the extraction unit 103, upsamples it, and combines it with the feature map of the lower layer to form a hierarchical feature map. Extract.

In such a hierarchical feature map, semantic information of a deep layer (characteristic contour of an object, context information between objects, etc.) can be propagated to the feature map of the lower layer, and is detected at the time of object detection. It is expected that the outline of the object candidate region will be smooth and accurate detection without omission of detection will be possible.

Next, in step S103, the CPU 11 receives the hierarchical feature map extracted in step S102 as the integration unit 104, and processes each layer so that all layers have the same size. Here, as shown in FIG. 4, the CPU 11, as the integrated unit 104, upsamples the hierarchical feature map to each layer in order from the deepest layer to the resolution of the hierarchical feature map of the shallowest layer. Convert each of the hierarchical feature maps to the same size so that they match. Then, the CPU 11 integrates all the layers of the hierarchical feature map converted to the same size as the integrated unit 104, for example, as shown by the arrow A in FIG. 5, to generate the integrated feature map.

A specific example of the integration method will be described. The feature amount of cell j in the hierarchical feature map of the i-th layer after size conversion and f _ij, the number of hierarchical feature map layer is assumed to be N. As a method of integration, for example, the features of each layer are connected and the feature amount f _{j_concat} of the cell j of the integrated feature map is _set to (f _1j , f _2j , ..., f _ij , ..., f _Nj ). Can be. Further, for example, the feature amounts of each layer can be added to make the feature amount f _{j_concat} of the cell j of the integrated feature map (f _1j + f _2j + ... + f _ij + ... + f _Nj ). Moreover, by taking the average value, the feature amount _{_{_{_{f j_concat ((f 1j + f}}}} 2j + ... + f ij + ... + f Nj) / N) may be. The maximum value _{f Ij_max} of each feature quantity _{f ij,} the minimum value _{f Ij_min,} the median _{f Ij_median} like, may be used as feature amounts _{f J_concat} integrated feature map.

Since the integrated processing of such a feature map effectively and efficiently incorporates information on all layers of the CNN, the integrated feature map has detailed features (patterns, dots, etc.) and semantic features (contours) of the object. And positional relationship) can be expected to be reflected more accurately.

Next, in step S104, the CPU 11 receives the integrated feature map generated in step S103 as the generation unit 105, and generates an enhanced hierarchical feature map in consideration of the information of all layers of the CNN. For example, as the generation unit 105, the CPU 11 uses the convolution processing having different filter sizes to generate a plurality of integrated maps of the same size as each of the hierarchical feature maps, as shown by the arrow B in FIG. Convert to an integrated feature map of. For this process, for example, the Inception module of GoogleLeNet as shown in FIG. 6 can be used.

Then, as the generation unit 105, the CPU 11 integrates each of the integrated feature maps converted into each size and the hierarchical feature map corresponding to the size as shown in the portion (b) of FIG. 7, and enhances the hierarchy. Generate a type feature map. In FIG. 7, the part (a) represents the extraction of the hierarchical feature map, the part (c) represents the generation of the integrated feature map, and the part (b) represents the generation of the enhanced hierarchical feature map.

A specific example of the method of integrating the hierarchical feature map and the integrated feature map when generating the enhanced hierarchical feature map will be described. i-th characteristic amount of cell j in the hierarchical feature map layers as f _ij, the feature amount of each cell j in the i th integration feature map which is converted to the same size as the hierarchical feature map layers and f _{Ij_concat} To do. As a method of integration, for example, by joining the feature amounts of both, the feature amount _{f Ij_new} of i-th cell j enhanced hierarchical feature map layer _{_(f ij,} f _{ij_concat)} can be. Further, for example, by adding the feature quantity of both, the feature amount _{f Ij_new} cell j enhanced hierarchical feature map may be a _{_(f ij +} f _{ij_concat).} Moreover, by taking the average value, the feature amount _{_{_{f ij_new ((f ij + f}}} ij_concat) / 2) may be. Further, the maximum value or the minimum value of both feature amounts may be _set as the feature amount _{fij_new} of the augmented hierarchical feature map.

Next, in step S105, the CPU 11 detects each object candidate region as the recognition unit 106 based on the augmented hierarchical feature map generated in step S104. For example, for each layer of the augmented hierarchical feature map, as shown in part (b) of FIG. 8, the score of the object is calculated for each pixel by RPN (Region Proposal Network), and the score of the corresponding area in each layer is high. The object candidate area is detected.

Then, as the recognition unit 106, the CPU 11 determines the category, position, and area of the object represented by the object candidate area for each of the detected object candidate areas based on the enhanced hierarchical feature map generated in step S104. recognize.

For example, the CPU 11 generates a fixed-size feature map as the recognition unit 106 by using each layer of the augmented hierarchical feature map corresponding to the object candidate region as shown in the portion (b) of FIG. To do. Then, as the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the FCN (Full Convolutional Network) as shown in the portion (c) of FIG. 8, and the area of the object represented by the object candidate area is formed. Recognize. As the recognition unit 106, the CPU 11 inputs a fixed-size feature map into the fully connected layer to enclose the object category represented by the object candidate region and the object, as shown in the portion (c) of FIG. Recognize the box position.

Then, the CPU 11 stores the detection result of each object candidate area and the recognition result of the category, position, and area of the object represented by each object candidate area in the storage unit 101 as the recognition unit 106.

Next, in step S106, the CPU 11 determines whether or not the processing has been completed for all the images stored in the storage unit 101. If so, the object detection and recognition process is ended, and if not, step S101. Return to to acquire the next image and repeat the process.

As described above, according to the object detection / recognition device according to the present embodiment, all the layers of the feature map output in each layer of the CNN are integrated to generate an integrated feature map, and the integrated feature map is used as the feature map of each layer. Generates an enhanced hierarchical feature map that reflects in. Then, using this augmented hierarchical map, each object candidate region is detected, and for each of the object candidate regions, the category and region of the object represented by the object candidate region are recognized. As a result, the high-level feature (upper layer) representing the semantic information of the object and the low-level feature (lower layer) representing the detailed information of the object, which are the information of all the convolutional layers in the CNN network, are equalized. Moreover, since it can be used efficiently, it is possible to accurately recognize the category and area of the object represented by the image.

The present invention is not limited to the above-described embodiment, and various modifications and applications are possible within a range that does not deviate from the gist of the present invention.

For example, in the above embodiment, the case where the learning unit 107 is included in the object detection / recognition device 10 has been described as an example, but the present invention is not limited to this, and the learning device is configured as a learning device separate from the object detection / recognition device 10. It may be.

Further, various processors other than the CPU may execute the object detection recognition process executed by the CPU reading the software (program) in each of the above embodiments. In this case, the processors include PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing FPGA (Field-Programmable Gate Array), and ASIC (Application Specific Integrated Circuit) for executing ASIC (Application Special Integrated Circuit). An example is a dedicated electric circuit or the like, which is a processor having a circuit configuration designed exclusively for the purpose. Further, the object detection recognition process may be executed by one of these various processors, or a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, and a CPU and an FPGA). It may be executed by combination etc.). Further, the hardware structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.

Further, in each of the above embodiments, the mode in which the object detection recognition processing program is stored (installed) in the ROM 12 or the storage 14 in advance has been described, but the present invention is not limited to this. The program is a non-temporary storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versailles Disk Online Memory), and a USB (Universal Serial Bus) memory. It may be provided in the form. Further, the program may be downloaded from an external device via a network.

Regarding the above embodiments, the following additional notes will be further disclosed.

(Appendix 1)
With memory
With at least one processor connected to the memory
Including
The processor
The image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
An object detection / recognition device that detects each object candidate region from the image based on the generated augmented hierarchical feature map, and recognizes the object category and region represented by each of the object candidate regions.

(Appendix 2)
A non-temporary recording medium that stores a program that can be executed by a computer to perform object detection and recognition processing.
The object detection recognition process
The image to be processed is input to a CNN (Convolutional Neural Network), and all the feature maps output in each layer of the CNN are integrated to generate an integrated feature map.
An enhanced hierarchical feature map that reflects the generated integrated feature map in each of the feature maps output in each layer is generated.
A non-temporary recording medium including detecting each object candidate region from the image based on the generated augmented hierarchical feature map and recognizing the category and region of the object represented by each of the object candidate regions.

10 Object detection and recognition device 11 CPU
12 ROM
13 RAM
14 Storage 15 Input unit 16 Display unit 17 Communication I / F
19 Bus 101 Storage unit 102 Acquisition unit 103 Extraction unit 104 Integration unit 105 Generation unit 106 Recognition unit 107 Learning unit

Claims

An integrated unit that inputs an image to be processed into a CNN (Convolutional Neural Network) and integrates all the feature maps output in each layer of the CNN to generate an integrated feature map.
An integrated feature map generated by the integrated unit is reflected in each of the feature maps output in each layer to generate an enhanced hierarchical feature map, and a generation unit.
Based on the augmented hierarchical feature map generated by the generation unit, a recognition unit that detects each object candidate region from the image and recognizes the object category and region represented by each of the object candidate regions.
Object detection and recognition device including.
The integration unit upsamples the feature map output in each layer to each layer in order from the deepest layer, and outputs the feature map in each layer so as to match the resolution of the feature map of the shallowest layer. The object detection / recognition device according to claim 1, wherein each of the feature maps is converted to the same size, and the feature quantities of the corresponding cells in each layer are stitched together or statistically processed to generate the integrated feature map. ..
The generator has a size corresponding to each of the plurality of integrated feature maps obtained by converting the integrated feature map to the same size as each of the feature maps output in each layer by convolution processing having different filter sizes. The object detection / recognition device according to claim 1 or 2, wherein the feature map output in each layer is integrated to generate the enhanced hierarchical feature map.
The generation unit connects the feature quantities of the corresponding cells with the integrated feature map whose size has been converted and the feature map output in each layer corresponding to the size, or statistically processes them. The object detection / recognition device according to claim 3, which generates the enhanced hierarchical feature map.
The integration unit inputs an image to be processed into a CNN (Convolutional Neural Network), integrates all the feature maps output in each layer of the CNN, and generates an integrated feature map.
The generation unit generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer.
The recognition unit detects each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit, and recognizes the object category and region represented by each of the object candidate regions. Recognition method.
Computer,
An integrated unit that inputs an image to be processed into a CNN (Convolutional Neural Network) and integrates all the feature maps output in each layer of the CNN to generate an integrated feature map.
A generation unit that generates an enhanced hierarchical feature map that reflects the integrated feature map generated by the integration unit in each of the feature maps output in each layer, and a generation unit.
In order to detect each object candidate region from the image based on the augmented hierarchical feature map generated by the generation unit and to function as a recognition unit for recognizing the object category and region represented by each of the object candidate regions. Object detection recognition program.