US11875553B2 - Method and device for detecting object in real time by means of deep learning network model - Google Patents

Method and device for detecting object in real time by means of deep learning network model Download PDF

Info

Publication number
US11875553B2
US11875553B2 US17/282,468 US201917282468A US11875553B2 US 11875553 B2 US11875553 B2 US 11875553B2 US 201917282468 A US201917282468 A US 201917282468A US 11875553 B2 US11875553 B2 US 11875553B2
Authority
US
United States
Prior art keywords
feature map
input image
layer
object detection
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US17/282,468
Other versions
US20210383165A1 (en
Inventor
Woong Jae WON
Tae Hun Kim
Soon Kwon
Jin Hee Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daegu Gyeongbuk Institute of Science and Technology
Original Assignee
Daegu Gyeongbuk Institute of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daegu Gyeongbuk Institute of Science and Technology filed Critical Daegu Gyeongbuk Institute of Science and Technology
Assigned to DAEGU GYEONGBUK INSTITUTE OF SCIENCE AND TECHNOLOGY reassignment DAEGU GYEONGBUK INSTITUTE OF SCIENCE AND TECHNOLOGY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KIM, TAE HUN, KWON, SOON, LEE, JIN HEE, WON, WOONG JAE
Publication of US20210383165A1 publication Critical patent/US20210383165A1/en
Application granted granted Critical
Publication of US11875553B2 publication Critical patent/US11875553B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/12Bounding box

Definitions

  • the present disclosure relates to a convolutional neural network (CNN) processing technology.
  • CNN convolutional neural network
  • a neural network-based deep learning technology is utilized in various fields.
  • a deep learning-based biometric identification/authentication application that performs facial recognition, iris recognition, voice recognition, or the like may be employed by a terminal (e.g., a smart phone) where the application is embedded.
  • a convolutional neural network is a multi-layer neural network that utilizes a convolutional operation, and shows high performance in the deep learning-based image and voice recognition fields.
  • Legacy faster-RCNN performs region proposal network (RPN)-based object detection, but regression modeling and a classification process are separated. Accordingly, an operation speed is slow.
  • RPN region proposal network
  • FCN fully convolutional network
  • the models consider a generally efficient structure, such as a VGG network or the like, as a CNN layer.
  • a VGG network feature information associated with a receptive field tends to decrease as the number of convolutional layers increases. Therefore, the performance of detecting a small object may slightly deteriorate. Accordingly, recently, R-FCN and DSSD models improve performance by applying a residual network that is capable of overcoming the drawback.
  • the legacy object detection models propose various multi-scale feature schemes capable of improving expressiveness and receptiveness associated with object feature information, so as to improve performance associated with various shapes of objects and various changes in size.
  • a multi-scale feature scheme also does not show perfect performance in detecting a small object or the like during driving.
  • CNN structures are being proposed which are capable of increasing receptiveness and expressiveness for information associated with various types, shapes, and sizes, such as ResNet, SeNet, and the like.
  • ResNet ResNet
  • SeNet SeNet
  • a CNN-based multi-object detection model can reduce a loss of information associated with a small receptive field by performing a small number of operations, and can detect an occluded object and a small object well.
  • the CNN-based multi-object detection model according to an embodiment may combine high-resolution feature information by performing a small number of operations without defining an anchor box in advance, and may detect various sizes of objects based on the combined feature information.
  • a real-time object detection method may include: receiving an input image and extracting a first feature map from the input image by performing high-speed convolution between an input and a kernel via a high-speed convolutional network; detecting an object included in the input image based on the first feature map; extracting, from the input image, a second feature map having resolution higher than the resolution of the first feature map; extracting a third feature map from the second feature map based on a region of the detected object; and redetecting the object based on the first feature map and the third feature map.
  • the operation of extracting the second feature map may include: providing the input image to the CNN; and obtaining the second feature map from an intermediate layer having a dimension larger than a dimension of an output layer of the CNN.
  • the operation of extracting the third feature map includes sampling elements in a bounding box corresponding to the region of the detected object.
  • the operation of redetecting the object region may include: concatenating the first feature map and the third feature map; and applying a plurality of filters corresponding to a detection result associated with the object to the concatenated feature map.
  • the operation of extracting the first feature map from the input image may include: providing the input image to the CNN; and applying, to an output of the CNN, a plurality of filters corresponding to a plurality of features included in the first feature map.
  • the operation of detecting the object may include: applying a plurality of filters corresponding to a detection result associated with the object to the first feature map.
  • the operation of redetecting the object may include: detecting candidate regions corresponding to the object in the input image based on the first feature map and the third feature map; and determining a final detection region based on the candidate regions.
  • a convolutional neural network wherein the neural network that extracts the first feature map may include: a first layer; a second layer connected to the first layer, and configured to perform convolution that is based on a receptive field that exceeds a predetermined size, based on an output of the first layer; and a third layer connected to the first layer and the second layer, wherein the third layer collects an aggregation of the output of the first layer and an output of the second layer.
  • an object detection apparatus may include a processor configured to: receive an input image; extract a first feature map from the input image; detect an object included in the input image based on the first feature map; extract, from the input image, a second feature map having resolution higher than the resolution of the first feature map; extract a third feature map from the second feature map based on a region of the detected object; and redetect the object based on the first feature map and the third feature map.
  • FIG. 1 is a diagram illustrating an object detection apparatus according to an embodiment
  • FIG. 2 is a diagram illustrating an object detection apparatus according to an embodiment
  • FIG. 3 is a flowchart illustrating an object detection method according to an embodiment
  • FIG. 4 is a diagram illustrating a processing operation in an AggNet block according to an embodiment
  • FIG. 5 is a diagram illustrating a processing operation in a channel aggregation according to an embodiment
  • FIG. 6 is a diagram illustrating the configuration of an object detection apparatus according to an embodiment
  • FIG. 7 is a diagram illustrating an object detection result according to an embodiment.
  • FIG. 8 is a diagram illustrating a multi-object detection model according to an embodiment.
  • Example embodiments may, however, may be embodied in many alternate forms. Accordingly, it should be understood that there is not intent to limit example embodiments to the particular forms disclosed, but to the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of the example embodiments.
  • first, second, and the like may be used herein to describe various elements, these terms are only used to distinguish one element from another.
  • a first element could be termed a second element, and, similarly, a second element could be termed a first element.
  • FIG. 1 is a diagram illustrating an object detection apparatus according to an embodiment.
  • an object detection apparatus may include a feature extractor 110 , a first detector 120 , a sampler 130 , and a second detector 140 .
  • the object detection apparatus is an apparatus for detecting an object, and may be implemented as, for example, one or more software modules, one or more hardware modules, or various combinations thereof.
  • the feature extractor 110 may extract a first feature map 115 from an input image.
  • the input image is an image including an object that is desired to be detected, for example, an image including an object(s) in a predetermined class such as a vehicle, a pedestrian, a bicycle, or the like.
  • the input image may be obtained by a camera.
  • the input image may be obtained by a camera disposed to obtain an image while a vehicle is driving.
  • the first feature map 115 may include a plurality of features extracted from the input image. Each of the plurality of features may has two dimensions (e.g., 12 ⁇ 39), and the first feature map 115 may have three dimensions (e.g., 12 ⁇ 39 ⁇ 500).
  • the first detector 120 may detect information associated with an object from the first feature map. For example, the first detector may output information associated with a bounding box corresponding to a detection result.
  • the bounding box is in a polygonal shape that encloses an object desired to be detected, and the detection result may include location information and size information of the bounding box. For example, if the bounding box is in a rectangular shape, the location information of the bounding box may include the coordinates of one corner of the rectangular shape and the size information of the bounding box may include the width and the height of the rectangular shape.
  • the feature extractor 110 may extract a second feature map 111 from the input image.
  • the second feature map may include a plurality of features extracted from the input image.
  • Each of the plurality of features may have two-dimensions (e.g., 48 ⁇ 156) greater than those of the first feature map, and the second feature map 111 may have three dimensions (e.g., 48 ⁇ 156 ⁇ 512).
  • the feature extractor may include a plurality of layers.
  • the feature extractor may include an input layer, one or more hidden layers, and an output layer.
  • the feature extractor may include a hidden layer having a dimension greater than that of the output layer.
  • the feature extractor may output the second feature map using the hidden layer having a dimension greater than that of the output layer. Therefore, the second feature map may include information of resolution higher than the resolution of the first feature map.
  • the hidden layer may include a convolutional layer.
  • the sampler 130 may extract a third feature map 135 from the second feature map.
  • the sampler may perform sampling of the second feature map based on information associated with the object (e.g., the bounding box) detected by the first detector.
  • the second feature map may be sampled after going through channel aggregation.
  • the second detector may detect information associated with an object from the first feature map and the third feature map.
  • the first feature map and the third feature map may be concatenated before the detection process.
  • the second detector may further use the second feature map having resolution higher than the resolution of the first feature map, in addition to the first feature map used by the first detector. Accordingly, the reliability of the object detected by the second detector may have a higher reliability than that of the object detected by the first detector.
  • the detected object may include a plurality of pieces of information. For example, information associated with which of the object(s) in predetermined classes corresponds to the detected object, information associated with whether the object corresponds to a background or foreground, location information, size information, or various combinations thereof may be included.
  • a CNN-based multi-object detection model may reduce a loss of information associated with a small receptive field by performing a small number of operations, and may effectively detect an occluded object and a small object.
  • the CNN-based multi-object detection model according to an embodiment may detect objects in various sizes by combining feature information having high resolution with a small number of operations, without defining various blocks or filters, such as an anchor box or the like.
  • FIG. 2 is a diagram illustrating an object detection apparatus according to an embodiment.
  • the feature extractor 110 may include a CNN 210 and a plurality of filters 220 .
  • the CNN may include convolutional layers designed to perform a convolutional operation.
  • the convolutional layers included in the CNN may perform a convolutional operation associated with an input using a kernel.
  • the CNN included in the feature extractor may be AggNet50 which is improved from the legacy representative model, ResNet 50. Detailed descriptions related to AggNet50 will be described with reference to FIG. 4 .
  • the plurality of filters included in the feature extractor may correspond to 500 1 ⁇ 1 convolutional filters.
  • Each of the plurality of features included in the first feature map 115 may have two dimensions. For example, if the two dimensions correspond to 12 ⁇ 39, the first feature map to which 500 1 ⁇ 1 convolutional filters are applied may have dimensions of 12 ⁇ 39 ⁇ 500.
  • the first detector 120 may include a plurality of filters 230 .
  • the number of filters may differ depending on the information associated with an object that is desired to be obtained. For example, if the information desired to be obtained is ⁇ car, pedestrian, cyclist, foreground/background, x, y, width, height ⁇ 280 , the first detector may include 8 1 ⁇ 1 convolutional filters. In this instance, an output 235 having dimensions of 12 ⁇ 39 ⁇ 8 may be produced. If the number of classes for an object is n, an object result having dimensions of 12 ⁇ 39 ⁇ (n+5) may be output via (n+5) 1 ⁇ 1 convolutional filters.
  • the second feature map 111 may include information of resolution higher than the resolution of the first feature map.
  • each of the plurality of features included in the second feature map may have dimensions 211 of 48 ⁇ 156.
  • the second feature map may go through channel aggregation 212 before being sampled.
  • a new feature map having dimensions of 48 ⁇ 156 ⁇ 256 may be produced by dividing the second feature map having dimensions of 48 ⁇ 156 ⁇ 512 into two channels, and performing matched element-wise multiplication.
  • the method of performing channel division and element-wise multiplication for channel aggregation may be variously modified, and the detailed descriptions related to the channel aggregation will be described with reference to FIG. 5 .
  • the sampler 130 and 240 may use the information associated with the object detected by the first detector as a bounding box, and may perform sampling of the second feature map so as to extract a third feature map. For example, the sampler may map an object region, estimated for each cell included in the detection result that the first detector obtains using the first feature map, to the second feature map which is a higher layer. The sampler may divide the mapped region into a plurality of grids (e.g., 9 grids by dividing the width and the height into three), and may select the value of a predetermined location (e.g., a central location) for each grid as a sample value.
  • a predetermined location e.g., a central location
  • a channel value of 256*9 may be obtained for each cell, and a third feature map 245 having dimensions of 12 ⁇ 39 ⁇ (256 ⁇ 9) may be extracted.
  • the first feature map and the third feature map may be concatenated 250 before the detection process. For example, if the first feature map has the dimensions of 12 ⁇ 39 ⁇ 500, and the third feature map has the dimensions of 12 ⁇ 39 ⁇ (256 ⁇ 9), a feature map 255 having dimensions of 12 ⁇ 39 ⁇ (500+256 ⁇ 9) may be produced via concatenation.
  • the second detector 140 may include a plurality of filters 260 .
  • the number of filters may differ depending on the information associated with an object that is desired to be obtained. For example, if the information desired to be obtained is ⁇ car, pedestrian, cyclist, foreground/background, x, y, width, height ⁇ 280 , the first detector may include 8 1 ⁇ 1 convolutional filters. In this instance, an output 235 having dimensions of 12 ⁇ 39 ⁇ 8 may be produced. If the number of classes for an object is n, an object result having dimensions of 12 ⁇ 39 ⁇ (n+5) may be output via (n+5) 1 ⁇ 1 convolutional filters.
  • a post-processing operation 150 may be performed based on the output of the second detector. If a plurality of candidate regions are output by the second detector, a post-processing operation may be performed which determines a final detection region based on the plurality of candidate regions. For example, each cell of 12 ⁇ 39 may include information associated with the bounding box, and the bounding boxes indicated by the cells may overlap each other. A post-processor may determine the final detection region based on the probability that the bounding box will include the object.
  • the post-processor may determine the final detection region using a non-maximum suppression scheme 270 .
  • the post-processor may suppress a bounding box that at least a predetermined ratio overlaps a bounding box having the highest probability, so as to determine the final detection region.
  • the final detection region may include one or more bounding boxes.
  • FIG. 3 is a flowchart illustrating an object detection method according to an embodiment.
  • an input image is received in image input operation 310 .
  • Feature extraction may be performed by the feature extractor 110 of FIG. 1 .
  • the feature extractor may extract a first feature map and a second feature map from the input image.
  • the first feature map may be provided to the first detector, which may be performed by the first detector 120 of FIG. 1 .
  • the first detector may detect information associated with an object from the first feature map.
  • the second feature map may be provided to the sampler, which may be performed by the sampler 130 of FIG. 1 .
  • the sampler may use information associated with the object detected by the first detector as a bounding box, and may perform sampling of the second feature map.
  • the sampler may extract a third feature map from the second feature map.
  • the third feature map may be provided to the second detector, which may be performed by the second detector 140 of FIG. 2 .
  • the second detector may detect information associated with an object from the third feature map, and may output a detection result in operation 350 .
  • FIG. 4 is a diagram illustrating a processing operation in an AggNet block according to an embodiment.
  • the layer-wise aggregation may include an aggregation of layers before and after convolution that is based on a receptive field exceeding a predetermined size (e.g., 1 ⁇ 1).
  • an AggNet 420 that is modified from a normal CNN mode, an ResNet 410 , may be used for a module of the feature extractor 110 .
  • the AggNet may produce an aggregation by aggregating a layer before a 3 ⁇ 3 convolutional layer of ResNet and a layer after the 3 ⁇ 3 convolutional layer.
  • the size of the receptive field increases by passing through the 3 ⁇ 3 convolutional layer. Accordingly, as the number of layers increases, information is lost and thus, it is difficult to detect an object having a small size.
  • the AggNet information before passing through the 3 ⁇ 3 convolutional layer is not lost. Accordingly, information associated with a small receptive field may also be transferred and loss of information is prevented and an object having a small size may be detected.
  • FIG. 5 is a diagram illustrating an operation of processing channel aggregation according to an embodiment.
  • channel aggregation 212 is a series of operations of producing an aggregated feature map by dividing a feature map into a plurality of channels, and performing an element-wise operation based on the divided channels.
  • a new feature 530 map having dimensions of 48 ⁇ 156 ⁇ 256 may be produced by dividing a feature map 510 having dimensions of 48 ⁇ 156 ⁇ 512 into two channels, and performing matched element-wise multiplication 520 .
  • a channel division method and the number of divided channels may be variously changed.
  • the divided channels may be aggregated via other operations such as element-wise addition or the like, instead of being aggregated via element-wise multiplication.
  • a feature may be enhanced by performing sampling after channel aggregation, instead of directly performing sampling of the second feature map 111 .
  • sampling after channel aggregation instead of directly performing sampling of the second feature map 111 .
  • back-propagation is performed during learning, a slope between channels may be aggregated and learning is effectively performed.
  • FIG. 6 is a diagram illustrating the configuration of an object detection apparatus according to an embodiment.
  • an object detection apparatus 601 may include a processor 602 and a memory 603 .
  • the processor 602 may include at least one apparatus described in FIGS. 1 to 5 , or may perform at least one method mentioned in FIGS. 1 to 5 .
  • the memory 603 may store at least one among the features of input images and the features of feature maps, or may store a program for implementing an object detection method.
  • the memory 603 may include a volatile memory or a non-volatile memory.
  • the processor 602 may implement a program, and may control the object detection apparatus 601 .
  • the code of a program implemented by the processor 602 may be stored in the memory 603 .
  • the object detection apparatus 601 may be connected to an external device (e.g., a personal computer or a network) via an input/output device, and may exchange data.
  • FIG. 7 is a diagram illustrating an object detection result according to an embodiment.
  • an object detection model based on the AggNet 420 of the present disclosure is capable of detecting a small object better than an object detection model based on a legacy CNN model, VGG16, or ResNet50, as shown in diagram 710 .
  • the object detection model based on AggNet may be capable of detecting an occluded object better than the object detection model based on the legacy CNN model, as shown in diagram 720 .
  • FIG. 8 is a diagram illustrating a multi-object detection model according to an embodiment.
  • the present disclosure applies a structure that is different from the legacy object detection model, so as to overcome the above-described drawbacks.
  • the legacy object detection model applies a multi-scale template (anchors)/feature scheme in order to secure the performance robust against changes in the sizes/shapes of various objects.
  • anchors multi-scale template
  • feature information of an object is lost. Accordingly, an image may be distorted or the detecting performance may deteriorate due to a change in shape.
  • the present disclosure may apply an Rezoom layer that extracts feature information having high resolution with a small number of operations, and improves object detection performance.
  • the present disclosure suggests a simple aggregation network block that reduces a loss of information associated with a small receptive field, so as to well detect occlusion/a small object.
  • FIG. 8 illustrates a CNN-based multi-object detection model.
  • a CNN layer is a model (AggNet50) obtained by applying an aggregation network block to legacy ResNet50, and may produce a CNN feature associated with an input image.
  • a region proposal layer (RPL) produces a hidden feature by performing 1 ⁇ 1 convolution on the last layer feature of the CNN layer, and roughly estimates a region and object class information for each cell (grid) via 1 ⁇ 1 convolution again.
  • a region refinement layer produces a Rezoom feature based on the briefly estimated region and object class information, and estimates a region and object class information with high accuracy using the Rezoom feature.
  • the Rezoom layer produces a 12 ⁇ 39 ⁇ 500 hidden feature map of the region proposal layer by performing 1 ⁇ 1 convolution on the last layer of the CNN layer (AggNet50).
  • the produced hidden feature map may produce a detection output of 12 ⁇ 39 ⁇ 8 that estimates a bound box (x, y, width, height) of an object, objectness (foreground/background), an object class probability (car, pedestrian, cyclist) estimated for each cell using 8 1 ⁇ 1 convolution filters.
  • a 48 ⁇ 156 ⁇ 256 feature map is produced by performing channel aggregation on a higher feature layer (AggNet_Block 2 : 156 ⁇ 48 ⁇ 512) of the CNN layer, and a Rezoomed feature of 12 ⁇ 39 ⁇ (9 ⁇ 256) is produced by performing sampling on each object region estimated for each cell in the RPL.
  • the sampling may perform mapping of the object region estimated for each cell to the feature map, may divide the width and height into three parts so as to have 9 grids, may select a central location value of each grid as a sample value, and may obtain a 9*256 channel value for each cell.
  • the Rezoomed feature and the hidden feature obtained in the RPL are concatenated, a final object detection feature of 12 ⁇ 39 ⁇ (9 ⁇ 256+500) is extracted, and object class/region information of 12 ⁇ 39 ⁇ 8 for each cell is estimated by performing full convolution. Finally, object class/regions may be determined by performing a non-maximum suppression process on the estimated class/region information.
  • the present disclosure may modify a residual block of ResNet50 into an aggregation network block (AggNet block), and may apply the AggNet block to a CNN layer in order to detect a small object by reducing a loss of information associated with a small receptive field.
  • AggNet block an aggregation network block
  • a receptive field increases due to 3 ⁇ 3 convolution performed between layers.
  • the structure is modified to well transfer feature information of a small receptive field.
  • the present disclosure divides an upper CNN layer feature into two regions based on a channel, performs element-wise multiplication for each of the channels that match the two regions so as to configure a channel aggregation feature, and performs sampling thereof, in order to extract the Rezoomed feature.
  • the present disclosure provides a CNN-based real-time object detection model that is capable of effectively detecting object occlusion/a small object in a driving environment.
  • the object detection model may replace a residual-block of the legacy ResNet 50 with an aggregation block, and may use a Rezoom layer, instead of using the legacy multi-scale anchors/feature scheme, thereby efficiently improving the performance of detecting occlusion or a small object.
  • the apparatus, method, and components described in the embodiments may be implemented using one or more general-purpose computers or special-purpose computers, such as a processor, controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit, a microprocessor, or another device that is capable of implementing an instruction and providing a response thereto.
  • a processing device may operate an operating system (OS) and one or more software applications implemented in the OS.
  • OS operating system
  • the processing device may access, store, modify, process, and produce data in response to implementation of software.
  • the processing device includes a plurality of processing elements and/or a plurality of types of processing elements.
  • the processing device may include a plurality of processor, or a single processor and a single controller.
  • the processing device may be implemented to have another processing configuration such as a parallel processor.
  • Software may include computer programs, codes, instructions, or a combination of one or more of them, and may configure a processing device to operate as intended, or may command a processing device to operate independently or collectively.
  • Software and/or data may be embodied permanently or temporarily in a type of machine, a component, a physical device, a virtual equipment, a computer storage medium or device, or a transmitted signal wave, in order to be interpreted by a processing device or in order to provide an instruction or data to a processing device.
  • Software may be distributively stored in or implemented in a computer system connected over a network.
  • Software and data may be stored in one or more computer recordable recording media.
  • the method according to the embodiments of the present disclosure may be implemented in the form of program instructions executed through various computer means, and may be recorded in a computer readable medium.
  • the computer readable medium may include a program instruction, a data file, a data structure, and the like independently or in combination.
  • the program instruction recorded in the medium may be designed or configured especially for the embodiments or may be publicly known to those skilled in the computer software field and may be allowed to be used.
  • the computer readable recording medium may include, for example, magnetic media such as hard disk, floppy disk, and magnetic tape, optical media such as CD-ROM and DVD, magneto-optical media such as floptical disk, and a hardware device configured to store and implement program instructions such as ROM, RAM, flash memory, and the like.
  • program instructions may include, for example, high class language codes, which can be executed in a computer by using an interpreter, as well as machine codes made by a compiler.
  • the aforementioned hardware device may be configured to operate as one or more software modules in order to perform operations in the embodiments of the present disclosure, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a real-time object detection method and an apparatus therefor. An object detection apparatus may receive an input image, may extract a first feature map from the input image, may detect an object included in the input image based on the first feature map, may extract, from the input image, a second feature map having resolution higher than the resolution of the first feature map, may extract a third feature map from the second feature map based on the region of the detected object, and may redetect the object based on the first feature map and the third feature map.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention
The present disclosure relates to a convolutional neural network (CNN) processing technology.
2. Description of the Prior Art
A neural network-based deep learning technology is utilized in various fields. For example, a deep learning-based biometric identification/authentication application that performs facial recognition, iris recognition, voice recognition, or the like may be employed by a terminal (e.g., a smart phone) where the application is embedded. A convolutional neural network (CNN) is a multi-layer neural network that utilizes a convolutional operation, and shows high performance in the deep learning-based image and voice recognition fields.
Legacy faster-RCNN performs region proposal network (RPN)-based object detection, but regression modeling and a classification process are separated. Accordingly, an operation speed is slow. To improve this, an object detection model using a fully convolutional network (FCN) structure has been proposed. The models consider a generally efficient structure, such as a VGG network or the like, as a CNN layer. However, in the structure such as the VGG network, feature information associated with a receptive field tends to decrease as the number of convolutional layers increases. Therefore, the performance of detecting a small object may slightly deteriorate. Accordingly, recently, R-FCN and DSSD models improve performance by applying a residual network that is capable of overcoming the drawback. In addition, the legacy object detection models propose various multi-scale feature schemes capable of improving expressiveness and receptiveness associated with object feature information, so as to improve performance associated with various shapes of objects and various changes in size. However, a multi-scale feature scheme also does not show perfect performance in detecting a small object or the like during driving.
In addition, recently, CNN structures are being proposed which are capable of increasing receptiveness and expressiveness for information associated with various types, shapes, and sizes, such as ResNet, SeNet, and the like. However, the amount of operation is increased since a block is further added.
SUMMARY OF THE INVENTION
A CNN-based multi-object detection model according to an embodiment can reduce a loss of information associated with a small receptive field by performing a small number of operations, and can detect an occluded object and a small object well. The CNN-based multi-object detection model according to an embodiment may combine high-resolution feature information by performing a small number of operations without defining an anchor box in advance, and may detect various sizes of objects based on the combined feature information.
According to an embodiment, a real-time object detection method may include: receiving an input image and extracting a first feature map from the input image by performing high-speed convolution between an input and a kernel via a high-speed convolutional network; detecting an object included in the input image based on the first feature map; extracting, from the input image, a second feature map having resolution higher than the resolution of the first feature map; extracting a third feature map from the second feature map based on a region of the detected object; and redetecting the object based on the first feature map and the third feature map. According to an embodiment, the operation of extracting the second feature map may include: providing the input image to the CNN; and obtaining the second feature map from an intermediate layer having a dimension larger than a dimension of an output layer of the CNN. According to an embodiment, the operation of extracting the third feature map includes sampling elements in a bounding box corresponding to the region of the detected object. According to an embodiment, the operation of redetecting the object region may include: concatenating the first feature map and the third feature map; and applying a plurality of filters corresponding to a detection result associated with the object to the concatenated feature map. According to an embodiment, the operation of extracting the first feature map from the input image may include: providing the input image to the CNN; and applying, to an output of the CNN, a plurality of filters corresponding to a plurality of features included in the first feature map. According to an embodiment, the operation of detecting the object may include: applying a plurality of filters corresponding to a detection result associated with the object to the first feature map. According to an embodiment, the operation of redetecting the object may include: detecting candidate regions corresponding to the object in the input image based on the first feature map and the third feature map; and determining a final detection region based on the candidate regions. According to an embodiment, a convolutional neural network is provided, wherein the neural network that extracts the first feature map may include: a first layer; a second layer connected to the first layer, and configured to perform convolution that is based on a receptive field that exceeds a predetermined size, based on an output of the first layer; and a third layer connected to the first layer and the second layer, wherein the third layer collects an aggregation of the output of the first layer and an output of the second layer. According to an embodiment, the operation of extracting the third feature map may include: dividing a plurality of channels included in the second feature map; producing an aggregated second feature map by performing an element-wise operation based on the divided channels; and extracting the third feature map from the aggregated second feature map based on the region of the detected object. According to an embodiment, an object detection apparatus may include a processor configured to: receive an input image; extract a first feature map from the input image; detect an object included in the input image based on the first feature map; extract, from the input image, a second feature map having resolution higher than the resolution of the first feature map; extract a third feature map from the second feature map based on a region of the detected object; and redetect the object based on the first feature map and the third feature map.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other aspects, features, and advantages of the present disclosure will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a diagram illustrating an object detection apparatus according to an embodiment;
FIG. 2 is a diagram illustrating an object detection apparatus according to an embodiment;
FIG. 3 is a flowchart illustrating an object detection method according to an embodiment;
FIG. 4 is a diagram illustrating a processing operation in an AggNet block according to an embodiment;
FIG. 5 is a diagram illustrating a processing operation in a channel aggregation according to an embodiment;
FIG. 6 is a diagram illustrating the configuration of an object detection apparatus according to an embodiment;
FIG. 7 is a diagram illustrating an object detection result according to an embodiment; and
FIG. 8 is a diagram illustrating a multi-object detection model according to an embodiment.
DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS
Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments. Example embodiments may, however, may be embodied in many alternate forms. Accordingly, it should be understood that there is not intent to limit example embodiments to the particular forms disclosed, but to the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of the example embodiments.
It will be understood that, although the terms first, second, and the like may be used herein to describe various elements, these terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element.
It will be understood that when an element is referred to as being “connected” to another element, it may be directly connected or coupled to the other element or intervening elements may be present.
As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “includes”, or the like when used in this specification, specify the presence of stated features, integers, steps, operations, elements, components, or a combination thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or a combination thereof.
Unless defined differently, all terms used herein, which include technical terminologies or scientific terminologies, have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. Such terms as those defined in commonly used dictionaries are to be interpreted to have the meanings equal to the contextual meanings in the relevant field of art, and are not to be interpreted to have ideal or excessively formal meanings unless clearly defined in the present disclosure.
Hereinafter, detailed description of embodiments will be provided with reference to drawings. Reference will now be made to example embodiments, which are illustrated in the accompanying drawings, wherein like reference numerals may refer to like components throughout.
FIG. 1 is a diagram illustrating an object detection apparatus according to an embodiment.
Referring to FIG. 1 , an object detection apparatus according to an embodiment may include a feature extractor 110, a first detector 120, a sampler 130, and a second detector 140. The object detection apparatus is an apparatus for detecting an object, and may be implemented as, for example, one or more software modules, one or more hardware modules, or various combinations thereof.
The feature extractor 110 may extract a first feature map 115 from an input image. The input image is an image including an object that is desired to be detected, for example, an image including an object(s) in a predetermined class such as a vehicle, a pedestrian, a bicycle, or the like. The input image may be obtained by a camera. For example, the input image may be obtained by a camera disposed to obtain an image while a vehicle is driving. The first feature map 115 may include a plurality of features extracted from the input image. Each of the plurality of features may has two dimensions (e.g., 12×39), and the first feature map 115 may have three dimensions (e.g., 12×39×500).
The first detector 120 may detect information associated with an object from the first feature map. For example, the first detector may output information associated with a bounding box corresponding to a detection result. The bounding box is in a polygonal shape that encloses an object desired to be detected, and the detection result may include location information and size information of the bounding box. For example, if the bounding box is in a rectangular shape, the location information of the bounding box may include the coordinates of one corner of the rectangular shape and the size information of the bounding box may include the width and the height of the rectangular shape.
The feature extractor 110 may extract a second feature map 111 from the input image. The second feature map may include a plurality of features extracted from the input image. Each of the plurality of features may have two-dimensions (e.g., 48×156) greater than those of the first feature map, and the second feature map 111 may have three dimensions (e.g., 48×156×512).
The feature extractor may include a plurality of layers. For example, the feature extractor may include an input layer, one or more hidden layers, and an output layer. The feature extractor may include a hidden layer having a dimension greater than that of the output layer. In this instance, the feature extractor may output the second feature map using the hidden layer having a dimension greater than that of the output layer. Therefore, the second feature map may include information of resolution higher than the resolution of the first feature map. The hidden layer may include a convolutional layer.
The sampler 130 may extract a third feature map 135 from the second feature map. The sampler may perform sampling of the second feature map based on information associated with the object (e.g., the bounding box) detected by the first detector. Although described below, according to an embodiment, the second feature map may be sampled after going through channel aggregation.
The operation of the sampler and the detailed description related to the second feature map will be described with reference to FIGS. 2 and 5 .
The second detector may detect information associated with an object from the first feature map and the third feature map. According to an embodiment, the first feature map and the third feature map may be concatenated before the detection process.
The second detector may further use the second feature map having resolution higher than the resolution of the first feature map, in addition to the first feature map used by the first detector. Accordingly, the reliability of the object detected by the second detector may have a higher reliability than that of the object detected by the first detector.
The detected object may include a plurality of pieces of information. For example, information associated with which of the object(s) in predetermined classes corresponds to the detected object, information associated with whether the object corresponds to a background or foreground, location information, size information, or various combinations thereof may be included.
A CNN-based multi-object detection model according to an embodiment may reduce a loss of information associated with a small receptive field by performing a small number of operations, and may effectively detect an occluded object and a small object. The CNN-based multi-object detection model according to an embodiment may detect objects in various sizes by combining feature information having high resolution with a small number of operations, without defining various blocks or filters, such as an anchor box or the like.
FIG. 2 is a diagram illustrating an object detection apparatus according to an embodiment.
Referring to FIG. 2 , the feature extractor 110 may include a CNN 210 and a plurality of filters 220. The CNN may include convolutional layers designed to perform a convolutional operation. The convolutional layers included in the CNN may perform a convolutional operation associated with an input using a kernel. According to an embodiment, the CNN included in the feature extractor may be AggNet50 which is improved from the legacy representative model, ResNet 50. Detailed descriptions related to AggNet50 will be described with reference to FIG. 4 .
The plurality of filters included in the feature extractor may correspond to 500 1×1 convolutional filters. Each of the plurality of features included in the first feature map 115 may have two dimensions. For example, if the two dimensions correspond to 12×39, the first feature map to which 500 1×1 convolutional filters are applied may have dimensions of 12×39×500.
The first detector 120 may include a plurality of filters 230. The number of filters may differ depending on the information associated with an object that is desired to be obtained. For example, if the information desired to be obtained is {car, pedestrian, cyclist, foreground/background, x, y, width, height} 280, the first detector may include 8 1×1 convolutional filters. In this instance, an output 235 having dimensions of 12×39×8 may be produced. If the number of classes for an object is n, an object result having dimensions of 12×39×(n+5) may be output via (n+5) 1×1 convolutional filters.
The second feature map 111 may include information of resolution higher than the resolution of the first feature map. For example, each of the plurality of features included in the second feature map may have dimensions 211 of 48×156.
According to an embodiment, the second feature map may go through channel aggregation 212 before being sampled. For example, a new feature map having dimensions of 48×156×256 may be produced by dividing the second feature map having dimensions of 48×156×512 into two channels, and performing matched element-wise multiplication. The method of performing channel division and element-wise multiplication for channel aggregation may be variously modified, and the detailed descriptions related to the channel aggregation will be described with reference to FIG. 5 .
The sampler 130 and 240 may use the information associated with the object detected by the first detector as a bounding box, and may perform sampling of the second feature map so as to extract a third feature map. For example, the sampler may map an object region, estimated for each cell included in the detection result that the first detector obtains using the first feature map, to the second feature map which is a higher layer. The sampler may divide the mapped region into a plurality of grids (e.g., 9 grids by dividing the width and the height into three), and may select the value of a predetermined location (e.g., a central location) for each grid as a sample value.
For example, if the dimensions of the first feature map correspond to 12×39×500 and the dimensions of the second feature map correspond to 48×156×256, a channel value of 256*9 may be obtained for each cell, and a third feature map 245 having dimensions of 12×39×(256×9) may be extracted.
The first feature map and the third feature map may be concatenated 250 before the detection process. For example, if the first feature map has the dimensions of 12×39×500, and the third feature map has the dimensions of 12×39×(256×9), a feature map 255 having dimensions of 12×39×(500+256×9) may be produced via concatenation.
The second detector 140 may include a plurality of filters 260. The number of filters may differ depending on the information associated with an object that is desired to be obtained. For example, if the information desired to be obtained is {car, pedestrian, cyclist, foreground/background, x, y, width, height} 280, the first detector may include 8 1×1 convolutional filters. In this instance, an output 235 having dimensions of 12×39×8 may be produced. If the number of classes for an object is n, an object result having dimensions of 12×39×(n+5) may be output via (n+5) 1×1 convolutional filters.
According to an embodiment, a post-processing operation 150 may be performed based on the output of the second detector. If a plurality of candidate regions are output by the second detector, a post-processing operation may be performed which determines a final detection region based on the plurality of candidate regions. For example, each cell of 12×39 may include information associated with the bounding box, and the bounding boxes indicated by the cells may overlap each other. A post-processor may determine the final detection region based on the probability that the bounding box will include the object.
For example, the post-processor may determine the final detection region using a non-maximum suppression scheme 270. According to an embodiment, the post-processor may suppress a bounding box that at least a predetermined ratio overlaps a bounding box having the highest probability, so as to determine the final detection region. The final detection region may include one or more bounding boxes.
The specifications in FIG. 2 , such as dimensions of a feature and a filter, and the like, are merely an example, and the specifications for implementing the embodiments can be modified variously.
FIG. 3 is a flowchart illustrating an object detection method according to an embodiment.
Referring to FIG. 3 , an input image is received in image input operation 310. Feature extraction may be performed by the feature extractor 110 of FIG. 1 . The feature extractor may extract a first feature map and a second feature map from the input image.
In operation 325 for provision to a first detector, the first feature map may be provided to the first detector, which may be performed by the first detector 120 of FIG. 1 . The first detector may detect information associated with an object from the first feature map.
In sampling operation 330, the second feature map may be provided to the sampler, which may be performed by the sampler 130 of FIG. 1 . The sampler may use information associated with the object detected by the first detector as a bounding box, and may perform sampling of the second feature map. The sampler may extract a third feature map from the second feature map.
In operation 340 for provision to a second detector, the third feature map may be provided to the second detector, which may be performed by the second detector 140 of FIG. 2 . The second detector may detect information associated with an object from the third feature map, and may output a detection result in operation 350.
FIG. 4 is a diagram illustrating a processing operation in an AggNet block according to an embodiment.
According to an embodiment, there is provided a technology that preserves information that is lost as the size of a receptive field increases via layer-wise aggregation. The layer-wise aggregation may include an aggregation of layers before and after convolution that is based on a receptive field exceeding a predetermined size (e.g., 1×1).
For example, referring to FIG. 4 , an AggNet 420 that is modified from a normal CNN mode, an ResNet 410, may be used for a module of the feature extractor 110. The AggNet may produce an aggregation by aggregating a layer before a 3×3 convolutional layer of ResNet and a layer after the 3×3 convolutional layer.
In the case of the ResNet, the size of the receptive field increases by passing through the 3×3 convolutional layer. Accordingly, as the number of layers increases, information is lost and thus, it is difficult to detect an object having a small size. However, in the case of the AggNet, information before passing through the 3×3 convolutional layer is not lost. Accordingly, information associated with a small receptive field may also be transferred and loss of information is prevented and an object having a small size may be detected.
FIG. 5 is a diagram illustrating an operation of processing channel aggregation according to an embodiment.
Referring to FIG. 5 , channel aggregation 212 is a series of operations of producing an aggregated feature map by dividing a feature map into a plurality of channels, and performing an element-wise operation based on the divided channels. For example, a new feature 530 map having dimensions of 48×156×256 may be produced by dividing a feature map 510 having dimensions of 48×156×512 into two channels, and performing matched element-wise multiplication 520. According to an embodiment, a channel division method and the number of divided channels may be variously changed. In addition, the divided channels may be aggregated via other operations such as element-wise addition or the like, instead of being aggregated via element-wise multiplication.
A feature may be enhanced by performing sampling after channel aggregation, instead of directly performing sampling of the second feature map 111. When back-propagation is performed during learning, a slope between channels may be aggregated and learning is effectively performed.
FIG. 6 is a diagram illustrating the configuration of an object detection apparatus according to an embodiment.
Referring to FIG. 6 , an object detection apparatus 601 may include a processor 602 and a memory 603. The processor 602 may include at least one apparatus described in FIGS. 1 to 5 , or may perform at least one method mentioned in FIGS. 1 to 5 . The memory 603 may store at least one among the features of input images and the features of feature maps, or may store a program for implementing an object detection method. The memory 603 may include a volatile memory or a non-volatile memory.
The processor 602 may implement a program, and may control the object detection apparatus 601. The code of a program implemented by the processor 602 may be stored in the memory 603. The object detection apparatus 601 may be connected to an external device (e.g., a personal computer or a network) via an input/output device, and may exchange data.
FIG. 7 is a diagram illustrating an object detection result according to an embodiment.
Referring to FIG. 7 , an object detection model based on the AggNet 420 of the present disclosure is capable of detecting a small object better than an object detection model based on a legacy CNN model, VGG16, or ResNet50, as shown in diagram 710. In addition, the object detection model based on AggNet may be capable of detecting an occluded object better than the object detection model based on the legacy CNN model, as shown in diagram 720.
FIG. 8 is a diagram illustrating a multi-object detection model according to an embodiment.
Conventional technologies for multi-object detection fail to overcome drawbacks which are associated with the amount of power consumed by a large number of operations, systematic performance related to real-time, distortion of an image, occlusion of object, a change in the size/shape of an object, and the like.
The present disclosure applies a structure that is different from the legacy object detection model, so as to overcome the above-described drawbacks.
First, the legacy object detection model applies a multi-scale template (anchors)/feature scheme in order to secure the performance robust against changes in the sizes/shapes of various objects. However, a large number of operations is required and feature information of an object is lost. Accordingly, an image may be distorted or the detecting performance may deteriorate due to a change in shape. Accordingly, the present disclosure may apply an Rezoom layer that extracts feature information having high resolution with a small number of operations, and improves object detection performance.
Second, in the case of the legacy CNN-based object detection model, feature information of a small object is lost, as the depth of a layer becomes deeper. Accordingly, it is difficult to detect a small object. In order to solve the drawback, the present disclosure suggests a simple aggregation network block that reduces a loss of information associated with a small receptive field, so as to well detect occlusion/a small object.
CNN-Based Multi-Object Detection Model
FIG. 8 illustrates a CNN-based multi-object detection model. A CNN layer is a model (AggNet50) obtained by applying an aggregation network block to legacy ResNet50, and may produce a CNN feature associated with an input image. A region proposal layer (RPL) produces a hidden feature by performing 1×1 convolution on the last layer feature of the CNN layer, and roughly estimates a region and object class information for each cell (grid) via 1×1 convolution again. A region refinement layer produces a Rezoom feature based on the briefly estimated region and object class information, and estimates a region and object class information with high accuracy using the Rezoom feature.
Rezoom Layer for Object Detection
As illustrated in FIG. 8 , the Rezoom layer produces a 12×39×500 hidden feature map of the region proposal layer by performing 1×1 convolution on the last layer of the CNN layer (AggNet50). The produced hidden feature map may produce a detection output of 12×39×8 that estimates a bound box (x, y, width, height) of an object, objectness (foreground/background), an object class probability (car, pedestrian, cyclist) estimated for each cell using 8 1×1 convolution filters. In order to obtain a feature having high resolution for object detection, a 48×156×256 feature map is produced by performing channel aggregation on a higher feature layer (AggNet_Block2: 156×48×512) of the CNN layer, and a Rezoomed feature of 12×39×(9×256) is produced by performing sampling on each object region estimated for each cell in the RPL. Here, the sampling may perform mapping of the object region estimated for each cell to the feature map, may divide the width and height into three parts so as to have 9 grids, may select a central location value of each grid as a sample value, and may obtain a 9*256 channel value for each cell.
In RRL, the Rezoomed feature and the hidden feature obtained in the RPL are concatenated, a final object detection feature of 12×39×(9×256+500) is extracted, and object class/region information of 12×39×8 for each cell is estimated by performing full convolution. Finally, object class/regions may be determined by performing a non-maximum suppression process on the estimated class/region information.
Aggregate Network Layer
The present disclosure may modify a residual block of ResNet50 into an aggregation network block (AggNet block), and may apply the AggNet block to a CNN layer in order to detect a small object by reducing a loss of information associated with a small receptive field. According to the legacy ResNet model, a receptive field increases due to 3×3 convolution performed between layers. However, in the present disclosure, by aggregating a result obtained by performing 3×3 convolution and a previous layer, the structure is modified to well transfer feature information of a small receptive field.
In addition, as opposed to the legacy method that merely performs sampling of an upper layer feature (AggNet_Block2) of a CNN, the present disclosure divides an upper CNN layer feature into two regions based on a channel, performs element-wise multiplication for each of the channels that match the two regions so as to configure a channel aggregation feature, and performs sampling thereof, in order to extract the Rezoomed feature.
The present disclosure provides a CNN-based real-time object detection model that is capable of effectively detecting object occlusion/a small object in a driving environment. The object detection model may replace a residual-block of the legacy ResNet 50 with an aggregation block, and may use a Rezoom layer, instead of using the legacy multi-scale anchors/feature scheme, thereby efficiently improving the performance of detecting occlusion or a small object.
The above-described embodiments may be implemented by hardware components, software components, and/or a combination of the hardware components and software components. For example, the apparatus, method, and components described in the embodiments may be implemented using one or more general-purpose computers or special-purpose computers, such as a processor, controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit, a microprocessor, or another device that is capable of implementing an instruction and providing a response thereto. A processing device may operate an operating system (OS) and one or more software applications implemented in the OS. In addition, the processing device may access, store, modify, process, and produce data in response to implementation of software. Although it is described that a single processing device is used for ease of description, those skilled in the art may understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. For example, the processing device may include a plurality of processor, or a single processor and a single controller. In addition, the processing device may be implemented to have another processing configuration such as a parallel processor.
Software may include computer programs, codes, instructions, or a combination of one or more of them, and may configure a processing device to operate as intended, or may command a processing device to operate independently or collectively. Software and/or data may be embodied permanently or temporarily in a type of machine, a component, a physical device, a virtual equipment, a computer storage medium or device, or a transmitted signal wave, in order to be interpreted by a processing device or in order to provide an instruction or data to a processing device. Software may be distributively stored in or implemented in a computer system connected over a network. Software and data may be stored in one or more computer recordable recording media.
The method according to the embodiments of the present disclosure may be implemented in the form of program instructions executed through various computer means, and may be recorded in a computer readable medium. The computer readable medium may include a program instruction, a data file, a data structure, and the like independently or in combination. The program instruction recorded in the medium may be designed or configured especially for the embodiments or may be publicly known to those skilled in the computer software field and may be allowed to be used. The computer readable recording medium may include, for example, magnetic media such as hard disk, floppy disk, and magnetic tape, optical media such as CD-ROM and DVD, magneto-optical media such as floptical disk, and a hardware device configured to store and implement program instructions such as ROM, RAM, flash memory, and the like. In addition, the program instructions may include, for example, high class language codes, which can be executed in a computer by using an interpreter, as well as machine codes made by a compiler. The aforementioned hardware device may be configured to operate as one or more software modules in order to perform operations in the embodiments of the present disclosure, and vice versa.
Although the embodiments have been described with reference to a limited number of drawings, those skilled in the art would make various changes or modifications based thereon. For example, although the above-described technology is implemented in a different order from the above-described method, and/or the above-described system, structure, apparatus, and components such as a circuit or the like are coupled or combined in a different manner from the above-described method, or are replaced with other components or equivalents, an appropriate result can be obtained.
Therefore, other implementations, other embodiments, and equivalents to claims belong to the scope of the claims.

Claims (18)

What is claimed is:
1. An object detection method comprising:
receiving an input image;
extracting a first feature map from the input image;
detecting an object included in the input image based on the first feature map;
extracting, from the input image, a second feature map having a resolution higher than a resolution of the first feature map;
extracting a third feature map from the second feature map based on a region of the detected object; and
redetecting the object based on the first feature map and the third feature map.
2. The object detection method of claim 1, wherein the extracting of the second feature map comprises:
providing the input image to a convolutional neural network (CNN); and
obtaining the second feature map from an intermediate layer having a dimension larger than a dimension of an output layer of the CNN.
3. The object detection method of claim 1, wherein the extracting of the third feature map comprises:
sampling elements in a bounding box corresponding to the region of the detected object.
4. The object detection method of claim 1, wherein the redetecting of the object region comprises:
concatenating the first feature map and the third feature map; and
applying a plurality of filters corresponding to a detection result associated with the object to the concatenated feature map.
5. The object detection method of claim 1, wherein the extracting of the first feature map from the input image comprises:
providing the input image to a convolutional neural network (CNN); and
applying, to an output of the CNN, a plurality of filters corresponding to a plurality of features included in the first feature map.
6. The object detection method of claim 1, wherein the detecting of the object comprises:
applying a plurality of filters corresponding to a detection result associated with the object to the first feature map.
7. The object detection method of claim 1, wherein the redetecting of the object comprises:
detecting candidate regions corresponding to the object in the input image based on the first feature map and the third feature map; and
determining a final detection region based on the candidate regions.
8. The object detection method of claim 1, wherein a neural network that extracts the first feature map comprises:
a first layer;
a second layer connected to the first layer, and configured to perform convolution that is based on a receptive field that exceeds a predetermined size, based on an output of the first layer; and
a third layer connected to the first layer and the second layer,
wherein the third layer collects an aggregation of the output of the first layer and an output of the second layer.
9. The object detection method of claim 1, wherein the extracting of the third feature map comprises:
dividing a plurality of channels included in the second feature map;
producing an aggregated second feature map by performing an element-wise operation based on the divided channels; and
extracting the third feature map from the aggregated second feature map based on the region of the detected object.
10. A non-transitory computer readable recording medium storing instructions that, when executed by a processor, cause the processor to perform the method of claim 1.
11. An object detection apparatus comprising:
a memory configured to store one or more instructions; and
a processor configured to, by executing the one or more instructions:
receive an input image;
extract a first feature map from the input image;
detect an object included in the input image based on the first feature map;
extract, from the input image, a second feature map having a resolution higher than a resolution of the first feature map;
extract a third feature map from the second feature map based on a region of the detected object; and
redetect the object based on the first feature map and the third feature map.
12. The object detection apparatus of claim 11, wherein the processor is further configured to:
provide the input image to a convolutional neural network (CNN); and
obtain the second feature map from an intermediate layer having a dimension larger than a dimension of an output layer of the CNN.
13. The object detection apparatus of claim 11, wherein the processor is further configured to:
perform sampling of elements in a bounding box corresponding to the region of the detected object.
14. The object detection apparatus of claim 11, wherein the processor is further configured to:
concatenate the first feature map and the third feature map; and
apply a plurality of filters corresponding to a detection result associated with the object to the concatenated feature map.
15. The object detection apparatus of claim 11, wherein the processor is further configured to:
provide the input image to a convolutional neural network (CNN); and
apply a plurality of filters corresponding to a plurality of features included in the first feature map to an output of the CNN.
16. The object detection apparatus of claim 11, wherein the processor is further configured to apply a plurality of filters corresponding to a detection result associated with the object to the first feature map.
17. The object detection apparatus of claim 11, wherein the processor is further configured to:
detect candidate regions corresponding to the object in the input image based on the first feature map and the third feature map; and
determine a final detection region based on the candidate regions.
18. The object detection apparatus of claim 11, wherein the processor is further configured to:
divide a plurality of channels included in the second feature map;
produce an aggregated second feature map by performing an element-wide operation based on the divided channels; and
extract the third feature map from the aggregated second feature map based on the region of the detected object.
US17/282,468 2018-10-05 2019-09-30 Method and device for detecting object in real time by means of deep learning network model Active 2040-11-27 US11875553B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
KR1020180118856A KR102108854B1 (en) 2018-10-05 2018-10-05 Real-time object detection method and apparatus by deep learning network model
KR10-2018-0118856 2018-10-05
PCT/KR2019/012699 WO2020071701A1 (en) 2018-10-05 2019-09-30 Method and device for detecting object in real time by means of deep learning network model

Publications (2)

Publication Number Publication Date
US20210383165A1 US20210383165A1 (en) 2021-12-09
US11875553B2 true US11875553B2 (en) 2024-01-16

Family

ID=70055937

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/282,468 Active 2040-11-27 US11875553B2 (en) 2018-10-05 2019-09-30 Method and device for detecting object in real time by means of deep learning network model

Country Status (3)

Country Link
US (1) US11875553B2 (en)
KR (1) KR102108854B1 (en)
WO (1) WO2020071701A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240135492A1 (en) * 2022-10-12 2024-04-25 Google Llc Image super-resolution neural networks

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102108854B1 (en) * 2018-10-05 2020-05-12 재단법인대구경북과학기술원 Real-time object detection method and apparatus by deep learning network model
CN109801270B (en) * 2018-12-29 2021-07-16 北京市商汤科技开发有限公司 Anchor point determination method and device, electronic device and storage medium
CN111582062B (en) * 2020-04-21 2022-10-14 电子科技大学 Re-detection method in target tracking based on YOLOv3
KR102416698B1 (en) * 2020-06-03 2022-07-05 고려대학교 산학협력단 Method and apparatus for automatic recognition and measurement system of peripheral nerves on ultrasound images using deep learning algorithm
CN112070730A (en) * 2020-08-27 2020-12-11 宁波市电力设计院有限公司 Anti-vibration hammer falling detection method based on power transmission line inspection image
KR102270808B1 (en) * 2020-09-15 2021-06-29 국민대학교산학협력단 Visible network providing apparatus and method using wireless artificial intelligence
KR20220052620A (en) 2020-10-21 2022-04-28 삼성전자주식회사 Object traking method and apparatus performing the same
KR102512151B1 (en) * 2020-11-20 2023-03-20 재단법인대구경북과학기술원 Method and apparatus for object detection
KR102448268B1 (en) 2020-12-04 2022-09-28 주식회사 두원전자통신 Intelligent image analysis system for accuracy enhancement of object analysis by auto learning, estimation and distribution of object based on Deep Neural Network Algorithm
KR102861626B1 (en) 2020-12-14 2025-09-18 삼성디스플레이 주식회사 Afterimage detection device and display device including the same
WO2022211409A1 (en) * 2021-03-31 2022-10-06 현대자동차주식회사 Method and device for coding machine vision data by using reduction of feature map
KR102416066B1 (en) * 2021-05-13 2022-07-06 유니셈 (주) System and method for matching video
KR102787904B1 (en) 2021-11-08 2025-04-08 연세대학교 산학협력단 Method and device for detect various object in high resolution image
KR102590387B1 (en) * 2021-12-28 2023-10-17 고려대학교 산학협력단 System for diagnosis of carpal tunnel syndrome using muscle ultrasound imaging based on artificial intelligence and method thereof
CN114821034B (en) * 2022-03-29 2025-02-11 恒安嘉新(北京)科技股份公司 Training method, device, electronic device and medium for target detection model
KR102846723B1 (en) * 2022-09-29 2025-08-14 충북대학교 산학협력단 Real-time objection detection network system
KR102689248B1 (en) * 2023-06-14 2024-07-30 한국과학기술원 Large-scale sparse point cloud neural network accelerator with virtual voxel and roi-based skipping
TWI873681B (en) * 2023-06-14 2025-02-21 緯創資通股份有限公司 Object detection method, machine learning method, and electronic device
CN118135666B (en) * 2024-05-07 2024-08-02 武汉纺织大学 A classroom behavior recognition method based on real-time target detection

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150052924A (en) 2013-11-06 2015-05-15 삼성전자주식회사 Method and apparatus for processing image
US20150347861A1 (en) 2014-05-30 2015-12-03 Apple Inc. Object-Of-Interest Detection And Recognition With Split, Full-Resolution Image Processing Pipeline
US20180150684A1 (en) 2016-11-30 2018-05-31 Shenzhen AltumView Technology Co., Ltd. Age and gender estimation using small-scale convolutional neural network (cnn) modules for embedded systems
KR20180065866A (en) 2016-12-07 2018-06-18 삼성전자주식회사 A method and apparatus for detecting a target
WO2020071701A1 (en) * 2018-10-05 2020-04-09 재단법인대구경북과학기술원 Method and device for detecting object in real time by means of deep learning network model
US20200143205A1 (en) * 2017-08-10 2020-05-07 Intel Corporation Convolutional neural network framework using reverse connections and objectness priors for object detection
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150052924A (en) 2013-11-06 2015-05-15 삼성전자주식회사 Method and apparatus for processing image
US20150347861A1 (en) 2014-05-30 2015-12-03 Apple Inc. Object-Of-Interest Detection And Recognition With Split, Full-Resolution Image Processing Pipeline
US20180150684A1 (en) 2016-11-30 2018-05-31 Shenzhen AltumView Technology Co., Ltd. Age and gender estimation using small-scale convolutional neural network (cnn) modules for embedded systems
KR20180065866A (en) 2016-12-07 2018-06-18 삼성전자주식회사 A method and apparatus for detecting a target
US20200143205A1 (en) * 2017-08-10 2020-05-07 Intel Corporation Convolutional neural network framework using reverse connections and objectness priors for object detection
US11074711B1 (en) * 2018-06-15 2021-07-27 Bertec Corporation System for estimating a pose of one or more persons in a scene
WO2020071701A1 (en) * 2018-10-05 2020-04-09 재단법인대구경북과학기술원 Method and device for detecting object in real time by means of deep learning network model

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. *
Teichmann, Marvin, et al. "Multinet: Real-time joint semantic reasoning for autonomous driving." 2018 IEEE intelligent vehicles symposium (IV). IEEE, 2018. *
Won, Woong-Jae, et al. "Aggnet: Simple aggregated network for real-time multiple object detection in road driving scene." 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018. *
Woong-Jae Won et al., "Real-Time CNN Model for Multi-Object Detection in Driving Scene", 2018 The Korean Society of Automotive Engineers Spring Conference p. 822-826, Jun. 2018.
Yoo, Donggeun, et al. "Attentionnet: Aggregating weak directions for accurate object detection." Proceedings of the IEEE international conference on computer vision. 2015. *
Zhang, Pingping, et al. "Amulet: Aggregating multi-level convolutional features for salient object detection." Proceedings of the IEEE international conference on computer vision. 2017. *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240135492A1 (en) * 2022-10-12 2024-04-25 Google Llc Image super-resolution neural networks

Also Published As

Publication number Publication date
KR20200044171A (en) 2020-04-29
KR102108854B1 (en) 2020-05-12
WO2020071701A1 (en) 2020-04-09
US20210383165A1 (en) 2021-12-09

Similar Documents

Publication Publication Date Title
US11875553B2 (en) Method and device for detecting object in real time by means of deep learning network model
Yang et al. Deeperlab: Single-shot image parser
CN112329702B (en) Method and device for rapid face density prediction and face detection, electronic equipment and storage medium
CN108010031B (en) Portrait segmentation method and mobile terminal
EP3649574B1 (en) Object detection based on deep neural network
CN111275034B (en) Methods, devices, equipment and storage media for extracting text areas from images
CN115908789A (en) Salient object detection method and device based on cross-modal feature fusion and asymptotic decoding
CN107545262A (en) A kind of method and device that text is detected in natural scene image
CN113486956A (en) Target segmentation system and training method thereof, and target segmentation method and device
CN112560845B (en) Character recognition method, device, intelligent food collection cabinet, electronic device and storage medium
US11928872B2 (en) Methods and apparatuses for recognizing text, recognition devices and storage media
CN114238904B (en) Identity recognition method, and training method and device of dual-channel hyper-resolution model
CN110991310B (en) Portrait detection method, device, electronic equipment and computer readable medium
JP2023507248A (en) System and method for object detection and recognition
US12530777B2 (en) Semantic segmentation method and system
CN118941493A (en) Photovoltaic panel defect detection system, method, computer equipment and storage medium
CN109961083B (en) Method and image processing entity for applying a convolutional neural network to an image
KR102306319B1 (en) Method, Program, Computing Device for generating pose estimation mapping data
CN112991191A (en) Face image enhancement method and device and electronic equipment
CN110879972A (en) A face detection method and device
CN110991440B (en) Pixel-driven mobile phone operation interface text detection method
CN116912829A (en) Small airway segmentation method, device, electronic equipment and non-volatile storage medium
KR102238672B1 (en) Multiclass classification apparatus, method thereof and computer readable medium having computer program recorded therefor
CN112528900B (en) Image salient object detection method and system based on extreme downsampling
KR20210087494A (en) Human body orientation detection method, apparatus, electronic device and computer storage medium

Legal Events

Date Code Title Description
AS Assignment

Owner name: DAEGU GYEONGBUK INSTITUTE OF SCIENCE AND TECHNOLOGY, KOREA, REPUBLIC OF

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WON, WOONG JAE;KIM, TAE HUN;KWON, SOON;AND OTHERS;REEL/FRAME:055807/0200

Effective date: 20210330

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE