WO2020073310A1 - Method and apparatus for context-embedding and region-based object detection - Google Patents
Method and apparatus for context-embedding and region-based object detection Download PDFInfo
- Publication number
- WO2020073310A1 WO2020073310A1 PCT/CN2018/110023 CN2018110023W WO2020073310A1 WO 2020073310 A1 WO2020073310 A1 WO 2020073310A1 CN 2018110023 W CN2018110023 W CN 2018110023W WO 2020073310 A1 WO2020073310 A1 WO 2020073310A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- roi
- context
- final feature
- feature map
- proposed
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2137—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20016—Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Definitions
- Various example embodiments relate generally to methods and apparatuses for performing region-based object detection.
- Object detection is a task in the area of computer vision that is aimed at localizing and recognizing object instances with a bounding box.
- Convolutional neural network (CNN) -based object detection can be utilized in the areas of visual surveillance, Advanced Driver Assistant Systems (ADAS) , and human-machine interaction (HMI) .
- ADAS Advanced Driver Assistant Systems
- HMI human-machine interaction
- region-based detectors are discussed in, for example, Y.S. Cao, X. Niu, and Y. Dou, “Region-based convolutional neural networks for object detection in very high resolution remote sensing images, ” In International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, 2016; R. Girshick, “Fast r-cnn, ” Computer Science, 2015; and S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks, ” in International Conference on Neural Information Processing Systems, 2015, pp. 91-99.
- region-based methods divide the object detection into two steps.
- a region proposal network (RPN) generates high-quality proposals.
- the proposals are further classified and regressed by a region-wise subnet.
- the region-free methods detect objects by regular and dense sampling over locations, scales and aspect ratios.
- a method of detecting an object in an image using a convolutional neural network includes generating, by the CNN, a plurality of reference feature maps based on the image; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI; extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final final final
- the feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- FPN feature pyramid network
- the area of the first context ROI may be 2 2 times the area of the proposed ROI.
- the method may further include concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- the method may further include applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- SEB squeeze-and-excitation block
- the method may further include generating a second context ROI based on the proposed ROI such that an area of the second context ROI is larger than an area of the first context ROI; assigning the second context ROI to a third final feature map from among the plurality of final feature maps, a size of the third final feature map being different than the sizes of the first and second final feature maps; and extracting a third set of features from the first final feature map by performing ROI pooling on the first final feature map using the second context ROI, wherein the determining includes determining, based on the first, second and third sets of extracted features, at least one of the location of the object with respect to the image and the class of the object.
- the feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- FPN feature pyramid network
- the area of the first context ROI may be 2 2 times the area of the proposed ROI, and the area of the second context ROI may be 4 2 times the area of the area of the proposed ROI.
- the method may further include concatenating the first, second and third sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- the method may further include applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- SEB squeeze-and-excitation block
- a computer-readable medium includes program instructions for causing an apparatus to perform at least generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI; extracting a second set of features from the second final feature map by performing an
- the feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- FPN feature pyramid network
- the area of the first context ROI may be 2 2 times the area of the proposed ROI.
- the computer-readable medium may further include program instructions for causing an apparatus to perform 1t least concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- the computer-readable medium of claim 14 may further include program instructions for causing an apparatus to perform at least applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- SEB squeeze-and-excitation block
- an apparatus includes at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform, generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation
- CNN convolutional neural network
- the feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- FPN feature pyramid network
- the area of the first context ROI may be twice the area of the proposed ROI.
- the at least one memory and the computer program code may be further configured to, with the at least one processor, cause the apparatus at least to perform concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- the at least one memory and the computer program code may be further configured to, with the at least one processor, cause the apparatus at least to perform applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- SEB squeeze-and-excitation block
- FIG. 1 is a diagram of a surveillance network 10 according to at least some example embodiments.
- FIG. 2 is a diagram illustrating an example structure of an object detection device according to at least some example embodiments.
- FIG. 3 illustrates an object detection sub-network of a multi-scale convolutional neural network (MS-CNN) detector.
- MS-CNN multi-scale convolutional neural network
- FIG. 4 illustrates a portion of a backbone convolutional neural network (CNN) according to at least some example embodiments.
- CNN backbone convolutional neural network
- FIG. 5 illustrates a feature pyramid network (FPN) according to at least some example embodiments.
- FPN feature pyramid network
- FIG. 6 illustrates a diagram of a portion of a context-embedding, region-based objection detection network 600 according to at least some example embodiments.
- FIG. 7 is a flow chart illustrating an example algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments.
- Exemplary embodiments are discussed herein as being implemented in a suitable computing environment. Although not required, exemplary embodiments will be described in the general context of computer-executable instructions (e.g., program code) , such as program modules or functional processes, being executed by one or more computer processors or CPUs. Generally, program modules or functional processes include routines, programs, objects, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
- program modules or functional processes include routines, programs, objects, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
- a context-embedding, region-based object detection method is based on region-based object detection methods and includes embedding a context branch in order to obtain rich context information thereby resulting in improved object detection.
- the context information is beneficial for detecting small, blurred and occluded objects.
- the context-embedding, region-based object detection method employs a squeeze-and-excitation block in conjunction with the context branch to reduce or, alternatively, avoid noise information.
- the context-embedding, region-based object detection method according to at least some example embodiments can be applied in several different ways including, for example, visual surveillance.
- Example structures of a surveillance network and object detection device 100 which may utilize the context-embedding, region-based object detection method according to at least some example embodiments will be discussed below in section II of the present disclosure.
- Examples of using feature pyramids and context embedding to perform object detection will be discussed in section III of the present disclosure.
- examples of a convolutional neural network (CNN) architecture and algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments will be discussed in section IV of the present disclosure. Further, methods of training the CNN architecture will be discussed in section V of the present disclosure.
- CNN convolutional neural network
- FIG. 1 illustrates a diagram of a surveillance network 10 according to at least some example embodiments.
- the surveillance network 10 may include an object detection device 100 and a surveillance system 150.
- the surveillance system 150 may include one or more cameras each capturing image data representing a scene in a vicinity of the location of the camera.
- the surveillance system 150 includes the camera 152, which captures surveillance scene 154.
- the camera 152 may capture surveillance scene 154 by, for example, continuously capturing a plurality of temporally-adjacent images (i.e., capturing video or moving image data) of the surveillance scene 154.
- the camera 152 transmits image data 120 corresponding to the captured surveillance scene 154 to the object detection device 100.
- An example structure of the object detection device 100 will now be discussed in greater detail below with reference to FIG. 2.
- FIG. 2 is a diagram illustrating an example structure of the object detection device 100 according to at least some example embodiments.
- the object detection device 100 may include, for example, a data bus 259, a transmitting unit 252, a receiving unit 254, a memory unit 256, and a processing unit 258.
- the transmitting unit 252, receiving unit 254, memory unit 256, and processing unit 258 may send data to and/or receive data from one another using the data bus 259.
- the transmitting unit 252 is a device that includes hardware and any necessary software for transmitting signals including, for example, control signals or data signals via one or more wired and/or wireless connections to one or more other network elements in a wireless communications network.
- the receiving unit 254 is a device that includes hardware and any necessary software for receiving wireless signals including, for example, control signals or data signals via one or more wired and/or wireless connections to one or more other network elements in a wireless communications network.
- the memory unit 256 may be any device capable of storing data including magnetic storage, flash storage, etc. Further, though not illustrated, the memory unit 256 may further include one or more of a port, dock, drive (e.g., optical drive) , or opening for receiving and/or mounting removable storage media (e.g., one or more of a USB flash drive, an SD card, an embedded MultiMediaCard (eMMC) , a CD, a DVD, and a Blue-ray disc) .
- a port e.g., optical drive
- removable storage media e.g., one or more of a USB flash drive, an SD card, an embedded MultiMediaCard (eMMC) , a CD, a DVD, and a Blue-ray disc
- the processing unit 258 may be any device capable of processing data including, for example, a processor.
- any operations described herein, for example with reference to any of FIGS. 1-7, as being performed by a an object detection device may be performed by an electronic device having the structure of the object detection device 100 illustrated in FIG. 2.
- the object detection device 100 may be programmed, in terms of software and/or hardware, to perform any or all of the functions described herein as being performed by an object detection device. Consequently, the object detection device 100 may be embodied as a special purpose computer through software and/or hardware programming.
- the memory unit 256 may store a program that includes executable instructions (e.g., program code) corresponding to any or all of the operations described herein as being performed by an object detection device.
- the executable instructions e.g., program code
- the object detection device 100 may include hardware for reading data stored on the computer readable-medium.
- processing unit 258 may be a processor configured to perform any or all of the operations described herein with reference to FIGS. 1-4 as being performed by an object detection device, for example, by reading and executing the executable instructions (e.g., program code) stored in at least one of the memory unit 256 and a computer readable storage medium loaded into hardware included in the object detection device 100 for reading computer-readable mediums.
- executable instructions e.g., program code
- the processing unit 258 may include a circuit (e.g., an integrated circuit) that has a structural design dedicated to performing any or all of the operations described herein with reference to FIGS. 1-6 as being performed by an object detection device.
- the above-referenced circuit included in the processing unit 258 may be a FPGA or ASIC physically programmed, through specific circuit design, to perform any or all of the operations described with reference to FIGS. 1-7 as being performed by an object detection device.
- the object detection device 100 performs region-based object detection using context embedding which results improving object detection performance with respect to small, blurred and occluded objects with reference to other object detection methods, while also being able to detect objects at multiple scales.
- context embedding Two features used by some other object detection methods, feature pyramids and embedding context will now be discussed in greater detail below in section III.
- some object detection methods utilize feature pyramids, which include feature maps of multiple levels (i.e., multiple scales) .
- the region-based detector multi-scale CNN (MS-CNN)
- MS-CNN uses convolutional layers of different spatial resolutions to generate region proposals of different scales.
- the different layers of the MS-CNN detector may have an inconsistent semantic.
- An example of the MS-CNN is discussed, for example, in Z. Cai and Q. Fan, R.S Feris and N. Vasconcelos, "A unified multi-scale deep convolutional neural network for fast object detection, " European Conference on Computer Vision. Springer, Cham, 2016.
- FIG. 3 illustrates an object detection sub-network 300 of an MS-CNN detector.
- the MS-CNN object detection sub-network 300 includes trunk CNN layers 310, a first feature map 320 corresponding to a conv4-3 convolutional layer, and a second feature map 330 corresponding to a conv4-3-2x convolutional layer resulting from performing a deconvolution operation on first feature map 320 such that the second feature map 330 is an enlarged version of the first feature map 320.
- trunk CNN layers 310 includes trunk CNN layers 310, a first feature map 320 corresponding to a conv4-3 convolutional layer, and a second feature map 330 corresponding to a conv4-3-2x convolutional layer resulting from performing a deconvolution operation on first feature map 320 such that the second feature map 330 is an enlarged version of the first feature map 320.
- the first feature map 320 has the dimensions H/8 ⁇ W/8 ⁇ 512
- the second feature map 330 has the dimensions H/4 ⁇ W/4 ⁇ 512, where H is the height of the input image initially input to the MS-CNN detector and W is the width of the input image.
- first region 334A i.e., the innermost cube illustrated within the second feature map 330
- second region 332A i.e., the cube illustrated within the second feature map 330 as encompassing the first region 334A
- the second region 332A is an enlarged version of the first region 334A and is 1.5 times as large as the first region 334A.
- features of the second feature map 330 corresponding to a first region 334A are reduced, by ROI pooling, to a first fixed-dimension feature map 334B having the dimensions 7 ⁇ 7 ⁇ 512.
- the MS-CNN object detection sub-network 300 concatenates the first and second fixed-dimension feature maps 334B and 332B, reduces the resulting feature map to a third fixed-dimension feature map 340B having the dimensions 5 ⁇ 5 ⁇ 512, and feeds the features of the third fixed-dimension feature map 340B to a fully connected layer 350 for determination of a class probability 370 and a bounding box 360.
- the MS-CNN detector attempts to embed context information of a high level of the feature pyramid included in the MS-CNN detector.
- the richness of the context information corresponding to the enlarged second region 332A may be limited because the enlarged second region 332A and the first region 334A are both mapped to the same level of the feature pyramid (i.e., the conv4-3-2x layer) .
- a context-embedding, region-based object detection method includes embedding a context branch such that features corresponding to a proposed region of interest (RoI) and context information corresponding to one or more enlarged RoIs are extracted from multiple levels of the feature pyramid. Consequently, the richness of the extracted context information may be improved relative to context information of the MS-CNN detector, and thus, the object detection performance of the context-embedding, region-based object detection method according to at least some example embodiments may also be improved.
- RoI proposed region of interest
- CNN convolutional neural network
- Example CNN architecture and algorithm for implementing the context-embedding, region-based object detection method according to at least some example embodiments
- the CNN structures and algorithms discussed below with reference to FIGS. 4-7 may be implemented by the object detection device 100 discussed above with reference to FIGS. 1 and 2. Thus, any or all operations discussed below with reference to FIGS. 4-7 may be executed or controlled by the object detection device 100 (i.e., the processing unit 258) .
- a CNN architecture for implementing the context-embedding, region-based object detection method may include a backbone CNN and a feature pyramid network (FPN) which may be used together to implement one or both of a region proposal network (RPN) and a context-embedding, region-based object detection network.
- FPN feature pyramid network
- FIG. 4 illustrates a portion of a backbone CNN 400 according to at least some example embodiments.
- one type of CNN that may serve as the backbone CNN 400 is the residual network CNN (i.e., ResNet) , examples of which (including ResNet36 and ResNet50) are discussed, for example, in K He, X Zhang, S Ren, J Sun, “Deep Residual Learning for Image Recognition, ” Proc. IEEE Computer Vision and Pattern Recognition, 2016.
- the structure of the backbone CNN 400 illustrated in FIG. 4 is the structure of the ResNet36 CNN.
- the backbone CNN 400 is implemented by the ResNet50 CNN.
- the backbone CNN 400 is not limited to the ResNet36 CNN and the ResNet50 CNN.
- the backbone CNN 400 may be implemented by any CNN that generates multiple feature maps having different scales.
- the backbone CNN 400 may include a plurality of convolution layers which output a plurality of reference feature maps, respectively.
- the backbone CNN 400 illustrated in FIG. 4 includes a first convolutional layer conv1_x (not illustrated) , a second convolutional layer conv2_x that outputs a second reference feature map C 2 , a third convolutional layer conv3_x that outputs a third reference feature map C 3 , a fourth convolutional layer conv4_x that outputs a fourth reference feature map C 4 , and a fifth convolutional layer conv5_x that outputs a fifth reference feature map C 5 .
- the reference feature maps C 2 , C 3 , C 4 , and C 5 may form the basis of an FPN.
- FIG. 5 illustrates an FPN 500 according to at least some example embodiments.
- the FPN 500 may be constructed based on the reference feature maps (e.g., 2 nd through fifth reference feature maps C 2 , -C 5 ) of the backbone CNN 400.
- reference feature maps e.g., 2 nd through fifth reference feature maps C 2 , -C 5
- examples of FPNs are discussed in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017; T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: Reverse connection with objectness prior networks for object detection, ” Proc.
- the FPN 500 employs a top-down architecture to create a feature pyramid that includes high-level semantic feature maps at all scales.
- the FPN 500 creates final feature maps P k0+2 , P k0+1 , P k0 , P k0-1 , P k0-2 corresponding to reference feature maps C k0+2 , C k0+1 , C k0 , C k0-1 , C k0-2 , respectively, where k 0 is a constant, the value of which may be set, for example, in accordance with the preferences of a designer and/or user of the object detection device 100.
- the constant k 0 will be discussed in greater detail below with reference to equation 1 and FIGS. 6 and 7.
- the final feature maps P generated by the FPN 500 can be used for one or both of region proposal and context-embedding, region-based object detection.
- FIG. 6 illustrates a diagram of a portion of a context-embedding, region-based objection detection network 600 according to at least some example embodiments.
- FIG. 7 is a flow chart illustrating an example algorithm for performing the context-embedding, region- based object detection method according to at least some example embodiments.
- An example algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments will now be discussed with reference to FIGS. 4-7, with respect to an example scenario in which the algorithm is performed by the object detection device 100, and the object detection device 100 implements (i.e., embodies) the backbone CNN 400, FPN 500, and objection detection network 600.
- FIGS. 4-7 illustrates a diagram of a portion of a context-embedding, region-based objection detection network 600 according to at least some example embodiments.
- FIG. 7 is a flow chart illustrating an example algorithm for performing the context-embedding, region- based object detection method according to at least some example
- the object detection device 100 e.g., by the processing unit 258 of the object detection device 100 executing computer-readable program code corresponding to the operations of the backbone CNN 400, FPN 500, and objection detection network 600.
- FIG. 7 will be explained with reference to detecting a single object included in an input image.
- the algorithm for performing the context-embedding, region-based object detection method is not limited to receiving an image including only one object, nor is the algorithm limited to detecting only one object.
- the input image can include several objects, and the algorithm is capable of detecting several objects of varying classes, locations and scales, concurrently.
- the object detection device 100 receives an input image including an object.
- the object detection device 100 may receive the input image as part of image data 120 received from the surveillance system 150, as is discussed above with reference to FIG. 1. After receiving the input image, the object detection device 100 may apply the received image as input to the backbone CNN 400. After step S710, the object detection device 100 proceeds to step S720.
- the object detection device 100 may generate reference feature maps.
- the object detection device 100 may generate, using the backbone CNN 400, a plurality of reference feature maps based on the input image received in step S710.
- the second to fifth convolutional layers ⁇ conv2_x, conv3_x, conv4_x, conv5_x ⁇ of the backbone CNN 400 may generate the second to fifth reference feature maps ⁇ C 2 , C 3 , C 4 , C 5 ⁇ , respectively.
- the reference feature maps ⁇ C 2 , C 3 , C 4 , C 5 ⁇ may each have different sizes/scales which decrease from the second reference feature map C 2 to the fifth reference feature map C 5 .
- the object detection device 100 proceeds to step S730.
- the object detection device 100 may use an FPN to generate a feature pyramid including final feature maps.
- the object detection device 100 may generate a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps generated in step S720.
- the FPN 500 may generate first to fifth final feature maps and, optionally, an additional sixth final feature map ⁇ P 2 , P 3 , P 4 , P 5 , P 6 ⁇ .
- the first to fifth final feature maps ⁇ P 2 , P 3 , P 4 , P 5 ⁇ may correspond, respectively, to the first to fifth reference feature maps ⁇ C 2 , C 3 , C 4 , C 5 ⁇ generated in step S720.
- the sixth final feature map P 6 may be generated by the FPN 500 based on the fifth final feature map P 5 , for example, by performing a stride 2 subsampling of the fifth final feature map P 5 , as is discussed, for example, in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017.
- the final feature maps ⁇ P 2 , P 3 , P 4 , P 5 , P 6 ⁇ may each have different sizes/scales which decrease from the second final feature map P 2 to the sixth final feature map P 6 .
- the object detection device 100 proceeds to step S740.
- step S740 the object detection device 100 obtains a proposed region of interest (RoI or ROI) , and generates one or more context RoIs.
- RoI proposed region of interest
- the object detection device 100 may obtain a proposed RoI from an external source.
- the object detection device 100 may obtain the proposed RoI by implementing a region proposal network (RPN) based on the FPN 500, and using the FPN-based RPN to generate a proposed RoI.
- RPN region proposal network
- the final feature maps P k0+2 , P k0+1 , P k0 , P k0-1 , P k0-2 generated by the FPN 500 as is illustrated in FIG. 5 can be used to implement the FPN-based RPN.
- FPN-based RPN One of ordinary skill in the art will recognize that example methods of implementing an FPN-based RPN are discussed in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017.
- the 6 th final feature map P 6 may be generated based on the 5 th final feature map P 5 in the same manner discussed above with reference to step S730.
- the FPN-based RPN may use anchors of three different aspect ratios ⁇ 1 ⁇ 2, 1 ⁇ 1, 2 ⁇ 1 ⁇ for each of the second through sixth final feature maps P 2 -P 6 such that the anchors used on the 5 different final feature maps ⁇ P 2 , P 3 , P 4 , P 5 , P 6 ⁇ have 5 different areas ⁇ 32 2 , 64 2 , 128 2 , 256 2 , 512 2 ⁇ , respectively.
- the object detection device 100 may obtain the proposed RoI by either one of receiving the proposed RoI and generating the proposed RoI.
- the object detection device 100 may obtain one or more context RoIs by enlarging the proposed RoI.
- FIG. 6 illustrates an input image 605, a proposed RoI 610, and first and second context RoIs, 615A and 615B.
- step S740 is described as obtaining “a proposed RoI” for the purpose of simplicity and ease of description.
- the algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments is not limited to obtaining just one RoI or just one RoI at a time.
- the object detection device 100 is capable of obtaining several RoI’s of varying locations, scales and aspect ratios, concurrently, in step S740.
- step S740 is described above with reference to an example scenario in which two context RoIs (i.e, two enlarged version of the proposed RoI 610) are generated, according to at least some example embodiments, any number of context RoIs (e.g., 1, 3, 5, etc. ) may be generated by enlarging the proposed RoI 610.
- the object detection device 100 proceeds to step S750.
- step S750 the object detection device 100 assigns the proposed RoI and the one or more context RoIs to final feature maps.
- the object detection device may assign the proposed RoI 610, the first context RoI 615A, and the second context RoI 615B to final feature maps, e.g., from among the final feature maps ⁇ P 2 , P 3 , P 4 , P 5 , P 6 ⁇ generated in step S730.
- the object detection device 100 may use the following equation:
- the object detection device 100 may apply the width ‘w’ and height ‘h’ of the RoI to Equation 1, above, to obtain the output k, and assign the RoI to the k th final feature map P k .
- the object detection network 600 assigns the proposed RoI 610 to the 3 rd final feature map P 3 , as is illustrated in FIG. 6.
- the object detection network 600 assigns the first and second context RoIs 615A and 615B to the 4 th and 5 th final feature maps P 4 and P 5 , respectively, as is illustrated in FIG. 6. After step S750, the object detection device 100 proceeds to step S760.
- step S760 the object detection device 100 extracts a set of features from each final feature map to which one of the RoIs is assigned, using RoI pooling.
- the object detection network 600 embodied by the object detection device 100 may perform RoI pooling with respect to the proposed RoI 610 and the final feature map to which the proposed RoI 610 is assigned.
- the object detection network 600 performs RoI pooling on the final feature map to which the proposed RoI 610 is assigned (i.e., the 3 rd final feature map P 3 ) such that the features of the 3 rd final feature map P 3 which fall within the proposed RoI 610 are pooled, by an RoI pooling operation, to generate a fixed-size original feature map 620.
- the fixed-size original feature map 620 is a set of features extracted from the 3 rd final feature map P 3 based on the RoI that was originally proposed, proposed RoI 610.
- step S760 the object detection network 600 forms a context branch 630 by performing RoI pooling on the first context RoI 615A and the second context RoI 615B and the final feature maps to which the first context RoI 615A and the second context RoI 615B are assigned.
- the object detection network 600 performs RoI pooling on the final feature maps to which the first and second context RoIs 615A and 615B are respectively assigned (i.e., the 4 th and 5 th final feature maps P 4 and P 5 ) such that the features of the 4 th final feature map P 4 which fall within the first context RoI 615A are pooled, by an RoI pooling operation, to generate a first fixed-size context feature map 632, and the features of the 5 th final feature map P 5 which fall within the second context RoI 615B are pooled, by an RoI pooling operation, to generate a second fixed-size context feature map 634.
- the first fixed-size context feature map 632 is a set of features extracted from the 4 th final feature map P 4 based on the first context RoI 615A and the second fixed-size context feature map 634 is a set of features extracted from the 5 th final feature map P 5 based on the second context RoI 615B.
- the RoI pooling operations discussed above with reference to step S750 may be performed by using the operations of the RoI pooling layer discussed in document R. Girshick, “Fast r-cnn, ” Computer Science, 2015.
- the RoI pooling operations discussed above with reference to step S750 may be performed by using the operations of the RoIAlign layer. Examples of the RoIAlign layer are discussed, for example, in K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN, ” In ICCV 2018.
- step S770 the object detection device 100 determines a class and/or location of the object included in the image.
- the object detection network 600 may perform context embedding by concatenating the first and second fixed-size context feature maps 632 and 634 to the fixed-size original feature map 620, thereby forming the concatenated feature map 625, as is shown in FIG. 6.
- the object detection network 600 may obtain richer context features and improved object detection results because the features included in the concatenated feature map 625 were not all extracted from the same convolutional layer or the same layer of the feature pyramid ⁇ P 2 , P 3 , P 4 , P 5 , P 6 ⁇ .
- the object detection network 600 includes a squeeze and excitation (SE) block 640 and may apply the concatenated feature map 625 to the SE block 640 in order to reduce or, alternatively, eliminate noise information, for example, by recalibrating channel-wise feature responses.
- SE block 640 contains two steps: squeeze and excitation. The first step is to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate the channel-wise statistics. The second step is adaptive recalibration.
- the SE block 640 may include a fully connected layer fc1 followed by a rectifier linear unit (ReLU) , whose output has the dimensions 1 ⁇ 1 ⁇ C’.
- ReLU rectifier linear unit
- Example structures and methods for constructing and using SE blocks are described, for example, in Hu, Jie, Li Shen, and Gang Sun, "Squeeze-and-excitation networks, " arXiv: 1709.01507, 2017.
- a class and bounding box (i.e., location) of the object included in the input image 605 are determined by using the output of the SE block 640 is to generate a class probability values 660 and bounding box values 670.
- the output of the SE block 640 may be applied to another fully connected layer 650 in order to generate class probability values (or class labels) 660 and bounding box values 670.
- Object detection utilizes bounding boxes to accurately locate where objects are and assign the objects correct class labels.
- the class probability values 660 and bounding box values 670 are object detection results of the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7.
- At least some example embodiments of the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can be applied to a wide variety of functions including autonomous driving system and video surveillance, as is discussed above with respect to FIG. 1.
- the object detection device 100 implementing the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can help count the pedestrian flow through the subway.
- the object detection device 100 implementing the context-embedding, region-based object detection method according to at least some example embodiments can help count the number of customers in the market thereby enabling an owner or operator of the market to control a number of customers, for example, for safety reasons.
- the context-embedding, region-based object detection method includes enlarging the size of the original RoI (e.g., proposed RoI 610) in order to obtain more context information using the enlarged RoIs (e.g., first and second context RoIs 615A and 615B) . Further, the enlarged RoIs are mapped to a different feature map than the original RoIs, thereby boosting the representation power of the context information obtained via the enlarged RoIs. Thus, the obtained context information is beneficial for the task of detecting small and occluded objects in the input image.
- the original RoI e.g., proposed RoI 610
- the enlarged RoIs are mapped to a different feature map than the original RoIs, thereby boosting the representation power of the context information obtained via the enlarged RoIs.
- Example methods of training a CNN architecture to perform the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 will now be discussed below in section V.
- the CNN architecture for performing the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can be trained in accordance with known CNN training techniques, for example, to set the various values of the filters used in the various convolutional layers (e.g., the filters of the first through fifth convolutional layers conv1_x -conv5_x of the backbone CNN 400 illustrated in FIG 4. ) .
- a proper loss function is designed.
- a multi-task loss function may be used.
- An example of a multi-task loss function is discussed, for example, in Lin T Y, Goyal P, Girshick R, et al., “Focal Loss for Dense Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017.
- training may be performed by using the Common Object in Context (COCO) train and val-minus-minival data sets as training data.
- COCO Common Object in Context
- val-minus-minival data sets as training data.
- the parameters of the above-referenced filters are iteratively updated until convergence by the stochastic gradient descent (SGD) algorithm.
- SGD stochastic gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biodiversity & Conservation Biology (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
A method of detecting an object in an image using a convolutional neural network (CNN) includes generating, based on the image, a plurality of reference feature maps and a corresponding feature pyramid including a plurality of final feature maps; obtaining a proposed region of interest (ROI); generating at least a first context ROI having an area larger than an area of the proposed ROI; assigning the proposed ROI and the first context ROI to a first and second final feature maps having different sizes; extracting, by performing ROI pooling, a first set of features from the first final feature map using the proposed ROI and a second set of features from the second final feature map using the first context ROI; and determining, based on the first and second sets of extracted features, at least one of a location of the object and a class of the object.
Description
1. Field
Various example embodiments relate generally to methods and apparatuses for performing region-based object detection.
2. Related Art
Object detection is a task in the area of computer vision that is aimed at localizing and recognizing object instances with a bounding box. Convolutional neural network (CNN) -based object detection can be utilized in the areas of visual surveillance, Advanced Driver Assistant Systems (ADAS) , and human-machine interaction (HMI) .
Current object detection frameworks could be grouped into two main streams: the region-based methods and the region-free methods. Examples of region-based detectors are discussed in, for example, Y.S. Cao, X. Niu, and Y. Dou, “Region-based convolutional neural networks for object detection in very high resolution remote sensing images, ” In International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, 2016; R. Girshick, “Fast r-cnn, ” Computer Science, 2015; and S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks, ” in International Conference on Neural Information Processing Systems, 2015, pp. 91-99. Generally, region-based methods divide the object detection into two steps. In the first step, a region proposal network (RPN) generates high-quality proposals. Then, in the second step, the proposals are further classified and regressed by a region-wise subnet. Generally, the region-free methods detect objects by regular and dense sampling over locations, scales and aspect ratios.
SUMMARY
According to at least some example embodiments, a method of detecting an object in an image using a convolutional neural network (CNN) includes generating, by the CNN, a plurality of reference feature maps based on the image; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI; extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; and determining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
The area of the first context ROI may be 2
2 times the area of the proposed ROI.
The method may further include concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The method may further include applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
The method may further include generating a second context ROI based on the proposed ROI such that an area of the second context ROI is larger than an area of the first context ROI; assigning the second context ROI to a third final feature map from among the plurality of final feature maps, a size of the third final feature map being different than the sizes of the first and second final feature maps; and extracting a third set of features from the first final feature map by performing ROI pooling on the first final feature map using the second context ROI, wherein the determining includes determining, based on the first, second and third sets of extracted features, at least one of the location of the object with respect to the image and the class of the object.
The feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
The area of the first context ROI may be 2
2 times the area of the proposed ROI, and the area of the second context ROI may be 4
2 times the area of the area of the proposed ROI.
The method may further include concatenating the first, second and third sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The method may further include applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
According to at least some example embodiments, a computer-readable medium includes program instructions for causing an apparatus to perform at least generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI; extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; and determining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
The area of the first context ROI may be 2
2 times the area of the proposed ROI.
The computer-readable medium may further include program instructions for causing an apparatus to perform 1t least concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The computer-readable medium of claim 14 may further include program instructions for causing an apparatus to perform at least applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
According to at least some example embodiments, an apparatus includes at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform, generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object; generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps; obtaining a proposed region of interest (ROI) ; generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI; assigning the proposed ROI to a first final feature map from among the plurality of final feature maps; assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map; extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI; extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; and determining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The feature pyramid may be generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
The area of the first context ROI may be twice the area of the proposed ROI.
The at least one memory and the computer program code may be further configured to, with the at least one processor, cause the apparatus at least to perform concatenating the first and second sets of extracted features, wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
The at least one memory and the computer program code may be further configured to, with the at least one processor, cause the apparatus at least to perform applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) , wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
At least some example embodiments will become more fully understood from the detailed description provided below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of example embodiments and wherein:
FIG. 1 is a diagram of a surveillance network 10 according to at least some example embodiments.
FIG. 2 is a diagram illustrating an example structure of an object detection device according to at least some example embodiments.
FIG. 3 illustrates an object detection sub-network of a multi-scale convolutional neural network (MS-CNN) detector.
FIG. 4 illustrates a portion of a backbone convolutional neural network (CNN) according to at least some example embodiments.
FIG. 5 illustrates a feature pyramid network (FPN) according to at least some example embodiments.
FIG. 6 illustrates a diagram of a portion of a context-embedding, region-based objection detection network 600 according to at least some example embodiments.
FIG. 7 is a flow chart illustrating an example algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
Various example embodiments will now be described more fully with reference to the accompanying drawings in which some example embodiments are shown.
Detailed illustrative embodiments are disclosed herein. However, specific structural and functional details disclosed herein are merely representative for purposes of describing at least some example embodiments. Example embodiments may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
Accordingly, while example embodiments are capable of various modifications and alternative forms, embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit example embodiments to the particular forms disclosed, but on the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of example embodiments. Like numbers refer to like elements throughout the description of the figures. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., "between" versus "directly between" , "adjacent" versus "directly adjacent" , etc. ) .
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms "a" , "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" , "comprising, " , "includes" and/or "including" , when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
Exemplary embodiments are discussed herein as being implemented in a suitable computing environment. Although not required, exemplary embodiments will be described in the general context of computer-executable instructions (e.g., program code) , such as program modules or functional processes, being executed by one or more computer processors or CPUs. Generally, program modules or functional processes include routines, programs, objects, components, data structures, etc. that performs particular tasks or implement particular abstract data types.
In the following description, illustrative embodiments will be described with reference to acts and symbolic representations of operations (e.g., in the form of flowcharts) that are performed by one or more processors, unless indicated otherwise. As such, it will be understood that such acts and operations, which are at times referred to as being computer-executed, include the manipulation by the processor of electrical signals representing data in a structured form. This manipulation transforms the data or maintains it at locations in the memory system of the computer, which reconfigures or otherwise alters the operation of the computer in a manner well understood by those skilled in the art.
I. Overview
As is discussed in greater detail below, a context-embedding, region-based object detection method according to at least some example embodiments is based on region-based object detection methods and includes embedding a context branch in order to obtain rich context information thereby resulting in improved object detection. According to at least some example embodiments, the context information is beneficial for detecting small, blurred and occluded objects. Further, as is also discussed in greater detail below, the context-embedding, region-based object detection method according to at least some example embodiments employs a squeeze-and-excitation block in conjunction with the context branch to reduce or, alternatively, avoid noise information. The context-embedding, region-based object detection method according to at least some example embodiments can be applied in several different ways including, for example, visual surveillance.
Example structures of a surveillance network and object detection device 100 which may utilize the context-embedding, region-based object detection method according to at least some example embodiments will be discussed below in section II of the present disclosure. Next, examples of using feature pyramids and context embedding to perform object detection will be discussed in section III of the present disclosure. Next, examples of a convolutional neural network (CNN) architecture and algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments will be discussed in section IV of the present disclosure. Further, methods of training the CNN architecture will be discussed in section V of the present disclosure.
II. Example structures for implementing the context-embedding, region-based object
detection method according to at least some example embodiments
For example, FIG. 1 illustrates a diagram of a surveillance network 10 according to at least some example embodiments. As is shown in FIG. 1, the surveillance network 10 may include an object detection device 100 and a surveillance system 150.
The surveillance system 150 may include one or more cameras each capturing image data representing a scene in a vicinity of the location of the camera. For example, as is illustrated in FIG. 1, the surveillance system 150 includes the camera 152, which captures surveillance scene 154. The camera 152 may capture surveillance scene 154 by, for example, continuously capturing a plurality of temporally-adjacent images (i.e., capturing video or moving image data) of the surveillance scene 154. According to at least some example embodiments, the camera 152 transmits image data 120 corresponding to the captured surveillance scene 154 to the object detection device 100. An example structure of the object detection device 100 will now be discussed in greater detail below with reference to FIG. 2.
FIG. 2 is a diagram illustrating an example structure of the object detection device 100 according to at least some example embodiments.
Referring to FIG. 2, the object detection device 100 may include, for example, a data bus 259, a transmitting unit 252, a receiving unit 254, a memory unit 256, and a processing unit 258.
The transmitting unit 252, receiving unit 254, memory unit 256, and processing unit 258 may send data to and/or receive data from one another using the data bus 259.
The transmitting unit 252 is a device that includes hardware and any necessary software for transmitting signals including, for example, control signals or data signals via one or more wired and/or wireless connections to one or more other network elements in a wireless communications network.
The receiving unit 254 is a device that includes hardware and any necessary software for receiving wireless signals including, for example, control signals or data signals via one or more wired and/or wireless connections to one or more other network elements in a wireless communications network.
The memory unit 256 may be any device capable of storing data including magnetic storage, flash storage, etc. Further, though not illustrated, the memory unit 256 may further include one or more of a port, dock, drive (e.g., optical drive) , or opening for receiving and/or mounting removable storage media (e.g., one or more of a USB flash drive, an SD card, an embedded MultiMediaCard (eMMC) , a CD, a DVD, and a Blue-ray disc) .
The processing unit 258 may be any device capable of processing data including, for example, a processor.
According to at least one example embodiment, any operations described herein, for example with reference to any of FIGS. 1-7, as being performed by a an object detection device may be performed by an electronic device having the structure of the object detection device 100 illustrated in FIG. 2. For example, according to at least one example embodiment, the object detection device 100 may be programmed, in terms of software and/or hardware, to perform any or all of the functions described herein as being performed by an object detection device. Consequently, the object detection device 100 may be embodied as a special purpose computer through software and/or hardware programming.
Examples of the object detection device 100 being programmed, in terms of software, to perform any or all of the functions described herein as being performed by an object detection device will now be discussed below. For example, the memory unit 256 may store a program that includes executable instructions (e.g., program code) corresponding to any or all of the operations described herein as being performed by an object detection device. According to at least one example embodiment, additionally or alternatively to being stored in the memory unit 256, the executable instructions (e.g., program code) may be stored in a computer-readable medium including, for example, an optical disc, flash drive, SD card, etc., and the object detection device 100 may include hardware for reading data stored on the computer readable-medium. Further, the processing unit 258 may be a processor configured to perform any or all of the operations described herein with reference to FIGS. 1-4 as being performed by an object detection device, for example, by reading and executing the executable instructions (e.g., program code) stored in at least one of the memory unit 256 and a computer readable storage medium loaded into hardware included in the object detection device 100 for reading computer-readable mediums.
Examples of the object detection device 100 being programmed, in terms of hardware, to perform any or all of the functions described herein as being performed by an object detection device will now be discussed below. Additionally or alternatively to executable instructions (e.g., program code) corresponding to the functions described with reference to FIGS. 1-7 as being performed by an object detection device being stored in a memory unit or a computer-readable medium as is discussed above, the processing unit 258 may include a circuit (e.g., an integrated circuit) that has a structural design dedicated to performing any or all of the operations described herein with reference to FIGS. 1-6 as being performed by an object detection device. For example, the above-referenced circuit included in the processing unit 258 may be a FPGA or ASIC physically programmed, through specific circuit design, to perform any or all of the operations described with reference to FIGS. 1-7 as being performed by an object detection device.
According to at least some example embodiments, the object detection device 100 performs region-based object detection using context embedding which results improving object detection performance with respect to small, blurred and occluded objects with reference to other object detection methods, while also being able to detect objects at multiple scales. Two features used by some other object detection methods, feature pyramids and embedding context will now be discussed in greater detail below in section III.
III. Feature pyramids and embedding context
For example, some object detection methods utilize feature pyramids, which include feature maps of multiple levels (i.e., multiple scales) . For example, the region-based detector, multi-scale CNN (MS-CNN) , uses convolutional layers of different spatial resolutions to generate region proposals of different scales. However, the different layers of the MS-CNN detector may have an inconsistent semantic. An example of the MS-CNN is discussed, for example, in Z. Cai and Q. Fan, R.S Feris and N. Vasconcelos, "A unified multi-scale deep convolutional neural network for fast object detection, " European Conference on Computer Vision. Springer, Cham, 2016.
Further, in addition to using feature pyramids to generate region proposals the MS-CNN detector also includes an object detection sub-network that utilizes context embedding. FIG. 3 illustrates an object detection sub-network 300 of an MS-CNN detector. As is illustrated in FIG. 3, the MS-CNN object detection sub-network 300 includes trunk CNN layers 310, a first feature map 320 corresponding to a conv4-3 convolutional layer, and a second feature map 330 corresponding to a conv4-3-2x convolutional layer resulting from performing a deconvolution operation on first feature map 320 such that the second feature map 330 is an enlarged version of the first feature map 320. For the example depicted in FIG. 3, the first feature map 320 has the dimensions H/8 × W/8 × 512, and the second feature map 330 has the dimensions H/4 × W/4 × 512, where H is the height of the input image initially input to the MS-CNN detector and W is the width of the input image.
As is illustrated in FIG. 3, within the second feature map 330, there is a first region 334A (i.e., the innermost cube illustrated within the second feature map 330) and a second region 332A (i.e., the cube illustrated within the second feature map 330 as encompassing the first region 334A) . The second region 332A is an enlarged version of the first region 334A and is 1.5 times as large as the first region 334A. Further, as is also illustrated in FIG. 3, features of the second feature map 330 corresponding to a first region 334A are reduced, by ROI pooling, to a first fixed-dimension feature map 334B having the dimensions 7 × 7 × 512. Further, features of the second feature map 330 corresponding to a second region 332A are reduced, by ROI pooling, to a second fixed-dimension feature map 332B, which also has the dimensions 7 × 7 × 512. As is illustrated in FIG. 3, the MS-CNN object detection sub-network 300 concatenates the first and second fixed-dimension feature maps 334B and 332B, reduces the resulting feature map to a third fixed-dimension feature map 340B having the dimensions 5 × 5 × 512, and feeds the features of the third fixed-dimension feature map 340B to a fully connected layer 350 for determination of a class probability 370 and a bounding box 360. By using the enlarged second region 332A in conjunction with the first region 334A, the MS-CNN detector attempts to embed context information of a high level of the feature pyramid included in the MS-CNN detector. However, the richness of the context information corresponding to the enlarged second region 332A may be limited because the enlarged second region 332A and the first region 334A are both mapped to the same level of the feature pyramid (i.e., the conv4-3-2x layer) .
In contrast, as is explained below with reference to FIGS. 4-6, a context-embedding, region-based object detection method according to at least some example embodiments disclosed herein includes embedding a context branch such that features corresponding to a proposed region of interest (RoI) and context information corresponding to one or more enlarged RoIs are extracted from multiple levels of the feature pyramid. Consequently, the richness of the extracted context information may be improved relative to context information of the MS-CNN detector, and thus, the object detection performance of the context-embedding, region-based object detection method according to at least some example embodiments may also be improved.
Examples of a convolutional neural network (CNN) architecture and an algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments will now be discussed in section IV of the present disclosure.
IV. Example CNN architecture and algorithm for implementing the context-embedding,
region-based object detection method according to at least some example embodiments
According to at least some example embodiments, the CNN structures and algorithms discussed below with reference to FIGS. 4-7 may be implemented by the object detection device 100 discussed above with reference to FIGS. 1 and 2. Thus, any or all operations discussed below with reference to FIGS. 4-7 may be executed or controlled by the object detection device 100 (i.e., the processing unit 258) .
According to at least some example embodiments, a CNN architecture for implementing the context-embedding, region-based object detection method may include a backbone CNN and a feature pyramid network (FPN) which may be used together to implement one or both of a region proposal network (RPN) and a context-embedding, region-based object detection network.
For example, FIG. 4 illustrates a portion of a backbone CNN 400 according to at least some example embodiments. Further, one type of CNN that may serve as the backbone CNN 400 is the residual network CNN (i.e., ResNet) , examples of which (including ResNet36 and ResNet50) are discussed, for example, in K He, X Zhang, S Ren, J Sun, “Deep Residual Learning for Image Recognition, ” Proc. IEEE Computer Vision and Pattern Recognition, 2016. For the purpose of simplicity, the structure of the backbone CNN 400 illustrated in FIG. 4 is the structure of the ResNet36 CNN. However, according to at least some example embodiments, the backbone CNN 400 is implemented by the ResNet50 CNN. Further, the backbone CNN 400 is not limited to the ResNet36 CNN and the ResNet50 CNN. According to at least some example embodiments, the backbone CNN 400 may be implemented by any CNN that generates multiple feature maps having different scales.
As is shown in FIG. 4, when the backbone CNN 400 is implemented by a ResNet, the backbone CNN 400 may include a plurality of convolution layers which output a plurality of reference feature maps, respectively. For example, the backbone CNN 400 illustrated in FIG. 4 includes a first convolutional layer conv1_x (not illustrated) , a second convolutional layer conv2_x that outputs a second reference feature map C
2, a third convolutional layer conv3_x that outputs a third reference feature map C
3, a fourth convolutional layer conv4_x that outputs a fourth reference feature map C
4, and a fifth convolutional layer conv5_x that outputs a fifth reference feature map C
5. As will be discussed in greater detail below with reference to FIG. 5, the reference feature maps C
2, C
3, C
4, and C
5 may form the basis of an FPN.
FIG. 5 illustrates an FPN 500 according to at least some example embodiments. The FPN 500 may be constructed based on the reference feature maps (e.g., 2
nd through fifth reference feature maps C
2, -C
5) of the backbone CNN 400. For example, examples of FPNs are discussed in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017; T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: Reverse connection with objectness prior networks for object detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017; and Lin T Y, Goyal P, Girshick R, et al., “Focal Loss for Dense Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition. In contrast to the multi-scale feature maps of the MS-CNN detector discussed above with reference to FIG. 4, the FPN 500 employs a top-down architecture to create a feature pyramid that includes high-level semantic feature maps at all scales. For example, the FPN 500 creates final feature maps P
k0+2, P
k0+1, P
k0, P
k0-1, P
k0-2 corresponding to reference feature maps C
k0+2, C
k0+1, C
k0, C
k0-1, C
k0-2, respectively, where k
0 is a constant, the value of which may be set, for example, in accordance with the preferences of a designer and/or user of the object detection device 100. The constant k
0 will be discussed in greater detail below with reference to equation 1 and FIGS. 6 and 7. Further, as is discussed in greater detail below with reference to FIGS. 6 and 7, the final feature maps P generated by the FPN 500 can be used for one or both of region proposal and context-embedding, region-based object detection.
FIG. 6 illustrates a diagram of a portion of a context-embedding, region-based objection detection network 600 according to at least some example embodiments. FIG. 7 is a flow chart illustrating an example algorithm for performing the context-embedding, region- based object detection method according to at least some example embodiments. An example algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments will now be discussed with reference to FIGS. 4-7, with respect to an example scenario in which the algorithm is performed by the object detection device 100, and the object detection device 100 implements (i.e., embodies) the backbone CNN 400, FPN 500, and objection detection network 600. Thus, operations described with respect to FIGS. 4-7 as being performed by the backbone CNN 400, FPN 500, or objection detection network 600, or an element thereof, may be performed by the object detection device 100 (e.g., by the processing unit 258 of the object detection device 100 executing computer-readable program code corresponding to the operations of the backbone CNN 400, FPN 500, and objection detection network 600) .
Further, for the purpose of simplicity and ease of description, FIG. 7 will be explained with reference to detecting a single object included in an input image. However, the algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments is not limited to receiving an image including only one object, nor is the algorithm limited to detecting only one object. The input image can include several objects, and the algorithm is capable of detecting several objects of varying classes, locations and scales, concurrently.
Referring to FIG. 7, in step S710, the object detection device 100 receives an input image including an object. According to at least one example embodiment of the inventive concepts, the object detection device 100 may receive the input image as part of image data 120 received from the surveillance system 150, as is discussed above with reference to FIG. 1. After receiving the input image, the object detection device 100 may apply the received image as input to the backbone CNN 400. After step S710, the object detection device 100 proceeds to step S720.
In step S720, the object detection device 100 may generate reference feature maps. For example, the object detection device 100 may generate, using the backbone CNN 400, a plurality of reference feature maps based on the input image received in step S710.
For example, in step S720, the second to fifth convolutional layers {conv2_x, conv3_x, conv4_x, conv5_x} of the backbone CNN 400 may generate the second to fifth reference feature maps {C
2, C
3, C
4, C
5} , respectively. The reference feature maps {C
2, C
3, C
4, C
5} may each have different sizes/scales which decrease from the second reference feature map C
2 to the fifth reference feature map C
5. After step S720, the object detection device 100 proceeds to step S730.
In step S730, the object detection device 100 may use an FPN to generate a feature pyramid including final feature maps. For example, the object detection device 100 may generate a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps generated in step S720.
For example, as is discussed above with reference to the FPN 500 illustrated in FIG. 5, in step S720, the FPN 500 may generate first to fifth final feature maps and, optionally, an additional sixth final feature map {P
2, P
3, P
4, P
5, P
6} . The first to fifth final feature maps {P
2, P
3, P
4, P
5} may correspond, respectively, to the first to fifth reference feature maps {C
2, C
3, C
4, C
5} generated in step S720. The sixth final feature map P
6 may be generated by the FPN 500 based on the fifth final feature map P
5, for example, by performing a stride 2 subsampling of the fifth final feature map P
5, as is discussed, for example, in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017. The final feature maps {P
2, P
3, P
4, P
5, P
6} may each have different sizes/scales which decrease from the second final feature map P
2 to the sixth final feature map P
6. After step S730, the object detection device 100 proceeds to step S740.
In step S740, the object detection device 100 obtains a proposed region of interest (RoI or ROI) , and generates one or more context RoIs.
For example, according to at least some example embodiments, the object detection device 100 may obtain a proposed RoI from an external source. Alternatively, the object detection device 100 may obtain the proposed RoI by implementing a region proposal network (RPN) based on the FPN 500, and using the FPN-based RPN to generate a proposed RoI.
For example, according to at least some example embodiments, the final feature maps P
k0+2, P
k0+1, P
k0, P
k0-1, P
k0-2 generated by the FPN 500 as is illustrated in FIG. 5 can be used to implement the FPN-based RPN. One of ordinary skill in the art will recognize that example methods of implementing an FPN-based RPN are discussed in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017. For example, when k
0 = 4, the FPN 500 generates 2
nd through 6
th final feature maps P
2, P
3, P
4, P
5, and P
6. The 6
th final feature map P
6 may be generated based on the 5
th final feature map P
5 in the same manner discussed above with reference to step S730. Further, in order to generate region proposals, the FPN-based RPN may use anchors of three different aspect ratios {1∶2, 1∶1, 2∶1} for each of the second through sixth final feature maps P
2-P
6 such that the anchors used on the 5 different final feature maps {P
2, P
3, P
4, P
5, P
6} have 5 different areas {32
2, 64
2, 128
2, 256
2, 512
2} , respectively.
Thus, in step S740, the object detection device 100 may obtain the proposed RoI by either one of receiving the proposed RoI and generating the proposed RoI.
Further, in step S740, based on the obtained proposed RoI, the object detection device 100 may obtain one or more context RoIs by enlarging the proposed RoI. For example, FIG. 6 illustrates an input image 605, a proposed RoI 610, and first and second context RoIs, 615A and 615B. According to at least some example embodiments of the inventive concepts, the object detection network 600 generates the first context RoI 615A by enlarging the area (i.e., w×h) of the proposed RoI 610 by a factor s 1 and the object detection network 600 generates the second context RoI 615B by enlarging the area (i.e., w×h) of the proposed RoI 610 by a factor s2, where ‘w’ is the width of the input image 605, ‘h’ is the height of the input image 605, and s1 and s2 are both positive numbers greater than 1. In the example illustrated in FIG. 6, s1 = 2
2 and s2= 4
2. Further, according to at least some example embodiments, the object detection network 600 may determine coordinates for context RoIs, which are generated by enlarging a proposed RoI, in such a manner that the context RoIs are concentric with the proposed RoI.
Further, step S740 is described as obtaining “a proposed RoI” for the purpose of simplicity and ease of description. However, the algorithm for performing the context-embedding, region-based object detection method according to at least some example embodiments is not limited to obtaining just one RoI or just one RoI at a time. For example, the object detection device 100 is capable of obtaining several RoI’s of varying locations, scales and aspect ratios, concurrently, in step S740.
Further, though step S740 is described above with reference to an example scenario in which two context RoIs (i.e, two enlarged version of the proposed RoI 610) are generated, according to at least some example embodiments, any number of context RoIs (e.g., 1, 3, 5, etc. ) may be generated by enlarging the proposed RoI 610. After step S740, the object detection device 100 proceeds to step S750.
In step S750, the object detection device 100 assigns the proposed RoI and the one or more context RoIs to final feature maps. For example, in step S750, the object detection device may assign the proposed RoI 610, the first context RoI 615A, and the second context RoI 615B to final feature maps, e.g., from among the final feature maps {P
2, P
3, P
4, P
5, P
6} generated in step S730.
For example, to perform the above-reference assigning, the object detection device 100 may use the following equation:
In Equation 1, ‘w’ represents width, ‘h’ represents height, and k
0 is a constant, the value of which may be set, for example, in accordance with the preferences of a designer and/or user of the object detection device 100. Additional details for setting k
0 are discussed in document [6] . In the example scenario illustrated in FIG. 6, k
0 = 4. It means that k
0 is corresponding to the area of 224
2, (i.e. w×h=224
2) . Equation 1 is discussed, for example, in T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017.
For each of the proposed RoI 610, the first context RoI 615A, and the second context RoI 615B, the object detection device 100 may apply the width ‘w’ and height ‘h’ of the RoI to Equation 1, above, to obtain the output k, and assign the RoI to the k
th final feature map P
k. For example, in the example scenario illustrated in FIG. 6, when the width w and height h of the proposed RoI 610 are applied to Equation 1, k = 3. Accordingly, the object detection network 600 assigns the proposed RoI 610 to the 3
rd final feature map P
3, as is illustrated in FIG. 6. Similarly, when the widths w and heights h of the first and second context RoIs 615A and 615B are applied to Equation 1, k = 4 and 5, respectively. Accordingly, the object detection network 600 assigns the first and second context RoIs 615A and 615B to the 4
th and 5
th final feature maps P
4 and P
5, respectively, as is illustrated in FIG. 6. After step S750, the object detection device 100 proceeds to step S760.
In step S760, the object detection device 100 extracts a set of features from each final feature map to which one of the RoIs is assigned, using RoI pooling. For example, in step S760, the object detection network 600 embodied by the object detection device 100 may perform RoI pooling with respect to the proposed RoI 610 and the final feature map to which the proposed RoI 610 is assigned. Specifically, with respect to the proposed RoI 610, the object detection network 600 performs RoI pooling on the final feature map to which the proposed RoI 610 is assigned (i.e., the 3
rd final feature map P
3) such that the features of the 3
rd final feature map P
3 which fall within the proposed RoI 610 are pooled, by an RoI pooling operation, to generate a fixed-size original feature map 620. Thus, the fixed-size original feature map 620 is a set of features extracted from the 3
rd final feature map P
3 based on the RoI that was originally proposed, proposed RoI 610.
Further, in step S760, the object detection network 600 forms a context branch 630 by performing RoI pooling on the first context RoI 615A and the second context RoI 615B and the final feature maps to which the first context RoI 615A and the second context RoI 615B are assigned. Specifically, with respect to the first and second context RoIs 615A and 615B, the object detection network 600 performs RoI pooling on the final feature maps to which the first and second context RoIs 615A and 615B are respectively assigned (i.e., the 4
th and 5
th final feature maps P
4 and P
5) such that the features of the 4
th final feature map P
4 which fall within the first context RoI 615A are pooled, by an RoI pooling operation, to generate a first fixed-size context feature map 632, and the features of the 5
th final feature map P
5 which fall within the second context RoI 615B are pooled, by an RoI pooling operation, to generate a second fixed-size context feature map 634. Thus, the first fixed-size context feature map 632 is a set of features extracted from the 4
th final feature map P
4 based on the first context RoI 615A and the second fixed-size context feature map 634 is a set of features extracted from the 5
th final feature map P
5 based on the second context RoI 615B.
According to at least some example embodiments, the RoI pooling operations discussed above with reference to step S750 may be performed by using the operations of the RoI pooling layer discussed in document R. Girshick, “Fast r-cnn, ” Computer Science, 2015. Alternatively, according to at least some example embodiments, the RoI pooling operations discussed above with reference to step S750 may be performed by using the operations of the RoIAlign layer. Examples of the RoIAlign layer are discussed, for example, in K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN, ” In ICCV 2018. After step S760, the object detection device 100 then proceeds to step S770.
In step S770, the object detection device 100 determines a class and/or location of the object included in the image. For example, in step S770, the object detection network 600 may perform context embedding by concatenating the first and second fixed-size context feature maps 632 and 634 to the fixed-size original feature map 620, thereby forming the concatenated feature map 625, as is shown in FIG. 6.
Further, in contrast the MS-CNN object detection sub-network 300 discussed above with respect to FIG. 3, the object detection network 600 may obtain richer context features and improved object detection results because the features included in the concatenated feature map 625 were not all extracted from the same convolutional layer or the same layer of the feature pyramid {P
2, P
3, P
4, P
5, P
6} .
As is also shown in FIG. 6, the object detection network 600 includes a squeeze and excitation (SE) block 640 and may apply the concatenated feature map 625 to the SE block 640 in order to reduce or, alternatively, eliminate noise information, for example, by recalibrating channel-wise feature responses. The SE block 640 contains two steps: squeeze and excitation. The first step is to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate the channel-wise statistics. The second step is adaptive recalibration. For example, the SE block 640 may include a fully connected layer fc1 followed by a rectifier linear unit (ReLU) , whose output has the dimensions 1 × 1 ×C’. Further, the SE block 640 may include another fully connected layer fc2 followed by a sigmoid, the output of which has the dimensions 1 × 1 ×C (where, generally, C’=C/16) and is used to rescale the initial features of the concatenated feature map 625, for example, via channel-wise multiplication, as is shown in FIG. 6. Example structures and methods for constructing and using SE blocks are described, for example, in Hu, Jie, Li Shen, and Gang Sun, "Squeeze-and-excitation networks, " arXiv: 1709.01507, 2017.
Next, a class and bounding box (i.e., location) of the object included in the input image 605 are determined by using the output of the SE block 640 is to generate a class probability values 660 and bounding box values 670. For example, the output of the SE block 640 may be applied to another fully connected layer 650 in order to generate class probability values (or class labels) 660 and bounding box values 670.
Object detection utilizes bounding boxes to accurately locate where objects are and assign the objects correct class labels. When image patches or frames of video are used as the input image in step S710, the class probability values 660 and bounding box values 670 are object detection results of the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7.
At least some example embodiments of the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can be applied to a wide variety of functions including autonomous driving system and video surveillance, as is discussed above with respect to FIG. 1. For example, referring to FIG. 1, when the camera 152 of the surveillance network 10 is placed at the entrance of a subway station, the object detection device 100 implementing the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can help count the pedestrian flow through the subway. In addition, when the camera 152 of the surveillance network 10 is placed in a market, the object detection device 100 implementing the context-embedding, region-based object detection method according to at least some example embodiments can help count the number of customers in the market thereby enabling an owner or operator of the market to control a number of customers, for example, for safety reasons.
Further, the context-embedding, region-based object detection method according to at least some example embodiments includes enlarging the size of the original RoI (e.g., proposed RoI 610) in order to obtain more context information using the enlarged RoIs (e.g., first and second context RoIs 615A and 615B) . Further, the enlarged RoIs are mapped to a different feature map than the original RoIs, thereby boosting the representation power of the context information obtained via the enlarged RoIs. Thus, the obtained context information is beneficial for the task of detecting small and occluded objects in the input image.
Example methods of training a CNN architecture to perform the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 will now be discussed below in section V.
V. Example training methods
The CNN architecture for performing the context-embedding, region-based object detection method discussed above with reference to FIGS. 4-7 can be trained in accordance with known CNN training techniques, for example, to set the various values of the filters used in the various convolutional layers (e.g., the filters of the first through fifth convolutional layers conv1_x -conv5_x of the backbone CNN 400 illustrated in FIG 4. ) .
To begin the training stage, a proper loss function is designed. For the task of object detection, a multi-task loss function, may be used. An example of a multi-task loss function is discussed, for example, in Lin T Y, Goyal P, Girshick R, et al., “Focal Loss for Dense Object Detection, ” Proc. IEEE Computer Vision and Pattern Recognition, 2017. Further, according to at least some example embodiments, training may be performed by using the Common Object in Context (COCO) train and val-minus-minival data sets as training data. With the technique of back-propagation, the parameters of the above-referenced filters are iteratively updated until convergence by the stochastic gradient descent (SGD) algorithm.
Example embodiments being thus described, it will be obvious that embodiments may be varied in many ways. Such variations are not to be regarded as a departure from example embodiments, and all such modifications are intended to be included within the scope of example embodiments.
Claims (20)
- A method of detecting an object in an image using a convolutional neural network (CNN) , the method comprising:generating, by the CNN, a plurality of reference feature maps based on the image;generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps;obtaining a proposed region of interest (ROI) ;generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI;assigning the proposed ROI to a first final feature map from among the plurality of final feature maps;assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map;extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI;extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; anddetermining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The method of claim 1, wherein the feature pyramid is generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- The method of claim 1, wherein the area of the first context ROI is 2 2 times the area of the proposed ROI.
- The method of claim 1, further comprising:concatenating the first and second sets of extracted features,wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The method of claim 4, further comprising:applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) ,wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- The method of claim 1, further comprising:generating a second context ROI based on the proposed ROI such that an area of the second context ROI is larger than an area of the first context ROI;assigning the second context ROI to a third final feature map from among the plurality of final feature maps, a size of the third final feature map being different than the sizes of the first and second final feature maps; andextracting a third set of features from the first final feature map by performing ROI pooling on the first final feature map using the second context ROI,wherein the determining includes determining, based on the first, second and third sets of extracted features, at least one of the location of the object with respect to the image and the class of the object.
- The method of claim 6, wherein the feature pyramid is generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- The method of claim 6, wherein the area of the first context ROI is 2 2 times the area of the proposed ROI, and the area of the second context ROI is 4 2 times the area of the area of the proposed ROI.
- The method of claim 6, further comprising:concatenating the first, second and third sets of extracted features,wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The method of claim 9, further comprising:applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) ,wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- A computer-readable medium comprising program instructions for causing an apparatus to perform at least the following:generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object;generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps;obtaining a proposed region of interest (ROI) ;generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI;assigning the proposed ROI to a first final feature map from among the plurality of final feature maps;assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map;extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI;extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; anddetermining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The computer-readable medium of claim 11, wherein the feature pyramid is generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- The computer-readable medium of claim 11, wherein the area of the first context ROI is 2 2 times the area of the proposed ROI.
- The computer-readable medium of claim 11, further comprising program instructions for causing an apparatus to perform at least the following:concatenating the first and second sets of extracted features,wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The computer-readable medium of claim 14, further comprising program instructions for causing an apparatus to perform at least the following:applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) ,wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
- An apparatus comprising:at least one processor; andat least one memory including computer program code,the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform,generating, by a convolutional neural network (CNN) , a plurality of reference feature maps based on an image that includes an object;generating a feature pyramid including a plurality of final feature maps corresponding, respectively, to the plurality of reference feature maps;obtaining a proposed region of interest (ROI) ;generating at least a first context ROI based on the proposed ROI such that an area of the first context ROI is larger than an area of the proposed ROI;assigning the proposed ROI to a first final feature map from among the plurality of final feature maps;assigning the first context ROI to a second final feature map from among the plurality of final feature maps, a size of the first final feature map being different than a size of the second final feature map;extracting a first set of features from the first final feature map by performing an ROI pooling operation on the first final feature map using the proposed ROI;extracting a second set of features from the second final feature map by performing an ROI pooling operation on the second final feature map using the first context ROI; anddetermining, based on the first and second sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The apparatus of claim 16, wherein the feature pyramid is generated based on the plurality of reference feature maps in accordance with a feature pyramid network (FPN) architecture.
- The apparatus of claim 16, wherein the area of the first context ROI is twice the area of the proposed ROI.
- The apparatus of claim 16, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to perform:concatenating the first and second sets of extracted features,wherein the determining includes determining, based on the concatenated sets of extracted features, at least one of a location of the object with respect to the image and a class of the object.
- The apparatus of claim 19, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to perform:applying the concatenated sets of extracted features to a squeeze-and-excitation block (SEB) ,wherein the at least one of a location of the object with respect to the image and a class of the object is determined based on an output of the SEB.
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2021520139A JP7192109B2 (en) | 2018-10-12 | 2018-10-12 | Method and Apparatus for Context Embedding and Region-Based Object Detection |
PCT/CN2018/110023 WO2020073310A1 (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedding and region-based object detection |
US17/283,276 US11908160B2 (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedding and region-based object detection |
EP18936706.3A EP3864621A4 (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedding and region-based object detection |
CN201880099562.2A CN113168705A (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedded and region-based object detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2018/110023 WO2020073310A1 (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedding and region-based object detection |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020073310A1 true WO2020073310A1 (en) | 2020-04-16 |
Family
ID=70164352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/110023 WO2020073310A1 (en) | 2018-10-12 | 2018-10-12 | Method and apparatus for context-embedding and region-based object detection |
Country Status (5)
Country | Link |
---|---|
US (1) | US11908160B2 (en) |
EP (1) | EP3864621A4 (en) |
JP (1) | JP7192109B2 (en) |
CN (1) | CN113168705A (en) |
WO (1) | WO2020073310A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950612A (en) * | 2020-07-30 | 2020-11-17 | 中国科学院大学 | FPN-based weak and small target detection method for fusion factor |
CN112150462A (en) * | 2020-10-22 | 2020-12-29 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for determining target anchor point |
CN112419227A (en) * | 2020-10-14 | 2021-02-26 | 北京大学深圳研究生院 | Underwater target detection method and system based on small target search scaling technology |
CN112446327A (en) * | 2020-11-27 | 2021-03-05 | 中国地质大学(武汉) | Remote sensing image target detection method based on non-anchor frame |
CN112491891A (en) * | 2020-11-27 | 2021-03-12 | 杭州电子科技大学 | Network attack detection method based on hybrid deep learning in Internet of things environment |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11494616B2 (en) * | 2019-05-09 | 2022-11-08 | Shenzhen Malong Technologies Co., Ltd. | Decoupling category-wise independence and relevance with self-attention for multi-label image classification |
FR3103938B1 (en) * | 2019-12-03 | 2021-11-12 | Idemia Identity & Security France | Method of detecting at least one element of interest visible in an input image using a convolutional neural network |
JP6800453B1 (en) * | 2020-05-07 | 2020-12-16 | 株式会社 情報システムエンジニアリング | Information processing device and information processing method |
KR102687632B1 (en) * | 2022-11-22 | 2024-07-24 | 주식회사 슈퍼브에이아이 | Method for tracking objects for low frame-rate videos and object tracking device using the same |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106203432A (en) * | 2016-07-14 | 2016-12-07 | 杭州健培科技有限公司 | A kind of localization method of area-of-interest based on convolutional Neural net significance collection of illustrative plates |
CN106339680A (en) * | 2016-08-25 | 2017-01-18 | 北京小米移动软件有限公司 | Human face key point positioning method and device |
WO2017015947A1 (en) * | 2015-07-30 | 2017-02-02 | Xiaogang Wang | A system and a method for object tracking |
US20180150956A1 (en) * | 2016-11-25 | 2018-05-31 | Industrial Technology Research Institute | Character recognition systems and character recognition methods thereof using convolutional neural network |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965719B2 (en) * | 2015-11-04 | 2018-05-08 | Nec Corporation | Subcategory-aware convolutional neural networks for object detection |
US20180039853A1 (en) | 2016-08-02 | 2018-02-08 | Mitsubishi Electric Research Laboratories, Inc. | Object Detection System and Object Detection Method |
US10354159B2 (en) * | 2016-09-06 | 2019-07-16 | Carnegie Mellon University | Methods and software for detecting objects in an image using a contextual multiscale fast region-based convolutional neural network |
CN107463892A (en) * | 2017-07-27 | 2017-12-12 | 北京大学深圳研究生院 | Pedestrian detection method in a kind of image of combination contextual information and multi-stage characteristics |
CN107871126A (en) * | 2017-11-22 | 2018-04-03 | 西安翔迅科技有限责任公司 | Model recognizing method and system based on deep-neural-network |
EP3729377A4 (en) * | 2017-12-18 | 2020-12-23 | Shanghai United Imaging Healthcare Co., Ltd. | Systems and methods for determining scanning parameter in imaging |
CN108090203A (en) * | 2017-12-25 | 2018-05-29 | 上海七牛信息技术有限公司 | Video classification methods, device, storage medium and electronic equipment |
CN108304820B (en) * | 2018-02-12 | 2020-10-13 | 腾讯科技(深圳)有限公司 | Face detection method and device and terminal equipment |
-
2018
- 2018-10-12 EP EP18936706.3A patent/EP3864621A4/en active Pending
- 2018-10-12 CN CN201880099562.2A patent/CN113168705A/en active Pending
- 2018-10-12 US US17/283,276 patent/US11908160B2/en active Active
- 2018-10-12 WO PCT/CN2018/110023 patent/WO2020073310A1/en unknown
- 2018-10-12 JP JP2021520139A patent/JP7192109B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017015947A1 (en) * | 2015-07-30 | 2017-02-02 | Xiaogang Wang | A system and a method for object tracking |
CN106203432A (en) * | 2016-07-14 | 2016-12-07 | 杭州健培科技有限公司 | A kind of localization method of area-of-interest based on convolutional Neural net significance collection of illustrative plates |
CN106339680A (en) * | 2016-08-25 | 2017-01-18 | 北京小米移动软件有限公司 | Human face key point positioning method and device |
US20180150956A1 (en) * | 2016-11-25 | 2018-05-31 | Industrial Technology Research Institute | Character recognition systems and character recognition methods thereof using convolutional neural network |
Non-Patent Citations (12)
Title |
---|
HU, JIELI SHENGANG SUN: "Squeeze-and-excitation networks", ARXIV:1709.01507, 2017 |
K HEX ZHANGS RENJ SUN: "Deep Residual Learning for Image Recognition", PROC. IEEE COMPUTER VISION AND PATTERN RECOGNITION, 2016 |
K. HEG. GKIOXARIP. DOLLARR. GIRSHICK: "Mask R-CNN", ICCV, 2018 |
LIN T YGOYAL PGIRSHICK R ET AL.: "Focal Loss for Dense Object Detection", PROC. IEEE COMPUTER VISION AND PATTERN RECOGNITION, 2017 |
R. GIRSHICK: "Fast r-cnn", COMPUTER SCIENCE, 2015 |
REN SHAOQING ET AL.: "Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 6, XP055847873, DOI: 10.1109/TPAMI.2016.2577031 |
S. RENK. HER. GIRSHICKJ. SUN: "Faster r-cnn: towards real-time object detection with region proposal networks", INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, 2015, pages 91 - 99 |
See also references of EP3864621A4 |
T. KONGF. SUNA. YAOH. LIUM. LUY CHEN: "Ron: Reverse connection with objectness prior networks for object detection", PROC. IEEE COMPUTER VISION AND PATTERN RECOGNITION, 2017 |
T. LINP. DOLLARR. GIRSHICKK. HEB. HARIHARANS. BELONGIE: "Feature Pyramid Networks for Object Detection", PROC. IEEE COMPUTER VISION AND PATTERN RECOGNITION, 2017 |
Y. S. CAOX NIUY. DOU: "Region-based convolutional neural networks for object detection in very high resolution remote sensing images", INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, 2016 |
Z. CAIQ. FANR. S FERISN. VASCONCELOS: "European Conference on Computer Vision", 2016, SPRINGER, CHAM, article "A unified multi-scale deep convolutional neural network for fast object detection" |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950612A (en) * | 2020-07-30 | 2020-11-17 | 中国科学院大学 | FPN-based weak and small target detection method for fusion factor |
CN112419227A (en) * | 2020-10-14 | 2021-02-26 | 北京大学深圳研究生院 | Underwater target detection method and system based on small target search scaling technology |
CN112419227B (en) * | 2020-10-14 | 2024-02-20 | 北京大学深圳研究生院 | Underwater target detection method and system based on small target search scaling technology |
CN112150462A (en) * | 2020-10-22 | 2020-12-29 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for determining target anchor point |
CN112150462B (en) * | 2020-10-22 | 2023-12-22 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for determining target anchor point |
US11915466B2 (en) | 2020-10-22 | 2024-02-27 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for determining target anchor, device and storage medium |
CN112446327A (en) * | 2020-11-27 | 2021-03-05 | 中国地质大学(武汉) | Remote sensing image target detection method based on non-anchor frame |
CN112491891A (en) * | 2020-11-27 | 2021-03-12 | 杭州电子科技大学 | Network attack detection method based on hybrid deep learning in Internet of things environment |
CN112491891B (en) * | 2020-11-27 | 2022-05-17 | 杭州电子科技大学 | Network attack detection method based on hybrid deep learning in Internet of things environment |
CN112446327B (en) * | 2020-11-27 | 2022-06-07 | 中国地质大学(武汉) | Remote sensing image target detection method based on non-anchor frame |
Also Published As
Publication number | Publication date |
---|---|
EP3864621A4 (en) | 2022-05-04 |
JP7192109B2 (en) | 2022-12-19 |
US11908160B2 (en) | 2024-02-20 |
US20210383166A1 (en) | 2021-12-09 |
JP2022504774A (en) | 2022-01-13 |
CN113168705A (en) | 2021-07-23 |
EP3864621A1 (en) | 2021-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11908160B2 (en) | Method and apparatus for context-embedding and region-based object detection | |
CN109815843B (en) | Image processing method and related product | |
US9928708B2 (en) | Real-time video analysis for security surveillance | |
US11393256B2 (en) | Method and device for liveness detection, and storage medium | |
Lindeberg | Scale selection | |
US10839537B2 (en) | Depth maps generated from a single sensor | |
CN110580428A (en) | image processing method, image processing device, computer-readable storage medium and electronic equipment | |
CN109413411B (en) | Black screen identification method and device of monitoring line and server | |
CN110033481A (en) | Method and apparatus for carrying out image procossing | |
KR102199094B1 (en) | Method and Apparatus for Learning Region of Interest for Detecting Object of Interest | |
CN110472599B (en) | Object quantity determination method and device, storage medium and electronic equipment | |
Hasinoff | Saturation (imaging) | |
CN111598065A (en) | Depth image acquisition method, living body identification method, apparatus, circuit, and medium | |
Fisher | Subpixel estimation | |
CN110826364A (en) | Stock position identification method and device | |
CN112036342A (en) | Document snapshot method, device and computer storage medium | |
Faraji et al. | Simplified active calibration | |
US11657608B1 (en) | Method and system for video content analysis | |
CN113255405B (en) | Parking space line identification method and system, parking space line identification equipment and storage medium | |
CN109035328B (en) | Method, system, device and storage medium for identifying image directivity | |
Shotton et al. | Semantic image segmentation: Traditional approach | |
TWI819219B (en) | Photographing method for dynamic scene compensation and a camera using the method | |
CN111862106A (en) | Image processing method based on light field semantics, computer device and storage medium | |
Ramanath et al. | Standard Illuminants | |
Matsushita | Shape from shading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18936706 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2021520139 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2018936706 Country of ref document: EP Effective date: 20210512 |