WO2017124221A1 - System and method for object detection - Google Patents

System and method for object detection Download PDF

Info

Publication number
WO2017124221A1
WO2017124221A1 PCT/CN2016/071193 CN2016071193W WO2017124221A1 WO 2017124221 A1 WO2017124221 A1 WO 2017124221A1 CN 2016071193 W CN2016071193 W CN 2016071193W WO 2017124221 A1 WO2017124221 A1 WO 2017124221A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
clusters
objects
bounding box
cnns
Prior art date
Application number
PCT/CN2016/071193
Other languages
French (fr)
Inventor
Xiaogang Wang
Wanli OUYANG
Original Assignee
Xiaogang Wang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaogang Wang filed Critical Xiaogang Wang
Priority to PCT/CN2016/071193 priority Critical patent/WO2017124221A1/en
Priority to CN201680079308.7A priority patent/CN108496185B/en
Publication of WO2017124221A1 publication Critical patent/WO2017124221A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • the disclosures relate to a method for object detection and a system thereof.
  • Fine-tuning refers to the approach that initializes the model parameters for the target task from the parameters pre-trained on another related task. Fine-tuning from the deep model pre-trained on the large-scale ImageNet dataset is found to yield state-of-the-art performance for many vision tasks such as tracking, segmentation, object detection, action recognition, and event detection.
  • the detection of multiple object classes is composed of multiple tasks. Detection of each class is a task. At the application stage, detection scores of different object classes are independent. And evaluation of the results are also independent for these object classes.
  • Existing deep learning methods consider all classes/tasks jointly and learn a single feature representation. However, this shared representation is not the best for all object classes. If the learned representation can focus on specific classes, e.g. mammals, the learned representation is better in describing these specific classes.
  • the deep learning is applied for generic object detection in many works.
  • Existing works mainly focus on developing new deep models and better object detection pipeline. These works use one feature representation for all object classes.
  • the same feature extraction mechanism is used for all object classes.
  • the same feature extraction mechanism is not most suitable for each of object classes, this naturally reduces the accuracy for some object classes.
  • a method for object detection comprising: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
  • the grouping comprises: obtaining training images containing object to be detected and at least one bounding box for the training images from a training set; extracting, by a trained CNN, features for the object in each bounding box; and distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
  • the distributing is based on a visual similarity.
  • the evaluating comprises: extracting features from the obtained image by the trained CNN for a parent cluster; calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster; accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated; performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
  • the method for object detection further comprises: training CNNs respectively used for each of the object clusters, which comprises: initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters; evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects; outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object; fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
  • the extracting comprises: cropping the obtained images by the bounding box; warping the cropped images into predefined size required by the trained CNNs; and extracting the features from the warped images by the trained CNNs.
  • the classification score presents a possibility where the object belongs to an object class in one cluster.
  • the outputting comprises: determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and outputting the object class label at said leaf cluster as the predicted object class label of the object.
  • a system for object detection comprising: a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; and a predictive unit for: obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
  • a system for object detection comprising: a memory that stores executable components; and a processor electrically coupled to the memory to execute the executable components for: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; and outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
  • Fig. 1 shows examples of object detection according to some embodiments of the present application
  • Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments of the present application
  • Fig. 3 shows the steps used for the grouping unit according to some embodiments of the present application
  • Fig. 4 shows an example of the hierarchical tree structure according to some embodiments of the present application
  • Fig. 5 shows the steps used for the predictive unit according to some embodiments of the present application.
  • Fig. 6 is Algorithm showing the key steps of the predictive unit according to some embodiments of the present application.
  • Fig. 7 shows the steps used for the train unit according to some embodiments of the present application.
  • the disclosure relates to object detection, of which the aim is to detect objects of certain classes on a given image, such as person, dog, and chair in Fig. 1.
  • Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments.
  • the system for object detection comprises a grouping unit 201, a predictive unit (202 and 204) , and a training unit 203.
  • the grouping unit is used for grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; the predictive unit is used for predicting objects contained in a given image; and the training unit is used for training the predictive unit before applying the predictive unit to the actual application.
  • the object classes to be detected are grouping into a plurality of object clusters constituting a hierarchical tree structure according to the corresponding features of these objects.
  • the training unit 203 trains the predictive unit 202 by using images from a predetermined training set and the cluster labels from the grouping unit 201, and outputting the trained predictive unit 204 that has convolutional neural networs (CNNs) respectively used for each of the clusters in the hierarchical tree structure.
  • CNNs convolutional neural networs
  • Fig. 3 shows steps used for the grouping unit according to some disclosed embodiments.
  • the grouping unit 201 is input with images from a training set and at least one bounding box, wherein the images contain objects belongs to the object classes to be detected.
  • objects in the images are grouped into a plurality of object clusters constituting a hierarchical tree structure, and then the cluster label is outputted.
  • input image is cropped by a bounding box and warped into the predefined size required by the convolutional neural network;
  • step S302 given an input image cropped by bounding box, the features are extracted by a pre-trained convolutional neural network; and at step S303, objects contained in the given image is distributed into a plurality of object clusters.
  • the distribution method can be any appropriate method.
  • the visual similarity will be used as the example for illustration.
  • the visual similarity between classes a and b is as follows:
  • h a, i is the last GoogleNet hidden layer for the ith training sample of class a, h b, j is for the jth training sample of class b.
  • object classes are grouping into a plurality of object clusters constituting a hierarchical tree structure, for example, as shown in Fig. 4.
  • the child clusters of S 1, 1 are S 2, 1 , S 2, 2 , S 2, 3 and S 2, 4
  • S 1, 1 is the parent cluster of S 2, 1 , S 2, 2 , S 2, 3 and S 2, 4 .
  • the predictive unit 202 or 204 is input with images, bounding boxes, and object set S l, jl .
  • the predictive units at training stage and the application stage are only different in the samples.
  • the samples are obtained from training data; at application stage, the samples are obtained from testing data.
  • the predicted object class labels are outputted from the predictive unit.
  • Fig. 5 shows the steps used for the predictive unit according to some embodiments.
  • the inputted image is cropped by the bounding boxes and warped into the predefined size required by the CNNs used in the predictive unit;
  • objects in each bounding box are evaluated from root cluster to leaf cluster; and
  • class label of the object in cropped image is determined.
  • a classification score for each cluster may be calculated by using the extracted features.
  • the classification score for object classes in one cluster may present the possibility where the object belongs to this cluster.
  • the detailed evaluating process may be described in Algorithm 1 shown in Fig. 6.
  • the detection scores i.e., the classification score
  • the classes in group S l, jl are evaluated (line 6 in Algorithm 1) .
  • These detection scores are used for deciding if the children clusters ch (l, j l ) need to be evaluated (line 8 in Algorithm 1) .
  • the detection scores for 200-classes are obtained at the node (1, 1) for a given sample of class bird. These 200-class scores are used for accepting this sample as an animal S 2, 1 and rejecting this sample as ball S 2, 2 , instrument S 2, 3 or furniture S 2, 4 . And then the scores of animals are used for accepting the bird sample as vertebrate and rejecting it as invertebrate. Therefore, each node focuses on rejecting the sample as not belonging to a group of object classes. Finally, only the groups that are not rejected have the SVM scores for their classes (line 13 in Algorithm 1) .
  • the cluster label of the deepest leaf cluster of the object is determined. if the determined cluster is the end cluster of the hierarchical tree structure, such as S 4, 1 , S 4, 2 , S 4, 3 and S 4, 4 as shown in Fig. 4, the class label, such as cow, bird, fish or ant, will be outputted. If the determined cluster is not the end cluster of the hierarchical tree structure, such as S 3, 1 , i.e., the classification scores of S 4, 1 , S 4, 2 , S 4, 3 and S 4, 4 are all smaller than the threshold, the object is considered as background, and its class label will not be outputted.
  • the CNNs respectively used for each of the clusters may be trained by the training unit before application.
  • Fig. 7 shows the steps used for the train unit according to some embodiments.
  • the images for training and ground truth object class labels of the objects in the training image are obtained from a training set.
  • the CNNs of the predictive unit are initialized with the CNNs of their parent clusters, i.e., for the cluster (l, j l ) , a CNN model M l, jl used for cluster S l, jl is fine-tuned using the model of its parent cluster model M l-1, par (jl) as initial point, for example, as shown in Fig. 4, M 2, 1 is initialized with M 1 .
  • step S703 training image is cropped and predicted by the predictive unit, at this step, the predicted class label is outputted.
  • the predicted class labels are compared with the ground truth class labels, and the dissimilarities between them are computed.
  • step S705 whether the predicted class labels are converge to the ground truth labels are determined. If the predicted class labels are converge to the ground truth labels, output the trained predictive unit; if no, fine-tune the parameters of CNNs, and repeat steps S701 to S704.
  • determining whether the predicted class labels are converge to the ground truth labels may be instead with determining whether the accuracy of the predicted class label can be further improved.
  • the process of the predictive unit during training, for one cluster, some cropped images with object that does not belong to this cluster are rejected at its parent cluster, therefore, only a subset of object classes are used for fine-tuning the CNN for each cluster. In this way, CNN may focus on learning the representations for this subset of object classes. Furthermore, when training the CNNs, CNN for a parent cluster is used as the initial point of the CNN for its child cluster so that the knowledge from the parent cluster is transferred to the child cluster. Based on above, at training stage, the trainings of CNNs respectively used for each clusters focus on hard samples that cannot be handled well at their parent clusters. In this way, the object detection will be more faster and more accurate.
  • the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ”
  • ICs integrated circuits
  • the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software.
  • the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-7.
  • the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

Abstract

Disclosed is a method for object detection, comprising: grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for the obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object. A system for object detection is also enclosed within the disclosure.

Description

[Title established by the ISA under Rule 37.2] SYSTEM AND METHOD FOR OBJECT DETECTION Technical Field
The disclosures relate to a method for object detection and a system thereof.
Background
Fine-tuning refers to the approach that initializes the model parameters for the target task from the parameters pre-trained on another related task. Fine-tuning from the deep model pre-trained on the large-scale ImageNet dataset is found to yield state-of-the-art performance for many vision tasks such as tracking, segmentation, object detection, action recognition, and event detection.
When fine-tuning the deep model for object detection, the detection of multiple object classes is composed of multiple tasks. Detection of each class is a task. At the application stage, detection scores of different object classes are independent. And evaluation of the results are also independent for these object classes. Existing deep learning methods consider all classes/tasks jointly and learn a single feature representation. However, this shared representation is not the best for all object classes. If the learned representation can focus on specific classes, e.g. mammals, the learned representation is better in describing these specific classes.
The deep learning is applied for generic object detection in many works. Existing works mainly focus on developing new deep models and better object detection pipeline. These works use one feature representation for all object classes. When using the hand-crafted features, the same feature extraction mechanism is used for all object classes. However, the same feature extraction mechanism is not most suitable for each of object classes, this naturally reduces the accuracy for some object classes.
Summary
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, disclosed is a method for object detection, comprising: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
In one embodiment of the present application, the grouping comprises: obtaining training images containing object to be detected and at least one bounding box for the training images from a training set; extracting, by a trained CNN, features for the object in each bounding box; and distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
In one embodiment of the present application, the distributing is based on a visual similarity.
In one embodiment of the present application, the evaluating comprises: extracting features from the obtained image by the trained CNN for a parent cluster; calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster; accepting the objects into the child clusters with the classification  score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated; performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists..
In one embodiment of the present application, the method for object detection further comprises: training CNNs respectively used for each of the object clusters, which comprises: initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters; evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects; outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object; fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
In one embodiment of the present application, the extracting comprises: cropping the obtained images by the bounding box; warping the cropped images into predefined size required by the trained CNNs; and extracting the features from the warped images by the trained CNNs.
In one embodiment of the present application, the classification score presents a possibility where the object belongs to an object class in one cluster.
In one embodiment of the present application, the outputting comprises: determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and outputting the object class label at said leaf cluster as the predicted object class label of the object.
In an aspect, disclosed is a system for object detection, comprising: a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; and a predictive unit for: obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
In an aspect, disclosed is a system for object detection, comprising: a memory that stores executable components; and a processor electrically coupled to the memory to execute the executable components for: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; and outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 shows examples of object detection according to some embodiments of the present application;
Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments of the present application;
Fig. 3 shows the steps used for the grouping unit according to some embodiments of the present application;
Fig. 4 shows an example of the hierarchical tree structure according to some embodiments of the present application;
Fig. 5 shows the steps used for the predictive unit according to some embodiments of the present application;
Fig. 6 is Algorithm showing the key steps of the predictive unit according to some embodiments of the present application; and
Fig. 7 shows the steps used for the train unit according to some embodiments of the present application.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not  been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising” , when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The disclosure relates to object detection, of which the aim is to detect objects of certain classes on a given image, such as person, dog, and chair in Fig. 1.
Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments. The system for object detection comprises a grouping unit 201, a predictive unit (202 and 204) , and a training unit 203. The grouping unit is used for grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; the predictive unit is used for predicting objects contained in a given image; and the training unit is used for training the predictive unit before applying the predictive unit to the actual application.
In the grouping unit 201, the object classes to be detected are grouping into a plurality of object clusters constituting a hierarchical tree structure according to the corresponding features of these objects. Then the training unit 203 trains the predictive unit 202 by using images from a predetermined training set and the cluster labels from the grouping unit 201, and outputting the trained predictive unit 204 that has convolutional neural networs (CNNs) respectively used for each of the clusters in the hierarchical tree structure. Finally, the trained predictive unit 204 is used for actual application, during the application, a  given image is fed into the trained predictive unit 204, the predictive unit 204 extracts the features of objects in the image and predicts object class of these object by the CNNs thereof. Above units will be described in details by referring to the drawings in the following.
Fig. 3 shows steps used for the grouping unit according to some disclosed embodiments.
In some embodiments, the grouping unit 201 is input with images from a training set and at least one bounding box, wherein the images contain objects belongs to the object classes to be detected. In the grouping unit 201, objects in the images are grouped into a plurality of object clusters constituting a hierarchical tree structure, and then the cluster label is outputted. As shown in Fig. 3, at step S301, input image is cropped by a bounding box and warped into the predefined size required by the convolutional neural network; at step S302, given an input image cropped by bounding box, the features are extracted by a pre-trained convolutional neural network; and at step S303, objects contained in the given image is distributed into a plurality of object clusters.
The distribution method can be any appropriate method. The visual similarity will be used as the example for illustration. The visual similarity between classes a and b is as follows:
where ha, i is the last GoogleNet hidden layer for the ith training sample of class a, hb, j is for the jth training sample of class b. < ha, i, hb, j > denotes the inner product between ha, i and hb, j. With the similarity between two classes defined, for example, object classes are grouping into a plurality of object clusters constituting a hierarchical tree structure, for example, as shown in Fig. 4. At the hierarchical level l, denote the jlth group by Sl, jl, for the present embodiment, l = 1, …, L, L= 4, jl = {1, …, Jl} , J1 = 1, J2 = 4, J3 = 7; J4 = 18. In some embodiments, there may be, for example, 200 object classes, initially, S1, 1 = {1, …, 200} . As an example, there may be,  on average, , 200 object classes per group at level 1, 50 classes per group at level 2, 29 classes per group at  level  3, and 11 classes per group at level 4. In Fig. 4, S1, 1 = S2, 1 ∪ S2, 2 ∪ S2, 3 ∪ S2, 4 and S2, 1 = S3, 1 ∪ S3, 2. In the hierarchical clustering results, the parent cluster par (l, jl) and children set ch (l, jl) of a cluster (l, jl) are defined such that
Figure PCTCN2016071193-appb-000002
Sl,jl=∪ (l+1,j,) ∈ch (l, jl) Sl+1, j, and
Figure PCTCN2016071193-appb-000003
For example, as shown in Fig. 4, the child clusters of S1, 1 are S2, 1, S2, 2, S2, 3 and S2, 4, and S1, 1 is the parent cluster of S2, 1, S2, 2, S2, 3 and S2, 4.
In some embodiments, the  predictive unit  202 or 204 is input with images, bounding boxes, and object set Sl, jl. The predictive units at training stage and the application stage are only different in the samples. At the training stage, the samples are obtained from training data; at application stage, the samples are obtained from testing data. The predicted object class labels are outputted from the predictive unit.
Fig. 5 shows the steps used for the predictive unit according to some embodiments. At step S501, the inputted image is cropped by the bounding boxes and warped into the predefined size required by the CNNs used in the predictive unit; at step S502, objects in each bounding box are evaluated from root cluster to leaf cluster; and at step S503, class label of the object in cropped image is determined.
Particularly, during evaluating, features of the cropped image are extracted at each cluster by a trained CNN thereof, and then a classification score for each cluster may be calculated by using the extracted features. The classification score for object classes in one cluster may present the possibility where the object belongs to this cluster. The detailed evaluating process may be described in Algorithm 1 shown in Fig. 6. At the cluster (l, jl) , the detection scores (i.e., the classification score) for the classes in group Sl, jl are evaluated (line 6 in Algorithm 1) . These detection scores are used for deciding if the children clusters ch (l, jl) need to be evaluated (line 8 in Algorithm 1) . For the child cluster (l+1, j’ ) ∈ ch (l, jl) , if the  maximum detection score among the classes in Sl+1, j’ is smaller than a threshold Tl, this sample is not considered as a positive sample in class group Sl+1, j’ , and then the cluster (l+1, j’ ) and its children clusters are not evaluated.
For example, initially the detection scores for 200-classes
Figure PCTCN2016071193-appb-000004
are obtained at the node (1, 1) for a given sample of class bird. These 200-class scores are used for accepting this sample as an animal S2, 1 and rejecting this sample as ball S2, 2, instrument S2, 3 or furniture S2, 4. And then the scores
Figure PCTCN2016071193-appb-000005
of animals are used for accepting the bird sample as vertebrate and rejecting it as invertebrate. Therefore, each node focuses on rejecting the sample as not belonging to a group of object classes. Finally, only the groups that are not rejected have the SVM scores for their classes (line 13 in Algorithm 1) .
Finally, the cluster label of the deepest leaf cluster of the object is determined. if the determined cluster is the end cluster of the hierarchical tree structure, such as S4, 1, S4, 2, S4, 3 and S4, 4 as shown in Fig. 4, the class label, such as cow, bird, fish or ant, will be outputted. If the determined cluster is not the end cluster of the hierarchical tree structure, such as S3, 1, i.e., the classification scores of S4, 1, S4, 2, S4, 3 and S4, 4 are all smaller than the threshold, the object is considered as background, and its class label will not be outputted.
The CNNs respectively used for each of the clusters may be trained by the training unit before application. Fig. 7 shows the steps used for the train unit according to some embodiments. During training, at step S701, the images for training and ground truth object class labels of the objects in the training image are obtained from a training set. At step S702, the CNNs of the predictive unit are initialized with the CNNs of their parent clusters, i.e., for the cluster (l, jl) , a CNN model Ml, jl used for cluster Sl, jl is fine-tuned using the model of its parent cluster model Ml-1, par (jl) as initial point, for example, as shown in Fig. 4, M2, 1 is initialized with M1. At step S703, training image is cropped and predicted by the predictive unit, at this step, the predicted class label is outputted. At step S704, the predicted class labels  are compared with the ground truth class labels, and the dissimilarities between them are computed. At step S705, whether the predicted class labels are converge to the ground truth labels are determined. If the predicted class labels are converge to the ground truth labels, output the trained predictive unit; if no, fine-tune the parameters of CNNs, and repeat steps S701 to S704. In some embodiments, determining whether the predicted class labels are converge to the ground truth labels may be instead with determining whether the accuracy of the predicted class label can be further improved.
According to the process of the predictive unit, during training, for one cluster, some cropped images with object that does not belong to this cluster are rejected at its parent cluster, therefore, only a subset of object classes are used for fine-tuning the CNN for each cluster. In this way, CNN may focus on learning the representations for this subset of object classes. Furthermore, when training the CNNs, CNN for a parent cluster is used as the initial point of the CNN for its child cluster so that the knowledge from the parent cluster is transferred to the child cluster. Based on above, at training stage, the trainings of CNNs respectively used for each clusters focus on hard samples that cannot be handled well at their parent clusters. In this way, the object detection will be more faster and more accurate.
As will be appreciated by one skilled in the art, the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles  and concepts according to the present application, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-7. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
Although the preferred examples of the present application have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present application.
Obviously, those skilled in the art can make variations or modifications to the present application without departing the spirit and scope of the present application. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present application.

Claims (24)

  1. A method for object detection, comprising:
    grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure;
    obtaining an image and at least one bounding box for an obtained image;
    evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and
    outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
  2. The method of claim 1, wherein the grouping comprises:
    obtaining training images containing object to be detected and at least one bounding box for the training images from a training set;
    extracting, by a trained CNN, features for the object in each bounding box; and
    distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
  3. The method of claim 2, wherein the distributing is based on a visual similarity.
  4. The method of claim 1, wherein the evaluating comprises:
    extracting features from the obtained image by the trained CNN for a parent cluster;
    calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;
    accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;
    performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
  5. The method of claim 4, wherein the method further comprises training CNNs respectively used for each of the object clusters, which comprises:
    initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;
    evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;
    outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;
    fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and
    repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
  6. The method of claim 5, wherein the extracting comprises:
    cropping the obtained images by the bounding box;
    warping the cropped images into predefined size required by the trained CNNs; and
    extracting the features from the warped images by the trained CNNs.
  7. The method of claim 4, wherein the classification score presents a possibility where the object belongs to an object class in one cluster.
  8. The method of claim 1, wherein the outputting comprises:
    determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and
    outputting the object class label at said leaf cluster as the predicted object class label of the object.
  9. A system for object detection, comprising:
    a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; and
    a predictive unit for:
    obtaining an image and at least one bounding box for the an obtained image;
    evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and
    outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
  10. The system of claim 9, wherein the grouping unit is further configured to:
    obtain training images containing object to be detected and at least one bounding box for the training images from a training set;
    extract, by a trained CNN, features for the object in each bounding box; and
    distribute object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
  11. The system of claim 10, wherein the distributing is based on a visual similarity.
  12. The system of claim 9, wherein the predictive unit is configured for:
    extracting features from the obtained image by the trained CNN for a parent cluster;
    calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;
    accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;
    performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
  13. The system of claim 12, further comprising:
    a training unit configured to train CNNs respectively used for each of the object clusters by:
    initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;
    evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;
    outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;
    fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and
    repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
  14. The system of claim 13, wherein the predictive unit is configured to extract features from the obtained image by:
    cropping the obtained images by the bounding box;
    warping the cropped images into predefined size required by the trained CNNs; and
    extracting the features from the warped images by the trained CNNs.
  15. The system of claim 12, wherein the classification score presents a possibility where  the object belongs to an object class in one cluster.
  16. The system of claim 9, wherein the outputting comprises:
    determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and
    outputting the object class label at said leaf cluster as the predicted object class label of the object.
  17. A system for object detection, comprising:
    a memory that stores executable components; and
    a processor electrically coupled to the memory to execute the executable components for:
    grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure;
    obtaining an image and at least one bounding box for an obtained image;
    evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; and
    outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
  18. The system of claim 17, wherein the grouping comprises:
    obtaining training images containing object to be detected and at least one bounding box for the training images from a training set;
    extracting, by a trained, features for the object in each bounding box; and
    distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
  19. The system of claim 18, wherein the distributing is based on a visual similarity.
  20. The system of claim 17, wherein the evaluating comprises:
    extracting features from the obtained image by the trained CNN for a parent cluster;
    calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;
    accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;
    performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
  21. The system of claim 20, wherein the executable components further comprise:
    training CNNs respectively used for each of the object clusters, which comprises:
    initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;
    evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;
    outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;
    fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and
    repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
  22. The system of claim 21, wherein the extracting comprises:
    cropping the obtained images by the bounding box;
    warping the cropped images into predefined size required by the trained CNNs; and
    extracting the features from the warped images by the trained CNNs.
  23. The system of claim 21, wherein the classification score presents a possibility where the object belongs to one of an object class in one cluster.
  24. The system of claim 17, wherein the outputting comprises:
    determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and
    outputting the object class label at said leaf cluster as the predicted object class label of the object.
PCT/CN2016/071193 2016-01-18 2016-01-18 System and method for object detection WO2017124221A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/071193 WO2017124221A1 (en) 2016-01-18 2016-01-18 System and method for object detection
CN201680079308.7A CN108496185B (en) 2016-01-18 2016-01-18 System and method for object detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/071193 WO2017124221A1 (en) 2016-01-18 2016-01-18 System and method for object detection

Publications (1)

Publication Number Publication Date
WO2017124221A1 true WO2017124221A1 (en) 2017-07-27

Family

ID=59361177

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/071193 WO2017124221A1 (en) 2016-01-18 2016-01-18 System and method for object detection

Country Status (2)

Country Link
CN (1) CN108496185B (en)
WO (1) WO2017124221A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3852054A1 (en) 2020-01-16 2021-07-21 Koninklijke Philips N.V. Method and system for automatically detecting anatomical structures in a medical image
US11270121B2 (en) 2019-08-20 2022-03-08 Microsoft Technology Licensing, Llc Semi supervised animated character recognition in video
US11366989B2 (en) 2019-08-20 2022-06-21 Microsoft Technology Licensing, Llc Negative sampling algorithm for enhanced image classification
US11450107B1 (en) 2021-03-10 2022-09-20 Microsoft Technology Licensing, Llc Dynamic detection and recognition of media subjects

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814885B (en) * 2020-07-10 2021-06-22 云从科技集团股份有限公司 Method, system, device and medium for managing image frames

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130148881A1 (en) * 2011-12-12 2013-06-13 Alibaba Group Holding Limited Image Classification
US20130259371A1 (en) * 2012-03-28 2013-10-03 Oncel Tuzel Appearance and Context Based Object Classification in Images
CN104992191A (en) * 2015-07-23 2015-10-21 厦门大学 Image classification method based on deep learning feature and maximum confidence path

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3630734B2 (en) * 1994-10-28 2005-03-23 キヤノン株式会社 Information processing method
CN1838150A (en) * 2005-03-09 2006-09-27 西门子共同研究公司 Probabilistic boosting tree structure for learned discriminative models
CN101290660A (en) * 2008-06-02 2008-10-22 中国科学技术大学 Tree-shaped assembled classification method for pedestrian detection
US8744172B2 (en) * 2011-06-15 2014-06-03 Siemens Aktiengesellschaft Image processing using random forest classifiers
US9117132B2 (en) * 2012-11-16 2015-08-25 Tata Consultancy Services Limited System and method facilitating designing of classifier while recognizing characters in a video
CN103324954B (en) * 2013-05-31 2017-02-08 中国科学院计算技术研究所 Image classification method based on tree structure and system using same
CN103530405B (en) * 2013-10-23 2016-08-31 天津大学 A kind of image search method based on hierarchy
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN104182981B (en) * 2014-08-26 2017-02-22 北京邮电大学 Image detection method and device
CN104217225B (en) * 2014-09-02 2018-04-24 中国科学院自动化研究所 A kind of sensation target detection and mask method
CN104281851B (en) * 2014-10-28 2017-11-03 浙江宇视科技有限公司 The extracting method and device of logo information
CN104794489B (en) * 2015-04-23 2019-03-08 苏州大学 A kind of induction type image classification method and system based on deep tag prediction
CN105069472B (en) * 2015-08-03 2018-07-27 电子科技大学 A kind of vehicle checking method adaptive based on convolutional neural networks
CN105205501B (en) * 2015-10-04 2018-09-18 北京航空航天大学 A kind of weak mark image object detection method of multi classifier combination

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130148881A1 (en) * 2011-12-12 2013-06-13 Alibaba Group Holding Limited Image Classification
US20130259371A1 (en) * 2012-03-28 2013-10-03 Oncel Tuzel Appearance and Context Based Object Classification in Images
CN104992191A (en) * 2015-07-23 2015-10-21 厦门大学 Image classification method based on deep learning feature and maximum confidence path

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11270121B2 (en) 2019-08-20 2022-03-08 Microsoft Technology Licensing, Llc Semi supervised animated character recognition in video
US11366989B2 (en) 2019-08-20 2022-06-21 Microsoft Technology Licensing, Llc Negative sampling algorithm for enhanced image classification
EP3852054A1 (en) 2020-01-16 2021-07-21 Koninklijke Philips N.V. Method and system for automatically detecting anatomical structures in a medical image
WO2021144230A1 (en) 2020-01-16 2021-07-22 Koninklijke Philips N.V. Method and system for automatically detecting anatomical structures in a medical image
US11450107B1 (en) 2021-03-10 2022-09-20 Microsoft Technology Licensing, Llc Dynamic detection and recognition of media subjects

Also Published As

Publication number Publication date
CN108496185A (en) 2018-09-04
CN108496185B (en) 2022-09-16

Similar Documents

Publication Publication Date Title
US11741372B2 (en) Prediction-correction approach to zero shot learning
WO2017124221A1 (en) System and method for object detection
Jing et al. Videossl: Semi-supervised learning for video classification
Zhang et al. Self supervised deep representation learning for fine-grained body part recognition
WO2019200747A1 (en) Method and device for segmenting proximal femur, computer apparatus, and storage medium
EP2065813B1 (en) Object comparison, retrieval, and categorization methods and apparatuses
US10242295B2 (en) Method and apparatus for generating, updating classifier, detecting objects and image processing device
JP2019521443A (en) Cell annotation method and annotation system using adaptive additional learning
CN108491766B (en) End-to-end crowd counting method based on depth decision forest
US8761510B2 (en) Object-centric spatial pooling for image classification
WO2023109208A1 (en) Few-shot object detection method and apparatus
EP3620958A1 (en) Learning method, learning device for detecting lane through lane model and testing method, testing device using the same
WO2016090522A1 (en) Method and apparatus for predicting face attributes
CN108090489A (en) Offline handwriting Balakrishnan word recognition methods of the computer based according to grapheme segmentation
CN111062277A (en) Sign language-lip language conversion method based on monocular vision
Zhang et al. Moving foreground-aware visual attention and key volume mining for human action recognition
US10373028B2 (en) Pattern recognition device, pattern recognition method, and computer program product
Chen et al. Discover and learn new objects from documentaries
Yang et al. Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency
Jiang et al. Dynamic proposal sampling for weakly supervised object detection
US11829442B2 (en) Methods and systems for efficient batch active learning of a deep neural network
CN111340057A (en) Classification model training method and device
Afkham et al. Joint visual vocabulary for animal classification
CN109389543B (en) Bus operation data statistical method, system, computing device and storage medium
EP3910549A1 (en) System and method for few-shot learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16885494

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16885494

Country of ref document: EP

Kind code of ref document: A1