WO2017124221A1 - System and method for object detection - Google Patents
System and method for object detection Download PDFInfo
- Publication number
- WO2017124221A1 WO2017124221A1 PCT/CN2016/071193 CN2016071193W WO2017124221A1 WO 2017124221 A1 WO2017124221 A1 WO 2017124221A1 CN 2016071193 W CN2016071193 W CN 2016071193W WO 2017124221 A1 WO2017124221 A1 WO 2017124221A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- clusters
- objects
- bounding box
- cnns
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/35—Categorising the entire scene, e.g. birthday party or wedding scene
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Definitions
- the disclosures relate to a method for object detection and a system thereof.
- Fine-tuning refers to the approach that initializes the model parameters for the target task from the parameters pre-trained on another related task. Fine-tuning from the deep model pre-trained on the large-scale ImageNet dataset is found to yield state-of-the-art performance for many vision tasks such as tracking, segmentation, object detection, action recognition, and event detection.
- the detection of multiple object classes is composed of multiple tasks. Detection of each class is a task. At the application stage, detection scores of different object classes are independent. And evaluation of the results are also independent for these object classes.
- Existing deep learning methods consider all classes/tasks jointly and learn a single feature representation. However, this shared representation is not the best for all object classes. If the learned representation can focus on specific classes, e.g. mammals, the learned representation is better in describing these specific classes.
- the deep learning is applied for generic object detection in many works.
- Existing works mainly focus on developing new deep models and better object detection pipeline. These works use one feature representation for all object classes.
- the same feature extraction mechanism is used for all object classes.
- the same feature extraction mechanism is not most suitable for each of object classes, this naturally reduces the accuracy for some object classes.
- a method for object detection comprising: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
- the grouping comprises: obtaining training images containing object to be detected and at least one bounding box for the training images from a training set; extracting, by a trained CNN, features for the object in each bounding box; and distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
- the distributing is based on a visual similarity.
- the evaluating comprises: extracting features from the obtained image by the trained CNN for a parent cluster; calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster; accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated; performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
- the method for object detection further comprises: training CNNs respectively used for each of the object clusters, which comprises: initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters; evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects; outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object; fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
- the extracting comprises: cropping the obtained images by the bounding box; warping the cropped images into predefined size required by the trained CNNs; and extracting the features from the warped images by the trained CNNs.
- the classification score presents a possibility where the object belongs to an object class in one cluster.
- the outputting comprises: determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and outputting the object class label at said leaf cluster as the predicted object class label of the object.
- a system for object detection comprising: a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; and a predictive unit for: obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
- a system for object detection comprising: a memory that stores executable components; and a processor electrically coupled to the memory to execute the executable components for: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; and outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
- Fig. 1 shows examples of object detection according to some embodiments of the present application
- Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments of the present application
- Fig. 3 shows the steps used for the grouping unit according to some embodiments of the present application
- Fig. 4 shows an example of the hierarchical tree structure according to some embodiments of the present application
- Fig. 5 shows the steps used for the predictive unit according to some embodiments of the present application.
- Fig. 6 is Algorithm showing the key steps of the predictive unit according to some embodiments of the present application.
- Fig. 7 shows the steps used for the train unit according to some embodiments of the present application.
- the disclosure relates to object detection, of which the aim is to detect objects of certain classes on a given image, such as person, dog, and chair in Fig. 1.
- Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments.
- the system for object detection comprises a grouping unit 201, a predictive unit (202 and 204) , and a training unit 203.
- the grouping unit is used for grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; the predictive unit is used for predicting objects contained in a given image; and the training unit is used for training the predictive unit before applying the predictive unit to the actual application.
- the object classes to be detected are grouping into a plurality of object clusters constituting a hierarchical tree structure according to the corresponding features of these objects.
- the training unit 203 trains the predictive unit 202 by using images from a predetermined training set and the cluster labels from the grouping unit 201, and outputting the trained predictive unit 204 that has convolutional neural networs (CNNs) respectively used for each of the clusters in the hierarchical tree structure.
- CNNs convolutional neural networs
- Fig. 3 shows steps used for the grouping unit according to some disclosed embodiments.
- the grouping unit 201 is input with images from a training set and at least one bounding box, wherein the images contain objects belongs to the object classes to be detected.
- objects in the images are grouped into a plurality of object clusters constituting a hierarchical tree structure, and then the cluster label is outputted.
- input image is cropped by a bounding box and warped into the predefined size required by the convolutional neural network;
- step S302 given an input image cropped by bounding box, the features are extracted by a pre-trained convolutional neural network; and at step S303, objects contained in the given image is distributed into a plurality of object clusters.
- the distribution method can be any appropriate method.
- the visual similarity will be used as the example for illustration.
- the visual similarity between classes a and b is as follows:
- h a, i is the last GoogleNet hidden layer for the ith training sample of class a, h b, j is for the jth training sample of class b.
- object classes are grouping into a plurality of object clusters constituting a hierarchical tree structure, for example, as shown in Fig. 4.
- the child clusters of S 1, 1 are S 2, 1 , S 2, 2 , S 2, 3 and S 2, 4
- S 1, 1 is the parent cluster of S 2, 1 , S 2, 2 , S 2, 3 and S 2, 4 .
- the predictive unit 202 or 204 is input with images, bounding boxes, and object set S l, jl .
- the predictive units at training stage and the application stage are only different in the samples.
- the samples are obtained from training data; at application stage, the samples are obtained from testing data.
- the predicted object class labels are outputted from the predictive unit.
- Fig. 5 shows the steps used for the predictive unit according to some embodiments.
- the inputted image is cropped by the bounding boxes and warped into the predefined size required by the CNNs used in the predictive unit;
- objects in each bounding box are evaluated from root cluster to leaf cluster; and
- class label of the object in cropped image is determined.
- a classification score for each cluster may be calculated by using the extracted features.
- the classification score for object classes in one cluster may present the possibility where the object belongs to this cluster.
- the detailed evaluating process may be described in Algorithm 1 shown in Fig. 6.
- the detection scores i.e., the classification score
- the classes in group S l, jl are evaluated (line 6 in Algorithm 1) .
- These detection scores are used for deciding if the children clusters ch (l, j l ) need to be evaluated (line 8 in Algorithm 1) .
- the detection scores for 200-classes are obtained at the node (1, 1) for a given sample of class bird. These 200-class scores are used for accepting this sample as an animal S 2, 1 and rejecting this sample as ball S 2, 2 , instrument S 2, 3 or furniture S 2, 4 . And then the scores of animals are used for accepting the bird sample as vertebrate and rejecting it as invertebrate. Therefore, each node focuses on rejecting the sample as not belonging to a group of object classes. Finally, only the groups that are not rejected have the SVM scores for their classes (line 13 in Algorithm 1) .
- the cluster label of the deepest leaf cluster of the object is determined. if the determined cluster is the end cluster of the hierarchical tree structure, such as S 4, 1 , S 4, 2 , S 4, 3 and S 4, 4 as shown in Fig. 4, the class label, such as cow, bird, fish or ant, will be outputted. If the determined cluster is not the end cluster of the hierarchical tree structure, such as S 3, 1 , i.e., the classification scores of S 4, 1 , S 4, 2 , S 4, 3 and S 4, 4 are all smaller than the threshold, the object is considered as background, and its class label will not be outputted.
- the CNNs respectively used for each of the clusters may be trained by the training unit before application.
- Fig. 7 shows the steps used for the train unit according to some embodiments.
- the images for training and ground truth object class labels of the objects in the training image are obtained from a training set.
- the CNNs of the predictive unit are initialized with the CNNs of their parent clusters, i.e., for the cluster (l, j l ) , a CNN model M l, jl used for cluster S l, jl is fine-tuned using the model of its parent cluster model M l-1, par (jl) as initial point, for example, as shown in Fig. 4, M 2, 1 is initialized with M 1 .
- step S703 training image is cropped and predicted by the predictive unit, at this step, the predicted class label is outputted.
- the predicted class labels are compared with the ground truth class labels, and the dissimilarities between them are computed.
- step S705 whether the predicted class labels are converge to the ground truth labels are determined. If the predicted class labels are converge to the ground truth labels, output the trained predictive unit; if no, fine-tune the parameters of CNNs, and repeat steps S701 to S704.
- determining whether the predicted class labels are converge to the ground truth labels may be instead with determining whether the accuracy of the predicted class label can be further improved.
- the process of the predictive unit during training, for one cluster, some cropped images with object that does not belong to this cluster are rejected at its parent cluster, therefore, only a subset of object classes are used for fine-tuning the CNN for each cluster. In this way, CNN may focus on learning the representations for this subset of object classes. Furthermore, when training the CNNs, CNN for a parent cluster is used as the initial point of the CNN for its child cluster so that the knowledge from the parent cluster is transferred to the child cluster. Based on above, at training stage, the trainings of CNNs respectively used for each clusters focus on hard samples that cannot be handled well at their parent clusters. In this way, the object detection will be more faster and more accurate.
- the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ”
- ICs integrated circuits
- the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software.
- the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-7.
- the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
Abstract
Disclosed is a method for object detection, comprising: grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for the obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object. A system for object detection is also enclosed within the disclosure.
Description
The disclosures relate to a method for object detection and a system thereof.
Fine-tuning refers to the approach that initializes the model parameters for the target task from the parameters pre-trained on another related task. Fine-tuning from the deep model pre-trained on the large-scale ImageNet dataset is found to yield state-of-the-art performance for many vision tasks such as tracking, segmentation, object detection, action recognition, and event detection.
When fine-tuning the deep model for object detection, the detection of multiple object classes is composed of multiple tasks. Detection of each class is a task. At the application stage, detection scores of different object classes are independent. And evaluation of the results are also independent for these object classes. Existing deep learning methods consider all classes/tasks jointly and learn a single feature representation. However, this shared representation is not the best for all object classes. If the learned representation can focus on specific classes, e.g. mammals, the learned representation is better in describing these specific classes.
The deep learning is applied for generic object detection in many works. Existing works mainly focus on developing new deep models and better object detection pipeline. These works use one feature representation for all object classes. When using the hand-crafted features, the same feature extraction mechanism is used for all object classes. However, the same feature extraction mechanism is not most suitable for each of object classes, this naturally reduces the accuracy for some object classes.
Summary
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect, disclosed is a method for object detection, comprising: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
In one embodiment of the present application, the grouping comprises: obtaining training images containing object to be detected and at least one bounding box for the training images from a training set; extracting, by a trained CNN, features for the object in each bounding box; and distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
In one embodiment of the present application, the distributing is based on a visual similarity.
In one embodiment of the present application, the evaluating comprises: extracting features from the obtained image by the trained CNN for a parent cluster; calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster; accepting the objects into the child clusters with the classification
score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated; performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists..
In one embodiment of the present application, the method for object detection further comprises: training CNNs respectively used for each of the object clusters, which comprises: initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters; evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects; outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object; fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; and repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
In one embodiment of the present application, the extracting comprises: cropping the obtained images by the bounding box; warping the cropped images into predefined size required by the trained CNNs; and extracting the features from the warped images by the trained CNNs.
In one embodiment of the present application, the classification score presents a possibility where the object belongs to an object class in one cluster.
In one embodiment of the present application, the outputting comprises: determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; and outputting the object class label at said leaf cluster as the predicted object class label of the object.
In an aspect, disclosed is a system for object detection, comprising: a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; and a predictive unit for: obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
In an aspect, disclosed is a system for object detection, comprising: a memory that stores executable components; and a processor electrically coupled to the memory to execute the executable components for: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for an obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; and outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 shows examples of object detection according to some embodiments of the present application;
Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments of the present application;
Fig. 3 shows the steps used for the grouping unit according to some embodiments of the present application;
Fig. 4 shows an example of the hierarchical tree structure according to some embodiments of the present application;
Fig. 5 shows the steps used for the predictive unit according to some embodiments of the present application;
Fig. 6 is Algorithm showing the key steps of the predictive unit according to some embodiments of the present application; and
Fig. 7 shows the steps used for the train unit according to some embodiments of the present application.
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not
been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising” , when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The disclosure relates to object detection, of which the aim is to detect objects of certain classes on a given image, such as person, dog, and chair in Fig. 1.
Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments. The system for object detection comprises a grouping unit 201, a predictive unit (202 and 204) , and a training unit 203. The grouping unit is used for grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; the predictive unit is used for predicting objects contained in a given image; and the training unit is used for training the predictive unit before applying the predictive unit to the actual application.
In the grouping unit 201, the object classes to be detected are grouping into a plurality of object clusters constituting a hierarchical tree structure according to the corresponding features of these objects. Then the training unit 203 trains the predictive unit 202 by using images from a predetermined training set and the cluster labels from the grouping unit 201, and outputting the trained predictive unit 204 that has convolutional neural networs (CNNs) respectively used for each of the clusters in the hierarchical tree structure. Finally, the trained predictive unit 204 is used for actual application, during the application, a
given image is fed into the trained predictive unit 204, the predictive unit 204 extracts the features of objects in the image and predicts object class of these object by the CNNs thereof. Above units will be described in details by referring to the drawings in the following.
Fig. 3 shows steps used for the grouping unit according to some disclosed embodiments.
In some embodiments, the grouping unit 201 is input with images from a training set and at least one bounding box, wherein the images contain objects belongs to the object classes to be detected. In the grouping unit 201, objects in the images are grouped into a plurality of object clusters constituting a hierarchical tree structure, and then the cluster label is outputted. As shown in Fig. 3, at step S301, input image is cropped by a bounding box and warped into the predefined size required by the convolutional neural network; at step S302, given an input image cropped by bounding box, the features are extracted by a pre-trained convolutional neural network; and at step S303, objects contained in the given image is distributed into a plurality of object clusters.
The distribution method can be any appropriate method. The visual similarity will be used as the example for illustration. The visual similarity between classes a and b is as follows:
where ha, i is the last GoogleNet hidden layer for the ith training sample of class a, hb, j is for the jth training sample of class b. < ha, i, hb, j > denotes the inner product between ha, i and hb, j. With the similarity between two classes defined, for example, object classes are grouping into a plurality of object clusters constituting a hierarchical tree structure, for example, as shown in Fig. 4. At the hierarchical level l, denote the jlth group by Sl, jl, for the present embodiment, l = 1, …, L, L= 4, jl = {1, …, Jl} , J1 = 1, J2 = 4, J3 = 7; J4 = 18. In some embodiments, there may be, for example, 200 object classes, initially, S1, 1 = {1, …, 200} . As an example, there may be,
on average, , 200 object classes per group at level 1, 50 classes per group at level 2, 29 classes per group at level 3, and 11 classes per group at level 4. In Fig. 4, S1, 1 = S2, 1 ∪ S2, 2 ∪ S2, 3 ∪ S2, 4 and S2, 1 = S3, 1 ∪ S3, 2. In the hierarchical clustering results, the parent cluster par (l, jl) and children set ch (l, jl) of a cluster (l, jl) are defined such thatSl,jl=∪ (l+1,j,) ∈ch (l, jl) Sl+1, j, andFor example, as shown in Fig. 4, the child clusters of S1, 1 are S2, 1, S2, 2, S2, 3 and S2, 4, and S1, 1 is the parent cluster of S2, 1, S2, 2, S2, 3 and S2, 4.
In some embodiments, the predictive unit 202 or 204 is input with images, bounding boxes, and object set Sl, jl. The predictive units at training stage and the application stage are only different in the samples. At the training stage, the samples are obtained from training data; at application stage, the samples are obtained from testing data. The predicted object class labels are outputted from the predictive unit.
Fig. 5 shows the steps used for the predictive unit according to some embodiments. At step S501, the inputted image is cropped by the bounding boxes and warped into the predefined size required by the CNNs used in the predictive unit; at step S502, objects in each bounding box are evaluated from root cluster to leaf cluster; and at step S503, class label of the object in cropped image is determined.
Particularly, during evaluating, features of the cropped image are extracted at each cluster by a trained CNN thereof, and then a classification score for each cluster may be calculated by using the extracted features. The classification score for object classes in one cluster may present the possibility where the object belongs to this cluster. The detailed evaluating process may be described in Algorithm 1 shown in Fig. 6. At the cluster (l, jl) , the detection scores (i.e., the classification score) for the classes in group Sl, jl are evaluated (line 6 in Algorithm 1) . These detection scores are used for deciding if the children clusters ch (l, jl) need to be evaluated (line 8 in Algorithm 1) . For the child cluster (l+1, j’ ) ∈ ch (l, jl) , if the
maximum detection score among the classes in Sl+1, j’ is smaller than a threshold Tl, this sample is not considered as a positive sample in class group Sl+1, j’ , and then the cluster (l+1, j’ ) and its children clusters are not evaluated.
For example, initially the detection scores for 200-classesare obtained at the node (1, 1) for a given sample of class bird. These 200-class scores are used for accepting this sample as an animal S2, 1 and rejecting this sample as ball S2, 2, instrument S2, 3 or furniture S2, 4. And then the scoresof animals are used for accepting the bird sample as vertebrate and rejecting it as invertebrate. Therefore, each node focuses on rejecting the sample as not belonging to a group of object classes. Finally, only the groups that are not rejected have the SVM scores for their classes (line 13 in Algorithm 1) .
Finally, the cluster label of the deepest leaf cluster of the object is determined. if the determined cluster is the end cluster of the hierarchical tree structure, such as S4, 1, S4, 2, S4, 3 and S4, 4 as shown in Fig. 4, the class label, such as cow, bird, fish or ant, will be outputted. If the determined cluster is not the end cluster of the hierarchical tree structure, such as S3, 1, i.e., the classification scores of S4, 1, S4, 2, S4, 3 and S4, 4 are all smaller than the threshold, the object is considered as background, and its class label will not be outputted.
The CNNs respectively used for each of the clusters may be trained by the training unit before application. Fig. 7 shows the steps used for the train unit according to some embodiments. During training, at step S701, the images for training and ground truth object class labels of the objects in the training image are obtained from a training set. At step S702, the CNNs of the predictive unit are initialized with the CNNs of their parent clusters, i.e., for the cluster (l, jl) , a CNN model Ml, jl used for cluster Sl, jl is fine-tuned using the model of its parent cluster model Ml-1, par (jl) as initial point, for example, as shown in Fig. 4, M2, 1 is initialized with M1. At step S703, training image is cropped and predicted by the predictive unit, at this step, the predicted class label is outputted. At step S704, the predicted class labels
are compared with the ground truth class labels, and the dissimilarities between them are computed. At step S705, whether the predicted class labels are converge to the ground truth labels are determined. If the predicted class labels are converge to the ground truth labels, output the trained predictive unit; if no, fine-tune the parameters of CNNs, and repeat steps S701 to S704. In some embodiments, determining whether the predicted class labels are converge to the ground truth labels may be instead with determining whether the accuracy of the predicted class label can be further improved.
According to the process of the predictive unit, during training, for one cluster, some cropped images with object that does not belong to this cluster are rejected at its parent cluster, therefore, only a subset of object classes are used for fine-tuning the CNN for each cluster. In this way, CNN may focus on learning the representations for this subset of object classes. Furthermore, when training the CNNs, CNN for a parent cluster is used as the initial point of the CNN for its child cluster so that the knowledge from the parent cluster is transferred to the child cluster. Based on above, at training stage, the trainings of CNNs respectively used for each clusters focus on hard samples that cannot be handled well at their parent clusters. In this way, the object detection will be more faster and more accurate.
As will be appreciated by one skilled in the art, the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles
and concepts according to the present application, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-7. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
Although the preferred examples of the present application have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present application.
Obviously, those skilled in the art can make variations or modifications to the present application without departing the spirit and scope of the present application. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present application.
Claims (24)
- A method for object detection, comprising:grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure;obtaining an image and at least one bounding box for an obtained image;evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; andoutputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
- The method of claim 1, wherein the grouping comprises:obtaining training images containing object to be detected and at least one bounding box for the training images from a training set;extracting, by a trained CNN, features for the object in each bounding box; anddistributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
- The method of claim 2, wherein the distributing is based on a visual similarity.
- The method of claim 1, wherein the evaluating comprises:extracting features from the obtained image by the trained CNN for a parent cluster;calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
- The method of claim 4, wherein the method further comprises training CNNs respectively used for each of the object clusters, which comprises:initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; andrepeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
- The method of claim 5, wherein the extracting comprises:cropping the obtained images by the bounding box;warping the cropped images into predefined size required by the trained CNNs; andextracting the features from the warped images by the trained CNNs.
- The method of claim 4, wherein the classification score presents a possibility where the object belongs to an object class in one cluster.
- The method of claim 1, wherein the outputting comprises:determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; andoutputting the object class label at said leaf cluster as the predicted object class label of the object.
- A system for object detection, comprising:a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure; anda predictive unit for:obtaining an image and at least one bounding box for the an obtained image;evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; andoutputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
- The system of claim 9, wherein the grouping unit is further configured to:obtain training images containing object to be detected and at least one bounding box for the training images from a training set;extract, by a trained CNN, features for the object in each bounding box; anddistribute object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
- The system of claim 10, wherein the distributing is based on a visual similarity.
- The system of claim 9, wherein the predictive unit is configured for:extracting features from the obtained image by the trained CNN for a parent cluster;calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
- The system of claim 12, further comprising:a training unit configured to train CNNs respectively used for each of the object clusters by:initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; andrepeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
- The system of claim 13, wherein the predictive unit is configured to extract features from the obtained image by:cropping the obtained images by the bounding box;warping the cropped images into predefined size required by the trained CNNs; andextracting the features from the warped images by the trained CNNs.
- The system of claim 12, wherein the classification score presents a possibility where the object belongs to an object class in one cluster.
- The system of claim 9, wherein the outputting comprises:determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; andoutputting the object class label at said leaf cluster as the predicted object class label of the object.
- A system for object detection, comprising:a memory that stores executable components; anda processor electrically coupled to the memory to execute the executable components for:grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure;obtaining an image and at least one bounding box for an obtained image;evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ; andoutputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
- The system of claim 17, wherein the grouping comprises:obtaining training images containing object to be detected and at least one bounding box for the training images from a training set;extracting, by a trained, features for the object in each bounding box; anddistributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
- The system of claim 18, wherein the distributing is based on a visual similarity.
- The system of claim 17, wherein the evaluating comprises:extracting features from the obtained image by the trained CNN for a parent cluster;calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster;accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated;performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
- The system of claim 20, wherein the executable components further comprise:training CNNs respectively used for each of the object clusters, which comprises:initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters;evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects;outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object;fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images; andrepeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
- The system of claim 21, wherein the extracting comprises:cropping the obtained images by the bounding box;warping the cropped images into predefined size required by the trained CNNs; andextracting the features from the warped images by the trained CNNs.
- The system of claim 21, wherein the classification score presents a possibility where the object belongs to one of an object class in one cluster.
- The system of claim 17, wherein the outputting comprises:determining that the determined leaf cluster is an end cluster of the hierarchical tree structure; andoutputting the object class label at said leaf cluster as the predicted object class label of the object.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2016/071193 WO2017124221A1 (en) | 2016-01-18 | 2016-01-18 | System and method for object detection |
CN201680079308.7A CN108496185B (en) | 2016-01-18 | 2016-01-18 | System and method for object detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2016/071193 WO2017124221A1 (en) | 2016-01-18 | 2016-01-18 | System and method for object detection |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017124221A1 true WO2017124221A1 (en) | 2017-07-27 |
Family
ID=59361177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/071193 WO2017124221A1 (en) | 2016-01-18 | 2016-01-18 | System and method for object detection |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108496185B (en) |
WO (1) | WO2017124221A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3852054A1 (en) | 2020-01-16 | 2021-07-21 | Koninklijke Philips N.V. | Method and system for automatically detecting anatomical structures in a medical image |
US11270121B2 (en) | 2019-08-20 | 2022-03-08 | Microsoft Technology Licensing, Llc | Semi supervised animated character recognition in video |
US11366989B2 (en) | 2019-08-20 | 2022-06-21 | Microsoft Technology Licensing, Llc | Negative sampling algorithm for enhanced image classification |
US11450107B1 (en) | 2021-03-10 | 2022-09-20 | Microsoft Technology Licensing, Llc | Dynamic detection and recognition of media subjects |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814885B (en) * | 2020-07-10 | 2021-06-22 | 云从科技集团股份有限公司 | Method, system, device and medium for managing image frames |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130148881A1 (en) * | 2011-12-12 | 2013-06-13 | Alibaba Group Holding Limited | Image Classification |
US20130259371A1 (en) * | 2012-03-28 | 2013-10-03 | Oncel Tuzel | Appearance and Context Based Object Classification in Images |
CN104992191A (en) * | 2015-07-23 | 2015-10-21 | 厦门大学 | Image classification method based on deep learning feature and maximum confidence path |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3630734B2 (en) * | 1994-10-28 | 2005-03-23 | キヤノン株式会社 | Information processing method |
CN1838150A (en) * | 2005-03-09 | 2006-09-27 | 西门子共同研究公司 | Probabilistic boosting tree structure for learned discriminative models |
CN101290660A (en) * | 2008-06-02 | 2008-10-22 | 中国科学技术大学 | Tree-shaped assembled classification method for pedestrian detection |
US8744172B2 (en) * | 2011-06-15 | 2014-06-03 | Siemens Aktiengesellschaft | Image processing using random forest classifiers |
US9117132B2 (en) * | 2012-11-16 | 2015-08-25 | Tata Consultancy Services Limited | System and method facilitating designing of classifier while recognizing characters in a video |
CN103324954B (en) * | 2013-05-31 | 2017-02-08 | 中国科学院计算技术研究所 | Image classification method based on tree structure and system using same |
CN103530405B (en) * | 2013-10-23 | 2016-08-31 | 天津大学 | A kind of image search method based on hierarchy |
CN104978328A (en) * | 2014-04-03 | 2015-10-14 | 北京奇虎科技有限公司 | Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device |
CN104182981B (en) * | 2014-08-26 | 2017-02-22 | 北京邮电大学 | Image detection method and device |
CN104217225B (en) * | 2014-09-02 | 2018-04-24 | 中国科学院自动化研究所 | A kind of sensation target detection and mask method |
CN104281851B (en) * | 2014-10-28 | 2017-11-03 | 浙江宇视科技有限公司 | The extracting method and device of logo information |
CN104794489B (en) * | 2015-04-23 | 2019-03-08 | 苏州大学 | A kind of induction type image classification method and system based on deep tag prediction |
CN105069472B (en) * | 2015-08-03 | 2018-07-27 | 电子科技大学 | A kind of vehicle checking method adaptive based on convolutional neural networks |
CN105205501B (en) * | 2015-10-04 | 2018-09-18 | 北京航空航天大学 | A kind of weak mark image object detection method of multi classifier combination |
-
2016
- 2016-01-18 CN CN201680079308.7A patent/CN108496185B/en active Active
- 2016-01-18 WO PCT/CN2016/071193 patent/WO2017124221A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130148881A1 (en) * | 2011-12-12 | 2013-06-13 | Alibaba Group Holding Limited | Image Classification |
US20130259371A1 (en) * | 2012-03-28 | 2013-10-03 | Oncel Tuzel | Appearance and Context Based Object Classification in Images |
CN104992191A (en) * | 2015-07-23 | 2015-10-21 | 厦门大学 | Image classification method based on deep learning feature and maximum confidence path |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11270121B2 (en) | 2019-08-20 | 2022-03-08 | Microsoft Technology Licensing, Llc | Semi supervised animated character recognition in video |
US11366989B2 (en) | 2019-08-20 | 2022-06-21 | Microsoft Technology Licensing, Llc | Negative sampling algorithm for enhanced image classification |
EP3852054A1 (en) | 2020-01-16 | 2021-07-21 | Koninklijke Philips N.V. | Method and system for automatically detecting anatomical structures in a medical image |
WO2021144230A1 (en) | 2020-01-16 | 2021-07-22 | Koninklijke Philips N.V. | Method and system for automatically detecting anatomical structures in a medical image |
US11450107B1 (en) | 2021-03-10 | 2022-09-20 | Microsoft Technology Licensing, Llc | Dynamic detection and recognition of media subjects |
Also Published As
Publication number | Publication date |
---|---|
CN108496185A (en) | 2018-09-04 |
CN108496185B (en) | 2022-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11741372B2 (en) | Prediction-correction approach to zero shot learning | |
WO2017124221A1 (en) | System and method for object detection | |
Jing et al. | Videossl: Semi-supervised learning for video classification | |
Zhang et al. | Self supervised deep representation learning for fine-grained body part recognition | |
WO2019200747A1 (en) | Method and device for segmenting proximal femur, computer apparatus, and storage medium | |
EP2065813B1 (en) | Object comparison, retrieval, and categorization methods and apparatuses | |
US10242295B2 (en) | Method and apparatus for generating, updating classifier, detecting objects and image processing device | |
JP2019521443A (en) | Cell annotation method and annotation system using adaptive additional learning | |
CN108491766B (en) | End-to-end crowd counting method based on depth decision forest | |
US8761510B2 (en) | Object-centric spatial pooling for image classification | |
WO2023109208A1 (en) | Few-shot object detection method and apparatus | |
EP3620958A1 (en) | Learning method, learning device for detecting lane through lane model and testing method, testing device using the same | |
WO2016090522A1 (en) | Method and apparatus for predicting face attributes | |
CN108090489A (en) | Offline handwriting Balakrishnan word recognition methods of the computer based according to grapheme segmentation | |
CN111062277A (en) | Sign language-lip language conversion method based on monocular vision | |
Zhang et al. | Moving foreground-aware visual attention and key volume mining for human action recognition | |
US10373028B2 (en) | Pattern recognition device, pattern recognition method, and computer program product | |
Chen et al. | Discover and learn new objects from documentaries | |
Yang et al. | Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency | |
Jiang et al. | Dynamic proposal sampling for weakly supervised object detection | |
US11829442B2 (en) | Methods and systems for efficient batch active learning of a deep neural network | |
CN111340057A (en) | Classification model training method and device | |
Afkham et al. | Joint visual vocabulary for animal classification | |
CN109389543B (en) | Bus operation data statistical method, system, computing device and storage medium | |
EP3910549A1 (en) | System and method for few-shot learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16885494 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16885494 Country of ref document: EP Kind code of ref document: A1 |