WO2017124221A1

WO2017124221A1 - System and method for object detection

Info

Publication number: WO2017124221A1
Application number: PCT/CN2016/071193
Authority: WO
Inventors: Xiaogang Wang; Wanli OUYANG
Original assignee: Xiaogang Wang
Priority date: 2016-01-18
Filing date: 2016-01-18
Publication date: 2017-07-27
Also published as: CN108496185A; CN108496185B

Abstract

Disclosed is a method for object detection, comprising: grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure; obtaining an image and at least one bounding box for the obtained image; evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects; and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object. A system for object detection is also enclosed within the disclosure.

Description

[Title established by the ISA under Rule 37.2] SYSTEM AND METHOD FOR OBJECT DETECTION

Technical Field

The disclosures relate to a method for object detection and a system thereof.

Background

Fine-tuning refers to the approach that initializes the model parameters for the target task from the parameters pre-trained on another related task. Fine-tuning from the deep model pre-trained on the large-scale ImageNet dataset is found to yield state-of-the-art performance for many vision tasks such as tracking, segmentation, object detection, action recognition, and event detection.

When fine-tuning the deep model for object detection, the detection of multiple object classes is composed of multiple tasks. Detection of each class is a task. At the application stage, detection scores of different object classes are independent. And evaluation of the results are also independent for these object classes. Existing deep learning methods consider all classes/tasks jointly and learn a single feature representation. However, this shared representation is not the best for all object classes. If the learned representation can focus on specific classes, e.g. mammals, the learned representation is better in describing these specific classes.

The deep learning is applied for generic object detection in many works. Existing works mainly focus on developing new deep models and better object detection pipeline. These works use one feature representation for all object classes. When using the hand-crafted features, the same feature extraction mechanism is used for all object classes. However, the same feature extraction mechanism is not most suitable for each of object classes, this naturally reduces the accuracy for some object classes.

Summary

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. This summary is not an extensive overview of the disclosure. It is intended to neither identify key or critical elements of the disclosure nor delineate any scope of particular embodiments of the disclosure, or any scope of the claims. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

In an aspect, disclosed is a method for object detection, comprising: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure； obtaining an image and at least one bounding box for an obtained image； evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects； and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.

In one embodiment of the present application, the grouping comprises: obtaining training images containing object to be detected and at least one bounding box for the training images from a training set； extracting, by a trained CNN, features for the object in each bounding box； and distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.

In one embodiment of the present application, the distributing is based on a visual similarity.

In one embodiment of the present application, the evaluating comprises: extracting features from the obtained image by the trained CNN for a parent cluster； calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster； accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated； performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists..

In one embodiment of the present application, the method for object detection further comprises: training CNNs respectively used for each of the object clusters, which comprises: initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters； evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects； outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object； fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images； and repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.

In one embodiment of the present application, the extracting comprises: cropping the obtained images by the bounding box； warping the cropped images into predefined size required by the trained CNNs； and extracting the features from the warped images by the trained CNNs.

In one embodiment of the present application, the classification score presents a possibility where the object belongs to an object class in one cluster.

In one embodiment of the present application, the outputting comprises: determining that the determined leaf cluster is an end cluster of the hierarchical tree structure； and outputting the object class label at said leaf cluster as the predicted object class label of the object.

In an aspect, disclosed is a system for object detection, comprising: a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure； and a predictive unit for: obtaining an image and at least one bounding box for an obtained image； evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects； and outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.

In an aspect, disclosed is a system for object detection, comprising: a memory that stores executable components； and a processor electrically coupled to the memory to execute the executable components for: grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure； obtaining an image and at least one bounding box for an obtained image； evaluating objects in each bounding box by CNNs respectively trained for each of the clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ； and outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 shows examples of object detection according to some embodiments of the present application；

Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments of the present application；

Fig. 3 shows the steps used for the grouping unit according to some embodiments of the present application；

Fig. 4 shows an example of the hierarchical tree structure according to some embodiments of the present application；

Fig. 5 shows the steps used for the predictive unit according to some embodiments of the present application；

Fig. 6 is Algorithm showing the key steps of the predictive unit according to some embodiments of the present application； and

Fig. 7 shows the steps used for the train unit according to some embodiments of the present application.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising” , when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The disclosure relates to object detection, of which the aim is to detect objects of certain classes on a given image, such as person, dog, and chair in Fig. 1.

Fig. 2 shows the overall pipeline of the system for object detection according to some embodiments. The system for object detection comprises a grouping unit 201, a predictive unit (202 and 204) , and a training unit 203. The grouping unit is used for grouping object classes to be detected into a plurality of object clusters constituting a hierarchical tree structure； the predictive unit is used for predicting objects contained in a given image； and the training unit is used for training the predictive unit before applying the predictive unit to the actual application.

In the grouping unit 201, the object classes to be detected are grouping into a plurality of object clusters constituting a hierarchical tree structure according to the corresponding features of these objects. Then the training unit 203 trains the predictive unit 202 by using images from a predetermined training set and the cluster labels from the grouping unit 201, and outputting the trained predictive unit 204 that has convolutional neural networs (CNNs) respectively used for each of the clusters in the hierarchical tree structure. Finally, the trained predictive unit 204 is used for actual application, during the application, a given image is fed into the trained predictive unit 204, the predictive unit 204 extracts the features of objects in the image and predicts object class of these object by the CNNs thereof. Above units will be described in details by referring to the drawings in the following.

Fig. 3 shows steps used for the grouping unit according to some disclosed embodiments.

In some embodiments, the grouping unit 201 is input with images from a training set and at least one bounding box, wherein the images contain objects belongs to the object classes to be detected. In the grouping unit 201, objects in the images are grouped into a plurality of object clusters constituting a hierarchical tree structure, and then the cluster label is outputted. As shown in Fig. 3, at step S301, input image is cropped by a bounding box and warped into the predefined size required by the convolutional neural network； at step S302, given an input image cropped by bounding box, the features are extracted by a pre-trained convolutional neural network； and at step S303, objects contained in the given image is distributed into a plurality of object clusters.

The distribution method can be any appropriate method. The visual similarity will be used as the example for illustration. The visual similarity between classes a and b is as follows:

where h_a, i is the last GoogleNet hidden layer for the ith training sample of class a, h_b, j is for the jth training sample of class b. < h_a, i, h_b, j > denotes the inner product between h_a, i and h_b, j. With the similarity between two classes defined, for example, object classes are grouping into a plurality of object clusters constituting a hierarchical tree structure, for example, as shown in Fig. 4. At the hierarchical level l, denote the j_lth group by S_l, jl, for the present embodiment, l ＝ 1, …, L, L＝ 4, j_l ＝ {1, …, J_l} , J₁ ＝ 1, J₂ ＝ 4, J₃ ＝ 7； J₄ ＝ 18. In some embodiments, there may be, for example, 200 object classes, initially, S_1, 1 ＝ {1, …, 200} . As an example, there may be, on average, , 200 object classes per group at level 1, 50 classes per group at level 2, 29 classes per group at

level

3, and 11 classes per group at level 4. In Fig. 4, S_1， 1 ＝ S_2, 1 ∪ S_2, 2 ∪ S_2, 3 ∪ S_2, 4 and S_2, 1 ＝ S_3, 1 ∪ S_3, 2. In the hierarchical clustering results, the parent cluster par (l, j_l) and children set ch (l, j_l) of a cluster (l, j_l) are defined such that

S_l,jl＝∪_{(l+1,j，) ∈ch (l, jl)}S_l+1, j，and

For example, as shown in Fig. 4, the child clusters of S_1, 1 are S_2, 1, S_2, 2, S_2, 3 and S_2, 4, and S_1, 1 is the parent cluster of S_2, 1, S_2, 2, S_2, 3 and S_2, 4.

In some embodiments, the

predictive unit

202 or 204 is input with images, bounding boxes, and object set S_l, jl. The predictive units at training stage and the application stage are only different in the samples. At the training stage, the samples are obtained from training data； at application stage, the samples are obtained from testing data. The predicted object class labels are outputted from the predictive unit.

Fig. 5 shows the steps used for the predictive unit according to some embodiments. At step S501, the inputted image is cropped by the bounding boxes and warped into the predefined size required by the CNNs used in the predictive unit； at step S502, objects in each bounding box are evaluated from root cluster to leaf cluster； and at step S503, class label of the object in cropped image is determined.

Particularly, during evaluating, features of the cropped image are extracted at each cluster by a trained CNN thereof, and then a classification score for each cluster may be calculated by using the extracted features. The classification score for object classes in one cluster may present the possibility where the object belongs to this cluster. The detailed evaluating process may be described in Algorithm 1 shown in Fig. 6. At the cluster (l, j_l) , the detection scores (i.e., the classification score) for the classes in group S_l, jl are evaluated (line 6 in Algorithm 1) . These detection scores are used for deciding if the children clusters ch (l, j_l) need to be evaluated (line 8 in Algorithm 1) . For the child cluster (l+1, j’ ) ∈ ch (l, j_l) , if the maximum detection score among the classes in S_l+1, j’is smaller than a threshold T_l, this sample is not considered as a positive sample in class group S_l+1, j’, and then the cluster (l+1, j’ ) and its children clusters are not evaluated.

For example, initially the detection scores for 200-classes

are obtained at the node (1, 1) for a given sample of class bird. These 200-class scores are used for accepting this sample as an animal S_2, 1 and rejecting this sample as ball S_2, 2, instrument S_2, 3 or furniture S_2, 4. And then the scores

of animals are used for accepting the bird sample as vertebrate and rejecting it as invertebrate. Therefore, each node focuses on rejecting the sample as not belonging to a group of object classes. Finally, only the groups that are not rejected have the SVM scores for their classes (line 13 in Algorithm 1) .

Finally, the cluster label of the deepest leaf cluster of the object is determined. if the determined cluster is the end cluster of the hierarchical tree structure, such as S_4, 1, S_4, 2, S_4, 3 and S_4, 4 as shown in Fig. 4, the class label, such as cow, bird, fish or ant, will be outputted. If the determined cluster is not the end cluster of the hierarchical tree structure, such as S_3, 1, i.e., the classification scores of S_4, 1, S_4, 2, S_4, 3 and S_4, 4 are all smaller than the threshold, the object is considered as background, and its class label will not be outputted.

The CNNs respectively used for each of the clusters may be trained by the training unit before application. Fig. 7 shows the steps used for the train unit according to some embodiments. During training, at step S701, the images for training and ground truth object class labels of the objects in the training image are obtained from a training set. At step S702, the CNNs of the predictive unit are initialized with the CNNs of their parent clusters, i.e., for the cluster (l, j_l) , a CNN model M_l, jl used for cluster S_l, jl is fine-tuned using the model of its parent cluster model M_{l-1, par (jl)}as initial point, for example, as shown in Fig. 4, M_2, 1 is initialized with M₁. At step S703, training image is cropped and predicted by the predictive unit, at this step, the predicted class label is outputted. At step S704, the predicted class labels are compared with the ground truth class labels, and the dissimilarities between them are computed. At step S705, whether the predicted class labels are converge to the ground truth labels are determined. If the predicted class labels are converge to the ground truth labels, output the trained predictive unit； if no, fine-tune the parameters of CNNs, and repeat steps S701 to S704. In some embodiments, determining whether the predicted class labels are converge to the ground truth labels may be instead with determining whether the accuracy of the predicted class label can be further improved.

According to the process of the predictive unit, during training, for one cluster, some cropped images with object that does not belong to this cluster are rejected at its parent cluster, therefore, only a subset of object classes are used for fine-tuning the CNN for each cluster. In this way, CNN may focus on learning the representations for this subset of object classes. Furthermore, when training the CNNs, CNN for a parent cluster is used as the initial point of the CNN for its child cluster so that the knowledge from the parent cluster is transferred to the child cluster. Based on above, at training stage, the trainings of CNNs respectively used for each clusters focus on hard samples that cannot be handled well at their parent clusters. In this way, the object detection will be more faster and more accurate.

As will be appreciated by one skilled in the art, the present application may be embodied as a system, a method or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment and hardware aspects that may all generally be referred to herein as a “unit” , “circuit, ” “module” or “system. ” Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present application, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments. In addition, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software. For example, the system may comprise a memory that stores executable components and a processor, electrically coupled to the memory to execute the executable components to perform operations of the system, as discussed in reference to Figs. 1-7. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

Although the preferred examples of the present application have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present application.

Obviously, those skilled in the art can make variations or modifications to the present application without departing the spirit and scope of the present application. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present application.

Claims

A method for object detection, comprising:

grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure；

obtaining an image and at least one bounding box for an obtained image；

evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects； and

outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
The method of claim 1, wherein the grouping comprises:

obtaining training images containing object to be detected and at least one bounding box for the training images from a training set；

extracting, by a trained CNN, features for the object in each bounding box； and

distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
The method of claim 2, wherein the distributing is based on a visual similarity.
The method of claim 1, wherein the evaluating comprises:

extracting features from the obtained image by the trained CNN for a parent cluster；

calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster；

accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated；

performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
The method of claim 4, wherein the method further comprises training CNNs respectively used for each of the object clusters, which comprises:

initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters；

evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects；

outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object；

fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images； and

repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
The method of claim 5, wherein the extracting comprises:

cropping the obtained images by the bounding box；

warping the cropped images into predefined size required by the trained CNNs； and

extracting the features from the warped images by the trained CNNs.
The method of claim 4, wherein the classification score presents a possibility where the object belongs to an object class in one cluster.
The method of claim 1, wherein the outputting comprises:

determining that the determined leaf cluster is an end cluster of the hierarchical tree structure； and

outputting the object class label at said leaf cluster as the predicted object class label of the object.
A system for object detection, comprising:

a grouping unit for grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure； and

a predictive unit for:

obtaining an image and at least one bounding box for the an obtained image；

evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects； and

outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object.
The system of claim 9, wherein the grouping unit is further configured to:

obtain training images containing object to be detected and at least one bounding box for the training images from a training set；

extract, by a trained CNN, features for the object in each bounding box； and

distribute object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
The system of claim 10, wherein the distributing is based on a visual similarity.
The system of claim 9, wherein the predictive unit is configured for:

extracting features from the obtained image by the trained CNN for a parent cluster；

calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster；

accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated；

performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
The system of claim 12, further comprising:

a training unit configured to train CNNs respectively used for each of the object clusters by:

initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters；

evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects；

outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object；

fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images； and

repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
The system of claim 13, wherein the predictive unit is configured to extract features from the obtained image by:

cropping the obtained images by the bounding box；

warping the cropped images into predefined size required by the trained CNNs； and

extracting the features from the warped images by the trained CNNs.
The system of claim 12, wherein the classification score presents a possibility where the object belongs to an object class in one cluster.
The system of claim 9, wherein the outputting comprises:

determining that the determined leaf cluster is an end cluster of the hierarchical tree structure； and

outputting the object class label at said leaf cluster as the predicted object class label of the object.
A system for object detection, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components for:

grouping object classes for an object to be detected into a plurality of object clusters constituting a hierarchical tree structure；

obtaining an image and at least one bounding box for an obtained image；

evaluating objects in each bounding box by CNNs respectively trained for each of the object clusters of the hierarchical tree structure, from a root cluster to leaf clusters of the hierarchical tree structure, to determine a deepest leaf cluster of the objects ； and

outputting an object class label at the deepest leaf cluster as a predicted object class label of the object.
The system of claim 17, wherein the grouping comprises:

obtaining training images containing object to be detected and at least one bounding box for the training images from a training set；

extracting, by a trained, features for the object in each bounding box； and

distributing object class of the object in each bounding box into the object clusters constituting the hierarchical tree structure according to similarity among the extracted features.
The system of claim 18, wherein the distributing is based on a visual similarity.
The system of claim 17, wherein the evaluating comprises:

extracting features from the obtained image by the trained CNN for a parent cluster；

calculating, according to the extracted features, classification scores of objects for each child cluster of the parent cluster；

accepting the objects into the child clusters with the classification score lager than a threshold value, and the child cluster is used as a parent cluster in the next evaluating, wherein the other clusters except for the said child cluster are not evaluated；

performing repeatedly the extracting, the calculating, and the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists.
The system of claim 20, wherein the executable components further comprise:

training CNNs respectively used for each of the object clusters, which comprises:

initializing the CNNs respectively used for each of the object clusters with the CNNs of their parent clusters；

evaluating objects in each bounding box through the extracting, the calculating, the accepting until the object cluster is located in a last level or no classification score larger than the threshold value exists, to determine a deepest leaf cluster of the objects；

outputting an object class label at the determined deepest leaf cluster as a predicted object class label of the object；

fine-tuning the CNNs for each cluster based on dissimilarities between the predicted object class labels and a ground truth object class labels for objects in a training images； and

repeating the initializing, the evaluating, the outputting, and the fine-tuning until the accuracy of the predicted object class labels converges.
The system of claim 21, wherein the extracting comprises:

cropping the obtained images by the bounding box；

warping the cropped images into predefined size required by the trained CNNs； and

extracting the features from the warped images by the trained CNNs.
The system of claim 21, wherein the classification score presents a possibility where the object belongs to one of an object class in one cluster.
The system of claim 17, wherein the outputting comprises:

determining that the determined leaf cluster is an end cluster of the hierarchical tree structure； and

outputting the object class label at said leaf cluster as the predicted object class label of the object.