WO2016037300A1 - Procédé et système de détection d'objets de multiples classes - Google Patents

Procédé et système de détection d'objets de multiples classes Download PDF

Info

Publication number
WO2016037300A1
WO2016037300A1 PCT/CN2014/000833 CN2014000833W WO2016037300A1 WO 2016037300 A1 WO2016037300 A1 WO 2016037300A1 CN 2014000833 W CN2014000833 W CN 2014000833W WO 2016037300 A1 WO2016037300 A1 WO 2016037300A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
bounding boxes
boxes
training
detector
Prior art date
Application number
PCT/CN2014/000833
Other languages
English (en)
Inventor
Xiaoou Tang
Wanli OUYANG
Xingyu ZENG
Shi QIU
Chen Change Loy
Xiaogang Wang
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to CN201480081846.0A priority Critical patent/CN106688011B/zh
Priority to PCT/CN2014/000833 priority patent/WO2016037300A1/fr
Publication of WO2016037300A1 publication Critical patent/WO2016037300A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to a method and a system of multi-class object detection, of which the aim is to automatically detect instances of object of different classes in digital images of videos.
  • the aim of object detection is to detect instances of object of a certain class in digital images and videos.
  • the performance of object detection systems depends heavily on image representation, of which the quality can be influenced by many kinds of variations, such as viewpoints, illuminations, poses, and occlusions. Due to such uncontrollable factors, it is non-trivial to design robust image representation that is sufficiently discriminative to represent large quantity of object classes.
  • hand-crafted features such as Gabor, SIFT, and HOG
  • object detection based on hand-crafted features involves extracting multiple features on the landmarks of images with multiple scales, and concatenating them into high-dimensional feature vectors.
  • Deep Convolutional Neural Network has been applied to learn features directly from raw pixels.
  • existing deep CNN learning methods pre-train the CNN by using images without bounding box ground truth, and subsequently fine-tune the deep neural net using another set of images with bounding box ground truth.
  • the image set used for fine-tuning has lower quantity of semantic class numbers compared to the image set used for pre-training.
  • the number of semantic classes in image set for fine-tuning equals the number of actual classes we wish to detect.
  • the device may comprise a feature learning unit and a sub-boxes detector unit.
  • the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside; and determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set.
  • the sub-boxes detector unit is configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes
  • a feature learning module configured to determine a plurality of classification features for each candidate bounding box of an inputted image
  • a sub-boxes detector module configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module (203)
  • a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.
  • a system for multi-class object detection which comprises a training device, configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets.
  • the system further comprises a prediction device, comprising a feature learning module configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box; a sub-boxes detector module configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features; and a context information module configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.
  • a method for training neural networks of multi-class object detection comprising:
  • each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
  • a method for training neural networks of multi-class object detection comprising:
  • each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
  • the present application further proposes a method for multi-class object detection, comprising:
  • the detection neural network takes the candidate bounding box as input and calculates features values from a last hidden layer of the detection neural network
  • Fig. 1 is a schematic diagram illustrating an exemplary system for multi-class object detection according to one embodiment of the present application.
  • Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device according to one embodiment of the present application.
  • Fig. 3 illustrates a flow chart of the operations for the selective search unit according to one embodiment of the present application.
  • Fig. 4 illustrates a flow chart of the operations for the feature learning unit according to one embodiment of the present application.
  • Fig. 5 illustrates a flow chart for the feature learning unit to train a neural network according to one embodiment of the present application.
  • Fig. 6 illustrates sub-image patches according to one embodiment of the present application.
  • Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit according to one embodiment of the present application.
  • Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit according to another embodiment of the present application.
  • Fig. 9 illustrates a flow chart of the operations for the contextual information unit according to another embodiment of the present application.
  • Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.
  • Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
  • Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device according to one embodiment of the present application.
  • Fig. 13 a flow chart for the process to show how to output predicted bounding boxes and the corresponding scores for the predicted bounding boxes according to one embodiment of the present application.
  • Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.
  • Fig. 1 is a schematic diagram illustrating an exemplary system 100 for multi-class object detection according to one embodiment of the present application.
  • the system 100 for multi-class object detection may comprise a training device 10 and a prediction device 20.
  • each box contains a target semantic object.
  • the training device 10 determines a classification neural network, a detection neural network, a plurality of (n) sub-boxes detectors and a plurality of (n) context information detectors from the retrieved training set.
  • the predictiondevice20 can use the networks, sub-boxes detectors and context detectors to detect semantic classes in the images.
  • the prediction device 20 takes an image as input, and output bounding boxes coordinates (x, y, w, h) , with which each box contains a target semantic object.
  • Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device 10 according to one embodiment of the present application.
  • the training device 10 may comprise a selective search unit 101, a region rejection unit 102, a feature learning unit 103, asub-boxes detector unit 104 and a contextual information unit 105, which will be discussed in details as below.
  • the selective search unit 101 is configured to retrieve at least one digital image of videos, and then propose an over-complete set of candidate bounding boxes that may have objects inside for each retrieved image, and then output a plurality of positive and negative candidate bounding boxes (x, y, w, h) .
  • Fig. 3 illustrates a flow chart of the operations for the selective search unit 101 according to one embodiment of the present application.
  • the selective search unit 101 operates to resize each of the retrieved images to a fixed width, e. g. 500 pixels.
  • the selective search unit 101 performs super-pixel segmentation on each of the images to obtain a set of bounding box locations of the each image, for example, a small set of data-driven, class-independent, high quality bounding box locations.
  • the selective search unit 101 compares candidate bounding boxes (i. e. the obtained bounding boxes) with manually labeled bounding boxes to determine if the overlap between the candidate bounding boxes and manually labeled bounding boxes is larger than a predetermined threshold (in terms of overlap area ratio) , for example 0.5. If yes, the bounding box will be regarded as positive samples in step S304, whereas those with overlap less than 0.5 will be regarded as positive samples regarded as negative samples in step s305.
  • a predetermined threshold in terms of overlap area ratio
  • the region rejection unit 102 is configured to throw away a large part of candidate bounding boxes, according to scores, to make the following procedure faster. This unit 102 is applied on only the fine-tuning set. In other words, the region rejection unit 102 receives at least one image of videos and the obtains positive and negative candidate bounding boxes (x, y, w, h) , and determines which boxes of the obtained positive and negative candidate bounding boxes will be filtered based on the received images.
  • the region rejection unit 102 operates to obtain object detection score for each positive and negative candidate bounding boxes.
  • the region rejection unit 102 may apply any existing object detector on the input images to obtain object detection score for each positive and negative candidate bounding boxes (x, y, w, h) .
  • the i-th candidate bounding box is rejected if the following rejection condition is satisfied:
  • i is the sample index
  • j is the class index
  • is a pre-determined threshold.
  • the feature learning unit 103 is used to train a neural network whose lasthidden-layer values would be regarded as features.
  • the feature learning unit 103 receives, as its inputs, a pre-training set, Bennette-tuning set and the filtered bounding boxes, and then determines, based on the inputs, a fine-tuned neural network, wherein the values outputted from the last hidden layer of the fine-tuned neural network will be regarded as feature.
  • the pre-training set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h) .
  • the pre-training set encompasses m object classes.
  • the fine-tuning set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h).
  • the fine-tuning set encompasses n object classes.
  • Fig. 4 illustrates a flow chart of the operations for the feature learning unit 103 according to one embodiment of the present application.
  • the unit 103 operates to pre-train the first neural network using the images in the pre-training set with positive and negative bounding boxes as determined by the selective search unit 101.
  • the feature learning unit 103 may unitize a back-propagation algorithm to train a neural network.
  • Fig. 5 illustrates a flow chart forthe feature learning unit 103 to train a neural network.
  • thefeature learning unit 103 createsa neural network and then random initializes the created network. The configuration of the created networkwill be discussed later.
  • the feature learning unit 103 calculates the pre-defined loss function for the inputted the images in the pre-training set, the candidate positive and negative image regions corresponding to the positive and negative bounding boxes.
  • step S4013 the feature learning unit 103 calculates the gradient with respect to all the parameters, that’s Then in step s4014, the update process can be described as where lr is one prefixed learning rate.
  • step s4015 the feature learning unit 103 will check if the stopping criterion, for example, whether the loss value of validation set is increasing or not, is satisfied. If not, the feature learning unit 103 returns to step s4012 to run though the steps s4012-S4015 until the stopping criterion is satisfied.
  • a second neural network with the same structure as the pre-trained neural network will be created in step S402.
  • the second neural network is initialized by using the parameters of the pre-trained neural network.
  • the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node.
  • the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the pre-training set and then further fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.
  • the first neural network may be trained/tuned by using bounding boxes of the pre-training set, and then in step s405, the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.
  • the pre-train step (step s401) uses the whole images in the pre-training set to train the first neural network
  • the fine-tuning step (step s405) uses the image regions (bounding boxes containing objects) in the pre-training set and then further use the fine-tuning set to train the second neural network.
  • the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node, and thus the difference between the pre-training step (step s401) and the fine-tuning step (step 405) is that the last layer of the first network has m nodes whereas the last layer of the second layer has n nodes.
  • Prior arts often use the whole images in the pre-training set to train the first neural network and use image regions (bounding boxes containing objects) in the fine-tuning set to train the second neural network.
  • the process as proposed above in the present application uses the image regions (bounding boxes containing objects) in the pre-training set to improve the feature learning performance of the feature learning unit.
  • Sub-boxes detector unit104 Sub-boxes detector unit104
  • the sub-boxes detector unit 104 receives at least one image and the candidate bounding boxes (i. e. the boxes outputted from unit 102) , and then utilizes the fine-tuned network trained by the unit 103 to output a plurality of (n) Support Vector Machine (SVM) detectors, each of which predicts one value for one candidate bounding box for one semantic object class, such that a plurality of (n) Support Vector Machine (SVM) detectors will be obtained for the prediction unit (to be discussed later) to predict detection scores for n object classes.
  • SVM Support Vector Machine
  • the SVM is discussed as an example only, and any other binary classifier may be used in the embodiments of the present application.
  • the sub-boxes detector unit 10 calculates the feature vector F B using the fine-tuned neural network obtained from feature learning unit 103 to describe each candidate bounding box’s contents, and further divide it into a plurality of sub-image patches.
  • Fig. 6 illustrates 4 sub-image patches as an example. It should be appreciated that different number of sub-image patches can be divided in the embodiments of the present application.
  • Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to one embodiment of the present application (Following max-average SVM) .
  • the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches, w.
  • the sub-boxes detector unit 104 calculates its overlapping ratios with all object-bounding-boxes B using the following equation
  • S w , S B , and S w ⁇ B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object-bounding-box B, respectively.
  • step s703 for each sub-image-patches w, the object-bounding box with the highest overlapping ratio is chosen as its corresponding box, i. e. ,
  • the feature vector of the object-bounding box is assigned to the sub-image-patches w to describe its contents.
  • step s704 for each object-bounding-box proposal B with its sub-image-patches, the element-wise average of the feature vectors of the plurality of sub-image-patches and maximum of the feature vectors of the plurality of sub-image-patches are calculated as
  • the feature vector F B of the object-bounding-box B is concatenated with and to create a longer feature vector to describe the image contents within the bounding box B.
  • the fine-tuned neural network obtained from the feature learning unit 103 is used to extract features from exact sub-image-patch regions. The element-wise average and maximum of the feature vectors are used to describe the image content.
  • step s706 the concatenated feature vectors and the ground-truth labels of the object-bounding-boxes B are used to train the binary classifier (for example, the SVM as discussed above) detector to output a likelihood score for every possible object class that the box might belongs to.
  • the binary classifier for example, the SVM as discussed above
  • Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to another embodiment of the present application (Following multiple-feature SVM) .
  • the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches w.
  • step s802 for each object-bounding-box B, its feature vectors F B and the feature vectors from the sub-image-patches are used to train separate support vector machines. For example, where there are 4 sub-image-patches, the 4 feature vectors from the 4 sub-image-patches are used to train 5 separate support vector machines.
  • step s803 given a new object-bounding-box B and its feature vector extracted by the fine-tuned neural network obtained from feature learning unit 103, the corresponding support vector machine is applied to calculate a likelihood score for each object class.
  • step s804 for each sub-image-patch w, sub-boxes detector unit 104 first calculates its overlapping ratios with all proposed object-bounding-boxes B using the following equation
  • S w , S B , and S w ⁇ B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object bounding box B, respectively.
  • a predetermined threshold for example, 0.5
  • the corresponding trained support vector machine of w is used to test all its candidate corresponding bounding boxes. For each candidate bounding box, the trained support vector machine generates a score for each possible object class in step s805. The highest score of each object class from all candidate windows is chosen as the class likelihood score for w.
  • step s806 the object-bounding-box and it’s (for example, 4) sub-image-patches are associated with a plurality of (for example, 5) sets of object class likelihood scores, the sets of scores are normalized independently and summed together to output a set object class likelihood.
  • the contextual information unit 105 is configured to exploit contextual information to improve detection performance.
  • the contextual information unit 105 receives at least one image and receives the candidate bounding boxes from the unit 102.
  • the unit 105 further retrieves scores of the sub-boxes detector from the sub-boxes detector unit 104 and the contextual information from feature learning unit 103, i. e. classification score outputted from the first network.
  • the unit 105 utilizes the pre-trained network and fine-tuned network to train one binary classifier (for example, SVM) for each detection class of the candidate bounding boxes to output n classes of binary classifier to predict an n-dimension vector for each candidate bounding box.
  • SVM binary classifier
  • Fig. 9 illustrates a flow chart of the operations for the contextual information unit 105 according to another embodiment of the present application.
  • the contextual information unit 105 utilizes the pre-trained network to output the classification score (contextual information) for the whole of the received image, where L c is the number of the classification categories.
  • s c (i) is the probability of the i-th classification class, i. e, i-th classification class of m classes in the pre-training set.
  • the contextual information unit 105 operates to concatenate the classification score s c and detection score s d obtained by sub-boxes detector unit 104 for each bounding box in this image.
  • a new one v. s. all binary classifier (SVM) is trained for each of the n detection class with contextual modeling.
  • the feature vector x B may be concatenated from s d (j) and a sparse feature vector with a weight ⁇ , i. e. , by rule of:
  • step s903 In order to avoid over-fitting to the training data, some irrelevant dimensions of the feature vector are set zero in step s903.
  • step s904 the contextual information unit 105 operates to train one binary classifier for each detection class. Let ⁇ j selects the most relevant classes in the classification task for thej-th class in the detection task. If i ⁇ j , otherwise Then the final score would be outputted as the score of binary classifier in step s905.
  • the multi-class object detection system 100 has been discussed. It shall be understood that there may be several models by changing the setting of feature learning unit, the sub-boxes detector unit and the contextual information unit. For example, configuration of network created by the feature learning unit may be changed with different layers. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model may output different scores for different classes.
  • the prediction device 10 may further comprise model average unit (not shown) .
  • the model average unit is configured to utilize advantages of several models and make the performance better. As it needs to detect instances of multiple classes, and different training setting may result different performance. For example, one model setting may be better in some classes while another model may come out better on other classes. This model average unit is used to select different models for each class.
  • the model average unit tries to find out a combination list for each class and averages the score of the models in this list as the final score for each candidate box.
  • Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.
  • step s1401 it creates one empty list for one class in step s1401. Multiple modes can be obtained by changing the setting of the feature learning unit, the sub-boxes detector unit and the contextual information unit. Those models share the same selective search unit.
  • step s1402 for each class, this unit starts to select the best model as the starting point, and tries to find one more model (s1403) so that the performance of this class would be better by averaging the scores of those two models (the best model and said one more model) and then add this model to list in step s1408.
  • step s1402-1407 until no more models can be added or the performance would be worse if one more model is added. Repeat the above procedures for all classes.
  • This model average unit would output one model list for each class.
  • the neural network is The neural network
  • the neural network structure consists of several kinds of layers.
  • Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.
  • Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
  • This layer receives images and its labels where x ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y ij is the j-th bit value of the n dimension label vector of the i-th input image region.
  • the convolution layer receives the output from the data layer and performs convolution, padding, sampling, and non-linear transformation operations.
  • the deformation layer is designed to learn the deformation constraints for different object parts. For a given channel of convolution layer C with size V*H, the deformation layer takes small blocks of size (2R+1) * (2R+1) from that convolution layer C and subsamples it to B with size to produces single output from that block as follows:
  • both i and j ranges from –R to R
  • the deformation layer takes the P part detection maps as input and outputs P part scores. And the deformation layer can capture multiple patterns simultaneously.
  • the output of convolution layer and deformation layer can be regarded as discriminative features.
  • the fully connected layer takes the discriminative feature as input and operates the inner-production between the feature and weights. Then one non-linear transformation would be operated on the production.
  • the prediction device 20 The prediction device 20
  • the prediction device 20 For each of test image, the prediction device 20 outputs predicted bounding boxes (x, y, w, h) and the corresponding scores for n object classes of the test image.
  • Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device 20 according to one embodiment of the present application. As shown in Fig, 12, the prediction device 20 comprises a selective search module 201, a region rejection module 202, a feature learning module 203, a sub-boxes detector module 204, a context information module 205.
  • Fig. 13 illustrates flow chart for the process to show how the units 201-206 cooperate to output predicted bounding boxes (x, y, w, h) and the corresponding scores for the predicted bounding boxes.
  • the selective search module 201 receives at least one test image and then proposes a number of candidate bounding boxes in the test image.
  • the received image includes a plurality of instances of (n) object classes (n semantic classes).
  • step S1302 the region rejection module 202 selects some boxes from a large number of candidate bounding boxes by rule of formula 1. Once a candidate box is rejected, this box will be thrown away. Only bounding boxes through the region rejection unit will be passed to the following unit, as discussed in reference to the training device.
  • the feature learning module 203 calculates the classification features for each candidate box through using the fine-tuned network obtained from the training device. Here the fine-tuned network takes the image regions corresponding to the bounding boxes as input and calculates the classification features from the last hidden layer of the fine-tuned network.
  • the sub-boxes detector module 204 receives the calculated classification features from the module 203 and then uses the sub-boxes detector (binary classifier detector) obtained from the training device 10 to calculate the n-classes scores s d for each candidate box.
  • the sub-boxes detector calculates the classification features of the plurality of sub-image-regions (for example, 4 sub-image-regions) and gets the classification features for each sub-image-region using the fine-tuned network obtained in the training device 10. Then the sub-boxes detector module 204 calculates the classification scores s d using the sub-boxes detectors (binary classifier detectors) trained in the training device 10.
  • a binary classifier detector for example, SVM detector
  • the sub-boxes detector (SVM detector) would find one bounding box having the max overlap value with each sub-image-region, calculate the feature for that bounding box using the fine-tuned network and use this feature to represent that sub-image-region. Once all four sub-image-regions get their corresponding representing feature, the element-wise-max and element-wise-average values would be extracted from those four sub-image-regions representing feature. The concatenated feature vectors multiplying with the binary classifier (SVM) weights obtained in the train device would produce the scores sd.
  • SVM binary classifier
  • the sub-boxes detector unit 204 uses the detection network (i. e. second network) obtained in the training device 10 to calculate the classification score s d , then the context information module 205 concatenates the s d in the previous step with the s c calculated in this step, and finally multiply the concatenation vector with weights of the binary classifier (SVM) obtained from the training device 10 in step s1305.
  • SVM binary classifier
  • the production is the final score for the candidate bounding boxes proposed by the selective search module 201. It shall be understood that there may be several models by changing the setting of feature learning unit and sub-boxes detector unit. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model would output different score for different class.
  • the prediction device 10 may further comprise model average unit (not shown) .
  • the final scores are obtained by averaging the final scores of multiple models selected by this model average unit for each candidate box, which is the same as that discussed in reference to the training device 10.
  • modules 201-205 are omitted herein since they function in the same way as the units 101-105 of the training device 10 as discussed above.
  • system 100 has been discussed in the case they are implemented using certain hardware with specific circuitry or the combination of the hardware and the software. It shall be appreciated that the systems 10 and 100 may be also implemented using software.
  • embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
  • the system100 may run in a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un dispositif permettant l'apprentissage de réseaux neuronaux de détection d'objets de multiples classes. Le dispositif peut comprendre une unité d'apprentissage de caractéristiques et une unité de détection de sous-boîtes. Selon un mode de réalisation de la présente invention, l'unité d'apprentissage de caractéristiques est configurée pour : déterminer un premier réseau neuronal sur la base d'images d'apprentissage d'un premier ensemble d'images d'apprentissage, chaque image ayant une pluralité de boîtes englobantes contenant des objets tandis que le premier réseau neuronal déterminé délivre des informations contextuelles pour une image entrée ; et déterminer un second réseau neuronal sur la base des boîtes englobantes des images dans le premier ensemble d'images d'apprentissage et ensuite ajuster avec précision le second réseau neuronal sur la base des boîtes englobantes des images dans le second ensemble d'images d'apprentissage. L'unité de détection de sous-boîtes est configurée pour déterminer des scores de détection de sous-boîte pour les boîtes englobantes sur la base du second réseau neuronal, chaque score de détection de sous-boîte prédisant une valeur pour une des boîtes englobantes pour une classe d'objet sémantique.
PCT/CN2014/000833 2014-09-10 2014-09-10 Procédé et système de détection d'objets de multiples classes WO2016037300A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480081846.0A CN106688011B (zh) 2014-09-10 2014-09-10 用于多类别物体检测的方法和系统
PCT/CN2014/000833 WO2016037300A1 (fr) 2014-09-10 2014-09-10 Procédé et système de détection d'objets de multiples classes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/000833 WO2016037300A1 (fr) 2014-09-10 2014-09-10 Procédé et système de détection d'objets de multiples classes

Publications (1)

Publication Number Publication Date
WO2016037300A1 true WO2016037300A1 (fr) 2016-03-17

Family

ID=55458228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/000833 WO2016037300A1 (fr) 2014-09-10 2014-09-10 Procédé et système de détection d'objets de multiples classes

Country Status (2)

Country Link
CN (1) CN106688011B (fr)
WO (1) WO2016037300A1 (fr)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016357A (zh) * 2017-03-23 2017-08-04 北京工业大学 一种基于时间域卷积神经网络的视频行人检测方法
WO2017166098A1 (fr) * 2016-03-30 2017-10-05 Xiaogang Wang Procédé et système pour la détection d'un objet dans une vidéo
GB2556985A (en) * 2016-11-28 2018-06-13 Adobe Systems Inc Facilitating sketch to painting transformations
CN109783666A (zh) * 2019-01-11 2019-05-21 中山大学 一种基于迭代精细化的图像场景图谱生成方法
CN109784487A (zh) * 2017-11-15 2019-05-21 富士通株式会社 用于事件检测的深度学习网络、该网络的训练装置及方法
JP2019125128A (ja) * 2018-01-16 2019-07-25 Necソリューションイノベータ株式会社 情報処理装置、制御方法、及びプログラム
CN110889318A (zh) * 2018-09-05 2020-03-17 斯特拉德视觉公司 利用cnn的车道检测方法和装置
CN111052146A (zh) * 2017-08-31 2020-04-21 三菱电机株式会社 用于主动学习的系统和方法
US10679351B2 (en) 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
CN111460878A (zh) * 2019-01-22 2020-07-28 斯特拉德视觉公司 利用网格生成器的神经网络运算方法及使用该方法的装置
US20200257984A1 (en) * 2019-02-12 2020-08-13 D-Wave Systems Inc. Systems and methods for domain adaptation
CN112308011A (zh) * 2020-11-12 2021-02-02 湖北九感科技有限公司 多特征联合目标检测方法及装置
NL2023577B1 (en) * 2019-07-26 2021-02-18 Suss Microtec Lithography Gmbh Method for detecting alignment marks, method for aligning a first substrate relative to a second substrate as well as apparatus
US20210166146A1 (en) * 2017-03-22 2021-06-03 Ebay Inc. Visual aspect localization presentation
US11055580B2 (en) 2017-06-05 2021-07-06 Siemens Aktiengesellschaft Method and apparatus for analyzing an image
CN113137916A (zh) * 2020-01-17 2021-07-20 苹果公司 基于对象分类的自动测量
WO2022221932A1 (fr) * 2021-04-22 2022-10-27 Oro Health Inc. Procédé et système de détection automatisée de caractéristiques superficielles dans des images numériques
CN115661492A (zh) * 2022-12-28 2023-01-31 摩尔线程智能科技(北京)有限责任公司 图像比对方法、装置、电子设备、存储介质和程序产品
US11574485B2 (en) 2020-01-17 2023-02-07 Apple Inc. Automatic measurements based on object classification
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229524A (zh) * 2017-05-25 2018-06-29 北京航空航天大学 一种基于遥感图像的烟囱和冷凝塔检测方法
EP3596655B1 (fr) * 2017-06-05 2023-08-09 Siemens Aktiengesellschaft Procédé et appareil d'analyse d'image
GB2565775A (en) * 2017-08-21 2019-02-27 Nokia Technologies Oy A Method, an apparatus and a computer program product for object detection
US10679129B2 (en) * 2017-09-28 2020-06-09 D5Ai Llc Stochastic categorical autoencoder network
CN111247559B (zh) * 2017-10-20 2023-10-31 丰田自动车欧洲公司 用于处理图像和确定对象的视点的方法和系统
CN108304856B (zh) * 2017-12-13 2020-02-28 中国科学院自动化研究所 基于皮层丘脑计算模型的图像分类方法
CN108121931B (zh) * 2017-12-18 2021-06-25 阿里巴巴(中国)有限公司 二维码数据处理方法、装置及移动终端
CN108416902B (zh) * 2018-02-28 2021-11-26 成都好享你网络科技有限公司 基于差异识别的实时物体识别方法和装置
WO2019246250A1 (fr) * 2018-06-20 2019-12-26 Zoox, Inc. Segmentation d'instances déduite d'une sortie de modèle d'apprentissage automatique
CN110570389B (zh) * 2018-09-18 2020-07-17 阿里巴巴集团控股有限公司 车辆损伤识别方法及装置
CN109543685A (zh) * 2018-10-16 2019-03-29 深圳大学 图像语义分割方法、装置和计算机设备
CA3115459A1 (fr) * 2018-11-07 2020-05-14 Foss Analytical A/S Analyseur de lait pour la classification de lait
CN109657551B (zh) * 2018-11-15 2023-11-14 天津大学 一种基于上下文信息增强的人脸检测方法
CN109657678B (zh) * 2018-12-17 2020-07-24 北京旷视科技有限公司 图像处理的方法、装置、电子设备和计算机存储介质
CN110298248A (zh) * 2019-05-27 2019-10-01 重庆高开清芯科技产业发展有限公司 一种基于语义分割的多目标跟踪方法及系统
US11055540B2 (en) * 2019-06-28 2021-07-06 Baidu Usa Llc Method for determining anchor boxes for training neural network object detection models for autonomous driving
EP3767521A1 (fr) * 2019-07-15 2021-01-20 Promaton Holding B.V. Détection d'objets et segmentation d'instances de nuages de points 3d basées sur l'apprentissage profond
CN112288686B (zh) * 2020-07-29 2023-12-19 深圳市智影医疗科技有限公司 一种模型训练方法、装置、电子设备和存储介质
CN112101134B (zh) * 2020-08-24 2024-01-02 深圳市商汤科技有限公司 物体的检测方法及装置、电子设备和存储介质
CN112418278A (zh) * 2020-11-05 2021-02-26 中保车服科技服务股份有限公司 一种多类物体检测方法、终端设备及存储介质
CN114387444B (zh) * 2021-12-24 2024-10-15 大连理工大学 一种基于负边界三元组损失和数据增强的零样本分类方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722712A (zh) * 2012-01-02 2012-10-10 西安电子科技大学 基于连续度的多尺度高分辨图像目标检测方法
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
US20130266214A1 (en) * 2012-04-06 2013-10-10 Brighham Young University Training an image processing neural network without human selection of features
CN103902987A (zh) * 2014-04-17 2014-07-02 福州大学 一种基于卷积网络的台标识别方法
CN103955702A (zh) * 2014-04-18 2014-07-30 西安电子科技大学 基于深度rbf网络的sar图像地物分类方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235799B2 (en) * 2011-11-26 2016-01-12 Microsoft Technology Licensing, Llc Discriminative pretraining of deep neural networks
CN102521442B (zh) * 2011-12-06 2013-07-24 南京航空航天大学 基于特征样本的飞机结构件神经网络加工时间预测方法
CN102693409B (zh) * 2012-05-18 2014-04-09 四川大学 一种快速的图像中二维码码制类型识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722712A (zh) * 2012-01-02 2012-10-10 西安电子科技大学 基于连续度的多尺度高分辨图像目标检测方法
US20130266214A1 (en) * 2012-04-06 2013-10-10 Brighham Young University Training an image processing neural network without human selection of features
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
CN103902987A (zh) * 2014-04-17 2014-07-02 福州大学 一种基于卷积网络的台标识别方法
CN103955702A (zh) * 2014-04-18 2014-07-30 西安电子科技大学 基于深度rbf网络的sar图像地物分类方法

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017166098A1 (fr) * 2016-03-30 2017-10-05 Xiaogang Wang Procédé et système pour la détection d'un objet dans une vidéo
CN108885684A (zh) * 2016-03-30 2018-11-23 北京市商汤科技开发有限公司 用于检测视频中的对象的方法和系统
CN108885684B (zh) * 2016-03-30 2022-04-01 北京市商汤科技开发有限公司 用于检测视频中的对象的方法和系统
GB2556985B (en) * 2016-11-28 2021-04-21 Adobe Inc Facilitating sketch to painting transformations
GB2556985A (en) * 2016-11-28 2018-06-13 Adobe Systems Inc Facilitating sketch to painting transformations
US11783461B2 (en) 2016-11-28 2023-10-10 Adobe Inc. Facilitating sketch to painting transformations
US11775844B2 (en) * 2017-03-22 2023-10-03 Ebay Inc. Visual aspect localization presentation
US20210166146A1 (en) * 2017-03-22 2021-06-03 Ebay Inc. Visual aspect localization presentation
CN107016357A (zh) * 2017-03-23 2017-08-04 北京工业大学 一种基于时间域卷积神经网络的视频行人检测方法
CN107016357B (zh) * 2017-03-23 2020-06-16 北京工业大学 一种基于时间域卷积神经网络的视频行人检测方法
US11055580B2 (en) 2017-06-05 2021-07-06 Siemens Aktiengesellschaft Method and apparatus for analyzing an image
US10679351B2 (en) 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
CN111052146B (zh) * 2017-08-31 2023-05-12 三菱电机株式会社 用于主动学习的系统和方法
CN111052146A (zh) * 2017-08-31 2020-04-21 三菱电机株式会社 用于主动学习的系统和方法
CN109784487A (zh) * 2017-11-15 2019-05-21 富士通株式会社 用于事件检测的深度学习网络、该网络的训练装置及方法
CN109784487B (zh) * 2017-11-15 2023-04-28 富士通株式会社 用于事件检测的深度学习网络、该网络的训练装置及方法
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
JP2019125128A (ja) * 2018-01-16 2019-07-25 Necソリューションイノベータ株式会社 情報処理装置、制御方法、及びプログラム
JP7107544B2 (ja) 2018-01-16 2022-07-27 Necソリューションイノベータ株式会社 情報処理装置、制御方法、及びプログラム
CN110889318B (zh) * 2018-09-05 2024-01-19 斯特拉德视觉公司 利用cnn的车道检测方法和装置
CN110889318A (zh) * 2018-09-05 2020-03-17 斯特拉德视觉公司 利用cnn的车道检测方法和装置
CN109783666A (zh) * 2019-01-11 2019-05-21 中山大学 一种基于迭代精细化的图像场景图谱生成方法
CN111460878B (zh) * 2019-01-22 2023-11-24 斯特拉德视觉公司 利用网格生成器的神经网络运算方法及使用该方法的装置
CN111460878A (zh) * 2019-01-22 2020-07-28 斯特拉德视觉公司 利用网格生成器的神经网络运算方法及使用该方法的装置
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing
US20200257984A1 (en) * 2019-02-12 2020-08-13 D-Wave Systems Inc. Systems and methods for domain adaptation
US11625612B2 (en) * 2019-02-12 2023-04-11 D-Wave Systems Inc. Systems and methods for domain adaptation
NL2023577B1 (en) * 2019-07-26 2021-02-18 Suss Microtec Lithography Gmbh Method for detecting alignment marks, method for aligning a first substrate relative to a second substrate as well as apparatus
CN113137916B (zh) * 2020-01-17 2023-07-11 苹果公司 基于对象分类的自动测量
US11763479B2 (en) 2020-01-17 2023-09-19 Apple Inc. Automatic measurements based on object classification
CN113137916A (zh) * 2020-01-17 2021-07-20 苹果公司 基于对象分类的自动测量
US11574485B2 (en) 2020-01-17 2023-02-07 Apple Inc. Automatic measurements based on object classification
CN112308011A (zh) * 2020-11-12 2021-02-02 湖北九感科技有限公司 多特征联合目标检测方法及装置
CN112308011B (zh) * 2020-11-12 2024-03-19 湖北九感科技有限公司 多特征联合目标检测方法及装置
WO2022221932A1 (fr) * 2021-04-22 2022-10-27 Oro Health Inc. Procédé et système de détection automatisée de caractéristiques superficielles dans des images numériques
CN115661492A (zh) * 2022-12-28 2023-01-31 摩尔线程智能科技(北京)有限责任公司 图像比对方法、装置、电子设备、存储介质和程序产品
CN115661492B (zh) * 2022-12-28 2023-12-29 摩尔线程智能科技(北京)有限责任公司 图像比对方法、装置、电子设备、存储介质和程序产品

Also Published As

Publication number Publication date
CN106688011B (zh) 2018-12-28
CN106688011A (zh) 2017-05-17

Similar Documents

Publication Publication Date Title
WO2016037300A1 (fr) Procédé et système de détection d'objets de multiples classes
US9965719B2 (en) Subcategory-aware convolutional neural networks for object detection
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
US10275719B2 (en) Hyper-parameter selection for deep convolutional networks
US9811718B2 (en) Method and a system for face verification
US20180114071A1 (en) Method for analysing media content
CN110826379B (zh) 一种基于特征复用与YOLOv3的目标检测方法
US10867169B2 (en) Character recognition using hierarchical classification
US20210133439A1 (en) Machine learning prediction and document rendering improvement based on content order
US10762389B2 (en) Methods and systems of segmentation of a document
CN114332473B (zh) 目标检测方法、装置、计算机设备、存储介质及程序产品
CN107766860A (zh) 基于级联卷积神经网络的自然场景图像文本检测方法
WO2022152009A1 (fr) Procédé et appareil de détection de cible, dispositif et support d'enregistrement
CN107103608B (zh) 一种基于区域候选样本选择的显著性检测方法
Wu et al. Typical target detection in satellite images based on convolutional neural networks
CN110008899B (zh) 一种可见光遥感图像候选目标提取与分类方法
CN110008900A (zh) 一种由区域到目标的可见光遥感图像候选目标提取方法
WO2015146113A1 (fr) Système d'apprentissage de dictionnaire d'identification, procédé d'apprentissage de dictionnaire d'identification, et support d'enregistrement
Wang et al. Small vehicle classification in the wild using generative adversarial network
Aliakbarian et al. Deep action-and context-aware sequence learning for activity recognition and anticipation
CN114494823A (zh) 零售场景下的商品识别检测计数方法及系统
CN114743045B (zh) 一种基于双分支区域建议网络的小样本目标检测方法
CN111950545B (zh) 一种基于MSDNet和空间划分的场景文本检测方法
CN117523252A (zh) 一种基于深度学习的页岩孔隙类型检测与分类方法及系统
CN115019342A (zh) 一种基于类关系推理的濒危动物目标检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14901459

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14901459

Country of ref document: EP

Kind code of ref document: A1