WO2017059576A1

WO2017059576A1 - Apparatus and method for pedestrian detection

Info

Publication number: WO2017059576A1
Application number: PCT/CN2015/091517
Authority: WO
Inventors: Xiaoou Tang; Yonglong TIAN; Ping Luo; Xiaogang Wang
Original assignee: Beijing Sensetime Technology Development Co., Ltd
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-04-13
Also published as: CN106570453B; CN106570453A

Abstract

Disclosed is an apparatus for pedestrian detection. The system comprises: a first box generator for generating candidate boxes from a plurality of pedestrian training images; a training patch generator for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes; a detector training unit for training part detectors from the training part patches; a detector selecting unit for selecting complementary part detectors from all the trained part detectors; a second box generator for generating candidate boxes from a plurality of pedestrian testing images; a testing patch generator for generating testing part patches from the candidate boxes generated by the second box generator; and a testing unit for generating a detection result from the testing part patches and the selected part detectors. A method and a system for pedestrian detection are also disclosed.

Description

APPARATUS AND METHOD FOR PEDESTRAIN DETECTION

Technical Field

The present application generally relates to a field of pedestrian detection, more particularly, to an apparatus and a method for pedestrian detection.

Background

Pedestrian detection has numerous applications in video surveillance, robotics and automotive safety. It has been studied extensively in recent years. While pedestrian detection quality has achieved steady improvements over the last several years, occlusion is still an obstacle for constructing a good pedestrian detector. For example, the current best performing detector SpatialPooling+ attains 75％average miss rate reduction over the VJ detector on no occlusion level, while only attaining 21％over VJ on heavy occlusion level1. Occlusion is frequent, i.e. around 70％of all pedestrians in street scenes is occluded in at least one frame. Current pedestrian detectors for occlusion handling can be generally grouped into two categories, training specific detectors for different occlusion types and modeling part visibility as latent variables. In the first category, constructing specific detector requires the prior knowledge of occlusion types. The second kind of approaches divide pedestrian template into several parts and infer the visibility with latent variables. Though these methods achieve promising results, manually selecting parts may not be the optimal solution and may fail when handling pedestrian detection in other scenarios beyond street, such as crowded scenes and market surveillance, where occlusion types may change. Thus there is a requirement for utilizing extensive part detectors to handle pedestrian occlusion at different levels and thus improving pedestrian detection.

Summary

According to an embodiment of the present application, disclosed is an apparatus for pedestrian detection. The system comprises: a first box generator for generating candidate boxes from a plurality of pedestrian training images； a training patch generator for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes； a detector training unit for training one or more part detectors from the generated training part patches； a detector selecting unit for selecting complementary part detectors from all the trained part detectors； a second box generator for generating candidate boxes from a plurality of pedestrian testing images； a testing patch generator for generating testing part patches from the candidate boxes generated by the second box generator； and a testing unit for generating detection result from the testing part patches and the selected part detectors.

According to another embodiment of the present application, disclosed is a method for pedestrian detection. The method comprises: generating candidate boxes from a plurality of pedestrian training images； generating training part patches from the candidate boxes generated from the plurality of pedestrian training images and ground truth boxes； training one or more part detectors from the generated training part patches； selecting complementary part detectors from all the trained part detectors； generating candidate boxes from a plurality of pedestrian testing images； generating testing part patches from the candidate boxes generated from the plurality of pedestrian testing images； and generating detection result from the testing part patches and the selected part detectors.

According to yet another embodiment of the present application, disclosed is a system for pedestrian detection. The system comprises: a memory that stores executable components； and a processor electrically coupled to the memory that executes the executable components to perform operations of the system, wherein the executable components comprise: a first box generating component configured for generating candidate boxes from a plurality of pedestrian training images； a training patch generating component configured for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes； a detector training component configured for training one or more part detectors from the generated training part patches； a detector selecting component configured for selecting complementary part detectors from all the trained part detectors； a second box generating component configured for generating candidate boxes from a plurality of pedestrian testing images； a testing patch generating component configured for generating testing part patches from the candidate boxes generated by the second box generator； and a testing component configured for generating a detection result from the testing part patches and the selected part detectors.

The present invention has following characteristics:

1) hard negative reduction -with the assist of deep learning pedestrian attribute and scene attribute tasks, the number of hard negatives is significantly decrease；

2) weakly supervised training –this system can be trained only with weakly labeled data, i.e., the required supervision is pedestrian bounding box instead of strong part annotation such as leg and arm；

3) strong part detectors -each part detector is already a strong detector which is capable of detecting pedestrian by observing only part of a candidate box； and

4) complementary parts selection -since not all part detectors are equally weighted and necessary in different scenarios, the present system can automatically select complementary parts and decide their weights.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating a system for pedestrian detection according to an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a training patch generator according to an embodiment of the present application.

Fig. 3 is an illustration of the training part patches according to an embodiment of the present application.

Fig. 4. is an example of generating training data for each part detector.

Fig. 5. is a schematic diagram illustrating a detector training unit according to another embodiment of the present application.

Fig. 6a shows how rapidly IoU will decrease on with little shifting on horizontal and vertical orientation.

Fig. 6b shows how to handle shifting problem in AlexNet.

Fig. 7. is a schematic diagram illustrating a detector selecting unit according to an embodiment of the present application.

Fig. 8 is an example of the selected parts and their weights.

Fig. 9. is a schematic diagram illustrating a testing unit according to an embodiment of the present application.

Fig. 10 is a schematic flowchart illustrating a method for pedestrian detection according to an embodiment of the present application.

Fig. 11 is a schematic flowchart illustrating a process for generating training part patches according to an embodiment of the present application.

Fig. 12 is a schematic flowchart illustrating a process for training part detectors according to an embodiment of the present application.

Fig. 13 is a schematic flowchart illustrating a process for selecting complementary part detectors according to an embodiment of the present application.

Fig. 14 is a schematic flowchart illustrating a process for generating detection result according to an embodiment of the present application.

Fig. 15 illustrates a system for pedestrian detection according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary apparatus 1000 for pedestrian detection with some disclosed embodiments.

It shall be appreciated that the apparatus 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.

In the case that the apparatus 1000 is implemented with software, the apparatus 1000 can be run in one or more system that may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.

Referring to Fig. 1 again, where the apparatus 1000 is implemented by the hardware, it may comprise a first box generator 100, a training patch generator 200, a detector training unit 300, a detector selecting unit 400, a second box generator 500, a testing patch generator 600, and a testing unit 700. In the embodiment shown in Fig. 1, the first box generator 100 may be configured to generate candidate boxes from a plurality of pedestrian training images. Particularly, most pedestrian patches are kept and most negative patches are filtered out simultaneously. The target prediction generator 200 may be configured to generate training part patches from the candidate boxes generated by the first box generator 100 and ground truth boxes. Particularly, extensive part patches are extracted for each candidate box, such as leg, head and upper body. The detector training unit 300 may be configured to train one or more part detectors from the training part patches. The detector selecting unit 400 may be configured to select complementary part detectors from all the trained part detectors. The output of the detector selecting unit 400 may be a combination of selected complementary part detectors. Each of the complementary part detectors may be selected based on the weight thereof in the support vector machine (SVM) . In some embodiment, the complementary part detectors may be those having the largest weight in the SVM. The second box generator 500 may be configured to generating candidate boxes from a plurality of pedestrian testing images. The testing patch generator 600 may be configured to generate testing part patches from the candidate boxes generated by the second box generator 500. The testing unit 700 may be configured to generate a detection result such as a confidence score from the testing part patches and the selected part detectors.

Normally, occlusion has various patterns. For instance, the left or right half body part may be occluded by a tree, and the lower half body part also may be occluded by car. Thus, a part pool which contains various semantic body parts may be extensively constructed.

In some embodiment, pedestrian can be considered as a rigid object with a 2m×m grid, where 2m and m indicate the numbers of grids in horizontal and vertical dimension, respectively. Each grid is square and has equal size. Hereinafter, the grid is defined as the minimum unit, and each part prototype is constrained to be rectangle. The sizes for part prototypes are defined as

where wand h indicate the width and height of a part prototype, in terms of grids. W_min and H_rnin are used to avoid over-local part since we focus on middle-level semantic part.

Then, for each (w, h) ∈S, sliding a h×w rectangle over the grid template will generate part prototypes at different positions. The full part pool could be expressed as follows

where x andy are the coordinates of the top left grid in the part prototype and i is a unique id. Specifically, the full body part prototype is (1, 1, m, 2m, i_full. ) . Setting m with a much larger number will generate an overlarge pool, which would cause too much computation in the training and testing stage. Also, setting too small W_min or H_min will result in over-local part prototype, such as W_min ＝ 0.1 × m.

The first or

second box generator

100 or 500 utilize static images such as training or testing images as inputs and employ a pedestrian detector to detect the pedestrians in these images. For example, a region proposal method such as “selective search” , “Edgebox” , and “LDCF” may be used to generate candidate bounding boxes.

The size of training or testing dataset is crucial for deep models, i.e., ConvNet. For example, when using Caltech dataset, which is now the largest pedestrian benchmark that consists of ～250k labeled frames and ～350k an-notated bounding boxes. Instead of using the typical Reasonable training setting, which uses every 30th image in the video and is composed of ～1.7k pedestrians, we utilize every frame and employ ～50k pedestrian bounding boxes as positive training patches. Negative patches have < 0.5 IoU with any ground truth and are proposed by LDCF.

As shown in Fig. 2, the training patch generator 200 further comprises a labeling module 201 for labeling the candidate boxes as negative or positive candidate boxes by comparison with the ground truth boxes, and an extracting module 202 for extracting negative and positive training part patches from the negative and positive candidate boxes for each body part such as leg, head, and upper body. Fig. 3 is an illustration of the training part patches, namely the output of the generator 200.

Fig. 4 is an example of generating training data for each part detector. (1) Given a part prototype, the corresponding region within a negative pedestrian proposal is used as a negative sample for this part detector. This assumption is owing to the fact that most of upright pedestrians are well aligned and that the corresponding regions in negative and positive pedestrian patches should be different. For example, if a head-shoulder part occupies the upper one third region of a negative proposal, this proposal should be regarded a positive pedestrian patch according to prior knowledge. (2) Each pedestrian is annotated with two BBs that denote the visible (B_vis) and full (B_full) part. We divide the full part (B_full) into 2m × m grids and compute the IoU between the visible part (B_vis) and each grid. Then the visible map is obtained by thresholding on the IoU value of each grid. If the visible grids of a ground truth can cover the template grids of a given part prototype, the corresponding region can be extracted as a positive sample.

As shown in Fig. 5, the detector training unit 300 further comprises a mixing module 301 for mixing the positive and negative training part patches and splitting them into batches, a training module 302 for iteratively training each part detector by using the batches of part patches until each of all part detectors converges, and a parameter learning module 303 for learning parameters for handling shifting for each part detector.

It is known that fine-tuning a pre-trained CNN for ImageNet classification task on object detection and segmentation data can significantly improve the performance. Particularly, the parameters learned at pre-training phase are directly used as initial values for the fine-tuning stage. Similar strategy can be directly adopted to fine-tune the generic CNN image classification models for part recognition. The main disparity between the pre-training and fine-tuning tasks is the type of input data. Image classification task takes the full image or whole object as input, which contains rich context information while part recognition task can only observe a middle-level part patch. Evaluated deep models include AlexNet, Clarifai, and GoogLeNet, the winning models for ImageNet classification challenge in the past three years. AlexNet and Clarifai have ～60 million parameters and share a similar structure, while GoogLeNet only uses 12x fewer parameters but employs a much deeper structure. The framework in the present invention is flexible to incorporate other generic deep models

In recognition-by-proposal detection scheme, i.e. deep detectors, the location quality of proposals is a key for the recognition stage. Pedestrian detectors or proponents usually suffer from poor location quality. As is known, the best proposal method SpatialPooling+recalls 93％pedestrians with 0.5 IoU threshold while only recalls 10％with 0.9 IoU threshold. Shifting is one of the major reasons that cause low IoU. As is shown in Fig 6a, shifting a ground truth bounding box by 10％on horizontal or vertical orientation results in 0.9 IoU value, which is a proposal with high quality. However, shifting on both orientation leads to 0.68 IoU value, which is less effective for the feature extraction stage and classification stage. Except for the whole body shifting, each body part would shift from its fixed template position, and different parts of the same pedestrian may shift towards different orientations. In our framework, the positive training samples for each part detector are well aligned while the testing proposals may shift at all orientations. Thus, handling shifting for both the full body and parts is necessary.

A straight forward way to handle this problem is that we crop multiple patches around each proposal with jitter, then feed the cropped patches into the deep model and choose the highest or average score as the detection score with penalty. However, this method would increase the testing time by k times, where k is the number of cropped patches for each proposal.

To reduce the testing computation, we firstly reformulate the generic ConvNet model with fully connected layer as a fully convolutional neural network, which does not require to fix the input size and can process multiple neighboring patches via only one forward pass. Afterwards, the input size of our fully convolutional ConvNet can be changed. Take the AlexNet as an example, the original input size of which is 227 × 227. As illustrated in Tablel, after reformulating fc6, fc7, fc8 as conv6 (1× 1 × 4096) , conv7 (1× 1 × 4096) , conv8 (1× 1 ×2), the fully convolutional AlexNet is able to receive an expanded input size because the convolution and pooling operations are unrelated to input size. Since the step size of receptive field for the classification layer is 32, the expanded input should be (227 + 32n) × (227 + 32n) in order to keep the forward procedure applicable, where n indicates expanded step size and is a none negative integer.

Given a proposed part patch (X_min, Y_min, w, h) and n, the expanded cropping patch is (X_min’ , Y_min’ , w’ , h’ ) , where

Then we resize the patch to (227 + 32n) × (227 + 32n) and feed it into the fully convolutional AlexNet. As a result, (1 + n) × (1 + n) neighboring 227 × 227 patches are explored simultaneously while the expanded scale keeps the same as the proposal scale. The final output of conv8 can be viewed as a (1 + n) × (1 + n) score map S and each score corresponds to a 227 × 227 region. The final score of the part patch is defined as

where P_i,j is a penalty term with respect to relative shifting distance from the proposed part box and is defined as

where a is the single orientation shifting penalty weight, and b is a geometrical distance penalty weight.

In this implementation, set n ＝ 2 for all part prototypes and search the values of a, b for each part prototype by a 6-fold cross validation on training set. Fig. 6b shows an example of the full body part detector with 9 neighboring patches evaluated, where a ＝ 2 and b ＝ 10. Shifting handling is a kind of context modeling that keeps scale invariant while simply cropping larger region with padding and resizing to 227 × 227 would cause a scale gap between the training and testing stages

As shown in Fig. 7, the detector selecting unit 400 further comprises a weight learning module 401 for learning combination weights of all part detectors, a selection module 402 for selecting one or more part detectors according to the combination weights, and a relearning module 403 for relearning the combination weights of the selected part detectors.

For each part prototype, the output of its ConvNet detector may be directly used as the visible score instead of stacking a linear SVM on the top as the RCNN framework. It is found that appending a SVM detector for mining hard negatives does not show significant improvement over directly using the ConvNet output, especially for GoogLeNet. This may due to the fact the training proposals generated by LDCF are already hard negatives. Thus, the SVM training stage is safely removed to save time of feature extraction.

Then a linear SVM is employed to learn complementarity over the 45 part detector scores. To alleviate the testing computation cost, 6 parts with highest value of the SVM weight is simply selected, yielding approximate performance. It is also shown that the performance improvement mainly benefits from the part complementarity. Fig. 8 is an illustration of the selected parts and their weights.

The testing patch generator 600 further comprises an extracting module for extracting testing part patches from the candidate boxes generated by the second box generator 500 as the generated testing patches for each body part corresponding to the selected part detectors.

As shown in Fig. 9, the testing unit 700 further comprises an evaluation module 701 and a result generation module 702. The evaluation module 701 may be configured to evaluate a score of each body part using the corresponding part detector from the testing part patches, the selected part detectors and the relearned combination weights. The result generation module 702 may be configured to generate a detection score by combining the score of each body part in a weighted manner.

Fig. 10 is a schematic flowchart illustrating a method 2000 for pedestrian detection according to an embodiment of the present application. Hereinafter, the method 2000 may be described in detail with respect to Fig. 10.

At step S210, candidate boxes are generated from a plurality of pedestrian training images, for example, by employing a region proposal method such as Selective Search, Edgebox, and LDCF.

At step S220, training part patches are generating from ground truth boxes and the candidate boxes, which are generated from the plurality of pedestrian training images.

As shown in Fig. 11, the step S220 of training part patches comprises following steps. To be specific, at step S221, the candidate boxes are labeling as negative or positive candidate boxes by comparison with the ground truth boxes. At step S222, for each body part, negative and positive training part patches are extracted as the training part patches from the negative and positive candidate boxes.

And then the method 2000 proceeds with step S230, at which part detectors are trained from the training part patches.

As shown in Fig. 12, the step S230 of training part detectors comprises following steps. To be specific, at step S231, the positive and negative training part patches are mixed and splitted into batches. At step S232, each part detector is iteratively trained by using these batches until all part detectors converge respectively. At step S233, for each part detector, parameters are learned for handling shifting.

And then the method 2000 proceeds with step S240 of selecting complementary part detectors from all the trained part detectors.

As shown in Fig. 13, the step S240 of selecting complementary part detectors comprises a step S241 of learning combination weights of all part detectors, a step S242 of selecting one or more part detectors according to the combination weights, and a step of S243 of relearning the combination weights of the selected part detectors.

And then the method 2000 proceeds with step S250 at which corresponding candidate boxes are generated from a plurality of pedestrian testing images.

And then the method 2000 proceeds with step S260 at which testing part patches are generated from the candidate boxes generated from the plurality of pedestrian testing images.

The step S260 of generating testing part patches further comprises extracting testing part patches from the candidate boxes generated from the plurality of pedestrian testing images as the generated testing part patches for each body part corresponding to the selected part detectors.

And then the method 2000 proceeds with step S270 at which a detection result is generated from the testing part patches and the selected part detectors.

As shown in Fig. 14, the step S270 of generating detection result comprises following steps. At step S271, a score of each body part is evaluated using the corresponding part detector from the testing part patches, the selected part detectors and the relearned combination weights. At step S272, a detection result is generated by combining the score of each body part in a weighted manner.

Fig. 15 shows a system 3000 for pedestrian detection. The system 3000 comprises: a memory 310 that stores executable components； and a processor 320 electrically coupled to the memory 310 that executes the executable components to perform operations of the system 3000. The executable components comprise: a first box generating component 311 configured for generating candidate boxes from a plurality of pedestrian training images； a training patch generating component 312 configured for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes； a detector training component 313 configured for training one or more part detectors from the generated training part patches； a detector selecting component 314 configured for selecting complementary part detectors from all the trained part detectors； a second box generating component 315 configured for generating candidate boxes from a plurality of pedestrian testing images； a testing patch generating component 316 configured for generating testing part patches from the candidate boxes generated by the second box generator； and a testing component 317 configured for generating a detection result from the testing part patches and the selected part detectors.

The present application is from “Deep Learning Strong Parts for Pedestrian Detection” , and is intended to addresses the problem of detecting pedestrians in a single image, aiming at constructing a pedestrian detector that can handle occlusion at different levels. The input is a single static image, and the output consists of detected bounding boxes and confidence scores.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

An apparatus for pedestrian detection, comprising:

a first box generator for generating candidate boxes from a plurality of pedestrian training images；

a training patch generator for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes；

a detector training unit for training one or more part detectors from the generated training part patches；

a detector selecting unit for selecting complementary part detectors from all the trained part detectors；

a second box generator for generating candidate boxes from a plurality of pedestrian testing images；

a testing patch generator for generating testing part patches from the candidate boxes generated by the second box generator； and

a testing unit for generating a detection result from the testing part patches and the selected part detectors.
The apparatus of claim 1, wherein the training patch generator comprises:

a labeling module configured to label the candidate boxes as negative or positive candidate boxes by comparison with the ground truth boxes； and

an extracting module configured to extract negative and positive training part patches, as the generated training part patches, from the negative and positive candidate boxes for each body part.
The apparatus of claim 2, wherein the detector training unit comprises:

a mixing module configured to mix the positive and negative training part patches and split them into batches；

a training module configured to iteratively train each part detector by using the batches until each of all part detectors converges.
The apparatus m of claim 2, wherein the detector training unit further comprises:

a parameter learning module configured to learn parameters for handling shifting for each part detector.
The apparatus of claim 3, wherein the detector selecting unit comprises:

a weight learning module configured to learn combination weights of all part detectors； and

a selection module configured to select the complementary part detectors according to the combination weights.
The apparatus of claim 5, wherein the detector selecting unit further comprises:

a relearning module configured to relearn the combination weights of the selected complementary part detectors.
The apparatus of claim 5, wherein the testing patch generator further comprises:

an extracting module configured to extract testing part patches from the candidate boxes generated by the second box generator as the generated testing patches for each body part corresponding to the selected part detectors.
The apparatus of claim 7, wherein the testing unit further comprises:

an evaluation module configured to evaluate a score of each body part using the corresponding part detector from the testing part patches, the selected part detectors and the relearned combination weights； and

a result generation module configured to generate a detection result by combining the score of each body part in a weighted manner.
A method for pedestrian detection, comprising:

generating candidate boxes from a plurality of pedestrian training images；

generating training part patches from the candidate boxes generated from the plurality of pedestrian training images and ground truth boxes；

training one or more part detectors from the training part patches；

selecting complementary part detectors from all the trained part detectors；

generating candidate boxes from a plurality of pedestrian testing images；

generating testing part patches from the candidate boxes generated from the plurality of pedestrian testing images； and

generating a detection result from the testing part patches and the selected part detectors.
The method of claim 8, wherein the step of generating training part patches comprises:

labeling the candidate boxes as negative or positive candidate boxes by comparison with the ground truth boxes； and

extracting negative and positive training part patches, as the generated training part patches, from the negative and positive candidate boxes for each body part.
The method of claim 10, wherein the step of training part detectors comprises:

mixing the positive and negative training part patches and splitting them into batches； and

iteratively training each part detector by using the batches until each of all part detectors converges.
The method of claim 11, wherein the step of training part detectors further comprises:

for each part detector, learning parameters for handling shifting.
The method of claim 11, wherein the step of selecting complementary part detectors comprises:

learning combination weights of all part detectors； and

selecting the complementary part detectors according to the combination weights.
The method of claim 13, wherein the step of selecting complementary part detectors further comprises:

relearning the combination weights of the of the selected complementary part detectors.
The method of claim 13, wherein the step of generating testing part patches comprises:

for each body part corresponding to the selected part detectors, extracting testing part patches from the candidate boxes generated from the plurality of pedestrian testing images as the generated testing part patches.
The method of claim 15, wherein the step of generating detection result comprises:

evaluating a score of each body part using the corresponding part detector from the testing part patches, the selected part detectors and the relearned combination weights； and

generating a detection result by combining the score of each body part in a weighted manner.
A system for pedestrian detection, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory that executes the executable components to perform operations of the system, wherein the executable components comprise:

a first box generating component configured for generating candidate boxes from a plurality of pedestrian training images；

a training patch generating component configured for generating training part patches from the candidate boxes generated by the first box generator and ground truth boxes；

a detector training component configured for training one or more part detectors from the generated training part patches；

a detector selecting component configured for selecting complementary part detectors from all the trained part detectors；

a second box generating component configured for generating candidate boxes from a plurality of pedestrian testing images；

a testing patch generating component configured for generating testing part patches from the candidate boxes generated by the second box generator； and

a testing component configured for generating a detection result from the testing part patches and the selected part detectors.
The system according to claim 17, wherein the training patch generating component further comprises:

a labeling sub-component configured to label the candidate boxes as negative or positive candidate boxes by comparison with the ground truth boxes； and

an extracting sub-component configured to extract negative and positive training part patches, as the generated training part patches, from the negative and positive candidate boxes for each body part.
The system according to claim 18, wherein the detector training component further comprises:

a mixing sub-component e configured to mix the positive and negative training part patches and split them into batches；

a training sub-component configured to iteratively train each part detector by using the batches until each of all part detectors converges.
The apparatus m of claim 19, wherein the detector training component further comprises:

a parameter learning sub-component configured to learn parameters for handling shifting for each part detector.
The system according to claim 19, wherein the detector selecting component further comprises:

a weight learning sub-component configured to learn combination weights of all part detectors； and

a selection sub-component configured to select the complementary part detectors according to the combination weights.
The system according to claim 21, wherein the detector selecting component further comprises:

a relearning sub-component configured to relearn the combination weights of the selected complementary part detectors.
The system according to claim 21, wherein the testing patch generating component further comprises:

an extracting sub-component configured to extract testing part patches from the candidate boxes generated by the second box generator as the generated testing patches for each body part corresponding to the selected part detectors.
The system according to claim 21, wherein the testing patch generating component further comprises:

an extracting sub-component configured to extract testing part patches from the candidate boxes generated by the second box generator as the generated testing patches for each body part corresponding to the selected part detectors.
The system according to claim 24, wherein the testing component further comprises:

an evaluation sub-component configured to evaluate a score of each body part using the corresponding part detector from the testing part patches, the selected part detectors and the relearned combination weights； and

a result generation sub-component configured to generate a detection result by combining the score of each body part in a weighted manner.