WO2016095068A1

WO2016095068A1 - Pedestrian detection apparatus and method

Info

Publication number: WO2016095068A1
Application number: PCT/CN2014/001125
Authority: WO
Inventors: Xiaoou Tang; Yonglong TIAN; Ping Luo; Xiaogang Wang
Original assignee: Xiaoou Tang
Priority date: 2014-12-15
Filing date: 2014-12-15
Publication date: 2016-06-23
Also published as: CN107003834B; CN107003834A

Abstract

Disclosed is a pedestrian detection apparatus. The pedestrian detection apparatus may comprise an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches; and a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model, and a determining device configured to determining one or more pedestrians in an input image based on the generated multi-task classification model. A pedestrian detection method is also disclosed.

Description

PEDESTRIAN DETECTION APPARATUS AND METHOD

Technical Field

The present application generally relates to a field of image processing， more particularly， to a pedestrian detection apparatus and a pedestrian detection method.

Background

With the rapid evolvement and popularization of high-performance mobile and wearable devices in recent years， pedestrian detection has gained increasing attention for its large amount of potential applications. The pedestrian detection is challenging because of large variation and confusion in human body and background scene.

Current methods for pedestrian detection are generally grouped into two categories， i.e.， the models based on handcrafted features and the deep models. In the first category， the conventional methods extract patches from images to train or boosting classifiers. Although they are sufficient to certain pose changes， the feature representations and the classifiers cannot be jointly optimized to improve performance and they are not able to capture large variations. In the second category， deep neural networks achieve promising results， owing to their capacity to learn mid-level representation. However， previous deep models treat pedestrian detection as a single binary classification task， and they mainly learn middle-level features but are not able to capture rich pedestrian variations.

Summary

According to an embodiment of the present application， disclosed is a pedestrian detection method. The method may comprise： extracting a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources； assigning preset attributes to the training patches based on types of the training patches； training a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model； and determining one or more pedestrians in an input image based on the generated multi-task classification model.

According to an embodiment of the present application， disclosed is a pedestrian detection apparatus. The pedestrian detection apparatus may comprise an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources； an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches； and a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model， and a determining device configured to determining one or more pedestrians in an input image based on the generated multi-task classification model.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating a pedestrian detection apparatus consi stent with some embodi m ents of the present appli cation.

Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus when it is implemented in software， consistent with some disclosed embodiments.

Fig. 3 is a schematic diagram illustrating an extracting device consistent with some disclosed embodiments.

Fig. 4 is a schematic diagram illustrating a Task-Assistant Convolutional Neutral Network consistent with some disclosed embodiments.

Fig. 5 is a schematic flowchart illustrating a pedestrian detection method consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments， examples of which are illustrated in the accompanying drawings. When appropriate， the same reference numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary pedestrian detection apparatus 1000 consistent with some disclosed embodiments.

Referring to Fig. 1， where the pedestrian detection apparatus 1000 is implemented by the hardware， it may comprise an extracting device 100， an assigning device 200， a training device 300 and a determining device 400.

It shall be appreciated that the apparatus 1000 may be implemented using certain hardware， software， or a combination thereof. In addition， the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage， CD-ROM， optical memory and the like) containing computer program codes. Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus 1000 when it is implemented in software consistent with some disclosed embodiments.

In the case that the pedestrian detection apparatus 1000 is implemented with software， the apparatus 1000 may include a general purpose computer， a computer cluster， a mainstream computer， a computing device dedicated for providing online contents， or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2， the apparatus 1000 may include one or more processors (

processors

102， 104， 106 etc. ) ， a memory 112， a storage device 116， and a bus to facilitate information exchange among various devices of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) ， a graphic processing unit ( “GPU” ) ， or other suitable information processing devices. Depending on the type of hardware being used， processors 102-106 can include one or more printed circuit boards， and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.

Memory 112 can include， among other things， a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored， accessed， and read from memory 112 for execution by one or more of processors 102-106. For example， memory 112 may store one or more software applications. Further， memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2， memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.

In the embodiment shown in Fig. 1， the extracting device 100 may be configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources. The pedestrian data source (abbreviated by P data source) and background scene data sources (each of them is abbreviated by B data source) may be existing data sources， for example background scene segmentation databases or the like.

In an embodiment， as shown in Fig. 3， the extracting device 100 may comprise a selector 110， a generating module 120 and an extracting module 130. The selector 110 may be configured to select a plurality of training images from the P data source and the B scene data sources， respectively.

The generating module 120 may be configured to generate candidate patches from the selected training images. For example， a region proposal method is employed to generate candidate patches from the training images. In an embodiment， the generating module 120 may be configured to determine whether each of the candidate patches is positive or negative based on a type of the candidate patch.

Since the number of negative patches is significantly larger than the number of positive patches in the P data source， the extracting module 130 may be configured to extract positive and negative patches from the generated candidate patches in the P data source and negative patches from the generated candidate patches in the B data sources.

Back to Fig. 1， the assigning device 200 may be configured to assign preset attributes to the training patches based on types of the plurality of training patches. For example， there are three types of the training patches， i.e. positive， hard negative and negative patch. As mentioned above， the extracting module 130 extracts positive and negative patches from the P data source and extracts negative patches from the B data sources. In this end， the assigning device 200 may be further configured to assign pedestrian attributes of the preset attributes to the positive patches from the P data source， and assign scene attributes of the preset attributes to the negative patches from the B data sources. In an embodiment， the pedestrian attributes may be manually labeled to the positive patches from the P data source.

In an embodiment， the pedestrian attributes may comprise for example backpack， dark-trousers， hat， bag， gender， occlusion， riding， viewpoint， white-clothes and the like. The scene attributes may comprise for example sky， tree， building， road， traffic light， horizontal， vertical， vehicle and the like.

In an embodiment， as different B data sources may have different data distributions， to reduce these discrepancies， the scene attributes comprises shared attributes which are included in all the B data sources and unshared attributes which are included in one of the B data sources. This is done because the former one enables the learning of shared representation across B data sources， while the latter one enhances diversity of the attributes.

The training device 300 in Fig. 1 may be configured to train a Task-Assistant Convolutional Neutral Network (TA-CNN) based on the training patches with assigned attributes to generate a multi-task classification model.

The determining device 400 in Fig. 1 may be configured to determine one or more pedestrians in an input image based on the generated multi-task classification model. Specifically， the determining device is further configured to obtain pedestrian detection results comprising a pedestrian detection score， pedestrian attributes and scene attributes based on the generated multi-task classification model， and to determine one or more pedestrians in an input image based on the obtained pedestrian detection results.

Fig. 4 is a schematic diagram illustrating a TA-CNN consistent with some disclosed embodiments. As shown in Fig. 4， the TA-CNN comprises a plurality of convolutional layers， at least one max-pooling layer and at least one fully-connected layer， and wherein each of the max-pooling layers is followed by a convolutional layer. Hereinafter， an exemplary training process for TA-CNN as mentioned in the above will be further discussed in detail. It is understood that the invention is not limited thereto.

In an embodiment， a training set D is constructed by combining the training patches extracted from P and B data sources. The training set D is formulated to

where each

is a four-tuple. Specifically， y_n denotes a binary label indicating whether an image patch is pedestrian or not.

and

are three sets of binary labels， representing the pedestrian， shared scene， and unshared scene attributes， respectively. As shown in Fig. 4， the TA-CNN employs image patch x_n as input and predicts y_n， by stacking four convolutional layers (convl to conv4) ， four max-pooling layers， and two fully-connected layers (fc5 and fc6) . The TA-CNN may be iteratively trained for multi-task learning based on the training patches with assigned attributes until the multi-task classification model converges.

In an embodiment， each hidden layer of TA-CNN from convl to conv4 is computed recursively by convolution and max-pooling， which are formulated as

In Eqn. (1) ， relu (x) ＝max (0， x) is the rectified linear function and*denotes the convolution operator applied on every pixel of the feature map

where

and

stand for the u-th input channel at the l-1 layer and the v-th output channel at the l layer， respectively. k^vu(l) and b^v(l) denote the filters and bias. In Eqn. (2) ， the feature map

is partitioned into grid with overlapping cells， each of which is denoted asΩ_(i，j)，where (i，j) indicates the cell index. The max-pooling compares value at each location (p， q) of a cell and output the maximum value of each cell.

Each hidden layer in fc5 and fc6 is obtained by

where the higher level representation is transformed from lower level with a non-linear mapping. W^(l) and b^(l) are the weight matrixes and bias vector at the l-th layer.

TA-CNN can be formulated as minimizing the log posterior probability with respect to a set of network parameters W

where

is a complete loss function regarding the entire training set. Here， the shared attributes

in Eqn. (4) are crucial to learn shared representation across multiple B scene data sources.

For clarity， only the unshared scene attributes

is kept in the loss function， which then becomes

x^a denotes the sample of B^a. A shared representation can be learned if and only if all the samples share at least one attribute. Since the samples are independent， the loss function can be expanded as

where I+J+K＝N， implying that each data source is only used to optimize its corresponding unshared attribute， although all the data sources and attributes are trained in a single TA-CNN.

The above formulation is not sufficient to leam shared features among data sources， especially when the data have large differences. To bridge multiple B data sources. The shared attributes o^s is introduced， the loss function develops into

such that TA-CNN can learn a shared representation across Bs because the samples share common targets o^s.

To reduce the gaps between the P and B data sources， the structure projection vectors z_n is computed for each sample x_n， and Eqn. (4) turns into

For example， the first term of the above decomposition can be written as

where

is attained by projecting the corresponding

in B^a on the feature space of P. Here

is used to bridge multiple data sources， because samples from different data sources are projected to a common space of P. TA-CNN adopts a pair of data

as input. All the remaining terms can be derived in a similar way.

In an embodiment， the structure projection vector (SPV) for each sample is calculated by organizing the positive and negative data of P into two tree structures，respectively， so as to close the gaps between the P and B data sources. Each tree has depth that equals three and partitions the data top-down， where each child node groups the data of its parent into clusters. Then， the SPV of each sample is obtained by concatenating the distance between it and the mean of each leaf node. Specifically， at each parent node， HOG feature for each sample is extracted and k-means is applied to group the data.

Hereinafter， an exemplary learning process for TA-CNN will be further discussed in detail. To leam network parameters W， Eqn. (5) is reformulated as

where the main task is to predict the pedestrian label y and the attribute estimations， i.e. o^pi， o^sj and o^ukare auxiliary semantic tasks. α， β and γ denote the importance coefficients to associate multiple tasks. Here， p (y|x， z）， p (o^pi|x， z) ，p (o^sj|x， z) and p (o^uk|x， z) are modeled by softmax functions， for example，

where h^(L)and W^m indicate the top-layer feature vector and the parameter matrix of the main task y， respectively， and h^(L) is obtained by h^(L)＝relu (W^(L)h^(L-1)+b^(L)+W^zz+b^z) .

Learning multiple tasks in Eqn. (6) are casted as optimizing a single weighted multivariate cross-entropy loss， which can not only learn a compact weight matrix but also iteratively estimate the importance coefficients，

where λ denotes a vector of importance coefficients and diag (·) represents a diagonal matrix. Here， y＝ (y， o^p， o^s， o^u ) is a vector of binary labels， concatenating the pedestrian label and all attribute labels. The optimization of Eqn. (7) iterates between two steps， updating network parameters with the importance coefficients fixed and updating coefficients with the network parameters fixed. Typically， the first step may be run for sufficient number of iterations to reach a local minimum， and then perform the second step to update the coefficients. According to an embodiment， the network parameters may be updated by using stochastic gradient descent and back-propagation (BP) to minimize Eqn. (7) ， where an error of the output layer is propagated top-down to update filters or weights at each layer. In addition， importance coefficients may be updated with the fixed network parameters by minimizing the posterior probability

Since the methods for learning network parameters and importance coefficients are similar to the previous methods， the description thereof is omitted here for clarity.

The multi-task classification model may be configured to learn a detection task， a pedestrian attribute task and a scene attribute task. In an embodiment， the pedestrian detection results comprise a pedestrian detection score， pedestrian attributes and scene attributes.

Fig. 5 is a schematic flowchart illustrating a pedestrian detection method 2000 consistent with some disclosed embodiments. Hereafter， the method 2000 may be described in detail with respect to Fig. 6.

At the step S201， a plurality of training patches are extracted from predetermined data sources comprising a pedestrian data source and background scene data sources. According to an embodiment， during the extracting process， a plurality of training images first are selected from the pedestrian data source and the background scene data sources. For example， a region proposal method is employed to generate candidate patches from the training images. Whether each of the candidate patches is positive or negative is determined based on a type of the candidate patch. Positive and negative patches are extracted from the generated candidate patches in the P data source and negative patches are extracted from the generated candidate patches in the B data sources.

At the step S202， preset attributes are assigned to the training patches based on types of the training patches. For example， the preset attributes may comprise pedestrian attributes， shared scene attributes and unshared scene attributes. In an embodiment， the pedestrian attributes are assigned to the positive patches from the P data source， and the shared and unshared scene attributes are assigned to the negative patches from the B data sources.

At the step S203， a Task-Assistant Convolutional Neutral Network (TA-CNN) is trained based on the training patches with assigned attributes to generate a multi-task classification model. In an embodiment， the TA-CNN comprises a plurality of convolutional layers， at least one max-pooling layer and at least one fully-connected layer， and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.

At the step S204， one or more pedestrians in an input image may be determined based on the generated multi-task classification model. Furthermore， pedestrian detection results are obtained based on the generated multi-task classification model. For example， the pedestrian detection results may comprise a pedestrian detection score， pedestrian attributes and scene attributes. One or more pedestrians in an input image are determined based on the obtained pedestrian detection results.

The multi-task classification model may be configured to learn a detection task， a pedestrian attribute task and a scene attribute task. The pedestrian detection results may comprise a pedestrian detection score， pedestrian attributes and scene attributes.

With the pedestrian detection apparatus and method of the present application， the number of hard negatives can be significantly decreased. Furthermore， the pedestrian attributes and background scene attributes can be predicted simultaneously. By training the multiple tasks from multiple sources using a single TA-CNN， multi-data sources visual gaps can be bridged and meanwhile attribute diversity can be enhanced. The scene attributes can be transferred from existing scene data sources without annotating manually.

Although the preferred examples of the present invention have been described， those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously， those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such， if these variations or modifications belong to the scope of the claims and equivalent technique， they may also fall into the scope of the present invention.

Claims

A pedestrian detection method， comprising：

extracting a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources；

assigning preset attributes to the training patches based on types of the training patches，

training a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model； and

determining one or more pedestrians in an input image based on the generated multi-task classification model.
The pedestrian detection method according to claim 1， wherein the extracting comprises：

selecting a plurality of training images from the pedestrian data source and the background scene data sources；

generating candidate patches from the selected training images；

extracting positive and negative patches from the generated candidate patches in the pedestrian data source， and

extracting negative patches from the generated candidate patches in the background scene data sources.
The pedestrian detection method according to claim 1， wherein the assigning preset attributes comprises：

assigning pedestrian attributes of the preset attributes to the positive patches from the pedestrian data source； and

assigning scene attributes of the preset attributes to the negative patches from the background scene data sources.
The pedestrian detection method according to claim 3， wherein the scene attributes comprises shared attributes which are included in all the background scene data sources and unshared attributes which are included in one of the background scene data sources.
The pedestrian detection method according to claim 1， wherein the neutral network comprises a plurality of convolutional layers， at least one max-pooling layer and at least one fully-connected layer， and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
The pedestrian detection method according to claim 1， wherein the multi-task classification model is configured to learn a detection task， a pedestrian attribute task and a scene attribute task.
The pedestrian detection method according to claim 1， wherein the determining comprises：

obtaining pedestrian detection results comprising a pedestrian detection score， pedestrian attributes and scene attributes based on the generated multi-task classification model， and

determining one or more pedestrians in an input image based on the obtained pedestrian detection results.
A pedestrian detection apparatus， comprising：

an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources；

an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches；

a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model； and

a determining device configured to determine one or more pedestrians in an input image based on the generated multi-task classification model.
The pedestrian detection apparatus according to claim 8， wherein the extracting device comprises：

a selector configured to select a plurality of training images from the pedestrian data source and the background scene data sources， respectively；

a generating module configured to generate candidate patches from the selected training images； and

an extracting module configured to extract positive and negative patches from the generated candidate patches in the pedestrian data source， and to extract negative patches from the generated candidate patches in the background scene data sources.
The pedestrian detection apparatus according to claim 8， wherein the assigning device is further configured to：

assign pedestrian attributes of the preset attributes to the positive patches from the pedestrian data source； and

assign scene attributes of the preset attributes to the negative patches from the background scene data sources.
The pedestrian detection apparatus according to claim 10， wherein the scene attributes comprises shared attributes which are included in all the background scene data sources and unshared attributes which are included in one of the background scene data sources.
The pedestrian detection apparatus according to claim 8， wherein the neutral network comprises a plurality of convolutional layers， at least one max-pooling layer and at least one fully-connected layer， and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
The pedestrian detection apparatus according to claim 8， wherein the multi-task classification model is configured to learn a detection task， a pedestrian attribute task and a scene attribute task.
The pedestrian detection apparatus according to claim 8， wherein the determining device is further configured to obtain pedestrian detection results comprising a pedestrian detection score， pedestrian attributes and scene attributes based on the generated multi-task classification model， and to determine one or more pedestrians in an input image based on the obtained pedestrian detection results.