WO2016095068A1 - Pedestrian detection apparatus and method - Google Patents

Pedestrian detection apparatus and method Download PDF

Info

Publication number
WO2016095068A1
WO2016095068A1 PCT/CN2014/001125 CN2014001125W WO2016095068A1 WO 2016095068 A1 WO2016095068 A1 WO 2016095068A1 CN 2014001125 W CN2014001125 W CN 2014001125W WO 2016095068 A1 WO2016095068 A1 WO 2016095068A1
Authority
WO
WIPO (PCT)
Prior art keywords
pedestrian
attributes
patches
pedestrian detection
data sources
Prior art date
Application number
PCT/CN2014/001125
Other languages
French (fr)
Inventor
Xiaoou Tang
Yonglong TIAN
Ping Luo
Xiaogang Wang
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to CN201480083931.0A priority Critical patent/CN107003834B/en
Priority to PCT/CN2014/001125 priority patent/WO2016095068A1/en
Publication of WO2016095068A1 publication Critical patent/WO2016095068A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application generally relates to a field of image processing, more particularly, to a pedestrian detection apparatus and a pedestrian detection method.
  • a pedestrian detection method may comprise: extracting a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; assigning preset attributes to the training patches based on types of the training patches; training a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model; and determining one or more pedestrians in an input image based on the generated multi-task classification model.
  • the pedestrian detection apparatus may comprise an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches; and a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model, and a determining device configured to determining one or more pedestrians in an input image based on the generated multi-task classification model.
  • Fig. 1 is a schematic diagram illustrating a pedestrian detection apparatus consi stent with some embodi m ents of the present appli cation.
  • Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus when it is implemented in software, consistent with some disclosed embodiments.
  • Fig. 3 is a schematic diagram illustrating an extracting device consistent with some disclosed embodiments.
  • Fig. 4 is a schematic diagram illustrating a Task-Assistant Convolutional Neutral Network consistent with some disclosed embodiments.
  • Fig. 5 is a schematic flowchart illustrating a pedestrian detection method consistent with some disclosed embodiments.
  • FIG. 1 is a schematic diagram illustrating an exemplary pedestrian detection apparatus 1000 consistent with some disclosed embodiments.
  • the pedestrian detection apparatus 1000 may comprise an extracting device 100, an assigning device 200, a training device 300 and a determining device 400.
  • Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus 1000 when it is implemented in software consistent with some disclosed embodiments.
  • the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the apparatus 1000 may include one or more processors (processors 102, 104, 106 etc. ) , a memory 112, a storage device 116, and a bus to facilitate information exchange among various devices of apparatus 1000.
  • processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) .
  • Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106.
  • memory 112 may store one or more software applications.
  • memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • the extracting device 100 may be configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources.
  • the pedestrian data source abbreviated by P data source
  • background scene data sources each of them is abbreviated by B data source
  • P data source may be existing data sources, for example background scene segmentation databases or the like.
  • the extracting device 100 may comprise a selector 110, a generating module 120 and an extracting module 130.
  • the selector 110 may be configured to select a plurality of training images from the P data source and the B scene data sources, respectively.
  • the generating module 120 may be configured to generate candidate patches from the selected training images. For example, a region proposal method is employed to generate candidate patches from the training images. In an embodiment, the generating module 120 may be configured to determine whether each of the candidate patches is positive or negative based on a type of the candidate patch.
  • the extracting module 130 may be configured to extract positive and negative patches from the generated candidate patches in the P data source and negative patches from the generated candidate patches in the B data sources.
  • the assigning device 200 may be configured to assign preset attributes to the training patches based on types of the plurality of training patches. For example, there are three types of the training patches, i.e. positive, hard negative and negative patch.
  • the extracting module 130 extracts positive and negative patches from the P data source and extracts negative patches from the B data sources.
  • the assigning device 200 may be further configured to assign pedestrian attributes of the preset attributes to the positive patches from the P data source, and assign scene attributes of the preset attributes to the negative patches from the B data sources.
  • the pedestrian attributes may be manually labeled to the positive patches from the P data source.
  • the pedestrian attributes may comprise for example backpack, dark-trousers, hat, bag, gender, occlusion, riding, viewpoint, white-clothes and the like.
  • the scene attributes may comprise for example sky, tree, building, road, traffic light, horizontal, vertical, vehicle and the like.
  • the scene attributes comprises shared attributes which are included in all the B data sources and unshared attributes which are included in one of the B data sources. This is done because the former one enables the learning of shared representation across B data sources, while the latter one enhances diversity of the attributes.
  • the training device 300 in Fig. 1 may be configured to train a Task-Assistant Convolutional Neutral Network (TA-CNN) based on the training patches with assigned attributes to generate a multi-task classification model.
  • T-CNN Task-Assistant Convolutional Neutral Network
  • the determining device 400 in Fig. 1 may be configured to determine one or more pedestrians in an input image based on the generated multi-task classification model. Specifically, the determining device is further configured to obtain pedestrian detection results comprising a pedestrian detection score, pedestrian attributes and scene attributes based on the generated multi-task classification model, and to determine one or more pedestrians in an input image based on the obtained pedestrian detection results.
  • Fig. 4 is a schematic diagram illustrating a TA-CNN consistent with some disclosed embodiments.
  • the TA-CNN comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a convolutional layer.
  • an exemplary training process for TA-CNN as mentioned in the above will be further discussed in detail. It is understood that the invention is not limited thereto.
  • a training set D is constructed by combining the training patches extracted from P and B data sources.
  • the training set D is formulated to where each is a four-tuple.
  • y n denotes a binary label indicating whether an image patch is pedestrian or not. and are three sets of binary labels, representing the pedestrian, shared scene, and unshared scene attributes, respectively.
  • the TA-CNN employs image patch x n as input and predicts y n , by stacking four convolutional layers (convl to conv4) , four max-pooling layers, and two fully-connected layers (fc5 and fc6) .
  • the TA-CNN may be iteratively trained for multi-task learning based on the training patches with assigned attributes until the multi-task classification model converges.
  • each hidden layer of TA-CNN from convl to conv4 is computed recursively by convolution and max-pooling, which are formulated as
  • k vu(l) and b v(l) denote the filters and bias.
  • the feature map is partitioned into grid with overlapping cells, each of which is denoted as ⁇ (i,j) ,where (i,j) indicates the cell index.
  • the max-pooling compares value at each location (p, q) of a cell and output the maximum value of each cell.
  • W (l) and b (l) are the weight matrixes and bias vector at the l-th layer.
  • TA-CNN can be formulated as minimizing the log posterior probability with respect to a set of network parameters W
  • the loss function For clarity, only the unshared scene attributes is kept in the loss function, which then becomes x a denotes the sample of B a .
  • the above formulation is not sufficient to leam shared features among data sources, especially when the data have large differences.
  • the shared attributes o s is introduced, the loss function develops into such that TA-CNN can learn a shared representation across Bs because the samples share common targets o s .
  • the structure projection vectors z n is computed for each sample x n , and Eqn. (4) turns into
  • the first term of the above decomposition can be written as where is attained by projecting the corresponding in B a on the feature space of P.
  • Here is used to bridge multiple data sources, because samples from different data sources are projected to a common space of P.
  • TA-CNN adopts a pair of data as input. All the remaining terms can be derived in a similar way.
  • the structure projection vector (SPV) for each sample is calculated by organizing the positive and negative data of P into two tree structures,respectively, so as to close the gaps between the P and B data sources.
  • Each tree has depth that equals three and partitions the data top-down, where each child node groups the data of its parent into clusters.
  • the SPV of each sample is obtained by concatenating the distance between it and the mean of each leaf node.
  • HOG feature for each sample is extracted and k-means is applied to group the data.
  • ⁇ , ⁇ and ⁇ denote the importance coefficients to associate multiple tasks.
  • denotes a vector of importance coefficients and diag ( ⁇ ) represents a diagonal matrix.
  • y (y, o p , o s , o u ) is a vector of binary labels, concatenating the pedestrian label and all attribute labels.
  • the optimization of Eqn. (7) iterates between two steps, updating network parameters with the importance coefficients fixed and updating coefficients with the network parameters fixed.
  • the first step may be run for sufficient number of iterations to reach a local minimum, and then perform the second step to update the coefficients.
  • the network parameters may be updated by using stochastic gradient descent and back-propagation (BP) to minimize Eqn.
  • BP stochastic gradient descent and back-propagation
  • the multi-task classification model may be configured to learn a detection task, a pedestrian attribute task and a scene attribute task.
  • the pedestrian detection results comprise a pedestrian detection score, pedestrian attributes and scene attributes.
  • Fig. 5 is a schematic flowchart illustrating a pedestrian detection method 2000 consistent with some disclosed embodiments. Hereafter, the method 2000 may be described in detail with respect to Fig. 6.
  • a plurality of training patches are extracted from predetermined data sources comprising a pedestrian data source and background scene data sources.
  • a plurality of training images first are selected from the pedestrian data source and the background scene data sources.
  • a region proposal method is employed to generate candidate patches from the training images. Whether each of the candidate patches is positive or negative is determined based on a type of the candidate patch. Positive and negative patches are extracted from the generated candidate patches in the P data source and negative patches are extracted from the generated candidate patches in the B data sources.
  • preset attributes are assigned to the training patches based on types of the training patches.
  • the preset attributes may comprise pedestrian attributes, shared scene attributes and unshared scene attributes.
  • the pedestrian attributes are assigned to the positive patches from the P data source, and the shared and unshared scene attributes are assigned to the negative patches from the B data sources.
  • a Task-Assistant Convolutional Neutral Network (TA-CNN) is trained based on the training patches with assigned attributes to generate a multi-task classification model.
  • the TA-CNN comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
  • one or more pedestrians in an input image may be determined based on the generated multi-task classification model. Furthermore, pedestrian detection results are obtained based on the generated multi-task classification model. For example, the pedestrian detection results may comprise a pedestrian detection score, pedestrian attributes and scene attributes. One or more pedestrians in an input image are determined based on the obtained pedestrian detection results.
  • the multi-task classification model may be configured to learn a detection task, a pedestrian attribute task and a scene attribute task.
  • the pedestrian detection results may comprise a pedestrian detection score, pedestrian attributes and scene attributes.
  • the number of hard negatives can be significantly decreased.
  • the pedestrian attributes and background scene attributes can be predicted simultaneously.
  • the scene attributes can be transferred from existing scene data sources without annotating manually.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Disclosed is a pedestrian detection apparatus. The pedestrian detection apparatus may comprise an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches; and a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model, and a determining device configured to determining one or more pedestrians in an input image based on the generated multi-task classification model. A pedestrian detection method is also disclosed.

Description

PEDESTRIAN DETECTION APPARATUS AND METHOD Technical Field
The present application generally relates to a field of image processing, more particularly, to a pedestrian detection apparatus and a pedestrian detection method.
Background
With the rapid evolvement and popularization of high-performance mobile and wearable devices in recent years, pedestrian detection has gained increasing attention for its large amount of potential applications. The pedestrian detection is challenging because of large variation and confusion in human body and background scene.
Current methods for pedestrian detection are generally grouped into two categories, i.e., the models based on handcrafted features and the deep models. In the first category, the conventional methods extract patches from images to train or boosting classifiers. Although they are sufficient to certain pose changes, the feature representations and the classifiers cannot be jointly optimized to improve performance and they are not able to capture large variations. In the second category, deep neural networks achieve promising results, owing to their capacity to learn mid-level representation. However, previous deep models treat pedestrian detection as a single binary classification task, and they mainly learn middle-level features but are not able to capture rich pedestrian variations.
Summary
According to an embodiment of the present application, disclosed is a pedestrian detection method. The method may comprise: extracting a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; assigning preset attributes to the training patches based on types of the training patches; training a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model; and determining one or more pedestrians in an input image based on the generated multi-task classification model.
According to an embodiment of the present application, disclosed is a pedestrian detection apparatus. The pedestrian detection apparatus may comprise an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources; an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches; and a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model, and a determining device configured to determining one or more pedestrians in an input image based on the generated multi-task classification model.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating a pedestrian detection apparatus consi stent with some embodi m ents of the present appli cation.
Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus when it is implemented in software, consistent with some disclosed embodiments.
Fig. 3 is a schematic diagram illustrating an extracting device consistent with some disclosed embodiments.
Fig. 4 is a schematic diagram illustrating a Task-Assistant Convolutional Neutral Network consistent with some disclosed embodiments.
Fig. 5 is a schematic flowchart illustrating a pedestrian detection method consistent with some disclosed embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference  numbers are used throughout the drawings to refer to the same or like parts. Fig. 1 is a schematic diagram illustrating an exemplary pedestrian detection apparatus 1000 consistent with some disclosed embodiments.
Referring to Fig. 1, where the pedestrian detection apparatus 1000 is implemented by the hardware, it may comprise an extracting device 100, an assigning device 200, a training device 300 and a determining device 400.
It shall be appreciated that the apparatus 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes. Fig. 2 is a schematic diagram illustrating a pedestrian detection apparatus 1000 when it is implemented in software consistent with some disclosed embodiments.
In the case that the pedestrian detection apparatus 1000 is implemented with software, the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the apparatus 1000 may include one or more processors ( processors  102, 104, 106 etc. ) , a memory 112, a storage device 116, and a bus to facilitate information exchange among various devices of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112  may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 2, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
In the embodiment shown in Fig. 1, the extracting device 100 may be configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources. The pedestrian data source (abbreviated by P data source) and background scene data sources (each of them is abbreviated by B data source) may be existing data sources, for example background scene segmentation databases or the like.
In an embodiment, as shown in Fig. 3, the extracting device 100 may comprise a selector 110, a generating module 120 and an extracting module 130. The selector 110 may be configured to select a plurality of training images from the P data source and the B scene data sources, respectively.
The generating module 120 may be configured to generate candidate patches from the selected training images. For example, a region proposal method is employed to generate candidate patches from the training images. In an embodiment, the generating module 120 may be configured to determine whether each of the candidate patches is positive or negative based on a type of the candidate patch.
Since the number of negative patches is significantly larger than the number of positive patches in the P data source, the extracting module 130 may be configured to extract positive and negative patches from the generated candidate patches in the P data source and negative patches from the generated candidate patches in the B data sources.
Back to Fig. 1, the assigning device 200 may be configured to assign preset attributes to the training patches based on types of the plurality of training patches. For example, there are three types of the training patches, i.e. positive, hard negative and negative patch. As mentioned above, the extracting module 130 extracts positive and negative patches from the P data source and extracts negative patches from the B data sources. In this end, the assigning device 200 may be further configured to assign pedestrian attributes of the preset  attributes to the positive patches from the P data source, and assign scene attributes of the preset attributes to the negative patches from the B data sources. In an embodiment, the pedestrian attributes may be manually labeled to the positive patches from the P data source.
In an embodiment, the pedestrian attributes may comprise for example backpack, dark-trousers, hat, bag, gender, occlusion, riding, viewpoint, white-clothes and the like. The scene attributes may comprise for example sky, tree, building, road, traffic light, horizontal, vertical, vehicle and the like.
In an embodiment, as different B data sources may have different data distributions, to reduce these discrepancies, the scene attributes comprises shared attributes which are included in all the B data sources and unshared attributes which are included in one of the B data sources. This is done because the former one enables the learning of shared representation across B data sources, while the latter one enhances diversity of the attributes.
The training device 300 in Fig. 1 may be configured to train a Task-Assistant Convolutional Neutral Network (TA-CNN) based on the training patches with assigned attributes to generate a multi-task classification model.
The determining device 400 in Fig. 1 may be configured to determine one or more pedestrians in an input image based on the generated multi-task classification model. Specifically, the determining device is further configured to obtain pedestrian detection results comprising a pedestrian detection score, pedestrian attributes and scene attributes based on the generated multi-task classification model, and to determine one or more pedestrians in an input image based on the obtained pedestrian detection results.
Fig. 4 is a schematic diagram illustrating a TA-CNN consistent with some disclosed embodiments. As shown in Fig. 4, the TA-CNN comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a convolutional layer. Hereinafter, an exemplary training process for TA-CNN as mentioned in the above will be further discussed in detail. It is understood that the invention is not limited thereto.
In an embodiment, a training set D is constructed by combining the training patches extracted from P and B data sources. The training set D is formulated to
Figure PCTCN2014001125-appb-000001
where each
Figure PCTCN2014001125-appb-000002
is a four-tuple. Specifically, yn denotes a binary label indicating whether an image patch is pedestrian or not. 
Figure PCTCN2014001125-appb-000003
and
Figure PCTCN2014001125-appb-000004
are three sets of binary labels, representing the pedestrian, shared scene, and unshared scene attributes, respectively. As shown in Fig. 4, the TA-CNN employs image patch xn as input and predicts yn, by stacking four convolutional layers (convl to conv4) , four max-pooling layers, and two fully-connected layers (fc5 and fc6) . The TA-CNN may be iteratively trained for multi-task learning based on the training patches with assigned attributes until the multi-task classification model converges.
In an embodiment, each hidden layer of TA-CNN from convl to conv4 is computed recursively by convolution and max-pooling, which are formulated as
Figure PCTCN2014001125-appb-000005
Figure PCTCN2014001125-appb-000006
In Eqn. (1) , relu (x) =max (0, x) is the rectified linear function and*denotes the convolution operator applied on every pixel of the feature map
Figure PCTCN2014001125-appb-000007
where
Figure PCTCN2014001125-appb-000008
and 
Figure PCTCN2014001125-appb-000009
stand for the u-th input channel at the l-1 layer and the v-th output channel at the l layer, respectively. kvu(l) and bv(l) denote the filters and bias. In Eqn. (2) , the feature map
Figure PCTCN2014001125-appb-000010
is partitioned into grid with overlapping cells, each of which is denoted asΩ(i,j),where (i,j) indicates the cell index. The max-pooling compares value at each location (p, q) of a cell and output the maximum value of each cell.
Each hidden layer in fc5 and fc6 is obtained by
Figure PCTCN2014001125-appb-000011
where the higher level representation is transformed from lower level with a non-linear mapping. W(l) and b(l) are the weight matrixes and bias vector at the l-th layer.
TA-CNN can be formulated as minimizing the log posterior probability with respect to a set of network parameters W
Figure PCTCN2014001125-appb-000012
where
Figure PCTCN2014001125-appb-000013
is a complete loss function regarding the entire training set. Here, the shared attributes
Figure PCTCN2014001125-appb-000014
in Eqn. (4) are crucial to learn shared representation across multiple B scene data sources.
For clarity, only the unshared scene attributes
Figure PCTCN2014001125-appb-000015
is kept in the loss function, which then becomes
Figure PCTCN2014001125-appb-000016
xa denotes the sample of Ba. A shared representation can be learned if and only if all the samples share at least one attribute. Since the samples are independent, the loss function can be expanded as
Figure PCTCN2014001125-appb-000017
where I+J+K=N, implying that each data source is only used to optimize its corresponding unshared attribute, although all the data sources and attributes are trained in a single TA-CNN.
The above formulation is not sufficient to leam shared features among data sources, especially when the data have large differences. To bridge multiple B data sources. The shared attributes os is introduced, the loss function develops into 
Figure PCTCN2014001125-appb-000018
such that TA-CNN can learn a shared representation across Bs because the samples share common targets os.
To reduce the gaps between the P and B data sources, the structure projection vectors zn is computed for each sample xn, and Eqn. (4) turns into
Figure PCTCN2014001125-appb-000019
For example, the first term of the above decomposition can be written as
Figure PCTCN2014001125-appb-000020
where
Figure PCTCN2014001125-appb-000021
is attained by projecting the corresponding
Figure PCTCN2014001125-appb-000022
in Ba on the feature space of P. Here
Figure PCTCN2014001125-appb-000023
is used to bridge multiple data sources, because samples from different data sources are projected to a common space of P. TA-CNN adopts a pair of data
Figure PCTCN2014001125-appb-000024
as input. All the remaining terms can be derived in a similar way.
In an embodiment, the structure projection vector (SPV) for each sample is calculated by organizing the positive and negative data of P into two tree structures,respectively, so as to close the gaps between the P and B data sources. Each tree has depth that equals three and partitions the data top-down, where each child node groups the data of its parent into clusters. Then, the SPV of each sample is obtained by concatenating the distance between it and the mean of each leaf node. Specifically, at each parent node, HOG feature for each sample is extracted and k-means is applied to group the data.
Hereinafter, an exemplary learning process for TA-CNN will be further discussed in detail. To leam network parameters W, Eqn. (5) is reformulated as
Figure PCTCN2014001125-appb-000025
where the main task is to predict the pedestrian label y and the attribute estimations, i.e. opi, osj and oukare auxiliary semantic tasks. α, β and γ denote the importance coefficients to associate multiple tasks. Here, p (y|x, z) , p (opi|x, z) ,p (osj|x, z) and p (ouk|x, z) are modeled by softmax functions, for example,
Figure PCTCN2014001125-appb-000026
where h(L)and Wm indicate the top-layer feature vector and the parameter matrix of the main task y, respectively, and h(L) is obtained by h(L)=relu (W(L)h(L-1)+b(L)+Wzz+bz) .
Learning multiple tasks in Eqn. (6) are casted as optimizing a single weighted multivariate cross-entropy loss, which can not only learn a compact weight matrix but also iteratively estimate the importance coefficients,
Figure PCTCN2014001125-appb-000027
where λ denotes a vector of importance coefficients and diag (·) represents a diagonal matrix. Here, y= (y, op, os, ou ) is a vector of binary labels, concatenating the pedestrian label and all attribute labels. The optimization of Eqn. (7) iterates between two steps, updating network parameters with the importance coefficients fixed and updating coefficients with the network parameters fixed. Typically, the first step may be run for sufficient number of iterations to reach a local minimum, and then perform the second step to update the coefficients. According to an embodiment, the network parameters may be updated by using stochastic gradient descent and back-propagation (BP) to minimize Eqn. (7) , where an error of the output layer is propagated top-down to update filters or weights at each layer. In addition, importance coefficients may be updated with the fixed network parameters by minimizing the posterior probability
Figure PCTCN2014001125-appb-000028
Since the methods for learning network parameters and importance coefficients are similar to the previous methods, the description thereof is omitted here for clarity.
The multi-task classification model may be configured to learn a detection task, a pedestrian attribute task and a scene attribute task. In an embodiment, the pedestrian detection results comprise a pedestrian detection score, pedestrian attributes and scene attributes.
Fig. 5 is a schematic flowchart illustrating a pedestrian detection method 2000 consistent with some disclosed embodiments. Hereafter, the method 2000 may be described in detail with respect to Fig. 6.
At the step S201, a plurality of training patches are extracted from predetermined data sources comprising a pedestrian data source and background scene data sources. According to an embodiment, during the extracting process, a plurality of training images first are selected from the pedestrian data source and the background scene data sources. For example, a region proposal method is employed to generate candidate patches from the training images. Whether each of the candidate patches is positive or negative is determined  based on a type of the candidate patch. Positive and negative patches are extracted from the generated candidate patches in the P data source and negative patches are extracted from the generated candidate patches in the B data sources.
At the step S202, preset attributes are assigned to the training patches based on types of the training patches. For example, the preset attributes may comprise pedestrian attributes, shared scene attributes and unshared scene attributes. In an embodiment, the pedestrian attributes are assigned to the positive patches from the P data source, and the shared and unshared scene attributes are assigned to the negative patches from the B data sources.
At the step S203, a Task-Assistant Convolutional Neutral Network (TA-CNN) is trained based on the training patches with assigned attributes to generate a multi-task classification model. In an embodiment, the TA-CNN comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
At the step S204, one or more pedestrians in an input image may be determined based on the generated multi-task classification model. Furthermore, pedestrian detection results are obtained based on the generated multi-task classification model. For example, the pedestrian detection results may comprise a pedestrian detection score, pedestrian attributes and scene attributes. One or more pedestrians in an input image are determined based on the obtained pedestrian detection results.
The multi-task classification model may be configured to learn a detection task, a pedestrian attribute task and a scene attribute task. The pedestrian detection results may comprise a pedestrian detection score, pedestrian attributes and scene attributes.
With the pedestrian detection apparatus and method of the present application, the number of hard negatives can be significantly decreased. Furthermore, the pedestrian attributes and background scene attributes can be predicted simultaneously. By training the multiple tasks from multiple sources using a single TA-CNN, multi-data sources visual gaps can be bridged and meanwhile attribute diversity can be enhanced. The scene attributes can be  transferred from existing scene data sources without annotating manually.
Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims (14)

  1. A pedestrian detection method, comprising:
    extracting a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources;
    assigning preset attributes to the training patches based on types of the training patches,
    training a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model; and
    determining one or more pedestrians in an input image based on the generated multi-task classification model.
  2. The pedestrian detection method according to claim 1, wherein the extracting comprises:
    selecting a plurality of training images from the pedestrian data source and the background scene data sources;
    generating candidate patches from the selected training images;
    extracting positive and negative patches from the generated candidate patches in the pedestrian data source, and
    extracting negative patches from the generated candidate patches in the background scene data sources.
  3. The pedestrian detection method according to claim 1, wherein the assigning preset attributes comprises:
    assigning pedestrian attributes of the preset attributes to the positive patches from the pedestrian data source; and
    assigning scene attributes of the preset attributes to the negative patches from the background scene data sources.
  4. The pedestrian detection method according to claim 3, wherein the scene attributes comprises shared attributes which are included in all the background scene data sources and unshared attributes which are included in one of the background scene data sources.
  5. The pedestrian detection method according to claim 1, wherein the neutral network comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
  6. The pedestrian detection method according to claim 1, wherein the multi-task classification model is configured to learn a detection task, a pedestrian attribute task and a scene attribute task.
  7. The pedestrian detection method according to claim 1, wherein the determining comprises:
    obtaining pedestrian detection results comprising a pedestrian detection score, pedestrian attributes and scene attributes based on the generated multi-task classification model, and
    determining one or more pedestrians in an input image based on the obtained pedestrian detection results.
  8. A pedestrian detection apparatus, comprising:
    an extracting device configured to extract a plurality of training patches from predetermined data sources comprising a pedestrian data source and background scene data sources;
    an assigning device configured to assign preset attributes to the training patches based on types of the plurality of training patches;
    a training device configured to train a neutral network based on the training patches with the assigned preset attributes to generate a multi-task classification model; and
    a determining device configured to determine one or more pedestrians in an input image based on the generated multi-task classification model.
  9. The pedestrian detection apparatus according to claim 8, wherein the extracting device comprises:
    a selector configured to select a plurality of training images from the pedestrian data source and the background scene data sources, respectively;
    a generating module configured to generate candidate patches from the selected training images; and
    an extracting module configured to extract positive and negative patches from the generated candidate patches in the pedestrian data source, and to extract negative patches from the generated candidate patches in the background scene data sources.
  10. The pedestrian detection apparatus according to claim 8, wherein the assigning device is further configured to:
    assign pedestrian attributes of the preset attributes to the positive patches from the pedestrian data source; and
    assign scene attributes of the preset attributes to the negative patches from the background scene data sources.
  11. The pedestrian detection apparatus according to claim 10, wherein the scene attributes comprises shared attributes which are included in all the background scene data sources and unshared attributes which are included in one of the background scene data sources.
  12. The pedestrian detection apparatus according to claim 8, wherein the neutral network comprises a plurality of convolutional layers, at least one max-pooling layer and at least one fully-connected layer, and wherein each of the max-pooling layers is followed by a corresponding convolutional layer of the convolutional layers.
  13. The pedestrian detection apparatus according to claim 8, wherein the multi-task classification model is configured to learn a detection task, a pedestrian attribute task and a scene attribute task.
  14.  The pedestrian detection apparatus according to claim 8, wherein the determining device is further configured to obtain pedestrian detection results comprising a pedestrian detection score, pedestrian attributes and scene attributes based on the generated multi-task classification model, and to determine one or more pedestrians in an input image based on the obtained pedestrian detection results.
PCT/CN2014/001125 2014-12-15 2014-12-15 Pedestrian detection apparatus and method WO2016095068A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480083931.0A CN107003834B (en) 2014-12-15 2014-12-15 Pedestrian detection device and method
PCT/CN2014/001125 WO2016095068A1 (en) 2014-12-15 2014-12-15 Pedestrian detection apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/001125 WO2016095068A1 (en) 2014-12-15 2014-12-15 Pedestrian detection apparatus and method

Publications (1)

Publication Number Publication Date
WO2016095068A1 true WO2016095068A1 (en) 2016-06-23

Family

ID=56125540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/001125 WO2016095068A1 (en) 2014-12-15 2014-12-15 Pedestrian detection apparatus and method

Country Status (2)

Country Link
CN (1) CN107003834B (en)
WO (1) WO2016095068A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018065158A1 (en) * 2016-10-06 2018-04-12 Siemens Aktiengesellschaft Computer device for training a deep neural network
CN110580487A (en) * 2018-06-08 2019-12-17 Oppo广东移动通信有限公司 Neural network training method, neural network construction method, image processing method and device
CN111191675A (en) * 2019-12-03 2020-05-22 深圳市华尊科技股份有限公司 Pedestrian attribute recognition model implementation method and related device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111052126A (en) * 2017-09-04 2020-04-21 华为技术有限公司 Pedestrian attribute identification and positioning method and convolutional neural network system
CN108875536A (en) * 2018-02-06 2018-11-23 北京迈格威科技有限公司 Pedestrian's analysis method, device, system and storage medium
CN111626087A (en) * 2019-02-28 2020-09-04 北京市商汤科技开发有限公司 Neural network training and eye opening and closing state detection method, device and equipment
CN112149665A (en) * 2020-09-04 2020-12-29 浙江工业大学 High-performance multi-scale target detection method based on deep learning
CN113807650A (en) * 2021-08-04 2021-12-17 北京房江湖科技有限公司 House resource owner interview management method, system, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102842045A (en) * 2012-08-03 2012-12-26 华侨大学 Pedestrian detection method based on combined features
CN103902968A (en) * 2014-02-26 2014-07-02 中国人民解放军国防科学技术大学 Pedestrian detection model training method based on AdaBoost classifier

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8015132B2 (en) * 2008-05-16 2011-09-06 Samsung Electronics Co., Ltd. System and method for object detection and classification with multiple threshold adaptive boosting
CN104063719B (en) * 2014-06-27 2018-01-26 深圳市赛为智能股份有限公司 Pedestrian detection method and device based on depth convolutional network
CN104091178A (en) * 2014-07-01 2014-10-08 四川长虹电器股份有限公司 Method for training human body sensing classifier based on HOG features
CN104166861B (en) * 2014-08-11 2017-09-29 成都六活科技有限责任公司 A kind of pedestrian detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102842045A (en) * 2012-08-03 2012-12-26 华侨大学 Pedestrian detection method based on combined features
CN103902968A (en) * 2014-02-26 2014-07-02 中国人民解放军国防科学技术大学 Pedestrian detection model training method based on AdaBoost classifier

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018065158A1 (en) * 2016-10-06 2018-04-12 Siemens Aktiengesellschaft Computer device for training a deep neural network
CN110580487A (en) * 2018-06-08 2019-12-17 Oppo广东移动通信有限公司 Neural network training method, neural network construction method, image processing method and device
CN111191675A (en) * 2019-12-03 2020-05-22 深圳市华尊科技股份有限公司 Pedestrian attribute recognition model implementation method and related device
CN111191675B (en) * 2019-12-03 2023-10-24 深圳市华尊科技股份有限公司 Pedestrian attribute identification model realization method and related device

Also Published As

Publication number Publication date
CN107003834B (en) 2018-07-06
CN107003834A (en) 2017-08-01

Similar Documents

Publication Publication Date Title
WO2016095068A1 (en) Pedestrian detection apparatus and method
CN110363210B (en) Training method and server for image semantic segmentation model
US11741361B2 (en) Machine learning-based network model building method and apparatus
CN112052787B (en) Target detection method and device based on artificial intelligence and electronic equipment
Shao et al. Online multi-view clustering with incomplete views
US20220335711A1 (en) Method for generating pre-trained model, electronic device and storage medium
WO2019223384A1 (en) Feature interpretation method and device for gbdt model
JP2015506026A (en) Image classification
WO2021088365A1 (en) Method and apparatus for determining neural network
CN113128478B (en) Model training method, pedestrian analysis method, device, equipment and storage medium
CN112446888A (en) Processing method and processing device for image segmentation model
WO2021030899A1 (en) Automated image retrieval with graph neural network
CN116580257A (en) Feature fusion model training and sample retrieval method and device and computer equipment
Jiang et al. Consensus style centralizing auto-encoder for weak style classification
CN111639230B (en) Similar video screening method, device, equipment and storage medium
CN116644804A (en) Distributed training system, neural network model training method, device and medium
CN114783021A (en) Intelligent detection method, device, equipment and medium for wearing of mask
CN110135428A (en) Image segmentation processing method and device
CN113744280A (en) Image processing method, apparatus, device and medium
CN111027551B (en) Image processing method, apparatus and medium
JP6991960B2 (en) Image recognition device, image recognition method and program
CN108460453B (en) Data processing method, device and system for CTC training
KR101953479B1 (en) Group search optimization data clustering method and system using the relative ratio of distance
CN111723247A (en) Graph-based hypothetical computation
CN116228484B (en) Course combination method and device based on quantum clustering algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14908103

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14908103

Country of ref document: EP

Kind code of ref document: A1