CN107301376B

CN107301376B - Pedestrian detection method based on deep learning multi-layer stimulation

Info

Publication number: CN107301376B
Application number: CN201710385952.3A
Authority: CN
Inventors: 李玺; 李健
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-05-26
Filing date: 2017-05-26
Publication date: 2021-04-13
Anticipated expiration: 2037-05-26
Also published as: CN107301376A

Abstract

The invention discloses a pedestrian detection method based on deep learning multilayer stimulation, which is used for marking the position of a target appearing in a video after video monitoring and the target needing to be detected are given. The method specifically comprises the following steps: acquiring a pedestrian data set used for training a target detection model, and defining an algorithm target; modeling the position deviation and the apparent semantics of the pedestrian target; establishing a pedestrian multilayer stimulation network model according to the modeling result in the step S2; the pedestrian position in the monitoring image is detected using the detection model. The pedestrian detection method is suitable for pedestrian detection in real video monitoring images, and has better effect and robustness in the face of various complex conditions.

Description

Pedestrian detection method based on deep learning multi-layer stimulation

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a pedestrian detection method based on deep learning multi-layer stimulation.

Background

With the development of computer vision since the end of the 20 th century, intelligent video processing technology has gained widespread attention and research. Pedestrian detection is an important and challenging task, with the goal of accurately detecting the location of pedestrians in video surveillance images. The problem has high application value in the fields of video monitoring, intelligent robots and the like, and is the basis of a large number of high-level visual tasks. However, the problem is also more challenging, namely how to express the target region information; secondly, how to uniformly model and optimize the extraction of the candidate region and the target classification, and the challenges put higher requirements on the performance and the robustness of the corresponding algorithm.

The general pedestrian detection algorithm is divided into three parts: 1. candidate regions containing the target in the input image are found. 2. And manually extracting target features based on the candidate regions. 3. And (4) realizing a detection task by using a classification algorithm on the features. The method mainly has the following problems: 1) the pedestrian detection method is based on the traditional visual features, the visual features can only express visual information of a lower layer, but a pedestrian detection task needs a model with high-level abstract semantic understanding capability; 2) the extraction of candidate regions and the classification of features are not optimized by end-to-end learning; 3) the features extracted based on deep learning are not subjected to multi-layer stimulation combination, and the target features are not abstract enough.

Disclosure of Invention

To solve the above problems, it is an object of the present invention to provide a pedestrian detection method based on deep learning multi-layer stimuli for detecting the pedestrian position in a given monitoring image. The method is based on a deep neural network, utilizes the depth visual characteristics of multi-layer stimulation to represent the information of a target area, adopts the Faster R-CNN framework to model pedestrian detection, and can better adapt to the complex situation in a real video monitoring scene.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a pedestrian detection method based on deep learning multi-layer stimulation comprises the following steps:

s1, acquiring a pedestrian data set for training a target detection model, and defining an algorithm target;

s2, modeling the position deviation and the apparent semantic meaning of the pedestrian target;

s3, establishing a pedestrian multilayer stimulation network model according to the modeling result in the step S2;

and S4, detecting the pedestrian position in the monitored image by using the detection model.

Further, in step S1, the pedestrian data set for training the target detection model includes a pedestrian image X_trainA pedestrian position B marked manually;

the algorithm targets are defined as: the pedestrian position P in one monitor image X is detected.

Further, in step S2, the modeling of the position deviation and the apparent semantic meaning of the pedestrian object specifically includes:

s21, according to the pedestrian data set X_trainAnd pedestrian position P modeling position deviation:

wherein, x, y are the coordinate of the middle point of the pedestrian frame label, w, h are the width and the length of the pedestrian frame label, and x_a,y_aIs the coordinate of the pedestrian candidate frame, w_a,h_aIs the width and length of the pedestrian candidate frame; t is t_xAs the ratio of the deviation of the x coordinate of the pedestrian frame relative to the x coordinate of the marking frame to the width of the marking frame, t_yAs the proportion of the deviation of the y coordinate of the pedestrian frame relative to the y coordinate of the marking frame corresponding to the length of the marking frame, t_wAs the ratio of the width of the pedestrian frame to the width of the marking frame, t_hThe length of the pedestrian frame is in proportion to the length of the marking frame;

s22, according to the pedestrian data set X_trainAnd pedestrian position P modeling appearance semantics:

s＝<w,d>

where s represents the projection value of the feature d onto a projection vector w, w is the pedestrian weight projection vector, d is the pedestrian feature descriptor,<.,.>is the inner product operator, p (C ═ k | d) is the softmax function, indicating the probability values belonging to class k; s_jIs the projection value of the feature d on the jth projection vector w; c is a discrete random variable with the value number of k; j is the index of the jth w of the total projection vectors w.

Further, in step S3, the step of establishing the pedestrian multi-layer stimulation network model according to the modeling result in step S2 specifically includes:

s31, establishing a multilayer stimulation convolutional neural network, wherein the input of the neural network is a monitoring image X and a pedestrian marking box B, and the output is a probability value p of a corresponding pedestrian candidate box and a pedestrian position deviation O in the X; the structure of the neural network is represented as mapping X → (p, O);

s32, child mapping X → p uses the soft maximum Softmax loss function, expressed as

L_cls(X,Y；θ)＝-∑_jY_jlogp (C | d) formula (3)

Wherein Y is a binary vector, if the k-th class belongs to, the corresponding value is 1, and the rest is 0; l is_cls(X, Y; θ) represents the softmax loss function of the entire training data set;

s33, child mapping X → O Using Euclidean loss function, expressed as

L_loc(t,v)＝∑_ismooth(t_i,v_i)

Wherein t is_iIs a pedestrian position deviation tag, v_iIs a pedestrian position deviation predicted value; i represents the ith training sample;

s34 loss function of the whole multi-layer stimulation neural network

L＝L_cls+L_locFormula (5)

The entire neural network is trained using a stochastic gradient descent and back propagation algorithm under a loss function L.

Further, in step S4, the detecting the pedestrian position in the monitoring image includes: inputting the monitoring image X to be detected into the trained neural network, judging whether the image X is a pedestrian according to the output candidate box probability value, and finally correcting according to the predicted position deviation O to obtain the pedestrian position P.

Compared with the existing pedestrian detection method, the pedestrian detection method applied to the video monitoring scene has the following beneficial effects:

firstly, the pedestrian detection method of the invention builds a model based on a deep convolutional neural network. The invention unifies the generation of the candidate region and the classification of the characteristics in the same network frame for learning and optimization, thereby improving the final effect of the method.

Secondly, the multi-layer stimulation algorithm provided by the invention can enrich the feature abstract capability, and meanwhile, the features learned by the algorithm enable the classifier to learn more robust classification rules.

The pedestrian detection method applied to the video monitoring scene has good application value in an intelligent video analysis system, and can effectively improve the efficiency and accuracy of pedestrian detection. For example, in traffic video monitoring, the pedestrian detection method can quickly and accurately detect the positions of all pedestrians, provide data for subsequent pedestrian search tasks, and greatly release human resources.

Drawings

FIG. 1 is a schematic flow chart of a pedestrian detection method applied to a video surveillance scene according to the present invention;

FIG. 2 is a schematic diagram of the loss function of the whole multi-layer neural network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

Referring to fig. 1, in a preferred embodiment of the present invention, a pedestrian detection method based on deep learning multi-layer stimulation comprises the following steps:

first, a pedestrian data set including a pedestrian image X for training a target detection model is acquired_trainA pedestrian position B marked manually;

Secondly, modeling the position deviation and the apparent semantics of the pedestrian target specifically comprises:

first, from a pedestrian data set X_trainAnd pedestrian position P modeling position deviation:

second, from the pedestrian data set X_trainAnd pedestrian position P modeling appearance semantics:

s＝<w,d>

And then, pre-training a detection model of the billboard target according to the complaint modeling result. The method specifically comprises the following steps:

firstly, establishing a multilayer stimulation convolutional neural network, wherein the input of the neural network is a monitoring image X and a pedestrian marking frame B, and the output is a probability value p of a corresponding pedestrian candidate frame and a pedestrian position deviation O in the X; thus, the structure of the neural network can be represented as the mapping X → (p, O);

second, the sub-map X → p uses a soft maximum (Softmax) loss function, denoted as

L_cls(X,Y；θ)＝-∑_jY_jlogp (C | d) formula (3)

third, the sub-map X → O uses the Euclidean loss function, expressed as

L_loc(t,v)＝∑_ismooth(t_i,v_i)

Wherein t is_iIs a pedestrian position deviation tag, v_iIs the predicted value of the pedestrian position deviation, and i represents the ith training sample.

Fourth, referring to FIG. 2, the loss function of the entire multi-layer neural network is

L＝L_cls+L_locFormula (5)

And finally, detecting the pedestrians in the monitoring image by using the trained detection model. The method specifically comprises the following steps: and (4) placing the preprocessed image on a multi-layer stimulation detection framework for calculation. The multi-layer stimulation detection framework extracts candidate frames by using 3 RPN networks, and the feature information utilized by each RPN network is different, so that the sizes and the scales of the obtained candidate frames are different. And firstly obtaining candidate frames extracted by each RPN network, and filtering according to the respective confidence degrees to obtain 300 candidate regions. Then, the candidate regions in the 3 RPN networks are merged to obtain 900 candidate regions. And then, according to the arrangement of the classification confidence degrees from large to small, filtering to obtain the final 300 target candidate regions. And filtering the candidate frames according to whether the output candidate frame classification probability value is greater than a given threshold value or not, eliminating the crossed and repeated detection frames by adopting a non-maximum value inhibition algorithm, and finally correcting according to the predicted position deviation O to obtain the position P of the pedestrian.

In the above embodiment, the pedestrian detection method of the invention first models the position deviation and the apparent semantic meaning of the pedestrian target. On the basis, the original problem is converted into a multi-task learning problem, and a pedestrian detection model is established based on the deep neural network. And finally, detecting the position of the pedestrian in the monitoring image by using the trained detection model.

Through the technical scheme, the embodiment of the invention develops a pedestrian detection algorithm based on deep learning multi-layer stimulation based on the deep learning technology. The invention can effectively model the position deviation and the apparent semantic information of the target at the same time, thereby detecting the accurate pedestrian position.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A pedestrian detection method based on deep learning multi-layer stimulation is characterized by comprising the following steps:

s1, acquiring a pedestrian data set for training a target detection model, and defining an algorithm target; the pedestrian data set for training the target detection model comprises a pedestrian image X_trainA pedestrian position B marked manually; the algorithm targets are defined as: detecting a pedestrian position P in a monitoring image X;

s2, modeling the position deviation and the apparent semantics of the pedestrian target, specifically comprising:

s＝<w，d>

where s represents the projection value of the feature d onto a projection vector w, w is the pedestrian weight projection vector, d is the pedestrian feature descriptor,<.，.>is the inner product operator, p (C ═ k | d) is the softmax function, indicating the probability values belonging to class k; s_jIs the projection value of the feature d on the jth projection vector w; c is a discrete random variable with the value number of k; j is the index of the jth w of all projection vectors w;

s3, establishing a pedestrian multilayer stimulation network model according to the modeling result in the step S2, which specifically comprises the following steps:

L_cls(X，Y；θ)＝-∑_jY_jLog p (C | d) formula (3)

s33, child mapping X → O Using Euclidean loss function, expressed as

L_loc(t，v)＝∑_ismooth(t_i，v_i)

s34 loss function of the whole multi-layer stimulation neural network

L＝L_cls+L_locFormula (5)

Training the whole neural network under a loss function L by using a random gradient descent and back propagation algorithm;

the multilayer stimulation neural network extracts candidate frames by using 3 RPN networks, the characteristic information utilized by each RPN network is different, so that the sizes and the scales of the obtained candidate frames are different, and each RPN network introduces a loss function L; in the detection process, obtaining candidate frames extracted by each RPN network, and filtering according to the respective confidence degrees to obtain 300 candidate regions; then combining the candidate regions in the 3 RPN networks to obtain 900 candidate regions; then, arranging according to the classification confidence degree from large to small, and filtering to obtain the final 300 target candidate regions; filtering the candidate frames according to whether the output candidate frame classification probability value is larger than a given threshold value or not, eliminating the detection frames with repeated crossing by adopting a non-maximum value inhibition algorithm, and finally correcting according to the predicted position deviation O to obtain the position P of the pedestrian;

s4, detecting the pedestrian position in the monitored image by using the detection model; wherein detecting the pedestrian position in the monitored image comprises: inputting the monitoring image X to be detected into the trained neural network, judging whether the image X is a pedestrian according to the output candidate box probability value, and finally correcting according to the predicted position deviation O to obtain the pedestrian position P.