CN111027372A

CN111027372A - Pedestrian target detection and identification method based on monocular vision and deep learning

Info

Publication number: CN111027372A
Application number: CN201910991615.8A
Authority: CN
Inventors: 任清元
Original assignee: Shandong Vocational College of Industry
Current assignee: Shandong Vocational College of Industry
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2020-04-17

Abstract

The invention belongs to the technical field of pedestrian target detection, and discloses a pedestrian target detection and identification method based on monocular vision and deep learning; establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; carrying out pedestrian detection based on a target detection algorithm based on whole image candidates and single regression and based on depth features; fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning; extracting the features of the multi-scale pyramid images based on the consistent phase, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map; and (4) adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target. The method utilizes the CNN to obtain the depth characteristics, trains the deformable component model and effectively improves the detection precision; transfer learning is introduced, and the accuracy of pedestrian target detection and identification is improved by analyzing the hidden layer in the AlexNet model.

Description

Pedestrian target detection and identification method based on monocular vision and deep learning

Technical Field

The invention belongs to the technical field of pedestrian target detection, and particularly relates to a pedestrian target detection and identification method based on monocular vision and deep learning.

Background

Currently, the current state of the art commonly used in the industry is such that: pedestrian detection has important application in the directions of intelligent vehicles, robots, video monitoring and the like; pedestrian detection remains a challenging topic in the field of computer vision due to variable pedestrian pose and influence by factors such as illumination, background, clothing, occlusion, etc. At present, pedestrian detection based on computer vision is mostly based on a method of feature extraction and machine learning; in the aspect of feature extraction, features such as contours, textures, frequency domain information and color regions are commonly used for describing the difference between pedestrians and backgrounds. Features such as HOG, EOH, Edgelet, Shapelet and the like describe contour features of pedestrians, LBP features describe texture features of pedestrians, CSS features describe structural region features of human bodies by utilizing color similarity between local parts, and Haar wavelet features describe frequency domain information of pedestrians; among a plurality of characteristics, the HOG characteristics describe the gradient strength and the gradient direction distribution of a local area of an image, can well represent the appearance and the shape of a pedestrian, is insensitive to illumination and small amount of offset, shows excellent performance in pedestrian detection, and becomes a mainstream method for pedestrian detection at present.

In summary, the problems of the prior art are as follows: traffic conditions under the real road camera are complex, people flow is dense, people flow distribution is unbalanced, vehicle sample class unbalance is caused, people flow conditions are greatly different from those in public data sets, characteristics of people flow of different classes cannot be learned in a balanced mode during model training, and people flow detection effects of other classes are poor.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a pedestrian target detection and identification method based on monocular vision and deep learning.

The invention is realized in such a way that a pedestrian target detection and identification method based on monocular vision and deep learning comprises the following steps:

the method comprises the steps of firstly, establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; respectively extracting the characteristics of the image and video data sets in the source domain by using the improved triple network; analyzing the common edge information of the pedestrians based on the posterior HOG characteristic of the gradient characteristic energy which is the statistical embodiment of the gradient information of the positive samples of a large number of pedestrians;

secondly, a depth feature-based target detection algorithm based on whole image candidates and single regression is realized and applied to pedestrian detection, double convolution is used for replacing a single convolution kernel, a gradient descent method with momentum is used for learning weight parameters of the network in a training stage, and a cross entropy loss function and a smooth L1 loss function are used as loss functions of a classifier and position regression;

thirdly, fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning;

fourthly, extracting the features of the multi-scale pyramid images based on consistent phases, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map;

and fifthly, adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target.

Further, the pedestrian data set in the first step

The tagged source domain S contains N_sPedestrian image video pair (I)_si，V_si)，I_si∈R^PIs the ith pedestrian image of the source domain, corresponding to the pedestrian video V in the source domain_si∈R^P(ii) a In the same way, the method for preparing the composite material,

pedestrian image and pedestrian video image without mark in target domain T

And

respectively represent; constructing a triple network enables the distance between the video of the target pedestrian and the image of the target pedestrian to be smaller than the distance between the video of the target pedestrian and the image of the target pedestrian, and triple loss is defined as follows:

represents V^a，I^p，IⁿFrom the source domain X, f^2d2D image feature extraction sub-network, f, representing a composition of several 2D convolution layers^3dAnd 3D video feature extraction sub-networks composed of a plurality of 3D convolutional layers.

Furthermore, in the second step, the AlexNet model is firstly cut off to obtain a convolution layer of the AlexNet model, then CNN model parameters are obtained through transfer learning to extract rich high-level features, specifically, the first 5 layers of convolution layers of the model are used for obtaining depth features, and then a support vector machine of hidden variables is trained by using each layer of features of a feature pyramid to obtain a global detector and a local detector of the DPM; in the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.

Further comprising:

firstly, extracting depth features from an image pyramid by using truncated CNN containing 5 layers of convolution layers to form a depth feature pyramid of 7 layers of 256 channels, then deconvolving the 256 channel feature layers of each layer by using an initialized global detector and a local detector respectively, pooling a feature map obtained after convolution of the local detector, cascading the pooled feature map with a global detector feature map to form a cascaded feature map, and then performing convolution operation by using a target geometric filter and the cascaded feature map to obtain a final single-component model detection score.

Further comprising: the maximum pooling equivalence is:

wherein when r ∈ { -k.,. k }, then d_maxA body function that maximizes pooling; d_maxConnections between maximized pooling and distance transform pooling may be established; wherein f: g → R; m_f: g → R; distance variation pooling D_f: g → R; by using

D_f: the functional definition of G → R is:

in the convolutional layer, the calculation formula of the characteristic pixel value of the pedestrian image is as follows:

wherein K { (u, v) ∈ N²|0≤u＜k_x；0≤v＜k_y}，k_xAnd k_yRespectively representing the l-th layer convolution kernel

Length and width of;

is the offset of the jth profile of the corresponding layer l; variables c and r represent the current longitudinal and transverse characteristic pixels, respectively, and variables u and v denote the convolution kernel in decibels

P represents the corresponding p-th training sample; f represents the activation function of the l-th layer; the convolution operation occurs equivalent to a convolution kernel volume input feature map

And further, establishing a multi-resolution pyramid model by using a Laplacian pyramid to finish multi-scale pyramid representation, extracting features of each layer of pedestrian images in the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map.

Further comprising: extracting features of each layer of pedestrian images in the pyramid by using a phase consistency algorithm, and finally fusing the pedestrian images in the multi-scale phase consistency pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase consistency feature map specifically comprises the following steps:

(1) searching the phase consistent feature map of the scale n to obtain the initial position (x) of the phase consistent feature point₀，y₀)；

(2) In the phase-consistent profile of scale n-1, at (x)₀，y₀) Searching for phase consistent characteristic points in the 3 multiplied by 3 neighborhood; if the position is (x, y) with the outlier, the position (x, y) in the fused phase consistency feature map is a feature point; if not, reserving;

(3) searching a point communicated with the feature point in the n-1 scale at the feature point obtained by fusing the phase-consistent images to obtain a detail feature which is not contained in the fusion image of the scale n;

(4) and (5) searching the next feature point in the n-dimension phase-consistent feature map, and repeating the steps (1) to (3) until the whole map is completed.

The invention further aims to provide a road traffic monitoring platform applying the pedestrian target detection and identification method based on monocular vision and deep learning.

In summary, the advantages and positive effects of the invention are: the method has the advantages that an integrated convolutional neural network target detection model based on whole image candidates and single regression is realized, double convolution is used for replacing a single convolution kernel, an internal competition mechanism is used for replacing an activation layer to realize nonlinearity of the network, the network parameter quantity is reduced, the abstract capability of features on the target is improved, and the end-to-end real-time detection of pedestrian information is realized. In the training stage, learning weight parameters of the network by using a gradient descent method with momentum, and taking a cross entropy loss function and a smooth L1 loss function as a classifier and a loss function of position regression; fine-tuning weight parameters of higher layers of the network on the VOC data set and the small sample road pedestrian data set through secondary transfer learning, enhancing the feature representation capability of the model, and improving the average accuracy by 12 percentage points; aiming at the problem of missing detection of pedestrians, a feature diagram pyramid with transverse connection is introduced, context information in feature representation is increased to a certain extent, a newly-added pyramid is attached to an original network in a jumping transmission mode, the original network structure is not changed, only a small number of weight parameters and calculation operation are added, and the overall detection speed of a pedestrian algorithm is not significantly influenced.

The method utilizes the CNN to obtain the depth characteristics, trains the deformable component model and effectively improves the detection precision of the algorithm. The concept of Transfer Learning (TL) is introduced, and the hidden layer in the AlexNet model is analyzed to discover that the function of the bottom layer is the extraction of the general image features, and the depth features of the pedestrian image are generated at the high layer, so that the accuracy of pedestrian target detection and identification is improved.

Drawings

Fig. 1 is a flowchart of a pedestrian target detection and identification method based on monocular vision and deep learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

As shown in fig. 1, the pedestrian target detection and identification method based on monocular vision and deep learning provided by the embodiment of the present invention includes the following steps:

s101: establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; respectively extracting the characteristics of the image and video data sets in the source domain by using the improved triple network; based on the posterior HOG characteristics of the gradient characteristic energy, the gradient characteristic energy is the statistical embodiment of the gradient information of a large number of pedestrian positive samples, and the common edge information of pedestrians can be analyzed;

s102: the method comprises the steps of realizing a depth feature-based target detection algorithm based on whole image candidates and single regression, applying the algorithm to pedestrian detection, replacing a single convolution kernel with double convolution, learning weight parameters of a network by using a gradient descent method with momentum in a training stage, and taking a cross entropy loss function and a smooth L1 loss function as a classifier and a position regression loss function;

s103: fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning;

s104: extracting the features of the multi-scale pyramid images based on the consistent phase, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map;

s105: and (4) adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target.

In a preferred embodiment of the present invention, step S101 specifically includes: suppose that

The tagged source domain S contains N_sPedestrian image video pair (I)_si，V_si)，I_si∈R^PIs the ith pedestrian image of the source domain, corresponding to the pedestrian video V in the source domain_si∈R^P. In the same way, the method for preparing the composite material,

pedestrian image and pedestrian video image without mark in target domain T

And

respectively, are shown. Since the video features of pedestrians tend to contain richer information than the image features, constructing the triple network will make the distance between the video of the target pedestrian and the image (positive example) of the target pedestrian smaller than the distance between the video of the target pedestrian and the image (negative example) of the other pedestrian. The triplet penalty is defined as follows:

To make the model converge faster, the more "difficult" triples tend to be selected, i.e., given

Selecting a good case picture

So that

Selecting negative example pictures

So that

In particular, an online ternary generator and a batch block containing more samples are used, but only the smallest and largest samples in the batch block are calculated.

In a preferred embodiment of the present invention, in the target detection algorithm based on depth features in step S102, the AlexNet model is first truncated to obtain its convolution layer, and then CNN model parameters are obtained through transfer learning to extract rich high-level features, specifically, depth features are obtained by using the first 5 layers of convolution layers of the model, and then a support Vector Machine (LSVM) for training hidden variables is trained by using each layer of features of a feature pyramid to obtain a global detector and a local detector of the DPM. In the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.

The maximum pooling equivalence is:

D_f: the functional definition of G → R is:

Length and width of;

In a preferred embodiment of the present invention, step S104 specifically includes: establishing a multi-resolution pyramid model by using a Laplacian Pyramid (LP), completing multi-scale pyramid representation, extracting features of pedestrian images in each layer of the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map.

The Laplace pyramid decomposition is an image decomposition method for decomposing an original pedestrian image into different spatial scales, and the steps of constructing the Laplace pyramid are as follows:

(1) the original pedestrian image is G₀As the bottom layer of the gaussian pyramid;

(2) filtering the original pedestrian image by a Gaussian low-pass filter G and performing alternate row downsampling on the original pedestrian image to obtain a low-pass pedestrian image, namely the low-pass pedestrian image is the first layer G of the Gaussian pyramid₁；

(3) G is to be₁Performing interpolation expansion and filtering by an up-sampling and band-pass filter H to obtain G₁And calculating the difference between the image and the original pedestrian image to obtain the band-pass component and the zero layer LP of the Laplacian pyramid₁(ii) a Wherein, the low-pass filter G and the band-pass filter H are normalized filters;

(4) the next stage of decomposition is actually carried out on the obtained low-pass Gaussian pyramid image, the multi-scale decomposition is completed through iteration, and the iteration process can be expressed by a formula:

G₁(i，j)＝∑G(m，n)G_l-1(2i+m，2j+n)

LP(i，j)＝G_l-1(i，j)-G′_l(i，j)；

in the formula:

wherein l is the maturity of decomposition of the Gaussian pyramid G and the Laplacian pyramid LP; i and j represent the number of rows and columns of the ith layer of the pyramid; from G₀，G₁，…，G_nThe formed pyramid is a gaussian pyramid,

in a preferred embodiment of the present invention, extracting features from each layer of pedestrian images in the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map specifically includes:

The application effect of the present invention will be described in detail with reference to the simulation.

Acquiring training samples of pedestrian images and videos in different road environments, and establishing a sample library containing 2000 positive samples of the head and shoulders of a pedestrian; changing an OpenCV (open computer vision library) of an Intel open machine into a program, and adopting an SVM classification principle; training a classifier to obtain a split model; when the video sequence is subjected to pedestrian detection, a classifier of a training number is imported, multi-scale detection is carried out on a frame to be detected by using a sliding window, whether the accurate pedestrian position can be obtained is judged, and if the detection rate does not meet the requirement and the false detection rate is too high, the classifier is retrained; if the detection is accurate, marking the detected pedestrian by using a rectangular frame; the hardware environment of the experiment is Intel i3-4130, 0.40GHzCPU and 2GB memory, and in order to keep the real-time requirement of the video, the frame rate of image acquisition and transmission is 20-30 frames/s; when the number of moving targets in the detection area is less, the average detection time of each picture is 30ms, when the number of moving targets is more, the average detection time of each picture is 80ms, the real-time requirement can be met, under the condition of small posture change and small angle deviation, the detection rate of the method can reach 95%, and the low false detection rate is kept.

TABLE 1 comparison of conventional Process with Process of the invention

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A pedestrian target detection and identification method based on monocular vision and deep learning is characterized in that the pedestrian target detection and identification method based on monocular vision and deep learning comprises the following steps:

2. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 1, wherein the pedestrian data set in the first step

pedestrian image and pedestrian video image without mark in target domain T

And

respectively represent; structure IIIThe tuple network makes the distance between the video of the target pedestrian and the image of the target pedestrian smaller than the distance between the video of the target pedestrian and the image of the target pedestrian, and the tuple loss is defined as follows:

3. The pedestrian target detection and identification method based on monocular vision and deep learning as claimed in claim 1, wherein in the second step, the target detection algorithm based on the depth features firstly truncates the AlexNet model to obtain its convolution layer, then obtains CNN model parameters through transfer learning for extracting rich high-level features, specifically, obtains the depth features by using the first 5 layers of convolution layer of the model, and then obtains the global detector and the local detector of the DPM by using each layer of feature training support vector machine of the hidden variables of the feature pyramid; in the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.

4. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 3, further comprising:

5. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 4, further comprising: the maximum pooling equivalence is:

D_f: the functional definition of G → R is:

Length and width of;

6. The method for detecting and identifying the pedestrian target based on the monocular vision and the deep learning as claimed in claim 1, wherein the fourth step utilizes a laplacian pyramid to establish a multi-resolution pyramid model, completes the multi-scale pyramid representation, utilizes a phase-consistent algorithm to extract the features of each layer of pedestrian images in the pyramid, and finally utilizes a multi-scale fusion algorithm to fuse the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom to obtain an original pedestrian image phase-consistent feature map.

7. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 6, further comprising: extracting features of each layer of pedestrian images in the pyramid by using a phase consistency algorithm, and finally fusing the pedestrian images in the multi-scale phase consistency pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase consistency feature map specifically comprises the following steps:

(2) In the phase-consistent profile of scale n-1, at (x)₀，y₀) Searching for phase consistent characteristic points in the 3 multiplied by 3 neighborhood; if there is a outlier and the position is (x, y), then the feature at (x, y) in the fused phase coincidence feature map is the featurePoint; if not, reserving;

8. A road traffic monitoring platform applying the pedestrian target detection and identification method based on monocular vision and deep learning of any one of claims 1 to 7.