CN111027372A - Pedestrian target detection and identification method based on monocular vision and deep learning - Google Patents

Pedestrian target detection and identification method based on monocular vision and deep learning Download PDF

Info

Publication number
CN111027372A
CN111027372A CN201910991615.8A CN201910991615A CN111027372A CN 111027372 A CN111027372 A CN 111027372A CN 201910991615 A CN201910991615 A CN 201910991615A CN 111027372 A CN111027372 A CN 111027372A
Authority
CN
China
Prior art keywords
pedestrian
feature
image
pyramid
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910991615.8A
Other languages
Chinese (zh)
Inventor
任清元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Vocational College of Industry
Original Assignee
Shandong Vocational College of Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Vocational College of Industry filed Critical Shandong Vocational College of Industry
Priority to CN201910991615.8A priority Critical patent/CN111027372A/en
Publication of CN111027372A publication Critical patent/CN111027372A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Abstract

The invention belongs to the technical field of pedestrian target detection, and discloses a pedestrian target detection and identification method based on monocular vision and deep learning; establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; carrying out pedestrian detection based on a target detection algorithm based on whole image candidates and single regression and based on depth features; fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning; extracting the features of the multi-scale pyramid images based on the consistent phase, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map; and (4) adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target. The method utilizes the CNN to obtain the depth characteristics, trains the deformable component model and effectively improves the detection precision; transfer learning is introduced, and the accuracy of pedestrian target detection and identification is improved by analyzing the hidden layer in the AlexNet model.

Description

Pedestrian target detection and identification method based on monocular vision and deep learning
Technical Field
The invention belongs to the technical field of pedestrian target detection, and particularly relates to a pedestrian target detection and identification method based on monocular vision and deep learning.
Background
Currently, the current state of the art commonly used in the industry is such that: pedestrian detection has important application in the directions of intelligent vehicles, robots, video monitoring and the like; pedestrian detection remains a challenging topic in the field of computer vision due to variable pedestrian pose and influence by factors such as illumination, background, clothing, occlusion, etc. At present, pedestrian detection based on computer vision is mostly based on a method of feature extraction and machine learning; in the aspect of feature extraction, features such as contours, textures, frequency domain information and color regions are commonly used for describing the difference between pedestrians and backgrounds. Features such as HOG, EOH, Edgelet, Shapelet and the like describe contour features of pedestrians, LBP features describe texture features of pedestrians, CSS features describe structural region features of human bodies by utilizing color similarity between local parts, and Haar wavelet features describe frequency domain information of pedestrians; among a plurality of characteristics, the HOG characteristics describe the gradient strength and the gradient direction distribution of a local area of an image, can well represent the appearance and the shape of a pedestrian, is insensitive to illumination and small amount of offset, shows excellent performance in pedestrian detection, and becomes a mainstream method for pedestrian detection at present.
In summary, the problems of the prior art are as follows: traffic conditions under the real road camera are complex, people flow is dense, people flow distribution is unbalanced, vehicle sample class unbalance is caused, people flow conditions are greatly different from those in public data sets, characteristics of people flow of different classes cannot be learned in a balanced mode during model training, and people flow detection effects of other classes are poor.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a pedestrian target detection and identification method based on monocular vision and deep learning.
The invention is realized in such a way that a pedestrian target detection and identification method based on monocular vision and deep learning comprises the following steps:
the method comprises the steps of firstly, establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; respectively extracting the characteristics of the image and video data sets in the source domain by using the improved triple network; analyzing the common edge information of the pedestrians based on the posterior HOG characteristic of the gradient characteristic energy which is the statistical embodiment of the gradient information of the positive samples of a large number of pedestrians;
secondly, a depth feature-based target detection algorithm based on whole image candidates and single regression is realized and applied to pedestrian detection, double convolution is used for replacing a single convolution kernel, a gradient descent method with momentum is used for learning weight parameters of the network in a training stage, and a cross entropy loss function and a smooth L1 loss function are used as loss functions of a classifier and position regression;
thirdly, fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning;
fourthly, extracting the features of the multi-scale pyramid images based on consistent phases, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map;
and fifthly, adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target.
Further, the pedestrian data set in the first step
Figure BSA0000192624250000021
The tagged source domain S contains NsPedestrian image video pair (I)si,Vsi),Isi∈RPIs the ith pedestrian image of the source domain, corresponding to the pedestrian video V in the source domainsi∈RP(ii) a In the same way, the method for preparing the composite material,
Figure BSA0000192624250000022
pedestrian image and pedestrian video image without mark in target domain T
Figure BSA0000192624250000023
And
Figure BSA0000192624250000024
respectively represent; constructing a triple network enables the distance between the video of the target pedestrian and the image of the target pedestrian to be smaller than the distance between the video of the target pedestrian and the image of the target pedestrian, and triple loss is defined as follows:
Figure BSA0000192624250000025
Figure BSA0000192624250000026
represents Va,Ip,InFrom the source domain X, f2d2D image feature extraction sub-network, f, representing a composition of several 2D convolution layers3dAnd 3D video feature extraction sub-networks composed of a plurality of 3D convolutional layers.
Furthermore, in the second step, the AlexNet model is firstly cut off to obtain a convolution layer of the AlexNet model, then CNN model parameters are obtained through transfer learning to extract rich high-level features, specifically, the first 5 layers of convolution layers of the model are used for obtaining depth features, and then a support vector machine of hidden variables is trained by using each layer of features of a feature pyramid to obtain a global detector and a local detector of the DPM; in the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.
Further comprising:
firstly, extracting depth features from an image pyramid by using truncated CNN containing 5 layers of convolution layers to form a depth feature pyramid of 7 layers of 256 channels, then deconvolving the 256 channel feature layers of each layer by using an initialized global detector and a local detector respectively, pooling a feature map obtained after convolution of the local detector, cascading the pooled feature map with a global detector feature map to form a cascaded feature map, and then performing convolution operation by using a target geometric filter and the cascaded feature map to obtain a final single-component model detection score.
Further comprising: the maximum pooling equivalence is:
Figure BSA0000192624250000031
wherein when r ∈ { -k.,. k }, then dmaxA body function that maximizes pooling; dmaxConnections between maximized pooling and distance transform pooling may be established; wherein f: g → R; mf: g → R; distance variation pooling Df: g → R; by using
Figure BSA0000192624250000032
Df: the functional definition of G → R is:
Figure BSA0000192624250000033
in the convolutional layer, the calculation formula of the characteristic pixel value of the pedestrian image is as follows:
Figure BSA0000192624250000041
wherein K { (u, v) ∈ N2|0≤u<kx;0≤v<ky},kxAnd kyRespectively representing the l-th layer convolution kernel
Figure BSA0000192624250000042
Length and width of;
Figure BSA0000192624250000043
is the offset of the jth profile of the corresponding layer l; variables c and r represent the current longitudinal and transverse characteristic pixels, respectively, and variables u and v denote the convolution kernel in decibels
Figure BSA0000192624250000044
P represents the corresponding p-th training sample; f represents the activation function of the l-th layer; the convolution operation occurs equivalent to a convolution kernel volume input feature map
Figure BSA0000192624250000045
And further, establishing a multi-resolution pyramid model by using a Laplacian pyramid to finish multi-scale pyramid representation, extracting features of each layer of pedestrian images in the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map.
Further comprising: extracting features of each layer of pedestrian images in the pyramid by using a phase consistency algorithm, and finally fusing the pedestrian images in the multi-scale phase consistency pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase consistency feature map specifically comprises the following steps:
(1) searching the phase consistent feature map of the scale n to obtain the initial position (x) of the phase consistent feature point0,y0);
(2) In the phase-consistent profile of scale n-1, at (x)0,y0) Searching for phase consistent characteristic points in the 3 multiplied by 3 neighborhood; if the position is (x, y) with the outlier, the position (x, y) in the fused phase consistency feature map is a feature point; if not, reserving;
(3) searching a point communicated with the feature point in the n-1 scale at the feature point obtained by fusing the phase-consistent images to obtain a detail feature which is not contained in the fusion image of the scale n;
(4) and (5) searching the next feature point in the n-dimension phase-consistent feature map, and repeating the steps (1) to (3) until the whole map is completed.
The invention further aims to provide a road traffic monitoring platform applying the pedestrian target detection and identification method based on monocular vision and deep learning.
In summary, the advantages and positive effects of the invention are: the method has the advantages that an integrated convolutional neural network target detection model based on whole image candidates and single regression is realized, double convolution is used for replacing a single convolution kernel, an internal competition mechanism is used for replacing an activation layer to realize nonlinearity of the network, the network parameter quantity is reduced, the abstract capability of features on the target is improved, and the end-to-end real-time detection of pedestrian information is realized. In the training stage, learning weight parameters of the network by using a gradient descent method with momentum, and taking a cross entropy loss function and a smooth L1 loss function as a classifier and a loss function of position regression; fine-tuning weight parameters of higher layers of the network on the VOC data set and the small sample road pedestrian data set through secondary transfer learning, enhancing the feature representation capability of the model, and improving the average accuracy by 12 percentage points; aiming at the problem of missing detection of pedestrians, a feature diagram pyramid with transverse connection is introduced, context information in feature representation is increased to a certain extent, a newly-added pyramid is attached to an original network in a jumping transmission mode, the original network structure is not changed, only a small number of weight parameters and calculation operation are added, and the overall detection speed of a pedestrian algorithm is not significantly influenced.
The method utilizes the CNN to obtain the depth characteristics, trains the deformable component model and effectively improves the detection precision of the algorithm. The concept of Transfer Learning (TL) is introduced, and the hidden layer in the AlexNet model is analyzed to discover that the function of the bottom layer is the extraction of the general image features, and the depth features of the pedestrian image are generated at the high layer, so that the accuracy of pedestrian target detection and identification is improved.
Drawings
Fig. 1 is a flowchart of a pedestrian target detection and identification method based on monocular vision and deep learning according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
As shown in fig. 1, the pedestrian target detection and identification method based on monocular vision and deep learning provided by the embodiment of the present invention includes the following steps:
s101: establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; respectively extracting the characteristics of the image and video data sets in the source domain by using the improved triple network; based on the posterior HOG characteristics of the gradient characteristic energy, the gradient characteristic energy is the statistical embodiment of the gradient information of a large number of pedestrian positive samples, and the common edge information of pedestrians can be analyzed;
s102: the method comprises the steps of realizing a depth feature-based target detection algorithm based on whole image candidates and single regression, applying the algorithm to pedestrian detection, replacing a single convolution kernel with double convolution, learning weight parameters of a network by using a gradient descent method with momentum in a training stage, and taking a cross entropy loss function and a smooth L1 loss function as a classifier and a position regression loss function;
s103: fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning;
s104: extracting the features of the multi-scale pyramid images based on the consistent phase, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map;
s105: and (4) adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target.
In a preferred embodiment of the present invention, step S101 specifically includes: suppose that
Figure BSA0000192624250000061
The tagged source domain S contains NsPedestrian image video pair (I)si,Vsi),Isi∈RPIs the ith pedestrian image of the source domain, corresponding to the pedestrian video V in the source domainsi∈RP. In the same way, the method for preparing the composite material,
Figure BSA0000192624250000062
pedestrian image and pedestrian video image without mark in target domain T
Figure BSA0000192624250000063
And
Figure BSA0000192624250000064
respectively, are shown. Since the video features of pedestrians tend to contain richer information than the image features, constructing the triple network will make the distance between the video of the target pedestrian and the image (positive example) of the target pedestrian smaller than the distance between the video of the target pedestrian and the image (negative example) of the other pedestrian. The triplet penalty is defined as follows:
Figure BSA0000192624250000071
Figure BSA0000192624250000072
represents Va,Ip,InFrom the source domain X, f2d2D image feature extraction sub-network, f, representing a composition of several 2D convolution layers3dAnd 3D video feature extraction sub-networks composed of a plurality of 3D convolutional layers.
To make the model converge faster, the more "difficult" triples tend to be selected, i.e., given
Figure BSA0000192624250000073
Selecting a good case picture
Figure BSA0000192624250000074
So that
Figure BSA0000192624250000075
Selecting negative example pictures
Figure BSA0000192624250000076
So that
Figure BSA0000192624250000077
In particular, an online ternary generator and a batch block containing more samples are used, but only the smallest and largest samples in the batch block are calculated.
In a preferred embodiment of the present invention, in the target detection algorithm based on depth features in step S102, the AlexNet model is first truncated to obtain its convolution layer, and then CNN model parameters are obtained through transfer learning to extract rich high-level features, specifically, depth features are obtained by using the first 5 layers of convolution layers of the model, and then a support Vector Machine (LSVM) for training hidden variables is trained by using each layer of features of a feature pyramid to obtain a global detector and a local detector of the DPM. In the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.
Firstly, extracting depth features from an image pyramid by using truncated CNN containing 5 layers of convolution layers to form a depth feature pyramid of 7 layers of 256 channels, then deconvolving the 256 channel feature layers of each layer by using an initialized global detector and a local detector respectively, pooling a feature map obtained after convolution of the local detector, cascading the pooled feature map with a global detector feature map to form a cascaded feature map, and then performing convolution operation by using a target geometric filter and the cascaded feature map to obtain a final single-component model detection score.
The maximum pooling equivalence is:
Figure BSA0000192624250000078
wherein when r ∈ { -k.,. k }, then dmaxA body function that maximizes pooling; dmaxConnections between maximized pooling and distance transform pooling may be established; wherein f: g → R; mf: g → R; distance variation pooling Df: g → R; by using
Figure BSA0000192624250000081
Df: the functional definition of G → R is:
Figure BSA0000192624250000082
in the convolutional layer, the calculation formula of the characteristic pixel value of the pedestrian image is as follows:
Figure BSA0000192624250000083
wherein K { (u, v) ∈ N2|0≤u<kx;0≤v<ky},kxAnd kyRespectively representing the l-th layer convolution kernel
Figure BSA0000192624250000084
Length and width of;
Figure BSA0000192624250000085
is the offset of the jth profile of the corresponding layer l; variables c and r represent the current longitudinal and transverse characteristic pixels, respectively, and variables u and v denote the convolution kernel in decibels
Figure BSA0000192624250000086
P represents the corresponding p-th training sample; f represents the activation function of the l-th layer; the convolution operation occurs equivalent to a convolution kernel volume input feature map
Figure BSA0000192624250000087
In a preferred embodiment of the present invention, step S104 specifically includes: establishing a multi-resolution pyramid model by using a Laplacian Pyramid (LP), completing multi-scale pyramid representation, extracting features of pedestrian images in each layer of the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map.
The Laplace pyramid decomposition is an image decomposition method for decomposing an original pedestrian image into different spatial scales, and the steps of constructing the Laplace pyramid are as follows:
(1) the original pedestrian image is G0As the bottom layer of the gaussian pyramid;
(2) filtering the original pedestrian image by a Gaussian low-pass filter G and performing alternate row downsampling on the original pedestrian image to obtain a low-pass pedestrian image, namely the low-pass pedestrian image is the first layer G of the Gaussian pyramid1
(3) G is to be1Performing interpolation expansion and filtering by an up-sampling and band-pass filter H to obtain G1And calculating the difference between the image and the original pedestrian image to obtain the band-pass component and the zero layer LP of the Laplacian pyramid1(ii) a Wherein, the low-pass filter G and the band-pass filter H are normalized filters;
(4) the next stage of decomposition is actually carried out on the obtained low-pass Gaussian pyramid image, the multi-scale decomposition is completed through iteration, and the iteration process can be expressed by a formula:
G1(i,j)=∑G(m,n)Gl-1(2i+m,2j+n)
Figure BSA0000192624250000091
LP(i,j)=Gl-1(i,j)-G′l(i,j);
in the formula:
Figure BSA0000192624250000092
wherein l is the maturity of decomposition of the Gaussian pyramid G and the Laplacian pyramid LP; i and j represent the number of rows and columns of the ith layer of the pyramid; from G0,G1,…,GnThe formed pyramid is a gaussian pyramid,
in a preferred embodiment of the present invention, extracting features from each layer of pedestrian images in the pyramid by using a phase-consistent algorithm, and finally fusing the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase-consistent feature map specifically includes:
(1) searching the phase consistent feature map of the scale n to obtain the initial position (x) of the phase consistent feature point0,y0);
(2) In the phase-consistent profile of scale n-1, at (x)0,y0) Searching for phase consistent characteristic points in the 3 multiplied by 3 neighborhood; if the position is (x, y) with the outlier, the position (x, y) in the fused phase consistency feature map is a feature point; if not, reserving;
(3) searching a point communicated with the feature point in the n-1 scale at the feature point obtained by fusing the phase-consistent images to obtain a detail feature which is not contained in the fusion image of the scale n;
(4) and (5) searching the next feature point in the n-dimension phase-consistent feature map, and repeating the steps (1) to (3) until the whole map is completed.
The application effect of the present invention will be described in detail with reference to the simulation.
Acquiring training samples of pedestrian images and videos in different road environments, and establishing a sample library containing 2000 positive samples of the head and shoulders of a pedestrian; changing an OpenCV (open computer vision library) of an Intel open machine into a program, and adopting an SVM classification principle; training a classifier to obtain a split model; when the video sequence is subjected to pedestrian detection, a classifier of a training number is imported, multi-scale detection is carried out on a frame to be detected by using a sliding window, whether the accurate pedestrian position can be obtained is judged, and if the detection rate does not meet the requirement and the false detection rate is too high, the classifier is retrained; if the detection is accurate, marking the detected pedestrian by using a rectangular frame; the hardware environment of the experiment is Intel i3-4130, 0.40GHzCPU and 2GB memory, and in order to keep the real-time requirement of the video, the frame rate of image acquisition and transmission is 20-30 frames/s; when the number of moving targets in the detection area is less, the average detection time of each picture is 30ms, when the number of moving targets is more, the average detection time of each picture is 80ms, the real-time requirement can be met, under the condition of small posture change and small angle deviation, the detection rate of the method can reach 95%, and the low false detection rate is kept.
TABLE 1 comparison of conventional Process with Process of the invention
Figure BSA0000192624250000101
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A pedestrian target detection and identification method based on monocular vision and deep learning is characterized in that the pedestrian target detection and identification method based on monocular vision and deep learning comprises the following steps:
the method comprises the steps of firstly, establishing a small sample pedestrian data set, and collecting road pedestrian images in a real scene; respectively extracting the characteristics of the image and video data sets in the source domain by using the improved triple network; analyzing the common edge information of the pedestrians based on the posterior HOG characteristic of the gradient characteristic energy which is the statistical embodiment of the gradient information of the positive samples of a large number of pedestrians;
secondly, a depth feature-based target detection algorithm based on whole image candidates and single regression is realized and applied to pedestrian detection, double convolution is used for replacing a single convolution kernel, a gradient descent method with momentum is used for learning weight parameters of the network in a training stage, and a cross entropy loss function and a smooth L1 loss function are used as loss functions of a classifier and position regression;
thirdly, fine-tuning weight parameters of a network higher layer on the VOC data set and the small sample pedestrian data set through secondary transfer learning;
fourthly, extracting the features of the multi-scale pyramid images based on consistent phases, and extracting the contour features of the pedestrian images to obtain a multi-scale pyramid feature map;
and fifthly, adopting a balanced focus loss function to replace a cross entropy loss function to measure the classification accuracy of the target.
2. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 1, wherein the pedestrian data set in the first step
Figure FSA0000192624240000011
The tagged source domain S contains NsPedestrian image video pair (I)si,Vsi),Isi∈RPIs the ith pedestrian image of the source domain, corresponding to the pedestrian video V in the source domainsi∈RP(ii) a In the same way, the method for preparing the composite material,
Figure FSA0000192624240000012
pedestrian image and pedestrian video image without mark in target domain T
Figure FSA0000192624240000013
And
Figure FSA0000192624240000014
respectively represent; structure IIIThe tuple network makes the distance between the video of the target pedestrian and the image of the target pedestrian smaller than the distance between the video of the target pedestrian and the image of the target pedestrian, and the tuple loss is defined as follows:
Figure FSA0000192624240000015
Figure FSA0000192624240000016
represents Va,Ip,InFrom the source domain X, f2d2D image feature extraction sub-network, f, representing a composition of several 2D convolution layers3dAnd 3D video feature extraction sub-networks composed of a plurality of 3D convolutional layers.
3. The pedestrian target detection and identification method based on monocular vision and deep learning as claimed in claim 1, wherein in the second step, the target detection algorithm based on the depth features firstly truncates the AlexNet model to obtain its convolution layer, then obtains CNN model parameters through transfer learning for extracting rich high-level features, specifically, obtains the depth features by using the first 5 layers of convolution layer of the model, and then obtains the global detector and the local detector of the DPM by using each layer of feature training support vector machine of the hidden variables of the feature pyramid; in the detection process, a global feature mapping image and a local feature mapping image are constructed for a test image, then the local feature mapping images are pooled, then the global feature mapping images are cascaded to obtain a new feature mapping image, and then the trained discriminant model is used for deconvoluting the cascaded feature mapping image to obtain a detection result.
4. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 3, further comprising:
firstly, extracting depth features from an image pyramid by using truncated CNN containing 5 layers of convolution layers to form a depth feature pyramid of 7 layers of 256 channels, then deconvolving the 256 channel feature layers of each layer by using an initialized global detector and a local detector respectively, pooling a feature map obtained after convolution of the local detector, cascading the pooled feature map with a global detector feature map to form a cascaded feature map, and then performing convolution operation by using a target geometric filter and the cascaded feature map to obtain a final single-component model detection score.
5. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 4, further comprising: the maximum pooling equivalence is:
Figure FSA0000192624240000021
wherein when r ∈ { -k.,. k }, then dmaxA body function that maximizes pooling; dmaxConnections between maximized pooling and distance transform pooling may be established; wherein f: g → R; mf: g → R; distance variation pooling Df: g → R; by using
Figure FSA0000192624240000031
Df: the functional definition of G → R is:
Figure FSA0000192624240000032
in the convolutional layer, the calculation formula of the characteristic pixel value of the pedestrian image is as follows:
Figure FSA0000192624240000033
wherein K { (u, v) ∈ N2|0≤u<kx;0≤v<ky},kxAnd kyRespectively representing the l-th layer convolution kernel
Figure FSA0000192624240000034
Length and width of;
Figure FSA0000192624240000035
is the offset of the jth profile of the corresponding layer l; variables c and r represent the current longitudinal and transverse characteristic pixels, respectively, and variables u and v denote the convolution kernel in decibels
Figure FSA0000192624240000036
P represents the corresponding p-th training sample; f represents the activation function of the l-th layer; the convolution operation occurs equivalent to a convolution kernel volume input feature map
Figure FSA0000192624240000037
6. The method for detecting and identifying the pedestrian target based on the monocular vision and the deep learning as claimed in claim 1, wherein the fourth step utilizes a laplacian pyramid to establish a multi-resolution pyramid model, completes the multi-scale pyramid representation, utilizes a phase-consistent algorithm to extract the features of each layer of pedestrian images in the pyramid, and finally utilizes a multi-scale fusion algorithm to fuse the pedestrian images in the multi-scale phase-consistent pyramid from top to bottom to obtain an original pedestrian image phase-consistent feature map.
7. The pedestrian target detection and identification method based on monocular vision and deep learning of claim 6, further comprising: extracting features of each layer of pedestrian images in the pyramid by using a phase consistency algorithm, and finally fusing the pedestrian images in the multi-scale phase consistency pyramid from top to bottom by using a multi-scale fusion algorithm to obtain an original pedestrian image phase consistency feature map specifically comprises the following steps:
(1) searching the phase consistent feature map of the scale n to obtain the initial position (x) of the phase consistent feature point0,y0);
(2) In the phase-consistent profile of scale n-1, at (x)0,y0) Searching for phase consistent characteristic points in the 3 multiplied by 3 neighborhood; if there is a outlier and the position is (x, y), then the feature at (x, y) in the fused phase coincidence feature map is the featurePoint; if not, reserving;
(3) searching a point communicated with the feature point in the n-1 scale at the feature point obtained by fusing the phase-consistent images to obtain a detail feature which is not contained in the fusion image of the scale n;
(4) and (5) searching the next feature point in the n-dimension phase-consistent feature map, and repeating the steps (1) to (3) until the whole map is completed.
8. A road traffic monitoring platform applying the pedestrian target detection and identification method based on monocular vision and deep learning of any one of claims 1 to 7.
CN201910991615.8A 2019-10-10 2019-10-10 Pedestrian target detection and identification method based on monocular vision and deep learning Pending CN111027372A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910991615.8A CN111027372A (en) 2019-10-10 2019-10-10 Pedestrian target detection and identification method based on monocular vision and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910991615.8A CN111027372A (en) 2019-10-10 2019-10-10 Pedestrian target detection and identification method based on monocular vision and deep learning

Publications (1)

Publication Number Publication Date
CN111027372A true CN111027372A (en) 2020-04-17

Family

ID=70201148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910991615.8A Pending CN111027372A (en) 2019-10-10 2019-10-10 Pedestrian target detection and identification method based on monocular vision and deep learning

Country Status (1)

Country Link
CN (1) CN111027372A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582126A (en) * 2020-04-30 2020-08-25 浙江工商大学 Pedestrian re-identification method based on multi-scale pedestrian contour segmentation fusion
CN111709294A (en) * 2020-05-18 2020-09-25 杭州电子科技大学 Express delivery personnel identity identification method based on multi-feature information
CN111709336A (en) * 2020-06-08 2020-09-25 杭州像素元科技有限公司 Highway pedestrian detection method and device and readable storage medium
CN111738088A (en) * 2020-05-25 2020-10-02 西安交通大学 Pedestrian distance prediction method based on monocular camera
CN112006678A (en) * 2020-09-10 2020-12-01 齐鲁工业大学 Electrocardiogram abnormity identification method and system based on combination of AlexNet and transfer learning
CN112052886A (en) * 2020-08-21 2020-12-08 暨南大学 Human body action attitude intelligent estimation method and device based on convolutional neural network
CN112287854A (en) * 2020-11-02 2021-01-29 湖北大学 Building indoor personnel detection method and system based on deep neural network
CN112597802A (en) * 2020-11-25 2021-04-02 中国科学院空天信息创新研究院 Pedestrian motion simulation method based on visual perception network deep learning
CN112733730A (en) * 2021-01-12 2021-04-30 中国石油大学(华东) Oil extraction operation field smoke suction personnel identification processing method and system
CN113257008A (en) * 2021-05-12 2021-08-13 兰州交通大学 Pedestrian flow dynamic control system and method based on deep learning
CN117876711A (en) * 2024-03-12 2024-04-12 金锐同创(北京)科技股份有限公司 Image target detection method, device, equipment and medium based on image processing

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582126A (en) * 2020-04-30 2020-08-25 浙江工商大学 Pedestrian re-identification method based on multi-scale pedestrian contour segmentation fusion
CN111582126B (en) * 2020-04-30 2024-02-27 浙江工商大学 Pedestrian re-recognition method based on multi-scale pedestrian contour segmentation fusion
CN111709294A (en) * 2020-05-18 2020-09-25 杭州电子科技大学 Express delivery personnel identity identification method based on multi-feature information
CN111709294B (en) * 2020-05-18 2023-07-14 杭州电子科技大学 Express delivery personnel identity recognition method based on multi-feature information
CN111738088A (en) * 2020-05-25 2020-10-02 西安交通大学 Pedestrian distance prediction method based on monocular camera
CN111709336B (en) * 2020-06-08 2024-04-26 杭州像素元科技有限公司 Expressway pedestrian detection method, equipment and readable storage medium
CN111709336A (en) * 2020-06-08 2020-09-25 杭州像素元科技有限公司 Highway pedestrian detection method and device and readable storage medium
CN112052886A (en) * 2020-08-21 2020-12-08 暨南大学 Human body action attitude intelligent estimation method and device based on convolutional neural network
WO2022036777A1 (en) * 2020-08-21 2022-02-24 暨南大学 Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN112052886B (en) * 2020-08-21 2022-06-03 暨南大学 Intelligent human body action posture estimation method and device based on convolutional neural network
CN112006678A (en) * 2020-09-10 2020-12-01 齐鲁工业大学 Electrocardiogram abnormity identification method and system based on combination of AlexNet and transfer learning
CN112287854A (en) * 2020-11-02 2021-01-29 湖北大学 Building indoor personnel detection method and system based on deep neural network
CN112597802A (en) * 2020-11-25 2021-04-02 中国科学院空天信息创新研究院 Pedestrian motion simulation method based on visual perception network deep learning
CN112733730A (en) * 2021-01-12 2021-04-30 中国石油大学(华东) Oil extraction operation field smoke suction personnel identification processing method and system
CN112733730B (en) * 2021-01-12 2022-11-18 中国石油大学(华东) Oil extraction operation field smoke suction personnel identification processing method and system
CN113257008A (en) * 2021-05-12 2021-08-13 兰州交通大学 Pedestrian flow dynamic control system and method based on deep learning
CN117876711A (en) * 2024-03-12 2024-04-12 金锐同创(北京)科技股份有限公司 Image target detection method, device, equipment and medium based on image processing

Similar Documents

Publication Publication Date Title
CN111027372A (en) Pedestrian target detection and identification method based on monocular vision and deep learning
CN111797716B (en) Single target tracking method based on Siamese network
CN108846358B (en) Target tracking method for feature fusion based on twin network
CN107341517B (en) Multi-scale small object detection method based on deep learning inter-level feature fusion
CN105869173B (en) A kind of stereoscopic vision conspicuousness detection method
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN109740413A (en) Pedestrian recognition methods, device, computer equipment and computer storage medium again
CN108009509A (en) Vehicle target detection method
CN106709568A (en) RGB-D image object detection and semantic segmentation method based on deep convolution network
CN110097575B (en) Target tracking method based on local features and scale pool
CN112131908A (en) Action identification method and device based on double-flow network, storage medium and equipment
CN103810473B (en) A kind of target identification method of human object based on HMM
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
Geng et al. Using deep learning in infrared images to enable human gesture recognition for autonomous vehicles
CN105787481B (en) A kind of object detection method and its application based on the potential regional analysis of Objective
CN106529494A (en) Human face recognition method based on multi-camera model
CN106530407A (en) Three-dimensional panoramic splicing method, device and system for virtual reality
CN112465021B (en) Pose track estimation method based on image frame interpolation method
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
CN116977674A (en) Image matching method, related device, storage medium and program product
CN109508640A (en) A kind of crowd's sentiment analysis method, apparatus and storage medium
CN111488951A (en) Countermeasure metric learning algorithm based on RGB-D image classification problem
CN115147644A (en) Method, system, device and storage medium for training and describing image description model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200417