CN115082854A

CN115082854A - Pedestrian searching method oriented to security monitoring video

Info

Publication number: CN115082854A
Application number: CN202210682446.1A
Authority: CN
Inventors: 冯德瀛; 魏衍侠; 肖海荣; 张来刚; 王政森; 杨杰
Original assignee: Liaocheng University
Current assignee: Liaocheng University
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-20

Abstract

The invention relates to a pedestrian searching method facing to a security monitoring video, which comprises the following steps: detecting each pedestrian frame by frame on a monitoring video by utilizing a pre-trained regional convolution neural network and generating corresponding spatial features; organizing the pedestrian space features extracted frame by frame in the monitoring video by using the hidden state output by the gate control circulation unit, adding an average pooling layer at the output end of the gate control circulation unit, reducing the dimension of the hidden state vector, and generating the corresponding pedestrian space-time features; and thirdly, indexing all pedestrian space-time characteristics through local sensitive hashing, and determining a final pedestrian search result by calculating the similarity of the pedestrian space-time characteristics to be searched and the pedestrian space-time characteristics in the monitoring video.

Description

Pedestrian searching method oriented to security monitoring video

Technical Field

The invention relates to the technical field of computer vision, in particular to a pedestrian searching method facing to security monitoring videos.

Background

With the continuous progress of the construction process of the smart city, more and more monitoring cameras are distributed in the streets and alleys of the city, and play an important role in the aspects of missing the old and children, searching and positioning criminal suspects and the like. However, related information can be found out by unequal surveillance videos, and because the number of surveillance cameras is large, the recording time is long, and the video data volume is in a trend of geometric increase, a great amount of time and manpower and material resources are often consumed for finding a specific pedestrian target in a large number of surveillance videos. Meanwhile, the monitoring scenes are different due to different installation positions of the monitoring cameras, and particularly in large public places such as shopping malls, stations and exhibition centers, the pedestrian flow is dense, so that the monitoring scenes are more complex, and the pedestrian search is more challenging. Therefore, how to find the relevant pedestrians in the massive security monitoring videos more quickly and accurately becomes one of the hot points concerned in the technical field of computer vision.

The pedestrian search method queries a specific pedestrian target in an unknown image or video data set, thereby finding the same pedestrian image in the data set. Most of the existing pedestrian search methods are usually trained on a ChuK-SYSU or PRW image data set, and the pedestrian search is carried out by using a trained neural network model. Compared with the CHUK-SYSU and PRW image data sets, the real security monitoring video not only contains the spatial characteristics of pedestrians, but also relates to the time correlation of the pedestrians. Therefore, the pedestrian search method trained on the CHUK-SYSU and PRW image data sets needs to be further improved in robustness and reliability when applied to security surveillance videos because the time correlation of pedestrians is not considered.

Through the literature search of the prior art, the patent CN 112241682a provides an end-to-end pedestrian search method based on blocking and multi-layer information fusion. The method uses a convolutional neural network to provide preliminary features, and uses a candidate region extraction network to extract the region where the pedestrian is located, so as to obtain a high-level feature map. By blocking the high-level features and fusing the high-level features with the middle-level features, the accuracy of pedestrian search is improved. Although the method adopts the whole image shot by the monitoring camera as input data, the time correlation of pedestrians in the front frame image and the rear frame image is not considered, and the method has certain limitation when being applied to security monitoring videos.

Further retrieval shows that patent CN 109165540a provides a pedestrian search method based on a priori candidate frame selection strategy. The method comprises the steps of constructing a pedestrian candidate frame vector according to the length and the width of all pedestrian frames in a training set, further obtaining a priori candidate frame through a K-means + + clustering algorithm, identifying the identity of a pedestrian, and finally determining the position of the pedestrian in a monitoring image through a trained pedestrian searching network. The method also only processes the pedestrian characteristics in a single image and does not relate to the time correlation among pedestrians in a monitoring video multi-frame image.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a pedestrian searching method facing to security monitoring videos. According to the characteristics that the security monitoring video has space invariance and time continuity, the time-space characteristics of the pedestrians are generated in the monitoring video, the time correlation of the pedestrians in the front frame and the rear frame of the video is fully utilized, and the pedestrian searching accuracy is improved. By organizing all pedestrian space-time characteristics in the monitoring video, the real-time performance of pedestrian searching is guaranteed.

The invention is realized by the following technical scheme, and the invention specifically comprises the following steps:

a pedestrian searching method facing to security monitoring videos comprises the following steps:

step one, detecting each pedestrian frame by frame for a monitoring video by utilizing a pre-trained regional convolution neural network and generating corresponding spatial features.

And secondly, organizing the pedestrian space features extracted frame by frame in the monitoring video by using the hidden state output by the gate control circulation unit, adding an average pooling layer at the output end of the gate control circulation unit, reducing the dimension of the hidden state vector, and generating the corresponding pedestrian space-time features.

And thirdly, indexing all pedestrian space-time characteristics through local sensitive hashing, and determining a final pedestrian search result by calculating the similarity of the pedestrian space-time characteristics to be searched and the pedestrian space-time characteristics in the monitoring video.

Further, step one is performed according to the following steps:

1) for surveillance video V ═ V { (V) ₁ ,v ₂ ,…,v _N Contains N frame images, wherein the ith frame image is represented as v _i ；

2) Processing the surveillance video V frame by frame through a pre-trained regional convolution neural network, and processing the image V in the ith frame _i The jth spatial feature s of the pedestrian extracted from the previous image _i,j ；

3) After the N frames of images in the monitoring video V are processed, all spatial features of the pedestrians are expressed as S ═ S _i,j }，1≤i≤n _j J is more than or equal to 1 and less than or equal to M, wherein n _j Representing the number of frame images containing the jth pedestrian, and M representing the total number of pedestrians appearing in the monitoring video;

further, the second step is executed according to the following steps:

1) the ith frame image v _i The jth spatial feature s of the pedestrian extracted from the previous image _i,j As input vectors, input to the gated loop unit;

2) spatial features s of pedestrians in gated cyclic units _i,j Updating candidate hidden state vector c by tanh activation function _i,j And is represented as: c. C _i,j ＝tanh(W _n s _i,j +U _n (r _i,j ⊙h _i-1,j )+b _n ) Wherein h is _i-1,j Representing the hidden state vector r corresponding to the jth pedestrian in the ith-1 frame image _i,j Is h _i-1,j Corresponding weight, W _n 、U _n And b _n Network parameters of gated cycle cells;

3) in gated-cycle units, according to h _i-1,j And c _i,j Generating a hidden state vector h corresponding to the jth pedestrian in the ith frame image _i,j And is represented as: h is _i,j ＝z _i,j h _i-1,j +(1-z _i,j )c _i,j ，z _i,j Is a combination h _i-1,j And c _i,j The weight of (c);

4) after the spatial features corresponding to the jth pedestrian are completely processed in the gate control cycle unit, a hidden vector sequence h corresponding to the jth pedestrian is obtained _j ＝{h _i,j }，1≤i≤n _j And all pedestrians in the video are represented as a hidden vector sequence H ═ H _j }，1≤j≤M；

5) Adding an average pooling layer at the output of the gated cyclic unit for the sequence h _j Reducing dimensions and generating the space-time characteristic p of the jth pedestrian _j And is represented by

All pedestrian spatiotemporal features are denoted as P ═ { P ═ P _j }，1≤j≤M。

Further, step three is performed according to the following steps:

1) mapping all pedestrian space-time characteristics P in the monitored video to a Hamming vector space, and regarding the jth pedestrian space-time characteristics P _j Mapping to a b-bit hash code;

2) after the pedestrian space-time feature q to be searched is mapped to a Hamming vector space, calculating the similarity of the pedestrian space-time feature to be searched and the pedestrian space-time feature in the video;

3) after calculating all similarities between the pedestrian space-time features q to be searched and all the pedestrian space-time features P, sequencing the similarities, and obtaining the pedestrian images corresponding to the T space-time features in the front sequence, namely the final search result.

The invention has the beneficial effects that: the method utilizes the regional convolutional neural network to extract the spatial characteristics of the pedestrians from frame to frame from the security monitoring video, thereby avoiding the interference of a complex monitoring background. The time correlation of the pedestrian space characteristics of the front frame and the rear frame is organized through the gate control circulation unit, the space-time characteristics of the pedestrians are generated at the output end by adding the average pooling layer, and the identification degree of the same pedestrian and the identification degrees of different pedestrians are enhanced. A large number of pedestrian space-time characteristics are organized through local sensitive hashing, the calculated amount in the searching process is reduced, and the real-time performance of pedestrian searching is guaranteed. Compared with the prior art, the method can be applied to the actual security monitoring video, and the searching accuracy is improved on the basis of ensuring the searching real-time performance.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a comparison of the accuracy of the inventive method and a search method using only spatial features of pedestrians.

Detailed Description

The invention is further described with reference to the accompanying drawings and the specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

Example 1

The invention is realized by the following technical scheme, and the method comprises the following specific steps:

firstly, a security monitoring video is composed of continuous frame images, and a pre-trained regional convolution neural network is utilized to detect pedestrians on the monitoring video frame by frame and generate corresponding spatial features. The specific method comprises the following steps:

1) for surveillance video V ═ V { (V) ₁ ,v ₂ ,…,v _N Contains N frame images, wherein the ith frame image is denoted as v _i . For the pre-trained regional convolutional neural network, it can be regarded as a nonlinear equation f _RCNN (·)；

2) Passing the surveillance video V through the regional convolutional neural network f frame by frame _RCNN (v) processing in the ith frame image v _i The jth spatial feature s of the pedestrian extracted from the previous image _i,j Can be represented as s _i,j ＝f _RCNN (v _i )；

3) After N frames of images in the surveillance video V are processed, all spatial features of the pedestrian may be expressed as S ═ S _i,j }，1≤i≤n _j J is more than or equal to 1 and less than or equal to M, wherein n _j Representing the number of frame images containing the jth pedestrian, and M representing the total number of pedestrians appearing in the video.

Organizing the pedestrian space features extracted from the front and rear frames of the video through a gate control circulation unit, and adding an average pooling layer at the output end to generate the pedestrian space-time features;

the pedestrian spatial features extracted from the front and rear frames of the video are organized through the gate control circulation unit, and the average pooling layer is added at the output end to generate the pedestrian spatial-temporal features, wherein the pedestrian spatial features are as follows: because the gating circulation unit can effectively process the data sequence, the spatial features of the pedestrians extracted from the video can be organized by utilizing the hidden state output by the gating circulation unit, so that the time correlation information of the pedestrians in the front frame and the rear frame of the video is transmitted. Because the generated hidden state vector has higher dimensionality, the calculated amount is increased in the pedestrian searching process, so that an average pooling layer is added at the output end of the gating circulating unit, the hidden state vector is subjected to dimension reduction, and the corresponding pedestrian space-time characteristics are generated at the same time, so that the calculated amount in the subsequent searching process can be reduced, and the time correlation information of the pedestrian is included.

The step of organizing the pedestrian spatial features extracted from different frames of the video through the gating cycle unit and generating the pedestrian spatial-temporal features through the average pooling layer comprises the following steps:

1) the ith frame image v _i The jth spatial feature s of the pedestrian extracted from the previous image _i,j As an input vector, inputting to a gated loop unit;

2) spatial features s of pedestrians in gated cyclic units _i,j Updating the candidate by the tanh activation functionSelect hidden state vector c _i,j And is represented as: c. C _i,j ＝tanh(W _n s _i,j +U _n (r _i,j ⊙h _i-1,j )+b _n ) Wherein h is _i-1,j Representing the hidden state vector r corresponding to the jth pedestrian in the ith-1 frame image _i,j Is h _i-1,j Corresponding weight, W _n 、U _n And b _n Network parameters that are gated cycle units;

3) in gated-cycle units, according to h _i-1,j And c _i,j Generating a hidden state vector h corresponding to the jth pedestrian in the ith frame image _i,j And is represented as: h is _i,j ＝z _i,j h _i-1,j +(1-z _i,j )c _i,j ，z _i,j Is a combination h _i-1,j And c _i,j The weight of (c). h is _i,j Not only the candidate hidden state of the jth pedestrian in the ith frame image is considered, but also the hidden state of the jth pedestrian in the ith-1 frame image is included, so that time correlation is established between the ith-1 frame image and the ith frame image for the jth pedestrian;

4) after the spatial features corresponding to the jth pedestrian are completely processed in the gate control cycle unit, the hidden vector sequence h corresponding to the jth pedestrian can be obtained _j ＝{h _i,j }，1≤i≤n _j Therefore, the time correlation of the pedestrians in different frames of the video is described, and then all the pedestrians in the video can be represented as a hidden vector sequence H ═ { H ═ in the video _j }，1≤j≤M；

5) Due to the sequence h _j Is higher, so that an average pooling layer is added at the output of the gated-round unit, for the sequence h _j Reducing dimensions and simultaneously generating the space-time characteristic p of the jth pedestrian _j And is represented by

For the jth pedestrian, the average pooling layer will be a high dimensional sequence h _j Conversion into a single vector p _j All pedestrian spatiotemporal features may be expressed as P ═ P at the same time _j }，1≤j≤M。

And thirdly, indexing all pedestrian space-time characteristics through local sensitive hashing, and determining a pedestrian search result according to the similarity.

The method for indexing all pedestrian space-time characteristics through the locality sensitive hash and determining the pedestrian search result according to the similarity comprises the following steps: because the number of pedestrians contained in the monitoring video is large, each pedestrian corresponds to one pedestrian space-time characteristic, and therefore a large number of pedestrian space-time characteristics can be generated. The method has the advantages that the similarity is directly calculated for the pedestrian space-time characteristics, and the real-time performance of pedestrian searching is difficult to guarantee, so that all the pedestrian space-time characteristics are indexed by adopting local sensitive Hash, and the final pedestrian searching result is given by calculating the similarity of the pedestrian space-time characteristics to be searched and the pedestrian space-time characteristics in the monitoring video.

The step of indexing all pedestrian space-time characteristics through locality sensitive hashing and determining a pedestrian search result according to the similarity comprises the following steps:

1) mapping all pedestrian space-time characteristics P in the monitored video to a Hamming vector space, and regarding the jth pedestrian space-time characteristics P _j Can be mapped to a b-bit hash code and is denoted as H: p _j →{0,1} ^b ；

2) After the pedestrian space-time feature q to be searched is mapped to the Hamming vector space, the similarity between the pedestrian space-time feature to be searched and the pedestrian space-time feature in the video can be calculated as sim (q, p) _j )＝P[H(q)＝H(p _j )]＝Jaccard(q,p _j )；

3) After calculating all similarities between the pedestrian spatiotemporal features q to be searched and all the pedestrian spatiotemporal features P, sequencing the similarities, and obtaining pedestrian images corresponding to T spatiotemporal features at the front of the sequencing, namely the final search result.

Example 2

The embodiment adopts a pedestrian searching method facing to security monitoring videos, and the specific implementation steps are as follows:

1. and extracting the spatial features of the pedestrians from the monitoring video frame by adopting a regional convolution neural network.

Detection of pedestrian boundaries in regional convolutional neural networks using the conv1 layer through the conv4_3 layer in the ResNet-50 modelBlocks, conv4_4 through conv5_3 layers, perform pedestrian identification. After global average pooling and feature mapping, for the ith frame image v _i The jth pedestrian in (b) can generate a 256-dimensional spatial feature s _i,j Further, for all pedestrians in the surveillance video, a set of 256 × n may be generated _j A spatial signature sequence S in x M dimensions.

2. And organizing the pedestrian spatial features extracted from the frames before and after the video through a gating circulation unit, and adding an average pooling layer at the output end to generate the pedestrian spatial-temporal features.

Because the number of frames of the security monitoring video is large, in order to ensure the efficiency of pedestrian searching, only one gating cycle unit is adopted to organize the pedestrian space characteristics extracted from the front frame image and the rear frame image. The ith frame image v _i The extracted spatial feature s of the jth pedestrian _i,j As an input vector, after being processed by a gating cycle unit, a 256-dimensional hidden state vector h can be generated _i,j Further, the entire spatial signature encompassing the jth pedestrian may be represented as a set of 256 × n _j Hidden vector sequence h of dimensions _j . For all pedestrians in the video, it can be expressed as 256 × n _j Hidden vector sequence H of x M dimensions.

Due to the hidden vector sequence h _j And H is higher in dimension, so the average pooling layer is used for the sequence H _j Dimension reduction is performed so as to reduce 256 × n _j Hidden vector sequence h of dimensions _j Conversion into a 256-dimensional pedestrian spatiotemporal feature p _j Further, the M pedestrians appearing in the video may be represented as M independent pedestrian spatiotemporal features P ═ { P ═ P _j }(1≤j≤M)。

3. And indexing all pedestrian space-time characteristics through local sensitive hashing, and determining a pedestrian search result according to the similarity.

In order to ensure the real-time performance of pedestrian search, local sensitive Hash is adopted to index M pedestrian space-time characteristics, the M pedestrian space-time characteristics are mapped to a Hamming vector space, and the jth pedestrian space-time characteristic p is _j Can be mapped to a 128-bit hash code and represented as

And then calculating the similarity of the pedestrian space-time feature to be searched and the pedestrian space-time feature in the video according to the Jaccard coefficient in the Hamming vector space. After all the similarity degrees are sequenced, the pedestrian images corresponding to the 5 space-time features which are ranked at the top are selected, and the pedestrian images are the final search results.

The simulation experiment of the method of the invention is as follows:

in the experiment, videos shot by 9 monitoring cameras are selected, and 15000 video clips including 897 pedestrian targets are selected in total, so that a pedestrian search database is created. In this database, 11546 video clips were selected as the training data set, and the remaining 3454 video clips were selected as the test data set. Meanwhile, the performance of the pedestrian search method is tested by selecting the Average accuracy (MAP). In the case of changing the number of hash code, the method is compared with a search method using only spatial features of pedestrians, and the experimental result is shown in fig. 2. It can be seen that as the hash code increases from 8 bits to 128 bits, the average accuracy of the mean values of the method and the search method using only the pedestrian spatial features increases, but the accuracy of the method is higher than that of the method using only the pedestrian spatial features. The pedestrian space-time characteristics used by the method not only comprise the pedestrian space characteristics of the single-frame image, but also organize the time correlation of the pedestrian space characteristics of the front frame and the rear frame, thereby enhancing the discrimination among different pedestrians and being beneficial to improving the search accuracy.

Claims

1. A pedestrian searching method facing to security monitoring videos comprises the following steps:

detecting each pedestrian frame by frame on a monitoring video by utilizing a pre-trained regional convolution neural network and generating corresponding spatial features;

organizing the pedestrian space features extracted frame by frame in the monitoring video by using the hidden state output by the gate control circulation unit, adding an average pooling layer at the output end of the gate control circulation unit, reducing the dimension of the hidden state vector, and generating the corresponding pedestrian space-time features;

2. The pedestrian searching method according to claim 1, wherein the step one is performed according to the steps of:

3) After the N frames of images in the monitoring video V are processed, all spatial features of the pedestrians are expressed as S ═ S _i,j }，1≤i≤n _j J is more than or equal to 1 and less than or equal to M, wherein n _j Indicating the number of frame images containing the jth pedestrian, and M indicating the total number of pedestrians appearing in the surveillance video.

3. The pedestrian searching method according to claim 1, wherein the step two is performed according to the steps of:

2) spatial features s of pedestrians in gated cyclic units _i,j Updating candidate hidden state vector c through tanh activation function _i,j And is represented as: c. C _i,j ＝tanh(W _n s _i,j +U _n (r _i,j ⊙h _i-1,j )+b _n ) Wherein h is _i-1,j Representing the hidden state vector r corresponding to the jth pedestrian in the ith-1 frame image _i,j Is h _i-1,j Corresponding weight, W _n 、U _n And b _n Network parameters of gated cycle cells;

4. The pedestrian search method according to claim 1, wherein step three is performed in accordance with the steps of:

1) mapping all pedestrian space-time characteristics P in the monitored video to a Hamming vector space, and regarding the jth pedestrian space-time characteristics P _j Mapping the hash code into a b-bit hash code;