CN108549852B

CN108549852B - Specific scene downlink person detector automatic learning method based on deep network enhancement

Info

Publication number: CN108549852B
Application number: CN201810264330.XA
Authority: CN
Inventors: 郑慧诚; 何炜雄; 谢晓华
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2018-03-28
Filing date: 2018-03-28
Publication date: 2020-09-08
Anticipated expiration: 2038-03-28
Also published as: CN108549852A

Abstract

The invention discloses a specific scene descending person detector automatic learning method based on deep network enhancement, which comprises the following steps: training a first neural network and a second neural network by using a universal data set at a server side, wherein the second neural network is deployed in the embedded equipment; capturing an image of a current scene through an embedded device to obtain a newly added image sample, and transmitting the newly added image sample to a server side; testing the newly added image sample by utilizing the first neural network trained before at the server side, and labeling the sample according to the test score; estimating the size of a pedestrian detection frame at the current height, eliminating samples with obvious difference between the detection frame and the estimated size in the positive samples, and reserving the residual samples; the server side optimizes the second neural network; and redeploying the adjusted second neural network model to the embedded equipment from the server side. The pedestrian detection method can quickly obtain an accurate pedestrian detection model in a specific scene.

Description

Specific scene downlink person detector automatic learning method based on deep network enhancement

Technical Field

The invention relates to the field of pedestrian detection research in video monitoring, in particular to a method for automatically learning a pedestrian detector under a specific scene based on deep network enhancement.

Background

With the gradual expansion of the monitoring range of the camera, how to analyze the behaviors, actions and tracks of pedestrians by using data acquired by the camera has become an urgent need in the current society, and the technical basis of the needs is pedestrian detection. Pedestrian detection is completed by a pedestrian detector, the task of the pedestrian detector is to estimate the position of a pedestrian under the current scene, and the pedestrian detector plays a very important role in the camera monitoring fields of pedestrian tracking, pedestrian recognition and the like. Pedestrian detection remains a very challenging problem at present due to factors such as illumination changes, camera angle changes, pedestrian pose changes, and the like.

In recent years, great progress is made in this respect, including that the traditional HOG feature and the application of the SVM classifier have achieved good results in pedestrian detection, and the recent study based on the convolutional neural network has advanced the performance of the pedestrian detector to a new height due to the relatively good learning capability of the sample distribution.

However, although these studies can achieve very good results in the problem of pedestrian detection, the pedestrian detector effect obtained by training the learning-based method depends on the distribution of the training set, so that when working in other specific scenes, the performance of the pedestrian detector will become very poor due to the very large differences between the distribution of the test set and the training set, which may be caused by occlusion, image quality, etc. of the scene. On the other hand, if a manual labeling method is adopted to collect data in each specific scene to train the model, the method is very labor-consuming, and when the number of pedestrian detectors is very large, the method is not preferable. Therefore, how to utilize the automatic learning method to improve the adaptability of the pedestrian detector to a specific scene is a critical issue.

The existing methods mainly comprise the following methods:

(1) a method based on context information, pedestrian size. See Xiaoogang Wang, Meng Wang, and WeiLi, Scene-Specific Peerstrian Detection for Static Video Surveillance, IEEETPAMI 36(2014) 361-. In the method, a current scene and the size of the pedestrian are modeled to obtain the probability that a current detection frame is a positive sample and a negative sample, and the positive sample and the negative sample obtained by the method are used for training an SVM classifier.

(2) Semi-supervised and assisted detector based approaches. See Si Wu, Shufeng Wang, Robert Laganiere, Cheng Liu, Hau-San Wong, and Yong Xu: expanding Target Data to Learn deep connected Networks for Scene-added Human Detection, IEEE TIP (2017). In the method, for the condition that a small number of positive and negative samples exist in a specific scene, an auxiliary detector is trained through the small number of samples, more unlabeled samples are obtained through the output of the auxiliary detector, and finally, the samples are used for training a model for the scene.

The above methods have many disadvantages. First, positive and negative samples in the current scene are obtained based on context information such as pedestrian size and background modeling, and the samples obtained by such a method have relatively large noise because such information is not very reliable. Meanwhile, the semi-supervised method requires a certain number of samples labeled manually, which is undoubtedly very time-consuming and labor-consuming.

Disclosure of Invention

Aiming at the condition that the pedestrian detector cannot well position the pedestrian in a specific scene at present, the invention provides the automatic learning method of the pedestrian detector in the specific scene based on the deep network enhancement.

The purpose of the invention is realized by the following technical scheme: the method for automatically learning the downlink human detector in a specific scene based on deep network enhancement comprises the following steps:

(1) training a first neural network and a second neural network by using a universal data set at a server side, wherein the second neural network is used for being deployed in the embedded equipment;

(2) capturing an image of a current scene by using embedded equipment in the working process of pedestrian detection to obtain a newly added image sample, and transmitting the newly added image sample to a server;

(3) testing the newly added image sample by utilizing the first neural network trained before at the server end, and labeling the sample by utilizing the test score of the first neural network;

(4) estimating the size of a pedestrian detection frame under the current height of the embedded device, calculating the difference value between the detection frame in the positive sample and the estimated pedestrian detection frame, if the difference value exceeds a threshold value, removing, and keeping the residual samples;

(5) the server side utilizes the residual samples to tune the second neural network;

(6) and redeploying the adjusted second neural network model to the embedded equipment from the server side.

In the invention, the first neural network is deployed at the server end, so that the design structure is complex, and the training precision is improved. The second neural network is used for being deployed to the embedded equipment, so that the structure can be designed to be simple, the embedded equipment can meet the speed requirement, for newly added image samples, the samples are tested and labeled by the complex first neural network, the samples with high scores are screened out, and then the second neural network is optimized, so that the identification result can be obtained quickly and accurately in a specific scene.

Preferably, in step (1), the step of training the first neural network and the second neural network using the common data set at the server side includes:

using manually labeled data in a plurality of scenes except the current scene as a universal data set, using Faster region-based convolutional neural network (Faster R-CNN) based on ResNet-101 (101-layer residual network) as a first neural network, and using Alexnet-based SSD (single frame multi-scale detector) as a second neural network.

Furthermore, the pre-training network adopted by the first neural network and the second neural network during training has the network parameter obtaining method as follows: and training on ImageNet to obtain network parameters for classification, removing layers after the last convolutional layer, and taking the parameters of the remaining convolutional layers as initialization parameters during current training.

Preferably, in the step (2), the embedded device transmits the newly added image sample to the server side by using an FTP protocol (file transfer protocol).

Furthermore, in the working process of pedestrian detection through the embedded equipment, the collected image samples are screened, and the steps are as follows:

setting the number of pedestrians detected by the current equipment to be N_pIf N is present_p≥T_p，T_pAnd if the threshold value is preset, the acquired image is used as a newly added image sample and is transmitted to the server side, otherwise, the current image is abandoned. Therefore, the embedded device can collect possibly effective samples, effectively shorten the time required by the next tuning process and simultaneously enhance the performance of the optimized result.

preferably, in the step (3), the step of testing and labeling the newly added image sample by the first neural network is:

for each image I, the result after the test by the first neural network is recorded as

Wherein n is the total number of detection frames, l_iIs the position vector of the ith detection frame, l_i＝[x_l,y_l,x_r,y_r]，(x_l,y_l)、(x_r,y_r) The coordinates of the upper left corner and the lower right corner of the detection frame at the image position, s_iS is 0 or more to the probability that the ith detection frame is discriminated as a pedestrian_i≤1；

For each detection box, a set threshold T is used to determine whether the sample is positive, i.e., for sample { l }_i,s_iAnd (4) the following steps:

if s_iGreater than or equal to T, then { l-_i,s_i-is the positive sample;

if s_iIf < T, then { l_i,s_i-is the negative sample;

all positive sample sets obtained after all images are subjected to the above operations are set as P, and all negative sample sets are set as N.

Preferably, in the step (4), the method for estimating the size of the pedestrian detection frame under the current height of the embedded device is as follows:

a person stands under the camera, takes the target frame of the person as the size of the pedestrian, and takes the length and the width of the target frame of the pedestrian as the height and the width of the person respectively;

let the area of the pedestrian be S when the pedestrian stands at the i-th position under the camera_iHeight of h_iWidth of w_i(ii) a Collecting data for many times, obtaining the area S of the pedestrian under the current camera height by an averaging method,height h, width w.

Preferably, in the step (4), the step of judging whether to reject the sample is:

for each sample, { l, under the set of positive samples P_i,s_iDetermine whether to cull from the positive sample set by the following criteria:

if x_l-x_rIf | is > γ w, then { l_i,s_iRemoving from P and adding into N;

if x_l-x_rIf | γ < w, then { l_i,s_iRemoving from P and adding into N;

if y_l-y_rIf | is greater than γ × h, then { l | > will be_i,s_iRemoving from P and adding into N;

if y_l-y_rIf | gamma < h, then { l_i,s_iRemoving from P and adding into N;

if it is

Then will { l_i,s_iRemoving from P and adding into N;

if it is

Then will { l_i,s_iRemoving from P and adding into N;

the positive sample set obtained after the above operation is set as P₁Set negative sample set to N₁。

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method of the invention provides effective samples collected based on the camera, so that the adjustment and the optimization of the pedestrian detector on the samples are faster and better results are obtained, and meanwhile, the embedded equipment can be optimized in the working process without completing the optimization process under the online condition of the embedded equipment.

2. The invention provides a specific scene pedestrian detector automatic learning method based on deep network enhancement, which can accurately position the position of a pedestrian and a corresponding area in a specific scene.

3. In the invention, the first neural network (large neural network with complex structure) has better learning ability for training data, so that the first neural network still has better prediction ability for a test set with unknown distribution.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of an example of prior art and inventive scene specific pedestrian detector detection results, where (a) - (d) are pedestrian detection results obtained in a particular scene using a detector trained using a common data set; (e) and (h) detecting results of the pedestrian detector under the specific scene by using the method.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

Referring to fig. 1, the method for automatic learning of the pedestrian detector in the specific scene based on the deep network enhancement in the embodiment includes the steps of:

(1) and training a first neural network and a second neural network by using a universal data set at the server side.

In the present embodiment, data that is manually labeled in a plurality of scenes other than the current scene is used as a common data set, Faster R-CNN (Faster area-based convolutional neural network) based on ResNet-101 (101-layer residual network) is used as a first neural network, and an SSD (single frame multi-scale detector) based on AlexNet is used as a second neural network.

The pre-training networks adopted by the two networks during training are network parameters for classification obtained by training on ImageNet, parameters of the convolutional layer before the last convolutional layer is removed are taken as initialization parameters during current training, the initial learning rate of the small network (second neural network) is 0.005, the adopted learning rate adjustment strategy is multistep, wherein the parameter gamma is set to be 0.5, stepvalue is set to be 30000, and the training times are 10 ten thousands. The initial learning rate of the large network (first neural network) is 0.01, the adopted learning rate adjustment strategy is multistep, the parameter gamma is set to be 0.98, stepvalue is set to be 8500, and the training times are 20 ten thousand. And in the training process, a GeForce GTX 1080Ti model display card is used for network training.

(2) In the working process of pedestrian detection, the image of the current scene is captured through the embedded device, a newly added image sample is obtained, and the newly added image sample is transmitted to the server side.

The camera type adopted in the invention is MT9M001C12STM, for the people stream activity condition in the real scene, 20 frames per second is used for image acquisition, the size of each image is 640X480 pixels, and the image is transmitted to the embedded equipment raspberry adopted in the invention for three generations while being acquired. Meanwhile, the embedded device transmits the image to the server side by using an FTP (file transfer protocol).

Meanwhile, in order to ensure the effectiveness of the collected image samples as much as possible, the number N of pedestrians detected by the current equipment is used in the working process of pedestrian detection through the embedded equipment_pAs a basis for whether the image is collected and transmitted to the server. The specific operation is as follows:

N_p≥T_p→ transfer of image to server

N_p＜T_p→ abandoning the image

In this experiment T_p3. Through the operation, the embedded device can collect possibly effective samples, so that the time required by the next tuning process can be effectively shortened, and the performance of the tuning result is improved.

(3) And testing the newly added image sample by utilizing the first neural network trained before at the server end, and labeling the sample by utilizing the test score of the first neural network.

And after the image I is obtained from the camera and transmitted to the server, the obtained image I is tested in the server by utilizing the large neural network trained before, and the sample is labeled by utilizing the test score of the large neural network.

The large network used therein is the Faster R-CNN (Faster area-based convolutional neural network) based on the ResNet-101 (101-layer residual network).

For each image I, the result after the large-scale network test is used

Wherein n is the total number of detection frames, l_iFor the position vector of the i-th detection frame, i.e./_i＝[x_l,y_l,x_r,y_r]，(x_l,y_l)、(x_r,y_r) Respectively the coordinates of the upper left corner and the lower right corner of the image position of the detection frame. s_iS is 0 or more to the probability that the ith detection frame is discriminated as a pedestrian_i≤1。

For each detection frame, a determination is made as to whether it is a positive sample using a hard threshold T of 0.3, i.e., for sample { l }_i,s_i}, there are

s_i≥T→{l_i,s_iIs a positive sample

s_i＜T→{l_i,s_iIs a negative sample

(4) Estimating the size of the pedestrian detection frame at the current height, and rejecting samples with obvious difference between the detection frame and the estimated size in the positive samples.

The pedestrian size is determined through experiments, and the specific method comprises the following steps: a person stands under the camera, takes the target frame of the person as the size of the pedestrian, and takes the length and the width of the target frame of the pedestrian as the height and the width of the person respectively.

Make pedestrian stand in ith position under cameraThe area of the pedestrian is S_iHeight of h_iWidth of w_i. A total of 20 data acquisitions. And obtaining the area S, the height h and the width w of the pedestrian under the current camera height by an averaging method.

|x_l-x_ri > gamma w → will { l_i,s_iRemoving from P and adding to N

|x_l-x_r| γ < w → will { l | +, <_i,s_iRemoving from P and adding to N

|y_l-y_r| > γ h → will { l_i,s_iRemoving from P and adding to N

|y_l-y_r| γ < h → will { l | +, < h-_i,s_iRemoving from P and adding to N

In this experiment gamma is 1.3,

(5) And the server side utilizes the residual samples to tune the second neural network.

Using data set D ═ P₁,N₁And (2) optimizing the AlexNet-based SSD (single-frame multi-scale detector) trained in the step (1), wherein the initial learning rate is 0.0005, the learning rate adjustment strategy is multistep, the parameter gamma is set to 0.5, stepvalue is set to 10000, and the training times are 3 ten thousand.

And (4) transmitting the adjusted model back to the embedded device by using an FTP (file transfer protocol), and restarting the device. Heretofore, embedded devices have worked with models trained using a common data set.

In the embodiment, an experiment of the automatic learning method of the pedestrian detector is performed in a real scene, referring to fig. 2, the experimental results before automatic learning are shown in fig. (a) - (d), and the experimental results after automatic learning are shown in fig. (e) - (h), and it can be seen from the drawings that the invention can accurately position the position of the pedestrian and the corresponding region in a specific scene better.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The method for automatically learning the downlink human detector in the specific scene based on deep network enhancement is characterized by comprising the following steps:

(6) redeploying the adjusted second neural network model to the embedded equipment from the server side;

in the step (4), the step of judging whether the samples are removed is as follows:

for each sample, { l, under the set of positive samples P_i，s_iDetermine whether to cull from the positive sample set by the following criteria:

if x₁-x_rIf | is > γ w, then { l_i，s_iRemoving from P and adding into N;

if x₁-x_rIf | γ < w, then { l_i，s_iRemoving from P and adding into N;

if y_l-y_rIf | is greater than γ × h, then { l | > will be_i，s_iRemoving from P and adding into N;

if y_l-y_rIf | gamma < h, then { l_i，s_iRemoving from P and adding into N;

if it is

Then will { l_i，s_iRemoving from P and adding into N;

if it is

Then will { l_i，s_iRemoving from P and adding into N;

the positive sample set obtained after the above operation is set as P₁Set negative sample set to N₁，γ、

Are all preset parameters which are more than 1.

2. The method for automatic learning of the pedestrian detector under the specific scene based on the deep network enhancement as claimed in claim 1, wherein in the step (1), the step of training the first neural network and the second neural network by using the common data set at the server side comprises:

the data which are manually marked under other scenes except the current scene are used as a universal data set, the fast R-CNN based on ResNet-101 is used as a first neural network, and the SSD based on AlexNet is used as a second neural network.

3. The method for automatic learning of the pedestrian detector under the specific scene based on the deep network enhancement as claimed in claim 2, wherein the pre-training networks adopted by the first neural network and the second neural network during training have the network parameters obtained by: and training on ImageNet to obtain network parameters for classification, removing layers after the last convolutional layer, and taking the parameters of the remaining convolutional layers as initialization parameters during current training.

4. The method for automatic learning of pedestrian detectors under a specific scenario based on deep network enhancement as claimed in claim 1, wherein in step (2), the embedded device transmits the newly added image samples to the server side using FTP protocol.

5. The automatic learning method of the pedestrian detector under the specific scene based on the deep network enhancement as claimed in claim 1, wherein the embedded device is used for screening the collected image samples in the working process of pedestrian detection, and the steps are as follows:

setting the number of pedestrians detected by the current equipment to be N_pIf N is present_p≥T_p，T_pAnd if the threshold value is preset, the acquired image is used as a newly added image sample and is transmitted to the server side, otherwise, the current image is abandoned.

6. The method for automatic learning of the pedestrian detector under the specific scene based on the deep network enhancement as claimed in claim 2, wherein in the step (3), the step of testing and labeling the newly added image sample by the first neural network is:

Wherein n is the total number of detection frames, l_iIs the position vector of the ith detection frame, l_i＝[x_l，y_l，x_r，y_r]，(x_l，y₁)、(x_r，y_r) The coordinates of the upper left corner and the lower right corner of the detection frame at the image position, s_iS is 0 or more to the probability that the ith detection frame is discriminated as a pedestrian_i≤1；

For each detection box, a set threshold T is used to determine whether the sample is positive, i.e., for sample { l }_i，s_iAnd (4) the following steps:

if s_iGreater than or equal to T, then { l-_i，s_i-is the positive sample;

if s_iIf < T, then { l_i，s_i-is the negative sample;

7. The method for automatic learning of pedestrian detectors under a specific scene based on deep network enhancement as claimed in claim 6, wherein in step (4), the method for estimating the size of the pedestrian detection frame under the current height of the embedded device is as follows:

let the area of the pedestrian be S when the pedestrian stands at the i-th position under the camera_iHeight of h_iWidth of w_i(ii) a And acquiring data for multiple times, and obtaining the area S, the height h and the width w of the pedestrian under the current camera height by an averaging method.