CN107085609A

CN107085609A - A kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net

Info

Publication number: CN107085609A
Application number: CN201710270659.2A
Authority: CN
Inventors: 吴耀文; 周学平; 廖宜良; 张修; 吴颖波; 张勇
Original assignee: HUBEI KENENG POWER ELECTRONICS CO Ltd; Jingzhou Power Supply Co of State Grid Hubei Electric Power Co Ltd
Current assignee: HUBEI KENENG POWER ELECTRONICS CO Ltd; Jingzhou Power Supply Co of State Grid Hubei Electric Power Co Ltd
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2017-08-22

Abstract

The present invention relates to a kind of Video Analysis Technology, and in particular to a kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net.It is that the pedestrian detected in video is calculated into feature, is stored in the property data base of pedestrian's collection to be measured, then calculates feature for pedestrian to be checked, and is compared with property data base, obtains coming the retrieval result high for similarity above.The present invention is integrated so that the upper lower part of the body of retrieval result and pedestrian to be checked are all more similar, and be ranked up to improve the convenience of retrieval by using 1 characteristic distance by calculating a variety of retrieval characters and characteristic distance using optimal distance weights W；With adapt to wide, accuracy rate it is high, using simplicity the characteristics of.The accuracy rate that is currently based in the pedestrian detection of monitor video can effectively be solved and retrieval similarity is all relatively low, various detection methods and characteristic distance can not be combined well, and adaptive surface is not wide, using it is more complicated the problem of.

Description

A kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net

Technical field

The present invention relates to a kind of Video Analysis Technology, and in particular to a kind of row that multiple features fusion is carried out based on neutral net People's search method.

Background technology

The method of current image retrieval and pedestrian retrieval is more, such as image search method CEDD (CEDD:Color and Edge Directivity Descriptor.A Compact Descriptor for Image Indexing and Retrieval, Savvas are A.2008), pedestrian retrieval method WHOS (Person Re-Identification by Iterative Re-Weighted Sparse Ranking, Giuseppe Lisanti, 2015), these methods are for one A little scientific data collection, such as pedestrian retrieval data set ViPeR Dataset (https://vision.soe.ucsc.edu/node/ 178) preferable retrieval effectiveness, can be obtained, but for the pedestrian in actual monitored video, is retrieved undesirable, it is necessary to integrate Form new retrieval character.

From the angle of retrieval result, such as some methods, WHOS, although pedestrian to be checked is included in retrieval result, that is, is retrieved Success, but the similarity of other pedestrians of retrieval result and pedestrian to be checked is not high, it is impossible to provide more ginsengs to user Information is examined, such as pedestrian to be checked is that expression similarity above is come in the blue clothes lower part of the body black trousers of upper body, retrieval result Some larger pedestrians, it is not a lot " the upper body indigo plant lower part of the body is black " to have；And the method that other can be retrieved by body part, such as：A General Method for Appearance-based People Search Based on Textual Queries, R. Satta, 2012, the similarity of retrieval result and pedestrian to be checked are larger, but retrieving and characteristic distance are more multiple It is miscellaneous, it is necessary to the multiple characteristic distances of explicit use and retrieval filtering link, it is impossible to represent similarity with 1 characteristic distance.

At present, characteristic distance has a variety of computational methods, such as：Bhattachayya methods (abbreviation Bh distances, https:// ), and Tanimoto methods (Fuzzy Algorithms en.wikipedia.org/wiki/Bhattacharyya_distance: With Applications to image processing and pattern recognition, 1996) etc., it is different Distance is applied to different feature and scene, now desires to the situation that these feature integrations are got up applied to monitor video.

In view of developing rapidly and its in many fields for convolutional neural networks (CNN)（Such as image recognition）What is obtained is excellent Effect (mageNet Classi cation with Deep Convolutional Neural Networks, 2012； OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, 2014), it is how using CNN models, various detection methods and characteristic distance are good Combine, reach adapt to wide, accuracy rate it is high, using easy more satisfactory effect, the side studied as people To.

The content of the invention

It is an object of the present invention to for above-mentioned the deficiencies in the prior art there is provided one kind by calculating a variety of retrieval characters And characteristic distance, integrated using optimal distance weights W so that the upper lower part of the body of retrieval result and pedestrian to be checked all compare It is similar, and be ranked up to improve the convenience of retrieval by using 1 characteristic distance；With adapting to, wide, accuracy rate is high, answer With simplicity, the accuracy rate that is currently based in the pedestrian detection of monitor video and the retrieval all relatively low problem of similarity can be effectively solved The pedestrian retrieval method of multiple features fusion is carried out based on neutral net.

The present invention is to realize above-mentioned purpose by the following technical solutions：

This is based on basic ideas of pedestrian retrieval method that neutral net carries out multiple features fusion：

Initial step is pedestrian detection, is as a result detection square frame（See Fig. 1）.The basic process of retrieval will be examined referring to Fig. 2 in video The pedestrian measured calculates feature, is stored in the property data base of pedestrian's collection to be measured, then for pedestrian to be checked calculating feature, and with Property data base compares, and obtains retrieval result, come above for similarity it is high, such as rank1 similarity highest, in Fig. 2 Rank1 and pedestrian to be checked belong to same pedestrian.Pedestrian's foreground mask is extremely important for pedestrian retrieval, reference can be made to Fig. 3, this The height and width for pedestrian's foreground mask that method is obtained are that RGB image block corresponding with pedestrian's square frame is identical, and foreground mask only has 3 kinds Value：Background 0, the upper part of the body 1 and the lower part of the body 2.This method includes 2 important steps：The meter of pedestrian's foreground mask and characteristic distance Calculate, 2 CNN models are corresponded to respectively, wherein, " foreground mask CNN " can be found in Fig. 4, and the pedestrian of the upper lower part of the body is distinguished for calculating Foreground mask, and calculate optimal distance weights W " optimizing CNN " can be found in Fig. 7, this weights be used for integrate various features distance, Calculate the similarity degree between 2 pedestrians.It is comprised the following steps that：

A kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net, it is characterised in that it comprises the following steps：

（1）Extract CNN foreground masks：Video and the pedestrian having been detected by for input, using GMM（Gaussian Mixture Model) GMM foreground masks in pedestrian's square frame are calculated, covered with GMM prospects in the RGB image block that square frame is included The color of the corresponding part of background in code is changed to grey, can so eliminate the interference of background area；Recycle in video Movable information calculates the light stream vector of each pixel in pedestrian's square frame, then by GMM foreground masks, the light in pedestrian's square frame The amplitude of flow vector, the direction of light stream vector and amended RGB image block are combined into " pedestrian's mask assemblage characteristic ", input To " in foreground mask CNN ", obtaining distinguishing the CNN foreground masks of the lower part of the body, i.e. mask value only has：Background 0, the upper part of the body 1 With the lower part of the body 2；

（2）Calculate searching characteristic vector：The CNN prospects of the lower part of the body on RGB image block and corresponding differentiation for each pedestrian Mask, calculates whole body, the upper part of the body and the HS of lower part of the body foreground mask corresponding region, RGB, improves CEDD and improves WHOS and be special respectively Levy, altogether 12 kinds of searching characteristic vectors, and the searching characteristic vector that pedestrian to be measured collects is stored in property data base；Wherein, Improve CEDD features and only calculate the pixel of foreground mask corresponding region, and improve WHOS and also only calculate foreground mask corresponding region Pixel, while not calculating HOG features；

（3）Calculate characteristic distance:For the searching characteristic vector of 2 pedestrians, using Bhattachayya methods and Tanimoto methods The distance of the subcharacter vector of corresponding same type is calculated respectively, and 24 kinds of distances, group so are obtained for 12 seed characteristics Into the distance vector D of 1 24 dimension；Then use by " this 1 distance vector is converted to 1 by the weight vector W that optimizing CNN " is tried to achieve Individual distance value, conversion formula is d=W'D, wherein：W is 24 right-safeguarding value column vectors, and D is tieed up apart from column vector for 24, and result of calculation d is 1 1x1 dimension value；In order to obtain and the upper lower part of the body feature of pedestrian to be checked all similar retrieval results, method of the invention Using 1 characteristic distance and filtration treatment is not needed；

（4）Sequence and output retrieval result：For pedestrian to be checked, (1) and (2) methods described is utilized to calculate characteristic vector, so Afterwards using the distance between each pedestrian to be measured in the calculating of (3) methods described and property data base, finally by these distances Value, which is ranked up, obtains retrieval result, high apart from small expression similarity degree, and distance is big to represent that similarity degree is low.

Step（1）Described in the CNN foreground masks of calculating pedestrian concretely comprise the following steps：The first step trains " foreground mask CNN ", first prepares training sample, using the pedestrian detected in monitor video as sample, zooms to PxQ dimension standard sizes, then hand Work mark pedestrian expectation foreground mask, mark different mask values respectively for the upper lower part of the body, i.e., by background, the upper part of the body and under Half body is respectively labeled as 0,1,2, is used as the output valve of training sample；Then the mask assemblage characteristic of sample is calculated, first using GMM The GMM foreground masks in pedestrian detection square frame are calculated, and by the background pair in pedestrian's square frame RGB image block with GMM foreground masks The region answered is set to grey, the light stream vector of each pixel in pedestrian detection square frame is then calculated, so for each row People can obtain 6 PxQ dimension matrixes：GMM foreground masks, the size of light stream vector, the direction of light stream vector, amended R, G, B, constitutes the characteristic of sample, is used as the input of training sample；Finally " foreground mask is trained with above-mentioned training sample CNN”；This CNN uses 6 layers, is respectively：Input layer, convolutional layer, max-pooling layers, convolutional layer, max-pooling layers, complete Articulamentum, inputs and ties up matrix for 6 above-mentioned PxQ, is output as PxQ dimension images, and data value is 0,1 or 2, and the back of the body is represented respectively Scape, the upper part of the body and the lower part of the body；Second step is using " foreground mask CNN " calculates CNN foreground masks.

Step（3）Described in weight vector W obtaining value method it is as follows：

The first step prepares training sample：For the pedestrian in pedestrian sample storehouse, calculate searching characteristic vector and preserve；Then select 1 pedestrian A, and find out 1 another pedestrian B1 for belonging to same pedestrian with pedestrian A, then select N-1 (（N>3）Individual and row People A is not belonging to other pedestrians of same pedestrian, constitutes sample group { B1, B2 ..., BN }, wherein A and B1 belong to same row People, then pedestrian A and the characteristic distance vector of 24 dimensions of each pedestrian in sample group { B1, B2 ..., BN } are calculated, and to each spy Levy distance vector and carry out L2 normalization, so obtain the characteristic distance matrix of 1 Nx24 dimension, 2 pedestrian's samples are represented per a line Characteristic distance vector between this, 1 matrix 1 training sample of formation；And the desired output of each training sample is fixed as to N The set { 0,1 ..., 1 } of individual element carries out the vectorial Y that L2 normalization is obtained；Multiple training samples are generated in this way；

" optimizing CNN " calculates W for second step training：Above-mentioned training sample input " is trained in optimizing CNN ", obtained 1x24 dimension convolution kernels are best initial weights vector W；" optimizing CNN " structure includes 3 layers：Input layer, convolutional layer, max- Pooling output layers, the characteristic distance vector D of input layer correspondence Nx24 dimensions, convolutional layer is the convolution kernel dimension square of K 1x24 dimension Battle array, max-pooling output layers are N-dimensional vector Y；The convolution kernel dimension matrix of K 1x24 dimension is taken average by training after finishing, and is obtained Optimal W.

The beneficial effect of the present invention compared with prior art is：

This based on neutral net carry out multiple features fusion pedestrian retrieval method by calculating a variety of retrieval characters and characteristic distance, Integrated using optimal distance weights W so that the upper lower part of the body of retrieval result and pedestrian to be checked are all more similar, and pass through It is ranked up to improve the convenience of retrieval using 1 characteristic distance；With adapt to wide, accuracy rate it is high, using easy spy Point.The accuracy rate and retrieval similarity that can effectively solve to be currently based in the pedestrian detection of monitor video are all relatively low, various detections Method and characteristic distance can not be combined well, and adaptive surface is not wide, using it is more complicated the problem of.

Brief description of the drawings

Fig. 1 is pedestrian detection block diagram of the invention；

Fig. 2 is pedestrian retrieval process schematic of the invention；

Fig. 3 is pedestrian's foreground mask schematic diagram of the invention；

Calculating and foreground mask CNN training schematic diagram of the Fig. 4 for pedestrian's mask assemblage characteristic of the present invention；

Fig. 5 is CNN foreground mask schematic diagram calculations of the invention；

Fig. 6 is the vectorial schematic diagram calculation of the retrieval character and characteristic distance of the present invention；

Fig. 7 is the optimal W of calculating of present invention CNN structural representation；

Fig. 8 is " optimizing CNN " structural representation of the invention；

Fig. 9 is the schematic diagram of the multiple features fusion pedestrian retrieval algorithmic descriptions based on neutral net of the present invention.

Embodiment

1~9 pair of pedestrian retrieval method that multiple features fusion should be carried out based on neutral net makees further below in conjunction with the accompanying drawings Description.

（1）The dimension of foreground mask：

For the pedestrian detected, using GMM（Gaussian Mixture Model) calculate pedestrian's square frame in GMM prospects The color of part corresponding with the background in GMM foreground masks is changed to grey in mask, the RGB image block that square frame is included, with Eliminate the interference of background area；Height and width are zoomed into normal size PxQ again, and CNN foreground masks are PxQ dimension matrixes, each Element only has 3 kinds of values：Background 0, the upper part of the body 1 and the lower part of the body 2；Referring to Fig. 4, data below constitutes pedestrian's mask assemblage characteristic, All it is PxQ dimension matrixes：The amplitude of light stream vector, the direction of light stream vector, GMM foreground masks, the R-portion of RGB image block, G portions Point, part B.

（2）The calculating of characteristic distance and its dimension：

In order to obtain needing artificial choosing to the upper lower part of the body feature of pedestrian to be checked all similar retrieval results, many search methods Select and calculate multiple characteristic distances and carry out filtration treatment, method of the invention can use 1 characteristic distance and need not Filtration treatment；Referring to Fig. 6~Fig. 8, the characteristic distance between 2 pedestrians that method of the invention is calculated is 1 1x1 dimension value, High apart from small expression similarity, distance is big to represent that similarity is low；Obtained with this characteristic distance with pedestrian's similarity to be checked compared with High pedestrian, is such as upper part of the body white clothes lower part of the body black with similar upper lower part of the body feature is compared with pedestrian to be checked Trousers are easier than other method accurate；When calculating characteristic distance, 12 kinds of retrieval characters of pedestrian, Ran Houji are first calculated The Bh distances and Tanimoto distances of this 12 kinds of retrieval characters between 2 pedestrians are calculated, 1 24 dimensional feature distance vector D is obtained, Finally characteristic distance d is calculated with below equation：

d=W'D；W and D are 24 dimensional vectors；W is optimal distance weights；

The calculation formula of Bh distances is：

；P, q are 2 n-dimensional vectors；

The calculation formula of Tanimoto distances is：

,

Xi and xj in formula are 2 vectors, and Tij scope is [0,1].

（3）The structure of foreground mask CNN models and use：

Referring to Fig. 4 and Fig. 5, foreground mask CNN is used for the foreground mask for calculating detection pedestrian, and the pedestrian inputted as 1 pedestrian covers Code assemblage characteristic, including 6 PxQ dimension matrixes：The amplitude of light stream vector, the direction of light stream vector, GMM foreground masks, RGB image R-portion, G parts, the part B of block；It is output as the CNN foreground masks that PxQ ties up matrix；This CNN uses 6 layers：

[a] input layer is 6 PxQ dimension matrixes；

The convolution kernel of [b] convolutional layer is M1 5x5x6 dimension matrix, is output as M1 PxQ matrix；

The processing unit that max-pooling layers of [c] is 2x2, is output as M1 (P/2) x (Q/2) matrixes；

The convolution kernel of [d] convolutional layer is M2 3x3xM1 dimension matrix, is output as M2 PxQ matrix；

The processing unit that max-pooling layers of [e] is 2x2, is output as M2 (P/2) x (Q/2) matrixes；

[f] full articulamentum, is output as 1 PxQ dimension matrix, as CNN foreground masks, only 3 kinds values：Background 0, the upper part of the body 1 With the lower part of the body 2；

Foreground mask CNN loss function is CNN output maskings and the Euclidean distance for expecting mask；

（4）The training of foreground mask CNN models：

Referring to Fig. 1~Fig. 5, pedestrian of the training sample in monitor video so that training pattern is adapted to actual conditions, For monitor video, pedestrian is first detected, Piotr Dollar toolbox can be used（http:// vision.ucsd.edu/-pdollar/toolbox/doc/index.html）, obtain pedestrian detection square frame, then mark by hand The PxQ dimensions of each pedestrian expect foreground mask, only 3 kinds values：Background 0, above the waist, the lower part of the body 2；Then 1 pedestrian is calculated Pedestrian's mask assemblage characteristic, including 6 PxQ dimension matrixes：The amplitude of light stream vector, the direction of light stream vector, GMM prospects are covered Code, the R-portion of the RGB image block of modification, G parts, part B；So, the data of each pedestrian constitute 1 training sample, input Data are pedestrian's mask assemblage characteristic, are output as expecting foreground mask；For the RGB image block in pedestrian's square frame, will with before GMM The corresponding region of background of scape mask is changed to grey, can so eliminate some ambient noises, improves accuracy；The training of collection The quantity of sample>5000, using stochastic gradient descent method（SGD: Stochastic Gradient Descent）Carry out Training, obtains foreground mask CNN；CNN uses the MatConvNet increased income（http://www.vlfeat.org/ matconvnet/）；The calculating of light stream vector can use Piotr Dollar toolbox；

（5）The calculating of retrieval character：

Referring to Fig. 6, present invention uses a variety of retrieval characters, including：HS, RGB, improved CEDD and improved WHOS；Calculate 1 The retrieval character of individual pedestrian is divided into 2 steps：

[a] calculates the CNN foreground masks for distinguishing the upper lower part of the body；

[b] is for 3 kinds of CNN foreground masks（Above the waist, the lower part of the body, whole body）Corresponding pedestrian area, calculates above-mentioned 4 kinds respectively Feature, obtains 12 kinds of retrieval characters；

12 kinds of retrieval characters of each pedestrian are stored in the property data base of pedestrian's collection to be measured；

CEDD comes from htpp://chatzichristofis.info, former algorithm does not support ROI (Region Of Interest), improve CEDD and support that ROI, ROI can be 3 kinds of CNN foreground masks；

WHOS features come from http://www.micc.unifi.it/lisanti/source-code/re-id/, former algorithm is not ROI is supported, WHOS is improved and supports ROI, and delete the HOG features in former algorithm；

（6）Optimizing CNN structure and optimal distance weights W calculating：

Optimizing CNN is used to calculate optimal distance weights W；Referring to Fig. 6, Fig. 8,4 layers are included：

[a] input layer：1 Nx24 matrix, comprising N to the characteristic distance vector between pedestrian；

[b] convolutional layer：Convolution kernel is K 1x24 dimension matrix, is output as K Nx1 matrix；

[c] output layer：The processing unit of 1 1xK dimension is included, 1 Nx1 matrix is output as；Every 1 numerical value represent 1 couple of pedestrian it Between characteristic distance；

After being finished to optimizing CNN training, tie up matrix (i.e. individual 24 dimensional vectors of K) for K 1x24 and take average, obtain 24 dimensions it is optimal away from From weight vector W, formula is：

；i=1,…,24; V_ijFor i-th of element in j-th of 24 dimensional vectors；

Loss function can be found in Fig. 7, and the normalized formula of L2 are：

；X and Y are 24 dimensional vectors；

（7）Optimizing CNN training：

Participate in Fig. 6~Fig. 8, pedestrian of the training sample in monitor video so that training pattern is adapted to actual conditions, 1 Individual training sample is that 1 Nx24 ties up distance matrix, it is necessary to N number of pedestrian；For 1 pedestrian A, find out 1 and belong to same with pedestrian A Another pedestrian B1 of individual pedestrian, then selects N-1 (N>3) it is individual same pedestrian is not belonging to pedestrian A other pedestrian B2~ BN, composition sample group { B1, B2 ..., BN }, wherein A and B1 belong to same pedestrian, then calculate sample A and sample group B1, B2 ..., BN } in each pedestrian 24 dimensions characteristic distances vector, can so obtain the characteristic distance matrix of 1 Nx24 dimension； Desired output is the vector after 24 dimensional vectors { 0,1,1 ..., 1 } are normalized through L2, and 0 represents that distance is 0, belongs to same pedestrian, 1 represents to be not belonging to same pedestrian, and desired output represents the sequencing of similarity of retrieval result；Fig. 7 is shown in the definition of loss function；Adopt The number of samples of collection>5000.

Simply presently preferred embodiments of the present invention described above, the example above illustrates that the substantive content not to the present invention is made Any formal limitation, technology of the person of an ordinary skill in the technical field after this specification has been read according to the present invention Any simple modification or deformation that essence is made to above embodiment, and possibly also with the technology contents of the disclosure above The equivalent embodiment of equivalent variations is changed or is modified to, in the range of still falling within technical solution of the present invention, without departing from The spirit and scope of the invention.

Claims

1. a kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net, it is characterised in that it comprises the following steps：

2. a kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net according to claim 1, its feature It is：Step（1）Described in the CNN foreground masks of calculating pedestrian concretely comprise the following steps：The first step trains " foreground mask CNN ", first prepares training sample, using the pedestrian detected in monitor video as sample, zooms to PxQ dimension standard sizes, then hand Work mark pedestrian expectation foreground mask, mark different mask values respectively for the upper lower part of the body, i.e., by background, the upper part of the body and under Half body is respectively labeled as 0,1,2, is used as the output valve of training sample；Then the mask assemblage characteristic of sample is calculated, first using GMM The GMM foreground masks in pedestrian detection square frame are calculated, and by the background pair in pedestrian's square frame RGB image block with GMM foreground masks The region answered is set to grey, the light stream vector of each pixel in pedestrian detection square frame is then calculated, so for each row People can obtain 6 PxQ dimension matrixes：GMM foreground masks, the size of light stream vector, the direction of light stream vector, amended R, G, B, constitutes the characteristic of sample, is used as the input of training sample；Finally " foreground mask is trained with above-mentioned training sample CNN”；This CNN uses 6 layers, is respectively：Input layer, convolutional layer, max-pooling layers, convolutional layer, max-pooling layers, complete Articulamentum, inputs and ties up matrix for 6 above-mentioned PxQ, is output as PxQ dimension images, and data value is 0,1 or 2, and the back of the body is represented respectively Scape, the upper part of the body and the lower part of the body；Second step is using " foreground mask CNN " calculates CNN foreground masks.

3. a kind of pedestrian retrieval method that multiple features fusion is carried out based on neutral net according to claim 1, its feature It is：Step（3）Described in weight vector W obtaining value method it is as follows：