CN104616032B

CN104616032B - Multi-camera system target matching method based on depth convolutional neural networks

Info

Publication number: CN104616032B
Application number: CN201510047118.4A
Authority: CN
Inventors: 王慧燕; 华璟
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2015-01-30
Filing date: 2015-01-30
Publication date: 2018-02-09
Anticipated expiration: 2035-01-30
Also published as: CN104616032A

Abstract

A kind of target matching method between multiple-camera based on depth convolutional neural networks.The present invention initializes multiple convolution kernels based on localised protection projecting method, and down-sampling is carried out to image based on maximum pond method, passes through successively eigentransformation, extraction more robust, more representative histogram feature；Multi-class support vector machine SVM classifier is recycled to carry out Classification and Identification.When target enters another camera coverage domain from a camera coverage domain, it is extracted feature and mark corresponding to target labels, realize multiple-camera cooperation monitoring field target is accurately identified, for target handoff and tracking etc..

Description

Multi-camera system target matching method based on deep convolutional neural network

Technical Field

The invention belongs to the field of intelligent video monitoring in computer vision, and discloses a target matching method based on a deep convolutional neural network, which is suitable for a multi-camera cooperative video monitoring system.

Background

In a large-scale video monitoring place, such as an airport, a subway station, a square and the like, a target in a multi-camera cooperation monitoring system is tracked, and target matching among multiple cameras is a key step. For a large video monitoring scene, calibration of cameras is difficult and complex, and the spatial relationship, the temporal relationship and the time difference among the cameras are difficult to infer, so that the current widely-applied target matching method among multiple cameras is mainly based on target matching of features, and the accuracy of a matching result is directly influenced by the effectiveness of feature selection. But the extraction of robust features that can effectively characterize the target is a difficult problem. The features commonly used at present include color, texture, etc., and the features are difficult to maintain good robustness in all monitored scenes. Therefore, we propose a target matching method based on deep learning, which can adaptively learn features from a video frame sequence to achieve accurate target matching. Compared with the traditional neural network, the deep neural network overcomes the problem of less network layers, obtains more abstract feature expression by performing layer-by-layer transformation on the features, realizes the target classification as the final output layer of the network, and greatly improves the speed and efficiency of target matching.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a target matching method among multiple cameras based on deep learning, which comprises the following specific steps:

(1) preprocessing of the target image: extracting n target images of a multi-camera domain, and dividing the target images into m labels; uniformly adjusting the size of the image to h multiplied by w by utilizing a bicubic interpolation algorithm (bicubic interpolation), wherein h is the height of the image, and w is the width of the image; simply scaling the pixel values of the image samples so that the final pixel values all fall between [0,1 ]; the labels of the n images are stored as n x 1 data, and the value range of each label is [1, …, m ];

(2) extracting features based on a deep convolutional neural network:

(a) selecting n from the target image extracted in the step (1)_tA training sample as a volumePerception node of first-layer input layer of cumulative neural networkWherein, X_i，i＝1,2,…,n_tRepresenting the ith image;

(b) the filter applied to target image feature extraction is a convolution kernel constructed based on a local preserving projection method, and the specific construction method is as follows:

for image X_iPerforming block processing to set block size to p₁×p₂Then X_iAll the blocks of (a) are:wherein x is_i,jJ is 1, …, hw represents X_iThe jth block vector of (1); the block mean is then subtracted from each block to yield: wherein,j is 1, …, hw denotes the block after mean removal; the same processing is done for all input images X, resulting in:

the feature vector is calculated as follows: XLX^Ta＝λXDX^Ta, wherein a is an eigenvector, λ is an eigenvalue corresponding to a, and D is a diagonal matrix, and element values of the diagonal matrix are columns and/or rows of the weight matrix W; the weight matrix W is of dimension n_t×n_tOf a sparse matrix, W_ijRepresenting a sampleAndthe Euclidean distance between all samples is calculated, and for each sample, the k closest to the sample is found_nearestOne sample, namely: if the sampleIn a sampleK of (a)_nearestWithin a nearest neighbor, or sampleIn a sampleK of (a)_nearestWithin a nearest neighbor, thenOtherwise, W_ij＝0；D_ii＝∑_jW_jiAnd L-D is a Laplacian matrix; sorting the calculated eigenvectors according to the magnitude of the eigenvalues, and taking the top k¹A feature vectorLet V_i ¹＝a_i-1，i＝1,2,…,k¹Then, thenNamely, the extracted convolution kernel is obtained;

a convolution kernel V¹With each frame imageCarry out convolution, i.e.i＝1,…,n_t，j＝1,…,k¹Then n is generated in the convolutional layer_tk¹A web output feature map represented as:

(c) performing feature point down-sampling on the obtained feature map Y based on maximum value pooling (max boosting); let the sampling window size be s¹×s¹Then n is obtained_tk¹Amplitude output feature map:

wherein, the ith output characteristic mapping chartJ row and k column of the ith output characteristic mapping chart, i is 1, …, n_tk¹， u, v denote the sampling step size, YⁱRepresenting the ith input image, and max { } representing a maximum function; in addition, the algorithm uses non-overlapping sampling, i.e., taking u-v-s¹；

(d) Using a procedure similar to (b) for image X_iPerforming block processing to set block size to p₁×p₂Taking the feature map Z obtained in the step (c) as the input of the convolutional layer, and removing the mean value of the block data of each frame of image to obtain an input image:

wherein, the ith input feature mapi＝1,…,n_tk¹Representing the ith block-de-averaged image,j is 1, …, hw represents j block vector of ith image after mean value is removed; constructing a weight matrix W and according to ZLZ^Ta＝λZDZ^Ta, calculating eigenvectors, sorting according to the magnitude of eigenvalues, and taking the top k²The feature vector is used as the selected convolution kernelWherein, V_i ²，i＝1,…,k²Represents V²The ith convolution kernel of (a); the resulting convolution kernel V is then used²For each frame of imagePerforming convolution to generate n on the convolution layer_tk¹k²Amplitude output feature map:

wherein,i＝1,…,n_tk¹，j＝1,…,k²；

(e) performing feature point down-sampling on the obtained feature map U based on a maximum value pooling method by adopting the steps similar to the step (c); let the sampling window size be s²×s²Then n is obtained_tk¹k²Amplitude output feature map:

wherein, the ith output characteristic mapping chart J row and k column of the ith output characteristic mapping chart, i is 1, …, n_tk¹k²，U, v denote the sampling step size, UⁱRepresenting the ith input image, and max { } representing a maximum function; in addition, the algorithm uses non-overlapping sampling, i.e., taking u-v-s²；

(f) Order toi＝1,…,n_tk¹I.e. take every k in O²The images are grouped, and are subjected to Heaviside binary quantization and then are processed into decimal values, and each k is²Converting a picture into an imagei＝1,…,n_tk¹Wherein H (. cndot.) represents the Heaviside function, P_i ^jRepresents P_iJ-th image of (1), T_iRepresenting a decimal processing result, having a value range ofThen every k is taken¹Breadth T_iDividing each image into a group, dividing each image into B blocks, calculating histogram features of each block region, connecting the histogram features of the B blocks into row vectors, and defining the row vectors asWherein,l＝1,…,n_t，s＝1,…,k¹(ii) a Then for each image X in (a)_lFinally extracting the feature vector based on the convolutional neural networkl＝1,…,n_t；

(3) Classification and identification: extracting the above-mentioned featuresAnd as input, the target label corresponding to each feature vector is used as output, and a classifier model of the target is constructed by a multi-class Support Vector Machine (SVM). Based on the classifier model, the targets in different camera fields can be labeled and classified for target handover, tracking and the like.

The invention has the beneficial effects that:

the invention adopts a local preserving projection method to initialize the convolution kernel, but not to randomly initialize the convolution kernel, so that the obvious characteristics of the target image can be accurately preserved, thereby leading the extracted histogram characteristics to be capable of keeping the invariance to the scale change and the rotation of the target, having stronger adaptability to the illumination change of a scene and greatly improving the identification rate of the target. The method performs downsampling on the image convolved by using the local preserving projection method, effectively reduces the feature dimension, avoids dimension disaster, greatly shortens the identification time of the target, and effectively eliminates the reduction of the identification rate caused by dimension reduction by adopting multi-convolution to perform convolution on the image and overlapping the features.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The method comprises two parts of target feature extraction and target classification identification. The method comprises the steps of extracting the characteristics of a target, constructing a neural network with multiple hidden layers, transforming the characteristics of the sample layer by layer, transforming the characteristic representation of the sample in an original space to a new characteristic space to learn more useful characteristics, and then using the characteristics as input characteristics of a multi-class SVM classifier to perform classification and identification on the target, thereby finally improving the accuracy of classification or prediction. Fig. 1 shows an implementation block diagram of the algorithm, and the specific steps are as follows:

(1) preprocessing of the target image: extracting n target images of a multi-camera domain, and dividing the target images into m labels; uniformly adjusting the size of the image to h multiplied by w by utilizing a bicubic interpolation algorithm (bicubic interpolation), wherein h is the height of the image, and w is the width of the image; simply scaling the pixel values of the image samples so that the final pixel values all fall between [0,1 ]; the labels of the n images are stored as n x 1 data, and the value range of each label is [1, …, m ]; 001 represents the preprocessed image;

(2) extracting features based on a deep convolutional neural network:

(a) select n from 001_tTraining samples as sensing nodes of the first input layer of the convolutional neural networkNamely 002, wherein, X_i，i＝1,2,…,n_tRepresenting the ith image;

for image X in 002_iPerforming block processing to set block size to p₁×p₂Then X_iAll the blocks of (a) are:wherein x is_i,jJ is 1, …, hw represents X_iThe jth block vector of (1); the block mean is then subtracted from each block to yield:wherein,j is 1, …, hw denotes the block after mean removal; the same processing is done for all input images X, resulting in: namely 003;

a convolution kernel V¹And 003 in each frame imagePerforming convolution to obtaini＝1,…,n_t，j＝1,…,k¹Then n is generated in the convolutional layer_tk¹A web output feature map represented as:

namely 004;

(c) for the feature map Y shown in 004, feature point downsampling is performed on the basis of maximum value pooling (max boosting); let the sampling window size be s¹×s¹Then n is obtained_tk¹Amplitude output feature map:

005, wherein the ith output characteristic map J row and k column of the ith output characteristic mapping chart, i is 1, …, n_tk¹，u, v denote the sampling step size, YⁱRepresenting the ith input image, and max { } representing a maximum function; in addition, the algorithm uses non-overlapping sampling, i.e., taking u-v-s¹；

(d) Using a procedure similar to (b) for image X in 005_iPerforming block processing to set block size to p₁×p₂Taking the feature map Z obtained in the step (c) as the input of the convolutional layer, and removing the mean value of the block data of each frame of image to obtain an input image:

i.e., 006, wherein the ith input feature mapi＝1,…,n_tk¹Representing the ith block-de-averaged image,j is 1, …, hw represents j block vector of ith image after mean value is removed; constructing a weight matrix W and according to ZLZ^Ta＝λZDZ^Ta, calculating eigenvectors, sorting according to the magnitude of eigenvalues, and taking the top k²The feature vector is used as the selected convolution kernelWherein, V_i ²，i＝1,…,k²Represents V²The ith convolution kernel of (a); the resulting convolution kernel V is then used²For each frame of imagePerforming convolution to generate n on the convolution layer_tk¹k²Amplitude output feature map:

i.e., 007, wherein,i＝1,…,n_tk¹，j＝1,…,k²；

(e) for the feature map U shown by 007, the feature points are downsampled based on the maximum value pooling by the steps similar to (c); let the sampling window size be s²×s²Then n is obtained_tk¹k²Amplitude output feature map:

i.e., 008, where the ith output feature map J row and k column of the ith output characteristic mapping chart, i is 1, …, n_tk¹k²，U, v denote the sampling step size, UⁱRepresenting the ith input image, and max { } representing a maximum function; in addition, the algorithm uses non-overlapping sampling, i.e., taking u-v-s²；

(f) Order toi＝1,…,n_tk¹Taking each k in O as shown in 008²The images are grouped, and are subjected to Heaviside binary quantization and then are processed into decimal values, and each k is²Converting a picture into an imagei＝1,…,n_tk¹Wherein H (. cndot.) represents the Heaviside function, P_i ^jRepresents P_iJ-th image of (1), T_iRepresenting a decimal processing result, having a value range ofThen every k is taken¹Breadth T_iDividing each image into a group, dividing each image into B blocks, calculating histogram features of each block region, connecting the histogram features of the B blocks into row vectors, and defining the row vectors asWherein,l＝1,…,n_t，s＝1,…,k¹(ii) a Then for each image X in 002_lFinally extracting the feature vector based on the convolutional neural networkl＝1,…,n_t；

(3) Classification and identification: extracting the above-mentioned featuresAnd as input, the target label corresponding to each feature vector is used as output, and a classifier model of the target is constructed by a multi-class Support Vector Machine (SVM). Based on the model of the classifier, it is,the method can realize labeling and classification of targets in different camera fields, and is used for target handover, tracking and the like.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A multi-camera system target matching method based on a deep convolutional neural network is characterized by comprising the following steps:

(2) extracting features based on a deep convolutional neural network:

(a) selecting n from the target image extracted in the step (1)_tTraining samples as sensing nodes of the first input layer of the convolutional neural networkWherein, X_i，i＝1,2,…,n_tRepresenting the ith image;

for image X_iPerforming block processing to set block size to p₁×p₂Then X_iAll the blocks of (a) are:wherein x is_i,jJ is 1, …, hw represents X_iThe jth block vector of (1); the block mean is then subtracted from each block to yield: wherein,representing the blocks with the mean removed; the same processing is done for all input images X, resulting in:

the feature vector is calculated as follows: XLX^Ta＝λXDX^Ta, where a is the feature vector and λ is aCorresponding characteristic value D is diagonal matrix, and element values of the diagonal matrix are columns and/or rows of the weight matrix W; the weight matrix W is of dimension n_t×n_tOf a sparse matrix, W_ijRepresenting a sampleAndthe Euclidean distance between all samples is calculated, and for each sample, the k closest to the sample is found_nearestOne sample, namely: if the sampleIn a sampleK of (a)_nearestWithin a nearest neighbor, or sampleIn a sampleK of (a)_nearestWithin a nearest neighbor, thenOtherwise, W_ij＝0；D_ii＝∑_jW_jiAnd L-D is a Laplacian matrix; sorting the calculated eigenvectors according to the magnitude of the eigenvalues, and taking the top k¹A feature vectorLet V_i ¹＝a_i-1，i＝1,2,…,k¹Then, thenNamely, the extracted convolution kernel is obtained;

a convolution kernel V¹With each frame imageCarry out convolution, i.e. Then n is generated in the convolutional layer_tk¹A web output feature map represented as:

(c) performing feature point downsampling on the obtained feature map Y based on maximum value pooling (maxporoling); let the sampling window size be s¹×s¹Then n is obtained_tk¹Amplitude output feature map:

wherein, the ith output characteristic mapping chart Represents the jth row and kth column of the ith output characteristic map, u, v denote the sampling step size, YⁱRepresenting the ith input image, and max { } representing a maximum function; in addition, hereUsing non-overlapping samples, i.e. taking u-v-s¹；

(d) Using a procedure similar to (b) for image X_iPerforming block processing to set block size to p₁×p₂Taking the feature map Z obtained in the step (c) as the input of the convolutional layer, and removing the mean value of the block data of each frame of image to obtain an input image:wherein, the ith input feature mapRepresenting the ith block-de-averaged image,representing the jth block vector of the ith image after the mean value is removed; constructing a weight matrix W and according to ZLZ^Ta＝λZDZ^Ta, calculating eigenvectors, sorting according to the magnitude of eigenvalues, and taking the top k²The feature vector is used as the selected convolution kernelWherein, V_i ²，i＝1,…,k²Represents V²The ith convolution kernel of (a); the resulting convolution kernel V is then used²For each frame of imagePerforming convolution to generate n on the convolution layer_tk¹k²Amplitude output feature map:wherein,

(e) for the feature map U obtained as described above, the same phase as (c) is usedA similar step, performing feature point down-sampling based on maximum value pooling; let the sampling window size be s²×s²Then n is obtained_tk¹k²Amplitude output feature map:wherein, the ith output characteristic mapping chart Represents the jth row and kth column of the ith output characteristic map,u, v denote the sampling step size, UⁱRepresenting the ith input image, and max { } representing a maximum function; non-overlapping sampling is used here, i.e. taking u-v-s²；

(f) Order toI.e. taking each k in O²The images are grouped, and are subjected to Heaviside binary quantization and then are processed into decimal values, and each k is²Converting a picture into an image Wherein H (-) represents the Heaviside function, P_i ^jRepresents P_iJ-th image of (1), T_iRepresenting a decimal processing result, having a value range ofThen every k is taken¹Breadth T_iThe images are grouped, and each image is firstlyDividing the block into B blocks, calculating histogram features of each block region, and connecting the histogram features of the B blocks into row vectorsWherein,then for each image X in (a)_lFinally extracting the feature vector based on the convolutional neural network

(3) Classification and identification: extracting the above-mentioned featuresTaking the target label corresponding to each feature vector as an input, and constructing by a multi-class Support Vector Machine (SVM) to obtain a classifier model of the target; based on the classifier model, the targets in different camera fields can be labeled and classified for target handover and tracking.