CN106372597B

CN106372597B - CNN Vehicle Detection method based on adaptive contextual information

Info

Publication number: CN106372597B
Application number: CN201610786130.1A
Authority: CN
Inventors: 李涛; 李冬梅; 张玉宏; 曲豪; 邹香玲; 张栋梁; 朱晓珺; 郭航宇; 高大伟; 刘永
Original assignee: Zhengzhou Zen Graphics Intelligent Technology Co Ltd
Current assignee: Zhengzhou Chantu Intelligent Technology Co ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2019-09-13
Anticipated expiration: 2036-08-31
Also published as: CN106372597A

Abstract

The CNN Vehicle Detection method based on adaptive contextual information that the invention discloses a kind of, including training stage and detection-phase, under special traffic scene, the adaptive contextual feature preference pattern of training；On the basis of obtaining adaptive contextual feature preference pattern, CNN traffic detection system of the training based on adaptive contextual information；Traffic target (when detecting, carrying out the associated prediction of context and target, by post-processing, accurately confine traffic target) is accurately confined by post-processing in the forward direction stage.The invention proposes the CNN traffic detection system based on adaptive contextual information, the main adaptive context preference pattern comprising based on CNN and the traffic detection system for merging the model further improve the accuracy of vehicle and pedestrian detection.

Description

CNN traffic detection method based on self-adaptive context information

Technical Field

The invention relates to a CNN traffic detection technology based on self-adaptive context information, which can be applied in real time.

Background art:

in order to solve these increasingly serious traffic problems, Intelligent Transportation Systems (ITS) have been developed, in which vehicle and pedestrian identification is an important component of the intelligent transportation system, and some existing related technologies related to vehicles and pedestrians are also widely used.

The existing traffic detection system mainly realizes the identification and detection of different targets (pedestrians and vehicles) through the depiction of the appearance information of the targets. Currently, this type of system mainly uses artificially designed features (such as HOG, LBP, SIFT, etc.) or deep features directly obtained from the image itself through deep learning to depict the target appearance, and uses the target appearance to realize target detection. However, in the actual detection of daily traffic, most of the traffic is an open environment without constraint, the traffic is complex and changeable, the interference such as illumination change, view angle change, target shielding and the like exists, and if the traffic is only based on the appearance information of the target, when the information provided by the traffic target in the image or video is too little, the target category cannot be accurately judged according to the target. Moreover, different traffic scenes have certain differences, and a ubiquitous traffic target detection system neglecting the differences of the different traffic scenes can reduce the accuracy of traffic target detection.

The invention content is as follows:

the invention provides a CNN traffic detection method based on self-adaptive context information, which aims at further enriching the description of traffic targets by means of different context information under different traffic scenes in a traffic video so as to improve the accuracy of traffic target detection.

The technical scheme adopted for realizing the purpose of the invention is as follows: a CNN traffic detection method based on self-adaptive context information comprises a training phase and a detection phase, and is characterized in that,

the training phase comprises two steps:

firstly, under a specific traffic scene, training and acquiring a self-adaptive context feature selection model; firstly, extracting two groups of CNN characteristic graphs of a traffic target image and a context image thereof in a specific traffic scene; then, calculating the difference between the two groups of feature maps under the same scale, and recording and counting the position indexes of the feature maps with the sample difference degree smaller than a set threshold value; then, selecting position indexes of K effective context CNN characteristic graphs to obtain a self-adaptive context selection model, wherein K is more than 0 and is an integer;

secondly, training a CNN traffic detection system based on self-adaptive context information on the basis of obtaining a self-adaptive context feature selection model; in the forward stage, firstly, two groups of CNN feature maps of a traffic target image and a context image thereof are extracted, K feature map position indexes reserved by a context feature selection model are utilized, and corresponding effective feature maps are reserved from the obtained context CNN feature maps; then, performing convolution calculation on the two groups of obtained feature graphs by using a target kernel and a context kernel respectively to obtain a target score and a context score; then, fusing the target score and the context score through a mixing coefficient to obtain a detection score;

in the backward stage, calculating errors of detection values and labels, and updating parameters such as a target core, a context core, a mixing coefficient and the like by using a BP (Back propagation) algorithm;

a detection stage: in a specific traffic scene, firstly, inputting a detected traffic image, extracting 256 feature maps by using CNN (compressed natural number), and on one hand, obtaining a target mask map by using a trained target kernel convolution 256 feature maps; on the other hand, K feature maps are selected from 256 feature maps by using K feature map position indexes reserved by the context feature selection model, and a context mask map is obtained by performing convolution by using a trained context kernel; then, fusing the obtained target mask image and the context mask image by using the trained mixing coefficient, and jointly predicting the target position; and finally, accurately framing the traffic target through post-processing, wherein K is greater than 0 and is an integer.

According to the scheme, the context characteristics can be adaptively blended according to different scenes through the difference measurement, the representation of the traffic target is enhanced, and the accuracy of traffic target depiction is effectively improved.

1. The CNN traffic detection system based on self-adaptive context information adopts the first five-layer structure of Alexnet to extract the characteristic diagram of the traffic image; the method comprises the following steps: (Alexnet is a model of a convolutional neural network of eight-layer structure proposed 2012. the neural network has 6000 ten thousand parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by a max-posing layer, and three fully-connected layers, and a softmax layer ranked at the last 1000-way

1.1 assume input image x⁰It is expressed asWherein,andrespectively representing images x⁰Three-channel map in RGB space. The index of the convolutional layer is denoted by l, 1,2,3,4, 5. M^lNumber of first convolutional layer feature map, M¹＝96、M²＝256、M³＝384、M⁴＝384、M⁵256; the jth feature map of the ith convolutional layerThe calculation method is as follows:

wherein, W^lRepresenting the connection relation of the characteristic diagrams of the adjacent convolutional layers;represents a convolution operation;andrespectively representing a convolution kernel and an offset;

1.2obtained by the pooling layer and the non-linear layer of the l-th layerExpressed as:

wherein g (g) represents the local response normalization, f (g) represents the activation function, and the unsaturated nonlinear function is adopted:

thus, CNN is for input image x⁰256 feature maps were obtained for the fifth convolutional layerj 1.. 256. Each feature mapIs the size of the input image x⁰1/16 of (1). For convenience of expression, the system uses F (x)⁰J) indicates that the input image is at x⁰And extracting the jth feature map of the 5 th convolutional layer.

2. Firstly, uniformly expressing a read traffic image set I by using an adaptive context selection model, wherein the specific form is as follows:whereinA representation of the target image is shown,representing a context image containing a target image, y_nE {0,1} represents the label of the positive and negative samples, n represents the sample index:

2.1 input of target image of 80X 48 sizeAnd its corresponding context image of size 144 x 112Using CNN to extract the feature map of the 5 th convolution layer to obtain 256 target feature maps of 5 × 3 size256 and 256 corresponding context feature maps of size 9 × 7j＝1,...,256。

2.2 to be able to compare the target feature maps at the same scaleAnd its corresponding contextual feature graphThe difference of (2) is required for the target feature mapUpsampling was performed to a size of 9 × 7 for each feature map, and this was recorded as

2.3 measuring two characteristic graphs by cosine similarityAndthe difference in (a). Mapping target featuresIs marked as x, context feature mapIs noted as y, when the similarity is less than a certain empirical threshold epsilon, i.e.

S_cos(x,y)≤ε (4)

If it is notAndwith less difference, discarding the context feature mapOtherwise, the context feature map is retainedAnd finally record the location index of the retained contextual feature map. The results of 2N positive and negative sample pictures are counted, sorted according to the frequency of occurrence of corresponding positions, and finally the position index of the reserved context feature picture is selected, so that the self-adaptive context feature selection is realized, wherein N is more than 0 and is an integer. About 85% of the contextual characteristic map is reserved in the traffic system.

3. Training to obtain relevant parameters of the traffic detection system, and obtaining a context mask and a target mask by adopting a forward process in a training stage:

3.1 separately for the target imageAnd upper including the targetContext imageUsing 1) to extract CNN characteristics to obtain corresponding target characteristic diagramAnd contextual feature graphsWhere j 1.., 256 represents the number of feature maps.

3.2 selecting a model according to the self-adaptive context characteristics obtained in the step 2, and measuring a target characteristic graph by utilizing cosine similarityAnd corresponding contextual feature graphAccording to the difference of (3), selecting K contextual feature graphs needing to be reserved from 256 contextual feature graphs according to a threshold valueWherein q is 1.

3.3 convolving the target feature map and the context feature map respectively by using different target kernels and context kernels to obtain corresponding target masksAnd context maskRespectively expressed as:

wherein,b^oindicating the target core and the corresponding offset,b^crepresenting the context core and the corresponding offset. Target nucleusAnd context coreRespectively in the sizes ofAndare consistent in size.The val id convolution (which has boundary loss) is used, soAndare all scalar quantities.

3.4 detection score for blending into contextual information score_nExpressed as:

where γ represents a mixing coefficient of the object and the context, γ ∈ [0,1 ]. Different γ needs to be obtained for different scenarios. By reflecting the variable in different scenes, different functions of different context information on target detection are achieved, for example, if γ is 0, the context has no function on target detection, and the model is equivalent to a CNN target detection model without context being considered.

3.5 the system uses the minimum mean square error method to establish the objective function, and gradually reduces score by using BP algorithm_nAnd label y_nTo the error between. The objective function of the model is:

where 2N represents the total number of positive and negative samples. In order to solve the optimization problem of the related parameters in the above formula, the system trains the parameters in the model by using a random gradient descent method, and all the parameters w are updated by using the following formula until convergence:

where i represents the index of the iteration, α represents the learning rate of the gradient descent algorithm, updating the relevant parameters requires iterative computation of the gradient of the objective function l (g).

4. Detecting a traffic target: and predicting the target position by combining the target mask image and the context mask image, and then acquiring the detection result of the traffic target by non-maximum inhibition.

4.1 first, the system inputs an image I of a traffic scene_nAnd extracting feature maps by the method in the step 1 to generate 256 feature maps.

4.2 then, according to the obtained context feature diagram selection model, K effective context feature diagrams are selected from the 256 feature diagrams, and the existing target feature diagrams are supplemented with context information.

4.3 then, convolving the context feature map and the target feature map with the corresponding convolution kernels respectively to obtain the corresponding target mask mapAnd context mask map

4.4 in the detection phase, the same convolution is used (the convolution has no boundary loss), soAndare all matrices. Finally, a target mask map is utilizedAnd context mask mapAnd jointly predicting the position M of the target in a weighting mode.

4.5, obtaining the detection result of the corresponding traffic target through the post-processing of non-maximum value inhibition.

The invention has the beneficial effects that: in the above solution of the present invention, for the problem of insufficient information of the traffic target itself, the related information from outside the target in the picture or video, such as the context information around the target, is used to directly or indirectly provide the auxiliary information for target detection, thereby improving the accuracy of traffic target detection. The scheme provides a CNN traffic detection system based on self-adaptive context information. The method mainly comprises a CNN-based adaptive context selection model and a traffic detection system fusing the model. Compared with the existing system, the method has the advantages that the context information around the target and the difference of different traffic scenes are fused, so that the accuracy of vehicle and pedestrian detection is further improved.

Description of the drawings:

FIG. 1 is an overall framework diagram of a context information adaptive CNN traffic detection system;

FIG. 2 is a diagram of an adaptive context selection model;

FIG. 3 is a parameter learning diagram for a context information adaptive CNN traffic detection system;

FIG. 4 is a diagram of a CNN traffic detection system detection process with adaptive context information;

fig. 5 is a diagram of the results of the traffic target detecting section.

The specific implementation mode is as follows:

in the traffic video, the description of the traffic target can be further enriched by means of the differential context information in different traffic scenes, so that the accuracy of the traffic target detection is improved. The overall framework is shown in fig. 1 and mainly comprises a training phase and a detection phase.

The training phase mainly comprises two steps. In the first step, under a specific traffic scene, an adaptive context feature selection model is trained. Firstly, extracting two groups of CNN characteristic graphs of a traffic target image and a context image thereof in a specific traffic scene; then, calculating the difference between the two groups of feature maps under the same scale, and recording and counting the position indexes of the feature maps with the sample difference degree smaller than a set threshold value; and then, selecting the position indexes of K effective context CNN characteristic graphs to obtain a self-adaptive context selection model. And secondly, training a CNN traffic detection system based on the self-adaptive context information on the basis of obtaining the self-adaptive context feature selection model. In the forward stage, firstly, two groups of CNN feature maps of a traffic target image and a context image thereof are extracted, K feature map position indexes reserved by a context feature selection model are utilized, and corresponding feature maps are reserved from the obtained context CNN feature maps; then, performing convolution calculation on the two groups of obtained feature graphs by using a target kernel and a context kernel respectively to obtain a target score and a context score; and then, fusing the target score and the context score through a mixing coefficient to obtain a detection score. In the backward stage, the error of the detection score and the label is calculated, and parameters such as a target core, a context core, a mixing coefficient and the like are updated by using a BP (Back propagation) algorithm.

In the detection stage, under a specific traffic scene, firstly, inputting a detected traffic image, and extracting 256 feature maps by using CNN (convolutional neural network), on one hand, obtaining a target mask map by using a trained target kernel to convolve the 256 feature maps; on the other hand, K feature maps are selected from 256 feature maps by using K feature map position indexes reserved by the context feature selection model, and a context mask map is obtained by performing convolution by using a trained context kernel. And then, fusing the obtained target mask image and the context mask image by using the trained mixing coefficient to jointly predict the target position. And finally, accurately framing the traffic target through post-processing.

A CNN feature extraction

1) The CNN traffic detection system based on the self-adaptive context information adopts the first five-layer structure of Alexnet to extract the characteristic diagram of the traffic image. The detailed steps are as follows:

(1.1) assume that the input image is x⁰It is expressed asWherein,andrespectively representing images x⁰Three-channel map in RGB space. The index of the convolutional layer is denoted by l, 1,2,3,4, 5. Ml denotes the number of first convolutional layer feature maps, M¹＝96、M²＝256、M³＝384、M⁴＝384、M⁵256. The jth feature map of the ith convolutional layerThe calculation method is as follows:

wherein, W^lShowing the connection relationship of the characteristic diagrams of the adjacent convolutional layers.Representing a convolution operation.Andrepresenting the convolution kernel and the offset, respectively.

(1.2)Obtained by the pooling layer and the non-linear layer of the l-th layerExpressed as:

Two-adaptive context selection model

2) Firstly, uniformly expressing a read traffic image set I, wherein the specific form is as follows:whereinA representation of the target image is shown,representing a context image containing a target image, y_nE {0,1} represents the label of the positive and negative samples, and n represents the sample index. The specific process is shown in fig. 2:

(2.1) for the input of the target image with the size of 80X 48And its corresponding context image of size 144 x 112Using CNN to extract the feature map of the 5 th convolution layer to obtain 256 target feature maps of 5 × 3 size256 and 256 corresponding context feature maps of size 9 × 7j＝1,...,256。

(2.2) to enable comparison of target feature maps at the same scaleAnd its corresponding contextual feature graphThe difference of (2), the target featureDrawing (A)Upsampling was performed to a size of 9 × 7 for each feature map, and this was recorded as

(2.3) measuring two characteristic graphs by adopting cosine similarity methodAndthe difference in (a). Mapping target featuresIs marked as x, context feature mapIs noted as y, when the similarity is less than a certain empirical threshold epsilon, i.e.

S_cos(x,y)≤ε(4)

If it is notAndwith less difference, discarding the context feature mapOtherwise, the context feature map is retainedAnd finally record the location index of the retained contextual feature map. By performing a result on 2N positive and negative sample picturesAnd line counting, sequencing according to the frequency of the corresponding position, and finally selecting the position index of the reserved context feature graph to realize self-adaptive context feature selection. About 85% of the contextual characteristic map is reserved in the traffic system.

Three-training acquisition of relevant parameters of traffic detection system

3) The training phase adopts a forward process to obtain a context mask and a target mask, and the overall framework of the related parameter learning is shown in fig. 3.

(3.1) separately for the target imagesAnd a context image containing the targetUsing 1) to extract CNN characteristics to obtain corresponding target characteristic diagramAnd contextual feature graphsWhere j 1.., 256 represents the number of feature maps.

(3.2) according to the self-adaptive context feature selection model obtained in the step 2), measuring a target feature map by utilizing cosine similarityAnd corresponding contextual feature graphAccording to the difference of (3), selecting K contextual feature graphs needing to be reserved from 256 contextual feature graphs according to a threshold valueWherein q is 1.

(3.3) respectively convolving the target feature graph and the context feature graph by using different target kernels and context kernels to obtain corresponding target masksAnd context maskRespectively expressed as:

wherein,b^oindicating the target core and the corresponding offset,b^crepresenting the context core and the corresponding offset. Target nucleusAnd context coreRespectively in the sizes ofAndare consistent in size.The valid convolution is used (the convolution has boundary loss), soAndare all scalar quantities.

(3.4) detection score merged into context information_nExpressed as:

(3.5) the system adopts a minimum mean square error method to establish an objective function, and gradually reduces the score by using a BP algorithm_nAnd label y_nTo the error between. The objective function of the model is:

Traffic target detection based on traffic detection system

4) The target position is predicted by combining the target mask map and the context mask map, and then the detection result of the traffic target is obtained by non-maximum suppression, and the detection process is shown in fig. 4.

(4.1) first, the system inputs an image I of a traffic scene_nThe feature maps are extracted by the method 1) to generate 256 feature maps.

And (4.2) according to the obtained context feature diagram selection model, selecting K effective context feature diagrams from 256 feature diagrams, and supplementing the context information to the existing target feature diagram.

(4.3) then, respectively convolving the context feature map and the target feature map with corresponding convolution kernels to obtain corresponding target mask mapsAnd context mask map

(4.4) in the detectionStage, the same convolution is adopted (the convolution has no boundary loss), soAndare all matrices. Finally, a target mask map is utilizedAnd context mask mapAnd jointly predicting the position M of the target in a weighting mode.

And (4.5) obtaining a detection result of the corresponding traffic target through post-processing of non-maximum suppression.

The scheme establishes a high-efficiency and quick traffic detection system. The system achieves satisfactory detection results of traffic targets as shown in fig. 5.

Claims

1. A CNN traffic detection method based on self-adaptive context information is characterized by comprising a training phase and a detection phase,

the training phase comprises two steps:

in the backward stage, calculating errors of the detection values and the labels, and updating the target kernel, the context kernel and the mixing coefficient parameters by using a BP algorithm;

a detection stage: in a specific traffic scene, firstly, inputting a detected traffic image, extracting 256 feature maps by using CNN (compressed natural number), and on one hand, obtaining a target mask map by using a trained target kernel convolution 256 feature maps; on the other hand, K feature maps are selected from 256 feature maps by using K feature map position indexes reserved by the context feature selection model, and a context mask map is obtained by performing convolution by using a trained context kernel; then, fusing the obtained target mask image and the context mask image by using the trained mixing coefficient, and jointly predicting the target position; and finally, accurately framing the traffic target through post-processing.

2. The CNN traffic detection method based on adaptive context information according to claim 1, wherein the feature extraction on the image feature map adopts a first five-layer structure of a CNN-based Alexnet model to extract a corresponding feature map; the method comprises the following specific steps:

(1) suppose the input image is x⁰It is expressed asWherein,and

respectively representing images x⁰A three-channel map in RGB space; the index of the convolutional layer is denoted by l, 1,2,3,4, 5; m^lNumber of first convolutional layer feature map, M¹＝96、M²＝256、M³＝384、M⁴＝384、M⁵256; the jth feature map of the ith convolutional layerThe calculation method is as follows:

(2)obtained by the pooling layer and the non-linear layer of the l-th layerExpressed as:

thus, CNN is for input image x⁰256 feature maps were obtained for the fifth convolutional layerEach feature mapIs the size of the input image x⁰1/16 of (1); for convenience of expression, the system uses F (x)⁰J) indicates that the input image is at x⁰And extracting the jth feature map of the 5 th convolutional layer.

3. The CNN traffic detection method based on adaptive context information according to claim 1, wherein the adaptive context selection model obtaining process is as follows: firstly, uniformly expressing a read traffic image set I, wherein the specific form is as follows:whereinA representation of the target image is shown,representing a context image containing a target image, y_nE {0,1} represents a label of positive and negative samples, n represents a sample index; then:

(1) for the input of the target image with the size of 80 × 48And its corresponding context image of size 144 x 112Using CNN to extract the feature map of the 5 th convolution layer to obtain 256 target feature maps of 5 × 3 sizeAnd 256 corresponding context feature maps of size 9 × 7

(2) To compare target feature maps at the same scaleAnd its corresponding contextual feature graphThe difference of (2) is required for the target feature mapUpsampling was performed to a size of 9 × 7 for each feature map, and this was recorded as

(3) Method for measuring two characteristic graphs by adopting cosine similarityAnda difference of (a); mapping target featuresIs marked as x, context feature mapIs noted as y, when the similarity is less than a certain empirical threshold epsilon, i.e.

S_cos(x，y)≤ε (4)

If it is notAndwith less difference, discarding the context feature mapOtherwise, the context feature map is retainedFinally recording the position index of the reserved context feature graph; the method comprises the steps of counting results of 2N positive and negative sample pictures, sorting according to the frequency of occurrence of corresponding positions, and finally selecting a reserved context feature picture position index to realize self-adaptive context feature selection, wherein N is greater than 0 and is an integer.

4. The CNN traffic detection method based on adaptive context information of claim 3, further comprising training to obtain relevant parameters of the traffic detection system, wherein the training phase adopts a forward process to obtain a context mask and a target mask:

(1) respectively aiming at the target imageAnd a context image containing the targetExtracting CNN characteristics to obtain corresponding target characteristic diagramAnd contextual feature graphsWhere j 1., 256 represents the number of feature maps;

(2) the adaptive context feature selection model obtained according to claim 3, using cosine similarity measure target feature mapAnd corresponding contextual feature graphAccording to the difference of (3), selecting K contextual feature graphs needing to be reserved from 256 contextual feature graphs according to a threshold valueWherein q is 1, K is less than or equal to 256;

(3) respectively convolving the target feature graph and the context feature graph by using different target kernels and context kernels to obtain corresponding target masksAnd context maskRespectively expressed as:

wherein,b^oindicating the target core and the corresponding offset,b^crepresenting the context core and the corresponding offset; target nucleusAnd context coreRespectively in the sizes ofAndare consistent in size;

the valid convolution is used (the convolution has boundary loss), soAndare all scalars;

(4) detection score incorporated into context information_nExpressed as:

wherein gamma represents the mixing coefficient of the target and the context, and gamma belongs to [0,1 ]; different gammas need to be obtained for different scenes; the variable is reflected in different scenes, different functions of different context information on target detection are realized, if gamma is o, the context has no function on the target detection, and equivalently, the model does not need to be merged into the context, and the model is a CNN target detection model without considering the context;

(5) the system adopts a minimum mean square error method to establish an objective function, and gradually reduces the score by using a BP algorithm_nAnd label y_nThe error between; the objective function of the model is:

wherein 2N represents the total number of positive and negative samples; in order to solve the optimization problem of the related parameters in the above formula, the system trains the parameters in the model by using a random gradient descent method, and all the parameters w are updated by using the following formula until convergence:

wherein i represents the index of iteration, α represents the learning rate of the gradient descent algorithm, and the gradient of the objective function L (g) needs to be repeatedly calculated by updating relevant parameters.

5. The CNN traffic detection method based on adaptive context information of claim 2, further comprising traffic target detection, wherein the target position is predicted jointly through a target mask map and a context mask map, and then the detection result of the traffic target is obtained through non-maximum suppression:

(1) first, the system inputs an image I of a traffic scene_nExtracting feature maps by the method of claim 2 to generate 256 feature maps;

(2) then, according to the obtained context feature diagram selection model, K effective context feature diagrams are selected from 256 feature diagrams, and the existing target feature diagrams are supplemented with context information;

(3) then, respectively convolving the context feature map and the target feature map with corresponding convolution kernels to obtain corresponding target mask mapsAnd context mask map

(4) In the detection phase, the same convolution is used (the convolution has no boundary loss), so thatAndare all matrices; finally, a target mask map is utilizedAnd context mask mapJointly predicting the position M of the target in a weighting mode;

(5) and obtaining a detection result of the corresponding traffic target through post-processing of non-maximum value inhibition.