CNN traffic detection method based on self-adaptive context information
Technical Field
The invention relates to a CNN traffic detection technology based on self-adaptive context information, which can be applied in real time.
Background art:
in order to solve these increasingly serious traffic problems, Intelligent Transportation Systems (ITS) have been developed, in which vehicle and pedestrian identification is an important component of the intelligent transportation system, and some existing related technologies related to vehicles and pedestrians are also widely used.
The existing traffic detection system mainly realizes the identification and detection of different targets (pedestrians and vehicles) through the depiction of the appearance information of the targets. Currently, this type of system mainly uses artificially designed features (such as HOG, LBP, SIFT, etc.) or deep features directly obtained from the image itself through deep learning to depict the target appearance, and uses the target appearance to realize target detection. However, in the actual detection of daily traffic, most of the traffic is an open environment without constraint, the traffic is complex and changeable, the interference such as illumination change, view angle change, target shielding and the like exists, and if the traffic is only based on the appearance information of the target, when the information provided by the traffic target in the image or video is too little, the target category cannot be accurately judged according to the target. Moreover, different traffic scenes have certain differences, and a ubiquitous traffic target detection system neglecting the differences of the different traffic scenes can reduce the accuracy of traffic target detection.
The invention content is as follows:
the invention provides a CNN traffic detection method based on self-adaptive context information, which aims at further enriching the description of traffic targets by means of different context information under different traffic scenes in a traffic video so as to improve the accuracy of traffic target detection.
The technical scheme adopted for realizing the purpose of the invention is as follows: a CNN traffic detection method based on self-adaptive context information comprises a training phase and a detection phase, and is characterized in that,
the training phase comprises two steps:
firstly, under a specific traffic scene, training and acquiring a self-adaptive context feature selection model; firstly, extracting two groups of CNN characteristic graphs of a traffic target image and a context image thereof in a specific traffic scene; then, calculating the difference between the two groups of feature maps under the same scale, and recording and counting the position indexes of the feature maps with the sample difference degree smaller than a set threshold value; then, selecting position indexes of K effective context CNN characteristic graphs to obtain a self-adaptive context selection model, wherein K is more than 0 and is an integer;
secondly, training a CNN traffic detection system based on self-adaptive context information on the basis of obtaining a self-adaptive context feature selection model; in the forward stage, firstly, two groups of CNN feature maps of a traffic target image and a context image thereof are extracted, K feature map position indexes reserved by a context feature selection model are utilized, and corresponding effective feature maps are reserved from the obtained context CNN feature maps; then, performing convolution calculation on the two groups of obtained feature graphs by using a target kernel and a context kernel respectively to obtain a target score and a context score; then, fusing the target score and the context score through a mixing coefficient to obtain a detection score;
in the backward stage, calculating errors of detection values and labels, and updating parameters such as a target core, a context core, a mixing coefficient and the like by using a BP (Back propagation) algorithm;
a detection stage: in a specific traffic scene, firstly, inputting a detected traffic image, extracting 256 feature maps by using CNN (compressed natural number), and on one hand, obtaining a target mask map by using a trained target kernel convolution 256 feature maps; on the other hand, K feature maps are selected from 256 feature maps by using K feature map position indexes reserved by the context feature selection model, and a context mask map is obtained by performing convolution by using a trained context kernel; then, fusing the obtained target mask image and the context mask image by using the trained mixing coefficient, and jointly predicting the target position; and finally, accurately framing the traffic target through post-processing, wherein K is greater than 0 and is an integer.
According to the scheme, the context characteristics can be adaptively blended according to different scenes through the difference measurement, the representation of the traffic target is enhanced, and the accuracy of traffic target depiction is effectively improved.
1. The CNN traffic detection system based on self-adaptive context information adopts the first five-layer structure of Alexnet to extract the characteristic diagram of the traffic image; the method comprises the following steps: (Alexnet is a model of a convolutional neural network of eight-layer structure proposed 2012. the neural network has 6000 ten thousand parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by a max-posing layer, and three fully-connected layers, and a softmax layer ranked at the last 1000-way
1.1 assume input image x0It is expressed asWherein,andrespectively representing images x0Three-channel map in RGB space. The index of the convolutional layer is denoted by l, 1,2,3,4, 5. MlNumber of first convolutional layer feature map, M1=96、M2=256、M3=384、M4=384、M5256; the jth feature map of the ith convolutional layerThe calculation method is as follows:
wherein, WlRepresenting the connection relation of the characteristic diagrams of the adjacent convolutional layers;represents a convolution operation;andrespectively representing a convolution kernel and an offset;
1.2obtained by the pooling layer and the non-linear layer of the l-th layerExpressed as:
wherein g (g) represents the local response normalization, f (g) represents the activation function, and the unsaturated nonlinear function is adopted:
thus, CNN is for input image x0256 feature maps were obtained for the fifth convolutional layerj 1.. 256. Each feature mapIs the size of the input image x01/16 of (1). For convenience of expression, the system uses F (x)0J) indicates that the input image is at x0And extracting the jth feature map of the 5 th convolutional layer.
2. Firstly, uniformly expressing a read traffic image set I by using an adaptive context selection model, wherein the specific form is as follows:whereinA representation of the target image is shown,representing a context image containing a target image, ynE {0,1} represents the label of the positive and negative samples, n represents the sample index:
2.1 input of target image of 80X 48 sizeAnd its corresponding context image of size 144 x 112Using CNN to extract the feature map of the 5 th convolution layer to obtain 256 target feature maps of 5 × 3 size256 and 256 corresponding context feature maps of size 9 × 7j=1,...,256。
2.2 to be able to compare the target feature maps at the same scaleAnd its corresponding contextual feature graphThe difference of (2) is required for the target feature mapUpsampling was performed to a size of 9 × 7 for each feature map, and this was recorded as
2.3 measuring two characteristic graphs by cosine similarityAndthe difference in (a). Mapping target featuresIs marked as x, context feature mapIs noted as y, when the similarity is less than a certain empirical threshold epsilon, i.e.
Scos(x,y)≤ε (4)
If it is notAndwith less difference, discarding the context feature mapOtherwise, the context feature map is retainedAnd finally record the location index of the retained contextual feature map. The results of 2N positive and negative sample pictures are counted, sorted according to the frequency of occurrence of corresponding positions, and finally the position index of the reserved context feature picture is selected, so that the self-adaptive context feature selection is realized, wherein N is more than 0 and is an integer. About 85% of the contextual characteristic map is reserved in the traffic system.
3. Training to obtain relevant parameters of the traffic detection system, and obtaining a context mask and a target mask by adopting a forward process in a training stage:
3.1 separately for the target imageAnd upper including the targetContext imageUsing 1) to extract CNN characteristics to obtain corresponding target characteristic diagramAnd contextual feature graphsWhere j 1.., 256 represents the number of feature maps.
3.2 selecting a model according to the self-adaptive context characteristics obtained in the step 2, and measuring a target characteristic graph by utilizing cosine similarityAnd corresponding contextual feature graphAccording to the difference of (3), selecting K contextual feature graphs needing to be reserved from 256 contextual feature graphs according to a threshold valueWherein q is 1.
3.3 convolving the target feature map and the context feature map respectively by using different target kernels and context kernels to obtain corresponding target masksAnd context maskRespectively expressed as:
wherein,boindicating the target core and the corresponding offset,bcrepresenting the context core and the corresponding offset. Target nucleusAnd context coreRespectively in the sizes ofAndare consistent in size.The val id convolution (which has boundary loss) is used, soAndare all scalar quantities.
3.4 detection score for blending into contextual information scorenExpressed as:
where γ represents a mixing coefficient of the object and the context, γ ∈ [0,1 ]. Different γ needs to be obtained for different scenarios. By reflecting the variable in different scenes, different functions of different context information on target detection are achieved, for example, if γ is 0, the context has no function on target detection, and the model is equivalent to a CNN target detection model without context being considered.
3.5 the system uses the minimum mean square error method to establish the objective function, and gradually reduces score by using BP algorithmnAnd label ynTo the error between. The objective function of the model is:
where 2N represents the total number of positive and negative samples. In order to solve the optimization problem of the related parameters in the above formula, the system trains the parameters in the model by using a random gradient descent method, and all the parameters w are updated by using the following formula until convergence:
where i represents the index of the iteration, α represents the learning rate of the gradient descent algorithm, updating the relevant parameters requires iterative computation of the gradient of the objective function l (g).
4. Detecting a traffic target: and predicting the target position by combining the target mask image and the context mask image, and then acquiring the detection result of the traffic target by non-maximum inhibition.
4.1 first, the system inputs an image I of a traffic scenenAnd extracting feature maps by the method in the step 1 to generate 256 feature maps.
4.2 then, according to the obtained context feature diagram selection model, K effective context feature diagrams are selected from the 256 feature diagrams, and the existing target feature diagrams are supplemented with context information.
4.3 then, convolving the context feature map and the target feature map with the corresponding convolution kernels respectively to obtain the corresponding target mask mapAnd context mask map
4.4 in the detection phase, the same convolution is used (the convolution has no boundary loss), soAndare all matrices. Finally, a target mask map is utilizedAnd context mask mapAnd jointly predicting the position M of the target in a weighting mode.
4.5, obtaining the detection result of the corresponding traffic target through the post-processing of non-maximum value inhibition.
The invention has the beneficial effects that: in the above solution of the present invention, for the problem of insufficient information of the traffic target itself, the related information from outside the target in the picture or video, such as the context information around the target, is used to directly or indirectly provide the auxiliary information for target detection, thereby improving the accuracy of traffic target detection. The scheme provides a CNN traffic detection system based on self-adaptive context information. The method mainly comprises a CNN-based adaptive context selection model and a traffic detection system fusing the model. Compared with the existing system, the method has the advantages that the context information around the target and the difference of different traffic scenes are fused, so that the accuracy of vehicle and pedestrian detection is further improved.
Description of the drawings:
FIG. 1 is an overall framework diagram of a context information adaptive CNN traffic detection system;
FIG. 2 is a diagram of an adaptive context selection model;
FIG. 3 is a parameter learning diagram for a context information adaptive CNN traffic detection system;
FIG. 4 is a diagram of a CNN traffic detection system detection process with adaptive context information;
fig. 5 is a diagram of the results of the traffic target detecting section.
The specific implementation mode is as follows:
in the traffic video, the description of the traffic target can be further enriched by means of the differential context information in different traffic scenes, so that the accuracy of the traffic target detection is improved. The overall framework is shown in fig. 1 and mainly comprises a training phase and a detection phase.
The training phase mainly comprises two steps. In the first step, under a specific traffic scene, an adaptive context feature selection model is trained. Firstly, extracting two groups of CNN characteristic graphs of a traffic target image and a context image thereof in a specific traffic scene; then, calculating the difference between the two groups of feature maps under the same scale, and recording and counting the position indexes of the feature maps with the sample difference degree smaller than a set threshold value; and then, selecting the position indexes of K effective context CNN characteristic graphs to obtain a self-adaptive context selection model. And secondly, training a CNN traffic detection system based on the self-adaptive context information on the basis of obtaining the self-adaptive context feature selection model. In the forward stage, firstly, two groups of CNN feature maps of a traffic target image and a context image thereof are extracted, K feature map position indexes reserved by a context feature selection model are utilized, and corresponding feature maps are reserved from the obtained context CNN feature maps; then, performing convolution calculation on the two groups of obtained feature graphs by using a target kernel and a context kernel respectively to obtain a target score and a context score; and then, fusing the target score and the context score through a mixing coefficient to obtain a detection score. In the backward stage, the error of the detection score and the label is calculated, and parameters such as a target core, a context core, a mixing coefficient and the like are updated by using a BP (Back propagation) algorithm.
In the detection stage, under a specific traffic scene, firstly, inputting a detected traffic image, and extracting 256 feature maps by using CNN (convolutional neural network), on one hand, obtaining a target mask map by using a trained target kernel to convolve the 256 feature maps; on the other hand, K feature maps are selected from 256 feature maps by using K feature map position indexes reserved by the context feature selection model, and a context mask map is obtained by performing convolution by using a trained context kernel. And then, fusing the obtained target mask image and the context mask image by using the trained mixing coefficient to jointly predict the target position. And finally, accurately framing the traffic target through post-processing.
According to the scheme, the context characteristics can be adaptively blended according to different scenes through the difference measurement, the representation of the traffic target is enhanced, and the accuracy of traffic target depiction is effectively improved.
A CNN feature extraction
1) The CNN traffic detection system based on the self-adaptive context information adopts the first five-layer structure of Alexnet to extract the characteristic diagram of the traffic image. The detailed steps are as follows:
(1.1) assume that the input image is x0It is expressed asWherein,andrespectively representing images x0Three-channel map in RGB space. The index of the convolutional layer is denoted by l, 1,2,3,4, 5. Ml denotes the number of first convolutional layer feature maps, M1=96、M2=256、M3=384、M4=384、M5256. The jth feature map of the ith convolutional layerThe calculation method is as follows:
wherein, WlShowing the connection relationship of the characteristic diagrams of the adjacent convolutional layers.Representing a convolution operation.Andrepresenting the convolution kernel and the offset, respectively.
(1.2)Obtained by the pooling layer and the non-linear layer of the l-th layerExpressed as:
wherein g (g) represents the local response normalization, f (g) represents the activation function, and the unsaturated nonlinear function is adopted:
thus, CNN is for input image x0256 feature maps were obtained for the fifth convolutional layerj 1.. 256. Each feature mapIs the size of the input image x01/16 of (1). For convenience of expression, the system uses F (x)0J) indicates that the input image is at x0And extracting the jth feature map of the 5 th convolutional layer.
Two-adaptive context selection model
2) Firstly, uniformly expressing a read traffic image set I, wherein the specific form is as follows:whereinA representation of the target image is shown,representing a context image containing a target image, ynE {0,1} represents the label of the positive and negative samples, and n represents the sample index. The specific process is shown in fig. 2:
(2.1) for the input of the target image with the size of 80X 48And its corresponding context image of size 144 x 112Using CNN to extract the feature map of the 5 th convolution layer to obtain 256 target feature maps of 5 × 3 size256 and 256 corresponding context feature maps of size 9 × 7j=1,...,256。
(2.2) to enable comparison of target feature maps at the same scaleAnd its corresponding contextual feature graphThe difference of (2), the target featureDrawing (A)Upsampling was performed to a size of 9 × 7 for each feature map, and this was recorded as
(2.3) measuring two characteristic graphs by adopting cosine similarity methodAndthe difference in (a). Mapping target featuresIs marked as x, context feature mapIs noted as y, when the similarity is less than a certain empirical threshold epsilon, i.e.
Scos(x,y)≤ε(4)
If it is notAndwith less difference, discarding the context feature mapOtherwise, the context feature map is retainedAnd finally record the location index of the retained contextual feature map. By performing a result on 2N positive and negative sample picturesAnd line counting, sequencing according to the frequency of the corresponding position, and finally selecting the position index of the reserved context feature graph to realize self-adaptive context feature selection. About 85% of the contextual characteristic map is reserved in the traffic system.
Three-training acquisition of relevant parameters of traffic detection system
3) The training phase adopts a forward process to obtain a context mask and a target mask, and the overall framework of the related parameter learning is shown in fig. 3.
(3.1) separately for the target imagesAnd a context image containing the targetUsing 1) to extract CNN characteristics to obtain corresponding target characteristic diagramAnd contextual feature graphsWhere j 1.., 256 represents the number of feature maps.
(3.2) according to the self-adaptive context feature selection model obtained in the step 2), measuring a target feature map by utilizing cosine similarityAnd corresponding contextual feature graphAccording to the difference of (3), selecting K contextual feature graphs needing to be reserved from 256 contextual feature graphs according to a threshold valueWherein q is 1.
(3.3) respectively convolving the target feature graph and the context feature graph by using different target kernels and context kernels to obtain corresponding target masksAnd context maskRespectively expressed as:
wherein,boindicating the target core and the corresponding offset,bcrepresenting the context core and the corresponding offset. Target nucleusAnd context coreRespectively in the sizes ofAndare consistent in size.The valid convolution is used (the convolution has boundary loss), soAndare all scalar quantities.
(3.4) detection score merged into context informationnExpressed as:
where γ represents a mixing coefficient of the object and the context, γ ∈ [0,1 ]. Different γ needs to be obtained for different scenarios. By reflecting the variable in different scenes, different functions of different context information on target detection are achieved, for example, if γ is 0, the context has no function on target detection, and the model is equivalent to a CNN target detection model without context being considered.
(3.5) the system adopts a minimum mean square error method to establish an objective function, and gradually reduces the score by using a BP algorithmnAnd label ynTo the error between. The objective function of the model is:
where 2N represents the total number of positive and negative samples. In order to solve the optimization problem of the related parameters in the above formula, the system trains the parameters in the model by using a random gradient descent method, and all the parameters w are updated by using the following formula until convergence:
where i represents the index of the iteration, α represents the learning rate of the gradient descent algorithm, updating the relevant parameters requires iterative computation of the gradient of the objective function l (g).
Traffic target detection based on traffic detection system
4) The target position is predicted by combining the target mask map and the context mask map, and then the detection result of the traffic target is obtained by non-maximum suppression, and the detection process is shown in fig. 4.
(4.1) first, the system inputs an image I of a traffic scenenThe feature maps are extracted by the method 1) to generate 256 feature maps.
And (4.2) according to the obtained context feature diagram selection model, selecting K effective context feature diagrams from 256 feature diagrams, and supplementing the context information to the existing target feature diagram.
(4.3) then, respectively convolving the context feature map and the target feature map with corresponding convolution kernels to obtain corresponding target mask mapsAnd context mask map
(4.4) in the detectionStage, the same convolution is adopted (the convolution has no boundary loss), soAndare all matrices. Finally, a target mask map is utilizedAnd context mask mapAnd jointly predicting the position M of the target in a weighting mode.
And (4.5) obtaining a detection result of the corresponding traffic target through post-processing of non-maximum suppression.
The scheme establishes a high-efficiency and quick traffic detection system. The system achieves satisfactory detection results of traffic targets as shown in fig. 5.