CN111062384A

CN111062384A - Vehicle window accurate positioning method based on deep learning

Info

Publication number: CN111062384A
Application number: CN201911089593.2A
Authority: CN
Inventors: 韩梦江; 楼燚航; 白燕; 张永祥; 陈杰
Original assignee: Boyun Vision Beijing Technology Co ltd
Current assignee: Boyun Vision Beijing Technology Co ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-24
Anticipated expiration: 2039-11-08
Also published as: CN111062384B

Abstract

The invention discloses a car window accurate positioning method based on deep learning, which comprises the following steps: s1, acquiring a rough positioning frame of the vehicle window in the first stage; s11, selecting a sample group, and calibrating the corner point coordinates of the car window in the picture; s12, storing the picture and the coordinates of the corner points as a data set; s13, inputting the data set into the deep convolution network of the first stage to extract a feature map; s14, inputting the characteristic diagram into the BOX regression layer to obtain an approximate positioning frame of the car window; s2, acquiring four accurate corner point coordinates of the vehicle window in the second stage; s21, expanding the approximate positioning frame of the car window; s22, intercepting the picture in the expanded candidate frame; s23, converting the coordinates of the corner points into relative coordinates relative to the expanded candidate frame; s24, inputting the intercepted picture into a depth convolution network at the second stage to extract a feature map, and converting the feature map into a feature vector; and S25, inputting the feature vectors into the linear regression layer to obtain the accurate corner point coordinates of the car window.

Description

Vehicle window accurate positioning method based on deep learning

Technical Field

The invention relates to the field of image processing, in particular to a car window accurate positioning method based on deep learning.

Background

In recent years, intelligent traffic systems and intelligent monitoring are rapidly developed, and vehicle window identification plays a significant role in the fields of intelligent traffic and intelligent monitoring. Electronic police and vehicle checkpoints can acquire a large number of high-definition vehicle pictures in real time, and the pictures are effectively applied to acquire more information as much as possible to help relieve traffic management pressure, which is the focus of attention in the field of intelligent traffic and intelligent monitoring at present. The vehicle window identification provides possibility for further analyzing driver information, positioning safety belts and improving the accuracy of vehicle type identification in the fields of intelligent transportation and intelligent monitoring. In addition, if the vehicle window can be accurately positioned, more interference can be eliminated, and further more accurate vehicle interior information can be obtained.

The goal of window positioning is to automatically identify the vehicle windows for a given series of vehicle pictures from different cameras, with different colors, orientations, types, and sizes.

At present, aiming at the problem of vehicle window positioning, generally, a vehicle window is detected by using some effective information of characteristics such as vehicle color, texture, spatial relationship and the like, and the traditional methods comprise the following methods: one is to separately process dark color cars and light color cars with complex backgrounds under different illumination conditions, and to segment and position car windows by adopting a genetic algorithm constructed based on a chromaticity function curve, the method also has the defects of long positioning time, complicated process and large resource consumption; the other method is that the texture information of the vehicle in the picture is utilized, the vehicle window can be roughly positioned through texture detection after the vehicle is subjected to color space conversion, the disadvantage of this is that the excessive dependence on the color texture information of the vehicle deteriorates the robustness of the algorithm, so that the performance of the detection algorithm is greatly reduced under different illumination and vehicle color information; yet another method is to use a sliding window to position the window with reference to the positioned window position, the accuracy and precision of the positioning method for positioning the position of the vehicle window are difficult to meet the use requirements of people.

Disclosure of Invention

The invention aims to solve the problems and provides a depth learning-based accurate window positioning method which can accurately position a vehicle window and output coordinates of four corner points of the window.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a car window accurate positioning method based on deep learning; the method comprises the following steps:

s1, acquiring a rough positioning frame of the front window of the vehicle in the first stage;

s11, selecting a vehicle picture as a sample group, and manually calibrating coordinates of four corner points of an upper left window, an upper right window, a lower left window and a lower right window in the vehicle picture;

s12, storing each vehicle picture and the front window corner point coordinate picture in the vehicle picture correspondingly to form a data set;

s13, inputting the data set into a deep convolution network of a first stage, wherein the deep convolution network of the first stage is a neural network with 23 layers, performing five times of convolution operation on pictures in the data set, performing batch regularization on output characteristic graphs after each time of convolution operation, inputting the regularization into an activation function, performing maximum pooling operation after the first four times of convolution operation, and adding one branch to the deep convolution network after the five times of convolution operation; in the two branches, one branch continues to perform five times of convolution operation and one time of full convolution operation, the other branch fuses the feature graph before the added branch and the feature graph obtained after the former branch performs five times of convolution operation in the channel direction, and finally the two branches respectively perform one time of convolution operation and one time of full convolution operation to obtain a fused vehicle picture feature graph;

s14, inputting the vehicle picture characteristic diagram and the corresponding corner point coordinates of the front window into a BOX regression layer, and regressing to obtain an approximate positioning frame of the front window after optimizing a loss function;

s2, acquiring four accurate corner point coordinates of the front window of the vehicle in the second stage;

s21, enlarging the front window approximate positioning frame obtained in the first stage by 1.3 times in the width and height directions to obtain an enlarged candidate frame;

s22, capturing pictures in the expansion candidate frame from the vehicle picture to form a new picture;

s23, converting coordinates of four corner points of the manually calibrated front window into relative coordinates relative to the expansion candidate frame;

s24, inputting the new picture obtained after the interception into a deep convolutional network at the second stage, extracting a feature map of the new picture through the deep convolutional network at the second stage, and converting the feature map into a feature vector through a full-link layer of the deep convolutional network at the second stage;

and S25, inputting the feature vectors and the transformed relative coordinates into a linear regression layer, and optimizing a loss function and then performing regression to obtain four accurate corner point coordinates of the front window.

Further, the loss function of the BOX regression layer in the step S14 adopts smooth L1 losss; the calculation formula is as follows:

；

；

wherein ,

when the value of the parameter is 1, the parameter represents that the ith default box is matched with the jth grountruth box; n is the number of candidate frames; m is a position parameter of the boundary frame, cx represents an x coordinate of the center point of the boundary frame, cy represents a y coordinate of the center point of the boundary frame, w represents the width of the boundary frame, and h represents the height of the boundary frame; l is the predicted value of the position of the bounding box corresponding to the default box,

is the corresponding ground trouh box location parameter value.

Further, in step S24, five times of convolution operations are performed on the new image obtained after the truncation, a linear rectification function with parameters is used as the activation function for the activation function after the convolution, the maximum pooling operation is performed after the first four times of convolution operations and then one pooling layer, a full connection layer is accessed after the five times of convolution operations, and the extracted feature maps are integrated into one feature vector.

Further, the loss function of the linear regression layer in step S25 adopts an L2 norm loss function, and the calculation formula is as follows:

；

wherein theta is the weight of the deep convolutional network at the second stage, i is the sample of each batch, j is the 4 corner point labels of the front window in each vehicle picture,

is the x coordinate of the corner point of the front window to be regressed,

is the x coordinate of the front window corner point of the ground truth, w is the width of the new picture obtained after the interception,

is the y coordinate of the corner point of the front window to be regressed,

the coordinate of the front window corner point y of the ground truth is shown, and h is the height of the new picture obtained after the clipping.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention provides a detection method for regression of a rough positioning frame of a vehicle window and further regression of precise corner coordinates of the vehicle window by utilizing a deep convolutional network, which is carried out in two stages, wherein in the first stage, a 23-layer neural network is used for extracting multi-level and multi-scale characteristics of a vehicle picture, the extracted characteristics are applied to a BOX regression algorithm to obtain the rough positioning frame of the vehicle window, and in the second stage, a 6-layer convolutional neural network is used for carrying out linear regression on four corner coordinates of the vehicle window in the rough positioning frame of the vehicle window, so that the vehicle window can be precisely positioned.

According to the invention, the accurate coordinates of the corner points of the car window are obtained by stages by utilizing two neural networks, so that the positioning precision and accuracy of the car window are greatly improved; the calculation speed of the neural network is effectively improved by the design of combining the improved small-sized deep convolutional network with the regression algorithm; on the other hand, the vehicle window is positioned by solving the coordinates of the four corner points of the vehicle window, so that the position information of the trapezoidal vehicle window can be accurately obtained, a large amount of edge interference generated by the rectangular positioning frame is eliminated, and the vehicle internal information can be more accurately obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram of a training model framework at a first stage;

FIG. 2 is a diagram of a first stage deep convolutional network architecture;

FIG. 3 is a characteristic diagram of a BOX regression layer;

FIG. 4 is a model framework diagram of a BOX regression layer;

FIG. 5 is a diagram of a training model framework for the second phase.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived from the embodiments of the present invention by a person skilled in the art without any creative effort, should be included in the protection scope of the present invention.

As shown in fig. 1 to 5, the present invention provides a method for detecting regression of a rough positioning frame of a vehicle window and further regression of precise corner coordinates of the vehicle window by using a deep convolutional network, the method is performed in two stages, in the first stage, a 23-layer neural network is used to extract multi-level and multi-scale features of a vehicle picture, the extracted features are applied to a BOX regression algorithm to obtain the rough positioning frame of the vehicle window, and in the second stage, a 6-layer convolutional neural network is used to perform linear regression on four corner coordinates of the vehicle window in the rough positioning frame of the vehicle window again, so that the vehicle window can be precisely positioned.

The training model framework of the invention at the stage of obtaining the approximate positioning frame of the car window is shown in figure 1.

In the first stage, two data sets are input in batches to train the network model when the model is trained, one data set is a picture set and comprises vehicle pictures with different colors, directions, types and sizes in a real monitoring camera scene, and the other data set is coordinates of a vehicle window ground route marking frame corresponding to each vehicle picture. And then, extracting multi-level and multi-scale features of the image from the input image data set through a deep convolution network, inputting the features into a BOX regression layer after extracting the features of different scales and levels of the vehicle image, and regressing approximate coordinates of the vehicle window positioning frame by matching a default BOX and a ground treth BOX of the labeling frame and then utilizing the matched default BOX.

The deep convolutional network at this stage is different from a common classification neural network, and the deep convolutional network at this stage is a network structure obtained by modifying on the basis of YOLOV3-TINY, and the network structure is shown in fig. 2.

As shown in fig. 2, firstly, five convolution operations are performed on an input picture, an output feature map is regularized in batches after each convolution, then the regularized output feature map is input into an activation function, and after the first four convolution operations, a maximum pooling operation is followed, so that the height and width of the feature map output after each convolution is halved; after five times of convolution operation, the network is added with one branch in the middle, and different times of convolution operation are carried out on each branch. The reason why the network structure is divided into two branches here is to enable the network to extract feature information of different scales. And one branch of the two branch networks continuously performs five times of convolution operation and one time of full convolution operation, the other branch fuses the feature diagram before the branch and the feature diagram obtained after five times of convolution operation of the previous branch in the channel direction, the previous branch performs up-sampling to ensure that the feature diagrams of the two branches are consistent in height and width before fusion, and then the two branches respectively perform one time of convolution operation and full convolution operation and input the operation to the BOX regression layer.

As shown in fig. 3, default boxes (shown by blue and red dotted lines in the figure) with different length-width ratios are preset on feature maps with different scales on a BOX regression layer, then the default boxes and a ground trout BOX are matched according to an IOU, then position regression is performed on the matched default boxes, and an optimal approximate positioning frame of a vehicle window is selected and obtained in a non-maximum suppression mode. In practice, our network generates two feature maps of 10 × 10 and 20 × 20, and sliding windows are made on the two feature maps to match default box and ground truth box.

As shown in FIG. 4, for default boxes with different length-width ratios, we match these default boxes with the input group try box by comparing Best result Best Jaccard overlay calculated by the default boxes and the group try box with respect to the Jaccard coefficients, if the calculation result between them is greater than the threshold value preset by us, we consider the matching successful, and then add this default box into the list to be regressed. The Jaccard coefficient is calculated as follows:

；

wherein A represents the area covered by the default box, and B represents the area covered by the ground channel box.

The candidate box on the match is then subjected to position regression to make it closer to the ground truth box, and smooth L1 losss is selected as the loss function for performing regression. The specific formula of the loss function is as follows:

；

；

wherein ,

is an indication parameter, when its value is 1, it represents that the ith default box matches with the jth group channel box; n represents the number of candidate frames; m represents a position parameter of the bounding box, wherein cx represents an x coordinate of the center of the bounding box, cy represents a y coordinate of the center point of the bounding box, w represents the width of the bounding box, and h represents the height of the bounding box; l represents the predicted value parameter value of the position of the bounding box corresponding to the default box;

is the corresponding ground truth box location parameter value.

In the stage, the loss function is calculated when the network forwards propagates, the network weight is updated according to the derivative of the sample when the network backwards propagates, and the loss function is continuously optimized, so that the network can regress the input picture out of an approximate positioning frame of the car window.

The training model frame in the accurate positioning stage of the coordinates of the four corner points of the car window is shown in figure 5.

And in the second stage, two data sets are input in batches to train the network model, one is the car window picture generated in the previous stage, the positioning frame is enlarged by 1.3 times, the car window picture is captured from the original picture as the input of the second stage, and the other is the group route coordinates of 4 car window corner points corresponding to each picture. Similarly, multi-level features of the image are extracted from an input image data set through a deep convolution network, after the features of the image level of the car window are extracted, the features are input into a linear regression layer of angular point coordinates, and model parameters are trained by continuously reducing Euclidean distances between the angular point coordinates to be regressed and real angular point coordinates.

The deep convolution network at the stage adopts a network structure obtained by modifying on the basis of oNet, the network firstly carries out five times of convolution operation on an input original picture, an activation function after each time of convolution adopts a linear rectification function (Prelu) with parameters, a pooling layer is connected after each time of previous four times of convolution operation, the maximum pooling operation is adopted, and the height and width of an output characteristic diagram are reduced by half each time. And finally, accessing a full-connection layer at the end of the convolution layer, integrating the characteristic diagram into a vector, inputting the integrated vector into the angular point coordinate linear regression layer, and regressing four accurate angular point coordinates of the car window in the layer.

In the linear regression layer, an L2 norm loss function is adopted, and particularly, in order to obtain more accurate coordinates, the x coordinate and the y coordinate of the to-be-regressed corner point coordinate and the real corner point coordinate are divided by the width and the height of an input picture respectively, so that integer pixel coordinates are converted into floating point numbers, and the effect of obtaining more accurate corner point coordinates in the iteration process of reducing the loss function is achieved. The loss function is specifically formulated as follows:

；

wherein theta is the weight of the deep convolutional network, i is the sample of each batch, j is the 4 car corner point labels of each picture,

is the coordinate of the corner point x of the car window to be regressed,

the window corner point x coordinate of the ground truth is, and w is an input graphThe width of the sheet is such that,

is the y coordinate of the corner point of the car window to be regressed,

the window corner point y coordinate of the ground truth is shown, and h is the height of the input picture.

The vehicle window corner point coordinates are estimated when the network is transmitted forwards, the loss function is calculated, the gradient of the loss function is calculated when the network is transmitted backwards, the network weight is updated continuously, the loss function is made to be smaller continuously, the vehicle window corner point coordinates estimated by the vehicle window corner point coordinates are made to be close to the real corner point coordinates continuously, and the accurate vehicle window corner point coordinates are obtained.

In a first stage, a method for obtaining a rough regression frame of a vehicle window is mainly provided, and the method specifically comprises the following steps:

(1) selecting a sample group, and manually calibrating a front vehicle window;

(2) correspondingly storing the data picture and the car window corner point coordinate picture to form a data set;

(3) dividing a data set into a training set and a test set;

(4) extracting multilevel and multiscale characteristics of the vehicle picture by using the deep convolutional network designed at the stage;

(5) inputting a feature map obtained by the deep convolutional network into a BOX regression layer to regress out of a vehicle window approximate positioning frame;

in the second stage, the invention provides a method for acquiring accurate coordinates of four corner points of a vehicle window, which specifically comprises the following steps:

(6) enlarging the vehicle window approximate positioning frame obtained in the first stage by 1.3 times in the width and height directions to obtain a candidate frame;

(7) intercepting the candidate frame from the original picture to form a new picture as the input of the stage;

(8) the coordinates of four corner points of the manually calibrated front vehicle window are converted into relative coordinates relative to the candidate frame;

(9) extracting the characteristics of an input picture by using the deep convolutional network designed at the stage, and converting the characteristic picture into a 256 characteristic vector through a full connection layer Fc;

(10) inputting the feature vectors obtained by the deep convolutional network into a car window corner coordinate linear regression layer, and accessing a designed L2 paradigm loss function as an optimization target; in the actual test stage, the eight values of the coordinates of the angular points of the car window 4 can be output only by extracting the characteristic vectors of the input pictures according to the steps and performing linear regression, so that the accurate positioning of the car window is obtained.

Claims

1. A car window accurate positioning method based on deep learning; the method is characterized in that: the method comprises the following steps:

s13, inputting the data set into a Deep Convolutional Network (Deep Convolutional Network) of a first stage, wherein the Deep Convolutional Network of the first stage is a 23-layer neural Network, performing five times of convolution (convolution, herein abbreviated as Conv) operation on a picture in an Image data set (Image set), performing Batch regularization (Batch norm) on an output feature graph after each time of convolution operation, inputting the Batch regularization to an activation function Relu, performing maximum pooling (Maxpool) operation after the previous four times of convolution Conv1 operation, and adding one branch to the Deep Convolutional Network after the five times of convolution Conv5 operation; in two branches, one branch continues to perform five convolution Conv6 operations and one full convolution Conv12 operation, the other branch performs convolution Conv13 operations on the feature map before adding the branch and the feature map obtained by performing upsampling (upsample) after the previous branch performs Conv12 operation in the channel direction, and the last two branches perform convolution operation Conv11 and one full convolution operation Conv15 respectively to obtain a fused vehicle picture feature map;

s14, inputting a vehicle picture feature map (feature map) and corresponding corner coordinates of the front window into a frame regression layer (BOX regression layer for short) and regressing an approximate positioning frame of the front window after optimizing a loss function;

s21, enlarging the front window approximate positioning frame obtained in the first stage by 1.3 times in the width and height directions to obtain an enlarged candidate frame (default box);

s23, transforming the coordinates of four corner points of the front window of the artificial calibration data set (Annotation set) into relative coordinates relative to the expansion candidate frame;

s24, inputting the new picture obtained after the interception into a deep convolutional network at the second stage, extracting a feature map of the new picture through the deep convolutional network at the second stage, and converting the feature map into a feature vector through a full connection layer (FC for short) of the deep convolutional network at the second stage;

2. The accurate car window positioning method based on deep learning of claim 1, wherein: in the step S14, a loss function of the BOX regression layer is smooth L1 loss; the calculation formula is as follows:

；

；

wherein ,

to indicate the parameter, when the value is 1, it represents that the ith default box matches with the jth label box (ground route box); n is the number of candidate frames; m is a position parameter of the boundary frame, cx represents an x coordinate of the center point of the boundary frame, cy represents a y coordinate of the center point of the boundary frame, w represents the width of the boundary frame, and h represents the height of the boundary frame; l is the predicted value of the position of the bounding box corresponding to the default box,

is the corresponding ground trouh box location parameter value.

3. The accurate car window positioning method based on deep learning as claimed in claim 2, characterized in that: in step S24, five times of convolution operations are performed on the new image obtained after the truncation, the linear rectification function with parameters is used as the activation function for the activation function after the convolution, the maximum pooling operation is performed after the previous four times of convolution operations and one pooling layer is connected after the five times of convolution operations, and the extracted feature maps are integrated into one feature vector.

4. The accurate car window positioning method based on deep learning of claim 3, wherein: the loss function of the linear regression layer in the step S25 adopts an L2 norm loss function, and the calculation formula is as follows:

；

is the x coordinate of the corner point of the front window to be regressed,

is the coordinate of the front window corner point x of the ground truth box, w is the width of the new picture obtained after the interception,

is the y coordinate of the corner point of the front window to be regressed,

the coordinate of the front window corner point y of the ground truth box is shown, and h is the height of the new picture obtained after the clipping.