CN110263679B

CN110263679B - Fine-grained vehicle detection method based on deep neural network

Info

Publication number: CN110263679B
Application number: CN201910476604.6A
Authority: CN
Inventors: 袁泽剑; 罗芳颖; 刘芮金
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2021-08-13
Anticipated expiration: 2039-06-03
Also published as: CN110263679A

Abstract

The invention discloses a fine-grained vehicle detection method based on a deep neural network, which can accurately detect the specific posture category and contour of a vehicle by defining an output, detection and training network. When the priori knowledge such as the ground plane, camera calibration information and the like is given, the detection result can be used for estimating a travelable area, collision time and the like, and further assisting and guaranteeing the safe driving of a driver. Compared with a general target detection network, the method can output more information and can meet different application requirements. The invention outputs the attitude category and the outline position information of the vehicle, and the information is beneficial to more accurately judging the position and the driving direction of the vehicle in the road. The invention has low requirement on the sensor for collecting data and is beneficial to production and use. The calculation of the invention is completed in the common RGB image, no equipment such as a depth sensor or a radar is needed, only one common camera is needed to meet the requirement, and the cost is low.

Description

Fine-grained vehicle detection method based on deep neural network

[ technical field ] A method for producing a semiconductor device

The invention relates to a fine-grained vehicle detection method based on a deep neural network.

[ background of the invention ]

The vehicle detection is an important task in an automatic driving or auxiliary driving system, and can be used for calculating collision distance and collision time and guaranteeing driving safety. The general target detection task can only obtain a coarse rectangular frame detection result, the rectangular frame cannot be used for distinguishing the position of each surface of the vehicle, the passable area beside the vehicle cannot be accurately analyzed, and the general target detection task is insensitive to the posture change of the vehicle. This requires that the vehicle be able to detect its exact contour and distinguish the side and head and tail of the vehicle to achieve fine-grained vehicle detection.

Two methods are mainly used for realizing contour detection, one is to realize example segmentation based on segmentation candidates or pixel classification in a common RGB image; the other is to take a picture with depth information and perform target detection in the RGB-D image. Although the example division method can detect the contour of the vehicle, it cannot distinguish different surfaces of the vehicle, and the calculation is slow. The 3D detection method can accurately obtain the position of the vehicle, but a 3D sensor is required to acquire depth information, and the acquisition cost is high.

With the continuous improvement of the algorithm, the neural network completing the target detection in one step approaches or meets the requirement of real-time detection. The detection of the vehicle contour is directly finished under the framework, so that extra calculation or acquisition cost can be avoided, and the method is suitable for an automatic driving or auxiliary driving system.

[ summary of the invention ]

The invention aims to overcome the defects of the prior art and provide a fine-grained vehicle detection method based on a deep neural network.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

a fine-grained vehicle detection method based on a deep neural network comprises the following steps:

step 1: defining an output

Given a rectangle (v, x, y, w, h), where v indicates positive and negative samples, v ∈ {0,1}, 0 indicates background, and 1 indicates vehicle; x, y, w and h represent the position and width and height of the rectangular frame, and on the basis, the posture subclass codes and the control points are output in an expanding mode;

step 2: detecting a network

Order (w)_f,h_f,c_f) The width, height and channel number of the characteristic layer with the dimension f; if V, A, P are the number of categories V, a, P, respectively, then a feature layer with a scale f is convolved to produce (w)_f,h_f,B_fX (V + A + P +4+3)) dimension, the detection result comprises (V, a, P, x, y, w, h, alpha, beta, gamma) information, wherein B_fThe number of default boxes generated for each location;

in the detection process, the detector predicts the conditional probability at each node of the hierarchical structure, and the conditional probabilities from the root node to the node are multiplied to obtain a joint probability; if the joint probability of a certain node is lower than a selected threshold value, stopping continuously judging downwards, and then predicting the final category and the geometric shape of the vehicle;

and step 3: training network

Order to

Whether the ith default box is associated with the jth default box of the d typeAn indication function of true box matching; obtaining N matched default boxes after matching with the actual value of the label; the overall loss function is the sum of the classification and localization losses:

the invention further improves the following steps:

the specific method of step 1 is as follows:

1-1) attitude coding

Two subclasses are used to encode the imaged 9 2D vehicle poses p e { p₁,...,p₉9 postures are respectively 9 leaf nodes; the first sub-class a e { R, F, N } indicates which faces of the vehicle are visible, wherein R indicates that the rear face is visible, F indicates that the front face is visible, and N indicates that neither the front face nor the rear face is visible, and only the side faces are visible; another sub-category is s, which represents a spatial configuration, which determines whether the side is to the left or right of the front or back; for a ═ N, s denotes the direction of the flank; s belongs to { l, r, n }, wherein l represents the left side, r represents the right side, and n represents that the target vehicle is in the right front, and only one rectangular surface can be seen; 9 different postures can be coded according to the values of a and s;

1-2) control points

Defining 3 virtual control points on the basis of the rectangular frame (x, y, w, h), and forming the outline boundary of each visible surface of the vehicle; alpha denotes the position of the boundary between two faces, beta, gamma defines the position of the upper base of the trapezoid; if s ═ l, then β, γ are defined on the leftmost border; for 9 2D poses, black dots are used to indicate required control points in each pose, when s is equal to N, the control points are not required, and when a is equal to N, and s is equal to l or r, only two control points of β and γ are required;

the output is defined as (v, a, s, x, y, w, h, α, β, γ), the output result of the third layer, i.e., the result of the leaf node, can be directly denoted by p, and thus, the output can also be defined as (v, a, p, x, y, w, h, α, β, γ);

1-3) hierarchical Structure

A hierarchical output structure is adopted, and detection results are output in 3 layers; whether the first layer outputs the vehicle or not is the category v, the second layer outputs the visible surface information of the vehicle is the category a, and the third layer outputs the accurate attitude category p.

The specific method of step 3 is as follows:

3-1) network classification

The loss function of the classification task is as follows;

in the formula:

the confidence of the category after softmax is shown, and the calculation formula is as follows:

3-2) control Point regression

Let (alpha)^x,α^y) Coordinates representing control point α, similar definitions apply to control points β and γ; the geometric constraint that three points of alpha, beta and gamma are all on the boundary of the rectangular frame requires regression values only including the position of the rectangular frame and alpha^x,β^y,γ^y(ii) a Defining the deviation of the predicted value from the default box as:

in the formula: cx, cy represents the coordinate of the center point of the default box; w, h tableWidth and height of the default box are shown;

a true value representing the x coordinate of the alpha control point; alpha is alpha^xA predicted value representing the x coordinate of the alpha control point;

β^y,

γ^yare similarly defined;

the loss of the positioning task is as follows:

in the formula: l is_boxA loss function representing rectangular box regression in target detection; l represents a robust loss function smoothL₁；

Representing an indication function, indicating whether the ith default box contributes to the coordinate t, and when the true value posture matched by the default box does not contain the control point alpha, not adding the alpha^xContributes to the regression of beta^y,γ^yAnd similarly.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a deep neural network method for detecting fine-grained vehicles, which can accurately detect the specific attitude category and contour of the vehicles. When the priori knowledge such as the ground plane, camera calibration information and the like is given, the detection result can be used for estimating a travelable area, collision time and the like, and further assisting and guaranteeing the safe driving of a driver. The invention has three advantages:

compared with a common target detection network, the method can output more information and can meet different application requirements. The invention outputs the attitude category and the outline position information of the vehicle, and the information is beneficial to more accurately judging the position and the driving direction of the vehicle in the road.

Secondly, the calculated amount is small, and the time efficiency is high. The method provided by the invention is expanded from a frame for completing target detection in one step which is close to or meets the real-time detection requirement, compared with the target detection method, the method provided by the invention does not generate additional characteristics, does not increase the number of candidate windows, only expands the output channel of the detector, hardly increases the calculated amount, and has the same detection efficiency as the original target detection algorithm.

Thirdly, the requirement on a sensor for collecting data is low, and the production and the use are facilitated. The calculation of the invention is completed in the common RGB image, no equipment such as a depth sensor or a radar is needed, only one common camera is needed to meet the requirement, and the cost is low.

[ description of the drawings ]

FIG. 1 is an overall network framework;

FIG. 2 is an example of a vehicle contour detection effect;

FIG. 3 is a pose encoding and control point definition;

FIG. 4 is a detection network structure;

FIG. 5 is a demonstration of control point regression;

FIG. 6 shows a part of the actual test results.

[ detailed description ] embodiments

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments, and are not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

In the context of the present disclosure, when a layer/element is referred to as being "on" another layer/element, it can be directly on the other layer/element or intervening layers/elements may be present. In addition, if a layer/element is "on" another layer/element in one orientation, then that layer/element may be "under" the other layer/element when the orientation is reversed.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the accompanying drawings:

referring to fig. 2, a visible face of the vehicle can be located with a rectangular frame or trapezoid, depending on the characteristics of the image taken by the forward facing camera. This representation is shown in fig. 2 to accurately depict the front, rear and sides of the vehicle, as compared to the original rectangular frame (dashed box). Since the combination of the visible surfaces of the vehicle is limited, the present invention proposes a 2D vehicle pose encoding strategy to represent this output.

Definition of output

The output representation proposed by the invention is expanded on the basis of a rectangular frame. Given a rectangle (v, x, y, w, h), where v indicates positive and negative samples, v ∈ {0,1}, 0 indicates background, and 1 indicates vehicle. And x, y, w and h represent the position and width and height of the rectangular frame, and on the basis, the output is expanded from two parts of posture subclass coding and control points.

Coding of the pose

The invention uses two subclasses to code the imaged 9 2D vehicle postures p epsilon { p ∈ { p }₁,...,p₉And as shown in fig. 3(a), the 9 postures are respectively 9 leaf nodes in the graph. The first sub-class a e R, F, N indicates which faces of the vehicle are visible, where R indicates rear visible, F indicates front visible, and N indicates that neither front nor rear visible, but only the side faces. Another subclass is s, which represents the spatial configuration, which determines whether the side is to the left or right of the front (or back). For a ═ N, s denotes the direction of the side. s ∈ { l, r, n }, where l denotes the left side, r denotes the right side, and n denotes that the target vehicle is almost directly in front, and only one rectangular surface can be seen. And 9 different postures can be coded according to the values of a and s.

Control point

The invention defines 3 virtual control points on the basis of the rectangular frame (x, y, w, h) to further form the outline boundary of each visible surface of the vehicle. α denotes the position of the boundary between the two faces, and β, γ defines the position of the upper base of the trapezoid. The definitions of α, β, γ in the a-R, s-R pose are given in fig. 3 (b). If s ═ l then β, γ is defined at the leftmost boundary. Not all control points are required for each of the 9 2D poses, but in fig. 3(a), the required control points are indicated by black dots for each pose, and when s is N, no control point is required, and when a is N, and when s is l, or s is r, only two control points, β and γ, are required.

Finally, the output is defined as (v, a, s, x, y, w, h, α, β, γ), and if the result is output according to the tree structure shown in fig. 3(a), the output result of the third layer, i.e., the result of the leaf node, can be directly denoted by p, and thus, the output can also be defined as (v, a, p, x, y, w, h, α, β, γ).

(iii) use of the hierarchical Structure

It is also a feasible solution to calculate the confidence of the 9 poses and the background directly using the flattened structure. But the use of tree structures is more advantageous when refining the classification. For example, if the detector detects the vehicle and its front, i.e. determines that a is F, but cannot further determine which 2D pose is specific, the tree-structured detector still outputs a high-confidence rectangular frame, but does not further output the contour according to the control point, which can ensure the correctness of the output result.

Detecting a network

Order (w)_f,h_f,c_f) The width, height and number of channels of the feature layer of dimension f. If V, A, P are the number of categories V, a, P, respectively. Then a feature layer of scale f is convolved to produce (w)_f,h_f,B_fX (V + A + P +4+3)) dimension, the detection result comprises (V, a, P, x, y, w, h, alpha, beta, gamma) information, wherein B_fThe number of default boxes generated for each location is schematically shown in FIG. 4.

In the detection process, the detector predicts the conditional probability at each node of the hierarchical structure, and the joint probability can be obtained by multiplying the conditional probabilities from the root node to the node. And if the joint probability at a certain node is lower than a selected threshold value, stopping continuously judging downwards, and then predicting the final class and the geometric shape of the vehicle.

Network training

Order to

Is the function indicating whether the ith default box matches the jth truth box of type d. And obtaining N matched default boxes after matching with the actual values of the labels. The overall loss function is the sum of the classification and localization losses.

Classifying into networks

The penalty function for the classification task is as follows.

In the formula:

② control point regression

Let (alpha)^x,α^y) Representing the coordinates of control point alpha, similar definitions apply for control points beta and gamma. Due to the geometric constraint that three points of alpha, beta and gamma are all on the boundary of the rectangular frame, the values needing regression are only the position of the rectangular frame and alpha^x,β^y,γ^y. The relationship between the true value box, default box, control points is shown in FIG. 5. Defining the deviation of the predicted value from the default box as:

in the formula: cx, cy represents the coordinate of the center point of the default box; w, h represents the width of the default boxAnd high;

β^y,

γ^ythe definition of (a) is similar.

The loss of the positioning task is as follows:

To represent

An indication function, which indicates whether the ith default box contributes to the coordinate t, when the true value pose matched by the default box does not contain the control point alpha, the default box will not contribute to the alpha^xContributes to the regression of beta^y,γ^yAnd similarly.

Network framework:

the network framework for implementing the overall detection is shown in fig. 1. The input image is subjected to a feature extraction network to obtain different feature layers, and detection is performed on different feature graphs to obtain a final required classification score, a rectangular frame and a control point regression result. The detailed detector is shown in fig. 4.

The parameter settings of the feature extraction network are shown in table 1, and the default step size stride is 1.

Table 1 feature extraction network parameter settings

Implementation details:

before training, an appropriate default box was generated by the K-Means clustering method. The pre-trained model on ImageNet was then fine-tuned on the collected dataset. The Adam gradient descent algorithm was used for training, with the size of the mini-batch set to 2, the initial learning rate set to 0.001 and descending at a rate of 0.94 of decay factor in each training period. On a single NVIDIA GeForce 1080Ti GPU, the entire training process based on tensrflow takes approximately 12 hours.

All objects in the dataset that need to be applied are clustered into L categories, L being equal to the number of layers of features that need to be used for detection. And setting the target dimension of the clustering center to be matched with the scaling size of the target of each characteristic layer during clustering. Thus for each feature layer, a range of scales for objects belonging to this class can be obtained

According to the scale range, order

Five aspect ratios are set, wherein the aspect ratio r belongs to {1,2,3,1/2,1/3 }. The equation for the anchor for each layer is as follows:

in the formula:

representing the width of the anchor with the k characteristic layer aspect ratio of r;

representing the height of the anchor with the k characteristic layer aspect ratio of r; w represents the width of the input image; h denotes the high of the input image.

When it is wideWhen the height ratio r is 1, a block of another scale is additionally calculated:

the effect of the network implementation is shown in fig. 6.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A fine-grained vehicle detection method based on a deep neural network is characterized by comprising the following steps:

step 1: defining an output

Given a rectangle (v, x, y, w, h), where v indicates positive and negative samples, v ∈ {0,1}, 0 indicates background, and 1 indicates vehicle; x, y, w and h represent the position and width and height of the rectangular frame, and on the basis, the output is expanded from two parts of posture subclass coding and control points, and the specific method is as follows:

1-1) attitude coding

1-2) control points

Defining 3 virtual control points on the basis of the rectangular frame (x, y, w, h), and forming the outline boundary of each visible surface of the vehicle; alpha denotes the position of the boundary between two faces, beta, gamma defines the position of the upper base of the trapezoid; if s ═ l, then β, γ are defined on the leftmost border; for 9 2D poses, no control point is needed when s is N, and only two control points, β and γ, are needed when a is N, s is l, or s is r;

1-3) hierarchical Structure

A hierarchical output structure is adopted, and detection results are output in 3 layers; whether the first layer of output is a vehicle or not is a category v, the second layer of output is visible surface information of the vehicle, namely a category a, and the third layer of output is an accurate attitude category p;

step 2: detecting a network

and step 3: training network

Order to

An indication function of whether the ith default box matches with the jth truth box of the type d; obtaining N matched default boxes after matching with the actual value of the label; the overall loss function is the sum of the classification and localization losses:

2. the fine-grained vehicle detection method based on the deep neural network as claimed in claim 1, wherein the specific method in step 3 is as follows:

3-1) network classification

The loss function of the classification task is as follows;