CN110263679A

CN110263679A - A kind of fine granularity vehicle checking method based on deep neural network

Info

Publication number: CN110263679A
Application number: CN201910476604.6A
Authority: CN
Inventors: 袁泽剑; 罗芳颖; 刘芮金
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-09-20
Anticipated expiration: 2039-06-03
Also published as: CN110263679B

Abstract

The invention discloses a kind of fine granularity vehicle checking method based on deep neural network can accurately detect the specific posture classification and profile of vehicle by definition output, detection and training network.When the priori knowledges such as given ground level and camera calibration information, testing result can be used for estimating can travel region, and collision time etc., further auxiliary and guarantee driver safety drive.Compared to general target detection network, the present invention can export more information, can meet different application demands.Invention outputs the posture classification of vehicle and outline position information, these information are conducive to the position more accurately judged vehicle in the road and driving direction.The present invention is low to the sensor requirements of acquisition data, is conducive to produce and use.Calculating of the invention is entirely to complete in common RGB image, does not need the equipment such as depth transducer or radar, it is only necessary to which a common camera can reach requirement, low in cost.

Description

A kind of fine granularity vehicle checking method based on deep neural network

[technical field]

The present invention relates to a kind of fine granularity vehicle checking method based on deep neural network.

[background technique]

Vehicle detection is the vital task in automatic Pilot or DAS (Driver Assistant System), can be used for calculating collision distance and collision Time, guarantee driving safety.General object detection task can only obtain coarse rectangle frame testing result, and rectangle frame can not divide The position in each face of vehicle, also just can not accurately analyze beside vehicle can traffic areas, also not to the attitudes vibration of vehicle It is sensitive.This requires that can detect the accurate profile of vehicle and distinguish side and the head-tail of vehicle, the inspection of fine granularity vehicle is realized It surveys.

Realize contour detecting method there are mainly two types of, one is in common RGB image be based on dividing candidate or Pixel classifications realize example segmentation；Another kind is the picture that shooting has depth information, and target inspection is carried out in RGB-D image It surveys.It although the method for example segmentation can detect the profile of vehicle, cannot be distinguished the different sides of vehicle, and calculate slow Slowly.The method of 3D detection can accurately obtain the position of vehicle, but need 3D sensor sampling depth information, and acquisition cost is high.

With continuously improving for algorithm, the neural network that a step completes target detection has been approached or reaches wanting for real-time detection It asks.The detection that vehicle's contour is done directly under this frame can be to avoid additional calculating or acquisition cost, suitable for driving automatically It sails or DAS (Driver Assistant System).

[summary of the invention]

It is an object of the invention to overcome the above-mentioned prior art, a kind of particulate based on deep neural network is provided Spend vehicle checking method.

In order to achieve the above objectives, the present invention is achieved by the following scheme:

A kind of fine granularity vehicle checking method based on deep neural network, comprising the following steps:

Step 1: definition output

A rectangle (v, x, y, w, h) is given, wherein v instruction is positive negative sample, and v ∈ { 0,1 }, 0 indicates background, 1 table Show it is vehicle；X, y, w, h then indicate that the position of rectangle frame and width are high, on this basis, from posture subclass coding and control point two Part extension output；

Step 2: detection network

Enable (w_f,h_f,c_f) be the characteristic layer that scale is f width, height and port number；If V, A, P are respectively classification v, a, p Number, then scale be f characteristic layer pass through convolution, (w can be generated_f,h_f,B_f× (V+A+P+4+3)) dimension testing result square Battle array, testing result includes (v, a, p, x, y, w, h, α, beta, gamma) information, wherein B_fThe default generated for each position The number of box；

In the detection process, what detector was predicted at each node of hierarchical structure is conditional probability, by root node Conditional probability to the node is multiplied to obtain joint probability；Stop if if some node joint probability is lower than selected threshold value Only continue to judge down, then predicts vehicle final classification and geometry；

Step 3: training network

It enablesFor i-th default box whether with j-th of matched indicator function of true value frame that classification is d； N number of matched default box is obtained after matching with the true value of mark；Total loss function is classification and positioning loss With:

A further improvement of the present invention lies in that:

The specific method is as follows for step 1:

1-1) posture encodes

Using two subclasses come 9 kinds of 2D vehicle attitude p ∈ { p after coded imaging₁,...,p₉, 9 kinds of postures are 9 respectively Leaf node；What the first subclass a ∈ { R, F, N } was indicated is which face of vehicle is visible, and wherein R is indicated below as it can be seen that F Indicate front as it can be seen that N then indicates that front and back is invisible, can only see side；Another subclass is representation space configuration S, it determine side in front or subsequent left side or right side；For a=N, s then indicates the direction of side；s∈{l, R, n }, wherein l indicates left side, and r indicates right side, and n then indicates that target vehicle in front, can only see a rectangular surfaces；According to a 9 kinds of different postures can be encoded out with the value of s；

1-2) control point

3 virtual controlling points are defined on the basis of rectangle frame (x, y, w, h), form the profile of each visible face of vehicle Boundary；What α was indicated is the position in line of demarcation between two faces, and beta, gamma then defines the position of trapezoidal upper bottom edge；If s=l, So β, γ are then defined on the boundary of the leftmost side；For 9 kinds of 2D postures, needs all are illustrated with stain in each posture Control point, when s=n, do not need control point, and when a=N, s=l or s=r only needs two control points of beta, gamma；

Output is defined as (v, a, s, x, y, w, h, α, beta, gamma), output result, that is, leaf node of third layer as a result, It can directly be indicated with p, therefore, output can also be defined as (v, a, p, x, y, w, h, α, beta, gamma)；

1-3) hierarchical structure

Using level export structure, point 3 layers of output test result；Whether first layer output is vehicle, i.e. classification v, second The visible face information of layer output vehicle, i.e. classification a, third layer export accurate posture classification p.

The specific method is as follows for step 3:

3-1) network class

The loss function of classification task such as following formula；

In formula:Indicate that the classification confidence level after softmax, calculation formula are as follows:

3-2) control point returns

Enable (α^x,α^y) indicating the coordinate of control point α, similar definition is also applied for control point β and γ；Tri- points of α, β, γ This geometrical constraint on the boundary of rectangle frame, the value for needing to return only have position and the α of rectangle frame^x,β^y,γ^y；Definition is pre- The deviation of measured value and default box are as follows:

In formula: cx, cy indicate the coordinate of the central point of default box；The width and height of w, h expression default box； Indicate the true value of the x coordinate at the control point α；α^xIndicate the predicted value of the x coordinate at the control point α；β^y,γ^yDefinition it is similar；

The loss of location tasks is as follows:

In formula: L_boxIndicate the loss function that rectangle frame returns in target detection；The loss function of L expression robust smoothL₁；It indicates whether indicator function, i-th of default box of instruction contribute to coordinate t, works as default It, will not be to α when the posture for the true value that box is matched to does not include control point α^xRecurrence contribute, β^y,γ^yAlso similar.

Compared with prior art, the invention has the following advantages:

The invention proposes a kind of deep neural network methods of fine granularity vehicle detection, can accurately detect vehicle Specific posture classification and profile.When the priori knowledges such as given ground level and camera calibration information, testing result can be used for estimating It counts and can travel region, collision time etc., further auxiliary and guarantee driver safety drive.The present invention has three big advantages:

First is that the present invention can export more information compared to general target detection network, different applications can be met Demand.Invention outputs the posture classification of vehicle and outline position information, these information are conducive to more accurately judge vehicle Position and driving direction in the road.

Second is that calculation amount is small, time efficiency is high.Since method proposed by the present invention is from having been approached or reach real-time detection It is required that a step complete target detection frame in expand, compared with object detection method, the present invention is not generated additionally Feature, also without increase candidate window quantity, only extend the output channel of detector, almost without increase calculation amount, There is the same detection efficiency with original algorithm of target detection.

Third is that it is low to the sensor requirements of acquisition data, be conducive to produce and use.Calculating of the invention is entirely common RGB image in complete, do not need the equipment such as depth transducer or radar, it is only necessary to a common camera, that is, reachable It is low in cost to requirement.

[Detailed description of the invention]

Fig. 1 is overall network frame；

Fig. 2 is vehicle's contour detection effect example；

Fig. 3 is posture coding and control point definition；

Fig. 4 is detection network structure；

Fig. 5 is the demonstration that control point returns；

Fig. 6 is the actually detected result in part.

[specific embodiment]

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, the embodiment being not all of, and it is not intended to limit range disclosed by the invention.In addition, with In lower explanation, descriptions of well-known structures and technologies are omitted, obscures concept disclosed by the invention to avoid unnecessary.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment should fall within the scope of the present invention.

The various structural schematic diagrams for disclosing embodiment according to the present invention are shown in the attached drawings.These figures are not in proportion It draws, wherein some details are magnified for the purpose of clear expression, and some details may be omitted.As shown in the figure The shape in various regions, layer and relative size, the positional relationship between them out is merely exemplary, in practice may be due to Manufacturing tolerance or technical restriction and be deviated, and those skilled in the art may be additionally designed as required have not Similar shape, size, the regions/layers of relative position.

In context disclosed by the invention, when one layer/element is referred to as located at another layer/element "upper", the layer/element Can may exist intermediate layer/element on another layer/element or between them.In addition, if in a kind of court One layer/element is located at another layer/element "upper" in, then when turn towards when, the layer/element can be located at another layer/ Element "lower".

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

The invention will be described in further detail with reference to the accompanying drawing:

Referring to fig. 2, according to the preceding image taken to camera the characteristics of, with rectangle frame or trapezoidal can position vehicle One visible face.In Fig. 2 compared with original rectangle frame (dotted line frame in figure), this representation method energy accurate description vehicle Front, rear and sides.Since the combined situation in the visible face of vehicle is limited, the invention proposes a kind of 2D vehicle attitudes Coding strategy indicates this output.

Output definition

Output expression proposed by the present invention is extended on the basis of rectangle frame.A given rectangle (v, x, y, w, H), wherein v instruction is positive negative sample, and v ∈ { 0,1 }, 0 indicates background, and 1 indicates to be vehicle.X, y, w, h then indicate rectangle frame Position and width are high, on this basis, from posture subclass coding and the extension output of control point two parts.

1. posture encodes

The present invention is with two subclasses come 9 kinds of 2D vehicle attitude p ∈ { p after coded imaging₁,...,p₉, such as Fig. 3 (a) institute Show, 9 kinds of postures are 9 leaf nodes in figure respectively.What the first subclass a ∈ { R, F, N } was indicated is which face of vehicle is Visible, wherein R is indicated below as it can be seen that F indicates front as it can be seen that N then indicates that front and back is invisible, can only see side Face.Another subclass is the s of representation space configuration, it determines the left side or right side of side in front (or below).For a =N, s then indicate the direction of side.S ∈ { l, r, n }, wherein l indicates left side, and r indicates right side, and n then indicates target vehicle almost In front, a rectangular surfaces can only be seen.Go out 9 kinds of different postures according to value, that is, codified of a and s.

2. control point

The present invention defines 3 virtual controlling points on the basis of rectangle frame (x, y, w, h), to be further formed vehicle The profile and border of each visible face.What α was indicated is the position in line of demarcation between two faces, and beta, gamma then defines trapezoidal upper bottom edge Position.A=R, the α under s=r posture, β, γ definition are given in Fig. 3 (b).If s=l so β, γ are then defined on most left On the boundary of side.It is not that each requires whole control points for 9 kinds of 2D postures, in Fig. 3 (a), each posture On all illustrate the control point needed with stain, when s=n, does not need control point, and when a=N, s=l or s=r only needs beta, gamma two A control point.

Finally, output is defined as (v, a, s, x, y, w, h, α, beta, gamma), if according to tree structure shown in Fig. 3 (a) Output as a result, so third layer output result i.e. leaf node as a result, can directly be indicated with p, therefore, export It can be defined as (v, a, p, x, y, w, h, α, β, γ).

3. the use of hierarchical structure

It the use of the confidence level of 9 kinds of postures of Structure Calculation of flattening and background is also directly a kind of feasible solution. But advantageously using tree structure when refining classification.For example, if detector detect vehicle and it before, Judge a=F, but can not further determine that specifically any 2D posture, the detector of tree structure still can export The rectangle frame of one high confidence level, but not profile is exported further according to control point, this can guarantee the correctness of output result.

Detect network

Enable (w_f,h_f,c_f) be the characteristic layer that scale is f width, height and port number.If V, A, P are respectively classification v, a, p Number.The characteristic layer that then scale is f passes through convolution, can generate (w_f,h_f,B_f× (V+A+P+4+3)) dimension testing result square Battle array, testing result includes (v, a, p, x, y, w, h, α, beta, gamma) information, wherein B_fThe default generated for each position The number of box, schematic diagram are as shown in Figure 4.

In the detection process, what detector was predicted at each node of hierarchical structure is conditional probability, by root node Joint probability can be obtained in conditional probability multiplication to the node.If in some node joint probability lower than selected threshold value Then stop continuing to judge down, then predicts vehicle final classification and geometry.

Network training

It enablesFor i-th default box whether with j-th of matched indicator function of true value frame that classification is d. Available N number of matched default box after being matched with the true value of mark.Total loss function is classification and positioning loss The sum of.

1. network class

The loss function of classification task such as following formula.

2. control point returns

Enable (α^x,α^y) indicating the coordinate of control point α, similar definition is also applied for control point β and γ.Due to α, β, γ tri- A point this geometrical constraint on the boundary of rectangle frame, the value for needing to return only have position and the α of rectangle frame^x,β^y,γ^y.Very It is worth frame, default box, the relationship between control point is as shown in Figure 5.Define the deviation of predicted value and default box are as follows:

In formula: cx, cy indicate the coordinate of the central point of default box；The width and height of w, h expression default box；Indicate the true value of the x coordinate at the control point α；α^xIndicate the predicted value of the x coordinate at the control point α；β^y,γ^yDefinition class Seemingly.

The loss of location tasks is as follows:

In formula: L_boxIndicate the loss function that rectangle frame returns in target detection；The loss function of L expression robust smoothL₁；It indicatesWhether indicator function, i-th of default box of instruction contribute to coordinate t, when The posture for the true value that default box is matched to will not be to α when not including control point α^xRecurrence contribute, β^y,γ^yAlso similar.

Network frame:

Realize that the network frame of whole detection is as shown in Fig. 1.Input picture obtains different by feature extraction network Characteristic layer carries out detecting the available classification score finally needed on different characteristic patterns, and rectangle frame and control point return As a result.Detailed detector is as shown in Figure 4.

The parameter setting of feature extraction network is as shown in table 1, and the step-length stride of default is 1.

The setting of 1 feature extraction network parameter of table

Realize details:

Before training, suitable default box is generated by K-Means clustering method.Then in the data of acquisition Fine tuning good model of pre-training on ImageNet on collection.Be trained using Adam gradient descent algorithm, small lot it is big Small to be set as 2, initial learning rate is set as 0.001 and is declined in each cycle of training with the rate of decay factor 0.94.In list On a NVIDIA GeForce 1080Ti GPU, the entire training process based on TensorFlow takes around 12 hours.

The cluster of all targets in data set to be applied will be needed to L classification, L, which is equal to, needs the feature for detection The number of plies.The target scale size that cluster centre is arranged when cluster is matched with the scaling size of each characteristic layer target.In this way for every One characteristic layer, the available range scale for belonging to this kind of other targetsAccording to range scale, enableFive kinds of the ratio of width to height, the ratio of width to height r ∈ { 1,2,3,1/2,1/3 } are set.The meter of each layer of anchor It is as follows to calculate formula:

In formula:Indicate that k-th of characteristic layer the ratio of width to height is the width of the anchor of r；Indicate that k-th of characteristic layer the ratio of width to height is The height of the anchor of r；The width of W expression input picture；The height of H expression input picture.

As the ratio of width to height r=1, the box of extra computation another kind scale:

The realization effect of network is as shown in Figure 6.

The above content is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, all to press According to technical idea proposed by the present invention, any changes made on the basis of the technical scheme each falls within claims of the present invention Protection scope within.

Claims

1. a kind of fine granularity vehicle checking method based on deep neural network, which comprises the following steps:

Step 1: definition output

A rectangle (v, x, y, w, h) is given, wherein v instruction is positive negative sample, and v ∈ { 0,1 }, 0 indicates background, and 1 expression is Vehicle；X, y, w, h then indicate that the position of rectangle frame and width are high, on this basis, from posture subclass coding and control point two parts Extension output；

Step 2: detection network

Enable (w_f,h_f,c_f) be the characteristic layer that scale is f width, height and port number；If V, A, P are respectively classification v, of a, p Number, the then characteristic layer that scale is f pass through convolution, can generate (w_f,h_f,B_f× (V+A+P+4+3)) dimension testing result matrix, inspection Surveying result includes (v, a, p, x, y, w, h, α, beta, gamma) information, wherein B_fFor for the default box that each position generates Number；

In the detection process, what detector was predicted at each node of hierarchical structure is conditional probability, by root node to should The conditional probability of node is multiplied to obtain joint probability；If stop if some node joint probability is lower than selected threshold value after It is continuous to judge down, then predict vehicle final classification and geometry；

Step 3: training network

It enablesFor i-th default box whether with j-th of matched indicator function of true value frame that classification is d；With mark N number of matched default box is obtained after the true value matching of note；Total loss function is the sum of classification and positioning loss:

2. the fine granularity vehicle checking method according to claim 1 based on deep neural network, which is characterized in that step 1 the specific method is as follows:

1-1) posture encodes

Using two subclasses come 9 kinds of 2D vehicle attitude p ∈ { p after coded imaging₁,...,p₉, 9 kinds of postures are 9 leaves respectively Node；What the first subclass a ∈ { R, F, N } was indicated is which face of vehicle is visible, and wherein R is indicated below as it can be seen that F is indicated Front is as it can be seen that N then indicates that front and back is invisible, can only see side；Another subclass is the s of representation space configuration, It determine side in front or subsequent left side or right side；For a=N, s then indicates the direction of side；S ∈ { l, r, n }, Wherein l indicates left side, and r indicates right side, and n then indicates that target vehicle in front, can only see a rectangular surfaces；According to a's and s Value can encode out 9 kinds of different postures；

1-2) control point

3 virtual controlling points are defined on the basis of rectangle frame (x, y, w, h), form the profile side of each visible face of vehicle Boundary；What α was indicated is the position in line of demarcation between two faces, and beta, gamma then defines the position of trapezoidal upper bottom edge；If s=l, that β, γ are then defined on the boundary of the leftmost side；For 9 kinds of 2D postures, the control needed all is illustrated with stain in each posture Processed, when s=n, does not need control point, and when a=N, s=l or s=r only needs two control points of beta, gamma；

Output is defined as (v, a, s, x, y, w, h, α, beta, gamma), output result, that is, leaf node of third layer as a result, it is possible to It is directly indicated with p, therefore, output can also be defined as (v, a, p, x, y, w, h, α, beta, gamma)；

1-3) hierarchical structure

Using level export structure, point 3 layers of output test result；Whether first layer output is vehicle, i.e., classification v, the second layer are defeated The visible face information of vehicle, i.e. classification a out, third layer export accurate posture classification p.

3. the fine granularity vehicle checking method according to claim 1 based on deep neural network, which is characterized in that step 3 the specific method is as follows:

3-1) network class

The loss function of classification task such as following formula；

3-2) control point returns

Enable (α^x,α^y) indicating the coordinate of control point α, similar definition is also applied for control point β and γ；Tri- points of α, β, γ exist This geometrical constraint on the boundary of rectangle frame, the value for needing to return only have position and the α of rectangle frame^x,β^y,γ^y；Define predicted value And the deviation of default box are as follows:

In formula: cx, cy indicate the coordinate of the central point of default box；The width and height of w, h expression default box；It indicates The true value of the x coordinate at the control point α；α^xIndicate the predicted value of the x coordinate at the control point α；β^y,γ^yDefinition it is similar；

The loss of location tasks is as follows:

In formula: L_boxIndicate the loss function that rectangle frame returns in target detection；The loss function smoothL of L expression robust₁；Indicate whether indicator function, i-th of default box of instruction contribute to coordinate t, when default box is matched It, will not be to α when the posture of the true value arrived does not include control point α^xRecurrence contribute, β^y,γ^yAlso similar.