CN117292346A

CN117292346A - Vehicle running risk early warning method for driver and vehicle state integrated sensing

Info

Publication number: CN117292346A
Application number: CN202311284729.1A
Authority: CN
Inventors: 俞山川; 骆中斌; 宋浪; 李刚; 王少飞; 谢耀华; 彭亚雪; 周欣; 陈晨; 周盼; 陈奇
Original assignee: China Merchants Chongqing Communications Research and Design Institute Co Ltd
Current assignee: China Merchants Chongqing Communications Research and Design Institute Co Ltd
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2023-12-26

Abstract

The invention discloses a vehicle running risk early warning method facing to integrated perception of a driver and a vehicle state, which comprises the following steps: training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model; inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver; inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result; judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed. The invention can synchronously sense the driving state of the driver and the driving track of the vehicle in real time, and enhances the reliability of driving early warning.

Description

Vehicle running risk early warning method for driver and vehicle state integrated sensing

Technical Field

The invention relates to the field of vehicle driving early warning, in particular to a vehicle driving risk early warning method for integrally sensing the states of a driver and a vehicle.

Background

In recent years, collision accidents are still the main form of road transportation safety accidents, the number of the accidents and the number of dead people respectively account for 70.5% and 68.4% of the total number of the accidents, and the prevention and control of the collision accidents of the exposed road transportation vehicles are insufficient. Before a collision accident of a vehicle, a driver is tired, distracted and the like, and the phenomena of lane departure, too close distance and the like of the vehicle account for 40-50% of the accidents. The early warning of the running risk of the vehicle has important significance in the aspects of improving road safety and traffic management.

The intelligent assistance is more and more widely applied to the vehicle running early warning field, but at present, the vehicle running risk early warning technology has the situation of false alarm or missing alarm potential danger, in addition, the existing vehicle running risk early warning technology also has the problem that complex traffic situation and risk factors cannot be effectively identified, so that the reliability of the risk early warning is greatly reduced, and therefore, a vehicle running risk early warning method facing to the integral perception of the states of a driver and a vehicle is needed, and the problems can be solved.

Disclosure of Invention

Therefore, the invention aims to overcome the defects in the prior art, and provides the vehicle running risk early warning method for integrally sensing the driver and the vehicle state, which can synchronously sense the driving state of the driver and the running track of the vehicle in real time, enhance the reliability of running early warning and improve the safety running level of the vehicle.

The invention relates to a vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state, which comprises the following steps:

creating a driving state data set of a driver; training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model;

creating a lane departure data set; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model;

inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver;

inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result;

judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed.

Further, the driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;

the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics;

the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels;

the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.

Further, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency, including:

at time step T < T ₀ Frame I _t Adjust to the lowest resolution and send it to the feature extractor; wherein T is ₀ For a set period of time; i _t A driver status image frame at the time t;

the long-period and short-period memory module updates and outputs the hidden state by using the extracted characteristics and the previous state;

given a hidden state, the policy distribution is estimated for action a at time t _t Sampling and performing Gumbel Softmax operation;

action a _t < L, frame size is adjusted to spatial resolution 3 XH _at ×W _at And forwards it to the corresponding backbone network to obtain frame-level predictions; wherein L is the resolution category number of the state image; h _at For action a _t The high of the image at time t; w (W) _at For action a _t The width of the image at time t;

action a _t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the following F _at-L-1 A frame; wherein F is _at-L-1 Is a as _t And (5) video frames when the video frame is more than or equal to L.

Further, the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics, and specifically comprises:

for a given input image X ε R ^{N×T×C×H×W} The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.with respect to the channel axisR ^{N×T×1×H×W} The method comprises the steps of carrying out a first treatment on the surface of the F is then remolded to F ^* ∈R ^{N×T×1×H×W} And fed to a three-dimensional convolutional layer K of core size 3 x 3, obtainingFinally, will->Remodelling to F _o ∈R ^{N×T×1×H×W} And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R ^{N×T×1×H×W} And finally outputting Y: y=x+x+m;

wherein, as follows, the space-time mask M is multiplied by all channel inputs X element by element, and T is the number of divided segments of the video corresponding to the image; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.

Further, the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels, and specifically includes:

for a given input image X ε R ^{N×T×C×H×W} First global spatial information F epsilon R of an input element is obtained by averaging the inputs ^{N×T×C×1×1} The method comprises the steps of carrying out a first treatment on the surface of the Compressing the channel number of F in proportion r to obtain F _r ＝K ₁ * F, performing the process; wherein K is ₁ Is a 1 x 1 two-dimensional convolution layer,

then remodel F _r To the point ofUntil time reasoning can be enabled, one-dimensional convolution layer K ₂ Kernel size 3 for handling +.>Obtain->Wherein (1)>

Will beReshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K ₃ Decompressing it to obtain F _o ＝K ₃ *F _temp Feeding to Sigmoid activation to obtain a channel mask M; wherein F is _o ∈R ^{N×T×C×1×1} And M.epsilon.R ^{N×T×C×1×1} ；

Final output Y: y=x+x+m.

Further, the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel, including:

for a given input image X ε R ^{N×T×C×H×W} The channel number is compressed in proportion r by using a 1X 1 two-dimensional convolution layer to obtainUsing a 1 x 1 two-dimensional convolution layer for F _r Decompressing;

modeling the motion characteristics to obtain F _m ＝K*F _r [:,t+1,:,:,:]-F _r [:,t,:,:,:]；

Wherein K is a 3 x 3 two-dimensional convolution layer,wherein F is _r [:,t+1,:,:,:]Representing the compressed characteristic diagram at time t+1, F _r [:,t,:,:,:]The characteristic diagram after compression at the time t is shown;

connecting motion characteristics with each other according to a time dimension, and filling 0 into a last element, wherein the steps are as follows:

F _m -[F _m (1),...,F _m (t-1),0]the method comprises the steps of carrying out a first treatment on the surface of the Wherein F is _m (t-1) is the t-1 th motion representation;

then to F _m Averaging is performed to obtain global spatial information of the input elements.

Further, the vehicle track recognition model includes a modified deep labv3+ network;

the improved deep bv3+ network takes deep bv3+ as a basic framework, replaces a backbone network Xaccept of the deep bv3+ with a lightweight network MobileNet v2, increases a channel attention mechanism module, and replaces an ASPP structure in the deep bv3+ network by using a Dense-ASPP;

the channel attention mechanism module is used for focusing attention among channels of the feature map.

Further, the method further comprises the following steps:

training the vehicle distance recognition model by using the front vehicle distance data set to obtain a trained vehicle distance recognition model; inputting the acquired front vehicle distance image into a trained vehicle distance recognition model, and outputting a front vehicle distance recognition result; judging whether the front vehicle distance recognition result is smaller than a distance threshold value, if so, reminding a driver to correct driving behaviors; if not, the processing is not performed.

Further, the headway recognition model includes a modified YOLOv5 network;

the improved YOLOv5 network uses YOLOv5 as a basic framework, uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, introduces an attention mechanism Coordinate Attention, and embeds position information into the channel.

Further, if the front vehicle is directly in front of the host vehicle, the distance d between the host vehicle and the front vehicle is determined according to the following formula:

h is the distance between the camera arranged on the vehicle and the front vehicle in the vertical direction; θ is the camera pitch angle; the intersection point of the lens optical axis of the camera and the image plane is O (x, y), the focal length is f, the imaging point of the center point of the bottom of the front vehicle at the image plane is D (u, v), and the bottom of the front vehicle isThe included angle between the straight line from the center point of the part to the camera and the optical axis of the lens is alpha;

if the front vehicle is in front of the side of the vehicle, determining a distance D between the vehicle and the front vehicle according to the following formula:

wherein, gamma is the yaw angle of the front vehicle.

The beneficial effects of the invention are as follows: the invention discloses a vehicle running risk early warning method for integrally sensing the states of a driver and a vehicle, which is based on the detection of distraction and fatigue driving of the driver and the detection of lane departure and distance between vehicles in front of the vehicle by vehicle-mounted video, and the detection of the driver and the detection of the lane departure and the distance between vehicles in front of the vehicle form an integrated detection, so that the driving state of the driver and the running track of the vehicle are synchronously sensed in real time, the running state of the vehicle is subjected to real-time risk assessment, the risk is early warned, and the driver is reminded to correct the driving behavior by voice, so that the driving behavior is recovered to the safe driving state as soon as possible, and the safe running level of the vehicle is improved.

Drawings

The invention is further described below with reference to the accompanying drawings and examples:

FIG. 1 is a schematic diagram of a vehicle driving risk early warning process according to the present invention;

FIG. 2 is a schematic diagram of video key frame extraction according to the present invention;

FIG. 3 is a flow chart of driver distraction and fatigue driving behavior identification in accordance with the present invention;

FIG. 4 is a schematic diagram of the SCM module architecture of ResNet-50 of the present invention;

FIG. 5 is a schematic diagram of the principle of operation of the spatio-temporal excitation submodule of the present invention;

FIG. 6 is a schematic diagram of the working principle of the channel excitation sub-module of the present invention;

FIG. 7 is a schematic diagram of the principle of operation of the motion-activated sub-module of the present invention;

FIG. 8 is a schematic diagram of a modified deep labv3+ network structure of the present invention;

FIG. 9 is a schematic diagram showing a lane departure frame image according to the present invention;

FIG. 10 is a schematic diagram of a vehicle distance measurement flow chart according to the present invention;

FIG. 11 is a schematic diagram of a modified YOLOv5 network architecture of the present invention;

FIG. 12 is a schematic diagram of pitch angle based distance measurement principles of the present invention;

fig. 13 is a schematic diagram of a principle of pitch angle and yaw angle based vehicle distance measurement according to the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, in which:

As shown in fig. 1, the present invention obtains a real-time driving behavior sequence of a driver and a real-time driving state sequence of a vehicle through a vehicle-mounted bidirectional camera, then inputs the real-time driving behavior sequence and the real-time driving state sequence of the vehicle into a trained driving behavior recognition model for recognition after data preprocessing operations such as clipping, scaling, etc. in order to obtain a video frame size meeting the input requirements of the model, and when the driving behavior recognition model recognizes fatigue or distraction driving behavior of the driver, the vehicle track recognition model recognizes lane departure of the vehicle or the vehicle distance recognition model recognizes approaching the front vehicle, the driver behavior and the vehicle track need to be comprehensively considered, and the driving safety risk is evaluated.

Further, the recognized driving behavior and the vehicle state can be continuously quantized, so that a more accurate and timely early warning effect is achieved. For example, the driver images of 2 continuous seconds are all identified to have distraction driving behaviors, or for example, the driver images of 1 continuous second are all identified to have distraction driving behaviors and the driving state image of the vehicle detects lane deviation, then the current vehicle is identified to have driving risks, at the moment, early warning processing is carried out, the driver is reminded to correct the driving behaviors by voice, the driver is quickly restored to a safe driving state, and the purpose of improving driving safety is achieved.

In the present embodiment, the driving behavior is a continuous motion. Compared with a method for identifying only through a single image, the method has the advantages that the video frame sequence is used as input, and the driving state of a driver is identified from the time, space and motion dimension characteristics of input data, so that better identification accuracy is obtained. Thus, the present invention performs driver fatigue and distraction behavior detection based on adaptive frame resolution.

Because of the large amount of redundancy from static scenes or very low frame quality (blurry, low light conditions, etc.), processing each frame in the video is often unnecessary and inefficient. Therefore, a frame skip mechanism is designed when the frame resolution is adaptively selected in a unified framework by using the strategy network, and frames are skipped (namely, the resolution is set to zero) when needed, so that the efficiency of motion recognition is further improved. Meanwhile, a two-dimensional Convolutional Neural Network (CNN) cannot obtain a long-term time relationship, but the problem of large calculated amount can be faced by adopting three-dimensional CNN processing. Therefore, the input video is processed by the strategy network, and a mixed attention mechanism module embedded into the two-dimensional CNN backbone network is adopted.

The driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module (STE), a channel excitation sub-module (CE) and a motion excitation sub-Module (ME); the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics; the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels; the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.

According to the invention, the driving state data set of normal driving, distraction driving and fatigue driving is constructed through the YAWDD and other public data sets, and the data sets are divided into a training set, a verification set and a test set according to the ratio of 6:2:2, so that the driving behavior recognition model is trained and verified, and the training and verification model is packaged and integrated into a system.

In this embodiment, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency. A series of resolutions are expressed in decreasing order as:wherein S is ₀ ＝(H ₀ ,W ₀ ) Representing the original (and highest) frame resolution, S _L-1 ＝(H _L-1 ,W _L-1) ) Is the lowest resolution. Will l ^th The frame at time t in the scale is denoted +.>Frame skip is "select resolution S ^∞ "a special case of the invention. Define the jump sequence (ascending) as +.>Ith (i) ^th The secondary operation indicates that the current frame and the following (F _i -1) frames. The choice of resolution and jump forms the motion space omega.

Policy networkIncludes a lightweight feature extractor phi (& theta) _φ ) And a Long Short Term Memory (LSTM) module.

At time step T < T ₀ Frame I _t Adjusted to the lowest resolutionAnd sends it to the feature extractor:

wherein T is ₀ For a set period of time; i _t A driver status image frame at the time t; f (f) _t Is a characteristic vector, θ _φ Representing the learnable parameters.

LSTM updates hidden state h using extracted features and previous state _t And outputs ot:

[h _t ,o _t ]＝LSTM(f _t ,h _t-1 ,o _t-1 ,θ _LSTM ) (2)

given a hidden state, the policy network estimates the policy distribution and pair actions

a _t E Ω= {0,1,..l+m-1 } is sampled by gummel Softmax operation:

a _t ～GUMBEL(h _t ,θ _G ) (3)

if a is _t < L, frame size is adjusted to spatial resolution 3 XH _at ×W _at And forwards it to the corresponding backbone networkTo obtain frame-level prediction:

wherein,is a frame for adjusting the size, +.>Is a predicted value. L is the resolution category number of the state image; h _at For action a _t The high of the image at time t; w (W) _at For action a _t The width of the image at time t.

Action a _t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the followingA frame; />Is a as _t And (5) video frames when the video frame is more than or equal to L.

Furthermore, to save computation, the lowest resolution policies and predictions can be generated with a shared policy network, i.e(phi' is a feature vector).

In this embodiment, in order to obtain a more accurate prediction result, an SCM module is added to the backbone network. The SCM module consists of three sub-modules, namely an immediate air excitation sub-module (STE), a channel excitation sub-module (CE) and a motion excitation sub-Module (ME).

Wherein, by exciting the spatiotemporal information with a 3D convolution, unlike a conventional three-dimensional convolution, the module averages all channels to obtain global spatiotemporal features, which can significantly reduce the computation of the three-dimensional convolution, the output of the STE containing global spatiotemporal information. The CE is configured to activate a channel correlation for time information, and the output includes a channel correlation based on a time perspective. ME shows the effectiveness of the inferred motion in the video, modeling the differences between adjacent frames at the feature level, and then combining with the modules described above for inferring the rich information maintained in the video.

Wherein all tensors outside the SCM action module we employ are 4D, i.e., (N (batch size) ×t (number of segments), C (channel), H (height), W (width)). We reshape the input 4D tensor into a 5D tensor (N, T, C, H, W) and then input it into the SCM module so as to be able to operate on a specific dimension inside the SCM module. The 5D output tensor is then reshaped into 4D before being fed to the next 2D convolution block. By doing so, the output of the SCM module may perceive information from a spatiotemporal perspective, channel correlation, and motion.

FIG. 4 shows a ResNet-50-SCM module architecture, with an SCM module inserted at the beginning of each residual block. ResNet-50 gives the size of the output profile for each layer (CLS represents the number of classes and T represents the number of segments). Firstly, the input video is divided into T fragments on average, and then the video processed by the strategy network is randomly sampled for one frame.

Wherein STE effectively simulates spatiotemporal information using three-dimensional convolution. In this stage STE generates a space-time mask M εR ^{N×T×1×H×W} For inputting X E R of all channels ^{N×T×C×H×W} Multiplying element by element.

As shown in FIG. 5, a given image input X εR ^{N×T×C×H×W} The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.R relative to the channel axis ^{N×T×1×H×W} . F is then remolded to F ^* ∈R ^{N×T×1×H×W} And fed to a three-dimensional convolution layer K having a core size of 3 x 3. The formula is:

finally, willRemodelling to F _o ∈R ^{N×T×1×H×W} And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R ^N ^×T×1×H×W . It can be expressed as:

M＝δ(F _o ) (6)

the final output is:

Y＝X+X⊙M (7)

wherein, as indicated by the space-time mask M, multiplied element by all channel inputs X.

T represents the number of segments of the video corresponding to the image that are divided; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.

The design of the CE is similar to the STE block, as shown in fig. 6.

Given an input X ε R ^{N×T×C×H×W} Global spatial information of an input element is first obtained by averaging the inputs, which can be expressed as:

wherein F is E R ^{N×T×C×1×1} . Compressing the number of channels of F in proportion to r (r-channel compression ratio) can be explained as:

F _r ＝K ₁ *F (9)

wherein K is ₁ Is a 1 x 1 two-dimensional convolution layer,

then remodel F _r To the point ofUntil temporal reasoning can be enabled. One-dimensional convolution layer K ₂ Kernel size 3 for handling +.>As:

wherein the method comprises the steps ofThen will->Reshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K ₃ Decompresses it and feeds it to Sigmoid activation. This is the last two steps of obtaining the channel mask M, which can be formulated separately:

F _o ＝K ₃ *F _temp (11)

M＝δ(F _o ) (12)

wherein F is _o ∈R ^{N×T×C×1×1} And M.epsilon.R ^{N×T×C×1×1} . Finally, the output of the CE is formulated as the same as equation (7) using the newly generated mask.

ME is used in parallel with the two modules STE, CE mentioned above, as shown in fig. 7, the motion information is modeled by adjacent frames.

Using the same compression and decompression strategy as the CE submodule, two 1 x 1 two-dimensional convolution layers are used, respectively, with reference to equation (9) and equation (11). Given the characteristics after the compression operation has been processedModeling the motion features according to the proposed similar operations can be expressed as:

F _m ＝K*F _r [:,t+1,:,:,:]-F _r [:,t,:,:,:] (13)

where K is a 3 x 3 two-dimensional convolutional layer,F _r [:,t+1,:,:,:]representing the compressed characteristic diagram at time t+1, F _r [:,t,:,:,:]A feature map showing time t;

the motion features are connected to each other according to the time dimension, and 0 is filled into the last element,then is F _m -[F _m (1),...,F _m (t-1),0],Wherein F is _m (t-1) is the t-1 th motion representation.

Then processed by the same spatial averaging combination as in equation (8), i.e. for F _m Averaging is performed to obtain global spatial information of the input elements.

In this embodiment, the vehicle track recognition model includes an improved deep labv3+ network;

Firstly, video acquired by a vehicle recorder is subjected to framing and screenshot, and the intercepted pictures are combined with a public data set Tusimple, CUlane to form a lane offset data set for lane line detection. Secondly, preprocessing the data sets, enhancing illuminance and the like of darker pictures, and manually marking to inform deep features of the model lane lines. And finally, inputting the marked data set into a vehicle track recognition model to start training, observing the training degree of the model through a loss function, and if the training effect is poor, reversely transmitting the obtained parameters to a network to continue training until an ideal model for lane line detection is trained.

The invention establishes a lightweight network model with more effective extracted features by taking deep labv3+ as a basic framework, aims at the problem of incomplete extraction of the original network features, and increases the condition of perfecting the intensive structure. The invention also adds a attention mechanism to improve the attention of important features, and is beneficial to the training and real-time detection precision of the model.

As shown in fig. 8, the main network Xception of deep bv3+ is replaced by the lightweight network MobileNetv2, so that the method has small parameter quantity and high speed, and is very suitable for being applied to a scene of real-time detection. In order to improve the performance of the improved model, the invention adds an SE (space-and-specification) module (channel attention mechanism module). The SE module mainly focuses attention among channels of the feature map and automatically learns the importance of different channels.

To solve the problem that the expansion rate is increased to cause the acquisition of local features to be more difficult, dense spatial pyramid pooling (Dense Atrous Spatial Pyramid Pooling, dense-ASPP) is introduced, and the ASPP structure of the original network is replaced by Dense-ASPP.

Based on the above lane line detection, as shown in fig. 9, it is assumed that the pixel coordinate in the image where the front body of the automobile is located is (180,0), D _L 、D _R Representing the distance of the center of the vehicle from the left and right lane lines, (L) _x ,L _y )、(R _x ,R _y ) Representing pixel coordinates of left and right lane lines in the image. Distance D from the center of the vehicle to the left lane line _L ＝180-L _x Similar to the theory D _R ＝R _x -180。

Normally acquiring 20 frames of images takes 0.5 seconds, at which time it may pass D _L 、D _R And continuously changing in 20 frames to judge whether the vehicle is deviated or not. At this time, an array L exists to store 20D _L The array R can store 20D _R Is a variable value of (a).

Further, the lateral velocity V at this time _L 、V _R See formulas (14), (15):

in this embodiment, the method further includes:

According to the invention, monocular vision is selected for vehicle distance recognition analysis, and distance measurement is completed mainly according to a monocular vision distance measurement model and a detected target frame. The monocular vision ranging can be completed by only one camera and performing some coordinate system conversion work, and has the advantages of small calculated amount, low consumption and the like.

The invention takes the automobile data recorder on the driving automobile as main equipment for collecting data, and the driving automobile collects the front automobile distance data set in real time in various scenes such as a busy road section, a high-speed road section, a highway and the like in the urban area.

As shown in fig. 11, the vehicle distance recognition model uses YOLOv5 as a basic framework, makes corresponding modification to a backbone network, and merges attention mechanisms to meet the requirements of light weight, high precision, multi-scale vehicle detection, vehicle detection under complex environmental conditions and real-time performance.

In the backbone network of YOLOv5, the CSP mechanism uses more convolution (Conv) operations, the calculated amount of convolution increases along with the increase of the layer number, and in order to solve the problem, the invention uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, thereby improving the speed of a model and reducing the number of related parameters under the condition of not affecting the precision. Attention mechanism Coordinate Attention (CA) introduced to inlay location information into a channel, solving the disadvantage of some attention-focused mechanisms ignoring location information.

After detecting the vehicle in front of the host vehicle, the pixel value of the corresponding target is extracted, and ranging modeling is performed based on the camera pitch angle and the front vehicle yaw angle, as shown in fig. 12.

In the figure, an imaginary straight line below the upper left corner is a lens optical axis, an intersection point of the imaginary straight line and an image plane is O (x, y) (an image plane coordinate system), a focal length is f, the imaginary straight line above the upper left corner is a detected straight line distance from a vehicle bottom center point to a camera, an imaging point of the imaginary straight line on the image plane is D (u, v) (a pixel coordinate system), and an included angle between the imaginary straight line and the optical axis is alpha. Wherein, the camera sets up in the host computer, then the camera is to the horizontal distance of vehicle, namely the distance between host computer and the preceding car is:

wherein,h is the distance between the camera and the front vehicle in the vertical direction, and θ is the camera pitch angle (the angle between the optical axis of the camera lens and the horizontal plane).

In an actual scenario, the front vehicle is not running right in front of the vehicle, and may be on the left and right sides, and the front vehicle is in front of the side of the vehicle, as shown in fig. 13.

Beta is the horizontal included angle of the external participation optical axis of the camera, and gamma is the yaw angle of the front vehicle. B' (B) _x ,B _y ) The position of the falling point B of the yaw track at the center of the bottom of the front vehicle in a pixel coordinate system is represented by O' (u, v) as a pixel center point. The range based on the camera pitch angle and the yaw angle of the front-side vehicle is as follows:

in the formula (17), D is the distance between the vehicle and the front vehicle, and θ and γ respectively represent the pitch angle of the camera and the yaw angle of the front vehicle.

Wherein, because pitch angle and yaw angle are difficult to obtain, a method for obtaining the transformation angle in real time can be designed: the lane lines basically show a parallel state, and if the lane lines are shot, the two lane lines finally meet at a point, and the point is called a vanishing point. The vanishing points are used to calculate the yaw and pitch angles of the camera.

Firstly, a Gabor filter is adopted to calculate the texture direction of each pixel point in a shot image, then, whether voting is needed or not is determined according to the confidence coefficient of the pixel points, and finally, a quick local voting method is used for determining the position of the vanishing point.

Wherein the functional expression for the Gabor filter is as follows:

wherein, omega,representing the dimensions and directions->

x, y represent pixel coordinate positions.

Since Gabor filtering obtains 36 directional textures of each pixel point, but it cannot be guaranteed that the textures in each direction are all required, at this time, a confidence level needs to be introduced, and when the parameter is greater than a set threshold value, the textures in the direction can be considered as voting points. The confidence of a pixel value d (x, y) at a point in an image can be defined by:

wherein r is _i (d) Representing the response value, r, of the ith direction at the pixel point ₅ (d) To r ₁₅ (d) Is that local maximum response values occur between them and the remaining parameters rarely occur. The threshold is set as follows:

t＝0.4(maxC(d)-minC(d)) (20)

when the confidence is greater than t, the pixel point is the voting point. And selecting a final road vanishing point according to the voting points and the confidence.

The voting algorithm principle is that candidate vanishing points H are selected, H is used as a circle center, the radius is one third of the image size, and half of the circle is used as a candidate area. The specific voting formula is as follows:

in the formula, alpha is an included angle between the texture direction of a certain voting point P and a line segment PH, d (P, H) represents the distance from the point P to the circle center, each pixel point in the image is used as a candidate point, and finally, the point with high voting score is obtained as a vanishing point.

After the vanishing point is obtained, the pitch angle and the yaw angle are respectively as follows:

wherein R is a camera image rotation matrix; r is R _xz A matrix representing rotation about the x-axis and the z-axis; r is R _yz Representing a matrix rotated about the y-axis and the z-axis.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. A vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state is characterized in that: comprising the following steps:

2. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;

3. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the strategy network adaptively selects different frame scales to realize driving behavior recognition efficiency, and comprises the following steps:

action a _t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the followingA frame; wherein (1)>Is a as _t And (5) video frames when the video frame is more than or equal to L.

4. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics and specifically comprises the following steps:

for a given input image X ε R ^{N×T×C×H×W} The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.R relative to the channel axis ^{N×T×1×H×W} The method comprises the steps of carrying out a first treatment on the surface of the F is then remolded to F ^* ∈R ^{N×T×1×H×W} And fed to a three-dimensional convolutional layer K of core size 3 x 3, obtainingFinally, will->Remodelling to F _o ∈R ^{N×T×1×H×W} And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R ^{N×T×1×H×W} And finally outputting Y: y=x+x+m;

5. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the channel excitation submodule adaptively calibrates characteristic response of the channels based on interdependence among the channels, and specifically comprises the following steps:

Will beReshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K ₃ Decompressing it to obtain F _o ＝K ₃ *F _temp Feeding to Sigmoid activation to obtain a channel mask M; wherein F is _o ∈R ^{N×T×C×1×1} And M.epsilon.R ^N ^×T×C×1×1 ；

Final output Y: y=x+x+m.

6. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel, comprising:

7. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the vehicle track recognition model comprises a modified deep labv3+ network;

8. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: further comprises:

9. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: the vehicle distance identification model comprises a modified YOLOv5 network;

10. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: if the front vehicle is right in front of the vehicle, determining a distance d between the vehicle and the front vehicle according to the following formula:

h is the distance between the camera arranged on the vehicle and the front vehicle in the vertical direction; θ is the camera pitch angle; the intersection point of the lens optical axis of the camera and the image plane is O (x, y), the focal length is f, the imaging point of the center point of the bottom of the front vehicle at the image plane is D (u, v), and the included angle between the straight line from the center point of the bottom of the front vehicle to the camera and the lens optical axis is alpha;

wherein, gamma is the yaw angle of the front vehicle.