CN117292346A - Vehicle running risk early warning method for driver and vehicle state integrated sensing - Google Patents

Vehicle running risk early warning method for driver and vehicle state integrated sensing Download PDF

Info

Publication number
CN117292346A
CN117292346A CN202311284729.1A CN202311284729A CN117292346A CN 117292346 A CN117292346 A CN 117292346A CN 202311284729 A CN202311284729 A CN 202311284729A CN 117292346 A CN117292346 A CN 117292346A
Authority
CN
China
Prior art keywords
vehicle
driver
channel
time
early warning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311284729.1A
Other languages
Chinese (zh)
Inventor
俞山川
骆中斌
宋浪
李刚
王少飞
谢耀华
彭亚雪
周欣
陈晨
周盼
陈奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Chongqing Communications Research and Design Institute Co Ltd
Original Assignee
China Merchants Chongqing Communications Research and Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Chongqing Communications Research and Design Institute Co Ltd filed Critical China Merchants Chongqing Communications Research and Design Institute Co Ltd
Priority to CN202311284729.1A priority Critical patent/CN117292346A/en
Publication of CN117292346A publication Critical patent/CN117292346A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a vehicle running risk early warning method facing to integrated perception of a driver and a vehicle state, which comprises the following steps: training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model; inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver; inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result; judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed. The invention can synchronously sense the driving state of the driver and the driving track of the vehicle in real time, and enhances the reliability of driving early warning.

Description

Vehicle running risk early warning method for driver and vehicle state integrated sensing
Technical Field
The invention relates to the field of vehicle driving early warning, in particular to a vehicle driving risk early warning method for integrally sensing the states of a driver and a vehicle.
Background
In recent years, collision accidents are still the main form of road transportation safety accidents, the number of the accidents and the number of dead people respectively account for 70.5% and 68.4% of the total number of the accidents, and the prevention and control of the collision accidents of the exposed road transportation vehicles are insufficient. Before a collision accident of a vehicle, a driver is tired, distracted and the like, and the phenomena of lane departure, too close distance and the like of the vehicle account for 40-50% of the accidents. The early warning of the running risk of the vehicle has important significance in the aspects of improving road safety and traffic management.
The intelligent assistance is more and more widely applied to the vehicle running early warning field, but at present, the vehicle running risk early warning technology has the situation of false alarm or missing alarm potential danger, in addition, the existing vehicle running risk early warning technology also has the problem that complex traffic situation and risk factors cannot be effectively identified, so that the reliability of the risk early warning is greatly reduced, and therefore, a vehicle running risk early warning method facing to the integral perception of the states of a driver and a vehicle is needed, and the problems can be solved.
Disclosure of Invention
Therefore, the invention aims to overcome the defects in the prior art, and provides the vehicle running risk early warning method for integrally sensing the driver and the vehicle state, which can synchronously sense the driving state of the driver and the running track of the vehicle in real time, enhance the reliability of running early warning and improve the safety running level of the vehicle.
The invention relates to a vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state, which comprises the following steps:
creating a driving state data set of a driver; training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model;
creating a lane departure data set; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model;
inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver;
inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result;
judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed.
Further, the driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;
the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics;
the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels;
the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.
Further, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency, including:
at time step T < T 0 Frame I t Adjust to the lowest resolution and send it to the feature extractor; wherein T is 0 For a set period of time; i t A driver status image frame at the time t;
the long-period and short-period memory module updates and outputs the hidden state by using the extracted characteristics and the previous state;
given a hidden state, the policy distribution is estimated for action a at time t t Sampling and performing Gumbel Softmax operation;
action a t < L, frame size is adjusted to spatial resolution 3 XH at ×W at And forwards it to the corresponding backbone network to obtain frame-level predictions; wherein L is the resolution category number of the state image; h at For action a t The high of the image at time t; w (W) at For action a t The width of the image at time t;
action a t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the following F at-L-1 A frame; wherein F is at-L-1 Is a as t And (5) video frames when the video frame is more than or equal to L.
Further, the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics, and specifically comprises:
for a given input image X ε R N×T×C×H×W The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.with respect to the channel axisR N×T×1×H×W The method comprises the steps of carrying out a first treatment on the surface of the F is then remolded to F * ∈R N×T×1×H×W And fed to a three-dimensional convolutional layer K of core size 3 x 3, obtainingFinally, will->Remodelling to F o ∈R N×T×1×H×W And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R N×T×1×H×W And finally outputting Y: y=x+x+m;
wherein, as follows, the space-time mask M is multiplied by all channel inputs X element by element, and T is the number of divided segments of the video corresponding to the image; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.
Further, the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels, and specifically includes:
for a given input image X ε R N×T×C×H×W First global spatial information F epsilon R of an input element is obtained by averaging the inputs N×T×C×1×1 The method comprises the steps of carrying out a first treatment on the surface of the Compressing the channel number of F in proportion r to obtain F r =K 1 * F, performing the process; wherein K is 1 Is a 1 x 1 two-dimensional convolution layer,
then remodel F r To the point ofUntil time reasoning can be enabled, one-dimensional convolution layer K 2 Kernel size 3 for handling +.>Obtain->Wherein (1)>
Will beReshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K 3 Decompressing it to obtain F o =K 3 *F temp Feeding to Sigmoid activation to obtain a channel mask M; wherein F is o ∈R N×T×C×1×1 And M.epsilon.R N×T×C×1×1
Final output Y: y=x+x+m.
Further, the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel, including:
for a given input image X ε R N×T×C×H×W The channel number is compressed in proportion r by using a 1X 1 two-dimensional convolution layer to obtainUsing a 1 x 1 two-dimensional convolution layer for F r Decompressing;
modeling the motion characteristics to obtain F m =K*F r [:,t+1,:,:,:]-F r [:,t,:,:,:];
Wherein K is a 3 x 3 two-dimensional convolution layer,wherein F is r [:,t+1,:,:,:]Representing the compressed characteristic diagram at time t+1, F r [:,t,:,:,:]The characteristic diagram after compression at the time t is shown;
connecting motion characteristics with each other according to a time dimension, and filling 0 into a last element, wherein the steps are as follows:
F m -[F m (1),...,F m (t-1),0]the method comprises the steps of carrying out a first treatment on the surface of the Wherein F is m (t-1) is the t-1 th motion representation;
then to F m Averaging is performed to obtain global spatial information of the input elements.
Further, the vehicle track recognition model includes a modified deep labv3+ network;
the improved deep bv3+ network takes deep bv3+ as a basic framework, replaces a backbone network Xaccept of the deep bv3+ with a lightweight network MobileNet v2, increases a channel attention mechanism module, and replaces an ASPP structure in the deep bv3+ network by using a Dense-ASPP;
the channel attention mechanism module is used for focusing attention among channels of the feature map.
Further, the method further comprises the following steps:
training the vehicle distance recognition model by using the front vehicle distance data set to obtain a trained vehicle distance recognition model; inputting the acquired front vehicle distance image into a trained vehicle distance recognition model, and outputting a front vehicle distance recognition result; judging whether the front vehicle distance recognition result is smaller than a distance threshold value, if so, reminding a driver to correct driving behaviors; if not, the processing is not performed.
Further, the headway recognition model includes a modified YOLOv5 network;
the improved YOLOv5 network uses YOLOv5 as a basic framework, uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, introduces an attention mechanism Coordinate Attention, and embeds position information into the channel.
Further, if the front vehicle is directly in front of the host vehicle, the distance d between the host vehicle and the front vehicle is determined according to the following formula:
h is the distance between the camera arranged on the vehicle and the front vehicle in the vertical direction; θ is the camera pitch angle; the intersection point of the lens optical axis of the camera and the image plane is O (x, y), the focal length is f, the imaging point of the center point of the bottom of the front vehicle at the image plane is D (u, v), and the bottom of the front vehicle isThe included angle between the straight line from the center point of the part to the camera and the optical axis of the lens is alpha;
if the front vehicle is in front of the side of the vehicle, determining a distance D between the vehicle and the front vehicle according to the following formula:
wherein, gamma is the yaw angle of the front vehicle.
The beneficial effects of the invention are as follows: the invention discloses a vehicle running risk early warning method for integrally sensing the states of a driver and a vehicle, which is based on the detection of distraction and fatigue driving of the driver and the detection of lane departure and distance between vehicles in front of the vehicle by vehicle-mounted video, and the detection of the driver and the detection of the lane departure and the distance between vehicles in front of the vehicle form an integrated detection, so that the driving state of the driver and the running track of the vehicle are synchronously sensed in real time, the running state of the vehicle is subjected to real-time risk assessment, the risk is early warned, and the driver is reminded to correct the driving behavior by voice, so that the driving behavior is recovered to the safe driving state as soon as possible, and the safe running level of the vehicle is improved.
Drawings
The invention is further described below with reference to the accompanying drawings and examples:
FIG. 1 is a schematic diagram of a vehicle driving risk early warning process according to the present invention;
FIG. 2 is a schematic diagram of video key frame extraction according to the present invention;
FIG. 3 is a flow chart of driver distraction and fatigue driving behavior identification in accordance with the present invention;
FIG. 4 is a schematic diagram of the SCM module architecture of ResNet-50 of the present invention;
FIG. 5 is a schematic diagram of the principle of operation of the spatio-temporal excitation submodule of the present invention;
FIG. 6 is a schematic diagram of the working principle of the channel excitation sub-module of the present invention;
FIG. 7 is a schematic diagram of the principle of operation of the motion-activated sub-module of the present invention;
FIG. 8 is a schematic diagram of a modified deep labv3+ network structure of the present invention;
FIG. 9 is a schematic diagram showing a lane departure frame image according to the present invention;
FIG. 10 is a schematic diagram of a vehicle distance measurement flow chart according to the present invention;
FIG. 11 is a schematic diagram of a modified YOLOv5 network architecture of the present invention;
FIG. 12 is a schematic diagram of pitch angle based distance measurement principles of the present invention;
fig. 13 is a schematic diagram of a principle of pitch angle and yaw angle based vehicle distance measurement according to the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, in which:
the invention relates to a vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state, which comprises the following steps:
creating a driving state data set of a driver; training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model;
creating a lane departure data set; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model;
inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver;
inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result;
judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed.
As shown in fig. 1, the present invention obtains a real-time driving behavior sequence of a driver and a real-time driving state sequence of a vehicle through a vehicle-mounted bidirectional camera, then inputs the real-time driving behavior sequence and the real-time driving state sequence of the vehicle into a trained driving behavior recognition model for recognition after data preprocessing operations such as clipping, scaling, etc. in order to obtain a video frame size meeting the input requirements of the model, and when the driving behavior recognition model recognizes fatigue or distraction driving behavior of the driver, the vehicle track recognition model recognizes lane departure of the vehicle or the vehicle distance recognition model recognizes approaching the front vehicle, the driver behavior and the vehicle track need to be comprehensively considered, and the driving safety risk is evaluated.
Further, the recognized driving behavior and the vehicle state can be continuously quantized, so that a more accurate and timely early warning effect is achieved. For example, the driver images of 2 continuous seconds are all identified to have distraction driving behaviors, or for example, the driver images of 1 continuous second are all identified to have distraction driving behaviors and the driving state image of the vehicle detects lane deviation, then the current vehicle is identified to have driving risks, at the moment, early warning processing is carried out, the driver is reminded to correct the driving behaviors by voice, the driver is quickly restored to a safe driving state, and the purpose of improving driving safety is achieved.
In the present embodiment, the driving behavior is a continuous motion. Compared with a method for identifying only through a single image, the method has the advantages that the video frame sequence is used as input, and the driving state of a driver is identified from the time, space and motion dimension characteristics of input data, so that better identification accuracy is obtained. Thus, the present invention performs driver fatigue and distraction behavior detection based on adaptive frame resolution.
Because of the large amount of redundancy from static scenes or very low frame quality (blurry, low light conditions, etc.), processing each frame in the video is often unnecessary and inefficient. Therefore, a frame skip mechanism is designed when the frame resolution is adaptively selected in a unified framework by using the strategy network, and frames are skipped (namely, the resolution is set to zero) when needed, so that the efficiency of motion recognition is further improved. Meanwhile, a two-dimensional Convolutional Neural Network (CNN) cannot obtain a long-term time relationship, but the problem of large calculated amount can be faced by adopting three-dimensional CNN processing. Therefore, the input video is processed by the strategy network, and a mixed attention mechanism module embedded into the two-dimensional CNN backbone network is adopted.
The driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module (STE), a channel excitation sub-module (CE) and a motion excitation sub-Module (ME); the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics; the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels; the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.
According to the invention, the driving state data set of normal driving, distraction driving and fatigue driving is constructed through the YAWDD and other public data sets, and the data sets are divided into a training set, a verification set and a test set according to the ratio of 6:2:2, so that the driving behavior recognition model is trained and verified, and the training and verification model is packaged and integrated into a system.
In this embodiment, the policy network adaptively selects different frame scales to achieve driving behavior recognition efficiency. A series of resolutions are expressed in decreasing order as:wherein S is 0 =(H 0 ,W 0 ) Representing the original (and highest) frame resolution, S L-1 =(H L-1 ,W L-1) ) Is the lowest resolution. Will l th The frame at time t in the scale is denoted +.>Frame skip is "select resolution S "a special case of the invention. Define the jump sequence (ascending) as +.>Ith (i) th The secondary operation indicates that the current frame and the following (F i -1) frames. The choice of resolution and jump forms the motion space omega.
Policy networkIncludes a lightweight feature extractor phi (& theta) φ ) And a Long Short Term Memory (LSTM) module.
At time step T < T 0 Frame I t Adjusted to the lowest resolutionAnd sends it to the feature extractor:
wherein T is 0 For a set period of time; i t A driver status image frame at the time t; f (f) t Is a characteristic vector, θ φ Representing the learnable parameters.
LSTM updates hidden state h using extracted features and previous state t And outputs ot:
[h t ,o t ]=LSTM(f t ,h t-1 ,o t-1LSTM ) (2)
given a hidden state, the policy network estimates the policy distribution and pair actions
a t E Ω= {0,1,..l+m-1 } is sampled by gummel Softmax operation:
a t ~GUMBEL(h tG ) (3)
if a is t < L, frame size is adjusted to spatial resolution 3 XH at ×W at And forwards it to the corresponding backbone networkTo obtain frame-level prediction:
wherein,is a frame for adjusting the size, +.>Is a predicted value. L is the resolution category number of the state image; h at For action a t The high of the image at time t; w (W) at For action a t The width of the image at time t.
Action a t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the followingA frame; />Is a as t And (5) video frames when the video frame is more than or equal to L.
Furthermore, to save computation, the lowest resolution policies and predictions can be generated with a shared policy network, i.e(phi' is a feature vector).
In this embodiment, in order to obtain a more accurate prediction result, an SCM module is added to the backbone network. The SCM module consists of three sub-modules, namely an immediate air excitation sub-module (STE), a channel excitation sub-module (CE) and a motion excitation sub-Module (ME).
Wherein, by exciting the spatiotemporal information with a 3D convolution, unlike a conventional three-dimensional convolution, the module averages all channels to obtain global spatiotemporal features, which can significantly reduce the computation of the three-dimensional convolution, the output of the STE containing global spatiotemporal information. The CE is configured to activate a channel correlation for time information, and the output includes a channel correlation based on a time perspective. ME shows the effectiveness of the inferred motion in the video, modeling the differences between adjacent frames at the feature level, and then combining with the modules described above for inferring the rich information maintained in the video.
Wherein all tensors outside the SCM action module we employ are 4D, i.e., (N (batch size) ×t (number of segments), C (channel), H (height), W (width)). We reshape the input 4D tensor into a 5D tensor (N, T, C, H, W) and then input it into the SCM module so as to be able to operate on a specific dimension inside the SCM module. The 5D output tensor is then reshaped into 4D before being fed to the next 2D convolution block. By doing so, the output of the SCM module may perceive information from a spatiotemporal perspective, channel correlation, and motion.
FIG. 4 shows a ResNet-50-SCM module architecture, with an SCM module inserted at the beginning of each residual block. ResNet-50 gives the size of the output profile for each layer (CLS represents the number of classes and T represents the number of segments). Firstly, the input video is divided into T fragments on average, and then the video processed by the strategy network is randomly sampled for one frame.
Wherein STE effectively simulates spatiotemporal information using three-dimensional convolution. In this stage STE generates a space-time mask M εR N×T×1×H×W For inputting X E R of all channels N×T×C×H×W Multiplying element by element.
As shown in FIG. 5, a given image input X εR N×T×C×H×W The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.R relative to the channel axis N×T×1×H×W . F is then remolded to F * ∈R N×T×1×H×W And fed to a three-dimensional convolution layer K having a core size of 3 x 3. The formula is:
finally, willRemodelling to F o ∈R N×T×1×H×W And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R N ×T×1×H×W . It can be expressed as:
M=δ(F o ) (6)
the final output is:
Y=X+X⊙M (7)
wherein, as indicated by the space-time mask M, multiplied element by all channel inputs X.
T represents the number of segments of the video corresponding to the image that are divided; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.
The design of the CE is similar to the STE block, as shown in fig. 6.
Given an input X ε R N×T×C×H×W Global spatial information of an input element is first obtained by averaging the inputs, which can be expressed as:
wherein F is E R N×T×C×1×1 . Compressing the number of channels of F in proportion to r (r-channel compression ratio) can be explained as:
F r =K 1 *F (9)
wherein K is 1 Is a 1 x 1 two-dimensional convolution layer,
then remodel F r To the point ofUntil temporal reasoning can be enabled. One-dimensional convolution layer K 2 Kernel size 3 for handling +.>As:
wherein the method comprises the steps ofThen will->Reshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K 3 Decompresses it and feeds it to Sigmoid activation. This is the last two steps of obtaining the channel mask M, which can be formulated separately:
F o =K 3 *F temp (11)
M=δ(F o ) (12)
wherein F is o ∈R N×T×C×1×1 And M.epsilon.R N×T×C×1×1 . Finally, the output of the CE is formulated as the same as equation (7) using the newly generated mask.
ME is used in parallel with the two modules STE, CE mentioned above, as shown in fig. 7, the motion information is modeled by adjacent frames.
Using the same compression and decompression strategy as the CE submodule, two 1 x 1 two-dimensional convolution layers are used, respectively, with reference to equation (9) and equation (11). Given the characteristics after the compression operation has been processedModeling the motion features according to the proposed similar operations can be expressed as:
F m =K*F r [:,t+1,:,:,:]-F r [:,t,:,:,:] (13)
where K is a 3 x 3 two-dimensional convolutional layer,F r [:,t+1,:,:,:]representing the compressed characteristic diagram at time t+1, F r [:,t,:,:,:]A feature map showing time t;
the motion features are connected to each other according to the time dimension, and 0 is filled into the last element,then is F m -[F m (1),...,F m (t-1),0],Wherein F is m (t-1) is the t-1 th motion representation.
Then processed by the same spatial averaging combination as in equation (8), i.e. for F m Averaging is performed to obtain global spatial information of the input elements.
In this embodiment, the vehicle track recognition model includes an improved deep labv3+ network;
the improved deep bv3+ network takes deep bv3+ as a basic framework, replaces a backbone network Xaccept of the deep bv3+ with a lightweight network MobileNet v2, increases a channel attention mechanism module, and replaces an ASPP structure in the deep bv3+ network by using a Dense-ASPP;
the channel attention mechanism module is used for focusing attention among channels of the feature map.
Firstly, video acquired by a vehicle recorder is subjected to framing and screenshot, and the intercepted pictures are combined with a public data set Tusimple, CUlane to form a lane offset data set for lane line detection. Secondly, preprocessing the data sets, enhancing illuminance and the like of darker pictures, and manually marking to inform deep features of the model lane lines. And finally, inputting the marked data set into a vehicle track recognition model to start training, observing the training degree of the model through a loss function, and if the training effect is poor, reversely transmitting the obtained parameters to a network to continue training until an ideal model for lane line detection is trained.
The invention establishes a lightweight network model with more effective extracted features by taking deep labv3+ as a basic framework, aims at the problem of incomplete extraction of the original network features, and increases the condition of perfecting the intensive structure. The invention also adds a attention mechanism to improve the attention of important features, and is beneficial to the training and real-time detection precision of the model.
As shown in fig. 8, the main network Xception of deep bv3+ is replaced by the lightweight network MobileNetv2, so that the method has small parameter quantity and high speed, and is very suitable for being applied to a scene of real-time detection. In order to improve the performance of the improved model, the invention adds an SE (space-and-specification) module (channel attention mechanism module). The SE module mainly focuses attention among channels of the feature map and automatically learns the importance of different channels.
To solve the problem that the expansion rate is increased to cause the acquisition of local features to be more difficult, dense spatial pyramid pooling (Dense Atrous Spatial Pyramid Pooling, dense-ASPP) is introduced, and the ASPP structure of the original network is replaced by Dense-ASPP.
Based on the above lane line detection, as shown in fig. 9, it is assumed that the pixel coordinate in the image where the front body of the automobile is located is (180,0), D L 、D R Representing the distance of the center of the vehicle from the left and right lane lines, (L) x ,L y )、(R x ,R y ) Representing pixel coordinates of left and right lane lines in the image. Distance D from the center of the vehicle to the left lane line L =180-L x Similar to the theory D R =R x -180。
Normally acquiring 20 frames of images takes 0.5 seconds, at which time it may pass D L 、D R And continuously changing in 20 frames to judge whether the vehicle is deviated or not. At this time, an array L exists to store 20D L The array R can store 20D R Is a variable value of (a).
Further, the lateral velocity V at this time L 、V R See formulas (14), (15):
in this embodiment, the method further includes:
training the vehicle distance recognition model by using the front vehicle distance data set to obtain a trained vehicle distance recognition model; inputting the acquired front vehicle distance image into a trained vehicle distance recognition model, and outputting a front vehicle distance recognition result; judging whether the front vehicle distance recognition result is smaller than a distance threshold value, if so, reminding a driver to correct driving behaviors; if not, the processing is not performed.
According to the invention, monocular vision is selected for vehicle distance recognition analysis, and distance measurement is completed mainly according to a monocular vision distance measurement model and a detected target frame. The monocular vision ranging can be completed by only one camera and performing some coordinate system conversion work, and has the advantages of small calculated amount, low consumption and the like.
The invention takes the automobile data recorder on the driving automobile as main equipment for collecting data, and the driving automobile collects the front automobile distance data set in real time in various scenes such as a busy road section, a high-speed road section, a highway and the like in the urban area.
As shown in fig. 11, the vehicle distance recognition model uses YOLOv5 as a basic framework, makes corresponding modification to a backbone network, and merges attention mechanisms to meet the requirements of light weight, high precision, multi-scale vehicle detection, vehicle detection under complex environmental conditions and real-time performance.
In the backbone network of YOLOv5, the CSP mechanism uses more convolution (Conv) operations, the calculated amount of convolution increases along with the increase of the layer number, and in order to solve the problem, the invention uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, thereby improving the speed of a model and reducing the number of related parameters under the condition of not affecting the precision. Attention mechanism Coordinate Attention (CA) introduced to inlay location information into a channel, solving the disadvantage of some attention-focused mechanisms ignoring location information.
After detecting the vehicle in front of the host vehicle, the pixel value of the corresponding target is extracted, and ranging modeling is performed based on the camera pitch angle and the front vehicle yaw angle, as shown in fig. 12.
In the figure, an imaginary straight line below the upper left corner is a lens optical axis, an intersection point of the imaginary straight line and an image plane is O (x, y) (an image plane coordinate system), a focal length is f, the imaginary straight line above the upper left corner is a detected straight line distance from a vehicle bottom center point to a camera, an imaging point of the imaginary straight line on the image plane is D (u, v) (a pixel coordinate system), and an included angle between the imaginary straight line and the optical axis is alpha. Wherein, the camera sets up in the host computer, then the camera is to the horizontal distance of vehicle, namely the distance between host computer and the preceding car is:
wherein,h is the distance between the camera and the front vehicle in the vertical direction, and θ is the camera pitch angle (the angle between the optical axis of the camera lens and the horizontal plane).
In an actual scenario, the front vehicle is not running right in front of the vehicle, and may be on the left and right sides, and the front vehicle is in front of the side of the vehicle, as shown in fig. 13.
Beta is the horizontal included angle of the external participation optical axis of the camera, and gamma is the yaw angle of the front vehicle. B' (B) x ,B y ) The position of the falling point B of the yaw track at the center of the bottom of the front vehicle in a pixel coordinate system is represented by O' (u, v) as a pixel center point. The range based on the camera pitch angle and the yaw angle of the front-side vehicle is as follows:
in the formula (17), D is the distance between the vehicle and the front vehicle, and θ and γ respectively represent the pitch angle of the camera and the yaw angle of the front vehicle.
Wherein, because pitch angle and yaw angle are difficult to obtain, a method for obtaining the transformation angle in real time can be designed: the lane lines basically show a parallel state, and if the lane lines are shot, the two lane lines finally meet at a point, and the point is called a vanishing point. The vanishing points are used to calculate the yaw and pitch angles of the camera.
Firstly, a Gabor filter is adopted to calculate the texture direction of each pixel point in a shot image, then, whether voting is needed or not is determined according to the confidence coefficient of the pixel points, and finally, a quick local voting method is used for determining the position of the vanishing point.
Wherein the functional expression for the Gabor filter is as follows:
wherein, omega,representing the dimensions and directions->
x, y represent pixel coordinate positions.
Since Gabor filtering obtains 36 directional textures of each pixel point, but it cannot be guaranteed that the textures in each direction are all required, at this time, a confidence level needs to be introduced, and when the parameter is greater than a set threshold value, the textures in the direction can be considered as voting points. The confidence of a pixel value d (x, y) at a point in an image can be defined by:
wherein r is i (d) Representing the response value, r, of the ith direction at the pixel point 5 (d) To r 15 (d) Is that local maximum response values occur between them and the remaining parameters rarely occur. The threshold is set as follows:
t=0.4(maxC(d)-minC(d)) (20)
when the confidence is greater than t, the pixel point is the voting point. And selecting a final road vanishing point according to the voting points and the confidence.
The voting algorithm principle is that candidate vanishing points H are selected, H is used as a circle center, the radius is one third of the image size, and half of the circle is used as a candidate area. The specific voting formula is as follows:
in the formula, alpha is an included angle between the texture direction of a certain voting point P and a line segment PH, d (P, H) represents the distance from the point P to the circle center, each pixel point in the image is used as a candidate point, and finally, the point with high voting score is obtained as a vanishing point.
After the vanishing point is obtained, the pitch angle and the yaw angle are respectively as follows:
wherein R is a camera image rotation matrix; r is R xz A matrix representing rotation about the x-axis and the z-axis; r is R yz Representing a matrix rotated about the y-axis and the z-axis.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims (10)

1. A vehicle running risk early warning method facing to the integrated perception of a driver and a vehicle state is characterized in that: comprising the following steps:
creating a driving state data set of a driver; training the driving behavior recognition model by using the driving state data set to obtain a trained driving behavior recognition model;
creating a lane departure data set; training the vehicle track recognition model by using the lane offset data set to obtain a trained vehicle track recognition model;
inputting the acquired driver images into a trained driving behavior recognition model, and outputting a driving behavior recognition result of the driver;
inputting the acquired vehicle running state image into a trained vehicle track recognition model, and outputting a lane deviation recognition result;
judging whether the driving behavior recognition result and/or the lane deviation recognition result meet the early warning requirement, if so, reminding a driver to correct the driving behavior and drive on a correct lane; if not, the processing is not performed.
2. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the driving behavior recognition model comprises a strategy network and a two-dimensional convolutional neural network; the strategy network comprises a feature extractor and a long-term and short-term memory module; a mixed attention mechanism module is embedded in a backbone network of the two-dimensional convolutional neural network; the mixed attention mechanism module comprises a space-time excitation sub-module, a channel excitation sub-module and a motion excitation sub-module;
the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics;
the channel excitation submodule adaptively calibrates characteristic responses of the channels based on interdependencies between the channels;
the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel.
3. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the strategy network adaptively selects different frame scales to realize driving behavior recognition efficiency, and comprises the following steps:
at time step T < T 0 Frame I t Adjust to the lowest resolution and send it to the feature extractor; wherein T is 0 For a set period of time; i t A driver status image frame at the time t;
the long-period and short-period memory module updates and outputs the hidden state by using the extracted characteristics and the previous state;
given a hidden state, the policy distribution is estimated for action a at time t t Sampling and performing Gumbel Softmax operation;
action a t < L, frame size is adjusted to spatial resolution 3 XH at ×W at And forwards it to the corresponding backbone network to obtain frame-level predictions; wherein L is the resolution category number of the state image; h at For action a t The high of the image at time t; w (W) at For action a t The width of the image at time t;
action a t If L is not less, the backbone network will skip the current frame for prediction and the policy network will skip the followingA frame; wherein (1)>Is a as t And (5) video frames when the video frame is more than or equal to L.
4. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the space-time excitation submodule uses single-channel three-dimensional convolution to represent space-time characteristics and specifically comprises the following steps:
for a given input image X ε R N×T×C×H×W The input tensors for each channel are averaged to obtain a global spatio-temporal tensor F.epsilon.R relative to the channel axis N×T×1×H×W The method comprises the steps of carrying out a first treatment on the surface of the F is then remolded to F * ∈R N×T×1×H×W And fed to a three-dimensional convolutional layer K of core size 3 x 3, obtainingFinally, will->Remodelling to F o ∈R N×T×1×H×W And feeds it to Sigmoid activation to get a spatiotemporal mask M ε R N×T×1×H×W And finally outputting Y: y=x+x+m;
wherein, as follows, the space-time mask M is multiplied by all channel inputs X element by element, and T is the number of divided segments of the video corresponding to the image; n represents the batch number of the fragment number T; c represents the number of image channels; h represents the high of the image; w represents the width of the image.
5. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the channel excitation submodule adaptively calibrates characteristic response of the channels based on interdependence among the channels, and specifically comprises the following steps:
for a given input image X ε R N×T×C×H×W First global spatial information F epsilon R of an input element is obtained by averaging the inputs N×T×C×1×1 The method comprises the steps of carrying out a first treatment on the surface of the Compressing the channel number of F in proportion r to obtain F r =K 1 * F, performing the process; wherein K is 1 Is a 1 x 1 two-dimensional convolution layer,
then remodel F r To the point ofUntil time reasoning can be enabled, one-dimensional convolution layer K 2 Kernel size 3 for handling +.>Obtain->Wherein (1)>
Will beReshape to +.>Then by using a 1 x 1 two-dimensional convolution layer K 3 Decompressing it to obtain F o =K 3 *F temp Feeding to Sigmoid activation to obtain a channel mask M; wherein F is o ∈R N×T×C×1×1 And M.epsilon.R N ×T×C×1×1
Final output Y: y=x+x+m.
6. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 2, characterized in that: the motion-excitation sub-module calculates a time difference at a feature level to stimulate a motion-sensitive channel, comprising:
for a given input image X ε R N×T×C×H×W The channel number is compressed in proportion r by using a 1X 1 two-dimensional convolution layer to obtainUsing a 1 x 1 two-dimensional convolution layer for F r Decompressing;
modeling the motion characteristics to obtain F m =K*F r [:,t+1,:,:,:]-F r [:,t,:,:,:];
Wherein K is a 3 x 3 two-dimensional convolution layer,wherein F is r [:,t+1,:,:,:]Representing the compressed characteristic diagram at time t+1, F r [:,t,:,:,:]The characteristic diagram after compression at the time t is shown;
connecting motion characteristics with each other according to a time dimension, and filling 0 into a last element, wherein the steps are as follows:
F m -[F m (1),...,F m (t-1),0]the method comprises the steps of carrying out a first treatment on the surface of the Wherein F is m (t-1) is the t-1 th motion representation;
then to F m Averaging is performed to obtain global spatial information of the input elements.
7. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: the vehicle track recognition model comprises a modified deep labv3+ network;
the improved deep bv3+ network takes deep bv3+ as a basic framework, replaces a backbone network Xaccept of the deep bv3+ with a lightweight network MobileNet v2, increases a channel attention mechanism module, and replaces an ASPP structure in the deep bv3+ network by using a Dense-ASPP;
the channel attention mechanism module is used for focusing attention among channels of the feature map.
8. The vehicle running risk early warning method for driver and vehicle state integrated perception according to claim 1, characterized in that: further comprises:
training the vehicle distance recognition model by using the front vehicle distance data set to obtain a trained vehicle distance recognition model; inputting the acquired front vehicle distance image into a trained vehicle distance recognition model, and outputting a front vehicle distance recognition result; judging whether the front vehicle distance recognition result is smaller than a distance threshold value, if so, reminding a driver to correct driving behaviors; if not, the processing is not performed.
9. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: the vehicle distance identification model comprises a modified YOLOv5 network;
the improved YOLOv5 network uses YOLOv5 as a basic framework, uses a Ghost Module in GhostNet to replace the convolution operation of YOLOv5, introduces an attention mechanism Coordinate Attention, and embeds position information into the channel.
10. The vehicle running risk early warning method for driver and vehicle state integrated perception of claim 8, wherein: if the front vehicle is right in front of the vehicle, determining a distance d between the vehicle and the front vehicle according to the following formula:
h is the distance between the camera arranged on the vehicle and the front vehicle in the vertical direction; θ is the camera pitch angle; the intersection point of the lens optical axis of the camera and the image plane is O (x, y), the focal length is f, the imaging point of the center point of the bottom of the front vehicle at the image plane is D (u, v), and the included angle between the straight line from the center point of the bottom of the front vehicle to the camera and the lens optical axis is alpha;
if the front vehicle is in front of the side of the vehicle, determining a distance D between the vehicle and the front vehicle according to the following formula:
wherein, gamma is the yaw angle of the front vehicle.
CN202311284729.1A 2023-10-07 2023-10-07 Vehicle running risk early warning method for driver and vehicle state integrated sensing Pending CN117292346A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311284729.1A CN117292346A (en) 2023-10-07 2023-10-07 Vehicle running risk early warning method for driver and vehicle state integrated sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311284729.1A CN117292346A (en) 2023-10-07 2023-10-07 Vehicle running risk early warning method for driver and vehicle state integrated sensing

Publications (1)

Publication Number Publication Date
CN117292346A true CN117292346A (en) 2023-12-26

Family

ID=89238683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311284729.1A Pending CN117292346A (en) 2023-10-07 2023-10-07 Vehicle running risk early warning method for driver and vehicle state integrated sensing

Country Status (1)

Country Link
CN (1) CN117292346A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636270A (en) * 2024-01-23 2024-03-01 南京理工大学 Vehicle robbery event identification method and device based on monocular camera

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636270A (en) * 2024-01-23 2024-03-01 南京理工大学 Vehicle robbery event identification method and device based on monocular camera
CN117636270B (en) * 2024-01-23 2024-04-09 南京理工大学 Vehicle robbery event identification method and device based on monocular camera

Similar Documents

Publication Publication Date Title
CN110097109B (en) Road environment obstacle detection system and method based on deep learning
US11250296B2 (en) Automatic generation of ground truth data for training or retraining machine learning models
US9311711B2 (en) Image processing apparatus and image processing method
CN109334563B (en) Anti-collision early warning method based on pedestrians and riders in front of road
US11527078B2 (en) Using captured video data to identify pose of a vehicle
CN111860274B (en) Traffic police command gesture recognition method based on head orientation and upper half skeleton characteristics
JP6574611B2 (en) Sensor system for obtaining distance information based on stereoscopic images
JP4271720B1 (en) Vehicle periphery monitoring device
US11024042B2 (en) Moving object detection apparatus and moving object detection method
WO2021096629A1 (en) Geometry-aware instance segmentation in stereo image capture processes
CN110807352B (en) In-vehicle scene visual analysis method for dangerous driving behavior early warning
CN117292346A (en) Vehicle running risk early warning method for driver and vehicle state integrated sensing
EP2741234B1 (en) Object localization using vertical symmetry
JP2019106193A (en) Information processing device, information processing program and information processing method
TW202101965A (en) Sensor device and signal processing method
CN117015792A (en) System and method for generating object detection tags for automated driving with concave image magnification
CN113557524A (en) Method for representing a mobile platform environment
JP7269694B2 (en) LEARNING DATA GENERATION METHOD/PROGRAM, LEARNING MODEL AND EVENT OCCURRENCE ESTIMATING DEVICE FOR EVENT OCCURRENCE ESTIMATION
CN117115690A (en) Unmanned aerial vehicle traffic target detection method and system based on deep learning and shallow feature enhancement
JP6472504B1 (en) Information processing apparatus, information processing program, and information processing method
US20240089577A1 (en) Imaging device, imaging system, imaging method, and computer program
Kondyli et al. A 3D experimental framework for exploring drivers' body activity using infrared depth sensors
CN113450385B (en) Night work engineering machine vision tracking method, device and storage medium
CN112329566A (en) Visual perception system for accurately perceiving head movements of motor vehicle driver
CN110556024B (en) Anti-collision auxiliary driving method and system and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination