CN112800879B

CN112800879B - Vehicle-mounted video-based front vehicle position prediction method and prediction system

Info

Publication number: CN112800879B
Application number: CN202110051940.3A
Authority: CN
Inventors: 宋建新; 苏万亮
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2022-08-26
Anticipated expiration: 2041-01-15
Also published as: CN112800879A

Abstract

The invention discloses a method for predicting the position of a vehicle ahead based on vehicle video, which includes: constructing a vehicle position prediction model based on an encoding and decoding framework, which is used to predict the position of the vehicle based on the enclosing frame of the preceding vehicle and the historical data of the optical flow in the enclosing frame, and the motion information of the vehicle. Predict the position and scale of the preceding vehicle; build a sample set and train the vehicle position prediction model; obtain vehicle video; perform vehicle detection and tracking on video frames and calculate optical flow to obtain the bounding box sequence and optical flow of the preceding vehicle. Sequence; predict the motion information of the vehicle to form a motion prediction sequence; intercept the bounding box of the preceding vehicle and the optical flow in the bounding box in the T video frames before the current time t, and the motion information of the vehicle in the △ video frames after t Predicted value, input the vehicle position prediction model, get the bounding box sequence of the preceding vehicle in △ video frames after t, and predict the position and scale of the preceding vehicle. This method is only based on the video information captured by the dash cam, and can predict the position and scale of the preceding vehicle in real time.

Description

A method and system for predicting the position of a vehicle ahead based on vehicle video

技术领域technical field

本发明属于辅助驾驶技术领域，具体涉及一种基于车载视频的前方车辆位置预测方法和系统。The invention belongs to the technical field of assisted driving, and in particular relates to a method and system for predicting the position of a vehicle ahead based on in-vehicle video.

背景技术Background technique

随着社会的不断发展，家用汽车得到了普及。在享受到汽车带来的便捷时，很多问题也随之而来，如交通安全事故频繁发生、道路行驶环境恶劣、生态环境受到污染等。种种问题都使得人们的生命和财产受到威胁，尤其是交通事故问题，因此安全行车成为了大众迫切的需求。造成交通事故往往是因为驾驶员对驾驶道路上其他交通参与者的行为不能及时做出反应，而行车记录仪现已经被大量车主使用，可以记录车主行驶全过程中的视频图像和声音，如果能够根据行车记录仪拍摄的视频，实时对本车前方车辆的位置进行预测，就能让驾驶员在行车过程中有足够的时间避免交通事故的发生，但目前的行车记录仪还没有这种功能。With the continuous development of society, family cars have been popularized. When enjoying the convenience brought by cars, many problems also follow, such as frequent traffic safety accidents, harsh road driving conditions, and pollution of the ecological environment. All kinds of problems threaten people's lives and property, especially traffic accidents. Therefore, safe driving has become an urgent need of the public. Traffic accidents are often caused by the driver's inability to respond in time to the behavior of other traffic participants on the driving road, and the driving recorder has been used by a large number of car owners, which can record the video images and sounds of the entire driving process of the car owner. According to the video captured by the driving recorder, the real-time prediction of the position of the vehicle in front of the vehicle can allow the driver to have enough time to avoid traffic accidents during the driving process, but the current driving recorder does not have such a function.

目前国内外提出的关于车辆位置的预测方法其大致可以分为传统方法和基于深度学习方法两类。At present, the prediction methods about vehicle position proposed at home and abroad can be roughly divided into two categories: traditional methods and deep learning-based methods.

传统的车辆位置预测方法如贝叶斯滤波方法，该方法的结构过于简单，无法分析复杂的车辆运动模式，而且往往不能很好的进行长期预测。动态贝叶斯网络利用图形模型描述了决定车辆轨迹的各种潜在因素，对生成车辆轨迹的物理过程进行显示建模，虽然能够解决上述问题，但由于基于设计人员的直觉确定的模型结构不足以捕获各种动态交通场景，在真实交通场景的性能受到限制，并且其计算复杂度高，不能满足实时预测的要求。Traditional vehicle position prediction methods such as Bayesian filtering methods are too simple in structure to analyze complex vehicle motion patterns and often fail to perform long-term predictions well. The dynamic Bayesian network uses a graphical model to describe the various potential factors that determine the vehicle trajectory, and models the physical process that generates the vehicle trajectory. Although it can solve the above problems, the model structure determined based on the designer's intuition is not enough. Capturing various dynamic traffic scenarios has limited performance in real traffic scenarios, and its computational complexity is high, which cannot meet the requirements of real-time prediction.

近几年，基于深度学习的方法在图像处理领域展现出强大的能力，许多研究者也将深度学习方法中的循环神经网络结构及其各种变体结构应用在车辆位置预测的任务中。这些方法利用车辆过去的行驶数据，在深度学习网络模型中训练，在各自的应用场景中都获得了很好的预测效果。但是这些研究存在两个问题：第一，车辆过去的行驶数据都需要通过车辆上安装的多种传感器捕获得到，这在今天的生产车辆上并不常见；第二，仅能预测出前方车辆的像素位置，不能预测出前方车辆的尺度。In recent years, methods based on deep learning have shown powerful capabilities in the field of image processing, and many researchers have also applied the recurrent neural network structure and its various variant structures in deep learning methods to the task of vehicle position prediction. These methods use the past driving data of the vehicle, train in the deep learning network model, and obtain good prediction results in their respective application scenarios. But these studies have two problems: first, the past driving data of the vehicle needs to be captured by various sensors installed on the vehicle, which is not common in today's production vehicles; second, it can only predict the vehicle ahead. Pixel position, the scale of the vehicle ahead cannot be predicted.

而本发明仅基于行车记录仪拍摄的图像信息实时对前方车辆位置和尺度做出预测，让驾驶员在行车过程中有足够的时间避免交通事故，可以较好的运用到实际场景中。However, the present invention makes real-time prediction on the position and size of the vehicle ahead based on the image information captured by the driving recorder, so that the driver has enough time to avoid traffic accidents during the driving process, and can be better applied to the actual scene.

发明内容SUMMARY OF THE INVENTION

发明目的：针对现有技术中存在的问题，本发明提供一种基于车载视频的前方车辆位置预测方法，该方法仅基于行车记录仪拍摄的视频信息，能够实时对前方车辆位置和尺度做出预测，让驾驶员在行车过程中有足够的时间避免交通事故，可以较好的运用到实际场景中。Purpose of the invention: In view of the problems existing in the prior art, the present invention provides a method for predicting the position of the vehicle ahead based on on-board video. The method can predict the position and scale of the vehicle ahead in real time only based on the video information captured by the driving recorder. , so that the driver has enough time to avoid traffic accidents during the driving process, which can be better applied to the actual scene.

技术方案：本发明一方面公开了一种基于车载视频的前方车辆位置预测方法，包括训练阶段和预测阶段，其中训练阶段包括：Technical solution: On the one hand, the present invention discloses a method for predicting the position of the vehicle ahead based on the vehicle video, including a training stage and a prediction stage, wherein the training stage includes:

S1、构建基于编解码框架的车辆位置预测模型，所述车辆位置预测模型用于根据当前时刻t之前的t-0,t-1,…,t-(T-1)时刻前方车辆包围框、所述包围框内的光流、本车在当前时刻t之后的t+1,t+2,…,t+△时刻的运动信息，预测前方车辆在当前时刻t之后的t+1,t+2,…,t+△时刻的包围框；S1. Construct a vehicle position prediction model based on an encoding and decoding framework. The vehicle position prediction model is used for the bounding box of the vehicle ahead at the time t-0, t-1, . . . , t-(T-1) before the current time t, The optical flow in the bounding box, the motion information of the vehicle at times t+1, t+2, ..., t+△ after the current time t, and the prediction of the vehicle ahead at t+1, t+2 after the current time t ,…, the bounding box at time t+△;

所述车辆位置预测模型的输入包括：当前时刻t前的T个时刻的视频帧中，前方车辆的包围框序列B、前方车辆包围框内的光流序列F，以及当前时刻t后的△个时刻的视频帧中，本车的运动预测序列M；The input of the vehicle position prediction model includes: in the video frames at T times before the current time t, the bounding box sequence B of the preceding vehicle, the optical flow sequence F in the bounding box of the preceding vehicle, and the △ frames after the current time t. In the video frame at the moment, the motion prediction sequence M of the vehicle;

所述车辆位置预测模型的输出为当前时刻t后的△个时刻的视频帧图像中前方车辆的预测包围框序列Y；The output of the vehicle position prediction model is the predicted bounding box sequence Y of the preceding vehicle in the video frame images at △ times after the current time t;

所述车辆位置预测模型包括：前方车辆包围框编码器、前方车辆光流编码器、特征融合单元、前方车辆位置预测解码器；The vehicle position prediction model includes: a front vehicle bounding box encoder, a front vehicle optical flow encoder, a feature fusion unit, and a front vehicle position prediction decoder;

所述前方车辆包围框编码器用于对前方车辆的包围框序列B编码，得到前方车辆的时序特征矢量

The preceding vehicle bounding box encoder is used to encode the bounding box sequence B of the preceding vehicle to obtain the time series feature vector of the preceding vehicle

所述前方车辆光流编码器用于对前方车辆包围框内的光流序列F编码，得到前方车辆的运动特征矢量

The preceding vehicle optical flow encoder is used to encode the optical flow sequence F in the bounding box of the preceding vehicle to obtain the motion feature vector of the preceding vehicle

所述特征融合单元将前方车辆的时序特征矢量

和运动特征矢量

连接为前车的融合特征矢量

The feature fusion unit combines the time series feature vector of the preceding vehicle

and motion feature vectors

Concatenated as the fused feature vector of the preceding vehicle

所述前方车辆位置预测解码器根据本车的运动预测序列M对特征矢量

解码，得到当前时刻t后的△个时刻的视频帧中前方车辆的预测包围框；The preceding vehicle position prediction decoder predicts the sequence M pairs of feature vectors according to the motion of the vehicle.

Decoding to obtain the predicted bounding box of the vehicle ahead in the video frame at △ times after the current time t;

S2、构建样本集并对车辆位置预测模型进行训练，包括：S2. Construct a sample set and train a vehicle position prediction model, including:

S2-1、采集能够拍摄到前车的多个时长为s的车载视频片段，对每个视频片段中的视频帧进行采样，并确定采样后的视频帧中前方车辆的包围框序列B_tr、包围框内的光流序列F_tr和视频帧对应时刻本车的运动预测序列M_tr，构成样本集；S2-1. Collect a plurality of in-vehicle video clips with a duration of s that can capture the preceding vehicle, sample the video frames in each video clip, and determine the bounding box sequence B _tr of the preceding vehicle in the sampled video frame, The optical flow sequence F _tr in the bounding box and the motion prediction sequence M _tr of the vehicle at the corresponding moment of the video frame constitute a sample set;

S2-2、将样本集划分为训练集和验证集；设置学习率σ，批处理数量N；S2-2. Divide the sample set into training set and validation set; set learning rate σ and batch number N;

S2-3、训练过程采用Adam优化器，根据训练集样本数和N确定训练批次N′；将训练样本中的视频片段前s′时长的视频帧对应的B_tr、F_tr，后s″时长的视频帧对应的M_tr作为车辆位置预测模型的输入，后s″时长的视频帧对应的B_tr作为输出，对所述模型进行训练，保存模型参数，并用验证集验证模型的预测准确度；s′+s″＝s； _S2-3 . The _Adam optimizer is used in the training process, and the training batch N′ is determined according to the number of samples in the training set and N; The M _tr corresponding to the video frame of the duration is used as the input of the vehicle position prediction model, and the B _tr corresponding to the video frame of the later s″ duration is used as the output, the model is trained, the model parameters are saved, and the verification set is used to verify the prediction accuracy of the model ;s'+s"=s;

S2-4、选择N′批训练中预测准确度最高的模型参数作为车辆位置预测模型的参数；S2-4, select the model parameters with the highest prediction accuracy in the N' batch training as the parameters of the vehicle position prediction model;

预测阶段包括：The forecast phase includes:

车辆上设置可以拍摄前方车辆的摄像头，获取所述摄像头在车辆行驶中采集的视频数据；A camera that can shoot the vehicle ahead is set on the vehicle, and video data collected by the camera while the vehicle is driving is acquired;

对视频中每一帧图像进行车辆检测与跟踪，得到每一辆前车的包围框序列，并存入B_test(i)中，i为前车编号；同时计算包围框内的光流，存入F_test(i)；获取本车在未来帧中的运动信息，存入序列M_test；Perform vehicle detection and tracking on each frame of image in the video, obtain the bounding box sequence of each preceding vehicle, and store it in B _test (i), where i is the number of the preceding vehicle; at the same time, calculate the optical flow in the bounding box and store Enter F _test (i); obtain the motion information of the vehicle in the future frame, and store it in the sequence M _test ;

在序列B_test(i)和F_test(i)中采用长度为T的第一滑动窗，在序列M_test中采用长度为△的第二滑动窗，分别截取当前时刻t前的T个视频帧中车辆i的包围框、所述包围框内的光流，以及当前时刻t后的△个视频帧中本车的运动信息预测值，输入训练好的车辆位置预测模型中，得到前方车辆i在当前时刻t后的△个视频帧中的包围框序列Y′(i)＝[Y′_t+1(i),Y′_t+2(i),…,Y′_t+δ(i),…,Y′_t+△(i)]，计算前方车辆i的包围框在当前时刻视频帧中的相对位置：

其中B_test,t+0(i)为前方车辆i在当前时刻t的包围框；1≤δ≤△；A first sliding window of length T is used in sequences B _test (i) and F _test (i), and a second sliding window of length △ is used in sequence M _test to intercept T video frames before the current time t respectively The bounding box of vehicle i, the optical flow in the bounding box, and the predicted value of the motion information of the vehicle in △ video frames after the current time t are input into the trained vehicle position prediction model, and the position of the vehicle i ahead is obtained. The bounding box sequence Y′(i) in △ video frames after the current time t=[Y′ _t+1 (i), Y′ _t+2 (i),…,Y′ _t+δ (i), ...,Y′ _t+△ (i)], calculate the relative position of the bounding box of the preceding vehicle i in the video frame at the current moment:

where B _test,t+0 (i) is the bounding box of the preceding vehicle i at the current time t; 1≤δ≤△;

根据Y′(i)中包围框的中心得到前方车辆i的预测轨迹；根据Y′(i)中包围框的宽高得到前方车辆i尺度。The predicted trajectory of the preceding vehicle i is obtained according to the center of the bounding box in Y'(i); the scale of the preceding vehicle i is obtained according to the width and height of the bounding box in Y'(i).

所述前方车辆的包围框序列采用如下步骤计算：The bounding box sequence of the preceding vehicle is calculated using the following steps:

A.1、对连续T个时刻的视频帧图像进行车辆检测，得到每帧图像中所有车辆的包围框；A.1. Perform vehicle detection on video frame images of T consecutive moments to obtain the bounding boxes of all vehicles in each frame of image;

A.2、采用多目标跟踪算法跟踪步骤A.1得到的车辆包围框，对不同帧中同一车辆给出相同编号，按时间顺序构成T个时刻前方车辆包围框序列B。A.2. Use the multi-target tracking algorithm to track the vehicle bounding box obtained in step A.1, give the same number to the same vehicle in different frames, and form the bounding box sequence B of the preceding vehicle at T times in time order.

所述前方车辆包围框内的光流序列采用如下步骤计算：The optical flow sequence in the bounding box of the preceding vehicle is calculated by the following steps:

B.1、对连续T个时刻的视频帧图像，计算每一帧与其前一帧图像的光流，得到每一帧图像对应的光流图；所述光流图中第j个像素点的二维光流矢量为：I_j＝(u_j,v_j)，u_j,v_j分别为光流矢量的垂直分量和水平分量；B.1. For video frame images at consecutive T times, calculate the optical flow of each frame and its previous frame image, and obtain the optical flow map corresponding to each frame image; the jth pixel in the optical flow map The two-dimensional optical flow vector is: I _j =(u _j , v _j ), u _j , v _j are the vertical and horizontal components of the optical flow vector, respectively;

B.2、在第t-τ时刻的图像对应的光流图中截取第t-τ时刻图像中前方车辆包围框覆盖部分，并缩放至预设的统一尺寸，得到第t-τ时刻的包围框内的光流图，按时间顺序构成T个时刻前方车辆包围框内的光流序列F，t-τ表示时刻t前的第τ个时刻，0≤τ<T。B.2. In the optical flow diagram corresponding to the image at time t-τ, intercept the part covered by the bounding box of the vehicle ahead in the image at time t-τ, and scale it to a preset uniform size to obtain the bounding box at time t-τ The optical flow graph in the box constitutes the optical flow sequence F in the bounding box of the vehicle ahead at T times in time order, t-τ represents the τth moment before time t, 0≤τ<T.

所述本车的运动预测序列采用如下步骤计算：The motion prediction sequence of the vehicle is calculated by the following steps:

C.1、对当前时刻t之前的t-0,t-1,…,t-(T-1)时刻的视频帧，计算相邻时刻视频帧P_t-τ-1和P_t-τ的相机旋转矩阵R_t-τ和平移向量V_t-τ，构成旋转矩阵序列RS和平移向量序列VS，0≤τ<T，具体包括步骤C.1-1至步骤C.1-2：C.1. For the video frames at the time t-0, t-1,..., t-(T-1) before the current time t, calculate the difference between the video frames P _t-τ-1 and P _t-τ at the adjacent time Camera rotation matrix R _t-τ and translation vector V _t-τ , constitute rotation matrix sequence RS and translation vector sequence VS, 0≤τ<T, specifically including steps C.1-1 to C.1-2:

C.1-1、采用八点法，计算得到本质矩阵E，方法如下：C.1-1. Using the eight-point method, calculate the essential matrix E, the method is as follows:

C.1-1-1、采用Surf算法，提取P_t-τ-1和P_t-τ的特征点，并选取8对最匹配的特征点(a_l,a′_l)，l＝1,2,…,8；其中a_l,a′_l分别表示视频帧P_t-τ-1和P_t-τ中第l对匹配的特征点像素位置在归一化平面上的坐标，a_l＝[x_l,y_l,1]^T，a′_l＝[x′_l,y′_l,1]^T；a_l,a′_l均为3×1的矩阵，其中T表示矩阵的转置；C.1-1-1. Using the Surf algorithm, extract the feature points of P _t-τ-1 and P _t-τ , and select 8 pairs of the most matching feature points (a _l , a' _l ), l=1, ₂ _, _. _{_} _{_} [x _l ,y _l ,1] ^T , a' _l =[x' _l ,y' _l ,1] ^T ; a _l , a' _l are both 3×1 matrices, where T represents the transpose of the matrix;

C.1-1-2、将8对匹配的特征点组合，得到3×8的矩阵a和a′：C.1-1-2. Combine 8 pairs of matched feature points to obtain 3×8 matrices a and a':

根据a和a′建立对极约束公式：

According to a and a′, establish the polar constraint formula:

a^TEa′＝0a ^T Ea'=0

解上述方程组得到本质矩阵E，E为3×3的矩阵；Solve the above equations to get the essential matrix E, where E is a 3×3 matrix;

C.1-2、对E进行奇异值分解，得到相机的旋转矩阵R_t-τ和平移向量V_t-τ，其中R_t-τ为3×3的矩阵，V_t-τ为3维列向量；C.1-2. Perform singular value decomposition on E to obtain the rotation matrix R _t-τ and translation vector V _t-τ of the camera, where R _t-τ is a 3×3 matrix, and V _t-τ is a 3-dimensional column vector;

最终得到t时刻前T个视频帧的旋转矩阵序列RS＝{R_t-(T-1),…,R_t-τ,…,R_t-1,R_t-0}，t时刻前T个视频帧的平移向量序列VS＝{V_t-(T-1),…,V_t-τ,…,V_t-1,V_t-0}；Finally, the rotation matrix sequence RS={R _t-(T-1) ,...,R _t-τ ,...,R _t-1 ,R _t-0 } of the T video frames before time t is obtained, and T before time t The translation vector sequence VS={V _t-(T-1) ,...,V _t-τ ,...,V _t-1 ,V _t-0 };

C.2、对于C.1得到的RS和VS中的相机旋转矩阵和平移向量，计算每一个R_t-τ和V_t-τ与其前一时刻的累积值，所述累积值用R′_t-τ和V′_t-τ表示，如下公式所示：C.2. For the camera rotation matrix and translation vector in RS and VS obtained in C.1, calculate the cumulative value of each R _t-τ and V _t-τ and its previous moment, the cumulative value is R' _{t -τ} and V′ _t-τ are expressed as follows:

C.3、将C.2最后计算得到的R′_t-0和V′_t-0传递给相机在下一时刻的旋转矩阵和平移向量，如下公式所示：C.3. Pass the R' _t-0 and V' _t-0 calculated in C.2 to the rotation matrix and translation vector of the camera at the next moment, as shown in the following formula:

R_t+1＝R′_t-0 R _t+1 =R' _t-0

V_t+1＝V′_t-0 V _t+1 =V' _t-0

C.4、将C.3得到的R_t+1和V_t+1分别添加在C.1得到的旋转矩阵序列RS和平移向量序列VS末尾，并继续执行C.2和C.3，直到得到t时刻后△个视频帧的所有旋转矩阵{R_t+1,R_t+2,…,R_t+δ,…,R_t+△}，t时刻后△个视频帧的所有平移向量{V_t+1,V_t+2,…,V_t+δ,…,V_t+△}，1≤δ≤△；C.4. Add R _t+1 and V _t+1 obtained in C.3 to the end of the rotation matrix sequence RS and translation vector sequence VS obtained in C.1, and continue to execute C.2 and C.3 until Obtain all rotation matrices {R _t+1 ,R _t+2 ,…,R _t+δ ,…,R _t+△ } of △ video frames after time t, and all translation vectors {V of △ video frames after time t _t+1 ,V _t+2 ,…,V _t+δ ,…,V _t+△ }, 1≤δ≤△;

C.5、计算本车在当前时刻t后△个时刻的运动向量，构成本车的运动预测序列M＝{M_t+1,M_t+2,…,M_t+δ,…,M_t+△}，具体包括步骤C.5-1至C.5-2：C.5. Calculate the motion vector of the vehicle at △ times after the current time t to form the motion prediction sequence M={M _t+1 ,M _t+2 ,…,M _t+δ ,…,M _{t+ △} }, specifically including steps C.5-1 to C.5-2:

C.5-1、从旋转矩阵R_t+δ中提取相机在x,y,z轴的旋转角度信息，并用3维行向量

表示，其中：C.5-1. Extract the rotation angle information of the camera in the x, y, z axes from the rotation matrix R _t+δ , and use the 3-dimensional row vector

means, where:

上式中，r_jk表示旋转矩阵R_t+δ中第j行第k列的值，j,k∈{1,2,3}；atan2()与atan()均表示反正切函数，但是atan()求出的结果取值范围为(0,2π]，atan2()求出的结果取值范围为(-π,π]；In the above formula, r _jk represents the value of the jth row and the kth column in the rotation matrix R _t+δ , j,k∈{1,2,3}; both atan2() and atan() represent the arctangent function, but atan The value range of the result obtained by () is (0, 2π], and the value range of the result obtained by atan2() is (-π, π];

C.5-2、将向量ψ_t+δ与转换为三维行向量的平移向量V_t+δ ^T连接，组成一个6维行向量M_t+δ：M_t+δ＝[ψ_t+δ,V_t+δ ^T]；C.5-2. Connect the vector ψ _t+δ with the translation vector V _t+δ ^T converted into a three-dimensional row vector to form a 6-dimensional row vector M _t+δ : M _t+δ =[ψ _t+δ , V _{t + δ} ^T ];

最终得到本车的运动预测序列M＝{M_t+1,M_t+2,…,M_t+δ,…,M_t+△}；Finally, the motion prediction sequence M={M _t+1 ,M _t+2 ,…,M _t+δ ,…,M _t+△ } is obtained;

C.6、将M经过一个全连接层FC₄，变换其所有运动向量的维度。C.6. Pass M through a fully connected layer FC ₄ to transform the dimensions of all its motion vectors.

所述前方车辆包围框编码器包括编码门控循环神经网络GRU_b和第一全连接层FC₁；所述GRU_b的输入为前方车辆的包围框序列B中每个时刻的包围框B_t-τ，以及上一时刻GRU_b传下来的隐藏状态矢量

输出为当前时刻的前方车辆包围框编码结果

FC₁对GRU_b最终输出

进行维度变换，得到当前时刻t前方车辆的时序特征矢量

The preceding vehicle bounding box encoder includes an encoding gated recurrent neural network GRU _b and a first fully connected layer FC ₁ ; the input of the GRU _b is the bounding box B _t- at each moment in the bounding box sequence B of the preceding vehicle _τ , and the hidden state vector passed down by GRU _b at the previous moment

The output is the encoding result of the bounding box of the front vehicle at the current moment

FC ₁ to GRU _b final output

Perform dimension transformation to obtain the time series feature vector of the vehicle ahead at the current time t

所述前方车辆光流编码器包括基于CNN的运动特征提取网络FEN和第二全连接层FC₂；所述FEN的输入为前方车辆包围框内的光流序列F，输出为当前时刻的前方车辆包围框内光流编码结果；所述FEN基于ResNet50架构，包括依次连接的一个卷积层conv1，一个Relu层、一个最大池化层maxPool、4个残差学习块；其中conv1的输入通道数为2m，m为对光流序列F中光流图的采样数，即从F中均匀采样m个光流图；4个残差学习块均为为三层结构，即每个残差学习块为3个串接在一起的卷积网络层和Relu层；The optical flow encoder of the preceding vehicle includes a CNN-based motion feature extraction network FEN and a second fully connected layer FC ₂ ; the input of the FEN is the optical flow sequence F in the bounding box of the preceding vehicle, and the output is the preceding vehicle at the current moment. The optical flow encoding result in the bounding box; the FEN is based on the ResNet50 architecture, including a convolutional layer conv1 connected in sequence, a Relu layer, a maximum pooling layer maxPool, and 4 residual learning blocks; The number of input channels of conv1 is 2m, m is the sampling number of optical flow graphs in the optical flow sequence F, that is, m optical flow graphs are uniformly sampled from F; the four residual learning blocks are all three-layer structures, that is, each residual learning block is 3 convolutional network layers and Relu layers concatenated together;

对前方车辆包围框内的光流序列F均匀采样m个光流图，m个光流图的垂直分量和水平分量构成2m个光流分量输入FEN中，FEN的输出为当前时刻的前方车辆包围框内光流图中的运动特征；The optical flow sequence F in the bounding box of the preceding vehicle is uniformly sampled with m optical flow graphs, and the vertical and horizontal components of the m optical flow graphs form 2m optical flow components, which are input into FEN, and the output of FEN is the surrounding vehicle at the current moment. Motion features in the optical flow map within the box;

FC₂对FEN输出的运动特征进行维度变换，得到当前时刻t前方车辆的运动特征矢量

FC ₂ performs dimensional transformation on the motion feature output by FEN, and obtains the motion feature vector of the vehicle ahead at the current time t

所述前方车辆位置预测解码器包括解码门控循环神经网络GRU_d和第三全连接层FC₃；所述GRU_d的输入为t+δ时刻本车运动信息预测值M_t+δ与上一时刻GRU_d传下来的隐藏状态矢量

的融合矢量Mh_t+δ，以及上一时刻GRU_d传下来的隐藏状态矢量

1≤δ≤△，

输出为t+δ时刻前方车辆包围框解码结果

FC₃对

进行维度变换，得到t+δ时刻前方车辆包围框。The preceding vehicle position prediction decoder includes a decoding gated recurrent neural network GRU _d and a third fully connected layer FC ₃ ; the input of the GRU _d is the predicted value M _{t+δ of the vehicle motion information at time t+δ} and the previous Hidden state vector passed from GRU _d at moment

The fusion vector Mh _t+δ of , and the hidden state vector passed down by GRU _d at the previous moment

1≤δ≤△,

The output is the decoding result of the bounding box of the vehicle ahead at time t+δ

FC ₃ pair

Perform dimension transformation to obtain the bounding box of the vehicle ahead at time t+δ.

另一方面，本发明还公开了实现上述基于车载视频的前方车辆位置预测方法的预测系统，包括：On the other hand, the present invention also discloses a prediction system for realizing the above-mentioned method for predicting the position of the vehicle ahead based on the in-vehicle video, including:

基于编解码框架的车辆位置预测模型，用于根据当前时刻t之前的t-0,t-1,…,t-(T-1)时刻前方车辆包围框、所述包围框内的光流、本车在当前时刻t之后的t+1,t+2,…,t+△时刻的运动信息，预测前方车辆在当前时刻t之后的t+1,t+2,…,t+△时刻的包围框；The vehicle position prediction model based on the codec framework is used to predict the vehicle position according to the time t-0, t-1, ..., t-(T-1) before the current time t. The bounding box of the vehicle ahead, the optical flow in the bounding box, The motion information of the vehicle at the time t+1, t+2,..., t+△ after the current time t, and predict the bounding box of the vehicle ahead at the time t+1, t+2,..., t+△ after the current time t ;

所述特征融合单元将前方车辆的时序特征矢量

和运动特征矢量

连接为前车的融合特征矢量

and motion feature vectors

Concatenated as the fused feature vector of the preceding vehicle

车辆包围框获取模块，用于获取车载视频中前方车辆的包围框序列B；a vehicle bounding box obtaining module, used to obtain the bounding box sequence B of the preceding vehicle in the vehicle video;

车辆包围框光流获取模块，用于获取车载视频中前方车辆包围框内的光流序列F；The vehicle bounding box optical flow acquisition module is used to obtain the optical flow sequence F in the bounding box of the preceding vehicle in the vehicle video;

本车运动信息预测模块，用于预测本车在未来时间的运动信息，构成本车运动预测序列M。The vehicle motion information prediction module is used to predict the motion information of the vehicle in the future, and constitutes the vehicle motion prediction sequence M.

有益效果：本发明公开前方车辆位置预测方法具有以下优点：1、本发明仅基于行车记录仪拍摄的视频图像信息，有效解决了现有技术中其他方法中需要依赖多种传感器获取信息而导致的在当下生产车辆中适用性不高的的问题；2、本发明采用基于编码-解码框架的深度学习网络模型，不仅能预测前方车辆的位置，还能预测前方车辆的尺度，显著提高了其预测的性能。Beneficial effects: The method for predicting the position of the vehicle ahead disclosed by the present invention has the following advantages: 1. The present invention is only based on the video image information captured by the driving recorder, which effectively solves the problem that other methods in the prior art need to rely on a variety of sensors to obtain information. The problem of low applicability in current production vehicles; 2. The present invention adopts a deep learning network model based on an encoding-decoding framework, which can not only predict the position of the vehicle ahead, but also predict the scale of the vehicle ahead, which significantly improves its prediction. performance.

附图说明Description of drawings

图1为本发明公开基于车载视频的前方车辆位置预测方法的流程图；1 is a flow chart of a method for predicting the position of a vehicle ahead based on a vehicle-mounted video disclosed by the present invention;

图2为视频帧车辆检测跟踪的示意图；2 is a schematic diagram of video frame vehicle detection and tracking;

图3为相邻帧的光流提取方法示意图；3 is a schematic diagram of an optical flow extraction method for adjacent frames;

图4为车辆位置预测模型的结构示意图；4 is a schematic structural diagram of a vehicle position prediction model;

图5为GRU的结构示意图；Fig. 5 is the structural schematic diagram of GRU;

图6为运动特征提取网络的结构示意图；6 is a schematic structural diagram of a motion feature extraction network;

图7为滑动窗示意图；7 is a schematic diagram of a sliding window;

图8为实施例中预测结果示意图；8 is a schematic diagram of a prediction result in an embodiment;

图9为本发明公开基于车载视频的前方车辆位置预测系统的结构示意图。FIG. 9 is a schematic structural diagram of the system for predicting the position of the vehicle ahead based on the in-vehicle video disclosed in the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施方式，进一步阐明本发明。The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明公开了一种基于车载视频的前方车辆位置预测方法，包括训练阶段和预测阶段，其中训练阶段包括：As shown in FIG. 1 , the present invention discloses a method for predicting the position of a vehicle ahead based on in-vehicle video, including a training stage and a prediction stage, wherein the training stage includes:

本实施例中，T＝20，△＝40；In this embodiment, T=20, Δ=40;

其中B＝[B_t-0,B_t-1,…,B_t-τ,…B_t-(T-1)]，B_t-τ表示前方车辆在时刻t前的第τ个时刻的视频帧中的包围框，所述包围框用包围框中心点的横纵坐标x_t-τ,y_t-τ、包围框的宽w_t-τ、高h_t-τ表示，即B_t-τ＝(x_t-τ,y_t-τ,w_t-τ,h_t-τ)；0≤τ<T；where B=[B _t-0 , B _t-1 ,...,B _t-τ ,...B _t-(T-1) ], B _t-τ represents the video of the vehicle ahead at the τth time before time t The bounding box in the frame, the bounding box is represented by the horizontal and vertical coordinates x _t-τ , y _t-τ of the center point of the bounding box, the width w _t-τ of the bounding box, and the height h _t-τ , namely B _t-τ =(x _t-τ , y _t-τ , w _t-τ , h _t-τ ); 0≤τ<T;

本发明中，前方车辆的包围框序列采用如下步骤计算：In the present invention, the bounding box sequence of the preceding vehicle is calculated by the following steps:

本实施例采用基于Mask-RCNN建立的车辆检测模型进行车辆检测，所述车辆检测模型采用COCO数据集进行训练，其输出为图像中的车辆包围框，每个包围框用4维向量表示；视频中的图像尺寸在输入Mask-RCNN前统一缩放至1024*1024。In this embodiment, the vehicle detection model established based on Mask-RCNN is used for vehicle detection. The vehicle detection model is trained using the COCO data set, and the output is the vehicle bounding box in the image, and each bounding box is represented by a 4-dimensional vector; video The size of the images in is uniformly scaled to 1024*1024 before input to Mask-RCNN.

A.2、采用多目标跟踪算法跟踪步骤A.1得到的车辆包围框，对不同帧中同一车辆给出相同编号，按时间顺序构成T个时刻前方车辆包围框序列B。本实施例中采用Sort算法进行多目标跟踪，Sort算法是一种在线实时多目标跟踪算法，适用于车载视频中车辆的跟踪。图2为视频帧车辆检测跟踪的示意图。图2中不同时刻的两幅视频帧中检测到3辆车，对相同的车辆编号，分别为1,2,3。A.2. Use the multi-target tracking algorithm to track the vehicle bounding box obtained in step A.1, give the same number to the same vehicle in different frames, and form the bounding box sequence B of the preceding vehicle at T times in time order. In this embodiment, the Sort algorithm is used for multi-target tracking. The Sort algorithm is an online real-time multi-target tracking algorithm, which is suitable for tracking vehicles in vehicle-mounted videos. FIG. 2 is a schematic diagram of vehicle detection and tracking in video frames. 3 vehicles are detected in the two video frames at different times in Figure 2, and the same vehicle numbers are 1, 2, and 3, respectively.

F＝[F_t-0,F_t-1,…,F_t-τ,…F_t-(T-1)]，F_t-τ表示前方车辆在时刻t前的第τ个时刻的视频帧中的包围框内的光流图，F_t-τ＝{(u_t-τ(p),v_t-τ(p))}，(u_t-τ(p),v_t-τ(p))为所述光流图中第p个像素点处的二维光流矢量；F=[F _t-0 ,F _t-1 ,...,F _t-τ ,...F _t-(T-1) ], F _t-τ represents the video frame of the preceding vehicle at the τth time before time t The optical flow graph inside the bounding box in , F _t-τ = {(u _t-τ (p), v _t-τ (p))}, (u _t-τ (p), v _t-τ (p )) is the two-dimensional optical flow vector at the p-th pixel in the optical flow diagram;

B.1、对连续T个时刻的视频帧图像，计算每一帧与其前一帧图像的光流，得到每一帧图像对应的光流图；本实施例采用FlowNet2算法进行相邻帧的光流计算；所述光流图中第j个像素点的二维光流矢量为：I_j＝(u_j,v_j)，u_j,v_j分别为光流矢量的垂直分量和水平分量；如图3所示。B.1. For the video frame images of consecutive T times, calculate the optical flow of each frame and its previous frame image, and obtain the optical flow diagram corresponding to each frame image; this embodiment uses the FlowNet2 algorithm to perform optical flow of adjacent frames. Flow calculation; the two-dimensional optical flow vector of the jth pixel in the optical flow diagram is: I _j =(u _j , v _j ), u _j , v _j are the vertical component and the horizontal component of the optical flow vector respectively; As shown in Figure 3.

B.2、在第t-τ时刻的图像对应的光流图中截取第t-τ时刻图像中前方车辆包围框覆盖部分，并缩放至预设的统一尺寸，得到第t-τ时刻的包围框内的光流图，按时间顺序构成T个时刻前方车辆包围框内的光流序列F，t-τ表示时刻t前的第τ个时刻，0≤τ<T。本实施例中，将包围框内的光流图统一缩放至224*224。B.2. In the optical flow diagram corresponding to the image at time t-τ, intercept the part covered by the bounding box of the vehicle ahead in the image at time t-τ, and scale it to a preset uniform size to obtain the bounding box at time t-τ The optical flow graph in the box constitutes the optical flow sequence F in the bounding box of the vehicle ahead at T times in time order, t-τ represents the τth moment before time t, 0≤τ<T. In this embodiment, the optical flow graph in the bounding box is uniformly scaled to 224*224.

行车过程中，除了车前方场景中的车辆运动，本车自身也在运动，要预测车前方车辆的运动，也必须预测本车自身的运动。During driving, in addition to the motion of the vehicle in the scene in front of the vehicle, the vehicle itself is also moving. To predict the motion of the vehicle in front of the vehicle, the motion of the vehicle itself must also be predicted.

本车的运动信息预测序列采用如下步骤计算：The motion information prediction sequence of the vehicle is calculated by the following steps:

根据a和a′建立对极约束公式：

According to a and a′, establish the polar constraint formula:

a^TEa′＝0a ^T Ea'=0

R_t+1＝R′_t-0 R _t+1 =R' _t-0

V_t+1＝V′_t-0 V _t+1 =V' _t-0

means, where:

C.6、将M经过一个全连接层FC₄，变换其所有运动向量的维度，使其与解码门控循环神经网络GRU_d上一时刻传下来的隐藏状态矢量

维度一致。本实施例中全连接输出维度为512维。C.6. Pass M through a fully connected layer FC ₄ to transform the dimensions of all its motion vectors so that they are the same as the hidden state vector passed down from the decoding gated recurrent neural network GRU _d at the previous moment

Dimensions are consistent. In this embodiment, the fully connected output dimension is 512 dimensions.

所述车辆位置预测模型的输出为当前时刻t后的△个时刻的视频帧图像中前方车辆的预测包围框序列Y，Y＝[Y_t+1,Y_t+2,…,Y_t+δ,…,Y_t+△]；其中Y_t+δ表示前方车辆在时刻t后的第δ个时刻视频帧图像中的预测包围框，所述包围框用包围框中心点的横纵坐标、包围框的宽高表示，即Y_t+δ＝(x_t+δ,y_t+δ,w_t+δ,h_t+δ)；The output of the vehicle position prediction model is the predicted bounding box sequence Y of the preceding vehicle in the video frame images at △ times after the current time t, Y=[Y _t+1 , Y _t+2 ,..., Y _t+δ ,...,Y _t+△ ]; where Y _t+δ represents the predicted bounding box in the video frame image at the δth time after the time t of the preceding vehicle, and the bounding box uses the horizontal and vertical coordinates of the center point of the bounding box, the bounding box The width and height of , namely Y _t+δ =(x _t+δ ,y _t+δ ,w _t+δ ,h _t+δ );

如图4所示，车辆位置预测模型包括：前方车辆包围框编码器1-1、前方车辆光流编码器1-2、特征融合单元1-3、前方车辆位置预测解码器1-4；As shown in Figure 4, the vehicle position prediction model includes: a front vehicle bounding box encoder 1-1, a front vehicle optical flow encoder 1-2, a feature fusion unit 1-3, and a front vehicle position prediction decoder 1-4;

所述前方车辆包围框编码器1-1用于对前方车辆的包围框序列B编码，得到前方车辆的时序特征矢量

The preceding vehicle bounding box encoder 1-1 is used to encode the bounding box sequence B of the preceding vehicle to obtain the time series feature vector of the preceding vehicle

前方车辆包围框编码器主要利用门控循环神经网络(Gated Recurrent Unit，GRU)进行编码。GRU可以只保留相关信息来进行预测，而忘记不相关的数据，其结构如图5所示，输入为当前时刻的输入In_t和上一时刻GRU传下来的隐藏状态矢量h_t-1，h_t-1表示GRU通过内部的门结构认为过去时刻中输入序列的有用信息，在本发明中该隐藏状态矢量表示前方车辆在过去时间段的位置和尺度信息。结合In_t和h_t-1，GRU输出当前时刻的隐藏状态矢量h_t，整个前向传播过程计算公式如下：The front vehicle bounding box encoder is mainly encoded by a gated recurrent neural network (Gated Recurrent Unit, GRU). GRU can only keep relevant information for prediction, and forget irrelevant data. Its structure is shown in Figure 5. The input is the input In _t at the current moment and the hidden state vector h _t-1 , h passed from the GRU at the previous moment. _t-1 indicates that the GRU considers the useful information of the input sequence in the past time through the internal gate structure. In the present invention, the hidden state vector represents the position and scale information of the preceding vehicle in the past time period. Combined with In _t and h _t-1 , GRU outputs the hidden state vector h _t at the current moment. The calculation formula of the whole forward propagation process is as follows:

其中z_t表示更新门的输出，σ()表示sigmoid函数，W_z表示更新门的权值参数，r_t表示重置门的输出，W_r表示重置门的权值参数，

表示当前时刻待定的输出，tanh()表示双曲正切函数，

表示待定值的权值参数，[,]表示两个矢量相连。将上述公式组简记为：where z _t represents the output of the update gate, σ() represents the sigmoid function, W _z represents the weight parameter of the update gate, r _t represents the output of the reset gate, W _r represents the weight parameter of the reset gate,

Represents the output to be determined at the current moment, tanh() represents the hyperbolic tangent function,

Indicates the weight parameter of the undetermined value, [,] indicates that the two vectors are connected. The above formula group is abbreviated as:

其中c为具体的应用类别，U为GRU_c当前时刻的输入值，V为GRU_c的权值参数。Among them, c is the specific application category, U is the input value of GRU _c at the current moment, and V is the weight parameter of GRU _c .

输出为当前时刻的前方车辆包围框编码结果

FC₁对GRU_b最终输出

进行维度变换，得到当前时刻t前方车辆的时序特征矢量

FC ₁ to GRU _b final output

编码门控循环神经网络GRU_b的结构为：The structure of the encoding gated recurrent neural network GRU _b is:

其中φ()表示使用ReLU激活函数进行线性映射，θ_b表示GRU_b中的权值参数V。本实施例中，

的维度为512，FC₁将

的维度变换为256，即

的维度为256。where φ() represents linear mapping using the ReLU activation function, and θ _b represents the weight parameter V in GRU _b . In this embodiment,

has a dimension of 512, FC ₁ will

The dimension of is transformed to 256, that is

The dimension is 256.

所述前方车辆光流编码器1-2用于对前方车辆包围框内的光流序列F编码，得到前方车辆的运动特征矢量

The preceding vehicle optical flow encoder 1-2 is used to encode the optical flow sequence F in the bounding box of the preceding vehicle to obtain the motion feature vector of the preceding vehicle

所述前方车辆光流编码器包括基于CNN的运动特征提取网络FEN和第二全连接层FC₂；所述FEN的输入为前方车辆包围框内的光流序列F，输出为当前时刻的前方车辆包围框内光流编码结果；如图6所示，所述FEN基于ResNet50架构，包括依次连接的一个卷积层conv1，一个Relu层、一个最大池化层maxPool、4个残差学习块，如图6-(a)所示；其中conv1的输入通道数为2m，m为对光流序列F中光流图的采样数，即从F中均匀采样m个光流图，本实施例中m＝10；4个残差学习块均为为三层结构，即每个残差学习块为3个串接在一起的卷积网络层Conv2和Relu层，如图6-(b)所示。The optical flow encoder of the preceding vehicle includes a CNN-based motion feature extraction network FEN and a second fully connected layer FC ₂ ; the input of the FEN is the optical flow sequence F in the bounding box of the preceding vehicle, and the output is the preceding vehicle at the current moment. The optical flow encoding result in the bounding box; as shown in Figure 6, the FEN is based on the ResNet50 architecture, including a convolutional layer conv1, a Relu layer, a maximum pooling layer maxPool, and 4 residual learning blocks connected in sequence, such as As shown in Figure 6-(a); the number of input channels of conv1 is 2m, and m is the number of samples of the optical flow map in the optical flow sequence F, that is, m optical flow maps are uniformly sampled from F, in this embodiment m = 10; the four residual learning blocks are all three-layer structures, that is, each residual learning block consists of three convolutional network layers Conv2 and Relu layers concatenated together, as shown in Figure 6-(b).

对前方车辆包围框内的光流序列F均匀采样m个光流图，每一个光流图的垂直分量和水平分量，看作光流图的两个通道。m个光流图的垂直分量和水平分量构成2m个光流分量输入FEN中，FEN的输出为当前时刻的前方车辆包围框内光流图中的运动特征；本实施例中FEN提取的运动特征维度为2048维，FC₂将FEN输出的运动特征的维度变换为256，得到当前时刻t前方车辆的256维运动特征矢量

M optical flow graphs are uniformly sampled for the optical flow sequence F in the bounding box of the preceding vehicle, and the vertical and horizontal components of each optical flow graph are regarded as two channels of the optical flow graph. The vertical and horizontal components of the m optical flow graphs form 2m optical flow components, which are input into the FEN, and the output of the FEN is the motion feature in the optical flow graph in the bounding box of the vehicle ahead at the current moment; the motion feature extracted by the FEN in this embodiment The dimension is 2048, FC ₂ transforms the dimension of the motion feature output by FEN to 256, and obtains the 256-dimensional motion feature vector of the vehicle ahead at the current time t

所述特征融合单元1-3将前方车辆的时序特征矢量

和运动特征矢量

连接为前车的融合特征矢量

表示车辆包围框历史信息和光流历史信息，即前方车辆在过去时间段中不同时间点的位置、尺度、外观和运动信息；本实施例中，

为512维矢量。The feature fusion unit 1-3 combines the time series feature vector of the preceding vehicle

and motion feature vectors

Concatenated as the fused feature vector of the preceding vehicle

Represents the historical information of the vehicle bounding box and the historical information of optical flow, that is, the position, scale, appearance and motion information of the preceding vehicle at different time points in the past time period; in this embodiment,

is a 512-dimensional vector.

所述前方车辆位置预测解码器1-4根据本车的运动预测序列M对特征矢量

解码，得到当前时刻t后的△个时刻的视频帧中前方车辆的预测包围框；The preceding vehicle position prediction decoder 1-4 predicts the sequence M pairs of feature vectors according to the motion of the vehicle

的融合矢量Mh_t+δ，以及上一时刻GRU_d传下来的隐藏状态矢量

1≤δ≤△，

输出为t+δ时刻前方车辆包围框解码结果

FC₃对

进行维度变换，转换为4维矢量，得到t+δ时刻前方车辆包围框。The preceding vehicle position prediction decoder includes a decoding gated recurrent neural network GRU _d and a third fully connected layer FC ₃ ; the input of the GRU _d is the predicted value M _{t+δ of the vehicle motion information at time t+δ} and the previous Hidden state vector passed from GRU _d at moment

1≤δ≤△,

FC ₃ pair

Perform dimensional transformation and convert it into a 4-dimensional vector to obtain the bounding box of the vehicle ahead at time t+δ.

解码门控循环神经网络GRU_d的结构为：The structure of the decoding gated recurrent neural network GRU _d is:

其中θ_d为GRU_d中的权值参数V。where θ _d is the weight parameter V in GRU _d .

本实施例中，融合矢量Mh_t+δ的计算为：In this embodiment, the calculation of the fusion vector Mh _t+δ is:

对6维向量M_t+δ采用第四全连接层FC₄变换为512维向量

对

使用ReLU激活函数进行线性映射，对线性映射后的向量与

相加后求平均，得到512维的融合矢量Mh_t+δ，

其中Average()表示对两个矢量相加后求平均。Use the fourth fully connected layer FC ₄ to transform the 6-dimensional vector M _t+δ into a 512-dimensional vector

right

Use the ReLU activation function to perform linear mapping, and the linearly mapped vector is

After adding and averaging, a 512-dimensional fusion vector Mh _t+δ is obtained,

Among them, Average() means that the two vectors are added and averaged.

S2-1、采集能够拍摄到前车的多个时长为s的车载视频片段，对每个视频片段中的视频帧进行采样，并确定采样后的视频帧中前方车辆的包围框序列B_tr、包围框内的光流序列F_tr和视频帧对应时刻本车的运动信息序列M_tr，构成样本集；S2-1. Collect a plurality of in-vehicle video clips with a duration of s that can capture the preceding vehicle, sample the video frames in each video clip, and determine the bounding box sequence B _tr of the preceding vehicle in the sampled video frame, The optical flow sequence F _tr in the bounding box and the motion information sequence M _tr of the vehicle at the corresponding moment of the video frame constitute a sample set;

本实施例中，采集1000个视频片段，每个片段时长为3秒，每秒20帧，根据前1秒内的车辆包围框预测后2秒内该车辆的包围框；训练集占样本集的70％，验证集占30％。训练过程采用Adam优化器，固定学习率为0.0005，批处理数量为64，共训练40批次。训练中计算车辆的实际包围框序列

与预测结果中的包围框Y的差值，使用smoothL1损失函数，反馈误差，优化并保存最终的网络权重参数；损失函数如下式所示：In this embodiment, 1000 video clips are collected, each clip has a duration of 3 seconds and 20 frames per second, and the bounding box of the vehicle in the next 2 seconds is predicted according to the bounding box of the vehicle in the first 1 second; 70% and the validation set is 30%. The training process adopts the Adam optimizer, the fixed learning rate is 0.0005, the batch number is 64, and a total of 40 batches are trained. Calculate the actual bounding box sequence of the vehicle during training

The difference from the bounding box Y in the prediction result, using the smoothL1 loss function, feedback the error, optimize and save the final network weight parameters; the loss function is as follows:

其中|·|表示计算向量的模。where |·| denotes the modulo of the computed vector.

预测阶段包括：The forecast phase includes:

在序列B_test(i)和F_test(i)中采用长度为T的第一滑动窗SW-1，在序列M_test中采用长度为△的第二滑动窗SW-2，分别截取当前时刻t前的T个视频帧中车辆i的包围框、所述包围框内的光流，以及当前时刻t后的△个视频帧中本车的运动信息预测值，输入训练好的车辆位置预测模型中，得到前方车辆i在当前时刻t后的△个视频帧中的包围框序列Y′(i)＝[Y′_t+1(i),Y′_t+2(i),…,Y′_t+δ(i),…,Y′_t+△(i)]，计算前方车辆i的包围框在当前时刻视频帧中的相对位置：

其中B_test,t+0(i)为前方车辆i在当前时刻t的包围框；1≤δ≤△；滑动窗的如图7所示。随着时间的持续，两个滑动窗均前进一格，进行下一时刻前车位置的检测。A first sliding window SW-1 with a length of T is used in the sequences B _test (i) and F _test (i), and a second sliding window SW-2 with a length of Δ is used in the sequence M _test , respectively intercepting the current time t The bounding box of vehicle i in the previous T video frames, the optical flow in the bounding box, and the predicted value of the vehicle's motion information in △ video frames after the current time t are input into the trained vehicle position prediction model. , obtain the bounding box sequence Y′(i)=[Y′ _t+1 (i), Y′ _t+2 (i),…,Y′ _{t of the preceding vehicle i in △ video frames after the current time t +δ} (i),…,Y′ _t+△ (i)], calculate the relative position of the bounding box of the preceding vehicle i in the current video frame:

Among them, B _test,t+0 (i) is the bounding box of the preceding vehicle i at the current time t; 1≤δ≤Δ; the sliding window is shown in Figure 7. As time goes on, both sliding windows move forward by one frame to detect the position of the preceding vehicle at the next moment.

本实施例中，将预测结果在当前时刻的视频帧中显示出来，如图8所示。In this embodiment, the prediction result is displayed in the video frame at the current moment, as shown in FIG. 8 .

如图9所示，本发明还公开了实现上述基于车载视频的前方车辆位置预测方法的预测系统，包括：As shown in FIG. 9 , the present invention also discloses a prediction system for realizing the above-mentioned method for predicting the position of the vehicle ahead based on the in-vehicle video, including:

基于编解码框架的车辆位置预测模型1，用于根据当前时刻t之前的t-0,t-1,…,t-(T-1)时刻前方车辆包围框、所述包围框内的光流、本车在当前时刻t之后的t+1,t+2,…,t+△时刻的运动信息，预测前方车辆在当前时刻t之后的t+1,t+2,…,t+△时刻的包围框；The vehicle position prediction model 1 based on the codec framework is used for the bounding box of the vehicle ahead at the time t-0, t-1, ..., t-(T-1) before the current time t, and the optical flow in the bounding box , the motion information of the vehicle at the time t+1, t+2,..., t+△ after the current time t, and predict the surrounding of the vehicle ahead at the time t+1, t+2,..., t+△ after the current time t frame;

所述车辆位置预测模型包括：前方车辆包围框编码器1-1、前方车辆光流编码器1-2、特征融合单元1-3、前方车辆位置预测解码器1-4；The vehicle position prediction model includes: a preceding vehicle bounding box encoder 1-1, a preceding vehicle optical flow encoder 1-2, a feature fusion unit 1-3, and a preceding vehicle position prediction decoder 1-4;

所述特征融合单元将前方车辆的时序特征矢量

和运动特征矢量

连接为前车的融合特征矢量

and motion feature vectors

Concatenated as the fused feature vector of the preceding vehicle

所述前方车辆位置预测解码器根据本车的运动信息预测序列M对特征矢量

解码，得到当前时刻t后的△个时刻的视频帧中前方车辆的预测包围框；The preceding vehicle position prediction decoder predicts the sequence M pairs of feature vectors according to the motion information of the vehicle

车辆包围框获取模块2，用于获取车载视频中前方车辆的包围框序列B；The vehicle bounding box obtaining module 2 is used to obtain the bounding box sequence B of the preceding vehicle in the vehicle video;

车辆包围框光流获取模块3，用于获取车载视频中前方车辆包围框内的光流序列F；The vehicle bounding box optical flow acquisition module 3 is used to obtain the optical flow sequence F within the bounding box of the preceding vehicle in the in-vehicle video;

本车运动信息预测模块4，用于预测本车在未来时间的运动信息，构成本车运动预测序列M。The own vehicle motion information prediction module 4 is used for predicting the own vehicle motion information in the future time, and constitutes the own vehicle motion prediction sequence M.

Claims

1. A vehicle-mounted video-based front vehicle position prediction method comprises a training phase and a prediction phase, and is characterized in that the training phase comprises the following steps:

s1, constructing a vehicle position prediction model based on a coding and decoding frame, wherein the vehicle position prediction model is used for predicting enclosing frames of the front vehicle at T +1, T +2, … and T + delta moments after the current moment T according to T-0, T-1, … and T- (T-1) moments before the current moment T, optical flows in the enclosing frames and motion information of the vehicle at T +1, T +2, … and T + delta moments after the current moment T;

the input of the vehicle position prediction model includes: in the video frames at T moments before the current moment T, the surrounding frame sequence B of the front vehicle, the optical flow sequence F in the surrounding frame of the front vehicle and the motion prediction sequence M of the self vehicle in the video frames at delta moments after the current moment T;

the output of the vehicle position prediction model is a predicted bounding box sequence Y of a front vehicle in a video frame image of delta moments after the current moment t;

the vehicle position prediction model includes: the system comprises a front vehicle surrounding frame encoder, a front vehicle optical flow encoder, a feature fusion unit and a front vehicle position prediction decoder;

the front vehicle surrounding frame encoder is used for encoding a surrounding frame sequence B of the front vehicle to obtain a time sequence characteristic vector of the front vehicle

The front vehicle optical flow encoder is used for encoding an optical flow sequence F in a surrounding frame of the front vehicle to obtain a motion characteristic vector of the front vehicle

The feature fusion unit fuses time-series feature vectors of a preceding vehicle

And motion feature vector

Fused feature vector connected as front vehicle

The front vehicle position prediction decoder predicts the feature vector according to the motion prediction sequence M of the vehicle

Decoding to obtain a prediction surrounding frame of a front vehicle in video frames at delta moments after the current moment t;

s2, constructing a sample set and training a vehicle position prediction model, wherein the method comprises the following steps:

s2-1, collecting a plurality of vehicle-mounted video clips with the duration of S and capable of shooting a front vehicle, sampling video frames in each video clip, and determining a surrounding frame sequence B of the front vehicle in the sampled video frames _tr Optical flow sequence within bounding box F _tr Motion prediction sequence M of the vehicle at a time corresponding to the video frame _tr Forming a sample set;

s2-2, dividing the sample set into a training set and a verification set; setting a learning rate sigma and a batch processing number N;

s2-3, adopting Adam optimizer in the training process,determining a training batch N' according to the number of the training set samples and N; b corresponding to the video frame s' duration before the video clip in the training sample _tr 、F _tr M corresponding to video frame of last s' duration _tr As input of vehicle position prediction model, B corresponding to video frame with time length of s ″ later _tr As output, training the model, storing model parameters, and verifying the prediction accuracy of the model by using a verification set; s' + s ═ s;

s2-4, selecting the model parameter with the highest prediction accuracy in N' batch training as the parameter of the vehicle position prediction model;

the prediction phase comprises:

the method comprises the steps that a camera capable of shooting a front vehicle is arranged on the vehicle, and video data collected by the camera in the driving process of the vehicle are obtained;

carrying out vehicle detection and tracking on each frame of image in the video to obtain an enclosure frame sequence of each front vehicle, and storing the enclosure frame sequence in B _test (i) In the middle, i is the serial number of the front vehicle; while calculating the light flow in the bounding box, storing in F _test (i) (ii) a Obtaining the motion information of the vehicle in the future frame and storing the motion information into the sequence M _test ；

In the sequence B _test (i) And F _test (i) In which a first sliding window of length T is used, in sequence M _test The method includes the steps of adopting a second sliding window with the length of delta, respectively intercepting a surrounding frame of a vehicle i in T video frames before the current time T, an optical flow in the surrounding frame and a predicted value of motion information of the vehicle in delta video frames after the current time T, inputting the intercepted value into a trained vehicle position prediction model, and obtaining a surrounding frame sequence Y '(i) ═ Y' _t+1 (i),Y′ _t+2 (i),…,Y′ _t+δ (i),…,Y′ _t+△ (i)]And calculating the relative position of the bounding box of the front vehicle i in the video frame at the current moment:

wherein B is _test,t+0 (i) A bounding box for the vehicle i ahead at the current time t; delta is not less than 1 and not more than delta;

obtaining a predicted track of a front vehicle i according to the center of the surrounding frame in Y' (i); and obtaining the dimension i of the front vehicle according to the width and the height of the surrounding frame in Y' (i).

2. A preceding vehicle position prediction method according to claim 1, characterized in that the sequence of bounding boxes of the preceding vehicle is calculated using the steps of:

a.1, carrying out vehicle detection on video frame images at continuous T moments to obtain surrounding frames of all vehicles in each frame image;

and A.2, tracking the vehicle enclosure frame obtained in the step A.1 by adopting a multi-target tracking algorithm, giving the same number to the same vehicle in different frames, and forming a front vehicle enclosure frame sequence B of T moments according to a time sequence.

3. The preceding vehicle position prediction method according to claim 1, characterized in that the optical flow sequence within the preceding vehicle bounding box is calculated by using:

b.1, calculating the optical flow of each frame and the previous frame of image of the frame of the video images at the continuous T moments to obtain an optical flow graph corresponding to each frame of image; the two-dimensional optical flow vector of the jth pixel point in the optical flow graph is as follows: i is _j ＝(u _j ,v _j )，u _j ,v _j Vertical and horizontal components of the optical flow vector, respectively;

and B.2, intercepting a covering part of the front vehicle surrounding frame in the image at the T-T moment from a light flow graph corresponding to the image at the T-T moment, zooming to a preset uniform size to obtain a light flow graph in the surrounding frame at the T-T moment, forming a light flow sequence F in the front vehicle surrounding frame at the T moment according to a time sequence, wherein T-T represents the T-th moment before the moment T, and T is more than or equal to 0 and less than T.

4. The preceding vehicle position prediction method according to claim 1, characterized in that the motion prediction sequence of the own vehicle is calculated by using:

c.1, calculating the video frames at T-0, T-1, … and T- (T-1) before the current time TAdjacent moment video frame P _t-τ-1 And P _t-τ Camera rotation matrix R _t-τ And a translation vector V _t-τ Forming a rotation matrix sequence RS and a translation vector sequence VS, and the value is more than or equal to 0 and less than or equal to tau<T, specifically comprising the steps C.1-1 to C.1-2:

c.1-1, calculating to obtain an essential matrix E by adopting an eight-point method, wherein the method comprises the following steps:

c.1-1-1, extracting P by using Surf algorithm _t-τ-1 And P _t-τ And 8 pairs of the most matched feature points (a) are selected _l ,a′ _l ) 1,2, …, 8; wherein a is _l ,a′ _l Respectively representing video frames P _t-τ-1 And P _t-τ Coordinates of the pixel positions of the matched characteristic points of the ith pair on a normalized plane, a _l ＝[x _l ,y _l ,1] ^T ，a′ _l ＝[x′ _l ,y′ _l ,1] ^T ；a _l ,a′ _l Each of the matrices is 3 × 1, where T represents a transpose of the matrix;

c.1-1-2, combining 8 pairs of matched feature points to obtain a 3 x 8 matrix a and a':

establishing a epipolar constraint formula according to a and a':

a ^T Ea′＝0

solving the equation set to obtain an essential matrix E, wherein E is a matrix of 3 multiplied by 3;

c.1-2, performing singular value decomposition on E to obtain a rotation matrix R of the camera _t-τ And a translation vector V _t-τ Wherein R is _t-τ Is a 3 × 3 matrix, V _t-τ Is a 3-dimensional column vector;

finally obtaining a rotation matrix sequence RS ═ R of T video frames before T time _t-(T-1) ,…,R _t-τ ,…,R _t-1 ,R _t-0 T time is earlier than T video frames, and the translation vector sequence VS is ═ V _t-(T-1) ,…,V _t-τ ,…,V _t-1 ,V _t-0 }；

C.2 Camera rotation matrix and translation vector in RS and VS obtained for C.1Calculating each R _t-τ And V _t-τ And a cumulative value of the previous time, the cumulative value being R' _t-τ And V' _t-τ Expressed as follows:

c.3, calculating R 'finally obtained from C.2' _t-0 And V' _t-0 The rotation matrix and the translation vector of the camera at the next moment are transmitted, and the following formula is shown:

R _t+1 ＝R′ _t-0

V _t+1 ＝V′ _t-0

C.4R obtained from C.3 _t+1 And V _t+1 Adding the rotation matrix sequence RS and the translation vector sequence VS obtained at C.1 respectively at the end, and continuing to execute C.2 and C.3 until all rotation matrices { R } of delta video frames after t time are obtained _t+1 ,R _t+2 ,…,R _t+δ ,…,R _t+△ All translation vectors V for delta video frames after time t _t+1 ,V _t+2 ,…,V _t+δ ,…,V _t+△ }，1≤δ≤△；

C.5, calculating the motion vector of the vehicle at the time delta after the current time t, and forming a motion prediction sequence M of the vehicle, wherein M is equal to { M { _t+1 ,M _t+2 ,…,M _t+δ ,…,M _t+△ The method specifically comprises the following steps of C.5-1 to C.5-2:

c.5-1, from rotation matrix R _t+δ Extracting the rotation angle information of the camera in the x, y and z axes and using 3-dimensional line vector

Is shown, in which:

in the above formula, r _jk Representing a rotation matrix R _t+δ The value of the jth row and the kth column, j, k belongs to {1,2,3 }; both atan2() and atan () represent arctan functions, but the range of values obtained by atan () is (0,2 π)]The value range of the result obtained by atan2() is (-pi, pi)]；

C.5-2, vector psi _t+δ And a translation vector V converted into a three-dimensional row vector _t+δ ^T Are connected to form a 6-dimensional row vector M _t+δ ：M _t+δ ＝[ψ _t+δ ,V _t+δ ^T ]；

Finally obtaining the motion prediction sequence M ═ { M ═ of the vehicle _t+1 ,M _t+2 ,…,M _t+δ ,…,M _t+△ }；

C.6, passing M through a full connection layer FC ₄ Transform the dimensions of all its motion vectors.

5. The method of predicting a location of a preceding vehicle as claimed in claim 1, wherein the preceding vehicle bounding box encoder includes a coded gated recurrent neural network GRU _b And a first full connection layer FC ₁ (ii) a The GRU _b The input of (a) is the bounding box B at each time in the bounding box sequence B of the preceding vehicle _t-τ And last time GRU _b Passing down hidden state vector

Outputting the coded result of the surrounding frame of the front vehicle at the current moment

FC ₁ To GRU _b Final output

Dimension conversion is carried out to obtain a time sequence feature vector of the vehicle in front of the current moment t

6. The preceding vehicle position prediction method according to claim 1, characterized in that the preceding vehicle optical flow encoder includes a CNN-based motion feature extraction network FEN and a second full connection layer FC ₂ (ii) a The input of the FEN is an optical flow sequence F in a surrounding frame of the front vehicle, and the output is an optical flow coding result in the surrounding frame of the front vehicle at the current moment; the FEN is based on a ResNet50 framework and comprises a convolution layer conv1, a Relu layer, a maximum pooling layer maxPool and 4 residual learning blocks which are connected in sequence; the number of input channels of conv1 is 2m, and m is the number of samples of the optical flow graph in the optical flow sequence F, that is, m optical flow graphs are uniformly sampled from F; the 4 residual error learning blocks are all of a three-layer structure, namely each residual error learning block is a convolution network layer and a Relu layer which are connected in series;

uniformly sampling m optical flow graphs of an optical flow sequence F in a surrounding frame of a front vehicle, wherein vertical components and horizontal components of the m optical flow graphs form 2m optical flow components which are input into FEN, and the output of the FEN is the motion characteristic of the optical flow graph in the surrounding frame of the front vehicle at the current moment;

FC ₂ performing dimension transformation on the motion characteristics output by the FEN to obtain the motion characteristic vector of the vehicle in front of the current time t

7. A preceding vehicle position prediction method according to claim 1, characterized in that the preceding vehicleThe vehicle position prediction decoder comprises a decoding gated recurrent neural network GRU _d And a third full connection layer FC ₃ (ii) a The GRU _d Is the predicted value M of the vehicle motion information at the moment of t + delta _t+δ And GRU at last time _d Passing down hidden state vector

Fusion vector Mh of _t+δ And last time GRU _d Passing down hidden state vector

1≤δ≤△，

Outputting a decoding result of a surrounding frame of a front vehicle at the moment of t + delta

FC ₃ To pair

And carrying out dimension conversion to obtain a front vehicle surrounding frame at the t + delta moment.

8. A preceding vehicle position prediction system based on an in-vehicle video, characterized by comprising:

the vehicle position prediction model based on the coding and decoding frame is used for predicting the bounding box of the front vehicle at T +1, T +2, … and T + delta moments after the current moment T according to the bounding box of the front vehicle at T-0, T-1, … and T- (T-1) moments before the current moment T, the optical flow in the bounding box and the motion information of the self vehicle at T +1, T +2, … and T + delta moments after the current moment T;

the front vehicle surrounding frame encoder is used for surrounding frames of front vehiclesCoding the sequence B to obtain the time sequence characteristic vector of the front vehicle

And motion feature vector

Fused feature vector connected as front vehicle

Decoding to obtain a predicted surrounding frame of a front vehicle in a video frame of delta moments after the current moment t;

the vehicle surrounding frame acquiring module is used for acquiring a surrounding frame sequence B of a front vehicle in the vehicle-mounted video;

the vehicle surrounding frame light stream acquisition module is used for acquiring a light stream sequence F in a front vehicle surrounding frame in the vehicle-mounted video;

and the vehicle motion information prediction module is used for predicting the motion information of the vehicle in the future time to form a vehicle motion prediction sequence M.

9. The preceding vehicle position prediction system of claim 8, characterized in that the preceding vehicle position prediction system is the preceding vehicle position prediction systemThe vehicle surround frame encoder includes a coded gated recurrent neural network GRU _b And a first full connection layer FC ₁ (ii) a The GRU _b The input of (a) is the bounding box B at each time in the bounding box sequence B of the preceding vehicle _t-τ And last time GRU _b Passing down hidden state vectors

FC ₁ To GRU _b Final output

Dimension conversion is carried out to obtain a time sequence feature vector of a vehicle in front of the current time t

10. The preceding vehicle position prediction system according to claim 8, characterized in that the preceding vehicle optical flow encoder includes a CNN-based motion feature extraction network FEN and a second full connection layer FC ₂ (ii) a The input of the FEN is an optical flow sequence F in a surrounding frame of the front vehicle, and the output is an optical flow coding result in the surrounding frame of the front vehicle at the current moment; the FEN is based on a ResNet50 framework and comprises a convolution layer conv1, a Relu layer, a maximum pooling layer maxPool and 4 residual learning blocks which are connected in sequence; wherein the number of input channels of conv1 is 2m, and m is the number of samples of the optical flow graph in the optical flow sequence F, that is, m optical flow graphs are uniformly sampled from F; the 4 residual error learning blocks are all of a three-layer structure, namely each residual error learning block is a convolution network layer and a Relu layer which are connected in series;