CN108803617B

CN108803617B - Trajectory prediction method and apparatus

Info

Publication number: CN108803617B
Application number: CN201810752554.5A
Authority: CN
Inventors: 邹文斌; 周长源; 吴迪; 王振楠; 唐毅; 李霞
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2020-03-20
Anticipated expiration: 2038-07-10
Also published as: CN108803617A

Abstract

The embodiment of the invention provides a track prediction method and a track prediction device, relates to the field of local navigation of robots and intelligent vehicles, and is applied to vehicles provided with vehicle-mounted cameras, and the method comprises the following steps: and shooting the surrounding environment by using the vehicle-mounted camera to obtain a video sequence comprising surrounding vehicles and a vehicle background. And positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information. And inputting the historical track information and the auxiliary information into a neural network model to obtain the predicted track of the surrounding vehicle. The trajectory prediction method can improve the accuracy of predicting the vehicle trajectory.

Description

Trajectory prediction method and apparatus

Technical Field

The invention relates to the field of local navigation of robots and intelligent vehicles, in particular to a track prediction method and a track prediction device.

Background

During vehicle travel, it is important to predict future trajectories of other traffic participants to avoid collision of the autonomous vehicle with other vehicles. Assuming that all traffic participants comply with traffic regulations and that human drivers can subconsciously predict the future trajectory of a target, modeling methods are typically employed to predict the future trajectories of other traffic participants for autonomous vehicles.

However, most of the current work is to extract visual semantic messages by using static images, or to learn a driving network by adopting an end-to-end structure, wherein the former ignores the time continuity in the driving situation, and the latter lacks the interpretability of the training network, thereby causing the problem of low accuracy of predicting the vehicle track.

Disclosure of Invention

The invention mainly aims to provide a track prediction method and a track prediction device, which can improve the accuracy of vehicle track prediction.

The track prediction method provided by the first aspect of the embodiment of the invention is applied to a vehicle provided with a vehicle-mounted camera, and comprises the following steps: shooting the surrounding environment by using a vehicle-mounted camera to obtain a video sequence comprising surrounding vehicles and a vehicle background; positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information; and inputting the historical track information and the auxiliary information into a neural network model to obtain the predicted track of the surrounding vehicle.

A trajectory prediction apparatus provided in a second aspect of an embodiment of the present invention is applied to a vehicle provided with a vehicle-mounted camera, and includes: the acquisition module is used for shooting the surrounding environment by utilizing the vehicle-mounted camera to acquire a video sequence comprising surrounding vehicles and a vehicle background; the extraction and segmentation module is used for positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information; and the output module is used for inputting the historical track information and the auxiliary information into a neural network model to obtain the predicted track of the surrounding vehicle.

In the embodiment, the video sequence including the surrounding vehicles and the vehicle background is acquired through the vehicle-mounted camera, the video sequence is subjected to image segmentation to acquire scene semantic information, then the scene semantic information and the historical track information are input into the neural network model to acquire the predicted track, instead of extracting the scene semantic information by using a static image to analyze, so that the time continuity of the neural network model in the embodiment is ensured, and the accuracy of the predicted vehicle track is improved.

Drawings

Fig. 1 is a schematic flow chart illustrating an implementation of a trajectory prediction method according to a first embodiment of the present invention;

FIG. 2 is a schematic flow chart of an implementation of a trajectory prediction method according to a second embodiment of the present invention;

FIG. 3 is a diagram of a neural network model of a trajectory prediction method according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an application of a trajectory prediction method according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of a trajectory prediction apparatus according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a flow chart of a track prediction method according to a first embodiment of the present invention, where the method is applied to a vehicle with a vehicle-mounted camera. As shown in fig. 1, the trajectory prediction method mainly includes the following steps:

101. and shooting the surrounding environment by using the vehicle-mounted camera to obtain a video sequence comprising surrounding vehicles and a vehicle background.

Specifically, in the automatic driving process of the vehicle, assuming that all traffic participants obey the traffic rules, a method for establishing a model is adopted to predict the future tracks of other traffic participants. In the process of establishing the model, the surrounding environment information needs to be acquired, so the surrounding environment is photographed by using a vehicle-mounted camera on the vehicle, and a video sequence comprising the surrounding vehicle and the background of the vehicle is acquired. Wherein, the frame number per second of the video sequence can be selected according to the actual situation. The distance between the surrounding vehicle and the vehicle provided with the vehicle-mounted camera is within a certain range, the vehicle having potential influence on the vehicle provided with the vehicle-mounted camera exists, and the range can be 30 meters around the vehicle provided with the vehicle-mounted camera.

102. And positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information.

Specifically, the motion in the video sequence is an artifact of motion formed by displaying frames in quick succession, the video sequence of each frame is a static image, the surrounding vehicle is positioned in the video sequence of each frame, and the track information of the surrounding vehicle can be seen from the continuous multi-frame video sequence, so that for the video sequence of the current frame, the historical track information of the surrounding vehicle is obtained from the video sequence of the past multi-frame.

The scene semantic information obtained by image segmentation of the video sequence of each frame is used as auxiliary information. The image segmentation means that objects in the video sequence of each frame are segmented according to semantic categories and labeled with scene semantic information, such as pedestrians, surrounding vehicles, buildings, sky, vegetation, road barriers, lane lines, road identification information, traffic signal light information and the like, so as to identify a drivable area in the video sequence of the current frame. By using scene semantic information as auxiliary information, the method can have certain robustness to apparent change of the target.

Optionally, since the regions corresponding to different semantic categories are different feature regions, and boundaries of the different feature regions are edges, the video sequence of each frame may be segmented by using edge detection, so as to extract the required target. The edge indicates the end of one feature region and the beginning of another feature region, and the internal features or attributes of the desired target are consistent and inconsistent with the features or attributes of the other feature regions, such as features of gray scale, color or texture.

103. And inputting the historical track information and the auxiliary information into a neural network model to obtain the predicted track of the surrounding vehicle.

In particular, the neural network is a complex network system formed by a large number of simple neurons widely connected with each other, is a highly complex nonlinear dynamical learning system, and has large-scale parallel, distributed storage and processing, self-organization, self-adaptation and self-learning capabilities. Therefore, a neural network model is obtained by establishing a mathematical model by using a neural network, and the obtained historical track information and auxiliary information are input into the neural network model to obtain the predicted track of the surrounding vehicle.

In the embodiment of the invention, the video sequence comprising surrounding vehicles and vehicle backgrounds is obtained through the vehicle-mounted camera, the video sequence is subjected to image segmentation to obtain scene semantic information, then the scene semantic information and historical track information are input into the neural network model to obtain the predicted track, instead of extracting the scene semantic information by using a static image to analyze, so that the time continuity of the neural network model in the embodiment is ensured, and the accuracy of the predicted vehicle track is improved.

Referring to fig. 2, fig. 2 is a schematic diagram of a flow chart of a track prediction method according to a second embodiment of the present invention, where the method is applied to a vehicle with a vehicle-mounted camera. As shown in fig. 2, the trajectory prediction method mainly includes the following steps:

201. and shooting the surrounding environment by using the vehicle-mounted camera to obtain a video sequence comprising surrounding vehicles and a vehicle background.

202. And positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information.

203. And inputting the auxiliary information into the convolutional neural network to obtain spatial characteristic information.

Specifically, the neural network model comprises a convolutional neural network, a first layer long short term memory network, a second layer long short term memory network and a full connection layer.

The convolutional neural network is a kind of feedforward neural network. And (3) carrying out image segmentation and annotation on the video sequence to obtain scene semantic information which is used as auxiliary information and input into the convolutional neural network to obtain spatial characteristic information. The auxiliary information is image information, can adopt one-bit effective coding to code, and uses the channel number as the semantic category number, inputs the auxiliary information into a four-layer convolution neural network, the convolution kernel can be 3 x 4, obtains the space characteristic information, and the space characteristic information is expressed by 6-dimensional vector.

As shown in fig. 3, the convolutional neural network includes a convolutional layer, a linear correction unit, a pooling layer, and a Dropout layer. The convolutional layer may extract features in the side information. The linear layer may introduce non-linear characteristics. The pooling layer may compress the inputted auxiliary information and extract the main features. A Dropout layer may be used to alleviate the over-fitting problem.

204. And inputting the historical track information into the first layer long and short term memory network to obtain time characteristic information. And inputting the spatial characteristic information and the temporal characteristic information into the second layer long-short term memory network to obtain the joint characteristic information.

In particular, a Long-short Term neural (LSTM) network is a time-recursive network. The historical track information has a certain time sequence and a certain context correlation exists at the position, namely the historical track information is used as a sequence input and needs to be continuously subjected to position characteristics before and after learning, so the LSTM network is used for training the historical track information and connecting the track information of the historical frames for estimating the track information of the current frame.

As shown in fig. 3, the historical track information is input into the first-layer LSTM network to obtain temporal characteristic information, and the temporal characteristic information and the spatial characteristic information obtained in step 203 are input into the second-layer LSTM network to obtain joint characterization information. And because the dimension of the three-dimensional space grid is 6, the first-layer LSTM network not only can learn the time characteristic information, but also can make the dimension of the time characteristic information consistent with that of the space characteristic information. In practical applications, the number of the elements of the first layer of LSTM network may be 100, and the second layer of LSTM network may include two layers of LSTM networks each having a number of 300.

205. And inputting the joint characteristic information into a full-connection layer to obtain the predicted track.

Specifically, each node of the fully-connected layer is connected with all nodes of the previous layer for integrating all the extracted features of the previous layer, so that the combined characterization information is input into the fully-connected layer, a series of matrix multiplication is performed to obtain the output of the neural network model, and the predicted trajectory J of T time steps is obtained. In practical applications, the time T may be 1.6s (unit: second)

Wherein, the neural network model comprises the following formula:

J←M_p(h,a):H×A。

where J denotes the predicted trajectory, M denotes a mapping relationship between H, A and J, H denotes the history trajectory information, a denotes the auxiliary information, p denotes the surrounding vehicle, H denotes position information of the vehicle p in the T-th frame video sequence, a denotes scene semantic information of the vehicle p in the T-th frame video sequence, J denotes position information of the vehicle p in the T-th frame video sequence from T +1 frame, and T denotes each frame.

As shown in fig. 3, in the present embodiment, an image Segmentation-Long-short Term Memory network (SEG-LSTM) is proposed to merge multiple streams of history frames and predict future trajectories of surrounding vehicles.

The number of layers of the LSTM network, the number of units of each layer of the LSTM network, the number of layers of the convolutional neural network and the size of the convolutional kernel belong to network hyper-parameters, and are determined through cross validation. The role of cross-validation is to determine the optimal hyper-parameters while avoiding model overfitting. Illustratively, first, the data set is divided into a training set and a test set, with a ratio of 5: 1. Then the training set is divided into 5 parts, each part is taken as a verification set in turn, the other 4 parts are taken as the training sets to carry out 5 times of training and verification, the corresponding average accuracy rate can be obtained by using different hyper-parameters, and the hyper-parameter with the optimal effect is taken to determine the numerical value.

As shown in fig. 4, a video sequence is divided into a plurality of video sequences with time step lengths by frame, and position information is obtained by detecting and tracking the video sequence of each frame, and semantic information is obtained by image segmentation. Then, the position information and semantic information of the same frame are input into an LSTM network for training, a plurality of video sequences of historical frames and a current frame are trained to obtain a predicted track,

206. and respectively acquiring the minimum relative distance between the vehicle and each surrounding vehicle through the depth camera. And converting the two-dimensional space prediction track into a three-dimensional space prediction track according to the minimum relative distance.

Specifically, the predicted track is a two-dimensional spatial predicted track, and a depth camera is further arranged in the vehicle.

Wherein the two-dimensional spatial predicted trajectory is converted into a three-dimensional spatial predicted trajectory according to the minimum relative distance by the following formula:

wherein x, y, w, h respectively represent elements of the two-dimensional spatial prediction track in the pixel bounding box of each frame of the video sequence, and x_r,y_r,w_r,h_rRespectively representing the elements of a three-dimensional space prediction track in a pixel bounding box in each frame of a video sequence, f is represented as the focal length of the depth camera, d_minExpressed as the minimum relative distance of the vehicle from each of the surrounding vehicles.

Wherein, if the subscript p is ignored, the historical track information and the predicted track can be defined as a three-dimensional space occupying grid, that is

H,J∈R⁶＝{x,y,w,h,d_min,d_max}

In the formula (d)_maxIndicating the maximum distance of the vehicle from each of the surrounding vehicles.

In the embodiment of the invention, firstly, a video sequence comprising surrounding vehicles and vehicle backgrounds is obtained through a vehicle-mounted camera, the video sequence is subjected to image segmentation to obtain scene semantic information, then the scene semantic information and historical track information are input into a neural network model to obtain a predicted track, instead of extracting the scene semantic information by using a static image to analyze, so that the time continuity of the neural network model in the embodiment is ensured, and the accuracy of predicting the vehicle track is improved. In addition, the convolutional neural network and the LSTM network can improve the robustness of tracking surrounding vehicles, and the image segmentation is adopted to obtain scene semantic information, so that the interpretability of the training process can be improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a trajectory prediction device according to a third embodiment of the present invention, and the trajectory prediction device is applied to a vehicle with a vehicle-mounted camera. As shown in fig. 5, the trajectory prediction apparatus mainly includes:

an obtaining module 301, configured to use the vehicle-mounted camera to photograph the surrounding environment, and obtain a video sequence including the surrounding vehicle and the vehicle background.

And an extracting and dividing module 302, configured to locate surrounding vehicles from the video sequence, extract historical track information of the surrounding vehicles, and use scene semantic information obtained by performing image division on the video sequence as auxiliary information.

And the output module 303 is configured to input the historical trajectory information and the auxiliary information into the neural network model to obtain a predicted trajectory of the surrounding vehicle.

Further, the neural network model includes a convolutional neural network, a first layer long short term memory network, a second layer long short term memory network, and a full link layer,

the output module 303 is further configured to input the auxiliary information to the convolutional neural network to obtain spatial feature information.

The output module 303 is further configured to input the historical trajectory information into the first-tier long-short term memory network to obtain the time characteristic information.

The output module 303 is further configured to input the spatial feature information and the temporal feature information into the second layer long-term and short-term memory network to obtain the joint feature information.

The output module 303 is further configured to input the joint feature information into the full connection layer to obtain a predicted track.

Further, the neural network model includes the following formula:

J←M_p(h,a):H×A。

wherein J denotes a predicted trajectory, M denotes a mapping relationship between H, A and J, H denotes history trajectory information, a denotes auxiliary information, p denotes a surrounding vehicle, H denotes position information of the vehicle p in the T-th frame video sequence, a denotes scene semantic information of the vehicle p in the T-th frame video sequence, J denotes position information of the vehicle p in the T-th frame video sequence from T +1 frame, and T denotes each frame.

Furthermore, the predicted track is a two-dimensional space predicted track, a depth camera is also arranged in the vehicle,

the obtaining module 301 is further configured to obtain, through the depth camera, the minimum relative distance between the vehicle and each surrounding vehicle.

The apparatus further comprises a conversion module 304 which,

and a conversion module 304, configured to convert the two-dimensional spatial prediction trajectory into a three-dimensional spatial prediction trajectory according to the minimum relative distance.

Further, the conversion module 304 is further configured to convert the two-dimensional spatial predicted trajectory into the three-dimensional spatial predicted trajectory according to the minimum relative distance by the following formula:

wherein x, y, w, h respectively represent elements of the two-dimensional spatial prediction track in the pixel bounding box of each frame of the video sequence, and x_r,y_r,w_r,h_rRespectively representing the elements of a three-dimensional spatial prediction track in a pixel bounding box in each frame of a video sequence, f representing the focal length of the depth camera, d_minExpressed as vehicle and surrounding vehiclesThe minimum relative distance of.

The process of the above modules to implement each function may specifically refer to the related contents in the embodiments shown in fig. 1 to fig. 4, and is not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication link may be electrical, mechanical or other form of coupling or communication link of the modules indirectly through some interfaces.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, the functional modules in the embodiments of the present invention may be integrated into one processing module. Each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the trajectory prediction method and apparatus, the terminal and the computer readable storage medium provided by the present invention, those skilled in the art will recognize that there may be variations in the embodiments and applications of the method and apparatus provided by the present invention.

Claims

1. A track prediction method is applied to a vehicle provided with a vehicle-mounted camera, and is characterized by comprising the following steps:

shooting the surrounding environment by using a vehicle-mounted camera to obtain a video sequence comprising surrounding vehicles and a vehicle background;

positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information;

inputting the historical track information and the auxiliary information into a neural network model to obtain a predicted track of the surrounding vehicle;

the predicted trajectory is a two-dimensional spatial predicted trajectory, and a depth camera is further disposed in the vehicle, then the method further includes:

respectively acquiring the minimum relative distance between the vehicle and each surrounding vehicle through the depth camera;

converting the two-dimensional space prediction track into a three-dimensional space prediction track according to the minimum relative distance;

converting the two-dimensional space prediction track into a three-dimensional space prediction track according to the minimum relative distance by the following formula:

wherein x, y, w, h respectively represent elements of the two-dimensional spatial prediction track in the pixel bounding box of each frame of the video sequence, and x_r,y_r,w_r,h_rRespectively representing the elements of a three-dimensional spatial prediction track in a pixel bounding box in each frame of a video sequence, f representing the focal length of the depth camera, d_minExpressed as the minimum relative distance of the vehicle from each of the surrounding vehicles.

2. The trajectory prediction method of claim 1, wherein the neural network model comprises a convolutional neural network, a first layer long short term memory network, a second layer long short term memory network, and a fully connected layer, and the inputting the historical trajectory information and the auxiliary information into the neural network model to obtain the predicted trajectory of the surrounding vehicle comprises:

inputting the auxiliary information to the convolutional neural network to obtain spatial characteristic information;

inputting the historical track information into the first layer long and short term memory network to obtain time characteristic information;

inputting the spatial characteristic information and the temporal characteristic information into the second layer long-short term memory network to obtain joint characteristic information;

and inputting the joint characteristic information into a full-connection layer to obtain the predicted track.

3. The trajectory prediction method of claim 1, wherein the neural network model comprises the following formula:

J←M_p(h,a):H×A；

wherein J represents the predicted trajectory, M represents a mapping relationship between H, A and J, H represents the historical trajectory information, a represents the auxiliary information, p represents the surrounding vehicle, H represents position information of the vehicle p in the T-th frame video sequence, a represents scene semantic information of the vehicle p in the T-th frame video sequence, J represents position information of the vehicle p in the T-th frame video sequence from T +1 frame, and T represents each frame.

4. A trajectory prediction device applied to a vehicle provided with an onboard camera, the device comprising:

the acquisition module is used for shooting the surrounding environment by utilizing the vehicle-mounted camera to acquire a video sequence comprising surrounding vehicles and a vehicle background;

the extraction and segmentation module is used for positioning the surrounding vehicles from the video sequence, extracting historical track information of the surrounding vehicles, and taking scene semantic information obtained by image segmentation of the video sequence as auxiliary information;

the output module is used for inputting the historical track information and the auxiliary information into a neural network model to obtain the predicted track of the surrounding vehicle;

the predicted track is a two-dimensional spatial predicted track, and a depth camera is further arranged in the vehicle;

the acquisition module is further configured to acquire, through the depth camera, minimum relative distances between the vehicle and each of the surrounding vehicles, respectively;

the apparatus may further comprise a conversion module for,

the conversion module is used for converting the two-dimensional space prediction track into a three-dimensional space prediction track according to the minimum relative distance;

the conversion module is further configured to convert the two-dimensional spatial prediction trajectory into a three-dimensional spatial prediction trajectory according to the minimum relative distance by using the following formula:

5. The trajectory prediction device of claim 4, wherein the neural network model includes a convolutional neural network, a first layer long short term memory network, a second layer long short term memory network, and a fully connected layer,

the output module is further configured to input the auxiliary information to the convolutional neural network to obtain spatial feature information;

the output module is further used for inputting the historical track information into the first layer long-short term memory network to obtain time characteristic information;

the output module is further configured to input the spatial feature information and the temporal feature information into the second layer long-short term memory network to obtain joint feature information;

and the output module is also used for inputting the combined characteristic information into a full connection layer to obtain the predicted track.

6. The trajectory prediction device of claim 4, wherein the neural network model comprises the following formula:

J←M_p(h,a):H×A；