CN114035575B

CN114035575B - Unmanned vehicle motion planning method and system based on semantic segmentation

Info

Publication number: CN114035575B
Application number: CN202111301912.9A
Authority: CN
Inventors: 高俊尧; 石朝侠; 袁嘉诚; 陈广泽
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2023-03-31
Anticipated expiration: 2041-11-04
Also published as: CN114035575A

Abstract

The invention discloses a semantic segmentation-based unmanned vehicle motion planning method and system, wherein a view is obtained from a vehicle vision sensor, the semantically segmented view and a corresponding weight are obtained through a semantic segmentation network and a category weight network, an expert driving experience base training simulation learning module bound with the corresponding view is used for fusing the semantically segmented view and the corresponding weight on a pixel level, a feature vector is extracted from a fused image, after training is completed, speed and steering are obtained end to end by directly inputting a current view, passing through the semantic segmentation module and the simulation learning module, and motion planning is carried out based on the speed and the steering. The invention can well carry out unmanned motion planning on the vehicle, and the behavior planning accuracy under the simulation environment Carla is improved by 27.8 percent compared with the method which is not suitable for semantic segmentation under some specific scenes, such as the reflection obstacle of the surface water in rainy days.

Description

Unmanned vehicle motion planning method and system based on semantic segmentation

Technical Field

The invention relates to the field of unmanned driving, in particular to a semantic segmentation-based unmanned vehicle motion planning method and system.

Background

In recent years, a deep neural network model based on end-to-end learning is an emerging research hotspot in the field of automatic driving. Different from the traditional method, the method can overcome the problem that the traditional method relies on prior environment modeling, and can directly emit control quantity from the perception mapping, so that the system is simplified. In 2016, NVIDIA demonstrated the feasibility of the end-to-end control approach by training the neural network by collecting real vehicle driving data. An end-to-end automatic driving model CIL is proposed by Dosovitskiy and the like in 2017, and experiments show that the CIL can effectively utilize navigation information to conduct navigation. In 2019, codevilla et al propose that the CILRS model uses ResNet34 to realize environment perception, so that a good effect is achieved, and the problem of abnormal parking of the CIL model is eliminated to a certain extent.

The traditional end-to-end model only takes the current environment information as network input, only depends on the current image to obtain the road track and the obstacle information at the moment, and a human driver judges the movement trend and the movement speed of the dynamic obstacle based on the visual information of a period of time in the past so as to determine the driving strategy. Compared to a human driver, the conventional model will certainly lose the motion information of the dynamic obstacle.

In order to acquire dynamic information of the surrounding environment, environmental information at successive times must be added as input of the neural network, which undoubtedly increases the data processing amount of the network, thereby delaying the network from feeding back the vehicle driving information.

Disclosure of Invention

The invention aims to provide a semantic segmentation-based unmanned vehicle motion planning method and system, which can reduce the data processing burden of a learning network while increasing the data input amount, and more accurately control the speed and direction of an unmanned vehicle. Thereby effectively determining the driving strategy.

The technical solution for realizing the purpose of the invention is as follows:

a semantic segmentation based unmanned vehicle motion planning method comprises the following steps:

periodically collecting continuous frame surrounding views of the left direction, the front direction and the right direction of the vehicle;

based on the environmental information classification, respectively carrying out semantic segmentation on the left and right direction current view data and the continuous n frames of forward view data through a semantic segmentation network to obtain a semantic segmentation image;

based on the environmental information classification, acquiring category weight information corresponding to the continuous n frames of forward images through a category weight network;

acquiring the speed and direction corresponding to each frame of view based on an expert database;

taking the segmented images, the category weight information and the speed and direction corresponding to the view as training data, training and learning the dynamic surround vision imitating learning network based on the neural network, and repeating the steps until a convergent dynamic surround vision imitating learning model is obtained;

collecting surrounding views in the left direction, the front direction and the right direction of a vehicle, acquiring semantic segmentation images of the surrounding views in the left direction, the front direction and the right direction and category weight information corresponding to continuous n frames of forward images, and inputting the semantic segmentation images and the category weight information into a dynamic surrounding vision simulation learning model to acquire a planning speed and a planning direction;

and carrying out motion planning on the unmanned vehicle based on the planned speed and the planned direction.

Further, the semantic segmentation network comprises a residual error network ResNet, a full convolution network FCN, a cavity space convolution pooling pyramid and a decoder; the method for obtaining the semantic segmentation image by respectively performing semantic segmentation on the left-direction current view data, the right-direction current view data and the continuous n frames of forward view data through the semantic segmentation network specifically comprises the following steps of:

performing semantic segmentation on the current left and right view data through a residual error network ResNet and a full convolution network FCN, and performing semantic segmentation on the current and previous n-1 frame forward view data;

based on the environmental information classification, extracting and classifying semantic segmentation image characteristic information by adopting a cavity space convolution pooling pyramid;

and recovering the details and the spatial dimension of the image features through a decoder based on the classified image feature information to obtain a final semantic segmentation image.

Further, the category weight network specifically includes: and forming a residual unit by four residual blocks, copying the fourth residual block for three times, connecting the copied fourth residual block to the residual unit in a cascading manner, and simultaneously performing parallel hollow space convolution pooling pyramid structure.

Further, the dynamic surround vision simulation learning network comprises a feature extraction fusion network and a branch decision network, and an L1 loss function is adopted; wherein:

the feature extraction fusion network fuses category weight information and semantic segmentation images of forward images, n 512-dimensional feature vectors are obtained from the fused images through a ResNet34 residual error network, the n 512-dimensional feature vectors obtain a 512-dimensional feature vector through a single-layer LSTM network, 64-dimensional feature vectors are obtained from the semantic segmentation images in the left direction and the semantic segmentation images in the right direction through the residual error network respectively, and one 512-dimensional feature vector and two 64-dimensional feature vectors are spliced to obtain a 640-dimensional combined feature vector;

the branch decision network activates the corresponding branch network according to navigation information, and the navigation information c comprises four states of road following, road walking, left turning and right turning;

the L1 loss function is:

wherein,

vehicle speed v at time t +1 _t+1 Is predicted value of (4), (v) is greater than or equal to>

Is the predicted value of the square disk angle at time t, a _t Is the steering wheel angle at time t, v _t+1 Is the vehicle speed at time t + 1.

Further, the number of hidden nodes of the LSTM network is 128, and the updating process is as follows:

external state S based on last moment _t-1 And input x of the current time _t Input gate i is calculated _t And an output gate o _t Forget door _t And candidate states

The calculation formula is as follows:

i _t ＝σ(M _i x _t +N _i S _t-1 +b _i )

o _t ＝σ(M _o x _t +N _o S _t-1 +b _o )

l _t ＝σ(M _l x _t +N _l S _t-1 +b _l )

wherein, tanh is an activation function, M _i And N _i Is a parameter matrix, with a value interval of (0, 1), a being a logistic function, a value range of (0,1)；

passing forgetting door _t And an input gate i _t To update the memory cell I _t The update formula is:

passing the internal state to the external state S _t External state S _t Comprises the following steps:

S _t ＝o _t tanh(I _t )

further, the 640-dimensional joint feature vector is:

wherein e is ^t 512-dimensional feature vectors and two 64-dimensional feature vectors are spliced to obtain a 640-dimensional joint feature vector,

for corresponding 512-dimensional feature vectors for n successive frames of forward images>

640-dimensional associated feature vectors corresponding to right and left images at time t, respectively>

Semantic segmentation images R corresponding to the forward image, the right image and the left image at the time t respectively ₃₄ Is a 34 layer residual network function, R ₁₈ Is a layer 18 residual network function and L is an LSTM network function.

Further, the dynamic surround vision simulation learning model is as follows:

(v _t+1 ，a _t )＝F(e ^t ，c)

(v _t+1 ，a _t )＝A ^c (e ^t )

wherein v is _t+1 Is the vehicle speed at time t +1, a _t T is the steering wheel angle, c is the navigation command, and F is the vehicle speed v _t+1 And steering wheel angle a _t And e ^t And navigation instructions c.

Further, the collection period is 0.3s, n =4, and the environmental information classification includes: empty, buildings, fences, pedestrians, pillars, lane dividing lines, roads, sidewalks, vegetation, vehicles, walls, signal lamps and other unidentified types.

Further, a PID controller is adopted to carry out motion planning on the unmanned vehicle, and the PID controller adopts a PI control algorithm, and specifically comprises the following steps:

wherein:

vehicle speed v at time t +1 _t+1 V is predicted value of _t Is the vehicle speed at time t, V _t Throttle size at time t, b _t Braking at time t, m _t Is a longitudinal control quantity.

An unmanned vehicle motion planning system based on the unmanned vehicle motion planning method comprises a visual sensor, a category weight module, a semantic segmentation module, a simulation learning module, a speed control module and a steering control module, wherein:

the vision sensor is used for periodically acquiring surrounding views of the left direction, the middle direction and the right direction of the vehicle;

the category weight module is used for acquiring category weight information corresponding to the forward image through a category weight network;

the semantic segmentation module is used for performing semantic segmentation on left-direction current view data and right-direction current view data and continuous n frames of forward view data respectively by a semantic segmentation network to obtain a semantic segmentation image;

the simulation learning module carries out network learning based on neural network learning to obtain a converged dynamic surrounding vision simulation learning model; the simulation learning module comprises an image fusion module and a branch decision module, wherein the image fusion module is used for fusing category weight information with semantic segmentation images of each frame of forward image, the fused image is subjected to ResNet34 residual error network to obtain n 512-dimensional feature vectors, the n 512-dimensional feature vectors are subjected to single-layer LSTM network to obtain a 512-dimensional feature vector, the semantic segmentation images in the left direction and the semantic segmentation images in the right direction are subjected to residual error network to respectively obtain 64-dimensional feature vectors, and one 512-dimensional feature vector and two 64-dimensional feature vectors are spliced to obtain a 640-dimensional combined feature vector; and the branch decision module activates a corresponding branch network according to the navigation information, and the navigation information c comprises four states of road following, line, left turn and right turn.

The speed control module controls the speed of the unmanned vehicle by simulating the planned speed output by the learning module;

the steering control module controls steering of the unmanned vehicle by mimicking the planned direction output by the learning module.

Compared with the prior art, the method has the beneficial effects that: the invention determines the motion track and the speed of a dynamic barrier through continuous multi-frame visual information, reduces the processing amount of input data by a neural network through semantic segmentation, inputs the multi-frame visual information into the semantic segmentation network for processing so as to simplify invalid information of the environment, and the processed image can completely retain effective environmental information of roads, barriers, pedestrians and the like and is more close to the environmental information sensed by a human driver in the driving process; according to the invention, the multi-frame semantic segmentation image is input into the end-to-end neural network model, so that vehicle control information is obtained, unmanned driving behavior planning can be well performed on the vehicle, and the accuracy of the speed and direction of the unmanned vehicle is improved.

Drawings

Fig. 1 is a model architecture diagram of the unmanned vehicle motion planning system of the present invention.

Fig. 2 is a schematic view of an operating environment based on carra unmanned vehicle simulation platform version 8.04 of the present invention.

FIG. 3 is a diagram illustrating semantic segmentation.

Fig. 4 is a diagram of an LSTM network architecture.

Fig. 5 is a diagram of a dynamic surround vision mock learning network architecture.

Fig. 6 is a schematic diagram of proportional-integral control.

FIG. 7 is a graph of throttle/brake and control quantities for a vehicle.

FIG. 8 is a schematic diagram of the convolution of holes at different expansion ratios.

Detailed Description

In order to enable environmental information acquired by an unmanned vehicle to be closer to environmental information perceived by a human driver in the driving process, and reduce the data processing burden of a neural network while increasing the data input quantity, the invention provides a dynamic simulation learning algorithm based on the combination of surround vision and semantic segmentation, wherein multiframe visual information is firstly input into a trained semantic segmentation network for processing so as to simplify invalid information of the environment, and the processed image can completely retain effective environmental information of roads, obstacles, pedestrians and the like; and inputting the processed multi-frame semantic segmentation image into an end-to-end learning network model so as to obtain vehicle control information. The method is the same as the logic of the perception environment of a human driver, the motion trail and the speed of the dynamic barrier are determined through historical visual information, the processing amount of input data of a neural network is reduced through semantic segmentation, and therefore the driving strategy can be effectively determined.

The model idea provided by the method is that visual perception, a category weight network, a semantic segmentation network, image fusion, a Dynamic Surround vision simulation Learning network and speed and steering control of an unmanned vehicle are adopted, wherein the Dynamic Surround vision simulation Learning network is based on a neural network and is named as DSCIL (Dynamic Surround-view simulation Learning). Firstly, a left, middle and right surrounding view is obtained through visual perception installed on a vehicle, front images of 4 continuous frames are input into a category weight network, and all 13 semantic segmentation category pixel-level weights are output for a current frame; then inputting the continuous 4 frames of forward images and the images of the current frames at the left and right sides into a semantic segmentation network to obtain corresponding semantic images; and then inputting the obtained semantic segmentation image and the class weight into a DSCIL model, multiplying the obtained semantic segmentation image and the class weight on a pixel level on the basis of the class to obtain the importance representation of each class on the spatial information after semantic segmentation, outputting the expected vehicle speed, and finally controlling the speed and the steering according to the error between the expected vehicle speed and the current vehicle speed.

The invention provides a semantic segmentation-based unmanned vehicle motion planning method, which comprises the following steps of:

periodically acquiring continuous frame surrounding views in the left direction, the front direction and the right direction of a vehicle;

based on the environmental information classification, performing semantic segmentation on each frame of view data in the left direction, the front direction and the right direction through a semantic segmentation network to obtain a semantic segmentation image, wherein the segmentation image is as shown in FIG. 3;

based on the environmental information classification, obtaining the category weight information corresponding to each frame of forward image through a category weight network;

taking the segmented images, the category weight information and the speed and the direction corresponding to the view as training data, training and learning the dynamic surround vision imitation learning network based on the neural network, and repeating the steps until a convergent dynamic surround vision imitation learning model is obtained;

and based on the planning speed and the planning direction, a PID controller is adopted to plan the motion of the unmanned vehicle.

Direct semantic segmentation usually loses information on partial space and is not beneficial to end-to-end behavior planning of unmanned driving, so that a category weight network is designed, resNet is used as a backbone to construct, and finally, a space importance weight corresponding to each category is output, namely, the output of 13 × 200 × 88 is output;

ResNet has four residual blocks in total, and the fourth residual block is copied three times and then connected behind the residual units in a cascade mode. Meanwhile, a parallel ASPP structure is adopted to process the feature graph output in the front, 4 hole convolutions with different expansion rates (6, 12, 18 and 24) are adopted at the top of the feature graph, context information is captured through a global average pooling layer, and then 1 x 1 convolution is used for fusing the features processed by the ASPP structure branches.

The semantic segmentation network is a semantic segmentation network which is usually constructed by taking ResNet as a backbone. After the FCN is used for realizing image segmentation, hole convolution is introduced to obtain multi-scale feature information, and more semantic feature information is mined through a pyramid structure to improve the classification effect.

The hole convolution is to sample the original image by setting a rate (rate). As shown in fig. 8, when rate =1, it is not different from the standard convolution; when rate is greater than 1, the convolution kernel is expanded through the expansion rate, sampling is carried out on the original image every (rate-1) pixels, the scope of the receptive field is expanded, and the semantic segmentation features with a larger scope are extracted under the condition that the number of parameters and the calculation amount are not increased.

The Spatial convolution Pooling Pyramid (ASPP) of the void space means that the void convolution with different expansion rates is adopted to learn multi-scale features, and the region with any scale can be classified more accurately.

The semantic segmentation network employs an encoder-decoder module that recovers the details and spatial dimensions of image features using 16 times bilinear upsampling to obtain segmentation results. The invention aims at classifying the environment information, replaces a decoder module for segmentation into a full connection layer result, takes softmax as an output layer, connects each neuron with output and realizes the classification of the input image. The classification classes include 0 for empty (None), 1 for Buildings (Buildings), 2 for Fences (Fences), 3 for Other unspecified (Other), 4 for Pedestrians (Pedestrians), 5 for pillars (Poles), 6 for lane lines (Roadlines), 7 for Roads (Roads), 8 for Sidewalks (sideroads), 9 for vegetation (vegentations), 10 for Vehicles (Vehicles), 11 for Walls (Walls), and 12 for traffic lights (traffisings).

In order to consider the influence of dynamic obstacles, the perception mode of a human driver is simulated, and more comprehensive environmental information is obtained by increasing the visual information of historical multiframes. The DSCIL model adopts 4 frames of forward images, and the interval between each frame is 0.3s, namely the forward image at the current moment and the forward images at the first 0.3s, the first 0.6s and the first 0.9s respectively. Compared with the single-frame image, the driving speed and the driving track of the automobile can be judged more effectively.

With reference to fig. 5, the dynamic surround vision simulation learning network includes a feature extraction fusion network and a branch decision network, and adopts an L1 loss function; specifically, the method comprises the following steps:

referring to FIG. 4, the number of hidden nodes of the LSTM network is 128, the structure of the LSTM network is shown in FIG. 5, wherein S _t ，I _t ，

x _t Respectively a state at the time t, an internal state, a candidate state and a network input; and i _t ，o _t ，l _t The input gate, the output gate and the forgetting gate at the moment t respectively have the value range of (0, 1), and allow the information to pass through in a certain proportion. Where σ represents the logistic function and the range is (0, 1). The state S is updated at every moment, the state S is short-term memory, the network parameters can be regarded as long-term memory, and the updating period is far slower than that of the state S. The internal state I may hold critical information at a certain moment and for a period of time, which is longer than short-term memory and shorter than long-term memory.

The updating process of the LSTM network is as follows: (1) First of all, the external state S of the last moment is used _t-1 And input x of the current time _t And calculating an input gate, an output gate, a forgetting gate and a candidate state, wherein a calculation formula is shown as formula 1.1-1.4, wherein M and N are parameter matrixes, and b is a bias parameter: (2) Use forgetting door _t And an input gate i _t To update the memory cell I _t The update formula is shown in formula 1.5: (3) Using output gates o _t Passing the internal state to an external state S _t The formula is shown as formula 1.6.

i _t ＝σ(M _i x _t +N _i S _t-1 +b _i ) (1.1)

o _t ＝σ(M _o x _t +N _o S _t-1 +b _o ) (1.2)

l _t ＝σ(M _l x _t +N _l S _t-1 +b _l ) (1.3)

S _t ＝o _t tanh(I _t ) (1.6)

512-dimensional fusion characteristics obtained by four frames of forward images are spliced with two 64-dimensional image characteristics obtained by the left and right images, so that the description of a semantic segmentation network and a DSCIL model on a dynamic environment is obtained.

The dynamic surround vision simulation learning network is mainly trained by fusing images and corresponding expert libraries, so that end-to-end unmanned vehicle speed and steering can be obtained by surrounding views in the current state, a speed and steering control module is input to make vehicle driving decisions, and the dynamic surround vision simulation learning network performs speed and direction planning based on a branch decision network.

The vehicle can face four conditions in the driving process, namely the condition that the vehicle follows a road and turns left, runs straight or turns right when meeting an intersection. For different navigation instructions, the DSCIL model uses different branch decision network correspondences, i.e. each navigation instruction only activates its corresponding branch network.

The branch network is selectively activated according to the information c, and different navigation information activates the corresponding branch network to predict the speed and the steering wheel angle of the vehicle, similar to a single-pole multi-throw switch. The navigation information c comprises four states of road following, straight going, left turning and right turning. The speed is also used as an input to predict the accelerator and the brake without a CILRS model (dynamic surround vision imitation learning network) in the network, so that the mapping from network learning to the ground speed and low acceleration with logic errors in data is avoided. This mapping may result in the vehicle possibly stopping when the target is not reached and the road ahead is accessible.

The four branch networks have the same structure, the input of the network is 640-dimensional environment characteristic vectors fused by the visual perception network, and the 2-dimensional vectors are finally output through the full-connection network with 256 nodes, wherein the two layers of nodes are respectively the predicted vehicle speed and the angle of the steering wheel. The vehicle speed is a real number larger than 0, and the value range of the angle is (-1, 1). The last layer of fully connected network uses dropout with probability of 0.5 to avoid the over-fitting phenomenon.

Suppose that

Respectively expressed as a forward image, a right image and a left image obtained by semantic segmentation at the time t, and the fusion characteristic of the forward image at the time can be obtained by the formula 1.7, wherein R ₃₄ Is a function for a 34 layer residual network, L is a function for an LSTM network,

feature vectors of right and left images at time t

And &>

This can be obtained from equations 1.8 and 1.9, respectively, where R18 is the corresponding function of the 18-layer residual network. Dynamic environmental characteristics e at time t ^t Is when the forward image fusion feature>

Right image feature->

And the left image feature>

As shown in formula 1.10.

As shown in equation 1.11, assume the vehicle speed v of the network at time t +1 _t+1 And steering wheel angle a at time t _t Is a dynamic environmental feature e ^t And a function F of the navigation instruction c, and because different navigation instructions correspond to different decision branch networks, the decision branch network corresponding to the navigation instruction c is marked as A ^c Then, formula 1.12 can be obtained. By combining equations 1.10,1.11 and 1.12, the final mathematical description of equation 1.13, i.e., the DSCIL network (dynamic simulation learning model), can be obtained.

(v _t+1 ，a _t )＝F(e ^t ，c) (1.11)

(v _t+1 ，a _t )＝A ^c (e ^t ) (1.12)

Codevilla et al have experimentally found that L1 loss is more suitable for autonomous driving missions than L2 loss, so the DSCIL model uses the L1 loss, and the loss function is shown in equation 1.14. Wherein

Is the square angle of the disk at time t _t Is predicted value of (4), (v) is greater than or equal to>

Vehicle speed v at time t +1 _t+1 The predicted value of (2). So the first term of the loss function is the disc angle error term and the second term is the vehicle speed error term, both error terms are equal in weight and are 0.5.

The branch decision network only solves the problem of lateral control of the vehicle, and for the problem of longitudinal control of the vehicle, the branch decision network uses PID (proportional-integral-derivative) control to solve the problem according to the current vehicle speed and the next expected vehicle speed.

To solve the longitudinal control problem of the vehicle, proportional-integral (PI) control is introduced to control the vehicle speed. The principle diagram of proportional-integral control is shown in FIG. 6, where a _t Is the set value of the system at time t, b _t Is the output value of the system at time t, deviation c _t Is the difference between the given value and the output value, as shown in equation 1.15. As shown in formula 1.16, m _t Is the controlled quantity of the controlled object, is determined by the deviation c _t Proportional and integral terms of, wherein k _s Is the proportionality coefficient, k _i Is an integral coefficient. The proportional term is adjusted based on the deviation, the proportional term increases when the deviation increases, the proportional term decreases when the deviation decreases, and the non-backlash control cannot be realized only by the proportional term. The integral term can effectively eliminate the static error by integrating the system deviation.

c _t ＝a _t -b _t (1.15)

Empirically, we set the proportional coefficient of the control speed to 0.25 and the integral coefficient to 0.2, and thus we can obtain the control quantity m as shown in the formula 1.17 _t Wherein v is _t Is the vehicle speed at the time point of t,

is the expected value of the speed at time t. Vehicle longitudinal control quantity m at time t _t And t moment accelerator V _t When m is as shown in formula 1.18 _t When the throttle is less than or equal to 0, the throttle size is 0: when m is _t When the throttle is larger than 0, the throttle size is equal to m _t 1.3 times and a minimum of 0.75. Vehicle longitudinal control quantity m at time t _t And brake b at time t _t When m is as shown in formula 1.19 _t When the brake force is larger than 0, the brake force is 0; when m is _t When the brake force is less than or equal to 0, the brake size is equal to m _t Minus 0.35 times and 1.

FIG. 7 shows the vehicle longitudinal control quantity m at time t _t Throttle and V _t And a brake b _t In the relationship of (1), the abscissa axis of the coordinate axis represents the control quantity m _t The ordinate of the coordinate axis is accelerator/brake, the solid line in the figure represents the variation relation of the accelerator size with the control quantity, and the dotted line represents the variation relation of the brake size with the control quantity. Longitudinal control quantity m _t When the timing is positive, the brake size is 0, the accelerator size is within a certain range and m _t Is a direct proportional relationship and is limited by a threshold of 0.75 after exceeding the range: when the longitudinal control quantity is negative, the throttle size is 0, the brake size is within a certain range and m _t Is in direct proportion, and is limited by a threshold of 1.0 after exceeding the range.

A system based on unmanned vehicle motion planning, includes a vision sensor, a category weight module, a semantic segmentation module, a mimic learning module, a speed control module, and a steering control module, wherein:

the simulation learning module carries out network learning based on neural network learning to obtain a converged dynamic surrounding vision simulation learning model; the simulation learning module comprises an image fusion module and a branch decision module, wherein the image fusion module is used for fusing category weight information with semantic segmentation images of each frame of forward images, acquiring n 512-dimensional feature vectors from the fused images through a ResNet34 residual error network, acquiring a 512-dimensional feature vector from the n 512-dimensional feature vectors through a single-layer LSTM network, respectively acquiring 64-dimensional feature vectors from the semantic segmentation images in the left direction and the semantic segmentation images in the right direction through the residual error network, and splicing the 512-dimensional feature vectors with the two 64-dimensional feature vectors to acquire a 640-dimensional combined feature vector; and the branch decision module activates a corresponding branch network according to the navigation information, and the navigation information c comprises four states of road following, line, left turn and right turn.

Example 1

With reference to fig. 1, in the unmanned vehicle motion planning system provided in this embodiment, a current left, middle, and right surrounding view is obtained through a vision sensor of the unmanned vehicle, where 4 frames of forward images are input to a category weighting module to obtain spatial category weighting information 13 × 200 × 88 corresponding to a current frame; meanwhile, the 4 frames of forward images and the left and right current frames of images are input into a semantic segmentation module to obtain corresponding 200 × 88 semantic images. 4 frames of forward images after the category weight information and the corresponding semantic segmentation images are combined and processed are subjected to ResNet34 network to obtain 4 512-dimensional feature vectors, 4 feature vectors are subjected to single-layer LSTM network to obtain 512-dimensional feature vectors, the processed left and right side current frame images are respectively subjected to 64-dimensional feature vectors through the same ResNet18 network, and 3 feature vectors are spliced to obtain 640-dimensional combined feature vectors. The joint feature vector predicts the vehicle speed and the steering wheel angle through a 3-layer fully-connected imitation learning module.

The driving evaluation models of the CARLA automatic driving platforms, namely CARLAbenchmark and NoCrash benchmark, are tested and compared with other five classical end-to-end driving models.

With reference to fig. 2, carra benchmark includes 4 different driving conditions, straight, one Turn, navigation, nav. Particularly in a dynamic environment, the test result is obviously superior to other models, which fully illustrates the advantages of the models in the dynamic environment. When the test is carried out under the New Weather & town environment, the task completion rate of the model in the Straight driving condition is 100%, the task completion rate in the other three driving conditions is 98%, and the task completion rate is greatly improved compared with the task completion rate of the other five classical models, for example, compared with the CAL model, the task completion rate of the method is respectively improved by 6%,26%,30% and 34% in the four driving conditions.

The NoCrash bench is more strict than the caralabelchmark in task requirement, and if a stronger collision occurs in the driving process, the task failure is judged, and the model is tested and compared with other five classical end-to-end models. In a New Weather environment, for tasks under three different traffic conditions, namely Empty, regular and Dense, the task completion rate of a model is 100 +/-0%, 85 +/-3%, 36 +/-2%, the CAL model completion rate is 85 +/-2%, 68 +/-5%, 33 +/-2%, and the CILRS model task completion rate is 96 +/-1%, 77 +/-1% and 47 +/-1%. Under a New Town environment, the task completion rate of the model is 67 +/-2%, 56 +/-1%, 26 +/-1%, the completion rate of the CAL model is 36 +/-6%, 26 +/-2%, 9 +/-1%, and the task completion rate of the CILRS model is 66+2%,49 +/-5%, 23 +/-1%. For different end-to-end models, the model provided by the method achieves the best performance in most driving tasks, and particularly in a dynamic environment, the reliability of the model in the dynamic environment is fully explained.

Claims

1. A semantic segmentation based unmanned vehicle motion planning method is characterized by comprising the following steps:

based on environmental information classification, respectively performing semantic segmentation on left and right direction current view data and continuous n frames of forward view data through a semantic segmentation network to obtain semantic segmentation images;

taking the segmented images, the category weight information and the speed and direction corresponding to the view as training data, training and learning the dynamic surround vision imitating learning network based on the neural network, and repeating the steps until a convergent dynamic surround vision imitating learning model is obtained; the dynamic surround vision simulation learning network comprises a feature extraction fusion network and a branch decision network, and adopts an L1 loss function; wherein:

the feature extraction fusion network fuses category weight information and a semantic segmentation image of a forward image, n 512-dimensional feature vectors are obtained from the fused image through a ResNet34 residual error network, the n 512-dimensional feature vectors obtain a 512-dimensional feature vector through a single-layer LSTM network, 64-dimensional feature vectors are obtained from the semantic segmentation image in the left direction and the semantic segmentation image in the right direction through the residual error network respectively, and a 640-dimensional combined feature vector is obtained by splicing the 512-dimensional feature vector and the two 64-dimensional feature vectors;

the L1 loss function is:

wherein,

Is the predicted value of the square disk angle at time t, a _t Is the steering wheel angle at time t, v _t+1 Vehicle speed at time t + 1;

the number of hidden nodes of the LSTM network is 128, and the updating process is as follows:

external state S based on last moment _t-1 And input x of the current time _t Calculating input gate i _t And an output gate o _t Forget door _t And candidate states

The calculation formula is as follows:

i _t ＝σ(M _i x _t +N _i S _t-1 +b _i )

o _t ＝σ(M _o x _t +N _o S _t-1 +b _o )

l _t ＝σ(M _l x _t +N _l S _t-1 +b _l )

wherein tanh is the activation function, M _i And N _i Is a parameter matrix, the value range is (0, 1), sigma is a logistic function, and the value range is (0, 1);

through forgetting door l _t And an input gate i _t To update the memory cell I _t The update formula is:

S _t ＝o _t tanh(I _t )

the 640-dimensional joint feature vector is:

Semantic segmentation images R corresponding to the forward image, the right image and the left image at the time t respectively ₃₄ Is 34 layer residual networkFunction, R ₁₈ Is a 18-layer residual network function, L is an LSTM network function;

the dynamic surround vision imitation learning model is as follows:

(v _t+1 ，a _t )＝F(e ^t ，c)

(v _t+1 ，a _t )＝A ^c (e ^t )

wherein v is _t+1 Is the vehicle speed at time t +1, a _t T is the steering wheel angle, c is the navigation command, and F is the vehicle speed v _t+1 And steering wheel angle a _t And e ^t And navigation instructions c, A ^c A decision branch network corresponding to the navigation instruction c;

surrounding views in the left direction, the front direction and the right direction of the vehicle are collected, semantic segmentation images of the surrounding views in the left direction, the front direction and the right direction and category weight information corresponding to continuous n frames of forward images are obtained and input to a dynamic surrounding vision simulation learning model to obtain a planning speed and a planning direction;

and carrying out motion planning on the unmanned vehicle based on the planning speed and the planning direction.

2. The unmanned vehicle motion planning method of claim 1, wherein the semantic segmentation network comprises a residual network ResNet, a full convolutional network FCN, a void space convolutional pooling pyramid, and a decoder; the method for obtaining the semantic segmentation image by respectively performing semantic segmentation on the left-direction current view data, the right-direction current view data and the continuous n frames of forward view data through the semantic segmentation network specifically comprises the following steps of:

and recovering the details and the space dimensionality of the image features through a decoder based on the classified image feature information to obtain a final semantic segmentation image.

3. The unmanned vehicle motion planning method of claim 1, wherein the category weight network specifically is: and forming a residual unit by four residual blocks, copying the fourth residual block for three times, connecting the copied fourth residual block to the residual unit in a cascading manner, and simultaneously performing parallel hollow space convolution pooling pyramid structure.

4. The unmanned vehicle motion planning method of claim 1, wherein the collection period is 0.3s, n =4, and the classification of the environmental information comprises: empty, buildings, fences, pedestrians, pillars, lane dividing lines, roads, sidewalks, vegetation, vehicles, walls, signal lamps and other unidentified types.

5. The unmanned vehicle motion planning method according to claim 1, wherein a PID controller is used to plan the motion of the unmanned vehicle, and the PID controller uses a PI control algorithm, specifically:

wherein:

is tVehicle speed v at +1 time _t+1 V is predicted value of _t Is the vehicle speed at time t, V _t Throttle size at time t, b _t Braking at time t, m _t Is a longitudinal control quantity.

6. An unmanned vehicle motion planning system based on the unmanned vehicle motion planning method based on semantic segmentation according to any one of claims 1 to 5, which is characterized by comprising a visual sensor, a category weight module, a semantic segmentation module, a simulation learning module, a speed control module and a steering control module, wherein:

the simulation learning module performs network learning based on neural network learning to obtain a converged dynamic surround vision simulation learning model; the simulation learning module comprises an image fusion module and a branch decision module, wherein the image fusion module is used for fusing category weight information with semantic segmentation images of each frame of forward images, acquiring n 512-dimensional feature vectors from the fused images through a ResNet34 residual error network, acquiring a 512-dimensional feature vector from the n 512-dimensional feature vectors through a single-layer LSTM network, respectively acquiring 64-dimensional feature vectors from the semantic segmentation images in the left direction and the semantic segmentation images in the right direction through the residual error network, and splicing the 512-dimensional feature vectors with the two 64-dimensional feature vectors to acquire a 640-dimensional combined feature vector; the branch decision module activates a corresponding branch network according to navigation information, wherein the navigation information c comprises four states of road following, line, left turn and right turn;

the steering control module controls steering of the unmanned vehicle by simulating the planned direction output by the learning module.