CN110956148A

CN110956148A - Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium

Info

Publication number: CN110956148A
Application number: CN201911236281.XA
Authority: CN
Inventors: 宗文豪
Original assignee: Shanghai Duomin Intelligent Technology Co ltd
Current assignee: Shanghai Duomin Intelligent Technology Co ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2020-04-03
Anticipated expiration: 2039-12-05
Also published as: CN110956148B

Abstract

The invention provides an autonomous obstacle avoidance method and device for an unmanned vehicle, electronic equipment and a readable storage medium, wherein the autonomous obstacle avoidance method comprises the following steps: acquiring current state information; according to the current state information and the historical state information, the obstacle avoidance network generates action information with high prediction evaluation; executing the previous action information, and repeating the process until the destination is reached; the obstacle avoidance network comprises an action generation network and a strategy evaluation network; the former obtains fusion state information according to the current state information and the historical state information; predicting current action information according to the fusion state information; the latter obtains the prediction evaluation of the current action information according to the return value, the fusion state information and the current action information; and adjusting a subsequent action generation strategy according to the prediction evaluation. According to the invention, a cyclic neural network and an attention mechanism are introduced in reinforcement learning to give higher attention to past abnormal states, so that the unmanned vehicle can effectively avoid obstacles by means of memorizing the past abnormal states.

Description

Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium

Technical Field

The invention relates to the field of unmanned driving, in particular to an autonomous obstacle avoidance method and device for an unmanned vehicle, electronic equipment and a readable storage medium.

Background

In unknown environments, unmanned vehicle operation requires attention to avoid any possible static and dynamic obstacles. To achieve this, the control algorithm needs to take into account a series of environmental information acquired by external sensors.

With the development of artificial intelligence, reinforcement learning methods have been attempted for use in unmanned vehicle control. The goal of reinforcement learning is to learn optimal behavior through interaction of the agent with the environment. The reinforcement learning is unsupervised learning, the training sample comes from the interactive experience of the intelligent agent and the environment, and the special condition existing in the environment can be effectively solved without the need of sample marking. Meanwhile, in order to adapt to the prediction scene of a high-dimensional data space, large-scale deep learning is introduced on the basis of a reinforcement learning framework, so that the motion space obtained by prediction is more suitable for a changeable scene.

The autonomous obstacle avoidance of the unmanned vehicle is a partially observable Markov process which is not only related to the current state but also related to the preorder state and has higher real-time requirement. For example, the unmanned vehicle detects that there is an obstacle in front for a certain distance at time t _ n, but with the position or pose adjustment of the unmanned vehicle, there is a possibility that the obstacle detected before may be in a blind area of vision at time t _ (n + x), which requires the unmanned vehicle to make a timely control by means of memory of the past state. Specifically, for example, when the vehicle head is far away from the road edge within a certain range, the road edge can be seen, but when the vehicle head is close to the road edge, the vehicle head gradually covers the road edge, so that the road edge cannot be seen in the visual field range, and at the moment, the unmanned vehicle is required to timely steer by means of the memory of the previous state. However, at present, based on the existing deep reinforcement learning algorithm, such as RDPG, DDPG (depth deterministic strategy gradient algorithm), and the like, the constructed model shows a common performance in the above-mentioned scenario, and even cannot converge.

Disclosure of Invention

One of the objectives of the present invention is to provide an autonomous obstacle avoidance method and apparatus for an unmanned vehicle, an electronic device, and a readable storage medium, so as to overcome at least some of the disadvantages in the prior art.

The technical scheme provided by the invention is as follows:

an autonomous obstacle avoidance method for an unmanned vehicle comprises the following steps: acquiring current state information, wherein the current state information comprises current environment state information and the current state of an unmanned vehicle; according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information; executing the current action information, repeating the process to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination; the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network; the action generating network is used for processing the current state information and the historical state information through a first recurrent neural network to obtain fusion state information; predicting current action information according to the fusion state information; the policy evaluation network is configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information; and the action generation network adjusts a subsequent action generation strategy according to the prediction evaluation.

Further, the obtaining of the prediction evaluation of the current action information by the processing of the second recurrent neural network according to the return value, the fusion state information, and the current action information includes: obtaining state action fusion information according to the return value, the fusion state information and the current action information; processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information; performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information; and according to the weight-corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of the second recurrent neural network.

Further, the obtaining of the state and motion fusion information with corrected weight by performing further attention processing on the state and motion fusion information and the pre-evaluation information specifically includes: calculating the correlation between the state action fusion information and the pre-evaluation information to obtain a correlation coefficient; normalizing the correlation information to obtain a corresponding weight factor; and adjusting the state action fusion information by using the weight factor to obtain the state action fusion information of weight correction.

Further, the correlation of the state action fusion information and the pre-evaluation information is calculated according to the following formula:

wherein the content of the first and second substances,

for the state action fusion information at time t,

as the pre-evaluation information at time j, w1 and w2 are coefficients,

a correlation coefficient indicating the pre-evaluation information at the time j and the state action fusion information at the time t; normalizing the correlation information according to the following formula to obtain corresponding weight factors

Obtaining weight-corrected state action fusion information according to the following formula

Further, the report value obtained by executing the current action information under the current state information specifically includes: if the action information is executed under the current state information and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time; and if the action information is executed under the current state information and collision can occur, the return value is a preset penalty value.

Further, the training of the obstacle avoidance network includes: training an obstacle avoidance network through interactive information between the environment and the unmanned vehicle, and updating network parameters through a minimized loss function; the loss function comprises the value increment of the old strategy and the new strategy and the KL divergence between the old strategy and the new strategy; and when the KL divergence between the new strategy and the old strategy is smaller than a preset threshold and the accumulated return value based on the new strategy is higher than the accumulated return value based on the old strategy, updating the old strategy by using the new strategy.

Further, a loss function J at time t is calculated according to the following formula_t：

Wherein the content of the first and second substances,

representing a cumulative reward function proxy objective function,

representing the squared loss of the return function, c₁,c₂Is the coefficient, s_π(s_t) Represents a cross-entropy loss gain that encourages policy heuristics, pi represents a policy,

indicates the desired estimated value, A_π(t) merit function, r_tIs the reported value at the time t.

The invention also provides an autonomous obstacle avoidance device of the unmanned vehicle, which comprises: the state acquisition module is used for acquiring current state information, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle; the obstacle avoidance module is used for generating current action information through the trained obstacle avoidance network according to the current state information and the historical state information; the triggering module is used for executing the action information, triggering to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination; wherein, keep away the barrier network and adopt Actor-criticic structure, keep away the barrier module and include: the action generating unit is used for processing the current state information and the historical state information through a first cyclic neural network to obtain fusion state information; predicting current action information according to the fusion state information; the strategy evaluation unit is used for acquiring a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information; and the action generating unit is used for adjusting a subsequent action generating strategy according to the prediction evaluation.

The present invention also provides an electronic device comprising: a memory for storing a computer program; and the processor is used for realizing the autonomous obstacle avoidance method of the unmanned vehicle when the computer program is run.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned autonomous obstacle avoidance method for an unmanned vehicle.

The autonomous obstacle avoidance method and device for the unmanned vehicle, the electronic device and the readable storage medium provided by the invention can bring the following beneficial effects:

1. according to the invention, by introducing the recurrent neural network and introducing the memory mechanism into the action generation network and the strategy evaluation network, the current detected roadblock and the previously detected roadblock can be comprehensively considered, and reasonable obstacle avoidance action is made; the current prediction evaluation and the previous prediction evaluation can be comprehensively considered to generate more appropriate evaluation output; in a word, the memory is increased, and the obstacle avoidance network can predict output more accurately.

2. The invention introduces the attention mechanism to give higher attention to past abnormal states in reinforcement learning, so that the unmanned vehicle can carry out timely control by means of memory of past abnormal states, and the obstacle is effectively avoided.

3. The invention limits the updating amplitude of the new strategy and the old strategy by using the KL divergence, avoids the strategy from rapidly forgetting the experience learned by the past samples during updating, and ensures that the strategy change is smooth and controllable.

Drawings

The above features, technical features, advantages and implementations of an autonomous obstacle avoidance method and apparatus for an unmanned vehicle, an electronic device, and a readable storage medium will be further described in detail below with reference to the accompanying drawings.

FIG. 1 is a flow chart of one embodiment of an autonomous obstacle avoidance method of an unmanned vehicle of the present invention;

FIG. 2 is a flow diagram for one embodiment of step S300 in FIG. 1;

FIG. 3 is a flow chart of another embodiment of an autonomous obstacle avoidance method of an unmanned vehicle of the present invention;

FIG. 4 is a flow chart of another embodiment of step S300 in FIG. 1;

fig. 5 is a schematic structural diagram of an embodiment of an autonomous obstacle avoidance apparatus of an unmanned vehicle according to the present invention;

fig. 6 is a schematic structural diagram of another embodiment of the autonomous obstacle avoidance apparatus of the unmanned vehicle of the present invention;

FIG. 7 is a schematic diagram of an electronic device in accordance with one embodiment of the invention;

fig. 8 is a test result diagram of another embodiment of the autonomous obstacle avoidance method for an unmanned vehicle according to the present invention;

the reference numbers illustrate:

100. the system comprises a state acquisition module, a 200 obstacle avoidance module, a 210 action generation unit, a 220 strategy evaluation unit, a 300 trigger module, a 400 training module, 440 electronic equipment, 410 memory, 420 processor, 430 computer program.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

In an embodiment of the present invention, as shown in fig. 1 and fig. 2, an autonomous obstacle avoidance method for an unmanned vehicle includes:

step S200, current state information is obtained, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle;

step S300, according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information;

the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network; step 300 comprises:

an action generating network:

step 310, according to the current state information and the historical state information, obtaining fusion state information through first cyclic neural network processing;

step 320, predicting current action information according to the fusion state information;

and (3) policy evaluation network:

step 330, obtaining a return value obtained by executing the current action information under the current state information;

step 340, obtaining a prediction evaluation of the current action information through the processing of the second recurrent neural network according to the return value, the fusion state information and the current action information;

step 350, according to the prediction evaluation, generating a network adjustment subsequent action generation strategy by the action generation network;

step S400, judging whether the unmanned vehicle reaches the destination; if yes, ending;

if not, after the current action information enters the next environment state, the step S410 jumps to the step S200 to obtain the next state information, updates the current state information with the next state information, and updates the current action information according to the updated current state information.

Specifically, the unmanned vehicle includes external sensors such as a laser radar, a camera, and the like. The external sensor is used for monitoring obstacles in the surrounding environment in the unmanned vehicle moving process, and obtaining information such as the distance and the direction of the obstacles relative to the vehicle body, namely environment state information, by analyzing time sequence point cloud data acquired by a laser radar or image data acquired by a camera. The unmanned vehicle also comprises an internal sensor for acquiring the speed, position information and the like of the unmanned vehicle; by analyzing the data collected by the internal sensors, the state (i.e., position and velocity information) of the unmanned vehicle is obtained. The current state information includes current environmental state information and a current state of the unmanned vehicle.

The unmanned vehicle also comprises an obstacle avoidance network for controlling the autonomous obstacle avoidance of the unmanned vehicle. The obstacle avoidance network adopts a common Actor-Critic model structure in large-scale deep reinforcement learning, wherein Actor represents an action generation network, and Critic represents a strategy evaluation network. The Actor network is used to learn a mapping a ═ λ(s) of the current state to the motion space, where s is the current state information and a is the predicted motion information. And the Critic network is used for evaluating the quality of the action by combining the return value given by the environment obtained by executing the action information in the current state, so that the whole algorithm is controlled to evolve to the maximum accumulated return value. The final goal of the overall algorithm is to obtain the maximum cumulative reward value. The accumulated reward value reflects the long-term incentives that are obtained from the start time to the end time (e.g., to the destination).

The cumulative reward value β is calculated as follows:

wherein the content of the first and second substances,gamma is the attenuation factor, r(s)_t,a_t) Is the reward value at time t (also called reward function at time t), which is defined as the reward value under specific state and specific action, s_tIs status information at time t, a_tE represents an expectation function for the motion information at time t.

The action generating network is composed of a recurrent neural network, such as a unidirectional LSTM (long short term memory network), or a bidirectional LSTM. The current state information and the historical state information are input into the action generating network to obtain the fusion state information, and the fusion state information not only considers the current state information, but also considers the stored historical state information (namely the state information before the current state information). And predicting action information to be taken at the current moment according to the fusion state information. The action information includes accelerator pedal information, brake pedal information, gear information, steering information, and the like. And controlling the driving of the unmanned vehicle through the action information.

Sometimes, an obstacle is detected in the historical state information, but no obstacle is detected in the current state information, for example, the radar point cloud detects that a road block exists in front of the unmanned vehicle at a certain distance at the time t _ n, but at the moment, the unmanned vehicle cannot turn due to the obstacle existing in the close distance between the left and right, the unmanned vehicle does not need to turn at the time t _ (n + x), but with the pose adjustment of the unmanned vehicle at the time t _ (n + x), the previously detected road block may be in a view blind area, and the road block is not detected at the time t _ (n + x). If only the current state information is considered and no obstacle exists, the unmanned vehicle can be caused to meet the obstacle in the subsequent driving process; the correct treatment is to require the unmanned vehicle to timely carry out obstacle avoidance treatment by means of the memory of the previous state. Because the recurrent neural network technology is adopted, the obtained fusion state information carries historical state information, so that the influence of the historical state information can be considered when the action information is output, and the barrier can be effectively avoided.

Optionally, processing the current state information and the historical state information through a first fully-connected neural network to obtain precoding state information corresponding to each; processing the pre-coding state information through a first cyclic neural network to obtain fusion state information; and generating corresponding action information according to the fusion state information. The fully-connected neural network is an artificial neural network consisting of a plurality of layers of neurons, and the circulating neural network can adopt a unidirectional LSTM network.

Policy evaluation network obtains information s at current state_tLower execution action information a_tThe obtained return value is r(s)_t,a_t). Optionally, if the unmanned vehicle executes the action information in the current environment state and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time; if the unmanned vehicle executes the action information in the current environment state and can collide, the return value is a preset penalty value. The predetermined penalty value is negative. By designing the reward function, the action generation network can evolve to an action strategy for obtaining high reward, so that the obstacle can be effectively avoided.

And the strategy evaluation network carries out prediction evaluation on the current action information through the processing of the second recurrent neural network according to the return value at the current time, the fusion state information and the stored evaluation of the past action, and predicts the possible obtained accumulated return value.

The action generating network adjusts the subsequent action generating strategy according to the prediction evaluation. For example, at time t, action a in state s receives a high predictive rating, and the same or similar state s1 is encountered in the future, encouraging the generation of a similar action a.

And judging whether the unmanned vehicle reaches the destination. If not, executing the action information, reaching the next environment state, jumping to the step S200, acquiring the next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination.

In the embodiment, by introducing the cyclic neural network and introducing the memory mechanism into the action generating network, the currently detected roadblock and the previously detected roadblock can be comprehensively considered, and more reasonable obstacle avoiding action can be performed; by introducing the recurrent neural network and introducing a memory mechanism into the strategy evaluation network, the current prediction evaluation and the previous prediction evaluation can be comprehensively considered, and more appropriate evaluation output is generated; in a word, the memory is increased, and the obstacle avoidance network can predict output more accurately.

In another embodiment of the present invention, as shown in fig. 1 and 4, an autonomous obstacle avoidance method for an unmanned vehicle includes:

on the basis of the foregoing embodiment, as shown in fig. 4, step S300 includes:

step S311, the current state information and the historical state information are processed by a first fully-connected neural network to obtain the precoding state information corresponding to each;

step S312, the precoding state information is processed by a first long-short term memory network to obtain fusion state information;

step S321, predicting current action information according to the fusion state information;

step S331 of acquiring a return value obtained by executing the current action information under the current state information;

step S341, obtaining state action fusion information according to the return value, the fusion state information and the current action information;

step S342, according to the state action fusion information, pre-evaluation information is obtained through processing of a second full-connection neural network;

step S343, the state action fusion information and the pre-evaluation information are subjected to one-step attention processing to obtain weight-corrected state action fusion information;

step S344, according to the predicted evaluation of the state action fusion information and the historical action information corrected by the weight, the predicted evaluation of the current action information is obtained through the processing of a second long-short term memory network;

step S351 is to generate an action generation policy for adjusting the subsequent action of the network according to the prediction evaluation.

Specifically, a fully-connected neural network is an artificial neural network composed of multiple layers of multiple neurons. The network has a memory mechanism by introducing a long-term and short-term memory network on the basis of a fully-connected neural network.

The action generating network includes a first fully-connected neural network and a first recurrent neural network. The first fully-connected network firstly completes pre-coding on the input environment state information and mines the relation between shallow states, but does not have a time sequence relation. The first long short term memory network (in this embodiment, a unidirectional long short term memory network) is then used to fit the pre-coded state information to the fused state information mapping. The loop layer formed by the first long-short term memory network allows the fused state information to encode an implicit representation with the past time step state information (i.e., historical state information).

Because the dependency relationship established for the time sequence samples by the long and short term memory network is gradually attenuated along with the increase of the time interval, the obstacle detected in the history state with longer time interval can be ignored by the unmanned vehicle due to the change of the pose in the obstacle avoidance process. In order to solve the problems, the attention degree of variable weight is realized on the state information of different time steps, the state information with weight correction is obtained by introducing an attention mechanism into a strategy evaluation network, and the weight of the state of the time step on the prediction return of the output strategy evaluation is increased once the environment state is abnormal.

The strategy evaluation network Critic comprises a second fully-connected neural network, one-step attention processing and a second recurrent neural network.

The strategy evaluation network firstly acquires a return value obtained by executing the current action information under the current state information; obtaining state action fusion information according to the return value, the fusion state information and the current action information; for example, the return value, the fusion state information, and the current action information are spliced to obtain state action fusion information. And processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information. And performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information, and performing one-layer processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected prediction evaluation information.

Assuming that a round counts T time steps, wherein the state action fusion information of the tth step (T epsilon (1, T), namely T time) is recorded as

Recording the pre-evaluation information of the t step

Processed by a second fully-connected neural network to obtain

Obtaining the state action fusion information of weight correction in the t step through the following attention processing

1. Calculating the pre-evaluation information of the j step

And fusion state information of the t step

The correlation between the two groups of the data to obtain a correlation coefficient

Wherein the content of the first and second substances,

for the state action fusion information at time t,

is pre-evaluation information at time j, w₁And w₂As a function of the number of the coefficients,

a correlation coefficient indicating the pre-evaluation information at the time j and the state action fusion information at the time t;

2. by means of normalizing fingersNormalizing the correlation coefficient by a number function (softmax function) to obtain a corresponding weight factor

3. Calculating the state action fusion information of weight correction according to the following formula:

fusing weight-corrected state actions into information

And the prediction evaluation of the historical action information is processed by a second long-short term memory network to obtain the weight corrected prediction evaluation information

Because the memory duration of the recurrent neural network is limited and the attenuation of state memory at intervals is larger, the embodiment gives higher attention to the abnormal sensing state which occurs once by introducing an attention mechanism into the strategy evaluation network, and further improves the memory of the whole system, thereby improving the accuracy of strategy evaluation.

In another embodiment of the present invention, as shown in fig. 3, an autonomous obstacle avoidance method for an unmanned vehicle includes:

on the basis of the embodiments shown in fig. 1 and 4, the following steps are added:

step S100, an obstacle avoidance network is trained through interactive information between the environment and the unmanned vehicle, and network parameters are updated through a minimum loss function.

Firstly, a plurality of training samples are generated according to the interactive information between the unmanned vehicle and the training environment and are stored in the experience playback pool. And extracting a plurality of sample slices from the experience playback pool, and inputting the sample slices as a series of environment perception states into an action generation network and a strategy evaluation network to be trained. Preferably, samples with higher reward values (e.g., greater than a predetermined threshold) are preferentially sampled because such samples have higher learning value. The action generating network generates an action a according to the input environment state s and a preset action space, and the strategy evaluation network obtains an evaluation value v which reacts to the action according to the action a and the environment state s. Finally, the action a with the largest return is selected as the actually executed action, and the operation is circulated. And finishing the self-adaptive obstacle avoidance of the unmanned vehicle under a new scene by using the trained parameters until the Actor and Critic networks are stable in performance and finally converge.

One specific training process is as follows:

step 1, initializing and setting an unmanned vehicle simulation experiment environment, and determining state sensing information and action space information. For example, the horizontal and longitudinal speeds of the unmanned vehicle, the laser point cloud, the radar image and the placing positions of surrounding obstacles, and the motion track of the dynamic obstacle are initialized; and (4) defining the destination position reached by the unmanned vehicle.

Step 2, initializing parameters of the action generating network, parameters of the strategy evaluating network, parameters of the target action generating network, parameters of the target strategy evaluating network, and the experience playback pool R.

And according to the complexity of the environment state information, designing the network scale of the action generation network and the strategy evaluation network. For example, the number of hidden layers of each layer of the fully-connected neuron, the number of layers of the neural network layer, the number of subunits of the loop layer, the maximum number of rounds of strategy iteration, and the like. The higher the latitude of the environmental status information, the larger the network scale of the proposed setting.

In order to facilitate recording and updating of the algorithm during training, a new model and an old model are respectively set for an action generation network and a strategy evaluation network, and each model is allocated with a corresponding parameter space, namely the action generation network

Parameter w of_aAnd target action generating network

Parameter w'_aPolicy evaluation network

Parameter w of_vAnd target policy evaluation network

Parameter w'_v。

And 3, starting simulation, generating a plurality of training samples according to the interactive information between the unmanned vehicle and the training environment in the self-adaptive driving process, recording each training sample in a transition form, and storing the training samples in the experience playback pool.

A transition comprises a 4-tuple, which is respectively the state s of the current time step_tAction a of this time step_tThe return value r obtained by executing the action at the time step_tState s of the next time step_t+1。

Specifically, a current state s is received_t(ii) a Selecting action a in preset action space according to current strategy_t(ii) a Performing action a_tThe obtained return value r_tAnd a new state s_t+1. Preservation(s)_t,a_t,r_t,s_t+1) Into the experience playback pool R.

Repeating the above process, collecting transitions of a certain time step and putting the transitions into a experience playback pool.

Step 4, sampling with priority from the experience playback pool to obtain a plurality of sample slices; and training the obstacle avoidance network by using the plurality of sample slices, and continuously and iteratively updating the network until convergence.

Specifically, several sample slices are extracted from the experience playback pool; a loss function is calculated from the sample slice. The parameters of the network are evaluated by a minimization loss function update policy, and the parameters of the network are generated using a policy gradient update action of the samples. And generating network parameters according to the updated action, updating the parameters of the target action generation network, and updating the parameters of the target strategy evaluation according to the updated strategy evaluation network parameters.

Wherein the loss function J at time t_tCan be expressed as:

the loss function includes a total of three terms, which are: cumulative reward function proxy objective function

Loss of square of a return function

Cross entropy loss gain s to encourage policy exploratory_π(s_t)。c₁,c₂Is the coefficient of the number of the first and second,

representing the desired estimate.

Corresponding to new strategy (use)

Representation), the squared loss of the state cost function of the old policy (represented by pi), is used to evaluate the accuracy of the policy evaluation network's generated value v. V is a function of the value of the state,is an expectation of accumulated return values. V_π(s_t) As a function of the state cost of the old policy,

is a state cost function of the new policy. The predicted value of the strategy evaluation function is to approach the state value function continuously.

Typically, the loss function consists of the squared loss of the reward function. Considering that the intelligent agent is in a blind exploration environment state in the initial stage of algorithm training, the difference between samples is large, so that the strategy updating amplitude is too large, the strategy is easy to deviate from the correct optimization direction, and the algorithm is not converged or the updating speed is slow. Thus, the dominant function A is introduced_π(t) New policy

Value increment of pi relative to the old strategy.

Is the action state cost function of the new policy.

The method is another expression form of the dominant function, the dominant function is cut, the dominant function is limited within a certain value range, and large fluctuation is avoided. r is_tIs the reported value at the time t. The Clip function is a clipping function, epsilon is a preset fluctuation range, and Clip (1-epsilon, 1+ epsilon) represents that the limiting value range is [ 1-epsilon, 1+ epsilon ]]When the value is less than 1-epsilon, taking 1-epsilon; when the sum is larger than 1+ epsilon, 1+ epsilon is taken.

Due to the difference in the action probability distribution space between the new strategy and the old strategy, the difference is called the KL divergence between the new strategy and the old strategy. The greater the difference in the two distributions, the greater the KL divergence. The difference uses a cross-entropy loss gain s_π(s_t) To measure. Introduction of s_π(s_t) Therefore, the situation that the new strategy stays in place and falls into local optimization can be avoided.

By in a loss functionDominance function introducing clipping

Sum cross entropy loss gain s_π(s_t) The advantage function is ensured to be monotonous and not to be reduced, and the KL divergence between the old strategy and the new strategy is limited to be smaller than a certain threshold value.

Calculating a loss function, and updating network parameters by minimizing the loss function, thereby ensuring that the strategy is updated along the direction in which the value function is monotonous and not reduced, and ensuring that the change amplitude of the strategy is controllable; by limiting the KL divergence between the old and new policies to be smaller than a smaller value, it is ensured that the accumulated return value obtained by the new policy is higher than that of the old policy.

And 5: and recording and tracking the return performance of the accumulated turn in the training process, and terminating the training once the turn performance reaches a higher level and the unmanned vehicle can safely reach the end position.

And 6, after the training is ended, saving model network parameters.

In the embodiment, by introducing a clipped dominance function and a cross entropy loss gain into a loss function and combining the principle of minimizing the loss function, it is ensured that the policy is updated along the direction in which the cost function is monotonously not reduced, and at the same time, it is ensured that the policy change amplitude is controllable (i.e., the KL divergence limits the update amplitudes of the new and old policies). Therefore, the method can prevent the algorithm from generating large adjustment amplitude when encountering samples with obvious distribution difference with the previous training samples, so that the new strategy is evolved to a completely different direction, and the final strategy cannot be converged. Therefore, the KL divergence is used for limiting the updating amplitude of the new strategy and the old strategy, so that the experience learned by the past samples is not rapidly forgotten by the algorithm during updating.

In one embodiment of the present invention, as shown in fig. 5, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:

a state obtaining module 100, configured to obtain current state information, where the current state information includes current environmental state information and a current state of an unmanned vehicle;

the obstacle avoidance module 200 is configured to generate current action information through the trained obstacle avoidance network according to the current state information and the historical state information;

the triggering module 300 is configured to execute the action information, trigger to obtain next state information, update current action information according to the next state information, and cycle the process until the unmanned vehicle reaches a destination;

wherein, keep away the barrier network and adopt Actor-criticic structure, keep away the barrier module and include:

an action generating unit 210, configured to obtain fusion state information by performing a first recurrent neural network processing according to the current state information and the historical state information; predicting current action information according to the fusion state information;

a policy evaluation unit 220, configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;

the action generating unit 210 is configured to adjust a subsequent action generating policy according to the prediction evaluation.

Specifically, the unmanned vehicle comprises an external sensor for monitoring obstacles in the surrounding environment in the movement process of the unmanned vehicle, and information such as the distance and direction of the obstacles relative to the vehicle body, namely environment state information, is obtained by analyzing time sequence point cloud data acquired by a laser radar or image data acquired by a camera. The unmanned vehicle also comprises an internal sensor for acquiring the speed, position information and the like of the unmanned vehicle; by analyzing the data collected by the internal sensors, the state (i.e., position and velocity information) of the unmanned vehicle is obtained. The current state information includes current environmental state information and a current state of the unmanned vehicle.

The motion generation network is constituted by a recurrent neural network. And inputting the current state information and the historical state information into the action generation network to obtain the fusion state information, wherein the fusion state information not only considers the current state information, but also considers the stored historical state information. And predicting action information to be taken at the current moment according to the fusion state information.

In some cases, an obstacle is detected in the history state information, but an obstacle is not detected in the current state information. If only the current state information is considered and no obstacle exists, the unmanned vehicle can be caused to meet the obstacle in the subsequent driving process; the correct treatment is to require the unmanned vehicle to timely carry out obstacle avoidance treatment by means of the memory of the previous state. Because the recurrent neural network technology is adopted, the obtained fusion state information carries historical state information, so that the influence of the historical state information can be considered when the action information is output, and the barrier can be effectively avoided.

Policy evaluation network obtains information s at current state_tLower execution action information a_tThe obtained return value is r(s)_t,a_t). Optionally, if the unmanned vehicle is in the current environmentExecuting the action information in a state, and if the collision does not occur, the return value is the distance traveled by the unmanned vehicle in unit time; if the unmanned vehicle executes the action information in the current environment state and can collide, the return value is a preset penalty value. The predetermined penalty value is negative. By designing the reward function, the action generation network can evolve to an action strategy for obtaining high reward, so that the obstacle can be effectively avoided.

In another embodiment of the present invention, as shown in fig. 5, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:

on the basis of the foregoing embodiment, the obstacle avoidance module 200 is refined, specifically:

an action generating unit 210, configured to process the current state information and the historical state information through a first fully-connected neural network to obtain precoding state information corresponding to each of the current state information and the historical state information; processing the pre-coding state information through a first long-short term memory network to obtain fusion state information; predicting current action information according to the fusion state information;

a policy evaluation unit 220, configured to obtain a return value obtained by executing the current action information under the current state information; obtaining state action fusion information according to the return value, the fusion state information and the current action information; processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information; performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information; according to the weight corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of a second long-short term memory network;

the action generating unit 210 is further configured to adjust a subsequent action generating policy according to the prediction evaluation.

In another embodiment of the present invention, as shown in fig. 6, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:

on the basis of the embodiment shown in fig. 5, a training module 400 is added:

and the training module 400 is used for training the obstacle avoidance network through the interactive information between the environment and the unmanned vehicle, and updating the network parameters through a minimum loss function.

One specific training process is as follows:

Parameter w of_aAnd target action generating network

Parameter w'_aPolicy evaluation network

Parameter w of_vAnd target policy evaluation network

Parameter w'_v。

Preservation(s)_t,a_t,r_t,s_t+1) Into the experience playback pool R. Repeating the above process, collecting transitions of a certain time step and putting the transitions into a experience playback pool.

And 4, sampling a certain batch of samples from the playback pool by adopting a sampling mode with priority, inputting the samples into a network structure for learning, and continuously updating the network in an iterative manner until convergence.

Wherein the loss function J at time t_tCan be expressed as:

Loss of square of a return function

representing the desired estimate.

Corresponding to new strategy (use)

Representation), the squared loss of the state cost function of the old policy (represented by pi), is used to evaluate the accuracy of the policy evaluation network's generated value v. V is a state cost function, which is an expectation of accumulated return values. V_π(s_t) As a function of the state cost of the old policy,

Value increment of pi relative to the old strategy.

Is the action state cost function of the new policy.

And 6, after the training is ended, saving model network parameters.

The embodiment of the autonomous obstacle avoidance apparatus for an unmanned vehicle provided by the invention and the embodiment of the autonomous obstacle avoidance method provided by the invention are based on the same inventive concept, and can obtain the same technical effects. Therefore, other specific contents of the embodiment of the autonomous obstacle avoidance apparatus may refer to the description of the embodiment of the foregoing autonomous obstacle avoidance method.

In another embodiment of the present invention, as shown in fig. 7, an electronic device 440 includes:

including a memory 410 and a processor 420. The memory 410 is used to store a computer program 430. When the processor runs the computer program, the autonomous obstacle avoidance method of the unmanned vehicle is realized.

As an example, the processor 420 realizes the steps S200 to S410 according to the foregoing description when executing the computer program. The processor 420 implements the functions of the modules and units in the autonomous obstacle avoidance of the unmanned vehicle described above when executing the computer program. As yet another example, the processor 420, when executing the computer program, implements the functions of the state acquisition module 100, the obstacle avoidance module 200, the action generation unit 210, the policy evaluation unit 220, and the trigger module 300.

Alternatively, the computer program may be divided into one or more modules/units according to the particular needs to accomplish the invention. Each module/unit may be a series of computer program instruction segments capable of performing a particular function. The computer program instruction segment is used for describing the execution process of the computer program in autonomous obstacle avoidance of the unmanned vehicle. As an example, the computer program may be divided into modules/units in the virtual device, such as a state acquisition module, an obstacle avoidance module, an action generation unit, a policy evaluation unit, a trigger module.

The processor is used for realizing the autonomous obstacle avoidance method of the unmanned vehicle by executing the computer program. The processor may be a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), general purpose processor or other logic device, etc., as desired.

The memory may be any internal storage unit and/or external storage device capable of implementing data, program storage. For example, the memory may be a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, or a flash card. The memory is used for storing computer programs, other programs and data of the autonomous obstacle avoidance device of the unmanned vehicle.

The electronic device 440 may be any computer device, such as a desktop computer (desktop), a laptop computer (laptop), a Personal Digital Assistant (PDA), or a server (server). The electronic device 440 may further include an input/output device, a display device, a network access device, a bus, and the like, as needed. The electronic device 440 may also be a single chip computer or a computing device integrating a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).

It will be understood by those skilled in the art that the above-mentioned units and modules for implementing the corresponding functions are divided for the purpose of convenient illustration and description, and the above-mentioned units and modules are further divided or combined according to the application requirements, that is, the internal structures of the devices/apparatuses are divided and combined again to implement the above-mentioned functions. Each unit and module in the above embodiments may be separate physical units, or two or more units and modules may be integrated into one physical unit. The units and modules in the above embodiments may implement corresponding functions by using hardware and/or software functional units. Direct coupling, indirect coupling or communication connection among a plurality of units, components and modules in the above embodiments can be realized through a bus or an interface; the coupling, connection, etc. between the multiple units or devices may be electrical, mechanical, or the like. Accordingly, the specific names of the units and modules in the above embodiments are only for convenience of description and distinction, and do not limit the scope of protection of the present application.

In one embodiment of the present invention, a computer-readable storage medium has a computer program stored thereon, and when executed by a processor, the computer program can implement the autonomous obstacle avoidance method for an unmanned vehicle as described in the foregoing embodiments. That is, when part or all of the technical solutions of the embodiments of the present invention contributing to the prior art are embodied by means of a computer software product, the computer software product is stored in a computer-readable storage medium. The computer readable storage medium can be any portable computer program code entity apparatus or device. For example, the computer readable storage medium may be a U disk, a removable magnetic disk, a magnetic diskette, an optical disk, a computer memory, a read-only memory, a random access memory, etc.

Another embodiment constructed by adopting the unmanned vehicle obstacle avoidance algorithm is applied to the TORCS simulation environment. Including a variety of racetracks, including static obstacles such as curbs, trees, and buildings. There are also moving vehicles as dynamic obstacles. The training of the network is divided into two cases, wherein the scenario is that only static obstacles are contained, and the scenario is that dynamic and static obstacles are contained.

The action generating network Actor and the strategy evaluation network criticic are both constructed through Tensiloflow. The fully connected layers of the two networks consist of 100 and 200 neurons, respectively. The output layer selects the activation function as the RELU linear activation function. The input and output of the setting algorithm are shown in tables 1 and 2 below:

TABLE 1 control algorithm input State information

TABLE 2 control Algorithm output action information

As shown in fig. 8, after training, the unmanned vehicle can reach the end point in about 15000 rounds when the unmanned vehicle is less than 1000 steps in a single round, and the training termination condition is triggered, which indicates that the unmanned vehicle has learned a better strategy, can completely run the entire track, and can repeatedly run for multiple rounds. The loss function of the algorithm converges gradually.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An autonomous obstacle avoidance method for an unmanned vehicle is characterized by comprising the following steps:

acquiring current state information, wherein the current state information comprises current environment state information and the current state of an unmanned vehicle;

according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information;

executing the current action information, repeating the process to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination;

the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network;

the action generating network is used for processing the current state information and the historical state information through a first recurrent neural network to obtain fusion state information; predicting current action information according to the fusion state information;

the policy evaluation network is configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;

and the action generation network adjusts a subsequent action generation strategy according to the prediction evaluation.

2. The autonomous obstacle avoidance method of an unmanned vehicle according to claim 1, wherein the obtaining of the predictive evaluation of the current motion information by the processing of the second recurrent neural network according to the return value, the fusion state information, and the current motion information includes:

obtaining state action fusion information according to the return value, the fusion state information and the current action information;

processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information;

performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information;

and according to the weight-corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of the second recurrent neural network.

3. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 2, wherein the obtaining of the state-motion fusion information with corrected weight by performing one-step attention processing on the state-motion fusion information and the pre-evaluation information specifically includes:

calculating the correlation between the state action fusion information and the pre-evaluation information to obtain a correlation coefficient;

normalizing the correlation information to obtain a corresponding weight factor;

and adjusting the state action fusion information by using the weight factor to obtain the state action fusion information of weight correction.

4. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 3, characterized in that:

calculating the correlation between the state action fusion information and the pre-evaluation information according to the following formula:

wherein the content of the first and second substances,

for the state action fusion information at time t,

normalizing the correlation information according to the following formula to obtain corresponding weight factors

5. The autonomous obstacle avoidance method of an unmanned vehicle according to claim 1, wherein the obtaining of the return value by executing the current action information under the current state information specifically includes:

if the action information is executed under the current state information and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time;

and if the action information is executed under the current state information and collision can occur, the return value is a preset penalty value.

6. The autonomous obstacle avoidance method of an unmanned vehicle of claim 1, wherein the training of the obstacle avoidance network comprises:

training an obstacle avoidance network through interactive information between the environment and the unmanned vehicle, and updating network parameters through a minimized loss function; the loss function comprises the value increment of the old strategy and the new strategy and the KL divergence between the old strategy and the new strategy; and when the KL divergence between the new strategy and the old strategy is smaller than a preset threshold and the accumulated return value based on the new strategy is higher than the accumulated return value based on the old strategy, updating the old strategy by using the new strategy.

7. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 6, characterized in that:

calculating the loss function J at the time t according to the following formula_t：

Wherein the content of the first and second substances,

representing a cumulative reward function proxy objective function,

indicates the desired estimated value, A_π(t) is a merit function, r_tIs the reported value at the time t.

8. An autonomous obstacle avoidance apparatus of an unmanned vehicle, comprising:

the state acquisition module is used for acquiring current state information, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle;

the obstacle avoidance module is used for generating current action information through the trained obstacle avoidance network according to the current state information and the historical state information;

the triggering module is used for executing the action information, triggering to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination;

the action generating unit is used for processing the current state information and the historical state information through a first cyclic neural network to obtain fusion state information; predicting current action information according to the fusion state information;

the strategy evaluation unit is used for acquiring a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;

and the action generating unit is used for adjusting a subsequent action generating strategy according to the prediction evaluation.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the method of autonomous obstacle avoidance of an unmanned vehicle according to any of claims 1 to 7 when running the computer program.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that:

the computer program, when executed by a processor, implements the autonomous obstacle avoidance method of an unmanned vehicle of any of claims 1 to 7.