CN110956148A - Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium - Google Patents

Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium Download PDF

Info

Publication number
CN110956148A
CN110956148A CN201911236281.XA CN201911236281A CN110956148A CN 110956148 A CN110956148 A CN 110956148A CN 201911236281 A CN201911236281 A CN 201911236281A CN 110956148 A CN110956148 A CN 110956148A
Authority
CN
China
Prior art keywords
information
action
current
state information
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911236281.XA
Other languages
Chinese (zh)
Other versions
CN110956148B (en
Inventor
宗文豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Duomin Intelligent Technology Co ltd
Original Assignee
Shanghai Duomin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Duomin Intelligent Technology Co ltd filed Critical Shanghai Duomin Intelligent Technology Co ltd
Priority to CN201911236281.XA priority Critical patent/CN110956148B/en
Publication of CN110956148A publication Critical patent/CN110956148A/en
Application granted granted Critical
Publication of CN110956148B publication Critical patent/CN110956148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention provides an autonomous obstacle avoidance method and device for an unmanned vehicle, electronic equipment and a readable storage medium, wherein the autonomous obstacle avoidance method comprises the following steps: acquiring current state information; according to the current state information and the historical state information, the obstacle avoidance network generates action information with high prediction evaluation; executing the previous action information, and repeating the process until the destination is reached; the obstacle avoidance network comprises an action generation network and a strategy evaluation network; the former obtains fusion state information according to the current state information and the historical state information; predicting current action information according to the fusion state information; the latter obtains the prediction evaluation of the current action information according to the return value, the fusion state information and the current action information; and adjusting a subsequent action generation strategy according to the prediction evaluation. According to the invention, a cyclic neural network and an attention mechanism are introduced in reinforcement learning to give higher attention to past abnormal states, so that the unmanned vehicle can effectively avoid obstacles by means of memorizing the past abnormal states.

Description

Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium
Technical Field
The invention relates to the field of unmanned driving, in particular to an autonomous obstacle avoidance method and device for an unmanned vehicle, electronic equipment and a readable storage medium.
Background
In unknown environments, unmanned vehicle operation requires attention to avoid any possible static and dynamic obstacles. To achieve this, the control algorithm needs to take into account a series of environmental information acquired by external sensors.
With the development of artificial intelligence, reinforcement learning methods have been attempted for use in unmanned vehicle control. The goal of reinforcement learning is to learn optimal behavior through interaction of the agent with the environment. The reinforcement learning is unsupervised learning, the training sample comes from the interactive experience of the intelligent agent and the environment, and the special condition existing in the environment can be effectively solved without the need of sample marking. Meanwhile, in order to adapt to the prediction scene of a high-dimensional data space, large-scale deep learning is introduced on the basis of a reinforcement learning framework, so that the motion space obtained by prediction is more suitable for a changeable scene.
The autonomous obstacle avoidance of the unmanned vehicle is a partially observable Markov process which is not only related to the current state but also related to the preorder state and has higher real-time requirement. For example, the unmanned vehicle detects that there is an obstacle in front for a certain distance at time t _ n, but with the position or pose adjustment of the unmanned vehicle, there is a possibility that the obstacle detected before may be in a blind area of vision at time t _ (n + x), which requires the unmanned vehicle to make a timely control by means of memory of the past state. Specifically, for example, when the vehicle head is far away from the road edge within a certain range, the road edge can be seen, but when the vehicle head is close to the road edge, the vehicle head gradually covers the road edge, so that the road edge cannot be seen in the visual field range, and at the moment, the unmanned vehicle is required to timely steer by means of the memory of the previous state. However, at present, based on the existing deep reinforcement learning algorithm, such as RDPG, DDPG (depth deterministic strategy gradient algorithm), and the like, the constructed model shows a common performance in the above-mentioned scenario, and even cannot converge.
Disclosure of Invention
One of the objectives of the present invention is to provide an autonomous obstacle avoidance method and apparatus for an unmanned vehicle, an electronic device, and a readable storage medium, so as to overcome at least some of the disadvantages in the prior art.
The technical scheme provided by the invention is as follows:
an autonomous obstacle avoidance method for an unmanned vehicle comprises the following steps: acquiring current state information, wherein the current state information comprises current environment state information and the current state of an unmanned vehicle; according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information; executing the current action information, repeating the process to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination; the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network; the action generating network is used for processing the current state information and the historical state information through a first recurrent neural network to obtain fusion state information; predicting current action information according to the fusion state information; the policy evaluation network is configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information; and the action generation network adjusts a subsequent action generation strategy according to the prediction evaluation.
Further, the obtaining of the prediction evaluation of the current action information by the processing of the second recurrent neural network according to the return value, the fusion state information, and the current action information includes: obtaining state action fusion information according to the return value, the fusion state information and the current action information; processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information; performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information; and according to the weight-corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of the second recurrent neural network.
Further, the obtaining of the state and motion fusion information with corrected weight by performing further attention processing on the state and motion fusion information and the pre-evaluation information specifically includes: calculating the correlation between the state action fusion information and the pre-evaluation information to obtain a correlation coefficient; normalizing the correlation information to obtain a corresponding weight factor; and adjusting the state action fusion information by using the weight factor to obtain the state action fusion information of weight correction.
Further, the correlation of the state action fusion information and the pre-evaluation information is calculated according to the following formula:
Figure BDA0002304966560000031
wherein the content of the first and second substances,
Figure BDA0002304966560000032
for the state action fusion information at time t,
Figure BDA0002304966560000033
as the pre-evaluation information at time j, w1 and w2 are coefficients,
Figure BDA0002304966560000034
a correlation coefficient indicating the pre-evaluation information at the time j and the state action fusion information at the time t; normalizing the correlation information according to the following formula to obtain corresponding weight factors
Figure BDA0002304966560000035
Obtaining weight-corrected state action fusion information according to the following formula
Figure BDA0002304966560000036
Further, the report value obtained by executing the current action information under the current state information specifically includes: if the action information is executed under the current state information and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time; and if the action information is executed under the current state information and collision can occur, the return value is a preset penalty value.
Further, the training of the obstacle avoidance network includes: training an obstacle avoidance network through interactive information between the environment and the unmanned vehicle, and updating network parameters through a minimized loss function; the loss function comprises the value increment of the old strategy and the new strategy and the KL divergence between the old strategy and the new strategy; and when the KL divergence between the new strategy and the old strategy is smaller than a preset threshold and the accumulated return value based on the new strategy is higher than the accumulated return value based on the old strategy, updating the old strategy by using the new strategy.
Further, a loss function J at time t is calculated according to the following formulat
Figure BDA0002304966560000037
Figure BDA0002304966560000038
Wherein the content of the first and second substances,
Figure BDA0002304966560000039
representing a cumulative reward function proxy objective function,
Figure BDA00023049665600000310
representing the squared loss of the return function, c1,c2Is the coefficient, sπ(st) Represents a cross-entropy loss gain that encourages policy heuristics, pi represents a policy,
Figure BDA0002304966560000041
indicates the desired estimated value, Aπ(t) merit function, rtIs the reported value at the time t.
The invention also provides an autonomous obstacle avoidance device of the unmanned vehicle, which comprises: the state acquisition module is used for acquiring current state information, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle; the obstacle avoidance module is used for generating current action information through the trained obstacle avoidance network according to the current state information and the historical state information; the triggering module is used for executing the action information, triggering to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination; wherein, keep away the barrier network and adopt Actor-criticic structure, keep away the barrier module and include: the action generating unit is used for processing the current state information and the historical state information through a first cyclic neural network to obtain fusion state information; predicting current action information according to the fusion state information; the strategy evaluation unit is used for acquiring a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information; and the action generating unit is used for adjusting a subsequent action generating strategy according to the prediction evaluation.
The present invention also provides an electronic device comprising: a memory for storing a computer program; and the processor is used for realizing the autonomous obstacle avoidance method of the unmanned vehicle when the computer program is run.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned autonomous obstacle avoidance method for an unmanned vehicle.
The autonomous obstacle avoidance method and device for the unmanned vehicle, the electronic device and the readable storage medium provided by the invention can bring the following beneficial effects:
1. according to the invention, by introducing the recurrent neural network and introducing the memory mechanism into the action generation network and the strategy evaluation network, the current detected roadblock and the previously detected roadblock can be comprehensively considered, and reasonable obstacle avoidance action is made; the current prediction evaluation and the previous prediction evaluation can be comprehensively considered to generate more appropriate evaluation output; in a word, the memory is increased, and the obstacle avoidance network can predict output more accurately.
2. The invention introduces the attention mechanism to give higher attention to past abnormal states in reinforcement learning, so that the unmanned vehicle can carry out timely control by means of memory of past abnormal states, and the obstacle is effectively avoided.
3. The invention limits the updating amplitude of the new strategy and the old strategy by using the KL divergence, avoids the strategy from rapidly forgetting the experience learned by the past samples during updating, and ensures that the strategy change is smooth and controllable.
Drawings
The above features, technical features, advantages and implementations of an autonomous obstacle avoidance method and apparatus for an unmanned vehicle, an electronic device, and a readable storage medium will be further described in detail below with reference to the accompanying drawings.
FIG. 1 is a flow chart of one embodiment of an autonomous obstacle avoidance method of an unmanned vehicle of the present invention;
FIG. 2 is a flow diagram for one embodiment of step S300 in FIG. 1;
FIG. 3 is a flow chart of another embodiment of an autonomous obstacle avoidance method of an unmanned vehicle of the present invention;
FIG. 4 is a flow chart of another embodiment of step S300 in FIG. 1;
fig. 5 is a schematic structural diagram of an embodiment of an autonomous obstacle avoidance apparatus of an unmanned vehicle according to the present invention;
fig. 6 is a schematic structural diagram of another embodiment of the autonomous obstacle avoidance apparatus of the unmanned vehicle of the present invention;
FIG. 7 is a schematic diagram of an electronic device in accordance with one embodiment of the invention;
fig. 8 is a test result diagram of another embodiment of the autonomous obstacle avoidance method for an unmanned vehicle according to the present invention;
the reference numbers illustrate:
100. the system comprises a state acquisition module, a 200 obstacle avoidance module, a 210 action generation unit, a 220 strategy evaluation unit, a 300 trigger module, a 400 training module, 440 electronic equipment, 410 memory, 420 processor, 430 computer program.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
In an embodiment of the present invention, as shown in fig. 1 and fig. 2, an autonomous obstacle avoidance method for an unmanned vehicle includes:
step S200, current state information is obtained, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle;
step S300, according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information;
the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network; step 300 comprises:
an action generating network:
step 310, according to the current state information and the historical state information, obtaining fusion state information through first cyclic neural network processing;
step 320, predicting current action information according to the fusion state information;
and (3) policy evaluation network:
step 330, obtaining a return value obtained by executing the current action information under the current state information;
step 340, obtaining a prediction evaluation of the current action information through the processing of the second recurrent neural network according to the return value, the fusion state information and the current action information;
step 350, according to the prediction evaluation, generating a network adjustment subsequent action generation strategy by the action generation network;
step S400, judging whether the unmanned vehicle reaches the destination; if yes, ending;
if not, after the current action information enters the next environment state, the step S410 jumps to the step S200 to obtain the next state information, updates the current state information with the next state information, and updates the current action information according to the updated current state information.
Specifically, the unmanned vehicle includes external sensors such as a laser radar, a camera, and the like. The external sensor is used for monitoring obstacles in the surrounding environment in the unmanned vehicle moving process, and obtaining information such as the distance and the direction of the obstacles relative to the vehicle body, namely environment state information, by analyzing time sequence point cloud data acquired by a laser radar or image data acquired by a camera. The unmanned vehicle also comprises an internal sensor for acquiring the speed, position information and the like of the unmanned vehicle; by analyzing the data collected by the internal sensors, the state (i.e., position and velocity information) of the unmanned vehicle is obtained. The current state information includes current environmental state information and a current state of the unmanned vehicle.
The unmanned vehicle also comprises an obstacle avoidance network for controlling the autonomous obstacle avoidance of the unmanned vehicle. The obstacle avoidance network adopts a common Actor-Critic model structure in large-scale deep reinforcement learning, wherein Actor represents an action generation network, and Critic represents a strategy evaluation network. The Actor network is used to learn a mapping a ═ λ(s) of the current state to the motion space, where s is the current state information and a is the predicted motion information. And the Critic network is used for evaluating the quality of the action by combining the return value given by the environment obtained by executing the action information in the current state, so that the whole algorithm is controlled to evolve to the maximum accumulated return value. The final goal of the overall algorithm is to obtain the maximum cumulative reward value. The accumulated reward value reflects the long-term incentives that are obtained from the start time to the end time (e.g., to the destination).
The cumulative reward value β is calculated as follows:
Figure BDA0002304966560000071
wherein the content of the first and second substances,gamma is the attenuation factor, r(s)t,at) Is the reward value at time t (also called reward function at time t), which is defined as the reward value under specific state and specific action, stIs status information at time t, atE represents an expectation function for the motion information at time t.
The action generating network is composed of a recurrent neural network, such as a unidirectional LSTM (long short term memory network), or a bidirectional LSTM. The current state information and the historical state information are input into the action generating network to obtain the fusion state information, and the fusion state information not only considers the current state information, but also considers the stored historical state information (namely the state information before the current state information). And predicting action information to be taken at the current moment according to the fusion state information. The action information includes accelerator pedal information, brake pedal information, gear information, steering information, and the like. And controlling the driving of the unmanned vehicle through the action information.
Sometimes, an obstacle is detected in the historical state information, but no obstacle is detected in the current state information, for example, the radar point cloud detects that a road block exists in front of the unmanned vehicle at a certain distance at the time t _ n, but at the moment, the unmanned vehicle cannot turn due to the obstacle existing in the close distance between the left and right, the unmanned vehicle does not need to turn at the time t _ (n + x), but with the pose adjustment of the unmanned vehicle at the time t _ (n + x), the previously detected road block may be in a view blind area, and the road block is not detected at the time t _ (n + x). If only the current state information is considered and no obstacle exists, the unmanned vehicle can be caused to meet the obstacle in the subsequent driving process; the correct treatment is to require the unmanned vehicle to timely carry out obstacle avoidance treatment by means of the memory of the previous state. Because the recurrent neural network technology is adopted, the obtained fusion state information carries historical state information, so that the influence of the historical state information can be considered when the action information is output, and the barrier can be effectively avoided.
Optionally, processing the current state information and the historical state information through a first fully-connected neural network to obtain precoding state information corresponding to each; processing the pre-coding state information through a first cyclic neural network to obtain fusion state information; and generating corresponding action information according to the fusion state information. The fully-connected neural network is an artificial neural network consisting of a plurality of layers of neurons, and the circulating neural network can adopt a unidirectional LSTM network.
Policy evaluation network obtains information s at current statetLower execution action information atThe obtained return value is r(s)t,at). Optionally, if the unmanned vehicle executes the action information in the current environment state and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time; if the unmanned vehicle executes the action information in the current environment state and can collide, the return value is a preset penalty value. The predetermined penalty value is negative. By designing the reward function, the action generation network can evolve to an action strategy for obtaining high reward, so that the obstacle can be effectively avoided.
And the strategy evaluation network carries out prediction evaluation on the current action information through the processing of the second recurrent neural network according to the return value at the current time, the fusion state information and the stored evaluation of the past action, and predicts the possible obtained accumulated return value.
The action generating network adjusts the subsequent action generating strategy according to the prediction evaluation. For example, at time t, action a in state s receives a high predictive rating, and the same or similar state s1 is encountered in the future, encouraging the generation of a similar action a.
And judging whether the unmanned vehicle reaches the destination. If not, executing the action information, reaching the next environment state, jumping to the step S200, acquiring the next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination.
In the embodiment, by introducing the cyclic neural network and introducing the memory mechanism into the action generating network, the currently detected roadblock and the previously detected roadblock can be comprehensively considered, and more reasonable obstacle avoiding action can be performed; by introducing the recurrent neural network and introducing a memory mechanism into the strategy evaluation network, the current prediction evaluation and the previous prediction evaluation can be comprehensively considered, and more appropriate evaluation output is generated; in a word, the memory is increased, and the obstacle avoidance network can predict output more accurately.
In another embodiment of the present invention, as shown in fig. 1 and 4, an autonomous obstacle avoidance method for an unmanned vehicle includes:
on the basis of the foregoing embodiment, as shown in fig. 4, step S300 includes:
step S311, the current state information and the historical state information are processed by a first fully-connected neural network to obtain the precoding state information corresponding to each;
step S312, the precoding state information is processed by a first long-short term memory network to obtain fusion state information;
step S321, predicting current action information according to the fusion state information;
step S331 of acquiring a return value obtained by executing the current action information under the current state information;
step S341, obtaining state action fusion information according to the return value, the fusion state information and the current action information;
step S342, according to the state action fusion information, pre-evaluation information is obtained through processing of a second full-connection neural network;
step S343, the state action fusion information and the pre-evaluation information are subjected to one-step attention processing to obtain weight-corrected state action fusion information;
step S344, according to the predicted evaluation of the state action fusion information and the historical action information corrected by the weight, the predicted evaluation of the current action information is obtained through the processing of a second long-short term memory network;
step S351 is to generate an action generation policy for adjusting the subsequent action of the network according to the prediction evaluation.
Specifically, a fully-connected neural network is an artificial neural network composed of multiple layers of multiple neurons. The network has a memory mechanism by introducing a long-term and short-term memory network on the basis of a fully-connected neural network.
The action generating network includes a first fully-connected neural network and a first recurrent neural network. The first fully-connected network firstly completes pre-coding on the input environment state information and mines the relation between shallow states, but does not have a time sequence relation. The first long short term memory network (in this embodiment, a unidirectional long short term memory network) is then used to fit the pre-coded state information to the fused state information mapping. The loop layer formed by the first long-short term memory network allows the fused state information to encode an implicit representation with the past time step state information (i.e., historical state information).
Because the dependency relationship established for the time sequence samples by the long and short term memory network is gradually attenuated along with the increase of the time interval, the obstacle detected in the history state with longer time interval can be ignored by the unmanned vehicle due to the change of the pose in the obstacle avoidance process. In order to solve the problems, the attention degree of variable weight is realized on the state information of different time steps, the state information with weight correction is obtained by introducing an attention mechanism into a strategy evaluation network, and the weight of the state of the time step on the prediction return of the output strategy evaluation is increased once the environment state is abnormal.
The strategy evaluation network Critic comprises a second fully-connected neural network, one-step attention processing and a second recurrent neural network.
The strategy evaluation network firstly acquires a return value obtained by executing the current action information under the current state information; obtaining state action fusion information according to the return value, the fusion state information and the current action information; for example, the return value, the fusion state information, and the current action information are spliced to obtain state action fusion information. And processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information. And performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information, and performing one-layer processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected prediction evaluation information.
Assuming that a round counts T time steps, wherein the state action fusion information of the tth step (T epsilon (1, T), namely T time) is recorded as
Figure BDA0002304966560000111
Recording the pre-evaluation information of the t step
Figure BDA0002304966560000112
Processed by a second fully-connected neural network to obtain
Figure BDA0002304966560000113
Obtaining the state action fusion information of weight correction in the t step through the following attention processing
Figure BDA0002304966560000114
1. Calculating the pre-evaluation information of the j step
Figure BDA0002304966560000115
And fusion state information of the t step
Figure BDA0002304966560000116
The correlation between the two groups of the data to obtain a correlation coefficient
Figure BDA0002304966560000117
Figure BDA0002304966560000118
Wherein the content of the first and second substances,
Figure BDA0002304966560000119
for the state action fusion information at time t,
Figure BDA00023049665600001110
is pre-evaluation information at time j, w1And w2As a function of the number of the coefficients,
Figure BDA00023049665600001111
a correlation coefficient indicating the pre-evaluation information at the time j and the state action fusion information at the time t;
2. by means of normalizing fingersNormalizing the correlation coefficient by a number function (softmax function) to obtain a corresponding weight factor
Figure BDA00023049665600001112
Figure BDA00023049665600001113
3. Calculating the state action fusion information of weight correction according to the following formula:
Figure BDA00023049665600001114
fusing weight-corrected state actions into information
Figure BDA00023049665600001115
And the prediction evaluation of the historical action information is processed by a second long-short term memory network to obtain the weight corrected prediction evaluation information
Figure BDA0002304966560000121
Because the memory duration of the recurrent neural network is limited and the attenuation of state memory at intervals is larger, the embodiment gives higher attention to the abnormal sensing state which occurs once by introducing an attention mechanism into the strategy evaluation network, and further improves the memory of the whole system, thereby improving the accuracy of strategy evaluation.
In another embodiment of the present invention, as shown in fig. 3, an autonomous obstacle avoidance method for an unmanned vehicle includes:
on the basis of the embodiments shown in fig. 1 and 4, the following steps are added:
step S100, an obstacle avoidance network is trained through interactive information between the environment and the unmanned vehicle, and network parameters are updated through a minimum loss function.
Firstly, a plurality of training samples are generated according to the interactive information between the unmanned vehicle and the training environment and are stored in the experience playback pool. And extracting a plurality of sample slices from the experience playback pool, and inputting the sample slices as a series of environment perception states into an action generation network and a strategy evaluation network to be trained. Preferably, samples with higher reward values (e.g., greater than a predetermined threshold) are preferentially sampled because such samples have higher learning value. The action generating network generates an action a according to the input environment state s and a preset action space, and the strategy evaluation network obtains an evaluation value v which reacts to the action according to the action a and the environment state s. Finally, the action a with the largest return is selected as the actually executed action, and the operation is circulated. And finishing the self-adaptive obstacle avoidance of the unmanned vehicle under a new scene by using the trained parameters until the Actor and Critic networks are stable in performance and finally converge.
One specific training process is as follows:
step 1, initializing and setting an unmanned vehicle simulation experiment environment, and determining state sensing information and action space information. For example, the horizontal and longitudinal speeds of the unmanned vehicle, the laser point cloud, the radar image and the placing positions of surrounding obstacles, and the motion track of the dynamic obstacle are initialized; and (4) defining the destination position reached by the unmanned vehicle.
Step 2, initializing parameters of the action generating network, parameters of the strategy evaluating network, parameters of the target action generating network, parameters of the target strategy evaluating network, and the experience playback pool R.
And according to the complexity of the environment state information, designing the network scale of the action generation network and the strategy evaluation network. For example, the number of hidden layers of each layer of the fully-connected neuron, the number of layers of the neural network layer, the number of subunits of the loop layer, the maximum number of rounds of strategy iteration, and the like. The higher the latitude of the environmental status information, the larger the network scale of the proposed setting.
In order to facilitate recording and updating of the algorithm during training, a new model and an old model are respectively set for an action generation network and a strategy evaluation network, and each model is allocated with a corresponding parameter space, namely the action generation network
Figure BDA0002304966560000131
Parameter w ofaAnd target action generating network
Figure BDA0002304966560000132
Parameter w'aPolicy evaluation network
Figure BDA0002304966560000133
Parameter w ofvAnd target policy evaluation network
Figure BDA0002304966560000134
Parameter w'v
And 3, starting simulation, generating a plurality of training samples according to the interactive information between the unmanned vehicle and the training environment in the self-adaptive driving process, recording each training sample in a transition form, and storing the training samples in the experience playback pool.
A transition comprises a 4-tuple, which is respectively the state s of the current time steptAction a of this time steptThe return value r obtained by executing the action at the time steptState s of the next time stept+1
Specifically, a current state s is receivedt(ii) a Selecting action a in preset action space according to current strategyt(ii) a Performing action atThe obtained return value rtAnd a new state st+1. Preservation(s)t,at,rt,st+1) Into the experience playback pool R.
Repeating the above process, collecting transitions of a certain time step and putting the transitions into a experience playback pool.
Step 4, sampling with priority from the experience playback pool to obtain a plurality of sample slices; and training the obstacle avoidance network by using the plurality of sample slices, and continuously and iteratively updating the network until convergence.
Specifically, several sample slices are extracted from the experience playback pool; a loss function is calculated from the sample slice. The parameters of the network are evaluated by a minimization loss function update policy, and the parameters of the network are generated using a policy gradient update action of the samples. And generating network parameters according to the updated action, updating the parameters of the target action generation network, and updating the parameters of the target strategy evaluation according to the updated strategy evaluation network parameters.
Wherein the loss function J at time ttCan be expressed as:
Figure BDA0002304966560000135
Figure BDA0002304966560000141
Figure BDA0002304966560000142
Figure BDA0002304966560000143
the loss function includes a total of three terms, which are: cumulative reward function proxy objective function
Figure BDA0002304966560000144
Loss of square of a return function
Figure BDA0002304966560000145
Cross entropy loss gain s to encourage policy exploratoryπ(st)。c1,c2Is the coefficient of the number of the first and second,
Figure BDA0002304966560000146
representing the desired estimate.
Figure BDA0002304966560000147
Corresponding to new strategy (use)
Figure BDA0002304966560000148
Representation), the squared loss of the state cost function of the old policy (represented by pi), is used to evaluate the accuracy of the policy evaluation network's generated value v. V is a function of the value of the state,is an expectation of accumulated return values. Vπ(st) As a function of the state cost of the old policy,
Figure BDA0002304966560000149
is a state cost function of the new policy. The predicted value of the strategy evaluation function is to approach the state value function continuously.
Typically, the loss function consists of the squared loss of the reward function. Considering that the intelligent agent is in a blind exploration environment state in the initial stage of algorithm training, the difference between samples is large, so that the strategy updating amplitude is too large, the strategy is easy to deviate from the correct optimization direction, and the algorithm is not converged or the updating speed is slow. Thus, the dominant function A is introducedπ(t) New policy
Figure BDA00023049665600001410
Value increment of pi relative to the old strategy.
Figure BDA00023049665600001411
Is the action state cost function of the new policy.
Figure BDA00023049665600001412
The method is another expression form of the dominant function, the dominant function is cut, the dominant function is limited within a certain value range, and large fluctuation is avoided. r istIs the reported value at the time t. The Clip function is a clipping function, epsilon is a preset fluctuation range, and Clip (1-epsilon, 1+ epsilon) represents that the limiting value range is [ 1-epsilon, 1+ epsilon ]]When the value is less than 1-epsilon, taking 1-epsilon; when the sum is larger than 1+ epsilon, 1+ epsilon is taken.
Due to the difference in the action probability distribution space between the new strategy and the old strategy, the difference is called the KL divergence between the new strategy and the old strategy. The greater the difference in the two distributions, the greater the KL divergence. The difference uses a cross-entropy loss gain sπ(st) To measure. Introduction of sπ(st) Therefore, the situation that the new strategy stays in place and falls into local optimization can be avoided.
By in a loss functionDominance function introducing clipping
Figure BDA0002304966560000151
Sum cross entropy loss gain sπ(st) The advantage function is ensured to be monotonous and not to be reduced, and the KL divergence between the old strategy and the new strategy is limited to be smaller than a certain threshold value.
Calculating a loss function, and updating network parameters by minimizing the loss function, thereby ensuring that the strategy is updated along the direction in which the value function is monotonous and not reduced, and ensuring that the change amplitude of the strategy is controllable; by limiting the KL divergence between the old and new policies to be smaller than a smaller value, it is ensured that the accumulated return value obtained by the new policy is higher than that of the old policy.
And 5: and recording and tracking the return performance of the accumulated turn in the training process, and terminating the training once the turn performance reaches a higher level and the unmanned vehicle can safely reach the end position.
And 6, after the training is ended, saving model network parameters.
In the embodiment, by introducing a clipped dominance function and a cross entropy loss gain into a loss function and combining the principle of minimizing the loss function, it is ensured that the policy is updated along the direction in which the cost function is monotonously not reduced, and at the same time, it is ensured that the policy change amplitude is controllable (i.e., the KL divergence limits the update amplitudes of the new and old policies). Therefore, the method can prevent the algorithm from generating large adjustment amplitude when encountering samples with obvious distribution difference with the previous training samples, so that the new strategy is evolved to a completely different direction, and the final strategy cannot be converged. Therefore, the KL divergence is used for limiting the updating amplitude of the new strategy and the old strategy, so that the experience learned by the past samples is not rapidly forgotten by the algorithm during updating.
In one embodiment of the present invention, as shown in fig. 5, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:
a state obtaining module 100, configured to obtain current state information, where the current state information includes current environmental state information and a current state of an unmanned vehicle;
the obstacle avoidance module 200 is configured to generate current action information through the trained obstacle avoidance network according to the current state information and the historical state information;
the triggering module 300 is configured to execute the action information, trigger to obtain next state information, update current action information according to the next state information, and cycle the process until the unmanned vehicle reaches a destination;
wherein, keep away the barrier network and adopt Actor-criticic structure, keep away the barrier module and include:
an action generating unit 210, configured to obtain fusion state information by performing a first recurrent neural network processing according to the current state information and the historical state information; predicting current action information according to the fusion state information;
a policy evaluation unit 220, configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;
the action generating unit 210 is configured to adjust a subsequent action generating policy according to the prediction evaluation.
Specifically, the unmanned vehicle comprises an external sensor for monitoring obstacles in the surrounding environment in the movement process of the unmanned vehicle, and information such as the distance and direction of the obstacles relative to the vehicle body, namely environment state information, is obtained by analyzing time sequence point cloud data acquired by a laser radar or image data acquired by a camera. The unmanned vehicle also comprises an internal sensor for acquiring the speed, position information and the like of the unmanned vehicle; by analyzing the data collected by the internal sensors, the state (i.e., position and velocity information) of the unmanned vehicle is obtained. The current state information includes current environmental state information and a current state of the unmanned vehicle.
The unmanned vehicle also comprises an obstacle avoidance network for controlling the autonomous obstacle avoidance of the unmanned vehicle. The obstacle avoidance network adopts a common Actor-Critic model structure in large-scale deep reinforcement learning, wherein Actor represents an action generation network, and Critic represents a strategy evaluation network. The Actor network is used to learn a mapping a ═ λ(s) of the current state to the motion space, where s is the current state information and a is the predicted motion information. And the Critic network is used for evaluating the quality of the action by combining the return value given by the environment obtained by executing the action information in the current state, so that the whole algorithm is controlled to evolve to the maximum accumulated return value. The final goal of the overall algorithm is to obtain the maximum cumulative reward value. The accumulated reward value reflects the long-term incentives that are obtained from the start time to the end time (e.g., to the destination).
The motion generation network is constituted by a recurrent neural network. And inputting the current state information and the historical state information into the action generation network to obtain the fusion state information, wherein the fusion state information not only considers the current state information, but also considers the stored historical state information. And predicting action information to be taken at the current moment according to the fusion state information.
In some cases, an obstacle is detected in the history state information, but an obstacle is not detected in the current state information. If only the current state information is considered and no obstacle exists, the unmanned vehicle can be caused to meet the obstacle in the subsequent driving process; the correct treatment is to require the unmanned vehicle to timely carry out obstacle avoidance treatment by means of the memory of the previous state. Because the recurrent neural network technology is adopted, the obtained fusion state information carries historical state information, so that the influence of the historical state information can be considered when the action information is output, and the barrier can be effectively avoided.
Optionally, processing the current state information and the historical state information through a first fully-connected neural network to obtain precoding state information corresponding to each; processing the pre-coding state information through a first cyclic neural network to obtain fusion state information; and generating corresponding action information according to the fusion state information. The fully-connected neural network is an artificial neural network consisting of a plurality of layers of neurons, and the circulating neural network can adopt a unidirectional LSTM network.
Policy evaluation network obtains information s at current statetLower execution action information atThe obtained return value is r(s)t,at). Optionally, if the unmanned vehicle is in the current environmentExecuting the action information in a state, and if the collision does not occur, the return value is the distance traveled by the unmanned vehicle in unit time; if the unmanned vehicle executes the action information in the current environment state and can collide, the return value is a preset penalty value. The predetermined penalty value is negative. By designing the reward function, the action generation network can evolve to an action strategy for obtaining high reward, so that the obstacle can be effectively avoided.
And the strategy evaluation network carries out prediction evaluation on the current action information through the processing of the second recurrent neural network according to the return value at the current time, the fusion state information and the stored evaluation of the past action, and predicts the possible obtained accumulated return value.
The action generating network adjusts the subsequent action generating strategy according to the prediction evaluation. For example, at time t, action a in state s receives a high predictive rating, and the same or similar state s1 is encountered in the future, encouraging the generation of a similar action a.
And judging whether the unmanned vehicle reaches the destination. If not, executing the action information, reaching the next environment state, jumping to the step S200, acquiring the next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination.
In the embodiment, by introducing the cyclic neural network and introducing the memory mechanism into the action generating network, the currently detected roadblock and the previously detected roadblock can be comprehensively considered, and more reasonable obstacle avoiding action can be performed; by introducing the recurrent neural network and introducing a memory mechanism into the strategy evaluation network, the current prediction evaluation and the previous prediction evaluation can be comprehensively considered, and more appropriate evaluation output is generated; in a word, the memory is increased, and the obstacle avoidance network can predict output more accurately.
In another embodiment of the present invention, as shown in fig. 5, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:
on the basis of the foregoing embodiment, the obstacle avoidance module 200 is refined, specifically:
an action generating unit 210, configured to process the current state information and the historical state information through a first fully-connected neural network to obtain precoding state information corresponding to each of the current state information and the historical state information; processing the pre-coding state information through a first long-short term memory network to obtain fusion state information; predicting current action information according to the fusion state information;
a policy evaluation unit 220, configured to obtain a return value obtained by executing the current action information under the current state information; obtaining state action fusion information according to the return value, the fusion state information and the current action information; processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information; performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information; according to the weight corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of a second long-short term memory network;
the action generating unit 210 is further configured to adjust a subsequent action generating policy according to the prediction evaluation.
Specifically, a fully-connected neural network is an artificial neural network composed of multiple layers of multiple neurons. The network has a memory mechanism by introducing a long-term and short-term memory network on the basis of a fully-connected neural network.
The action generating network includes a first fully-connected neural network and a first recurrent neural network. The first fully-connected network firstly completes pre-coding on the input environment state information and mines the relation between shallow states, but does not have a time sequence relation. The first long short term memory network (in this embodiment, a unidirectional long short term memory network) is then used to fit the pre-coded state information to the fused state information mapping. The loop layer formed by the first long-short term memory network allows the fused state information to encode an implicit representation with the past time step state information (i.e., historical state information).
Because the dependency relationship established for the time sequence samples by the long and short term memory network is gradually attenuated along with the increase of the time interval, the obstacle detected in the history state with longer time interval can be ignored by the unmanned vehicle due to the change of the pose in the obstacle avoidance process. In order to solve the problems, the attention degree of variable weight is realized on the state information of different time steps, the state information with weight correction is obtained by introducing an attention mechanism into a strategy evaluation network, and the weight of the state of the time step on the prediction return of the output strategy evaluation is increased once the environment state is abnormal.
The strategy evaluation network Critic comprises a second fully-connected neural network, one-step attention processing and a second recurrent neural network.
The strategy evaluation network firstly acquires a return value obtained by executing the current action information under the current state information; obtaining state action fusion information according to the return value, the fusion state information and the current action information; for example, the return value, the fusion state information, and the current action information are spliced to obtain state action fusion information. And processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information. And performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information, and performing one-layer processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected prediction evaluation information.
Because the memory duration of the recurrent neural network is limited and the attenuation of state memory at intervals is larger, the embodiment gives higher attention to the abnormal sensing state which occurs once by introducing an attention mechanism into the strategy evaluation network, and further improves the memory of the whole system, thereby improving the accuracy of strategy evaluation.
In another embodiment of the present invention, as shown in fig. 6, an autonomous obstacle avoidance apparatus for an unmanned vehicle includes:
on the basis of the embodiment shown in fig. 5, a training module 400 is added:
and the training module 400 is used for training the obstacle avoidance network through the interactive information between the environment and the unmanned vehicle, and updating the network parameters through a minimum loss function.
Firstly, a plurality of training samples are generated according to the interactive information between the unmanned vehicle and the training environment and are stored in the experience playback pool. And extracting a plurality of sample slices from the experience playback pool, and inputting the sample slices as a series of environment perception states into an action generation network and a strategy evaluation network to be trained. Preferably, samples with higher reward values (e.g., greater than a predetermined threshold) are preferentially sampled because such samples have higher learning value. The action generating network generates an action a according to the input environment state s and a preset action space, and the strategy evaluation network obtains an evaluation value v which reacts to the action according to the action a and the environment state s. Finally, the action a with the largest return is selected as the actually executed action, and the operation is circulated. And finishing the self-adaptive obstacle avoidance of the unmanned vehicle under a new scene by using the trained parameters until the Actor and Critic networks are stable in performance and finally converge.
One specific training process is as follows:
step 1, initializing and setting an unmanned vehicle simulation experiment environment, and determining state sensing information and action space information. For example, the horizontal and longitudinal speeds of the unmanned vehicle, the laser point cloud, the radar image and the placing positions of surrounding obstacles, and the motion track of the dynamic obstacle are initialized; and (4) defining the destination position reached by the unmanned vehicle.
Step 2, initializing parameters of the action generating network, parameters of the strategy evaluating network, parameters of the target action generating network, parameters of the target strategy evaluating network, and the experience playback pool R.
And according to the complexity of the environment state information, designing the network scale of the action generation network and the strategy evaluation network. For example, the number of hidden layers of each layer of the fully-connected neuron, the number of layers of the neural network layer, the number of subunits of the loop layer, the maximum number of rounds of strategy iteration, and the like. The higher the latitude of the environmental status information, the larger the network scale of the proposed setting.
In order to facilitate recording and updating of the algorithm during training, a new model and an old model are respectively set for an action generation network and a strategy evaluation network, and each model is allocated with a corresponding parameter space, namely the action generation network
Figure BDA0002304966560000211
Parameter w ofaAnd target action generating network
Figure BDA0002304966560000212
Parameter w'aPolicy evaluation network
Figure BDA0002304966560000213
Parameter w ofvAnd target policy evaluation network
Figure BDA0002304966560000214
Parameter w'v
And 3, starting simulation, generating a plurality of training samples according to the interactive information between the unmanned vehicle and the training environment in the self-adaptive driving process, recording each training sample in a transition form, and storing the training samples in the experience playback pool.
A transition comprises a 4-tuple, which is respectively the state s of the current time steptAction a of this time steptThe return value r obtained by executing the action at the time steptState s of the next time stept+1
Preservation(s)t,at,rt,st+1) Into the experience playback pool R. Repeating the above process, collecting transitions of a certain time step and putting the transitions into a experience playback pool.
And 4, sampling a certain batch of samples from the playback pool by adopting a sampling mode with priority, inputting the samples into a network structure for learning, and continuously updating the network in an iterative manner until convergence.
Specifically, several sample slices are extracted from the experience playback pool; a loss function is calculated from the sample slice. The parameters of the network are evaluated by a minimization loss function update policy, and the parameters of the network are generated using a policy gradient update action of the samples. And generating network parameters according to the updated action, updating the parameters of the target action generation network, and updating the parameters of the target strategy evaluation according to the updated strategy evaluation network parameters.
Wherein the loss function J at time ttCan be expressed as:
Figure BDA0002304966560000215
Figure BDA0002304966560000216
Figure BDA0002304966560000217
Figure BDA0002304966560000218
the loss function includes a total of three terms, which are: cumulative reward function proxy objective function
Figure BDA0002304966560000219
Loss of square of a return function
Figure BDA00023049665600002110
Cross entropy loss gain s to encourage policy exploratoryπ(st)。c1,c2Is the coefficient of the number of the first and second,
Figure BDA00023049665600002111
representing the desired estimate.
Figure BDA0002304966560000221
Corresponding to new strategy (use)
Figure BDA0002304966560000222
Representation), the squared loss of the state cost function of the old policy (represented by pi), is used to evaluate the accuracy of the policy evaluation network's generated value v. V is a state cost function, which is an expectation of accumulated return values. Vπ(st) As a function of the state cost of the old policy,
Figure BDA0002304966560000223
is a state cost function of the new policy. The predicted value of the strategy evaluation function is to approach the state value function continuously.
Typically, the loss function consists of the squared loss of the reward function. Considering that the intelligent agent is in a blind exploration environment state in the initial stage of algorithm training, the difference between samples is large, so that the strategy updating amplitude is too large, the strategy is easy to deviate from the correct optimization direction, and the algorithm is not converged or the updating speed is slow. Thus, the dominant function A is introducedπ(t) New policy
Figure BDA0002304966560000224
Value increment of pi relative to the old strategy.
Figure BDA0002304966560000225
Is the action state cost function of the new policy.
Figure BDA0002304966560000226
The method is another expression form of the dominant function, the dominant function is cut, the dominant function is limited within a certain value range, and large fluctuation is avoided. r istIs the reported value at the time t. The Clip function is a clipping function, epsilon is a preset fluctuation range, and Clip (1-epsilon, 1+ epsilon) represents that the limiting value range is [ 1-epsilon, 1+ epsilon ]]When the value is less than 1-epsilon, taking 1-epsilon; when the sum is larger than 1+ epsilon, 1+ epsilon is taken.
Due to the difference in the action probability distribution space between the new strategy and the old strategy, the difference is called the KL divergence between the new strategy and the old strategy. The greater the difference in the two distributions, the greater the KL divergence. The difference uses a cross-entropy loss gain sπ(st) To measure. Introduction of sπ(st) Therefore, the situation that the new strategy stays in place and falls into local optimization can be avoided.
Calculating a loss function, and updating network parameters by minimizing the loss function, thereby ensuring that the strategy is updated along the direction in which the value function is monotonous and not reduced, and ensuring that the change amplitude of the strategy is controllable; by limiting the KL divergence between the old and new policies to be smaller than a smaller value, it is ensured that the accumulated return value obtained by the new policy is higher than that of the old policy.
And 5: and recording and tracking the return performance of the accumulated turn in the training process, and terminating the training once the turn performance reaches a higher level and the unmanned vehicle can safely reach the end position.
And 6, after the training is ended, saving model network parameters.
In the embodiment, by introducing a clipped dominance function and a cross entropy loss gain into a loss function and combining the principle of minimizing the loss function, it is ensured that the policy is updated along the direction in which the cost function is monotonously not reduced, and at the same time, it is ensured that the policy change amplitude is controllable (i.e., the KL divergence limits the update amplitudes of the new and old policies). Therefore, the method can prevent the algorithm from generating large adjustment amplitude when encountering samples with obvious distribution difference with the previous training samples, so that the new strategy is evolved to a completely different direction, and the final strategy cannot be converged. Therefore, the KL divergence is used for limiting the updating amplitude of the new strategy and the old strategy, so that the experience learned by the past samples is not rapidly forgotten by the algorithm during updating.
The embodiment of the autonomous obstacle avoidance apparatus for an unmanned vehicle provided by the invention and the embodiment of the autonomous obstacle avoidance method provided by the invention are based on the same inventive concept, and can obtain the same technical effects. Therefore, other specific contents of the embodiment of the autonomous obstacle avoidance apparatus may refer to the description of the embodiment of the foregoing autonomous obstacle avoidance method.
In another embodiment of the present invention, as shown in fig. 7, an electronic device 440 includes:
including a memory 410 and a processor 420. The memory 410 is used to store a computer program 430. When the processor runs the computer program, the autonomous obstacle avoidance method of the unmanned vehicle is realized.
As an example, the processor 420 realizes the steps S200 to S410 according to the foregoing description when executing the computer program. The processor 420 implements the functions of the modules and units in the autonomous obstacle avoidance of the unmanned vehicle described above when executing the computer program. As yet another example, the processor 420, when executing the computer program, implements the functions of the state acquisition module 100, the obstacle avoidance module 200, the action generation unit 210, the policy evaluation unit 220, and the trigger module 300.
Alternatively, the computer program may be divided into one or more modules/units according to the particular needs to accomplish the invention. Each module/unit may be a series of computer program instruction segments capable of performing a particular function. The computer program instruction segment is used for describing the execution process of the computer program in autonomous obstacle avoidance of the unmanned vehicle. As an example, the computer program may be divided into modules/units in the virtual device, such as a state acquisition module, an obstacle avoidance module, an action generation unit, a policy evaluation unit, a trigger module.
The processor is used for realizing the autonomous obstacle avoidance method of the unmanned vehicle by executing the computer program. The processor may be a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), general purpose processor or other logic device, etc., as desired.
The memory may be any internal storage unit and/or external storage device capable of implementing data, program storage. For example, the memory may be a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, or a flash card. The memory is used for storing computer programs, other programs and data of the autonomous obstacle avoidance device of the unmanned vehicle.
The electronic device 440 may be any computer device, such as a desktop computer (desktop), a laptop computer (laptop), a Personal Digital Assistant (PDA), or a server (server). The electronic device 440 may further include an input/output device, a display device, a network access device, a bus, and the like, as needed. The electronic device 440 may also be a single chip computer or a computing device integrating a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).
It will be understood by those skilled in the art that the above-mentioned units and modules for implementing the corresponding functions are divided for the purpose of convenient illustration and description, and the above-mentioned units and modules are further divided or combined according to the application requirements, that is, the internal structures of the devices/apparatuses are divided and combined again to implement the above-mentioned functions. Each unit and module in the above embodiments may be separate physical units, or two or more units and modules may be integrated into one physical unit. The units and modules in the above embodiments may implement corresponding functions by using hardware and/or software functional units. Direct coupling, indirect coupling or communication connection among a plurality of units, components and modules in the above embodiments can be realized through a bus or an interface; the coupling, connection, etc. between the multiple units or devices may be electrical, mechanical, or the like. Accordingly, the specific names of the units and modules in the above embodiments are only for convenience of description and distinction, and do not limit the scope of protection of the present application.
In one embodiment of the present invention, a computer-readable storage medium has a computer program stored thereon, and when executed by a processor, the computer program can implement the autonomous obstacle avoidance method for an unmanned vehicle as described in the foregoing embodiments. That is, when part or all of the technical solutions of the embodiments of the present invention contributing to the prior art are embodied by means of a computer software product, the computer software product is stored in a computer-readable storage medium. The computer readable storage medium can be any portable computer program code entity apparatus or device. For example, the computer readable storage medium may be a U disk, a removable magnetic disk, a magnetic diskette, an optical disk, a computer memory, a read-only memory, a random access memory, etc.
Another embodiment constructed by adopting the unmanned vehicle obstacle avoidance algorithm is applied to the TORCS simulation environment. Including a variety of racetracks, including static obstacles such as curbs, trees, and buildings. There are also moving vehicles as dynamic obstacles. The training of the network is divided into two cases, wherein the scenario is that only static obstacles are contained, and the scenario is that dynamic and static obstacles are contained.
The action generating network Actor and the strategy evaluation network criticic are both constructed through Tensiloflow. The fully connected layers of the two networks consist of 100 and 200 neurons, respectively. The output layer selects the activation function as the RELU linear activation function. The input and output of the setting algorithm are shown in tables 1 and 2 below:
TABLE 1 control algorithm input State information
Figure BDA0002304966560000251
TABLE 2 control Algorithm output action information
Figure BDA0002304966560000252
Figure BDA0002304966560000261
As shown in fig. 8, after training, the unmanned vehicle can reach the end point in about 15000 rounds when the unmanned vehicle is less than 1000 steps in a single round, and the training termination condition is triggered, which indicates that the unmanned vehicle has learned a better strategy, can completely run the entire track, and can repeatedly run for multiple rounds. The loss function of the algorithm converges gradually.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. An autonomous obstacle avoidance method for an unmanned vehicle is characterized by comprising the following steps:
acquiring current state information, wherein the current state information comprises current environment state information and the current state of an unmanned vehicle;
according to the current state information and the historical state information, the trained obstacle avoidance network generates current action information;
executing the current action information, repeating the process to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination;
the obstacle avoidance network adopts an Actor-Critic structure and comprises an action generation network and a strategy evaluation network;
the action generating network is used for processing the current state information and the historical state information through a first recurrent neural network to obtain fusion state information; predicting current action information according to the fusion state information;
the policy evaluation network is configured to obtain a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;
and the action generation network adjusts a subsequent action generation strategy according to the prediction evaluation.
2. The autonomous obstacle avoidance method of an unmanned vehicle according to claim 1, wherein the obtaining of the predictive evaluation of the current motion information by the processing of the second recurrent neural network according to the return value, the fusion state information, and the current motion information includes:
obtaining state action fusion information according to the return value, the fusion state information and the current action information;
processing the state action fusion information through a second fully-connected neural network to obtain pre-evaluation information;
performing one-step attention processing on the state action fusion information and the pre-evaluation information to obtain weight-corrected state action fusion information;
and according to the weight-corrected state action fusion information and the prediction evaluation of the historical action information, the prediction evaluation of the current action information is obtained through the processing of the second recurrent neural network.
3. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 2, wherein the obtaining of the state-motion fusion information with corrected weight by performing one-step attention processing on the state-motion fusion information and the pre-evaluation information specifically includes:
calculating the correlation between the state action fusion information and the pre-evaluation information to obtain a correlation coefficient;
normalizing the correlation information to obtain a corresponding weight factor;
and adjusting the state action fusion information by using the weight factor to obtain the state action fusion information of weight correction.
4. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 3, characterized in that:
calculating the correlation between the state action fusion information and the pre-evaluation information according to the following formula:
Figure FDA0002304966550000021
wherein the content of the first and second substances,
Figure FDA0002304966550000022
for the state action fusion information at time t,
Figure FDA0002304966550000023
is pre-evaluation information at time j, w1And w2As a function of the number of the coefficients,
Figure FDA0002304966550000024
a correlation coefficient indicating the pre-evaluation information at the time j and the state action fusion information at the time t;
normalizing the correlation information according to the following formula to obtain corresponding weight factors
Figure FDA0002304966550000025
Figure FDA0002304966550000026
Obtaining weight-corrected state action fusion information according to the following formula
Figure FDA0002304966550000027
Figure FDA0002304966550000031
5. The autonomous obstacle avoidance method of an unmanned vehicle according to claim 1, wherein the obtaining of the return value by executing the current action information under the current state information specifically includes:
if the action information is executed under the current state information and no collision occurs, the reported value is the distance traveled by the unmanned vehicle in unit time;
and if the action information is executed under the current state information and collision can occur, the return value is a preset penalty value.
6. The autonomous obstacle avoidance method of an unmanned vehicle of claim 1, wherein the training of the obstacle avoidance network comprises:
training an obstacle avoidance network through interactive information between the environment and the unmanned vehicle, and updating network parameters through a minimized loss function; the loss function comprises the value increment of the old strategy and the new strategy and the KL divergence between the old strategy and the new strategy; and when the KL divergence between the new strategy and the old strategy is smaller than a preset threshold and the accumulated return value based on the new strategy is higher than the accumulated return value based on the old strategy, updating the old strategy by using the new strategy.
7. The autonomous obstacle avoidance method of the unmanned vehicle according to claim 6, characterized in that:
calculating the loss function J at the time t according to the following formulat
Figure FDA0002304966550000032
Figure FDA0002304966550000033
Wherein the content of the first and second substances,
Figure FDA0002304966550000034
representing a cumulative reward function proxy objective function,
Figure FDA0002304966550000035
representing the squared loss of the return function, c1,c2Is the coefficient, sπ(st) Represents a cross-entropy loss gain that encourages policy heuristics, pi represents a policy,
Figure FDA0002304966550000036
indicates the desired estimated value, Aπ(t) is a merit function, rtIs the reported value at the time t.
8. An autonomous obstacle avoidance apparatus of an unmanned vehicle, comprising:
the state acquisition module is used for acquiring current state information, wherein the current state information comprises current environment state information and the current state of the unmanned vehicle;
the obstacle avoidance module is used for generating current action information through the trained obstacle avoidance network according to the current state information and the historical state information;
the triggering module is used for executing the action information, triggering to obtain next state information, updating the current action information according to the next state information, and repeating the steps until the unmanned vehicle reaches the destination;
wherein, keep away the barrier network and adopt Actor-criticic structure, keep away the barrier module and include:
the action generating unit is used for processing the current state information and the historical state information through a first cyclic neural network to obtain fusion state information; predicting current action information according to the fusion state information;
the strategy evaluation unit is used for acquiring a return value obtained by executing the current action information under the current state information; processing the current action information through the second recurrent neural network according to the return value, the fusion state information and the current action information to obtain the prediction evaluation of the current action information;
and the action generating unit is used for adjusting a subsequent action generating strategy according to the prediction evaluation.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the method of autonomous obstacle avoidance of an unmanned vehicle according to any of claims 1 to 7 when running the computer program.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that:
the computer program, when executed by a processor, implements the autonomous obstacle avoidance method of an unmanned vehicle of any of claims 1 to 7.
CN201911236281.XA 2019-12-05 2019-12-05 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium Active CN110956148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911236281.XA CN110956148B (en) 2019-12-05 2019-12-05 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911236281.XA CN110956148B (en) 2019-12-05 2019-12-05 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN110956148A true CN110956148A (en) 2020-04-03
CN110956148B CN110956148B (en) 2024-01-23

Family

ID=69980184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911236281.XA Active CN110956148B (en) 2019-12-05 2019-12-05 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN110956148B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN112258097A (en) * 2020-12-23 2021-01-22 睿至科技集团有限公司 Driving assistance method and system based on big data
CN112346457A (en) * 2020-11-03 2021-02-09 智邮开源通信研究院(北京)有限公司 Control method and device for obstacle avoidance, electronic equipment and readable storage medium
CN112904890A (en) * 2021-01-15 2021-06-04 北京国网富达科技发展有限责任公司 Unmanned aerial vehicle automatic inspection system and method for power line
CN112965499A (en) * 2021-03-08 2021-06-15 哈尔滨工业大学(深圳) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning
CN113386790A (en) * 2021-06-09 2021-09-14 扬州大学 Automatic driving decision-making method for cross-sea bridge road condition
CN113687651A (en) * 2021-07-06 2021-11-23 清华大学 Path planning method and device for delivering vehicles according to needs
CN114781072A (en) * 2022-06-17 2022-07-22 北京理工大学前沿技术研究院 Decision-making method and system for unmanned vehicle
CN114815904A (en) * 2022-06-29 2022-07-29 中国科学院自动化研究所 Attention network-based unmanned cluster countermeasure method and device and unmanned equipment
CN114839884A (en) * 2022-07-05 2022-08-02 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107450593A (en) * 2017-08-30 2017-12-08 清华大学 A kind of unmanned plane autonomous navigation method and system
CN108629144A (en) * 2018-06-11 2018-10-09 湖北交投智能检测股份有限公司 A kind of bridge health appraisal procedure
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109948781A (en) * 2019-03-21 2019-06-28 中国人民解放军国防科技大学 Continuous action online learning control method and system for automatic driving vehicle
CN109976340A (en) * 2019-03-19 2019-07-05 中国人民解放军国防科技大学 Man-machine cooperation dynamic obstacle avoidance method and system based on deep reinforcement learning
CN110262511A (en) * 2019-07-12 2019-09-20 同济人工智能研究院(苏州)有限公司 Biped robot's adaptivity ambulation control method based on deeply study

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107450593A (en) * 2017-08-30 2017-12-08 清华大学 A kind of unmanned plane autonomous navigation method and system
CN108629144A (en) * 2018-06-11 2018-10-09 湖北交投智能检测股份有限公司 A kind of bridge health appraisal procedure
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109976340A (en) * 2019-03-19 2019-07-05 中国人民解放军国防科技大学 Man-machine cooperation dynamic obstacle avoidance method and system based on deep reinforcement learning
CN109948781A (en) * 2019-03-21 2019-06-28 中国人民解放军国防科技大学 Continuous action online learning control method and system for automatic driving vehicle
CN110262511A (en) * 2019-07-12 2019-09-20 同济人工智能研究院(苏州)有限公司 Biped robot's adaptivity ambulation control method based on deeply study

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周能: "复杂场景下基于深度增强学习的移动机器人控制方法研究" *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582441B (en) * 2020-04-16 2021-07-30 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN112256056B (en) * 2020-10-19 2022-03-01 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN112256056A (en) * 2020-10-19 2021-01-22 中山大学 Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN112346457A (en) * 2020-11-03 2021-02-09 智邮开源通信研究院(北京)有限公司 Control method and device for obstacle avoidance, electronic equipment and readable storage medium
CN112258097A (en) * 2020-12-23 2021-01-22 睿至科技集团有限公司 Driving assistance method and system based on big data
CN112258097B (en) * 2020-12-23 2021-03-26 睿至科技集团有限公司 Driving assistance method and system based on big data
CN112904890A (en) * 2021-01-15 2021-06-04 北京国网富达科技发展有限责任公司 Unmanned aerial vehicle automatic inspection system and method for power line
CN112965499A (en) * 2021-03-08 2021-06-15 哈尔滨工业大学(深圳) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning
CN112965499B (en) * 2021-03-08 2022-11-01 哈尔滨工业大学(深圳) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning
CN113386790A (en) * 2021-06-09 2021-09-14 扬州大学 Automatic driving decision-making method for cross-sea bridge road condition
CN113687651A (en) * 2021-07-06 2021-11-23 清华大学 Path planning method and device for delivering vehicles according to needs
CN113687651B (en) * 2021-07-06 2023-10-03 清华大学 Path planning method and device for dispatching vehicles on demand
CN114781072A (en) * 2022-06-17 2022-07-22 北京理工大学前沿技术研究院 Decision-making method and system for unmanned vehicle
CN114815904A (en) * 2022-06-29 2022-07-29 中国科学院自动化研究所 Attention network-based unmanned cluster countermeasure method and device and unmanned equipment
CN114839884A (en) * 2022-07-05 2022-08-02 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN114839884B (en) * 2022-07-05 2022-09-30 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning

Also Published As

Publication number Publication date
CN110956148B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
CN110956148B (en) Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium
US11836625B2 (en) Training action selection neural networks using look-ahead search
US11842261B2 (en) Deep reinforcement learning with fast updating recurrent neural networks and slow updating recurrent neural networks
CN110262511B (en) Biped robot adaptive walking control method based on deep reinforcement learning
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN112937564A (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
US11182676B2 (en) Cooperative neural network deep reinforcement learning with partial input assistance
KR20190028531A (en) Training machine learning models for multiple machine learning tasks
KR102310490B1 (en) The design of GRU-based cell structure robust to missing value and noise of time-series data in recurrent neural network
CN110447041B (en) Noise neural network layer
CN111783994A (en) Training method and device for reinforcement learning
CN112172813B (en) Car following system and method for simulating driving style based on deep inverse reinforcement learning
CN114162146B (en) Driving strategy model training method and automatic driving control method
JP2023512722A (en) Reinforcement learning using adaptive return calculation method
CN117008620A (en) Unmanned self-adaptive path planning method, system, equipment and medium
CN113743603A (en) Control method, control device, storage medium and electronic equipment
CN116430842A (en) Mobile robot obstacle avoidance method, device, equipment and storage medium
CN115906673A (en) Integrated modeling method and system for combat entity behavior model
Prescott Explorations in reinforcement and model-based learning
CN114397817A (en) Network training method, robot control method, network training device, robot control device, equipment and storage medium
CN114118371A (en) Intelligent agent deep reinforcement learning method and computer readable medium
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data
CN117556681B (en) Intelligent air combat decision method, system and electronic equipment
KR102590791B1 (en) Method and apparatus of uncertainty-conditioned deep reinforcement learning
US20220101196A1 (en) Device for and computer implemented method of machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant