CN113408782B - Robot path navigation method and system based on improved DDPG algorithm - Google Patents

Robot path navigation method and system based on improved DDPG algorithm Download PDF

Info

Publication number
CN113408782B
CN113408782B CN202110512658.0A CN202110512658A CN113408782B CN 113408782 B CN113408782 B CN 113408782B CN 202110512658 A CN202110512658 A CN 202110512658A CN 113408782 B CN113408782 B CN 113408782B
Authority
CN
China
Prior art keywords
robot
network
ddpg
lstm
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110512658.0A
Other languages
Chinese (zh)
Other versions
CN113408782A (en
Inventor
吕蕾
赵盼盼
周青林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202110512658.0A priority Critical patent/CN113408782B/en
Publication of CN113408782A publication Critical patent/CN113408782A/en
Application granted granted Critical
Publication of CN113408782B publication Critical patent/CN113408782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Biophysics (AREA)
  • Remote Sensing (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a robot path navigation method and a system based on an improved DDPG algorithm, which are used for acquiring the current state information and the target position of a robot; inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data and complete collision-free path navigation; the improved DDPG network completes the calculation of the reward value of the DDPG network by utilizing a curiosity reward mechanism model; a curiosity reward mechanism model comprising: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor. Curiosity-based robot path navigation may make the robot more intelligent.

Description

Robot path navigation method and system based on improved DDPG algorithm
Technical Field
The invention relates to the technical field of path planning, in particular to a robot path navigation method and system based on an improved DDPG algorithm.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the development of artificial intelligence technology, robots have gradually entered our daily lives from the original industrial production field. Particularly, in recent years, there is a vigorous trend in the service industry field. The demand of the human society for mobile robots is also becoming stronger. The path planning of the robot is a key problem to be solved in the field of robots. Path planning for mobile robots is a complex problem, and an autonomous mobile robot is required to find an unobstructed path from an initial position to a target position according to constraints. As the environment faced by robots becomes more complex, it is required that the robots have the ability to anticipate obstacles and avoid collisions therewith at a higher level.
Traditional navigation solutions such as genetic algorithms, simulated annealing algorithms and the like have good effects in the aspect of navigation. However, these methods all design a universal solution under the assumption that the environment is known. As robots are used in all industries, the environment in which the robots are located becomes more and more complex. Previous solutions do not solve these problems well. In recent years, a deep reinforcement learning method combining reinforcement learning and deep learning is widely applied to the field of robot path navigation. Deep learning has unique advantages in aspects of feature extraction, object perception and the like, and is widely applied to the fields of computer vision and the like. Reinforcement learning has a better decision-making ability to reach a maximum return or achieve a specific goal by learning strategies during interaction with the environment. The robot navigation problem in the complex environment is successfully solved by the deep reinforcement learning combining the deep learning and the reinforcement learning. The Deep Deterministic Policy Gradient (DDPG) algorithm is one of the earliest proposed deep reinforcement learning networks. As a classic algorithm in deep reinforcement learning, the DDPG algorithm is directed to a strategy learning method of a continuous, high-latitude behavior space. Compared with the prior reinforcement learning method, the DDPG algorithm has great advantages in the aspect of continuous control problem and is applied to numerous fields of robot path navigation, automatic driving, mechanical arm control and the like.
However, sensitivity to hyperparameters and reward values that tend to diverge have long been one of the problems that DDPG has difficult to solve well. In reinforcement learning, the feedback of the reward value R is usually hard-coded manually, and since the reward of each step cannot be simply predicted, the design of the reward function is usually sparse, so that the robot cannot obtain immediate feedback, and the learning ability is not high.
In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:
the robot path navigation realized based on the prior art has the problem of inaccurate navigation.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a robot path navigation method and a system based on an improved DDPG algorithm;
in a first aspect, the invention provides a robot path navigation method based on an improved DDPG algorithm;
the robot path navigation method based on the improved DDPG algorithm comprises the following steps:
acquiring current state information and a target position of the robot;
inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor.
In a second aspect, the present invention provides a robot path navigation system based on an improved DDPG algorithm;
a robot path navigation system based on an improved DDPG algorithm comprises:
an acquisition module configured to: acquiring current state information and a target position of the robot;
an output module configured to: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
a navigation module configured to: the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor.
In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
the invention utilizes the total sum of the internal reward generated by curiosity and the external reward of the algorithm as the total reward generated by the interaction of the robot and the environment. The reward function module is embedded with a long-short term memory artificial neural network (LSTM) and a Convolutional Neural Network (CNN). A plurality of past states are input into the LSTM network, the prediction of the next state is output, and the difference value between the predicted value and the actual state of the next state is used as an internal reward. In human society, people often have past experience in predicting what happens next, and embedding LSTM networks into curiosity mechanisms is just for reference to this mental feature. While using the CNN network to perform a reverse prediction of the action for the next state generated by the previous network. Curiosity has been considered by some scientists as one of the basic attributes of intelligence, and robot path navigation based on curiosity can make a robot more intelligent, and even under the condition that reward is sparse and even no external reward exists, the robot can feel like a human.
The invention uses the thinking characteristics of human beings for reference, and embeds a curiosity mechanism in the reward function module. Meanwhile, the latest batch states are used as experience data to be input into a curiosity mechanism of the robot, and an LSTM network with a long-term and short-term memory function is used for predicting the next state, so that the curiosity prediction can keep the time sequence. Meanwhile, the difference between the predicted next state and the actual next state is used as an internal reward value, and the problem of sparse reward of the original DDPG algorithm can be solved.
The invention uses the CNN network with the feature extraction function to predict the next state S of the LSTM network t+1 ' with actual state S t As input, output to action A t Predicted value A of t ', will the actual action A t Predicted actions A with CNN networks t The difference between' serves as a constraint. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a block diagram of an algorithm of a robot path navigation method based on improved curiosity and DDPG algorithm according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an LSTM model algorithm embedded in a curiosity mechanism according to an embodiment of the invention;
fig. 3 is a schematic diagram of CNN module algorithm embedded in the curiosity mechanism according to the embodiment of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the invention may be combined with each other without conflict.
Example one
The embodiment provides a robot path navigation method based on an improved DDPG algorithm;
as shown in fig. 1, the robot path navigation method based on the improved DDPG algorithm includes:
s101: acquiring current state information and a target position of the robot;
s102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
s103: the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor.
As shown in fig. 3, the CNN model includes three convolutional layers connected in sequence.
The improved DDPG network is based on the DDPG network, and a stack structure is additionally arranged on an experience playback pool of the DDPG network; the experience playback pool stores two batches of data, one is a sample obtained by original random sampling, and the other is a time sequence sample obtained by a stack structure. And the time sequence samples obtained by the stack structure are used for training the curiosity reward mechanism model. And randomly sampling the obtained samples for use in the training of an Actor module and a Critic module of the DDPG network.
Further, the current state information includes: the current position of the robot, the current angular velocity of the robot, the current linear velocity of the robot and the current environment information of the robot.
Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; the method specifically comprises the following steps:
and inputting the current state information and the target position of the robot into the trained improved DDPG network, and generating optimal executable action data by an Actor module of the improved DDPG network.
Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; wherein, the improved DDPG network comprises:
the system comprises an Actor module, an experience playback pool and a criticic module which are connected in sequence;
the Actor module comprises an Actor current network and an Actor target network which are sequentially connected;
the Critic module comprises a Critic current network and a Critic target network which are sequentially connected;
wherein, the Actor current network is connected with all LSTM models of the curiosity reward mechanism model; the Actor current network is also connected to the output of the CNN model of the curiosity reward mechanism model.
Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; wherein the training step of the trained improved DDPG network comprises the following steps:
s1021: constructing a training set; the training set comprises the state of the robot with known robot navigation paths at each moment;
s1022: and inputting the training set into the improved DDPG network to finish the training of an Actor module, a Critic module and a curiosity reward mechanism model of the improved DDPG network.
Further, the training of the curiosity reward mechanism model is completed, and the training step comprises the following steps:
s10221: robot selection in state S t Action A corresponding to the following t And generating a next state S by interacting with the environment t+1 And a reward value R;
s10222: empirical data (S) generated by interaction of the robot with the environment t ,A t ,R,S t+1 Done) is stored in the experience playback pool, a stack structure is newly added in the experience playback pool so as to access the experience data according to time sequence, and done represents whether robot navigation is finished or not;
s10223: inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is robot state information at the corresponding moment, and the other part is the output value of the LSTM model at the previous moment; the last LSTM model outputs a predicted value S of the state of the robot at the next moment t+1 ';
S10224: the actual next state S t+1 And predicted next state S t+1 ' the difference between them being the internal reward r i While awarding the interior r i With the original external award r e The sum is used as the total reward R of the robot exploration environment; the actual next state S t+1 And predicted next state S t+1 ' the difference between them is used as the first constraint bar in the training processA member;
s10225: the state S of the robot at the current moment t And the predicted value S of the state of the robot at the next moment t+1 ' input to the convolutional neural network CNN, and output the inverse prediction action A t ';
S10226: reverse predicted action A t ' with actual action A t The difference value is used as a second constraint condition in the training process, and the curiosity reward mechanism model is trained by utilizing the gradient back propagation to finish the training of the curiosity reward mechanism model.
It should be understood that the timing sequence between the retained samples described in S10222 is independent of the original random sampling mechanism. In order to avoid the over-fitting (over fit) problem during training and need to break the correlation between samples, DDPG often selects a set of data for network training in a random sampling manner during network training. In order to obtain a sample with the time sequence for training the LSTM network, the invention is provided with a stack structure independent of a random sampling module, maintains the time sequence of the sample by utilizing the characteristic of stack first-in first-out, stores data at the top of a stack, and takes a batch (batch) of data samples from the top of the stack during data taking. When the empirical data is stored, the data is stored in the stack mechanism and the original queue mechanism. When data is fetched, an original random sampling mode is kept for a queue mechanism and is used for training a network Actor module and a criticic module. For the stack mechanism, the characteristic of stack data fetching is kept, and the latest experience data with time sequence is guaranteed to be fetched.
It should be understood that, in S10223, the person' S anticipation of the next occurrence is usually based on the previous experience, by taking the thought characteristics of the person, inputting the state sequence with time sequence into the LSTM network, and using the LSTM memory function to predict the next state S t+1 '. The specific calculation method is as follows:
S t+1 ′=L(S t-n ,S t-(n-1) ,...,S t-2 ,S t-1, S t ;θ)
where S represents the state at a certain time, and θ represents a parameter of the LSTM network. With time sequenceThe predictive value S of the next state is obtained after the characteristic state sequence passes through the LSTM t+1 ';
It should be understood that the next state S to be predicted in S10224 t+1 ' Next State S resulting from interaction with the Environment in practice t+1 The difference between them is used as the internal reward value, while in order to avoid the LSTM network predicting paradoxical solutions, the actual next state S is used t+1 And the predicted value S of the next state t+1 The difference between' serves as a constraint for LSTM network training. The specific calculation is as follows:
r i =||S t+1 ′-S t+1 ||
R=r e +r i
Min(||S t+1 ′-S t+1 ||)
wherein r is i Is based on the actual next state S generated by the improved curiosity mechanism t+1 And the predicted value S of the next state t+1 ' as an internal reward. r is e Is the external prize of the DDPG algorithm, and R is the sum of the internal prize and the external prize based on the modified curiosity algorithm as a total prize value. Min (| | S) t+1 ′-S t+1 | |) is a constraint of the LSTM network.
It should be understood that the states S10225 and S10226 are converted into the state S by using the feature extraction function of the CNN network t And a state S predicted by curiosity t+1 ' reverse predicted action in input CNN network A t ' will make the actual action A at the same time t And predicted action A t ' as another constraint, using the CNN network, state features can be extracted that affect the action-related keys. The specific calculation is as follows:
A t ′=H(S t ,S t+1 ′;w)
Min(A t ,A′ t )
where w is a parameter of the CNN network, A t ' is a pair action A generated through the CCN network t The predicted value of (2).
The LSTM network and the CNN network may be trained simultaneously by backpropagation of the gradient through a first constraint and a second constraint generated on the predicted action.
Meanwhile, a CNN network with a feature extraction function is embedded, the next state predicted by the LSTM network and the actual last state are used as the input of the CNN network, and the CNN network outputs the predicted value of the action. The difference between the actual action and the action predicted by the CNN network is taken as a constraint. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.
Selecting experience data with a batch size from an experience playback pool in a random sampling mode to train a Critic network and an Actor network, and updating parameters through gradient back propagation;
and copying the network parameters from the actual network to the target network at regular intervals by using a soft update mode between the actual network and the target network.
The LSTM network, as shown in fig. 2, uses the thinking characteristics of human beings, and in human society, people often have past experience to predict what happens next, and the invention embeds a curiosity mechanism in the reward function module. Meanwhile, the latest batch states are input into a curiosity mechanism of the robot as experience data, an LSTM network with a long-term and short-term memory function is utilized to predict the next state, a state sequence with the time sequence is input into the LSTM, and the memory function of the LSTM is utilized to predict the next state so that the curiosity prediction can be kept. And meanwhile, the difference between the predicted next state and the actual next state is used as an internal reward value, and meanwhile, in order to avoid the LSTM predicting an absurd solution, the difference between the actual value of the state and the predicted value of the next state is used as a constraint condition.
The CNN network module, as shown in fig. 3, embeds a CNN network with a feature extraction function in the curiosity mechanism, takes the next state predicted by the LSTM network and the actual previous state as inputs, outputs a predicted value of the action, and takes the difference between the actual action and the action predicted by the CNN network as a constraint condition. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.
Example two
The embodiment provides a robot path navigation system based on an improved DDPG algorithm;
a robot path navigation system based on an improved DDPG algorithm comprises:
an acquisition module configured to: acquiring current state information and a target position of the robot;
an output module configured to: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
a navigation module configured to: the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor.
It should be noted here that the above-mentioned obtaining module, output module and navigation module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios realized by the corresponding steps, but are not limited to what is disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical functional division, and in actual implementation, there may be another division, for example, a plurality of modules may be combined or may be integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. The robot path navigation method based on the improved DDPG algorithm is characterized by comprising the following steps:
acquiring current state information and a target position of the robot;
inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor;
completing the training of a curiosity reward mechanism model, wherein the training step comprises the following steps:
(a) Robot selection in state S t Action A corresponding to t And generating a next state S by interacting with the environment t+1 And a reward value R;
(b) Empirical data (S) generated by interaction of the robot with the environment t ,A t ,R,S t+1 Done) is stored in the experience playback pool, a stack structure is newly added in the experience playback pool so as to access the experience data according to time sequence, and done represents whether robot navigation is finished or not;
(c) Inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is the robot state information of the corresponding moment, and the other part is the output value of the LSTM model of the previous moment; the last LSTM model outputs the predicted value S of the robot state at the next moment t+1 ';
(d) The actual next state S t+1 And predicted next state S t+1 ' the difference between them being the internal reward r i While awarding the interior r i With the original external award r e The sum is used as the total reward R of the robot exploration environment; the actual next state S t+1 And predicted next state S t+1 ' the difference between them is used as the first constraint in the training process;
(e) The state S of the robot at the current moment t And the predicted value S of the state of the robot at the next moment t+1 ' input to the convolutional neural network CNN, and output the inverse prediction action A t ';
(f) Reverse predicted action A t ' with actual action A t The difference between the two is used as a second constraint condition in the training process, and the reverse propagation of the gradient is used for training the curiosity prizeThe excitation mechanism model is used for finishing the training of the curiosity incentive mechanism model;
the improved DDPG network is based on the DDPG network, and a stack structure is additionally arranged on an experience playback pool of the DDPG network; storing two batches of data in an experience playback pool, wherein one batch of data is a sample obtained by original random sampling, and the other batch of data is a time sequence sample obtained by a stack structure; the time sequence sample obtained by the stack structure is used for training a curiosity reward mechanism model; and randomly sampling the obtained samples for use in the training of an Actor module and a Critic module of the DDPG network.
2. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; the method specifically comprises the following steps:
and inputting the current state information and the target position of the robot into the trained improved DDPG network, and generating optimal executable action data by an Actor module of the improved DDPG network.
3. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; wherein, the improved DDPG network comprises:
the Actor module, the experience playback pool and the Critic module are connected in sequence;
the Actor module comprises an Actor current network and an Actor target network which are sequentially connected;
the Critic module comprises a Critic current network and a Critic target network which are sequentially connected;
wherein, the Actor current network is connected with all LSTM models of the curiosity reward mechanism model; the Actor current network is also connected to the output of the CNN model of the curiosity reward mechanism model.
4. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; wherein the training step of the trained improved DDPG network comprises the following steps:
(1) Constructing a training set; the training set comprises the state of the robot with known robot navigation paths at each moment;
(2) And inputting the training set into the improved DDPG network to finish the training of an Actor module, a Critic module and a curiosity reward mechanism model of the improved DDPG network.
5. The robot path navigation method based on the improved DDPG algorithm of claim 1, wherein the current state information comprises: the current position of the robot, the current angular velocity of the robot, the current linear velocity of the robot and the current environment information of the robot.
6. The robot path navigation system based on the improved DDPG algorithm is characterized by comprising the following steps:
an acquisition module configured to: acquiring current state information and a target position of the robot;
an output module configured to: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;
a navigation module configured to: the robot completes collision-free path navigation according to the optimal executable action data;
wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor;
finishing the training of the curiosity reward mechanism model, wherein the training step comprises the following steps:
(a) Robot selection in state S t Action A corresponding to t And generating a next state S by interacting with the environment t+1 And a reward value R;
(b) Empirical data (S) generated by interaction of the robot with the environment t ,A t ,R,S t+1 Done) is stored in the experience playback pool, a stack structure is newly added in the experience playback pool so as to access the experience data according to time sequence, and done represents whether robot navigation is finished or not;
(c) Inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is the robot state information of the corresponding moment, and the other part is the output value of the LSTM model of the previous moment; the last LSTM model outputs a predicted value S of the state of the robot at the next moment t+1 ';
(d) The actual next state S t+1 And predicted next state S t+1 ' the difference between them being the internal reward r i While awarding the interior r i With the original external award r e The sum is used as the total reward R of the robot exploration environment; the actual next state S t+1 And predicted next state S t+1 ' the difference between them is used as the first constraint in the training process;
(e) The state S of the robot at the current moment t And the predicted value S of the state of the robot at the next moment t+1 ' input to the convolutional neural network CNN, and output the inverse prediction action A t ';
(f) Reverse predicted action A t ' with actual action A t The difference value is used as a second constraint condition in the training process, and a curiosity reward mechanism model is trained by utilizing gradient back propagation to finish the training of the curiosity reward mechanism model;
the improved DDPG network is based on the DDPG network, and a stack structure is additionally arranged on an experience playback pool of the DDPG network; storing two batches of data in an experience playback pool, wherein one batch of data is a sample obtained by original random sampling, and the other batch of data is a time sequence sample obtained by a stack structure; the time sequence sample obtained by the stack structure is used for training a curiosity reward mechanism model; and randomly sampling the obtained samples for use in the training of an Actor module and a Critic module of the DDPG network.
7. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is coupled to the memory, the one or more computer programs being stored in the memory, and wherein when the electronic device is running, the processor executes the one or more computer programs stored in the memory to cause the electronic device to perform the method of any of the preceding claims 1-5.
8. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 5.
CN202110512658.0A 2021-05-11 2021-05-11 Robot path navigation method and system based on improved DDPG algorithm Active CN113408782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110512658.0A CN113408782B (en) 2021-05-11 2021-05-11 Robot path navigation method and system based on improved DDPG algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110512658.0A CN113408782B (en) 2021-05-11 2021-05-11 Robot path navigation method and system based on improved DDPG algorithm

Publications (2)

Publication Number Publication Date
CN113408782A CN113408782A (en) 2021-09-17
CN113408782B true CN113408782B (en) 2023-01-31

Family

ID=77678380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110512658.0A Active CN113408782B (en) 2021-05-11 2021-05-11 Robot path navigation method and system based on improved DDPG algorithm

Country Status (1)

Country Link
CN (1) CN113408782B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114355980B (en) * 2022-01-06 2024-03-08 上海交通大学宁波人工智能研究院 Four-rotor unmanned aerial vehicle autonomous navigation method and system based on deep reinforcement learning
CN116578094A (en) * 2023-06-05 2023-08-11 中科南京智能技术研究院 Autonomous obstacle avoidance planning method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488859A (en) * 2019-07-15 2019-11-22 北京航空航天大学 A kind of Path Planning for UAV based on improvement Q-learning algorithm
CN111487864A (en) * 2020-05-14 2020-08-04 山东师范大学 Robot path navigation method and system based on deep reinforcement learning
CN111523731A (en) * 2020-04-24 2020-08-11 山东师范大学 Crowd evacuation movement path planning method and system based on Actor-Critic algorithm
CN112629542A (en) * 2020-12-31 2021-04-09 山东师范大学 Map-free robot path navigation method and system based on DDPG and LSTM

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668235B (en) * 2020-12-07 2022-12-09 中原工学院 Robot control method based on off-line model pre-training learning DDPG algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488859A (en) * 2019-07-15 2019-11-22 北京航空航天大学 A kind of Path Planning for UAV based on improvement Q-learning algorithm
CN111523731A (en) * 2020-04-24 2020-08-11 山东师范大学 Crowd evacuation movement path planning method and system based on Actor-Critic algorithm
CN111487864A (en) * 2020-05-14 2020-08-04 山东师范大学 Robot path navigation method and system based on deep reinforcement learning
CN112629542A (en) * 2020-12-31 2021-04-09 山东师范大学 Map-free robot path navigation method and system based on DDPG and LSTM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Curiosity-driven Exploration for Mapless Navigation with Deep Reinforcement Learning;Oleksii Zhelo.et;《arXiv:1804.00456v2》;20180514;第二张 *

Also Published As

Publication number Publication date
CN113408782A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN112937564B (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
CN110168578B (en) Multi-tasking neural network with task-specific paths
CN113408782B (en) Robot path navigation method and system based on improved DDPG algorithm
KR102481885B1 (en) Method and device for learning neural network for recognizing class
JP7510637B2 (en) How to generate a general-purpose trained model
JP7367233B2 (en) System and method for robust optimization of reinforcement learning based on trajectory-centered models
US11080586B2 (en) Neural network reinforcement learning
CN112596515B (en) Multi-logistics robot movement control method and device
US11182676B2 (en) Cooperative neural network deep reinforcement learning with partial input assistance
CN112119404A (en) Sample efficient reinforcement learning
WO2014018793A1 (en) Apparatus and methods for efficient updates in spiking neuron networks
CN112115352A (en) Session recommendation method and system based on user interests
CN108830376B (en) Multivalent value network deep reinforcement learning method for time-sensitive environment
JP7139524B2 (en) Control agents over long timescales using time value transfer
EP3502978A1 (en) Meta-learning system
US20220036186A1 (en) Accelerated deep reinforcement learning of agent control policies
CN112904852B (en) Automatic driving control method and device and electronic equipment
Lee et al. Hierarchical emotional episodic memory for social human robot collaboration
JP2024506025A (en) Attention neural network with short-term memory unit
Bakker Reinforcement learning by backpropagation through an LSTM model/critic
JP7283774B2 (en) ELECTRONIC APPARATUS AND OPERATING METHOD THEREOF, AND COMPUTER PROGRAM FOR PRECISE BEHAVIOR PROFILING FOR IMPLANTING HUMAN INTELLIGENCE TO ARTIFICIAL INTELLIGENCE
US20240143975A1 (en) Neural network feature extractor for actor-critic reinforcement learning models
WO2023123838A1 (en) Network training method and apparatus, robot control method and apparatus, device, storage medium, and program
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
KR20210113939A (en) Electronic device for high-precision profiling to develop artificial inntelligence with human-like intelligence, and operating method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant