CN113408782B

CN113408782B - Robot path navigation method and system based on improved DDPG algorithm

Info

Publication number: CN113408782B
Application number: CN202110512658.0A
Authority: CN
Inventors: 吕蕾; 赵盼盼; 周青林
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2023-01-31
Anticipated expiration: 2041-05-11
Also published as: CN113408782A

Abstract

The invention discloses a robot path navigation method and a system based on an improved DDPG algorithm, which are used for acquiring the current state information and the target position of a robot; inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data and complete collision-free path navigation; the improved DDPG network completes the calculation of the reward value of the DDPG network by utilizing a curiosity reward mechanism model; a curiosity reward mechanism model comprising: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor. Curiosity-based robot path navigation may make the robot more intelligent.

Description

Robot path navigation method and system based on improved DDPG algorithm

Technical Field

The invention relates to the technical field of path planning, in particular to a robot path navigation method and system based on an improved DDPG algorithm.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of artificial intelligence technology, robots have gradually entered our daily lives from the original industrial production field. Particularly, in recent years, there is a vigorous trend in the service industry field. The demand of the human society for mobile robots is also becoming stronger. The path planning of the robot is a key problem to be solved in the field of robots. Path planning for mobile robots is a complex problem, and an autonomous mobile robot is required to find an unobstructed path from an initial position to a target position according to constraints. As the environment faced by robots becomes more complex, it is required that the robots have the ability to anticipate obstacles and avoid collisions therewith at a higher level.

Traditional navigation solutions such as genetic algorithms, simulated annealing algorithms and the like have good effects in the aspect of navigation. However, these methods all design a universal solution under the assumption that the environment is known. As robots are used in all industries, the environment in which the robots are located becomes more and more complex. Previous solutions do not solve these problems well. In recent years, a deep reinforcement learning method combining reinforcement learning and deep learning is widely applied to the field of robot path navigation. Deep learning has unique advantages in aspects of feature extraction, object perception and the like, and is widely applied to the fields of computer vision and the like. Reinforcement learning has a better decision-making ability to reach a maximum return or achieve a specific goal by learning strategies during interaction with the environment. The robot navigation problem in the complex environment is successfully solved by the deep reinforcement learning combining the deep learning and the reinforcement learning. The Deep Deterministic Policy Gradient (DDPG) algorithm is one of the earliest proposed deep reinforcement learning networks. As a classic algorithm in deep reinforcement learning, the DDPG algorithm is directed to a strategy learning method of a continuous, high-latitude behavior space. Compared with the prior reinforcement learning method, the DDPG algorithm has great advantages in the aspect of continuous control problem and is applied to numerous fields of robot path navigation, automatic driving, mechanical arm control and the like.

However, sensitivity to hyperparameters and reward values that tend to diverge have long been one of the problems that DDPG has difficult to solve well. In reinforcement learning, the feedback of the reward value R is usually hard-coded manually, and since the reward of each step cannot be simply predicted, the design of the reward function is usually sparse, so that the robot cannot obtain immediate feedback, and the learning ability is not high.

In the process of implementing the invention, the inventor finds that the following technical problems exist in the prior art:

the robot path navigation realized based on the prior art has the problem of inaccurate navigation.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a robot path navigation method and a system based on an improved DDPG algorithm;

in a first aspect, the invention provides a robot path navigation method based on an improved DDPG algorithm;

the robot path navigation method based on the improved DDPG algorithm comprises the following steps:

acquiring current state information and a target position of the robot;

inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;

the robot completes collision-free path navigation according to the optimal executable action data;

wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor.

In a second aspect, the present invention provides a robot path navigation system based on an improved DDPG algorithm;

a robot path navigation system based on an improved DDPG algorithm comprises:

an acquisition module configured to: acquiring current state information and a target position of the robot;

an output module configured to: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;

a navigation module configured to: the robot completes collision-free path navigation according to the optimal executable action data;

In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention utilizes the total sum of the internal reward generated by curiosity and the external reward of the algorithm as the total reward generated by the interaction of the robot and the environment. The reward function module is embedded with a long-short term memory artificial neural network (LSTM) and a Convolutional Neural Network (CNN). A plurality of past states are input into the LSTM network, the prediction of the next state is output, and the difference value between the predicted value and the actual state of the next state is used as an internal reward. In human society, people often have past experience in predicting what happens next, and embedding LSTM networks into curiosity mechanisms is just for reference to this mental feature. While using the CNN network to perform a reverse prediction of the action for the next state generated by the previous network. Curiosity has been considered by some scientists as one of the basic attributes of intelligence, and robot path navigation based on curiosity can make a robot more intelligent, and even under the condition that reward is sparse and even no external reward exists, the robot can feel like a human.

The invention uses the thinking characteristics of human beings for reference, and embeds a curiosity mechanism in the reward function module. Meanwhile, the latest batch states are used as experience data to be input into a curiosity mechanism of the robot, and an LSTM network with a long-term and short-term memory function is used for predicting the next state, so that the curiosity prediction can keep the time sequence. Meanwhile, the difference between the predicted next state and the actual next state is used as an internal reward value, and the problem of sparse reward of the original DDPG algorithm can be solved.

The invention uses the CNN network with the feature extraction function to predict the next state S of the LSTM network _t+1 ' with actual state S _t As input, output to action A _t Predicted value A of _t ', will the actual action A _t Predicted actions A with CNN networks _t The difference between' serves as a constraint. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a block diagram of an algorithm of a robot path navigation method based on improved curiosity and DDPG algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an LSTM model algorithm embedded in a curiosity mechanism according to an embodiment of the invention;

fig. 3 is a schematic diagram of CNN module algorithm embedded in the curiosity mechanism according to the embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise, and furthermore, it should be understood that the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the invention may be combined with each other without conflict.

Example one

The embodiment provides a robot path navigation method based on an improved DDPG algorithm;

as shown in fig. 1, the robot path navigation method based on the improved DDPG algorithm includes:

s101: acquiring current state information and a target position of the robot;

s102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data;

s103: the robot completes collision-free path navigation according to the optimal executable action data;

As shown in fig. 3, the CNN model includes three convolutional layers connected in sequence.

The improved DDPG network is based on the DDPG network, and a stack structure is additionally arranged on an experience playback pool of the DDPG network; the experience playback pool stores two batches of data, one is a sample obtained by original random sampling, and the other is a time sequence sample obtained by a stack structure. And the time sequence samples obtained by the stack structure are used for training the curiosity reward mechanism model. And randomly sampling the obtained samples for use in the training of an Actor module and a Critic module of the DDPG network.

Further, the current state information includes: the current position of the robot, the current angular velocity of the robot, the current linear velocity of the robot and the current environment information of the robot.

Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; the method specifically comprises the following steps:

and inputting the current state information and the target position of the robot into the trained improved DDPG network, and generating optimal executable action data by an Actor module of the improved DDPG network.

Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; wherein, the improved DDPG network comprises:

the system comprises an Actor module, an experience playback pool and a criticic module which are connected in sequence;

the Actor module comprises an Actor current network and an Actor target network which are sequentially connected;

the Critic module comprises a Critic current network and a Critic target network which are sequentially connected;

wherein, the Actor current network is connected with all LSTM models of the curiosity reward mechanism model; the Actor current network is also connected to the output of the CNN model of the curiosity reward mechanism model.

Further, S102: inputting the current state information and the target position of the robot into the trained improved DDPG network to obtain optimal executable action data; wherein the training step of the trained improved DDPG network comprises the following steps:

s1021: constructing a training set; the training set comprises the state of the robot with known robot navigation paths at each moment;

s1022: and inputting the training set into the improved DDPG network to finish the training of an Actor module, a Critic module and a curiosity reward mechanism model of the improved DDPG network.

Further, the training of the curiosity reward mechanism model is completed, and the training step comprises the following steps:

s10221: robot selection in state S _t Action A corresponding to the following _t And generating a next state S by interacting with the environment _t+1 And a reward value R;

s10222: empirical data (S) generated by interaction of the robot with the environment _t ,A _t ,R,S _t+1 Done) is stored in the experience playback pool, a stack structure is newly added in the experience playback pool so as to access the experience data according to time sequence, and done represents whether robot navigation is finished or not;

s10223: inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is robot state information at the corresponding moment, and the other part is the output value of the LSTM model at the previous moment; the last LSTM model outputs a predicted value S of the state of the robot at the next moment _t+1 '；

S10224: the actual next state S _t+1 And predicted next state S _t+1 ' the difference between them being the internal reward r ⁱ While awarding the interior r ⁱ With the original external award r ^e The sum is used as the total reward R of the robot exploration environment; the actual next state S _t+1 And predicted next state S _t+1 ' the difference between them is used as the first constraint bar in the training processA member;

s10225: the state S of the robot at the current moment _t And the predicted value S of the state of the robot at the next moment _t+1 ' input to the convolutional neural network CNN, and output the inverse prediction action A _t '；

S10226: reverse predicted action A _t ' with actual action A _t The difference value is used as a second constraint condition in the training process, and the curiosity reward mechanism model is trained by utilizing the gradient back propagation to finish the training of the curiosity reward mechanism model.

It should be understood that the timing sequence between the retained samples described in S10222 is independent of the original random sampling mechanism. In order to avoid the over-fitting (over fit) problem during training and need to break the correlation between samples, DDPG often selects a set of data for network training in a random sampling manner during network training. In order to obtain a sample with the time sequence for training the LSTM network, the invention is provided with a stack structure independent of a random sampling module, maintains the time sequence of the sample by utilizing the characteristic of stack first-in first-out, stores data at the top of a stack, and takes a batch (batch) of data samples from the top of the stack during data taking. When the empirical data is stored, the data is stored in the stack mechanism and the original queue mechanism. When data is fetched, an original random sampling mode is kept for a queue mechanism and is used for training a network Actor module and a criticic module. For the stack mechanism, the characteristic of stack data fetching is kept, and the latest experience data with time sequence is guaranteed to be fetched.

It should be understood that, in S10223, the person' S anticipation of the next occurrence is usually based on the previous experience, by taking the thought characteristics of the person, inputting the state sequence with time sequence into the LSTM network, and using the LSTM memory function to predict the next state S _t+1 '. The specific calculation method is as follows:

S _t+1 ′＝L(S _t-n ,S _t-(n-1) ，...,S _t-2 ,S _t-1, S _t ；θ)

where S represents the state at a certain time, and θ represents a parameter of the LSTM network. With time sequenceThe predictive value S of the next state is obtained after the characteristic state sequence passes through the LSTM _t+1 '；

It should be understood that the next state S to be predicted in S10224 _t+1 ' Next State S resulting from interaction with the Environment in practice _t+1 The difference between them is used as the internal reward value, while in order to avoid the LSTM network predicting paradoxical solutions, the actual next state S is used _t+1 And the predicted value S of the next state _t+1 The difference between' serves as a constraint for LSTM network training. The specific calculation is as follows:

r ⁱ ＝||S _t+1 ′-S _t+1 ||

R＝r ^e +r ⁱ

Min(||S _t+1 ′-S _t+1 ||)

wherein r is ⁱ Is based on the actual next state S generated by the improved curiosity mechanism _t+1 And the predicted value S of the next state _t+1 ' as an internal reward. r is ^e Is the external prize of the DDPG algorithm, and R is the sum of the internal prize and the external prize based on the modified curiosity algorithm as a total prize value. Min (| | S) _t+1 ′-S _t+1 | |) is a constraint of the LSTM network.

It should be understood that the states S10225 and S10226 are converted into the state S by using the feature extraction function of the CNN network _t And a state S predicted by curiosity _t+1 ' reverse predicted action in input CNN network A _t ' will make the actual action A at the same time _t And predicted action A _t ' as another constraint, using the CNN network, state features can be extracted that affect the action-related keys. The specific calculation is as follows:

A _t ′＝H(S _t ,S _t+1 ′；w)

Min(A _t ,A′ _t )

where w is a parameter of the CNN network, A _t ' is a pair action A generated through the CCN network _t The predicted value of (2).

The LSTM network and the CNN network may be trained simultaneously by backpropagation of the gradient through a first constraint and a second constraint generated on the predicted action.

Meanwhile, a CNN network with a feature extraction function is embedded, the next state predicted by the LSTM network and the actual last state are used as the input of the CNN network, and the CNN network outputs the predicted value of the action. The difference between the actual action and the action predicted by the CNN network is taken as a constraint. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.

Selecting experience data with a batch size from an experience playback pool in a random sampling mode to train a Critic network and an Actor network, and updating parameters through gradient back propagation;

and copying the network parameters from the actual network to the target network at regular intervals by using a soft update mode between the actual network and the target network.

The LSTM network, as shown in fig. 2, uses the thinking characteristics of human beings, and in human society, people often have past experience to predict what happens next, and the invention embeds a curiosity mechanism in the reward function module. Meanwhile, the latest batch states are input into a curiosity mechanism of the robot as experience data, an LSTM network with a long-term and short-term memory function is utilized to predict the next state, a state sequence with the time sequence is input into the LSTM, and the memory function of the LSTM is utilized to predict the next state so that the curiosity prediction can be kept. And meanwhile, the difference between the predicted next state and the actual next state is used as an internal reward value, and meanwhile, in order to avoid the LSTM predicting an absurd solution, the difference between the actual value of the state and the predicted value of the next state is used as a constraint condition.

The CNN network module, as shown in fig. 3, embeds a CNN network with a feature extraction function in the curiosity mechanism, takes the next state predicted by the LSTM network and the actual previous state as inputs, outputs a predicted value of the action, and takes the difference between the actual action and the action predicted by the CNN network as a constraint condition. The LSTM network and the CNN network are trained simultaneously using back propagation of the gradient. After the CNN module is added, the state characteristics influencing the action-related keys can be extracted.

Example two

The embodiment provides a robot path navigation system based on an improved DDPG algorithm;

a robot path navigation system based on an improved DDPG algorithm comprises:

It should be noted here that the above-mentioned obtaining module, output module and navigation module correspond to steps S101 to S103 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios realized by the corresponding steps, but are not limited to what is disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical functional division, and in actual implementation, there may be another division, for example, a plurality of modules may be combined or may be integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The robot path navigation method based on the improved DDPG algorithm is characterized by comprising the following steps:

acquiring current state information and a target position of the robot;

wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor;

completing the training of a curiosity reward mechanism model, wherein the training step comprises the following steps:

(a) Robot selection in state S _t Action A corresponding to _t And generating a next state S by interacting with the environment _t+1 And a reward value R;

(b) Empirical data (S) generated by interaction of the robot with the environment _t ,A _t ,R,S _t+1 Done) is stored in the experience playback pool, a stack structure is newly added in the experience playback pool so as to access the experience data according to time sequence, and done represents whether robot navigation is finished or not;

(c) Inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is the robot state information of the corresponding moment, and the other part is the output value of the LSTM model of the previous moment; the last LSTM model outputs the predicted value S of the robot state at the next moment _t+1 '；

(d) The actual next state S _t+1 And predicted next state S _t+1 ' the difference between them being the internal reward r ⁱ While awarding the interior r ⁱ With the original external award r ^e The sum is used as the total reward R of the robot exploration environment; the actual next state S _t+1 And predicted next state S _t+1 ' the difference between them is used as the first constraint in the training process;

(e) The state S of the robot at the current moment _t And the predicted value S of the state of the robot at the next moment _t+1 ' input to the convolutional neural network CNN, and output the inverse prediction action A _t '；

(f) Reverse predicted action A _t ' with actual action A _t The difference between the two is used as a second constraint condition in the training process, and the reverse propagation of the gradient is used for training the curiosity prizeThe excitation mechanism model is used for finishing the training of the curiosity incentive mechanism model;

the improved DDPG network is based on the DDPG network, and a stack structure is additionally arranged on an experience playback pool of the DDPG network; storing two batches of data in an experience playback pool, wherein one batch of data is a sample obtained by original random sampling, and the other batch of data is a time sequence sample obtained by a stack structure; the time sequence sample obtained by the stack structure is used for training a curiosity reward mechanism model; and randomly sampling the obtained samples for use in the training of an Actor module and a Critic module of the DDPG network.

2. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; the method specifically comprises the following steps:

3. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; wherein, the improved DDPG network comprises:

the Actor module, the experience playback pool and the Critic module are connected in sequence;

4. The robot path navigation method based on the improved DDPG algorithm as claimed in claim 1, wherein the current state information and the target position of the robot are inputted into the trained improved DDPG network to obtain the optimal executable action data; wherein the training step of the trained improved DDPG network comprises the following steps:

(1) Constructing a training set; the training set comprises the state of the robot with known robot navigation paths at each moment;

(2) And inputting the training set into the improved DDPG network to finish the training of an Actor module, a Critic module and a curiosity reward mechanism model of the improved DDPG network.

5. The robot path navigation method based on the improved DDPG algorithm of claim 1, wherein the current state information comprises: the current position of the robot, the current angular velocity of the robot, the current linear velocity of the robot and the current environment information of the robot.

6. The robot path navigation system based on the improved DDPG algorithm is characterized by comprising the following steps:

wherein the improved DDPG network is based on the DDPG network, and the calculation of the reward value of the DDPG network is completed by utilizing a curiosity reward mechanism model; the curiosity reward mechanism model comprises: a plurality of LSTM models which are connected in series in sequence; in the LSTM models which are sequentially connected in series, the input ends of all the LSTM models are connected with the output end of the current network of the Actor, the output end of the last LSTM model is connected with the input end of the CNN model, and the output end of the CNN model is connected with the input end of the current network of the Actor;

finishing the training of the curiosity reward mechanism model, wherein the training step comprises the following steps:

(c) Inputting experience data with a time sequence in a stack structure into an LSTM network, wherein as shown in FIG. 2, a first LSTM model only inputs robot state information at a corresponding time; the input of the non-first LSTM model consists of two parts, one part is the robot state information of the corresponding moment, and the other part is the output value of the LSTM model of the previous moment; the last LSTM model outputs a predicted value S of the state of the robot at the next moment _t+1 '；

(f) Reverse predicted action A _t ' with actual action A _t The difference value is used as a second constraint condition in the training process, and a curiosity reward mechanism model is trained by utilizing gradient back propagation to finish the training of the curiosity reward mechanism model;

7. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is coupled to the memory, the one or more computer programs being stored in the memory, and wherein when the electronic device is running, the processor executes the one or more computer programs stored in the memory to cause the electronic device to perform the method of any of the preceding claims 1-5.

8. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 5.