CN111645076B

CN111645076B - Robot control method and equipment

Info

Publication number: CN111645076B
Application number: CN202010552467.2A
Authority: CN
Inventors: 王东署; 胡宇航; 罗勇; 辛健斌; 王河山; 马天磊; 贾建华; 张方方; 陈书立
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2021-05-11
Anticipated expiration: 2040-06-17
Also published as: CN111645076A

Abstract

The method and the device dynamically adjust the exploration speed by simulating a nerve regulation mechanism of a cortices of a front cingulum in a brain physiological mechanism of a primate, and adjust the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed in the behavior decision process of the robot is improved, and a better global solution is obtained.

Description

Robot control method and equipment

Technical Field

The present application relates to the field of computers, and in particular, to a robot control method and apparatus.

Background

In the prior art, a robot needs to learn in an unknown environment and adapt to the environment, and reinforcement learning is one of key technologies. The advantage of reinforcement learning is that it does not require a given desired output, but rather achieves a better-performing control strategy by providing the greatest cumulative return for the robot when moving in the environment through on-line interactive training based on reinforcement signals. Reinforcement learning is therefore often used in the study of robot behavior decisions.

At present, an important problem faced by the reinforcement learning algorithm is the balance between environment exploration and environment utilization, and the coordination state of the exploration and utilization directly influences the efficiency of reinforcement learning. Exploring behavior means that the robot traverses all state spaces in the environment one by one more in the learning process, and usually comprises some measures for sacrificing short-term benefits, and the robot can learn a macroscopically better behavior strategy by gathering enough information. However, due to the factors of large dimensionality, more behavior sets, complex information and the like of the robot behavior decision, excessive exploration can cause the problems of dimension disaster and low learning convergence speed of reinforcement learning, the calculated amount is greatly increased, and the real-time requirement of the behavior decision is difficult to meet. The utilization behavior means that the robot forms a certain behavior strategy after independently learning for a period of time, at this time, in order to enable the robot to obtain a larger reward and accelerate the learning convergence speed, the exploration behavior of the robot is gradually reduced and gradually converted into the utilization behavior, that is, the robot does not explore unknown action strategies gradually any more, but selects the optimal action strategy under the current information according to the learned experience. But if the time to use the environment is too early or high, it is difficult for the system to learn the optimal behavior strategy. Therefore, proper exploration-utilization of a coordinated balancing mechanism is critical to the efficiency of the reinforcement learning algorithm.

Algorithms that explore exploiting balance generally fall into 2 broad categories: unguided and guided methods. The method without guidance in the current action selection method needs to finely adjust exploration parameters, and has the defects that uncertainty expected reward of each action is not considered, and the value of the exploration parameters can be correctly determined after multiple times of simulation; the drawback of the guided approach is that a large number of complex calculations are required to converge to an optimal solution.

Currently, the exploration-utilization balance strategy commonly used in robot reinforcement learning is an indirect selection strategy, which ignores uncertainty in environment in the learning process and selects probability to realize exploration and utilization balance, and the strategy comprises various commonly used methods, such as an epsilon-greedy method, a Boltzmann distribution method, a heuristic action selection method and the like. The epsilon-greedy strategy is widely used due to simple implementation, but the parameter epsilon is a fixed value, the problems of exploration and utilization of the dynamic learning process still exist, and the learning rate and efficiency of the algorithm are influenced to a certain extent. The Boltzmann distribution method involves selecting probability of action, associating the selection of action with a value function, and adjusting the selection probability of action using temperature parameters. The Boltzmann distribution method has the disadvantages that the initial value setting of the temperature parameter is uncertain, and the parameter setting has certain influence on the learning rate and efficiency of the algorithm. In short, none of these methods can adjust the degree of exploration or utilization in real time according to the environment of the robot, resulting in the defects of poor adaptability, slow convergence rate, local optimum, etc.

Therefore, the robot control method which can realize the dynamic balance between the utilization environment and the exploration environment of the robot in the robot learning process, adjust the exploration and utilization degrees in real time according to the environment, has the advantages of high learning convergence speed, better overall solution after stabilization and the like is the direction of continuous research needed by the technical personnel in the field.

Disclosure of Invention

An object of the present application is to provide a robot control method and apparatus, so as to solve the problem in the prior art of how to adjust the degree of exploration and utilization in the robot learning process, thereby increasing the learning convergence rate and obtaining a better global solution.

According to an aspect of the present application, there is provided a robot control method including:

acquiring a current state, at least two actions to be executed and corresponding weights of the actions, wherein the current state comprises a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be executed and the corresponding weights of the actions;

based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state;

and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof.

Further, in the above robot control method, the adjusting a search speed by a anterior cingulum cortical neural modulation mechanism based on the reward prediction error signal to obtain the search speed corresponding to the current state includes:

determining, based on the reward prediction error signal, a correct neuron response value and a wrong neuron response value by the anterior cingulum cortical neuromodulation mechanism;

acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating an alertness value corresponding to the current state by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate;

and adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state.

Further, in the robot control method, the determining and executing an optimal action to be performed from all the actions to be performed based on the search speed, all the actions to be performed, and their corresponding weights includes:

performing equation conversion based on the exploration speed, all the actions to be executed and the weights corresponding to the actions to be executed to obtain the execution probability of each action to be executed corresponding to the current state;

and determining the optimal action to be executed based on the execution probability of each action to be executed and executing.

Further, in the robot control method, the determining and executing the optimal action to be performed based on the execution probability of each action to be performed includes:

obtaining the similarity of the execution probability of all the actions to be executed based on the execution probability of each action to be executed;

if the similarity of the execution probabilities of all the actions to be executed is greater than a similarity threshold, randomly selecting one action to be executed from all the actions to be executed as the optimal action to be executed and executing;

and if the similarity of the execution probabilities of all the actions to be executed is less than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing.

Further, the robot control method further includes:

acquiring an updating state after the optimal action to be executed is executed;

and updating the action to be executed and the corresponding weight thereof based on the updating state.

Further, in the robot control method, updating the weight corresponding to the action to be performed based on the update status includes:

judging whether the optimal action to be executed is collided or not based on the updating state;

and if the collision does not occur, updating the action to be executed and the weight corresponding to the action to be executed based on the current state, the optimal action to be executed and the updating state to obtain the updated weight corresponding to the action to be executed.

According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the method of any one of the above.

According to another aspect of the present application, there is also provided a robot control apparatus including:

one or more processors;

a computer-readable medium for storing one or more computer-readable instructions,

when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above.

Compared with the prior art, the method and the device have the advantages that the current state of the robot, at least two actions to be executed and the corresponding weights of the actions are obtained, wherein the current state comprises the current environment and reward feedback information is obtained based on the current environment information; based on the reward feedback information, the exploration speed is adjusted through a front buckle zone cortical neural adjustment mechanism, and the exploration speed corresponding to the current state is obtained; and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated gyrocortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a robot control method in accordance with an aspect of the subject application;

FIG. 2 illustrates an ACC neuromodulation mechanism-based behavioral decision model schematic of a robotic control method according to an aspect of the present application;

FIG. 3 illustrates an ACC neural network architecture diagram of a robot control method in accordance with an aspect of the subject application;

FIG. 4 illustrates a flow diagram of an embodiment of a robot control method in accordance with an aspect of the subject application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.

The Memory may include volatile Memory in a computer readable medium, Random Access Memory (RAM), and/or nonvolatile Memory such as Read Only Memory (ROM) or flash Memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change RAM (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, magnetic cassette tape, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Fig. 1 is a schematic flowchart of a robot control method according to an aspect of the present application, the method is applicable to various motion scenarios of a mobile robot, and the method includes steps S11, S12, and S13, where the method specifically includes:

step S11, obtaining a current state of the robot, at least two actions to be performed, and corresponding weights thereof, where the current state includes a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be performed, and corresponding weights thereof, where the reward prediction error signal is a reinforced signal of a reward prediction error, and the reward prediction error signal is determined based on the current state, the at least two actions to be performed, and corresponding weights thereof, and is calculated by the following formula:

wherein R is_tThe reward information obtained at the t moment when the robot interacts with the environment in real time is represented; gamma is a discount factor; q (s, a) is an expected value for selecting to execute the action a to be executed in the current state s in reinforcement learning; s_tRepresenting the state of the robot corresponding to t time; i is the number of the action to be executed, and the value range is i-1, 2, … …, n; a is_iThe action to be executed with the serial number i; a is all actions to be performed a_iA collection of (a).

Step S12, based on the reward prediction error signal, adjusting the exploration speed through an Anterior Cingulate Cortex (ACC) nerve regulation mechanism to obtain the exploration speed corresponding to the current state, wherein the exploration speed is used for indicating the degree of the robot exploration environment; the exploration speed is adjusted by utilizing a front buckle cortical neural regulation mechanism, balance between exploration and utilization of the robot is realized, namely different actions can be tried to be selected and an unknown action strategy can be explored in the deep learning process of the robot, the optimal action to be executed in the current state is selected according to learned experience, and the learning, cognition and evolution capabilities of the robot are improved, so that the optimal action to be executed can be obtained.

And step S13, determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, thereby realizing obtaining a better global solution.

In the above steps S11 to S13, first, the current state of the robot, at least two actions to be performed and their corresponding weights are obtained, where the current state includes the current environment and reward feedback information is obtained based on the current environment information; then, based on the reward feedback information, the exploration speed is adjusted through a front buckle zone cortical nerve adjusting mechanism, and the exploration speed corresponding to the current state is obtained; and finally, determining the optimal action to be executed from all the actions to be executed and executing the optimal action to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated cortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.

For example, as shown in fig. 2, based on an Actor-Critic algorithm in reinforcement learning, the Actor portion is a three-layer radial basis function neural network, and the Critic portion is a Q learning algorithm. Firstly, the current state s and the action a to be executed of the robot are obtained₁、a₂、a₃......a_nAnd action a to be performed₁Corresponding weight W₁To-be-performed action a₂Corresponding weight W₂To-be-performed action a₃Corresponding weight W₃.._nCorresponding weight W_nThe current state s comprises a current environment and reward information R, and a reward prediction error signal delta is determined based on the current state s, at least two actions to be executed and weights corresponding to the actions; then, based on the reward prediction error signal delta, the exploration speed is adjusted through a front buckle zone cortical neural adjustment mechanism, and the exploration speed beta corresponding to the current state is obtained; finally, all the actions a to be performed based on the exploration speed beta₁、a₂、a₃......a_nAnd its corresponding weight W₁、W₂、W₃......W_nFrom all the actions a to be performed₁、a₂、a₃......a_nThe optimal action to be executed is determined and executed, namely the exploration speed beta is dynamically adjusted through simulating a procingulated corticoid nerve regulation mechanism in a brain physiological mechanism of a primate, and the exploration and utilization degree is regulated in real time according to the environment, so that the dynamic switching of the robot between the utilization environment and the exploration environment is realized, and the behavior decision process of the robot is improvedAnd (4) converging the learning speed, and obtaining a better global solution after stabilization.

Next to the above embodiment of the present application, the step S12 of adjusting a search speed by a anterior cingulum cortical nerve adjusting mechanism based on the reward prediction error signal to obtain the search speed corresponding to the current state includes:

determining, by the pre-buckled cortical neuromodulation mechanism, a correct neuron response value and a false neuron response value based on the reward prediction error signal, where the reward prediction error signal affects a set of feedback classification neurons in the ACC: correct neuron (COR) and error neuron (ERR); wherein the reward prediction error signal comprises a negative of the reward prediction error signal and a positive of the reward prediction error signal, and when a negative of the reward prediction error signal is obtained, the reward prediction error signal is located in the ACC to indicate a false neuron ERR response, and when a positive reinforcement signal is obtained, the reward prediction error signal is located in the anterior cingulum ACC to indicate a correct neuron COR response. Determining, by the anterior cingulate cortical neuromodulation mechanism, correct neuron response values and incorrect neuron response values for subsequent adjustment of the speed of exploration.

Acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating an alertness value corresponding to the current state by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate, wherein the alertness value is used for indicating the alertness degree of the robot under the current environment, and the alertness value is introduced by the method to simulate different reactions of a human under different alertness degrees, so that the robot dynamically converts the utilization environment and the exploration environment; determining the alertness value, calculated from the formula:

β^*(t)←β^*(t)+μ₊δ₊(t)+μ_-δ_-(t)

wherein, mu₊Update rate of correct neuron response, mu_-For the wrong neuron response update rate, δ₊(t) is the correct neuron response value, δ_-(t) is the value of the erroneous neuron response.

Adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state, wherein the exploration speed is adjusted through the alertness value, if the robot selects the optimal action to be executed, the alertness value is reduced, the exploration speed is reduced, and at the moment, the robot can continue to repeat the current action, namely, the robot represents a utilization environment; otherwise, when the non-optimal action to be executed is selected, the alertness value is increased, the exploration speed is increased, and at the moment, the action selection should be adjusted in time by the robot, namely, the robot can try to randomly select a certain action to be executed, namely, the action is expressed as an exploration environment; the exploration speed can be obtained by algorithm filtering of a sigmoid function, and the formula is as follows:

β＝ω₁/(1+exp(ω₂·[1-β^*]+ω₃))

in the formula, ω₁、ω₂And ω₃Are all constant, and ω₁＞ω₃＞0，ω₂＜0，β^*Is an alert value.

For example, as shown in fig. 3 and 4, first, the current state s and at least two actions to be performed a of the robot are obtained₁、a₂、a₃......a_nAnd its corresponding weight W₁、W₂、W₃......W_nWherein the current state comprises a current environment V. Visual input the current environment V (e.g. objects seen on the screen or objects on the table) is input into the posterior parietal cortex, and then a reward information R is received at the Ventral Tegmental Area (VTA) and from this reward information R the reward prediction error signal δ is calculated, a set of feedback classification neurons in the ACC: the correct neurons and the incorrect neurons adjust the required alertness value β by means of the reward prediction error signal δ. Finally, the action to be performed and its corresponding weight and alertness β are passed to the Lateral Prefrontal Cortex (LPFC), which selects the action that should be performed and depends on the alertness βThe size of the index adjusts the exploration speed, so as to realize the dynamic balance of exploration (exploration) and utilization (exploration).

Following the above embodiment of the present application, the step S13 determining and executing an optimal action to be performed from all the actions to be performed based on the search speed, all the actions to be performed, and their corresponding weights includes:

performing equation conversion based on the exploration speed, all the actions to be executed and the weights corresponding to the actions to be executed to obtain the execution probability of each action to be executed corresponding to the current state, where the execution probability of each action to be executed corresponding to the current state is obtained, and a specific calculation formula is as follows:

and n is the total n types of actions to be executed and selected when the robot is in the state s, j is the number of the actions to be executed, and the range of j is 1, 2, … …, and n and beta are the search speed.

And determining and executing the optimal action to be executed based on the execution probability of each action to be executed, and realizing dynamic adjustment of the exploration speed, namely balance of exploration and utilization so as to obtain the optimal action to be executed.

For example, first, all the actions a to be performed are obtained₁、a₂、a₃......a_nAnd its corresponding weight W₁、W₂、W₃......W_nAnd the exploration speed β; then, all the actions a to be performed based on the exploration speed beta₁、a₂、a₃......a_nAnd its corresponding weight W₁、W₂、W₃......W_nObtaining the action a to be executed corresponding to the current state through Boltzmann-Softmax equation conversion₁Is performed with probability P (a)₁) To-be-performed action a₂Is performed with probability P (a)₂) To-be-performed action a₃Is performed with probability P (a)₃) .._nIs performedProbability P (a)_n) Wherein, P (a)₁)+P(a₂)+P(a₃)+......P(a_n) 1 is ═ 1; finally, according to the execution probability P (a) of each action to be executed₁)、P(a₂)、P(a₃)......P(a_n) And obtaining and executing the optimal action to be executed, so that the selected action to be executed is the optimal action to be executed.

Further, determining and executing the optimal action to be executed based on the execution probability of each action to be executed comprises:

if the similarity of the execution probabilities of all the actions to be executed is greater than the similarity threshold, one action to be executed is randomly selected from all the actions to be executed as the optimal action to be executed and executed, the exploration speed is a smaller value at this time, and the execution probabilities of all the actions to be executed are close to each other, so that the action to be executed which originally has the maximum weight does not have the highest execution probability after the execution probability conversion, that is, one action to be executed can be randomly selected as the optimal action to be executed and executed, further exploration of the external environment by the robot is achieved, and the learning convergence speed of the robot in the action decision process is improved.

If the similarity of the execution probabilities of all the actions to be executed is smaller than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing the action, wherein the exploration speed is a larger value at this time, and the difference between the execution probabilities of all the actions to be executed is increased, namely the action to be executed with the highest execution probability is the optimal action to be executed, so that the robot can select the optimal action to be executed for execution when making an action decision.

For example, each of the actions to be performed a is obtained₁、a₂、a₃......a_nIs performed with probability P (a)₁)、P(a₂)、P(a₃)......P(a_n) And obtaining the similarity X of the execution probability of all the actions to be executed. Presetting a similarity threshold K, wherein the similarity of the execution probabilities of all the actions to be executed is greater than the similarity threshold, namely X is greater than K, namely the execution probabilities of all the actions to be executed are close to each other; and then, one action to be executed is randomly selected as the optimal action to be executed and executed, so that the further exploration of the robot on the external environment is realized, and the learning convergence speed of the robot in the action decision process is improved.

As another example, each of the actions to be performed a is obtained₁、a₂、a₃......a_nIs performed with probability P (a)₁)、P(a₂)、P(a₃)......P(a_n) And obtaining the similarity X of the execution probability of all the actions to be executed. Presetting a similarity threshold K, wherein the similarity of the execution probabilities of all the actions to be executed is smaller than the similarity threshold, namely X is smaller than K, namely the execution probabilities of all the actions to be executed have larger difference; and then, the action to be executed with the highest execution probability is the optimal action to be executed and executed, so that the robot can select the optimal action to be executed for execution when making an action decision.

In another preferred embodiment of the present application, the method further comprises:

acquiring an update state after the optimal action to be executed is executed, wherein the update state comprises an updated current environment and updated reward information;

and updating the action to be executed and the weight corresponding to the action to be executed based on the updating state, so that the updating of the action to be executed and the weight corresponding to the action to be executed is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.

For example, an optimal action to be executed is executed, and an update state V after the optimal action to be executed is acquired; and updating the action to be executed and the corresponding weight thereof based on the updating state V, so that the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.

Next, the above embodiment of the present application updates the action to be executed and the weight corresponding to the action to be executed based on the update status, including:

and if the collision does not occur, updating the action to be executed and the corresponding weight thereof based on the current state, the optimal action to be executed and the updating state to obtain the updated action to be executed and the corresponding weight thereof.

For example, executing an optimal action to be executed, and acquiring an updated state v after the optimal action to be executed is executed; the robot is known not to collide with the obstacle through the obtained updated state v, and at the moment, the action a to be executed is carried out on the basis of the current state s, the optimal action to be executed and the updated state v₁、a₂、a₃......a_nAnd its corresponding weight W₁、W₂、W₃......W_nUpdating to obtain the updated action a to be executed₁、a₂、a₃......a_nCorresponding weight W₁’、W₂’、W₃’......W_n' to improve the learning, cognition and evolution ability of the robot so as to obtain better action to be executed in later behavior decision.

According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions, which, when executed by a processor, cause the processor to implement the method of controlling user base alignment as described above.

According to another aspect of the present application, there is also provided a robot control apparatus characterized by comprising:

one or more processors;

when executed by the one or more processors, cause the one or more processors to implement a method of controlling user base station on a device as described above.

Here, for details of each embodiment of the device, reference may be specifically made to corresponding parts of the embodiment of the method for controlling user base pairing at the device side, and details are not described here.

In summary, the present application determines a reward prediction error signal by obtaining a current state of the robot, at least two actions to be performed, and weights corresponding to the actions, wherein the current state includes a current environment and reward information, and based on the current state, the at least two actions to be performed, and the weights corresponding to the actions; based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state; and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated gyrocortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed in the behavior decision process of the robot is improved, and a better global solution is obtained.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A robot control method, characterized in that the method comprises:

acquiring a current state, at least two actions to be executed and corresponding weights of the actions, wherein the current state comprises a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be executed and the corresponding weights of the actions, and the reward prediction error signal is calculated by the following formula:

wherein R is_tThe reward information obtained at the t moment when the robot interacts with the environment in real time is represented; gamma is a discount factor; q (s, a) is an expected value for selecting to execute the action a to be executed in the current state s in reinforcement learning; s_tIndicating the state of the robot corresponding to the time t; i is the number of the action to be executed, and the value range is i-1, 2, … …, n; a is_iThe action to be executed with the serial number i; a is all actions to be performed a_iA set of (a);

based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state, including: determining, by the pre-cingulated cortical neuromodulation mechanism, a correct neuron response value and a wrong neuron response value based on the reward prediction error signal; acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate to obtain an alertness value corresponding to the current state; adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state;

determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, wherein the method comprises the following steps:

determining and executing the optimal action to be executed based on the execution probability of each action to be executed, wherein the method comprises the following steps: obtaining the similarity of the execution probability of all the actions to be executed based on the execution probability of each action to be executed; if the similarity of the execution probabilities of all the actions to be executed is greater than a similarity threshold, randomly selecting one action to be executed from all the actions to be executed as the optimal action to be executed and executing; and if the similarity of the execution probabilities of all the actions to be executed is less than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing.

2. The method of claim 1, wherein the method further comprises:

and updating the weight corresponding to the action to be executed based on the updating state.

3. The method of claim 2, wherein updating the action to be performed and its corresponding weight based on the update status comprises:

4. A computer readable medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 3.

5. A robot control apparatus, characterized in that the apparatus comprises:

one or more processors;

the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.