CN111645076B - Robot control method and equipment - Google Patents

Robot control method and equipment Download PDF

Info

Publication number
CN111645076B
CN111645076B CN202010552467.2A CN202010552467A CN111645076B CN 111645076 B CN111645076 B CN 111645076B CN 202010552467 A CN202010552467 A CN 202010552467A CN 111645076 B CN111645076 B CN 111645076B
Authority
CN
China
Prior art keywords
executed
action
actions
current state
exploration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010552467.2A
Other languages
Chinese (zh)
Other versions
CN111645076A (en
Inventor
王东署
胡宇航
罗勇
辛健斌
王河山
马天磊
贾建华
张方方
陈书立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202010552467.2A priority Critical patent/CN111645076B/en
Publication of CN111645076A publication Critical patent/CN111645076A/en
Application granted granted Critical
Publication of CN111645076B publication Critical patent/CN111645076B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The method and the device dynamically adjust the exploration speed by simulating a nerve regulation mechanism of a cortices of a front cingulum in a brain physiological mechanism of a primate, and adjust the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed in the behavior decision process of the robot is improved, and a better global solution is obtained.

Description

Robot control method and equipment
Technical Field
The present application relates to the field of computers, and in particular, to a robot control method and apparatus.
Background
In the prior art, a robot needs to learn in an unknown environment and adapt to the environment, and reinforcement learning is one of key technologies. The advantage of reinforcement learning is that it does not require a given desired output, but rather achieves a better-performing control strategy by providing the greatest cumulative return for the robot when moving in the environment through on-line interactive training based on reinforcement signals. Reinforcement learning is therefore often used in the study of robot behavior decisions.
At present, an important problem faced by the reinforcement learning algorithm is the balance between environment exploration and environment utilization, and the coordination state of the exploration and utilization directly influences the efficiency of reinforcement learning. Exploring behavior means that the robot traverses all state spaces in the environment one by one more in the learning process, and usually comprises some measures for sacrificing short-term benefits, and the robot can learn a macroscopically better behavior strategy by gathering enough information. However, due to the factors of large dimensionality, more behavior sets, complex information and the like of the robot behavior decision, excessive exploration can cause the problems of dimension disaster and low learning convergence speed of reinforcement learning, the calculated amount is greatly increased, and the real-time requirement of the behavior decision is difficult to meet. The utilization behavior means that the robot forms a certain behavior strategy after independently learning for a period of time, at this time, in order to enable the robot to obtain a larger reward and accelerate the learning convergence speed, the exploration behavior of the robot is gradually reduced and gradually converted into the utilization behavior, that is, the robot does not explore unknown action strategies gradually any more, but selects the optimal action strategy under the current information according to the learned experience. But if the time to use the environment is too early or high, it is difficult for the system to learn the optimal behavior strategy. Therefore, proper exploration-utilization of a coordinated balancing mechanism is critical to the efficiency of the reinforcement learning algorithm.
Algorithms that explore exploiting balance generally fall into 2 broad categories: unguided and guided methods. The method without guidance in the current action selection method needs to finely adjust exploration parameters, and has the defects that uncertainty expected reward of each action is not considered, and the value of the exploration parameters can be correctly determined after multiple times of simulation; the drawback of the guided approach is that a large number of complex calculations are required to converge to an optimal solution.
Currently, the exploration-utilization balance strategy commonly used in robot reinforcement learning is an indirect selection strategy, which ignores uncertainty in environment in the learning process and selects probability to realize exploration and utilization balance, and the strategy comprises various commonly used methods, such as an epsilon-greedy method, a Boltzmann distribution method, a heuristic action selection method and the like. The epsilon-greedy strategy is widely used due to simple implementation, but the parameter epsilon is a fixed value, the problems of exploration and utilization of the dynamic learning process still exist, and the learning rate and efficiency of the algorithm are influenced to a certain extent. The Boltzmann distribution method involves selecting probability of action, associating the selection of action with a value function, and adjusting the selection probability of action using temperature parameters. The Boltzmann distribution method has the disadvantages that the initial value setting of the temperature parameter is uncertain, and the parameter setting has certain influence on the learning rate and efficiency of the algorithm. In short, none of these methods can adjust the degree of exploration or utilization in real time according to the environment of the robot, resulting in the defects of poor adaptability, slow convergence rate, local optimum, etc.
Therefore, the robot control method which can realize the dynamic balance between the utilization environment and the exploration environment of the robot in the robot learning process, adjust the exploration and utilization degrees in real time according to the environment, has the advantages of high learning convergence speed, better overall solution after stabilization and the like is the direction of continuous research needed by the technical personnel in the field.
Disclosure of Invention
An object of the present application is to provide a robot control method and apparatus, so as to solve the problem in the prior art of how to adjust the degree of exploration and utilization in the robot learning process, thereby increasing the learning convergence rate and obtaining a better global solution.
According to an aspect of the present application, there is provided a robot control method including:
acquiring a current state, at least two actions to be executed and corresponding weights of the actions, wherein the current state comprises a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be executed and the corresponding weights of the actions;
based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state;
and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof.
Further, in the above robot control method, the adjusting a search speed by a anterior cingulum cortical neural modulation mechanism based on the reward prediction error signal to obtain the search speed corresponding to the current state includes:
determining, based on the reward prediction error signal, a correct neuron response value and a wrong neuron response value by the anterior cingulum cortical neuromodulation mechanism;
acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating an alertness value corresponding to the current state by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate;
and adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state.
Further, in the robot control method, the determining and executing an optimal action to be performed from all the actions to be performed based on the search speed, all the actions to be performed, and their corresponding weights includes:
performing equation conversion based on the exploration speed, all the actions to be executed and the weights corresponding to the actions to be executed to obtain the execution probability of each action to be executed corresponding to the current state;
and determining the optimal action to be executed based on the execution probability of each action to be executed and executing.
Further, in the robot control method, the determining and executing the optimal action to be performed based on the execution probability of each action to be performed includes:
obtaining the similarity of the execution probability of all the actions to be executed based on the execution probability of each action to be executed;
if the similarity of the execution probabilities of all the actions to be executed is greater than a similarity threshold, randomly selecting one action to be executed from all the actions to be executed as the optimal action to be executed and executing;
and if the similarity of the execution probabilities of all the actions to be executed is less than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing.
Further, the robot control method further includes:
acquiring an updating state after the optimal action to be executed is executed;
and updating the action to be executed and the corresponding weight thereof based on the updating state.
Further, in the robot control method, updating the weight corresponding to the action to be performed based on the update status includes:
judging whether the optimal action to be executed is collided or not based on the updating state;
and if the collision does not occur, updating the action to be executed and the weight corresponding to the action to be executed based on the current state, the optimal action to be executed and the updating state to obtain the updated weight corresponding to the action to be executed.
According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the method of any one of the above.
According to another aspect of the present application, there is also provided a robot control apparatus including:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the above.
Compared with the prior art, the method and the device have the advantages that the current state of the robot, at least two actions to be executed and the corresponding weights of the actions are obtained, wherein the current state comprises the current environment and reward feedback information is obtained based on the current environment information; based on the reward feedback information, the exploration speed is adjusted through a front buckle zone cortical neural adjustment mechanism, and the exploration speed corresponding to the current state is obtained; and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated gyrocortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a robot control method in accordance with an aspect of the subject application;
FIG. 2 illustrates an ACC neuromodulation mechanism-based behavioral decision model schematic of a robotic control method according to an aspect of the present application;
FIG. 3 illustrates an ACC neural network architecture diagram of a robot control method in accordance with an aspect of the subject application;
FIG. 4 illustrates a flow diagram of an embodiment of a robot control method in accordance with an aspect of the subject application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (e.g., Central Processing Units (CPUs)), input/output interfaces, network interfaces, and memory.
The Memory may include volatile Memory in a computer readable medium, Random Access Memory (RAM), and/or nonvolatile Memory such as Read Only Memory (ROM) or flash Memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, Phase-Change RAM (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other Memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, magnetic cassette tape, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
Fig. 1 is a schematic flowchart of a robot control method according to an aspect of the present application, the method is applicable to various motion scenarios of a mobile robot, and the method includes steps S11, S12, and S13, where the method specifically includes:
step S11, obtaining a current state of the robot, at least two actions to be performed, and corresponding weights thereof, where the current state includes a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be performed, and corresponding weights thereof, where the reward prediction error signal is a reinforced signal of a reward prediction error, and the reward prediction error signal is determined based on the current state, the at least two actions to be performed, and corresponding weights thereof, and is calculated by the following formula:
Figure BDA0002543007880000071
wherein R istThe reward information obtained at the t moment when the robot interacts with the environment in real time is represented; gamma is a discount factor; q (s, a) is an expected value for selecting to execute the action a to be executed in the current state s in reinforcement learning; stRepresenting the state of the robot corresponding to t time; i is the number of the action to be executed, and the value range is i-1, 2, … …, n; a isiThe action to be executed with the serial number i; a is all actions to be performed aiA collection of (a).
Step S12, based on the reward prediction error signal, adjusting the exploration speed through an Anterior Cingulate Cortex (ACC) nerve regulation mechanism to obtain the exploration speed corresponding to the current state, wherein the exploration speed is used for indicating the degree of the robot exploration environment; the exploration speed is adjusted by utilizing a front buckle cortical neural regulation mechanism, balance between exploration and utilization of the robot is realized, namely different actions can be tried to be selected and an unknown action strategy can be explored in the deep learning process of the robot, the optimal action to be executed in the current state is selected according to learned experience, and the learning, cognition and evolution capabilities of the robot are improved, so that the optimal action to be executed can be obtained.
And step S13, determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, thereby realizing obtaining a better global solution.
In the above steps S11 to S13, first, the current state of the robot, at least two actions to be performed and their corresponding weights are obtained, where the current state includes the current environment and reward feedback information is obtained based on the current environment information; then, based on the reward feedback information, the exploration speed is adjusted through a front buckle zone cortical nerve adjusting mechanism, and the exploration speed corresponding to the current state is obtained; and finally, determining the optimal action to be executed from all the actions to be executed and executing the optimal action to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated cortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.
For example, as shown in fig. 2, based on an Actor-Critic algorithm in reinforcement learning, the Actor portion is a three-layer radial basis function neural network, and the Critic portion is a Q learning algorithm. Firstly, the current state s and the action a to be executed of the robot are obtained1、a2、a3......anAnd action a to be performed1Corresponding weight W1To-be-performed action a2Corresponding weight W2To-be-performed action a3Corresponding weight W3..nCorresponding weight WnThe current state s comprises a current environment and reward information R, and a reward prediction error signal delta is determined based on the current state s, at least two actions to be executed and weights corresponding to the actions; then, based on the reward prediction error signal delta, the exploration speed is adjusted through a front buckle zone cortical neural adjustment mechanism, and the exploration speed beta corresponding to the current state is obtained; finally, all the actions a to be performed based on the exploration speed beta1、a2、a3......anAnd its corresponding weight W1、W2、W3......WnFrom all the actions a to be performed1、a2、a3......anThe optimal action to be executed is determined and executed, namely the exploration speed beta is dynamically adjusted through simulating a procingulated corticoid nerve regulation mechanism in a brain physiological mechanism of a primate, and the exploration and utilization degree is regulated in real time according to the environment, so that the dynamic switching of the robot between the utilization environment and the exploration environment is realized, and the behavior decision process of the robot is improvedAnd (4) converging the learning speed, and obtaining a better global solution after stabilization.
Next to the above embodiment of the present application, the step S12 of adjusting a search speed by a anterior cingulum cortical nerve adjusting mechanism based on the reward prediction error signal to obtain the search speed corresponding to the current state includes:
determining, by the pre-buckled cortical neuromodulation mechanism, a correct neuron response value and a false neuron response value based on the reward prediction error signal, where the reward prediction error signal affects a set of feedback classification neurons in the ACC: correct neuron (COR) and error neuron (ERR); wherein the reward prediction error signal comprises a negative of the reward prediction error signal and a positive of the reward prediction error signal, and when a negative of the reward prediction error signal is obtained, the reward prediction error signal is located in the ACC to indicate a false neuron ERR response, and when a positive reinforcement signal is obtained, the reward prediction error signal is located in the anterior cingulum ACC to indicate a correct neuron COR response. Determining, by the anterior cingulate cortical neuromodulation mechanism, correct neuron response values and incorrect neuron response values for subsequent adjustment of the speed of exploration.
Acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating an alertness value corresponding to the current state by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate, wherein the alertness value is used for indicating the alertness degree of the robot under the current environment, and the alertness value is introduced by the method to simulate different reactions of a human under different alertness degrees, so that the robot dynamically converts the utilization environment and the exploration environment; determining the alertness value, calculated from the formula:
β*(t)←β*(t)+μ+δ+(t)+μ-δ-(t)
wherein, mu+Update rate of correct neuron response, mu-For the wrong neuron response update rate, δ+(t) is the correct neuron response value, δ-(t) is the value of the erroneous neuron response.
Adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state, wherein the exploration speed is adjusted through the alertness value, if the robot selects the optimal action to be executed, the alertness value is reduced, the exploration speed is reduced, and at the moment, the robot can continue to repeat the current action, namely, the robot represents a utilization environment; otherwise, when the non-optimal action to be executed is selected, the alertness value is increased, the exploration speed is increased, and at the moment, the action selection should be adjusted in time by the robot, namely, the robot can try to randomly select a certain action to be executed, namely, the action is expressed as an exploration environment; the exploration speed can be obtained by algorithm filtering of a sigmoid function, and the formula is as follows:
β=ω1/(1+exp(ω2·[1-β*]+ω3))
in the formula, ω1、ω2And ω3Are all constant, and ω1>ω3>0,ω2<0,β*Is an alert value.
For example, as shown in fig. 3 and 4, first, the current state s and at least two actions to be performed a of the robot are obtained1、a2、a3......anAnd its corresponding weight W1、W2、W3......WnWherein the current state comprises a current environment V. Visual input the current environment V (e.g. objects seen on the screen or objects on the table) is input into the posterior parietal cortex, and then a reward information R is received at the Ventral Tegmental Area (VTA) and from this reward information R the reward prediction error signal δ is calculated, a set of feedback classification neurons in the ACC: the correct neurons and the incorrect neurons adjust the required alertness value β by means of the reward prediction error signal δ. Finally, the action to be performed and its corresponding weight and alertness β are passed to the Lateral Prefrontal Cortex (LPFC), which selects the action that should be performed and depends on the alertness βThe size of the index adjusts the exploration speed, so as to realize the dynamic balance of exploration (exploration) and utilization (exploration).
Following the above embodiment of the present application, the step S13 determining and executing an optimal action to be performed from all the actions to be performed based on the search speed, all the actions to be performed, and their corresponding weights includes:
performing equation conversion based on the exploration speed, all the actions to be executed and the weights corresponding to the actions to be executed to obtain the execution probability of each action to be executed corresponding to the current state, where the execution probability of each action to be executed corresponding to the current state is obtained, and a specific calculation formula is as follows:
Figure BDA0002543007880000101
and n is the total n types of actions to be executed and selected when the robot is in the state s, j is the number of the actions to be executed, and the range of j is 1, 2, … …, and n and beta are the search speed.
And determining and executing the optimal action to be executed based on the execution probability of each action to be executed, and realizing dynamic adjustment of the exploration speed, namely balance of exploration and utilization so as to obtain the optimal action to be executed.
For example, first, all the actions a to be performed are obtained1、a2、a3......anAnd its corresponding weight W1、W2、W3......WnAnd the exploration speed β; then, all the actions a to be performed based on the exploration speed beta1、a2、a3......anAnd its corresponding weight W1、W2、W3......WnObtaining the action a to be executed corresponding to the current state through Boltzmann-Softmax equation conversion1Is performed with probability P (a)1) To-be-performed action a2Is performed with probability P (a)2) To-be-performed action a3Is performed with probability P (a)3) ..nIs performedProbability P (a)n) Wherein, P (a)1)+P(a2)+P(a3)+......P(an) 1 is ═ 1; finally, according to the execution probability P (a) of each action to be executed1)、P(a2)、P(a3)......P(an) And obtaining and executing the optimal action to be executed, so that the selected action to be executed is the optimal action to be executed.
Further, determining and executing the optimal action to be executed based on the execution probability of each action to be executed comprises:
obtaining the similarity of the execution probability of all the actions to be executed based on the execution probability of each action to be executed;
if the similarity of the execution probabilities of all the actions to be executed is greater than the similarity threshold, one action to be executed is randomly selected from all the actions to be executed as the optimal action to be executed and executed, the exploration speed is a smaller value at this time, and the execution probabilities of all the actions to be executed are close to each other, so that the action to be executed which originally has the maximum weight does not have the highest execution probability after the execution probability conversion, that is, one action to be executed can be randomly selected as the optimal action to be executed and executed, further exploration of the external environment by the robot is achieved, and the learning convergence speed of the robot in the action decision process is improved.
If the similarity of the execution probabilities of all the actions to be executed is smaller than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing the action, wherein the exploration speed is a larger value at this time, and the difference between the execution probabilities of all the actions to be executed is increased, namely the action to be executed with the highest execution probability is the optimal action to be executed, so that the robot can select the optimal action to be executed for execution when making an action decision.
For example, each of the actions to be performed a is obtained1、a2、a3......anIs performed with probability P (a)1)、P(a2)、P(a3)......P(an) And obtaining the similarity X of the execution probability of all the actions to be executed. Presetting a similarity threshold K, wherein the similarity of the execution probabilities of all the actions to be executed is greater than the similarity threshold, namely X is greater than K, namely the execution probabilities of all the actions to be executed are close to each other; and then, one action to be executed is randomly selected as the optimal action to be executed and executed, so that the further exploration of the robot on the external environment is realized, and the learning convergence speed of the robot in the action decision process is improved.
As another example, each of the actions to be performed a is obtained1、a2、a3......anIs performed with probability P (a)1)、P(a2)、P(a3)......P(an) And obtaining the similarity X of the execution probability of all the actions to be executed. Presetting a similarity threshold K, wherein the similarity of the execution probabilities of all the actions to be executed is smaller than the similarity threshold, namely X is smaller than K, namely the execution probabilities of all the actions to be executed have larger difference; and then, the action to be executed with the highest execution probability is the optimal action to be executed and executed, so that the robot can select the optimal action to be executed for execution when making an action decision.
In another preferred embodiment of the present application, the method further comprises:
acquiring an update state after the optimal action to be executed is executed, wherein the update state comprises an updated current environment and updated reward information;
and updating the action to be executed and the weight corresponding to the action to be executed based on the updating state, so that the updating of the action to be executed and the weight corresponding to the action to be executed is realized, the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.
For example, an optimal action to be executed is executed, and an update state V after the optimal action to be executed is acquired; and updating the action to be executed and the corresponding weight thereof based on the updating state V, so that the learning convergence speed of the robot in the action decision process is improved, and a better global solution is obtained.
Next, the above embodiment of the present application updates the action to be executed and the weight corresponding to the action to be executed based on the update status, including:
judging whether the optimal action to be executed is collided or not based on the updating state;
and if the collision does not occur, updating the action to be executed and the corresponding weight thereof based on the current state, the optimal action to be executed and the updating state to obtain the updated action to be executed and the corresponding weight thereof.
For example, executing an optimal action to be executed, and acquiring an updated state v after the optimal action to be executed is executed; the robot is known not to collide with the obstacle through the obtained updated state v, and at the moment, the action a to be executed is carried out on the basis of the current state s, the optimal action to be executed and the updated state v1、a2、a3......anAnd its corresponding weight W1、W2、W3......WnUpdating to obtain the updated action a to be executed1、a2、a3......anCorresponding weight W1’、W2’、W3’......Wn' to improve the learning, cognition and evolution ability of the robot so as to obtain better action to be executed in later behavior decision.
According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions, which, when executed by a processor, cause the processor to implement the method of controlling user base alignment as described above.
According to another aspect of the present application, there is also provided a robot control apparatus characterized by comprising:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a method of controlling user base station on a device as described above.
Here, for details of each embodiment of the device, reference may be specifically made to corresponding parts of the embodiment of the method for controlling user base pairing at the device side, and details are not described here.
In summary, the present application determines a reward prediction error signal by obtaining a current state of the robot, at least two actions to be performed, and weights corresponding to the actions, wherein the current state includes a current environment and reward information, and based on the current state, the at least two actions to be performed, and the weights corresponding to the actions; based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state; and determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, namely dynamically adjusting the exploration speed by simulating a procingulated gyrocortical nerve regulation mechanism in a brain physiological mechanism of a primate, and regulating the exploration and utilization degree in real time according to the environment, so that the dynamic balance of the robot between the utilization environment and the exploration environment is realized, the learning convergence speed in the behavior decision process of the robot is improved, and a better global solution is obtained.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (5)

1. A robot control method, characterized in that the method comprises:
acquiring a current state, at least two actions to be executed and corresponding weights of the actions, wherein the current state comprises a current environment and reward information, and determining a reward prediction error signal based on the current state, the at least two actions to be executed and the corresponding weights of the actions, and the reward prediction error signal is calculated by the following formula:
Figure FDA0003000326550000011
wherein R istThe reward information obtained at the t moment when the robot interacts with the environment in real time is represented; gamma is a discount factor; q (s, a) is an expected value for selecting to execute the action a to be executed in the current state s in reinforcement learning; stIndicating the state of the robot corresponding to the time t; i is the number of the action to be executed, and the value range is i-1, 2, … …, n; a isiThe action to be executed with the serial number i; a is all actions to be performed aiA set of (a);
based on the reward prediction error signal, adjusting the exploration speed through a front buckle zone cortical neural regulation mechanism to obtain the exploration speed corresponding to the current state, including: determining, by the pre-cingulated cortical neuromodulation mechanism, a correct neuron response value and a wrong neuron response value based on the reward prediction error signal; acquiring a correct neuron response update rate and an incorrect neuron response update rate, and calculating by using the correct neuron response value and the incorrect neuron response value corresponding to the current state and the correct neuron response update rate and the incorrect neuron response value update rate to obtain an alertness value corresponding to the current state; adjusting the exploration speed through the alertness value to obtain the exploration speed corresponding to the current state;
determining and executing the optimal action to be executed from all the actions to be executed based on the exploration speed, all the actions to be executed and the corresponding weights thereof, wherein the method comprises the following steps:
performing equation conversion based on the exploration speed, all the actions to be executed and the weights corresponding to the actions to be executed to obtain the execution probability of each action to be executed corresponding to the current state;
determining and executing the optimal action to be executed based on the execution probability of each action to be executed, wherein the method comprises the following steps: obtaining the similarity of the execution probability of all the actions to be executed based on the execution probability of each action to be executed; if the similarity of the execution probabilities of all the actions to be executed is greater than a similarity threshold, randomly selecting one action to be executed from all the actions to be executed as the optimal action to be executed and executing; and if the similarity of the execution probabilities of all the actions to be executed is less than or equal to the similarity threshold, taking the action to be executed with the highest execution probability in all the actions to be executed as the optimal action to be executed and executing.
2. The method of claim 1, wherein the method further comprises:
acquiring an updating state after the optimal action to be executed is executed;
and updating the weight corresponding to the action to be executed based on the updating state.
3. The method of claim 2, wherein updating the action to be performed and its corresponding weight based on the update status comprises:
judging whether the optimal action to be executed is collided or not based on the updating state;
and if the collision does not occur, updating the action to be executed and the weight corresponding to the action to be executed based on the current state, the optimal action to be executed and the updating state to obtain the updated weight corresponding to the action to be executed.
4. A computer readable medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 3.
5. A robot control apparatus, characterized in that the apparatus comprises:
one or more processors;
a computer-readable medium for storing one or more computer-readable instructions,
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-3.
CN202010552467.2A 2020-06-17 2020-06-17 Robot control method and equipment Active CN111645076B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010552467.2A CN111645076B (en) 2020-06-17 2020-06-17 Robot control method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010552467.2A CN111645076B (en) 2020-06-17 2020-06-17 Robot control method and equipment

Publications (2)

Publication Number Publication Date
CN111645076A CN111645076A (en) 2020-09-11
CN111645076B true CN111645076B (en) 2021-05-11

Family

ID=72342733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010552467.2A Active CN111645076B (en) 2020-06-17 2020-06-17 Robot control method and equipment

Country Status (1)

Country Link
CN (1) CN111645076B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113537318B (en) * 2021-07-01 2023-04-07 郑州大学 Robot behavior decision method and device simulating human brain memory mechanism
CN113671834B (en) * 2021-08-24 2023-09-01 郑州大学 Robot flexible behavior decision method and equipment
CN113848946B (en) * 2021-10-20 2023-11-03 郑州大学 Robot behavior decision method and equipment based on nerve regulation mechanism

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345749B (en) * 2013-06-27 2016-04-13 中国科学院自动化研究所 A kind of brain network function connectivity lateralization detection method based on modality fusion
US10343279B2 (en) * 2015-07-10 2019-07-09 Board Of Trustees Of Michigan State University Navigational control of robotic systems and other computer-implemented processes using developmental network with turing machine learning
CN108600379A (en) * 2018-04-28 2018-09-28 中国科学院软件研究所 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN110000781B (en) * 2019-03-29 2021-06-08 郑州大学 Development network-based mobile robot motion direction pre-decision method

Also Published As

Publication number Publication date
CN111645076A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111645076B (en) Robot control method and equipment
US11461661B2 (en) Stochastic categorical autoencoder network
EP4231197B1 (en) Training machine learning models on multiple machine learning tasks
Gijsberts et al. Real-time model learning using incremental sparse spectrum gaussian process regression
US20230047151A1 (en) Systems and Methods for Neural Networks Allocating Capital
US5598510A (en) Self organizing adaptive replicate (SOAR)
US20220067588A1 (en) Transforming a trained artificial intelligence model into a trustworthy artificial intelligence model
US9330358B1 (en) Case-based reasoning system using normalized weight vectors
US11748600B2 (en) Quantization parameter optimization method and quantization parameter optimization device
Huang et al. Interpretable policies for reinforcement learning by empirical fuzzy sets
JP2022515941A (en) Generating hostile neuropil-based classification system and method
Xie et al. Modeling adaptive preview time of driver model for intelligent vehicles based on deep learning
WO2021200392A1 (en) Data adjustment system, data adjustment device, data adjustment method, terminal device, and information processing device
US20220299232A1 (en) Machine learning device and environment adjusting apparatus
CN113780394B (en) Training method, device and equipment for strong classifier model
US20240020531A1 (en) System and Method for Transforming a Trained Artificial Intelligence Model Into a Trustworthy Artificial Intelligence Model
CN113671834A (en) Robot flexible behavior decision method and device
Kochenderfer Adaptive modelling and planning for learning intelligent behaviour
Motta Goulart et al. An evolutionary algorithm for large margin classification
US11869383B2 (en) Method, system and non-transitory computer- readable recording medium for providing information on user's conceptual understanding
CN113848946B (en) Robot behavior decision method and equipment based on nerve regulation mechanism
JP7491622B1 (en) Pattern recognition device, learning method, and program
US20240078921A1 (en) Method, system and non-transitory computer-readable recording medium for answering prediction for learning problem
WO2022217856A1 (en) Methods, devices and media for re-weighting to improve knowledge distillation
US20230043618A1 (en) Computation apparatus, neural network system,neuron model apparatus, computation method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant