CN113671834B - Robot flexible behavior decision method and equipment - Google Patents

Robot flexible behavior decision method and equipment Download PDF

Info

Publication number
CN113671834B
CN113671834B CN202110973178.4A CN202110973178A CN113671834B CN 113671834 B CN113671834 B CN 113671834B CN 202110973178 A CN202110973178 A CN 202110973178A CN 113671834 B CN113671834 B CN 113671834B
Authority
CN
China
Prior art keywords
robot
information
current
neural network
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110973178.4A
Other languages
Chinese (zh)
Other versions
CN113671834A (en
Inventor
王东署
朱觐镳
王河山
辛健斌
马天磊
贾建华
罗勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202110973178.4A priority Critical patent/CN113671834B/en
Publication of CN113671834A publication Critical patent/CN113671834A/en
Application granted granted Critical
Publication of CN113671834B publication Critical patent/CN113671834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The application provides a robot flexible behavior decision method and equipment, wherein the current environment information, a target task and the current state information of a robot are acquired; constructing a neural network mixed model based on the supervised learning model and the reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity index to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, reinforcement learning and supervision learning are dynamically combined, and dynamic self-adaptive adjustment of environment exploration-utilization is realized in reinforcement learning, so that the flexible behavior decision of the mobile robot in an unknown environment is realized, and the adaptability and learning capacity of the mobile robot are improved.

Description

Robot flexible behavior decision method and equipment
Technical Field
The application relates to the field of computers, in particular to a robot flexible behavior decision method and equipment.
Background
In the prior art, along with the development of technology, a mobile robot is taken as a perfect carrier combining a robot and an artificial intelligence technology, and carries the latest technology of the robot and the artificial intelligence, and has been used for processing various tasks and widely applied to the production and the life of human beings.
An important premise for a mobile robot to complete various unspecified tasks is to fully and effectively recognize the environment in which the mobile robot is located. In the environment cognition process, the robot generally does not have priori knowledge of the environment, various dynamic or static obstacles or various emergency situations or even various 'traps' can be encountered in the movement process, and how to realize the flexible behavior decision of the mobile robot in the dynamic environment is always an important problem focused by a robot researcher.
In order to solve the problem, researchers put forward various behavior decision methods, such as robust nonsingular terminal synovial membrane control, forward/backward motion control, sliding mode control+attraction ellipsoidal method, anti-interference PID control, central mode generator, hopf oscillator+Kuramoto oscillator, etc., but the behavior decision methods are mostly aimed at a specific application scene, and once the application scene is changed, the corresponding behavior decision needs to be modified correspondingly, so that the capability of adapting to the dynamic environment is poor.
In addition to the above methods, in recent years, students have provided basis for robot behavior decisions by simulating the function of cells located in the brain. In addition, with the development of intelligent science, more and more intelligent control algorithms are applied to robot behavior decisions, such as fuzzy inference systems, fuzzy logic+behavior trees, neural networks, neural inverse reinforcement learning, feedforward neural networks+q learning.
In recent years, with the development of neurobiology and the progressive deep research on the cognitive mechanism of the human brain, researchers begin to consider the introduction of the cognitive principle of the human brain into the behavior decision of a mobile robot, and solve the behavior decision problem of the robot by simulating the thinking mode of the human brain. Fang, naveros, hausknecht, and kui, etc. have respectively proposed robot behavior decision solutions for simulating cerebellum function for different robots. Research shows that basal ganglia have important roles in behavioral decisions in addition to cerebellum, and both have links to target-guided behavioral decisions, etc. Since both are associated with target-guided behavior decisions, one naturally has the idea of whether or not both can be incorporated into the robot's behavior decisions. Currently, studies have been made on jointly applying the cerebellum and basal ganglia to the aspect of robot behavior decision, such as Dasgupta and the like, to simulate the learning principle of the basal ganglia and the cerebellum respectively by using an Actor-Critic model and input association learning, and a reward-regulated heterogeneous synaptic plasticity model is provided, and the two systems of the cerebellum and the basal ganglia are adaptively combined for behavior control of a four-wheeled robot. However, in this model, there is no direct feedback (interaction) between the cerebellum and basal ganglia. Ruan Xiaogang and the like propose a motion-related heuristic dynamic programming learning mechanism based on trendy to explore a cooperative mechanism between the cerebellum and the basal ganglia, adopting an Actor model and a Critic model to simulate functions of the cerebellum and the basal ganglia respectively for motion control of a two-wheel self-balancing robot, wherein the model takes motion output of the cerebellum as one of inputs of the basal ganglia, and the basal ganglia does not output the cerebellum, namely unidirectional (cerebellum to basal ganglia) rather than bidirectional interaction is adopted between the two.
Therefore, how to realize direct interactive connection between the cerebellum and the basal ganglia in the neural network model, integrate cerebellum supervised learning and basal ganglia reinforcement learning, realize flexible behavior decision in an unknown environment of the mobile robot, and simultaneously enable the robot to obtain continuous and stable learning ability is the direction of research of the current technical personnel.
Disclosure of Invention
The application aims to provide a robot flexible behavior decision method and device, which are used for solving the problems of how to realize dynamic combination of cerebellum supervised learning and basal ganglia reinforcement learning in mobile robot behavior decision in the prior art, so that the robot can obtain continuous and stable learning capacity and the adaptability of the dynamic environment of the mobile robot is improved.
According to one aspect of the present application, there is provided a robot flexible behavior decision method comprising:
acquiring current environment information, a target task and current state information of a robot, wherein the current environment information comprises barrier position information;
constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model;
And inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision.
Further, in the robot flexible behavior decision method, the dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information includes:
obtaining the nearest barrier distance in all barrier information from the current environment information;
adjusting the combination coefficient of the neural network hybrid model according to the nearest obstacle distance, wherein the combination coefficient is expressed as follows:
wherein omega 1 =1-ω 2
a=ω 1 a 12 a 2
m 1 And m 2 Are all normal numbers, ω 2 For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action decision at the current moment, a 1 Representing a behavioural decision based on the reinforcement learning model decision, a 2 Representing the behavior decisions decided by the supervised learning model.
Further, in the robot flexible behavior decision method, the enhancing curiosity index improves the reinforcement learning model to obtain an improved neural network hybrid model, which includes:
obtaining maximum rewards and minimum rewards obtained by the robot in a reinforcement learning model and environmental alertness;
Calculating the curiosity index based on the maximum reward, the minimum reward, and the environmental alertness;
and dynamically adjusting environment exploration and utilization in the reinforcement learning model by utilizing the curiosity index to obtain an improved neural network hybrid model.
Further, in the method for determining flexible behavior of a robot, the obtaining the current environmental alertness from the maximum rewards and the minimum rewards obtained by the robot in the reinforcement learning model and the environmental alertness includes:
dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate;
acquiring a state action value function of the current moment of the robot by combining the real-time learning rate and the real-time discount rate, and calculating to obtain a strengthening signal corresponding to the current moment according to a reward prediction error;
and obtaining the environmental alertness based on the response value of the strengthening signal.
Further, in the robot flexible behavior decision method, the dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate includes:
calculating a similarity value of the current environmental information and the historical environmental information;
If the maximum similarity value is greater than or equal to a similarity threshold value, the same environmental information times corresponding to the current environmental information are obtained in a cumulative mode;
if the maximum similarity value is smaller than a similarity threshold value, memorizing the current environmental information for the first time to obtain the same environmental information frequency corresponding to the current environmental information as 1;
and adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate.
Further, in the robot flexible behavior decision method, the adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate further includes:
and correcting the real-time learning rate and the real-time discount rate according to the nearest obstacle distance in a dynamic environment.
Further, in the robot flexible behavior decision method, the curiosity index calculated based on the maximum reward, the minimum reward, and the environmental alertness is expressed as follows:
Wherein R is max And R is min And respectively representing the maximum rewards and the minimum rewards obtained by the robot, wherein ρ is an adjustment coefficient.
Further, in the robot flexible behavior decision method, the method dynamically adjusts environmental exploration and utilization by using the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model, which comprises the following steps:
presetting a curiosity index threshold in the reinforcement learning model, and calculating the execution probability of each action to be executed based on the curiosity index and a state action value function at the current moment;
when the curiosity index is greater than or equal to the curiosity index threshold, the robot performs environment exploration;
when the curiosity index is smaller than the curiosity index threshold, the robot performs environment utilization;
in the process of the environment exploration and utilization after adjustment), determining an optimal action behavior decision according to the execution probability of each action to be executed, and realizing the dynamic self-adaptive adjustment of the environment exploration and utilization to obtain an improved neural network mixed model.
Further, in the method for deciding the flexible behavior of the robot, the inputting the current environmental information, the target task and the state information of the robot into the improved neural network hybrid model to obtain the flexible behavior decision includes:
Inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model, and obtaining the combination coefficient of the improved neural network hybrid model based on the current environment information;
and according to the nearest obstacle distance, calculating the combination coefficient in real time to adjust the proportion of the action to be executed determined based on supervised learning and the action to be executed determined based on reinforcement learning in the final behavior decision of the robot, so as to obtain the flexible behavior decision in the dynamic environment of the robot.
According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a robot behavior decision method as described above.
According to another aspect of the present application, there is also provided a robot flexible behavior decision device, comprising:
one or more processors;
a computer readable medium for storing one or more computer readable instructions,
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the robot flexible behavior decision method as described above.
Compared with the prior art, the method and the device have the advantages that the current environment information, the target task and the current state information of the robot are obtained, and the current environment information comprises barrier position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; inputting the current environment information, the target task and the robot state information into the improved neural network mixed model to obtain a flexible behavior decision, namely integrating the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerating the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, improving the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia, realizing the direct interaction between the cerebellum and the basal ganglia, thereby realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive adjustment of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain the continuous and stable learning capability.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a robot flexible behavior decision method in accordance with an aspect of the present application;
FIG. 2 illustrates an inverted U-shaped relationship between blue spot activity and task performance requiring attention in a robotic flexible decision method in accordance with an aspect of the application;
FIG. 3 illustrates a schematic diagram of the feed-forward neural network operating principle for simulating the function of the anterior cingulate gyrus and the prefrontal cortex in a robot flexible decision method according to one aspect of the application;
FIG. 4 illustrates a schematic diagram of an environment exploration-reinforcement learning model with dynamic adaptive tuning in a robot flexible decision method in accordance with an aspect of the subject application.
The same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
The application is described in further detail below with reference to the accompanying drawings.
In one exemplary configuration of the application, the terminal, the device of the service network, and the trusted party each include one or more processors (e.g., central processing unit (Central Processing Unit, CPU), input/output interfaces, network interfaces, and memory.
The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change RAM (PRAM), static random access Memory (Static Random Access Memory, SRAM), dynamic random access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other Memory technology, read-Only optical disk read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disk, DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
Fig. 1 shows a flow chart of a robot flexible behavior decision method according to an aspect of the present application, which is suitable for a robot control process in various complex environments, and the method includes steps S1, S2 and S3, wherein the method specifically includes:
step S1, current environment information, a target task and current state information of a robot are obtained, wherein the current environment information comprises barrier position information; the current environment information is used for indicating the environment directions among the robot, the obstacle and the target, and comprises obstacle position information and target position information, namely, the included angle between the current robot and the target, the included angle between the robot and the obstacle, the distance between the robot and the target and the distance between the robot and the obstacle;
and S2, constructing a neural network mixed model based on the supervised learning model and the reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environment information, realizing the dynamic adjustment of the proportion of the cerebellum supervised learning and the basal ganglia reinforcement learning in the behavior decision of the mobile robot, and improving the adaptability of the dynamic environment of the mobile robot. Increasing curiosity index to improve the reinforcement learning model to obtain an improved neural network mixed model; here, it is considered that the living body has different curiosity degrees for the environment when the living body searches for and utilizes the environment, and the curiosity degree is high when the living body searches for the environment and is low when the living body utilizes the environment. Therefore, curiosity indexes are introduced into the reinforcement learning model, and the magnitude of curiosity is calculated by combining the front buckling band back and the forehead leaf cortex and neurotransmitters at the same time so as to simulate the dynamic switching of the blue print activity between the pitch mode and the phase mode, thereby realizing the dynamic self-adaptive regulation of environment exploration-utilization in reinforcement learning.
And step S3, inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision.
Step S1 to step S3 are carried out by acquiring current environment information, target tasks and current state information of the robot, wherein the current environment information comprises barrier position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, the application integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, and improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia at the same time, thereby realizing the direct interaction between the cerebellum and the basal ganglia, further realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive regulation of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain continuous and stable learning capability.
For example, first, when a robot moves in an environment, current environment information X is received by a vision sensor u The following inputs are designed:
wherein θ is f 、θ e 、d f 、d e Respectively representing the current included angle between the robot and the target, the included angle between the robot and the obstacle, the distance between the robot and the target and the distance between the robot and the obstacle. Acquiring current state information S of a target task G and a robot; then, a neural network mixed model M is constructed based on the supervised learning model and the reinforcement learning model, and the current environment information X is used for the neural network mixed model M u Dynamically adjusting the binding coefficient omega of the neural network hybrid model M 2 The method integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia together to dynamically and adaptively adjust the exploration-utilization balance in the reinforcement learning, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia, realizes the direct interaction between the cerebellum and the basal ganglia, realizes the flexible behavior decision in the unknown environment of the mobile robot through the integration of the cerebellum and the basal ganglia, and particularly improves the adaptability of the robot in the dynamic environment. The curiosity index is increased to improve the reinforcement learning model, so that dynamic self-adaptive adjustment of environment exploration-utilization in reinforcement learning is realized, and finally an improved neural network hybrid model M' is obtained; finally, the current environment information X u And inputting the target task G and the robot state information S into the improved neural network hybrid model M' to obtain a flexible behavior decision.
Next, in the foregoing embodiment, in step S2, dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information includes:
obtaining the nearest barrier distance in all barrier information from the current environment information;
adjusting the combination coefficient of the neural network hybrid model according to the nearest obstacle distance, wherein the combination coefficient is expressed as follows:
ω 1 =1-ω 2
a=ω 1 a 12 a 2
wherein m is 1 And m 2 Are all normal numbers, ω 2 For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action decision at the current moment, a 1 Representing a behavioural decision based on the reinforcement learning model decision, a 2 Representing the behavior decisions decided by the supervised learning model.
When the robot begins to search for the environment, the new environment is more unfamiliar with the environment memorized in the brain, and ω is calculated by the above formula 1 The value is smaller, omega 2 And at the moment, the system preferentially adopts the supervision learning of the cerebellum so as to accelerate the action decision-making speed. As the exploration proceeds, as the environment becomes more familiar, the degree of strangeness of the environment decreases, ω 1 Increasing in value omega 2 The value decreases and the system begins to more adopt reinforcement learning of the basal ganglia to guide the motion selection of the robot. When the robot encounters a new environment again, the process described above is repeated and the robot begins to select actions with supervised learning of the cerebellum. With the change of exploring environment, the robot dynamically adjusts the proportion of the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia in the action decision to adapt to the change of the environment, and the scheme can fully exert the baseThe action of the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia in the action decision of the robot can realize the flexible action decision in the unknown environment of the robot, and ensure that the robot obtains continuous and stable learning capacity.
In still another embodiment of the present application, the step S2 of adding curiosity index to improve the reinforcement learning model to obtain an improved neural network hybrid model includes:
step S21, obtaining the maximum rewards and the minimum rewards and the environmental alertness obtained by the robot in the reinforcement learning model, comprising:
dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate;
And acquiring a state action value function of the current moment of the robot by combining the real-time learning rate and the real-time discount rate, and calculating to obtain a strengthening signal corresponding to the current moment according to a reward prediction error. In this case, when the model is operated, the reinforcement signal δ is first calculated, and the difference between the reward r obtained when the robot selects a certain action a and the reward obtained when the action is actually performed, that is, the expected error of the reward is used to represent the reinforcement signal δ, that is:
δ=r-Q(s,a i )
the robot updates the action value and selects the action according to the reward prediction error, and thus the reinforcement signal δ (t) based on the reward prediction error can be described as follows:
wherein r is t Which is a reward the robot gets at time t.
In addition to influencing the prize value of each action, dopamine signals also influence feedback classification neurons in the anterior cingulate cortex. Wherein, when there is a negative reinforcement signal (representing a penalty), the neurons in the anterior cingulate cortex are marked as "wrong" respond with a response value denoted as delta - (t); when there is a positive reinforcement signal (indicating a reward), it is marked as "correctNeuronal response, the response value is noted as delta + (t)。
And obtaining the environmental alertness based on the response value of the strengthening signal. For example, if the robot takes a certain action a for a scene s at a certain time t, a positive reinforcement signal δ is obtained + (t) beta is * (t)+η + δ + (t)→β * (t),η + >0, if a negative enhancement signal delta is obtained - (t) beta is * (t)+η - δ - (t)→β * (t),η - <0. Whether a large positive intensification signal delta + (t) also a smaller negative intensification Signal delta - (t) the magnitude trend of the alertness obtained by the robot is the same, and the value is relatively large. When the value of the strengthening signal delta (t) is moderate, the obtained environmental alertness value is also moderate.
Step S22, calculating the curiosity index based on the maximum rewards, the minimum rewards and the environmental alertness; the curiosity index is expressed as follows:
wherein R is max And R is min And respectively representing the maximum rewards and the minimum rewards obtained by the robot, wherein ρ is an adjustment coefficient.
And S23, dynamically adjusting environment exploration and utilization by utilizing the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model.
Here, when the reinforcement signal δ (t) obtained by the robot takes a large positive value or a small negative value, the environmental alertness β * Taking a larger value, wherein the curiosity index Cur is larger, the robot can select more actions, and the robot is guided to explore the environment; conversely, when the value of the strengthening signal delta (t) obtained by the robot is moderate, the environmental alertness beta * The value is also moderate, the curiosity index Cur is smaller, namely the activity of the action neurons in the prefrontal cortex is more comparative, so the robot almost always selects the highest prize The action of the excitation value is embodied as a utilization environment. The performance trend of the curiosity index Cur with the change of the enhancement signal δ (t) coincides with the inverted U-shape relationship between the blue spot activity and the task performance that requires attention depicted in fig. 2. Therefore, the curiosity index Cur is used for simulating the dynamic switching of the blue print activity between the pitch mode and the phase mode, so that the dynamic adjustment of environment exploration-utilization in the robot reinforcement learning process can be realized.
For example, as shown in fig. 3: first, visual input is transmitted to the posterior parietal cortex, and then a reward signal r is received at the ventral covered region, from which reinforcement learning signal δ is calculated, from which the anterior cingulate cortex uses to calculate the value of each possible action selected and updated and stored, while environmental alertness is calculated from the reinforcement signal δ by a set of feedback-classified neurons (neurons labeled "correct" and neurons labeled "error") in the anterior cingulate cortex, and the orbital parietal cortex evaluates the rewards for each action. Finally, the action value, alertness, and rewards for each action are transferred to the prefrontal cortex, which performs the action assessment and selection, which selects the action that should be performed, through the striatum, substantia nigra and thalamus, to the anterior motor cortex. Finally, the front movement cortex output controls the movement of the robot, meta learning rules (meta learning) are introduced, and the history of the action performed by the robot is fed back to the front cingulum back cortex.
Following the above embodiment, the dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate includes:
calculating a similarity value of the current environmental information and the historical environmental information;
if the similarity value is greater than or equal to a similarity threshold value, the same environmental information times corresponding to the current environmental information are obtained in a cumulative mode;
if the maximum similarity value is smaller than a similarity threshold value, memorizing the current environmental information for the first time to obtain the same environmental information frequency corresponding to the current environmental information as 1;
and adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate.
For example, the number of times of cumulative acquisition of the same environmental information is recorded during the movement of the robot. Let the current state information of the robot be denoted by s, the number of times the same environmental information is obtained is cumulatively denoted by N(s). Let the current moment, the robot receives the input current environment information X from the perception layer u And with all the historical environmental information memorized in the brainAnd (5) performing comparison. When there is a certain environmental state s in the memory 0 Its environmental azimuth information->And X is u The following conditions are satisfied:
k is the similarity threshold, then consider: memory of environmental states s 0 Azimuth information of (a)Current environmental information X obtained with the robot at the current moment u The same is said to be the current state information of the robot and the historical state information s in the memory 0 Identical, so s 0 Corresponding cumulative number of times of obtaining the same environmental information N (s 0 ) 1 is added. When the above formula is not satisfied, it is considered that there is no certain historical environmental information and current environmental information X of the current moment of the robot in the memory u And if the number of times of the same environmental information corresponding to the current environmental information is 1, the robot memorizes the position for the first time. To sum up, the formula is:
when the larger the value of N(s), i.e. the more times the same environmental information is accumulated, the more times the robot passes through the region when moving, the more familiar the robot is considered to be in the region, and finally the blue print is in the pitch mode, and the norepinephrine content is reduced. Conversely, when the smaller the value of N(s), the fewer the number of times the robot passes through the area, the more strange the robot is considered to be in the area, and finally the blue print is in a stage mode, and the norepinephrine content is increased.
The environmental state s corresponding to N(s) is first transferred into the blue spot, and stimulates the blue spot to produce norepinephrine, the content Ne of which can be expressed as follows:
Ne=-k 1 ·N(s)+c 1
wherein k is 1 And c 1 Are normal numbers.
Norepinephrine acts on basal forebrain to affect acetylcholine content, which can be expressed as follows
Ach=k 2 ·Ne+c 2
Wherein k is 2 And c 2 Is a positive constant.
Acetylcholine content affects learning rate a, then:
α=1-1/(1+e Ach )
at the same time, norepinephrine content also affects the magnitude of discount rate γ, then:
γ=k 3 ·Ne+c 3
wherein k is 3 And c 3 Is a positive constant.
According to the method, the learning rate and the discount rate are dynamically adjusted, so that the convergence rate of the robot is improved, the flexible behavior decision in the unknown environment of the robot is realized, and the continuous and stable learning capability of the robot is ensured.
Next, in the foregoing embodiment, the adjusting the learning rate and the discount rate based on the number of times of the same environmental information corresponding to the current environmental information, to obtain the real-time learning rate and the real-time discount rate, further includes:
and correcting the real-time learning rate and the real-time discount rate according to the nearest obstacle distance in a dynamic environment. Here, in the dynamic environment, the robot is to pay attention not only to the position information of the static obstacle but also to the position information of the dynamic obstacle. Therefore, the learning rate and the discount rate in the dynamic environment need to be corrected, and the robot is prevented from collision, so that the adaptability of the robot in the dynamic environment is improved. When the robot encounters an unfamiliar dynamic environment, the projection process of the norepinephrine-acetylcholine-learning rate is enhanced, the learning rate is improved, and the robot is accelerated to learn a new environment; meanwhile, the projection process of the norepinephrine-serotonin-discount rate is enhanced, and the discount rate is improved. The enhancement signal delta (t) based on the reward prediction error becomes larger, which is manifested by a stronger response of the robot to the obstacle, and both projections reflect that the robot is constantly "exploring-utilizing balance" in motion, as shown in fig. 4.
The corrected norepinephrine content Ne' is calculated as follows:
Ne′=-k 1 ·N(s)+c 1 -k′ 1 ·enedis
wherein k' 1 Is positive constant, and enedis is distance from nearest obstacle under current environment information of robot, wherein k is 1 And c 1 Are normal numbers.
Norepinephrine content affects acetylcholine content, and acetylcholine content Ach' in dynamic environments can be expressed as:
Ach′=k 2 ·Ne′+k′ 2 ·enedis
wherein k' 2 Is a positive constant.
The learning rate α' in a dynamic environment can be expressed as:
similarly, the new discount rate γ' in a dynamic environment can be expressed as:
γ′=k′ 3 ·Ne′
wherein k' 3 Is a positive constant.
For example, in an experimental environment of 500 x 500, there are 10 static obstacles and 3 dynamic obstacles in addition to the robot and the target, and one of the three dynamic obstacles moves vertically downward at a uniform velocity, one moves straight rightward at a uniform velocity, and one moves randomly. The performance of robot path planning is improved under the conditions of changing the learning rate or changing the discount rate, and the possibility of local minima and collision is reduced. And in a dynamic environment, when the variable learning rate is adopted, the number of times the robot reaches the destination is obviously higher than the number of times the robot reaches the target when the variable discount rate is adopted.
Following the above embodiment, the step S23 dynamically adjusts the environmental exploration and utilization by using the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model, which includes:
Presetting a curiosity index threshold in the reinforcement learning model, and calculating the execution probability of each action to be executed based on the curiosity index and a state action value function at the current moment; here, the activity of the prefrontal cortex neurons may be converted into each action value by a Softmax equation based on the Boltzmann distribution to perform the relevant action a under the current state information s of the robot i Is described by the probability of (a):
when the curiosity index is greater than or equal to the curiosity index threshold, the robot performs environment exploration;
when the curiosity index is smaller than the curiosity index threshold, the robot performs environment utilization;
and in the process of the environment exploration and utilization after adjustment, determining an optimal action behavior decision according to the execution probability of each action to be executed, and realizing the dynamic self-adaptive adjustment of the environment exploration and utilization to obtain the improved neural network hybrid model.
When the curiosity index Cur is a larger value, the execution probabilities of the actions are close to each other, so that the robot can select more actions to be reflected as an exploration environment; when the curiosity index Cur is smaller, the difference between the probabilities of each action increases, that is, the activity of the action neurons in the prefrontal cortex is more contrastive, so that the robot almost always selects the action with the highest rewarding value, which is reflected as the utilization environment. Therefore, the conversion between the robot exploration environment and the utilization environment can be dynamically adjusted by utilizing the curiosity index Cur.
In yet another embodiment of the present application, step S3 inputs the current environmental information, the target task, and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision, including:
inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model, and then obtaining the combination coefficient of the improved neural network hybrid model based on the current environment information;
and according to the nearest obstacle distance, calculating the combination coefficient in real time to adjust the proportion of the action to be executed determined based on supervised learning and the action to be executed determined based on reinforcement learning in the final behavior decision of the robot, so as to obtain the flexible behavior decision of the robot.
The application dynamically adjusts the weights of reinforcement learning and supervision learning according to the continuously-changing environmental information, so that different behavior decision strategies are adopted according to different environments in the whole motion process of the robot, thereby improving the learning capacity and adaptability of the robot and ensuring the efficiency and quality of the robot for completing target tasks.
According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to cause the processor to implement a method of controlling a user's barrier to barrier as described above.
According to another aspect of the present application, there is also provided a robot flexible behavior decision device, the device comprising:
one or more processors;
a computer readable medium for storing one or more computer readable instructions;
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement a method of controlling user-alignment on a device as described above.
The details of each embodiment of the device may be specifically referred to the corresponding parts of the embodiment of the method for controlling user alignment at the device side, which is not described herein again.
In summary, the present application obtains the current environmental information, the target task and the current state information of the robot, where the current environmental information includes the obstacle position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, the application integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, and improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia at the same time, thereby realizing the direct interaction between the cerebellum and the basal ganglia, further realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive regulation of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain continuous and stable learning capability.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the application as described above.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (8)

1. A method for robot flexible behavior decision-making, the method comprising:
acquiring current environment information, a target task and current state information of a robot, wherein the current environment information comprises barrier position information;
Constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model;
wherein the dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information comprises: obtaining the nearest barrier distance in all barrier information from the current environment information; adjusting the combination coefficient of the neural network hybrid model according to the nearest obstacle distance, wherein the combination coefficient is expressed as follows:
wherein omega 1 =1-ω 2
a=ω 1 a 12 a 2
m1 and m2 are both normal numbers, ω 2 For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action behavior decision at the current moment, a 1 Representing a behavioural decision based on the reinforcement learning model decision, a 2 Representing behavioral decisions determined by the supervised learning model;
the enhancing curiosity index improves the reinforcement learning model to obtain an improved neural network hybrid model, which comprises the following steps: obtaining maximum rewards and minimum rewards obtained by the robot in a reinforcement learning model and environmental alertness; calculating the curiosity index based on the maximum reward, the minimum reward, and the environmental alertness, wherein the curiosity index is expressed as follows:
Wherein Cur represents curiosity index, β * Indicating environmental alertness, R max And R is min Respectively representing the maximum rewards and the minimum rewards obtained by the robot, wherein ρ is an adjustment coefficient;
dynamically adjusting environment exploration and utilization by utilizing the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model;
and inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision.
2. The method of claim 1, wherein the obtaining the current environmental alertness from among the maximum and minimum rewards and environmental alertness obtained by the robot in a reinforcement learning model comprises:
dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate;
acquiring a state action value function of the current moment of the robot by combining the real-time learning rate and the real-time discount rate, and calculating to obtain a strengthening signal corresponding to the current moment according to a reward prediction error;
and obtaining the environmental alertness based on the response value of the strengthening signal.
3. The method of claim 2, wherein dynamically adjusting the learning rate and the discount rate based on the current environmental information results in a real-time learning rate and a real-time discount rate, comprising:
calculating a similarity value of the current environmental information and the historical environmental information;
if the similarity value is greater than or equal to a similarity threshold value, the same environmental information times corresponding to the current environmental information are obtained in a cumulative mode;
if the similarity value is smaller than a similarity threshold value, memorizing the current environmental information for the first time to obtain the same environmental information frequency corresponding to the current environmental information as 1;
and adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate.
4. The method of claim 3, wherein the adjusting the learning rate and the discount rate based on the same number of times of the environmental information corresponding to the current environmental information results in the real-time learning rate and the real-time discount rate, further comprising:
and correcting the real-time learning rate and the real-time discount rate according to the nearest obstacle distance in a dynamic environment.
5. The method of claim 4, wherein the dynamically adjusting environmental exploration and utilization in the reinforcement learning model using the curiosity index results in an improved neural network hybrid model comprising:
presetting a curiosity index threshold in the reinforcement learning model, and calculating the execution probability of each action to be executed based on the curiosity index and a state action value function at the current moment;
when the curiosity index is greater than or equal to the curiosity index threshold, the robot performs environment exploration;
when the curiosity index is smaller than the curiosity index threshold, the robot performs environment utilization;
and in the process of exploring and utilizing the adjusted environment, determining the optimal neural network mixed model with improved action behavior decision according to the execution probability of each action to be executed.
6. The method according to any of claims 1-5, wherein inputting the current environmental information, the target task, and the robot state information into the improved neural network hybrid model results in a flexible behavior decision, comprising:
Inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model, and obtaining the combination coefficient of the improved neural network hybrid model based on the current environment information;
and according to the nearest obstacle distance, calculating the combination coefficient in real time to adjust the proportion of the action to be executed determined based on supervised learning and the action to be executed determined based on reinforcement learning in the final behavior decision of the robot, so as to obtain the flexible behavior decision of the robot.
7. A computer readable medium having stored thereon computer readable instructions executable by a processor to cause the processor to implement the method of any one of claims 1 to 6.
8. A robotic flexible behavioural decision making device, the device comprising:
one or more processors;
a computer readable medium for storing one or more computer readable instructions,
when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6 to.
CN202110973178.4A 2021-08-24 2021-08-24 Robot flexible behavior decision method and equipment Active CN113671834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110973178.4A CN113671834B (en) 2021-08-24 2021-08-24 Robot flexible behavior decision method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110973178.4A CN113671834B (en) 2021-08-24 2021-08-24 Robot flexible behavior decision method and equipment

Publications (2)

Publication Number Publication Date
CN113671834A CN113671834A (en) 2021-11-19
CN113671834B true CN113671834B (en) 2023-09-01

Family

ID=78545466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110973178.4A Active CN113671834B (en) 2021-08-24 2021-08-24 Robot flexible behavior decision method and equipment

Country Status (1)

Country Link
CN (1) CN113671834B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114161419B (en) * 2021-12-13 2023-09-15 大连理工大学 Efficient learning method for robot operation skills guided by scene memory

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762281A (en) * 2018-06-08 2018-11-06 哈尔滨工程大学 It is a kind of that intelligent robot decision-making technique under the embedded Real-time Water of intensified learning is associated with based on memory
CN111645076A (en) * 2020-06-17 2020-09-11 郑州大学 Robot control method and equipment
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
CN112612274A (en) * 2020-12-22 2021-04-06 清华大学 Autonomous motion decision control method and system for ultrasonic inspection robot
CN113031437A (en) * 2021-02-26 2021-06-25 同济大学 Water pouring service robot control method based on dynamic model reinforcement learning
CN113077052A (en) * 2021-04-28 2021-07-06 平安科技(深圳)有限公司 Reinforced learning method, device, equipment and medium for sparse reward environment
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11179846B2 (en) * 2018-07-24 2021-11-23 Yan Yufik Method and systems for enhancing collaboration between robots and human operators
CN109190720B (en) * 2018-07-28 2021-08-06 深圳市商汤科技有限公司 Intelligent agent reinforcement learning method, device, equipment and medium
US20210103815A1 (en) * 2019-10-07 2021-04-08 Deepmind Technologies Limited Domain adaptation for robotic control using self-supervised learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762281A (en) * 2018-06-08 2018-11-06 哈尔滨工程大学 It is a kind of that intelligent robot decision-making technique under the embedded Real-time Water of intensified learning is associated with based on memory
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
CN111645076A (en) * 2020-06-17 2020-09-11 郑州大学 Robot control method and equipment
CN112612274A (en) * 2020-12-22 2021-04-06 清华大学 Autonomous motion decision control method and system for ultrasonic inspection robot
CN113031437A (en) * 2021-02-26 2021-06-25 同济大学 Water pouring service robot control method based on dynamic model reinforcement learning
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method
CN113077052A (en) * 2021-04-28 2021-07-06 平安科技(深圳)有限公司 Reinforced learning method, device, equipment and medium for sparse reward environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进Q-学习的导航知识获取算法研究;郑炳文;;科学之友(第04期);全文 *

Also Published As

Publication number Publication date
CN113671834A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
Li et al. An improved DQN path planning algorithm
Chen et al. Deep reinforcement learning based path tracking controller for autonomous vehicle
Roozegar et al. XCS-based reinforcement learning algorithm for motion planning of a spherical mobile robot
CN113671834B (en) Robot flexible behavior decision method and equipment
CN116476825B (en) Automatic driving lane keeping control method based on safe and reliable reinforcement learning
Lin et al. Nonlinear system control using self-evolving neural fuzzy inference networks with reinforcement evolutionary learning
Schultheis et al. Receding horizon curiosity
Kong et al. Energy management strategy for electric vehicles based on deep Q-learning using Bayesian optimization
Wu et al. Uncertainty-aware model-based reinforcement learning with application to autonomous driving
Gu et al. DM-DQN: Dueling Munchausen deep Q network for robot path planning
Han et al. Reinforcement learning guided by double replay memory
Iqbal et al. Intelligent multimedia content delivery in 5G/6G networks: a reinforcement learning approach
Jiang et al. Path tracking control based on Deep reinforcement learning in Autonomous driving
Huang et al. A novel policy based on action confidence limit to improve exploration efficiency in reinforcement learning
Koh et al. Cooperative control of mobile robots with stackelberg learning
Liu et al. Her-pdqn: A reinforcement learning approach for uav navigation with hybrid action spaces and sparse rewards
CN115526270A (en) Spatio-temporal fusion reasoning and lifelong cognition learning method for behavior evolution in open environment
Zhang et al. An efficient planning method based on deep reinforcement learning with hybrid actions for autonomous driving on highway
CN114559439A (en) Intelligent obstacle avoidance control method and device for mobile robot and electronic equipment
CN114610039A (en) Robot control method, device, robot and storage medium
Liu et al. Safe offline reinforcement learning through hierarchical policies
Wulfe Uav collision avoidance policy optimization with deep reinforcement learning
García et al. Incremental reinforcement learning for multi-objective robotic tasks
Yu et al. Coordinated collision avoidance for multi‐vehicle systems based on collision time
Wang et al. A computational developmental model of perceptual learning for mobile robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant