CN113671834B

CN113671834B - Robot flexible behavior decision method and equipment

Info

Publication number: CN113671834B
Application number: CN202110973178.4A
Authority: CN
Inventors: 王东署; 朱觐镳; 王河山; 辛健斌; 马天磊; 贾建华; 罗勇
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2023-09-01
Anticipated expiration: 2041-08-24
Also published as: CN113671834A

Abstract

The application provides a robot flexible behavior decision method and equipment, wherein the current environment information, a target task and the current state information of a robot are acquired; constructing a neural network mixed model based on the supervised learning model and the reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity index to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, reinforcement learning and supervision learning are dynamically combined, and dynamic self-adaptive adjustment of environment exploration-utilization is realized in reinforcement learning, so that the flexible behavior decision of the mobile robot in an unknown environment is realized, and the adaptability and learning capacity of the mobile robot are improved.

Description

Robot flexible behavior decision method and equipment

Technical Field

The application relates to the field of computers, in particular to a robot flexible behavior decision method and equipment.

Background

In the prior art, along with the development of technology, a mobile robot is taken as a perfect carrier combining a robot and an artificial intelligence technology, and carries the latest technology of the robot and the artificial intelligence, and has been used for processing various tasks and widely applied to the production and the life of human beings.

An important premise for a mobile robot to complete various unspecified tasks is to fully and effectively recognize the environment in which the mobile robot is located. In the environment cognition process, the robot generally does not have priori knowledge of the environment, various dynamic or static obstacles or various emergency situations or even various 'traps' can be encountered in the movement process, and how to realize the flexible behavior decision of the mobile robot in the dynamic environment is always an important problem focused by a robot researcher.

In order to solve the problem, researchers put forward various behavior decision methods, such as robust nonsingular terminal synovial membrane control, forward/backward motion control, sliding mode control+attraction ellipsoidal method, anti-interference PID control, central mode generator, hopf oscillator+Kuramoto oscillator, etc., but the behavior decision methods are mostly aimed at a specific application scene, and once the application scene is changed, the corresponding behavior decision needs to be modified correspondingly, so that the capability of adapting to the dynamic environment is poor.

In addition to the above methods, in recent years, students have provided basis for robot behavior decisions by simulating the function of cells located in the brain. In addition, with the development of intelligent science, more and more intelligent control algorithms are applied to robot behavior decisions, such as fuzzy inference systems, fuzzy logic+behavior trees, neural networks, neural inverse reinforcement learning, feedforward neural networks+q learning.

In recent years, with the development of neurobiology and the progressive deep research on the cognitive mechanism of the human brain, researchers begin to consider the introduction of the cognitive principle of the human brain into the behavior decision of a mobile robot, and solve the behavior decision problem of the robot by simulating the thinking mode of the human brain. Fang, naveros, hausknecht, and kui, etc. have respectively proposed robot behavior decision solutions for simulating cerebellum function for different robots. Research shows that basal ganglia have important roles in behavioral decisions in addition to cerebellum, and both have links to target-guided behavioral decisions, etc. Since both are associated with target-guided behavior decisions, one naturally has the idea of whether or not both can be incorporated into the robot's behavior decisions. Currently, studies have been made on jointly applying the cerebellum and basal ganglia to the aspect of robot behavior decision, such as Dasgupta and the like, to simulate the learning principle of the basal ganglia and the cerebellum respectively by using an Actor-Critic model and input association learning, and a reward-regulated heterogeneous synaptic plasticity model is provided, and the two systems of the cerebellum and the basal ganglia are adaptively combined for behavior control of a four-wheeled robot. However, in this model, there is no direct feedback (interaction) between the cerebellum and basal ganglia. Ruan Xiaogang and the like propose a motion-related heuristic dynamic programming learning mechanism based on trendy to explore a cooperative mechanism between the cerebellum and the basal ganglia, adopting an Actor model and a Critic model to simulate functions of the cerebellum and the basal ganglia respectively for motion control of a two-wheel self-balancing robot, wherein the model takes motion output of the cerebellum as one of inputs of the basal ganglia, and the basal ganglia does not output the cerebellum, namely unidirectional (cerebellum to basal ganglia) rather than bidirectional interaction is adopted between the two.

Therefore, how to realize direct interactive connection between the cerebellum and the basal ganglia in the neural network model, integrate cerebellum supervised learning and basal ganglia reinforcement learning, realize flexible behavior decision in an unknown environment of the mobile robot, and simultaneously enable the robot to obtain continuous and stable learning ability is the direction of research of the current technical personnel.

Disclosure of Invention

The application aims to provide a robot flexible behavior decision method and device, which are used for solving the problems of how to realize dynamic combination of cerebellum supervised learning and basal ganglia reinforcement learning in mobile robot behavior decision in the prior art, so that the robot can obtain continuous and stable learning capacity and the adaptability of the dynamic environment of the mobile robot is improved.

According to one aspect of the present application, there is provided a robot flexible behavior decision method comprising:

acquiring current environment information, a target task and current state information of a robot, wherein the current environment information comprises barrier position information;

constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model;

And inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision.

Further, in the robot flexible behavior decision method, the dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information includes:

obtaining the nearest barrier distance in all barrier information from the current environment information;

adjusting the combination coefficient of the neural network hybrid model according to the nearest obstacle distance, wherein the combination coefficient is expressed as follows:

wherein omega ₁ ＝1-ω ₂

a＝ω ₁ a ₁ +ω ₂ a ₂

m ₁ And m ₂ Are all normal numbers, ω ₂ For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action decision at the current moment, a ₁ Representing a behavioural decision based on the reinforcement learning model decision, a ₂ Representing the behavior decisions decided by the supervised learning model.

Further, in the robot flexible behavior decision method, the enhancing curiosity index improves the reinforcement learning model to obtain an improved neural network hybrid model, which includes:

obtaining maximum rewards and minimum rewards obtained by the robot in a reinforcement learning model and environmental alertness;

Calculating the curiosity index based on the maximum reward, the minimum reward, and the environmental alertness;

and dynamically adjusting environment exploration and utilization in the reinforcement learning model by utilizing the curiosity index to obtain an improved neural network hybrid model.

Further, in the method for determining flexible behavior of a robot, the obtaining the current environmental alertness from the maximum rewards and the minimum rewards obtained by the robot in the reinforcement learning model and the environmental alertness includes:

dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate;

acquiring a state action value function of the current moment of the robot by combining the real-time learning rate and the real-time discount rate, and calculating to obtain a strengthening signal corresponding to the current moment according to a reward prediction error;

and obtaining the environmental alertness based on the response value of the strengthening signal.

Further, in the robot flexible behavior decision method, the dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate includes:

calculating a similarity value of the current environmental information and the historical environmental information;

If the maximum similarity value is greater than or equal to a similarity threshold value, the same environmental information times corresponding to the current environmental information are obtained in a cumulative mode;

if the maximum similarity value is smaller than a similarity threshold value, memorizing the current environmental information for the first time to obtain the same environmental information frequency corresponding to the current environmental information as 1;

and adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate.

Further, in the robot flexible behavior decision method, the adjusting the learning rate and the discount rate based on the same environmental information times corresponding to the current environmental information to obtain the real-time learning rate and the real-time discount rate further includes:

and correcting the real-time learning rate and the real-time discount rate according to the nearest obstacle distance in a dynamic environment.

Further, in the robot flexible behavior decision method, the curiosity index calculated based on the maximum reward, the minimum reward, and the environmental alertness is expressed as follows:

Wherein R is _max And R is _min And respectively representing the maximum rewards and the minimum rewards obtained by the robot, wherein ρ is an adjustment coefficient.

Further, in the robot flexible behavior decision method, the method dynamically adjusts environmental exploration and utilization by using the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model, which comprises the following steps:

presetting a curiosity index threshold in the reinforcement learning model, and calculating the execution probability of each action to be executed based on the curiosity index and a state action value function at the current moment;

when the curiosity index is greater than or equal to the curiosity index threshold, the robot performs environment exploration;

when the curiosity index is smaller than the curiosity index threshold, the robot performs environment utilization;

in the process of the environment exploration and utilization after adjustment), determining an optimal action behavior decision according to the execution probability of each action to be executed, and realizing the dynamic self-adaptive adjustment of the environment exploration and utilization to obtain an improved neural network mixed model.

Further, in the method for deciding the flexible behavior of the robot, the inputting the current environmental information, the target task and the state information of the robot into the improved neural network hybrid model to obtain the flexible behavior decision includes:

Inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model, and obtaining the combination coefficient of the improved neural network hybrid model based on the current environment information;

and according to the nearest obstacle distance, calculating the combination coefficient in real time to adjust the proportion of the action to be executed determined based on supervised learning and the action to be executed determined based on reinforcement learning in the final behavior decision of the robot, so as to obtain the flexible behavior decision in the dynamic environment of the robot.

According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a robot behavior decision method as described above.

According to another aspect of the present application, there is also provided a robot flexible behavior decision device, comprising:

one or more processors;

a computer readable medium for storing one or more computer readable instructions,

the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the robot flexible behavior decision method as described above.

Compared with the prior art, the method and the device have the advantages that the current environment information, the target task and the current state information of the robot are obtained, and the current environment information comprises barrier position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; inputting the current environment information, the target task and the robot state information into the improved neural network mixed model to obtain a flexible behavior decision, namely integrating the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerating the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, improving the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia, realizing the direct interaction between the cerebellum and the basal ganglia, thereby realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive adjustment of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain the continuous and stable learning capability.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a robot flexible behavior decision method in accordance with an aspect of the present application;

FIG. 2 illustrates an inverted U-shaped relationship between blue spot activity and task performance requiring attention in a robotic flexible decision method in accordance with an aspect of the application;

FIG. 3 illustrates a schematic diagram of the feed-forward neural network operating principle for simulating the function of the anterior cingulate gyrus and the prefrontal cortex in a robot flexible decision method according to one aspect of the application;

FIG. 4 illustrates a schematic diagram of an environment exploration-reinforcement learning model with dynamic adaptive tuning in a robot flexible decision method in accordance with an aspect of the subject application.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The application is described in further detail below with reference to the accompanying drawings.

In one exemplary configuration of the application, the terminal, the device of the service network, and the trusted party each include one or more processors (e.g., central processing unit (Central Processing Unit, CPU), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change RAM (PRAM), static random access Memory (Static Random Access Memory, SRAM), dynamic random access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash Memory or other Memory technology, read-Only optical disk read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disk, DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

Fig. 1 shows a flow chart of a robot flexible behavior decision method according to an aspect of the present application, which is suitable for a robot control process in various complex environments, and the method includes steps S1, S2 and S3, wherein the method specifically includes:

step S1, current environment information, a target task and current state information of a robot are obtained, wherein the current environment information comprises barrier position information; the current environment information is used for indicating the environment directions among the robot, the obstacle and the target, and comprises obstacle position information and target position information, namely, the included angle between the current robot and the target, the included angle between the robot and the obstacle, the distance between the robot and the target and the distance between the robot and the obstacle;

and S2, constructing a neural network mixed model based on the supervised learning model and the reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environment information, realizing the dynamic adjustment of the proportion of the cerebellum supervised learning and the basal ganglia reinforcement learning in the behavior decision of the mobile robot, and improving the adaptability of the dynamic environment of the mobile robot. Increasing curiosity index to improve the reinforcement learning model to obtain an improved neural network mixed model; here, it is considered that the living body has different curiosity degrees for the environment when the living body searches for and utilizes the environment, and the curiosity degree is high when the living body searches for the environment and is low when the living body utilizes the environment. Therefore, curiosity indexes are introduced into the reinforcement learning model, and the magnitude of curiosity is calculated by combining the front buckling band back and the forehead leaf cortex and neurotransmitters at the same time so as to simulate the dynamic switching of the blue print activity between the pitch mode and the phase mode, thereby realizing the dynamic self-adaptive regulation of environment exploration-utilization in reinforcement learning.

And step S3, inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision.

Step S1 to step S3 are carried out by acquiring current environment information, target tasks and current state information of the robot, wherein the current environment information comprises barrier position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, the application integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, and improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia at the same time, thereby realizing the direct interaction between the cerebellum and the basal ganglia, further realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive regulation of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain continuous and stable learning capability.

For example, first, when a robot moves in an environment, current environment information X is received by a vision sensor _u The following inputs are designed:

wherein θ is _f 、θ _e 、d _f 、d _e Respectively representing the current included angle between the robot and the target, the included angle between the robot and the obstacle, the distance between the robot and the target and the distance between the robot and the obstacle. Acquiring current state information S of a target task G and a robot; then, a neural network mixed model M is constructed based on the supervised learning model and the reinforcement learning model, and the current environment information X is used for the neural network mixed model M _u Dynamically adjusting the binding coefficient omega of the neural network hybrid model M ₂ The method integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia together to dynamically and adaptively adjust the exploration-utilization balance in the reinforcement learning, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia, realizes the direct interaction between the cerebellum and the basal ganglia, realizes the flexible behavior decision in the unknown environment of the mobile robot through the integration of the cerebellum and the basal ganglia, and particularly improves the adaptability of the robot in the dynamic environment. The curiosity index is increased to improve the reinforcement learning model, so that dynamic self-adaptive adjustment of environment exploration-utilization in reinforcement learning is realized, and finally an improved neural network hybrid model M' is obtained; finally, the current environment information X _u And inputting the target task G and the robot state information S into the improved neural network hybrid model M' to obtain a flexible behavior decision.

Next, in the foregoing embodiment, in step S2, dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information includes:

ω ₁ ＝1-ω ₂

a＝ω ₁ a ₁ +ω ₂ a ₂

wherein m is ₁ And m ₂ Are all normal numbers, ω ₂ For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action decision at the current moment, a ₁ Representing a behavioural decision based on the reinforcement learning model decision, a ₂ Representing the behavior decisions decided by the supervised learning model.

When the robot begins to search for the environment, the new environment is more unfamiliar with the environment memorized in the brain, and ω is calculated by the above formula ₁ The value is smaller, omega ₂ And at the moment, the system preferentially adopts the supervision learning of the cerebellum so as to accelerate the action decision-making speed. As the exploration proceeds, as the environment becomes more familiar, the degree of strangeness of the environment decreases, ω ₁ Increasing in value omega ₂ The value decreases and the system begins to more adopt reinforcement learning of the basal ganglia to guide the motion selection of the robot. When the robot encounters a new environment again, the process described above is repeated and the robot begins to select actions with supervised learning of the cerebellum. With the change of exploring environment, the robot dynamically adjusts the proportion of the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia in the action decision to adapt to the change of the environment, and the scheme can fully exert the baseThe action of the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia in the action decision of the robot can realize the flexible action decision in the unknown environment of the robot, and ensure that the robot obtains continuous and stable learning capacity.

In still another embodiment of the present application, the step S2 of adding curiosity index to improve the reinforcement learning model to obtain an improved neural network hybrid model includes:

step S21, obtaining the maximum rewards and the minimum rewards and the environmental alertness obtained by the robot in the reinforcement learning model, comprising:

And acquiring a state action value function of the current moment of the robot by combining the real-time learning rate and the real-time discount rate, and calculating to obtain a strengthening signal corresponding to the current moment according to a reward prediction error. In this case, when the model is operated, the reinforcement signal δ is first calculated, and the difference between the reward r obtained when the robot selects a certain action a and the reward obtained when the action is actually performed, that is, the expected error of the reward is used to represent the reinforcement signal δ, that is:

δ＝r-Q(s,a _i )

the robot updates the action value and selects the action according to the reward prediction error, and thus the reinforcement signal δ (t) based on the reward prediction error can be described as follows:

wherein r is _t Which is a reward the robot gets at time t.

In addition to influencing the prize value of each action, dopamine signals also influence feedback classification neurons in the anterior cingulate cortex. Wherein, when there is a negative reinforcement signal (representing a penalty), the neurons in the anterior cingulate cortex are marked as "wrong" respond with a response value denoted as delta _- (t); when there is a positive reinforcement signal (indicating a reward), it is marked as "correctNeuronal response, the response value is noted as delta ₊ (t)。

And obtaining the environmental alertness based on the response value of the strengthening signal. For example, if the robot takes a certain action a for a scene s at a certain time t, a positive reinforcement signal δ is obtained ₊ (t) beta is ^* (t)+η ₊ δ ₊ (t)→β ^* (t)，η ₊ >0, if a negative enhancement signal delta is obtained _- (t) beta is ^* (t)+η _- δ _- (t)→β ^* (t)，η _- <0. Whether a large positive intensification signal delta ₊ (t) also a smaller negative intensification Signal delta _- (t) the magnitude trend of the alertness obtained by the robot is the same, and the value is relatively large. When the value of the strengthening signal delta (t) is moderate, the obtained environmental alertness value is also moderate.

Step S22, calculating the curiosity index based on the maximum rewards, the minimum rewards and the environmental alertness; the curiosity index is expressed as follows:

And S23, dynamically adjusting environment exploration and utilization by utilizing the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model.

Here, when the reinforcement signal δ (t) obtained by the robot takes a large positive value or a small negative value, the environmental alertness β ^* Taking a larger value, wherein the curiosity index Cur is larger, the robot can select more actions, and the robot is guided to explore the environment; conversely, when the value of the strengthening signal delta (t) obtained by the robot is moderate, the environmental alertness beta ^* The value is also moderate, the curiosity index Cur is smaller, namely the activity of the action neurons in the prefrontal cortex is more comparative, so the robot almost always selects the highest prize The action of the excitation value is embodied as a utilization environment. The performance trend of the curiosity index Cur with the change of the enhancement signal δ (t) coincides with the inverted U-shape relationship between the blue spot activity and the task performance that requires attention depicted in fig. 2. Therefore, the curiosity index Cur is used for simulating the dynamic switching of the blue print activity between the pitch mode and the phase mode, so that the dynamic adjustment of environment exploration-utilization in the robot reinforcement learning process can be realized.

For example, as shown in fig. 3: first, visual input is transmitted to the posterior parietal cortex, and then a reward signal r is received at the ventral covered region, from which reinforcement learning signal δ is calculated, from which the anterior cingulate cortex uses to calculate the value of each possible action selected and updated and stored, while environmental alertness is calculated from the reinforcement signal δ by a set of feedback-classified neurons (neurons labeled "correct" and neurons labeled "error") in the anterior cingulate cortex, and the orbital parietal cortex evaluates the rewards for each action. Finally, the action value, alertness, and rewards for each action are transferred to the prefrontal cortex, which performs the action assessment and selection, which selects the action that should be performed, through the striatum, substantia nigra and thalamus, to the anterior motor cortex. Finally, the front movement cortex output controls the movement of the robot, meta learning rules (meta learning) are introduced, and the history of the action performed by the robot is fed back to the front cingulum back cortex.

Following the above embodiment, the dynamically adjusting the learning rate and the discount rate based on the current environmental information to obtain a real-time learning rate and a real-time discount rate includes:

if the similarity value is greater than or equal to a similarity threshold value, the same environmental information times corresponding to the current environmental information are obtained in a cumulative mode;

For example, the number of times of cumulative acquisition of the same environmental information is recorded during the movement of the robot. Let the current state information of the robot be denoted by s, the number of times the same environmental information is obtained is cumulatively denoted by N(s). Let the current moment, the robot receives the input current environment information X from the perception layer _u And with all the historical environmental information memorized in the brainAnd (5) performing comparison. When there is a certain environmental state s in the memory ₀ Its environmental azimuth information->And X is _u The following conditions are satisfied:

k is the similarity threshold, then consider: memory of environmental states s ₀ Azimuth information of (a)Current environmental information X obtained with the robot at the current moment _u The same is said to be the current state information of the robot and the historical state information s in the memory ₀ Identical, so s ₀ Corresponding cumulative number of times of obtaining the same environmental information N (s ₀ ) 1 is added. When the above formula is not satisfied, it is considered that there is no certain historical environmental information and current environmental information X of the current moment of the robot in the memory _u And if the number of times of the same environmental information corresponding to the current environmental information is 1, the robot memorizes the position for the first time. To sum up, the formula is:

when the larger the value of N(s), i.e. the more times the same environmental information is accumulated, the more times the robot passes through the region when moving, the more familiar the robot is considered to be in the region, and finally the blue print is in the pitch mode, and the norepinephrine content is reduced. Conversely, when the smaller the value of N(s), the fewer the number of times the robot passes through the area, the more strange the robot is considered to be in the area, and finally the blue print is in a stage mode, and the norepinephrine content is increased.

The environmental state s corresponding to N(s) is first transferred into the blue spot, and stimulates the blue spot to produce norepinephrine, the content Ne of which can be expressed as follows:

Ne＝-k ₁ ·N(s)+c ₁

wherein k is ₁ And c ₁ Are normal numbers.

Norepinephrine acts on basal forebrain to affect acetylcholine content, which can be expressed as follows

Ach＝k ₂ ·Ne+c ₂

Wherein k is ₂ And c ₂ Is a positive constant.

Acetylcholine content affects learning rate a, then:

α＝1-1/(1+e ^Ach )

at the same time, norepinephrine content also affects the magnitude of discount rate γ, then:

γ＝k ₃ ·Ne+c ₃

wherein k is ₃ And c ₃ Is a positive constant.

According to the method, the learning rate and the discount rate are dynamically adjusted, so that the convergence rate of the robot is improved, the flexible behavior decision in the unknown environment of the robot is realized, and the continuous and stable learning capability of the robot is ensured.

Next, in the foregoing embodiment, the adjusting the learning rate and the discount rate based on the number of times of the same environmental information corresponding to the current environmental information, to obtain the real-time learning rate and the real-time discount rate, further includes:

and correcting the real-time learning rate and the real-time discount rate according to the nearest obstacle distance in a dynamic environment. Here, in the dynamic environment, the robot is to pay attention not only to the position information of the static obstacle but also to the position information of the dynamic obstacle. Therefore, the learning rate and the discount rate in the dynamic environment need to be corrected, and the robot is prevented from collision, so that the adaptability of the robot in the dynamic environment is improved. When the robot encounters an unfamiliar dynamic environment, the projection process of the norepinephrine-acetylcholine-learning rate is enhanced, the learning rate is improved, and the robot is accelerated to learn a new environment; meanwhile, the projection process of the norepinephrine-serotonin-discount rate is enhanced, and the discount rate is improved. The enhancement signal delta (t) based on the reward prediction error becomes larger, which is manifested by a stronger response of the robot to the obstacle, and both projections reflect that the robot is constantly "exploring-utilizing balance" in motion, as shown in fig. 4.

The corrected norepinephrine content Ne' is calculated as follows:

Ne′＝-k ₁ ·N(s)+c ₁ -k′ ₁ ·enedis

wherein k' ₁ Is positive constant, and enedis is distance from nearest obstacle under current environment information of robot, wherein k is ₁ And c ₁ Are normal numbers.

Norepinephrine content affects acetylcholine content, and acetylcholine content Ach' in dynamic environments can be expressed as:

Ach′＝k ₂ ·Ne′+k′ ₂ ·enedis

wherein k' ₂ Is a positive constant.

The learning rate α' in a dynamic environment can be expressed as:

similarly, the new discount rate γ' in a dynamic environment can be expressed as:

γ′＝k′ ₃ ·Ne′

wherein k' ₃ Is a positive constant.

For example, in an experimental environment of 500 x 500, there are 10 static obstacles and 3 dynamic obstacles in addition to the robot and the target, and one of the three dynamic obstacles moves vertically downward at a uniform velocity, one moves straight rightward at a uniform velocity, and one moves randomly. The performance of robot path planning is improved under the conditions of changing the learning rate or changing the discount rate, and the possibility of local minima and collision is reduced. And in a dynamic environment, when the variable learning rate is adopted, the number of times the robot reaches the destination is obviously higher than the number of times the robot reaches the target when the variable discount rate is adopted.

Following the above embodiment, the step S23 dynamically adjusts the environmental exploration and utilization by using the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model, which includes:

Presetting a curiosity index threshold in the reinforcement learning model, and calculating the execution probability of each action to be executed based on the curiosity index and a state action value function at the current moment; here, the activity of the prefrontal cortex neurons may be converted into each action value by a Softmax equation based on the Boltzmann distribution to perform the relevant action a under the current state information s of the robot _i Is described by the probability of (a):

and in the process of the environment exploration and utilization after adjustment, determining an optimal action behavior decision according to the execution probability of each action to be executed, and realizing the dynamic self-adaptive adjustment of the environment exploration and utilization to obtain the improved neural network hybrid model.

When the curiosity index Cur is a larger value, the execution probabilities of the actions are close to each other, so that the robot can select more actions to be reflected as an exploration environment; when the curiosity index Cur is smaller, the difference between the probabilities of each action increases, that is, the activity of the action neurons in the prefrontal cortex is more contrastive, so that the robot almost always selects the action with the highest rewarding value, which is reflected as the utilization environment. Therefore, the conversion between the robot exploration environment and the utilization environment can be dynamically adjusted by utilizing the curiosity index Cur.

In yet another embodiment of the present application, step S3 inputs the current environmental information, the target task, and the robot state information into the improved neural network hybrid model to obtain a flexible behavior decision, including:

inputting the current environment information, the target task and the robot state information into the improved neural network hybrid model, and then obtaining the combination coefficient of the improved neural network hybrid model based on the current environment information;

and according to the nearest obstacle distance, calculating the combination coefficient in real time to adjust the proportion of the action to be executed determined based on supervised learning and the action to be executed determined based on reinforcement learning in the final behavior decision of the robot, so as to obtain the flexible behavior decision of the robot.

The application dynamically adjusts the weights of reinforcement learning and supervision learning according to the continuously-changing environmental information, so that different behavior decision strategies are adopted according to different environments in the whole motion process of the robot, thereby improving the learning capacity and adaptability of the robot and ensuring the efficiency and quality of the robot for completing target tasks.

According to another aspect of the present application, there is also provided a computer readable medium having stored thereon computer readable instructions executable by a processor to cause the processor to implement a method of controlling a user's barrier to barrier as described above.

According to another aspect of the present application, there is also provided a robot flexible behavior decision device, the device comprising:

one or more processors;

a computer readable medium for storing one or more computer readable instructions;

the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement a method of controlling user-alignment on a device as described above.

The details of each embodiment of the device may be specifically referred to the corresponding parts of the embodiment of the method for controlling user alignment at the device side, which is not described herein again.

In summary, the present application obtains the current environmental information, the target task and the current state information of the robot, where the current environmental information includes the obstacle position information; constructing a neural network mixed model based on a supervised learning model and a reinforcement learning model, dynamically adjusting the combination coefficient of the neural network mixed model according to the current environmental information, and improving the reinforcement learning model by increasing curiosity indexes to obtain an improved neural network mixed model; the current environment information, the target task and the robot state information are input into the improved neural network mixed model to obtain a flexible behavior decision, namely, the application integrates the supervised learning of the cerebellum and the reinforcement learning of the basal ganglia, accelerates the convergence rate of the reinforcement learning of the basal ganglia by using the supervised learning of the cerebellum, and improves the knowledge base of the supervised learning of the cerebellum by using the reinforcement learning of the basal ganglia at the same time, thereby realizing the direct interaction between the cerebellum and the basal ganglia, further realizing the flexible behavior decision in the unknown environment of the mobile robot, improving the adaptability of the dynamic environment of the mobile robot, improving the reinforcement learning function of the basal ganglia, realizing the dynamic self-adaptive regulation of environment exploration-utilization in the reinforcement learning, and enabling the robot to obtain continuous and stable learning capability.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the application as described above.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for robot flexible behavior decision-making, the method comprising:

wherein the dynamically adjusting the combination coefficient of the neural network hybrid model according to the current environmental information comprises: obtaining the nearest barrier distance in all barrier information from the current environment information; adjusting the combination coefficient of the neural network hybrid model according to the nearest obstacle distance, wherein the combination coefficient is expressed as follows:

wherein omega ₁ ＝1-ω ₂

a＝ω ₁ a ₁ +ω ₂ a ₂

m1 and m2 are both normal numbers, ω ₂ For the combination coefficient, enedi is the nearest obstacle distance information, a represents the optimal action behavior decision at the current moment, a ₁ Representing a behavioural decision based on the reinforcement learning model decision, a ₂ Representing behavioral decisions determined by the supervised learning model;

the enhancing curiosity index improves the reinforcement learning model to obtain an improved neural network hybrid model, which comprises the following steps: obtaining maximum rewards and minimum rewards obtained by the robot in a reinforcement learning model and environmental alertness; calculating the curiosity index based on the maximum reward, the minimum reward, and the environmental alertness, wherein the curiosity index is expressed as follows:

Wherein Cur represents curiosity index, β ^* Indicating environmental alertness, R _max And R is _min Respectively representing the maximum rewards and the minimum rewards obtained by the robot, wherein ρ is an adjustment coefficient;

dynamically adjusting environment exploration and utilization by utilizing the curiosity index in the reinforcement learning model to obtain an improved neural network hybrid model;

2. The method of claim 1, wherein the obtaining the current environmental alertness from among the maximum and minimum rewards and environmental alertness obtained by the robot in a reinforcement learning model comprises:

3. The method of claim 2, wherein dynamically adjusting the learning rate and the discount rate based on the current environmental information results in a real-time learning rate and a real-time discount rate, comprising:

if the similarity value is smaller than a similarity threshold value, memorizing the current environmental information for the first time to obtain the same environmental information frequency corresponding to the current environmental information as 1;

4. The method of claim 3, wherein the adjusting the learning rate and the discount rate based on the same number of times of the environmental information corresponding to the current environmental information results in the real-time learning rate and the real-time discount rate, further comprising:

5. The method of claim 4, wherein the dynamically adjusting environmental exploration and utilization in the reinforcement learning model using the curiosity index results in an improved neural network hybrid model comprising:

and in the process of exploring and utilizing the adjusted environment, determining the optimal neural network mixed model with improved action behavior decision according to the execution probability of each action to be executed.

6. The method according to any of claims 1-5, wherein inputting the current environmental information, the target task, and the robot state information into the improved neural network hybrid model results in a flexible behavior decision, comprising:

7. A computer readable medium having stored thereon computer readable instructions executable by a processor to cause the processor to implement the method of any one of claims 1 to 6.

8. A robotic flexible behavioural decision making device, the device comprising:

one or more processors;

when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6 to.