CN112008734A - Robot control method and device based on component interaction degree - Google Patents

Robot control method and device based on component interaction degree Download PDF

Info

Publication number
CN112008734A
CN112008734A CN202010813591.XA CN202010813591A CN112008734A CN 112008734 A CN112008734 A CN 112008734A CN 202010813591 A CN202010813591 A CN 202010813591A CN 112008734 A CN112008734 A CN 112008734A
Authority
CN
China
Prior art keywords
robot
controlled
state information
component
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010813591.XA
Other languages
Chinese (zh)
Other versions
CN112008734B (en
Inventor
余超
董银昭
葛宏伟
陈炳才
孙亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN202010813591.XA priority Critical patent/CN112008734B/en
Publication of CN112008734A publication Critical patent/CN112008734A/en
Application granted granted Critical
Publication of CN112008734B publication Critical patent/CN112008734B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/0005Manipulators having means for high-level communication with users, e.g. speech generator, face recognition means
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a robot control method based on component interaction degree, which comprises the following steps: acquiring integral state information of a robot to be controlled; the method comprises the steps that the overall state information of a robot to be controlled is input into an action prediction model, the action prediction model carries out structural decomposition on the overall state information to obtain the state information of each component, and then the interaction degree between each component and the rest of components is calculated according to the state information of each component; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components; and finally, controlling the motion of the robot to be controlled according to the overall predicted motion information of the robot to be controlled. By implementing the embodiment of the invention, the complexity of robot control can be reduced, and the stability of robot control can be improved.

Description

Robot control method and device based on component interaction degree
Technical Field
The invention relates to the technical field of intelligent robots, in particular to a robot control method and device based on component interaction degree.
Background
Deep Reinforcement Learning (DRL) can ensure that various robot behavior controls work well in challenging tasks such as motion and manipulation. However, in the control of the robot in reality of the existing DRL algorithm, the robot is directly searched in the state and behavior space of the high altitude in the aspect of facing the control problem of the high altitude, and the complexity of the robot control is high; in addition, because the existing DRL algorithm is an end-to-end behavior, that is, the whole state-behavior space is directly searched and the finally learned motion strategy is output, the finally obtained motion strategy lacks interpretability, and the stability of robot control is poor.
Disclosure of Invention
The embodiment of the invention provides a robot control method and device based on the component interaction degree, which can reduce the complexity of robot control and improve the stability of robot control.
An embodiment of the present invention provides a robot control method based on a component interaction degree, including:
acquiring integral state information of a robot to be controlled;
inputting the overall state information of the robot to be controlled into a constructed action prediction model so that the action prediction model generates the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled;
the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps:
carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and controlling the robot to be controlled to move according to the overall predicted action information of the robot to be controlled.
Further, the structural decomposition is performed on the overall state information of the robot to be controlled, state information of each component of the robot to be controlled is obtained, and then the degree of interaction between each component and the rest of the components is calculated according to the state information of each component of the robot to be controlled, which specifically includes:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
Further, the predicting the predicted action information of each component according to the enhanced state information of each component specifically includes:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
Further, still include: adjusting network parameters of the state prediction network by:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
On the basis of the above method item embodiment, the present invention correspondingly provides an apparatus item embodiment:
the embodiment of the invention provides a robot control device based on component interaction degree, which comprises a data acquisition module, an action prediction module and a motion control module, wherein the data acquisition module is used for acquiring the data of a robot;
the data acquisition module is used for acquiring the overall state information of the robot to be controlled;
the action prediction module is used for inputting the overall state information of the robot to be controlled into the constructed action prediction model so as to enable the action prediction model to generate the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps: carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and the motion control module is used for controlling the motion of the robot to be controlled according to the overall predicted action information of the robot to be controlled.
Further, the structural decomposition is performed on the overall state information of the robot to be controlled, state information of each component of the robot to be controlled is obtained, and then the degree of interaction between each component and the rest of the components is calculated according to the state information of each component of the robot to be controlled, which specifically includes:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
Further, the predicting the predicted action information of each component according to the enhanced state information of each component specifically includes:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
Further, the device also comprises a parameter adjusting module; the parameter adjusting module is used for adjusting the network parameters of the state prediction network in the following way:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a robot control method and a device based on component interaction degree, the method comprises the steps of obtaining the overall state information of a robot to be controlled, inputting the overall state information into an action prediction model, carrying out structural decomposition on the overall state information according to the physical structure of the robot by the action prediction model to obtain the state information of each component, then calculating the interaction degree of each component, next generating the enhanced state information of each component according to the interaction degree and the state information of each component, predicting the prediction action of each component through the enhanced state information, and finally integrating all the prediction actions to obtain the overall prediction action of the robot to be controlled; compared with the prior art, the method and the device have the advantages that the overall motion prediction problem is firstly decomposed into the prediction problem of the motion of each part, dimension reduction is realized, direct searching in the overall high-latitude state and behavior space is not needed on the aspect of facing the high-latitude control problem, the complexity of machine learning is simplified, and the complexity of robot control is reduced.
Drawings
Fig. 1 is a schematic flowchart of a robot control method based on a degree of interaction between components according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of the comparison between the motion prediction model provided by the present invention and the average reward value of the existing DRL algorithm in the context of Half-Cheetah robot.
FIG. 3 is a schematic diagram of comparing the motion prediction model provided by the present invention with the cumulative reward value of the existing DRL algorithm in the context of Half-Cheetah robot.
FIG. 4 is a schematic diagram of the Half-Cheetah robot provided by the present invention in different postures.
FIG. 5 is a collaboration diagram corresponding to different postures of the Half-Cheetah robot provided by the invention.
Fig. 6 is a schematic structural diagram of a robot control device based on component interaction degree according to an embodiment of the present invention
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a robot control method based on a component interaction degree, including:
s101, acquiring the overall state information of the robot to be controlled;
step S102, inputting the overall state information of the robot to be controlled into a constructed action prediction model so that the action prediction model generates the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps:
carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and S103, controlling the robot to be controlled to move according to the overall predicted action information of the robot to be controlled.
For step S101, the overall state information of the robot to be controlled refers to state elements in the Markov decision process, including but not limited to speed, position and other information;
in step S102, in a preferred embodiment, the performing structural decomposition on the overall state information of the robot to be controlled to obtain state information of each component of the robot to be controlled, and then calculating the interaction degree between each component and the other components according to the state information of each component of the robot to be controlled specifically includes:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
Preferably, the predicting the predicted action information of each component according to the enhanced state information of each component specifically includes:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
In the present invention, a Markov decision process for a multi-agent is first defined. The robot to be controlled comprises a plurality of parts (e.g. ankle, knee, crotch etc.) each controlling the behavior of the robot. Taking each component as an agent (hereinafter, the agent is referred to as the agent), the markov decision process of a multi-agent is defined as a five-tuple:<n,S,A,P,R>where n represents the number of agents, s represents the current state of the agent (including information such as speed and location), a represents the action performed by the agent (information such as angle of movement), p represents the agent action transition probability, and R represents the reward value. Suppose siRespectively represent the ithStatus and action information of individual components; sgRepresenting state information shared by all agents. The state of the entire robot can be expressed as:
s=<s1,...si,...sn,sg> (1)
p (s '| s, a) represents the probability that the robot will jump from the current state s to the next state s' when performing action a. R (s, a) represents the prize value that the robot gets from taking action a at the current state s.
Defining the interaction degree immediately; information interaction is an important mode in multi-agent (component) collaboration, and the invention defines that DoI (interaction degree) represents the information interaction degree among different agents. And a synergy graph is used to model DoI between different agents as follows:
G=(V,W)
V={Ai|i∈[n]}
W={wij|i,j∈[n]} (2)
where G represents a collaboration graph and V represents a collection of agents. A. theiRepresenting the ith agent, W is a adjacency matrix, i.e., a set of weights between any two agents of the collaborative map. Weight WijThe representation being agent AjTo AiThe importance level of (d) is DoI (degree of interaction). According to different WijThree different forms of DoI can be defined:
1. if W is for all agentsijAll 1, it is called Global Degree of Interaction (GDoI). At the moment, each agent can learn the state information of any one agent, and the importance degree of the state information of each other agent is consistent.
2. If the agent's own weight is 1 and the weight W between any two different agentsijAll of which are 0, are called Independent Degree of Interaction (IDoI). Each agent only learns its own state information at this time.
3. If W isijThe continuous variation is called Dynamic Degree of Interaction (DDoI). This is achieved byEach agent may learn the information of other agents but have different values.
With the above definitions in mind, the motion prediction model of the present invention will be described in detail below;
the action prediction model comprises a state prediction network, an attention network, an action prediction network and a reward value network;
first, the state prediction network will be explained:
the state prediction network comprises a state predictor and an interaction degree cooperative graph generator, wherein the state predictor is constructed by taking the state information of one intelligent agent in the robot to be controlled at the current moment as input and taking the predicted state information of the other intelligent agents at the next moment as output;
secondly, the interaction degree collaborative diagram generator compares the predicted state information output by the state predictor with the real state information of the other agents, calculates a prediction error, and determines the interaction degree among the agents according to the prediction error; the method comprises the following specific steps:
firstly, extracting a continuous track { s ] of a robot to be controlled from a sample databaset,st+1},t∈{1,2,...,T};
stFor the global state information of the robot to be controlled at time step t, st+1The overall state information of the robot to be controlled at the time step of t +1 is obtained; to stCarrying out structural decomposition to obtain state information of each intelligent agent at t time step; to st+1Carrying out structural decomposition to obtain state information of each agent at the time step of t + 1; suppose one of agents AiThe status information at time step t is
Figure BDA0002631885300000091
Agent AjThe status information at time step t +1 is
Figure BDA0002631885300000092
In the general of agent AiState information at time step t
Figure BDA0002631885300000093
Input into the state predictor to obtain the state predictor related to the agent AjPredicted state information at t +1
Figure BDA0002631885300000101
Then, the interaction degree cooperation graph generator further calculates the interaction degree among the agents to generate a corresponding cooperation graph, specifically, the agent AjState information at time step t +1 (actual state information)
Figure BDA0002631885300000102
And agent AjPredicted state information at t +1 time step
Figure BDA0002631885300000103
The prediction error of (a) can be expressed as:
Figure BDA0002631885300000104
after calculating the prediction error at each time, agent AiAnd agent AjCan be expressed as a continuous track st,st+1Predicted error e in {1, 2.,. T }, T ∈ijAs shown in (4):
Figure BDA0002631885300000105
wherein eijWhich is representative of the prediction error, is,
Figure BDA0002631885300000106
representative of agent AiAnd agent AjT represents the length of the track.
After the interaction degree of each component and other components is obtained, the action prediction model calculates the enhanced state information of each agent; the enhanced state may be represented as:
Figure BDA0002631885300000107
wherein
Figure BDA0002631885300000108
Representative of agent AiEnhanced state of (a), wi,jRepresentative of agent AiAnd agent AjDegree of interaction of sjRepresentative of agent AjState of (1), sgRepresenting the global state of the robot.
Next, the network parameter update of the state prediction network will be described:
in a preferred embodiment, the network parameters of the state prediction network are adjusted by:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
In particular, after each training round, the trace of the memory bank access < s is fetchedt,at,st+1,rt>,t∈[0,T]. Calculating the error loss between the real state and the predicted state using the information of the trajectoryp. Predicting a parameter Θ of a network by updating a statepTo the minimumReduce losspThe following are:
Figure BDA0002631885300000111
completing one-time updating of network parameters of the state prediction network through the formula; next, after every few training rounds (the specific number of rounds can be selected according to actual conditions), an attention network is used to generate the adjacency matrix WaCollaborative map W generated by a deskew state prediction networkPFurther update the parameter thetapThereby obtaining more accurate interaction degree. As the formula:
lossa→p=||Wa-Wp||2 (6)
therein, lossa→pRepresenting the error of two adjacent matrices.
The following describes the attention network:
first, agent AiState S ofiIs first input to a multi-layer perceptron (F)in) Thus, a feature vector with dimension b is output, such as:
fi=Fin(si) (7)
wherein f isi∈RbIs agent AiThe feature vector of (2).
Then, the joint feature vector of any two agents < fi,fj>Is input to an attention network to output fiAnd fjThe value of similarity between, i.e. Kij. Then, positive planning is carried out by using a Soft-max function, thereby obtaining AjFor AiThe DoI of (1):
Figure BDA0002631885300000121
where T represents a transposed symbol.
Following this, the bonus value network is described:
the reward value network is used for predicting the accumulated discount reward values of all robots
Figure BDA0002631885300000122
The goal of the bonus value network is to update the parameters
Figure BDA0002631885300000123
To make the estimated loss L of returnBLAnd (4) minimizing. Track < s with fetch bank accesst,at,st+1,rt>,t∈[0,T]The method comprises the following steps:
Figure BDA0002631885300000124
where T represents the number of intermediate time steps of the trajectory, γ is the attenuation coefficient, rt'represents the prize value at time step t',
Figure BDA0002631885300000125
s representing the Critic utilization of the entire robot at time step ttThe predicted cumulative rebate award value.
Next, the operation prediction network will be described:
the action prediction network is constructed by taking the integral state information of the robot to be controlled as input and taking the integral pre-stored state as output;
in particular, the enhanced state of each agent is defined
Figure BDA0002631885300000126
Inputting a status Actor network, and outputting the mean value mu of each intelligent agent enhancement statusiSum variance σiSuch as:
Figure BDA0002631885300000127
then, using Gaussian distribution sampling, the action a of each agent is obtainediThe method comprises the following steps:
Figure BDA0002631885300000131
where x is a random number.
The overall motion information of the robot may be represented as a set of each agent motion:
a=<a1,...ai,...an> (12)
action prediction network for learning optimized behavior strategy pi of robot to be controlledΘ(as). It can derive the parameter Θ to maximize the discount return Jpolicy
Figure BDA0002631885300000132
Figure BDA0002631885300000133
Wherein represents a balance hyperparameter;
Figure BDA0002631885300000134
when the representative robot is at a time step t, the Actor network respectively uses the ratio of the two robot actions estimated by the current parameter theta and the old parameter theta';
Figure BDA0002631885300000135
representing the merit function at time step t.
After all the networks are trained, obtaining the action prediction model; then, the whole state information of the robot to be controlled is input, and then the whole action information of the robot to be controlled can be output;
step S103, after the action prediction model outputs the overall action information of the robot to be controlled, the robot is controlled to move, and interaction with the environment is realized; the robot moves for one round (T step), and at any time step T, the robot executes the actionatThen interacting with the environment to obtain the state s of the step t +1t+1And a prize value rt. Finally the whole movement track<st,aa,st+1,rt>,t∈[0,T]Storing into the sample data.
In order to better explain the technical scheme of the invention, the Half-Cheetah robot is taken as the robot to be controlled, and the technical scheme of the invention is further explained as follows:
each joint of the Half-Cheetah robot is first modeled as an agent. The Half-Cheetah is a planar biped robot having state information including 17 dimensions and motion information including 6 dimensions, and includes six agents (components). The state and action information for each agent may be expressed as:
A1:s1=(ρ211),a1=(θ0)A4:s4=(ρ514),a4=(θ3)
A2:s2=(ρ312),a2=(θ1)A5:s5=(ρ615),a5=(θ4)
A3:s3=(ρ413),a3=(θ2)A6:s6=(ρ716),a6=(θ5)
where ρ isb(b∈[0,7]),ψc(c∈[0,7]) And thetad(d∈[0,7]) Information representing the position, velocity, angle of the joint, respectively, of the different components.
The prize values r for all agents are defined as follows:
Figure BDA0002631885300000141
a=<a1,...ai,...a6>
wherein VxFront of Half-Cheetah robotThe speed in, a, represents the action taken by the entire robot.
Sampling a continuous trace of one from the memory bank:<st,st+1>,t∈[0,T]. Execution state prediction network output cooperation graph GPAnd an adjacency matrix WPIt remains unchanged for one learning round.
Calculating the joint state according to which all agents are based at each time step t
Figure BDA0002631885300000142
Enhanced state of each agent
Figure BDA0002631885300000143
Using equations (10) (11) (12), the actions taken by the entire robot are obtained, namely:
Figure BDA0002631885300000144
Figure BDA0002631885300000145
Figure BDA0002631885300000146
where it represents a gaussian distribution sample.
The robot then interacts with the environment. From the current state s of the robottPerforming action atThen interacts with the environment to obtain the next state s of the robott+1And a prize value rt. The robot continuously moves for 300 steps, and finally the whole movement track is obtained
Figure BDA0002631885300000151
And (c) storing the t E {1, 2., 300} in a memory bank.
And updating and correcting the cooperation map. Firstly, the state prediction network is updated by using the formula (5) to ensure that the real state and the prediction are performedError between states lossPAnd the minimum is obtained, so that a more accurate state prediction network is obtained. Thus, the state prediction network obtains an approximative ground co-plot GP. It remains unchanged for one in a complete learning trajectory. The state prediction network may cause the DoI to change too slowly to accommodate the changes of different agents of the robot. The DoI needs to be recalculated for each time step of the attention model. Such a precise topology may result in high complexity, local optimality, and low learning efficiency. In order to balance the two models, the calculation load degree is reduced, and meanwhile, an accurate DoI value is obtained. Therefore, the present invention further updates the parameter Θ every 50 rounds using the adjacency matrix rectification prediction model generated by the attention modelpAs shown in equation (6).
As shown in fig. 2 and 3, fig. 2 and 3 compare the motion prediction model (identified as apdt in the figure) provided by the present invention with the existing DRL algorithm (including PPO, DDPG, AC, REIN FORCE and CEM) to obtain the comparison between the average prize value and the jackpot value, and it can be seen from the figure that the apdt has the optimal prize value and jackpot value as the round increases.
Fig. 4 shows schematic diagrams of different postures of the Half-Cheetah robot, wherein (a) in fig. 4 is when the Half-Cheetah robot walks, (b) when the Half-Cheetah robot jumps, and (c) when the Half-Cheetah robot lands.
Fig. 5 shows a synergy diagram corresponding to different poses of the Half-Cheetah robot, where (d) in fig. 5 corresponds to (a) in fig. four, (e) in fig. 5 corresponds to (b) in fig. four, and (f) in fig. 5 corresponds to (c) in fig. four, where the black arrows indicate a bidirectional connection, i.e. both agents consider information of opposite parties; the grey arrow represents a one-way connection, and the agents pointed by the arrow must take the information of the agents at the tail into consideration for cooperation; the black dashed circle represents the joint most highly focused by other agents. At this time, in order to maintain smooth walking, there is information interaction between all major components (i.e., the front and rear of the thigh, the knee and the ankle). When the robot jumps, the most important joint is the back thigh. This means that it takes into account the state information of other joints so that the robot can make coordinated actions in preparation for jumping. When the robot lands, the rear thigh and the forefoot ankle are the most important joints and higher attention should be paid.
On the basis of the above method item embodiment, the present invention provides an apparatus item embodiment, and as shown in fig. 6, the present invention provides a robot control apparatus based on a component interaction degree, including: the device comprises a data acquisition module, an action prediction module and a motion control module;
the data acquisition module is used for acquiring the overall state information of the robot to be controlled;
the action prediction module is used for inputting the overall state information of the robot to be controlled into the constructed action prediction model so as to enable the action prediction model to generate the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps: carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and the motion control module is used for controlling the motion of the robot to be controlled according to the overall predicted action information of the robot to be controlled.
In a preferred embodiment, the performing structural decomposition on the overall state information of the robot to be controlled to obtain state information of each component of the robot to be controlled, and then calculating the interaction degree between each component and the other components according to the state information of each component of the robot to be controlled specifically includes:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
In a preferred embodiment, the predicting the predicted action information of each component according to the enhanced status information of each component specifically includes:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
In a preferred embodiment, the system further comprises a parameter adjusting module;
the parameter adjusting module is used for adjusting the network parameters of the state prediction network in the following way:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, in the drawings of the device embodiments provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A robot control method based on a degree of component interaction, comprising:
acquiring integral state information of a robot to be controlled;
inputting the overall state information of the robot to be controlled into a constructed action prediction model so that the action prediction model generates the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled;
the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps:
carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and controlling the robot to be controlled to move according to the overall predicted action information of the robot to be controlled.
2. The method according to claim 1, wherein the structural decomposition is performed on the overall state information of the robot to be controlled to obtain the state information of each component of the robot to be controlled, and then the degree of interaction between each component and the other components is calculated according to the state information of each component of the robot to be controlled, specifically comprising:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
3. The method for controlling a robot according to claim 1, wherein the predicting the predicted motion information of each component according to the enhanced status information of each component is specifically:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
4. The method for robot control based on degree of interaction of parts according to claim 2, further comprising: adjusting network parameters of the state prediction network by:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
5. A robot control apparatus based on a degree of component interaction, comprising: the device comprises a data acquisition module, an action prediction module and a motion control module;
the data acquisition module is used for acquiring the overall state information of the robot to be controlled;
the action prediction module is used for inputting the overall state information of the robot to be controlled into the constructed action prediction model so as to enable the action prediction model to generate the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps: carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;
and the motion control module is used for controlling the motion of the robot to be controlled according to the overall predicted action information of the robot to be controlled.
6. The robot control device based on the component interaction degree is characterized in that the overall state information of the robot to be controlled is structurally decomposed to obtain the state information of each component of the robot to be controlled, and then the interaction degree between each component and the rest of the components is calculated according to the state information of each component of the robot to be controlled, and specifically comprises the following steps:
acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;
selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.
7. The robot control device based on the component interaction degree is characterized in that the predicted action information of each component is predicted according to the enhanced state information of each component, and the predicted action information is specifically as follows:
inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;
and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.
8. The robot control device based on the component interaction degree is characterized by further comprising a parameter adjusting module;
the parameter adjusting module is used for adjusting the network parameters of the state prediction network in the following way:
acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;
the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;
calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;
and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.
CN202010813591.XA 2020-08-13 2020-08-13 Robot control method and device based on component interaction degree Active CN112008734B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010813591.XA CN112008734B (en) 2020-08-13 2020-08-13 Robot control method and device based on component interaction degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010813591.XA CN112008734B (en) 2020-08-13 2020-08-13 Robot control method and device based on component interaction degree

Publications (2)

Publication Number Publication Date
CN112008734A true CN112008734A (en) 2020-12-01
CN112008734B CN112008734B (en) 2021-10-15

Family

ID=73506042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010813591.XA Active CN112008734B (en) 2020-08-13 2020-08-13 Robot control method and device based on component interaction degree

Country Status (1)

Country Link
CN (1) CN112008734B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
US20190172230A1 (en) * 2017-12-06 2019-06-06 Siemens Healthcare Gmbh Magnetic resonance image reconstruction with deep reinforcement learning
US10363657B2 (en) * 2016-12-23 2019-07-30 X Development Llc Multi-agent coordination under sparse networking
CN111432015A (en) * 2020-03-31 2020-07-17 中国人民解放军国防科技大学 Dynamic noise environment-oriented full-coverage task allocation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10363657B2 (en) * 2016-12-23 2019-07-30 X Development Llc Multi-agent coordination under sparse networking
US20190172230A1 (en) * 2017-12-06 2019-06-06 Siemens Healthcare Gmbh Magnetic resonance image reconstruction with deep reinforcement learning
CN109794937A (en) * 2019-01-29 2019-05-24 南京邮电大学 A kind of Soccer robot collaboration method based on intensified learning
CN111432015A (en) * 2020-03-31 2020-07-17 中国人民解放军国防科技大学 Dynamic noise environment-oriented full-coverage task allocation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHAO YU, DONGXU WANG, JIANKANG REN, HONGWEI GE, AND LIANG SUN: "Decentralized Multiagent Reinforcement", 《SPRINGER》 *

Also Published As

Publication number Publication date
CN112008734B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
Russell et al. Q-decomposition for reinforcement learning agents
Qiang et al. Reinforcement learning model, algorithms and its application
Badgwell et al. Reinforcement learning–overview of recent progress and implications for process control
Xu et al. Learning multi-agent coordination for enhancing target coverage in directional sensor networks
CN112329948A (en) Multi-agent strategy prediction method and device
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
CN114139637A (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
Tagliaferri et al. A real-time strategy-decision program for sailing yacht races
Hafez et al. Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space
Oliehoek et al. The decentralized POMDP framework
US11948079B2 (en) Multi-agent coordination method and apparatus
CN112008707B (en) Robot control method and device based on component decomposition
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN112008734B (en) Robot control method and device based on component interaction degree
Li et al. Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning
Espinós Longa et al. Swarm Intelligence in Cooperative Environments: Introducing the N-Step Dynamic Tree Search Algorithm
CN115909027A (en) Situation estimation method and device
CN115587615A (en) Internal reward generation method for sensing action loop decision
CN113379063B (en) Whole-flow task time sequence intelligent decision-making method based on online reinforcement learning model
Zhan et al. Dueling network architecture for multi-agent deep deterministic policy gradient
Morales Deep Reinforcement Learning
CN114118371A (en) Intelligent agent deep reinforcement learning method and computer readable medium
Junru et al. Decentralized multi-task reinforcement learning policy gradient method with momentum over networks
Zhang et al. Stm-gail: Spatial-Temporal meta-gail for learning diverse human driving strategies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant