CN112008734A

CN112008734A - Robot control method and device based on component interaction degree

Info

Publication number: CN112008734A
Application number: CN202010813591.XA
Authority: CN
Inventors: 余超; 董银昭; 葛宏伟; 陈炳才; 孙亮
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-12-01
Anticipated expiration: 2040-08-13
Also published as: CN112008734B

Abstract

The invention discloses a robot control method based on component interaction degree, which comprises the following steps: acquiring integral state information of a robot to be controlled; the method comprises the steps that the overall state information of a robot to be controlled is input into an action prediction model, the action prediction model carries out structural decomposition on the overall state information to obtain the state information of each component, and then the interaction degree between each component and the rest of components is calculated according to the state information of each component; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components; and finally, controlling the motion of the robot to be controlled according to the overall predicted motion information of the robot to be controlled. By implementing the embodiment of the invention, the complexity of robot control can be reduced, and the stability of robot control can be improved.

Description

Robot control method and device based on component interaction degree

Technical Field

The invention relates to the technical field of intelligent robots, in particular to a robot control method and device based on component interaction degree.

Background

Deep Reinforcement Learning (DRL) can ensure that various robot behavior controls work well in challenging tasks such as motion and manipulation. However, in the control of the robot in reality of the existing DRL algorithm, the robot is directly searched in the state and behavior space of the high altitude in the aspect of facing the control problem of the high altitude, and the complexity of the robot control is high; in addition, because the existing DRL algorithm is an end-to-end behavior, that is, the whole state-behavior space is directly searched and the finally learned motion strategy is output, the finally obtained motion strategy lacks interpretability, and the stability of robot control is poor.

Disclosure of Invention

The embodiment of the invention provides a robot control method and device based on the component interaction degree, which can reduce the complexity of robot control and improve the stability of robot control.

An embodiment of the present invention provides a robot control method based on a component interaction degree, including:

acquiring integral state information of a robot to be controlled;

inputting the overall state information of the robot to be controlled into a constructed action prediction model so that the action prediction model generates the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled;

the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps:

carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;

and controlling the robot to be controlled to move according to the overall predicted action information of the robot to be controlled.

Further, the structural decomposition is performed on the overall state information of the robot to be controlled, state information of each component of the robot to be controlled is obtained, and then the degree of interaction between each component and the rest of the components is calculated according to the state information of each component of the robot to be controlled, which specifically includes:

acquiring integral state information of the robot to be controlled at a first moment, performing structural decomposition, and generating state information of each part of the robot to be controlled at the first moment;

selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment after each selected component is determined, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.

Further, the predicting the predicted action information of each component according to the enhanced state information of each component specifically includes:

inputting the enhanced state information of each component into a preset action prediction network so that the action prediction network outputs the mean value and the variance of the enhanced state information of each component;

and obtaining the predicted action information of each component through Gaussian distribution sampling according to the mean value and the variance of the enhanced state information of each component.

Further, still include: adjusting network parameters of the state prediction network by:

acquiring motion trail information of the robot to be controlled from a sample database; the motion trail information comprises overall state information and overall action information of the robot to be controlled at each moment;

the overall state information of the robot to be controlled at each moment is input into the state prediction network, so that the state prediction network generates overall predicted action information of the robot to be controlled at each moment;

calculating the error between the overall predicted action information of the robot to be controlled at each moment and the overall action information of the robot to be controlled at each moment, and adjusting the network parameters of the state prediction network according to the error;

and generating an adjacency matrix according to a preset attention network, and adjusting the network parameters again according to the adjacency matrix.

On the basis of the above method item embodiment, the present invention correspondingly provides an apparatus item embodiment:

the embodiment of the invention provides a robot control device based on component interaction degree, which comprises a data acquisition module, an action prediction module and a motion control module, wherein the data acquisition module is used for acquiring the data of a robot;

the data acquisition module is used for acquiring the overall state information of the robot to be controlled;

the action prediction module is used for inputting the overall state information of the robot to be controlled into the constructed action prediction model so as to enable the action prediction model to generate the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps: carrying out structural decomposition on the overall state information of the robot to be controlled to obtain the state information of each part of the robot to be controlled, and then calculating the interaction degree between each part and the rest of parts according to the state information of each part of the robot to be controlled; determining the enhanced state information of each component according to the interaction degree of each component and the other components; predicting the predicted action information of each component according to the enhanced state information of each component, and then generating the overall predicted action information of the robot to be controlled according to the predicted action information of all the components;

and the motion control module is used for controlling the motion of the robot to be controlled according to the overall predicted action information of the robot to be controlled.

Further, the device also comprises a parameter adjusting module; the parameter adjusting module is used for adjusting the network parameters of the state prediction network in the following way:

The embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a robot control method and a device based on component interaction degree, the method comprises the steps of obtaining the overall state information of a robot to be controlled, inputting the overall state information into an action prediction model, carrying out structural decomposition on the overall state information according to the physical structure of the robot by the action prediction model to obtain the state information of each component, then calculating the interaction degree of each component, next generating the enhanced state information of each component according to the interaction degree and the state information of each component, predicting the prediction action of each component through the enhanced state information, and finally integrating all the prediction actions to obtain the overall prediction action of the robot to be controlled; compared with the prior art, the method and the device have the advantages that the overall motion prediction problem is firstly decomposed into the prediction problem of the motion of each part, dimension reduction is realized, direct searching in the overall high-latitude state and behavior space is not needed on the aspect of facing the high-latitude control problem, the complexity of machine learning is simplified, and the complexity of robot control is reduced.

Drawings

Fig. 1 is a schematic flowchart of a robot control method based on a degree of interaction between components according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of the comparison between the motion prediction model provided by the present invention and the average reward value of the existing DRL algorithm in the context of Half-Cheetah robot.

FIG. 3 is a schematic diagram of comparing the motion prediction model provided by the present invention with the cumulative reward value of the existing DRL algorithm in the context of Half-Cheetah robot.

FIG. 4 is a schematic diagram of the Half-Cheetah robot provided by the present invention in different postures.

FIG. 5 is a collaboration diagram corresponding to different postures of the Half-Cheetah robot provided by the invention.

Fig. 6 is a schematic structural diagram of a robot control device based on component interaction degree according to an embodiment of the present invention

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a robot control method based on a component interaction degree, including:

s101, acquiring the overall state information of the robot to be controlled;

step S102, inputting the overall state information of the robot to be controlled into a constructed action prediction model so that the action prediction model generates the overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled; the action prediction model generates overall predicted action information of the robot to be controlled according to the overall state information of the robot to be controlled, and the action prediction model specifically comprises the following steps:

and S103, controlling the robot to be controlled to move according to the overall predicted action information of the robot to be controlled.

For step S101, the overall state information of the robot to be controlled refers to state elements in the Markov decision process, including but not limited to speed, position and other information;

in step S102, in a preferred embodiment, the performing structural decomposition on the overall state information of the robot to be controlled to obtain state information of each component of the robot to be controlled, and then calculating the interaction degree between each component and the other components according to the state information of each component of the robot to be controlled specifically includes:

Preferably, the predicting the predicted action information of each component according to the enhanced state information of each component specifically includes:

In the present invention, a Markov decision process for a multi-agent is first defined. The robot to be controlled comprises a plurality of parts (e.g. ankle, knee, crotch etc.) each controlling the behavior of the robot. Taking each component as an agent (hereinafter, the agent is referred to as the agent), the markov decision process of a multi-agent is defined as a five-tuple:<n,S,A,P,R>where n represents the number of agents, s represents the current state of the agent (including information such as speed and location), a represents the action performed by the agent (information such as angle of movement), p represents the agent action transition probability, and R represents the reward value. Suppose s_iRespectively represent the ithStatus and action information of individual components; s_gRepresenting state information shared by all agents. The state of the entire robot can be expressed as:

s＝<s₁,...s_i,...s_n,s_g> (1)

p (s '| s, a) represents the probability that the robot will jump from the current state s to the next state s' when performing action a. R (s, a) represents the prize value that the robot gets from taking action a at the current state s.

Defining the interaction degree immediately; information interaction is an important mode in multi-agent (component) collaboration, and the invention defines that DoI (interaction degree) represents the information interaction degree among different agents. And a synergy graph is used to model DoI between different agents as follows:

G＝(V,W)

V＝{A_i|i∈[n]}

W＝{w_ij|i,j∈[n]} (2)

where G represents a collaboration graph and V represents a collection of agents. A. the_iRepresenting the ith agent, W is a adjacency matrix, i.e., a set of weights between any two agents of the collaborative map. Weight W_ijThe representation being agent A_jTo A_iThe importance level of (d) is DoI (degree of interaction). According to different W_ijThree different forms of DoI can be defined:

1. if W is for all agents_ijAll 1, it is called Global Degree of Interaction (GDoI). At the moment, each agent can learn the state information of any one agent, and the importance degree of the state information of each other agent is consistent.

2. If the agent's own weight is 1 and the weight W between any two different agents_ijAll of which are 0, are called Independent Degree of Interaction (IDoI). Each agent only learns its own state information at this time.

3. If W is_ijThe continuous variation is called Dynamic Degree of Interaction (DDoI). This is achieved byEach agent may learn the information of other agents but have different values.

With the above definitions in mind, the motion prediction model of the present invention will be described in detail below;

the action prediction model comprises a state prediction network, an attention network, an action prediction network and a reward value network;

first, the state prediction network will be explained:

the state prediction network comprises a state predictor and an interaction degree cooperative graph generator, wherein the state predictor is constructed by taking the state information of one intelligent agent in the robot to be controlled at the current moment as input and taking the predicted state information of the other intelligent agents at the next moment as output;

secondly, the interaction degree collaborative diagram generator compares the predicted state information output by the state predictor with the real state information of the other agents, calculates a prediction error, and determines the interaction degree among the agents according to the prediction error; the method comprises the following specific steps:

firstly, extracting a continuous track { s ] of a robot to be controlled from a sample database^t,s^t+1},t∈{1,2,...,T}；

s^tFor the global state information of the robot to be controlled at time step t, s^t+1The overall state information of the robot to be controlled at the time step of t +1 is obtained; to s^tCarrying out structural decomposition to obtain state information of each intelligent agent at t time step; to s^t+1Carrying out structural decomposition to obtain state information of each agent at the time step of t + 1; suppose one of agents A_iThe status information at time step t is

Agent A_jThe status information at time step t +1 is

In the general of agent A_iState information at time step t

Input into the state predictor to obtain the state predictor related to the agent A_jPredicted state information at t +1

Then, the interaction degree cooperation graph generator further calculates the interaction degree among the agents to generate a corresponding cooperation graph, specifically, the agent A_jState information at time step t +1 (actual state information)

And agent A_jPredicted state information at t +1 time step

The prediction error of (a) can be expressed as:

after calculating the prediction error at each time, agent A_iAnd agent A_jCan be expressed as a continuous track s^t，s^t+1Predicted error e in {1, 2.,. T }, T ∈_ijAs shown in (4):

wherein e_ijWhich is representative of the prediction error, is,

representative of agent A_iAnd agent A_jT represents the length of the track.

After the interaction degree of each component and other components is obtained, the action prediction model calculates the enhanced state information of each agent; the enhanced state may be represented as:

wherein

Representative of agent A_iEnhanced state of (a), w_i,jRepresentative of agent A_iAnd agent A_jDegree of interaction of s_jRepresentative of agent A_jState of (1), s_gRepresenting the global state of the robot.

Next, the network parameter update of the state prediction network will be described:

in a preferred embodiment, the network parameters of the state prediction network are adjusted by:

In particular, after each training round, the trace of the memory bank access < s is fetched_t,a_t,s_t+1,r_t>,t∈[0,T]. Calculating the error loss between the real state and the predicted state using the information of the trajectory_p. Predicting a parameter Θ of a network by updating a state_pTo the minimumReduce loss_pThe following are:

completing one-time updating of network parameters of the state prediction network through the formula; next, after every few training rounds (the specific number of rounds can be selected according to actual conditions), an attention network is used to generate the adjacency matrix W_aCollaborative map W generated by a deskew state prediction network_PFurther update the parameter theta_pThereby obtaining more accurate interaction degree. As the formula:

loss_a→p＝||W_a-W_p||₂ (6)

therein, loss_a→pRepresenting the error of two adjacent matrices.

The following describes the attention network:

first, agent A_iState S of_iIs first input to a multi-layer perceptron (F)_in) Thus, a feature vector with dimension b is output, such as:

f_i＝F_in(s_i) (7)

wherein f is_i∈R^bIs agent A_iThe feature vector of (2).

Then, the joint feature vector of any two agents < f_i,f_j>Is input to an attention network to output f_iAnd f_jThe value of similarity between, i.e. K_ij. Then, positive planning is carried out by using a Soft-max function, thereby obtaining A_jFor A_iThe DoI of (1):

where T represents a transposed symbol.

Following this, the bonus value network is described:

the reward value network is used for predicting the accumulated discount reward values of all robots

The goal of the bonus value network is to update the parameters

To make the estimated loss L of return_BLAnd (4) minimizing. Track < s with fetch bank access_t,a_t,s_t+1,r_t>,t∈[0,T]The method comprises the following steps:

where T represents the number of intermediate time steps of the trajectory, γ is the attenuation coefficient, r^t'represents the prize value at time step t',

s representing the Critic utilization of the entire robot at time step t^tThe predicted cumulative rebate award value.

Next, the operation prediction network will be described:

the action prediction network is constructed by taking the integral state information of the robot to be controlled as input and taking the integral pre-stored state as output;

in particular, the enhanced state of each agent is defined

Inputting a status Actor network, and outputting the mean value mu of each intelligent agent enhancement status_iSum variance σ_iSuch as:

then, using Gaussian distribution sampling, the action a of each agent is obtained_iThe method comprises the following steps:

where x is a random number.

The overall motion information of the robot may be represented as a set of each agent motion:

a＝＜a₁,...a_i,...a_n> (12)

action prediction network for learning optimized behavior strategy pi of robot to be controlled_Θ(as). It can derive the parameter Θ to maximize the discount return J_policy：

Wherein represents a balance hyperparameter;

when the representative robot is at a time step t, the Actor network respectively uses the ratio of the two robot actions estimated by the current parameter theta and the old parameter theta';

representing the merit function at time step t.

After all the networks are trained, obtaining the action prediction model; then, the whole state information of the robot to be controlled is input, and then the whole action information of the robot to be controlled can be output;

step S103, after the action prediction model outputs the overall action information of the robot to be controlled, the robot is controlled to move, and interaction with the environment is realized; the robot moves for one round (T step), and at any time step T, the robot executes the actiona^tThen interacting with the environment to obtain the state s of the step t +1^t+1And a prize value r^t. Finally the whole movement track<s^t,a^a,s^t+1,r^t>,t∈[0,T]Storing into the sample data.

In order to better explain the technical scheme of the invention, the Half-Cheetah robot is taken as the robot to be controlled, and the technical scheme of the invention is further explained as follows:

each joint of the Half-Cheetah robot is first modeled as an agent. The Half-Cheetah is a planar biped robot having state information including 17 dimensions and motion information including 6 dimensions, and includes six agents (components). The state and action information for each agent may be expressed as:

A₁:s₁＝(ρ₂,ψ₁₁),a₁＝(θ₀)A₄:s₄＝(ρ₅,ψ₁₄),a₄＝(θ₃)

A₂:s₂＝(ρ₃,ψ₁₂),a₂＝(θ₁)A₅:s₅＝(ρ₆,ψ₁₅),a₅＝(θ₄)

A₃:s₃＝(ρ₄,ψ₁₃),a₃＝(θ₂)A₆:s₆＝(ρ₇,ψ₁₆),a₆＝(θ₅)

where ρ is_b(b∈[0,7])，ψ_c(c∈[0,7]) And theta_d(d∈[0,7]) Information representing the position, velocity, angle of the joint, respectively, of the different components.

The prize values r for all agents are defined as follows:

a＝<a₁,...a_i,...a₆>

wherein V_xFront of Half-Cheetah robotThe speed in, a, represents the action taken by the entire robot.

Sampling a continuous trace of one from the memory bank:<s^t,s^t+1>,t∈[0,T]. Execution state prediction network output cooperation graph G_PAnd an adjacency matrix W_PIt remains unchanged for one learning round.

Calculating the joint state according to which all agents are based at each time step t

Enhanced state of each agent

Using equations (10) (11) (12), the actions taken by the entire robot are obtained, namely:

where it represents a gaussian distribution sample.

The robot then interacts with the environment. From the current state s of the robot^tPerforming action a^tThen interacts with the environment to obtain the next state s of the robot^t+1And a prize value r^t. The robot continuously moves for 300 steps, and finally the whole movement track is obtained

And (c) storing the t E {1, 2., 300} in a memory bank.

And updating and correcting the cooperation map. Firstly, the state prediction network is updated by using the formula (5) to ensure that the real state and the prediction are performedError between states loss_PAnd the minimum is obtained, so that a more accurate state prediction network is obtained. Thus, the state prediction network obtains an approximative ground co-plot G_P. It remains unchanged for one in a complete learning trajectory. The state prediction network may cause the DoI to change too slowly to accommodate the changes of different agents of the robot. The DoI needs to be recalculated for each time step of the attention model. Such a precise topology may result in high complexity, local optimality, and low learning efficiency. In order to balance the two models, the calculation load degree is reduced, and meanwhile, an accurate DoI value is obtained. Therefore, the present invention further updates the parameter Θ every 50 rounds using the adjacency matrix rectification prediction model generated by the attention model_pAs shown in equation (6).

As shown in fig. 2 and 3, fig. 2 and 3 compare the motion prediction model (identified as apdt in the figure) provided by the present invention with the existing DRL algorithm (including PPO, DDPG, AC, REIN FORCE and CEM) to obtain the comparison between the average prize value and the jackpot value, and it can be seen from the figure that the apdt has the optimal prize value and jackpot value as the round increases.

Fig. 4 shows schematic diagrams of different postures of the Half-Cheetah robot, wherein (a) in fig. 4 is when the Half-Cheetah robot walks, (b) when the Half-Cheetah robot jumps, and (c) when the Half-Cheetah robot lands.

Fig. 5 shows a synergy diagram corresponding to different poses of the Half-Cheetah robot, where (d) in fig. 5 corresponds to (a) in fig. four, (e) in fig. 5 corresponds to (b) in fig. four, and (f) in fig. 5 corresponds to (c) in fig. four, where the black arrows indicate a bidirectional connection, i.e. both agents consider information of opposite parties; the grey arrow represents a one-way connection, and the agents pointed by the arrow must take the information of the agents at the tail into consideration for cooperation; the black dashed circle represents the joint most highly focused by other agents. At this time, in order to maintain smooth walking, there is information interaction between all major components (i.e., the front and rear of the thigh, the knee and the ankle). When the robot jumps, the most important joint is the back thigh. This means that it takes into account the state information of other joints so that the robot can make coordinated actions in preparation for jumping. When the robot lands, the rear thigh and the forefoot ankle are the most important joints and higher attention should be paid.

On the basis of the above method item embodiment, the present invention provides an apparatus item embodiment, and as shown in fig. 6, the present invention provides a robot control apparatus based on a component interaction degree, including: the device comprises a data acquisition module, an action prediction module and a motion control module;

In a preferred embodiment, the performing structural decomposition on the overall state information of the robot to be controlled to obtain state information of each component of the robot to be controlled, and then calculating the interaction degree between each component and the other components according to the state information of each component of the robot to be controlled specifically includes:

selecting one component from all the components one by one as a selected component, predicting the predicted state information of the other components except the selected component at a second moment through a preset state prediction network according to the state information of the selected component at the first moment, calculating a prediction error according to the predicted state information of the other components at the second moment and the actual state information of the other components at the second moment, and determining the interaction degree between the selected component and the other components according to the prediction error; wherein the second time is a next time of the first time.

In a preferred embodiment, the predicting the predicted action information of each component according to the enhanced status information of each component specifically includes:

In a preferred embodiment, the system further comprises a parameter adjusting module;

the parameter adjusting module is used for adjusting the network parameters of the state prediction network in the following way:

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, in the drawings of the device embodiments provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A robot control method based on a degree of component interaction, comprising:

acquiring integral state information of a robot to be controlled;

2. The method according to claim 1, wherein the structural decomposition is performed on the overall state information of the robot to be controlled to obtain the state information of each component of the robot to be controlled, and then the degree of interaction between each component and the other components is calculated according to the state information of each component of the robot to be controlled, specifically comprising:

3. The method for controlling a robot according to claim 1, wherein the predicting the predicted motion information of each component according to the enhanced status information of each component is specifically:

4. The method for robot control based on degree of interaction of parts according to claim 2, further comprising: adjusting network parameters of the state prediction network by:

5. A robot control apparatus based on a degree of component interaction, comprising: the device comprises a data acquisition module, an action prediction module and a motion control module;

6. The robot control device based on the component interaction degree is characterized in that the overall state information of the robot to be controlled is structurally decomposed to obtain the state information of each component of the robot to be controlled, and then the interaction degree between each component and the rest of the components is calculated according to the state information of each component of the robot to be controlled, and specifically comprises the following steps:

7. The robot control device based on the component interaction degree is characterized in that the predicted action information of each component is predicted according to the enhanced state information of each component, and the predicted action information is specifically as follows:

8. The robot control device based on the component interaction degree is characterized by further comprising a parameter adjusting module;