CN109794937B

CN109794937B - Football robot cooperation method based on reinforcement learning

Info

Publication number: CN109794937B
Application number: CN201910083609.2A
Authority: CN
Inventors: 胡丽娟; 梁志伟; 李汉辉
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2021-10-01
Anticipated expiration: 2039-01-29
Also published as: CN109794937A

Abstract

The invention discloses a football robot cooperation method based on reinforcement learning, which comprises the following steps: s1, constructing a reinforced learning basic model of the football robot based on a Sarsa (lambda) algorithm added with communication, and setting a reward and punishment mechanism r of the reinforced learning basic model; s2, defining a specified number of state variables based on the distance and the angle between the football robots; s3, setting an operable action set of the football robot, and selecting the next action by the football robot based on the reward and punishment mechanism r, the state variables and mutual communication of the football robot; according to the invention, a reward and punishment mechanism is established in the established reinforcement learning basic model, so that the football robot can select the next action according to the current environment and the reward and punishment mechanism, and the football robot can learn and update through the mutual communication, and the cooperation efficiency of the football robot is effectively improved.

Description

Football robot cooperation method based on reinforcement learning

Technical Field

The invention belongs to the field of football robots, and particularly relates to a football robot cooperation method based on reinforcement learning.

Background

The football robot confrontation game is taken as a typical multi-football robot system, provides a good experimental platform for intelligent theoretical research and the integrated application of a plurality of technologies, and has stronger and stronger requirements on the capability of the football robot to autonomously take corresponding measures according to the change of the surrounding environment in the motion process, which relates to a series of research subjects such as robot positioning, path planning, coordination control, target tracking and decision making.

In recent years, many scholars and experts have studied a lot of results, for example, chinese patent application No. 201120008202.2 discloses an intelligent robot game device, which includes a mechanical part and a circuit control part, the mechanical part includes a table, a console and a robot, the circuit control part includes a control module on the control and a controlled module on the robot, can form a confrontational game scene; the chinese patent with application number 201010175496.8 discloses a robot education platform, which comprises a box body, a clothes mechanical assembly, a sensor unit, a control unit, an execution unit, an interface conversion unit, a task software optical disk and a power module, which are arranged in the box body, and is suitable for various experiments of classroom teaching; the Chinese patent with the application number of 200410016867.2 discloses an embedded direct driving device of a football robot, which aims at the defects of the rotating part of the existing autonomous robot and provides a driving device of the football robot, the structure is compact and the debugging is flexible, so that the robot has the functions of quick movement, accurate positioning, impact resistance and strong antagonism; the application number is 201120313058.3's chinese patent discloses an indoor football robot binocular vision navigation, adopts global infrared vision positioning mode, combines sensor information, has realized the indoor football robot binocular vision navigation of indoor mobile robot high accuracy location with navigation, but it is only applicable to the fixed and environment of barrier more stable, the condition of single robot operation. In the prior art, the design of a mechanical structure of a robot platform, the transformation of a robot driving device and the motion control of a fixed environment or a single robot are mainly used, a coordination and cooperation control case which can be applied to a confrontation type match of a football robot is not seen, moreover, in the existing football robot match, the phenomenon that the football robot cannot find the posture of the football robot on a football field and does autorotation motion often occurs, some chances of goal are missed often, and the goal speed is delayed.

Disclosure of Invention

Aiming at the problem of low cooperation efficiency of the football robots in the football robot match in the prior art, the invention provides a football robot cooperation method based on reinforcement learning, which realizes high cooperation efficiency of the football robots through constructing a reinforcement learning basic model of the football robot based on an Sarsa (lambda) algorithm added with communication and communicating the reinforcement learning model and the football robots with each other, and adopts the following specific technical scheme:

a method of reinforcement learning-based soccer robot collaboration, the method comprising:

s1, constructing a reinforced learning basic model of the football robot based on a Sarsa (lambda) algorithm added with communication, and setting a reward and punishment mechanism r of the reinforced learning basic model;

s2, defining a specified number of state variables based on the distance and the angle between the football robots;

s3, setting an operable action set of the football robot, and selecting the next action by the football robot based on the reward and punishment mechanism r, the state variable and the mutual communication of the football robot.

Further, the soccer robot includes an attack end robot and a defense end robot, and the number of state variables is set based on a sum of the attack end robot and the defense end robot.

Further, the method further comprises: and the attacking-end robot or the appointed football robot in the defending-end robot communicates with the rest football robots through the Sarsa (lambda) algorithm, and broadcasts the state and action messages of the attacking-end robot or the defending-end robot through the communication.

Further, the reward and punishment mechanism r is:

further, the operational action set includes three types of passing, carrying and shooting.

The invention relates to a football robot cooperation method based on reinforcement learning, which is applied to a football robot match comprising an attack end robot and a defense end robot, wherein for all football robots at an attack end or all football robots at a defense end, a reinforcement learning basic model of the football robot is firstly established based on a Sarsa (lambda) algorithm added with communication, a basic action set and a reward and punishment mechanism of the football robot are established in the reinforcement learning basic model, and a specified number of state variables are set according to the number of the football robots; then, the football robots can select the execution actions in the football match according to the reward and punishment mechanism, the environment of the football robots and the communication information between the football robots and other football robots, so that the cooperation of the football robots is realized; compared with the prior art, the invention can effectively improve the cooperation efficiency of the football robot and improve the appreciation of the football robot competition.

Drawings

FIG. 1 is a block flow diagram of a cooperation method of a soccer robot based on reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a basic reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of state variables of players in an embodiment employing the method of the present invention;

FIG. 4 is a schematic diagram of a simulation experiment on an HFO platform using the method of the present invention;

FIGS. 5(a) and 5(b) are graphs showing the comparison of the experimental results of the cooperative efficiency of the soccer robot with and without communication in the embodiment of the present invention;

FIG. 6 is a comparison graph showing the learning performance of the soccer robot according to the present invention;

fig. 7 is a comparison graph showing the learning performance of the intercommunication between different soccer robots according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

Example one

Referring to fig. 1, in an embodiment of the present invention, a soccer robot cooperation method based on reinforcement learning is provided, which specifically includes:

s1, constructing a reinforced learning basic model of the football robot based on the Sarsa (lambda) algorithm added with communication, and setting a reward and punishment mechanism r of the reinforced learning basic model.

Referring to fig. 2, it can be seen that the principle of the reinforcement learning basic model is as follows: the football robot selects actions under the condition of sensing the current environment, the environment state is transferred to a new state, correspondingly, the new state generates a strengthening signal to be fed back to the football robot, and the football robot determines the next action according to the current environment information and the strengthening signal; the key points of the football robot reinforcement learning in the invention comprise:

strategy: a key component of reinforcement learning agents that provides a mapping of control of euphoria to environmental perception states; value function: also known as a reward value. Evaluating the behavior derived using the existing policy and estimating the performance of the current state, which is a reaction to taking the behavior according to the current policy; the value function corrects the strategy through continuous correction; reward and punishment value: the device is used for estimating the instantaneous expectation of the environment perception state generated by the one-time control action, namely the action of the football robot in a certain state can obtain a corresponding reward and punishment value, and when the expectation is met, a positive reward and punishment value is given, and when the expectation is not met, a negative reward and punishment value is given; an environment model: a planning tool for predicting future behavior scenarios in view of future possibilities.

In the embodiment of the invention, in the learning process of the reinforced learning basic model, the football robot can continuously try to select actions, and the reinforced signals provided by the environment evaluate the action quality rather than transmit the information of how to generate correct actions to the system; meanwhile, because the information of the external environment adjusting action is little, the reinforcement learning system of the soccer robot must learn by depending on the experience of the robot, and finally the soccer robot can obtain the optimal strategy through the evaluation value of the reinforcement signal adjusting action, namely how to cooperate to achieve the goal of scoring.

The Sarsa (lambda) algorithm adopted by the invention is a variation of the Sarsa algorithm, wherein the specific working principle of the Sarsa algorithm is as follows: firstly, the name of the Sarsa algorithm comes from updating a Q value by applying experience of State (State) → Action (Action) → Reward and punishment (Reward) → State (State) → Action (Action)', wherein the Q value is a value of a strategy to be executed; the form of Sarsa's experience is (s, a, s ', a '), meaning: the Agent performs action a in the current state s, accepts the reward penalty value r, ends at state s ', and thus determines that action a' is performed, and the experience (s, a, s ', a') of Sarsa provides a new value for updating Q (s, a), i.e. r + γ Q (s ', a'); since the Sarsa (λ) algorithm is a variation of the Sarsa algorithm, which provides that for each state s and action a, updating Q (s, a) each time a new reward or punishment is received, but only those qualifying as being greater than a certain threshold, is not only more efficient, but also suffers little from loss of accuracy, the specific principle of the Sarsa (λ) algorithm is:

Sarsa(λ,S,A,γ,α)

inputting:

s is a set of states, A is a set of actions, γ is a discount rate, α is a step size, and λ is an attenuation rate

Internal state:

real value arrays Q (s, a) and e (s, a), previous state s, previous behavior a

begin：

Random initialization Q (s, a)

For all s, a, the initialization e (s, a) is 0

Observing the current state s

Selecting a using a Q-based strategy

repeat forever：

Performing action a

Observe reward punishment r and state s'

Selecting action a 'with one Q-based policy'

δ←r+γQ(s',a')-Q(s,a)

e(s,a)←e(s,a)+1

Fall all s",a"

Q(s",a")←Q(s",a")+αδe(s",a")

e(s",a")←γλe(s",a")

s←s′

a←a′

end-repeat

End

Where e (s, a) is also called the eligibility trace, where s and a are the set of all states and all actions, respectively; after each action is performed, the Q value of each "state-action" pair is updated.

Preferably, the reward and punishment mechanism r of the present invention is:

in the invention, the goal is that an attacking player scores a goal, so that the reward and punishment value r after the goal is set to be 1, and other actions give out corresponding small reward and punishment values r; experiments prove that successful ball passing can also give a small prizeA penalty value r (e.g. 0.01), where r ═ 0 is equally valid; and no discount is used because the task happens by chance.

in the embodiment of the invention, as the method is applied to the football robot competition, the football robot comprises an attack end robot and a defense end robot, and the number of the state variables is set based on the sum of the attack end robot and the defense end robot; referring specifically to fig. 3, in the present embodiment, the attacking player is white, the defending player is black, and the player index number is set to be O according to the distance from the ball₁，O₁Nearest to the sphere, followed by O₂By analogy to O_m(ii) a Similarly, the defensive player is respectively set to D according to the distance to the ball₁，D₂，…，D_n(ii) a And the goalkeeper can be any one of defensive players and uses D_gIt is shown that for a soccer robot game with four attacking players and five defending players, the present invention describes the positional relationship of the soccer robot by using the following 17 state variables: dist (O)₁，O₂)，dist(O₁，O₃)，dist(O₁，O₄) Indicates the person attacking the ball O₁Distance from each teammate; dist (O)₁，D_g) Indicates a ball holder O₁Distance from the goalkeeper; dist (O)₁，GL)，dist(O₂，GL)，dist(O₃，GL)，dist(O₄GL) represents the distance of each offensive player from goal line GL; min _ dist (O)₁，D)，min_dist(O₂，D)，min_dist(O₃，D)，min_dist(O₄D) represents the closest distance of each offensive player from a defensive player; min _ ang (O)₂，O₁，D)，min_ang(O₃，O₁，D)，min_ang(O₄，O₁And D) represents the minimum angle O_iO₁D, D is all defending players; min _ dist (O)₁，D_dcone) Shown with a cone D_dconeBall holder O₁Distance to defensive player, D_dconeIs one of O₁The vertex is a half angle of 60 degrees, and the axis of the ball passes through the cone of the goal; max _ good _ ang (O)₁) Represents the maximum angle max (angle GP)_{Left side of}O₁D_g，∠GP_{Right side}O₁D_g) I.e. with holding the ball O₁Is a vertex, O₁To a goalkeeper as a ray, O₁To both side goal posts GP_{Left side of}，GP_{Right side}Is the maximum angle of ray composition; wherein dist (O)₁，GL)，max_goal_ang(O₁)，dist(O₁，D_g) Directly influencing the selection of the shooting action, min _ dist (O)₁，GL)，min_dist(O₁，D_dcone) Directly influences the selection of the action with the ball, and other state variables influence the selection of the action of passing the ball.

In the embodiment, only the attack efficiency of the attacking player is considered, so that the number of the state variables is linearly related to the number of the attacking player and linearly unrelated to the number of the defending players; of course, the number of state variables of defensive players is the same as the linear relationship of the state variables of offensive and offensive players.

And S3, setting an operable action set of the football robot, and selecting the next action by the football robot based on a reward and punishment mechanism r and the mutual communication of the state variables and the football robot.

The operable action set comprises three types of pass, ball carrying and shooting, wherein the pass action PassK is based on the distance from teammates and is not an actual number; PassK kicks the ball to the kth teammate, K2, 3, …, m. Drible is a ball-carrying action to encourage attackers to approach the goal; shooting the shot to the goal, and scoring the goal; when the ball is not held by the attacking player, the attacking player closest to the ball directly rushes to the ball (Getball) to achieve the right of holding the ball; meanwhile, other attackers always keep the lineup forward attack (GetOpen), the pseudo code is as follows:

if has the ball holding power then

Set of actions to perform { Pass2, …, Passm, Dribble, shooot }

The closest offensive player then to the ball in else if

GetBull (near ball)

else

GetOpen (move to lattice point).

In the embodiment of the invention, a designated football robot in an attack end robot or a defense end robot is communicated with the rest football robots through a Sarsa (lambda) algorithm, and the self state and action message is broadcasted through communication; for example, when a player selects an action in state s and accepts a reward or punishment r, a message is broadcast to the team, and the specific implementation can be realized by the following pseudo code:

reinforcement learning for communication

Initialization:

for all training fragment do

s←NULL

repeat

if has the ball holding power then

s ← getCurrentStateFromEnviroment (obtaining the status of the current context)

Selecting and executing action a according to Q function

r ← waitForRewardFromEnviroment (waiting for environment judgment action to give corresponding reward and punishment value)

Broadcast message (s, a, r)

The closest offensive player then to the ball in else if

GetBull (near ball)

else

GetOpen (Mobile to lattice type dot)

if receives the broadcast message(s)_m,a_m,r_m)then

if state s is empty then

s,a,r←s_m,a_m,r_m

else

s′,a′,r′←s_m,a_m,r_m

Q(s,a)←Q(s,a)+α(r+γQ(s′,a′)-Q(s,a))

s,a,r←s′,a′,r′

The unitil fragment ends.

The above segments are the learning tasks of the football robot in the reinforcement learning basic model; the following three cases are defined here as the end of a segment: scoring when the goal is played, crossing the boundary of the goal and getting the ball control right (including goalkeeper) by the defenders; each football robot stores a current action value function, an attacking player holding a football carries out action and receives reward punishment, then a message is broadcasted to a team as (s, a, r), each football robot is initialized with (s, a, r) at the beginning, and subsequent messages are dynamically updated according to (s ', a ', r '); meanwhile, in order to ensure the consistency of the messages, the method is also provided with a special football robot which is used as a medium for the communication of all the football robots, namely, the communication information among all the football robots is firstly sent to the special football robot and then is broadcasted to other football robots by the special football robot; in the method of the invention, the special football robot is used as the intermediate communication medium, and the special football robot is independent from other football robots in communication, so that the integrity and the reliability of the communication information can be realized.

Example two

The method of the first embodiment is verified by using an HFO experimental platform, and specifically comprises m attacking players and n defending players. Wherein the defender comprises a goalkeeper, and n is more than or equal to m. A half-court attack task is performed on half of the football field and begins near the half-court line, with the ball held by the attacking player; referring to FIG. 4, there is illustrated a classic 4v5 mode HFO platform wherein the white filled circle is a ball, four offensive players, five defensive players including a goalkeeper; in the experimental process, in order to successfully shoot and goal the shot in the HFO experimental platform, the attacking player needs to learn the three operations of passing the shot, carrying the shot and shooting the shot through a reinforcement learning basic model, and simulate the defending player to try to prevent the action of the attacking player.

Preferably, the invention firstly carries out 30 groups of experiments to the learning with and without communication between the soccer robots respectively to analyze errors; the specific number of experimental data sets can be described according to practical situations, which are only preferred embodiments of the present invention and are not limitations and fixed on the method of the present invention;referring to FIGS. 5(a) and 5(b), the x-axis represents the number of experimental groups, the y-axis represents the score y obtained after 20000 segment learning, and the score y is expressed by the formula

Calculating to obtain; wherein r is_jIs a reward and punishment value obtained by the agent learning when the jth segment is finished; FIG. 5(a) shows the scores obtained in each set of experiments with communicating studies, where the dashed line is the average of 30 sets of studies

And the variance is only 0.0005 calculated by a variance formula; FIG. 5b shows the scores obtained in each set of experiments with learning without communication, wherein the dotted line represents the average of 30 sets of learning

And the variance is only calculated to be 0.0025, and the scores obtained by two groups of learning can be represented by respective average values, and the error is ignored within an allowable range.

Referring to FIG. 6, which shows the performance comparison of reinforcement learning algorithm with and without communication in the soccer robot, the x-axis represents the number of segments, i.e., x_iIndicating that the agent learns in the ith segment, where i e 1,20000]. The y-axis represents the score y obtained at the end of each segment learning task_iFraction y_iBy the formula

Calculating to obtain; wherein r is_jIs the reward penalty value at the end of the jth segment; it can be seen from the figure that in the first 5000 pieces of learning, the performance of both the communicating soccer robot and the non-communicating soccer robot increases linearly, and the learning added to the communication increases more rapidly; after 5000 pieces of study, the efficiency of study of communication is obviously increased; after 20000 segments, the two learning curves both tend to converge basically, the learning success rate without communication is 20.09%, and the learning success rate after communication is about 31.08%, which is higher than that without communicationThe learning efficiency is improved by 10.99%; the comparison shows that the learning efficiency of the football robot can be improved after communication is added.

In the embodiment of the invention, in order to eliminate the hidden state, the cause of the football robot is set to 360 degrees; meanwhile, in order to more clearly compare the performance of the algorithm of reinforcement learning after communication is added, communication learning is added to different numbers of attacking players for comparison experiments, specifically, referring to fig. 7, a football robot system containing four players, three players, two players and a single player is respectively learned and updated, and as can be seen from the fact that the four curves are basically linearly increased in a certain learning segment and tend to be converged; in each learning segment, the values of the learning curves of all players after communication are always higher than those of other curves; when 5000 and 10000 segments are used for learning, the learning efficiency is accelerated along with the increase of the number of players for communication learning; after 20000 segments, the players are in communication learning; the learning score of the football robot system containing at most four players is far higher than that of other football robot systems containing a small number of football robots; the comparison of the data shows that in the method, when the number of the football robots in the football robot system is larger, the learning efficiency of the football robots is higher through the communication learning among the football robots, namely, in the actual match process, the cooperation efficiency of the whole football robot system can be effectively improved through the method, and therefore the whole attack efficiency is improved.

In summary, the invention is applied to a football robot cooperation method based on reinforcement learning in a football robot match comprising an attack end robot and a defense end robot, and for all football robots at the attack end or all football robots at the defense end, a reinforcement learning basic model of the football robot is firstly established based on a Sarsa (lambda) algorithm added with communication, a basic action set and a reward and punishment mechanism of the football robot are established in the reinforcement learning basic model, and a specified number of state variables are set according to the number of the football robots; then, the football robots can select the execution actions in the football match according to the reward and punishment mechanism, the environment of the football robots and the communication information between the football robots and other football robots, so that the cooperation of the football robots is realized; compared with the prior art, the invention can effectively improve the cooperation efficiency of the football robot and improve the appreciation of the football robot competition.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing detailed description, or equivalent changes may be made in some of the features of the embodiments described above. All equivalent structures made by using the contents of the specification and the attached drawings of the invention can be directly or indirectly applied to other related technical fields, and are also within the protection scope of the patent of the invention.

Claims

1. A football robot cooperation method based on reinforcement learning is characterized in that the method comprises the following steps:

s1, constructing a reinforced learning basic model of the football robot based on a Sarsa (lambda) algorithm added with communication, and setting a reward and punishment mechanism r of the reinforced learning basic model; the principle of the reinforcement learning basic model is that the football robot selects actions under the condition of sensing the current environment, the environment state is transferred to a new state, correspondingly, the new state generates a reinforcement signal to be fed back to the football robot, and the football robot determines the next action according to the current environment information and the reinforcement signal;

s2, defining a specified number of state variables based on the distance and the angle between the football robots; the distance between an offensive ball holder and each teammate, the distance between the ball holder and a goalkeeper, the distance between each offensive ball holder and a goal line, the closest distance between each offensive ball holder and a defender, the minimum angle, the closest distance between the ball holder and the defender in a cone with the ball and the maximum angle are included;

s3, setting an operable action set of the football robot, and selecting the next action by the football robot based on the reward and punishment mechanism r, the state variables and mutual communication of the football robot; the operational set of actions includes three types of pass, dribbling, and shooting, wherein a pass action PassK kicks a ball to a kth teammate based on a distance PassK from the teammate; drible is a ball-carrying action to encourage attackers to approach the goal; shooting the shot to the goal, scoring the goal; when the ball is not held by an attacker, the attacker closest to the ball can directly rush to the ball to achieve the right of holding the ball; at the same time, other attacking players always keep the formation forward attack.

2. The learning-reinforcement-based soccer robot collaboration method of claim 1, wherein the soccer robots include an attack end robot and a defense end robot, the number of state variables being set based on a sum of the attack end robot and the defense end robot.

3. The reinforcement learning-based soccer robot collaboration method of claim 2, wherein the method further comprises: and the attacking-end robot or the appointed football robot in the defending-end robot communicates with the rest football robots through the Sarsa (lambda) algorithm, and broadcasts the state and action messages of the attacking-end robot or the defending-end robot through the communication.

4. The learning-reinforcement-based soccer robot collaboration method of claim 1, wherein the reward and punishment mechanism r is:

。