CN116039956B

CN116039956B - Spacecraft sequence game method, device and medium based on Monte Carlo tree search

Info

Publication number: CN116039956B
Application number: CN202211364933.XA
Authority: CN
Inventors: 叶东; 贾振; 姜锐; 田鑫龙; 张剑桥
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-11-14
Anticipated expiration: 2042-11-02
Also published as: CN116039956A

Abstract

The embodiment of the invention discloses a spacecraft sequence game method based on Monte Carlo tree search, belonging to the technical field of spacecraft orbit control; the method comprises the following steps: in the current round, constructing initial state information s of the current round ₀ The method comprises the steps of carrying out a first treatment on the surface of the Taking the initial state information of the current round as a root node of a game tree, and selecting one or more to-be-explored subtrees for constructing the game tree from candidate states formed by expanding in a discrete action space; updating utility estimation information of all nodes on a path from the root node to the leaf nodes through backtracking propagation according to state estimation information of each of all the leaf nodes expanded in the subtree to be explored; according to the utility estimation information updated by the game tree, making an optimal action decision of the current round; and controlling the motion state of the decision-making spacecraft according to the optimal motion decision so as to make the opponent spacecraft perform motion decision based on the motion state of the decision-making spacecraft after control.

Description

Spacecraft sequence game method, device and medium based on Monte Carlo tree search

Technical Field

The embodiment of the invention relates to the technical field of spacecraft orbit control, in particular to a spacecraft sequence game method, device and medium based on Monte Carlo tree search.

Background

The conventional spacecraft orbit game problem is often based on continuous maneuvering assumption of the spacecraft, and more spacecraft in actual task scenes adopt a pulse maneuvering mode, so that the spacecraft orbit game problem under the pulse maneuvering lacks uniform description.

The design of the spacecraft orbit game problem terminal rewarding curved surface has no unified form and has no universality and flexibility.

The expansibility game problem is usually solved by using a game tree method, the state evaluation is often required to be carried out on the nodes, the state evaluation is required to be carried out on each node by using a traditional game tree method, and the consumption of computing resources is high.

Disclosure of Invention

In view of this, the embodiment of the invention expects to provide a spacecraft sequence game method, device and medium based on Monte Carlo tree search; the method can model the spacecraft orbit game problem under the pulse maneuver and give out a better solution of the sub-game problem in a limited time and computing resource scene.

The technical scheme of the embodiment of the invention is realized as follows:

In a first aspect, an embodiment of the present invention provides a spacecraft sequence gaming method based on monte carlo tree search, including:

in the current round, constructing initial state information s of the current round ₀ ；

Taking the initial state information of the current round as a root node of a game tree, and selecting one or more to-be-explored subtrees for constructing the game tree from candidate states formed by expanding in a discrete action space;

updating utility estimation information of all nodes on a path from the root node to the leaf nodes through backtracking propagation according to state estimation information of each of all the leaf nodes expanded in the subtree to be explored;

according to the utility estimation information updated by the game tree, making an optimal action decision of the current round;

and controlling the motion state of the decision-making spacecraft according to the optimal motion decision so as to make the opponent spacecraft perform motion decision based on the motion state of the decision-making spacecraft after control.

In a second aspect, an embodiment of the present invention provides a spacecraft sequence gaming device based on a monte carlo tree search, including a first construction portion, a second construction portion, an updating portion, a decision portion, and a control portion; wherein,

The first construction part is configured to construct initial state information s of the current round in the current round ₀ ；

The second construction part is configured to take the initial state information of the current round as the root node of the game tree, select one or more candidate states formed by the discrete action space and construct a subtree to be explored of the game tree;

the updating part is configured to update utility estimation information of all nodes on a path from the root node to the leaf nodes through backtracking according to state estimation information of each of all the leaf nodes expanded in the subtree to be explored;

the decision part is configured to make an optimal action decision of the current round according to the updated utility estimation information of the game tree;

the control part is configured to control the motion state of the decision-making spacecraft according to the optimal motion decision so as to make the opponent spacecraft perform motion decision based on the motion state after the control of the decision-making spacecraft.

In a third aspect, embodiments of the present invention provide a computing device, the computing device comprising: a communication interface, a memory and a processor; the components are coupled together by a bus system; wherein,

The communication interface is used for receiving and transmitting signals in the process of receiving and transmitting information with other external network elements;

the memory is used for storing a computer program capable of running on the processor;

the processor is configured to execute the steps of the spacecraft sequence game method based on the monte carlo tree search in the first aspect when the computer program is executed, which is not described herein.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a spacecraft sequence game program based on a monte carlo tree search is stored, where the spacecraft sequence game program based on a monte carlo tree search implements the steps of the spacecraft sequence game method based on a monte carlo tree search in the first aspect when the spacecraft sequence game program based on a monte carlo tree search is executed by at least one processor.

The embodiment of the invention provides a spacecraft sequence game method, a device and a medium based on Monte Carlo tree search; firstly, constructing initial state information of the current round, and carrying out discretization model description on game problems under pulsation maneuver; and then selecting a direction favorable for utility estimation from candidate states formed by expanding discrete action spaces to construct a subtree to be explored, then carrying out state information estimation on leaf nodes of the subtree to carry out state information estimation and reversely updating utility estimation information of all nodes on a search path to carry out optimal action decision, so that the selection of game actions can embody a final game target, the search range of a game tree is reduced, and each node is not required to carry out state estimation, thereby reducing the calculated amount and solving the better solution to the game problem under the condition of limited calculation resources.

Drawings

FIG. 1 is a schematic view of a constraint of monitoring satellite solar interference provided by an embodiment of the present invention;

fig. 2 is a schematic flow chart of a spacecraft sequence game method based on monte carlo tree search according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a sequence gaming state transfer process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a discrete pulse motion space provided by an embodiment of the present invention;

FIG. 5 is a diagram illustrating a comparison of a complete game and a sub-game according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a state transition process according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of exploring a new node according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of leaf node state evaluation according to an embodiment of the present invention;

fig. 9 is a schematic diagram of construction of a game tree of an fighter sequence of a chase flight vehicle according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a spacecraft sequence game device based on monte carlo tree search according to an embodiment of the invention;

fig. 11 is a schematic hardware structure of a computing device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Considering the problem of monitoring games under the constraint of solar interference, the visibility under specific solar interference is shown in fig. 1. The targets of the two-party game of the spacecraft participating in the game are optimal close observation for tracking the spacecraft, and the optimal observation is destroyed for escaping the spacecraft. The main objective of tracking the spacecraft is to achieve optimal observation angles and relative distances in the near observation. The main objective of the escape spacecraft is to achieve a tamper observation condition. Because the relative included angles are relative, the observation included angles of the opponent spacecraft are naturally positioned at the forward light observation position when destroyed.

Referring to fig. 2, the method for game playing of spacecraft sequences based on monte carlo tree search provided by the embodiment of the invention can be applied to decision spacecraft, and it can be understood that the decision spacecraft can be either a tracking spacecraft or an escape spacecraft, and the method comprises:

s201: in the current round, constructing initial state information s of the current round ₀ ；

S202: taking the initial state information of the current round as a root node of a game tree, and selecting one or more to-be-explored subtrees for constructing the game tree from candidate states formed by expanding in a discrete action space;

S203: updating utility estimation information of all nodes on a path from the root node to the leaf nodes through backtracking propagation according to state estimation information of each of all the leaf nodes expanded in the subtree to be explored;

s204: according to the utility estimation information updated by the game tree, making an optimal action decision of the current round;

s205: and controlling the motion state of the decision-making spacecraft according to the optimal motion decision so as to make the opponent spacecraft perform motion decision based on the motion state of the decision-making spacecraft after control.

The above scheme expresses the process of single action decision making by the decision spacecraft according to the current state in the spacecraft orbit game under the pulse maneuver, and it can be understood that taking the decision spacecraft as the tracking spacecraft as an example, after the tracking spacecraft finishes the action control according to the optimal action decision, the opponent spacecraft (i.e. the escape spacecraft) can also perform the optimal action decision making according to the process explained by the above scheme, which is not repeated here; at this time, an interactive round between the game-playing spacecrafts is completed, as shown by Δt in fig. 3 _p And Deltat _e The two-stage process and the circulation form the complete process of the spacecraft resistance sequence game under the pulse maneuver shown in fig. 3.

For the solution shown in fig. 2, in some possible implementations, the initial state information includes: according to the motion information of the observer, the motion information formed by the action decision of the observer opponent spacecraft based on the previous round, and the relative position of the sun, the method is described as the following formula:

wherein x is _sun Indicating the relative position of the sun; x is x _i I=e, p denotes the inclusion position r of the decision spacecraft e and the opponent spacecraft p _i And velocity v _i Motion information of (a); the subscript t denotes a discrete time.

For the above implementation manner, it should be noted that, in the embodiment of the present invention, under the complete information, it is assumed that there is no error in the state measurement of the opposite party, that is, the accurate position and speed of the opposite party can be obtained.

For the technical solution shown in fig. 2, in some possible implementations, the decision spacecraft uses initial state information of a current round as a root node of a game tree, selects one or more sub-trees to be explored for constructing the game tree from candidate states formed by expanding in a discrete action space, and includes:

taking initial state information of the current round as a root node of a game tree, and taking all candidate states generated on the basis of the initial state information in a discrete action space as a first layer child node S' of the game tree;

Selecting one or more nodes from the first layer of child nodes S' as nodes to be unfolded;

predicting actions of the spacecraft and the opponent spacecraft in a follow-up set number of rounds, and expanding each node to be expanded based on the predicted actions to form a subtree to be explored corresponding to each node to be expanded.

For the above implementation, in some examples, the taking the initial state information of the current round as the root node of the game tree takes all candidate states generated on the discrete action space based on the initial state information as the first layer child nodes S' of the game tree, including:

uniformly dividing the continuous action space to obtain a discrete action space;

forming candidate actions corresponding to each sampling space in the discrete action space according to the direction corresponding to each sampling space in the discrete action space;

according to the initial state information and each candidate action, performing state transition through the following formula to obtain a candidate state corresponding to each candidate action;

wherein Φ (n) represents the state transition matrix of the relative motion C-W equation; n represents a discrete time; x is x _i,n Representing tracking the motion state of the spacecraft or the escape spacecraft at the time n; r is (r) _n Representing a position vector for tracking a spacecraft or escaping based on an LVLH coordinate system; v _n Representing tracking of spacecraft or flees in LVLH-based coordinate systemA speed vector of the running spacecraft; a, a _n Representing motion vectors based on the LVLH coordinate system;

all candidate states are taken as first layer child nodes S' of the game tree.

For the above example, in detail, for the continuous motion space a, the discrete motion space may be obtained by sampling, on the premise of having a certain error, the whole motion space is uniformly divided, and candidate motions in all directions are approximately estimated, as shown in fig. 4, in a space coordinate system formed by taking the center of mass of the decision spacecraft as the origin, the formed spherical surface is the continuous motion space a, and is uniformly divided to obtain the discrete motion space, each point on the surface of the sphere in fig. 4 represents each sampling space of the discrete motion space, and it can be understood that each sampling space corresponds to one direction, and thus can correspond to one candidate motion; a is a _i,n Representing candidate actions, may be described as shown in the following equation:

wherein v is _x ,v _y ,v _z Representing the velocities of the decision spacecraft in the x-axis, y-axis, z-axis in the LVLH coordinate system shown in fig. 4, respectively.

For the above example, it should also be noted that setting orbit dynamics of a game spacecraft is common knowledge to players who know each other's equations of motion; thus, the state transition process characterized by the state transition equation set forth in the above implementation may be as shown in fig. 6, s representing initial state information, and a(s) representing candidate action space; based on the space represented by A(s), N candidate actions may be formed, such as a shown in FIG. 6 _e,n N candidate states, such as s' shown in FIG. 6, may be formed for each candidate action; so that discretized model description can be performed on game problems under the pulse maneuver.

For the above implementation, in some examples, the selecting one or more nodes from the first layer child nodes S' as the node to be expanded includes:

the confidence upper bound UCB (the Upper Confidence Bound) value for each candidate state S 'in the first tier child node S' is calculated by:

wherein the former part Q (s') represents the utility estimation of the node state, reflects the utilization of information, has an initial value of 0, and is reversely and retrospectively updated according to the leaf node state estimation information; the latter part represents information brought by exploring new nodes; n(s) = Σ _a∈A(s) n (s, a) represents the number of times the state s is accessed; c is a constant and is obtained through configuration, and in general, c is a positive number when the decision spacecraft is an escape spacecraft and c is a negative number when the decision spacecraft is a tracking spacecraft;

if the decision spacecraft is an escape spacecraft, taking a node corresponding to the UCB maximum value in the first layer of child nodes S' as a node to be unfolded; and if the decision spacecraft is a tracking spacecraft, taking a node corresponding to the UCB minimum value in the first layer of child nodes S' as a node to be unfolded.

It should be noted that the above example describes the step of selecting the node to be expanded once by the decision spacecraft, as shown in fig. 7, s is an initial state of the decision spacecraft, and the state formed after expansion on the action space a(s) forms a first layer child node of the game tree with s as a root node, for example, the initial state s is an action a in the action space a(s) _e,n The upper expansion corresponds to S 'in the first layer of sub-nodes, so that the first layer of sub-nodes S' are formed by expansion in the whole action space; selecting a node, such as s', from the first layer of child nodes through UCB values, and continuing to expand the node serving as a node to be expanded; after the node to be expanded is expanded according to the subsequent steps, the utility estimation information of the node to be expanded is reversely updated, and understandably, the node with the largest or smallest UCB value is selected as the next node to be expanded according to the updated utility estimation information Such loop iteration, where the decision time allows, the first layer child node S' may have one or more nodes selected as nodes to be expanded; it can be further understood that, when calculating the UCB value, the non-accessed node in the first layer of child nodes has a larger value due to a larger value of the latter part, that is, the probability of being selected as the node to be expanded is larger, and under the condition of time permission, until all the nodes in the first layer of child nodes are accessed once, the utility information of the first part of nodes in the UCB value calculation formula is used as the main reference information to be the selection basis of the subsequent node to be expanded.

For the above implementation, in some examples, predicting the actions of the own spacecraft and the opponent spacecraft for a set number of rounds, and expanding each of the nodes to be expanded based on the predicted actions to form a subtree to be explored corresponding to each of the nodes to be expanded includes:

s701: setting the maximum value of the expansion layer number of the subtree to be explored, which is correspondingly expanded by the node to be expanded, as M, setting the initial value of M as 0, representing the layer number of the subtree to be explored, and recording the state of the node to be expanded as s '' _m At this time, the decision time of the opponent spacecraft;

s702: randomly selecting an action a on a discrete action space _p E A(s), for s 'according to the selected action' _m Unfolding and transferring the state to s' _m+1 ＝f(s′ _m ，a _p ) Will s' _m+1 As s' _m Adding the child nodes of the tree to be explored;

s703: if m+1 is less than M-1, s' _m+1 The corresponding node is not the leaf node of the subtree to be explored, and the node s 'is continued to be explored' _m+1 Expanding in the action space, wherein at the moment, one action a is randomly selected in the discrete action space for the decision moment of the decision spacecraft _e E A(s), for said s 'according to the action' _m+1 Unfolding is carried out from s' _m+1 Migration to s' _m+2 ＝f(s′ _m+1 ，a _e ) Will s' _m+2 As s' _m+1 Adding the child nodes of the tree to be explored;

s704: if m+2 is less than M-1, i.e. s' _m+2 If the corresponding node is not the leaf node of the subtree to be explored, assigning m+2 to m, repeating the steps 702 to 704, and exploring the game tree of the next round;

s705: if m+1=m-1 or m+2=m-1, respectively, s' _M-1 And as the leaf nodes of the subtrees to be explored, the nodes to be unfolded complete the construction of the subtrees to be explored corresponding to the nodes to be unfolded.

It should be noted that, the above example describes a process of expanding the subtree to be explored corresponding to the node to be expanded, simulates a virtual game process of a decision spacecraft and opponent spacecraft for a set number of rounds, as shown in fig. 7, after the decision spacecraft selects one node s 'to be expanded in the first layer of sub-nodes, at this time, the decision spacecraft selects an action in an action space as an action decision of the opponent spacecraft through a random strategy, the node to be expanded reaches a state node s″ in fig. 7 after being expanded through the action decision of the opponent spacecraft, so as to predict actions of a specific number of rounds downwards, reach leaf nodes represented by dotted lines in fig. 8, and the subtree to be explored corresponding to the node s' to be expanded completes one expansion.

In view of the above implementation, it should also be noted that the game tree search method is an effective tool for analyzing and solving extended game, but its complexity is directly related to the state space. The initial state is used as a root node of a game tree in the game process, then the game tree is unfolded according to different candidate actions, finally the optimal action is selected, the actual state is updated, and then the game tree of the next round is unfolded. As shown in fig. 5 (a), the whole search space can be fully expanded under the limited action space, but the cost of the computing resource is extremely high, the problem of combined explosion is faced, and the optimal solution can be fully expanded only under the condition that the action space is relatively small. In practical situations, a small-scale sub-game problem can be solved, as shown in fig. 5 (b), in some possible implementations, the above embodiment provides that the size of the sub-game problem is reduced from three aspects, the requirement on computing resources is reduced, and the requirement that a better solution is obtained in a specific decision time is satisfied: in one aspect, when the decision spacecraft is allowed in time, the deployment of the subtrees to be explored is biased to be deployed in a direction favorable for utility estimation, and complete deployment is not needed; secondly, aiming at the node to be unfolded in the S', the decision spacecraft only predicts the depth of a specific number of rounds forwards, and unfolds to form a subtree to be explored, and does not need to reach the state of ending the game; and thirdly, for the subtrees to be explored, only state evaluation is needed for leaf nodes, and state evaluation is not needed for all nodes.

For the solution shown in fig. 2, in some possible implementations, the state evaluation information of each of all the leaf nodes deployed in the subtree to be explored includes:

the relative distance d (t) and the relative angle θ (t) for each of the leaf nodes are obtained according to the following formulas:

d(t)＝‖x _pe ‖

wherein x is _pe ＝x _e -x _p ；x _ps ＝x _sun -x _p ；

Based on the relative distance d (t) and the relative included angle theta (t) corresponding to each leaf node, the state evaluation information corresponding to each leaf node is obtained through the following formula:

wherein θ represents the relative angle, d represents the relative distance, θ _d Represents the desired optimum angle, d _d Indicating the desired optimal distance.

For the above example, it should be noted that the nodeThe state of the game is represented by the utility under the current situation information, the node evaluation is directly related to the selection of game actions, and the most important aspect of the situation evaluation is the design of rewards at the game ending time, wherein the rewards are visual reflection of intention. The intent modeling and state rewards directly affect the search direction and action selection of the game tree. In some examples, the embodiment of the invention provides a method for rewarding curved surfaces of different terminals by considering relative distances and relative angles, wherein as shown in a formula (3), θ represents the relative included angle, namely an included angle between a connecting line between a tracking spacecraft and an escape spacecraft and a connecting line between the tracking spacecraft and the sun by taking the tracking spacecraft as a reference center; d represents the relative distance; θ _d Indicating a desired optimum angle; d, d _d Indicating the desired optimal distance, it will be appreciated that the optimal prize can be achieved at the optimal angle and optimal distance.

For the solution shown in fig. 2, in some possible implementations, the updating utility estimation information of all nodes on the path from the root node to the leaf node through backtracking propagation includes:

updating the access times n(s) of all nodes on the path from the root node to the leaf node, wherein the access times n(s) are shown as follows:

n(s)←n(s)+1

updating utility estimates for all node states on paths between the root node to the leaf nodeThe following formula is shown:

wherein r is state evaluation information of leaf nodes after the subtrees to be explored are unfolded,and estimating the utility of all node states on the path.

For the technical solution shown in fig. 2, in some possible implementations, the making an optimal action decision of the current round according to the utility estimation information updated by the game tree includes:

according to the updated utility estimation information, for the escape spacecraft, selecting an action corresponding to a node with the largest UCB value in the first layer of sub-nodes S' as a current round optimal action decision; for tracking the spacecraft, selecting an action corresponding to a node with the minimum UCB value from the first layer of sub-nodes S' as a current round optimal action decision, wherein the action is as follows:

For an escape spacecraft to the outside,

for the purpose of tracking a spacecraft,

it should be noted that, for the above implementation, in some examples, setting c=0 is to prevent selecting an action corresponding to an unexplored node as an optimal action decision.

For the technical solution shown in fig. 2, and the implementation manner and the example thereof, it should be noted that, the motion state of the decision spacecraft is controlled according to the optimal motion decision, so that the opponent spacecraft makes a motion decision based on the motion state of the decision spacecraft after control, where in the process of making the motion decision of the opponent spacecraft, the opponent spacecraft may adopt the same game tree method as the foregoing technical solution, and further construct an opponent game tree as shown in fig. 9: it can be understood that, at the decision time of the tracking spacecraft, the tracking spacecraft takes the self motion state of the tracking spacecraft, the motion state of the observation escape spacecraft and the sun position information at the moment as initial states, and at the decision time deltat _p Within the allowable range, the initial state is taken as a root node, states corresponding to one or more candidate actions are selected as child nodes to be unfolded at a first layer on the action space, the actions with a set number of rounds are continuously predicted forwards, the construction of a game tree of a tracking spacecraft is completed, and evaluation is performed Estimating state information of all leaf nodes of the game tree of the tracking spacecraft, back-propagating and updating utility estimation information of all nodes on a path from the leaf nodes to the root node, and performing optimal action decision by the tracking spacecraft according to the updated utility estimation information, and executing the optimal action by the tracking spacecraft; likewise, the escape spacecraft is at its decision moment, at its decision time Δt _e And finishing the optimal action decision of the escape spacecraft within the allowable range, and entering a game of the next round after the escape spacecraft controls the motion state of the escape spacecraft by executing the optimal action decision. It should be noted that, the UCB algorithm is used to select the motion to further expand the space of exploration in the tree search process, and as the number of samples tends to be infinite, the probability of selecting the optimal motion tends to be 1. The method can solve a larger state space problem compared with a conventional game tree search algorithm, and has the characteristic of UCB1 algorithm, so that the tradeoff of exploration and utilization can be balanced.

Based on the same inventive concept as the foregoing technical solution, referring to fig. 10, there is shown a spacecraft sequence gaming device 100 based on monte carlo tree search provided by an embodiment of the present invention, where the device 100 includes: a first construction section 1001, a second construction section 1002, an updating section 1003, a decision section 1004, and a control section 1005; wherein,

The first construction part 1001 is configured to construct initial state information s of a current round in the current round ₀ ；

The second construction portion 1002 is configured to select one or more sub-trees to be explored for constructing the game tree from candidate states formed by discrete action spaces by taking initial state information of a current round as a root node of the game tree;

the updating part 1003 is configured to update utility estimation information of all nodes on a path from the root node to the leaf nodes by retrospective propagation according to state estimation information of each of all leaf nodes expanded in the subtree to be explored;

the decision section 1004 is configured to make an optimal action decision of the current round according to the utility estimation information updated by the game tree;

the control part 1005 is configured to control the motion state of the decision spacecraft according to the optimal motion decision, so that the opponent spacecraft makes a motion decision based on the motion state after the control of the decision spacecraft.

In some examples, the initial state information includes:

according to the motion information of the observer, the motion information formed by the action decision of the observer opponent spacecraft based on the previous round, and the relative position of the sun, the method is described as the following formula:

In some examples, the second build portion 1002 is configured to:

wherein Φ (n) represents the state transition matrix of the relative motion C-W equation; n represents a discrete time; x is x _i,n Representing tracking the motion state of the spacecraft or the escape spacecraft at the time n; r is (r) _n Representing a position vector of the tracking spacecraft or the escape based on an earth inertial coordinate system; v _n Representing a tracking spacecraft or escape speed vector; a, a _n Representing motion vectors based on the earth inertial coordinate system;

all candidate states are taken as first layer child nodes S' of the game tree.

In some examples, the second build portion 1002 is configured to:

calculating a confidence upper bound UCB value corresponding to each candidate state S 'in the first layer child node S' by the following formula:

wherein the former part Q (s') represents the utility estimation of the node state, reflects the utilization of information, has an initial value of 0, and is reversely and retrospectively updated according to the leaf node state estimation information; the latter part represents information brought by exploring new nodes;n(s)＝∑ _a∈A(s) n (s, a) represents the number of times the state s is accessed; c is a constant and is obtained through configuration, and in general, c is a positive number when the decision spacecraft is an escape spacecraft and c is a negative number when the decision spacecraft is a tracking spacecraft;

In some examples, the second build portion 1002 is configured to:

step 1: setting the maximum value of the expansion layer number of the subtree to be explored, which is correspondingly expanded by the node to be expanded, as M, setting the initial value of M as 0, representing the layer number of the subtree to be explored, and recording the state of the node to be expanded as s '' _m At this time, the decision time of the opponent spacecraft;

step 2: randomly selecting an action a on a discrete action space _p E A(s), for s 'according to the selected action' _m Unfolding and transferring the state to s' _m+1 ＝f(s′ _m ，a _p ) Will s' _m+1 As s' _m Adding the child nodes of the tree to be explored;

step 3: if m+1 is less than M-1, s' _m+1 The corresponding node is not the leaf node of the subtree to be explored, and the node s 'is continued to be explored' _m+1 Expanding in the action space, wherein at the moment, one action a is randomly selected in the discrete action space for the decision moment of the decision spacecraft _e E A(s), for said s 'according to the action' _m+1 Unfolding is carried out from s' _m+1 Migration to s' _m+2 ＝f(s′ _m+1 ，a _e ) Will s' _m+2 As s' _m+1 Adding the child nodes of the tree to be explored;

step 4: if m+2 is less than M-1, i.e. s' _m+2 If the corresponding node is not the leaf node of the subtree to be explored, assigning m+2 to m, repeating the steps 2 to 4, and furtherExploring a game tree of the next round;

step 5: if m+1=m-1 or m+2=m-1, respectively, s' _M-1 And as the leaf nodes of the subtrees to be explored, the nodes to be unfolded complete the construction of the subtrees to be explored corresponding to the nodes to be unfolded.

In some examples, the update portion 1003 is configured to:

d(t)＝||x _pe ||

wherein x is _pe ＝x _e -x _p ；x _ps ＝x _sun -x _p ；

In some examples, the update portion 1003 is configured to:

n(s)←n(s)+1

In some examples, the decision portion 1004 is configured to:

for an escape spacecraft to the outside,

for the purpose of tracking a spacecraft,

it will be appreciated that in this embodiment, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and of course may be a unit, or a module may be non-modular.

In addition, each component in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.

The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on such understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, which is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform all or part of the steps of the method described in the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Therefore, the embodiment provides a computer storage medium, where a spacecraft sequence game program based on a monte carlo tree search is stored, and when the spacecraft sequence game program based on the monte carlo tree search is executed by at least one processor, the steps of the spacecraft sequence game method based on the monte carlo tree search in the above technical scheme are implemented.

Referring to fig. 11, a specific hardware structure of a computing device 110 of the spacecraft sequence gaming device 100 capable of implementing the above-mentioned monte carlo tree search is provided according to an embodiment of the present invention, and the computing device 110 may be a wireless device, a mobile or cellular phone (including a so-called smart phone), a Personal Digital Assistant (PDA), a video game console (including a video display, a mobile video game device, a mobile video conference unit), a laptop computer, a desktop computer, a television set-top box, a tablet computing device, an electronic book reader, a fixed or mobile media player, and the like. The computing device 110 includes: a communication interface 1101, a memory 1102 and a processor 1103; the various components are coupled together by a bus system 1104. It is to be appreciated that the bus system 1104 is employed to facilitate connected communications between the components. The bus system 1104 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 1104 in fig. 11. Wherein,

The communication interface 1101 is configured to receive and send signals during the process of receiving and sending information with other external network elements;

the memory 1102 is used for storing a computer program capable of running on the processor 1103;

the processor 1103 is configured to execute the steps of the spacecraft sequence game method based on the monte carlo tree search in the foregoing technical solution when the computer program is executed, which is not described herein.

It will be appreciated that memory 1102 in embodiments of the invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). The memory 1102 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

And the processor 1103 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1103. The processor 1103 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1102, and the processor 1103 reads information in the memory 1102, and combines the hardware with the steps of the method.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP devices, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

It can be appreciated that the above exemplary technical solutions of the spacecraft sequence gaming device 100 and the computing device 110 based on the monte carlo tree search are the same as the technical solutions of the spacecraft sequence gaming method based on the monte carlo tree search described above, and therefore, for details that are not described in detail in the technical solutions of the spacecraft sequence gaming device 100 and the computing device 110 based on the monte carlo tree search, reference may be made to the description of the technical solutions of the spacecraft sequence gaming method based on the monte carlo tree search described above. The embodiments of the present application will not be described in detail.

It should be noted that: the technical schemes described in the embodiments of the present invention may be arbitrarily combined without any collision.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A spacecraft sequence game method based on Monte Carlo tree search is characterized by comprising the following steps:

Controlling the motion state of the decision-making spacecraft according to the optimal motion decision so as to enable the opponent spacecraft to perform motion decision based on the motion state of the decision-making spacecraft after control;

wherein,

the initial state information comprises motion information of the decision spacecraft, motion information formed by an observation opponent spacecraft based on a previous round of execution motion decision, and sun relative position;

the method for constructing the to-be-explored subtrees of the game tree by taking the initial state information of the current round as the root node of the game tree comprises the steps of:

predicting actions of the spacecraft and the opponent spacecraft in a follow-up set number of rounds, and expanding each node to be expanded based on the predicted actions to form a subtree to be explored corresponding to each node to be expanded;

The leaf nodes comprise virtual decision moments of the opponent spacecraft when the decision spacecraft selects one node to be unfolded in the first layer of child nodes, the decision spacecraft selects one action in an action space through a random strategy to serve as an action decision of the opponent spacecraft, the node to be unfolded reaches a state node s' after being unfolded through the action decision of the opponent spacecraft, and actions of a specific number of rounds are predicted downwards to reach the leaf nodes of the subtrees to be explored.

2. The method of claim 1, wherein the initial state information comprises:

3. The method according to claim 2, wherein the taking the initial state information of the current round as the root node of the game tree takes all candidate states generated on the basis of the initial state information in the discrete action space as the first layer child nodes S' of the game tree, including:

Uniformly dividing a continuous action space in a space coordinate system formed by taking the mass center of the decision spacecraft as an origin to obtain a discrete action space;

wherein Φ (n) represents the state transition matrix of the relative motion C-W equation; n represents a discrete time; x is x _i,n Representing tracking the motion state of the spacecraft or the escape spacecraft at the time n; r is (r) _n Representing a position vector for tracking or escaping the spacecraft based on the LVLH coordinate system; v _n Representing a velocity vector for tracking or escaping the spacecraft based on the LVLH coordinate system; a, a _n Representing motion vectors based on the LVLH coordinate system;

all candidate states are taken as first layer child nodes S' of the game tree.

4. A method according to claim 3, wherein said selecting one or more nodes from said first layer of sub-nodes S' as nodes to be expanded comprises:

5. The method of claim 2, wherein predicting the actions of the own and opponent spacecraft for a subsequent set number of rounds and expanding each of the nodes to be expanded based on the predicted actions to form a subtree to be explored corresponding to each of the nodes to be expanded comprises:

Step 1: setting the maximum value of the expansion layer number of the subtree to be explored, which is correspondingly expanded by the node to be expanded, as M, setting the initial value of M as 0, representing the layer number of the subtree to be explored, and recording the state of the node to be expanded as s '' _m At this time, it is opponent spacecraft blockA strategy time;

step 2: randomly selecting an action a on a discrete action space _p E A(s), for s 'according to the selected action' _m Unfolding and transferring the state to s' _m+1 ＝f(s′ _m ,a _p ) Will s' _m+1 As s' _m Adding the child nodes of the tree to be explored;

step 3: if m+1 is less than M-1, s' _m+1 The corresponding node is not the leaf node of the subtree to be explored, and the node s 'is continued to be explored' _m+1 Expanding in the action space, wherein at the moment, one action a is randomly selected in the discrete action space for the decision moment of the decision spacecraft _e E A(s), for said s 'according to the action' _m+1 Unfolding is carried out from s' _m+1 Migration to s' _m+2 ＝f(s′ _m+1 ,a _e ) Will s' _m+2 As s' _m+1 Adding the child nodes of the tree to be explored;

step 4: if m+2 is less than M-1, i.e. s' _m+2 If the corresponding node is not the leaf node of the subtree to be explored, assigning m+2 to m, repeating the steps 2 to 4, and exploring the game tree of the next round;

6. The method of claim 5, wherein the state evaluation information for each of all leaf nodes deployed in the subtree to be explored comprises:

d(t)＝||x _pe ||

wherein x is _pe ＝x _e -x _p ；x _ps ＝x _sun -x _p ；

7. The method of claim 6, wherein updating utility estimation information for all nodes on a path between the root node to the leaf node by backtracking comprises:

n(s)←n(s)+1

updating utility estimates for all node states on paths between the root node to the leaf node The following formula is shown:

8. The method of claim 7, wherein the making of the optimal action decision for the current round based on the updated utility estimation information for the game tree comprises:

for an escape spacecraft to the outside,

for the purpose of tracking a spacecraft,

9. a spacecraft sequence gaming device based on monte carlo tree search, the device comprising: a first construction section, a second construction section, an updating section, a decision section and a control section; wherein,

The updating part is configured to update utility estimation information of all nodes on a path from the root node to the leaf nodes by backtracking according to state estimation information of each of all the leaf nodes expanded in the subtree to be explored;

the control part is configured to control the motion state of the decision-making spacecraft according to the optimal motion decision so as to enable the opponent spacecraft to perform motion decision based on the motion state after the control of the decision-making spacecraft;

wherein,

10. A computer storage medium, characterized in that it stores a spacecraft sequence gaming program based on a monte carlo tree search, which when executed by at least one processor implements the spacecraft sequence gaming method steps based on a monte carlo tree search of any one of claims 1 to 8.