CN112001583B

CN112001583B - Strategy determination method, central control equipment and storage medium

Info

Publication number: CN112001583B
Application number: CN202010650248.8A
Authority: CN
Inventors: 李军; 徐文博; 刘冰; 王晓悦; 王伟; 江金寿; 田建辉; 陈科; 叶金华; 何圣华
Original assignee: Ordnance Science and Research Academy of China
Current assignee: Ordnance Science and Research Academy of China
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2021-06-22
Anticipated expiration: 2040-07-08
Also published as: CN112001583A

Abstract

The invention discloses a strategy determination method, a central control device and a storage medium, which are used for solving the problem that the existing sealing control strategy determination mode is not flexible enough. In the embodiment of the invention, a control device receives state parameters reported by at least one intelligent device and relative position information of an obstacle closest to the intelligent device; determining mobile information corresponding to the intelligent equipment based on a strategy model corresponding to the intelligent equipment according to the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment; and sending the determined mobile information corresponding to the intelligent equipment so that the intelligent equipment can adjust the position according to the mobile information. In the embodiment of the invention, the central control equipment can determine the movement information of the intelligent equipment for position adjustment, so that the intelligent equipment is commanded to move, the sealing control strategy determined based on the strategy model is more accurate and timely, and the intelligent equipment can adapt to city blocks with different building distributions.

Description

Strategy determination method, central control equipment and storage medium

Technical Field

The present invention relates to the field of command and control technologies, and in particular, to a policy determination method, a central control device, and a storage medium.

Background

When a major safety event or a public health event occurs in a modern city, effective three-dimensional sealing control is implemented on key points of a city block, which is a complex combined optimization problem and a graph theory problem, and professional personnel are required to manually perform mathematical modeling and strategy deduction on the city block.

The existing urban block sealing control strategy needs experts to evaluate an urban model, calculate position information of key points, allocate intelligent equipment or personnel to seal control, needs to consume a large amount of manpower and material resources, and has the risk that the existing urban block sealing control strategy cannot meet the time limit requirement for an emergency; and the existing sealing control strategy based on graph theory and expert experience has limitation, and is not suitable for city blocks with different building distributions.

Therefore, the existing sealing control strategy is not flexible in determining mode.

Disclosure of Invention

The exemplary embodiment of the present invention provides a policy determining method, a central control device, and a storage medium, so as to solve the problem that the existing sealing control policy determining method is not flexible enough.

According to a first aspect of the exemplary embodiments, there is provided a policy determination method, the method comprising:

receiving state parameters reported by at least one intelligent device and relative position information of an obstacle closest to the intelligent device; the state parameters of the intelligent equipment comprise current position information and motion speed of the intelligent equipment;

determining mobile information corresponding to the intelligent equipment based on a strategy model corresponding to the intelligent equipment according to the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment;

and sending the determined mobile information corresponding to the intelligent equipment so that the intelligent equipment can adjust the position according to the mobile information.

In some exemplary embodiments, if the movement information is acceleration information, the determining, according to the state parameter reported by the smart device and the relative position information of the obstacle closest to the smart device, the movement information corresponding to the smart device based on a policy model corresponding to the smart device includes:

and inputting the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment into a strategy model corresponding to the intelligent equipment, and acquiring acceleration information output by the strategy model corresponding to the intelligent equipment.

In some exemplary embodiments, if the movement information is target location information, the determining, according to the state parameter reported by the smart device and the relative location information of the obstacle closest to the smart device, the movement information corresponding to the smart device based on a policy model corresponding to the smart device includes:

respectively inputting the state parameters reported by each intelligent device in M intelligent devices and the relative position information of the obstacle closest to the intelligent device into a strategy model corresponding to the intelligent device, acquiring acceleration information output by the strategy model corresponding to the intelligent device, and sending the acceleration information to the intelligent device; wherein M is the number of the preset intelligent devices, and M is more than or equal to 1;

determining an evaluation parameter set corresponding to the M intelligent devices based on a strategy evaluation model according to the state parameters of the M intelligent devices and the acceleration information corresponding to the M intelligent devices;

receiving the state parameter reported again by each intelligent device in the M intelligent devices after adjusting the position according to the acceleration information and the relative position information of the obstacle closest to the intelligent device, determining the acceleration information corresponding to each intelligent device until the evaluation parameters in the evaluation parameter sets corresponding to the M intelligent devices are all converged, determining the position information of the M intelligent devices, and taking the position information of the M intelligent devices as the target position information.

In some exemplary embodiments, the policy evaluation model and the policy models corresponding to the M smart devices are trained according to the following:

performing N rounds of training on the strategy evaluation model and the strategy model corresponding to the at least one intelligent device, acquiring the trained strategy evaluation model after the N rounds of training, and determining the strategy model corresponding to each intelligent device after the N rounds of training according to a particle swarm algorithm; wherein N is a positive integer;

wherein each round of training performs the following process:

taking the state parameter of each intelligent device in M intelligent devices and the relative position information of an obstacle closest to the intelligent device as the input of a strategy model, and acquiring the acceleration information output by the strategy model corresponding to the intelligent device;

taking the acceleration information output by the strategy models corresponding to the M intelligent devices and the state parameters of the M intelligent devices as the input of the strategy evaluation model, and taking the actual evaluation parameter set corresponding to the M intelligent devices as the output of the strategy evaluation model to train the strategy evaluation model;

and adjusting parameters of the strategy evaluation model according to the prediction evaluation parameter set output by the strategy evaluation model and corresponding to the M intelligent devices, and adjusting parameters of the strategy model corresponding to the M intelligent devices.

In some exemplary embodiments, the determining, according to a particle swarm algorithm, a policy model corresponding to each intelligent device after N rounds of training includes:

determining a global optimal strategy model in strategy models corresponding to the M intelligent devices in N rounds of training according to a particle swarm algorithm, performing weighted summation operation on the global optimal strategy model and a historical optimal strategy model of the strategy model corresponding to each intelligent device in the N rounds of training, and determining the strategy model corresponding to each intelligent device.

According to a second aspect of the exemplary embodiments, there is provided a central control apparatus including a processor, a memory, and a transceiver;

wherein the processor is configured to read a program in the memory and execute:

In some exemplary embodiments, the processor is specifically configured to:

and if the mobile information is acceleration information, inputting the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment into a strategy model corresponding to the intelligent equipment, and acquiring the acceleration information output by the strategy model corresponding to the intelligent equipment.

In some exemplary embodiments, the processor is specifically configured to:

if the mobile information is target position information, respectively inputting the state parameter reported by each intelligent device in M intelligent devices and the relative position information of the obstacle closest to the intelligent device into a strategy model corresponding to the intelligent device, acquiring acceleration information output by the strategy model corresponding to the intelligent device, and sending the acceleration information to the intelligent device; wherein M is the number of the preset intelligent devices, and M is more than or equal to 1;

In some exemplary embodiments, the processor is specifically configured to train the policy evaluation model and the policy models corresponding to the M smart devices according to the following:

wherein each round of training performs the following process:

In some exemplary embodiments, the processor is specifically configured to:

According to a third aspect of the exemplary embodiments, a policy determining apparatus is provided, so as to implement the policy determining method according to any one of the first aspect of the embodiments of the present invention.

According to a fourth aspect of the exemplary embodiments, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the policy determination method according to any one of the first aspect of the embodiments of the present invention.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

according to the policy determining method provided by the embodiment of the invention, the central control device can determine the mobile information corresponding to the intelligent device based on the policy model corresponding to the intelligent device according to the state parameters reported by the intelligent device and the relative position information of the obstacle closest to the intelligent device, and send the determined mobile information to the intelligent device, so that the intelligent device can adjust the position according to the mobile information. In the embodiment of the invention, the central control equipment can determine the movement information of the intelligent equipment for position adjustment, so that the intelligent equipment is instructed to move, and an instruction is issued without depending on expert analysis.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a policy determination system according to an embodiment of the present invention;

fig. 2 is a flowchart of a policy determination method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating that an intelligent device determines relative position information of an obstacle closest to a current heading direction according to an embodiment of the present invention;

fig. 4 is an interaction flowchart of an intelligent device and a central control device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of simulation drilling of an urban sealing control problem according to an embodiment of the present invention;

fig. 6 is a flowchart of a policy evaluation model and a training method of policy models corresponding to M pieces of intelligent devices according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a central control device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a policy determining apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present application will be described in detail and removed with reference to the accompanying drawings. In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" in the text is only an association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: three cases of a alone, a and B both, and B alone exist, and in addition, "a plurality" means two or more than two in the description of the embodiments of the present application.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

Some terms appearing herein are explained below:

1. the term "Markov Decision Process (MDP)" in the embodiments of the present invention is a mathematical model for sequential Decision to model the randomness strategy and reward achievable by an agent in an environment where the system state has Markov properties. The MDP is built based on a set of interactive objects, namely agents and environments, with elements including state, actions, policies and rewards. In the simulation of MDP, the agent perceives the current system state and acts on the environment in a strategic manner, thereby changing the state of the environment and receiving rewards, the accumulation of which over time is referred to as rewards.

2. In the embodiment of the invention, the term "Deep Learning" (DL) is a new research direction in the field of Machine Learning (ML), the concept of Deep Learning is derived from the research of an artificial neural network, and a multilayer perceptron comprising a plurality of hidden layers is a Deep Learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

3. In the embodiments of the present invention, the term "Particle Swarm Optimization (PSO)" is also referred to as a Particle Swarm optimization, or a Particle Swarm optimization. The method is a random search algorithm based on group cooperation and developed by simulating foraging behavior of a bird group. PSO is generally considered to be one of the cluster intelligence (SI). The basic core of the PSO is to utilize the sharing of information by individuals in a group so as to enable the movement of the whole group to generate an evolution process from disorder to order in a problem solving space, thereby obtaining the optimal solution of the problem.

When a major security event or a public health event occurs in a modern city, effective three-dimensional sealing control needs to be implemented on key points of city blocks, the existing sealing control strategy for the city blocks generally needs experts to determine the sealing control strategy after evaluating a city model based on graph theory, the risk that the time limit requirement cannot be met exists, and the determined sealing control strategy has limitation and cannot be applied to other city blocks.

Based on the above problem, an embodiment of the present invention provides a policy determining system, as shown in fig. 1, including a central control device 10 and M

intelligent devices

11, 12, and 13 … (M is greater than or equal to 1); wherein, smart machine can be the ground strength agent, for example the armoured car, intelligent robot etc. and smart machine still can be aerial strength agent, like unmanned aerial vehicle, four rotors etc..

In a specific implementation, the central control device 10 can control M intelligent devices simultaneously, and for convenience of description, the following describes a policy determination method provided by an embodiment of the present invention by taking a single intelligent device 11 as an example:

the intelligent device 11 detects current position information and motion speed, the detected position information and motion speed are used as state parameters of the intelligent device 11, meanwhile, the intelligent device 11 detects relative position information of an obstacle closest to the intelligent device 11 in the current advancing direction, and the intelligent device 11 reports the state parameters and the relative position information of the obstacle closest to the intelligent device 11 to the central control device 10; the central control device 10 determines acceleration information corresponding to the intelligent device 11 based on a strategy model corresponding to the intelligent device 11 according to the received state parameters of the intelligent device 11 and the relative position information of the obstacle closest to the intelligent device 11, and sends the acceleration information to the intelligent device 11; the smart device 11 performs position adjustment according to the received acceleration information.

As shown in fig. 2, a policy determining method according to an embodiment of the present invention includes the following steps:

step S201, receiving state parameters reported by at least one intelligent device and relative position information of an obstacle closest to the intelligent device; the state parameters of the intelligent equipment comprise current position information and motion speed of the intelligent equipment;

step S202, determining mobile information corresponding to the intelligent equipment based on a strategy model corresponding to the intelligent equipment according to the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment;

and S203, sending the determined mobile information corresponding to the intelligent device so that the intelligent device can adjust the position according to the mobile information.

In the problem of urban block sealing control, the power of an attacker and a defender can be included, the attacker aims at capturing target buildings in the urban block, and the defender aims at defending the target buildings in the urban block; the strategy determining method provided by the embodiment of the invention can be applied to determining the attack strategy of an attacker and can also be applied to determining the defending strategy of a defender.

An optional implementation manner is that a markov decision is adopted to perform mathematical modeling on the city block sealing and control problem, where the mathematical model includes a state space S, an action space a, a return function R and a discount rate γ for designing the city block sealing and control problem, where the return function R and the discount rate γ can be used to determine an evaluation parameter after the adjustment position of the smart device, and the evaluation parameter is used to represent an influence of the adjustment position of the smart device on a win-or-lose result of the city block sealing and control problem.

Optionally, the state space S is designed according to the following manner:

with a target building center in a city blockEstablishing a city block model as an origin of the city block model, detecting a relative position (x, y) of the intelligent device and the origin as current position information of the intelligent device by the intelligent device, and detecting a current movement speed (v)_x，v_y) The intelligent equipment takes the detected relative position and the detected movement speed as state parameters of the intelligent equipment; meanwhile, the smart device detects relative position information of an obstacle closest to the smart device in the current movement direction (c)_x，c_y). The state of the smart device includes state parameters of the smart device and relative position information of an obstacle closest to the smart device, and the state of each smart device may be expressed as s ═ (x, y, v)_x，v_y，c_x，c_y) Assuming that there are M smart devices, the overall states of the M smart devices are S ═ S₁，s₂…s_m}。

An optional implementation manner is that the intelligent device determines the relative position information of the obstacle closest to the current heading direction according to the following manner:

in a virtual simulation environment, with the current position of the intelligent device as an origin, emitting a high-speed particle in the moving direction of the intelligent device, as shown in fig. 3, where the speed of the high-speed particle is a preset value far greater than the maximum moving speed of the intelligent agent, stopping moving when the high-speed particle contacts a first obstacle, determining the moving time of the high-speed particle, and determining the relative position information of the obstacle according to the speed and the moving time of the high-speed particle and the position information of the intelligent device;

determining the relative position information of the obstacle and the intelligent device according to the following formula:

c_x＝v_x*Δt

c_y＝v_y*Δt

wherein, c_xAs the relative position of the obstacle to the smart device in the x-axis direction, c_yIs the relative position of the obstacle to the smart device in the y-axis direction, v_xIs the velocity component of the movement velocity of the intelligent device in the direction of the x axis, v_yFor the speed of the movement speed of the intelligent device in the y-axis directionThe degree component, Δ t, is the time of motion of the high-velocity particles.

The Markov decision-making action space A comprises acceleration information executed by M intelligent devices in the next step, and the central control device determines the acceleration information corresponding to the intelligent devices based on the strategy models corresponding to the intelligent devices according to the state parameters reported by the intelligent devices and the relative position information of the obstacle closest to the intelligent devices.

In specific implementation, the state parameters reported by the intelligent device and the relative position information of the obstacle closest to the intelligent device are input into the strategy model corresponding to the intelligent device, and the acceleration information output by the strategy model corresponding to the intelligent device is obtained.

The central control device can simultaneously receive the state parameters reported by the M intelligent devices and the relative position information of the obstacle closest to the intelligent devices, determine the acceleration information corresponding to the M intelligent devices based on the M strategy models corresponding to the M intelligent devices, and respectively send the acceleration information corresponding to the M intelligent devices to the corresponding intelligent devices, so that the intelligent devices can adjust the positions according to the acceleration information; the acceleration information corresponding to the M intelligent devices forms an action space A of Markov decision, and the action space A { (a)_ix，a_iy)|-d₁≤a_ix≤d₁,-d₂≤a_iy≤d₂Wherein (a)_ix，a_iy) Acceleration information corresponding to the ith intelligent device among the M intelligent devices, d₁And d₂All are adjustable hyper-parameters greater than 0.

After the intelligent device receives the acceleration information sent by the central control device, the movement speed of the intelligent device is adjusted according to the acceleration information, in specific implementation, the intelligent device keeps the received acceleration within preset time, and after the preset time is finished, the acceleration is reset to be 0, so that the movement speed of the intelligent device is changed, and the position of the intelligent device is adjusted.

When the control device determines the acceleration information of the intelligent device, the position information of the intelligent device input into the strategy model corresponding to the intelligent device is the position coordinate of the intelligent device in the coordinate system with the center of the target building as the origin, and the acceleration information of the intelligent device output by the strategy model is the acceleration information corresponding to the next movement action executed by the intelligent device determined by the central control device for seizing or defending the target building.

As shown in fig. 4, an interaction flowchart of an intelligent device and a central control device according to an embodiment of the present invention includes the following steps:

s401, the intelligent device determines position information and movement speed as state parameters of the intelligent device, and determines relative position information of an obstacle closest to the intelligent device in the current movement direction;

s402, the intelligent equipment reports the state parameters and the relative position information of the obstacle closest to the intelligent equipment to the central control equipment;

step S403, the central control device inputs the state parameters reported by the intelligent device and the relative position information of the obstacle closest to the intelligent device into the strategy model corresponding to the intelligent device, and obtains the acceleration information output by the strategy model corresponding to the intelligent device;

s404, the central control device sends the determined acceleration information to the intelligent device;

and S405, the intelligent device adjusts the position according to the received acceleration information.

The following describes the policy determination method provided by the embodiment of the present invention in two specific embodiments:

example 1

In the problem of urban sealing control, a central control device receives state parameters reported by M intelligent devices and relative position information of an obstacle closest to the intelligent devices in real time, inputs the state parameters reported by each intelligent device and the relative position information of the obstacle closest to the intelligent device into a strategy model corresponding to the intelligent device, and obtains acceleration information corresponding to the intelligent device, which is output by the strategy model corresponding to the intelligent device.

The central control equipment respectively sends the determined acceleration information corresponding to the M intelligent devices to the corresponding intelligent devices, and after the intelligent devices receive the acceleration information sent by the central control equipment, the intelligent devices adjust the movement speed according to the acceleration information within preset time, so that the positions are adjusted.

Example 2

In the problem of city sealing control, simulation drilling is performed on the problem of city sealing control, for example, drilling is performed in a simulation environment as shown in fig. 5, taking a current defender as an example, according to state parameters reported by M intelligent devices of the defender in the simulation drilling and relative position information of an obstacle closest to the intelligent device, the state parameters reported by each intelligent device and the relative position information of the obstacle closest to the intelligent device are input into a policy model corresponding to the intelligent device, and acceleration information corresponding to the intelligent device output by the policy model corresponding to the intelligent device is acquired.

The central control device respectively sends the determined acceleration information corresponding to the M intelligent devices to the corresponding intelligent devices, and after receiving the acceleration information sent by the central control device, the intelligent devices adjust the movement speed according to the acceleration information within a preset time, so that the positions are adjusted;

the central control equipment determines an evaluation parameter set corresponding to the M intelligent equipment on the basis of a strategy evaluation model according to the state parameters of the M intelligent equipment and the acceleration information corresponding to the M intelligent equipment;

in specific implementation, the central control device inputs state parameters of the M intelligent devices and acceleration information corresponding to the M intelligent devices into the policy evaluation model, and obtains an evaluation parameter set corresponding to the M intelligent devices output by the policy evaluation model, where the evaluation parameter set includes M evaluation parameters corresponding to the M intelligent devices, and the evaluation parameters are used to represent influences on the win and lose results of the city sealing control problem after the intelligent devices adjust positions according to the acceleration information.

And the central control equipment receives the state parameters which are sent again after the intelligent equipment is adjusted in position and the relative position of the obstacle closest to the intelligent equipment, the process is repeated until the evaluation parameters in the evaluation parameter sets corresponding to the M intelligent equipment are converged, then it is determined that the attacker and the defender reach Nash balance in the simulation exercise at the moment, and the current position distribution of the defender intelligent equipment is the optimal distribution strategy determined by the central control equipment.

The method comprises the steps of determining the position information of M current intelligent devices, using the position information of the M intelligent devices as target position information, wherein the target position information is an optimal distribution strategy obtained through simulation drilling, and arranging the intelligent devices of a defender according to the target position information in the actual urban sealing control problem.

The embodiment of the invention also provides a strategy evaluation model and a training method of strategy models corresponding to the M intelligent devices, which comprises the following steps:

as shown in fig. 6, the method for training the policy evaluation model and the policy models corresponding to M pieces of intelligent devices provided by the embodiment of the present invention includes the following steps:

step S601, initializing parameters of a strategy evaluation model and strategy models corresponding to M intelligent devices, wherein an initialization training turn n is 1;

step S602, judging whether the training round number N is smaller than a preset training round number N, if so, executing step S603; otherwise, go to step S609;

step S603, taking the state parameter of each intelligent device in the M intelligent devices and the relative position information of the obstacle closest to the intelligent device as the input of the strategy model, and acquiring the acceleration information output by the strategy model corresponding to the intelligent device;

step S604, taking the acceleration information output by the strategy models corresponding to the M intelligent devices and the state parameters of the M intelligent devices as the input of the strategy evaluation model, and acquiring a prediction evaluation parameter set output by the strategy evaluation model;

step S605, determining actual evaluation parameter sets corresponding to the M intelligent devices;

in specific implementation, according to the state parameters of the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment, determining an immediate return value of the intelligent equipment after the position of the intelligent equipment is adjusted according to the acceleration information determined by the strategy model;

for example, when the intelligent device is an attacker intelligent device, the immediate return value R after the intelligent device adjusts the position according to the acceleration information determined by the policy model is determined according to the following formula:

wherein the content of the first and second substances,

the value is reported immediately after the ith intelligent device in the M intelligent devices of the attacker adjusts the position according to the acceleration information determined by the strategy model,

the position information of the ith intelligent device after the position is adjusted according to the acceleration information,

for the motion speed of the ith intelligent device adjusted according to the acceleration information,

and the i-th intelligent device adjusts the relative position information of the closest obstacle after the position is adjusted according to the acceleration information, wherein the oc and the epsilon are adjustable hyper-parameters.

When the intelligent equipment is defending intelligent equipment, determining the immediate return value R of the intelligent equipment after the acceleration information adjusting position determined by the intelligent equipment according to the strategy model according to the following formula:

wherein the content of the first and second substances,

the value is reported immediately after the j-th intelligent equipment in M intelligent equipment of the guardian adjusts the position according to the acceleration information determined by the strategy model,

for the position information of the jth intelligent device after being adjusted according to the acceleration information,

for the location information of the i-th smart device of the attacker,

for the moving speed of the i-th smart device of the attacker,

for the relative position information of the obstacle closest to the ith intelligent device of the attacker, oc and epsilon are adjustable hyper-parameters, and when the situation A is that the jth intelligent device of the defending party is ground strength, the current simulation is not finished and the intelligent device of the attacker is not captured; and the situation B is that when the jth intelligent device of the defender is the air force, the current simulation is not finished, and the intelligent device of the attacker is not captured.

After the value R is determined to be reported immediately after the adjustment position of the intelligent equipment, determining an actual evaluation parameter set corresponding to the intelligent equipment according to the following formula:

U＝R+γ*Q(s‘，π(s‘,w),θ)

wherein, U is an actual evaluation parameter set corresponding to the intelligent device, R is an immediate return value after the intelligent device adjusts the position, γ is a discount rate, Q () is a policy evaluation model, pi () is a policy model corresponding to the intelligent device, s' is a state after the intelligent device adjusts the position according to the acceleration, w is a parameter of the policy model corresponding to the intelligent device, and θ is a parameter of the policy evaluation model.

Step S606, determining a loss value between the prediction evaluation parameter set and the actual evaluation parameter set based on a loss function, and adjusting parameters of the strategy evaluation model by a gradient descent method to reduce the loss value between the prediction evaluation parameter set and the actual evaluation parameter set;

alternatively, the loss function may be a mean square error loss function.

Step S607, according to the prediction evaluation parameter set output by the strategy evaluation model, adjusting the parameters of the strategy model corresponding to the intelligent device by a gradient descent method to increase the prediction evaluation parameters output by the strategy evaluation model;

step S608, updating the training round n ═ n +1, and returning to step S602;

step S609, determining a global optimal strategy model in strategy models corresponding to M intelligent devices in N rounds of training according to a particle swarm optimization;

in specific implementation, in N rounds of training, when parameters of the policy models corresponding to M pieces of intelligent equipment are adjusted according to the evaluation parameters, the parameter adjustment value of the policy model corresponding to each piece of intelligent equipment is recorded, and after N rounds of training, the policy model with the minimum parameter adjustment value is used as the global optimal policy model.

And S610, performing weighted summation operation on the global optimal strategy model and the historical optimal strategy model in the N rounds of training corresponding to each intelligent device, and determining the strategy model corresponding to each intelligent device.

In specific implementation, the strategy model corresponding to the intelligent equipment with the minimum parameter adjustment value in the N rounds of training is used as the historical optimal strategy model corresponding to the intelligent equipment; and carrying out weighted summation on the global optimal strategy model and the historical optimal strategy model in the N rounds of training corresponding to each intelligent device according to a preset weight value, and determining the strategy model corresponding to each intelligent device.

According to the strategy evaluation model and the training method of the strategy models corresponding to the M intelligent devices, provided by the embodiment of the invention, multi-turn confrontation training can be carried out on all the intelligent devices in the simulation environment, and each turn of confrontation training comprises the N-turn training process. In each round of confrontation training, when the intelligent equipment of the defender captures the intelligent equipment of the attacker, the intelligent equipment of the attacker stops running, and when all the intelligent equipment of the attacker stops running, the round is ended, and the defender wins; when the time of the round exceeds the preset time and the attacker fails to capture the target, the round is ended and the defender wins; when any attacker agent successfully captures the target, the round is ended and the attacker wins.

Based on the same inventive concept, the embodiment of the present invention further provides a central control device, and as the principle of solving the problem of the central control device is similar to the policy determination method in the embodiment of the present invention, the implementation of the terminal may refer to the implementation of the method, and repeated details are not described again.

As shown in fig. 7, an embodiment of the present invention provides a central control device, which includes a processor 701, a memory 702, and a transceiver 703;

the processor 701 is configured to read a program in the memory 702 and execute:

In some exemplary embodiments, the processor 701 is specifically configured to:

In some exemplary embodiments, the processor 701 is specifically configured to train the policy evaluation model and the policy models corresponding to the M smart devices according to the following manners:

wherein each round of training performs the following process:

In some exemplary embodiments, the processor 701 is specifically configured to:

As shown in fig. 8, an embodiment of the present invention provides a policy determining apparatus, including:

a receiving module 801, configured to receive a state parameter reported by at least one smart device and relative position information of an obstacle closest to the smart device; the state parameters of the intelligent equipment comprise current position information and motion speed of the intelligent equipment;

a determining module 802, configured to determine, according to the state parameter reported by the smart device and the relative position information of the obstacle closest to the smart device, based on a policy model corresponding to the smart device, movement information corresponding to the smart device;

a sending module 803, configured to send the determined mobile information corresponding to the intelligent device, so that the intelligent device performs position adjustment according to the mobile information.

In some exemplary embodiments, the determining module 802 is specifically configured to:

In some exemplary embodiments, the policy determining apparatus further includes a training module 804, where the training module 804 is specifically configured to:

wherein each round of training performs the following process:

In some exemplary embodiments, the training module 804 is specifically configured to:

Since the computer storage medium in the embodiment of the present invention may be applied to the policy determining method, reference may also be made to the method embodiment for obtaining technical effects, and details of the embodiment of the present invention are not repeated herein.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for policy determination, the method comprising:

sending the determined mobile information corresponding to the intelligent equipment so that the intelligent equipment can adjust the position according to the mobile information;

if the mobile information is acceleration information, determining the mobile information corresponding to the intelligent device based on a policy model corresponding to the intelligent device according to the state parameter reported by the intelligent device and the relative position information of the obstacle closest to the intelligent device, including:

inputting the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment into a strategy model corresponding to the intelligent equipment, and acquiring acceleration information output by the strategy model corresponding to the intelligent equipment;

inputting the state parameters reported by the intelligent device and the relative position information of the obstacle closest to the intelligent device into the strategy model corresponding to the intelligent device, and acquiring the acceleration information output by the strategy model corresponding to the intelligent device, wherein the acceleration information includes:

respectively inputting the state parameters reported by each intelligent device in M intelligent devices and the relative position information of the obstacle closest to the intelligent device into a strategy model corresponding to the intelligent device, and acquiring acceleration information output by the strategy model corresponding to the intelligent device; wherein M is the number of the preset intelligent devices, and M is more than or equal to 1;

after the determined mobile information corresponding to the intelligent device is sent to the intelligent device, the method further includes:

determining an evaluation parameter set corresponding to the M intelligent devices based on a policy evaluation model according to the state parameters of the M intelligent devices and the acceleration information corresponding to the M intelligent devices, wherein the evaluation parameter set comprises M evaluation parameters corresponding to the M intelligent devices, and the evaluation parameters represent the influence of the M intelligent devices on the win and loss results of the city sealing control problem after the positions of the M intelligent devices are adjusted according to the acceleration information;

2. The method of claim 1, wherein the policy evaluation model and the policy models corresponding to the M smart devices are trained according to the following:

wherein each round of training performs the following process:

3. The method of claim 2, wherein the determining the strategy model corresponding to each intelligent device after the N rounds of training according to the particle swarm algorithm comprises:

4. A central control device comprising a processor, a memory, and a transceiver;

the processor is specifically configured to:

if the mobile information is acceleration information, inputting the state parameters reported by the intelligent equipment and the relative position information of the obstacle closest to the intelligent equipment into a strategy model corresponding to the intelligent equipment, and acquiring the acceleration information output by the strategy model corresponding to the intelligent equipment;

the processor is specifically configured to:

if the mobile information is target position information, respectively inputting the state parameter reported by each intelligent device in M intelligent devices and the relative position information of the obstacle closest to the intelligent device into a strategy model corresponding to the intelligent device, and acquiring acceleration information output by the strategy model corresponding to the intelligent device; wherein M is the number of the preset intelligent devices, and M is more than or equal to 1;

the processor is further configured to:

5. The central control device of claim 4, wherein the processor is specifically configured to train the policy evaluation model and the policy models corresponding to the M smart devices according to the following:

wherein each round of training performs the following process:

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.