CN116841208A

CN116841208A - Unmanned underwater vehicle formation control simulation method, system and equipment

Info

Publication number: CN116841208A
Application number: CN202311105997.2A
Authority: CN
Inventors: 尹辉; 马骏; 曹一丁; 郭伟; 黄安付
Original assignee: Baiyang Times Beijing Technology Co ltd
Current assignee: Baiyang Times Beijing Technology Co ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-10-03

Abstract

The application discloses a method, a system and equipment for simulating formation control of an unmanned underwater vehicle, wherein the method comprises the following steps: acquiring environmental situation data from a simulation environment by a plurality of UUV agent models; outputting decision action information according to the environmental situation data so that the corresponding simulation object executes corresponding formation action according to the decision action information; the simulation object is a component for simulating a plurality of unmanned underwater vehicles in a simulation environment. The UUV intelligent body models are cooperatively trained in a CTDE mode, so that the performance of the intelligent body training process is stable, and the countermeasure training is performed based on the alliance population of multiple strategy types, so that the UUV intelligent body models have strong robustness. The UUV intelligent body models control the simulation objects to execute formation actions according to the simulation environment situation data, unmanned underwater vehicle formation control simulation is realized in the virtual environment, and the method has guiding significance for actual unmanned underwater vehicle formation control scenes.

Description

Unmanned underwater vehicle formation control simulation method, system and equipment

Technical Field

The application relates to the field of data processing, in particular to a formation control simulation method, system and equipment for an unmanned underwater vehicle.

Background

Unmanned underwater vehicles (Unmanned Underwater Vehicle, UUV) have the advantages of good concealment, long-time working in severe environments, high comprehensive benefit and the like, and along with the rapid development of Internet of things technology, information technology and artificial intelligence, the unmanned underwater vehicles are highly valued at home and abroad in the military field in recent years, and a plurality of UUV forms a formation to cooperatively execute tasks, so that the unmanned underwater vehicles become a necessary way for the development of UUV.

Therefore, how to better control multiple UUVs is a challenge.

Disclosure of Invention

Based on the problems, the application provides a method, a system and equipment for simulating formation control of an unmanned underwater vehicle, which can realize the formation control simulation of the unmanned underwater vehicle in a virtual environment through an agent model with better training stability and stronger robustness.

The application discloses the following technical scheme:

the first aspect of the application provides an unmanned underwater vehicle formation control simulation method, which comprises the following steps:

acquiring environmental situation data from a simulation environment by a plurality of UUV agent models; outputting decision action information according to the environmental situation data so that the corresponding simulation object executes corresponding formation action according to the decision action information; the UUV agent models are obtained by adopting a CTDE mode for cooperative training and performing countermeasure training based on a multi-strategy type alliance population; the simulation object is a component for simulating a plurality of unmanned underwater vehicles in a simulation environment.

In one possible implementation manner, the training method of the UUV agent models includes:

constructing a plurality of first agent models;

performing cooperative training on the plurality of first intelligent agent models by adopting a CTDE training mode to obtain a plurality of second intelligent agent models;

and performing countertraining on the plurality of second agent models based on the alliance population of the multi-strategy type to obtain a plurality of UUV agent models.

In one possible implementation manner, the CTDE training manner is used to perform cooperative training on the plurality of first agent models, so as to obtain a plurality of second agent models, including:

each first agent acquires local environment situation data and inputs the local environment situation data into a respective decision network to output decision action information;

inputting the local environment situation data of each first agent and the corresponding output decision action information into a centralized value network, wherein the centralized value network obtains global environment situation observation data according to the local environment situation data of each agent, and outputs a value function according to the global environment situation observation data and the decision action information of each agent;

updating each of the first and second values using the value of the value decision network parameters of an agent;

The above steps are cycled until the value of the cost function meets a first preset condition, the updated plurality of first agents is used as a plurality of second agent models.

In one possible implementation manner, the multi-policy type-based federation population performs countertraining on the plurality of second agent models to obtain a plurality of UUV agent models, including:

pre-training agent models of a plurality of strategy types for each second agent model; the agent models of the plurality of policy types include: a main agent model, a history partner agent model and a main agent defect strategy agent model;

constructing the agent models of the plurality of strategy types into a alliance population;

selecting two agent models from the alliance population for countertraining, recording the winning rate of each agent model, and storing the agent models after each countertraining into the alliance population;

and (3) circulating the steps until the winning rate of the main intelligent agent model meets a second preset condition, outputting a plurality of main intelligent agent models with highest winning rates, and taking the main intelligent agent models as UUV intelligent agent models.

In one possible implementation, the agent model includes: the neural network module and the knowledge rule module.

In one possible implementation manner, the outputting decision action information according to the environmental situation data includes:

the neural network module responds to the input of environment situation data and outputs first decision action information;

and the knowledge rule module responds to the input of the environmental situation data and outputs second decision action information according to a first preset rule.

the neural network module responds to the input of the environmental situation data and outputs high-level task decision information;

and the knowledge rule module outputs bottom layer decision action information according to a second preset rule according to the high layer task decision information.

In one possible implementation, the knowledge rule module includes a task layer knowledge rule module and an execution layer knowledge rule module;

the outputting decision action information according to the environmental situation data comprises the following steps:

the neural network module responds to the input of environment situation data and outputs first task information;

the task layer knowledge rule module responds to the input of the environmental situation data and outputs second task information according to a third preset rule;

And the execution layer knowledge rule module responds to the input of the first task information and/or the second task information and outputs decision action information according to a fourth preset rule.

A second aspect of the present application provides an unmanned underwater vehicle formation control simulation system, comprising: the UUV intelligent agent simulation system comprises a plurality of UUV intelligent agent models, a simulation interaction module and a simulation environment; the simulation environment comprises a plurality of simulation objects; the simulation object is a component used for simulating the unmanned underwater vehicle in a simulation environment;

the UUV agent model is used for acquiring environmental situation data; outputting decision action information to the simulation object according to the environmental situation data, wherein the UUV intelligent body models are obtained by adopting a CTDE mode for cooperative training and performing countermeasure training based on a multi-strategy type alliance population;

the simulation interaction module is used for transmitting environmental situation data to the UUV agent model; transmitting the action information of the UUV agent model to a corresponding simulation object;

the simulation object is used for executing corresponding formation actions according to the decision action information.

A third aspect of the present application provides a computer apparatus comprising: memory, processor, and storage

A computer program on the memory and executable on the processor, which when executed, implements a unmanned underwater vehicle formation control simulation method according to any of the first aspects of the present application.

Compared with the prior art, the application has the following beneficial effects:

according to the unmanned underwater vehicle formation control simulation method provided by the application, a plurality of UUV intelligent body models acquire environmental situation data from a simulation environment; outputting decision action information according to the environmental situation data so that the corresponding simulation object executes corresponding formation action according to the decision action information; the simulation object is a component for simulating a plurality of unmanned underwater vehicles in a simulation environment; the UUV intelligent body models are cooperatively trained in a CTDE mode, so that the performance of the intelligent body training process is stable, and the countermeasure training is performed based on the alliance population of multiple strategy types, so that the UUV intelligent body models have strong robustness. The UUV intelligent body models control the simulation objects to execute formation actions according to the simulation environment situation data, unmanned underwater vehicle formation control simulation is realized in the virtual environment, and the method has guiding significance for actual unmanned underwater vehicle formation control scenes.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1a is a schematic diagram of an unmanned underwater vehicle formation control simulation system according to an embodiment of the present application;

FIG. 1b is a schematic diagram of a simulation interaction module according to an embodiment of the present application;

FIG. 2 is a multi-agent training flow chart provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a rule and neural network cross-direction combination provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a rule and neural network longitudinal combination provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a rule and neural network hybrid combination provided by an embodiment of the present application;

FIG. 6a is a schematic diagram of a behavior tree structure according to an embodiment of the present application;

FIG. 6b is a schematic diagram illustrating the composition of nodes of a behavior tree according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a centralized training process according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a decentralizing execution process according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an countermeasure training process according to an embodiment of the present application;

fig. 10 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of embodiments of the application will be rendered by reference to the appended drawings and appended drawings.

As described above, unmanned underwater vehicles (Unmanned Underwater Vehicle, UUV) have the advantages of good concealment, long-time working in severe environments, high comprehensive benefit and the like, and along with the rapid development of Internet of things technology, information technology and artificial intelligence, the unmanned underwater vehicles are highly valued at home and abroad in recent years in the military field, and a plurality of UUV forms a formation to cooperatively execute tasks, so that the unmanned underwater vehicles become a necessary route for UUV development.

The research of the formation control of the multi-UUV system is influenced by the structure and software design of the single UUV, the communication quality among UUV members, the system architecture and the like, and the control architecture has more layers. In the actual research process, UUV formation control issues are typically decoupled into single UUV path trace control sub-issues and multi-UUV formation control sub-issues.

Path tracking control refers to that a single UUV starts to move from a given initial state, runs its path tracking controller on a certain smooth path, and completes the tracking task under the continuous excitation of the path tracking controller. Each UUV comprises an independent path tracking controller so as to achieve the aim that the error between the real-time position of the UUV and the expected path position is zero, the path tracking task is put forward based on the angle of a single UUV, and as the technical research on the UUV is mature gradually, multiple UVs form a formation cooperative execution task, and the formation cooperative execution task becomes a necessary route for the development of the UUV.

The multi-UUV formation control refers to a control technology for adjusting the speed, the course and the pose of UUV in a formation system according to the states of other members so as to achieve the coordinated formation navigation of the plurality of UUV, and the formation control core is a multi-UUV training technology. The formation control task is provided under the condition that a plurality of UUV cooperatively perform tasks, and belongs to the macro-level in the field of clusters. Because the cluster formation concept of UUV is put forward later, the task is currently the main direction of the learner to overcome the research, how to realize the intellectualization is the key point, and the integration of multiple known algorithms or the development of new algorithms has become the main solution.

The mainstream UUV formation control algorithm covers a manual potential field method, a pilot-follower method, a virtual structure method, a behavior control method and the like.

The artificial potential field method was first proposed by Khatib in 1985, and the guiding idea is to draw the motion of an agent in the external environment into a virtual potential field, where a target point in the potential field exerts an attractive force on the agent, and an obstacle and other threats exert repulsive force, and the agent generates acceleration in the resultant force of the forces of the two virtual potential fields to move.

The pilot-follower method is one of the most commonly used methods in UUV cluster formation control, and the idea is that a member in UUV formation is defined as a pilot, other members are taken as followers, the pilot runs a path preset by a path tracking controller, and the followers perform formation control according to position errors and speed errors relative to the pilot.

The virtual structure method is a centralized control structure method, and is proposed by Lewis and the like, and the main idea is to consider a multi-UUV system as a virtual rigid structure, wherein each UUV is a point with fixed relative position on the rigid structure, and all UVs in the system move with reference to a virtual geometric center.

Behavior-based control method: the formation control is decomposed into a series of basic behaviors including formation forming, following, obstacle avoidance and the like, and the formation motion control is realized through the synthesis of the behaviors, so that each individual in the system can finish tasks in cooperation with other individuals according to self decisions.

However, in the complex task process of completing multiple UUVs, the above method faces the following problems:

(1) The artificial potential field method is easy to realize, can effectively solve the problems of formation reconstruction and cooperative obstacle avoidance, but also has the defects that a potential field function is difficult to design and is easy to sink into a local extremum.

(2) Pilot-follower method: the navigator is independent of the follower, the navigator is difficult to obtain information such as the speed, the pose and the like of the follower, once the navigator fails or loses contact with the follower, the UUV formation system cannot normally operate, and therefore the robustness and the system reliability of the UUV formation system are not strong.

(3) Virtual structure method: because the whole team must move according to a certain rigid structure, the method has low flexibility and is not suitable for solving the cooperative obstacle avoidance and formation reconstruction, thereby limiting the application range.

(4) Behavior-based control method: the behavior method-based method forms control instructions according to preset information and trigger conditions, so that the flexibility and adaptability of the system are reduced.

Therefore, how to better control multi-UUV formation is a challenge.

In view of the above, the embodiment of the application provides a method, a system and equipment for simulating formation control of an unmanned underwater vehicle.

In order to facilitate readers to understand the scheme of the application, firstly, the architecture of the unmanned underwater vehicle formation control simulation system provided by the embodiment of the application is introduced.

Fig. 1a is a schematic diagram of an unmanned underwater vehicle formation control simulation system according to an embodiment of the present application. As shown in fig. 1a, the system comprises:

the UUV intelligent agent simulation system comprises a plurality of UUV intelligent agent models, a simulation interaction module and a simulation environment; the simulation environment comprises a plurality of simulation objects; the simulation object is a component used for simulating the unmanned underwater vehicle in a simulation environment;

In the embodiment of the application, the plurality of UUV agent models are cooperatively trained in a CTDE mode, so that the performance of the agent training process is stable, and the countermeasure training is performed based on the alliance population of multiple strategy types, so that the UUV agent models have strong robustness. The UUV intelligent body models control the simulation objects to execute formation actions according to the environment information, and the unmanned underwater vehicle formation control simulation is realized in the virtual environment, so that the method has guiding significance for the actual unmanned underwater vehicle formation control scene.

In one example, the simulation environment further comprises: simulation environment interfaces, simulation engines, etc.

Fig. 1b is a schematic structural diagram of a simulation interaction module according to an embodiment of the present application, where, as shown in fig. 1b, the simulation interaction module includes: task planning component, situation collection component, model control component, operation control component, codec component, etc. The method comprises the steps that a plurality of UUV agents send request instructions or control instructions to a simulation interaction module through a coding and decoding assembly, and a task planning assembly, a situation collecting assembly, a model control assembly and an operation control assembly in the simulation interaction module return planning data, situation data and control results obtained from a simulation environment to the plurality of UUV agents according to instruction content. After acquiring task wanted information and situation information from a simulation environment through a simulation interaction module, the intelligent body model makes a decision through a neural network of the intelligent body model, and then outputs decision action information to be transmitted to the simulation environment through the simulation interaction module. The simulation interaction module is used for interfacing a plurality of UUV intelligent body models with a simulation environment to realize control of simulation objects.

When the UUV intelligent body model carries out formation simulation control, environmental situation data is input into a neural network after feature engineering, and after fusion of the overall situation is completed, the overall situation is further evaluated and decided to obtain decision action information. According to the decision action information, decoding the decision action information into a specific action instruction, a target selection instruction and a sensor selection instruction, and outputting the decision action instruction to control the UUV simulation object through the simulation interaction module.

The embodiment of the application adopts the following three technologies respectively at different stages of the realization of a plurality of UUV intelligent agent models:

1) In the modeling stage of a plurality of UUV intelligent body models, an intelligent model construction technology based on combination of rules and networks is used, so that the problem of dimensional explosion of the plurality of UUV intelligent body models is solved.

2) In the training stage of a plurality of UUV agent models, the multi-agent system is cooperatively trained based on a CTDE (center training, decentralization execution, centralized Training Decentralized Execution) mode, so that the multi-agent system is more stable in the training process.

3) In the optimization stage of the multi-UUV intelligent body model, an intelligent evolution training technology based on a tournament mechanism is optimized to generate the application range of the multi-UUV intelligent body model, and the robustness of the multi-UUV intelligent body model is enhanced.

The unmanned underwater vehicle formation control simulation method provided by the embodiment of the application is applied to the unmanned underwater vehicle formation control simulation system. The method comprises the following steps:

Referring to fig. 2, fig. 2 is a flowchart of multi-agent training provided in an embodiment of the present application. As shown in fig. 1, the method includes S201 to S203:

s201, constructing a plurality of first agent models.

The first agent model is an initialized UUV agent model, and the first agent model is subjected to collaborative training and countermeasure training optimization to obtain an agent model for outputting a strategy when final simulation formation control is performed.

The agent model (including the first agent model constructed, the trained agent model and the agent model in the training process) used in the embodiment of the application comprises: the neural network module and the knowledge rule module; the neural network module includes: decision networks and value networks.

The combination of rules and neural networks in each agent model includes lateral combination, longitudinal combination and mixed combination.

In one embodiment, the outputting decision action information according to the environmental situation data includes:

Fig. 3 is a schematic diagram of a rule and neural network cross-direction combination according to an embodiment of the present application. The lateral combination mode is mainly characterized in that the command decision content is subjected to lateral decomposition, the command decisions of part of units are responsible for the neural network model, and the command decisions of the other units are responsible for the decision rules. As shown in fig. 3, the environmental situation data are respectively input into a neural network and a knowledge rule module, the neural network outputs first action information for commanding part of action functions of the intelligent agent, and the knowledge rule module outputs corresponding second action information according to preset rules for commanding the other part of action functions of the intelligent agent.

Fig. 4 is a schematic diagram of a rule and neural network longitudinal combination according to an embodiment of the present application. The longitudinal combination mode is mainly characterized in that command decision contents are divided into multiple layers according to decision granularity, then different layers of decision contents are realized by different decision algorithms, for example, a neural network is responsible for top-level macroscopic task level decision, and knowledge rules pay attention to the decision contents of a bottom-level task execution layer. As shown in fig. 4, the neural network outputs macro actions characterizing high-level instructions in response to input information from the simulation environment, the macro actions input knowledge rules, and then the knowledge rules output corresponding underlying decision action information.

In one embodiment, the knowledge rule module includes a task layer knowledge rule module and an execution layer knowledge rule module;

Fig. 5 is a schematic diagram of a rule and neural network hybrid combination according to an embodiment of the present application. As shown in fig. 5, the hybrid combination mode mainly combines the two modes, namely, hierarchical decisions including a top layer and a bottom layer, and transverse decisions divided according to unit types, and finally, integrates and uniformly outputs the decisions. And respectively inputting the environmental situation data into a neural network and a task layer knowledge rule module, outputting a first task by the neural network, and outputting a second task by the task layer knowledge rule module. The execution layer knowledge rule module combines the first task and the second task to output a decision action.

The knowledge rule can be implemented by using a behavior tree, the behavior tree is a directed tree, the structure of the behavior tree can be approximately regarded as a behavior tree, leaf nodes of the tree are called executing nodes and are responsible for executing defined behavior logic at proper time, non-leaf nodes are called control nodes, the control nodes comprise combination nodes and modification nodes, and proper executing node sets are responsible for selecting to execute operations.

The behavior tree performs a traversal process from the root node to the leaf node in one cycle, where the traversal process from the parent node to the child node is called a tick. When a node is accessed and executed, an execution result, which comprises success, failure and running, must be reported to a parent node. The father node decides the trend of the decision tree by checking the execution result.

Fig. 6a is a schematic diagram of a behavior tree structure according to an embodiment of the present application. As shown in fig. 6a, the functional nodes of the behavior tree can be divided into three types: the combined node determines the structure of the behavior tree, and the behavior tree can adapt to different logic requirements, and accesses different nodes according to constraint conditions (such as condition 1, condition 2, condition 3, condition 4, condition 5 and condition 6) corresponding to the different logic requirements. Fig. 6b is a schematic diagram of a behavioral tree node composition according to an embodiment of the present application. As shown in the figure 6b, the combined node comprises a sequence node, a selection node and a parallel node, and the interpretation process of various constraint conditions is mainly concentrated in the combined node; the modifier nodes are mainly used for packaging leaf nodes and combined nodes, so that more complex logic is realized. The modifying node mainly comprises a gate node, a circulating node and a limiting node. For the intelligent body model behavior for simulating the unmanned underwater vehicle, specific behaviors such as continuous maneuver, termination maneuver and the like under certain states can be realized; the lowest layer node of the leaf nodes does not have child nodes any more, and actions directly taken under the environment comprise action nodes and precondition nodes.

S202, training the plurality of first agent models by adopting a CTDE mode to obtain a plurality of second agent models.

In the CTDE mode training process, a centralized value network deployed in a central controller can acquire interactive data of all intelligent body models and simulation environments, and the interactive data are used for training the models, namely, centralizing training; and in the execution process, each intelligent agent only outputs corresponding actions according to own local observation data, namely, the intelligent agent performs decentralization.

In one embodiment, S202 includes: deploying a centralized value network on a central controller; each first agent acquires local environment situation data and inputs the local environment situation data into a respective decision network to output decision action information; inputting the local environment situation data of each first agent and the corresponding output decision action information into a centralized value network, wherein the centralized value network obtains global environment situation observation data according to the local environment situation data of each agent, and outputs a value function according to the global environment situation observation data and the decision action information of each agent; updating each of the first and second values using the value of the value decision network parameters of an agent; the above steps are cycled until the value of the cost function meets a first preset condition, the updated plurality of first agents is used as a plurality of second agent models.

At the center training stageIn the training process, all m intelligent agent models are needed to participate together, and m strategy network parameters are improved togetherAnd a value network parameter w. Let current parameters of m policy networks be +.>Let the parameters of the current value network and the target network be w respectively _now And->。

Fig. 7 is a schematic diagram of a centralized training process according to an embodiment of the present application. As shown in figure 7 of the drawings,

each intelligent agent i interacts with the environment to obtain current local environment situation dataThe policy network outputs decision action data +.>：

/>。

Wherein, the liquid crystal display device comprises a liquid crystal display device,policy functions.

Each agent i performs a decision action, and obtains environmental feedback, where the environmental feedback includes: next time step, local environmental situation data for each agent iEnvironment acquisition rewards r _t 。

Each agentTo the central controlThe controller transmits the local environment situation data of the current time step and the next time step, and the central controller obtains:

current global environmental situation data；

Global environmental situation data for next time step。

The St input value network is obtained:

value function estimation of current state:；

inputting St+1 into a target network to obtain:

value function estimation of target state:。

the central controller calculates a state-behavior value function error:

；

wherein delta _t Is the central controller calculates the state-behavior value function error,is a value function estimate of the current ambient state St, < >>Is a value function estimate of state St+1, < >>Is a function of the target state value, r _t Is the reward that the agent decision network gets when the environmental state is shifted from St to St+1, and gamma is the decay factor for taking into account the importance of future rewards, typically with a value of [0,1 ]]Between them.

The central controller updates the value network parameters:

;

representation function->Regarding the parameter w at the ambient state St _now Gradient, w _new Is a new value network parameter.

The central controller updates the target network parameters:

；

is a super parameter for controlling the ratio of new and old weight vectors in the averaging operation, +.>Is the updated target network parameter.

Each agentUpdating policy network parameters:

；

beta is the learning rate, and the step length of each parameter update is controlled.Is the error of the state-behavior value function at time step t. />The gradient of the policy function (policy function) with respect to the i-th parameter at the current time step.And the policy function is represented, and decision action data can be output based on the local environment situation data received by the agent i in the time step t.

In the decentralization execution stage, after training is completed, a value network is not needed any more, an intelligent agent only needs to use a locally deployed strategy network to make a decision, and the decision process does not need communication. Fig. 8 is a schematic diagram of a decentralized execution process provided in an embodiment of the present application, as shown in fig. 8, an agent 1 and an agent 2 … … respectively acquire local environment situation data, input the local environment situation data into respective decision network actors, and output decision actions into a simulation environment.

The embodiment of the application adopts a reinforcement learning algorithm, such as a PPO (Proximal Policy Optimization, near-end optimization) algorithm, and realizes optimization of state space design, action space design and neural network parameters and rewarding function design of a UUV intelligent body model.

The state space design comprises the design of the UUV agent model own state information (such as longitude, latitude, depth, speed, state of a sensor and the like) and marine environment information (such as sea water temperature and salt density, sea condition, ocean current, sea surface visibility, sea water transparency, sea surface wind speed, sea surface wind direction and the like).

The action space design is to design decision actions of a plurality of UUV intelligent body models, and comprises turning, deepening, speed changing, sensor control and the like of the plurality of UUV intelligent body models.

And (3) bonus function design: and designing a plurality of UUV agent model rewarding functions, wherein a main line rewarding and auxiliary rewarding combined mode is mainly adopted. The dominant line rewards are rewards set for achieving and improving main qualitative targets and quantitative targets in reinforcement learning, such as a plurality of UUV intelligent body models cooperatively completing to find a certain target. The secondary rewards are augmented with other procedural rewards or penalty terms such that the rewards function becomes dense, such as the movement of the UUV agent model approaching a certain target distance.

S203, performing countertraining on the plurality of second agent models based on the alliance population of the multi-strategy type to obtain a plurality of UUV agent models.

Constructing a multi-strategy type alliance population aiming at a plurality of second agent models obtained after collaborative training is completed; and performing countermeasure training in the alliance population based on the multi-strategy type to obtain a plurality of UUV intelligent body models for simulating a plurality of unmanned underwater vehicles.

S203 includes: pre-training agent models of a plurality of strategy types for each second agent model; the agent models of the plurality of policy types include: a main agent model, a history partner agent model and a main agent defect strategy agent model; constructing the agent models of the plurality of strategy types into a alliance population; selecting two agent models from the alliance population for countertraining, recording the winning rate of each agent model, and storing the agent models after each countertraining into the alliance population; and (3) circulating the steps until the winning rate of the main intelligent agent model meets a second preset condition, outputting a plurality of main intelligent agent models with highest winning rates, and taking the main intelligent agent models as UUV intelligent agent models.

The agent models of the plurality of policy types are agent models pre-trained for the second agent.

At least three policy types are included:

1) Main agent model: is the strategy being trained, and will be the agent model of the final output strategy.

2) History partner model: aiming at the agent models of historic weaknesses of all agents in the alliance population, the alliance is more robust.

3) Main agent defect partner training agent model: the method is used for antagonizing with the main intelligent body, finding the weakness of the intelligent body and enabling the main intelligent body to be more robust.

Adding the agents in the alliance population into the countermeasure pool, each main agent model preferentially selects the agent model with high winning rate to perform pairwise countermeasures, recording the winning rate of each agent model, storing the main agent models corresponding to time steps every other preset time, adding the main agent models into the alliance population, repeating the steps until the winning rates of the main agent models for all the agent countermeasures in the countermeasure pool exceed a certain proportion, ending training, and outputting the final main agent model as the agent model used for formation control simulation.

The embodiment of the application carries out countermeasure training based on the intelligent agent alliance population of various strategy types, when selecting the countermeasure intelligent agent, the priority logic of the virtual self-game (Prioritized Fictitious Self-Play, PFSP) algorithm with priority is referred, and when selecting the opponent to perform the countermeasure, the more difficult opponent has larger optimization weight, but not all opponents have the same optimization weight. The method comprises the steps of recording the winning rate matrixes of all agents from game through a tournament winning rate table, updating after each round of winning, and giving weight to the probability selected by any two agents by utilizing the winning rate table, wherein the probability that an opponent with higher winning rate is selected to be won is also larger.

Fig. 9 is a schematic diagram of an countermeasure training process according to an embodiment of the present application. As shown in fig. 9, the method includes:

for a plurality of second agent models obtained after co-training: agent 1, agent 2, agent 3 … agent n, three agent models are generated for each second agent model: the method comprises the steps of constructing three intelligent body models into a alliance population according to the winning rates of the intelligent body models, selecting opponents with strong winning rates from the alliance population to conduct antagonism according to the winning rates of the intelligent body models, evaluating the winning rates of the intelligent bodies after each antagonism, updating a winning rate evaluation table, storing the outputted main intelligent body into the alliance population, circulating the antagonism training until the winning rates of the main intelligent body model against all the intelligent bodies in a resistance pool exceed a certain proportion, and ending the training. And outputting the main agent with the winning rate exceeding the preset value as an agent for formation control simulation.

The embodiment of the application provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the unmanned underwater vehicle formation control simulation method.

In practical applications, the computer-readable storage medium may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable information medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable information medium may include data information that is propagated in baseband or as part of a carrier wave, with computer readable program code embodied therein. Such propagated data information may take a variety of forms, including, but not limited to, electro-magnetic information, optical information, or any suitable combination of the preceding. A computer readable information medium may also be any computer readable medium that can transmit, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

As shown in fig. 10, a schematic structural diagram of a computer device is provided in an embodiment of the present application. The computer device 12 shown in fig. 10 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in FIG. 10, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, commonly referred to as a "hard disk drive"). Although not shown in fig. 10, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown in fig. 10, the network adapter 20 communicates with other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in fig. 10, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the unmanned underwater vehicle formation control simulation method provided by the embodiment of the present application.

It should be noted that the term "comprising" and variants thereof as used herein is open ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

It is to be understood that, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

While several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the application. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present application is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method of simulating formation control for an unmanned underwater vehicle, the method comprising:

2. The method of claim 1, wherein the training method of the plurality of UUV agent models comprises:

constructing a plurality of first agent models;

3. The method of claim 2, wherein the co-training the plurality of first agent models using CTDE training to obtain a plurality of second agent models comprises:

inputting the local environment situation data acquired by each first intelligent agent into a respective decision network, and outputting decision action information;

4. The method of claim 3, wherein the multi-policy type based federation population countertraining the plurality of second agent models to obtain a plurality of UUV agent models, comprising:

5. The method of claim 1, wherein the agent model comprises: the neural network module and the knowledge rule module.

6. The method of claim 5, wherein outputting decision action information from the environmental situation data comprises:

7. The method of claim 5, wherein outputting decision action information from the environmental situation data comprises:

8. The method of claim 5, wherein the knowledge rules module comprises a task layer knowledge rules module and an execution layer knowledge rules module;

9. An unmanned underwater vehicle formation control simulation system, comprising: the UUV intelligent agent simulation system comprises a plurality of UUV intelligent agent models, a simulation interaction module and a simulation environment; the simulation environment comprises a plurality of simulation objects; the simulation object is a component used for simulating the unmanned underwater vehicle in a simulation environment;

the UUV agent model is used for acquiring environmental situation data from the simulation environment; outputting decision action information to the simulation object according to the environmental situation data; the UUV agent models are obtained by adopting a CTDE mode for cooperative training and performing countermeasure training based on a multi-strategy type alliance population;

10. An unmanned underwater vehicle formation control simulation device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed, implements the unmanned underwater vehicle formation control simulation method of any of claims 1-8.