CN116992952A

CN116992952A - Pre-training method, training method and system for collaborative guidance law model

Info

Publication number: CN116992952A
Application number: CN202310852767.6A
Authority: CN
Inventors: 路鹰; 赵大海; 胡一帆; 韩特; 付斌; 邱璐莹
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-11-03

Abstract

The application relates to the technical field of control, in particular to a pre-training method, a training method and a system of a collaborative guidance law model, wherein the pre-training method comprises the following steps: the method comprises the steps of obtaining a pre-training sample and a pre-training model of a slave bullet, performing offline reinforcement learning training on the pre-training model based on the pre-training sample to obtain a pre-training cooperative guidance law model, taking network parameters of a pre-training evaluation network and network parameters of a pre-training evaluation target network in the pre-training cooperative guidance law model as initial network parameters of an evaluation network to be trained and initial network parameters of an evaluation target to be trained in the cooperative guidance law model training process, simplifying the actual training process, improving the training efficiency, and rapidly and efficiently obtaining the cooperative guidance law model.

Description

Pre-training method, training method and system for collaborative guidance law model

Technical Field

The application relates to the technical field of control, in particular to a pre-training method, a training method and a training system of a collaborative guidance law model.

Background

In the process of the fight, the unmanned plane needs to continuously adjust the action strategy of the unmanned plane according to the position, the state and the strategy of the enemy plane, so that the unmanned plane can strike the enemy accurately according to a certain path or track. When multiple unmanned aerial vehicles fight or launch multiple missiles, the target tracking and striking needs to be completed in cooperation. The time collaborative attack of all slave bullets to the target can be realized by introducing a residual time collaborative compensation instruction in a multi-bullet time collaborative guidance law, namely in a mode of guiding by the master bullet and tracking by the slave bullets.

In the related art, the neural network training is directly performed by using the samples in the experience pool, but a large amount of data is required in the training process, and the online model training takes a long time.

Disclosure of Invention

The application aims to provide a pre-training method, a training method and a training system for a collaborative guidance law model, which can improve training efficiency.

The first object of the present application is achieved by the following technical solutions:

in a first aspect, a method for pre-training a collaborative guidance law model is provided, including:

obtaining pre-training samples of the slave cartridge, wherein each pre-training sample comprises: the method comprises the steps of a first time environmental state, a first rewarding value corresponding to the first time, a second rewarding value corresponding to the second time and a third time environmental state;

obtaining a pre-training model, the pre-training model comprising: the pre-training network and the pre-training target network, the pre-training network comprising: a pre-training action network and a pre-training evaluation network, the pre-training target network comprising: a pre-training action target network and a pre-training evaluation target network;

performing reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached to obtain a pre-training cooperative guidance law model,

The network parameters of the pre-training evaluation network and the network parameters of the pre-training evaluation target network in the pre-training cooperative guidance law model are used as the initial network parameters of the evaluation network to be trained and the initial network parameters of the evaluation target to be trained in the cooperative guidance law model training process.

In one possible implementation, the acquiring the pre-training sample of the slave projectile includes:

acquiring a first time environmental state, and acquiring a corresponding action value at a first time by utilizing a collaborative guidance law according to the first time environmental state;

simulating through a training environment according to the action value corresponding to the first moment to obtain an environment state at the second moment and a reward value corresponding to the first moment;

obtaining a second action value corresponding to the second moment according to the second moment environmental state and the cooperative guidance law; and simulating through a training environment according to the action value corresponding to the second moment to obtain the environment state at the third moment and a second rewarding value corresponding to the second moment.

In one possible implementation manner, the simulating through the training environment according to the action value corresponding to the first time to obtain the environmental state at the second time and the reward value corresponding to the first time includes:

Simulating through a training environment according to the action value corresponding to the first moment to obtain a second moment environment state;

determining a front angle corresponding to the missile at the second moment, a relative distance between the missile and the target and a relative speed between the missile and the target according to the environmental state at the second moment, wherein the missile comprises a master missile and a slave missile;

when the missile hits, determining a first reward value corresponding to the first moment according to the main missile hit moment and the auxiliary missile hit moment;

when the missile is not hit, determining a predicted hit time corresponding to the master missile and a predicted hit time corresponding to the slave missile according to a front angle corresponding to the missile, a relative distance between the missile and the target and a relative speed between the missile and the target; and determining a first reward value corresponding to the first moment according to the predicted hit moment corresponding to the master bullet and the predicted hit moment corresponding to the slave bullet.

In one possible implementation manner, the performing reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, to obtain a pre-training cooperative guidance law model, including:

obtaining an evaluation value according to the first moment environmental state and a pre-training network;

obtaining an initial evaluation target value by utilizing a pre-training target network according to the environmental state at the third moment;

Determining an evaluation target value according to the initial evaluation target value, the first rewarding value and the second rewarding value;

and performing iterative training on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until a stopping condition is reached to obtain the pre-training cooperative guidance law model.

In one possible implementation manner, the determining an evaluation target value according to the initial evaluation target value, the first prize value and the second prize value includes:

determining an intermediate evaluation target value according to the initial evaluation target value, the decay discount coefficient and the second prize value;

and determining the evaluation target value according to the intermediate evaluation target value, the attenuation discount coefficient and the first reward value.

In one possible implementation manner, the performing iterative training on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until reaching a stopping condition to obtain a pre-training cooperative guidance law model includes:

after the calculation of the pre-training model of the samples with the preset number is completed, determining an evaluation network parameter according to the evaluation target value and the evaluation value corresponding to the samples with the preset number, and updating a pre-training evaluation network according to the evaluation network parameter;

Updating the pre-training evaluation target network according to the evaluation network parameters determined after the first preset period after the pre-training evaluation network updates the first preset period;

performing strategy gradient calculation on the pre-training action network according to the updated pre-training evaluation network to obtain an action loss value, and updating network parameters of the pre-training action network according to the action loss value;

after the pre-training action network updates the second preset period, updating the network parameters of the pre-training action target network according to the action loss value determined after the second preset period;

and performing iterative training until a stopping condition is reached to obtain a pre-training cooperative guidance law model.

In one possible implementation, the pre-training sample further includes: the corresponding action value of the first moment,

the strategy gradient calculation is carried out on the pre-training action network according to the updated pre-training evaluation network to obtain an action loss value, and the method comprises the following steps:

evaluating the action value corresponding to the first moment to obtain an offline expert evaluation value corresponding to the first moment;

determining a correction value according to the offline expert evaluation value and the evaluation value corresponding to the first moment;

when the offline expert evaluation value corresponding to the first moment is larger than the evaluation value corresponding to the first moment, the correction value is a double-norm of the action value corresponding to the first moment and the predicted action value corresponding to the first moment, otherwise, the correction value is 0; the predicted action value corresponding to the first moment is obtained according to the environment state at the first moment and the pre-training action network,

And performing strategy gradient calculation on the pre-training action network according to the corrected value and the updated pre-training evaluation network to obtain an action loss value.

In a second aspect, the present application provides a method for training a collaborative guidance law model, including:

obtaining a plurality of training samples, wherein each training sample comprises: the first time environmental state, the first rewarding value corresponding to the first time, the second rewarding value corresponding to the second time and the third time environmental state;

according to the training samples, performing iterative reinforcement learning training on the model to be trained to obtain a cooperative guidance law model;

wherein the model to be trained comprises a network to be trained and a target network to be trained, initial parameters of the evaluation network to be trained in the network to be trained and initial parameters of the target evaluation network to be trained in the target network to be trained are corresponding parameters in a pre-training cooperative guidance law model, the pre-training cooperative guidance law model is obtained by performing reinforcement learning training according to a plurality of pre-training samples.

In a third aspect, a pre-training system for a collaborative guidance law model is provided, comprising:

a first acquisition module for acquiring pre-training samples of the slave projectile, wherein each pre-training sample comprises: the method comprises the steps of a first time environmental state, a first rewarding value corresponding to the first time, a second rewarding value corresponding to the second time and a third time environmental state;

A second acquisition module, configured to acquire a pre-training model, where the pre-training model includes: the pre-training network and the pre-training target network, the pre-training network comprising: a pre-training action network and a pre-training evaluation network, the pre-training target network comprising: a pre-training action target network and a pre-training evaluation target network;

a first training module, configured to perform reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, to obtain a pre-training cooperative guidance law model,

In a fourth aspect, a training system for a collaborative guidance law model is provided, comprising:

a third obtaining module, configured to obtain a plurality of training samples, where each training sample includes: the first time environmental state, the first rewarding value corresponding to the first time, the second rewarding value corresponding to the second time and the third time environmental state;

The second training module is used for carrying out iterative reinforcement learning training on the model to be trained according to the plurality of training samples to obtain a cooperative guidance law model;

In a fifth aspect, there is provided an electronic device comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: operations corresponding to the method shown in any possible implementation manner of the first aspect are performed.

In a sixth aspect, there is provided another electronic device comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to: the operations corresponding to the method according to the second aspect are performed.

In a seventh aspect, a computer readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by a processor to implement a method as shown in any one of the possible implementations of the first aspect.

In an eighth aspect, a computer readable storage medium is provided, the storage medium storing at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by a processor to implement a method as described in the second aspect.

In summary, the present application includes at least one of the following beneficial technical effects:

according to the scheme, the pre-training sample and the pre-training model of the slave projectile are obtained, and further, the pre-training model is subjected to offline reinforcement learning training based on the pre-training sample so as to obtain a pre-training collaborative guidance law model, and the network parameters of the pre-training evaluation network and the network parameters of the pre-training evaluation target network in the pre-training collaborative guidance law model are used as the initial network parameters of the evaluation network to be trained and the initial network parameters of the evaluation target to be trained in the collaborative guidance law model training process, so that the actual training process can be simplified, the training efficiency can be improved, and the collaborative guidance law model can be obtained quickly and efficiently.

Drawings

FIG. 1 is a schematic diagram of a master-slave bullet information interaction provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for pre-training a collaborative guidance law model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a pre-training process according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a training method of a collaborative guidance law model according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a specific training process according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a pre-training system for collaborative guidance law model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a training system for collaborative guidance law model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Description of the embodiments

The present application is described in further detail below with reference to fig. 1-8.

The present embodiment is only for explanation of the present application and is not to be construed as limiting the present application, and modifications to the present embodiment, which may not creatively contribute to the present application as required by those skilled in the art after reading the present specification, are all protected by patent laws within the scope of claims of the present application.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, unless otherwise specified, the term "/" generally indicates that the associated object is an "or" relationship.

Collaborative guidance law design is an important factor affecting the accuracy of interception of a guided missile, and the military requirement of multi-missile collaborative hit becomes more and more urgent when more complex combat scenes and combat tasks are faced. The collaborative guidance law is obtained by analyzing and deducing a guidance residual time estimation formula, the assumption of a small angle relation is carried out, the processing such as neglecting of a high-order term for expanding the residual time is carried out, and the residual time difference between the bullets is enabled to be zero by controlling an overload instruction of a single bullet, so that collaborative striking is realized.

The aim of the co-guidance is to bring the remaining time of flight of all missiles to the same, and if there is a remaining time of flight error, the motion value is continuously adjusted. When all the remaining flight times to the missiles are the same, all missiles may achieve a coordinated hit on the target. But the cooperative guidance law is difficult to popularize and use in engineering.

The embodiment of the application designs a scheme aiming at the multi-missile cooperative guidance system, designs a cooperative guidance law model by utilizing a reinforcement learning technology so as to realize cooperative striking of all missiles on targets, and performs proper offline pre-training on the model to obtain network parameters in order to enable the cooperative guidance law model to be quickly trained, and takes the network parameters as initial parameters in an online training process so as to quickly and efficiently perform iterative learning.

The embodiment of the application relates to a pre-training process and a training process, which both comprise the following steps: master-slave cooperative information topology, multi-bullet motion model and cooperative guidance law corresponding to information interaction.

Specifically, a master-slave system information topology is further described.

The problem of the time coordination of the N missiles on the target can be converted into the problem of the time coordination of N-1 secondary missiles on the main bullet, namely the guiding of the main bullet and the tracking of the secondary bullet. And the time for the slave bullet to reach the target is regulated in a mode that a residual time cooperative compensation instruction can be introduced on the slave bullet basic proportion guidance law, so that the aim of cooperative attack of all slave bullets to the target time along with the master bullet is fulfilled. The guiding rule of the main bullet only uses the detection information of the main bullet to realize guiding, but the secondary bullet needs to receive the state of the main bullet and the detection information and simultaneously combines the state of the secondary bullet to conduct guiding, the secondary bullet can not conduct information interaction, and the whole bullet group belongs to a star-shaped topological structure, and the specific reference can be made to fig. 1.

Specifically, the multi-bullet motion model is further described, and the design can be performed by taking a horizontal plane as an example, and the method comprises the following steps:,

where i=0, 1,2 …, n, in all missiles i=0 is the master projectile and i=1, 2 …, n is the slave projectile.

Is the relative velocity of the ith missile to the target; />Is the speed of the target; />Is the lead angle of the target; />Is the speed of the ith missile; />Is the leading angle of the ith missile; />The variation of the angle of the missile vision of the ith missile and the target; />Is the relative distance from the ith missile to the target; />Is the missile visual angle of the ith missile and the target, < ->Is the heading angle of the target; />Is the heading angle of the ith missile, +.>Is the variation of the heading angle of the ith missile; />Is the lateral acceleration acting on the ith missile.

Specifically, the framework of the collaborative guidance laws is further described.

In the present application, the collaborative guidance laws include a base ratio guidance law and/or collaborative correction terms.

In one implementation, a collaborative guidance law framework (training action network) is used as a base proportional guidance law and collaborative correction terms to jointly form the collaborative guidance law during training,. The guidance law framework (pre-training action network) may guide the law for the base proportion during the pre-training process.

Wherein i is the number of the missile, and the basic proportion guidance law is that the missile itselfForm feedback for guiding the missile to fly and attack the target, and for the main missile (i=0), the basic proportion guiding law is thatThe guiding law for the proportion of the secondary bullet base isWherein N is a proportional guide coefficient, +.>K is a time collaborative term correction coefficient for the lead angle,>for the estimation of the remaining time ∈>Estimating a difference between the remaining time for itself and the main bullet; />Is the relative distance of the missile to the target.

The method is a time cooperative control item of a guidance law, namely a cooperative correction item, which is used for adjusting the remaining time of a target in a slave bullet, can be decided by using a reinforcement learning method, is a representation of a neural network compensation item, and achieves the purpose of a corresponding to the minimum time difference between the master bullet and the target in the slave bullet, namely:wherein->Indicating the hit time of the main bullet under the guidance law, < >>Indicating the hit time of the ith missile.

In order to improve training efficiency, the method and the device perform offline pre-training to obtain network parameters of a pre-training evaluation network and network parameters of a pre-training evaluation target network as initial network parameters of the evaluation network to be trained and initial network parameters of the evaluation target to be trained in the collaborative guidance law model training process.

Specifically, an embodiment of the present application provides a method for pre-training a collaborative guidance law model, as shown in fig. 2, where the method provided in the embodiment of the present application may be executed by a training system, and the training system may be a server or a terminal device, where the server may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud computing services. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., and the terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein, and the method includes:

step S110, obtaining pre-training samples of the slave bullet, where each pre-training sample includes: the method comprises the steps of a first time environmental state, a first rewarding value corresponding to the first time, a second rewarding value corresponding to the second time and a third time environmental state;

a preset number of pre-training samples are randomly selected from the experience pool.

Step S120, obtaining a pre-training model, where the pre-training model includes: a pre-training network and a pre-training target network, the pre-training network comprising: a pre-training action network and a pre-training evaluation network, the pre-training target network comprising: a pre-training action target network and a pre-training evaluation target network;

The pre-training model adopted by the embodiment of the application comprises a pre-training action network, a pre-training evaluation network, a pre-training action target network and a pre-training evaluation target network; the pre-training evaluation network and the pre-training evaluation target network may both include two evaluation networks, and in practical application, the evaluation networks may have overestimation, and because the smaller evaluation values obtained by selecting the two evaluation networks, overestimation can be reduced. After the training of the action network (the pre-training action network and the pre-training action target network) is completed, the parameters after the training can be used as initial parameters of the neural network controller to be trained, and further, the model after the training is completed can be used for realizing the attitude control of the missile.

The action network inputs the environmental state and outputs the action value; the evaluation network (the pre-training evaluation network and the pre-training evaluation target network) inputs the environmental state and the action value, and outputs the corresponding evaluation value. The purpose of the action network is to be able to output an action value that maximizes the evaluation value according to the state; the evaluation network can evaluate and output an evaluation value according to the environmental state and the action value.

It will be appreciated that the pre-training action network and the pre-training action target network are identical in structure, and the pre-training evaluation network and the pre-training evaluation target network are identical in structure.

In the embodiment of the application, the pre-training action network can comprise a basic proportion guidance law term and/or a time cooperative correction term, wherein the time cooperative correction term is g _i The function isThe decision can be made by adjusting the time remaining from the hit, preferably the pre-training action network is made up of basic proportional lead-law terms.

And step 130, performing reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, so as to obtain the pre-training cooperative guidance law model.

In the pre-training process, a pre-training model can be initialized by adopting random parameters, and then reinforcement learning is carried out on the initialized pre-training model based on the pre-training sample until a stopping condition is reached. The stopping condition may be that the difference between the hit time of the master bullet and the hit time of the slave bullet is not greater than a preset threshold, the preset threshold may be set according to actual requirements, and the shorter the hit time of the master bullet and the hit time of the slave bullet, the better the model effect, and preferably, the preset threshold is 0.

Therefore, in the embodiment of the application, the pre-training sample and the pre-training model of the slave projectile are obtained, and the pre-training model is subjected to offline reinforcement learning training based on the pre-training sample to obtain the pre-training collaborative guidance law model, and the network parameters of the pre-training evaluation network and the network parameters of the pre-training evaluation target network in the pre-training collaborative guidance law model are used as the initial network parameters of the evaluation network to be trained and the initial network parameters of the evaluation target to be trained in the collaborative guidance law model training process, so that the actual training process can be simplified, the training efficiency can be improved, and the collaborative guidance law model can be obtained rapidly and efficiently.

In one implementation, the manner in which the pre-training samples of the slave cartridges are obtained includes:

acquiring a first time environmental state, and acquiring a corresponding action value at the first time by utilizing a cooperative guidance law according to the first time environmental state;

obtaining a second action value corresponding to the second moment according to the environmental state and the cooperative guidance law at the second moment; and simulating through the training environment according to the action value corresponding to the second moment to obtain the environment state at the third moment and a second rewarding value corresponding to the second moment.

Before the training sample is collected, in order to realize collaborative guidance law algorithm training based on reinforcement learning, the method may further include: first, the following Markov decision model is built for the slave bulletS is the environmental state at the current time, a is the action value corresponding to the current time, R is the rewarding function corresponding to the current time, s' is the state at the next time corresponding to the current time, and the observation states are defined as follows: />Wherein->A speed that is a target; the goal of the design training is to minimize errors in the slave and master shot hit times, such as:

wherein->For the hit duration of the main bullet at hit +.>Hit duration for the slave bullet at hit; and designing the rewards R of the rewarding agent.

Specifically, when the pre-training sample is collected, the sample can be obtained through collaborative guidance law interception simulation, the main bullet directly attacks the target by adopting the traditional proportional guidance law, and the guidance law corresponding to the main bullet is thatThe slave projectile adopts a classical time cooperative guidance law, wherein the guidance law is +.>。

Specifically, the first time environmental state S is known ₀ Predicting by action network (cooperative guidance law) to obtain corresponding action value a ₀ Then the action value a ₀ Inputting the training environment to simulate to obtain a second moment environment state S ₁ The prize value R corresponding to the first time ₀₀ Further, according to the second time environmental state S ₁ And the cooperative guidance law, obtaining a second action value a corresponding to a second moment ₁ The method comprises the steps of carrying out a first treatment on the surface of the According to the action value a corresponding to the second moment ₁ Simulating by training environment to obtain environment state S at third moment ₂ A second prize value R corresponding to the second time ₁₁ Thus, a sample is obtained (S ₀ ，R ₀₀ ，R ₁₁ ，S ₂ ) And put it into an experience pool, of course, the sample may be (S ₀ ，a ₀ ，R ₀₀ ，S ₁ ，a ₁ ，R ₁₁ ，S ₂ ，a ₂ ）。

The experience pool can eliminate the correlation among samples, the front and back actions in reinforcement learning are usually strongly correlated, the samples are put into the experience pool, and a batch of samples are randomly selected during subsequent training, so that the accuracy of neural network training is higher.

Therefore, in the embodiment of the application, the action value corresponding to the first moment can be obtained by utilizing the action network according to the environment state at the first moment, and the reward value corresponding to the second moment environment state and the first moment can be obtained by simulating through the training environment; and then, according to the second moment environment state, the action value corresponding to the second moment obtained by the action network is utilized, simulation is carried out through the training environment, the third moment environment state and the rewarding value corresponding to the second moment are obtained, a pre-training sample of the slave ball is generated, and the error of the evaluation network under the condition of large-range random initialization is reduced in the pre-training process by utilizing the sample sequence of three continuous moments.

In the embodiment of the application, the process of determining the reward value corresponding to the first moment and the process of determining the reward value corresponding to the second moment are the same, and the two processes can be mutually compared.

In the embodiment of the present application, the reward R for designing the reward agent may be:

,

wherein the method comprises the steps ofFor the awards on hit, it is only relevant to the hit time difference of the master-slave bullet:

wherein->The scale factor can be set in a self-defining way, and the value is larger than 0.

For pre-hit rewards, and predicted missile hit time +.>The following are related:

wherein->Is a proportionality coefficient, can be set in a self-defining way, and has a value greater than 0 +.>For predicted main bullet hit time, +.>Is the predicted hit time from bullet i.

One possible implementation of an embodiment of the application,wherein, the method comprises the steps of, wherein,

whileThe function of the term is to constrain the amplitude of the bonus output, avoiding the occurrence of oversized or saturated instructions, k _a The value of (2) is larger than 0.

Furthermore, according to the action value corresponding to the first moment, the simulation is performed through the training environment to obtain the environment state at the second moment and the rewarding value corresponding to the first moment, which comprises the following steps:

simulating through a training environment according to the action value corresponding to the first moment to obtain the environment state at the second moment;

determining a leading angle corresponding to the missile at the second moment, a relative distance between the missile and the target and a relative speed between the missile and the target according to the environmental state at the second moment, wherein the missile comprises a master missile and a slave missile;

When the missile hits, determining a first reward value corresponding to a first moment according to the main missile hit moment and the auxiliary missile hit moment;

Specifically, the second time environmental state can be obtained through the action corresponding to the first time, and the second time environmental state includes: the relative distance from the main bullet to the target, the bullet visual angle, the course angle and the speed of the main bullet and the target, and the relative distance from the bullet to the target, the bullet visual angle, the course angle and the speed of the secondary bullet and the target, and the speed of the target; and calculating the relative speeds of the secondary bullet and the target by using the environmental state at the second moment and a formula of the multi-bullet motion model.

Judging whether the next moment is the final state, if so, determining the missile hit condition, otherwise, determining the missile miss condition.

Further, when the missile hits, the method uses the main bullet hit time and the slave bullet hit time And determining a first reward value corresponding to the first moment, wherein the master bullet hit moment and the slave bullet hit moment are obtained by simulation.

When the missile is not hit, according to the corresponding lead angle of the missile, the relative distance between the missile and the target and the relative speed between the missile and the target, utilizingDetermining predicted hit time corresponding to the master bullet and predicted hit time corresponding to the slave bullet, and utilizing according to the predicted hit time corresponding to the master bullet and the predicted hit time corresponding to the slave bulletAnd determining a first reward value corresponding to the first moment.

Another possible implementation of an embodiment of the present application in which the design is，

When the missile is not hit, according to the corresponding lead angle of the missile, the relative distance between the missile and the target, the relative speed between the missile and the target and the proportional guide coefficient, the method utilizesCalculating the predicted hit time corresponding to the master bullet and the predicted hit time corresponding to the slave bullet; and determining a first reward value corresponding to the first moment according to the predicted hit moment corresponding to the master bullet and the predicted hit moment corresponding to the slave bullet.

Therefore, after the simulation is performed according to the action value corresponding to the first moment to obtain the environment state at the second moment, the states of the master missile and the slave missile at the second moment are determined according to the second environment state, and further, when the missile hits the target, the first reward value is determined according to the moment when the master missile hits the target and the moment when the slave missile hits the target, when the missile misses the target, the hit moment is predicted by combining the leading angle corresponding to the missile, the relative distance between the missile and the target and the relative speed between the missile and the target, and the first reward value is determined by combining the prediction result, so that the accuracy of determining the reward value can be improved.

Further, in the embodiment of the present application, step S130 performs reinforcement learning training on the pre-training model according to the pre-training sample until reaching the stop condition, to obtain a pre-training cooperative guidance law model, including: obtaining an evaluation value according to the first moment environmental state and the pre-training network; obtaining an initial evaluation target value by utilizing a pre-training target network according to the environmental state at the third moment; determining an evaluation target value according to the initial evaluation target value, the first rewarding value and the second rewarding value; and carrying out iterative training on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until a stopping condition is reached to obtain the pre-training cooperative guidance law model.

Referring to fig. 3, fig. 3 is a schematic diagram of pre-training provided by the embodiment of the present application, wherein motion prediction is performed through a pre-training action network in a pre-training network according to a first time environmental state to obtain a predicted motion value corresponding to the first time, then both the predicted motion value corresponding to the first time and the first time environmental state are input into a first pre-training evaluation network and a second pre-training evaluation network to obtain a first evaluation value and a second evaluation value, and a small evaluation value is used as a final evaluation value;

Performing action prediction according to the environmental state at the third moment through a pre-training action target network in the pre-training target network to obtain a predicted target action value corresponding to the third moment, and then inputting the predicted target action value corresponding to the third moment and the environmental state at the third moment into a first pre-training evaluation target network and a second pre-training evaluation target network to obtain a first initial evaluation value and a second initial evaluation value, wherein the small initial evaluation value is used as a final initial evaluation value; wherein in one possible case, the predicted target action value is a value directly output by the pre-training action target network, and in another possible case, the predicted target action value is a sum of the value output by the pre-training action target network and noise, so that exploration can be encouraged to obtain a more accurate initial evaluation value;

then, determining an intermediate evaluation target value according to the initial evaluation target value, the decay discount coefficient and the second prize value; the evaluation target value is determined based on the intermediate evaluation target value, the decay discount coefficient, and the first prize value. And a damping discount coefficient is introduced, and a plurality of differential errors are introduced according to step sequence damping, so that the finally determined evaluation target value is more accurate.

Specifically, to further reduce the error of the evaluation network under a wide range of random initialization conditions, a multi-step differential error is introduced in accordance with step-by-step attenuation in the updated estimation of the pre-trained evaluation network. I.e. the update procedure of the pre-trained evaluation network may utilize samples of 3 consecutive instantsIn calculating->Target value of time pretrained target evaluation network +.>At the time, use +.>Rewarding of time of day>And +.>Status of time->Specifically, the initial evaluation target value +.>Further, an evaluation target value y is obtained by using the formula (1):

formula (1);

wherein the subscriptRepresenting a pre-trained evaluation target network,>attenuation discount coefficient for rewards, +.>Action in State->And->Action in State->The values after the exploratory disturbance are added to the output actions of the pre-training action target network under the corresponding states can be calculated by using the formula (2) and the formula (3) respectively.

Formula (2);

formula (3);

wherein the subscriptExpressed as a pre-training action target network, the update frequency and amplitude are both low, +.>Representing a normal distribution +.>And is subject to->The random exploration of amplitude constraints perturbs.

It can be known that, in the embodiment of the present application, in order to avoid the occurrence of the local convergence phenomenon of the agent, eliminate the overfitting of the accumulated evaluation estimation and enhance the generalization ability of the agent to the environment, the exploring noise is added after the output signal of the action target network. The expected accumulated return is calculated by taking smaller evaluation value estimation, and the phenomenon of overestimation of the accumulated return estimation value can be prevented by the method, so that training convergence is facilitated.

For the embodiment of the application, the initial evaluation target value, the first reward value and the second reward value determined by combining the environmental state at the third moment are jointly determined, so that the accuracy of each update can be improved, and the training efficiency of pre-training is further improved.

In one implementation manner, the iterative training is performed on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until the stopping condition is reached to obtain the pre-training cooperative guidance law model, which comprises the following steps:

after the calculation of the pre-training model of the samples with the preset number is completed, determining an evaluation network parameter according to an evaluation target value and an evaluation value corresponding to the samples with the preset number, and updating the pre-training evaluation network according to the evaluation network parameter;

wherein the evaluation value can be calculatedDetermining an evaluation network parameter +_according to the formula (4) with the calculated evaluation target value y>And is based on->And (3) carrying out parameter updating of the pre-training evaluation network:

formula (4);

wherein M is the number of samples, i.e. the preset number.

Updating the pre-training evaluation target network according to the evaluation network parameters determined after the first preset period after the pre-training evaluation network is updated to the first preset period;

The pre-training evaluation target network has low update frequency and small update amplitude, namely the evaluation network parameters of the pre-training evaluation networkEvery update of the first preset period is recorded as +.>And updating the parameters corresponding to the pre-training evaluation target network once.

In one possible case, the pre-training evaluation target network is updated in a manner thatWherein->Much smaller than 1.

In another possible case, the network parameters of the pre-training evaluation target network may be directly set as the network parameters of the pre-training evaluation network corresponding to the current time.

the evaluation network is used for solving the strategy gradient of the action network, and the strategy network is updated according to the strategy gradient definition.

specifically, the first preset period may be identical to or different from the second preset period, which is not limited in the embodiment of the present application.

In one possible scenario, the pre-training action target network is updated in a manner thatWherein->Much smaller than 1.

In another possible case, the network parameters of the pre-training action target network may be directly set as the network parameters of the pre-training action network corresponding to the current time.

In the embodiment of the application, a dual-delay deterministic depth strategy gradient algorithm is adopted, so that the problem that evaluation network estimation deviation is possibly larger is solved.

Further, in an embodiment of the present application, the pre-training sample further includes: the corresponding action value of the first moment,

performing strategy gradient calculation on the pre-training action network according to the updated pre-training evaluation network to obtain an action loss value, wherein the strategy gradient calculation comprises the following steps:

when the offline expert evaluation value corresponding to the first moment is larger than the evaluation value corresponding to the first moment, the correction value is the two norms of the action value corresponding to the first moment and the predicted action value corresponding to the first moment, otherwise, the correction value is 0; the predicted motion value corresponding to the first moment is obtained according to the environmental state and the pre-training action network at the first moment,

For the embodiment of the application, the reinforcement learning evaluation network is pre-trained. The basic framework of the pre-training algorithm is consistent with the TD3 algorithm, and in the embodiment of the application, the pre-training evaluation networkIs>The determination of (2) can be obtained by the following equations (2), (5) and (6):

equation (5);

equation (6);

wherein the method comprises the steps ofFor the state, action, reward and next state obtained by collaborative guidance law interception simulationTransfer (S)/(S)>For obtaining a random exploring action around the action after policy breaking,/>And->Respectively representing the pre-training target evaluation network parameters and the pre-training target evaluation network parameters, wherein N represents the number of samples of the slave bomb.

Pre-training action network in pre-training algorithmPre-training action network parameters ∈>Updating according to policy gradients, for the embodiment of the application, the difference of the evaluation values of the network output actions and the actions of the offline expert policy is introduced as a determination correction value +.>According to (a) and according to (b) the correction value->And (5) correcting the strategy gradient.

Specifically, an offline expert evaluation value for calculating an action value corresponding to a first moment based on an action network (i.e., a cooperative guidance law) is calculated Calculating a correction value according to the formula (7), specifically, calculating an offline expert evaluation value corresponding to the first momentEvaluation value corresponding to the first time in the pre-training action network +.>Compare if->Is greater than->ThenFor two norms of two actions (action value corresponding to the first moment in the pre-training sample and predicted action value corresponding to the first moment obtained according to the environmental state of the first moment and the pre-training action network), otherwise +.>Is 0. Then calculate the loss function of the strategy gradient according to equation (8)>The calculation formula is as follows:

equation (7);

equation (8);

wherein, the liquid crystal display device comprises a liquid crystal display device,to adjust the parameters.

It can be seen that, in the embodiment of the present application, the parameter correction of the mobile network is performed by adding the correction term, so that the training efficiency can be improved.

Specifically, a training method of a collaborative guidance law model according to an embodiment of the present application, as shown in fig. 4, includes: step S410-step S420, wherein:

step S410, obtaining a plurality of training samples, where each training sample includes: the first time environmental state, the first rewarding value corresponding to the first time, the second rewarding value corresponding to the second time and the third time environmental state;

step S420, performing iterative reinforcement learning training on the model to be trained according to a plurality of training samples to obtain a collaborative guidance law model;

The training method comprises the steps that a model to be trained comprises a network to be trained and a target network to be trained, initial parameters of the network to be trained evaluation network to be trained in the network to be trained and initial parameters of the target network to be trained in the target network to be trained are corresponding parameters in a pre-training collaborative guidance law model, and the pre-training collaborative guidance law model is obtained by performing reinforcement learning training according to a plurality of pre-training samples.

In one possible case, only the network parameters of the pre-training evaluation network and the network parameters of the pre-training evaluation action target network are used as the initial network parameters of the evaluation network to be trained and the initial network parameters of the action evaluation target network to be trained in the cooperative guidance law model training process, so that the performance of the exploration space of the initial model to be trained is better, and the training efficiency can be improved.

In another possible case, the network parameters of the pre-training network and the network parameters of the pre-training target network after the pre-training are used as the initial network parameters of the to-be-trained model, and the to-be-trained network model converges faster during training.

Specifically, the training process in the embodiment of the present application is compared with the pre-training process, and the related training content in the embodiment of the present application is not described in detail.

In the embodiment of the application, the training evaluation network is utilized to solve the strategy gradient of the training action network, the training action network is updated according to the strategy gradient definition, and the strategy gradient calculation formula (9) can be as follows:

equation (9);

specifically, a sampling strategy gradient is utilizedUpdating the expected value with the network parameters of the training action network, wherein +.>Is the gradient of the output motion of the training evaluation network relative to the training action network,is the gradient of the training action network output action relative to the training action network parameters, +.>Training action network with s as input, a as output,/-as output>Network parameters for training the mobile network; the derivation adopts a chain rule: theoretically the index function J (benefit) should be +.>The method comprises the steps of deriving, by adopting a chain rule, firstly letting a training evaluation network Q conduct deriving on an action a, and then using the action a to conduct deriving on an action network parameter +.>And (5) conducting derivation. In addition, since the goal of training the action network is to maximize the evaluation value, the average evaluation value is taken as a benefit.

Further, the action network in the network to be trained and the action target network in the target network to be trained include: the base scale directs the law term and/or the collaborative correction term.

In one implementation, an action network in a network to be trained and an action target network in a target network to be trained comprise: the base proportion guidance law term and the time collaborative correction term, e.g.,。

it can be understood that the cooperative control term of time in the guidance law is a complex multi-parameter term, so that the control strategy is difficult to obtain in a short time, and the cooperative control term needs to be trained and solved by a reinforcement learning method, so that the aim of minimizing the time difference between the main bullet and the target hit from the bullet is fulfilled. Because the action strategy is continuous and in order to prevent overestimation, the intelligent agent can be quickly and effectively trained, and then the intelligent agent is trained by adopting a double-delay deterministic depth strategy gradient algorithm. In the embodiment of the application, the parameters related to the evaluation network in the pre-training are used as initial parameters so as to ensure that the intelligent body can develop quick and efficient learning training iteration, and further solve the problem of harsh conditions of the collaborative guidance law in engineering application.

Based on any of the above embodiments, an embodiment of the present application provides a specific training process, please refer to fig. 5, including:

establishing a cooperative guidance law mathematical model;

simulating an countermeasure environment according to the cooperative guidance law model, and determining the observation environment state and action value of the intelligent agent to obtain a pre-training sample and a training sample;

The pre-training process comprises the following steps: pre-training based on offline to obtain an evaluation network after the pre-training is completed; in order to improve the exploration efficiency of the trained collaborative guidance law in the reinforcement learning process and accelerate the training process, the embodiment of the application adopts an offline reinforcement learning pre-training method framework based on LBC.

Training process: taking the parameters of the evaluation network after the pre-training is completed as initial parameters of the evaluation network during training; taking the parameters of the action network after the pre-training is completed as initial parameters of the action network during training; then obtaining the current environment state, generating an action value according to the training action network, and making an evaluation value according to the two evaluation networks; and updating the training evaluation network and the training action network by utilizing the updated environment and the reward value, after completing the acquisition of the evaluation value and the target value of one sequence, judging whether the end condition is reached, if so, determining that the training is completed to obtain a trained model, and if not, training by utilizing the samples of the next sequence. That is, the pre-trained evaluation network and the action network are used as network initial parameters; training is carried out again: the environmental state is input into the action network, the action network outputs actions, the actions and the environmental state are input into the evaluation network, the evaluation network outputs an evaluation value, and then the environment is updated to determine the actual rewards; thereby updating the network. In summary, the embodiment of the application is based on an improved guidance law structure, in the training process, in order to enable the novel guidance algorithm to be rapidly suitable for a larger battle scene, the action network is firstly subjected to proper pre-training, and then a dual-delay depth deterministic strategy gradient algorithm is combined, so that rapid and efficient learning training iteration is developed. The reinforcement learning cooperative guidance law obtained by training the designed framework has the obvious advantages of larger application range and higher time cooperative precision.

In the above embodiment, a method for pre-training a collaborative guidance law model is described from the aspect of a method flow, and the following embodiment describes a pre-training system of a collaborative guidance law model from the aspect of a module or a unit, specifically the following embodiment is described in detail.

An embodiment of the present application provides a pre-training system for a collaborative guidance law model, as shown in fig. 6, where the system may include:

a first obtaining module 610, configured to obtain pre-training samples of the slave projectile, where each pre-training sample includes: the method comprises the steps of a first time environmental state, a first rewarding value corresponding to the first time, a second rewarding value corresponding to the second time and a third time environmental state;

a second obtaining module 620, configured to obtain a pre-training model, where the pre-training model includes: a pre-training network and a pre-training target network, the pre-training network comprising: a pre-training action network and a pre-training evaluation network, the pre-training target network comprising: a pre-training action target network and a pre-training evaluation target network;

a first training module 630, configured to perform reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, obtain a pre-training collaborative guidance law model,

In one possible implementation of the present invention,

the first acquisition module 610, when executing the acquisition of the pre-training samples of the slave cartridges, is configured to:

In one possible implementation manner, when the first obtaining module 610 performs the simulation according to the action value corresponding to the first time through the training environment to obtain the environmental state at the second time and the prize value corresponding to the first time, the first obtaining module is configured to:

In one possible implementation, the first training module 630 is configured to, when performing reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, obtain a pre-training cooperative guidance law model:

obtaining an evaluation value according to the first moment environmental state and the pre-training network;

and carrying out iterative training on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until a stopping condition is reached to obtain the pre-training cooperative guidance law model.

In one possible implementation, the first training module 630, when executing the determination of the evaluation target value based on the initial evaluation target value, the first reward value, and the second reward value, is configured to:

the evaluation target value is determined based on the intermediate evaluation target value, the decay discount coefficient, and the first prize value.

In one possible implementation, the first training module 630 is configured to, when performing iterative training on the pre-training model according to the evaluation target value, the evaluation value and the pre-training sample until reaching the stopping condition to obtain the pre-training collaborative guidance law model:

In one possible implementation, the pre-training sample further comprises: the corresponding action value of the first moment,

the first training module 630 is configured to, when performing policy gradient calculation on the pre-training action network according to the updated pre-training evaluation network to obtain an action loss value:

The pre-training system of the collaborative guidance law model provided by the embodiment of the application is suitable for the pre-training method embodiment of the collaborative guidance law model, and is not described herein.

In the above embodiment, a method for training a collaborative guidance law model is described from the aspect of a method flow, and the following embodiment describes a training system for a collaborative guidance law model from the aspect of a module or a unit, specifically the following embodiment.

The embodiment of the application provides a training system of a collaborative guidance law model, as shown in fig. 7, the system may include:

a third obtaining module 710, configured to obtain a plurality of training samples, where each training sample includes: the first time environmental state, the first rewarding value corresponding to the first time, the second rewarding value corresponding to the second time and the third time environmental state;

the second training module 720 is configured to perform iterative reinforcement learning training on the model to be trained according to the plurality of training samples, so as to obtain a collaborative guidance law model;

The training system of the collaborative guidance law model provided by the embodiment of the application is suitable for the training method embodiment of the collaborative guidance law model, and is not repeated here.

In an embodiment of the present application, as shown in fig. 8, an electronic device 800 shown in fig. 8 includes: a processor 801 and a memory 803. The processor 801 is coupled to a memory 803, such as via a bus 802. Optionally, the electronic device 800 may also include a transceiver 804. It should be noted that, in practical applications, the transceiver 804 is not limited to one, and the structure of the electronic device 800 is not limited to the embodiment of the present application.

The processor 801 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 801 may also be a combination of computing functions, e.g., including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 802 may include a path to transfer information between the aforementioned components. Bus 802 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or EISA (Extended Industry Standard Architecture ) bus, among others. Bus 802 may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

The Memory 803 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 803 is used to store application code for performing the aspects of the present application and is controlled by the processor 801 for execution. The processor 801 is configured to execute application code stored in the memory 803 to implement what is shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

Embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, which when run on a computer, causes the computer to perform the corresponding method embodiments described above.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations should and are intended to be comprehended within the scope of the present application.

Claims

1. A method of pre-training a collaborative guidance law model, comprising:

2. The method of pre-training a collaborative guidance law model according to claim 1, wherein the obtaining pre-training samples of slave projectiles comprises:

3. The method for pre-training the collaborative guidance law model according to claim 2, wherein the simulating by the training environment according to the action value corresponding to the first moment to obtain the environmental state at the second moment and the reward value corresponding to the first moment includes:

4. The method for pre-training a collaborative guidance law model according to claim 1, wherein the performing reinforcement learning training on the pre-training model according to the pre-training sample until a stopping condition is reached, to obtain a pre-training collaborative guidance law model, comprises:

5. The method of pre-training a collaborative guidance law model according to claim 4, wherein determining an evaluation target value based on the initial evaluation target value, the first rewards value, and the second rewards value comprises:

6. The method for pre-training a collaborative guidance law model according to claim 4, wherein the iterative training of the pre-training model based on the evaluation target value, the evaluation value and the pre-training sample until a stop condition is reached comprises:

7. The method of pre-training a collaborative guidance law model according to claim 6, wherein the pre-training sample further comprises: the corresponding action value of the first moment,

8. The training method of the cooperative guidance law model is characterized by comprising the following steps of:

9. A pre-training system for a collaborative guidance law model, comprising:

10. A training system for collaborative guidance law models, comprising: