CN114841098B

CN114841098B - Deep reinforcement learning Beidou navigation chip design method based on sparse representation drive

Info

Publication number: CN114841098B
Application number: CN202210384663.2A
Authority: CN
Inventors: 唐建浩; 李珍妮; 郑少龙; 谢胜利; 元荣
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2023-04-18
Anticipated expiration: 2042-04-13
Also published as: CN114841098A

Abstract

The invention relates to a Beidou navigation chip design method based on sparse representation driven deep reinforcement learning, which comprises the following steps: obtaining graph embedding, current macro cell embedding and netlist metadata embedding based on macro cell features, netlist graph information and netlist metadata of the chip, and obtaining a three-dimensional state space through a second fully-connected network; neuron addition to last layer hidden layer of value network

Carrying out sparse constraint on the regulons to obtain a value network based on sparse representation; inputting the three-dimensional state space into a value network based on sparse representation to obtain a value function; and inputting the three-dimensional state space into a strategy network and obtaining the optimal layout strategy of the Beidou navigation chip macro unit under the guidance of the cost function. The value network based on sparse representation alleviates the problem of catastrophic interference of value network parameter learning, and improves the accuracy and robustness of the Beidou navigation chip design based on deep reinforcement learning.

Description

Deep reinforcement learning Beidou navigation chip design method based on sparse representation driving

Technical Field

The invention relates to the field of machine learning and chip design, in particular to a novel Beidou navigation chip design method based on sparse representation driven deep reinforcement learning.

Background

At present, the positioning chip used by navigation products of companies such as Xiaotiancai, huashi and millet in China basically depends on import and mainly comes from u-blox, SONY and the like of foreign enterprises. The current design of the navigation chip is Cheng Xuyao which takes a long time, so that the development speed of the navigation chip is very slow. The chip layout stage is the most complex and time-consuming, and the complexity mainly comes from three aspects of the size of a network chart, the grid granularity of a chip report and the calculation cost of the over-high real target index. Despite decades of research into chip design issues, existing chip layout tools still take weeks of iteration to generate a layout solution that meets all aspects of the design criteria. Therefore, it is very important to develop a new Beidou navigation chip design method which can improve the accuracy of chip design and shorten the chip design period.

Deep Reinforcement Learning (Deep Reinforcement Learning) combines the decision-making capability of Reinforcement Learning and the perception capability of Deep Learning, shows excellent adaptability and Learning capability, and can be used for solving the problem of complex decision-making perception of a system. More recently, google has proposed a deep reinforcement learning-based chip placement method, with the goal of quickly mapping a netlist containing macro cells and standard cells onto a chip canvas while optimizing power consumption, performance, and area (PPA) while respecting the conditional constraints of placement density and routing congestion. The chip design is regarded as a reinforcement learning problem, and a deep reinforcement learning network is trained to optimize the chip layout problem. The method comprises the following two steps: firstly, a Value Network (Value Network) guides the training of a policy Network (policy Network), so that the policy Network gives the optimal layout policy of the current macro units, and then the trained policy Network guides all the macro units of a chip to be sequentially placed according to the size sequence; and secondly, after all macro cells are laid out, the standard cells are laid out by a force guiding method, so that the mapping from the netlist to the canvas of the chip is completed. Experimental results show that compared with the most advanced reference model, the method can realize more excellent PPA on the TPU of Google. More importantly, it can generate a chip layout that is superior or comparable to the chip designer design of the human profession within 6 hours.

However, value networks in deep reinforcement learning are often affected by catastrophic interference phenomena. The value network performs back propagation on inputs in different states to act on the same neuron, so that past learning parameters are overwritten, the network forgets to learn past batch data, approximate estimation deviation of a value function is large, and accuracy of a strategy network for generating a current chip macro unit layout strategy is affected. Therefore, how to relieve the problem of catastrophic interference of value network parameter learning and improve the accuracy and robustness of the Beidou navigation chip design based on deep reinforcement learning is an urgent problem to be solved in the field of artificial intelligence chip design.

Disclosure of Invention

The invention aims to provide a Beidou navigation chip design method based on sparse representation-driven deep reinforcement learning, which is used for solving the problem of catastrophic interference of value network parameter learning based on a value network of sparse representation and improving the accuracy and robustness of the Beidou navigation chip design based on deep reinforcement learning.

In order to achieve the purpose, the invention provides the following scheme:

a Beidou navigation chip design method based on sparse representation driven deep reinforcement learning comprises the following steps:

obtaining graph embedding and current macro unit embedding based on macro unit characteristics and netlist graph information of a chip;

obtaining netlist metadata embedding by passing netlist metadata of a chip through a first full-connection network;

embedding the graph, the current macro cell and the netlist metadata through the second fully connected network to obtain a three-dimensional state space;

neuron addition to last layer hidden layer of value network

Carrying out sparse constraint on the regulons to obtain a value network based on sparse representation;

inputting the three-dimensional state space into the value network based on sparse representation to obtain a value function;

and inputting the three-dimensional state space into the strategy network and obtaining the optimal layout strategy of the Beidou navigation chip macro unit under the guidance of the cost function.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a Beidou navigation chip design method based on sparse representation driven deep reinforcement learning, which comprises the following steps: obtaining graph embedding and current macro unit embedding based on macro unit characteristics and netlist graph information of a chip; obtaining netlist metadata embedding by passing netlist metadata of a chip through a first full-connection network; embedding the graph, the current macro cell and the netlist metadata through the second fully-connected network to obtain a three-dimensional state space; neuron addition to last hidden layer of value network

Carrying out sparse constraint on the regulons to obtain a value network based on sparse representation; inputting the three-dimensional state space into the value network based on sparse representation to obtain a value function; and inputting the three-dimensional state space into the strategy network and obtaining the optimal layout strategy of the Beidou navigation chip macro unit under the guidance of the cost function. By carrying out sparse representation on the neurons of the last hidden layer of the value network, the problem of catastrophic interference in the learning process of the value function parameters in the chip design method based on deep reinforcement learning is solved, so that the deviation of approximate estimation of the value function is reduced, and the accuracy of the layout strategy output by the strategy network can be ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a north dipper navigation chip design method based on sparse representation-driven deep reinforcement learning according to embodiment 1 of the present invention;

FIG. 2 is a specific application process of a value network and a policy network provided in embodiment 1 of the present invention;

FIG. 3 is a diagram of a physical model architecture of a sparse representation-based value network according to embodiment 1 of the present invention;

fig. 4 is a physical model diagram of a policy network provided in embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A fully-connected neural network is used as a value network, the network inputs a three-dimensional state space corresponding to macro units in an environment, and a corresponding value function in the state space is output. However, the value network is often affected by disaster interference, so that the approximate estimation deviation of the value function is large. Therefore, the problem of catastrophic interference of value network parameter learning is solved, the accuracy and robustness of chip layout are improved, and the method has a wide application scene in the field of artificial intelligence-based navigation chip design.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

Firstly, constructing information of a neural network architecture coding netlist, aiming at extracting information such as node types and link property among nodes in the netlist into vector representation of low latitude and generating rich information for a strategy networkRich state space, as shown in figure 1. Then, a deconvolution network is initialized as a strategy network, and a fully-connected neural network is initialized as a value network. The input of the two is a three-dimensional state space, the strategy network outputs a two-dimensional chip layout strategy, and the value network outputs a corresponding value function in the state space; training a strategy network and a value network, respectively outputting the probability distribution of the available position of the current macro unit and the value estimation of the state space corresponding to the current macro unit, and simultaneously adopting the strategy network under the condition of maintaining the objective function of the strategy network unchanged

The regulon performs sparse representation on neuron of the last hidden layer of the value network, and a brand new target function of the value network based on sparse representation driving is constructed. Due to the addition of>

The optimization problem of the value network target function is still a convex optimization problem after the regular item, the objective function is solved by using a sub-gradient algorithm, the weight parameter of the value network is updated, the value function estimated value with smaller deviation is obtained, the training of the strategy network is further guided, the influence of catastrophic interference on the value network training is relieved, and the estimation accuracy and the robustness of the value network are further improved. And finally, obtaining a chip macro unit layout strategy output by the trained strategy network, accurately and efficiently guiding the layout of the Beidou navigation chip macro unit, and guiding the chip macro units to be mapped on the chip canvas one by one according to the size sequence by the layout strategy, as shown in the attached figure 2.

As shown in fig. 1, the present embodiment provides a method for designing a beidou navigation chip based on sparse representation-driven deep reinforcement learning, including:

s1: obtaining graph embedding and current macro unit embedding based on macro unit characteristics and netlist graph information of the chip;

the chip comprises a plurality of macro units and standard units, different macro units and standard units are composed of different basic circuits, macro unit features describe the functional characteristics of the macro units, and the netlist diagram describes basic information of all the macro units of the chip. The macro unit characteristic and netlist graph information obtaining graph embedding and macro unit embedding based on the chip specifically comprise:

inputting net list graph information of the navigation chip into a graph neural network;

performing graph convolution operation on macro unit features of the navigation chip and a netlist graph by using the graph neural network to generate edge embedding and macro unit embedding;

reducing the average value of the edge embedding to obtain graph embedding;

and adding current macro unit information to the macro unit embedding to obtain the current macro unit embedding.

S2: obtaining netlist metadata embedding by passing netlist metadata of a chip through a first full-connection network;

s3: embedding the graph, the current macro cell and the netlist metadata through the second fully-connected network to obtain a three-dimensional state space S _t ；

Will state space S _t Inputting the strategy network after training through the full connection layer, and executing the layout strategy a of the current macro unit generated by the strategy network in the action space _t To obtain the next state space S _t+1 。

The three-dimensional state space is generated as follows: firstly, net list graph information of a navigation chip is input into a graph neural network, and the graph neural network carries out graph convolution operation on macro unit features and the net list graph to generate edge embedding and macro unit embedding. And secondly, embedding the metadata of the netlist by the metadata of the netlist through a full-connection network, embedding the edge by reducing an average value to obtain a graph, and embedding the macro unit by adding current macro unit information to obtain current macro unit embedding. Finally, the obtained netlist metadata embedding, the graph embedding and the current macro unit embedding are input into a full-connection network together, and a three-dimensional state space is generated for a value network and a strategy network and used for training the network. As shown in particular in figure 1.

S4: neuron addition to last layer hidden layer of value network

will state space S _t And a next state space S _t+1 Inputting the values into a value network to respectively obtain the values V (S) of the two state spaces _t W) and V (S) _t+1 W), adding a reward R given by the external environment, calculating to obtain a time sequence difference error TD-error, and constructing an objective function of the value network by using the time sequence difference error TD-error. Secondly, adding the neuron of the last hidden layer in the objective function of the value network

And carrying out sparse constraint on the regularizer, and then updating weight parameters of the value network by adopting a subgradient descent method.

S5: inputting the three-dimensional state space into the value network based on sparse representation to obtain a value function;

the state of the intelligent agent in the t step in the environment is S _t I.e. the current chip macro cell placement on the chip canvas, perform action a in the state _t I.e. the collection of all locations in the space of the discrete canvas, and then obtain a reward R for the environment giving that action _t Defined as the weighted sum of the wireless network and congestion. The agent transitions to the next state S _t+1 Then, the next action a is executed _t+1 。

Constructing a value network objective function:

constructing a value network function V (S, W) to approximate a state S _t And V is the value of W, wherein W represents the weight parameter of the value network, and the discount rate of return is defined to be gamma. Then, the timing difference error δ (TD-error) can be expressed as:

δ＝R _t +γV(S _t+1 ，W)-V(S _t ，W)

the value network updates the network parameters by minimizing the TD-error, so the objective function of the value network is obtained by solving the expectation of the square of the TD-error, which is as follows:

f(W)＝E[(R _t +γV(S _t+1 ，W)-V(S _t ，W)) ² ]

applying output y of neuron of last layer hidden layer of value network

Regularization sub h (y) = lambda y | | non-calculation ₁ Sparse constraint is performed, and an objective function is obtained as follows:

wherein λ represents

The regularization subparameter, E (-) represents the desired operation.

Before the three-dimensional state space is input to the sparse representation-based value network, the method further comprises the following steps:

(1) Constructing an objective function of the value network based on sparse representation to obtain a value network objective function;

the expression of the value network objective function is as follows:

wherein W represents a weight parameter of the value network; r _t Represents the state S _t Lower execution action a _t The prize value of (d); γ represents the discount rate; s _t+1 And S _t Representing a next layout state and a current layout state of the current macro cell on the chip canvas; v () represents a value estimation value; lambada y caldolizing ₁ To represent

A regularization sub; lambda denotes->

A regularization sub-parameter; e (-) denotes the desired operation.

(2) And carrying out weight optimization on the value network objective function to obtain an optimized value network objective function.

Specifically, the performing weight optimization on the value network objective function specifically includes:

1) Solving the current value network objective function by using a sub-gradient descent algorithm to obtain a current updated value weight parameter;

wherein, the formula of the current update weight parameter is:

y _j the jth neuron representing the last hidden layer; k represents the total number of the last hidden layer neurons;

denotes y _j To W _i-1 Operation of derivation; alpha is alpha ^W Represents a learning rate; w _i-1 Representing the weight value of the i-1 th iteration; />

Denotes f (W) _i-1 ) For weight parameter W _i-1 Calculating gradient; sign (y) _j ) Is a sign function, i.e. y ₁ A sub-gradient of (a).

2) Substituting the current updated value weight parameter into the current value network objective function to obtain the current updated value network objective function;

3) Judging whether the relative error between the current value network objective function and the last value network objective function is smaller than a first preset value or not to obtain a first judgment result;

if the first judgment result is yes, the current updated value network objective function is the optimized value network objective function;

if the first judgment result is negative, judging whether the current iteration times are equal to the first maximum iteration times to obtain a second judgment result;

if the second judgment result is yes, the current updated value network objective function is the optimized value network objective function;

and if the second judgment result is negative, the currently updated value network objective function is made to be the current value network objective function, and the step of solving the current value network objective function by using a sub-gradient descent algorithm to obtain the current updated value weight parameter is returned.

Training obtains a sparsely represented value network, as shown in FIG. 3. Wherein white squares represent activated (non-zero) neurons and grey squares represent inactivated neurons, such that only a small portion of the neurons in the last hidden layer are activated, thereby generating a sparse representation and alleviating the problem of catastrophic interference in value network parameter learning. By adopting the deep reinforcement learning method based on sparse representation driving, the influence of disaster interference on the performance of the value network can be relieved, so that the accuracy and the robustness of the approximate estimation of the value function are improved.

And replacing the TD-error with the dominant function, constructing an objective function of the strategy network by using a PPO algorithm, and updating the weight parameter of the strategy network by using an Adam algorithm. By the method, the influence of catastrophic interference on the performance of the value network can be relieved, and the accuracy and robustness of approximate estimation of the value function are improved. So before step S6, it further includes:

(1) Constructing an objective function of the strategy network to obtain a strategy network objective function;

constructing a policy network function pi (a) _t |S _t ) Wherein S is _t Information representing the t-th state, including the current macro cell and the entire net list, a _t Representing the t-th action is the current macrocell layout policy generated by the policy network, including a collection of locations on the discrete canvas space where all chips can be laid out, for guiding the layout of the current macrocell. In the Beidou navigation chip design method based on sparse depth reinforcement learning, a Proximal Policy Optimization (PPO) algorithm is adopted to construct an objective function of a strategy network:

/>

wherein, theta represents the weight parameter of the strategy network,

representing the probability ratio, pi, between the network functions of the old and the new policies _old (a _t |S _t ) Represents an old policy, <' > or>

Represents the merit function (TD-error can be substituted).

(2) And updating the weight of the strategy objective function to obtain an updated strategy network objective function.

The optimization problem of the objective function of the policy network is a convex optimization problem, and the Adam algorithm can be directly used for updating the weight parameter of the policy network. Wherein, the updating the weight of the policy network objective function specifically includes:

1) Deriving the current strategy objective function to obtain the current updated gradient g _i (θ)；

According to the current updated gradient g _i (theta) calculating a first order estimate m _i And a second order estimate v _i ；

m _i ＝β ₁ m _i-1 +(1-β ₁ )g _i (θ)

Wherein, beta ₁ And beta ₂ Respectively representing first order estimates m _i And a second order estimate v _i The attenuation coefficient of (2).

2) From the first order estimate m _i And said second order estimate v _i Obtaining a first order estimate offset correction

And a second order estimated bias correction>

3) Correcting according to the first-order estimated deviation

And said second order estimated bias correction>

Obtaining a current policy update weight parameter;

further obtaining a calculation formula of the strategy network weight parameter:

α ^θ representing a learning rate for controlling a step size;

and &>

Respectively representing first-order deviation correction and second-order deviation correction; ε represents the stability parameter, i.e., the numerical stability parameter is used to prevent the denominator from being zero.

4) Substituting the current strategy updating weight parameter into the current strategy network objective function to obtain the current updated strategy network objective function;

5) Judging whether the relative error between the current strategy network objective function and the previous strategy network objective function is smaller than a second preset value or not to obtain a third judgment result;

if the third judgment result is yes, the current updated policy network objective function is the optimized policy network objective function;

if the third judgment result is negative, judging whether the current iteration times are equal to the second maximum iteration times to obtain a fourth judgment result;

if the fourth judgment result is yes, the current updated policy network objective function is the optimized policy network objective function;

if the fourth judgment result is negative, the strategy network objective function after the current update is made to be the current strategy network objective function, and the step of 'obtaining the current updated gradient by differentiating the current strategy objective function' is returned.

Judging the iteration cutoff condition comprises two conditions: finally, two times of continuous strategy network target functions obj _i And obj _i-1 Is less than a certain small value

I.e. is->

And stopping iteration, and if not, continuing to iterate to the maximum iteration step number.

S6: and inputting the three-dimensional state space into the strategy network and obtaining the optimal layout strategy of the Beidou navigation chip macro unit under the guidance of the cost function.

And finishing the layout of the macro cells of the chip one by one, and sequentially generating layout strategies for the macro cells according to the size sequence of the macro cells so as to ensure sufficient space. And changing the information of the current macro unit (predefined logic modules such as a trigger, an arithmetic logic unit, a hardware temporary storage and the like) one by one to change the input of the strategy network, obtaining the optimal layout strategy of the Beidou navigation chip macro unit corresponding to the optimal cost function in the current state, and further guiding the layout of the Beidou navigation chip macro unit.

A deconvolution network is adopted as a strategy network, so that an input three-dimensional state space can output a two-dimensional chip layout strategy through the strategy network. The deconvolution network consists of an input layer, a deconvolution layer, and an output layer. Similar to the convolutional network, the input layer of the deconvolution network uses a non-fully connected mode to realize data input, and the output layer uses a fully connected mode to realize data output, and at the input layer, data is outputAnd one or more deconvolution layers for deconvolution are arranged between the output layer and the reference layer. Assume an input matrix of a deconvolution network as

The output matrix is->

The number of selected channels (i.e., the convolution kernels used by each deconvolution layer) is 4, and the physical model of a deconvolution network is shown in fig. 4. Wherein the network layer in the deconvolution network is denoted as L (L) ^k Representing the k network layer). Stored in the input layer

(width 8, height 8, depth 16) input matrix Y, through non-full connection, into the first deconvolution device @, one-to-one>

Then passes through 4 channels and enters a second deconvolution layer

Then enters an output layer after 4 channels>

And the output layer obtains an output matrix X through full connection operation.

In this embodiment, a new neural network architecture is constructed as an embedded layer of a policy-value network, and the network list graph node features and information of the current macro to be placed are encoded to generate a three-dimensional state space. And after the state space is obtained, inputting the state space into a trained deep reinforcement learning network, and outputting a more accurate optimal layout strategy of the Beidou navigation chip macro unit under the condition of relieving catastrophic interference, thereby guiding the chip macro unit to be mapped onto the chip canvas one by one according to the size sequence.

The present embodiment has the following advantages: 1. the invention is inValue network objective function introduction

And the regularizer updates weight parameters by using a secondary gradient method, further realizes the deep reinforcement learning based on sparse representation driving, is further applied to the layout of the Beidou navigation chip, better relieves the influence caused by catastrophic interference in the deep reinforcement learning, and improves the accuracy and the robustness of value network estimation. 2. According to the invention, the deep reinforcement learning network based on sparse representation driving is realized, manual experience is replaced, the development period of a navigation chip is shortened, the development cost of the chip is reduced, and the sparse representation deep reinforcement learning algorithm provided by the invention can be applied to other more complex chip design environments.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. A Beidou navigation chip design method based on sparse representation driven deep reinforcement learning is characterized by comprising the following steps of:

obtaining graph embedding and current macro unit embedding based on macro unit characteristics and netlist graph information of the chip;

the netlist metadata of the chip is embedded into netlist metadata obtained through a first full-connection network;

embedding the graph, the current macro cell and the netlist metadata into a three-dimensional state space obtained through a second fully-connected network;

neuron addition to last layer hidden layer of value network

Carrying out sparse constraint on the regulons to obtain a value network based on sparse representation; the above-mentionedThe value network is a fully-connected neural network;

inputting the three-dimensional state space into a strategy network and obtaining an optimal layout strategy of the Beidou navigation chip macro unit under the guidance of the cost function; the strategy network is a deconvolution network;

said inputting said three-dimensional state space into said sparse representation-based value network further comprises:

(1) Constructing an objective function of the value network based on sparse representation to obtain a value network objective function; the expression of the value network objective function is as follows:

wherein,

a weight parameter representing a value network; />

Indicates a status->

Down take action pick>

The prize value of (d); />

Representing a discount rate; />

And &>

Representing the next cloth of the current macro-unit on the chip canvasA local state and a current layout state; />

Representing a value estimate; />

Represents->

A regularization sub; />

Represents->

A regularization subparameter; />

Representing the desired operation;

(2) Carrying out weight optimization on the value network objective function to obtain an optimized value network objective function; the method specifically comprises the following steps:

solving the current value network objective function by using a sub-gradient descent algorithm to obtain a current updated value weight parameter;

substituting the current updated value weight parameter into the current value network objective function to obtain the current updated value network objective function;

judging whether the relative error between the current value network objective function and the last value network objective function is smaller than a first preset value or not to obtain a first judgment result;

if the first judgment result is yes, the current updated value network objective function is an optimized value network objective function;

if the second judgment result is negative, the current updated value network objective function is made to be the current value network objective function, and the step of solving the current value network objective function by using a sub-gradient descent algorithm to obtain the current updated value weight parameter is returned.

2. The method of claim 1, wherein the chip-based macro cell feature and netlist graph information derivation graph embedding and macro cell embedding specifically comprises:

inputting the net list graph information of the navigation chip into a graph neural network;

performing graph convolution operation on the macro unit features of the navigation chip and the netlist graph by using the graph neural network to generate edge embedding and macro unit embedding;

reducing the average value of the edge embedding to obtain graph embedding;

3. The method of claim 1, wherein the current update value weight parameter is formulated as:

wherein,

fifth-or-fifth-representing the last hidden layer>

A plurality of neurons; />

Representing the total number of neurons in the last hidden layer;

represents->

Is paired and/or matched>

Operation of derivation; />

Represents a learning rate; />

Indicates the fifth->

The weight value in the secondary iteration;

represents->

To the weight parameter->

Solving gradient operation; />

Is a function of sign>

A sub-gradient of (a).

4. The method of claim 1, wherein inputting the three-dimensional state space into the policy network further comprises:

constructing an objective function of the strategy network to obtain a strategy network objective function;

and updating the weight of the strategy network objective function to obtain the updated strategy network objective function.

5. The method according to claim 4, wherein the updating the weight of the policy network objective function specifically includes:

deriving the current strategy network objective function to obtain a current updated gradient;

calculating a first order estimate and a second order estimate from the current updated gradient;

obtaining a first order estimated bias correction and a second order estimated bias correction according to the first order estimation and the second order estimation;

obtaining a current strategy updating weight parameter according to the first-order estimation deviation correction and the second-order estimation deviation correction;

substituting the current strategy updating weight parameter into the current strategy network objective function to obtain the current updated strategy network objective function;

judging whether the relative error between the current strategy network objective function and the previous strategy network objective function is smaller than a second preset value or not to obtain a third judgment result;

if the third judgment result is yes, the current updated strategy network objective function is an optimized strategy network objective function;

if the fourth judgment result is negative, the current updated strategy network objective function is made to be the current strategy network objective function, and the step of 'obtaining the current updated gradient by differentiating the current strategy network objective function' is returned.

6. The method according to claim 5, wherein the expression of the current policy update weight parameter is:

wherein,

represents a learning rate; />

Respectively representing first-order deviation correction and second-order deviation correction; />

The stability parameter is indicated. />