CN114372438A

CN114372438A - Chip macro-unit layout method and system based on lightweight deep reinforcement learning

Info

Publication number: CN114372438A
Application number: CN202210030064.0A
Authority: CN
Inventors: 李珍妮; 谢胜利; 王名为; 元荣; 凌家城
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-19
Anticipated expiration: 2042-01-12
Also published as: CN114372438B

Abstract

The invention relates to a chip macro-unit layout method and a chip macro-unit layout system based on lightweight deep reinforcement learning, wherein a strategy network is divided into a plurality of mutually independent sub-networks according to channels, so that a new idea of multi-channel multi-layer structured pruning is provided for the lightweight of the strategy network, and a method is provided for the strategy network to perform block processing on data in the future; by introducing groups in the objective function of the policy network

The regularizer performs sparse constraint in and among groups on weight parameters of the sub-network, and performs pruning compression on a sparse strategy network, so that gradient calculation caused by some unimportant input data can be better eliminated, the problem of network weight parameter redundancy is solved, waste of storage resources and calculation resources in a chip macro unit layout process in a chip layout method based on deep reinforcement learning is reduced, requirements of the chip macro unit layout process on hardware equipment are reduced, and the updating development of hardware design is promoted.

Description

Chip macro-unit layout method and system based on lightweight deep reinforcement learning

Technical Field

The invention relates to the field of machine learning and the field of chip layout, in particular to a chip macro-unit layout method and system based on lightweight deep reinforcement learning.

Background

The birth of a chip, i.e., a carrier of an integrated circuit, requires four important processes of design, manufacture, packaging, and testing. The obvious progress of the chip promotes the rapid development of a plurality of fields such as new energy automobiles, internet of things, artificial intelligence, edge computing and the like, however, as a scientific and technological big country, the demand of the chip is the first in the world, and the self-supply rate of the chip made in China is less than 10%. Therefore, the domestic chip is vigorously developed, the domestic substitution of most commercial chips is realized, the transformation and upgrade of the manufacturing industry in China are further promoted, and the method is a necessary way for China to realize the strong science and technology. However, current chip design processes tend to take years, again with the most complicated and time consuming chip layout phase of mapping a netlist containing macro and standard cell information onto a chip canvas. The complexity of the chip layout derives mainly from three aspects: the size of the netlist, the granularity of the grid on which the chip is drawn, and the computational cost of the true target index are prohibitive (evaluation using industry standard EDA tools takes several hours or even more than one day). Despite decades of research into the chip layout problem, experts still take weeks of iteration to generate a layout solution that meets all aspects of design criteria using existing chip layout tools.

Recently, google proposed a chip layout method based on deep reinforcement learning, aiming to quickly map a netlist containing macro cells and standard cells onto a chip canvas while optimizing power consumption, performance, and area (PPA) while observing the conditional constraints of placement density and routing congestion. Google considers the chip layout as a reinforcement learning problem, and optimizes the chip layout problem by training a deep reinforcement learning network. Experimental results show that compared with the most advanced reference model, the method can realize more excellent PPA on the TPU of Google. More importantly, it can generate a chip layout that is superior or comparable to the chip designer design of the human profession within 6 hours.

However, the chip layout environment is complex, and the chip layout method based on deep reinforcement learning needs to train a huge redundant deconvolution network as a strategy network to generate an optimal layout strategy for the chip macro unit. This results in huge storage and computation resources occupied by the training of the policy network and the generation of the chip macro-cell layout policy, which puts high demands on hardware devices.

Therefore, the deep reinforcement learning network is light, the requirements of a chip macro unit layout process in the chip layout method based on the deep reinforcement learning on hardware equipment are reduced, the updating development of hardware design is promoted, and the method has a wide application scene in the field of artificial intelligence chip layout.

Disclosure of Invention

The invention aims to provide a chip macro-unit layout method and a chip macro-unit layout system based on light-weight deep reinforcement learning, which reduce the requirements of a chip macro-unit layout process in the chip layout method based on deep reinforcement learning on hardware equipment by using a light-weight deep reinforcement learning network and promote the updating and development of hardware design.

In order to achieve the purpose, the invention provides the following scheme:

a chip macro-cell layout method based on lightweight deep reinforcement learning comprises the following steps:

generating a three-dimensional state space according to the macro unit characteristics and the network list information of the chip; the network list information of the chip comprises a network list graph and network list metadata;

training a lightweight deep reinforcement learning network; the lightweight deep reinforcement learning network comprises a lightweight strategy network and a monovalent value networkComplexing; the value network is used for guiding the lightweight strategy network to train; the lightweight policy network includes a plurality of sub-networks that pass through a lead-in group

The regulon is obtained by training a deconvolution network through pruning operation and compression operation;

taking the three-dimensional state space as input, and outputting an optimal layout strategy of the chip macro unit according to the trained lightweight deep reinforcement learning network;

and guiding the macro units to be mapped to the chip canvas one by one according to the optimal layout strategy.

A chip macro cell layout system based on lightweight deep reinforcement learning comprises:

the data acquisition module is used for generating a three-dimensional state space according to the macro unit characteristics and the network list information of the chip; the network list information of the chip comprises a network list graph and network list metadata;

the model training module is used for training a lightweight deep reinforcement learning network; the lightweight deep reinforcement learning network comprises a lightweight strategy network and a monovalent value network; the value network is used for guiding the lightweight strategy network to train; the lightweight policy network includes a plurality of sub-networks that pass through a lead-in group

the strategy generation module is used for taking the three-dimensional state space as input and outputting an optimal layout strategy of the chip macro unit according to the trained lightweight deep reinforcement learning network;

and the mapping module is used for guiding the macro units to be mapped to the chip canvas one by one according to the optimal layout strategy.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a chip macro-unit layout method and a chip macro-unit layout system based on lightweight deep reinforcement learning, wherein a strategy network is divided into a plurality of mutually independent sub-networks according to channels, so that a new idea of multi-channel multi-layer structured pruning is provided for the lightweight of the strategy network, and a method is provided for the strategy network to perform block processing on data in the future; by introducing groups in the objective function of the policy network

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a chip macro-cell layout method based on lightweight deep reinforcement learning according to embodiment 1 of the present invention;

FIG. 2 is a structural view of an embedding layer in embodiment 1 of the present invention;

fig. 3 is a schematic diagram of a training process of a lightweight deep reinforcement learning network in embodiment 1 of the present invention;

fig. 4 is a diagram of a physical model structure of a second policy network in embodiment 1 of the present invention;

fig. 5 is a structural diagram of a chip macro-cell layout system based on lightweight deep reinforcement learning according to embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1:

a chip layout method based on deep reinforcement learning is provided by Google and specifically comprises the following two steps: firstly, a Value Network (Value Network) guides the training of a policy Network (policy Network), so that the policy Network gives the optimal layout policy of the current macro units, and then the trained policy Network guides all the macro units of a chip to be sequentially placed according to the size sequence; and secondly, after the layout of all macro cells is finished, finishing the layout of the standard cells by a force guiding method, thereby finishing the mapping from the netlist to the canvas of the chip. The method is the first placement layout of the chip with generalization capability, which can learn from the previous netlist layout and serve the new netlist layout, which enables the strategy network to generate the optimal layout strategy for the chip faster and better over time. However, the chip layout method based on deep reinforcement learning needs to train a huge redundant deconvolution network as a strategy network, which results in that the training of the strategy network and the generation of the chip macro-unit layout strategy occupy huge storage resources and calculation resources, and have high requirements on hardware devices.

In contrast, referring to fig. 1, the embodiment provides a chip macro cell layout method based on lightweight deep reinforcement learning, so as to reduce the requirements of a chip macro cell layout process on hardware devices and promote the update and development of hardware design by using a lightweight deep reinforcement learning network. The method comprises the following steps:

s1: generating a three-dimensional state space according to the macro unit characteristics and the network list information of the chip; the net list information of the chip includes a net list map and net list metadata.

Constructing a new neural network architecture as an embedded layer, and encoding a netlist graph, node features and information of a current macro to be placed of a chip to generate a three-dimensional state space, as shown in fig. 2, specifically comprising:

(1) inputting the macro unit features and the net list graph into a graph neural network, and generating macro unit embedding and edge embedding through graph convolution operation;

(2) inputting the network list metadata into a fully connected network to obtain network list metadata embedding;

(3) reducing the average value of the edge embedding to obtain graph embedding;

(4) embedding and fusing the current macro unit information and the macro unit to obtain the current macro unit embedding;

(5) the network list metadata embedding, the graph embedding and the current macro unit embedding are input into the fully-connected network to obtain the current three-dimensional state space S_t。

S2: training a lightweight deep reinforcement learning network; the lightweight deep reinforcement learning network comprises a lightweight strategy network and a monovalent value network; the value network is used for guiding the lightweight strategy network to train; the lightweight policy network includes a plurality of sub-networks that pass through a lead-in group

And training a deconvolution network through pruning operation and compression operation to obtain the regulon.

As shown in fig. 3, the specific training process of the lightweight deep reinforcement learning network includes:

(1) initializing a deconvolution network based on a reinforcement learning structure to obtain a first deep reinforcement learning network, wherein the first deep reinforcement learning network comprises a first strategy network and a value network;

(2) carrying out multi-channel multi-layer structural processing on the first strategy network to obtain a second strategy network;

(3) introducing groups in an objective function of the second policy network

The regularizer performs intra-group and inter-group sparse constraint on the weight parameters of the sub-network to obtain a sparse strategy network;

(4) and pruning and compressing the sparse strategy network to obtain the lightweight strategy network.

In order to make the specific processes of (1) to (4) more clearly understood by those skilled in the art, the following description is made specifically.

1. First strategy network for constructing self-learning chip macro-unit layout based on reinforcement learning

In the embodiment, a deconvolution network is adopted as a first strategy network, and the relationship between adjacent elements in the input three-dimensional state matrix is fully utilized, so that the input three-dimensional state space can output the optimal layout strategy of the two-dimensional chip macro unit through the first strategy network. The deconvolution network is composed of an input layer, a deconvolution layer and an output layer. Similar to the convolutional network, the input layer of the deconvolution network uses a non-fully-linked mode to realize data input, the output layer uses a fully-connected mode to realize data output, and one or more deconvolution layers for deconvolution are arranged between the input layer and the output layer.

Assume a first policy network inputs a data matrix of

The output data matrix is

The input layer, the deconvolution layer, and the output layer include a preset number of channels (i.e., convolution kernels used per layer). The physical model of the first policy network is shown in FIG. 4, where the network layer in the first policy network is denoted L (L)^kRepresenting the kth network layer), the channel is represented as θ. Stored in the input layer

The input matrix Y (width 8, height 8, depth 16) is input to the first deconvolution layer in a one-to-one correspondence via non-full links

Then enters a second layer of deconvolution layer through 4 channels

Then enters the output layer through 4 channels

And the output layer obtains an output matrix X through full connection operation.

2. Carrying out multi-channel and multi-layer structured preprocessing on the first policy network to obtain a second policy network

In the first policy network, the sizes of the corresponding channels of the network layers are not consistent due to the different sizes of the network layers. Therefore, the width and the height of the channel can be set according to experience when the first strategy network is constructed. However, the depth of a channel must be the depth of its corresponding network layer

Thus, unlike a fully connected network, in a first policy network, each channel is connected only to its corresponding network layer portion element, and each element of the network layer is connected only to one channel. That is, assuming that the number of channels of the first policy network is 4, a square formed by the height and width of the network layer is taken as a cross section, the network layer is divided into 4 blocks with the same size according to the number of channels, then deconvolution operation is performed through the corresponding channels respectively, then the obtained result is averaged through the activation function, and then the input of the next network layer can be obtained, for example, as shown by the dotted line mark of fig. 4, the deconvolution layer L of the second layer is shown as the second layer L³The first data matrix is composed of a first layer of deconvolution layer L²The first data matrix in the 4 blocks is obtained by averaging after deconvolution operation is respectively carried out. Similarly, the third layer of deconvolution layer L³The remaining 3 data matrices are also formed by the second deconvolution layer L²The 4 data matrixes corresponding to the 4 squares in the block are obtained by averaging after deconvolution operation.

According to the characteristic that each element in the first policy network is connected with only one channel, the embodiment divides the first policy network into a plurality of mutually independent sub-networks according to the channels to obtain the second policy network. Referring to fig. 4, the first policy network may be divided into 4 mutually independent sub-networks by the number of channels, looking from the output layer of the first policy network to its input layer. Input Y of the 4 sub-networks₁，Y₂，Y₃，Y₄Determined by the last deconvolution layer of the first policy network, in particular, in the first policy network of fig. 4, the third deconvolution layer L³Input first data matrix

Is composed of a second deconvolution layer L²Corresponding first data matrix in the 4 channels of

And averaging after deconvolution operation. Due to the input layer L¹The neuron and the second layer deconvolution layer L²The spirit ofThe warp elements are in one-to-one correspondence, then

These 4 data matrices form the input Y of the first subnetwork₁Input Y of the remaining three sub-networks₂，Y₃，Y₄Also obtained by this process. Therefore, the input data of the 4 sub-networks are completely different and have the same size. Respectively carrying out deconvolution operation on 4 groups of input data with different data and the same size on mutually independent sub-networks, and finally respectively outputting X with the size consistent with that of X on an output layer₁，X₂，X₃，X₄And recovers the output X of the first policy network by taking the average of these four output data.

3. Introducing groups in an objective function of the second policy network

The regularizer carries out intra-group and inter-group sparse constraint on the weight parameters of the sub-networks to obtain a sparse strategy network

(1) Constructing value network objective functions

The current state of the agent in the environment is S_tPerforming action a in the current state_tThe win environment gives a reward R for the action, with a discount rate of γ. The agent transitions to the next state S_t+1Then, the next action a is executed_t+1。

Constructing a value network function V (S, W) to approximate a state S_tThe first value below is the value V, where W represents the weight parameter of the value network. Then, the timing difference error δ (TD-error) can be expressed as:

δ＝R+γV(S_t+1，W)-V(S_t，W)

the value network updates the network parameters by minimizing the TD-error, so the objective function of the value network can be obtained by solving the expectation of the square of the TD-error, which specifically includes:

wherein E (. circle.) represents expectation.

(2) Constructing an objective function for a policy network

Constructing a second policy network function pi (a)_t|S_t) Wherein S is_tRepresenting the current state of the agent in the environment, a_tRepresenting actions that the agent may perform in the current state. In the chip layout method based on deep reinforcement learning, a Proximal Policy Optimization (PPO) algorithm is adopted to construct an objective function of a second Policy network:

wherein, theta represents the weight parameter of the strategy network,

representing the probability ratio between the old and new policy network functions,

the merit function (TD-error can be used instead) is expressed.

In order to implement effective pruning of the policy network, it is necessary to implement thinning of the weight parameters within and between the sub-network groups. To this end, the present embodiment introduces groups

Regularizer, adding groups of weight parameters of sub-networks of the second policy network

Regularizers perform intra-group and inter-group sparsity constraints. Due to policy network passing maximization

Network parameters are updated, so that a theta sparse regular term is negated here, and a sparsification strategy network objective function is obtained, which is specifically as follows:

wherein the content of the first and second substances,

representing an objective function of the sparsification strategy network; α > 0 represents a regularization term parameter; i | · | purple wind₁To represent

A regularization sub;

representing a weight parameter matrix of the nth layer of the mth sub-network; m represents the total number of sub-networks; n denotes the total number of network layers in the sub-network.

Although groups are introduced on the objective function of the second policy network

The regularizer realizes intra-group and inter-group sparsity constraints on weight parameters of the sub-network, but because the optimization problem of the objective function is still a convex optimization problem, the Adam algorithm needs to be directly used for further processing the objective function of the value network and the objective function of the sparsification strategy network, so that the alternate update of the weight parameters is realized, and the method specifically comprises the following steps:

(1) optimization of value network objective function

Since the objective function of the value network is a convex function, the Adam algorithm can be directly used for direct optimization. Firstly, the target function is derived to obtain the gradient g updated by the t iteration_t(θ); then, using g_t(theta) solving for a first order estimate m_tAnd a second order estimate v_t：

m_t＝β₁m_t-1+(1-β₁)g_t(W)

Wherein, beta₁And beta₂Respectively representing first order estimates m_tAnd a second order estimate v_tAttenuation coefficient of (d), m_t-1And v_t-1First and second order estimates at the t-1 th iterative update, respectively. Using m_tAnd v_tRespectively calculate their offset corrections

To know

Further, a calculation formula of the value network is obtained:

wherein alpha is^WRepresenting a learning rate for controlling a step size; ε represents a numerical calculation stability parameter to prevent the denominator from being 0.

(2) Optimization of objective function for sparsification policy network

Since the optimization problem of the objective function of the sparsification strategy network is still a convex optimization problem, the Adam algorithm can be directly used for updating the weight parameters of the objective function. Similarly, directly deriving the objective function of the sparse strategy network to obtain the gradient g updated by the t iteration_t(theta) thenBy g_t(theta) solving for a first order estimate m_tAnd a second order estimate v_tAnd calculate a first order estimate m_tAnd a second order estimate v_tCorrection of deviation of

To know

And further obtaining a calculation formula of the policy network:

wherein alpha is^θRepresenting a learning rate for controlling a step size; ε represents a numerical calculation stability parameter to prevent the denominator from being zero.

4. Pruning and compressing the sparse strategy network to obtain the lightweight strategy network

After the sparse strategy network is obtained, pruning, compression and fine adjustment can be carried out on the strategy network, so that the light weight of the strategy network is realized. In this embodiment, the threshold for limiting the number of subnetworks is set to T_pThe pruning threshold of the weight parameter is T_θAfter a certain number of iterative updates are performed, pruning of the policy network is started. In particular, if the number of subnetworks is greater than T_pAnd the weight parameter matrix theta of the mth sub-network_mExpected value of E [ theta ]_m]Satisfies the following conditions:

|E[θ_m]|＜T_θ

namely, setting the weight parameter of the sub-network to zero, and completing the pruning operation of the sparse strategy network.

And then, removing redundant weight parameters in the pruned strategy network to obtain a non-redundant strategy network, compressing the non-redundant strategy network, and regenerating a brand-new lightweight strategy network. And finally, keeping the objective functions of the lightweight strategy network and the value network unchanged, inputting the current state space, and performing fine adjustment training on the deep reinforcement learning network to enable the deep reinforcement learning network to be converged again, so that the final lightweight deep reinforcement learning network is obtained.

And guiding the lightweight strategy network to train by using the value network, and specifically comprising the following steps:

the current three-dimensional state space S_tInputting the data into a trained lightweight strategy network through a full connection layer, and generating the probability distribution of the available position of the current macro unit (namely the action space a of the current macro unit) by the lightweight strategy network_t) And in the motion space a_tRandomly sampling an action execution to obtain the next three-dimensional state space S_t+1；

The current three-dimensional state space S_tAnd the next three-dimensional state space S_t+1Inputting the values into a value network to respectively obtain first value values V (S) of the two state spaces_tW) and a second value V (S)_t+1W), adding an incentive value R given by an external environment, and calculating to obtain a time sequence differential error TD-error; and replacing the TD-error with an advantage function in an objective function of the lightweight strategy network, constructing the objective function of the sparse strategy network by using a PPO algorithm, calculating the gradient of the objective function, updating weight parameters of the lightweight strategy network, and guiding the lightweight strategy network to train.

In addition, the TD-error can be used for constructing an objective function of the value network, then the gradient of the objective function is calculated, and the weight parameter of the value network is updated.

S3: taking the three-dimensional state space as input, and outputting an optimal layout strategy of the chip macro unit according to the trained lightweight deep reinforcement learning network;

specifically, the current macro-unit information is changed one by one to change the input of the lightweight strategy network, so that the optimal layout strategy of all the chip macro-units is obtained, and the layout of the chip macro-units is guided.

S4: and guiding the macro units to be mapped to the chip canvas one by one according to the optimal layout strategy.

The embodiment constructs a new neural network architecture as an embedded layer of a strategy-value network, and carries out mapping on a network table graph of a chip, node characteristics and information of a current macro to be placedAnd encoding to generate a three-dimensional state space. After obtaining the state space, by using the group-based

The strategy network is subjected to light weight processing by a regular sub-multi-channel multi-layer deconvolution network pruning technology, the strategy network and the value network are trained, probability distribution of the available position of the current macro unit and reward estimation of the position of the current macro unit are respectively output by utilizing the strategy network and the value network, the three-dimensional state space is input into the trained light weight deep reinforcement learning network, the optimal layout strategy of the chip macro unit is output under the condition of occupying less storage resources and calculation resources, the chip macro unit is guided to be mapped onto a chip canvas one by one according to the size sequence, and the calculation amount of the strategy network for generating the placement strategy for the chip macro unit is reduced.

In the embodiment, the policy network is divided into a plurality of mutually independent sub-networks according to the channels, so that a new idea of multi-channel multi-layer structured pruning is provided for the light weight of the policy network, and a method is provided for the block processing of data by the policy network in the future; by introducing groups in the policy network objective function

The regularizer performs sparse constraint on weight parameters of the strategy network sub-networks in and among groups, and performs pruning compression on the sparse strategy network, so as to realize a self-learning chip macro-unit layout method based on lightweight deep reinforcement learning, better eliminate gradient calculation brought by some unimportant input data, solve the problem of network weight parameter redundancy, and reduce the waste of storage resources and calculation resources in the chip macro-unit layout process in the chip layout method based on deep reinforcement learning.

Example 2

Referring to fig. 5, the embodiment provides a chip macro cell layout system based on lightweight deep reinforcement learning, including:

the data acquisition module M1 is used for generating a three-dimensional state space according to the macro cell characteristics and the net list information of the chip; the network list information of the chip comprises a network list graph and network list metadata;

the model training module M2 is used for training a lightweight deep reinforcement learning network; the lightweight deep reinforcement learning network comprises a lightweight strategy network and a monovalent value network; the value network is used for guiding the lightweight strategy network to train; the lightweight policy network includes a plurality of sub-networks that pass through a lead-in group

the strategy generation module M3 is used for taking the three-dimensional state space as input and outputting the optimal layout strategy of the chip macro unit according to the trained lightweight deep reinforcement learning network;

and the mapping module M4 is used for guiding the macro units to be mapped onto the chip canvas one by one according to the optimal layout strategy.

The emphasis of each embodiment in the present specification is on the difference from the other embodiments, and the same and similar parts among the various embodiments may be referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A chip macro-cell layout method based on lightweight deep reinforcement learning is characterized by comprising the following steps:

training a lightweight deep reinforcement learning network; the lightweight deep reinforcement learning network comprises a lightweight strategy network and a monovalent value network; the value network is used for guiding the lightweight strategy network to train; the lightweight policy network includes a plurality of sub-networks that pass through a lead-in group

2. The chip macro-cell layout method based on light-weight deep reinforcement learning according to claim 1, wherein the generating of the three-dimensional state space according to macro-cell features and net list information of the chip specifically comprises:

inputting the macro unit features and the net list graph into a graph neural network, and generating macro unit embedding and edge embedding through graph convolution operation;

inputting the network list metadata into a fully connected network to obtain network list metadata embedding;

reducing the average value of the edge embedding to obtain graph embedding;

embedding and fusing the current macro unit information and the macro unit to obtain the current macro unit embedding;

and embedding the network list metadata, the graph and the current macro unit into the fully-connected network to obtain the current three-dimensional state space.

3. The chip macro cell layout method based on light weight deep reinforcement learning according to claim 1, wherein the guiding the light weight strategy network to train by using the value network specifically comprises:

inputting the current three-dimensional state space into the lightweight strategy network to obtain an action space of the current macro unit;

randomly extracting an action from the current action space and executing the action to obtain a next three-dimensional state space;

inputting the current three-dimensional state space and the next three-dimensional state space into the value network to obtain a first value and a second value;

obtaining a time sequence difference error according to the first value, the second value and an incentive value of an external environment to the current action;

replacing the timing difference error with a merit function in an objective function of the lightweight policy network;

and guiding the lightweight strategy network training according to the replaced objective function.

4. The method for chip macro-cell layout based on light-weight deep reinforcement learning of claim 1, wherein the light-weight strategy network comprises a plurality of sub-networks, and the sub-networks are grouped by introduction

The regularization element is obtained by training a deconvolution network through pruning operation and compression operation, and specifically comprises the following steps:

initializing a deconvolution network based on a reinforcement learning structure to obtain a first strategy network;

carrying out multi-channel multi-layer structural processing on the first strategy network to obtain a second strategy network;

introducing groups in an objective function of the second policy network

Regulon, weight parameter to the sub-networkPerforming intra-group and inter-group sparse constraint on the numbers to obtain a sparse strategy network;

and pruning and compressing the sparse strategy network to obtain the lightweight strategy network.

5. The method of claim 4, wherein the chip macro cell layout method based on light weight deep reinforcement learning,

the first policy network comprises a network layer comprising an input layer, an output layer, and at least one deconvolution layer located between the input layer and the output layer;

each network layer comprises a preset number of channels;

the depth of each channel is 1/channel number of the depth of the corresponding network layer.

6. The chip macro cell layout method based on light-weight deep reinforcement learning according to claim 5, wherein the performing multi-channel multi-layer structurization processing on the first policy network to obtain a second policy network specifically comprises:

correspondingly dividing each layer network layer into a plurality of areas according to the channels;

forming a plurality of mutually independent sub-networks according to corresponding areas in each network layer; wherein the input data of each sub-network is determined by the input data of the corresponding channel in the last deconvolution layer of the first policy network;

and taking a first policy network divided into a plurality of sub-networks as the second policy network.

7. The method of claim 4, wherein the introducing of groups into the objective function of the second policy network is performed by using a light weight deep reinforcement learning-based chip macro cell layout method

Regularizer for performing parameter interpolation on the weight parameters of the sub-networkObtaining a sparse strategy network by row group and inter-group sparse constraints, specifically comprising:

wherein the content of the first and second substances,

representing an objective function of the sparsification strategy network; α denotes the regularization term parameter, α>0；||·||₁To represent

A regularization sub;

8. The method for chip macro cell layout based on light-weight deep reinforcement learning of claim 4, wherein a group is introduced into an objective function of the second policy network

And the regularizer performs parameter intra-group and inter-group sparse constraint on the weight parameters of the sub-network to obtain a sparse strategy network, and then optimizes the weight parameters of the value network and the weight parameters of the sparse strategy network through an Adam algorithm.

9. The chip macro-cell layout method based on light-weight deep reinforcement learning according to claim 4, wherein the pruning and compressing operations on the sparse strategy network to obtain the light-weight strategy network specifically include:

setting a sub-network threshold value and a pruning threshold value of the weight parameter of the sub-network;

simultaneously judging the number of the sub-networks and the size of the sub-network threshold value as well as the weight parameter of the sub-networks and the size of the pruning threshold value;

if the number of the sub-networks is larger than the sub-network threshold value and the expected value of the weight parameter of the sub-network is smaller than or equal to the pruning threshold value, setting the current weight parameter of the sub-network to zero to complete the pruning operation of the sparse strategy network;

removing the redundancy weight parameters in the pruned sparse strategy network to obtain a non-redundancy strategy network;

compressing the non-redundant strategy network, and adjusting the non-redundant strategy network to be convergent to obtain the lightweight strategy network.

10. A chip macro cell layout system based on lightweight deep reinforcement learning is characterized by comprising: