CN116738923A

CN116738923A - Chip layout optimization method based on reinforcement learning with constraint

Info

Publication number: CN116738923A
Application number: CN202310359245.2A
Authority: CN
Inventors: 欧阳雅捷; 刘晓翔
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-09-12
Anticipated expiration: 2043-04-04
Also published as: CN116738923B

Abstract

The invention provides a chip layout optimization method based on reinforcement learning with constraint, which belongs to the field of integrated circuits, and comprises the following steps: establishing a model based on a Markov decision process for the chip layout problem; aiming at the chip design layout field, distinguishing hard constraint and soft constraint; designing a reinforcement learning algorithm to process hard constraint and soft constraint; designing a reward function to respectively process hard constraint and soft constraint; training the intelligent agent by using a reinforcement learning algorithm with constraint, so that the intelligent agent finds a strategy for optimizing soft constraint on the premise of meeting hard constraint; after the training of the agent is completed, the training method is applied to the actual chip layout problem, and an optimized layout scheme is obtained through the action sequence executed by the agent. The invention adopts reinforcement learning algorithm with constraint and targeted constraint processing mode, and can optimize soft constraint on the premise of meeting hard constraint, thereby realizing a chip layout scheme with high performance and low power consumption.

Description

Chip layout optimization method based on reinforcement learning with constraint

Technical Field

The invention belongs to the field of integrated circuits, and particularly relates to a chip layout optimization method based on reinforcement learning with constraint.

Background

In modern integrated circuit design, chip layout is a critical step that directly affects chip performance, power consumption, and cost. The chip layout problem can be regarded as an optimization problem, requiring optimization of multiple objectives under certain constraints. In chip layout, there are two constraint types, hard and soft. Hard constraints are conditions that must be met, such as routing rules, power supplies, etc. Soft constraints are then the goal that one wants to optimize to some extent, such as power consumption, performance, etc. Violating the hard constraints can cause the chip to fail, while optimizing the soft constraints affects the performance of the chip. Therefore, how to optimize soft constraints on the premise of meeting hard constraints becomes an important research direction.

Conventional chip layout optimization methods typically rely on heuristic algorithms and human experience. However, with the rapid development of integrated circuit technology, the complexity of chips is increasing, and conventional methods have been difficult to cope with. In recent years, reinforcement learning has received attention as a method of autonomous learning. However, existing reinforcement learning methods often have difficulty distinguishing between hard constraints and soft constraints, resulting in the possibility of violating hard constraints when solving chip layout problems, thereby affecting chip usability.

Disclosure of Invention

The invention aims to provide a chip layout optimization method based on reinforcement learning with constraint, which can optimize soft constraint on the premise of meeting hard constraint by adopting reinforcement learning algorithm with constraint and a targeted constraint processing mode, thereby realizing a chip layout scheme with high performance and low power consumption.

In order to achieve the above object, the present invention provides a chip layout optimization method based on constraint reinforcement learning, the method comprising:

s1: establishing a model based on a Markov decision process for the chip layout problem;

s2: aiming at the chip design layout field, distinguishing hard constraint and soft constraint;

s3: designing a reinforcement learning algorithm to process hard constraint and soft constraint;

s4: designing a reward function to respectively process hard constraint and soft constraint;

s5: training the intelligent agent by using a reinforcement learning algorithm with constraint, so that the intelligent agent finds a strategy for optimizing soft constraint on the premise of meeting hard constraint;

s6: after the training of the agent is completed, the training method is applied to the actual chip layout problem, and an optimized layout scheme is obtained through the action sequence executed by the agent.

Further, establishing a model based on a Markov decision process for the chip layout problem, wherein the model comprises states, actions, state transition probabilities and rewarding functions;

the state is S, the current situation of the chip layout is represented, and the state is defined as a tuple;

the action is A, which represents the operation of the intelligent agent on the layout;

the state transition probability is P, which indicates the probability that the system will transition to a new state after executing a certain action in a given state;

the rewarding function is R and is used for evaluating rewards obtained by the agent after executing a certain action.

Further, the tuple includes a placed element position, a list of non-placed elements, a hard constraint state, and a soft constraint state;

the placed component position represents a component which has been placed on a chip and position information thereof;

the unset component list represents a component list that has not been placed on a chip;

the hard constraint state represents the connection relation between the placed elements in the current layout, and the distance and the size between the elements;

the soft constraint state represents a performance indicator of the placed element in the current layout.

Further, for the chip design layout field, distinguishing between hard constraints and soft constraints, the hard constraints including, but not limited to, space constraints, overlap constraints, connection constraints, power constraints, and thermal constraints; the soft constraints include, but are not limited to, functional optimization, delay optimization, space utilization optimization, line length optimization, and thermal profile optimization.

Further, in the process of hard constraint and soft constraint by designing a reinforcement learning algorithm, the reinforcement learning algorithm specifically comprises:

defining a feasibility function f (s, a) representing whether the action a taken in the state s satisfies the hard constraint; when the hard constraint is satisfied, f (s, a) =1; otherwise, f (s, a) =0;

the expected soft constraint rewards are maximized without violating the hard constraints, and therefore the objective function is expressed as:

J(π)＝E{s,a～π}[r(s,a)*f(s,a)]

where pi is the policy and r (s, a) represents the soft constraint reward obtained by taking action a in state s;

in order to optimize the objective function, the loss function is as follows:

L(π)＝E{s,a～π}[-r(s,a)*f(s,a)+λ*D_KL(π_old||π)]

wherein D_KL represents KL divergence, which is used for measuring the difference between the new strategy pi and the old strategy pi_old; lambda is a super parameter for balancing soft constraint rewards and policy update magnitudes;

the objective function is optimized by iteratively updating the strategy pi, in each iteration track data is first collected, then the strategy is updated using the loss function, in the updating process, it is ensured that the new strategy pi satisfies f (s, a) =1.

Further, the design reward function respectively processes hard constraint and soft constraint, specifically:

hard constraint processing: defining a state transition function T (s, a, s ') representing the probability of transition to state s' after performing action a in state s; for the state transition meeting the hard constraint, the original transition probability is maintained, and if a certain state transition violates the hard constraint, the probability is set to 0 so as to inhibit the state transition;

soft constraint processing: and (3) distributing a weight to each soft constraint, wherein the weight is adjusted according to the specific requirements of the problem and the optimization target, and finally, the weighted sum of the soft constraints is incorporated into the reward function.

Further, the state transition is expressed as:

T(s,a,s')＝P(s'|s,a)*f(s,a,s')

wherein P (s ' |s, a) is the original state transition probability, f (s, a, s ') is an indication function, and the value is 1 when the state transition (s, a, s ') satisfies the hard constraint, otherwise, the value is 0;

the reward function is expressed as:

R(s,a,s')＝r(s,a,s')+∑w_i*g_i(s,a,s')

where r (s, a, s ') represents the original prize, w_i is the weight of the i-th soft constraint, and g_i (s, a, s ') represents the contribution of the i-th soft constraint under the state transition (s, a, s ').

Further, the reinforcement learning algorithm with constraint is used for training the intelligent agent, so that the intelligent agent finds a strategy for optimizing soft constraint on the premise of meeting hard constraint, and the strategy specifically comprises the following steps:

s5-1: ensuring that the hard constraint is satisfied;

s5-2: optimizing soft constraints;

s5-3: experience playback is adopted;

s5-4: using the target network:

s5-5: attenuation exploration rate.

Further, the reward function contains performance metrics of the layout, including power consumption, delay.

Further, after the training of the agent is completed, the training is applied to the actual chip layout problem, and an optimized layout scheme is obtained through an action sequence executed by the agent, which comprises the following specific steps:

s6-1: according to the current layout state, enabling the intelligent agent to execute actions;

s6-2: before each action is performed, checking whether the action would cause the hard constraint to be violated; if yes, skipping the action, and selecting the next action;

s6-3: updating the current layout scheme according to the action selected by the agent;

s6-4: after each time of updating the layout, calculating the satisfaction degree of soft constraint under the new layout;

s6-5: steps S6-1-S6-4 are repeated until a preset number of optimizations or other termination conditions are reached.

The beneficial technical effects of the invention are at least as follows:

(1) The hard constraint is incorporated into the calculation of the state transition probability, so that the intelligent agent is ensured to always meet the hard constraint, and the reliability of the layout scheme is improved. The optimization target of the soft constraint is embodied in the reward function, so that the intelligent agent can optimize the soft constraint in the learning process, and the performance of the layout scheme is improved. The hard constraint and the soft constraint are fully considered in the definition of the state and the action, so that the intelligent agent is helped to fully understand and master the characteristics of the layout problem in the learning process, and a better layout strategy is found.

(2) Hard constraints and soft constraints are handled separately in the reinforcement learning process. The hard constraint is guaranteed not to be violated by the adjustment of the state transition probability, while the soft constraint is optimized by the adjustment of the reward function. Such a design allows our algorithm to efficiently optimize soft constraints while following hard constraints.

(3) Chip layout optimization under the condition of considering hard constraint and soft constraint is realized. The resulting layout should be capable of exhibiting advantages in terms of hard and soft constraints.

(4) The hard constraint and the soft constraint are clearly distinguished and processed in a targeted manner, so that the reliability and the optimization degree of the chip layout scheme are improved. By adopting the reinforcement learning algorithm with the constraint, the soft constraint can be effectively optimized on the premise of meeting the hard constraint. Through autonomous learning, the intelligent agent can find out a strategy for optimizing the chip layout under the condition of no manual intervention, so that the design complexity and the labor cost are reduced.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

FIG. 1 is a flow chart of a chip layout optimization method based on reinforcement learning with constraint.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

As shown in fig. 1, the method provided by the embodiment of the invention includes:

s1: and establishing a model based on a Markov decision process for the chip layout problem. The chip layout problem is modeled as a Markov Decision Process (MDP) that includes states, actions, transition probabilities, and rewards functions. The state represents the current layout state, the action represents the operation performed on the layout, the transition probability represents the transition relation among the states, and the reward function is used for evaluating the advantages and disadvantages of the layout. The specific definition is as follows:

a) State (S): the state represents the current situation of the chip layout. The state is defined as a tuple (placed element position, non-placed element list, hard constraint state, soft constraint state), where

Placed component position: representing the components that have been placed on the chip and their positional information.

List of unset elements: representing a list of components that have not yet been placed on the chip.

Hard constraint state: representing the connection relationships between the placed components in the current layout, as well as the distance, size, etc. constraints between the components. This information can be used to ensure that the layout always satisfies the hard constraints during state transitions.

Soft constraint state: representing performance metrics such as power consumption, delay, etc. of the placed elements in the current layout. This information can be used to evaluate the performance of the layout in optimizing the soft constraints.

b) Action (a): an action represents an action that the agent can take, such as selecting an unset element and placing it in a certain position in the layout. The action set needs to be generated on the premise that the hard constraint is satisfied to ensure that all optional actions do not violate the hard constraint.

c) State transition probability (P): state transition probabilities describe the probability that a system will transition to a new state after performing some action in a given state. The inclusion of the hard constraint in the calculation of the state transition probability causes the state transition that violates the hard constraint to be disabled. This helps ensure that the agent always meets the hard constraints during the learning process.

d) Bonus function (R): the reward function is used to evaluate rewards that an agent obtains after performing an action. In the present invention, the reward function mainly considers the optimization of the soft constraint. In order to embody the optimization target of the soft constraint, performance indexes (such as power consumption, delay and the like) of the layout are included in the reward function, so that the intelligent agent can optimize the indexes in the learning process.

By definition, the present invention has the following advantages in handling hard and soft constraints:

the hard constraint is incorporated into the calculation of the state transition probability, so that the intelligent agent is ensured to always meet the hard constraint, and the reliability of the layout scheme is improved.

The optimization target of the soft constraint is embodied in the reward function, so that the intelligent agent can optimize the soft constraint in the learning process, and the performance of the layout scheme is improved.

The hard constraint and the soft constraint are fully considered in the definition of the state and the action, so that the intelligent agent is helped to fully understand and master the characteristics of the layout problem in the learning process, and a better layout strategy is found.

S2: for the chip design layout field, hard constraints and soft constraints are distinguished. For the chip design layout field, the hard constraint and the soft constraint are clearly distinguished. Hard constraints including routing rules, power supplies, etc., violating hard constraints can cause the chip to fail to function properly; soft constraints include power consumption, performance, etc., and optimizing soft constraints can improve the overall performance of the chip.

Some specific definitions in engineering are given below:

hard constraint: the following are some examples of hard constraints for the chip layout problem:

space limitations: the component must lie entirely within the chip boundary.

Overlap limit: there cannot be any spatial overlap between the two elements.

Connection restriction: the connections between all elements must meet predefined connection rules.

Power supply limitation: the power requirements of each element must be within a specified range.

Thermal limit: the temperature profile of the chip must meet design requirements.

Soft constraint: the following are some examples of soft constraints for the chip layout problem:

and (3) power consumption optimization: the total power consumption of the chip is reduced.

Delay optimization: reducing the overall delay of the signal transmission path.

Space utilization optimization: the utilization rate of the chip space is improved, and the idle area of the layout is reduced.

Line length optimization: the overall length of the interconnect lines is reduced.

Optimizing heat distribution: the heat distribution inside the chip is improved, and the local hot spots are reduced.

S3: the design reinforcement learning algorithm handles hard constraints and soft constraints. An improved Constrained Policy Optimization (CPO) algorithm is designed that specifically handles both hard and soft constraints. Referred to as HS-CPO (Hard-Soft Constrained Policy Optimization).

The core idea of HS-CPO is to incorporate hard and soft constraints into the objective function and the loss function, respectively. Specifically, the hard constraint is represented as a feasibility function to measure whether the policy satisfies the hard constraint. Soft constraints are then included as part of the optimization objective in the loss function.

The key components of the HS-CPO algorithm are as follows:

feasibility function: a feasibility function f (s, a) is defined, indicating whether taking action a in state s satisfies the hard constraint. When the hard constraint is satisfied, f (s, a) =1; otherwise, f (s, a) =0.

To be more adaptive to the chip layout task, the hard constraints of the chip layout may be mapped to the state space and the action space and incorporated into the feasibility function.

Specifically, a deep neural network is used as a function approximator, and the state s and the action a are input to output a feasibility function value. By training the neural network, a feasibility function capable of effectively judging the hard constraint in the chip layout problem is obtained.

This straightforward definition of the above presents problems in the chip layout, so we want to be able to add to the patent the solution of redefining the feasibility function f (s, a) according to the characteristics of the chip layout problem. In a chip layout task, hard constraints may include minimum distances between elements, restrictions on hot spot areas, etc. Thus, a targeted feasibility function is designed, such as f (s, a) =1-exp (- Σc_i), where c_i is a penalty term for each hard constraint. In this way, we can better capture the hard constraint characteristics in the chip layout task.

Objective function: in HS-CPO, our goal is to maximize the desired soft constraint rewards without violating the hard constraint. Thus, the objective function can be expressed as:

J(π)＝E_{s,a～π}[r(s,a)*f(s,a)]

where pi is the policy and r (s, a) represents the soft constraint reward obtained by taking action a in state s.

Loss function: in order to optimize the objective function, the loss function is designed as follows:

L(π)＝E_{s,a～π}[-r(s,a)*f(s,a)+λ*D_KL(π_old||π)]

wherein D_KL represents KL divergence, which is used for measuring the difference between the new strategy pi and the old strategy pi_old; lambda is a super parameter used to balance soft constraint rewards and policy update magnitudes.

And (3) algorithm iteration: the HS-CPO optimizes the objective function by iteratively updating the strategy pi. In each iteration, the trajectory data is first collected and then the strategy is updated with the loss function. During the update process, the new policy is ensured to satisfy the hard constraint (i.e., f (s, a) =1). Through HS-CPO, the chip layout problem can be effectively solved while the hard constraint and the soft constraint are processed.

S4: the bonus function is designed to handle hard constraints and soft constraints, respectively. The hard constraint and the soft constraint are treated differently when designing the bonus function. For a hard constraint, it is translated into a portion of the state transition probability such that state transitions that violate the hard constraint are disabled. For soft constraints, they are incorporated into the reward function in order to optimize it during the learning process. The method comprises the following steps:

hard constraint processing: to incorporate the hard constraint into the state transition probabilities, a state transition function T (s, a, s ') is defined that represents the probability of transitioning to state s' after performing action a in state s. For state transitions that satisfy the hard constraint, the original transition probabilities are maintained. However, if a state transition violates a hard constraint, its probability is set to 0 to prohibit such state transition. Specifically, the state transfer function may be defined as:

T(s,a,s')＝P(s'|s,a)*f(s,a,s')

wherein P (s ' |s, a) is the original state transition probability, f (s, a, s ') is an indication function, and the value is 1 when the state transition (s, a, s ') satisfies the hard constraint, otherwise, is 0.

Soft constraint processing: to incorporate soft constraints into the bonus function, each soft constraint is first assigned a weight. The weights can be adjusted according to the specific requirements and optimization objectives of the problem. The weighted sum of the individual soft constraints is then incorporated into the bonus function. Specifically, the reward function may be defined as:

R(s,a,s')＝r(s,a,s')+∑w_i*g_i(s,a,s')

In this way, hard constraints and soft constraints can be handled separately in the reinforcement learning process. The hard constraint is guaranteed not to be violated by the adjustment of the state transition probability, while the soft constraint is optimized by the adjustment of the reward function. Such a design allows the algorithm to efficiently optimize the soft constraints while following the hard constraints.

S5: training the intelligent agent by using a reinforcement learning algorithm with constraint, so that the intelligent agent finds a strategy for optimizing soft constraint on the premise of meeting hard constraint. During the training process, the agent will learn how to maximize the reward function without violating the hard constraint, thereby achieving optimization of the soft constraint. The method comprises the following steps:

s5-1: ensuring that the hard constraint satisfies: during the training process of the agent, it needs to be ensured that it does not violate hard constraints. To this end, each time an action is attempted to be performed, it is checked whether the action would result in a hard constraint being violated. If so, execution of the action is prohibited while a negative reward is given to penalize the agent. In this way, the smart body will learn to follow the hard constraints.

S5-2: optimizing soft constraints: on the premise of meeting the hard constraint, the intelligent agent is expected to find out the strategy for optimizing the soft constraint. Soft constraints can be incorporated into the reward function, enabling the intelligent agent to optimize the soft constraints during training by adjusting weights and contribution values. Specifically, each soft constraint may be assigned a weight based on the characteristics of the problem, and these weights may be added to the reward function multiplied by the contribution of the corresponding soft constraint.

S5-3: experience playback is used: in order to improve training effect of the intelligent agent, an experience playback technology is adopted. After each action is performed, state transitions (including state, action, rewards, and next state) are stored in an experience playback buffer. Then, a batch of experience is randomly extracted from the buffer for training. By doing so, the time correlation can be broken, and the learning effect of the intelligent agent is improved.

S5-4: using the target network: to stabilize the training process, a target network is employed. The target network is a network with the same structure as the main network, but its parameters are updated slowly during the training process. The target value may be calculated using a target network to reduce instability during training.

S5-5: attenuation exploration rate: in order to gradually turn the agent around during the training process to utilize the learned knowledge, the exploration rate may be gradually reduced during the training process. Thus, the agent initially explores the environment in large quantities, and as training proceeds, it is increasingly focused on learned strategies.

Only the first two steps are designed specifically for hard constraint and soft constraint, and the other steps are common skills in reinforcement learning, so that the learning effect and stability of the intelligent agent can be improved.

Specifically, the specific training process of HS-CPO:

1. the policy network pi and the value function network V are initialized, while the target networks pi_target and v_target are initialized. An initial value of the attenuation search rate epsilon and an attenuation rate are set.

2. For each training round:

a) Generating a track tau: starting from the initial state s0, the following operations are performed until the end state:

i. the actions are selected randomly with a probability epsilon or according to the current policy pi with a probability 1-epsilon.

Checking if the selected action a satisfies the hard constraint. If so, the action is performed, otherwise another action is selected and a negative prize is awarded.

Calculating soft constraint rewards r (s, a).

Storing the state transition quadruples (s, a, r, s') in an empirical playback buffer.

Updating the current state s=s'.

b) A small batch of data of size N is randomly extracted from the experience playback buffer.

c) Updating the value function network V using small batches of data:

i. calculating a target value using the target network v_target: y=r+γ v_target (s').

Calculating the predicted value of the value function network V(s).

Calculating a mean square error loss: L_V= (V(s) -y)/(2).

Updating parameters of the value function network V using a gradient descent method.

d) The policy network pi is updated using small batches of data:

i. calculating an action probability ratio: ρ=pi (a|s)/pi_old (a|s).

c) Adjusting soft constraint weights: in the training process, the weights of the soft constraints in the reward functions can be properly adjusted according to the optimization degree of different soft constraints so as to realize a more balanced optimization effect.

S6: after the training of the agent is completed, the training method is applied to the actual chip layout problem, and an optimized layout scheme is obtained through the action sequence executed by the agent. After the agent training is completed, it is applied to the actual chip layout problem. Through the action sequence executed by the intelligent agent, an optimized layout scheme can be obtained, and the scheme meets the hard constraint and simultaneously shows superiority in the aspect of soft constraint. In the project, the optimization is performed according to the following steps:

s6-1: executing the actions of the agent: at S1-S5 we have established problem modeling, defined hard and soft constraints, selected reinforcement learning algorithms, and completed agent training. Now, trained agents need to be applied to practical chip layout issues. And according to the current layout state, enabling the intelligent agent to execute a series of actions so as to optimize the layout.

S6-2: hard constraint checking: before each action is performed, it is checked whether the action would result in a hard constraint being violated. If so, this action is skipped and the next action is selected. This ensures that the layout always meets the hard constraint requirements.

S6-3: updating the layout: and updating the current layout scheme according to the action selected by the agent. This may include moving the assembly, rotating the assembly, changing the wiring, etc.

S6-4: evaluating the soft constraint satisfaction degree: after each update of the layout, the degree of satisfaction of the soft constraint under the new layout is calculated. This may be achieved by calculating a bonus function that already contains soft constraint related information.

S6-5: iterative optimization: steps S6-1-S6-4 are repeated until a preset number of optimizations or other termination conditions are reached (e.g., layout quality reaches a desired goal). In the whole optimization process, the intelligent agent can optimize soft constraint as much as possible on the premise of meeting hard constraint according to the learned strategy.

Through this process, on the basis of S1-S5, chip layout optimization under the condition of considering hard constraint and soft constraint is achieved. The resulting layout should be capable of exhibiting advantages in terms of hard and soft constraints.

While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for optimizing a chip layout based on reinforcement learning with constraints, the method comprising:

2. The method for optimizing a chip layout based on reinforcement learning with constraint according to claim 1, wherein the model based on a markov decision process is built for the chip layout problem, and the model includes states, actions, state transition probabilities and rewarding functions;

3. The method for optimizing a chip layout based on reinforcement learning with constraint according to claim 2, wherein the tuple comprises a placed element position, a list of non-placed elements, a hard constraint state and a soft constraint state;

4. The chip layout optimization method based on constraint reinforcement learning of claim 1, wherein the hard constraints and the soft constraints are distinguished for the chip design layout field, and the hard constraints include but are not limited to space limitations, overlap limitations, connection limitations, power limitations and thermal limitations; the soft constraints include, but are not limited to, functional optimization, delay optimization, space utilization optimization, line length optimization, and thermal profile optimization.

5. The chip layout optimization method based on constraint reinforcement learning according to claim 1, wherein the reinforcement learning algorithm is designed to process hard constraints and soft constraints, and specifically comprises:

J(π)＝E{s,a～π}[r(s,a)*f(s,a)]

in order to optimize the objective function, the loss function is as follows:

L(π)＝E{s,a～π}[-r(s,a)*f(s,a)+λ*D_KL(π_old||π)]

6. The chip layout optimization method based on constraint reinforcement learning according to claim 1, wherein the design reward function respectively processes hard constraints and soft constraints, specifically:

7. The method for optimizing a chip layout based on constrained reinforcement learning of claim 6, wherein the state transition is expressed as:

T(s,a,s')＝P(s'|s,a)*f(s,a,s')

the reward function is expressed as:

R(s,a,s')＝r(s,a,s')+∑w_i*g_i(s,a,s')

8. The chip layout optimization method based on constraint reinforcement learning according to claim 1, wherein the training of the agent by using the constraint reinforcement learning algorithm enables the agent to find a strategy for optimizing soft constraints on the premise of meeting hard constraints, specifically:

s5-1: ensuring that the hard constraint is satisfied;

s5-2: optimizing soft constraints;

s5-3: experience playback is adopted;

s5-4: using the target network:

s5-5: attenuation exploration rate.

9. The method for optimizing a chip layout based on constrained reinforcement learning according to claim 2, wherein the reward function comprises performance metrics of the layout, the performance metrics including power consumption and delay.

10. The method for optimizing the chip layout based on the reinforcement learning with constraint according to claim 1, wherein after the training of the agent is completed, the method is applied to the actual chip layout problem, and an optimized layout scheme is obtained through an action sequence executed by the agent, and the specific steps are as follows: