CN114942895B

CN114942895B - Address mapping strategy design method based on reinforcement learning

Info

Publication number: CN114942895B
Application number: CN202210714310.4A
Authority: CN
Inventors: 魏榕山; 徐楠楠; 陈家扬
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2024-06-04
Anticipated expiration: 2042-06-22
Also published as: CN114942895A

Abstract

The invention relates to an address mapping strategy design method based on reinforcement learning. A binary invertible matrix (Binary Invertible Matrix, BIM) is used to represent the address mapping strategy of the main stream, and the best line cache hit rate address mapping strategy is trained in combination with a reinforcement learning model. The reversibility of the binary reversible matrix BIM enables the effective mapping of physical addresses and addresses of memory storage units, and the BIM has the irreplaceable advantages of flexible expression of address mapping strategies, low hardware overhead and the like.

Description

Address mapping strategy design method based on reinforcement learning

Technical Field

The invention relates to an address mapping strategy design method based on reinforcement learning.

Background

In a computer architecture, the performance improvement speed of a processor and the performance improvement speed of a memory are always unbalanced and developed, so that memory access delay becomes an important factor for limiting the performance improvement of a system. Since the problem of "memory wall" was proposed, hardware accelerator performance improvement in computer systems has long been one of the key research objectives in computer architecture, and memory controllers have been one of the key to improving accelerator performance. Memory controllers are optimized by students from various angles at home and abroad, and system delay is reduced. Most address mapping strategies have the problems of strong pertinence, incapability of being widely popularized to other applications and insufficient flexibility for realizing high-performance access in the accelerator special for the field.

Disclosure of Invention

The invention aims to provide an address mapping strategy design method based on reinforcement learning, which uses a binary reversible matrix to represent the address mapping strategy of a main stream and trains the address mapping strategy of the optimal line cache hit rate by combining with a reinforcement learning model.

In order to achieve the above purpose, the technical scheme of the invention is as follows: an address mapping strategy design method based on reinforcement learning uses a binary reversible matrix BIM to represent an address mapping strategy, and combines the reinforcement learning model to train the address mapping strategy with optimal line cache hit rate. The implementation mode is as follows: taking one-dimensional expansion of a binary reversible matrix BIM as input of a reinforcement learning model; taking the initial BIM line cache hit rate as the current optimal value H _best of the reinforcement learning model; selecting actions by the reinforcement learning model according to the probability to obtain candidate BIMs; when the line cache hit rate obtained by calculating the candidate BIM is higher than that of the current BIM, the reinforcement learning model replaces the current BIM with the candidate BIM; then, recalculating the rewarding value and simultaneously updating the parameters of the reinforcement learning model; the reinforcement learning model is continuously iterated and optimized according to the process, and the trained BIM is obtained through convergence according to a preset stopping rule; and simultaneously, the address mapping strategy with the highest line cache hit rate is obtained.

Compared with the prior art, the invention has the following beneficial effects: the invention discloses an address mapping strategy design method based on reinforcement learning, which combines a binary reversible matrix BIM and reinforcement learning as an address mapping strategy design of a memory controller for the first time. The binary reversible matrix BIM has extremely high flexibility in the expression of the address mapping strategies, and can correctly represent all the current address mapping strategies. In addition, the invention combines a reinforcement learning model based on strategy gradient, so that BIM learns the address mapping strategy with highest line cache hit rate aiming at different access modes of the neural network accelerator. And implementing the trained and learned BIM model in hardware in the memory controller.

Drawings

Fig. 1 is a representation of an address mapping policy.

Fig. 2 is a schematic diagram showing a main stream address mapping policy by BIM.

FIG. 3 is a schematic diagram of reinforcement learning strategy network optimization BIM.

FIG. 4 is an optimized iterative BIM algorithm.

FIG. 5 is a Mini-batch training reinforcement learning model algorithm.

FIG. 6 is a schematic diagram of a reinforcement learning model system workflow.

Detailed Description

The technical scheme of the invention is specifically described below with reference to the accompanying drawings.

The invention relates to an address mapping strategy design method based on reinforcement learning, which uses a binary reversible matrix BIM to represent an address mapping strategy and combines a reinforcement learning model to train an optimal line cache hit rate address mapping strategy. The implementation mode is as follows: taking one-dimensional expansion of a binary reversible matrix BIM as input of a reinforcement learning model; taking the initial BIM line cache hit rate as the current optimal value H _best of the reinforcement learning model; selecting actions by the reinforcement learning model according to the probability to obtain candidate BIMs; when the line cache hit rate obtained by calculating the candidate BIM is higher than that of the current BIM, the reinforcement learning model replaces the current BIM with the candidate BIM; then, recalculating the rewarding value and simultaneously updating the parameters of the reinforcement learning model; the reinforcement learning model is continuously iterated and optimized according to the process, and the trained BIM is obtained through convergence according to a preset stopping rule; and simultaneously, the address mapping strategy with the highest line cache hit rate is obtained.

The following is a specific implementation procedure of the present invention.

The principle of the memory address mapping strategy is that addresses and memory cell locations in the DRAM are mapped into specific locations of the DRAM according to certain rules. FIG. 1 illustrates the currently prevailing memory address mapping strategy, which is simplified into an 8-bit physical address by representing the DRAM's different address mapping strategy with simplified address bits. Wherein the first 2 bits are Bank bits, followed by 4-bit row bits, and the last 2 bits are column address bits. Fig. 1 (a) shows BRC, and the policy addresses are mapped to physical addresses in the order of Bank, row and column. Fig. 1 (b) shows RBC, which permutes the Bank bits and the row bits, and places the row bits before the Bank bits, with the column coordinate bits unchanged. Fig. 1 (c) shows bit inversion, i.e., the initial Bank bits and the row bits are arranged in reverse order. And FIG. 1 (d) shows Permutation-based, which exclusive OR the Bank bits with part of the row address bits to generate new Bank address bits. Fig. 1 (e) is a strategy of memory address mapping based on a binary invertible matrix (Binary Invertible Matrix, BIM), which multiplies an initial physical address and the binary invertible matrix to obtain address information of a corresponding BIM address mapping.

All the strategies described above can be represented by binary invertible matrix BIM. The policy implementation is to multiply the original address with BIM to get the required address mapping. The binary reversible matrix BIM consists of 1 or 0, so that the realization of the memory address mapping can be realized by hardware only by an and gate and an exclusive or gate. The AND gate and the exclusive OR gate are used for multiplication and addition operation respectively, and the process can effectively reduce the hardware overhead of memory address mapping. The reversibility of the binary invertible matrix enables an efficient mapping of physical addresses to addresses of memory cells. As shown in fig. 2, the main stream address mapping policies shown in fig. 2 (a) - (d) can each be expressed by BIM. Because BIM has the irreplaceable advantages of the performance, low hardware overhead and the like, the memory controller based on reinforcement learning has obvious advantages of selecting BIM as a carrier of a memory address mapping strategy in a system.

1. Reinforcement learning optimization BIM

The optimization of BIM in the invention mainly comprises the step of performing elementary matrix transformation on a binary identity matrix in a strategy gradient algorithm model. The action space of the reinforcement learning model is composed of all possible row/column switching actions of the binary invertible matrix.

(1) Strategic network design

In the invention, the action of optimizing BIM address mapping strategy with higher access efficiency is learned by using a strategy network pi. The strategy network is designed into two fully-connected layers in cascade, and a non-linear factor is introduced in the strategy network by taking a ReLU as an activation function in the first layer. The output of the second layer of the network is connected with the Softmax function in a fully-connected mode. As shown in fig. 3, is an example of BIM optimization. The design sequentially expands the binary reversible matrix BIM line by line into one-dimensional data serving as an input of a strategy network. Based on the probability distribution, the model will select an action in the action space as the current optimization action for the binary reversible BIM. The BIM is transformed according to the last action to become a new binary reversible matrix BIM, and the binary reversible matrix BIM at the moment is used as the input of the next moment strategy network. In the following example, BIM is simplified into a binary reversible matrix BIM of 6×6 as an address mapping policy, and optimization of BIM can be performed by selecting a corresponding row/column transformation according to the output of the model, so that the optimization is iterated and iterated in a loop.

(2) Motion space optimization

In the binary reversible matrix BIM model, the number of action spaces of the reinforcement learning model isWhere b is the row/column number of the binary invertible matrix. Assuming a binary invertible matrix of 32 x 32 for BIM, the total motion space is 992, i.e. the BIM transform has 992 transform choices at a time. When the training process of the reinforcement learning model requires multiple iterative learning, the action space of the optimization BIM is a very large search space, and in this case, the learning process is reversible, and the performance of the reinforcement learning model is reduced due to the too long search action. In order to solve the problem of overlarge action space, action space compression is performed in the BIM optimization process in the section.

From linear algebra knowledge, it is known that performing an infinite number of row/column exchanges on a binary invertible matrix can be implemented using a plurality of row exchanges. As shown in equation (1), row-switching BIM may be performed by multiplying a transposed matrix M _pre on the left side of BIM; column swapping a BIM can be performed by multiplying a transposed matrix M _post on the right side of the BIM.

The binary invertible matrix satisfies the switching law and the combining law, and a series of row/column transformations can be equivalently implemented by using the row transformation. Therefore, the present study compresses the motion space into a set of only row-transformed motions. The transformation expression is as follows:

After the above-mentioned compression of the motion space, the motion space has been reduced by half. To optimize the action search space to a greater extent, the present study emphasizes the transformation of BIM into the exchange of the first row and the other rows, with a total of b-1 possible actions. The feasibility foundation of the design is that no matter which two rows of BIM are exchanged, the exchange of the first row and the other two rows can be completed, so that the optimization result of BIM can be ensured not to be influenced. At the same time, the design also adds a hold action NOP in the action space. In summary, the action space optimized by the BIM model is finally optimized into b actions, and if b=32, the number of action spaces is 32.

(3) Iterative optimization

The reinforcement learning model optimizes the address mapping strategy BIM by iteration. Firstly, each row of BIM is unfolded into a one-dimensional matrix to be used as input, and meanwhile, the row cache hit rate H of an initial BIM address mapping strategy is tested and used as the current optimal value H _best of the model. Setting k iterations to finish BIM optimization, and carrying out new row hit rate test on each iteration optimization result, wherein if the row hit rate is higher than H _best, the BIM is used as the optimal address mapping strategy. The BIM is iteratively optimized in this way. The row hit rate H _best also increases with iteration. Address mapping strategy BIM iterative optimization process pseudocode is shown in fig. 4.

2. Model training

In the model training process, the policy network generates the next action a _t at the current moment, and the BIM at the current moment is converted into the BIM at the next moment according to the action. After k cycles have elapsed, the policy network obtains a prize value r _k＝H_k. The maximum jackpot value may be achieved through reinforcement learning. And meanwhile, the address mapping strategy based on BIM with the highest row hit rate can be obtained.

The present invention uses the strategic gradient algorithm mentioned above to iterate the optimization model. The formula for the jackpot value is:

R _t＝γ^k+1r_k formula (3)

Wherein, gamma is a break factor. The cost function V _φ(BIM_t) is used primarily to predict the jackpot value, by means of a strategy gradient to update the neural network containing the parameter phi.

The value network and the strategy network intermediate structure are composed of two full-connection layers. The difference is that the output of the value network is a numerical value that describes the predicted jackpot value. The formula for the benefit of an action is used to represent the benefit of an agent selecting this action in the current context versus a policy network to randomly select an action. The specific formula is as follows:

A _t＝R_t-V_t formula (4)

The maximized objective function is:

The strategy gradient is as follows:

The loss function of the value network is:

The gradient of the value network is:

In the network model, gradient values of parameters calculated by using a back propagation algorithm, lr _π and lr _v are divided into learning rates of a policy network and a value network (specific formulas refer to fig. 5), and are set to 0.001 in this project.

The invention updates the model parameters according to the Mini-batch method. In the experiment, batch was set to 64, which means that in one Batch, the policy network would make 64 iterative updates. The 64 iterations obtain experience pools (actions, rewards, etc.) that are used to update parameters in the model. However, the Mini-batch method saves all the input data, the calculation result and other data, resulting in serious storage overhead. To solve this problem, the gradients of one Batch are accumulated as parameter gradients in the experiment, and the accumulated gradients of the strategy network and the value network are g _θ and g _φ (see fig. 5 for specific formulas). An algorithm for training the reinforcement learning model using the Mini-batch method is shown in FIG. 5.

3. Workflow process

The overall process of iterative optimization training of the present invention is shown in fig. 6. The binary reversible matrix BIM one-dimensional expansion of 32 x 32 is used as the input of the strategy network and the value network, and the foreterm derivation is carried out on the strategy network and the value network. The policy network training derivation process determines whether the row hit rate case selection updates the BIM. And selecting actions by the model according to the probability to obtain candidate BIMs. When the calculated line cache hit rate of the BIM is higher than the current BIM, the reinforcement learning model system replaces the current BIM with the candidate BIM. Then, the new BIM line cache hit rate is recalculated, i.e. the reward value is calculated, and the parameters of the two networks are updated. The system is continuously iterated and optimized according to the process, and can converge to obtain a trained BIM strategy according to a set stopping rule. Meanwhile, the address mapping strategy with the highest row hit rate can be obtained, and can be transplanted to the hardware implementation in FPGA MIG IP.

The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. The method is characterized in that a binary reversible matrix BIM is used for representing an address mapping strategy, and the best line cache hit rate address mapping strategy is trained by combining a reinforcement learning model; the method comprises the following specific implementation modes: taking one-dimensional expansion of a binary reversible matrix BIM as input of a reinforcement learning model; taking the initial BIM line cache hit rate as the current optimal value H _best of the reinforcement learning model; selecting actions by the reinforcement learning model according to the probability to obtain candidate BIMs; when the line cache hit rate obtained by calculating the candidate BIM is higher than that of the current BIM, the reinforcement learning model replaces the current BIM with the candidate BIM; then, recalculating the rewarding value and simultaneously updating the parameters of the reinforcement learning model; the reinforcement learning model is continuously iterated and optimized according to the process, and the trained BIM is obtained through convergence according to a preset stopping rule; simultaneously obtaining an address mapping strategy with highest line cache hit rate; the action space number of the reinforcement learning model consists of all possible row/column switching actions of the binary reversible matrix BIM, namelyWhere b is the row/column number of the binary invertible matrix; in order to solve the problem of overlarge action space of the reinforcement learning model, the action space number of the reinforcement learning model is compressed, and the method specifically comprises the following steps:

as shown in the following equation, the row-switching is performed on the BIM by multiplying a transposed matrix M _pre on the left side of the BIM; column swapping BIM is performed by multiplying a transposed matrix M _post on the right side of BIM:

BIM satisfies the exchange law and the combination law, and a series of line/column transformations are equivalently realized by using line transformation; thus, the motion space is compressed into a set of only row-transformed motions; the transformation expression is as follows:

After the compression of the motion space, the motion space is reduced by half; to optimize the motion search space to a greater extent, the transformation of the binary reversible matrix BIM is forced to be the exchange of the first row and the other rows, the total possible motion is b-1; meanwhile, adding the held action NOP in the action space; the number of action spaces of the reinforcement learning model is finally optimized into b actions.

2. The method for designing an address mapping strategy based on reinforcement learning according to claim 1, wherein the reinforcement learning model is composed of a strategy network and a value network; the strategy network consists of two cascaded full-connection layers, wherein a first layer in the strategy network takes a ReLU as an activation function, and the output of a second layer of the strategy network is connected with a Softmax function in a full-connection mode; in the training process of the reinforcement learning model, the strategy network generates the next action a _t at the current moment, and BIM at the current moment is transformed according to the action to generate BIM at the next moment; after the preset iteration number k, the strategy network obtains a reward value r _k＝H_k,H_k as the line cache hit rate of the BIM after k iterations; the formula for the jackpot value is:

R_t＝γ^k+1r_k

wherein, gamma is a break factor;

The value network is composed of two fully connected layers as well as the strategy network intermediate structure, and the difference is that the output of the value network is a numerical value for describing the predicted jackpot value; the formula of the advantage of the action is used for expressing the advantage of selecting the reward value of the corresponding action in the current environment relative to the strategy network to randomly select the action; the specific formula is as follows:

A_t＝R_t-V_t

Wherein A _t is an dominance function, and V _t is a return value estimated after selecting actions according to a strategy pi in the state of s _t;

The maximized objective function is:

Wherein J (theta) is a maximized objective function, and the maximized J (theta) is used for continuously optimizing the parameter theta of the neural network model; pi _θ is a strategy gradient algorithm, which is a strategy that parameterizes the strategy pi to pi _θ, i.e., learns in the corresponding environment to maximize the jackpot value; BIM _t represents the current binary invertible matrix;

The policy gradient, i.e. the bias that maximizes the objective function, is calculated as:

The loss function of the value network is:

The gradient of the value network is:

A cost function V _φ(BIM_t) for predicting a jackpot value, updating the neural network containing the parameter phi by means of a strategy gradient;

In the reinforcement learning model, gradient values of parameters calculated by a back propagation algorithm are used, and lr _π and lr _v are learning rates of a strategy network and a value network, respectively.

3. The method of claim 2, wherein the parameters of the reinforcement learning model are updated according to the Mini-Batch method, and the gradient of one Batch is accumulated as the parameter gradient, and the cumulative gradients of the policy network and the value network are g _θ and g _φ.