CN109032967B

CN109032967B - Cache address mapping method based on three-dimensional many-core processor

Info

Publication number: CN109032967B
Application number: CN201810757396.2A
Authority: CN
Inventors: 陈小文; 王子聪; 郭阳; 鲁建壮; 陈海燕; 陈胜刚; 刘胜; 雷元武; 王耀华; 郭晓伟
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2021-10-01
Anticipated expiration: 2038-07-11
Also published as: CN109032967A

Abstract

The invention discloses a Cache address mapping method based on a three-dimensional many-core processor, which comprises the following steps: s1, constructing an objective function of a nonlinear programming problem, and solving to obtain optimal probability distribution of accessing each Bank of the address mapping of the non-uniform Cache, wherein the access distance comprises the distance in three directions in a three-dimensional grid network; s2, adjusting probability distribution to finally obtain a required three-dimensional distribution matrix; s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix; and S4, adjusting the number of the Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank. The method and the device can realize Cache address mapping of the three-dimensional many-core processor, realize network delay balance under large network scale and improve the operation efficiency of the three-dimensional many-core processor.

Description

Cache address mapping method based on three-dimensional many-core processor

Technical Field

The invention relates to the technical field of three-dimensional many-core processors, in particular to a Cache address mapping method based on a three-dimensional many-core processor.

Background

The Three-Dimensional Network-on-Chip (3D NoC) is a main interconnection mode in a Three-Dimensional many-core processor structure due to good expandability, and the increase of the number of processing cores promotes the performance of the processor on one hand and also promotes the scale of the Network-on-Chip to be gradually increased on the other hand. For a three-dimensional mesh network, the difference between the communication distance and the delay between nodes of each processing core becomes larger due to the increase of the network scale, wherein the communication between processing cores with close distances is more advantageous than that between processing cores with longer distances. However, in the three-dimensional mesh network, the communication advantages of each node are not consistent, and specifically, the processing core located at the central node is shorter than the processing cores located at the peripheral nodes, and therefore, the processing core is more advantageous in network communication, and such advantages are continuously enlarged with the continuous increase of the network scale, so that the delay difference between different network packets is gradually increased, that is, a problem of unbalanced network delay is generated.

As the demand for Cache capacity is continuously expanding, the three-dimensional many-core processor usually organizes a Last Level Cache (LLC) based on a 3D NoC using a Non-Uniform Cache Access (NUCA) architecture. In the 3D NoC-based NUCA architecture, the LLC is usually physically distributed over the processing core nodes, and the Cache banks (banks) of each node logically form a unified shared Cache. A typical NUCA-based architecture three-dimensional stacked many-core system on chip under a 4X 4 three-dimensional mesh network is shown in FIG. 1, where a processing unit includes a primary instruction/data Cache (L1I/L1D), a secondary shared Cache Bank and a network interface, and each processing unit is connected to a router via a network interface. The number on each node represents the serial number of the node in the network, and the distributed shared secondary Cache banks are organized in a static NUCA structure mode and are subjected to cross addressing in a Cache block unit mode.

However, in the above-mentioned NUCA structure, when the processing core issues a Cache access request, the access time is related to the distance between the node where the processing core is requested and the node where the Cache Bank where the access data is located, wherein when the distance is short, the access time is short; when banks with longer distances are accessed, the access time is longer. When the traditional NUCA structure is adopted, along with the expansion of network scale and the increase of the number of nodes, the Cache access delay is gradually dominated by network delay, so that the problem of unbalanced network delay is transmitted to the Cache access delay, the delay difference of different Cache access requests is increased, further unbalanced Cache access delay is caused, the delay of partial Cache access requests is very large, the execution process of a processing core sending the Cache access requests is blocked, the system bottleneck is formed, and the overall performance of the system is seriously influenced.

The Chinese patent application CN107729261A discloses a Cache address mapping method in a multi-core/many-core processor, which can effectively relieve the problem of unbalanced network delay in the traditional two-dimensional multi-core/many-core processor by combining a non-uniform design, but the scheme aims at the Cache address mapping in the two-dimensional multi-core/many-core processor, and the two-dimensional address mapping in the two-dimensional multi-core/many-core processor is simple and low in algorithm complexity compared with a three-dimensional processor.

In summary, the contradiction between the consistency of the Cache address mapping mechanism of the traditional three-dimensional many-core processor and the inequality of the network topology structure can cause the problem of unbalanced network delay in practical use, so that the system performance is further improved, and the Cache address mapping of the two-dimensional multi-core/many-core processor cannot be directly applied to the three-dimensional many-core processor, so that the Cache mapping method of the three-dimensional many-core processor is urgently needed to solve the problem of balanced network delay in the three-dimensional many-core processor, especially the problem of balanced network delay in the three-dimensional many-core processor under the large network scale.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the Cache address mapping method in the three-dimensional many-core processor is simple in implementation method, can achieve network delay balance of the three-dimensional many-core processor in a large network scale, and improves the operation efficiency of the three-dimensional many-core processor.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a Cache address mapping method based on a three-dimensional many-core processor comprises the following steps:

s1, constructing an objective function of a nonlinear programming problem based on the visited probability of each Bank and the visited distance between the banks when a target three-dimensional many-core processor adopts non-uniform Cache address mapping, and solving the constructed objective function to obtain the optimal visited probability distribution of each Bank of the non-uniform Cache address mapping, wherein the visited distance comprises the distance in three directions in a three-dimensional grid network;

s2, adjusting the probability distribution obtained in the step S1 to finally obtain a required three-dimensional distribution matrix;

s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix obtained in the step S2;

and S4, adjusting the number of Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank obtained in the step S3.

As a further improvement of the invention: in step S1, a Cache access cost distribution function is specifically constructed from access distances between banks in the three-dimensional mesh network and the probability distribution of accesses to the banks, and the objective function is constructed based on the Cache access cost distribution function.

As a further improvement of the invention: specifically, a vector C standard deviation is selected to construct the objective function, wherein the vector C represents the distribution of the access overhead of the Cache of the Bank of each node, namely D ═ C_i]_VV is the size of the network on chip in the target processor architecture, c_iCache access overhead for the ith Bank, i.e.

h_i,jRepresents the access distance, p, of nodes i and j in the three-dimensional mesh network_iThe visited probability of Bank for node i;

the constructed objective function is specifically as follows:

and setting constraint conditions:

where μ (C) is the average of the Cache access overheads of all nodes obtained from vector C.

As a further improvement of the invention: the access distance is specifically a manhattan distance.

As a further improvement of the invention: in the step S2, the probability distribution obtained in the step S1 is adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix.

As a further improvement of the invention: in step S3, specifically, B is 2^mObtaining the quantity distribution B of Cache blocks mapped by each Bank by xP, wherein P is the probability distribution of accessing the Bank, and when the Bank address occupies m bits, the mapping interval is 2^mAnd (4) one Cache block.

As a further improvement of the invention: in step S4, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped by each Bank and the number of Cache blocks during mapping of consistent Cache addresses, where the first target Bank is a Bank with a smaller number of Cache blocks during mapping of mapped Cache blocks than during mapping of consistent Cache addresses, and the second target Bank is a Bank with a larger number of Cache blocks during mapping of mapped Cache blocks than during mapping of consistent storage.

As a further improvement of the invention: the first target Bank is a Bank close to the peripheral position in the network grid, and the second target Bank is a Bank close to the central position in the network grid, namely mapping the Cache block of the first target Bank close to the peripheral position in the network grid to the second target Bank close to the central position in the network grid.

As a further improvement of the invention: the specific steps for adjusting the number of Cache blocks mapped by each Bank are as follows: equally dividing a network grid formed by each Bank node into eight regions, judging the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping process of the consistent Cache address in each region, and if the size relationship is smaller than the size relationship, judging that the corresponding node is a first target Bank close to the periphery in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position.

As a further improvement of the invention: the non-uniform Cache address mapping also comprises a step of setting a Bank address segment, specifically, a Bank ID field is expanded with a zone bit and an index bit, the zone bit is used for identifying the number of groups obtained after each Cache block is divided, and the index bit is stored in the Bank address to which a target Cache block should be mapped under an S-NUCA structure.

Compared with the prior art, the invention has the advantages that:

1. the invention relates to a Cache address mapping method based on a three-dimensional many-core processor, which is based on the Cache address mapping method of the three-dimensional many-core processor, and takes the consistency mapping problem from a memory to an LLC (logical link control) adopted by the traditional three-dimensional many-core processor and the structural characteristics of the three-dimensional many-core processor into consideration, introduces a non-consistent design to realize Cache address mapping, firstly constructs a target function based on the access probability of each Bank and the access distance between the banks, solves and obtains the optimal accessed probability distribution of each Bank of the non-consistent Cache address mapping, calculates the quantity distribution of Cache blocks of each Bank mapping after being adjusted into a required three-dimensional distribution matrix form, then adjusts the quantity of Cache blocks of each Bank mapping in a three-dimensional space range according to the quantity distribution of the Cache blocks to optimize the quantity of the Cache blocks of each Bank mapping, adjusts the network delay unbalanced state through the optimized Cache address mapping to realize network delay balance, therefore, the problem of unbalanced network delay in the traditional three-dimensional many-core processor can be effectively solved by combining the non-uniform design, and the system performance is effectively improved.

2. The Cache address mapping method based on the three-dimensional many-core processor solves the optimal probability distribution of Bank accesses of the non-uniform Cache address mapping by simultaneously considering the distances in three directions in a three-dimensional grid network based on the structural characteristics of the three-dimensional many-core processor, can obtain the probability distribution of Bank accesses matched with the structure of the three-dimensional many-core processor, can realize the three-dimensional Cache address mapping aiming at the three-dimensional many-core processor by adjusting the number of Cache blocks mapped by the Cache block number distribution and in a three-dimensional space range, and can realize the high-efficiency Cache address mapping of the three-dimensional many-core processor by combining the non-uniform design, thereby relieving the problem of unbalanced network delay in the three-dimensional many-core processor

3. The Cache address mapping method based on the three-dimensional many-core processor is applied to the three-dimensional many-core processor in a mode of optimally adjusting the mapping distribution of the inconsistent Cache addresses by consistent Cache address mapping, and banks with small mapping Cache blocks are remapped to banks with large mapping Cache blocks based on Cache address mapping in a three-dimensional space range, so that the network delay balance performance of the three-dimensional many-core processor can be effectively improved.

4. According to the Cache address mapping method based on the three-dimensional many-core processor, Cache access overhead distribution is further obtained through access distances in three directions and the probability distribution of access to each Bank in the three-dimensional grid network, then a target function is constructed through the Cache access overhead distribution of each Bank, optimal Bank access probability distribution can be obtained based on the Cache access overhead, and optimization of the quantity distribution of Cache blocks mapped by each Bank can be achieved to the maximum extent.

Drawings

Fig. 1 is a schematic diagram of a typical three-dimensional stacked many-core system on chip based on a NUCA structure under a 4 × 4 × 4 three-dimensional mesh network.

FIG. 2 is a schematic diagram of an implementation flow of the Cache address mapping method based on the three-dimensional many-core processor in the embodiment.

Fig. 3 is a diagram showing the access probability and the number distribution of Cache blocks of each Bank obtained in the embodiment (4 × 4 × 4) of the present invention.

Fig. 4 is a schematic diagram illustrating an implementation principle of non-uniform Cache address mapping in the embodiment (4 × 4 × 4) of the present invention.

Fig. 5 is a schematic diagram of the mapping result of each Cache block obtained in the embodiment (4 × 4 × 4) of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

As shown in fig. 2, the Cache address mapping method based on the three-dimensional many-core processor in this embodiment includes the steps of:

s1, constructing an objective function of a nonlinear programming problem based on the visited probability of each Bank and the visited distance between the banks when the non-uniform Cache address mapping is adopted in a target three-dimensional many-core processor, and solving the constructed objective function to obtain the optimal visited probability distribution of each Bank of the non-uniform Cache address mapping, wherein the visited distance comprises the distance in three directions in a three-dimensional grid network;

and S4, adjusting the number of Cache blocks of each Bank mapping in a three-dimensional space range according to the number distribution of the Cache blocks of each Bank mapping obtained in the step S3.

According to the method, based on the structural characteristics of the three-dimensional many-core processor, the optimal probability distribution of the addresses of the non-uniform Cache addresses for accessing the banks is solved by considering the distances in three directions in the three-dimensional grid network, the probability distribution of the addresses of the banks matching the three-dimensional many-core processor structure can be obtained, and meanwhile, when the number of the Cache blocks mapped by each Bank is adjusted according to the number distribution of the Cache blocks, the three-dimensional Cache addresses mapped by the three-dimensional many-core processor can be realized by adjusting in a three-dimensional space range, and the efficient Cache address mapping of the three-dimensional many-core processor can be realized by combining the non-uniform design, so that the problem of unbalanced network delay in the three-dimensional many-core processor is solved.

The LLC consistency mapping is that Cache blocks in a memory are mapped to each Bank of an LLC one by taking the Cache blocks as a cross unit, the Cache address mapping is realized by introducing a non-consistency design by considering the consistency mapping problem from the memory to the LLC by a traditional three-dimensional many-core processor, firstly, an objective function is constructed on the basis of the access probability of each Bank and the access distance between the banks, the optimal accessed probability distribution of each Bank of the non-consistency Cache address mapping is obtained by solving, the Cache block number distribution of each Bank mapping is calculated after the Cache block number distribution is adjusted to a required three-dimensional distribution matrix form, namely, the Cache block mapping proportion of each Bank, then, the Cache block number of each Bank mapping is adjusted in a three-dimensional space range according to the Cache block number distribution so as to optimize the Cache block number of each Bank mapping, the network delay imbalance state is adjusted through the optimized Cache address mapping, and the network delay balance is realized, therefore, the problem of unbalanced network delay in the traditional three-dimensional many-core processor is effectively solved by combining a non-uniform design, and the system performance is effectively improved.

In this embodiment, the number of Cache blocks mapped by each Bank is specifically adjusted according to the number of Cache blocks distributed in the number of Cache blocks in one mapping interval.

In this example, in step S1, a Cache access overhead distribution function is specifically constructed from access distances between banks in the three-dimensional mesh network and the probability distribution of accesses to the banks, and an objective function is constructed based on the Cache access overhead distribution function, that is, after Cache access overhead distribution is obtained from the access distances in three directions and the probability distribution of accesses to the banks in the three-dimensional mesh network, an objective function is constructed from the Cache access overhead distribution of the banks, so that optimal Bank access probability distribution can be obtained based on the Cache access overhead.

In this embodiment, a mesh network using an XYZ dimensional order routing policy is specifically adopted, an access distance specifically adopts a manhattan distance, and in step S1, the number of network nodes in a three-dimensional many-core processor structure is first input as V, and the access probability of a Bank located at node i (which may also be regarded as a Cache block mapping ratio in a memory) is input as p_iThat is, the mapping proportion of Cache blocks in the memory, the non-uniform Cache address mapping distribution to be calculated is represented by a vector P:

P＝[p_i]_V (1)

then is p_iSetting unified initial values as:

assuming that the processing core at node j needs to access Bank at node i, the non-contention delay between them can be expressed by the following equation:

t_i,j＝h_i,jτ_1hop (3)

wherein h is_i,jDenotes the Manhattan distance (number of hops) between node i and node j in the network, and τ_1hopRepresenting the delay of one jump. Considering that a continuous storage area contains M Cache blocks in total, wherein the number of the Cache blocks mapped to a Bank where a node i is located is M_iEach core sends m to the ith Bank_iThe requests thus get the total non-contention delay at Bank at node i:

will T_iDivided by the constant M.tau_1hopTo normalize the non-contention delay and define it as the Cache access overhead for the ith Bank, using c_iRepresents:

wherein m is_ithe/M is the proportion of the mapped Cache block of the ith Bank, so that the mapping Cache block can be replaced by p, and the Cache access overhead c of each Bank_iCan be expressed as the following equation:

the Cache access overhead distribution of the Bank of each node is represented by a vector C, so that the average value of the distribution C can be obtained:

in order to balance the average access delay of each node, it is desirable that the average access distances of each node are as close as possible to each other, that is, the standard deviation of the set of elements in the distribution C is as small as possible, so that the standard deviation is selected as an optimized objective function:

in addition, the sum of the access probabilities of all banks should be equal to 1, and the access probability of each Bank should be greater than or equal to 0, i.e., the following constraint equation is satisfied:

that is, in this embodiment, a vector C standard deviation is specifically selected to construct an objective function, where the vector C represents the Bank Cache access overhead distribution of each node, that is, D ═ C_i]_VV is the size of the network on chip in the target processor architecture, c_iCache access overhead for the ith Bank, i.e.

h_i,jRepresents the access distance, p, of nodes i and j in the three-dimensional mesh network_iThe visited probability of Bank for node i; the specific objective function for constructing the nonlinear programming problem is as follows:

and setting the constraint conditions as follows:

After the nonlinear programming problem is solved by using a nonlinear programming method, the visited probability of each Bank of the optimal non-uniform Cache address mapping can be obtained, and therefore the visited probability distribution vector P of each Bank of the optimal non-uniform Cache address mapping can be obtained.

In this embodiment, in step S2, the probability distribution obtained in step S1 is specifically adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix. That is, according to the distribution vector P, the size of the distribution vector P is resized according to the network size to be suitable for the network size, thereby obtaining a final three-dimensional distribution matrix P.

And after obtaining the access probability distribution P of each Bank, calculating the proportion of the storage mapping Cache blocks of each Bank according to the distribution P, specifically the proportion of the storage mapping Cache blocks of each Bank consistent with the access probability of each Bank. Considering that the Bank address occupies m bits in the physical memory space address, the mapping interval is 2^mA Cache block, specifically as B ═ 2 in step S2 of this embodiment^mAnd obtaining the Cache block quantity distribution B mapped by each Bank by the xP. .

In step S4 of this embodiment, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped to each Bank and the number of Cache blocks in the case of consistent Cache address mapping, where the first target Bank is a Bank with a smaller number of Cache blocks in the mapping Cache than in the case of consistent Cache address mapping, and the second target Bank is a Bank with a larger number of Cache blocks in the mapping Cache than in the case of consistent storage mapping. On the basis of the consistent Cache address mapping, the number of Cache blocks mapped by each Bank is adjusted according to the Cache block number distribution B mapped by each Bank, and the specific optimization and adjustment method comprises the following steps: and remapping the Bank with the smaller number of mapping Cache blocks in the network grid than the Bank with the smaller number of mapping Cache blocks in the case of mapping the consistent Cache addresses to the target Bank with the larger number of mapping Cache blocks in the network grid than the Bank with the larger number of mapping Cache blocks in the case of mapping the consistent storage.

The method for optimizing and adjusting the non-uniform Cache address mapping distribution through the uniform Cache address mapping is applied to the three-dimensional many-core processor, banks with small mapping Cache blocks are remapped to banks with large mapping Cache blocks based on Cache block data during uniform Cache address mapping in a three-dimensional space range, and the network delay balance performance of the three-dimensional many-core processor can be effectively improved.

In this embodiment, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped by each Bank and the number of Cache blocks in the case of consistent Cache address mapping, where the first target Bank is a Bank in which the number of mapping Cache blocks is smaller than that in the case of consistent Cache address mapping, and the second target Bank is a Bank in which the number of mapping Cache blocks is larger than that in the case of consistent storage mapping.

Because the number of Cache blocks mapped by the central node in the network grid is large, and the number of Cache blocks mapped by the peripheral nodes is small, based on the optimization and adjustment principle, the first target Bank is a Bank close to the peripheral position in the network grid, and the second target Bank is a Bank close to the central position in the network grid in this embodiment, that is, the Cache blocks of the first target Bank close to the peripheral position in the network grid are mapped to the second target Bank close to the central position in the network grid, and the specific steps are as follows: equally dividing a network grid formed by each Bank node into eight regions, judging the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping process of the consistent Cache address in each region, and if the size relationship is smaller than the size relationship, judging that the corresponding node is a first target Bank close to the periphery in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position. By the method, the distribution of the Cache blocks can be quickly and effectively optimized, so that the problem of unbalanced network delay of the three-dimensional many-core processor is efficiently solved.

The invention is further illustrated below by taking a 4 × 4 × 4 three-dimensional mesh network as an example.

Fig. 3 shows the access probability and the number distribution of Cache blocks of each Bank calculated under the embodiment of the three-dimensional mesh network with the size of 4 × 4 × 4. Since the distribution matrix P is three-dimensional, it is used P._iRepresenting the ith component of the three-dimensional matrix P in the third dimension. Compared with the consistency Cache address mapping under the structure of the traditional three-dimensional many-core processor, under the condition that the Bank address field occupies 10 bits, the Cache blocks mapped by the central node are found to be more, and the Cache blocks mapped by the peripheral nodes are less.

The mapping of non-uniform Cache addresses under a mesh network with a size of 4 × 4 × 4 is shown in fig. 4, where, as shown in fig. 4(a), a Bank ID field with 6 bits originally is expanded into 10 bits as a Bank address, the high 4 bits are used as a flag bit (Bank tag), and the low 6 bits are used as an index bit (Bank index), 1024 Cache blocks can be divided into 16 groups according to the difference of the flag bits, each group contains 64 Cache blocks, and the index bit indicates the Bank address to which the Cache block should be mapped under the original S-NUCA structure; FIG. 4(b) is the number of Cache blocks of the Bank mapping for each node, the nodes in a 4 × 4 × 4 three-dimensional mesh network can be divided into 8 regions due to symmetry, each region contains 8(2 × 2 × 2) nodes, and a similar mapping manner is followed, i.e., part of the Cache blocks near the peripheral nodes are mapped to the nodes near the center; taking the selected sub-region in fig. 4(c) as an example, fig. 4(d) and fig. 4(e) respectively show the sequence number and the number of mapping Cache blocks of each node in the sub-region, and the node 22 near the center needs to map 20 Cache blocks, and compared with the mapping under the consistency structure (16 Cache blocks/Bank), the node 22 has 4 more Cache blocks, so that one is taken out from the Bank node 2/3/7/19 with less mapping and is remapped to the node 22; for the node 6/18/23, 17 Cache blocks need to be mapped, and one Cache block is less than the mapping under the consistency structure, so that one mapping is taken from the node 3 to the node 6/18/23, and thus the node 3 also maps exactly 12 Cache blocks.

The mapping result of each group of Cache blocks in the mesh network with the size of 4 × 4 × 4 is shown in fig. 5, which mainly shows the result corresponding to the sub-region part in fig. 4(c), the mapping mode for the first 12 groups (i.e., tag equals to 0 to 11) is consistent with that in the consistency structure, and for the last 4 groups (i.e., tag equals to 12 to 15), part of Cache blocks are mapped to the nodes close to the center. According to the mapping result, the Cache address mapping method can balance the network of the three-dimensional many-core processor, and solve the problem of unbalanced network delay in the traditional three-dimensional many-core processor.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A Cache address mapping method based on a three-dimensional many-core processor is characterized by comprising the following steps:

s4, adjusting the number of Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank obtained in the step S3;

in step S1, a Cache access cost distribution function is specifically constructed from access distances between banks in a three-dimensional mesh network and the probability distribution of access to each Bank, and the objective function is constructed based on the Cache access cost distribution function;

specifically, a vector C standard deviation is selected to construct the objective function, wherein the vector C represents the distribution of the access overhead of the Cache of the Bank of each node, namely D ═ C_i]_VV is the size of the network on chip in the target processor architecture, c_iCache access overhead for the ith Bank, i.e.

the constructed objective function is specifically as follows:

and setting constraint conditions:

p_i≥0(0≤i≤V-1)

wherein, mu (C) is the average value of Cache access overheads of all nodes obtained by the vector C;

in step S4, the nxnxnxnxn network grid formed by the Bank nodes is symmetrically divided into eight regions, and in each region, the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping of the consistent Cache address is determined, and if the size relationship is smaller than the size relationship, the corresponding node is determined to be a first target Bank near the peripheral position in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position.

2. The Cache address mapping method based on the three-dimensional many-core processor according to claim 1, wherein the access distance is specifically a Manhattan distance.

3. The Cache address mapping method based on the three-dimensional many-core processor as claimed in claim 1, wherein in the step S2, the probability distribution obtained in the step S1 is adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix.

4. The Cache address mapping method based on the three-dimensional many-core processor as claimed in claim 1, wherein in step S3, specifically according to B-2^mObtaining the quantity distribution B of Cache blocks mapped by each Bank by xP, wherein P is the probability distribution of accessing the Bank, and when the Bank address occupies m bits, the mapping interval is 2^mAnd (4) one Cache block.

5. The Cache address mapping method based on the three-dimensional many-core processor according to claim 1, wherein the non-uniform Cache address mapping further comprises a step of setting a Bank address segment, specifically, a Bank ID field is extended with a flag bit and an index bit, the flag bit is used for identifying the number of groups into which each Cache block is divided, and the index bit is stored in a Bank address to which a target Cache block should be mapped under an S-NUCA structure.