CN109032967B - Cache address mapping method based on three-dimensional many-core processor - Google Patents

Cache address mapping method based on three-dimensional many-core processor Download PDF

Info

Publication number
CN109032967B
CN109032967B CN201810757396.2A CN201810757396A CN109032967B CN 109032967 B CN109032967 B CN 109032967B CN 201810757396 A CN201810757396 A CN 201810757396A CN 109032967 B CN109032967 B CN 109032967B
Authority
CN
China
Prior art keywords
cache
bank
dimensional
distribution
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810757396.2A
Other languages
Chinese (zh)
Other versions
CN109032967A (en
Inventor
陈小文
王子聪
郭阳
鲁建壮
陈海燕
陈胜刚
刘胜
雷元武
王耀华
郭晓伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201810757396.2A priority Critical patent/CN109032967B/en
Publication of CN109032967A publication Critical patent/CN109032967A/en
Application granted granted Critical
Publication of CN109032967B publication Critical patent/CN109032967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a Cache address mapping method based on a three-dimensional many-core processor, which comprises the following steps: s1, constructing an objective function of a nonlinear programming problem, and solving to obtain optimal probability distribution of accessing each Bank of the address mapping of the non-uniform Cache, wherein the access distance comprises the distance in three directions in a three-dimensional grid network; s2, adjusting probability distribution to finally obtain a required three-dimensional distribution matrix; s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix; and S4, adjusting the number of the Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank. The method and the device can realize Cache address mapping of the three-dimensional many-core processor, realize network delay balance under large network scale and improve the operation efficiency of the three-dimensional many-core processor.

Description

Cache address mapping method based on three-dimensional many-core processor
Technical Field
The invention relates to the technical field of three-dimensional many-core processors, in particular to a Cache address mapping method based on a three-dimensional many-core processor.
Background
The Three-Dimensional Network-on-Chip (3D NoC) is a main interconnection mode in a Three-Dimensional many-core processor structure due to good expandability, and the increase of the number of processing cores promotes the performance of the processor on one hand and also promotes the scale of the Network-on-Chip to be gradually increased on the other hand. For a three-dimensional mesh network, the difference between the communication distance and the delay between nodes of each processing core becomes larger due to the increase of the network scale, wherein the communication between processing cores with close distances is more advantageous than that between processing cores with longer distances. However, in the three-dimensional mesh network, the communication advantages of each node are not consistent, and specifically, the processing core located at the central node is shorter than the processing cores located at the peripheral nodes, and therefore, the processing core is more advantageous in network communication, and such advantages are continuously enlarged with the continuous increase of the network scale, so that the delay difference between different network packets is gradually increased, that is, a problem of unbalanced network delay is generated.
As the demand for Cache capacity is continuously expanding, the three-dimensional many-core processor usually organizes a Last Level Cache (LLC) based on a 3D NoC using a Non-Uniform Cache Access (NUCA) architecture. In the 3D NoC-based NUCA architecture, the LLC is usually physically distributed over the processing core nodes, and the Cache banks (banks) of each node logically form a unified shared Cache. A typical NUCA-based architecture three-dimensional stacked many-core system on chip under a 4X 4 three-dimensional mesh network is shown in FIG. 1, where a processing unit includes a primary instruction/data Cache (L1I/L1D), a secondary shared Cache Bank and a network interface, and each processing unit is connected to a router via a network interface. The number on each node represents the serial number of the node in the network, and the distributed shared secondary Cache banks are organized in a static NUCA structure mode and are subjected to cross addressing in a Cache block unit mode.
However, in the above-mentioned NUCA structure, when the processing core issues a Cache access request, the access time is related to the distance between the node where the processing core is requested and the node where the Cache Bank where the access data is located, wherein when the distance is short, the access time is short; when banks with longer distances are accessed, the access time is longer. When the traditional NUCA structure is adopted, along with the expansion of network scale and the increase of the number of nodes, the Cache access delay is gradually dominated by network delay, so that the problem of unbalanced network delay is transmitted to the Cache access delay, the delay difference of different Cache access requests is increased, further unbalanced Cache access delay is caused, the delay of partial Cache access requests is very large, the execution process of a processing core sending the Cache access requests is blocked, the system bottleneck is formed, and the overall performance of the system is seriously influenced.
The Chinese patent application CN107729261A discloses a Cache address mapping method in a multi-core/many-core processor, which can effectively relieve the problem of unbalanced network delay in the traditional two-dimensional multi-core/many-core processor by combining a non-uniform design, but the scheme aims at the Cache address mapping in the two-dimensional multi-core/many-core processor, and the two-dimensional address mapping in the two-dimensional multi-core/many-core processor is simple and low in algorithm complexity compared with a three-dimensional processor.
In summary, the contradiction between the consistency of the Cache address mapping mechanism of the traditional three-dimensional many-core processor and the inequality of the network topology structure can cause the problem of unbalanced network delay in practical use, so that the system performance is further improved, and the Cache address mapping of the two-dimensional multi-core/many-core processor cannot be directly applied to the three-dimensional many-core processor, so that the Cache mapping method of the three-dimensional many-core processor is urgently needed to solve the problem of balanced network delay in the three-dimensional many-core processor, especially the problem of balanced network delay in the three-dimensional many-core processor under the large network scale.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the Cache address mapping method in the three-dimensional many-core processor is simple in implementation method, can achieve network delay balance of the three-dimensional many-core processor in a large network scale, and improves the operation efficiency of the three-dimensional many-core processor.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a Cache address mapping method based on a three-dimensional many-core processor comprises the following steps:
s1, constructing an objective function of a nonlinear programming problem based on the visited probability of each Bank and the visited distance between the banks when a target three-dimensional many-core processor adopts non-uniform Cache address mapping, and solving the constructed objective function to obtain the optimal visited probability distribution of each Bank of the non-uniform Cache address mapping, wherein the visited distance comprises the distance in three directions in a three-dimensional grid network;
s2, adjusting the probability distribution obtained in the step S1 to finally obtain a required three-dimensional distribution matrix;
s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix obtained in the step S2;
and S4, adjusting the number of Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank obtained in the step S3.
As a further improvement of the invention: in step S1, a Cache access cost distribution function is specifically constructed from access distances between banks in the three-dimensional mesh network and the probability distribution of accesses to the banks, and the objective function is constructed based on the Cache access cost distribution function.
As a further improvement of the invention: specifically, a vector C standard deviation is selected to construct the objective function, wherein the vector C represents the distribution of the access overhead of the Cache of the Bank of each node, namely D ═ Ci]VV is the size of the network on chip in the target processor architecture, ciCache access overhead for the ith Bank, i.e.
Figure BDA0001727057700000021
hi,jRepresents the access distance, p, of nodes i and j in the three-dimensional mesh networkiThe visited probability of Bank for node i;
the constructed objective function is specifically as follows:
Figure BDA0001727057700000031
and setting constraint conditions:
Figure BDA0001727057700000032
Figure BDA0001727057700000033
where μ (C) is the average of the Cache access overheads of all nodes obtained from vector C.
As a further improvement of the invention: the access distance is specifically a manhattan distance.
As a further improvement of the invention: in the step S2, the probability distribution obtained in the step S1 is adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix.
As a further improvement of the invention: in step S3, specifically, B is 2mObtaining the quantity distribution B of Cache blocks mapped by each Bank by xP, wherein P is the probability distribution of accessing the Bank, and when the Bank address occupies m bits, the mapping interval is 2mAnd (4) one Cache block.
As a further improvement of the invention: in step S4, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped by each Bank and the number of Cache blocks during mapping of consistent Cache addresses, where the first target Bank is a Bank with a smaller number of Cache blocks during mapping of mapped Cache blocks than during mapping of consistent Cache addresses, and the second target Bank is a Bank with a larger number of Cache blocks during mapping of mapped Cache blocks than during mapping of consistent storage.
As a further improvement of the invention: the first target Bank is a Bank close to the peripheral position in the network grid, and the second target Bank is a Bank close to the central position in the network grid, namely mapping the Cache block of the first target Bank close to the peripheral position in the network grid to the second target Bank close to the central position in the network grid.
As a further improvement of the invention: the specific steps for adjusting the number of Cache blocks mapped by each Bank are as follows: equally dividing a network grid formed by each Bank node into eight regions, judging the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping process of the consistent Cache address in each region, and if the size relationship is smaller than the size relationship, judging that the corresponding node is a first target Bank close to the periphery in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position.
As a further improvement of the invention: the non-uniform Cache address mapping also comprises a step of setting a Bank address segment, specifically, a Bank ID field is expanded with a zone bit and an index bit, the zone bit is used for identifying the number of groups obtained after each Cache block is divided, and the index bit is stored in the Bank address to which a target Cache block should be mapped under an S-NUCA structure.
Compared with the prior art, the invention has the advantages that:
1. the invention relates to a Cache address mapping method based on a three-dimensional many-core processor, which is based on the Cache address mapping method of the three-dimensional many-core processor, and takes the consistency mapping problem from a memory to an LLC (logical link control) adopted by the traditional three-dimensional many-core processor and the structural characteristics of the three-dimensional many-core processor into consideration, introduces a non-consistent design to realize Cache address mapping, firstly constructs a target function based on the access probability of each Bank and the access distance between the banks, solves and obtains the optimal accessed probability distribution of each Bank of the non-consistent Cache address mapping, calculates the quantity distribution of Cache blocks of each Bank mapping after being adjusted into a required three-dimensional distribution matrix form, then adjusts the quantity of Cache blocks of each Bank mapping in a three-dimensional space range according to the quantity distribution of the Cache blocks to optimize the quantity of the Cache blocks of each Bank mapping, adjusts the network delay unbalanced state through the optimized Cache address mapping to realize network delay balance, therefore, the problem of unbalanced network delay in the traditional three-dimensional many-core processor can be effectively solved by combining the non-uniform design, and the system performance is effectively improved.
2. The Cache address mapping method based on the three-dimensional many-core processor solves the optimal probability distribution of Bank accesses of the non-uniform Cache address mapping by simultaneously considering the distances in three directions in a three-dimensional grid network based on the structural characteristics of the three-dimensional many-core processor, can obtain the probability distribution of Bank accesses matched with the structure of the three-dimensional many-core processor, can realize the three-dimensional Cache address mapping aiming at the three-dimensional many-core processor by adjusting the number of Cache blocks mapped by the Cache block number distribution and in a three-dimensional space range, and can realize the high-efficiency Cache address mapping of the three-dimensional many-core processor by combining the non-uniform design, thereby relieving the problem of unbalanced network delay in the three-dimensional many-core processor
3. The Cache address mapping method based on the three-dimensional many-core processor is applied to the three-dimensional many-core processor in a mode of optimally adjusting the mapping distribution of the inconsistent Cache addresses by consistent Cache address mapping, and banks with small mapping Cache blocks are remapped to banks with large mapping Cache blocks based on Cache address mapping in a three-dimensional space range, so that the network delay balance performance of the three-dimensional many-core processor can be effectively improved.
4. According to the Cache address mapping method based on the three-dimensional many-core processor, Cache access overhead distribution is further obtained through access distances in three directions and the probability distribution of access to each Bank in the three-dimensional grid network, then a target function is constructed through the Cache access overhead distribution of each Bank, optimal Bank access probability distribution can be obtained based on the Cache access overhead, and optimization of the quantity distribution of Cache blocks mapped by each Bank can be achieved to the maximum extent.
Drawings
Fig. 1 is a schematic diagram of a typical three-dimensional stacked many-core system on chip based on a NUCA structure under a 4 × 4 × 4 three-dimensional mesh network.
FIG. 2 is a schematic diagram of an implementation flow of the Cache address mapping method based on the three-dimensional many-core processor in the embodiment.
Fig. 3 is a diagram showing the access probability and the number distribution of Cache blocks of each Bank obtained in the embodiment (4 × 4 × 4) of the present invention.
Fig. 4 is a schematic diagram illustrating an implementation principle of non-uniform Cache address mapping in the embodiment (4 × 4 × 4) of the present invention.
Fig. 5 is a schematic diagram of the mapping result of each Cache block obtained in the embodiment (4 × 4 × 4) of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
As shown in fig. 2, the Cache address mapping method based on the three-dimensional many-core processor in this embodiment includes the steps of:
s1, constructing an objective function of a nonlinear programming problem based on the visited probability of each Bank and the visited distance between the banks when the non-uniform Cache address mapping is adopted in a target three-dimensional many-core processor, and solving the constructed objective function to obtain the optimal visited probability distribution of each Bank of the non-uniform Cache address mapping, wherein the visited distance comprises the distance in three directions in a three-dimensional grid network;
s2, adjusting the probability distribution obtained in the step S1 to finally obtain a required three-dimensional distribution matrix;
s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix obtained in the step S2;
and S4, adjusting the number of Cache blocks of each Bank mapping in a three-dimensional space range according to the number distribution of the Cache blocks of each Bank mapping obtained in the step S3.
According to the method, based on the structural characteristics of the three-dimensional many-core processor, the optimal probability distribution of the addresses of the non-uniform Cache addresses for accessing the banks is solved by considering the distances in three directions in the three-dimensional grid network, the probability distribution of the addresses of the banks matching the three-dimensional many-core processor structure can be obtained, and meanwhile, when the number of the Cache blocks mapped by each Bank is adjusted according to the number distribution of the Cache blocks, the three-dimensional Cache addresses mapped by the three-dimensional many-core processor can be realized by adjusting in a three-dimensional space range, and the efficient Cache address mapping of the three-dimensional many-core processor can be realized by combining the non-uniform design, so that the problem of unbalanced network delay in the three-dimensional many-core processor is solved.
The LLC consistency mapping is that Cache blocks in a memory are mapped to each Bank of an LLC one by taking the Cache blocks as a cross unit, the Cache address mapping is realized by introducing a non-consistency design by considering the consistency mapping problem from the memory to the LLC by a traditional three-dimensional many-core processor, firstly, an objective function is constructed on the basis of the access probability of each Bank and the access distance between the banks, the optimal accessed probability distribution of each Bank of the non-consistency Cache address mapping is obtained by solving, the Cache block number distribution of each Bank mapping is calculated after the Cache block number distribution is adjusted to a required three-dimensional distribution matrix form, namely, the Cache block mapping proportion of each Bank, then, the Cache block number of each Bank mapping is adjusted in a three-dimensional space range according to the Cache block number distribution so as to optimize the Cache block number of each Bank mapping, the network delay imbalance state is adjusted through the optimized Cache address mapping, and the network delay balance is realized, therefore, the problem of unbalanced network delay in the traditional three-dimensional many-core processor is effectively solved by combining a non-uniform design, and the system performance is effectively improved.
In this embodiment, the number of Cache blocks mapped by each Bank is specifically adjusted according to the number of Cache blocks distributed in the number of Cache blocks in one mapping interval.
In this example, in step S1, a Cache access overhead distribution function is specifically constructed from access distances between banks in the three-dimensional mesh network and the probability distribution of accesses to the banks, and an objective function is constructed based on the Cache access overhead distribution function, that is, after Cache access overhead distribution is obtained from the access distances in three directions and the probability distribution of accesses to the banks in the three-dimensional mesh network, an objective function is constructed from the Cache access overhead distribution of the banks, so that optimal Bank access probability distribution can be obtained based on the Cache access overhead.
In this embodiment, a mesh network using an XYZ dimensional order routing policy is specifically adopted, an access distance specifically adopts a manhattan distance, and in step S1, the number of network nodes in a three-dimensional many-core processor structure is first input as V, and the access probability of a Bank located at node i (which may also be regarded as a Cache block mapping ratio in a memory) is input as piThat is, the mapping proportion of Cache blocks in the memory, the non-uniform Cache address mapping distribution to be calculated is represented by a vector P:
P=[pi]V (1)
then is piSetting unified initial values as:
Figure BDA0001727057700000061
assuming that the processing core at node j needs to access Bank at node i, the non-contention delay between them can be expressed by the following equation:
ti,j=hi,jτ1hop (3)
wherein h isi,jDenotes the Manhattan distance (number of hops) between node i and node j in the network, and τ1hopRepresenting the delay of one jump. Considering that a continuous storage area contains M Cache blocks in total, wherein the number of the Cache blocks mapped to a Bank where a node i is located is MiEach core sends m to the ith BankiThe requests thus get the total non-contention delay at Bank at node i:
Figure BDA0001727057700000062
will TiDivided by the constant M.tau1hopTo normalize the non-contention delay and define it as the Cache access overhead for the ith Bank, using ciRepresents:
Figure BDA0001727057700000063
wherein m isithe/M is the proportion of the mapped Cache block of the ith Bank, so that the mapping Cache block can be replaced by p, and the Cache access overhead c of each BankiCan be expressed as the following equation:
Figure BDA0001727057700000071
the Cache access overhead distribution of the Bank of each node is represented by a vector C, so that the average value of the distribution C can be obtained:
Figure BDA0001727057700000072
in order to balance the average access delay of each node, it is desirable that the average access distances of each node are as close as possible to each other, that is, the standard deviation of the set of elements in the distribution C is as small as possible, so that the standard deviation is selected as an optimized objective function:
Figure BDA0001727057700000073
in addition, the sum of the access probabilities of all banks should be equal to 1, and the access probability of each Bank should be greater than or equal to 0, i.e., the following constraint equation is satisfied:
Figure BDA0001727057700000074
that is, in this embodiment, a vector C standard deviation is specifically selected to construct an objective function, where the vector C represents the Bank Cache access overhead distribution of each node, that is, D ═ Ci]VV is the size of the network on chip in the target processor architecture, ciCache access overhead for the ith Bank, i.e.
Figure BDA0001727057700000075
hi,jRepresents the access distance, p, of nodes i and j in the three-dimensional mesh networkiThe visited probability of Bank for node i; the specific objective function for constructing the nonlinear programming problem is as follows:
Figure BDA0001727057700000076
and setting the constraint conditions as follows:
Figure BDA0001727057700000077
Figure BDA0001727057700000078
where μ (C) is the average of the Cache access overheads of all nodes obtained from vector C.
After the nonlinear programming problem is solved by using a nonlinear programming method, the visited probability of each Bank of the optimal non-uniform Cache address mapping can be obtained, and therefore the visited probability distribution vector P of each Bank of the optimal non-uniform Cache address mapping can be obtained.
In this embodiment, in step S2, the probability distribution obtained in step S1 is specifically adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix. That is, according to the distribution vector P, the size of the distribution vector P is resized according to the network size to be suitable for the network size, thereby obtaining a final three-dimensional distribution matrix P.
And after obtaining the access probability distribution P of each Bank, calculating the proportion of the storage mapping Cache blocks of each Bank according to the distribution P, specifically the proportion of the storage mapping Cache blocks of each Bank consistent with the access probability of each Bank. Considering that the Bank address occupies m bits in the physical memory space address, the mapping interval is 2mA Cache block, specifically as B ═ 2 in step S2 of this embodimentmAnd obtaining the Cache block quantity distribution B mapped by each Bank by the xP. .
In step S4 of this embodiment, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped to each Bank and the number of Cache blocks in the case of consistent Cache address mapping, where the first target Bank is a Bank with a smaller number of Cache blocks in the mapping Cache than in the case of consistent Cache address mapping, and the second target Bank is a Bank with a larger number of Cache blocks in the mapping Cache than in the case of consistent storage mapping. On the basis of the consistent Cache address mapping, the number of Cache blocks mapped by each Bank is adjusted according to the Cache block number distribution B mapped by each Bank, and the specific optimization and adjustment method comprises the following steps: and remapping the Bank with the smaller number of mapping Cache blocks in the network grid than the Bank with the smaller number of mapping Cache blocks in the case of mapping the consistent Cache addresses to the target Bank with the larger number of mapping Cache blocks in the network grid than the Bank with the larger number of mapping Cache blocks in the case of mapping the consistent storage.
The method for optimizing and adjusting the non-uniform Cache address mapping distribution through the uniform Cache address mapping is applied to the three-dimensional many-core processor, banks with small mapping Cache blocks are remapped to banks with large mapping Cache blocks based on Cache block data during uniform Cache address mapping in a three-dimensional space range, and the network delay balance performance of the three-dimensional many-core processor can be effectively improved.
In this embodiment, a first target Bank in the network grid is remapped to a second target Bank according to the number distribution of Cache blocks mapped by each Bank and the number of Cache blocks in the case of consistent Cache address mapping, where the first target Bank is a Bank in which the number of mapping Cache blocks is smaller than that in the case of consistent Cache address mapping, and the second target Bank is a Bank in which the number of mapping Cache blocks is larger than that in the case of consistent storage mapping.
Because the number of Cache blocks mapped by the central node in the network grid is large, and the number of Cache blocks mapped by the peripheral nodes is small, based on the optimization and adjustment principle, the first target Bank is a Bank close to the peripheral position in the network grid, and the second target Bank is a Bank close to the central position in the network grid in this embodiment, that is, the Cache blocks of the first target Bank close to the peripheral position in the network grid are mapped to the second target Bank close to the central position in the network grid, and the specific steps are as follows: equally dividing a network grid formed by each Bank node into eight regions, judging the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping process of the consistent Cache address in each region, and if the size relationship is smaller than the size relationship, judging that the corresponding node is a first target Bank close to the periphery in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position. By the method, the distribution of the Cache blocks can be quickly and effectively optimized, so that the problem of unbalanced network delay of the three-dimensional many-core processor is efficiently solved.
The invention is further illustrated below by taking a 4 × 4 × 4 three-dimensional mesh network as an example.
Fig. 3 shows the access probability and the number distribution of Cache blocks of each Bank calculated under the embodiment of the three-dimensional mesh network with the size of 4 × 4 × 4. Since the distribution matrix P is three-dimensional, it is used P.iRepresenting the ith component of the three-dimensional matrix P in the third dimension. Compared with the consistency Cache address mapping under the structure of the traditional three-dimensional many-core processor, under the condition that the Bank address field occupies 10 bits, the Cache blocks mapped by the central node are found to be more, and the Cache blocks mapped by the peripheral nodes are less.
The mapping of non-uniform Cache addresses under a mesh network with a size of 4 × 4 × 4 is shown in fig. 4, where, as shown in fig. 4(a), a Bank ID field with 6 bits originally is expanded into 10 bits as a Bank address, the high 4 bits are used as a flag bit (Bank tag), and the low 6 bits are used as an index bit (Bank index), 1024 Cache blocks can be divided into 16 groups according to the difference of the flag bits, each group contains 64 Cache blocks, and the index bit indicates the Bank address to which the Cache block should be mapped under the original S-NUCA structure; FIG. 4(b) is the number of Cache blocks of the Bank mapping for each node, the nodes in a 4 × 4 × 4 three-dimensional mesh network can be divided into 8 regions due to symmetry, each region contains 8(2 × 2 × 2) nodes, and a similar mapping manner is followed, i.e., part of the Cache blocks near the peripheral nodes are mapped to the nodes near the center; taking the selected sub-region in fig. 4(c) as an example, fig. 4(d) and fig. 4(e) respectively show the sequence number and the number of mapping Cache blocks of each node in the sub-region, and the node 22 near the center needs to map 20 Cache blocks, and compared with the mapping under the consistency structure (16 Cache blocks/Bank), the node 22 has 4 more Cache blocks, so that one is taken out from the Bank node 2/3/7/19 with less mapping and is remapped to the node 22; for the node 6/18/23, 17 Cache blocks need to be mapped, and one Cache block is less than the mapping under the consistency structure, so that one mapping is taken from the node 3 to the node 6/18/23, and thus the node 3 also maps exactly 12 Cache blocks.
The mapping result of each group of Cache blocks in the mesh network with the size of 4 × 4 × 4 is shown in fig. 5, which mainly shows the result corresponding to the sub-region part in fig. 4(c), the mapping mode for the first 12 groups (i.e., tag equals to 0 to 11) is consistent with that in the consistency structure, and for the last 4 groups (i.e., tag equals to 12 to 15), part of Cache blocks are mapped to the nodes close to the center. According to the mapping result, the Cache address mapping method can balance the network of the three-dimensional many-core processor, and solve the problem of unbalanced network delay in the traditional three-dimensional many-core processor.
The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims (5)

1. A Cache address mapping method based on a three-dimensional many-core processor is characterized by comprising the following steps:
s1, constructing an objective function of a nonlinear programming problem based on the visited probability of each Bank and the visited distance between the banks when a target three-dimensional many-core processor adopts non-uniform Cache address mapping, and solving the constructed objective function to obtain the optimal visited probability distribution of each Bank of the non-uniform Cache address mapping, wherein the visited distance comprises the distance in three directions in a three-dimensional grid network;
s2, adjusting the probability distribution obtained in the step S1 to finally obtain a required three-dimensional distribution matrix;
s3, calculating the quantity distribution of Cache blocks mapped by each Bank according to the three-dimensional distribution matrix obtained in the step S2;
s4, adjusting the number of Cache blocks mapped by each Bank in a three-dimensional space range according to the number distribution of the Cache blocks mapped by each Bank obtained in the step S3;
in step S1, a Cache access cost distribution function is specifically constructed from access distances between banks in a three-dimensional mesh network and the probability distribution of access to each Bank, and the objective function is constructed based on the Cache access cost distribution function;
specifically, a vector C standard deviation is selected to construct the objective function, wherein the vector C represents the distribution of the access overhead of the Cache of the Bank of each node, namely D ═ Ci]VV is the size of the network on chip in the target processor architecture, ciCache access overhead for the ith Bank, i.e.
Figure FDA0003024211700000011
hi,jRepresents the access distance, p, of nodes i and j in the three-dimensional mesh networkiThe visited probability of Bank for node i;
the constructed objective function is specifically as follows:
Figure FDA0003024211700000012
and setting constraint conditions:
Figure FDA0003024211700000013
pi≥0(0≤i≤V-1)
Figure FDA0003024211700000014
wherein, mu (C) is the average value of Cache access overheads of all nodes obtained by the vector C;
in step S4, the nxnxnxnxn network grid formed by the Bank nodes is symmetrically divided into eight regions, and in each region, the size relationship between the number of Cache blocks mapped by each node and the number of Cache blocks in the mapping of the consistent Cache address is determined, and if the size relationship is smaller than the size relationship, the corresponding node is determined to be a first target Bank near the peripheral position in the network grid; and if so, judging that the corresponding node is a second target Bank close to the central position, and remapping the Cache block of the first target Bank close to the peripheral position in the network grid in each region to the second target Bank close to the central position.
2. The Cache address mapping method based on the three-dimensional many-core processor according to claim 1, wherein the access distance is specifically a Manhattan distance.
3. The Cache address mapping method based on the three-dimensional many-core processor as claimed in claim 1, wherein in the step S2, the probability distribution obtained in the step S1 is adjusted according to the network scale, so as to match the network size, and finally obtain the required three-dimensional distribution matrix.
4. The Cache address mapping method based on the three-dimensional many-core processor as claimed in claim 1, wherein in step S3, specifically according to B-2mObtaining the quantity distribution B of Cache blocks mapped by each Bank by xP, wherein P is the probability distribution of accessing the Bank, and when the Bank address occupies m bits, the mapping interval is 2mAnd (4) one Cache block.
5. The Cache address mapping method based on the three-dimensional many-core processor according to claim 1, wherein the non-uniform Cache address mapping further comprises a step of setting a Bank address segment, specifically, a Bank ID field is extended with a flag bit and an index bit, the flag bit is used for identifying the number of groups into which each Cache block is divided, and the index bit is stored in a Bank address to which a target Cache block should be mapped under an S-NUCA structure.
CN201810757396.2A 2018-07-11 2018-07-11 Cache address mapping method based on three-dimensional many-core processor Active CN109032967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810757396.2A CN109032967B (en) 2018-07-11 2018-07-11 Cache address mapping method based on three-dimensional many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810757396.2A CN109032967B (en) 2018-07-11 2018-07-11 Cache address mapping method based on three-dimensional many-core processor

Publications (2)

Publication Number Publication Date
CN109032967A CN109032967A (en) 2018-12-18
CN109032967B true CN109032967B (en) 2021-10-01

Family

ID=64642149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810757396.2A Active CN109032967B (en) 2018-07-11 2018-07-11 Cache address mapping method based on three-dimensional many-core processor

Country Status (1)

Country Link
CN (1) CN109032967B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110034950B (en) * 2019-02-28 2021-08-10 华南理工大学 Mapping method for network on 3D chip

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729261A (en) * 2017-09-28 2018-02-23 中国人民解放军国防科技大学 Cache address mapping method in multi-core/many-core processor

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8103894B2 (en) * 2009-04-24 2012-01-24 International Business Machines Corporation Power conservation in vertically-striped NUCA caches
US8621157B2 (en) * 2011-06-13 2013-12-31 Advanced Micro Devices, Inc. Cache prefetching from non-uniform memories
CN103810119B (en) * 2014-02-28 2017-01-04 北京航空航天大学 The temperature difference on sheet is utilized to reduce the cache design method of STT-MRAM power consumption

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729261A (en) * 2017-09-28 2018-02-23 中国人民解放军国防科技大学 Cache address mapping method in multi-core/many-core processor

Also Published As

Publication number Publication date
CN109032967A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN107729261B (en) Cache address mapping method in multi-core/many-core processor
US20210112003A1 (en) Network interface for data transport in heterogeneous computing environments
CN106990915B (en) Storage resource management method based on storage medium type and weighted quota
US9264357B2 (en) Apparatus and method for table search with centralized memory pool in a network switch
US10440112B2 (en) Server device including interface circuits, memory modules and switch circuit connecting interface circuits and memory modules
CN113424160A (en) Processing method, processing device and related equipment
US20020194433A1 (en) Shared cache memory replacement control method and apparatus
CN114153754B (en) Data transmission method and device for computing cluster and storage medium
US11487473B2 (en) Memory system
CN114928577B (en) Workload proving chip and processing method thereof
US12095654B2 (en) Interconnection device
US11288012B2 (en) Memory system
CN109032967B (en) Cache address mapping method based on three-dimensional many-core processor
JP5435132B2 (en) Information processing system
JP2002055879A (en) Multi-port cache memory
WO2020124488A1 (en) Application process mapping method, electronic device, and computer-readable storage medium
CN111858096B (en) Directory-based method and system for monitoring reading of cache at shortest distance
CN114996023B (en) Target cache device, processing device, network equipment and table item acquisition method
CN114338718B (en) Distributed storage method, device and medium for massive remote sensing data
US20180203875A1 (en) Method for extending and shrinking volume for distributed file system based on torus network and apparatus using the same
CN114742214A (en) Caching method, system, device and storage medium of neural network
US7133997B2 (en) Configurable cache
US20230139732A1 (en) Multi-dimensional memory cluster
US20180089106A1 (en) Method and apparatus for replacing data block in cache
CN115826862A (en) On-chip cache data scheduling method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant