CN1217505A

CN1217505A - Address mapping technique and apparatus in high-speed buffer storage device

Info

Publication number: CN1217505A
Application number: CN 97120245
Authority: CN
Inventors: 刘志勇; 李恩有; 乔香珍
Original assignee: Institute of Computing Technology of CAS
Current assignee: Ningbo Zhongke IC Design Center Co., Ltd.
Priority date: 1997-11-06
Filing date: 1997-11-06
Publication date: 1999-05-26
Anticipated expiration: 2017-11-06
Also published as: CN1081361C

Abstract

The present invention provides an address mapping transformation technique in the CACHE system and its device. It is characterized by that in this CACHE system, it adopts XOR address scattering mechanism to improve the distribution mode of data of inner storage in CACHE data memory bank; the input of said address scattering mechanism is data address CACHE flag bit field (or its a partial field) and CACHE line number field (or its a partial field), and the output of said address scattering mechanism is CACHE line number (or its a partial field). The new-formed CACHE line number is used for indicating actual access address (line number) of CACHE data memory bank and CACHE flag bit memory bank.

Description

Address mapping technique in the cache memory system and device

The invention belongs to field of computer technology, be meant address mapping technique and device in a kind of CACHE system especially.

The Computer Systems Organization Progress in technique makes that the gap of speed of the speed of storage system and arithmetic system is more and more significant.This is because advanced person's structure technology (as instruction and arithmetic pipelining technology, compression instruction set (RISC) technology, speciality instruction word technology (VLIW), multi-part and concurrent section technology etc.) has shortened the performance period of instructing in control and the application in the arithmetic system on the one hand; Then be because the progress of hardware technology makes CPU frequency be able to rapid raising on the other hand.The arithmetic speed of CPU increases rapidly in the speed of turning over 2 years left and right sides time.In contrast, primary memory (DRAM) access speed is slowly progressive.Primary memory speed is a big bottleneck of computer system processor ability always.Along with the increasing of CPU and main memory gaps between their growth rates, the key position of the raising of storage system speed in whole computer system performance improves becomes increasingly conspicuous.

Effective ways that improve performance of storage system are to adopt cache memory (CACHE).Computer system has now generally adopted the CACHE structure, and even multistage CACHE structure.The validity of CACHE (and even whole level memory structure) is based on the principle of locality of routine access.In fact, and if only if is stored under the situation that the data among the CACHE can have access to repeatedly by arithmetic unit, and the data that are confined in other words among the CACHE have under the situation of high reusability, and the usefulness of CACHE just can be given full play to.

Yet in many algorithms most in use, in the core algorithm of applications such as especially scientific and engineering calculating, image and signal Processing, program can not satisfy the locality condition that above-mentioned CACHE is played a role to the visit of data, thereby makes the setting of CACHE can not cause the raising of system speed.The main cause that causes this locality can not satisfy is the row conflict in the CACHE visit, and considers from the angle of system architecture, forms basic reason that this row conflicts and then be the map addresses mode of traditional CACHE and primary memory.

The CACHE memory bank is usually by N=2 ⁿThe capable composition of individual CACHE, each CACHE is capable of 2 ^wIndividual byte is formed.Row is the elementary cell of CACHE and internal memory swap data.The memory headroom of supposing the system is of a size of 2 ^m, and the memory address byte of data is A, by the traditional CACHE and the mapping mode of memory address, then the row that should deposit among the CACHE of these data number is (capable number from " 0 ")

1=[A mod (2 ^N+w)/2 ^w] this mapping mode as shown in Figure 1, Fig. 1 is known CACHE and memory address mapping graph.

Among Fig. 1, the data address of CPU is divided into 3 fields, is called zone bit field t, row field 1 and row intrinsic displacement field d.Wherein the function of zone bit field t will illustrate below.Row field 1 indicates the row of data in CACHE that will visit number, and row intrinsic displacement field d then indicates the byte number in the row.

The memory bank of CACHE is made up of zone bit memory bank and data back.Data back (and zone bit memory bank) can have one, also can have a plurality of.Only have the CACHE of a data memory bank to be called direct mapping CACHE, the data that visit promptly are stored in by in the specified row of above line field.CACHE with a plurality of data back is called set associative CACHE, and the number of its data back is called its degree of association.In set associative CACHE, the data that visit can be stored in any one body by the specified CACHE of above line field capable in.Zone bit field and other control information of zone bit memory bank store data address.

When computing machine needs visit data, just from the zone bit memory bank, read zone bit, and sign bit field in zone bit that is read and the data address is compared by the specified row of the CACHE row field of data address.If relatively conform to, illustrate that then the data that this row is stored among the CACHE at this moment really are required data, this kind situation is called visit and hits.The visit hiting signal that is produced is used to gating by the specified row access of advancing of data address row field (data of reading from this row are sent in control, or control writes data to this row).If the sign bit field is not inconsistent in zone bit that is read and the data address, then claim this kind situation for not hitting, then send the memory access cycle that hiting signal does not hit to start a CACHE of system at this moment.

Because the mapping techniques discussed is irrelevant with w in theory,, promptly suppose the capable data unit (and not necessarily a byte) that only stores of each CACHE so all suppose w=0 in the style of writing afterwards in this case.Like this, CACHE address (capable number) and the mapping relations of memory address are

1=A?mod?2 ⁿ

In this manner, the row address of CACHE promptly the memory address of data the 2n delivery is obtained.

The details of relevant CACHE structure can be with reference to (reference paper 6).

In the core algorithm of actual application problem, the data access of a data structure is often presented certain rules, we call a kind of data pattern (Data Pattern) [reference paper 3] to this one group of accessed regularly data.Under above-mentioned traditional map addresses mode, the different data in the frequently-used data pattern usually can be mapped to identical CACHE capable in, thereby caused CACHE to exercise with competition.At this moment having taken the capable data of a certain CACHE can be washed out by back data that are mapped to this row.In such cases, we are called the conflict of CACHE row access, or are called for short the capable conflict of CACHE.Because the CACHE conflict can not be had access to repeatedly even have the data of temporal locality in the program of the making execution, thereby reduced the visit hit rate of CACHE from CACHE.

Such as the matrix of a N * N, under the address allocation pattern of C programmer, the N of its each row data element all may be mapped to same CACHE capable in.In the FOTRAN language, each the row N data element then can be mapped to same CACHE capable in.At this moment the CACHE utilization of space only is 1/N.Yi Zhi, if improve the utilization factor of CACHE and adopt the piecemeal compute mode, one

N = \sqrt{N} \times \sqrt{N}

Piece to the utilization factor of CACHE only be

Should be noted that:

The first, the raising of the space availability ratio of CACHE may not necessarily cause the raising of CACHE hit rate.

The second, the problems referred to above are not can solve by increasing the CACHE capacity simply.

The reason that produces the problems referred to above is data access patterns and traditional CACHE shown in Figure 1 and the mapping mode between memory address in the program.

U.S. Patent No. 5,133,061 document [reference paper 4] proposes a kind of method, improves the access performance of CACHE.This method proposes bit matrix with a data address and a particular design and multiplies each other and produce the CACHE row address, makes the data address randomization.The bit matrix of its particular design is constructed with " Sierpinsky ' s Gasket " (the leg-of-mutton binary mode of Pascal).Adopt this method can be so that be that the round-robin data access patterns of step pitch is avoided the capable conflict of CACHE with any integer side power of 2.

Reference paper 7 proposes, the row of CACHE number by data address to 2 ⁿ-1 delivery and obtaining, rather than as among traditional CACHE by data address to 2 ⁿDelivery obtains.It is that the round-robin data access patterns of step pitch is avoided the capable conflict of CACHE (unless visit step pitch is 2 that this kind method makes with any integer ⁿ-1 multiple or approximate number), to improve the CACHE performance.

Reference paper 1 proposes, and conflict is too much then changed another kind of mapping function when finding with certain Function Mapping CACHE row address in the program operation process, to improve the performance of CACHE.

When adopting a plurality of data back among the CACHE (when adopting set associative CACHE), list of references 5 proposes, and adopts different mapping functions to improve the CACHE performance to different memory banks.

The objective of the invention is to, address mapping technique in a kind of CACHE system is provided, the mapping mechanism of a kind of new Ying's simple CACHE and memory address just, be XOR scattering mechanism (XOR Scattering Mechanism), its objective is minimizing and eliminate (in the ideal case) CACHE access conflict the frequently-used data pattern, improve effective access speed of CACHE storage system, thereby improve the arithmetic speed of total system.

The invention is characterized in that a CACHE system that has improved adopts XOR address scattering mechanism to improve the distribution mode of data in the CACHE data back of internal storage in such CACHE system; CACHE zone bit field in the address that is input as data of this address scattering mechanism (or one of them part field) and CACHE row field (or one of them part field), this address scattering mechanism is output as CACHE capable number (or one of them part field).The CACHE of this new formation is used to indicate the actual access address (capable number) of CACHE data back and CACHE zone bit memory bank for capable number; Described address scattering mechanism is by formula

S ^τ=(R×i ^τ)(I×j ^τ)

Defined mapping function EE or formula

S ^τ=(H * i ^τ) (I * j ^τ) defined mapping function LR realizes that wherein i in the formula and j are the input of this address scattering mechanism, and S is the output of this address scattering mechanism; Device wherein is an XOR scattering mechanism, and XOR scattering mechanism comprises XOR gate and line with it, it is characterized in that, wherein, the wherein row field of a termination CPC data address register of the input end of XOR gate; The zone bit field tag of another termination cpu data address register, the output terminal of XOR gate also inserts zone bit memory bank and data back respectively.

Here the technology that propose has following notable attribute:

The first, our purpose is to cover multiple data access patterns commonly used with a kind of function, and is not only the formed data access patterns of substance circulation in the program (even the round-robin step pitch can be different).

The second, we have invented two kinds of mapping functions that can achieve the above object specially, and these two kinds of functions are better than existing scheme in its hard-wired simplicity, time delay and the aspects such as data access pattern that can cover.

Here the technology that propose has following superiority:

The first, realize simple.Avoid any arithmetical operation mode owing to the mode that adopts logical operation realizes the map addresses of CACHE and main memory, the whole mapping mechanism only a small amount of XOR gate of need (XORGates) can realize.And its hardware complexity only is 0 (n), and its time postpones only to be constant level 0 (1) (only being the delay of 1 or 2 grade of XOR gate).Therefore, this technology is not only applicable in the large-scale system, also is applicable in medium and small and even the microsystem; This technology not only can realize in one-level CACHE system, and can realize in secondary CACHE system; It not only can realize (being integrated in the cpu chip) in " in the sheet " mode, can also realize in " sheet is outer " mode.Therefore, not only CPU manufacturer can realize, Computer System Design person also can realize.

The second, function is strong.Visit to the frequently-used data pattern is not to realize with the different step-lengths of substance round-robin entirely, such as the visit to data block in the two-dimensional matrix.By the frequently-used data pattern in research numerical analysis, image processing and pattern-recognition, signal Processing, the scientific engineering computing, we design powerful map addresses mode, to satisfy the conflict-free access of these patterns or to reduce the conflict of these patterns in CACHE.

The 3rd, use flexibly.Because hardware is realized simple, can realize different mapping functions in the map addresses mechanism of same system, thereby satisfy different request for utilizations.This point both can be selected and control by the programmer, also can control automatically by more senior compiling system, and perhaps the mode that combines with programmer and compiling system realizes.

For ease of further understanding feature of the present invention, effect and implementation, the present invention is further illustrated below in conjunction with accompanying drawing, wherein:

Fig. 1 is traditional CACHE and memory address mapping graph.

Fig. 2 is the new CACHE structural drawing of the present invention.

Fig. 3, Fig. 4 are the high-speed cache mapping matrix of the present invention's 16 * 16 matrixes under the EE Function Mapping.

Fig. 5 is the realization circuit diagram of EE function of the present invention.

Fig. 6 is 16 * 16 the high-speed cache mapping matrix of matrix in LR mapping (n=4).

Fig. 7 is 32 * 32 the high-speed cache mapping matrix of matrix in LR mapping (n=5).

Fig. 8 is the realization circuit of LR mapping function of the present invention.

Fig. 9 moves figure for data block of the present invention.

From the mapping mode of CACHE and core address, core address is divided into three position sections in tradition shown in Figure 1.If core address is the m position.The lowest order of core address is that (CACHE is capable to contain 2 to the byte number of data in a CACHE is capable ^wIndividual byte), be called capable intrinsic displacement field d; N position in the middle of it is that (capacity of a CACHE memory bank is 2 to capable number of CACHE ⁿOK), be called row field 1; Its highest t position is zone bit field tag (t=m-(n+w)).

In our XOR mapping mode, the zone bit field of CACHE is identical with traditional mapping mode with the generation type of the capable intrinsic displacement field of CACHE, and CACHE row field then is to form with the step-by-step XOR between the certain bits of traditional row field 1 and zone bit field tag.And the structure of CACHE system just as shown in Figure 2.

The mapping function of the scattering mechanism among Fig. 2 should make 1 with s man-to-man mapping relations to be arranged, to improve the service efficiency in CACHE space in principle.Requirement according to system performance both can realize single mapping function, also realized the mapping function more than in the same mechanism of possibility, made user's (or program compiler) select different mapping functions according to different application problems.And choosing of mapping function can be carried out (selecting different mapping functions) according to different data access patterns, also can carry out according to different data structure sizes (choose 1, the different position section of tag carry out computing to form the certain bits section of s).

We will represent 1 field with j, and represent the CACHE that is formed by mapping function capable number (collection set number) with s with the tag field in the memory address in the i presentation graphs 2.This paper calls " XOR scattering mechanism " to the scattering mechanism that we propose, and the function that is proposed is referred to as " XOR mapping function "." XOR mapping function " this noun derives from Frailong etc. and the parallel memory bank of a class is tiltedly arranged the given generality of scheme describes (reference paper 2).Parallel memory bank tiltedly the bank number in row's scheme be changed to capable number of CACHE in the CACHE system, we provide following description.

An XOR mapping function is described by following formula:

S ^τ=A×i ^τB×j ^τ

I in (formula 1) and j are the n bit vector, and A and B are all n * n matrix, and the computing in the formula is all carried out on GF (2) (galois field), and s is capable number of the CACHE that mapping function produces.

Please note, first, here require i and j to be all the n position, but the tag position section in the real system in the memory address and 1 section not necessarily have identical figure place (or owing to the reason of machine construction, they do not have the identical figure place that is suitable for conversion), at this moment as long as i and j are interpreted as that the corresponding positions section that participates in conversion in tag and 1 is just passable.The second, be to satisfy the requirement that makes full use of the CACHE space, A and B should be nonsingular matrix, could make like this 1 and s have man-to-man mapping relations.

The mapping function of two kinds of our propositions is below described.

The EE function is tiltedly one of row's schemes of several parallel memory banks, and its directly perceived and detailed description can be referring to ((reference paper 3).Must explanation be that in [reference paper 3], the EE function is that the oblique row's scheme as parallel memory system is suggested, and irrelevant with structure and the mapping mode of CACHE.The EE function can be expressed from the next:

S ^τ=(R * i ^τ) (I * j ^τ) R is the back-diagonal matrix of a n * n in (formula 2) formula 2, and I is the unit matrix of a n * n, that is:

With r _{U, v}The element of the capable v row of expression R u (0≤u, v≤n-1), then r when u=n-1-v _{U, v}=1, otherwise r _{U, v}=0.

An outstanding advantage of EE function be realize very simple.From its structure as can be seen its hardware only to need to realize several NOR gate circuits.Consider the development of device technology, foreseeable in recent years in, this function only wants several can realize that to tens XOR gate its cost can be described as very little.

The outstanding advantage of another of EE function is powerful.It can cover many most frequently used data access patterns.

At one 2 ⁿ* 2 ⁿThe header element address of matrix be that it can guarantee the CACHE conflict-free access to following frequently-used data pattern under 0 the situation:

The delegation of matrix or row;

Rectangle master's piece of square and various Aspect Ratios;

Carried out the square or rectangular master piece that level or vertical direction move;

Discrete area is promptly by 2 ⁿBe in 2 of the interior same position of piece in the individual square or rectangular master piece ⁿThe set that individual element constituted;

By the certain rule of two row (or two row) locational element constituted 2 ⁿThe set of element is called " part row to " (or " part rows to ").

These data access patterns are data access patterns of using always in the core algorithm of applications such as scientific and engineering calculating, numerical analysis, image processing, signal Processing.Detailed, formal definition and proof are please referring to joint " character of mapping function EE and LR " down.

Should be noted that, the EE mapping function ensure to through level or vertical moving square and the character of rectangle data block conflict-free access be very significant to the CACHE system.Because this character has been arranged, even just can ensure for degree of association to be 1 the simplest CACHE, to any starting point by 2 ⁿThe data block that individual continuous element constituted, any one CACHE walks to the capable conflict of CACHE that mostly occurs.In the CACHE system of reality, one-level CACHE has realized that degree of association is more than 2 or 2 mostly.Therefore we we can say, with the degree of association of EE Function Mapping more than or equal to 2 have a capable CACHE system of N, can avoid CACHE to conflict fully to any one visit that comprises the data block of N element in N * N matrix.

Consider the importance of blocks of data visit to a large amount of practical algorithms, obviously, this character of EE mapping function is significant more than or equal to 2 CACHE system for degree of association.This is that we propose to adopt the major reason of EE function as map addresses mode between CACHE and main memory.

When Fig. 3, Fig. 4 had provided by the EE Function Mapping, the various element of one 16 * 16 matrix was mapped to a signal that has among 16 CACHE that go.Suppose that this matrix occupies 0～255 data cell by depositing with behavior master (mode that the C language is adopted) in main memory, the numeral among the figure promptly is capable number of the CACHE that is mapped to through EE mapping back respective data element.We have marked the data pattern (square and rectangle data block) of several conflict-free accesss under the EE mapping.

Fig. 5 has provided the circuit that only has 16 row CACHE systems to realize the EE mapping to.Make formula 2 by the EE function, any delegation of matrix R (back-diagonal matrix) and matrix I (unit matrix) only has one 1, thus S any one only by one in the zone bit field (tag) of cpu address with former row field (1) in an XOR and form.Herein, the zone bit field of cpu data address and row field are 4.As seen from the figure, the EE function only needs n (￡ nn=4 herein) XOR gate to realize, it postpones only is the delay (irrelevant with n) of 1 grade of NOR gate circuit.

The wherein row field of a termination CPC data address register of the input end of XOR gate in Fig. 5; The zone bit field tag of another termination cpu data address register, the output terminal of XOR gate inserts zone bit memory bank and data back respectively.

Though the EE function that we introduce previously can be realized the conflict-free access to a large amount of commonly used data access patterns, and has special significance to the data block access, but can not guarantee conflict-free access to any side's power step pitch vector of 2.The method that proposes in the list of references 4 can realize the conflict-free access to any side's power step pitch vector of 2, but its complex structure, and can not realize the conflict-free access to other a large amount of frequently-used data pattern simultaneously.Consider any side's power step pitch vector pattern importance in actual applications of 2, research is satisfied the structure of any side's power step pitch vector of 2 and other data pattern commonly used and implements simple function significant simultaneously.The LR function be we for this purpose and the structure mapping function.The LR function is constructed in the following manner:

S ^τ=(H * i ^τ) (I * j ^τ) I still is a unit matrix in (formula 3) formula (3), and being configured to of H matrix

Address mapping technique and device in a kind of CACHE of the present invention system, a CACHE system that has improved adopts XOR address scattering mechanism to improve the distribution mode of data in the CACHE data back of internal storage in such CACHE system; CACHE zone bit field in the address that is input as data of this address scattering mechanism (or one of them part field) and CACHE row field (or one of them part field), this address scattering mechanism is output as CACHE capable number (or one of them part field).The CACHE of this new formation is used to indicate the actual access address (capable number) of CACHE data back and CACHE zone bit memory bank for capable number; Described address scattering mechanism is by formula

S ^τ=(R×i ^τ)(I×j ^τ)

Defined mapping function EE or formula

S ^τ=(H * i ^τ) (I * j ^τ) defined mapping function LR realizes that wherein i in the formula and j are the input of this address scattering mechanism, and S is the output of this address scattering mechanism.

The LR mapping function can satisfy the nothing conflict CACHE visit to the following data pattern of N * N matrix.The row and column of matrix;

The rectangular blocks of square and various Aspect Ratios;

Discrete area is promptly by N the data acquisition that element constituted that is in same position in N the square or rectangular master piece;

Be spaced apart 2 ^I(i is an arbitrary integer, the uniformly-spaced principal vector of i＜n);

Being shifted uniformly-spaced, principal vector (is spaced apart 2 ⁱ, wherein i is the integer less than n);

Principal diagonal (when the n digit pair is counted).

The above-mentioned uniformly-spaced principal vector and the principal vector that is shifted uniformly-spaced be exactly step-length be 2 ⁱCirculation N the data pattern that the data element is constituted of visiting.This data access patterns is one of (for example FFT) most important access module in digital image processing, the signal Processing core algorithm.Therefore the conflict-free access that solves this type of data pattern has important and practical meanings to application problems such as digital image processing, signal Processing.

Fig. 6 and Fig. 7 have provided capable number of the CACHE of 16 * 16 and 32 * 32 each element of matrix under the LR Function Mapping.Fig. 7 has provided the signal of continuous blocks and discrete area.Fig. 8 has provided capable number of different starting points, the different CACHE that vector element was mapped to that jumps distances.

Fig. 8 has provided one 16 (2 ⁿ=2 ⁴) the realization circuit of LR mapping function in the row CACHE system.Make formula 3 by the LR function, any delegation of H matrix only contains maximum 21, and any delegation of I matrix (unit matrix) only contains maximum 11, thereby any of S is only formed for XOR by one in the former row field 1 of the maximum sum-bits among the zone bit field tag.As can be seen from the figure, this function only needs 6 XOR gate to realize, and its delay only is 2 grades of NOR gate circuits delays.

Input the 7th among this section Fig. 5 and Fig. 8,6,5,4 promptly corresponding to the CACHE zone bit field tag in the earlier figures 2, input the 3rd, 2,1 among this section Fig. 5 and Fig. 8,0 promptly corresponding to the former CACHE row field 1 in the earlier figures 2, and the CACHE row field of the new formation among this section Fig. 5 and Fig. 8 is promptly corresponding to the CACHE row S in the earlier figures 2.

The hardware spending of EE that this section proposed and LR mapping function is the order of magnitude.That is, if the CACHE capacity is N=2 ⁿOK, its hardware spending is 0 (log ₂N).No matter it is EE function or LR function again, and its delay is the constant level, promptly 0 (1), and irrelevant with the size of CACHE.These character realize it being a very important advantage for hardware.

This part provides the explication and the formal proof of the character of mapping function EE and LR.In the following description, we suppose that all the CACHE memory bank contains N=2 ⁿOK, and data element of the capable storage of each CACHE; When we spoke of matrix, we supposed that matrix is of a size of N * N; We suppose that the header element address of this matrix is 0, and the storage mode of matrix is " with the behavior master " (as mode of C language compiling).If being " to classify the master as " (as compile mode of formula translation), the storage mode of matrix knows easily that according to proof procedure following character still sets up.Below in the proof, all n * n matrixes of speaking of, its row number and row be number all with n-1, n-1 ..., 1,0 order count; The behavior n-1 of the top, left classify n-1 as, thereby (n-1, n-1), the coordinate of the element in the lower right corner is (0,0) to following being designated as of the upper left corner element of matrix.

Prove the performance of EE function below.

Theorem 1: under the EE Function Mapping, any two elements in any delegation of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby the N of any delegation of a matrix element can reside in one simultaneously.

Proof: two elements of establishing arbitrary row of matrix A are a _{U, v}And a _{X, y}, they should CACHE be for capable number S1 and S2.Because this two element is the element of matrix with delegation, so u=x and v ≠ y are arranged.Thereby

S1S2=((R×u)(I×v))((R×x)(I×y))

=R×(ux)I×(vy)=I×(vy)

Because I is a unit matrix, it is nonsingular, so S2 is non-0, and promptly these two matrix elements are stored in the different rows of CACHE.

Theorem 2: under the EE Function Mapping, any two elements in any row of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any row of matrix can reside in one simultaneously.

Notice that R is a nonsingular matrix, know that easily this character can demonstrate,prove with theorem 1 is similar.

The outstanding advantage of EE function is that its guarantees character to the conflict-free access of the data block of the various patterns that contain N element.Below we provide the formal definition and the proof of data block.The continuous P Q piece BLK (P, Q:k, 1) of the matrix X of definition 1:N * N is defined as:

BLK(P,Q:k,l)={x _k+a,l+b(0≤a≤P-1)∧(0≤b≤Q-1)}

P=2 wherein ^p, Q=2 ^q, p and q are all integer, and p+q=n.

The main PQ piece MBLK (P, Q:k, 1) of the matrix X of definition 2:N * N is defined as the continuous P Q piece of satisfied (kMODP=0) ∧ (lMODQ=0).

Fig. 3, Fig. 4 get the bid and understand main PQ piece MBLK (4,4:0,0) and MBLK (2,8:2,8) in 16 * 16 the matrix.

In the matrix operation of more complicated, because matrix size is very big, for the cache systems that makes full use of in the hierarchical memory system structure improves computing velocity, the program designer usually wants to adopt partitioning of matrix algorithm to come the execution speed of accelerating application.In addition, a kind of application of very typical matrix-block access is the imaging filtering algorithm of widespread use in the image processing.Because the importance of partitioning of matrix algorithm in numerical operation and other application, generally parallel storage means is all listed it in very important parallel access pattern.But, what have significance more is that row address and column address are the main P * Q piece of starting point with 2 positive integer time power all, because in the partitioned matrix computing, it is that the matrix-block of 2 positive integer time power carries out piecemeal and calculates that the program designer is divided into length and width to matrix usually, and partitioned matrix at this moment is exactly the main PQ piece of matrix.

Theorem 3: under the EE Function Mapping, any two elements in any one main PQ piece of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any one main PQ piece of matrix can reside in one simultaneously.

Proof: in the production of capable number of CACHE, because its multiplying and additive operation all are in the enterprising row operation of GF (2), so can carry out following rewriting: S to it _{U, v}=( of R * u) (I * v)

= [\begin{matrix} R & I \end{matrix}] \times [\begin{matrix} u \\ v \end{matrix}]

=C×l

= [\begin{matrix} C_{1} & C_{2} & C_{3} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{2} \\ l_{3} \\ l_{4} \end{matrix}]

= [\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{4} \end{matrix}]

Wherein: Matrix C ₁N-1 to the p row by matrix R are formed l ₁=(u _N-1, u _N-2, u _p) ^τ

Matrix C ₂P-1 to the 0 row by matrix R are formed l ₂=(u _P-1, u _P-2, u ₀) ^τ

Matrix C ₃N-1 to the q row by matrix I are formed l ₃=(v _N-1, v _N-2, v _q) ^τ

Matrix C ₄Q-1 to the 0 row by matrix I are formed l ₄=(v _Q-1, v _Q-2, v ₀) ^τ

We investigate by C ₂And C ₄Matrix [the C that constitutes ₂C ₄].Because C ₂Be p-1 to the 0 row (p row altogether) formation by R, and C ₄Q-1 to the 0 row (q row altogether) by I constitute, and consider p+q=n, easily know ［ C ₂C ₄The matrix that constitutes is that dimension is the matrix of n.

For any two element x in the main PQ piece of the matrix X of N * N _{U, v}With x ' _{U ', v '}L is arranged ₁=l _{1 '}, l ₃=l _{3 '}, but l ₂=l _{2 '}With l ₄=l _{4 '}Can not set up simultaneously.

S_{u, v} &CirclePlus; S_{u',, v'} = (([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{2} \\ l_{4} \end{matrix}])) &CirclePlus;

(([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1}' \\ l_{3}' \end{matrix}]) &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{2}' \\ l_{4}' \end{matrix}]))

= ([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times ([\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; \begin{matrix}  \end{matrix} [\begin{matrix} l_{1}' \\ l_{3}' \end{matrix}])) &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}] \times ([\begin{matrix} l_{2} \\ l_{4} \end{matrix}] &CirclePlus; [\begin{matrix} l_{2}' \\ l_{4}' \end{matrix}]))

= [\begin{matrix} C_{2} & C_{4} \end{matrix}] \times ([\begin{matrix} l_{2} \\ l_{4} \end{matrix}] &CirclePlus; [\begin{matrix} l_{2}' \\ l_{4}' \end{matrix}])

≠ 0 is that capable number of the CACHE that shone upon of any two elements is all inequality.So the EE function can guarantee any two elements in the main PQ piece of matrix be stored in respectively different CACHE capable in, promptly the EE function can guarantee that N element of the main PQ piece of matrix resides among the CACHE simultaneously.

The displacement PQ piece SHBLK (P, Q:k, 1) of the matrix X of definition 3:N * N is defined as the continuous P Q piece of satisfied (kMODP=0) ∨ (lMODQ=0).

Theorem 4: under the EE Function Mapping, matrix arbitrarily-any two elements in the displacement PQ piece all do not produce the CACHE conflict, thereby matrix arbitrarily-N element of displacement PQ piece can reside in one simultaneously to have in the capable CACHE memory bank of N CACHE.

Proof: be similar to the proof of theorem 3, be omitted herein.

The displacement PQ piece of matrix is in the displacement of level or vertical direction and get by main PQ piece.Theorem 4 be one than the more powerful theorem of theorem 3 expression, because can residing in one simultaneously, N the element that it not only guarantees main piece have in the capable CACHE memory bank of N CACHE, and guarantee that its a N element also can reside among the CACHE simultaneously as long as main piece only is shifted in a level or a vertical direction.

Theorem 5: degree of association be equal to or greater than 2 and each CACHE memory bank have N capable the CACHE system in, if adopt the EE function, then can make any N of comprising of N * N matrix continuously the data block of element can reside in simultaneously in the CACHE memory bank.

Proof: as shown in Figure 9, get the p * Q continuous blocks mnop of arbitrary N of comprising element, can think that it is to be moved in the horizontal direction X first vegetarian refreshments and moved the individual first vegetarian refreshments of Y and obtain in vertical direction by a main piece abcd.Be displaced block efgh if main piece abcd moves in the horizontal direction X point back, know easily that then mnop is moved the Y point and obtains in vertical direction by efgh.Be displaced block ijkl if main piece abcd moves Y point back in vertical direction, know easily that then mnop moves in the horizontal direction the X point by ijkl and obtains.But know that by theorem 3 the pairing CACHE of the N of an efgh element does not have conflict capable number, so the pairing CACHE of any two elements among the mngh does not also have conflict capable number.Know that as a same reason the pairing CACHE of any two elements among the hckp does not also have conflict capable number.

But we know that cgok belongs to another main PQ piece, so the pairing CACHE of any two elements does not wherein also have conflict capable number.

In addition, know that by theorem 3 element among the cgok conflicts with the capable number nothing of Elements C ACHE among the jngc, and number also do not have and to conflict with row among the hckp.

In sum, we know, CACHE clashes for capable number may have only a element among the cgok and an element among the mjch, perhaps an element among the jngc and an element among the hckp.The element that clashes may have a plurality of simultaneously, but can only be man-to-man.This has just proved the conclusion of theorem 4.

Because the importance of block algorithm in applications such as scientific and engineering calculating, numerical analysis, image processing and pattern-recognition, signal Processing, theorem 4 has been explained the extremely important and superior character of EE function.

In some image processing algorithm, visit each pattern that constituted of point that is in same position in each continuous blocks by picture matrix, we claim this pattern for " discrete area " (SCatteredBLocKss), as following formal definition.

Definition 4: establish a, b is an integer, and 0≤a≤P-1,0≤b≤Q-1.(P, Q:a b) are defined as all element x to the discrete PQ piece SCBLK of the matrix X of N * N _{U, v}, wherein u=amod P and v=b mod Q.

This pattern mainly is used in image processing and the algorithm for pattern recognition.

Theorem 6: under the EE Function Mapping, any two elements in any one discrete PQ piece of matrix all do not produce the CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any one discrete PQ piece of matrix can reside in one simultaneously.

Proof:, the CACHE row expression formula of matrix element is done following rewriting with similar in the proof of theorem 3:

S _ui,v=(R×u)(I×v)

= [\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{2} \\ l_{4} \end{matrix}]

Be easy to prove matrix [C ₁C ₃] order be n.

Hash P * Q piece SCBLK (P, Q:a, b) any two element x in for the matrix of N * N _{U, v}With x _{U ', v '}, l is arranged ₂=l _{2 '}, l ₄=l _{4 '}, but l ₁=l _{1 '}With l ₃=l _{3 '}Can not set up simultaneously.Be located at when adopting the EE method x _{U, v}With x _{U ', v '}Be stored in S respectively _{U, v}With S _{U ', v '}During CACHE is capable, then:

S_{u, v} &CirclePlus; S_{u', v'} = (([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{3} \end{matrix}]) &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{2} \\ l_{4} \end{matrix}])) &CirclePlus;

(([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times [\begin{matrix} l_{1}' \\ l_{3}' \end{matrix}]) &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}] \times [\begin{matrix} l_{2}' \\ l_{4}' \end{matrix}]))

= ([\begin{matrix} C_{1} & C_{3} \end{matrix}] \times ([\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} l_{1}' \\ l_{3}' \end{matrix}])) &CirclePlus; ([\begin{matrix} C_{2} & C_{4} \end{matrix}]

= [\begin{matrix} C_{1} & C_{3} \end{matrix}] \times ([\begin{matrix} l_{1} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} l_{1}' \\ l_{3}' \end{matrix}])

≠ 0 is the CACHE row difference of any two elements, thereby the N of any one discrete area element all can reside among the CACHE simultaneously.

In image processing algorithm, to use pattern " part row to " sometimes and reach " part rows to ".They are defined as follows respectively.

The part row that following N-1 the element of definition 5:N * N matrix X is called X is to PRP:

PRP(k)={x _k,0,x _k,1,…,x _k,k-1,x _N-l-k,0,x _N-l-k,1,…,x _{N-l-k,N-l-k-1}},

0≤k≤N-1 wherein.

The part rows that following N-1 the element of definition 6:N * N matrix X is called X is to PCP:

PCP(k)={x _0,k,x _1,k,…,x _k-1,k,x _0,n-l-k,,x _1,n-l-k,…,x _{N-l-k-1,n-l-k}},

0≤k-≤N-1 wherein.

Lemma 1: for any u, v, (0≤u, v≤N-1), all have

S _{U, v}=S _{N-l-u, N-l-v}, S wherein _{U, v}

(S _{N-l-u, N-l-v}) be element x _{U, v}(element x _{N-l-u, N-l-v}) capable number of CACHE under the EE Function Mapping.

Proof: because u (or v) any one is the radix-minus-one complement of corresponding positions among the N-l-u (or N-l-v), so S _{U, v}In any one be S _{N-l-u, N-l-v}In corresponding positions through negating for twice and get, thereby lemma must be demonstrate,proved.

Theorem 7: under the mapping of circle function, among N * N matrix X N-1 element of any part row centering all can reside in simultaneously the N of CACHE memory bank capable in.

Proof: know, all have by definition 5, if x for any one u _{K, u}∈ PRP (k) is x then _{N-1-k, N-1-u}∈ PRP (k), vice versa.Know by lemma 1 that the more shared CACHM of element that belongs to PRP (k) during this row centering k is capable just is for capable number that N-l-k does not belong to the shared row of the element of PRP (k) number in capable.Know that by theorem 2 N element in arbitrary row do not have capable number conflict of CACHE again.Thereby there be not capable number conflict of CACHM in N-1 element knowing arbitrary capable centering.

Fixed 8: under the mapping of EE function, among N * N matrix X N-1 element of any part rows centering all can reside in simultaneously the N-1 of CACHE memory bank capable in.

Fixed 8 proof and theorem 7 are similar, are omitted herein.

Prove the LR function property below.

Theorem 9: under the LR Function Mapping, any two elements in any delegation of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby the N of any delegation of a matrix element can reside in one simultaneously.

This proof of theorem and theorem 1 are similar, are omitted herein.

Theorem 10: under the LR Function Mapping, any two elements in any row of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any row of matrix can reside in one simultaneously.

This proof of theorem and theorem 2 are similar, are omitted herein.

Theorem 1l: under the LR Function Mapping, any two elements in any one main PQ piece of matrix all do not produce CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any one main PQ piece of matrix can reside in one simultaneously.

This proof of theorem and theorem 3 are similar, are omitted herein.

Theorem 12: under the LR Function Mapping, any two elements in any one discrete PQ piece of matrix all do not produce the CACHE conflict, have in the capable CACHE memory bank of N CACHE thereby N element of any one discrete PQ piece of matrix can reside in one simultaneously.

This proof of theorem and theorem 6 are similar, are omitted herein.

The uniformly-spaced vectorial continuously SEQ (N, S:k, 1) of the matrix X of definition 4:N * N is defined as:

∧ (q=0,1 ..., N-1) }

Wherein S is the step pitch of vector, and it is a positive integer, and X (k, 1) is first element of this vector.

The uniformly-spaced principal vector MSEQ (N, S:k, 1) of matrix X of definition 5:N * N is defined as and satisfies condition (the mod of k * the N) (vector uniformly-spaced continuously of S * N)=0.

Uniformly-spaced principal vector MSEQ in 32 * 32 the matrix (32,4:4,0) and MSEQ (32,16:16,0) have been marked in the parallel memory system of N=32 among Fig. 6.Theorem 13: under LR ￡ Function Mapping, step pitch S is  2 ^sN the element of uniformly-spaced principal vector MSEQ (N, S:k, 1) can reside among the CACHE simultaneously.

Proof: for S=2 ^s(being that step pitch is the situation of integer side's power of 2) can carry out following rewriting: S to the CACHE row expression formula of element _{U, v}=( of H * u) (I * v)

= [\begin{matrix} H & I \end{matrix}] \times [\begin{matrix} u \\ v \end{matrix}]

=C×l

= [\begin{matrix} C_{1} & C_{2} & C_{3} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{2} \\ l_{3} \\ l_{4} \end{matrix}]

= [\begin{matrix} C_{1} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{4} \end{matrix}] &CirclePlus; [\begin{matrix} C_{2} & C_{3} \end{matrix}] \times [\begin{matrix} l_{2} \\ l_{3} \end{matrix}]

Wherein: Matrix C ₁N-1 to the s row by matrix H are formed,

l ₁=(u _n-1,u _n-2…,u _s) ^τ

Matrix C ₂S-1 to the 0 row by matrix H are formed,

l ₂=(u _s-1,u _s-2…,u ₀) ^τ

Matrix C ₃N-1 to the s row by matrix I are formed,

l ₃=v _n-1,v _n-2…,v _s) ^τ

Matrix C ₄S-1 to the 0 row by matrix I are formed,

l ₄=(v _s-1,v _s-2…,v ₀) ^τ

Can prove matrix [C ₂C ₃) order be n.

Step pitch S=2 for N * N matrix ⁵Uniformly-spaced principal vector MSEQ (N, S:k, 1) in any two element x _{U, v}And x _{U ', v '}L is all arranged ₁=l _{1 '}, l ₄=l _{4 '}, but l ₂=l _{2 '}With l ₃=l _{3 '}Can not set up simultaneously.

S_{u, v} &CirclePlus; S_{u', v'} = (([\begin{matrix} C_{1} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1} \\ l_{4} \end{matrix}] &CirclePlus; ([\begin{matrix} C_{2} & C_{3} \end{matrix}] \times [\begin{matrix} l_{2} \\ l_{3} \end{matrix}])) &CirclePlus;

(([\begin{matrix} C_{1} & C_{4} \end{matrix}] \times [\begin{matrix} l_{1}' \\ l_{4}' \end{matrix}]) &CirclePlus; ([\begin{matrix} C_{2} & C_{3} \end{matrix}] \times [\begin{matrix} l_{2}' \\ l_{3}' \end{matrix}]))

= ([\begin{matrix} C_{1} & C_{4} \end{matrix}] \times ([\begin{matrix} l_{1} \\ l_{4} \end{matrix}] &CirclePlus; [\begin{matrix} l_{1}' \\ l_{4}' \end{matrix}])) &CirclePlus; ([\begin{matrix} C_{2} & C_{3} \end{matrix}] \times ([\begin{matrix} l_{2} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} l_{2}' \\ l_{3}' \end{matrix}]))

= [\begin{matrix} C_{2} & C_{3} \end{matrix}] \times ([\begin{matrix} l_{2} \\ l_{3} \end{matrix}] &CirclePlus; [\begin{matrix} l_{2}' \\ l_{3}' \end{matrix}])

≠0

So under situation with the LR Function Mapping, step pitch in N * N matrix is that any two elements in the uniformly-spaced principal vector of S=2s are mapped in respectively among the different CACHE, and promptly step pitch is that N element of the uniformly-spaced principal vector of S=2s all can reside among the CACHE simultaneously.

The displacement of the matrix X of definition 6:N * N uniformly-spaced principal vector SHMSEQ (N, S:k, 1) is defined as and satisfies condition 0≤(mod of k * the N) (vector uniformly-spaced continuously of S * N)≤S-1.

0,1) and SHMSEQ (32,8: 8,2) Fig. 7 represented in the parallel memory system of N=32, and the displacement in 32 * 32 the matrix is principal vector SHMSEQ (32,2: uniformly-spaced.Theorem 14: under the LR Function Mapping, step pitch is  2 ^sDisplacement uniformly-spaced N the element of principal vector SHMSEQ (N, S:k, 1) can reside among the CACHE simultaneously.This proof of theorem is similar with the proof of deciding 13, omits herein.Fixed 15: under the LR Function Mapping, step pitch is 1/22 ^sN the element of any uniformly-spaced vectorial SEQ (N, S:k, 1) can reside in degree of association simultaneously more than or equal among 2 the CACHE.

The result who utilizes thought in theorem 5 proof and theorem 13, theorem 14 is the correctness of theorem 15 as can be known, and proof is omitted herein.

Uniformly-spaced Xiang Liang parallel access is calculated and engineering problem has very important significance in finding the solution in science, and especially fast Fourier transform (FFT) is used in calculating is spaced apart the uniformly-spaced vectorial of 2 positive integer time power.This is because fast fourier transform is widely used in many science and engineering calculation field, uses widely as all having in fields such as image processing, digital signal processing, pattern-recognitions very.Owing to use step pitch in the fft algorithm repeatedly and be the circulation of different integer side's power of 2; and the line number of CACHE memory bank is integer side's power of one 2; usually can produce a large amount of CACHE conflicts so carry out the CACHE map addresses in a conventional manner, be difficult to realize efficient calculation.Utilizing the LR function to carry out the CACHE mapping can address this problem effectively.

Theorem 16: under the LR Function Mapping, when n was even number, N element of the principal diagonal of N * N matrix can reside among the CACH simultaneously.

Proof: any two element x of the principal diagonal of matrix X _{U, u}With x _{V, v}CACHE be for capable number S _{U, u}And S _{V, v}

S _u,uS _v,v=((H×u)(I×u))((H×v)(I×v))

=((HI)×u)((HI)×v)

=(HI)×(uv)

As long as we prove out that the order of matrix H I is n, then can by u ≠ v release (H I) * (u v) ≠ 0, thus draw S _{U, u}≠ S _{V, v}

If Matrix C=H I, c _{X, y}For following being designated as of Matrix C (then Matrix C can be expressed as follows for x, element y):

When n is even number,

Matrix C is a upper triangular matrix, and its back-diagonal element all is 1, so the order of Matrix C is n.

Theorem 17: under the LR Function Mapping, when n was even number, then N element on the back-diagonal of N * N matrix can reside among the CACHE simultaneously.

Proof: for any two element x of the back-diagonal of matrix X _{U, N-l-u}With x _{V, N-l-v}, establish them and be stored in S respectively _{U, N-l-u}With S _{V, N-l-v}During CACHE is capable: S _{U, N-l-u} h _{V, N-l-v}=(( of H * u) (I * (N-l-u)) (( of H * v) (I * (N-l-v))

=(H×(uv)(I×((N-l-u)(N-l-v)))

=(HI)×(uv)

Proof by theorem 16 knows that when n was even number, the order of matrix H I was n, and u ≠ v, thereby this theorem must be demonstrate,proved.

From above proof as can be seen, the intersegmental function EE of coordination or LR conversion do not generate the memory address of main memory data in CACHE with core address in the present invention, thereby reduce CACHE data access conflict in the practical core algorithm, the valid memory access speed of raising system, to improve the computing velocity of whole computing system, present technique realizes in the design of the secondary CACHE of experimental system " and flat-bed machine " system, even present technique only is applied in the design of two utmost point CACHE systems, also can make whole computer system that the arithmetic speed of some algorithms most in use is improved 30%-60%.

Claims

1. address mapping technique and the device in the CACHE system, it is characterized in that, CACHE (cache memory) system that has improved adopts XOR address scattering mechanism to improve the distribution mode of data in the CACHE data back of internal storage in such CACHE system; CACHE zone bit field in the address that is input as data of this address scattering mechanism (or one of them part field) and CACHE row field (or one of them part field), this address scattering mechanism is output as CACHE capable number (or one of them part field).The CACHE of this new formation is used to indicate the actual access address (capable number) of CACHE data back and CACHE zone bit memory bank for capable number.

2. by address mapping technique and device in the system of the described a kind of CACHE of claim 1, it is characterized in that described address scattering mechanism is by formula

S ^τ=(R×i ^τ)(I×j ^τ)

Defined mapping function EE or formula

3. address mapping technique and the device in the CACHE system, device wherein is an XOR scattering mechanism, and XOR scattering mechanism comprises XOR gate and line with it, it is characterized in that, wherein, the wherein row field of a termination cpu data address register of the input end of XOR gate; The zone bit field tag of another termination cpu data address register, the output terminal of XOR gate also inserts zone bit memory bank and data back respectively.