CN107577616A

CN107577616A - A kind of method and system for dividing final stage shared buffer memory

Info

Publication number: CN107577616A
Application number: CN201710791546.7A
Authority: CN
Inventors: 张德闪
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2017-09-05
Filing date: 2017-09-05
Publication date: 2018-01-12
Anticipated expiration: 2037-09-05
Also published as: CN107577616B

Abstract

A kind of method of division final stage shared buffer memory provided herein, including determine most suitable caching when each processor core is run；According to the most suitable caching, it is determined that each processor core page to be allocated chromatic number；Chromatic number according to the most suitable caching and the page, each processor core caching line number to be allocated is calculated；The page as corresponding to the processor core the gentle descending order of line number of depositing of chromatic number and divides the final stage shared buffer memory.By realizing that the gentle double dimensions for depositing line number of chromatic number according to page divides, and has refined granularity of division, dilatancy is more preferable to last level cache.Present invention also provides a kind of system for dividing final stage shared buffer memory, there is above-mentioned beneficial effect.

Description

A kind of method and system for dividing final stage shared buffer memory

Technical field

The application is related to server field, more particularly to a kind of method and system for dividing final stage shared buffer memory.

Background technology

The chip that server is carried at present is mostly chip multi-core processor CMP, as Intel E5-2600v4 contains 22 Individual core, Cavium ThanderX contains 48 cores etc., to ensure the service quality of the application program performed parallel in different IPs, The indexs of correlation such as transmission bandwidth, generally it can all be taken some measures in system software or hardware view to reduce between application program Interfere.Wherein, it is a kind of maximally efficient measure to divide final stage shared buffer memory on piece, and it not only can reasonably be utilized The resource of last level cache, also it can provide preferable service quality for the program higher to delay requirement.Currently proposed draws Divide in shared buffer memory Cache management strategy, be broadly divided into two kinds, a kind of is the division based on cache lines, and another kind is to be based on The division of caching group.

Final stage shared buffer memory is generally all connected the mapping mode of (set-associative) using group on piece, and caching is divided into Multiple groups of set, every group includes equal number of cache lines line.Current cache generally all uses and is not used by (LRU, English recently Literary full name：Least Recently Used) management strategy is managed, and it can be divided into three substrategys：

1) insertion strategy, the data accessed for the first time are inserted into the cache lines of the limit priority of corresponding caching group (MRU, English full name：Most Recently Used) in；

2) Promotion Strategy, the accessed hit of some cache lines in caching, just the cache lines lifting of hit to currently organizing Limit priority position (MRU)；

3) replacement policy, all cache lines of caching group are all filled data, when needing to insert new data, selection The cache line replacement of lowest priority position is gone out caching.

Partition strategy based on cache lines, it is that " vertical " division is carried out to caching, limiting the data of processor core can only put Put in certain corresponding sub-fraction cache lines, different internuclear data are not interfere with each other.This is the most commonly used mitigation different application The method disturbed between program, but its autgmentability is bad, it is not suitable for the application scenarios of high volume applications program parallelization.Than Such as Cavium Thunder X processors, it has 48 cores, but the line number of each caching group of last level cache is on its piece Simply 16, also Intel E5-2600v4 processors contain 22 cores, but last level cache only allows have 20 different cachings Division.

Another partition strategy based on caching group, it is to carry out " level " to caching to divide, by for each processor Core distributes the memory pages (page dye technology) of particular address, limits its corresponding data and is only placed into corresponding caching group In, to reach the purpose of division caching.But it is this strategy there is also the limitation of autgmentability, such as with Cavium 48 cores Exemplified by ThunderX, physical address is 48, page-size 64KB, so page bias internal is 16, page number is 32.Caching Row size is 128B, and every group of caching line number is 16, cache size 16MB, so row bias internal is 7, group number is 13.Its Middle page number and group number have 4 overlapping, are color bits, only distribute 16 kinds of colors.In fact can be by reducing the big of the page It is small, increase number of colours, but proved through market, the effect that the setting of the big page can more obtain.If page-size is reduced Afterwards, number of color increase, but the number of pages corresponding to each color can be caused to reduce, i.e., each assignable page of processor core Face number is less, and this causes memory allocation error there may be that can be distributed without the page, even if it is idle other in current system to be present The page of color.In addition, the program that each processor core is run can constantly change, reasonably to distribute spatial cache, Its corresponding page color number can also increase or decrease.The adjustment of each processor core page color number, and a generation The very big process of valency, for example reduce by a color, then this processor verification answers the page of this color just to need to migrate, accordingly TLB (Translation Lookaside Buffer, transition detection buffering area) and cache lines just need to refresh and write-back, to being System performance has no small influence.

The content of the invention

The purpose of the application is to provide a kind of method and system for dividing final stage shared buffer memory, solves existing partitioning technology The problem of dilatancy difference.

In order to solve the above technical problems, the application provides a kind of method for dividing final stage shared buffer memory, technical scheme is as follows：

It is determined that most suitable caching during each processor core operation；

According to the most suitable caching, it is determined that each processor core page to be allocated chromatic number；

Chromatic number according to the most suitable caching and the page, each processor core cache lines to be allocated are calculated Number；

The page as corresponding to the processor core chromatic number and the caching line number divides institute by small order is arrived greatly State final stage shared buffer memory.

Wherein, the most suitable caching determined during each processor core operation includes：

Different line numbers and different pages into chromatic number and carry out combination of two；

Gather every thousand instruction missing numbers of the last level cache of each processor core under every kind of combination；

Most suitable caching during each processor core operation is determined according to described every thousand instruction missing numbers.

Wherein, according to the most suitable caching, it is determined that each processor core page to be allocated chromatic number and included：

Judge whether the most suitable caching M of the processor core meets M≤S/4, if so, the then page of alignment processing device core It is colored as K；Wherein, the S is the final stage shared buffer memory size of processor, and the K is the page coloring sum of processor；

If it is not, determine to meet M ∈ [S/2 according to the most suitable caching Mⁿ⁺¹,S/2ⁿ) corresponding parameter n, then alignment processing It is K/2 that the page of device core, which chromatic number,^n-1, wherein n≤2 and K/2^n-1≧2。

Wherein, the page as corresponding to the processor core the gentle descending order of line number of depositing of chromatic number and divides end Level shared buffer memory, including：

The page corresponding to the processor core into chromatic number to sort from big to small；

Processor core described in chromatic number identical to the page to sort from big to small by the caching line number, is finally arranged Sequence；

The final stage shared buffer memory is divided according to the final sequence.

Wherein, methods described also includes：

When there is data insertion, the insertion position of the data is determined according to the caching line number of the processor core.

Wherein, methods described also includes：

When cache lines are accessed, the priority position of the cache lines is lifted according to default probability.

The application also provides a kind of system for dividing final stage shared buffer memory, including：

Determining module is cached, for determining most suitable caching during each processor core operation；

Page coloring determining module, for the most suitable caching according to, it is determined that each processor core page to be allocated Chromatic number；

Cache lines determining module, chromatic number according to the most suitable caching and the page, each processor is calculated Core caching line number to be allocated；

Division module, chromatic number for the page as corresponding to the processor core and the caching line number is small by arriving greatly Order division final stage shared buffer memory.

Wherein, the caching determining module includes：

Assembled unit, combination of two is carried out for different line numbers and different pages into chromatic number；

Collecting unit, for gathering every thousand instruction missings of the last level cache of each processor core under every kind of combination Number；

Determining unit, it is most suitable slow when determining that each processor core is run according to described every thousand instruction missing numbers Deposit.

Wherein, the page coloring determining module includes：

Judging unit, for judging whether most suitable caching M meets M≤S/4 described in the processor core, if so, then right The page of processor core is answered to be colored as K；Wherein, the S is the final stage shared buffer memory size of processor, and the K is the page of processor Coloring sum；

Traversal Unit, if for when most suitable caching M is unsatisfactory for M≤S/4 described in the processor core, according to described most suitable Caching M determines to meet M ∈ [S/2ⁿ⁺¹,S/2ⁿ) corresponding parameter n, then it is K/2 that the page of alignment processing device core, which chromatic number,^n-1, wherein n ≤ 2 and K/2^n-1≧2。

Wherein, the division module includes：

First sequencing unit, sorted from big to small for the page corresponding to the processor core into chromatic number；

Second sequencing unit, for the page processor core described in chromatic number identical by it is described caching line number from greatly to Small sequence, is finally sorted；

Division unit, for dividing the final stage shared buffer memory according to the final sequence.

A kind of method of division final stage shared buffer memory provided herein, including determine when each processor core is run Most suitable caching；According to the most suitable caching, it is determined that each processor core page to be allocated chromatic number；According to described most suitable slow Deposit and chromatic number with the page, each processor core caching line number to be allocated is calculated；It is corresponding by the processor core The page the gentle descending order of line number of depositing of chromatic number and divide the final stage shared buffer memory.By real to last level cache Now the gentle double dimensions for depositing line number of chromatic number according to page to divide, refined granularity of division, dilatancy is more preferable.Present invention also provides A kind of system for dividing final stage shared buffer memory, has above-mentioned beneficial effect, here is omitted.

Brief description of the drawings

, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this The embodiment of application, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.

A kind of method flow diagram for division final stage shared buffer memory that Fig. 1 is provided by the embodiment of the present application；

A kind of caching schematic diagram that Fig. 2 is provided by the embodiment of the present application；

The schematic diagram for the distribution processor core P1 that Fig. 3 is provided by the embodiment of the present application；

The schematic diagram for the distribution processor core P2 that Fig. 4 is provided by the embodiment of the present application；

The schematic diagram for the distribution processor core P3 that Fig. 5 is provided by the embodiment of the present application；

A kind of system schematic for division final stage shared buffer memory that Fig. 6 is provided by the embodiment of the present application.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belong to the scope of the application protection.

It refer to Fig. 1, a kind of method flow diagram for division final stage shared buffer memory that Fig. 1 is provided by the embodiment of the present application, It may include steps of：

S101：It is determined that most suitable caching during each processor core operation；

Server run when, the caching required for each processor core operation program is different identical, now it needs to be determined that Most suitable caching during each processor core operation.The application does not limit the side for obtaining most suitable caching corresponding to each processor core Method.A kind of preferable method for determining most suitable caching is provided here：

Different line numbers and different pages into chromatic number and carry out combination of two；Gather each processor under every kind of combination Every thousand instruction missing numbers of the last level cache of core；Each processor core fortune is determined according to described every thousand instruction missing numbers Most suitable caching during row.

Specifically, different line numbers and page coloring number dividing mode are taken processor core, such as the division side of line number Formula { 1,2,4,8,16 } and page coloring species number { 2,4,8,12,16 } carry out combination of two, gather alignment processing under various combination The last level cache MPKI (Miss-Per-Kilo-Instructions every thousand instruction missing numbers) of device core, in this, as index, Determine the suitable cache size of alignment processing device core.There can certainly be the most suitable caching that other method determines processor core, all Should be in the protection domain of the application.

S102：According to the most suitable caching, it is determined that each processor core page to be allocated chromatic number；

Page chromatic number can according to most it is suitable caching be calculated, the purpose of this step is to determine that page chromatic number, as long as can really Determine page corresponding to processor core and chromatic number, all should be in the protection domain of the application.One kind is provided herein preferably by most suitable Caching determines that processor core corresponds to the method that page chromatic number：

Assuming that the final stage shared buffer memory size of processor is S, the page coloring sum of processor is K, judges the processor Whether the most suitable caching M of core meets M≤S/4, if so, then the page of alignment processing device core is colored as K；If it is not, determine full Sufficient M ∈ [S/2ⁿ⁺¹,S/2ⁿ) corresponding parameter n, then it is K/2 that the page of alignment processing device core, which chromatic number,^n-1, wherein n≤2 and K/2^n-1 ≧2。

The benefit of this method is to mitigate the performance cost processor core page colours again when.Above-mentioned classifying rules is taken, When the cache size of processor core distribution has change, but still falls within same category space, its page distributed coloring species can not Become, avoid restaining for the page.There can certainly be other determination processor cores to correspond to the method that page chromatic number, the application It is not limited thereto.

S103：Chromatic number according to the most suitable caching and the page, it is to be allocated that each processor core is calculated Cache line number；

Here computational methods are most suitable cache size divided by page coloring number, result to be rounded downwards.Such as cache lines Size is 22, and page coloring number is 4, then the caching line number divided is

S104：The page as corresponding to the processor core chromatic number and the caching line number is drawn by small order is arrived greatly Divide the final stage shared buffer memory.

After obtaining page corresponding to processor core and that chromatic number is gentle and deposit line number, you can be ranked up preparation division final stage and be total to Enjoy caching.Spatial cache based on distribution, from big to small, caching is divided successively.For division every time, selection first is worked as It is preceding to divide the maximum continuous caching group in space, make the region of division adjacent to each other, avoid wasting spatial cache.Because page colours Number is determined according to most suitable caching, and caching line number is chromatic number by most suitable caching and page and determined, therefore page the priority of chromatic number Higher than caching line number, page corresponding to the larger processor core of most suitable caching chromatic number necessarily not less than the most suitable less place of caching Page corresponding to reason device core chromatic number, and page the big processor core of chromatic number and small processor core elder generation storage allocation is coloured than page. Specifically method can be：

The page corresponding to the processor core into chromatic number to sort from big to small；The page described in chromatic number identical Processor core sorts from big to small by the caching line number, is finally sorted；The final stage is divided according to the final sequence Shared buffer memory.

Further, it is also possible to the direct order descending according to most suitable caching for processor core to divide final stage shared slow Deposit, now page the gentle parameter deposited when line number caches as division of chromatic number.

It is worth noting that, when obtained most suitable caching is not the integral multiple that page chromatic number, it can now ignore remainder The caching of quantity, when above-mentioned cache lines are 22 exemplified by, page chromatic number as 4, and caching line number is 5, the caching now distributed is 4 × 5 caching section, that two cache lines can be ignored.Advantage of this is that last level cache can be utilized as far as possible.Certainly, Select continuous caching group to distribute for processor core as far as possible in distribution to cache.

The embodiment of the present application provides a kind of method for dividing final stage shared buffer memory, be able to can be realized by the above method The division last level cache of double dimensions, solve the partition strategy based on cache lines in the prior art and be not suitable with high volume applications program The problem of parallel or partition strategy based on caching group easily causes memory allocation error, from two dimensions of caching group and cache lines Final stage shared buffer memory is divided, improves the utilization rate of final stage shared buffer memory, efficiently solves the scaling concern of caching division, Preferably interfering between reduction application program.

Based on above-described embodiment, present invention also provides division last level cache after be directed to the management strategy of caching group, It is described in detail below：

Assuming that cache lines dividing mode corresponding to a certain caching group is { A1, A2 ... An }, wherein An is a certain processor core The caching line number that caching component is fitted on herein, thenWherein W is cache lines sum.Caching group management method can be with It is divided into three sub- policy depictions：

Insertion strategy：The caching line number of processor core distribution determines the position that its data is inserted in this cache lines.Such as Processor core has been assigned to 5 cache lines in the caching group containing 8 cache lines, and caching group priority is inserted into its data For 5 this position.

Promotion Strategy：When a certain cache lines are accessed hit, then it lifts a priority position with certain probability P.Its Middle probability P is the parameter pre-set, can be taking human as determination, the method for not limiting its determination herein.Such as can be according to full The number that is connected with counter with the group of last level cache determines.Certainly can also be according to the species for the cache lines being hit (if mixed Close host when, judge that cache lines belong to dynamic random access memory or nonvolatile storage etc.) and access species (press Need to access or write back access etc.) different Promotion Strategies is set.

Replacement policy：It is consistent with LRU, select the data of lowest priority position to replace out caching group.

Based on above-described embodiment, the present embodiment provides a kind of method of specific division last level cache, and technical scheme is as follows；

As shown in Fig. 2 a caching, includes the caching group that 8 pages colour, each caching group has 8 cache lines, total size For 64 cache lines.There are three processor cores to need to divide caching at present, being computed processor core P1 needs 40 cache lines, place It is respectively 12 and 10 cache lines to manage device core P2 and processor core P3.Since 40 ＞ (64 ÷ 4)=16, therefore P1 page chromatic number For 8, and 64/2³The ＜ 16=64/2 of=8 ＜ 12², 64/2³The ＜ 16=64/2 of=8 ＜ 10², i.e. n=2, therefore corresponding page colours Number is 8/2=4.And 40/8=5,12/4=3,Therefore the caching line number of P1, P2 and P3 distribution is respectively 5,3 With 2.Then first distribution page the larger P1 of chromatic number, and reallocate P2, P3.

Such as Fig. 3 to Fig. 5, the processor core order of distribution and the process of distribution are indicated.Before unallocated, each caching group quilt The line number of division is 0.Fig. 3 is after distributing P1, and it is 5 to update the allocated caching line number of each caching group.Fig. 4 be distribution P2 after, Continuous caching group is selected, and cache lines count value has been distributed corresponding to renewal as 8.Cache lines similarly are distributed to P3, such as Fig. 5 institutes Show.

As it was earlier mentioned, the data for belonging to P1 are inserted into caching group priority as 5 position, P2 and P3 data are inserted respectively Enter to the position that caching group priority is 3 and 2.Data hit just lifts forward a priority with certain probability P, and P can use For 3/4.The position of lowest priority is selected to be replaced during replacement.

A kind of system of the division final stage shared buffer memory provided below the embodiment of the present application is introduced, described below The method of the system and above-described division final stage shared buffer memory that divide final stage shared buffer memory can be mutually to should refer to.

Referring to Fig. 6, a kind of system schematic for division final stage shared buffer memory that Fig. 6 is provided by the embodiment of the present application should System can include：

Determining module 100 is cached, for determining most suitable caching during each processor core operation；

Page coloring determining module 200, for the most suitable caching according to, it is determined that each processor core page to be allocated Chromatic number；

Cache lines determining module 300, chromatic number according to the most suitable caching and the page, each processing is calculated Device core caching line number to be allocated；

Division module 400, chromatic number for the page as corresponding to the processor core, gentle to deposit line number descending Order division final stage shared buffer memory.

Based on above-described embodiment, as preferred embodiment, the caching determining module can include：

Based on above-described embodiment, as preferred embodiment, the page coloring determining module can include：

Traversal Unit, if for when most suitable caching M is unsatisfactory for M≤S/4 described in the processor core, it is determined that meeting M ∈ [S/2ⁿ⁺¹,S/2ⁿ) corresponding parameter n, then it is K/2 that the page of alignment processing device core, which chromatic number,^n-1, wherein n≤2 and K/2^n-1≧2。

Based on above-described embodiment, as preferred embodiment, the division module can include：

Each embodiment is described by the way of progressive in specification, and what each embodiment stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For embodiment provide system and Speech, because it is corresponding with the method that embodiment provides, so description is fairly simple, related part is referring to method part illustration .

A kind of method and system of division final stage shared buffer memory provided herein is described in detail above.This Apply specific case in text to be set forth the principle and embodiment of the application, the explanation of above example is only intended to Help understands the present processes and its core concept.It should be pointed out that for those skilled in the art, On the premise of not departing from the application principle, some improvement and modification can also be carried out to the application, these are improved and modification also falls Enter in the application scope of the claims.

It should also be noted that, in this manual, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also the other element including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except other identical element in the process including the key element, method, article or equipment being also present.

Claims

A kind of 1. method for dividing final stage shared buffer memory, it is characterised in that including：

It is determined that most suitable caching during each processor core operation；

According to the most suitable caching, it is determined that each processor core page to be allocated chromatic number；

Chromatic number according to the most suitable caching and the page, each processor core caching line number to be allocated is calculated；

The page as corresponding to the processor core chromatic number and the caching line number divides the end by small order is arrived greatly Level shared buffer memory.
2. according to the method for claim 1, it is characterised in that the most suitable caching determined when each processor core is run Including：

Different line numbers and different pages into chromatic number and carry out combination of two；

Gather every thousand instruction missing numbers of the last level cache of each processor core under every kind of combination；

Most suitable caching during each processor core operation is determined according to described every thousand instruction missing numbers.
3. method according to claim 1 or 2, it is characterised in that according to the most suitable caching, it is determined that each processing Device core page to be allocated, which chromatic number, to be included：

Judge whether the most suitable caching M of the processor core meets M≤S/4, if so, the then page coloring of alignment processing device core For K；Wherein, the S is the final stage shared buffer memory size of processor, and the K is the page coloring sum of processor；

If it is not, determine to meet M ∈ [S/2 according to the most suitable caching Mⁿ⁺¹,S/2ⁿ) corresponding parameter n, then alignment processing device core Page chromatic number be K/2^n-1, wherein n≤2 and K/2^n-1≧2。
4. according to the method for claim 3, it is characterised in that the page chromatic number and eased up as corresponding to the processor core The descending order division final stage shared buffer memory of line number is deposited, including：

The page corresponding to the processor core into chromatic number to sort from big to small；

Processor core described in chromatic number identical to the page to sort from big to small by the caching line number, is finally sorted；

The final stage shared buffer memory is divided according to the final sequence.
5. according to the method for claim 4, it is characterised in that also include：

When there is data insertion, the insertion position of the data is determined according to the caching line number of the processor core.
6. according to the method for claim 5, it is characterised in that also include：

When cache lines are accessed, the priority position of the cache lines is lifted according to default probability.
A kind of 7. system for dividing final stage shared buffer memory, it is characterised in that including：

Determining module is cached, for determining most suitable caching during each processor core operation；

Page coloring determining module, for the most suitable caching according to, it is determined that each processor core page to be allocated chromatic number；

Cache lines determining module, chromatic number according to the most suitable caching and the page, each processor core is calculated and treats The caching line number of distribution；

Division module, it is small suitable by arriving greatly chromatic number and the caching line number for the page as corresponding to the processor core Sequence divides final stage shared buffer memory.
8. system according to claim 7, it is characterised in that the caching determining module includes：

Assembled unit, combination of two is carried out for different line numbers and different pages into chromatic number；

Collecting unit, for gathering every thousand instruction missing numbers of the last level cache of each processor core under every kind of combination；

Determining unit, most suitable caching when determining that each processor core is run according to described every thousand instruction missing numbers.
9. the system according to claim 7 or 8, it is characterised in that the page coloring determining module includes：

Judging unit, for judging whether most suitable caching M meets M≤S/4 described in the processor core, if so, then corresponding position The page of reason device core is colored as K；Wherein, the S is the final stage shared buffer memory size of processor, and the K is that the page of processor colours Sum；

Traversal Unit, if for when most suitable caching M is unsatisfactory for M≤S/4 described in the processor core, according to the most suitable caching M determines to meet M ∈ [S/2ⁿ⁺¹,S/2ⁿ) corresponding parameter n, then it is K/2 that the page of alignment processing device core, which chromatic number,^n-1, wherein n≤2 And K/2^n-1≧2。
10. system according to claim 9, it is characterised in that the division module includes：

First sequencing unit, sorted from big to small for the page corresponding to the processor core into chromatic number；

Second sequencing unit, arranged from big to small by the caching line number for processor core described in chromatic number identical to the page Sequence, finally sorted；

Division unit, for dividing the final stage shared buffer memory according to the final sequence.