CN104252392A

CN104252392A - Method for accessing data cache and processor

Info

Publication number: CN104252392A
Application number: CN201310269618.3A
Authority: CN
Inventors: 徐远超; 范东睿; 张�浩; 叶笑春
Original assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd; Institute of Computing Technology of CAS
Priority date: 2013-06-28
Filing date: 2013-06-28
Publication date: 2014-12-31
Anticipated expiration: 2033-06-28
Also published as: CN104252392B; WO2014206218A1

Abstract

The embodiment of the invention provides a method for accessing a data cache and a processor, and relates to the field of computers. According to the method and the processor, the range of data search can be narrowed, the access delay is reduced and the performances of the system are improved. A data register of the processor is a first-level cache, wherein the first-level cache comprises a private data cache and a shared data cache; the private data cache comprises a plurality of private caches and is used for storing private data of threads; the shared data cache is used for storing shared data among the threads; when data in the data register of the processor are accessed, data types of the data are determined according to additional flags, corresponding to the data, in physical addresses; the data types comprise the private data and the shared data; the threads corresponding to the data are determined according to the accessed data, and further data caches corresponding to the threads are accessed according to the threads and the data types, so that the data in the data caches are acquired. The embodiment of the invention is used for distinguishing the data caches and accessing the data caches.

Description

A kind of method of visit data buffer memory and processor

Technical field

The present invention relates to computer realm, particularly relate to a kind of method and processor of visit data buffer memory.

Background technology

After processor enters the multinuclear epoch, memory access is the bottleneck of system performance always, the growth rate of the serious delayed processor performance of growth rate of performance of memory system, the performance of the serious limit calculation speed of speed of memory access.Current multinuclear cache (memory buffer) is usually directed to as L1cache is privately owned cache, and other level is the multi-level hierarchy of cache sharing.

Polycaryon processor provides larger computation capability, multiple programs load can be run simultaneously, but have performance interference problem between the program simultaneously run on the multinuclear of cache sharing, mainly due to internuclear replacement that routine data occurs on cache sharing, program feature is affected, because need memory access again when the data be replaced reuse, add memory access latency and memory bandwidth, low and the program feature of resource utilization is made to be difficult to determine, intensive for memory access but the streaming applications that rate of reusing is low and memory access are not intensive but the program mixed running that rate of reusing is high time, problem can be more outstanding.

Therefore, will carry out reasonable management to cache, in prior art, a kind of implementation is for be divided into many parts by cache sharing, and the entity of every a corresponding association, this entity thread that normally operating system is minimum, as thread.But this division methods does not consider the possibility sharing data between thread, share data if existed but there is no cache sharing, multiple copy can be there is in the shared data between thread in privately owned cache, thus need more cache space, also need the cache consistance safeguarding multiple copy; A kind of implementation is also had to divide for being realized cache by page dye technology, but this method limits the physical memory space that each thread can use, can accomplish that good cache isolates for independently multiple process, but in the multithread programs of stream data application, have a lot of shared data between thread, it is unfavorable for carrying out completely isolated.That is page dye technology is applicable to the cache isolation between process, is not too applicable to the cache isolation between same in-process multiple thread.

Summary of the invention

Embodiments of the invention provide a kind of method and processor of visit data buffer memory, can reduce the scope of data search, reduce access delay, improve system performance.

For achieving the above object, embodiments of the invention adopt following technical scheme:

First aspect, a kind of processor is provided, comprise programmable counter, register file, instruction prefetch parts, Instruction decode unit, instruction issue unit, scalar/vector, ALU, share floating point unit, shared instruction buffer memory and internal bus, also comprise:

Data buffer, described data buffer is level cache, described level cache comprises private data buffer memory and shared data buffer storage, described private data buffer memory comprises multiple privately owned buffer memory, described privately owned buffer memory is used for the private data of storage thread, and described shared data buffer storage is for storing the shared data between described thread.

In conjunction with first aspect, in the first mode in the cards of first aspect, described processor is Simultaneous multithreaded architecture, described privately owned buffer memory and hardware thread one_to_one corresponding, and all hardware thread shares described shared data buffer storage.

Second aspect, provides a kind of method of visit data buffer memory, comprising:

During data in the data buffer of access processor, determine the data type of described data according to zone bit additional in the physical address that described data are corresponding, described data type comprises private data and shared data;

Data according to access determine the thread that described data are corresponding, and then access data buffer storage corresponding to described thread according to described thread and described data type, to obtain the data in described data buffer storage, described data buffer storage is private data buffer memory or shared data buffer storage.

In conjunction with second aspect, in the first mode in the cards of second aspect, described method also comprises:

If there are not described data in described private data buffer memory, then access primary memory, and the cache lines obtaining described data place from described primary memory is backfilled in private data buffer memory corresponding to described thread;

If there are not described data in described shared data buffer storage, then access described primary memory, and the cache lines obtaining described data place from described primary memory is backfilled in described shared data buffer storage.

In conjunction with the first mode in the cards of second aspect, in the second mode in the cards, zone bit additional in the described physical address corresponding according to described data determines that the data type of described data comprises:

If the zone bit in described physical address is the first mark, then determine that the data type of described data is private data;

If the zone bit in described physical address is the second mark, then determine that the data type of described data is for sharing data.

The second in conjunction with second aspect mode in the cards, in the third mode in the cards, described data buffer storage of accessing described thread corresponding according to described thread and described data type comprises:

If described data type is described private data, then access the private data buffer memory that described thread is corresponding;

If described data type is described shared data, then access described shared data buffer storage.

In conjunction with the third mode in the cards of second aspect, in the 4th kind of mode in the cards, described private data buffer memory comprises multiple privately owned buffer memory, described private data buffer memory is used for the private data of storage thread, and described shared data buffer storage is for storing the shared data between described thread;

Wherein, described privately owned buffer memory and hardware thread one_to_one corresponding, all hardware thread shares described shared data buffer storage.

The embodiment of the present invention provides a kind of method and processor of visit data buffer memory, this processor comprises programmable counter, register file, instruction prefetch parts, Instruction decode unit, instruction issue unit, scalar/vector, ALU, share floating point unit, shared instruction buffer memory and internal bus, also comprise data buffer, data buffer is level cache, level cache comprises private data buffer memory and shared data buffer storage, private data buffer memory comprises multiple privately owned buffer memory, privately owned buffer memory is used for the private data of storage thread, share data buffer storage for the shared data between storage thread, during data in the data buffer of access processor, according to the data type of zone bit determination data additional in the physical address that data are corresponding, data type comprises private data and shared data, the thread corresponding according to the data determination data of access, and then access data buffer storage corresponding to thread according to thread and data type, to obtain the data in data buffer storage, data buffer storage is private data buffer memory or shared data buffer storage, the scope of data search can be reduced, reduce access delay, improve system performance.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

A kind of processor structure schematic diagram that Fig. 1 provides for the embodiment of the present invention;

A kind of buffer memory that Fig. 2 provides for the embodiment of the present invention divides schematic diagram;

The method flow schematic diagram of a kind of visit data buffer memory that Fig. 3 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The embodiment of the present invention provides a kind of processor 01, as shown in Figure 1, comprise programmable counter 011, register file 012, instruction prefetch parts 013, Instruction decode unit 014, instruction issue unit 015, scalar/vector 016, ALU 017, share floating point unit 018, shared instruction buffer memory 019 and internal bus, also comprise:

Data buffer 021, data buffer 021 is level cache, level cache comprises private data buffer memory 0211 and shared data buffer storage 0212, private data buffer memory 0211 comprises multiple privately owned buffer memory 0211a, private data buffer memory 0211 is for the private data of storage thread, and shared buffer memory 0212 is for the shared data of storage thread.

Wherein, processor 01 is Simultaneous multithreaded architecture, private data buffer memory 0211 and hardware thread one_to_one corresponding, the public shared data buffer storage 0212 of all hardware thread.This Simultaneous multithreaded architecture is allow the instruction of launching multiple thread within a clock period to perform to functional part, to improve the utilization factor of functional part.Privately owned buffer memory is that unique user is used, and shared buffer memory is that multiple user shares use.

PC (Program Counter, programmable counter) has 16, is PC0 ~ PC15, and in a processor core, the number of logic processor core (hardware thread) is consistent with the number of PC.

GRF (General Register File, register file), the logic processor in a processor core is checked and is answered a GRF, quantitatively consistent with the quantity of PC.

Fetch (instruction prefetch parts) is for obtaining instruction, Decoder (Instruction decode unit) is for decoding to instruction, Issue is instruction issue unit, for firing order, AGU (Address Generator Unit, scalar/vector) for carrying out the module of all address computation, generate an address for controlling access storer.ALU (Arithmetic Logic Unit, ALU) be CPU (Central Processing Unit, central processing unit) performance element, the ALU that can be made up of " And Gate " (with door) and " Or Gate " (or door).Share floating point unit (Shared Float Point Unit) for carrying out the circuit unit of floating-point operation arithmetic in processor specially, shared instruction buffer memory is for storing instruction, and internal bus is used for each parts in connection handling device.

Data buffer (Cache) 021 is the first order buffer memory L1Cache (Level 1 Cache) of processor 01, and this L1Cache comprises private data buffer memory 0211 and shared data buffer storage 0212.

Private data buffer memory 0211 comprises multiple independently privately owned buffer memory (D-Cache) 0211a, for storing the private data of each hardware thread, shares data buffer storage 0212 for the shared data between storage thread.

Privately owned cache and cache sharing are at same level L1, and when data are filled out cache by CPU, be stored in by the private data of thread in privately owned cache, the shared data between thread are stored in cache sharing.

It will be understood by those skilled in the art that current existing multinuclear cache is generally L1cache is privately owned cache, and the levels such as other L2, L3 are the multi-level hierarchy of cache sharing, and the present invention does not adopt multi-level hierarchy, retains L1cache.Like this, each hardware thread has oneself privately owned cache, the public cache sharing of all hardware thread.

For example, this processor 01 can be many-core processor, and each processor core is Simultaneous multithreaded architecture, and buffer 021 is the ingredient of each processor core.The hardware implementation of this cache can be as shown in Figure 2.Inner at the processor core of a Simultaneous multithreaded architecture, have multiple hardware thread, the corresponding privately owned buffer memory 0211 of each hardware thread, all hardware threads share a shared buffer memory 0212.Privately owned buffer memory 0211 and shared buffer memory 0212 belong to same level.

Therefore, the embodiment of the present invention provides a kind of processor, this processor comprises programmable counter, register file, instruction prefetch parts, Instruction decode unit, instruction issue unit, scalar/vector, ALU, share floating point unit, shared instruction buffer memory and internal bus, also comprise data buffer, this data buffer is level cache, level cache comprises private data buffer memory and shared data buffer storage, private data buffer memory and shared data buffer storage belong to same level, private data buffer memory comprises multiple privately owned buffer memory, private data buffer memory is used for the private data of storage thread, share data buffer storage for the shared data between storage thread, like this, the scope of data search can be reduced, reduce access delay, improve system performance.

The embodiment of the present invention provides a kind of method of access cache data, as shown in Figure 3, comprising:

During data 101, in the data buffer of processor access processor, processor is according to the data type of zone bit determination data additional in physical address corresponding to data, and data type comprises private data and shared data.

Exemplary, the data type of data can be determined by page table entry (Page Table Entry, the PTE) mark in retouching operation system.The Paging system can supported by compiling, is stored in thread private data in the exclusive page frame (Page Frame) of each thread, is stored in by the thread data sharing belonging to a process in the shared page frame of thread.

Concrete, operating system storage allocation space is in units of page, the base address of page frame writes in page table entry, is home zone or shared region, defines the zone bit of a bit in the reservation position of page table entry in order to what identify that page frame points to, whether this zone bit is home zone for distinguishing physics page frame corresponding to this page table entry, exemplary, if home zone, mark position 1, if shared region, mark position 0.Here the mark position of home zone and shared region is not limited.As shown in table 1, for example, illustrate with the page table entry structure of 4KB size, it is home zone or shared region that the zone bit that can define a bit in 9-14 position distinguishes physics page frame corresponding to page table entry.

The page table entry structure of table 1 4KB size

Virtual address is used when CPU access cache data, first search TLB (Translation Lookaside Buffer, bypass conversion buffered district) table, this table is the cache tables of virtual address and physical address, for obtaining physical address according to virtual address.If there is no corresponding virtual address in TLB, then enter paging process to obtain physical address, and be stored in the zone bit share_flag in page table entry, and this zone bit is left in the physical address of TLB table; If have corresponding virtual address item in TLB table, then directly obtain the physical address in TLB table and zone bit share_flag, and this zone bit is added in physical address.As shown in table 2 is the composition of physical address, comprises zone bit share_flag, tag (label) set index (group index number), block offset (block skew) and byte offset (byte offset).

Table 2 physical address forms

Share-flag

tag

set?index

block?offset

byte?offset

Like this, by defining reservation position, a certain position in page table entry, and the zone bit of definition is passed to CPU as the additional bit of physical address, CPU can determine according to zone bit additional in physical address the data type wanting visit data.If the zone bit in physical address is the first mark, then determine that the data type of data is private data; If the zone bit in physical address is the second mark, then determine that the data type of data is for sharing data.Such as, this first zone bit 1, the second is masked as 0.

102, the thread that processor is corresponding according to the data determination data of access, and then access buffer memory corresponding to thread according to thread and data type, to obtain the data in buffer memory, buffer memory is privately owned buffer memory or shared buffer memory.

Concrete, after the data type of the data that will access determined, according to the data of access, CPU can determine this data access by which hardware thread is initiated, and then the buffer zone that will access is determined according to this hardware thread and data type, if share_flag is 1, data type is private data, then the private data buffer memory that access hardware thread is corresponding; If share_flag is 0, data type is for sharing data, then the shared data buffer storage in access cache device, to obtain the data in buffer memory.Private data buffer memory and shared data buffer storage are completed by hardware synchronization.

Wherein, the corresponding private data buffer memory of each hardware thread, all hardware threads share a shared data buffer storage, and private data buffer memory and shared data buffer storage belong to same level L1cache.

In addition, if the private data buffer memory that CPU access hardware thread is corresponding does not hit, namely there are not the data that will access in private data buffer memory, primary memory then in CPU access memory, and from primary memory, obtain the cache lines at the data place that will access, this cache lines is backfilled in private data buffer memory corresponding to this hardware thread; If the shared data buffer storage of CPU access hardware thread does not hit, namely share in data buffer storage and there are not the data that will access, primary memory then in CPU access memory, and from primary memory, obtain the cache lines at the data place that will access, this cache lines is backfilled in the shared shared data buffer storage of all hardware thread.Wherein, when cache lines is backfilled in data buffer storage, if when data buffer storage fills up, LRU (Least Recently Used can be passed through, least recently used to) cache lines that will backfill replaces the least recently used cache lines arrived, or when there is not this cache lines in the buffer, directly this cache lines is backfilled in data buffer storage.The replacement policy adopted during backfill is same as the prior art, repeats no more here.

So, because high throughput applications program threads similarity is high, data sharing degree is low, by repartitioning data buffer storage, thread private data is separately distributed in respective privately owned buffer memory, without any interference, shared data are stored in shared data buffer storage, when searching the data in buffer memory at CPU, can according to the zone bit of physical address directly determine to search to as if private data buffer memory still share data buffer storage, reduce seek scope, reduce access delay, improve system performance.

Therefore, the embodiment of the present invention provides a kind of method of visit data buffer memory, during data in the data buffer of access processor, according to the data type of the zone bit determination data in the physical address that data are corresponding, data type comprises private data and shared data, the thread corresponding according to the data determination data of access, and then access data buffer storage corresponding to thread according to thread and data type, to obtain the data in data buffer storage, data buffer storage is private data buffer memory or shared data buffer storage, the scope of data search can be reduced, reduce access delay, improve system performance.

In several embodiments that the application provides, should be understood that disclosed processor and method can realize by another way.Such as, apparatus embodiments described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.

In addition, in the processor in each embodiment of the present invention, each functional unit can be integrated in a processing unit, also can be that the independent physics of unit comprises, also can two or more unit in a unit integrated.And above-mentioned each unit both can adopt the form of hardware to realize, the form that hardware also can be adopted to add SFU software functional unit had realized.

The all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, and aforesaid program can be stored in a computer read/write memory medium, and this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read Only Memory, be called for short ROM), random access memory (Random Access Memory, be called for short RAM), magnetic disc or CD etc. various can be program code stored medium.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims

1. a processor, comprise programmable counter, register file, instruction prefetch parts, Instruction decode unit, instruction issue unit, scalar/vector, ALU, share floating point unit, shared instruction buffer memory and internal bus, it is characterized in that, also comprise:

Data buffer, described data buffer is level cache, described level cache comprises private data buffer memory and shared data buffer storage, described private data buffer memory comprises multiple privately owned buffer memory, described private data buffer memory is used for the private data of storage thread, and described shared data buffer storage is for storing the shared data between described thread.

2. processor core according to claim 1, is characterized in that, described processor is Simultaneous multithreaded architecture, described privately owned buffer memory and hardware thread one_to_one corresponding, and all hardware thread shares described shared data buffer storage.

3. a method for visit data buffer memory, is characterized in that, comprising:

4. method according to claim 3, is characterized in that, described method also comprises:

5. method according to claim 4, is characterized in that, zone bit additional in the described physical address corresponding according to described data determines that the data type of described data comprises:

6. method according to claim 5, is characterized in that, described data buffer storage of accessing described thread corresponding according to described thread and described data type comprises:

7. method according to claim 6, is characterized in that, described private data buffer memory comprises multiple privately owned buffer memory, and described privately owned buffer memory is used for the private data of storage thread, and described shared data buffer storage is for storing the shared data between described thread;