CN103927277A

CN103927277A - CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device

Info

Publication number: CN103927277A
Application number: CN201410147375.0A
Authority: CN
Inventors: 石伟; 邓宇; 郭御风; 龚锐; 任巨; 张明; 马爱永; 高正坤; 窦强; 童元满
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2014-07-16
Anticipated expiration: 2034-04-14
Also published as: CN103927277B

Abstract

The invention discloses a CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device. The method includes the steps of caching access and storage requests from CPU/GPU in classification; requesting for arbitration for the cashed access and storage requests different in classification; executing the access and storage requests while subjecting read and write data of the access and storage requests to high-speed cache; enabling the read and write data read or written in an external storage through the access and storage requests to bypass the high-speed cache when the access and storage requests of a GPU are executed, operating the external storage directly, and notifying a CPU core to perform cancellation or update private data backup only when writing hit the target the high-speed cache. The device comprises a CPU request queue, a GPU request queue, an arbiter and a high-speed cache flow execution unit. The method and the device have the advantages that different access features of the CPU and the GPU can be taken into account at the same time, performance is high, hardware implementation is simple and cost is low.

Description

CPU and GPU share method and the device of high-speed cache on sheet

Technical field

The present invention relates to computer microprocessor field, be specifically related to method and the device of high-speed cache on the shared sheet of a kind of CPU and GPU.

Background technology

Along with the develop rapidly of VLSI (very large scale integrated circuit) and embedded technology, the available transistor resource on single-chip is more and more, and system integrated chip (SoC) technology is arisen at the historic moment.The often IP kernel of integrated multiple difference in functionalitys of a SoC chip, has comparatively perfect function.The SoC chip being used as the handheld terminal such as mobile phone, PDA machine can all integrate the almost repertoire of an embedded information handling system, so on one single chip, realize information acquisition, the function such as input, storage, processor, output.Current some embedded systems (as mobile phone, game machine) have proposed higher requirement to multimedia processor performances such as figure, image, videos, and therefore Graphics Processing Unit GPU is often also integrated in SoC chip and goes.

In the SoC chip of integrated CPU and two kinds of different processing units of GPU, they generally need to share Resources on Chips such as using high-speed cache, memory controller.But memory bandwidth limited on sheet is difficult to meet the high bandwidth requirements of CPU and GPU, and then makes the performance of CPU and GPU all be subject to certain impact simultaneously.In addition, also there is bigger difference in the memory access characteristic of CPU and GPU, the characteristic of high-speed cache on sheet has also been proposed to different requirements.The access request of CPU belongs to delay-sensitive, and it requires its access request can obtain quick service; And the access request of GPU belongs to bandwidth sensitive, it needs high-bandwidth service, otherwise causes GPU cannot handle in real time the image that will show.In sum, on sheet, the shared use-pattern of high-speed cache has affected the performance of CPU and GPU to a certain extent, and the low delay requirement of CPU and the high bandwidth requirements of GPU all cannot be all met.Along with increasing of the SoC chip of integrated CPU and GPU, the memory access race problem of CPU and GPU has become a technical matters urgently to be resolved hurrily.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind ofly can take into account that CPU and GPU different access characteristic, performance are high, hardware is realized CPU simple, that cost is little and GPU shares method and the device of high-speed cache on sheet simultaneously.

In order to solve the problems of the technologies described above, the technical solution used in the present invention is:

CPU and GPU share a method for high-speed cache on sheet, and implementation step is as follows:

1) classification buffer memory is from the access request of CPU and from the access request of GPU;

2) arbitrate for the dissimilar access request of buffer memory, the access request that arbitration is won enters streamline;

3) inspection enters the request type of the access request of streamline, if access request is the access request from CPU, and reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.

Further, described step 2) detailed step as follows:

2.1) according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory is arbitrated, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;

2.2) upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.

Further, described step 2.2) detailed step as follows:

2.2.1) judge current priority state value, if current priority state value be CPU preferentially; redirect execution step 2.2.2), otherwise redirect execution step 2.2.3);

2.2.2) check the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the current GPU utilized bandwidth of the GPU access request non-NULL of buffer memory and person can not meet the demands, current priority state value representation be set be GPU preferentially for arbitrate use next time, upgrade complete;

2.2.3) check the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, current priority state value representation be set be CPU preferentially for arbitrate use next time, upgrade complete.

Further, the detailed step of described step 3) is as follows:

3.1) inspection enters the request type of the access request of streamline, in the time that request type is CPU, if the action type of access request is read operation, and redirect execution step 3.2), otherwise redirect execution step 3.3); In the time that request type is GPU, if the request type of access request is read operation, redirect execution step 3.4), otherwise redirect execution step 3.5);

3.2) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request; Access request is finished;

3.3) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and send the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache; Access request is finished;

3.4) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache; Access request is finished;

3.5) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, then send the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache; Access request is finished.

The present invention also provides a kind of CPU and GPU to share the device of high-speed cache on sheet, comprising:

CPU request queue and GPU request queue, for the buffer memory of classifying from the access request of CPU and from the access request of GPU;

Moderator, arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;

High-speed cache flowing water performance element, for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.

Further, described moderator comprises:

Dynamic priority arbitration modules, for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;

Priority state value update module, for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.

Further, described priority state value update module comprises:

Priority state value judges submodule, for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule, otherwise call GPU priority status control submodule;

CPU core priority status control submodule, for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;

GPU priority status control submodule (323), for checking the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU is preferentially for arbitrate use next time that current priority state value representation is set.

Further, described high-speed cache flowing water performance element comprises:

Access request checking module, for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module, otherwise calls CPU write operation execution module; In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module, otherwise call GPU write operation execution module;

CPU read operation execution module, for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;

CPU write operation execution module, for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;

GPU read operation execution module, for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;

GPU write operation execution module, judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.

The method that CPU of the present invention and GPU share high-speed cache on sheet has following advantage:

1, the present invention is according to the memory access feature of CPU and GPU, by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline, in the time that streamline is carried out, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.Therefore, the present invention is optimized respectively processing according to the memory access feature of CPU and GPU, locality for CPU program (from CPU access request) is larger, in the time that streamline is carried out, its visit data enters high-speed cache to improve program feature, and there is properties of flow and the poor problem of locality for the data of GPU program (from GPU access request) access, therefore the data of GPU routine access are generally walked around high-speed cache, thereby on having reduced to large extent the impact on CPU program, therefore can take into account CPU and GPU different access characteristic simultaneously, have advantages of that performance is high.

2, the present invention by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline and carries out, therefore by introduce GPU sharing high-speed cache on the basis of existing polycaryon processor, substantially retained the design feature of polycaryon processor access cache, less to original processor structure change, hardware configuration is simple, and cost is little.Meanwhile, amended cache structure can also be used for not using the polycaryon processor of GPU, also has advantages of compatible strong.

3, the present invention is according to the memory access feature of CPU and GPU, by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline, therefore can select flexibly different resolving strategies according to the memory access feature of CPU and GPU, thereby make the CPU program or the GPU program that urgently obtain service obtain priority processing, improved the dirigibility of system.

CPU of the present invention is the corresponding device of method that CPU of the present invention and GPU share high-speed cache on sheet with the device that GPU shares high-speed cache on sheet, therefore there is the technique effect identical with the method for high-speed cache on CPU of the present invention and the shared sheet of GPU, therefore do not repeat them here.

Brief description of the drawings

Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.

Fig. 2 carries out arbitration process schematic diagram for dissimilar access request in embodiment of the present invention method.

Fig. 3 is the state machine diagram of upgrading priority state value in embodiment of the present invention method.

Fig. 4 is the logical framework structural representation of embodiment of the present invention device.

Fig. 5 is the logical framework structural representation of moderator in embodiment of the present invention device.

Fig. 6 is the framed structure schematic diagram of dynamic priority arbitration modules in embodiment of the present invention device.

Fig. 7 is the logical framework structural representation of embodiment of the present invention device medium priority state value update module.

Fig. 8 is the logical framework structural representation of embodiment of the present invention device high speed buffer memory flowing water performance element.

Fig. 9 is the framed structure schematic diagram of the microprocessor of application embodiment of the present invention device.

Figure 10 is the framed structure schematic diagram of embodiment of the present invention device high speed buffer memory flowing water performance element.

Figure 11 is the schematic flow sheet that embodiment of the present invention device is carried out CPU access request.

Figure 12 is the schematic flow sheet that embodiment of the present invention device is carried out GPU access request.

Embodiment

As shown in Figure 1, on the shared sheet of the present embodiment CPU and GPU, the implementation step of the method for high-speed cache is as follows:

As shown in Figure 2, the dissimilar access request for buffer memory in the present embodiment is arbitrated, and the access request that arbitration is won enters streamline, if the access request from CPU of buffer memory is won, should enter streamline from the access request of CPU; If the access request from GPU of buffer memory is won, should enter streamline from the access request of GPU.In the present embodiment, step 2) detailed step as follows:

In conjunction with above-mentioned steps 2.1)～2.2) known, the present embodiment adopts the method for dynamic priority arbitration, according to priority state value, the dissimilar access request of buffer memory is arbitrated, if described priority state value is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value is that GPU is preferential else if, from the access request from GPU of buffer memory, take out an access request and enter streamline as the access request of arbitration triumph, and all upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance each time, the method of this dynamic priority arbitration can change the priority of asking according to system current state and demand, thereby make the CPU relative program or the GPU relative program that urgently obtain service obtain priority processing, improved the dirigibility of system.

In the present embodiment, step 2.2) detailed step as follows:

2.2.2) check the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, current priority state value representation be set be GPU preferentially for arbitrate use next time, upgrade complete;

In conjunction with above-mentioned steps 2.2.1)～2.2.3) known, the present embodiment considers the priority of CPU request and GPU request according to system current state, system current state comprises the GPU access request quantity of buffer memory, the CPU core access request quantity of buffer memory, current GPU utilized bandwidth three type systematic current states, by considering said system current state, can guarantee that memory bandwidth limited on sheet meets the high bandwidth requirements of CPU and GPU substantially simultaneously, and then the performance that makes CPU and GPU obtains optimized performance, thereby the system performance decline problem that the memory access race problem that can overcome substantially CPU and GPU causes, can improve the overall performance of processor, thereby make the CPU program or the GPU program that urgently obtain service obtain priority processing, improve the dirigibility of system.

As shown in Figure 3, in the present embodiment, represent priority state value with Priority, in the time of Priority=0, the priority of CPU request queue is higher than GPU, and to be represented as CPU preferential for Priority=0; In the time of Priority=1, the priority of GPU request queue is higher than CPU, and to be represented as GPU preferential for Priority=1.System default initialization priority state value is 0, i.e. Priority=0, and it is preferential that representative is defaulted as CPU.The in the situation that of Priority=0, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferential (Priority=0) for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, current priority state value representation is set is GPU preferential (Priority=1) uses the Priority=1 in the situation that for arbitrate next time, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferential (Priority=1) for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU preferential (Priority=0) is for arbitrate use next time that current priority state value representation is set.

In the present embodiment, the detailed step of step 3) is as follows:

To sum up step 3.1)～3.5) known, when the present embodiment is carried out the access request of CPU, the data that read or write are carried out buffer memory in high-speed cache, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU; In the time that CPU reads to lose efficacy, the data that read from external memory storage will be replaced and be entered high-speed cache; In the time that CPU writes inefficacy, carry out write allocate strategy; In the time carrying out CPU write operation, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU.When the present embodiment is carried out GPU access request, the data that read or write are not carried out buffer memory as far as possible in high-speed cache.Only ought write and hit high-speed cache, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU; In the time reading to lose efficacy, data directly send to GPU kernel from external memory storage; When writing while losing efficacy, the data external memory storage that writes direct, carries out not according to writing batch operation.

As shown in Figure 4, on the shared sheet of the present embodiment CPU and GPU, the device of high-speed cache comprises:

CPU request queue 1 and GPU request queue 2, for the buffer memory of classifying from the access request of CPU and from the access request of GPU; These access requests represent with message form.

Moderator 3, arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;

High-speed cache flowing water performance element 4, for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.

As shown in Figure 5, the moderator 3 of the present embodiment comprises:

Dynamic priority arbitration modules 31, for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;

Priority state value update module 32, for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.

As shown in Figure 6, dynamic priority arbitration modules 31 specifically comprises arbitration execution module and selector switch, arbitration execution module is for according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory being arbitrated, and selector switch enters streamline for select to take out from the access request from CPU or GPU of buffer memory the access request that an access request wins as arbitration according to arbitration result.Enter respectively after CPU request queue 1 and GPU request queue 2 at access request, moderator 3 is arbitrated by dynamic priority arbitration modules 31, if the access request arbitration from CPU is won, can enter streamline from the access request of CPU, on the contrary, if the access request arbitration from GPU is won, can enter streamline from the access request of GPU.

As shown in Figure 7, the priority state value update module 32 of the present embodiment comprises:

Priority state value judges submodule 321, for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule 322, otherwise call GPU priority status control submodule 323;

CPU core priority status control submodule 322, for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;

GPU priority status control submodule 323, for checking the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU is preferentially for arbitrate use next time that current priority state value representation is set.

The said structure of comprehensive priority state value update module 32 is known, and the moderator 3 of the present embodiment can consider according to system current state the priority of CPU request and GPU request, optimization system performance.CPU program belongs to delay-sensitive, and its access request need to obtain quick service, and GPU is comparatively responsive to bandwidth.Therefore,, in the situation that GPU memory bandwidth satisfies the demands, the priority of CPU access request is higher than GPU access request.With the actual memory bandwidth of GPU in hardware counter record unit time, if memory bandwidth lower than expection, shows that the access request of GPU needs to improve, to prevent that the discontinuous situation of graph and image processing from appearring in GPU; If memory bandwidth higher than expection, shows that GPU can normally work, more allocated bandwidth can be used to CPU, improve the performance of CPU.

As shown in Figure 8, the high-speed cache flowing water performance element 4 of the present embodiment comprises:

Access request checking module 41, for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module 42, otherwise call CPU write operation execution module 43; In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module 44, otherwise call GPU write operation execution module 45;

CPU read operation execution module 42, for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;

CPU write operation execution module 43, for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;

GPU read operation execution module 44, for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;

GPU write operation execution module 45, judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.

As shown in Figure 9, (kernel 1～kernel n), a GPU, a high-speed cache and a memory controller form by n CPU core for the multi-core microprocessor (Soc chip) that adopts the present embodiment device, each CPU core inside all configures private cache, (n) He one GPU kernel of kernel 1～kernel is received high-speed cache by interconnection structures such as bus, cross bar switch or networks to n CPU core, and high-speed cache is by memory controller access external memory.The data of CPU core access all will be passed through high-speed cache, and keep cache coherence by high-speed cache, and the data of GPU kernel access simultaneously are generally all walked around high-speed cache, avoid taking a large amount of high-speed caches.

As shown in figure 10, high-speed cache flowing water performance element 4 comprises: buffer memory Tag module, store the address information of data for recording high-speed cache, and whether there is backup at high-speed cache for differentiating accessed data.Tag block of state, whether for the status information of record buffer memory Tag, as effective in corresponding Tag, whether the data that Tag is corresponding are modified.Data cached module, for recording data message corresponding to Tag institute recording address.CACHE DIRECTORY module, for recording the Tag part of upper level buffer memory, for carrying out buffer consistency management.When CPU and GPU write data to certain address and write while hitting, buffer consistency hardware management will be sent and cancel or lastest imformation to the CPU that preserves this address date according to the information in CACHE DIRECTORY.CPU writes buffer memory, need to write the data of external memory storage for recording CPU.In the time that CPU adopts write through strategy execution write operation, data also will write external memory storage in write cache, now will write data and write CPU and write buffer memory.GPU writes buffer memory, need to write the data of external memory storage for recording GPU.When GPU writes while losing efficacy, data will directly writing buffer memory by GPU writes external memory storage, and write cache no longer.Read buffer memory, the result of returning from external memory storage read data for recording CPU and GPU.Inefficacy buffer memory, for when CPU and the GPU read-write cache miss, will send corresponding read request and write request to external memory storage, inefficacy caching record corresponding read-write requests and request address.CPU returns to queue, return to queue for recording the data GPU that high-speed cache returns to CPU, return to queue and only record the data that GPU will read for recording data GPU that high-speed cache returns to GPU, and CPU returns to queue and not only records the data that CPU will read, also record is for the information such as calcellation and renewal of coherency management.The CACHE DIRECTORY of high-speed cache has recorded the record of all CPU visit datas, and does not record the record of GPU visit data.In the time of CPU or GPU write cache, cache coherence device cancels according to the state of CACHE DIRECTORY or upgrades related data in CPU, thereby makes CPU total energy obtain up-to-date data; But because the data of the access of GPU do not enter high-speed cache, cache coherence device can not ensure that the data of GPU access are all up-to-date.Although the present embodiment does not ensure that from hardware the data of GPU access are up-to-date, ensure that from hardware environment angle the caching device of proposition can correctly be worked.In the SoC of integrated CPU and GPU system, CPU controls the execution of GPU conventionally, and provides necessary data for it; In the operational process of GPU, CPU and GPU, in relatively independent running status, do not carry out data interaction conventionally between them, and therefore CPU write operation does not conventionally relate to the address space of GPU yet, and then the inconsistency of data together not.

As shown in figure 11, the present embodiment execution is as follows from the step of the access request of CPU:

A1) CPU core is sent access request.

A2) access request is cached to the CPU request queue that is specifically designed to buffer memory CPU access request.

A3) access request is arbitrated, if access request failure continues to wait for arbitration next time, final access request arbitration is won and is entered the streamline of high-speed cache.

A4) differentiate access request type, if jump procedure A5 of read request); If jump procedure A6 of write request).

A5) Tag in access cache mark (address tag mark) and Tag state (address tag state), whether differentiation reads to hit in high-speed cache, if read to hit, access cache data correct data are sent into CPU and return to queue; If read to lose efficacy, access request to be sent into inefficacy buffer memory, and then obtained correct data from next stage memory hierarchy (as external memory storage), these data return to enter CPU simultaneously to queue and high-speed cache;

A6) mark of the Tag in access cache (address tag mark) and Tag state (address tag state), differentiate and whether in high-speed cache, write and hit, hit if write, data are write to corresponding position in buffer memory, and query caching catalogue, send and cancel or updating message to the privately owned buffer memory of other CPU according to CACHE DIRECTORY; If write inefficacy, adopt write allocate principle, data are write to newly assigned cache blocks address in buffer memory.

As shown in figure 12, the present embodiment execution is as follows from the step of the access request of GPU:

B1) GPU sends access request.

B2) access request is cached to the GPU request queue that is specifically designed to buffer memory GPU access request.

B3) access request is arbitrated, if access request failure continues to wait for arbitration next time, final access request arbitration is won and is entered the streamline of high-speed cache.

B4) differentiate access request type, if jump procedure B5 of read request); If jump procedure B6 of write request).

B5) Tag in access cache mark (address tag mark) and Tag state (address tag state), whether differentiation reads to hit in high-speed cache, if read to hit, access cache data correct data are sent into GPU and return to queue; If read to lose efficacy, access request to be sent into inefficacy buffer memory, and then obtained correct data from next stage memory hierarchy (as external memory storage), these data only enter GPU and return to queue, and do not enter high-speed cache.

B6) mark of the Tag in access cache (address tag mark) and Tag state (address tag state), differentiate and whether in high-speed cache, write and hit, hit if write, data are write to corresponding position in buffer memory, and query caching catalogue, send and cancel or updating message to the privately owned buffer memory of other CPU according to CACHE DIRECTORY; If write inefficacy, adopt by not writing distribution principle, by the data external memory storage that writes direct, in high-speed cache, do not store the data backup of these data.

The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. CPU and GPU share a method for high-speed cache on sheet, it is characterized in that implementation step is as follows:

2. CPU according to claim 1 and GPU share the method for high-speed cache on sheet, it is characterized in that described step 2) detailed step as follows:

3. CPU according to claim 2 and GPU share the method for high-speed cache on sheet, it is characterized in that described step 2.2) detailed step as follows:

4. according to the method for high-speed cache on the CPU described in claim 1 or 2 or 3 and the shared sheet of GPU, it is characterized in that, the detailed step of described step 3) is as follows:

5. CPU and GPU share a device for high-speed cache on sheet, it is characterized in that comprising:

CPU request queue (1) and GPU request queue (2), for the buffer memory of classifying from the access request of CPU and from the access request of GPU;

Moderator (3), arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;

High-speed cache flowing water performance element (4), for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.

6. CPU according to claim 5 and GPU share the device of high-speed cache on sheet, it is characterized in that, described moderator (3) comprising:

Dynamic priority arbitration modules (31), for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;

Priority state value update module (32), for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.

7. CPU according to claim 6 and GPU share the device of high-speed cache on sheet, it is characterized in that, described priority state value update module (32) comprising:

Priority state value judges submodule (321), for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule (322), otherwise call GPU priority status control submodule (323);

CPU core priority status control submodule (322), for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;

8. according to the device of high-speed cache on the CPU described in claim 5 or 6 or 7 and the shared sheet of GPU, it is characterized in that, described high-speed cache flowing water performance element (4) comprising:

Access request checking module (41), for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module (42), otherwise call CPU write operation execution module (43); In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module (44), otherwise call GPU write operation execution module (45);

CPU read operation execution module (42), for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;

CPU write operation execution module (43), for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;

GPU read operation execution module (44), for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;

GPU write operation execution module (45), judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.