CN103927277A - CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device - Google Patents

CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device Download PDF

Info

Publication number
CN103927277A
CN103927277A CN201410147375.0A CN201410147375A CN103927277A CN 103927277 A CN103927277 A CN 103927277A CN 201410147375 A CN201410147375 A CN 201410147375A CN 103927277 A CN103927277 A CN 103927277A
Authority
CN
China
Prior art keywords
access request
gpu
cpu
speed cache
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410147375.0A
Other languages
Chinese (zh)
Other versions
CN103927277B (en
Inventor
石伟
邓宇
郭御风
龚锐
任巨
张明
马爱永
高正坤
窦强
童元满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201410147375.0A priority Critical patent/CN103927277B/en
Publication of CN103927277A publication Critical patent/CN103927277A/en
Application granted granted Critical
Publication of CN103927277B publication Critical patent/CN103927277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a CPU (central processing unit) and GPU (graphic processing unit) on-chip cache sharing method and device. The method includes the steps of caching access and storage requests from CPU/GPU in classification; requesting for arbitration for the cashed access and storage requests different in classification; executing the access and storage requests while subjecting read and write data of the access and storage requests to high-speed cache; enabling the read and write data read or written in an external storage through the access and storage requests to bypass the high-speed cache when the access and storage requests of a GPU are executed, operating the external storage directly, and notifying a CPU core to perform cancellation or update private data backup only when writing hit the target the high-speed cache. The device comprises a CPU request queue, a GPU request queue, an arbiter and a high-speed cache flow execution unit. The method and the device have the advantages that different access features of the CPU and the GPU can be taken into account at the same time, performance is high, hardware implementation is simple and cost is low.

Description

CPU and GPU share method and the device of high-speed cache on sheet
Technical field
The present invention relates to computer microprocessor field, be specifically related to method and the device of high-speed cache on the shared sheet of a kind of CPU and GPU.
Background technology
Along with the develop rapidly of VLSI (very large scale integrated circuit) and embedded technology, the available transistor resource on single-chip is more and more, and system integrated chip (SoC) technology is arisen at the historic moment.The often IP kernel of integrated multiple difference in functionalitys of a SoC chip, has comparatively perfect function.The SoC chip being used as the handheld terminal such as mobile phone, PDA machine can all integrate the almost repertoire of an embedded information handling system, so on one single chip, realize information acquisition, the function such as input, storage, processor, output.Current some embedded systems (as mobile phone, game machine) have proposed higher requirement to multimedia processor performances such as figure, image, videos, and therefore Graphics Processing Unit GPU is often also integrated in SoC chip and goes.
In the SoC chip of integrated CPU and two kinds of different processing units of GPU, they generally need to share Resources on Chips such as using high-speed cache, memory controller.But memory bandwidth limited on sheet is difficult to meet the high bandwidth requirements of CPU and GPU, and then makes the performance of CPU and GPU all be subject to certain impact simultaneously.In addition, also there is bigger difference in the memory access characteristic of CPU and GPU, the characteristic of high-speed cache on sheet has also been proposed to different requirements.The access request of CPU belongs to delay-sensitive, and it requires its access request can obtain quick service; And the access request of GPU belongs to bandwidth sensitive, it needs high-bandwidth service, otherwise causes GPU cannot handle in real time the image that will show.In sum, on sheet, the shared use-pattern of high-speed cache has affected the performance of CPU and GPU to a certain extent, and the low delay requirement of CPU and the high bandwidth requirements of GPU all cannot be all met.Along with increasing of the SoC chip of integrated CPU and GPU, the memory access race problem of CPU and GPU has become a technical matters urgently to be resolved hurrily.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind ofly can take into account that CPU and GPU different access characteristic, performance are high, hardware is realized CPU simple, that cost is little and GPU shares method and the device of high-speed cache on sheet simultaneously.
In order to solve the problems of the technologies described above, the technical solution used in the present invention is:
CPU and GPU share a method for high-speed cache on sheet, and implementation step is as follows:
1) classification buffer memory is from the access request of CPU and from the access request of GPU;
2) arbitrate for the dissimilar access request of buffer memory, the access request that arbitration is won enters streamline;
3) inspection enters the request type of the access request of streamline, if access request is the access request from CPU, and reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
Further, described step 2) detailed step as follows:
2.1) according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory is arbitrated, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
2.2) upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
Further, described step 2.2) detailed step as follows:
2.2.1) judge current priority state value, if current priority state value be CPU preferentially; redirect execution step 2.2.2), otherwise redirect execution step 2.2.3);
2.2.2) check the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the current GPU utilized bandwidth of the GPU access request non-NULL of buffer memory and person can not meet the demands, current priority state value representation be set be GPU preferentially for arbitrate use next time, upgrade complete;
2.2.3) check the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, current priority state value representation be set be CPU preferentially for arbitrate use next time, upgrade complete.
Further, the detailed step of described step 3) is as follows:
3.1) inspection enters the request type of the access request of streamline, in the time that request type is CPU, if the action type of access request is read operation, and redirect execution step 3.2), otherwise redirect execution step 3.3); In the time that request type is GPU, if the request type of access request is read operation, redirect execution step 3.4), otherwise redirect execution step 3.5);
3.2) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request; Access request is finished;
3.3) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and send the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache; Access request is finished;
3.4) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache; Access request is finished;
3.5) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, then send the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache; Access request is finished.
The present invention also provides a kind of CPU and GPU to share the device of high-speed cache on sheet, comprising:
CPU request queue and GPU request queue, for the buffer memory of classifying from the access request of CPU and from the access request of GPU;
Moderator, arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;
High-speed cache flowing water performance element, for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
Further, described moderator comprises:
Dynamic priority arbitration modules, for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
Priority state value update module, for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
Further, described priority state value update module comprises:
Priority state value judges submodule, for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule, otherwise call GPU priority status control submodule;
CPU core priority status control submodule, for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;
GPU priority status control submodule (323), for checking the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU is preferentially for arbitrate use next time that current priority state value representation is set.
Further, described high-speed cache flowing water performance element comprises:
Access request checking module, for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module, otherwise calls CPU write operation execution module; In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module, otherwise call GPU write operation execution module;
CPU read operation execution module, for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;
CPU write operation execution module, for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;
GPU read operation execution module, for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;
GPU write operation execution module, judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.
The method that CPU of the present invention and GPU share high-speed cache on sheet has following advantage:
1, the present invention is according to the memory access feature of CPU and GPU, by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline, in the time that streamline is carried out, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.Therefore, the present invention is optimized respectively processing according to the memory access feature of CPU and GPU, locality for CPU program (from CPU access request) is larger, in the time that streamline is carried out, its visit data enters high-speed cache to improve program feature, and there is properties of flow and the poor problem of locality for the data of GPU program (from GPU access request) access, therefore the data of GPU routine access are generally walked around high-speed cache, thereby on having reduced to large extent the impact on CPU program, therefore can take into account CPU and GPU different access characteristic simultaneously, have advantages of that performance is high.
2, the present invention by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline and carries out, therefore by introduce GPU sharing high-speed cache on the basis of existing polycaryon processor, substantially retained the design feature of polycaryon processor access cache, less to original processor structure change, hardware configuration is simple, and cost is little.Meanwhile, amended cache structure can also be used for not using the polycaryon processor of GPU, also has advantages of compatible strong.
3, the present invention is according to the memory access feature of CPU and GPU, by classification buffer memory from the access request of CPU and from the access request of GPU, dissimilar access request for buffer memory is arbitrated, the access request that arbitration is won enters streamline, therefore can select flexibly different resolving strategies according to the memory access feature of CPU and GPU, thereby make the CPU program or the GPU program that urgently obtain service obtain priority processing, improved the dirigibility of system.
CPU of the present invention is the corresponding device of method that CPU of the present invention and GPU share high-speed cache on sheet with the device that GPU shares high-speed cache on sheet, therefore there is the technique effect identical with the method for high-speed cache on CPU of the present invention and the shared sheet of GPU, therefore do not repeat them here.
Brief description of the drawings
Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.
Fig. 2 carries out arbitration process schematic diagram for dissimilar access request in embodiment of the present invention method.
Fig. 3 is the state machine diagram of upgrading priority state value in embodiment of the present invention method.
Fig. 4 is the logical framework structural representation of embodiment of the present invention device.
Fig. 5 is the logical framework structural representation of moderator in embodiment of the present invention device.
Fig. 6 is the framed structure schematic diagram of dynamic priority arbitration modules in embodiment of the present invention device.
Fig. 7 is the logical framework structural representation of embodiment of the present invention device medium priority state value update module.
Fig. 8 is the logical framework structural representation of embodiment of the present invention device high speed buffer memory flowing water performance element.
Fig. 9 is the framed structure schematic diagram of the microprocessor of application embodiment of the present invention device.
Figure 10 is the framed structure schematic diagram of embodiment of the present invention device high speed buffer memory flowing water performance element.
Figure 11 is the schematic flow sheet that embodiment of the present invention device is carried out CPU access request.
Figure 12 is the schematic flow sheet that embodiment of the present invention device is carried out GPU access request.
Embodiment
As shown in Figure 1, on the shared sheet of the present embodiment CPU and GPU, the implementation step of the method for high-speed cache is as follows:
1) classification buffer memory is from the access request of CPU and from the access request of GPU;
2) arbitrate for the dissimilar access request of buffer memory, the access request that arbitration is won enters streamline;
3) inspection enters the request type of the access request of streamline, if access request is the access request from CPU, and reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
As shown in Figure 2, the dissimilar access request for buffer memory in the present embodiment is arbitrated, and the access request that arbitration is won enters streamline, if the access request from CPU of buffer memory is won, should enter streamline from the access request of CPU; If the access request from GPU of buffer memory is won, should enter streamline from the access request of GPU.In the present embodiment, step 2) detailed step as follows:
2.1) according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory is arbitrated, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
2.2) upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
In conjunction with above-mentioned steps 2.1)~2.2) known, the present embodiment adopts the method for dynamic priority arbitration, according to priority state value, the dissimilar access request of buffer memory is arbitrated, if described priority state value is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value is that GPU is preferential else if, from the access request from GPU of buffer memory, take out an access request and enter streamline as the access request of arbitration triumph, and all upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance each time, the method of this dynamic priority arbitration can change the priority of asking according to system current state and demand, thereby make the CPU relative program or the GPU relative program that urgently obtain service obtain priority processing, improved the dirigibility of system.
In the present embodiment, step 2.2) detailed step as follows:
2.2.1) judge current priority state value, if current priority state value be CPU preferentially; redirect execution step 2.2.2), otherwise redirect execution step 2.2.3);
2.2.2) check the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, current priority state value representation be set be GPU preferentially for arbitrate use next time, upgrade complete;
2.2.3) check the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, current priority state value representation be set be CPU preferentially for arbitrate use next time, upgrade complete.
In conjunction with above-mentioned steps 2.2.1)~2.2.3) known, the present embodiment considers the priority of CPU request and GPU request according to system current state, system current state comprises the GPU access request quantity of buffer memory, the CPU core access request quantity of buffer memory, current GPU utilized bandwidth three type systematic current states, by considering said system current state, can guarantee that memory bandwidth limited on sheet meets the high bandwidth requirements of CPU and GPU substantially simultaneously, and then the performance that makes CPU and GPU obtains optimized performance, thereby the system performance decline problem that the memory access race problem that can overcome substantially CPU and GPU causes, can improve the overall performance of processor, thereby make the CPU program or the GPU program that urgently obtain service obtain priority processing, improve the dirigibility of system.
As shown in Figure 3, in the present embodiment, represent priority state value with Priority, in the time of Priority=0, the priority of CPU request queue is higher than GPU, and to be represented as CPU preferential for Priority=0; In the time of Priority=1, the priority of GPU request queue is higher than CPU, and to be represented as GPU preferential for Priority=1.System default initialization priority state value is 0, i.e. Priority=0, and it is preferential that representative is defaulted as CPU.The in the situation that of Priority=0, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferential (Priority=0) for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, current priority state value representation is set is GPU preferential (Priority=1) uses the Priority=1 in the situation that for arbitrate next time, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferential (Priority=1) for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU preferential (Priority=0) is for arbitrate use next time that current priority state value representation is set.
In the present embodiment, the detailed step of step 3) is as follows:
3.1) inspection enters the request type of the access request of streamline, in the time that request type is CPU, if the action type of access request is read operation, and redirect execution step 3.2), otherwise redirect execution step 3.3); In the time that request type is GPU, if the request type of access request is read operation, redirect execution step 3.4), otherwise redirect execution step 3.5);
3.2) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request; Access request is finished;
3.3) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and send the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache; Access request is finished;
3.4) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache; Access request is finished;
3.5) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, then send the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache; Access request is finished.
To sum up step 3.1)~3.5) known, when the present embodiment is carried out the access request of CPU, the data that read or write are carried out buffer memory in high-speed cache, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU; In the time that CPU reads to lose efficacy, the data that read from external memory storage will be replaced and be entered high-speed cache; In the time that CPU writes inefficacy, carry out write allocate strategy; In the time carrying out CPU write operation, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU.When the present embodiment is carried out GPU access request, the data that read or write are not carried out buffer memory as far as possible in high-speed cache.Only ought write and hit high-speed cache, send and cancel or lastest imformation ensures the cache coherence of multiple CPU to other CPU; In the time reading to lose efficacy, data directly send to GPU kernel from external memory storage; When writing while losing efficacy, the data external memory storage that writes direct, carries out not according to writing batch operation.
As shown in Figure 4, on the shared sheet of the present embodiment CPU and GPU, the device of high-speed cache comprises:
CPU request queue 1 and GPU request queue 2, for the buffer memory of classifying from the access request of CPU and from the access request of GPU; These access requests represent with message form.
Moderator 3, arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;
High-speed cache flowing water performance element 4, for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
As shown in Figure 5, the moderator 3 of the present embodiment comprises:
Dynamic priority arbitration modules 31, for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
Priority state value update module 32, for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
As shown in Figure 6, dynamic priority arbitration modules 31 specifically comprises arbitration execution module and selector switch, arbitration execution module is for according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory being arbitrated, and selector switch enters streamline for select to take out from the access request from CPU or GPU of buffer memory the access request that an access request wins as arbitration according to arbitration result.Enter respectively after CPU request queue 1 and GPU request queue 2 at access request, moderator 3 is arbitrated by dynamic priority arbitration modules 31, if the access request arbitration from CPU is won, can enter streamline from the access request of CPU, on the contrary, if the access request arbitration from GPU is won, can enter streamline from the access request of GPU.
As shown in Figure 7, the priority state value update module 32 of the present embodiment comprises:
Priority state value judges submodule 321, for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule 322, otherwise call GPU priority status control submodule 323;
CPU core priority status control submodule 322, for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;
GPU priority status control submodule 323, for checking the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU is preferentially for arbitrate use next time that current priority state value representation is set.
The said structure of comprehensive priority state value update module 32 is known, and the moderator 3 of the present embodiment can consider according to system current state the priority of CPU request and GPU request, optimization system performance.CPU program belongs to delay-sensitive, and its access request need to obtain quick service, and GPU is comparatively responsive to bandwidth.Therefore,, in the situation that GPU memory bandwidth satisfies the demands, the priority of CPU access request is higher than GPU access request.With the actual memory bandwidth of GPU in hardware counter record unit time, if memory bandwidth lower than expection, shows that the access request of GPU needs to improve, to prevent that the discontinuous situation of graph and image processing from appearring in GPU; If memory bandwidth higher than expection, shows that GPU can normally work, more allocated bandwidth can be used to CPU, improve the performance of CPU.
As shown in Figure 8, the high-speed cache flowing water performance element 4 of the present embodiment comprises:
Access request checking module 41, for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module 42, otherwise call CPU write operation execution module 43; In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module 44, otherwise call GPU write operation execution module 45;
CPU read operation execution module 42, for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;
CPU write operation execution module 43, for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;
GPU read operation execution module 44, for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;
GPU write operation execution module 45, judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.
As shown in Figure 9, (kernel 1~kernel n), a GPU, a high-speed cache and a memory controller form by n CPU core for the multi-core microprocessor (Soc chip) that adopts the present embodiment device, each CPU core inside all configures private cache, (n) He one GPU kernel of kernel 1~kernel is received high-speed cache by interconnection structures such as bus, cross bar switch or networks to n CPU core, and high-speed cache is by memory controller access external memory.The data of CPU core access all will be passed through high-speed cache, and keep cache coherence by high-speed cache, and the data of GPU kernel access simultaneously are generally all walked around high-speed cache, avoid taking a large amount of high-speed caches.
As shown in figure 10, high-speed cache flowing water performance element 4 comprises: buffer memory Tag module, store the address information of data for recording high-speed cache, and whether there is backup at high-speed cache for differentiating accessed data.Tag block of state, whether for the status information of record buffer memory Tag, as effective in corresponding Tag, whether the data that Tag is corresponding are modified.Data cached module, for recording data message corresponding to Tag institute recording address.CACHE DIRECTORY module, for recording the Tag part of upper level buffer memory, for carrying out buffer consistency management.When CPU and GPU write data to certain address and write while hitting, buffer consistency hardware management will be sent and cancel or lastest imformation to the CPU that preserves this address date according to the information in CACHE DIRECTORY.CPU writes buffer memory, need to write the data of external memory storage for recording CPU.In the time that CPU adopts write through strategy execution write operation, data also will write external memory storage in write cache, now will write data and write CPU and write buffer memory.GPU writes buffer memory, need to write the data of external memory storage for recording GPU.When GPU writes while losing efficacy, data will directly writing buffer memory by GPU writes external memory storage, and write cache no longer.Read buffer memory, the result of returning from external memory storage read data for recording CPU and GPU.Inefficacy buffer memory, for when CPU and the GPU read-write cache miss, will send corresponding read request and write request to external memory storage, inefficacy caching record corresponding read-write requests and request address.CPU returns to queue, return to queue for recording the data GPU that high-speed cache returns to CPU, return to queue and only record the data that GPU will read for recording data GPU that high-speed cache returns to GPU, and CPU returns to queue and not only records the data that CPU will read, also record is for the information such as calcellation and renewal of coherency management.The CACHE DIRECTORY of high-speed cache has recorded the record of all CPU visit datas, and does not record the record of GPU visit data.In the time of CPU or GPU write cache, cache coherence device cancels according to the state of CACHE DIRECTORY or upgrades related data in CPU, thereby makes CPU total energy obtain up-to-date data; But because the data of the access of GPU do not enter high-speed cache, cache coherence device can not ensure that the data of GPU access are all up-to-date.Although the present embodiment does not ensure that from hardware the data of GPU access are up-to-date, ensure that from hardware environment angle the caching device of proposition can correctly be worked.In the SoC of integrated CPU and GPU system, CPU controls the execution of GPU conventionally, and provides necessary data for it; In the operational process of GPU, CPU and GPU, in relatively independent running status, do not carry out data interaction conventionally between them, and therefore CPU write operation does not conventionally relate to the address space of GPU yet, and then the inconsistency of data together not.
As shown in figure 11, the present embodiment execution is as follows from the step of the access request of CPU:
A1) CPU core is sent access request.
A2) access request is cached to the CPU request queue that is specifically designed to buffer memory CPU access request.
A3) access request is arbitrated, if access request failure continues to wait for arbitration next time, final access request arbitration is won and is entered the streamline of high-speed cache.
A4) differentiate access request type, if jump procedure A5 of read request); If jump procedure A6 of write request).
A5) Tag in access cache mark (address tag mark) and Tag state (address tag state), whether differentiation reads to hit in high-speed cache, if read to hit, access cache data correct data are sent into CPU and return to queue; If read to lose efficacy, access request to be sent into inefficacy buffer memory, and then obtained correct data from next stage memory hierarchy (as external memory storage), these data return to enter CPU simultaneously to queue and high-speed cache;
A6) mark of the Tag in access cache (address tag mark) and Tag state (address tag state), differentiate and whether in high-speed cache, write and hit, hit if write, data are write to corresponding position in buffer memory, and query caching catalogue, send and cancel or updating message to the privately owned buffer memory of other CPU according to CACHE DIRECTORY; If write inefficacy, adopt write allocate principle, data are write to newly assigned cache blocks address in buffer memory.
As shown in figure 12, the present embodiment execution is as follows from the step of the access request of GPU:
B1) GPU sends access request.
B2) access request is cached to the GPU request queue that is specifically designed to buffer memory GPU access request.
B3) access request is arbitrated, if access request failure continues to wait for arbitration next time, final access request arbitration is won and is entered the streamline of high-speed cache.
B4) differentiate access request type, if jump procedure B5 of read request); If jump procedure B6 of write request).
B5) Tag in access cache mark (address tag mark) and Tag state (address tag state), whether differentiation reads to hit in high-speed cache, if read to hit, access cache data correct data are sent into GPU and return to queue; If read to lose efficacy, access request to be sent into inefficacy buffer memory, and then obtained correct data from next stage memory hierarchy (as external memory storage), these data only enter GPU and return to queue, and do not enter high-speed cache.
B6) mark of the Tag in access cache (address tag mark) and Tag state (address tag state), differentiate and whether in high-speed cache, write and hit, hit if write, data are write to corresponding position in buffer memory, and query caching catalogue, send and cancel or updating message to the privately owned buffer memory of other CPU according to CACHE DIRECTORY; If write inefficacy, adopt by not writing distribution principle, by the data external memory storage that writes direct, in high-speed cache, do not store the data backup of these data.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (8)

1. CPU and GPU share a method for high-speed cache on sheet, it is characterized in that implementation step is as follows:
1) classification buffer memory is from the access request of CPU and from the access request of GPU;
2) arbitrate for the dissimilar access request of buffer memory, the access request that arbitration is won enters streamline;
3) inspection enters the request type of the access request of streamline, if access request is the access request from CPU, and reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
2. CPU according to claim 1 and GPU share the method for high-speed cache on sheet, it is characterized in that described step 2) detailed step as follows:
2.1) according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, the dissimilar access request of buffer memory is arbitrated, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
2.2) upgrade priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
3. CPU according to claim 2 and GPU share the method for high-speed cache on sheet, it is characterized in that described step 2.2) detailed step as follows:
2.2.1) judge current priority state value, if current priority state value be CPU preferentially; redirect execution step 2.2.2), otherwise redirect execution step 2.2.3);
2.2.2) check the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, current priority state value representation be set be GPU preferentially for arbitrate use next time, upgrade complete;
2.2.3) check the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, current priority state value representation be set be CPU preferentially for arbitrate use next time, upgrade complete.
4. according to the method for high-speed cache on the CPU described in claim 1 or 2 or 3 and the shared sheet of GPU, it is characterized in that, the detailed step of described step 3) is as follows:
3.1) inspection enters the request type of the access request of streamline, in the time that request type is CPU, if the action type of access request is read operation, and redirect execution step 3.2), otherwise redirect execution step 3.3); In the time that request type is GPU, if the request type of access request is read operation, redirect execution step 3.4), otherwise redirect execution step 3.5);
3.2) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request; Access request is finished;
3.3) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and send the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache; Access request is finished;
3.4) judge whether access request hits in high-speed cache, if hit, hiting data is directly returned to the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache; Access request is finished;
3.5) judge whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, then send the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache; Access request is finished.
5. CPU and GPU share a device for high-speed cache on sheet, it is characterized in that comprising:
CPU request queue (1) and GPU request queue (2), for the buffer memory of classifying from the access request of CPU and from the access request of GPU;
Moderator (3), arbitrates for the dissimilar access request for buffer memory, and the access request that arbitration is won enters streamline;
High-speed cache flowing water performance element (4), for checking the request type of the access request that enters streamline, if access request is the access request from CPU, reading and writing data through high-speed cache access request in the time carrying out the access request of CPU; If access request is the access request from GPU, in the time carrying out the access request of GPU, the reading and writing data of external memory storage of reading or write of access request walked around to high-speed cache, directly external memory storage is operated, only just notify CPU core to cancel while hitting high-speed cache or upgrade private data backup when writing.
6. CPU according to claim 5 and GPU share the device of high-speed cache on sheet, it is characterized in that, described moderator (3) comprising:
Dynamic priority arbitration modules (31), for the dissimilar access request of buffer memory being arbitrated according to the priority state value for the priority level that represents the classification of different buffer memorys setting in advance, if described priority state value representation is that CPU is preferential, from the access request from CPU of buffer memory, takes out an access request and enter streamline as the access request of arbitration triumph; Described priority state value representation is that GPU is preferential else if, from the access request from GPU of buffer memory, takes out an access request and enters streamline as the access request of arbitration triumph;
Priority state value update module (32), for upgrading priority state value for arbitrate use next time according to the priority state value update strategy setting in advance.
7. CPU according to claim 6 and GPU share the device of high-speed cache on sheet, it is characterized in that, described priority state value update module (32) comprising:
Priority state value judges submodule (321), for judging current priority state value, if current priority state value be CPU preferentially; call CPU core priority status control submodule (322), otherwise call GPU priority status control submodule (323);
CPU core priority status control submodule (322), for checking the GPU access request of buffer memory and current GPU utilized bandwidth, if the GPU access request of buffer memory is empty or current GPU utilized bandwidth meets the demands, keep current priority state value representation be CPU preferentially for arbitrate use next time, upgrade complete; If the GPU access request non-NULL of buffer memory and current GPU utilized bandwidth can not meet the demands, it is that GPU is preferentially for arbitrate use next time that current priority state value representation is set;
GPU priority status control submodule (323), for checking the CPU core access request of buffer memory and current GPU utilized bandwidth, if it is undesirable that the CPU core access request of buffer memory is empty or current GPU utilized bandwidth, keep current priority state value representation be GPU preferentially for arbitrate use next time, upgrade complete; If the CPU core access request non-NULL of buffer memory and current GPU utilized bandwidth meet the requirements, it is that CPU is preferentially for arbitrate use next time that current priority state value representation is set.
8. according to the device of high-speed cache on the CPU described in claim 5 or 6 or 7 and the shared sheet of GPU, it is characterized in that, described high-speed cache flowing water performance element (4) comprising:
Access request checking module (41), for checking the request type of the access request that enters streamline, in the time that request type is CPU, if the action type of access request is read operation, calls CPU read operation execution module (42), otherwise call CPU write operation execution module (43); In the time that request type is GPU, if the request type of access request is read operation, calls GPU read operation execution module (44), otherwise call GPU write operation execution module (45);
CPU read operation execution module (42), for judging whether access request hits at high-speed cache, if hit, directly returns to by hiting data the CPU core of sending access request; Lost efficacy else if, access external memory is fetched the read data of access request, the read data of fetching is cached in high-speed cache and returns to the CPU core of sending access request;
CPU write operation execution module (43), for judging whether access request hits at high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and sends the order of cancelling or upgrading private data backup to CPU core; Lost efficacy else if, adopted write allocate principle, by newly assigned cache blocks address in data write cache;
GPU read operation execution module (44), for judging whether access request hits at high-speed cache, if hit, directly returns to hiting data the GPU of access request; Otherwise access external memory is fetched the read data of access request, by the read data of fetching directly return to access request GPU but not in write cache;
GPU write operation execution module (45), judges whether access request hits in high-speed cache, if hit, will write the data that in data write cache, replacement is hit, and then sends the order of cancelling or upgrading private data backup to CPU core; Otherwise adopt not according to writing distribution principle, only by write operation write that data write outside storer but not in write cache.
CN201410147375.0A 2014-04-14 2014-04-14 CPU and GPU shares the method and device of on chip cache Active CN103927277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410147375.0A CN103927277B (en) 2014-04-14 2014-04-14 CPU and GPU shares the method and device of on chip cache

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410147375.0A CN103927277B (en) 2014-04-14 2014-04-14 CPU and GPU shares the method and device of on chip cache

Publications (2)

Publication Number Publication Date
CN103927277A true CN103927277A (en) 2014-07-16
CN103927277B CN103927277B (en) 2017-01-04

Family

ID=51145500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410147375.0A Active CN103927277B (en) 2014-04-14 2014-04-14 CPU and GPU shares the method and device of on chip cache

Country Status (1)

Country Link
CN (1) CN103927277B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461957A (en) * 2014-08-28 2015-03-25 浪潮(北京)电子信息产业有限公司 Method and device for heterogeneous multi-core CPU share on-chip caching
CN104809078A (en) * 2015-04-14 2015-07-29 苏州中晟宏芯信息科技有限公司 Exiting and avoiding mechanism based on hardware resource access method of shared cache
CN106502959A (en) * 2016-11-16 2017-03-15 湖南国科微电子股份有限公司 The structure and system in package, pcb board of master chip and Big Dipper chip shared drive
CN107861890A (en) * 2016-09-22 2018-03-30 龙芯中科技术有限公司 Memory access processing method, device and electronic equipment
CN107920025A (en) * 2017-11-20 2018-04-17 北京工业大学 A kind of dynamic routing method towards CPU GPU isomery network-on-chips
CN108009008A (en) * 2016-10-28 2018-05-08 北京市商汤科技开发有限公司 Data processing method and system, electronic equipment
CN108199985A (en) * 2017-12-29 2018-06-22 中国人民解放军国防科技大学 NoC arbitration method based on global node information in GPGPU
CN108733492A (en) * 2018-05-20 2018-11-02 北京工业大学 A kind of batch scheduling memory method divided based on Bank
CN109101443A (en) * 2018-07-27 2018-12-28 天津国芯科技有限公司 A kind of arbitration device and method of weight timesharing
CN109144578A (en) * 2018-06-28 2019-01-04 中国船舶重工集团公司第七0九研究所 A kind of video card resource allocation method and device based on Godson computer
CN109313557A (en) * 2016-07-07 2019-02-05 英特尔公司 The device of local memory access is shared for optimizing GPU thread
CN109992413A (en) * 2019-03-01 2019-07-09 中国科学院计算技术研究所 A kind of accelerator towards breadth-first search algorithm, method and storage medium
CN110223214A (en) * 2019-06-10 2019-09-10 西安博图希电子科技有限公司 A kind of method, apparatus and computer storage medium reducing texture cell amount of access
CN110457238A (en) * 2019-07-04 2019-11-15 中国民航大学 The method paused when slowing down GPU access request and instruction access cache
CN110688335A (en) * 2019-09-11 2020-01-14 上海高性能集成电路设计中心 Device for splitting cache space storage instruction into independent micro-operations
CN111190735A (en) * 2019-12-30 2020-05-22 湖南大学 Linux-based on-chip CPU/GPU (Central processing Unit/graphics processing Unit) pipelined computing method and computer system
CN112764668A (en) * 2019-11-01 2021-05-07 伊姆西Ip控股有限责任公司 Method, electronic device and computer program product for expanding GPU memory
CN112783803A (en) * 2021-01-27 2021-05-11 于慧 Computer CPU-GPU shared cache control method and system
CN113377688A (en) * 2021-05-13 2021-09-10 中国人民解放军军事科学院国防科技创新研究院 L1 cache sharing method for GPU
WO2022222040A1 (en) * 2021-04-20 2022-10-27 华为技术有限公司 Method for accessing cache of graphics processor, graphics processor, and electronic device
CN116166575A (en) * 2023-02-03 2023-05-26 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment, medium and program product for configuring access segment length

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595425B2 (en) * 2009-09-25 2013-11-26 Nvidia Corporation Configurable cache for multiple clients
CN102270180B (en) * 2011-08-09 2014-04-02 清华大学 Multicore processor cache and management method thereof
CN102902640B (en) * 2012-09-28 2015-07-08 杭州中天微系统有限公司 Request arbitration method and device for consistency multinuclear treater
CN103593306A (en) * 2013-11-15 2014-02-19 浪潮电子信息产业股份有限公司 Design method for Cache control unit of protocol processor

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461957A (en) * 2014-08-28 2015-03-25 浪潮(北京)电子信息产业有限公司 Method and device for heterogeneous multi-core CPU share on-chip caching
CN104809078A (en) * 2015-04-14 2015-07-29 苏州中晟宏芯信息科技有限公司 Exiting and avoiding mechanism based on hardware resource access method of shared cache
CN104809078B (en) * 2015-04-14 2019-05-14 苏州中晟宏芯信息科技有限公司 Based on the shared cache hardware resource access method for exiting yielding mechanism
CN109313557B (en) * 2016-07-07 2023-07-11 英特尔公司 Apparatus for optimizing GPU thread shared local memory access
CN109313557A (en) * 2016-07-07 2019-02-05 英特尔公司 The device of local memory access is shared for optimizing GPU thread
CN107861890A (en) * 2016-09-22 2018-03-30 龙芯中科技术有限公司 Memory access processing method, device and electronic equipment
CN107861890B (en) * 2016-09-22 2020-04-14 龙芯中科技术有限公司 Memory access processing method and device and electronic equipment
CN108009008B (en) * 2016-10-28 2022-08-09 北京市商汤科技开发有限公司 Data processing method and system and electronic equipment
CN108009008A (en) * 2016-10-28 2018-05-08 北京市商汤科技开发有限公司 Data processing method and system, electronic equipment
CN106502959A (en) * 2016-11-16 2017-03-15 湖南国科微电子股份有限公司 The structure and system in package, pcb board of master chip and Big Dipper chip shared drive
CN106502959B (en) * 2016-11-16 2019-09-13 湖南国科微电子股份有限公司 The structure and system in package, pcb board of master chip and Beidou chip shared drive
CN107920025A (en) * 2017-11-20 2018-04-17 北京工业大学 A kind of dynamic routing method towards CPU GPU isomery network-on-chips
CN107920025B (en) * 2017-11-20 2021-09-14 北京工业大学 Dynamic routing method for CPU-GPU heterogeneous network on chip
CN108199985A (en) * 2017-12-29 2018-06-22 中国人民解放军国防科技大学 NoC arbitration method based on global node information in GPGPU
CN108199985B (en) * 2017-12-29 2020-07-24 中国人民解放军国防科技大学 NoC arbitration method based on global node information in GPGPU
CN108733492A (en) * 2018-05-20 2018-11-02 北京工业大学 A kind of batch scheduling memory method divided based on Bank
CN109144578A (en) * 2018-06-28 2019-01-04 中国船舶重工集团公司第七0九研究所 A kind of video card resource allocation method and device based on Godson computer
CN109144578B (en) * 2018-06-28 2021-09-03 中国船舶重工集团公司第七0九研究所 Display card resource allocation method and device based on Loongson computer
CN109101443A (en) * 2018-07-27 2018-12-28 天津国芯科技有限公司 A kind of arbitration device and method of weight timesharing
CN109101443B (en) * 2018-07-27 2021-09-28 天津国芯科技有限公司 Weight time-sharing arbitration device and method
CN109992413A (en) * 2019-03-01 2019-07-09 中国科学院计算技术研究所 A kind of accelerator towards breadth-first search algorithm, method and storage medium
CN110223214A (en) * 2019-06-10 2019-09-10 西安博图希电子科技有限公司 A kind of method, apparatus and computer storage medium reducing texture cell amount of access
CN110457238A (en) * 2019-07-04 2019-11-15 中国民航大学 The method paused when slowing down GPU access request and instruction access cache
CN110457238B (en) * 2019-07-04 2023-01-03 中国民航大学 Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache
CN110688335A (en) * 2019-09-11 2020-01-14 上海高性能集成电路设计中心 Device for splitting cache space storage instruction into independent micro-operations
CN112764668A (en) * 2019-11-01 2021-05-07 伊姆西Ip控股有限责任公司 Method, electronic device and computer program product for expanding GPU memory
CN111190735B (en) * 2019-12-30 2024-02-23 湖南大学 On-chip CPU/GPU pipelining calculation method based on Linux and computer system
CN111190735A (en) * 2019-12-30 2020-05-22 湖南大学 Linux-based on-chip CPU/GPU (Central processing Unit/graphics processing Unit) pipelined computing method and computer system
CN112783803A (en) * 2021-01-27 2021-05-11 于慧 Computer CPU-GPU shared cache control method and system
CN112783803B (en) * 2021-01-27 2022-11-18 湖南中科长星科技有限公司 Computer CPU-GPU shared cache control method and system
WO2022222040A1 (en) * 2021-04-20 2022-10-27 华为技术有限公司 Method for accessing cache of graphics processor, graphics processor, and electronic device
CN113377688B (en) * 2021-05-13 2022-10-11 中国人民解放军军事科学院国防科技创新研究院 L1 cache sharing method for GPU
CN113377688A (en) * 2021-05-13 2021-09-10 中国人民解放军军事科学院国防科技创新研究院 L1 cache sharing method for GPU
CN116166575A (en) * 2023-02-03 2023-05-26 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment, medium and program product for configuring access segment length
CN116166575B (en) * 2023-02-03 2024-01-23 摩尔线程智能科技(北京)有限责任公司 Method, device, equipment, medium and program product for configuring access segment length

Also Published As

Publication number Publication date
CN103927277B (en) 2017-01-04

Similar Documents

Publication Publication Date Title
CN103927277B (en) CPU and GPU shares the method and device of on chip cache
US8180981B2 (en) Cache coherent support for flash in a memory hierarchy
US8230179B2 (en) Administering non-cacheable memory load instructions
CN103370696B (en) Multiple nucleus system and Nuclear Data read method
CN101853227B (en) Method for improving storage map input/output demand processing, and processor
US8078862B2 (en) Method for assigning physical data address range in multiprocessor system
KR101511972B1 (en) Methods and apparatus for efficient communication between caches in hierarchical caching design
US9135177B2 (en) Scheme to escalate requests with address conflicts
US10761986B2 (en) Redirecting data to improve page locality in a scalable data fabric
CN105556503B (en) Dynamic memory control methods and its system
Bock et al. Concurrent page migration for mobile systems with OS-managed hybrid memory
KR20160099722A (en) Integrated circuits with cache-coherency
US9183150B2 (en) Memory sharing by processors
CN104461957A (en) Method and device for heterogeneous multi-core CPU share on-chip caching
CN117377943A (en) Memory-calculation integrated parallel processing system and method
US10754791B2 (en) Software translation prefetch instructions
US20220197506A1 (en) Data placement with packet metadata
CN104049951A (en) Replaying memory transactions while resolving memory access faults
JP6055456B2 (en) Method and apparatus for efficient communication between caches in a hierarchical cache design
CN104049905A (en) Migrating pages of different sizes between heterogeneous processors
El-Kustaban et al. Design and Implementation of a Chip Multiprocessor with an Efficient Multilevel Cache System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant