CN106708626A - Low power consumption-oriented heterogeneous multi-core shared cache partitioning method - Google Patents
Low power consumption-oriented heterogeneous multi-core shared cache partitioning method Download PDFInfo
- Publication number
- CN106708626A CN106708626A CN201611187228.1A CN201611187228A CN106708626A CN 106708626 A CN106708626 A CN 106708626A CN 201611187228 A CN201611187228 A CN 201611187228A CN 106708626 A CN106708626 A CN 106708626A
- Authority
- CN
- China
- Prior art keywords
- gpu
- cache
- cpu
- application programs
- ipc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses a lower power consumption-oriented heterogeneous multi-core shared cache partitioning method. The method comprises the following steps of: carrying out static partitioning on a shared last-stage cache L2 cache, distributing a cache space with a constant proportion of 50% to a CPU application, and distributing the residual space to a GPU application; on the basis of mean-proportion partitioning, carrying out optimum static proportion partitioning, and portioning different proportions to the CPU application and the GPU application; and finally carrying out dynamic self-adaptive proportion partitioning mechanism, and dynamically changing the proportions, in a last-stage cache, of the CPU application and the GPU application according to an IPC partitioning index during the operation, so as to achieve the aims of reducing the system power consumption and enhancing the system performance.
Description
Technical field
The invention belongs to computer architecture caching system construction applications, and in particular to a kind of heterogeneous polynuclear towards low-power consumption
Cache sharing division methods.
Background technology
With continuing to develop for polycaryon processor, traditional multicore architecture is difficult to meet the demand of large-scale calculations, industry
Boundary forms heterogeneous multi-nucleus processor framework by integrating different types of processor on same chip.At heterogeneous polynuclear
Be fused to for processor with different computing capabilitys by reason device (HMP-Heterogeneous Multi-core Processor)
Together, it is widely used in the industries such as Aero-Space, Industry Control, instrument and meter, to meet performance requirements, reduces power consumption
And cost.HMP, can be by different types of distribution of computation tasks to different due to being integrated with the processor core of different characteristics and performance
Parallel processing on the processor core of type, such as, quick complicated nuclear energy performs serial code part, and better simply core then can
Parallel processing digital, so as to for the application of different demands provides more flexible, efficient treatment mechanism, meet various applied environments
Requirement to system real time, power consumption, reliability and cost, the focus as current research.
Graphic process unit (Graphics Processing Unit, GPU) has relatively simple control logic, is integrated with
A large amount of parallel processing cores, with peak value efficiency (the calculating performance of unit work consumptiom) higher.GPU is from beginning of being born just in floating-point
Calculate and surmounted CPU in performance, huge wide gap causes substantial amounts of program between GPU executed in parallel and CPU order execution performances
The part of computation-intensive is given GPU and is calculated by developer, and Computer Architecture is also from traditional multi-core CPU epoch
Driving crosses the multi-core CPU-GPU isomery epoch.Heterogeneous multi-nucleus processor framework such as AMD Fusion, Intel Sandy
The appearance of Bridge and Nvidia Denver indicates the heterogeneous polynuclear framework main flow as current era.
CPU and GPU are integrated in a chip on heterogeneous multi-nucleus processor, using shared last level cache structure, with CPU phases
Than GPU number of threads is more, and degree of parallelism is high, and GPU application programs can reach data access speed more more than CPU application program
Rate, with certain access delay tolerance.The program locality feature that is shown by CPU and GPU application programs and
The inconsistency of memory access behavior, the data of different processes can seize shared last level cache, bring the problem of resource contention.
When CPU application programs and GPU application programs are performed jointly, because GPU application program access rates are high, therefore it is most of can profit
Last level cache space will be used by GPU application programs so that the space for leaving CPU for is very limited, cause CPU application journeys
Ordered pair can be reduced substantially in the access of shared LLC, and for many CPU application programs, application program is delayed once there is high speed
Especially last level cache missing is deposited, then needs extraly to go access off-chip main to deposit, cause unnecessary power dissipation overhead.Cause
This, in heterogeneous polynuclear framework, the division for sharing LLC is most important for the influence of systematic function and power consumption.Now, a conjunction
It is very necessary that efficient shared buffer memory division methods are managed for reducing the power dissipation overhead of system.
At present, existing some researchs are devoted to Cache cache subsystems, and Xie and Loh et al. are proposed by using hardware meter
Thread of the number device to operation on the same chip is cached dynamically subregion so that the cache hit rate of cross-thread is maximized.
Lee and Kim et al. are by considering the influence of the application and caching miss rate of memory level parallelism in performance, it is proposed that a dynamic
Ground cache partitions strategy.Qureshi and Patt et al. propose the shared last level cache division methods based on Buffer Utilization,
By using a performance monitor UMON (Utility Monitor), the shared height of data access final stage of different processes is counted
The position of hit during speed caching, the execution for instructing next cycle partition strategy.But existing cache management work,
It is mainly used in isomorphism multiple nucleus system environment, it is impossible to be adapted to the isomerous environment that CPU is combined with GPU, more cannot be to from CPU's
Request and the request from GPU make a distinction, and cause the unjustness of shared last level cache distribution, have a strong impact on the performance of system
And power consumption.
The content of the invention
The present invention proposes a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption.Shared end is carried out first
The level L2 grades of static division of Cache of caching, the spatial cache for distributing constant ratio 50% is applied to CPU, and remaining space leaves GPU for
Using.Respectively on the basis of ratio cut partition, the division of optimal staticaccelerator scale is carried out, divide inequality proportion and applied to CPU and GPU.
Finally, dynamically self adaptation ratio cut partition mechanism is carried out, using the dynamic patitioning algorithm towards low-power consumption, is divided according to IPC and referred to
Mark operationally dynamically changes CPU and applies the ratio for occupying last level cache with GPU applications, so as to reach reduction system power dissipation,
The purpose of lifting system performance.
To reach above-mentioned purpose, the present invention uses following technical scheme.
A kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, comprise the following steps:
Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and difference is distinguished using flag bit TagID
The access request of core;
Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to
Location;
Step 3, realizes static division, comprises the following steps:
Step 3.1, equivalent is divided equally
In the request buffering queue L1RequestToL2Cache of L2 Cache, TagID message flags position is judged, if
The request is the L1 grades of Cache from GPU cores, and the access request is mapped into odd address, if the request is from CPU core
L1 grades of Cache of the heart, even address is mapped to by the access request;
Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address for distributing to CPU and GPU applications
Ratio, instruction number (the Instructions Per performed in statistics shared buffer memory CPU and the respective unit period of GPU program
Cycle abbreviation IPC), find best performance division proportion least in power-consuming;
Step 4, realizes that dynamic self-adapting is divided
Step 2, the CPU and GPU last level cache accountings in 3 be application program operation before it is ready-portioned, in operation
Mobile state adjustment will not be entered according to the feature of application program.Dynamic is divided operationally collects CPU core and GPU core access requests
Feature, realizes that self adaptation is dynamically divided.
Preferably, step 4 is specifically included:
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively.
The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively;
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache is arranged into (Cache way)
It is allocated to the maximum application of performance gain;
Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications
IPC yield values, arrange distribution Cache to corresponding application program when next cycle starts.
Preferably, step 4.2 to specifically include specific method as follows:If, ThresholdlowIt is the lower bound of threshold value evaluation and test
Value, ThresholdhighIt is the upper dividing value of threshold value evaluation and test,
If the 1, the IPC yield values σ of the GPU is less than threshold value Thresholdlow, GPU application programs are that caching is insensitive
Cache row are allocated to CPU application programs by type.
If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdlow, less than or equal to threshold value
Thresholdhigh, GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, will
Cache row are allocated to GPU application programs.
If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdhigh, GPU application program stage of development
Property change, recover caching be divided to original state.
Compared with prior art, the present invention has advantages below:
GPU application programs and CPU Application sharing last level caches, the exclusive good concurrency of GPU application programs and visit
Survive slow tolerance, cause GPU application programs to occupy most of LLC spaces, have a strong impact on the memory access hit rate of CPU programs, lead
The expense for causing extra access to host, influences the performance and power consumption of system.LLC static divisions and dynamic self-adapting division methods, have
Be limited in the addressing space of CPU and GPU on some Cache row by effect ground, it is to avoid the unfair competition of GPU application programs, improves
CPU application programs reduce memory access miss rate to the utilization rate of LLC, so as to reach reduction power consumption, improve the purpose of systematic function.
Brief description of the drawings
To make the purpose of the present invention, scheme is more easy-to-understand, and below in conjunction with accompanying drawing, the present invention is further described.
Fig. 1 is CPU+GPU heterogeneous multi-core system Organization Charts, and the heterogeneous multi-core system is by 2 core cpus and 4 GPU cores
Composition heterogeneous multi-core architecture, each core includes a privately owned L1 grades of Cache, and all core cpus are shared with GPU cores
L2 Cache are the communication network-on-chip between final stage shared buffer memory (LLC) and main memory controller DRAM (MEMORY), all cores
NOC is exchanged.
Fig. 2 is SLICC operating mechanism figures;
Fig. 3 schematic diagram of Cache division methods
Fig. 4 is the flow chart of dynamic self-adapting LLC partitioning algorithms.
Specific embodiment
To make the purpose of the present invention, technical scheme and advantage become more apparent, below in conjunction with accompanying drawing to the present invention
Embodiment be described in detail.
Involved in the present invention is a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, as shown in figure 1,
There are two core cpus, four GPU cores with one, each core possesses respective L1 Cache, share a L2
As a example by the heterogeneous processor of Cache.The cpu test program of operation is that SPEC CPU2006, the GPU application programs of single thread are
Rodinia.Each workload is made up of a cpu test program and a GPU application program.In simulator, use
SLICC (specification language for implementing cache coherence) scripts are described
Consistency protocol.As shown in Fig. 2 being SLICC operating mechanism figures.Comprise the following steps that:
Step 1, distinguishes CPU access requests and GPU access requests, and addition flag bit TagID marks the Cache of each to compile
Number, distinguish L1 Cache and belong to core cpu or GPU cores.
One workload group of operation (includes 2 benchmark test programs), and the wherein benchmark of CPU is
One test program of SPEC2006 operates in a core cpu, and the benchmark of GPU is the rodinia test programs of GPU
Another core cpu is operated in, is operated on 4 GPU cores by core cpu guiding GPU test programs rodinia.Have 6 altogether
The individual L1 cache message from different core.Increase TagID flag bits newly on each L1 Cache Controller, pass through
TagID distinguishes the L1 Cache message from different core.
Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to
Location.
With the addition of flag bit in step 1 to distinguish the L1 Cache message of different core, as L1 Cache
When access request message in Controller reaches L2 Cache Controller, in port L1RequestL2Network_
The judgement of type of message is carried out in, different ground are carried out to the L1 Cache request messages from core cpu or GPU cores
Location space reflection.
Step 3, realizes static division.
CPU application programs and GPU Application sharing L2 Cache space address, L2 is accessed by CPU application programs
Cache addresses are limited in the Cache of some fixations row (Cache Way), and GPU application programs are accessed into L2 Cache addresses limit
Make in remaining Cache for fixing Way, as shown in figure 3, realizing static division.Static division scheme efficiently avoid GPU
Parallel multithread takes to L2 grades of the unfair of caching, it is ensured that CPU reduces CPU access for L2 grades of caching utilization ratio
The expense that off-chip main is deposited, further reduces the power consumption of system.
Step 3.1, equivalent is divided equally.
Storage comes from CPU or GPU L1 Cache Controller in L1RequestL2Network_in message queues
Message, L2 grades is mapped by getCacheEntry (in_msg.addr) functions according to the address of in_msg.addr by message
In the corresponding cache_entry of Cache, odd even address mark function F_O (in_msg.addr) is increased newly, if in_msg.addr
It is even address, returns to true, otherwise returns to false.For the message from CPU L1 Cache, firstly for from CPU L1
The message of Cache judges whether F_O (in_msg.addr) return value is true, if true, maps directly to L2 grades of Cache
L2cache [addr] in, if false, in mapping that to the L2 grades of L2cache of Cache [addr+1] so that come from
The message of CPU L1 Cache is mapped in even address.For the message from GPU L1 Cache, as F_O (in_msg.addr)
When return value is false, directly mapped, return value is true, and message is mapped in L2cache [addr+1].By L2
Level Cache address spaces are divided into odd address and even address two parts, are respectively allocated to CPU application programs and GPU application programs,
So as to reduce the memory access miss rate of CPU applications.
Step 3.2, optimal dividing.
Another mode of static division is exactly that inequality proportion is divided, and the L2 Cache spaces of inequality proportion are allocated to
CPU and GPU is applied, and according to the L2 grades of Cache accounting of the current actual products CPU of Intel and AMD and GPU, 1/8 is applied to GPU
Program, 7/8 gives CPU application programs, and such spatial cache accounting effect is preferable.Therefore newly-increased address partition function C_G
(in_msg.addr), for the message from GPU L1 Cache, if in_msg.addr%8 is equal to 0, by L2cache
GPU is distributed in [addr] address, if in_msg.addr%8 is not equal to 0, L2cache [addr/8+8] address is distributed to
GPU.For the message from CPU L1 Cache, if in_msg.addr%8 is not equal to 0, by L2cache [addr]
Location is distributed directly to CPU, and L2cache [addr+1] address otherwise is distributed into CPU.So divide L2 grades of Cache address is empty
Between be divided into 1:7 inequality proportion, GPU occupies 1 part, and CPU occupies 7 parts, although GPU application programs are occupied L2 grades of caching compared with
It is few, but due to the concurrency and delay-tolerant of GPU height, L2 grades of less caching be not obvious to GPU performance impacts,
CPU applications occupy most of L2 grades of Cache, can be effectively reduced memory access miss rate so that the power dissipation overhead of system is reduced.
Step 4, realizes that dynamic self-adapting is divided.Step 2, CPU and GPU last level cache accountings in 3 are in application program
It is ready-portioned before operation, Mobile state adjustment will not be entered according to the feature of application program in operation.Dynamic is divided operationally
The feature of CPU core and GPU core access requests is collected, realizes that self adaptation is dynamically divided, flow such as Fig. 4 institutes of self adaptation partitioning algorithm
Show.The dynamic of self adaptation divides access request feature that can operationally according to CPU and GPU, dynamically changes and distributes to CPU
With L2 grades of ratio of Cache of GPU applications, for the L2 grades of Cache space that the application of different characteristic can distribute different proportion, make
Performance benefits are obtained to maximize.
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively.
The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively.It is average that IPC values refer to CPU or GPU each cycles
The instruction strip number of execution, computing formula is as follows:
IPC can be very good to reflect influence of the L2 grades of buffer memory capacity change to CPU and GPU application program capacities, therefore
Can reflect that application program capacity is changed by L2 grades of buffer memory capacity and is changed by the monitoring to IPC values.From L2 grades
In first sampling period after each division of caching, record the IPC values of each cycle GPU applications.
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, by Cache being allocated to property of way
Can the maximum application of gain.Specific method is as follows:
Assuming that XiRepresent i-th value of cycle IPC after caching division, ThresholdlowThe floor value of threshold value evaluation and test,
ThresholdhighIt is the upper dividing value of threshold value evaluation and test.Then the IPC values of current sample period divide the stage with GPU application programs before
Average IPC othernesses are that formula is as follows:
(1) if the IPC yield values σ for calculating GPU is less than threshold value Thresholdlow, it can be determined that current GPU application journeys
Sequence is caching non-sensitive type, and distributing more L2 grades of Cache cachings can't produce influence to GPU application program capacities, on the contrary
CPU application programs can be reduced to the L2 grades of access efficiency of Cache, cause the memory access miss rate of CPU application programs to greatly promote.This
When should stop distribution L2 grade cache to GPU apply, Cache Way are distributed into CPU application programs, L2 grades can be effectively improved
The utilization ratio of Cache.
(2) if the IPC yield values σ for calculating GPU is more than or equal to threshold value ThresholdlowLess than or equal to threshold value
Thresholdhigh, GPU application programs are Cache Sensitive type, and distributing to L2 grades of Cache space of GPU application programs can bring bigger
Income, then Cache Way are allocated to GPU application programs.
(3) if the IPC yield values σ for calculating GPU is more than or equal to threshold value Thresholdhigh, then GPU application programs hair
Raw phasic Chang, recovers caching and is divided to original state.
Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications
IPC yield values, corresponding application program is given when next cycle starts by distribution Cache Way.
Heterogeneous polynuclear cache sharing division methods towards low-power consumption of the invention, the Sandy Bridge frameworks of Intel
The integration CPU and GPU cores in a chip are realized with the Kaveri frameworks of AMD, heterogeneous multi-nucleus processor framework is formd.
Heterogeneous multi-nucleus processor framework simplifies the communication between CPU and GPU, realizes and shared between CPU and GPU afterbody caching
(LLC) resource.Because GPU cores possess level of parallelism higher than core cpu, therefore GPU can reach data higher and visit
Speed is asked, causing the spatial cache of majority will be taken by GPU application programs, leave behind very limited amount of space and give CPU application journeys
Sequence, has a strong impact on the memory access latency of CPU application programs and the expense of power consumption, and GPU application programs have good concurrency, cause
GPU has shielding effect to memory access latency, is limited on the influence of GPU application program capacities.So, in order to ensure CPU application journeys
Sequence obtains the Fairshare of caching, static division can be carried out to the LLC resources that CPU and GPU shares and the dynamic of self adaptation is divided.
For CPU retains constant percentage caching during static partition, remaining leaves GPU for.Self adaptation division methods dynamically are by online
Analysis CPU and GPU application programs operationally carry out the dynamic adjustment of partition size to the susceptibility of cache size.Caching point
Contention of the CPU and GPU application programs to shared buffer memory can effectively be alleviated in area, reduce CPU application programs and access what off-chip main was deposited
Expense, so as to reach the purpose of the power consumption of reduction system.
Claims (3)
1. a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, it is characterised in that comprise the following steps:
Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and different core is distinguished using flag bit TagID
Access request;
Step 2, the flag bit TagID according to different core access request determines the L2Cache addresses that access request is mapped to;
Step 3, realizes static division, comprises the following steps:
Step 3.1, equivalent is divided equally
In the request buffering queue L1RequestToL2Cache of L2Cache, TagID message flags position is judged, if should
The L1 from GPU cores grades of Cache of Seeking Truth, odd address is mapped to by the access request, if the request is from core cpu
L1 grades of Cache, even address is mapped to by the access request;
Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address ratio for distributing to CPU and GPU applications,
Instruction number (the Instructions Per Cycle performed in statistics shared buffer memory CPU and the respective unit period of GPU program
Abbreviation IPC), find best performance division proportion least in power-consuming;
Step 4, realizes that dynamic self-adapting is divided
Step 2, the CPU and GPU last level cache accountings in 3 are ready-portioned before application program operation, in operation will not
Feature according to application program enters Mobile state adjustment.Dynamic divides the spy for operationally collecting CPU core and GPU core access requests
Levy, realize that self adaptation is dynamically divided.
2. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step 4
Specifically include:
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively.Respectively
Count the IPC values of the IPC and GPU application programs of CPU application programs;
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache row (Cache way) is divided
To the application that performance gain is maximum;
Step 4.3, periodically performs step 4.1,4.2, the memory access information according to current period GPU, the IPC of calculating GPU applications
Yield value, arranges distribution Cache to corresponding application program when next cycle starts.
3. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step
4.2 to specifically include specific method as follows:If, ThresholdlowIt is the floor value of threshold value evaluation and test, ThresholdhighFor threshold value is commented
The upper dividing value surveyed,
If the 1, the IPC yield values σ of the GPU is less than threshold value Thresholdlow, GPU application programs are caching non-sensitive type, will
Cache row are allocated to CPU application programs.
If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdlow, less than or equal to threshold value Thresholdhigh,
GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, and Cache row are allocated into GPU
Application program.
If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdhigh, the change of GPU application programs stage of development
Change, recover caching and be divided to original state.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611187228.1A CN106708626A (en) | 2016-12-20 | 2016-12-20 | Low power consumption-oriented heterogeneous multi-core shared cache partitioning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611187228.1A CN106708626A (en) | 2016-12-20 | 2016-12-20 | Low power consumption-oriented heterogeneous multi-core shared cache partitioning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106708626A true CN106708626A (en) | 2017-05-24 |
Family
ID=58939396
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611187228.1A Pending CN106708626A (en) | 2016-12-20 | 2016-12-20 | Low power consumption-oriented heterogeneous multi-core shared cache partitioning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106708626A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463510A (en) * | 2017-08-21 | 2017-12-12 | 北京工业大学 | It is a kind of towards high performance heterogeneous polynuclear cache sharing amortization management method |
CN108154461A (en) * | 2017-12-06 | 2018-06-12 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of low-power consumption GPU dyeing tasks and unified dyeing array task scene mapping structure |
CN108459912A (en) * | 2018-04-10 | 2018-08-28 | 郑州云海信息技术有限公司 | A kind of last level cache management method and relevant apparatus |
CN109101332A (en) * | 2017-06-20 | 2018-12-28 | 畅想芯科有限公司 | Asymmetric multicore heterogeneous parallel processing system |
CN109753134A (en) * | 2018-12-24 | 2019-05-14 | 四川大学 | A kind of GPU inside energy consumption control system and method based on overall situation decoupling |
CN110389833A (en) * | 2019-06-28 | 2019-10-29 | 北京大学深圳研究生院 | A kind of performance scheduling method and system of processor |
CN111897747A (en) * | 2020-07-24 | 2020-11-06 | 宁波中控微电子有限公司 | Cache dynamic allocation method of on-chip coprocessor and on-chip system |
CN112000465A (en) * | 2020-07-21 | 2020-11-27 | 山东师范大学 | Method and system for reducing performance interference of delay sensitive program in data center environment |
CN112783803A (en) * | 2021-01-27 | 2021-05-11 | 于慧 | Computer CPU-GPU shared cache control method and system |
CN113780336A (en) * | 2021-07-27 | 2021-12-10 | 浙江工业大学 | Lightweight cache partitioning method and device based on machine learning |
CN114138179A (en) * | 2021-10-19 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Method and device for dynamically adjusting write cache space |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105068940A (en) * | 2015-07-28 | 2015-11-18 | 北京工业大学 | Self-adaptive page strategy determination method based on Bank division |
-
2016
- 2016-12-20 CN CN201611187228.1A patent/CN106708626A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105068940A (en) * | 2015-07-28 | 2015-11-18 | 北京工业大学 | Self-adaptive page strategy determination method based on Bank division |
Non-Patent Citations (6)
Title |
---|
GUANG SUO;XUEJUN YANG;GUANGHUI LIU ET AL.: "IPC-Based Cache Partitioning:An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism", 《2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBIRD INFORMATION TECHNOLOGY》 * |
S.KIM;D.CHANDRA;Y.SOLIHIN: "Fair cache sharing and partioning in a chip multiprocessor architecture", 《13TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPLIATION TECHNIQUES,2004.PACT 2004》 * |
孙传伟: "CPU-GPU融合架构上共享Cache的动态划分技术", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
孙荪: "提高多核处理器片上Cache利用率的关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
杨立,邓振杰,刘宏雁: "《微型计算机原理与接口技术学习指导 第二版》", 31 August 2007 * |
陈希,蒋乐民: "《微机原理与接口技术》", 31 July 2006 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101332A (en) * | 2017-06-20 | 2018-12-28 | 畅想芯科有限公司 | Asymmetric multicore heterogeneous parallel processing system |
CN107463510B (en) * | 2017-08-21 | 2020-05-08 | 北京工业大学 | High-performance heterogeneous multi-core shared cache buffer management method |
CN107463510A (en) * | 2017-08-21 | 2017-12-12 | 北京工业大学 | It is a kind of towards high performance heterogeneous polynuclear cache sharing amortization management method |
CN108154461A (en) * | 2017-12-06 | 2018-06-12 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of low-power consumption GPU dyeing tasks and unified dyeing array task scene mapping structure |
CN108459912B (en) * | 2018-04-10 | 2021-09-17 | 郑州云海信息技术有限公司 | Last-level cache management method and related device |
CN108459912A (en) * | 2018-04-10 | 2018-08-28 | 郑州云海信息技术有限公司 | A kind of last level cache management method and relevant apparatus |
CN109753134A (en) * | 2018-12-24 | 2019-05-14 | 四川大学 | A kind of GPU inside energy consumption control system and method based on overall situation decoupling |
CN109753134B (en) * | 2018-12-24 | 2022-04-15 | 四川大学 | Global decoupling-based GPU internal energy consumption control system and method |
CN110389833A (en) * | 2019-06-28 | 2019-10-29 | 北京大学深圳研究生院 | A kind of performance scheduling method and system of processor |
CN110389833B (en) * | 2019-06-28 | 2023-06-16 | 北京大学深圳研究生院 | Performance scheduling method and system for processor |
CN112000465A (en) * | 2020-07-21 | 2020-11-27 | 山东师范大学 | Method and system for reducing performance interference of delay sensitive program in data center environment |
CN112000465B (en) * | 2020-07-21 | 2023-02-03 | 山东师范大学 | Method and system for reducing performance interference of delay sensitive program in data center environment |
CN111897747A (en) * | 2020-07-24 | 2020-11-06 | 宁波中控微电子有限公司 | Cache dynamic allocation method of on-chip coprocessor and on-chip system |
CN112783803A (en) * | 2021-01-27 | 2021-05-11 | 于慧 | Computer CPU-GPU shared cache control method and system |
CN112783803B (en) * | 2021-01-27 | 2022-11-18 | 湖南中科长星科技有限公司 | Computer CPU-GPU shared cache control method and system |
CN113780336A (en) * | 2021-07-27 | 2021-12-10 | 浙江工业大学 | Lightweight cache partitioning method and device based on machine learning |
CN113780336B (en) * | 2021-07-27 | 2024-02-02 | 浙江工业大学 | Lightweight cache dividing method and device based on machine learning |
CN114138179A (en) * | 2021-10-19 | 2022-03-04 | 苏州浪潮智能科技有限公司 | Method and device for dynamically adjusting write cache space |
CN114138179B (en) * | 2021-10-19 | 2023-08-15 | 苏州浪潮智能科技有限公司 | Method and device for dynamically adjusting write cache space |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106708626A (en) | Low power consumption-oriented heterogeneous multi-core shared cache partitioning method | |
Liu et al. | A software memory partition approach for eliminating bank-level interference in multicore systems | |
Stuecheli et al. | The virtual write queue: Coordinating DRAM and last-level cache policies | |
CN110704360B (en) | Graph calculation optimization method based on heterogeneous FPGA data flow | |
CN104067227B (en) | Branch prediction logic | |
CN104252392B (en) | A kind of method and processor accessing data buffer storage | |
US8335892B1 (en) | Cache arbitration between multiple clients | |
US8839259B2 (en) | Thread scheduling on multiprocessor systems | |
US8904154B2 (en) | Execution migration | |
Tsai et al. | Adaptive scheduling for systems with asymmetric memory hierarchies | |
CN103218208A (en) | System and method for performing shaped memory access operations | |
CN103207774A (en) | Method And System For Resolving Thread Divergences | |
CN107463510B (en) | High-performance heterogeneous multi-core shared cache buffer management method | |
CN103218309A (en) | Multi-level instruction cache prefetching | |
Arora | The architecture and evolution of cpu-gpu systems for general purpose computing | |
CN106250348B (en) | A kind of heterogeneous polynuclear framework buffer memory management method based on GPU memory access characteristic | |
CN108132834A (en) | Method for allocating tasks and system under multi-level sharing cache memory framework | |
Tian et al. | Abndp: Co-optimizing data access and load balance in near-data processing | |
Liu et al. | A space-efficient fair cache scheme based on machine learning for nvme ssds | |
Rai et al. | Improving CPU performance through dynamic GPU access throttling in CPU-GPU heterogeneous processors | |
Rai et al. | Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors | |
García-Guirado et al. | Energy-efficient cache coherence protocols in chip-multiprocessors for server consolidation | |
Jia et al. | Coordinate channel-aware page mapping policy and memory scheduling for reducing memory interference among multimedia applications | |
BiTalebi et al. | LARA: Locality-aware resource allocation to improve GPU memory-access time | |
CN112817639A (en) | Method for accessing register file by GPU read-write unit through operand collector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170524 |
|
RJ01 | Rejection of invention patent application after publication |