CN106708626A - Low power consumption-oriented heterogeneous multi-core shared cache partitioning method - Google Patents

Low power consumption-oriented heterogeneous multi-core shared cache partitioning method Download PDF

Info

Publication number
CN106708626A
CN106708626A CN201611187228.1A CN201611187228A CN106708626A CN 106708626 A CN106708626 A CN 106708626A CN 201611187228 A CN201611187228 A CN 201611187228A CN 106708626 A CN106708626 A CN 106708626A
Authority
CN
China
Prior art keywords
gpu
cache
cpu
application programs
ipc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611187228.1A
Other languages
Chinese (zh)
Inventor
方娟
刘士建
程妍瑾
常泽清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201611187228.1A priority Critical patent/CN106708626A/en
Publication of CN106708626A publication Critical patent/CN106708626A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a lower power consumption-oriented heterogeneous multi-core shared cache partitioning method. The method comprises the following steps of: carrying out static partitioning on a shared last-stage cache L2 cache, distributing a cache space with a constant proportion of 50% to a CPU application, and distributing the residual space to a GPU application; on the basis of mean-proportion partitioning, carrying out optimum static proportion partitioning, and portioning different proportions to the CPU application and the GPU application; and finally carrying out dynamic self-adaptive proportion partitioning mechanism, and dynamically changing the proportions, in a last-stage cache, of the CPU application and the GPU application according to an IPC partitioning index during the operation, so as to achieve the aims of reducing the system power consumption and enhancing the system performance.

Description

A kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption
Technical field
The invention belongs to computer architecture caching system construction applications, and in particular to a kind of heterogeneous polynuclear towards low-power consumption Cache sharing division methods.
Background technology
With continuing to develop for polycaryon processor, traditional multicore architecture is difficult to meet the demand of large-scale calculations, industry Boundary forms heterogeneous multi-nucleus processor framework by integrating different types of processor on same chip.At heterogeneous polynuclear Be fused to for processor with different computing capabilitys by reason device (HMP-Heterogeneous Multi-core Processor) Together, it is widely used in the industries such as Aero-Space, Industry Control, instrument and meter, to meet performance requirements, reduces power consumption And cost.HMP, can be by different types of distribution of computation tasks to different due to being integrated with the processor core of different characteristics and performance Parallel processing on the processor core of type, such as, quick complicated nuclear energy performs serial code part, and better simply core then can Parallel processing digital, so as to for the application of different demands provides more flexible, efficient treatment mechanism, meet various applied environments Requirement to system real time, power consumption, reliability and cost, the focus as current research.
Graphic process unit (Graphics Processing Unit, GPU) has relatively simple control logic, is integrated with A large amount of parallel processing cores, with peak value efficiency (the calculating performance of unit work consumptiom) higher.GPU is from beginning of being born just in floating-point Calculate and surmounted CPU in performance, huge wide gap causes substantial amounts of program between GPU executed in parallel and CPU order execution performances The part of computation-intensive is given GPU and is calculated by developer, and Computer Architecture is also from traditional multi-core CPU epoch Driving crosses the multi-core CPU-GPU isomery epoch.Heterogeneous multi-nucleus processor framework such as AMD Fusion, Intel Sandy The appearance of Bridge and Nvidia Denver indicates the heterogeneous polynuclear framework main flow as current era.
CPU and GPU are integrated in a chip on heterogeneous multi-nucleus processor, using shared last level cache structure, with CPU phases Than GPU number of threads is more, and degree of parallelism is high, and GPU application programs can reach data access speed more more than CPU application program Rate, with certain access delay tolerance.The program locality feature that is shown by CPU and GPU application programs and The inconsistency of memory access behavior, the data of different processes can seize shared last level cache, bring the problem of resource contention. When CPU application programs and GPU application programs are performed jointly, because GPU application program access rates are high, therefore it is most of can profit Last level cache space will be used by GPU application programs so that the space for leaving CPU for is very limited, cause CPU application journeys Ordered pair can be reduced substantially in the access of shared LLC, and for many CPU application programs, application program is delayed once there is high speed Especially last level cache missing is deposited, then needs extraly to go access off-chip main to deposit, cause unnecessary power dissipation overhead.Cause This, in heterogeneous polynuclear framework, the division for sharing LLC is most important for the influence of systematic function and power consumption.Now, a conjunction It is very necessary that efficient shared buffer memory division methods are managed for reducing the power dissipation overhead of system.
At present, existing some researchs are devoted to Cache cache subsystems, and Xie and Loh et al. are proposed by using hardware meter Thread of the number device to operation on the same chip is cached dynamically subregion so that the cache hit rate of cross-thread is maximized. Lee and Kim et al. are by considering the influence of the application and caching miss rate of memory level parallelism in performance, it is proposed that a dynamic Ground cache partitions strategy.Qureshi and Patt et al. propose the shared last level cache division methods based on Buffer Utilization, By using a performance monitor UMON (Utility Monitor), the shared height of data access final stage of different processes is counted The position of hit during speed caching, the execution for instructing next cycle partition strategy.But existing cache management work, It is mainly used in isomorphism multiple nucleus system environment, it is impossible to be adapted to the isomerous environment that CPU is combined with GPU, more cannot be to from CPU's Request and the request from GPU make a distinction, and cause the unjustness of shared last level cache distribution, have a strong impact on the performance of system And power consumption.
The content of the invention
The present invention proposes a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption.Shared end is carried out first The level L2 grades of static division of Cache of caching, the spatial cache for distributing constant ratio 50% is applied to CPU, and remaining space leaves GPU for Using.Respectively on the basis of ratio cut partition, the division of optimal staticaccelerator scale is carried out, divide inequality proportion and applied to CPU and GPU. Finally, dynamically self adaptation ratio cut partition mechanism is carried out, using the dynamic patitioning algorithm towards low-power consumption, is divided according to IPC and referred to Mark operationally dynamically changes CPU and applies the ratio for occupying last level cache with GPU applications, so as to reach reduction system power dissipation, The purpose of lifting system performance.
To reach above-mentioned purpose, the present invention uses following technical scheme.
A kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, comprise the following steps:
Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and difference is distinguished using flag bit TagID The access request of core;
Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to Location;
Step 3, realizes static division, comprises the following steps:
Step 3.1, equivalent is divided equally
In the request buffering queue L1RequestToL2Cache of L2 Cache, TagID message flags position is judged, if The request is the L1 grades of Cache from GPU cores, and the access request is mapped into odd address, if the request is from CPU core L1 grades of Cache of the heart, even address is mapped to by the access request;
Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address for distributing to CPU and GPU applications Ratio, instruction number (the Instructions Per performed in statistics shared buffer memory CPU and the respective unit period of GPU program Cycle abbreviation IPC), find best performance division proportion least in power-consuming;
Step 4, realizes that dynamic self-adapting is divided
Step 2, the CPU and GPU last level cache accountings in 3 be application program operation before it is ready-portioned, in operation Mobile state adjustment will not be entered according to the feature of application program.Dynamic is divided operationally collects CPU core and GPU core access requests Feature, realizes that self adaptation is dynamically divided.
Preferably, step 4 is specifically included:
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively. The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively;
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache is arranged into (Cache way) It is allocated to the maximum application of performance gain;
Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications IPC yield values, arrange distribution Cache to corresponding application program when next cycle starts.
Preferably, step 4.2 to specifically include specific method as follows:If, ThresholdlowIt is the lower bound of threshold value evaluation and test Value, ThresholdhighIt is the upper dividing value of threshold value evaluation and test,
If the 1, the IPC yield values σ of the GPU is less than threshold value Thresholdlow, GPU application programs are that caching is insensitive Cache row are allocated to CPU application programs by type.
If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdlow, less than or equal to threshold value Thresholdhigh, GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, will Cache row are allocated to GPU application programs.
If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdhigh, GPU application program stage of development Property change, recover caching be divided to original state.
Compared with prior art, the present invention has advantages below:
GPU application programs and CPU Application sharing last level caches, the exclusive good concurrency of GPU application programs and visit Survive slow tolerance, cause GPU application programs to occupy most of LLC spaces, have a strong impact on the memory access hit rate of CPU programs, lead The expense for causing extra access to host, influences the performance and power consumption of system.LLC static divisions and dynamic self-adapting division methods, have Be limited in the addressing space of CPU and GPU on some Cache row by effect ground, it is to avoid the unfair competition of GPU application programs, improves CPU application programs reduce memory access miss rate to the utilization rate of LLC, so as to reach reduction power consumption, improve the purpose of systematic function.
Brief description of the drawings
To make the purpose of the present invention, scheme is more easy-to-understand, and below in conjunction with accompanying drawing, the present invention is further described.
Fig. 1 is CPU+GPU heterogeneous multi-core system Organization Charts, and the heterogeneous multi-core system is by 2 core cpus and 4 GPU cores Composition heterogeneous multi-core architecture, each core includes a privately owned L1 grades of Cache, and all core cpus are shared with GPU cores L2 Cache are the communication network-on-chip between final stage shared buffer memory (LLC) and main memory controller DRAM (MEMORY), all cores NOC is exchanged.
Fig. 2 is SLICC operating mechanism figures;
Fig. 3 schematic diagram of Cache division methods
Fig. 4 is the flow chart of dynamic self-adapting LLC partitioning algorithms.
Specific embodiment
To make the purpose of the present invention, technical scheme and advantage become more apparent, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.
Involved in the present invention is a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, as shown in figure 1, There are two core cpus, four GPU cores with one, each core possesses respective L1 Cache, share a L2 As a example by the heterogeneous processor of Cache.The cpu test program of operation is that SPEC CPU2006, the GPU application programs of single thread are Rodinia.Each workload is made up of a cpu test program and a GPU application program.In simulator, use SLICC (specification language for implementing cache coherence) scripts are described Consistency protocol.As shown in Fig. 2 being SLICC operating mechanism figures.Comprise the following steps that:
Step 1, distinguishes CPU access requests and GPU access requests, and addition flag bit TagID marks the Cache of each to compile Number, distinguish L1 Cache and belong to core cpu or GPU cores.
One workload group of operation (includes 2 benchmark test programs), and the wherein benchmark of CPU is One test program of SPEC2006 operates in a core cpu, and the benchmark of GPU is the rodinia test programs of GPU Another core cpu is operated in, is operated on 4 GPU cores by core cpu guiding GPU test programs rodinia.Have 6 altogether The individual L1 cache message from different core.Increase TagID flag bits newly on each L1 Cache Controller, pass through TagID distinguishes the L1 Cache message from different core.
Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to Location.
With the addition of flag bit in step 1 to distinguish the L1 Cache message of different core, as L1 Cache When access request message in Controller reaches L2 Cache Controller, in port L1RequestL2Network_ The judgement of type of message is carried out in, different ground are carried out to the L1 Cache request messages from core cpu or GPU cores Location space reflection.
Step 3, realizes static division.
CPU application programs and GPU Application sharing L2 Cache space address, L2 is accessed by CPU application programs Cache addresses are limited in the Cache of some fixations row (Cache Way), and GPU application programs are accessed into L2 Cache addresses limit Make in remaining Cache for fixing Way, as shown in figure 3, realizing static division.Static division scheme efficiently avoid GPU Parallel multithread takes to L2 grades of the unfair of caching, it is ensured that CPU reduces CPU access for L2 grades of caching utilization ratio The expense that off-chip main is deposited, further reduces the power consumption of system.
Step 3.1, equivalent is divided equally.
Storage comes from CPU or GPU L1 Cache Controller in L1RequestL2Network_in message queues Message, L2 grades is mapped by getCacheEntry (in_msg.addr) functions according to the address of in_msg.addr by message In the corresponding cache_entry of Cache, odd even address mark function F_O (in_msg.addr) is increased newly, if in_msg.addr It is even address, returns to true, otherwise returns to false.For the message from CPU L1 Cache, firstly for from CPU L1 The message of Cache judges whether F_O (in_msg.addr) return value is true, if true, maps directly to L2 grades of Cache L2cache [addr] in, if false, in mapping that to the L2 grades of L2cache of Cache [addr+1] so that come from The message of CPU L1 Cache is mapped in even address.For the message from GPU L1 Cache, as F_O (in_msg.addr) When return value is false, directly mapped, return value is true, and message is mapped in L2cache [addr+1].By L2 Level Cache address spaces are divided into odd address and even address two parts, are respectively allocated to CPU application programs and GPU application programs, So as to reduce the memory access miss rate of CPU applications.
Step 3.2, optimal dividing.
Another mode of static division is exactly that inequality proportion is divided, and the L2 Cache spaces of inequality proportion are allocated to CPU and GPU is applied, and according to the L2 grades of Cache accounting of the current actual products CPU of Intel and AMD and GPU, 1/8 is applied to GPU Program, 7/8 gives CPU application programs, and such spatial cache accounting effect is preferable.Therefore newly-increased address partition function C_G (in_msg.addr), for the message from GPU L1 Cache, if in_msg.addr%8 is equal to 0, by L2cache GPU is distributed in [addr] address, if in_msg.addr%8 is not equal to 0, L2cache [addr/8+8] address is distributed to GPU.For the message from CPU L1 Cache, if in_msg.addr%8 is not equal to 0, by L2cache [addr] Location is distributed directly to CPU, and L2cache [addr+1] address otherwise is distributed into CPU.So divide L2 grades of Cache address is empty Between be divided into 1:7 inequality proportion, GPU occupies 1 part, and CPU occupies 7 parts, although GPU application programs are occupied L2 grades of caching compared with It is few, but due to the concurrency and delay-tolerant of GPU height, L2 grades of less caching be not obvious to GPU performance impacts, CPU applications occupy most of L2 grades of Cache, can be effectively reduced memory access miss rate so that the power dissipation overhead of system is reduced.
Step 4, realizes that dynamic self-adapting is divided.Step 2, CPU and GPU last level cache accountings in 3 are in application program It is ready-portioned before operation, Mobile state adjustment will not be entered according to the feature of application program in operation.Dynamic is divided operationally The feature of CPU core and GPU core access requests is collected, realizes that self adaptation is dynamically divided, flow such as Fig. 4 institutes of self adaptation partitioning algorithm Show.The dynamic of self adaptation divides access request feature that can operationally according to CPU and GPU, dynamically changes and distributes to CPU With L2 grades of ratio of Cache of GPU applications, for the L2 grades of Cache space that the application of different characteristic can distribute different proportion, make Performance benefits are obtained to maximize.
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively. The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively.It is average that IPC values refer to CPU or GPU each cycles The instruction strip number of execution, computing formula is as follows:
IPC can be very good to reflect influence of the L2 grades of buffer memory capacity change to CPU and GPU application program capacities, therefore Can reflect that application program capacity is changed by L2 grades of buffer memory capacity and is changed by the monitoring to IPC values.From L2 grades In first sampling period after each division of caching, record the IPC values of each cycle GPU applications.
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, by Cache being allocated to property of way Can the maximum application of gain.Specific method is as follows:
Assuming that XiRepresent i-th value of cycle IPC after caching division, ThresholdlowThe floor value of threshold value evaluation and test, ThresholdhighIt is the upper dividing value of threshold value evaluation and test.Then the IPC values of current sample period divide the stage with GPU application programs before Average IPC othernesses are that formula is as follows:
(1) if the IPC yield values σ for calculating GPU is less than threshold value Thresholdlow, it can be determined that current GPU application journeys Sequence is caching non-sensitive type, and distributing more L2 grades of Cache cachings can't produce influence to GPU application program capacities, on the contrary CPU application programs can be reduced to the L2 grades of access efficiency of Cache, cause the memory access miss rate of CPU application programs to greatly promote.This When should stop distribution L2 grade cache to GPU apply, Cache Way are distributed into CPU application programs, L2 grades can be effectively improved The utilization ratio of Cache.
(2) if the IPC yield values σ for calculating GPU is more than or equal to threshold value ThresholdlowLess than or equal to threshold value Thresholdhigh, GPU application programs are Cache Sensitive type, and distributing to L2 grades of Cache space of GPU application programs can bring bigger Income, then Cache Way are allocated to GPU application programs.
(3) if the IPC yield values σ for calculating GPU is more than or equal to threshold value Thresholdhigh, then GPU application programs hair Raw phasic Chang, recovers caching and is divided to original state.
Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications IPC yield values, corresponding application program is given when next cycle starts by distribution Cache Way.
Heterogeneous polynuclear cache sharing division methods towards low-power consumption of the invention, the Sandy Bridge frameworks of Intel The integration CPU and GPU cores in a chip are realized with the Kaveri frameworks of AMD, heterogeneous multi-nucleus processor framework is formd. Heterogeneous multi-nucleus processor framework simplifies the communication between CPU and GPU, realizes and shared between CPU and GPU afterbody caching (LLC) resource.Because GPU cores possess level of parallelism higher than core cpu, therefore GPU can reach data higher and visit Speed is asked, causing the spatial cache of majority will be taken by GPU application programs, leave behind very limited amount of space and give CPU application journeys Sequence, has a strong impact on the memory access latency of CPU application programs and the expense of power consumption, and GPU application programs have good concurrency, cause GPU has shielding effect to memory access latency, is limited on the influence of GPU application program capacities.So, in order to ensure CPU application journeys Sequence obtains the Fairshare of caching, static division can be carried out to the LLC resources that CPU and GPU shares and the dynamic of self adaptation is divided. For CPU retains constant percentage caching during static partition, remaining leaves GPU for.Self adaptation division methods dynamically are by online Analysis CPU and GPU application programs operationally carry out the dynamic adjustment of partition size to the susceptibility of cache size.Caching point Contention of the CPU and GPU application programs to shared buffer memory can effectively be alleviated in area, reduce CPU application programs and access what off-chip main was deposited Expense, so as to reach the purpose of the power consumption of reduction system.

Claims (3)

1. a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, it is characterised in that comprise the following steps:
Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and different core is distinguished using flag bit TagID Access request;
Step 2, the flag bit TagID according to different core access request determines the L2Cache addresses that access request is mapped to;
Step 3, realizes static division, comprises the following steps:
Step 3.1, equivalent is divided equally
In the request buffering queue L1RequestToL2Cache of L2Cache, TagID message flags position is judged, if should The L1 from GPU cores grades of Cache of Seeking Truth, odd address is mapped to by the access request, if the request is from core cpu L1 grades of Cache, even address is mapped to by the access request;
Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address ratio for distributing to CPU and GPU applications, Instruction number (the Instructions Per Cycle performed in statistics shared buffer memory CPU and the respective unit period of GPU program Abbreviation IPC), find best performance division proportion least in power-consuming;
Step 4, realizes that dynamic self-adapting is divided
Step 2, the CPU and GPU last level cache accountings in 3 are ready-portioned before application program operation, in operation will not Feature according to application program enters Mobile state adjustment.Dynamic divides the spy for operationally collecting CPU core and GPU core access requests Levy, realize that self adaptation is dynamically divided.
2. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step 4 Specifically include:
Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively.Respectively Count the IPC values of the IPC and GPU application programs of CPU application programs;
Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache row (Cache way) is divided To the application that performance gain is maximum;
Step 4.3, periodically performs step 4.1,4.2, the memory access information according to current period GPU, the IPC of calculating GPU applications Yield value, arranges distribution Cache to corresponding application program when next cycle starts.
3. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step 4.2 to specifically include specific method as follows:If, ThresholdlowIt is the floor value of threshold value evaluation and test, ThresholdhighFor threshold value is commented The upper dividing value surveyed,
If the 1, the IPC yield values σ of the GPU is less than threshold value Thresholdlow, GPU application programs are caching non-sensitive type, will Cache row are allocated to CPU application programs.
If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdlow, less than or equal to threshold value Thresholdhigh, GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, and Cache row are allocated into GPU Application program.
If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Thresholdhigh, the change of GPU application programs stage of development Change, recover caching and be divided to original state.
CN201611187228.1A 2016-12-20 2016-12-20 Low power consumption-oriented heterogeneous multi-core shared cache partitioning method Pending CN106708626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611187228.1A CN106708626A (en) 2016-12-20 2016-12-20 Low power consumption-oriented heterogeneous multi-core shared cache partitioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611187228.1A CN106708626A (en) 2016-12-20 2016-12-20 Low power consumption-oriented heterogeneous multi-core shared cache partitioning method

Publications (1)

Publication Number Publication Date
CN106708626A true CN106708626A (en) 2017-05-24

Family

ID=58939396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611187228.1A Pending CN106708626A (en) 2016-12-20 2016-12-20 Low power consumption-oriented heterogeneous multi-core shared cache partitioning method

Country Status (1)

Country Link
CN (1) CN106708626A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463510A (en) * 2017-08-21 2017-12-12 北京工业大学 It is a kind of towards high performance heterogeneous polynuclear cache sharing amortization management method
CN108154461A (en) * 2017-12-06 2018-06-12 中国航空工业集团公司西安航空计算技术研究所 A kind of low-power consumption GPU dyeing tasks and unified dyeing array task scene mapping structure
CN108459912A (en) * 2018-04-10 2018-08-28 郑州云海信息技术有限公司 A kind of last level cache management method and relevant apparatus
CN109101332A (en) * 2017-06-20 2018-12-28 畅想芯科有限公司 Asymmetric multicore heterogeneous parallel processing system
CN109753134A (en) * 2018-12-24 2019-05-14 四川大学 A kind of GPU inside energy consumption control system and method based on overall situation decoupling
CN110389833A (en) * 2019-06-28 2019-10-29 北京大学深圳研究生院 A kind of performance scheduling method and system of processor
CN111897747A (en) * 2020-07-24 2020-11-06 宁波中控微电子有限公司 Cache dynamic allocation method of on-chip coprocessor and on-chip system
CN112000465A (en) * 2020-07-21 2020-11-27 山东师范大学 Method and system for reducing performance interference of delay sensitive program in data center environment
CN112783803A (en) * 2021-01-27 2021-05-11 于慧 Computer CPU-GPU shared cache control method and system
CN113780336A (en) * 2021-07-27 2021-12-10 浙江工业大学 Lightweight cache partitioning method and device based on machine learning
CN114138179A (en) * 2021-10-19 2022-03-04 苏州浪潮智能科技有限公司 Method and device for dynamically adjusting write cache space

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068940A (en) * 2015-07-28 2015-11-18 北京工业大学 Self-adaptive page strategy determination method based on Bank division

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105068940A (en) * 2015-07-28 2015-11-18 北京工业大学 Self-adaptive page strategy determination method based on Bank division

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GUANG SUO;XUEJUN YANG;GUANGHUI LIU ET AL.: "IPC-Based Cache Partitioning:An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism", 《2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBIRD INFORMATION TECHNOLOGY》 *
S.KIM;D.CHANDRA;Y.SOLIHIN: "Fair cache sharing and partioning in a chip multiprocessor architecture", 《13TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPLIATION TECHNIQUES,2004.PACT 2004》 *
孙传伟: "CPU-GPU融合架构上共享Cache的动态划分技术", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
孙荪: "提高多核处理器片上Cache利用率的关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *
杨立,邓振杰,刘宏雁: "《微型计算机原理与接口技术学习指导 第二版》", 31 August 2007 *
陈希,蒋乐民: "《微机原理与接口技术》", 31 July 2006 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101332A (en) * 2017-06-20 2018-12-28 畅想芯科有限公司 Asymmetric multicore heterogeneous parallel processing system
CN107463510B (en) * 2017-08-21 2020-05-08 北京工业大学 High-performance heterogeneous multi-core shared cache buffer management method
CN107463510A (en) * 2017-08-21 2017-12-12 北京工业大学 It is a kind of towards high performance heterogeneous polynuclear cache sharing amortization management method
CN108154461A (en) * 2017-12-06 2018-06-12 中国航空工业集团公司西安航空计算技术研究所 A kind of low-power consumption GPU dyeing tasks and unified dyeing array task scene mapping structure
CN108459912B (en) * 2018-04-10 2021-09-17 郑州云海信息技术有限公司 Last-level cache management method and related device
CN108459912A (en) * 2018-04-10 2018-08-28 郑州云海信息技术有限公司 A kind of last level cache management method and relevant apparatus
CN109753134A (en) * 2018-12-24 2019-05-14 四川大学 A kind of GPU inside energy consumption control system and method based on overall situation decoupling
CN109753134B (en) * 2018-12-24 2022-04-15 四川大学 Global decoupling-based GPU internal energy consumption control system and method
CN110389833A (en) * 2019-06-28 2019-10-29 北京大学深圳研究生院 A kind of performance scheduling method and system of processor
CN110389833B (en) * 2019-06-28 2023-06-16 北京大学深圳研究生院 Performance scheduling method and system for processor
CN112000465A (en) * 2020-07-21 2020-11-27 山东师范大学 Method and system for reducing performance interference of delay sensitive program in data center environment
CN112000465B (en) * 2020-07-21 2023-02-03 山东师范大学 Method and system for reducing performance interference of delay sensitive program in data center environment
CN111897747A (en) * 2020-07-24 2020-11-06 宁波中控微电子有限公司 Cache dynamic allocation method of on-chip coprocessor and on-chip system
CN112783803A (en) * 2021-01-27 2021-05-11 于慧 Computer CPU-GPU shared cache control method and system
CN112783803B (en) * 2021-01-27 2022-11-18 湖南中科长星科技有限公司 Computer CPU-GPU shared cache control method and system
CN113780336A (en) * 2021-07-27 2021-12-10 浙江工业大学 Lightweight cache partitioning method and device based on machine learning
CN113780336B (en) * 2021-07-27 2024-02-02 浙江工业大学 Lightweight cache dividing method and device based on machine learning
CN114138179A (en) * 2021-10-19 2022-03-04 苏州浪潮智能科技有限公司 Method and device for dynamically adjusting write cache space
CN114138179B (en) * 2021-10-19 2023-08-15 苏州浪潮智能科技有限公司 Method and device for dynamically adjusting write cache space

Similar Documents

Publication Publication Date Title
CN106708626A (en) Low power consumption-oriented heterogeneous multi-core shared cache partitioning method
Liu et al. A software memory partition approach for eliminating bank-level interference in multicore systems
Stuecheli et al. The virtual write queue: Coordinating DRAM and last-level cache policies
CN110704360B (en) Graph calculation optimization method based on heterogeneous FPGA data flow
CN104067227B (en) Branch prediction logic
CN104252392B (en) A kind of method and processor accessing data buffer storage
US8335892B1 (en) Cache arbitration between multiple clients
US8839259B2 (en) Thread scheduling on multiprocessor systems
US8904154B2 (en) Execution migration
Tsai et al. Adaptive scheduling for systems with asymmetric memory hierarchies
CN103218208A (en) System and method for performing shaped memory access operations
CN103207774A (en) Method And System For Resolving Thread Divergences
CN107463510B (en) High-performance heterogeneous multi-core shared cache buffer management method
CN103218309A (en) Multi-level instruction cache prefetching
Arora The architecture and evolution of cpu-gpu systems for general purpose computing
CN106250348B (en) A kind of heterogeneous polynuclear framework buffer memory management method based on GPU memory access characteristic
CN108132834A (en) Method for allocating tasks and system under multi-level sharing cache memory framework
Tian et al. Abndp: Co-optimizing data access and load balance in near-data processing
Liu et al. A space-efficient fair cache scheme based on machine learning for nvme ssds
Rai et al. Improving CPU performance through dynamic GPU access throttling in CPU-GPU heterogeneous processors
Rai et al. Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors
García-Guirado et al. Energy-efficient cache coherence protocols in chip-multiprocessors for server consolidation
Jia et al. Coordinate channel-aware page mapping policy and memory scheduling for reducing memory interference among multimedia applications
BiTalebi et al. LARA: Locality-aware resource allocation to improve GPU memory-access time
CN112817639A (en) Method for accessing register file by GPU read-write unit through operand collector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170524

RJ01 Rejection of invention patent application after publication