WO2022267443A1 - Memory resource dynamic regulation and control method and system based on memory access and performance modeling - Google Patents

Memory resource dynamic regulation and control method and system based on memory access and performance modeling Download PDF

Info

Publication number
WO2022267443A1
WO2022267443A1 PCT/CN2022/070519 CN2022070519W WO2022267443A1 WO 2022267443 A1 WO2022267443 A1 WO 2022267443A1 CN 2022070519 W CN2022070519 W CN 2022070519W WO 2022267443 A1 WO2022267443 A1 WO 2022267443A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory access
memory
core
delay
token bucket
Prior art date
Application number
PCT/CN2022/070519
Other languages
French (fr)
Chinese (zh)
Inventor
徐易难
周耀阳
王卅
唐丹
孙凝晖
包云岗
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Publication of WO2022267443A1 publication Critical patent/WO2022267443A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention belongs to the technical field of key application service quality assurance in real-time multi-core system scenarios, and particularly relates to a memory resource dynamic regulation method and system based on memory access and performance modeling.
  • Intel Xeon series processors are equipped with Resource Director Technology (RDT), including cache monitoring technology, cache allocation technology, memory bandwidth monitoring technology, memory bandwidth allocation technology, etc.
  • RDT Resource Director Technology
  • the operating system uses resource allocation technology to monitor the cache and bandwidth usage of different cores, and adjusts the number of available resources of a single core by directly setting the resource allocation ratio, thereby reducing performance interference and ensuring the performance of key loads in complex environments.
  • the application performance loss model (Application Slowdown Model, ASM) combines the analysis of shared cache and main memory, and believes that for applications with limited memory access, its performance is proportional to the sending speed of memory access requests, and the highest priority In this case, the process can reach its maximum memory access bandwidth.
  • ASM minimizes the interference from the memory access period, on the other hand, it quantifies the interference in the shared cache, and then periodically evaluates the performance loss to realize the feedback dynamic adjustment of hardware resources.
  • Intel RDT allows only static partitioning of resources, and allocation amounts are based only on the needs of known sensitive applications.
  • RDT relies on the manual control of software (operating system or user), and the hardware cannot dynamically adjust the number of resources at runtime. Since software usually has a coarser control granularity, this causes hardware resources waste and negatively impact overall system performance.
  • the application performance loss model does not have the general applicability of the architecture. Due to the existence of a large number of shared resources, in order to realize the control of hardware resources, ASM needs to make extensive intrusive modifications to the system bus, memory controller, prefetcher and other components to ensure Its support for priority is very expensive to implement. Due to the need to consider hardware implementation details, the complexity of modeling is increased, which makes it difficult to migrate resource competition evaluation models between different platforms.
  • the purpose of the present invention is to solve the shortcomings of the above-mentioned prior art that can only be statically divided, does not have architecture universality, and does not have application universality, and proposes a technology based on memory access delay prediction, performance loss prediction, and bandwidth dynamic adjustment technology. Key application service quality assurance method and system.
  • the present invention proposes a method for dynamic regulation and control of memory resources based on memory access and performance modeling, which includes:
  • Step 1 In the multi-core system, use the historical memory access request information of the preset process alone to the DRAM as the training data, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay model ;
  • Step 2 When the multi-core system runs with multiple processes, record the target memory access request of the target process and input it into the memory access delay model, and obtain the memory access delay t of the target memory access request without inter-process interference solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of the memory access delay;
  • Step 3 count the number of execution clock cycles inside and outside the core of the target process, and combine the increase ratio of the memory access delay to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
  • Step 4 When the performance loss is greater than the threshold, limit the DRAM memory access traffic of processes other than the target process to dynamically allocate DRAM bandwidth resources in real time to ensure the service quality of the target process.
  • A is the number of execution clock cycles in the core
  • B is the number of execution clock cycles outside the core
  • Lats is the increase ratio of memory access delay
  • step 4 includes: using token bucket technology to limit DRAM memory access traffic of processes other than the target process.
  • Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
  • the present invention also proposes a dynamic control system for memory resources based on memory access and performance modeling, which includes:
  • Module 1 is used to use the historical memory access request information of the preset process to DRAM alone as the training data in the multi-core system, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay Model;
  • Module 2 used to record the target memory access request of the target process and input it into the memory access delay model when the multi-core system runs with multiple processes, and obtain the memory access of the target memory access request without inter-process interference Delay t solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of memory access delay;
  • Module 3 is used to count the number of execution clock cycles in the core and outside the core of the target process, and in combination with the increase ratio of the memory access delay, to obtain the performance loss of the target process relative to its separate operation when multi-process operation is performed;
  • Module 4 is used to limit the DRAM access traffic of processes other than the target process when the performance loss is greater than a threshold, so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
  • the memory resource dynamic control system based on memory access and performance modeling, wherein the performance loss is:
  • A is the number of execution clock cycles in the core
  • B is the number of execution clock cycles outside the core
  • Lats is the increase ratio of memory access delay
  • the module 4 includes: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
  • Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
  • the present invention has the advantages of:
  • the present invention proposes a technique for ensuring key application service quality through dynamic memory bandwidth resource division on real-time multi-core hardware, and provides a fine-grained, high-precision, and fast-response non-intrusive solution.
  • the present invention designs the overall architecture of the process performance automatic regulation mechanism, allows the hardware to directly obtain the priority of upper-layer applications through the label mechanism, and provides differentiated hardware resource allocation for processes with different priorities.
  • DRAM Dynamic random access memory
  • the performance loss of a process is estimated relative to its independent operation, and the average error is only 8.78%, which is better than the existing related technologies.
  • the memory access interference of other processes to key processes is effectively reduced by dynamically adjusting memory bandwidth allocation, and the quality of service of high-priority processes is accurately guaranteed to achieve their independent operation. 90% of the time.
  • Fig. 1 is the position and composition structure schematic diagram of AutoMBA of the present invention in the system
  • Fig. 2 is the input and output schematic diagram of multi-layer perceptron model
  • Fig. 3 is a schematic diagram showing that the execution time of a sequential processor can be divided into two parts, inside the core and outside the core;
  • Fig. 4 is a schematic diagram of the operating principle of the token bucket mechanism.
  • Key point 3 use the quota-based token bucket technology to control the memory access bandwidth of the program, and use the adjustment of the token bucket parameters to dynamically control the memory access bandwidth allocation of different programs, so as to guarantee the performance of key programs and achieve the set The ideal performance target of 90% (standard deviation 4.19%).
  • the core technical principles of the present invention include: (1) Based on the memory access request sequence of a single application and their timestamps in a multi-core environment, the main memory behavior is consistently simulated by modeling the memory, and online estimation The latency of the memory access request when it is running alone, combined with the memory access latency in the actual mixed environment, obtains the memory access latency increase ratio (Latency Scale-up, LatS); (2) Statistical process's in-core and out-of-core execution clocks The number of cycles, combined with LatS, calculates the execution time of the process when it runs alone, and quantifies the amount of interference the process receives; (3) According to the priority relationship of different cores, the token bucket technology is used to dynamically allocate memory bandwidth resources in real time to ensure Quality of service for critical applications.
  • the present invention proposes (1) modeling the access delay of DRAM, and considering the DRAM Bank structure as a black box model ShadowDRAM, wherein the present invention is to The overall DRAM is modeled. The behavior of each bank in the DRAM is similar. The same black box model is used for different banks.
  • the input is the relevant information of the current and past k memory access requests
  • the output is the current access
  • the present invention considers that the current state of the DRAM and its controller depends on all historical access requests, and if a current memory access request input is given on this basis, the access delay of the request can be accurately predicted by modeling.
  • the reason is that for a memory access component, its input signal is non-trivial only when it receives a memory access request, and it will not affect the way the internal state changes at other times, that is, it is a Moore type when there is no memory access request sequential circuit. Therefore, the internal state of the DRAM and its controller only needs to be determined from meaningful input signals in the past, ie, historical access request sequences.
  • Non-blocking cache brings memory-level parallelism (Memory-level Parallelism), that is, the cache continuously sends out multiple memory access requests without receiving a reply, and may Reply data is received out of order, which makes the processing flow of a single request affected by all requests within a short period of time before and after it, and is processed in advance or delayed.
  • the present invention proposes that the DRAM delay prediction model (black box model) samples q and predicts its delay only when no access request for the same Bank is issued within the time period after the memory access request q is issued and before the reply data is received .
  • the main reasons are as follows: (1) The prediction of the delay needs to be relatively real-time, and the time difference between two consecutive requests for a Bank access may be large, so the input of the model cannot contain future information; (2) The sampling condition restrictions mentioned above ensure that future requests will not affect the processing time of request q, so the historical access request sequence information input by the model is sufficient for the delay prediction of the current request; (3) the model requires The calculation time may be long, and the sampling method is used to select some requests to predict the delay and performance loss, which can not only reserve sufficient time for calculation, but also reduce system power consumption.
  • the present invention attempts to The Bank structure in DRAM is modeled. Based on the sampling of some requests that meet certain conditions, the delay of the current request is predicted by using the limited number of access histories corresponding to the Bank.
  • the certain condition may be, for example, fetching a request every 100 items, or requiring no new instruction to be issued within a period of time after the request is sent and before the reply is received, or other feasible filtering conditions.
  • the present invention proposes to model the DRAM Bank using a machine learning method, which has the advantages of: (1) using basic arithmetic operations and a small amount of historical information storage and calculation to predict the delay of the current memory access request with a fairly high accuracy; 2)
  • the machine learning model is versatile and universal, and its training process requires very little internal information of the DRAM controller and DRAM particles. When transplanting on different platforms, it is only necessary to re-grab the memory access information training model; ( 3) Based on partial sampling of memory access requests, the number of memory access delay predictions is effectively reduced, and the dynamic power consumption of the system is reduced.
  • the present invention uses a machine learning method to model the DRAM Bank based on a Multi-layer Perceptron (MLP) model.
  • MLP Multi-layer Perceptron
  • the reasons are: (1) MLP has universal applicability, and it can accurately Simulate the DRAM Bank under different configurations accurately, which enables the memory access delay prediction mechanism to be reused between different platforms, effectively reducing the workload of transplantation; (2) The MLP model and the ReLU activation function do not include complex function operations, only The result can be output after a simple multiplication and addition operation, and it is no longer necessary to maintain a large number of queues and state machines inside the DRAM and its controller, which effectively reduces the power consumption of the delay prediction mechanism.
  • the memory access delay of the current request can be obtained by pre-running some preset threads or programs to capture their memory access and the delay of the corresponding request.
  • the process inside the core includes the operation unit, access to the core private cache, etc.
  • the part outside the core includes sending a read request to DRAM. , DRAM internal processing, etc.
  • CPU clock cycles can be strictly divided into two types: in-core time and out-of-core time, among which only out-of-core time is subject to multi-core interference.
  • the present invention considers that when running simultaneously with other processes under the multi-core architecture, the number of MCs of a process will increase compared with that of running alone, while the number of NMCs is the same.
  • the hardware can divide each clock cycle into two types: NMC and MC, and count the number of two types of time cycles in a period of time, which are recorded as A ,B. Due to the existence of inter-core interference, the memory access request delay increases and the number of MCs increases, and the higher the increase ratio, the greater the interference caused by resource competition, and the greater the performance loss of the process.
  • the process executes the same stage, the number of NMCs is still A, but the number of MCs will be reduced in proportion to LatS, that is, the execution time required by the process is A+B/LatS.
  • Execution_Time is the execution time of the process or program
  • Execution_Time solo is the individual execution time of the process or program
  • Execution_Time mix is the mixed execution time of the process or program.
  • the present invention proposes an automatic memory bandwidth allocation mechanism (Automatic Memory Bandwidth Allocation, AutoMBA): using SEMAL, we can dynamically monitor the operation of the program at runtime and obtain the degree of performance loss. Based on the memory access bandwidth during hybrid operation, Forecasts its optimal-case (standalone) bandwidth requirements. Using the token bucket technology, the system can regulate the memory access bandwidth of different cores, and by limiting low-priority memory access traffic, the service quality of key applications can be guaranteed first.
  • AutoMBA Automatic Memory Bandwidth Allocation
  • Token Bucket (TB) technology is the basic tool of AutoMBA, which can effectively and accurately control the memory access bandwidth of different cores.
  • each CPU core has a private and independent token bucket, and a certain number of tokens (Inc) are automatically added every certain period (Freq), and the token bucket has a maximum capacity (Size).
  • All memory access requests issued by the core will pass through the token bucket, and any request data packet will be marked when it enters the token bucket, and the time when it enters the token bucket will be recorded.
  • the data packet will be sent to the lower layer, and the number of tokens will decrease according to the amount of requested data.
  • the request will be blocked and sent to the waiting queue.
  • a timer synchronized with the system clock is set in the token bucket, which will reset every time it reaches Freq, and trigger the token to automatically increase the number of Inc tokens.
  • the requests in the waiting queue will be sent to the lower layer and the number of tokens will be reduced accordingly. For example, the previously blocked PACKET 2 and PACKET 3 requests are released at time t3 . If the requested data volume of the two requests is 1 respectively, ntokens will be reduced by 2 at the same time.
  • each sampling period (Sampling Interval, SI): (1) The TB of each core runs automatically according to the set parameters, and the core sends memory access requests based on the current remaining number of tokens; (2) The delay prediction module (LPM) records The memory access history of the high-priority core (or target process), predicting the delay t solo of the memory access request issued by the core when running alone; (3) the token bucket controller (TBM) handles the memory access issued by TB on the one hand Request, forward requests from different cores to the DRAM controller, on the other hand, get the predicted t solo from the LPM, detect the actual memory access delay t mix , use the relative ratio of t solo and t mix and the core memory access clock cycle The number estimates the performance loss of the target process when it is running alone, and records it in the preset related registers.
  • LPM delay prediction module
  • An updating interval (Updating Interval, UI) consists of multiple sampling periods.
  • the AutoMBA mechanism evaluates the degree of performance loss of the target process in the past UI. When the performance loss of the target process is too much, TBM automatically restricts the memory access traffic of the remaining cores, thereby reducing the interference between multiple cores and improving the performance of the target process; when the performance of the target process meets the requirements, the flow control of the remaining cores can be relaxed.
  • AutoMBA's control algorithm is divided into observation (Observe) and action (Act) two steps. Observe is performed at the end of each SI. Combined with memory access delay and memory access cycle count, the hardware calculates the performance loss of the process and sets the corresponding counter. Act is carried out at the end of each UI. If the hardware judges that the target process meets the performance loss of less than 10% in most SIs, increase the maximum allowable flow of the remaining cores, and the more time it satisfies, the more the number will be increased; when the target process When the performance loss reaches more than 50% in no less than 3 SIs, the traffic allowed by the remaining cores is directly halved; when the performance loss of the target process is within the remaining range, there is also a corresponding token bucket Inc parameter adjustment.
  • the present invention also proposes a dynamic control system for memory resources based on memory access and performance modeling, which includes:
  • Module 1 is used to use the historical memory access request information of the preset process to DRAM alone as the training data in the multi-core system, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay Model;
  • Module 2 used to record the target memory access request of the target process and input it into the memory access delay model when the multi-core system runs with multiple processes, and obtain the memory access of the target memory access request without inter-process interference Delay t solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of memory access delay;
  • Module 3 which is used to count the number of execution clock cycles inside and outside the core of the target process, and in combination with the increase ratio of the memory access delay, to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
  • Module 4 is used to limit the DRAM access traffic of processes other than the target process when the performance loss is greater than a threshold, so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
  • the memory resource dynamic control system based on memory access and performance modeling, wherein the performance loss is:
  • A is the number of execution clock cycles in the core
  • B is the number of execution clock cycles outside the core
  • Lats is the increase ratio of memory access delay
  • the module 4 includes: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
  • Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
  • the invention proposes a method and system for dynamic regulation and control of memory resources based on memory access and performance modeling.
  • the present invention proposes a dynamic control method and system for memory resources based on memory access and performance modeling.
  • Memory latency model to predict the memory access latency of the application to be executed when it runs alone, combined with the memory access latency of the application to be executed in a mixed execution environment of multiple applications, to obtain the increase ratio of memory access latency; count the in-core and out-of-core of the process
  • the number of execution clock cycles, combined with the increase ratio is used to obtain the execution time when the process is running alone. Based on this, the performance loss suffered by the process is quantified.
  • the performance loss of the target process is greater than the threshold, the memory access traffic other than from the target process is limited. It ensures that the quality of service of the high-priority process is close to the quality of service when it runs alone.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Memory System (AREA)

Abstract

Provided in the present invention are a memory resource dynamic regulation and control method and system based on memory access and performance modeling. By means of technology of guaranteeing quality of service of key applications on real-time multi-core hardware by means of dynamic memory bandwidth resource partitioning, a non-invasive solution with fine granularity, high precision and quick response is provided. In the present invention, an overall architecture of an automatic process performance regulation and control mechanism is designed; by means of a label mechanism, hardware directly acquires the priority of an upper-layer application, so as to provide differentiated hardware resource allocation for processes with different priorities. On the basis of a machine learning method, latency modeling is performed on a bank structure of a dynamic random access memory. For the problem of guaranteeing quality of service of key applications, in a real-time multi-core environment, memory access interference of other processes to a key process can be effectively reduced by means of dynamically adjusting memory bandwidth allocation, such that the quality of service of a high-priority process can be accurately guaranteed.

Description

基于访存和性能建模的内存资源动态调控方法及系统Method and system for dynamic regulation and control of memory resources based on memory access and performance modeling 技术领域technical field
本发明属于实时性多核系统场景下关键应用服务质量保证技术领域,并特别涉及一种基于访存和性能建模的内存资源动态调控方法及系统。The invention belongs to the technical field of key application service quality assurance in real-time multi-core system scenarios, and particularly relates to a memory resource dynamic regulation method and system based on memory access and performance modeling.
背景技术Background technique
在实时性系统中,关键应用的服务质量必须得到保障,体现到硬件上则表现为关键进程的硬件资源分配数量得到保障。随着应用对计算资源需求的不断提高,云计算、智能手机、5G基站等场景对计算机硬件的处理能力要求也不断提高,多核已经成为几乎所有实时性系统的标准配置。然而,在多核场景下,运行在同一处理器上的多个应用会互相争抢硬件资源,导致性能波动,进而影响实时性系统的性能表现。In a real-time system, the quality of service of key applications must be guaranteed, and when reflected in hardware, the allocation of hardware resources for key processes must be guaranteed. With the continuous improvement of application requirements for computing resources, scenarios such as cloud computing, smartphones, and 5G base stations have also continuously increased the processing capabilities of computer hardware. Multi-core has become the standard configuration for almost all real-time systems. However, in a multi-core scenario, multiple applications running on the same processor will compete with each other for hardware resources, resulting in performance fluctuations, which in turn affect the performance of real-time systems.
因此,目前已有一些工作围绕“如何在具有实时性要求的系统中,正确高效地控制硬件资源在不同应用之间的分配,保障关键应用的服务质量”这一问题展开。Therefore, some work has been carried out around the problem of "how to correctly and efficiently control the allocation of hardware resources between different applications in a system with real-time requirements, and ensure the quality of service of key applications".
英特尔至强系列处理器配备了资源调配技术(Resource Director Technology,RDT),包含缓存监测技术、缓存分配技术、内存带宽监测技术、内存带宽分配技术等。操作系统利用资源调配技术监测不同核心的缓存、带宽使用情况,并通过直接给定资源分配比例来调整单个核心的可用资源数量,进而减少性能干扰,在复杂环境下保障关键负载的性能。Intel Xeon series processors are equipped with Resource Director Technology (RDT), including cache monitoring technology, cache allocation technology, memory bandwidth monitoring technology, memory bandwidth allocation technology, etc. The operating system uses resource allocation technology to monitor the cache and bandwidth usage of different cores, and adjusts the number of available resources of a single core by directly setting the resource allocation ratio, thereby reducing performance interference and ensuring the performance of key loads in complex environments.
应用程序性能损失模型(Application Slowdown Model,ASM)结合了对共享缓存及主存储器的分析,认为对于访存受限的应用,其性能与访存请求的发送速度成正比,且在优先级最高的情况下进程能达到其最大访存带宽。ASM一方面尽量减少来自访存期间的干扰,另一方面将共享缓存中的干扰量化,进而周期性地评估性能损失,实现对硬件资源的反馈式动态调节。The application performance loss model (Application Slowdown Model, ASM) combines the analysis of shared cache and main memory, and believes that for applications with limited memory access, its performance is proportional to the sending speed of memory access requests, and the highest priority In this case, the process can reach its maximum memory access bandwidth. On the one hand, ASM minimizes the interference from the memory access period, on the other hand, it quantifies the interference in the shared cache, and then periodically evaluates the performance loss to realize the feedback dynamic adjustment of hardware resources.
Intel RDT仅允许资源的静态划分,且分配数量仅基于已知敏感应用的需求。在实际运行时,由于硬件对程序需求无感知,RDT依赖于软件(操作系统或者用户)的手工控制,硬件无法在运行时动态调整资源数量,由于软件通常 调控粒度较粗,这造成了硬件资源的浪费,并对系统整体性能产生负面影响。Intel RDT allows only static partitioning of resources, and allocation amounts are based only on the needs of known sensitive applications. In actual operation, because the hardware has no perception of program requirements, RDT relies on the manual control of software (operating system or user), and the hardware cannot dynamically adjust the number of resources at runtime. Since software usually has a coarser control granularity, this causes hardware resources waste and negatively impact overall system performance.
应用程序性能损失模型不具备架构普适性,由于大量共享资源的存在,为了实现对硬件资源的控制,ASM需要对系统总线、内存控制器、预取器等部件进行大范围侵入式修改,保证其对优先级的支持,实现成本很高。由于需要考虑硬件实现细节,增加了建模的复杂度,使得资源竞争评估模型在不同平台之间的迁移变得困难。The application performance loss model does not have the general applicability of the architecture. Due to the existence of a large number of shared resources, in order to realize the control of hardware resources, ASM needs to make extensive intrusive modifications to the system bus, memory controller, prefetcher and other components to ensure Its support for priority is very expensive to implement. Due to the need to consider hardware implementation details, the complexity of modeling is increased, which makes it difficult to migrate resource competition evaluation models between different platforms.
现有方法都存在“在特定场景中使用启发式规则判断核间干扰,应用普适性不足”的问题,RDT硬件并不能自动化地识别并调整资源划分情况,而ASM认为访存受限的应用在优先级最高的情况下能达到其最大访存带宽,以此量化多核间干扰造成的带宽损失,但这一前提并不一定成立。Existing methods all have the problem of "using heuristic rules to judge inter-core interference in specific scenarios, and insufficient application universality". RDT hardware cannot automatically identify and adjust resource division, and ASM believes that applications with limited memory access In the case of the highest priority, its maximum memory access bandwidth can be achieved to quantify the bandwidth loss caused by interference between multi-cores, but this premise is not necessarily true.
发明公开invention disclosure
本发明的目的是解决上述现有技术的只能静态划分、不具有架构普适性、不具有应用普适性缺陷,提出了一种基于访存延迟预测、性能损失预测、带宽动态调整技术的关键应用服务质量保证方法及系统。The purpose of the present invention is to solve the shortcomings of the above-mentioned prior art that can only be statically divided, does not have architecture universality, and does not have application universality, and proposes a technology based on memory access delay prediction, performance loss prediction, and bandwidth dynamic adjustment technology. Key application service quality assurance method and system.
针对现有技术的不足,本发明提出一种基于访存和性能建模的内存资源动态调控方法,其中包括:Aiming at the deficiencies of the prior art, the present invention proposes a method for dynamic regulation and control of memory resources based on memory access and performance modeling, which includes:
步骤1、在该多核心系统中以预设进程单独对DRAM的历史访存请求信息为训练数据,以该历史访存请求信息对应的延迟为训练目标,训练神经网络模型,得到访存延迟模型; Step 1. In the multi-core system, use the historical memory access request information of the preset process alone to the DRAM as the training data, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay model ;
步骤2、在多核心系统以多进程运行时,记录目标进程的目标访存请求并将其输入该访存延迟模型,得到该目标访存请求在无多进程间干扰情况下的访存延迟t solo,同时检测该目标访存请求的实际访存延迟t mix,通过访存延迟t solo除以访存延迟t mix,得到访存延迟提高比例; Step 2. When the multi-core system runs with multiple processes, record the target memory access request of the target process and input it into the memory access delay model, and obtain the memory access delay t of the target memory access request without inter-process interference solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of the memory access delay;
步骤3、统计该目标进程的核内和核外执行时钟周期数,并结合该访存延迟提高比例,得到该目标进程在多进程运行时,相对其单独运行的性能损失; Step 3, count the number of execution clock cycles inside and outside the core of the target process, and combine the increase ratio of the memory access delay to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
步骤4、当该性能损失大于阈值时,限制除该目标进程以外进程的DRAM访存流量,以实时动态分配DRAM带宽资源,保障该目标进程的服务质量。Step 4. When the performance loss is greater than the threshold, limit the DRAM memory access traffic of processes other than the target process to dynamically allocate DRAM bandwidth resources in real time to ensure the service quality of the target process.
所述的基于访存和性能建模的内存资源动态调控方法,其中该历史访存请求信息包括访问目标Bank的当前请求信息h 0及过去k条访存历史h i(i=1,…, k),其中h i(i=0,…,k)包含h i的发出时间t i、h i访问的行地址row i和列地址col i,访存延迟模型的输入为访存历史与当前请求信息之间的差值t 0-t i、row 0-row i和col 0-col i,访存延迟模型的输出为当前请求信息h 0的访存延迟Latency=g(h 0,…,h k),通过拟合函数g完成该访存延迟模型的训练。 The memory access and performance modeling-based dynamic control method for memory resources, wherein the historical memory access request information includes the current request information h0 of the access target Bank and the past k memory access histories h i (i=1,..., k), where h i (i=0,...,k) includes the issue time t i of hi , the row address row i and the column address col i accessed by h i , and the input of the memory access delay model is the memory access history and the current The difference between request information t 0 -t i , row 0 -row i and col 0 -col i , the output of the memory access delay model is the memory access delay of the current request information h 0 Latency=g(h 0 ,…, h k ), complete the training of the memory access latency model by fitting the function g.
所述的基于访存和性能建模的内存资源动态调控方法,其中该性能损失为:The method for dynamically regulating memory resources based on memory access and performance modeling, wherein the performance loss is:
Figure PCTCN2022070519-appb-000001
Figure PCTCN2022070519-appb-000001
其中A为核内执行时钟周期数,B为核外执行时钟周期数,Lats为访存延迟提高比例。Among them, A is the number of execution clock cycles in the core, B is the number of execution clock cycles outside the core, and Lats is the increase ratio of memory access delay.
所述的基于访存和性能建模的内存资源动态调控方法,其中该步骤4包括:使用令牌桶技术,限制除该目标进程以外进程的DRAM访存流量。In the method for dynamically regulating memory resources based on memory access and performance modeling, step 4 includes: using token bucket technology to limit DRAM memory access traffic of processes other than the target process.
所述的基于访存和性能建模的内存资源动态调控方法,其中The method for dynamic regulation and control of memory resources based on memory access and performance modeling, wherein
多核心系统每个核心拥有独立的令牌桶,每隔一定周期会自动添加一定数量的令牌至该令牌桶,且该令牌桶设有最大令牌容量,核心发出的所有访存请求均会经过令牌桶,任何一个访存请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间,并判断该令牌桶中是否有可用令牌,若是,则数据包会被发送至下层,同时令牌桶中令牌数目根据访存请求的数据量减少,否则,将访存请求送进等待队列。Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
本发明还提出了一种基于访存和性能建模的内存资源动态调控系统,其中包括:The present invention also proposes a dynamic control system for memory resources based on memory access and performance modeling, which includes:
模块1,用于在多核心系统中以预设进程单独对DRAM的历史访存请求信息为训练数据,以该历史访存请求信息对应的延迟为训练目标,训练神经网络模型,得到访存延迟模型; Module 1 is used to use the historical memory access request information of the preset process to DRAM alone as the training data in the multi-core system, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay Model;
模块2,用于在多核心系统以多进程运行时,记录目标进程的目标访存请求并将其输入该访存延迟模型,得到该目标访存请求在无多进程间干扰情况下的访存延迟t solo,同时检测该目标访存请求的实际访存延迟t mix,通过访存延迟t solo除以访存延迟t mix,得到访存延迟提高比例; Module 2, used to record the target memory access request of the target process and input it into the memory access delay model when the multi-core system runs with multiple processes, and obtain the memory access of the target memory access request without inter-process interference Delay t solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of memory access delay;
模块3,用于统计该目标进程的核内和核外执行时钟周期数,并结合该访 存延迟提高比例,得到该目标进程在多进程运行时,相对其单独运行的性能损失; Module 3 is used to count the number of execution clock cycles in the core and outside the core of the target process, and in combination with the increase ratio of the memory access delay, to obtain the performance loss of the target process relative to its separate operation when multi-process operation is performed;
模块4,用于当该性能损失大于阈值时,限制除该目标进程以外进程的DRAM访存流量,以实时动态分配DRAM带宽资源,保障该目标进程的服务质量。Module 4 is used to limit the DRAM access traffic of processes other than the target process when the performance loss is greater than a threshold, so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
所述的基于访存和性能建模的内存资源动态调控系统,其中该历史访存请求信息包括访问目标Bank的当前请求信息h 0及过去k条访存历史h i(i=1,…,k),其中h i(i=0,…,k)包含h i的发出时间t i、h i访问的行地址row i和列地址col i,访存延迟模型的输入为访存历史与当前请求信息之间的差值t 0-t i、row 0-row i和col 0-col i,访存延迟模型的输出为当前请求信息h 0的访存延迟Latency=g(h 0,…,h k),通过拟合函数g完成该访存延迟模型的训练。 The memory resource dynamic control system based on memory access and performance modeling, wherein the historical memory access request information includes the current request information h 0 of the access target Bank and the past k memory access histories h i (i=1,..., k), where h i (i=0,...,k) includes the issue time t i of hi , the row address row i and the column address col i accessed by h i , and the input of the memory access delay model is the memory access history and the current The difference between request information t 0 -t i , row 0 -row i and col 0 -col i , the output of the memory access delay model is the memory access delay of the current request information h 0 Latency=g(h 0 ,…, h k ), complete the training of the memory access latency model by fitting the function g.
所述的基于访存和性能建模的内存资源动态调控系统,其中该性能损失为:The memory resource dynamic control system based on memory access and performance modeling, wherein the performance loss is:
Figure PCTCN2022070519-appb-000002
Figure PCTCN2022070519-appb-000002
其中A为核内执行时钟周期数,B为核外执行时钟周期数,Lats为访存延迟提高比例。Among them, A is the number of execution clock cycles in the core, B is the number of execution clock cycles outside the core, and Lats is the increase ratio of memory access delay.
所述的基于访存和性能建模的内存资源动态调控系统,其中该模块4包括:使用令牌桶技术,限制除该目标进程以外进程的DRAM访存流量。In the memory resource dynamic control system based on memory access and performance modeling, the module 4 includes: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
所述的基于访存和性能建模的内存资源动态调控系统,其中The memory resource dynamic control system based on memory access and performance modeling, wherein
多核心系统每个核心拥有独立的令牌桶,每隔一定周期会自动添加一定数量的令牌至该令牌桶,且该令牌桶设有最大令牌容量,核心发出的所有访存请求均会经过令牌桶,任何一个访存请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间,并判断该令牌桶中是否有可用令牌,若是,则数据包会被发送至下层,同时令牌桶中令牌数目根据访存请求的数据量减少,否则,将访存请求送进等待队列。Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
由以上方案可知,本发明的优点在于:As can be seen from the above scheme, the present invention has the advantages of:
本发明提出了一种在实时多核硬件上通过动态内存带宽资源划分进行关键应用服务质量保障的技术,提供了一种细粒度、高精度、快速响应的非侵入式解决方案。本发明设计了进程性能自动调控机制的总体架构,通过标签机制 让硬件直接获取到上层应用的优先级,为不同优先级的进程提供差异化的硬件资源分配。创新性地基于机器学习方法对动态随机访问存储器(Dynamic random access memory,DRAM)的体(Bank)结构进行延迟建模,在绝大多数场景下能够达到90%以上的预测精确度,平均误差为2.78%。基于访存延迟估测进程相对其单独运行时的性能损失,其平均误差仅为8.78%,优于现有相关技术。针对关键应用的服务质量保障问题,在实时多核环境下,通过动态调整内存带宽分配有效地降低了其余进程对关键进程的访存干扰,准确地保障了高优先级进程的服务质量达到其单独运行时的90%。The present invention proposes a technique for ensuring key application service quality through dynamic memory bandwidth resource division on real-time multi-core hardware, and provides a fine-grained, high-precision, and fast-response non-intrusive solution. The present invention designs the overall architecture of the process performance automatic regulation mechanism, allows the hardware to directly obtain the priority of upper-layer applications through the label mechanism, and provides differentiated hardware resource allocation for processes with different priorities. Innovatively based on the machine learning method to model the delay of the DRAM (Dynamic random access memory, DRAM) bank structure, in most scenarios, it can achieve a prediction accuracy of more than 90%, and the average error is 2.78%. Based on the memory access delay, the performance loss of a process is estimated relative to its independent operation, and the average error is only 8.78%, which is better than the existing related technologies. Aiming at the problem of quality of service guarantee for key applications, in a real-time multi-core environment, the memory access interference of other processes to key processes is effectively reduced by dynamically adjusting memory bandwidth allocation, and the quality of service of high-priority processes is accurately guaranteed to achieve their independent operation. 90% of the time.
附图简要说明Brief description of the drawings
图1为本发明AutoMBA在系统中的位置及其组成结构示意图;Fig. 1 is the position and composition structure schematic diagram of AutoMBA of the present invention in the system;
图2为多层感知机模型的输入和输出示意图;Fig. 2 is the input and output schematic diagram of multi-layer perceptron model;
图3为顺序处理器的执行时间可以被分成核内、核外两部分示意图;Fig. 3 is a schematic diagram showing that the execution time of a sequential processor can be divided into two parts, inside the core and outside the core;
图4为令牌桶机制的运行原理示意图。Fig. 4 is a schematic diagram of the operating principle of the token bucket mechanism.
实现本发明的最佳方式BEST MODE FOR CARRYING OUT THE INVENTION
发明人在进行多核程序性能分析、动态资源调整优化研究时,发现现有技术并没有将内存的延迟、带宽、程序的访存特性等低层次硬件信息与高层软件信息结合起来,从而导致了硬件信息缺失、软件实际表现不可知、控制技术复杂等问题。发明人经过研究发现,解决该问题可以通过“利用延迟、带宽、访问特征等信息结合软件实际性能损失建模,并在硬件上在线推导得到软件预估性能损失,依据不断的观察结果反馈式地进行基于令牌桶的内存带宽分配”来实现。具体来说本申请包括以下关键技术点:When the inventors conducted multi-core program performance analysis and dynamic resource adjustment and optimization research, they found that the existing technology did not combine low-level hardware information such as memory delay, bandwidth, and program memory access characteristics with high-level software information, resulting in hardware Problems such as lack of information, unknowable actual software performance, complex control technology, etc. After research, the inventor found that this problem can be solved by "using delay, bandwidth, access characteristics and other information combined with the actual performance loss of the software to model, and deriving the estimated performance loss of the software online on the hardware, based on continuous observation results Feedback To implement memory bandwidth allocation based on token bucket". Specifically, this application includes the following key technical points:
关键点1,使用机器学习方法,基于历史访存地址序列对内存延迟进行离线建模,在SPEC CPU2006测试程序上的平均误差率为2.84%。其中离线指的是在程序运行时,模型已经建立(程序离线情况下已经完成),而并不需要在程序运行时在线地建立模型。 Key point 1. Using machine learning methods to model memory latency offline based on historical memory access address sequences, the average error rate on the SPEC CPU2006 test program is 2.84%. The offline means that the model has been established when the program is running (the program has been completed offline), and there is no need to establish the model online when the program is running.
关键点2,使用机器学习方法,基于实测访存延迟、预估理想访存延迟、程序访存带宽、程序访存频率等信息对程序在多核情况下性能损失进行离线建模,在SPEC CPU2006测试程序上的平均误差为8.78%。 Key point 2, using machine learning methods, based on the measured memory access delay, estimated ideal memory access delay, program memory access bandwidth, program memory access frequency and other information to model the performance loss of the program in the case of multi-core offline, tested in SPEC CPU2006 The average error of the procedure is 8.78%.
关键点3,使用基于配额的令牌桶技术控制程序的访存带宽,并利用对令牌桶参数的调整动态地控制不同程序的访存带宽分配,实现对关键程序性能的保障,达到设定的90%理想性能目标(标准差4.19%)。 Key point 3, use the quota-based token bucket technology to control the memory access bandwidth of the program, and use the adjustment of the token bucket parameters to dynamically control the memory access bandwidth allocation of different programs, so as to guarantee the performance of key programs and achieve the set The ideal performance target of 90% (standard deviation 4.19%).
为让本发明的上述特征和效果能阐述的更明确易懂,下文特举实施例,并配合说明书附图作详细说明如下。In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.
如图1所示,本发明的核心技术原理包括:(1)基于单个应用的访存请求序列及它们在多核环境下的时间戳,通过对内存进行建模一致地模拟主存行为,在线估计访存请求在单独运行时的延迟,并结合其实际混合环境下的访存延迟,得到访存延迟提高比例(Latency Scale-up,LatS);(2)统计进程的核内、核外执行时钟周期数,结合LatS计算得到进程在单独运行时的执行时间,据此量化进程受到的干扰大小;(3)根据不同核心的优先级关系,使用令牌桶技术,实时动态分配内存带宽资源,保障关键应用的服务质量。As shown in Figure 1, the core technical principles of the present invention include: (1) Based on the memory access request sequence of a single application and their timestamps in a multi-core environment, the main memory behavior is consistently simulated by modeling the memory, and online estimation The latency of the memory access request when it is running alone, combined with the memory access latency in the actual mixed environment, obtains the memory access latency increase ratio (Latency Scale-up, LatS); (2) Statistical process's in-core and out-of-core execution clocks The number of cycles, combined with LatS, calculates the execution time of the process when it runs alone, and quantifies the amount of interference the process receives; (3) According to the priority relationship of different cores, the token bucket technology is used to dynamically allocate memory bandwidth resources in real time to ensure Quality of service for critical applications.
本发明的技术方案的具体实施过程如下:The concrete implementation process of technical scheme of the present invention is as follows:
(1)DRAM访存延迟建模(1) DRAM memory access delay modeling
为了在多核环境下实时获取进程访存请求在单独运行环境中的延迟,本发明提出(1)对DRAM的访问延迟进行建模,将DRAM Bank结构视为黑盒模型ShadowDRAM,其中本发明是对整体的DRAM进行建模,DRAM中每一个bank的行为是相似的,对不同的bank用的是同一个黑盒模型,其输入为当前及过去k条访存请求的相关信息,输出为当前访存请求的延迟;(2)在多进程运行时,基于标签化环境获取上层应用的进程信息,硬件记录单进程的访存请求并将其输入DRAM访存延迟模型ShadowDRAM,得到访存请求在无多核间干扰情况下的延迟t soloIn order to obtain the delay of the process memory access request in a separate operating environment in real time in a multi-core environment, the present invention proposes (1) modeling the access delay of DRAM, and considering the DRAM Bank structure as a black box model ShadowDRAM, wherein the present invention is to The overall DRAM is modeled. The behavior of each bank in the DRAM is similar. The same black box model is used for different banks. The input is the relevant information of the current and past k memory access requests, and the output is the current access (2) When multiple processes are running, obtain the process information of the upper-layer application based on the labeling environment, record the memory access request of a single process by the hardware and input it into the DRAM memory access delay model ShadowDRAM, and obtain the memory access request in no time Delay t solo in the case of multi-core interference.
本发明认为,DRAM及其控制器当前的状态取决于所有的历史访问请求,如果在此基础上给定一个当前访存请求输入,则该请求的访问延迟能够通过建模被准确预测。原因是,对于一个访存部件而言,其输入信号仅在接收访存请求时是非平凡的,其余时刻均为不会对内部状态的改变方式造成影响,即无访存请求时其为摩尔型时序电路。因此,DRAM及其控制器的内部状态只需要根据过去有意义的输入信号就可以确定,即历史访问请求序列。The present invention considers that the current state of the DRAM and its controller depends on all historical access requests, and if a current memory access request input is given on this basis, the access delay of the request can be accurately predicted by modeling. The reason is that for a memory access component, its input signal is non-trivial only when it receives a memory access request, and it will not affect the way the internal state changes at other times, that is, it is a Moore type when there is no memory access request sequential circuit. Therefore, the internal state of the DRAM and its controller only needs to be determined from meaningful input signals in the past, ie, historical access request sequences.
更特别地,发明人发现一个DRAM访问请求的延迟可以由其前后的少部分访问相同Bank的访存请求序列决定。这是因为,(1)DRAM Bank内的行缓存状 态对访存延迟影响最大,而行缓存的状态由上一次访问该Bank的请求决定,这之前的更久远的访存请求不会对行缓存状态造成影响;(2)非阻塞缓存(Non-blocking Cache)带来了内存级并行(Memory-level Parallelism),即在未收到回复的情况下缓存连续发出多个访存请求,且可能会乱序地收到回复数据,这使得单个请求的处理流程会受到其前后一小段时间内所有请求的影响,被提前或延后处理。More specifically, the inventors found that the delay of a DRAM access request can be determined by a small number of memory access request sequences that access the same Bank before and after it. This is because (1) the state of the row cache in the DRAM Bank has the greatest impact on the memory access delay, and the state of the row cache is determined by the last request to access the Bank, and the earlier memory access request will not affect the row cache (2) Non-blocking cache (Non-blocking Cache) brings memory-level parallelism (Memory-level Parallelism), that is, the cache continuously sends out multiple memory access requests without receiving a reply, and may Reply data is received out of order, which makes the processing flow of a single request affected by all requests within a short period of time before and after it, and is processed in advance or delayed.
本发明提出,仅当访存请求q发出后、收到回复数据前的时间段内,没有相同Bank的访问请求发出时,DRAM延迟预测模型(黑盒模型)才对q进行采样并预测其延迟。原因主要有以下三点:(1)延迟的预测需要有相对的实时性,而对一个Bank访存的连续两个请求之间的时间差可能很大,因此模型的输入不能够包含未来的信息;(2)前文所述的采样条件限制保证了未来的请求不会影响请求q的处理时间,因此模型输入的历史访存请求序列信息是足够进行当前请求的延迟预测的;(3)模型所需的计算时间可能较长,通过采样的办法选取部分请求进行延迟与性能损失的预测,既能为计算预留充分时间,也能够降低系统功耗。The present invention proposes that the DRAM delay prediction model (black box model) samples q and predicts its delay only when no access request for the same Bank is issued within the time period after the memory access request q is issued and before the reply data is received . The main reasons are as follows: (1) The prediction of the delay needs to be relatively real-time, and the time difference between two consecutive requests for a Bank access may be large, so the input of the model cannot contain future information; (2) The sampling condition restrictions mentioned above ensure that future requests will not affect the processing time of request q, so the historical access request sequence information input by the model is sufficient for the delay prediction of the current request; (3) the model requires The calculation time may be long, and the sampling method is used to select some requests to predict the delay and performance loss, which can not only reserve sufficient time for calculation, but also reduce system power consumption.
综合以上讨论,为了寻找一种通用的延迟预测机制,尽量减少对DRAM内部结构的不必要研究,基于DRAM及其控制器当前的状态可由部分当前Bank的历史访问请求确定的假设,本发明尝试对DRAM中的Bank结构进行建模,基于对满足一定条件的部分请求的采样,利用对应Bank的有限数量的访问历史来预测当前请求的延迟。其中一定条件可例如是每100条取一次请求,或要求请求发出后、接收到回复前的这一段时间内,没有新的指令发出,或其他可行的过滤条件。Based on the above discussion, in order to find a general delay prediction mechanism and minimize unnecessary research on the internal structure of DRAM, based on the assumption that the current state of DRAM and its controller can be determined by the historical access requests of some current Banks, the present invention attempts to The Bank structure in DRAM is modeled. Based on the sampling of some requests that meet certain conditions, the delay of the current request is predicted by using the limited number of access histories corresponding to the Bank. The certain condition may be, for example, fetching a request every 100 items, or requiring no new instruction to be issued within a period of time after the request is sent and before the reply is received, or other feasible filtering conditions.
本发明提出,对DRAM Bank利用机器学习方法进行建模,其优势为:(1)利用基本的算术运算和少量的历史信息存储计算,在相当高的精度下预测当前访存请求的延迟;(2)机器学习模型具有通用性、普适性,其训练过程对DRAM控制器及DRAM颗粒内部信息的需求极少,在不同平台上移植时只需要重新抓取访存信息训练模型即可;(3)基于对访存请求的部分采样,有效地减少访存延迟预测数量,降低系统动态功耗。The present invention proposes to model the DRAM Bank using a machine learning method, which has the advantages of: (1) using basic arithmetic operations and a small amount of historical information storage and calculation to predict the delay of the current memory access request with a fairly high accuracy; 2) The machine learning model is versatile and universal, and its training process requires very little internal information of the DRAM controller and DRAM particles. When transplanting on different platforms, it is only necessary to re-grab the memory access information training model; ( 3) Based on partial sampling of memory access requests, the number of memory access delay predictions is effectively reduced, and the dynamic power consumption of the system is reduced.
本发明采用机器学习方法,基于多层感知机(Multi-layer Perceptron,MLP)模型对DRAM Bank进行建模,其原因为:(1)MLP具有普适通用性,在不 同的参数下其能够准确地模拟不同配置下的DRAM Bank,这使得访存延迟预测机制能够在不同平台之间得到复用,有效地降低了移植工作量;(2)MLP模型及ReLU激活函数不包含复杂函数运算,只需经过简单的乘加运算即可输出结果,不再需要维持DRAM及其控制器内部大量的队列、状态机,有效地降低了延迟预测机制的功耗开销。The present invention uses a machine learning method to model the DRAM Bank based on a Multi-layer Perceptron (MLP) model. The reasons are: (1) MLP has universal applicability, and it can accurately Simulate the DRAM Bank under different configurations accurately, which enables the memory access delay prediction mechanism to be reused between different platforms, effectively reducing the workload of transplantation; (2) The MLP model and the ReLU activation function do not include complex function operations, only The result can be output after a simple multiplication and addition operation, and it is no longer necessary to maintain a large number of queues and state machines inside the DRAM and its controller, which effectively reduces the power consumption of the delay prediction mechanism.
如图2所示,本发明使用访问目标Bank的当前请求h 0及过去k条访存历史h i(i=1,…,k)以预测访存延迟,其中h i(i=0,…,k)包含的信息为h i的发出时间t i、h i访问的行地址row i和列地址col i,而模型的实际输入访存历史可以通过历史请求信息与当前请求信息之间的差异来表示。在模型的训练和预测中,对于每一条访存历史h i,我们使用t 0-t i,row 0-row i,col 0-col i来表示它们与当前请求的关系,而模型的输出为当前请求的访存延迟Latency=g(h 0,…,h k),模型的训练过程就是不断地拟合函数g。其中当前请求的访存延迟可通过预先运行一些预设线程或程序,把他们的访存情况和对应请求的延迟抓出来得到。 As shown in Figure 2, the present invention uses the current request h 0 of the access target Bank and past k memory access histories h i (i=1,...,k) to predict the memory access delay, where h i (i=0,... ,k) contains the information of the issue time t i of hi , the row address row i and the column address col i accessed by hi , and the actual input memory access history of the model can be determined by the difference between the historical request information and the current request information To represent. In the training and prediction of the model, for each memory access history h i , we use t 0 -t i ,row 0 -row i ,col 0 -col i to represent their relationship with the current request, and the output of the model is The memory access latency of the current request is Latency=g(h 0 ,...,h k ), and the training process of the model is to continuously fit the function g. Among them, the memory access delay of the current request can be obtained by pre-running some preset threads or programs to capture their memory access and the delay of the corresponding request.
(2)程序性能损失建模(2) Program performance loss modeling
从微结构的角度看,我们可以将进程的执行过程分为核内、核外两部分,核内的流程包括运算单元、核心私有缓存的访问等,核外的部分则包括向DRAM发出读请求、DRAM内部处理等。From the perspective of microstructure, we can divide the execution process of the process into two parts: inside the core and outside the core. The process inside the core includes the operation unit, access to the core private cache, etc., and the part outside the core includes sending a read request to DRAM. , DRAM internal processing, etc.
如图3所示,在顺序处理器中,当一条访存指令造成一次L2缓存不命中后,核心内部流水线会停止,等待DRAM返回数据后继续执行。因此,程序的执行过程中,顺序处理其中CPU时钟周期可以被严格地分为核内时间、核外时间两类,其中仅有核外时间会受到多核干扰。As shown in Figure 3, in a sequential processor, when a memory access instruction causes an L2 cache miss, the internal pipeline of the core will stop and wait for DRAM to return data before continuing to execute. Therefore, during program execution, CPU clock cycles can be strictly divided into two types: in-core time and out-of-core time, among which only out-of-core time is subject to multi-core interference.
乱序处理器中,在一定条件下位置靠后的指令有可能被提前执行,因此缓存不命中并不一定造成CPU核心内部流水线的停止。然而,从系统整体角度看,在任何一个时钟周期,要么核外访存请求正在被处理,要么所有的访存请求都已经完成,这两类时钟周期分别称作访存周期(Memory Cycles,MCs)和非访存周期(Non-Memory Cycles,NMCs)。In an out-of-order processor, instructions at a later position may be executed in advance under certain conditions, so a cache miss does not necessarily cause the internal pipeline of the CPU core to stop. However, from the perspective of the system as a whole, in any clock cycle, either the memory access request outside the core is being processed, or all the memory access requests have been completed. These two types of clock cycles are called memory cycles (Memory Cycles, MCs). ) and non-memory cycles (Non-Memory Cycles, NMCs).
另一方面,在多核架构中,受到多核间资源竞争造成的干扰影响,对于从L2通过系统总线发往DRAM控制器的请求,它们的延迟整体上会由于多核资源竞争而增大,如请求发出时系统总线正在被另一个核心占用,此时该请求必须 等待另一个核心的请求完成传输后才能发出。最大的核间干扰来源为DRAM Bank内部。行缓存机制类似于Cache,当程序的局部性较好时,其访存延迟将大大降低。然而,在多核环境中,单个进程的连续两个访存请求之间很可能会插入其余进程的访存请求,从而破坏DRAM访问的局部性,影响访存延迟。On the other hand, in a multi-core architecture, affected by the interference caused by resource competition among multiple cores, for requests sent from L2 to the DRAM controller through the system bus, their overall delay will increase due to multi-core resource competition. When the system bus is being occupied by another core, the request must wait for the other core's request to complete the transfer before it can be issued. The biggest source of inter-core interference is inside the DRAM Bank. The line caching mechanism is similar to Cache. When the locality of the program is good, its memory access delay will be greatly reduced. However, in a multi-core environment, memory access requests of other processes are likely to be inserted between two consecutive memory access requests of a single process, thereby destroying the locality of DRAM access and affecting memory access latency.
综合考虑以上两方面,本发明认为,在多核架构下与其他进程同时运行时,一个进程的MC数量相比其单独运行时会提高,而NMC数量是相同的。在多核环境下,通过判断当前是否存在未完成的DRAM访问请求,硬件能够将每一个时钟周期分为NMC和MC两类,并在一段时间区间内统计两类时间周期的数量,分别记为A,B。由于核心间干扰的存在,访存请求延迟增大并提高MC数量,且增大比例越高意味着资源竞争带来的干扰越大,进程的性能损失也就越大。Considering the above two aspects comprehensively, the present invention considers that when running simultaneously with other processes under the multi-core architecture, the number of MCs of a process will increase compared with that of running alone, while the number of NMCs is the same. In a multi-core environment, by judging whether there are currently outstanding DRAM access requests, the hardware can divide each clock cycle into two types: NMC and MC, and count the number of two types of time cycles in a period of time, which are recorded as A ,B. Due to the existence of inter-core interference, the memory access request delay increases and the number of MCs increases, and the higher the increase ratio, the greater the interference caused by resource competition, and the greater the performance loss of the process.
基于此,本发明提出基于访存延迟的性能损失估计(Slowdown Estimation via Memory Access Latency,SEMAL)模型:通过机器学习方法得到进程单核访存延迟t solo之后,结合利用多核实际访存延迟t mix,我们可以基于延迟增大比例(Latency Scale-up,LatS)来估计进程的MC数量的提高比例LatS=t mix/t solo。对应到单独运行场景下,进程执行相同阶段时,NMC数量仍为A,但MC数量会以LatS比例减少,即进程所需的执行时间为A+B/LatS。 Based on this, the present invention proposes a performance loss estimation (Slowdown Estimation via Memory Access Latency, SEMAL) model based on memory access delay: after the single-core memory access delay t solo of the process is obtained by a machine learning method, combined with the actual multi-core memory access delay t mix , we can estimate the increase ratio of the MC number of the process LatS=t mix /t solo based on the latency increase ratio (Latency Scale-up, LatS). Corresponding to the single running scenario, when the process executes the same stage, the number of NMCs is still A, but the number of MCs will be reduced in proportion to LatS, that is, the execution time required by the process is A+B/LatS.
因此,在多核环境下,基于进程执行时间的相对延长量,我们计算得到其相对单独运行时的性能损失为:Therefore, in a multi-core environment, based on the relative extension of process execution time, we calculate the performance loss when it is running alone:
Figure PCTCN2022070519-appb-000003
Figure PCTCN2022070519-appb-000003
式中Execution_Time为进程或程序的执行时间,Execution_Time solo为进程或程序的单独执行时间,Execution_Time mix为进程或程序的混合执行时间。 In the formula, Execution_Time is the execution time of the process or program, Execution_Time solo is the individual execution time of the process or program, and Execution_Time mix is the mixed execution time of the process or program.
(3)基于令牌桶的服务质量保障技术(3) Service quality assurance technology based on token bucket
基于SEMAL模型,本发明提出自动内存带宽分配机制(Automatic Memory Bandwidth Allocation,AutoMBA):利用SEMAL我们可以在运行时动态监测程序的运行情况并得到其性能损失程度,基于混合运行时的访存带宽,预测得到其最优情况下(单独运行时)的带宽需求。使用令牌桶技术,系统能够调控不同核心的访存带宽,通过限制低优先级的访存流量,优先保证关键应用的服务质量。Based on the SEMAL model, the present invention proposes an automatic memory bandwidth allocation mechanism (Automatic Memory Bandwidth Allocation, AutoMBA): using SEMAL, we can dynamically monitor the operation of the program at runtime and obtain the degree of performance loss. Based on the memory access bandwidth during hybrid operation, Forecasts its optimal-case (standalone) bandwidth requirements. Using the token bucket technology, the system can regulate the memory access bandwidth of different cores, and by limiting low-priority memory access traffic, the service quality of key applications can be guaranteed first.
令牌桶(Token Bucket,TB)技术是AutoMBA的基础工具,其能够有效、准 确地控制不同核心的访存带宽。如图4所示,每个CPU核心拥有私有、独立的令牌桶,每隔一定周期(Freq)会自动添加一定数量的令牌(Inc),且令牌桶设置有最大容量(Size)。核心发出的所有访存请求均会经过令牌桶,任何一个请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间。此时,如果令牌桶中有可用令牌(如t 0时刻发出的PACKET 0),数据包会被发送至下层,同时令牌数目根据请求的数据量减少。否则,如果令牌桶中无剩余令牌(如t 1,t 2时刻分别发出的PACKET 1,PACKET 2),请求会被阻塞并送进等待队列。令牌桶中设置了一个与系统时钟同步的计时器,其每次达到Freq时会重置,并触发令牌自动增加Inc数量的令牌。此时,在仍有剩余令牌的情况下,等待队列中的请求会被发往下层并相应减少令牌数,如t 3时刻释放了之前被阻塞的PACKET 2,PACKET 3两个请求,这两个请求的请求数据量若分别为1,则同时ntokens减2。 Token Bucket (TB) technology is the basic tool of AutoMBA, which can effectively and accurately control the memory access bandwidth of different cores. As shown in Figure 4, each CPU core has a private and independent token bucket, and a certain number of tokens (Inc) are automatically added every certain period (Freq), and the token bucket has a maximum capacity (Size). All memory access requests issued by the core will pass through the token bucket, and any request data packet will be marked when it enters the token bucket, and the time when it enters the token bucket will be recorded. At this time, if there is an available token in the token bucket (such as PACKET 0 sent at time t 0 ), the data packet will be sent to the lower layer, and the number of tokens will decrease according to the amount of requested data. Otherwise, if there is no remaining token in the token bucket (such as PACKET 1 and PACKET 2 sent at time t 1 and t 2 respectively), the request will be blocked and sent to the waiting queue. A timer synchronized with the system clock is set in the token bucket, which will reset every time it reaches Freq, and trigger the token to automatically increase the number of Inc tokens. At this time, in the case that there are still remaining tokens, the requests in the waiting queue will be sent to the lower layer and the number of tokens will be reduced accordingly. For example, the previously blocked PACKET 2 and PACKET 3 requests are released at time t3 . If the requested data volume of the two requests is 1 respectively, ntokens will be reduced by 2 at the same time.
在每个采样周期(Sampling Interval,SI)中:(1)各核心的TB根据设置的参数自动运行,基于当前剩余令牌数量限制核心发送访存请求;(2)延迟预测模块(LPM)记录高优先级核心(或目标进程)的访存历史,预测该核心发出的访存请求在单独运行时的延迟t solo;(3)令牌桶控制器(TBM)一方面处理TB发出的访存请求,将来自不同核心的请求转发到DRAM控制器,另一方面从LPM处得到预测的t solo,检测得到实际访存延迟t mix,利用t solo和t mix的相对比例及核心访存时钟周期数,估计目标进程相对其单独运行时的性能损失,并记录在预先设置的相关寄存器中。 In each sampling period (Sampling Interval, SI): (1) The TB of each core runs automatically according to the set parameters, and the core sends memory access requests based on the current remaining number of tokens; (2) The delay prediction module (LPM) records The memory access history of the high-priority core (or target process), predicting the delay t solo of the memory access request issued by the core when running alone; (3) the token bucket controller (TBM) handles the memory access issued by TB on the one hand Request, forward requests from different cores to the DRAM controller, on the other hand, get the predicted t solo from the LPM, detect the actual memory access delay t mix , use the relative ratio of t solo and t mix and the core memory access clock cycle The number estimates the performance loss of the target process when it is running alone, and records it in the preset related registers.
一个更新周期(Updating Interval,UI)由多个采样周期组成,在每一个UI结束时,AutoMBA机制评估在过去的UI中目标进程性能损失的程度。当目标进程性能损失过多时,自动地通过TBM限制其余核心的访存流量,进而减少多核间干扰,提高目标进程的性能;当目标进程性能满足要求时,可以放松对其余核心的流量控制。An updating interval (Updating Interval, UI) consists of multiple sampling periods. At the end of each UI, the AutoMBA mechanism evaluates the degree of performance loss of the target process in the past UI. When the performance loss of the target process is too much, TBM automatically restricts the memory access traffic of the remaining cores, thereby reducing the interference between multiple cores and improving the performance of the target process; when the performance of the target process meets the requirements, the flow control of the remaining cores can be relaxed.
AutoMBA的控制算法分为观察(Observe)与行动(Act)两步进行。Observe在每一个SI结束时进行,结合访存延迟与访存周期计数,硬件计算出进程的性能损失并设置相应计数器。Act在每一个UI结束时进行,如硬件判断目标进程在大部分SI中满足10%以下性能损失时,提高其余核心的最大允许流量,且满足的时间越多提高的数量越多;当目标进程性能损失在不少于3个SI中 达到50%以上时,直接将其余核心允许的流量减半;目标进程的性能损失在其余区间内时,亦有相应的令牌桶Inc参数调整。AutoMBA's control algorithm is divided into observation (Observe) and action (Act) two steps. Observe is performed at the end of each SI. Combined with memory access delay and memory access cycle count, the hardware calculates the performance loss of the process and sets the corresponding counter. Act is carried out at the end of each UI. If the hardware judges that the target process meets the performance loss of less than 10% in most SIs, increase the maximum allowable flow of the remaining cores, and the more time it satisfies, the more the number will be increased; when the target process When the performance loss reaches more than 50% in no less than 3 SIs, the traffic allowed by the remaining cores is directly halved; when the performance loss of the target process is within the remaining range, there is also a corresponding token bucket Inc parameter adjustment.
以下为与上述方法实施例对应的系统实施例,本实施方式可与上述实施方式互相配合实施。上述实施方式中提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述实施方式中。The following are system embodiments corresponding to the foregoing method embodiments, and this implementation manner may be implemented in cooperation with the foregoing implementation manners. The relevant technical details mentioned in the foregoing implementation manners are still valid in this implementation manner, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner may also be applied in the foregoing implementation manners.
本发明还提出了一种基于访存和性能建模的内存资源动态调控系统,其中包括:The present invention also proposes a dynamic control system for memory resources based on memory access and performance modeling, which includes:
模块1,用于在多核心系统中以预设进程单独对DRAM的历史访存请求信息为训练数据,以该历史访存请求信息对应的延迟为训练目标,训练神经网络模型,得到访存延迟模型; Module 1 is used to use the historical memory access request information of the preset process to DRAM alone as the training data in the multi-core system, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay Model;
模块2,用于在多核心系统以多进程运行时,记录目标进程的目标访存请求并将其输入该访存延迟模型,得到该目标访存请求在无多进程间干扰情况下的访存延迟t solo,同时检测该目标访存请求的实际访存延迟t mix,通过访存延迟t solo除以访存延迟t mix,得到访存延迟提高比例; Module 2, used to record the target memory access request of the target process and input it into the memory access delay model when the multi-core system runs with multiple processes, and obtain the memory access of the target memory access request without inter-process interference Delay t solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of memory access delay;
模块3,用于统计该目标进程的核内和核外执行时钟周期数,并结合该访存延迟提高比例,得到该目标进程在多进程运行时,相对其单独运行的性能损失; Module 3, which is used to count the number of execution clock cycles inside and outside the core of the target process, and in combination with the increase ratio of the memory access delay, to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
模块4,用于当该性能损失大于阈值时,限制除该目标进程以外进程的DRAM访存流量,以实时动态分配DRAM带宽资源,保障该目标进程的服务质量。Module 4 is used to limit the DRAM access traffic of processes other than the target process when the performance loss is greater than a threshold, so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
所述的基于访存和性能建模的内存资源动态调控系统,其中该历史访存请求信息包括访问目标Bank的当前请求信息h 0及过去k条访存历史h i(i=1,…,k),其中h i(i=0,…,k)包含h i的发出时间t i、h i访问的行地址row i和列地址col i,访存延迟模型的输入为访存历史与当前请求信息之间的差值t 0-t i、row 0-row i和col 0-col i,访存延迟模型的输出为当前请求信息h 0的访存延迟Latency=g(h 0,…,h k),通过拟合函数g完成该访存延迟模型的训练。 The memory resource dynamic control system based on memory access and performance modeling, wherein the historical memory access request information includes the current request information h 0 of the access target Bank and the past k memory access histories h i (i=1,..., k), where h i (i=0,...,k) includes the issue time t i of hi , the row address row i and the column address col i accessed by h i , and the input of the memory access delay model is the memory access history and the current The difference between request information t 0 -t i , row 0 -row i and col 0 -col i , the output of the memory access delay model is the memory access delay of the current request information h 0 Latency=g(h 0 ,…, h k ), complete the training of the memory access latency model by fitting the function g.
所述的基于访存和性能建模的内存资源动态调控系统,其中该性能损失为:The memory resource dynamic control system based on memory access and performance modeling, wherein the performance loss is:
Figure PCTCN2022070519-appb-000004
Figure PCTCN2022070519-appb-000004
其中A为核内执行时钟周期数,B为核外执行时钟周期数,Lats为访存延迟提高比例。Among them, A is the number of execution clock cycles in the core, B is the number of execution clock cycles outside the core, and Lats is the increase ratio of memory access delay.
所述的基于访存和性能建模的内存资源动态调控系统,其中该模块4包括:使用令牌桶技术,限制除该目标进程以外进程的DRAM访存流量。In the memory resource dynamic control system based on memory access and performance modeling, the module 4 includes: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
所述的基于访存和性能建模的内存资源动态调控系统,其中The memory resource dynamic control system based on memory access and performance modeling, wherein
多核心系统每个核心拥有独立的令牌桶,每隔一定周期会自动添加一定数量的令牌至该令牌桶,且该令牌桶设有最大令牌容量,核心发出的所有访存请求均会经过令牌桶,任何一个访存请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间,并判断该令牌桶中是否有可用令牌,若是,则数据包会被发送至下层,同时令牌桶中令牌数目根据访存请求的数据量减少,否则,将访存请求送进等待队列。Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
工业应用性Industrial applicability
本发明提出一种基于访存和性能建模的内存资源动态调控方法和系统。本发明提出一种基于访存和性能建模的内存资源动态调控方法和系统,包括:获取待执行应用的访存请求序列及访存请求的时间戳,以对内存进行建模,得到DRAM访存延迟模型,以预测待执行应用单独运行时的访存延迟,结合待执行应用在多个应用混合执行环境下的访存延迟,得到访存延迟的提高比例;统计进程的核内、核外执行时钟周期数,结合提高比例,得到单独运行进程时的执行时间,据此量化进程受到的性能损失,当目标进程性能损失大于阈值时,限制除来自目标进程以外的访存流量。保障了高优先级进程的服务质量接近其单独运行时的服务质量。The invention proposes a method and system for dynamic regulation and control of memory resources based on memory access and performance modeling. The present invention proposes a dynamic control method and system for memory resources based on memory access and performance modeling. Memory latency model to predict the memory access latency of the application to be executed when it runs alone, combined with the memory access latency of the application to be executed in a mixed execution environment of multiple applications, to obtain the increase ratio of memory access latency; count the in-core and out-of-core of the process The number of execution clock cycles, combined with the increase ratio, is used to obtain the execution time when the process is running alone. Based on this, the performance loss suffered by the process is quantified. When the performance loss of the target process is greater than the threshold, the memory access traffic other than from the target process is limited. It ensures that the quality of service of the high-priority process is close to the quality of service when it runs alone.

Claims (10)

  1. 一种基于访存和性能建模的内存资源动态调控方法,其特征在于,包括:A method for dynamic regulation and control of memory resources based on memory access and performance modeling, characterized in that it includes:
    步骤1、在该多核心系统中以预设进程单独对DRAM的历史访存请求信息为训练数据,以该历史访存请求信息对应的延迟为训练目标,训练神经网络模型,得到访存延迟模型;Step 1. In the multi-core system, use the historical memory access request information of the preset process alone to the DRAM as the training data, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay model ;
    步骤2、在多核心系统以多进程运行时,记录目标进程的目标访存请求并将其输入该访存延迟模型,得到该目标访存请求在无多进程间干扰情况下的访存延迟t solo,同时检测该目标访存请求的实际访存延迟t mix,通过访存延迟t solo除以访存延迟t mix,得到访存延迟提高比例; Step 2. When the multi-core system runs with multiple processes, record the target memory access request of the target process and input it into the memory access delay model, and obtain the memory access delay t of the target memory access request without inter-process interference solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of the memory access delay;
    步骤3、统计该目标进程的核内和核外执行时钟周期数,并结合该访存延迟提高比例,得到该目标进程在多进程运行时,相对其单独运行的性能损失;Step 3, count the number of execution clock cycles inside and outside the core of the target process, and combine the increase ratio of the memory access delay to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
    步骤4、当该性能损失大于阈值时,限制除该目标进程以外进程的DRAM访存流量,以实时动态分配DRAM带宽资源,保障该目标进程的服务质量。Step 4. When the performance loss is greater than the threshold, limit the DRAM memory access traffic of processes other than the target process to dynamically allocate DRAM bandwidth resources in real time to ensure the service quality of the target process.
  2. 如权利要求1所述的基于访存和性能建模的内存资源动态调控方法,其特征在于,该历史访存请求信息包括访问目标Bank的当前请求信息h 0及过去k条访存历史h i(i=1,…,k),其中h i(i=0,…,k)包含h i的发出时间t i、h i访问的行地址row i和列地址col i,访存延迟模型的输入为访存历史与当前请求信息之间的差值t 0-t i、row 0-row i和col 0-col i,访存延迟模型的输出为当前请求信息h 0的访存延迟Latency=g(h 0,…,h k),通过拟合函数g完成该访存延迟模型的训练。 The memory resource dynamic control method based on memory access and performance modeling according to claim 1, wherein the historical memory memory request information includes the current request information h0 of the access target Bank and the past k pieces of memory memory memory history hi (i=1,...,k), where h i (i=0,...,k) includes the issue time t i of hi , the row address row i and the column address col i accessed by hi , and the memory access delay model The input is the difference between the memory access history and the current request information t 0 -t i , row 0 -row i and col 0 -col i , and the output of the memory access delay model is the memory access delay Latency of the current request information h 0 = g(h 0 ,…,h k ), complete the training of the memory access latency model by fitting the function g.
  3. 如权利要求1所述的基于访存和性能建模的内存资源动态调控方法,其特征在于,该性能损失为:The memory resource dynamic control method based on memory access and performance modeling as claimed in claim 1, wherein the performance loss is:
    Figure PCTCN2022070519-appb-100001
    Figure PCTCN2022070519-appb-100001
    其中A为核内执行时钟周期数,B为核外执行时钟周期数,Lats为访存延迟提高比例。Among them, A is the number of execution clock cycles in the core, B is the number of execution clock cycles outside the core, and Lats is the increase ratio of memory access delay.
  4. 如权利要求1所述的基于访存和性能建模的内存资源动态调控方法,其 特征在于,该步骤4包括:使用令牌桶技术,限制除该目标进程以外进程的DRAM访存流量。The method for dynamically controlling memory resources based on memory access and performance modeling according to claim 1, wherein step 4 comprises: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
  5. 如权利要求1所述的基于访存和性能建模的内存资源动态调控方法,其特征在于,The memory resource dynamic control method based on memory access and performance modeling as claimed in claim 1, characterized in that,
    多核心系统每个核心拥有独立的令牌桶,每隔一定周期会自动添加一定数量的令牌至该令牌桶,且该令牌桶设有最大令牌容量,核心发出的所有访存请求均会经过令牌桶,任何一个访存请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间,并判断该令牌桶中是否有可用令牌,若是,则数据包会被发送至下层,同时令牌桶中令牌数目根据访存请求的数据量减少,否则,将访存请求送进等待队列。Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is an available token in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
  6. 一种基于访存和性能建模的内存资源动态调控系统,其特征在于,包括:A memory resource dynamic control system based on memory access and performance modeling, characterized in that it includes:
    模块1,用于在多核心系统中以预设进程单独对DRAM的历史访存请求信息为训练数据,以该历史访存请求信息对应的延迟为训练目标,训练神经网络模型,得到访存延迟模型;Module 1 is used to use the historical memory access request information of the preset process to DRAM alone as the training data in the multi-core system, and use the delay corresponding to the historical memory access request information as the training target to train the neural network model to obtain the memory access delay Model;
    模块2,用于在多核心系统以多进程运行时,记录目标进程的目标访存请求并将其输入该访存延迟模型,得到该目标访存请求在无多进程间干扰情况下的访存延迟t solo,同时检测该目标访存请求的实际访存延迟t mix,通过访存延迟t solo除以访存延迟t mix,得到访存延迟提高比例; Module 2, used to record the target memory access request of the target process and input it into the memory access delay model when the multi-core system runs with multiple processes, and obtain the memory access of the target memory access request without inter-process interference Delay t solo , and detect the actual memory access delay t mix of the target memory access request at the same time, and divide the memory access delay t solo by the memory access delay t mix to obtain the increase ratio of memory access delay;
    模块3,用于统计该目标进程的核内和核外执行时钟周期数,并结合该访存延迟提高比例,得到该目标进程在多进程运行时,相对其单独运行的性能损失;Module 3, which is used to count the number of execution clock cycles inside and outside the core of the target process, and in combination with the increase ratio of the memory access delay, to obtain the performance loss of the target process when it is running in multiple processes, relative to its single operation;
    模块4,用于当该性能损失大于阈值时,限制除该目标进程以外进程的DRAM访存流量,以实时动态分配DRAM带宽资源,保障该目标进程的服务质量。Module 4 is used to limit the DRAM access traffic of processes other than the target process when the performance loss is greater than a threshold, so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
  7. 如权利要求6所述的基于访存和性能建模的内存资源动态调控系统,其特征在于,该历史访存请求信息包括访问目标Bank的当前请求信息h 0及过去k条访存历史h i(i=1,…,k),其中h i(i=0,…,k)包含h i的发出时间t i、h i访问的行地址row i和列地址col i,访存延迟模型的输入为访存历史与当前请求信息之间的差值t 0-t i、row 0-row i和col 0-col i,访存延迟模型的输出为当前请求信息h 0的访存延迟Latency=g(h 0,…,h k),通过拟合函数g完成该访存延迟模型的训练。 The memory resource dynamic control system based on memory access and performance modeling according to claim 6, wherein the historical memory memory request information includes the current request information h0 of the access target Bank and the past k pieces of memory memory memory history hi (i=1,...,k), where h i (i=0,...,k) includes the issue time t i of hi , the row address row i and the column address col i accessed by hi , and the memory access delay model The input is the difference between the memory access history and the current request information t 0 -t i , row 0 -row i and col 0 -col i , and the output of the memory access delay model is the memory access delay Latency of the current request information h 0 = g(h 0 ,…,h k ), complete the training of the memory access latency model by fitting the function g.
  8. 如权利要求6所述的基于访存和性能建模的内存资源动态调控系统,其特征在于,该性能损失为:The memory resource dynamic control system based on memory access and performance modeling as claimed in claim 6, wherein the performance loss is:
    Figure PCTCN2022070519-appb-100002
    Figure PCTCN2022070519-appb-100002
    其中A为核内执行时钟周期数,B为核外执行时钟周期数,Lats为访存延迟提高比例。Among them, A is the number of execution clock cycles in the core, B is the number of execution clock cycles outside the core, and Lats is the increase ratio of memory access delay.
  9. 如权利要求6所述的基于访存和性能建模的内存资源动态调控系统,其特征在于,该模块4包括:使用令牌桶技术,限制除该目标进程以外进程的DRAM访存流量。The memory resource dynamic control system based on memory access and performance modeling according to claim 6, wherein the module 4 includes: using token bucket technology to limit the DRAM memory access flow of processes other than the target process.
  10. 如权利要求6所述的基于访存和性能建模的内存资源动态调控系统,其特征在于,The memory resource dynamic control system based on memory access and performance modeling as claimed in claim 6, wherein,
    多核心系统每个核心拥有独立的令牌桶,每隔一定周期会自动添加一定数量的令牌至该令牌桶,且该令牌桶设有最大令牌容量,核心发出的所有访存请求均会经过令牌桶,任何一个访存请求数据包进入令牌桶时被加上标记,记录其进入令牌桶的时间,并判断该令牌桶中是否有可用令牌,若是,则数据包会被发送至下层,同时令牌桶中令牌数目根据访存请求的数据量减少,否则,将访存请求送进等待队列。Each core of a multi-core system has an independent token bucket, and a certain number of tokens are automatically added to the token bucket every certain period, and the token bucket has a maximum token capacity, and all memory access requests issued by the core All will go through the token bucket, and any access request data packet will be marked when it enters the token bucket, record the time when it enters the token bucket, and judge whether there is a token available in the token bucket, and if so, the data The packet will be sent to the lower layer, and the number of tokens in the token bucket will decrease according to the data volume of the memory access request, otherwise, the memory access request will be sent to the waiting queue.
PCT/CN2022/070519 2021-06-24 2022-01-06 Memory resource dynamic regulation and control method and system based on memory access and performance modeling WO2022267443A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110702890.0A CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling
CN202110702890.0 2021-06-24

Publications (1)

Publication Number Publication Date
WO2022267443A1 true WO2022267443A1 (en) 2022-12-29

Family

ID=78010810

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070519 WO2022267443A1 (en) 2021-06-24 2022-01-06 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Country Status (2)

Country Link
CN (1) CN113505084B (en)
WO (1) WO2022267443A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117319322A (en) * 2023-12-01 2023-12-29 成都睿众博芯微电子技术有限公司 Bandwidth allocation method, device, equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505084B (en) * 2021-06-24 2023-09-12 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling
CN114840270A (en) * 2022-05-09 2022-08-02 珠海全志科技股份有限公司 System bus bandwidth adjusting method, computer device and computer readable storage medium
CN118034901A (en) * 2022-11-11 2024-05-14 华为技术有限公司 Memory management method, device and related equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122801A1 (en) * 2012-10-29 2014-05-01 Advanced Micro Devices, Inc. Memory controller with inter-core interference detection
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method
CN113505084A (en) * 2021-06-24 2021-10-15 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122801A1 (en) * 2012-10-29 2014-05-01 Advanced Micro Devices, Inc. Memory controller with inter-core interference detection
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method
CN113505084A (en) * 2021-06-24 2021-10-15 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIONG DONGLIANG XIONGDL@VLSI.ZJU.EDU.CN; HUANG KAI HUANGK@VLSI.ZJU.EDU.CN; JIANG XIAOWEN JIANGXW@VLSI.ZJU.EDU.CN; YAN XIAOLANG YAN: "Providing Predictable Performance via a Slowdown Estimation Model", ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, ASSOCIATION FOR COMPUTING MACHINERY, US, vol. 14, no. 3, 22 August 2017 (2017-08-22), US , pages 1 - 26, XP058673014, ISSN: 1544-3566, DOI: 10.1145/3124451 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117319322A (en) * 2023-12-01 2023-12-29 成都睿众博芯微电子技术有限公司 Bandwidth allocation method, device, equipment and storage medium
CN117319322B (en) * 2023-12-01 2024-02-27 成都睿众博芯微电子技术有限公司 Bandwidth allocation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113505084B (en) 2023-09-12
CN113505084A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
WO2022267443A1 (en) Memory resource dynamic regulation and control method and system based on memory access and performance modeling
Ali et al. Batch: Machine learning inference serving on serverless platforms with adaptive batching
Bui et al. Rethinking energy-performance trade-off in mobile web page loading
Bitirgen et al. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach
US8069444B2 (en) Method and apparatus for achieving fair cache sharing on multi-threaded chip multiprocessors
TWI525547B (en) Mechanisms to avoid inefficient core hopping and provide hardware assisted low-power state selection
Verner et al. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems
Gaudette et al. Improving smartphone user experience by balancing performance and energy with probabilistic QoS guarantee
Fu et al. Cache-aware utilization control for energy efficiency in multi-core real-time systems
Pellizzoni et al. Memory servers for multicore systems
Xu et al. Move fast and meet deadlines: Fine-grained real-time stream processing with cameo
Jalle et al. Bounding resource contention interference in the next-generation microprocessor (NGMP)
Li et al. Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applications
CN104820616A (en) Task scheduling method and device
Amert et al. OpenVX and real-time certification: The troublesome history
Fernandez et al. Computing safe contention bounds for multicore resources with round-robin and FIFO arbitration
Ye et al. Astraea: A fair deep learning scheduler for multi-tenant gpu clusters
CN108574600B (en) Service quality guarantee method for power consumption and resource competition cooperative control of cloud computing server
Niknia et al. An SMDP-based approach to thermal-aware task scheduling in NoC-based MPSoC platforms
Gupta et al. Quality time: A simple online technique for quantifying multicore execution efficiency
Buzen et al. Best/1-design of a tool for computer system capacity planning
Liu et al. Understanding the impact of vcpu scheduling on dvfs-based power management in virtualized cloud environment
Sikal et al. Thermal-and cache-aware resource management based on ML-driven cache contention prediction
CN107329813B (en) Global sensing data active prefetching method and system for many-core processor
Jing Performance isolation for mixed criticality real-time system on multicore with xen hypervisor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22826954

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22826954

Country of ref document: EP

Kind code of ref document: A1