CN113505084A - Memory resource dynamic regulation and control method and system based on memory access and performance modeling - Google Patents

Memory resource dynamic regulation and control method and system based on memory access and performance modeling Download PDF

Info

Publication number
CN113505084A
CN113505084A CN202110702890.0A CN202110702890A CN113505084A CN 113505084 A CN113505084 A CN 113505084A CN 202110702890 A CN202110702890 A CN 202110702890A CN 113505084 A CN113505084 A CN 113505084A
Authority
CN
China
Prior art keywords
memory access
access
memory
delay
token bucket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110702890.0A
Other languages
Chinese (zh)
Other versions
CN113505084B (en
Inventor
徐易难
周耀阳
王卅
唐丹
孙凝晖
包云岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202110702890.0A priority Critical patent/CN113505084B/en
Publication of CN113505084A publication Critical patent/CN113505084A/en
Priority to PCT/CN2022/070519 priority patent/WO2022267443A1/en
Application granted granted Critical
Publication of CN113505084B publication Critical patent/CN113505084B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a memory resource dynamic regulation and control method and system based on memory access and performance modeling. The technology for guaranteeing the quality of service of key application is carried out on real-time multi-core hardware through dynamic memory bandwidth resource division, and a non-invasive solution with fine granularity, high precision and quick response is provided. The invention designs the overall architecture of a process performance automatic regulation mechanism, and hardware directly acquires the priority of upper application through a label mechanism, thereby providing differentiated hardware resource allocation for processes with different priorities. And performing delay modeling on the body structure of the dynamic random access memory based on a machine learning method. Aiming at the problem of guaranteeing the service quality of key application, under the real-time multi-core environment, the access interference of other processes to the key process is effectively reduced by dynamically adjusting the memory bandwidth allocation, and the service quality of the high-priority process is accurately guaranteed.

Description

Memory resource dynamic regulation and control method and system based on memory access and performance modeling
Technical Field
The invention belongs to the technical field of critical application service quality assurance in a real-time multi-core system scene, and particularly relates to a memory resource dynamic regulation and control method and system based on memory access and performance modeling.
Background
In a real-time system, the quality of service of key applications must be guaranteed, and the amount of hardware resource allocation expressed as a key process on hardware is guaranteed. With the continuous improvement of the demand of applications on computing resources, the processing capability requirements of scenes such as cloud computing, smart phones, 5G base stations and the like on computer hardware are also continuously improved, and multi-core is already the standard configuration of almost all real-time systems. However, in a multi-core scenario, multiple applications running on the same processor may contend for hardware resources, which may cause performance fluctuation, and further affect performance of the real-time system.
Therefore, there are some works around the problem of how to accurately and efficiently control the allocation of hardware resources among different applications and guarantee the quality of service of critical applications in a system with real-time requirements.
The intel to r series of processors are equipped with Resource Director Technology (RDT), which includes cache monitoring Technology, cache allocation Technology, memory bandwidth monitoring Technology, memory bandwidth allocation Technology, and the like. The operating system monitors the use conditions of caches and bandwidths of different cores by using a resource allocation technology, and adjusts the number of available resources of a single core by directly giving a resource allocation proportion, so that the performance interference is reduced, and the performance of a key load is guaranteed in a complex environment.
An Application slow down Model (ASM) combines the analysis of the shared cache and the main memory, and it is considered that for an Application with limited access, the performance is proportional to the sending speed of the access request, and the process can reach its maximum access bandwidth under the condition of the highest priority. The ASM reduces the interference from the memory access period as much as possible on one hand, and quantifies the interference in the shared cache on the other hand, thereby periodically evaluating the performance loss and realizing the feedback type dynamic adjustment of hardware resources.
Intel RDT only allows static partitioning of resources and the allocation amount is based only on the requirements of known sensitive applications. In actual operation, because hardware does not sense program requirements, RDT relies on manual control of software (operating system or user), hardware cannot dynamically adjust resource quantity during operation, and because software generally has a coarse regulation granularity, hardware resources are wasted and overall system performance is negatively affected.
The performance loss model of the application program does not have architecture universality, and due to the existence of a large amount of shared resources, in order to realize the control of hardware resources, the ASM needs to carry out large-range invasive modification on components such as a system bus, a memory controller, a prefetcher and the like, so that the support of the ASM on priority is ensured, and the realization cost is very high. Due to the fact that hardware implementation details need to be considered, modeling complexity is increased, and migration of the resource competition evaluation model between different platforms becomes difficult.
The existing methods all have the problems of insufficient application universality caused by the fact that heuristic rules are used for judging inter-core interference in a specific scene, RDT hardware cannot automatically identify and adjust resource division conditions, and ASM considers that the access limited application can reach the maximum access bandwidth under the condition of the highest priority, so that the bandwidth loss caused by the inter-core interference is quantified, but the premise is not necessarily true.
Disclosure of Invention
The invention aims to overcome the defects that the prior art can only realize static division, does not have architecture universality and does not have application universality, and provides a key application service quality guarantee method and a system based on memory access delay prediction, performance loss prediction and bandwidth dynamic adjustment technology.
Aiming at the defects of the prior art, the invention provides a memory resource dynamic regulation and control method based on memory access and performance modeling, which comprises the following steps:
step 1, in the multi-core system, historical access request information of a preset process to a DRAM independently serves as training data, a delay corresponding to the historical access request information serves as a training target, a neural network model is trained, and an access delay model is obtained;
step 2, when the multi-process system runs in a multi-process mode, recording a target memory access request of a target process and inputting the target memory access request into the memory access delay model to obtain the memory access delay t of the target memory access request under the condition of no multi-process interferencesoloSimultaneously detecting the actual memory access delay t of the target memory access requestmixBy memory access delay tsoloDivide by the memory access delay tmixObtaining the memory access delay improvement proportion;
step 3, counting the number of clock cycles of the execution inside and outside the core of the target process, and increasing the proportion by combining the memory access delay to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in multiple processes;
and 4, when the performance loss is larger than a threshold value, limiting the DRAM access flow of the process except the target process so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
The memory resource dynamic regulation and control method based on memory access and performance modeling is characterized in that the historical memory access request information comprises current request information h for accessing a target Bank0And past k memory access histories hi(i ═ 1, …, k), where h isi(i-0, …, k) includes hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe input of the memory access delay model is the difference value t between the memory access history and the current request information0-ti、row0-rowiAnd col0-coliThe output of the access delay model is the current request information h0Access delay of g (h)0,…,hk) And finishing the training of the memory access delay model by a fitting function g.
The memory resource dynamic regulation and control method based on memory access and performance modeling comprises the following performance loss:
Figure BDA0003130854730000031
wherein A is the number of clock cycles for in-core execution, B is the number of clock cycles for out-of-core execution, and Lats is the memory access delay increase ratio.
The memory resource dynamic regulation and control method based on memory access and performance modeling, wherein the step 4 comprises the following steps: using token bucket technology, DRAM access traffic for processes other than the target process is limited.
The memory resource dynamic regulation and control method based on memory access and performance modeling, wherein
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at regular intervals, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time when the access request data packet enters the token bucket is recorded, whether an available token exists in the token bucket is judged, if yes, the data packet is sent to a lower layer, meanwhile, the number of the tokens in the token bucket is reduced according to the data volume of the access requests, and if not, the access requests are sent to a waiting queue.
The invention also provides a memory resource dynamic regulation and control system based on memory access and performance modeling, which comprises:
the module 1 is used for training a neural network model by taking historical access request information of a DRAM (dynamic random access memory) independently in a multi-core system by a preset process as training data and taking delay corresponding to the historical access request information as a training target to obtain an access delay model;
module 2, for recording target access request of target process and inputting it into the access delay model when the multi-process system operates in multi-process to obtain access delay t of target access request under the condition of no interference between multi-processessoloSimultaneously detecting the actual memory access delay t of the target memory access requestmixBy memory access delay tsoloDivide by the memory access delay tmixObtaining the memory access delay improvement proportion;
a module 3, configured to count the number of clock cycles for executing the target process inside and outside the core, and increase a ratio by combining the memory access delay, so as to obtain a performance loss of the target process when the target process runs in multiple processes, relative to the performance loss of the target process when the target process runs alone;
and the module 4 is used for limiting the DRAM access flow of the process except the target process when the performance loss is greater than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and guarantee the service quality of the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that the historical memory access request information comprises current request information h for accessing a target Bank0And past k memory access histories hi(i ═ 1, …, k), where h isi(i-0, …, k) includes hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe input of the memory access delay model is the difference value t between the memory access history and the current request information0-ti、row0-rowiAnd col0-coliThe output of the access delay model is the current request information h0Access delay of g (h)0,…,hk) And finishing the training of the memory access delay model by a fitting function g.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that the performance loss is as follows:
Figure BDA0003130854730000041
wherein A is the number of clock cycles for in-core execution, B is the number of clock cycles for out-of-core execution, and Lats is the memory access delay increase ratio.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the module 4 comprises: using token bucket technology, DRAM access traffic for processes other than the target process is limited.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at regular intervals, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time when the access request data packet enters the token bucket is recorded, whether an available token exists in the token bucket is judged, if yes, the data packet is sent to a lower layer, meanwhile, the number of the tokens in the token bucket is reduced according to the data volume of the access requests, and if not, the access requests are sent to a waiting queue.
According to the scheme, the invention has the advantages that:
the invention provides a technology for guaranteeing the quality of service of key application through dynamic memory bandwidth resource division on real-time multi-core hardware, and provides a non-invasive solution with fine granularity, high precision and quick response. The invention designs the overall architecture of a process performance automatic regulation mechanism, and hardware directly acquires the priority of upper application through a label mechanism, thereby providing differentiated hardware resource allocation for processes with different priorities. The method is characterized in that a body (Bank) structure of a Dynamic Random Access Memory (DRAM) is innovatively subjected to delay modeling based on a machine learning method, the prediction accuracy can reach over 90% in most scenes, and the average error is 2.78%. The performance loss of the process relative to the single operation of the process is estimated based on the memory access delay, and the average error is only 8.78 percent, which is better than that of the prior related art. Aiming at the problem of guaranteeing the service quality of key application, under the real-time multi-core environment, the memory bandwidth allocation is dynamically adjusted to effectively reduce the memory access interference of other processes to the key process, and the service quality of the high-priority process is accurately guaranteed to reach 90% of that of the high-priority process when the high-priority process operates alone.
Drawings
FIG. 1 is a schematic diagram of the position of the AutoMBA in the system and the composition structure thereof;
FIG. 2 is a schematic diagram of the input and output of a multi-layered perceptron model;
FIG. 3 is a diagram illustrating that the execution time of a sequential processor can be divided into an inner core and an outer core;
fig. 4 is a schematic diagram illustrating the operation of the token bucket mechanism.
Detailed Description
When the inventor conducts multi-core program performance analysis and dynamic resource adjustment optimization research, the inventor finds that the prior art does not combine low-level hardware information such as delay, bandwidth and memory access characteristics of a program with high-level software information, so that the problems of hardware information loss, unknown actual software performance, complex control technology and the like are caused. The inventor finds that solving the problem can be realized by modeling by combining information such as delay, bandwidth and access characteristics with actual performance loss of software, deducing on line on hardware to obtain estimated performance loss of the software, and performing token bucket-based memory bandwidth allocation in a feedback manner according to continuous observation results. Specifically, the present application includes the following key technical points:
the key point 1 is that a machine learning method is used, offline modeling is carried out on memory delay based on a historical memory access address sequence, and the average error rate on a SPEC CPU2006 test program is 2.84%. Where offline means that the model is built already at program run time (the program is completed offline), and it is not necessary to build the model online at program run time.
And a key point 2 is used for performing off-line modeling on the performance loss of the program under the multi-core condition by using a machine learning method based on information such as actually measured memory access delay, estimated ideal memory access delay, program memory access bandwidth and program memory access frequency, wherein the average error on the SPEC CPU2006 test program is 8.78%.
And a key point 3, controlling the memory access bandwidth of the program by using a token bucket technology based on quota, and dynamically controlling the memory access bandwidth allocation of different programs by adjusting the parameters of the token bucket, thereby realizing the guarantee of the performance of the key program and achieving the set 90% ideal performance target (standard deviation 4.19%).
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, the core technical principle of the present invention includes: (1) based on the memory access request sequence of single application and the time stamps of the memory access request sequence under the multi-core environment, the main memory behavior is consistently simulated by modeling the memory, the delay of the memory access request in the independent operation is estimated on line, and the memory access delay under the actual mixed environment is combined to obtain the memory access delay improvement ratio (Latency Scale-up, LatS); (2) counting the number of clock cycles of the process during the in-core execution and the out-core execution, and calculating by combining with the LatS to obtain the execution time of the process during the independent operation, thereby quantifying the interference on the process; (3) and dynamically allocating memory bandwidth resources in real time by using a token bucket technology according to the priority relations of different cores, and ensuring the service quality of key application.
The technical scheme of the invention is implemented as follows:
(1) DRAM memory access delay modeling
In order to obtain the delay of a process access request in a single operation environment in real time under a multi-core environment, the invention provides (1) modeling the access delay of a DRAM (dynamic random access memory), and regarding the structure of the DRAM Bank as a black box model ShadowDRAM, wherein the invention models the whole DRAM, the behavior of each Bank in the DRAM is similar, the same black box model is used for different banks, the input of the same black box model is the related information of current and past k access requests, and the output of the same black box model is the delay of the current access request; (2) during multi-process operation, process information of an upper layer application is obtained based on a labeling environment, a memory access request of a single process is recorded by hardware and is input into a DRAM memory access delay model ShadowDRAM, and delay t of the memory access request under the condition of no multi-core interference is obtainedsolo
The present invention recognizes that the current state of the DRAM and its controller is dependent on all historical access requests, and if a current access request input is given based thereon, the access latency of that request can be accurately predicted by modeling. The reason is that for a memory access component, the input signal is non-trivial only when the memory access request is received, and the change mode of the internal state cannot be influenced at other moments, namely, the memory access component is a Moore type sequential circuit when no memory access request exists. Thus, the internal state of the DRAM and its controller need only be determined from past meaningful input signals, i.e., a historical access request sequence.
More particularly, the inventors have discovered that the latency of a DRAM access request can be determined by a sequence of memory requests that access the same Bank, a small fraction of the time before and after the DRAM access request. This is because (1) the line cache state in a DRAM Bank has the greatest effect on the access latency, and the line cache state is determined by the last request to access the Bank, which was not affected by the older, more recent access requests; (2) non-blocking Cache (Non-blocking Cache) brings Memory-level Parallelism (Memory-level Parallelism), that is, a Cache continuously sends out a plurality of access requests under the condition of not receiving a reply and may receive reply data out of order, so that the processing flow of a single request is influenced by all requests in a short time before and after the request, and is processed in advance or in a delayed manner.
The invention provides that only when no access request of the same Bank is sent in a time period after a memory access request q is sent and before reply data is received, a DRAM delay prediction model (black box model) samples q and predicts the delay of the q. The reasons are mainly as follows: (1) the prediction of the delay requires relative real-time, and the time difference between two consecutive requests for a Bank access can be large, so the input of the model cannot contain future information; (2) the sampling condition limitation ensures that future requests cannot influence the processing time of the request q, so that the historical memory access request sequence information input by the model is enough to perform delay prediction of the current request; (3) the calculation time required by the model is possibly longer, and part of requests are selected by a sampling method to predict delay and performance loss, so that sufficient time can be reserved for calculation, and the power consumption of the system can be reduced.
In view of the above discussion, to find a generic latency prediction mechanism that minimizes unnecessary research into the internal structure of the DRAM, based on the assumption that the current state of the DRAM and its controller can be determined by historical access requests from a portion of the current Bank, the present invention attempts to model the Bank structure in the DRAM, and predict the latency of the current request based on sampling of the portion of the request that meets certain conditions, using a limited number of access histories from the corresponding Bank. Certain conditions may be, for example, every 100 fetch requests, or a period of time after the request is issued and before a reply is received, no new instructions are issued, or other filtering conditions may be applicable.
The invention provides that the DRAM Bank is modeled by using a machine learning method, and the method has the advantages that: (1) predicting the delay of the current memory access request at a relatively high precision by utilizing basic arithmetic operation and a small amount of historical information storage calculation; (2) the machine learning model has universality and universality, the requirements of the training process on the DRAM controller and the DRAM particle internal information are very little, and the memory access information training model only needs to be grabbed again when the memory access information training model is transplanted on different platforms; (3) based on the partial sampling of the memory access request, the memory access delay prediction number is effectively reduced, and the dynamic power consumption of the system is reduced.
The invention adopts a machine learning method and models a DRAM Bank based on a Multi-layer Perceptron (MLP) model, and the reasons are as follows: (1) the MLP has universal universality, and can accurately simulate DRAM banks under different configurations under different parameters, so that a memory access delay prediction mechanism can be multiplexed among different platforms, and the transplanting workload is effectively reduced; (2) the MLP model and the ReLU activation function do not comprise complex function operation, the result can be output only through simple multiply-add operation, a large number of queues and state machines in the DRAM and a controller thereof do not need to be maintained, and the power consumption expense of a delay prediction mechanism is effectively reduced.
As shown in FIG. 2, the present invention uses a current request h to access a target Bank0And past k memory access histories hi(i-1, …, k) to predict memory access latency, where hi(i-0, …, k) contains the information hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe actual input memory history of the model can be represented by the difference between the historical request information and the current request information. In the training and prediction of the model, history h is stored for each accessiWe use t0-ti,row0-rowi,col0-coliTo represent their relationship to the current request, the output of the model being that of the current requestAccess delay (g) (h)0,…,hk) The training process of the model is to fit the function g continuously. The memory access delay of the current request can be obtained by operating some preset threads or programs in advance and catching the memory access conditions and the delay of the corresponding request.
(2) Program performance loss modeling
From the viewpoint of microstructure, the execution process of the process can be divided into an inner core part and an outer core part, wherein the flow in the core comprises the access of an arithmetic unit and a core private cache, and the outer core part comprises the read request sent to the DRAM, the internal processing of the DRAM and the like.
As shown in FIG. 3, in a sequential processor, after an access instruction causes an L2 cache miss, the internal core pipeline is stalled and execution continues after the DRAM returns data. Therefore, during the execution of the program, the sequential processing of the CPU clock cycles can be strictly divided into the two types of the core time and the core time, wherein only the core time is interfered by the multiple cores.
In an out-of-order processor, instructions that are located later under certain conditions may be executed in advance, and therefore a cache miss does not necessarily cause a stall of the CPU core internal pipeline. However, from the overall system perspective, in any one clock cycle, either the out-of-core Memory requests are being processed or all the Memory requests are completed, and these two types of clock Cycles are respectively referred to as Memory Cycles (MCs) and Non-Memory Cycles (NMCs).
On the other hand, in the multi-core architecture, due to the interference caused by resource contention among the cores, the delay of the requests from L2 to the DRAM controller via the system bus is increased by the multi-core resource contention as a whole, for example, when the system bus is occupied by another core when the request is issued, the request must wait for the request of another core to complete transmission before being issued. The largest source of inter-core interference is within the DRAM Bank. The line Cache mechanism is similar to Cache, and when the locality of a program is good, the access delay of the program is greatly reduced. However, in a multi-core environment, the access requests of other processes are likely to be inserted between two consecutive access requests of a single process, so that the locality of DRAM access is damaged, and the access delay is influenced.
In view of the above two aspects, the present invention considers that, when running simultaneously with other processes in a multi-core architecture, the number of MC of one process is increased compared to that of the other process when running alone, and the number of NMC is the same. In a multi-core environment, by judging whether an uncompleted DRAM access request exists currently, hardware can divide each clock cycle into NMC and MC, and count the number of the two types of time cycles within a time interval, which are respectively marked as A and B. Due to the existence of inter-core interference, the memory access request delay is increased and the number of MCs is increased, and the higher the increase ratio is, the larger the interference brought by resource competition is, and the larger the performance loss of the process is.
Based on this, the invention provides a Memory Access delay-based performance loss Estimation (SEMAL) model: obtaining process single-core memory access delay t through machine learning methodsoloThen, the multi-core actual memory access delay t is utilizedmixWe can estimate the increase ratio of the MC number of processes, LatS ═ t, based on the delay Scale-up (LatS)mix/tsolo. Corresponding to the single operation scene, when the process executes the same stage, the number of NMCs is still A, but the number of MCs is reduced by the LatS proportion, namely the execution time required by the process is A + B/LatS.
Therefore, in a multi-core environment, based on the relative extension of the execution time of the process, we calculate the performance loss of the process relative to the single runtime as:
Figure BDA0003130854730000091
wherein Execution _ Time is Execution Time of the process or program, Execution _ TimesoloExecution _ Time, which is the individual Execution Time of a process or programmixIs the hybrid execution time of a process or program.
(3) Service quality guarantee technology based on token bucket
Based on the SEMAL model, the invention provides an Automatic Memory Bandwidth Allocation mechanism (Automatic Memory Bandwidth Allocation, AutoMBA): by utilizing SEMAL, the running condition of a program can be dynamically monitored during running, the performance loss degree of the program can be obtained, and the bandwidth requirement under the optimal condition (single running) can be predicted and obtained based on the memory access bandwidth during mixed running. By using the token bucket technology, the system can regulate and control the access bandwidth of different cores, and the service quality of key application is preferentially ensured by limiting the access flow with low priority.
Token Bucket (TB) technology is a basic tool of AutoMBA, and can effectively and accurately control access bandwidth of different cores. As shown in fig. 4, each CPU core has a private, independent token bucket, a certain number of tokens (Inc) are automatically added at regular intervals (Freq), and the token bucket is set with a maximum capacity (Size). All access requests sent by the core pass through the token bucket, any request data packet is marked when entering the token bucket, and the time when the request data packet enters the token bucket is recorded. At this point, if there are tokens available in the token bucket (e.g., t)0PACKET sent at a moment0) The packet is sent to the lower layer while the number of tokens is reduced according to the amount of data requested. Otherwise, if there are no tokens remaining in the token bucket (e.g., t)1,t2PACKET sent out at different time1,PACKET2) The request may be blocked and sent to a wait queue. A timer synchronized to the system clock is set in the token bucket, which resets each time Freq is reached and triggers the token to automatically increment its Inc number of tokens. At this time, in case there are still remaining tokens, the request in the waiting queue is sent to the lower layer and the number of tokens is reduced accordingly, e.g. t3Time of day release previously blocked PACKET2,PACKET3If the request data amounts of the two requests are respectively 1, the ntokens is simultaneously decreased by 2.
In each Sampling period (Sampling Interval, SI): (1) the TB of each core automatically runs according to the set parameters, and the core is limited to send the memory access request based on the number of the current remaining tokens; (2) a delay prediction module (LPM) records the access history of a high priority core (or target process), predicting the coreLatency t of memory access requests issued by the core when running alonesolo(ii) a (3) The token bucket controller (TBM) processes access requests sent by TB (transport Block), forwards the requests from different cores to the DRAM controller, and obtains predicted t from LPMsoloDetecting the actual memory access delay tmixUsing tsoloAnd tmixRelative proportion and core memory access clock period number, estimating the performance loss of the target process relative to the independent operation of the target process, and recording the performance loss in a preset relevant register.
One Update Interval (UI) consists of multiple sampling cycles, and at the end of each UI, the AutoMBA mechanism evaluates the degree of target process performance loss in past UIs. When the performance loss of the target process is excessive, the access flow of other cores is automatically limited through the TBM, so that the interference among multiple cores is reduced, and the performance of the target process is improved; when the target process performance meets the requirements, the flow control to the remaining cores may be relaxed.
The control algorithm of AutoMBA is divided into two steps of observation (observer) and action (Act). And (3) performing the operation by using the observer at the end of each SI, combining the memory access delay and the memory access period counting, calculating the performance loss of the process by using hardware and setting a corresponding counter. Act is carried out when each UI is finished, if hardware judges that the target process meets less than 10% of performance loss in most SI, the maximum allowable flow of other cores is increased, and the more time is met, the more the number is increased; when the performance loss of the target process reaches more than 50% in not less than 3 SIs, directly reducing the flow allowed by the rest cores by half; when the performance loss of the target process is in the rest interval, the corresponding token bucket Inc parameter is adjusted.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a memory resource dynamic regulation and control system based on memory access and performance modeling, which comprises:
the module 1 is used for training a neural network model by taking historical access request information of a DRAM (dynamic random access memory) independently in a multi-core system by a preset process as training data and taking delay corresponding to the historical access request information as a training target to obtain an access delay model;
module 2, for recording target access request of target process and inputting it into the access delay model when the multi-process system operates in multi-process to obtain access delay t of target access request under the condition of no interference between multi-processessoloSimultaneously detecting the actual memory access delay t of the target memory access requestmixBy memory access delay tsoloDivide by the memory access delay tmixObtaining the memory access delay improvement proportion;
a module 3, configured to count the number of clock cycles for executing the target process inside and outside the core, and increase a ratio by combining the memory access delay, so as to obtain a performance loss of the target process when the target process runs in multiple processes, relative to the performance loss of the target process when the target process runs alone;
and the module 4 is used for limiting the DRAM access flow of the process except the target process when the performance loss is greater than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and guarantee the service quality of the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that the historical memory access request information comprises current request information h for accessing a target Bank0And past k memory access histories hi(i ═ 1, …, k), where h isi(i-0, …, k) includes hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe input of the memory access delay model is the difference value t between the memory access history and the current request information0-ti、row0-rowiAnd col0-coliThe output of the access delay model is the current request information h0Access delay of g (h)0,…,hk) And finishing the training of the memory access delay model by a fitting function g.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that the performance loss is as follows:
Figure BDA0003130854730000111
wherein A is the number of clock cycles for in-core execution, B is the number of clock cycles for out-of-core execution, and Lats is the memory access delay increase ratio.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the module 4 comprises: using token bucket technology, DRAM access traffic for processes other than the target process is limited.
The memory resource dynamic regulation and control system based on memory access and performance modeling is characterized in that
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at regular intervals, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time when the access request data packet enters the token bucket is recorded, whether an available token exists in the token bucket is judged, if yes, the data packet is sent to a lower layer, meanwhile, the number of the tokens in the token bucket is reduced according to the data volume of the access requests, and if not, the access requests are sent to a waiting queue.

Claims (10)

1. A memory resource dynamic regulation and control method based on memory access and performance modeling is characterized by comprising the following steps:
step 1, in the multi-core system, historical access request information of a preset process to a DRAM independently serves as training data, a delay corresponding to the historical access request information serves as a training target, a neural network model is trained, and an access delay model is obtained;
step 2, when the multi-process system runs in a multi-process mode, recording a target memory access request of a target process and inputting the target memory access request into the memory access delay model to obtain the memory access delay t of the target memory access request under the condition of no multi-process interferencesoloWhile detecting the targetActual memory access delay t of memory access requestmixBy memory access delay tsoloDivide by the memory access delay tmixObtaining the memory access delay improvement proportion;
step 3, counting the number of clock cycles of the execution inside and outside the core of the target process, and increasing the proportion by combining the memory access delay to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in multiple processes;
and 4, when the performance loss is larger than a threshold value, limiting the DRAM access flow of the process except the target process so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
2. The memory resource dynamic regulation and control method based on memory access and performance modeling as claimed in claim 1, wherein the historical memory access request information includes current request information h for accessing the target Bank0And past k memory access histories hi(i ═ 1, …, k), where h isi(i-0, …, k) includes hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe input of the memory access delay model is the difference value t between the memory access history and the current request information0-ti、row0-rowiAnd col0-coliThe output of the access delay model is the current request information h0Access delay of g (h)0,…,hk) And finishing the training of the memory access delay model by a fitting function g.
3. The memory resource dynamic regulation method based on memory access and performance modeling of claim 1, wherein the performance loss is:
Figure FDA0003130854720000011
wherein A is the number of clock cycles for in-core execution, B is the number of clock cycles for out-of-core execution, and Lats is the memory access delay increase ratio.
4. The memory resource dynamic regulation and control method based on memory access and performance modeling as claimed in claim 1, wherein the step 4 comprises: using token bucket technology, DRAM access traffic for processes other than the target process is limited.
5. The memory resource dynamic regulation and control method based on memory access and performance modeling as claimed in claim 1,
each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at regular intervals, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time when the access request data packet enters the token bucket is recorded, whether an available token exists in the token bucket is judged, if yes, the data packet is sent to a lower layer, meanwhile, the number of the tokens in the token bucket is reduced according to the data volume of the access requests, and if not, the access requests are sent to a waiting queue.
6. A memory resource dynamic regulation and control system based on memory access and performance modeling is characterized by comprising:
the module 1 is used for training a neural network model by taking historical access request information of a DRAM (dynamic random access memory) independently in a multi-core system by a preset process as training data and taking delay corresponding to the historical access request information as a training target to obtain an access delay model;
module 2, for recording target access request of target process and inputting it into the access delay model when the multi-process system operates in multi-process to obtain access delay t of target access request under the condition of no interference between multi-processessoloSimultaneously detecting the actual memory access delay t of the target memory access requestmixBy memory access delay tsoloDivide by the memory access delay tmixObtaining the memory access delay improvement proportion;
a module 3, configured to count the number of clock cycles for executing the target process inside and outside the core, and increase a ratio by combining the memory access delay, so as to obtain a performance loss of the target process when the target process runs in multiple processes, relative to the performance loss of the target process when the target process runs alone;
and the module 4 is used for limiting the DRAM access flow of the process except the target process when the performance loss is greater than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and guarantee the service quality of the target process.
7. The memory resource dynamic regulation and control system based on memory access and performance modeling as claimed in claim 1, wherein the historical memory access request information includes current request information h for accessing the target Bank0And past k memory access histories hi(i ═ 1, …, k), where h isi(i-0, …, k) includes hiIs sent out at time ti、hiRow address row of accessiAnd column address coliThe input of the memory access delay model is the difference value t between the memory access history and the current request information0-ti、row0-rowiAnd col0-coliThe output of the access delay model is the current request information h0Access delay of g (h)0,…,hk) And finishing the training of the memory access delay model by a fitting function g.
8. The memory resource dynamic regulation system based on memory access and performance modeling of claim 1, wherein the performance penalty is:
Figure FDA0003130854720000031
wherein A is the number of clock cycles for in-core execution, B is the number of clock cycles for out-of-core execution, and Lats is the memory access delay increase ratio.
9. The memory resource dynamic regulation system based on memory access and performance modeling of claim 1, wherein the module 4 comprises: using token bucket technology, DRAM access traffic for processes other than the target process is limited.
10. The memory resource dynamic regulation system based on memory access and performance modeling of claim 1,
each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at regular intervals, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time when the access request data packet enters the token bucket is recorded, whether an available token exists in the token bucket is judged, if yes, the data packet is sent to a lower layer, meanwhile, the number of the tokens in the token bucket is reduced according to the data volume of the access requests, and if not, the access requests are sent to a waiting queue.
CN202110702890.0A 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling Active CN113505084B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110702890.0A CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling
PCT/CN2022/070519 WO2022267443A1 (en) 2021-06-24 2022-01-06 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110702890.0A CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Publications (2)

Publication Number Publication Date
CN113505084A true CN113505084A (en) 2021-10-15
CN113505084B CN113505084B (en) 2023-09-12

Family

ID=78010810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110702890.0A Active CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Country Status (2)

Country Link
CN (1) CN113505084B (en)
WO (1) WO2022267443A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022267443A1 (en) * 2021-06-24 2022-12-29 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117319322B (en) * 2023-12-01 2024-02-27 成都睿众博芯微电子技术有限公司 Bandwidth allocation method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122801A1 (en) * 2012-10-29 2014-05-01 Advanced Micro Devices, Inc. Memory controller with inter-core interference detection
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505084B (en) * 2021-06-24 2023-09-12 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122801A1 (en) * 2012-10-29 2014-05-01 Advanced Micro Devices, Inc. Memory controller with inter-core interference detection
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIONG, DONGLIANG 等: "Providing Predictable Performance via a Slowdown Estimation Model", ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, vol. 14, no. 3, pages 1 - 26, XP058673014, DOI: 10.1145/3124451 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022267443A1 (en) * 2021-06-24 2022-12-29 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Also Published As

Publication number Publication date
WO2022267443A1 (en) 2022-12-29
CN113505084B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
Ali et al. Batch: Machine learning inference serving on serverless platforms with adaptive batching
WO2021174735A1 (en) Dynamic resource scheduling method for guaranteeing latency slo of latency-sensitive application, and system
US20080059712A1 (en) Method and apparatus for achieving fair cache sharing on multi-threaded chip multiprocessors
US9430277B2 (en) Thread scheduling based on predicted cache occupancies of co-running threads
Jog et al. Exploiting core criticality for enhanced GPU performance
EP3087503B1 (en) Cloud compute scheduling using a heuristic contention model
CN113505084B (en) Memory resource dynamic regulation and control method and system based on memory access and performance modeling
US9632836B2 (en) Scheduling applications in a clustered computer system
WO2019091387A1 (en) Method and system for provisioning resources in cloud computing
US8898674B2 (en) Memory databus utilization management system and computer program product
Li et al. Amoeba: Qos-awareness and reduced resource usage of microservices with serverless computing
CN114730276A (en) Determining an optimal number of threads per core in a multi-core processor complex
CN111190735B (en) On-chip CPU/GPU pipelining calculation method based on Linux and computer system
US20050125797A1 (en) Resource management for a system-on-chip (SoC)
US11907550B2 (en) Method for dynamically assigning memory bandwidth
CN105528250B (en) The evaluation and test of Multi-core computer system certainty and control method
CN108574600B (en) Service quality guarantee method for power consumption and resource competition cooperative control of cloud computing server
CN116820784B (en) GPU real-time scheduling method and system for reasoning task QoS
Niknia et al. An SMDP-based approach to thermal-aware task scheduling in NoC-based MPSoC platforms
CN111625347B (en) Fine-grained cloud resource control system and method based on service component level
Gupta et al. Timecube: A manycore embedded processor with interference-agnostic progress tracking
CN112306628A (en) Virtual network function resource management framework based on multi-core server
Mirosanlou et al. Duomc: Tight DRAM latency bounds with shared banks and near-cots performance
CN101661406A (en) Processing unit dispatching device and method
CN107329813B (en) Global sensing data active prefetching method and system for many-core processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant