CN113505084B - Memory resource dynamic regulation and control method and system based on memory access and performance modeling - Google Patents

Memory resource dynamic regulation and control method and system based on memory access and performance modeling Download PDF

Info

Publication number
CN113505084B
CN113505084B CN202110702890.0A CN202110702890A CN113505084B CN 113505084 B CN113505084 B CN 113505084B CN 202110702890 A CN202110702890 A CN 202110702890A CN 113505084 B CN113505084 B CN 113505084B
Authority
CN
China
Prior art keywords
access
memory
core
delay
token bucket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110702890.0A
Other languages
Chinese (zh)
Other versions
CN113505084A (en
Inventor
徐易难
周耀阳
王卅
唐丹
孙凝晖
包云岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202110702890.0A priority Critical patent/CN113505084B/en
Publication of CN113505084A publication Critical patent/CN113505084A/en
Priority to PCT/CN2022/070519 priority patent/WO2022267443A1/en
Application granted granted Critical
Publication of CN113505084B publication Critical patent/CN113505084B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Memory System (AREA)

Abstract

The application provides a memory resource dynamic regulation and control method and a memory resource dynamic regulation and control system based on memory access and performance modeling. The technology for guaranteeing the service quality of key application by dividing dynamic memory bandwidth resources on real-time multi-core hardware provides a fine-grained, high-precision and fast-response non-invasive solution. The application designs the overall architecture of the automatic regulation mechanism of the process performance, and the hardware directly obtains the priority of the upper layer application through the label mechanism, thereby providing differentiated hardware resource allocation for the processes with different priorities. The method comprises the step of performing delay modeling on the volume structure of the dynamic random access memory based on a machine learning method. Aiming at the problem of guaranteeing the service quality of key application, the access interference of other processes to the key process is effectively reduced by dynamically adjusting the memory bandwidth allocation in a real-time multi-core environment, and the service quality of the high-priority process is accurately guaranteed.

Description

Memory resource dynamic regulation and control method and system based on memory access and performance modeling
Technical Field
The application belongs to the technical field of key application service quality assurance in a real-time multi-core system scene, and particularly relates to a memory resource dynamic regulation and control method and system based on memory access and performance modeling.
Background
In a real-time system, the service quality of a key application must be guaranteed, and the number of hardware resource allocation which is represented as a key process on hardware is guaranteed. With the continuous improvement of the demands of applications on computing resources, the requirements of scenes such as cloud computing, smart phones, 5G base stations and the like on the processing capacity of computer hardware are also continuously improved, and the multi-core system becomes the standard configuration of almost all real-time systems. However, in a multi-core scenario, multiple applications running on the same processor may contend with each other for hardware resources, resulting in performance fluctuations, which in turn affect the performance of the real-time system.
Therefore, some work has been developed around the problem of how to control the allocation of hardware resources between different applications correctly and efficiently and guarantee the quality of service of critical applications in a system with real-time requirements.
Intel-to-strong serial processors are equipped with resource allocation techniques (Resource Director Technology, RDT) including cache monitoring techniques, cache allocation techniques, memory bandwidth monitoring techniques, memory bandwidth allocation techniques, and the like. The operating system monitors the buffer memory and bandwidth use conditions of different cores by utilizing a resource allocation technology, and adjusts the available resource quantity of a single core by directly giving a resource allocation proportion, so that the performance interference is reduced, and the performance of a key load is ensured in a complex environment.
The application performance penalty model (Application Slowdown Model, ASM) combines analysis of shared cache and main memory, considers that for access limited applications, its performance is proportional to the sending speed of access requests, and the process can reach its maximum access bandwidth with the highest priority. On one hand, the ASM reduces the interference from the access period as much as possible, and on the other hand, quantifies the interference in the shared buffer, so that the performance loss is periodically evaluated, and the feedback type dynamic adjustment of the hardware resources is realized.
Intel RDT allows only a static partitioning of resources and the number of allocations is based only on the needs of known sensitive applications. In actual operation, the RDT depends on manual control of software (an operating system or a user) because hardware has no sense on program requirements, the hardware cannot dynamically adjust the quantity of resources in operation, and the software generally has thicker regulation granularity, so that the waste of hardware resources is caused, and the overall performance of the system is negatively influenced.
The application program performance loss model has no architecture universality, and because of the existence of a large amount of shared resources, ASM needs to carry out large-scale invasive modification on components such as a system bus, a memory controller, a prefetcher and the like in order to realize the control of hardware resources, so that the support of the ASM on priority is ensured, and the realization cost is high. The complexity of modeling is increased due to the fact that hardware implementation details need to be considered, and migration of the resource competition assessment model between different platforms becomes difficult.
The prior methods all have the problems of 'judging inter-core interference and insufficient application universality by using heuristic rules in specific scenes', RDT hardware cannot automatically identify and adjust resource division conditions, ASM considers that the application with limited access can reach the maximum access bandwidth under the condition of highest priority, so that bandwidth loss caused by inter-core interference is quantified, but the premise is not necessarily established.
Disclosure of Invention
The application aims to solve the defects that the prior art can only be statically divided, has no architecture universality and has no application universality, and provides a key application service quality assurance method and a key application service quality assurance system based on memory access delay prediction, performance loss prediction and bandwidth dynamic adjustment technology.
Aiming at the defects of the prior art, the application provides a memory resource dynamic regulation and control method based on memory access and performance modeling, which comprises the following steps:
step 1, in the multi-core system, taking the history access request information of a preset process on a DRAM as training data, taking the delay corresponding to the history access request information as a training target, training a neural network model, and obtaining an access delay model;
step 2, when the multi-core system runs in a multi-process mode, recording a target access request of a target process and inputting the target access request into the access delay model to obtain access delay t of the target access request under the condition of no multi-process interference solo Simultaneously detecting the actual access delay t of the target access request mix Delay t by access solo Divided by memory access delay t mix Obtaining the memory access delay improvement proportion;
step 3, counting the number of the clock cycles of the inner core execution and the outer core execution of the target process, and combining the memory access delay improvement proportion to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in multiple processes;
and step 4, when the performance loss is larger than a threshold value, limiting the DRAM access memory flow of the processes except the target process so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
The memory resource dynamic regulation and control method based on memory access and performance modeling, wherein the historical memory access request information comprises current request information h of an access target Bank 0 Past k access histories h i (i=1, …, k), where h i (i=0, …, k) comprises h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i The input of the access delay model is the difference t between the access history and the current request information 0 -t i 、row 0 -row i And col 0 -col i The output of the memory access delay model is the current request information h 0 Is delayed by latency=g (h 0 ,…,h k ) And (5) completing the training of the memory access delay model through fitting the function g.
The memory resource dynamic regulation and control method based on memory access and performance modeling, wherein the performance loss is as follows:
wherein A is the number of execution clock cycles in the core, B is the number of execution clock cycles out of the core, and Lats is the memory access delay increasing proportion.
The memory resource dynamic regulation and control method based on memory access and performance modeling comprises the following steps: using token bucket techniques, DRAM access traffic is limited for processes other than the target process.
The memory resource dynamic regulation and control method based on memory access and performance modeling, wherein
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at intervals of a certain period, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time of entering the token bucket is recorded, whether the token bucket has available tokens or not is judged, if yes, the data packet is sent to the lower layer, meanwhile, the number of tokens in the token bucket is reduced according to the data quantity of the access requests, and otherwise, the access requests are sent to a waiting queue.
The application also provides a memory resource dynamic regulation and control system based on memory access and performance modeling, which comprises the following steps:
the module 1 is used for training a neural network model by taking historical memory access request information of a preset process on a DRAM as training data and taking delay corresponding to the historical memory access request information as a training target in a multi-core system to obtain a memory access delay model;
a module 2 for recording and inputting a target access request of a target process into the access delay model to obtain access delay t of the target access request without inter-process interference when the multi-core system operates in multiple processes solo Simultaneously detecting the actual access delay t of the target access request mix Delay t by access solo Divided by memory access delay t mix Obtaining the memory access delay improvement proportion;
the module 3 is used for counting the number of the clock cycles of the execution in the core and the execution out of the core of the target process, and combining the memory access delay improvement proportion to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in a plurality of processes;
and the module 4 is used for limiting the DRAM access memory flow of the processes except the target process when the performance loss is larger than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the historical memory access request information comprises current request information h of an access target Bank 0 Past k access histories h i (i=1, …, k), where h i (i=0, …, k) comprises h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i Visit and visitThe input of the memory delay model is the difference t between the memory history and the current request information 0 -t i 、row 0 -row i And col 0 -col i The output of the memory access delay model is the current request information h 0 Is delayed by latency=g (h 0 ,…,h k ) And (5) completing the training of the memory access delay model through fitting the function g.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the performance loss is as follows:
wherein A is the number of execution clock cycles in the core, B is the number of execution clock cycles out of the core, and Lats is the memory access delay increasing proportion.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the module 4 comprises: using token bucket techniques, DRAM access traffic is limited for processes other than the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at intervals of a certain period, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time of entering the token bucket is recorded, whether the token bucket has available tokens or not is judged, if yes, the data packet is sent to the lower layer, meanwhile, the number of tokens in the token bucket is reduced according to the data quantity of the access requests, and otherwise, the access requests are sent to a waiting queue.
The advantages of the application are as follows:
the application provides a technology for guaranteeing the service quality of key applications on real-time multi-core hardware through dynamic memory bandwidth resource division, and provides a fine-grained, high-precision and quick-response non-invasive solution. The application designs the overall architecture of the automatic regulation mechanism of the process performance, and the hardware directly obtains the priority of the upper layer application through the label mechanism, thereby providing differentiated hardware resource allocation for the processes with different priorities. The method is innovatively based on a machine learning method to carry out delay modeling on a volume (Bank) structure of a dynamic random access memory (Dynamic random access memory, DRAM), and can reach more than 90% of prediction accuracy in most scenes, and the average error is 2.78%. The average error of the memory delay based estimation process is only 8.78% relative to the performance loss of the memory delay based estimation process when the memory delay alone is operated, which is superior to the prior art. Aiming at the problem of guaranteeing the service quality of key application, in a real-time multi-core environment, the access interference of other processes to the key process is effectively reduced by dynamically adjusting the memory bandwidth allocation, and the service quality of the high-priority process is accurately guaranteed to reach 90% of that of the high-priority process when the high-priority process runs independently.
Drawings
FIG. 1 is a schematic diagram of the location of AutoMBA in a system and its constituent structure according to the present application;
FIG. 2 is a schematic diagram of the inputs and outputs of a multi-layer perceptron model;
FIG. 3 is a schematic diagram showing that the execution time of a sequential processor can be divided into two parts, i.e., in-core and out-of-core;
fig. 4 is a schematic diagram of the principle of operation of the token bucket mechanism.
Detailed Description
When the inventor performs multi-core program performance analysis and dynamic resource adjustment optimization research, the inventor finds that the prior art does not combine low-level hardware information such as memory delay, bandwidth, access and storage characteristics of programs and high-level software information, so that the problems of hardware information loss, unknown actual software performance, complex control technology and the like are caused. The inventor finds that solving the problem can be realized by combining information such as delay, bandwidth, access characteristics and the like with the actual performance loss modeling of software, deriving on line on hardware to obtain the estimated performance loss of the software, and carrying out memory bandwidth allocation based on a token bucket in a feedback mode according to continuous observation results. Specifically, the application comprises the following key technical points:
key point 1, using machine learning method, models memory latency offline based on historical memory access sequences, with an average error rate of 2.84% on SPEC CPU2006 test program. Where offline refers to the model having been built at program run-time (which is done off-line), and does not require the model to be built on-line at program run-time.
And 2, performing offline modeling on the performance loss of the program under the multi-core condition based on the information such as the actual measurement access delay, the estimated ideal access delay, the program access bandwidth, the program access frequency and the like by using a machine learning method, wherein the average error on the SPEC CPU2006 test program is 8.78%.
And a key point 3, wherein the access bandwidth of the program is controlled by using a token bucket technology based on quota, and the access bandwidth allocation of different programs is dynamically controlled by adjusting the token bucket parameters, so that the performance of the key program is guaranteed, and the set 90% ideal performance target (standard deviation 4.19%) is achieved.
In order to make the above features and effects of the present application more clearly understood, the following specific examples are given with reference to the accompanying drawings.
As shown in fig. 1, the core technical principle of the present application includes: (1) Based on the access request sequence of single application and the timestamp of the access request sequence in the multi-core environment, the main memory behavior is consistently simulated by modeling the memory, the delay of the access request in the single operation is estimated on line, and the access delay in the actual mixed environment is combined to obtain the access delay improvement ratio (LatS); (2) Counting the number of the clock cycles of the process in the core and the process outside the core, and calculating to obtain the execution time of the process when the process runs independently by combining with LatS, thereby quantifying the interference suffered by the process; (3) And according to the priority relation of different cores, a token bucket technology is used for dynamically distributing memory bandwidth resources in real time, so that the service quality of key applications is ensured.
The specific implementation process of the technical scheme of the application is as follows:
(1) DRAM access delay modeling
To obtain in real time in a multi-core environmentThe application provides (1) modeling the access delay of DRAM, regarding DRAM Bank structure as a black box model shadow DRAM, wherein the application models the whole DRAM, each Bank in DRAM has similar behavior, uses the same black box model for different banks, inputs the related information of the current and past k access requests, and outputs the information as the delay of the current access request; (2) When a plurality of processes run, process information of an upper layer application is acquired based on a labeling environment, and a hardware records a single-process access request and inputs the single-process access request into a DRAM access delay model shadow DRAM to obtain delay t of the access request under the condition of no multi-core interference solo
The present application recognizes that the current state of a DRAM and its controller depends on all historical access requests, and if a current access request input is given on this basis, the access latency of the request can be accurately predicted by modeling. The reason is that for a memory unit, its input signal is only nontrivial when a memory request is received, and the rest of the time is the time that will not affect the way the internal state changes, i.e. it is a mole-type sequential circuit when there is no memory request. Thus, the internal state of the DRAM and its controller need only be determined from past meaningful input signals, i.e., a historical access request sequence.
More particularly, the inventors have found that the latency of a DRAM access request can be determined by the sequence of access requests that access the same Bank in a small portion of the sequence before and after it. This is because (1) the line cache state within DRAM Bank has the greatest impact on memory latency, while the state of the line cache is determined by the last request to access that Bank, and the earlier more remote memory requests will not have an impact on the line cache state; (2) Non-blocking Cache (Non-blocking Cache) brings about Memory-level parallelism (Memory-level Parallelism), i.e., the Cache issues multiple Memory requests consecutively without receiving replies, and may receive reply data out of order, which causes the processing flow of a single request to be affected by all requests in a short period of time before and after it, and to be processed in advance or postponed.
The application provides that the DRAM delay prediction model (black box model) only samples q and predicts the delay of q when no access request of the same Bank is sent in the time period after the access request q is sent and before the reply data is received. The reasons are mainly as follows: (1) The prediction of delay requires relative real-time, and the time difference between two consecutive requests for a Bank memory may be large, so that the input of the model cannot contain future information; (2) The sampling condition limitation ensures that future requests do not affect the processing time of the request q, so that the historical memory access request sequence information input by the model is enough to carry out delay prediction of the current request; (3) The calculation time required by the model may be long, and the prediction of delay and performance loss is performed by selecting part of requests through a sampling method, so that the full time can be reserved for calculation, and the system power consumption can be reduced.
To synthesize the above discussion, in order to find a generic latency prediction mechanism, and minimize unnecessary research on the internal structure of DRAM, the present application attempts to model the Bank structure in DRAM based on the assumption that the current state of DRAM and its controller can be determined by some of the current Bank's historical access requests, and predict the latency of the current request with a limited number of access histories for the corresponding Bank based on sampling some of the requests meeting certain conditions. A condition may be, for example, a request taken every 100, or that no new instructions be issued during the period of time after the request is issued, before a reply is received, or other feasible filtering conditions.
The application provides modeling of DRAM Bank by machine learning method, which has the advantages that: (1) Predicting the delay of the current memory request with relatively high accuracy using basic arithmetic operations and a small amount of history information storage calculations; (2) The machine learning model has universality and universality, the training process has little requirement on internal information of the DRAM controller and the DRAM particles, and the machine learning model can be trained only by grabbing access information again when being transplanted on different platforms; (3) Based on partial sampling of the access request, the prediction quantity of access delay is effectively reduced, and the dynamic power consumption of the system is reduced.
The application adopts a machine learning method to model DRAM (dynamic random access memory) banks based on a Multi-layer Perceptron (MLP) model, and the reason is that: (1) The MLP has universality, and can accurately simulate DRAM (dynamic random Access memory) banks under different configurations under different parameters, so that access delay prediction mechanisms can be multiplexed among different platforms, and the transplanting workload is effectively reduced; (2) The MLP model and the ReLU activation function do not contain complex function operation, the result can be output only through simple multiply-add operation, a large number of queues and state machines in the DRAM and the controller thereof are not required to be maintained, and the power consumption overhead of a delay prediction mechanism is effectively reduced.
As shown in FIG. 2, the present application uses the current request h to access the target Bank 0 Past k access histories h i (i=1, …, k) to predict access latency, where h i (i=0, …, k) contains information h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i Whereas the actual input memory history of the model may be represented by the difference between the historical request information and the current request information. In training and predicting the model, for each access history h i We use t 0 -t i ,row 0 -row i ,col 0 -col i To represent their relationship to the current request, and the output of the model is the memory Latency of the current request, latency=g (h 0 ,…,h k ) The training process of the model is to fit the function g continuously. The access delay of the current request can be obtained by running some preset threads or programs in advance and grasping the access condition of the current request and the delay of the corresponding request.
(2) Program performance loss modeling
From the microstructure point of view, we can divide the execution process of the process into two parts, i.e. an in-core part and an out-core part, wherein the in-core flow includes an operation unit, access of a core private cache, and the out-core part includes sending a read request to a DRAM, processing inside the DRAM, and the like.
As shown in FIG. 3, in a sequential processor, when an L2 cache miss is caused by a memory access instruction, the internal pipeline of the core is stopped, and the execution is continued after waiting for the data returned by the DRAM. Thus, during execution of the program, sequential processing in which the CPU clock cycles can be strictly divided into two categories, intra-core time and extra-core time, wherein only the extra-core time is subject to multi-core interference.
In out-of-order processors, instructions that are later in the way under certain conditions may be executed ahead of time, so cache misses do not necessarily cause a stall in the internal pipeline of the CPU core. However, from the overall system perspective, at any one clock cycle, either an out-of-core Memory request is being processed or all Memory requests have been completed, both types of clock Cycles are referred to as Memory Cycles (MCs) and Non-Memory Cycles (NMCs), respectively.
On the other hand, in the multi-core architecture, due to interference caused by inter-core resource contention, for requests sent from L2 to the DRAM controller through the system bus, their latency increases as a whole due to multi-core resource contention, such as when a request is being sent out, the system bus is being occupied by another core, and the request must wait for the request of the other core to be sent out after the completion of transmission. The largest source of inter-core interference is within the DRAM Bank. The line Cache mechanism is similar to a Cache, and when the locality of a program is good, the access delay of the line Cache mechanism is greatly reduced. However, in a multi-core environment, there is a high probability that memory requests of the remaining processes will be inserted between two consecutive memory requests of a single process, thereby destroying the locality of DRAM accesses and affecting memory latency.
Considering the two aspects, the application considers that when the MC is operated simultaneously with other processes under the multi-core architecture, the MC quantity of one process is increased compared with that of the MC in the single process, and the NMC quantity is the same. In a multi-core environment, by judging whether an incomplete DRAM access request exists currently, hardware can divide each clock cycle into two types of NMC and MC, and count the number of the two types of time cycles in a period of time, which are respectively marked as A and B. Due to the inter-core interference, the access request delay is increased and the MC number is increased, and the higher the increase proportion is, the larger the interference caused by resource competition is, and the larger the performance loss of the process is.
Based on this, the present applicationA performance penalty estimation (Slowdown Estimation via Memory Access Latency, SEMAL) model based on memory latency is proposed: obtaining the process single-core memory access delay t through a machine learning method solo Then, the multi-core actual access delay t is combined and utilized mix We can estimate the increase ratio lats=t of the MC number of the process based on the delay increase ratio (LatS) mix /t solo . When the processes execute the same phase under the corresponding single operation scene, the NMC number is still A, but the MC number is reduced in the LatS proportion, namely the execution time required by the processes is A+B/LatS.
Therefore, in a multi-core environment, based on the relative extension of the process execution time, we calculate the performance penalty for its relative individual run as:
in the formula, the execution_time is the Execution Time of a process or a program, and the execution_time is solo Execution_time, which is the separate Execution Time of a process or program mix For a hybrid execution time of a process or program.
(3) Service quality guarantee technology based on token bucket
Based on the SEMAL model, the application provides an automatic memory bandwidth allocation mechanism (Automatic Memory Bandwidth Allocation, autoMBA): by using SEMAL, the running condition of the program can be dynamically monitored at the running time, the performance loss degree of the program can be obtained, and the bandwidth requirement under the optimal condition (single running time) can be predicted and obtained based on the memory access bandwidth at the mixed running time. By using the token bucket technology, the system can regulate and control access bandwidths of different cores, and the service quality of key applications is preferentially ensured by limiting access traffic with low priority.
Token Bucket (TB) technology is the basic tool of AutoMBA, which can effectively and accurately control access bandwidth of different cores. As shown in fig. 4, each CPU core has a private, independent token bucket, which automatically adds a certain number of tokens (Inc) at regular intervals (Freq), and the token bucket is provided withMaximum capacity (Size). All access requests sent by the core pass through the token bucket, any request data packet is marked when entering the token bucket, and the time when the request data packet enters the token bucket is recorded. At this point, if there are tokens available in the token bucket (e.g., t 0 Pack issued at the moment 0 ) The data packet is sent to the lower layer while the number of tokens is reduced according to the amount of data requested. Otherwise, if no tokens remain in the token bucket (e.g., t 1 ,t 2 Pack issued at each moment 1 ,PACKET 2 ) The request may be blocked and sent to the wait queue. A timer synchronized to the system clock is provided in the token bucket which resets each time Freq is reached and triggers the tokens to automatically increment the number of Inc. At this time, in the case that there are still remaining tokens, the request in the waiting queue is sent to the lower layer and the number of tokens is reduced accordingly, e.g. t 3 Time-of-day release previously blocked PACKET 2 ,PACKET 3 If the request data amounts of the two requests are respectively 1, the ntokens is reduced by 2.
In each sampling period (Sampling Interval, SI): (1) The TB of each core automatically operates according to the set parameters, and the cores are limited to send memory access requests based on the number of the current residual tokens; (2) A Latency Prediction Module (LPM) records the access history of a high priority core (or target process), predicts the latency t of an access request issued by that core when running alone solo The method comprises the steps of carrying out a first treatment on the surface of the (3) A token bucket controller (TBM) handles access requests issued by TBs on the one hand, requests from different cores are forwarded to DRAM controllers, and on the other hand, predicted t is derived from LPMs solo Detecting to obtain the actual access delay t mix By t solo And t mix The relative proportion of the target process relative to the performance loss of the target process when the target process runs alone and the relative proportion of the core access clock cycle number are estimated and recorded in a preset relative register.
One update period (Updating Interval, UI) consists of multiple sample periods, and at the end of each UI, the autopba mechanism evaluates the extent of target process performance penalty in past UIs. When the performance loss of the target process is excessive, the access and storage flow of the other cores is automatically limited through the TBM, so that the interference among multiple cores is reduced, and the performance of the target process is improved; when the target process performance meets the requirements, the flow control to the remaining cores can be relaxed.
The control algorithm of AutoMBA is divided into two steps of observation (Observe) and action (Act). The unserve is performed at the end of each SI, the hardware calculates the performance loss of the process and sets the corresponding counter by combining the memory delay and the memory cycle count. Act is carried out at the end of each UI, if hardware judges that the target process meets performance loss below 10% in most SI, the maximum allowable flow of the other cores is improved, and the more the time met, the more the number is increased; when the target process performance loss reaches more than 50% in not less than 3 SIs, directly halving the flow allowed by the rest cores; when the performance loss of the target process is in the rest intervals, the corresponding token bucket Inc parameter adjustment is also carried out.
The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.
The application also provides a memory resource dynamic regulation and control system based on memory access and performance modeling, which comprises the following steps:
the module 1 is used for training a neural network model by taking historical memory access request information of a preset process on a DRAM as training data and taking delay corresponding to the historical memory access request information as a training target in a multi-core system to obtain a memory access delay model;
a module 2 for recording and inputting a target access request of a target process into the access delay model to obtain access delay t of the target access request without inter-process interference when the multi-core system operates in multiple processes solo Simultaneously detecting the actual access delay t of the target access request mix Delay t by access solo Divided by memory access delay t mix Obtaining the memory access delay improvement proportion;
the module 3 is used for counting the number of the clock cycles of the execution in the core and the execution out of the core of the target process, and combining the memory access delay improvement proportion to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in a plurality of processes;
and the module 4 is used for limiting the DRAM access memory flow of the processes except the target process when the performance loss is larger than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the historical memory access request information comprises current request information h of an access target Bank 0 Past k access histories h i (i=1, …, k), where h i (i=0, …, k) comprises h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i The input of the access delay model is the difference t between the access history and the current request information 0 -t i 、row 0 -row i And col 0 -col i The output of the memory access delay model is the current request information h 0 Is delayed by latency=g (h 0 ,…,h k ) And (5) completing the training of the memory access delay model through fitting the function g.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the performance loss is as follows:
wherein A is the number of execution clock cycles in the core, B is the number of execution clock cycles out of the core, and Lats is the memory access delay increasing proportion.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein the module 4 comprises: using token bucket techniques, DRAM access traffic is limited for processes other than the target process.
The memory resource dynamic regulation and control system based on memory access and performance modeling, wherein
Each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at intervals of a certain period, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time of entering the token bucket is recorded, whether the token bucket has available tokens or not is judged, if yes, the data packet is sent to the lower layer, meanwhile, the number of tokens in the token bucket is reduced according to the data quantity of the access requests, and otherwise, the access requests are sent to a waiting queue.

Claims (10)

1. A memory resource dynamic regulation and control method based on memory access and performance modeling is characterized by comprising the following steps:
step 1, in a multi-core system, taking the history access request information of a preset process on a DRAM as training data, taking the delay corresponding to the history access request information as a training target, training a neural network model, and obtaining an access delay model;
step 2, when the multi-core system runs in a multi-process mode, recording a target access request of a target process and inputting the target access request into the access delay model to obtain access delay t of the target access request under the condition of no multi-process interference solo Simultaneously detecting the actual access delay t of the target access request mix Delay t by access solo Divided by memory access delay t mix Obtaining the memory access delay improvement proportion;
step 3, counting the number of the clock cycles of the inner core execution and the outer core execution of the target process, and combining the memory access delay improvement proportion to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in multiple processes;
and step 4, when the performance loss is larger than a threshold value, limiting the DRAM access memory flow of the processes except the target process so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
2. The memory resource dynamic regulation and control method based on memory access and performance modeling as claimed in claim 1, wherein the memory resource dynamic regulation and control method comprises the following steps ofCharacterized in that the history access request information comprises the current request information h of the access target Bank 0 Past k access histories h i I=1, …, k, where h 0 Comprises h 0 Is set to be the emission time t of (2) 0 、h 0 Accessed row address row 0 And column address col 0 ,h i Comprises h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i The input of the access delay model is the difference t between the access history and the current request information 0 -t i 、row 0 -row i And col 0 -col i The output of the memory access delay model is the current request information h 0 Is delayed by latency=g (h 0 ,…,h k ) And (5) completing the training of the memory access delay model through fitting the function g.
3. The memory resource dynamic regulation and control method based on memory access and performance modeling of claim 1, wherein the performance loss is:
wherein A is the number of execution clock cycles in the core, B is the number of execution clock cycles out of the core, and Lats is the memory access delay increasing proportion.
4. The memory resource dynamic regulation and control method based on memory access and performance modeling as set forth in claim 1, wherein the step 4 includes: using token bucket techniques, DRAM access traffic is limited for processes other than the target process.
5. The memory resource dynamic regulation and control method based on memory access and performance modeling of claim 1, wherein,
each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at intervals of a certain period, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time of entering the token bucket is recorded, whether the token bucket has available tokens or not is judged, if yes, the data packet is sent to the lower layer, meanwhile, the number of tokens in the token bucket is reduced according to the data quantity of the access requests, and otherwise, the access requests are sent to a waiting queue.
6. Memory resource dynamic regulation and control system based on memory access and performance modeling, which is characterized by comprising:
the module 1 is used for training a neural network model by taking historical memory access request information of a preset process on a DRAM as training data and taking delay corresponding to the historical memory access request information as a training target in a multi-core system to obtain a memory access delay model;
a module 2 for recording and inputting a target access request of a target process into the access delay model to obtain access delay t of the target access request without inter-process interference when the multi-core system is running in multiple processes solo Simultaneously detecting the actual access delay t of the target access request mix Delay t by access solo Divided by memory access delay t mix Obtaining the memory access delay improvement proportion;
the module 3 is used for counting the number of the clock cycles of the execution in the core and the execution out of the core of the target process, and combining the memory access delay improvement proportion to obtain the performance loss of the target process relative to the independent operation of the target process when the target process operates in a plurality of processes;
and the module 4 is used for limiting the DRAM access memory flow of the processes except the target process when the performance loss is larger than a threshold value so as to dynamically allocate DRAM bandwidth resources in real time and ensure the service quality of the target process.
7. The memory resource dynamic regulation and control system based on memory access and performance modeling of claim 6 wherein the historical memory access request information includes current request information h of the access target Bank 0 Past k access histories h i I=1, …, k, which isMiddle h 0 Comprises h 0 Is set to be the emission time t of (2) 0 、h 0 Accessed row address row 0 And column address col 0 ,h i Comprises h i Is set to be the emission time t of (2) i 、h i Accessed row address row i And column address col i The input of the access delay model is the difference t between the access history and the current request information 0 -t i 、row 0 -row i And col 0 -col i The output of the memory access delay model is the current request information h 0 Is delayed by latency=g (h 0 ,…,h k ) And (5) completing the training of the memory access delay model through fitting the function g.
8. The memory resource dynamic regulation and control system based on memory access and performance modeling of claim 6, wherein the performance penalty is:
wherein A is the number of execution clock cycles in the core, B is the number of execution clock cycles out of the core, and Lats is the memory access delay increasing proportion.
9. The memory resource dynamic regulation and control system of claim 6 wherein the module 4 comprises: using token bucket techniques, DRAM access traffic is limited for processes other than the target process.
10. The memory resource dynamic regulation and control system based on memory access and performance modeling of claim 6,
each core of the multi-core system is provided with an independent token bucket, a certain number of tokens are automatically added to the token bucket at intervals of a certain period, the token bucket is provided with the maximum token capacity, all access requests sent by the cores pass through the token bucket, any access request data packet is marked when entering the token bucket, the time of entering the token bucket is recorded, whether the token bucket has available tokens or not is judged, if yes, the data packet is sent to the lower layer, meanwhile, the number of tokens in the token bucket is reduced according to the data quantity of the access requests, and otherwise, the access requests are sent to a waiting queue.
CN202110702890.0A 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling Active CN113505084B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110702890.0A CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling
PCT/CN2022/070519 WO2022267443A1 (en) 2021-06-24 2022-01-06 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110702890.0A CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Publications (2)

Publication Number Publication Date
CN113505084A CN113505084A (en) 2021-10-15
CN113505084B true CN113505084B (en) 2023-09-12

Family

ID=78010810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110702890.0A Active CN113505084B (en) 2021-06-24 2021-06-24 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Country Status (2)

Country Link
CN (1) CN113505084B (en)
WO (1) WO2022267443A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505084B (en) * 2021-06-24 2023-09-12 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling
CN118034901A (en) * 2022-11-11 2024-05-14 华为技术有限公司 Memory management method, device and related equipment
CN117319322B (en) * 2023-12-01 2024-02-27 成都睿众博芯微电子技术有限公司 Bandwidth allocation method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8880809B2 (en) * 2012-10-29 2014-11-04 Advanced Micro Devices Inc. Memory controller with inter-core interference detection
CN113505084B (en) * 2021-06-24 2023-09-12 中国科学院计算技术研究所 Memory resource dynamic regulation and control method and system based on memory access and performance modeling

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105700946A (en) * 2016-01-15 2016-06-22 华中科技大学 Scheduling system and method for equalizing memory access latency among multiple threads under NUMA architecture
WO2019020028A1 (en) * 2017-07-26 2019-01-31 华为技术有限公司 Method and apparatus for allocating shared resource
CN112083957A (en) * 2020-09-18 2020-12-15 海光信息技术股份有限公司 Bandwidth control device, multithread controller system and memory access bandwidth control method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIONG, Dongliang 等.Providing Predictable Performance via a Slowdown Estimation Model.ACM Transactions on Architecture and Code Optimization.2017,第14卷(第3期),第1-26页. *

Also Published As

Publication number Publication date
WO2022267443A1 (en) 2022-12-29
CN113505084A (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN113505084B (en) Memory resource dynamic regulation and control method and system based on memory access and performance modeling
US8069444B2 (en) Method and apparatus for achieving fair cache sharing on multi-threaded chip multiprocessors
CN115037749B (en) Large-scale micro-service intelligent multi-resource collaborative scheduling method and system
US9632836B2 (en) Scheduling applications in a clustered computer system
Verner et al. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems
CN111176817B (en) Method for analyzing interference between DAG (demand-oriented architecture) real-time tasks on multi-core processor based on division scheduling
EP3929745A1 (en) Apparatus and method for a closed-loop dynamic resource allocation control framework
US9244733B2 (en) Apparatus and method for scheduling kernel execution order
Li et al. Amoeba: Qos-awareness and reduced resource usage of microservices with serverless computing
CN116820784B (en) GPU real-time scheduling method and system for reasoning task QoS
Fu et al. Cache-aware utilization control for energy efficiency in multi-core real-time systems
EP3929746A1 (en) Apparatus and method for a resource allocation control framework using performance markers
US20050125797A1 (en) Resource management for a system-on-chip (SoC)
CN115269108A (en) Data processing method, device and equipment
CN104820616A (en) Task scheduling method and device
Amert et al. OpenVX and real-time certification: The troublesome history
CN105528250B (en) The evaluation and test of Multi-core computer system certainty and control method
CN108574600B (en) Service quality guarantee method for power consumption and resource competition cooperative control of cloud computing server
Buzen et al. Best/1-design of a tool for computer system capacity planning
CN117349026A (en) Distributed computing power scheduling system for AIGC model training
US20100199282A1 (en) Low burden system for allocating computational resources in a real time control environment
CN110928649A (en) Resource scheduling method and device
CN107329813B (en) Global sensing data active prefetching method and system for many-core processor
KR100547625B1 (en) Intelligent Monitoring System and Method for Grid Information Service
Wei et al. Predicting and reining in application-level slowdown on spatial multitasking GPUs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant