CN118051189B

CN118051189B - Memory access optimizing method, device, equipment, medium and program product

Info

Publication number: CN118051189B
Application number: CN202410448915.2A
Authority: CN
Inventors: 白铠豪; 张菁; 王宝林; 宋卓
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Filing date: 2024-04-15
Publication date: 2024-07-02
Anticipated expiration: 2044-04-15

Abstract

The embodiment of the application provides a memory access optimizing method, a device, equipment, a medium and a program product, relating to the field of computers, wherein the method comprises the following steps: acquiring execution information of a memory access instruction; extracting a physical address in the execution information of the access instruction; constructing array elements according to the physical address of the memory access instruction, and obtaining a memory array formed by the array elements; according to the total accessed times of the memory chip areas corresponding to the physical pages represented by the array elements, calculating to obtain the access heat of the physical pages, and determining a target physical page with the access heat greater than a heat threshold; and migrating the memory space corresponding to the target physical page to a high-speed memory. The application realizes the identification of the thermophysical page based on the memory array in a dense form, thus the occupied amount of the storage space of the data participating in the heat calculation and the complexity of the data participating in the heat calculation can be reduced, thereby reducing the expenditure of a processor and improving the overall performance of the system.

Description

Memory access optimizing method, device, equipment, medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for optimizing a memory access, an electronic device, a machine-readable medium, and a computer program product.

Background

In a scene with both high-speed volatile memory and low-speed persistent memory, identifying physical pages corresponding to memory access operations in the current scene, finding out thermophysical pages with frequent memory access frequency, and migrating the thermophysical pages into the high-speed volatile memory, the speed advantage of the high-speed volatile memory can be more exerted, so that the overall memory access performance of the hybrid memory scene is improved.

At present, the kernel can realize the identification and migration of the thermophysical page by periodically scanning the page table, specifically, the page table entry usually has an access bit (Accessed bit, also called a reference bit), and when the processor accesses a certain physical page through a memory access operation, the access bit is set to a specific value. The processor can determine whether a physical page has been accessed again since the last scan by checking whether the access bit in the page table has been set to a value. During page table scanning, the processor clears the access bit of the accessed physical pages and moves the physical pages to an active or inactive list according to their access frequency, thereby realizing the distinction of the hot physical pages.

However, the inventor finds that in the current scheme, page table scanning is a resource intensive operation, so that a processor has a large cost, thereby affecting the system performance. In addition, the processor may improve the performance of identifying the thermophysical pages by increasing the scan frequency, but this may further increase the overhead of the processor.

Disclosure of Invention

The embodiment of the application provides a memory access optimizing method, which aims to solve the problem that the processor in the related art has larger expenditure, thereby influencing the system performance.

Correspondingly, the embodiment of the application also provides a memory access optimizing device, memory access optimizing equipment, electronic equipment, machine-readable medium and computer program product, which are used for ensuring the realization and application of the method.

In order to solve the above problems, an embodiment of the present application discloses a method for optimizing memory access, where the method includes:

acquiring execution information of a memory access instruction;

Extracting a physical address in the execution information of the access instruction;

constructing an array element according to the physical address of the memory access instruction, and obtaining a memory array formed by the array element, wherein the array element is used for representing a physical page, and the physical page comprises the physical address;

According to the total number of times of access of the memory area corresponding to the physical page represented by the array elements in the memory array, calculating to obtain the access heat of the physical page, and determining a target physical page with the access heat greater than a heat threshold;

and migrating the memory space corresponding to the target physical page to a high-speed memory.

The embodiment of the application discloses a memory access optimizing device, which comprises:

the information acquisition module is used for acquiring the execution information of the access instruction;

the extraction module is used for extracting a physical address in the execution information of the access instruction;

The construction module is used for constructing an array element according to the physical address of the memory access instruction to obtain a memory array formed by the array element, wherein the array element is used for representing a physical page, and the physical page comprises the physical address;

the calculation module is used for calculating access heat of the physical pages according to the memory array, and determining a target physical page with the access heat being greater than a heat threshold; the access heat of the physical page is in positive correlation with the total number of times of access of the memory chip area corresponding to the physical page represented by the array elements in the memory array;

And the migration module is used for migrating the memory space corresponding to the target physical page into the high-speed memory.

The embodiment of the application also discloses a memory access optimizing device, which comprises:

A processor and a sampling device;

The sampling device is used for acquiring the execution information of the memory access instruction from the instruction pipeline queue of the processor;

The processor is configured to:

Acquiring execution information of the access instruction sampled by the sampling device;

The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon that, when executed, causes the processor to perform a method as described in one or more of the embodiments of the application.

Embodiments of the application also disclose one or more machine-readable media having executable code stored thereon that, when executed, cause a processor to perform a method as described in one or more of the embodiments of the application.

Compared with the related art, the embodiment of the application has the following advantages:

According to the embodiment of the application, the physical address of the sampling memory access instruction is constructed into the memory array in a dense form, and the memory array participates in the heat calculation of the physical page based on the memory array, so that the identification of the heat physical page is realized, the occupation amount of the storage space of the data participating in the heat calculation and the complexity of the data participating in the heat calculation can be reduced, the expenditure of a processor is reduced, and the overall performance of a system is improved. Therefore, the embodiment of the application realizes sampling, thermophysical page identification and migration on the basis of low processor overhead, and the overall system performance is better.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a system architecture diagram of an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an instant messaging application according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a usage scenario of an e-commerce application in accordance with an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for optimizing memory access according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a method for optimizing memory access according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a memory address distribution according to an embodiment of the present application;

FIG. 7 is a diagram of a physical address distribution according to an embodiment of the present application;

FIG. 8 is a flowchart of an implementation of a method for optimizing memory access according to an embodiment of the present application;

FIG. 9 is a block diagram of a memory access optimizing device according to an embodiment of the present application;

FIG. 10 is a block diagram of an apparatus for optimizing memory access according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, the term "and/or" as used in the specification and claims to describe an association of associated objects means that there may be three relationships, e.g., a and/or B, may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The term "plurality" in embodiments of the present application means two or more, and other adjectives are similar.

For a better understanding of the present application, the following description is given to illustrate the concepts related to the present application to those skilled in the art:

Sampling device: is a technology based on statistical analysis extension (SPE, STATISTICAL PROFILING EXTENSION) and provides a performance analysis function supported by hardware. SPEs are intended to help developers optimize application and system performance by collecting statistical information about software behavior without significantly affecting the execution speed of the processor. The registers of the SPE are used to control and configure the parameters of the performance monitoring.

Instruction pipeline (pipeline): the instruction pipeline queue is used for splitting the instruction processing process of the computer into a plurality of steps and accelerating the instruction execution speed through parallel execution of a plurality of hardware processing units. Which performs in particular like a pipeline in a factory and is therefore named.

Access instruction: is an instruction for realizing access or storage of data in the memory.

Standard for high-speed serial computer expansion bus (CXL, compute Express Link): is a high-speed serial protocol designed specifically for data centers and high-performance computing environments. It aims to improve the communication efficiency between different types of hardware in a data center.

Access delay: refers to the time that elapses from the initiation of a read-write request to memory to the receipt of the requested data. There are significant differences in memory latency for different memory types, which also reflect end-to-end performance differences.

Cold/hot memory: in computer systems, "cold/hot memory" generally refers to the pattern and frequency of access of data in memory. This concept is based on the observation that one phenomenon: not all memory data is accessed equally, some data is accessed frequently (hot data), and other data is accessed rarely (cold data). The purpose of distinguishing between cold and hot memory is to optimize system performance and use of memory resources. By maintaining the hot data in a fast-access storage medium, access latency can be reduced and system response speed can be improved. At the same time, migrating cold data to lower cost storage media may reduce overall storage costs.

Analysis process: the application relates to a tool for analyzing instruction execution performance, in particular to a perf tool which is a powerful performance analysis tool capable of reporting events such as processor performance counters, trace points and the like and supporting various performance data from low-level hardware to high-level events.

Access heat: the higher the frequency of the physical page being accessed in a cycle, the higher the access heat of the physical page.

High-speed memory: such as dynamic random access memory (DRAM, dynamic Random Access Memory) is a random access memory technology that is widely used as the main memory of computers. High speed memory is volatile, meaning that upon power failure, the information stored therein is lost. The high-speed memory provides very high data read-write speed, and is suitable for temporary storage of data cached by the processor and running programs.

Persistent memory (PMEM, PERSISTENT MEMORY MODULE): is a persistent storage technology, also known as non-volatile memory. It combines the high-speed access characteristics of conventional memories with the data retention capabilities of storage devices to preserve data even in the event of a power failure. Persistent storage is faster than traditional storage, but is typically slower than high speed storage, and can be used to extend main memory or as fast secondary storage.

Physical pages: in paging memory management, an actual physical memory is divided into a plurality of memory blocks with fixed sizes, the divided memory blocks are called physical pages, and the physical pages are the minimum units in the memory management and are allocation mechanisms for the physical memory in the form of pages.

Memory array: the array is an implementation of the memory, and the one-dimensional array can particularly present a dense data form, which is beneficial to saving the storage space. Array elements in a memory array may represent physical pages.

The overall implementation scenario of the embodiment of the application may be a CXL application scenario. There is a background of a number of memory problems based on data centers. The introduction of CXLs solves the problem of resource sharing between devices, and CXLs allow for more flexibility in memory subsystem design and fine-grained control of memory bandwidth and capacity. Furthermore, CXL is not limited to the specific medium of the device memory, and thus the shared memory of the entire system may be composed of different parts, thereby forming a heterogeneous memory system. In order to fully exploit the performance of heterogeneous memory systems, an effective solution is to use memory hierarchy expansion. The memory layered expansion is a mechanism for simultaneously using a high-speed volatile memory medium and other low-speed memory mediums to mix to form a multi-level memory, and by effectively managing the mixed memory, the performance, service life and other comprehensive performances of the mixed memory are improved.

Because of the difference in access speed between different memory mediums in heterogeneous memory, more frequently accessed physical pages (i.e. "hot" pages) may be placed in high speed memory, such as DRAM, while less frequently accessed physical pages (i.e. "cold" pages) may be placed in relatively low speed memory, such as DRAM. In a non-volatile memory (NVM). This is an ideal solution to improve the performance of heterogeneous memory systems. Based on the solution, the embodiment of the application can analyze the thermophysical pages in the access process by analyzing the access process of the application program in the server, and correspondingly migrate the thermophysical pages into the high-speed memory, thereby playing the speed advantage of the high-speed volatile memory and improving the overall access performance of the mixed memory scene.

The memory access optimizing method in the embodiment of the application can be applied to different scenes, such as: the method and the device are applied to the use scene of the instant messaging application, can also be applied to the use scene of the e-commerce application, take the use scene of the instant messaging application as an example, and in the use process of the instant messaging application, more access operations exist in the device, for example, after clicking the head images of friends, the detailed information of the friends can be displayed, and the access operations for reading the detailed information of the friends from the memory at one time are involved. For example, if the user issues a dynamic video, the issuing action designs a memory access operation of writing video data into the memory. According to the embodiment of the application, the sampling device can acquire the execution information of the access instruction from the instruction pipeline queue when the processor executes the instant messaging application, extract the physical address in the execution information of the access instruction, construct array elements based on the physical address, acquire a memory array formed by the array elements, calculate the access heat of the physical page according to the memory array, and determine the target physical page with the access heat greater than the heat threshold; and finally, migrating the memory space corresponding to the target physical page into a high-speed memory. Therefore, the speed advantage of the high-speed volatile memory can be exerted by migrating the thermophysical page into the high-speed memory, and the overall memory access performance of the hybrid memory scene is improved.

Referring to fig. 1, which shows a system architecture diagram of a method for optimizing memory access according to an embodiment of the present application, a server includes: the sampling device, the low-speed memory, the high-speed memory and the processor, wherein the processor comprises a transmission module, an analysis module and a migration module.

In the embodiment of the application, in order to realize the migration of the thermophysical page, the thermophysical page is determined based on the relevant execution information of the access operation executed by the acquisition processor, and the function of the sampling device is to acquire the execution information of the access instruction in the instruction pipeline queue of the acquisition processor. Specifically, the sampling device can be integrated in a server, the sampling device can be hardware realized based on SPE technology, and the SPE provides a non-invasive hardware sampling method, so that the sampling device can perform sampling analysis of execution information on architecture instructions defined by an instruction set architecture. In the embodiment of the application, the SPE acquires the execution information generated in the process of executing the memory access instruction by the processor through the sampling device, can sample the memory access instruction according to the sampling frequency configured in the register of the sampling device, and can transmit the sampled result from the kernel mode to the user mode for analysis. The execution information of SPE sampling is very accurate and does not drift, and SPE sampling is hardware sampling, so that the sampling process does not occupy processor resources, and the cost of a processor is reduced. Because the cost of sampling is small, the sampling rate can be set high, thereby improving the precision of the subsequent analysis of the thermophysical page.

In addition, the related art processors implement the collection of information by periodically scanning access bits in the page table, since page table scanning may involve locking and synchronizing page table entries in order to meet data consistency requirements, which may result in contention and lock contention, thereby affecting the performance of multiple processes or threads. The hardware sampling based on SPE of the embodiment of the application specifically extracts the memory access instruction in the instruction pipeline queue of the processor, tracks and records the execution information of the memory access instruction, which is an instantaneous state, so that the atomicity of the memory access instruction is ensured without using a lock, and therefore, the lock related to the process can be held without blocking any behavior of the process, and the influence on the performance of the process is extremely low.

The transmission module of the processor is used for transmitting the execution information of the access instruction sampled by the sampling device from the kernel mode to the user mode so as to be analyzed by the analysis module of the user mode. The transmission module can firstly analyze the execution information into a format supported by an analysis process in the analysis module, and transmit the filtered residual execution information to a user state after software filtering, wherein the software filtering aims at filtering out the execution information with wrong format or overlarge data volume (possibly excessive redundant content).

The analysis module of the processor is used for analyzing and processing the data packet which is transmitted to the user mode and contains the execution information through the analysis process, and the execution information of the access instruction is utilized to identify the thermophysical pages with frequent access operation for subsequent migration. Specifically, the analysis process of the embodiment of the application can extract the physical address in the execution information according to the memory access instruction, and construct the array element by utilizing the physical address to obtain the memory array formed by the array element, wherein the array element in the memory array is used for representing the physical page, and the identification of the thermal physical page can be realized based on the memory array. The method is characterized in that the whole physical address space in the memory is divided into individual sections (groups), each section is in a discrete sparse distribution form, the sections are not connected, the physical address of each access instruction is in a sparse distribution form, the data of the form occupies larger storage space, and the processing efficiency is lower. The embodiment of the application converts the physical address into the array elements to obtain the memory array formed by the array elements, wherein the array elements in the memory array are continuously distributed and are densely distributed, so that the occupied amount of storage space of data participating in heat calculation and the complexity of the data participating in heat calculation can be reduced, the expenditure of a processor is reduced, and the overall performance of the system is improved. Therefore, the embodiment of the application realizes sampling, thermophysical page identification and migration on the basis of low processor overhead, and the overall system performance is better.

Furthermore, the analysis process can realize the identification of the thermophysical page for performing the memory access operation at high frequency through the memory array. Specifically, one array element in the memory array represents a physical page, the physical page represents a section of memory space with the size of 4k, in the embodiment of the application, the number of times of access operation on the memory space represented by the physical page in a period can be counted, the number of times is overlapped with the access heat of the physical page obtained by calculation in the previous period, so that the access heat of the physical page in the period can be obtained, if the access heat is greater than a heat threshold, the physical page is considered to be a thermophysical page with access operation performed at high frequency, and the thermophysical page can be added into a linked list to be migrated to wait for migration.

The migration module of the processor can call a linked list migration interface, and migration operation is carried out by taking a linked list to be migrated as a dimension and taking physical pages in the linked list to be migrated as migration granularity, namely, targeted migration of the thermophysical pages from the low-speed memory to the high-speed memory is realized.

The embodiment of the application realizes hardware sampling by the sampling device, so that the sampling process does not occupy processor resources, the cost of the processor is reduced, the system performance is improved, and the hardware sampling realized by the sampling device is realized by extracting the memory access instruction from the instruction pipeline queue of the processor, tracking and recording the execution information of the memory access instruction, which is an instantaneous state, thus the sampling cost is smaller, the higher sampling frequency can be set, and the high sampling precision is ensured. In addition, the embodiment of the application constructs the physical address of the sampling memory access instruction into a dense memory array, and participates in the heat calculation of the physical page based on the memory array to realize the identification of the heat physical page, so that the occupation amount of the storage space of the data participating in the heat calculation and the complexity of the data participating in the heat calculation can be reduced, the expenditure of a processor is reduced, and the overall performance of the system is improved. Therefore, the embodiment of the application realizes sampling, thermophysical page identification and migration on the basis of low processor overhead, and the overall system performance is better.

It should be noted that, according to the method for optimizing memory access provided by the embodiment of the present application, several specific scenarios may be implemented as follows:

in another implementation manner, referring to fig. 2, a schematic use diagram of an instant messaging application provided by an embodiment of the present application is shown, where a scenario includes: the method comprises the steps that firstly, the instant messaging application server can sample execution information of access instructions through an internally integrated sampling device, a dense memory array is constructed by utilizing physical addresses in the execution information, and then, identification of a thermophysical page and migration of the thermophysical page from the low-speed memory to the high-speed memory are realized based on the memory array. Assuming that a physical page related to a friend information checking function of the instant messaging application is identified as a thermal physical page for frequently performing the memory access operation, after migration is completed, the memory access operation for the friend information checking initiated by the user equipment triggers the memory access action of the instant messaging application server in the high-speed memory and obtains a memory access result to return to the user equipment for display. Thereby fully playing the advantages of the high-speed memory and improving the performance of instant messaging application.

In another implementation manner, referring to fig. 3, a schematic diagram of a usage scenario of an e-commerce application provided by an embodiment of the present application is shown, where a scenario includes: the method comprises the steps that firstly, the E-commerce application server can sample execution information of a memory access instruction through an internally integrated sampling device, a dense memory array is constructed by utilizing a physical address in the execution information, and then, identification of a thermophysical page and migration of the thermophysical page from the low-speed memory to the high-speed memory are realized based on the memory array. Assuming that a physical page related to a commodity picture display function applied to an e-commerce is identified as a thermal physical page for frequently performing memory access operation, after migration is completed, a memory access operation for commodity picture display initiated by user equipment can trigger a memory access action of an e-commerce application server in a high-speed memory and obtain a memory access result to return to the user equipment for display. Thereby fully playing the advantages of the high-speed memory and improving the application performance of the E-commerce.

It should be noted that, the execution information, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, displayed data, etc.) of the access instruction according to the present application are all information and data authorized by the user or fully authorized by each party, and the collection, the use and the processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.

Referring to fig. 4, a step flow chart of a method for optimizing memory access according to an embodiment of the present application is shown, including:

Step 101, obtaining execution information of the access instruction.

In the embodiment of the application, the execution information in the process of executing the instruction by the processor can be generated in an instruction pipeline of the processor, wherein the instruction pipeline queue is used for receiving, processing and sending the instruction, maintaining the state of the instruction according to the time sequence, and the instruction initiated by the processor can be processed and sent through the instruction pipeline queue, and meanwhile, the instruction pipeline queue can maintain various states in the process of executing the instruction.

Furthermore, the memory access instruction is an instruction for accessing or storing data in the memory, so that based on the requirement of the embodiment of the application for optimizing the memory access, the execution information of the memory access instruction can be obtained from the instruction pipeline queue to realize subsequent optimization. The execution information of the access instruction may include information about virtual address, physical address, execution duration, and the like.

Step 102, extracting the physical address in the execution information of the access instruction.

In the embodiment of the application, the physical page is the minimum unit in the memory management, is an allocation mechanism for the physical memory in the form of a page, and the memory access instruction specifically realizes access to the memory space through the physical address, so that the physical address in the execution information of the memory access instruction can be extracted, and the physical page accessed by the memory access operation can be determined through further analysis so as to realize migration according to the dimension of the physical page.

And 103, constructing an array element according to the physical address of the access instruction, and obtaining a memory array formed by the array element, wherein the array element is used for representing a physical page, and the physical page comprises the physical address.

In the embodiment of the application, the whole physical address space in the memory is divided into the individual sections (groups), each section is in a discrete sparse distribution form, and the sections are not connected, so that the physical addresses of all access instructions obtained by sampling are in a sparse distribution form, the data in the form occupies larger storage space, and the processing efficiency is lower. The embodiment of the application converts the physical address into the array elements to obtain the memory array formed by the array elements, the array elements in the memory array are continuously distributed and are in a dense distribution form, compared with the physical address set in a sparse distribution form, the data structure of the array elements in the dense distribution form is more compact, the data complexity is lower, and the memory array with more compact data structure and lower data complexity is subsequently utilized to participate in heat calculation, so that the memory space occupation of the data participating in the heat calculation can be reduced, the heat calculation cost of a processor is reduced, and the overall performance of the system is improved.

And 104, calculating access heat of the physical page according to the total number of accessed times of the memory chip area corresponding to the physical page represented by the array elements in the memory array, and determining the target physical page with the access heat greater than a heat threshold.

The access heat of the physical page and the total number of times of access of the memory area corresponding to the physical page represented by the array elements in the memory array are in positive correlation.

In the embodiment of the application, after the memory array is constructed and obtained, the identification of the thermophysical page for performing the memory access operation at high frequency can be realized through the memory array. Specifically, one array element in the memory array represents a physical page, the physical page represents a section of memory space with the size of 4k, in the embodiment of the application, the number of times of memory access operation on the memory space represented by the physical page in a period can be counted, the number of times is overlapped with the access heat of the physical page obtained by calculation in the previous period, so that the access heat of the physical page in the period can be obtained, if the access heat is greater than a heat threshold, the physical page is considered to be a thermophysical page (target physical page) with high-frequency memory access operation, and the thermophysical page can be added into a linked list to be migrated to wait for migration.

And 105, migrating the memory space corresponding to the target physical page into a high-speed memory.

The migration module can call a linked list migration interface to realize migration operation by taking a linked list to be migrated as a dimension and taking physical pages in the linked list to be migrated as migration granularity, namely, the targeted migration of the thermophysical pages from the low-speed memory to the high-speed memory is realized. The migration operation is executed, specifically, a new physical page is applied in the high-speed memory, the data content of the target physical page to be migrated is copied to the memory space corresponding to the new physical page, then the old data of the target physical page is released, finally, the recorded target physical page is replaced by the new physical page in the process page table, after migration, the access of the process to the target physical page falls into the new physical page of the high-speed memory, and the effect of high-speed access is realized.

In summary, the application constructs the physical address of the sampled memory access instruction into the memory array in a dense form, and participates in the heat calculation of the physical page based on the memory array to realize the identification of the heat physical page, thus the occupation amount of the storage space of the data participating in the heat calculation and the complexity of the data participating in the heat calculation can be reduced, thereby reducing the expenditure of a processor and improving the overall performance of the system. Therefore, the embodiment of the application realizes sampling, thermophysical page identification and migration on the basis of low processor overhead, and the overall system performance is better.

Referring to fig. 5, a flowchart of specific steps of a method for optimizing memory access according to an embodiment of the present application is shown, including:

Step 201, a sampling device is controlled to acquire sampling parameters from a register of the sampling device.

Step 202, receiving the residual execution information sent by the sampling device.

Wherein the sampling parameters comprise sampling frequency and filtering conditions; the sampling device is used for acquiring the execution information of the access instruction from the instruction pipeline queue of the processor according to the sampling frequency, discarding the execution information meeting the filtering condition, and then sending the rest execution information; the filtering condition is used for screening access instructions which do not have memory access.

In the embodiment of the present application, for steps 201 to 202, the sampling device may be integrated in a server, and the sampling device may be hardware implemented based on SPE technology, where the SPE provides a non-invasive hardware sampling method, so that the sampling device may perform sampling analysis of information on architecture instructions defined by an instruction set architecture. In the embodiment of the application, the SPE acquires the execution information generated in the process of executing the memory access instruction by the processor through the sampling device, can sample the memory access instruction according to the sampling frequency configured in the register of the sampling device, and can transmit the sampled result from the kernel mode to the user mode for analysis. The instruction pipeline queue is used for receiving, processing and sending instructions, maintaining the states of the instructions according to time sequences, and enabling the instructions sent by the processor to be processed and sent through the instruction pipeline queue, and maintaining various states in the instruction executing process.

Furthermore, the memory access instruction is an instruction for accessing or storing data in the memory, so that based on the requirement of the embodiment of the application for optimizing the memory access, the sampling device can be specified to acquire the execution information of the memory access instruction in the instruction pipeline queue to realize subsequent optimization. The execution information of the access instruction may include information about virtual address, physical address, execution duration, and the like.

The execution information of SPE sampling is very accurate and does not drift, and SPE sampling is hardware sampling, so that the sampling process does not occupy processor resources, and the cost of a processor is reduced. Because the cost of sampling is small, the sampling rate can be set high, thereby improving the precision of the subsequent analysis of the thermophysical page.

Specifically, in the hardware sampling process, the embodiment of the application can control the sampling time interval by setting the sampling frequency, so as to control the precision of the subsequent analysis thermophysical page (the sampling frequency and the subsequent analysis precision are in positive correlation). In addition, filtering conditions can be set in the process of hardware sampling, so that the execution information which is irrelevant to memory access optimization in the execution information obtained by sampling is filtered and removed through the filtering conditions, thus the data quality of the data obtained by sampling can be provided, the data quantity of the output data of the sampling device is reduced, and the subsequent analysis precision is improved on the basis of saving the cost.

In addition, the sampling frequency and the filtering condition may be configured in advance in a register integrated inside the sampling device, from which the sampling device can obtain the sampling frequency and the filtering condition at the start of sampling.

The size of the sampling frequency influences the time interval between two adjacent sampling actions, and the larger the sampling frequency is, the larger the data volume acquired in the period is, and the higher the accuracy in the subsequent analysis of the thermophysical page is. The embodiment of the application can reasonably set the sampling frequency according to actual requirements (such as the requirement of sampling precision, the requirement of equipment cost and the like). The processor can control the sampling device to obtain the sampling frequency from the register, after the sampling frequency is obtained, the sampling device can obtain the execution information of the memory access instruction from the instruction pipeline queue according to the sampling frequency and provide the execution information for the processor for subsequent use, the specific execution process of the sampling does not need the intervention of the processor, and the cost of the processor is saved.

Furthermore, in the embodiment of the application, the purpose of setting the filtering condition is to filter and remove the execution information which is irrelevant to the memory access optimization in the execution information obtained by sampling, so that the data quality of the data obtained by sampling can be provided, and the data quantity of the output data of the sampling device can be reduced, thereby improving the follow-up analysis precision on the basis of saving the cost.

In one implementation, since there is a cache between the memory and the processor, data that is frequently accessed by the processor is stored in the cache in advance by the memory, when the processor performs a memory access operation, a memory access instruction will first access the cache, and if the cache access misses, the memory access is performed, and a filtering condition aims to filter and remove execution information of the memory access instruction that does not actually perform the memory access (e.g., does not need to perform the memory access when the cache hits).

Optionally, the execution information includes cache hit information and execution duration; the sampling device is used for discarding the first access instruction under the condition that the first access instruction of at least one level of cache in the hit three-level cache is determined to exist by the cache hit information; and discarding the second memory access instruction under the condition that the second memory access instruction with the execution duration in the preset duration range exists.

In the embodiment of the present application, in order to improve execution efficiency and reduce interaction between a processor and a memory, a modern processor may integrate a multi-level cache architecture on the processor, where a common architecture is a three-level cache structure, and the method includes: a first level cache L1, a second level cache L2 and a third level cache L3. The first level cache L1 is the cache closest to the processor, and has the smallest capacity and the fastest speed; the capacity of the second-level buffer L2 is larger, but the speed is slower than that of the first-level buffer L1, the second-level buffer L2 is a buffer of the first-level buffer L1, the second-level buffer L2 is used for storing data which are needed by the processors during processing, but the first-level buffer L1 cannot store; the capacity of the third level buffer L3 is the largest and the first level buffer L3 and the memory with the slowest speed can be regarded as the buffer of the second level buffer L2.

When the processor operates, the processor firstly searches the first level cache L1 for the required data according to the memory access instruction, then the processor firstly searches the second level cache L2, then the processor firstly searches the third level cache L3, and if the third level cache does not find the required data, the processor acquires the data from the memory. The longer the path is found, the longer the time is, so if some data is to be acquired very frequently, it is guaranteed that the data is in the first level cache L1, so the speed will be very fast.

Based on the above architecture design, it can be found that some memory access instructions may have hit in a certain layer of cache, and some memory access instructions have miss in a three layer of cache and access to the memory is performed.

In order to achieve the objective, the filtering condition may specifically have two filtering schemes, and the first scheme may implement filtering according to hit information, that is, the execution information of the access instruction obtained by sampling includes hit information, where the hit information reflects hit/miss of the access instruction in each level of cache and in the memory (hit indicates that the access operation is completed in the storage medium, if the data to be read is found, miss indicates that the access operation is not completed, if the data to be read is not found), based on the hit information, when it is determined that a miss phenomenon occurs in the third level cache L3 in the access instruction, the cache instruction may further perform memory access, and for the access instruction that does not occur in the third level cache L3 (for example, hit phenomenon occurs in one of the first level cache L1, the second level cache L2, and the third level cache L3), the sampling device may consider that the access instruction is not really performed, and thus may be determined as the first access instruction, and filtered by the filtering condition.

The second scheme can realize filtering according to execution time, based on the access link from the first level buffer L1, the second level buffer L2 and the third level buffer L3 to the memory in turn, the time required by the execution of the access instruction in each level buffer can be counted, if the access instruction is executed in the first level buffer L1 for a certain time from the time a of accessing the first level buffer L1, the access instruction is executed in the second level buffer L2 for a certain time from the time b, the access instruction is executed in the third level buffer L3 for a certain time from the time c, the access instruction is executed in the third level buffer L3 for a certain time from the time d, the access instruction is executed in the memory, and the time a, b, c, d arranged from the early to the late on the time axis is based on the execution time of the access instruction obtained by sampling, and the storage medium where the access instruction is executed can be judged, if the execution time of the access instruction is started from the time a, the end time of the access instruction is between the time a and the time b, the access instruction is considered to be executed in the first level buffer L1, the access instruction is executed in the first level buffer L1, and the access instruction is not executed in the second sampling device; when the ending time of the execution time of the memory access instruction is between the time b and the time c, the memory access operation is considered to be completed to be executed in the second-level cache L2, and the memory access instruction is a second memory access instruction which is not really executed in the memory and needs to be filtered and removed; when the ending time of the execution time of the memory access instruction is between the time c and the time d, the memory access operation is considered to be completed to be executed in the three-level cache L3, and the memory access instruction is a second memory access instruction which is not really executed in the memory and needs to be filtered and removed; and when the ending time of the execution time of the memory access instruction is later than the time d, the memory access operation is considered to be executed in the memory, and the memory access instruction can be reserved.

Step 203, in the process of parsing the execution information of the access instruction into a data packet in a format supported by the analysis process, discarding the data packet with the parsed data volume greater than the first data volume threshold, and discarding the data packet with the parsed data volume not supported by the analysis process.

In the embodiment of the application, in the process that the processor transmits the execution information of the access instruction obtained by sampling from the kernel mode to the user mode, the execution information can be resolved into a data packet in a format supported by the analysis process in the user mode through the analysis process. In the analysis process, one-pass software filtering can be performed, namely, the format and the data volume of the data packet obtained after analysis are analyzed, and the data packet with wrong format and overlarge data volume (more redundant data exists) can be filtered and removed, so that the data quality of the data packet analyzed by the analysis process is further improved, the cost is reduced, and the system performance is improved.

And 204, after the execution information is analyzed into a format supported by an analysis process, extracting a physical address in the execution information of the access instruction through the analysis process.

In the embodiment of the application, after the execution information of the access instruction obtained by sampling is transmitted to the user state in a format supported by the analysis process, the analysis process in the user state can realize the extraction of the physical address in the execution information of the access instruction and the subsequent operation of analyzing the thermophysical page.

The analysis process may employ a perf tool, which is a powerful performance analysis tool capable of reporting events such as processor performance counters, trace points, etc., supporting various performance data from low-level hardware to high-level events.

Optionally, after step 202, the method may further include:

And A1, writing the execution information of the access instruction into a first cache slice region.

Step A2, providing the first cache tile to an analysis process under the condition that the data volume in the first cache tile reaches a second data volume threshold value, so that the analysis process analyzes the execution information in the first cache tile; and under the condition that the second cache tile is not used by the analysis process, continuing to write the execution information of the access instruction into the second cache tile.

Step A3, providing the second cache tile to an analysis process for the analysis process to analyze the execution information in the second cache tile when the data volume in the second cache tile reaches the second data volume threshold; and under the condition that the first cache tile is not used by the analysis process, entering a step A1.

In the embodiment of the application, aiming at the steps A1-A3, aiming at a transmission link for transmitting the data sampled by the sampling device from a kernel state to a user state and an analysis link of an analysis process, the ping-pong buffer (ping-pang buffer) technology can be further used for realizing optimization. In the related art, in the process of transmitting sampled execution information from a kernel mode to a user mode, the execution information needs to be written into a cache, and an analysis process reads the execution information in the cache to analyze, and the related art adopts a serial blocking processing mode, namely, when the execution information is written into the cache and is full, interruption is triggered, the analysis process responds to the interruption to analyze data in the cache, and in the analysis process, the sampling device cannot continue to work (namely, after the data sampled by the sampling device need to be written into the cache, the sampling device can continue to work), so that fault appears in the work of the sampling device, and the execution information in a period of time is leaked to be sampled, thereby affecting the precision of a subsequent identification thermophysical page.

The embodiment of the application adopts the ping-pang buffer technology, a first buffer and a second buffer are arranged, a sampling device is used as a producer, an analysis process is used as a consumer, the producer firstly samples and generates data, the sampled data is written into the first buffer, and at the moment, the second buffer is empty; when the writing is finished (the second data quantity threshold value is reached) to the first cache, the producer continues to sample and generate data through logic operation, the sampled data is written to the second cache, and meanwhile, the consumer reads out the data from the first cache for analysis; after the second buffer memory finishes writing (reaching the second data quantity threshold), the producer continues sampling to generate data through logic operation conversion, the sampled data is written into the first buffer memory, and the consumer reads out the data from the second buffer memory. In the whole process, two caches always read one by one and write, and exchange reading/writing roles with each other, similar to ping-pong blocks. Compared with the serial blocking processing mode in the related art, after ping-pang buffer is adopted, the interaction of writing data and reading data is smoother and more efficient, the probability that the sampling device cannot sample due to blocking is greatly reduced, continuous sampling of execution information is ensured, missed sampling probability is reduced, and the accuracy of subsequently identifying thermophysical pages is improved.

Step 205, extracting a memory segment index of the physical address, and converting the memory segment index into array elements to obtain a memory array formed by the array elements, where the memory segment index is used to indicate a memory area object to which a physical page where the physical address is located belongs.

Wherein the memory area objects in the memory are in a sparse distribution form, and the array elements in the memory array are in a dense distribution form.

In the embodiment of the present application, referring to fig. 6, which shows a memory address distribution schematic diagram, under a sparse memory model, a continuous address space is divided into sections according to sections (groups), memories in the sections are continuous, a struct mem_section (pointer array) is used to represent a data structure of a physical memory Section in a system, which is used to describe a physical memory division situation in the system, and it can be seen that, based on a structure of the Section division of the continuous address space under the memory model, physical addresses pointed by struct mem_section have discrete sparse distribution, (such as physical addresses pointed to sections 1-5 do not exist), such data structure occupies a large storage space, and affects system performance.

Referring further to FIG. 7, there is shown a physical address distribution diagram, in a memory implementation system, the number of bits of the supported physical address is 48, the size of the physical page is 4KB, the number of bits is 12, the size of the section is 1GB, the number of bits is 30, wherein 0-11 is a page frame offset (PAGE FRAME offset), which characterizes the page displacement, and specifically indicates the size of the physical page, typically 4KB;12-29, a page count index (page count index) characterizes the memory array displacement; the memory segment index (memory section index) of 30-47 is used to indicate the memory region object to which the physical page to which the physical address belongs.

In the embodiment of the application, the memory segment index memory section index of the physical address can be extracted, and the memory segment index is converted into the array element to obtain the memory array formed by the array element, in a specific conversion process, the memory segment index memory section index is used as the index of the pointer array structure_section, the section where the physical page PAGE FRAME is located can be determined, the memory array index page count index is used as the index of the memory array page_count, and the specific access frequency of the physical page is determined, and then the record information is updated. The memory array (Struct page array) is attached to the section structures, and array elements in the memory array are in a continuous (dense) form relative to the discrete property (sparse) among the section structures, so that the data occupation space is saved.

In one practical example, assume that the physical page size is 4KB (4096 bytes), if one physical address is 0x00400000 (4194304 bytes). Then to calculate the Page Frame Number (PFN), the physical address may be divided by the physical Page size:

PFN = physical address/page size;

PFN=0x00400000/4096；

PFN=1024；

The PFN corresponding to physical address 0x00400000 is 1024.

In the kernel, the translation process may be done by macros or inline functions, e.g., __ pa () macros may translate virtual addresses to physical addresses, while page_to_pfn () functions may translate structpage pointers in the kernel to corresponding PFNs. After the PFN is acquired, the corresponding mem_section can be searched, the offset in the mem_section is calculated, and the array element, namely the mem_section initial element and the offset, is obtained.

Step 206, counting the number of times of access to the physical page corresponding to the array element in the memory array in the current period.

And 207, superposing the access times of the physical pages in the current period and the access heat of the physical pages in the previous period to obtain the access heat of the physical pages in the current period.

In the embodiment of the present application, for steps 206-207, the analysis process may implement the identification of the thermophysical page for the access operation at high frequency through the memory array. Specifically, one array element in the memory array represents a physical page, the physical page represents a section of memory space with the size of 4k, and in the embodiment of the application, the number of memory access operations on the memory space represented by the physical page (namely, the number of times that a physical address of the memory access operation falls into a memory space corresponding to the physical page) in the current period can be counted, and the number of times is overlapped with the access heat of the physical page obtained by calculation in the previous period, so that the access heat of the physical page in the current period can be obtained.

For example, assuming that the access heat of a physical page is calculated to be 10 in period 1 and 17 in period 2, if the current period is period 3 and the number of times of access to the memory space corresponding to the physical page in period 3 is counted to be 11, the access heat of the physical page calculated in period 3=17+11=28. If the hot threshold of the thermophysical page (target physical page) is determined to be 25, the physical page may be determined to be a thermophysical page.

Optionally, the method may further include:

And B1, dividing the access heat of the physical pages calculated in the next period by a preset coefficient after a preset number of periods, and updating the access heat of the physical pages counted in the next period into an operation result, wherein the preset coefficient is a coefficient larger than 1.

In the embodiment of the application, due to the uncertainty of access operation, the problem of thermal physical page fault judgment caused by the phenomenon that the access heat of the physical page is suddenly high or suddenly low in a plurality of periods may exist, for example, the calculated heat of the physical page is always high along with the period change, but when the physical page is suddenly cooled down in the current period, the physical page in the current period is easily misjudged as the thermal physical page according to the overlapped heat calculation mode, and in order to solve the problem, the embodiment of the application can divide the access heat of the physical page calculated in the next period by a preset coefficient (for example, 2) larger than 1 after each preset number of periods so as to properly attenuate the access heat of the physical page, thus the problem of thermal physical page fault judgment can be solved, and the probability of thermal physical page misjudgment is reduced.

Step 208, determining that the access heat is greater than a target physical page of a heat threshold, and adding the target physical page to a linked list to be migrated.

Step 209, calling a linked list migration interface, and migrating a memory space corresponding to the target physical page in the linked list to be migrated to the high-speed memory.

In the related art, there is an interface size_pages, which is a main interface for physical page migration in a kernel mode, so that the migration of physical pages (without distinguishing cold and hot) related to a process can be realized based on the dimension of the process in a user mode, and the interface size_pages are obviously not suitable for the migration scene of the thermal physical pages in the memory in the embodiment of the application.

In the embodiment of the present application, for steps 208 to 209, a linked list migration interface may be implemented, where the linked list migration interface is used to implement a migration operation using a linked list to be migrated as a dimension and using physical pages in the linked list to be migrated as migration granularity, that is, implement targeted migration of thermophysical pages from low-speed memory to high-speed memory, thereby providing a migration interface that satisfies a migration scenario of thermophysical pages in memory.

In addition, the embodiment of the application can flexibly set the heat threshold in the user mode, thereby flexibly adjusting the distribution condition of the cold and hot memories in the memory space, for example, under the condition of centralized memory access of memory access hot spots, for example, in long-tail memory access, the heat threshold can be set to be a larger value in the user mode, and thus the occurrence probability of access jolt is reduced.

Optionally, step 209 may specifically include:

In the substep 2091, when it is detected that the memory space corresponding to the target physical page is not in the high-speed memory, a new physical page is applied in the high-speed memory.

In step 2092, the data content stored in the memory space corresponding to the target physical page is copied to the memory space corresponding to the new physical page in the high-speed memory.

In the embodiment of the present application, for sub-steps 2091-2092, the migration operation specifically applies for a new physical page in the high-speed memory when detecting that the memory space corresponding to the target physical page is not in the high-speed memory, and copies the data content stored in the memory space corresponding to the target physical page to the memory space corresponding to the new physical page in the high-speed memory, thereby completing the purpose of migrating the physical page from the low-speed memory to the high-speed memory.

Optionally, in the case that it is detected that the memory space corresponding to the target physical page is not in the high-speed memory, step 209 may further include:

Substep 2093, sets the target physical page to be in an unalterable state to be migrated.

In the embodiment of the application, under the condition that the memory space corresponding to the target physical page is detected not to be in the high-speed memory, the target physical page can be firstly in the unchangeable state to be migrated so as to ensure the consistency of data.

Optionally, after sub-step 2092, it may further include:

sub-step 2094, replacing the recorded target physical page with the new physical page in a process page table.

In step 2095, the data content in the memory space corresponding to the target physical page is released.

In the embodiment of the present application, for sub-steps 2094-2095, after copying the data content of the target physical page to be migrated to the memory space corresponding to the new physical page, the old data of the target physical page may be released, and finally, in the process page table, the recorded target physical page is replaced with the new physical page, and after the migration, the access of the process to the target physical page falls into the new physical page of the high-speed memory, thereby realizing the effect of the high-speed access.

In summary, referring to fig. 8, a flowchart of an implementation of an approach is shown, including: s1, starting a decoder. And then S2, judging whether sampling data exist or not, if not, finishing, if yes, executing S3, reading the data packet which is transmitted by the transmission module and is filtered by the software, and then executing S4, judging whether migration conditions are met, namely, when the memory array is filled and not migrated, executing S9 and continuing to execute migration if the migration conditions are met; if the migration condition is not met, S5 is carried out, a physical address in the execution information of the access instruction is obtained, S6 is carried out, the heat of the physical page is calculated, S7 is carried out, whether the heat of the physical page reaches a heat threshold value is judged, S4 is carried out, if the heat of the physical page does not reach the heat threshold value, S8 is carried out, the physical page is added into a linked list to be migrated, S9 is carried out, migration is carried out, and then the flow is ended.

In summary, the embodiment of the application realizes hardware sampling by the sampling device, so that the sampling process does not occupy processor resources, the cost of the processor is reduced, the system performance is improved, and the hardware sampling realized by the sampling device is to extract the memory instruction in the instruction pipeline queue of the processor, track and record the execution information of the memory instruction, which is an instantaneous state, thus the sampling cost is small, thereby the higher sampling frequency can be set, and the high sampling precision is ensured. In addition, the embodiment of the application constructs the physical address of the sampling memory access instruction into a dense memory array, and participates in the heat calculation of the physical page based on the memory array to realize the identification of the heat physical page, so that the occupation amount of the storage space of the data participating in the heat calculation and the complexity of the data participating in the heat calculation can be reduced, the expenditure of a processor is reduced, and the overall performance of the system is improved. Therefore, the embodiment of the application realizes sampling, thermophysical page identification and migration on the basis of low processor overhead, and the overall system performance is better.

Referring to fig. 9, an embodiment of the present application further provides a device for optimizing memory access, including:

A processor and a sampling device;

The processor is configured to:

In the embodiment of the present application, the related interaction content between the processor and the sampling device may refer to the related description of the above embodiment, which is not described herein.

Referring to fig. 10, a block diagram of an apparatus for optimizing memory access according to an embodiment of the present application is shown, including:

The information acquisition module 301 is configured to acquire execution information of the access instruction;

The extracting module 302 is configured to extract a physical address in the execution information of the memory access instruction;

A construction module 303, configured to construct an array element according to a physical address of the memory access instruction, and obtain a memory array configured by the array element, where the array element is used to characterize a physical page, and the physical page includes the physical address;

The calculation module 304 is configured to calculate, according to the total number of times that the physical page represented by the array element in the memory array is accessed, an access heat of the physical page, and determine a target physical page whose access heat is greater than a heat threshold;

and the migration module 305 is configured to migrate the memory space corresponding to the target physical page to the high-speed memory.

Optionally, the information obtaining module 301 includes:

the control submodule is used for controlling the sampling device to acquire sampling parameters from a register of the sampling device; the sampling parameters comprise sampling frequency and filtering conditions; the sampling device is used for acquiring the execution information of the access instruction from the instruction pipeline queue of the processor according to the sampling frequency, discarding the execution information meeting the filtering condition, and then sending the rest execution information; the filtering condition is used for screening access instructions which are not accessed by the memory;

and the receiving sub-module is used for receiving the residual execution information sent by the sampling device.

Optionally, the extracting module 302 includes:

and the analysis process sub-module is used for extracting the physical address in the execution information of the access instruction through the analysis process after analyzing the execution information into a format supported by the analysis process.

Optionally, the apparatus further includes:

and the second filtering module is used for discarding the data packet with the analyzed data volume larger than the first data volume threshold value and discarding the data packet with the analyzed data volume not supporting the format in the process of analyzing the execution information of the access instruction into the data packet with the format supported by the analysis process.

Optionally, the apparatus further includes:

The writing sub-module is used for writing the execution information of the access instruction into the first cache slice area;

the first circulation sub-module is used for providing the first cache tile to an analysis process for the analysis process to analyze the execution information in the first cache tile under the condition that the data volume in the first cache tile reaches a second data volume threshold; and under the condition that the second cache patch is not used by the analysis process, continuing to write the execution information of the access instruction into the second cache patch;

The second circulation sub-module is configured to provide the second cache tile to an analysis process, where the analysis process analyzes execution information in the second cache tile when the data size in the second cache tile reaches the second data size threshold; and under the condition that the first cache tile is not used by the analysis process, entering the step of writing the execution information of the access instruction into the first cache tile.

Optionally, the building module 303 includes:

The construction submodule is used for extracting a memory segment index of the physical address, converting the memory segment index into array elements and obtaining a memory array formed by the array elements, wherein the memory segment index is used for indicating a memory area object of a physical page where the physical address is located;

Optionally, the computing module 304 includes:

The first calculation sub-module is used for counting the accessed times of the physical pages corresponding to the array elements in the memory array in the current period;

and the second calculation sub-module is used for superposing the access times of the physical pages in the current period and the access heat of the physical pages in the previous period to obtain the access heat of the physical pages in the current period.

Optionally, the apparatus further includes:

and the attenuation module is used for dividing the access heat of the physical pages calculated in the next period by a preset coefficient after a preset number of periods, and updating the access heat of the physical pages counted in the next period into an operation result, wherein the preset coefficient is a coefficient larger than 1.

Optionally, the apparatus further includes:

The adding module is used for adding the target physical page to a linked list to be migrated;

a migration module 305 comprising:

and the calling sub-module is used for calling a linked list migration interface and migrating the memory space corresponding to the target physical page in the linked list to be migrated to the high-speed memory.

Optionally, the migration module 305 includes:

An application submodule, configured to apply for a new physical page in the high-speed memory when it is detected that the memory space corresponding to the target physical page is not in the high-speed memory;

and the copy sub-module is used for copying the data content stored in the memory space corresponding to the target physical page to the memory space corresponding to the new physical page in the high-speed memory.

Optionally, in the case that it is detected that the memory space corresponding to the target physical page is not in the high-speed memory, the migration module 305 further includes:

a modification sub-module, configured to set the target physical page to an unchangeable state to be migrated;

The migration module 305 further includes:

a replacing sub-module, configured to replace, in a process page table, the recorded target physical page with the new physical page;

And the release sub-module is used for releasing the data content in the memory space corresponding to the target physical page.

The embodiment of the application also provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the instructions (instructions) of each method step in the embodiment of the application may cause the device to execute.

Embodiments of the application provide one or more machine-readable media having instructions stored thereon that, when executed by one or more processors, cause an electronic device to perform a method as described in one or more of the above embodiments. In the embodiment of the application, the electronic equipment comprises various types of equipment such as terminal equipment, servers (clusters) and the like.

Embodiments of the present disclosure may be implemented as an apparatus for performing a desired configuration using any suitable hardware, firmware, software, or any combination thereof, which may include electronic devices such as terminal devices, servers (clusters), etc. Fig. 11 schematically illustrates an exemplary apparatus 1000 that may be used to implement various embodiments described in embodiments of the present application.

For one embodiment, FIG. 11 illustrates an example apparatus 1000 having one or more processors 1002, a control module (chipset) 1004 coupled to at least one of the processor(s) 1002, a Memory 1006 coupled to the control module 1004, a Non-Volatile Memory 1008 coupled to the control module 1004, one or more input/output devices 1010 coupled to the control module 1004, and a network interface 1012 coupled to the control module 1004.

The processor 1002 may include one or more single-core or multi-core processors, and the processor 1002 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1000 can be used as a terminal device, a server (cluster), or the like in the embodiments of the present application.

In some embodiments, the apparatus 1000 can include one or more computer-readable media (e.g., memory 1006 or NVM/storage 1008) having instructions 1014 and one or more processors 1002 in combination with the one or more computer-readable media configured to execute the instructions 1014 to implement the modules to perform the actions described in this disclosure.

For one embodiment, the control module 1004 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 1002 and/or any suitable device or component in communication with the control module 1004.

The control module 1004 may include a memory controller module to provide an interface to the memory 1006. The memory controller modules may be hardware modules, software modules, and/or firmware modules.

Memory 1006 may be used to load and store data and/or instructions 1014 for device 1000, for example. For one embodiment, the memory 1006 may include any suitable volatile memory, such as a suitable dynamic random access memory (DRAM, dynamic Random Access Memory). In some embodiments, the memory 1006 may comprise a double data rate type four synchronous dynamic random access memory.

For one embodiment, the control module 1004 may include one or more input/output controllers to provide an interface to the NVM/storage 1008 and the input/output device(s) 1010.

For example, NVM/storage 1008 may be used to store data and/or instructions 1014. NVM/storage 1008 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., hard disk drive(s) (HDD, hard Disk Drive), compact Disc (CD) drive(s), and/or digital versatile Disc (DVD, digital Video Disc) drive (s)).

NVM/storage 1008 may include storage resources that are physically part of the device on which apparatus 1000 is installed, or may be accessible by the device without necessarily being part of the device. For example, NVM/storage 1008 may be accessed over a network via input/output device(s) 1010.

Input/output device(s) 1010 may provide an interface for apparatus 1000 to communicate with any other suitable device, input/output device 1010 may include communication components, audio components, sensor components, and the like. Network interface 1012 may provide an interface for device 1000 to communicate over one or more networks, and device 1000 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as accessing a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic of one or more controllers (e.g., memory controller modules) of the control module 1004. For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic of one or more controllers of the control module 1004 to form a system-in-package. For one embodiment, at least one of the processor(s) 1002 may be integrated on the same mold as logic of one or more controllers of the control module 1004. For one embodiment, at least one of the processor(s) 1002 may be integrated on the same die as logic of one or more controllers of the control module 1004 to form a System on Chip (SoC).

In various embodiments, the apparatus 1000 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, device 1000 may have more or fewer components and/or different architectures. For example, in some embodiments, the apparatus 1000 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application SPECIFIC INTEGRATED Circuit (ASIC), and a speaker.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiment of the present application provides a computer program product, which is executed by a processor to implement the processes of the above-mentioned memory access optimization method embodiment, and achieve the same technical effects, so that repetition is avoided, and no further description is given here. The computer program product may be stored in a storage medium,

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The foregoing has described in detail the methods, apparatuses, devices, electronic devices, machine-readable media and computer program products for optimizing memory accesses provided by the present application, and specific examples have been provided herein to illustrate the principles and embodiments of the present application, and the above examples are only for aiding in the understanding of the methods and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. The method for optimizing the memory access is characterized by comprising the following steps of:

acquiring execution information of a memory access instruction;

Migrating the memory space corresponding to the target physical page to a high-speed memory;

constructing an array element according to the physical address of the access instruction, and obtaining the memory array formed by the array element comprises:

And extracting a memory segment index of the physical address, converting the memory segment index into array elements, and obtaining a memory array formed by the array elements, wherein the memory segment index is used for indicating a memory area object to which a physical page where the physical address is located belongs.

2. The method of claim 1, wherein the obtaining the execution information of the access instruction comprises:

Controlling a sampling device to acquire sampling parameters from a register of the sampling device; the sampling parameters comprise sampling frequency and filtering conditions; the sampling device is used for acquiring the execution information of the access instruction from the instruction pipeline queue of the processor according to the sampling frequency, discarding the execution information meeting the filtering condition, and then sending the rest execution information; the filtering condition is used for screening access instructions which are not accessed by the memory;

And receiving the residual execution information sent by the sampling device.

3. The method of claim 2, wherein the execution information includes cache hit information and execution duration; the sampling device is used for discarding the first access instruction under the condition that the first access instruction of at least one level of cache in the hit three-level cache is determined to exist by the cache hit information; and discarding the second memory access instruction under the condition that the second memory access instruction with the execution duration in the preset duration range exists.

4. The method of claim 1, wherein the extracting the physical address in the execution information of the memory access instruction comprises:

and after the execution information is analyzed into a format supported by an analysis process, extracting a physical address in the execution information of the access instruction through the analysis process.

5. The method of claim 4, wherein after the obtaining the execution information of the access instruction, the method further comprises:

in the process of analyzing the execution information of the access instruction into the data packets in the format supported by the analysis process, discarding the data packets with the analyzed data volume larger than the first data volume threshold value, and discarding the data packets in the format not supported by the analysis process after the analysis.

6. The method of claim 1, wherein after the obtaining the execution information of the access instruction, the method further comprises:

writing the execution information of the access instruction into a first cache slice region;

Providing the first cache tile to an analysis process for the analysis process to analyze the execution information in the first cache tile under the condition that the data volume in the first cache tile reaches a second data volume threshold; and under the condition that the second cache patch is not used by the analysis process, continuing to write the execution information of the access instruction into the second cache patch;

Providing the second cache tile to an analysis process for the analysis process to analyze the execution information in the second cache tile when the data amount in the second cache tile reaches the second data amount threshold; and under the condition that the first cache tile is not used by the analysis process, entering the step of writing the execution information of the access instruction into the first cache tile.

7. The method of claim 1, wherein the memory area objects in the memory are in a sparse distribution and the array elements in the memory array are in a dense distribution.

8. The method of claim 1, wherein the calculating the access heat of the physical page according to the total number of times the memory segment corresponding to the physical page represented by the array element in the memory array is accessed comprises:

in the current period, counting the accessed times of the physical pages corresponding to the array elements in the memory array;

And superposing the access times of the physical pages in the current period and the access heat of the physical pages in the previous period to obtain the access heat of the physical pages in the current period.

9. The method of claim 8, wherein the method further comprises:

after a preset number of periods, dividing the access heat of the physical pages calculated in the next period by a preset coefficient, and updating the access heat of the physical pages counted in the next period into an operation result, wherein the preset coefficient is a coefficient larger than 1.

10. The method of claim 1, wherein after the determining that the access heat is greater than a target physical page of a heat threshold, the method further comprises:

adding the target physical page to a linked list to be migrated;

The transferring the memory space corresponding to the target physical page to the high-speed memory includes:

And calling a linked list migration interface to migrate the memory space corresponding to the target physical page in the linked list to be migrated to the high-speed memory.

11. The method of claim 1, wherein the migrating the memory space corresponding to the target physical page to the high-speed memory comprises:

under the condition that the memory space corresponding to the target physical page is detected not to be in the high-speed memory, applying for a new physical page in the high-speed memory;

copying the data content stored in the memory space corresponding to the target physical page to the memory space corresponding to the new physical page in the high-speed memory.

12. The method of claim 11, wherein in the event that it is detected that the memory space corresponding to the target physical page is not in the high-speed memory, the method further comprises:

the target physical page is in an unchangeable state to be migrated;

after copying the data content stored in the memory space corresponding to the target physical page to the memory space corresponding to the new physical page in the high-speed memory, the method further includes:

Replacing the recorded target physical page with the new physical page in a process page table;

and releasing the data content in the memory space corresponding to the target physical page.

13. An optimizing device for memory access, comprising:

The calculation module is used for calculating access heat of the physical page according to the total accessed times of the memory chip areas corresponding to the physical page represented by the array elements in the memory array, and determining a target physical page with the access heat being greater than a heat threshold;

the migration module is used for migrating the memory space corresponding to the target physical page into a high-speed memory;

14. An apparatus for optimizing memory access, comprising:

A processor and a sampling device;

The processor is configured to:

15. An electronic device, comprising:

A processor; and

A memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1 to 12.

16. One or more machine readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1 to 12.

17. A computer program product, characterized in that it is executed by a processor to implement the method of any one of claims 1 to 12.