CN116521583A

CN116521583A - Method for recording access address sequence by using on-chip cache and data processing device

Info

Publication number: CN116521583A
Application number: CN202310371988.1A
Authority: CN
Inventors: 卢天越; 李海锋; 陈明宇
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-08-01

Abstract

The invention provides a method for recording an access address sequence by using an on-chip cache, which comprises the following steps: setting a first buffer area in the memory controller and setting a second buffer area in the last level of buffer; after a processor core sends a memory access request, generating a memory access record according to the memory access request, and temporarily storing the memory access record in the first buffer area; when the number of the access records temporarily stored in the first buffer area reaches a temporary storage threshold value, writing the access records in the first buffer area into the second buffer area; the program running on the processor obtains and processes the access record by directly reading the second buffer area. The invention also provides a data processing device, which adopts the method of recording the address sequence by using the on-chip cache to execute program operation, and the data processing device comprises: a memory; the memory controller comprises a memory record and control unit and a first buffer area; and the processor is provided with a second buffer area in the last level of buffer memory.

Description

Method for recording access address sequence by using on-chip cache and data processing device

Technical Field

The invention belongs to the technical field of computer hardware, and particularly relates to a method and a system for recording an access address sequence by using an on-chip cache.

Background

In recent years, the speed and bandwidth of networks continue to approach memory, and network-based remote memory systems have received widespread attention, while virtual memory subsystems in operating systems play an important role in remote memory systems. The virtual memory subsystem is mainly responsible for reading, prefetching and replacing local page caches of slow devices such as media and disks (SSD, HDD) at the far end of the network on a critical path. This process involves multiple accesses to memory and I/O operations. Fig. 1 shows a typical server structure diagram. Wherein the Last Level Cache (LLC), the memory controller and the PCIe controller are interconnected by an on-chip bus. Whenever a cache miss (LLCMISS) occurs, the memory controller receives a data request packet with an address, and then the memory controller initiates a read/write request to memory via an off-chip memory bus (e.g., DDR 4). After receiving the IO request sent by the processor core through the on-chip bus, the PCIe controller initiates a read/write request to the IO device through the PCIe bus, and initiates a Direct Memory Access (DMA) request to the memory through the memory controller. In order to enable the processor to access the I/O data faster, and reduce data access delay, the PCIe controller may also directly write the data into the cache (e.g., data direct I/O technology (AlianM, yuanY, zhangJ, et al)), reexaminingdirectcacheaccesstooptimize I/O direct application for streaming-shared-graphics networks (FarshinA, roozbeh A, maguireJrGQ, et al)) by directly using an on-chip bus protocol after receiving the data returned by the IO device, thereby enabling the processor core to read the data directly from the cache.

However, the virtual memory management mechanism of the existing operating system has a great limitation, which results in inaccurate prefetching and replacement of memory pages. The root cause of this limitation is that the operating system cannot perceive the page address of the application accessing the memory in real time. In existing architectures, there are two ways for an operating system to perceive an application run position:

(1) The memory page accessed by the application is determined by the access bit (accessbit) of the Page Table Entry (PTE). The 12 th bit of each page table entry of the application program is a page status information flag bit, wherein one bit is an existence bit (accessbit), and the accessbit of the PTE corresponding to the page is set to be 1 every time the page is accessed. The operating system determines whether the application accesses the corresponding page within time T by periodically (every time T) clearing and checking the application's accessbit. This approach has low impact on application performance and can be accomplished without modification of existing hardware.

However, this method has problems of coarse granularity, high delay and high overhead. 1) Coarse particle size. The operating system can only know whether the application accesses the corresponding page within a period of time, but does not know the specific times of page access, which is not beneficial for the operating system to make finer granularity decisions. 2) The delay is high. The time required to traverse all page table entries for big data applications is lengthy, such as: the average time required to check an access bit is 20 nanoseconds, then it takes approximately 52 milliseconds to check the access bits of all page table entries for a 10GB large data application. 3) The overhead is large. In order to monitor the access bit in real time, the access bit of all page table entries needs to be cleared and checked periodically, and in order to acquire the application access condition more timely, the checking frequency of the access bit needs to be increased, so that greater expenditure is brought.

(2) And determining the running position of the application in a page missing mode. The application triggers the page fault exception when accessing the corresponding page by modifying the status bit in the corresponding PTE of the page. The operating system can acquire the process number and address of the page fault in the page fault abnormal processing flow, so as to determine the real-time running position of the application. The method can greatly reduce the application performance, and the context switching can be caused by abnormal page missing, so that unnecessary kernel overhead is increased, and the method is not suitable for an operating system to sense the application access position.

Disclosure of Invention

In view of the above problems, the present invention proposes a method for recording an address sequence by using an on-chip cache, including: setting a first buffer area in the memory controller and setting a second buffer area in the last level of buffer of the processor; when any processor core of the processor sends a memory access request, generating a memory access record according to the memory access request, and temporarily storing the memory access record in the first buffer area; when the number of the access records temporarily stored in the first buffer zone reaches a temporary storage threshold value, writing the access records in the first buffer zone into the second buffer zone through an on-chip bus; the program running on the processor obtains and processes the access record by directly reading the second buffer area.

The method for recording the access address sequence by using the on-chip cache updates the value of the second write pointer in the second buffer zone to be consistent with the value of the first write pointer of the first buffer zone when the access address record in the first buffer zone is written into the second buffer zone.

The program judges whether a new access record appears in the second buffer zone by monitoring a write pointer and a read pointer in the second buffer zone, and updates the read pointer of the second buffer zone after reading the new access record after judging that the new access record appears.

The method for recording the access address sequence by using the on-chip cache sets the first buffer area and the second buffer area as annular buffer areas.

The method for recording the access address sequence by using the on-chip cache writes reservation information into a configuration interface of a memory controller to configure the second buffer area; the reservation information includes a buffer head address and a buffer length.

The invention discloses a method for recording an access address sequence by utilizing an on-chip cache, wherein the access record comprises a request address, a read-write type and access time of the access request.

According to the method for recording the access address sequence by using the on-chip cache, the access records in the first buffer area are written into the second buffer area, and the access records are filtered according to the preset filtering rule.

The present invention also proposes a data processing apparatus for running at least one program using a method of recording an address sequence using an on-chip cache as described above, the data processing apparatus comprising: the memory controller is provided with a memory record and control unit and a first buffer area, the memory record and control unit is used for generating memory information according to a memory request sent by the processor, and the first buffer area is used for caching the memory information; a processor having a last level cache and at least one processor core; the program is used for acquiring and processing the access record by directly reading the second buffer zone; the first buffer area and the second buffer area are annular buffer areas; the program configures the second buffer through a configuration interface of the memory controller.

Drawings

Fig. 1 is a schematic diagram of a prior art server architecture hierarchy.

Fig. 2 is a schematic diagram of a software and hardware architecture of the present invention.

FIG. 3 is a diagram of a bus packet and trace format according to the present invention.

FIG. 4 is a schematic diagram of the reverse page table structure of the present invention.

Fig. 5 is an overall structural view of the present invention.

Fig. 6 is a software configuration flow chart of the present invention.

Fig. 7 is a software processing Trace flow chart of the present invention.

FIG. 8 is a flow chart of the operation of the memory access control unit of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The invention aims to solve the problem that an operating system cannot perceive the real-time access position of an application with low cost, and provides a mechanism for adding an access recording unit in a memory controller to transmit real-time access information to the operating system through a cache.

In the running process of the application, a CPU sends out a memory access instruction, and the required data is obtained by inquiring a step-by-step cache and a memory (DRAM). The step-by-step cache and memory controller can sense the access position of the application, the access address can be extracted by adding hardware logic at the corresponding position, then the access address information is transmitted into the appointed memory area, and then the operation system monitors the appointed memory area in real time, so that the purpose of enabling the operation system to sense the access position of the application in real time is achieved. To achieve this objective, two problems need to be solved: 1) At which level of the architecture hardware is added while achieving the goal of minimizing hardware changes. 2) Because of the large amount of access information, data transmission using memory consumes a large amount of memory bandwidth and can contaminate the cache.

1) At which level of architecture hardware is added. If the hardware logic for acquiring the access request is added in the step-by-step cache, the hardware logic needs to be added to all cores, so that the modification is overlarge and the difficulty is higher. The memory controller is a location where all access requests can be observed and where hardware logic is relatively easy to modify. Therefore, the memory controller is added with the memory access control and recording unit.

2) How to reduce the consumption of memory bandwidth and the pollution to the cache. In the on-chip bus protocol, there are related operations to directly write the cache. The invention uses a small block of buffer memory which is divided independently in the last level of buffer memory to transfer access information data. By employing a cache partitioning technique (khangnguyen.2016.instructioncacheallocation technique), this portion of the cache may be left unremoved by other accesses and therefore no cache write-back initiated write-memory operation may occur. Therefore, the memory bandwidth consumption caused by data transmission can be avoided, the transmission and access speed can be improved, and the buffer pollution can be limited in a small part of the buffer of the last level of buffer.

Referring to fig. 2, the method for recording the address sequence by using the on-chip cache according to the present invention includes:

in step S1, a memory record and control unit 1001 is added to the memory controller 1000 to record the request address, the read-write type, the access time, etc. of each memory request, generate a record (Trace), and add a serial number to each Trace, and temporarily store the Trace containing real-time memory address information in a first buffer 1002 in the memory controller.

In step S2, a part of the independent caches are divided in the last level of the cache 2000 by using the cache division technique, and the second buffer 2003, the second write pointer 2004 and the read pointer 2005 for storing Trace are used to ensure that the caches 2003, 2004 and 2005 reside.

In step S3, when the Trace temporarily stored in the first buffer 1002 is accumulated to a certain amount (e.g. stored to half, full, or a certain time has elapsed), the access record and control unit 1001 multiplexes the on-chip bus write buffer command, writes the Trace data in the first buffer 1002 into the second buffer 2003 storing the Trace in the buffer according to the value of the first write pointer 1006 in the memory controller 1000, and updates the second write pointer 2004 so as to be consistent with the first write pointer 1006. Therefore, cache miss and memory read-write request are not generated in the Trace data transmission process, and no memory bandwidth is consumed. The batch transmission Trace increases transmission efficiency, reduces update frequency of write pointers, and reduces consumption of bus bandwidth.

In step S4, the software (program) 3000 running on the processor core queries the second write pointer 2004 and the read pointer 2005 in the cache 2000, temporarily stores 2004 and 2005 to the register of the processor, reads Trace between the two pointers, and updates 2005 to be equal to the temporarily stored 2004 after the data Trace is read. The program indicates whether the software overflows by whether the sequence numbers in the data packets are consecutive. In addition, the program can directly process, move or discard Trace. So as to ensure that cache miss can not be generated in the process of reading Trace, ensure the processing speed of software, and enable the software to process Trace in time and find overflow.

Through the above steps, the memory record and control unit 1001 may also be used to pre-filter the trace temporarily stored in the first buffer 1002 according to a preset filtering rule (e.g., according to an address, according to a read-write, according to a preprocessing result of the configurable hardware, etc.), so as to transfer only a part of trace to the second buffer 2003, thereby reducing the total data amount required to transfer the trace.

The overall structure of the present invention is shown in fig. 3, which includes two parts, hardware and software. The memory access recording unit in the hardware part memory controller and the ring buffer area stored in the buffer memory.

Hardware part:

the hardware processing flow comprises the following steps:

as shown in fig. 4, the program writes configuration information through the memory controller configuration interface to configure the first address and length of the second buffer. As shown in fig. 2, the memory controller 1000 configures the first address and the length of the second buffer 2000 after receiving the information transmitted through the memory controller configuration interface.

Each new memory access request, the memory access record and control unit generates a new Trace and temporarily stores it in the first buffer of the memory controller. Typically the first buffer is of a small capacity and is not directly accessible to the processor. And once the first buffer area is fully written, multiplexing an on-chip bus write cache command by the access record and control unit, writing Trace batch into a second buffer area in the cache, and updating a second write pointer.

Specific implementation examples of hardware:

the memory recording unit comprises a 128-byte ring temporary storage buffer. The ring buffer is used to temporarily store the generated trace.

In each case of 1 last-level cache miss, the access record and control unit generates a piece of 6-byte record information, and temporarily stores the record information into a ring temporary storage buffer area (the size of the temporary storage buffer area is 128 bytes) of the access record and control unit. Once the trace stored in the ring buffer reaches 64 bytes, 1 bus packet is generated by the memory controller, as shown in fig. 5, each packet containing 64 bytes of trace information and the target physical address. After receiving the data packet, the last level buffer controller writes trace information into the buffer corresponding to the target physical address. The memory controller generates 1 bus packet modifying the write address in the cache every time the fetch length memory trace is transmitted.

Software part:

the configuration flow comprises the following steps: in fig. 6, the program configures the second buffer of the cache Trace through the configuration register of the memory controller. In an embodiment of the present invention, the configuration register includes a buffer head address 3007 and a buffer length 3008 (shown in FIG. 2).

Processing Trace flow: as in fig. 7, the program determines whether a new Trace is generated by detecting the write pointer 2004 and the read pointer 2005 of the second buffer, and if so, reads the Trace in batch and updates the read pointer. Program detection may be polled uninterruptedly to ensure processing speed, or some timing detection or interrupt mechanism may be employed.

One embodiment of processing Trace:

the program portion includes a filter table and a reverse page table, wherein the thread in which the filter table resides often occupies 1 processor core. In this embodiment, the trace passed by the hardware is processed using the filter table and the reverse page table.

The filtering table is a linear array and records the access times of each physical page; the reverse page table is also a linear array, several virtual addresses and process numbers corresponding to each physical page, and the structure of each entry of the reverse page table is shown in fig. 8, and includes a 16-bit process number and a 40-bit virtual page number.

The filtering table thread continuously checks the read-write pointer of the annular buffer in the buffer, and once the change of the write pointer is detected, the trace from the read pointer to the write pointer is processed. The filtering table analyzes the physical page number of each trace, and once the number of times of accessing the physical page number reaches N (N=8), the physical page is marked as a hot page, the corresponding reverse page table is queried, the process number and the virtual page number are obtained, and the process number and the virtual page number are provided for an operating system.

One embodiment using Trace:

after the program obtains the complete access information, the program can perform operations such as prefetching, replacement and the like by learning the information as shown in fig. 2.

The method for perceiving the real-time access position of the application by the lightweight operating system multiplexes the command of the on-chip bus write cache, has little change on the memory controller, does not consume memory bandwidth, does not pollute the cache, does not introduce new cache consistency overhead, has almost negligible negative influence on the operating system and other applications, and ensures that the operating system/application can perceive the access information in real time.

The above embodiments are only for illustrating the present invention, not for limiting the present invention, and various changes and modifications may be made by one of ordinary skill in the relevant art without departing from the spirit and scope of the present invention, and therefore, all equivalent technical solutions are also within the scope of the present invention, and the scope of the present invention is defined by the claims.

Claims

1. A method for recording an address sequence using an on-chip cache, comprising:

setting a first buffer area in the memory controller and setting a second buffer area in the last level of buffer of the processor;

when any processor core of the processor sends a memory access request, generating a memory access record according to the memory access request, and temporarily storing the memory access record in the first buffer area;

when the number of the access records temporarily stored in the first buffer zone reaches a temporary storage threshold value, writing the access records in the first buffer zone into the second buffer zone through an on-chip bus;

the program running on the processor obtains and processes the access record by directly reading the second buffer area.

2. The method of claim 1, wherein the memory record in the first buffer is written to the second buffer, and the value of the second write pointer in the second buffer is updated to be consistent with the value of the first write pointer in the first buffer.

3. The method of claim 1, wherein the program determines whether a new memory record is present in the second buffer by monitoring a write pointer and a read pointer in the second buffer, and updates a read pointer in the second buffer after determining that a new memory record is present, and wherein the program performs a read operation on the new memory record.

4. The method of claim 1, wherein the first buffer and the second buffer are ring buffers.

5. The method for utilizing an on-chip cache to record an address sequence as recited in claim 1, wherein the second buffer is configured by writing reservation information through a configuration interface of a memory controller; the reservation information includes a buffer head address and a buffer length.

6. The method of claim 1, wherein the memory record includes a request address, a read-write type, and an access time of the memory request.

7. The method of claim 1, wherein the memory records in the first buffer are written into the second buffer, and the memory records are filtered according to a predetermined filtering rule.

8. A data processing apparatus for executing at least one program using the method of recording an address sequence using an on-chip cache as claimed in any one of claims 1 to 7, the data processing apparatus comprising:

the memory controller is provided with a memory record and control unit and a first buffer area, the memory record and control unit is used for generating memory information according to a memory request sent by the processor, and the first buffer area is used for caching the memory information;

a processor having a last level cache and at least one processor core; the last level buffer is provided with a second buffer area, the second buffer area is used for buffering access information sent by the first buffer area, and the program directly reads the second buffer area to acquire and process the access record.

9. The data processing apparatus of claim 8 wherein the first buffer and the second buffer are each ring buffers.

10. The data processing apparatus of claim 8 wherein the program configures the second buffer via a memory controller configuration interface.