CN113254321A - Method and system for evaluating memory access performance of processor - Google Patents

Method and system for evaluating memory access performance of processor Download PDF

Info

Publication number
CN113254321A
CN113254321A CN202110633327.2A CN202110633327A CN113254321A CN 113254321 A CN113254321 A CN 113254321A CN 202110633327 A CN202110633327 A CN 202110633327A CN 113254321 A CN113254321 A CN 113254321A
Authority
CN
China
Prior art keywords
memory
data
access
length
preset length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110633327.2A
Other languages
Chinese (zh)
Other versions
CN113254321B (en
Inventor
李腾
叶飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hengwei Intelligent Technology Co ltd
Original Assignee
Embedway Technologies Shanghai Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Embedway Technologies Shanghai Corp filed Critical Embedway Technologies Shanghai Corp
Priority to CN202110633327.2A priority Critical patent/CN113254321B/en
Publication of CN113254321A publication Critical patent/CN113254321A/en
Application granted granted Critical
Publication of CN113254321B publication Critical patent/CN113254321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method and a system for evaluating memory access performance of a processor, wherein memory data is obtained, the memory data is of a first preset length, the memory data of the first preset length is segmented, each segment of data is sequentially accessed to obtain access data, the access is read operation or write operation, and the memory read operation performance or the memory write operation performance is determined based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.

Description

Method and system for evaluating memory access performance of processor
Technical Field
The present application relates to the field of electronic information technologies, and in particular, to a method and a system for evaluating memory access performance of a processor.
Background
On a general CPU platform such as ARM, memory read-write throughput performance (MB/s) is one of the main performance indexes of the whole machine or CPU.
The current memory evaluation tool and algorithm widely applied in the field of high-performance CPU are Stream schemes and software tools proposed and related by computer academy of university of Virginia, and the idea algorithm is to use C language or other high-level programming languages to read and write a large section of continuous virtual memory (generally over 512 MB) for many times, obtain the time of each complete reading and writing, take the time of executing the complete reading and writing once, and calculate the throughput value according to the data volume and the execution time.
However, the above solutions test the comprehensive performance of the read-write hybrid application, but cannot reflect the performance of the read memory or the write memory of the memory respectively.
Disclosure of Invention
In view of the above, the present application provides a method and a system for evaluating memory access performance of a processor, which has the following specific scheme:
a method of evaluating processor memory access performance, comprising:
obtaining memory data, wherein the memory data is a first preset length;
segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
determining a memory read operation performance or a memory write operation performance based on the access data.
Further, the segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data includes:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length;
and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
Further, the sequentially accessing each piece of data includes:
each core in the multi-core processor accesses one segment of data in parallel, and different cores in the multi-core processor access different segments of data.
Further, the segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data includes:
starting access from the byte at the first address in the memory data, wherein the length of the currently accessed data is a second preset length;
starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the access of the memory data with the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
Further, the determining the memory read operation performance or the memory write operation performance based on the access data includes:
determining the shortest time length, the longest time length, the average time length and/or the read operation performance measurement value consumed by executing the read operation on the memory data with the first preset length based on the read operation data;
and determining the shortest time length, the longest time length, the average time length and/or the write operation performance measurement value consumed by performing the write operation on the memory data with the first preset length based on the write operation data.
Further, the sequentially accessing each piece of data includes:
each piece of data is accessed in turn by assembly language matched to the multi-core processor.
A system for evaluating memory access performance of a processor, comprising:
the device comprises an obtaining unit, a processing unit and a processing unit, wherein the obtaining unit is used for obtaining memory data, and the memory data is a first preset length;
the access unit is used for segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
a determining unit, configured to determine a memory read operation performance or a memory write operation performance based on the access data.
Further, the access unit is configured to:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length; and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
Further, the accessing unit accesses each piece of data in sequence, including:
the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
Further, a storage medium for storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of evaluating processor memory access performance as described in any of the above.
According to the technical scheme, the method and the system for evaluating the memory access performance of the processor, disclosed by the application, obtain the memory data, the memory data is of the first preset length, the memory data of the first preset length is segmented, each segment of the data is sequentially accessed to obtain the access data, the access is a read operation or a write operation, and the memory read operation performance or the memory write operation performance is determined based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a system for evaluating memory access performance of a processor according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The present application discloses a method for evaluating memory access performance of a processor, a flowchart of which is shown in fig. 1, and the method includes:
step S11, obtaining memory data, wherein the memory data has a first preset length;
step S12, segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
step S13, determining the memory read operation performance or the memory write operation performance based on the access data.
The ARMv8-A CPU is an ASIC in UK, relates to a reduced instruction set processor architecture promoted by the company ARM, has wide application in the fields of high-performance consumer electronics, servers and the like, and is similar to other CPU platforms such as: x86 or MIPS platform, memory read-write throughput performance is one of the main performance indexes of the whole machine or CPU.
The memory assessment tool and algorithm currently in wide use in the high performance CPU field is the Stream scheme and software tool proposed and referred to by the university of virginia computer school, england, the version of the current Stream software tool being 5.10, and therefore the tool is referred to as Stream 5.10. The main idea algorithm of the stream5.10 scheme is to use C language or other high-level programming languages, such as Fortran, to read and write a large section of continuous virtual memory for multiple times, obtain the time for each complete read and write, obtain the time for which the execution is the fastest, and calculate the throughput value according to the data volume and the execution time.
However, the comprehensive performance of some typical read-write performance hybrid applications can be evaluated and tested by adopting the scheme, but the performance of the read memory or the performance of the write memory is not reflected respectively. If the CPU has both read operation and write operation to each memory or memory controller at the same time, the read and write operations can not reach the self maximum throughput bandwidth; in addition, in some systems, the distribution of read and write performance is unbalanced, such as: on a general processor platform for non-network applications, it is generally considered that the frequency of memory read operations is higher than that of memory write operations, which more easily results in the performance of memory read and memory write being unclear by the above scheme.
In order to solve the problem, the scheme respectively executes read operation and write operation on the memory in the CPU so as to respectively obtain the memory read operation performance and the memory write operation performance.
A segment of memory data is obtained in advance, and the length of the memory data is known, for example: 1024 bytes, and performing segmented access on the pre-stored memory data to obtain access data, wherein the access can be a read operation or a write operation.
If the access is a read operation, after segmenting the pre-stored memory data, sequentially reading each segment of data so as to obtain read operation data of each segment of data, such as: the duration of each segment of data is read, and/or the duration of the complete pre-acquired memory data is read.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the read operation data obtained by executing the read operation on the memory segments with the same length of the multiple memory segments has a comparison basis. If the memory data with the first preset length is segmented, and the lengths of the segments are not necessarily the same, for the memory segments with different lengths, even if the same kernel of the CPU executes the read operation, the obtained read operation durations are also different, so that the obtained durations do not have a comparison basis, and if the different kernels of the CPU execute the read operation, the obtained read operation durations are also meaningless.
Such as: if the length of the memory data obtained in advance is 1024 bytes, the 1024-byte memory data is segmented, and the length of each segment is 64 bytes, then the 1024-byte memory data can be divided into 16 segments, that is, read in 16 times, and each time the 64 bytes are read. The obtained read operation data may be a time length used for reading each 64 bytes of data, 16 time lengths may be obtained, and the obtained 16 time lengths may be determined as the read operation data.
Determining the memory read operation performance based on the read operation data, and determining the memory read operation performance for the read operation duration based on each obtained segment of data, where the memory read operation performance may be: the minimum duration consumed by performing a read operation, the maximum duration consumed by performing a read operation, and the average duration consumed by performing a read operation may also be a measure of read operation performance.
Continuing with the above example, if 16 durations are obtained, each duration corresponds to a duration for reading each segment of 64-byte data, and the maximum value of the 16 durations is determined, which is the longest duration consumed for executing the read operation; determining the minimum value of the 16 time lengths, namely the minimum time length consumed by executing the read operation; averaging the 16 time length values to obtain the average time length for executing the read operation; and the read performance measurement may be: obtained by dividing each piece of data by the shortest time length.
If the access is a write operation, after segmenting the pre-stored memory data, sequentially writing each segment of data so as to obtain write operation data of each segment of data, such as: the duration for writing each segment of data, and/or the duration for writing the complete pre-acquired memory data.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the write operation data obtained by executing write operation on the memory segments with the same length of a plurality of memory segments has a comparative basis.
Such as: if the length of the pre-acquired memory data is 256 bytes, the 256-byte memory data is segmented, and the length of each segment is 8 bytes, then the 256-byte memory data can be divided into 32 segments, that is, written into the memory bank 32 times, and 8 bytes are written into each time. The obtained write operation data may be a time length used for writing each segment of 8 bytes of data, 32 time lengths may be obtained, and the obtained 32 time lengths may be determined as the write operation data.
Determining the memory write operation performance based on the write operation data, and determining the memory write operation performance based on the obtained write operation duration of each segment of data, where the memory write operation performance may be: the minimum duration consumed to perform a write operation, the maximum duration consumed to perform a write operation, the average duration consumed to perform a write operation, and may also be a measure of write operation performance.
Continuing with the above example, if 32 time lengths are obtained, each time length corresponds to a time length for writing 8-byte data of each segment into the memory bank, and the maximum value of the 32 time lengths is determined, which is the longest time length consumed for executing the write operation; determining the minimum value of the 32 time lengths, namely the minimum time length consumed by executing the write operation; averaging the 32 time length values to obtain the average time length for executing the write operation; and the write performance measurement may be: obtained by dividing each piece of data by the shortest time length.
In the scheme, the memory read operation and the memory write operation are respectively executed, so that the throughput value of the CPU can be closer to the hardware limit when the kernel of the CPU executes the read operation or the write operation, more accurate evaluation data can be provided, and the obtained performance data can be even used for performance estimation of network applications such as DPDK and the like.
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The embodiment discloses a method for evaluating memory access performance of a processor, a flowchart of which is shown in fig. 2, and the method comprises the following steps:
step S21, obtaining memory data, wherein the memory data has a first preset length;
step S22, segmenting memory data with a first preset length, accessing a segment of data in parallel through each kernel in the multi-core processor, accessing different segments of data through different kernels in the multi-core processor to obtain access data, wherein the access is read operation or write operation;
step S23, determining the memory read operation performance or the memory write operation performance based on the access data.
For a multi-core processor in a reduced instruction set processor architecture, when a read operation or a write operation is executed, the read operation or the write operation can be executed in parallel through each core in the multi-core processor, so that the memory access pressure is increased through multi-core parallel access, and the overall performance of a memory controller is excavated.
That is, if a read operation is performed, multiple cores in the multi-core processor may perform the operation at the same time, but all the multiple cores perform the read operation at the same time, and one or more of the cores do not perform the write operation; if the write operation is executed, a plurality of cores in the multi-core processor execute the operation at the same time, but the write operation is executed by the plurality of cores at the same time, and one or more cores in the multi-core processor do not execute the read operation, so that the multi-core simultaneous parallel read operation or the multi-core simultaneous parallel write operation is ensured.
Specifically, according to the method for evaluating the performance of the memory of the processor, when the bottom layer is implemented, the multi-core parallel memory access is realized by calling the openmp function and the macro instruction.
In the bottom implementation, the memory data of the first preset length is obtained through a single loop section execution flow in the for loop, namely a PRFM instruction of ARMv8-A assembly language is used for reading the memory data of the first preset length in a segmented mode, and the PRFM assembly instruction is used in the main step, so that compared with an instruction which uses C language for variable assignment, the method is more compact, and can reduce 80% of software instructions, thereby forming greater access pressure on the memory and measuring a throughput value which is closer to a hardware limit.
The # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be individually assigned to a certain core of the multi-core CPU, the potential of the multi-core CPU is exerted, the memory can be subjected to greater access pressure, and the throughput value closer to the hardware limit can be measured.
It should be noted that, in the existing scheme, when determining the memory access performance, the language C is generally used. The method comprises the steps that C-language written memory read-write codes read and write the operation of the same physical address, and after the read-write codes are compiled into assembly language, a plurality of assembly sentences exist, most of the final read-write operations are useless sentences, the execution efficiency is low, the execution time is long and the measured throughput value is low, and the process scheduling and interruption of an operating system are easy to occur; the same operation is directly realized by an assembly language, only one assembly instruction is needed, the execution efficiency is high, and the measured read-write throughput is closer to the real throughput value of the memory controller.
In addition, the C language is a high-level language, so that the control on the bottom layer operation of the machine cannot be completely accurate, and only read operation and only write operation cannot be simultaneously performed; the assembly language can only do read memory operation or only do write memory operation at the same time under the use of proper statements.
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The embodiment discloses a method for evaluating memory access performance of a processor, a flowchart of which is shown in fig. 3, and the method comprises the following steps:
step S31, obtaining memory data, wherein the memory data has a first preset length;
step S32, starting accessing from the byte in the memory data at the first address, where the length of the currently accessed data is a second preset length;
step S33, starting accessing from a byte at a second address in the memory data, where the length of the currently accessed data is a second preset length until the memory data with the first preset length is accessed, where a difference between the second address and the first address is the second preset length;
step S34, determining the memory read operation performance or the memory write operation performance based on the access data.
When data is accessed in a segmented mode, the data is accessed based on the first byte address of the memory segment to be accessed, namely, the memory data with the first preset length is segmented in advance, the first byte address of each memory segment is determined, the kernel of the CPU starts to access the byte at the first byte address, and the current access is to access the byte with the second preset length, namely, the segmented memory segment.
When the CPU has a plurality of cores, the CPU can access through the plurality of cores, that is, one of the cores accesses one of the memory segments, and the other core accesses the other memory segment, until one memory segment is allocated to all the cores for access, or until one core is allocated to all the memory segments.
Alternatively, the following may be used: and sequentially accessing each memory segment, namely after the first memory segment is accessed, accessing the second memory segment until all the memory segments are accessed.
In the bottom-layer implementation, multi-core parallel operation is realized through a for loop, and access is executed by adopting assembly language. If the access is a read operation, reading the data with the second preset length each time through a for loop until the memory data with the first preset length of the complete section is read when the access is executed; at each read, the first byte address of each memory segment is determined as the memory read base address in each cycle node in fou cycles.
Specifically, when data is read, reading the address of the first Byte of the first memory segment, namely addr +0, wherein the reading length is a second preset length, namely cacheline, namely the second preset length is 64Byte of the size of the CPU cache stack; when reading data, reading the first byte address of the second memory segment, namely addr +64, wherein the reading length is the second preset length until the whole segment of kernel data with the first preset length is read.
The core code may be as follows:
Figure BDA0003104465340000101
if the access is write operation, writing data with a second preset length each time through a for loop until memory data with a first preset length of a complete section is written when the access is executed; at each time of writing, the first byte address of each memory segment is determined as the memory writing reference address in each cycle node in fou cycles.
Specifically, when data is written, writing is performed at the address of the first byte of the first memory segment, that is, addr +0, and the reading length is a second preset length; when data is written, writing is carried out on the first byte address of the second memory segment, namely addr +8, and the writing length is the second preset length until the whole segment of kernel data with the first preset length is written.
The core code may be as follows:
Figure BDA0003104465340000111
when data is written, a single loop section execution flow in the for loop carries out storage operation on a memory with one end length of 256 bytes, namely an STUR instruction of ARMv8-A assembly language is used, the data in a CPU is written into a memory bank in 32 bytes at a time, the STUR assembly instruction is used in a main step of the memory writing operation, and compared with an instruction of C language for variable assignment, the STUR assembly instruction is more compact and can reduce 80% of software instructions, so that greater access pressure is formed on the memory, and a throughput value closer to a hardware limit is measured;
meanwhile, the # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be independently assigned to a core of the multi-core CPU, the potential of the multi-core CPU is exerted, larger access pressure can be formed on the memory, and the throughput value closer to the hardware limit can be measured.
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The present embodiment discloses a system for evaluating processor memory access performance, a schematic structural diagram of which is shown in fig. 4, and the system includes:
an obtaining unit 41, an accessing unit 42 and a determining unit 43.
The obtaining unit 41 is configured to obtain memory data, where the memory data is a first preset length;
the access unit 42 is configured to segment the memory data with the first preset length, and sequentially access each segment of data to obtain access data, where the access is a read operation or a write operation;
the determination unit 43 is configured to determine a memory read operation performance or a memory write operation performance based on the access data.
The ARMv8-A CPU is an ASIC in UK, relates to a reduced instruction set processor architecture promoted by the company ARM, has wide application in the fields of high-performance consumer electronics, servers and the like, and is similar to other CPU platforms such as: x86 or MIPS platform, memory read-write throughput performance is one of the main performance indexes of the whole machine or CPU.
The memory assessment tool and algorithm currently in wide use in the high performance CPU field is the Stream scheme and software tool proposed and referred to by the university of virginia computer school, england, the version of the current Stream software tool being 5.10, and therefore the tool is referred to as Stream 5.10. The main idea algorithm of the stream5.10 scheme is to use C language or other high-level programming languages, such as Fortran, to read and write a large section of continuous virtual memory for multiple times, obtain the time for each complete read and write, obtain the time for which the execution is the fastest, and calculate the throughput value according to the data volume and the execution time.
However, the comprehensive performance of some typical read-write performance hybrid applications can be evaluated and tested by adopting the scheme, but the performance of the read memory or the performance of the write memory is not reflected respectively. If the CPU has both read operation and write operation to each memory or memory controller at the same time, the read and write operations can not reach the self maximum throughput bandwidth; in addition, in some systems, the distribution of read and write performance is unbalanced, such as: on a general processor platform for non-network applications, it is generally considered that the frequency of memory read operations is higher than that of memory write operations, which more easily results in the performance of memory read and memory write being unclear by the above scheme.
In order to solve the problem, the scheme respectively executes read operation and write operation on the memory in the CPU so as to respectively obtain the memory read operation performance and the memory write operation performance.
A segment of memory data is obtained in advance, and the length of the memory data is known, for example: 1024 bytes, and performing segmented access on the pre-stored memory data to obtain access data, wherein the access can be a read operation or a write operation.
If the access is a read operation, after segmenting the pre-stored memory data, sequentially reading each segment of data so as to obtain read operation data of each segment of data, such as: the duration of each segment of data is read, and/or the duration of the complete pre-acquired memory data is read.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the read operation data obtained by executing the read operation on the memory segments with the same length of the multiple memory segments has a comparison basis. If the memory data with the first preset length is segmented, and the lengths of the segments are not necessarily the same, for the memory segments with different lengths, even if the same kernel of the CPU executes the read operation, the obtained read operation durations are also different, so that the obtained durations do not have a comparison basis, and if the different kernels of the CPU execute the read operation, the obtained read operation durations are also meaningless.
Such as: if the length of the memory data obtained in advance is 1024 bytes, the 1024-byte memory data is segmented, and the length of each segment is 64 bytes, then the 1024-byte memory data can be divided into 16 segments, that is, read in 16 times, and each time the 64 bytes are read. The obtained read operation data may be a time length used for reading each 64 bytes of data, 16 time lengths may be obtained, and the obtained 16 time lengths may be determined as the read operation data.
Determining the memory read operation performance based on the read operation data, and determining the memory read operation performance for the read operation duration based on each obtained segment of data, where the memory read operation performance may be: the minimum duration consumed by performing a read operation, the maximum duration consumed by performing a read operation, and the average duration consumed by performing a read operation may also be a measure of read operation performance.
Continuing with the above example, if 16 durations are obtained, each duration corresponds to a duration for reading each segment of 64-byte data, and the maximum value of the 16 durations is determined, which is the longest duration consumed for executing the read operation; determining the minimum value of the 16 time lengths, namely the minimum time length consumed by executing the read operation; averaging the 16 time length values to obtain the average time length for executing the read operation; and the read performance measurement may be: obtained by dividing each piece of data by the shortest time length.
If the access is a write operation, after segmenting the pre-stored memory data, sequentially writing each segment of data so as to obtain write operation data of each segment of data, such as: the duration for writing each segment of data, and/or the duration for writing the complete pre-acquired memory data.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the write operation data obtained by executing write operation on the memory segments with the same length of a plurality of memory segments has a comparative basis.
Such as: if the length of the pre-acquired memory data is 256 bytes, the 256-byte memory data is segmented, and the length of each segment is 8 bytes, then the 256-byte memory data can be divided into 32 segments, that is, written into the memory bank 32 times, and 8 bytes are written into each time. The obtained write operation data may be a time length used for writing each segment of 8 bytes of data, 32 time lengths may be obtained, and the obtained 32 time lengths may be determined as read operation data.
Determining the memory write operation performance based on the write operation data, and determining the memory write operation performance based on the obtained write operation duration of each segment of data, where the memory write operation performance may be: the minimum duration consumed to perform a write operation, the maximum duration consumed to perform a write operation, the average duration consumed to perform a write operation, and may also be a measure of write operation performance.
Continuing with the above example, if 32 time lengths are obtained, each time length corresponds to a time length for writing 8-byte data of each segment into the memory bank, and the maximum value of the 32 time lengths is determined, which is the longest time length consumed for executing the write operation; determining the minimum value of the 32 time lengths, namely the minimum time length consumed by executing the write operation; averaging the 32 time length values to obtain the average time length for executing the write operation; and the write performance measurement may be: obtained by dividing each piece of data by the shortest time length.
In the scheme, the memory read operation and the memory write operation are respectively executed, so that the throughput value of the CPU can be closer to the hardware limit when the kernel of the CPU executes the read operation or the write operation, more accurate evaluation data can be provided, and the obtained performance data can be even used for performance estimation of network applications such as DPDK and the like.
Further, the accessing unit 42 accesses each piece of data in turn, including: the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
For a multi-core processor in a reduced instruction set processor architecture, when a read operation or a write operation is executed, the read operation or the write operation can be executed in parallel through each core in the multi-core processor, so that the memory access pressure is increased through multi-core parallel access, and the overall performance of a memory controller is excavated.
That is, if a read operation is performed, multiple cores in the multi-core processor may perform the operation at the same time, but all the multiple cores perform the read operation at the same time, and one or more of the cores do not perform the write operation; if the write operation is executed, a plurality of cores in the multi-core processor execute the operation at the same time, but the write operation is executed by the plurality of cores at the same time, and one or more cores in the multi-core processor do not execute the read operation, so that the multi-core simultaneous parallel read operation or the multi-core simultaneous parallel write operation is ensured.
Specifically, according to the method for evaluating the performance of the memory of the processor, when the bottom layer is implemented, the multi-core parallel memory access is realized by calling the openmp function and the macro instruction.
In the bottom implementation, the memory data of the first preset length is obtained through a single loop section execution flow in the for loop, namely a PRFM instruction of ARMv8-A assembly language is used for reading the memory data of the first preset length in a segmented mode, and the PRFM assembly instruction is used in the main step, so that compared with an instruction which uses C language for variable assignment, the method is more compact, and can reduce 80% of software instructions, thereby forming greater access pressure on the memory and measuring a throughput value which is closer to a hardware limit.
The # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be individually assigned to a certain core of the multi-core CPU, the potential of the multi-core CPU is exerted, the memory can be subjected to greater access pressure, and the throughput value closer to the hardware limit can be measured.
It should be noted that, in the existing scheme, when determining the memory access performance, the language C is generally used. The method comprises the steps that C-language written memory read-write codes read and write the operation of the same physical address, and after the read-write codes are compiled into assembly language, a plurality of assembly sentences exist, most of the final read-write operations are useless sentences, the execution efficiency is low, the execution time is long and the measured throughput value is low, and the process scheduling and interruption of an operating system are easy to occur; the same operation is directly realized by an assembly language, only one assembly instruction is needed, the execution efficiency is high, and the measured read-write throughput is closer to the real throughput value of the memory controller.
In addition, the C language is a high-level language, so that the control on the bottom layer operation of the machine cannot be completely accurate, and only read operation and only write operation cannot be simultaneously performed; the assembly language can only do read memory operation or only do write memory operation at the same time under the use of proper statements.
Further, the access unit 42 is configured to start accessing from a byte in the memory data at the first address, where the length of the currently accessed data is a second preset length; starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the memory data access of the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
When data is accessed in a segmented mode, the data is accessed based on the first byte address of the memory segment to be accessed, namely, the memory data with the first preset length is segmented in advance, the first byte address of each memory segment is determined, the kernel of the CPU starts to access the byte at the first byte address, and the current access is to access the byte with the second preset length, namely, the segmented memory segment.
When the CPU has a plurality of cores, the CPU can access through the plurality of cores, that is, one of the cores accesses one of the memory segments, and the other core accesses the other memory segment, until one memory segment is allocated to all the cores for access, or until one core is allocated to all the memory segments.
Alternatively, the following may be used: and sequentially accessing each memory segment, namely after the first memory segment is accessed, accessing the second memory segment until all the memory segments are accessed.
In the bottom-layer implementation, multi-core parallel operation is realized through a for loop, and access is executed by adopting assembly language. If the access is a read operation, reading the data with the second preset length each time through a for loop until the memory data with the first preset length of the complete section is read when the access is executed; at each read, the first byte address of each memory segment is determined as the memory read base address in each cycle node in fou cycles.
Specifically, when data is read, reading the address of the first Byte of the first memory segment, namely addr +0, wherein the reading length is a second preset length, namely cacheline, namely the second preset length is 64Byte of the size of the CPU cache stack; when reading data, reading the first byte address of the second memory segment, namely addr +64, wherein the reading length is the second preset length until the whole segment of kernel data with the first preset length is read.
The core code may be as follows:
Figure BDA0003104465340000171
if the access is write operation, writing data with a second preset length each time through a for loop until memory data with a first preset length of a complete section is written when the access is executed; at each time of writing, the first byte address of each memory segment is determined as the memory writing reference address in each cycle node in fou cycles.
Specifically, when data is written, writing is performed at the address of the first byte of the first memory segment, that is, addr +0, and the reading length is a second preset length; when data is written, writing is carried out on the first byte address of the second memory segment, namely addr +8, and the writing length is the second preset length until the whole segment of kernel data with the first preset length is written.
The core code may be as follows:
Figure BDA0003104465340000181
when data is written, a single loop section execution flow in the for loop carries out storage operation on a memory with one end length of 256 bytes, namely an STUR instruction of ARMv8-A assembly language is used, the data in a CPU is written into a memory bank in 32 bytes at a time, the STUR assembly instruction is used in a main step of the memory writing operation, and compared with an instruction of C language for variable assignment, the STUR assembly instruction is more compact and can reduce 80% of software instructions, so that greater access pressure is formed on the memory, and a throughput value closer to a hardware limit is measured;
meanwhile, the # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be independently assigned to a core of the multi-core CPU, the potential of the multi-core CPU is exerted, larger access pressure can be formed on the memory, and the throughput value closer to the hardware limit can be measured.
The system for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The present embodiment discloses a storage medium for storing at least one set of instructions for being called and performing at least the method of evaluating memory access performance of a processor as described in any of the above.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for evaluating memory access performance of a processor, comprising:
obtaining memory data, wherein the memory data is a first preset length;
segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
determining a memory read operation performance or a memory write operation performance based on the access data.
2. The method according to claim 1, wherein segmenting the memory data of the first preset length and sequentially accessing each segment of data to obtain access data comprises:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length;
and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
3. The method of claim 1, wherein said accessing each piece of data in turn comprises:
each core in the multi-core processor accesses one segment of data in parallel, and different cores in the multi-core processor access different segments of data.
4. The method according to claim 1, wherein segmenting the memory data of the first preset length and sequentially accessing each segment of data to obtain access data comprises:
starting access from the byte at the first address in the memory data, wherein the length of the currently accessed data is a second preset length;
starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the access of the memory data with the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
5. The method of claim 1, wherein determining a memory read operation performance or a memory write operation performance based on the access data comprises:
determining the shortest time length, the longest time length, the average time length and/or the read operation performance measurement value consumed by executing the read operation on the memory data with the first preset length based on the read operation data;
and determining the shortest time length, the longest time length, the average time length and/or the write operation performance measurement value consumed by performing the write operation on the memory data with the first preset length based on the write operation data.
6. The method of claim 1, wherein said accessing each piece of data in turn comprises:
each piece of data is accessed in turn by assembly language matched to the multi-core processor.
7. A system for evaluating performance of memory accesses of a processor, comprising:
the device comprises an obtaining unit, a processing unit and a processing unit, wherein the obtaining unit is used for obtaining memory data, and the memory data is a first preset length;
the access unit is used for segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
a determining unit, configured to determine a memory read operation performance or a memory write operation performance based on the access data.
8. The system of claim 7, wherein the access unit is configured to:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length; and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
9. The system of claim 7, wherein the access unit accesses each piece of data in turn, comprising:
the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
10. A storage medium storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of evaluating processor memory access performance as described in any of the above.
CN202110633327.2A 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor Active CN113254321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633327.2A CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633327.2A CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Publications (2)

Publication Number Publication Date
CN113254321A true CN113254321A (en) 2021-08-13
CN113254321B CN113254321B (en) 2023-01-24

Family

ID=77186864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633327.2A Active CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Country Status (1)

Country Link
CN (1) CN113254321B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5896550A (en) * 1997-04-03 1999-04-20 Vlsi Technology, Inc. Direct memory access controller with full read/write capability
CN102567256A (en) * 2011-12-16 2012-07-11 龙芯中科技术有限公司 Processor system, as well as multi-channel memory copying DMA accelerator and method thereof
CN103257888A (en) * 2012-02-16 2013-08-21 阿里巴巴集团控股有限公司 Method and equipment for concurrently executing read and write access to buffering queue
US20130339635A1 (en) * 2012-06-14 2013-12-19 International Business Machines Corporation Reducing read latency using a pool of processing cores
CN103559079A (en) * 2013-11-15 2014-02-05 深圳市道通科技有限公司 Shared memory based data access method and device
CN103902467A (en) * 2012-12-26 2014-07-02 华为技术有限公司 Compressed memory access control method, device and system
CN104090795A (en) * 2014-07-08 2014-10-08 三星电子(中国)研发中心 Method, system and device for upgrading multi-core mobile terminal
CN105373456A (en) * 2015-11-19 2016-03-02 英业达科技有限公司 Memory testing method for reducing cache hit rate
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
US20160292032A1 (en) * 2013-11-22 2016-10-06 Alcatel Lucent Detecting a read access to unallocated or uninitialized memory
CN109144419A (en) * 2018-08-20 2019-01-04 浪潮电子信息产业股份有限公司 A kind of solid state hard disk memory read-write method and system
CN109857342A (en) * 2019-01-16 2019-06-07 盛科网络(苏州)有限公司 A kind of data read-write method and device, exchange chip and storage medium
CN109901956A (en) * 2017-12-08 2019-06-18 英业达科技有限公司 The system and method for memory integrated testability
CN110502190A (en) * 2019-08-28 2019-11-26 上海航天电子通讯设备研究所 File read/write method
CN111739577A (en) * 2020-07-20 2020-10-02 成都智明达电子股份有限公司 DSP-based efficient DDR test method
CN112181893A (en) * 2020-09-29 2021-01-05 东风商用车有限公司 Communication method and system between multi-core processor cores in vehicle controller

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5896550A (en) * 1997-04-03 1999-04-20 Vlsi Technology, Inc. Direct memory access controller with full read/write capability
CN102567256A (en) * 2011-12-16 2012-07-11 龙芯中科技术有限公司 Processor system, as well as multi-channel memory copying DMA accelerator and method thereof
CN103257888A (en) * 2012-02-16 2013-08-21 阿里巴巴集团控股有限公司 Method and equipment for concurrently executing read and write access to buffering queue
US20130339635A1 (en) * 2012-06-14 2013-12-19 International Business Machines Corporation Reducing read latency using a pool of processing cores
CN103902467A (en) * 2012-12-26 2014-07-02 华为技术有限公司 Compressed memory access control method, device and system
CN103559079A (en) * 2013-11-15 2014-02-05 深圳市道通科技有限公司 Shared memory based data access method and device
US20160292032A1 (en) * 2013-11-22 2016-10-06 Alcatel Lucent Detecting a read access to unallocated or uninitialized memory
CN104090795A (en) * 2014-07-08 2014-10-08 三星电子(中国)研发中心 Method, system and device for upgrading multi-core mobile terminal
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
CN105373456A (en) * 2015-11-19 2016-03-02 英业达科技有限公司 Memory testing method for reducing cache hit rate
CN109901956A (en) * 2017-12-08 2019-06-18 英业达科技有限公司 The system and method for memory integrated testability
CN109144419A (en) * 2018-08-20 2019-01-04 浪潮电子信息产业股份有限公司 A kind of solid state hard disk memory read-write method and system
CN109857342A (en) * 2019-01-16 2019-06-07 盛科网络(苏州)有限公司 A kind of data read-write method and device, exchange chip and storage medium
CN110502190A (en) * 2019-08-28 2019-11-26 上海航天电子通讯设备研究所 File read/write method
CN111739577A (en) * 2020-07-20 2020-10-02 成都智明达电子股份有限公司 DSP-based efficient DDR test method
CN112181893A (en) * 2020-09-29 2021-01-05 东风商用车有限公司 Communication method and system between multi-core processor cores in vehicle controller

Also Published As

Publication number Publication date
CN113254321B (en) 2023-01-24

Similar Documents

Publication Publication Date Title
US9477601B2 (en) Apparatus and method for determining a sector division ratio of a shared cache memory
US10268454B2 (en) Methods and apparatus to eliminate partial-redundant vector loads
KR101456976B1 (en) Memory test device and testing method for memory
US8359291B2 (en) Architecture-aware field affinity estimation
CN107145446B (en) Application program APP test method, device and medium
JP2020087470A (en) Data access method, data access device, apparatus, and storage medium
CN113254322B (en) Method and system for evaluating ultimate throughput performance of Stream system
JP4208079B2 (en) Database server, program, recording medium, and control method
Ibrahim et al. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions
CN113254321B (en) Method and system for evaluating memory access performance of processor
CN114706834A (en) High-efficiency dynamic set management method and system
CN109947667B (en) Data access prediction method and device
JP6145193B2 (en) Read or write to memory
US10102099B2 (en) Performance information generating method, information processing apparatus and computer-readable storage medium storing performance information generation program
KR20140093593A (en) Method and system for determining work-group size and computer readable recording medium therefor
US20180081581A1 (en) Device and method for determining data placement destination, and program recording medium
JP5521687B2 (en) Analysis apparatus, analysis method, and analysis program
CN114625719A (en) Dynamic set management method and system based on mobile filtering framework
US11500825B2 (en) Techniques for dynamic database access modes
US20100169322A1 (en) Efficient access of bitmap array with huge usage variance along linear fashion, using pointers
CN110674170A (en) Data caching method, device, equipment and medium based on linked list reverse order reading
CN107766729B (en) Virus characteristic matching method, terminal and computer readable storage medium
US11442658B1 (en) System and method for selecting a write unit size for a block storage device
TW201317780A (en) Method for identifying memory of virtual machine and computer system using the same
CN111352825B (en) Data interface testing method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20221221

Address after: 201114 Room 603C, Building 8, No. 2388, Chenhang Road, Minhang District, Shanghai

Applicant after: Shanghai Hengwei Intelligent Technology Co.,Ltd.

Address before: 6 / F, building 8, 2388 Chenhang Road, Minhang District, Shanghai, 201114

Applicant before: EMBEDWAY TECHNOLOGIES (SHANGHAI) Corp.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant