CN113254321B - Method and system for evaluating memory access performance of processor - Google Patents

Method and system for evaluating memory access performance of processor Download PDF

Info

Publication number
CN113254321B
CN113254321B CN202110633327.2A CN202110633327A CN113254321B CN 113254321 B CN113254321 B CN 113254321B CN 202110633327 A CN202110633327 A CN 202110633327A CN 113254321 B CN113254321 B CN 113254321B
Authority
CN
China
Prior art keywords
memory
data
access
length
performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110633327.2A
Other languages
Chinese (zh)
Other versions
CN113254321A (en
Inventor
李腾
叶飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hengwei Intelligent Technology Co ltd
Original Assignee
Shanghai Hengwei Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hengwei Intelligent Technology Co ltd filed Critical Shanghai Hengwei Intelligent Technology Co ltd
Priority to CN202110633327.2A priority Critical patent/CN113254321B/en
Publication of CN113254321A publication Critical patent/CN113254321A/en
Application granted granted Critical
Publication of CN113254321B publication Critical patent/CN113254321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method and a system for evaluating memory access performance of a processor, wherein memory data is obtained, the memory data is of a first preset length, the memory data of the first preset length is segmented, each segment of data is sequentially accessed to obtain access data, the access is read operation or write operation, and the memory read operation performance or the memory write operation performance is determined based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.

Description

Method and system for evaluating memory access performance of processor
Technical Field
The present application relates to the field of electronic information technologies, and in particular, to a method and a system for evaluating memory access performance of a processor.
Background
On a general CPU platform such as ARM, memory read-write throughput performance (MB/s) is one of the main performance indexes of the whole machine or CPU.
The current memory evaluation tool and algorithm widely applied in the field of high-performance CPU are Stream schemes and software tools proposed and related by computer academy of university of Virginia, and the idea algorithm is to use C language or other high-level programming languages to read and write a large section of continuous virtual memory (generally over 512 MB) for many times, obtain the time of each complete reading and writing, take the time of executing the complete reading and writing once, and calculate the throughput value according to the data volume and the execution time.
However, the above solutions test the comprehensive performance of the read-write hybrid application, but cannot reflect the performance of the read memory or the write memory of the memory respectively.
Disclosure of Invention
In view of the above, the present application provides a method and a system for evaluating memory access performance of a processor, which has the following specific scheme:
a method of evaluating processor memory access performance, comprising:
obtaining memory data, wherein the memory data is a first preset length;
segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
determining a memory read operation performance or a memory write operation performance based on the access data.
Further, the segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data includes:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length;
and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
Further, the sequentially accessing each piece of data includes:
accessing a segment of data in parallel by each core in a multi-core processor, wherein different cores in the multi-core processor access different segments of data.
Further, the segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data includes:
starting to access from the byte at the first address in the memory data, wherein the length of the current access data is a second preset length;
starting access from bytes at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the memory data access with the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
Further, the determining the memory read operation performance or the memory write operation performance based on the access data includes:
determining the shortest time length, the longest time length, the average time length and/or the read operation performance measurement value consumed by executing the read operation on the memory data with the first preset length based on the read operation data;
and determining the shortest time length, the longest time length, the average time length and/or the write operation performance measurement value consumed by performing the write operation on the memory data with the first preset length based on the write operation data.
Further, the sequentially accessing each piece of data includes:
each piece of data is accessed in turn by assembly language matched to the multi-core processor.
A system for evaluating memory access performance of a processor, comprising:
the device comprises an obtaining unit, a processing unit and a processing unit, wherein the obtaining unit is used for obtaining memory data, and the memory data is a first preset length;
the access unit is used for segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
a determining unit, configured to determine a memory read operation performance or a memory write operation performance based on the access data.
Further, the access unit is configured to:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length; and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
Further, the accessing unit accesses each piece of data in sequence, including:
the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
Further, a storage medium for storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of evaluating processor memory access performance as described in any of the above.
According to the technical scheme, the method and the system for evaluating the memory access performance of the processor, disclosed by the application, obtain the memory data, the memory data is of the first preset length, the memory data of the first preset length is segmented, each segment of the data is sequentially accessed to obtain the access data, the access is a read operation or a write operation, and the memory read operation performance or the memory write operation performance is determined based on the access data. According to the scheme, the data in the memory is read in a segmented mode or written in a segmented mode so as to obtain the memory reading operation performance or the memory writing operation performance, the reading operation performance and the writing operation performance of the memory are tested respectively, the fact that only the reading and writing comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or the CPU is determined.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a method for evaluating memory access performance of a processor according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a system for evaluating memory access performance of a processor according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The present application discloses a method for evaluating processor memory access performance, a flowchart of which is shown in fig. 1, and includes:
s11, obtaining memory data, wherein the memory data is a first preset length;
s12, segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
and S13, determining the memory reading operation performance or the memory writing operation performance based on the access data.
The ARMv8-A CPU is an ASIC in the UK, relates to a simplified instruction set processor architecture promoted by the ARM of the company, has wide application in the fields of high-performance consumer electronics, servers and the like, and is similar to other CPU platforms such as: x86 or MIPS platform, etc., the memory read-write throughput performance is one of the main performance indexes of the whole machine or CPU.
The memory assessment tool and algorithm currently in wide use in the high performance CPU field is the Stream scheme and software tool proposed and referred to by the university of virginia computer school, england, the version of the current Stream software tool is 5.10, and therefore the tool is called Stream5.10. The main idea algorithm of the stream5.10 scheme is to use C language or other high-level programming languages, such as Fortran, to read and write a large section of continuous virtual memory for multiple times, obtain the time for each complete read and write, obtain the time for which the execution is the fastest, and calculate the throughput value according to the data volume and the execution time.
However, the comprehensive performance of some typical read-write performance hybrid applications can be evaluated and tested by adopting the scheme, but the performance of the read memory or the performance of the write memory is not reflected respectively. If the CPU has both read operation and write operation to each memory or memory controller at the same time, the read and write operations can not reach the self maximum throughput bandwidth; in addition, in some systems, the distribution of read and write performance is unbalanced, such as: on a general processor platform for non-network applications, it is generally considered that the frequency of the memory read operation is higher than that of the memory write operation, which more easily results in the memory read performance and the memory write performance being not clearly defined by the above scheme.
In order to solve the problem, the scheme respectively executes read operation and write operation on the memory in the CPU so as to respectively obtain the memory read operation performance and the memory write operation performance.
Pre-acquiring a segment of memory data, the length of which is known, such as: 1024 bytes, and performing segmented access on the pre-stored memory data to obtain access data, wherein the access can be a read operation or a write operation.
If the access is a read operation, after segmenting the pre-stored memory data, sequentially reading each segment of data to obtain read operation data of each segment of data, such as: and reading the duration used by each section of data, and/or reading the duration used by the complete memory data acquired in advance.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into a preset number of memory segments, so that the length of each memory segment is equal, and the read operation data obtained by performing the read operation on the memory segments with the same length of the multiple memory segments is guaranteed to have a comparative basis. If the memory data with the first preset length is segmented, and the lengths of the segments are not necessarily the same, for the memory segments with different lengths, even if the same kernel of the CPU executes the read operation, the obtained read operation durations are also different, so that the obtained durations do not have a comparison basis, and if the different kernels of the CPU execute the read operation, the obtained read operation durations are also meaningless.
Such as: if the length of the memory data obtained in advance is 1024 bytes, the memory data of the 1024 bytes is segmented, and the length of each segment is 64 bytes, then the memory data of the 1024 bytes can be divided into 16 segments, that is, read in 16 times, and read 64 bytes each time. The obtained read operation data may be a time length used for reading each 64 bytes of data, 16 time lengths may be obtained, and the obtained 16 time lengths may be determined as the read operation data.
Determining the memory read operation performance based on the read operation data, and determining the memory read operation performance for the read operation duration based on each obtained segment of data, where the memory read operation performance may be: the minimum duration consumed by performing a read operation, the maximum duration consumed by performing a read operation, and the average duration consumed by performing a read operation may also be a measure of read operation performance.
Continuing with the above example, if 16 durations are obtained, each duration corresponds to a duration for reading each segment of 64-byte data, and the maximum value of the 16 durations is determined, which is the longest duration consumed for executing the read operation; determining the minimum value of the 16 time lengths, namely the minimum time length consumed by executing the read operation; averaging the 16 time length values to obtain the average time length for executing the read operation; and the read performance measurement may be: and dividing each piece of data by the shortest time length to obtain the data.
If the access is a write operation, after segmenting the pre-stored memory data, sequentially writing each segment of data so as to obtain write operation data of each segment of data, such as: the duration for writing each segment of data, and/or the duration for writing the complete pre-acquired memory data.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the write operation data obtained by executing write operation on the memory segments with the same length of a plurality of memory segments has a comparative basis.
Such as: if the length of the pre-acquired memory data is 256 bytes, the 256-byte memory data is segmented, and the length of each segment is 8 bytes, then the 256-byte memory data can be divided into 32 segments, that is, written into the memory bank 32 times, and 8 bytes are written into each time. The obtained write operation data may be a time length used for writing each segment of 8 bytes of data, 32 time lengths may be obtained, and the obtained 32 time lengths may be determined as the write operation data.
Determining the memory write operation performance based on the write operation data, and determining the memory write operation performance based on the obtained write operation duration of each segment of data, where the memory write operation performance may be: the minimum duration consumed to perform a write operation, the maximum duration consumed to perform a write operation, the average duration consumed to perform a write operation, and may also be a measure of write operation performance.
Continuing with the above example, if 32 time lengths are obtained, each time length corresponds to a time length for writing 8-byte data of each segment into the memory bank, and the maximum value of the 32 time lengths is determined, which is the longest time length consumed for executing the write operation; determining the minimum value of the 32 time lengths, namely the minimum time length consumed by executing the write operation; averaging the 32 time length values to obtain the average time length for executing the write operation; and the write performance measurement may be: obtained by dividing each piece of data by the shortest time length.
In the scheme, the memory read operation and the memory write operation are respectively executed, so that the throughput value of the CPU can be closer to the hardware limit when the kernel of the CPU executes the read operation or the write operation, more accurate evaluation data can be provided, and the obtained performance data can be even used for performance estimation of network applications such as DPDK (digital Pre-distortion K).
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The embodiment discloses a method for evaluating memory access performance of a processor, a flowchart of which is shown in fig. 2, and includes:
step S21, obtaining memory data, wherein the memory data is a first preset length;
s22, segmenting memory data with a first preset length, accessing a segment of data in parallel through each kernel in the multi-core processor, accessing different segments of data through different kernels in the multi-core processor to obtain access data, wherein the access is read operation or write operation;
and step S23, determining the memory reading operation performance or the memory writing operation performance based on the access data.
For a multi-core processor in a reduced instruction set processor architecture, when a read operation or a write operation is executed, the read operation or the write operation can be executed in parallel through each core in the multi-core processor, so that the memory access pressure is increased through multi-core parallel access, and the whole performance of a memory controller is mined.
That is, if a read operation is performed, multiple cores in the multi-core processor may perform the operation at the same time, but all the multiple cores perform the read operation at the same time, and one or more of the cores may not perform the write operation; if the write operation is executed, a plurality of cores in the multi-core processor execute the operation at the same time, but the plurality of cores execute the write operation at the same time, and one or more cores cannot execute the read operation, so that the multi-core simultaneous parallel read operation or the multi-core simultaneous parallel write operation is ensured.
Specifically, when the method for evaluating the performance of the memory of the processor is implemented at the bottom layer, the openmp function and the macro instruction are called to realize multi-core parallel memory access.
In the bottom implementation, the memory data with the first preset length is obtained through a single loop section execution flow in the for loop, namely a PRFM instruction of ARMv8-A assembly language is used, so that the memory data with the first preset length can be read in a segmented mode, and the PRFM assembly instruction is used in the main step, so that compared with an instruction using C language for variable assignment, the method is more compact, and can reduce 80% of software instructions, thereby forming greater access pressure on the memory, and measuring a throughput value closer to a hardware limit.
The # pragma omp parallel for openmp macro instruction is used in the scheme, so that each loop section can be individually assigned to a core of the multi-core CPU, the potential of the multi-core CPU is exerted, greater access pressure can be generated on the memory, and the throughput value closer to the hardware limit can be measured.
It should be noted that, in the existing scheme, when determining the memory access performance, the language C is generally used. The method comprises the following steps that C-language written memory read-write codes read-write the operation of the same physical address, and after the operation is compiled into assembly language, a plurality of assembly sentences exist, most of the final read-write operations are useless sentences, the execution efficiency is low, the operation is easily interrupted by process scheduling and interruption of an operating system, the execution time is long, and the measured throughput value is low; the same operation is directly realized by an assembly language, only one assembly instruction is needed, the execution efficiency is high, and the measured read-write throughput is closer to the real throughput value of the memory controller.
In addition, the C language is a high-level language, so that the control on the bottom layer operation of the machine cannot be completely accurate, and only read operation and only write operation cannot be simultaneously performed; the assembly language can only do read memory operation or only do write memory operation at the same time under the use of proper statements.
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The embodiment discloses a method for evaluating memory access performance of a processor, a flowchart of which is shown in fig. 3, and includes:
step S31, obtaining memory data, wherein the memory data is a first preset length;
step S32, accessing from the byte at the first address in the memory data, wherein the length of the current access data is a second preset length;
step S33, starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the access of the memory data with the first preset length is completed, and the difference value between the second address and the first address is the second preset length;
and step S34, determining the memory read operation performance or the memory write operation performance based on the access data.
When data is accessed in a segmented mode, the data is accessed based on the first byte address of the memory segment to be accessed, namely, the memory data with the first preset length is segmented in advance, the first byte address of each memory segment is determined, the kernel of the CPU starts to access the byte at the first byte address, and the current access is to access the byte with the second preset length, namely, the segmented memory segment.
When the CPU has a plurality of cores, the CPU can access through the plurality of cores, that is, one of the cores accesses one of the memory segments, and the other core accesses the other memory segment, until one memory segment is allocated to all the cores for access, or until one core is allocated to all the memory segments.
Alternatively, the following may be used: and sequentially accessing each memory segment, namely after the first memory segment is accessed, accessing the second memory segment until all the memory segments are accessed.
In the bottom-layer implementation, multi-core parallel operation is realized through for loop, and access is executed by adopting assembly language. If the access is a read operation, reading the data with the second preset length each time through a for loop until the memory data with the first preset length of the complete section is read when the access is executed; and determining the first byte address of each memory segment as a memory reading reference address in each cycle node in the fou cycle during reading each time.
Specifically, when data is read, reading the address of the first Byte of the first memory segment, namely addr +0, wherein the reading length is a second preset length, namely cacheline, namely the second preset length is 64Byte of the size of the CPU cache stack; when data is read, the address of the first byte of the second memory segment, namely addr +64, is read, and the read length is the second preset length until the whole segment of the kernel data with the first preset length is read.
The core code may be as follows:
Figure BDA0003104465340000101
if the access is write operation, writing data with a second preset length each time through a for loop until memory data with a first preset length of a complete section is written when the access is executed; and determining the first byte address of each memory segment as the memory writing reference address in each cycle node in the fou cycle during each writing.
Specifically, when data is written, writing is performed at the address of the first byte of the first memory segment, that is, addr +0, and the reading length is a second preset length; when data is written, writing is carried out on the first byte address of the second memory segment, namely addr +8, and the writing length is the second preset length until the whole segment of kernel data with the first preset length is written.
The core code may be as follows:
Figure BDA0003104465340000111
when data is written, a single loop section execution flow in the for loop carries out storage operation on a memory with one end length of 256 bytes, namely an STUR instruction of ARMv8-A assembly language is used, the data in a CPU is written into a memory bank in 32 bytes at a time, the STUR assembly instruction is used in a main step of memory writing operation, and compared with an instruction for variable assignment in C language, the STUR assembly instruction is more compact and can reduce 80% of software instructions, so that greater access pressure is formed on the memory, and a throughput value closer to a hardware limit is measured;
meanwhile, the # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be independently assigned to a core of the multi-core CPU, the potential of the multi-core CPU is exerted, larger access pressure can be formed on the memory, and the throughput value closer to the hardware limit can be measured.
The method for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the data in the memory is read in a segmented mode or written in a segmented mode so as to obtain the memory reading operation performance or the memory writing operation performance, the reading operation performance and the writing operation performance of the memory are tested respectively, the fact that only the reading and writing comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or the CPU is determined.
The present embodiment discloses a system for evaluating processor memory access performance, a schematic structural diagram of which is shown in fig. 4, and the system includes:
an obtaining unit 41, an accessing unit 42 and a determining unit 43.
The obtaining unit 41 is configured to obtain memory data, where the memory data is a first preset length;
the access unit 42 is configured to segment the memory data with the first preset length, and sequentially access each segment of data to obtain access data, where the access is a read operation or a write operation;
the determination unit 43 is configured to determine a memory read operation performance or a memory write operation performance based on the access data.
The ARMv8-A CPU is an ASIC in the UK, relates to a simplified instruction set processor architecture promoted by the ARM of the company, has wide application in the fields of high-performance consumer electronics, servers and the like, and is similar to other CPU platforms such as: x86 or MIPS platform, etc., memory read-write throughput performance is one of the main performance indexes of the whole machine or CPU.
The memory assessment tool and algorithm currently in wide use in the high performance CPU field is the Stream scheme and software tool proposed and referred to by the university of virginia computer school, england, the version of the current Stream software tool is 5.10, and therefore the tool is called Stream5.10. The main idea algorithm of the stream5.10 scheme is to use a C language or other high-level programming languages, such as Fortran, to read and write a large segment of continuous virtual memory for multiple times, obtain the time for each complete read and write, obtain the time for which the execution is the fastest, and calculate the throughput value according to the data volume and the execution time.
However, the comprehensive performance of some typical read-write performance hybrid applications can be evaluated and tested by adopting the scheme, but the performance of the read memory or the performance of the write memory is not reflected respectively. If the CPU has both read operation and write operation to each memory or memory controller at the same time, the read and write operations can not reach the self maximum throughput bandwidth; in addition, in some systems, the distribution of read and write performance is unbalanced, such as: on a general processor platform for non-network applications, it is generally considered that the frequency of memory read operations is higher than that of memory write operations, which more easily results in the performance of memory read and memory write being unclear by the above scheme.
In order to solve the problem, the scheme respectively executes read operation and write operation on the memory in the CPU so as to respectively obtain the memory read operation performance and the memory write operation performance.
Pre-acquiring a segment of memory data, the length of which is known, such as: 1024 bytes, and performing segmented access on the pre-stored memory data to obtain access data, wherein the access can be a read operation or a write operation.
If the access is a read operation, after segmenting the pre-stored memory data, sequentially reading each segment of data so as to obtain read operation data of each segment of data, such as: the duration of each segment of data is read, and/or the duration of the complete pre-acquired memory data is read.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the read operation data obtained by executing the read operation on the memory segments with the same length of the multiple memory segments has a comparison basis. If the memory data with the first preset length is segmented, and the lengths of the segments are not necessarily the same, for the memory segments with different lengths, even if the same kernel of the CPU executes the read operation, the obtained read operation durations are also different, so that the obtained durations do not have a comparison basis, and if the different kernels of the CPU execute the read operation, the obtained read operation durations are also meaningless.
Such as: if the length of the memory data obtained in advance is 1024 bytes, the 1024-byte memory data is segmented, and the length of each segment is 64 bytes, then the 1024-byte memory data can be divided into 16 segments, that is, read in 16 times, and each time the 64 bytes are read. The obtained read operation data may be a time length used for reading each 64 bytes of data, 16 time lengths may be obtained, and the obtained 16 time lengths may be determined as the read operation data.
Determining the memory read operation performance based on the read operation data, and determining the memory read operation performance for the read operation duration based on each obtained segment of data, where the memory read operation performance may be: the minimum duration consumed by performing a read operation, the maximum duration consumed by performing a read operation, and the average duration consumed by performing a read operation may also be a measure of read operation performance.
Continuing with the above example, if 16 durations are obtained, each duration corresponds to a duration for reading each segment of 64-byte data, and the maximum value of the 16 durations is determined, which is the longest duration consumed for executing the read operation; determining the minimum value of the 16 time lengths, namely the minimum time length consumed by executing the read operation; averaging the 16 time length values to obtain the average time length for executing the read operation; and the read performance measurement may be: obtained by dividing each piece of data by the shortest time length.
If the access is a write operation, after segmenting the pre-stored memory data, sequentially writing each segment of data so as to obtain write operation data of each segment of data, such as: the duration for writing each segment of data, and/or the duration for writing the complete pre-acquired memory data.
The memory data is segmented, and may be: dividing the memory data with the first preset length into memory segments with preset number, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length. The memory data with the first preset length is averagely divided into memory segments with preset number, so that the length of each memory segment is equal, and the write operation data obtained by executing write operation on the memory segments with the same length of a plurality of memory segments has a comparative basis.
Such as: if the length of the pre-acquired memory data is 256 bytes, the 256-byte memory data is segmented, and the length of each segment is 8 bytes, then the 256-byte memory data can be divided into 32 segments, that is, written into the memory bank 32 times, and 8 bytes are written into each time. The obtained write operation data may be a time length used for writing each segment of 8 bytes of data, 32 time lengths may be obtained, and the obtained 32 time lengths may be determined as read operation data.
Determining the memory write operation performance based on the write operation data, and determining the memory write operation performance based on the obtained write operation duration of each segment of data, where the memory write operation performance may be: the minimum duration consumed to perform a write operation, the maximum duration consumed to perform a write operation, the average duration consumed to perform a write operation, and may also be a measure of write operation performance.
Continuing to explain by the above example, if 32 time lengths are obtained, each time length corresponds to the time length used for writing each segment of 8-byte data into the memory bank, and the maximum value of the 32 time lengths is determined, namely the longest time length consumed for executing the write operation; determining the minimum value of the 32 time lengths, namely the minimum time length consumed by executing the write operation; averaging the 32 time length values to obtain the average time length for executing the write operation; and the write performance measurement may be: obtained by dividing each piece of data by the shortest time length.
In the scheme, the memory read operation and the memory write operation are respectively executed, so that the throughput value of the CPU can be closer to the hardware limit when the kernel of the CPU executes the read operation or the write operation, more accurate evaluation data can be provided, and the obtained performance data can be even used for performance estimation of network applications such as DPDK and the like.
Further, the accessing unit 42 accesses each piece of data in turn, including: the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
For a multi-core processor in a reduced instruction set processor architecture, when a read operation or a write operation is executed, the read operation or the write operation can be executed in parallel through each core in the multi-core processor, so that the memory access pressure is increased through multi-core parallel access, and the overall performance of a memory controller is excavated.
That is, if a read operation is performed, multiple cores in the multi-core processor may perform the operation at the same time, but all the multiple cores perform the read operation at the same time, and one or more of the cores do not perform the write operation; if the write operation is executed, a plurality of cores in the multi-core processor execute the operation at the same time, but the write operation is executed by the plurality of cores at the same time, and one or more cores in the multi-core processor do not execute the read operation, so that the multi-core simultaneous parallel read operation or the multi-core simultaneous parallel write operation is ensured.
Specifically, according to the method for evaluating the performance of the memory of the processor, when the bottom layer is implemented, the multi-core parallel memory access is realized by calling the openmp function and the macro instruction.
In the bottom implementation, the memory data with the first preset length is obtained through a single loop section execution flow in the for loop, namely a PRFM instruction of ARMv8-A assembly language is used, so that the memory data with the first preset length can be read in a segmented mode, and the PRFM assembly instruction is used in the main step, so that compared with an instruction using C language for variable assignment, the method is more compact, and can reduce 80% of software instructions, thereby forming greater access pressure on the memory, and measuring a throughput value closer to a hardware limit.
The # pragma omp parallel for openmp macroinstruction is used in the scheme, so that each loop section can be individually assigned to a certain core of the multi-core CPU, the potential of the multi-core CPU is exerted, the memory can be subjected to greater access pressure, and the throughput value closer to the hardware limit can be measured.
It should be noted that, in the existing scheme, when determining the memory access performance, the language C is generally used. The method comprises the steps that C-language written memory read-write codes read and write the operation of the same physical address, and after the read-write codes are compiled into assembly language, a plurality of assembly sentences exist, most of the final read-write operations are useless sentences, the execution efficiency is low, the execution time is long and the measured throughput value is low, and the process scheduling and interruption of an operating system are easy to occur; the same operation is directly realized by an assembly language, only one assembly instruction is needed, the execution efficiency is high, and the measured read-write throughput is closer to the real throughput value of the memory controller.
In addition, the C language is a high-level language, so that the control on the bottom layer operation of the machine cannot be completely accurate, and only read operation and only write operation cannot be simultaneously performed; the assembly language can only do read memory operation or only do write memory operation at the same time under the use of proper statements.
Further, the access unit 42 is configured to start accessing from a byte in the memory data at the first address, where the length of the currently accessed data is a second preset length; starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the memory data access of the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
When data is segmented to be accessed, the data is accessed based on the first byte address of the memory segment to be accessed, namely, the memory data with the first preset length is segmented in advance, the first byte address of each memory segment is determined, the kernel of the CPU starts to access the byte at the first byte address, and the current access is to access the byte with the second preset length, namely, the segmented memory segment.
When the CPU has a plurality of cores, the CPU can access through the plurality of cores, that is, one of the cores accesses one of the memory segments, and the other core accesses the other memory segment, until one memory segment is allocated to all the cores for access, or until one core is allocated to all the memory segments.
Alternatively, the following may be used: and sequentially accessing each memory segment, namely after the first memory segment is accessed, accessing the second memory segment until all the memory segments are accessed.
In the bottom-layer implementation, multi-core parallel operation is realized through a for loop, and access is executed by adopting assembly language. If the access is a read operation, reading the data with the second preset length each time through for circulation when the access is executed until the memory data with the first preset length of the complete segment is read; and determining the first byte address of each memory segment as the memory reading reference address in each cycle node in the fou cycle during each reading.
Specifically, when data is read, reading the address of the first Byte of the first memory segment, namely addr +0, wherein the reading length is a second preset length, namely cacheline, namely the second preset length is 64Byte of the size of the CPU cache stack; when reading data, reading the first byte address of the second memory segment, namely addr +64, wherein the reading length is the second preset length until the whole segment of kernel data with the first preset length is read.
Its core code may be as follows:
Figure BDA0003104465340000171
if the access is write operation, writing data with a second preset length each time through a for loop until memory data with a first preset length of a complete section is written when the access is executed; and determining the first byte address of each memory segment as the memory writing reference address in each cycle node in the fou cycle during each writing.
Specifically, when data is written, writing is performed at the address of the first byte of the first memory segment, that is, addr +0, and the reading length is a second preset length; when data is written, writing is carried out on the first byte address of the second memory segment, namely addr +8, and the writing length is the second preset length until the whole segment of kernel data with the first preset length is written.
Its core code may be as follows:
Figure BDA0003104465340000181
when data is written, a single loop section execution flow in the for loop carries out storage operation on a memory with one end length of 256 bytes, namely an STUR instruction of ARMv8-A assembly language is used, the data in a CPU is written into a memory bank in 32 bytes at a time, the STUR assembly instruction is used in a main step of memory writing operation, and compared with an instruction for variable assignment in C language, the STUR assembly instruction is more compact and can reduce 80% of software instructions, so that greater access pressure is formed on the memory, and a throughput value closer to a hardware limit is measured;
meanwhile, the # pragma omp parallel for openmp macro instruction is used in the scheme, so that each loop section can be independently assigned to a core of the multi-core CPU, the potential of the multi-core CPU is exerted, larger access pressure can be formed on the memory, and the throughput value closer to the hardware limit can be measured.
The system for evaluating the memory access performance of the processor disclosed in this embodiment obtains memory data, where the memory data is a first preset length, segments the memory data of the first preset length, and accesses each segment of data in sequence to obtain access data, where the access is a read operation or a write operation, and determines the memory read operation performance or the memory write operation performance based on the access data. According to the scheme, the read operation is performed on the data in the memory in a segmented mode, or the write operation is performed on the data in the memory in a segmented mode, so that the read operation performance or the write operation performance of the memory can be obtained, the read operation performance and the write operation performance of the memory can be tested respectively, the situation that only the read-write comprehensive performance of the memory can be obtained in the prior art is avoided, the independent operation performance of the memory is determined, and therefore the main performance index of the whole machine or a CPU is determined.
The present embodiment discloses a storage medium for storing at least one set of instructions for being called and performing at least the method of evaluating memory access performance of a processor as described in any of the above.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A method for evaluating memory access performance of a processor, comprising:
obtaining memory data, wherein the memory data is a first preset length;
segmenting the memory data with the first preset length, and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
determining memory read operation performance or memory write operation performance based on the access data;
the determining of the memory read operation performance or the memory write operation performance based on the access data includes:
determining the shortest time length, the longest time length, the average time length and/or the read operation performance measurement value consumed by executing the read operation on the memory data with the first preset length based on the read operation data;
and determining the shortest time length, the longest time length, the average time length and/or the write operation performance measurement value consumed by performing the write operation on the memory data with the first preset length based on the write operation data.
2. The method according to claim 1, wherein segmenting the memory data of the first preset length and sequentially accessing each segment of data to obtain access data comprises:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length;
and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
3. The method of claim 1, wherein said accessing each piece of data in turn comprises:
accessing a segment of data in parallel by each core in a multi-core processor, wherein different cores in the multi-core processor access different segments of data.
4. The method according to claim 1, wherein the segmenting the memory data of the first preset length and sequentially accessing each segment of data to obtain access data comprises:
starting access from the byte at the first address in the memory data, wherein the length of the currently accessed data is a second preset length;
starting access from a byte at a second address in the memory data, wherein the length of the currently accessed data is a second preset length until the access of the memory data with the first preset length is completed, and the difference value between the second address and the first address is the second preset length.
5. The method of claim 1, wherein said accessing each piece of data in turn comprises:
each piece of data is accessed in turn by assembly language matched to the multi-core processor.
6. A system for evaluating memory access performance of a processor, comprising:
the device comprises an obtaining unit, a processing unit and a processing unit, wherein the obtaining unit is used for obtaining memory data, and the memory data is a first preset length;
the access unit is used for segmenting the memory data with the first preset length and sequentially accessing each segment of data to obtain access data, wherein the access is read operation or write operation;
a determination unit for determining a memory read operation performance or a memory write operation performance based on the access data
Wherein the determining of the memory read operation performance or the memory write operation performance based on the access data comprises:
determining the shortest time length, the longest time length, the average time length and/or the read operation performance measurement value consumed by executing the read operation on the memory data with the first preset length based on the read operation data;
and determining the shortest time length, the longest time length, the average time length and/or the write operation performance measurement value consumed by performing the write operation on the memory data with the first preset length based on the write operation data.
7. The system of claim 6, wherein the access unit is configured to:
dividing the memory data with the first preset length into a preset number of memory segments, wherein the length of each memory segment is a second preset length, and the product of the second preset length and the preset number is equal to the first preset length; and sequentially accessing each memory segment data to obtain the access data of each memory segment data.
8. The system of claim 6, wherein the access unit accesses each piece of data in turn, comprising:
the access unit accesses a segment of data in parallel through each core in the multi-core processor, and different cores in the multi-core processor access different segments of data.
9. A storage medium storing at least one set of instructions;
the set of instructions is for being called and performing at least the method of evaluating processor memory access performance as described in any of the above.
CN202110633327.2A 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor Active CN113254321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633327.2A CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633327.2A CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Publications (2)

Publication Number Publication Date
CN113254321A CN113254321A (en) 2021-08-13
CN113254321B true CN113254321B (en) 2023-01-24

Family

ID=77186864

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633327.2A Active CN113254321B (en) 2021-06-07 2021-06-07 Method and system for evaluating memory access performance of processor

Country Status (1)

Country Link
CN (1) CN113254321B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567256A (en) * 2011-12-16 2012-07-11 龙芯中科技术有限公司 Processor system, as well as multi-channel memory copying DMA accelerator and method thereof
CN103559079A (en) * 2013-11-15 2014-02-05 深圳市道通科技有限公司 Shared memory based data access method and device
CN104090795A (en) * 2014-07-08 2014-10-08 三星电子(中国)研发中心 Method, system and device for upgrading multi-core mobile terminal
CN105373456A (en) * 2015-11-19 2016-03-02 英业达科技有限公司 Memory testing method for reducing cache hit rate
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
CN109144419A (en) * 2018-08-20 2019-01-04 浪潮电子信息产业股份有限公司 Solid state disk memory read-write method and system
CN109857342A (en) * 2019-01-16 2019-06-07 盛科网络(苏州)有限公司 A kind of data read-write method and device, exchange chip and storage medium
CN109901956A (en) * 2017-12-08 2019-06-18 英业达科技有限公司 The system and method for memory integrated testability
CN111739577A (en) * 2020-07-20 2020-10-02 成都智明达电子股份有限公司 DSP-based efficient DDR test method
CN112181893A (en) * 2020-09-29 2021-01-05 东风商用车有限公司 Communication method and system between multi-core processor cores in vehicle controller

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5896550A (en) * 1997-04-03 1999-04-20 Vlsi Technology, Inc. Direct memory access controller with full read/write capability
CN103257888A (en) * 2012-02-16 2013-08-21 阿里巴巴集团控股有限公司 Method and equipment for concurrently executing read and write access to buffering queue
US8930633B2 (en) * 2012-06-14 2015-01-06 International Business Machines Corporation Reducing read latency using a pool of processing cores
CN103902467B (en) * 2012-12-26 2017-02-22 华为技术有限公司 Compressed memory access control method, device and system
EP2876557B1 (en) * 2013-11-22 2016-06-01 Alcatel Lucent Detecting a read access to unallocated or uninitialized memory
CN110502190B (en) * 2019-08-28 2023-03-17 上海航天电子通讯设备研究所 File reading and writing method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567256A (en) * 2011-12-16 2012-07-11 龙芯中科技术有限公司 Processor system, as well as multi-channel memory copying DMA accelerator and method thereof
CN103559079A (en) * 2013-11-15 2014-02-05 深圳市道通科技有限公司 Shared memory based data access method and device
CN104090795A (en) * 2014-07-08 2014-10-08 三星电子(中国)研发中心 Method, system and device for upgrading multi-core mobile terminal
CN105740164A (en) * 2014-12-10 2016-07-06 阿里巴巴集团控股有限公司 Multi-core processor supporting cache consistency, reading and writing methods and apparatuses as well as device
CN105373456A (en) * 2015-11-19 2016-03-02 英业达科技有限公司 Memory testing method for reducing cache hit rate
CN109901956A (en) * 2017-12-08 2019-06-18 英业达科技有限公司 The system and method for memory integrated testability
CN109144419A (en) * 2018-08-20 2019-01-04 浪潮电子信息产业股份有限公司 Solid state disk memory read-write method and system
CN109857342A (en) * 2019-01-16 2019-06-07 盛科网络(苏州)有限公司 A kind of data read-write method and device, exchange chip and storage medium
CN111739577A (en) * 2020-07-20 2020-10-02 成都智明达电子股份有限公司 DSP-based efficient DDR test method
CN112181893A (en) * 2020-09-29 2021-01-05 东风商用车有限公司 Communication method and system between multi-core processor cores in vehicle controller

Also Published As

Publication number Publication date
CN113254321A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
US20150339229A1 (en) Apparatus and method for determining a sector division ratio of a shared cache memory
US10268454B2 (en) Methods and apparatus to eliminate partial-redundant vector loads
JP2020087470A (en) Data access method, data access device, apparatus, and storage medium
CN113254322B (en) Method and system for evaluating ultimate throughput performance of Stream system
CN113254321B (en) Method and system for evaluating memory access performance of processor
CN109947667B (en) Data access prediction method and device
Ibrahim et al. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions
KR101537725B1 (en) Method and system for determining work-group size and computer readable recording medium therefor
JP6145193B2 (en) Read or write to memory
US10102099B2 (en) Performance information generating method, information processing apparatus and computer-readable storage medium storing performance information generation program
US10120602B2 (en) Device and method for determining data placement destination, and program recording medium
JP2005327138A (en) Data base server, program, recording medium, and control method
CN110737509A (en) Thermal migration processing method and device, storage medium and electronic equipment
JP5521687B2 (en) Analysis apparatus, analysis method, and analysis program
CN107943415A (en) The method and system of lookup free cluster based on FAT file system
JP6519228B2 (en) Data allocation determination device, data allocation determination program, and data allocation determination method
US10452368B2 (en) Recording medium having compiling program recorded therein, information processing apparatus, and compiling method
CN111352825B (en) Data interface testing method and device and server
CN107766729B (en) Virus characteristic matching method, terminal and computer readable storage medium
CN110674170A (en) Data caching method, device, equipment and medium based on linked list reverse order reading
JP7168731B1 (en) MEMORY ACCESS CONTROL DEVICE, MEMORY ACCESS CONTROL METHOD, AND MEMORY ACCESS CONTROL PROGRAM
US11442658B1 (en) System and method for selecting a write unit size for a block storage device
TW201317780A (en) Method for identifying memory of virtual machine and computer system using the same
JP6318923B2 (en) Test method, test program, and information processing apparatus
JP7156376B2 (en) OBSERVED EVENT DETERMINATION DEVICE, OBSERVED EVENT DETERMINATION METHOD, AND PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221221

Address after: 201114 Room 603C, Building 8, No. 2388, Chenhang Road, Minhang District, Shanghai

Applicant after: Shanghai Hengwei Intelligent Technology Co.,Ltd.

Address before: 6 / F, building 8, 2388 Chenhang Road, Minhang District, Shanghai, 201114

Applicant before: EMBEDWAY TECHNOLOGIES (SHANGHAI) Corp.

GR01 Patent grant
GR01 Patent grant