CN116860256A

CN116860256A - RISC-V base C library-oriented optimization method

Info

Publication number: CN116860256A
Application number: CN202310848297.6A
Authority: CN
Inventors: 张飞; 于佳耕; 武延军
Original assignee: Zhongke Nanjing Software Technology Research Institute; Institute of Software of CAS
Current assignee: Zhongke Nanjing Software Technology Research Institute; Institute of Software of CAS
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-10-10

Abstract

The invention discloses an optimization method for a RISC-V base C library, which belongs to the technical field of computer software, adopts a compiler to predefine the compatibility of a macro realization base instruction set and an RVV extension instruction set, focuses on optimizing the character string operation function of the base C library, and respectively realizes the assembly realization only comprising the base instruction set and the RVV instruction set. The character string operation function realized by the basic instruction set adopts optimization modes such as fine-grained data division, address alignment, cyclic expansion, address jump, magic number and the like to improve the performance and efficiency of the function. Character string operation functions realized by the RVV expansion instruction set adopt address alignment, vectorization and other optimization modes to improve the execution efficiency of basic C library functions.

Description

RISC-V base C library-oriented optimization method

Technical Field

The invention belongs to the technical field of computer software, and particularly provides an optimization method for a RISC-V base C library.

Background

RISC-V is an open source instruction set architecture (Instruction Set Architecture, abbreviated as ISA) based on a reduced instruction set (Reduced Instruction Set Computing, abbreviated as RISC), developed by David Patterson, krste Asanovic et al, mainly by Bokrill division, california university. RISC-V architecture has the characteristics of high expandability, flexibility, free opening and the like, and has attracted extensive attention in the field of computers. RISC-V has received extremely high evaluation not only in academia but also in every field of interest, and is considered as an important direction for future processors.

RISC-V has a range of technical features, the most notable of which is its modular functionality, so that RISC-V cores can be implemented using different subsets for different application scenarios. In addition, RISC-V has the characteristics of high performance, low power consumption, easy design and debugging, and the like, and can be widely applied to the fields of embedded systems, servers, and the like. Furthermore, the RISC-V instruction set also follows the open principle, being freely accessible and extensible, meaning that it can be used by anyone and can be developed and optimized according to own needs.

RISC-V also develops very rapidly. Since RISC-V was first disclosed in 2010, it has been supported by numerous businesses and organizations, such as Intel, apple, google, ARM, huacheng, etc., and has found widespread use in various fields, such as mobile devices, internet of things, artificial intelligence, etc. In addition, the global effort for processor technology has increased in recent years, driving the rapid development of RISC-V instruction set architectures. It is expected that RISC-V will be increasingly valued in the future computer technology field as it is ever perfected and evolving.

In modern computer systems, the C language is a widely used programming language, while the base C library is a general purpose programming library, providing a series of common functions and constants, providing a convenient programming interface for programmers. The operating efficiency of the underlying C library has a significant impact on the performance of the program, which is particularly important for applications that perform large-scale data computations and multimedia processing. However, existing base C libraries do not fully exploit their performance advantages on the RISC-V instruction set, because of the different RISC-V instruction set and other instruction set architectures. For example, when executing SIMD instructions, the existing base C-libraries do not fully utilize the vector register V registers in the RISC-V instruction set, resulting in the inability to take advantage of SIMD instructions. In addition, the existing basic C libraries have relatively weak running performance on RISC-V, resulting in some applications requiring high performance computing not fully exploiting the advantages of the RISC-V instruction set.

In order to solve these problems, a new method for accelerating the basic C library is needed, which makes full use of performance characteristics of the RISC-V instruction set and improves performance and efficiency of the application program. The base C library can be optimized by adopting a vectorization programming technology, and the existing base C library can fully exert the SIMD instruction advantage in the RISC-V instruction set by rewriting the data processing function into vectorization codes, so that the running performance of the base C library on RISC-V is improved.

Disclosure of Invention

Aiming at the technical problems existing in the prior art, the invention aims to provide an optimization method for a RISC-V basis C library. The method adopts a compiler to predefine macros, and realizes the compatibility of a basic instruction set and an RVV extension instruction set. The character string operation function of the basic C library is emphasized and optimized, the assembly realization only comprising the basic instruction set and the RVV expansion instruction set is realized respectively, the character string operation function realized by the basic instruction set and the character string operation function realized by the RVV expansion instruction set are correspondingly optimized, and the performance and the execution efficiency of the basic C library function are improved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an optimization method for RISC-V basis C library comprises the following steps:

1) In the RISC-V architecture, a compiler predefined macro __ riscv_vector is used to check at the compilation stage whether vector instruction set extensions are supported, and if not, the following step 2) is performed; if so, executing the following step 3);

2) Implementing the character string operation function of the basic C library by using a basic instruction set, and optimizing the character string operation function by using a compiling optimization method;

3) And using the RVV expansion instruction set to carry out vectorization programming on the character string operation function of the basic C library, and optimizing the character string operation function.

Further, checking in step 1) whether a support vector instruction set extension is supported at the compile stage using a compiler-predefined macro __ riscv_vector, that is, checking whether a __ riscv_vector macro is defined, if a __ riscv_vector macro is defined, compiling an environment support vector instruction set extension; if the __ riscv_vector macro is not defined, the compilation environment does not support vector instruction set extensions.

Further, the string manipulation functions in steps 2) and 3) include strlen, memset, memcpy and memkove.

Further, the compiling optimization method in the step 2) comprises optimization methods of fine granularity data division, address alignment, cyclic expansion, address jump and magic number.

Further, the optimizing step of the fine-grained data partitioning optimizing method in step 2) includes: partitioning the memory according to a fixed size; for the memset, memcpy, memmove function, the input data quantity of the function is read, and batch processing is carried out on different data quantities.

Further, the optimizing step of the double pointer optimizing method in the step 2) includes: and for the memset function, when the data quantity is smaller than a set threshold value and the tail processing is performed, calculating a stored first address and a stored tail address, respectively storing data from two ends and judging the data quantity so as to prevent the address from crossing the boundary until the storage is completed.

Further, the optimizing step of the address alignment optimizing method in the step 2) includes: for strlen, memset, memcpy, memmove function, judging whether the first address is 8 byte aligned, if not, processing to the 8 byte address aligned position by byte operation instruction, and processing data according to word size for core loop.

Further, the optimizing step of the loop expansion optimizing method in the step 2) includes: placing a plurality of operations of the same type into one cycle to be executed; the memory loop segment is expanded 32 times for the memset function to execute.

Further, the optimizing step of the address jump optimizing method in the step 2) includes: for the memset function, for the data volume which cannot be stored for 32 times simultaneously, the number of times of expansion and the offset from the instruction head address of the loop segment are calculated, and the execution is directly jumped into the loop.

Further, the optimizing step of the magic number optimizing method in the step 2) includes: for the strlen function, the terminator is found by performing logic operation on the 64-bit magic number and the data.

Further, in the step 3), an address alignment optimization method is adopted to optimize the operation function of the character string, and the optimization step comprises the following steps: for strlen, memset, memcpy, memmove functions, reading the vlenb register to obtain the bit width of the vector register, judging whether the storage head address is aligned according to the bit width of the vector register, and if not, firstly operating the data byte by byte until the address is aligned according to the bit width of the vector register.

Further, in the step 3), a vectorization optimization method is adopted to optimize the string operation function, and the optimization step comprises the following steps:

setting a vl register to be the maximum vector length;

for the strlen function, loading data from a memory into a vector register by utilizing a vle8.V instruction in an RVV extended instruction set, searching terminator elements by utilizing a vfirst. M instruction, sending position information into a scalar register, and finally merging the position information in the scalar register to obtain the length of a character string;

for the memset function, vector data is stored into memory using vmv.v.x and vse8.v instructions;

for the memcpy, memmove function, data in memory is loaded into vector registers using vle8.V, and then stored into memory using vse8. V.

Compared with the prior art, the invention has the following positive effects:

(1) Aiming at the support of hardware and a compiler, a developer can freely select a basic C library realized by compiling and generating basic instructions or a basic C library realized by RVV instructions, and the coexistence of the basic instructions and RVV instruction assembly of the same function is realized, but the support of the basic C library in the aspect of technology attention is not paid at present;

(2) The invention performs special optimization aiming at the problems that the running performance of the basic C library on the RISC-V is relatively weak and the advantages of the RISC-V instruction set are not fully exerted, and compared with the traditional basic C library, the invention has higher execution efficiency.

Drawings

FIG. 1 is a flowchart of an optimization method for RISC-V base C library according to the present invention.

FIG. 2 is a flowchart of a base instruction implementation memset function.

Figure 3 is a flow chart for RVV instruction implementation memset function.

Detailed Description

In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.

The invention provides an optimization method for a RISC-V basis C library, wherein the optimization flow is shown in a figure 1, and the specific optimization steps are described as follows:

1. RVV compatible scheme

This results in vector instruction optimized functions not running on hardware that does not support vector expansion due to the scalability of the RISC-V instruction set. Currently, a hwcap (hardware capabilities) mechanism is commonly used in Linux systems to detect and identify the characteristics and functions of hardware. The hwcap mechanism represents different hardware functions by defining a set of bit flags. These bit flags are encoded as one or more specific registers or memory locations and set by the operating system kernel at system start-up. Because the current RISC-V kernel hwcap mechanism is not well supported and additional overhead is incurred by dynamically reading the hardware configuration at runtime, the present invention uses the compiler predefined macro __ riscv vector to check in the compilation phase whether vector instruction set extensions are supported to decide whether to use vectorized code paths to improve performance while implementing two version compilations that include only the base instruction set and RVV optimization. Specifically, the compiler predefined macro is a predefined identifier provided by the compiler for conditional compilation at the compilation stage. By checking whether the __ riscv_vector macro is defined, it can be determined whether the compilation environment supports RISC-V vector instruction set extensions. If the __ riscv vector macro is defined, meaning that the compilation environment supports vector instruction set extensions, the vectorized code path may be used to optimize performance, and the compiler may generate optimized code for the vector instruction set to achieve more efficient vector computation. If the __ riscv vector macro is undefined, indicating that the compilation environment does not support vector instruction set extensions, then the compiler will generate code that contains only the underlying instruction set without optimization using a vectorized path.

2. Implementation scheme of basic instruction set of string operation function

String manipulation functions are a common class of manipulation functions in the C standard library, including strlen, memset, memcpy, memmove, etc. Wherein the strlen function is used to calculate the length of a string; the memset function is used for setting a section of memory area to a specified value; the memcpy function is used to copy data from one memory region into another memory region; the memmode function is used for copying data in the memory from a source address to a target address, and can handle the situation that the source address and the target address are overlapped. The RISC-V instruction has the characteristics of expandability, reduced and unified instruction set, hierarchical design and the like, and the invention utilizes RISC-V instruction characteristics and compiling optimization technology, including adopting optimization means such as fine granularity data division, address alignment, cyclic expansion, address jump, magic number and the like, and utilizes basic instructions to realize more efficient character string operation functions.

The method comprises the following specific steps:

1) And a fine-grain data division optimization method is adopted, and the memory is partitioned according to a fixed size so as to improve the processing efficiency. Aiming at the memset, memcpy, memmove function, reading the input data quantity of the function, and carrying out batch processing on different data quantities;

2) A double pointer optimization method is adopted. For the memset function, when the data volume is smaller and the tail is processed, calculating a stored first address and a stored last address, respectively storing data from two ends and judging the data volume to prevent the address from crossing the boundary until the storage is completed;

3) By adopting the address alignment optimization method, the problem of slow processing speed caused by byte misalignment is avoided. Aiming at the strlen, memset, memcpy, memmove function, judging whether the head address is 8-byte aligned, if not, processing the head address to the 8-byte address aligned position through a byte operation instruction, and processing data according to the word size for core circulation;

4) A loop expansion optimization method is adopted, namely a plurality of operations of the same type are put into one loop to be executed, so that judgment and the number of times of jumping inside the loop are reduced. For the memset function, the storage loop segment is unfolded for 32 times to improve the execution efficiency;

5) An address jump optimization method is adopted. Aiming at the memset function, for the data volume which cannot be stored for 32 times at the same time, calculating the number of times needing to be unfolded and the offset of the instruction head address of the loop section, directly jumping to the loop for execution, and obtaining the loop unfolding effect as well;

6) And a magic number optimization method is adopted. And carrying out logic operation on the strlen function by using the 64-bit magic number and data, and quickly searching the terminator.

3. Implementation scheme of character string operation function RVV expansion instruction set

RVV expands instruction set characteristic and vector instruction makes string operation function vectorization programming realization more succinct. The abundant vector instructions can efficiently transmit and copy data between vector registers and between vectors and scalar quantities, and meanwhile, the number of operations is controlled by matching with the vsetvli instructions, so that the problem that different data need to be processed respectively in the basic instruction set implementation and tail processing are avoided. Character string operation functions realized by the RVV expansion instruction set adopt optimization modes such as address alignment, vectorization and the like to improve the performance and efficiency of the functions.

The method comprises the following specific steps:

1) An address alignment optimization method is adopted. For strlen, memset, memcpy, memmove functions, reading a vlenb register to obtain a vector register bit width, judging whether a storage head address is aligned according to the vector register bit width, and if not, firstly operating data byte by byte until the address is aligned according to the vector register bit width;

2) And adopting a vectorization optimization method. Setting a vl register as the maximum vector length, for the realization of the strlen function, loading data from a memory into the vector register by utilizing a vle8.V instruction in an RVV extended instruction set, searching terminator elements by utilizing a vfirst.m instruction, sending the position information into a scalar register, and finally merging the position information in the scalar register to obtain the length of a character string; for the realization of the memset function, vector data is stored into a memory using vmv.v.x and vse8.v instructions; for the implementation of memcpy and memnove functions, data in memory is loaded into vector registers using vle8.V, and then stored into memory using vse8. V.

For the method proposed by the present invention, a specific example is given below:

1. first, a compiler predefined macro __ riscv_vector is used to check at the compiling stage whether vector instruction set extensions are supported or not to decide whether to use vectorized code paths to improve performance, if the compiler defines __ riscv_vector macro definitions, then RVV extension optimized base C library memset function compilation is used to implement, and if not defined, base instruction optimized base C library memset function compilation is used to implement. The pseudo code is as follows:

specific embodiments of the present invention are described in detail in connection with memset functions. The prototype of the memset function is void_memset (void_s, int c, size_ t n), where s is the memory start address to be set, c is the value to be set, and n is the number of bytes to be set.

2. Assuming that the user's compiler only supports the RISC-V base instruction set, then the base C library memset function for base instruction optimization is used, as shown in FIG. 2, with the following implementation steps:

1) Reading parameters of a memset function called by a user, firstly judging the input data quantity n of the function, and judging whether the input data quantity n is smaller than 16 bytes or not;

2) If n in the step 1) is smaller than 16 bytes, adopting a double pointer optimization method to store data c from the head and tail of the memory addresses s and s+n respectively until the storage is completed;

5) If n in the step 1) is greater than 16 bytes, an address alignment optimization method is adopted to continuously judge whether the storage head address s is 8 bytes aligned, if not, a sb instruction is used to store the storage head address s to the 8 bytes address aligned position byte by byte, after alignment, the data c to be filled is copied into 8 bytes through sli instruction shift operation, and if the data c is 0xff, the data c is copied into 0 xffffffffffffffffffffffffffffffffffffffffffffs;

6) Adopting a cyclic expansion optimization method, storing and copying 8 bytes of data c according to words, expanding into a plurality of repeated instruction sequences to improve the execution efficiency, and expanding the storage cyclic segment for 32 times;

7) Adopting an address jump optimization method, calculating the number of times needing to be unfolded and the offset of the first address of a loop section instruction for the data quantity which cannot be stored for 32 times at the same time, directly jumping to the loop for execution, and obtaining the loop unfolding effect;

8) And (3) repeating the step (2) to store the data end to end for the last residual data quantity until the storage is completed.

3. Assuming the user's compiler supports the RISC-V RVV extended instruction set, then the base C library memset function for RVV instruction optimization is used, as shown in FIG. 3, with the following implementation steps:

1) An address alignment optimization method is adopted, a csrr instruction is used for reading a vlenb register to obtain the bit width of a vector register, and then whether a storage head address s is aligned according to the bit width of the vector register is judged;

2) Step 1), if the head address s is not aligned, using a sb instruction to store data byte by byte until the addresses are aligned according to the bit width of a vector register;

3) Step 1) if the addresses are aligned, a vectorization optimization method is adopted, a vl register is firstly set to be the maximum vector length, namely a parameter SEW is set to be 8, an LMUL is set to be 8, then vmv vector instructions are used for copying data c into the vector register, the copying number is determined by vsetvli, and finally vector data are parallelly stored into a memory by using vse8.v.

Experimental test:

the invention adopts the test case (https:// github. Com/ARM-software/opti mized-routines/tree/master/string/bench) provided by ARM authorities, and the execution efficiency of the memset function is measured on a gem5 simulator, wherein random tests the performance when the corresponding data volume is near, medium tests the performance when the data volume is medium, and large tests the performance when the data volume is larger, and the larger the value represents the better the performance. From the test results, the performance of the memset realized by the base instruction or RVV is better than that of the memset realized by the C language in the base C library. The detailed data are shown in tables 1 to 3.

Table 1.Random memset Performance comparison (units: bytes/ns)

TABLE 2Medium memset Performance contrast (Unit: bytes/ns)

TABLE 3large memset Performance comparison (Unit: bytes/ns)

Although the present invention has been described with reference to the above embodiments, it should be understood that the invention is not limited thereto, and that modifications and equivalents may be made thereto by those skilled in the art, which modifications and equivalents are intended to be included within the scope of the present invention as defined by the appended claims.

Claims

1. The optimizing method for the RISC-V basis C library is characterized by comprising the following steps of:

2. The method of claim 1, wherein the string manipulation functions of steps 2) and 3) include strlen, memset, memcpy and memkove.

3. The method of claim 2, wherein the compiling optimization method in step 2) includes a fine-grained data partitioning optimization method, the optimizing step comprising: partitioning the memory according to a fixed size; for the memset, memcpy, memmove function, the input data quantity of the function is read, and batch processing is carried out on different data quantities.

4. The method of claim 2, wherein the compiling optimization method in step 2) includes a double pointer optimization method, and the optimizing step includes: and for the memset function, when the data quantity is smaller than a set threshold value and the tail processing is performed, calculating a stored first address and a stored tail address, respectively storing data from two ends and judging the data quantity so as to prevent the address from crossing the boundary until the storage is completed.

5. The method of claim 2, wherein the compiling optimization method in step 2) includes an address alignment optimization method, and the optimizing step includes: for strlen, memset, memcpy, memmove function, judging whether the first address is 8 byte aligned, if not, processing to the 8 byte address aligned position by byte operation instruction, and processing data according to word size for core loop.

6. The method of claim 2, wherein the compiling optimization method in step 2) includes a loop unrolling optimization method, the optimizing step of which includes: placing a plurality of operations of the same type into one cycle to be executed; the memory loop segment is expanded 32 times for the memset function to execute.

7. The method of claim 2, wherein the compiling optimization method in step 2) includes an address jump optimization method, the optimizing step including: for the memset function, for the data volume which cannot be stored for 32 times simultaneously, the number of times of expansion and the offset from the instruction head address of the loop segment are calculated, and the execution is directly jumped into the loop.

8. The method of claim 2, wherein the compiling optimization method in step 2) includes a magic number optimization method, and the optimizing step includes: for the strlen function, the terminator is found by performing logic operation on the 64-bit magic number and the data.

9. The method of claim 2, wherein the optimizing step of optimizing the string operation function in step 3) using an address alignment optimization method includes: for strlen, memset, memcpy, memmove functions, reading the vlenb register to obtain the bit width of the vector register, judging whether the storage head address is aligned according to the bit width of the vector register, and if not, firstly operating the data byte by byte until the address is aligned according to the bit width of the vector register.

10. The method of claim 2, wherein the optimizing step of optimizing the string operation function in step 3) using a vectorization optimization method includes:

setting a vl register to be the maximum vector length;