CN111104093A

CN111104093A - Finite field operation method, system, operation device and computer readable storage medium

Info

Publication number: CN111104093A
Application number: CN201811250597.XA
Authority: CN
Inventors: 徐祥曦; 张炎泼
Original assignee: Guizhou Baishancloud Technology Co Ltd
Current assignee: Guizhou Baishancloud Technology Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2020-05-05

Abstract

The invention provides a finite field operation method, a finite field operation system, an operation device and a computer readable storage medium. The method relates to the technical field of computers and solves the problem that a finite field consumes a large amount of resources. The method comprises the following steps: obtaining cached information of a server CPU; according to the cached information, carrying out blocking processing on the vectors of the operation tasks to obtain a plurality of data vector blocks which are not larger than the cache size; and operating the data vector block. The technical scheme provided by the invention is suitable for optimizing the performance of the processor and realizes high-efficiency CPU operation.

Description

Finite field operation method, system, operation device and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a finite field operation method, a finite field operation system, an operation device, and a computer-readable storage medium.

Background

Finite field operations consume a large amount of CPU time and may become a performance bottleneck in the system. How to improve the finite field operation efficiency of the CPU is a problem which needs long-term attention.

Disclosure of Invention

The present invention is directed to solving the problems described above. It is an object of the present invention to provide a finite field arithmetic method, system, arithmetic device, and computer-readable storage medium that solve any one of the above problems.

According to a first aspect of the present invention, a finite field operation method includes:

obtaining cached information of a server CPU;

according to the cached information, carrying out blocking processing on the vectors of the operation tasks to obtain a plurality of data vector blocks which are not larger than the cache size;

and operating the data vector block.

Preferably, the size of the data vector block is not larger than the size of the CPU level cache.

Preferably, the size of the data vector block is not larger than half of the size of the CPU level cache.

Preferably, the step of operating on the data vector block includes:

acquiring an optimal instruction set according to the characteristic parameters of the CPU, wherein the optimal instruction set comprises supported extended instructions;

and operating the data vector block by using the optimal instruction set.

Preferably, the step of operating on the data vector block using the optimal instruction set further includes:

and when the operation task is addition or subtraction operation and the operation task is larger than the size of the second-level cache of the CPU, writing an operation result into a memory by using a Non-Temporal Hint instruction.

According to still another aspect of the present invention, a finite field computing system includes:

the CPU information acquisition module is used for acquiring the cached information of the server CPU;

the blocking module is used for carrying out blocking processing on the vector of the operation task according to the cached information to obtain a plurality of data vector blocks which are not larger than the cache size;

and the operation module is used for operating the data vector block.

Preferably, the operation module includes:

the instruction set acquisition unit is used for acquiring an optimal instruction set according to the characteristic parameters of the CPU, and the optimal instruction set comprises supported extended instructions;

and the instruction operation unit is used for operating the data vector block by using the optimal instruction set.

Preferably, the operation module further includes:

and the memory writing unit is used for writing the operation result into the memory by using a Non-Temporal Hint instruction when the operation task is addition or subtraction operation and the operation task is larger than the size of a second-level cache of the CPU.

According to yet another aspect of the present invention, an arithmetic device comprises a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the finite field arithmetic method when executing the computer instructions.

According to yet another aspect of the present invention, a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the above-described finite field operation method.

The invention provides a finite field operation method, a finite field operation system, an operation device and a computer readable storage medium, which are used for acquiring cached information of a server CPU (Central processing Unit), blocking a vector of an operation task according to the cached information to obtain a plurality of data vector blocks which are not larger than the cache size, and then operating the data vector blocks. By carrying out block processing on the vector of the operation task to be operated and automatically selecting the optimal instruction, the operation efficiency of the CPU is improved, and the problem of large resource consumption in a limited domain is solved.

Other characteristic features and advantages of the invention will become apparent from the following description of exemplary embodiments, which is to be read in connection with the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention. In the drawings, like reference numerals are used to indicate like elements. The drawings in the following description are directed to some, but not all embodiments of the invention. For a person skilled in the art, other figures can be derived from these figures without inventive effort.

Fig. 1 exemplarily illustrates a flow of a finite field operator provided by an embodiment of the present invention;

FIG. 2 schematically shows a detailed flow of step 102 in FIG. 1;

FIG. 3 is a diagram illustrating an architecture of a finite field computing system according to an embodiment of the present invention;

fig. 4 exemplarily shows the structure of the operation module 303.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

Specifically, the existing finite field operation method has the following problems:

1. the latest extended instruction set is not utilized. Such as avx2, is not supported by Jerusure, which is the most widely used.

2. Secondary development is needed to avoid performance penalties or data errors. If the caller is required to clear the register, if the NT store is turned on, the caller is required to call the ince before reading the encoded data to avoid reading dirty data.

3. The multi-core technology is utilized incorrectly, so that cache pollution is caused, and the overall I/O throughput is reduced. By utilizing the multi-core technology, the efficiency of a single task can be improved, and for the multi-task, the overall efficiency is reduced due to competition on the cache. The use of multi-core single-task arithmetic models should therefore be avoided.

4. The cache is not fully utilized and the I/O efficiency is low. The data needed for the next cycle is not buffered ahead of time.

5. Unnecessary data is cached incorrectly, and I/O efficiency is low.

6. No accelerated processing is performed on register misaligned data.

In order to solve the above problem, embodiments of the present invention provide a finite field operation method, a finite field operation system, an operation device, and a computer-readable storage medium, where a vector of an operation task to be operated is subjected to block processing, and an optimal instruction is automatically selected, so that the operation efficiency of a CPU is improved, and the problem of large resource consumption of a finite field is solved.

An embodiment of the present invention provides a finite field operation method, where a flow of finite field operation completed by using the method is shown in fig. 1, and the method includes:

and step 101, obtaining cached information of a server CPU.

And 102, carrying out blocking processing on the vectors of the operation tasks according to the cached information to obtain a plurality of data vector blocks which are not larger than the cache size.

In this step, specifically, the size of the data vector block is not greater than the size of the first-level cache of the CPU. Preferably, the size of the data vector block is not larger than half of the size of the CPU level cache. The data vector chunk is specifically a vector chunk (Split vector).

Specifically, the vector to be operated is subjected to blocking processing, and each operation does not exceed the Data vector of the first-level Cache Size (L1Data Cache Size) of 1/2. The speed of the L1Cache is fastest, the block Size range is preferably smaller than that of the L1Data Cache Size, and the best performance is obtained under most conditions according to a large number of tests 1/2L1Data Cache Size.

And 103, operating the data vector block.

As shown in fig. 2, the step is a data stream in the vertical direction and an instruction stream in the horizontal direction, and includes:

and 1031, acquiring an optimal instruction set according to the characteristic parameters of the CPU, wherein the optimal instruction set comprises supported expansion instructions.

Since not all CPUs contain all extension instructions, support of the operating system is also determined in this step (e.g., some extension instructions may not be supported after virtualization).

After completing splitting vector (Split vector), initializing CPU characteristic parameters, and obtaining an optimal instruction set, wherein the CPU characteristic can be the support condition of an extended instruction, the optimal instruction set is the result selected about the extended instruction, and the result is reused in each operation. The optimal instruction set can be selected according to the register bit width, for example, the instruction set with the widest register bit width is preferentially selected as the optimal instruction set. Specifically, the AVX512 supports a register with a bit width of 512 bits, the AVX2 supports a register with a bit width of 256 bits, and the SSSE3 supports a register with a bit width of 128 bits.

For example, the support of the extended instruction of the CPU can be obtained by the CPU id and the xgetbv instruction, and stored in the data structure, so as to select the optimal instruction set based on this.

And 1032, operating the data vector blocks by using the optimal instruction set.

For example, 1. the support of the extended instruction is obtained by the cpuid and xgetbv instructions and stored in the data structure, and the optimal instruction set is selected according to the support. The instruction sets supported by the algorithm include: AVX512, AVX2, SSSE 3. AVX512/AVX2/AVX loop are actual arithmetic units. For the same operation task:

a. when the server supports AVX512, the AVX512 instruction is preferentially used, AVX2 is used when the number of Bytes is less than 64, AVX is used when the number of Bytes is less than 32, and the number of Bytes is less than 16.

Since AVX512 supports ZMM registers (512 bits wide), AVX/AVX2 supports YMM registers (256 bits wide), whereas AVX supports partial XMM register (128 bits wide) operations. Therefore, the register with the widest bit width is selected for operation in sequence, so that the instruction overhead can be reduced, and the throughput is improved. And the use of the completion enables the operation to completely use a single instruction multiple data Stream (SIMD) mode, thereby avoiding the operation process in other modes from polluting the cache.

b. When the server supports AVX2, the AVX2 instruction is preferentially used, AVX is used when the number of Bytes is less than 32, and the vector is supplemented when the number of Bytes is less than 16.

c. When the server supports SSSE3, the SSSE3 instruction is preferentially used, and the vectors are supplemented with less than 16 Bytes.

And when the operation task is addition or subtraction operation and the operation task is larger than the size of the second-level cache of the CPU, writing an operation result into a memory by using a Non-Temporal Hint instruction penetrating the cache.

When the operation task is addition or subtraction (exclusive or) and the operation task is larger than the second level cache size (L2CacheSize), the Non-Temporal Hint technique is used. Considering that data may be used immediately after operation, if Non-Temporal Hint is used directly to penetrate the cache, the performance of the following program will be affected. Finite field multiply/divide operations are often applied to matrix operations, involving a large number of repeated reads of data, and thus the use of Non-Temporal Hint may degrade the overall performance. The problem does not exist in the exclusive-or operation, and when the data needing to be cached is too large, the performance can be effectively improved by using Non-Temporal Hint. According to a number of tests L2

Size as a measure of Non-temporallnt.

An embodiment of the present invention further provides a finite field operation method, where a process of completing an operation on an operation task by using the method is shown in fig. 2, and the process includes:

step 201, acquiring CPU Feature, and storing the CPU Feature in a data structure. The data structure defines elements, the element is named as 'cpu' and the value is 0/1/2/3, which respectively represent AVX512/AVX2/SSSE 3/NoSIMD.

Step 202, obtaining a CPU L2Cache Size, and when the length of the vector to be operated is larger than the L2Cache Size and the vector to be operated is finite field addition operation, directly writing the operation result into a memory by using a Non-Temporal Hint Store. After using NTStore, sfence is used before the function ret to ensure that the caller's read data is correct.

The vector is split with a minimum size of 16Bytes and a maximum size of 16 KB. The splitting algorithm is described as follows: the length of the vector is n, and when n is smaller than 16Bytes, 16 is returned; when the Size is smaller than 1/2L1DataCache Size, (n > >4) < <4 is returned; otherwise, 1/2L1Data Cache Size is returned.

And for the data which do not satisfy 16Bytes, using a padding operation, and intercepting after padding.

After the splitting is completed, the operation can be performed on the split block vectors.

Step 203, selecting the most appropriate instruction according to the CPU Feature to operate the split block vector, wherein the operation includes circulation from 1Byte to 64 Bytes. When selecting the instruction, sequentially judging whether the platform supports or not in AVX512/AVX2/SSSE3 (priority is from high to low), and selecting the instruction level with the highest priority supported by the platform (namely the instruction set with the widest supported register bit width). All functions that use instructions of the AVX family will use the vzeroupper instruction to circumvent the AVX-SSETranging Penalties prior to RET.

Preferably, if Non-Temporal Hint technology is used, sfence is also used before ret to achieve data ordering.

An embodiment of the present invention further provides a finite field computing system, whose structure is shown in fig. 3, including:

a CPU information obtaining module 301, configured to obtain cached information of a server CPU;

a blocking module 302, configured to perform blocking processing on a vector of an operation task according to the cached information to obtain a plurality of data vector blocks that are not larger than the cache size;

and the operation module 303 is configured to perform an operation on the data vector block.

Preferably, the structure of the operation module 303 is as shown in fig. 4, and includes:

an instruction set obtaining unit 3031, configured to obtain an optimal instruction set according to the feature parameters of the CPU, where the optimal instruction set includes supported extended instructions;

an instruction operation unit 3032, configured to perform an operation on the data vector block by using the optimal instruction set.

Preferably, the operation module 303 further includes:

and an internal storage unit 3033, configured to write the operation result into an internal memory by using a Non-Temporal Hint instruction when the operation task is an addition or subtraction operation and the operation task is larger than the size of the second-level cache of the CPU.

The embodiment of the present invention further provides an arithmetic device, which includes a memory, a processor, and a computer instruction stored in the memory and executable on the processor, where the processor executes the computer instruction to implement the steps of the finite field arithmetic method provided in any embodiment of the present invention.

Embodiments of the present invention further provide a computer-readable storage medium, which stores computer instructions, and when the computer instructions are executed by a processor, the computer instructions implement the steps of the finite field operation method provided in any embodiment of the present invention.

The embodiment of the invention provides a finite field operation method, a finite field operation system, an operation device and a computer readable storage medium, which are used for acquiring cached information of a server CPU (central processing unit), blocking a vector of an operation task according to the cached information to obtain a plurality of data vector blocks with the size not larger than the cache size, and then operating the data vector blocks. By carrying out block processing on the vector of the operation task to be operated and automatically selecting the optimal instruction, the operation efficiency of the CPU is improved, and the problem of large resource consumption in a limited domain is solved.

The technical scheme provided by the embodiment of the invention fully and reasonably utilizes the CPU Cache, and realizes a simple and efficient algorithm. The method comprises the steps of full-automatic instruction set selection, and optimal instructions can be selected on different platforms; CPU instruction details are hidden, and external layer intervention is not needed; the single core and the single task avoid buffer pollution; the operation task vector is partitioned, and a cache is fully utilized; the cache rule is automatically selected, and throughput is accelerated through a cache penetration technology; vector operation is still used for the misaligned data of the register, and cache pollution is avoided while acceleration is achieved. The performance is optimal and at the same time the user friendliness is greatest.

The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.

Finally, it should be noted that: the above examples are only for illustrating the technical solutions of the present invention, and are not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A finite field operation method, comprising:

obtaining cached information of a server CPU;

and operating the data vector block.

2. The finite field operation method according to claim 1, wherein the size of the data vector chunk is not larger than the size of the CPU level cache.

3. The finite field operation method according to claim 2, wherein the size of the data vector chunk is not larger than one-half of the size of the CPU level cache.

4. The finite field operation method according to claim 1, wherein the step of operating on the data vector block comprises:

and operating the data vector block by using the optimal instruction set.

5. The finite field operation method of claim 4, wherein the step of operating on the data vector chunks using the optimal instruction set further comprises:

6. A finite field computing system, comprising:

and the operation module is used for operating the data vector block.

7. The finite field computing system of claim 6, wherein the computing module comprises:

8. The finite field computing system of claim 7, wherein the computing module further comprises:

9. An arithmetic device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-5 when executing the computer instructions.

10. A computer-readable storage medium storing computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the method of any one of claims 1-5.