CN118153494A

CN118153494A - Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus

Info

Publication number: CN118153494A
Application number: CN202410578790.5A
Authority: CN
Inventors: 刘上; 王静
Original assignee: Zhenjiang Nanjing University Of Posts And Telecommunications Research Institute; Nanjing University of Posts and Telecommunications
Current assignee: Zhenjiang Nanjing University Of Posts And Telecommunications Research Institute; Nanjing University of Posts and Telecommunications
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-06-07

Abstract

The invention discloses a hardware acceleration system for realizing matrix SVD decomposition based on an AXI bus, which comprises a hardware mathematical accelerator, a DMA, a RAM, a register, a FIFO and an AXI bus module, and covers a driver for the cooperation of software and hardware; the principle is that the system is utilized to execute the subfunction of SVD calculation through an external drive writing instruction, the read-write operation of data is carried out in a register and a RAM, and the internal and external data exchange is realized with an external system through an AXI bus; the hardware mathematical accelerator is internally provided with 4 trigonometric function modules, and the 4 modules can be mobilized for parallel calculation when SVD carries out iterative calculation, so that the clock period of calculation is shortened; the invention adopts 64 registers with 64 bits and 4 RAMs with 32KB, can meet the storage requirement of large-scale high-order matrixes, and simultaneously supports single-precision and double-precision calculation; the function of the internal hardware acceleration system is configured through the write instruction, so that the flexibility and configurability of the system are improved, and different application requirements can be met.

Description

Hardware acceleration system for realizing matrix SVD decomposition based on AXI bus

Technical Field

The invention relates to the field of digital chip design and signal processing, in particular to a hardware acceleration system for realizing matrix SVD decomposition based on an AXI bus.

Background

SVD is a matrix decomposition technique, known as singular value decomposition. For one matrix a, SVD decomposes it into the product of three matrices: a=uΣ (V ζ), where U and V are respectively referred to as left and right singular matrices, Σ is a diagonal matrix, and elements on the diagonal thereof are referred to as singular values. SVD decomposition has wide application in radar, GPS navigation, data dimension reduction and various fields, and the principle can be understood as follows: a relatively complex large-scale matrix is represented by multiplying 3 smaller and simpler sub-matrices, the 3 small matrices describing the important characteristics of a large matrix. Singular values, in turn, are the most important parts of the SVD, describing the importance or weight of the matrix in each feature direction, the larger the singular value, the more important the feature space described by the corresponding singular vector. Such matrix factorization techniques may make the data set easier to use, reduce the computational overhead of the algorithm, and make the results easier to understand.

AXI bus is a high-performance, high-bandwidth, low-latency, scalable on-chip bus interface standard, widely used for various system-level interconnects inside chips. The chip has the important functions of connecting each functional module, realizing high-performance data transmission and communication, supporting a complex system interconnection structure, providing flow control and error processing and the like, and is one of key technologies in modern SoC and MCU chip designs. The bus is based on a master-slave architecture, and provides high expandability and flexibility. AXI buses include several critical signals and channels, such as address channels, data channels, write transmissions, read transmissions, and the like. The design of the AXI bus allows for performance, reliability, and scalability so that different hardware modules can communicate through this standardized interface.

At present, the SVD decomposition has a plurality of problems in the implementation, and the operation performance of the SVD decomposition basically determines the overall performance of a system. The need to support high performance and high order matrix decomposition is also emerging as a typical matrix operation. Firstly, as computation and memory access intensive operation, SVD decomposition comprises a large number of iterative operation processes, and as the matrix scale increases, the computation amount and space-time complexity of the matrix SVD decomposition are greatly increased, and the time required by data iterative operation is longer; secondly, since the original matrix, the left singular matrix, the singular value, the right singular matrix and the intermediate data need to be stored during decomposition, a large amount of memory is occupied, and particularly for a large-scale high-order matrix, the storage requirement is large; in addition, the iterative computation process of SVD has dependence, and the iterative computation process is endless, so that the computational task is difficult to be completely decomposed into independent subtasks for parallel computation, and the parallel performance of computation is limited to a great extent; meanwhile, the SVD decomposition has huge calculation amount, so that the error accumulation is larger and the precision damage is serious in the iterative operation processing process. In summary, considering the above cases, the matrix SVD decomposition implemented by software performs poorly in real-time computation, which takes a long time, and if implemented by hardware, the accuracy and access problems are considered. These problems are the problems that the current SVD decomposition is urgently needed to solve in the chip design field, and the pursuit of large-scale, rapid and high-precision operation is also the key point of the urgent breakthrough of the current SVD calculation.

There are many ways of improving the hardware implementation of SVD decomposition that are currently proposed by those skilled in the art. Firstly, in the aspect of SVD parallel operation, researchers put forward an expandable FPGA engine based on Hestenes-Jacobi method for parallel acceleration SVD calculation, the method puts forward a new MDS data ordering, the MDS ordering can provide higher parallelism for Jacobi algorithm, and the reuse rate of on-chip data is improved to the greatest extent. However, the ordering method is too complex and inconvenient to implement in a digital chip by using a hardware Verilog code mode, and the FPGA mainly relies on a lookup table to perform computation, and has a very fine granularity, and a large amount of resources are used for configurable on-chip routing and wiring, which results in lower utilization rate of computing resources, and meanwhile, a series of disadvantages including high power consumption, high cost, limited clock frequency and low integration are associated with implementing SVD decomposition on the FPGA, so that implementing SVD computation of a matrix on the FPGA is still not an optimal choice. Besides, as the running speed of the chip and the real-time application of the chip data are considered, researchers comprehensively analyze the common algorithm for calculating SVD, a method for realizing the combination of the Jacobi algorithm and the CORDIC algorithm is provided, the Jacobi algorithm needs the sine, cosine, square root and arctangent values of the trigonometric function in each iteration, and the basic trigonometric function can be conveniently calculated based on the CORDIC algorithm, so that the researchers provide a new scheme for combining the Jacobi algorithm and the CORDIC algorithm to carry out SVD decomposition, and the realization of the new algorithm scheme is greatly improved in the aspect of calculation speed; in terms of the implementation method, the method is the most effective method for realizing SVD calculation, but the method is realized based on FPGA, has relatively low flexibility although having expandability, has single application scene, and needs to redesign a calculation module if a model of SVD calculation is required to be adjusted; although the FPGA has programmability, its logic resources and clock resources are limited, the performance level of the ASIC cannot be achieved, and if the design method is applied to the ASIC, which is a customized chip, the performance bottleneck will be caused, so that there are still too many limitations in implementing SVD decomposition by pure hardware.

For intensive operation of the large-scale high-order matrix similar to SVD decomposition, in order to improve the calculation speed, a plurality of scholars propose a method for accelerating hardware, in the invention of a random-order matrix inversion hardware acceleration system based on cholesky decomposition in a cyclic iteration mode, an author adopts a cyclic iteration method to replace the traditional multiply-accumulate calculation, solves the result of cholesky decomposition and the inverse matrix of a triangular matrix, and optimizes the addressing complexity when reading and writing data; meanwhile, the method adopts a novel matrix multiplication algorithm suitable for the triangular matrix, shortens the calculation time of matrix multiplication, supports inversion operation of any-order complex matrix in 4-256 orders, and finally achieves the characteristics of low hardware complexity and high storage resource utilization rate. However, in practical engineering applications, especially in the computation of matrix SVD decomposition, the matrix elements to be computed may be not only complex matrices but also real matrices, nor for the type of matrix only triangular matrices; in the invention, one matrix inversion operation is decomposed into three modules of cholesky decomposition, triangular matrix inversion and triangular matrix multiplication, the operation of SVD decomposition is huge in original function, and if one matrix inversion is split into a plurality of modules, the operation instruction of a system is inevitably increased; meanwhile, SVD decomposition also requires multiplication of non-triangular matrix, because when SVD calculation is applied to radar and vehicle-mounted GPS navigation, actual requirements are considered, the dimension and precision of the matrix are determined by the number of visible satellites, measurement errors, a resolving algorithm, system dynamics and other factors, so that the matrix is a triangular matrix or a non-triangular matrix, and is a square matrix or a non-square matrix, which are continuously changed according to actual application, and therefore, for multiplication of the non-triangular matrix, an independent multiplication module is still required to be continuously designed, which inevitably leads to more internal modules of an accelerator, higher design difficulty and larger chip area, the cost is higher; the invention has a great improvement on the single complex matrix inversion function, but cannot adapt to large-scale and general matrix operation such as SVD decomposition, so that the invention needs to avoid limitation on functions and data types, realize higher compatibility and reduce design modules as much as possible. Besides, in the invention of an overspeed matrix operation coprocessor system, the author also provides a fast operation method of matrix addition and multiplication, and the invention is theoretically accelerated by n times compared with the traditional matrix multiplication addition method, and has the advantages of simple structure, high calculation efficiency and stronger application pertinence. However, the invention has limitation on data type, the data bus of the coprocessor system is 32 bits, so 64 bits of data transmission cannot be supported, and the system has limitation on data precision; meanwhile, the four magic cube DRAM storage arrays are all 8 x 32 bits of data, and the supported matrix dimension is 8 x 8, so that the dimension of the matrix is greatly limited, the DRAM is not large enough, and each DRAM only has 2048 bits, namely 0.25KB; the invention has fast calculation speed for the single operation of matrix multiplication and addition, but only for low-order matrix, but is completely insufficient for the application of SVD decomposition in practical engineering, whether from the aspects of data bit width or matrix dimension, SVD decomposition is a continuously iterative intensive operation, if the higher data bit width cannot be ensured, the error of the final result is larger, and meanwhile, in order to ensure the representativeness of SVD decomposition result, larger matrix dimension is needed to support, so that in the calculation of SVD decomposition, the matrix calculation of high order and high data bit width must be supported to ensure the accuracy of the final result, Availability and representativeness.

In summary, the research of implementing matrix operation on hardware is far insufficient, the implementation of SVD decomposition hardware is far less, and although a certain single function of matrix operation is implemented, in practical application, there are still great limitations, and these limitations have great influence on large-scale operations of SVD decomposition, which integrate matrix operation, trigonometric function operation, four operations and other operations, and the SVD decomposition has lower matrix dimension, less data bit width and more single data type, and cannot change the calculation flow according to the requirement, so that the final result of SVD decomposition is inaccurate and even wrong. In practical application, in order to solve the problems of the SVD decomposition and calculation, and meet the universality of the operation function, a method for continuously optimizing and solving the related matrix operation and other operations is also needed.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, analyze the research current situation of SVD decomposition calculation and intensive matrix operation, and the current SVD calculation is very low in calculation performance and does not have real-time application value if being realized completely by software; if the SVD decomposition is realized by pure hardware, the structure of a hardware circuit is usually fixed, the functions and the performances are relatively fixed, the later stage is not easy to modify and perfect, the SVD calculation step and the realization method cannot be adjusted after the chip is sliced, and the application limitation is very large; furthermore, the SVD decomposition implemented by pure hardware requires deep knowledge of the functional design and verification of the chip, and the development and debugging period increases exponentially. In order to further improve the optimization of SVD decomposition calculation in a digital chip, the invention discloses a hardware acceleration system for realizing matrix SVD decomposition based on an AXI bus, which is designed as a mathematical hardware acceleration system with a general function and integrated into an MCU chip, and communicates and exchanges data with other functional modules by means of the AXI bus, so that the complete function of SVD decomposition calculation is realized by instructions under the condition of ensuring calculation precision and speed, and the hardware acceleration system is suitable for common hardware platforms such as an FPGA or an ASIC, and has better performance, lower power consumption and cost in an integrated circuit of the customized design such as the ASIC. The method has application value in the current MCU chip design field, and is convenient for the expansion application of various scenes.

The invention adopts the following technical scheme to realize a hardware acceleration system:

In the first aspect, the parallel computation is accelerated and optimized by hardware mathematics with general functions, the invention adopts a special hardware mathematic accelerator module to accelerate the computation of each sub-function required by SVD decomposition computation, and fully utilizes the parallelism of the hardware to improve the execution efficiency of the computation. The processing time of the decomposition calculation can be remarkably reduced by realizing the calculation of the SVD decomposition key step function at the hardware level, and the specific hardware mathematical accelerator module is as follows:

Module M1: the FUNC functional module is designed based on a CORDIC algorithm, and mainly comprises Sin, cos, arctan, sqrt and the calculation functions of various trigonometric functions in terms of functions, so that all requirements of SVD decomposition on trigonometric function calculation can be met; in terms of precision, the module can support the calculation of 32-bit or 64-bit single-precision and double-precision data under the IEEE754 standard, and can meet different numerical calculation precision requirements; meanwhile, the module has 4 FuNC functional modules, namely FUNC0, FUNC1, FUNC2 and FUNC3, so that parallel computation of a plurality of trigonometric functions can be met, the functions can be operated simultaneously and repeatedly, the computation speed is improved, the computation clock period is greatly reduced, and the large-scale iterative computation requirement in SVD decomposition computation is met.

Module M2: the MAT functional module mainly comprises calculation of various functions such as matrix transposition, matrix replication, matrix multiplication, matrix addition, matrix initialization, matrix autocorrelation, matrix row-column transformation and the like in the functional aspect; in terms of calculation precision, the module can also support single-precision 32-bit and double-precision 64-bit calculation under the IEEE754 standard; in terms of data types, the MAT module can support calculation of two types of data, namely complex data and real data; in the aspect of matrix dimension, the MAT functional module can also assign the dimension of the matrix, and under the condition of common double-precision and real number calculation, the maximum matrix dimension can support 64 x 64, not only can support the calculation of a square matrix, but also can support the calculation of a non-square matrix, support the calculation of a triangular matrix or a non-triangular matrix, and also can support the calculation of a vector.

If the operand is single-precision data under IEEE754 standard, only the low 32 bits of the register are used, so that the read-write operation of the register is reduced; if the data is double-precision data, 64 bits are used; the design method can flexibly adapt to different operand sizes, so that the CPU is more flexible and efficient in processing data with different precision, and simultaneously, hardware resources are saved and actual power consumption is reduced.

Module M3: the Queue function module is used for performing related operation on instructions in the FIFO and taking out the instructions in the FIFO for executing the internal hardware mathematical accelerator function. The module also comprises function calculation of single clock period, mainly comprises related operations of inverting the value according to the condition, calculating absolute value according to the condition, interrupting the instruction according to the condition and the like, and also supports single-precision and double-precision calculation, the functions can be completed in one clock period, so that the register can be directly operated without being executed in a complex state machine, and the time waste can be avoided and the calculation complexity can be reduced.

Second aspect: the invention provides flexibility and configuration options, and a software driver can configure options of a hardware math acceleration system, including calculation precision, data format, matrix dimension, matrix type, function, iteration times and selection of used registers and RAM numbers, so as to adapt to matrixes with different orders and data types to carry out SVD decomposition calculation. The method can ensure that the method can be flexibly applied in various application scenes and can customize related algorithm flows according to specific requirements, so that the method has high flexibility.

The total number of the RAMs is 4, namely RAMs_ A, RAM _ B, RAM _ C, RAM _D, and each size is 32KB, namely 256 x 1024 bits, and the RAMs are used for storing data of a matrix; the RAM used in the SVD calculation process can randomly assign numbers according to the orders of the matrixes and the calculation demands, so that the intermediate result and the final result of the SVD calculation of the high-order matrixes can be accessed and stored in the RAM in a parallel and collision-free manner; when matrix calculation is performed, an input matrix is written into a RAM in advance, and four types of data storage formats of the matrix in the RAM are as follows:

Single precision real numbers, 0-31 bits to store (0, 0) elements, 32-63 bits to store (0, 1) elements, 64-95 bits to store (0, 2) elements, and so on;

double-precision real numbers, 0-63 bits for (0, 0) elements, 64-127 bits for (0, 1) elements, 128-191 bits for (0, 2) elements, and so on;

single precision complex number, 0-31 bits to store the real part of the (0, 0) element, 32-63 bits to store the imaginary part of the (0, 0) element, 64-95 bits to store the real part of the (0, 1) element, 96-127 bits to store the imaginary part of the (0, 1) element, and so on;

double-precision complex numbers, 0-63 bits for the real part of the (0, 0) element, 64-127 bits for the imaginary part of the (0, 0) element, 128-191 bits for the real part of the (0, 1) element, 192-255 bits for the imaginary part of the (0, 1) element, and so on;

the above (0, 0), (0, 1) and (0, 2) represent row and column coordinates of the matrix, one RAM address can access 1024 bits of data, one address exceeds 1024 bits or the next row of data of the matrix needs to be started by starting a new RAM address, based on the storage mode, the matrix module is convenient to read real parts and imaginary parts of a plurality of elements at one time and process simultaneously.

Third aspect: the invention optimizes the hardware design, ensures that the utilization efficiency of system resources is maximized and the speed is maximized when a large-scale matrix is processed. The hardware plays a role of a coprocessor, is responsible for calculating and returning processing results, and can share CPU calculation resources by adopting the hardware, and meanwhile, the calculation delay is reduced. This involves intelligent allocation and management of hardware resources to avoid unnecessary wastage while improving the efficiency of computation.

Fourth aspect: the invention adopts AXI bus standard in hardware realization to seamlessly integrate the hardware acceleration system realizing SVD decomposition sub-function with other system components; by means of the AXI bus, DMA can control and realize efficient exchange of data between the internal hardware acceleration system and the external storage system.

The method for realizing SVD decomposition and calculation based on the AXI bus comprises the following steps:

step S1: writing a driver instruction on the outside according to a C model of an SVD algorithm, determining a specific matrix address, a register and a RAM number and a specific function serial number for calculation according to requirements, and sequentially executing related operation instructions to realize a complete calculation flow of the SVD;

Step S2: the input value of the driver is written into an internal 64-bit FIFO, and when the internal hardware math accelerator detects that the instruction exists in the FIFO, the instruction is fetched for execution, and the relevant calculation operation is carried out on the input value.

Step S3: during the calculation process, the hardware acceleration system uses internal 64-bit registers, and the internal module can read and write the registers. In addition, the internal part also comprises 4 blocks of RAM, each block is 32KB bytes, and the four RAMs can be accessed by the CPU and internal control logic to store and read data of the matrix.

Step S4: executing the complete instruction and writing the calculated single numerical value or matrix data into a register or RAM with a designated number.

Step S5: DMA takes the internal data from the register and RAM to the external storage system through the AXI bus interface, and completes the whole hardware acceleration of SVD decomposition.

Further, the system needs to flexibly configure the instructions for executing the SVD calculation according to the following format, and the writing method is as follows:

FIFO_WR_EN = 1;

FIFO[31:0] = (cmd_idx<<0) + (idx_r0<<8) + (idx_r1<<14) +

(idx_r2<<20) + (complex_fmt<<26) + (precision<<27) + (gen_intr<<31)；

FIFO[63:32] = (Row_size<<8) + (Col_size<<16) + (inter_num<<24)；

Wherein, FIFO_WR_EN is 1, which means that a 64-bit FIFO is enabled to carry out a write instruction, and cmd_idx is the used function number; idx_r0, idx_r1, idx_r2 represent two numerical input ports and one result output port, respectively; complex_fmt is a data type, 1 represents complex number, and 0 represents real number; precision is a precision type, 1 is single precision, and 0 is double precision; inter_num is the number of iterations; gen_inr represents an interrupt operation at the end of the command; row_size and col_size represent the rows and columns of the input matrix, respectively;

The instruction format, including bit values, function numbers, matrix dimensions, data precision and types of left shift of the instruction and other operations, can be reasonably adjusted according to a sub-function module designed in the hardware acceleration system, and the format of the instruction is configured according to the called function type, so that a specific function is started to execute SVD calculation.

The principle realized in the invention is as follows:

All sub-functions required in SVD decomposition and calculation are split into different sub-modules in a hardware mode and are realized, complexity of SVD decomposition and calculation is simplified, and design and verification of the modules are facilitated. The SVD decomposition calculation uses multiple trigonometric functions, matrix operation and some addition, subtraction, multiplication and division operations, so that the detailed functions are implemented in the hardware acceleration system one by one, and then the hardware acceleration system is integrated into a whole MCU chip. After the chip flows, a driving instruction is written outside the chip, a series of hardware functions are configured by the instruction, the communication between an external storage system and an internal hardware acceleration system and the internal and external data exchange are realized by utilizing an AXI bus, and corresponding SVD decomposition complete function calculation is realized, so that the demands of configurable, expandable, rapid operation and the like of SVD are realized, the universality is functionally ensured, and the SVD decomposition instruction can be adjusted according to the demands in a real life scene to achieve the optimal performance.

Compared with the prior art, the invention has the beneficial effects that:

1. The hardware acceleration system for realizing matrix SVD decomposition based on the AXI bus has more flexibility compared with SVD decomposition realized by directly using hardware. By utilizing an AXI bus standard, the hardware acceleration system can be easily integrated with other hardware modules, can be in seamless cooperative work with other hardware components, and can support the cooperative work of software and hardware.

2. The invention provides a hardware acceleration system for accelerating and optimizing SVD decomposition and calculation performance, wherein the system comprises 4 parallel FUNC modules, MAT and Queue functional modules, and the system can be suitable for a plurality of functions to simultaneously perform parallel calculation and pipelining operation, shortens the clock period of calculation, improves the execution speed of SVD calculation, and simultaneously takes the role of a coprocessor to release the burden of a CPU of a main processor so that the CPU can be focused on other tasks.

3. The invention provides configuration options for SVD decomposition and calculation, which allow adjustment according to different application scenes and functional requirements, such as various configuration options of matrix dimension, data precision, function type, data type, iteration times, data storage addresses and the like. This makes the invention suitable for use in a variety of application environments, thereby providing greater flexibility and configurability in applications.

4. The hardware acceleration system for realizing matrix SVD decomposition based on the AXI bus provided by the invention provides 64-bit registers, namely Reg0-Reg63, and 4 32KB RAMs, namely RAMs_ A, RAM _ B, RAM _ C, RAM _D, so that the requirements of large-scale matrix decomposition results and intermediate value storage and reading can be met, and the situations of insufficient data storage space or data access conflict can be effectively avoided.

5. The hardware acceleration system for realizing matrix SVD decomposition based on the AXI bus has the advantage that functions in an internal hardware mathematical accelerator have universality; in the matrix type, the method not only can support the calculation of square matrixes, but also can support the calculation of non-square matrixes, support the operation of triangular matrixes or non-triangular matrixes, and also can support the calculation of vectors; in the data type, the method can support single-precision and double-precision operation of 32 bits and 64 bits under the IEEE754 standard, and meets the calculation requirements of various complex situations. Therefore, the invention has lower limitation on functions, can determine the use of the functions according to the requirements, and has better universality.

6. The hardware acceleration system for realizing matrix SVD decomposition based on the AXI bus can be applied to FPGA design or ASIC design; due to the overall functionality and versatility, highly customized, excellent performance and lower power consumption can be provided in ASIC designs. Compared with a general-purpose processor or an FPGA, the ASIC design can realize higher running speed, lower energy consumption, smaller chip area, higher integration level and lower cost, and is suitable for application scenes needing high efficiency and high performance.

Drawings

FIG. 1 is a system architecture diagram of a hardware acceleration system for implementing matrix SVD decomposition based on an AXI bus according to the present invention;

FIG. 2 is a diagram showing the internal structure of a digital hardware accelerator in a hardware acceleration system based on an AXI bus for realizing matrix SVD decomposition;

FIG. 3 is a simulated waveform diagram of an example implementation of multiple trigonometric functions multiple parallel computations based on an AXI bus according to the first embodiment;

FIG. 4 is a diagram of a register allocation waveform for implementing multiple parallel computations of multiple trigonometric functions based on an AXI bus according to an embodiment;

fig. 5 is a diagram of an exemplary simulation waveform for implementing matrix transpose computation based on AXI bus according to the second embodiment;

FIG. 6 is a waveform diagram showing assignment of different RAM allocations for input and output data for implementing matrix transpose computation based on an AXI bus according to the second embodiment;

FIG. 7 is a flow chart of matrix SVD decomposition based on an AXI bus according to the present invention;

fig. 8 is a simulation waveform diagram of three decomposition matrix results based on AXI bus to realize matrix SVD decomposition according to the third embodiment;

fig. 9 is a graph of a calculation error analysis of the final singular value result based on AXI bus to implement matrix SVD decomposition according to the third embodiment.

Detailed Description

The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention. It should be noted that the words "front", "rear", "left", "right", "upper" and "lower" used in the following description refer to directions in the drawings, and the words "inner" and "outer" refer to directions toward or away from, respectively, the geometric center of a particular component.

Those skilled in the art will appreciate that the related functions referred to in the present invention implement one or more of the steps, measures, and schemes. The hardware modules may be specially designed and constructed for the required purposes.

Embodiment one, realizing multiple trigonometric functions multiple parallel computation based on AXI bus

FIG. 1 simultaneously shows a system architecture diagram of a hardware acceleration system for implementing matrix SVD decomposition based on an AXI bus, and based on the architecture diagram, the hardware acceleration system based on the AXI bus and the inside realizes multiple parallel computation of multiple trigonometric functions. The implementation method is that the write operation instruction in the driver Hardware_ Svd realizes multiple parallel computation of a plurality of trigonometric functions, and in the Hardware_ Svd driver, the function numbers of the trigonometric functions such as COS, SIN, TAN, LN, LOG, SQRT are respectively 32, 33, 34, 45, 46 and 47. The number of times of calculation of the trigonometric function such as COS, SIN, TAN, LN, LOG, SQRT is 3 except for the number of times of calculation of the LOG function, the precision of the calculation is designated as a 64-bit double-precision type, and the data type is a real number. Between the Hardware acceleration system and the driver Hardware_ Svd is a 64-bit FIFO, the driver writes instructions for realizing multiple parallel calculations of multiple trigonometric functions into the FIFO, and a Hardware mathematical accelerator in the Hardware acceleration system detects that the instructions are in the FIFO, and then fetches the instructions to execute related functions. The internal hardware mathematical accelerator module for executing calculation mainly comprises four mathematical library functions, namely FUNC0, FUNC1, FUNC2 and FUNC3, as shown in fig. 2, which is an internal structure diagram of the hardware mathematical accelerator, and is used for carrying out parallel operation when calculating a plurality of trigonometric functions and carrying out multiple operations, so as to shorten the clock period of calculation. Fig. 3 shows a simulation waveform diagram of AXI bus-based trigonometric function parallel computation in this embodiment, it can be seen that, for multiple computations, multiple FUNC modules can be triggered to reduce the clock period of the computation, and from the waveform diagram, it can be seen that new_func0_code_valid and new_func1_code_valid are set to high levels multiple times, so that it is ensured that when the computation amount is large, multiple FUNC modules can operate in parallel. When large-scale iterative operation is needed, the new_func0_code_valid, the new_func1_code_valid, the new_func2_code_valid and the new_func3_code_valid can be set to be high level, so that 4 FUNC modules can calculate in parallel, and the iterative speed is increased.

In addition to 4 FUNC modules, the hardware math accelerator has a Queue module which can execute instructions in the FIFO and also can execute function functions with single period. In this embodiment, the driver performs computation of two functions, namely, data inversion and data absolute value computation, the corresponding function numbers are 1 and 2, the computation times are 1 time each, the data type is a real number, and the computation precision is also a 64-bit double-precision type.

In addition, there are 64 registers with 64 bits in the hardware, and in this embodiment, the final trigonometric function calculation result is specified to exist in different registers, so as to avoid conflict and coverage of the input and output results of the data. The COS function is calculated for 3 times, input values are stored in registers Reg0, reg2 and Reg4, and output results are stored in registers Reg1, reg3 and Reg 5; 3 times of calculation of the SIN function, input values are stored in registers Reg6, reg8 and Reg10, and output results are stored in registers Reg7, reg9 and Reg 11; 3 times of calculation of the TAN function, input values are stored in registers Reg12, reg14 and Reg16, and output results are stored in registers Reg13, reg15 and Reg 17; 3 times of calculation of LN function, input values are stored in registers Reg18, reg20 and Reg22, and output results are stored in registers Reg19, reg21 and Reg 23; 10 times of calculation of the LOG function, since the LOG function has two inputs and one output, the inputs are respectively input1 and input2, 30 registers are needed to store data, so that the data of input1 are stored in registers Reg24, reg27, reg30, reg33, reg36, reg39, reg42, reg45, reg48 and Reg51, the data of input2 are stored in registers Reg25, reg28, reg31, reg34, reg37, reg40, reg43, reg46, reg49 and Reg52, and the calculated results are stored in registers Reg26, reg29, reg32, reg35, reg38, reg41, reg44, reg47, reg50 and Reg 53; 3 times of computation of the SQRT function, input data are stored in Reg54, reg56 and Reg58, and output results are stored in Reg55, reg57 and Reg 59; 1 calculation of the data inverse function, storing input data in Reg60, and storing output data in Reg 61; the final data is subjected to an absolute function, the input data is stored in Reg62, and the output data is stored in Reg 63. Fig. 4 is a waveform diagram of a register allocation for implementing multiple parallel computations of multiple trigonometric functions according to this embodiment, where the registers corresponding to op_r0-op_r63, i.e., reg0-Reg63, in the waveform implement on-demand allocation of registers for computing data, and fully mobilize 64-bit registers in the hardware module, while ensuring that input and output data do not collide and overlap during a large-scale data operation.

The implementation method of the embodiment specifically comprises the following steps:

Step S1: the write command driving is performed externally, and specific function functions, calculation times, input/output register numbers and data precision are determined in the driver.

Step S2: the input numerical value and the operation instruction of the external driver are compiled and operated into an internal 64-bit FIFO, and when the internal math accelerator of the hardware acceleration system detects that the instruction exists in the FIFO, the instruction execution state machine is taken out, the relevant calculation operation is carried out on the input numerical value, and the relevant functions of the instruction are sequentially executed.

Step S3: after the execution of the trigonometric function instruction is completed, the internal state machine writes the input numerical value and the calculated numerical value into 64-bit registers with the designated number.

Step S4: DMA takes the data in the internal hardware acceleration system from the register to the external storage system through the AXI bus interface, and the realization of multiple parallel computation of multiple trigonometric functions is completed.

In this embodiment, the software driver configurable registers are Reg0-Reg63, and the driver may perform the calculations of this embodiment by directing calls to any of the 64-bit configuration registers in the hardware acceleration system. If the data is single precision, only the low 32 bits of the register are used; if double precision, then all of the 64 bits of registers Reg0-Reg63 are used. It will be appreciated by those skilled in the art that the number and size of the configuration registers may be any random one or more of Reg0-Reg63, and the size of the stored data may be 32 bits or 64 bits, depending on the function implemented, and the application is not limited thereto. The number and the size of the required registers can be adjusted to proper values according to the actual needs of the functions of the device, and the required specific register numbers can be selected according to the calculation needs to store the input and output data and the temporary intermediate data.

Embodiment two, computation of multidimensional matrix transpose based on AXI bus implementation

Based on the system architecture of the hardware acceleration system for implementing matrix SVD decomposition based on the AXI bus shown in FIG. 1, the operation of matrix transposition based on the AXI bus is implemented. Similarly, the operation of matrix transposition is realized by writing instructions in the software driver Hardware_ Svd, the given data precision is of a double-precision type of 64 bits, the given data type is a complex number, and the matrix dimension is a square matrix of 8 x 8. Since the data of the matrix is batch data, the initial addresses of the input and output of the matrix need to be set; in this embodiment, the initial data address of the matrix input data is 0x80000000, the initial address of the output data is 0x81000000, and the system can load all data according to the relevant data type and precision type configuration and store them into the appointed RAM of the hardware acceleration system only by giving the initial address. Fig. 5 shows a simulation waveform of this embodiment, in which input matrix data is stored in ram_a, and transposed matrix output data is stored in ram_c. As can be seen from the data shown in the global diagram in fig. 5, a_ram_rd_data represents matrix data being read from the ram_a, and c_ram_wr_data represents writing the transposed matrix into the ram_c; as can be seen from the partial diagram in fig. 5, since the complex number is composed of a real part and an imaginary part and is double-precision, each matrix element occupies 128 bits, and 32 numbers in 16 scale.

In this embodiment, allocation of RAM may be designated, and input data and output data may be stored in the designated RAM. To embody this function, in this embodiment, the input original matrix is stored in ram_b, the matrix input data is read out from ram_b, and the data result after the matrix is transposed is stored in ram_d, it can be seen in waveform chart 6 that as the clock period changes, there is data reading in b_ram_rd_data, there is data writing in d_ram_wr_data, and the other matrices a_ram_wr_data, c_ram_wr_data are all 0, that is, there is no data movement in ram_a and ram_c; by such flexible allocation, the RAM for storing matrix data can be realized without data conflict and coverage, thereby meeting the access of a plurality of matrixes during complex calculation.

In this embodiment, the RAM configurable by the software driver hardware_ Svd is ram_ A, RAM _ B, RAM _ C, RAM _d, and the driver may call any multiple RAM of the 4 RAMs of 32KB in the Hardware acceleration system to perform the computation of this embodiment. It will be appreciated by those skilled in the art that the amount and size of the configuration RAM may be any random one or more of ram_ A, RAM _ B, RAM _ C, RAM _d, and the size of the individual elements of the memory matrix may be 32 bits, 64 bits, or 128 bits, depending on the type of data and the type of precision, depending on the implementation and complexity of the calculations, and the application is not limited in this regard. The required RAM quantity and data storage size can be adjusted to proper values according to the actual requirement of the self function by a person skilled in the art, and the required specific RAM number can be selected to store input and output matrix data according to the calculation requirement.

Embodiment three, matrix SVD decomposition based on AXI bus and internal hardware acceleration system

In this embodiment, the SVD decomposition input matrix is a 4*4 matrix, and the execution flow of the SVD decomposition is shown in fig. 7, where the calculation precision is 64-bit double-precision type, and the data type is real. In order to achieve higher precision, the iteration number of the decomposition calculation is 15, the final calculation result is a left singular matrix, singular values and a right singular matrix, the result calculated by the hardware acceleration system is given in fig. 8, the values of c_ram_wr_data are sequentially expressed as 4 singular values in the singular value matrix sigma, a left singular matrix U and a transpose V ζ of the right singular matrix from top to bottom, meanwhile, 4 singular values can be seen to be distributed on a main diagonal from the figure, and other elements are 0.

In order to show the accuracy of the calculation result of the hardware acceleration system, the embodiment calculates matrixes with various dimensions, various data sources and different accuracies, compares the singular value calculated by the hardware with the singular value result calculated by the software, and can be seen from fig. 9, the calculation result is accurate. In the double-precision mode, if the source data is smaller, the precision can reach 1e-7 or more, and if the source data is larger, the precision can still reach 1e-6; in the single precision mode, if the source data is smaller, the precision can reach 1e-5 or more, the source data is larger, and the precision can still reach 1e-4. Therefore, the performance of SVD decomposition realized by hardware meets the preset target and is enough to meet the actual application requirement. The person skilled in the art can select the needed specific RAM numbers to store the input and output matrixes according to the actual requirement of the own functions, can adjust the dimension of the input matrixes to a proper value, and can also adjust the precision in consideration of the actual requirement. In this embodiment, since the hardware simulation test platform used by the inventor is Verdi, it is limited in simulation performance, and it is inconvenient to simulate a large-scale high-order matrix, but there is no deviation in function from the above description.

The embodiment of the invention can effectively solve some problems existing in the prior art, and optimize the problems of high calculation complexity, large memory consumption, low calculation speed, poor universality, unsuitability for real-time application, inflexibility for configuration, inextensibility and the like of SVD decomposition in the prior art. According to the hardware acceleration system for realizing matrix SVD decomposition based on the AXI bus, the subfunctions required by SVD calculation are realized through hardware, the SVD calculation speed is improved through a mode that a driver is combined with the AXI bus to realize SVD decomposition, and meanwhile, the hardware acceleration system is provided with a plurality of large-scale storage system RAMs and registers, so that the hardware acceleration system can be suitable for the requirement of large-scale matrix decomposition, and the problem of data access conflict is solved. The function module is transferred to the hardware level for realizing, and the whole function of SVD is realized through driving, so that the performance of the whole system can be improved, and the method has the characteristics of simplicity, high efficiency, universality and flexibility. The AXI bus-based method can realize functions aiming at different application scenes, and can continuously optimize the SVD calculation implementation scheme according to requirements.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features.

Claims

1. A hardware acceleration system for realizing matrix SVD decomposition based on an AXI bus is characterized in that the collaborative work and data exchange of an internal hardware acceleration system and an external system are realized through the AXI bus, the complete flow of SVD decomposition is executed according to instructions, the subfunctions of SVD calculation are realized through the hardware acceleration system, and simultaneously, the instructions can be freely configured to match different calculation requirements, and finally, the calculation of a left singular matrix, a right singular matrix and a singular value matrix is realized; the hardware accelerating system comprises a hardware mathematical accelerator, a DMA (direct memory access), a RAM (random access memory), a register, a FIFO (first in first out) and an AXI (advanced extensible interface) bus module, wherein the hardware mathematical accelerator comprises 4 FUNC functional modules, MAT functional modules and Queue functional modules, and the execution flow of each module is as follows when the system carries out matrix SVD decomposition and calculation: firstly, storing the function of SVD decomposition whole in a mode of instructions in a driver Hardware_ Svd, compiling the instructions through a gcc compiler, and compiling the instructions into a FIFO of a Hardware acceleration system; then, if the hardware mathematical accelerator in the hardware acceleration system detects that the FIFO is provided with an instruction, the instruction is taken out to execute a corresponding functional module in the hardware mathematical accelerator, and corresponding sub-functions in idle FUNC, MAT and Queue modules are called according to the corresponding configuration instruction to calculate; finally, the intermediate or final result calculated in the hardware math accelerator is stored in a designated register and RAM for external system reading.

2. The hardware acceleration system for implementing matrix SVD decomposition based on AXI bus according to claim 1, wherein the system has 64 registers, which are Reg0-Reg63, and registers used in the iterative calculation process of SVD are assigned numbers according to the order of the matrix and the calculation requirement, so as to implement parallel conflict-free access of a large number of values.

3. The hardware acceleration system for implementing matrix SVD decomposition based on AXI bus according to claim 1, wherein the total number of RAM is 4, ram_ A, RAM _ B, RAM _ C, RAM _d, each size is 32KB, i.e. 256 x 1024 bits, for storing data of matrix; the RAM used in the SVD calculation process can randomly assign numbers according to the orders of the matrixes and the calculation demands, so that the intermediate result and the final result of the SVD calculation of the high-order matrixes can be accessed and stored in the RAM in a parallel and collision-free manner; when matrix calculation is performed, an input matrix is written into a RAM in advance, and four types of data storage formats of the matrix in the RAM are as follows:

4. The hardware acceleration system for implementing matrix SVD decomposition based on AXI bus according to claim 1, wherein the data of the FUNC and Queue modules in the hardware math accelerator of the system support single and double precision real number calculation under IEEE754 standard, the data of the MAT module support single and double precision real number and complex number two types of calculation, and the data type and precision can be configured as required.

5. The hardware acceleration system of claim 4, wherein if the operand is single precision data under IEEE754 standard, only the low 32 bits of the register are used, thereby reducing the read/write operations of the register; in the case of double-precision data, 64-bit total is used.

6. The hardware acceleration system for realizing matrix SVD decomposition based on an AXI bus according to claim 1, wherein the hardware mathematical accelerator of the system is provided with 4 parallel modules, namely FUNC0, FUNC1, FUNC2 and FUNC3, when the system performs SVD iterative operation, if FUNC0 is in a busy state, FUNC1 is called, and if FUNC1 is not in an idle state, FUNC2 and FUNC3 modules are called in sequence, so that the parallel operation of large-scale data is satisfied during the high-order matrix iterative operation, the iterative process of SVD operation is ensured not to be in a waiting state, and the SVD decomposition speed is accelerated.

7. The hardware acceleration system of claim 1, wherein matrix computation dimensions supported by a MAT module in the hardware mathematical accelerator are 2 x 2, 12 x 16, 32 x 32 and any of a plurality of different dimensions, and for a common dual-precision, real-type matrix computation, a maximum dimension can support 64 x 64, and the matrix type is a square matrix or a non-square matrix, a triangular matrix or a non-triangular matrix, or vector operation.

8. The hardware acceleration system for implementing matrix SVD decomposition based on AXI bus according to claim 1, wherein the hardware mathematical accelerator in the system supports various trigonometric functions, matrixes and vectors, addition, subtraction, multiplication and division of floating point numbers and other types of computation, and numbers all sub-functions individually, and can interrupt sub-function instructions after execution, and perform next sub-function computation; the design method is convenient for writing instructions in the driver program to continuously call different functions to execute SVD calculation, no function conflict exists, the problem that the instructions are covered or cannot be executed is avoided, and each sub-function has universality and is not limited to special cases.

9. The implementation method for realizing matrix SVD decomposition hardware acceleration based on the AXI bus is characterized by comprising the following steps:

Step S1: writing a driver program instruction stream outside according to the SVD calculation principle, determining specific data addresses, register and RAM numbers and specific function serial numbers for calculation according to requirements, and realizing a complete calculation flow of SVD decomposition by writing related operation instructions;

Step S2: compiling an instruction stream of a driver through gcc, operating the instruction into an internal 64-bit FIFO by a CPU, and taking out the instruction and sequentially executing the instruction when an internal hardware mathematical accelerator detects that the instruction exists in the FIFO;

Step S3: the internal hardware mathematical accelerator calls corresponding sub-functions in the internal 4 parallel FUNC, MAT, queue modules according to the instruction, carries out related calculation operation on input numerical values, and comprises matrix initialization, calculation of two input matrixes A (A-T) and (A-T) A, jacobi iterative calculation and singular value square root calculation;

Step S4: in the calculation process, the hardware acceleration system uses internal 64-bit registers, and an internal module can read and write the registers; in addition, the internal part also comprises 4 RAMs, each block is 32KB, and the four RAMs can be accessed by a CPU and internal control logic to store and read matrixes;

Step S5: executing a complete instruction, and writing the calculated intermediate value, singular value, left singular matrix and right singular matrix into a register or RAM with a designated number;

Step S6: the DMA control module takes the internal data from the appointed register and RAM to the external storage system through the AXI bus interface, and completes the whole hardware acceleration of SVD decomposition.

10. The method for implementing matrix SVD decomposition hardware acceleration based on AXI bus according to claim 9, wherein the instructions for performing SVD computation by the system are flexibly configured according to the following format, and the writing method is as follows:

FIFO_WR_EN = 1;

FIFO[31:0] = (cmd_idx<<0) + (idx_r0<<8) + (idx_r1<<14) +

(idx_r2<<20) + (complex_fmt<<26) + (precision<<27) + (gen_intr<<31)；

FIFO[63:32] = (Row_size<<8) + (Col_size <<16) + (inter_num<<24)；