CN108763653B

CN108763653B - Reconfigurable linear equation set solving accelerator based on FPGA

Info

Publication number: CN108763653B
Application number: CN201810412917.0A
Authority: CN
Inventors: 潘红兵; 苏岩; 秦子迪; 何书专; 李丽; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-04-30
Filing date: 2018-04-30
Publication date: 2022-04-22
Anticipated expiration: 2038-04-30
Also published as: CN108763653A

Abstract

The invention provides a reconfigurable linear equation set solving accelerator based on an FPGA, which comprises: the data distribution module is used for distributing the data in the internal memory to the calculation array module and adjusting the data distribution mode under the control of the main control module according to the scale and the type of the input coefficient matrix; the main program control module is used for controlling the operation of the data distribution module, the reconstruction control module and the calculation array module and the communication among the modules; the reconstruction control module is used for resetting the calculation mode according to the scale and the type of the coefficient matrix; an internal memory module for storing the coefficient matrix and the vector data; and the calculation array module is used for calculating the solution of the linear equation system. The reconstruction method designed by the invention can simultaneously adjust the storage and transmission modes of data, can adopt different operation modes under the scene of different requirements on operation resources and operation precision, and has better universality compared with the existing accelerator for solving a linear equation system.

Description

Reconfigurable linear equation set solving accelerator based on FPGA

Technical Field

The invention belongs to the technical field of data processing, and relates to a linear equation system solving technology.

Background

Solving of the linear equation system belongs to a part of matrix calculation and is the core of scientific and engineering calculation. Solving linear equations is a type of computation intensive task, and widely exists in the fields of data mining, signal processing, numerical approximation and the like.

The solution of the large-scale linear equation system generally has two types, namely a coefficient matrix is a sparse matrix and a coefficient matrix is a dense matrix. When the sparse linear equation set is solved, because a large number of elements exist, the iterative method is adopted to solve, so that a lot of computing resources can be saved, and the solution vector finally approaches to an accurate solution through multiple iterations. When solving the dense linear equation set, an LU decomposition method is generally adopted to obtain an accurate solution of the linear equation set. Due to the difference between the two algorithms, the solution of the large-scale linear equation set is usually realized by software through a general-purpose processor on a high-performance server.

The FPGA has a large number of computing components, can perform parallel pipeline processing on large-scale repeated computing, realizes parallelization of algorithms, and is a good choice for accelerating the solution of a large-scale linear equation set. By combining the FPGA reconfigurable technology and designing aiming at different algorithms, the universality of the solving mode can be realized.

Disclosure of Invention

In order to solve the problems and achieve generality of solution modes, the invention provides a general solution accelerator for sparse and dense matrix linear equation sets based on an FPGA (field programmable gate array) based on a reconfigurable technology, provides a reconfiguration method for two different types of linear equation sets, and is specifically achieved by the following technical scheme.

The reconfigurable linear equation system solving accelerator based on the FPGA comprises:

the data distribution module is used for distributing the data in the internal memory to the calculation array module and adjusting the data distribution mode under the control of the main control module according to the scale and the type of the input coefficient matrix;

the main program control module is used for controlling the operation of the data distribution module, the reconstruction control module and the calculation array module and the communication among the modules;

the reconstruction control module is used for resetting the calculation mode according to the scale and the type of the coefficient matrix;

an internal memory module for storing the coefficient matrix and the vector data;

and the calculation array module is used for calculating the solution of the linear equation system.

The reconfigurable linear equation system solving accelerator based on the FPGA is further designed in that the data distribution module controls a data path from data in the internal memory module to cache of the calculation array module at each moment, data exchange in the internal memory module is operated, and a header mark is added to each distributed column data, so that the data are distributed to cache of a calculation unit of the matched calculation array module.

The reconfigurable linear equation system solving accelerator based on the FPGA is further designed in that the coefficient matrix types processed by the data distribution module are divided into a sparse coefficient matrix and a dense coefficient matrix.

The reconfigurable linear equation set solving accelerator based on the FPGA is further designed in that a main program module is respectively in bidirectional communication connection with a data distribution module, a reconfiguration control module and a calculation array module, so that the operation of the data distribution module, the reconfiguration control module and the calculation array module and the communication among the modules are controlled, and an uppermost controller of the linear equation set solving accelerator is formed.

The reconfigurable linear equation set solving accelerator based on the FPGA is further designed in that the reconfiguration control module resets the operation mode of the calculation array module to an iterative method according to the type of the coefficient matrix and the type of the matrix is a sparse coefficient matrix; and resetting the operation mode of the calculation array module to be a direct method when the type of the matrix is a dense coefficient matrix.

The reconfigurable linear equation set solving accelerator based on the FPGA is further designed in that the iterative method adopts a Jacobi iterative method idea, a processing coefficient matrix is used for solving the linear equation set of a large-scale sparse matrix, and an approximate solution of required accuracy is obtained; the direct method adopts the principle of column-selected principal component LU decomposition, and solves a linear equation set with a processing coefficient matrix being a dense matrix to obtain an accurate solution.

The reconfigurable linear equation system solving accelerator based on the FPGA is further designed in that an internal storage module is provided with RAMs with different depths according to the scale of a coefficient matrix and is used for storing data of each column and row of the coefficient matrix, and the data is communicated with a cache in a calculation array module through a data bus to finish reading and writing of the data.

The reconfigurable linear equation system solution accelerator based on the FPGA is further designed in that the calculation array module comprises:

the preprocessing unit is used for completing preprocessing before calculation and distribution work of specific data;

12 × 12 computing unit array for performing parallelization large-scale data computation, and executing LU decomposition process of direct method and iterative computation process of iterative method;

the back substitution unit is used for calculating the back substitution process of the linear equation set solution after the LU decomposition is finished;

and the iteration judging unit is used for calculating the vector x after the single iteration is finished and judging whether the iteration is finished or not according to the accuracy.

The reconfigurable linear equation set solving accelerator based on the FPGA is further designed in that in a direct method calculation mode, a preprocessing unit completes column selection of principal elements, obtained line information of the principal elements is communicated with a data distribution module, and calculation is performed in sequence

a_mnIs to select the principal element column for data other than principal element, a_pivIs a column select principal and is in communication with the 12 x 12 array of compute units; in the iterative calculation mode, the preprocessing unit communicates with the 12-by-12 calculation unit array and distributes each parameter x in the vector x in turn_nAnd distributing the data to a computing unit for storing the nth data.

The reconfigurable linear equation system solving accelerator based on the FPGA is further designed in that each computing unit array comprises a head label matching unit and a multiplication and addition computing unit.

THE ADVANTAGES OF THE PRESENT INVENTION

The linear equation system solving accelerator adopts a proper and efficient solving method according to the type of the equation required to be solved. The linear equation system solving accelerator can complete the linear equation system solving of LU decomposition and can also complete the linear equation system solving of a Jacobi iteration method. Different operation modes can be adopted under the scene of different requirements on operation resources and operation precision, and compared with the existing linear equation system solution accelerator, the solution accelerator has better universality.

The invention aims at analyzing the processes of two algorithms, extracts a similar operation process, adopts a parallel pipelining method, accelerates the operation process, and effectively improves the operation efficiency compared with a software solving method of a general processor.

The reconstruction control module can simultaneously adjust the storage and transmission modes of data, and has good acceleration effect on a large-scale coefficient matrix linear equation set.

Drawings

FIG. 1 is an overall architecture diagram of an accelerator for solving a system of linear equations.

FIG. 2 is a block diagram of a compute array module.

Fig. 3 is a diagram of a computing unit structure.

Detailed Description

The following describes the present invention in detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention solves the accelerator based on a linear system of equations of a reconfigurable idea. The linear equation system solving accelerator consists of a main control module, an internal storage module, a reconstruction control module, a data distribution module and a calculation array module. Wherein:

the main control module is used as the uppermost control module of the system and controls the whole operation process of the solving accelerator. At the beginning of the operation of the solving accelerator, the main control module obtains parameters (the scale and the type of the coefficient matrix) of the coefficient matrix stored in the off-chip DRAM, and communicates with the reconstruction control module according to the obtained parameters to complete the initial work of the solving. The main control module is also communicated with the data distribution module, the internal storage module and the calculation array module, so that the data distribution work, the data exchange and preprocessing work and the data calculation work are completed.

And the internal storage module stores data required by calculation and a result after calculation. The internal storage module is controlled by the reconstruction control module, completes the reconstruction of the memory in the reconstruction process and adapts to the scale and the type of the coefficient matrix. Meanwhile, the data distribution module can also communicate with the internal storage module to complete data processing work such as exchange of partial data and addition of head labels. The internal memory is divided into 12 groups, each of which is connected to 12 computing units via a data bus.

And the reconstruction control module completes reconstruction according to the scale and the type of the coefficient matrix provided by the main controller at the initial stage of solving the accelerator. The internal storage reconstructing module adjusts the depth and size of internal storage according to the scale of the coefficient matrix; and adjusting the storage mode to be row storage or column storage according to the coefficient matrix type. And the reconstruction calculation array module adjusts the operation modes of the calculation array module according to the coefficient type or the specified calculation requirement, and is a direct method and an iterative method respectively. Under the operation mode of the direct method, a preprocessing unit in a calculation array module is adjusted to be in a column selection mode, and a back substitution unit is started; and under the operation mode of the iterative method, adjusting the preprocessing unit in the calculation array module to be in a distribution mode, and starting the iterative judgment unit.

The data distribution module is used for completing the distribution of data, and mainly adds a head label to the data so that the data is sent to the computing unit corresponding to the head label; and communicating with a preprocessing unit in the calculation array module to obtain the parameters of the column selection principal element and control the internal storage module to exchange principal element column data.

As shown in fig. 2, the calculation array module is composed of a preprocessing unit, a 12 × 12 calculation unit array, a back-substitution unit, and an iteration judgment unit. The preprocessing unit is composed of a multi-input comparator, a buffer and a divider. Data is stored in a buffer, a multi-input comparator completes the work of a column selection main element, and a divider completes the evaluation after the column selection main element

(a_mnIs to select the principal element column for data other than principal element, a_pivIs a pivot of the pivot column).

The preprocessing unit has two modes, the mode selection is reconstructedThe control module controls the direct method to be a column selection mode to complete column selection principal elements,

Calculating and distributing data; the iterative method is a distribution mode, and data distribution is completed. 12 × 12 computing unit array completes the multiply-add operation of data. As shown in fig. 3, after the single computing unit head label matching module matches the corresponding head label, the data is sent to the cache, and the subsequent part completes multiplication and accumulation. And after the direct method LU decomposition is completed, the back substitution unit performs subsequent back substitution operation to obtain a equation set solution. The iteration judgment unit calculates a vector x of one iteration, judges whether the vector x reaches the set accuracy, and solves a solution of the equation set if the vector x reaches the set accuracy; if not, the next iteration is carried out.

The working principle of solving the accelerator by the linear equation set is described below by taking a dense coefficient matrix linear equation set of 200 x 200 as an example. In the initial stage, after the main controller obtains the scale and the type of the matrix, the main controller communicates with the reconstruction controller and sets the storage mode of the internal storage module. For the dense coefficient matrix, the coefficient matrix is stored in columns, and each set of the internal memory stores an entire column of data completely, i.e., the first column is stored in the first set, the second column is stored in the second set, and the storage is sequentially performed, and the cycle is restarted at the thirteenth column of data and stored in the first set. The reconstruction control module is also provided with a calculation array module, adjusts the preprocessing unit into a column selection mode, adjusts the head label of each calculation unit, and starts the back substitution unit to close the iteration judgment unit. After the reconfiguration is completed, the off-chip DRAM transmits the data to the internal memory module through the data bus. After the storage is completed, the data allocation module adds a head tag before each column of data, and the head tag of each column of data is the column number (vector x special mark) in which the column is located. The main controller module controls the first row of data to enter a preprocessing unit of the calculation array module, and the calculation is carried out after row selection of principal elements is completed

And sent to the respective computing units. Then, each column of data is transmitted from each set of internal memory to the corresponding dataOn the bus, there are 12 data buses connected to 12 × 12 computing units, and there are 12 computing units matched to the column data of the corresponding head tag at the same time. And then, the data transmission of 144 columns (143 columns of data and vector x) is completed. The main control module controls the calculation to start, the calculation units simultaneously start the multiplication and addition operation, and the 1 st calculation unit calculates according to the difference of the stored data

The 2 nd calculation unit calculates

Sequentially, the 143 th calculation unit calculates

144 th calculation unit calculates

The first round of calculation completes the update result. The preprocessing unit completes the first round of calculation

After the first round of calculation is completed, the calculation is sent to each calculation unit. Starting the second round of calculation, i.e. the 1 st calculation unit

The 2 nd calculation unit calculates

Sequentially, the 143 th calculation unit calculates

144 th calculation unit calculates

Because the cache of the computing unit can not store a whole column of data, when the stored column of data is calculated, the main control moduleThe internal storage module is controlled to send the remaining column data, the data covers the 2 nd to the last of the cache of the computing unit, and the data of the first row is reserved. The calculation continues and finally completes the LU decomposition of columns 1 to 143. And updating the data of the rest columns into a calculation unit, keeping the vector x to perform the same calculation without operation, and finally obtaining a coefficient matrix of the LU decomposition once, wherein the result covers the original data in the internal memory. The main controller starts the second LU decomposition, the new coefficient matrix truncates the first row data and the first column data for the original matrix, and the vector x truncates x₁The other operations are the same as the first LU decomposition. Each time a new LU decomposition is started, one row and one column of coefficients are removed, leaving one parameter of the vector x. And when LU decomposition is finally completed, the back substitution unit starts back substitution solution to obtain the solution of the linear equation set.

The invention relates to a linear equation set solving accelerator designed based on a reconfigurable idea, and a proper and efficient solving method is adopted according to the type of the equation to be solved. The linear equation system solving accelerator can complete the linear equation system solving of LU decomposition and can also complete the linear equation system solving of a Jacobi iteration method. Different operation modes can be adopted in the scene with less general requirements on operation resources and operation precision, and compared with the existing linear equation system solution accelerator, the solution accelerator has better universality. The invention aims at analyzing the processes of two algorithms, extracts a similar operation process, adopts a parallel pipelining method, accelerates the operation process and effectively improves the operation efficiency. The reconstruction control module of the solving accelerator can simultaneously adjust the storage and transmission modes of data, and has good acceleration effect for a large-scale coefficient matrix.

The reconfigurable linear equation set solving accelerator provided by the invention is described in detail above, so as to facilitate understanding of the invention and the core idea thereof. For a person skilled in the art, many modifications and deductions can be made in the concrete implementation according to the core idea of the invention. In view of the above, this description should not be taken in a limiting sense.

Claims

1. A reconfigurable linear equation set solution accelerator based on an FPGA is characterized by comprising:

the calculation array module is used for calculating the solution of the linear equation set;

the reconstruction control module resets the operation mode of the calculation array module as an iteration method for the matrix type as a sparse coefficient matrix according to the type of the coefficient matrix; for the matrix type being a dense coefficient matrix, resetting the operation mode of the calculation array module as a direct method;

the iterative method adopts the concept of a Jacobi iterative method, and a processing coefficient matrix is solved for a linear equation set of a large sparse matrix to obtain an approximate solution of required accuracy; the direct method adopts the principle of column-selected principal component LU decomposition, and solves a linear equation set with a processing coefficient matrix being a dense matrix to obtain an accurate solution.

2. The accelerator according to claim 1, wherein the data distribution module controls data paths of data in the internal memory module to the cache of the computing array module at each time, operates data exchange in the internal memory module, and adds a header mark to each distributed column data to distribute the data to the cache of the computing unit of the matching computing array module.

3. The accelerator according to claim 1, wherein the coefficient matrix types processed by the data distribution module are classified into sparse coefficient matrices and dense coefficient matrices.

4. The accelerator for solving the reconfigurable linear equation set based on the FPGA as claimed in claim 1, wherein the main program module is respectively connected with the data distribution module, the reconfiguration control module and the calculation array module in a bidirectional communication manner, so that the operation of the data distribution module, the reconfiguration control module and the calculation array module and the communication among the modules are controlled, and an uppermost controller of the accelerator for solving the linear equation set is formed.

5. The accelerator according to claim 1, wherein the internal storage module configures RAMs with different depths according to the scale of the coefficient matrix, and is configured to store data of each column and row of the coefficient matrix, and communicate with the cache in the calculation array module through a data bus to complete reading and writing of the data.

6. The accelerator according to claim 1, wherein the computational array module comprises:

7. The accelerator according to claim 6, wherein in the direct calculation mode, the preprocessing unit completes column selection of the principal element, and communicates the row information of the obtained principal element with the data distribution moduleSignal, sequential calculation

a_mnIs to select the principal element column for data other than principal element, a_pivIs a column select principal and is in communication with the 12 x 12 array of compute units; in the iterative calculation mode, the preprocessing unit communicates with the 12 x 12 calculation unit array, and distributes each parameter x in the vector x in turn_nAnd distributing the data to a computing unit for storing the nth data.

8. The FPGA-based reconfigurable linear equation set solution accelerator of claim 7, wherein each computing element array comprises a header tag matching element and a multiply-add computing element.