CN114443616B

CN114443616B - FPGA-based parallel heterogeneous database acceleration method and device

Info

Publication number: CN114443616B
Application number: CN202111667493.0A
Authority: CN
Inventors: 任智新; 张闯; 黄广奎; 刘科; 孙忠祥; 王敏
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2024-01-16
Anticipated expiration: 2041-12-30
Also published as: CN114443616A

Abstract

The invention provides a parallel heterogeneous database acceleration method based on an FPGA, which comprises the steps of obtaining original data to be calculated; host starts DMA to carry the original data to be calculated to the cache DDR of the FPGA accelerator; the FPGA accelerator reads and stores the data and judges whether writable Ram exists or not; if not, the FPGA is interrupted, and the original data to be calculated is returned and continuously obtained; if yes, starting DDR read operation in the FPGA accelerator, and taking out and writing the data into any local Ram; after any Ram is fully written, starting data analysis and calculation, wherein the Ram position is in a writable state after the calculation is completed, and recording the DDR read address; judging whether all data calculation is completed, if so, ending; if not, the process returns to step 103, and the whole data transfer is completed. By the technical scheme, the waiting time for data movement in the prior art is reduced, so that the acceleration effect of the system is improved to the greatest extent.

Description

FPGA-based parallel heterogeneous database acceleration method and device

Technical Field

The invention relates to the field of data processing, in particular to a parallel heterogeneous database acceleration method and device based on an FPGA.

Background

In order to relieve the calculation pressure brought to the CPU by increasingly growing data size analysis and inquiry, the traditional mode is to expand the calculation nodes on a large scale, and the method is not applicable under the condition that people increasingly seek to reduce the cost and increase the efficiency, so that the economic value is low, and the traditional mode of expanding the calculation nodes is difficult to execute due to the limitation of environmental factors in many cases. Much research is currently focused on hardware acceleration, which has proven to be an effective solution to improve the performance of data-intensive applications in database management systems. Hardware accelerators commonly used today include FPGA, GPU, ASIC, etc. In the aspect of selecting various accelerators, the FPGA has higher parallelism and lower power consumption, so that the FPGA is favored more, and a series of heterogeneous acceleration solutions based on the FPGA are developed in the industry.

In the heterogeneous acceleration scheme of the database based on the FPGA, the central thought is to offload computation-intensive tasks such as data query and the like to the FPGA for execution by the CPU. There is no specific solution at present on how to fully embody the parallelism of the FPGA, fully utilize the resources of the FPGA and maximize the advantage of the pipeline. In the prior art, in the parallel computing and unloading of the database accelerator, an FPGA or an ASIC serving as an acceleration unit may be configured to perform parallel computing on a plurality of processing units, but there is no specific data transmission method and how to perform parallel computing, so as to achieve an optimal acceleration effect. Therefore, the research of a database heterogeneous acceleration scheme based on the FPGA is one of the important problems to be solved at present.

Disclosure of Invention

In view of the above, the present invention aims to provide an acceleration method for parallel heterogeneous databases based on FPGA, which is used to solve the problem of low efficiency caused by that in the prior art, the FPGA does not start calculation but waits in the process of moving only the database data to be queried from Host to FPGA accelerator.

Based on the above purpose, the invention provides a parallel heterogeneous database acceleration method based on FPGA, which comprises the following steps:

step 101, obtaining original data to be calculated;

102, host starts DMA to carry the original data to be calculated to a cache DDR of an FPGA accelerator;

step 103, after the carrying is completed, the FPGA accelerator reads and stores the data and judges whether writable Ram exists or not;

step 104, if the writable Ram is not available, the FPGA is interrupted, and the step 101 is returned to obtain the next original data to be calculated;

step 105, if writable Ram exists, starting DDR read operation in the FPGA accelerator, reading and storing according to the storage size of the original data, and taking out the data and writing the data into any Ram in the local area;

step 106, after any Ram is fully written, starting data analysis and calculation, wherein the Ram position is in a writable state after the calculation is completed, and recording the DDR read address;

step 107, judging whether all data calculation is completed, if yes, ending;

if not, returning to step 103 until all the original data to be calculated are completely moved.

In some embodiments, the method further comprises:

in step 101, the original data to be calculated is a database table to be queried, which is extracted from a database by the DBMS receiving a query plan to be executed on the database.

In some embodiments, the method further comprises:

in step 102, the raw data to be calculated is stored in a cache DDR of the FPGA accelerator, specifically: the method comprises the steps that original data to be calculated are stored in a cache DDR of an FPGA accelerator through a communication channel between the FPGA accelerator and a CPU, the DDR is more than one path, after the first path DDR1 is moved, a subsequent module initiates a read operation to the DDR, and no write operation is performed; after the first path DDR transmission is completed, the Host starts the write operation of the second path DDR 2.

In some embodiments, the method further comprises:

and 105, after the original data to be calculated is stored in the accelerator card DDR, a data cache processing module in the accelerator card starts DDR reading operation, reads and stores the data according to the storage size of an original data page, and takes out and writes the data into the local Ram.

In some embodiments, the method further comprises:

the FPGA comprises a plurality of processing units, and the processing units read and store data to be calculated in parallel.

In some embodiments, the method further comprises:

after the calculation of each processing unit is finished, the calculation result is input into a calculation result arranging module, and the calculation result arranging module arranges the results of the parallel calculation units and feeds the results back to the Host to finish the one-time parallel calculation.

In some embodiments, the method further comprises:

when the parallel unit of the first path DDR1 calculates, the second path DDR2 transmits original data, and after transmission is completed, calculation is started; meanwhile, write control of the Host is returned to the first path DDR1, if no read operation is currently performed, the Host initiates a DMA (direct memory access) moving operation, and data writing operation is completed.

In another aspect of the present invention, there is also provided an FPGA-based parallel heterogeneous database acceleration apparatus, including:

the original data acquisition module is used for acquiring original data to be calculated;

the data handling module is used for handling the original data to be calculated into a cache DDR of the FPGA accelerator;

the reading module is used for reading and storing the data and judging whether writable Ram exists or not;

the first judging module is used for judging whether writable Ram exists or not; if the writable Ram is not available, the FPGA is interrupted, and next original data to be calculated is obtained;

if writable Ram exists, starting DDR read operation in the FPGA accelerator, reading and storing according to the storage size of the original data, and taking out the data and writing the data into any Ram in the local area;

the data analysis and calculation module is used for analyzing and calculating the data after any Ram is filled, the Ram position is in a writable state after the calculation is completed, and the DDR read address is recorded;

a second judging module for judging whether all data calculation is completed,

if yes, ending;

In yet another aspect of the present invention, there is also provided a computer readable storage medium storing computer program instructions which, when executed, implement any of the methods described above.

In yet another aspect of the present invention, there is also provided a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, performs any of the methods described above.

The invention has at least the following beneficial technical effects:

according to the invention, more than one path of DDR is set by adopting a ping-pong data transmission mode, after a Host moves a certain amount of source data to the cache of the FPGA accelerator, the FPGA calculation is started, a plurality of calculation units are started simultaneously, and in the calculation process of the calculation unit of one DDR1 channel, the data transmission of the other path of DDR2 channel is started again, and the data is moved from the Host to the data of the cache DDR2 of the FPGA accelerator. The Host performs the write operation on the two paths of DDRs in a time sharing manner, and the calculation unit performs the read operation on the DDRs in a time sharing manner, so that the calculation unit is always in a working state, the waiting time for data movement in the prior art is reduced, and the acceleration effect of the system is improved to the greatest extent.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an FPGA-based database parallel heterogeneous acceleration system implementation according to an embodiment of the present invention;

FIG. 2 is a flow chart of an FPGA heterogeneous acceleration system query task provided according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a computer readable storage medium implementing a resource monitoring method according to an embodiment of the present invention;

fig. 4 is a schematic hardware structure of a computer device for performing a resource monitoring method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two non-identical entities with the same name or non-identical parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or other step or unit that comprises a list of steps or units.

In the existing database acceleration scheme based on the FPGA, the method for moving data is as follows: and (3) moving all the data to be queried from the Host side to the FPGA at one time, starting the FPGA to accelerate calculation after the data movement is completed, and keeping the FPGA in a waiting state before. While so-called parallel computing is typically an example of multiple computing units, computing is restarted after all the data is cached on the FPGA accelerator. In the following examples of the present invention, the following terms and their english abbreviations are explained as follows:

and (3) FPGA: field Programmable Gate Array field programmable logic array

DMA: direct Memory Access direct memory access

PCIe: peripheral Component Interconnect Express A high-speed serial computer expansion bus standard

DDR SDRAM: double rate synchronous dynamic random access memory, abbreviated herein as DDR.

Based on the above object, in a first aspect of the embodiment of the present invention, an embodiment of a parallel heterogeneous database acceleration method based on FPGA is provided.

As shown in fig. 1, an implementation block diagram of a parallel heterogeneous acceleration system based on an FPGA database according to an embodiment of the present invention includes: the Host comprises a CPU, a PCIe driver and a Host program; the FPGA accelerator is connected with the CPU through a PCIe interface; the PCIe driver is used for establishing a data transmission path between the CPU and the FPGA;

the Host program is used for responsible for operation analysis and distribution, so that the CPU can offload the computationally intensive tasks to the FPGA for execution;

the FPGA accelerator is used for unloading the CPU to the computation-intensive task to execute operation;

the data in the database table is stored according to pages, the sizes of the database pages of different manufacturers are different, and the postgreSQL is taken as an example, the size of each Page is 8K, and the following modules take 8K as an example for relevant description. The embodiment of the invention comprises the following steps:

a parallel heterogeneous database acceleration method based on FPGA comprises the following steps:

step 101, obtaining original data to be calculated; the DBMS receives a query plan to be executed on a database, and extracts a database table to be queried from the database as original calculation data. The data in the database table is stored in pages, and the sizes of the database pages of different manufacturers are different, and the postgreSQL is taken as an example, and each Page has the size of 8K.

102, host starts DMA to carry the original data to be calculated to a cache DDR of an FPGA accelerator; the method comprises the following steps: the method comprises the steps that original data to be calculated are stored in a cache DDR of an FPGA accelerator through a communication channel between the FPGA accelerator and a CPU, the DDR is more than one path, after the first path DDR1 is moved, a subsequent module initiates a read operation to the DDR, and no write operation is performed; after the first path DDR transmission is completed, the Host starts the write operation of the second path DDR 2;

in the application, the optimal data movement amount can be tested according to the conditions of hardware and software, and the optimal data movement amount can be used as the data movement amount of engineering realization each time. Taking the size of 64Mbyte moved by DMA each time as an example for detailed explanation, 64M data is moved into the DDR1 of the FPRA accelerator card through a communication interface PCIe, after the first path DDR1 is moved, a subsequent module initiates a read operation to the DDR1, and at the moment, host does not perform a write operation to the DDR 1;

in order to avoid time waste caused by waiting for transmission of the Host, after the transmission of the first path DDR1 is completed, the Host immediately starts the write operation of the second path DDR2, and the data size of 64M is moved.

Step 103, after the carrying is completed, the FPGA accelerator reads and stores the data and judges whether writable Ram exists or not; after the data to be calculated with the size of 64M is stored in the acceleration card DDR, a data cache processing module in the FPGA acceleration card starts DDR1 reading operation, and reads and stores according to the storage size 8K of an original data page;

step 105, if writable Ram exists, starting DDR read operation in the FPGA accelerator, reading and storing according to the storage size of the original data, and taking out the data and writing the data into any Ram in the local area; namely 8K data are taken out and written into a local Ram;

step 106, after any Ram is fully written, starting data analysis and calculation, wherein the Ram position is in a writable state after the calculation is completed, and recording the DDR read address; after any Ram data is written, a subsequent data analysis and calculation module can be started;

step 107, judging whether all data calculation is completed, if yes, ending;

In some embodiments, the FPGA accelerator includes a plurality of processing units, where the processing units PU perform reading and storing processing on data to be calculated in parallel, and in this embodiment, in order to further provide parallel processing capability of the FPGA, the present invention exemplifies 8 local rams of 8k, which is equivalent to that 8 paths of computing units PU are performed synchronously; and the parallel processing capability is effectively improved.

After each 8K local Ram is written, the subsequent original data analysis and calculation functions are started, including page information acquisition, row information acquisition, column information acquisition, inquiry, comparison, aggregation and other operations in the subsequent calculation modules, and the subsequent original data analysis and calculation functions are related to the specific scene of the actual application.

After the calculation of each processing unit PU is completed, the calculation result is input into a calculation result arranging module, and the calculation result arranging module arranges the results of the parallel calculation units and feeds back the results to the Host to complete the one-time parallel calculation.

When the parallel unit of the first path DDR1 performs calculation, the other path DDR2 performs transmission of original data, once the transmission is completed, calculation of the other path DDR2 (8 parallel PU units) can be started, at the moment, write control right of Host returns to the first path DDR operation, so that the Host can initiate DMA moving operation once only if no read operation is currently performed, data writing operation is completed, and step 103 is continuously executed until all original data to be calculated are moved completely after the writing operation is completed.

After the completion, the Host can transfer to execute other tasks, the calculation tasks of the database are independently completed by the FPGA, and after the calculation is completed, the Host is informed to read the calculation result in an interrupt mode, so that one-time database acceleration operation is completed.

a second judging module for judging whether all data calculation is completed,

if yes, ending;

In a third aspect of the embodiment of the present invention, a computer readable storage medium is provided, and fig. 3 shows a schematic diagram of a computer readable storage medium implementing a resource monitoring method according to an embodiment of the present invention. As shown in fig. 3, the computer-readable storage medium 3 stores computer program instructions 31, which computer program instructions 31 are executable by a processor. The computer program instructions 31 when executed implement the method of any of the embodiments described above.

It should be understood that all of the embodiments, features and advantages set forth above for the resource monitoring method according to the invention equally apply to the resource monitoring system and storage medium according to the invention, without conflicting therewith.

In a fourth aspect of the embodiments of the present invention, there is also provided a computer device comprising a memory 402 and a processor 401, the memory storing a computer program which, when executed by the processor, implements the method of any of the embodiments described above.

Fig. 4 is a schematic hardware structure diagram of an embodiment of a computer device for performing the resource monitoring method according to the present invention. Taking the example of a computer device as shown in fig. 4, a processor 401 and a memory 402 are included in the computer device, and may further include: an input device 403 and an output device 404. The processor 401, memory 402, input device 403, and output device 404 may be connected by a bus or otherwise, for example in fig. 4. The input device 403 may receive entered numeric or character information and generate key signal inputs related to user settings and function control of the resource monitoring system. The output 404 may include a display device such as a display screen.

The memory 402 is used as a non-volatile computer readable storage medium, and may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the resource monitoring method in the embodiments of the present application. Memory 402 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by use of the resource monitoring method, and the like. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to the local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor 401 executes various functional applications of the server and data processing, i.e., implements the resource monitoring method of the above-described method embodiment, by running nonvolatile software programs, instructions, and modules stored in the memory 402.

Finally, it should be noted that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, RAM may be available in a variety of forms such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP and/or any other such configuration.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. The parallel heterogeneous database acceleration method based on the FPGA is characterized by comprising the following steps of:

step 101, obtaining original data to be calculated;

102, host starts DMA to carry original data to be calculated into a cache DDR of an FPGA accelerator, wherein the DDR is more than one path, 64M-sized original data to be calculated is carried into DDR1 of the FPRA accelerator through a communication interface PCIe, after the first path of DDR1 is carried out, a subsequent module initiates a read operation to DDR1, at the moment, host does not perform a write operation to DDR1, after the first path of DDR1 is transmitted, host starts a write operation of a second path of DDR2 immediately, and 64M-sized data is carried out;

step 103, after the carrying is completed, storing 64M-sized original data to be calculated into DDR of the accelerator, starting DDR1 read operation by a data cache processing module in the FPGA accelerator, and reading and storing according to the storage size 8K of an original data page to judge whether writable Ram exists or not;

step 105, if writable Ram exists, starting DDR read operation in the FPGA accelerator, reading and storing according to the storage size of the original data, and taking out 8K data and writing the 8K data into any Ram in the local; the FPGA comprises a plurality of processing units, and the processing units read and store the calculated data in parallel; after the calculation of each processing unit is finished, inputting the calculation result into a calculation result arranging module, arranging the result of the parallel calculation unit by the calculation result arranging module, and feeding back to host to finish one-time parallel calculation; when the parallel unit of the first path DDR1 calculates, the second path DDR2 transmits original data, and after transmission is completed, calculation is started; meanwhile, the write control right of the host returns to the first path DDR1, if no read operation is currently performed, the host initiates a DMA (direct memory access) moving operation to complete the data writing operation;

step 107, judging whether all data calculation is completed, if yes, ending;

2. The method of claim 1, wherein in step 101, the original data to be calculated is a query plan received by the DBMS to be executed on a database, and the database table to be queried is extracted from the database.

3. An FPGA-based parallel heterogeneous database acceleration apparatus, comprising:

the data handling module is used for handling the original data to be calculated into a cache DDR of the FPGA accelerator, the DDR is more than one path, the original data to be calculated with the size of 64M is moved into DDR1 of the FPRA accelerator through a communication interface PCIe, after the first path DDR1 is moved, a subsequent module initiates a read operation to DDR1, at the moment, host does not perform a write operation to DDR1, after the first path DDR1 is transmitted, host immediately starts a write operation of a second path DDR2, and the data with the size of 64M is also moved;

the reading module is used for reading and storing the data and judging whether writable Ram exists, wherein after the original data to be calculated with the size of 64M is stored to the DDR of the accelerator, a data cache processing module in the FPGA accelerator starts DDR1 reading operation and reads and stores the data according to the storage size of 8K of an original data page so as to judge whether writable Ram exists;

if writable Ram exists, starting DDR read operation in the FPGA accelerator, reading and storing according to the storage size of the original data, and taking out 8K data and writing the 8K data into any Ram in the local area; the FPGA comprises a plurality of processing units, and the processing units read and store the calculated data in parallel; after the calculation of each processing unit is finished, inputting the calculation result into a calculation result arranging module, arranging the result of the parallel calculation unit by the calculation result arranging module, and feeding back to host to finish one-time parallel calculation; when the parallel unit of the first path DDR1 calculates, the second path DDR2 transmits original data, and after transmission is completed, calculation is started; meanwhile, the write control right of the host returns to the first path DDR1, if no read operation is currently performed, the host initiates a DMA (direct memory access) moving operation to complete the data writing operation;

a second judging module for judging whether all data calculation is completed,

if yes, ending;

4. A computer readable storage medium, characterized in that computer program instructions are stored, which when executed implement the method of any of claims 1-2.

5. A computer device comprising a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, performs the method of any of claims 1-2.