CN112905239B

CN112905239B - Point cloud preprocessing acceleration method based on FPGA, accelerator and electronic equipment

Info

Publication number: CN112905239B
Application number: CN202110191083.7A
Authority: CN
Inventors: 王维杰; 郭开元; 张剑
Original assignee: Beijing Chaoxing Future Technology Co ltd
Current assignee: Beijing Chaoxing Future Technology Co ltd
Priority date: 2021-02-19
Filing date: 2021-02-19
Publication date: 2024-01-12
Anticipated expiration: 2041-02-19
Also published as: CN112905239A

Abstract

The embodiment of the application provides a point cloud preprocessing acceleration method, an accelerator and electronic equipment based on an FPGA, wherein the acceleration method comprises the following steps: receiving parameter configuration and starting commands; dividing to obtain voxels, wherein each voxel comprises a plurality of point clouds; maintaining a voxel table on an FPGA, and recording the quantity and related information of point clouds in each voxel; reading point cloud data to be processed, performing calculation processing, and storing a calculation result in the DDR; and reading the calculation result data, carrying out maximum value pooling processing, writing the pooled result into the DDR, and simultaneously completing the function of a scanner. According to the processing scheme, the data storage structure adopts the FPGA on-chip storage, so that the consumption of on-chip storage resources is reduced; the parallel processing of the point cloud data is realized by fully utilizing the pipeline parallel technology, and the data throughput rate is greatly improved, so that the speed of the point cloud preprocessing is improved, and the requirements of an automatic driving scene are met.

Description

Point cloud preprocessing acceleration method based on FPGA, accelerator and electronic equipment

Technical Field

The application relates to the technical field of point cloud data processing, in particular to a point cloud preprocessing acceleration method based on an FPGA, an accelerator and electronic equipment.

Background

Most of the existing machine learning systems are calculated by a CPU or a GPU, and have the problems of complex structure, huge calculated amount and data amount and high power and energy consumption, and cannot be directly applied to embedded environments such as automobiles. This greatly affects the usage scenario of deep learning.

In the actual application process, if the processing delay is too large or the occupied computational power and storage resources are more, the application field of the CNN (Convolutional Neural Networks, convolutional neural network) can be greatly limited. For example, in the aspect of processing images and laser radar point clouds in the automatic driving field, the processing data volume is large, and the real-time requirement is high, so that research on how to optimize CNN to achieve high throughput rate and low delay is urgent.

In general, a point cloud preprocessing algorithm runs on a CPU, and the CPU performs serial processing on point cloud data, so that pipeline parallel processing cannot be realized, and when large-scale point cloud data is processed, a traditional point cloud data processing mode on the CPU occupies a long processing time and tends to be a bottleneck of system performance, so that the real-time requirement of automatic driving cannot be met.

Disclosure of Invention

In view of this, the embodiments of the present application provide a point cloud preprocessing acceleration method, an accelerator and an electronic device based on FPGA, which at least partially solve the problems existing in the prior art.

In a first aspect, an embodiment of the present application provides a method for accelerating point cloud preprocessing based on FPGA, including the following steps:

receiving parameter configuration and starting commands;

dividing a graph to be processed into a plurality of voxels according to grids, wherein each voxel comprises a plurality of point clouds;

maintaining a voxel table on an FPGA, wherein the voxel table records the quantity and related information of point clouds in each voxel;

reading point cloud data to be processed from DDR flowing water;

calculating the point cloud data, and storing a calculation result in the DDR;

reading the calculation result data in the DDR according to the point cloud related information in the voxel table;

and respectively carrying out maximum pooling processing on the point cloud data in each voxel according to the read calculation result data, writing the pooled result into the DDR, and writing the pooled result into the DDR and simultaneously completing the function of a seater.

According to a specific implementation manner of the embodiment of the application, the point cloud computing module performs computing processing on the point cloud data, where the computing processing includes computing coordinates of a voxel center, computing coordinates of a point cloud offset voxel center, convolution operation, activation function and normalization.

According to a specific implementation manner of the embodiment of the application, the coordinates of the voxel center and the coordinates of the offset voxel center of the point cloud are calculated in a serial pipeline mode, and a set of point cloud data generates a set of coordinates.

According to a specific implementation manner of the embodiment of the application, when the convolution operation is performed, a group of point cloud data generates a plurality of channels and adopts multi-channel convolution operation, and the multi-channel convolution operation adopts parallel processing.

According to a specific implementation manner of the embodiment of the application, the parameter configuration includes the data volume to be processed, the source data address, the destination address and parameters used in the point cloud computing.

According to a specific implementation manner of the embodiment of the application, the relevant information recorded in the voxel table includes the number of point clouds processed in each voxel and the storage address of the calculation result in the DDR.

In a second aspect, an embodiment of the present application further provides a point cloud preprocessing accelerator based on an FPGA, where the point cloud preprocessing accelerator includes: the system comprises a voxel table, a top layer control module, a point cloud computing module and a maximum value pooling module;

the voxel table records the quantity and related information of point clouds in each voxel by using storage resources on an FPGA (field programmable gate array) chip;

one end of the top layer control module is connected with the APU and is used for receiving parameter configuration and starting commands, and the other end of the top layer control module is connected with the point cloud computing module and controls the pipeline flow between the top layer control module and the point cloud computing module;

the point cloud computing module is connected with the voxel table and is used for computing point cloud data, and computing results are stored in the DDR, wherein the computing processing of the point cloud data comprises the steps of computing voxel coordinates, computing coordinates of a point cloud offset voxel center, carrying out convolution operation, activating functions and normalizing;

the maximum value pooling module is connected with the voxel table and is used for carrying out maximum value pooling processing on the point cloud data in each voxel, writing the pooled result into the DDR and simultaneously completing the function of a sciter.

According to a specific implementation manner of the embodiment of the application, the top layer control module is further connected with the data carrier and controls the data carrier to read and write the data in the DDR.

In a third aspect, embodiments of the present application further provide an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the FPGA-based point cloud preprocessing acceleration method of any one of the preceding first aspects.

In a fourth aspect, embodiments of the present application further provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the FPGA-based point cloud preprocessing acceleration method of any one of the preceding first aspects.

The FPGA-based point cloud preprocessing acceleration method comprises the steps that a top layer control module receives parameter configuration and a starting command; dividing the top view into a plurality of voxels according to grids; maintaining a voxel table on an FPGA, and recording relevant information of point cloud in each voxel; the point cloud data are input into a point cloud computing module to be computed, and a computing result is stored in the DDR; and respectively carrying out maximum value pooling processing on the point cloud data in each voxel at a maximum value pooling module according to the calculation result data, writing the pooled result into the DDR, and simultaneously completing the function of the scanner. According to the processing scheme, an efficient hardware architecture is designed for a point cloud preprocessing algorithm based on the FPGA, and the hardware architecture fully utilizes a pipeline technology, so that parallel processing of point cloud data is realized, and the data throughput rate is greatly improved; the hardware architecture optimizes the data storage structure, and the key storage adopts on-chip storage, so that the consumption of on-chip storage resources is reduced as much as possible under the condition that the performance meets the requirement; the invention can meet the requirements of different application scenes through parameterized configuration, and improves the practicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an acceleration method for preprocessing point cloud based on FPGA according to an embodiment of the present application;

fig. 2 is a system frame diagram of an FPGA-based point cloud preprocessing accelerator according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present application, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

In a first aspect, an embodiment of the present application provides a method for accelerating point cloud preprocessing based on FPGA (Field Programmable Gate Array ), where steps of the method are shown in fig. 1, and specific steps are as follows:

the first step: a Top level control module (Top Controller) receives the parameter configuration and initiates the command.

In the embodiment of the application, the top layer control module is connected with the APU through the AXI_Lite interface to receive parameter configuration, wherein the parameter configuration mainly comprises data quantity to be processed, a source data address, a destination address and the like, and parameters used in calculation. Through parameterized configuration, the requirements of different application scenes can be met.

And a second step of: dividing the top view into H×W voxels (pixels) according to grids, wherein each Voxel comprises a plurality of point clouds, H represents the number of voxels in the height direction, and W represents the number of voxels in the width direction.

And a third step of: a Voxel Table (Voxel Table) is maintained on the FPGA, and the Voxel Table records the number of point clouds and related information in each Voxel.

In the embodiment, the related data of the voxel table is stored on the FPGA chip, the energy efficiency ratio is high, the volume is small, the voxel table is realized by using the storage resources on the FPGA chip, the access external memory can be reduced, and the access delay is greatly reduced compared with the access delay of DDR (double rate synchronous dynamic random access memory) storage. Since the point cloud top view is divided into h×w voxels, each entry in the voxel table corresponds to one voxel, so there are h×w entries in the voxel table. The relevant information recorded in the voxel table mainly comprises the number of processed point clouds in each voxel and the storage address of the calculated result in the DDR.

Fourth step: and (3) reading point cloud data to be processed from the DDR in the running water, and inputting the point cloud data to a point cloud computing module (calculation).

Fifth step: and the point cloud computing module performs computing processing on the point cloud data, and a computing result is stored in the DDR.

The specific process of the calculation processing of the point cloud data comprises the following steps: voxel coordinates are calculated (Calculate Voxel Coordinate), coordinates of the point cloud offset voxel center are calculated (Calculate Coordinate Offset), convolution operation (Conv), activation function (Relu) and normalization.

When the coordinates are calculated, a serial pipeline mode is used for calculation, a group of point cloud data generates a group of coordinates, the coordinates of the voxels are calculated, namely, the coordinates of the voxels on the top view, and then the Table entries (Lookup tables) in the voxel Table corresponding to the voxels can be quickly accessed through the coordinates.

In computing the convolution, since a set of point cloud data (x, y, z) produces several channels, computing the multi-channel convolution may employ parallel processing to increase the computation speed. The parallelism selection method includes: in this embodiment, a set of point cloud data (x, y, z) generates 32 channels, i.e. 32 data, and if the data bit width is 8bits, the computation multi-channel convolution may use parallel processing, and the parallelism depends on the available bandwidth of the DDR, for example, when the data bus bit width is 128bits (clock 200 Mhz), the data bus bit width/data bit width=128/8=16, so the parallelism is 16. It should be understood that, in the embodiment of the present application, 32 channels are selected for a set of point cloud data, and the data bit width is 8bits, which is not limited to this in practical application and may be determined according to practical situations.

And (3) performing an activation function operation on the result of the convolution operation, wherein the parallelism of the activation function operation is selected to be consistent with that of the convolution operation.

The data after the operation of the activation function is written into the DDR after some normalization.

The calculation result can be obtained through calculation in the DDR, each group of voxels has 32 groups of point cloud data at most, if the pixel is not full of 32, the pixel cloud data corresponds to 32 channels according to 32 reserved spaces, the data bit width is 8bits, and the voxel size is H multiplied by W, so that the occupied DDR capacity is as follows: h×w×32×32×8 bits.

Sixth step: and reading the calculation result data in the DDR according to the point cloud related information in the voxel table. The specific process is as follows: and traversing a voxel Table (Lookup Table) in sequence, calculating a voxel coordinate, namely the coordinate of the voxel on a top view through the point cloud calculation in the fifth step, quickly accessing an entry in the voxel Table corresponding to the voxel through the coordinate, obtaining the number of the point clouds in the voxel and the storage address of a calculation result in the DDR for the entry which is not empty, and reading the result of a point cloud calculation module through a Datamover (a form of direct memory access), thereby quickly reading calculation result data (Read DDR) in the DDR.

Seventh step: and according to the read calculation result data, carrying out maximum value pooling processing (only taking the maximum value as a reserved value) between point clouds in the same voxel by a maximum value pooling module (Maxpooling), writing the pooled result into the DDR (Compare & Write DDR), writing the pooled result into the DDR, and simultaneously completing the function of a sciter, wherein the function of the sciter is the coordinates of the point clouds, and storing the point cloud data into corresponding DDR addresses. The data size of the pooled result after the storage structure in DDR is pooled is: H×W×32×8 bits.

According to the embodiment of the application, the parallel processing of the point cloud data is realized by fully utilizing the pipeline technology, the data throughput rate is greatly improved, and the point cloud computing performance is improved.

In a second aspect, an embodiment of the present application further provides a point cloud preprocessing accelerator based on FPGA, where a specific frame diagram of the point cloud preprocessing accelerator is shown in fig. 2, and the point cloud preprocessing accelerator includes: voxel table (Voxel table), top control module (Top Controller), point cloud computing module (calculation), and maximum pooling module (Maxpooling).

The voxel table stores point cloud related data by utilizing storage resources on an FPGA chip;

one end of the top layer control module is connected with the APU through an AXI_lite interface and is used for receiving parameter configuration, starting commands and initializing a voxel table, and the other end of the top layer control module is connected with the point cloud computing module and controls the pipeline flow between the top layer control module and the point cloud computing module;

the point cloud computing module is connected with the voxel table and is used for computing the point cloud data in the voxel table, and the computing result is stored in the DDR, wherein the computing process of the point cloud data comprises the steps of computing the coordinates of the voxels, computing the coordinates of the offset voxel centers of the point cloud, carrying out convolution operation, activating functions and normalizing.

And the maximum value pooling module is connected with the voxel table, performs maximum pooling operation on the calculation result of the point cloud calculation module, completes the function of a scanner, and stores the pooled result in the DDR.

According to a specific implementation manner of the embodiment of the present application, the top layer control module is further connected to and controls the Datamover to read and write data in the DDR.

For understanding, the embodiment specifically describes the workflow of the point cloud preprocessing accelerator, including the following steps:

s1, a top layer control module configures parameters of an accelerator through an AXI_Lite interface, and the method mainly comprises the following steps: the data volume to be processed, the source data address, the destination address and the like, and parameters used in calculation;

s2, the APU sends a start command to start the accelerator to work through an AXI_Lite interface;

s3, the accelerator firstly initializes all list items of the Voxel Table to be 0;

s4, sending a command for reading the weight from the DDR to the Datamver through an axis_mm2s_cmd interface, receiving the weight (used for convolution operation) through the axis_mm2s interface and caching the weight in an on-chip memory;

s5, sending a command for reading the point cloud from the DDR to the Datamver through an axis_mm2s_cmd interface, and receiving the point cloud data through the axis_mm2s interface for calculation of the next step;

s6, judging whether the point cloud data is in the selected boundary range, and calculating the center coordinates of the Voxel where the point cloud data is located and the offset coordinates of the point cloud data in the range;

s7, looking up a Table, and searching the number of existing point clouds of the Voxel where the point clouds are located in the Voxel Table;

s8, when the Voxel of the point cloud is not built in the Voxel Table, if the total number of the Voxels is smaller than 20000, creating a new Table entry; the Voxel entry already exists in the table, which contains less than 32 point clouds. The next step is carried out in both cases, otherwise, the point cloud data are discarded;

s9, calculating convolution and activation functions in parallel, and writing the result into the DDR through an axis_s2mm interface;

s10, maxpooling: reading the table in sequence, obtaining the storage address of each point cloud data volume and calculation result in the DDR, and sending a reading command through an axis_mm2s_cmd interface;

s11, maxpooling: carrying out maximum value pooling operation on the data comparison, and then storing the pooled result in the DDR for the DPU to use;

and S12, the accelerator starts the accelerator to work by sending a completion (done) command through an AXI_Lite interface.

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the FPGA-based point cloud preprocessing acceleration method of any one of the preceding first aspects.

In a fourth aspect, embodiments of the present application further provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the steps of the FPGA-based point cloud preprocessing acceleration method of any one of the preceding first aspects:

receiving parameter configuration and starting commands;

reading point cloud data to be processed from DDR flowing water;

calculating the point cloud data, and storing a calculation result in the DDR;

Aiming at the problem that the real-time performance and throughput rate cannot be met by using a CPU to perform a point cloud preprocessing algorithm, the embodiment provided by the application provides a point cloud preprocessing acceleration method, an accelerator and electronic equipment based on an FPGA, and designs an efficient hardware architecture, wherein a data storage structure adopts FPGA on-chip storage, so that the consumption of on-chip storage resources is reduced as much as possible under the condition that the performance meets the requirement; the parallel processing of the point cloud data is realized by fully utilizing the pipeline parallel technology, and the data throughput rate is greatly improved, so that the speed of the point cloud preprocessing is improved, and the requirements of an automatic driving scene are met.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The point cloud preprocessing acceleration method based on the FPGA is characterized by comprising the following steps of:

receiving parameter configuration and starting commands;

reading point cloud data to be processed from DDR flowing water;

calculating the point cloud data, wherein a calculation result is stored in the DDR, and the calculation process comprises the steps of calculating the coordinates of the voxel center, calculating the coordinates of the offset voxel center of the point cloud, carrying out convolution operation, activating functions and normalizing;

respectively carrying out maximum pooling treatment on the point cloud data in each voxel according to the read calculation result data, writing the pooled result into the DDR, and simultaneously completing the function of a seater;

the relevant information recorded in the voxel table comprises the number of processed point clouds in each voxel and the storage address of the calculation result in the DDR;

and the calculation processing and the maximum value pooling processing of the point cloud data are performed on an FPGA.

2. The FPGA-based point cloud preprocessing acceleration method of claim 1, wherein the calculating of the coordinates of the voxel center and the calculating of the coordinates of the point cloud offset voxel center are performed using a serial pipeline, and a set of point cloud data generates a set of coordinates.

3. The FPGA-based point cloud preprocessing acceleration method of claim 1, wherein, when performing the convolution operation, a set of point cloud data generates a plurality of channels and adopts a multi-channel convolution operation, and the multi-channel convolution operation adopts parallel processing.

4. The FPGA-based point cloud preprocessing acceleration method of claim 1, wherein the parameter configuration includes an amount of data to be processed, a source data address, a destination address, and parameters used in point cloud computing.

5. A point cloud preprocessing accelerator based on an FPGA, the point cloud preprocessing accelerator comprising: a voxel table, a top layer control module, a point cloud computing module and a maximum value pooling module,

the point cloud computing module is connected with the voxel table and is used for computing point cloud data, and the computing result is stored in the DDR, wherein the computing process of the point cloud data comprises the steps of computing voxel coordinates, computing coordinates of a point cloud offset voxel center, convolution operation, activation function and normalization;

the maximum value pooling module is connected with the voxel table and is used for carrying out maximum value pooling processing on the point cloud data in each voxel, writing the pooled result into the DDR and completing the function of a scanner at the same time;

6. The FPGA-based point cloud preprocessing accelerator of claim 5, wherein said top level control module is further coupled to a data handler and controls said data handler to read and write data in said DDR.

7. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the FPGA-based point cloud preprocessing acceleration method of any of the preceding claims 1-4.

8. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the FPGA-based point cloud preprocessing acceleration method of any of the preceding claims 1-4.