CN114691345A

CN114691345A - Calculation framework suitable for SLAM nonlinear parallelization chip and working method

Info

Publication number: CN114691345A
Application number: CN202011564008.2A
Authority: CN
Inventors: 董志岩; 张立华; 成祥; 陈迟晓
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-07-01

Abstract

The invention discloses a calculation architecture suitable for an SLAM nonlinear parallelization chip, which comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, wherein the multiply-add unit is used for decomposing a large-scale matrix into a block structure matrix parallelization multiply-add operation with the maximum 6 multiplied by 6 scale; at least one iterative solver of a preprocessing conjugate gradient method for solving a large-scale symmetric positive definite matrix equation; and the hardware mapping module is used for processing and analyzing the complex data stream in SLAM back-end optimization. The invention provides an acceleration framework aiming at SLAM rear end optimization, realizes a rear end optimization hardware operation accelerator based on a beam adjustment method, can be flexibly suitable for the operation of rear end optimization components of various SLAM algorithms, and has the advantages of flexible configuration, high operation speed and low power consumption.

Description

Calculation framework suitable for SLAM nonlinear parallelization chip and working method

Technical Field

The invention belongs to the technical field of calculation acceleration chips, and particularly relates to a calculation framework and a working method suitable for an SLAM nonlinear parallelization chip.

Background

In the prior art, a general processor is usually used for realizing an SLAM algorithm, the requirement of common real-time SLAM operation cannot be met, the method usually reduces the frequency of back-end optimization to meet the requirement of real-time performance, and the performance of the back-end optimization is greatly reduced. Another way to implement the SLAM algorithm is to perform operations in a Graphics Processing Unit (GPU), which cannot achieve an effect of globally efficient acceleration due to the fact that the complexity of SLAM operations cannot fully support subsequent operations.

In response to the bottleneck problem of computing power, manufacturers and research organizations seek to accelerate algorithms for robots through hardware design. The eSLAM provides an energy-saving framework of a real-time ORBSLAM for accelerating feature extraction and matching stages on an FPGA platform, so that a real-time SLAM algorithm is realized on a low-power-consumption platform. ESLAM only decelerates the front-end, while more computationally demanding back-end operations are not taken into account.

The Intel corporation has proposed a general multi-robot system, which includes multiple functions such as SLAM, path planning, etc., and can complete functions such as search and rescue, precision agriculture, and industrial automation. The system adopts a processor with general functions to process the operation of the robot system, and integrates a host processor for acquiring and preprocessing sensor data; the Tensiica DSP processor is used for positioning/mapping, avoiding collision and collaborating intelligent decision; a dedicated path planning and motion control hardware accelerator; with an audio accelerator for human speech detection; CNN accelerator for object detection and recognition. There is no dedicated accelerator specifically designed for SLAM. Michigan university proposed a parallel processor that accelerated the semi-global matching process. The design of the posture capable of realizing dense real-time 3-D depth and 3-D motion perception can realize neighbor-guided semi-global matching under full-high-definition (1920 x 1080, FHD) resolution, thereby realizing real-time unmanned aerial vehicle autonomous flight under the full-high-definition resolution. But the problem acceleration processor only accelerates the pose estimation part, which is a very small module of the SLAM system.

In addition, related patents related to SLAM in the prior art, such as a SLAM operation device and a method, realize a device of a SLAM hardware accelerator, including three major parts of storage, operation and control, disclose an acceleration device of a vector operation unit and a matrix operation unit, can effectively accelerate a SLAM algorithm according to different requirements, can be suitable for various SLAM algorithms and various different input data types, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like. However, the matrix operation is designed for a 16-dimensional square matrix, and no special design is made for back-end optimization, so that the method is an acceleration for SLAM general operation.

The patent name is a beam adjustment method FPGA accelerator for the known self pose of the SLAM, and discloses a FPGA accelerator for the beam adjustment method for updating the known pose, two hardware modules of a rotation matrix processor and a reprojection processor are realized, and the two modules realize the balance of calculation time and the reutilization of intermediate variables by dividing calculation stages, so that the operation speed is improved, and the hardware resources are saved.

However, the method is established in a scene with known posture, the acceleration method is only one step in the beam adjustment method, and key steps with larger calculation amount cannot be included, so that the applicability is limited.

Disclosure of Invention

The invention aims to provide a calculation framework and a working method suitable for an SLAM nonlinear parallelization chip, which are used for solving the problem of poor adaptability of the conventional single acceleration front end by using a special SLAM hardware acceleration structure for a back end optimization process.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a calculation framework suitable for an SLAM nonlinear parallelization chip, which comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, wherein the multiply-add unit is used for decomposing a large-scale matrix into block structure matrix parallelization multiply-add operation with the maximum 6 x 6 scale;

at least one iterative solver for a pre-processing conjugate gradient method for solving a large-scale symmetric positive definite matrix equation;

and the hardware mapping module is used for processing and analyzing the complex data stream in SLAM back-end optimization.

Preferably, the iterative solver performs solution of the positive definite matrix based on the preprocessing conjugate gradient method on the matrix equation of the schur structure of the block structure according to a parallel operation mode of the block structure, so as to obtain an optimized change amount of the pose parameter of the camera, and further obtain a change amount of the optimized map point.

Preferably, the iterative solver uses a parallel Shurlin matrix to construct an acceleration unit to rapidly perform Shurlin solution on m three-dimensional map points and n six-dimensional camera poses, so that the Hessian matrix for observing the projection coordinate error is reduced to the scale of 6n × 6 n.

Preferably, the map point number m is greater than the camera pose number n, the schulren complement matrix parallelization construction accelerating unit reduces the scale of the operation matrix, the matrix operation is operated and accelerated by the parallelization matrix operation unit, and the number is determined by the accelerator resource.

The invention provides a working method suitable for a SLAM nonlinear parallelization chip computing architecture, which specifically comprises the following steps

The method comprises the following steps: pre-calculating a correlation matrix, which is actually calculated and accelerated by a matrix operation unit;

step two: the iterative solver constructs a parallel schur complement matrix through a schur complement construction accelerating unit;

step three: using a special matrix iterative computation accelerating unit of a preprocessing conjugate gradient method to iteratively solve a matrix equation to obtain the change quantity of the pose parameter of the camera;

step four: the matrix operation unit participates in accelerating the calculation of the map point space change quantity.

The present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

1. aiming at an acceleration framework of SLAM rear end optimization, a rear end optimization hardware operation accelerator based on a beam adjustment method is realized, the method can be flexibly suitable for the operation of rear end optimization components of various SLAM algorithms, and the method has the characteristics of flexible configuration, high operation speed, low power consumption and the like.

2. The hardware accelerator can reduce the calculation cost of the SLAM algorithm, and further improve the performance of the SLAM algorithm. Meanwhile, a special computing framework designed by combining software and hardware layers can improve computing capability and reduce hardware cost, so that the threshold of robot design and application is reduced, the SLAM system performance is improved, and the space can be expanded.

3. The camera pose parameters and map point information are updated, and the calculation acceleration of the beam adjustment method of the SLAM rear-end optimization algorithm is realized through a hardware parallel acceleration calculation framework, so that the calculation power of the SLAM system hardware is improved, the calculation speed is increased, the calculation time is saved, a foundation is laid for the expansion of the SLAM algorithm to a more complex algorithm, the high performance of the SLAM system is realized, and the application scene of the SLAM system is expanded.

Drawings

Fig. 1 is a schematic diagram of an overall structure of a computation architecture suitable for a SLAM nonlinear parallelized chip according to the present invention;

FIG. 2 is a flow chart of a multiplication and addition with a 6-dimensional square matrix according to the present invention;

fig. 3 is a schematic structural diagram of a schulvin structural unit according to the present invention.

Fig. 4 is a schematic structural diagram of a preprocessing gradient descent matrix solver unit according to the present invention.

Fig. 5 is a schematic diagram of an optimized hardware acceleration structure of the SLAM backend provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

An effective and common classical method for solving the SLAM back-end nonlinear optimization problem is the beam-balancing method, in which the number of map points can reach thousands compared to the number of camera pose parameters, even if keyframes and sparsity features are used, hundreds of keypoints can be observed for each keyframe, and there are a large number of matrix operations in the operational process and a complex matrix equation solving process. Specifically, in the SLAM algorithm back-end optimization operation process, a large amount of time is spent in the Schur complement equation construction, large-scale positive definite matrix solution and other matrix operation calculation processes. A large amount of computing resources are consumed on the current mainstream CPU, and the computing power requirement on the CPU is also high.

In the process of solving by using the beam adjustment method, the Levenberg-Marquart (LM) algorithm is used for carrying out optimization solution, which is as follows

Wherein x is a parameter to be optimized and is divided into two parts of a camera pose parameter (group) and a map point parameter (m group). (x) is a perspective projection function, J (x) is a Jacobian matrix about which the parameter x is related, D (x) is a diagonal matrix of the Jacobian matrix, epsilon is an error between an actual map point and a predicted projection point of the camera model, lambda is an LM algorithm parameter, delta^*To calculate the resulting optimized change. Order to

A matrix equation can be obtained, as in equation (2):

by the above problem and its solving method, the study on linear optimization of SLAM rear end has the following special properties:

firstly, matrix operation in back-end optimization is the main bottleneck influencing the operation performance at present, the Jacobian matrix and the Hessian matrix are pre-calculated, the solution of a large-scale equation can be realized by a Schur solution method such as a formula (3), a matrix equation is solved by utilizing a pretreatment conjugate gradient method, and the key steps all need matrix operation;

secondly, because the jacobian matrix has a sparse structure, the first term of the left expression of the formula (2) has a special sparse block structure matrix, and each block structure can be operated in parallel in each step of calculation without mutual influence, so that a method and a device for designing a parallel operation module to accelerate exist.

A computing framework for implementing hardware acceleration of the back-end optimization part of the SLAM algorithm in this embodiment is shown in fig. 1, and is mainly divided into a matrix operation acceleration unit, a preprocessing conjugate gradient method matrix equation solving acceleration unit, and other parts supporting data storage and algorithm control of the SLAM algorithm. The method comprises the steps of utilizing a general processor to control and schedule data transmission among a bus, a storage module, the general processor and a special operation module, utilizing a matrix operation accelerating unit to realize multiplication and addition operation among fast matrixes, vectors and scalars, utilizing a Schur complement matrix parallelization construction accelerating unit to realize digestion construction of a large-scale matrix equation to construct a Schur complement matrix equation, and utilizing a preprocessing conjugate gradient method matrix equation to solve each block structure of a solved matrix equation paralleled by the accelerating unit.

As shown in FIG. 2, the operation is performed based on the systolic array mode, the operation units form an operation network in a data flow relationship, the required data is read from the initial addresses of the two matrixes to be operated according to the instructions, and the required data is input to the designated unit in a time sequence orientation mode and is transmitted to the multidimensional operation unit. In each clock period, the data of the multidimensional arithmetic units are transmitted to the right/downwards in sequence, each arithmetic unit in the device performs multiplication operation on input data in the operation process, and adds the intermediate value stored in the unit to obtain the operation result of the unit, and the operation result is stored in the unit and transmitted to the adjacent right/lower arithmetic unit. The data calculated in the arithmetic unit can be directly transmitted in the multidimensional arithmetic unit in the form of intermediate values, and the reusability of data arithmetic is used, so that the carrying times of the data are greatly reduced. Repeating the steps of calculation and transmission until the matrix to be calculated is completely input into the multidimensional operation unit and calculation is finished, and outputting the final result to a storage destination address specified by the instruction for storage, thereby only finishing the whole operation instruction flow. In addition, the device can change the size of the data stream directional input through an instruction, thereby realizing operations such as vector of 1 × 3 size, vector of 1 × 6 size, vector multiplication between a 3-dimensional square matrix and a 3 × 6-dimensional matrix, matrix multiplication and addition, and the like.

As shown in fig. 3, in the schulvin construction unit structure provided by this embodiment, the left side of the equation for solving the equation is a 3m +6n square matrix, and the scale can be reduced to the 3m square matrix in a schulvin digestion manner, so that the matrix equation solving process after the solution is greatly simplified and accelerated.

The unit inputs six types of matrixes including a projection error, a camera parameter jacobian matrix, a map point parameter jacobian matrix and the like, the six types of matrixes are divided into five parts according to the calculated data dependency, 12 calculation stages are counted, and meanwhile, the different parts are divided into five calculation stages according to the size of calculated amount and the data dependency so as to balance calculation delay and increase the calculation speed.

The shaded portion is a dedicated matrix multiply add unit.

The first part calculates the Jacobian matrix J of map points^pIs transposed with its own matrix multiply add V ∑ J^pTJ^pAnd calculating the inverse V of its matrix^-1；

The second part calculates the Jacobian matrix J of map points^pAnd projection error e^pIs multiplied by e^pThen sequentially with matrix V^-1、J^pAnd J^cPerforming matrix multiply-add operation to obtain J^cJ^pV^-1∈^p；

The third part calculates the Jacobian matrix J about the camera parameters^cAnd projection error e^cIs multiplied by e^cSubtracting the first partial result to obtain b_schur＝∈^c-∑J^cTJ^pV^-1∈^p；

The fourth part calculates the Jacobian matrix J of camera parameters^cIs transposed with respect to the jacobian matrix J of map points^cThe matrix multiplication and addition operation W of the transposed sum^T＝∑J^pTJ^cAfter calculating its transposition, it is then sequentially associated with the matrix V^-1、J^pAnd J^cPerforming matrix multiply-add operation to obtain J^cTJ^pV^-1W^T；

The fifth part calculates the Jacobian matrix J of camera parameters^cIs transposed with its own matrix multiplication plus U ═ Σ J^cTJ^cThen calculating a matrix subtraction from the result of the fourth partial calculation, H_schur＝U-∑J^cTJ^pV^-1W^T。

The generated intermediate variables of the five calculation stages are stored in an on-chip RAM or a register file, and meanwhile, the size of the RAM between the calculation stages is expanded to form a Ping-pong buffer structure so as to improve the parallelism of calculation. From this point on, the unit completes the construction of the solution equation after Schulvin's solution.

As shown in fig. 4, the acceleration unit of the present embodiment includes four operations, namely, matrix multiplication (indicated by a dashed box), vector dot multiplication (indicated by a double solid box), vector axpy operation (indicated by a shaded box), and scalar operation (indicated by a single solid box).

The specific description is as follows:

the preprocessing gradient descent matrix solving method is divided into a plurality of calculation stages according to the size of calculated amount and the data dependency so as to balance calculation delay and increase calculation speed. Before using the unit to solve, each piece of data, such as the first line of algorithm 1, is first initialized according to the algorithm, whereThe Ax ═ b equation corresponds to the matrix in this example with the following relationship, a: h ═ H_schur，b：＝b_schur。

The first stage calculates line 8, line 9, line 10 in algorithm 1;

the second step is calculation algorithm line 5 and line 4, and the calculation is divided into two parts to be carried out in parallel;

in the third stage calculation algorithm, line 6 and line 7, the calculation can also be divided into two parts to be performed in parallel.

The matrix x, r, w, p generated by each iteration will be used as the data to be updated to participate in the next iteration until the exit conditions satisfied by the

rows

2 and 3 in the algorithm 1 are reached,

and finally obtaining an optimized matrix x. The generated intermediate variables of the three calculation stages are stored in an on-chip RAM or a register file, and meanwhile, the size of the RAM between the calculation stages is expanded to form a Ping-pong buffer structure so as to improve the parallelism of calculation. The unit completes the solution of the gradient descent matrix of the preprocessing to obtain an optimized matrix x, namely an optimized matrix of the camera parameters.

Algorithm 1 preprocessing conjugate gradient algorithm

As shown in fig. 5, the embodiment is required to be able to accelerate the operation process of the SLAM algorithm backend optimization in a parallelized manner, reduce data exchange, and save storage space.

In order to control the SLAM back end optimization process more efficiently, the preprocessing matrix operation part, the Schulvin complement construction matrix equation and the operation of solving the positive definite matrix equation by the PCG are mainly accelerated through bus connection, so that each unit is executed in a pipeline mode and can also be executed concurrently.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A calculation architecture suitable for a SLAM nonlinear parallelization chip is characterized in that: the device comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, and a block structure parallel multiply-add unit, wherein the multiply-add unit is used for decomposing a large-scale matrix into block structure matrix parallelization with the maximum 6 x 6 scale;

2. The architecture of claim 1, wherein the iterative solver solves the positive definite matrix based on the pre-processing conjugate gradient method with a matrix equation of a schur structure of a block structure according to a parallel operation of the block structure, so as to obtain an optimized change amount of the pose parameters of the camera, and further obtain a change amount of the optimized map points.

3. The architecture of claim 1, wherein the iterative solver uses a parallel-construction acceleration unit of a schur complement matrix to perform schur-solution on the hessian matrix of the observed projection coordinate errors rapidly for m three-dimensional map points and n six-dimensional camera poses, so that the large-scale matrix equation is reduced to a scale of 6n × 6 n.

4. The architecture of claim 3, wherein the number m of map points is greater than the number n of camera poses, the Boolean-complement matrix parallelization construction acceleration unit reduces the size of the operation matrix, the matrix operation is to be operated and accelerated by the parallelization matrix operation unit, and the number is determined by the accelerator resources.

5. The working method of the SLAM nonlinear parallelization chip computing architecture according to claim 1, comprising the following steps

The method comprises the following steps: pre-calculating a correlation matrix to be actually calculated and accelerated by a matrix operation unit;

step two: a schulren construction accelerating unit of the iterative solver constructs a parallel schulren matrix;

step three: using a special matrix iterative computation accelerating unit of a preprocessing conjugate gradient method to iteratively solve a matrix equation to obtain the change quantity of the camera pose parameter;

6. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 5.