CN114691345A - Calculation framework suitable for SLAM nonlinear parallelization chip and working method - Google Patents

Calculation framework suitable for SLAM nonlinear parallelization chip and working method Download PDF

Info

Publication number
CN114691345A
CN114691345A CN202011564008.2A CN202011564008A CN114691345A CN 114691345 A CN114691345 A CN 114691345A CN 202011564008 A CN202011564008 A CN 202011564008A CN 114691345 A CN114691345 A CN 114691345A
Authority
CN
China
Prior art keywords
matrix
slam
parallelization
unit
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011564008.2A
Other languages
Chinese (zh)
Inventor
董志岩
张立华
成祥
陈迟晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202011564008.2A priority Critical patent/CN114691345A/en
Publication of CN114691345A publication Critical patent/CN114691345A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Abstract

The invention discloses a calculation architecture suitable for an SLAM nonlinear parallelization chip, which comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, wherein the multiply-add unit is used for decomposing a large-scale matrix into a block structure matrix parallelization multiply-add operation with the maximum 6 multiplied by 6 scale; at least one iterative solver of a preprocessing conjugate gradient method for solving a large-scale symmetric positive definite matrix equation; and the hardware mapping module is used for processing and analyzing the complex data stream in SLAM back-end optimization. The invention provides an acceleration framework aiming at SLAM rear end optimization, realizes a rear end optimization hardware operation accelerator based on a beam adjustment method, can be flexibly suitable for the operation of rear end optimization components of various SLAM algorithms, and has the advantages of flexible configuration, high operation speed and low power consumption.

Description

Calculation framework suitable for SLAM nonlinear parallelization chip and working method
Technical Field
The invention belongs to the technical field of calculation acceleration chips, and particularly relates to a calculation framework and a working method suitable for an SLAM nonlinear parallelization chip.
Background
In the prior art, a general processor is usually used for realizing an SLAM algorithm, the requirement of common real-time SLAM operation cannot be met, the method usually reduces the frequency of back-end optimization to meet the requirement of real-time performance, and the performance of the back-end optimization is greatly reduced. Another way to implement the SLAM algorithm is to perform operations in a Graphics Processing Unit (GPU), which cannot achieve an effect of globally efficient acceleration due to the fact that the complexity of SLAM operations cannot fully support subsequent operations.
In response to the bottleneck problem of computing power, manufacturers and research organizations seek to accelerate algorithms for robots through hardware design. The eSLAM provides an energy-saving framework of a real-time ORBSLAM for accelerating feature extraction and matching stages on an FPGA platform, so that a real-time SLAM algorithm is realized on a low-power-consumption platform. ESLAM only decelerates the front-end, while more computationally demanding back-end operations are not taken into account.
The Intel corporation has proposed a general multi-robot system, which includes multiple functions such as SLAM, path planning, etc., and can complete functions such as search and rescue, precision agriculture, and industrial automation. The system adopts a processor with general functions to process the operation of the robot system, and integrates a host processor for acquiring and preprocessing sensor data; the Tensiica DSP processor is used for positioning/mapping, avoiding collision and collaborating intelligent decision; a dedicated path planning and motion control hardware accelerator; with an audio accelerator for human speech detection; CNN accelerator for object detection and recognition. There is no dedicated accelerator specifically designed for SLAM. Michigan university proposed a parallel processor that accelerated the semi-global matching process. The design of the posture capable of realizing dense real-time 3-D depth and 3-D motion perception can realize neighbor-guided semi-global matching under full-high-definition (1920 x 1080, FHD) resolution, thereby realizing real-time unmanned aerial vehicle autonomous flight under the full-high-definition resolution. But the problem acceleration processor only accelerates the pose estimation part, which is a very small module of the SLAM system.
In addition, related patents related to SLAM in the prior art, such as a SLAM operation device and a method, realize a device of a SLAM hardware accelerator, including three major parts of storage, operation and control, disclose an acceleration device of a vector operation unit and a matrix operation unit, can effectively accelerate a SLAM algorithm according to different requirements, can be suitable for various SLAM algorithms and various different input data types, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like. However, the matrix operation is designed for a 16-dimensional square matrix, and no special design is made for back-end optimization, so that the method is an acceleration for SLAM general operation.
The patent name is a beam adjustment method FPGA accelerator for the known self pose of the SLAM, and discloses a FPGA accelerator for the beam adjustment method for updating the known pose, two hardware modules of a rotation matrix processor and a reprojection processor are realized, and the two modules realize the balance of calculation time and the reutilization of intermediate variables by dividing calculation stages, so that the operation speed is improved, and the hardware resources are saved.
However, the method is established in a scene with known posture, the acceleration method is only one step in the beam adjustment method, and key steps with larger calculation amount cannot be included, so that the applicability is limited.
Disclosure of Invention
The invention aims to provide a calculation framework and a working method suitable for an SLAM nonlinear parallelization chip, which are used for solving the problem of poor adaptability of the conventional single acceleration front end by using a special SLAM hardware acceleration structure for a back end optimization process.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a calculation framework suitable for an SLAM nonlinear parallelization chip, which comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, wherein the multiply-add unit is used for decomposing a large-scale matrix into block structure matrix parallelization multiply-add operation with the maximum 6 x 6 scale;
at least one iterative solver for a pre-processing conjugate gradient method for solving a large-scale symmetric positive definite matrix equation;
and the hardware mapping module is used for processing and analyzing the complex data stream in SLAM back-end optimization.
Preferably, the iterative solver performs solution of the positive definite matrix based on the preprocessing conjugate gradient method on the matrix equation of the schur structure of the block structure according to a parallel operation mode of the block structure, so as to obtain an optimized change amount of the pose parameter of the camera, and further obtain a change amount of the optimized map point.
Preferably, the iterative solver uses a parallel Shurlin matrix to construct an acceleration unit to rapidly perform Shurlin solution on m three-dimensional map points and n six-dimensional camera poses, so that the Hessian matrix for observing the projection coordinate error is reduced to the scale of 6n × 6 n.
Preferably, the map point number m is greater than the camera pose number n, the schulren complement matrix parallelization construction accelerating unit reduces the scale of the operation matrix, the matrix operation is operated and accelerated by the parallelization matrix operation unit, and the number is determined by the accelerator resource.
The invention provides a working method suitable for a SLAM nonlinear parallelization chip computing architecture, which specifically comprises the following steps
The method comprises the following steps: pre-calculating a correlation matrix, which is actually calculated and accelerated by a matrix operation unit;
step two: the iterative solver constructs a parallel schur complement matrix through a schur complement construction accelerating unit;
step three: using a special matrix iterative computation accelerating unit of a preprocessing conjugate gradient method to iteratively solve a matrix equation to obtain the change quantity of the pose parameter of the camera;
step four: the matrix operation unit participates in accelerating the calculation of the map point space change quantity.
The present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at an acceleration framework of SLAM rear end optimization, a rear end optimization hardware operation accelerator based on a beam adjustment method is realized, the method can be flexibly suitable for the operation of rear end optimization components of various SLAM algorithms, and the method has the characteristics of flexible configuration, high operation speed, low power consumption and the like.
2. The hardware accelerator can reduce the calculation cost of the SLAM algorithm, and further improve the performance of the SLAM algorithm. Meanwhile, a special computing framework designed by combining software and hardware layers can improve computing capability and reduce hardware cost, so that the threshold of robot design and application is reduced, the SLAM system performance is improved, and the space can be expanded.
3. The camera pose parameters and map point information are updated, and the calculation acceleration of the beam adjustment method of the SLAM rear-end optimization algorithm is realized through a hardware parallel acceleration calculation framework, so that the calculation power of the SLAM system hardware is improved, the calculation speed is increased, the calculation time is saved, a foundation is laid for the expansion of the SLAM algorithm to a more complex algorithm, the high performance of the SLAM system is realized, and the application scene of the SLAM system is expanded.
Drawings
Fig. 1 is a schematic diagram of an overall structure of a computation architecture suitable for a SLAM nonlinear parallelized chip according to the present invention;
FIG. 2 is a flow chart of a multiplication and addition with a 6-dimensional square matrix according to the present invention;
fig. 3 is a schematic structural diagram of a schulvin structural unit according to the present invention.
Fig. 4 is a schematic structural diagram of a preprocessing gradient descent matrix solver unit according to the present invention.
Fig. 5 is a schematic diagram of an optimized hardware acceleration structure of the SLAM backend provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
An effective and common classical method for solving the SLAM back-end nonlinear optimization problem is the beam-balancing method, in which the number of map points can reach thousands compared to the number of camera pose parameters, even if keyframes and sparsity features are used, hundreds of keypoints can be observed for each keyframe, and there are a large number of matrix operations in the operational process and a complex matrix equation solving process. Specifically, in the SLAM algorithm back-end optimization operation process, a large amount of time is spent in the Schur complement equation construction, large-scale positive definite matrix solution and other matrix operation calculation processes. A large amount of computing resources are consumed on the current mainstream CPU, and the computing power requirement on the CPU is also high.
In the process of solving by using the beam adjustment method, the Levenberg-Marquart (LM) algorithm is used for carrying out optimization solution, which is as follows
Figure BDA0002860041040000061
Wherein x is a parameter to be optimized and is divided into two parts of a camera pose parameter (group) and a map point parameter (m group). (x) is a perspective projection function, J (x) is a Jacobian matrix about which the parameter x is related, D (x) is a diagonal matrix of the Jacobian matrix, epsilon is an error between an actual map point and a predicted projection point of the camera model, lambda is an LM algorithm parameter, delta*To calculate the resulting optimized change. Order to
Figure BDA0002860041040000062
A matrix equation can be obtained, as in equation (2):
Figure BDA0002860041040000063
by the above problem and its solving method, the study on linear optimization of SLAM rear end has the following special properties:
firstly, matrix operation in back-end optimization is the main bottleneck influencing the operation performance at present, the Jacobian matrix and the Hessian matrix are pre-calculated, the solution of a large-scale equation can be realized by a Schur solution method such as a formula (3), a matrix equation is solved by utilizing a pretreatment conjugate gradient method, and the key steps all need matrix operation;
secondly, because the jacobian matrix has a sparse structure, the first term of the left expression of the formula (2) has a special sparse block structure matrix, and each block structure can be operated in parallel in each step of calculation without mutual influence, so that a method and a device for designing a parallel operation module to accelerate exist.
Figure BDA0002860041040000071
Figure BDA0002860041040000072
A computing framework for implementing hardware acceleration of the back-end optimization part of the SLAM algorithm in this embodiment is shown in fig. 1, and is mainly divided into a matrix operation acceleration unit, a preprocessing conjugate gradient method matrix equation solving acceleration unit, and other parts supporting data storage and algorithm control of the SLAM algorithm. The method comprises the steps of utilizing a general processor to control and schedule data transmission among a bus, a storage module, the general processor and a special operation module, utilizing a matrix operation accelerating unit to realize multiplication and addition operation among fast matrixes, vectors and scalars, utilizing a Schur complement matrix parallelization construction accelerating unit to realize digestion construction of a large-scale matrix equation to construct a Schur complement matrix equation, and utilizing a preprocessing conjugate gradient method matrix equation to solve each block structure of a solved matrix equation paralleled by the accelerating unit.
As shown in FIG. 2, the operation is performed based on the systolic array mode, the operation units form an operation network in a data flow relationship, the required data is read from the initial addresses of the two matrixes to be operated according to the instructions, and the required data is input to the designated unit in a time sequence orientation mode and is transmitted to the multidimensional operation unit. In each clock period, the data of the multidimensional arithmetic units are transmitted to the right/downwards in sequence, each arithmetic unit in the device performs multiplication operation on input data in the operation process, and adds the intermediate value stored in the unit to obtain the operation result of the unit, and the operation result is stored in the unit and transmitted to the adjacent right/lower arithmetic unit. The data calculated in the arithmetic unit can be directly transmitted in the multidimensional arithmetic unit in the form of intermediate values, and the reusability of data arithmetic is used, so that the carrying times of the data are greatly reduced. Repeating the steps of calculation and transmission until the matrix to be calculated is completely input into the multidimensional operation unit and calculation is finished, and outputting the final result to a storage destination address specified by the instruction for storage, thereby only finishing the whole operation instruction flow. In addition, the device can change the size of the data stream directional input through an instruction, thereby realizing operations such as vector of 1 × 3 size, vector of 1 × 6 size, vector multiplication between a 3-dimensional square matrix and a 3 × 6-dimensional matrix, matrix multiplication and addition, and the like.
As shown in fig. 3, in the schulvin construction unit structure provided by this embodiment, the left side of the equation for solving the equation is a 3m +6n square matrix, and the scale can be reduced to the 3m square matrix in a schulvin digestion manner, so that the matrix equation solving process after the solution is greatly simplified and accelerated.
The unit inputs six types of matrixes including a projection error, a camera parameter jacobian matrix, a map point parameter jacobian matrix and the like, the six types of matrixes are divided into five parts according to the calculated data dependency, 12 calculation stages are counted, and meanwhile, the different parts are divided into five calculation stages according to the size of calculated amount and the data dependency so as to balance calculation delay and increase the calculation speed.
The shaded portion is a dedicated matrix multiply add unit.
The first part calculates the Jacobian matrix J of map pointspIs transposed with its own matrix multiply add V ∑ JpTJpAnd calculating the inverse V of its matrix-1
The second part calculates the Jacobian matrix J of map pointspAnd projection error epIs multiplied by epThen sequentially with matrix V-1、JpAnd JcPerforming matrix multiply-add operation to obtain JcJpV-1p
The third part calculates the Jacobian matrix J about the camera parameterscAnd projection error ecIs multiplied by ecSubtracting the first partial result to obtain bschur=∈c-∑JcTJpV-1p
The fourth part calculates the Jacobian matrix J of camera parameterscIs transposed with respect to the jacobian matrix J of map pointscThe matrix multiplication and addition operation W of the transposed sumT=∑JpTJcAfter calculating its transposition, it is then sequentially associated with the matrix V-1、JpAnd JcPerforming matrix multiply-add operation to obtain JcTJpV-1WT
The fifth part calculates the Jacobian matrix J of camera parameterscIs transposed with its own matrix multiplication plus U ═ Σ JcTJcThen calculating a matrix subtraction from the result of the fourth partial calculation, Hschur=U-∑JcTJpV-1WT
The generated intermediate variables of the five calculation stages are stored in an on-chip RAM or a register file, and meanwhile, the size of the RAM between the calculation stages is expanded to form a Ping-pong buffer structure so as to improve the parallelism of calculation. From this point on, the unit completes the construction of the solution equation after Schulvin's solution.
As shown in fig. 4, the acceleration unit of the present embodiment includes four operations, namely, matrix multiplication (indicated by a dashed box), vector dot multiplication (indicated by a double solid box), vector axpy operation (indicated by a shaded box), and scalar operation (indicated by a single solid box).
The specific description is as follows:
the preprocessing gradient descent matrix solving method is divided into a plurality of calculation stages according to the size of calculated amount and the data dependency so as to balance calculation delay and increase calculation speed. Before using the unit to solve, each piece of data, such as the first line of algorithm 1, is first initialized according to the algorithm, whereThe Ax ═ b equation corresponds to the matrix in this example with the following relationship, a: h ═ Hschur,b:=bschur
The first stage calculates line 8, line 9, line 10 in algorithm 1;
the second step is calculation algorithm line 5 and line 4, and the calculation is divided into two parts to be carried out in parallel;
in the third stage calculation algorithm, line 6 and line 7, the calculation can also be divided into two parts to be performed in parallel.
The matrix x, r, w, p generated by each iteration will be used as the data to be updated to participate in the next iteration until the exit conditions satisfied by the rows 2 and 3 in the algorithm 1 are reached,
and finally obtaining an optimized matrix x. The generated intermediate variables of the three calculation stages are stored in an on-chip RAM or a register file, and meanwhile, the size of the RAM between the calculation stages is expanded to form a Ping-pong buffer structure so as to improve the parallelism of calculation. The unit completes the solution of the gradient descent matrix of the preprocessing to obtain an optimized matrix x, namely an optimized matrix of the camera parameters.
Algorithm 1 preprocessing conjugate gradient algorithm
Figure BDA0002860041040000101
As shown in fig. 5, the embodiment is required to be able to accelerate the operation process of the SLAM algorithm backend optimization in a parallelized manner, reduce data exchange, and save storage space.
In order to control the SLAM back end optimization process more efficiently, the preprocessing matrix operation part, the Schulvin complement construction matrix equation and the operation of solving the positive definite matrix equation by the PCG are mainly accelerated through bus connection, so that each unit is executed in a pipeline mode and can also be executed concurrently.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (6)

1. A calculation architecture suitable for a SLAM nonlinear parallelization chip is characterized in that: the device comprises at least one block structure parallelization matrix multiply-add unit based on a systolic array, and a block structure parallel multiply-add unit, wherein the multiply-add unit is used for decomposing a large-scale matrix into block structure matrix parallelization with the maximum 6 x 6 scale;
at least one iterative solver for a pre-processing conjugate gradient method for solving a large-scale symmetric positive definite matrix equation;
and the hardware mapping module is used for processing and analyzing the complex data stream in SLAM back-end optimization.
2. The architecture of claim 1, wherein the iterative solver solves the positive definite matrix based on the pre-processing conjugate gradient method with a matrix equation of a schur structure of a block structure according to a parallel operation of the block structure, so as to obtain an optimized change amount of the pose parameters of the camera, and further obtain a change amount of the optimized map points.
3. The architecture of claim 1, wherein the iterative solver uses a parallel-construction acceleration unit of a schur complement matrix to perform schur-solution on the hessian matrix of the observed projection coordinate errors rapidly for m three-dimensional map points and n six-dimensional camera poses, so that the large-scale matrix equation is reduced to a scale of 6n × 6 n.
4. The architecture of claim 3, wherein the number m of map points is greater than the number n of camera poses, the Boolean-complement matrix parallelization construction acceleration unit reduces the size of the operation matrix, the matrix operation is to be operated and accelerated by the parallelization matrix operation unit, and the number is determined by the accelerator resources.
5. The working method of the SLAM nonlinear parallelization chip computing architecture according to claim 1, comprising the following steps
The method comprises the following steps: pre-calculating a correlation matrix to be actually calculated and accelerated by a matrix operation unit;
step two: a schulren construction accelerating unit of the iterative solver constructs a parallel schulren matrix;
step three: using a special matrix iterative computation accelerating unit of a preprocessing conjugate gradient method to iteratively solve a matrix equation to obtain the change quantity of the camera pose parameter;
step four: the matrix operation unit participates in accelerating the calculation of the map point space change quantity.
6. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 5.
CN202011564008.2A 2020-12-25 2020-12-25 Calculation framework suitable for SLAM nonlinear parallelization chip and working method Pending CN114691345A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011564008.2A CN114691345A (en) 2020-12-25 2020-12-25 Calculation framework suitable for SLAM nonlinear parallelization chip and working method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011564008.2A CN114691345A (en) 2020-12-25 2020-12-25 Calculation framework suitable for SLAM nonlinear parallelization chip and working method

Publications (1)

Publication Number Publication Date
CN114691345A true CN114691345A (en) 2022-07-01

Family

ID=82129318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011564008.2A Pending CN114691345A (en) 2020-12-25 2020-12-25 Calculation framework suitable for SLAM nonlinear parallelization chip and working method

Country Status (1)

Country Link
CN (1) CN114691345A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382921A (en) * 2023-05-08 2023-07-04 深圳市欧朗博科技有限公司 Baseband chip architecture and method for pre-allocation and parallelism self-adjustment of data streams

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382921A (en) * 2023-05-08 2023-07-04 深圳市欧朗博科技有限公司 Baseband chip architecture and method for pre-allocation and parallelism self-adjustment of data streams

Similar Documents

Publication Publication Date Title
Guo et al. Software-hardware codesign for efficient neural network acceleration
EP3955173B1 (en) Training neural network accelerators using mixed precision data formats
US10691996B2 (en) Hardware accelerator for compressed LSTM
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN109657794B (en) Instruction queue-based distributed deep neural network performance modeling method
Qin et al. π-ba: Bundle adjustment acceleration on embedded fpgas with co-observation optimization
Bai et al. Pointnet on fpga for real-time lidar point cloud processing
Gankidi FPGA accelerator architecture for Q-learning and its applications in space exploration rovers
CN114691345A (en) Calculation framework suitable for SLAM nonlinear parallelization chip and working method
Wang et al. Briefly Analysis about CNN Accelerator based on FPGA
CN111709270A (en) Three-dimensional shape recovery and attitude estimation method and device based on depth image
Zhang et al. A-u3d: A unified 2d/3d cnn accelerator on the versal platform for disparity estimation
US20220230069A1 (en) Neural network sparsification device and method, and related product
Ferreira et al. Fast exact Bayesian inference for high-dimensional models
Lu et al. A reconfigurable DNN training accelerator on FPGA
Xia et al. PAI-FCNN: FPGA based inference system for complex CNN models
CN113298241B (en) Deep separable convolutional neural network acceleration method and accelerator
Hashimoto et al. Fadec: FPGA-based acceleration of video depth estimation by hw/sw co-design
Chen et al. Hardware acceleration implementation of three-dimensional convolutional neural network on vector digital signal processors
Niu et al. A Novel Distributed Duration-Aware LSTM for Large Scale Sequential Data Analysis
Wang et al. An efficient architecture for floating-point eigenvalue decomposition
Maliţa et al. Map-scan node accelerator for big-data
Kästner et al. Analysis of hardware implementations to accelerate convolutional and recurrent neuronal networks
CN113673704B (en) Relational network reasoning optimization method based on software and hardware cooperative acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination