WO2018082229A1 - Slam运算装置和方法 - Google Patents

Slam运算装置和方法 Download PDF

Info

Publication number
WO2018082229A1
WO2018082229A1 PCT/CN2017/075134 CN2017075134W WO2018082229A1 WO 2018082229 A1 WO2018082229 A1 WO 2018082229A1 CN 2017075134 W CN2017075134 W CN 2017075134W WO 2018082229 A1 WO2018082229 A1 WO 2018082229A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
instruction
slam
dimensional
hardware accelerator
Prior art date
Application number
PCT/CN2017/075134
Other languages
English (en)
French (fr)
Inventor
陈云霁
杜子东
张磊
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Publication of WO2018082229A1 publication Critical patent/WO2018082229A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/161Computing infrastructure, e.g. computer clusters, blade chassis or hardware partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a SLAM (simultaneous localization and mapping) operation device and method for accelerating the operation of the SLAM algorithm according to different requirements.
  • SLAM simultaneous localization and mapping
  • Autonomous navigation in an unknown environment is a basic capability of mobile robots such as unmanned ground and aerial vehicles.
  • the positioning of the real-time positioning and mapping in the SLAM task mainly completes the determination of the position of the robot in the map.
  • the main task of the mapping is that the robot establishes a map corresponding to the environment according to the environment. In the absence of an initial map of the location environment, this requires the robot to be able to build the map in real time and use the map to complete its own positioning, and the SLAM algorithm required to accomplish this task is generated.
  • accurately implementing the SLAM algorithm under the limited computing power and strict power consumption requirements of mobile robots is one of the biggest problems in reality.
  • a way to implement the SLAM algorithm is to perform operations directly on a general-purpose processor (CPU).
  • CPU general-purpose processor
  • One of the disadvantages of the method is that the performance of a single general-purpose processor is low, and the real-time performance of common SLAM operations cannot be satisfied. demand.
  • communication between general-purpose processors becomes a performance bottleneck.
  • Another way to implement the SLAM algorithm is to perform operations on a graphics processing unit (GPU) that supports the above algorithms by executing general SIMD instructions using a general purpose register file and a generic stream processing unit.
  • GPU graphics processing unit
  • this method is specifically used to perform graphics image operations, due to the complexity of the SLAM algorithm, this method does not support its subsequent operation well. In other words, it is impossible to effectively accelerate the overall operation of the SLAM algorithm.
  • the GPU on-chip cache is too small to meet the computational needs of a large number of SLAM algorithms.
  • an apparatus for a SLAM hardware accelerator comprising:
  • a storage portion for storing input data, temporary operation result data, final operation result data, an instruction set required for the operation process, and/or algorithm parameter data;
  • An operation portion coupled to the storage portion, for performing calculations on SLAM related algorithms and applications
  • control section that connects the storage section and the operation section for controlling and coordinating the storage section and the operation section.
  • the storage part comprises:
  • Input storage module used to store input and output data
  • Intermediate result storage module used to store intermediate operation results
  • Final result storage module used to store the final operation result
  • Instruction storage module used to store the set of instructions required for the operation.
  • Buffer storage module used for buffer storage of data.
  • the operation part comprises:
  • the acceleration operation device includes a vector operation unit and a matrix operation unit.
  • the arithmetic part is implemented by a hardware circuit.
  • the control part is composed of a first in first out queue and a control processor, the first in first out queue is used to store the control signal, and the control processor is used to take out the control signal to be executed, and the control logic After the analysis, the storage part and the arithmetic part are controlled and coordinated.
  • the instruction set includes:
  • a multidimensional data operation instruction class for controlling the operation of multidimensional data
  • a one-dimensional data operation instruction class for controlling the operation of one-dimensional data.
  • control operation instruction class includes a jump instruction and a branch instruction
  • the jump instruction includes a direct jump instruction and an indirect jump instruction
  • the branch instruction includes a conditional branch instruction
  • the macro operation instruction class includes a convolution operation instruction or a pool operation operation instruction.
  • the multi-dimensional data operation instruction class is used to require the operation unit to perform multi-dimensional data operation
  • the multi-dimensional data operation includes operation between multi-dimensional data and multi-dimensional data, operation between multi-dimensional data and one-dimensional vector data, and multi-dimensional data and one The operation between dimensional scalar data.
  • the one-dimensional data operation instruction class is used to require an operation unit to perform one-dimensional data operation
  • the one-dimensional data includes a one-dimensional vector and a one-dimensional scalar.
  • the operation of the one-dimensional vector data includes an operation between a one-dimensional vector and a one-dimensional vector, and an operation between the one-dimensional vector and the scalar.
  • the operation of the one-dimensional scalar data includes an operation between a scalar and a scalar.
  • a method of performing a SLAM operation according to any of the above apparatus, wherein the operation of the data, the operation, and the operation of the program are controlled by the control portion by the instruction set of the storage portion, including:
  • Step 2 The operation part performs an operation according to a required instruction set of the operation process
  • Step 3 Transfer and save the operation result data
  • Step 4 Repeat the above process until the calculation is completed.
  • the device and method of the SLAM hardware accelerator provided by the invention can effectively accelerate the SLAM algorithm according to different requirements, can be applied to various SLAM algorithms and a plurality of different input data types, and can satisfy different requirements of operations, and has strong flexibility.
  • the configurability is high, the operation speed is fast, and the power consumption is low.
  • FIG. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a SLAM hardware accelerator according to another embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a scalar operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a vector operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a matrix operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a three-dimensional coordinate L2 norm operation performed by a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a 16-dimensional matrix matrix multiplication operation performed by a SLAM hardware accelerator according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of an implementation of an SLAM algorithm based on an Extended Kalman Filtering Method (EKF) according to an embodiment of the present invention.
  • EKF Extended Kalman Filtering Method
  • FIG. 10 is a schematic diagram of application of a macro operation instruction according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a one-dimensional data operation instruction according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of a SIFT feature extraction algorithm configured on the device according to an embodiment of the present invention.
  • FIG. 13 is a schematic diagram of a G2O framework-based graph optimization algorithm configured on the device according to an embodiment of the present invention.
  • FIG. 14 is a flowchart of execution of a convolution operation instruction according to an embodiment of the present invention.
  • FIG. 15 is a flowchart of execution of an image accumulation instruction according to an embodiment of the present invention.
  • FIG. 16 is a flowchart of execution of a filter operation instruction according to an embodiment of the present invention.
  • FIG. 17 is a flowchart of execution of a local extremum instruction according to an embodiment of the present invention.
  • FIG. 18 is a flowchart of execution of a two-dimensional convolution operation operation according to an embodiment of the present invention.
  • FIG. 19 is a flowchart of execution of a one-dimensional vector dot product operation according to an embodiment of the present invention.
  • FIG. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention.
  • the accelerator is mainly divided into three parts, a control part, an arithmetic part and a storage part.
  • the control section issues control signals to the arithmetic section and the storage section to control the operation of the two and coordinate the data transmission between the two.
  • the storage part is used for storing related data, including input data, intermediate results, final results, instructions, caches, etc., and different plans for specific storage data contents, storage organization methods, and access calling modes may be performed according to different requirements.
  • the operation part includes a plurality of arithmetic units for performing operations on the data, and includes a combination of one or more of a scalar operation unit, a vector operation unit, and a matrix operation unit, wherein the operation unit can perform operations on data of different input types according to different requirements. .
  • the computing part can also realize a certain degree of sharing of data through the buffer storage module, thereby reducing the reuse distance of the data. Among them, The design of the arithmetic and storage sections and the arrangement of the instructions greatly reduce the power consumption during execution.
  • 2 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention. As shown in FIG.
  • this embodiment requires an operation process capable of accelerating the image-based SLAM algorithm, reducing data exchange, and saving storage space. Therefore, the structure of the device is that each part of the control part is connected to the storage part and the operation part is composed of a first in first out queue and a control processor, the first in first out queue is used to store the control signal, and the control processor is used to take out the execution to be executed.
  • the control signal after analyzing the control logic, controls and coordinates the storage part and the operation part.
  • the storage part is divided into four modules, an input storage module, an output storage module, an intermediate result storage module, and a cache module.
  • the operation part is mainly used to accelerate the operation of the image processing part, the operation of cloud image construction, the operation of image matching and the operation of image optimization. Therefore, the arithmetic unit is mainly divided into three modules, a scalar operation module, a vector operation module and a matrix operation module. The three modules can be implemented in a pipelined manner or in parallel.
  • FIG. 3 is a schematic diagram of an apparatus for a scalar operation unit of the present apparatus according to an embodiment of the present invention, wherein the SPE represents a separate scalar operation unit.
  • the scalar operation unit is mainly used to solve the operation part of the SLAM algorithm that cannot be used for acceleration, and some complicated operations such as trigonometric function calculation, and also solve the memory access consistency problem, which is one of the important components of the accelerator.
  • the storage modules directly related to the scalar arithmetic unit are intermediate result storage modules and buffer storage modules.
  • the operands required for scalar operations can be in the intermediate result storage module or in the buffer storage module.
  • the result of the scalar operation can be stored in the intermediate result storage module or output to the buffer module, depending on actual needs.
  • the entire vector operation unit is composed of a plurality of basic operation units, and the VPE in the figure is a basic operation unit of vector operations.
  • the vector operation unit can be used to solve the vector operation part in the SLAM algorithm and all the operation parts with vector operation characteristics, such as dot multiplication of vectors, etc., and can also implement efficient data level parallelism and task level parallelism.
  • the storage modules directly related thereto have intermediate result storage modules and buffer modules. Each basic unit of the vector operation unit can perform the same operation in parallel by configuration, or can implement different operations through configuration.
  • the storage modules directly related to the vector operation unit are an intermediate result storage module and a buffer storage module.
  • the operands required for vector operations can be in the intermediate result storage module or in the buffer storage module.
  • Vector The result of the operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual needs.
  • FIG. 5 is a schematic diagram of a matrix operation unit apparatus according to an embodiment of the present invention, which can satisfy the requirements of accelerating the operation of all matrix operation types and operation types similar to the matrix operation type, wherein the MPE represents the basic operation unit of the matrix operation unit.
  • the matrix operation unit is composed of a plurality of basic operation units, and the illustrated case is an arithmetic unit array.
  • the matrix operation unit has various external data exchange modes, and may be a 2D exchange mode or a 1D exchange mode.
  • the arithmetic unit supports the data access mode between the internal units, which can greatly reduce the reuse distance of the local data and achieve efficient acceleration.
  • the storage modules directly related to the matrix operation unit are intermediate result storage modules and buffer storage modules.
  • the operands required for matrix operations can be in the intermediate result storage module or in the buffer storage module.
  • the result of the matrix operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual needs.
  • FIG. 6 is a flow chart of performing a three-dimensional coordinate L2 norm operation using the device according to an embodiment of the present invention. It is assumed that three data of three-dimensional coordinates are stored in the intermediate storage module.
  • the operation instructions are fetched from the intermediate storage module through the configuration instruction and input to the three basic operation units VPE of the vector operation unit respectively.
  • the operations performed by each VPE of the three VPEs are Multiplication, the two operands of the multiplication are a certain number of the extracted coordinates and itself, and the result of the multiplication operation is input to the scalar operation unit through the buffer storage module, and the result of the three multiplication operations is completed in the scalar operation unit. And operation, then perform the square root operation.
  • the final operation result is output to the intermediate result storage module or the buffer storage module as needed.
  • FIG. 7 is a schematic diagram of an embodiment of a SLAM hardware accelerator performing a 16-dimensional matrix matrix multiplication operation according to an embodiment of the present invention.
  • N 16
  • the multiplication operation of the matrix A and the matrix B is completed to obtain the matrix C
  • the number of basic operation units in the matrix operation unit in the figure is 256
  • each operation unit is responsible for calculating in the operation process.
  • the final result data, the matrix data required for the operation is stored in the intermediate result storage module.
  • the operation starts by taking each operand of A from the intermediate result storage module to the buffer storage module, and the buffer storage module inputs the data into each basic operation unit MPE of the matrix operation unit in the row order, and the operand of the B matrix is also the same.
  • the column order is entered step by step into each PE.
  • the number of each A in the PE and the number of B will complete the multiplication operation.
  • the result of each multiplication operation after each PE will not be sent, but the last result is accumulated in the register stored in the PE, so that when all B
  • the result of each PE input after inputting to the PE is the number of each position of the resulting C matrix.
  • the C data is stored in the intermediate result storage module or left in the buffer storage module as needed.
  • FIG. 8 is a schematic diagram of the configuration and operation of an algorithm for performing SLAM based on Extended Kalman Filtering Method (EKF) according to an embodiment of the present invention.
  • the EKF algorithm can be basically divided into three major steps: Compute True Data, EKF Predict, and EKFUpdate.
  • Compute True Data the real coordinates are obtained from the motion model.
  • the pose of the new robot in the EKF Predict is predicted by the robot that was updated with the last predicted value and control input.
  • the association information with the surrounding reference points is calculated in the EKF Update, and the predicted pose and covariance matrix are updated.
  • Compute True Data The main operations involved in Compute True Data are low-dimensional vector processing operations such as Euclidean distance operations of three-dimensional coordinates, so most of the operations can be performed using vector arithmetic units, which also involve typical scalar operations such as trigonometric functions of angles. Therefore, it is also necessary to perform a small number of operations on the scalar arithmetic unit.
  • EKF Predict involves multiple large-scale matrix operations such as matrix multiplication in this step. In order to get better acceleration, this part of the operation can be performed on the matrix operation unit, and some smaller vector operations therefore require vectors.
  • the arithmetic unit functions.
  • EKF Update has many types of operations, and various operations alternate with each other, such as typical matrix SVD (singular value decomposition) decomposition, cholesky decomposition, etc. These operations are performed by matrix multiplication, vector addition and subtraction, and vector. A small operation such as a norm, a trigonometric function, and the like, using a matrix operation unit, a vector operation unit, and a scalar operation unit.
  • the input of the EKF-based SLAM algorithm is the coordinates of the points such as waypoints and landmarks, and the amount of data is not large, so it is only necessary to load these from the input storage module at the initial time. data.
  • the SLAM algorithm outputs the calculation result to the output storage module at the output, and completes the hardware configuration and implementation of the entire algorithm.
  • FIG. 9 is a schematic diagram of an instruction type according to an embodiment of the present invention.
  • the design of the instructions supports a variety of basic types of operation, making the device highly configurable.
  • the instruction set of the present invention includes various types such as a control operation instruction class, a data operation instruction class, a macro operation instruction class, a multidimensional data operation instruction class, and a one-dimensional data operation instruction class.
  • Each instruction class can be subdivided into a number of different instructions, each of which is distinguished by the first instruction code. As shown in Figure 9, several representative instructions and their encodings are selected in each instruction class. And list them.
  • Control operation instruction class mainly used to control the operation of the program.
  • the instruction code is JUMP to indicate a jump instruction for performing the jump function. According to the following opcodes, it can be divided into direct jump instructions and indirect jump instructions.
  • the instruction code is CB to indicate a conditional jump instruction for performing the conditional jump function.
  • the data operation instruction class is mainly used to control the transmission of data.
  • the command code is LD/ST, which is used for transmitting data in DRAM (Dynamic Random Access Memory) and SRAM (Static Random Access Memory), that is, LD means reading data from DRAM and Loaded into the SRAM, ST indicates that the data in the SRAM is transferred to the DRAM and stored.
  • the instruction code is MOV to transfer data between SRAMs.
  • the instruction code is RD/WR for transferring data between SRAM and BUFFER (buffer), where RD means reading data from SRAM to BUFFER, and WR means storing data in BUFFER back to SRAM.
  • the macro operation instruction class is used as a coarse-grained data operation operation instruction for a relatively complete operation operation.
  • the instruction code is CONV, which represents a convolution operation instruction for implementing convolution and class convolution operations, that is, multiplying and summing the input data with corresponding weights, and the instruction takes into account the local reusability of the data, specifically
  • CONV represents a convolution operation instruction for implementing convolution and class convolution operations, that is, multiplying and summing the input data with corresponding weights, and the instruction takes into account the local reusability of the data, specifically
  • the implementation process is as shown in Figure 14:
  • the image data is taken out from the start address of the image data as required by the instruction, and the weight data is extracted from the start address of the weight data.
  • each PE multiplies the input image data by the corresponding weight data, and adds it to the register in the internal unit of the operation unit and stores it in the register (the register needs to be initialized) Is 0).
  • the image data that has been in the multidimensional operation unit is transmitted inside the multidimensional operation unit according to the transmission rule specified by the multidimensional operation unit, and the image data that is not in the multidimensional operation unit is read from the BUFFER and transmitted to the designated operation position.
  • steps S3-S4 are repeated until the PE calculation is completed, and the result is output to the stored destination address specified by the instruction and saved.
  • the instruction code is POOL, which means a pooled operation instruction, which is used for pooling and class pooling operations, that is, averaging a predetermined number of data or obtaining a maximum/minimum value or performing a downsampling operation, and the specific implementation flow and volume
  • the product operation instructions are similar.
  • the instruction code is IMGACC, which represents an image accumulation instruction, which is used to complete image processing and perform accumulation or similar arithmetic functions.
  • IMGACC represents an image accumulation instruction, which is used to complete image processing and perform accumulation or similar arithmetic functions.
  • the specific implementation process is as follows, as shown in Figure 15:
  • the original data in the multi-dimensional operation unit is sequentially passed up one line, and then a new line of data is transferred to the multi-dimensional operation unit, and the new incoming line is accumulated with the corresponding column of the original last line of data. , accumulate the result as the data of the new last row. Repeat until the multidimensional unit is filled.
  • each clock cycle sequentially transfers and accumulates data in the multi-dimensional operation unit to the right, that is, the first clock cycle, the first column of data is passed to the right, and the second column is added with data from the first column. And save.
  • the second clock cycle the second column of data is passed to the right, the third column of data is added to the data from the second column and saved, and so on. The result of the integral accumulation of the desired image is finally obtained.
  • the multi-dimensional operation data is initialized to 0, and the next operation is performed again until all the images are calculated.
  • the buffered data needs to be accumulated when the first operation is not performed, so as to ensure the correct operation result.
  • the instruction code is BOX, which represents a filter instruction for completing the box filtering operation of the image.
  • the operation flow of the algorithm is to first find an array A, the width and height are equal to the original image, and then assign the value to the array.
  • the value of each element A[i] is assigned to the point and The sum of all the pixels in the rectangle formed by the origin of the image, and then the local matrix is obtained, and only the addition and subtraction operations of the four elements of the A matrix are needed. Therefore, the macro is mainly divided into two steps, as shown in Figure 16:
  • the required data is read from the start address according to the instruction, and is transmitted to the multi-dimensional operation unit to sequentially accumulate the incoming data, and is stored in the specified destination address 1.
  • the instruction supports data transmission inside the multidimensional operation unit.
  • the instruction code is LOCALEXTERMA, which represents a local extremum command, which is used to judge the operation of the local extremum when processing the image, that is, whether the data of the specified position is an extreme value in the set of data.
  • LOCALEXTERMA represents a local extremum command, which is used to judge the operation of the local extremum when processing the image, that is, whether the data of the specified position is an extreme value in the set of data.
  • the macro is mainly divided into two steps, as shown in Figure 17:
  • the register value in each PE in the multi-dimensional operation unit is initialized to a value small enough/large, and the data is read from the data start address and transmitted to the multi-dimensional operation unit, and then each PE pair transmits the data and The data stored in the register is compared and the larger/small values are saved back into the register until the specified data is compared. That is, the maximum/small value of the specified data stream is obtained in each PE.
  • each PE compares whether the data transmitted into the PE is the same as the maximum/small value stored in the register, the same output 1, different output 0 .
  • the instruction code is COUNTCMP for comparison operation, and is used for performing the comparison operation by using the counter, that is, reading the data to be compared and the threshold is transmitted to the multi-dimensional operation unit, and each PE compares the incoming data stream with the threshold and counts the traversal. After the incoming data is completed, the number of data that is greater than or less than the threshold is output.
  • the multi-dimensional data operation instruction class as one of the fine-grained operation operation instructions, is mainly used to control the operation operation of multi-dimensional data.
  • Multidimensional data includes two-dimensional and two-dimensional data, including multi-dimensional data and multi-dimensional data, one-dimensional vector data, one-dimensional scalar data and the like.
  • MMmM is a multiplication operation instruction of a matrix and a matrix, and belongs to a kind of operation instruction of multidimensional data and multidimensional data, and similarly, MMaM, that is, an addition instruction of a matrix and a matrix; MMmV, is a matrix and The multiplication operation instruction of the one-dimensional vector belongs to one type of operation instruction of multidimensional data and one-dimensional vector data, and similarly, MMaV, that is, an addition instruction of a matrix and a one-dimensional vector; MMmS is a matrix and a one-dimensional scalar
  • the multiplication instruction is a kind of operation instruction which is performed by multidimensional data and one-dimensional scalar data, and similarly, MMaS, that is, a matrix and a one-dimensional scalar addition instruction.
  • the multidimensional data operation instruction class is also compatible with operations between one-dimensional data, such as MVmV, which implements multiplication operations of one-dimensional vectors and one-dimensional vectors, and MMoV implements one-dimensional vectors and one-dimensional vectors.
  • MVmV multiplication operations of one-dimensional vectors and one-dimensional vectors
  • MMoV implements one-dimensional vectors and one-dimensional vectors.
  • the outer product operation instruction is also compatible with operations between one-dimensional data, such as MVmV, which implements multiplication operations of one-dimensional vectors and one-dimensional vectors.
  • the one-dimensional data operation instruction class is mainly used to control the operation operation of one-dimensional data.
  • the one-dimensional data is mainly divided into one-dimensional vector data and one-dimensional scalar data.
  • VVmV is a multiplication operation instruction of a one-dimensional vector and a one-dimensional vector
  • a similar VVaV represents an addition instruction of a one-dimensional vector and a one-dimensional vector.
  • VVmS is a multiplication instruction between a one-dimensional vector and a one-dimensional scalar.
  • SSsS an instruction representing a one-dimensional scalar operation, is used to perform a square root operation for obtaining the one-dimensional scalar.
  • SSrS which represents an operation for obtaining a random number.
  • the MV is a move operation instruction used to fetch registers or immediate data during an operation.
  • Figure 10 is a diagram showing an embodiment of a macro operation CONV of the present invention for performing a two-dimensional convolution operation on a hardware structure.
  • the two-dimensional convolution operation is: for a two-dimensional input image, a convolution kernel slides on the input image, and each convolution kernel filters the data of the two-dimensional data image covered by the current position, that is, the convolution kernel and The covered image data is multiplied by the alignment, and then the results after the orange are accumulated, and the desired filtering result is recorded. Then, the convolution kernel slides to the next position and the operation is repeated until all operations are completed.
  • the convolution operation designed by this patent can make full use of the data reusability on the hardware structure, rationally distribute and transmit the data, and maximize the utilization of the hardware.
  • the input is defined as an image or matrix
  • the output is also an image or a matrix, which are stored in a specified position in the form of a block.
  • the hardware structure is exemplified by a matrix operation unit (MPU), which includes m*n matrix operation units (MPEs), each of which contains a required arithmetic unit and a register for temporarily storing intermediate data.
  • MPU matrix operation unit
  • MPEs matrix operation units
  • S1 a macro that reads a convolution operation, consisting of an operation code and an operand.
  • the instruction operation is coded as CONV, indicating that a convolution operation is being performed.
  • the starting address of SA2 is the starting address 2, indicating the starting address of the convolution kernel to be read; IX and IY respectively represent the size in the X and Y directions of the image, that is, the two variables are defined to be operated.
  • the size of the image; KX and KY represent the size of the convolution kernel, respectively.
  • the input image data is read from the SRAM to the corresponding position in the BUFFER, and the operation is waited for.
  • each MPE in the MPU is required to calculate one pixel of the output image.
  • each MPE transmitting corresponding input image data to each MPE. Since the convolution kernels in the operation in each MPE are the same, the convolution kernel is broadcast to each MPE in a broadcast manner. Each MPE then multiplies the incoming input data with the corresponding convolution kernel data and stores them in the respective MPE registers.
  • the input image data to be calculated next is the data of the current MPE current shot operation, so the input image data is sequentially transmitted to the left, and the rightmost MPE The required data is not in the MPU, so it needs to be read again from BUFFER.
  • each MPE multiplies the input image data and the corresponding convolution kernel data, and accumulates the obtained product with the data in the MPE register, and stores it again in the register.
  • step S5 repeating step S4, until all the convolution kernel data and the corresponding input image data are calculated, that is, each MPE obtains one pixel of the output image, and the result is output and saved to the position defined by the destination address in the instruction. .
  • macros can handle data reuse problems well, improve data utilization, reduce data transfer, reduce power consumption, and improve performance.
  • FIG. 11 is a multi-dimensional data operation instruction according to an embodiment of the present invention, which implements a dot product operation between a one-dimensional vector and a one-dimensional vector.
  • VPU vector operation unit
  • VPE mm vector operation units
  • the product is accumulated with the previous product temporarily stored in the internal register, and the accumulated result is again sent to the internal register for temporary storage. Repeat the above steps until all the inputs have been calculated. Then the result of the vector operation unit is transmitted from the rightmost segment to the left, and the VPE at the right end directly transfers the data in the register to the VPE on the left side. When the VPE on the left side receives the data from the right side, it internally After the data in the register is accumulated, the accumulated result is continued to the left, and so on. Finally, the result of the dot product will be obtained in the leftmost VPE and output as required.
  • FIG. 12 is a process diagram of a configuration implementation of a SIFT feature extraction algorithm on the device according to an embodiment of the present invention.
  • the SIFT (Scale-invariant feature transform) feature extraction algorithm is one of the key operations of the RGBD SLAM algorithm.
  • the first step is to create an image pyramid operation Gaussian Pyramid, which contains basic image operations such as image smoothing, which can be further decomposed into multiple convolution and pooling operations in this device.
  • the Gaussian difference DOG operation is performed, which can be regarded as a matrix subtraction operation between different faces of the image pyramid tower.
  • the local extrema search can be done by calling the macro LOCAL EXTREMA.
  • KP filter feature point filtering
  • this step is composed of a large number of vector and scalar operations, such as vector dot multiplication, matrix determinant, and so on.
  • the histogram of the neighboring points is calculated by multiple vector and scalar operations to calculate the key point of the key point.
  • the calculation of the histogram operation can be completed by the macro instruction HIST, which is composed of a vector comparison equal vector operation operation.
  • the rotation operation of adjacent pixel regions is achieved by multiplication of matrix vectors.
  • FIG. 13 is a schematic flowchart of a G2O map optimization algorithm configured on the device according to an embodiment of the present invention.
  • G2O is a framework for solving nonlinear graph optimization problems.
  • Many typical SLAM algorithms such as RGBD SLAM and ORB SLAM, are based on the framework.
  • the operations of the error matrix and the Jacobian matrix can be performed by matrix operations and vector operations, such as multiplication and accumulation of matrices.
  • a linear system capable of optimizing the objective function is established by the error matrix and the Jacobian matrix. This step can be done by matrix and vector operation units, including operations such as matrix multiplication and accumulation.
  • PCG Preconditioned Conjugate Gradient
  • the PCG operation can be decomposed into a block matrix and a vector multiplication and addition operation.
  • the specific implementation can be realized by the macro PCG.
  • the optimization of the final pose can also be done by operations such as multiplication and addition of matrices and vectors.
  • the design of the matrix and vector operation unit, combined with the design of the scalar unit, can support various types of operations and significantly speed up the operation.
  • the apparatus and method of the embodiments of the present invention can be applied to the following (including but not limited to) scenarios: data processing, robot, drone, automatic driving, computer, printer, scanner, telephone, tablet, smart terminal, mobile phone, Driving recorders, navigators, sensors, cameras, cloud servers, cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other means of transportation; television, Air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods, etc.; and various types of medical equipment including nuclear magnetic resonance instruments, B-ultrasounds, electrocardiographs, etc. device.

Abstract

一种SLAM硬件加速器的装置,包括存储部分,用于存储输入数据、临时运算结果数据、最终运算结果数据、运算过程所需的指令集和/或算法参数数据;运算部分,与所述存储部分连接,用于完成对SLAM相关算法和应用的计算;控制部分,连接所述存储部分和运算部分,用于控制和协调存储部分和运算部分。还提供了一种完成SLAM运算的方法,该方法通过指令来控制数据的运输、数据的运算、程序的运行等。所述装置和方法能够有效根据不同的需求对SLAM算法进行加速,满足不同需求的运算,具有灵活性强、可配置程度高、运算速度快、功耗低等优点。

Description

SLAM运算装置和方法 技术领域
本发明涉及一种SLAM(simultaneous Localization and Mapping,即时定位与建图)运算装置和方法,用于根据不同需求对SLAM算法的运算进行加速。
背景技术
在未知的环境中自主导航是移动机器人(例如无人地面和空中载具等)的一个基本能力。SLAM任务中即时定位与建图的定位主要完成机器人的位置在地图中的确定工作,建图的主要任务是是机器人根据环境建立对应环境的地图。在缺乏位置环境初始地图的情况下这就需要机器人能够实时地构建地图并且利用地图完成自身的定位,完成这项任务所需要的SLAM算法随之产生。然而在移动机器人有限的计算能力和严格的功耗要求下精确地实现SLAM算法是现实中所面临的最大问题之一。首先,SLAM算法因为有实时性的要求因而需要极高的运算速度来完成类似帧与帧间短时间的大量运算,其次SLAM算法由于受到移动机器人的限制对功耗有着苛刻的要求,最后SLAM算法种类众多运算类型较广,因此设计的加速器需要支持各种类型的SLAM算法。
在现有技术中,一种实现SLAM算法的方式是直接在通用处理器(CPU)上进行运算,该方法的缺点之一是单个通用处理器的运算性能较低,无法满足常见SLAM运算实时性需求。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。
另一种实现SLAM算法的方式是在图形处理器(GPU)上进行运算,这种方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。虽然该方法是专门用来执行图形图像运算的设备,但是由于SLAM算法运算的复杂性,该方法并不能够很好的支持其后续运 算,也就是无法对SLAM算法的整体运算进行有效加速。同时,GPU片上缓存太小,更无法满足大量的SLAM算法的运算需求。此外,由于在实际应用领域中,将CPU或GPU等类似的结构移植到机器人上是一件比较困难的事情。
发明内容
根据本发明的一方面,提供一种SLAM硬件加速器的装置,包括:
存储部分,用于存储输入数据、临时运算结果数据、最终运算结果数据、运算过程所需的指令集和/或算法参数数据;
运算部分,与所述存储部分连接,用于完成对SLAM相关算法和应用的计算;
控制部分,连接所述存储部分和运算部分,用于控制和协调存储部分和运算部分。
优选的,所述存储部分包括:
输入存储模块:用于存储输入输出数据;
中间结果存储模块:用于存储中间运算结果;
最终结果存储模块:用于存储最终运算结果;
指令存储模块:用于存储运算过程所需的指令集;和/或
缓冲存储模块:用于数据的缓冲存储。
优选的,所述运算部分包括:
针对SLAM相关算法和应用而设计的加速和处理SLAM运算的加速运算装置;
SLAM相关算法和应用中包含但不能由所述加速运算装置完成的其他运算的其他运算装置。
优选的,所述加速运算装置包含向量运算单元和矩阵运算单元。
优选的,所述
其他运算装置用于完成在算法和应用中使用但又不由加速运算装置完成的运算。
优选的,所述运算部分通过硬件电路实现。
优选的,所述控制部分
连接存储部分的每个模块和运算部分,控制部分由一个先进先出队列和一个控制处理器组成,先进先出队列用于存储控制信号,控制处理器用于取出待执行的控制信号,对控制逻辑进行分析后,对存储部分和运算部分进行控制和协调。
优选的,所述指令集包括:
控制操作指令类,用于选取待执行的运行指令的控制;
数据操作指令类,用于控制数据的传输;
宏运算指令类,用于完整的运算操作;
多维数据运算指令类,用于控制多维数据的运算操作;和/或
一维数据运算指令类,用于控制一维数据的运算操作。
优选的,所述控制操作指令类包括指跳转指令和分支指令,跳转指令包括直接跳转指令和间接跳转指令,分支指令包括条件分支指令。
优选的,所述宏运算指令类包括卷积运算指令或池化运算指令。
优选的,所述多维数据运算指令类用于要求运算单元执行多维数据的运算,多维数据的运算包括多维数据与多维数据间的运算,多维数据与一维向量数据间的运算以及多维数据与一维标量数据之间的运算。
优选的,所述一维数据运算指令类,用于要求运算单元执行一维数据的运算,所述一维数据包括一维向量和一维标量。
优选的,所述一维向量数据的运算包括一维向量与一维向量之间的运算,以及一维向量与标量之间的运算。
优选的,所述一维标量数据的运算包括标量与标量之间的运算。
优选的,还包括汇编器,用于在运行过程中,选择使用指令集中的指令类型。
根据本发明的另一方面,还提供根据以上任一所述装置进行SLAM运算的方法,在于通过存储部分的指令集由控制部分来控制数据的运输、运算和程序的运行,其中包括:
步骤一:将存储部分的输入数据运输至运算部分;
步骤二:在运算部分根据运算过程的所需的指令集执行运算;
步骤三:传输并保存运算结果数据;
步骤四:重复上述过程直至运算完毕。
本发明提供的SLAM硬件加速器的装置和方法,能够有效根据不同的需求对SLAM算法进行加速,能够适用于各种SLAM算法和多种不同的输入数据类型,满足不同需求的运算,具有灵活性强、可配置程度高、运算速度快、功耗低等优点。
附图说明
图1是本发明一实施例提供的SLAM硬件加速器的装置的结构示意图。
图2是本发明又一实施例提供的SLAM硬件加速器的结构示意图。
图3是本发明一实施例提供的SLAM硬件加速器的标量运算单元的结构示意图。
图4是本发明一实施例提供的SLAM硬件加速器的向量运算单元的结构示意图。
图5是本发明一实施例提供的SLAM硬件加速器的矩阵运算单元的结构示意图。
图6是本发明一实施例提供的SLAM硬件加速器完成三维坐标L2范数运算的示意图。
图7是本发明一实施例提供的SLAM硬件加速器完成16维方阵矩阵乘法运算的示意图。
图8是本发明一实施例提供的基于扩展卡尔曼滤波方法(EKF)的SLAM的算法在本装置上配置实现的示意图。
图9是本发明一实施例提供的指令类型示意图。
图10是本发明一实施例提供的一种宏运算指令的应用示意图。
图11是本发明一实施例提供的一种一维数据运算指令的示意图。
图12是本发明一实施例提供的一种SIFT特征提取算法在本装置上配置实现的示意图。
图13是本发明一实施例提供的一种基于G2O框架的图优化算法在本装置上配置实现的示意图。
图14是本发明一实施例提供的一种卷积运算指令的执行流程图。
图15是本发明一实施例提供的一种图像累加指令的执行流程图。
图16是本发明一实施例提供的一种滤波运算指令的执行流程图。
图17是本发明一实施例提供的一种局部极值指令的执行流程图。
图18是本发明一实施例提供的一种二维卷积运算操作的执行流程图。
图19是本发明一实施例提供的一种一维向量点积运算的执行流程图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图1是本发明一实施例提供的SLAM硬件加速器的装置的结构示意图。如图1所示,本加速器主要分为三个部分,控制部分、运算部分和存储部分。控制部分向运算部分和存储部分发出控制信号,来控制二者的运行,协调二者间的数据传输。存储部分用于存储相关数据,包括输入数据、中间结果、最终结果、指令、缓存等等,可以根据需求不同,对具体的存储数据内容、存储组织方式和存取调用方式进行不同的规划。运算部分包含多种运算器,用于数据的运算,包括标量运算单元、向量运算单元和矩阵运算单元的一个或多个的组合,其中,运算器能够根据不同需求对不同输入类型的数据进行运算。运算部分也可以通过缓冲存储模块实现数据的一定程度的共享,降低了数据的重用距离。其中,运 算部分和存储部分的设计以及指令的安排极大地降低了执行时的功耗。图2是本发明一个实施例的SLAM硬件加速器的装置的结构示意图。如图2所示,本实施例是要求能够加速基于图像的SLAM算法的运算过程,减少数据交换,节约存储空间。故本装置的结构为,控制部分连接存储部分的每个模块和运算部分,由一个先进先出队列和一个控制处理器组成,先进先出队列用于存储控制信号,控制处理器用于取出待执行的控制信号,对控制逻辑进行分析后,对存储部分和运算部分进行控制和协调。存储部分分为四个模块,输入存储模块、输出存储模块、中间结果存储模块和缓存模块。运算部分主要用于加速图像处理部分的运算、云图构建的运算、图像匹配的运算和图像优化的运算,故运算单元也主要分为三个模块,标量运算模块、向量运算模块和矩阵运算模块,三个模块可以采用流水线的方式执行也可以并行地执行。
图3是本发明一实施例提供的可用于本装置的标量运算单元的装置示意图,其中的SPE表示单独的标量运算单元。标量运算单元主要用于解决SLAM算法中不能用来加速的运算部分以及部分复杂的运算如三角函数运算等,同时也能解决访存一致性问题,是加速器的重要组成之一。标量运算单元所直接相关的存储模块是中间结果存储模块和缓冲存储模块。标量运算需要的操作数可以在中间结果存储模块中,也可以在缓冲存储模块中。标量运算的结果可以存放到中间结果存储模块中,也可以输出到缓冲模块中,取决于实际需要。
图4是本发明一实施例提供的向量运算单元的装置示意图,整个向量运算单元由多个基本运算单元构成,图中VPE是向量运算的基本运算单元。向量运算单元可以用于解决SLAM算法中的向量运算部分和所有具有向量运算特性的运算部分,例如向量的点乘等,也可以实现高效的数据级并行和任务级并行。与其直接相关的存储模块有中间结果存储模块和缓冲模块。向量运算单元的每个基本单元可以通过配置实现并行地执行同样的操作,也可以通过配置实现不同的操作。向量运算单元所直接相关的存储模块是中间结果存储模块和缓冲存储模块。向量运算需要的操作数可以在中间结果存储模块中,也可以在缓冲存储模块中。向量 运算的结果可以存放到中间结果存储模块中,也可以输出到缓冲模块中,取决于实际需要。
图5是本发明一实施例提供的矩阵运算单元装置示意图,能够满足加速所有的矩阵运算类型以及和矩阵运算类型相似的运算类型的运算的要求,其中的MPE表示矩阵运算单元的基本运算单元。矩阵运算单元由多个基本运算单元构成,图示的情况是一个运算单元阵列。矩阵运算单元对外的数据交换模式有多种,可以是2D的交换模式也可以是1D的交换模式。同时运算单元支持内部单元之间的数据访问模式,可以极大地降低局部性数据的重用距离,实现高效地加速。与矩阵运算单元直接相关的存储模块是中间结果存储模块和缓冲存储模块。矩阵运算需要的操作数可以在中间结果存储模块中,也可以在缓冲存储模块中。矩阵运算的结果可以存放到中间结果存储模块中,也可以输出到缓冲模块中,取决于实际需要。
图6是本发明一实施例提供的一种用本装置进行三维坐标L2范数运算的流程图。假定三维坐标的三个数据存放在中间存储模块中,首先通过配置指令从中间存储模块取出操作数分别输入到向量运算单元的三个基本运算单元VPE上,三个VPE每个VPE执行的运算是乘法运算,乘法的两个操作数是取出的坐标的某个数和自身,乘法运算的结果会经过缓冲存储模块再输入到标量运算单元中,在标量运算单元中完成三个乘法运算结果的求和操作,然后执行开方运算。最后的运算结果根据需要输出到中间结果存储模块或是缓冲存储模块中。
图7是本发明一实施例提供的一种SLAM硬件加速器完成16维方阵矩阵乘法运算的一个实施例的示意图。例如对于N=16的情形,假设要完成矩阵A和矩阵B的乘法操作得到矩阵C,图中矩阵运算单元中的基本运算单元个数为256个,每个运算单元在运算过程中负责计算出最终结果数据,运算所需的矩阵数据存放在中间结果存储模块。运算开始先从中间结果存储模块取出A的每个操作数到缓冲存储模块,缓冲存储模块将数据按照行顺序输入到矩阵运算单元的每个基本运算单元MPE中,B矩阵的操作数也会同样被取到缓冲存储模块中,由指令调度按照 列顺序分步输入到每个PE中。PE中的每个A的数和B的数会完成乘法操作,每次每个PE完成乘法操作后的结果不会送出,而是和存放在PE的寄存器中上次结果累加,这样当所有B的数输入到PE以后每个PE保存的结果就是最终得到的C矩阵的每个位置的数。最后C的数据根据需要存放到中间结果存储模块中或是留在缓冲存储模块中。
图8是本发明一实施例提供的一种本装置进行基于扩展卡尔曼滤波方法(EKF)的SLAM的算法的配置和运算示意图。EKF算法基本可以分为三大步,分别是Compute True Data、EKF Predict(预测)和EKFUpdate(更新)。在Compute True Data中,通过运动模型得到真实的坐标。在EKF Predict中新的机器人的位姿通过上次的预测值和控制输入更新的机器人预测。在EKF Update中计算与周围环境参照点的关联信息,更新预测出的位姿和协方差矩阵。Compute True Data中主要涉及的运算是低维的向量处理操作如三维坐标的欧式距离运算等,因此大部分运算可以使用向量运算单元进行运算,其中还涉及到角度的三角函数运算等典型的标量运算,因此也需要在标量运算单元上进行少量的运算。EKF Predict这一步中涉及多次较大规模的矩阵操作如矩阵乘法等,为了得到较好的加速,这部分操作可以放在矩阵运算单元上执行,同时也有些较小的向量操作因此也需要向量运算单元发挥作用。EKF Update这步操作运算种类较多,而且各种运算相互交替,例如有典型的矩阵SVD(singular value decomposition,奇异值分解)分解、cholesky分解等操作,这些操作由矩阵乘法、向量加减、向量范数、三角函数等细小的操作组成,同时使用到矩阵运算单元、向量运算单元和标量运算单元。从存储模块上看,基于EKF的SLAM算法的输入是waypoints(路径点)和landmarks(环境参照点)等点的坐标,数据量不大,因此只需要在初始时从输入存储模块中载入这些数据。在中间的运算过程中,一般情况下由于存储的设计数据量不会超过中间结果存储模块的大小,因此一般不需要与输入存储模块有频繁的数据交换,降低了能耗与运行时间。最后SLAM算法在输出时将计算结果输出到输出存储模块,完成整个算法的硬件配置和实现。
图9是本发明实施例提供的指令类型示意图。指令的设计支持各种基本的操作类型,使得装置的可配置性很高。
本发明的指令集包含控制操作指令类、数据操作指令类、宏运算指令类、多维数据运算指令类、一维数据运算指令类等多种类型。每种指令类又可细分为多种不同的指令,每种指令以开头的指令编码进行区分,如图9所示,在每种指令类中选择了几种具有代表性的指令及其编码并将其列出。
控制操作指令类,主要用于控制程序的运行。指令编码为JUMP表示跳转指令,用于执行跳转功能。根据后面的操作码的不同,可以分为直接跳转指令和间接跳转指令。指令编码为CB表示条件跳转指令,用于执行条件跳转功能。
数据操作指令类,主要用于控制数据的传输。指令编码为LD/ST表示用于DRAM(Dynamic Random Access Memory,动态随机存取存储器)与SRAM(Static Random Access Memory,静态随机存取存储器)中传输数据,即LD表示从DRAM中读取数据并载入到SRAM中,ST表示将SRAM中的数据传输至DRAM中并进行存储。指令编码为MOV表示在SRAM之间传输数据。指令编码为RD/WR表示用于SRAM与BUFFER(缓冲器)之间传输数据,其中RD表示从SRAM中读取数据到BUFFER,WR表示将BUFFER中的数据存储回SRAM中。宏运算指令类,作为粗粒度的数据运算操作指令,用于相对完整的运算操作。
指令编码为CONV表示卷积运算指令,用于实现卷积及类卷积运算,即把输入的数据分别和相应的权值相乘并求和,并且该指令考虑到数据的局部重用性,具体的执行过程为,如图14:
S1,按照指令的要求从图像数据的起始地址开始取出图像数据,从权值数据的起始地址开始取出权值数据。
S2,将图像数据按照对应的运算要求传送到相应的多维运算单元中,将权值数据广播给多维运算单元中的每个运算元素(PE)。
S3,每个PE将输入的图像数据和对应的权值数据相乘,并和运算单元内部的寄存器内的数据相加并存储回寄存器中(该寄存器需初始化 为0)。
S4,对于已经在多维运算单元中的图像数据按照多维运算单元规定的传输规则在多维运算单元内部进行传输,对于不在多维运算单元中的图像数据从BUFFER中读取并传输到指定的运算位置。这种方法利用了卷积运算时的数据的重用性,从而大大减少了数据的搬运次数。
S5,重复步骤S3-S4,直到该PE计算完毕,将结果输出至指令规定的存储的目的地址中保存。
S6,重新读取数据并重复上述操作,直到输出图像中的所有像素点都计算并保存完毕,指令结束。
指令编码为POOL表示池化运算指令,用于实现池化及类池化运算,即对规定数目的数据求取平均值或求取最大/小值或进行降采样操作,其具体实现流程与卷积运算指令相似。
指令编码为IMGACC表示图像累加指令,用于完成图像的处理并进行累加或类似的运算功能。其具体的执行过程如下,如图15:
S1,根据指令要求从图像数据的起始地址开始读取图像数据,并将多维运算单元中的所有运算元素(PE)初始化为0。
S2,每个时钟周期,将多维运算单元中的原数据依次向上传递一行,而后向多维运算单元中传递一行新的数据,并将新传入的一行与原最后一行的数据的对应列进行累加,累加结果作为新的最后一行的数据。重复操作,直到填满多维运算单元。
S3,每个时钟周期,依次将多维运算单元中的数据向右传递并累加,即第一个时钟周期,将第一列数据向右传递,第二列加上从第一列传来的数据,并保存。第二个时钟周期,第二列数据向右传递,第三列数据与第二列传来的数据相加并保存,以此类推。最终得到所需的图像的积分累加结果。
S4,保存多维运算单元中的所有数据到指令指定的目的地址,并对最下面一行和最右面一列数据进行缓存。
S5,将多维运算数据初始化为0,重新进行下一次运算,直到全部图像计算完毕。其中,需要注意的是,在后续运算时,当图像的宽度或 长度超过多维运算单元的单次处理大小的时候,需在非首次运算的时候累加上缓存的数据,以保证运算结果的正确。
指令编码为BOX表示一种滤波指令,用于完成图像的box滤波操作。该算法的操作流程是,为了求得图像的局部矩阵之和,首先建立一个数组A,宽高与原图像相等,然后对这个数组赋值,每个元素的值A[i]赋为该点与图像原点所构成的矩形中所有像素的和,而后求得局部矩阵之后,只需要通过A矩阵的4个元素的加减操作即可完成。故该宏指令主要分为两步操作,如图16:
S1,根据指令从起始地址读取所需数据,传入到多维运算单元中对传入数据依次进行累加,并保存在规定的目的地址1中。
S2,根据指令所需的数据,从目的地址1中读取数据,对数据进行加减操作,得到滤波结果,保存到目的地址2中,即为所需的最终结果。
由于在数据累加过程中,类似于卷积运算指令,数据具有局部重用性,故该指令支持在多维运算单元内部对数据进行传输。
指令编码为LOCALEXTERMA表示局部极值指令,用于完成处理图像时判断局部极值的操作,即判断指定位置的数据是否是该组数据中的极值。具体而言,该宏指令主要分为两步操作,如图17:
S1,将多维运算单元中的每个PE中的寄存器值初始化为一个足够小/大的值,从数据起始地址读取数据传入到多维运算单元当中,而后每个PE对传入数据与寄存器内保存的数据进行比较运算,得到较大/小值保存回寄存器内,直到规定数据比较完毕。即每个PE中得到了指定数据流的最大/小值。
S2,根据指令,读取指定位置的数据重新传入多维数据运算单元中,每个PE比较传入该PE中的数据是否与寄存器内保存的最大/小值相同,相同输出1,不同输出0。
指令编码为COUNTCMP表示比较操作,用于利用计数器完成比较的操作,即读取待比较数据和阈值传递到多维运算单元中,每个PE对传入的数据流依次和阈值进行比较并计数待遍历完传入的数据,输出大于或小于该阈值的数据的个数。
多维数据运算指令类,作为细粒度运算操作指令之一,主要用于控制多维数据的运算操作。多维数据包括二维以及二维以上的数据,其中包含多维数据分别与多维数据、一维向量数据、一维标量数据等进行的运算指令。以矩阵为例,MMmM,是矩阵与矩阵的乘法运算指令,属于多维数据与多维数据进行的运算指令的一种,类似的还有MMaM,即矩阵与矩阵的加法运算指令;MMmV,是矩阵与一维向量的乘法运算指令,属于多维数据与一维向量数据进行的运算指令的一种,类似的还有MMaV,即矩阵与一维向量的加法运算指令;MMmS,是矩阵与一维标量的乘法运算指令,属于多维数据与一维标量数据进行的运算指令的一种,类似的还有MMaS,即矩阵与一维标量的加法运算指令。除此之外,多维数据运算指令类还能够兼容一维数据之间的运算,例如MVmV,实现的是一维向量与一维向量的乘法运算指令,MMoV实现了一维向量与一维向量的外积运算指令。
一维数据运算指令类,作为细粒度运算操作指令类之一,主要用于控制一维数据的运算操作,其中,一维数据又主要分为一维向量数据和一维标量数据两种。例如,VVmV,是一维向量与一维向量的乘法运算指令,类似的VVaV,表示一维向量与一维向量的加法运算指令。VVmS,是一维向量与一维标量之间的乘法运算指令。SSsS,表示一维标量运算的指令,用于完成求取该一维标量的开方运算。SSrS,表示用于求取随机数的运算。MV是移动操作指令,用于运算过程中取寄存器或立即数。
图10是本发明提供的一种宏指令操作CONV在一种硬件结构上完成一个二维卷积运算操作的实施例。二维卷积的运算过程是,对于一个二维输入图像,有一个卷积核在输入图像上滑动,每次卷积核对当前位置覆盖的二维数据图像的数据进行滤波,即卷积核和被覆盖的图像数据进行对位相乘,而后将香橙后的结果进行累加,记得到所需的滤波结果。而后,卷积核滑动至下一位置,重复运算直到全部运算完成。由于卷积操作的应用十分广泛,而且大量出现,所以本专利设计的卷积操作可以充分利用硬件结构上的数据可重用性,将数据进行合理的分配和传输,将硬件的利用率提高到最大。为加强说明,附以一具体的实施例,如图 10所示。在本实施例中,定义输入为一个图像或者矩阵,输出也是一个图像或者矩阵,均以分块的形式存储在指定的位置。硬件结构以一个矩阵操作单元(MPU)为例,该操作单元中包含m*n个矩阵运算部件(MPE),每个运算部件内含有所需要的运算器和用于暂存中间数据的寄存器。如图18具体运算过程为:
S1,读取一条卷积操作的宏指令,由操作编码和操作数组成。指令操作编码为CONV,表示进行的是卷积运算。操作数共有7个,分别为DA、SA1、SA2、IX、IY、KX、KY,其中,DA表示目的地址,即输出结果的存储地址;SA1为起始地址1,表示读取待运算的图像的起始地址;SA2为起始地址2,表示读取待运算的卷积核的起始地址;IX和IY分别表示图像X方向和Y方向上大小,即通过这两个变量定义了待运算的图像的大小;KX和KY分别表示卷积核的大小。
S2,根据指令,把输入图像数据从SRAM中读取到BUFFER中的对应位置,等待运算,这里要求MPU中的每一个MPE计算输出图像的一个像素点。
S3,向每个MPE中传输相应的输入的图像数据。由于每个MPE中运算时的卷积核相同,故采用广播的方式将卷积核广播给每一个MPE。而后每个MPE将传入的输入数据和对应的卷积核数据进行相乘,而后保存到各自MPE的寄存器之中。
S4,由于卷积操作的运算数据具有局部重用性,故下一拍待运算的输入图像数据即为右边的MPE当前拍进行运算的数据,故将输入图像数据依次向左传递,最右边的MPE所需的数据不在MPU中,故需重新从BUFFER中读取。待数据传输完毕,每个MPE将输入图像数据和对应的卷积核数据进行相乘,并将所得的积与该MPE的寄存器中的数据进行累加,再次存入寄存器中。
S5,重复步骤S4,直到所有卷积核数据和对应的输入图像数据运算完毕,即得到了每个MPE得到了输出图像的1个像素点,将结果输出并保存到指令中目的地址定义的位置。
S6,重复上述步骤,直到输出图像中所有像素点计算完毕。
利用宏指令可以充分利用数据的局部重用性,大大减少了数据搬运次数,提高了运算效率。譬如,当m=3,n=3时,该MPU能够同时进行9个像素点的卷积运算,耗时9个时候时钟周期。
类似的。我们提供了大量的宏指令操作,如卷积,虽然其完成的操作能够有其他类型的指令操作完成,但是由于宏指令操作的存在,能够使得操作指令更加简洁高效。另外,宏指令能够很好的处理数据的重用问题,能够提高数据的利用率,减少数据的传输,降低功耗,提高性能。
图11是本发明实施例提供的一种多维数据运算指令,实现了一维向量和一维向量间的点积运算,类似的,向量乘法、向量加法、向量比较等运算都是采用类似的运算流程。每个向量运算单元(VPU)包含mm个向量运算部件(VPE),每个VPE能够完成一对输入数据的运算。详细的运算流程如图19,首先将mm对待运算的数据分别输入给mm个VPE,分别执行一次乘法之后,存入VPE内部的寄存器中,同时输入mm对待运算的数据分别输入给mm个VPE,分别执行一次乘法之后,将乘积与内部寄存器内暂存的上一次的乘积进行累加,累加结果再次送入内部寄存器内暂存。重复上述操作直到所有输入都已被计算完毕。而后将向量运算单元的结果从最右段开始左传,最右端的VPE直接将寄存器内的数据传递给其左边的VPE,当其左边的VPE接收到从右边传来的数据后,与自己内部的寄存器内的数据进行累加后,将累加结果继续左传,依次类推。最终,点积运算结果将会在最左端的VPE中得到,按要求输出即可。
图12是本发明一实施例提供的在SIFT特征提取算法在本装置上的配置实现的过程图。SIFT(Scale-invariant feature transform)特征提取算法是RGBD SLAM算法的关键运算之一。第一步是建立图像金字塔操作Gaussian Pyramid,包含了图像平滑等基本的图像操作,在本装置中可以进一步分解为多个convolution(卷积)和pooling(降采样)的操作。接下来进行高斯差分DOG的操作,这个操作可以看做是在图像金字塔塔的不同面之间做矩阵的减法操作。一旦DOG操作完成,局部极值搜索的操作可以通过调用宏指令LOCAL EXTREMA来完成。搜索局部极值 后进行特征点的确定,特征点滤波(KP filter),这一步操作由大量的向量和标量运算组成,例如向量点乘、矩阵行列式等。最后通过多个向量和标量的运算操作计算邻近点的直方图来计算出关键点的描述子(Key Point)。其中计算直方图操作可以通过宏指令HIST完成,该操作由向量比较等向量运算操作组成。邻近像素区域的旋转操作用矩阵向量的乘法来实现。某些特殊的函数操作如exponential等主要通过标量运算单元来实现。
图13是本发明实施例提供的一种本装置上配置实现G2O图优化算法的示意流程图。G2O是一个解决非线性图优化问题的框架,很多典型的SLAM算法如RGBD SLAM和ORB SLAM等基于图方法的SLAM算法都是以该框架为基础的。给定两个图节点的位姿约束和初始位姿,误差矩阵和雅克比矩阵的运算可以通过矩阵运算操作和向量运算操作来完成,例如矩阵的乘法和累加操作等。然后通过误差矩阵和雅克比矩阵建立一个能够优化目标函数的线性系统,这一步可以通过矩阵和向量运算单元来完成,其中也涉及包括矩阵乘法和累加等操作。然后求解这个线性系统,我们可以使用Preconditioned Conjugate Gradient(PCG)算法来实现(我们也可以通过cholesky分解的方法或稀疏矩阵的方法或上三角分解方法来实现)。PCG操作可以被分解为分块的矩阵和向量的乘法和加法操作,具体实现时可以通过宏指令PCG来实现。最后位姿的优化操作也可以通过矩阵和向量的乘法和加法等操作来完成。矩阵和向量运算单元的设计再配合标量运算单元的设计可以支持各种类型的运算,并且显著地加快运算速度。
本发明实施例的装置和方法可以应用于以下(包括但不限于)场景中:数据处理、机器人、无人机、自动驾驶、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗 设备。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种SLAM硬件加速器的装置,其特征在于包括:
    存储部分,用于存储输入数据、临时运算结果数据、最终运算结果数据、运算过程所需的指令集和/或算法参数数据;
    运算部分,与所述存储部分连接,用于完成对SLAM算法和应用的计算;
    控制部分,连接所述存储部分和运算部分,用于控制和协调存储部分和运算部分。
  2. 根据权利要求1所述的SLAM硬件加速器的装置,其特征在于,所述存储部分包括:
    输入存储模块:用于存储输入输出数据;
    中间结果存储模块:用于存储中间运算结果;
    最终结果存储模块:用于存储最终运算结果;
    指令存储模块:用于存储运算过程所需的指令集;和/或
    缓冲存储模块:用于数据的缓冲存储。
  3. 根据权利要求1所述的SLAM硬件加速器的装置,其特征在于,所述运算部分包括:
    针对SLAM相关算法和应用而设计的加速和处理SLAM运算的加速运算装置;
    SLAM相关算法和应用中包含但不能由所述加速运算装置完成的其他运算的其他运算装置。
  4. 根据权利要求3所述的SLAM硬件加速器的装置,其特征在于,所述加速运算装置包含向量运算单元和矩阵运算单元。
  5. 根据权利要求3所述的SLAM硬件加速器的装置,其特征在于,所述其他运算装置用于完成在算法和应用中使用但又不由加速运算装置完成的运算。
  6. 根据权利要求3所述的SLAM硬件加速器的装置,其特征在于,所述运算部分通过硬件电路实现。
  7. 根据权利要求1所述的SLAM硬件加速器的装置,其特征在于,所述控制部分连接存储部分的每个模块和运算部分,控制部分由一个先进先出队列和一个控制处理器组成,先进先出队列用于存储控制信号,控制处理器用于取出待执行的控制信号,对控制逻辑进行分析后,对存储部分和运算部分进行控制和协调。
  8. 根据权利要求2所述的SLAM硬件加速器的装置,其特征在于,所述指令存储模块中的指令集包括:
    控制操作指令类,用于选取待执行的运行指令的控制;
    数据操作指令类,用于控制数据的传输;
    宏运算指令类,用于完整的运算操作;
    多维数据运算指令类,用于控制多维数据的运算操作;和/或
    一维数据运算指令类,用于控制一维数据的运算操作。
  9. 根据权利要求8所述的SLAM硬件加速器的装置,其特征在于,所述控制操作指令类包括指跳转指令和分支指令,跳转指令包括直接跳转指令和间接跳转指令,分支指令包括条件分支指令。
  10. 根据权利要求8所述的SLAM硬件加速器的装置,所述宏运算指令类包括卷积运算指令或池化运算指令。
  11. 根据权利要求8所述的SLAM硬件加速器的装置,其特征在于,所述多维数据运算指令类用于要求运算单元执行多维数据的运算,多维数据的运算包括多维数据与多维数据间的运算,多维数据与一维向量数据间的运算以及多维数据与一维标量数据之间的运算。
  12. 根据权利要求8所述的SLAM硬件加速器的装置,其特征在于,所述一维数据运算指令类,用于要求运算单元执行一维数据的运算,所述一维数据包括一维向量和一维标量。
  13. 根据权利要求12所述的SLAM硬件加速器的装置,其特征在于,所述一维向量数据的运算包括一维向量与一维向量之间的运算,以及一维向量与标量之间的运算。
  14. 根据权利要求12所述的SLAM硬件加速器的装置,其特征在于,所述一维标量数据的运算包括标量与标量之间的运算。
  15. 根据权利要求8所述的SLAM硬件加速器的装置,还包括汇编器,用于在运行过程中,选择使用指令集中的指令类型。
  16. 根据权利要求1所述装置进行SLAM运算的方法,其特征在于,通过存储部分的指令集由控制部分来控制数据的运输、运算和程序的运行,其中包括:
    步骤一:将存储部分的输入数据运输至运算部分;
    步骤二:在运算部分根据运算过程的所需的指令集执行运算;
    步骤三:传输并保存运算结果数据;
    步骤四:重复步骤一至三,直至运算完毕。
PCT/CN2017/075134 2016-11-03 2017-02-28 Slam运算装置和方法 WO2018082229A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610958847.XA CN108021528B (zh) 2016-11-03 2016-11-03 Slam运算装置和方法
CN201610958847.X 2016-11-03

Publications (1)

Publication Number Publication Date
WO2018082229A1 true WO2018082229A1 (zh) 2018-05-11

Family

ID=62075642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/075134 WO2018082229A1 (zh) 2016-11-03 2017-02-28 Slam运算装置和方法

Country Status (2)

Country Link
CN (12) CN109710559A (zh)
WO (1) WO2018082229A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023014588A1 (en) * 2021-08-03 2023-02-09 Micron Technology, Inc. Parallel matrix operations in a reconfigurable compute fabric

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290788B (zh) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111290789B (zh) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
CN111079915B (zh) * 2018-10-19 2021-01-26 中科寒武纪科技股份有限公司 运算方法、装置及相关产品
CN110058884B (zh) * 2019-03-15 2021-06-01 佛山市顺德区中山大学研究院 用于计算型存储指令集运算的优化方法、系统及存储介质
CN110991291B (zh) * 2019-11-26 2021-09-07 清华大学 一种基于并行计算的图像特征提取方法
CN113112481B (zh) * 2021-04-16 2023-11-17 北京理工雷科电子信息技术有限公司 一种基于矩阵网络的混合异构片上架构
CN113177211A (zh) * 2021-04-20 2021-07-27 深圳致星科技有限公司 用于隐私计算的fpga芯片、异构处理系统及计算方法
CN113342671B (zh) * 2021-06-25 2023-06-02 海光信息技术股份有限公司 对运算模块进行验证的方法、装置、电子设备和介质
CN113395551A (zh) * 2021-07-20 2021-09-14 珠海极海半导体有限公司 处理器、npu芯片和电子设备
CN113792867A (zh) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 运算电路、芯片和板卡
CN117093816B (zh) * 2023-10-19 2024-01-19 上海登临科技有限公司 矩阵乘运算方法、装置和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834898A (zh) * 2005-05-16 2006-09-20 威盛电子股份有限公司 执行指数乘法的微处理器装置与方法
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
CN102750127A (zh) * 2012-06-12 2012-10-24 清华大学 一种协处理器
CN103150596A (zh) * 2013-02-22 2013-06-12 百度在线网络技术(北京)有限公司 一种反向传播神经网络dnn的训练系统

Family Cites Families (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60201472A (ja) * 1984-03-26 1985-10-11 Nec Corp マトリクス積計算装置
US5666300A (en) * 1994-12-22 1997-09-09 Motorola, Inc. Power reduction in a data processing system using pipeline registers and method therefor
JPH09230954A (ja) * 1996-02-28 1997-09-05 Olympus Optical Co Ltd ベクトル規格化装置
US7454451B2 (en) * 2003-04-23 2008-11-18 Micron Technology, Inc. Method for finding local extrema of a set of values for a parallel processing element
US7814297B2 (en) * 2005-07-26 2010-10-12 Arm Limited Algebraic single instruction multiple data processing
US8051124B2 (en) * 2007-07-19 2011-11-01 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module
CN101609715B (zh) * 2009-05-11 2012-09-05 中国人民解放军国防科学技术大学 行列访问端口分离的矩阵寄存器文件
WO2011063824A1 (en) * 2009-11-30 2011-06-03 Martin Raubuch Microprocessor and method for enhanced precision sum-of-products calculation on a microprocessor
KR101206213B1 (ko) * 2010-04-19 2012-11-28 인하대학교 산학협력단 그래픽 가속기 기반 고속 slam 시스템 및 방법
WO2012012819A1 (en) * 2010-07-26 2012-02-02 Commonwealth Scientific And Industrial Research Organisation Three dimensional scanning beam system and method
CN101986264B (zh) * 2010-11-25 2013-07-31 中国人民解放军国防科学技术大学 用于simd向量微处理器的多功能浮点乘加运算装置
CN102012893B (zh) * 2010-11-25 2012-07-18 中国人民解放军国防科学技术大学 一种可扩展向量运算装置
CN102353379B (zh) * 2011-07-06 2013-02-13 上海海事大学 一种适用于自动驾驶车导航的环境建模方法
WO2013147885A1 (en) * 2012-03-30 2013-10-03 Intel Corporation Apparatus and method for accelerating operations in a processor which uses shared virtual memory
US9013490B2 (en) * 2012-05-17 2015-04-21 The United States Of America As Represented By The Administrator Of The National Aeronautics Space Administration Hilbert-huang transform data processing real-time system with 2-D capabilities
CN103208000B (zh) * 2012-12-28 2015-10-21 青岛科技大学 基于局部极值快速搜索的特征点提取方法
CN104252331B (zh) * 2013-06-29 2018-03-06 华为技术有限公司 乘累加器
US9449675B2 (en) * 2013-10-31 2016-09-20 Micron Technology, Inc. Apparatuses and methods for identifying an extremum value stored in an array of memory cells
CN103640018B (zh) * 2013-12-13 2014-09-03 江苏久祥汽车电器集团有限公司 一种基于surf算法进行定位的方法
CN103677741A (zh) * 2013-12-30 2014-03-26 南京大学 基于ncs算法的成像方法以及混合精度浮点协处理器
CN103955447B (zh) * 2014-04-28 2017-04-12 中国人民解放军国防科学技术大学 基于dsp芯片的fft加速器
CN105212922A (zh) * 2014-06-11 2016-01-06 吉林大学 面向fpga实现心电信号r波自动检测的方法及系统
CN105849690B (zh) * 2014-07-02 2019-03-15 上海兆芯集成电路有限公司 融合乘积-累加运算的处理器与方法
CN104317768B (zh) * 2014-10-15 2017-02-15 中国人民解放军国防科学技术大学 面向cpu+dsp异构系统的矩阵乘加速方法
CN104330090B (zh) * 2014-10-23 2017-06-06 北京化工大学 机器人分布式表征智能语义地图创建方法
KR102374160B1 (ko) * 2014-11-14 2022-03-14 삼성디스플레이 주식회사 스케일링을 사용하여 디스플레이 지연을 감소시키는 방법 및 장치
CN104391820B (zh) * 2014-11-25 2017-06-23 清华大学 基于fpga的通用浮点矩阵处理器硬件结构
CN104574508A (zh) * 2015-01-14 2015-04-29 山东大学 一种面向虚拟现实技术的多分辨率模型简化方法
US10285760B2 (en) * 2015-02-04 2019-05-14 Queen's University At Kingston Methods and apparatus for improved electromagnetic tracking and localization
CN104851094A (zh) * 2015-05-14 2015-08-19 西安电子科技大学 一种基于rgb-d的slam算法的改进方法
CN104915322B (zh) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法
CN104899182B (zh) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 一种支持可变分块的矩阵乘加速方法
CN105528082B (zh) * 2016-01-08 2018-11-06 北京暴风魔镜科技有限公司 三维空间及手势识别追踪交互方法、装置和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1834898A (zh) * 2005-05-16 2006-09-20 威盛电子股份有限公司 执行指数乘法的微处理器装置与方法
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
CN102750127A (zh) * 2012-06-12 2012-10-24 清华大学 一种协处理器
CN103150596A (zh) * 2013-02-22 2013-06-12 百度在线网络技术(北京)有限公司 一种反向传播神经网络dnn的训练系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023014588A1 (en) * 2021-08-03 2023-02-09 Micron Technology, Inc. Parallel matrix operations in a reconfigurable compute fabric

Also Published As

Publication number Publication date
CN109376113B (zh) 2021-12-14
CN109710559A (zh) 2019-05-03
CN109634904A (zh) 2019-04-16
CN109376112B (zh) 2022-03-15
CN109634905B (zh) 2023-03-10
CN109697184A (zh) 2019-04-30
CN108021528A (zh) 2018-05-11
CN109376113A (zh) 2019-02-22
CN109656867A (zh) 2019-04-19
CN109726168B (zh) 2021-09-21
CN109376114A (zh) 2019-02-22
CN109726168A (zh) 2019-05-07
CN109376112A (zh) 2019-02-22
CN109684267A (zh) 2019-04-26
CN109376114B (zh) 2022-03-15
CN109684267B (zh) 2021-08-06
CN109656867B (zh) 2023-05-16
CN109697184B (zh) 2021-04-09
CN109710558A (zh) 2019-05-03
CN109634905A (zh) 2019-04-16
CN108021528B (zh) 2020-03-13
CN109634904B (zh) 2023-03-07

Similar Documents

Publication Publication Date Title
WO2018082229A1 (zh) Slam运算装置和方法
KR102258414B1 (ko) 처리 장치 및 처리 방법
KR102402111B1 (ko) 콘볼루션 신경망 정방향 연산 실행용 장치와 방법
KR102544275B1 (ko) 콘볼루션 신경망 트레이닝 실행용 장치와 방법
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017185389A1 (zh) 一种用于执行矩阵乘运算的装置和方法
CN108733348B (zh) 融合向量乘法器和使用其进行运算的方法
KR102354718B1 (ko) 계산 장치 및 방법
WO2017185336A1 (zh) 用于执行pooling运算的装置和方法
CN111651206A (zh) 一种用于执行向量外积运算的装置和方法
CN111160547A (zh) 一种人工神经网络运算的装置及方法
WO2017185335A1 (zh) 一种用于执行batch normalization运算的装置和方法
CN111860814A (zh) 一种用于执行batch normalization运算的装置和方法
CN111860772B (zh) 一种用于执行人工神经网络pooling运算的装置和方法
CN111367567B (zh) 一种神经网络计算装置和方法
CN113867800A (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
CN117933314A (zh) 处理装置、处理方法、芯片及电子装置
CN117933327A (zh) 处理装置、处理方法、芯片及电子装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17866842

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17866842

Country of ref document: EP

Kind code of ref document: A1