CN109634905B - SLAM operation device and method - Google Patents

SLAM operation device and method Download PDF

Info

Publication number
CN109634905B
CN109634905B CN201811529557.9A CN201811529557A CN109634905B CN 109634905 B CN109634905 B CN 109634905B CN 201811529557 A CN201811529557 A CN 201811529557A CN 109634905 B CN109634905 B CN 109634905B
Authority
CN
China
Prior art keywords
data
instruction
vector
result
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811529557.9A
Other languages
Chinese (zh)
Other versions
CN109634905A (en
Inventor
陈云霁
杜子东
张磊
陈天石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201811529557.9A priority Critical patent/CN109634905B/en
Publication of CN109634905A publication Critical patent/CN109634905A/en
Application granted granted Critical
Publication of CN109634905B publication Critical patent/CN109634905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/161Computing infrastructure, e.g. computer clusters, blade chassis or hardware partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30021Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The device of the SLAM hardware accelerator comprises a vector operation unit and a vector operation unit, wherein the vector operation unit comprises a plurality of vector operation units VPE, and the vector operation units VPE are used for respectively inputting a data set to be operated into each vector operation unit in the vector operation units, and each vector unit executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result; and the register unit is used for storing the operation results of the multiplication operation and the addition operation. The device and the method can effectively accelerate the algorithm according to different requirements, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like.

Description

SLAM operation device and method
Technical Field
The invention relates to a SLAM (simultaneous Localization and Mapping) operation device and a method, which are used for accelerating the operation of an SLAM algorithm according to different requirements.
Background
Autonomous navigation in an unknown environment is a fundamental capability of mobile robots (e.g., unmanned ground and air vehicles, etc.). The positioning of instant positioning and mapping in the SLAM task mainly completes the determination work of the position of the robot in the map, and the mapping task is that the robot establishes the map of the corresponding environment according to the environment. In the absence of an initial map of the location environment, this requires that the robot be able to construct a map in real time and use the map to perform its own positioning, with the consequent SLAM algorithm required to accomplish this task. However, implementing the SLAM algorithm accurately under the limited computing power and strict power consumption requirements of the mobile robot is one of the biggest problems faced in reality. Firstly, the SLAM algorithm requires extremely high operation speed to complete a large amount of operations in a short time like frames and interframes because of the requirement of real-time performance, secondly, the SLAM algorithm has severe requirements on power consumption because of the limitation of a mobile robot, and finally, the SLAM algorithm has a plurality of operation types and is wide, so the designed accelerator needs to support various types of SLAM algorithms.
In the prior art, one way to implement the SLAM algorithm is to directly perform operations on a general purpose processor (CPU), and one of the disadvantages of this method is that the operation performance of a single general purpose processor is low and cannot meet the real-time requirements of common SLAM operations. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck.
Another way to implement the SLAM algorithm is to perform operations on a Graphics Processor (GPU) that supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Although this method is a device dedicated to performing operations on graphics images, due to the complexity of the operation of the SLAM algorithm, it cannot support its subsequent operations well, i.e., it cannot accelerate the overall operation of the SLAM algorithm efficiently. Meanwhile, the cache on the GPU chip is too small to meet the operation requirements of a large number of SLAM algorithms. In addition, in the field of practical application, it is difficult to transplant a structure similar to a CPU or a GPU to a robot, and therefore, there is no dedicated SLAM hardware accelerator structure with high practicability and high flexibility. The designed device is a special SLAM hardware accelerator meeting the requirements, and correspondingly, the method of the device is designed, and the device can be designed into hardware such as a special chip and an embedded chip, so that the device can be applied to robots, computers, mobile phones and the like.
Disclosure of Invention
Technical problem to be solved
The invention aims to provide a device and a method of a SLAM hardware accelerator.
(II) technical scheme
According to an aspect of the present invention, there is provided an apparatus of a SLAM hardware accelerator, including:
the storage part is used for storing input data, temporary operation result data, final operation result data, an instruction set and/or algorithm parameter data required by the operation process;
the operation part is connected with the storage part and is used for finishing calculation of related algorithms and application of the SLAM;
and the control part is connected with the storage part and the operation part and is used for controlling and coordinating the storage part and the operation part.
Preferably, the storage section includes:
an input storage module: for storing input and output data;
an intermediate result storage module: used for storing the intermediate operation result;
a final result storage module: used for storing the final operation result;
an instruction storage module: the instruction set is used for storing an instruction set required by the operation process; and/or
A buffer storage module: for buffer storage of data.
Preferably, the operation section includes:
an acceleration operation device for accelerating and processing SLAM operation designed for SLAM correlation algorithm and application;
SLAM correlation algorithms and other computing devices that include other operations in the application that cannot be performed by the accelerated computing device.
Preferably, the acceleration operation means includes a vector operation unit and a matrix operation unit.
Preferably, the other computing devices are used to perform operations used in algorithms and applications that are not performed by the acceleration computing device.
Preferably, the operation section is implemented by a hardware circuit.
Preferably, the control part is connected with each module of the storage part and the operation part, the control part comprises a first-in first-out queue and a control processor, the first-in first-out queue is used for storing control signals, the control processor is used for taking out the control signals to be executed, and after the control logic is analyzed, the storage part and the operation part are controlled and coordinated.
Preferably, the instruction set comprises:
a control operation instruction class used for selecting the control of the operation instruction to be executed;
a data operation instruction class for controlling data transmission;
macro operation instruction class for complete operation;
the multidimensional data operation instruction class is used for controlling the operation of multidimensional data; and/or
And the one-dimensional data operation instruction class is used for controlling the operation of the one-dimensional data.
Preferably, the control operation instruction class includes a jump instruction and a branch instruction, the jump instruction includes a direct jump instruction and an indirect jump instruction, and the branch instruction includes a conditional branch instruction.
Preferably, the macro operation instruction class includes a convolution operation instruction or a pooling operation instruction.
Preferably, the multidimensional data operation instruction class is used for requiring the operation unit to execute operations on multidimensional data, and the operations on multidimensional data include operations between multidimensional data and multidimensional data, operations between multidimensional data and one-dimensional vector data, and operations between multidimensional data and one-dimensional scalar data.
Preferably, the one-dimensional data operation instruction class is used for requiring the operation unit to execute the operation of one-dimensional data, and the one-dimensional data includes a one-dimensional vector and a one-dimensional scalar.
Preferably, the operation of the one-dimensional vector data includes an operation between a one-dimensional vector and a one-dimensional vector, and an operation between a one-dimensional vector and a scalar.
Preferably, the operation of the one-dimensional scalar data includes an operation between scalars.
Preferably, the system further comprises an assembler for selecting the instruction type in the instruction set to use during the operation.
According to another aspect of the present invention, there is also provided a method of performing SLAM operation according to any one of the above apparatuses, in which transportation of data, operation, and execution of a program are controlled by a control section through an instruction set of a storage section, including:
the method comprises the following steps: transporting the input data of the storage part to the operation part;
step two: executing an operation in an operation part according to a required instruction set of an operation process;
step three: transmitting and storing operation result data;
step four: and repeating the above processes until the operation is finished.
(III) advantageous effects
The device and the method of the SLAM hardware accelerator can effectively accelerate the SLAM algorithm according to different requirements, can be suitable for various SLAM algorithms and various different input data types, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like.
Compared with the prior art, the device and the method have the following effects:
1) The operation part can operate the data of different input types according to different requirements;
2) The operation part can also realize the sharing of data to a certain degree through the buffer storage module, thereby reducing the reuse distance of the data;
3) The design of the instructions supports a variety of basic types of operation, making the configurability of the device high;
4) The design of the matrix and vector operation unit is matched with the design of the scalar operation unit to support various types of operation, and the operation speed is obviously accelerated;
5) The design of the operation and storage sections and the arrangement of instructions greatly reduce power consumption during execution.
Drawings
Fig. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a SLAM hardware accelerator according to another embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of a scalar operation unit of the SLAM hardware accelerator according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of an embodiment of a vector operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an embodiment of a matrix operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.
Fig. 6 is a schematic diagram illustrating an embodiment of a SLAM hardware accelerator for performing a three-dimensional coordinate L2 norm operation according to an embodiment of the present invention.
Fig. 7 is a diagram illustrating an embodiment of a SLAM hardware accelerator performing a 16-dimensional square matrix multiplication operation according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of an implementation of the algorithm of SLAM based on the extended kalman filter method (EKF) in the present apparatus according to an embodiment of the present invention.
FIG. 9 is a diagram illustrating instruction types provided by an embodiment of the invention.
FIG. 10 is a block diagram illustrating an application of a macro instruction according to an embodiment of the present invention.
FIG. 11 is a block diagram of one embodiment of a one-dimensional data operation instruction, according to the present invention.
Fig. 12 is a schematic diagram of a configuration implementation of a SIFT feature extraction algorithm on the present apparatus according to an embodiment of the present invention.
Fig. 13 is a schematic diagram of a configuration implementation of a G2O framework-based graph optimization algorithm on the present apparatus according to an embodiment of the present invention.
FIG. 14 is a flowchart illustrating execution of a convolution operation instruction according to an embodiment of the present invention.
FIG. 15 is a flowchart illustrating execution of an image accumulate instruction according to an embodiment of the present invention.
FIG. 16 is a flowchart illustrating an exemplary embodiment of a filter operation instruction.
FIG. 17 is a flowchart illustrating an exemplary execution of a local extremum instruction according to an embodiment of the present invention.
FIG. 18 is a flowchart illustrating an exemplary implementation of a two-dimensional convolution operation according to an embodiment of the present invention.
Fig. 19 is a flowchart illustrating an implementation of a one-dimensional vector dot product operation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
Fig. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention. As shown in fig. 1, the present accelerator is mainly divided into three sections, a control section, an operation section, and a storage section. The control part sends control signals to the operation part and the storage part to control the operation of the operation part and the storage part and coordinate the data transmission between the operation part and the storage part. The storage part is used for storing relevant data, including input data, intermediate results, final results, instructions, cache and the like, and different plans can be performed on specific storage data content, storage organization modes and access calling modes according to different requirements. The operation part comprises a plurality of operators for data operation, and comprises one or more combinations of a scalar operation unit, a vector operation unit and a matrix operation unit, wherein the operators can operate on data of different input types according to different requirements. The operation part can also realize the sharing of data to a certain degree through the buffer storage module, thereby reducing the reuse distance of the data.
Fig. 2 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to still another embodiment of the present invention. As shown in fig. 2, this embodiment is required to accelerate the operation process of the image-based SLAM algorithm, reduce data exchange, and save memory space. The control part is connected with each module of the storage part and the operation part and consists of a first-in first-out queue and a control processor, wherein the first-in first-out queue is used for storing control signals, the control processor is used for taking out the control signals to be executed, and after the control logic is analyzed, the storage part and the operation part are controlled and coordinated. The storage part is divided into four modules, namely an input storage module, an output storage module, an intermediate result storage module and a cache module. The operation part is mainly used for accelerating the operation of the image processing part, the operation of cloud picture construction, the operation of image matching and the operation of image optimization, so the operation unit is also mainly divided into three modules, namely a scalar operation module, a vector operation module and a matrix operation module, and the three modules can be executed in a pipeline mode or in parallel.
FIG. 3 is a schematic diagram of an apparatus for a scalar operation unit that can be used in the apparatus, according to an embodiment of the present invention, where SPE represents a single scalar operation unit. The scalar operation unit is mainly used for solving the operation part which cannot be used for acceleration in the SLAM algorithm and part of complex operations such as trigonometric function operation and the like, and can also solve the problem of access consistency, and is one of the important components of the accelerator. The storage modules directly related to the scalar operation unit are an intermediate result storage module and a buffer storage module. The operands required for scalar operations may be in the intermediate result storage module or in the buffer storage module. The result of the scalar operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual requirement.
FIG. 4 is a schematic diagram of a vector operation unit applicable to the present apparatus, wherein the entire vector operation unit is composed of a plurality of basic operation units, and VPE is a basic operation unit of vector operation. The vector operation unit can be used for solving a vector operation part in the SLAM algorithm and all operation parts with vector operation characteristics, such as point multiplication of vectors and the like, and can also realize efficient data-level parallelism and task-level parallelism. The storage module directly related to the buffer module is provided with an intermediate result storage module and a buffer module. Each basic unit of the vector operation unit can execute the same operation in parallel through configuration implementation, and different operations can also be implemented through configuration implementation. The storage modules directly related to the vector operation unit are an intermediate result storage module and a buffer storage module. The operands required for vector operations may be in the intermediate result storage module or in the buffer storage module. The result of the vector operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual requirement.
Fig. 5 is another embodiment of the present invention, which introduces a schematic diagram of a matrix operation unit device that can be used in the device, and can satisfy the requirement of accelerating operations of all matrix operation types and operation types similar to the matrix operation types, wherein MPE represents the basic operation unit of the matrix operation unit. The matrix operation unit is composed of a plurality of basic operation units, and in the illustrated case, one operation unit array. The matrix operation unit has a plurality of external data exchange modes, which may be 2D exchange mode or 1D exchange mode. Meanwhile, the arithmetic unit supports the data access mode among the internal units, the reuse distance of the local data can be greatly reduced, and the high-efficiency acceleration is realized. The storage modules directly related to the matrix operation unit are an intermediate result storage module and a buffer storage module. The operands required by the matrix operation can be in the intermediate result storage module or the buffer storage module. The result of the matrix operation can be stored in the intermediate result storage module, and also can be output to the buffer module, depending on the actual needs.
Fig. 6 is a flow chart illustrating a three-dimensional coordinate L2 norm operation performed by the present apparatus, according to an embodiment of the present invention. Assuming that three data of three-dimensional coordinates are stored in an intermediate storage module, firstly, an operand is taken out from the intermediate storage module through a configuration instruction and is respectively input to three basic operation units VPE of a vector operation unit, the operation executed by each VPE of the three VPEs is multiplication operation, two operands of the multiplication are a certain number of the taken coordinates and the operand itself, the result of the multiplication operation is input into a scalar operation unit through a buffer storage module, the summation operation of the three multiplication operation results is completed in the scalar operation unit, and then the evolution operation is executed. And outputting the final operation result to an intermediate result storage module or a buffer storage module according to the requirement.
Fig. 7 is a flow chart of one possible configuration of the apparatus for performing an N-dimensional square matrix multiplication operation according to an embodiment of the present invention. For example, for the case of N =16, it is assumed that a matrix C is obtained by performing multiplication operation on a matrix a and a matrix B, the number of basic operation units in the matrix operation unit in the figure is 256, each operation unit is responsible for calculating final result data in the operation process, and matrix data required for operation is stored in the intermediate result storage module. The operation starts, each operand of A is taken out from the intermediate result storage module to the buffer storage module, the buffer storage module inputs data into each basic operation unit MPE of the matrix operation unit according to the row sequence, the operands of the B matrix are also taken into the buffer storage module, and the operands are input into each PE step by step according to the column sequence by instruction scheduling. The number of each A and the number of B in the PE will complete the multiplication operation, the result after each PE completes the multiplication operation will not be sent out, but will be accumulated with the last result stored in the register of the PE, so that the result stored by each PE after all the numbers of B are input to the PE is the number of each position of the finally obtained C matrix. And finally, storing the data of the C into an intermediate result storage module or remaining in a buffer storage module according to the requirement.
Fig. 8 is an embodiment of the present invention, which introduces a schematic configuration and operation diagram of the apparatus for performing SLAM algorithm based on Extended Kalman Filter (EKF). The EKF algorithm can be basically divided into three major steps, computer True Data, EKF Predict, and EKF Update. In computer True Data, the True coordinates are derived from the motion model. And in the EKF Predict, the pose of the new robot is predicted by the robot updated by the last predicted value and the control input. And calculating the associated information of the reference points of the surrounding environment in the EKF Update, and updating the predicted pose and covariance matrix. The calculation mainly involved in computer True Data is a low-dimensional vector processing operation such as an euclidean distance operation of three-dimensional coordinates, and therefore most of the calculations can be performed using a vector operation unit, and in addition, a typical scalar calculation such as a trigonometric function operation of an angle is involved, and therefore, a small amount of calculation is also required to be performed on the scalar operation unit. The EKF Predict step involves a plurality of times of large-scale matrix operations such as matrix multiplication and the like, and the operations can be executed on a matrix operation unit for better acceleration, and meanwhile, a small vector operation also needs the vector operation unit to function. The EKF Update operation is performed in many kinds of operations, and various operations are alternated, and for example, there are typical operations such as matrix SVD (singular value decomposition) decomposition and cholesky decomposition, which are composed of fine operations such as matrix multiplication, vector addition and subtraction, vector norm, and trigonometric function, and a matrix operation unit, a vector operation unit, and a scalar operation unit are used. In the storage module, because the input of the SLAM algorithm based on the EKF is coordinates of points such as waypoints and landframes, and the amount of data is not large, it is only necessary to initially load the data from the input storage module. In the middle operation process, the stored design data amount does not exceed the size of the middle result storage module generally, so frequent data exchange with the input storage module is not needed generally, and the energy consumption and the operation time are reduced. And finally, the SLAM algorithm outputs the calculation result to an output storage module during output, and the hardware configuration and implementation of the whole algorithm are completed.
FIG. 9 is a schematic diagram of the types of instructions provided by the present invention.
The instruction set comprises various types such as a control operation instruction type, a data operation instruction type, a macro operation instruction type, a multidimensional data operation instruction type, a one-dimensional data operation instruction type and the like. Each instruction class can be subdivided into a plurality of different instructions, each distinguished by the beginning instruction code, and as shown in fig. 9, several representative instructions and their codes are selected and listed in each instruction class.
And the control operation instruction class is mainly used for controlling the running of the program. The instruction encoding into JUMP represents a JUMP instruction for performing a JUMP function. According to the difference of the following operation codes, the method can be divided into a direct jump instruction and an indirect jump instruction. Instruction encoding to CB represents a conditional jump instruction for performing a conditional jump function.
And the data operation instruction class is mainly used for controlling data transmission. The command code is LD/ST, which is used for data transmission in a DRAM (Dynamic Random Access Memory) and an SRAM (Static Random Access Memory), that is, LD indicates reading data from the DRAM and loading the data into the SRAM, and ST indicates transmitting and storing the data in the SRAM into the DRAM. The instruction is encoded as MOV to indicate that data is transferred between SRAMs. The command is encoded as RD/WR for transferring data between SRAM and BUFFER, where RD represents reading data from SRAM to BUFFER, and WR represents storing data in BUFFER back to SRAM. The macro operation instruction class is used as a coarse-grained data operation instruction for relatively complete operation.
The instruction is encoded as CONV to represent a convolution operation instruction, and is used for implementing convolution and convolution-like operations, that is, input data are multiplied and summed with corresponding weights respectively, and the instruction takes local reusability of data into consideration, and a specific implementation process is as shown in fig. 14:
s1, image data is taken out from the initial address of the image data according to the requirement of an instruction, and weight data is taken out from the initial address of the weight data.
And S2, transmitting the image data to a corresponding multidimensional operation unit according to the corresponding operation requirement, and broadcasting the weight data to each operation element (PE) in the multidimensional operation unit.
And S3, multiplying the input image data by the corresponding weight data by each PE, adding the multiplied image data and the data in the register in the operation unit, and storing the added image data and the data in the register (the register needs to be initialized to be 0).
And S4, transmitting the image data in the multidimensional arithmetic unit inside the multidimensional arithmetic unit according to the transmission rule specified by the multidimensional arithmetic unit, and reading the image data which is not in the multidimensional arithmetic unit from the BUFFER and transmitting the image data to a specified arithmetic position. The method utilizes the reusability of data in convolution operation, thereby greatly reducing the carrying times of the data.
And S5, repeating the steps S3-S4 until the PE is calculated, and outputting the result to a destination address specified by the instruction for storage.
And S6, reading the data again and repeating the operation until all pixel points in the output image are calculated and stored, and ending the instruction.
The instruction codes are POOL to represent pooling operation instructions and are used for realizing pooling and similar pooling operations, namely, averaging a specified number of data or calculating a maximum/minimum value or performing down-sampling operation, and the specific implementation flow is similar to that of a convolution operation instruction.
The instruction code IMGACC represents an image accumulation instruction for completing the processing of the image and performing accumulation or similar operation functions. The specific implementation process is as follows, as shown in fig. 15:
s1, reading the image data from the initial address of the image data according to the instruction requirement, and initializing all operation elements (PE) in the multidimensional operation unit to be 0.
And S2, sequentially transmitting one row of original data in the multi-dimensional operation unit upwards in each clock period, then transmitting one row of new data in the multi-dimensional operation unit, accumulating the corresponding columns of the newly transmitted row and the original last row of data, and taking the accumulated result as the new last row of data. And repeating the operation until the multidimensional operation unit is filled.
And S3, sequentially transmitting and accumulating the data in the multi-dimensional operation unit to the right in each clock cycle, namely transmitting the data in the first row to the right in the first clock cycle, adding the data transmitted from the first row to the second row, and storing the data. In the second clock cycle, the second column of data is transmitted to the right, the third column of data is added with the data transmitted from the second column and stored, and so on. And finally, obtaining the integral accumulation result of the required image.
And S4, storing all data in the multidimensional operation unit to a destination address specified by the instruction, and caching the data in the lowest row and the rightmost column.
And S5, initializing the multidimensional operation data to be 0, and repeating the next operation until all the images are calculated. It should be noted that, during subsequent operations, when the width or length of the image exceeds the size of a single processing of the multidimensional operation unit, the cached data needs to be accumulated during non-first operations, so as to ensure the correctness of the operation result.
The instruction encoding to BOX represents a filtering instruction for performing BOX filtering operations on an image. The operation flow of the algorithm is that in order to obtain the sum of local matrixes of an image, an array A is established, the width and the height of the array A are equal to those of an original image, then the array A is assigned, the value A [ i ] of each element is assigned to the sum of all pixels in a rectangle formed by the point and an original point of the image, and after the local matrix is obtained, the operation can be completed only by adding and subtracting 4 elements of the array A. Therefore, the macro instruction is mainly divided into two operations, as shown in FIG. 16:
s1, reading required data from an initial address according to an instruction, transmitting the required data into a multi-dimensional arithmetic unit, sequentially accumulating the transmitted data, and storing the accumulated data in a specified destination address 1.
And S2, reading data from the destination address 1 according to the data required by the instruction, performing addition and subtraction operation on the data to obtain a filtering result, and storing the filtering result in the destination address 2, namely the required final result.
In the data accumulation process, similar to a convolution operation instruction, data has local reusability, so the instruction supports data transmission inside a multidimensional operation unit.
The instruction code is a local extremum instruction, which is used for judging the operation of a local extremum when processing an image, namely judging whether the data at the specified position is an extremum in the group of data. Specifically, the macro instruction is mainly divided into two steps of operations, as shown in fig. 17:
s1, initializing a register value in each PE in a multidimensional operation unit into a value small/large enough, reading data from a data starting address and transmitting the data into the multidimensional operation unit, then comparing the transmitted data with the data stored in the register by each PE, and storing the obtained large/small value back into the register until the specified data is compared. I.e. a maximum/minimum value for a given data flow is obtained in each PE.
And S2, reading the data at the designated position according to the instruction, and transmitting the data to the multi-dimensional data operation unit again, wherein each PE compares whether the data transmitted to the PE is the same as the maximum/small value stored in the register, and outputs 1 in the same way and 0 in different ways.
The instruction is coded as COUNTCMP to represent comparison operation, and is used for completing comparison operation by using a counter, namely reading data to be compared and a threshold value and transmitting the data to be compared and the threshold value to a multidimensional operation unit, each PE compares an incoming data stream with the threshold value in sequence and counts the data to be traversed and transmitted, and the number of the data larger than or smaller than the threshold value is output.
The multidimensional data operation instruction class is one of fine-grained operation instructions and is mainly used for controlling operation of multidimensional data. The multidimensional data includes two-dimensional data and data with more than two dimensions, and the multidimensional data includes operation instructions which are respectively carried out on the multidimensional data, the one-dimensional vector data, the one-dimensional scalar data and the like. Taking matrix as an example, MMmM is a multiplication operation instruction of matrix and matrix, belongs to one of operation instructions performed by multidimensional data and multidimensional data, and is also similar to MMaM, namely an addition operation instruction of matrix and matrix; MMmV, which is a multiplication operation instruction of a matrix and a one-dimensional vector, belongs to one of operation instructions performed on multidimensional data and one-dimensional vector data, and similarly, MMaV, which is an addition operation instruction of the matrix and the one-dimensional vector, is also provided; the MMmS is a multiplication instruction of a matrix and a one-dimensional scalar, belongs to one of operation instructions performed on multidimensional data and one-dimensional scalar data, and is also similar to the MMaS, that is, an addition instruction of the matrix and the one-dimensional scalar. In addition, the multidimensional data operation instruction can also be compatible with operation between one-dimensional data, such as MVmV, a multiplication operation instruction of a one-dimensional vector and a one-dimensional vector is realized, and MMoV realizes an outer product operation instruction of the one-dimensional vector and the one-dimensional vector.
The one-dimensional data operation instruction class is one of fine-grained operation instruction classes and is mainly used for controlling operation of one-dimensional data, wherein the one-dimensional data is mainly divided into one-dimensional vector data and one-dimensional scalar data. For example, VVmV is a multiplication instruction of a one-dimensional vector and a one-dimensional vector, and similarly VVaV indicates an addition instruction of a one-dimensional vector and a one-dimensional vector. VVMS, which is a multiplication instruction between a one-dimensional vector and a one-dimensional scalar. And the SSsS represents an instruction of one-dimensional scalar operation and is used for completing the evolution operation of obtaining the one-dimensional scalar. SSrS denotes an operation for obtaining a random number. The MV is a move operation instruction used for fetching a register or an immediate in the operation process.
Fig. 10 is a diagram illustrating an embodiment of a macro instruction operation CONV according to the present invention, which performs a two-dimensional convolution operation on a hardware structure. The two-dimensional convolution operation process is that for a two-dimensional input image, a convolution kernel slides on the input image, the data of the two-dimensional data image covered by the current position is filtered by the convolution kernel each time, namely the convolution kernel and the covered image data are subjected to counterpoint multiplication, and then the orange results are accumulated to obtain the required filtering result. Then, the convolution kernel is slid to the next position, and the operation is repeated until all operations are completed. Because the convolution operation is widely applied and appears in large quantity, the convolution operation designed by the patent can fully utilize the data reusability on a hardware structure, reasonably distribute and transmit data and improve the utilization rate of hardware to the maximum. For the sake of illustration, a specific embodiment is attached, as shown in fig. 10. In this embodiment, the input is defined as an image or matrix, and the output is also an image or matrix, and all are stored in the designated position in the form of blocks. The hardware structure is exemplified by a matrix operation unit (MPU) which includes m × n matrix operation units (MPE), each of which includes a required operator and a register for temporarily storing intermediate data. As shown in fig. 18, the specific operation process is:
s1, reading a macro instruction of convolution operation, wherein the macro instruction consists of an operation code and an operand. The instruction operation is encoded as CONV, indicating that a convolution operation is performed. The number of the operands is 7, namely DA, SA1, SA2, IX, IY, KX and KY, wherein DA represents a destination address, namely a storage address of an output result; SA1 is a starting address 1 which represents a starting address for reading an image to be operated; SA2 is a starting address 2 which represents the starting address for reading the convolution kernel to be operated; IX and IY respectively represent the size of the image in the X direction and the Y direction, namely the size of the image to be operated is defined by the two variables; KX and KY denote the size of the convolution kernel, respectively.
And S2, reading the input image data from the SRAM to a corresponding position in the BUFFER according to the instruction, and waiting for operation, wherein each MPE in the MPU is required to calculate one pixel point of the output image.
And S3, transmitting corresponding input image data into each MPE. Because the convolution kernel is the same in each MPE during operation, the convolution kernel is broadcasted to each MPE in a broadcasting mode. Each MPE then multiplies incoming input data with corresponding convolution kernel data and stores the result in a respective MPE register.
And S4, because the operation data of the convolution operation has local reusability, the input image data to be operated in the next beat is the data operated by the MPE in the right current beat, the input image data are sequentially transmitted to the left, and the data required by the MPE in the right most is not in the MPU, so that the data need to be read from the BUFFER again. After the data transmission is finished, each MPE multiplies the input image data by the corresponding convolution kernel data, and accumulates the product with the data in the MPE register and stores the accumulated product in the register again.
And S5, repeating the step S4 until all the convolution kernel data and the corresponding input image data are operated, namely 1 pixel point of the output image is obtained by each MPE, and outputting and storing the result to the position defined by the destination address in the instruction.
And S6, repeating the steps until all pixel points in the output image are calculated.
Local reusability of data can be fully utilized by utilizing the macroinstruction, data carrying times are greatly reduced, and operation efficiency is improved. For example, when m =3,n =3, the MPU can simultaneously perform convolution operations for 9 pixels, which takes 9 clock cycles.
And similarly. We provide a large number of macro-instruction operations, such as convolution, which can be performed by other types of instruction operations, but can make the operation instruction more concise and efficient due to the existence of the macro-instruction operations. In addition, the macro instruction can well deal with the reuse problem of data, the utilization rate of the data can be improved, the transmission of the data is reduced, the power consumption is reduced, and the performance is improved.
Fig. 11 is a diagram of an embodiment of a multidimensional data operation instruction provided by the present invention, which implements dot product operation between one-dimensional vectors and one-dimensional vectors, and similarly, operations such as vector multiplication, vector addition, vector comparison, etc. all use similar operation flows. Each vector operation unit (VPU) comprises mm vector operation parts (VPEs), and each VPE can complete the operation of a pair of input data. The detailed operation flow is as shown in fig. 19, firstly, mm data to be operated are respectively input to mm VPEs, after one multiplication is respectively executed, the mm data to be operated are stored in a register inside the VPE, meanwhile, mm data to be operated are respectively input to mm VPEs, after one multiplication is respectively executed, the product and the last product temporarily stored in the internal register are accumulated, and the accumulated result is sent to the internal register again for temporary storage. The above operations are repeated until all inputs have been calculated. And then, the result of the vector operation unit is transmitted from the rightmost section to the left, the VPE at the rightmost end directly transmits the data in the register to the VPE at the left, and after receiving the data transmitted from the right, the VPE at the left accumulates the data in the register inside the VPE, the accumulated result is continuously transmitted to the left, and so on. Finally, the dot product operation result is obtained in the VPE at the leftmost end and is output according to requirements.
Fig. 12 is an embodiment provided in the present invention, and describes a process for implementing a configuration of a SIFT feature extraction algorithm on the present apparatus. The SIFT (Scale-innovative feature transform) feature extraction algorithm is one of the key operations of the RGBD SLAM algorithm. The first step is to establish an image Pyramid operation Gaussian Pyramid, which includes basic image operations such as image smoothing, and can be further decomposed into multiple convolution and downsampling operations in the present apparatus. Next, the operation of the difference of gaussian DOG is performed, which can be regarded as a subtraction of a matrix between different planes of the image pyramid. Once the DOG operation is complete, the operation of LOCAL extremum searching may be completed by calling the macroinstruction LOCAL EXTREMA. After searching the local extremum, the feature point determination and feature point filtering (KP filter) are performed, and this operation is composed of a large number of vector and scalar operations, such as vector dot multiplication, matrix determinant, and the like. And finally, calculating histograms of adjacent points through operation of a plurality of vectors and scalars to calculate descriptors (Key Point) of the Key points. The operation of calculating the histogram can be completed by a macro instruction HIST, and the operation is composed of vector operation operations such as vector comparison and the like. The rotation operation of adjacent pixel regions is implemented with multiplication of a matrix vector. Some special function operations such as exponential are mainly implemented by scalar arithmetic units.
Fig. 13 is an embodiment provided by the present invention, and introduces a schematic flow chart of implementing the G2O diagram optimization algorithm configured on the present apparatus. G2O is a framework for solving the nonlinear graph optimization problem, and many typical SLAM algorithms such as RGBD SLAM and ORB SLAM based on graph methods are based on the framework. Given the pose constraints and initial poses of the two graph nodes, the operations of the error matrix and the Jacobian matrix can be completed through matrix operation and vector operation, such as multiplication and accumulation of the matrix. Then, a linear system capable of optimizing the objective function is established through the error matrix and the Jacobian matrix, and the linear system can be completed through a matrix and vector operation unit, wherein operations including matrix multiplication and accumulation are involved. Then, solving this linear system, we can use the Preconditioned connected component (PCG) algorithm to implement (we can also implement by cholesky decomposition method or sparse matrix method or upper triangular decomposition method). The PCG operation can be decomposed into multiplication and addition operations of matrix and vector of the block, and the operation can be realized by a macro-instruction PCG when being specifically realized. And finally, the optimization operation of the pose can be completed through multiplication, addition and the like of the matrix and the vector.
The apparatus and method of the embodiments of the present invention may be applied in the following (including but not limited to) scenarios: data processing, robots, unmanned planes, autopilots, computers, printers, scanners, telephones, tablet computers, intelligent terminals, mobile phones, automobile data recorders, navigators, sensors, cameras, cloud servers, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; various vehicles such as airplanes, ships, vehicles and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (20)

1. A SLAM calculation apparatus for performing calculations of SLAM-related algorithms and applications, the apparatus comprising:
the vector operation unit comprises a plurality of vector operation units VPE, and is used for respectively inputting a data set to be operated into each vector operation unit in the plurality of vector operation units, and each vector unit executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result;
the register unit is used for storing operation results of multiplication operation and addition operation;
an instruction storage module: the instruction set is used for storing instruction sets required by the operation process;
the set of instructions includes: macro operation instruction class for complete operation; the macro-operation instruction class includes: convolution operation instruction, image accumulation operation instruction, image BOX filtering operation instruction, local extremum operation instruction, counter comparison operation instruction and/or pooling operation instruction.
2. The arithmetic device according to claim 1, wherein the data set to be operated on comprises mm data to be operated on, and the vector operation unit is specifically configured to, in terms of that each vector unit performs multiplication and addition operations on the input data to be operated on to obtain an operation result:
step 1: inputting each data to be operated in mm data to be operated into a plurality of VPEs and executing multiplication to obtain a product operation result, and storing the product operation result into a register inside the corresponding VPE in the VPEs;
step 2: accumulating the product operation result and a product operation result obtained by the last operation stored in a register to obtain an accumulated result, and storing the accumulated result into a register inside a corresponding VPE in the plurality of VPEs;
repeatedly executing the step 1 and the step 2 until the mm data to be operated are all calculated, and obtaining mm calculation results;
and (4) sequentially transmitting the calculation result of the VPE at the rightmost end to the left and accumulating the calculation result until the calculation result is transmitted to the VPE at the leftmost end to obtain an operation result.
3. The computing device of claim 2, wherein the vector computing unit is further configured to, in terms of left-hand passing and accumulating the computation result of the rightmost VPE until the computation result is passed to the leftmost VPE to obtain the computation result:
step 1: transmitting the calculation result of the VPE at the rightmost end to the VPE adjacent to the left side of the VPE at the rightmost end;
step 2: the VPE adjacent to the left side performs addition operation on the calculation result of the VPE at the rightmost end and the calculation result of the VPE adjacent to the left side to obtain a superposition result;
and (5) repeating the step (1) and the step (2) until the VPE at the leftmost end executes the superposition operation to obtain an operation result.
4. The computing device of claim 1, wherein the device further comprises:
an input storage module: for storing input/output data;
an intermediate result storage module: the temporary operation device is used for storing temporary operation result data; and/or
A final result storage module: for storing the final operation result data.
5. The computing device of claim 4, wherein the set of instructions comprises:
a control operation instruction class used for selecting the control of the operation instruction to be executed;
a data operation instruction class for controlling the transmission of data;
the multidimensional data operation instruction class is used for controlling the operation of multidimensional data; and/or
And the one-dimensional data operation instruction class is used for controlling the operation of the one-dimensional data.
6. The computing device of claim 5, wherein the class of control operation instructions comprises finger jump instructions and branch instructions, wherein the jump instructions comprise direct jump instructions and indirect jump instructions, and wherein the branch instructions comprise conditional branch instructions.
7. The computing device of claim 5, wherein the class of data operation instructions comprises at least one of:
LD/ST instruction, used for transferring data in DRAM and SRAM;
MOV instructions for transferring data between SRAMs;
RD/WR instruction, indicating the use of transferring data between SRAM and BUFFER.
8. The computing device of claim 1, the class of macro-operation instructions comprising at least one of:
matrix and matrix multiply instructions, matrix and matrix add instructions, matrix and vector multiply instructions, matrix and vector add instructions, matrix and scalar multiply instructions, matrix and scalar add matrices, vector and vector multiply instructions, and vector outer product instructions.
9. The arithmetic device of claim 1, the class of macro-operation instructions comprising at least one of:
vector-to-vector multiply instructions, vector-to-vector add instructions, vector-to-scalar multiply instructions, vector-to-scalar add instructions, scalar square instructions, scalar fetch random instructions, and move instructions.
10. The computing device of claim 5, wherein the multi-dimensional data operation instruction class is used for requesting the computing unit to perform operations on multi-dimensional data, and the operations on multi-dimensional data include operations between multi-dimensional data and multi-dimensional data, operations between multi-dimensional data and one-dimensional vector data, and operations between multi-dimensional data and one-dimensional scalar data.
11. The arithmetic device of claim 5 wherein the class of one-dimensional data operation instructions is for requiring the arithmetic unit to perform operations on one-dimensional data, the one-dimensional data comprising one-dimensional vectors and one-dimensional scalars.
12. The computing device of claim 5, further comprising an assembler to select, during execution, a type of instruction in the instruction set for use.
13. The method of SLAM operation by the operation device according to any one of claims 1 to 12, wherein the method comprises:
respectively inputting a data set to be operated into each vector operation part in a plurality of vector operation parts included in a vector operation unit by using the vector operation unit, wherein each vector part executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result;
storing operation results of multiplication operation and addition operation by using a register unit;
storing an instruction set required by the operation process by using an instruction storage module;
the set of instructions includes: macro operation instruction class for complete operation; the macro-operation instruction class includes: convolution operation instruction, image accumulation operation instruction, image BOX filtering operation instruction, local extremum operation instruction, counter comparison operation instruction and/or pooling operation instruction.
14. The operation method according to claim 13, wherein the data set to be operated on comprises mm data to be operated on, and each vector unit performs multiplication and addition operations on the input data to be operated on to obtain an operation result, including:
step 1: inputting each data to be operated in mm data to be operated into a plurality of VPEs and executing multiplication operation to obtain a product operation result, and storing the product operation result into a register inside the corresponding VPE in the plurality of VPEs;
and 2, step: accumulating the product operation result and a product operation result obtained by the last operation stored in a register to obtain an accumulated result, and storing the accumulated result into a register inside a corresponding VPE in the plurality of VPEs;
repeating the step 1 and the step 2 until the mm data to be calculated are all calculated, and obtaining mm calculation results;
and (4) sequentially transmitting the calculation result of the VPE at the rightmost end to the left, and accumulating until the calculation result is transmitted to the VPE at the leftmost end to obtain an operation result.
15. The method of claim 14, wherein the left-hand side and accumulation of the rightmost VPE calculation result until the leftmost VPE calculation result is passed to obtain the calculation result, comprising:
step 1: transmitting the calculation result of the VPE at the rightmost end to the VPE adjacent to the left side of the VPE at the rightmost end;
step 2: the VPE adjacent to the left side performs addition operation on the calculation result of the VPE at the rightmost end and the calculation result of the VPE adjacent to the left side to obtain a superposition result;
and (5) repeating the step (1) and the step (2) until the VPE at the leftmost end executes the superposition operation to obtain an operation result.
16. The method of claim 13, further comprising:
storing input/output data using an input storage module;
using an intermediate result storage module to store temporary operation result data; and/or
And storing the final operation result data by using a final result storage module.
17. The method of claim 13, wherein the set of instructions comprises:
the control operation instruction class is used for selecting control of an operation instruction to be executed, the control operation instruction class comprises a jump instruction and a branch instruction, the jump instruction comprises a direct jump instruction and an indirect jump instruction, and the branch instruction comprises a conditional branch instruction;
a data operation instruction class for controlling the transmission of data;
the macro-operation instruction class includes at least one of:
a matrix and matrix multiply instruction, a matrix and matrix add instruction, a matrix and vector multiply instruction, a matrix and vector add instruction, a matrix and scalar multiply instruction, a matrix and scalar add matrix, a vector and vector multiply instruction, and a vector and vector outer product instruction;
alternatively, the macro-operation instruction class includes at least one of:
a vector and vector multiply instruction, a vector and vector add instruction, a vector and scalar multiply instruction, a vector and scalar add instruction, a scalar square instruction, a scalar random fetch instruction, and a move instruction;
a multidimensional data operation instruction class for controlling operation of multidimensional data; and/or
A one-dimensional data operation instruction class for controlling an operation of one-dimensional data, the one-dimensional data including: one-dimensional vectors and one-dimensional scalars.
18. The method of claim 17, wherein the class of data manipulation instructions comprises at least one of:
LD/ST instruction, used for transferring data in DRAM and SRAM;
MOV instructions for transferring data between SRAMs;
RD/WR instruction, indicating the use of transferring data between SRAM and BUFFER.
19. The method of claim 17, wherein the class of multidimensional data operation instructions is for requiring the arithmetic unit to perform operations on multidimensional data, the operations on multidimensional data including operations between multidimensional data and multidimensional data, operations between multidimensional data and one-dimensional vector data, and operations between multidimensional data and one-dimensional scalar data.
20. The method of claim 17, further comprising:
the assembler selects the type of instruction in the instruction set to use during operation.
CN201811529557.9A 2016-11-03 2016-11-03 SLAM operation device and method Active CN109634905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811529557.9A CN109634905B (en) 2016-11-03 2016-11-03 SLAM operation device and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811529557.9A CN109634905B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201610958847.XA CN108021528B (en) 2016-11-03 2016-11-03 SLAM operation device and method

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201610958847.XA Division CN108021528B (en) 2016-11-03 2016-11-03 SLAM operation device and method

Publications (2)

Publication Number Publication Date
CN109634905A CN109634905A (en) 2019-04-16
CN109634905B true CN109634905B (en) 2023-03-10

Family

ID=62075642

Family Applications (12)

Application Number Title Priority Date Filing Date
CN201811521819.7A Active CN109376113B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811653568.8A Active CN109684267B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811529500.9A Active CN109697184B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811529556.4A Active CN109634904B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811529557.9A Active CN109634905B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811653560.1A Active CN109726168B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811545672.5A Pending CN109710558A (en) 2016-11-03 2016-11-03 SLAM arithmetic unit and method
CN201610958847.XA Active CN108021528B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811521820.XA Active CN109376114B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811653558.4A Active CN109656867B (en) 2016-11-03 2016-11-03 SLAM arithmetic device and method
CN201811521818.2A Active CN109376112B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811654180.XA Pending CN109710559A (en) 2016-11-03 2016-11-03 SLAM arithmetic unit and method

Family Applications Before (4)

Application Number Title Priority Date Filing Date
CN201811521819.7A Active CN109376113B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811653568.8A Active CN109684267B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811529500.9A Active CN109697184B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811529556.4A Active CN109634904B (en) 2016-11-03 2016-11-03 SLAM operation device and method

Family Applications After (7)

Application Number Title Priority Date Filing Date
CN201811653560.1A Active CN109726168B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811545672.5A Pending CN109710558A (en) 2016-11-03 2016-11-03 SLAM arithmetic unit and method
CN201610958847.XA Active CN108021528B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811521820.XA Active CN109376114B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811653558.4A Active CN109656867B (en) 2016-11-03 2016-11-03 SLAM arithmetic device and method
CN201811521818.2A Active CN109376112B (en) 2016-11-03 2016-11-03 SLAM operation device and method
CN201811654180.XA Pending CN109710559A (en) 2016-11-03 2016-11-03 SLAM arithmetic unit and method

Country Status (2)

Country Link
CN (12) CN109376113B (en)
WO (1) WO2018082229A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290789B (en) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111290788B (en) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111079915B (en) * 2018-10-19 2021-01-26 中科寒武纪科技股份有限公司 Operation method, device and related product
CN110058884B (en) * 2019-03-15 2021-06-01 佛山市顺德区中山大学研究院 Optimization method, system and storage medium for computational storage instruction set operation
CN110991291B (en) * 2019-11-26 2021-09-07 清华大学 Image feature extraction method based on parallel computing
CN113112481B (en) * 2021-04-16 2023-11-17 北京理工雷科电子信息技术有限公司 Hybrid heterogeneous on-chip architecture based on matrix network
CN113177211A (en) * 2021-04-20 2021-07-27 深圳致星科技有限公司 FPGA chip for privacy computation, heterogeneous processing system and computing method
CN113342671B (en) * 2021-06-25 2023-06-02 海光信息技术股份有限公司 Method, device, electronic equipment and medium for verifying operation module
CN113395551A (en) * 2021-07-20 2021-09-14 珠海极海半导体有限公司 Processor, NPU chip and electronic equipment
US20230056246A1 (en) * 2021-08-03 2023-02-23 Micron Technology, Inc. Parallel matrix operations in a reconfigurable compute fabric
CN113792867A (en) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 Arithmetic circuit, chip and board card
CN117093816B (en) * 2023-10-19 2024-01-19 上海登临科技有限公司 Matrix multiplication operation method and device and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60201472A (en) * 1984-03-26 1985-10-11 Nec Corp Matrix product computing device
US5666300A (en) * 1994-12-22 1997-09-09 Motorola, Inc. Power reduction in a data processing system using pipeline registers and method therefor
JPH09230954A (en) * 1996-02-28 1997-09-05 Olympus Optical Co Ltd Vector standardizing device
US7454451B2 (en) * 2003-04-23 2008-11-18 Micron Technology, Inc. Method for finding local extrema of a set of values for a parallel processing element
US7664810B2 (en) * 2004-05-14 2010-02-16 Via Technologies, Inc. Microprocessor apparatus and method for modular exponentiation
US7814297B2 (en) * 2005-07-26 2010-10-12 Arm Limited Algebraic single instruction multiple data processing
US8051124B2 (en) * 2007-07-19 2011-11-01 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module
CN101609715B (en) * 2009-05-11 2012-09-05 中国人民解放军国防科学技术大学 Matrix register file with separated row-column access ports
EP2507701B1 (en) * 2009-11-30 2013-12-04 Martin Raubuch Microprocessor and method for enhanced precision sum-of-products calculation
KR101206213B1 (en) * 2010-04-19 2012-11-28 인하대학교 산학협력단 High speed slam system and method based on graphic processing unit
JP2013540985A (en) * 2010-07-26 2013-11-07 コモンウェルス サイエンティフィック アンドインダストリアル リサーチ オーガナイゼーション Three-dimensional scanning beam system and method
CN102012893B (en) * 2010-11-25 2012-07-18 中国人民解放军国防科学技术大学 Extensible vector operation device
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN102353379B (en) * 2011-07-06 2013-02-13 上海海事大学 Environment modeling method applicable to navigation of automatic piloting vehicles
CN106708753B (en) * 2012-03-30 2021-04-02 英特尔公司 Apparatus and method for accelerating operation in processor using shared virtual memory
US9013490B2 (en) * 2012-05-17 2015-04-21 The United States Of America As Represented By The Administrator Of The National Aeronautics Space Administration Hilbert-huang transform data processing real-time system with 2-D capabilities
CN102750127B (en) * 2012-06-12 2015-06-24 清华大学 Coprocessor
CN103208000B (en) * 2012-12-28 2015-10-21 青岛科技大学 Based on the Feature Points Extraction of local extremum fast search
CN103150596B (en) * 2013-02-22 2015-12-23 百度在线网络技术(北京)有限公司 The training system of a kind of reverse transmittance nerve network DNN
CN104252331B (en) * 2013-06-29 2018-03-06 华为技术有限公司 Multiply-accumulator
US9449675B2 (en) * 2013-10-31 2016-09-20 Micron Technology, Inc. Apparatuses and methods for identifying an extremum value stored in an array of memory cells
CN103640018B (en) * 2013-12-13 2014-09-03 江苏久祥汽车电器集团有限公司 SURF (speeded up robust feature) algorithm based localization method
CN103677741A (en) * 2013-12-30 2014-03-26 南京大学 Imaging method based on NCS algorithm and mixing precision floating point coprocessor
CN103955447B (en) * 2014-04-28 2017-04-12 中国人民解放军国防科学技术大学 FFT accelerator based on DSP chip
CN105212922A (en) * 2014-06-11 2016-01-06 吉林大学 The method and system that R wave of electrocardiosignal detects automatically are realized towards FPGA
US9798519B2 (en) * 2014-07-02 2017-10-24 Via Alliance Semiconductor Co., Ltd. Standard format intermediate result
CN104317768B (en) * 2014-10-15 2017-02-15 中国人民解放军国防科学技术大学 Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system
CN104330090B (en) * 2014-10-23 2017-06-06 北京化工大学 Robot distributed sign intelligent semantic map creating method
KR102374160B1 (en) * 2014-11-14 2022-03-14 삼성디스플레이 주식회사 A method and apparatus to reduce display lag using scailing
CN104391820B (en) * 2014-11-25 2017-06-23 清华大学 General floating-point matrix processor hardware structure based on FPGA
CN104574508A (en) * 2015-01-14 2015-04-29 山东大学 Multi-resolution model simplifying method oriented to virtual reality technology
US10285760B2 (en) * 2015-02-04 2019-05-14 Queen's University At Kingston Methods and apparatus for improved electromagnetic tracking and localization
CN104851094A (en) * 2015-05-14 2015-08-19 西安电子科技大学 Improved method of RGB-D-based SLAM algorithm
CN104915322B (en) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 A kind of hardware-accelerated method of convolutional neural networks
CN104899182B (en) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN105528082B (en) * 2016-01-08 2018-11-06 北京暴风魔镜科技有限公司 Three dimensions and gesture identification tracking exchange method, device and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Sparse Matrix Vector Multiplication on a Field Programmable Gate Array;Marcel van der Veen;《https://essay.utwente.nl/781/1/scriptie_van_der_Veen.pdf》;20070930;全文 *

Also Published As

Publication number Publication date
CN109697184B (en) 2021-04-09
CN109710559A (en) 2019-05-03
CN109376114B (en) 2022-03-15
WO2018082229A1 (en) 2018-05-11
CN109697184A (en) 2019-04-30
CN109684267B (en) 2021-08-06
CN108021528A (en) 2018-05-11
CN109710558A (en) 2019-05-03
CN109726168A (en) 2019-05-07
CN109634904B (en) 2023-03-07
CN109376113A (en) 2019-02-22
CN109376112B (en) 2022-03-15
CN109656867A (en) 2019-04-19
CN109726168B (en) 2021-09-21
CN109376113B (en) 2021-12-14
CN109634905A (en) 2019-04-16
CN109376112A (en) 2019-02-22
CN109684267A (en) 2019-04-26
CN109656867B (en) 2023-05-16
CN108021528B (en) 2020-03-13
CN109376114A (en) 2019-02-22
CN109634904A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
CN109634905B (en) SLAM operation device and method
KR102258414B1 (en) Processing apparatus and processing method
CN109240746B (en) Apparatus and method for performing matrix multiplication operation
CN108733348B (en) Fused vector multiplier and method for performing operation using the same
CN106970896B (en) Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution
Banz et al. Real-time semi-global matching disparity estimation on the GPU
CN109389213B (en) Storage device and method, data processing device and method, and electronic device
CN111860772B (en) Device and method for executing artificial neural network mapping operation
CN111367567B (en) Neural network computing device and method
CN113867800A (en) Computing device, integrated circuit chip, board card, electronic equipment and computing method
CN114638352B (en) Processor architecture, processor and electronic equipment
CN117933314A (en) Processing device, processing method, chip and electronic device
CN117933327A (en) Processing device, processing method, chip and electronic device
CN115330683A (en) Target rapid detection system based on FPGA
El-Sayed et al. Real-time motion tracking using the CELL BE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant after: Zhongke Cambrian Technology Co.,Ltd.

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant