CN109634905B

CN109634905B - SLAM operation device and method

Info

Publication number: CN109634905B
Application number: CN201811529557.9A
Authority: CN
Inventors: 陈云霁; 杜子东; 张磊; 陈天石
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2023-03-10
Anticipated expiration: 2036-11-03
Also published as: CN109697184B; CN109710559A; CN109376114B; WO2018082229A1; CN109697184A; CN109684267B; CN108021528A; CN109710558A; CN109726168A; CN109634904B; CN109376113A; CN109376112B; CN109656867A; CN109726168B; CN109376113B; CN109634905A; CN109376112A; CN109684267A; CN109656867B; CN108021528B

Abstract

The device of the SLAM hardware accelerator comprises a vector operation unit and a vector operation unit, wherein the vector operation unit comprises a plurality of vector operation units VPE, and the vector operation units VPE are used for respectively inputting a data set to be operated into each vector operation unit in the vector operation units, and each vector unit executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result; and the register unit is used for storing the operation results of the multiplication operation and the addition operation. The device and the method can effectively accelerate the algorithm according to different requirements, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like.

Description

SLAM operation device and method

Technical Field

The invention relates to a SLAM (simultaneous Localization and Mapping) operation device and a method, which are used for accelerating the operation of an SLAM algorithm according to different requirements.

Background

Autonomous navigation in an unknown environment is a fundamental capability of mobile robots (e.g., unmanned ground and air vehicles, etc.). The positioning of instant positioning and mapping in the SLAM task mainly completes the determination work of the position of the robot in the map, and the mapping task is that the robot establishes the map of the corresponding environment according to the environment. In the absence of an initial map of the location environment, this requires that the robot be able to construct a map in real time and use the map to perform its own positioning, with the consequent SLAM algorithm required to accomplish this task. However, implementing the SLAM algorithm accurately under the limited computing power and strict power consumption requirements of the mobile robot is one of the biggest problems faced in reality. Firstly, the SLAM algorithm requires extremely high operation speed to complete a large amount of operations in a short time like frames and interframes because of the requirement of real-time performance, secondly, the SLAM algorithm has severe requirements on power consumption because of the limitation of a mobile robot, and finally, the SLAM algorithm has a plurality of operation types and is wide, so the designed accelerator needs to support various types of SLAM algorithms.

In the prior art, one way to implement the SLAM algorithm is to directly perform operations on a general purpose processor (CPU), and one of the disadvantages of this method is that the operation performance of a single general purpose processor is low and cannot meet the real-time requirements of common SLAM operations. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck.

Another way to implement the SLAM algorithm is to perform operations on a Graphics Processor (GPU) that supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Although this method is a device dedicated to performing operations on graphics images, due to the complexity of the operation of the SLAM algorithm, it cannot support its subsequent operations well, i.e., it cannot accelerate the overall operation of the SLAM algorithm efficiently. Meanwhile, the cache on the GPU chip is too small to meet the operation requirements of a large number of SLAM algorithms. In addition, in the field of practical application, it is difficult to transplant a structure similar to a CPU or a GPU to a robot, and therefore, there is no dedicated SLAM hardware accelerator structure with high practicability and high flexibility. The designed device is a special SLAM hardware accelerator meeting the requirements, and correspondingly, the method of the device is designed, and the device can be designed into hardware such as a special chip and an embedded chip, so that the device can be applied to robots, computers, mobile phones and the like.

Disclosure of Invention

Technical problem to be solved

The invention aims to provide a device and a method of a SLAM hardware accelerator.

(II) technical scheme

According to an aspect of the present invention, there is provided an apparatus of a SLAM hardware accelerator, including:

the storage part is used for storing input data, temporary operation result data, final operation result data, an instruction set and/or algorithm parameter data required by the operation process;

the operation part is connected with the storage part and is used for finishing calculation of related algorithms and application of the SLAM;

and the control part is connected with the storage part and the operation part and is used for controlling and coordinating the storage part and the operation part.

Preferably, the storage section includes:

an input storage module: for storing input and output data;

an intermediate result storage module: used for storing the intermediate operation result;

a final result storage module: used for storing the final operation result;

an instruction storage module: the instruction set is used for storing an instruction set required by the operation process; and/or

A buffer storage module: for buffer storage of data.

Preferably, the operation section includes:

an acceleration operation device for accelerating and processing SLAM operation designed for SLAM correlation algorithm and application;

SLAM correlation algorithms and other computing devices that include other operations in the application that cannot be performed by the accelerated computing device.

Preferably, the acceleration operation means includes a vector operation unit and a matrix operation unit.

Preferably, the other computing devices are used to perform operations used in algorithms and applications that are not performed by the acceleration computing device.

Preferably, the operation section is implemented by a hardware circuit.

Preferably, the control part is connected with each module of the storage part and the operation part, the control part comprises a first-in first-out queue and a control processor, the first-in first-out queue is used for storing control signals, the control processor is used for taking out the control signals to be executed, and after the control logic is analyzed, the storage part and the operation part are controlled and coordinated.

Preferably, the instruction set comprises:

a control operation instruction class used for selecting the control of the operation instruction to be executed;

a data operation instruction class for controlling data transmission;

macro operation instruction class for complete operation;

the multidimensional data operation instruction class is used for controlling the operation of multidimensional data; and/or

And the one-dimensional data operation instruction class is used for controlling the operation of the one-dimensional data.

Preferably, the control operation instruction class includes a jump instruction and a branch instruction, the jump instruction includes a direct jump instruction and an indirect jump instruction, and the branch instruction includes a conditional branch instruction.

Preferably, the macro operation instruction class includes a convolution operation instruction or a pooling operation instruction.

Preferably, the multidimensional data operation instruction class is used for requiring the operation unit to execute operations on multidimensional data, and the operations on multidimensional data include operations between multidimensional data and multidimensional data, operations between multidimensional data and one-dimensional vector data, and operations between multidimensional data and one-dimensional scalar data.

Preferably, the one-dimensional data operation instruction class is used for requiring the operation unit to execute the operation of one-dimensional data, and the one-dimensional data includes a one-dimensional vector and a one-dimensional scalar.

Preferably, the operation of the one-dimensional vector data includes an operation between a one-dimensional vector and a one-dimensional vector, and an operation between a one-dimensional vector and a scalar.

Preferably, the operation of the one-dimensional scalar data includes an operation between scalars.

Preferably, the system further comprises an assembler for selecting the instruction type in the instruction set to use during the operation.

According to another aspect of the present invention, there is also provided a method of performing SLAM operation according to any one of the above apparatuses, in which transportation of data, operation, and execution of a program are controlled by a control section through an instruction set of a storage section, including:

the method comprises the following steps: transporting the input data of the storage part to the operation part;

step two: executing an operation in an operation part according to a required instruction set of an operation process;

step three: transmitting and storing operation result data;

step four: and repeating the above processes until the operation is finished.

(III) advantageous effects

The device and the method of the SLAM hardware accelerator can effectively accelerate the SLAM algorithm according to different requirements, can be suitable for various SLAM algorithms and various different input data types, meet the operation of different requirements, and have the advantages of strong flexibility, high configurable degree, high operation speed, low power consumption and the like.

Compared with the prior art, the device and the method have the following effects:

1) The operation part can operate the data of different input types according to different requirements;

2) The operation part can also realize the sharing of data to a certain degree through the buffer storage module, thereby reducing the reuse distance of the data;

3) The design of the instructions supports a variety of basic types of operation, making the configurability of the device high;

4) The design of the matrix and vector operation unit is matched with the design of the scalar operation unit to support various types of operation, and the operation speed is obviously accelerated;

5) The design of the operation and storage sections and the arrangement of instructions greatly reduce power consumption during execution.

Drawings

Fig. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a SLAM hardware accelerator according to another embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an embodiment of a scalar operation unit of the SLAM hardware accelerator according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an embodiment of a vector operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an embodiment of a matrix operation unit of a SLAM hardware accelerator according to an embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating an embodiment of a SLAM hardware accelerator for performing a three-dimensional coordinate L2 norm operation according to an embodiment of the present invention.

Fig. 7 is a diagram illustrating an embodiment of a SLAM hardware accelerator performing a 16-dimensional square matrix multiplication operation according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of an implementation of the algorithm of SLAM based on the extended kalman filter method (EKF) in the present apparatus according to an embodiment of the present invention.

FIG. 9 is a diagram illustrating instruction types provided by an embodiment of the invention.

FIG. 10 is a block diagram illustrating an application of a macro instruction according to an embodiment of the present invention.

FIG. 11 is a block diagram of one embodiment of a one-dimensional data operation instruction, according to the present invention.

Fig. 12 is a schematic diagram of a configuration implementation of a SIFT feature extraction algorithm on the present apparatus according to an embodiment of the present invention.

Fig. 13 is a schematic diagram of a configuration implementation of a G2O framework-based graph optimization algorithm on the present apparatus according to an embodiment of the present invention.

FIG. 14 is a flowchart illustrating execution of a convolution operation instruction according to an embodiment of the present invention.

FIG. 15 is a flowchart illustrating execution of an image accumulate instruction according to an embodiment of the present invention.

FIG. 16 is a flowchart illustrating an exemplary embodiment of a filter operation instruction.

FIG. 17 is a flowchart illustrating an exemplary execution of a local extremum instruction according to an embodiment of the present invention.

FIG. 18 is a flowchart illustrating an exemplary implementation of a two-dimensional convolution operation according to an embodiment of the present invention.

Fig. 19 is a flowchart illustrating an implementation of a one-dimensional vector dot product operation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

Fig. 1 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to an embodiment of the present invention. As shown in fig. 1, the present accelerator is mainly divided into three sections, a control section, an operation section, and a storage section. The control part sends control signals to the operation part and the storage part to control the operation of the operation part and the storage part and coordinate the data transmission between the operation part and the storage part. The storage part is used for storing relevant data, including input data, intermediate results, final results, instructions, cache and the like, and different plans can be performed on specific storage data content, storage organization modes and access calling modes according to different requirements. The operation part comprises a plurality of operators for data operation, and comprises one or more combinations of a scalar operation unit, a vector operation unit and a matrix operation unit, wherein the operators can operate on data of different input types according to different requirements. The operation part can also realize the sharing of data to a certain degree through the buffer storage module, thereby reducing the reuse distance of the data.

Fig. 2 is a schematic structural diagram of an apparatus of a SLAM hardware accelerator according to still another embodiment of the present invention. As shown in fig. 2, this embodiment is required to accelerate the operation process of the image-based SLAM algorithm, reduce data exchange, and save memory space. The control part is connected with each module of the storage part and the operation part and consists of a first-in first-out queue and a control processor, wherein the first-in first-out queue is used for storing control signals, the control processor is used for taking out the control signals to be executed, and after the control logic is analyzed, the storage part and the operation part are controlled and coordinated. The storage part is divided into four modules, namely an input storage module, an output storage module, an intermediate result storage module and a cache module. The operation part is mainly used for accelerating the operation of the image processing part, the operation of cloud picture construction, the operation of image matching and the operation of image optimization, so the operation unit is also mainly divided into three modules, namely a scalar operation module, a vector operation module and a matrix operation module, and the three modules can be executed in a pipeline mode or in parallel.

FIG. 3 is a schematic diagram of an apparatus for a scalar operation unit that can be used in the apparatus, according to an embodiment of the present invention, where SPE represents a single scalar operation unit. The scalar operation unit is mainly used for solving the operation part which cannot be used for acceleration in the SLAM algorithm and part of complex operations such as trigonometric function operation and the like, and can also solve the problem of access consistency, and is one of the important components of the accelerator. The storage modules directly related to the scalar operation unit are an intermediate result storage module and a buffer storage module. The operands required for scalar operations may be in the intermediate result storage module or in the buffer storage module. The result of the scalar operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual requirement.

FIG. 4 is a schematic diagram of a vector operation unit applicable to the present apparatus, wherein the entire vector operation unit is composed of a plurality of basic operation units, and VPE is a basic operation unit of vector operation. The vector operation unit can be used for solving a vector operation part in the SLAM algorithm and all operation parts with vector operation characteristics, such as point multiplication of vectors and the like, and can also realize efficient data-level parallelism and task-level parallelism. The storage module directly related to the buffer module is provided with an intermediate result storage module and a buffer module. Each basic unit of the vector operation unit can execute the same operation in parallel through configuration implementation, and different operations can also be implemented through configuration implementation. The storage modules directly related to the vector operation unit are an intermediate result storage module and a buffer storage module. The operands required for vector operations may be in the intermediate result storage module or in the buffer storage module. The result of the vector operation can be stored in the intermediate result storage module or output to the buffer module, depending on the actual requirement.

Fig. 5 is another embodiment of the present invention, which introduces a schematic diagram of a matrix operation unit device that can be used in the device, and can satisfy the requirement of accelerating operations of all matrix operation types and operation types similar to the matrix operation types, wherein MPE represents the basic operation unit of the matrix operation unit. The matrix operation unit is composed of a plurality of basic operation units, and in the illustrated case, one operation unit array. The matrix operation unit has a plurality of external data exchange modes, which may be 2D exchange mode or 1D exchange mode. Meanwhile, the arithmetic unit supports the data access mode among the internal units, the reuse distance of the local data can be greatly reduced, and the high-efficiency acceleration is realized. The storage modules directly related to the matrix operation unit are an intermediate result storage module and a buffer storage module. The operands required by the matrix operation can be in the intermediate result storage module or the buffer storage module. The result of the matrix operation can be stored in the intermediate result storage module, and also can be output to the buffer module, depending on the actual needs.

Fig. 6 is a flow chart illustrating a three-dimensional coordinate L2 norm operation performed by the present apparatus, according to an embodiment of the present invention. Assuming that three data of three-dimensional coordinates are stored in an intermediate storage module, firstly, an operand is taken out from the intermediate storage module through a configuration instruction and is respectively input to three basic operation units VPE of a vector operation unit, the operation executed by each VPE of the three VPEs is multiplication operation, two operands of the multiplication are a certain number of the taken coordinates and the operand itself, the result of the multiplication operation is input into a scalar operation unit through a buffer storage module, the summation operation of the three multiplication operation results is completed in the scalar operation unit, and then the evolution operation is executed. And outputting the final operation result to an intermediate result storage module or a buffer storage module according to the requirement.

Fig. 7 is a flow chart of one possible configuration of the apparatus for performing an N-dimensional square matrix multiplication operation according to an embodiment of the present invention. For example, for the case of N =16, it is assumed that a matrix C is obtained by performing multiplication operation on a matrix a and a matrix B, the number of basic operation units in the matrix operation unit in the figure is 256, each operation unit is responsible for calculating final result data in the operation process, and matrix data required for operation is stored in the intermediate result storage module. The operation starts, each operand of A is taken out from the intermediate result storage module to the buffer storage module, the buffer storage module inputs data into each basic operation unit MPE of the matrix operation unit according to the row sequence, the operands of the B matrix are also taken into the buffer storage module, and the operands are input into each PE step by step according to the column sequence by instruction scheduling. The number of each A and the number of B in the PE will complete the multiplication operation, the result after each PE completes the multiplication operation will not be sent out, but will be accumulated with the last result stored in the register of the PE, so that the result stored by each PE after all the numbers of B are input to the PE is the number of each position of the finally obtained C matrix. And finally, storing the data of the C into an intermediate result storage module or remaining in a buffer storage module according to the requirement.

Fig. 8 is an embodiment of the present invention, which introduces a schematic configuration and operation diagram of the apparatus for performing SLAM algorithm based on Extended Kalman Filter (EKF). The EKF algorithm can be basically divided into three major steps, computer True Data, EKF Predict, and EKF Update. In computer True Data, the True coordinates are derived from the motion model. And in the EKF Predict, the pose of the new robot is predicted by the robot updated by the last predicted value and the control input. And calculating the associated information of the reference points of the surrounding environment in the EKF Update, and updating the predicted pose and covariance matrix. The calculation mainly involved in computer True Data is a low-dimensional vector processing operation such as an euclidean distance operation of three-dimensional coordinates, and therefore most of the calculations can be performed using a vector operation unit, and in addition, a typical scalar calculation such as a trigonometric function operation of an angle is involved, and therefore, a small amount of calculation is also required to be performed on the scalar operation unit. The EKF Predict step involves a plurality of times of large-scale matrix operations such as matrix multiplication and the like, and the operations can be executed on a matrix operation unit for better acceleration, and meanwhile, a small vector operation also needs the vector operation unit to function. The EKF Update operation is performed in many kinds of operations, and various operations are alternated, and for example, there are typical operations such as matrix SVD (singular value decomposition) decomposition and cholesky decomposition, which are composed of fine operations such as matrix multiplication, vector addition and subtraction, vector norm, and trigonometric function, and a matrix operation unit, a vector operation unit, and a scalar operation unit are used. In the storage module, because the input of the SLAM algorithm based on the EKF is coordinates of points such as waypoints and landframes, and the amount of data is not large, it is only necessary to initially load the data from the input storage module. In the middle operation process, the stored design data amount does not exceed the size of the middle result storage module generally, so frequent data exchange with the input storage module is not needed generally, and the energy consumption and the operation time are reduced. And finally, the SLAM algorithm outputs the calculation result to an output storage module during output, and the hardware configuration and implementation of the whole algorithm are completed.

FIG. 9 is a schematic diagram of the types of instructions provided by the present invention.

The instruction set comprises various types such as a control operation instruction type, a data operation instruction type, a macro operation instruction type, a multidimensional data operation instruction type, a one-dimensional data operation instruction type and the like. Each instruction class can be subdivided into a plurality of different instructions, each distinguished by the beginning instruction code, and as shown in fig. 9, several representative instructions and their codes are selected and listed in each instruction class.

And the control operation instruction class is mainly used for controlling the running of the program. The instruction encoding into JUMP represents a JUMP instruction for performing a JUMP function. According to the difference of the following operation codes, the method can be divided into a direct jump instruction and an indirect jump instruction. Instruction encoding to CB represents a conditional jump instruction for performing a conditional jump function.

And the data operation instruction class is mainly used for controlling data transmission. The command code is LD/ST, which is used for data transmission in a DRAM (Dynamic Random Access Memory) and an SRAM (Static Random Access Memory), that is, LD indicates reading data from the DRAM and loading the data into the SRAM, and ST indicates transmitting and storing the data in the SRAM into the DRAM. The instruction is encoded as MOV to indicate that data is transferred between SRAMs. The command is encoded as RD/WR for transferring data between SRAM and BUFFER, where RD represents reading data from SRAM to BUFFER, and WR represents storing data in BUFFER back to SRAM. The macro operation instruction class is used as a coarse-grained data operation instruction for relatively complete operation.

The instruction is encoded as CONV to represent a convolution operation instruction, and is used for implementing convolution and convolution-like operations, that is, input data are multiplied and summed with corresponding weights respectively, and the instruction takes local reusability of data into consideration, and a specific implementation process is as shown in fig. 14:

s1, image data is taken out from the initial address of the image data according to the requirement of an instruction, and weight data is taken out from the initial address of the weight data.

And S2, transmitting the image data to a corresponding multidimensional operation unit according to the corresponding operation requirement, and broadcasting the weight data to each operation element (PE) in the multidimensional operation unit.

And S3, multiplying the input image data by the corresponding weight data by each PE, adding the multiplied image data and the data in the register in the operation unit, and storing the added image data and the data in the register (the register needs to be initialized to be 0).

And S4, transmitting the image data in the multidimensional arithmetic unit inside the multidimensional arithmetic unit according to the transmission rule specified by the multidimensional arithmetic unit, and reading the image data which is not in the multidimensional arithmetic unit from the BUFFER and transmitting the image data to a specified arithmetic position. The method utilizes the reusability of data in convolution operation, thereby greatly reducing the carrying times of the data.

And S5, repeating the steps S3-S4 until the PE is calculated, and outputting the result to a destination address specified by the instruction for storage.

And S6, reading the data again and repeating the operation until all pixel points in the output image are calculated and stored, and ending the instruction.

The instruction codes are POOL to represent pooling operation instructions and are used for realizing pooling and similar pooling operations, namely, averaging a specified number of data or calculating a maximum/minimum value or performing down-sampling operation, and the specific implementation flow is similar to that of a convolution operation instruction.

The instruction code IMGACC represents an image accumulation instruction for completing the processing of the image and performing accumulation or similar operation functions. The specific implementation process is as follows, as shown in fig. 15:

s1, reading the image data from the initial address of the image data according to the instruction requirement, and initializing all operation elements (PE) in the multidimensional operation unit to be 0.

And S2, sequentially transmitting one row of original data in the multi-dimensional operation unit upwards in each clock period, then transmitting one row of new data in the multi-dimensional operation unit, accumulating the corresponding columns of the newly transmitted row and the original last row of data, and taking the accumulated result as the new last row of data. And repeating the operation until the multidimensional operation unit is filled.

And S3, sequentially transmitting and accumulating the data in the multi-dimensional operation unit to the right in each clock cycle, namely transmitting the data in the first row to the right in the first clock cycle, adding the data transmitted from the first row to the second row, and storing the data. In the second clock cycle, the second column of data is transmitted to the right, the third column of data is added with the data transmitted from the second column and stored, and so on. And finally, obtaining the integral accumulation result of the required image.

And S4, storing all data in the multidimensional operation unit to a destination address specified by the instruction, and caching the data in the lowest row and the rightmost column.

And S5, initializing the multidimensional operation data to be 0, and repeating the next operation until all the images are calculated. It should be noted that, during subsequent operations, when the width or length of the image exceeds the size of a single processing of the multidimensional operation unit, the cached data needs to be accumulated during non-first operations, so as to ensure the correctness of the operation result.

The instruction encoding to BOX represents a filtering instruction for performing BOX filtering operations on an image. The operation flow of the algorithm is that in order to obtain the sum of local matrixes of an image, an array A is established, the width and the height of the array A are equal to those of an original image, then the array A is assigned, the value A [ i ] of each element is assigned to the sum of all pixels in a rectangle formed by the point and an original point of the image, and after the local matrix is obtained, the operation can be completed only by adding and subtracting 4 elements of the array A. Therefore, the macro instruction is mainly divided into two operations, as shown in FIG. 16:

s1, reading required data from an initial address according to an instruction, transmitting the required data into a multi-dimensional arithmetic unit, sequentially accumulating the transmitted data, and storing the accumulated data in a specified destination address 1.

And S2, reading data from the destination address 1 according to the data required by the instruction, performing addition and subtraction operation on the data to obtain a filtering result, and storing the filtering result in the destination address 2, namely the required final result.

In the data accumulation process, similar to a convolution operation instruction, data has local reusability, so the instruction supports data transmission inside a multidimensional operation unit.

The instruction code is a local extremum instruction, which is used for judging the operation of a local extremum when processing an image, namely judging whether the data at the specified position is an extremum in the group of data. Specifically, the macro instruction is mainly divided into two steps of operations, as shown in fig. 17:

s1, initializing a register value in each PE in a multidimensional operation unit into a value small/large enough, reading data from a data starting address and transmitting the data into the multidimensional operation unit, then comparing the transmitted data with the data stored in the register by each PE, and storing the obtained large/small value back into the register until the specified data is compared. I.e. a maximum/minimum value for a given data flow is obtained in each PE.

And S2, reading the data at the designated position according to the instruction, and transmitting the data to the multi-dimensional data operation unit again, wherein each PE compares whether the data transmitted to the PE is the same as the maximum/small value stored in the register, and outputs 1 in the same way and 0 in different ways.

The instruction is coded as COUNTCMP to represent comparison operation, and is used for completing comparison operation by using a counter, namely reading data to be compared and a threshold value and transmitting the data to be compared and the threshold value to a multidimensional operation unit, each PE compares an incoming data stream with the threshold value in sequence and counts the data to be traversed and transmitted, and the number of the data larger than or smaller than the threshold value is output.

The multidimensional data operation instruction class is one of fine-grained operation instructions and is mainly used for controlling operation of multidimensional data. The multidimensional data includes two-dimensional data and data with more than two dimensions, and the multidimensional data includes operation instructions which are respectively carried out on the multidimensional data, the one-dimensional vector data, the one-dimensional scalar data and the like. Taking matrix as an example, MMmM is a multiplication operation instruction of matrix and matrix, belongs to one of operation instructions performed by multidimensional data and multidimensional data, and is also similar to MMaM, namely an addition operation instruction of matrix and matrix; MMmV, which is a multiplication operation instruction of a matrix and a one-dimensional vector, belongs to one of operation instructions performed on multidimensional data and one-dimensional vector data, and similarly, MMaV, which is an addition operation instruction of the matrix and the one-dimensional vector, is also provided; the MMmS is a multiplication instruction of a matrix and a one-dimensional scalar, belongs to one of operation instructions performed on multidimensional data and one-dimensional scalar data, and is also similar to the MMaS, that is, an addition instruction of the matrix and the one-dimensional scalar. In addition, the multidimensional data operation instruction can also be compatible with operation between one-dimensional data, such as MVmV, a multiplication operation instruction of a one-dimensional vector and a one-dimensional vector is realized, and MMoV realizes an outer product operation instruction of the one-dimensional vector and the one-dimensional vector.

The one-dimensional data operation instruction class is one of fine-grained operation instruction classes and is mainly used for controlling operation of one-dimensional data, wherein the one-dimensional data is mainly divided into one-dimensional vector data and one-dimensional scalar data. For example, VVmV is a multiplication instruction of a one-dimensional vector and a one-dimensional vector, and similarly VVaV indicates an addition instruction of a one-dimensional vector and a one-dimensional vector. VVMS, which is a multiplication instruction between a one-dimensional vector and a one-dimensional scalar. And the SSsS represents an instruction of one-dimensional scalar operation and is used for completing the evolution operation of obtaining the one-dimensional scalar. SSrS denotes an operation for obtaining a random number. The MV is a move operation instruction used for fetching a register or an immediate in the operation process.

Fig. 10 is a diagram illustrating an embodiment of a macro instruction operation CONV according to the present invention, which performs a two-dimensional convolution operation on a hardware structure. The two-dimensional convolution operation process is that for a two-dimensional input image, a convolution kernel slides on the input image, the data of the two-dimensional data image covered by the current position is filtered by the convolution kernel each time, namely the convolution kernel and the covered image data are subjected to counterpoint multiplication, and then the orange results are accumulated to obtain the required filtering result. Then, the convolution kernel is slid to the next position, and the operation is repeated until all operations are completed. Because the convolution operation is widely applied and appears in large quantity, the convolution operation designed by the patent can fully utilize the data reusability on a hardware structure, reasonably distribute and transmit data and improve the utilization rate of hardware to the maximum. For the sake of illustration, a specific embodiment is attached, as shown in fig. 10. In this embodiment, the input is defined as an image or matrix, and the output is also an image or matrix, and all are stored in the designated position in the form of blocks. The hardware structure is exemplified by a matrix operation unit (MPU) which includes m × n matrix operation units (MPE), each of which includes a required operator and a register for temporarily storing intermediate data. As shown in fig. 18, the specific operation process is:

s1, reading a macro instruction of convolution operation, wherein the macro instruction consists of an operation code and an operand. The instruction operation is encoded as CONV, indicating that a convolution operation is performed. The number of the operands is 7, namely DA, SA1, SA2, IX, IY, KX and KY, wherein DA represents a destination address, namely a storage address of an output result; SA1 is a starting address 1 which represents a starting address for reading an image to be operated; SA2 is a starting address 2 which represents the starting address for reading the convolution kernel to be operated; IX and IY respectively represent the size of the image in the X direction and the Y direction, namely the size of the image to be operated is defined by the two variables; KX and KY denote the size of the convolution kernel, respectively.

And S2, reading the input image data from the SRAM to a corresponding position in the BUFFER according to the instruction, and waiting for operation, wherein each MPE in the MPU is required to calculate one pixel point of the output image.

And S3, transmitting corresponding input image data into each MPE. Because the convolution kernel is the same in each MPE during operation, the convolution kernel is broadcasted to each MPE in a broadcasting mode. Each MPE then multiplies incoming input data with corresponding convolution kernel data and stores the result in a respective MPE register.

And S4, because the operation data of the convolution operation has local reusability, the input image data to be operated in the next beat is the data operated by the MPE in the right current beat, the input image data are sequentially transmitted to the left, and the data required by the MPE in the right most is not in the MPU, so that the data need to be read from the BUFFER again. After the data transmission is finished, each MPE multiplies the input image data by the corresponding convolution kernel data, and accumulates the product with the data in the MPE register and stores the accumulated product in the register again.

And S5, repeating the step S4 until all the convolution kernel data and the corresponding input image data are operated, namely 1 pixel point of the output image is obtained by each MPE, and outputting and storing the result to the position defined by the destination address in the instruction.

And S6, repeating the steps until all pixel points in the output image are calculated.

Local reusability of data can be fully utilized by utilizing the macroinstruction, data carrying times are greatly reduced, and operation efficiency is improved. For example, when m =3,n =3, the MPU can simultaneously perform convolution operations for 9 pixels, which takes 9 clock cycles.

And similarly. We provide a large number of macro-instruction operations, such as convolution, which can be performed by other types of instruction operations, but can make the operation instruction more concise and efficient due to the existence of the macro-instruction operations. In addition, the macro instruction can well deal with the reuse problem of data, the utilization rate of the data can be improved, the transmission of the data is reduced, the power consumption is reduced, and the performance is improved.

Fig. 11 is a diagram of an embodiment of a multidimensional data operation instruction provided by the present invention, which implements dot product operation between one-dimensional vectors and one-dimensional vectors, and similarly, operations such as vector multiplication, vector addition, vector comparison, etc. all use similar operation flows. Each vector operation unit (VPU) comprises mm vector operation parts (VPEs), and each VPE can complete the operation of a pair of input data. The detailed operation flow is as shown in fig. 19, firstly, mm data to be operated are respectively input to mm VPEs, after one multiplication is respectively executed, the mm data to be operated are stored in a register inside the VPE, meanwhile, mm data to be operated are respectively input to mm VPEs, after one multiplication is respectively executed, the product and the last product temporarily stored in the internal register are accumulated, and the accumulated result is sent to the internal register again for temporary storage. The above operations are repeated until all inputs have been calculated. And then, the result of the vector operation unit is transmitted from the rightmost section to the left, the VPE at the rightmost end directly transmits the data in the register to the VPE at the left, and after receiving the data transmitted from the right, the VPE at the left accumulates the data in the register inside the VPE, the accumulated result is continuously transmitted to the left, and so on. Finally, the dot product operation result is obtained in the VPE at the leftmost end and is output according to requirements.

Fig. 12 is an embodiment provided in the present invention, and describes a process for implementing a configuration of a SIFT feature extraction algorithm on the present apparatus. The SIFT (Scale-innovative feature transform) feature extraction algorithm is one of the key operations of the RGBD SLAM algorithm. The first step is to establish an image Pyramid operation Gaussian Pyramid, which includes basic image operations such as image smoothing, and can be further decomposed into multiple convolution and downsampling operations in the present apparatus. Next, the operation of the difference of gaussian DOG is performed, which can be regarded as a subtraction of a matrix between different planes of the image pyramid. Once the DOG operation is complete, the operation of LOCAL extremum searching may be completed by calling the macroinstruction LOCAL EXTREMA. After searching the local extremum, the feature point determination and feature point filtering (KP filter) are performed, and this operation is composed of a large number of vector and scalar operations, such as vector dot multiplication, matrix determinant, and the like. And finally, calculating histograms of adjacent points through operation of a plurality of vectors and scalars to calculate descriptors (Key Point) of the Key points. The operation of calculating the histogram can be completed by a macro instruction HIST, and the operation is composed of vector operation operations such as vector comparison and the like. The rotation operation of adjacent pixel regions is implemented with multiplication of a matrix vector. Some special function operations such as exponential are mainly implemented by scalar arithmetic units.

Fig. 13 is an embodiment provided by the present invention, and introduces a schematic flow chart of implementing the G2O diagram optimization algorithm configured on the present apparatus. G2O is a framework for solving the nonlinear graph optimization problem, and many typical SLAM algorithms such as RGBD SLAM and ORB SLAM based on graph methods are based on the framework. Given the pose constraints and initial poses of the two graph nodes, the operations of the error matrix and the Jacobian matrix can be completed through matrix operation and vector operation, such as multiplication and accumulation of the matrix. Then, a linear system capable of optimizing the objective function is established through the error matrix and the Jacobian matrix, and the linear system can be completed through a matrix and vector operation unit, wherein operations including matrix multiplication and accumulation are involved. Then, solving this linear system, we can use the Preconditioned connected component (PCG) algorithm to implement (we can also implement by cholesky decomposition method or sparse matrix method or upper triangular decomposition method). The PCG operation can be decomposed into multiplication and addition operations of matrix and vector of the block, and the operation can be realized by a macro-instruction PCG when being specifically realized. And finally, the optimization operation of the pose can be completed through multiplication, addition and the like of the matrix and the vector.

The apparatus and method of the embodiments of the present invention may be applied in the following (including but not limited to) scenarios: data processing, robots, unmanned planes, autopilots, computers, printers, scanners, telephones, tablet computers, intelligent terminals, mobile phones, automobile data recorders, navigators, sensors, cameras, cloud servers, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; various vehicles such as airplanes, ships, vehicles and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A SLAM calculation apparatus for performing calculations of SLAM-related algorithms and applications, the apparatus comprising:

the vector operation unit comprises a plurality of vector operation units VPE, and is used for respectively inputting a data set to be operated into each vector operation unit in the plurality of vector operation units, and each vector unit executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result;

the register unit is used for storing operation results of multiplication operation and addition operation;

an instruction storage module: the instruction set is used for storing instruction sets required by the operation process;

the set of instructions includes: macro operation instruction class for complete operation; the macro-operation instruction class includes: convolution operation instruction, image accumulation operation instruction, image BOX filtering operation instruction, local extremum operation instruction, counter comparison operation instruction and/or pooling operation instruction.

2. The arithmetic device according to claim 1, wherein the data set to be operated on comprises mm data to be operated on, and the vector operation unit is specifically configured to, in terms of that each vector unit performs multiplication and addition operations on the input data to be operated on to obtain an operation result:

step 1: inputting each data to be operated in mm data to be operated into a plurality of VPEs and executing multiplication to obtain a product operation result, and storing the product operation result into a register inside the corresponding VPE in the VPEs;

step 2: accumulating the product operation result and a product operation result obtained by the last operation stored in a register to obtain an accumulated result, and storing the accumulated result into a register inside a corresponding VPE in the plurality of VPEs;

repeatedly executing the step 1 and the step 2 until the mm data to be operated are all calculated, and obtaining mm calculation results;

and (4) sequentially transmitting the calculation result of the VPE at the rightmost end to the left and accumulating the calculation result until the calculation result is transmitted to the VPE at the leftmost end to obtain an operation result.

3. The computing device of claim 2, wherein the vector computing unit is further configured to, in terms of left-hand passing and accumulating the computation result of the rightmost VPE until the computation result is passed to the leftmost VPE to obtain the computation result:

step 1: transmitting the calculation result of the VPE at the rightmost end to the VPE adjacent to the left side of the VPE at the rightmost end;

step 2: the VPE adjacent to the left side performs addition operation on the calculation result of the VPE at the rightmost end and the calculation result of the VPE adjacent to the left side to obtain a superposition result;

and (5) repeating the step (1) and the step (2) until the VPE at the leftmost end executes the superposition operation to obtain an operation result.

4. The computing device of claim 1, wherein the device further comprises:

an input storage module: for storing input/output data;

an intermediate result storage module: the temporary operation device is used for storing temporary operation result data; and/or

A final result storage module: for storing the final operation result data.

5. The computing device of claim 4, wherein the set of instructions comprises:

a data operation instruction class for controlling the transmission of data;

6. The computing device of claim 5, wherein the class of control operation instructions comprises finger jump instructions and branch instructions, wherein the jump instructions comprise direct jump instructions and indirect jump instructions, and wherein the branch instructions comprise conditional branch instructions.

7. The computing device of claim 5, wherein the class of data operation instructions comprises at least one of:

LD/ST instruction, used for transferring data in DRAM and SRAM;

MOV instructions for transferring data between SRAMs;

RD/WR instruction, indicating the use of transferring data between SRAM and BUFFER.

8. The computing device of claim 1, the class of macro-operation instructions comprising at least one of:

matrix and matrix multiply instructions, matrix and matrix add instructions, matrix and vector multiply instructions, matrix and vector add instructions, matrix and scalar multiply instructions, matrix and scalar add matrices, vector and vector multiply instructions, and vector outer product instructions.

9. The arithmetic device of claim 1, the class of macro-operation instructions comprising at least one of:

vector-to-vector multiply instructions, vector-to-vector add instructions, vector-to-scalar multiply instructions, vector-to-scalar add instructions, scalar square instructions, scalar fetch random instructions, and move instructions.

10. The computing device of claim 5, wherein the multi-dimensional data operation instruction class is used for requesting the computing unit to perform operations on multi-dimensional data, and the operations on multi-dimensional data include operations between multi-dimensional data and multi-dimensional data, operations between multi-dimensional data and one-dimensional vector data, and operations between multi-dimensional data and one-dimensional scalar data.

11. The arithmetic device of claim 5 wherein the class of one-dimensional data operation instructions is for requiring the arithmetic unit to perform operations on one-dimensional data, the one-dimensional data comprising one-dimensional vectors and one-dimensional scalars.

12. The computing device of claim 5, further comprising an assembler to select, during execution, a type of instruction in the instruction set for use.

13. The method of SLAM operation by the operation device according to any one of claims 1 to 12, wherein the method comprises:

respectively inputting a data set to be operated into each vector operation part in a plurality of vector operation parts included in a vector operation unit by using the vector operation unit, wherein each vector part executes multiplication operation and addition operation on the input data set to be operated to obtain an operation result;

storing operation results of multiplication operation and addition operation by using a register unit;

storing an instruction set required by the operation process by using an instruction storage module;

14. The operation method according to claim 13, wherein the data set to be operated on comprises mm data to be operated on, and each vector unit performs multiplication and addition operations on the input data to be operated on to obtain an operation result, including:

step 1: inputting each data to be operated in mm data to be operated into a plurality of VPEs and executing multiplication operation to obtain a product operation result, and storing the product operation result into a register inside the corresponding VPE in the plurality of VPEs;

and 2, step: accumulating the product operation result and a product operation result obtained by the last operation stored in a register to obtain an accumulated result, and storing the accumulated result into a register inside a corresponding VPE in the plurality of VPEs;

repeating the step 1 and the step 2 until the mm data to be calculated are all calculated, and obtaining mm calculation results;

and (4) sequentially transmitting the calculation result of the VPE at the rightmost end to the left, and accumulating until the calculation result is transmitted to the VPE at the leftmost end to obtain an operation result.

15. The method of claim 14, wherein the left-hand side and accumulation of the rightmost VPE calculation result until the leftmost VPE calculation result is passed to obtain the calculation result, comprising:

16. The method of claim 13, further comprising:

storing input/output data using an input storage module;

using an intermediate result storage module to store temporary operation result data; and/or

And storing the final operation result data by using a final result storage module.

17. The method of claim 13, wherein the set of instructions comprises:

the control operation instruction class is used for selecting control of an operation instruction to be executed, the control operation instruction class comprises a jump instruction and a branch instruction, the jump instruction comprises a direct jump instruction and an indirect jump instruction, and the branch instruction comprises a conditional branch instruction;

a data operation instruction class for controlling the transmission of data;

the macro-operation instruction class includes at least one of:

a matrix and matrix multiply instruction, a matrix and matrix add instruction, a matrix and vector multiply instruction, a matrix and vector add instruction, a matrix and scalar multiply instruction, a matrix and scalar add matrix, a vector and vector multiply instruction, and a vector and vector outer product instruction;

alternatively, the macro-operation instruction class includes at least one of:

a vector and vector multiply instruction, a vector and vector add instruction, a vector and scalar multiply instruction, a vector and scalar add instruction, a scalar square instruction, a scalar random fetch instruction, and a move instruction;

a multidimensional data operation instruction class for controlling operation of multidimensional data; and/or

A one-dimensional data operation instruction class for controlling an operation of one-dimensional data, the one-dimensional data including: one-dimensional vectors and one-dimensional scalars.

18. The method of claim 17, wherein the class of data manipulation instructions comprises at least one of:

LD/ST instruction, used for transferring data in DRAM and SRAM;

MOV instructions for transferring data between SRAMs;

19. The method of claim 17, wherein the class of multidimensional data operation instructions is for requiring the arithmetic unit to perform operations on multidimensional data, the operations on multidimensional data including operations between multidimensional data and multidimensional data, operations between multidimensional data and one-dimensional vector data, and operations between multidimensional data and one-dimensional scalar data.

20. The method of claim 17, further comprising:

the assembler selects the type of instruction in the instruction set to use during operation.