CN109684267A

CN109684267A - SLAM arithmetic unit and method

Info

Publication number: CN109684267A
Application number: CN201811653568.8A
Authority: CN
Inventors: 陈云霁; 杜子东; 张磊; 陈天石
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2019-04-26
Anticipated expiration: 2036-11-03
Also published as: CN109697184B; CN109726168B; CN109376113B; WO2018082229A1; CN109376114A; CN108021528B; CN109684267B; CN109634904B; CN109710559A; CN109376112B; CN109726168A; CN109634905B; CN109710558A; CN109634904A; CN108021528A; CN109376114B; CN109634905A; CN109376113A; CN109656867B; CN109376112A

Abstract

A kind of device of SLAM hardware accelerator, including storage section, for instruction set and/or algorithm parameter data needed for storing input data, interim operation result data, final operation result data, calculating process；Arithmetic section is connect with the storage section, for completing the calculating to SLAM related algorithm and application；Control section connects the storage section and arithmetic section, for controlling and coordinating storage section and arithmetic section.The present invention also provides a kind of method for completing SLAM operation, this method controls the transport of data, the operation of data, operation of program etc. by instructing.Apparatus and method of the present invention can effectively according to different needs accelerate SLAM algorithm, meet the operation of different demands, have many advantages, such as that strong flexibility, configurable degree is high, arithmetic speed is fast, low in energy consumption.

Description

SLAM arithmetic unit and method

Technical field

The present invention relates to a kind of SLAM, (simultaneous Localization and Mapping is positioned immediately and is built Figure) arithmetic unit and method, for being accelerated according to different demands to the operation of SLAM algorithm.

Background technique

Independent navigation is a base of mobile robot (such as unmanned ground and aerial carrier etc.) in unknown environment This ability.Immediately determination work of the main position for completing robot of positioning of figure in map is positioned and built in SLAM task, The main task for building figure is that robot establishes the map for corresponding to environment according to environment.In the feelings for lacking the initial map of location circumstances This is just needed robot that can construct map in real time and is completed the positioning of itself using map under condition, completes this task institute The SLAM algorithm needed generates therewith.However under the limited computing capability of mobile robot and stringent power consumption requirements accurately Realize that SLAM algorithm is one of the greatest problem faced in reality.Firstly, SLAM algorithm because have real-time requirement thus High arithmetic speed is needed to complete a large amount of operations of similar frame and interframe short time, and secondly SLAM algorithm is due to being moved The limitation of robot has harsh requirement to power consumption, and last SLAM algorithm huge number arithmetic type is wider, therefore design Accelerator needs support various types of SLAM algorithms.

In the prior art, a kind of mode for realizing SLAM algorithm is directly to carry out operation on general processor (CPU), The disadvantages of this method first is that the operational performance of single general processor is lower, being unable to satisfy common SLAM operation real-time needs It asks.And multiple general processors, when executing parallel, the intercommunication of general processor becomes performance bottleneck again.

Another kind realizes that the mode of SLAM algorithm is that operation is carried out on graphics processor (GPU), and this method is by making General SIMD instruction is executed with general-purpose register and general stream processing unit to support above-mentioned algorithm.Although this method is special Equipment for executing graph image operation, but due to the complexity of SLAM algorithm operation, this method can not be good Support its subsequent arithmetic, that is, can not the integral operation to SLAM algorithm effectively accelerated.Meanwhile GPU on piece caching is too Small, it is even more impossible to meet the operation demand of a large amount of SLAM algorithm.Further, since in practical application area, by CPU or GPU etc. It is a relatively difficult thing on similar structure implantation to robot, so, even without a practical, flexibility High dedicated SLAM hardware accelerator architecture.The device that we design is a satisfactory dedicated SLAM hardware accelerator, The corresponding method that we devise this covering device, it can be designed to the hardware such as special chip, embedded chip, so as to To be applied in the application such as robot, computer, mobile phone.

Summary of the invention

(1) technical problems to be solved

The object of the present invention is to provide a kind of device and method of SLAM hardware accelerator.

(2) technical solution

According to an aspect of the present invention, a kind of device of SLAM hardware accelerator is provided, comprising:

Storage section, for storing input data, interim operation result data, final operation result data, calculating process Required instruction set and/or algorithm parameter data；

Arithmetic section is connect with the storage section, for completing the calculating to SLAM related algorithm and application；

Control section connects the storage section and arithmetic section, for controlling and coordinating storage section and arithmetic section.

Preferably, the storage section includes:

Input memory module: for storing inputoutput data；

Intermediate result memory module: for storing intermediate calculation results；

Final result memory module: for storing final operation result；

Instruct memory module: for instruction set needed for storing calculating process；And/or

Buffered memory module: the buffer-stored for data.

Preferably, the arithmetic section includes:

The acceleration arithmetic unit of the acceleration and processing SLAM operation that are designed for SLAM related algorithm and application；

SLAM related algorithm and application in include but cannot by it is described acceleration arithmetic unit complete other operations other Arithmetic unit.

Preferably, the acceleration arithmetic unit includes vector operation unit and matrix operation unit.

Preferably, described

Other arithmetic units are used to completing to use in algorithm and application but the operation do not completed by acceleration arithmetic unit.

Preferably, the arithmetic section is realized by hardware circuit.

Preferably, the control section

The each module and arithmetic section of storage section are connected, control section is controlled by a fifo queue and one Processor composition, fifo queue are used for storage control signal, and control processor is right for taking out pending control signal After control logic is analyzed, storage section and arithmetic section are controlled and coordinated.

Preferably, described instruction collection includes:

Operational order class is controlled, for choosing the control of pending operating instruction；

Data manipulation instruction class, for controlling the transmission of data；

Macro operational order class is operated for complete operation；

Multidimensional data operational order class, for controlling the arithmetic operation of multidimensional data；And/or

One-dimensional data operational order class, for controlling the arithmetic operation of one-dimensional data.

Preferably, the control operational order class includes referring to jump instruction and branch instruction, and jump instruction includes directly jumping Turn instruction and indirect jump instruction, branch instruction includes conditional branch instructions.

Preferably, the macro operational order class includes convolution algorithm instruction or pond operational order.

Preferably, the operation that the multidimensional data operational order class is used to that arithmetic element to be required to execute multidimensional data, multidimensional The operation of data includes the operation between multidimensional data and multidimensional data, operation between multidimensional data and one-dimensional vector data and more Operation between dimension data and one-dimensional scalar data.

Preferably, the one-dimensional data operational order class, it is described for requiring arithmetic element to execute the operation of one-dimensional data One-dimensional data includes one-dimensional vector and one-dimensional scalar.

Preferably, the operation of the one-dimensional vector data includes the operation between one-dimensional vector and one-dimensional vector, Yi Jiyi Operation between dimensional vector and scalar.

Preferably, the operation of the one-dimensional scalar data includes the operation between scalar and scalar.

It preferably, further include assembler, in the process of running, selection to use the instruction type in instruction set.

According to another aspect of the present invention, the method for carrying out SLAM operation according to any description above device is also provided, The operation of transport, the operation and program of data is controlled by control section in the instruction set by storage section, including:

Step 1: the input data of storage section is transported to arithmetic section；

Step 2: operation is executed according to the required instruction set of calculating process in arithmetic section；

Step 3: transmitting and saves operation result data；

Step 4: repeating the above process until operation finishes.

(3) beneficial effect

The device and method of SLAM hardware accelerator provided by the invention can effectively according to different needs calculate SLAM Method is accelerated, and can be suitable for various SLAM algorithms and a variety of different input data types, be met the operation of different demands, Have many advantages, such as that strong flexibility, configurable degree is high, arithmetic speed is fast, low in energy consumption.

Apparatus and method of the present invention has the effect that compared with prior art

1) arithmetic section can carry out operation according to data of the different demands to different input types；

2) arithmetic section can also realize a degree of shared of data by buffered memory module, reduce data Reuse distance；

3) various basic action types are supported in the design instructed, so that the configurability of device is very high；

4) design of matrix and vector operation unit cooperates the design of scalar operation unit that can support various types of again Operation, and significant accelerate arithmetic speed；

5) power consumption when arrangement of design and the instruction of arithmetic section and storage section significantly reduces execution.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of the device for the SLAM hardware accelerator that one embodiment of the invention provides.

Fig. 2 is the structural schematic diagram for the SLAM hardware accelerator that further embodiment of this invention provides.

Fig. 3 is the knot of one embodiment of the scalar operation unit for the SLAM hardware accelerator that one embodiment of the invention provides Structure schematic diagram.

Fig. 4 is the knot of one embodiment of the vector operation unit for the SLAM hardware accelerator that one embodiment of the invention provides Structure schematic diagram.

Fig. 5 is the knot of one embodiment of the matrix operation unit for the SLAM hardware accelerator that one embodiment of the invention provides Structure schematic diagram.

Fig. 6 is the reality that the SLAM hardware accelerator that one embodiment of the invention provides completes three-dimensional coordinate L2 norm operation Apply the schematic diagram of example.

Fig. 7 is one that the SLAM hardware accelerator that one embodiment of the invention provides completes 16 dimension square matrix matrix multiplication operations The schematic diagram of embodiment.

Fig. 8 is the algorithm for the SLAM based on Extended Kalman filter method (EKF) that one embodiment of the invention provides at this The schematic diagram of realization is configured on device.

Fig. 9 is the instruction type schematic diagram that one embodiment of the invention provides.

Figure 10 is a kind of application schematic diagram for macro operational order that one embodiment of the invention provides.

Figure 11 is a kind of one embodiment for one-dimensional data operational order that one embodiment of the invention provides.

Figure 12 is that a kind of SIFT feature extraction algorithm that one embodiment of the invention provides configures showing for realization in the present apparatus It is intended to.

Figure 13 is that a kind of of one embodiment of the invention offer is configured in fact based on the figure optimization algorithm of G2O frame in the present apparatus Existing schematic diagram.

Figure 14 is a kind of execution flow chart for convolution algorithm instruction that one embodiment of the invention provides.

Figure 15 is a kind of execution flow chart for image accumulated instruction that one embodiment of the invention provides.

Figure 16 is a kind of execution flow chart for filtering operation instruction that one embodiment of the invention provides.

Figure 17 is a kind of execution flow chart for local extremum instruction that one embodiment of the invention provides.

Figure 18 is a kind of execution flow chart for two-dimensional convolution arithmetic operation that one embodiment of the invention provides.

Figure 19 is a kind of execution flow chart for one-dimensional vector dot-product operation that one embodiment of the invention provides.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.

Fig. 1 is the structural schematic diagram of the device for the SLAM hardware accelerator that one embodiment of the invention provides.As shown in Figure 1, This accelerator is broadly divided into three parts, control section, arithmetic section and storage section.Control section is to arithmetic section and storage Part issues control signal, to control the operation of the two, coordinates the data transmission between the two.Storage section is for storing dependency number According to, including input data, intermediate result, final result, instruction, caching etc., it can be different according to demand, to specific storage Data content, storage organization mode and access method of calling carry out different planning.Arithmetic section includes a variety of arithmetic units, is used for The operation of data, one or more combinations including scalar operation unit, vector operation unit and matrix operation unit, In, arithmetic unit can carry out operation according to data of the different demands to different input types.Arithmetic section can also pass through buffering Memory module realizes a degree of shared of data, reduces the reuse distance of data.

Fig. 2 is the structural schematic diagram of the device of the SLAM hardware accelerator of another embodiment of the invention.As shown in Fig. 2, The present embodiment is the calculating process for being required to accelerate the SLAM algorithm based on image, reduces data exchange, saves memory space. Therefore the structure of the present apparatus is, control section connects each module and arithmetic section of storage section, by a fifo queue It is formed with a control processor, fifo queue is used for storage control signal, and control processor is pending for taking out Signal is controlled, after analyzing control logic, storage section and arithmetic section is controlled and is coordinated.Storage section is divided into Four modules, input memory module, output memory module, intermediate result memory module and cache module.Arithmetic section is mainly used In operation, the operation of images match and the operation of image optimization of operation, the cloud atlas building for accelerating image processing section, therefore operation Unit is also broadly divided into three modules, scalar operation module, vector operation module and matrix operation module, and three modules can adopt Being executed with the mode of assembly line can also be performed in parallel.

Fig. 3 is one embodiment of the present of invention, and the device for describing a kind of scalar operation unit that can be used for the present apparatus shows It is intended to, SPE therein indicates individual scalar operation unit.Scalar operation unit is mainly used for solving to use in SLAM algorithm Come the operation of the arithmetic section and some of complex that accelerate such as trigonometric function operation etc., while it also can solve memory access consistency and asking Topic, is one of important composition of accelerator.The directly related memory module of scalar operation unit is intermediate result memory module And buffered memory module.The operand that scalar operation needs can be in intermediate result memory module, can also be in buffer-stored In module.The result of scalar operation can be stored in intermediate result memory module, can also be output in buffer module, be depended on In actual needs.

Fig. 4 is one embodiment of the present of invention, and the device for describing a kind of vector operation unit that can be used for the present apparatus shows It is intended to, entire vector operation unit is made of multiple basic processing units, and VPE is the basic processing unit of vector operation in figure. Vector operation unit can be used for solving vector operation part and all operations with vector operation characteristic in SLAM algorithm Part, such as the dot product of vector etc., also may be implemented that efficient data level is parallel and task-level parallelism.It is directly relevant to deposit Storage module has intermediate result memory module and buffer module.Each basic unit of vector operation unit can be realized by configuring It is performed in parallel same operation, can also realize different operations by configuring.Vector operation unit is directly related to be deposited Storing up module is intermediate result memory module and buffered memory module.The operand that vector operation needs can be stored in intermediate result It, can also be in buffered memory module in module.The result of vector operation can be stored in intermediate result memory module, can also To be output in buffer module, actual needs is depended on.

Fig. 5 is another embodiment of the present invention, describes a kind of matrix operation cell arrangement signal that can be used for the present apparatus Figure can satisfy wanting for the operation for accelerating all matrix operation type and arithmetic type similar with matrix operation type It asks, the basic processing unit of MPE representing matrix arithmetic element therein.Matrix operation unit is by multiple basic processing unit structures It is an arithmetic element array at the case where, diagram.There are many external data exchange modes of matrix operation unit, can be 2D Switch mode be also possible to the switch mode of 1D.Arithmetic element supports the data access patterns between internal element simultaneously, can To greatly reduce the reuse distance of locality data, realization efficiently accelerates.The storage directly related with matrix operation unit Module is intermediate result memory module and buffered memory module.The operand that matrix operation needs can store mould in intermediate result It, can also be in buffered memory module in block.The result of matrix operation can be stored in intermediate result memory module, can also be with It is output in buffer module, depends on actual needs.

Fig. 6 is one embodiment of the present of invention, describes a kind of stream that three-dimensional coordinate L2 norm operation is carried out with the present apparatus Cheng Tu.It is assumed that three data of three-dimensional coordinate are stored in intermediate storage module, pass through configuration-direct first from intermediate storage mould Block takes out operand and is separately input on three basic processing unit VPE of vector operation unit, and each VPE of three VPE is executed Operation be multiplying, two operands of multiplication be the coordinate taken out certain number and itself, the result meeting of multiplying It is input in scalar operation unit again by buffered memory module, three multiplication results is completed in scalar operation unit Then sum operation executes extracting operation.Last operation result is output to intermediate result memory module as needed or delays It rushes in memory module.

Fig. 7 is one embodiment of the present of invention, describes a kind of one kind of present apparatus progress N-dimensional square matrix matrix multiplication operation The flow chart of possible arrangement.Such as the situation for N=16, it is assumed that the multiplication that complete matrix A and matrix B operates to obtain matrix C, the basic processing unit number in figure in matrix operation unit are 256, and each arithmetic element is responsible for meter in calculating process Final result data are calculated, matrix data needed for operation is stored in intermediate result memory module.Operation starts first to tie from centre Buffered memory module is counted in each operation that fruit memory module takes out A, and data are sequentially input to by buffered memory module according to row In each basic processing unit MPE of matrix operation unit, the operand of B matrix also can equally be fetched into buffered memory module In, it is input in each PE step by step by instruction scheduling according to column sequence.The number of each A in PE and the number of B can complete multiplication behaviour Make, the result that each PE completes after multiplication operation every time will not be sent out, but tired with last time result in the register of PE is stored in Add, the result that each PE is saved in this way after the number of all B is input to PE is exactly each position of finally obtained C matrix Number.The data of last C are stored in intermediate result memory module as needed or stay in buffered memory module.

Fig. 8 is one embodiment of the present of invention, describes a kind of present apparatus and carries out based on Extended Kalman filter method (EKF) configuration of the algorithm of SLAM and operation schematic diagram.EKF algorithm can be divided into three big steps substantially, be Compute respectively True Data, EKF Predict (prediction) and EKF Update (update).In Compute True Data, pass through movement Model obtains true coordinate.The predicted value and control input that the pose of new robot passes through last time in EKF Predict The robot predicting of update.The related information with ambient enviroment reference point is calculated in EKF Update, updates the position predicted Appearance and covariance matrix.The operation related generally in Compute True Data is that the Vector Processing operation of low-dimensional is sat as three-dimensional Target Euclidean distance operation etc., therefore most of operation can be used vector operation unit and carry out operation, wherein also relating to angle The typical scalar operation such as the trigonometric function operation of degree, therefore be also required to carry out a small amount of operation on scalar operation unit.EKF Matrix manipulation such as matrix multiplication etc. repeatedly fairly large involved in this step of Predict, preferable in order to obtain to accelerate, this portion Divide operation that can be placed on matrix operation unit to execute, while also therefore some lesser vector operations are also required to vector operation list Member plays a role.This step operation type of EKF Update is more, and various operations alternate, such as have typical square The operations such as battle array SVD (singular value decomposition, singular value decomposition) is decomposed, cholesky is decomposed, these behaviour Work is made of tiny operations such as matrix multiplication, vector plus-minus, vector norm, trigonometric functions, while using matrix operation list Member, vector operation unit and scalar operation unit.From memory module, the input of the SLAM algorithm based on EKF is The coordinate of the points such as waypoints (path point) and landmarks (environment reference point), data volume is little, therefore only needs first These data are loaded into from input memory module when the beginning.In intermediate calculating process, under normal circumstances due to the design of storage Data volume does not exceed the size of intermediate result memory module, therefore does not need have frequent data with input memory module generally Exchange, reduces energy consumption and runing time.Calculated result is output to output memory module in output by last SLAM algorithm, complete At the hardware configuration and realization of entire algorithm.

Fig. 9 is instruction type schematic diagram provided by the invention.

Instruction set of the invention includes control operational order class, data manipulation instruction class, macro operational order class, multidimensional data The multiple types such as operational order class, one-dimensional data operational order class.Every kind of instruction class can be subdivided into a variety of different instructions again, often Kind instruction is distinguished with the instruction encoding started, as shown in figure 9, having selected in every kind of instruction class several representative Instruction and its coding are simultaneously listed.

Operational order class is controlled, is mainly used for controlling the operation of program.Instruction encoding is that JUMP indicates jump instruction, is used for Execute turn function.According to the difference of subsequent operation code, direct jump instruction and indirect jump instruction can be divided into.Instruction is compiled Code is that CB indicates conditional jump instructions, is used for execution of conditional jump function.

Data manipulation instruction class is mainly used for controlling the transmission of data.Instruction encoding is that LD/ST indicates to be used for DRAM (Dynamic Random Access Memory, dynamic random access memory) and SRAM (Static Random Access Memory, static random access memory) in transmit data, i.e. LD indicates data and to be loaded into SRAM from reading in DRAM, Data in SRAM are transmitted in DRAM and are stored by ST expression.Instruction encoding is that number is transmitted in MOV expression between SRAM According to.Instruction encoding is that RD/WR indicates that, for transmitting data between SRAM and BUFFER (buffer), wherein RD is indicated from SRAM Data are read to BUFFER, the data in BUFFER are stored back into SRAM by WR expression.Macro operational order class, as coarseness Data operation operational order is operated for opposite complete operation.

Instruction encoding is that CONV indicates convolution algorithm instruction, for realizing convolution and class convolution algorithm, i.e., the number of input According to being multiplied and sum with corresponding weight respectively, and the local reusability of data, specific implementation procedure are considered in the instruction For such as Figure 14:

S1 takes out image data according to the requirement of instruction, from weight data since the initial address of image data Beginning address starts to take out weight data.

Image data is required to be transmitted in corresponding multidimensional operation unit, by weight data by S2 according to corresponding operation The each operational element (PE) being broadcast in multidimensional operation unit.

The image data of input is multiplied by S3, each PE with corresponding weight data, and with the deposit inside arithmetic element Data phase adduction in device is stored back into register (register need to be initialized as 0).

S4, for existing in the image data in multidimensional operation unit according to transmission rule as defined in multidimensional operation unit It is transmitted inside multidimensional operation unit, the image data not in multidimensional operation unit is read and transmitted from BUFFER To specified work location.The reusability of data when this method is utilized convolution algorithm, to greatly reduce data Carry number.

S5 repeats step S3-S4, finishes until the PE is calculated, and result is exported to the destination stored as defined in instruction It is saved in location.

S6 re-reads data and repeats aforesaid operations, until all pixels point in output image is all calculated and saved It finishes, instruction terminates.

Instruction encoding is that POOL indicates pond operational order, for realizing Chi Huajileichiization operation, i.e., to defined amount Data averaged or seek maximum/small value or carry out down-sampled operation, specific implementation flow and convolution algorithm instruction It is similar.

Instruction encoding is that IMGACC indicates image accumulated instruction, for completing the processing of image and carrying out cumulative or similar Calculation function.Its specific implementation procedure is as follows, such as Figure 15:

S1 requires to read image data since the initial address of image data according to instruction, and by multidimensional operation unit In all operational elements (PE) be initialized as 0.

Former data in multidimensional operation unit are transmitted a line by S2, each clock cycle upwards in turn, and backward multidimensional is transported The data that transmitting a line is new in unit are calculated, and a line of new incoming and the respective column of the data of former last line are added up, Data of the accumulation result as new last line.Repetitive operation, until filling up multidimensional operation unit.

Data in multidimensional operation unit are successively transmitted and are added up to the right by S3, each clock cycle, i.e. first clock Period transmits to the right the first column data, and secondary series adds the data transmitted from first row, and saves.Second clock cycle, Second column data transmits to the right, and the data phase adduction that third column data is transmitted with secondary series saves, and so on.Finally obtain institute The integral accumulation result of the image needed.

S4, save multidimensional operation unit in all data to instruction designated destination location, and to bottom line with Most one column data of the right side is cached.

Multidimensional operation data initialization is 0, re-starts operation next time by S5, until all images calculating finishes.Its In, it should be noted that in subsequent arithmetic, when the width or length of image are more than the single treatment size of multidimensional operation unit When, need to add up the upper data cached when non-operation for the first time, to guarantee the correct of operation result.

Instruction encoding is that BOX indicates a kind of filter command, for completing the box filtering operation of image.The operation of the algorithm Process is, for the sum of the local matrix for acquiring image, to initially set up an array A, wide height is equal with original image, then to this A array assignment, in the rectangle that the value A [i] of each element is assigned to the point and image origin is constituted all pixels and, then After acquiring local matrix, it is only necessary to can be completed by the plus-minus operation of 4 elements of A matrix.Therefore the macro-instruction is mainly divided It is operated for two steps, such as Figure 16:

S1 reads required data from initial address according to instruction, is passed in multidimensional operation unit to incoming data successively It adds up, and is stored in defined destination address 1.

S2 reads data from destination address 1, carries out plus-minus operation to data, filtered according to data needed for instruction Wave is as a result, be saved in destination address 2, as required final result.

It due to during data accumulation, is similar to convolution algorithm and instructs, data have a local reusability, therefore the instruction branch It holds and data is transmitted inside multidimensional operation unit.

Instruction encoding is that LOCALEXTERMA indicates local extremum instruction, judges local extremum when for completing and handle image Operation, that is, judge whether the data of designated position are extreme values in this group of data.Specifically, the macro-instruction is broadly divided into two Step operation, such as Figure 17:

Register value in each PE in multidimensional operation unit is initialized as sufficiently small/big value, from number by S1 It reads data according to initial address to be passed in multidimensional operation unit, then each PE in incoming data and register to saving Data are compared operation, obtain larger/small value and save back in register, until regulation data relatively finish.In i.e. each PE Maximum/small value of specified data flow is obtained.

S2, according to instruction, the data for reading designated position are passed in multidimensional data arithmetic element again, and each PE compares biography Whether identical as the maximum/small value saved in register enter the data in the PE, identical output 1, difference output 0.

Instruction encoding is that operation is compared in COUNTCMP expression, for completing the operation compared using counter, i.e., reading to Compare data and threshold value is transmitted in multidimensional operation unit, each PE is successively compared and counts with threshold value to incoming data flow Number is more than or less than the number of the data of the threshold value wait traverse incoming data, output.

Multidimensional data operational order class is mainly used for controlling multidimensional data as one of fine granularity arithmetic operation instruction Arithmetic operation.Multidimensional data include two dimension and two dimension more than data, wherein comprising multidimensional data respectively with multidimensional data, one The operational order of the progress such as dimensional vector data, one-dimensional scalar data.By taking matrix as an example, MMmM is that matrix and multiplication of matrices are transported Instruction is calculated, the one kind for the operational order that multidimensional data and multidimensional data carry out, similar also MMaM, i.e. matrix and matrix are belonged to Add operation instruction；MMmV is the multiplying instruction of matrix and one-dimensional vector, belongs to multidimensional data and one-dimensional vector data One kind of the operational order of progress, similar also MMaV, i.e. the add operation instruction of matrix and one-dimensional vector；MMmS is square Battle array and the multiplying of one-dimensional scalar instruct, and belong to the one kind for the operational order that multidimensional data and one-dimensional scalar data carry out, class As there are also MMaS, i.e., the add operation of matrix and one-dimensional scalar instructs.In addition to this, multidimensional data operational order class can also Operation between compatible one-dimensional data, such as MVmV are accomplished that the multiplying of one-dimensional vector and one-dimensional vector instructs, MMoV Realize the apposition operational order of one-dimensional vector and one-dimensional vector.

One-dimensional data operational order class is mainly used for controlling one-dimensional data as one of fine granularity arithmetic operation instruction class Arithmetic operation, wherein one-dimensional data is broadly divided into one-dimensional vector data and two kinds of one-dimensional scalar data again.For example, VVmV, is The multiplying of one-dimensional vector and one-dimensional vector instructs, similar VVaV, indicates the add operation of one-dimensional vector and one-dimensional vector Instruction.VVmS is the multiplying instruction between one-dimensional vector and one-dimensional scalar.SSsS indicates the instruction of one-dimensional scalar operation, For completing to seek the extracting operation of the one-dimensional scalar.SSrS indicates the operation for seeking random number.MV is that moving operation refers to It enables, for taking register or immediate in calculating process.

Figure 10 is that a kind of macro-instruction operation CONV provided by the invention completes a two-dimensional convolution on a kind of hardware configuration The embodiment of arithmetic operation.The calculating process of two-dimensional convolution is that, for a two-dimensional input image, have a convolution kernel inputting Slided on image, each convolution kernel is filtered the data for the 2-D data image that current location covers, i.e., convolution kernel and by The image data of covering carries out contraposition multiplication, and then the result after fragrant citrus adds up, and remembers required filter result.And Afterwards, convolution kernel slides into the next position, and repetitive operation is completed until whole operations.Since convolution operation is using very extensive, And largely occur, so the convolution operation of this patent design can make full use of the data reusability on hardware configuration, it will Data are reasonably distributed and are transmitted, and the utilization rate of hardware is increased to maximum.To reinforce explanation, it is accompanied by a specific implementation Example, as shown in Figure 10.In the present embodiment, definition input is that perhaps matrix output is also an image or square to an image Battle array, is stored in specified position in the form of piecemeal.Hardware configuration is by taking a matrix manipulation unit (MPU) as an example, the operation It include m*n matrix operation component (MPE) in unit, each arithmetic unit has included required arithmetic unit and has been used in temporary Between data register.Such as Figure 18 concrete operation process are as follows:

S1 reads the macro-instruction of a convolution operation, by operation coding and groups of operands at.Instruction operation is encoded to CONV, indicate to carry out is convolution algorithm.Operand shares 7, respectively DA, SA1, SA2, IX, IY, KX, KY, wherein DA It indicates destination address, that is, exports the storage address of result；SA1 is initial address 1, indicates the starting point for reading the image to operation Location；SA2 is initial address 2, indicates the initial address for reading the convolution kernel to operation；IX and IY respectively indicates image X-direction and Y Size on direction, the i.e. size by the two variable-definitions to the image of operation；KX and KY respectively indicates the big of convolution kernel It is small.

S2 waits operation input image data from the corresponding position read in BUFFER in SRAM according to instruction, Require each of MPU MPE to calculate a pixel of output image here.

S3 transmits the image data inputted accordingly into each MPE.Due to the convolution nuclear phase in each MPE when operation Together, therefore convolution kernel is broadcast to each MPE by the way of broadcast.Then input data and correspondence that each MPE will be passed to Convolution Nuclear Data be multiplied, be then saved among the register of respective MPE.

S4, since the operational data of convolution operation has a local reusability, therefore input image data of next bat to operation The MPE on as the right currently claps the data for carrying out operation, therefore input image data is successively transmitted to the left, the MPE institute of rightmost The data needed need to read from BUFFER again not in MPU.Pending data end of transmission, each MPE is by input image data It is multiplied with corresponding convolution Nuclear Data, and resulting product and the data in the register of the MPE is added up, deposit again Enter in register.

S5 repeats step S4, until all convolution Nuclear Datas and corresponding input image data operation finish to get arriving Each MPE has obtained 1 pixel of output image, and result is exported and is saved in the position that destination address defines in instruction.

S6 repeats the above steps, until the calculating of all pixels point finishes in output image.

The local reusability that data can be made full use of using macro-instruction is greatly reduced data and carries number, improved Operation efficiency.For example, work as m=3, when n=3, which can carry out the convolution algorithm of 9 pixels simultaneously, when 9 time-consuming Clock cycle.

Similar.We provide the operation of a large amount of macro-instruction, such as convolution, although its operation completed can have other classes The instruction of type, which operates, to be completed, but due to the presence of macro-instruction operation, enables to operational order more succinct efficient.In addition, Macro-instruction can be good at handling the reuse problem of data, can be improved the utilization rate of data, reduce the transmission of data, reduce function Consumption improves performance.

Figure 11 is a kind of one embodiment of multidimensional data operational order provided by the invention, realizes one-dimensional vector and one Dot-product operation between dimensional vector, similar, the operations such as relatively of vector multiplication, vectorial addition, vector are all using similar operation Process.Each vector operation unit (VPU) includes mm vector operation component (VPE), and each VPE can complete a pair of of input number According to operation.The mm data for treating operation are inputed to mm VPE first respectively, held respectively by detailed operation process such as Figure 19 It after multiplication of row, is stored in the register inside VPE, while inputting mm and treat the data of operation and inputing to mm respectively VPE after executing a multiplication respectively, product is added up with the last product kept in internal register, is added up As a result it is fed again into internal register and keeps in.Aforesaid operations are repeated until all inputs have all been finished by calculating.It then will be to Data in register are directly passed to its left side by the result of amount arithmetic element left biography, VPE of right end since most right section VPE carried out with the data in oneself internal register tired after the VPE on its left side receives the data transmitted from the right After adding, accumulation result is continued into left biography, and so on.Finally, dot-product operation result will obtain in the VPE of left end, press It is required that output.

Figure 12 is one embodiment provided by the invention, describes the configuration in SIFT feature extraction algorithm in the present apparatus The process of realization.SIFT (Scale-invariant feature transform) feature extraction algorithm is RGBD SLAM algorithm One of key operation.The first step is to establish image pyramid operation Gaussian Pyramid, contains the bases such as image smoothing This image operation, can be further broken into multiple convolution (convolution) and pooling (down-sampled) in the present apparatus Operation.Followed by the operation of difference of Gaussian DOG, this operation can regard as image pyramid tower different sides it Between do matrix subtraction operation.Once DOG operation is completed, the operation of local extremum search can pass through call macroinstruction LOCAL EXTREMA is completed.The determination of characteristic point is carried out after search local extremum, characteristic point filters (KP filter), this single stepping It is made of a large amount of vector sum scalar operation, such as vector dot, matrix determinant etc..Finally by multiple vector sum scalars It is sub (Key Point) to calculate the description of key point that arithmetic operation calculates the histogram of neighbor point.Wherein calculate histogram behaviour Work can be completed by macro-instruction HIST, which relatively waits vector operations to operate and form by vector.The rotation in neighborhood pixels region Turn the multiplication of operation matrix-vector to realize.Certain special function operation such as exponential etc. are main to be transported by scalar Unit is calculated to realize.

Figure 13 is one embodiment provided by the invention, describes the configuration in the present apparatus and realizes G2O figure optimization algorithm Schematic flow diagram.G2O is the frame for solving non-linear figure optimization problem, many typical SLAM algorithm such as RGBD SLAM It is all based on the frame with the SLAM algorithm based on drawing method such as ORB SLAM.The pose constraint of given two node of graph With initial pose, the operation of error matrix and Jacobian matrix can be operated by matrix operation operation and vector operation come complete At, such as multiplication of matrices and accumulation operations etc..Then mesh can be optimized by establishing one by error matrix and Jacobian matrix The linear system of scalar functions, this step can be completed by matrix and vector operation unit, wherein being also related to includes Matrix Multiplication Method and cumulative etc. operates.Then this linear system is solved, Preconditioned Conjugate can be used in we Gradient (PCG) algorithm come realize (we can also by cholesky decompose method or sparse matrix method or on Triangle decomposition method is realized).PCG operation can be broken down into the matrix of piecemeal and the multiplication of vector and add operation, specifically It can be realized by macro-instruction PCG when realization.The optimization operation of last pose can also by the multiplication of matrix and vector with The operations such as addition are completed.

The device and method of the embodiment of the present invention can be applied in following (including but not limited to) scene: data processing, Robot, unmanned plane, automatic Pilot, computer, printer, scanner, phone, tablet computer, intelligent terminal, mobile phone, driving note Record instrument, navigator, sensor, camera, cloud server, camera, video camera, projector, wrist-watch, earphone, mobile storage, can Each electronic product such as wearable device；All kinds of vehicles such as aircraft, steamer, vehicle；TV, air-conditioning, micro-wave oven, refrigerator, electricity All kinds of household electrical appliance such as rice cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；And including Nuclear Magnetic Resonance, B ultrasound, the heart All kinds of Medical Devices such as electrograph instrument.

Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects It describes in detail bright, it should be understood that the above is only a specific embodiment of the present invention, is not intended to restrict the invention, it is all Within the spirit and principles in the present invention, any modification, equivalent substitution, improvement and etc. done should be included in guarantor of the invention Within the scope of shield.

Claims

1. a kind of processor, which is characterized in that the processor includes: control section, storage section and arithmetic section, described to deposit Storage part includes that input memory module includes input memory module, and the arithmetic section includes multidimensional operation unit, the multidimensional Arithmetic element includes multiple basic processing units and internal register, wherein

The control section, for sending corresponding multidimensional operation unit for reference image data according to operational order, and First weight data is broadcast to each basic processing unit in the corresponding multidimensional operation unit, the reference picture number According to for a part in the destination image data；

The basic processing unit obtains product for the reference image data to be multiplied with corresponding first weight data As a result, the result of product is added with the data in the internal register of the multidimensional operation unit, accumulation result is obtained, it will The accumulation result is stored to the internal register of the multidimensional operation unit, and the initial value of the internal register is 0, by institute The image data stated in multidimensional operation unit adds up according to the transmission sequence of preset transmission mode, obtains target and calculates knot Fruit, the preset transmission mode are the transmission mode stored in the multidimensional operation unit；

The multidimensional operation unit obtains the mesh for the method according to the target calculated result for obtaining reference image data The final calculation result of logo image, the final calculation result includes the target calculated result, by the final calculation result It is sent to the storage section.

2. processor according to claim 1, which is characterized in that

The storage section further includes final result memory module；The final result memory module, for storing final calculating As a result；

The control section is also used to according to the operational order, from the destination image data in the input memory module Address extraction reference image data, and the first weight is extracted from the initial address of the weight data in the input memory module Data, first weight data is corresponding with the reference image data, and the reference image data is the target image A part in data.

3. processor according to claim 1, which is characterized in that the storage section includes:

Buffered memory module, for the buffer-stored to data；

Intermediate result memory module, for storing interim operation result data；And/or

Memory module is instructed, for storing the instruction set in calculating process.

4. processor according to claim 1, which is characterized in that the arithmetic section includes:

Accelerate arithmetic unit, for executing the target operation in SLAM related algorithm and application, the target operation includes accelerating With the operation of processing SLAM；

Other arithmetic units, for executing other operations in SLAM related algorithm and application, other described operations are described Operation in SLAM related algorithm and application in addition to the target operation.

5. processor according to claim 4, which is characterized in that the acceleration arithmetic unit include vector operation unit and Matrix operation unit.

6. processor according to claim 2, which is characterized in that the control section is used for in the storage section Each module and the arithmetic section are attached, and control section includes fifo queue and control processor, wherein

The fifo queue is used for storage control signal；

The control processor parses the pending control signal, obtains for taking out pending control signal To control logic, the storage section and the arithmetic section are controlled and coordinated according to the control logic.

7. processor according to claim 3, which is characterized in that described instruction collection includes:

Operational order class is controlled, for choosing the control of pending operating instruction, the control operational order class includes jumping Instruction and branch instruction, jump instruction include direct jump instruction and indirect jump instruction, and branch instruction includes that conditional branching refers to It enables；

Data manipulation instruction class, for controlling the transmission of data；The data manipulation class instruction includes at least one of following: LD/ST instruction, for transmitting data in DRAM and SRAM；MOV instruction, for transmitting data between SRAM；RD/WR instruction, For indicating to transmit data between SRAM and BUFFER；

Macro operational order class is operated for complete operation；

The macro operational order comprises at least one of the following: convolution algorithm instruction, convolution operation instruction, image accumulation operations refer to It enables, the instruction of image BOX filtering operation, local extremum operational order, counter compare operational order and/or pond operational order；

Alternatively, the macro operational order class comprises at least one of the following:

Matrix is instructed with matrix multiplication, matrix and addition of matrices instruction, matrix and vector multiplication instruction, matrix and vectorial addition refer to It enables, the instruction of matrix and scalar multiplication, matrix and scalar addition matrix, vector and vector multiplication instruction and vector and Outer Product of Vectors refer to It enables；

Vector is instructed with vector multiplication, vector and vectorial addition instruction, vector and scalar multiplication instruction, vector and scalar addition refer to It enables, scalar extract instruction, scalar take stochastic instruction and move；

Multidimensional data operational order class, for controlling the arithmetic operation of multidimensional data, the arithmetic operation of the multidimensional data includes Arithmetic operation between multidimensional data and multidimensional data, arithmetic operation and multidimensional data between multidimensional data and one-dimensional vector data With the arithmetic operation between one-dimensional scalar data；And/or

One-dimensional data operational order class, for controlling the arithmetic operation of one-dimensional data, the one-dimensional data include: one-dimensional vector and One-dimensional scalar.

8. processor according to claim 7, which is characterized in that the processor further includes assembler, in operation In the process, instruction type is chosen from instruction set, and is executed institute and chosen instruction type from instruction set.

9. a kind of operation method, which is characterized in that be applied to processor, the processor include: control section, storage section and Arithmetic section, the storage section include that input memory module includes input memory module, intermediate result memory module, buffers and deposit It stores up module and final result memory module, the arithmetic section includes multidimensional operation unit, the multidimensional operation unit includes more A basic processing unit and internal register, which comprises

The control section is according to operational order, from the address extraction reference of the destination image data in the input memory module Image data, and the first weight data, institute are extracted from the initial address of the weight data in the intermediate result memory module It is corresponding with the reference image data to state the first weight data, the reference image data is in the destination image data A part；

The control section sends corresponding multidimensional operation unit for the reference image data according to the operational order, with And first weight data is broadcast to each basic processing unit in the corresponding multidimensional operation unit；

Each basic processing unit in the multidimensional operation unit is by the reference image data and corresponding first weight number According to multiplication, result of product is obtained, the result of product is added with the data in the internal register of the multidimensional operation unit, Accumulation result is obtained, the accumulation result is stored to the internal register of the multidimensional operation unit, the internal register Initial value be 0, the image data in the multidimensional operation unit is carried out according to the transmission sequence of preset transmission mode tired Add, obtain target calculated result, the preset transmission mode is the transmission mode stored in the multidimensional operation unit；

According to the method for the target calculated result for obtaining reference image data, the final calculation result of the target image is obtained, The final checkout result includes the target calculated result, and the final output is stored to the final result and is stored Module.

10. according to the method described in claim 9, it is characterized in that, the storage section further include intermediate result memory module, Memory module and buffered memory module are instructed, the method also includes:

Buffered memory module carries out buffer-stored to data；

Intermediate result memory module stores interim operation result data；And/or

Instruct the instruction set in memory module storage calculating process.

11. according to the method described in claim 10, it is characterized in that, described instruction collection includes:

Data manipulation instruction class, for controlling the transmission of data；The data manipulation class instruction includes at least one of following: LD/ST instruction, for transmitting data in DRAM and SRAM；

MOV instruction, for transmitting data between SRAM；RD/WR instruction transmits number for indicating between SRAM and BUFFER According to；

Macro operational order class is operated for complete operation；

12. according to the method for claim 11, which is characterized in that the processor further includes assembler, and the method is also Include:

The assembler chooses instruction type in calculating process from instruction set, and executes institute and choose from instruction set and refer to Enable type.

13. a kind of processing unit, which is characterized in that the processing unit requires the described in any item sides of 9-12 for perform claim Method.

14. a kind of electronic device, which is characterized in that the electronic device includes processing as claimed in any one of claims 1 to 8 Device, the electronic device applications are in following at least one: robot, unmanned plane, automatic Pilot, computer, printer, scanner, Phone, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, cloud server, camera, Video camera, projector, wrist-watch, earphone, mobile storage, wearable device and each electronic product；Aircraft, steamer, vehicle and each The class vehicles；TV, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator and each Class household electrical appliance；And including Nuclear Magnetic Resonance, B ultrasound, electrocardiograph and all kinds of Medical Devices.