CN113536024A

CN113536024A - ORB-SLAM relocation feature point retrieval acceleration method based on FPGA

Info

Publication number: CN113536024A
Application number: CN202110918561.XA
Authority: CN
Inventors: 张磊; 汪成亮; 张寻; 任骜; 陈咸彰; 刘铎
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-10-22
Anticipated expiration: 2041-08-11
Also published as: CN113536024B

Abstract

The invention provides an ORB _ SLAM relocation feature point retrieval acceleration method based on an FPGA, which comprises the following steps: s1, buffering the input picture and extracting a descriptor; s2, entering a working space Workspace, and solving the distance of the nodes through a computing circuit; s3, the results of each calculation circuit flow to a parallel comparison circuit together to find the point where the minimum value is located; s4, finally, judging whether the node is a bottom layer or not, if so, finishing the search to obtain a final node; s5, each node has an offset value for finding the address of the child node to obtain the key frame, and then relocates according to the key frame set. The invention adds the approximate unit AU in front of the counter to form the circuit structure of the accumulation parallel counter APC for the consumption of the circuit resource, based on the approximate calculation principle, the invention can reduce the consumption of the hardware resource and improve the circuit calculation speed under the condition that the bit stream is longer and a plurality of same structures need to be copied.

Description

ORB-SLAM relocation feature point retrieval acceleration method based on FPGA

Technical Field

The invention relates to the field of image processing, in particular to an ORB _ SLAM relocation feature point retrieval accelerating method based on an FPGA.

Background

When slam navigation positioning fails, the system starts repositioning, and based on one frame currently shot by the system, similar key frames are found from a frame library according to the characteristic points of the frame to update map point matching of the current frame, so that repositioning operation is performed. In the existing ORB relocation algorithm, the frame library is constructed by using a structure of k-means (k-means) tree, as shown in fig. 1. Extracting and calculating the feature points of all training pictures into descriptors, and clustering for d times of k-means to obtain a retrieval tree with k-degree of d layers, wherein the center of each cluster is represented by an average distance, and the associated frames are recorded in the class at the bottommost layer, so that each relocation algorithm finds the class to which all the feature points (256-bit descriptors) extracted from the current frame finally belong through the retrieval tree, each layer can calculate the distance, namely the hamming distance, from all the sub-nodes of each layer, the minimum value is the direction of the next layer search path until the last layer, and calculates similarity scores according to the finally found associated frames to screen out some key frames to complete the computation of relocation.

However, when the number of frames in our library becomes huge, the depth and breadth of the search tree become large, and the speed of feature point search becomes slow accordingly, and although a forward index is maintained in the ORB algorithm for fast search, the real-time performance is difficult to achieve in the case of a huge frame library.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly creatively provides an ORB _ SLAM relocation feature point retrieval acceleration method based on an FPGA.

In order to achieve the above object, the present invention provides an acceleration method for retrieving ORB _ SLAM relocation feature points based on FPGA, which includes:

s1, buffering the input picture and extracting a descriptor;

s2, entering a working space Workspace, and solving the distance of the nodes through a computing circuit;

s3, the results of each calculation circuit flow to a parallel comparison circuit together to find the point where the minimum value is located;

s4, finally, judging whether the node is a bottom layer or not, if so, finishing the search to obtain a final node;

s5, each node has an offset value for finding the address of the child node to obtain the key frame, and then relocates according to the key frame set.

In a preferred embodiment of the present invention, the calculation circuit includes:

firstly, the data passes through an exclusive-OR gate and then passes through an accumulation parallel counter APC or a parallel counter PC;

the accumulated parallel counter APC is added with an X-level approximate unit AU in front of the counter PC;

the approximation unit AU comprises: the first-level AU approximate unit is an AND gate and an OR gate of a column;

the parallel counter PC comprises a plurality of full adders.

In a preferred embodiment of the present invention, the parallel counter PC includes:

every three bits are sent to a full adder in a group, each full adder has a weight, wherein three inputs are the same weight, the output home bit and Sum send values to the full adder with the same weight, and the carry Cout to the adjacent high bit sends values to the high-level weight; the intermediate result is 'beat' by the same operation, then the full adder with all existing values input is calculated, the intermediate result is beat until the highest bit of the output result is calculated, and finally each arrow output is a bit.

In a preferred embodiment of the present invention, the parallel counter PC further includes:

v denotes the output bit width, and the input bit width N is 2^vThe number of full adders consumed per added bit of output becomes 2 times plus v-1, i.e.:

f(v)＝2*f(v-1)+v-1

wherein f (v) represents the number of full adders required for the output bit width v, and f (v-1) represents the number of full adders required for the output bit width v-1.

In a preferred embodiment of the invention, the gate level resources consumed by the parallel counter PC are:

g(v)＝(2^v-v-1)*5＝(N-log₂ ^N-1)*5

where N represents the input bit width and v represents the output bit width.

In a preferred embodiment of the present invention, the cumulative parallel counter APC is added to the X-level approximation unit AU before the full adder, and the consumed gate-level resources are:

where N represents the input bit width and X represents the number of levels of the approximation unit.

In a preferred embodiment of the present invention, the full adder includes:

two exclusive-or gates, two and gates and one or gate, and the specific logic expression is as follows:

Sum＝(A^B)^Cin

Cout＝(A&B)|((A^B)&Cin)

a, B, Cin is the input of full adder, which is the added number, and adjacent low-order carry number, Sum, Cout are the output of full adder, which is the carry number of output local Sum and adjacent high-order carry number, the exclusive OR operation is represented, the AND operation is represented, and the OR operation is represented.

In a preferred embodiment of the present invention, step S5 includes the following steps:

s51, randomly selecting I feature points from the key frame set, wherein I is a positive integer greater than or equal to 1, and calculating the pose (alpha, gamma) of the current frame, wherein alpha represents a rotation angle, and gamma is a translation amount;

s52, calculating the reprojection error of the rest key frames according to the pose (alpha, gamma) of the step S51, and if the calculated reprojection error is less than or equal to the set error threshold, the point is a key point;

s53, counting the number of key points and corresponding poses (alpha, gamma);

s54, using the pose (alpha, gamma) of the step S53 as the initial pose value to locally optimize the pose of the current frame, wherein the optimized objective function is as follows:

wherein e is_xFor the x-th reprojection error observed by the camera, | | · | | represents norm, h_xIs the number of observations; o represents the number of re-projections observed by the camera;

and S55, if the optimized key points exceed the set key points, the relocation is considered to be successful.

In a preferred embodiment of the present invention, the method for calculating the pose (α, γ) of the current frame in step S51 is:

wherein I represents the total number of the characteristic points in the reference frame;

(X_i,Y_i) Representing the position coordinates of the ith feature point in the current frame;

(X_j,Y_j) Representing the position coordinates of the j-th characteristic point in the current frame; j is not equal to i;

(X_i′,Y_i') indicates the position coordinates of the ith feature point in the reference frame corresponding to the ith feature point in the current frame;

(X_j′,Y_j') indicates the position coordinates of the jth characteristic point in the reference frame corresponding to the ith characteristic point in the current frame;

(x₀,y₀) Represents a reference starting point;

[X_i-x₀,Y_i-y₀]a vector representing an ith feature point in the current frame;

[X_j-x₀,Y_j-y₀]a vector representing a jth feature point in the current frame;

|X_i-x₀,Y_i-y₀| represents a distance value of an ith feature point in the current frame;

|X_j-x₀,Y_j-y₀the distance value of the jth characteristic point in the current frame;

representing a rotation error value;

in a preferred embodiment of the present invention, the calculation method of the reprojection error of the remaining keyframes in step S52 is:

wherein ε represents the equilibrium coefficient;

representing the degree of deviation of the pose (alpha, gamma) on the reference frame;

S^(α,γ)representing the degree of shift of the pose (α, γ) over the remaining keyframes;

K_krepresenting the reprojection error of the kth remaining keyframe;

when it is K_kTau is less than or equal to tau, and tau represents a set error threshold value, and the selected characteristic points are key points;

when it is K_kIf the value is more than tau, the selected characteristic point is not a key point.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1) the invention adopts the circuit structure of a parallel counter PC for solving the Hamming distance of two 256-bit characteristic points to form the accelerated calculation in a pipeline form.

2) The invention adds the approximate unit AU in front of the counter to form the circuit structure of the accumulation parallel counter APC for the consumption of the circuit resource, based on the approximate calculation principle, the invention can reduce the consumption of the hardware resource and improve the circuit calculation speed under the condition that the bit stream is longer and a plurality of same structures need to be copied.

3) Aiming at the characteristic that the upper layer of the k-means tree has a coarse clustering granularity and the lower layer has a fine clustering granularity, the upper layer adopts a multi-level approximate unit, and the lower layer adopts one layer or does not use an approximate unit.

4) The independence of distance calculation for each sub-node is designed into k workers, wherein k is the number of clustering clusters, and each Worker has a copy of an accelerating circuit and controls reading input and calculation of respective data in parallel.

5) The invention adopts a dynamic random access memory DRAM to store the characteristic points of the frame Bank, and according to the read-write parallel characteristic of each memory Bank in the DRAM, each sub-node respectively has different banks to improve the data reading speed, and each Worker is responsible for managing one Bank.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a block diagram of a prior art feature point search tree;

FIG. 2 is an overall block diagram of the present invention;

FIG. 3 is a schematic diagram of a Hamming distance parallel counter circuit according to the present invention (16-bit input is an example);

FIG. 4 is a schematic diagram of a gate level circuit for a full adder;

FIG. 5 is a schematic diagram of a parallel counting circuit employing an approximation unit;

FIG. 6 is a diagram illustrating a storage form of search tree data in the DRAM.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As shown in fig. 1, a search tree is trained offline with all feature points of the frame library, each feature point being represented by a 256-bit descriptor, and the tree nodes being represented by the mean of all descriptors belonging to the class. And finding the belonged class layer by layer during retrieval, namely finding the minimum Hamming distance until the lowest layer. The feature points are feature points in a specific picture, and the nodes refer to nodes in a tree structure and generally represent means belonging to the class.

The invention provides an ORB _ SLAM relocation feature point retrieval accelerating method based on an FPGA, which comprises the following steps:

s1, buffering the input picture and extracting a descriptor;

the parallel counter PC comprises a plurality of full adders.

f(v)＝2*f(v-1)+v-1

g(v)＝(2^v-v-1)*5＝(N-log₂ ^N-1)*5

where N represents the input bit width and v represents the output bit width.

In a preferred embodiment of the present invention, the full adder includes:

Sum＝(A^B)^Cin

Cout＝(A&B)|((A^B)&Cin)

FIG. 2 is an overall acceleration structure diagram of the present invention, where an input query is a graph with many feature points for search query, and since there are many inputs and the query cannot be sent all at once, a data structure, i.e. a queue with FIFO feature, is used for storage. The input module is a descriptor which is used for buffering the input pictures and is extracted;

workspace is a main working space, wherein each Worker is responsible for parallelly controlling a node distance to solve, each Worker also corresponds to a cache (cache) and a computing circuit (PC or APC), the cache is used for buffering sub-nodes which are accessed once, and because the distance between adjacent feature points is relatively small, some similar access paths can exist; when no data exists in the cache, the DRAM is accessed; the result of each calculation circuit will flow to a parallel comparison circuit together, find out the point where the minimum value is located, judge whether it is the bottom layer at last, if yes, search and finish, each node will have an deviant, is used for looking for and solving the address of the sub-node, then relocate.

DRAM is a storage structure in FPGA, can be understood as the memory in computer, and cache is a cache, and is smaller but has faster access speed than DRAM. The Cache stores a part of data of the DRAM, is small in size, but the access speed of the Cache is higher than that of accessing the DRAM, so that the data can be buffered in the Cache when the DRAM is accessed (due to the small size, some old data can be covered), the Cache can be accessed firstly when the data is accessed every time, and the DRAM is accessed and the data is buffered if corresponding data does not exist. The mechanism is mainly characterized in that the computer has personality, and data accessed recently is likely to be accessed again, so that the data access speed can be improved.

FIG. 3 shows a core computing circuit of the present invention, which employs a parallel counting structure, taking 16-bit input as an example. Each unit is a full adder, combinational logic is adopted, and the specific logic expression is as follows: sum ═ B ^ Cin, Cout ^ a & B | ((a ^ B) & Cin), consuming a total of two exclusive or gates, two and gates and one or gate, as shown in fig. 4.

Let output v denote that the output requires several bits to represent, i.e. the output bit width, and the input bit width N is 2^vFrom FIG. 3, it can be seen that for each bit of output increase, the input is doubled; the number of full adders consumed per added bit of output becomes 2 times plus v-1, i.e.:

f(v)＝2*f(v-1)+v-1

wherein f (2) ═ 1

Then the gate level resources consumed are:

g(v)＝(2^v-v-1)*5＝(N-log₂ ^N-1)*5

where N represents the input bit width and v represents the number of bits of the result, i.e., the output bit width. Based on the principle of approximate calculation, the APC adds a first-stage AU approximation unit, i.e. a column of and gates and or gates, before the full adder, as shown in fig. 5, so that the calculation result can be obtained with a small error (the experimental verification error is within 5), and the resource consumption can be reduced, where the consumed gate-level resource becomes:

similarly, APC adds an X-level AU approximation unit before the full adder, consuming gate-level resources becoming:

for the characteristic that the upper layer of the k-means tree has coarse clustering granularity and the lower layer has fine clustering granularity, the upper layer adopts a multi-level approximate unit, and the lower layer adopts one layer or does not use an approximate unit. The nodes of the same hierarchy adopt the calculation circuits with the same structure, and the nodes are parallel.

For 256-bit input, the one-stage (gate-stage) approximation unit can reduce resource consumption by about 50%, and meanwhile, the calculation speed is improved.

After two 256 bits of distance have entered the circuit, the circuit comprises an exclusive or gate before the calculation circuit PC or the calculation circuit APC. Firstly, the exclusive OR is solved by using combinational logic, and the number of 1 solved for the 256 bits after the exclusive OR is the distance between the two bits. The 256 bits after XOR first enter the approximate unit to halve the input, e.g. after one stage AU the input will be equivalent to 128 bits, but the weight of each bit becomes 2¹. Then using register non-blocking assignment to make one beat of intermediate result on clock rising edge, then feeding every third bit into full adder (firstly feeding weight 2)⁰Full adders) each having a weight, wherein three inputs are the same weight, Sum of the output sends a value to the full adders of the same level weight, and Cout of the carry sends a value to the higher level weight; the intermediate result is 'beaten' by the same operation, then the full adder with the existing value of the input is calculated, the intermediate result is beaten until the highest bit of the output result is calculated, each arrow of the final output is a bit (0 or 1) as shown in figure 5 by taking 16 bits as an example, the output can be represented by 4 bits, the power of 2 indicates which bit of the 4 bits the arrow points to, the circuit is directly connected, the actual calculation is that 0 or 1 represents a coefficient, and the power of 2 represents a weight and then the weights are accumulated. This forms a pipeline inside the computation circuit, the first descriptor will get the result in the 11 th cycle, and the computation results of one descriptor will be obtained in each cycle. Since the distance calculation with each sub-node is parallelAfter the result is obtained, all the data are sent to a comparison tree, and the minimum value is obtained to determine where the next layer of path goes.

The distance between clusters at the upper layer of the retrieval tree is larger, the difference is smaller and smaller when going downwards, the stage number of the AU calculation unit adopted by each layer is controlled according to the actual condition of the data set, the difference at the upper layer is larger, two to three layers can be adopted, and whether an approximate unit is adopted or not is determined according to the condition of the data set at the bottom layer.

For tree storage inside DRAM, as shown in fig. 6, DRAM shares one I/O control port per Bank, but reads and writes inside each Bank can be done in parallel. And each subnode is stored in a corresponding Bank, and each Worker can read the own Bank in parallel when data does not need to access the DRAM in the cache.

s53, counting the number of key points and corresponding poses (alpha, gamma);

(x₀,y₀) Represents a reference starting point;

representing a rotation error value;

wherein ε represents the equilibrium coefficient;

K_krepresenting the reprojection error of the kth remaining keyframe;

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. An ORB _ SLAM relocation feature point retrieval acceleration method based on FPGA is characterized by comprising the following steps:

s1, buffering the input picture and extracting a descriptor;

2. The FPGA-based ORB-SLAM relocation feature point retrieval acceleration method according to claim 1, wherein the calculation circuit comprises:

the parallel counter PC comprises a plurality of full adders.

3. The FPGA-based ORB _ SLAM relocation feature point retrieval acceleration method according to claim 2, wherein the parallel counter PC comprises:

4. The FPGA-based ORB _ SLAM relocation feature point retrieval acceleration method according to claim 2, wherein the parallel counter PC further comprises:

f(v)＝2*f(v-1)+v-1

5. The FPGA-based ORB _ SLAM relocation feature point retrieval acceleration method according to claim 2, wherein gate-level resources consumed by the parallel counter PC are:

g(v)＝(2^v-v-1)*5＝(N-log₂ ^N-1)*5

where N represents the input bit width and v represents the output bit width.

6. The method of claim 2, wherein the cumulative parallel counter APC is added with the X-level approximation unit AU before the full adder, and the consumed gate-level resources are:

7. The FPGA-based ORB _ SLAM relocation feature point retrieval acceleration method according to claim 2, wherein the full adder comprises:

Sum＝(A^B)^Cin

Cout＝(A&B)|((A^B)&Cin)

8. The method for accelerating the retrieval of ORB _ SLAM relocation feature points based on FPGA of claim 1, wherein the step S5 comprises the following steps:

s51, randomly selecting I feature points from the key frame set, wherein I is a positive integer greater than or equal to 1, and calculating the pose (alpha, gamma) of the current frame, wherein alpha represents a rotation angle;

s53, counting the number of key points and corresponding poses (alpha, gamma);