Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The replacement of manpower by machines has been the direction of human technological effort, where target tracking is a loop that machines must traverse for intelligence to be developed. The key factor restricting the application of the vision technology in the target tracking method is the problem of insufficient real-time performance of the system caused by massive data processing. In order to solve the problem, research institutions at home and abroad mainly start from two aspects: firstly, a high-performance processor is adopted, and secondly, a new visual processing algorithm is proposed.
The inventor of the present invention selects a tracking algorithm particularly suitable for high-parallelism hardware implementation, uses logic hardware instead of at least a part of functions of target tracking implemented by software in a processor, realizes a real-time tracking system capable of satisfying practical applications, and maintains system power consumption at a level that is considerably low and enables miniaturization.
Fig. 1 shows a general flow of target tracking. The current target area is first input to the target tracking unit. A partial map including a tracking target cut out from a current video image frame may be directly input. Preferably, in order to reduce the data dimension required to be calculated by the follow-up tracking algorithm, the image feature extraction unit may also be input to extract the image feature related to the target from the local image. The features reflecting the essential attributes of the modes can be obtained through image feature extraction, so that the tracking of a follow-up tracking algorithm on a specific target can be facilitated. The target tracking system then gives a predicted next frame target position based on the current input, and finally the processor processes and gives the tracking result (e.g., a box on the image frame) via the display and gives a picture input for the next frame.
Target tracking systems typically employ specific tracking algorithms to achieve tracking of a given target. Since object tracking can be considered an online learning problem, a classifier is trained that can distinguish the appearance and environment of an object given the initial image region containing the object. For the discriminant learning method, learning of a negative sample (background environment) is as important as that of a positive sample (tracking target). Since the calculation amount of the common tracking algorithm needs to be maintained at an acceptable level, a small number of environment areas are randomly selected for each frame of image to perform negative sample training. The lack of negative examples is often a large factor limiting the tracking performance.
For the dilemma, a training method of the fourier domain is proposed. By using a special model, some learning algorithms become easier in the fourier domain instead as a large number of samples are added. By transforming a complex algebraic solution process into a simple operation in the fourier domain, a simple solution including a large number of sample trainings can be achieved, thereby algorithmically ensuring the accuracy of tracking while increasing the tracking speed.
Fig. 2 shows a schematic view of an object tracking system according to the invention. As shown, the target tracking system 200 includes a target feature training module 210, a candidate feature training module 220, and a target region calculation module 230, and the candidate feature training module 220 and the target region calculation module 230 include a common fourier transform module, preferably a Fast Fourier Transform (FFT) module, implemented by logic hardware.
The target feature training module 210 may train image information of a target region in a current video image frame in the fourier domain to obtain current training features. The training may be sample-bound training (e.g., a plurality of positive samples for the target and negative samples for the environment), and the acquisition of the current training features may also be based on historical training features, thereby further enhancing the richness of the sample training. The candidate feature training module 220 may train image information of a plurality of candidate target regions in a next video image frame in the fourier domain to obtain a plurality of candidate training features. The target region calculation module 230 may then select a predicted next target region from a plurality of candidate target regions based on the current training feature and the plurality of candidate training features. The common FFT module is capable of performing fourier domain training for the target feature training module 210 and the candidate feature training module 220, respectively, in different time periods.
The training of the fourier domain may include the computation of a two-dimensional FFT, depending on the implementation of the particular tracking algorithm. The fourier transform module thus comprises two Fast Fourier Transform (FFT) module units to perform the calculation of the two-dimensional FFT. Preferably, the fourier transform module may comprise only one FFT module unit, e.g. known one-dimensional FFT IP core hardware, and implement a two-dimensional FFT by specific configuration and multiplexing, such as a Discrete Fourier (DFT) transform.
The training in the fourier domain may also include the computation of an Inverse Fast Fourier Transform (IFFT) to convert the results of the operations in the fourier domain back to the algebraic domain. The FFT block unit implemented as one or more as described above may be used to implement the computation of the IFFT.
The inputs for feature training and the input are preferably matrices, whereby the computation of the fourier domain can be exploited to simplify the matrix operations (e.g., matrix inversion) that are originally responsible. In one embodiment, the target feature training module 210 and the candidate feature training module 220 train feature matrices of the target region and the candidate target region, respectively, in the fourier domain to obtain a current training feature matrix and a plurality of candidate training feature matrices, respectively. More preferably, a matrix transposition function may be added to the calculation of the IFFT. The IFFT calculation may transpose the transposed matrix in the second FFT calculation back to the original direction while performing the inverse operation, so as to further optimize the operation and improve the calculation efficiency.
The target region calculation module 230 selects the next target region based on a plurality of region calculation result matrices obtained by the plurality of candidate training feature matrices and the current training feature matrix. In the next image video frame, n (n is an integer of 2 or more) image areas may be selected as candidate areas in the next frame on the basis of the position of the current target area based on a predetermined rule. And training the image feature matrix of each candidate region to obtain a corresponding candidate training feature matrix. The current training feature matrix is operated with each of the n candidate training feature matrices, thereby obtaining n for each initial candidate region. And finally, determining the area which is determined as the next target area in the n candidate areas according to the n area calculation result matrixes.
In one embodiment, the determination of the next target region is determined by selecting the largest one of the elements in the n region computation result matrices. Thus, the target area calculation module 230 may further include a maximum element selection unit for selecting a maximum element among all elements of the plurality of area calculation result matrices. For example, the largest one of all the 16 × 3 matrix elements of the 16 × 16 area calculation result matrix of 3(n ═ 3) candidate areas is selected, and the area where the element is located is taken as the predicted next target area.
The maximum element selection unit may further record values of the maximum element and its neighboring elements and a position of the maximum element. The target area calculation module further comprises a target area optimization unit that adjusts a next target area based on the value and position recorded by the maximum element selection unit to obtain an optimized next target area. The target region optimizing unit may find the position offset value by quadratic fitting based on the neighboring elements of the largest element (e.g., four elements, up, down, left, and right, or 8 surrounding elements, etc.), and adjust the position and size of the selected next target region according to the value.
As shown in fig. 3, the maximum element selection unit is preferably implemented by a shift register. The shift register may include only a capacity to accommodate information of two lines of elements of the area calculation result matrix, and the maximum element may be determined and the value and position may be acquired by inputting all elements of the plurality of area calculation result matrices in series into the shift register for comparison.
Referring back to fig. 1, the current target area is a basis for determining a next target area, and the next target area is a basis for determining a next target area. Thus, the training features of any current target region may also preferably be based on historical training features of previous training. In one embodiment, the current training feature is a historical training feature + b the current image feature training result (0< a <1, 0< b <1, and a + b is 1), and the values of the coefficients a and b may be flexibly adjusted according to the specific implementation to adjust the weights occupied by the historical training feature and the current image training feature.
Preferably, the object tracking system of the present invention is particularly adapted for object tracking using a kernel correlation filter tracking algorithm (KCF). The basic idea of KCF is to cyclically shift the tracking target area, so as to construct a large number of samples to train a classifier. And calculating the similarity degree of the candidate region and the tracking target through a kernel function, selecting the candidate region with the maximum similarity as a new tracking target, and simultaneously reducing the operation amount in the training and detecting processes of the classifier by utilizing discrete Fourier transform. The training samples of the classifier are skillfully constructed through cyclic migration, so that the data matrix becomes a cyclic matrix. Then, the solution of the problem is transformed to a discrete Fourier transform domain based on the characteristics of the cyclic matrix, so that the matrix inversion process is avoided, and the algorithm complexity of several orders of magnitude is reduced. Although the present invention is particularly suitable for target tracking implemented by a KCF algorithm, it is understood that the tracking system of the present invention may use other correlation methods capable of training image features in the fourier domain, and can also obtain similar effects of optimizing tracking accuracy and real-time performance.
In one embodiment, the target tracking system of the present invention can track multiple targets simultaneously. For example, after the prediction of the target a of the current frame is completed, the prediction result of the target a is stored, the input related to the target B is read, and the tracking prediction … for the target B is performed. Thus, the target feature training module, the candidate feature training module, and the target region calculation module repeat the operations of feature training and region selection for different targets to track multiple targets in the same video image frame.
The preferred embodiment of the modular implementation of the object tracking system of the present invention is described above in connection with fig. 2 and 3. Although the figures illustrate the target feature training module, the candidate feature training module, and the target region calculation module, each of which may be implemented in part by logic hardware. It should be understood that the target feature training module and the candidate feature training module may be implemented in software and/or hardware based on the specific implementation, in addition to the fourier transform module implemented by logic hardware. Similarly, the target area calculation module may be implemented in whole or in part by software or logical hardware.
Preferably, the logic hardware used by the present invention may be an FPGA, an ASIC, other hardware platform, or any combination thereof. The target tracking system is implemented on a system on a chip (SoC) that includes a general purpose processor and logic hardware. The object tracking system of the present invention may be part of a complete object detection and tracking system. Figure 4 shows a schematic diagram of the operation of a complete object detection and tracking system. The object tracking system 200 (and 300) of the present invention may be implemented as the object tracking module in this figure.
The modules described above may be combined in different ways in specific applications. In one embodiment, the target feature training module may be regarded as a sample training module for performing sample training on image features of a target region of a current image video frame to obtain feature mapping parameters. Meanwhile, a part of the target region calculation module and the candidate feature training module can be combined into a target detection module, so that a region calculation result matrix is directly obtained. The other part of the target area calculation module, i.e. the maximum element selection unit, and preferably including the target area calculation module, may be regarded as a separate detection result generation module or incorporated into the above-mentioned target detection module.
FIG. 4 illustrates a system that enables high performance real-time target detection and tracking with little hardware resource consumption. In particular, the system may be a software and hardware coordinated system for real-time target detection and tracking of FPGA-based system-on-a-chip, where the inputs and outputs on the FPGA are preferably both fixed-point values to further simplify the computation and to prompt computational efficiency. The system comprises an object detection module for globally locating the position and size of an object, and relocating the object when the system starts, a user inputs, a timing expires or the object is lost. In this example, the target detection module is implemented using a convolutional neural network algorithm (CNN).
The system also includes a target tracking module for tracking the target in real time, after the target is detected (i.e., a local map including the target is detected), the target is tracked in real time. In this example, the image is feature extracted using a directional gradient histogram algorithm (HOG), and the target position and size are predicted in real time from the image features using a kernel correlation filter algorithm (KCF).
And the control module is used for controlling the operation of the whole system, such as the input and output of video images, module call control and the like. For example, the image input module described above may be part of the control module.
In specific operation, the CNN is called first, and the target is positioned on the input full image to obtain the position and size of the target in the video image (i.e., local image detection). Then, feature calculation is performed on the partial image of the input video image by using the HOG algorithm according to the target position and size calculated in the previous step (according to the requirement of the KCF algorithm, the module runs four times in the same image). And the HOG module sends the calculated feature map to the KCF calculation module. The Fourier transform, inverse Fourier transform and point multiplication functional modules in the KCF calculation module suitable for parallel processing are preferably realized by an FPGA, and the whole calculation process of the KCF can be controlled by a microprocessor. The KCF module sends the calculated position and size of the target to the controller, and the controller marks on the output video image. After the KCF module is calculated every time, the controller compares the confidence probability attached in the calculation with a preset threshold value, if the confidence probability is smaller than the threshold value, the tracking is considered to be lost, and the system recalls the CNN to perform target positioning. And if the confidence probability is larger than the threshold value, the system considers that the operation is normal, and the characteristic calculation of the HOG algorithm starts to operate.
The object tracking system of the present invention is described above in connection with fig. 2-4 in the form of functional modules. The hardware implementation of the object tracking system of the present invention will be described with emphasis on fig. 5 as follows. Fig. 5 shows a hardware functional diagram of the object tracking device according to the present invention. The target tracking device comprises a receiving module preferably implemented as a FIFO memory for receiving and storing externally input picture information; the sample training module is used for carrying out characteristic training on the input partial picture information; the DFT calculation module is used for calculating discrete Fourier transform of the picture information; the target detection module is used for carrying out position detection on the input picture information; and an output module, also preferably implemented as a FIFO memory, for sending the predicted trace-target information to the general-purpose processor.
The steps involved in the target tracking process are as follows:
step 1: and the general processor determines a screenshot area according to the detection result and sends the picture characteristics of the screenshot area to the target tracking device.
And 2, step: the receiving module receives the image information of the detection area input from the general processor, preferably a multi-channel feature matrix, and transmits the multi-channel image feature information matrix to the sample training module. And obtaining a feature mapping matrix after the processing of the sample training module.
a. And the sample training module receives the multi-channel picture characteristic information matrix. The module first takes the feature matrix data from the input FIFO memory and then performs a two-dimensional fast fourier transform calculation on the feature matrix using a multiplexed DFT calculation module.
b. The fourier transform calculation result is stored in a BLOCK memory (BLOCK RAM, also called BRAM), the calculation result and its conjugate result are subjected to complex multiplication, and the complex multiplication result is stored in BRAM using a complex multiplier.
c. And accumulating the complex multiplication results of the channels, and storing the accumulated result into the BRAM. Until all the feature inputs of the multiple channels are calculated. And then, performing inverse Fourier transform operation on the accumulated result.
d. And performing matrix rearrangement on the calculation result of the inverse Fourier transform, removing the imaginary part of the complex matrix, and outputting a real matrix. And updating the matrix and the characteristic mapping matrix obtained in history to obtain a new characteristic mapping matrix.
And step 3: the receiving module receives the image information of the prediction region input from the general processor, preferably a multi-channel feature matrix, transmits the image information to the target detection module, and obtains the prediction position information after the image information is processed by the target detection module.
a. The target detection module receives three or more multi-channel image matrixes of the area to be detected. The input data is first fourier transformed using a multiplexed DFT computation module.
b. And performing complex multiplication on results of two continuous input image matrixes after Fourier transform operation, and storing the complex multiplication results into BRAM.
c. And accumulating the complex multiplication results of the channels, and storing the accumulated result into the BRAM until all the image matrix inputs of the channels are calculated. And then performing inverse Fourier transform operation on the accumulated result.
d. And adding the operation result of the inverse Fourier transform and the ridge regression coefficient matrix, and multiplying the addition result by the feature mapping matrix obtained by the sample training module to obtain the prediction result of the picture of the region to be predicted.
And 4, step 4: and the position fitting module receives the output of the target detection module, and the final predicted position information is obtained after secondary fitting processing realized by the shift register.
And 5: and the output module sends the final prediction result to the general processor through the FIFO, and the general processor displays the final tracking result according to the prediction information.
In order to improve the operation efficiency, a set of DFT calculation resources can be used in the sample training module, and the operation result of DFT and the transposed result are directly operated, so that the operation of performing DFT operation twice in the original algorithm is omitted. In addition, the sample training module and the target detection module can be made to repeatedly share one DFT computation unit for the serial computation characteristic of the entire KCF tracking algorithm.
In addition, the implementation framework designed by the invention can modify the configured precision parameters according to different precision requirements, so that the calculation precision of the discrete Fourier transform of 16 bits, 24 bits and 32 bits can be provided.
The object tracking system and the implementation method according to the present invention have been described in detail above with reference to the accompanying drawings. The system can be regarded as a hardware-based target tracking accelerator, and can be implemented on a programmable gate array (FPGA), or an Application Specific Integrated Circuit (ASIC) chip, or a chip of an ARM, a CPU, and a GPU in a hardcore manner.
The device and the method for multiplying the sparse matrix and the vector solve the problem of insufficient CPU operation performance, effectively improve the operation speed and improve the real-time performance of target tracking. According to the characteristic that the whole KCF tracking algorithm runs in series, the reusability of a discrete Fourier transform calculation module is fully excavated, so that the use amount of on-chip calculation resources is greatly reduced. And different parameters can be configured to achieve different tracking performances according to different precision requirements. Therefore, the target tracking system and the tracking realized by the invention only use few hardware resources to realize the optimization of the tracking realization method. The calculation amount of the algorithm and the required hardware resources and power consumption are greatly reduced. The device can achieve excellent tracking real-time performance on small-sized equipment.
It should be appreciated that the preferred features of the embodiments described above with reference to fig. 2-5 may be combined or split to provide a new embodiment. Embodiments combining these features are intended to be within the scope of the present invention as defined by the appended claims.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.