CN113436232A

CN113436232A - Hardware acceleration method based on tracking algorithm

Info

Publication number: CN113436232A
Application number: CN202110723521.XA
Authority: CN
Inventors: 胡铭德
Original assignee: Shanghai Lexin Information Technology Co ltd
Current assignee: Shanghai Lexin Information Technology Co ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-24
Anticipated expiration: 2041-06-29
Also published as: CN113436232B

Abstract

The invention discloses a hardware acceleration method based on a tracking algorithm; s1, receiving and dividing the data flow information by hardware; s2, the CPU distributes the compressed video data to the GPU and the APU for processing; s3, the GPU and the APU realize post processing of the video data stream through the algorithm of the GPU and the APU; s4, carrying out algorithm parallelization on the algorithm processing process of the video data stream; s5, the processed video data stream is received and played by the CPU; the invention realizes the segmentation of the data information, so that the data information can be divided into a plurality of small blocks for processing, the accelerated running of hardware can be effectively improved, and the running speed of the hardware and the running speed and efficiency of the hardware can be effectively improved by adopting algorithm parallelization, data parallelization and operation parallelization when the hardware is processed.

Description

Hardware acceleration method based on tracking algorithm

Technical Field

The invention belongs to the technical field of hardware acceleration, and particularly relates to a hardware acceleration method based on a tracking algorithm.

Background

Hardware acceleration refers to a technique for reducing the workload of a central processing unit by allocating a very computationally intensive job to dedicated hardware for processing in a computer. This technique is often used, in particular, in image processing, the structure of the central processor being such that it can carry out a wide variety of different instructions in a short time. What instructions it can process is mainly limited by software. But some repetitive tasks cannot be handled very efficiently and quickly due to the structure of the central processor. These special hardware elements do not have to be as flexible as the central processor and therefore their hardware design already takes into account the need to optimize the handling of these special problems, so that the central processor has time to handle other tasks. Some tasks can be solved very efficiently by breaking them down into thousands of smaller tasks. Such as fourier transforming a certain frequency band or rendering a small image. These tasklets can be computed in parallel independent of each other. The overall computational speed for processing these special tasks by a large number of parallel computations, i.e. using a large number of small processors running in parallel, can be greatly increased. In many cases the computation speed increases linearly with the number of parallel processors. Such parallel calculation is also significant from the viewpoint of efficient energy utilization. Energy usage increases linearly with the number of parallel processors and increases as the square ratio of processor frequency. Therefore, the frequency of the parallel arithmetic processor is not required to be too high, and the energy used is relatively small, but various hardware acceleration in the market still has various problems.

Although the target detection hardware accelerator and the acceleration method disclosed by the grant publication No. CN112230884B can reduce the time and power consumption required by the accelerator for data transportation and improve the working efficiency of the accelerator, the method does not solve the problems that the existing hardware acceleration technology cannot realize target acquisition through a tracking algorithm, realize accurate processing of a target to be detected, and then realize acceleration of hardware, and therefore we propose a hardware acceleration method based on a tracking algorithm.

Disclosure of Invention

The present invention is directed to a hardware acceleration method based on a tracking algorithm, so as to solve the problems set forth in the above background art.

In order to achieve the purpose, the invention provides the following technical scheme: a hardware acceleration method based on a tracking algorithm comprises the following steps:

s1, the hardware realizes the receiving and dividing process of the data flow information: the CPU receives the data stream information, separates and compresses the data stream information to obtain video data, places the compressed video data at the separation part in a system memory, and divides the compressed video data into a plurality of small parts;

s2, the CPU distributes the compressed video data to the GPU and the APUs for processing: the CPU transmits the compressed video data divided into a plurality of small parts to the GPU and the APU for data parallelization and operation parallelization processing, so that the compressed video data stream is decompressed, and decompressed data information is stored in the sound card;

s3, GPU and APU realize post-processing of video data stream through their own algorithm: the GPU and the APU realize target locking on characteristic objects or characters in the video through a tracking algorithm, then realize tracking and positioning on the target through Kalman filtering, particle filtering, Meanshift, Camshift, MOSSE, CSK, KCF, BACF or SAMF, then realize fine processing on the small video data stream of the area where the target is located, so that the video data stream can be processed in a higher definition mode, then the small video data stream of the area where the non-target is located is processed in a fuzzy mode, and further the speed of hardware processing is improved;

s4, adopting algorithm parallelization for the algorithm processing process of the video data stream: when the GPU and the APU process the video data stream by the algorithm, the algorithm is parallelized, so that the GPU and the APU can simultaneously utilize the processing space of the GPU and the APU, the running speed and the efficiency of hardware can be effectively accelerated, and the accelerated processing of the video data stream is finished;

s5, the processed video data stream is received and played by the CPU: when the GPU and the APU finish processing the video data stream, the CPU receives the video data, splices the video data stream according to the segmented data sequence, arranges the video data stream by adopting a bubble sorting method, and then finishes quickly playing the video data stream.

Preferably, the segmentation processing in S1 adopts a spatial domain algorithm and a temporal domain algorithm, the spatial domain algorithm is performed by a macroblock, the processing of each pixel is performed locally in a spatial domain, each pixel is processed in sequence and then an output is generated, and there is no result accumulation effect when the output is generated from the previous or next pixel;

the time domain algorithm looks for where a change or similarity occurs in a particular pixel or pixel region between frames and converts the interlaced field to progressive format by line doubling or filtering.

Preferably, the algorithm of the macroblock uses the average value calculated by adjacent pixels around a certain pixel to perform low-pass filtering on the pixel, that is, 5 × 5 two-dimensional convolution kernels are used for detecting edge information in an image and drawing related information from a large number of pixels around the pixel; the spatial processing and the temporal processing are combined to form a new category, which is called "space-time processing", each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, and then the macro blocks are tracked and compared frame by frame to extract approximate values of motion estimation and compensation.

Preferably, the data parallelization in S2 is to divide the data block into several small blocks capable of being processed simultaneously, where the data block can implement 16 by 16 and 32 by 32 data blocks, and the data parallelization needs to have a Stream object, call its parallel method to enable it to have the capability of parallel operation, or create a Stream from a collection class to call parallel Stream to immediately obtain a Stream with the capability of parallel.

Preferably, the operation parallelization in S2 is a detail optimization of the algorithm processing, so as to reduce generation of intermediate variables as much as possible and calculate in one step as much as possible.

Preferably, the target in S3 is tracked, positioned and particle-filtered, and the particle-filtered operation steps are as follows:

s301, initialization stage-extraction of target features: selecting a target, and extracting the characteristics of a target area, namely a target color histogram;

s302, initializing particles: a) scattering particles evenly over the entire image, b) scattering particles in a gaussian distribution near the target of the previous frame;

s303, a search stage: counting the color histogram of each particle, comparing with the color histogram of the target model, calculating the weight according to the Papanicolaou distance, normalizing the weight to make the weight of all the particles added to be 1,

s304, particle resampling: a small number of particles are placed at a place with low similarity, a plurality of particles are placed at a place with high similarity, and the particles with low weight are discarded;

s305, state transition: according to

Calculating the position of the particle at the next moment by st-Ast-1 + wt-1s _ t ═ As _ { t-1} + w _ { t-1} st ═ Ast-1+ wt-1;

s306, an observation stage: calculating the similarity between each particle and the target characteristic, and updating the weight of the particle;

s307, decision stage: calculating a weighted average value of the coordinates and the similarity to obtain the position of the next frame of the tracking target;

s308, repeating S303, S304, S305, S306 and S307 according to the predicted position.

Preferably, the algorithm parallelization in S4 adopts a PRAM model having a centralized shared memory and an instruction controller, and implicitly synchronizes the calculation by exchanging data through R/W of the SM.

Preferably, the step of comparing the bubble sort method in S5 is as follows:

s501, comparing adjacent elements, and if the first element is larger than the second element, exchanging the two elements;

s502, performing the same work on each pair of adjacent elements, namely, from the first pair to the last pair at the end, wherein the last element is the maximum number after the work is completed;

s503, repeating the steps for all the elements except the last element;

s504, repeating the above steps for fewer and fewer elements each time until no pair of numbers needs to be compared.

Preferably, the dividing process numbers the divided video data stream ends when dividing, and then performs effective bubble sorting process according to the numbers when performing the bubble sorting method, so as to realize sequential playing of the video data stream.

Preferably, in the high definition processing in S3, a cubic convolution method is adopted for the scaling processing of the video signal, each pixel of the output image of the cubic convolution method is a result of operation of 16 pixels of the original image, and when cubic convolution interpolation is used, the value of the target point is calculated by resampling the values of 16 known surrounding pixels.

Compared with the prior art, the invention has the beneficial effects that:

the invention realizes the segmentation of the data information, so that the data information can be divided into a plurality of small blocks for processing, then realizes the fine processing of the target area through the tracking algorithm, and performs the fuzzy processing on other areas, thereby effectively improving the accelerated operation of hardware, and effectively improving the operation speed of the hardware through the algorithm parallelization, the data parallelization and the operation parallelization adopted when the hardware is processed, thereby improving the acceleration of the hardware through the fuzzy processing and the processing of the hardware algorithm, and improving the operation speed and the efficiency of the hardware.

Drawings

FIG. 1 is a schematic flow chart of the steps of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a hardware acceleration method based on a tracking algorithm comprises the following steps:

In this embodiment, it is preferable that the segmentation processing in S1 employs a spatial domain algorithm and a temporal domain algorithm, the spatial domain algorithm is performed by a macroblock, the processing of each pixel is performed locally in a spatial domain, each pixel is processed in sequence and then an output is generated, and there is no result accumulation effect when the output is generated from the previous or next pixel;

In this embodiment, preferably, the algorithm of the macroblock performs low-pass filtering on a certain pixel by using an average value calculated by neighboring pixels around the pixel, that is, detecting edge information in an image by using a 5 × 5 two-dimensional convolution kernel, and drawing related information from a large number of surrounding pixels; the spatial processing and the temporal processing are combined to form a new category, which is called "space-time processing", each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, and then the macro blocks are tracked and compared frame by frame to extract approximate values of motion estimation and compensation.

In this embodiment, preferably, the data parallelization in S2 is to divide the data block into several small blocks that can be processed simultaneously, where the data block can implement 16 by 16 and 32 by 32 data blocks, and the data parallelization needs to have a Stream object, call its parallel method to enable it to have the capability of parallel operation, or create a Stream from an aggregate class to call a parallel Stream to immediately obtain a Stream with the capability of parallel.

In this embodiment, preferably, the parallelization of operations in S2 is a detailed optimization of algorithm processing, so as to reduce generation of intermediate variables as much as possible and perform calculation as much as possible in one step.

In this embodiment, preferably, the target in S3 is tracked, positioned, and particle filtered, and the operation steps of the particle filtering are as follows:

s305, state transition: according to

In this embodiment, preferably, the model adopted by the parallelization of the algorithm in S4 is a PRAM model, which has a centralized shared memory and an instruction controller, and performs implicit synchronous computation by exchanging data through R/W of the SM.

In this embodiment, preferably, the step of comparing the bubble sort method in S5 is as follows:

s503, repeating the steps for all the elements except the last element;

In this embodiment, preferably, the dividing process numbers the divided video data stream ends when dividing, and then performs effective bubble sorting process according to the numbers when performing the bubble sorting method, so as to implement sequential playing of the video data stream.

In this embodiment, preferably, in the high definition processing in S3, a cubic convolution method is adopted for the scaling processing of the video signal, each pixel of the output image of the cubic convolution method is a result of operation of 16 pixels of the original image, and when cubic convolution interpolation is used, the value of the target point is calculated by resampling values of 16 known surrounding pixels.

The working principle and the using process of the invention are as follows:

firstly, receiving and dividing data stream information by hardware: the CPU receives the data stream information, separates and compresses the data stream information to obtain video data, places the compressed video data at the separation part in a system memory, and divides the compressed video data into a plurality of small parts;

secondly, the CPU distributes the compressed video data to the GPU and the APUs for processing: the CPU transmits the compressed video data divided into a plurality of small parts to the GPU and the APU for data parallelization and operation parallelization processing, so that the compressed video data stream is decompressed, and decompressed data information is stored in the sound card;

thirdly, the GPU and the APU realize post-processing of the video data stream through self algorithms: the GPU and the APU realize target locking on characteristic objects or characters in the video through a tracking algorithm, then realize tracking and positioning on the target through Kalman filtering, particle filtering, Meanshift, Camshift, MOSSE, CSK, KCF, BACF or SAMF, then realize fine processing on the small video data stream of the area where the target is located, so that the video data stream can be processed in a higher definition mode, then the small video data stream of the area where the non-target is located is processed in a fuzzy mode, and further the speed of hardware processing is improved;

fourthly, algorithm parallelization is adopted for the algorithm processing process of the video data stream: when the GPU and the APU process the video data stream by the algorithm, the algorithm is parallelized, so that the GPU and the APU can simultaneously utilize the processing space of the GPU and the APU, the running speed and the efficiency of hardware can be effectively accelerated, and the accelerated processing of the video data stream is finished;

fifthly, receiving and playing the processed video data stream by the CPU: when the GPU and the APU finish processing the video data stream, the CPU receives the video data, splices the video data stream according to the segmented data sequence, arranges the video data stream by adopting a bubble sorting method, and then finishes quickly playing the video data stream.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A hardware acceleration method based on a tracking algorithm is characterized by comprising the following steps:

2. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the segmentation processing in the S1 adopts a spatial domain algorithm and a time domain algorithm, the spatial domain algorithm is performed by a macroblock, the processing of each pixel is performed locally in a spatial domain, each pixel is processed in sequence and then an output is generated, and there is no result accumulation effect when the output is generated from the previous or next pixel;

3. The hardware acceleration method based on tracking algorithm of claim 2, characterized in that: the algorithm of the macro block uses the adjacent pixels around a certain pixel to calculate the average value to carry out low-pass filtering on the pixel, namely 5 × 5 two-dimensional convolution kernels are used for detecting the edge information in the image and drawing related information from a large number of surrounding pixels; the spatial processing and the temporal processing are combined to form a new category, which is called "space-time processing", each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, and then the macro blocks are tracked and compared frame by frame to extract approximate values of motion estimation and compensation.

4. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the data parallelization in S2 is to divide data into blocks capable of being processed simultaneously, where the data blocks can implement 16 by 16 and 32 by 32 data blocks, and the data parallelization needs to have a Stream object, call a parallel method of the Stream object to enable the Stream object to have the capability of parallel operation, or create a Stream from a collection class to call a parallel Stream to immediately obtain a Stream with the capability of parallel.

5. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the operation parallelization in the step S2 is a detailed optimization of algorithm processing, so that generation of intermediate variables is reduced as much as possible, and calculation is performed as one step as possible.

6. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: and performing tracking, positioning and particle filtering on the target in the step S3, wherein the particle filtering operation steps are as follows:

s305, state transition: according to

7. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the algorithm parallelization in the S4 adopts a PRAM model which has a centralized shared memory and an instruction controller, and performs implicit synchronous calculation by exchanging data through R/W of the SM.

8. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the step of comparing the bubble sort method in S5 is as follows:

s503, repeating the steps for all the elements except the last element;

9. The hardware acceleration method based on tracking algorithm of claim 8, characterized in that: and when the segmentation processing is carried out, the segmented video data stream ends are numbered, and then effective bubble sorting processing is carried out according to the numbers when the bubble sorting method is carried out, so that the video data stream is sequentially played.

10. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: in the high-definition processing in S3, a cubic convolution method is adopted for scaling the video signal, each pixel of the output image of the cubic convolution method is the result of operation of 16 pixels of the original image, and when cubic convolution interpolation is used, the value of the target point is calculated by resampling the values of 16 known pixels around the target point.