CN113436232B

CN113436232B - Hardware acceleration method based on tracking algorithm

Info

Publication number: CN113436232B
Application number: CN202110723521.XA
Authority: CN
Inventors: 胡铭德
Original assignee: Shanghai Lexin Information Technology Co ltd
Current assignee: Shanghai Lexin Information Technology Co ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-03-24
Anticipated expiration: 2041-06-29
Also published as: CN113436232A

Abstract

The invention discloses a hardware acceleration method based on a tracking algorithm; s1, receiving and segmenting data stream information by hardware; s2, the CPU distributes the compressed video data to the GPU and the APU for processing; s3, the GPU and the APU realize post-processing of the video data stream through self algorithms; s4, algorithm parallelization is adopted for the algorithm processing process of the video data stream; s5, receiving and playing the processed video data stream by the CPU; the invention realizes the segmentation of the data information, so that the data information can be divided into a plurality of small blocks for processing, the accelerated running of hardware can be effectively improved, and the running speed of the hardware and the running speed and efficiency of the hardware can be effectively improved by adopting algorithm parallelization, data parallelization and operation parallelization when the hardware is processed.

Description

Hardware acceleration method based on tracking algorithm

Technical Field

The invention belongs to the technical field of hardware acceleration, and particularly relates to a hardware acceleration method based on a tracking algorithm.

Background

Hardware acceleration refers to a technique for reducing the workload of a central processing unit by allocating a very computationally intensive job to dedicated hardware for processing in a computer. This technique is often used, in particular, in image processing, the structure of the central processor being such that it can carry out a wide variety of different instructions in a short time. What instructions it can process is mainly limited by software. But some repetitive tasks cannot be handled very efficiently and quickly due to the structure of the central processing unit. These special hardware elements do not have to be as flexible as the central processor and therefore their hardware design already takes into account the need to optimize the handling of these special problems, so that the central processor has time to handle other tasks. Some tasks can be solved very efficiently by breaking them down into thousands of smaller tasks. Such as fourier transforming a certain frequency band or rendering a small image. These tasklets can be computed in parallel independent of each other. The overall computational speed for processing these special tasks by massively parallel computations, i.e., using a large number of small processors running in parallel, can be greatly increased. In many cases the computation speed increases linearly with the number of parallel processors. Such parallel calculation is also significant from the viewpoint of efficient energy utilization. Energy usage increases linearly with the number of parallel processors and increases as the square ratio of processor frequency. Therefore, the frequency of the parallel arithmetic processor is not required to be too high, and the energy used is relatively small, but various hardware acceleration in the market still has various problems.

Although the target detection hardware accelerator and the acceleration method disclosed in the No. CN112230884B can reduce the time and power consumption required by the accelerator for data transportation and improve the working efficiency of the accelerator, the hardware acceleration method based on the tracking algorithm is proposed for solving the problems that the target acquisition cannot be realized through the tracking algorithm, the target to be detected is accurately processed, and then the hardware is accelerated in the existing hardware acceleration technology.

Disclosure of Invention

The present invention is directed to a hardware acceleration method based on a tracking algorithm, so as to solve the problems set forth in the above background art.

In order to achieve the purpose, the invention provides the following technical scheme: a hardware acceleration method based on a tracking algorithm comprises the following steps:

s1, receiving and segmenting data stream information by hardware: the CPU receives the data stream information, separates and compresses the data stream information to obtain video data, places the compressed video data at the separation part in a system memory, and divides the compressed video data into a plurality of small parts;

s2, the CPU distributes the compressed video data to the GPU and the APUs for processing: the CPU transmits the compressed video data divided into a plurality of small parts to the GPU and the APU for data parallelization and operation parallelization processing, so that the compressed video data stream is decompressed, and decompressed data information is stored in the sound card;

s3, the GPU and the APU realize post-processing of the video data stream through the algorithm of the GPU and the APU: the GPU and the APU realize target locking on characteristic objects or characters in the video through a tracking algorithm, then realize tracking and positioning on the target through Kalman filtering, particle filtering, meanshift, camshift, MOSSE, CSK, KCF, BACF or SAMF, then realize fine processing on the small video data stream of the area where the target is located, so that the video data stream can be processed in a higher definition mode, then the small video data stream of the area where the non-target is located is processed in a fuzzy mode, and further the speed of hardware processing is improved;

s4, algorithm parallelization is adopted for the algorithm processing process of the video data stream: when the GPU and the APU perform algorithm processing on the video data stream, algorithm parallelization is adopted, so that the GPU and the APU can simultaneously utilize own processing space, the running speed and the efficiency of hardware can be effectively accelerated, and the accelerated processing of the video data stream is completed;

s5, receiving and playing the processed video data stream by the CPU: when the GPU and the APU finish processing the video data stream, the CPU receives the video data, splices the video data stream according to the segmented data sequence, arranges the video data stream by adopting a bubble sorting method, and then finishes quickly playing the video data stream.

Preferably, the segmentation processing in S1 adopts a space-domain algorithm and a time-domain algorithm, the space-domain algorithm is performed by a macroblock, the processing of each pixel is performed locally in a space domain, each pixel is processed in sequence and then an output is generated, and there is no result accumulation effect when the output is generated from the previous or next pixel;

the time domain algorithm looks for where a change or similarity occurs in a particular pixel or pixel region between frames and converts the interlaced field to progressive format by line doubling or filtering.

Preferably, the algorithm of the macroblock uses the average value calculated by adjacent pixels around a certain pixel to perform low-pass filtering on the pixel, that is, 5 × 5 two-dimensional convolution kernels are used for detecting edge information in an image and drawing related information from a large number of pixels around the pixel; the spatial processing and the temporal processing are combined to form a new category, which is called "space-time processing", each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, and then the macro blocks are tracked and compared frame by frame to extract approximate values of motion estimation and compensation.

Preferably, the data parallelization in S2 is to divide data blocks into a plurality of small blocks capable of being processed simultaneously, where the data blocks can implement 16 by 16 and 32 by 32 data blocks, and the data parallelization needs to have a Stream object, call a parallel method thereof to enable the Stream object to have a parallel operation capability, or create a Stream from an aggregation class to call a parallel Stream to immediately obtain a Stream with a parallel capability.

Preferably, the operation parallelization in S2 is a detail optimization of the algorithm processing, so that generation of intermediate variables is reduced as much as possible, and calculation is performed as one step as possible.

Preferably, the target in S3 is subjected to tracking, positioning and particle filtering, and the particle filtering includes the following operation steps:

s301, initialization stage-extraction of target features: selecting a target, and extracting the characteristics of a target area, namely a target color histogram;

s302, initializing particles: a) Scattering particles evenly over the entire image, b) scattering particles in a gaussian distribution near the target of the previous frame;

s303, a searching stage: counting the color histogram of each particle, comparing with the color histogram of the target model, calculating the weight according to the Papanicolaou distance, normalizing the weight to make the weight of all the particles added to be 1,

s304, resampling particles: a small number of particles are placed at a place with low similarity, a plurality of particles are placed at a place with high similarity, and the particles with low weight are discarded;

s305, state transition: according to

st = Ast-1+ wt-1s _t = As _ { t-1} + w _ { t-1} st = Ast-1+ wt-1 calculating the position of the particle at the next moment;

s306, an observation stage: calculating the similarity between each particle and the target characteristic, and updating the weight of the particle;

s307, decision stage: calculating a weighted average value of the coordinates and the similarity to obtain the position of the next frame of the tracking target;

s308, repeating S303, S304, S305, S306 and S307 according to the predicted position.

Preferably, the algorithm parallelization in S4 adopts a PRAM model, which has a centralized shared memory and an instruction controller, and performs implicit synchronous computation by exchanging data through R/W of SM.

Preferably, the step of comparing in the bubble sort method in S5 is as follows:

s501, comparing adjacent elements, and if the first element is larger than the second element, exchanging the two elements;

s502, performing the same work on each pair of adjacent elements, namely, from the first pair to the last pair at the end, wherein the last element is the maximum number after the work is completed;

s503, repeating the steps for all the elements except the last element;

s504, repeating the above steps for fewer and fewer elements each time until no pair of numbers needs to be compared.

Preferably, the dividing process numbers the divided video data stream ends when dividing, and then performs effective bubble sorting process according to the numbers when performing the bubble sorting method, so as to realize sequential playing of the video data stream.

Preferably, in the high-definition processing in S3, a cubic convolution method is adopted for the scaling processing of the video signal, each pixel of the output image of the cubic convolution method is a result of operation of 16 pixels of the original image, and when cubic convolution interpolation is used, the value of the target point is obtained by resampling the values of 16 known surrounding pixels.

Compared with the prior art, the invention has the beneficial effects that:

the invention realizes the segmentation of the data information, so that the data information can be divided into a plurality of small blocks for processing, then realizes the fine processing of the target area through the tracking algorithm, and performs the fuzzy processing on other areas, thereby effectively improving the accelerated operation of hardware, and effectively improving the operation speed of the hardware through the algorithm parallelization, the data parallelization and the operation parallelization adopted when the hardware is processed, thereby improving the acceleration of the hardware through the fuzzy processing and the processing of the hardware algorithm, and improving the operation speed and the efficiency of the hardware.

Drawings

FIG. 1 is a schematic flow chart of the steps of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a hardware acceleration method based on a tracking algorithm comprises the following steps:

s3, the GPU and the APU realize post processing of the video data stream through the algorithm of the GPU and the APU: the GPU and the APU realize target locking on characteristic objects or characters in the video through a tracking algorithm, then realize tracking and positioning on the target through Kalman filtering, particle filtering, meanshift, camshift, MOSSE, CSK, KCF, BACF or SAMF, then realize fine processing on the small video data stream of the area where the target is located, so that the video data stream can be processed in a higher definition mode, then the small video data stream of the area where the non-target is located is processed in a fuzzy mode, and further the speed of hardware processing is improved;

s4, algorithm parallelization is adopted for the algorithm processing process of the video data stream: when the GPU and the APU process the video data stream by the algorithm, the algorithm is parallelized, so that the GPU and the APU can simultaneously utilize the processing space of the GPU and the APU, the running speed and the efficiency of hardware can be effectively accelerated, and the accelerated processing of the video data stream is finished;

In this embodiment, it is preferable that the segmentation processing in S1 employs a spatial domain algorithm and a time domain algorithm, the spatial domain algorithm is performed by a macroblock, the processing of each pixel is performed locally in a spatial domain, each pixel is processed in sequence and then an output is generated, and there is no result accumulation effect when the output is generated from the previous or next pixel;

In this embodiment, preferably, the algorithm of the macroblock performs low-pass filtering on a pixel by using an average value calculated by neighboring pixels around the pixel, that is, detecting edge information in an image by using a two-dimensional convolution kernel of 5 × 5, and drawing related information from a large number of pixels around the pixel; the spatial processing and the temporal processing are combined to form a new category, which is called "space-time processing", each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, and then the macro blocks are tracked and compared frame by frame to extract approximate values of motion estimation and compensation.

In this embodiment, preferably, the data parallelization in S2 is to divide the data block into a plurality of small blocks capable of being processed simultaneously, where the data block can implement 16 by 16 and 32 by 32 data blocks, and the data parallelization needs to have a Stream object, call a parallel method of the Stream object to enable the Stream object to have a parallel operation capability, or create a Stream from a set class to call a parallel Stream to immediately obtain a Stream with a parallel capability.

In this embodiment, preferably, the operation parallelization in S2 is a detail optimization of algorithm processing, so as to reduce generation of intermediate variables as much as possible and achieve calculation in one step as much as possible.

In this embodiment, preferably, the target in S3 is tracked, positioned, and particle filtered, and the operation steps of the particle filtering are as follows:

s301, initialization stage-target feature extraction: selecting a target, and extracting the characteristics of a target area, namely a target color histogram;

s303, a search stage: counting the color histogram of each particle, comparing with the color histogram of the target model, calculating the weight according to the Papanicolaou distance, normalizing the weight to make the weight of all the particles added to be 1,

s304, particle resampling: a small number of particles are placed at a place with low similarity, a plurality of particles are placed at a place with high similarity, and the particles with low weight are discarded;

s305, state transition: according to

In this embodiment, preferably, the model adopted for parallelizing the algorithm in S4 is a PRAM model, which has a centralized shared memory and an instruction controller, and performs implicit synchronous computation by exchanging data through R/W of SM.

In this embodiment, preferably, the step of comparing the bubble sort method in S5 is as follows:

s503, repeating the steps for all the elements except the last element;

In this embodiment, preferably, the dividing process numbers the divided video data stream ends when dividing, and then performs effective bubble sorting process according to the numbers when performing the bubble sorting method, so as to implement sequential playing of the video data stream.

In this embodiment, preferably, in the high definition processing in S3, a cubic convolution method is adopted for the scaling processing of the video signal, each pixel of the output image of the cubic convolution method is a result of operation of 16 pixels of the original image, and when cubic convolution interpolation is used, the value of the target point is calculated by resampling values of 16 known pixels around the target point.

The working principle and the using process of the invention are as follows:

firstly, receiving and dividing data stream information by hardware: the CPU receives the data stream information, separates and compresses the data stream information to obtain video data, places the compressed video data at the separation part in a system memory, and divides the compressed video data into a plurality of small parts;

secondly, the CPU distributes the compressed video data to the GPU and the APUs for processing: the CPU transmits the compressed video data divided into a plurality of small parts to the GPU and the APU for data parallelization and operation parallelization processing, so that the compressed video data stream is decompressed, and decompressed data information is stored in the sound card;

thirdly, the GPU and the APU realize post-processing of the video data stream through self algorithms: the GPU and the APU realize target locking on characteristic objects or characters in the video through a tracking algorithm, then realize tracking and positioning on the target through Kalman filtering, particle filtering, meanshift, camshift, MOSSE, CSK, KCF, BACF or SAMF, then realize fine processing on the small video data stream of the area where the target is located, so that the video data stream can be processed in a higher definition mode, then the small video data stream of the area where the non-target is located is processed in a fuzzy mode, and further the speed of hardware processing is improved;

fourthly, algorithm parallelization is adopted for the algorithm processing process of the video data stream: when the GPU and the APU process the video data stream by the algorithm, the algorithm is parallelized, so that the GPU and the APU can simultaneously utilize the processing space of the GPU and the APU, the running speed and the efficiency of hardware can be effectively accelerated, and the accelerated processing of the video data stream is finished;

fifthly, receiving and playing the processed video data stream by the CPU: when the GPU and the APU finish processing the video data stream, the CPU receives the video data, splices the video data stream according to the partitioned data sequence, arranges the video data stream by using a bubble sorting method, and then finishes quickly playing the video data stream.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A hardware acceleration method based on a tracking algorithm is characterized by comprising the following steps:

s1, receiving and segmenting data stream information by hardware: the CPU receives the data stream information, separates and compresses the video data from the data stream information, and places the separated and compressed video data in a system memory to realize the division of the compressed video data and divide the compressed video data into a plurality of small parts;

s5, receiving and playing the processed video data stream by the CPU: when the GPU and the APU finish processing the video data stream, the CPU receives the video data, splices the video data stream according to the partitioned data sequence, arranges the video data stream by using a bubble sorting method, and then finishes quickly playing the video data stream.

2. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the segmentation processing in the S1 adopts a space domain algorithm and a time domain algorithm, the space domain algorithm is carried out through a macro block, the processing of each pixel is carried out in a space domain part, each pixel is processed according to a sequence and then an output is generated, and the output generated from the previous pixel or the next pixel has no result accumulation effect;

the time domain algorithm looks for where a change or similarity occurs in a particular pixel or pixel region between frames and converts the interlaced field to progressive scan format by line doubling or filtering.

3. The hardware acceleration method based on tracking algorithm of claim 2, characterized in that: the spatial domain algorithm performs low-pass filtering on a pixel by using an average value calculated by adjacent pixels around the pixel, namely, 5 × 5 two-dimensional convolution kernels are used for detecting edge information in an image and drawing related information from the surrounding pixels; the spatial domain algorithm and the time domain algorithm are combined to form a new category called space-time processing, each image frame is decomposed into a plurality of macro blocks, namely an area of 16 × 16 pixels, then the macro blocks are tracked and compared frame by frame, and approximate values of motion estimation and compensation are extracted from the macro blocks.

4. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the data parallelization in S2 is to divide data into blocks and divide the data blocks into a plurality of small blocks that can be processed simultaneously, where the data blocks implement two data blocks of 16 by 16 and 32 by 32, and the data parallelization needs to have a Stream object, call a parallel method of the Stream object to enable the Stream object to have a parallel operation capability, or create a Stream from a set class to call the parallel method to immediately obtain a Stream with the parallel capability.

5. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the operation parallelization in the S2 is detail optimization of algorithm processing, so that the generation of intermediate variables is reduced as much as possible, and the calculation is completed as much as possible.

6. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: and the target in the S3 is tracked, positioned and selected to be subjected to particle filtering, and the particle filtering comprises the following operation steps:

s304, particle resampling: a small number of particles are placed at the position with low similarity in the comparison with the target model color histogram, a large number of particles are placed at the position with high similarity, and the particles with low weight are discarded;

s305, state transition: according to

s307, a decision stage: calculating a weighted average value of the coordinates and the similarity to obtain the position of the next frame of the tracking target;

7. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the model adopted by the algorithm parallelization in the S4 is a PRAM model, the PRAM model is provided with a centralized shared memory and an instruction controller, data are exchanged through the R/W of the SM, and implicit synchronous calculation is carried out.

8. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: the step of comparing the bubble sorting method in S5 is as follows:

s503, repeating the steps for all the elements except the last element;

9. The hardware acceleration method based on tracking algorithm of claim 8, characterized in that: and when the segmentation processing is carried out, numbering processing is carried out on the segmented video data stream ends, and then effective bubble sorting processing is carried out according to the numbering when the bubble sorting method is carried out, so that the video data stream is sequentially played.

10. The hardware acceleration method based on tracking algorithm of claim 1, characterized in that: in the high-definition processing in S3, a cubic convolution method is adopted for scaling the video signal, each pixel of the image output by the cubic convolution method is a result of operation of 16 pixels of the original image, and when interpolation by the cubic convolution method is used, a value of a target point is obtained by resampling values of 16 known pixels around the target point.