CN109711323B

CN109711323B - Real-time video stream analysis acceleration method, device and equipment

Info

Publication number: CN109711323B
Application number: CN201811585634.2A
Authority: CN
Inventors: 谈鸿韬; 陆辉; 刘树惠; 杨波
Original assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Current assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2021-06-15
Anticipated expiration: 2038-12-25
Also published as: CN109711323A

Abstract

The invention relates to a real-time video stream analysis accelerating method, a device and equipment, aiming at the optimization and the expansion of the number of paths analyzed by a real-time video stream algorithm, a GPU is called to decode each path of real-time video, the decoding result is directly returned to the algorithm through a video memory address, the algorithm end is provided with double caches, one cache is used for multi-path storage of decoding data, the other cache is used for transferring to the algorithm for GPU batch processing, and after the batch processing is finished, the functions of the two caches are switched, so that the purpose of minimum system delay is achieved.

Description

Real-time video stream analysis acceleration method, device and equipment

Technical Field

The invention relates to the technical field of video image processing, in particular to a real-time video stream analysis acceleration method, a real-time video stream analysis acceleration device and real-time video stream analysis acceleration equipment.

Background

Along with the gradual advance and landing of large security engineering and projects such as 'safe cities', 'smart cities', 'snow projects', and the like, city video monitoring construction has slowly entered the deep phase, and when massive video data is accumulated, the video stage of simple 'watching' has not been satisfied for a long time: in the face of a large amount of video scenes, the traditional manual visual looking up of videos usually still seems unconscious while consuming a large amount of manpower and material resources, and cannot adapt to the case handling requirements of the real public security industry. Under the background, people, vehicles, objects and the like in the video are subjected to video structurization through an intelligent video analysis algorithm, such as lineation detection, target tracking, face detection and the like, target features in the videos are extracted, human eyes are replaced by automatic extraction of programs, keyword search is carried out by combining technical means such as big data and the like to find clues, and the method gradually becomes a mainstream mode of the security industry.

However, when an intelligent analysis algorithm faces a massive video processing scene, a huge performance pressure is faced, taking the currently most widely applied 1080PH264 video stream as an example, currently, a mainstream intelXeon server based on an x86 architecture can only achieve the performance of about 200 to 300fps based on CPU decoding, and the intelligent video analysis algorithm is a pipeline processed by video stream- > decoding- > YUV/RGB data- > algorithm, when an algorithm link is added, because an image algorithm usually consumes a CPU very much, the decoding performance above is lower, specifically, aiming at the fact that the number of concurrent paths which can be supported by a real-time video stream is difficult to go, and if the efficiency is improved by increasing an analysis server through horizontally expanding analysis nodes, the cost is too high and the cost performance ratio is too low, and an application scene of large-scale video analysis is difficult to support.

Disclosure of Invention

The present invention is directed to overcoming the drawbacks of the prior art and providing a method, an apparatus and a device for accelerating real-time video stream analysis, which can significantly improve the performance of a system based on GPU hardware acceleration.

The invention is realized by the following steps: the invention provides a real-time video stream analysis acceleration method, which comprises the following steps:

1) setting at least two caches aiming at each GPU card, wherein a flag bit and a decoding way value k are arranged in each cache, the decoding way value k is used for storing the accumulated decoding way number, and when the flag bit of each cache is false, the cache is writable, and decoding data are allowed to be stored in the writable cache; when the cache flag bit is true, the cache is readable, the multi-channel decoded data stored in the cache is allowed to be transmitted to an algorithm analysis module in batch for analysis and processing, the flag bits of a plurality of caches corresponding to each GPU card are initialized to false, two monitoring threads are started, one monitoring thread is a cache write-in monitoring thread, and the other monitoring thread is a double-cache read-out monitoring thread;

2) after receiving the decoded data, the algorithm end firstly checks the flag bits of a plurality of caches, judges whether writable caches exist or not, when at least one cache flag bit is false, the writable cache exists, a writable cache with the flag bit being false is randomly selected, the decoded data of the path is stored, and the value k of the decoding path of the cache is added with 1; otherwise, directly abandoning the path of decoded data, and directly returning without processing;

3) checking the states of a plurality of caches at intervals by a cache write monitoring thread, when the decoding way value K of the cache is greater than or equal to a set value K, considering the cache to be readable, and setting the cache mark position as true, otherwise, setting the cache mark position as false; meanwhile, when the cache reading monitoring thread checks the states of a plurality of caches at intervals of designated time, when the cache flag bit is true, the cache reading monitoring thread considers that the cache flag bit is readable, the multi-channel decoding data stored in the cache is transmitted to the algorithm analysis module in batch for analysis processing, and after the processing is finished, the flag bit of the cache is false and is set to be writable again.

Further, two caches are set for each GPU card; the two caches are bound with corresponding GPU cards; each double buffer is responsible for accepting decoded data on the corresponding GPU. The N GPU cards correspond to N double caches.

Furthermore, each cache is allowed to store maximum M paths of decoded data, wherein M is the maximum parallel processing task number allowed on each GPU card obtained through testing.

Further, the set value K is M/2.

And transmitting the decoded data information to the algorithm end through a data receiving interface of the algorithm end every time the main thread of the application program finishes decoding one frame.

The algorithm analysis module provides a receiving data interface for the decoding layer to call, and is somewhat like push operation in a data structure. The decoding module and the algorithm analysis module are mainly operated on a GPU. And the algorithm analysis module analyzes the data. The decoding module and the algorithm analysis module are core calculation modules in an application layer sequence and are responsible for decoding and analyzing functions, the decoding and the analyzing are all dependent on corresponding hardware components of the GPU, and a special video coding and decoding core and a cuda core are arranged in the nvidiagpu.

Further, the step of obtaining the maximum parallel processing task number M allowed on each GPU card through testing specifically includes the steps of:

selecting a reference test file;

decoding and analyzing the M test files through a benchmark test program, outputting an analysis frame rate fps, continuously increasing M from M to 1,2 and 3, and recording the M value at the moment when the fps is reduced to approach a set Q value, wherein the M value is the optimal number of paths supported and analyzed by a single card; and the benchmark test program decodes and algorithmically analyzes the multi-channel video stream file. The frame rate of the real-time stream is generally 25-30 fps, for example, 25, when a file is used for a simulation test, when M is smaller, fps of each path is larger, for example, when M is 2, the fps can reach 200fps, when M is continuously increased, fps is continuously decreased, when fps is decreased to 25-30, M cannot be increased, and when mfps is increased to <25, the requirement of real-time performance cannot be met. Approximating Q means that Q is slightly greater than or equal to Q. Based on the minimum fps, the average is generally not too different.

When fps M is maximum, the analysis speed is highest, and the analysis steps are as follows:

(1) assuming that the duration of a video file is T and the frame rate is FR;

(2) defining index analysis acceleration ratio as video recording duration/analysis time to measure analysis efficiency;

(3) for the sake of simplicity of the analysis model, assuming that the GPU server has N GPU cards, the video is first uniformly cut into the N cards for analysis, and the time length of the video segment divided on each card is:

(4) assuming that the time length t video on each card is segmented, the number of the segments is M, each card is equivalent to parallel analysis of M paths of video streams, the analysis frame rate of each video stream is fps, and the time required for the analysis of each stream is as long as:

the total analysis time of the video can be approximated by t1, so the acceleration ratio is analyzed

N is the number of GPU cards, and FR is the frame rate of the video, which are all fixed values. The variable slice number M of only one card and the analysis frame rate fps of each slice can be used, so that the analysis speed is the highest when the product of the two is the maximum.

Step 2) and step 3) are performed asynchronously.

The method also comprises the following steps before the step 2): and calling the GPU to decode each path of real-time video, and directly returning a decoding result to the algorithm end through the video memory address.

The invention provides a real-time video stream analysis accelerating device, which comprises a decoding data receiving module, a decoding module, a writing module, a cache writing monitoring module and a cache reading monitoring module;

the decoding data receiving module is used for receiving each path of decoding data;

the write-in module is used for checking the flag bit of the corresponding cache, judging whether writable cache exists or not, when at least one cache flag bit is false, indicating that writable cache exists, randomly selecting a writable cache with the flag bit being false, storing the decoded data of the path in the writable cache, and adding 1 to the value k of the decoded path of the cache; otherwise, directly abandoning the path of decoded data;

the cache write-in monitoring module is used for checking the states of a plurality of caches at intervals, when the decoding way numerical value K of the cache is greater than or equal to a set value K, the cache is considered to be readable, the cache mark position is true, and otherwise, the cache mark position is false;

the cache reading monitoring module is used for checking the states of a plurality of caches at intervals, when the cache flag bit is true, the cache reading monitoring module considers the state to be readable, the multi-channel decoding data stored in the cache is transmitted to the algorithm analysis module in batch for algorithm analysis processing, and after the processing is finished, the flag position of the cache is false and is set to be writable again.

The invention provides a real-time video stream analysis accelerating device, which comprises a memory, a video processing unit and a video processing unit, wherein the memory is used for storing a program;

and a processor for implementing the steps of the real-time video stream analysis acceleration method as described above when executing the program.

Compared with the prior art, the invention has the following beneficial effects:

the method aims at the optimization and the expansion of the number of paths analyzed by a real-time video stream algorithm, the GPU is called to decode each path of real-time video, the decoding result is directly returned to the algorithm through a video memory address, the algorithm end is provided with double caches, one cache is used for storing decoding data in multiple paths, the other cache is used for transferring the decoding data to the algorithm for GPU batch processing, after the batch processing is completed, the functions of the two caches are switched, the purpose of minimizing the system delay is achieved, and the problem of time delay when the main performance bottleneck of the real-time video stream is parallel processing of multiple GPU tasks is solved.

The invention provides a corresponding acceleration method aiming at real-time flow analysis, and can obviously improve the system efficiency based on GPU hardware acceleration.

Drawings

FIG. 1 is a diagram of an embodiment for a real-time video analytics task;

FIG. 2 is a diagram illustrating a detailed embodiment of a double-buffer switching procedure according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1 and fig. 2, the present embodiment provides a real-time video stream analysis acceleration method, including the following steps:

1) the algorithm analysis module sets two identical GPU caches aiming at each GPU, wherein the caches are respectively marked as a first cache and a second cache, each cache can store decoding data of at most M paths, and a flag bit and a decoding path numerical value k are arranged in each cache and used for storing accumulated decoding path numbers. When the algorithm module is started, initializing the flag bits of the two GPU caches as false; and the decoding device also comprises a data receiving interface, and each path of decoding data can transfer the decoding data L to the algorithm analysis module by calling the interface.

A. When the flag bit of the double caches is false, the caches are indicated to be writable, and the multi-path decoding data can be stored in the writable caches;

B. when the double-cache flag bit is true, the cache is readable, and the stored multi-channel decoding data can be transmitted to the algorithm analysis module in batch for batch processing;

C. the double cache corresponds to one GPU, if the double cache is N GPUs, the double cache corresponds to N parts of double caches, the double caches are bound with the card, and each part of double cache is responsible for receiving M paths of decoding data on the corresponding GPU; the following steps are all explained by using a single card;

2) the algorithm analysis module starts two threads, one cache is written into the monitoring thread, one double cache is read out of the monitoring thread, and monitoring and checking are carried out once every 10 ms;

3) when the ith path of decoded data (1< ═ i < ═ M) arrives, a data receiving interface of an algorithm analysis module is called;

4) the data receiving interface firstly checks the double-cache flag bit of the algorithm analysis module, when at least one is false, the writable double cache is indicated, the next step is carried out, otherwise, the decoding data of the path is directly abandoned without any processing;

5) randomly selecting a writable double cache which is false, storing the decoding data of the ith path, adding 1 to k, and finishing the calling of a data receiving interface;

step 3), step 4), step 5) call the execution flow of the data receiving interface of the algorithm analysis module for the decoding module, and the following step 6) and later the processing steps inside the algorithm analysis module are executed asynchronously;

6) writing a cache of the algorithm analysis module into a monitoring thread, checking a double-cache state every 10ms, and when the number of the decoded data paths stored in each cache exceeds half of the maximum value (k > -M/2), considering the cache to be readable, and marking the cache as true;

7) and reading the monitoring thread by a cache of the algorithm analysis module, checking the state of the double caches every 10ms, considering the double caches to be readable when the flag bit of the double caches is true, transmitting the block cache to the analysis module for batch processing, and setting the flag to be false and re-writable after the processing is finished.

The scheduling and decoding steps before the real-time video stream analysis are as follows:

(1) detecting and managing various GPU models, and automatically identifying the card types and the number;

(2) for a specified GPU card type, using a mainstream H264 or H2651080P real-time video stream as a benchmark test source;

(3) writing a benchmark test analysis program, realizing the decoding and algorithm analysis functions of a plurality of paths of video files, and outputting the analysis frame rate fps of each path;

(4) for a single card to access M paths of real-time streams, simultaneously printing and outputting an algorithm link to analyze frame rate fps, starting to continuously increase M from M being 1,2 and 3, and when the fps is reduced and approaches to a Q value such as Q being 25 (fps being 25, 25 is the most common real-time video stream frame rate in the field of video monitoring, and the Q value can be adjusted according to the actual frame rate), recording the M value at the moment and being the optimal single card to support the analysis path number;

(5) the GPU scheduler initializes the maximum parallel processing task number P on each GPU card to be M and the running task number C to be 0;

(6) for each real-time flow analysis task K, sequentially traversing N blocks of cards, when the number C being analyzed on the ith block of card is less than P, returning the id of the ith block of card to the algorithm for processing, meanwhile, adding 1 to C, and waiting when the traversal is finished and no idle state exists (C > of all GPUs is equal to P);

(7) when the task of each GPU resource is analyzed, releasing the corresponding GPU resource id, subtracting 1 from the number C being analyzed on the ith block card, and distributing the resource to the task waiting in the total task;

(8) using a GPU scheduler to obtain a corresponding GPU card id (j) and an analysis task Ti (1< ═ i < ═ M);

(9) calling a GPU decoder (SDK) to perform GPU hard decoding on the analysis task Ti on the GPUj, and storing decoded data in a GPU video memory L;

the decoding result is directly recalled to the algorithm through the video memory address, the algorithm end sets at least two caches for each GPU card, and the steps of the first embodiment are adopted for double-cache switching and batch processing of the decoded data, as shown in FIG. 2.

Example two

The embodiment provides a real-time video stream analysis accelerating device, which comprises a decoding data receiving module, a decoding module, a writing module, a cache writing monitoring module and a cache reading monitoring module;

EXAMPLE III

The embodiment provides a real-time video stream analysis accelerating device, which comprises a memory, a storage unit and a processing unit, wherein the memory is used for storing a program;

The invention adopts double buffers aiming at real-time video (frame rate is online to send and fixed, generally 25-30 fps), and emphasizes that the invention supports as many paths (generally 10-30) as possible on the premise of meeting real-time performance as far as possible. However, the number of paths is large, and data transmission and delay between the CPU and the GPU and between the interior of the GPU become a great bottleneck, so that double-cache batch processing is designed for relieving.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A real-time video stream analysis acceleration method is characterized by comprising the following steps:

1) setting at least two caches aiming at each GPU card, wherein a flag bit and a decoding way value k are arranged in each cache, the decoding way value k is used for storing the accumulated decoding way number, and when the flag bit of each cache is false, the cache is writable, and decoding data are allowed to be stored in the writable cache; when the cache flag bit is true, the cache is readable, the multi-channel decoded data stored in the cache is allowed to be transmitted to an algorithm analysis module in batch for analysis and processing, the flag bits of a plurality of caches corresponding to each GPU card are initialized to false, two monitoring threads are started, one monitoring thread is a cache write-in monitoring thread, and the other monitoring thread is a double-cache read-out monitoring thread; each cache allows the storage of at most M paths of decoded data, wherein M is the maximum parallel processing task number allowed on each GPU card obtained through testing;

the step of testing the maximum parallel processing task number M allowed on each GPU card specifically comprises the following steps:

selecting a reference test file;

decoding and analyzing the M test files through a benchmark test program, outputting an analysis frame rate fps, wherein M =1,2 and 3 …, increasing M from M =1, and recording the M value at the moment when the fps is reduced to approach a set Q value, wherein the M value is the optimal number of single-card supported analysis paths; the reference test program decodes and analyzes the algorithm of the multi-channel video stream file;

2) calling a GPU to decode each path of real-time video, directly returning a decoding result to an algorithm end through a video memory address, after the algorithm end receives decoded data, firstly checking a plurality of cached flag bits, judging whether writable caches exist or not, when at least one cached flag bit is false, indicating that writable caches exist, randomly selecting a writable cache with the flag bit being false, storing the decoded data of the path, and adding 1 to a decoding path value k of the cache; otherwise, directly abandoning the path of decoded data, and directly returning without processing;

2. The method of claim 1, wherein: setting two caches for each GPU card; the two caches are bound with corresponding GPU cards; each double buffer is responsible for accepting decoded data on the corresponding GPU.

3. The method of claim 1, wherein: the set value K is M/2.

4. The method of claim 1, wherein: step 2) and step 3) are performed asynchronously.

5. A real-time video stream analysis acceleration apparatus, characterized by: the device comprises a decoding data receiving module, a decoding module, a writing module, a cache writing monitoring module and a cache reading monitoring module;

the write-in module is used for checking the flag bit of the corresponding cache, judging whether writable cache exists or not, when at least one cache flag bit is false, indicating that writable cache exists, randomly selecting a writable cache with the flag bit being false, storing the decoded data in the writable cache, and adding 1 to the decoded way value k of the cache; otherwise, directly abandoning the decoded data;

6. A real-time video stream analysis acceleration apparatus, characterized by: comprises a memory for storing a program;

and a processor for implementing the steps of the real-time video stream analysis acceleration method according to any one of claims 1 to 4 when executing said program.