CN109711323A

CN109711323A - A kind of live video stream analysis accelerated method, device and equipment

Info

Publication number: CN109711323A
Application number: CN201811585634.2A
Authority: CN
Inventors: 谈鸿韬; 陆辉; 刘树惠; 杨波
Original assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Current assignee: Wuhan Fiberhome Digtal Technology Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2019-05-03
Anticipated expiration: 2038-12-25
Also published as: CN109711323B

Abstract

The present invention relates to a kind of live video stream analysis accelerated method, device and equipment, this method is the number optimization and extension for the analysis of real-time video flow algorithm, GPU is called to be decoded every road real-time video, decoding result directly passes through video memory address and adjusts back to algorithm, and algorithm end sets Double buffer, and one piece stores decoding data for multichannel, one piece carries out GPU batch processing for passing to algorithm, after batch processing is completed, two pieces of caching function switchings reach the smallest purpose of system delay.

Description

A kind of live video stream analysis accelerated method, device and equipment

Technical field

The present invention relates to technical field of video image processing, and in particular to a kind of live video stream analysis accelerated method, dress It sets and equipment.

Background technique

With the large-scale safe protection engineering such as " safe city ", " smart city ", " bright as snow engineering " and project gradually propulsion and Landing, city video monitoring construction slowly enters gos deep into the phase, while having accumulated massive video data, is also unsatisfactory for already Simple " seeing " video stage: in face of the video scene of magnanimity, traditional artificial eye consults video and is expending a large amount of manpower While material resources, seems unable to do what one wishes toward contact, the demand of handling a case of the public security industry of reality can not be adapted to.Face in this context, By intelligent video analysis algorithm, such as scribing line detection, target tracking, Face datection etc., by people, vehicle, the object etc. in video Video structural is carried out, the target signature of the inside is extracted, extracted by the automation of program and replace human eye, and combine big data etc. Technological means carries out keyword search to find clue, gradually becomes the main way of security industry.

But intelligent analysis process face massive video processing scene when, huge performance pressures are faced with, to answer now For widest 1080PH264 video flowing, the intelXeon server based on x86 framework of mainstream, is based on CPU at present Decoding is typically only capable to reach the performance of about 200~300fps, and intelligent video analysis algorithm be usually video flowing -> decoding -> YUV/RGB data -> algorithm process pipeline, after adding algorithm link, since CPU is consumed in the usual pole of image algorithm, above Decoding performance can be lower, specific manifestation is exactly to get on for the concurrent number difficulty that live video stream can support, pass through level Extensive diagnostic node increases the mode of Analysis server come if improving efficiency, cost is too high and cost performance is too low, is difficult to support The application scenarios of extensive video analysis.

Summary of the invention

It is an object of the invention to overcome the defect of the prior art, provide a kind of live video stream analysis accelerated method, Device and equipment are remarkably improved the system effectiveness accelerated based on GPU hardware.

The present invention is implemented as follows: the present invention provides a kind of live video stream analysis accelerated method, include the following steps:

1) at least two cachings are set for every piece of GPU card, caching is internal to be equipped with a flag bit and a decoding Number value k, decoding number value k are used to save the decoding number of accumulation, when cache tag position is false, indicate that caching is writeable, Allowing for decoding data to be stored in can be in write buffer；When cache tag position is true, indicates that caching is readable, allow that guarantor will be cached The multipath decoding batch data deposited is transmitted to algorithm analysis module and is analyzed and processed, and every piece of GPU card of initialization is corresponding several slow The flag bit deposited is false, starts two monitoring threads, and one is caching write-in monitoring thread, another is Double buffer reading Monitoring thread；

2) after algorithm termination is by decoding data, the flag bit of several cachings is first checked for, judges whether there is writeable delay It deposits, when the flag bit of at least one caching is false, indicates writeable caching, then randomly choosing a flag bit is False can write buffer, save the decoding data on the road, the decoding number value k of the caching adds 1；Otherwise, road solution is directly abandoned Code data, do not handle direct return；

3) caching write-in monitoring thread checks the state of several cachings every specified time, as the decoding number value k of caching When more than or equal to setting value K, then it is assumed that caching is readable, is otherwise cache tag position is by true by cache tag position false；Meanwhile caching and reading monitoring thread when specified time checking the state of several cachings, when cache tag position is When true, it is believed that it is readable, then the multipath decoding batch data that caching saves is transmitted to algorithm analysis module and be analyzed and processed, located After reason, it is false by the mark position of the caching, is set as again writeable.

Further, two cachings are set for every piece of GPU card；Two cachings are bound with corresponding GPU card；Every part double Caching is responsible for receiving the decoding data on corresponding GPU.N block GPU card corresponds to N parts of Double buffers.

Further, each to cache the decoding data for allowing to maintain up to the road M, M is to test on every piece of obtained GPU card to permit Perhaps maximum parallel processing task number.

Further, setting value K is M/2.

As soon as application program main thread is every to have solved frame, the data receiver interface that decoded data information is passed through algorithm end Pass to algorithm end.

Algorithm analysis module, which provides, receives data-interface, calls for decoding layer, the push behaviour being somewhat like inside data structure Make.Decoder module, algorithm analysis module mainly operate on GPU.Algorithm analysis module analyzes data.Decoder module It is all to be responsible for decoding and analytic function using the core calculation module in sequence, and decode and analyze with algorithm analysis module All it is the correspondence hardware component for relying on GPU, has special coding and decoding video core and cuda core inside nvidiagpu.

Further, the maximum parallel processing task number M that test obtains allowing on every piece of GPU card specifically includes following step Suddenly, comprising:

Choose benchmark test file；

M test file is decoded and is analyzed by benchmark, output analysis frame per second fps, from M=1,2, 3.. start constantly to increase M, when fps reduction is approached to the Q value of setting, record M value at this time, support to analyze for best single deck tape-recorder Number；Benchmark is decoded to multi-channel video stream file and algorithm analysis.The frame per second of real-time streams is usually 25~ 30fps, for 25, when with file come simulation test, when M is smaller, the fps on every road is bigger, such as when M=2, can With arrive 200fps, when M is continuously increased, fps be it is ever-reduced, when fps drops to 25~30, M cannot be further added by, then increase Add the requirement that real-time is just unable to satisfy when Mfpps < 25.It approaches Q and refers to slightly greater than or be equal to Q.It is to be with the smallest fps Standard, no matter in general, it can be relatively average, it will not differ too many.

When fps*M maximum, speed highest is analyzed, analytical procedure is as follows:

(1) assume a length of T, frame per second FR when video file；

(2) index analysis speed-up ratio=video recording duration/analysis time, Lai Hengliang analysis efficiency are defined；

(3) it is analysis model for the sake of simplicity, assuming that GPU server has N block GPU card, video recording is uniformly cut into N first It is analyzed on block card, the video clip duration assigned on every piece of card are as follows:

(4) assume that block upper duration t video recording to every piece carries out cutting again, fragment number is M, and the road M view is equivalent on every piece of card Frequency stream parallel parsing, the analysis frame per second speed of per share video flowing is fps, then per share flow point, which has been analysed, needs the time:

Record a video the bulk analysis time can with t1 come it is approximate, so analyze speed-up ratio

N is the number of GPU card, and it is fixed value that FR, which is the frame per second of video,.The variable only slice numbers M of single deck tape-recorder and The analysis frame per second fps of each slice, thus can both proper product maximum when, when exactly analyzing speed highest.

Step 2) and step 3) are asynchronous executions.

Further include following steps before step 2): calling GPU to be decoded every road real-time video, decoding result directly passes through Video memory address, which is adjusted back, gives algorithm end.

The present invention provides a kind of live video stream analysis accelerator, including decoding data receiving module, decoder module, writes Enter module, caching write-in monitoring module, caching and reads monitoring module；

The decoding data receiving module is for receiving each road decoding data；

The write module is used to check the flag bit of corresponding caching, writeable caching is judged whether there is, when at least When the flag bit of one caching is false, writeable caching is indicated, then randomly choosing a flag bit is the writeable of false Caching, the decoding data on the road is stored in this can be in write buffer, and the decoding number value k of the caching adds 1；Otherwise, it directly abandons The road decoding data；

Caching write-in monitoring module is used to check the state of several cachings every specified time, when the decoding number value of caching When k is greater than or equal to setting value K, then it is assumed that caching is readable, is true by cache tag position, otherwise, by cache tag position For false；

Caching reads monitoring module and is used for when specified time checking the state of several cachings, when cache tag position is When true, it is believed that it is readable, then the multipath decoding batch data that caching saves is transmitted to algorithm analysis module and carried out at algorithm analysis The mark position of the caching after being disposed, is false, is set as again writeable by reason.

The present invention provides a kind of live video stream analysis acceleration equipment, including memory, for storing program；

And processor, the processor is for realizing that live video stream analysis as described above adds when executing described program The step of fast method.

Compared with prior art, the invention has the following advantages:

The present invention for real-time video flow algorithm analysis number optimization and extension, call GPU to every road real-time video into Row decoding, decoding result directly pass through video memory address and adjust back to algorithm, and algorithm end sets Double buffer, and one piece for multichannel storage solution Code data, one piece carries out GPU batch processing for passing to algorithm, and after batch processing is completed, two pieces of caching function switchings reach The smallest purpose of system delay solves when being more GPU task parallel processings for live video stream main performance bottleneck The problem of time delay.

The present invention gives corresponding accelerated method for real-time streams analysis, is remarkably improved based on GPU hardware acceleration System effectiveness.

Detailed description of the invention

Fig. 1 is the embodiment figure for real-time video analysis task；

Fig. 2 is the detailed embodiment figure of Double buffer switch step of the present invention.

Specific embodiment

The following is a clear and complete description of the technical scheme in the embodiments of the invention, it is clear that described embodiment Only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, the common skill in this field Art personnel all other embodiment obtained without making creative work belongs to the model that the present invention protects It encloses.

Embodiment one

Referring to Fig. 1 and Fig. 2, the present embodiment provides a kind of live video streams to analyze accelerated method, includes the following steps:

1) GPU as algorithm analysis module sets two for every piece of GPU is cached, respectively marked as No.1 caching and two Number caching, every piece caches and can maintain up to the decoding data on the road M, while internal being equipped with a flag bit and a decoding Number value k, for saving the decoding number of accumulation.When algoritic module starts, the flag bit for initializing two pieces of GPU caching is false；It simultaneously include a data receiving interface, every road decoding data can be by calling this interface to transmit to algorithm analysis module Decoding data L.

A, when Double buffer flag bit is false, indicate that caching is writeable, can be stored in multipath decoding data can write buffer In；

B, it when Double buffer flag bit is true, indicates that caching is readable, the multipath decoding batch data of preservation can be transmitted to Batch processing is carried out to algorithm analysis module；

C, the corresponding one piece of GPU of Double buffer, is talked about if it is N block GPU, and corresponding N parts of Double buffers are bound with card, and every part double slow Deposit the road the M decoding data for being responsible for receiving on corresponding GPU；Below step is illustrated with single deck tape-recorder；

2) algorithm analysis module opens two threads, and monitoring thread is written in a caching, and a Double buffer reads monitoring line Journey, every 10ms monitor check are primary；

3) when the i-th tunnel decoding data (1≤i≤M) reaches, the data receiver interface of algorithm analysis module is called；

4) data receiver interface internal first checks for the Double buffer flag bit of algorithm analysis module, when at least one is When false, writeable Double buffer is indicated, into next step, otherwise, directly abandon the road decoding data, it is without any processing；

5) the writeable Double buffer that one is false is randomly choosed, the decoding data on the i-th tunnel is saved, k adds 1, and data receiver connects Mouth is called and is finished；

Step 3) 4) is, 5) the execution process of the data receiver interface of decoder module calling algorithm analysis module, and below The processing step of the step 6) inside of algorithm analysis module later is asynchronous execution；

6) monitoring thread is written in the caching of algorithm analysis module, when 10ms checks Double buffer state, caches when every piece Save (k >=M/2) when decoding data number is more than maximum value half, it is believed that caching is readable, is true by cache tag position；

7) caching of algorithm analysis module reads monitoring thread, when 10ms checks Double buffer state, when Double buffer mark When will position is true, it is believed that it is readable, block caching is transmitted to and carries out batch processing to analysis module, after being disposed, mark is set For false, it is set as again writeable.

Scheduling, decoding step before live video stream analysis is as follows:

(1) various GPU models are detected and is managed, automatically identify card-type, number；

(2) for specified GPU card type, H264 the H2651080P live video stream of mainstream is used to survey as benchmark Examination source；

(3) benchmark test analysis program is write, can be realized decoding+algorithm analytic function to multi-channel video file, and The analysis frame per second fps on every road can be exported；

(4) road M real-time streams are accessed for single deck tape-recorder, while prints out algorithm link analysis frame per second fps, from M=1,2,3.. Start constantly to increase M, and when fps reduction is approached to Q value such as Q=25 (fps >=it 25,25 is the most common reality of field of video monitoring When video stream frame rate, Q value can be adjusted according to actual frame per second), record M value at this time, support analysis number for best single deck tape-recorder；

(5) GPU scheduler initializes the maximum parallel processing task number P=M on every piece of GPU card, is currently running number of tasks C =0；

(6) for each real-time streams analysis task K, order traversal N block card is less than when analyzing number C on i-th piece of card It when P, returns to i-th piece of card id and is handled to algorithm, while C increases by 1, traversal is finished without idle (C of all GPU >=P) When, then it waits；

(7) when each task analysis for obtaining GPU resource finishes, corresponding GPU resource id is discharged, i-th piece is blocked Analysis number C subtracts 1, and by the resource allocation to waiting in general assignment for task；

(8) corresponding GPU card id=j and analysis task Ti (1≤i≤M) is obtained using GPU scheduler；

(9) GPU decoder (SDK) is called to carry out GPU hard decoder on GPUj to analysis task Ti, decoding data is stored in In GPU video memory L；

Decoding result directly passes through video memory address and adjusts back to algorithm, and algorithm end sets at least two for every piece of GPU card The step of caching, Double buffer switching and decoding data batch processing use embodiment one, as shown in Figure 2.

Embodiment two

The present embodiment provides a kind of live video stream analyze accelerator, including decoding data receiving module, decoder module, Writing module, caching write-in monitoring module, caching read monitoring module；

The decoding data receiving module is for receiving each road decoding data；

Embodiment three

The present embodiment provides a kind of live video streams to analyze acceleration equipment, including memory, for storing program；

The present invention using Double buffer for real-time video (frame per second be it is online send and fixation, general 25~30fps), It is emphasised that supporting number (general 10~30) more than possibility as far as possible under the premise of meeting real-time.But number is more, Between CPU and GPU, data are transmitted between the inside GPU and delay is just at very big bottleneck, therefore are designed Double buffer batch processing and carried out Alleviate.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of live video stream analyzes accelerated method, which comprises the steps of:

1) at least two cachings are set for every piece of GPU card, caching is internal to be equipped with a flag bit and a decoding number Value k, decoding number value k are used to save the decoding number of accumulation, when cache tag position is false, indicates that caching is writeable, allow Decoding data is stored in can be in write buffer；When cache tag position is true, indicates that caching is readable, allow to cache preservation Multipath decoding batch data is transmitted to algorithm analysis module and is analyzed and processed, the corresponding several cachings of every piece of GPU card of initialization Flag bit is false, starts two monitoring threads, and one is caching write-in monitoring thread, another reads monitoring for Double buffer Thread；

2) after algorithm termination is by decoding data, the flag bit of several cachings is first checked for, writeable caching is judged whether there is, when When the flag bit of at least one caching is false, writeable caching is indicated, then randomly choosing a flag bit is false Can write buffer, save the decoding data on the road, the decoding number value k of the caching adds 1；Otherwise, road solution yardage is directly abandoned According to not handling direct return；

3) caching write-in monitoring thread checks the state of several cachings every specified time, when the decoding number value k of caching is greater than Or when being equal to setting value K, then it is assumed that caching is readable, is otherwise cache tag position is by true by cache tag position false；Meanwhile caching and reading monitoring thread when specified time checking the state of several cachings, when cache tag position is When true, it is believed that it is readable, then the multipath decoding batch data that caching saves is transmitted to algorithm analysis module and be analyzed and processed, located After reason, it is false by the mark position of the caching, is set as again writeable.

2. according to the method described in claim 1, it is characterized by: setting two cachings for every piece of GPU card；Two cachings It is bound with corresponding GPU card；Every part of Double buffer is responsible for receiving the decoding data on corresponding GPU.

3. according to the method described in claim 1, it is characterized by: each cache the decoding data for allowing to maintain up to the road M, M To test the maximum parallel processing task number allowed on every piece of obtained GPU card.

4. according to the method described in claim 3, it is characterized by: setting value K is M/2.

5. the method according to claim 3 or 4, it is characterised in that: the maximum that test obtains allowing on every piece of GPU card is parallel Processing number of tasks M specifically comprises the following steps, comprising:

Choose benchmark test file；

M test file is decoded and is analyzed by benchmark, output analysis frame per second fps, from M=1,2,3.. Start constantly to increase M, when fps reduction is approached to the Q value of setting, records M value at this time, support to analyze road for best single deck tape-recorder Number；Benchmark is decoded to multi-channel video stream file and algorithm analysis.

6. according to the method described in claim 1, it is characterized by: step 2) and step 3) are asynchronous executions.

7. according to the method described in claim 1, it is characterized by: further including following steps before step 2): calling GPU to every road Real-time video is decoded, and decoding result, which directly passes through video memory address, adjusts back and give algorithm end.

8. a kind of live video stream analyzes accelerator, it is characterised in that: including decoding data receiving module, decoder module, write Enter module, caching write-in monitoring module, caching and reads monitoring module；

The decoding data receiving module is for receiving each road decoding data；

The write module is used to check the flag bit of corresponding caching, writeable caching is judged whether there is, when at least one When the flag bit of caching is false, indicates writeable caching, then randomly choose flag bit for false can write buffer, The decoding data on the road is stored in this can be in write buffer, and the decoding number value k of the caching adds 1；Otherwise, road solution is directly abandoned Code data；

Caching write-in monitoring module is used to check the state of several cachings every specified time, when the decoding number value k of caching is big When setting value K, then it is assumed that caching is readable, is otherwise cache tag position is by true by cache tag position false；

Caching reads monitoring module and is used for when specified time checking the state of several cachings, when cache tag position is true When, it is believed that it is readable, then the multipath decoding batch data that caching saves is transmitted to algorithm analysis module and carries out algorithm analysis processing, place After reason, it is false by the mark position of the caching, is set as again writeable.

9. a kind of live video stream analyzes acceleration equipment, it is characterised in that: including memory, for storing program；

And processor, the processor are as described in any one of claim 1 to 7 real-time for realizing when executing described program Video flowing analyzes the step of accelerated method.