CN106934757A

CN106934757A - Monitor video foreground extraction accelerated method based on CUDA

Info

Publication number: CN106934757A
Application number: CN201710057317.2A
Authority: CN
Inventors: 袁飞
Original assignee: Beijing Zhongke Detective Technology Co Ltd
Current assignee: Beijing Zhongke Detective Technology Co Ltd
Priority date: 2017-01-26
Filing date: 2017-01-26
Publication date: 2017-07-07
Anticipated expiration: 2037-01-26
Also published as: CN106934757B

Abstract

The present invention relates to a kind of monitor video foreground extraction accelerated method based on CUDA, the graphic processing facility comprising CPU and GPU is applied to, including：GPU carries out perspective process according to background model to frame of video, obtains foreground information；After the foreground information of some frame of video exported in obtaining GPU, the foreground information according to the frame of video is modified CPU to background model, and revised background model is used for into perspective process of the GPU to next frame of video.In the present invention, realize and be combined the branch operations ability of the high-performance calculation ability of GPU and CPU, make full use of hardware resource, greatly improve arithmetic speed, reached the high efficiency extraction process demand to video foreground.

Description

Monitor video foreground extraction accelerated method based on CUDA

Technical field

Before Video processing and parallel computing field, more particularly to a kind of monitor video based on CUDA Scape extracts accelerated method.

Background technology

In the case where video monitoring is widely used, the application of moving object detection and tracking based on video is increasingly Extensively, also seem more and more important, and towards the development of networking, Qinghua high and intelligent direction, this is to the real-time of monitoring system Property and reliability propose requirement higher.In order to improve the real-time of system, can be from quantization digit, the selection for reducing image The aspect such as image processing algorithm in hgher efficiency and the stronger hardware of selection disposal ability is started with.But reduce the digit meeting of pixel Many information are lost, picture quality can be also reduced, and the computational efficiency of image algorithm and accuracy are often difficult to take into account, especially When application scenarios are more complicated, therefore the stronger hardware of selection disposal ability is often into the selection of reality.At present, in PC Video card typically all carry GPU (Graphics Processing Unit, graphic process unit), for CPU, GPU tool There is stronger computing capability.

, CUDA (the Compute Unified Device Architecture, unification of the issue of NVIDIA companies in 2007 Computing device framework) parallel computation framework effectively can be carried out using GPU powerful disposal ability it is general beyond figure is rendered Calculate, triggered the huge repercussion of industrial circle and academia.Compared with traditional GPU general-purpose computations, CUDA programmings are simpler, energy It is enough more easily to utilize the hardware resource of GPU and more powerful function is provided.Now, CUDA technologies oneself astrophysics, The numerous areas such as oil exploration, pattern-recognition, bioengineering obtain extensive use.As GPU general-purpose computations ground is fast-developing, meter Calculate and developed to the cooperated computing direction of CPU+GPU from being calculated using only CPU.

In this context, it is necessary on the basis of existing foreground extraction algorithm, how by by the high-performance calculation energy of GPU The branch operations ability of power and CPU is combined, and realizes that efficient process video foreground is extracted into for problem demanding prompt solution.

The content of the invention

In order to solve above mentioned problem of the prior art, how solution has been by by the high-performance calculation ability of GPU Branch operations ability with CPU is combined, and realizes the problem that efficient process video foreground is extracted, and is based on the invention provides one kind The monitor video foreground extraction accelerated method of CUDA, is applied to the graphic processing facility comprising CPU and GPU, including：

GPU carries out perspective process according to background model to frame of video, obtains foreground information；

After the foreground information of some frame of video that CPU is exported in acquisition GPU, according to the foreground information pair of the frame of video Background model is modified, and revised background model is used for into perspective process of the GPU to next frame of video.

Preferably, GPU carries out perspective process using three CUDA streams that can parallel carry out data processing, at n-th In the reason cycle, processing method includes：

First CUDA streams receive n-th video requency frame data from CPU；

2nd CUDA is flowed to (n-1)th video requency frame data according to background model, by the foreground information computational methods for setting Carry out perspective process；

3rd CUDA streams send to CPU the foreground information of the n-th -2 video requency frame datas.

Preferably, it is that continuous 3 frame is distributed in CUDA streams in host memory while to three CUDA stream initialization The page locking page in memory for using, the page locking page in memory that returned data is distributed for continuous 3 frame, are continuous 3 frame in the global memory of GPU Distribution memory space and the Boolean space with frame of video formed objects.

Preferably, carried out data transmission using asynchronous system between CPU and GPU.

Preferably, described perspective process includes frame of video pretreatment, prospect probability calculation and generating random number.

Preferably, the prospect probability, its computational methods：：

Wherein, μ is represented to presetting the equal value coefficient after background model is calculated, P_tIt is frame of video preprocessing process Value in middle current video frame after pixel normalized, α is influence force parameter, and σ is the prospect probability of current pixel point.

The background model of base map generation initialization is preferably based on, its method is：

The frame of video of any foreground object be could be used without for base map, the background that n times copy is initialized is carried out to base map Model；N is preset times.

Preferably, after to three CUDA stream initialization, the step of also the quantity including thread block is calculated, the method for calculating For：

Wherein, B_xAnd B_yThread number of blocks respectively on x directions and y directions, t_xAnd t_yIt is each thread block thread Quantity, w and h are the horizontal and vertical number of pixels of frame of video.

Preferably, the t_xAnd t_tIt is preset value, the t_xWith t_yProduct be 32 multiple.

Preferably, described that background model is modified, its method is：

According to the Boolean in the foreground information of some frame of video exported in acquired GPU, it is defined as genuine boolean It is worth corresponding random number, and further background model is modified.

Compared with prior art, the present invention at least has advantages below：

By the design of the monitor video foreground extraction accelerated method based on CUDA in the present invention, realize the height of GPU The branch operations ability of performance computing capability and CPU is combined, and has reached the high efficiency extraction treatment to video foreground.

Brief description of the drawings

Fig. 1 is the structural representation of the monitor video foreground extraction accelerated method based on CUDA provided by the present invention；

Fig. 2 is that data transfer shares out the work and helps one another parallel processing schematic diagram between CPU, GPU and CUDA provided by the present invention flow；

The step of Fig. 3 is monitor video foreground extraction accelerated method based on CUDA provided by the present invention flow is illustrated Figure.

Specific embodiment

The preferred embodiment of the present invention described with reference to the accompanying drawings.It will be apparent to a skilled person that this A little implementation methods are used only for explaining know-why of the invention, it is not intended that limit the scope of the invention.

In the present invention, the characteristic according to CUDA splits to algorithm, algorithm is divided into be suitable for GPU process part and It is suitable for the process part of CPU, and utilizes page locking page in memory and asynchronous transmission by data between CPU end memories and equipment end memory Transmission time is hidden, so as to significantly accelerate the calculating speed that video foreground is extracted, and then meets the requirement of Video processing.

Main extraction to video foreground using CUDA of the invention has carried out parallelization acceleration, while according to CPU and GPU each The characteristics of algorithm is split and is optimized, highest speed-up ratio can reach more than 15 times.

In order to realize the purpose of the present invention, with reference to Fig. 1~2, before a kind of monitor video based on CUDA Scape extracts accelerated method, is applied to the graphic processing facility comprising CPU and GPU, including：

After the foreground information of some frame of video that CPU is exported in acquisition GPU, according to the foreground information pair of the frame of video Background model is modified, and revised background model is used for the perspective process of next frame of video in GPU.

GPU carries out perspective process using three CUDA streams that can parallel carry out data processing, for n-th process cycle, Processing method includes：

First CUDA streams receive n-th video requency frame data from CPU；

In the present invention, carried out data transmission using asynchronous system between CPU and GPU.

Further refinement explanation is carried out to the handling process of frame of video of the present invention below by the detailed step of embodiment, such as Shown in Fig. 3, including：

Step 301, the background model of initialization is generated with base map.

The present embodiment uses unified memory (Unified Memory), by background model storage in unified memory.Due to Background model needs simultaneously at CPU and GPU ends, and the modification amount of background model is smaller, and the driving that can give CUDA exists Transmission is automatically performed when suitable, it is possible to which background model is stored in a unified memory.

Base map is the special frames without foreground object in frame of video, and it is not appoint that first-selection is selected as base map in the present embodiment The frame of video of what foreground object, the background model that n times copy is initialized is carried out to base map.The value of N has with video frame rate Close, the value of N is 20 in the implementation case.

Step 302, CUDA streams and memory space initialization.

Initialization three CUDA stream, in host memory be continuous 3 frame distribution used in CUDA streams page locking page in memory, The page locking page in memory of returned data is distributed for continuous 3 frame, is continuous 3 frame point in the global memory (Global Memory) of GPU Boolean space with memory space and with frame of video formed objects.

Because the pixel value of frame of video only has primary access process, this data storage is in global memory.3 streams Performed calculating is followed successively by and is calculated to GPU copies data, execution GPU, data copy is returned main frame, and 3 streams are opened simultaneously It is dynamic.Wherein, GPU calculating processes are formula calculating etc., and the more sentence of branch such as if sentences and modification model is avoided as far as possible, this A little steps will be performed by step 306 by CPU.

Step 303, CUDA flows the calculating of thread block.

The Thread Count in each thread block (Block) is set, and calculates required thread number of blocks.The quantity of thread block and The resolution ratio of frame of video is relevant, shown in computational methods such as formula (1)：

The quantity of the thread that the thread block number being calculated using formula (1) is included with each thread block, including x directions On and the number on y directions, can ensure the pixel quantity more than or equal to image, be both not in thread distribution Redundancy is not in again the not enough mistake of Thread Count.Wherein t_xAnd t_yPre-define, it is ensured that t_x×t_yBe 32 multiple i.e. Can.

In the present embodiment, t_xAnd t_yTake 16.

Step 304, the video data in equipment is copied to page locking page in memory in an asynchronous manner, starts kernel function to video Frame carries out perspective process.

Three function settings of CUDA streams, and the sequence number rule of handled frame of video are said above Bright, corresponding frame of video is described further no longer flowed to each CUDA herein in, just in explanation the 2nd CUDA streams Frame of video perspective process method.

In this implementation, perspective process includes frame of video pretreatment, prospect probability calculation and generating random number.

In the calculating process of kernel function, the strategy for taking each thread to process 1 pixel, due to each pixel Between calculate relatively independent, need not be interacted between thread.In preprocessing process, each pixel to frame of video is entered Row normalized.

Prospect probability is calculated using formula (2)：

N number of base map in background is calculated, and can obtain N number of μ values, and each μ value is brought into the computing formula of σ, can To obtain final result σ, this numerical value represents the size that this pixel is the probability of foreground point, and numerical value is bigger, represents this point Be foreground point probability it is bigger.Afterwards, this is put using threshold epsilon and splits, if it is foreground point that σ ＞ ε represent this point, Otherwise it is background dot.The result Boolean that will be obtained is stored in the good internal memory of pre- first to file.

In the stage of generating random number, using the generating random number generation random number of CUDA.In the present embodiment, at random Number value is 1 to 20 integer, and this random number represents the particular location of background model.

In the present embodiment, threshold epsilon value 0.85.

Step 305, the Boolean space that will be calculated and random number are passed back in page locking page in memory in an asynchronous manner.

Step 306, is modified to background model.

According to the Boolean in the foreground information of some frame of video exported in acquired GPU, if Boolean is true, Then obtain the random number for having generated.Background model is modified based on Boolean space and random number.

The above-mentioned modification mode to background model is algorithm more typical in this area, is known in the art general knowledge, herein Repeat no more.

The combined with access model of global memory in GPU is not met due to being modified to background model, the week needed for accessing Phase is more long, so being relatively more suitable for being modified in CPU, then is transmitted by unified memory.

The flow that above-mentioned steps description is processed only for explanation video requency frame data, not to three parallel processings of CUDA streams Mechanism is developed in details, specifically can be with reference to from the point of view of Fig. 1, Fig. 2, in a process cycle, it is assumed that the process cycle is n, Then three CUDA streams start simultaneously, and perform respectively：First CUDA streams receive n-th video requency frame data from CPU；2nd CUDA flows To (n-1)th video requency frame data according to background model, perspective process is carried out by the foreground information computational methods for setting；3rd CUDA streams send to CPU the foreground information of the n-th -2 video requency frame datas.So as to by disassembling to frame of video handling process and Capable mode, realizes the shortening of process cycle.

During the 2nd CUDA streams carry out perspective process, before prospect probability calculation, it is defeated that CPU has completed the 3rd CUDA streams Go out the reception and the amendment according to the n-th -2 foreground informations of video requency frame data to background model of data, the revised model Can be used in the 2nd CUDA streams to (n-1)th treatment of video requency frame data.

In the present embodiment, it is related to the data transfer between CPU, GPU and CUDA stream, the division of labor and the pass of time between three System as shown in Fig. 2 after the completion of same process cycle domestic demand is parallel the input of a frame of video, the treatment of current video frame, with And the output and treatment of previous frame of video processing data；The output of wherein previous frame of video processing data and it is processed as sequentially carrying out Two steps：The output of previous frame of video processing data, according to the foreground information of previous video requency frame data to background model Amendment.

The monitor video foreground extraction accelerated method based on CUDA that the present invention is used has the following advantages that：

1. using the advantage of CUDA computation models, the calculating of each pixel is distributed to the virtual thread in GPU, own Thread is performed simultaneously, the execution speed of algorithm greatly improved, and ensure that extraction effect is unaffected；

2. according to the respective calculation features of CPU and GPU, calculating task is divided the work, the work of intensive calculations is handed over Give CPU treatment to the branches such as GPU treatment, model modification and the more work of memory access；

3., using the stream process characteristic of CUDA, using 3 designs of stream, the transmission of data is carried out while calculating, will The time that data transfer needs is hidden.

Those skilled in the art should be able to recognize that, the side of each example described with reference to the embodiments described herein Method step, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate electronic hardware and The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These Function is performed with electronic hardware or software mode actually, depending on the application-specific and design constraint of technical scheme. Those skilled in the art can realize described function to each specific application using distinct methods, but this reality Now it is not considered that beyond the scope of this invention.

Term " including " or any other like term be intended to including for nonexcludability so that being including one The process of row key element, method, article or equipment/device not only include those key elements, but also including being not expressly set out Other key elements, or also include these processes, method, article or the intrinsic key element of equipment/device.

So far, combined preferred embodiment shown in the drawings describes technical scheme, but, this area Technical staff is it is easily understood that protection scope of the present invention is expressly not limited to these specific embodiments.Without departing from this On the premise of the principle of invention, those skilled in the art can make equivalent change or replacement to correlation technique feature, these Technical scheme after changing or replacing it is fallen within protection scope of the present invention.

Claims

1. a kind of monitor video foreground extraction accelerated method based on CUDA, is applied to the graphics process dress comprising CPU and GPU Put, it is characterised in that including：

After the foreground information of some frame of video exported in obtaining GPU, the foreground information according to the frame of video is to background for CPU Model is modified, and revised background model is used for into perspective process of the GPU to next frame of video.

2. method according to claim 1, it is characterised in that GPU is using three CUDA that can parallel carry out data processing Stream carries out perspective process, and for n-th process cycle, processing method includes：

First CUDA streams receive n-th video requency frame data from CPU；

2nd CUDA streams, according to background model, are carried out to (n-1)th video requency frame data by the foreground information computational methods for setting Perspective process；

3. method according to claim 2, it is characterised in that while to three CUDA stream initialization, in main frame Deposit be the distribution of continuous 3 frame used in CUDA streams page locking page in memory, in the page locking of continuous 3 frame distribution returned data Deposit, be the continuous distribution of 3 frame memory space and the Boolean space with frame of video formed objects in the global memory of GPU.

4. method according to claim 3, it is characterised in that data biography is carried out using asynchronous system between CPU and GPU It is defeated.

5. method according to claim 4, it is characterised in that described perspective process includes frame of video pretreatment, prospect Probability calculation and generating random number.

6. method according to claim 5, it is characterised in that the prospect probability, its computational methods is：

Wherein, μ is represented to presetting the equal value coefficient after background model is calculated, P_tFor in frame of video preprocessing process when Value in preceding frame of video after pixel normalized, α is influence force parameter, and σ is the prospect probability of current pixel point.

7. the method according to any one of claim 1~6, it is characterised in that the background based on base map generation initialization Model, its method is：

The frame of video of any foreground object be could be used without for base map, the background model that n times copy is initialized is carried out to base map； N is preset times.

8. the method according to any one of claim 3~6, it is characterised in that after to three CUDA stream initialization, also The step of quantity including thread block is calculated, the method for calculating is：

Wherein, B_xAnd B_yThread number of blocks respectively on x directions and y directions, t_xAnd t_yIt is the quantity of each thread block thread, W and h are the horizontal and vertical number of pixels of frame of video.

9. method according to claim 8, it is characterised in that the t_xAnd t_yIt is preset value, the t_xWith t_yProduct be 32 Multiple.

10. the method according to any one of claim 1~6, it is characterised in that described to be modified to background model, Its method is：

According to the Boolean in the foreground information of some frame of video exported in acquired GPU, it is defined as genuine Boolean pair The random number answered, and further background model is modified.