CN116708807A

CN116708807A - Compression reconstruction method and compression reconstruction device for monitoring video

Info

Publication number: CN116708807A
Application number: CN202310125528.0A
Authority: CN
Inventors: 陆晓; 徐春雷; 蒋承伶; 马洲俊; 吴强; 姚建伟; 李海涛; 李波; 陈洁; 李元
Original assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2023-02-16
Filing date: 2023-02-16
Publication date: 2023-09-05

Abstract

The invention provides a compression reconstruction method and a compression reconstruction device of a monitoring video, wherein the method comprises the following steps: sampling key frames of the monitoring video based on the Gaussian mixture model, and screening the key frames of the monitoring video; after converting the monitoring video into a corresponding video snapshot, compressing and reconstructing the monitoring video to obtain a reconstructed video with a low sampling rate; performing key frame compression of the monitoring video based on the depth convolution network to obtain a reconstructed image of a key frame hot spot area; and synthesizing the reconstructed video and the reconstructed image of the key frame hot spot area to obtain a final compressed reconstructed video. The method compresses all video frames by using a video snapshot compressed sensing method, compresses all or part of areas of a key frame based on a depth convolution neural network, then rebuilds compressed representation, uses a higher sampling rate for a hot spot area to obtain better rebuilding quality, synthesizes a final rebuilding video, and combines the compression speed and the rebuilding quality.

Description

Compression reconstruction method and compression reconstruction device for monitoring video

Technical Field

The invention relates to the technical field of data processing, in particular to a compression reconstruction method and a compression reconstruction device of a monitoring video.

Background

At present, a video coding framework captures intra-frame correlation and inter-frame correlation of video by performing operations such as motion estimation and motion compensation at a coding end, and compresses video by using redundant information, but in an edge end application with limited resources, performing complex motion estimation and compensation at an edge end consumes a great deal of computational effort and battery energy.

The compression algorithm based on the compressed sensing theory can transfer complex motion estimation and motion compensation operations to a decoding end with more abundant resources, and an edge end obtains compressed representation of data by using simpler matrix multiplication. With reduced extraneous variables, the video can be abstracted into a time course of the two-dimensional image. According to the CS (Compressive Sensing, compressed sensing) theory described above, for a single image, when the measurement matrix and the sparse representation matrix satisfy a certain uncorrelation, the original image can be accurately reconstructed by obtaining a small number of measurement values.

However, since the independent CS reconstruction of a single image uses only the sparsity of the image itself, the reconstruction result often cannot reach the required accuracy when the measured value is less.

Disclosure of Invention

The first object of the present invention is to provide a compressed reconstruction method for a surveillance video.

A second object of the present invention is to provide a compression reconstruction device for surveillance video.

The technical scheme adopted by the invention is as follows:

an embodiment of a first aspect of the present invention provides a compression reconstruction method for a surveillance video, including the following steps: sampling the key frames of the monitoring video based on a Gaussian mixture model, and acquiring a key frame sampling sequence to screen the key frames of the monitoring video; after converting the monitoring video into a corresponding video snapshot, compressing and reconstructing the monitoring video to obtain a reconstructed video with a low sampling rate; performing key frame compression of the monitoring video based on a depth convolution network to obtain a reconstructed image of a key frame hot spot area; and reconstructing images of the reconstructed video and the key frame hot spot area, and synthesizing according to a sampling sequence to obtain a final compressed reconstructed video.

The compression reconstruction method of the monitoring video provided by the invention can also have the following additional technical characteristics:

according to one embodiment of the invention, the method for sampling the key frames of the monitoring video based on the Gaussian mixture model specifically comprises the following steps: acquiring monitoring video intermediate separationThe nearest frame n at the current time _ghistory Estimating parameters of a Gaussian mixture model; when a new video frame enters, if the value of a certain pixel is greater than three times of the mean value of the model, marking the pixel as a change point; if the number of the change points in the designated area is greater than or equal to the set proportion, judging that the new video frame has an abnormal moving object; and when an abnormal moving object is detected, performing key frame sampling by adopting a momentum-based video time unequal sub-sampling algorithm.

According to an embodiment of the present invention, after converting the monitoring video into a corresponding video snapshot, compression and reconstruction of the monitoring video are performed to obtain a reconstructed video with a low sampling rate, which specifically includes: multiplying the monitoring video by a sampling matrix after performing color downsampling to convert the monitoring video into a corresponding video snapshot; preprocessing the video snapshot to roughly reconstruct a monitoring video corresponding to the video snapshot; inputting the video after the rough reconstruction into a priori reconstruction network for priori reconstruction; and inputting the video after priori reconstruction into an optimized reconstruction network for optimized reconstruction so as to obtain a reconstructed video with a low sampling rate.

According to one embodiment of the invention, the a priori reconstruction network comprises: the coding layer, the characteristic extraction layer and the decoding layer are connected in series in sequence, wherein the coding layer comprises: the system comprises a first three-dimensional convolution module, a first ReLU (Rectified Linear Unit, modified linear unit) module, a second three-dimensional convolution module, a second ReLU module, a third three-dimensional convolution module and a third ReLU module which are sequentially connected in series; the feature extraction layer includes: a plurality of reversible three-dimensional convolution modules connected in series with each other; the decoding layer includes: the three-dimensional transpose convolution module, the fourth ReLU module, the second three-dimensional transpose convolution module, the fifth ReLU module, the third three-dimensional transpose convolution module, and the sixth ReLU module are sequentially connected in series.

According to one embodiment of the present invention, the optimal reconstruction network is an iterative network, and the optimal reconstruction network includes a plurality of stages, a network structure of each stage is the same as a structure of the prior reconstruction network, and an input of the network of each stage of the optimal reconstruction network is: the output of the network of the previous stage and the input of the network of the previous stage.

According to one embodiment of the present invention, the key frame compression of the surveillance video is performed based on a depth convolution network to obtain a reconstructed image of a key frame hot spot region, which specifically includes: compressing the key frames by adopting a depth convolution compression network to obtain compressed representation of the key frames; acquiring a change area according to the compressed representation of the key frame, and obtaining an ROI (Region of Interest ) matrix according to the numerical value of the change area; extracting a hot spot region according to the ROI matrix; compressing a plurality of images of the hot spot area by using a hot spot model to obtain a compressed representation of a feature map corresponding to the image of the hot spot area; and reconstructing the compressed representation of the feature map corresponding to the hot spot region image by adopting a depth convolution reconstruction network to obtain the reconstructed image of the key frame hot spot region.

According to an embodiment of the present invention, the compression reconstruction method further does not include: and if the hot spot area is not extracted, reconstructing according to the compressed representation of the key frame to obtain a reconstructed image of the hot spot area of the key frame.

According to one embodiment of the invention, the final compressed reconstructed video is derived specifically according to the following formula: wherein ,/>Representing said final compressed reconstructed video, +.>Reconstructed video representing low sample rate, +.>Representing a sampling sequence, +.>Representing the reconstructed image of the key frame hot spot region.

An embodiment of a second aspect of the present invention provides a compression reconstruction device for a surveillance video, including: the sampling module is used for sampling the key frames of the monitoring video based on the Gaussian mixture model and acquiring a key frame sampling sequence so as to screen the key frames of the monitoring video; the first reconstruction module is used for compressing and reconstructing the monitoring video after converting the monitoring video into the corresponding video snapshot so as to obtain a reconstructed video with a low sampling rate; the second reconstruction module is used for carrying out key frame compression of the monitoring video based on a depth convolution network so as to obtain a key frame hot spot area reconstructed image; and the third reconstruction module is used for reconstructing images of the reconstructed video and the key frame hot spot area, and synthesizing the reconstructed images according to a sampling sequence to obtain a final compressed reconstructed video.

The invention has the beneficial effects that:

according to the invention, a momentum-based self-adaptive video key frame sampling algorithm is realized based on a Gaussian mixture model, the sampling frequency of a video key frame is reduced when no abnormal moving target appears, the method is suitable for the characteristics of low-speed monitoring video, the subsequent compression speed is improved, then all video frames are compressed by using a video snapshot compressed sensing method, all or part of the key frame is compressed based on a deep convolutional neural network, compressed representation is reconstructed, a higher sampling rate is used for a hot spot area to obtain better reconstruction quality, a final reconstructed video is synthesized, and the compression speed and the reconstruction quality are both considered.

Drawings

FIG. 1 is a flow chart of a method for compressed reconstruction of surveillance video according to an embodiment of the invention;

FIG. 2 is a schematic diagram of acquiring a reconstructed video at a low sampling rate according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a color video snapshot compressed sensing reconstruction network according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of the architecture of an a priori reconstruction network in accordance with one embodiment of the present invention;

FIG. 5 is a schematic diagram of the structure of a t-stage of an optimized reconstruction network according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of the architecture of a deep convolutional network in accordance with one embodiment of the present invention;

FIG. 7 is an acquisition schematic diagram of a key frame hot spot area reconstructed image according to one embodiment of the invention;

fig. 8 is a block diagram of a compressed reconstruction apparatus for surveillance video according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flowchart of a method for compressed reconstruction of surveillance video according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps S1-S4.

S1, sampling key frames of the monitoring video based on a Gaussian mixture model, and acquiring a key frame sampling sequence to screen the key frames of the monitoring video.

Further, according to an embodiment of the present invention, the sampling of the keyframes of the surveillance video based on the gaussian mixture model specifically includes: acquiring a nearest frame n from the current moment in the monitoring video _ghistory Estimating parameters of a Gaussian mixture model; when a new video frame enters, if the value of a certain pixel is greater than three times of the mean value of the model, marking the pixel as a change point; if the number of the change points in the designated area is greater than or equal to the set proportion, judging that the new video frame has an abnormal moving object; and when an abnormal moving object is detected, performing key frame sampling by adopting a momentum-based video time unequal sub-sampling algorithm.

Specifically, one of the characteristics of the surveillance video is that the picture remains relatively unchanged in most cases, so that compressing the original video by reducing the key frame sampling rate of the video under normal conditions has less influence on the final reconstruction result. Aiming at the time locality and periodicity of abnormal video (abnormal moving object), the invention adopts a momentum-based video time unequally-spaced sub-sampling algorithm, when the abnormal moving object is detected, the sampling interval is gradually shortened, and the specific sampling interval adopts the following formula (1):

wherein Δt_i For sampling interval Δt _i-1 For the previous sampling interval, t _i Numbering the next sampled video frame, t _i-1 Numbering the video frames of the previous sample, T being the period of the lowest sample rate, η _t Momentum parameters are sampled for time.

Sampling by using a momentum video time unequally-spaced sub-sampling algorithm to obtain an original monitoring videoIs a sub-sampled video of->And corresponding key frame sampling time point sequences wherein ∑S_key ＝n _key ，n _frame To monitor the total number of video frames contained in a video, n _key The number of key frames, W, is the video frame width and H is the video frame height.

S2, after the monitoring video is converted into the corresponding video snapshot, compression and reconstruction of the monitoring video are carried out, so that a reconstructed video with a low sampling rate is obtained.

According to one embodiment of the present invention, after converting a surveillance video into a corresponding video snapshot, compression and reconstruction of the surveillance video are performed to obtain a reconstructed video with a low sampling rate, which specifically includes: multiplying the monitoring video by a sampling matrix after performing color downsampling to convert the monitoring video into a corresponding video snapshot; preprocessing the video snapshot to roughly reconstruct a monitoring video corresponding to the video snapshot; inputting the video after the rough reconstruction into a priori reconstruction network for priori reconstruction; and inputting the video after priori reconstruction into an optimized reconstruction network for optimized reconstruction so as to obtain a reconstructed video with a low sampling rate.

Specifically, the schematic diagram of the method for acquiring the reconstructed video with the low sampling rate can be shown by referring to fig. 2, the video snapshot compression is performed first to convert the monitoring video into the video snapshot, and then the video snapshot reconstruction is performed through the color video snapshot compressed sensing reconstruction network to obtain the reconstructed video with the low sampling rate. The structure of the color video snapshot compressed sensing reconstruction network is shown in fig. 3, and includes: the system comprises a preprocessing module, an a priori reconstruction network and an optimized reconstruction network.

When a video snapshot is acquired, the principle of a Bayer color filter is utilized to perform color downsampling, an original color image is converted into a single-channel image, the input dimension and parameter quantity of a model are reduced, and the color downsampling is expressed as the following formula (2):

wherein ,X_bayer ∈R ^W×H In order to downsample the single-channel image,in the case of a color video frame,data for the ith color channel of a color video frame,/or->The bayer color filter matrix in the RGGB arrangement is W, H, which are the width and height of the video frame, respectively.

As shown in fig. 2, the image after the color downsampling is multiplied by the sampling matrix to obtain a snapshot of the video, and the video snapshot compression process can be represented by the following formula (3):

wherein For the kth color downsampled video frame to be compressed,

C _k ∈R ^W×H for the kth random sampling matrix, the element-wise multiplication of the matrix is additionally shown, the total number of video frames contained in the snapshot is shown as B, and X _meas ∈R ^W×H And the video snapshot corresponding to the B video frames.

For the obtained video snapshot X _meas Firstly, preprocessing by utilizing a preprocessing module to obtain a rough reconstructed video corresponding to a monitoring video frame contained in a snapshot, wherein X is the number of the rough reconstructed video _meas In the case of a snapshot of the video,for the rough reconstructed video of the monitoring video frame corresponding to the snapshot, the preprocessing formula is as follows (4):

and inputting the video after the rough reconstruction into a priori reconstruction network for priori reconstruction.

According to one embodiment of the invention, as shown in fig. 4, the a priori reconstruction network includes: the coding layer, the characteristic extraction layer and the decoding layer are connected in series in sequence, wherein the coding layer comprises: the first three-dimensional convolution module Conv3D1, the first ReLU module ReLU1, the second three-dimensional convolution module Conv3D2, the second ReLU module ReLU2, the third three-dimensional convolution module Conv3D3 and the third ReLU module ReLU3 which are sequentially connected in series; the feature extraction layer includes: a plurality of reversible three-dimensional convolution modules RevConv3D connected in series; the decoding layer includes: the three-dimensional transposed convolution module ConvT3D1, the fourth ReLU module ReLU4, the second three-dimensional transposed convolution module ConvT3D2, the fifth ReLU module ReLU5, the third three-dimensional transposed convolution module ConvT3D3, and the sixth ReLU module ReLU5 are sequentially connected in series.

Specifically, as shown in fig. 4, the coding layer is composed of the following modules: features (e.g. a character)The depth of the graph is M _e1 A first three-dimensional convolution module Conv3D1 with a convolution kernel size of 5x5x5, a step size of 1 and a padding of 2; a first ReLU module ReLU1; the depth of the feature map is M _e2 A second three-dimensional convolution module Conv3D2 with the convolution kernel size of 3x3x3, the step length of 1 and the padding of 1; a second ReLU module ReLU2; the depth of the feature map is M _e3 The convolution kernel size is 3x3x3, the step size is (1, scale) _w ，scale _h ) A third three-dimensional convolution module Conv3D3 with padding being 1; and a third ReLU module ReLU3. Finally, the dimension [ M ] is obtained _e3 ,B,W/scale _w ,H/scale _h ]Is then input into the feature extraction layer.

The neural network which is not specially designed can lose information in each layer, so that the network is required to store all intermediate results to update the weights in the reverse derivation process according to the network output and the network weight reverse pushing network input, and the display memory occupation in model training is large.

Specifically, as shown in fig. 4, the reversible three-dimensional convolution module RevConv3D used to build the feature extraction network 2 can be described by the following formula, where F _rc3d1 And F is equal to _rc3d2 The three-dimensional convolution module is characterized in that x is the input of the three-dimensional convolution module, and y is the output of the three-dimensional convolution module. The module input x is split according to the first dimension to obtain a module input x with two dimensions of [ M ] _e3 /2,B,W/scale _w ,H/scale _h ]Is the vector x of (2) ₁ ,x ₂ Module output y= [ y ] ₁ ,y ₂ ]From dimension [ M ] _e3 /2,B,W/scale _w ,H/scale _h ]Is defined by two vectors y ₁ ,y ₂ And splicing according to the first dimension.

y ₁ ＝x ₁ +F _rc3d1 (x ₂ )

y ₂ ＝x ₂ +F _rc3d2 (x ₁ ) (5)

The reverse calculation process is as follows equation (6):

x ₁ ＝y ₂ -F _rc3d2 (y ₁ )

x ₂ ＝y ₁ -F _rc3d1 (y ₂ ) (6)

F _rc3d1 and F is equal to _rc3d2 The same structure is provided, and the structure is as follows: the depth of the feature map is M _e3 A three-dimensional convolution module with a convolution kernel size of 3x3x3, a step length of 1 and a padding of 1; a ReLU module; the depth of the feature map is M _e3 A three-dimensional convolution module with a convolution kernel size of 3x3x3, a step length of 1 and a padding of 1;

the input and output dimensions of the reversible three-dimensional convolution module RevConv3D are [ M ] _e3 ,B,W/scale _w ,H/scale _h ]A specified number of reversible three-dimensional convolution modules may be stacked to deepen the network.

And then, after passing through the feature extraction layer, taking the output of the feature extraction layer as the input of a prior reconstruction network decoding layer, wherein the modules of the decoding layer are as follows: the depth of the feature map is M _e3 The convolution kernel is 3x3x3, the step length is 1, scale _w ,scale _h A first three-dimensional transpose convolution ConvT3D1 with padding of 1; reLU4; the depth of the feature map is M _e2 A second three-dimensional transpose convolution ConvT3D2 with a convolution kernel size of 3x3x3, a step size of 1, and a padding of 1; reLU5; the depth of the feature map is M _e1 A third three-dimensional transpose convolution ConvT3D3 with a convolution kernel size of 1x1x1, a step size of 1, and a padding of 1; reLU6.

As shown in fig. 3, the present invention adopts an iterative method, and proposes an iterative reconstruction method, which is specifically implemented by dividing a reconstruction network into two module parts: the prior reconstruction network and the optimized reconstruction network, wherein the structure of the optimized reconstruction network is the same as that of the prior reconstruction network, but the training mode and the weight are different.

According to one embodiment of the present invention, the optimal reconstruction network is an iterative network, and the optimal reconstruction network includes a plurality of stages, the network structure of each stage is identical to the structure of the prior reconstruction network, and the inputs of the network of each stage of the optimal reconstruction network are: the output of the network of the previous stage and the input of the network of the previous stage. Fig. 5 is a schematic diagram of the structure of the t-stage of an optimized reconstruction network according to one embodiment of the present invention.

And the input of the optimized reconstruction network is reconstruction of the reconstruction frame and the reference output in the last step, and the reconstruction frame is output. When training the network of the t iteration, the parameters of the previous network are fixed. The multi-step optimized reconstruction improves the interpretability of the network, providing variable reconstruction accuracy and speed.

The present invention uses the mean square error (Mean Square Error, MSE) as a loss function in the network training described above. The mean square error is widely used in the field of image processing because of its characteristic of facilitating derivation. Mean square errorThe formula of (2) is as follows (7):

where w is the image width, h is the image height, c is the color channel,to reconstruct the pixel values at image coordinates (i, j, c), X _ijc Is the pixel value at the real image coordinates (i, j, c).

And S3, performing key frame compression of the monitoring video based on the depth convolution network to obtain a reconstructed image of the key frame hot spot area.

According to one embodiment of the invention, the key frame compression of the monitoring video is performed based on the depth convolution network to obtain a reconstructed image of a key frame hot spot area, which specifically comprises the following steps: compressing the key frames by adopting a depth convolution compression network to obtain compressed representation of the key frames; acquiring a change area according to the compressed representation of the key frame, and obtaining an ROI matrix according to the numerical value of the change area; extracting a hot spot region according to the ROI matrix; compressing a plurality of images of the hot spot area by using a hot spot model to obtain a compressed representation of a feature map corresponding to the image of the hot spot area; and reconstructing the compressed representation of the feature map corresponding to the hot spot region image by adopting a depth convolution reconstruction network to obtain a key frame hot spot region reconstructed image.

According to one embodiment of the invention, if the hot spot region is not extracted, reconstruction is performed according to the compressed representation of the key frame, resulting in a key frame hot spot region reconstructed image.

Specifically, the feature of the surveillance video also includes that the area of the change in the key frame is relatively small, and there is further compression space. Therefore, the invention adopts a key frame differential compression flow based on a depth convolution network, and specifically comprises the following steps: compressing the key frames to obtain compressed representations of the key frames; and extracting the hot spot area and compressing the hot spot area image to obtain a compressed representation of the hot spot area image. If the hot spot area image exists, only transmitting the compressed representation of the hot spot area image, otherwise transmitting the compressed representation of the whole key frame, thereby further improving the compression efficiency.

In an embodiment of the present invention, a deep convolutional network is shown in fig. 6, where the deep convolutional compression network has the following structure: the depth of the feature map is M _bg Is Conv2D of the two-dimensional convolution layer of (2), finally obtaining [ M ] _bg ，W _mbg ，H _mbg ]As a compressed representation of the original video frame image, wherein W _mbg ，H _mbg The width and height of the feature map, respectively.

The structure of the deep convolutional reconstruction network is shown in fig. 6 and includes: the RevConv2D modules are connected back and forth; the depth of the feature map is M _d1 A two-dimensional transposition convolution module ConvT2D with a convolution kernel of 3x3, a step length of 1 and a padding of 1; a ReLU module; the depth of the feature map is M _d2 A two-dimensional transposition convolution module ConvT2D with the convolution kernel size of 3x3, the step length of 1 and the padding of 1; a ReLU module; the depth of the feature map is 3, the convolution kernel size is 3x3, the step length is 1, and the padding is 1; and a ReLU module.

The RevConv2D module used in the present invention can be described by the following formula, where F _rc2d1 And F is equal to _rc2d2 And x is the module input, and y is the module output. The module input x is split into [ M ] according to the dimension of the feature map, namely the dimension of the feature map of the output result of the coding layer _rc2d /2,W _rc2d ,H _rc2d ]Is defined by two vectors x ₁ ,x ₂ Module output y= [ y ] ₁ ,y ₂ ]From two dimensions [ M _rc2d /2,W _rc2d ,H _rc2d ]Vector y of (2) ₁ ,y ₂ Is spliced according to a first dimension, wherein M _rc2d Depth of feature map input for module, W _rc2d ，H _rc2d The width and the height of the inputted characteristic diagram are respectively shown as follows:

y ₁ ＝x ₁ +F _rc2d1 (x ₂ )

y ₂ ＝x ₂ +F _rc2d2 (x ₁ ) (8)

the reverse calculation process is as follows:

x ₁ ＝y ₂ -F _rc2d2 (y ₁ )

x ₂ ＝y ₁ -F _rc2d1 (y ₂ ) (9)；

wherein F_rc2d1 And F is equal to _rc2d2 The structure is the same, specifically includes: the depth of the feature map is M _rc2d 2, a two-dimensional convolution module with the convolution kernel size of 3x3, the step length of 1 and the padding of 1; a ReLU module; the depth of the feature map is M _rc2d A two-dimensional convolution module with a convolution kernel size of 3x3, a step size of 1, and a padding of 1.

In the depth convolution network structure, the depth of the coding layer using the feature map is M _ROI The decoding layer is similar to the decoding layer of a key frame general region compression and reconstruction network.

As shown in fig. 7, for a key frame, encoding is performed using an encoding layer of a general region, and then, for the obtained feature map, differences are made with the obtained feature map of the previous key frame and then summed to obtain a delta value of a change region, and an ROI matrix ROI [ w, h ] is obtained from the value of the change region]. The ROI matrix is scaled to the video frame size and then expanded with a kernel size of 64x64 and a step size of 64 to obtain the final hot spot region, whereinFor the current key frame->Is the previous key frame, wherein,

as shown in fig. 7, a hotspot model is used to compress a plurality of images of a hotspot region, so as to obtain a compressed representation of a feature map corresponding to the hotspot region image, and the transmission data is the compressed representation and a corresponding ROI matrix. The reconstruction flow is to input the compressed representation of the feature map into a reconstruction network to obtain a reconstructed image of the key frame hot spot region.

S4, reconstructing images of the reconstructed video and the key frame hot spot area, and synthesizing according to the sampling sequence to obtain a final compressed reconstructed video.

Further, according to one embodiment of the present invention, the final compressed reconstructed video is derived specifically according to the following formula:

wherein ,representing the final compressed reconstructed video, < >>Representing a reconstructed video at a low sampling rate,

representing a sampling sequence, +.>Representing a reconstructed image of a key frame hot spot region.

In summary, according to the compression reconstruction method of the surveillance video provided by the embodiment of the invention, the momentum-based self-adaptive video key frame sampling algorithm is realized based on the Gaussian mixture model, the sampling frequency of the video key frame is reduced when no abnormal moving object appears, the method is suitable for the characteristics of low-speed surveillance video, the subsequent compression speed is improved, then all video frames are compressed by using the video snapshot compression sensing method, all or part of the key frame is compressed by using the depth convolution neural network, the compressed representation is reconstructed, the higher sampling rate is used for obtaining better reconstruction quality for the hot spot region, the final reconstruction video is synthesized, and the compression speed and the reconstruction quality are both considered.

In addition, the invention also provides a compression reconstruction device of the monitoring video, which comprises the compression reconstruction method of the monitoring video.

Fig. 8 is a block diagram of a compressed reconstruction apparatus for surveillance video according to an embodiment of the invention, as shown in fig. 8, the apparatus includes: the sampling module 10, the first reconstruction module 20, the second reconstruction module 30 and the third reconstruction module 40.

The sampling module 10 is used for sampling key frames of the monitoring video based on the Gaussian mixture model, and acquiring a key frame sampling sequence so as to screen the key frames of the monitoring video; the first reconstruction module 20 is configured to compress and reconstruct the surveillance video after converting the surveillance video into a corresponding video snapshot, so as to obtain a reconstructed video with a low sampling rate; the second modeling block 30 is configured to perform key frame compression of the surveillance video based on the depth convolution network, so as to obtain a reconstructed image of a key frame hot spot area; the third reconstruction module 40 is configured to reconstruct images of the reconstructed video and the key frame hot spot region, and synthesize the reconstructed images according to the sampling sequence to obtain a final compressed reconstructed video.

According to the compression reconstruction device for the monitoring video, a momentum-based self-adaptive video key frame sampling algorithm is realized based on the Gaussian mixture model, the sampling frequency of video key frames is reduced when no abnormal moving target appears, the device is suitable for the characteristics of low-speed monitoring video, the subsequent compression speed is improved, then all video frames are compressed by using a video snapshot compressed sensing method, all or part of the key frames are compressed based on a depth convolution neural network, the compressed representation is reconstructed, a higher sampling rate is used for a hot spot area to obtain better reconstruction quality, the final reconstruction video is synthesized, and the compression speed and the reconstruction quality are considered.

In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily for the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The compression reconstruction method of the monitoring video is characterized by comprising the following steps of:

sampling the key frames of the monitoring video based on a Gaussian mixture model, and acquiring a key frame sampling sequence to screen the key frames of the monitoring video;

after converting the monitoring video into a corresponding video snapshot, compressing and reconstructing the monitoring video to obtain a reconstructed video with a low sampling rate;

performing key frame compression of the monitoring video based on a depth convolution network to obtain a reconstructed image of a key frame hot spot area;

and reconstructing images of the reconstructed video and the key frame hot spot area, and synthesizing according to a sampling sequence to obtain a final compressed reconstructed video.

2. The method for compressed reconstruction of surveillance video according to claim 1, wherein the step of sampling key frames of the surveillance video based on a gaussian mixture model specifically comprises:

acquiring the nearest frame from the current moment in the monitoring video ⁿ _ghi The memory estimates the parameters of the Gaussian mixture model;

when a new video frame enters, if the value of a certain pixel is greater than three times of the mean value of the model, marking the pixel as a change point;

if the number of the change points in the designated area is greater than or equal to the set proportion, judging that the new video frame has an abnormal moving object;

and when an abnormal moving object is detected, performing key frame sampling by adopting a momentum-based video time unequal sub-sampling algorithm.

3. The method for compressing and reconstructing a surveillance video according to claim 1, wherein after converting the surveillance video into a corresponding video snapshot, compressing and reconstructing the surveillance video to obtain a reconstructed video with a low sampling rate, specifically comprising:

multiplying the monitoring video by a sampling matrix after performing color downsampling to convert the monitoring video into a corresponding video snapshot;

preprocessing the video snapshot to roughly reconstruct a monitoring video corresponding to the video snapshot;

inputting the video after the rough reconstruction into a priori reconstruction network for priori reconstruction;

and inputting the video after priori reconstruction into an optimized reconstruction network for optimized reconstruction so as to obtain a reconstructed video with a low sampling rate.

4. A method of compressed reconstruction of surveillance video according to claim 3, wherein the a priori reconstruction network comprises: an encoding layer, a feature extraction layer and a decoding layer which are sequentially connected in series, wherein,

the coding layer includes: the system comprises a first three-dimensional convolution module, a first ReLU module, a second three-dimensional convolution module, a second ReLU module, a third three-dimensional convolution module and a third ReLU module which are sequentially connected in series;

the feature extraction layer includes: a plurality of reversible three-dimensional convolution modules connected in series with each other;

the decoding layer includes: the three-dimensional transpose convolution module, the fourth ReLU module, the second three-dimensional transpose convolution module, the fifth ReLU module, the third three-dimensional transpose convolution module, and the sixth ReLU module are sequentially connected in series.

5. The method for compressed reconstruction of surveillance video according to claim 4, wherein the optimized reconstruction network is an iterative network, and the optimized reconstruction network includes a plurality of stages, and a network structure of each stage is identical to a structure of the prior reconstruction network, and inputs of the network of each stage of the optimized reconstruction network are: the output of the network of the previous stage and the input of the network of the previous stage.

6. The method for compressing and reconstructing a surveillance video according to claim 1, wherein the key frame compression of the surveillance video is performed based on a depth convolution network to obtain a reconstructed image of a key frame hot spot region, specifically comprising:

compressing the key frames by adopting a depth convolution compression network to obtain compressed representation of the key frames;

acquiring a change area according to the compressed representation of the key frame, and obtaining an ROI matrix according to the numerical value of the change area;

extracting a hot spot region according to the ROI matrix;

compressing a plurality of images of the hot spot area by using a hot spot model to obtain a compressed representation of a feature map corresponding to the image of the hot spot area;

and reconstructing the compressed representation of the feature map corresponding to the hot spot region image by adopting a depth convolution reconstruction network to obtain the reconstructed image of the key frame hot spot region.

7. The method for compressed reconstruction of surveillance video of claim 6, further comprising: and if the hot spot area is not extracted, reconstructing according to the compressed representation of the key frame to obtain a reconstructed image of the hot spot area of the key frame.

8. The method for compressed and reconstructed video surveillance of claim 1, wherein the final compressed and reconstructed video is obtained according to the following formula:

wherein ,representing said final compressed reconstructed video, +.>Reconstructed video representing low sample rate, +.>Representing a sampling sequence, +.>Representing the reconstructed image of the key frame hot spot region.

9. A compression reconstruction device for monitoring video, comprising:

the sampling module is used for sampling the key frames of the monitoring video based on the Gaussian mixture model and acquiring a key frame sampling sequence so as to screen the key frames of the monitoring video;

the first reconstruction module is used for compressing and reconstructing the monitoring video after converting the monitoring video into the corresponding video snapshot so as to obtain a reconstructed video with a low sampling rate;

the second reconstruction module is used for carrying out key frame compression of the monitoring video based on a depth convolution network so as to obtain a key frame hot spot area reconstructed image;

and the third reconstruction module is used for reconstructing images of the reconstructed video and the key frame hot spot area, and synthesizing the reconstructed images according to a sampling sequence to obtain a final compressed reconstructed video.