CN118118620B

CN118118620B - Video conference abnormal reconstruction method, computer device and storage medium

Info

Publication number: CN118118620B
Application number: CN202410533157.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Yunji Digital Technology Qingdao Co ltd; Shenzhen Yuntian Changxiang Information Technology Co ltd
Current assignee: Yunji Digital Technology Qingdao Co ltd; Shenzhen Yuntian Changxiang Information Technology Co ltd
Priority date: 2024-04-30
Filing date: 2024-04-30
Publication date: 2024-07-12
Anticipated expiration: 2044-04-30
Also published as: CN118118620A

Abstract

The invention discloses a video conference abnormal reconstruction method, a computer device and a storage medium, comprising the following steps: acquiring continuous online video conference images as video image training data, and acquiring key frame data and non-key frame data of the image group through compressed sensing sampling; acquiring a key frame target block from the key frame data by adopting an image reconstruction algorithm, constructing a mapping network between the key frame target block and the non-key frame sampling value corresponding to the key frame target block through a deep learning prediction network, and acquiring global airspace data of the video conference; extracting similar blocks of a plurality of frames before and after the airspace data to form low-rank tensors of image information, performing video compression on an image of a video conference to obtain tensor approximation data, reconstructing the video frame of the video conference image, effectively removing image blocking effect by utilizing a non-local similarity, low-rank and group sparse reconstruction algorithm, establishing an end-to-end depth compressed sensing reconstruction network, effectively reconstructing video signals and effectively improving reconstruction performance.

Description

Video conference abnormal reconstruction method, computer device and storage medium

Technical Field

The invention relates to the technical field of video abnormal reconstruction, in particular to a video conference abnormal reconstruction method, a computer device and a storage medium.

Background

The video reconstruction technology aims at recovering high-quality video signals from degraded video signals or observed values, has wide application in industries such as communication, video conference and the like, and in recent years, image and video reconstruction algorithms based on compressed sensing are greatly developed, however, the reconstruction quality of the existing algorithms is difficult to meet application requirements, prior information of images and videos is fully mined, and reconstruction algorithms with better subjective and objective performances are designed, so that the problems to be solved in the field of compressed sensing of images and videos are still needed.

The final purpose of the existing video compression sensing technology and super-resolution technology is to recover high-dimensional signals from low-dimensional signals, but the use scenes of the existing video compression sensing technology and super-resolution technology are different, a sampling mode different from a traditional frame is adopted at an encoding end for solving the problem of limited resources of the encoding end, then an original signal is recovered from a measured value at a decoding end, the super-resolution of the video does not need to change the encoding end, the super-resolution method is mainly used for super-resolution reconstruction of the existing low-resolution video and recovery of the video with higher resolution, and the two common video abnormal-reconstruction methods have the following problems at present: (1) The existing video reconstruction algorithm has the characteristics of high running speed and good reconstruction quality under low sampling rate, but has poor interpretability, is difficult to recover high-frequency information, has less research on realization of a neural network in the field of video compression sensing, and has poor modeling capability on time correlation; (2) The method has the advantages that a completely convolved measurement network is adopted in video reconstruction, the structural information of scene images is reserved, the blocking effect can be effectively eliminated, the algorithm complexity is high, the reconstruction time cost is too high, the reconstruction of image video signals is realized to a certain extent, the satisfactory degree is not achieved at present, and the application requirements of various scenes are difficult to meet.

Disclosure of Invention

The invention aims to provide a video conference abnormal reconstruction method, a computer device and a storage medium, which are used for solving the technical problems that the interpretability is poor, the high-frequency information is difficult to recover and the application requirements of various scenes are difficult to meet in the prior art.

In order to solve the technical problems, the invention specifically provides the following technical scheme:

in a first aspect of the present invention, there is provided a video conference abnormal reconstruction method comprising the steps of:

Acquiring continuous online video conference images as video image training data, dividing the video image training data into a plurality of image groups according to a video sequence, and acquiring key frame data and non-key frame data of the image groups through compressed sensing sampling;

Obtaining a key frame target block from the key frame data by adopting an image reconstruction algorithm, obtaining a non-key frame sampling value from the non-key frame data by adopting self-adaptive sampling, constructing a mapping network between the non-key frame sampling values corresponding to the key frame target block through a time domain association corresponding to the non-key frame sampling value by using a deep learning prediction network, and obtaining global airspace data of the video conference;

And extracting similar blocks of a plurality of frames before and after the airspace data to form a low-rank tensor of image information, carrying out video compression on the image of the video conference according to the low-rank tensor to obtain tensor approximation data, and carrying out reconstruction of video frames on the video conference image according to the tensor approximation data.

As a preferred aspect of the present invention, the acquiring the key frame data and the non-key frame data for the image group through compressed sensing sampling includes:

Taking a first frame of the image group as a key frame through a time sequence, taking the key frame nearest to the current frame as a reference frame, optimizing the reference frame through an inter-frame sparse network, and obtaining similar characteristics of the key frame;

using a loss function on the similar features Optimizing through a convolution network, aiming at optimizing the whole video conference picture, wherein the loss functionThe expression is:

Wherein, Representing the current frame data of the frame,A motion compensated frame representing a previous video frame adjacent to the current frame,A motion compensated frame representing a subsequent video frame adjacent to the current frame,Representing the coefficients of the motion-compensated frame,Representing the total length of the time series,Representing the current time;

acquiring single-frame information of a key frame in the image group, and taking the single-frame information as key frame data;

And searching all overlapped blocks in a window from a corresponding reference frame by adopting a residual error network for the single frame information to serve as non-target data, and predicting target blocks in a non-key frame for the non-target data through linear weighting, wherein the expression is as follows:

Wherein, Representing the first of the current reference framesThe outcome of the prediction of the non-target data,Representing the weight of the linear weighting,Represent the firstThe search window corresponds to the firstNon-target data for the respective overlapping blocks,Representing the overlapping block duty cycle; and taking the data in the target block as non-key frame data.

As a preferred solution of the present invention, obtaining a key frame target block from the key frame data by using an image reconstruction algorithm includes: dividing the key frame data into blocks according to time sequenceThe size isIs not included in the image block of (a)The image blockExpressed as: Wherein, The block operator is represented by a block operator,，Representing an image corresponding to the key frame data; -blocking said image blockAnd linearly combining the image reconstruction function with similar blocks in the corresponding reference frames to obtain weighted residual errors, and establishing the image reconstruction function, wherein the expression is as follows:

Wherein, 、The regularization coefficient is represented as a function of the regularization coefficient,Representing the low weft measurement after adaptive sampling,、The weight values are respectively the linear weight values,Representing corresponding target blocksIs used in combination with the linear combination of (a),A diagonal matrix representing a corresponding image; and acquiring a corresponding key frame target block through the image reconstruction function.

As a preferred solution of the present invention, obtaining a non-key frame sampling value by adaptive sampling for the non-key frame data, and associating the non-key frame sampling value corresponding to the key frame target block through a time domain, including:

extracting all overlapped blocks as assumed target blocks by dividing a search window in the key frame target blocks, and setting the assumed target blocks to be high With a width ofIs a frame of an input frame;

The input frame is input into a convolution layer to carry out convolution operation on Personal (S)Non-overlapping blocks of the size of (a) are sampled separately, each block samplingStacking the measured values together to output non-key frame sampling values;

And associating the non-key frame sampling values with the corresponding key frame target blocks according to the time sequence to obtain the output tensor of the non-key frame.

As a preferred scheme of the invention, a mapping network between the key frame target blocks and the non-key frame sampling values is constructed through a deep learning prediction network, and global airspace data of the video conference is obtained, which comprises the following steps:

Taking the output tensor as the input of a deep learning prediction network, and establishing a mapping network from a measurement domain to a high-dimensional pixel domain;

the linear weight increasing multi-hypothesis prediction module is updated through continuous iteration, and network mapping parameters between the key frame target block and the non-key frame sampling value are trained according to time sequence correspondence;

and in each training stage of the network, adopting the mean square error as a training loss function to acquire the reconstructed key frame data of the deep learning prediction network, wherein the training loss function expression is as follows: Wherein, Representing the batch size of the key frame target block,Represent the firstReconstructed key frames corresponding to the input frames,Representing the parameters of the corresponding network,Representing the output of the network and,Represent the firstImage blocks corresponding to the input frames. As a preferred scheme of the invention, carrying out full-connection layer convolution operation on the reconstructed key frame data by taking an image block as a unit to obtain global airspace data of the video conference, which comprises the following steps:

Convolving the multi-hypothesis prediction module through Conv_phi to obtain a measured value, and subtracting the measured value of the multi-hypothesis prediction from the measured value of the current frame to obtain a residual measured value tensor;

And expanding the channel dimension of the residual measurement value through a convolution layer Conv_e to obtain a multi-maintenance tensor, and carrying out full-connection layer convolution operation on the tensor by taking an image block as a unit to obtain global airspace data of the video conference.

As a preferred scheme of the present invention, extracting similar blocks of several frames before and after the spatial data to form a low rank tensor of image information, and performing video compression on an image of a video conference according to the low rank tensor to obtain tensor approximation data, including:

dividing the spatial data into a plurality of spatial data Extracting a plurality of similar blocks from the data blocks, combining the similar blocks into a third-order tensor according to time sequence, and obtaining a corresponding low-rank tensor；

For the low rank tensorConstructing a three-dimensional coefficient array from the low rank tensors in an overlapping mannerExtracting similar block tensors to obtain tensor combined reconstructed images;

combining the extracted tensors by using a non-local strategy to form a high-dimensional tensor, corresponding the high-dimensional tensor to the airspace data in time sequence, reconstructing a video image by introducing similarity of time and space, and obtaining tensor approximation data.

As a preferred embodiment of the present invention, the reconstructing of the video frame of the video conference image according to the tensor approximation data includes:

Each frame of the tensor approximation data is regarded as an independent image, each frame of image is respectively reconstructed, and a high-quality initial reconstructed video is obtained through image reconstruction of each frame of image;

extracting similar blocks by selecting a target frame and several frames before and after the target frame in the initial reconstructed video, and searching similar blocks from the previous frames and the next frames for a certain target block in the target frame to form tensors;

And performing block matching on the tensors in time and space, and realizing effective reconstruction of the video signal through tensor combination.

In a second aspect of the invention, a computer apparatus is provided,

Comprising the following steps: at least one processor; and a memory communicatively coupled to the at least one processor;

Wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor.

In a third aspect of the present invention, a computer-readable storage medium is provided,

The computer-readable storage medium has stored therein computer-executable instructions that are executed by a processor.

Compared with the prior art, the invention has the following beneficial effects:

The invention reconstructs each frame of the video sequence as independent image based on tensor approximation and space-time correlation, fully utilizes the space-time correlation of the video sequence, constructs tensor by searching similar blocks in the front frame and the rear frame of the frame where the target block is positioned, saves space information by tensor, saves image information by space-time correlation on the basis of image reconstruction, can realize effective reconstruction of the video signal and effectively improves reconstruction performance.

The convolution layer is used as a measurement matrix, various priors are fused into a depth network, and the image blocking effect is removed more effectively by utilizing a non-local similarity, low-rank and group sparse reconstruction algorithm, so that the image reconstruction performance is improved, the effective improvement of the image reconstruction quality is realized by establishing an end-to-end depth compressed sensing reconstruction network, and excellent image reconstruction performance is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

Fig. 1 is a flowchart of a video conference abnormal reconstruction method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present invention provides a video conference abnormal reconstruction method, which includes the following steps:

in this embodiment, the original signals of the key frame and the current non-key frame are reconstructed through the video image training data, so that the intrinsic characteristics of the signals can be more fully utilized, and a better reconstruction result can be obtained.

In this embodiment, all overlapping blocks in the key frame search window are used as hypothesis atoms, a linear weighted sum is constructed to predict target blocks in the non-key frames, the measured values are projected to a higher dimensional space, the similarity is measured through a vector dot product, network mapping is realized, and global airspace data of the video conference is obtained.

In this embodiment, the low-rank attribute of the image signal is utilized, the image reconstruction model is built by taking the tensor as a basic unit, then, the tensor is respectively formed from the space and the time by combining the space-time correlation of the video signal, and the two-stage video signal reconstruction model is built, so as to realize the end-to-end depth image reconstruction network.

Obtaining the key frame data and the non-key frame data from the image group through compressed sensing sampling, including:

in this embodiment, an inter-frame sparse network is employed to generate a mean square error between the predicted frame and the actual future frame as a prediction loss, and the model predicted future frame and the actual future frame are made similar in pixel by minimizing the mean square error loss.

Using a loss function on the similar featuresOptimizing through a convolution network, aiming at optimizing the whole video conference picture, wherein the loss functionThe expression is: Wherein, Representing the current frame data of the frame,A motion compensated frame representing a previous video frame adjacent to the current frame,A motion compensated frame representing a subsequent video frame adjacent to the current frame,Representing the coefficients of the motion-compensated frame,Representing the total length of the time series,Representing the current time;

In this embodiment, in the fast motion video sequence, when the time distance between the key frame and the current frame is long, the current frame is difficult to capture useful information in the key frame, and in the video reconstruction, the frame adjacent to the current frame is used as a reference frame, and directly input into the convolutional network to obtain two motion compensation frames, and the loss function is used The existing high quality reconstructed frames are utilized as much as possible by optimizing the convolutional network.

And searching all overlapped blocks in a window from a corresponding reference frame by adopting a residual error network for the single frame information to serve as non-target data, and predicting target blocks in a non-key frame for the non-target data through linear weighting, wherein the expression is as follows: Wherein, Representing the first of the current reference framesThe outcome of the prediction of the non-target data,Representing the weight of the linear weighting,Represent the firstThe search window corresponds to the firstNon-target data for the respective overlapping blocks,Representing the overlapping block duty cycle;

And taking the data in the target block as non-key frame data. In this embodiment, the non-target data extracts the assumed target block from the search window of the reconstructed key frame, predicts the target block by a weighted sum, projects the obtained measured value into a higher dimensional space, and can better describe the relative distance between the assumed target block and the target block by a large number of sample learning, so that the similarity degree of the target block in the non-key frame can be easily realized in the deep neural network.

Obtaining a key frame target block from the key frame data by adopting an image reconstruction algorithm, wherein the method comprises the following steps:

Dividing the key frame data into blocks according to time sequence The size isIs not included in the image block of (a)The image blockExpressed as: Wherein, The block operator is represented by a block operator,，Representing an image corresponding to the key frame data; -blocking said image blockAnd linearly combining the image reconstruction function with similar blocks in the corresponding reference frames to obtain weighted residual errors, and establishing the image reconstruction function, wherein the expression is as follows: Wherein, 、The regularization coefficient is represented as a function of the regularization coefficient,Representing the low weft measurement after adaptive sampling,、The weight values are respectively the linear weight values,Representing corresponding target blocksIs used in combination with the linear combination of (a),A diagonal matrix representing a corresponding image;

In this embodiment, the image reconstruction function takes the residual error of the linear combination of the image block and the similar block as a part of the objective function, and then weights the different residual values in a weighted manner, and meanwhile, the sparsity of the image block is combined into the reconstruction model by considering the sparsity of the image block, so that the reconstruction performance of the model is further improved.

And acquiring a corresponding key frame target block through the image reconstruction function.

In the embodiment, the image reconstruction function is combined with the sparsity of the image and the residual sparsity based on the non-local similarity, a model is built in a linear weight mode, the sparsity of the signals and the self structural information are fully utilized, and then the reconstruction of the signals is realized by solving the model, so that the method has the advantages of high operation speed and low operation time cost of deep learning, and the image reconstruction performance is greatly improved. Obtaining a non-key frame sampling value by adopting self-adaptive sampling to the non-key frame data, and associating the key frame target block with the corresponding non-key frame sampling value through a time domain, wherein the method comprises the following steps:

In this embodiment, a search window is divided from the key frame target block through a deep neural network, a measurement value is obtained by convolving an assumed target block through Conv_phi, then, a residual measurement value tensor is obtained by subtracting the measurement value of multi-hypothesis prediction from the measurement value of the current frame, a channel dimension of the residual measurement value is expanded by a convolution layer Conv_e, and a tensor is obtained through deformation, the deformed tensor can be regarded as an initial reconstruction of an original signal, further fine adjustment is performed through 8 convolution layers until a residual reconstruction result is output, and the final reconstruction result is obtained by adding the residual reconstruction result and multi-hypothesis prediction.

Constructing a mapping network between the key frame target blocks and the non-key frame sampling values through a deep learning prediction network to obtain global airspace data of the video conference, wherein the mapping network comprises the following steps:

and in each training stage of the network, adopting the mean square error as a training loss function to acquire the reconstructed key frame data of the deep learning prediction network, wherein the training loss function expression is as follows: Wherein, Representing the batch size of the key frame target block,Represent the firstReconstructed key frames corresponding to the input frames,Representing the parameters of the corresponding network,Representing the output of the network and,Represent the firstImage blocks corresponding to the input frames.

In this embodiment, a sub-network including convolutional layers conv_phi, conv_p1, conv_p2, conv_p3 is trained in advance in the deep learning prediction network, a non-key frame is used as the input and output of the network, the pre-training process is mainly aimed at helping key frame data to establish a mapping from a measurement domain to a high-dimensional pixel domain, a multi-hypothesis prediction module is added on the basis of the sub-network, and the multi-hypothesis prediction result is used as the output, so that the accuracy of the multi-hypothesis prediction result can be improved by continuing the parameters of the pre-training network. Performing full-connection layer convolution operation on the reconstructed key frame data by taking an image block as a unit to obtain global airspace data of the video conference, wherein the method comprises the following steps:

In this embodiment, parameters in the deep learning prediction network are determined by a normal distribution of 0 mean value and fixed standard deviation, standard deviations of the convolution layers conv_e, conv_p1, conv_p2 and conv_p3 are set to 0.01, other convolution layers are set to 0.1, bias initialization is set to 0, a pre-training result of each stage is used as a new initialization of the next stage, training is ended when the training loss function is not reduced any more, and global airspace data of the video conference is obtained.

Extracting a low-rank tensor of image information formed by similar blocks of a plurality of frames before and after the airspace data, and carrying out video compression on the image of the video conference according to the low-rank tensor to obtain tensor approximation data, wherein the method comprises the following steps:

In the present embodiment, the following size isExtracting a plurality of similar blocks from the data blocks of (a) a certain search areaThe similar image blocks are combined into third-order tensors correspondingly to obtain corresponding low-rank tensors, and high-order singular value decomposition is carried out on the low-rank tensors to obtain low-rank tensor approximate solutionsThe low rank tensor approximation is obtained by setting the final value of each mode in the kernel tensor to zero, that is, this approximation process can be achieved by forcing the kernel tensor to be sparse.

Reconstructing video frames of the video conference image according to the tensor approximation data, including:

In this embodiment, in the first stage of video reconstruction, each frame of the input video is regarded as an independent image, each frame of image is reconstructed separately to achieve initial spatial recovery of the video, a high-quality initial reconstructed video can be obtained through image reconstruction of each frame of image, in the second stage, considering time correlation between frames, similar blocks are extracted by selecting a target frame and several frames before and after the target frame, for a certain target block in the target frame, similar blocks are searched from the previous several frames and the next several frames to form tensors, and then low-rank tensor approximation is achieved by using HOSVD, thereby further improving reconstruction quality.

In a second embodiment, a computer apparatus is provided,

In a third embodiment, a computer-readable storage medium is provided,

In this embodiment, the real-time operation states of hardware, a system, a virtual machine, a container and a process are monitored in real time by the cloud computing platform, so that the image quality and the fluency of the reconstruction of the abnormal image can be monitored at any time according to the video conference requirement, and a cloud instruction can be issued in real time to adjust, so that the operation state of the whole platform is ensured to be known and controllable.

In the embodiment, the computing and the storage are separated in a cloud cluster form, the operation system is guided through the special storage cluster, and the video management can be realized quickly and efficiently based on an internal distribution mechanism, so that the deployment cost is reduced, quick reset is supported, and the efficient utilization of resources is realized.

The above embodiments are only exemplary embodiments of the present application and are not intended to limit the present application, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this application will occur to those skilled in the art, and are intended to be within the spirit and scope of the application.

Claims

1. The video conference abnormal reconstruction method is characterized by comprising the following steps of:

Extracting similar blocks of a plurality of frames before and after the airspace data to form a low-rank tensor of image information, performing video compression on the image of the video conference according to the low-rank tensor to obtain tensor approximation data, and reconstructing the video frame of the video conference image according to the tensor approximation data;

using a loss function on the similar features Optimizing through a convolution network, aiming at optimizing the whole video conference picture, wherein the loss functionThe expression is: Wherein, Representing the current frame data of the frame,A motion compensated frame representing a previous video frame adjacent to the current frame,A motion compensated frame representing a subsequent video frame adjacent to the current frame,Representing the coefficients of the motion-compensated frame,Representing the total length of the time series,Representing the current time;

Taking the data in the target block as non-key frame data;

2. The method for video conference abnormal reconstruction as set forth in claim 1, wherein,

Obtaining a non-key frame sampling value by adopting self-adaptive sampling to the non-key frame data, and associating the key frame target block with the corresponding non-key frame sampling value through a time domain, wherein the method comprises the following steps:

3. The method for video conference abnormal reconstruction as set forth in claim 2, wherein,

4. The method for video conference abnormal reconstruction as set forth in claim 3, wherein,

Performing full-connection layer convolution operation on the reconstructed key frame data by taking an image block as a unit to obtain global airspace data of the video conference, wherein the method comprises the following steps:

convolving the multi-hypothesis prediction module through a convolution layer Conv_phi to obtain a measured value, and subtracting the measured value of the multi-hypothesis prediction from the measured value of the current frame to obtain a residual measured value tensor;

And expanding the channel dimension of the residual measurement value through a convolution layer Conv_e to obtain a multi-dimension tensor, and carrying out full-connection layer convolution operation on the tensor by taking an image block as a unit to obtain global airspace data of the video conference.

5. The method for video conference abnormal reconstruction as claimed in claim 4, wherein,

6. The method for video conference abnormal reconstruction as set forth in claim 5, wherein,

7. A computer device, characterized in that,

Wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor, whereby the method of any one of claims 1-6 is performed by the processor.

8. A computer-readable storage medium comprising,

The computer readable storage medium having stored therein computer executable instructions which, when executed by a processor, implement the method of any of claims 1-6.