CN117079108A

CN117079108A - Cloud edge collaborative video stream analysis method and device based on depth estimation

Info

Publication number: CN117079108A
Application number: CN202310477632.6A
Authority: CN
Inventors: 王艳花; 段婕; 顾玮; 万雪枫
Original assignee: Information and Telecommunication Branch of State Grid Shanxi Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Shanxi Electric Power Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-11-17

Abstract

The application belongs to the field of video analysis, and particularly relates to a cloud edge collaborative video stream analysis method and device based on depth estimation. Comprising the following steps: deploying a Server DNN model at the cloud and deploying an Edge DNN model at the Edge; generating a partitioning rule of the video frame based on the depth of the video frame at the cloud; the acquisition end carries out different quality coding on different blocks according to cloud partition rules and then transmits the different blocks to the edge end; the edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video; the cloud terminal infers the received video and returns an inference result to the edge terminal; the edge terminal aggregates self reasoning results and cloud reasoning results as final results, and tracks the aggregated reasoning results. The application provides a method for selecting video frames according to different depths of images, and solves the problem that different DNN models have different definition on ROIs.

Description

Cloud edge collaborative video stream analysis method and device based on depth estimation

Technical Field

The application belongs to the field of video analysis, and particularly relates to a cloud edge collaborative video stream analysis method and device based on depth estimation.

Background

With the popularization of deep neural networks and cameras, video stream reasoning has a wide application scenario. In these scenarios (e.g., urban traffic analysis and security anomaly detection, etc.), the cameras continuously collect and stream video information to a remote server, which, upon receipt of the information, runs a deep neural network model (DNN) to analyze it and returns the analysis results to the cameras. This is the prototype of the video stream analysis system. However, with the progressive use of these video stream analysis systems, various types of problems begin to appear. The first problem is that the analysis delay is too long, and in order to reduce the video analysis delay, many researchers reduce the analysis delay by only offloading the region of interest (ROI) to reduce the amount of data transmitted over the network. The focus of these studies lies in the selection of the ROI. Since the goal of ROI selection is to reduce network transmission delay, in video stream analysis systems, ROI selection module can only be performed on the video source, i.e. camera. Thus, researchers have proposed a variety of heuristic algorithms for RIO selection using limited computational resources on the camera. Video stream analysis systems began to develop vigorously, such as Glimpse (Chen, y.h., et al, "Glimpse: continuous, real-Time Object Recognition On Mobile devices," the 13th ACM Conference ACM, 2015 "), reduction (Li, yuanqi, et al," reduction: on-camera filtering for resource-efficiency Real-time video analysis, "Proceedings of the Annual conference of the ACM Special Interest Group On Data Communication On the applications, technologies, architecture, and protocols for computer communication, 2020.), and the like. These video stream analysis systems all have low latency without excessive degradation in analysis accuracy. However, with the intensive research, researchers have proposed that only a small part of cameras used in the current society have higher computing resources, and most cameras have no much computing resources and can only collect and transmit videos. This time, if the method is continued, it will severely limit the video stream analysis system to use the scene.

To address the limitations of camera computing resources, researchers have focused on servers. These researchers believe that there is another way to reduce network transmission delay, namely to change the video coding scheme. Traditional video streaming analysis systems compress and encode video as it is streamed, based on user perceived quality of experience (QoE). In the video stream analysis system, the purpose of the server to receive the video is to make reasoning by the DNN model, and whether QoE is reduced in the video transmission process is not required to be considered. Taking the object detection task as an example, compressing and even clipping the background in the video frame does not affect the recognition accuracy of the DNN model, but human beings can feel obvious degradation of the video quality. Researchers have then begun to trade off the latency and accuracy of video analysis by adjusting the quality of video transmitted in the network, either actively or passively. A series of video stream analysis systems with this as the center of gravity began to appear. For example, AWStream proposes to dynamically adjust the coding quality of the next video clip to cope with fluctuations in bandwidth. (Zhang, b., et al, "AWStream: adaptive with-area streaming analysis," ACM Special Interest Group on Data Communication ACM, 2018.) DDS divides the offloading process into two times, the first time, after encoding the video with higher quality loss, transmits to the server and receives feedback from the server, and the second time, according to the feedback from the server, only lossless compression is performed on a part of the area and transmits to the server for reasoning by the DNN model. (Du, k.), et al, "Server-Driven Video Streaming for Deep Learning information," sigmam' 20: annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architecture, and protocols for computer communication ACM, 2020.) these offloading modes, guided by Server feedback information, while improving accuracy, at the same time the inference delay is greatly increased, and the best trade-off between accuracy and delay is not achieved.

Disclosure of Invention

The application aims to overcome the problems in the prior art and provides a cloud-edge collaborative video stream analysis method and device based on depth estimation.

The application adopts the following technical scheme: a cloud edge collaborative video stream analysis method based on depth estimation comprises the following steps:

deploying a Server DNN model at the cloud and deploying an Edge DNN model at the Edge;

generating a partitioning rule of the video frame based on the depth of the video frame at the cloud;

the acquisition end carries out different quality coding on different blocks according to cloud partition rules and then transmits the different blocks to the edge end;

the edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video;

the cloud terminal infers the received video and returns an inference result to the edge terminal;

the edge terminal aggregates self reasoning results and cloud reasoning results as final results, and tracks the aggregated reasoning results.

The Server DNN model is a high-complexity model, the Edge DNN model is a low-complexity model, the model complexity distinguishing standard is the model layer number, the layer number is more than or equal to 100 and is a high-complexity model, and the layer number is less than 100 and is a low-complexity model.

The partitioning rule for generating the video frame includes:

the camera of the acquisition end continuously acquires video, codes the acquired video and then streams the high-quality code to the cloud;

the cloud end decodes the received data, then carries out depth estimation and DNN reasoning respectively, and then generates partition rules of the camera and transmits the partition rules to the acquisition end.

The partitioning rules for different quality encodings of different blocks include:

1) Dividing a video frame into a plurality of tiles, wherein the tiles are square macro blocks, and the width is the maximum common factor of the width and the height of the video frame;

2) The cloud terminal infers the decoded images of ServerDNN and Edge DNN respectively, and finds tiles areas with different inference results;

3) Performing depth estimation on the video frame to obtain depth values of all pixel points in the frame and obtaining average depth values of pixels contained in each tiles of the video frame;

4) Obtaining average depth values of different tiles obtained in the step 3), namely an unloading threshold value;

5) And (3) marking that the tile partition rule is completely generated according to the unloading threshold value in the step (4), and returning the partition rule to the camera in the form of a hash value to serve as the basis of partition coding of the camera.

The zone cutting includes:

1) Decoding the received video data;

2) Reasoning is carried out on the decoded video;

3) Traversing different blocks of the video frame, and covering all the blocks of the acquisition end low-quality code with the same pixel value;

4) And sending the modified video frames to the cloud for reasoning.

And tracking the aggregation reasoning result by adopting an attention LSTM module, wherein the LSTM module is based on an attention LSTM network structure.

The device comprises an acquisition end, an edge end and a cloud end, wherein the cloud end deploys a DNN model with a complex structure, the edge end deploys a DNN model with a simple structure, and the cloud edge collaborative video stream analysis method based on depth estimation operates among the acquisition end, the edge end and the cloud end.

Compared with the prior art, the application has the following beneficial effects:

1. the application provides the selection of the ROI area of the video frame according to different depths of the image, solves the problem that different DNN models have different definition on the ROI, and can share the same depth estimation result. The DNN model is replaced in the middle of system reasoning without any additional operation.

2. The application proposes to use LSTM to dynamically predict the reasoning result from the history reasoning result as the supplement of video analysis, so as to greatly reduce the time delay on the premise of ensuring the accuracy.

Drawings

FIG. 1 is a block diagram of the method of the present application;

FIG. 2 is a workflow diagram of cloud generation of partition rules.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; the term "plurality" as used herein means greater than or equal to two. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

In general terms, the method of the application comprises: the DNN inference model with higher complexity is deployed at the cloud and the DNN inference model with lower complexity is deployed at the edge. And then generating a partitioning rule of the video frame by introducing a depth estimation technology at the cloud according to the depth of the video frame, carrying out different quality coding on different blocks according to the partitioning rule at the cloud by an acquisition end, transmitting the coded blocks to an edge server, carrying out DNN reasoning on the video by the edge server, streaming the blocks with high quality coding to the cloud for DNN reasoning, and finally, aggregating self reasoning results and cloud reasoning results by the edge server as final results, and tracking the aggregation reasoning results by using an LSTM module based on attention.

In order to achieve the purpose of the application, the technical scheme adopted is summarized as follows:

a cloud edge collaborative video stream analysis method based on depth estimation comprises the following steps:

(1) The DNN model with complex structure is deployed at the cloud, which is called as Server DNN herein, and the DNN model with simple structure is deployed at the Edge, which is called as Edge DNN herein.

(2) The camera of the acquisition end continuously acquires the video, and the acquired video is encoded and then high-quality encoding is kept to be streamed to the cloud.

(3) The cloud end decodes the received data, then carries out depth estimation and DNN reasoning respectively, and then generates partition rules of the camera and transmits the partition rules to the acquisition end.

The partition rule coding in the step (3) comprises the following steps:

2) The cloud terminal respectively infers the decoded image by using the Server DNN and the Edge DNN, and searches tiles regions with different inference results.

3) And carrying out depth estimation on the video frame to obtain the depth value of each pixel point in the frame and obtaining the average depth value of the pixels contained in each tiles of the video frame.

4) Obtaining the average depth value of different tiles obtained in the step 3), namely the unloading threshold value.

5) And (3) marking that the tile partition rule is completely generated according to the unloading threshold value in the step 4). And returning the partition rule to the camera in the form of a hash value as the basis of partition coding of the camera.

(4) The acquisition end codes the acquired video partitions according to the cloud end generation rule and different qualities, and streams the acquired video partitions to the edge server after the coding is finished.

(5) The edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video.

The region cutting in the step (5) comprises the following steps:

1) The received video data is decoded.

2) And reasoning the decoded video.

3) And traversing different blocks of the video frame, and covering all the blocks of the acquisition end low-quality code with the same pixel value.

4) And sending the modified video frames to the cloud for reasoning.

(6) The cloud terminal infers the received video and returns an inference result to the edge terminal.

(7) And the edge terminal aggregates the local reasoning result and the cloud reasoning result to be used as a final reasoning result of the video frame and tracks by using a classical attention-based LSTM module.

The embodiment provides an example of a cloud edge collaborative video stream analysis system based on depth estimation.

In this embodiment, as shown in fig. 1, an overall workflow of video stream analysis in a cloud-edge network environment is illustrated. The method comprises the following specific steps:

a pre-starting stage: the method comprises the steps that an acquisition end carries out lossless encoding on acquired videos and then transmits the acquired videos to a cloud end, the cloud end decodes the received videos and then transmits the decoded videos to a partition rule generation module, and the module generates a set of partition schemes by using a partition rule generation algorithm and transmits results to the acquisition end. As shown in fig. 2, a workflow of partition rule generation is illustrated, and specific steps thereof are as follows:

assume that a video frame is divided into 5x8 blocks. Wherein the length is 8 and the width is 5.

The cloud terminal respectively infers the decoded video frames by using a Server DNN and an Edge DNN, searches blocks with difference in inference results, and sets the blocks as a block setEach number represents a block number with a difference between the cloud end and the edge end reasoning result.

Performing depth estimation on the video frame to obtain an average depth set of each divided blockWhere each a represents the height of a block.

The average depth AVD of each block in the set of blocks is obtained. The formula is as follows:

and marking each block in the AveDep by taking the obtained AVD as an unloading threshold, marking the block with the depth larger than the AVD as HQ, representing high-quality coding, and marking the block with the depth smaller than the AVD as LQ, representing low-quality coding.

Partition transmission stage: the acquisition end carries out partition coding on the video according to the partition scheme of the first stage and then streams the video to the edge end. And after the video is decoded by the edge end, calling a region cutting algorithm in a region cutting module to cut the video frame by frame, and transmitting the cut result to the cloud server. Wherein the area cutting steps are as follows:

the received video data is decoded.

And (3) reasoning the decoded video to obtain a reasoning result LD.

And traversing different blocks of the video frame, covering all the blocks with the low-quality codes at the acquisition end with the same pixel value, and recoding the blocks into B.

And sending the modified video frames to the cloud for reasoning.

Real-time reasoning phase: and respectively using Server DNN and Edge DNN to infer the decoded video at the cloud end and the Edge end. The cloud reasoning results are returned to the edge end, the edge end conducts aggregation reasoning on the cloud and edge end reasoning results to obtain final reasoning results, and the LSTM module based on attention is used for tracking the reasoning results of the historical frames. Wherein the aggregation reasoning steps are as follows:

the edge end receives an reasoning result HD returned by the cloud end.

The edge compares the difference between the HD in the first block and the LD obtained by the regional cutting. If the result is the same, the processing is not performed, and if the result is not the same, the result of reasoning of the LD is covered by the result of reasoning of the HD.

And sequentially iterating the 40 blocks according to the steps to obtain a final result JD.

Wherein the attention-based LSTM module reasoning about historical frames is as follows:

inputting the aggregation reasoning result of the previous N frames into the LSTM module。

The LSTM module outputs the processing result of the new frame。

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A cloud edge collaborative video stream analysis method based on depth estimation is characterized by comprising the following steps:

2. The depth estimation-based cloud Edge collaborative video stream analysis method according to claim 1, wherein the Server DNN model is a high-complexity model, the Edge DNN model is a low-complexity model, the model complexity distinguishing standard is a model layer number, the layer number is greater than or equal to 100 and is a high-complexity model, and the layer number is lower than 100 and is a low-complexity model.

3. The depth estimation low complexity model based cloud edge collaborative video streaming analysis method according to claim 1, wherein the generating partitioning rules for video frames comprises:

4. The depth estimation-based cloud-edge collaborative video streaming analysis method according to claim 2, wherein the partitioning rules perform different quality encodings on different blocks comprising:

2) The cloud terminal respectively uses DNN with a complex structure and DNN with a simple structure to infer the decoded image, and searches tiles regions with different inference results;

5. The depth estimation-based cloud edge collaborative video streaming analysis method according to claim 1, wherein the region cutting comprises:

1) Decoding the received video data;

2) Reasoning is carried out on the decoded video;

4) And sending the modified video frames to the cloud for reasoning.

6. The depth estimation-based cloud edge collaborative video stream analysis method according to claim 1, wherein attention-based LSTM module is adopted for tracking the aggregation reasoning result, and the LSTM module is based on an attention-based LSTM network structure.

7. An apparatus, characterized in that: the cloud side collaborative video stream analysis method based on depth estimation is characterized by comprising an acquisition end, an edge end and a cloud end, wherein the cloud end deploys DNN models with complex structures, and the edge end deploys DNN models with simple structures, and the cloud side collaborative video stream analysis method based on depth estimation is operated between the acquisition end, the edge end and the cloud end.