CN117079108A - Cloud edge collaborative video stream analysis method and device based on depth estimation - Google Patents
Cloud edge collaborative video stream analysis method and device based on depth estimation Download PDFInfo
- Publication number
- CN117079108A CN117079108A CN202310477632.6A CN202310477632A CN117079108A CN 117079108 A CN117079108 A CN 117079108A CN 202310477632 A CN202310477632 A CN 202310477632A CN 117079108 A CN117079108 A CN 117079108A
- Authority
- CN
- China
- Prior art keywords
- cloud
- video
- edge
- reasoning
- dnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 39
- 238000005192 partition Methods 0.000 claims abstract description 29
- 238000005520 cutting process Methods 0.000 claims abstract description 11
- 238000000638 solvent extraction Methods 0.000 claims abstract description 9
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000000034 method Methods 0.000 abstract description 7
- 238000004891 communication Methods 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
- G06V10/95—Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/54—Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/176—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/18—Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
Abstract
The application belongs to the field of video analysis, and particularly relates to a cloud edge collaborative video stream analysis method and device based on depth estimation. Comprising the following steps: deploying a Server DNN model at the cloud and deploying an Edge DNN model at the Edge; generating a partitioning rule of the video frame based on the depth of the video frame at the cloud; the acquisition end carries out different quality coding on different blocks according to cloud partition rules and then transmits the different blocks to the edge end; the edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video; the cloud terminal infers the received video and returns an inference result to the edge terminal; the edge terminal aggregates self reasoning results and cloud reasoning results as final results, and tracks the aggregated reasoning results. The application provides a method for selecting video frames according to different depths of images, and solves the problem that different DNN models have different definition on ROIs.
Description
Technical Field
The application belongs to the field of video analysis, and particularly relates to a cloud edge collaborative video stream analysis method and device based on depth estimation.
Background
With the popularization of deep neural networks and cameras, video stream reasoning has a wide application scenario. In these scenarios (e.g., urban traffic analysis and security anomaly detection, etc.), the cameras continuously collect and stream video information to a remote server, which, upon receipt of the information, runs a deep neural network model (DNN) to analyze it and returns the analysis results to the cameras. This is the prototype of the video stream analysis system. However, with the progressive use of these video stream analysis systems, various types of problems begin to appear. The first problem is that the analysis delay is too long, and in order to reduce the video analysis delay, many researchers reduce the analysis delay by only offloading the region of interest (ROI) to reduce the amount of data transmitted over the network. The focus of these studies lies in the selection of the ROI. Since the goal of ROI selection is to reduce network transmission delay, in video stream analysis systems, ROI selection module can only be performed on the video source, i.e. camera. Thus, researchers have proposed a variety of heuristic algorithms for RIO selection using limited computational resources on the camera. Video stream analysis systems began to develop vigorously, such as Glimpse (Chen, y.h., et al, "Glimpse: continuous, real-Time Object Recognition On Mobile devices," the 13th ACM Conference ACM, 2015 "), reduction (Li, yuanqi, et al," reduction: on-camera filtering for resource-efficiency Real-time video analysis, "Proceedings of the Annual conference of the ACM Special Interest Group On Data Communication On the applications, technologies, architecture, and protocols for computer communication, 2020.), and the like. These video stream analysis systems all have low latency without excessive degradation in analysis accuracy. However, with the intensive research, researchers have proposed that only a small part of cameras used in the current society have higher computing resources, and most cameras have no much computing resources and can only collect and transmit videos. This time, if the method is continued, it will severely limit the video stream analysis system to use the scene.
To address the limitations of camera computing resources, researchers have focused on servers. These researchers believe that there is another way to reduce network transmission delay, namely to change the video coding scheme. Traditional video streaming analysis systems compress and encode video as it is streamed, based on user perceived quality of experience (QoE). In the video stream analysis system, the purpose of the server to receive the video is to make reasoning by the DNN model, and whether QoE is reduced in the video transmission process is not required to be considered. Taking the object detection task as an example, compressing and even clipping the background in the video frame does not affect the recognition accuracy of the DNN model, but human beings can feel obvious degradation of the video quality. Researchers have then begun to trade off the latency and accuracy of video analysis by adjusting the quality of video transmitted in the network, either actively or passively. A series of video stream analysis systems with this as the center of gravity began to appear. For example, AWStream proposes to dynamically adjust the coding quality of the next video clip to cope with fluctuations in bandwidth. (Zhang, b., et al, "AWStream: adaptive with-area streaming analysis," ACM Special Interest Group on Data Communication ACM, 2018.) DDS divides the offloading process into two times, the first time, after encoding the video with higher quality loss, transmits to the server and receives feedback from the server, and the second time, according to the feedback from the server, only lossless compression is performed on a part of the area and transmits to the server for reasoning by the DNN model. (Du, k.), et al, "Server-Driven Video Streaming for Deep Learning information," sigmam' 20: annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architecture, and protocols for computer communication ACM, 2020.) these offloading modes, guided by Server feedback information, while improving accuracy, at the same time the inference delay is greatly increased, and the best trade-off between accuracy and delay is not achieved.
Disclosure of Invention
The application aims to overcome the problems in the prior art and provides a cloud-edge collaborative video stream analysis method and device based on depth estimation.
The application adopts the following technical scheme: a cloud edge collaborative video stream analysis method based on depth estimation comprises the following steps:
deploying a Server DNN model at the cloud and deploying an Edge DNN model at the Edge;
generating a partitioning rule of the video frame based on the depth of the video frame at the cloud;
the acquisition end carries out different quality coding on different blocks according to cloud partition rules and then transmits the different blocks to the edge end;
the edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video;
the cloud terminal infers the received video and returns an inference result to the edge terminal;
the edge terminal aggregates self reasoning results and cloud reasoning results as final results, and tracks the aggregated reasoning results.
The Server DNN model is a high-complexity model, the Edge DNN model is a low-complexity model, the model complexity distinguishing standard is the model layer number, the layer number is more than or equal to 100 and is a high-complexity model, and the layer number is less than 100 and is a low-complexity model.
The partitioning rule for generating the video frame includes:
the camera of the acquisition end continuously acquires video, codes the acquired video and then streams the high-quality code to the cloud;
the cloud end decodes the received data, then carries out depth estimation and DNN reasoning respectively, and then generates partition rules of the camera and transmits the partition rules to the acquisition end.
The partitioning rules for different quality encodings of different blocks include:
1) Dividing a video frame into a plurality of tiles, wherein the tiles are square macro blocks, and the width is the maximum common factor of the width and the height of the video frame;
2) The cloud terminal infers the decoded images of ServerDNN and Edge DNN respectively, and finds tiles areas with different inference results;
3) Performing depth estimation on the video frame to obtain depth values of all pixel points in the frame and obtaining average depth values of pixels contained in each tiles of the video frame;
4) Obtaining average depth values of different tiles obtained in the step 3), namely an unloading threshold value;
5) And (3) marking that the tile partition rule is completely generated according to the unloading threshold value in the step (4), and returning the partition rule to the camera in the form of a hash value to serve as the basis of partition coding of the camera.
The zone cutting includes:
1) Decoding the received video data;
2) Reasoning is carried out on the decoded video;
3) Traversing different blocks of the video frame, and covering all the blocks of the acquisition end low-quality code with the same pixel value;
4) And sending the modified video frames to the cloud for reasoning.
And tracking the aggregation reasoning result by adopting an attention LSTM module, wherein the LSTM module is based on an attention LSTM network structure.
The device comprises an acquisition end, an edge end and a cloud end, wherein the cloud end deploys a DNN model with a complex structure, the edge end deploys a DNN model with a simple structure, and the cloud edge collaborative video stream analysis method based on depth estimation operates among the acquisition end, the edge end and the cloud end.
Compared with the prior art, the application has the following beneficial effects:
1. the application provides the selection of the ROI area of the video frame according to different depths of the image, solves the problem that different DNN models have different definition on the ROI, and can share the same depth estimation result. The DNN model is replaced in the middle of system reasoning without any additional operation.
2. The application proposes to use LSTM to dynamically predict the reasoning result from the history reasoning result as the supplement of video analysis, so as to greatly reduce the time delay on the premise of ensuring the accuracy.
Drawings
FIG. 1 is a block diagram of the method of the present application;
FIG. 2 is a workflow diagram of cloud generation of partition rules.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by a person of ordinary skill in the art based on the embodiments provided by the present application without making any inventive effort, are intended to fall within the scope of the present application. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the described embodiments of the application can be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," and similar referents in the context of the application are not to be construed as limiting the quantity, but rather as singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; the term "plurality" as used herein means greater than or equal to two. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.
In general terms, the method of the application comprises: the DNN inference model with higher complexity is deployed at the cloud and the DNN inference model with lower complexity is deployed at the edge. And then generating a partitioning rule of the video frame by introducing a depth estimation technology at the cloud according to the depth of the video frame, carrying out different quality coding on different blocks according to the partitioning rule at the cloud by an acquisition end, transmitting the coded blocks to an edge server, carrying out DNN reasoning on the video by the edge server, streaming the blocks with high quality coding to the cloud for DNN reasoning, and finally, aggregating self reasoning results and cloud reasoning results by the edge server as final results, and tracking the aggregation reasoning results by using an LSTM module based on attention.
In order to achieve the purpose of the application, the technical scheme adopted is summarized as follows:
a cloud edge collaborative video stream analysis method based on depth estimation comprises the following steps:
(1) The DNN model with complex structure is deployed at the cloud, which is called as Server DNN herein, and the DNN model with simple structure is deployed at the Edge, which is called as Edge DNN herein.
(2) The camera of the acquisition end continuously acquires the video, and the acquired video is encoded and then high-quality encoding is kept to be streamed to the cloud.
(3) The cloud end decodes the received data, then carries out depth estimation and DNN reasoning respectively, and then generates partition rules of the camera and transmits the partition rules to the acquisition end.
The partition rule coding in the step (3) comprises the following steps:
1) Dividing a video frame into a plurality of tiles, wherein the tiles are square macro blocks, and the width is the maximum common factor of the width and the height of the video frame;
2) The cloud terminal respectively infers the decoded image by using the Server DNN and the Edge DNN, and searches tiles regions with different inference results.
3) And carrying out depth estimation on the video frame to obtain the depth value of each pixel point in the frame and obtaining the average depth value of the pixels contained in each tiles of the video frame.
4) Obtaining the average depth value of different tiles obtained in the step 3), namely the unloading threshold value.
5) And (3) marking that the tile partition rule is completely generated according to the unloading threshold value in the step 4). And returning the partition rule to the camera in the form of a hash value as the basis of partition coding of the camera.
(4) The acquisition end codes the acquired video partitions according to the cloud end generation rule and different qualities, and streams the acquired video partitions to the edge server after the coding is finished.
(5) The edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video.
The region cutting in the step (5) comprises the following steps:
1) The received video data is decoded.
2) And reasoning the decoded video.
3) And traversing different blocks of the video frame, and covering all the blocks of the acquisition end low-quality code with the same pixel value.
4) And sending the modified video frames to the cloud for reasoning.
(6) The cloud terminal infers the received video and returns an inference result to the edge terminal.
(7) And the edge terminal aggregates the local reasoning result and the cloud reasoning result to be used as a final reasoning result of the video frame and tracks by using a classical attention-based LSTM module.
The embodiment provides an example of a cloud edge collaborative video stream analysis system based on depth estimation.
In this embodiment, as shown in fig. 1, an overall workflow of video stream analysis in a cloud-edge network environment is illustrated. The method comprises the following specific steps:
a pre-starting stage: the method comprises the steps that an acquisition end carries out lossless encoding on acquired videos and then transmits the acquired videos to a cloud end, the cloud end decodes the received videos and then transmits the decoded videos to a partition rule generation module, and the module generates a set of partition schemes by using a partition rule generation algorithm and transmits results to the acquisition end. As shown in fig. 2, a workflow of partition rule generation is illustrated, and specific steps thereof are as follows:
assume that a video frame is divided into 5x8 blocks. Wherein the length is 8 and the width is 5.
The cloud terminal respectively infers the decoded video frames by using a Server DNN and an Edge DNN, searches blocks with difference in inference results, and sets the blocks as a block setEach number represents a block number with a difference between the cloud end and the edge end reasoning result.
Performing depth estimation on the video frame to obtain an average depth set of each divided blockWhere each a represents the height of a block.
The average depth AVD of each block in the set of blocks is obtained. The formula is as follows:
and marking each block in the AveDep by taking the obtained AVD as an unloading threshold, marking the block with the depth larger than the AVD as HQ, representing high-quality coding, and marking the block with the depth smaller than the AVD as LQ, representing low-quality coding.
Partition transmission stage: the acquisition end carries out partition coding on the video according to the partition scheme of the first stage and then streams the video to the edge end. And after the video is decoded by the edge end, calling a region cutting algorithm in a region cutting module to cut the video frame by frame, and transmitting the cut result to the cloud server. Wherein the area cutting steps are as follows:
the received video data is decoded.
And (3) reasoning the decoded video to obtain a reasoning result LD.
And traversing different blocks of the video frame, covering all the blocks with the low-quality codes at the acquisition end with the same pixel value, and recoding the blocks into B.
And sending the modified video frames to the cloud for reasoning.
Real-time reasoning phase: and respectively using Server DNN and Edge DNN to infer the decoded video at the cloud end and the Edge end. The cloud reasoning results are returned to the edge end, the edge end conducts aggregation reasoning on the cloud and edge end reasoning results to obtain final reasoning results, and the LSTM module based on attention is used for tracking the reasoning results of the historical frames. Wherein the aggregation reasoning steps are as follows:
the edge end receives an reasoning result HD returned by the cloud end.
The edge compares the difference between the HD in the first block and the LD obtained by the regional cutting. If the result is the same, the processing is not performed, and if the result is not the same, the result of reasoning of the LD is covered by the result of reasoning of the HD.
And sequentially iterating the 40 blocks according to the steps to obtain a final result JD.
Wherein the attention-based LSTM module reasoning about historical frames is as follows:
inputting the aggregation reasoning result of the previous N frames into the LSTM module。
The LSTM module outputs the processing result of the new frame。
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (7)
1. A cloud edge collaborative video stream analysis method based on depth estimation is characterized by comprising the following steps:
deploying a Server DNN model at the cloud and deploying an Edge DNN model at the Edge;
generating a partitioning rule of the video frame based on the depth of the video frame at the cloud;
the acquisition end carries out different quality coding on different blocks according to cloud partition rules and then transmits the different blocks to the edge end;
the edge end performs region cutting on the received video, transmits the high-quality coded blocks to the cloud server for reasoning, and simultaneously locally reasoning the received video;
the cloud terminal infers the received video and returns an inference result to the edge terminal;
the edge terminal aggregates self reasoning results and cloud reasoning results as final results, and tracks the aggregated reasoning results.
2. The depth estimation-based cloud Edge collaborative video stream analysis method according to claim 1, wherein the Server DNN model is a high-complexity model, the Edge DNN model is a low-complexity model, the model complexity distinguishing standard is a model layer number, the layer number is greater than or equal to 100 and is a high-complexity model, and the layer number is lower than 100 and is a low-complexity model.
3. The depth estimation low complexity model based cloud edge collaborative video streaming analysis method according to claim 1, wherein the generating partitioning rules for video frames comprises:
the camera of the acquisition end continuously acquires video, codes the acquired video and then streams the high-quality code to the cloud;
the cloud end decodes the received data, then carries out depth estimation and DNN reasoning respectively, and then generates partition rules of the camera and transmits the partition rules to the acquisition end.
4. The depth estimation-based cloud-edge collaborative video streaming analysis method according to claim 2, wherein the partitioning rules perform different quality encodings on different blocks comprising:
1) Dividing a video frame into a plurality of tiles, wherein the tiles are square macro blocks, and the width is the maximum common factor of the width and the height of the video frame;
2) The cloud terminal respectively uses DNN with a complex structure and DNN with a simple structure to infer the decoded image, and searches tiles regions with different inference results;
3) Performing depth estimation on the video frame to obtain depth values of all pixel points in the frame and obtaining average depth values of pixels contained in each tiles of the video frame;
4) Obtaining average depth values of different tiles obtained in the step 3), namely an unloading threshold value;
5) And (3) marking that the tile partition rule is completely generated according to the unloading threshold value in the step (4), and returning the partition rule to the camera in the form of a hash value to serve as the basis of partition coding of the camera.
5. The depth estimation-based cloud edge collaborative video streaming analysis method according to claim 1, wherein the region cutting comprises:
1) Decoding the received video data;
2) Reasoning is carried out on the decoded video;
3) Traversing different blocks of the video frame, and covering all the blocks of the acquisition end low-quality code with the same pixel value;
4) And sending the modified video frames to the cloud for reasoning.
6. The depth estimation-based cloud edge collaborative video stream analysis method according to claim 1, wherein attention-based LSTM module is adopted for tracking the aggregation reasoning result, and the LSTM module is based on an attention-based LSTM network structure.
7. An apparatus, characterized in that: the cloud side collaborative video stream analysis method based on depth estimation is characterized by comprising an acquisition end, an edge end and a cloud end, wherein the cloud end deploys DNN models with complex structures, and the edge end deploys DNN models with simple structures, and the cloud side collaborative video stream analysis method based on depth estimation is operated between the acquisition end, the edge end and the cloud end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310477632.6A CN117079108A (en) | 2023-04-28 | 2023-04-28 | Cloud edge collaborative video stream analysis method and device based on depth estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310477632.6A CN117079108A (en) | 2023-04-28 | 2023-04-28 | Cloud edge collaborative video stream analysis method and device based on depth estimation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117079108A true CN117079108A (en) | 2023-11-17 |
Family
ID=88708589
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310477632.6A Pending CN117079108A (en) | 2023-04-28 | 2023-04-28 | Cloud edge collaborative video stream analysis method and device based on depth estimation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117079108A (en) |
-
2023
- 2023-04-28 CN CN202310477632.6A patent/CN117079108A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7479957B2 (en) | System and method for scalable portrait video | |
TWI806199B (en) | Method for signaling of feature map information, device and computer program | |
CN109688407B (en) | Reference block selection method and device for coding unit, electronic equipment and storage medium | |
WO2022155974A1 (en) | Video coding and decoding and model training method and apparatus | |
Liu et al. | Fuzzy logic-based adaptive point cloud video streaming | |
AU2020456664A1 (en) | Reinforcement learning based rate control | |
US20220408097A1 (en) | Adaptively encoding video frames using content and network analysis | |
Gao et al. | Fras: Federated reinforcement learning empowered adaptive point cloud video streaming | |
CN114157870A (en) | Encoding method, medium, and electronic device | |
Shilpa et al. | Deep learning based optimised data transmission over 5G networks with Lagrangian encoder | |
He et al. | End-to-end facial image compression with integrated semantic distortion metric | |
Afsana et al. | Efficient scalable uhd/360-video coding by exploiting common information with cuboid-based partitioning | |
Sivam et al. | Survey on video compression techniques for efficient transmission | |
Huang et al. | A cloud computing based deep compression framework for UHD video delivery | |
CN117079108A (en) | Cloud edge collaborative video stream analysis method and device based on depth estimation | |
Liu et al. | Edge-assisted intelligent video compression for live aerial streaming | |
Zhou et al. | Bandwidth-efficient edge video analytics via frame partitioning and quantization optimization | |
Shindo et al. | Accuracy Improvement of Object Detection in VVC Coded Video Using YOLO-v7 Features | |
TW202337211A (en) | Conditional image compression | |
CN116155873A (en) | Cloud-edge collaborative image processing method, system, equipment and medium | |
WO2022100173A1 (en) | Video frame compression method and apparatus, and video frame decompression method and apparatus | |
CN116939218A (en) | Coding and decoding method and device of regional enhancement layer | |
Yang et al. | Graph-convolution network for image compression | |
He et al. | End-edge coordinated joint encoding and neural enhancement for low-light video analytics | |
CN103907136A (en) | Systems, methods and computer program products for integrated post-processing and pre-processing in video transcoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |