CN114257817A

CN114257817A - Encoding method and decoding method of multitask digital retina characteristic stream

Info

Publication number: CN114257817A
Application number: CN202210189806.4A
Authority: CN
Inventors: 滕波; 向国庆; 牛梅梅; 洪一帆; 陆嘉瑶; 焦立欣; 张羿
Original assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Current assignee: Zhejiang Smart Video Security Innovation Center Co Ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-03-29
Anticipated expiration: 2042-03-01
Also published as: CN114257817B

Abstract

The invention discloses a coding method and a decoding method of a multitask digital retina characteristic stream, the extra storage bit requirement brought by differential coding is very small, and other machine analysis tasks added in the future can also continuously use a differential coding mode on the basis of changing depth characteristics, so that only extra small amount of storage bit requirement is required to be brought for each new machine analysis task, and the low-cost upgrading capability is realized. No matter how many machine analysis tasks need to be newly added to the digital retina system, the newly added machine analysis tasks can be dynamically realized at the cloud end only by separating the corresponding transformation depth characteristics from the characteristic flow data.

Description

Encoding method and decoding method of multitask digital retina characteristic stream

Technical Field

The invention relates to the technical field of video coding, in particular to a coding method and a decoding method of a multitask digital retina characteristic stream.

Background

Since the introduction of the digital retina concept, great attention has been paid to the fields of video encoding and decoding, video monitoring and the like. In the field of traditional image processing, video compression and video analysis belong to two different fields, and digital retina technology is inspired by the biological function of human retina, and an intelligent image sensor integrating video compression and video analysis is proposed first. Specifically, the digital retina is characterized in that video compression data and video characteristic data can be obtained simultaneously and transmitted to the cloud end through data streams, so that later playback and various machine analysis tasks are facilitated. In order to obtain the feature stream of the image, the digital retina technology introduces the concept of model stream, that is, the image acquisition front end can apply different feature extraction models according to requirements, and the models can be sent to the image acquisition front end in a cloud storage and reverse transmission mode, wherein one model can be regarded as a task. In terms of international standards, a new machine-oriented video coding project is also being proposed in the international organization for standardization, which is intended to bridge the above-mentioned video coding and video analysis, in particular, depth feature compression for machine automation analysis.

In video coding, the basic idea is to compute the spatio-temporal redundancy information of the compressed video. The basic paradigm of video compression has not changed greatly in the past decades, and the block-based video compression coding and decoding technology has developed very well, and has the characteristics of moderate computational complexity, high compression rate, high reconstruction quality and the like, so that the block-based video compression coding and decoding technology has been widely applied in the past decades, and currently, the mainstream coding and decoding technologies include h.264/h.265/h.266, MPEG2/MPEG4 and the like, which are mainly based on the block-based video coding and decoding technology. Since the beginning of video coding, the paradigm of coding theory has not changed, and the new generation of coding standards all adopt the technique of "computing transform space" to increase the compression ratio. For example, the evolution from h.264 to h.265, the compression ratio is improved by 50%, but also brings more calculation requirements. This is due to the use of more flexible coding units, which allows the motion compensation based compression method to exploit more compression potential. In general, it is considered that the video compression technology based on the block has been developed very well from the goal of signal accuracy (signaling fidelity). Therefore, video coding can also be considered as a pixel-oriented reconstruction task.

As previously mentioned, the cloud under the digital retinal framework stores a variety of models corresponding to different tasks. The tasks may also be dynamically updated, i.e., the cloud may update the task library as the user's needs change. However, if the cloud wants to perform new task oriented features through the retinal front end device, the current approach is to update the feature extraction model of the front end through a model stream. This means that old tasks must be shut down at the same time. However, this method, which can only support a single task, is far from the actual requirement. In an actual machine analysis task, the occurrence of the task may not be in real time, and thus a particular machine analysis task needs to execute the machine analysis task according to the stored data. A direct solution is that when a new task is started, a front end deploys a new feature extraction model corresponding to the task at the same time, extracts feature data from video data respectively, and codes and sends the feature data to a cloud end respectively. However, this approach is clearly inefficient because the computational load of a feature extraction model is not negligible and the amount of encoded data it generates is also enormous. If new models are continuously added, this means that the computational load and bandwidth requirements quickly exceed the tolerable limits.

Disclosure of Invention

The invention aims to provide an encoding method and a decoding method of a multitask digital retina characteristic stream, which solve the problems that the existing method can only support a single task and has low efficiency of an analysis task.

A method for encoding a multitask digital retinal feature stream, comprising the steps of:

step A, constructing a BP neural network for multitasking;

b, deploying a low-level network of the BP neural network as a feature extraction network at the front end of the BP neural network;

step C, training a transformation network based on the characteristics extracted by the characteristic extraction network;

step D, deploying a low-level network of the transformation network at the front end as a feature extraction network of the newly added task;

step E, acquiring feature data, inputting the feature data into the newly added task feature extraction network, and acquiring newly added task feature data;

and F, coding and/or transmitting the characteristic data and the newly added task characteristic data in a combined manner.

As a further preference, the BP neural network comprises at least one task for video reconstruction and at least one task for machine analysis.

As a further preference, the step C includes:

step C1, reconstructing the video by the video reconstruction network, and using the reconstructed video to transform the depth characteristics output by a certain layer in the network;

step C2, the output required by the task of the new machine will be output.

As a further preference, the newly added task includes obtaining a lower-layer network of the network, and deploying the lower-layer network at the video front end.

As a further preference, the step F includes: and F1, outputting the transformation depth characteristics by the transformation low-level network, carrying out joint coding on the characteristics and the depth characteristics, and outputting or storing the characteristics in the cloud.

As a further preference, the step F further includes: step F2, independently encoding the depth features and encoding the difference between the depth features and the transformed depth features.

As a further preferred, the step F2 is specifically:

step F21, carrying out video coding on the depth features to obtain coded data of the depth features;

step F22, performing video coding on the difference value between the transformed depth feature and the depth feature to obtain coded data of the difference value.

A decoding method of a multitask digital retina characteristic stream comprises the above coding method of the multitask digital retina characteristic stream, and further comprises the following steps:

step a, decoding the characteristic stream data, and separating a depth characteristic and a transformation depth characteristic from the characteristic stream data;

b, inputting the depth features into a high-level network of the machine tasks deployed before, and acquiring target data of the machine tasks;

and c, inputting the transformation depth characteristics to the high-level network to obtain target data of the newly deployed machine task.

As a further preference, in the step b, when the target data of the machine task is obtained, no change occurs in the higher-layer network.

Preferably, the step c includes only separating the corresponding transformation depth feature from the feature stream data, so as to dynamically implement a new machine analysis task at the cloud.

An electronic device, comprising:

a memory and one or more processors;

wherein the memory is communicatively coupled to the one or more processors and stores instructions executable by the one or more processors, and when the instructions are executed by the one or more processors, the electronic device is configured to implement the method of any of the above embodiments.

A computer-readable storage medium having stored thereon computer-executable instructions operable, when executed by a computing device, to implement the method of any of the above embodiments.

A computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are operable to carry out the method of any of the above embodiments.

The technical scheme has the following advantages or beneficial effects:

the encoding method and the decoding method of the multitask digital retina characteristic stream have very small extra storage bit requirements after differential encoding, and other machine analysis tasks which are newly added in the future can also continuously use a differential encoding mode on the basis of converting depth characteristics, so that only a small amount of extra storage bit requirements are required for each newly added machine analysis task, and the low-cost upgrading capability is realized. No matter how many machine analysis tasks need to be newly added to the digital retina system, the newly added machine analysis tasks can be dynamically realized at the cloud end only by separating the corresponding transformation depth characteristics from the characteristic flow data.

Drawings

FIG. 1 is a schematic flow chart of a method of encoding a multiplexed digital retinal feature stream of the present invention;

fig. 2 is a flow chart illustrating a method for decoding a multiplexed digital retinal feature stream in accordance with the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

The front-end device has both video compression and a depth model for video feature extraction. Since the back-end can deploy different models to the front-end by means of transmission, it can be understood that the front-end device has the capability of adaptively acquiring any depth model. Therefore, as long as a model with special feature extraction capability is trained in an offline condition, the model can be deployed to the front-end equipment through the model flow. In the cloud, the main purpose of the feature stream is to perform various machine analysis tasks, such as image classification, target detection, and the like. More particularly, the feature stream represents feature data for a machine analysis task, and does not show features and compression strategies of the data. Specifically, there are two different strategies that can each provide feature flow data for a machine analysis task:

a: the characteristic stream data still uses the image coding method based on the block, but the signal reconstructed by the decoding end is not viewed by human eyes any more, but is used for improving the performance of the machine analysis task, the method is still based on the signal accuracy, at this time, the characteristic stream is still video data in nature, only the processed data exists, and the difference possibly exists between the characteristic stream data and the data of the video stream;

b: the feature stream data is depth features of the compressed machine analysis tasks, and is feature map data processed by a deep neural network generally, and the compressed feature map data can be directly used for various machine analysis tasks at a cloud end.

There are dependencies between the features of different machine analysis tasks. That is, feature data extracted by one machine analysis task may be used for another, different machine analysis task, e.g., surface normals may be used for depth detection. However, such dependencies between different machine analysis tasks do not always exist. If the features extracted by the deployed feature extraction network cannot be used for the newly added task, the fact that the whole network needs to be retrained is meant, and compatibility problems are caused.

The existing research results show that the characteristics of different machine analysis tasks have dependency relationship. That is, feature data extracted by one machine analysis task may be used for another, different machine analysis task, e.g., surface normals may be used for depth detection. However, such dependencies between different machine analysis tasks do not always exist. If the features extracted by the deployed feature extraction network cannot be used for the newly added task, the fact that the whole network needs to be retrained is meant, and compatibility problems are caused.

Referring to fig. 1, a method for encoding a multitask digital retinal feature stream includes the following steps:

constructing a BP neural network for multitasking;

deploying a low-level network of the BP neural network as a feature extraction network at the front end of the BP neural network;

training a transformation network based on the features extracted by the feature extraction network;

deploying a low-level network of the transformation network at the front end as a feature extraction network of a newly added task;

acquiring feature data, and inputting the feature data into the newly added task feature extraction network to acquire newly added task feature data;

and coding and/or transmitting the characteristic data and the newly added task characteristic data jointly.

A deep neural network for multitasking is trained, wherein at least one task for video reconstruction and at least one task for machine analysis are included. And deploying a lower layer network of the network as a feature extraction network at the video front end. When a new task is obtained, a transformation network is trained based on the features extracted by the feature extraction network, and the transformation network is used for generating target data of the new task from the feature data extracted by the feature extraction network. And deploying a lower layer network of the transformation network at the front end of the video to serve as a newly added task feature extraction network. The front-end equipment obtains video data, obtains characteristic data through the characteristic extraction network, and inputs the characteristic data into the newly added task characteristic extraction network to obtain newly added task characteristics. And transmitting or storing the feature data and the newly added task feature data after the feature data and the newly added task feature data are jointly coded.

Further, in a preferred embodiment of the method for encoding a multitask digital retinal feature stream according to the present invention, the BP neural network comprises at least one task for video reconstruction and at least one task for machine analysis.

In the task setting process, at least the video reconstruction is taken as a task to be put into the training process. In addition, other machine analysis tasks are put into the co-training process as targets of known multitasking.

Further, in a preferred embodiment of the method for encoding a multitask digital retinal feature stream according to the present invention, the step C includes:

and step C2, outputting the output required by the task of the new machine.

When a new task is obtained at the cloud, a transformation network is trained based on the deployed feature extraction network as a solidified network, the transformation network takes the depth feature as input and takes the target data of the newly added task as output, and the transformation network transforms the extracted depth feature into the output required by the new machine analysis task.

And a training process of a lower-layer network, wherein one task is image reconstruction, and the depth features already comprise depth features capable of reconstructing all pixels of an image. Thus, the transformation network can be implemented by using two separate networks, one for reconstructing the original image and its subsequent network for a new machine analysis task. The transformation network comprises a video reconstruction network which is able to complete the reconstruction of the video, since this network is the first higher layer network. The reconstructed video can be regarded as a depth feature output by a certain layer in the transformation network. Further, the transformation network concatenates a deep network of machine analysis tasks based on the depth feature. The network is used for newly added machine analysis tasks. The transformation network will eventually output the output required for the newly added machine analysis task. The transformation network is used for showing that the depth characteristics can be guaranteed to always provide the depth characteristics which are depended by the analysis task of the newly added machine through the method. The deep neural network can directly extract the output required by the analysis task of the newly added machine from the deep characteristics.

Further, in a preferred embodiment of the method for encoding a multitask digital retinal feature stream according to the present invention, the adding task includes obtaining a lower layer network of the network, and deploying the lower layer network at the video front end.

After a transformation network for a newly added machine analysis task is obtained, a lower layer network of the network is obtained, and the lower layer network is deployed at the front end of the video. The video front end is equivalent to newly adding and deploying an independent network, and the original feature extraction network is solidified and is not changed. Further, the transformation low-level network outputs transformation depth characteristics, and the characteristics and the depth characteristics are jointly coded and output or stored in the cloud. Under the framework, any newly added network can be completed by adding an independent transformation low-level network at the front end and adding a corresponding transformation depth characteristic in an encoded data stream.

Further, in a preferred embodiment of the method for encoding a multitask digital retinal feature stream according to the present invention, the step F includes: and F1, outputting the transformation depth characteristics by the transformation low-level network, carrying out joint coding on the characteristics and the depth characteristics, and outputting or storing the characteristics in the cloud.

The transformation depth characteristics and the depth characteristics of the transformation lower-layer network output have the same size. In this case, the joint encoding may be performed by differential encoding. I.e. the depth features are coded independently and the difference between the depth features and the transformed depth features is coded.

In another embodiment of the present invention, the step F further includes: step F2, independently encoding the depth feature, and encoding the difference between the depth feature and the transformed depth feature.

Further, in a preferred embodiment of the method for encoding a multitask digital retinal feature stream according to the present invention, the step F2 is specifically:

Using the existing video coding method for the depth features to obtain coded data of the depth features; and carrying out video coding on the difference value of the transformation depth characteristic and the depth characteristic to obtain coded data of the difference value. Through the multiplexer, data for the newly added machine analysis task is embedded into the data stream generated by the front-end device. If other tasks are added, the front-end equipment only needs to add corresponding differential value coding data.

In this framework, the difference between the transform depth feature and the depth feature is small, and thus the extra storage bit requirement after differential coding is very small. In addition, other machine analysis tasks added in the future can also continue to use the differential coding mode on the basis of the transformation depth characteristics. This results in an extra small memory bit requirement for each new machine analysis task. The ability to be upgraded at low cost is achieved.

And a lower-layer network in a machine analysis task processes an original image and obtains a depth feature map, the depth feature map is transmitted or stored after being coded, and when the machine analysis task is executed, feature stream data is decoded and input to a higher-layer network of the machine analysis task, and a result is output. Wherein the network of multiple machine analysis tasks may also share the same depth profile data.

The video front end extracts a depth feature map from an original video through a feature extraction network. And obtaining a depth feature map after feature coding, transmission live storage and feature stream decoding. The depth profile may be used for different machine analysis tasks. Since the tasks are different, for example, the machine analysis task may be a classifier or a target detection, the parameters of the higher-level network are also different.

The lower network and the higher network are completed by pre-training before deployment. That is, features extracted by the lower-layer network can simultaneously satisfy a plurality of machine analysis tasks, and the training process can use different strategies to satisfy the plurality of machine analysis tasks, for example, the training process is implemented by using a joint training mode. From another perspective, the depth features extracted by the lower network can be shared by multiple tasks. In short, after the low-level network is deployed, only the depth features corresponding to each image frame need to be transmitted or stored, and the corresponding machine analysis tasks can be completed through different high-level networks at any time in the future.

And training a feature extraction network, and in the task setting process, taking the video reconstruction as a task to be put into the training process, and in addition, taking other machine analysis tasks as the targets of known multiple tasks to be put into the joint training process. The training process of the lower network, wherein one task is image reconstruction, that is to say, the depth features already contain the depth features capable of reconstructing all pixels of the image, therefore, the transformation network can be always realized by using two independent networks, wherein one network is used for reconstructing the original image, and the subsequent network is used for a new machine analysis task;

when a new task is obtained at the cloud, training a transformation network based on a deployed feature extraction network as a solidified network, wherein the transformation network takes depth features as input and takes target data of the newly added task as output, and the transformation network transforms the extracted depth features into output required by a new machine analysis task;

after a transformation network for a newly added machine analysis task is obtained, a lower-layer network of the network is obtained, and the lower-layer network is deployed at the video front end, wherein the video front end is equivalent to a newly deployed independent network, and the original feature extraction network is solidified, so that no change occurs.

Referring to fig. 2, the method for decoding a multitask digital retinal feature stream according to the present invention includes the above-mentioned method for encoding a multitask digital retinal feature stream, and further includes the following steps:

Further, in a preferred embodiment of the decoding method of the multitask digital retina feature stream according to the invention, in the step b, when the target data of the machine task is obtained, no change occurs in the high-level network.

Further, in a preferred embodiment of the decoding method of the multitask digital retina feature stream, the step c includes only separating the corresponding transform depth feature from the feature stream data, so as to dynamically implement an additional machine analysis task at the cloud.

In the cloud of the digital retina, the work flow of the decoder is as follows: the feature stream data is first decoded and the depth features and the transformed depth features are separated therefrom. Further, the depth features are input to a high-level network of previously deployed machine analysis tasks, and target data of the machine tasks are obtained. Note that at this point no changes have occurred to the higher level network of machine tasks. Meanwhile, the decoder inputs the transformation depth characteristics to a transformation high-level network to obtain target data of a newly deployed machine analysis task. That is to say, no matter how many machine analysis tasks need to be added to the digital retina system, the added machine analysis tasks can be dynamically realized at the cloud end only by separating the corresponding transformation depth features from the feature stream data.

An electronic device, comprising:

a memory and one or more processors;

wherein the memory is communicatively coupled to the one or more processors and has stored therein instructions executable by the one or more processors, the electronic device operable to implement the method as any one of the above when the instructions are executed by the one or more processors.

In particular, the processor and the memory may be connected by a bus or other means, such as by a bus connection. The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the cascaded progressive network in the embodiments of the present application. The processor executes various functional applications and data processing of the processor by executing non-transitory software programs/instructions and functional modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network, such as through a communications interface. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

A computer-readable storage medium having stored thereon computer-executable instructions operable, when executed by a computing device, to implement a method as in any above.

The foregoing computer-readable storage media include physical volatile and nonvolatile, removable and non-removable media implemented in any manner or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. The computer-readable storage medium specifically includes, but is not limited to, a USB flash drive, a removable hard drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), an erasable programmable Read-Only Memory (EPROM), an electrically erasable programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, a CD-ROM, a Digital Versatile Disk (DVD), an HD-DVD, a Blue-Ray or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

While the subject matter described herein is provided in the general context of execution in conjunction with the execution of an operating system and application programs on a computer system, those skilled in the art will recognize that other implementations may also be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Those skilled in the art will appreciate that the subject matter described herein may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like, as well as distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.

In summary, the encoding method and the decoding method of the multitask digital retina feature stream of the present invention have very small extra memory bit requirements after differential encoding, and other machine analysis tasks added in the future can also continue to use the differential encoding mode on the basis of the depth feature conversion, so that each new machine analysis task only needs to bring extra small memory bit requirements, and the low-cost upgrading capability is realized. No matter how many machine analysis tasks need to be newly added to the digital retina system, the newly added machine analysis tasks can be dynamically realized at the cloud end only by separating the corresponding transformation depth characteristics from the characteristic flow data.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", and the like, which indicate orientations or positional relationships, are based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Claims

1. A method for encoding a multiplexed digital retinal feature stream, comprising the steps of:

step A, constructing a BP neural network for multitasking;

2. The method of claim 1, wherein the BP neural network comprises at least one task for video reconstruction and at least one task for machine analysis.

3. The method of claim 1, wherein step C comprises:

step C2, the output required by the task of the new machine will be output.

4. The method of claim 3, wherein the adding task comprises obtaining a lower network of the network and deploying the lower network at the video front end.

5. The method of claim 1, wherein step F comprises: and F1, outputting the transformation depth characteristics by the transformation low-level network, carrying out joint coding on the characteristics and the depth characteristics, and outputting or storing the characteristics in the cloud.

6. The method of claim 1, wherein step F further comprises: step F2, independently encoding the depth feature and encoding the difference between the depth feature and the transformed depth feature.

7. The method according to claim 6, wherein said step F2 is specifically:

8. A method for decoding a multiplexed digital retinal feature stream, comprising a method for encoding a multiplexed digital retinal feature stream according to any one of claims 1 to 7, further comprising the steps of:

9. The method for decoding a multitask digital retinal feature stream according to claim 8, characterized in that in said step b when the target data of said machine task is obtained, said higher-layer network has not undergone any change.

10. The method of claim 8, wherein step c comprises dynamically performing additional machine analysis tasks at the cloud end only by separating corresponding transform depth features from the feature stream data.

11. An electronic device, comprising:

a memory and one or more processors;

wherein the memory is communicatively coupled to the one or more processors and has stored therein instructions executable by the one or more processors, the electronic device being configured to implement the method of any of claims 1-7 when the instructions are executed by the one or more processors.

12. A computer-readable storage medium having stored thereon computer-executable instructions operable, when executed by a computing device, to implement the method of any of claims 1-7.