CN111953973B

CN111953973B - General video compression coding method supporting machine intelligence

Info

Publication number: CN111953973B
Application number: CN202010895946.4A
Authority: CN
Inventors: 陈志波; 金鑫; 孙思萌; 冯若愚; 冯润森
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2022-10-28
Anticipated expiration: 2040-08-31
Also published as: CN111953973A

Abstract

The invention discloses a general video compression coding method supporting machine intelligence, which is used for compressing intelligent analysis tasks of machines, so that a higher compression ratio can be obtained when the same intelligent analysis tasks of machines are realized compared with the compression ratio aiming at human eyes, the information required to be transmitted is reduced, and the transmission load is lightened; the compressed features can be directly applied to machine intelligent analysis tasks without extra decoding and processing, so that the calculated amount is reduced, the execution of the machine analysis tasks is accelerated, and the realization of edge calculation is supported; in addition, partial analysis on the original video/image is supported before coding compression, so that not only can the intelligent analysis precision be improved, but also a structured compressed code stream can be generated, and more subsequent intelligent analysis tasks can be supported. In conclusion, the method can make the process of performing video/image compression on a machine more universal, flexible and efficient.

Description

General video compression coding method supporting machine intelligence

Technical Field

The invention relates to the technical field of video/image compression coding, in particular to a general video compression coding method supporting machine intelligence.

Background

The existing video/image compression standard mainly aims at the compression facing human vision, and aims to keep the video code rate as low as possible under the condition that the distortion of the video watched by human eyes is certain. As the algorithm of machine learning becomes mature, the task of machine intelligent analysis is also gradually applied to various fields of human social life and production, such as intelligent factories, intelligent cities, intelligent transportation, and the like. The realization of the series of applications is often accompanied with the analysis of a large amount of video/image data, and the conventional method is adopted, wherein the video/image is compressed by the existing standard, the compressed code stream needs to be decoded before the analysis, so that the compressed video/image is obtained, and then the compressed and restored video/image is analyzed. However, there are problems as follows: 1) Since the conventional video/image compression standard aims at human vision, a large number of code rates may be used in the compressed code stream to represent unnecessary content in the video/image analysis, which may cause a heavy burden on transmission. 2) Since the compressed video/image needs to be decoded and restored in the conventional method and then analyzed, a time delay is also caused, which results in poor user experience. 3) Since the compression-restored video/image has some distortion, the analysis may be wrong or even more problematic.

With the development of edge computing and terminal intelligent technologies, more machine intelligent analysis can process and analyze videos/images on an edge server or terminal equipment, so if a machine-oriented encoding method can be realized, the encoded code stream only contains contents useful for machine intelligent analysis, and the data volume required to be transmitted by a machine intelligent analysis task can be greatly reduced. Meanwhile, the coded code stream can be directly used in the task of machine intelligent analysis without recovering compressed video/images, so that the calculation time delay can be reduced, and the processing efficiency can be improved. Therefore, the intelligent analysis and coding of partial machines are performed before, the structural function of the code stream is improved, and the subsequent intelligent analysis task is favorably executed.

In the prior art, a Visual Search Compact descriptor international standard (CDVS) encodes video/image features required by a retrieval task, and the above requirements are met to a certain extent, but a code stream of the video/image features can only be used for the Search task, an application scene is single, and the requirements of more general intelligent applications on compression coding cannot be met. Therefore, a general video compression encoding method supporting machine intelligence is highly desirable.

Disclosure of Invention

The invention aims to provide a general video compression coding method supporting machine intelligence so as to realize coding of video/image characteristic information required by each task, thereby improving the analysis accuracy of intelligent tasks and reducing the data transmission pressure.

The purpose of the invention is realized by the following technical scheme:

a method of universal video compression encoding supporting machine intelligence, comprising: intra-frame coding and inter-frame coding; wherein:

the intra-frame encoding section includes: for an input video frame, firstly carrying out object detection to obtain spatial position information and category information of each object; performing attribute analysis and relationship inference based on the spatial position information and the category information of each object to obtain attribute information of each object and a topological relationship between the objects; then, the spatial position information and the category information of each object are used as guiding information, the spatial position information of the object is used for dividing the coding units of the input video frame, the divided coding units are coded, and the category information of the object contained in the code stream obtained by coding is used for the video frame reconstruction process of the interframe coding part;

the inter-frame encoding section includes: reconstructing a video frame by taking an input video frame or an object as a unit, and obtaining optical flow prediction information and residual error coding information through motion compensation;

and entropy coding spatial position information and category information of each object obtained by the intra-frame coding part, attribute information of each object, topological relation among the objects, a coded coding unit and optical flow prediction information and residual coding information obtained by the inter-frame coding part to obtain a corresponding code stream.

The technical scheme provided by the invention can show that 1) the method can support various existing or even possible future tasks, has a wide application range and has a strong practical application value; 2) The compression is carried out aiming at the machine intelligent analysis task, a compression ratio higher than that of human eye compression can be obtained when the same machine intelligent analysis task can be realized, the information required to be transmitted is reduced, and the transmission burden is lightened. 3) The machine intelligent analysis tasks are compressed, the compressed features can be directly applied to the machine intelligent analysis tasks, extra decoding and processing are not needed, the calculated amount is reduced, the execution of the machine analysis tasks is accelerated, and the realization of edge calculation is supported. 4) The universal coding framework supports partial analysis on the original video/image before coding compression, can improve intelligent analysis precision, can generate a structured compressed code stream, and supports more subsequent intelligent analysis tasks. In summary, the above scheme can make the process of video/image compression more general, flexible and efficient for the machine.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a block diagram of a general video compression encoding method supporting machine intelligence according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an encoding process according to an embodiment of the present invention;

fig. 3 is a schematic view of a code stream structure of an intra-frame coding portion according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a general video compression coding method supporting machine intelligence, which is different from a coding mode of a traditional video coding framework and is used for compression by utilizing a coding framework based on deep machine learning. The division of the coding processing unit can be carried out in a pixel domain, and also supports the division in a hidden variable domain. As shown in fig. 1, it mainly includes: the method comprises the following steps of intra-frame coding and inter-frame coding.

1. An intra-coded portion.

As shown in fig. 2, the intra-frame coding portion includes an object detection module, an encoder, a spatial relationship inference module, a semantic relationship inference module, and an attribute analysis module.

The main process is as follows: for an input video frame x _t Firstly, object detection is carried out to obtain spatial position information and category information of each object; then combine video frame x _t Further mining the spatial position information and the category information of each object, including performing attribute analysis and relationship inference, to obtain attribute information of each object (in the case of a pedestrian, the attribute information includes features of each body part of the pedestrian, such as a head feature, upper/lower body features, accessory features, and the like), and a topological relationship between the objects; and then, the spatial position information and the category information of each object are used as guide information, the input video frame is divided into coding units, and the divided coding units are coded.

In the embodiment of the present invention, the processing unit is an Object (Object) in the video and a background outside the Object, and the Object may be a rectangular frame containing one or more objects or a closed boundary of an arbitrary shape containing one or more objects, as shown in fig. 2.

In the embodiment of the present invention, the relationship inference includes: spatial relationship reasoning and semantic relationship reasoning; carrying out spatial relationship reasoning by using the spatial position information of each object to obtain the spatial relationship among the objects; performing semantic relation reasoning by utilizing the category information of the objects to obtain the semantic relation among the objects; the spatial relationship and the semantic relationship form a topological relationship.

In the embodiment of the invention, the spatial position information and the category information of each object are used as the guidance information. The dividing of the encoding unit of the input video frame using the spatial position information of the object, and the encoding of the divided encoding unit includes: the method comprises the steps of mapping an object to a hidden variable space to be coded according to the space position information of the object, carrying out semantic division on hidden variables (the hidden variables belong to a coding unit form) according to the mapped space position information to obtain hidden variables to be coded corresponding to the semantics, coding the divided hidden variables according to the sequence from top to bottom and from left to right, wherein code streams obtained by coding also comprise class information of the object, and using the class information of the object as object mark information required by a decoder in the process of reconstructing video frames of an interframe coding part, such as pedestrian-1, vehicle-2, pedestrian-3 and the like.

2. And an inter-frame coding part.

The inter-frame encoding section includes: the method comprises the steps of reconstructing a video frame by taking an input video frame or an object as a unit, and obtaining optical flow prediction information and residual coding information through motion compensation. This may be achieved by conventional techniques.

3. And entropy coding to generate a code stream.

As shown in fig. 1, the spatial position information and the category information of each object obtained by the intra-frame coding part, the attribute information of each object, the topological relation between the objects, the coded coding unit, and the optical flow prediction information and the residual coding information obtained by the inter-frame coding part are entropy-coded to obtain a corresponding code stream.

As shown in fig. 3, a code stream structure of an intra-coded portion is given.

The code stream structure of the intra-frame coding part is as follows: object header information, object attribute information, and an object information stream; wherein the object header information includes: spatial position information, category information, and topological relation information of the object. Each object and background included in the object information stream mainly refers to a corresponding image, each object can be detected in the first object detection, and the remaining portion is the background.

The compression coding process can be realized at the edge aiming at certain specific tasks, and can also be realized aiming at various tasks at the cloud.

In practical application, the code stream is transmitted or stored, when the terminal decompresses, the code stream is correspondingly decompressed according to header information defined during compression and encoding (namely, header information required by decompression and a corresponding object information stream) to obtain characteristic information for a specific task, and the characteristic information is input to obtain an analysis result for the task.

Based on the scheme of the embodiment of the invention, image analysis tasks such as physical detection, object segmentation, image enhancement, image understanding and the like can be supported by analyzing part of code stream data, and video analysis tasks such as pedestrian tracking, behavior recognition, anomaly detection and the like can be realized; the data can be decoded to support visual analysis and manual identification; and decoding all code streams to generate complete image video data can be supported.

Through the description of the above embodiments, it is clear to those skilled in the art that the above embodiments may be implemented by software, or by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for universal video compression coding that supports machine intelligence, comprising: intra-frame coding and inter-frame coding; wherein:

the intra-frame encoding section includes: for an input video frame, firstly carrying out object detection to obtain spatial position information and category information of each object; performing attribute analysis and relationship reasoning based on the spatial position information and the category information of each object to obtain attribute information of each object and a topological relationship between the objects; then, the spatial position information and the category information of each object are used as guiding information, the spatial position information of the object is used for dividing the coding unit of the input video frame, the divided coding unit is coded, and the category information of the object contained in the code stream obtained by coding is used for the video frame reconstruction process of the interframe coding part;

the inter-frame encoding section includes: reconstructing a video frame by taking an input video frame or an object as a unit, and obtaining optical flow prediction information and residual coding information through motion compensation;

entropy coding spatial position information and category information of each object obtained by the intra-frame coding part, attribute information of each object, topological relation between the objects, a coded coding unit and optical flow prediction information and residual coding information obtained by the inter-frame coding part to obtain a corresponding code stream;

the method for dividing the coding units of the input video frame by using the spatial position information and the category information of each object as the guide information and using the spatial position information of the object comprises the following steps: mapping an object to a hidden variable space to be coded according to the space position information of the object, performing semantic division on hidden variables according to the mapped space position information to obtain hidden variables to be coded corresponding to semantics, and then coding the divided hidden variables according to the sequence from top to bottom and from left to right; and the class information of the object contained in the code stream obtained by coding is used as object mark information required by a decoder in the process of reconstructing the video frame of the interframe coding part.

2. The method of claim 1, wherein the relational inference comprises: spatial relationship reasoning and semantic relationship reasoning;

carrying out spatial relationship reasoning by utilizing the spatial position information of each object to obtain the spatial relationship among the objects;

performing semantic relation reasoning by utilizing the category information of the objects to obtain the semantic relation among the objects;

the spatial relationship and the semantic relationship form a topological relationship.

3. The method of claim 1, wherein the stream structure of the intra-frame coding part is: object header information, object attribute information, and an object information stream;

wherein the object header information includes: spatial position information, category information and topological relation information of the object; the object information stream includes: and images corresponding to the objects and the background.