CN113034626A

CN113034626A - Optimization method for alignment of target object in feature domain in structured image coding

Info

Publication number: CN113034626A
Application number: CN202110235413.8A
Authority: CN
Inventors: 陈志波; 孙思萌; 冯润森; 金鑫; 冯若愚
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-25
Anticipated expiration: 2041-03-03
Also published as: CN113034626B

Abstract

The invention discloses an optimization method for alignment of a target object in a feature domain in structured image coding, which can realize alignment of the target object in the feature domain, solve the problem of position deviation of object information in an original image and in compression features in the existing structured image coding frame based on a neural network, ensure the integrity of the target object information in an object code stream of a structured code stream, improve the quality of partial decoding and simultaneously improve the accuracy of partial analysis tasks.

Description

Optimization method for alignment of target object in feature domain in structured image coding

Technical Field

The invention relates to the technical field of image coding, in particular to an optimization method for alignment of a target object in a feature domain in structured image coding.

Background

The existing video/image compression standard mainly aims at human eye-oriented compression, and as the algorithm of machine learning is gradually matured, the machine intelligent analysis task is also gradually applied to various fields of human social life and production, such as intelligent factories, intelligent cities, intelligent transportation and the like. In order to ensure interpretability and robustness of intelligent analysis results in a plurality of open scenes, brand new paradigms such as man-machine intelligent interaction cooperation, hybrid enhanced intelligence and the like are often required to be introduced.

In order to support the application scenario of man-machine hybrid intelligent application more efficiently, the existing method proposes the concept of semantic structured code stream, for example: the method comprises the steps of firstly, a task-driven code stream structured image coding method; and the second method supports a general video compression coding method with machine intelligence. Taking the first method as an example, a regional decision network and an alignment module for target detection are introduced, a bounding box of a region where an object may exist is extracted based on compressed features, and the features are segmented at a spatial level. The features after segmentation are respectively subjected to entropy coding and are sequentially put into the code stream to form a structured code stream.

However, when the code stream structured coding method is directly combined with various deep learning-based compression coding methods, the situation of incompatibility is often presented. In particular, existing neural network-based coding frameworks are typically composed of an encoder and a decoder. The input image obtains compression characteristics (namely hidden variable characteristics for compression, the size of the hidden variable characteristics is usually smaller than that of the original image) through an encoder, the compression characteristics are quantized and entropy-encoded to form a code stream for storage and transmission, and then the compression characteristics are decoded by the entropy of the code stream to reconstruct an original image through a decoder. When a semantic structured code stream coding frame is combined, firstly, object detection results, namely category labels and a boundary frame, are obtained based on an original image, after compression characteristics are quantized, region information related to each object is extracted according to the boundary frame, and then entropy coding is carried out in sequence to form a structured code stream. However, the bounding box obtained based on the image cannot simply obtain the bounding box of the object in the compression feature through downsampling, and this process cannot completely store all information related to the object in the compression feature, so that the quality of reconstruction performed by the terminal using the structured code stream or the analysis result of the intelligent analysis task may be affected.

Disclosure of Invention

The invention aims to provide an optimization method for aligning a target object in a feature domain in structured image coding, which can realize the alignment of the target object in the feature domain so as to improve the quality of partial decoding and improve the accuracy of partial analysis tasks.

The purpose of the invention is realized by the following technical scheme:

a method for optimizing the alignment of a target object in a feature domain in the coding of a structured image comprises the following steps:

setting an optimization module in a structured image coding frame to realize self-adaptive mapping between the position of a specified target object in an original image and the position of the specified target object in a compression characteristic; the input of the optimization module is pixel level position information of a specified target object in an original image; the optimization module maps the pixel level position information to obtain compressed characteristic level position information;

and then, based on the compression feature level position information, performing transformation processing on the quantized compression features output by the encoder in the structured image coding frame to obtain the compression features only containing the specified target object.

The technical scheme provided by the invention can solve the problem that the positions of object information in an original image and in compression characteristics in the existing structured image coding frame based on the neural network are aligned, and ensure the integrity of target object information in an object code stream of a structured code stream so as to improve the quality of partial decoding and improve the accuracy of partial analysis tasks.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of an optimization method for aligning a target object in a feature domain in structured image coding according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an optimization method for aligning a target object in a feature domain in structured image coding, which is used for solving the problem of position alignment of object information in an original image and a compressed feature in an existing structured image coding frame based on a neural network. The codec part can be any existing structured image coding framework (model) based on a neural network, and parameters of the structured image coding framework are fixed and only internal parameters of an optimization module are trained during training. In the actual use process, images are input randomly and position information of a target object is given, so that compression characteristic level position information can be obtained, and the compression characteristic level position information is used as a basis for generating a semantic structured code stream.

The method has the advantages that the method mainly comprises the following three aspects:

1) an optimization module is designed, and self-adaptive mapping between the relevant area of the object in the image and the relevant area of the object in the compression characteristic is realized, so that the problem of position offset of the object information on the original image and in the compression characteristic in the structured image coding frame based on the neural network is effectively solved.

2) The optimization module is used as an independent module for training, so that the performance of the original coding frame is not influenced, and the coding performance of the semantic structured image coding is ensured;

3) the optimization module is simple and easy to realize, and can be combined with various image coding frameworks based on the neural network to realize the semantic structural image framework, so that the flexibility of the semantic structural image coding framework is greatly improved, namely the encoder module and the decoder module are replaceable.

For ease of understanding, the methods provided by the present invention are described in detail below.

As shown in fig. 1, the input of the optimization module is pixel level position information of a specified target object in an original image, and the optimization module maps the pixel level position information to obtain compressed feature level position information; and then, based on the compression feature level position information, performing transformation processing on the quantized compression features output by the encoder in the structured image coding frame to obtain the compression features only containing the specified target object.

In the embodiment of the invention, the optimization module is realized by a plurality of layers (for example, 2-3 layers) of two-dimensional convolution layers, and the nonlinear of a mapping function is realized by using a Sigmoid function; specifically, the method comprises the following steps: the Sigmoid function has two uses: 1) the nonlinear mapping method is matched with the convolutional layer for use, and the nonlinearity of mapping is realized. The two-dimensional convolutional layers are related in that the two-dimensional convolutional layers realize linear mapping relation between layers, and the two-dimensional convolutional layers realize nonlinear mapping relation by combining a Sigmoid function. The parameters of the convolution layers are updated by an optimization network obtained by combining 2-3 convolution layers through a gradient return algorithm so as to obtain an optimal nonlinear mapping function through fitting, and the optimal mapping function from input pixel level position information to compression characteristic level position information is obtained. 2) At the end of the optimization module, all element values in the compressed feature-level location information that control the output are as binary as possible (i.e., not 0 or 1).

In this embodiment of the present invention, the pixel level position information may be obtained in a conventional manner, and specifically, the pixel level position information of the specified target object may be: bounding box (x) containing a specified target object_i,y_i,w_i,h_i) Wherein (x)_i,y_i) Top left corner vertex position of bounding box, w_i,h_iRespectively, the width and height of the bounding box.

In the embodiment of the present invention, the pixel level position information may be converted into an area mask in a binary form, that is, pixels inside the bounding box are set to 1, and pixels outside the bounding box are set to 0. And then mapping the area mask in the binary form through an optimization module to obtain the compression feature level position information with the same dimensionality as the quantized compression feature.

In the embodiment of the present invention, the compression feature level position information may be in a binary form, and the binary form is used as an optimized area mask to perform element multiplication with the quantized compression feature, so as to obtain a compression feature only including a given object, and then, a decoder is used to perform decoding reconstruction, so as to obtain a reconstructed image only including the object.

In the training process, the error between the reconstructed image only containing the specified target object and the image only containing the specified target object in the original image (i.e. the reconstruction distortion D of the object-related information shown at the bottom of fig. 1) is used as a loss, and the parameters of the optimization module are updated through an inverse gradient propagation algorithm. Through a number of iterations until convergence (i.e., the loss is substantially unchanged and the model parameters are substantially not updated). The optimization module can achieve adaptive mapping between the relevant regions of the objects in the image and the relevant regions of the objects in the compressed features. The inverse gradient propagation algorithm used in the training process may directly use a conventional scheme.

When the code stream structured coding is realized, the pixel level position information and the compression characteristic level position information are transmitted as a part of structured code stream header information, the pixel level position information is used as a machine intelligent analysis task, and the compression characteristic level position information is used for recovering the complete compression characteristic when a complete image is reconstructed. The syntax structure of the header information is shown in table 1. Wherein bbox _ enabled _ flag is a pixel level position information (bounding box coordinate) switch flag, refined _ bbox _ enabled _ flag is a compression feature level position information switch flag, and bbox _ length _ minus1 and refined _ bbox _ length _ minus1 are respectively the code stream lengths of the two types of position information.

TABLE 1 header information grammar structure (syntax)

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for optimizing the alignment of a target object in a feature domain in structured image coding is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of mapping the pixel-level location information to obtain the compressed feature-level location information by the optimization module comprises:

the pixel level position information of the specified target object is: bounding box (x) containing a specified target object_i,y_i,w_i,h_i) Wherein (x)_i,y_i) Top left corner vertex position of bounding box, w_i,h_iRespectively the width and the height of the bounding box; converting a bounding box containing a specified target object into an area mask in a binary form, namely setting pixels inside the bounding box to be 1 and setting pixels outside the bounding box to be 0;

the optimization module maps the area mask in the binary form to obtain the compression feature level position information with the same dimensionality as the quantized compression feature.

3. The method as claimed in claim 1, wherein the optimization module is implemented by multiple layers of two-dimensional convolution layers, and uses Sigmoid function to implement the non-linearity of the mapping function.

4. The method of claim 1, wherein the transforming the quantized compressed features output by an encoder in the framework of structured image coding based on the compressed feature level position information comprises:

element multiplying the compressed feature level position information with the quantized compressed feature, wherein the compressed feature level position information and the quantized compressed feature have the same dimensionality.

5. The method for optimizing the alignment of the target object in the feature domain in the structured image coding according to any one of claims 1 to 4, wherein after the compression features only containing the designated target object are obtained, the decoding reconstruction is performed by a decoder in a structured image coding frame to obtain the reconstructed image only containing the designated target object;

in the training process, the error between the reconstructed image only containing the specified target object and the image only containing the specified target object in the original image is used as loss, and the parameters of the optimization module are updated through a reverse gradient propagation algorithm.

6. The method according to any of claims 1 to 4, wherein the pixel level location information and the compressed feature level location information are transmitted as part of the structured codestream header information, the pixel level location information is used as a machine intelligent analysis task, and the compressed feature level location information is used to recover the complete compressed features when reconstructing the complete image.