WO2024078403A1

WO2024078403A1 - Image processing method and apparatus, and device

Info

Publication number: WO2024078403A1
Application number: PCT/CN2023/123322
Authority: WO
Inventors: 李胜曦; 刘铁; 陈超然; 张子夫; 徐迈; 吕卓逸
Original assignee: 维沃移动通信有限公司
Priority date: 2022-10-13
Filing date: 2023-10-08
Publication date: 2024-04-18

Abstract

The present application belongs to the technical field of coding and decoding. Disclosed are an image processing method and apparatus, and a device. The image processing method in the embodiments of the present application comprises: acquiring a first image to be processed of a target object, wherein the first image to be processed comprises a first feature map of a first image of the target object or comprises a first sub-video in a first video of the target object; and on the basis of a first compression network, processing the first image to be processed, so as to acquire a reconstructed second feature map or a reconstructed second sub-video, wherein the second feature map is a feature map of the first image, image features of the reconstructed second feature map are different from image features of the first feature map, and an image of the reconstructed second sub-video is different from an image of the first sub-video.

Description

Image processing method, device and equipment

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No. 202211254689.1 filed in China on October 13, 2022, the entire contents of which are incorporated herein by reference.

Technical Field

The present application belongs to the field of coding and decoding technology, and specifically relates to an image processing method, device and equipment.

Background technique

In machine vision applications, in order to avoid directly transmitting videos or images with a large amount of data, videos or images are generally compressed. Traditional image compression standards are designed for a wide range of image and video compression tasks. Images or videos have high correlation in space and time, but the feature maps of images or videos do not have this feature. Directly using traditional image compression methods to process image feature maps cannot guarantee coding efficiency and the quality of reconstructed feature maps.

Summary of the invention

The embodiments of the present application provide an image processing method, apparatus and device, which can solve the problem in the related art that the traditional image compression method used to process the image feature map cannot guarantee the encoding efficiency and the quality of the reconstructed feature map.

In a first aspect, an image processing method is provided, comprising:

Acquire a first image to be processed of the target object, where the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;

Processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

The second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;

The reconstructed image of the second sub-video is different from the image of the first sub-video.

In a second aspect, an image processing device is provided, comprising:

A first acquisition module, used to acquire a first image to be processed of the target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;

A second acquisition module, used to process the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

The second sub-video is a partial video in the first video, and the image of the reconstructed second sub-video is different from the image of the first sub-video.

In a third aspect, an image processing device is provided, which includes a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the method described in the first aspect are implemented.

In a fourth aspect, an image processing device is provided, comprising a processor and a communication interface, wherein the processor is used to obtain a first image to be processed of a target object, wherein the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object;

In a fifth aspect, a readable storage medium is provided, on which a program or instruction is stored. When the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.

In a sixth aspect, a chip is provided, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.

In a seventh aspect, a computer program/program product is provided, wherein the computer program/program product is stored in a storage medium and is executed by at least one processor to implement the steps of the method described in the first aspect.

In an embodiment of the present application, a first object to be processed of a target object is obtained, wherein the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video. The reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so that the second feature map or the second sub-video does not need to be encoded and transmitted, which improves the encoding efficiency, and obtaining the reconstructed second feature map or the second sub-video based on the compression network can effectively ensure the image quality.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram showing a flow chart of an image processing method according to an embodiment of the present application;

FIG2 is a schematic diagram showing the network architecture of a feature pyramid network according to an embodiment of the present application;

FIG3 is a schematic diagram showing a first compression network in an embodiment of the present application;

FIG4 is a schematic diagram showing a prediction and restoration network in an embodiment of the present application;

FIG5 is a schematic diagram showing a first compression network and a second compression network processing feature graph in an embodiment of the present application;

FIG6 is a schematic diagram showing a second compression network in an embodiment of the present application;

FIG7 is a schematic diagram showing a module of an image processing device according to an embodiment of the present application;

FIG8 is a block diagram showing a structure of an image processing device in an embodiment of the present application;

FIG. 9 is a block diagram showing a structure of a terminal according to an embodiment of the present application.

Detailed ways

The following will be combined with the drawings in the embodiments of the present application to clearly describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field belong to the scope of protection of this application.

The terms "first", "second", etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way are interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first" and "second" are generally of the same type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally represents that the objects associated with each other are in an "or" relationship.

The image processing device corresponding to the image processing method in the embodiment of the present application may be a terminal, which may also be referred to as a terminal device or a user terminal (User Equipment, UE). The terminal may be a mobile phone, a tablet computer (Tablet Personal Computer), a laptop computer (Laptop Computer) or a notebook computer, a personal digital assistant (Personal Digital Assistant, PDA), a handheld computer, a netbook, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device (Wearable Device) or a vehicle-mounted device (Vehicle User Equipment, VUE), a pedestrian terminal (Pedestrian User Equipment, PUE) and other terminal-side devices, and the wearable device includes: a smart watch, a bracelet, a headset, glasses, etc. It should be noted that the specific type of the terminal is not limited in the embodiment of the present application.

In order to enable those skilled in the art to better understand the embodiments of the present application, the following description is first made.

1. Traditional image encoding and decoding solution.

Most traditional coding schemes follow the three steps of transformation, quantization and entropy coding. The Joint Photographic Experts Group (JPEG) still image compression is the most widely used compression standard, which uses a series of transformations, quantization and entropy coding to reduce spatial and other coding redundancy as much as possible. In order to reduce spatial redundancy, the image is first divided into small blocks, and the image is converted from the time domain to the frequency domain through the Discrete Cosine Transform (DCT) to achieve a more compact representation. Then the transformed image information is quantized and input into the entropy coding process. Among them, quantization is the only lossy process in the entire compression method. Due to the versatility of wavelet transform for non-stationary processes, in order to reduce information loss in the quantization process, JPEG2000 uses Discrete Walsh Transform (DWT) instead of DCT to achieve better compression quality. In the new video compression standard (High Efficiency Video Coding, HEVC) standard, intra-frame coding units can independently compress a single frame, and the new image format (Better Image Format) can also compress the image information in the same frame. Based on this similarity, the Portable Graphics (BPG) codec was proposed for image compression. The current compression standard is designed for a wider range of image compression tasks, so it has no advantage in the face of some special highly related image sets.

2. Image encoding and decoding based on deep learning.

With the development of deep learning, neural network-based methods have been widely used in image processing, and image compression is one of them. In deep learning-based schemes, the compression task is regarded as an encoding process and trained using an end-to-end learning method. The specific process can be decomposed into an encoding process, and the corresponding reconstruction task can be regarded as a decoding process. The encoder-decoder structure is widely used in learning-based compression methods. For example, a compression model based on recursive neural network (RNN) uses an autoencoder to extract image features as a transformation process. The residual structure is used for feature extraction in the encoder and decoder. For example, the new optimization method of compression rate and restoration distortion in the related art, the smooth and adjustable compression ratio makes it very popular in the later methods. In order to obtain better performance, the related art has proposed a compression method based on super prior to reduce spatial redundancy, and improved the entropy coding module and attached it to the super prior structure to further reduce coding redundancy. The related paper takes advantage of the residual structure and attention mechanism, proposes a better structured autoencoder, and proposes a Gaussian mixture likelihood entropy model to improve its flexibility and accuracy. The above methods can certainly change the compression ratio by adjusting parameters, however, when compressing at low bit rates, because these methods do not pay special attention to extreme cases, their compression quality usually drops very quickly.

3. Solutions for low bit rates.

In the related art, the low bit rate performance is optimized. For the JPEG codec method, it is proposed to apply 2×2 average pooling to the image to obtain a smaller image. After JPEG encoding and decoding, the original size image is interpolated during reconstruction. The method is optimized by designing filters in the downsampling and interpolation process, but the filters designed by this method are related to the image information and need to be designed manually. At the same time, the related art attempts to apply Generative Adversarial Network (GAN) to low bit rate compression. A generative compression architecture is proposed to generate images from the image distribution encoded by the encoder, and the corresponding loss function is designed to balance the visual quality and reconstruction quality. Generative Adversarial Networks (GAN) is used as an enhancement module of the decoder structure. A pair of classic codec structures are trained by optimizing the rate-distortion loss, and the trained encoder is frozen to fine-tune the decoder to make it a generator in the GAN. Finally, the decoder and generator parameters are interpolated to reduce the artifacts of compressed images at low bit rates. At this time, the compression network is completely optimized through a newly designed network structure. To further obtain better results, the network structure needs to be redesigned, and there is basically no compatibility.

The image processing method provided in the embodiment of the present application is described in detail below through some embodiments and their application scenarios in combination with the accompanying drawings.

As shown in FIG1 , the embodiment of the present application provides an image processing method, including:

Step 101: Acquire a first image to be processed of a target object, where the first image to be processed includes a first feature map of a first image of the target object or a first sub-video in a first video of the target object.

The first feature map is extracted from the first image or the first video using a neural network. The target object is a photographed object (or photographed content) corresponding to the first video or the first image.

Optionally, the first video is a multi-view video (multi-view) of the target object, or the first video is a scalable video of the target object.

The multi-view video (also described as stereoscopic video) refers to the video of each view obtained by shooting the same object (or the same scene) with multiple cameras at different viewpoints. For example, the first sub-video is the video of the target object at a certain viewpoint.

The scalable video image (scalable video) includes videos of different resolutions or different frame rates of the same video source. For example, the first sub-video is a video in the first video that transmits and displays the target object at a certain resolution. That is to say, the method of the embodiment of the present application can be used not only for processing feature maps but also for processing videos of different viewing angles or different resolutions.

Step 102: Processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

Optionally, the image features include at least one of resolution and feature quantity.

For example, the resolution of the first feature map is different from the resolution of the second feature map;

Alternatively, the number of features corresponding to the first feature map is different from the number of features corresponding to the second feature map.

The first compression network is used to output a feature map having different image features from the input feature map. The image features in the embodiment of the present application include but are not limited to resolution and number of features. That is to say, in the embodiment of the present application, the feature map in the image can be compressed by the first compression network, and part of the video in the multi-view video or scalable video (such as a video of a certain resolution or a video of a certain view) can also be compressed.

Optionally, the first compression network is a learnable compression network. The first compression network is trained by a rate loss function and a distortion loss function.

In one implementation of the present application, the method of the embodiment of the present application is applied to a neural network that extracts multiple feature maps from an image, that is, the first feature map is extracted by a neural network. There is information redundancy between the multiple feature maps extracted by the neural network, and this feature is used to perform mutual prediction between the feature maps based on the first compression network, that is, the reconstructed second feature map is obtained by the first feature map.

In another implementation of the present application, mutual prediction is performed between different videos based on the above-mentioned first compression network. For example, a reconstructed second sub-video is obtained based on the above-mentioned first sub-video. The resolution or frame rate corresponding to the first sub-video and the second sub-video are different, or the shooting angles corresponding to the first sub-video or the second sub-video are different.

In an embodiment of the present application, a first object to be processed of a target object is obtained, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video. The network can obtain the reconstructed second feature map or second sub-video of the target object, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency. In addition, obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.

Optionally, obtaining a first feature map of a first image of the target object includes:

Acquire a plurality of feature maps of the first image using a target neural network, where the target neural network is a neural network for extracting image features;

A feature map is selected from the multiple feature maps as the first feature map.

In one implementation of the present application, the target neural network is a feature pyramid network, for example, a fast region-based convolutional neural network (Fast Region Convolutional Neural Network, FastRCNN). The feature pyramid network can be used to extract feature maps of different resolutions.

Of course, the above-mentioned target neural network can also extract neural networks of multiple feature maps in other forms. The embodiment of the present application is explained by using the feature pyramid network to implement the target detection task as an example.

The network architecture of the feature pyramid network is shown in Figure 2, where the input of the neural network is an image with a resolution of W×H consisting of three color channels (RGB). The feature map of the P layer is obtained from the neural network, and the resolution of the feature map of each P layer is P2: P3: P4: P5: The number of feature channels is 256.

Optionally, the first compression network includes a first compression encoding network, a first processing unit and a first compression decoding network;

The step of processing the first image to be processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video includes:

Encoding the first image to be processed based on the first compression coding network to obtain a first variable;

quantizing, arithmetic coding, arithmetic decoding and inverse quantizing the first variable based on the first processing unit to obtain a decoded first variable;

The decoded first variable is decoded based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.

Optionally, before encoding the first image to be processed based on the first compression coding network, pixels of the first image to be processed are normalized.

The first compression network of the embodiment of the present application can also be described as a prediction compression module. As shown in FIG3, the prediction compression module includes an encoder, a first processing unit and a decoder. The encoder includes a first compression encoding network, and the decoder includes a first compression decoding network. Assume that the above-mentioned first feature map is the feature map P2 obtained based on the neural network shown in FIG2, and the above-mentioned second feature map is the feature map P3 obtained based on the neural network shown in FIG2, that is, the feature maps for mutual prediction restoration are P2 and P3. Of course, the feature maps for mutual prediction restoration can also be any two feature maps in P2-P5 except P2 and P3. The input of the encoder is the feature map P2. First, its pixel values are normalized, and then the input feature map P2 is compressed (encoded) using the first compression coding network to obtain a latent variable (first variable), that is, a variable The variable c is then quantized and arithmetically encoded to obtain a binary bit stream. After the decoder obtains the input binary bit stream, it first performs arithmetic decoding and dequantization to obtain the decoded latent variable That is, the decoded first variable is then decoded through the first compression decoding network to obtain the reconstructed second feature map, that is, the feature map P3'.

The first compression network shown in FIG3 can be selected and designed according to actual needs. For example, using commonly used compression networks such as Balle and Cheng, the process of the first compression network lock execution in FIG3 can be expressed as the following formula:
c = Enc(P2);

Among them, Q(·) is the quantization, arithmetic coding and arithmetic decoding operations, Enc(·) and Dec(·) are the encoder and decoder respectively.

Optionally, the method of the embodiment of the present application further includes:

Performing target processing on the reconstructed second feature map based on the prediction restoration network to obtain a reconstructed first feature map;

Alternatively, target processing is performed on the reconstructed second sub-video based on the prediction restoration network to obtain the reconstructed first sub-video;

Wherein, the prediction and restoration network is obtained by training through an enhanced loss function.

Optionally, the target processing includes sampling processing and residual recovery processing, and the sampling processing includes upsampling processing or downsampling processing.

In one embodiment of the present application, as shown in Figure 4, the above-mentioned decoder may also include the above-mentioned prediction and restoration network (prediction and restoration module), which is used to predict and restore the reconstructed first feature map from the second feature map obtained by decoding and reconstruction. Specifically, the prediction and restoration network is designed based on the residual unit, and the width and height of the reconstructed second feature map are first upsampled to twice the original size. It can be obtained by interpolation methods such as bilinear interpolation, which is not limited here. After that, multiple stacked residual units are used to restore the residual, and then added to the input to obtain the predicted first feature map (i.e., the reconstructed first feature map). As shown in Figure 4, the specific prediction and restoration network can also be selected and designed according to actual needs, such as using commonly used residual networks, dense networks and other enhanced networks. The process of Figure 4 can be expressed as the following formula:
P2'=Up(P3')+Res(Up(P3'));

Wherein, Up(·) represents a 2-fold upsampling operation, the upsampling or downsampling multiple is determined according to the feature map for inter-prediction restoration, and Res(·) represents a multi-stacked residual unit.

Optionally, processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video includes:

When the first condition is met, the first to-be-processed image is processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

The first condition includes at least one of the following:

The network bandwidth is less than or equal to the first threshold;

The data volume of the first image to be processed is greater than or equal to the second threshold.

The second image to be processed of the target object is processed based on the second compression network to obtain a reconstructed third feature The picture or the third sub video;

The second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.

Specifically, at least one third feature map of the target object is processed based on the second compression network to obtain a reconstructed third feature map, wherein image features of the third feature map are different from image features of the first feature map or the second feature map;

Alternatively, at least one third sub-video of the target object is processed based on the second compression network to obtain a reconstructed third sub-video, wherein the image of the third sub-video is different from the image of the first sub-video, or the image of the third sub-video is different from the image of the second sub-video.

Optionally, processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video includes:

When the second condition is met, the second to-be-processed image of the target object is processed based on the second compression network to obtain a reconstructed third feature map or a third sub-video;

The second condition includes at least one of the following:

The network bandwidth is greater than a first threshold;

The data volume of the second image to be processed is less than the second threshold.

Optionally, the second compression network includes a second compression encoding network, a second processing unit and a second compression decoding network;

The processing of the second to-be-processed object of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video includes:

Encoding the second object to be processed according to the second compression coding network to obtain a second variable;

quantizing, arithmetic coding, arithmetic decoding and inverse quantizing the second variable based on the second processing unit to obtain a decoded second variable;

The decoded second variable is decoded based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.

Optionally, the second compression network is a learnable codec network, and the target neural network is described as a feature pyramid network as an example, and the at least one third feature map is a feature map P4 and a feature map P5. By using the second compression network to compress and reconstruct the feature maps P4 and P5, the reconstruction quality of P4 and P5 can be guaranteed at a low bit rate, thereby ensuring the accuracy of the machine vision task. The specific processing flow is shown in Figure 5. At the encoding end, the feature maps P2, P3, P4 and P5 are first obtained from the FastRCNN network, their pixel values are normalized, and the feature maps P2 and P3 with larger resolution are predicted and restored using the mutual prediction restoration network. Then, for the basic feature maps P4 and P5, considering that they occupy less bit rate and are more important for completing visual tasks, the encoding end uses the basic feature compression network (second compression network) to compress P4 and P5 respectively. The decoding end uses the decoding network of the second compression network to obtain reconstructed feature maps P4' and P5', and combines the reconstructed feature maps P2' and P3' obtained at the decoding end for visual task (eg, target detection task) analysis to obtain the final target detection result.

Optionally, as shown in FIG6 , the second compression network includes an encoder, a second processing unit and a decoder, the encoder includes a second compression encoding network, and the decoder includes a second compression decoding network. The second compression network is a learnable encoding and decoding network, which can be selected and designed according to actual needs, such as using commonly used compression networks such as Balle and Cheng. The process of FIG6 can be expressed as the following formula:
d = Enc _base (P4);

e＝Enc _base (P5);

Among them, Enc _base (·) and Dec _base (·) are the encoder and decoder of the basic feature compression network respectively.

In one embodiment of the present application, the sampling loss function is used to train the first compression network, the second compression network and the prediction and restoration network. Specific loss functions include rate loss function, distortion loss function and enhancement loss function.

Among them, the rate loss function converts the input features into latent variables, which can be output through arithmetic coding. and Calculate the bit rate. The specific function is as follows:

in, is the latent variable output by the encoder, is the additional edge information introduced, For Conditional latent variables The probability distribution of pixel values is For edge information In addition, N is the number of training samples, θ represents the network parameters, where R represents the rate and E represents the expected value.

The distortion loss function is used to measure the difference between the original feature P and the reconstructed feature P'. The distortion loss function is as follows:

Where P′ is the reconstructed feature map, θ represents the network parameters. In addition, N is the number of training samples, l ₂ measures the difference between the original feature map and the reconstructed feature map, and D(θ) represents the distortion.

The enhancement loss is used to measure the difference between the output feature P′ ₂ and the original feature P ₂ , and its formula is as follows:

Where E(θ) represents the expected value.

When training the first compression network and the second compression network, the rate loss function and the distortion loss function are used for training. Specifically, the following formula is used for training:
L _total (θ) = λ·D(θ) + R(θ);

Among them, L _total (θ) represents the total loss function, and compression models with different compression rates can be obtained by adjusting λ.

The above-mentioned enhanced loss function can be used for training when training the prediction and restoration module to ensure the quality of the predicted and restored feature map P2'.

In an embodiment of the present application, a first to-be-processed object of a target object is obtained, wherein the first to-be-processed image includes a first feature map of a first image of the target object or a first sub-video in a first video including the target object; The first image to be processed is processed in a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video. The reconstructed second feature map or the second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or the second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or the second sub-video based on the compression network can effectively ensure the image quality.

The image processing method provided in the embodiment of the present application can be executed by an image processing device. In the embodiment of the present application, an image processing device executing the image processing method is taken as an example to illustrate the image processing device provided in the embodiment of the present application.

As shown in FIG. 7 , the embodiment of the present application further provides an image processing device 700, including:

A first acquisition module 701 is used to acquire a first image to be processed of a target object, where the first image to be processed includes a first feature map of a first image of the target object or a first sub-video in a first video of the target object;

A second acquisition module 702 is used to process the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

Optionally, the first acquisition module 701 includes:

A first acquisition submodule, used to acquire a plurality of feature maps of the first image using a target neural network, wherein the target neural network is a neural network used to extract image features;

The second acquisition submodule is used to select a feature map from the multiple feature maps as the first feature map.

The second acquisition module 702 includes:

A third acquisition submodule is used to encode the first to-be-processed image based on the first compression coding network to obtain a first variable;

a fourth acquisition submodule, configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the first variable based on the first processing unit to obtain a decoded first variable;

The fifth acquisition submodule is used to decode the decoded first variable based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.

Optionally, the first compression network is trained by a rate loss function and a distortion loss function.

Optionally, the image processing device 700 of the embodiment of the present application further includes:

A third acquisition module is used to perform target processing on the reconstructed second feature map based on the prediction restoration network to obtain a reconstructed first feature map;

Alternatively, the reconstructed second sub-video is subjected to target processing based on the prediction restoration network to obtain the reconstructed first sub-video. video;

Optionally, the first acquisition module 701 is used to process the first to-be-processed image based on a first compression network to obtain a reconstructed second feature map or a second sub-video when a first condition is met;

The first condition includes at least one of the following:

The network bandwidth is less than or equal to the first threshold;

A fourth acquisition module, used for processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video;

Optionally, the fourth acquisition module is used to process the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video when the second condition is met;

The second condition includes at least one of the following:

The network bandwidth is greater than a first threshold;

The data volume of the image to be processed is less than the second threshold.

The fourth acquisition module includes:

a sixth acquisition submodule, configured to encode the second object to be processed according to the second compression coding network to obtain a second variable;

a seventh acquisition submodule, configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the second variable based on the second processing unit to obtain a decoded second variable;

An eighth acquisition submodule is used to decode the decoded second variable based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.

Optionally, the second compression network is trained by a rate loss function and a distortion loss function.

Optionally, the first video is a multi-view video of the target object, or the first video is a scalable video of the target object.

The image processing device of the embodiment of the present application obtains a first object to be processed of a target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object; processes the first image to be processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image features of the reconstructed second sub-video are different from the image features of the first sub-video. The reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.

The image processing device in the embodiment of the present application can be an electronic device, such as an electronic device with an operating system, or a component in an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal, or it can be other devices other than a terminal. Exemplarily, the terminal can include but is not limited to the types of terminal 11 listed above, and other devices can be servers, network attached storage (NAS), etc., which are not specifically limited in the embodiment of the present application.

The image processing device provided in the embodiment of the present application can implement each process implemented by the method embodiments of Figures 1 to 6 and achieve the same technical effect. To avoid repetition, it will not be repeated here.

Optionally, as shown in FIG8 , the embodiment of the present application further provides an image processing device 800, including a processor 801 and a memory 802, wherein the memory 802 stores a program or instruction that can be run on the processor 801, and when the program or instruction is executed by the processor 801, each step of the above-mentioned image processing method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be described here.

The embodiment of the present application also provides an image processing device, including a processor and a communication interface, the processor is used to obtain a first image to be processed of a target object, the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; wherein the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video. The device embodiment corresponds to the above-mentioned method embodiment, and each implementation process and implementation method of the above-mentioned method embodiment can be applied to the device embodiment and can achieve the same technical effect. Specifically, Figure 9 is a schematic diagram of the hardware structure of an image processing device that implements the embodiment of the present application. The image processing device is specifically a terminal 900.

The terminal 900 includes but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909 and at least some of the components of a processor 910.

Those skilled in the art will appreciate that the terminal 900 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 910 through a power management system, so as to implement functions such as managing charging, discharging, and power consumption management through the power management system. The terminal structure shown in FIG9 does not constitute a limitation on the terminal, and the terminal may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.

It should be understood that in the embodiment of the present application, the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042. The graphics processor 9041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. A display panel 9061 may be included, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc. The user input unit 907 includes a touch panel 9071 and at least one of other input devices 9072. The touch panel 9071 is also called a touch screen. The touch panel 9071 may include two parts: a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, a function key (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.

In the embodiment of the present application, after receiving downlink data from the network side device, the RF unit 901 can transmit the data to the processor 910 for processing; in addition, the RF unit 901 can send uplink data to the network side device. Generally, the RF unit 901 includes but is not limited to an antenna, an amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc.

The memory 909 can be used to store software programs or instructions and various data. The memory 909 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instruction required for at least one function (such as a sound playback function, an image playback function, etc.), etc. In addition, the memory 909 may include a volatile memory or a non-volatile memory, or the memory 909 may include both volatile and non-volatile memories. Among them, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous link dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM). The memory 909 in the embodiment of the present application includes but is not limited to these and any other suitable types of memories.

The processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 910.

The processor 910 is configured to obtain a first image to be processed of the target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;

In the embodiment of the present application, a first to-be-processed object of the target object is obtained, wherein the first to-be-processed image includes the The first feature map of the first image of the target object or the first sub-video in the first video including the target object; the first image to be processed is processed based on the first compression network to obtain the reconstructed second feature map or second sub-video; the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the second sub-video is a partial video in the first video, and the image of the reconstructed second sub-video is different from the image of the first sub-video. The reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.

Optionally, the processor 910 is further configured to:

The processor 910 is further configured to:

Optionally, the processor 910 is further configured to:

The first condition includes at least one of the following:

The network bandwidth is less than or equal to the first threshold;

Optionally, the processor 910 is further configured to:

The second condition includes at least one of the following:

The network bandwidth is greater than a first threshold;

The processor 910 is further configured to:

An embodiment of the present application also provides a readable storage medium, which may be volatile or non-volatile. A program or instruction is stored on the readable storage medium. When the program or instruction is executed by a processor, the various processes of the above-mentioned image processing method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

The processor is the processor in the terminal described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.

An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned image processing method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

It should be understood that the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.

The embodiment of the present application further provides a computer program/program product, which is stored in a storage medium, and is executed by at least one processor to implement the various processes of the above-mentioned image processing method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

It should be noted that, in this article, the terms "comprise", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the sentence "comprises one..." does not exclude the presence of other identical elements in the process, method, article or device including the element. In addition, it should be noted that the scope of the method and device in the embodiment of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved, for example, the described method may be performed in an order different from that described, and various steps may also be added, omitted, or combined. In addition, the features described with reference to certain examples may be combined in other examples.

Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a magnetic disk, or an optical disk), and includes a number of instructions for enabling a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in each embodiment of the present application.

The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Claims

An image processing method, comprising:

Acquire a first image to be processed of the target object, where the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;

Processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

The second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;

The reconstructed image of the second sub-video is different from the image of the first sub-video.
The method according to claim 1, wherein obtaining a first feature map of a first image of the target object comprises:

Acquire a plurality of feature maps of the first image using a target neural network, where the target neural network is a neural network for extracting image features;

A feature map is selected from the multiple feature maps as the first feature map.
The method according to claim 1, wherein the first compression network comprises a first compression encoding network, a first processing unit and a first compression decoding network;

The step of processing the first image to be processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video includes:

Encoding the first image to be processed based on the first compression coding network to obtain a first variable;

quantizing, arithmetic coding, arithmetic decoding and inverse quantizing the first variable based on the first processing unit to obtain a decoded first variable;

The decoded first variable is decoded based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.
The method according to claim 1, wherein the first compression network is trained by a rate loss function and a distortion loss function.
The method according to claim 1, further comprising:

Performing target processing on the reconstructed second feature map based on the prediction restoration network to obtain a reconstructed first feature map;

Alternatively, target processing is performed on the reconstructed second sub-video based on the prediction restoration network to obtain the reconstructed first sub-video;

Wherein, the prediction and restoration network is obtained by training through an enhanced loss function.
The method according to claim 5, wherein the target processing includes a sampling process and a residual recovery process, and the sampling process includes an upsampling process or a downsampling process.
The method according to claim 1, wherein processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video comprises:

When the first condition is met, the first image to be processed is processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video;

The first condition includes at least one of the following:

The network bandwidth is less than or equal to the first threshold;

The data volume of the first image to be processed is greater than or equal to the second threshold.
The method according to claim 1, further comprising:

Processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video;

The second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.
The method according to claim 8, wherein the second to-be-processed image of the target object is processed based on the second compression network to obtain a reconstructed third feature map or a third sub-video, comprising:

When the second condition is met, the second to-be-processed image of the target object is processed based on the second compression network to obtain a reconstructed third feature map or a third sub-video;

The second condition includes at least one of the following:

The network bandwidth is greater than a first threshold;

The data volume of the second image to be processed is less than the second threshold.
The method according to claim 8, wherein the second compression network comprises a second compression encoding network, a second processing unit, and a second compression decoding network;

The processing of the second to-be-processed object of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video includes:

Encoding the second object to be processed according to the second compression coding network to obtain a second variable;

quantizing, arithmetic coding, arithmetic decoding and inverse quantizing the second variable based on the second processing unit to obtain a decoded second variable;

The decoded second variable is decoded based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.
The method according to claim 8, wherein the second compression network is trained by a rate loss function and a distortion loss function.
The method according to claim 1, wherein the first video is a multi-view video of the target object, or the first video is a scalable video of the target object.
The method of claim 1, wherein the image characteristics include at least one of resolution and number of features.
An image processing device, comprising:

A first acquisition module, used to acquire a first image to be processed of the target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;

The second acquisition module is used to process the first to-be-processed image based on the first compression network to obtain a reconstructed a second feature map or a second sub-video;

The second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;

The image features of the reconstructed second sub-video are different from those of the image of the first sub-video.
The apparatus according to claim 14, wherein the first acquisition module comprises:

A first acquisition submodule, used to acquire a plurality of feature maps of the first image using a target neural network, wherein the target neural network is a neural network used to extract image features;

The second acquisition submodule is used to select a feature map from the multiple feature maps as the first feature map.
The apparatus according to claim 14, wherein the first compression network comprises a first compression encoding network, a first processing unit, and a first compression decoding network;

The second acquisition module includes:

A third acquisition submodule is used to encode the first to-be-processed image based on the first compression coding network to obtain a first variable;

a fourth acquisition submodule, configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the first variable based on the first processing unit to obtain a decoded first variable;

The fifth acquisition submodule is used to decode the decoded first variable based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.
The apparatus according to claim 14, wherein the first compression network is trained by a rate loss function and a distortion loss function.
The device according to claim 14, further comprising:

A third acquisition module is used to perform target processing on the reconstructed second feature map based on the prediction restoration network to obtain a reconstructed first feature map;

Alternatively, target processing is performed on the reconstructed second sub-video based on the prediction restoration network to obtain the reconstructed first sub-video;

Wherein, the prediction and restoration network is obtained by training through an enhanced loss function.
The apparatus according to claim 18, wherein the target processing comprises a sampling process and a recovery residual process, and the sampling process comprises an upsampling process or a downsampling process.
The device according to claim 14, further comprising:

A fourth acquisition module, used for processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video;

The second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.
The apparatus according to claim 20, wherein the second compression network comprises a second compression encoding network, a second processing unit, and a second compression decoding network;

The fourth acquisition module includes:

a sixth acquisition submodule, configured to encode the second object to be processed according to the second compression coding network to obtain a second variable;

a seventh acquisition submodule, configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the second variable based on the second processing unit to obtain a decoded second variable;

An eighth acquisition submodule is used to decode the decoded second variable based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.
The apparatus according to claim 20, wherein the second compression network is trained by a rate loss function and a distortion loss function.
The device according to claim 14, wherein the first video is a multi-view video of the target object, or the first video is a scalable video of the target object.
The apparatus of claim 14, wherein the image characteristics include at least one of resolution and number of features.
An image processing device comprises a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the image processing method according to any one of claims 1 to 13 are implemented.
A readable storage medium stores a program or instruction, and when the program or instruction is executed by a processor, the steps of the image processing method according to any one of claims 1 to 13 are implemented.