WO2024078403A1 - Procédé et appareil de traitement d'image, et dispositif - Google Patents

Procédé et appareil de traitement d'image, et dispositif Download PDF

Info

Publication number
WO2024078403A1
WO2024078403A1 PCT/CN2023/123322 CN2023123322W WO2024078403A1 WO 2024078403 A1 WO2024078403 A1 WO 2024078403A1 CN 2023123322 W CN2023123322 W CN 2023123322W WO 2024078403 A1 WO2024078403 A1 WO 2024078403A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
video
network
feature map
sub
Prior art date
Application number
PCT/CN2023/123322
Other languages
English (en)
Chinese (zh)
Inventor
李胜曦
刘铁
陈超然
张子夫
徐迈
吕卓逸
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2024078403A1 publication Critical patent/WO2024078403A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • the present application belongs to the field of coding and decoding technology, and specifically relates to an image processing method, device and equipment.
  • the embodiments of the present application provide an image processing method, apparatus and device, which can solve the problem in the related art that the traditional image compression method used to process the image feature map cannot guarantee the encoding efficiency and the quality of the reconstructed feature map.
  • an image processing method comprising:
  • the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the reconstructed image of the second sub-video is different from the image of the first sub-video.
  • an image processing device comprising:
  • a first acquisition module used to acquire a first image to be processed of the target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;
  • a second acquisition module used to process the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the second sub-video is a partial video in the first video, and the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • an image processing device which includes a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the method described in the first aspect are implemented.
  • an image processing device comprising a processor and a communication interface, wherein the processor is used to obtain a first image to be processed of a target object, wherein the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the reconstructed image of the second sub-video is different from the image of the first sub-video.
  • a readable storage medium on which a program or instruction is stored.
  • the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.
  • a chip comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.
  • a computer program/program product is provided, wherein the computer program/program product is stored in a storage medium and is executed by at least one processor to implement the steps of the method described in the first aspect.
  • a first object to be processed of a target object is obtained, wherein the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • the reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so that the second feature map or the second sub-video does not need to be encoded and transmitted, which improves the encoding efficiency, and obtaining the reconstructed second feature map or the second sub-video based on the compression network can effectively ensure the image quality.
  • FIG1 is a schematic diagram showing a flow chart of an image processing method according to an embodiment of the present application.
  • FIG2 is a schematic diagram showing the network architecture of a feature pyramid network according to an embodiment of the present application.
  • FIG3 is a schematic diagram showing a first compression network in an embodiment of the present application.
  • FIG4 is a schematic diagram showing a prediction and restoration network in an embodiment of the present application.
  • FIG5 is a schematic diagram showing a first compression network and a second compression network processing feature graph in an embodiment of the present application
  • FIG6 is a schematic diagram showing a second compression network in an embodiment of the present application.
  • FIG7 is a schematic diagram showing a module of an image processing device according to an embodiment of the present application.
  • FIG8 is a block diagram showing a structure of an image processing device in an embodiment of the present application.
  • FIG. 9 is a block diagram showing a structure of a terminal according to an embodiment of the present application.
  • first, second, etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way are interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by “first” and “second” are generally of the same type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally represents that the objects associated with each other are in an "or” relationship.
  • the image processing device corresponding to the image processing method in the embodiment of the present application may be a terminal, which may also be referred to as a terminal device or a user terminal (User Equipment, UE).
  • the terminal may be a mobile phone, a tablet computer (Tablet Personal Computer), a laptop computer (Laptop Computer) or a notebook computer, a personal digital assistant (Personal Digital Assistant, PDA), a handheld computer, a netbook, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device (Wearable Device) or a vehicle-mounted device (Vehicle User Equipment, VUE), a pedestrian terminal (Pedestrian User Equipment, PUE) and other terminal-side devices, and the wearable device includes: a smart watch, a bracelet, a headset, glasses, etc. It should be noted that the specific
  • JPEG Joint Photographic Experts Group
  • DCT Discrete Cosine Transform
  • JPEG2000 uses Discrete Walsh Transform (DWT) instead of DCT to achieve better compression quality.
  • DWT Discrete Walsh Transform
  • intra-frame coding units can independently compress a single frame
  • the new image format can also compress the image information in the same frame.
  • BPG Portable Graphics
  • the compression task is regarded as an encoding process and trained using an end-to-end learning method.
  • the specific process can be decomposed into an encoding process, and the corresponding reconstruction task can be regarded as a decoding process.
  • the encoder-decoder structure is widely used in learning-based compression methods.
  • a compression model based on recursive neural network (RNN) uses an autoencoder to extract image features as a transformation process.
  • the residual structure is used for feature extraction in the encoder and decoder.
  • the new optimization method of compression rate and restoration distortion in the related art makes it very popular in the later methods.
  • the related art has proposed a compression method based on super prior to reduce spatial redundancy, and improved the entropy coding module and attached it to the super prior structure to further reduce coding redundancy.
  • the related paper takes advantage of the residual structure and attention mechanism, proposes a better structured autoencoder, and proposes a Gaussian mixture likelihood entropy model to improve its flexibility and accuracy.
  • the above methods can certainly change the compression ratio by adjusting parameters, however, when compressing at low bit rates, because these methods do not pay special attention to extreme cases, their compression quality usually drops very quickly.
  • the low bit rate performance is optimized.
  • JPEG codec method it is proposed to apply 2 ⁇ 2 average pooling to the image to obtain a smaller image.
  • the original size image is interpolated during reconstruction.
  • the method is optimized by designing filters in the downsampling and interpolation process, but the filters designed by this method are related to the image information and need to be designed manually.
  • GAN Generative Adversarial Network
  • a generative compression architecture is proposed to generate images from the image distribution encoded by the encoder, and the corresponding loss function is designed to balance the visual quality and reconstruction quality.
  • Generative Adversarial Networks is used as an enhancement module of the decoder structure.
  • a pair of classic codec structures are trained by optimizing the rate-distortion loss, and the trained encoder is frozen to fine-tune the decoder to make it a generator in the GAN.
  • the decoder and generator parameters are interpolated to reduce the artifacts of compressed images at low bit rates.
  • the compression network is completely optimized through a newly designed network structure. To further obtain better results, the network structure needs to be redesigned, and there is basically no compatibility.
  • the embodiment of the present application provides an image processing method, including:
  • Step 101 Acquire a first image to be processed of a target object, where the first image to be processed includes a first feature map of a first image of the target object or a first sub-video in a first video of the target object.
  • the first feature map is extracted from the first image or the first video using a neural network.
  • the target object is a photographed object (or photographed content) corresponding to the first video or the first image.
  • the first video is a multi-view video (multi-view) of the target object, or the first video is a scalable video of the target object.
  • the multi-view video (also described as stereoscopic video) refers to the video of each view obtained by shooting the same object (or the same scene) with multiple cameras at different viewpoints.
  • the first sub-video is the video of the target object at a certain viewpoint.
  • the scalable video image includes videos of different resolutions or different frame rates of the same video source.
  • the first sub-video is a video in the first video that transmits and displays the target object at a certain resolution. That is to say, the method of the embodiment of the present application can be used not only for processing feature maps but also for processing videos of different viewing angles or different resolutions.
  • Step 102 Processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the reconstructed image of the second sub-video is different from the image of the first sub-video.
  • the image features include at least one of resolution and feature quantity.
  • the resolution of the first feature map is different from the resolution of the second feature map
  • the number of features corresponding to the first feature map is different from the number of features corresponding to the second feature map.
  • the first compression network is used to output a feature map having different image features from the input feature map.
  • the image features in the embodiment of the present application include but are not limited to resolution and number of features. That is to say, in the embodiment of the present application, the feature map in the image can be compressed by the first compression network, and part of the video in the multi-view video or scalable video (such as a video of a certain resolution or a video of a certain view) can also be compressed.
  • the first compression network is a learnable compression network.
  • the first compression network is trained by a rate loss function and a distortion loss function.
  • the method of the embodiment of the present application is applied to a neural network that extracts multiple feature maps from an image, that is, the first feature map is extracted by a neural network.
  • the first feature map is extracted by a neural network.
  • this feature is used to perform mutual prediction between the feature maps based on the first compression network, that is, the reconstructed second feature map is obtained by the first feature map.
  • mutual prediction is performed between different videos based on the above-mentioned first compression network.
  • a reconstructed second sub-video is obtained based on the above-mentioned first sub-video.
  • the resolution or frame rate corresponding to the first sub-video and the second sub-video are different, or the shooting angles corresponding to the first sub-video or the second sub-video are different.
  • a first object to be processed of a target object is obtained, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • the network can obtain the reconstructed second feature map or second sub-video of the target object, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency.
  • obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.
  • obtaining a first feature map of a first image of the target object includes:
  • a feature map is selected from the multiple feature maps as the first feature map.
  • the target neural network is a feature pyramid network, for example, a fast region-based convolutional neural network (Fast Region Convolutional Neural Network, FastRCNN).
  • the feature pyramid network can be used to extract feature maps of different resolutions.
  • target neural network can also extract neural networks of multiple feature maps in other forms.
  • the embodiment of the present application is explained by using the feature pyramid network to implement the target detection task as an example.
  • the network architecture of the feature pyramid network is shown in Figure 2, where the input of the neural network is an image with a resolution of W ⁇ H consisting of three color channels (RGB).
  • the feature map of the P layer is obtained from the neural network, and the resolution of the feature map of each P layer is P2: P3: P4: P5: The number of feature channels is 256.
  • the first compression network includes a first compression encoding network, a first processing unit and a first compression decoding network;
  • the step of processing the first image to be processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video includes:
  • the decoded first variable is decoded based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.
  • pixels of the first image to be processed are normalized.
  • the first compression network of the embodiment of the present application can also be described as a prediction compression module.
  • the prediction compression module includes an encoder, a first processing unit and a decoder.
  • the encoder includes a first compression encoding network
  • the decoder includes a first compression decoding network.
  • the above-mentioned first feature map is the feature map P2 obtained based on the neural network shown in FIG2
  • the above-mentioned second feature map is the feature map P3 obtained based on the neural network shown in FIG2
  • the feature maps for mutual prediction restoration are P2 and P3.
  • the feature maps for mutual prediction restoration can also be any two feature maps in P2-P5 except P2 and P3.
  • the input of the encoder is the feature map P2.
  • the input feature map P2 is compressed (encoded) using the first compression coding network to obtain a latent variable (first variable), that is, a variable
  • first variable that is, a variable
  • the variable c is then quantized and arithmetically encoded to obtain a binary bit stream.
  • the decoder After the decoder obtains the input binary bit stream, it first performs arithmetic decoding and dequantization to obtain the decoded latent variable That is, the decoded first variable is then decoded through the first compression decoding network to obtain the reconstructed second feature map, that is, the feature map P3'.
  • Q( ⁇ ) is the quantization
  • Enc( ⁇ ) and Dec( ⁇ ) are the encoder and decoder respectively.
  • the method of the embodiment of the present application further includes:
  • target processing is performed on the reconstructed second sub-video based on the prediction restoration network to obtain the reconstructed first sub-video;
  • the prediction and restoration network is obtained by training through an enhanced loss function.
  • the target processing includes sampling processing and residual recovery processing
  • the sampling processing includes upsampling processing or downsampling processing.
  • the above-mentioned decoder may also include the above-mentioned prediction and restoration network (prediction and restoration module), which is used to predict and restore the reconstructed first feature map from the second feature map obtained by decoding and reconstruction.
  • the prediction and restoration network is designed based on the residual unit, and the width and height of the reconstructed second feature map are first upsampled to twice the original size. It can be obtained by interpolation methods such as bilinear interpolation, which is not limited here. After that, multiple stacked residual units are used to restore the residual, and then added to the input to obtain the predicted first feature map (i.e., the reconstructed first feature map).
  • the specific prediction and restoration network can also be selected and designed according to actual needs, such as using commonly used residual networks, dense networks and other enhanced networks.
  • Up( ⁇ ) represents a 2-fold upsampling operation
  • the upsampling or downsampling multiple is determined according to the feature map for inter-prediction restoration
  • Res( ⁇ ) represents a multi-stacked residual unit.
  • processing the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video includes:
  • the first to-be-processed image is processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the first condition includes at least one of the following:
  • the network bandwidth is less than or equal to the first threshold
  • the data volume of the first image to be processed is greater than or equal to the second threshold.
  • the method of the embodiment of the present application further includes:
  • the second image to be processed of the target object is processed based on the second compression network to obtain a reconstructed third feature The picture or the third sub video;
  • the second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.
  • At least one third feature map of the target object is processed based on the second compression network to obtain a reconstructed third feature map, wherein image features of the third feature map are different from image features of the first feature map or the second feature map;
  • At least one third sub-video of the target object is processed based on the second compression network to obtain a reconstructed third sub-video, wherein the image of the third sub-video is different from the image of the first sub-video, or the image of the third sub-video is different from the image of the second sub-video.
  • processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video includes:
  • the second to-be-processed image of the target object is processed based on the second compression network to obtain a reconstructed third feature map or a third sub-video;
  • the second condition includes at least one of the following:
  • the network bandwidth is greater than a first threshold
  • the data volume of the second image to be processed is less than the second threshold.
  • the second compression network includes a second compression encoding network, a second processing unit and a second compression decoding network;
  • the processing of the second to-be-processed object of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video includes:
  • the decoded second variable is decoded based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.
  • the second compression network is a learnable codec network
  • the target neural network is described as a feature pyramid network as an example
  • the at least one third feature map is a feature map P4 and a feature map P5.
  • the second compression network By using the second compression network to compress and reconstruct the feature maps P4 and P5, the reconstruction quality of P4 and P5 can be guaranteed at a low bit rate, thereby ensuring the accuracy of the machine vision task.
  • the specific processing flow is shown in Figure 5.
  • the feature maps P2, P3, P4 and P5 are first obtained from the FastRCNN network, their pixel values are normalized, and the feature maps P2 and P3 with larger resolution are predicted and restored using the mutual prediction restoration network.
  • the encoding end uses the basic feature compression network (second compression network) to compress P4 and P5 respectively.
  • the decoding end uses the decoding network of the second compression network to obtain reconstructed feature maps P4' and P5', and combines the reconstructed feature maps P2' and P3' obtained at the decoding end for visual task (eg, target detection task) analysis to obtain the final target detection result.
  • the second compression network includes an encoder, a second processing unit and a decoder, the encoder includes a second compression encoding network, and the decoder includes a second compression decoding network.
  • the second compression network is a learnable encoding and decoding network, which can be selected and designed according to actual needs, such as using commonly used compression networks such as Balle and Cheng.
  • Enc base ( ⁇ ) and Dec base ( ⁇ ) are the encoder and decoder of the basic feature compression network respectively.
  • the sampling loss function is used to train the first compression network, the second compression network and the prediction and restoration network.
  • Specific loss functions include rate loss function, distortion loss function and enhancement loss function.
  • the rate loss function converts the input features into latent variables, which can be output through arithmetic coding. and Calculate the bit rate.
  • the specific function is as follows:
  • N is the number of training samples
  • represents the network parameters
  • R represents the rate
  • E represents the expected value
  • the distortion loss function is used to measure the difference between the original feature P and the reconstructed feature P'.
  • the distortion loss function is as follows:
  • P′ is the reconstructed feature map
  • represents the network parameters.
  • N is the number of training samples
  • l 2 measures the difference between the original feature map and the reconstructed feature map
  • D( ⁇ ) represents the distortion.
  • the enhancement loss is used to measure the difference between the output feature P′ 2 and the original feature P 2 , and its formula is as follows:
  • L total ( ⁇ ) represents the total loss function
  • compression models with different compression rates can be obtained by adjusting ⁇ .
  • the above-mentioned enhanced loss function can be used for training when training the prediction and restoration module to ensure the quality of the predicted and restored feature map P2'.
  • a first to-be-processed object of a target object is obtained, wherein the first to-be-processed image includes a first feature map of a first image of the target object or a first sub-video in a first video including the target object;
  • the first image to be processed is processed in a first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • the reconstructed second feature map or the second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or the second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or the second sub-video based on the compression network can effectively ensure the image quality.
  • the image processing method provided in the embodiment of the present application can be executed by an image processing device.
  • an image processing device executing the image processing method is taken as an example to illustrate the image processing device provided in the embodiment of the present application.
  • the embodiment of the present application further provides an image processing device 700, including:
  • a first acquisition module 701 is used to acquire a first image to be processed of a target object, where the first image to be processed includes a first feature map of a first image of the target object or a first sub-video in a first video of the target object;
  • a second acquisition module 702 is used to process the first to-be-processed image based on the first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the reconstructed image of the second sub-video is different from the image of the first sub-video.
  • the first acquisition module 701 includes:
  • a first acquisition submodule used to acquire a plurality of feature maps of the first image using a target neural network, wherein the target neural network is a neural network used to extract image features;
  • the second acquisition submodule is used to select a feature map from the multiple feature maps as the first feature map.
  • the first compression network includes a first compression encoding network, a first processing unit and a first compression decoding network;
  • the second acquisition module 702 includes:
  • a third acquisition submodule is used to encode the first to-be-processed image based on the first compression coding network to obtain a first variable
  • a fourth acquisition submodule configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the first variable based on the first processing unit to obtain a decoded first variable
  • the fifth acquisition submodule is used to decode the decoded first variable based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.
  • the first compression network is trained by a rate loss function and a distortion loss function.
  • the image processing device 700 of the embodiment of the present application further includes:
  • a third acquisition module is used to perform target processing on the reconstructed second feature map based on the prediction restoration network to obtain a reconstructed first feature map
  • the reconstructed second sub-video is subjected to target processing based on the prediction restoration network to obtain the reconstructed first sub-video.
  • the prediction and restoration network is obtained by training through an enhanced loss function.
  • the target processing includes sampling processing and residual recovery processing
  • the sampling processing includes upsampling processing or downsampling processing.
  • the first acquisition module 701 is used to process the first to-be-processed image based on a first compression network to obtain a reconstructed second feature map or a second sub-video when a first condition is met;
  • the first condition includes at least one of the following:
  • the network bandwidth is less than or equal to the first threshold
  • the data volume of the first image to be processed is greater than or equal to the second threshold.
  • the image processing device 700 of the embodiment of the present application further includes:
  • a fourth acquisition module used for processing the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video;
  • the second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.
  • the fourth acquisition module is used to process the second to-be-processed image of the target object based on the second compression network to obtain a reconstructed third feature map or a third sub-video when the second condition is met;
  • the second condition includes at least one of the following:
  • the network bandwidth is greater than a first threshold
  • the data volume of the image to be processed is less than the second threshold.
  • the second compression network includes a second compression encoding network, a second processing unit and a second compression decoding network;
  • the fourth acquisition module includes:
  • a sixth acquisition submodule configured to encode the second object to be processed according to the second compression coding network to obtain a second variable
  • a seventh acquisition submodule configured to perform quantization, arithmetic coding, arithmetic decoding, and inverse quantization on the second variable based on the second processing unit to obtain a decoded second variable
  • An eighth acquisition submodule is used to decode the decoded second variable based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.
  • the second compression network is trained by a rate loss function and a distortion loss function.
  • the first video is a multi-view video of the target object, or the first video is a scalable video of the target object.
  • the image features include at least one of resolution and feature quantity.
  • the image processing device of the embodiment of the present application obtains a first object to be processed of a target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object; processes the first image to be processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image features of the reconstructed second sub-video are different from the image features of the first sub-video.
  • the reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.
  • the image processing device in the embodiment of the present application can be an electronic device, such as an electronic device with an operating system, or a component in an electronic device, such as an integrated circuit or a chip.
  • the electronic device can be a terminal, or it can be other devices other than a terminal.
  • the terminal can include but is not limited to the types of terminal 11 listed above, and other devices can be servers, network attached storage (NAS), etc., which are not specifically limited in the embodiment of the present application.
  • the image processing device provided in the embodiment of the present application can implement each process implemented by the method embodiments of Figures 1 to 6 and achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the embodiment of the present application further provides an image processing device 800, including a processor 801 and a memory 802, wherein the memory 802 stores a program or instruction that can be run on the processor 801, and when the program or instruction is executed by the processor 801, each step of the above-mentioned image processing method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be described here.
  • the embodiment of the present application also provides an image processing device, including a processor and a communication interface, the processor is used to obtain a first image to be processed of a target object, the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; wherein the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • FIG. 9 is a schematic diagram of the hardware structure of an image processing device that implements the embodiment of the present application.
  • the image processing device is specifically a terminal 900.
  • the terminal 900 includes but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909 and at least some of the components of a processor 910.
  • the terminal 900 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 910 through a power management system, so as to implement functions such as managing charging, discharging, and power consumption management through the power management system.
  • a power source such as a battery
  • the terminal structure shown in FIG9 does not constitute a limitation on the terminal, and the terminal may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.
  • the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042.
  • the graphics processor 9041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode.
  • a display panel 9061 may be included, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc.
  • the user input unit 907 includes a touch panel 9071 and at least one of other input devices 9072.
  • the touch panel 9071 is also called a touch screen.
  • the touch panel 9071 may include two parts: a touch detection device and a touch controller.
  • Other input devices 9072 may include, but are not limited to, a physical keyboard, a function key (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.
  • the RF unit 901 can transmit the data to the processor 910 for processing; in addition, the RF unit 901 can send uplink data to the network side device.
  • the RF unit 901 includes but is not limited to an antenna, an amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc.
  • the memory 909 can be used to store software programs or instructions and various data.
  • the memory 909 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instruction required for at least one function (such as a sound playback function, an image playback function, etc.), etc.
  • the memory 909 may include a volatile memory or a non-volatile memory, or the memory 909 may include both volatile and non-volatile memories.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous link dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM).
  • the memory 909 in the embodiment of the present application includes but is not limited to these and any other suitable types of memories.
  • the processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 910.
  • the processor 910 is configured to obtain a first image to be processed of the target object, wherein the first image to be processed includes a first feature map of the first image of the target object or a first sub-video in a first video of the target object;
  • the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map;
  • the reconstructed image of the second sub-video is different from the image of the first sub-video.
  • a first to-be-processed object of the target object is obtained, wherein the first to-be-processed image includes the The first feature map of the first image of the target object or the first sub-video in the first video including the target object; the first image to be processed is processed based on the first compression network to obtain the reconstructed second feature map or second sub-video; the second feature map is the feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the second sub-video is a partial video in the first video, and the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • the reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so there is no need to encode and transmit the second feature map or second sub-video, which improves the encoding efficiency, and obtaining the reconstructed second feature map or second sub-video based on the compression network can effectively ensure the image quality.
  • processor 910 is further configured to:
  • a feature map is selected from the multiple feature maps as the first feature map.
  • the first compression network includes a first compression encoding network, a first processing unit and a first compression decoding network;
  • the processor 910 is further configured to:
  • the decoded first variable is decoded based on the first compression decoding network to obtain a reconstructed second feature map or a second sub-video.
  • the first compression network is trained by a rate loss function and a distortion loss function.
  • processor 910 is further configured to:
  • target processing is performed on the reconstructed second sub-video based on the prediction restoration network to obtain the reconstructed first sub-video;
  • the prediction and restoration network is obtained by training through an enhanced loss function.
  • the target processing includes sampling processing and residual recovery processing
  • the sampling processing includes upsampling processing or downsampling processing.
  • processor 910 is further configured to:
  • the first to-be-processed image is processed based on the first compression network to obtain a reconstructed second feature map or a second sub-video;
  • the first condition includes at least one of the following:
  • the network bandwidth is less than or equal to the first threshold
  • the data volume of the first image to be processed is greater than or equal to the second threshold.
  • processor 910 is further configured to:
  • the second image to be processed of the target object is processed based on the second compression network to obtain a reconstructed third feature The picture or the third sub video;
  • the second image to be processed includes at least one third feature map of the first image or includes at least one third sub-video of the first sub-video.
  • processor 910 is further configured to:
  • the second to-be-processed image of the target object is processed based on the second compression network to obtain a reconstructed third feature map or a third sub-video;
  • the second condition includes at least one of the following:
  • the network bandwidth is greater than a first threshold
  • the data volume of the second image to be processed is less than the second threshold.
  • the second compression network includes a second compression encoding network, a second processing unit and a second compression decoding network;
  • the processor 910 is further configured to:
  • the decoded second variable is decoded based on the second compression decoding network to obtain a reconstructed third feature map or a third sub-video.
  • the second compression network is trained by a rate loss function and a distortion loss function.
  • the first video is a multi-view video of the target object, or the first video is a scalable video of the target object.
  • the image features include at least one of resolution and feature quantity.
  • a first object to be processed of a target object is obtained, wherein the first image to be processed includes a first feature map of the first image of the target object or includes a first sub-video in a first video of the target object; the first image to be processed is processed based on a first compression network to obtain a reconstructed second feature map or a second sub-video; the second feature map is a feature map of the first image, and the image features of the reconstructed second feature map are different from the image features of the first feature map; the image of the reconstructed second sub-video is different from the image of the first sub-video.
  • the reconstructed second feature map or second sub-video of the target object can be obtained through the above-mentioned first compression network, so that the second feature map or the second sub-video does not need to be encoded and transmitted, which improves the encoding efficiency, and obtaining the reconstructed second feature map or the second sub-video based on the compression network can effectively ensure the image quality.
  • An embodiment of the present application also provides a readable storage medium, which may be volatile or non-volatile.
  • a program or instruction is stored on the readable storage medium.
  • the program or instruction is executed by a processor, the various processes of the above-mentioned image processing method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the processor is the processor in the terminal described in the above embodiment.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned image processing method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
  • the embodiment of the present application further provides a computer program/program product, which is stored in a storage medium, and is executed by at least one processor to implement the various processes of the above-mentioned image processing method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the technical solution of the present application can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a magnetic disk, or an optical disk), and includes a number of instructions for enabling a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in each embodiment of the present application.
  • a storage medium such as ROM/RAM, a magnetic disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

La présente demande appartient au domaine technique du codage et du décodage. Sont divulgués un procédé et un appareil de traitement d'image, ainsi qu'un dispositif. Le procédé de traitement d'image dans les modes de réalisation de la présente demande consiste à : acquérir une première image à traiter d'un objet cible, la première image à traiter comprenant une première carte de caractéristiques d'une première image de l'objet cible ou comprenant une première sous-vidéo dans une première vidéo de l'objet cible ; et sur la base d'un premier réseau de compression, traiter la première image à traiter, de façon à acquérir une seconde carte de caractéristiques reconstruite ou une seconde sous-vidéo reconstruite, la seconde carte de caractéristiques étant une carte de caractéristiques de la première image, des caractéristiques d'image de la seconde carte de caractéristiques reconstruite étant différentes des caractéristiques d'image de la première carte de caractéristiques, et une image de la seconde sous-vidéo reconstruite étant différente d'une image de la première sous-vidéo.
PCT/CN2023/123322 2022-10-13 2023-10-08 Procédé et appareil de traitement d'image, et dispositif WO2024078403A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211254689.1A CN117939157A (zh) 2022-10-13 2022-10-13 图像处理方法、装置及设备
CN202211254689.1 2022-10-13

Publications (1)

Publication Number Publication Date
WO2024078403A1 true WO2024078403A1 (fr) 2024-04-18

Family

ID=90668812

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/123322 WO2024078403A1 (fr) 2022-10-13 2023-10-08 Procédé et appareil de traitement d'image, et dispositif

Country Status (2)

Country Link
CN (1) CN117939157A (fr)
WO (1) WO2024078403A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106791927A (zh) * 2016-12-23 2017-05-31 福建帝视信息科技有限公司 一种基于深度学习的视频增强与传输方法
CN111970513A (zh) * 2020-08-14 2020-11-20 成都数字天空科技有限公司 一种图像处理方法、装置、电子设备及存储介质
WO2022057837A1 (fr) * 2020-09-16 2022-03-24 广州虎牙科技有限公司 Procédé et appareil de traitement d'image, procédé et appareil de reconstruction de super-résolution de portrait, et procédé et appareil d'apprentissage de modèle de reconstruction de super-résolution de portrait, dispositif électronique et support de stockage
CN114501013A (zh) * 2022-01-14 2022-05-13 上海交通大学 一种可变码率视频压缩方法、系统、装置及存储介质
CN114882350A (zh) * 2022-03-30 2022-08-09 北京市测绘设计研究院 图像处理方法及装置、电子设备和存储介质
CN114897711A (zh) * 2022-04-06 2022-08-12 厦门美图之家科技有限公司 一种视频中图像处理方法、装置、设备及存储介质
US20220286696A1 (en) * 2021-03-02 2022-09-08 Samsung Electronics Co., Ltd. Image compression method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106791927A (zh) * 2016-12-23 2017-05-31 福建帝视信息科技有限公司 一种基于深度学习的视频增强与传输方法
CN111970513A (zh) * 2020-08-14 2020-11-20 成都数字天空科技有限公司 一种图像处理方法、装置、电子设备及存储介质
WO2022057837A1 (fr) * 2020-09-16 2022-03-24 广州虎牙科技有限公司 Procédé et appareil de traitement d'image, procédé et appareil de reconstruction de super-résolution de portrait, et procédé et appareil d'apprentissage de modèle de reconstruction de super-résolution de portrait, dispositif électronique et support de stockage
US20220286696A1 (en) * 2021-03-02 2022-09-08 Samsung Electronics Co., Ltd. Image compression method and apparatus
CN114501013A (zh) * 2022-01-14 2022-05-13 上海交通大学 一种可变码率视频压缩方法、系统、装置及存储介质
CN114882350A (zh) * 2022-03-30 2022-08-09 北京市测绘设计研究院 图像处理方法及装置、电子设备和存储介质
CN114897711A (zh) * 2022-04-06 2022-08-12 厦门美图之家科技有限公司 一种视频中图像处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN117939157A (zh) 2024-04-26

Similar Documents

Publication Publication Date Title
US11057646B2 (en) Image processor and image processing method
Chen et al. Dynamic measurement rate allocation for distributed compressive video sensing
US20220188976A1 (en) Image processing method and apparatus
KR101266667B1 (ko) 장치 내 제어기에서 프로그래밍되는 압축 방법 및 시스템
EP3571841B1 (fr) Schéma de codage de signe de coefficient dc
US10277905B2 (en) Transform selection for non-baseband signal coding
WO2022155974A1 (fr) Codage et décodage vidéo ainsi que procédé et appareil d'apprentissage de modèle
WO2023279961A1 (fr) Procédé et appareil de codage d'image vidéo, et procédé et appareil de décodage d'image vidéo
CN107018416B (zh) 用于视频和图像压缩的自适应贴片数据大小编码
Chen et al. Learning to compress videos without computing motion
WO2022266955A1 (fr) Procédé et appareil de décodage d'images, procédé et appareil de traitement d'images, et dispositif
WO2024078066A1 (fr) Procédé et appareil de décodage vidéo, procédé et appareil de codage vidéo, support de stockage et dispositif
WO2023193629A1 (fr) Procédé et appareil de codage pour couche d'amélioration de région, et procédé et appareil de décodage pour couche d'amélioration de zone
CN116847087A (zh) 视频处理方法、装置、存储介质及电子设备
WO2023225808A1 (fr) Compression et décompression d'image apprise à l'aide d'un module d'attention long et court
WO2024078403A1 (fr) Procédé et appareil de traitement d'image, et dispositif
WO2020053688A1 (fr) Optimisation de distorsion de débit destinée à un codage de sous-bande adaptatif d'une transformée de haar adaptative régionale (raht)
TW202324308A (zh) 圖像編解碼方法和裝置
US20130308698A1 (en) Rate and distortion estimation methods and apparatus for coarse grain scalability in scalable video coding
CN116918329A (zh) 一种视频帧的压缩和视频帧的解压缩方法及装置
CN111491166A (zh) 基于内容分析的动态压缩系统及方法
WO2023133888A1 (fr) Procédé et appareil de traitement d'image, dispositif de commande à distance, système et support de stockage
WO2023279968A1 (fr) Appareil et procédé de codage et de décodage d'une image vidéo
WO2023133889A1 (fr) Procédé et appareil de traitement d'image, dispositif de commande à distance, système et support de stockage
WO2024007977A1 (fr) Procédé et appareil de traitement d'image, et dispositif